Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4492
Derong Liu Shumin Fei Zengguang Hou Huaguang Zhang Changyin Sun (Eds.)
Advances in Neural Networks – ISNN 2007 4th International Symposium on Neural Networks, ISNN 2007 Nanjing, China, June 3-7, 2007 Proceedings, Part II
13
Volume Editors Derong Liu University of Illinois at Chicago, IL 60607-7053, USA E-mail:
[email protected] Shumin Fei Southeast University, School of Automation, Nanjing 210096, China E-mail:
[email protected] Zengguang Hou The Chinese Academy of Sciences, Institute of Automation, Beijing, 100080, China E-mail:
[email protected] Huaguang Zhang Northeastern University, Shenyang 110004, China E-mail:
[email protected] Changyin Sun Hohai University, School of Electrical Engineering, Nanjing 210098, China E-mail:
[email protected] Library of Congress Control Number: 2007926816 CR Subject Classification (1998): F.1, F.2, D.1, G.2, I.2, C.2, I.4-5, J.1-4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-72392-7 Springer Berlin Heidelberg New York 978-3-540-72392-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12060771 06/3180 543210
Preface
ISNN 2007 – the Fourth International Symposium on Neural Networks—was held in Nanjing, China, as a sequel of ISNN 2004/ISNN 2005/ISNN 2006. ISNN has now become a well-established conference series on neural networks in the region and around the world, with growing popularity and increasing quality. Nanjing is an old capital of China, a modern metropolis with a 2470-year history and rich cultural heritage. All participants of ISNN 2007 had a technically rewarding experience as well as memorable experiences in this great city. A neural network is an information processing structure inspired by biological nervous systems, such as the brain. It consists of a large number of highly interconnected processing elements, called neurons. It has the capability of learning from example. The field of neural networks has evolved rapidly in recent years. It has become a fusion of a number of research areas in engineering, computer science, mathematics, artificial intelligence, operations research, systems theory, biology, and neuroscience. Neural networks have been widely applied for control, optimization, pattern recognition, image processing, signal processing, etc. ISNN 2007 aimed to provide a high-level international forum for scientists, engineers, and educators to present the state of the art of neural network research and applications in diverse fields. The symposium featured plenary lectures given by worldwide renowned scholars, regular sessions with broad coverage, and some special sessions focusing on popular topics. The symposium received a total of 1975 submissions from 55 countries and regions across all six continents. The symposium proceedings consists of 454 papers among which 262 were accepted as long papers and 192 were accepted as short papers. We would like to express our sincere gratitude to all reviewers of ISNN 2007 for the time and effort they generously gave to the symposium. We are very grateful to the National Natural Science Foundation of China, K. C. Wong Education Foundation of Hong Kong, the Southeast University of China, the Chinese University of Hong Kong, and the University of Illinois at Chicago for their financial support. We would also like to thank the publisher, Springer, for cooperation in publishing the proceedings in the prestigious series of Lecture Notes in Computer Science. Derong Liu Shumin Fei Zeng-Guang Hou Huaguang Zhang Changyin Sun
ISNN 2007 Organization
General Chair Derong Liu, University of Illinois at Chicago, USA, and Yanshan University, China
General Co-chair Marios M. Polycarpou, University of Cyprus
Organization Chair Shumin Fei, Southeast University, China
Advisory Committee Chairs Shun-Ichi Amari, RIKEN Brain Science Institute, Japan Chunbo Feng, Southeast University, China Zhenya He, Southeast University, China
Advisory Committee Members Hojjat Adeli, Ohio State University, USA Moonis Ali, Texas State University-San Marcos, USA Zheng Bao, Xidian University, China Tamer Basar, University of Illinois at Urbana-Champaign, USA Tianyou Chai, Northeastern University, China Guoliang Chen, University of Science and Technology of China, China Ruwei Dai, Chinese Academy of Sciences, China Dominique M. Durand, Case Western Reserve University, USA Russ Eberhart, Indiana University Purdue University Indianapolis, USA David Fogel, Natural Selection, Inc., USA Walter J. Freeman, University of California-Berkeley, USA Toshio Fukuda, Nagoya University, Japan Kunihiko Fukushima, Kansai University, Japan Tom Heskes, University of Nijmegen, The Netherlands Okyay Kaynak, Bogazici University, Turkey Frank L. Lewis, University of Texas at Arlington, USA Deyi Li, National Natural Science Foundation of China, China Yanda Li, Tsinghua University, China Ruqian Lu, Chinese Academy of Sciences, China
VIII
Organization
John MacIntyre, University of Sunderland, UK Robert J. Marks II, Baylor University, USA Anthony N. Michel, University of Notre Dame, USA Evangelia Micheli-Tzanakou, Rutgers University, USA Erkki Oja, Helsinki University of Technology, Finland Nikhil R. Pal, Indian Statistical Institute, India Vincenzo Piuri, University of Milan, Italy Jennie Si, Arizona State University, USA Youxian Sun, Zhejiang University, China Yuan Yan Tang, Hong Kong Baptist University, China Tzyh Jong Tarn, Washington University, USA Fei-Yue Wang, Chinese Academy of Sciences, China Lipo Wang, Nanyang Technological University, Singapore Shoujue Wang, Chinese Academy of Sciences Paul J. Werbos, National Science Foundation, USA Bernie Widrow, Stanford University, USA Gregory A. Worrell, Mayo Clinic, USA Hongxin Wu, Chinese Academy of Space Technology, China Youlun Xiong, Huazhong University of Science and Technology, China Lei Xu, Chinese University of Hong Kong, China Shuzi Yang, Huazhong University of Science and Technology, China Xin Yao, University of Birmingham, UK Bo Zhang, Tsinghua University, China Siying Zhang, Qingdao University, China Nanning Zheng, Xi’an Jiaotong University, China Jacek M. Zurada, University of Louisville, USA
Steering Committee Chair Jun Wang, Chinese University of Hong Kong, China
Steering Committee Co-chair Zongben Xu, Xi’an Jiaotong University, China
Steering Committee Members Tianping Chen, Fudan University, China Andrzej Cichocki, Brain Science Institute, Japan Wlodzislaw Duch, Nicholaus Copernicus University, Poland Chengan Guo, Dalian University of Technology, China Anthony Kuh, University of Hawaii, USA Xiaofeng Liao, Chongqing University, China Xiaoxin Liao, Huazhong University of Science and Technology, China Bao-Liang Lu, Shanghai Jiaotong University, China
Organization
Chenghong Wang, National Natural Science Foundation of China, China Leszek Rutkowski, Technical University of Czestochowa, Poland Zengqi Sun, Tsinghua University, China Donald C. Wunsch II, University of Missouri-Rolla, USA Gary G. Yen, Oklahoma State University, Stillwater, USA Zhang Yi, University of Electronic Science and Technology, China Hujun Yin, University of Manchester, UK Liming Zhang, Fudan University, China Chunguang Zhou, Jilin University, China
Program Chairs Zeng-Guang Hou, Chinese Academy of Sciences, China Huaguang Zhang, Northeastern University, China
Special Sessions Chairs Lei Guo, Beihang University, China Wen Yu, CINVESTAV-IPN, Mexico
Finance Chair Xinping Guan, Yanshan University, China
Publicity Chair Changyin Sun, Hohai University, China
Publicity Co-chairs Zongli Lin, University of Virginia, USA Weixing Zheng, University of Western Sydney, Australia
Publications Chair Jinde Cao, Southeast University, China
Registration Chairs Hua Liang, Hohai University, China Bhaskhar DasGupta, University of Illinois at Chicago, USA
IX
X
Organization
Local Arrangements Chairs Enrong Wang, Nanjing Normal University, China Shengyuan Xu, Nanjing University of Science and Technology, China Junyong Zhai, Southeast University, China
Electronic Review Chair Xiaofeng Liao, Chongqing University, China
Symposium Secretariats Ting Huang, University of Illinois at Chicago, USA Jinya Song, Hohai University, China
ISNN 2007 International Program Committee Shigeo Abe, Kobe University, Japan Ajith Abraham, Chung Ang University, Korea Khurshid Ahmad, University of Surrey, UK Angelo Alessandri, University of Genoa, Italy Sabri Arik, Istanbul University, Turkey K. Vijayan Asari, Old Dominion University, USA Amit Bhaya, Federal University of Rio de Janeiro, Brazil Abdesselam Bouzerdoum, University of Wollongong, Australia Martin Brown, University of Manchester, UK Ivo Bukovsky, Czech Technical University, Czech Republic Jinde Cao, Southeast University, China Matthew Casey, Surrey University, UK Luonan Chen, Osaka-Sandai University, Japan Songcan Chen, Nanjing University of Aeronautics and Astronautics, China Xiao-Hu Chen, Nanjing Institute of Technology, China Xinkai Chen, Shibaura Institute of Technology, Japan Yuehui Chen, Jinan University, Shandong, China Xiaochun Cheng, University of Reading, UK Zheru Chi, Hong Kong Polytechnic University, China Sungzoon Cho, Seoul National University, Korea Seungjin Choi, Pohang University of Science and Technology, Korea Tommy W. S. Chow, City University of Hong Kong, China Emilio Corchado, University of Burgos, Spain Jose Alfredo F. Costa, Federal University, UFRN, Brazil Mingcong Deng, Okayama University, Japan Shuxue Ding, University of Aizu, Japan Meng Joo Er, Nanyang Technological University, Singapore Deniz Erdogmus, Oregon Health & Science University, USA
Organization
Gary Feng, City University of Hong Kong, China Jian Feng, Northeastern University, China Mauro Forti, University of Siena, Italy Wai Keung Fung, University of Manitoba, Canada Marcus Gallagher, University of Queensland, Australia John Qiang Gan, University of Essex, UK Xiqi Gao, Southeast University, China Chengan Guo, Dalian University of Technology, China Dalei Guo, Chinese Academy of Sciences, China Ping Guo, Beijing Normal University, China Madan M. Gupta, University of Saskatchewan, Canada Min Han, Dalian University of Technology, China Haibo He, Stevens Institute of Technology, USA Daniel Ho, City University of Hong Kong, China Dewen Hu, National University of Defense Technology, China Jinglu Hu, Waseda University, Japan Sanqing Hu, Mayo Clinic, Rochester, Minnesota, USA Xuelei Hu, Nanjing University of Science and Technology, China Guang-Bin Huang, Nanyang Technological University, Singapore Tingwen Huang, Texas A&M University at Qatar Giacomo Indiveri, ETH Zurich, Switzerland Malik Magdon Ismail, Rensselaer Polytechnic Institute, USA Danchi Jiang, University of Tasmania, Australia Joarder Kamruzzaman, Monash University, Australia Samuel Kaski, Helsinki University of Technology, Finland Hon Keung Kwan, University of Windsor, Canada James Kwok, Hong Kong University of Science and Technology, China James Lam, University of Hong Kong, China Kang Li, Queen’s University, UK Xiaoli Li, University of Birmingham, UK Yangmin Li, University of Macau, China Yongwei Li, Hebei University of Science and Technology, China Yuanqing Li, Institute of Infocomm Research, Singapore Hualou Liang, University of Texas at Houston, USA Jinling Liang, Southeast University, China Yanchun Liang, Jilin University, China Lizhi Liao, Hong Kong Baptist University, China Guoping Liu, University of Glamorgan, UK Ju Liu, Shandong University, China Meiqin Liu, Zhejiang University, China Xiangjie Liu, North China Electric Power University, China Yutian Liu, Shangdong University, China Hongtao Lu, Shanghai Jiaotong University, China Jinhu Lu, Chinese Academy of Sciences and Princeton University, USA Wenlian Lu, Max Planck Institute for Mathematics in Sciences, Germany
XI
XII
Organization
Shuxian Lun, Bohai University, China Fa-Long Luo, Anyka, Inc., USA Jinwen Ma, Peking University, China Xiangping Meng, Changchun Institute of Technology, China Kevin L. Moore, Colorado School of Mines, USA Ikuko Nishikawa, Ritsumeikan University, Japan Stanislaw Osowski, Warsaw University of Technology, Poland Seiichi Ozawa, Kobe University, Japan Hector D. Patino, Universidad Nacional de San Juan, Argentina Yi Shen, Huazhong University of Science and Technology, China Daming Shi, Nanyang Technological University, Singapore Yang Shi, University of Saskatchewan, Canada Michael Small, Hong Kong Polytechnic University Ashu MG Solo, Maverick Technologies America Inc., USA Stefano Squartini, Universita Politecnica delle Marche, Italy Ponnuthurai Nagaratnam Suganthan, Nanyang Technological University, Singapore Fuchun Sun, Tsinghua University, China Johan A. K. Suykens, Katholieke Universiteit Leuven, Belgium Norikazu Takahashi, Kyushu University, Japan Ying Tan, Peking University, China Yonghong Tan, Guilin University of Electronic Technology, China Peter Tino, Birmingham University, UK Christos Tjortjis, University of Manchester, UK Antonios Tsourdos, Cranfield University, UK Marc van Hulle, Katholieke Universiteit Leuven, Belgium Dan Ventura, Brigham Young University, USA Michel Verleysen, Universite Catholique de Louvain, Belgium Bing Wang, University of Hull, UK Dan Wang, Dalian Maritime University, China Pei-Fang Wang, SPAWAR Systems Center-San Diego, USA Zhiliang Wang, Northeastern University, China Si Wu, University of Sussex, UK Wei Wu, Dalian University of Technology, China Shunren Xia, Zhejiang University, China Yousheng Xia, University of Waterloo, Canada Cheng Xiang, National University of Singapore, Singapore Daoyi Xu, Sichuan University, China Xiaosong Yang, Huazhong University of Science and Technology, China Yingjie Yang, De Montfort University, UK Zi-Jiang Yang, Kyushu University, Japan Mao Ye, University of Electronic Science and Technology of China, China Jianqiang Yi, Chinese Academy of Sciences, China Dingli Yu, Liverpool John Moores University, UK Zhigang Zeng, Wuhan University of Technology, China
Organization
XIII
Guisheng Zhai, Osaka Perfecture University, Japan Jie Zhang, University of Newcastle, UK Liming Zhang, Fudan University, China Liqing Zhang, Shanghai Jiaotong University, China Nian Zhang, South Dakota School of Mines & Technology, USA Qingfu Zhang, University of Essex, UK Yanqing Zhang, Georgia State University, USA Yifeng Zhang, Hefei Institute of Electrical Engineering, China Yong Zhang, Jinan University, China Dongbin Zhao, Chinese Academy of Sciences, China Hongyong Zhao, Nanjiang University of Aeronautics and Astronautics, China Haibin Zhu, Nipissing University, Canada
Table of Contents – Part II
Chaos and Synchronization Synchronization of Chaotic Systems Via the Laguerre–Polynomials-Based Neural Network . . . . . . . . . . . . . . . . . . . . . . . . Hongwei Wang and Hong Gu
1
Chaos Synchronization Between Unified Chaotic System and Genesio System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xianyong Wu, Zhi-Hong Guan, and Tao Li
8
Robust Impulsive Synchronization of Coupled Delayed Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lan Xiang, Jin Zhou, and Zengrong Liu
16
Synchronization of Impulsive Fuzzy Cellular Neural Networks with Parameter Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingwen Huang and Chuandong Li
24
Global Synchronization in an Array of Delayed Neural Networks with Nonlinear Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinling Liang, Ping Li, and Yongqing Yang
33
Self-synchronization Blind Audio Watermarking Based on Feature Extraction and Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohong Ma, Bo Zhang, and Xiaoyan Ding
40
An Improved Extremum Seeking Algorithm Based on the Chaotic Annealing Recurrent Neural Network and Its Application . . . . . . . . . . . . . Yun-an Hu, Bin Zuo, and Jing Li
47
Solving the Delay Constrained Multicast Routing Problem Using the Transiently Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen Liu and Lipo Wang
57
Solving Prize-Collecting Traveling Salesman Problem with Time Windows by Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanyan Zhang and Lixin Tang
63
A Quickly Searching Algorithm for Optimization Problems Based on Hysteretic Transiently Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . Xiuhong Wang and Qingli Qiao
72
Secure Media Distribution Scheme Based on Chaotic Neural Network . . . Shiguo Lian, Zhongxuan Liu, Zhen Ren, and Haila Wang
79
XVI
Table of Contents – Part II
An Adaptive Radar Target Signal Processing Scheme Based on AMTI Filter and Chaotic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quansheng Ren, Jianye Zhao, Hongling Meng, and Jianye Zhao
88
Horseshoe Dynamics in a Small Hyperchaotic Neural Network . . . . . . . . . Qingdu Li and Xiao-Song Yang
96
The Chaotic Netlet Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geehyuk Lee and Gwan-Su Yi
104
A Chaos Based Robust Spatial Domain Watermarking Algorithm . . . . . . Xianyong Wu, Zhi-Hong Guan, and Zhengping Wu
113
Integrating KPCA and LS-SVM for Chaotic Time Series Forecasting Via Similarity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Cheng, Jian-sheng Qian, Xiang-ting Wang, and Li-cheng Jiao
120
Prediction of Chaotic Time Series Using LS-SVM with Simulated Annealing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meiying Ye
127
Radial Basis Function Neural Network Predictor for Parameter Estimation in Chaotic Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongmei Xie and Xiaoyi Feng
135
Global Exponential Synchronization of Chaotic Neural Networks with Time Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jigui Jian, Baoxian Wang, and Xiaoxin Liao
143
Neural Fuzzy Systems A Fuzzy Neural Network Based on Back-Propagation . . . . . . . . . . . . . . . . . Huang Jin, Gan Quan, and Cai Linhui
151
State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Duan, Baoxia Cui, and Xinhe Xu
160
Realization of an Improved Adaptive Neuro-Fuzzy Inference System in DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xingxing Wu, Xilin Zhu, Xiaomei Li, and Haocheng Yu
170
Neurofuzzy Power Plant Predictive Control . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang-Jie Liu and Ji-Zhen Liu GA-Driven Fuzzy Set-Based Polynomial Neural Networks with Information Granules for Multi-variable Software Process . . . . . . . . . . . . . Seok-Beom Roh, Sung-Kwun Oh, and Tae-Chon Ahn
179
186
Table of Contents – Part II
The ANN Inverse Control of Induction Motor with Robust Flux Observer Based on ESO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Wang and Xianzhong Dai Design of Fuzzy Relation-Based Polynomial Neural Networks Using Information Granulation and Symbolic Gene Type Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SungKwun Oh, InTae Lee, Witold Pedrycz, and HyunKi Kim
XVII
196
206
Fuzzy Neural Network Classification Design Using Support Vector Machine in Welding Defect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-guang Zhang, Shi-jin Ren, Xing-gan Zhang, and Fan Zhao
216
Multi-granular Control of Double Inverted Pendulum Based on Universal Logics Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Lu and Juan Chen
224
The Research of Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Gang Sun, Hai Zhao, Xiao-Dan Zhang, Jiu-Qiang Xu, Zhen-Yu Yin, Xi-Yuan Zhang, and Si-Yuan Zhu Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahib H. Abiyev, Fakhreddin Mamedov, and Tayseer Al-shanableh Comparative Studies of Fuzzy Genetic Algorithms . . . . . . . . . . . . . . . . . . . Qing Li, Yixin Yin, Zhiliang Wang, and Guangjun Liu Fuzzy Random Dependent-Chance Bilevel Programming with Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Liang, Jinwu Gao, and Kakuzo Iwamura Fuzzy Optimization Problems with Critical Value-at-Risk Criteria . . . . . . Yan-Kui Liu, Zhi-Qiang Liu, and Ying Liu
234
241
251
257
267
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingkui Gu and Xuewen He
275
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongrong Sun and Yuanyuan Wang
284
A Neural-Fuzzy Pattern Recognition Algorithm Based Cutting Tool Condition Monitoring Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pan Fu and A.D. Hope
293
XVIII
Table of Contents – Part II
Research on Customer Classification in E-Supermarket by Using Modified Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-An Tan, Zuo Wang, and Qi Luo
301
Recurrent Fuzzy Neural Network Based System for Battery Charging . . . R.A. Aliev, R.R. Aliev, B.G. Guirimov, and K. Uyar
307
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach . . . . . Ching-Hung Lee and Yu-Ching Lin
317
Fuzzy Neural Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hua Xu, Yuan Wang, and Peifa Jia
328
Hardware Design of an Adaptive Neuro-Fuzzy Network with On-Chip Learning Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tzu-Ping Kao, Chun-Chang Yu, Ting-Yu Chen, and Jeen-Shing Wang Stock Prediction Using FCMAC-BYY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiacai Fu, Kok Siong Lum, Minh Nhut Nguyen, and Juan Shi
336
346
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shufeng Wang, Gengfeng Wu, and Jianguo Pan
352
A Novel Approach for Extraction of Fuzzy Rules Using the Neuro-Fuzzy Network and Its Application in the Blending Process of Raw Slurry . . . . Rui Bai, Tianyou Chai, and Enjie Ma
362
Training and Learning Algorithms for Neural Networks Neural Network Training Using Genetic Algorithm with a Novel Binary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Liang, Kwong-Sak Leung, and Zong-Ben Xu
371
Adaptive Training of a Kernel-Based Representative and Discriminative Nonlinear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benyong Liu, Jing Zhang, and Xiaowei Chen
381
Indirect Training of Grey-Box Models: Application to a Bioprocess . . . . . Francisco Cruz, Gonzalo Acu˜ na, Francisco Cubillos, Vicente Moreno, and Danilo Bassi FNN (Feedforward Neural Network) Training Method Based on Robust Recursive Least Square Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JunSeok Lim and KoengMo Sung
391
398
Table of Contents – Part II
A Margin Maximization Training Algorithm for BP Network . . . . . . . . . . Kai Wang and Qingren Wang
XIX
406
Learning Bayesian Networks Based on a Mutual Information Scoring Function and EMI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fengzhan Tian, Haisheng Li, Zhihai Wang, and Jian Yu
414
Learning Dynamic Bayesian Networks Structure Based on Bayesian Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Song Gao, Qinkun Xiao, Quan Pan, and Qingguo Li
424
An On-Line Learning Algorithm of Parallel Mode for MLPN Models . . . D.L. Yu, T.K. Chang, and D.W. Yu
432
An Robust RPCL Algorithm and Its Application in Clustering of Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zeng-Shun Zhao, Zeng-Guang Hou, Min Tan, and An-Min Zou
438
An Evolutionary RBFNN Learning Algorithm for Complex Classzification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Tian, Minqiang Li, and Fuzan Chen
448
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinyuan Shen, Huaiyu Fan, and Shengjiang Chang
457
An Improved Algorithm for Eleman Neural Network by Adding a Modified Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhiqiang Zhang, Zheng Tang, GuoFeng Tang, Catherine Vairappan, XuGang Wang, and RunQun Xiong
465
Regularization Versus Dimension Reduction, Which Is Better? . . . . . . . . . Yunfei Jiang and Ping Guo
474
Integrated Analytic Framework for Neural Network Construction . . . . . . Kang Li, Jian-Xun Peng, Minrui Fei, Xiaoou Li, and Wen Yu
483
Neural Networks Structures A Novel Method of Constructing ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangping Meng, Quande Yuan, Yuzhen Pi, and Jianzhong Wang
493
Topographic Infomax in a Neural Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . James Kozloski, Guillermo Cecchi, Charles Peck, and A. Ravishankar Rao
500
Genetic Granular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan-Qing Zhang, Bo Jin, and Yuchun Tang
510
XX
Table of Contents – Part II
A Multi-Level Probabilistic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . Ning Zong and Xia Hong An Artificial Immune Network Model Applied to Data Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenggong Zhang and Zhang Yi Sparse Coding in Sparse Winner Networks . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz A. Starzyk, Yinyin Liu, and David Vogel Multi-valued Cellular Neural Networks and Its Application for Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhong Zhang, Takuma Akiduki, Tetsuo Miyake, and Takashi Imamura Emergence of Topographic Cortical Maps in a Parameterless Local Competition Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Ravishankar Rao, Guillermo Cecchi, Charles Peck, and James Kozloski Graph Matching Recombination for Evolving Neural Networks . . . . . . . . . Ashique Mahmood, Sadia Sharmin, Debjanee Barua, and Md. Monirul Islam
516
526 534
542
552
562
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Jia Yin
569
Implementation of Multi-valued Logic Based on Bi-threshold Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiuxiang Deng and Zhigang Zeng
575
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wangmeng Zuo, Kuanquan Wang, David Zhang, and Feng Yue
583
Decomposition Method for Tree Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Huang and Jie Zhu An Intelligent Hybrid Approach for Designing Increasing Translation Invariant Morphological Operators for Time Series Forecasting . . . . . . . . . Ricardo de A. Ara´ ujo, Robson P. de Sousa, and Tiago A.E. Ferreira Ordering Grids to Identify the Clustering Structure . . . . . . . . . . . . . . . . . . Shihong Yue, Miaomiao Wei, Yi Li, and Xiuxiu Wang An Improve to Human Computer Interaction, Recovering Data from Databases Through Spoken Natural Language . . . . . . . . . . . . . . . . . . . . . . . Omar Florez-Choque and Ernesto Cuadros-Vargas
593
602 612
620
Table of Contents – Part II
XXI
3D Reconstruction Approach Based on Neural Network . . . . . . . . . . . . . . . Haifeng Hu and Zhi Yang
630
A New Method of IRFPA Nonuniformity Correction . . . . . . . . . . . . . . . . . . Shaosheng Dai, Tianqi Zhang, and Jian Gao
640
Novel Shape-From-Shading Methodology with Specular Reflectance Using Wavelet Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Yang and Jiu-qiang Han Attribute Reduction Based on Bi-directional Distance Correlation and Radial Basis Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li-Chao Chen, Wei Zhang, Ying-Jun Zhang, Bin Ye, Li-Hu Pan, and Jing Li Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunfeng Wu and S.C. Ng
646
656
664
Discriminant Analysis with Label Constrained Graph Partition . . . . . . . . Peng Guan, Yaoliang Yu, and Liming Zhang
671
The Kernelized Geometrical Bisection Methods . . . . . . . . . . . . . . . . . . . . . . Xiaomao Liu, Shujuan Cao, Junbin Gao, and Jun Zhang
680
Design and Implementation of a General Purpose Neural Network Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi Qian, Ang Li, and Qin Wang
689
A Forward Constrained Selection Algorithm for Probabilistic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ning Zong and Xia Hong
699
Probabilistic Motion Switch Tracking Method Based on Mean Shift and Double Model Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Risheng Han, Zhongliang Jing, and Gang Xiao
705
Neural Networks for Pattern Recognition Human Action Recognition Using a Modified Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ho-Joon Kim, Joseph S. Lee, and Hyun-Seung Yang
715
Neural Networks Based Image Recognition: A New Approach . . . . . . . . . . Jiyun Yang, Xiaofeng Liao, Shaojiang Deng, Miao Yu, and Hongying Zheng
724
Human Touching Behavior Recognition Based on Neural Networks . . . . . Joung Woo Ryu, Cheonshu Park, and Joo-Chan Sohn
730
XXII
Table of Contents – Part II
Kernel Fisher NPE for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guoqiang Wang, Zongying Ou, Fan Ou, Dianting Liu, and Feng Han
740
A Parallel RBFNN Classifier Based on S-Transform for Recognition of Power Quality Disturbances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiming Tong and Xuelei Song
746
Recognition of Car License Plates Using Morphological Features, Color Information and an Enhanced FCM Algorithm . . . . . . . . . . . . . . . . . . . . . . Kwang-Baek Kim, Choong-shik Park, and Young Woon Woo
756
Modified ART2A-DWNN for Automatic Digital Modulation Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuexia Wang, Zhilu Wu, Yaqin Zhao, and Guanghui Ren
765
Target Recognition of FLIR Images on Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Liu, Xiyue Huang, Yong Chen, and Naishuai He
772
Two-Dimensional Bayesian Subspace Analysis for Face Recognition . . . . Daoqiang Zhang
778
A Wavelet-Based Neural Network Applied to Surface Defect Detection of LED Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Dar Lin and Chung-Yu Chung
785
Graphic Symbol Recognition of Engineering Drawings Based on Multi-Scale Autoconvolution Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chuan-Min Zhai and Ji-Xiang Du
793
Driver Fatigue Detection by Fusing Multiple Cues . . . . . . . . . . . . . . . . . . . . Rajinda Senaratne, David Hardy, Bill Vanderaa, and Saman Halgamuge
801
Palmprint Recognition Using a Novel Sparse Coding Technique . . . . . . . . Li Shang, Fenwen Cao, Zhiqiang Zhao, Jie Chen, and Yu Zhang
810
Radial Basis Probabilistic Neural Networks Committee for Palmprint Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jixiang Du, Chuanmin Zhai, and Yuanyuan Wan
819
A Connectionist Thematic Grid Predictor for Pre-parsed Natural Language Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jo˜ ao Lu´ıs Garcia Rosa
825
Perfect Recall on the Lernmatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Israel Rom´ an-God´ınez, Itzam´ a L´ opez-Y´ an ˜ez, and Cornelio Y´ an ˜ez-M´ arquez
835
Table of Contents – Part II
A New Text Detection Approach Based on BP Neural Network for Vehicle License Plate Detection in Complex Background . . . . . . . . . . . . . . Yanwen Li, Meng Li, Yinghua Lu, Ming Yang, and Chunguang Zhou Searching Eye Centers Using a Context-Based Neural Network . . . . . . . . . Jun Miao, Laiyun Qing, Lijuan Duan, and Wen Gao
XXIII
842
851
A Fast New Small Target Detection Algorithm Based on Regularizing Partial Differential Equation in IR Clutter . . . . . . . . . . . . . . . . . . . . . . . . . . Biyin Zhang, Tianxu Zhang, and Kun Zhang
861
The Evaluation Measure of Text Clustering for the Variable Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taeho Jo and Malrey Lee
871
Clustering-Based Reference Set Reduction for k-Nearest Neighbor . . . . . . Seongseob Hwang and Sungzoon Cho
880
A Contourlet-Based Method for Wavelet Neural Network Automatic Target Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xue Mei, Liangzheng Xia, and Jiuxian Li
889
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuang Xu, Yunde Jia, and Youdong Zhao
896
Face Recognition from a Single Image per Person Using Common Subfaces Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun-Bao Li, Jeng-Shyang Pan, and Shu-Chuan Chu
905
SOMs, ICA/PCA A Structural Adapting Self-organizing Maps Neural Network . . . . . . . . . . Xinzheng Xu, Wenhua Zeng, and Zuopeng Zhao How Good Is the Backpropogation Neural Network Using a Self-Organised Network Inspired by Immune Algorithm (SONIA) When Used for Multi-step Financial Time Series Prediction? . . . . . . . . . . . . . . . . Abir Jaafar Hussain and Dhiya Al-Jumeily Edge Detection Combined Entropy Threshold and Self-Organizing Map (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Wang, Liqun Gao, Zhaoyu Pian, Li Guo, and Jianhua Wu Hierarchical SOMs: Segmentation of Cell-Migration Images . . . . . . . . . . . . Chaoxin Zheng, Khurshid Ahmad, Aideen Long, Yuri Volkov, Anthony Davies, and Dermot Kelleher
913
921
931
938
XXIV
Table of Contents – Part II
Network Anomaly Detection Based on DSOM and ACO Clustering . . . . Yong Feng, Jiang Zhong, Zhong-yang Xiong, Chun-xiao Ye, and Kai-gui Wu
947
Hybrid Pipeline Structure for Self-Organizing Learning Array . . . . . . . . . . Janusz A. Starzyk, Mingwei Ding, and Yinyin Liu
956
CSOM for Mixed Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fedja Hadzic and Tharam S. Dillon
965
The Application of ICA to the X-Ray Digital Subtraction Angiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songyuan Tang, Yongtian Wang, and Yen-wei Chen
979
Relative Principle Component and Relative Principle Component Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Lin Wen, Jing Hu, and Tian-Zhen Wang
985
The Hybrid Principal Component Analysis Based on Wavelets and Moving Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-lin Wen, Shao-hui Fan, and Zhi-guo Chen
994
Recursive Bayesian Linear Discriminant for Classification . . . . . . . . . . . . . 1002 D. Huang and C. Xiang Histogram PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 P. Nagabhushan and R. Pradeep Kumar Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022 Xuexiang Jin, Yi Zhang, and Danya Yao An Efficient K-Hyperplane Clustering Algorithm and Its Application to Sparse Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032 Zhaoshui He and Andrzej Cichocki A PCA-Combined Neural Network Software Sensor for SBR Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 Liping Fan and Yang Xu Symmetry Based Two-Dimensional Principal Component Analysis for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048 Mingyong Ding, Congde Lu, Yunsong Lin, and Ling Tong A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056 Yaobo Li, Zhiliang Ren, Gong Chen, and Changcun Sun ICA Based Super-Resolution Face Hallucination and Recognition . . . . . . 1065 Hua Yan, Ju Liu, Jiande Sun, and Xinghua Sun
Table of Contents – Part II
XXV
Principal Component Analysis Based Probability Neural Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072 Jie Xing, Deyun Xiao, and Jiaxiang Yu A Multi-scale Dynamically Growing Hierarchical Self-organizing Map for Brain MRI Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Jingdan Zhang and Dao-Qing Dai
Biomedical Applications A Study on How to Classify the Security Rating of Medical Information Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090 Jaegu Song and Seoksoo Kim Detecting Biomarkers for Major Adverse Cardiac Events Using SVM with PLS Feature Selection and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 1097 Zheng Yin, Xiaobo Zhou, Honghui Wang, Youxian Sun, and Stephen T.C. Wong Hybrid Systems and Artificial Immune Systems: Performances and Applications to Biomedical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107 Vitoantonio Bevilacqua, Cosimo G. de Musso, Filippo Menolascina, Giuseppe Mastronardi, and Antonio Pedone NeuroOracle: Integration of Neural Networks into an Object-Relational Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Erich Schikuta and Paul Glantschnig Discrimination of Coronary Microcirculatory Dysfunction Based on Generalized Relevance LVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125 Qi Zhang, Yuanyuan Wang, Weiqi Wang, Jianying Ma, Juying Qian, and Junbo Ge Multiple Signal Classification Based on Genetic Algorithm for MEG Sources Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133 Chenwei Jiang, Jieming Ma, Bin Wang, and Liming Zhang Registration of 3D FMT and CT Images of Mouse Via Affine Transformation with Bayesian Iterative Closest Points . . . . . . . . . . . . . . . . 1140 Xia Zheng, Xiaobo Zhou, Youxian Sun, and Stephen T.C. Wong Automatic Diagnosis of Foot Plant Pathologies: A Neural Networks Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150 Marco Mora, Mary Carmen Jarur, Daniel Sbarbaro, and Leopoldo Pavesi Phase Transitions Caused by Threshold in Random Neural Network and Its Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159 Guangcheng Xi and Jianxin Chen
XXVI
Table of Contents – Part II
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168 Lisha Sun, Guoliang Chang, and Patch J. Beadle Comparing Analytical Decision Support Models Through Boolean Rule Extraction: A Case Study of Ovarian Tumour Malignancy . . . . . . . . . . . . . 1177 M.S.H. Aung, P.J.G Lisboa, T.A. Etchells, A.C. Testa, B. Van Calster, S. Van Huffel, L. Valentin, and D. Timmerman Human Sensibility Evaluation Using Neural Network and Multiple-Template Method on Electroencephalogram (EEG) . . . . . . . . . . . 1187 Dongjun Kim, Seungjin Woo, Jeongwhan Lee, and Kyeongseop Kim A Decision Method for Air-Pressure Limit Value Based on the Respiratory Model with RBF Expression of Elastance . . . . . . . . . . . . . . . . 1194 Shunshoku Kanae, Zi-Jiang Yang, and Kiyoshi Wada Hand Tremor Classification Using Bispectrum Analysis of Acceleration Signals and Back-Propagation Neural Network . . . . . . . . . . . . . . . . . . . . . . . 1202 Lingmei Ai, Jue Wang, Liyu Huang, and Xuelian Wang A Novel Ensemble Approach for Cancer Data Classification . . . . . . . . . . . 1211 Yaou Zhao, Yuehui Chen, and Xueqin Zhang Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1221 A.K.M.A. Baten, Saman K. Halgamuge, Bill Chang, and Nalin Wickramarachchi A Method of X-Ray Image Recognition Based on Fuzzy Rule and Parallel Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231 Dongmei Liu and Zhaoxia Wang Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting of Confocal Raman Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240 Seong-Joon Baek, Aaron Park, Sangki Kang, Yonggwan Won, Jin Young Kim, and Seung You Na Prediction of Helix, Strand Segments from Primary Protein Sequences by a Set of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248 Zhuo Song, Ning Zhang, Zhuo Yang, and Tao Zhang A Novel EPA-KNN Gene Classification Algorithm . . . . . . . . . . . . . . . . . . . . 1254 Haijun Wang, Yaping Lin, Xinguo Lu, and Yalin Nie A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264 Shuxue Zou, Yanxin Huang, Yan Wang, Chengquan Hu, Yanchun Liang, and Chunguang Zhou
Table of Contents – Part II XXVII
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273 Sanqing Hu, Matt Stead, Andrew B. Gardner, and Gregory A. Worrell Biological Inspired Global Descriptor for Shape Matching . . . . . . . . . . . . . 1281 Yan Li, Siwei Luo, and Qi Zou Fuzzy Support Vector Machine for EMG Pattern Recognition and Myoelectrical Prosthesis Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1291 Lingling Chen, Peng Yang, Xiaoyun Xu, Xin Guo, and Xueping Zhang Classification of Obstructive Sleep Apnea by Neural Networks . . . . . . . . . 1299 Zhongyu Pang, Derong Liu, and Stephen R. Lloyd Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1309
Synchronization of Chaotic Systems Via the Laguerre–Polynomials-Based Neural Network Hongwei Wang and Hong Gu Department of Automation, Dalian University of Technology
[email protected] Abstract. In recent years, chaos synchronization has attracted many researchers’ interests. For a class of chaotic synchronization systems with unknown uncertainties caused by both model variations and external disturbances, an orthogonal function neural network is utilized to realize the synchronization of chaotic systems. The basis functions of orthogonal function neural network are Laguerre polynomials. First of all, the orthogonal function neural network is trained to learn the uncertain information. Then, the parameters of Laguerre orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation by Lyapunov steady theorem. At last, the result of numerical example is shown to illustrate the validity of the proposed method.
1 Introduction In recent years, chaos synchronization has attracted many researchers’ interests. Different methods were used in the synchronization of chaotic systems such as radical basis function neural network, recurrent neural network and wavelet neural network in the literatures [1-6], which have possessed the abilities to approximate nonlinear systems. Chaos synchronization can be viewed from a state-observer perspective, in the sense that the response system can be regarded as the state-observer of the drive system [7-9]. In the state-observer based approach, the output can be chosen to be a linear or nonlinear combination or function of the system state variables. However, it has been shown that the state-observer-based scheme has a coherent disadvantage that the transmission noise affects the performance of synchronization and communication [7]. On the other hand, control methods that are applicable to general nonlinear systems have been extensively developed since the early 1980’s, for example based on the differential geometry theory [10]. Recently, the passivity approach has generated some increasing interest for synchronization control laws for general nonlinear systems [11]. An important problem in this approach is how to achieve robust nonlinear control in the presence of unmodelled dynamics and external disturbances. Along this line there is the so-called H ∞ nonlinear control approach [12-13]. One major difficulty with this approach, alongside its possible system structural instability, seems to be the requirement of solving the associated partial differential equations. In addition, for dynamic systems with complex, ill-conditioned, or nonlinear characteristics, the fuzzy D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1–7, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
H. Wang and H. Gu
modeling method is very effective to describe the properties of the systems. Also, in this researching area, there are many attempts to create the synchronization of chaotic systems by using fuzzy methods [14-15]. In the paper, the orthogonal function neural network is utilized to realize the synchronization of chaotic system. The orthogonal function neural network is Laguerre orthogonal neural network. The orthogonal function neural network is trained to learn the uncertain information of the system. At last, with the demonstration of numerical example, the simulation results can show the validity of proposed method.
2 The Description of the Problem The chaotic system is shown as the following form.
x = F ( x) + u
(1)
x ∈ R n ; F (x) is a nonlinear function, satisfying F(x) = ( f1 (x), f 2 ( x),....,f n (x))T , f i ( x)(i = 1,2,..., n) ; u is the input vector,
where the input vector satisfies
u ∈ Rn . In the actual system, the chaotic system is perturbed by the outside perturbation, and then Equation (1) can be written as the following equation.
x = F ( x) + ΔF ( x) + u
(2)
where ΔF (x ) is the outside perturbation. Definition: the reference chaotic system is as Equation (3).
x r = g ( x r )
(3)
x r is a state vector, satisfying x r ∈ R n , g (⋅) is the vector with many slippery nonlinear functions, having the same structure as F (⋅) or the different with F (⋅) . Let e = x − x r , we have where
lim e(t ) = 0 t →∞
(4)
and the system (1) keeps the synchronization with the system (2). Matrix A is a matrix with negative real part of the eigenvalues. The error dynamical trajectory equation is shown as Equation (5).
e = x − x r = A( x − x r ) + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(5)
Equation (5) is defined as the following form
e = Ae + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(6)
Synchronization of Chaotic Systems
3
Because of the eigenvalues of the matrix A having negative real part, there is a positive definite and symmetric matrix P to satisfy the following Lyapunov equation.
AT P + PA = −Q
(7)
where matrix Q is a positive definite and symmetric matrix. Let ing equation.
G (x) as the follow-
G ( x) = Ax r − Ax + F ( x) + ΔF ( x) − g ( x r )
(8)
Equation (5) represents as Equation (9)
e = Ae + G ( x) + u The estimated value
(9)
Gˆ ( x) substitutes the function G (x) when the function
G (x) is unknown. The controller is defined as the following equation (10). u = −Gˆ ( x) − σ where
σ
(10)
is a vector with the smaller norm.
3 The Controller of Orthogonal Function Neural Network The orthogonal function neural network has simple structure, fast convergence by the comparison of the common BP neural network. The orthogonal function neural network can approach to any nonlinear function on the tight set. In the paper, the orthogonal function of neural network is Laguerre orthogonal polynomial. The Laguerre polynomial is defined as the following form.
⎧ P1 ( x) = 1 ⎪ P ( x) = 1 − x i = 1,2,....N ⎨ 2 ⎪⎩ Pi = {[P2 + 2(i − 2)]Pi −1 − (i − 2) Pi − 2 }/(i − 1)
(11)
The global output of the orthogonal function neural network is defined as Equation (12). N
y = ∑ Φ iWi
(12)
i =1
n
where
Φi = P1i (x1)× P2i (x2 ) ×⋅⋅⋅×Pni (xn ) = ∏Pji (xj ) , Pji ( x j ) is Laguerre polynoj=1
1 2 i = 1,2,..., N , j = 1,2,...,n .
{[
]
}
mial, Pj1(xj ) =1, Pj2(xj ) = (1−xj ), Pji = Pj 2 + 2(i − 2) Pj (i−1) − (i − 2)Pj (i −2) /(i −1) ,
4
H. Wang and H. Gu
Lemma 1. For defining the any function g ( X ) on the section [ a, b] and any small positive number ε , there is an orthogonal function sequence
{Φ1 ( X ), Φ 2 ( X ),..., Φ N ( X )} and the real number sequence Wi (i = 1,2,..., N )
to satisfy the following equation N
g ( X ) − ∑ Wi Φ i ( X ) ≤ ε
(13)
i =1
On the basis of the lemma 1, the properties are acquired as the following form
ε0
Property 1. Give a positive constant
and a continuous function G (x ) : x ∈ R , n
exist an optimization weighted matrix W = W , W = [W1 ,W2 ,...,WN ] to satisfy *
*
*
*
*
the following equation.
G ( x) − W * Φ ( x) ≤ ε 0 T
(14)
Φ satisfies Φ( x) = [Φ 1 ( x), Φ 2 ( x),..., Φ N ( x)] . Based on Equation (14), G (x ) is shown as Equation (15). T
where
G ( x ) = W * Φ ( x) + η T
(15)
where η is a vector with the smaller norm. Equation (15) is substituted into Equation (9).
e = Ae + W * Φ ( x) + η + u (t ) T
(16)
P , satisfying T T W = Φ ( x)e P , A P + PA = −Q , where Q is a positive definite and symmetric
Theorem 1. Exist a positive definite and symmetric matrix matrix, the controller is designed as the following form.
u = u1 + u 2 where
(17)
u1 is u1 = −W T Φ( x) and u 2 is u 2 = −η 0 sgn( Pe)
The state of Equation (9) approaches zero, namely chronization with the system (2).
x → x r , the system (1) is syn-
~ W = W * − W , which W * an optimization weighted matrix, W is a ~ weighted matrix, W is an estimated error matrix. Define the norm of matrix R: 2 R = tr ( RR T ) = tr ( R T R ) . The lyapunov function is shown as the following form. Proof. Let
V =
1 T 1 ~ e Pe + W 2 2
2
(18)
Synchronization of Chaotic Systems
5
Differentiating Equation (18) with respect to the time is shown the following steps.
1 ~ ~ V = (e T Pe + e T Pe) + tr (W T W ) = 2 1 T T ~ ~ ~ e A P + PA e + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) = 2 1 ~ ~ ~ − e T Qe + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) 2 ~ ~ T T T T T Because of Φ ( x )WPe = tr ( PeΦ ( x )W ) , e Pη − e Pη0 sgn(e P) ≤ 0
(
)
1 1 ~ ~ ~ V ≤ − e T Qe + tr (W T W + PeΦ T ( x)W ) = − e T Qe ≤ 0 2 2 ~ On the basis of the lyapunov theorem, e , W have the limit, then lim e = 0 . t →∞
4 Simulation The Lorenz chaotic system is shown as Equation (19).
⎧ v1 = a(v 2 − v1 ) ⎪ ⎨v2 = (b − v3 )v1 − v 2 ⎪ v = −cv + v v 3 1 2 ⎩ 3
(19)
In the actual system, the Lorenz chaotic system is perturbed by the environment, and then the Lorenz chaotic system is represented as the following equation.
⎧ v1 = (a + δa )(v 2 − v1 ) + d1 + u1 ⎪ ⎨v2 = (b + δb − v3 )v1 − v 2 + d 2 + u 2 ⎪ v = −(c + δc)v + v v + d + u 3 1 2 3 3 ⎩ 3
(20)
The system parameters are a = 10 , b = 30 , c = 8 / 3 ; The perturbation items of the parameters are δa = 0.1 , δb = 0.2 , δc = 0.2 ; The perturbation items of the system states are
d1 = 0.03 sin t , d 2 = 0.01cos t , d 3 = 0.02 sin(3t ) .
The parameters of Laguerre orthogonal neural network are adjusted to accomplish the synchronization. The other parameters are
A = diag (−1,−1,−1) , Q = diag(2,2,2) ,η0 = 0.02, P = diag (1,1,1) .
6
H. Wang and H. Gu
Fig. 1. The responding diagram of
e1
Fig. 2. The responding diagram of
Fig. 3. The responding diagram of
e2
e3
The error responding diagrams are shown as the figure 1 to figure 3. The Based on these responding diagrams, the synchronization of the two chaotic systems is carried out by the proposed method. The validity of the proposed method is demonstrated with the result of numerical example simulation.
5 Conclusion In the paper, the orthogonal function neural network based on Laguerre orthogonal polynomial is utilized to realize the synchronization of chaotic systems. The parameters of orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation of parameters by Lyapunov steady theorem. The proposed method can guarantee the synchronization of two chaotic systems with the perturbation of parameters.
Acknowledgement This work is supported by National Natural Science Foundation of China (60674061).
Synchronization of Chaotic Systems
7
References 1. Liu, F., Ren, Y., Shan, X.M., Qiu, Z.L.: A Linear Feedback Synchronization Theorem for a Class of Chaotic Systems. Chaos, Solution and Fractals 13 (2002) 723-730 2. Sarasola, C., Torrealdea, F.J.: Cost of Synchronizing Different Chaos Systems. Mathematics and Computers in Simulation 58 (2002) 309-327 3. Shahverdiev, E.M., Sivaprakasam, S., Shore, K.A.: Lag Synchronization in Time-Delayed Systems. Physics Letter A 292 (2002) 320-324 4. Tsui, A., Jones, A.: Periodic Response to External Stimulation of a Chaotic Neural Network with Delayed Feedback. International Journal of Bifurcation and Chaos 9 (1999) 713-722 5. Tan, W., Wang, Y.N., Liu, Z.R., Zhou, S.W.: Neural Network Control for Nonlinear Chaotic Motion. Acta Physica Sinica 51 (2002) 2463-2466 6. Li, Z., Han, C.S.: Adaptive Control for a Class of Chaotic Systems with Uncertain Parameters. Acta Physica Sinica 50 (2002) 847-850 7. Alvarez-Ramirez, J., Cervantes, I.: Stability of Observer-Based Chaotic Communications for a Class of Lure Systems. Bifurcation and Chaos 12 (2002) 1605-1618 8. Grassi, G., Masolo, S.: Nonlinear Observer Design to Synchronize Hyperchaotic Systems via a Scalar Signal. IEEE Transaction of Circuits Systems 44 (1997) 1011-1014 9. Jiang, G.P., Zheng, W.X.: An LMI Criterion for Chaos Synchronization via the LinearState-Feedback Approach. IEEE International Symposium Computer Aided Control System Design (2004) 368-371 10. Isidori, A.: Nonlinear Control Systems. 3rd Ed, Spring Verlag, New York, USA, 1995. 11. Hill, D. J., P. Moylan.: The Stability of Nonlinear Dissipative Systems. IEEE Transaction on Automatic Control 21 (3) (1996) 708-711 12. Knobloch, H.W., Isidori, A., Flockerzi, D.: Topics in Control Theory. Birkhauser, Boston, USA, 1993 13. Yu, G.R.: Fuzzy Synchronization of Chaos Using Gray Prediction for Secure Communication. IEEE International Conference on Systems, Man, Cybernetics 4 (2004) 3104-31099 14. Hyun, C.H., Kim, J.J., Kim, E.: Adaptive Fuzzy Observer Based on Synchronization Design and Secure Communications of Chaotic Systems. Chaos, Soliton and Fractals 27 (4) (2006) 930-940 15. Vasegh, N., Majd, V.J.: Adaptive Fuzzy Synchronization of Discrete-Time Chaotic Systems. Chaos, Soliton and Fractals 27 (4) (2006) 1029-1036
Chaos Synchronization Between Unified Chaotic System and Genesio System Xianyong Wu1,2, Zhi-Hong Guan1, and Tao Li1 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China 2 School of Electronics and Information, Yangtze University, Jingzhou, Hubei, 434023, China
[email protected] Abstract. This work presents chaos synchronization between two different chaotic systems via active control and adaptive control. Synchronization between unified chaotic system and Genesio system are investigated, different controllers are designed to synchronize the drive and response systems, Numerical simulations show the effectiveness of the proposed schemes.
1 Introduction Since Pecora and Carroll introduced a method [1] to synchronize two identical chaotic systems with different initial conditions, chaos synchronization, an important topic in nonlinear science, has been investigated and studied extensively in the last few years. A variety of approaches have been proposed for the synchronization of chaotic systems such as drive-response synchronization [2], linear and nonlinear feedback synchronization [3], adaptive synchronization [4-6], coupled synchronization [7,8], active control method [9,10], impulsive synchronization [11,12], etc., most of the methods mentioned above synchronize two identical chaotic systems with known or unknown parameters. However, the method of synchronization of two different chaotic systems is far from being straightforward because of their different structures and parameters mismatch. In practice, it is hardly the case that every component can be assumed to be identical, especially when chaos synchronization is applied to secure communication, in which the structures of drive and response systems are different. Therefore, synchronization of two different chaotic systems in the presence of known or unknown parameters is more essential and useful in real-life applications. Recently, Bai and Lonngren studied synchronization of unified chaotic systems via active control [10], Ref. [13] used backstepping approach to synchronize two Genesio systems. However, the approach of synchronization between unified chaotic system and Genesio system is seldom reported. In this paper, we propose a scheme to synchronize unified chaotic system and Genesio system with different structures by two different methods, active control is applied when system parameters are known; adaptive synchronization is employed when system parameters are unknown or uncertain, the controllers and adaptive laws of parameters are designed based on Lyapunov stability theory. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 8–15, 2007. © Springer-Verlag Berlin Heidelberg 2007
Chaos Synchronization Between Unified Chaotic System and Genesio System
9
2 Systems Description and Mathematical Models Consider nonlinear chaotic system as follows.
{
x = f (t , x) y = g (t , y ) + u (t , x, y )
(1)
where x, y ∈ R n , f , g ∈ R × R n are differentiable functions, the first equation in (1) is the drive system, and the second one is the response system, u (t , x, y ) is the control input. Let e = y − x be the synchronization error, our goal is to design a controller u such that the trajectory of response system with initial conditions y0 asymptotically approaches the drive system with initial conditions x0 and finally implements synchronization, in the sense that
lim e = lim y (t , y0 ) − x(t , x0 ) = 0 t →∞
t →∞
where || ⋅ || is the Euclidean norm. The Genesio system, proposed by Genesio and Tesi [14], is one of paradigms of chaos since it captures many features of chaotic systems. It includes a simple square part and three simple ordinary differential equations that depend on three negative real parameters. The dynamic equation of the system is as follows ⎧ x = y ⎪ ⎨ y = z ⎪⎩ z = ax + by + cz + x 2
(2)
where x, y, z are state variables, when a = −6, b = −2.92, c = −1.2 , the system (2) is chaotic. Lü et. al. proposed a unified chaotic system [15], which is described by ⎧ x1 = (25α + 10)( y1 − x1 ) ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 ⎪⎩ z1 = x1 y1 − 8 +3α z1
(3)
where α ∈ [0,1] . Obviously, system (3) becomes the original Lorenz system for α = 0 while system (3) becomes the original Chen system for α = 1. When α = 0.8, system (3) becomes the critical system. In particular, system (3) bridges the gap between Lorenz system and Chen system. Moreover, system (3) is always chaotic in the whole interval α ∈ [0,1] . In the next sections, we will study chaos synchronization between unified chaotic system and Genesio system by two different methods.
10
X. Wu , Z.-H. Guan, and T. Li
3 Synchronization Between Unified Chaotic System and Genesio System Via Active Control In order to observe the synchronization behavior between unified chaotic system and Genesio system via active control, we assume that Genesio system (2) is the drive system and the controlled unified chaotic system (4) is the response system. ⎧ x1 = (25α + 10)( y1 − x1 ) + u1 ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 + u2 ⎪⎩ z1 = x1 y1 − 8 +3α z1 + u3
(4)
Three control functions u1 , u2 , u3 are introduced in system (4), in order to determine the control functions to realize synchronization between systems (2) and (4), we subtract (2) from (4) and get ⎧ e1 = (25 + α )( y1 − x1 ) − y + u1 ⎪ ⎨ e2 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 − z + u2 ⎪⎩ e3 = x1 y1 − 8 +α z1 − ax − by − cz − x 2 + u3 3
(5)
where e1 = x1 − x, e2 = y1 − y, e3 = z1 − z , we define active control functions u1 , u2 and u3 as follows
⎧ u1 = −(25α + 10)( y1 − x) + y + V1 ⎪ ⎨ u2 = −(28 − 35α ) x1 − (29α − 1) y + x1 z1 + z + V2 ⎪⎩ u3 = − x1 y1 + 8 +α z + ax + by + cz + x 2 + V3 3
(6)
Hence the error system (5) becomes ⎧ e1 = −(25α + 10)e1 + V1 ⎪ ⎨ e2 = (29α − 1)e2 + V2 ⎪⎩ e3 = − 8 +3α e3 + V3
(7)
The error system (7) to be controlled is a linear system with control inputs V1 ,V2 and V3 as functions of the error states e1 , e2 and e3 . As long as these feedbacks stabilize the system, e1 , e2 and e3 converge to zero as time t tends to infinity. This implies that unified chaotic system and Genesio system are synchronized with feedback control. There are many possible choices for the control V1 ,V2 and V3 . We choose
⎡V1 ⎤ ⎡ e1 ⎤ ⎢V2 ⎥ = A ⎢e2 ⎥ , ⎢ ⎥ ⎢ ⎥ ⎣V3 ⎦ ⎣ e3 ⎦ where A is a 3×3 constant matrix. In order to make the closed loop system stable, the proper choice of the elements of matrix A is that the feedback system must have all eigenvalues with negative real parts. Let matrix A be chosen in the following form 0 ⎡ 25α + 9 A=⎢ 0 −29α ⎢ 0 ⎣ 0
0 ⎤ 0 ⎥ 5 +α ⎥ 3 ⎦
Chaos Synchronization Between Unified Chaotic System and Genesio System
11
In this particular choice, the closed loop system (7) has the eigenvalues -1,-1 and -1. This choice will lead to the error states e1 , e2 and e3 converge to zero as time t tends to infinity and hence the synchronization between unified chaotic system and Genesio system can be achieved. In simulation, fourth order Runge-Kutta integration method is used to solve two systems of diferential equations (2) and (4) with time step 0.001. We select the parameters of unified chaotic system as α = 0.2, and the parameters of Genesio system as a = −6, b = −2.92, c = −1.2 , the initial values of drive and response systems are ( x(0), y (0), z (0)) = (1, 2,3) and ( x1 (0), y1 (0), z1 (0)) = (−1, − 2,5) , respectively, while the initial errors of system (5) are (e1 (0), e2 (0), e3 (0)) = (−2, − 4, 2) . Fig.1 shows the synchronization errors between unified chaotic system and Genesio system, one can see that response system can trace drive system rapidly and become the same finally.
Fig. 1. Synchronization errors e1 , e2 , e3 between unified chaotic system and Genesio system via active control
4 Adaptive Synchronization Between Unified Chaotic System and Genesio System with Unknown Parameters In order to compare with active control method, we still assume that Genesio system (2) is the drive system, and the controlled unified chaotic system (8) is the response system.
⎧ x1 = (25α + 10)( y1 − x1 ) + u1 ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 + u2 ⎪⎩ z1 = x1 y1 − 8 +3α z1 + u3 We subtract (2) from equation (8) and yield
(8)
12
X. Wu , Z.-H. Guan, and T. Li
⎧ e1 = (25α + 10)( y1 − x1 ) − y + u1 ⎪ ⎨ e2 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 − z + u2 ⎪⎩ e3 = x1 y1 − 8 +α z1 − ax − by − cz − x 2 + u3 3
(9)
where e1 = x1 − x, e2 = y1 − y, e3 = z1 − z Our goal is to find proper controllers ui (i = 1, 2,3) and parameter adaptive laws, such that system (8) globally synchronizes system (2) asymptotically. i.e. lim e(t ) = 0 t →∞
where e = [e1 , e2 , e3 ]T Theorem: If the controllers are chosen as
⎧ u1 = −(25αˆ + 10)( y1 − x1 ) + y − k1e1 ⎪ ⎨ u2 = −(28 − 35αˆ ) x1 − (29αˆ − 1) y1 + x1 z1 + z − k2 e2 ⎪ u = − x y + 8 +αˆ z + ax ˆ + cz ˆ + by ˆ + x 2 − k3 e3 1 1 1 ⎩ 3 3
(10)
And adaptive laws of parameters are chosen as ⎧ aˆ = − xe3 ⎪ ⎪bˆ = − ye3 ⎨ ⎪cˆ = − ze3 ⎪αˆ = 25( y − x )e − (35 x − 29 y )e − 1 z e 1 1 1 1 1 2 ⎩ 3 1 3
(11)
Then system (8) globally synchronizes system (2) asymptotically. Where ki (i = 1, 2,3) are positive constants, aˆ , bˆ, cˆ, αˆ are estimates of a, b, c, α , respectively. Proof: Applying control laws (10) to (9) yields the error dynamics as follows
⎧ e1 = −25α ( y1 − x1 ) − k1e1 ⎪ ⎨ e2 = 35α x1 − 29α y1 − k2 e2 ⎪ e = ax + 1 α z1 ⎩ 3 + by + cz 3
(12)
where a = aˆ − a, b = bˆ − b, c = cˆ − c, α = αˆ − α
Consider the following Lyapunov function V=
1 T (e e + a 2 + b 2 + c 2 + α 2 ) 2
The time derivative of V along the solution of error dynamical system (12) gives that
Chaos Synchronization Between Unified Chaotic System and Genesio System
13
dV + cc + bb + αα = eT e + aa dt = e1[−25α ( y1 − x1 ) − k1e1 ] + e2 [35α x1 − 29α y1 − k2 e2 ] + cz + by + 13 α z1 ] + e3 [ax + a (− xe3 ) + b (− ye3 ) + c (− ze3 ) + α [25( y1 − x1 )e1 − (35 x1 − 29 y1 )e2 − 13 z1e3 ] = −k1e12 − k2 e2 2 − k3 e32 = −eT Pe ≤ 0
where P = diag{k1 , k2 , k3 } Since V is positive definite and dV is negative semi-definite in the neighborhood of dt zero solution of system (12). It follows that e, a , b, c, α ∈ L∞ , from the fact that t
∫0 λ min ( P )
e
2
dt ≤
t
∫0 e
T
Pedt =
= V (0) − V (t ) ≤ V (0) ∫0 −Vdt t
where λmin ( P) is the minimal eigenvalue of the positive definite P . Thus, e ∈ L2 , from Eq. (12) we have e ∈ L∞ , by Barbalat’s lemma, we have lim e = 0, Thus response t →∞
system (8) can globally synchronize drive system (2) asymptotically. This completes the proof. In simulation, fourth order Runge-Kutta integration method is used to solve two systems of differential equations (2) and (8). We select the parameters of unified chaotic system as α = 0.95, and the parameters of Genesio system as a = −6, b = −2.92,
Fig. 2. Synchronization errors e1 , e2 , e3 between unified chaotic system and Genesio system via adaptive control
14
X. Wu , Z.-H. Guan, and T. Li
Fig. 3. Adaptive parameters aˆ , bˆ, cˆ of Genesio system
Fig. 4. Adaptive parameters αˆ of unified chaotic system
c = −1.2, ki (i = 1, 2,3) = 2, the initial values of drive and response systems are
( x(0), y (0), z (0)) = (1, 2, 3) and ( x1 (0), y1 (0), z1 (0)) = (−1, − 2, 5), respectively, while the initial errors of system (9) are (e1 (0), e2 (0), e3 (0)) = (−2, − 4, 2) , the initial values of the estimate parameters are aˆ (0) = bˆ(0) = cˆ(0) = 1 , α (0) = 2 .The synchronization errors between unified chaotic system and Genesio system are shown in Fig.2. The estimate parameters of a, b, c and αˆ are shown in Figs.3 and 4, respectively. Obviously, the synchronization errors converge asymptotically to zero and two different systems are indeed achieved chaos synchronization. Furthermore, the estimates of parameters converge to their real values.
5 Conclusions This paper presents two chaos synchronization schemes between unified chaotic system and Genesio system with different structures and parameters. Active control is used when system parameters are known and adaptive control is used when system parameters are unknown. Computer simulations show the effectiveness of the proposed schemes.
Chaos Synchronization Between Unified Chaotic System and Genesio System
15
Acknowledgement This work was supported by the National Natural Science Foundation of China under Grants 60573005 and 60603006.
References 1. Pecora, L.M., Carroll, T.L.: Synchronization in chaotic systems. Phys Rev Lett. 64 (1990) 821-824 2. Yang, X.S., Duan, C.K., Liao, X.X.: A note on mathematical aspects of drive-response type synchronization. Chaos, Solitons & Fractals 10 (1999) 1457-1462 3. Lu, J., Wu, X., Han, X., Lü, J.H.: Adaptive feedback synchronization of a unified chaotic system. Phys Lett A. 329 (2004) 327-333 4. Femat, R., et al.: Adaptive synchronization of high-order chaotic systems: a feedback with low-order parameterization. Physica D. 139 (2000) 231-246 5. Yassen, M.T.: Adaptive synchronization of two different uncertain chaotic systems. Phys Lett A. 337 (2005) 335-341 6. Feki, M.: An adaptive chaos synchronization scheme applied to secure communication. Chaos,Solitons & Fractals. 18 (2003) 141-148 7. Lü, J.H., Zhou, T.S., Zhang, S.C.: Chaos synchronization between linearly coupled chaotic systems. Chaos, Solitons & Fractals. 14 (2002) 529-541 8. Alexeyev, A.A., Shalfeev, V.D.: Chaotic synchronization of mutually coupled generators with frequency-controlled feedback loop. Int J Bifurcat Chaos. 5 (1995) 551-557 9. Ho, M.C., Hung, Y.C.: Synchronization of two different systems by using generalized active control. Phys Lett A. 301 (2002) 424-8 10. Ucar, A., Lonngren, K.E., Bai, E.W.: Synchronization of the unified chaotic systems via active control. Chaos, Solitons & Fractals. 27 (2006) 1292-1297 11. Chen, S., Yang, Q., Wang, C.: Impulsive control and synchronization of unified chaotic system. Chaos, Solitons & Fractals. 20 (2004) 751-758 12. Yang, T., Chua, L.O.: Impulsive control and synchronization of nonlinear dynamical systems and application to secure communication. Int J Bifurcat Chaos. 7 (1997) 645-664 13. Park, J.H.: Synchronization of Genesio chaotic system via backstepping approach. Chaos, Solitons & Fractals. 27 (2006) 1369-1375 14. Genesio, R., Tesi, A.: A harmonic balance methods for the analysis of chaotic dynamics in nonlinear systems. Automatica. 28 (1992) 531-548 15. Lü, J.H., Chen, G., Cheng, D.Z., Celikovsky, S.: Bridge the gap between the Lorenz system and the Chen system. Int J Bifur Chaos. 12 (2002) 2917-2926
Robust Impulsive Synchronization of Coupled Delayed Neural Networks Lan Xiang1 , Jin Zhou2, , and Zengrong Liu3 1
2
Department of Physics, School of Science, Shanghai University, Shanghai, 200444, P.R. China
[email protected] Shanghai Institute of Applied Mathematics and Mechanics, Shanghai University, Shanghai, 200072, P.R. China
[email protected] 3 Institute of System Biology, Shanghai University, Shanghai, 200444, P.R. China
[email protected] Abstract. The present paper studies robust impulsive synchronization of coupled delayed neural networks. Based on impulsive control theory on dynamical systems, a simple yet less conservative criteria ensuring robust impulsive synchronization of coupled delayed neural networks is established. Furthermore, the theoretical result is applied to a typical scale-free (SF) network composing of the representative chaotic delayed Hopfield neural network nodes, and numerical results are presented to demonstrate the effectiveness of the proposed control techniques.
1
Introduction
Over the last decade, control and synchronization of coupled chaotic dynamical systems has attracted a great deal of attention due to its potential applications in many fields including secure communications, chemical reactions, biological systems and information science, etc [1], [2], [3], [4]. As a typical example, synchronization of coupled neural networks has currently been an active area of research, and a wide variety of strategies have been developed for chaos synchronization, see ([3], [5], [6], [7], [8], [9], [10]) and relevant references therein. In the past several years, impulsive control has been widely used to stabilize and synchronize chaotic dynamical systems due to its potential advantages over general continuous control schemes [12], [13]. There are many important results focusing mainly on some well-known chaotic dynamical systems such as the Lorenz system, R¨ossler system, Chua system, Duffing oscillator, Brusselator oscillator and so on [1], [13]. It has been proved, in the study of chaos synchronization, that impulsive synchronization approach is effective and robust in synchronization of chaotic dynamical systems. Moreover, the controllers used usually have a relatively simple structure. In an impulsive synchronization scheme, only
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 16–23, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
17
the synchronization impulses are sent to the receiving systems at the impulsive instances, which can decrease the information redundancy in the transmitted signal and increase robustness against the disturbances. In this sense, impulsive synchronization schemes are very useful in practical application, such as in digital secure communication systems [4]. Therefore, the investigation of impulsive synchronization for coupled delayed neural networks is an important step for practical design and application of delayed neural networks. This paper is mainly concerned with the issues of robust impulsive synchronization of coupled delayed neural networks. Based on impulsive control theory on delayed dynamical systems, a simple yet less conservative criteria is derived for robust impulsive synchronization of coupled delayed neural networks. It is shown that the approaches developed here further extend the ideas and techniques presented in recent literature, and they are also simple to implement in practice. Finally, a typical scale-free (SF) network composing of the representative chaotic delayed Hopfield neural network nodes is used as an example to illustrate this impulsive control scheme, and the numerical simulations also demonstrate the effectiveness and feasibility of the proposed control techniques.
2
Problem Formulations and Preliminaries
First, we consider a dynamical system consisting of N linearly coupled identical delayed neural networks, which is described by the following set of differential equations with time delays [3]: x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) +
N
bij Γ xj (t),
i = 1, 2 · · · , N.
(1)
j=1
where xi (t) = (xi1 (t), · · · , xin (t)) are the state variables of the ith delayed neural network, C = diag(c1 , . . . , cn ) is a diagonal matrix with positive diagonal entries cr > 0 (i = 1, · · · , n), A = (a0rs )n×n and Aτ = (aτrs )n×n denote the connection weight matrix and the delayed connection weight matrix respectively, I(t) = (I1 (t), · · · , In (t)) is an external input vector, τ is the time delay, and the activation function vectors f (xi (t)) = [f1 (xi1 (t)), · · · , fn (xin (t))] and g(xi (t)) = [g1 (xi1 (t), · · · , gn (xin (t))] , here we assume that the activation functions fr (x) and gr (x) are globally Lipschitz continuous, i. e., (A1 ) There exist constants kr > 0, lr > 0, r = 1, 2, · · · , n, for any two different x1 , x2 ∈ R, such that 0≤
fr (x1 ) − fr (x2 ) ≤ kr , x1 − x2
|gr (x1 ) − gr (x2 )| ≤ lr |x1 − x2 |, r = 1, 2, · · · , n.
For simplicity, we further assume that the inner connecting matrix Γ = diag(γ1 , · · · , γn ), and the coupling matrix B = (bij )N ×N is the Laplacian matrix, i.e., a symmetric irreducible matrix with zero-sum and real spectrum. This
18
L. Xiang, J. Zhou, and Z. Liu
implies that zero is an eigenvalue of B with multiplicity 1 and all the other eigenvalues of B are strictly negative [3]. Next, we consider the issues of impulsive control for robust synchronization of the coupled delayed neural network (1). By adding an impulsive controller {tk , Iik (t, xi (t))} to the ith-dynamical node in the coupled delayed neural network (1), we have the following impulsively controlled coupled delayed neural network: ⎧ x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) ⎪ ⎪ ⎪ N ⎨ + bij Γ xj (t), t = tk , t ≥ t0 , (2) ⎪ ⎪ j=1 ⎪ ⎩ xi = Iik (t, xi (t)), t = tk , k = 1, 2, · · · , where i = 1, 2 · · · , N , the time sequence {tk }+∞ k=1 satisfy tk−1 < tk and limk→∞ tk = − +∞, xi = xi (t+ ) − x (t ) is the control law in which xi (t+ i k k k ) = limt→t+ xi (t) and k
xi (t− xi (t). Without loss of generality, we assume that limt→t+ xi (t) = k ) = limt→t− k k xi (tk ), which means the solution x(t) is continuous from the right. The initial conditions of Eq. (2) are given by xi (t) = φi (t) ∈ P C([t0 − τ, t0 ], Rn ), where P C([t0 − τ, t0 ], Rn ) denotes the set of all functions of bounded variation and right-continuous on any compact subinterval of [t0 − τ, t0 ]. We always assume that Eq. (2) has a unique solution with respect to initial conditions. Clearly, if Iik (t, xi (t)) = 0, then the controlled model (2) becomes the well-known continuous coupled delayed neural network (1) [3]. The main objective of this paper is to design and implement an appropriate impulsive controller {tk , Iik (t, xi (t))} such that the states of the controlled coupled delayed neural network (2) will achieve synchronization, i. e., lim xi (t) − s(t) = 0,
t→+∞
i = 1, 2 · · · , N,
(3)
where s(t) is called as the synchronization state of the controlled coupled delayed neural network (2). It may be an equilibrium point, a periodic orbit, or a chaotic attractor. Throughout this paper, we define the synchronization state of the N 1 controlled coupled delayed neural network (2) as s(t) = xi (t), where N i=1 xi (t) (i = 1, 2 · · · , N ) are the solutions of the continuous coupled delayed neural network (1) [11]. For the later use, the definition with respect to robust impulsive synchronization of the controlled coupled delayed neural network (2) and the famous Halanay differential inequality on impulsive delay differential inequality are introduced as follows: Definition 1. The controlled coupled delayed neural network (2) is robustly exponentially synchronized, if there exist constants ε > 0 and M > 0, for all φi (t) ∈ P C([t0 − τ, t0 ], Rn ), such that xi (t) − s(t) ≤ M e−ε(t−t0 ) ,
t ≥ t0 ,
i = 1, 2 · · · , N.
(4)
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
19
Lemma 1. [12] Suppose p > q ≥ 0 and u(t) satisfies the scalar impulsive differential inequality + D u(t) ≤ −pu(t) + q( sup u(s)), t = tk , t ≥ t 0 , t−τ ≤s≤t (5) − u(tk ) ≤ αk u(tk ), u(t) = φ(t), t ∈ [t0 − τ, t0 ]. where u(t) is continuous at t = tk , t ≥ t0 , u(tk ) = u(t+ k ) = lims→0+ u(tk + s) − and u(tk ) = lims→0− u(tk + s) exists, φ ∈ P C([t0 − τ, t0 ], R). Then u(t) ≤ ( θk )e−μ(t−t0 ) ( sup φ(s)), (6) t0 −τ ≤s≤t0
t0 0 is a solution of the inequality μ−p+qeμτ ≤ 0.
3
Robust Impulsive Synchronization
Base on impulsive control theory on delayed dynamical systems, the following sufficient condition for robust impulsive synchronization of the controlled coupled delayed neural network (2) is established. Theorem 1. Consider the controlled coupled delayed neural network (2). Let the impulsive controller as ui (t, xi ) =
+∞
Iik (t, xi (t))δ(t − tk ) =
k=1
+∞
dk xi (t− k ) − s(t) δ(t − tk ),
(7)
k=1
where dk is a constant called as the control gain, δ(t) is the Dirac function, and the eigenvalues of its coupling matrix B be ordered as 0 = λ1 > λ2 ≥ λ3 ≥ · · · , λN .
(8)
Assume that, in addition to (A1 ), the following conditions are satisfied for all i = 1, 2, · · · , n and k ∈ Z + = {1, 2, · · · , ∞} (A2 ) There exist n positive numbers δ1 , · · · , δn , and two numbers pi = δi + ci −
(a0ii )+ ki
n n 1 1 τ 0 0 − |aij |kj + |aji |ki − |a |lj , 2 j=1 2 j=1 ij j=i
qi =
n 1
2
|aτji |li ,
j=1
such that p = min1≤i≤n {2pi } > q = ⎧max1≤i≤n {2qi } and γi λ(γi ) + δi ≤ 0, where ⎨ λ2 , if γi > 0, (a0ii )+ = max{a0ii , 0} with λ(γi ) = 0, if γi = 0, ⎩ λN , if γi < 0.
20
L. Xiang, J. Zhou, and Z. Liu
(A3 ) Let μ > 0 satisfy μ − p + qeμτ ≤ 0, and
θk = max 1, (1 + dk )2 , θ = sup k∈Z +
ln θk tk − tk−1
(9)
such that θ < μ. Then the controlled coupled delayed neural network (2) is robustly exponentially synchronized. Brief Proof. Let vi (t) = xi (t) − s(t) (i = 1, 2, · · · , N ), then the error dynamical system can be rewritten as ⎧ τ ˜ ˜(vi (t − τ )) ⎪ i (t) + Af (vi (t)) + A g ⎪ v˙ i (t) = −Cv ⎪ N ⎨ + bij Γ vj (t) + J, t = tk , t ≥ t0 , (10) ⎪ ⎪ j=1 ⎪ ⎩ vi (tk ) = (1 + dk )vi (t− t = tk , k = 1, 2, · · · , k ), ˜ where f (vi (t)) = f (vi (t) + s(t)) − f (s(t)), g˜(vi (t − τ )) = g(vi (t − τ ) + s(t − τ )) − N 1 g(s(t−τ )) and J = Af (s(t))+Aτ g(s(t−τ ))+ [Af (xk (t))+Aτ g(xk (t−τ ))]. N k=1 Let us construct a Lyapunov function 1 v (t)vi (t). 2 i=1 i N
V (t) =
(11)
Calculating the upper Dini derivative of V (t) with respect to time along the solution N of Eq. (10), from Condition (A1 ), and note that vi (t) = 0, we can get for t = tk , i=1
D+ V (t) ≤
N n
1 0 (|a |ks + |a0sr |kr ) 2 s=1 rs n
− δr − cr + (a0rr )+ kr +
i=1 r=1
s=r
n N 1 1 τ 2 2 + |aτrs |ls vir (t) + |asr |lr vir (t − τ ) + vi (t) 2 s=1 2 s=1 i=1 n
×
N
bij Γ vj (t) + diag(δ1 , . . . , δn )vi (t)
j=1
≤ −pV (t) + qV (t − τ ) +
n
v¯j (t)(γj B + δj IN )¯ vj (t),
(12)
j=1
N where v¯j (t) = (¯ v1j (t), · · · , v¯N j (t)) ∈ L def = z = (z1 , · · · , zN ) ∈ RN | i=1 zi = n 0 , from which it can be concluded that if γj λ(γj )+δj ≤ 0, then v¯j (t)(γj B+ j=1
δj IN )¯ vj (t) ≤ 0. This leads to D+ V (t) ≤ −pV (t) + q( sup
t−τ ≤s≤t
V (s)).
(13)
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
21
On the other hand, from the construction of V (t), we have V (tk ) = (1 + dk )2
N
− − 2 vj (t− k )vj (tk ) ≤ (1 + dk ) V (tk ).
(14)
j=1
It follows from Lemma 1 that if θ < μ for all t > t0 , V (t) ≤ e−(μ−θ)(t−t0 ) (
sup
t0 −τ ≤s≤t0
V (s)).
(15)
This completes the proof of Theorem 1. Remark 1. From the proof of Theorem 1, it should be noted that, different from previous investigations in [4], [5], [6], here our main strategy is to control all the states of dynamical networks to its synchronization state s(t), but where s(t) may be not a solution of an isolated dynamical node. Moreover, it can be seen from (A2 ) and (A3 ) that robust impulsive synchronization of the controlled coupled delayed neural network (2) not only depends on the coupling matrix B, the inner connecting matrix Γ, and the time delay τ, but also is heavily determined by the impulsive control gain dk and the impulsive control interval tk − tk−1 . Therefore, the approaches developed here further extend the ideas and techniques presented in recent literature, and they are also simple to implement in practice. Example 1. Consider a model of the controlled coupled delayed neural network: ⎧ x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) ⎪ ⎪ ⎪ N ⎨ + bij Γ xj (t), t = tk , t ≥ t0 , ⎪ ⎪ j=1 ⎪
⎩ xi (t) = (1 + dk ) (xi (t− t = tk , k = 1, 2, · · · , k ) − s(t) , in which xi (t) = (xi1 (t), xi2 (t)) , f (xi (t)) = tanh(xi2 (t))) (i = 1, · · · , 100), I(t) = (0, 0) and C=
10 , 01
A=
2.0 −0.1 −5.0 3.0
g(xi (t))
with
Aτ =
=
(16)
(tanh(xi1 (t)),
−1.5 −0.1 . −0.2 −2.5
where the synchronization state of the coupled delayed neural network (16) is 100 1 defined as s(t) = xk (t). 100 k=1 It should be noted that the isolate neural network x(t) ˙ = −Cx(t) + Af (x(t)) + Aτ g(x(t − 1)),
(17)
is actually a chaotic delayed Hopfield neural network [8], [9] (see Fig. 1 (a)).
22
L. Xiang, J. Zhou, and Z. Liu
Now we consider an scare-free network with 100 dynamical nodes. We here take the parameters N = 100, m = m0 = 5 and κ = 3, then the coupling matrix B = Bsf of the SF network can be randomly generated by the B-A scale-free model [11]. In this simulation, the second-largest eigenvalue and the smallest eigenvalue of the coupling matrix Bsf are λ2 = −1.2412 and λ100 = −34.1491 respectively. For simplicity, we consider the equidistant impulsive interval τk − τk−1 = 0.1 and dk = −0.5000 (k ∈ Z + ). By taking kr = lr = 1 and δr = 12 (r = 1, 2), it is easy to verify that if γ1 = γ2 = 6, then all the conditions of Theorem 1 are satisfied. Hence, the the controlled coupled delayed neural network (16) will achieve robust impulsive synchronization. The simulation results corresponding to this situation are shown in Fig. 1 (b). 2
3
1.5
2
1
x (t), x (t) (i=1,2,...,100)
4
0
xi2(t)
0.5 0
i2
y
1
−0.5
i1
−1 −2 −3 −4 −1
x (t)
−1
i1
−1.5
−0.5
0
(a) x
0.5
1
−2 0
0.2
0.4
0.6
0.8
1
(b) t
Fig. 1. (a) A fully developed double-scroll-like chaotic attractors of the isolate delayed Hopfield neural network (17). (b) Impulsive synchronization process of the state variables in the controlled coupled delayed neural network (16).
4
Conclusions
In this paper, we have investigated the issues of robust impulsive synchronization of coupled delayed neural networks. A simple criterion for robust impulsive synchronization of such dynamical networks has been derived analytically. It is shown that the theoretical results can be applied to some typical chaotic neural networks such as delayed Hopfield neural networks and delayed cellular neural networks (CNN). The numerical results are given to verify and also visualize the theoretical results. Acknowledgments. This work was supported by the National Science Foundation of China (Grant nos. 60474071 and 10672094), the Science Foundation of Shanghai Education Commission (Grant no. 06AZ101), the Shanghai Leading Academic Discipline Project (Project nos. Y0103 and T0103) and the Shanghai Key Laboratory of Power Station Automation Technology.
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
23
References 1. Chen, G., Dong, X.: From Chaos to Order: Methodologies, Perspectives, and Applications, World Scientific Pub. Co, Singapore (1998) 2. Wu, C. W., Chua, L. O.: Synchronization in an Array Linearly Coupled Dynamical System. IEEE Trans. CAS-I 42 (1995) 430-447 3. Chen, G., Zhou, J., Liu, Z.: Global Synchronization of Coupled Delayed Neural Networks and Applications to Chaotic CNN Model. Int. J. Bifur. Chaos 14 (2004) 2229-2240 4. Liu, B., Liu, X., Chen, G.: Robust Impulsive Synchronization of Uncertain Dynamical Networks, IEEE Trans. CAS-I. IEEE Trans 52 (2005) 1431-1441 5. Wang, W., Cao, J.: Synchronization in an Array of Linearly Coupled Networks with Time-varying Delay. Physica A 366 (2006) 197-211 6. Li, P., Cao, J., Wang, Z.: Robust Impulsive Synchronization of Coupled Delayed Neural Networks with Uncertainties. Physica A 373 (2006) 261-272 7. Zhou, J., Chen, T., Xiang, L.: Adaptive Synchronization of Coupled Chaotic Systems Based on Parameters Identification and Its Applications. Int. J. Bifur. Chaos 16 (2004) 2923-2933 8. Zhou, J., Chen, T., Xiang, L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos, Solitons, Fractals 27 (2006) 905-913 9. Zhou, J., Chen, T., Xiang, L.: Chaotic Lag Synchronization of Coupled Delayed Neural Networks and Its Applications in Secure Communication. Circuits, Systems and Signal Processing 24 (2005) 599-613 10. Zhou, J., Chen, T., Xiang, L.: Global Synchronization of Impulsive Coupled Delayed Neural Networks. Wang, J and Yi, Z. (eds.): Advances in Neural Networks ISNN 2006. Lecture Notes in Computer Science, Vol. 3971. Springer-Verlag, Berlin Heidelberg New York (2006) 303-308 11. Zhou, J., Chen, T.: Synchronization in General Complex Delayed Dynamical Networks. IEEE Trans. CAS-I 53 (2006) 733-744 12. Yang, Z., Xu, D.: Stability Analysis of Delay Neural Networks with Impulsive Effects. IEEE Trans. CAS-II 52 (2005) 517-521. 13. Yang, T.: Impulsive Control Theory. Springer-Verlag, Berlin Heidelberg New York (2001)
Synchronization of Impulsive Fuzzy Cellular Neural Networks with Parameter Mismatches Tingwen Huang1 and Chuandong Li2 1
2
Texas A&M University at Qatar, Doha, P.O. Box 5825, Qatar
[email protected] College of Computer Science, Chongqing University, Chongqing, 400030, China
[email protected] Abstract. In this paper, we study the effect of parameter mismatches on the fuzzy neural networks with impulses. Since it is impossible to make two non-identical neural networks complete synchronized, we study the synchronization of two neural networks in terms of quasi-synchronization. Using Lyapunov method and linear matrix inequality method, we obtain a sufficient condition for a global synchronization error bound of the two neural networks.
1
Introduction
Since L. Pecora and T. Carroll [16] published their pioneering work on synchronization of chaos, synchronization of chaotic systems has been investigated intensively by many researchers [2,3,4,6,7,9-12,14-19,24,25] in various fields such as applied mathematics, physics and engineering due to its practical applications such as security communication. The most common regime of synchronization been investigated is complete synchronization, which implies the coincidence of states of interacting (master and response) systems. However, due to the parameter mismatch [2,6,7,9,18] which is unavoidable in real implementation, the master system and response system are not identical and the resulting synchronization is not exact. It is impossible to achieve complete synchronization. However, it is possible to make the synchronization error bounded by a small positive constant ε which is depended on the differences between parameters of two fuzzy neural networks. To the best of our knowledge, no report has been reported on quasisynchronization of two non-identical fuzzy neural networks. In this paper, we will investigate the effect of parameter mismatches on chaos synchronization of fuzzy neural networks by impulsive controls. It is known that the main obstacle for the impulsive synchronization in the presence of parameter mismatches is to get a good estimate of the synchronization error bound. To overcome this problem, we will obtain a numerically tractable, though suboptimal, sufficient condition using linear decomposition and comparison-system method. This paper is organized as follows. In Section 2, the problem is formulated and some preliminaries are given. In Section 3, the main results are presented. In Section 4, conclusions are drawn. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 24–32, 2007. c Springer-Verlag Berlin Heidelberg 2007
Synchronization of Impulsive Fuzzy Cellular Neural Networks
2
25
Problem Formulation and Preliminaries
Consider the driving system described by the following fuzzy neural networks: n n dxi = −di xi (t) + αij fj (xj (t)) + βij fj (xj (t)) dt j=1 j=1
+
n
Hij μj +
j=1
n
Tij μj + Ii ,
(1)
j=1
where i = 1, · · · , n; αij , βij , Tij and Hij are elements of fuzzy feedback MIN template, fuzzy feedback MAX template, fuzzy feed forward MIN template and fuzzy feed forward MAX template respectively; and denote the fuzzy AND and fuzzy OR operation respectively; xi , μi and Ii denote state, input and bias of the ith neurons respectively; fi is the activation function. At discrete instants, τk , k = 1, 2, 3, · · · , state variables of the driving system are transmitted to the driven system and then the state variables y = (y1 , · · · , yn ) of the driven system are subjected to sudden changes at these instants. In this sense, the driven system is described by an impulsive fuzzy neural networks as n n dyi = −di yi (t) + αij fj (yj (t)) + β ij fj (yj (t)) dt j=1 j=1
+
n
Hij μj +
j=1
n
Tij μj + Ii ,
t = τk ,
j=1
Δy|t=τk = y(τk+ ) − y(τk− ) = −Ce,
t = τk , (2)
where i = 1, · · · , n, k = 1, 2, 3, · · · , C ∈ Rn×n is the control gain and e = x − y is the synchronization error, In general di , αij , β ij are different from di , αij , βij , in other words, there exist parameter mismatches between the driving and driven systems. From systems (1) and (2), the error system of the impulsive synchronization is given by n n dei (t) = −di ei (t) + (di − di )yi (t) + αij fj (xj (t)) − αij fj (yj (t)) dt j=1 j=1
+
n j=1
βij fj (xj (t)) −
n
β ij fj (yj (t)),
t = τk ,
j=1
Δy|t=τk = y(τk+ ) − y(τk− ) = −Be,
t = τk , (3)
Remark 1: It is clear that the origin e = 0 is not equilibrium point of equation (3) when parameter mismatches exist, so complete synchronization between systems (1) and (2) is impossible.
26
T. Huang and C. Li
In this paper, we assume that H: fi is a bounded function defined on R and satisfies |fi (x) − fi (y)| ≤ li |x − y|,
i = 1, · · · , n.
(4)
for any x, y ∈ R. In the following, we cite several concepts on quasi-synchronization and impulsive differential equation. Definition 1. ([9]). Let χ denote a region of interest in the phase space that contains the chaotic attractor of system (1). The synchronization schemes (1) and (2) are said to be uniformly quasi-synchronized with error bound ε > 0 if there exists a T ≥ t0 such that ||x(t) − y(t)|| ≤ ε for all t ≥ T starting from any initial values x(t0 ) ∈ χ and y(t0 ) ∈ χ. Definition 2. ([23]) A function V : R+ × Rn → R+ is said to belong to class Σ if 1) V is continuous in (τk−1 , τk ) × Rn and, for each x ∈ Rn , k = 1, 2, · · · , lim(t,y)→(τ +,x) V (t, y) = V (τk+ , x) exists; k 2) V is locally Lipschitzian in x For the following general impulsive differential equation x˙ = g(t, x),
t = τk ,
x(τk+ ) = ψk (x(τk )), t = τk , x(t0 ) = x0 , t0 ≥ 0,
(5)
the right-upper Dini’s derivative of V ∈ Σ is defined as the following: Definition 3. ([23]) For (t, x) ∈ (τk−1 , τk ] × Rn , the right-upper Dini’s derivative of V ∈ Σ with respect to time is defined as D+ V (t, x) = lim {V [t + h, x + hg(t, x)] − V (t, x)} h→0+
(6)
Definition 4. ([23]) For impulsive system (5), let V ∈ Σ and assume that D+ V (t, x) ≤ g[t, V (t, x)], t = τk , − V [t, y(τk ) − Be] ≤ ψk [V (t, x)], t = τk , (7) where g : R+ × R+ → R is continuous and g(t, 0) = 0, ψk : R+ → R+ is non-decreasing. Then, the system ω˙ = g(t, ω), ω(τk+ )
t = τk ,
= ψk (ω(τk )),
ω(t0 ) = ω0 ,
t0 ≥ 0.
t = τk , (8)
is called the comparison system for (5). For the convenience, we give the matrix notations here. For A, B ∈ Rn×n , A ≤ B(A > B) means that each pair of the corresponding elements of A and B satisfy the inequality ≤ ( > ). Also, if A = (aij ), then |A| = (|aij )|.
Synchronization of Impulsive Fuzzy Cellular Neural Networks
3
27
Main Results
In this Section, we will obtain a sufficient condition for quasi-synchronization and estimate the synchronization bound at the same time using Lyapunov-like function. Before we state the main results, we state the following theorem first. Lemma 1. ([22]). For any aij ∈ R, xj , yj ∈ R, i, j = 1, · · · , n, we have the following estimations, |
n
aij xj −
j=1
and |
n j=1
n
aij yj | ≤
j=1
aij xj −
n
(|aij | · |xj − yj |)
(9)
(|aij | · |xj − yj |)
(10)
1≤j≤n
aij yj | ≤
j=1
1≤j≤n
Let D = diag(d1 , · · · , dn ), L = diag(l1 , · · · , ln ), D = diag(d1 , · · · , dn ), A = (αij )n×n , A = (αij )n×n , B = (βij )n×n , B = (β ij )n×n , ΔD = D − D, ΔA = A − A, ΔB = B − B. Now, we are ready to state and prove the main result on the synchronization of driving system (1) and driven system (2). Theorem 1. Let χ = {x ∈ Rn |||x|| ≤ δ1 }, and parameter mismatches satisfy ΔDT ΔD + ΔAT ΔA + ΔB T ΔB ≤ δ2 , δ = δ12 δ22 . and let the sequence of impulses be equidistant and separated by an interval τ . If there exists a symmetric and positive definite matrix P > 0 such that the following conditions hold: (i)−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 − λP ≤ 0, (ii) (I + C)T P (I + C) − ρP ≤ 0, (iii) lnρ + λτ < 0. then the synchronization error system (3) converges exponentially to a small region containing the origin which is τδ n {e ∈ R |||e|| ≤ } (11) λm (P )ρ(lnρ + τ λ) Thus, the quasi-synchronization with error bound ε =
τδ λm (P )ρ(lnρ+τ λ)
between
the systems (1) and (2) achieved. Proof. Consider the following Lyapunov-like function: V (e(t)) = e(t)T P e(t) From (3) and Lemma 1, for t ∈ (τk+ , τk+1 ], we have n n dei (t) = −di ei (t) + (di − di )yi (t) + αij fj (xj (t)) − αij fj (yj (t)) dt j=1 j=1
(12)
28
T. Huang and C. Li
+
+
n
αij fj (yj (t))−
j=1 n
n
αij fj (yj (t))+
j=1 n
βij fj (yj (t)) −
j=1
n
βij fj (xj (t)) −
j=1
n
βij fj (yj (t))
j=1
β ij fj (yj (t)),
j=1
≤ −di ei (t) + (di − di )yi (t) +
n
|αij fj (xj (t)) − αij fj (yj (t))|
j=1
+
+
n j=1 n
|αij fj (yj (t)) − αij fj (yj (t))| +
n
|βij fj (xj (t)) − βij fj (yj (t))|
j=1
|βij fj (yj (t)) − β ij fj (yj (t))|,
j=1
≤ −di ei (t) + (di − di )yi (t) + +
+
n j=1 n
lj |αij − αij ||yj (t)| +
n j=1 n
lj |αij ||xj (t) − yj (t)| lj |βij ||(xj (t) − (yj (t)|
j=1
lj |βij − β ij ||yj (t)|,
j=1
≤ −di ei (t) + (di − di )yi (t) + +
+
n j=1 n
lj |αij − αij ||yj (t)| +
n j=1 n
lj |αij |e(t) lj |βij |e(t)
j=1
lj |βij − β ij ||yj (t)|,
(13)
j=1
We write the above inequality as matrix form: de(t) ≤ −De(t) + (D − D)y(t) + L|A|e(t) dt +L|A − A||y(t)| + L|B|e(t) + L|B − B||y(t)|
(14)
Now calculate the derivative of V with respect to time t ∈ (τk+ , τk+1 ] along the solution to (3). V + (e(t)) = 2e(t)T P e(t) ≤ 2e(t)T P {−De(t) + (D − D)y(t) + L|A|e(t) +L|A − A||y(t)| + L|B|e(t) + L|B − B||y(t)|} ≤ −2e(t)T P De(t) + e(t)T P 2 e(t) + y(t)T (D − D)T (D − D)y(t) +2e(t)T L|A|e(t) + e(t)T P 2 L2 e(t) + y(t)T (A − A)T (A − A)y(t)
Synchronization of Impulsive Fuzzy Cellular Neural Networks
29
+2e(t)T L|B|e(t) + e(t)T P 2 L2 e(t) + y(t)(B − B)T (B − B)y(t) = e(t)T (−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 )e(t) +y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) = e(t)T (−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 − λP )e(t) +e(t)T λP e(t) + y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) ≤ λe(t)T P e(t) + y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) ≤ λV (e(t)) + δ
(15)
At the impulsive points, we get V ((I + C)e(τk+ )) = e(τk− )T (I + C)T P (I + C)e(τk− ) = e(τk− )T [(I + C)T P (I + C) − ρP ]e(τk− ) + ρe(τk− )T P e(τk− ) ≤ ρe(τk− )T P e(τk− ) = ρV (τk− )
(16) Thus, the error system has the following comparison system: t = τk , t = τk ,
z(t) ˙ = λz(t) + δ, z(τk+ ) = ρz(τk− ),
z(t0 ) = z0 = V (e(t+ 0 )).
(17)
To obtain the solution to (16) explicitly, consider the linear reference system for (16): z(t) ˙ = λz(t), t = τk , + − z(τk ) = ρz(τk ), t = τk , z(t0 ) = z0 = V (e(t+ 0 )).
(18)
The unique solution to the above equation is z(t, t0 , z0 ) = ρn(t,t0 ) eλ(t−t0 ) z0 ,
t > t0 ,
(19)
τ1 (t
where n(t, t0 ) = − t0 ). Here . is the floor operation. Since ρ < 1 and 1 1 n(t,s) (t − s) − 1 < n(t, s) ≤ < ρ−1 ρ(t−s)/τ , thus, τ τ (t − s) for t > s, we have ρ z(t, s, z(s)) ≤ ρ−1 (ρ τ eλ )t−s) z(s), 1
t > s ≥ t0 ,
(20)
The solution of Equation (16) with initial value z0 is t z(t, t0 , z0 ) = ρn(t,t0 ) eλ(t−t0 ) z0 + ρn(t,s) eλ(t−s) δds t0 −1
≤ρ
1 τ
λ (t−t0 )
(ρ e )
z0 + δρ
−1
t
1
(ρ τ eλ )(t−s) ds t0
τδ 1 = ρ−1 (ρ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ] ρ(lnρ + τ λ) 1 τ
(21)
30
T. Huang and C. Li
By the Theorem 3.1.1 in [23], we have V (e(t)) = e(t)T P e(t) ≤ z(t, t0 , z0 ),
t > t0 ,
(22)
where z0 = V (e(t0 )). Let λm (P ) is the minimal eigenvalue of the square matrix P . From equations (21) and (22), we have λm (P )||e(t)||2 ≤ e(t)T P e(t) = V (e(t)) ≤ z(t, t0 , z0 ) 1 τδ 1 ≤ ρ−1 (ρ τ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ] ρ(lnρ + τ λ) (23) so, ||e(t)||2 ≤
1 1 τδ 1 {ρ−1 (ρ τ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ]} λm (P ) ρ(lnρ + τ λ) (24)
1
Since ρ τ eλ < 1, the first and third term on the right side of equation (23) will go to 0 exponentially as t approaches to ∞. Thus, there exists a large T > 0 such that ||e(t)||2 ≤
τδ λm (P )ρ(lnρ + τ λ) (25)
namely, ||e(t)|| ≤
τδ λm (P )ρ(lnρ + τ λ) (26)
The proof of the theorem is completed.
4
Conclusion
Since parameter mismatches are inevitable, and have detrimental effect on the synchronization quality between driving system and driven system, it is important to find out what the effect of parameter mismatch on the synchronization. In this paper, we have investigated synchronization of two systems with parameter mismatches using Lyapunov method and comparison theorem. We obtained a sufficient condition for quasi-synchronization with error bound ε of two nonidentical fuzzy neural networks by impulsive control.
Synchronization of Impulsive Fuzzy Cellular Neural Networks
31
Acknowledgments The first author is grateful for the support of Texas A&M University at Qatar. Also, this work was partially supported by the National Science Foundation of China (Grant No. 60574024).
References 1. Arik S.: Global Robust Stability of Delayed Neural Networks. IEEE Trans. Circ. Syst. I, 50(2003)156-160 2. Astakhov,V., Hasler,M., Kapitaniak,T., Shabunin,A., Anishchenko,V.: Effect of Parameter Mismatch on The Mechanism of Chaos Synchronization Loss in Coupled Systems. Physical Review E 58 (1998) 5620-5628 3. Cao J., Li P., Wang W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. PHysics Letters A, 353(2006) 318-325 4. Cao J., Li H., Ho D.: Synchronization Criteria of Lur’e Systems with Time-delay Feedback Control. Chaos Solitons & Fractals 23 (2005) 1285-1298 5. Huang T.: Exponential Stability of Fuzzy Cellular Neural Networks with Distributed Delay. Physics Letters A 351 (2006)48-52 6. Jalnine,A., Kim,S.-Y.: Characterization of The Parameter-mismatching Effect on The Loss of Chaos Synchronization. Physical Review E 65 (2002), 026210 7. Leung,H., Zhu,Z.: Time-varying Synchronization of Chaotic Systems in The Presence of System Mismatch. Physical Review E 69 (2004) 026201 8. Liu,Y., Tang,W.: Exponential Stability of Fuzzy Cellular Neural Networks with Constant and Time-varying Delays. Physics Letters A, 323 (2004) 224-233 9. Li C., Chen G., Liao X., Fan Z.: Chaos Quasisynchronization Induced by Impulses with Parameter Mismatches. Chaos 16 (2006), No.02102 10. Li C., Chen G., Liao X., Zhang X.: Impulsive Synchronization of Chaotic Systems. Chaos 15 (2005), No. 023104 11. Li C., Liao X., Yang X. and Huang T.: Impulsive Stabilization and Synchronization of A Class of Chaotic Delay Systems. Chaos 15 (2005), 043103 12. Li C., Liao X.: Wong KW Chaotic Lag Synchronization of Coupled Time-delayed Systems and Its Applications in Secure Communication. Physica D 194 (2004) 187-202 13. Liao X., Wu Z., Yu J.: Stability Analyses for Cellular Neural Networks with Continuous Delay. Journal of Computational and Applied Mathematics, 143(2002)29-47 14. Lu J., Cao J.: Adaptive Complete Synchronization of Two Identical or Different Chaotic (Hyperchaotic) Systems with Fully Unknown Parameters. Chaos 15 (2005), No. 043901. 15. Lu W., Chen T.: New Approach to Synchronization Analysis of Linearly Coupled Ordinary Differential Systems. Physica D 213 (2006) 214-230 16. Pecora L., Carroll, T.: Synchronization in Chaotic systems. Physical Review Letters 64 (1990) 821-824. 17. Wang W., Cao J.: Synchronization in An Array of Linearly Coupled Networks with Time-varying Delay. Physica A, 366(2006) 197-211 18. Wu,C. W., Chua,L.O.: A Unified Framework for Synchronization and Control of Dynamical Systems. Int. J. Bifurcation Chaos, 4 (1994) 979-989
32
T. Huang and C. Li
19. Xiong W., Xie W., Cao J.: Adaptive Exponential Synchronization of Delayed Chaotic Networks. Physica A 370 (2006) 832-842 20. Yang T., Yang L.B., Wu C.W., Chua L.O.: Fuzzy Cellular Neural Networks: Theory. In Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, (1996)181-186 21. Yang T. , Yang L.B., Wu C.W. and Chua L.O.: Fuzzy Cellular Neural Networks: Applications. In Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, (1996)225-230. 22. Yang T., Yang L.B.: The Global Stability of Fuzzy Cellular Neural Network. Circuits and Systems I: Fundamental Theory and Applications, 43(1996)880-883 23. Yang Y.: Impulsive Control Theory. Springer, Berlin, 2001. 24. Zhang X., Liao X., Li C.: Impulsive Control, Complete and Lag Synchronization of Unified Chaotic System with Continuous Periodic Switch. Chaos Solitons & Fractals 26 (2005) 845-854 25. Zhou J., Chen T., Xiang L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos Solitons & Fractals 27 (2006) 905-913
Global Synchronization in an Array of Delayed Neural Networks with Nonlinear Coupling Jinling Liang1 , Ping Li1 , and Yongqing Yang2 1
Department of Mathematics, Southeast University, Nanjing, 210096, China 2 School of Science, Southern Yangtze University, Wuxi, 214122, China
[email protected] Abstract. In this paper, synchronization is investigated for an array of nonlinearly coupled identical connected neural networks with delay. By employing the Lyapunov functional method and the Kronecker product technique, several sufficient conditions are derived. It is shown that global exponential synchronization of the coupled neural networks is guaranteed by a suitable design of the coupling matrix, the inner linking matrix and some free matrices representing the relationships between the system matrices. The conditions obtained in this paper are in the form of linear matrix inequalities, which can be easily computed and checked in practice. A typical example with chaotic nodes is finally given to illustrate the effectiveness of the proposed synchronization scheme.
1
Introduction
Dynamical behaviors of recurrent neural networks have been deeply investigated in the past decades due to their successful application in optimization, signal processing, pattern recognition and associative memories, especially in processing static images [1]. Most of the previous studies predominantly concentrated on the stability analysis, periodic oscillations and dissipativity of such kind of neural networks [2]. However, complex dynamics such as bifurcation and chaotic phenomena have also been shown to exist in these networks [3]. On the other hand, both theoretical studies and practical experiments have been reported that synchronization phenomena occur generically in many cases, such as in a mammalian brain, in language emergence and in an array of coupled identical neural networks. Arrays of coupled systems have received much attention recently for they can exhibit many interesting phenomena such as spatio-temporal chaos, autowaves and they can be utilized in engineering fields such as secure communication, chaos generators design and harmonic oscillation generation [4,5]. Synchronization of coupled chaotic systems has been extensively investigated, for more information one may refer to [6-10, 12-13] and the references cited therein. However, in these papers, the coupling terms of the models been studied are always linear, to the best of our knowledge, up till now, there are very few results on an array of nonlinearly coupled neural networks. Based on the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 33–39, 2007. c Springer-Verlag Berlin Heidelberg 2007
34
J. Liang, P. Li, and Y. Yang
above discussions, in this paper, the following nonlinearly coupled neural network model will be studied: dxi (t) dt
= −Cxi (t) + Af (xi (t)) + Bf (xi (t − τ )) + I(t) +
N
Gij Df (xj (t))
(1)
j=1
where i = 1, 2, . . . , N and xi (t) = (xi1 (t), . . . , xin (t))T is the state vector of the ith network at time t; C = diag(c1 , . . . , cn ) > 0 denotes the rate with which the cell i resets its potential to the resting state when isolated from other cells and inputs; A and B are the weight matrix and the delayed weight matrix, respectively; activation function f (xi (t)) = (f1 (xi1 (t)), . . . , fn (xin (t)))T ; I(t) = (I1 (t), . . . , In (t))T is the external input and τ > 0 represents the transmission delay; D is an n×n matrix and G = (Gij )N ×N denotes the coupling configuration of the array and satisfying the diffusive coupling connections (i = j),
Gij = Gji
N
Gii = −
Gij
for
i, j = 1, 2, . . . , N.
(2)
j=1,j =i
For simplicity, let x(t) = (xT1 (t), xT2 (t), . . . , xTN (t))T , F (x(t)) = (f T (x1 (t)), f T (x2 (t)), . . . , f T (xN (t)))T , I(t) = (I T (t), . . . , I T (t))T , combining with the signal ⊗ of Kronecker product, model (1) can be rewritten as dx(t) dt
= −(IN ⊗ C)x(t) + (IN ⊗ A)F (x(t)) +(IN ⊗ B)F (x(t − τ )) + I(t) + (G ⊗ D)F (x(t))
(3)
The initial conditions with (3) are given by xi (s) = φi (s) ∈ C([−τ, 0], Rn ),
i = 1, 2, . . . , N.
(4)
Throughout of this paper, the following assumptions are made: (H) There exist constants lr > 0, r = 1, 2, . . . , n, such that 0≤
fr (x1 ) − fr (x2 ) ≤ lr x1 − x2
for any different x1 , x2 ∈ R. Definition 1. Model (3) is said to be globally exponentially synchronized, if there exist two constants > 0 and M > 0, such that for all φi (s) (i = 1, 2, . . . , N ) and for sufficiently large T > 0, xi (t) − xj (t) ≤ M e−t for all t > T , i, j = 1, 2, . . . , N . Lemma 1 [11]. Let ⊗ denotes the notation of Kronecker product, α ∈ R, A, B, C and D are matrices with appropriate dimensions, then (1) (2) (3)
(αA) ⊗ B = A ⊗ (αB); (A + B) ⊗ C = A ⊗ C + B ⊗ C; (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD).
Global Synchronization in an Array of Delayed Neural Networks
2
35
Main Results
In this section, Lyapunov functional method will be employed to investigate the global exponential synchronization of system (3). Theorem 1. Under the assumption (H), system (3) with initial condition (4) is globally exponentially synchronized, if there exist three positive definite matrices Pi > 0 (i = 1, 2, 3) and two positive diagonal matrices S, W such that the following LMIs are satisfied for all 1 ≤ i < j ≤ N : ⎡
−P1 C − CP1 + P3 LS + P1 A − N Gij P1 D ⎢ SL + AT P1 − N Gij DT P1 P2 − 2S Ωij =⎢ ⎣ 0 0 B T P1 0
⎤ 0 P1 B ⎥ 0 0 ⎥ < 0, ⎦ −P3 LW W L −P2 − 2W (5)
where L = diag(l1 , l2 , . . . , ln ). Proof. Condition (5) ensures that there exists a scaler > 0 such that ⎡
⎤ 0 P1 B ⎥ 0 0 ⎥ < 0. ⎦ −P3 LW W L −P2 − 2W (6) Let e = (1, 1, . . . , 1)T , EN = eeT be the N × N matrix of all 1’s, and U = N IN −EN , in which IN denotes the N ×N unitary matrix. Consider the following Lyapunov functional candidate for system (3): P1 − P1 C − CP1 + eτ P3 LS + P1 A − N Gij P1 D ⎢ SL + AT P1 − N Gij DT P1 eτ P2 − 2S ij =⎢ Ω ⎣ 0 0 B T P1 0
V (t, xt ) = V1 (t, xt ) + V2 (t, xt ) + V3 (t, xt ),
(7)
where V1 (t, xt ) = et xT (t)(U ⊗ P1 )x(t),
t V2 (t, xt ) = t−τ e(s+τ ) F T (x(s))(U ⊗ P2 )F (x(s))ds,
t V3 (t, xt ) = t−τ e(s+τ ) xT (s)(U ⊗ P3 )x(s)ds. Calculating the derivative of V (t) along the solutions of (3), and notifying that (U ⊗ P1 )I(t) ≡ 0, U G = N G; by Lemma 1, we have dV (t,xt ) dt t T
= e x (t)(U ⊗ P1 )x(t) + 2et xT (t)(U ⊗ P1 )[−(IN ⊗ C)x(t) +(IN ⊗ A)F (x(t)) + (IN ⊗ B)F (x(t − τ )) + I(t) + (G ⊗ D)F (x(t))] +e(t+τ )F T (x(t))(U ⊗ P2 )F (x(t)) − et F T (x(t − τ ))(U ⊗ P2 )F (x(t − τ )) +e(t+τ )xT (t)(U ⊗ P3 )x(t) − et xT (t − τ )(U ⊗ P3 )x(t − τ ) = et {xT (t)[(U ⊗ P1 ) − 2U ⊗ (P1 C)]x(t) + 2xT (t)(U ⊗ (P1 A) +(N G) ⊗ (P1 D))F (x(t)) + 2xT (t)(U ⊗ (P1 B))F (x(t − τ )) +eτ F T (x(t))(U ⊗ P2 )F (x(t)) − F T (x(t − τ ))(U ⊗ P2 )F (x(t − τ )) +eτ xT (t)(U ⊗ P3 )x(t) − xT (t − τ )(U ⊗ P3 )x(t − τ )}
36
J. Liang, P. Li, and Y. Yang
= et
N −1
N
{(xi (t) − xj (t))T [(P1 − 2P1 C)(xi (t) − xj (t))
i=1 j=i+1
+2(P1 A − N Gij P1 D)(f (xi (t)) − f (xj (t))) +2P1 B(f (xi (t − τ )) − f (xj (t − τ )))] +eτ (f (xi (t)) − f (xj (t)))T P2 (f (xi (t)) − f (xj (t))) −(f (xi (t − τ )) − f (xj (t − τ )))T P2 (f (xi (t − τ )) − f (xj (t − τ ))) +eτ (xi (t) − xj (t))T P3 (xi (t) − xj (t)) −(xi (t − τ ) − xj (t − τ ))T P3 (xi (t − τ ) − xj (t − τ ))};
(8)
Under the assumption (H), one can easily get the following inequalities (f (xi (t)) − f (xj (t)))T S(f (xi (t)) − f (xj (t))) ≤ (xi (t) − xj (t))T LS(f (xi (t)) − f (xj (t))), (f (xi (t − τ )) − f (xj (t − τ )))T W (f (xi (t − τ )) − f (xj (t − τ ))) ≤ (xi (t − τ ) − xj (t − τ ))T LW (f (xi (t − τ )) − f (xj (t − τ ))),
(9) (10)
where 1 ≤ i < j ≤ N . Substituting (9) and (10) into (8), we obtain dV (t, xt ) dt N −1 N ≤ et {(xi (t) − xj (t))T [P1 − 2P1 C + eτ P3 ](xi (t) − xj (t)) i=1 j=i+1
+(f (xi (t)) − f (xj (t)))T [eτ P2 − 2S](f (xi (t)) − f (xj (t))) −(xi (t − τ ) − xj (t − τ ))T P3 (xi (t − τ ) − xj (t − τ )) −(f (xi (t − τ )) − f (xj (t − τ )))T (P2 + 2W )(f (xi (t − τ )) − f (xj (t − τ ))) +2(xi (t) − xj (t))T [LS + P1 A − N Gij P1 D](f (xi (t)) − f (xj (t))) +2(xi (t) − xj (t))T P1 B(f (xi (t − τ )) − f (xj (t − τ ))) +2(xi (t − τ ) − xj (t − τ ))T LW (f (xi (t − τ )) − f (xj (t − τ )))} = et
N −1
N
T ξij Ωij ξij ,
(11)
i=1 j=i+1
in which ξ = [(xi (t) − xj (t))T , (f (xi (t)) − f (xj (t)))T , (xi (t − τ ) − xj (t − τ ))T , (f (xi (t−τ ))−f (xj (t−τ )))T ]T . From condition (6), the above inequality (11) implies that V (t) ≤ V (0), hence et xT (t)(U ⊗P1 )x(t) is bounded and this yields that λmin (P1 )xi (t) − xj (t)2 ≤
N −1
N
(xi (t) − xj (t))T P1 (xi (t) − xj (t)) = O(e−t ),
i=1 j=i+1
∀1 ≤ i < j ≤ N . According to Definition 1, we can conclude that the dynamical system (3) is globally exponentially synchronized. Based on Theorem 1, one can easily get the following corollary: Corollary 1. Under the assumption (H), system (3) with initial condition (4) is globally exponentially synchronized, if there exist three positive definite matrices
Global Synchronization in an Array of Delayed Neural Networks
37
Pi > 0 (i = 1, 2, 3) and one positive diagonal matrix S such that the following LMIs are satisfied for all 1 ≤ i < j ≤ N : ⎡ ⎤ −P1 C − CP1 + P3 LS + P1 A − N Gij P1 D 0 P1 B ⎢ SL + AT P1 − N Gij DT P1 P2 − 2S 0 0 ⎥ ⎢ ⎥ < 0. (12) ⎣ 0 0 −P3 0 ⎦ T B P1 0 0 −P2
3
Numerical Example
Consider a 2-dimensional neural network with delay presented in [3]: dy(t) dt
= −Cy(t) + Af (y(t)) + Bf (y(t − 0.93)) + I(t),
(13)
where y(t) = (y1 (t), y2 (t))T ∈ R2 is the state vector of the network, the activation function f (y(t)) = (f1 (y1 (t)), f2 (y2 (t)))T with fi (yi ) = 0.5(|yi +1|−|yi −1|) (i = 1, 2), obviously, assumption (H) is satisfied with L = diag(1, 1); the external input vector I(t) = (0, 0)T ; and the other matrices are as follows: √
10 1 + π4 20 − 1.3π4 2 0.1√ C= , A= , B= 01 0.1 1 + π4 0.1 − 1.3π4 2 The dynamical chaotic behavior with initial conditions y1 (s) = 0.2,
∀s ∈ [−0.93, 0]
y2 (s) = 0.3,
(14)
is shown in Fig.1. 0.8
1
0.6
0.8
0.4 0.6
0.2
e(t)
0.4
0
0.2
−0.2 0
−0.4 −0.2
−0.6 −0.4
−0.8 −15
−10
−5
0
5
10
0
5
Fig. 1. Chaotic trajectory of (13)
10
15
time t
15
Fig. 2. Synchronization error e(t)
Now consider a complex system consisting of three nonlinearly coupled identical models (13). The state equations of the entire array are dxi (t) dt
= −Cxi (t) + Af (xi (t)) + Bf (xi (t − 0.93)) + I(t) +
3
Gij Df (xj (t)),
j=1
(15)
38
J. Liang, P. Li, and Y. Yang
where xi (t) = (xi1 (t), xi2 (t))T (i = 1, 2, 3) is the state vector of the ith neural network. Choose the coupling matrix G and the linking matrix D as ⎡ ⎤
−3 1 2 40 ⎣ ⎦ G = 1 −2 1 , D= . 04 2 1 −3 By applying the MATLAB LMI Control Toolbox, (12) can be solved to yield the following feasible solutions:
0.0632 0.0467 0.4114 −0.5336 0.0084 −0.0119 P1 = , P2 = , P3 = , 0.0467 2.2843 −0.5336 14.6908 −0.0119 0.2428 and S = diag(1.0214, 36.3493). According to Corollary 1, network (15) can achieve global exponential synchronization, and the synchronization performance 3 is illustrated in Fig.2, where e(t) = (e1 (t), e2 (t))T and ej (t) = (xij (t)−x1j (t))2 i=2
and the initial stats for (15) are taken randomly constants in [0, 1] × [0, 1]. Fig.2 confirm that the dynamical system (15) is globally exponentially synchronized.
References 1. Hopfield, J.J.: Neurons with Graded Response Have Collective Computational Properties Like Those of Two-Stage Neurons. Proc. Natl. Acad. Sci. USA 81 (1984) 3088-3092 2. Zhang, J., Suda, Y. and Iwasa, T.: Absolutely Exponential Stability of a Class of Neural Networks with Unbounded Delay. Neural Networks 17 (2004) 391-397 3. Gilli, M.: Strange Attractors in Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst.-I 40(11) (1993) 849-853 4. Hoppensteadt, F.C. and Izhikevich, E.M.: Pattern Recognition Via Synchronization in Phase Locked Loop Neural Networks. IEEE Trans. Neural Networks 11(3) (2000) 734-738 5. Zheleznyak, A. and Chua, L.O.: Coexistence of Low- and High-Dimensional SpatioTemporal Chaos in a Chain of Dissipatively Coupled Chua’s Circuits. Int. J. Bifur. Chaos 4(3) (1994) 639-674 6. Wu, C.W. and Chua, L.O.: Synchronization in an Array of Linearly Coupled Dynamical Systems. IEEE Trans. Circuits Syst.-I 42(8) (1995) 430-447 7. Chen, G.R., Zhou, J. and Liu, Z.R.: Global Synchronization of Coupled Delayed Neural Networks and Applications to Chaotic Models. Int. J. Bifur. Chaos 14(7) (2004) 2229-2240 8. Lu, W.L. and Chen, T.P.: Synchronization of Coupled Connected Neural Networks with Delays. IEEE Trans. Circuits Syst.-I 51(12) (2004) 2491-2503 9. Cao, J., Li, P. and Wang, W.W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. Phys. Lett. A 353 (2006) 318-325 10. Li, Z. and Chen, G.R.: Global Synchronization and Asymptotic Stability of Complex Dynamical Networks. IEEE Trans. Circuits Syst.-II 53(1) (2006) 28-33
Global Synchronization in an Array of Delayed Neural Networks
39
11. Chen, J.L. and Chen, X.H.: Special Matrices, Tsinghua University press, China, 2001 12. Cao, J. and Lu, J.: Adaptive Synchronization of Neural Networks with or without Time-Varying Delays. Chaos 16 (2006) art. no. 013133 13. Huang, X. and Cao, J.: Generalized Synchronization for Delayed Chaotic Neural Networks: a Novel Coupling Scheme. Nonlinearity 19(12) (2006) 2797-2811
Self-synchronization Blind Audio Watermarking Based on Feature Extraction and Subsampling Xiaohong Ma, Bo Zhang, and Xiaoyan Ding School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China
[email protected] Abstract. A novel embedding watermark signal generation scheme based on feature extraction is proposed in this paper. The original binary watermark image is divided into two blocks with the same size and each block is changed into one dimension sequences. After that, Independent Component Analysis (ICA) is used to extract the independent features of them, which are regarded as two embedding watermark signals. In the embedding procedure, the embedding watermark signals are embedded in some selected wavelet coefficients of the subaudios obtained by subsampling. And Self-synchronization is implemented by applying special peak point extraction scheme. The blind extraction procedure is basically the converse procedure of the embedding one. And the original watermark image can be recovered with the help of the mixing matrix of the ICA. Experimental results show the validity of this scheme.
1
Introduction
Recent growth in the distribution of digital multimedia data over networks and internet has caused authentication and copyright problems. Digital watermarking is proposed as an effective solution to these problems. The most important properties of digital watermarking are robustness and imperceptibility [1]. To achieve them, the watermark is usually embedded in the transformed domain. As Discrete Wavelet Transform (DWT) can reflect both time and frequency properties, lots of watermarking algorithms are based on DWT [2], [3]. Synchronization attack is a serious problem to any audio watermarking scheme. Audio processing such as random cropping causes displacement between embedding and detected signals in the time domain, and therefore it is difficult for the watermark to survive [4]. In [5], the authors proposed a synchronization scheme based on peak point extraction. The scheme proposed in this paper has made some improvements on it. It can make the search of synchronization points more accurate without adding extra information to the original audio signal. As a kind of blind source separation (BSS) algorithm, ICA has received much attention because of its potential applications in signal processing. In many audio watermark embedding schemes, it is used to separate watermark and audio signals [1], [6], [7]. In digital image watermark schemes, the usage of ICA can obtain independent feature components of an image for watermark embedding to improve robustness [8]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 40–46, 2007. c Springer-Verlag Berlin Heidelberg 2007
Self-synchronization Blind Audio Watermarking
41
A novel embedding watermark generation method based on ICA is proposed in this paper. It is employed to extract the independent feature components corresponding to the watermark image to generate the embedding watermark signals. The mixing matrix obtained during the ICA is kept as secret key. In the embedding procedure, a new method called subsampling [9] is utilized and a synchronization scheme called peak point extraction is utilized to resist cropping attack. The original audio signal is not required during watermark extraction.
2
Watermark Embedding
The block diagram of watermark embedding is shown in Fig. 1. There are three main steps, the embedding watermark signals generation, which is enclosed with dashed line, the synchronization point extraction and the watermark embedding.
Audio for embedding
Original audio
Special shaping
Subaudio
A1
Subaudio
A2
Subaudio
A3
Subaudio
A4
Subsampling
Synchronization points
DWT
Coefficients selection
key3
V1i V2i V3i V4i
Watermark image
Subblocks
Embedding watermark signals generation
ICA features extraction
w1
w2
Embedding
IDWT
Watermarked audio
key1 key2
Fig. 1. The block diagram of watermark embedding
2.1
Embedding Watermark Signal Generation
In this paper, to ensure the security of the scheme, FastICA method [10] is applied to extract the two independent features corresponding to the original watermark image as two embedding watermark signals. An image can be considered as a mixture of several independent features. In this paper, the watermark image is taken as a two-feature-image combination, and FastICA is employed to extract these two independent feature components to generate the embedding watermark signals. The watermark image is divided into two subblocks and each subblock is changed to a vector of one dimension as an observation signal of FastICA. After this process, two feature components and two matrices can be obtained. The generation process can make the watermark scheme be much securer. The original watermark W is a binary image with the size of 32 × 32. It is divided into 2 subblocks of 16 × 32 and resized into two vectors d1 and d2 . And then, FastICA method is applied to them to obtain two feature components v1 and v2 , v1 = {v1 (i), i = 1, 2, · · · , 512}, v2 = {v2 (i), i = 1, 2, · · · , 512}. The
42
X. Ma, B. Zhang, and X. Ding
mixing matrix is kept as secret key key1 which can recover the watermark image through multiplying it by extracted feature signals in watermark extraction scheme. There are altogether two possible element values in v1 and four possible element values in v2 . The elements in v2 are selected to form two groups t1 and t2 , denoted as t1 = {t1 (i), i = 1, 2, · · · , S}, t2 = {t2 (i), i = 1, 2, · · · , 512 − S}. The elements of t1 have the same absolute value, so do the elements of t2 . The positions of t1 in v2 , and the absolute values of v1 , t1 and t2 are kept as secret key key2 for the extraction procedure. v1 , t1 and t2 can be quantized as follows: 1, if v1 (i) > 0 w1 (i) = (1) −1, if v1 (i) < 0 1, if tk (i) > 0 tk (i) = k = 1, 2 (2) −1, if tk (i) < 0 The combination of t1 and t2 , which can be described as w2 = [t1 , t2 ] , and w1 denoted as w1 = {w1 (i), i = 1, 2, · · · , 512} are two embedding watermark signals. 2.2
Synchronization Point Extraction
Synchronization is a significant scheme in digital audio watermarking because the attack such as cropping is very destructive. Therefore, lots of synchronization schemes have been proposed to resist various attacks of time axis. In [11], bark code is embedded into the original audio signal to indicate the location of watermark. But extra information embedded in the audio signal may distort the original audio signal and draw the attention of attackers. What’s more, the search of the synchronic code is always a time consuming work. Another solution for synchronization is called self-synchronization. In this kind of scheme, the feature points or areas are fully employed. In [6], the feature of the audio signal is utilized to implement synchronization. But the synchronization points are not outstanding and difficult to search. In this scheme, the power of the original signal is specially shaped by raising the sample value to a high power as exemplified by the following equation: x (n) = x4 (n)
(3)
where x(n) is the original audio signal, and x (n) is the signal after the special shaping. Power of 4 is chosen for the convenience of identifying the outstanding peaks. This process could amplify the energy differences between the peak regions and low-energy regions. The special regions are identified by comparing with a threshold th. th is set to be 20% of the sample value of the highest peak after special shaping. Samples which have values higher than the threshold are extracted as the peak points. The peak points usually appear in group consisting of many samples. If the number of consecutive peak points in a group is equal to or greater than N , this group is chosen for embedding. In [6], the last point of the group is taken
Self-synchronization Blind Audio Watermarking
43
as a synchronization point. In this scheme, the largest point in this group is taken as the synchronization point because it is more outstanding in a group. The selection of N is according to practical experiments and varies among different audio signals. In this scheme, to improve the security and robustness, two synchronization points are selected and the watermark signals are embedded twice. 2.3
Embedding
L points of original audio signal after the synchronization point are selected as the watermark embedding segment Audio, L = 4k, k = 1, · · · , M . It can be subsampled as follows: Ai (k) = Audio(4k − 4 + i) , k = 1, 2, · · · L/4 , i = 1, 2, 3, 4
(4)
where A1 , A2 , A3 , A4 are four similar subaudios. To ensure the robustness, 3-level DWT is implemented on these signals. The approximate components of them are rearranged according to descending sequence and then checked according to Eq.(5) and Eq.(6) to see if they can satisfy the embedding condition. V1j + V2j 2 V1j − V2j < 2a Vj Vj =
(5) (6)
where V1j and V2j are the rearranged approximate components of A1 and A2 , and a is a positive constant. 512 coefficients which can satisfy Eq.(6) are picked out for embedding and denoted as V1i and V2i . At the same time, the positions for embedding are kept as secret key key3 . The watermark signal w1 is embedded according to Eq.(7): V1i = Vi (1 + aw1 (i)) , V2i = Vi (1 − aw1 (i)) , i = 1, 2, · · · , 512
(7)
The selection of a is a tradeoff between audio distortion and detection accuracy. As the similarity of four subaudios, the approximate components of them are similar too. So the embedding positions of w2 are the same as those of w1 . The embedding procedure of w2 is totally the same as that of w1 . To resist cropping attack, w1 and w2 are twice embedded each. At last, IDWT is implemented on the modified coefficients together with the other ones to get the watermarked audio signal.
3
Watermark Extraction
The block diagram of watermark extraction is shown in Fig. 2. Just like the watermark embedding procedure, synchronization points are searched and the following L points are selected and subsampled to get four
44
X. Ma, B. Zhang, and X. Ding
Audio for extraction
Watermarked audio
Special shaping
Subaudio
A1c
Subaudio
A2c
Subaudio
A3c
Subaudio
A4c
Subsampling
Synchronization points
DWT
w1c Coefficients w 2c selection
key3
Postprocessing
Extracted watermark
key2 key1
Fig. 2. The block diagram of watermark extraction
subaudios as described in Fig. 2. 3-level DWT is applied to each subaudio. According to the secret key key3 , the embedding positions in approximate components are obtained. The extraction of w1 is according to Eq.(8). Considering the watermarked audio signal may have undergone some attacks or processing, the selected pairs of approximate components are denoted as U1i and U2i , the watermark signal w 1 , w 1 = {w1 (i), i = 1, 2, · · · , 512} can be recovered as follows: w1 (i) =
1 U1i − U2i · a U1i + U2i
(8)
The extraction of w 2 is absolutely the same as the process of w 1 . As discussed in watermark signals generation, a reverted process is necessary for w 1 and w 2 . The positive elements in w 1 are replaced by the absolute value of v1 , which is kept in key2 ; the rest ones are replaced by negative values of them. The elements in the kept positions of w 2 are replaced by the positive or negative absolute value of t1 depending on the signs of themselves. The rest ones in w 2 are replaced according to the same rule. After that, the watermark can be recovered as follows: w1 ww = A · (9) w 2 where ww is a 2 × 512 matrix. Taken 0 as the threshold, the elements in ww are mapped into {0, 255}. Each vector is changed to a 16 × 32 matrix and then combined to an integrated watermark image.
4
Experiment Results
The parameters in our experiment are given as follows: N = 10; a = 0.1; L = 70856. The sampling rate of original audio signal is 44.1 KHz and length is 112080. The original audio signal and the watermarked audio signal are shown in Fig. 3(a) and Fig. 3(b) respectively. There is no visible distortion between them, and it is also true in listening test.
Self-synchronization Blind Audio Watermarking 0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
-0.2
-0.2
-0.4
-0.4
-0.6
45
-0.6 0
20000
40000
60000
(a)
80000
100000
120000
0
20000
40000
60000
80000
100000
120000
(b)
Fig. 3. Original audio signal and watermarked audio signal. (a) original audio signal. (b) watermarked audio signal.
Fig. 4. The original watermark image and extracted watermark images under various conditions. (a) original watermark image. (b) extracted watermark image without any attack. (c) mp3 compression. (d) cropping. (e) adding white Gaussian noise(SNR is 25dB). (f) requalifing. (g) resampling (from 44.1KHz to 88.2 KHz, then to 44.1 KHz). (h) lowpass filtering.
The original watermark image is shown in Fig. 4(a), and the extracted watermark without any attack is shown in Fig. 4(b). Fig. 4(c)-Fig. 4(h) show the extracted watermarks under various attacks. All the extracted watermarks except Fig. 4(g) and Fig. 4(h) are all very clear. Though under the attack of resampling and filtering, the embedding watermark signal has been degraded, the extracted watermark can still be recognized clearly.
5
Conclusion
A novel watermark signal generation scheme based on feature extraction is proposed in this paper. It makes use of ICA for feature extraction to generate the embedding watermark signals which makes the audio watermark scheme much securer. Watermark signals are embedded in the DWT domain of four subaudios obtained by subsampling. The synchronization scheme can improve the robustness
46
X. Ma, B. Zhang, and X. Ding
against cropping attack without introducing additional information and the extraction procedure is completely blind. Experimental results show the excellent imperceptibility and good robustness against various attacks. Acknowledgments. This work was supported by the National Natural Science Foundation of China under Grant No. 60575011 and the Liaoning Province Natural Science Foundation of China under Grant No. 20052181.
References 1. Liu, J., Zhang, X. G., Najar, M., Lagunas, M. A.: A Robust Digital Watermarking Scheme Based on ICA. International Conference on Neural Networks and Signal, Oregon, USA 2 (2003) 1481-1484 2. Vieru, R., Tahboub, R., Constantinescu, C., Lazarescu, V.: New Results Using the Audio Watermarking Based on Wavelet Transform. International Symposium on Signals, Circuits, and Systems, Kobe, Japan 2 (2005) 441-444 3. Cvejic, N., Seppanen, T.: Robust Audio Watermarking in Wavelet Domain Using Frequency Hopping and Patchwork Method. The 3rd International Symposium on Image and Signal Processing and Analysis, Rome, Italy 1 (2003) 251-255 4. Wei Li, Xiangyang Xue, Peizhong Lu.: Localized Audio Watermarking Technique Robust Against Time-Scale Modification. IEEE Transactions on Multimedia 8 (2006) 60-69 5. Foo Say Wei, Xue Feng, Li Mengyuan.: A Blind Audio Watermarking Scheme Using Peak Point Extraction. IEEE International Symposium on Circuits and Systems, Kobe, Japan 5 (2005) 4409-4412 6. Toch, B., Lowe, D., Saad, D.: Watermarking of Audio Signals Using Independent Component Analysis. The Third International Conference WEB Delivering of Music, Leeds, United Kingdom (2003) 71-74 7. Sener, S., Gunsel, B.: Blind Audio Watermark Decoding Using Independent Component Analysis. The 17th International Conference on Pattern Recognition, Cambridge, United Kingdom 2 (2004) 875-878 8. Sun, J., Liu, J.: A Novel Digital Watermark Scheme Based on Image Independent Feature. The 2003 IEEE International Conference on Robotics, Intelligent Systems and Signal Processing, Changsha, China 2 (2003) 1333-1338 9. Chu., Wai C.: DCT-Based Image Watermarking Using Subsampling. IEEE Transactions on Multimedia 5 (1) (2003) 34-38 10. Hyvarinen, A., Oja, E.: A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 9 (7) (1997) 1483-1492 11. Huang, J., Wang, Y., Shi, Y.: A Blind Audio Watermarking Algorithm with Selfsynchronization. IEEE International Symposium on Circuits and Systems, Arizona, USA 3 (2002) 627-630
An Improved Extremum Seeking Algorithm Based on the Chaotic Annealing Recurrent Neural Network and Its Application* Yun-an Hu, Bin Zuo, and Jing Li Department of Control Engineering, Naval Aeronautical Engineering Academy Yantai 264001, China
[email protected],
[email protected] Abstract. The application of sinusoidal periodic search signals into the general extremum seeking algorithm(ESA) results in the “chatter” problem of the output and the switching of the control law and incapability of escaping from the local minima. An improved chaotic annealing recurrent neural network (CARNN) is proposed for ESA to solve those problems in the general ESA and improve the global searching capability. The paper converts ESA into seeking the global extreme point where the slope of Cost Function is zero, and applies a CARNN to finding the global point and stabilizing the plant at that point. ESA combined with CARNN doesn’t make use of search signals such as sinusoidal periodic signals, which solves those problems in previous ESA and improves the dynamic performance of the controlled system greatly. During the process of optimization, chaotic annealing is realized by decaying the amplitude of the chaos noise and the probability of accepting continuously. The process of optimization was divided into two phases: the coarse search based on chaos and the elaborate search based on ARNN. At last, CARNN will stabilize the system to the global extreme point. At the same time, it can be simplified by the proposed method to analyze the stability of ESA. The simulation results of a simplified UAV tight formation flight model and a typical Schaffer function validate the advantages mentioned above.
1 Introduction Extremum seeking problem deals with the problem of minimizing or maximizing a plant over a set of decision variables[1]. Extremum seeking problems represent a class of widespread optimization problems arising in diverse design and planning contexts. Many large-scale and real-time applications, such as traffic routing and bioreactor systems, require solving large-scale extremum seeking problem in real time. In order to solve this class of extremum seeking problems, a novel extremum seeking algorithm was proposed in the 1950’s. Early work on performance improvement by extremum seeking can be found in Tsien. In the 1950s and 1960s, Extremum seeking algorithm was considered as an adaptive control method[2]. Until 1990s sliding mode control for extremum seeking has not been utilized successfully[3]. Subsequently, a method of adding compensator dynamics in ESA was proposed by Krstic, which *
This research was supported by the Natural Science Foundation of P.R.China (No. 60674090).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 47–56, 2007. © Springer-Verlag Berlin Heidelberg 2007
48
Y.-a. Hu, B. Zuo, and J. Li
improved the stability of the system[4]. Although those methods improved tremendously the performance of ESA, the “chatter” problem of the output and the switching of the control law and incapability of escaping from the local minima limit the application of ESA. The method of introducing a chaotic annealing recurrent neural network into ESA is proposed in the paper. First, an extremum seeking problem is converted into the process of seeking the global extreme point of the plant where the slope of cost function is zero. Second, an improved CARNN is constructed; then, we can apply the CARNN to finding the global extreme point and stabilizing the plant at that point. The CARNN proposed in the paper doesn’t make use of search signals such as sinusoidal periodic signals, so the method can solve the “chatter” problem of the output and the switching of the control law in the general ESA and improve the dynamic performance of the ESA system. At the same time, CARNN utilizes the randomicity and the property of global searching of chaos system to improve the capability of global searching of the system[5-6], During the process of optimization, chaotic annealing is realized by decaying the amplitude of the chaos noise and the accepting probability continuously. Adjusting the probability of acceptance could influence the rate of convergence. The process of optimization was divided into two phases: the coarse search based on chaos and the elaborate search based on RNN. At last, CARNN will stabilize the system to the global extreme point, which is validated by simulating a simplified UAV tight formation flight model and a typical Schaffer Function. At the same time, it can be simplified by the proposed method to analyze the stability of ESA.
2 Annealing Recurrent Neural Network Descriptions 2.1 Problem Formulation Consider a general nonlinear system:
x = f ( x ( t ) ,u ( t ) )
(1)
y = F ( x (t ))
Where
x ∈ R n ,u ∈ R m and y ∈ R are the states, the system inputs and the system
F ( x ) is also defined as the cost function of the
output, respectively. system.
f ( x,u ) and F ( x ) are smooth functions.
If the nonlinear system(1) is an extremum seeking system, then it must satisfy three assumptions described in [7]. We know that there must be a smooth control law
u ( t ) = α ( x ( t ) ,θ )
to
stabilize
the
nonlinear
system(1),
θ = ⎡⎣θ1,θ2 ,",θi ,",θp ⎤⎦ ( i ∈[12 , ,",p]) is a parameter vector of p
where
T
dimension which
determines a unique equilibrium vector. Then there must also be a smooth function
An Improved Extremum Seeking Algorithm
49
xe : R p → R n such that: f ( x,α ( x,θ ) ) = 0 ↔ x = xe (θ ) . Therefore, the static performance map at the equilibrium point
xe (θ ) from θ to y represented by:
y = F ( xe (θ ) ) = F (θ ) .
(2)
Differentiating (2) with respect to time yields the relation between
∂ (θ ( t ) ) θ ( t ) = y ( t )
θ
and y ( t ) .
,
(3)
⎡ ∂F (θ ) ∂F (θ ) ∂F (θ ) ⎤ T where ∂ (θ ( t ) ) = ⎢ , ," , ⎥ and θ ( t ) = ⎡⎣θ1 ,θ 2 ," ,θ p ⎤⎦ . ∂θ2 ∂θ p ⎦⎥ ⎣⎢ ∂θ1 T
Once the seeking vector
θ
of the extremum seeking system (1) converges to the T
⎡ ∂F (θ ) ∂F (θ ) ∂F (θ ) ⎤ global extreme vector θ , then ∂ (θ ) = ⎢ , ," , ⎥ must also ∂θ2 ∂θ p ⎦⎥ ⎣⎢ ∂θ1 ∗
converge to zero. A CARNN is introduced into ESA in order to minimize ∂ (θ ) in finite time. Certainly the system (1) is subjected to (3). Then, the extremum seeking problem can be written as follows Minimize: Subject to: where
f1 (υ ) = cTυ
p1 (υ ) = Aυ − b = 0 .
(4)
∂T (θ ) denotes the transpose of ∂ (θ ) . υ = ⎡⎣∂ (θ )
⎡ 11× p −sign ( ∂T (θ ) ) 01× p ⎤ ⎢ ⎥ A = ⎢θ T ( t ) 01× p 01× p ⎥ ⎢0 01× p ∂T (θ ) ⎥⎥ ⎢⎣ 1× p ⎦
b = ⎡⎣0 y ( t )
,
∂ (θ )
c = ⎡⎣01× p 11× p
T θ ( t ) ⎤⎦
01× p ⎤⎦
,
T
,
⎧1 x > 0 y ( t ) ⎤⎦ , and sign ( x ) = ⎪⎨0 x = 0 . ⎪ −1 x < 0 ⎩ T
By the dual theory, the dual program corresponding to the program (4) is Maximize: Subject to: where,
ω
f 2 ( ω ) = bT ω
p2 (ω ) = AT ω − c = 0 .
denotes the dual vector of υ ,
(5)
ω T = [ω1 ω2 ω3 ]1×3 .
Therefore, an extremum seeking problem is converted into the programs defined in (4) and (5).
50
Y.-a. Hu, B. Zuo, and J. Li
2.2 Annealing Recurrent Neural Network(ARNN) Design In view of the primal and dual programs (4) and (5), define the following energy function:
E (υ , ω ) = T ( t ) ( f1 (υ ) − f 2 (ω ) ) 2 + p1 (υ ) 2
2 + p2 (ω )
2
2
2.
(6)
Clearly, the energy function (6) is convex and continuously differentiable. The first term in (6) is the squared difference between the objective functions of the programs (4) and (5), respectively. The second and the third terms are for the equality constraints of (4) and (5).
T ( t ) denotes a time-varying annealing parameter.
With the energy function defined in (6), the dynamics for ARNN solving (4) and (5) can be defined by the negative gradient of the energy function as follows:
dσ dt = −μ∇E (σ ) . where, σ
(7)
= (υ ,ω ) , ∇E (σ ) is the gradient of the energy function E (σ ) defined T
in (6), and μ is a positive scalar constant, which is used to scale the convergence rate of annealing recurrent neural network. The dynamical equation (7) of annealing recurrent neural network can be expressed as:
du1 dt = − μ ∂E (υ , ω ) ∂υ = −μ ⎡⎣T ( t ) c ( cTυ − bT ω ) + AT ( Aυ − b ) ⎤⎦ .
(8)
du2 dt = −μ ∂E (υ , ω ) ∂ω = −μ ⎡⎣ −T ( t ) b ( cTυ − bT ω ) + A ( AT ω − c ) ⎤⎦ .
(9)
υ = q ( u1 ) .
(10)
ω = q ( u2 ) .
(11)
where,
q(
) is a sigmoid activation function, υ = q ( u ) = ( b − a ) (1+ e
−u1 ε1
ω = q ( u2 ) = ( b2 − a2 ) (1 + e−u
2
below bound of
υ . a2
ε1 > 0 and ε 2 > 0 .
and
ε2
)+a . 2
1
1
1
)+a
1
and
a1 and b1 denote the upper bound and the
b2 denote the upper bound and the below bound of ω .
~
The annealing recurrent neural network is described as the equations (8) (11), which are determined by the number of decision variables such as (υ ,ω ) , ( u1 ,u2 ) is the column vector of instantaneous net inputs to neurons, (υ ,ω ) is the column output vector of neurons.
An Improved Extremum Seeking Algorithm
51
3 Convergence Analysis In this section, analytical results on the stability of the proposed annealing recurrent neural network and feasibility and optimality of the steady-state solutions to the programs described in (4) and (5) are presented. Theorem 1. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ and J ⎡⎣ q ( u2 ) ⎤⎦ exist and are positive semidefinite. If the temperature parameter T ( t ) is nonnegative, strictly monotone decreasing for t ≥ 0 , and approaches zero as time approaches infinity, then the annealing recurrent neural network (8) (11) is asymptotically stable.
~
Proof: Consider the following Lyapunov function:
L = E (υ , ω ) = T ( t ) ( f1 (υ ) − f 2 (ω ) ) 2 + p1 (υ ) 2
Apparently,
2
2 + p2 (ω )
2
2.
(12)
L ( t ) > 0 . The differentiation of L along time trajectory of (12) is
as follows:
∂f (υ ) ∂p (υ ) ⎤ dυ . dL ⎡ = ⎢T ( t ) ⋅ 1 ⋅ ( f1 (υ ) − f 2 (ω ) ) + 1 ⋅ p1 (υ ) ⎥ ⋅ dt ⎣ ∂υ ∂υ ⎦ dt
⎡ ∂f (ω) ∂p (ω) ⎤ dω 1 dT ( t) 2 +⎢−T ( t ) ⋅ 2 ⋅( f1 (υ) − f2 (ω) ) + 2 ⋅ p2 (ω) ⎥⋅ + f1 (υ) − f2 (ω) ) . ( ∂ω ∂ω ⎣ ⎦ dt 2 dt
(13)
According to the equations (8) and (9), and the following equations dυ dt = J ⎡⎣ q ( u1 ) ⎤⎦ ⋅ du1 dt and d ω dt = J ⎡⎣ q ( u2 ) ⎤⎦ ⋅ du2 dt . We can have: 2 dL 1 du du 1 du du 1 dT ( t ) = − ⋅ 1 ⋅ J ⎡⎣q ( u1 )⎤⎦ ⋅ 1 − ⋅ 2 ⋅ J ⎡⎣q ( u2 )⎤⎦ ⋅ 2 + f1 (υ) − f2 (ω) ) ( dt μ dt dt μ dt dt 2 dt
(14)
We know that the Jacobian matrices of J ⎡⎣ q ( u1 ) ⎦⎤ and J ⎣⎡ q ( u2 ) ⎦⎤ both exist and are positive semidefinite and μ is a positive scalar constant. If the time-varying annealing parameter T ( t ) is nonnegative, strictly monotone decreasing for t ≥ 0 , and
approaches zero as time approaches infinity, then dL dt is negative definite. Because
T ( t ) represents the annealing effect, the simple examples of T ( t ) can described by −η T ( t ) = βα −η t or T ( t ) = β (1 + t ) , where α > 1 , β > 0 and η > 0 are constant
parameters. Parameters
β
and η can be used to scale the annealing parameter.
Because L ( t ) is positive definite and radially unbounded, and dL dt is negative definite. According to the Lyapunov’s theorem, the designed annealing recurrent neural network is asymptotically stable.
52
Y.-a. Hu, B. Zuo, and J. Li
Theorem 2. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ and J ⎡⎣ q ( u2 ) ⎤⎦ exist and are positive semidefinite. If T ( t ) ≥ 0 , dT ( t ) dt < 0 and lim T ( t ) = 0 , then the t →∞
steady state of the annealing neural network represents a feasible solution to the programs described in equations (4) and (5). Proof: The proof of Theorem 1 shows that the energy function E (υ , ω ) is positive definite and strictly monotone decreasing with respect to time lim E (υ , ω , T ( t ) ) = 0 . Because lim T ( t ) = 0 , then we have t →∞
t →∞
(
lim E (υ , ω , T ( t ) ) = lim p1 (υ ( t ) ) t →∞
t →∞
p1 (υ ( t ) )
Because
(
lim p1 (υ ( t ) ) t →∞
= p1 (υ )
υ
t , which implies
and
ω
2
2
2
2
(
) (
p2 ( ω ( t ) )
2 = p1 limυ ( t ) t →∞
)
2
)
2 =0 are
2
(15) continuous,
(
2 + p2 lim ω ( t ) t →∞
)
2
2
2 = 0 , so we have p1 (υ ) = 0 and p2 (ω ) = 0 , where
are the stable solutions of
Now, Let F1(υ) =⎡ f1(υ) ⎣
2 + p2 (ω ( t ) )
and
2 + p2 (ω ( t ) )
2 + p2 (ω )
2
υ
and
ω.
) ( f (υ)) ( f (υ))⎤⎦
T
1
1
(
and F2 (ω) =⎡ f2 (ω) ⎣
) ( f (ω)) ( f (ω))⎤⎦
T
2
2
be the augmented vector. Theorem 3. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ ≠ 0 and J ⎡⎣ q ( u2 ) ⎤⎦ ≠ 0 and are positive semidefinite, ∀t ≥ 0 , and ∇ ( f1 (υ ) ) ≠ 0 and ∇ f 2 (ω ) ≠ 0 . If
dT ( t ) dt < 0 , lim T ( t ) = 0 and
(
)
t →∞
⎧ ⎛ ∂p1(υ) ∂p (υ) ⎞ T T p1(υ) −∇F1 ⎡⎣υ( t)⎤⎦ J⎡⎣q( u1)⎤⎦ 1 p1(υ) ⎟ ⎪ ⎜∇p1 ⎡⎣υ( t)⎤⎦ J⎡⎣q( u1)⎤⎦ ∂υ ∂υ ⎪ ⎝ ⎠ T( t) ≥max⎨0, , ⎛ ∂ f υ ∂ f υ T T ⎪ ∇F⎡υ( t)⎤ J⎡q( u )⎤ 1( ) ( f (υ) − f (ω) ) −∇p ⎡υ( t)⎤ J⎡q( u )⎤ 1( ) ( f (υ) − f (ω) ) ⎞ ⎟ 2 1⎣ 2 ⎦ ⎣ 1 ⎦ ∂υ 1 ⎪ ⎜⎝ 1 ⎣ ⎦ ⎣ 1 ⎦ ∂υ 1 ⎠ ⎩ ⎫ ∂p2 (ω) ∂p (ω) ⎛ ⎞ T T p2 (ω) −∇F2 ⎡⎣ω( t) ⎤⎦ J ⎡⎣q( u2) ⎤⎦ 2 p2 (ω) ⎟ ⎪ ⎜∇p2 ⎡⎣ω( t)⎤⎦ J ⎡⎣q( u2 ) ⎤⎦ ∂ω ∂ω ⎪ (16) ⎝ ⎠ ⎬ ⎛ ∂f2 (ω) ∂f2 (ω) ⎞⎪ T T ∇ F ⎡ ω t ⎤ J ⎡ q u ⎤ f υ − f ω −∇ p ⎡ ω t ⎤ J ⎡ q u ⎤ f υ − f ω ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 1 2 ) 2 ⎣ ⎦ ⎣ 2 ⎦ ∂ω ( 1 2 ) ⎟⎪ ⎜ 2⎣ ⎦ ⎣ 2 ⎦ ∂ω ⎝ ⎠⎭
An Improved Extremum Seeking Algorithm
then the steady states
υ
and
ω
53
of the annealing neural network represents the
optimal solutions υ and ω to the programs described in equations (4) and (5). Because of the length restriction, we omit the proof of Theorem 3. ∗
∗
4 A Chaotic Annealing Recurrent Neural Network Descriptions In order to improve the global searching performance of the designed annealing recurrent neural network, we introduce chaotic factors into the designed neural network. Therefore, the structure of a chaotic annealing recurrent neural network is described as follows. du1 dt
du2 dt
T t
f1
f1 f1
T t
Tt
f2
f1 Tt
p1
f2 f1
f2
f1
f2
1
b1 a1
1
p2
2
p2
b2 a2
2
ω = q ( u2 ) = ( b2 − a2 ) (1 + e −u
ε1
)+a .
random P2 t
(19)
1
2
ε2
)+a .
(20)
2
ηi ( t + 1) = (1 − κ )ηi ( t ) i = 1, 2 . Pi t
a2
(18)
p2 1
Pi t 0
random P1 t
p1
υ = q ( u1 ) = ( b1 − a1 ) (1 + e −u
Pi t 1
a1
(17)
p1
f2
p2
f2
p1
(21)
0 .
(22) .
χ i ( t + 1) = γχi ( t ) (1 − χi ( t ) ) .
(23)
where γ = 4 , Pi ( 0 ) > 0 , 0 < κ < 1 , 0 < δ < 1 , η i ( 0 ) > 0 , ε1 > 0 and ε 2 > 0 . We know that equation (23) is a Logistic map, when γ = 4 , the chaos phenomenon will happen in the system. As time approaches infinity, the chaotic annealing recurrent neural network will evolve into the annealing recurrent neural network (8) (11). Therefore, we must not repeatedly analyze the stability and solution feasibility and solution optimality of the chaotic annealing recurrent neural network (17) (23).
~
~
54
Y.-a. Hu, B. Zuo, and J. Li
5 Simulation Analysis
ⅰ
( ) A Simplified Tight Formation Flight Model Simulation Consider a simplified tight formation flight model consisting of two Unmanned Aerial Vehicles tested in reference [8]. The cost function of the tight formation flight model is given by
y ( t ) = −10 ( x1 ( t ) + 0) − 5( x3 ( t ) + 9) + 590 . 2
2
(24)
Clearly, if the states of the model are x1∗ = 0 and x3∗ = −9 , then the cost function
y ( t ) will reach its maximum y ∗ = 590 .
The initial conditions of the model are given as x1 ( 0 ) = − 2 , x 2 ( 0 ) = 0 ,
x 3 ( 0 ) = − 4 , x 4 ( 0 ) = 0 , θ 1 ( 0 ) = − 2 , θ 2 ( 0 ) = − 4 . Choose T ( t ) = β α −η t , where β = 0 .01 , α = e , η = 5 . Applying CARNN to the model described in reference [8], the parameters are given as: μ = 23.5 , γ = 4 , P1 ( 0 ) = P2 ( 0 ) = 1 , κ = 0.01 , δ = 0.01 , ε 1 = 10 , ε 2 = 10 , χ1 ( 0 ) = 0.912 ,
χ2 ( 0) = 0.551 , η1 ( 0) = [ −10 −1 5]T , η2 ( 0) = [3 10 5]T , b1 = b2 = 0.5 ,
a1 = a2 = −0.5 .
The simulation results are shown from figure 1 to figure 3. In those simulation results, solid lines are the results applying CARNN to ESA; dash lines are the results applying ESA with sliding mode[9]. Comparing those simulation results, we know the dynamic performance of the method proposed in the paper is superior to that of ESA with sliding mode. By figure 1 and figure 2, the “chatter” phenomenon disappears in the CARNN’s output, which is very harmful in practice. Moreover the convergence rate of ESA with CARNN can be scaled by adjusting the annealing parameter T ( t ) .
x1
x3
Learning iterative times
Learning iterative times
n
Fig. 1. The simulation result of the state
x1
n
Fig. 2. The simulation result of the state
x3
An Improved Extremum Seeking Algorithm
55
ⅱ
( ) Schaffer Function Simulation In order to exhibit the capability of global searching of the proposed CARNN, the typical Schaffer function (25) is defined as the testing function[10].
f ( x1 , x2 ) =
sin 2
x12 + x22 − 0.5
(1 + 0.001( x
2 1
+x
2 2
))
2
− 0.5, xi ≤ 10, i = 1, 2 .
(25)
When x1 = x2 = 0 , the schaffer function f ( x1 , x2 ) will obtain the global
minimum f ( 0, 0 ) = − 1 . However, there are numerous local minimums and
maximums among the range of 3.14 away from the global minimum. Now, we define θ1 = x1 and θ 2 = x 2 . The values of CARNN’s parameters are same with those in subsection 5.1 except for μ = 35 , η1 ( 0) = [ −200 −20 50] and T
η2 ( 0) = [100 300 50] . The simulation condition T
Ⅰ:
the initial conditions of the
function (25) are given as x1 ( 0 ) = − 2 and x2 ( 0 ) = 3.5 ; the simulation condition
Ⅱ: the initial conditions are given as x ( 0 ) = − 1 and x 1
2
( 0 ) = 9.5 . The simulation
y
Learning iterative times
n
Fig. 3. The simulation result of the output
Fig. 5. The simulation result of
x1
y
Fig. 4. The simulation result of f ( x1 , x2 )
Fig. 6. The simulation result of
x2
56
Y.-a. Hu, B. Zuo, and J. Li
results are shown as from figure 4 to figure 6, where the dash-dot lines are the results of the simulation condition , and the solid lines are the results of the simulation condition . We have accomplished a great deal of simulations in different initial conditions. The ESA based on the chaotic annealing recurrent neural network can find the global minimum of Schaffer function in every different simulation.
Ⅱ
Ⅰ
6 Conclusions The method of introducing CARNN into ESA greatly improves the dynamic performance and the global searching capability of the system. Two phases of the coarse search based on chaos and the elaborate search based on ARNN ensure that the system could fully carry out the chaos searching and find the global extremum point and accordingly converge to that point. At the same time, the disappearance of the “chatter” of the system output and the switching of the control law are beneficial to engineering applications.
References 1. Natalia I. M.: Applications of the Adaptive Extremum Seeking Control Techniques to Bioreactor Systems. A dissertation for the degree of Master of Science. Ontario: Queen’s University, (2003). 2. Blackman, B.F.: Extremum-seeking Regulators. An Exposition of Adaptive Control, New York: Macmillan (1962) 36-50 3. Drakunov, S., Ozguner, U., Dix, P., Ashrafi, B.: ABS Control Using Optimum Search via Sliding Mode., IEEE Transactions on Control Systems Technology 3 (1995) 79-85 4. Krstic, M.: Toward Faster Adaptation in Extremum Seeking Control. Proc. of the 1999 IEEE Conference on Decision and Control, Phoenix. AZ (1999) 4766-4771 5. Tan, Y., Wang, B.Y., He, Z.Y.: Neural Networks with Transient Chaos and Time-variant gain and Its Application to Optimization Computations. ACTA ELECTRONICA SINICA. 26 (1998) 123-127 6. Wang, L., Zheng, D.Z.: A Kind of Chaotic Neural Network Optimization Algorithm Based on Annealing Strategy. Control Theory and Applications 17 (2000) 139-142 7. Hu, Y.A., Zuo, B.: An Annealing Recurrent Neural Network for Extremum Seeking Control. International Journal of Information Technology 11 (2005) 45-52 8. Zuo, B., Hu, Y.A.: Optimizing UAV Close Formation Flight via Extremum Seeking. WCICA2004 4 (2004) 3302-3305 9. Pan, Y., Ozguner, U., Acarman, T.: Stability and Performance Improvement of Extremum Seeking Control with Sliding Mode. Control. Vol. 76 (2003) 968-985. 10. Wang, L..: Intelligent Optimization Algorithms with Application. Beijing: Tsinghua University Press (2004)
Solving the Delay Constrained Multicast Routing Problem Using the Transiently Chaotic Neural Network Wen Liu and Lipo Wang College of Information Engineering, Xiangtan University, Xiangtan, Hunan, China School of Electrical and Electronic Engineering, Nanyang Technology University, Block S1, 50 Nanyang Avenue, Singapore 639798 {liuw0004,elpwang}@ntu.edu.sg Abstract. Delay constrained multicast routing (DCMR) aims to construct a minimum-cost tree with end-to-end delay constraints. This routing problem is becoming more and more important to multimedia applications which are delay-sensitive and require real time communications. We solve the DCMR problem by the transiently chaotic neural network (TCNN) of Chen and Aihara. Simulation results show that the TCNN is more capable of reaching global optima compared with the Hopfield neural network (HNN).
1
Introduction
There are two types of multimedia delivery: real-time file streaming and nonreal-time downloads. As for the real-time communication, its applications usually have various quality of service (QoS) requirements, such as bandwidth limit, cost minimization, and delay constraint. The QoS constrained routing problem covers a wide area, e.g., point-to-point and group-to-group routing, with different endto-end QoS requirements [1, 2]. In this paper we focus on delay constrained multicast routing (DCMR) problem, which is also called the constrained Steiner tree (CST) problem. Multicast routing [3, 4] covers the delivery service that can not be accomplished by broadcast and point-to-point delivery. The multicast routing functionality including three parts: the management of group membership, the construction of data delivery route, and the information replication at the interior node. Our work is on the second part: construct a delay constrained minimal cost tree with the transiently chaotic neural network (TCNN) [5]. The neural network is applied to the routing problem for the powerful parallel computational ability of the neural network [6]. Rauch and Winarske use neural networks for the shortest path problem [7]. A modified version of the Hopfield neural network for the delay constrained multicast routing was proposed in [8]. The model is capable to find the solution for an 8-node network, but for large scale communication networks, this HNN model may be easily trapped at local minima. To overcome this limitation of HNNs, Nozawa [9] proposed a chaotic D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 57–62, 2007. c Springer-Verlag Berlin Heidelberg 2007
58
W. Liu and L.P. Wang
neural network (CNN) by adding negative self-feedbacks into HNNs. Chen and Aihara [5] further developed the CNN and presented a neural network with transient chaos, namely, the transiently chaotic neural network (TCNN). Since the chaos is able to improve the ability of the neural network model to reach global optima, this transiently chaotic neurodynamics makes the TCNN a promising tool for the combinatorial optimization problem. Hence we use TCNNs here to solve the DCMR problem for the powerful searching ability of this transiently chaotic model. This paper is organized as follows. We introduce the delay constrained multicast routing problem in Section 2. The transiently chaotic neural network is reviewed in Section 3. Simulation results are presented and discussed in Section 4. Finally, we conclude this paper in Section 5.
2 2.1
The Delay Constrained Multicast Routing Problem Problem Formulation
Based on the formulation presented in [10], an n-node D destinations communication network is formulated on D-n × n matrices, where matrix m is used to compute the constrained unicast route to destination dm , (m = 1, · · · , D). Each element in one matrix is treated as a neuron and neuron mxi describes the link from node x to node i for destination dm in the communication network. Pxi characterizes the connection status of the communication network: Pxi = 1 (m) if the link from node x to node i does not exist; Otherwise Pxi = 0. Vxi is the (m) output of the neuron at location (x, i) in matrix m, Vxi = 1 implies the link from node x to node i is on the final optimal tree for destination m; Otherwise (m) Vxi = 0. Cxi and Lxi denote the cost and delay of a link from node x to node i, respectively, which are assumed to be real non-negative numbers [8]. For nonexisting arcs, Cxi = Lxi = 0. Costs and delays are assumed to be independent. E.g., costs could be a measure of channel utilization, and the delay could be a combination of propagation, transmission, and queuing delay. 2.2
Problem Definition
The delay constrained multicast routing problem is defined to construct a tree rooted at the source s and spanning to all the destination members of D = {d1 , d2 , . . . , dm } such that not only the total cost of the tree is minimum but also the delay from the source to each destination is not greater than the ren n (m) quired delay constraint, i.e., x=1 i=1,i=x Lxi Vxi ≤ Δ, where Δ is the delay (m)
bound. Vxi ∈ {0, 1} denotes the neuron output of constrained unicast route for destination dm . 2.3
The Energy Function
Pornavalai et at [8] proposed the energy function for the delay constraint multicast routing problem. We change the neuron update rule by using the mean value of
Solving the DCMR Problem Using the TCNN
59
neuron outputs as the threshold to fire the neuron. In the original energy function, n n (m) (m) the outputs are forced to be 0 or 1 by an energy term x=1 i=1 Vxi (1 − Vxi ). The total energy function E for the delay constrained multicast routing is the sum of energyfunctions of delay constrained unicast routing to every desN tination [8]: E = m=1,m∈D E (m) . Where E (m) is used to find the constrained unicast route from source node s to destination dm : n n (m) (m) (m) E (m) = μ1 [ Cxi fxi (V )Vxi ] + μ2 (1 − Vms ) x=1 i=1,i =x
+ μ3
⎧ n ⎨ n
x=1
+ μ5
⎩
(m)
Vxi
−
i=1,i =x
n
(m)
Vix
i=1,i =x
⎫2 ⎬ + μ4
⎭
n n
(m)
Pxi Vxi
x=1 i=1,i =x
h(z)dz
(1)
where, m fxi (V ) =
1+ 0, h(z) = z,
n
1
(2)
(j)
j=1,j =m
Vxi
if z ≤ 0; otherwise.
(3)
μ1 term is the total cost of the unicast route for destination dm . The function (m) fxi (V ) reduces the cost when unicast routes for different destinations choose the same link. μ2 term creates a virtual link from destination dm to source s, which is used to satisfy the constraint state in μ3 term. μ3 term ensures that for every node, the number of incoming links is equal to the number of outgoing links. μ4 term penalizes neurons that represent non-existing links. μ5 term is used (m) to satisfy the delay constraint, with z = nx=1 ni=1,i=x LxiVxi − Δ. Thus the μ5 term contributes positively only when the delay constraint is violated [10].
3
Transiently Chaotic Neural Networks
Chen and Aihara proposed a transiently chaotic neural network (TCNN) [5] as follows: (m)
(m)
Uxi (t + 1) = kUxi (t) +
N N
(m)
wyj,xi Vyj (t)
y=1 j=1,j =y (m)
(m)
Vxi
+Ixi − zxi (t)[Vxi (t)) − I0 ] 1 (m) (m) = fxi (Uxi ) = (m) (m) −U 1 + e xi /xi
where, −
∂E (m) ∂Vxi
=
N N y=1 j=1,j =y
(m)
wyj,xi Vyj (t) + Ixi
(4) (5)
60
W. Liu and L.P. Wang
zxi (t + 1) = (1 − β)zxi (t) zxi (t) = self-feedback neuronal connection weight (zxi (t) ≥ 0). (m)
(m)
Uxi and Vxi are internal state and output of neuron (x, i) in matrix m, respectively. k is the damping factor of the nerve membrane (0 ≤ k ≤ 1), Ixi is the input bias of neuron (x, i), I0 is the positive bias, β is the damping factor for (m) the time-dependent neuronal self coupling (0 ≤ β ≤ 1)., and xi is the steepness parameter of the neuron activity function ( ≥ 0).
4
Simulation Results
We implement the algorithm in VC++. The end of an iteration is determined by the change in the energy function between two steps: ΔE = E(t) − E(t − 1). The iterations stop when ΔE is smaller than a threshold (0.002) in three continuous steps. The communication network used in the simulation are generated by a graph generator [11]. A network with n nodes is randomly placed on a Cartesian coordinate. Fig.1 is an example of randomly generated 80-node communication network. Table 1 shows the specifications of communication networks we generated. Values for the weighting coefficients are chosen as follows based on [8]: μ1 = 200 μ2 = 5000 μ3 = 1500 μ4 = 5000 μ5 = 250 Corresponding to the parameter setting principle described in [12], we let = = 0.004, I0 = 0.65, β = 0.001, and z(0) = 0.1. Initial inputs of neural networks Uxi (0) are randomly generated between [−1, 1]. At the end of each iteration, we set each neuron on or off according to the average value (VT ) of (m) xi
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
Fig. 1. A 80-node network used in our simulations, with an average degree 4 (the number of links for a node)
Solving the DCMR Problem Using the TCNN
61
Table 1. Specifications of the randomly generated geometric instances Instance
Nodes
Destinations Edges Delay bound
Number N
Number
E
Δ
Case #1
8
5
11
20
Case #2
16
5
22
20
Case #3
30
5
47
20
Case #4
80
5
154
25
Table 2. Results for the HNN, TCNN for instances #1 to #4. “sd” stands for “standard deviation”. Cost mean±sd
Time mean±sd (s)
Instance No. HNN
TCNN
HNN
TCNN
1
12.19±0.53 8.72±0.31 0.18±0.22 0.68±0.29
2
32.06±3.23 23.20±0.28 2.17±0.64 6.49±0.70
3
24.35±1.31 22.67±1.95 23.32±3.71 46.86±4.18
4
29.07±1.02 25.13±0.98 537.3±79.2 1150±127
the output matrix. If Vxi ≥ VT , then Vxi = 1, the link from x to i is on the final optimal tree, and vice versa. The algorithm is run 1000 times with randomly generated initial neuron states, and compared with conventional Hopfield networks used in [8]. The result is listed in Table 2. The TCNN is capable to jump out of local minima and achieves the global optimal due to its complex dynamics. As a trade off, the execution time increases. In applications, we can balance the route optimality ratio and the execution time through the parameter β, which determines the decaying of chaotic dynamics. Larger β will make the TCNN converge faster, while smaller one will make the TCNN more probable to reach the global optimal.
5
Conclusion
We studied the delay constrained multicast routing problem which is motivated by fast development of delay-sensitive communication applications. We showed that the transiently chaotic neural network is more capable to reach the global optimal solutions compared with the HNN.
62
W. Liu and L.P. Wang
Individual QoS parameters may be conflicting and interdependent, thus making the problem even more challenging [13]. Computing multicast routes that satisfy different QoS parameters simultaneously is an NP-hard problem. It is even harder to solve when each destination has different QoS requirements. Furthermore, the multicast group may be dynamic, i.e., the node may join or leave the communication network at any instance of time. We will keep exploring this area in future.
References 1. Reeves, D.S., Salama, H.F.: A Distributed Algorithm for Delay-constrained Unicast Routing. IEEE/ACM Transactions on Networking 8(2) (2000) 239-250 2. Chen, J., Chan, S.H.G., Li, V.O.K.: Multipath Routing for Video Delivery over Bandwidth-limited Networks. IEEE Transactions on Selected Areas in Communications 22(10) (2004) 1920-1932 3. Chakraborty, D., Chakraborty, G., Shiratori, N.: A Dynamic Multicast Routing Satisfying Multiple QoS Constraints. Int. Journal of Network Management 13(5) (2003) 321-335 4. Ganjam, A., Zhang, H.: Internet Multicast Video Delivery. Proceedings of the IEEE 93(1) (2005) 159-170 5. Chen, L.N., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8(6) (1995) 915-930 6. Venkataram, P., Ghosal, S., Kumar, B.P.V.: Neural Network Based Optimal Routing Algorithm for Communication Networks. Neural Networks 15(10) (2002) 1289-1298 7. Rauch, H.E., Winarske, T.: Neural Networks for Routing Communication Traffic. IEEE Cont. Syst. Mag. 8(2) (1988) 26-31 8. Pornavalai, C., Chakraborty, G., Shiratori, N.: A Neural Network Approach to Multicast Routing in Real-time Communication Networks. In: International Conference on Network Protocols (ICNP-95) (1995) 332-339 9. Nozawa, H.: A Neural-network Model as a Globally Coupled Map and Applications Based on Chaos. Chaos 2(3) (1992) 377-386 10. Ali, M.K.M., Kamoun, F.: Neural Networks for Shortest Path Computation and Routing in Computer Networks. IEEE Transactions on Neural Networks 4(6) (1993) 941-954 11. Waxman, B.: Routing of Multipoint Connections. IEEE J. select. Areas Communication 6(9) (1988) 1617-1622 12. Wang, L.P., Li, S., Tian, F.Y., Fu, X.J.: A Noisy Chaotic Neural Network for Solving Combinatorial Optimization Problems: Stochastic Chaotic Simulated Annealing. IEEE Transactions on System, Man, and Cybernetics-Part B: Cybernetics 34(5) (2004) 2119-2125 13. Roy, A., Banerjee, N., Das, S.K.: An Efficient Multi-objective Qos Routing Algorithm for Real-time Wireless Multicasting. In: Proceedings of IEEE Vehicular Technology Conference (2002) 1160-1164
Solving Prize-Collecting Traveling Salesman Problem with Time Windows by Chaotic Neural Network Yanyan Zhang and Lixin Tang
,
The Logistics Institute Northeastern University, Shenyang, China
[email protected] Abstract. This paper presents an artificial neural network algorithm for prize-collecting traveling salesman problem with time windows, which is often encountered when scheduling color-coating coils in cold rolling production or slabs in hot rolling mill. The objective is to find a subset sequence from all cities such that the sum of traveling cost and penalty cost of city unvisited is minimized. To deal with this problem, we construct mathematical model and the corresponding network formulation. Chaotic neurodynamic is introduced and designed to obtain the solution of the problem, and the workload reduction strategy is proposed to speed up the solving procedure. To verify the efficiency of the proposed method, we compare it with ordinary Hopfield neural network by performing experiment on the problem instances randomly generated. The results clearly indicate that the proposed method is effective and efficient for given size of problems with respect to solution quality and computation time.
1 Introduction A great deal of problems in theory and practice are related to combinatorial optimization problems, most of which are hard to solve and belong to NP-hard problems. Therefore researches in this field usually aim at developing efficient and effective techniques to find better solutions instead of exact ones. And from the practical viewpoint, rather fast approximate algorithms are useful and have achieved considerable success when applied to practical case. A typical such kind of combinatorial optimization problem can be found in color coating coils scheduling in cold rolling mill and slabs scheduling in hot rolling mill. In the production of color coating coils, after the surface treatment, the cold rolled coils and galvanized coils are dressed with all kinds of paints to the surface in roller applying method. In the course of operation, considering productivity and cost, many requirements between adjacent coils must be taken into account. Most of these requirements can be transformed into a parameter [1] (similar to the sense of “distance” in TSP) between adjacent coils (cities). The situation of slabs scheduling in hot rolling mill is much similar. Based on such transformation, these production scheduling problems can be formulated as the framework of a well-studied Prize-Collecting Traveling Salesman Problem with Time Windows D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 63–71, 2007. © Springer-Verlag Berlin Heidelberg 2007
64
Y. Zhang and L. Tang
(PCTSPTW), which can be characterized by prize-collecting mechanism and time windows requirements. With respect to the prize-collecting mechanism, that is, each city is assigned a prize value and a penalty, the goal is to construct a tour that maximizes total prize collected and (or) minimizes the penalties occurred while minimizing the distance traveled; this allows a salesman to skip certain unprofitable sites, similar researches have been done in[2][3]. As for the time windows demand, it can be expressed as that, each city can only be visited within a given time interval. Because of the time windows, only feasible paths need to be considered. In practice, such windows are needed because there exists a holding time constraint for each coil or slab, if a coil (slab) has been kept longer than the allowable time limitation before processing, unnecessary pretreatment cost will occur. In this research the holding time requirement, the arrival and visiting time of each city is defined as the time windows, a real time index of the city. Therefore, the best sequence in the above context is the sequence with minimum cost, with respect to both the visiting costs and penalties. Unlike most of the researches that consider the time windows requirements as soft constraints (violation of time windows constraints leads to penalties), this paper formulate them as hard constraints, only feasible solution with respect to time windows can be accepted, which increases the difficulties of solving. It has been proved that elementary shortest path with time windows is a strongly NP-hard problem [4], and relaxed versions of this problem have been reported [5]. Therefore the task of PCTSPTW with such complex constraints that we address is intractable, even for a feasible solution. As for the solving approach, artificial neural networks have been applied to many combinatorial optimization problems [6][7] such as TSP and production scheduling problems. But in these problems, no time windows are taken into account. Since the performance and structure of artificial neural networks have been optimized, and among which the transient chaotic neural network [8][9] is one of the most successful applications. In this paper, we propose a novel neural network algorithm in which chaotic mechanism is introduced to escape from local minima of traditional neural network. It is, with the authors’sights, the first such algorithm in the literature to solve the PCTSPTW. The contribution of this research involves the construction of a network formulation, the derivation of running neurodynamic, the innovations of reducing computation cost and the design of the experiments.
2 Problem Description and Formulation We define the Prize-Collecting Traveling Salesman Problem with time-windows (PCTSPTW) as follows. 2.1 Notations n i, j
— the number of all available cities to be processed. — city identifier, i, j=1, 2, … , n
Solving Prize-Collecting Traveling Salesman Problem
65
C pi ri ei ci Bi k,l
— the capacity demand (the upper bound of prize) of the sequence. — the penalty occurred when nod i s not selected in current sequence. — the arrival time of city i. — the ending time of city i. — the visiting prize of city i. — the start time of city i. — the processing position, k, l=1, 2, … , n
vik
— the output of a neuron.
uik
— the state of a neuron.
Iik
— the threshold value of a neuron.
Wik,jl
— the connection weights between two neurons.
ISk
— the immediate succeeding) visiting position of k in the route.
IPk dij
— the immediate preceding visiting position of k in the route. — the distance between city i and j. — damping factor of nerve membrane ( 0 ≤α ≤1). — positive scaling parameter for inputs.
α
γ zik β
— self-feedback connection weight or refractory strength (zik ≥ 0 ). — damping factor of the time-dependant zi ( 0 ≤ β ≤1)
I0 ΩS
— positive parameter. —set of selected cities in current sequence.
Ωu
—set of unvisited cities.
Decision variables: ⎧1, xij = ⎨ ⎩0, ⎧1 yi = ⎨ ⎩0
if city i is visited immediately before city j otherwise
if city i is selected in the current sequence otherwise
2.2 Mathematical Model
The objective function n
n
n
min ∑∑ x ij d ij + ∑ p i (1 − y i ) i =1 j =1
(1)
i =1
Subject to n
∑x i =1
ij
≤ 1,
j = 1,2,..., n
(2)
66
Y. Zhang and L. Tang n
∑x j =1
∑x
i , j∈S
≤ 1,
ij
ij
≤ S − 1,
ri ≤ Bi ≤ ei n
∑c y i =1
∑x
i
i = 1,2,..., n
(3)
∀S ⊆ Ω s
(4)
∀i ∈ Ω s ∪ Ω u
(5)
≤C
i
(6)
= yi
∀i ∈ Ω s ∪ Ω u
(7)
xij ∈ {0,1},
i, j = 1,2,..., n
(8)
i = 1,2,..., n
(9)
j∈Ω \ i
ij
yi ∈ {0,1},
The first item in the objective function is the sum of distances between all pairs of adjacent cities, the second item is the total penalties of unscheduled cities. Constraints (2) ensure that for each city there are at most one city is arranged before it. Constraints (3) guarantee that for each city there are at most one city is arranged after it. Constraints (4) ensure the feasibility of the obtained route that no cycle is allowed to exist, where S is the generated city sequence. Constraints (5) present the time windows of each city, the hard real-time constraints, only within this time the city can be processed. Constraints (6) mean that total prizes in the sequence should not exceed upper bound the capacity demand. Equations (7), (8) and (9) are the variable values constraints. 2.3 Networks Formulation
Objective function n
n
min ∑∑
n
n
n
i =1
k =1
∑ vik (v j ,ISk + v j ,IPk )dij + ∑ pi (1 − ∑ vik )
i =1 k =1 j =1, j ≠i
(10)
Subject to n
n
n
∑∑ ∑ v i =1 k =1 j =1, j ≠ i
ik
v jk = 0
(11)
Solving Prize-Collecting Traveling Salesman Problem n
n
n
∑∑ ∑ v i =1 k =1 l =1,l ≠ k
n
v =0
ik il
67
(12)
n
(∑∑ vik − num) 2 = 0
(13)
i =1 k =1
n
n
(∑∑ ci vik − C ) 2 = 0
(14)
i =1 k =1 n
min ∑ vik ( Bi − ri ) ≥ 0 1≤ i ≤ n
(15)
k =1
n
min ∑ vik (ei − Bi ) ≥ 0 1≤ i ≤ n
(16)
k =1
Where Bi is the start time of city i, Bi = max{BIPi + d IPi , i , ri } . In the objective function, the first item is the sum of distances between all pairs of adjacent cities, the second item is the penalty of all cities for the tardiness of due date, the third item is the penalty for unscheduled cities. Constraints (11) require that on one position, only one city can be arranged. Constraints (12) mean that each city can only be arranged to one processing position. Constraints (13) claim that the approximate number of scheduled cities is num which corresponds to capacity (prize) limitation and is expressed as num =[C/(
∑
n i =1
c i /n)]. Constraints (14) express the demand for the sum of
prizes in the sequence. Constraints (15) and (16) are the time window requirements, giving that once a city is selected, its start time must be after its earliest possible start time and before its latest allowable start time. Then we get the following energy function.
E=
A1 n n n (∑∑ ∑vik (v j ,IPk + v j ,ISk )dij 2 i=1 k =1 j=1, j≠i
n
n
+ ∑ pi (1 − ∑vik )2 ) + i=1 n
k =1
n
+ ∑∑
n
∑vikvil ) +
i=1 k =1 l =1,l ≠k
+
A2 n n n (∑∑ ∑vik v jk 2 i=1 k=1 j=1, j≠i
n n A3 n n A (∑∑vik − num)2 + 4 F (C − ∑∑ci vik ) 2 i=1 k=1 2 i=1 k =1
n n A5 (G min∑vik (Bi − ri ) + G min∑vik (ei − Bi )) 1≤i≤n 1≤i≤n 2 k =1 k =1
(17)
68
Y. Zhang and L. Tang
The connection weights and threshold values are as follows:
(
)
wik , jl = − A1 (1 − δ ij )(δ l , IPk +δ l , IS k )dij + δ ij pi − A2 ((1 − δ ij )δ kl n
n
(18)
+ δ ij (1 − δ kl )) − A3 − A4ci c j g (C − ∑∑ c p v pq ) p =1 q =1
n
n
I ik = − A1λ2 ci − A3 num − A4 Cci g (C − ∑∑ c p v pq ) p =1 q =1
n
n
− A5 ( Bi − ri ) g (min ∑ v jl ( B j − r j )) + (ei − Bi ) g (min ∑ v jl (e j − B j )) 1≤ j ≤ n
1≤ j ≤ n
l =1
l =1
(19) Substitute the above formula for the Wik,jl and Iik in the following equation. n
n
u ik (t ) = ∑
∑w
j =1 l =1, jl ≠ ik
ik , jl
v jl (t ) − I ik
(20)
Then we get the running dynamics of our networks as follows. n
uik (t ) = ∑
∑ (− A ((1 − δ n
1
j =1 l =1, jl ≠ik
ij
)(δ l ,IPk+δ l ,ISk )d ij + δ ij pi ) − A2 ((1 − δ ij )δ kl (21)
⎞ + δ ij (1 − δ kl )) − A3 − A4 ci c j g (C − ∑∑ ci vik ) ⎟v jl (t ) − I ik i =1 k =1 ⎠ n
u ik (t ) = − A1 ( − A2 (
n
∑
j =1, j ≠ i
n
n
∑ (v j ,IPk (t ) + v j , ISk (t ))d ij + pi (∑ vil − 1))
j =1, j ≠ i
v jk (t ) + n
n
l =1
n
∑v
l =1, l ≠ k
n
n
il
n
(t )) − A3 (∑∑ v jl (t ) − num) j =1 l =1
n
n
− A4 g (C − ∑ ∑ c p v pq )c i (∑ ∑ v jl (t )c j − C ) p =1 q =1
j =1 l =1
n
n
− A5 (( Bi − ri ) g (min ∑ v jl ( B j − r j )) + (ei − Bi ) g (min ∑ v jl (e j − B j ))) 1≤ j ≤ n
l =1
1≤ j ≤ n
l =1
(22)
Solving Prize-Collecting Traveling Salesman Problem
69
When chaos is introduced, n
n
n
uik (t + 1) = αuik (t ) + β (∑∑∑
n
∑w
i =1 k =1 j =1 l =1, jl ≠ik
ik jl
v jl (t ) − I ik ) + z (t )(vik (t ) − I 0 )
n n ⎛ = αuik (t ) + γ ⎜⎜ − A1 ( ∑ (v j ,IPk (t ) + v j ,ISk (t ))d ij + pi (∑ vil − 1)) j =1, j ≠i l =1 ⎝
− A2 (
n
∑v
j =1, j ≠i
jk
(t ) +
n
n
∑v
l =1,l ≠ k
il
n
n
(t )) − A3 (∑∑ v jl (t ) − num) j =1 l =1
n
n
n
− A4 g (C − ∑∑ c p v pq )ci (∑∑ v jl (t )c j − C ) + zik (t )(vik (t ) − I 0 ) p =1 q =1
j =1 l =1
n ⎞ − A5 (( Bi − ri ) g (min ∑ v jl ( B j − rj )) + (ei − Bi ) g (min ∑ v jl (e j − B j ))) ⎟ 1≤ j ≤ n 1≤ j ≤ n l =1 l =1 ⎠ n
(23) Where
z ik (t + 1) = z ik (t ) / ln(e + β (1 − z ik (t )))
Where the output
⎧0 F ( x) = ⎨ 2 ⎩x
vik = 1 /(1 + e
x≥0
⎧0 G ( x) = ⎨ x 0 . For hidden layer,
δ j (n) = ϕ 'j ( x j (n)) ⋅
∑δ
k (n) ⋅ wkj (n) =
k
∑δ
b [a − y j (n)][a + y j (n)] a
k (n) wkj (n) .
(7 )
k
3.2 Adaptive MTI Filter Based on Burg Algorithm
Hawkes and Haykin pointed out that most of clutters could be fitted by the low-rank auto regressive (AR) sequences. The coefficients of the AR model are determined by the kind of clutter and environment. The Maximum Entropy Method (MEM) of spectral estimation has the following power spectral expression: 2
P(ω ) = σ 2 A(e jω ) , A( z ) =
p
∑ a(i) z i =0
Fig. 3. FIR filter scheme
−i
.
(8)
92
Q. Ren et al.
Burg is one of the MEM algorithm, and it is equivalent to AR model when a(0)=1. To filter the clutter, an FIR filter is designed. Its coefficients are just coefficients (a 0 , a1 , " , a N ) gained by Burg algorithm. N
The output of the filter is y (n) =
∑a
k x (n
− k ) The system equation is
k =0
N
H ( z) =
∑a
kz
−k
. The frequency response is
k =0
H ( e jω ) =
N
∑a
kz
− jωk
.
(9)
k =0
Comparing equation (8) with (9), one could see that the zero coefficients of the FIR filter frequency response are just the pole coefficients of the Burg spectral expression. Therefore the filter has the ideal frequency response, which is just the “inverted” clutter spectrum. Since the central frequency and bandwidth of the clutter is estimated by Burg algorithm during “learning period” in practical working, the filter can be adaptively adjusted according to the characteristic of the certain clutter spectrum. 3.3 Chaotic Neural Network as Spectrum Repair Module
The effect of noise, such as thermal noise, is that it expands the frequency spectrum peak of the target signal. This expansion would bring difficulties and errors to the estimation of the distance and velocity information. One of the most useful functions of chaotic neural network is its associative memory characteristic [10]. Because of its complex dynamics, chaotic neural network has more memory capacity and error tolerance compared with Hopfield neural network. In this paper, the chaotic neural network memorizes the ideal spectrum peaks, and associates the expanded spectrum peak to the most likely memorized ideal peak. The expression of the chaotic neural network is just like the network (2) introduced in section 3.1. However, the output function adopts the Sigmoid function
f ( y ) = (1 − exp( −λy )) /(1 + exp( −λy )) , where
λ
is the steepness parameter.
(10)
wij are synaptic weights to the ith neuron from
the jth neuron respectively. The chaotic neural network memorize T ideal spectrum peaks, and its learning rule adopts Hebb rule ωij =
T
∑ (x
p i
− xip )( x jp − x jp ) ,
p =1
where xip is the ith element of the pth memorized peak.
(11)
An Adaptive Radar Target Signal Processing Scheme
93
4 Simulations 4.1 Simulations of the CNN Detecting Module
The main interference comes from the main spectral peak of the clutter. The central frequency spectral peaks of clutter and target are simulated. The simulation result is illustrated in Fig. 4. The real line corresponds to the real signal, and the dashed corresponds to the predictive signal. For (a) and (b), there is only clutter central frequency in the radar echo. The predictive error in this case is very low. This implies that the CNN learn and reconstruct the clutter successfully. For (c) and (d), there are both clutter and target central frequencies in the radar echo. The predictive error is much higher than above case. This implies that the CNN learn the clutter and detect the target signal successfully.
Fig. 4. The detection of the central frequency spectral peaks
4.2 Simulations of the Adaptive MTI Filter
Clutter could be filtered by the designed adaptive MTI filter. Fig. 5 illustrates the spectral estimation results of the radar echo and the filtered signal. The powers of the clutter and the target signal are equal, i.e. the Signal-to-Clutter Ratio (SCR) is 0dB. For case (a), there is only clutter (central frequency is 6kHz), and no target signal in the radar echo. After filtering, the spectral peak of the clutter is filtered. For case (b), there are both clutter (central frequency is 6kHz) and target signal (central frequency is 12kHz) in the radar echo. After filtering, the spectral peak of the clutter is
Fig. 5. The spectrum of the radar echo and filtered signal
94
Q. Ren et al.
filtered, and the spectral peak of the target signal is reserved. For case (c), the central frequency of the target signal is 48kHz, and similar result is gotten. 4.3 Simulations of the CNN Spectrum Repair Module
The associative memory of CNN is utilized to repair the expanded spectrum.
Fig. 6. The spectral repair effect of the CNN
Fig. 6 illustrates this effect. For case (a), the chaotic neural network memorizes one ideal peak (26.6kHz) as the sample, and repairs the expanded spectrum by associative memory. For case (b), the chaotic neural network memorizes six ideal peaks (8.9kHz, 17.8kHz, 26.6kHz, 35.5kHz, 44.4kHz, 53.3kHz) as the samples, and repairs the expanded spectrum by associative memory successfully.
5 Conclusion In this paper we proposed a new scheme of adaptive radar target signal processing. The chaotic neural network is designed to reconstruct the chaotic clutter and to detect the target signal utilizing the Takens embedding theorem. After detection, the clutter is filtered by the Burg algorithm based adaptive MTI filter. The information of distance and velocity is obtained by Burg spectral estimation. The noise would expand the spectral peak of the target signal. Because of its complex dynamics, chaotic neural network has more memory capacity and error tolerance than other neural networks. In the proposed scheme, the CNN module not only detects the target signal, but also repairs the frequency spectrum according to its associative memory characteristic. The validity of the scheme is analyzed theoretically, and the simulation results show that it has good performance in clutter and noise background. The adaptive method adopted in this paper facilitates the radar design in complex environment.
References 1. Haykin, S., Puthusserypady, S.: Chaotic dynamics of sea clutter. Chaos 7 (1997) 777–802 2. Leung, H., Dubash, N., Xie, N.: Detection of small objects in clutter using a GA-RBF neural network. IEEE Trans. Aerosp. Electron. Syst. 38 (2002) 98–118
An Adaptive Radar Target Signal Processing Scheme
95
3. Haykin, S., Bakker, R., Currie, B.W.: Uncovering nonlinear dynamics-the case study of sea clutter. Proc. IEEE. 90 (2002) 860–881 4. Morrison, A.I., Srokosz, M.A.: Estimating the fractal dimension of the sea-surface—A 1st attempt. Annales Geophysicae-Atmospheres Hydrospheres and Space Sci. 11 (1993) 648–658 5. Hu, J., Tung, W.W., Gao, J.B.: Detection of low observable targets within sea clutter by structure function based multifractal analysis. IEEE Transactions on antennas and propagation 54 (2006) 136-143 6. Xiong, Z.L., Shi, X.Q.: A novel signal detection subsystem of radar based on HA-CNN. Lecture Notes in Computer Science 3174 (2004) 344-349 7. Huang, Y., Peng, Y.N.: Design of airborne adaptive recursive MTI filter for detecting targets of slow speed. IEEE National Radar Conference – Proceedings (2000) 215-218 8. Xiang, Y., Ma, X.Y.: AR model approaching-based method for AMTI filter design. Systems Engineering and Electronics 27 (2005) 1826-1830 9. Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks. Physics Letters A 144 (1990) 333-340 10. Adachi, M., Aihara, K.: Associative dynamics in a chaotic neural network. Neural Networks 10 (1997) 83-98
Horseshoe Dynamics in a Small Hyperchaotic Neural Network Qingdu Li1 and Xiao-Song Yang2
2
1 Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected] Department of Mathematics, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected] Abstract. This paper studies the hyperchaotic dynamics in a four dimensional Hopfield neural network. A topological horseshoe on a three dimensional block is found in a carefully chosen Poincar´e section hyperplane of the ordinary differential equations. Numerical studies show that there exist two-directional expansions in this horseshoe map. In this way, a computer-assisted verification of hyperchaoticity of this neural network is presented by virtue of topological horseshoe theory.
1
Introduction
Among the various neural dynamics, deterministic chaos is of much interest and has been regarded as a powerful mechanism for the storage, retrieval and creation of information in neural networks and received a considerable attention in recent years [1, 2, 3, 4]. The research in anatomy and physiology suggests the attempt to understand the emergent dynamical properties of a large network in terms of interacting smaller subnetworks [5, 6, 7, 8]. Therefor, a thorough investigation on chaotic dynamics of small neural networks is significant to study brain functions and artificial neural networks [9, 10, 11, 4, 12, 13, 14, 15]. The existence of a horseshoe embedded in a dynamical system should be the most compelling signature of chaos since it can be used to prove the existence of chaos, show the structure of chaotic attractors and reveal the mechanism inside of chaotic phenomena. Now it is well recognized that horseshoe theory with symbolic dynamics provides a powerful tool in rigorous studies of chaos [16, 17, 18, 19]. This tool has been successfully applied in studies of common chaos with one positive Lyapunov exponent in neural network [10, 20, 21, 22]. In this paper, we try using this tool to carry out a rigorous study on the hyperchaotic dynamics of a small neural network proposed in [15] by showing a topological horseshoe with two-directional expansion and presenting a computerassisted verification of hyperchaoticity. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 96–103, 2007. c Springer-Verlag Berlin Heidelberg 2007
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
2
97
Horseshoe Dynamics in the Hyperchaotic Neural Network
In this section, we first recall a result on horseshoes theory, and then propose our main result. 2.1
A Result of Topological Horseshoe
Let X be a metric space, D is a compact subset of X, and f : D → X is a map satisfying the assumption that there exist m mutually disjoint compact subsets D1 , D2 , . . . , Dm of D, the restriction of f to each Di , i.e., f |Di is continuous. Definition 1. Let γ be a compact subset of D, such that for each 1 ≤ i ≤ m, γi = γ ∩ Di is nonempty and compact, then γ is called a connection with respect to D1 , D2 , . . . , Dm . Let F be a family of connections γ with respect to D1 , D2 , . . . , Dm satisfying property: γ ∈ F ⇒ f (γi ) ∈ F . Then F is said to be an f -connected family with respect to D1 , D2 , . . . , Dm . Theorem 1. Suppose that there exists an f -connected family F with respect to D1 , D2 , . . . , Dm . Then there exists a compact invariant set K ⊂ D, such that f |K is semiconjugate to m−shift dynamics. Here, the semiconjugacy is conventionally defined as follows. Definition 2. Let X and Σm be topological spaces, and let f : X → X and σ : Σm → Σm be continuous functions. We say that f is topologically semicongugate to σ, if there exists a continuous surjection h : Σm → X such that f ◦ h = h ◦ σ. Proposition 1. Let X be a compact metric space, and f : X → X is a continuous map. If there exists an invariant set Λ ⊂ X such that f |Λ is semi-conjugate to m-shift dynamics σ|Σm , then ent(f ) ≥ ent(σ) = log m,
(1)
where ent(f ) denotes the entropy of the map f . In addition, for every positive integer k, ent(f k ) = k · ent(f ). (2) For details about the proof of Theorem 1, see [19], and for details of symbolic dynamics and horseshoe theory, see [16]. 2.2
Poincar´ e Map and Horseshoe
The dynamics of the 4D hyperchaotic Hopfield neural network can be described by the following ordinary differential equations: x˙i = −ci xi +
4
wij tanh(xi )
j=0
where W = (wij ) is the connection matrix. When the parameters take
(3)
98
Q. Li and X.-S. Yang
⎛
⎞ 1 0.5 −3 −1 ⎜ 0 2.3 3 0 ⎟ ⎟ c1 = c2 = c3 = 1, c4 = 100 and W = ⎜ ⎝ 3 −3 1 0 ⎠, 100 0 0 170 computer simulations show that (3) has an attractor, as illustrated in Fig. 1 [15]. Its Lyapunov exponents are 0.237, 0.024, -0.000 and -74.08 which suggests that the attractor seems hyperchaotic. In what follows, we will give detailed discussions of a horseshoe imbedded in this attractor.
2 1
x4
A1
−0.24
P
Block a
0
A5 A6
−0.26
A4
−1
A8
−0.28
A3
x2
−2 1
0
−0.34
−0.5
x2
B2
−0.32
0.5
−1
−0.5
x1
0
0.5
−0.36
A7
Block b
−0.3
B1 B4
B6 B7 B5 B8
−0.304−0.302 −0.3 −0.298−0.296−0.294
x1
Fig. 1. The phase plot of (3) and the position of block a and block b
As shown in Fig. 1, we choose a 3D section P = {x1 ∈ (−3.1, 2.8), x2 ∈ (0, 0.8), x3 ∈ (−0.19, 0.35)} in the hyperplane Q : x4 = 1.1. The Poincar´e map π : P → Q is defined as follows: For each x ∈ P , π(x) is taken to be the second return point in Q under the flow with the initial condition x. Let κ be a subset of P , its image under the map π is denoted by κ = π(κ) in the following discussion. The following statement can be obtained by numerical computations on P . Theorem 2. For the Poincar´e map π corresponding to the cross section P , there exists a closed invariant set Λ ⊂ P for which π 2 |Λ is semiconjugate to the 2-shift dynamics, and ent(π) ≥ 12 log 2. Proof. In view of Theorem 1, we only need to show that there exists an π 2 connected family F with respect to two subsets of P . After a number of attempts, we find two subsets a and b of P with their eight vertices in terms of (x1 , x2 , x3 ) to be
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8
99
= (−0.29581779549343, −0.24019597042840, 0.09142880863296), = (−0.29542801291366, −0.25654351847307, 0.08383869638895), = (−0.29664321742707, −0.28604018124933, 0.10066504874467), = (−0.29726992667299, −0.27182492208005, 0.11184324741833), = (−0.29382176324355, −0.24020663129214, 0.09155427417961), = (−0.29343198066377, −0.25655417933681, 0.08396416193560), = (−0.29464718517719, −0.28605084211307, 0.10079051429132), = (−0.29527389442310, −0.27183558294379, 0.11196871296498), = (−0.30406472867537, −0.34778553401970, 0.21313753034174), = (−0.30304332426000, −0.32692785416949, 0.19847573234127), = (−0.30297948648404, −0.33166823595363, 0.19709871410172), = (−0.30423141620149, −0.35505411942204, 0.21523569550565), = (−0.30206868992023, −0.34779509551942, 0.21326298099230), = (−0.30104728550486, −0.32693741566922, 0.19860118299183), = (−0.30098344772890, −0.33167779745335, 0.19722416475228), = (−0.30223537744636, −0.35506368092176, 0.21536114615621),
on which π|a and π|b are both diffeomorphisms, as shown in Fig. 1. For block a, the top surface at = |A5 A6 A7 A8 | parallels the bottom surface ab = |A1 A2 A3 A4 |, they are both quadrangular, and the other four surfaces of a called the side of a in the following discussions(indicated with as ) are all parallelograms. For block b, it has the same situation with a. By means of interval analysis, their images under π are computed as what did in [23, 24] and shown in Figs 2, 3. From Fig. 2, it is easy to see that the Poincar´e map π sends block a to its image a as follows: The top quadrangular at and the bottom quadrangular ab of a are both expanded in two directions and transversely intersect block a between at and ab and intersect block b between bt and bb ; the side of a, i.e. as , is mapped outside of as and bs , as shown in Fig. 2(b). In this case, for each subset of a if it can transversely intersect a between at and ab its image must transversely intersect block a and block b between their top and bottom surfaces, we say that the image a = π(a) wholly across a and b, Similarly, it is easy to see from Fig. 3 that the Poincar´e map π sends block b to its image b as follows: The top quadrangular bt and the bottom quadrangular bb of b are both expanded in two directions and transversely intersect block a between at and ab ; the side of b, i.e. bs , is mapped outside of as , as shown in Fig. 3(b). In this case, we say that the image b = π(b) wholly across a. Since π|a and π|b are both diffeomorphisms, it is easy to find a sub-block a ˜ of a and a sub-block ˜b of b such that: a ˜ and ˜b both wholly across a ˜ and ˜b under π 2 , e.g., a ˜ = π −1 (π(a) ∩ a) and ˜b = π −1 (π(b) ∩ a). Note that the subsets a and b are mutually disjoint, a ˜ and ˜b must be also mutually disjoint. It is not hard to find a π 2 -connected family F with respect to a ˜ and ˜b. In view of Theorem 1, the Poincar´e map π 2 is semi-conjugate to a 2-shift map, and ent(π) ≥ 12 log 2. The global picture of the images π(a) and π(b) suggests that π|a and π|b both expand in two directions (corresponding to the two positive Lyapunov exponents).
100
Q. Li and X.-S. Yang
(a) The 3D view
0.26
0.26
as = π(as )
0.24
0.24 0.22
b
0.2
0.2
0.18
0.18
0.16
x3
x3
0.22
a = π(a)
0.14
bb
bt
0.16 0.14
0.12
a
0.1
0.12 0.1
0.08
ab
at
0.08
0.06 −0.305 −0.3 −0.295
x1
0.06 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2
x2
(b) The top view
−0.2
−0.4 −0.305 x2 −0.3
x1 −0.295
−0.3
(c) The side view
Fig. 2. a = π(a) wholly across a and b
The local expansions of π on a and b can be partially confirmed by numerically studying partial derivatives ∂π of π at randomly chosen points in the intersection set of blocks a and b and their images. We numerically find that the matrix ∂π has one eigenvalue lying in the interior of the unit circle and two eigenvalues that are located outside of the unit circle. Thereby it justifiably indicates a strong evidence that the attractor illustrated in Fig. 1 is hyperchaotic.
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
101
(a) The 3D view
x3 b
0.2 0.18
0.18
0.16
0.16
0.14
bs = π(bs )
0.14
0.12
x3
x3
0.2
b
bs = π(bs )
0.12
a
0.1
at
0.1
0.08 0.08
ab
0.06 0.06 0.04 0.04 −0.3 −0.29
x1
−0.5
−0.4
x2
−0.3
(b) The top view
−0.2
−0.2 −0.4 −0.305
x2
−0.3 x −0.295 −0.29
1
(c) The side view
Fig. 3. b = π(b) wholly across a
3
Conclusions
We have presented a 3D topological horseshoe in the small hyperchaotic neural network proposed in [15]. Numerical studies suggest that there exist twodirectional expansions in this horseshoe map. In this way, a computer-assisted verification of hyperchaos has been provided by virtue of topological horseshoe theory, which is much more intuitionistic and convincible than the usual method by calculating Lyapunov exponents.
102
Q. Li and X.-S. Yang
Acknowledgements This work is supported in part by Program for New Century Excellent Talents in University (NCET-04-0713), National Natural Science Foundation of China (10672062) and Doctorial Thesis Fund of Huazhong University of Science and Technology(D0640).
References 1. Elbert, T., Ray, W.J., Kowalik, Z.J., Skinner, J.E., Graf, K.E., Birbaumer, N.: Chaos and Physiology: Deterministic Chaos in Excitable Cell Assemblies. Physiological Reviews 74 (1994) 1–47 2. Freeman, W.J., Yao, Y.: Model of Biological Pattern Recognition with Spatially Chaotic Dynamics. Neural Networks 3 (1990) 153–170 3. Babloyantz, A., Lourenco, C.: Brain Chaos and Computation. Int. J. Neural Syst. 7 (1996) 461–471 4. Lewis, J.E., Glass, L.: Nonlinear Dynamics and Symbolic Dynamics of Neural Networks. Neural Computation 4 (1992) 621–642 5. Abeles, M.: Corticonics. Cambridge University Press, Cambridge (1991) ´ 6. Arbib, M.A., Erdi, P., Szent´ agothai, J.: Neural Organization - Structure, Function, and Dynamics. MIT Press, Massachusetts (1998) 7. Shepherd, G.M., ed.: The Synaptic Organization of the Brain Cortex. Ocford Univ. Press, New York (1990) 8. White, E.L.: Cortical Circuits: Synaptic Organization of the Cerebral Cortex Structure, Function and Theory. Birkh¨ auser, Boston (1989) 9. Pasemann, F.: Complex Dynamics and the Structure of Small Neural Networks. Network: Comput. Neural Syst. 13 (2002) 195–216 10. Guckenheimer, J., Oliva, R.A.: Chaos in the Hodgkin-Huxley Model. Siam J. Applied Dynamical Systems 1 (2002) 105–114 11. Das, A., Das, P., Roy, A.B.: Chaos in a Three-Dimensional General Model of Neural Network. I. J. Bifurcation and Chaos 12 (2002) 2271–2281 12. Bersini, H.: The Frustrated and Compositional Nature of Chaos in Small Hopfield Networks. Neural Networks 11 (1998) 1017–1025 13. Bersini, H., Sener, P.: The Connections between the Frustrated Chaos and the Intermittency Chaos in Small Hopfield Networks. Neural Networks 15 (2002) 1197–1204 14. Li, Q., Yang, X.S.: Complex Dynamics in a Simple Hopfield-Type Neural Network. In Wang, J., Liao, X., Yi, Z., eds.: Advances in neural networks - ISNN 2005. Volume 3496., New York, Springer-Verlag (2005) 357–360 15. Li, Q., Yang, X.S., Yang, F.: Hyperchaos in Hopfield-Type Neural Networks. Neurocomputing 67 (2005) 275–280 16. Wiggins, S.: Introduction to Applied Nonlinear Dynamical Systems and Chaos. Springer-Verlag, New York (1990) 17. Szymczak, A.: The Conley Index and Symbolic Dynamics. Topology 35 (1996) 287–299 18. Kennedy, J., Yorke, J.A.: Topological horseshoes. Transactions of The American Mathematical Society 353 (2001) 2513–2530 19. Yang, X.S., Tang, Y.: Horseshoes in Piecewise Continuous Maps. Chaos, Solitons and Fractals 19 (2004) 841–845
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
103
20. Li, Q., Yang, X.S.: Chaotic Dynamics in a Class of Three Dimensional Glass Networks. Chaos 16 (2006) 033101 21. Yang, X.S., Yang, F.: A Rigorous Verification of Chaos in an Inertial Two-Neuron System. Chaos, Solitons and Fractals 20 (2004) 587–591 22. Yang, X.S., Li, Q.: Horseshoe Chaos in Cellular Neural Networks. Int. J. Bifurcation and Chaos 16 (2006) 131–140 23. Zgliczy´ nski, P.: Computer Assisted Proof of Chaos in the R¨ ossler Equations and in the H´enon map. Nonlinearity 10 (1997) 243–252 24. Li, Q., Yang, X.S.: A Computer-Assisted Verification of Hyperchaos in the Saito Hysteresis Chaos Generator. J. Phys. A: Math. Gen. 39 (2006) 9139–9150
The Chaotic Netlet Map Geehyuk Lee and Gwan-Su Yi Information and Communications University Daejeon 305-732, South Korea
[email protected],
[email protected] Abstract. The parametrically coupled map lattice (PCML) exhibits many interesting dynamical behaviors that are reminiscent of the adaptation and the learning of the neural network. In order for the PCML to be a model of the neural network, however, it is necessary to identify the biological counterpart of one-dimensional maps that constitute the PCML. One of the possible candidates is a netlet, a small population of randomly interconnected neurons, that was suggested to be a functional unit constituting the neural network. We studied the possibility of representing a netlet by a chaotic one-dimensional map and the result is the chaotic netlet map that we introduce in this paper.
1
Introduction
The coupled map lattice (CML) [1] is a mathematical model of a spatially extended dynamical system having discrete-time, discrete-space, and continuous states. In spite of its simplicity, the CML has been a successful tool for studying spatiotemporal chaos which arises in many physical systems. More recently, the parameterically coupled map lattice (PCML) [2,3] was proposed as a model neural network capable of automatic adaptation and learning. While the PCML displays many unique dynamical behaviors that are reminiscent of the adaptation and learning of the neural network, it is not easy to relate the PCML with the neural network. Before we can relate them with each other, we need to identify the biological counterpart of the chaotic map that constitute the PCML. One of the possible candidates is the chaotic neurons as suggested by many researchers including Aihara [4] and Farhat [5]. Another candidate is the netlet [6,7], a small population of randomly interconnected neurons that was suggested to be a functional unit constituting the neural network. Farhat [2] considered the netlet to be the biological counterpart of the chaotic map constituting the PCML. However, there has not been yet a satisfactory explanation for the linkage between the netlet and a chaotic map. This paper is about our effort to find an explanation for the linkage between the netlet and a chaotic map. We started with a review of Harth’s onedimensional map model of a netlet [6], and observed that, although the map model was successful in explaining certain collective behaviors of a netlet, it is inherently unable to model the chaotic aspect of a netlet. This led us to reconsider the dynamics of a netlet in a different time scale. The result was a D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 104–112, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Chaotic Netlet Map
105
chaotic one-dimensional map model of a netlet, that we call the chaotic netlet map. A brief review of Harth’s map model will be given in Sect. 2, and the possibility of chaos in Harth’s map model will be discussed in Sect. 3. The derivation of the chaotic netlet map model will be given in Sect. 4, followed by a reminder of the assumptions made in the derivation of the new model.
2
The Netlet
Harth and others [6] suggested that the structure of the neural network may be approximated by a set of discrete populations of randomly interconnected neurons, which they named netlet. The concept netlet was an answer to the question of redundancy in the neural network. A netlet is a reliable functional unit made of many less reliable units, i.e., neurons. Due to the redundancy, a netlet does not require precise wiring between the constituent neurons. The connections between the neurons are determined by only a few probability parameters. Nevertheless, due to cooperative action among neurons, the netlet is much more reliable than the independent duplications of the equivalent number of the identical neurons. A detailed description of the mathematical model of a netlet is given by Anninos et al. [7]. In this section, the derivation of their map model of a netlet in the special case of no inhibitory neurons will be given. Consider a netlet consisting of N neurons. Each of them has μ afferent connections on average. The synaptic delay of a connection is identical throughout the netlet and is taken as the time unit. Assume that the absolute refractory period is longer than the synaptic delay, but shorter than twice the synaptic delay. Thus a neuron that fires at time n will be insensitive at n + 1, and fully recover at t = n + 2. Next, define the activity α(n) of a netlet as the fractional number of neurons firing at time n. If we assume that the integration time of the postsynaptic potentials is less than the synaptic delay, we can see that α(n + 1) depends only on α(n). The expectation value of α(n + 1) depends on α(n) as follows. Since each of the excitatory neurons has μ efferent connections, there are α(n)N μ excitatory postsynaptic potentials (EPSPs) at n + 1. Since the connections are assumed to be distributed uniformly, the expected number of EPSPs per neuron is α(n)μ. The probability that a neuron receive l EPSPs is given by the Poisson distribution in the limit of a large total number of EPSPs: pl =
(α(n)μ)l −α(n)μ e . l!
(1)
The probability P (α(n)) that a neuron receive EPSPs exceeding its threshold at t = n + 1 is then given by
α(n)N μ
P (α(n)) =
l=η
pl
∞ l=η
pl = 1 − e−α(n)μ
η−1 l=0
(α(n)μ)l , l!
(2)
where η is the minimum number of EPSPs necessary to trigger a neuron. The approximation here is possible because pl is already negligibly small when l =
106
G. Lee and G.-S. Yi
α(n)N μ. Finally, because (1 − α(n)) is the fraction of neurons that are not in the refractory period at t = n + 1, the expectation value of α(n + 1) is given by < α(n + 1) >= (1 − α(n))P (α(n)),
(3)
If we approximate α(n + 1) by its expectation value < α(n + 1) > and then use (2), we arrive at the following one-dimensional map. η−1 (α(n)μ)l −α(n)μ α(n + 1) = (1 − α(n)) 1 − e . (4) l! 0 Figure 1 shows the return-maps of (4), the one-dimensional map model of a netlet, for several different values of μ and η. These return-maps explain that a netlet can have three different dynamical modes. When η = 1, there are two fixed points: one at the origin is a repeller and the other off the origin is an attractor. Any sequence {α(n)} will eventually settle down to the attractor. When η is greater than a certain threshold, the return-map will be contained below the diagonal line α(n + 1) = α(n). In this case, there is only one fixed point at the origin that is an attractor. Any sequence {α(n)} will eventually converge to 0. When η is between 1 and the threshold, there can be three fixed points: two attractors and one repeller between them. A sequence {α(n)} will converge to either attractor. Considering the dynamical characteristics of the one-dimensional map given by (4) we will call it the stable netlet map in the following.
3
The Netlet and Chaos
Harth provided some evidence of the chaotic behavior of a netlet from the computer simulation of a netlet. On the other hand, the stable netlet map given by (4) cannot exhibit chaos for the following reason. From the derivation of (4), we know that the second factor on the right side of (4) is a probability which was denoted by P (α(n)) in (2), i.e., α(n + 1) = (1 − α(n)) P (α(n)) .
(5)
If we take the derivative of the map function, dα(n + 1) = −P (α(n)) + (1 − α(n)) P (α(n)). dα(n)
(6)
Regardless of the statistical reasoning to evaluate P (·), P (·) is a probability, and therefore cannot exceed 1. Since the effect of refractoriness is already taken into account by the first factor (1 − α(n)), the probability function P (·) must be an increasing function of α(n). Therefore, dα(n + 1) ≥ −1 + (1 − α(n)) P (α(n)) ≥ −1. dα(n)
(7)
The Chaotic Netlet Map μ=10
1
1
0.8
0.8
0.6
0.6
α(n+1)
α(n+1)
μ=5
0.4 0.2 0
0.4 0.2
0
0.5 α(n)
0
1
0
0.5 α(n)
1
μ=20 1
0.8
0.8
0.6
0.6
α(n+1)
α(n+1)
μ=15 1
0.4 0.2 0
107
0.4 0.2
0
0.5 α(n)
1
0
0
0.5 α(n)
1
Fig. 1. The return-maps of the stable netlet map: five curves for 5 different values of η (η = 1, 2, 3, 4 and 5, from the top), in each plot
The second inequality follows (1 − α(n)) ≥ 0. The fact that the map function cannot have a derivative smaller than −1 places a strict limitation on the possibility of chaos in the stable netlet map. The map function may cross the line α(n + 1) = α(n), and have one or more fixed points, but none of them can have the chance of a flip bifurcation and develop into a chaotic attractor. (See [8] for a detailed description of bifurcation mechanisms in unimodal one-dimensional maps.) The conclusion is that we cannot revise the stable netlet map and derive a map that can exhibit chaos. We also studied other map models of the netlet, for instance the one by Usher [9], but arrived at the basically same conclusion. This is the consequence of using the absolute refractory period as the time unit of the map models. In this regard, we say that their map models are basically models of absolute refractoriness. We had to back off a step and look at the netlet again at a coarser time scale in order to free ourselves from the restriction of absolute refractoriness.
4
Chaotic Netlet Map
Consider a netlet of N neurons which are fully connected to one another. Full connection is not necessary in our derivation but will help keep our argument simple without affecting the validity of the final result. Let us begin by defining two basic time constants in the dynamics of a netlet.
108
G. Lee and G.-S. Yi
– Let τr be the absolute refractory period of a neuron, which is assumed to be identical for all neurons in a netlet. – Let τd be the pulse integration time of a neuron, which is also assumed to be identical for all neurons in a netlet. In the derivation of the stable netlet map, τd was assumed to be of the same order as τr , and therefore τr played the role of the time unit. As we pointed out in the previous section, any effort to derive a map model of a netlet with τr being the time unit will lead to the basically same result as the stable netlet map. Therefore, we decided to examine the dynamics of a netlet in a different time scale. Our choice was to use τd as the time unit as before, but assumed that τd is several times larger than τr , which is in fact no less acceptable than that of the stable netlet map. In this case the dynamics of a netlet can be better described in terms of the number of pulses that a neuron generates than in terms of the number of active neurons in a netlet. We chose yi (n), the average number of pulses generated by neuron i at time step n, to be the state variable of the target one-dimensional map. To avoid unnecessary complication by integer arithmetic, we assumed that the variable, yi (n), takes on a real value. Now the time unit and the state variable for a one-dimensional map are determined. The next step is to design a first order map function for a netlet. At this point, it may be worth reviewing the history of the logistic map [10] since it is a model of a dynamical system that is also a kind of population as a netlet is. The logistic map is a model of a population with the following two conditions: 1. There is a multiplying factor in the system. In the population dynamics of insects, a couple of insects gives a birth to tens of offsprings. 2. There is a resource constraint. In the population dynamics of insects, it is limited supply of food. We may be guided by these two conditions in our reasoning toward the development of a first order map function for a netlet. We consider first a multiplying factor in a netlet and then two types of resource constraints in a netlet. Multiplying factor: A neuron can deliver an output pulse to multiple number of postsynaptic neurons. If the postsynaptic neurons fire in response to the pulses at some probability, the net result is multiple pulses out of a single pulse. Suppose, for example, every neuron fires once at a certain time step in a netlet of 100 neurons (yi (n) = 1). Since the netlet is fully connected, each neuron will receive 100 EPSPs on average. If a neuron fires at probability 0.1 for an incoming pulse, every neuron in the netlet will fire 10 times on average in the next time step (yi (n+1) = 10), meaning multiplication of a firing frequency by 10. In more general terms, this multiplying dynamics can be stated as follows. yi (n + 1) = p
N j=1
yj (n),
(8)
The Chaotic Netlet Map
109
where p is the probability for a neuron to fire in response to an incoming pulse. Using a mean field argument, we replace the summation by N yi (n). Then, y(n + 1) = N py(n).
(9)
The subscript to the state variable is dropped since individual neurons are no more distinguished after the mean field approximation. Constraint by absolute refractoriness: The absolute refractory period limits the maximum number of pulses a neuron can generate in a unit time interval. In the current framework, τd is the time unit, and therefore the upper bound yˆ to the number of pulses that a neuron generate in a unit time is given by τd /τr . With this hard bound to the state variable, (9) becomes y(n + 1) = min (ˆ y , N py(n)) .
(10)
Metabolic constraint: In addition to the hard constraint by the absolute refractory period, there are many environmental factors that can affect the efficiency of a neuron. For instance, ions or neurotransmitters are essential in the relay of signals between neurons, and therefore their variant availability and activity in a netlet can be one of the factors that can control the efficiency of a neuron. A cellular energy source such as adenosine triphosphate (ATP) can be a main environmental factor of neuronal activity. ATP is necessary for the most of cellular signal transduction and active transport of ions which are involved in the process of neuronal pulse generation. Especially, active transport is needed for the polarization and the repolarization of a neuron that can affect the efficiency of pulse generation directly and possibly set the upper bound of the number of pulses too. Another issue to bring here is that various external and internal factors of a neuron can invoke temporal and localized changes of ATP level which can be a source of inconsistent activity of neuron. (See [11] for detailed description of the role of ATP in neuronal signal transduction.) At present, however, we would not be able to describe the exact mechanism of the environmental factors on this process without further experimental evidence. It is inevitable to leave it as an assumption which needs to be justified in the future. The efficiency of a neuron is represented by the parameter p in (8). The parameter p is proportional to the environmental condition, i.e., p = po z, where z ∈ [0, 1] represent the environmental condition, and po is the value of p when the environment is in the best condition (z = 1). Since every firing of a neuron consumes some amount of resource in the environment, we may write p in the following form: p = po (1 − by(n)), where b is a small positive constant. A flaw with this form of p is that p can become 0, which never occurs in an open system like a netlet. A remedy to this flaw is to replace (1 − by(n)) by e−by(n) , which approximates (1 − by(n)) well when by(n) is small, and approaches 0 as by(n) becomes larger, but never becomes 0. With this exponential factor incorporated, (9) now becomes y(n + 1) = min yˆ, ay(n)e−by(n) ,
(11)
110
G. Lee and G.-S. Yi
“Multiplication”
“Absolute refractoriness”
yˆ = τ d / τ r
N
y ( n)
Ny ( n )
pNy ( n )
p = po e
− by ( n )
LIM
(
y ( n + 1) =
− by n min yˆ , ay ( n ) e ( )
)
“Metabolic constraint”
Fig. 2. The chaotic netlet map: the firing rate y(n) is multiplied by the population size N , and is attenuated by the metabolic constraint, and is finally hard-limited by absolute refractoriness before it becomes y(n + 1) β = 6.000000 1
0.8
0.8
0.6
0.6
x(n+1)
x(n+1)
β = 5.000000 1
0.4 0.2 0
0.4 0.2
0
0.5 x(n)
0
1
0
1
0.8
0.8
0.6
0.6
0.4 0.2 0
1
β = 8.000000
1
x(n+1)
x(n+1)
β = 7.000000
0.5 x(n)
0.4 0.2
0
0.5 x(n)
1
0
0
0.5 x(n)
1
Fig. 3. The return-maps of the chaotic netlet map: four plots for β = 5, 6, 7 and 8, and four curves in each plot for α = 5, 10, 15 and 20 (from the lowest curve)
where a ≡ N po is called the multiplying factor of a netlet and b is the resource factor of a neuron. Figure 2 is a graphical representation of (11). y(n) pulses generated by a neuron result in N y(n) EPSPs. A receiving neuron cannot fire in response to every incoming pulse, but its responsiveness depends on the available metabolic resource left unused in the previous time step, which is modeled by
The Chaotic Netlet Map
111
Fig. 4. The bifurcation diagrams of the chaotic netlet map with the parameter α being the bifurcation parameter: four diagrams for β = 5, 6, 7 and 8
the factor e−byi (n) . Finally, the number of pulses are hard-limited by yˆ due to the absolute refractoriness of a neuron. We introduce a normalized state variable x(n) = y(n) yˆ to convert the map into a form better suited for comparison with other well-known maps defined on the unit interval. In terms of the normalized variable, (11) becomes x(n + 1) = min 1, αx(n)e−βx(n) , (12) where α ≡ ayˆ and β ≡ bˆ x. The meaning of β can be understood when we note that −β e is the minimum responsiveness of a neuron after it experiences the maximum activity, which is allowed by the absolute refractory period of a neuron, in the previous time step. Since the resulting one-dimensional map given by (12) is a model of the chaotic aspect of a netlet, we named it the chaotic netlet map. Figure 3 shows the return-maps of the chaotic netlet map: four plots for β = 5, 6, 7 and 8, and four curves for α = 5, 10, 15 and 20 in each plot. From these return-maps, we can expect the bifurcation pattern of the map will be similar to that of the logistic map since they have an unstable fixed point at the origin and another fixed point with a negative slope. The multiplying factor α can be used to change the slope like the μ-parameter of the logistic map. Actually, the bifurcation diagrams of the chaotic netlet map shown in Fig. 4 are similar to that
112
G. Lee and G.-S. Yi
of the logistic map except for the disappearance of the chaotic orbits for large values of α, due to the clipping of the return-map by the absolute refractoriness.
5
Conclusions
We showed that, when the integration time of a neuron is several times larger than the absolute refractory period of a neuron and when we look at the dynamics of a netlet in such a coarser time scale, a netlet can be represented by a chaotic one-dimensional map, that is similar in form and behavior to the logistic map. It seems that our initial goal of deriving a chaotic one-dimensional map model of a netlet is achieved, but it should be remembered that we left many assumptions unverified. Among others, we still need to come up with evidence of a biological mechanism that can explain the resource constraint in a netlet. Also, it should be noted that the new map model exhibits chaos only if the parameters α and β are chosen properly. We need yet to check the validity of the ranges of the parameters of the chaotic netlet map from the biological point of view. The values of the parameter α seem to be acceptable since the total number of neurons N is usually much larger than the number of pulses required for a neuron to fire. On the other hand, the validity of the values of β used in Fig. 4 needs further examination.
References 1. Kaneko, K.: Theory and Applications of Coupled Map Lattices, Chichester : New York : John Wiley & Sons (1993) 2. Farhat, N.H., Hernandez, E.D.M., Lee, G.: Strategies for autonomous adaptation and learning in dynamical networks, In: IWANN ‘97. (1997) 3. Lee, G., Farhat, N.H.: Parametrically coupled sine map networks, International Journal of Bifurcation and Chaos 11 (2001) 1815–1834 4. Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks, Physics Letters A 144 (1990) 333–340 5. Farhat, N.H., Eldefrawy, M.: The bifurcating neuron, In: Digest Annual OSA Meeting, San Jose, CA. (1991) 10–10 6. Harth, E.M., Csermely, T.J., Beek, B., Lindsay, R.D.: Brain functions and neural dynamics, Journal of Theoretical Biology 26 (1970) 93–120 7. Anninos, P.A., Beek, B., Csermely, T.J., Harth, E.M., Pertile, G.: Dynamics of neural structures, Journal of Theoretical Biology 26 (1970) 121–148 8. Hilborn, R.C.: Chaos and Nonlinear Dynamics, Oxford University Press, New York (1994) 9. Usher, M., Schuster, H.G., Niebur, E.: Dynamics of populations of integrated-andfire neurons, partial synchronization and memory, Neural Computation 5 (1993) 570–586 10. May, R.M.: Simple mathematical models with very complicated dynamics, Nature 26 (1976) 459–467 11. Nicholls, J.G., Martin, A.R., Wallace, B.G.: From Neuron to Brain. 3rd edn, Sinauer, Sunderland, MA (1992)
A Chaos Based Robust Spatial Domain Watermarking Algorithm Xianyong Wu1,2, Zhi-Hong Guan1, and Zhengping Wu1 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China 2 School of Electronics and Information, Yangtze University, Jingzhou, Hubei, 434023, China
[email protected] Abstract. This paper presents a novel spatial domain watermarking scheme based on chaotic maps. Two chaotic maps are employed in our scheme, which is different from most of the existing chaotic watermarking methods, 1-D Logistic map is used to encrypt the watermark signal, and generalized 2-D Arnold cat map is used to encrypt the embedding position of the host image. Simulation results show that the proposed digital watermarking scheme is effective and robust to commonly used image processing operations.
1 Introduction Recently, a new information security technology, information hiding has become a major concern, which includes digital watermarking and stenography. Many different watermarking schemes are proposed in recent years [1-4], which can be classified into two categories: the spatial domain [5] and the frequency domain [6-9] watermarking. Spatial domain watermarking show that a large number of bits can be embedded without incurring noticeable visual artifacts; whereas, frequency domain watermarking has been shown to be quite robust against JPEG compression, filtering, noise pollution and so on. In most spatial domain schemes, watermark signal is embedded in the LSB (least significant bit) of the pixels in host image, but the robustness against attacks is weak, watermark can be detected easily. Therefore, many new schemes for LSB algorithm are proposed to improve the robustness, but they are not secure enough. In [10], for example, Hash function is employed to improve the security of watermarking algorithm. In [11], a digital signature approach that will not degrade the quality of the host image is proposed, but a mapping table is needed to record the embedding position, which increased the complexity of the algorithm. In this paper, 1-D Logistic map is used to encrypt watermark signal; To spread the watermark signal in all the regions of host image chaotically, 2-D Arnold cat map is employed to shuffle the embedding positions of pixels in host image, which ensures the security of our scheme; another chaotic sequence is generated to locate the bit of pixel for host image, watermark bits are used to modify the 3th, 4th,5th or 6th bit of the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 113–119, 2007. © Springer-Verlag Berlin Heidelberg 2007
114
X. Wu, Z.-H. Guan, and Z. Wu
corresponding shuffled pixels in host image randomly, which further enhances the robustness and security of the proposed scheme.
2 Chaos and Its Application in Watermarking 2.1 Encryption to Watermark Signal
Due to the characteristics of extreme sensitivity to initial conditions and the outspreading of orbits over the entire space, chaos maps are widely used for watermarking and encryption. To ensure the security of the watermarking scheme, watermark is encrypted before embedding. First, watermark signal is encoded into binary bit streams by exploiting ASCII codes; then, random-like, uncorrelated and deterministic chaotic sequence is created by 1-D logistic map, the initial condition and parameters of chaotic map are kept as the secret key; next, the encoded watermark bits are encrypted by chaotic sequence. Therefore, a number of uncorrelated, randomlike and reproducible encrypted watermarking signals are generated. A commonly used chaotic map is the Logistic map, which is described by:
zn+1 = μ z n (1 − zn ) ,
(1)
where zn ∈ (0,1) , μ ∈ (0, 4] . When μ > 3.5699456, the sequence iterated with an initial value is chaotic, different sequences will be generated with different initial values. The encryption formula is as follows. wen = w ⊕ cn ,
(2)
where wen is the n-th encrypted watermark signal, w is the original watermark signal, and cn is the chaotic sequence. 2.2 Encryption to Embedding Position of Watermark
In order to shuffle the embedding position of the host image, 2-D Arnold cat map [12] is adopted in our scheme, which is described by: xn+1 = ( xn + yn ) mod 1, yn+1 = ( xn + 2 yn )
(3)
where notation “ x mod 1” denotes the fractional parts of a real number x by adding or subtracting an appropriate integer. Therefore, ( xn , yn ) is confined in a unit square of [0,1] × [0,1], we write formula (3) in matrix form and obtain: ⎡ xn+1 ⎤ = ⎡1 1 ⎤ ⎡ xn ⎤ = A ⎡ xn ⎤ mod1. ⎢⎣ yn+1 ⎥⎦ ⎢⎣1 2 ⎥⎦ ⎢⎣ yn ⎥⎦ ⎢⎣ yn ⎥⎦
(4)
A unit square is first stretched by linear transformation and then folded by modulo operation, so the cat map is area preserving, the determinant of its linear
A Chaos Based Robust Spatial Domain Watermarking Algorithm
115
transformation matrix | A | is equal to 1. The map is known to be chaotic. In addition, it is one to one map, each point of the unit square is uniquely mapped onto another point in the unit square. Hence, the watermark pixel of different positions will get a different embedding position. The cat map above can be extended as follows: firstly, the phase space is generalized to [0,1, 2" N − 1] × [0,1, 2" N − 1] , i.e., only the positive integers from 0 to N − 1 are taken; then equation (4) is generalized to 2-D invertible chaotic map. ⎡ xn+1 ⎤ = ⎡ a b ⎤ ⎡ xn ⎤ = A ⎡ xn ⎤ mod N , ⎢⎣ yn+1 ⎥⎦ ⎢⎣ c d ⎥⎦ ⎢⎣ yn ⎥⎦ ⎢⎣ yn ⎥⎦
(5)
where a,b, c and d are positive integers, and | A |= ad − bc = 1 , therefore, only three among the four parameters of a, b, c and d are independent under this condition. The generalized cat map formula (5) is also of chaotic characteristics. By using the generalized cat map (5), we can obtain the embedding position of watermark pixels, i.e., the coordinate (i, j ) of watermark pixel is served as the initial value, three independent parameters and iteration times n are served as the secret key, after n rounds of iterations, the iterating result ( xn , yn ) will be served as the embedding position of watermark pixel of (i, j ) . When the iteration times n is big enough, two arbitrary adjacent watermark pixels will separate apart largely in host image, different watermark pixels will get different embedding positions. To locate the pixel bits to be embedded in host image, 1-D Logistic map is used once more in our approach. Because the chaotic sequence is normally distributed in the interval of (0,1) and is non periodic, the interval of (0,1) can be divided into several subintervals which correspond to different pixel bits of the host image.
3 Watermark Embedding and Extraction 3.1 Watermark Embedding
Let the binary watermark of size M 1 × M 2 be denoted as W = {w(i, j ), 1 ≤ i ≤ M 1 , 1 ≤ j ≤ M 2 } , and the original host image of size N1 × N 2 be denoted as F = { f ( x, y ), 1≤ x ≤ N1 ,1≤ y ≤ N 2 }, where (i, j ) and ( x, y ) represent pixel coordinates of binary watermark image and original host image, respectively, w(i, j ) = {0,1} and
f ( x, y ) = {0,1," , 2L − 1} represent the pixel values of the watermark and the host image, respectively, L denotes binary bit of gray image pixel. For simplicity, let M 1 = M 2 = M , N1 = N 2 = N . Watermark bits (1 bit per pixel) are embedded randomly to pixel bits in host image for security (avoid unauthorized extraction), the embedding position ( x, y ) is calculated by formula (5). Therefore, different watermark positions (i, j ) will be mapped onto different embedding positions
116
X. Wu, Z.-H. Guan, and Z. Wu
( x, y ) . For Arnold cat map is one to one map, the record table is not needed to record the colliding positions in our algorithm. The watermark bit of position (i, j ) is embedded to the k-th bit of position ( x, y ) in host image, where k =3,4,5,6, which is determined by subintervals of z n generated by Logistic map (1), so the bit position to be embedded is located by coordinate ( x, y ) and k . Let the pixel of watermarked image be denoted as f ′( x, y ) , if w(i, j ) is the same as the k-th bit of f ( x, y ) , then f ′( x, y ) = f ( x, y ) , i.e., the pixel value is kept unchanged; otherwise, the k-th bit of f ( x, y ) is substituted by w(i, j ) . Watermark embedding algorithm can be described as follows. Step1: Encrypt watermark signal with Logistic chaotic sequence to obtain an encrypted watermark signal. Step2: Designate three independent parameters of Arnold cat map, initial value (i, j ) , and iteration times n as well as initial value z0 of Logistic map. Step3: For watermark pixel w(i, j ) , let x0 = i, y0 = j , iterate n times by (5) to obtain ( x, y ) . Then let i = i + 1, j = j + 1 . Step4: Perform Logistic iteration to obtain a real sequence z n , where zn ∈ (0,1) , then determine k : zn ∈ (0, 0.25] , k = 3 ; zn ∈ (0.25, 0.5] , k = 4 ; zn ∈ (0.5, 0.75] ,
k = 5 ; zn ∈ (0.75,1) , k = 6 , Then find out the k-th bit bk in f ( x, y ) . Step5: If w(i, j ) = bk , then f ′( x, y ) = f ( x, y ) ; Otherwise f ′( x, y ) = w(i, j ) . Step6: Take the next watermark pixel; repeat step2 through step 4 until all watermark pixels are embedded. 3.2 Watermark Extraction
Watermark extraction is just the inverse process of the above embedding algorithm. Key parameters and initial value z0 as well as the watermark length are needed in watermark extraction, let w′(i, j ) denote the watermark pixel to be extracted, watermark extraction algorithm can be described as follow. Step1: Designate three independent parameters of Arnold cat map, initial value (i, j ) , iteration times n as well as initial value z0 of Logistic map. Step2: For watermark pixel to be extracted w′(i, j ) , let x0 = i, y0 = j , iterate n times by formula (5) to obtain ( x, y ) . Then let i = i + 1, j = j + 1 . Step3: Perform the same Logistic iteration as the embedding process to obtain z n , zn ∈ (0, 0.25] , k = 3 ; zn ∈ (0.25, 0.5] , k = 4 ; zn ∈ (0.5, 0.75] , k = 5 ; z n ∈ (0.75,1) ,
k = 6. Step4: Calculate the k-th bit pk of f ′( x, y ) to obtain encrypted watermark ′ w ( i , j ) = pk .
A Chaos Based Robust Spatial Domain Watermarking Algorithm
117
Step5: Repeat step2 through step 4 until all the watermark pixels w′(i, j ) (i, j = 1, 2," , M ) are extracted. Step6: Decrypt the encrypted watermark to obtain original watermark.
4 Simulation Results To demonstrate the effectiveness of the proposed algorithm, MATLAB simulations are performed by using 256 × 256 pixel gray level“peppers”image and 64 × 64 pixel binary watermark logo“HUST”. The three independent parameters and initial value of Arnold cat map are chosen as a = 1, b = 2, c = 3 and ( x0 , y0 ) = (2,3), respectively, the iteration number n = 20; the parameter and the initial value of Logistic map is chosen as μ = 4 and z0 = 0.5, respectively. The watermark bits are embedded to 3th, 4th,5th or 6th bits of the pixel position ( x, y ) in host image randomly.
(a)
(b)
(c)
Fig. 1. Demonstration of invisibility: (a) Original “peppers” image; “HUST”; (c) Watermarked image; (d) Extracted watermark logo
(d) (b) Watermark logo
Fig.1. demonstrates the invisibility of watermark. 1(a) and 1(b) show the original host image and binary watermark logo, respectively, 1(c) and 1(d) show the watermarked image (PSNR=47.25dB) and the extracted watermark logo “HUST” , respectively. One can see that the watermark is perceptually invisible. Fig.2. demonstrates the robustness of our algorithm. 2(a) 2(b) 2(c) 2(d) 2(e) show the JPEG compressed watermarked image with quality=10, watermarked image with 5 5 median filtering, watermarked image polluted by additive Gaussian noise (0,0.01), watermarked image with a quarter being cropped at the upper left corner, watermarked image with 2° rotation, respectively; 2(f) 2(j) are the corresponding extracted watermark logos, respectively. Results show that the recovered watermark logos are obvious even watermarked image is survived severe attacks.
×
、 、 、 、 ~
118
X. Wu, Z.-H. Guan, and Z. Wu
(a)
(f)
(c)
(h)
(e)
(j)
(b)
(d)
(g)
(i)
Fig. 2. Demonstration of robustness: (a) JPEG compressed (quality=10) watermarked image; (b) Watermarked image by 5×5 median filtering; (c) Noisy watermarked image (0, 0.01); (d) One quarter being Cropped watermarked image; (e) Rotated watermarked image (2°); (f) (j) Show the corresponding extracted watermark logos
~
5 Conclusions In this paper, a novel spatial domain watermarking algorithm based on Logistic map and Arnold cat map is proposed. The embedding positions of watermark signal are encrypted by 2-D Arnold cat map, and the pixel bit to be embedded in host image is
A Chaos Based Robust Spatial Domain Watermarking Algorithm
119
determined by 1-D Logistic chaotic map. Computer simulations show that the scheme is secure and robust to commonly used image processing operations.
Acknowledgment This work is supported by the National Natural Science Foundation of China under Grants 60573005 and 60603006.
References 1. Hsuct, W.: Hidden Digital Watermarks in Images. IEEE Trans. Image Processing 8 (1999) 58-68 2. Cox, I. J., Miller, M.L., Bloom, J.A.: Digital Watermarking. Academic Press, New York (2002) 3. Lee, C.H., Lee, Y.K.: An Adaptive Digital Image Watermarking Technique for Copyright Protection. IEEE Trans. Consumer Electronics 45 (1999) 1005-1015 4. Zhang, J.S., Tian, L.H., Tai, M.: A New Watermarking Method Based on Chaotic Maps. IEEE international Conference on Multimedia and Expo (2004) 939-942 5. Bender, W.R., Gruhl, D., Morimoto, N.: Techniques for Data Hiding. In Proc. SPIE: Storage and Retrieva1 of Image and Video Databases 2420 (1995) 164-173 6. Barni, M., Bartolini, F., Cappellini, V., Piva, A.: A DCT-domain System for Robust Image Watermarking. Signal Processing 66 (1998) 357-372 7. Lin, S.D., Chen, C.F.: A DCT Based Image Watermarking with Threshold Embedding. Int. J. of Comp. and Applications 25 (2003) 130-l35 8. Zhao, D.W., Chen, G.R., Liu, W.B.: A Chaos-Based Robust Wavelet Domain Watermarking Algorithm. Chaos Solitons & Fractals 22 (2004) 47 9. Lu, W., Lu, H.T., Chung, F.L.: Chaos-Based Spread Spectrum Robust Watermarking in DWT Domain. Proceedings of the 4th International Conference on Machine Learning and Cybernetics (2005) 5308-5313 10. Hwang, M.S., Chang, C.C., Hwang, K.F.: A Watermarking Technique Based on One-way Hash Function. IEEE Trans. Consumer Electronics 45 (1999) 286-294 11. Chang, C., Hsiao, J., Chiang, C.: An Image Copyright Protections Scheme Based on Torus Automorphism. Proc. of the IEEE (2002) 217-224 12. Kohda, T., Aihara, K.: Chaos in Discrete Systems and Diagnosis of Experimental Chaos. Transactions of IEICE E 73 (1990) 772-783
Integrating KPCA and LS-SVM for Chaotic Time Series Forecasting Via Similarity Analysis Jian Cheng1, Jian-sheng Qian1, Xiang-ting Wang1, and Li-cheng Jiao2 1 School of Information and Electrical Engineering, China University of Mining and Technology, 221116, Xu Zhou, China 2 Institute of Intelligent Information Processing, Xidian University, 710071, Xi an, China
[email protected] Abstract. A novel approach is presented to reconstruct the phase space using kernel principal component analysis (KPCA) with similarity analysis and forecast chaotic time series based on a least squares support vector machines (LSSVM) in the phase space. A three-stage architecture is proposed to improve its prediction accuracy and generalization performance for chaotic time series forecasting. In the first stage, KPCA is adopted to extract features and obtain kernel principal components. Then, in the second stage, the similarity is analyzed between every principal components and output variable, and some principal components are chosen to construct the phase space of chaotic time series according to their similarity degree to the model output. LS-SVM is employed in the third stage for forecasting the chaotic time series. The method was evaluated by coal mine gas concentration in experiment. The simulation shows that LS-SVM by phase space reconstruction using KPCA with similarity analysis performs much better than that without similarity analysis.
1 Introduction With the interests in chaotic time series forecasting have been increased, however, most practical time series are of nonlinear and chaotic nature that makes conventional, linear prediction methods inapplicable. Although the neural networks is developed in chaotic time series prediction, some inherent drawbacks, e.g., the multiple local minima problem, the choice of the number of hidden units and the danger of over fitting, etc., would make it difficult to put the neural networks into some practice. Support vector machine (SVM), established on the unique theory of the structural risk minimization principle [1], usually achieves higher generalization performance than traditional neural networks that implement the empirical risk minimization principle in solving many machine learning problems. Another key characteristic of SVM is that training SVM is equivalent to solving a linearly constrained quadratic programming problem so that the solution of SVM is always unique and globally optimal. Least squares support vector machine (LS-SVM) [2], as a new kind of SVM, is easier to use than usual SVM, so LS-SVM is employed to forecast chaotic time series. In developing a LS-SVM model for chaotic time series, the first important step is to reconstruct embedding phase space. The traditional time series phase space D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 120–126, 2007. © Springer-Verlag Berlin Heidelberg 2007
Integrating KPCA and LS-SVM
121
reconstruction usually adopts coordinate delay method whose key is to ascertain embedding dimension and time delay [3]. G-P algorithm [4], FNN (false nearest neighbors) method [5] can all ascertain embedding dimension. Besides timeconsuming, their most serious problem is that there may be correlation between different features in reconstructed phase space, which will influence the quality of phase space and modeling effect. Principal component analysis (PCA) is a well-known method for feature extraction, which acquires the embedding dimension from time series directly, but PCA is a linear method in nature [6]. Kernel principal component analysis (KPCA) is one type of nonlinear PCA developed by generalizing the kernel method into PCA [7], which first maps the original input space into a high dimensional feature space using the kernel method and then calculates PCA in the high dimensional feature space. The linear PCA in the high dimensional feature space corresponds to a nonlinear PCA in the original input space. The paper proposes a phase space reconstruction method based on KPCA with similarity analysis in order to improve quality of phase space and accuracy of chaotic time series modeling. On the basis of KPCA, some kernel principal components are chosen according to their similarity degree to the model output, and utilized to reconstruct final phase space of chaotic time series. The restructured phase space is then used as the input space of LS-SVM to realize chaotic time series forecasting. By examining the performance in forecasting coal mine gas concentration, the simulation shows that LS-SVM with phase space reconstruction combining KPCA with similarity analysis performs much better than that without similarity analysis. The rest of this paper is organized as follows. Section 2 presents the phase space reconstruction of chaotic time series based on KPCA. In Section 3, reducing the dimensions of phase space is presented. The architecture and algorithm are given in Section 4. Section 5 presents the results and discussions on the experimental validation. Finally, some concluding remarks are drawn in Section 6.
2 Phase Space Reconstruction Based on KPCA Given a set of centered chaotic time series x k ( k = 1,2, " , l , and
∑k =1 xk l
= 0 ). The
basic idea of KPCA is to map the original input vectors x k into a high dimensional feature space Φ ( x k ) and then to calculate the linear PCA in Φ ( x k ) . By mapping x k
into Φ ( x k ) , KPCA solves the eigenvalue equation (1).
~
λi u i = Cu i , i = 1,2, " , l ,
(1)
~ 1 l where C = ∑ Φ ( x k )Φ ( x k ) T is the sample covariance matrix of Φ ( x k ) . λi is one of l k =1 ~ the non-zero eigenvalues of C . u i is the corresponding eigenvector. Equation (1) can be transformed to the eigenvalue equation (2). ~ (2) λ α = Kα , i = 1,2, " , l , i
i
i
122
J. Cheng et al.
where K is the l × l kernel matrix. The value of each element of K is equal to the inner product of two high dimensional feature vector Φ ( xi ) and Φ ( x j ) . That is, K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) . The advantage of using K is that one can deal with Φ ( x k ) of arbitrary dimensionality without having to compute Φ ( x k ) explicitly, as all the calculations of the dot product (Φ ( xi ) ⋅ Φ ( x j )) are replaced with the kernel function ~ K ( xi , x j ) . This means that the mapping of Φ ( x k ) from x k is implicit. λi is one of ~ the eigenvalues of K , satisfying λi = lλi . α i is the corresponding eigenvector of K , satisfying u i = ∑ j =1 α i ( j )Φ ( x j ) ( α i ( j ) , j = 1,2, " , l , are the components of α i ). l
For assuring u i is of unit length, each α i must be normalized using the corresponding eigenvalue by ~
α~i = α i
λi , i = 1,2, " , l .
(3)
Based on the estimated α~i , the principal components for x k is calculated by l
s k (i ) = u iT Φ ( x k ) = ∑ α~i ( j ) K ( x j , x k ) , i = 1,2, " , l . j =1
In addition, for making
(4)
∑k =1 Φ ( x k ) = 0 , in equation (4) the kernel matrix on the l
training set K and on the testing set K t are respectively modified by
~ 1 1 K = ( I − 1l1Tl ) K ( I − 1l1Tl ) , l l
(5)
~ 1 1 K t = ( K t − 1lt 1Tl K )( I − 1l1Tl ) , l l
(6)
where I is l dimensional identity matrix. l t is the number of testing data points. 1l and 1lt represent the vectors whose elements are all ones, with length l and l t respectively. K t represents the l t × l kernel matrix for the testing data points. From above equations, it can be found that the maximal number of principal components extracted by KPCA is l . If only the first several eigenvectors sorted in descending order of the eigenvalues are considered, the number of principal components in s k can be reduced. The popular kernel functions includes Gaussian kernel function, sigmoid kernel, polynomial kernel, etc. Gaussian kernel function is employed in this paper. K ( x, x k ) = exp(− x − x k σ 2 ) .
(7)
Integrating KPCA and LS-SVM
123
3 Reducing the Dimension of the Embedding Phase Space Via Similarity Analysis The kernel principal components s k in feature space can be computed as Section 2, which are denoted by H 1 , H 2 , " , H l ( H i = ( H 1i , H 2i , " , H li ) T for i = 1,2, " , l ) in this section for convenience. Where H ij is the ith principal component of jth sample. The first q principal components are chosen such that their accumulative contribution ratio is big enough, which form reconstructed the phase space. As formula (8), training sample pairs for chaotic time series modeling can be formed as below, ⎡ H 11 ⎢ ~ ⎢ H 21 X = ⎢ # ⎢ 1 ⎣⎢ H l
⎡ y1 ⎤ H 12 " H 1q ⎤ ⎢y ⎥ 2 q⎥ H2 " H2 ⎥ , Y = ⎢ 2⎥ . ⎥ ⎢#⎥ # % # ⎥ ⎢ ⎥ 2 q H l " H l ⎦⎥ ⎣ yl ⎦
(8)
Modeling for time series, which is based on KPCA phase space reconstruction, is ~ ~ to find the hidden function f between input X and output Y such that y i = f ( xi ) . The above KPCA-based phase space reconstruction choose the first q principal components successively only according to their accumulative contribution ratio (their accumulative contribution ratio must be big enough so that they can stand for most information of original variables), not considering the similarity between every principal component H i (1 ≤ i ≤ q ) chosen and output variable Y . The paper analyses the similarity between principal components and output variables on the basis of KPCA. Set a threshold θ , compute the similarity coefficient between principal component H i (1 ≤ i ≤ q ) and output Y ,
ρi =
Cov ( H i , Y ) Cov ( H i , H i ) ⋅ Cov (Y , Y )
,
(9)
where Cov( H i , Y ) is covariance of vector H i and Y . Choose principal components H i (1 ≤ i ≤ q ) such that the similarity coefficient ρ i ≥ θ to form the reconstructed phase space H .
4 The Proposed Architecture and Algorithm The Basic idea is to use KPCA to reconstruct the phase space with similarity analysis (SA) and apply LS-SVM for forecasting chaotic time series. Fig. 1 shows how the model is built.
124
J. Cheng et al.
Fig. 1. The architecture of model for chaotic time series forecasting
Up to here the process of predicting chaotic time series is completed. The detailed step of algorithm is illustrated as the following: Step 1. For a chaotic time series x k , KPCA is applied to assign the embedding
dimension with the accumulative contribution ratio. The principal components s k , whose number of dimension is less than x k , is obtained . Step 2. s k is used for the input vector of SA, and select appropriate threshold θ according to the satisfied result. The dimension of the final embedding phase space is assigned. Step 3. In the reconstructed phase space, the structure of LS-SVM model is built, trained and validated by the partitioned data set respectively to determine the kernel parameters σ 2 and γ of LS-SVM with Gaussian kernel function. Choose the most adequate LS-SVM that produces the smallest error on the validating data set for chaotic time series forecasting.
5 Simulation Results The gas concentration, which is a chaotic time series in essence, is one of most key factors that endanger the produce in coal mine. It has very highly social and economic benefits to strengthen the forecast and control over the coal mine gas concentration. From the coal mine, 2010 samples are collected from online sensor underground after eliminating abnormal data in this study. The goal of the task is to use known values of the time series up to the point x = t to predict the value at some point in the future x = t + τ . To make it simple, the method of forecasting is to create a mapping from d points of the time series spaced τ apart, that is, ( x(t − (d − 1)τ ), " , x(t − τ ), x(t )) , to a forecasting future value x(t + τ ) . 1200 samples are considered to reconstruct the phase space using KPCA. Through several trials, σ 2 = 75 and number of principal components is 28 where the accumulative contribution ratio is 0.95. Then the embedding dimension of phase space is 15 through similarity analysis with θ = 0.90 . So the embedding phase space is reconstructed with the values of the parameters d = 45 and τ = 4 in the experiment. From the gas concentration time series x(t ) , we extracted 1200 input-output data pairs. The first 500 pairs is used as the training data set, the second 200 pairs is used as validating data set for finding the optimal parameters of LS-SVM, while the remaining 500 pairs are used as testing data set for testing the
Integrating KPCA and LS-SVM
125
predictive power of the model. The prediction performance is evaluated using by the root mean squared error (RMSE) and the normalized mean square error (NMSE) as follows:
RMSE =
NMSE =
1
δ
2
1 n 2 ∑ ( yi − yˆ i ) , n i =1
∑ ( yi − yˆ i ) n n
i =1
2
, δ2 =
(10)
1 n 2 ∑ ( yi − y ) , n − 1 i =1
(11)
where n represents the total number of data points in the test set, y i , yˆ i , y are the actual value, prediction value and the mean of the actual values respectively. When applying LS-SVM to modeling, the first thing that needs to be considered is what kernel function is to be used. As the dynamics of chaotic time series are strongly nonlinear, it is intuitively believed that using nonlinear kernel functions could achieve better performance than the linear kernel. In this investigation, the Gaussian kernel functions trend to give good performance under general smoothness assumptions. The second thing that needs to be considered is what values of the kernel parameters ( γ and σ 2 ) are to be used. As there is no structured way to choose the optimal parameters of LS-SVM, the values of the parameters that produce the best result in the validation set are used for LS-SVM. Through several trials, it can be get that σ 2 and
γ play an important role on the generalization performance of LS-SVM, so σ 2 and γ are, respectively, fixed at 0.15 and 25 for following experiments. The results of simulation are shown in Table 1, where SA represents the similarity analysis. Table 1. The converged RMSE and NMSE and the number of principal components in gas concentration chaotic time series
Model #Principal Component Training RMSE Testing Training NMSE Testing
KPCA+LS-SVM 28 0.0141 0.0150 0.0379 0.0725
KPCA(SA)+LS-SVM 15 0.0101 0.0107 0.0291 0.0348
Fig. 2. The forecasting errors in the KPCA+LS-SVM model (the dotted line) and the KPCA(SA)+LS-SVM model (the solid line)
126
J. Cheng et al.
From Table 1, it can be observed that the KPCA(SA)+LS-SVM forecast more closely to actual values than KPCA+LS-SVM. So there are correspondingly smaller forecasting errors in the KPCA(SA)+LS-SVM (the solid line) than the KPCA+LSSVM (the dotted line), as illustrated in Fig. 2.
6 Conclusions This paper describes a novel methodology, a LS-SVM based on combining KPCA with similarity analysis, to model and forecast chaotic time series. Firstly, KPCA is a nonlinear PCA by generalizing the kernel method into linear PCA, which is adopted to extract features of chaotic time series, reflecting its nonlinear characteristic fully. Secondly, on the basis of KPCA, the embedding dimension of the phase space of chaotic time series is reduced according to the similarity degree of the principal components to the model output, so the model precision is improved greatly. The proposed model has been evaluated by coal mine gas concentration. Its superiority is demonstrated by comparing it with the model without similarity analysis. The simulation results show that the proposed model in the paper can achieve a higher prediction accuracy and better generalization performance than that with out similarity analysis. On the other hand, there are some issues that should be investigated in future work, such as how to ascertain the accumulative contribution ratio of KPCA and confidence threshold of similarity analysis which affect deeply the performance of the whole model, how to construct the kernel function and determine the optimal kernel parameters, etc.
Acknowledgements This research is supported by National Natural Science Foundation of China under grant 70533050 and Young Science Foundation of CUMT under grant 2006A010.
References 1. Vapnik,V.N.: An Overview of Statistical Learning Theory. IEEE Transactions Neural Networks 10 (5) (1999) 988-999 2. Suykens, J.A.K., Vanderwalle, Moor,J., B. D.: Optimal Control by Least Squares Support Vector Machines. Neural Network 14 (2001) 23-35 3. Wei, X.K., Li, Y.H., et al.: Analysis and Applications of Time Series Forecasting Model via Support Vector Machines. System Engineering and Electronics 27 (3) (2005) 529-532 4. Chen, K., Han, B.T.: A Survey of State Space Reconstruction of Chaotic Time Series Analysis. Computer Science 32 (4) (2005) 67-70 5. Kennel, B. Mathew, Brown, R., et al.: Determining Embedding Dimension for Phase-space Reconstruction Using a Geometrical Construction. Phy Rev A 45 (1992) 3403-3411 6. Palus, M., Dovrak, I.: Singular-value Decomposition in Attractor Reconstruction: Pitfalls and Precautions. Physica D 55 (1992) 221-234 7. Scholkopf, B., Smola, A.J., Muller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10 (1998) 1299-1319
Prediction of Chaotic Time Series Using LS-SVM with Simulated Annealing Algorithms Meiying Ye Department of Physics, Zhejiang Normal University, Jinhua 321004, China
[email protected] Abstract. Least squares support vector machine (LS-SVM) is a popular tool for the analysis of time series data sets. Choosing optimal hyperparameter values for LS-SVM is an important step in time series analysis. In this paper, we combine LS-SVM with simulated annealing (SA) algorithms for nonlinear time series analysis. The LS-SVM is used to predict chaotic time series, and its parameters are automatically tuned using the SA and generalization performance is estimated by minimizing the k-fold cross-validation error. A benchmark problem, Mackey-Glass time series, has been used as example for demonstration. It is showed this approach can escape from the blindness of man-made choice of the LS-SVM parameters. It enhances the prediction capability of chaotic time series.
1 Introduction Time series prediction is a very important practical problem with a diverse range of applications from economic and business planning, inventory and production control, weather forecasting, signal processing and control. However, time series analysis is a complex problem. Most time series of practical relevance are of nonlinear and chaotic nature which makes conventional, linear prediction methods inapplicable. Hence, a number of nonlinear prediction methods have been developed including neural networks (NN), though, not initially proposed for time series prediction, exceed conventional methods by orders of magnitude in accuracy. One of the most common NN in the area of chaotic time series prediction is the multilayer NN with error backpropagation learning algorithm. The NN has been successfully utilized to predict chaotic dynamical systems. The NN employs gradient descent method to provide a suitable solution for network weights by minimizing the sum of squared errors. Training is usually done by iterative updating of the weights according to the error signal. Although the NN is developed in chaotic time series prediction, some inherent drawbacks, e.g., the multiple local minima problem, the choice of the number of hidden units and the danger of over fitting, etc., would make it difficult to put the NN into some practical application. The present study focuses on the problem of chaotic time series prediction using least squares support vector machine (LS-SVM) regression [1] [2], whose parameters are automatically tuned using the SA [3] and generalization performance is estimated by minimizing the k -fold cross-validation error [4]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 127–134, 2007. © Springer-Verlag Berlin Heidelberg 2007
128
M. Ye
2 Problem Description A chaotic time series is an array of values belonging to subsequent samples usually coming from a nonlinear dynamic system's output. It is assumed that neither of state of the nonlinear dynamic system is measurable nor the equation describing its state is known. If the nonlinear dynamic system is deterministic, we can try to predict the chaotic time series by reconstructing the state space. The object of chaotic time series forecasting is building an estimate function for the system's transfer function only using its output. Many conventional regression techniques can be used to solve problems of estimating function. In this investigation, we concentrate on the LS-SVM. Let's assume that the chaotic time series is sampled every T . The chaotic time series can be express as x(T ), x(2T ), ", x(NT ) . The chaotic time series prediction
can be stated as a numerical problem: Split the time series x(T ), x(2T ), " , x(NT ) into windows x((i − D + 1)T )," , x(iT ) of size D . Then find a good estimate for the function F : R D → R such that
x((i + 1)T ) = F ( x((i − D + 1)T )," , x(iT )) ,
(1)
for every i ∈ {D, N } . The F ( ⋅ ) is an unknown function, and D is a positive integer, the so-called embedding dimension. In many time series applications, one-step prediction schemes are used to predict the next sample of data, x((i + 1)T ) , based on previous samples. However, one-step prediction may not provide enough information, especially in situations where a broader knowledge of the time series behavior can be very useful or in situations where it is desirable to anticipate the behavior of the time series process. The present study deals with chaotic time series prediction, i.e. to obtain predictions several steps ahead into the future x((i + 1)T ), x((i + 2)T )," , x((i + P )T ) starting from information at instant i + 1 . Hence, the goal is to approximate the function F ( ⋅ ) such that the model given by equation (1) can be used as a chaotic time series prediction scheme. In this work, we try applying LS-SVM and SA to estimate the unknown function F ( ⋅ ) .
3 SVM and Its Parameter Selection by SA The present study focuses on the problem of chaotic time series prediction using LSSVM and SA. In the following, we briefly introduce LS-SVM regression and SA. For further details on LS-SVM and SA we refer to Refs. [1] [2] and [3]. 3.1 LS-SVM Model for Chaotic Time Series Prediction
Consider a given training set of N data points {x k , y k }k =1 with input data x k ∈ R D and N
output y k ∈ R . In feature space LS-SVM models take the form: y ( x) = wTϕ ( x) + b ,
(2)
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
129
where the nonlinear mapping ϕ (⋅) maps the input data into a higher dimensional feature space. Note that the dimensional of w is not specified (it can be infinite dimensional). In LS-SVM for function estimation the following optimization problem is formulated: min J ( w,e) =
1 T 1 N w w + γ ∑ ek2 , 2 2 k =1
(3)
subject to the equality constrains: y ( x ) = w T ϕ ( x k ) + b + e k , k = 1, " , N .
(4)
Important differences with standard SVM [5] are the equality constrains and the squared error term, which greatly simplifies the problem. The solution is obtained after constructing the Lagrangian: N
L ( w , b , e, α ) = J ( w, e ) − ∑ α k { w T ϕ ( x k ) + b + e k − y k }
(5)
k =1
with Lagrange multipliers α k . After optimizing equation (5) and eliminating ek , w , the solution is given by the following set of linear equations: G ⎡0 ⎤ ⎡b ⎤ ⎡ 0 ⎤ 1T ⎢G ⎥ ⎢ ⎥=⎢ (6) ⎥, T -1 ⎣⎢1 ϕ ( xk ) ϕ ( xl ) + γ I ⎦⎥ ⎣⎢ α ⎦⎥ ⎣⎢ y ⎦⎥ G where y = [ y1 ;"; y N ] , 1 = [1;";1] , α = [α 1 ;"; α N ] and the Mercer’s condition:
K ( xk , xl ) = ϕ ( xk ) T ϕ ( xl ) , k , l = 1," , N
(7)
has been applied. This finally results into the following LS-SVM model for function estimation: L
y ( x) = ∑ α k K ( xk , xl ) + b ,
(8)
k =1
where α k , b are the solution to the linear system, K (⋅,⋅) represents the high dimensional feature spaces that is nonlinearly mapped from the input space x , and L is the number of support vectors. The LS-SVM approximates the function using the equation (8). Any function that satisfies Mercer’s condition can be used as the kernel function K (⋅,⋅) . Te choice of the kernel function has several possibilities. Popular kernel functions are Gaussian kernel: K ( xk , xl ) = exp(−
xk − xl 2σ 2
2
),
Polynomial kernel: K ( xk , xl ) = (1 + xk ⋅ xl ) β . where σ and β are positive real constant.
(9) (10)
130
M. Ye
In this work, the Gaussian kernel function is used as the kernel function of the LSSVM because Gaussian kernels tend to give good performance under general smoothness assumptions. Consequently, they are especially useful if no additional knowledge of the data is available. Note that in the case of Gaussian kernels, one has only two additional tuning parameters, viz. kernel width parameter σ in equation (9) and regularization parameter γ in equation (3), which is less than for standard SVM. 3.2 SA for Parameter Tuning of LS-SVM
To obtain a good prediction performance, some parameters in LS-SVM have to be chosen carefully. These parameters include: • •
the regularization parameter γ , which determines the tradeoff between minimizing the training error and minimizing model complexity; and parameter ( σ or β ) of the kernel function that implicitly defines the nonlinear mapping from input space to some high-dimensional feature space. (In this paper we entirely focus on the Gaussian kernel).
These “higher level” parameters are usually referred as hyperparameters. In this paper, these parameters are automatically tuned using the SA and generalization performance of LS-SVM is estimated by minimizing the k -fold cross-validation error in the training phase. The SA is an optimization technique, analogous to the annealing process of material physics. Boltzmann [6] pointed out that if the system is in thermal equilibrium at a temperature T , then the probability PT (s ) of the system being in a given state s is given by the Boltzmann distribution: PT ( s ) =
exp(− E ( s ) / KT ) , ∑ exp(− E (w) / KT )
(11)
w∈S
where E (s ) denotes the energy of state s ; K represents the Boltzmann constant and S is the set of all possible states. However, equation (11) does not contain information on how a fluid reaches thermal equilibrium at a given temperature. Metropolis et al. [3] developed an algorithm that simulates the process of Boltzmann. The Metropolis algorithm is summarized as follows. When the system is in the original state sold with energy E ( s old ) , a randomly selected atom is perturbed, resulting in a state snew with energy E ( s new ) . This new state is either accepted or rejected depending on the Metropolis criterion: if E ( s new ) ≤ E ( s old ) then the new sate is automatically accepted. In contrast, if E ( s new ) > E ( s old ) , then the probability of accepting the new state is given by the following probability function: ⎛ E ( sold ) − E ( s new ) ⎞ Pt (accept s new ) = exp⎜ ⎟. KT ⎝ ⎠
(12)
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
131
Based on the study of Boltzmann and Metropolis, Kirkpatrick et al. [7] claimed that the Metropolis approach is conducted for each temperature on the annealing schedule until thermal equilibrium is reached. Additionally, a prerequisite for applying the SA algorithm is that a given set of the multiple variables defines a unique system state for which the objective function can be calculated. The SA algorithm in our investigation is described as follows: Step 1 (Initialization). Set upper bounds of the two LS-SVM positive parameters,
σ and γ . Then, generate and feed the initial values of the two parameters into the
LS-SVM model. The forecasting error is defined as the system state ( E ). Here, the initial state ( E0 ) is obtained. Step 2 (Provisional state). Make a random move to change the existing system state to a provisional state. Another set of the two positive parameters is generated in this stage. Step 3 (Acceptance tests). The following equation is employed to determine the acceptance or rejection of the provisional state: ⎧Accept the provisional state if E ( snew ) > E ( sold ), and p < Pt (Accept snew ),0 ≤ p < 1 ⎪ ⎨Accept the provisional state if E ( snew ) ≤ E ( sold ) ⎪Reject the provisional otherwise ⎩
(13)
In equation (13), p is a random number to determine the acceptance of the provisional state. If the provisional state is accepted, then set the provisional state as the current state. Step 4 (Incumbent solutions). If the provisional state is not accepted, then return to Step 2. Furthermore, if the current state is not superior to the system state, then repeat Steps 2 and 3 until the current state is superior to the system state and, finally, set the current state as the new system state. Previous studies [8,9] indicated that the maximum number of loops ( N sa ) is 100 D to avoid infinitely repeated loops, where D denotes the problem dimension. In this investigation, the two parameters ( σ and γ ) are used to determine the system states, hence, N sa is set to 200. Step 5 (Temperature reduction). After the new system state is obtained, reduce the temperature. The new temperature reduction is obtained by equation: New temperature = (Current temperature) × ρ , where 0 < ρ < 1 .
(14)
The ρ is set at 0.9 in this study. If the pre-determined temperature is reached, then stop the algorithm, and the latest state is an approximate optimal solution. Otherwise, go to Step 2. Cross-validation is a popular technique for estimating generalization performance and there are several versions. The k -folds cross validation is computed as follows: The training set is randomly divided into k mutually exclusive subsets (folds). The LS-SVM is trained with k − 1 subsets and then tested with the remaining subset to obtain the regression error. This procedure is repeated k times and in this fashion each subset is used for testing once. Averaging the test error over the k trials gives an estimate of the generalization performance. The k -fold cross-validation can be
132
M. Ye
applicable to arbitrary learning algorithms. In order to evaluate the performance of the proposed methods, we use k = 5 for the number of folds.
4 Benchmark Problem and Experimental Results In this section, we present an example showing the effectiveness of using LS-SVM with SA for chaotic time series prediction. We use data sets generated by the MackeyGlass differential-delay equation [10]. We generate a time series by numerically integrating the Mackey-Glass time-delay differential equation: dx(t ) gx(t − τ ) = −hx(t ) + dt 1 + x10 (t − τ )
(15)
with parameter g = 0.2 , h = 0.1 , τ = 17 and initial condition x(0) = 1.2 and x(t ) = 0 for t < 0 . Equation (15) was originally introduced as a model of blood cell regulation. 1.5 1.2 0.9 0.6 0.3
50
100
150
200
time
Fig. 1. Predicted and desired values of Mackey-Glass series, the parameter P is set to 36 0.05
RMSE
0.04 0.03 0.02 0.01 0.00
10
20
30
40
P
Fig. 2. Root-mean-square errors (RMSE) as a function of P. The solid line indicates the prediction errors with SA and the dashed line indicates that using 5-folds cross validation.
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
133
The time series data was obtained by applying the conventional fourth-order RungeKutta algorithm to determine the numerical solution to equation (15). This nonlinear time series is chaotic, and so there is no clearly defined period. The series will not converge or diverge, and the trajectory is highly sensitive to initial conditions. The prediction of future values of this series is a benchmark problem. In time series prediction, we want to use known values of the time series up to the point in time, say, i , to predict the value at some point in the future, say, i + P . The standard method for this type of prediction is to create a mapping from D sample data points, sampled every T units in time, ( x(i − ( D − 1)T ),", x(i − T ), x(i )) , to a predicted future value x(i + P) . Following the conventional settings for predicting the Mackey-Glass time series, we set D = 4 . For each i , the input training data for LSSVM is a four-dimensional vector. We extracted input/output data pairs of the following format:
[ x(i − 18), x(i − 12), x(i − 6), x (i ); x(i + 6)]
(16)
In Figure 1, multi-step predictions have been considered; the parameter P is set to 36. Figure 2 shows the dependence of prediction errors with prediction step when LSSVM parameters are automatically tuned using the SA. We can see that the prediction is quite accurate. With the increasing prediction parameter P , the prediction errors increase. These indicate that the lower prediction errors were obtained. The results may be attributable to the fact that it is more likely to converge to the global.
5 Conclusions In this paper, LS-SVM with SA is used for chaotic time series prediction. The Mackey-Glass time series has been used as examples for demonstration. The results demonstrate that the prediction method using the LS-SVM with SA is suitable for the multi-step prediction. It is showed this approach can escape from the blindness of man-made choice of the LS-SVM parameters. Although the processes are focused on the Mackey-Glass differential-delay equation, we believe that the proposed method can be used to other complex chaotic time series.
Acknowledgements. The Project Supported by Zhejiang Provincial Natural Science Foundation of China (Y105281, Y106786).
References 1. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9 (1999) 293-300 2. Suykens, J.A.K., Brabanter, J.D., Gestel, T.V., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific, Singapore (2002) 3. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of State Calculations by Fast Computing Machines. J. Chem. Phys. 21 (1953) 1087-1091 4. Duan, K., Keerthi, S.S., Poo, A.N.: Evaluation of Simple Performance Measures for Tuning SVM Hyperparameters. Neurocomputing 51 (2003) 41-59
134
M. Ye
5. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1999) 6. Cercignani, C.: The Boltzmann Equation and Its Applications. Springer-Verlag, Berlin (1988) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220 (1983) 671-680 8. Van Laarhoven, P.J.M., Aarts, E.H.L.: Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Dordrecht (1987) 9. Dekkers, A., Aarts, E.: Global Optimization and Simulated Annealing. Math. Programm. 50 (1991) 367-393 10. Mackey, M., Glass, L.: Oscillations and Chaos in Physiological Control Systems. Science 197 (1977) 287-289
Radial Basis Function Neural Network Predictor for Parameter Estimation in Chaotic Noise Hongmei Xie1,* and Xiaoyi Feng2 1
Department of Electronics and Information Engineering, School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected] 2 Department of Electronics Science and Technology, School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected] Abstract. Chaotic noise cancellation has potential application in both secret communication and radar target identification. To solve the problem of parameter estimation in chaotic noise, a novel radial basis function neural network (RBF-NN) -based chaotic time series data modeling method is presented in this paper. Together with the spectral analysis technique, the algorithm combines neural network’s ability to approximate any nonlinear function. Based on the flexibility of RBF-NN predictor and classical amplitude spectral analysis technique, this paper proposes a new algorithm for parameter estimation in chaotic noise. Analysis of the proposed algorithm’s principle and simulation experiments results are given out, which show the effective of the proposed method. We conclude that the study has potential application in various fields as in secret communication for narrow band interference rejection or attenuation and in radar signal processing for weak target detection and identification in sea clutter.
1 Introduction Nonlinear dynamic is very important in describing many physical phenomena in practice [1]. In the filed of radar signal processing, sea clutter can be modeled as chaotic noise. In communication system, speech and indoor multi-path have been demonstrated to be chaotic rather than purely randomness. In such application as radar surveillance, secure communication and narrowband interference cancellation, the chaotic signal is one kind of noise. Therefore, there exists an enormous need to detect and extract useful signal parameter in chaotic noise. For example, chaotic modulation is used in secret communication where impulse interference cancellation will rely on the performance of frequency estimation in chaotic noise [2]. Modeling sea clutter using nonlinear dynamics chaos made target velocity estimation become frequency estimation in chaotic noise. *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 135–142, 2007. © Springer-Verlag Berlin Heidelberg 2007
136
H. Xie and X. Feng
To solve the problem of parameter estimation in chaotic noise, minimum phase space volume (MPSV)-based algorithm and its improved version like generic algorithm minimum phase space volume (GA-MPSV) [2][3][6], least square autoregressive (LS-AR)[4] have been proposed. However, their performance is not satisfying. On the one hand, the former MPSV-based algorithm is very complex because it will involve in inverse filter design and then a global searching and optimization procedure, although it has considered the nature of chaotic noise and can achieve correct results. On the other hand, the latter algorithm doesn’t take into consideration of the nature of chaotic noise although its computation burden is small and can work in relatively high signal-to-noise ratio (SNR). In this paper, we proposed a neural network and power spectrum analysis based algorithm, which take account into both the computation burden and numerical precise at the same time. Our motivation is that neural network can fit our chaotic nonlinear dynamic function since neural network (NN) has the ability to model nonlinear time series [5][6] globally. Analysis of the proposed algorithm’s principle and simulation experiments results are given out, the results show the effective of the proposed method. The time delay chaotic reconstruction and power spectral density analysis are used to estimate the useful parameter. Systematic compare and contrast of all the three kinds of parameter estimation algorithms are also give out in this paper. This paper is organized as follows: Section 2 describes the mathematical formulation and physical description of the problem to be solved. Section 3 gives out the block diagram of the novel parameter estimation algorithm and some consideration concerning the selection of some key factors. In section 4, simulation experiments are designed and results are given out and analyzed. In the last section, discussion and conclusion are described.
2 Problem Formulation and Description Generally speaking, the problem can be expressed as: k
xt = st (θ0 ) + nt = ∑ α i sin(2π f i t ) + nt ,
t = 1, 2,
,N ,
(1)
i =1
θ 0 = ⎡⎣θ1 , ,θ p ⎤⎦ is the parameter vector to be estimated in useful signal st (θ 0 ) . The additive noise nt is chaotic noise. Here p is the dimension of the where
vector θ 0 . In other words,
p is the number of unknown parameters.
In radar or sonar system, the parameters like DOA, moving velocity and RCS are needed to describe the target exactly and to track it. By using the Doppler theory, the moving velocity can be transformed into frequency. Thus, the formula for real system is to estimate some frequency in the signal. This can be written as Eq.(1).
RBF-NN Predictor for Parameter Estimation in Chaotic Noise
137
To solve the problem of parameter estimation in chaotic noise, what we need to do first is to model and predict the chaotic component correctly. Actually, from the point of signal processing, the modeling of chaotic signal can be described as the obtaining of proper state space from clear or additive noisy received signals. When one considers a discrete dynamical system whose state can be described by a set of physical variables, to simply the problem, one can assume that the observation data is acquired at discrete time, i.e. t = 1,2, , . Then the dynamic rule can be converted into mapping expressed as:
(
)
Y ( t + 1) = ψ Y ( t ) , And each element in
Y ( t + 1) can be expressed as
y ( t + 1) = ψ { y ( t ) , y ( t − 1) , In other words, each element’s value previous system value
Y ( t ) , Y ( t − 1) ,
(2)
, y ( t − 2 D )}.
(3)
y ( t + 1) for t + 1 can be obtained by the
y (t ), y (t − 1),
. i.e.
Y ( t + 1) can be get by
. Therefore, the state of a dynamic system at time t + 1 can be
formed by a its former states. Basically, chaos is one kind of deterministic nonlinear system and chaotic signal can be predicted in short period. Moreover, the local predictability signal is based on knowing the deterministic functionψ . Therefore, the aim is to construct one model that can reconstruct the mapping from the observations Y (t ) . According to Taken’s delay embedding theorem, a compact manifold with dimension D can be reconstructed by a delay map of at least dimension m = 2d + 1 . This gives out the considerations that need to be considered when designing a delay embedding reconstruct. To solve the problem of chaotic time series modeling, local method using different
m -order AR system and global method have been proposed. The main disadvantage of the former is that one needs to choose the size of the region because this method will fail with proper region size selection. The global method that consists of polynomial modeling, radial base function and forward feed neural network can overcome the disadvantage of the local method [5].
138
H. Xie and X. Feng
3 Depiction of the Proposed Scheme The basic idea of the new scheme is based on the local predictability of chaotic noise. The brief description and implementation of our scheme is shown in Fig.1. First, we use neural network as a tool to reconstruct the nonlinear dynamic system for chaotic component in the received signal. Then subtract the reconstructed chaotic noise from the received signal to obtain the weak but useful remain (error) signal. The remaining signal mainly contains of information in which we are interested. After that we perform power spectra density (PSD) analysis to the error signal and derive the parameter using the PSD results by traditional method.
r (t )
s (t ) n (t )
5(&216758&7 121/,1($5 )81&7,21)25 &+$26&20321(17
rˆ ( t )
r (t )
2%7$,1 (5525
36' $1$/ 0 for i = 1, · · · , n. x∈R
Assumption H2. Each activation functions fi (·) is bound, and satisfies the Lipschitz condition with a Lipschitz constant Li > 0, i.e., |fi (x) − fi (y)| Li |x − y| for all x, y ∈ R. The error dynamics between (1) and (2) can be expressed by e˙ i (t) = −Gi (ei (t)) +
n
aij Fj (ej (t)) +
j=1
n
bij Fj (ej (t − τ )) − ui , i = 1, 2, . . . , n,
j=1
(3) where Gi (ei (t)) = gi (xi (t)) − gi (yi (t)), Fi (ei (t)) = fi (xi (t)) − fi (yi (t)). Model (3) can be rewritten as the following matrix form e(t) ˙ = −G(e(t)) + AF (e(t)) + BF (e(t − τ )) − u(t),
(4)
where G(e(t)) = (G1 (e1 (t)), G2 (e2 (t)), . . . , Gn (en (t)))T , F (e(t)) = (F1 (e1 (t)), F2 (e2 (t)), . . . , Fn (en (t)))T , u(t) = (u1 (t), u2 (t), . . . , un (t))T . Definition 1. The systems (1) and the uncontrolled system (2) (i.e. u ≡ 0 in (3)) are said to be exponentially synchronized if there exist constants M 1 and λ > 0 such that x(t) − y(t) M
sup
t0 −τ st0
ϕ(s) − φ(s)e−λ(t−t0 ) , t t0 ,
where ϕ(s) = (ϕ1 (s), ϕ2 (s), . . . , ϕn (s))T , φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T . Moreover, the constant λ is defined as the exponential synchronization rate. Lemma 1. (Halanay inequality Lemma)[11] Let τ > 0, x(t) is nonnegative continuous scalar function defined for [t0 − τ, t0 ] which satisfies D+ x(t) −r1 x(t) + r2 x ˜(t) for t t0 , where x ˜(t) = sup {x(s)}, r1 and r2 are constants. If t−τ st
r1 > r2 > 0, then
x(t) x˜(t0 )e−λ(t−t0 ) , t t0 ,
where λ is a unique positive root of the equation λ = r1 − r2 eλτ . The Letter aims to determine the decentralized control input ui (t) associated with the state-feedback for the purpose of exponentially synchronizing the unidirectional coupled identical chaotic neural networks with the same system’s parameters but the differences in initial conditions.
3
Main Results
Theorem 1. For drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2). If the control input ui (t) in (3) is suitably designed as ui (t) = ηi ei (t), i = 1, 2, . . . , n,
(5)
146
J. Jian, B. Wang, and X. Liao
where ηi chosen are constants such that the matrix A˜ = (˜ aij )n×n is negative ˜ ˜ definite and dm λM (A) > dM λM (B) for a existed positive diagonal matrix D = diag(d1 , d2 , . . . , dn ) > 0, dm = min {di }, dM = max {di }, and 1in 1in −di (γi + ηi ) + di Li |aii |, i = j = 1, 2, . . . , n, 0 D|B|L ˜ a ˜ij = and B = di Lj |aij |+dj Li |aji | L|B|T D 0 , i = j; i, j = 1, 2, . . . , n. 2 ˜ is 2n × 2n matrix with |B| = (|bij |)n×n and L = diag(L1 , L2 , . . . , Ln ), −λM (A) ˜ are the maximum eigenvalues of A˜ and B, ˜ respectively. Then system and λM (B) (3) is globally exponentially stable, i.e., the global exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ, where λ is the unique positive root of the equation λ=
˜ ˜ ˜ 2λM (A) λM (B) λM (B) − − eλτ . dM dm dm
Proof. To confirm that the origin of (3) or (4) is globally exponentially stable, consider the following continuous Lyapunov function V (t) defined as 1 2 di ei (t), 2 i=1 n
V (t) =
(6)
Then for ∀e ∈ Rn , the inequality holds 1 1 dm e(t)22 ≤ V (t) ≤ dM e(t)22 . 2 2
(7)
Subsequently, with condition (5) and assumptions (H1) and (H2), evaluating the time derivative of V (t) along the trajectory of (3) gives: n n n V˙ (t) = − i=1 di ei (t)Gi (ei (t)) + i=1 j=1 di aij ei (t)Fj (ej (t)) n n + i=1 j=1 di bij ei (t)Fj (ej (t − τ )) − ni=1 di ηi e2i (t) ≤ nj=1 [(−dj (γj + ηj ) + dj Lj |ajj |)e2j (t) + ni=1 di Lj |aij ||ei (t)ej (t)|] i =j n n + j=1 i=1 di Lj |bij ||ei (t)ej (t − τ )| T |e(t)| ˜ ˜ |e(t)| ≤ |e(t)|T A|e(t)| + 12 B |˜ e(t)| |˜ e(t)| ˜ + 1 λM (B))e ˜ T (t)e(t) + 1 λM (B))˜ ˜ eT (t)˜ ≤ (−λM (A) e(t) 2 2 ˜ ≤ −r1 V (t) + r2 V (t), where r1 =
V˜ (t) =
˜ ˜ 2λM (A) λM (B) − , dM dm sup
t−τ ≤s≤t
V (s) =
r2 =
˜ λM (B) , dm
n 1 sup di e2i (s). 2 t−τ ≤s≤t i=1
Global Exponential Synchronization of Chaotic Neural Networks
147
In terms of Lemma 1, we obtain V˙ (t) ≤ V˜ (t0 ) exp(−λ(t − t0 )), t ≥ t0 ,
(8)
combining (7) and (8), we have dM λ e(t) ≤ sup ϕ(s) − φ(s) exp(− (t − t0 )), t ≥ t0 , dm t0 −τ ≤s≤t0 2 Therefore, system (3) is globally exponentially stable, i.e., under the control input vector (5), every trajectory yi (t) of system (2) synchronize exponentially the corresponding variable xi (t) of neural network (1). The proof is completed. Corollary 1. If drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2), and the control input ui (t) in (3) is given by (5) such that r1 + r2 < 0, then the exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ. Where λ is the unique positive root of the equation λ = r1 − r2 exp(λτ ) with n n r1 = max {−2(γj + ηj )+ [Lj |aij |+ Li (|aji |+ |bji |)]}, r2 = max {Lj |bij |}. 1≤j≤n
i=1
1≤j≤n
i=1
Corollary 2. If drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2), and the control input ui (t) in (3) is given by (5) such that r1 + r2 + r3 < 0, then the exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ. Where λ is the unique positive root nof the equation λ = r1 + r2 − r3 exp(λτ )nwith r1 = max1≤j≤n {−2(γj +ηj )+ i=1 (Lj |aij |+Li (|aji |)}, r2 = max1≤j≤n { i=1 Li |bji |} and r3 = max1≤j≤n {Lj ni=1 |bij |}.
4
An Illustrative Example
It has been demonstrated that if the system’s matrices A and B, as well as the delay parameter τ are suitably specified, the system (1) may display a chaotic behavior [3,6,7]. Regarding the exponential synchronization condition of the system (1) with delays is demonstrated by the following example. Example. Consider a delayed Hopfield neural network (HNN) with two neurons as below [6]: x˙ 1 −x1 (t) 2 −0.1 f1 (x1 (t)) = + x˙ 2 −x −5 f2 (x2 (t)) 2 (t) 3 (9) −1.5 −0.1 f1 (x1 (t − τ )) + , −0.2 −2.5 f2 (x2 (t − τ ))
148
J. Jian, B. Wang, and X. Liao 5
4
3
2
x2
1
0
−1
−2
−3
−4
−5 −1
−0.8
−0.6
−0.4
−0.2
0 x1
0.2
0.4
0.6
0.8
1
Fig. 1. The chaotic behavior of system (9) with the initial condition x(s) = [0.4, 0.6]T , −1 ≤ s ≤ 0, in which x label denotes the state x1 (t) and y label denotes the state x2 (t) 1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
1
2
3
4
5 t
6
7
8
9
10
Fig. 2. The synchronization error e1 (t), e2 (t) with the initial condition e(s) = [−1, 1]T , −1 ≤ s ≤ 0 between system (9) and system (10), in which the dashed line depicts the trajectory of error state e1 (t) and the solid line depicts the trajectory of error state e2 (t)
where gi (xi ) = xi , τ = 1 and fi (xi ) = tanh(xi ) for i = 1, 2. The system satisfies assumptions (H1) and (H2) with L1 = L2 = 1 and γ1 = γ2 = 1. It should be noted that the system (9) is actually a chaotic delayed Hopfied neural networks
Global Exponential Synchronization of Chaotic Neural Networks
149
with the initial condition (x1 (s), x2 (s))T = (0.4, 0.6)T for −1 ≤ s ≤ 0 (See [3,6]). The response chaotic Hopfield neural network with delays is designed by y˙ 1 −y1 (t) 2 −0.1 f1 (y1 (t)) = + y˙ 2 −y −5 f2 (y 2 (t)) 2 (t) 3 (10) −1.5 −0.1 f1 (y1 (t − τ )) u1 (t) + + , −0.2 −2.5 f2 (y2 (t − τ )) u2 (t) If the control input vectors are designed as u1 (t) = η1 e1 (t), u2 (t) = η2 e2 (t). ⎞ 0 0 1.5 0.1 ⎜ ⎟ 1 − η1 2.55 ˜ = ⎜ 0 0 0.2 2.5 ⎟ Let d1 = d2 = 1, then A˜ = and B ⎝ 2.55 1 − η2 1.5 0.2 0 0 ⎠ 0.1 2.5 0 0 ˜ = 2.52, where η1 and η1 can be chosen to ensure that A˜ is negwith λM (B) ˜ > λM (B). ˜ If let η1 = η ≥ 7 and η2 = η + 1, ative definite and −λM (A) ˜ = η − 3.55 ≥ 3.45 > λM (B) ˜ = 2.52. From Theorem 1, the then λM (A) exponential synchronization of systems (9) and (10) can be obtained with a synchronization rate 12 λ, where λ is the unique positive root of the equation λ = 2(η − 3.55) − 2.55 − 2.55 exp(λτ ). For instance, for η = 7 and η = 10, the exponential synchronization rates of (9) and (10) are at least 12 λ = 0.225 and 1 2 λ = 0.64, respectively. Figure 1 depicts that the chaotic behavior of system (9) with the initial condition x(s) = [0.4, 0.6]T , −1 ≤ s ≤ 0. Figure 2 depicts the synchronization error e1 (t), e2 (t) between the derive system (9) and the response system (10).
5
⎛
Conclusion
This Letter has proposed a decentralized control scheme to guarantee the globally exponential synchronization for a class of neural networks including Hopfield neural networks and cellular neural networks with time delays. By constructing some suited controllers and using the Halanay inequality lemma, a delay-independent criteria have been derived to ensure the global exponential synchronization of delayed chaotic neural networks. Furthermore, the synchronization degree can be easily estimated. Finally, a numerical example has been given to verify the correctness of our results.
Acknowledgments This work was partially supported by National Natural Science Foundation of China (60474011, 60574025), and the Scientific Research Projects of Hubei Provincial Department of Education (D200613002) and the Doctoral PreResearch Foundation of China Three Gorges University.
150
J. Jian, B. Wang, and X. Liao
References 1. Pecora, L.M., Carroll, T.L.: Synchronization in Chaotic Systems. Phys Rev Lett 64 (8) (1990) 821-824 2. Carroll, T.L., Pecora, L.M.: Synchronization Chaotic Circuits. IEEE Trans Circ Syst 38 (4) (1991) 453-456 3. Cheng, C.J., Liao, T.L., Yan, J.J., Hwang, C.C.: Synchronization of Neural Networks by Decentralized Feedback Control. Physics Letters A 338 (2005) 28-35 4. Wang, Z.S., Zhang, H.G, Wang, Z.L.: Global Asymptotic Synchronization of a Class of Delayed Chaotic Neural Networks. Journal of Northeastern University (Natural Science) 27 (6) (2006) 598-601 5. Wang, Z.S., Zhang, H.G, Wang, Z.L.: Global Synchronization of a Class of Chaotic Neural Networks. Acta Physica Sinica 55 (6) (2006) 2687-2693 6. Cheng, C.J., Liao, T.L., Hwang, C.C.: Exponential Synchronization of a Class of Chaotic Neural Networks. Chaos, Solitons & Fractals 24 (2005) 197-206 7. Li, C., Chen, G.: Synchronization in General Complex Dynamical Networks with Coupling Delays. Physica A 343 (2004) 263-278 8. Liao, X.X., Chen, G.R., Wang, H.O.: On Global Synchronization of Chaotic Systems. Dynamics of Continuous, Discrete Impulsive Syst 10 (2003) 865-872 9. Cao, J., Li, P., Wang, W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. Physics Letters A 353 (2006) 318-325 10. Jian, J.G., Kong, D.M., Luo, H.G., Liao, X.X.: Exponential Stability of Differential Systems with Separated Variables and Time Delays. J. Center South University (Science and Technology) 36 (2) (2005) 282-287 11. Liao, X.X., Xiao, D.M.: Globally Exponential Stability of Hopfield Neural Networks with Time-Varying Delays. Acta Electronica Sinica 28 (4) (2000) 87-90 12. Zhang, J.Y.: Globally Exponential Stability of Neural Networks with Variable Delays. IEEE Transactions on Circuits and Systems -I: Fundamental Theory and Applications 50 (2) (2003) 288-291
A Fuzzy Neural Network Based on Back-Propagation Huang Jin1,2, Gan Quan1, and Cai Linhui1 1
2
NanJing Artillery Academy, NanJing 211132 NanJing University of Science and Technology, NanJing 210094
[email protected] Abstract. Some arguments on fuzzy neural network algorithm have been put forward, whose weights were considered as special fuzzy numbers. This paper proposes a conception of strong L-R type fuzzy number and derives a learning algorithm based on BP algorithm via level sets of strong L-R type fuzzy numbers. The special fuzzy number has been weakened to the common case. Then the range of application has been enlarged.
1 Introduction Some fuzzy neural networks models have been put forward in recent years [1], [2]. One approach for the direct fuzzification is transforming real inputs and real targets to fuzzy numbers. Ishibuchi et al, proposed a neural network for fuzzy input vectors. Connection weights of neural network were fuzzified. Hayashi also fuzzified the delta rule while Ishibuchi et al derived a crisp learning algorithm for triangular fuzzy weights. But all of arguments have the same fault which is the fuzzy weights should be the symmetrical triangular numbers. The application will be restricted within narrow limits. In this paper, firstly, we bring forward a fuzzy neural network, whose input-output relation of proposed fuzzy neural network is defined by the extension principle of Zadeh [3]. The input-output relation is numerically calculated by interval arithmetic via level sets (i.e., α -cuts) of fuzzy weights and fuzzy inputs. Next we define the strong L-R type fuzzy number, and show its good properties in interval arithmetic. While defining a cost function for level sets of fuzzy outputs and fuzzy targets, we propose a learning algorithm from the cost function for adjusting three parameters of each strong L-R type fuzzy weight. Lastly, we examine the ability of proposed fuzzy neural network implementing on fuzzy if-then rules.
2 Fuzzy Neural Network Algorithms In the type of fuzzy neural networks based on BP, neurons are organized into a number different of layers and signals flow in one direction. There are no interactions and feedback loops among the neurons in the same layer, Fig. 1 shows this model fuzzy neural network. According to the type of inputs and weights, we define three different kinds of fuzzy neural networks as follows: (I) crisp weight and fuzzy inputs; (II) fuzzy weight D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 151–159, 2007. © Springer-Verlag Berlin Heidelberg 2007
152
H. Jin, G. Quan, and C. Linhui
and crisp inputs; (III) fuzzy weight and fuzzy inputs. This paper will deal with the type (III) of fuzzy feed forward neural networks. In this model, the connections between the layers will be illustrated as a matrix of fuzzy weights w ji , which provides a fuzzy weight of a connection between ith neuron of the input layer, and jth neuron of the hidden layer. The total fuzzy input of jth neuron in the second layer is defined as:
Net pj =
Nj
∑W
ji
i =1
. O pj + Θ j ,
(1)
Net pj is the total fuzzy input of the jth neuron of hidden layer, OPJ = X PJ is the i ith fuzzy input of that neuron, and is fuzzy bias of the jth neuron. The fuzzy output of the jth neuron is defined with the transfer function Where,
f(Net)=1/{1+exp(-Net)}:
O pj = f ( Net pj ),
j=1,2,…,NH .
(2)
Furthermore, the fuzzy output of the kth neuron of output layer is defined as follows: NH
Net pj = ∑ Wkl ⋅ O pj + Θ k ,
(3)
j =1
O pk = f ( Net pk ).
(4)
The input-output relation in (1)-(4) can defined by the extension principle [3]. The fuzzy output is numerically calculated for level sets (i.e. a-cut) of fuzzy inputs, fuzzy weights and fuzzy biases. Next, we need to find out a type of fuzzy number to denote the fuzzy inputs, fuzzy weights and fuzzy biases; this type fuzzy number has good property so that it can be easily adapted to the interval arithmetic. This type fuzzy number has good property. Furthermore, let (Xp, Tp) is a fuzzy input-output pairs, and Tp =(Tp1 ,Tp2,…,Tpn) is a NO-dimensional fuzzy target vector corresponding to the fuzzy input vector Xp. The cost function for the input-output pair (Xp, Tp) is obtained as:
e p = ∑ e ph .
(5)
h
The cost function for the h-level sets of the fuzzy output vector Op and the fuzzy target vector are defined as: No
e ph = ∑ e pkh , k =1
(6)
A Fuzzy Neural Network Based on Back-Propagation
153
where
e pkh = e Lpkh + eUpkh , e
L pkh
U phk
e
= h⋅
= h⋅
([Tpk ]hL − [O pk ]hL )2 2 ([Tpk ]Uh − [O pk ]Uh ) 2 2
(7)
,
(8)
.
(9)
Next section we introduce the strong L-R type fuzzy number, and put forward a FNN algorithm based BP.
3 Strong L-R Representation of Fuzzy Numbers Definition 1. A function, usually denoted L or R, is a reference function of fuzzy numbers if 1. S (0)=1; 2. S (x)=S (-x); 3. S is no increasing on [0
,+ ∞ ]
Definition 2. A fuzzy number M is said to be an L-R type number if
⎧ ⎛β −x⎞ ⎪ L⎜ a ⎟, x ≤ β , a > 0 ⎪ ⎝ ⎠ μ M ( x) = ⎨ ⎪ R⎛⎜ x − β ⎞⎟, x ≥ β , b > 0 ⎪⎩ ⎝ b ⎠
(10)
L is for left and R for right reference. m is the mean value of M. called left and right spreads, Symbolically, we write M=(m
a and β are
αβ )L R *
Definition 3. A fuzzy number M is said to be a strong L-R type fuzzy number if
L(1) = R(1) = 0
This kind of fuzzy number has properties as follows: 1. The a -cuts of every fuzzy number are closed intervals of real numbers; 2. Fuzzy numbers are convex fuzzy sets;
β −x
= 1, x = β − a ≡ a , such that, L( β − a ) = L(a ) = 0 , same a as R ( β | +b) = R (ν ) = 0 , such that the support of every fuzzy number is the interval (a |,ν ) of real numbers. 3. Let
154
H. Jin, G. Quan, and C. Linhui
Those properties are essential for defining meaningful arithmetic operations on fuzzy numbers. Since each fuzzy set is uniquely represented by its -cut. These are closed intervals of real numbers, arithmetic operations on fuzzy numbers can be defined in terms of arithmetic operation on closed intervals of real numbers. These operations are the corners one of interval analysis, which is a well-established area of classical mathematics. We will apply them to next section to define arithmetic operations on fuzzy numbers. The strong L-R type is an important kind of fuzzy numbers. The triangular fuzzy number (T.F.N.) is a special class of the strong L-R type fuzzy number. We can write any strong L-R type fuzzy number symbolically as M
= ( α , β , γ ) LR
*
, in other words, the strong L-R type fuzzy number
can be uniquely represented by three parameters. Accordingly, we can adjust three parameters of each strong L-R type fuzzy weight and fuzzy biases.
Wkj = ( wkjα , wkjβ , wkjγ ) LR*
W ji = ( wαji , w βji , wγji ) LR*
Θ k = (θ kα , θ kβ , θ kγ ) LR*
Θ j = (θ αj , θ jβ , θ γj ) LR*
What’s more let
ckj =
wkjγ − wkjβ
c ji =
wkjβ − wkjα
wγji − w βji w βji − wαji
ck =
θ kγ − θ kβ θ kβ − θ kα
θ γj − θ jβ cj = β θ j − θ αj then β
wkj =
wkjγ + ckj ⋅ wkjα 1 + ckj
,
wijβ , θ kβ , θ jβ have some from as wijβ .
We discuss how to learn the strong L-R type fuzzy weight between the
Wkj = wkjα , wkjβ , wkjγ
jth hidden unit and the kth output unit. Similar to Rumelhart, we can
count the quantity of adjustment for each parameter by the cost function
ΔwkjL (t ) = −η ΔwkjU (t ) = −η
∂e ph ∂w
L kj
∂e ph ∂wUkj
+ ξ ⋅ ΔwkjL (t − 1),
(11)
+ ξ ⋅ ΔwUkj (t − 1).
(12)
A Fuzzy Neural Network Based on Back-Propagation
155
The derivatives above can be written as follows:
∂e ph ∂wkjα ∂e ph ∂wkjγ
= =
∂e ph ∂[ wkj ]αh ∂e ph ∂[ wkj ]αh
⋅ ⋅
∂[ wkj ]αh ∂wkjα ∂[ wkj ]αh ∂wkjγ
+ +
∂e ph ∂[ wkj ]γh ∂e ph ∂[ wkj ]γh
⋅ ⋅
∂[ wkj ]γh ∂wkjα ∂[ wkj ]γh ∂wkjγ
,
(13)
.
(14)
Since Wkj is a strong L-R type fuzzy number, it's h-level and 0-level have relations as follows:
[ wkj ]αh = γ
[ wkj ]h =
wkjγ + ckj ⋅ wkjα 1 + ckj
wkjγ + ckj ⋅ wkjα 1 + ckj
+
−
wkjγ − wkjα 1 + ckj
⋅ L−1 (h),
ckj ( wkjγ − wkjα ) 1 + ckj
(15)
⋅ R −1 (h).
(16)
Therefore,
∂e ph α
∂wkj
=
⎡ ckj ∂e ph L−1 (h) ⎤ ⋅ + ⎥+ α ⎢ ∂[ wkj ]h ⎢⎣1 + ckj 1 + ckj ⎥⎦ ∂[ wkj ]γh ∂e ph
⎡ ckj ⎤ ckj ⋅⎢ − R −1 ( h ) ⎥ , ⎢⎣1 + ckj 1 + ckj ⎥⎦ (17)
∂e ph γ
∂wkj
=
⎡ 1 ∂e ph L ( h) ⎤ ⋅⎢ − ⎥+ ∂[ wkj ]h ⎢⎣1 + ckj 1 + ckj ⎥⎦ ∂[ wkj ]γh ∂e ph
−1
α
⎡ 1 ⎤ ckj ⋅⎢ + R −1 (h) ⎥ . ⎢⎣1 + ckj 1 + ckj ⎥⎦ (18)
These relations explain how the error signals
∂e ph α
∂[ wkj ]h
and
∂e ph ∂[ wkj ]γh
for the
h-level set propagate to the 0-level of the strong L-R type fuzzy weight Wkj , and then, the fuzzy weight is updated by the following rules:
wkjα (t + 1) = wkjα (t ) + Δwkjα (t ),
(19)
wkjγ (t + 1) = wkjγ (t ) + Δwkjγ (t ).
(20)
We assume that n values of h (i.e., h1, h2,…, hn) are used for the learning of the fuzzy neural network. In this way, the learning algorithm of the fuzzy neural network can be defined as follows: 1: Initialize the fuzzy weights and the fuzzy biases.
156
H. Jin, G. Quan, and C. Linhui
2: Repeat 3 for h=h1, h2,…,hn. 3: Repeat the following procedures for p=1,2,…,m. (m input-output pairs (Xp,Tp)): Forward calculation: Calculate the h-level set of the fuzzy output vector Op corresponding to the fuzzy input vector Xp Back-propagation: Adjust the fuzzy weights and the fuzzy biases using the cost function eph4: if a pre-specified stopping condition (etc, the total number of iterations) is not satisfied, go to 2. Let (XP,TP) is the fuzzy input-output pairs, and Tp =(Tp1,Tp2,…,Tpn ) is NO-dimensional fuzzy target vector corresponding to the fuzzy input vector Xp.
4 Simulation We consider an n-dimension fuzzy classification problem. It can be described by If-Then rules as follows: If x p1 is Ap1 and,…, x pn is Apn ,Then
x p = ( x p1, Λ, x pn ) belong to G p where p=1,2,...,k, Api is linguistic term, for example: "large", "small" etc. For the convenience of computing, we assume that Api is a symmetrical strong L-R type fuzzy number, that is to say, L=R=max(0,1-|x|2) . We can solve the above problem b y using the fuzzy neural network we proposed. So we note the fuzzy input as Ap=(Ap1, Ap2,...,Apn), and the target output Tp can be defined as follows:
⎧⎪1, Ap ∈ Class1; Tp = ⎨ ⎪⎩0, Ap ∈ Class 2;
(21)
According to the target output T and the real output O, we define the error function: 2 ⎪⎧ (t p − o p ) ⎪⎫ e ph = max ⎨ o p ∈ [Yp ]h ⎬ . 2 ⎪⎩ ⎪⎭
(22)
We should train this network in order to make the eph be minimum. It is easy to know that the error function become the classical error function in BP algorithm when input vector
k
(t p − o p ) 2
p =1
2
e=∑
Ap and Yp are real number. We train the fuzzy
neural network with h-level sets (h=0.2, 0.4, 0.6, 0.8), the error function of the pair is: 2 ⎪⎧ (t p − o p ) ⎪⎫ e p = ∑ h ⋅ max ⎨ o p ∈ [Yp ]h ⎬. 2 h ⎪⎭ ⎩⎪
(23)
A Fuzzy Neural Network Based on Back-Propagation
157
In this way, we can deal with the fuzzy classification problem by using the model of section 3, where the input vector and the weight are symmetrical strong L-R type fuzzy number.
5 Example We set it as an example to measure the height of the city wall. We set a certain wall as an example. And there are forty-five feature points about the wall. Now twenty-three feature points are taken as studying samples, which are chosen regularly, other twenty-two points are taken as testing samples. The BP algorithm applies the classical error iterative method, Table. 1 shows the practical parameter. Table 1. The parameter for neural network system Sample
23
Work
45
nInput
3
nHidden
15
nOutput
1
Eita
1.2
Alfa
0.5
Error
0 .3
StepE
6
Trans Min-Max = 0.2 - 0.8
5.1 The Result of the BP Network Calculating It adopts six-grade iteration. As the error is smaller than the givenε=0.003, the circulation ends. Table 2 shows the known height y0, the imitated height y, and the differential between the two height of each observing point. In comparison the actual output with the measured value, we can find the maximum error is 0.99m, and the minimum error is 0.01m. The result of fitting is not well-pleasing, because: (1) the city wall feature points is linear distribution so this method is greatly limited to describe the space information. (2) The extent of the city wall chosen is broad, the height is constantly changing, and the changing rule is hard to describe.
158
H. Jin, G. Quan, and C. Linhui Table 2. The result of simulating the height of wall about each feature points
the simulated result of studying sample y0
y
the simulated result of testing sample y0
y
5.2 To Use BP Algorithm to Interpolate in Paragraphs In this example, we can decollate the broad extent and constant changing city wall to relative smaller segment, and then we can find the changing rule of each segment through the same neural network model .Now we set the NO.0~9 point as example, and set five points as studying samples and other five points as testing samples. Table 3 shows the result of simulation. According to the result above, we can find the maximum error of this simulate interpolation is 9cm and the minimum error is 0.8cm. So the measuring accuracy is satisfied with this project.
A Fuzzy Neural Network Based on Back-Propagation
159
Table 3. The result of simulation in paragraphs ----------------------------------------------------------------------point
y0
y
dy
----------------------------------------------------------------------0
15.32755
15.21728
.060271
2
13.85258
13.96085
-.088274
4
13.09941
13.06753
.031876
6
13.12414
13.02487
.089266
8
13.33703
13.24328
.093745
1
14.93377
14.91906
3
13.29222
13.34892
5
13.0759
12.94139
.094508
7
13.26844
13.19278
.075663
9
13.21131
13.09183
.08948
.00807 -.0567
-----------------------------------------------------------------------
6 Conclusion In this paper, we proposed the fuzzy neural network architecture with strong L-R type fuzzy numbers, and defined the corresponding learning algorithm. Since the strong L-R type fuzzy number is (the) more familiar than the triangular fuzzy number, the proposed fuzzy network can be considered as an extension of the former work.
References [1] L, M., Quan, T.F., Luan, S.H.: An Attribute Recognition System Based on Rough Set Theory-Fuzzy Neural Network and Fuzzy Expert System. Fifth World Congress on Intelligent Control and Automation (WCICA) (2004) 2355-2359 [2] W, S.Q., L, Z.H., X, Z.H., Zhang, Z.P.: Application of GA-FNN Hybrid Control System for Hydroelectric Generating Units. Proc. Int. Conf. Machine Learning and Cybernetics 2 (2005) 840-845 [3] Dubois, D., Prade, H.: Fuzzy Sets and Systems-Theory and Applications. New York: Academic Press (1982) [4] Feng, L., Liu, Z.Y.: Genetic Algorithms and Rough Fuzzy Neural Network-based Hybrid Approach for Short-term Load Forecasting. IEEE: Power Engineering Society General Meeting (2006) 1-6
State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network Yong Duan1, Baoxia Cui1, and Xinhe Xu2 1
School of Information Science & Engineering, Shenyang University of Technology, Shenyang, 110023, China 2 Institute of AI and Robotics, Northeastern University, Shenyang, 110004, China
[email protected] Abstract. In this paper, a tabular reinforcement learning (RL) method is proposed based on improved fuzzy min-max (FMM) neural network. The method is named FMM-RL. The FMM neural network is used to segment the state space of the RL problem. The aim is to solve the “curse of dimensionality” problem of RL. Furthermore, the speed of convergence is improved evidently. Regions of state space serve as the hyperboxes of FMM. The minimal and maximal points of the hyperbox are used to define the state space partition boundaries. During the training of FMM neural network, the state space is partitioned via operations on hyperbox. Therefore, a favorable generalization performance of state space can be obtained. Finally, the method of this paper is applied to learn behaviors for the reactive robot. The experiment shows that the algorithm can effectively solve the problem of navigation in a complicated unknown environment.
1 Introduction Reinforcement learning (RL) requires the agent to obtain the mapping from state to action. The aim is to maximize the future accumulated reinforcement signals (rewards) received from the environment. In the case of application, the state and action space of RL are often large, which brings on the search space of training is overly large. Therefore, the agent is difficult to visit each state-action pair. To cope with the problem, some generalization approaches are used to approximate or quantize the state space, which aims to reduce the complexity of the search space. Recently, some methods of quantization are proposed by researchers, such as BOX [1, 2]. Its basic idea is to quantify the state space of the RL problem into nonoverlaping regions. Each of the regions is called a BOX. Moore [3] proposed the Parti-game algorithm using k-d tree which partitions the state space. Henceforth, some improved algorithms of Parti-game have been researched [4,5]. Murao and Kitamura put forward an approach noted QLASS [6], in which the state spaces space of an RL problem is constructed as Voronoi diagram. Furthermore, Ivan S.K. Lee and Henry Y.K.Lau also presented an on-line state space partition algorithm [7]. Therefore, how to make the agent adaptively partition the state space according to the characteristics of the environment and the learning tasks become the key of RL. In D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 160–169, 2007. © Springer-Verlag Berlin Heidelberg 2007
State Space Partition for RL Based on FMM Neural Network
161
order to effectively solve the problem of state space, the improved fuzzy min-max (FMM) neural network is applied to quantify the state space of RL in this paper. The hyperboxes of FMM serve as the partition regions of the state space. By tuning the min and max points and related parameters, the hyperboxes can reflect adaptively the distribution characteristics. So the quantization distortion can be decreased effectively. The hyperboxes of FMM can constitute the tabular RL, which is instrumental to implement the exploration scheme and increases the learning speed. The FMM [8-10] neural network can be viewed as the online classifier based on hyperbox fuzzy sets. Each hyperbox represents one cluster. The min-max points are utilized to define the boundaries of cluster. This clustering approach is based on soft verification, that is, the input training data does not definitely belong to some hyperbox (cluster). Instead, the fuzzy membership function is used to denote the degree of membership of being in the hyperbox. So the vector data set can be classified accurately. According to the above merits, FMM neural network is used to partition the state space of RL problems. The basic FMM algorithm is improved and integrated with Q(λ ) -learning, which is noted as FMM-RL. In the learning process, the state space is partitioned online through the operations of hyperbox expansion, contraction, merging and deletion. Synchronously, Q(λ ) -learning proceeds. Therefore, the method in this paper can construct the state space and solve the RL problem simultaneously. In a way, RL suits well to the robot control domain. RL has the merits of independent to environment models and self-adaptability, consequently, it pioneers the new research field of robotics. In the application of autonomous mobile robots, RL not only is able to implement the lower elementary control of robot behavior; but also can be used for learning the high-level behavior and complicated strategy of the robot [11]. Therefore, the above RL algorithm based on FMM is utilized to control the behaviors of a reactive robot. The robot is able to learn diversified behaviors and accomplish appointed tasks through interacting with the environment, under an unsupervising situation. In unknown and unstructured environments, the robot only can sense the environmental information by its own sensors. Thereby, the perceptible sensor data are viewed as the state vector of the RL problem. The state vector is quantized by a FMM neural network. The goal is to discretize the continuous state space and decrease the distortion of generalization.
2 Q(λ)-Learning In Markov decision process (MDP), the agent can perceive the state set S = {s i | s i ∈ S} from the environment. And the action set of the agent is A = {a i | a i ∈ A} . At the time step t , the agent senses the current state st and chooses the action a t . Through implementing this action, the agent can receive the reward rt from the environment and transform into the new state st +1 . The aim of the RL method is to achieve an optimum control scheme π : S → A
162
Y. Duan, B. Cui, and X. Xu
Q-learning [12] is an important algorithm of RL [1,11]. In Q -learning, the idea is to directly optimize Q -function. The function Q( s t , a t ) represents evaluation value of the state-action pair < s t , a t > . Q -learning is given by [12]:
Qˆ ( st , at ) = Q( st , at ) + η ⋅ [rt + γ ⋅ max Q( st +1 , at +1 ) − Q ( st , at )] , a
(1)
where the value of η is a learning rate. γ denotes the discount factor. The equation (1) updates the current Q -function based on evaluation value of the next state, which is called one-step Q -learning [11, 13]. When the convergence of Q -function is achieved, the optimal policy can be confirmed. The TD (λ ) method is introduced into Q -learning, which becomes incremental multi-step Q -learning. This method is noted Q(λ ) -learning [13]. Firstly, Q -function is updated according to the normal one-step Q -learning. Then, the temporal difference of greedy policy is used to update Q -function again. Therefore, Q(λ ) learning is an on-line algorithm. Q(λ ) -learning has faster convergence speed than one-step Q -learning. Moreover, it is more effective than one-step Q -learning [11,13]. Now let
ς t = rt + γ max Qt +1 − max Qt .
(2)
ζ t = rt + γ max Qt +1 − Qt .
(3)
and
Then, updating algorithm of Q value is calculated as follows: If s = st , a = at , then Qˆ ( s t , a t ) = Q ( s t , a t ) + η t ⋅ [ζ t + ς t et ( s t , a t )] ,
(4)
otherwise, Qˆ ( st , at ) = Q( st , at ) + η t ς t et ( st , at ) ,
(5)
where et ( s t , a t ) is the eligibility trace of the state-action pair < s t , a t > .
3 RL Based on FMM Neural Network 3.1 FMM Clustering Neural Network
FMM neural network is one of the online learning classifier. Each hyperbox with ndimension is composed of the n-dimensional min-max points. The hyperbox can be regard as the fuzzy set. The membership function of the hyperbox describes the degree of an input state vector S ∈ R n pertaining to the hyperbox. The hyperbox B j is defined as follows [8, 9]: B j = {Si ,V j ,W j , μ j ( Si ,V j ,W j )} . Where, Si denotes the ith
State Space Partition for RL Based on FMM Neural Network
163
input vector, Si = ( Si1, Si 2 ,", SiL ) n . V j = (v j1, v j1,", v jL ) and W j = ( w j1, w j1,", w jL ) , respectively, express the minimum and maximum points of the hyperbox B j . μ j ( Si ,V j ,W j ) is the membership function of B j , which can be calculated as follows:
μ j (S ,V j ,W j ) =
1 L ∑ [1 − f ( si − w ji , γ ) − f (v ji − si , γ )] , L i =1
(6)
where L is the state vector dimension. γ represents the sensitivity parameter that regulates the gradient of the membership function. f (⋅) denotes the two-parameter ramp threshold function, which is given by: ⎧ 1, ⎪ f ( x, γ ) = ⎨ xγ , ⎪ 0, ⎩
xγ > 1, 0 ≤ xγ ≤ 1, xγ < 0.
(7)
Fig. 1. FMM neural network element
As described in Fig. 1, FMM can be viewed as a two-layer neural network. The input vectors and the hyperboxes serve as the input and output nodes of the neural network. The connected weights of the nodes are the min and max points of the hyperbox. The weights connecting the ith input node and the jth output node are depicted as v ji and w ji . The corresponding transforming function is the membership function of the hyperbox. The output y j = μ j of jth node can be calculated according to the membership function. Each output node denotes the only cluster. By competing, the victorious hyperbox with the maximal degree of the membership is the output of the neural network. During the training process, the weights (min-max points) of the victorious neural node (hyperbox) are updated continuously. During the learning process, the original FMM neural network approximates gradually the state space of RL through appending the new hyperboxes. The hyperbox mergence and the hyperbox deletion are appended to the basic operations of the FMM neural network. The hyperbox mergence operation can merge the similar hyperboxes into an exclusive hyperbox. The condition of hyperboxe mergence is that the two hyperboxes are near enough. Moreover, the evaluation values Q are also similar enough. Whether deleting the hyperbox that is determined by the visited frequency
164
Y. Duan, B. Cui, and X. Xu
and the cumulated reward of the hyperbox. The above improved method can remove effectively the redundant and nonsignificant hyperboxes. Accordingly, the state space can be approximated availably by the finite hyperboxes. 3.2 Q(λ)-Learning Based on FMM
The FMM-RL system is composed of the FMM neural network and Q(λ ) -learning. Firstly, the state vectors of RL are viewed as the training data of the FMM neural network. The hyperboxes represent the segmentation regions of the state space of RL. The learning of the FMM neural network can be regarded as the process of the dynamic partition for the state space. In the learning process, the partition region boundaries are changed by tuning the min and max points of the hyperbox. The hyperboxes can express self-adaptively the distribution characteristics of the state space through operations on hyperbox expansion, contraction, append, mergence and deletion. Consequently, the tradeoff of coarse and fine partition of the state space can be solved effectively. Then, the hyperboxes of FMM serves as the discrete state vectors of the tabular RL. After training with RL, the action with the maximum Qvalue of each state vector (hyperbox) is selected as the optimal policy. Consequently, the state vector and its corresponding optimum action constitute the Look-up table. Synthesizing improved the FMM neural network and Q(λ ) -learning, the FMM-RL algorithm can be described as follows: (1) Parameter Initialization. Define the maximum hyperbox size ϑ and the sensitivity gain γ . Initialize the visited frequency threshold κ and the hyperbox comparability threshold δ . Initialize the mean square difference threshold ε of the Q-values and the accumulated reward threshold χ .Initialize the initial hyperbox B0 , which min point V0 = 1 and max point W0 = 0 . Define the initial evaluation Q ( B0 , ak ) = 0 , k = 1,", N , where, N denotes the number of the selected actions of RL. Initialize the accumulated reward AR j = 0 and the visited frequency HF j = 0 of the hyperbox B j .Initialize the eligibility trace e( B0 , ak ) = 0 . Furthermore, archive the current state S0 . (2) The action at is selected. By executing the action at , the agent obtains the immediate reward rt and the next state St +1 . (3) Find the most adjacent hyperbox with the current state. The degree of membership μ j , j = 1,", M t that the current state belongs to each hyperbox is calculated by equation 6. The hyperbox Bt +1 = B j that has the highest degree of ∗
membership is selected as the victorious hyperbox. The visitied frequency HF j of hyperbox B j is updated. (4) Verdict the condition of hyperbox expansion. If the min and max points of hyperbox B j meet the equation 8, goto (5). Otherwise, the expansion condition of the hyperbox except for B j is judged until all the hyperboxes are exhausted. If all the hyperboxes do not satisfy the condition, the new hyperbox is appended, goto (6).
State Space Partition for RL Based on FMM Neural Network
L
∑ [max( w ji , si ) − min(v ji , si )] ≤ L ⋅ ϑ .
i =1
(5)
165
(8)
Hyperbox expansion. The min and max points of the hyperbox are adjusted based on equation 9 and 10. old v new ji = min(v ji , si ) , ∀i = 1, 2," , L , old wnew ji = max( w ji , si ) , ∀i = 1, 2," , L .
(9) (10)
(6) Hyperbox append. The min and max points of appended hyperbox is the current state vector, that is, Vnew = Wnew = St +1 . Furthermore, the corresponding evaluation value Q ( Bnew , ak ) and the eligibility trace e( Bnew , ak ) are appended and initialized. (7) Q(λ ) -learning. The Q -values, eligibility trace e( B j , ak ) and the accumulated reward AR j are updated according to the Q(λ ) -learning algorithm. (8) Hyperbox overlapping test. If the expanded or appended hyperbox is satisfied for each of the following case, goto step(9); otherwise, goto (10). Case 1: v pi < v ji < w pi < w ji ; Case 2: v ji < v pi < w ji < w pi ; Case 3: v pi < v ji ≤ w ji < w pi ; Case 4: v ji < v pi ≤ w pi < w ji . (9) Hyperbox Contraction. If the hyperbox overlapping conditions are met, the overlap is eliminated. According to the four cases previously demonstrated, the overlapping hyperboxes are contracted as follows: Case 1: v pi < v ji < w pi < w ji , new old old v new ji = w pi = (v ji + w pi ) / 2 .
Case 2: v ji < v pi < w ji < w pi , new old old v new pi = w ji = (v pi + w ji ) / 2 .
Case 3: v pi < v ji ≤ w ji < w pi , If w ji − v pi < w pi − v ji , old v new pi = w ji . old Otherwise, wnew pi = v ji .
Case 4: v ji < v pi ≤ w pi < wki , The same assignments under the same conditions in case 3. (10) Hyperbox mergence. The conditions of the hyperbox mergence are two L
hyperboxes are sufficient to approach, that is, ∑ (v pi − vki ) 2 + ( w pi − wki ) 2 ≤ δ i =1
166
Y. Duan, B. Cui, and X. Xu
and Qth ≤ ε . Where, Qth =
N
∑ [Q ( B p , a k ) − Q ( B j , a k )] 2 . The merged
k =1
hyperbox can be denoted to: v new = (v p + v j ) / 2 , w new = ( w p + w j ) / 2 . The corresponding Q -value is Q( B new , a k ) = [Q ( B p , a k ) + Q( B j , a k )] / 2 . (11) Hyperbox deletion. If the visited frequency of the hyperbox is less than the threshold κ ( HF j < κ ) and the accumulated reward of the hyperbox is less than the threshold χ ( AR j < χ ), then the hyperbox is deleted. (12) State transformation. S t ← S t +1 , return step (2). (13) Iterate from step (2) to (12), until the min and max points of hyperbox don’t change.
4 The Robot Navigation Based on FMM-RL The robot adopts a two wheel differential drive at its geometric center (see Figure 2).The drive motors of both wheels are independent. Where, vl and v r are the velocities of the left and right wheel. The sensors of the robot are divided into three groups according to their overlay areas. Respectively, the distances of the obstacles to the right, at the front, and to the left of the robot are sensed. In each group, the distance between the robot and obstacles is the minimum vale of the sensed data, i.e., D min = min( d i ) . θ is the angle between the moving direction of the robot and the line connecting the robot center with the target .
Fig. 2. The perceptive model of robot
Firstly, the robot senses the state information of the environment. According to the above section, the operations on hyperbox expansion, append, contraction, mergence and deletion are implemented. Thereby, the robot can online partition the state space and implement Q(λ ) -learning during the learning process. Then, the action corresponding with the above state vector searched from Look-up table is regarded as the control variables of the robot. The robot control variables are the left/right wheel velocities vl and v r , which are represented respectively by five discrete values. They constitute 25 different
State Space Partition for RL Based on FMM Neural Network
167
combinations, which are used as the action variables of RL. The corresponding Q-value of each action is updated by RL. After training, the action with the maximal Q-value is selected as the optimal action of the hyperbox.
5 Experimental Results In order to demonstrate the effectiveness of the proposed FMM-RL, the experiments are performed with the simulation and the real mobile robot Pioneer II. According to the previous section, the ultrasonic sensors of Pioneer II are divided into three groups. Each group sensors can measure the distance between the robot and the obstacles in the different directions. In order to increase the learning speed of the FMM-RL method and decrease the exhaustion of the real robot, we apply the proposed method to the robot that tries to learn the behaviors in the simulation environment. Then the learned results are tested with the real robot Pioneer II. In this section, we study the learned wandering behaviors. The wandering behavior of the robot is that the robot can explore stochastically the unknown and changed environment without collision, which aim is to obtain the environment information or search the targets. For wandering behavior, the sensor measure values Dl , Dc and Dr of the triple orientations (left, front and right) are viewed as the state variables of RL. By implementing the FMM-RL method, the state vectors are partition online. To avoidance obstacles, it is natural to wish the robot is far away from obstacles. If the robot is close to obstacle, it will receive the punishment (negative reinforcement signal); On the contrary, the robot receives the bonus (positive reinforcement signal).Thereby, the reinforcement signal function is defined as follows:
−1, dt < DS , ⎧ ⎪ rt = ⎨−τ ( DA − dt ), DS < dt ≤ DA , ⎪ 0, otherwise. ⎩
(11)
Where rt is the immediate reinforcement signal at the time step t . d t denotes the minimum obstacle distance of triple directions around the robot, i.e., dt = min{Dl , Dc , Dr } . The parameter τ is proportional gain. DS represents the threshold of the safe distance. If the distance between the robot and the obstacles is less than D S , the robot is considered have been collided. D A is the distance threshold of avoiding obstacles. Within the range from D S to D A , the robot is able to avoid obstacles effectively. The simulated robot is located in the complicated unknown environment to train the obstacle-avoidance behavior. According to the equation 11, the robot receives the reinforcement signal. If the robot collides with obstacles, reaches the target or completes the trials, it will return the start state and perform a new learning stage. Figure 3 denotes the wandering trajectories of the robot in the unknown simulation
168
Y. Duan, B. Cui, and X. Xu
Fig. 3. Wandering trajectories in simulation
Fig. 4. Pioneer II robot wandering behavior
environment. Figure 4 shows that the robot Pioneer II performs the wandering behavior in the real environment. The effectiveness of the proposed method is demonstrated through simulator and the real robot experiments. The robot with controller designed by FMM-RL can explore the environment without collision.
6 Conclusions In this paper, the improved FMM neural network and RL are integrated, which constitute the FMM-RL algorithm. Firstly, the FMM neural network is used to quantify the continuous state space of RL. So the continuous state space can be approximated by the finite hyperboxes of FMM. The proposed algorithm not only partitions self-adaptively the state space, but also can effectively delete and merge the insignificance state partition regions. Consequently, the tabular RL method can be implemented. We study the behavior learning of the mobile robot based on the FMMRL method. The experimental results indicate the FMM-RL method with the reasonable reinforcement signal function can complete effectively the learned tasks.
References 1. Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts (1998) 2. Michie, D., Chambers, R. A.: Box: An experiment in adaptive control. Machine Intelligent 2 (1968) 137-152 3. Moore, A.W., Atkeson, C.G.: The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces. Machine Learning 21 (1995) 199-233 4. Munos, R., Moore, A. W.: Variable Resolution Discretization for High-accuracy Solutions of Optimal Control Problems. Proc. 16th International Joint Conf. on Artificial Intelligence (1999) 1348--1355 5. Reynolds, S. I.: Adaptive Resolution Model-free Reinforcement Learning: Decision Boundary Partitioning. Proc. 17th International Conf. on Maching Learning (2000) 783-790
State Space Partition for RL Based on FMM Neural Network
169
6. Murao, H., Kitamura, S.: Q-Learning with Adaptive State Segmentation (QLASS). Proc. IEEE International Symposium on Computational Intelligence in Robotics and Automation (1997) 179-184 7. Ivan, S.K. Lee, Henry, Y.K.Lau.: Adaptive State Space Partitioning for Reinforcement Learning. Engineering Applications of Artificial Intelligence 17 (2004) 577-588 8. Simpson, P. Fuzzy Min-max Neural Network-Part I: Classification. IEEE Trans. on Neural Networks 3 (5) (1992) 776-786 9. Simpson, P. K.: Fuzzy Min-max Neural Network-Part II: Clustering. IEEE Trans. on Fuzzy Systems 1 (1) (1993) 32-45 10. Gabrys, B., Bargiela.: General Fuzzy Min-max Neural Network for Clustering and Classification. IEEE Trans. on Neural Networks 11 (3) (1999) 769-783 11. Zhang, R.B.: Reinforcement Learning Theory and Applications. Harbin Engineering University Press (2000) 12. Watkins, C. J., Dayan P.: Q-learning. Machine Learning 8 (3) (1992) 279-292 13. Peng, J., Williams, R.J.: Incremental Multi-step Q-learning. Machine Learning: Proceedings of the Eleventh International Conference(ML94), Morgan Kaufmann, New Brunswick, NJ, USA (1994) 226-232
Realization of an Improved Adaptive Neuro-Fuzzy Inference System in DSP Xingxing Wu, Xilin Zhu, Xiaomei Li, and Haocheng Yu College of Mechanical science and Engineering, Jilin University, Changchun 130025, China
[email protected] Abstract. Scaled conjugate gradient (SCG) algorithm was used to improve adaptive neuro-fuzzy inference system (ANFIS). It’s proved by applications in chaotic time-series prediction that the improved ANFIS converges with less time and fewer iterations than standard ANFIS or ANFIS improved with the Fletcher-Reeves update method. The way in which ANFIS could be improved on the basis of standard algorithm using fuzzy logic toolbox of MATLAB is dwelled on. A convenient method to realize ANFIS in TI ’s digital signal processor (DSP) TMS320C5509 is presented. Results of experiments indicate that output of ANFIS realized in DSP coincides with that in MATLAB and validate this method.
1
Introduction
Artificial neural network and fuzzy inference system have been applied in more and more engineering fields for their abilities to simulate human’s learning and inference ability. Adaptive neuro-fuzzy inference system (ANFIS) utilizes learning principle and adaptive ability of neural network to model fuzzy inference system. In this way membership function parameters and fuzzy inference rules can be obtained by learning from quantities of input and output data. So ANFIS has special superiority for complex system in which qualitative knowledge and experiences are deficient or hard to obtain.With the self-learning ability ANFIS can expand its fuzzy inference library according to change of application circumstances. Therefore the system’s flexibility and adaptive ability have been improved.At present ANFIS has been sucessfully applied in many fields such as modeling and forecast of nonlinear system,fingerprint matching,etc [1,2,3]. Programmable digital signal processor has been developing at a high speed in recent twenty years. With increasingly high cost performance digital signal processor(DSP) has been the core of many electronic devices and widely used in communication, automatic control, spaceflight and other fields [4]. Stardard ANFIS algorithm is rather slow for complex computation of large amounts of data should be carried out in trainning. In this paper Scaled conjugate gradient(SCG)algorithm [7] was used to improve ANFIS algorithm to accelerate the training process and reduce training times. In addition how to realize ANFIS in DSP conveniently on the basis of MATLAB’s fuzzy toolbox was presented. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 170–178, 2007. c Springer-Verlag Berlin Heidelberg 2007
Realization of an Improved ANFIS in DSP
2
171
Improvement of ANFIS
Fuzzy inference systems can be classfied into Mamdani-type, Sugeno-type ,pure fuzzy inference system,etc. Mamdani-type fuzzy inference system can express knowledge conveniently because the form of fuzzy inference rules coincides with human thought and language expression customs. But its computation is rather complicated and difficult for analysis in math. Sugeno-type fuzzy inference system is simple in computation and easy to be combined with optimizing and self-adapting methods [9]. ANFIS based on Sugeno-type fuzzy inference system was put forward by Jang [5]. In ANFIS parameters which determine member function shapes of each input are called premise parameters. Output of each rule is linear combination of inputs and constant. The linear combination coefficients and constant are called consequent parameters. All these parameters are adjusted by back propagation(BP) algorithm or a combination of least squares estimation and back propagation algorithm in the way similar to neural network. In the pure BP method both premise and consequent parameters are adjusted by BP algorithm. In the hybrid method premise parameters are adjusted by BP algorithm while consequent parameters are adjusted by least squares estimation. But standard back propagation algorithm is often too slow for application and may get stuck in a shallow local minimum. So many faster algorithms have been presented such as variable learning rate BP,resilient BP,conjugate gradient,Levenberg-Marquardt(LM),and so on. As far as conjugate gradient algorithm is concerned, it also can be divided into several kinds such as scaled conjugate gradient(SCG) , Fletcher-Reeves Update(FRU), Powell-Beale Restarts, etc. Following conclusion was drawn according to experiments on different algorithms described above with different structures and precision in solving six different kinds of practical problems [6]. Generally the LM algorithm will have the fastest convergence for networks that contain up to a few hundred weights on function approximation problems. But its performance is relatively poor on pattern recognition problems. Resilient BP algorithm is the fastest algorithm on pattern recognition problems while it does not perform well on function approximation problems. The conjugate gradient algorithms, in particular scaled conjugate gradient (SCG) algorithm, seem to perform well over a wide variety of problems, particularly for networks with a large number of weights. The SCG algorithm is almost as fast as the LM algorithm on function approximation problems (faster for large networks) and is almost as fast as resilient back propagation algorithm on pattern recognition problems. So in this study ANFIS was improved with the SCG algorithm to quicken its training speed. Standard back propagation algorithm adjusts the weights in the steepest descent direction.As formula (1) and (2) show. Δf (Wn ) = −αn ∇f (Wn ),
(1)
Wn+1 = Wn + Δf (Wn ),
(2)
where Wn is weight vector at iteration n, αn is current step size and ∇f (Wn ) is current gradient vector. It turns out that it doesn’t necessarily produce the
172
X. Wu et al.
fastest convergence along the negative of the gradient. In conjugate gradient algorithms the search direction is conjugate to previous search direction except that the first search direction is along the negative of the gradient. Generally conjugate gradient algorithms converge faster than standard BP algorithm. Line searches are performed to determine the optimal distance to move along search directions in conjugate algorithms such as FRU, Powell-Beale Restarts, etc. SCG algorithm put forward by Moller combines the model-trust region approach and the conjugate gradient approach to avoid the time-consuming line search and improve the convergence speed [7,8]. It’s shown as below. (1) At n=0,choose an initial weight vector W0 , and scalars0 < σ < 10−4 ,0 < ρ0 < 10−6 ,ρ0 = 0,set the Boolean success=true. Set the initial direction vector D0 = G0 = −∇f (W0 ).
(3)
(2)If success=true then calculate second order information: σ , |Dn |
(4)
∇f (Wn + σn Dn ) − ∇f (Wn ) , σn
(5)
σn = Sn =
θn = DnT Sn ,
(6)
(3)scale θn : 2
θn = θn + (ρn − ρn ) |Dn | , (4)if θn ≤ 0 then make the Hessian positive definite: θn ρn = 2 ρ n − , 2 |Dn | 2
(7)
(8)
θn = −θn + ρn |Dn | ,
(9)
ρn = ρn ,
(10)
ξn = DnT Gn ,
(11)
(5)calculate the step size: ξn αn = , θn (6)calculate the comparison parameter Cn : Cn = 2θn
f (Wn ) − f (Wn + αn Dn ) . ξn2
(12)
(13)
(7)Weight and direction update: If Cn > 0 then a successful update can be made Wn+1 = Wn + αn Dn ,
(14)
Gn+1 = −∇(Wn+1 ),
(15)
Realization of an Improved ANFIS in DSP
173
ρn = 0,Success=true. If n mod N=0 then restart the algorithm with Dn+1 = Gn+1 , else 2 βn = (|Gn+1 | − GT n+1 Gn )/ξn ,
(16)
Dn+1 = Gn+1 + βn Dn .
(17)
ρn = ρn /4,
(18)
If Cn ≥ 0.75 then else ρn = ρn ,success=false. (8)If Cn < 0.25 then 2
ρn = ρn + θn (1 − Cn )/ |Dn | .
(19)
(9)If the steepest descent direction:Gn = 0,set n=n+1 and go back to (2) else terminate and return Wn+1 as the desired minimum. In MATLAB standard ANFIS funciton anfis() is provided,with which pure BP method or hybrid method can be chosen to train the system. It was found out from analysis of MALTAB language source file anfis.m that in anfis() anfismex.dll was called to realize the kernel training algorithm. C language source codes of anfismex.dll can be found in the directory toolbox\fuzzy\fuzzy\src of MATLAB. Through analyzing these source codes it can be concluded that variable learning rate BP algorithm has been used to improve the convergence speed. The basic idea of variable learning rate BP to increase or decrease the learning rate(or called step size) by judging if current training error is smaller than last training error. If the training errors decrease in succession for several times the learning rate will be increased. If the training errors vibrate the learning rate will be decreased and otherwise keep constant. As above algorithm flow shows,in SCG algorithm the learning rate is computed from second order information of performance function. Compared to variable learning rate BP algorithm,it can avoid vibrations may caused by inappropriate initial learning rate or increasing/decreasing rate. To construct a new ANFIS function anfisscg() based on SCG algorithm, Firstly a new kernel training algorithm library anfisscgmex.dll should be made. It was realized by modifying the source codes of anfismex.dll according to SCG algorithm described above. Main modification was made to the function anfislearning() in learning.c as most learning procedures were completed in it. In order to complete computation of formula (5),(13) and (16) conveniently,new members were added to the construct type FIS and NODE in anfis.h.Assigning and freeing memory as well as initiating codes for new members were added in datstruc.c. After modification the file that contained function mexfunction() was renamed to anfisscgmex.c. Then anfisscgmex.dll was generated by command ”mex anfisscgmex.c -output anfisscgmex.dll” in MATLAB. At last the improved ANFIS function anfisscg() was got by substituting function anfismex with function anfisscgmex in anfis.m and renaming the file anfisscg.m.
174
3
X. Wu et al.
Test of the Improved ANFIS Algorithm
In order to test the improved ANFIS algorithm,the standard ANFIS ,ANFIS improved with the Fletcher-Reeves update method and ANFIS improved with SCG algorithm were respectively applied to forecasting chaotic time series. A chaotic time series is generated by following Mackey-Glass (MG) time-delay differential equation: •
x(t) =
0.2x(t − τ ) − 0.1x(t). 1 + x10 (t − τ )
(20)
This time series is chaotic, and so there is no clearly defined period. The series will not converge or diverge, and the trajectory is highly sensitive to initial conditions. This is a benchmark problem in the neural network and fuzzy modeling research communities [9]. In MATLAB file mgdata.dat has been provided in the directory toolbox\fuzzy\fuzdemos which contains time series data calculated by the fourth-order Runge-Kutta method.Half of the data were used to train while the other half were used for checking to assure the modeling was successful. In all algorithms system was initialized by grid partition method and Gauss type of membership function was selected as membership function of inputs. Other parameters used the default values. Training results of different algorithms are as Table 1 shows. Table 1. Training results of different algorithms Algorithm
Target Error Et Iteration N Time T /s
Standard ANFIS BP Method Improved by FRU BP Method Improved by SCG BP Method Standard ANFIS Hybrid Method Improved by FRU Hybrid Method Improved by SCG Hybrid Method
0.02 0.02 0.02 0.0017 0.0017 0.0017
5919 220 40 155 72 63
338.3970 28.4810 6.8000 51.8140 35.8820 30.3430
Training error curves of Standard ANFIS, ANFIS improved by FRU and ANFIS improved by SCG using BP method are as Fig. 1 shows. In ANFIS improved by FRU the parameter βn which determined how much last search direction influence current search direction was computed according to formula (21): 2
2
βn = |Gn+1 | / |Gn | .
(21)
Step size αn was adjusted in the same way as variable learning rate BP algorithm. In ANFIS improved by SCG using Hybrid Method consequent parameters were adjusted least squares estimation and premise parameters were adjusted in a way similar to FRU except that βn was computed according to formula (16) instead of formula (21).
Realization of an Improved ANFIS in DSP
175
1 Standard ANFIS ANFIS improved with FRU ANFIS improved with SCG
0.9 0.8
Training Error E
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
50
100 Training Iterations N/times
150
200
Fig. 1. Comparison of training error curves
It can be concluded from Table 1. and Fig. 1 that ANFIS improved by SCG converges much faster than standard ANFIS or ANFIS improved by FRU. It takes less time and fewer iterations for ANFIS improved by SCG to reach the same target error.
4
Realization of ANFIS in DSP
High speed Real-time signal process can be achieved in DSP for advanced technologies used in it such as Harvard architecture,super scale pipeline, special MAC units and instruction,etc. The improved ANFIS algorithm realized in MATLAB running on PC is fitful for analysis and simulation. But it can’t satisfy the field signal process demands such as low power,real-time,small size,etc. It will greatly promote ANFIS’s applications in more fields if it can be realized in DSP conveniently. DSP used in this study is TMS320VC5509, which is based on the latest TMS320C55x DSP processor core.The C55x DSP architecture achieves high performance and low power through increased parallelism and total focus on reduction in power dissipation [10]. The devices used include emulator,code composer studio (CCS)5000 and a target board which has been extended with 1M*16 bit SDRAM and 512K*16bit flash. CCS 5000 supports development and debugging of TMS320C55x c or assembly language program. According to this study there are two methods to realize ANFIS in DSP. In the first method c language source codes of anfisscgmex.dll was modified according to TMS320C55x c language and hardware attributes of TMS320C5509 to realize whole training and inference processes in DSP. In the second method training process was completed in MATLAB. After training, the system was saved to .fis format file with the use of function writefis(). Structure and parameters of the inference system can be extracted from the .fis format file saved for system initialization. The stand-alone c code fuzzy inference engine contained in fis.c in
176
X. Wu et al.
the directory toolbox\fuzzy\fuzzy of MATLAB was modified in CCS to complete the inference process in DSP. As there is no file system in DSP, extraction of parameters was completed by calling the function returnFismatrix() in fis.c. In practical applications it often takes long time to train the system while the inference speed should be as quick as possible. So off-line training and online inference is a wise choice. As a result the second method is better as it also demands smaller memory in DSP. For example, The improved ANFIS to forecast chaotic time series was realized in DSP using the second method. Forecast results of time series number 124 to 223 in DSP is as Fig. 2 shows.
Fig. 2. Output results in DSP
Here dual time time/frequency graph was used to show computation results in DSP. The upper curve represents forecast values obtained from fuzzy inference computation. The lower curve represents errors between forecast values and real values in mgdata.dat. Start addresses of the upper curve and lower curve were respectively set to be names of arrays in which system outputs and errors were saved. Both display buffer size and display data size were set to be 100. 32-bit IEEE float point was used as DSP data type. Output of the system was shown in the stdout window. In MATLAB outputs of the system were computed by calling function readfis ( ) and evalfis( ) [9]. As Fig. 3 shows,the upper graph is the comparison of real values (represented by circles) to outputs of the system (represented by line) in MATLAB. In the graph trend of the line coincides with that of circles which indicates the forecast is successful. The lower graph shows the forecast error curve. Parts of forecast results in DSP and MATLAB are as Table 2 shows. It can be seen from Fig. 2,Fig. 3 and Table 2 that chaotic time series forecast has been successfully achieved both in MATLAB and in DSP. System output
system output and real value
Realization of an Improved ANFIS in DSP
177
1.5
1 real value system output
0.5
0 120
140
160
180 200 Time series number
220
240
140
160
180 200 Time series number
220
240
−3
Forecast error
4
x 10
2 0 −2 −4 120
Fig. 3. Output results in MATLAB Table 2. Forecast results in MATLAB and DSP Time series Forecast value Forecast error Real value number MATLAB DSP MATLAB DSP 123 1.0510 1.0516 1.051554 0.0006 0.000554 125 0.9564 0.9530 0.952994 -0.0034 -0.003406 136 0.6526 0.6541 0.654167 0.0014 0.001567 145 0.8663 0.8659 0.865982 -0.0004 -0.000318 159 1.2022 1.2021 1.202076 -0.0000 -0.000124 167 1.1540 1.1541 1.154046 0.0001 0.000046 186 0.5053 0.5040 0.504101 -0.0013 -0.001199 222 1.0022 1.0026 1.002582 0.0004 0.000382
in DSP coincides with system output in MATLAB,which verify the method to realize ANFIS in DSP.
5
Conclusions
This paper presents an improved ANFIS algorithm and a convenient method to realize it in DSP. The ANFIS improved by SCG algorithm converges faster than standard ANFIS or ANFIS improved by FRU conjugate gradient algorithm. In this way training iterations and time of ANFIS can be reduced. ANFIS can be conveniently realized in DSP by the way of off-line training and online inference. Tests on chaotic time series forecast verify the improved ANFIS algorithm and the method to realize ANFIS in DSP. With faster training speed and a convenient method to be realized in DSP ANFIS will be applied in more and more practical fields.
178
X. Wu et al.
References 1. Lee, K.C., Gardner, P.: Adaptive Neuro-Fuzzy Inference System (ANFIS) Digital Predistorter for RF Power Amplifier Linearization. IEEE Transactions on Vehicular Technology 55 (1) (2006) 43-51 2. Hui, H., Song, F.J., Widjaja, J.: ANFIS-Based Fingerprint-Matching Algorithm. Optical Engineering 43 (3) (2004) 415-438 3. Jwo, D.J., Chen, Z.M.: ANFIS Based Dynamic Model Compensator for Tracking and GPS Navigation Applications. Lecture Notes in Computer Science 3611. Springer-Verlag, Berlin Heidelberg (2005) 425-431 4. Wang, C.M., Sun, H.B., Ren, Z.G.: Design and Development Examples of TMS320C5000 Series DSP System.Publish house of electronics industry. Beijing (2004) 5. Jang, J.R.:ANFIS: Adaptive-Network-Based Fuzzy Inference System.IEEE Transactions on Systems,Man and Cybernetics 23 (3) (1993) 665-685 6. Demuth, H., Beale, M., Hagan, M.: Neural Network Toolbox for Use with MATLAB User’s Guide.4th edn. The MathWorks IncMA(2005) 7. Moller, M.F.: A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Neural Networks 6 (4) (1993) 525-533 8. Falas, T., Stafylopatis, A.: Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm. Neural Processing Letters 22 (3) (2005) 361-375 9. Fuzzy Logic Toolbox for Use with MATLAB User’s Guide.2nd edn. The MathWorks IncMA(2005) 10. TMS320VC5509 Fixed-Point Digital Signal Processor Data Manual. The Texas Instruments Inc, Dallas(2001)
Neurofuzzy Power Plant Predictive Control Xiang-Jie Liu and Ji-Zhen Liu Department of Automation, North China Electric Power University Beijing 102206, China
[email protected] Abstract. In unit steam-boiler generation, a coordinated control strategy is required to ensure a higher rate of load change without violating thermal constraints. The process is characterized by nonlinearity and uncertainty. Using of neuro-fuzzy networks (NFNs) to represent a nonlinear dynamical process is one choice. Two alternative methods of exploiting the NFNs within a generalised predictive control (GPC) framework are described. Coordinated control of steam-boiler generation using the two nonlinear GPC methods show excellent tracking and disturbance rejection results.
1 Introduction In modern power plant, the coordinated control scheme constitutes the uppermost layer of the control system, which is responsible for driving the boiler-turbinegenerator set as a single entity, harmonising the slow response of the boiler with the faster response of the turbine-generator, to achieve fast and stable unit response during load tracking manoeuvres and load disturbances. In existing method, the PID controller is still the most widespread, being developed in power plant control loops. However, steam-boiler turbine system is the complex industrial process with highly nonlinear, non-minimum, uncertainty and load disturbance [1]. Load-cycling operation between full load and low load is a common feature in modern power plant. This leads to the change of operating point right across the whole operating range. Variations in plant variables become quite nonlinear. This has presented a great challenge to power plant control system. Model predictive control (MPC) has emerged to be an effective way of power plant control. The application of a decentralized predictive control scheme was proposed in [2] based on a state space implementation of GPC for a combined-cycle power plant, in which a two-level decentralized Kalman filter was used to locally estimate the states of each of the subprocess. A nonlinear long-range predictive controller based on neural networks is developed in [3] to control the power plant process. In the presence of constraints, the optimum predicted control trajectory is defined through the on-line solution of a quadratic programming problem. For nonlinear system, since the on-line optimization problem is generally nonconvex, the on-line computation demand is high for any reasonably nontrivial systems. Using a neurofuzzy networks (NFNs) [4] to learn the plant model from operational process data for nonlinear GPC is one solution. In the NFNs, expert knowledge in linguistic form can be incorporated into the network through the fuzzy rules. This article describes how this nonlinear neurofuzzy modelling technique can be integrated within an MPC framework. It also discusses how constraint handling can be D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 179–185, 2007. © Springer-Verlag Berlin Heidelberg 2007
180
X.-J. Liu and J.-Z. Liu
incorporated in the nonlinear control scheme while ensuring the highest possible rate of load change. Comparative control studies produce good results for both nonlinear coordinated control schemes.
2 Neuro-Fuzzy Network Modelling Consider the following general single-input single-output nonlinear dynamic system:
y (t ) = f [ y (t − 1),", y (t − n′y ), u (t − d )," , u (t − d − nu′ + 1), e(t − 1)," , e(t − ne′ )] + e(t ) Δ
(1)
where f[.] is a smooth nonlinear function such that a Taylor series expansion exists, e(t) is a zero mean white noise and Δ is the differencing operator, n ′y , nu′ , ne′ and d are respectively the known orders and time delay of the system. Let the local linear model of the nonlinear system (1) at the operating point O(t) be given by:
A ( z −1 ) y (t ) = z − d B ( z −1 )Δu (t ) + C ( z −1 )e(t )
(2)
where A ( z −1 ) = ΔA( z −1 ) , B( z −1 ) and C ( z −1 ) are polynomials in z-1, the backward shift operator. The nonlinear system (1) is partitioned into several operating regions, such that each region can be approximated by a local linear model. Since NFNs is a class of associative memory networks with knowledge stored locally [4], they can be applied to model this class of nonlinear systems. A schematic diagram of the NFN is shown in Fig. 1. The input of the network is the antecedent variable [ x1 , x2 " xn ] , and the output, yˆ (t ) , is a weighted sum of the output of the local linear models yˆ i (t ) . Bspline functions are used as the membership functions in the NFNs. The membership functions of the fuzzy variables can be obtained by, n
a i = ∏ μ Ai ( x k ) ; for i = 1,2, " , p
(3)
k
k =1
where n is the dimension of the input vector x, and p, the total number of weights: n
p = ∏ ( Ri + k i ) i =1
Fig. 1. Neuro-fuzzy network
(4)
Neurofuzzy Power Plant Predictive Control
181
where k i and Ri are the order of the basis function and the number of inner knots respectively. The output of the NFN is, p
yˆ =
∑ yˆ a i =1 p
i
∑a i =1
i
p
= ∑ yˆ iα i
(5)
i =1
i
3 Neuro-Fuzzy Network Predictive Control 3.1 Local Model-Based Generalized Predictive Control (LMB-GPC)
The neurofuzzy network provides a global nonlinear plant representation from a set of locally valid CARIMA models together with a weight function, producing a value close to one in parts of the operating space where the local model is a good approximation and a value approaching zero elsewhere. Notice that this is the main property of the B-spline neuro-fuzzy networks. An alternative way of developing nonlinear controller is to use the same operating regime based model directly with a model based control framework. In this way, global modeling information may be used to determine the control input at each sample time. The closed-loop performance, stability and robustness are then all directly related to both the quality of the identified model and the general properties of GPC. It is assumed to constitute a linear representation of the process at any time instant and may then be used by a GPC controller to represent the process dynamics locally. The resultant LMB-GPC is shown in Fig.2.
Fig. 2. Local model-based generalized predictive control
3.2 Composed Controller Generalized Predictive Control (CC-GPC)
The control structure here consists of the family of controllers and the scheduler. At each sample instant the latter decides which controller, or combination of controllers, to apply to the process. Generally, the controllers are tuned about a model obtained from experiments at a particular equilibrium point. The interpolated outputs are then
182
X.-J. Liu and J.-Z. Liu
summed and used to supply the control commands to the process. The resultant CCGPC structure is shown in Fig.3. The interpolation function effectively smoothes the transition between each of the local controllers. In addition, the transparency of the nonlinear control algorithm is improved as the operating space is covered using controllers rather than models.
Fig. 3. Composed controller generalized predictive control
3.3 Constraint Handling
One of the main application benefits of using a linear predictive controller is its ability to handle process constraints directly within the control law. The inclusion of constraints in LMB-GPC is straightforward, since the least squares solution to the chosen cost function may be replaced by a constrained optimization technique such as quadratic programming. The drawback is the increasing computation required to solve for the control sequence at each sample instant. While the same approach is applied to the CC-GPC, a problem arises, as there is no way of knowing that the summation of all of the controller outputs will not in fact violate a process constraint. Notice that we are using a B-spline neuro-fuzzy network, i.e., ∑ μ kj ( x) ≡ 1, x ∈ [ xmin , xmax ] , signifyj
ing that the basis functions form a partition of unity. In such a way, the summation of all of the controller outputs will not in fact violate a process constraint, since they are weighted sum by the normalized B-spline neuro-fuzzy network.
4 Coordinated Control in Steam-Boiler Generation A valid neurofuzzy model of the plant, which is an essential tool for the improvement of the control system, has been established in [1].The proposed two kinds of neurofuzzy predictive controllers are now incorporated in the system. In the control system shown in Fig.4, W Nμ (s ) is the transfer function relating the steam valve setting to the load power, and W NM (s ) is the transfer function between the fuel consumption and the load power, i.e., W Nμ ( s ) = K μ WT ( s )
W NM ( s ) = WPM ( s ) K PWT ( s )
(6)
In the CC-GPC, the nonlinear controller consists of five local controllers, each one of which is designed about one of the local models, and thus each with a set of tuning parameters. At each sample instant the load signal was fed to the interpolation membership
Neurofuzzy Power Plant Predictive Control
183
Fig. 4. Load control system in boiler-following mode
function of the B-spline NFNs, which in turn generates the activation weights for each of the local controllers. Each local controller was assumed to be linear and hence the control sequence for each could be solved analytically. However, the summation of the interpolated outputs is nonlinear. Notice that, since the B-spline membership function was chosen to be second order, there are two controllers working at any time instant. In the LMB-GPC, the NFNs model for the process was used with a GPC algorithm for control purposes. At each sample instant the load signal was fed to the interpolation membership function of the NFNs. Each of the five sets of local model parameters was then passed through this B-spline interpolation function to form a local model, which accurately represents the process around that particular operating point. This local model may be assumed linear and is used by the GPC controller. Also notice that, since the B-spline membership function was chosen to be second order, there are two local models working at any instant time. The LMB-GPC strategy requires only one set of tuning parameters. The internal model of a single GPC controller is updated at each sample instant. The linear GPC is obtained by minimizing the following cost function, N
M
j =1
j =1
J = E{∑ q j [ yˆ (t + j ) − y r (t + j )] 2 } + ∑ λ j [Δu (t + j − 1)] 2
(7)
subject to u min < u (t + i − 1) < u max Δu min < u (t + i − 1) < Δu max ,for i = 1,2, " , m The controller parameters are chosen as Q = I , and λ = 0.1× I . The sampling in-
,
terval is chosen to be 30s. N=10 M = 6. In the sliding pressure mode, the steam pressure setpoint was incremented every 10 minutes from 11Mpa to 19Mpa, leading to a load increase from 140MW to 300MW. This was done in order to move the process across a wide operating range.The “tuning knobs” of the neuro-fuzzy GPC are chosen as discussed above. Simulations were first taken under unconstraint condition. The sliding pressure responses are shown in Fig.5 by the dotted lines. It is readily apparent that the linear GPC controller could not offer satisfactory results in most of the cases. This is because its internal model was generated at a load “Medium” where the plant gain is moderate. The nonlinear GPC controllers show good sliding pressure response. Overall there seems to be very little difference between the two nonlinear controllers during this test. Simulations were then made under constraint condition: −0.005 ≤ u1 ≤ 0.005
−1.0 ≤ u 2 ≤ 0.02
(8)
184
X.-J. Liu and J.-Z. Liu
(a)
(b)
(c) Fig. 5. Sliding pressure response and control efforts under (a) linear GPC, (b) local modelbased GPC and (c) composed controller GPC
The sliding pressure responses and control efforts are shown in Fig.5 by the dotted lines. Similar comparing results were obtained except that in every scheme, control change effort was limited, leading to a slower response. Boiler following or “constant pressure” mode is the most commonly used mode in power plant coordinated control. Fig.6-a shows steam pressure transient process while load increases from 260MW to 290MW. The opening of the steam valve leads to a quick increase in the load, as energy stored in the boiler are being released. The steam pressure is restored to its original level by increasing the fuel delivery, after being decreased. All the three controllers give a similar performance, since the plant dynamic is within one operating region and the tuning parameter of the linear controller are valid within this region. Fig.6-b shows steam pressure response while load increases from 240MW to 300MW. The nonlinear controllers exhibit superior action,
Neurofuzzy Power Plant Predictive Control
(a)
185
(b)
Fig. 6. Steam pressure transient process under boiler following mode
since the tuning parameters of the linear controller were specified at one region and the plant dynamic changes across two regions.
5 Conclusion GPC can produce excellent results compared to conventional methods. One limitation of GPC is that it is mostly based on a linear model. It would lead to large differences between the actual and predicted output values, especially when the current output is relatively far away from the operating point at which the linear control model was generated. Introducing NFNs could help to solve this problem. The proposed nonlinear GPC controllers were applied in the simulation of the power plant coordinated control, which is kernel system of unit steam-boiler. Better results are obtained when compared with the linear GPC. Also it is shown how constraints handling can be incorporated into the GPC system by using the B-spline NFNs. The advantage of the method is that it is suitable for improving many industrial plants already controlled by linear controllers.
Acknowledgment This work is supported by National Natural Science Foundation of China under grant 50576022 and 69804003, Natural Science Foundation of Beijing under grant 4062030.
References 1. Liu, X.J., Lara-Rosano, F., Chan, C.W.: Neurofuzzy Network Modelling and Control of Steam Pressure in 300MW Steam-boiler System. Engineering Applications of Artificial Intelligence 16(5) (2003) 431-440 2. Katebi, M.R., Johnson, M.A.: Predictive Control Design for Large-scale Systems. Automatica 33(3) (1997) 421-425 3. Prasad, G., Swidenbank, E., Hogg, B.W.: A Neural Net Model-based Multivariable Longrange Predictive Control Strategy Applied Thermal Power Plant Control. IEEE Trans. Energy Conversion 13(2) (1998) 176-182 4. Brown, M., Harris, C.J.: Neurofuzzy Adaptive Modelling and Control. Englewood Cliffs, Prentice-Hall, NJ (1994)
GA-Driven Fuzzy Set-Based Polynomial Neural Networks with Information Granules for Multi-variable Software Process Seok-Beom Roh1, Sung-Kwun Oh2, and Tae-Chon Ahn1 1
Department of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea {nado,tcahn}@wonkwang.ac.kr 2 Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected] Abstract. In this paper, we investigate a GA-driven fuzzy-neural networks– Fuzzy Set–based Polynomial Neural Networks (FSPNN) with information granules for the software engineering field where the dimension of dataset is high. Fuzzy Set–based Polynomial Neural Networks (FSPNN) are based on a fuzzy set-based polynomial neuron (FSPN) whose fuzzy rules include the information granules obtained through Information Granulation. The information Granules are capable of representing the specific characteristic of the system. We have developed a design methodology (genetic optimization using real number type gene Genetic Algorithms) to find the optimal structures for fuzzy-neural networks which are the number of input variables, the order of the polynomial, the number of membership functions, and a collection of the specific subset of input variables. The augmented and genetically developed FSPNN (gFSPNN) with aids of information granules results in being structurally optimized and information granules obtained by information granulation are able to help a GA-driven FSPNN showing good approximation on the field of software engineering. The GA-based design procedure being applied at each layer of FSPNN leads to the selection of the most suitable nodes (or FSPNs) available within the FSPNN. Real number genetic algorithms are capable of reducing the solution space more than conventional genetic algorithms with binary genetype chromosomes. The performance of GA-driven FSPNN (gFSPNN) with aid of real number genetic algorithms is quantified through experimentation where we use a Boston housing data.
1 Introduction In recent, a great deal of attention has been directed towards usage of Computational Intelligence such as fuzzy sets, neural networks, and evolutionary optimization towards system modeling on the high-dimensional input-output space. A lot of researchers on system modeling have been interested in the multitude of challenging and conflicting objectives such as compactness, approximation ability, generalization D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 186–195, 2007. © Springer-Verlag Berlin Heidelberg 2007
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
187
capability and so on which they wish to satisfy. Fuzzy sets emphasize the aspect of linguistic transparency of models and a role of a model designer whose prior knowledge about the system may be very helpful in facilitating all identification pursuits. It is difficult to build the fuzzy model which has good approximation ability and superior generalization capability on the multi-dimensional field. In addition, to build models with substantial approximation capabilities on the multi-dimensional field, there should be a need for advanced tools. As one of the representative and sophisticated design approaches comes a family of fuzzy polynomial neuron (FPN)-based self organizing neural networks (abbreviated as FPNN or SOPNN and treated as a new category of neural-fuzzy networks) [1], [2], [3], [4]. The design procedure of the FPNNs exhibits some tendency to produce overly complex networks as well as comes with a repetitive computation load caused by the trial and error method being a part of the development process. The latter is in essence inherited from the original GMDH algorithm that requires some repetitive parameter adjustment to be completed by the designer. In this study, in addressing the above problems coming with the conventional SOPNN (especially, FPN-based SOPNN called “FPNN”) [1], [2], [3], [4] as well as the GMDH algorithm, we introduce a new genetic design approach as well as a new FSPN structure treated as a FPN within the FPNN. Bearing this new design in mind, we will be referring to such networks as GA-driven FPNN with fuzzy set-based PNs (“gFPNN” for brief). In other hand, we introduce a new structure of fuzzy rules as well as a new genetic design approach. The new structure of fuzzy rules based on the fuzzy set-based approach changes the viewpoint of input space division. From a point of view of a new understanding of fuzzy rules, information granules seem to melt into the fuzzy rules respectively. The determination of the optimal values of the parameters available within an individual FSPN (viz. the number of input variables, the order of the polynomial corresponding to the type of fuzzy inference method, the number of membership functions(MFs) and a collection of the specific subset of input variables) leads to a structurally and parametrically optimized network. The network is directly contrasted with several existing neural-fuzzy models reported in the literature.
2 The Architecture and Development of Fuzzy Set-Based Polynomial Neural Networks (FSPNN) The FSPN encapsulates a family of nonlinear “if-then” rules. When put together, FSPNs results in a self-organizing Fuzzy Set-based Polynomial Neural Networks (FSPNN). Each rule reads in the form. if xp is Ak then z is Ppk(xi, xj, apk), if xq is Bk then z is Pqk(xi, xj, aqk),
(1)
where aqk is a vector of the parameters of the conclusion part of the rule while P(xi, xj, a) denoted the regression polynomial forming the consequence part of the fuzzy rule. The activation levels of the rules contribute to the output of the FSPN being computed
188
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
1st layer
2nd layer or higher FSPN
FSPN FSPN
x1 FSPN
FSPN FSPN
x2
FSPN
yˆ
FSPN FSPN
x3
FSPN
FSPN
x4
FSPN FSPN FSPN
FSPN xi , x j F
x3
P
μ31
μˆ 31
P31
μ32
μˆ32
P32
μ41
μˆ 41
P41
μ42
μˆ 42
P42
∑
{ A3 }
∑
z
xi , x j
x4
∑
{B4 }
Fuzzy set-based processing(F) part
Membership function
Triangular
Gaussian
No. of MFs per each input
Polynomial form of mapping(P) prat
2≤M ≤5
Fuzzy inference method
Simplified fuzzy inference Regression polynomial fuzzy inference
The structure of consequent part of fuzzy rules
Selected input variables
PD : C0+C1x3+C2+x4
Entire system input variables
PD : C0+C1x1+C2+x2+C3x3+C4x4
Fig. 1. A general topology of the FSPN based FPNN along with the structure of the generic FSPN module (F: fuzzy set-based processing part, P: the polynomial form of mapping) Table 1. Different forms of the regression polynomials forming the consequence part of the fuzzy rules No. of inputs Order of the polynomial 0 (Type 1) 1 (Type 2)
1
2
3
Constant
Constant
Constant
Linear
Bilinear
Trilinear
Biquadratic-1
Triquadratic-1
Biquadratic-2
Triquadratic-2
2 (Type 3) Quadratic 2 (Type 4)
1: Basic type, 2: Modified type
as a weighted average of the individual condition parts (functional transformations) PK. (note that the index of the rule, namely “k” is a shorthand notation for the two indices of fuzzy sets used in the rule (1), that is K=(l,k)). total inputs
z=
∑ l =1
total inputs
=
∑ l =1
(
(
total_rules related to input l
∑
total_rules related to input l
μ( l , k ) P( l , k ) ( xi , x j , a ( l , k ) )
k =1
rules related to input l
∑ k =1
μ ( l , k ) P( l , k ) ( xi , x j , a ( l , k ) )
).
∑ k =1
μ ( l ,k )
)
(2)
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
189
In the above expression, we use an abbreviated notation to describe an activation level of the “k”th rule to be in the form
μ(l , k ) μ~( l , k ) = total rule related . to input l ∑ μ(l , k )
(3)
k =1
3 Information Granulation Through Hard C-Means Clustering Algorithm Information granules are defined informally as linked collections of objects (data points, in particular) drawn together by the criteria of indistinguishability, similarity or functionality [12]. - Definition of the premise and consequent part of fuzzy rules using Information Granulation The fuzzy rules of Information Granulation-based FSPN are as followings. if xp is A*k then z-mpk = Ppk((xi-vipk),(xj- vjpk),apk), if xq is B*k then z-mqk = Pqk((xi-viqk),(xj- vjqk),aqk),
(4)
where, A*k and B*k mean the fuzzy set, the apex of which is defined as the center point of information granule (cluster) and mpk is the center point related to the output variable on clusterpk, vipk is the center point related to the i-th input variable on clusterpk and aqk is a vector of the parameters of the conclusion part of the rule while P((xivi),(xj- vj),a) denoted the regression polynomial forming the consequence part of the fuzzy rule. The given inputs are X=[x1 x2 … xm] related to a certain application and the output is Y=[y1 y2 … yn]T. Step 1) build the universe set. Step 2) build m reference data pairs composed of [x1;Y], [x2;Y], and [xm;Y]. Step 3) classify the universe set U into l clusters such as ci1, ci2, …, cil (subsets) by using HCM according to the reference data pair [xi;Y]. Step 4) construct the premise part of the fuzzy rules related to the i-th input variable (xi) using the directly obtained center points from HCM. Step 5) construct the consequent part of the fuzzy rules related to the i-th input variable (xi). Sub-step1) make a matrix as (5) according to the clustered subsets
⎡ x21 ⎢x 51 i Aj = ⎢ ⎢ xk 1 ⎢ ⎣#
y2 ⎤
x22
"
x2 m
x52
"
x5 m
y5 ⎥
xk 2
"
xkm
yk ⎥
#
"
#
# ⎦
⎥, ⎥
(5)
190
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
where, {xk1, xk2, …, xkm, yk}∈cij and Aij means the membership matrix of j-th subset related to the i-th input variable. Sub-step2) take an arithmetic mean of each column on Aij. The mean of each column is the additional center point of subset cij. The arithmetic means of column is (6)
center points = ⎣⎡ vij 1
2
vij
"
m
vij
mij ⎦⎤ .
(6)
Step 6) if i is m then terminate, otherwise, set i = i +1 and return step 3.
4 Genetic Optimization of FSPNN with Aid of Real Number Gene-Type Genetic Algorithms Let us briefly recall that GAs is a stochastic search technique based on the principles of evolution, natural selection, and genetic recombination by simulating a process of “survival of the fittest” in a population of potential solutions (individuals) to the given problem. GAs are aimed at the global exploration of a solution space. They help pursue potentially fruitful search paths while examining randomly selected points in order to reduce the likelihood of being trapped in possible local minima. The main features of genetic algorithms concern individuals viewed as strings, population-based optimization (where the search is realized through the genotype space), and stochastic search mechanisms (selection and crossover). The conventional genetic algorithms use several binary gene type chromosomes. However, real number gene type genetic algorithms use real number gene type chromosomes not binary gene type chromosomes. We are able to reduce the solution space with aid of real number gene type genetic algorithms. That is the important advantage of real number gene type genetic algorithms. In order to enhance the learning of the FPNN, we use GAs to complete the structural optimization of the network by optimally selecting such parameters as the number of input variables (nodes), the order of polynomial, and input variables within a FSPN. In this study, GA uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual
Fig. 2. Overall genetically-driven structural optimization process of FSPNN
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
191
and carry it over to the next generation, we use elitist strategy [3], [8]. The overall genetically-driven structural optimization process of FPNN is visualized in Fig. 2.
5 Design Procedure of GA-Driven FPNN (gFPNN) The framework of the design procedure of the GA-driven Fuzzy Polynomial Neural Networks (FPNN) with fuzzy set-based PNs (FSPN) comprises the following steps [Step 1] Determine system’s input variables [Step 2] Form training and testing data [Step 3] specify initial design parameters - Fuzzy inference method - Type of membership function : Triangular or Gaussian-like MFs - Number of MFs allocated to each input of a node - Structure of the consequence part of the fuzzy rules
Fig. 3. The FSPN design–structural considerations and mapping the structure on a chromosome
192
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
[Step 4] Decide upon the FSPN structure through the use of the genetic design [Step 5] Carry out fuzzy-set based fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node (FSPN) [Step 6] Select nodes (FSPNs) with the best predictive capability and construct their corresponding layer [Step 7] Check the termination criterion [Step 8] Determine new input variables for the next layer Finally, an overall design flowchart of the genetic optimization of FSPNN is shown Fig. 4. STA RT
D e c is io n o f e n t ir e s y s t e m 's in p u t v a r ia b le s
I n it ia l in f o r m a tio n f o r c o n s tr u c t in g F P N N a r c h it e c tu r e D e c is o n o f in it ia l in f o r m a t io n f o r f u z z y in f e r e n c e m e t h o d & f u z z y id e n t if ic a t io n ¾ ¾ ¾ ¾ ¾ ¾
D e c is io n o f f u z z y in f e r e n e m e th o d D e c is io n o f M F ty p e D e c is io n o f th e s tr u c tu r e o f c o n s e q u e n t p a r t o f fu z z y r u le
S e le c tio n o f th e te r m in a tio n c r ite r io n D e c is io n o f th e m a x im u m n u m b e r o f in p u t v a r ia b le s o f n o d e s in e a c h la y e r D e c is io n o f n o . o f n o d e s in e a c h la y e r
G e n e r a t io n o f F P N N a r c h ite c t u r e in t h e c o r r e s p o n d in g la y e r b y G A s
¾ ¾ ¾ ¾ ¾
¾
R ep ro d u ctio n R ep r od u ctio n
¾ ¾ ¾ ¾ ¾
N o . o f g e n e r a tio n s . o nf greantee r a tio n s M uNtaotio C r oMs suotavtio e r nr a rtea te s o vne sr izr aete P oC p ur olastio P o p s iz C h r muola s otio m ne le n ge th C h r o m o s o m e le n g th
G e n e r a tio n o f a F S P N b y c h r o m o so m e G e n er a tio n o f a F S P N b y c h r o m o so m e
1 st su b -c h ro m o so m e : 1 sSt es le u bc-tio c hnr oomf onsoo. mo fe in : p u t v a r ia b le s 2 n d s u bS-ecle h rcotio m no soofmneo :. o f in p u t v a r ia b le s 2 nSdesleucbtio - c hnr oo m om : ia l o r d e r f poos ly n oe m S e le c tio n o f p o ly ¾ 3 r d s u b - c h r o m o s m e : n o m ia l o r d e r ¾ 3 r dSseuleb c- ctio h rno m o fons oo m . oef :M F s ¾ 4 th s u b -Sc he le r ocmtio o sno omf en:o . o f M F s ¾ 4 thS seulebc-tio c h rno mo fo in s o pmuet v: a r ia b le s S e le c tio n o f in p u t v a r ia b le s ¾
¾ S e le c tio n : R o u le tte - w h e e l ¾ ¾ C r oS sesleo cvtio e r n: :ORnoeu- pleotte in t- wc rhoeseslo v e r o s ns o:v Ienr v:eO in nt c r o s s o v e r ¾ ¾ M uCartio r t nme -upaotio ¾ M u a tio n : I n v e r t m u a tio n
D e c isio n o f g e n e tic in itia l in fo r m a to in D e c isio n o f g e n e tic in itia l in f o r m a to in
In fo tm a tio n G ra n u latio n In fo tm a tio n G ra n u la tio n
¾
E x tr a c t I n fo r m a tio n G r a n u le s f orr m tiotsn oGf reaancuhle s s u cEhx tr a sa cCt eInn te P oa in s u cChluassteCr eunste o in inr gP H C ts M o f each C lu s te r u s in g H C M
¾
E v a lu a tio n o f F S P N s b y fitn ess v a lu e E v a lu a tio n o f F S P N s b y fitn ess v a lu e
I n f o r m a t io n G r a n u le s
E litis t st r a te g y & S e le c t io n o f F S P N s( W ) E litis t s tr a te g y & S e le c tio n o f F S P N s( W )
¾ A r r a y o f n o d e s o n th e b a s is o f fitn e s s v a lu e s r r ath y eo fd un po lic d e sa te o dn th e ebsass is ¾ ¾ U nAify fitn v aoluf efsitn e s s v a lu e s U cn tio ifyntho ef ndoudpelic a te d hfitn a lu ¾ ¾ S e le sw h ic h aevses hvig h eers fitn e s s v a lu e s ( W ) h hoaf vin e itia h iglh pe or pf itn s s nv anluu ems b( W ¾ ¾ A r Sr aeyleocftio n on doefs noond eths ewbhaic s is u laetio e r) ¾ A r r a y o f n o d e s o n th e b a s is o f in itia l p o p u la tio n n u m b e r
NO
T e r m in a t io n c r it e r io n ? YES
G e n e r a tio n o f F P N N a r c h ite c u r e in th e c o r r e s p o n d in g la y e r G e n e r a tio n o f F P N N a r c h ite c u r e in th e c o r r e s p o n d in g la y e r
NO
T e r m in a t io n c r ite r io n ?
D e t e r m in e n e w in p u t v a r ia b le s f o r t h e n e x t la y e r xj = zi
YES F in a l p r e d ic t iv e m o d e l
fˆ
END
Fig. 4. An overall design flowchart for the genetic optimization of the FPNN architecture
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
193
Table 2. System’s variables description System variables CRIM ZN INDUS NOX CHAS RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
description Per capita crime rate by town Proportion of residential land zones for lots over 25,00 sq. ft. Proportion of non-retail business acres per town Nitric oxides concentration(parts per 10 million) Chareles River dummy variables (1-track bounds river, 0-otherwise) Average number of rooms per dwelling Proportion of owner-occupied units built prior to 1940 Weighted distance to five Boston employment centers Index of accessibility to radial highways Full-value property-tax rate per $ 10,000 Pupil-teacher ratio by town 1000 ⋅ (Bk-0.63) 2 , Bk is the proportion of blacks by town % lower status of the population Media value of owner-occupied homes in $1000s
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
20
34
A : (5 13;4 4) B : (2 5 13;4 4) C : (6 8 11 13;2 4) D : (6 8 11 13 0;2 4)
Training error
16 14
32 30
A : (4 25;4 5) B : (3 4 13;2 2) C : (2 6 11 16;4 2) D : (2 10 13 0 0;2 4)
12 10
Testing error
18
A : (3 26;4 2) B : (3 4 13;2 2) C : (1 10 25 0;2 2) D : (11 13 21 0 0;2 3)
28 26 24
8
22
6
20
4 2
18 1
2
16
3
Layer
1
2
3
Layer
(a-1) Training error
(a-2) Testing error (a) Triangular M F Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ; 32
20 18
30
A : (7 13;4 5) B : (8 11 13;2 4) C : (6 8 11 13;2 4) D : (6 8 11 13 0;2 4)
14
28
A : (1 9;4 2) B : (3 20 29;2 3) C : (10 26 0 0;4 3) D : (2 5 0 0;4 2)
12
Training error
Training error
16
A : (9 19;3 4) B : (10 13 17;2 4) C : (6 8 11 13;2 4) D : (12 14 19 21 0 0;2 4)
10 8
26
24
22
6 20
4 2
1
1.2
1.4
1.6
1.8
2 2.2 Layer
2.4
2.6
2.8
3
18
1
1.2
(b-1) Training error
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
Layer
(b-2) Testing error (b) Gaussian-like M F
Fig. 5. Performance index of IG-gFSPNN (with Type T) with respect to the increase of number of layers
194
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
6 Experimental Studies In the experiment of this study, we investigate a Boston housing data set on the software engineering [6]. It concerns a description of real estate in the Boston area where housing is characterized by a number of features including crime rate, size of lots, number of rooms, age of houses, etc. and median price of houses. The Boston dataset consists of 504 14-dimensional points, each representing a single attribute, Table 2. The construction of the fuzzy model is completed for 336 data points treated as a training set. The rest of the data set (i.e. 168 data points) is retained for testing purposes. Fig. 5 depicts the performance index of each layer of Information Granules based gFSPNN with Type T according to the increase of maximal number of inputs to be selected. Fig. 6 illustrates the different optimization process between gFSPNN and the proposed IG-gFSPNN by visualizing the values of the performance index obtained in successive generations of GA when using Type T*. 22
40
.. : IG-gFSPNN - : gFSPNN
.. : IG-gFSPNN - : gFSPNN
20 18
35
Testing Error
Training Error
16 14 12
1st layer
2nd layer
3rd layer
10
30
1st layer
2nd layer
3rd layer
25
8 6
20
4 2
0
50
100
150
200
Generation
(a) Training error
250
300
15
0
50
100
150
200
250
300
Generation
(b) Testing error
Fig. 6. The different optimization process between gFSPNN and IG-gFSPNN quantified by the values of the performance index (in case of using Gaussian MF with Max=4 and Type T)
In case when using triangular MF and Max=4 in the IG-gFSPNN, the minimal value of the performance index, that is PI=3.5071, EPI=16.9334 are obtained. In case when using Gaussian-like MF and Max=5, the best results are reported in the form of the performance index such as PI=2.5726, EPI=18.0604.
7 Concluding Remarks In this study, we have investigated the real number gene type GA-based design procedure of Fuzzy Set-based Polynomial Neural Networks (IG-FSPNN) with information granules along with its architectural considerations. The design methodology emerges as a hybrid structural optimization framework (based on the GMDH method and genetic optimization) and parametric learning being regarded as a two-phase design procedure. The GMDH method is comprised of both a structural phase such as
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
195
a self-organizing and an evolutionary algorithm (rooted in natural law of survival of the fittest), and the ensuing parametric phase of the Least Square Estimation (LSE)based learning. The comprehensive experimental studies involving well-known datasets quantify a superb performance of the network when compared with the existing fuzzy and neuro-fuzzy models. Most importantly, the proposed framework of genetic optimization supports an efficient structural search resulting in the structurally and parametrically optimal architectures of the networks. Acknowledgements. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2006-311-D00194).
References 1. Oh, S.-K., Pedrycz, W.: Self-organizing Polynomial Neural Networks Based on PNs or FPNs : Analysis and Design. Fuzzy Sets and Systems 142 (2) (2004) 163-198 2. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg New York (1996) 3. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers?. Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. eds., North-Holland, Amsterdam 4. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 5. Jang, J. S. R.: ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Trans. on Systems, Man and Cybernetics 23 (3) (1993) 665-685 6. Pedrycz, W., Reformat, M.: Evolutionary Fuzzy Modeling. IEEE Trans. On Fuzzy Systems 11 (5) (2003) 652-665 7. Oh, S.K., Pedrycz, W.: The design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 8. Sugeno, M., Yasukawa, T.: A Fuzzy-Logic-Based Approach to Qualitative Modeling. IEEE Trans. Fuzzy Systems 1 (1) (1993) 7-31 9. Park, B.-J., Pedrycz, W., Oh, S.-K.: Fuzzy Polynomial Neural Networks : Hybrid Architectures of Fuzzy Modeling. IEEE Transaction on Fuzzy Systems 10 (5) (2002) 607-621 10. Lapedes, A. S., Farber, R.: Non-linear Signal Processing Using Neural Networks: Prediction and System Modeling. Technical Report LA-UR-87-2662. Los Alamos National Laboratory, Los Alamos, New Mexico 87545 (1987) 11. Zadeh, L. A.: Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy Sets and Systems 90 (1997) 111-117 12. Park, B.J., Lee, D.Y., Oh, S.K.: Rule-based Fuzzy Polynomial Neural Networks in Modeling Software Process Data. Int. J. of Control Automation and Systems 1 (3) (2003) 321-331
The ANN Inverse Control of Induction Motor with Robust Flux Observer Based on ESO Xin Wang and Xianzhong Dai School of Automation, Southeast University Nanjing, 210096, P.R. China
[email protected],
[email protected] Abstract. When flux and speed are measurable, the artificial neural network inverse system (ANNIS) can almost linearize and decouple (L&D) induction motor despite variation of parameters. In practice, the rotor flux cannot be measured and is difficult to estimate accurately due to parameters varying. The inaccurate flux can affect the ANNIS, coordinate transformation and outer rotor flux loop, and make the performance degrade further. Based on this, an artificial neural network inverse control (ANNIC) method of induction motor with robust flux observer based on extended state observer (ESO) is proposed. The observer can estimate the rotor flux accurately when uncertainty exists. The proposed control method is expected to enhance the robustness and improve the performance of whole control system. At last, the feasibility of proposed control method is confirmed by simulation. Keywords: neural network inverse, extended state observer, linearize and decouple, induction motor, robust, simulation.
1 Introduction In the last decades, lots of high performance control methods of induction motor were proposed, these methods include (field oriented control) FOC, (direct torque control) DTC and other nonlinear control methods [1], [2]. In original version of these methods, the variation of machine parameters are not taken into account, that is, they depend on the exactly known model of induction motor, when electrical parameters of AC drive varying, the performance will deteriorate. To overcome it, improved version of them [3], [4] were proposed to achieve robust control system, which are expected to obtain high performance when the variation of parameters happens under various operating conditions. The ANNIC of induction motor is one of them [5], compared with others, it has advantage as following: 1) it is more robust than analytic inverse system control. When parameters of controlled plant varying, the ANNIS still can almost L&D the induction motor system. 2) It extends the asymptotically decoupling and linearization of FOC to global one. 3) It is simpler than other nonlinear adaptive controller and robust to the variation of all parameters, unmodelled dynamics etc. in practical applications. In the other hand, like other high performance control method, the ANNIC also needs accurate estimated flux, the inaccurate one can influence the ANNIS ,the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 196–205, 2007. © Springer-Verlag Berlin Heidelberg 2007
The ANN Inverse Control of Induction Motor with Robust Flux Observer
197
coordinate transformation and the outer flux loop, consequently make the performance of whole system degrade. So it is necessary to design the ANNIC with robust flux observer to improve the performance of whole system.
2 Rotor Flux Observer Based on ESO In past, various flux observers are proposed to estimate flux [6], [7]. In ideal condition, a good rotor flux observer owns the characters that: use little input as possible; robust to variation of parameters and lightest computing burden. Because of the value of rotor resistance ranging from zero to 100% due to the variation of temperature, so a rotor flux based on ESO will be designed firstly in this paper, it can estimate rotor flux accurately despite uncertainty in rotor flux dynamics. The ESO is the core of auto-disturbance rejection controller (ADRC) and used as an essential part of ADRC in high performance control of induction motor drives [9], [10]. The ESO is based on the concept of generalized derivatives and generalized functions. ESO is a nonlinear configuration for observing the states and disturbances of the system under control without the knowledge of the exact system parameters. In this section, the theory of ESO is used to handle the uncertainty in the rotor flux dynamic system of induction motor. The representation of induction motor based on α , β two phases fixed coordinate system is given by
⎧ di s dt = −γ i s + ( βτ − Jβ n pωm )ψ r + η u s ⎪ ⎨ dψ r dt = τ Lm i s − (τ − Jn pωm )ψ r ⎪ ⎩ d ωm dt = μ (ψ r ⊗ i s ) − Tl J
(1)
where
β=
n p Lm Lm L2 R R R ⎡0 −1⎤ 1 ,μ = , γ = m r2 + s ,η = ,τ = r , J = ⎢ ⎥ σ Ls Lr JLr σ Ls Lr σ Ls Lr σ Ls ⎣1 0 ⎦
Rs , Rr are stator and rotor resistance, Ls , Lr , Lm are stator, rotor and mutual inductance,
ωm
is rotor mechanical angular velocity, n p is the number of pole
pairs, Tl is load torque, J is motor-load inertia, σ = 1 − L2m ( Ls Lr ) is the leakage coefficient, us = ⎡⎣ usα
T
T
usβ ⎤⎦ , ψr = ⎡⎣ψ rα ψ rβ ⎤⎦ , i s = ⎡⎣isα
T
is β ⎤⎦ . Please note that the
α , β appeared in subscript are different with the α , β in text. In the equation (1), the parameters, especially rotor resistance are varying when the motor in operation, the first row of (1),that is the current dynamic equation is rewritten as following, which combine the term including Rr di s dt = − Rsη i s − β ⎡⎣τ Lm i s − (τ − Jn pωm )ψ r ⎤⎦ + η us
(2)
198
X. Wang and X. Dai
Let a (t) = τ L m i s − (τ − J n p ω m ) ψ r
(3)
which includes all the uncertainty aroused from related machine parameters, it takes into the variation of Rr into account. Let da(t) dt = b(t) , (2) can be extended to ⎧ d i s dt = − R sη i s − β a(t) + η u s ⎨ ⎩ d a(t) dt = b(t)
(4)
⎧⎪ d ˆi s dt = − R sη ˆi s − β a(t) ˆ + η u s + g 1 ( ˆi s − i s ) ⎨ ˆ dt = g 2 ( ˆi s − i s ) ⎪⎩ d a(t)
(5)
The ESO of (4) is:
where in general gi ( xˆ − x) = βi fal ( xˆ − x, α , δ ) , i = 1, 2 ⎧ ε α sgn(ε ), ε > δ ⎪ fal (ε , α , δ ) = ⎨ x ,ε ≤δ ⎪ ⎩ δ 1−α
(6)
where ε = xˆ − x , sgn(ε ) is the signum function. The exponential α , α ∈ ( 0,1) and the
scaling factor βi determine the convergence speed of ESO, the parameter δ determines the nonlinear region of the ESO. Generally, δ is set to be approximately 10% of the variation range of its input signal. For a(t) ,that is, the derivative of rotor flux, is varying in given region. Choosing α , β1 , β 2 , δ carefully, we can make (5) approximate the state i s and uncertain a(t) of practical system, and get ˆ r by integrating aˆ (t) . the estimated rotor flux ψ ˆ r = ∫0t a(t) ˆ dt ψ
(7)
One can see from (5) that the ESO doesn’t include rotor resistance Rr , so it is robust to Rr .The observed modulus and position of rotor flux are
ψˆ r = ψˆ r2α + ψˆ r2β
(8.1)
θˆs = ∫ ωˆ s dt = arctg (ψˆ r β ψˆ rα )
(8.2)
3 The ANNIC of Induction Motor The detailed introduction to design the ANNIC of a plant was given in [8]. The steps of designing an ANNIC of induction motor, assuming flux is measurable, are given below.
The ANN Inverse Control of Induction Motor with Robust Flux Observer
199
3.1 Analytic Inverse Expression of Induction Motor
The model of induction motor in M-T coordinated can be represented as (9) ⎞ ⎛ dism dt ⎞ ⎛ −γ ism + ωs ist + τβψ r + η usm ⎜ ⎟ ⎜ −ω i − γ i − n ω βψ + η u ⎟ st p m r st ⎟ ⎜ dist dt ⎟ = ⎜ s sm ⎟ ⎜ dψ r dt ⎟ ⎜ τ L i − τψ r ⎟ ⎜⎜ ⎟⎟ ⎜⎜ m sm ⎟ ⎝ d ωm dt ⎠ ⎝ μ istψ r − Tl J ⎠
(9)
where
ωs = n pωm + Lm Rr ist ( Lrψ r ) ωs is the rotor flux rotating velocity or synchronous rotating velocity, which is calculated in M-T axis in this section, it is equal to one calculated in section 2, ism , ist are the M-axis and T-axis components of stator currents, ψ r is rotor flux projected on M-axis or modulus of rotor flux , usm , ust are M-axis and T-axis components of stator voltages. In equation (9), let state vector X = [ x1
x2
x3
x4 ] = [ism , ist ,ψ r , ωm ]T ,the input T
vector us = [u1 , u2 ]T = [usm , ust ]T and the output vector y = [ y1 , y2 ]T = [ψ r , ωm ]T .Then the state equation (9) can be compacted as dX dt = f ( X, u s , θ)
(10)
y = [ y1 , y2 ]T = [ψ r , ωm ]T
(11)
The output equation is
where θ is vector of motor parameters, for convenience, it is assumed that Tl = 0 in this paper. The input-output type static analytic inverse expressions are expressed as (12) AB C Rr ( β Lm + 1) y1 Lr v1 ⎧ + ] ⎪u1 = σ Ls [ L − (n p y2 + Lm C ) y − Lm Lr Rr Lm ⎪ r 1 ⎨ ⎪u = σ L [ AC + n y ( B + n β y ) + v2 ] s p 2 p 1 ⎪⎩ 2 Lr y1 μ Jy1
(12)
where A = γ Lr + Rr , B =
T + Jy 2 Rr y1 + Lr y1 ,C = l Lm R r μJ
The compact form of the inverse system expression can be written as us = G ( y , y , v, θ) , v = y
(13)
200
X. Wang and X. Dai
3.2 The Design of ANNIC
According to (13) and the relative degree of the system are 4; so the inputs and outputs of the static neural network are 6 and 2 respectively. Choosing the three layers feedforword BP (backpropagation) neural network, the active functions of hide and output layer are tansig()and purelin() respectively, then the static neural network used to approximate the analytic inverse control is expressed as usNN = NN ( y , y , v ) = purelin( W2T (tan sig ( W1T Y + B1 )) + B 2 )
where Y= [ y
(14)
,
y v ] ,B1,B2 are bias vector W=[W1T,W2T ]T, thus the problem is T
ˆ in space Ω make equation (15) transformed to search the optimum weight matrix W satisfied ˆ = arg min ⎛⎜ sup purelin( W T (tan sig ( W T Y + B )) + B ) − u ⎞⎟ W 2 1 1 2 s W∈Ω ⎝ Y∈D ⎠
(15)
where D is the space spanned by the data of input used to train neural network, the appropriate algorithm is chosen to train neural network, thus the analytic inverse system controller is replaced by static neural network, then the number of the additional integrators is ascertained according to the relative degree. The designed input-output integrated types ANNIS can almost L&D the induction motor into flux subsystem and speed subsystem when flux and speed are measurable. Thus we can design the flux and speed regulators with linear system theory separately. Compare (14) with (13) we can conclude that: through introducing ANN, the effect of θ is eliminated and replaced by the weight matrices of the ANN, because of the neural network is essential adaptive system, and has advantages like self-learning, tolerance and robustness. So the replacement of analytic inverse system with ANNIS can improve the capability of rejecting variation of parameters.
4 The ANNIC of Induction Motor with Robust Flux Observer According to the method described in section 3, in this section, we propose an ANNIC of induction motor with robust rotor flux observer designed in section 2. We expected to make the control system more robust and the performance better. The design of ANNIC with robust flux observer based on ESO is as follows. 4.1 Design of Exciting Signals
The flux and speed exciting signals are designed according to the operational region of the motor and identification theory. The close loop identification is chosen to avoid that motor runs out of its region. Firstly, inputting step reference signal of flux and speed to the analytic inverse control of induction motor with flux observer based on ESO. According to obtained response curves to ascertain steady and dynamic parameters, the uniformed random signals are chosen by simulation inspection, the amplitude of the excited signals are flux: 0.1-1(Wb) and speed: 0-150(rad/s), the period of variation are 1s and 0.9s respectively. The exciting signals added to
The ANN Inverse Control of Induction Motor with Robust Flux Observer 180
24
200
100 0
0
120 0.6
60 0.3
0.0 0
10
20
30
Time (s) (a)
0
10
20
Rotor speed (rad/s)
12
Rotor flux modulus (Wb)
0.9
300
T-axis voltage component (V)
M-axis voltage component (V)
201
0 30
Time (s) (b)
Fig. 1. The input and output of excited induction motor, (a) The stator voltages added to induction motor (b) The flux and speed response of induction motor
induction motor are Fig.1 (a), the solid line is M-axis stator voltage component, the dashed line is T-axis stator voltage component, the output signal of motor is Fig.1 (b), the solid line is flux response curve, and the dashed line is rotor speed response curve. 4.2 Data Sampling and Handling
To sample input, output signals of induction motor with a sampling rate much higher than that in the subsequent control. The derivatives of output variables {ψˆ r ωm } were calculated from the first to second order. Since the derivatives are calculated offline, one can choose a good numerical differentiating algorithm to ensure the derivation accuracy to some extent. Reassemble all the sampled and calculated data to form the training data set {ψˆ r ψˆ r ψˆ r ωm ω m ωm }, { usm ust }. The former and the latter are, respectively, the input data and desired outputs of the static ANN, for easily convergence, the data sets are normalized in -4-+4. 4.3 Training and Testing of ANN
According to section 3.2, the ANN’s structure will be ascertained. The number’s selection of hide layer is a compromise between output precision, training time, and ANN generalization capabilities. With a small number of neurons, the network will take a long time to converge or will not converge to a satisfactory error. On the other hand, if the number of neurons is large or if the training error is very small, the ANN will memorize the training vectors and give a large error for generalization vectors. The number of neurons of hide layer is ascertained by trial and error. Finally the structure of ANN is 6-13-2.The sampled and handled data was divided into two groups, one is used to train, and the other is used to test. The LM (LevenbergMarquardt) algorithm is chosen to train the static ANN offline, the number of training step is 2000. At last, the ANN which training error is 5.23832e-5 is obtained.
202
X. Wang and X. Dai
4.4 Design of Flux and Speed Regulators
The designed ANNIS using estimated flux can L&D adaptively the motor into flux subsystem Gψ ( s ) ≈ 1 s 2 and speed subsystem Gω ( s ) ≈ 1 s 2 .The proportional and derivative (PD) is chosen to adjust the two subsystems, the parameters of regulators are: K pω = K pψ = 1200 , K d ω = K dψ = 50 ,the subscripts ω ,ψ represent the parameter of regulators of speed and flux subsystem respectively. The ANNIC of induction motor with observer based on ESO is described by Fig.2.
Fig. 2. The input-output type ANNIC of induction motor with robust flux observer based on ESO Table 1. The Parameters of Induction Motor Rated power Rated speed Pairs of pole Motor-load Inertia Rate load torque
1.1kw 146.6rad/s 2 0.0021kg·m2 7.5N·m
Stator resistance Rotor resistance Stator inductance Rotor resistance Mutual inductance
,
5.9Ω 5.6Ω 0.574H 0.580H 0.55H
5 Simulation The proposed control system is studied by simulation, the simulation algorithm is ode45, and the sampling period is 1e-4s, the chosen motor’s parameters are listed in Table 1, the parameters of ESO are α = 0.5, δ = 0.1, β1 = 75, β 2 = 375 . Comparison ANNIC with flux observer to the one with real flux under two conditions will be done in this section. 5.1 Comparison Under Rr Constant
The simulation results of ANNI control of the induction motor with observer and the one using real flux value when rotor resistance is constant are shown in Fig.3, Fig.3 (a) is the flux response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference flux, Fig.3 (b) is the speed response of ANNI control system with real flux value (dashed line) and the
The ANN Inverse Control of Induction Motor with Robust Flux Observer
203
180 1.0
160 140
0.8
Speed (rad/s)
Flux (Wb)
120 0.6
The reference flux signal The response of ANNI control system with real flux value The response of ANNI control system with flux observer
0.4
100 80 60 40
0.2
The reference speed signal The response of ANNI control system with real flux value The response of ANNI control system with flux observer
20 0
0.0 0
1
2
3
4
0
1
Time (s) (a)
2
3
4
Time (s)
(b)
Fig. 3. The comparison of ANNI control system with real flux value and the one with flux observer when Rr is constant, (a) The flux response curves of two control system, (b) The speed response curves of two control system
one with observer (dotted line), the solid line represents reference speed. From the results, we can conclude that when the motor model is exactly known, the approximate L&D is obtained both by ANNI control of induction motor with flux observer and the one with real flux value. The performance of system with flux observer based on ESO is comparative to the one with the real flux value, some coupling that can be omitted appear. 5.2 Comparison Under Rr Varying
The simulation results of ANNI control of the induction motor with robust observer and the one using real flux value when rotor resistance is varying as (16) are shown in Fig.4, Fig.4 (a) is the flux response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference flux, Fig.4 (b) is the speed response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference speed. 180 1.0
Rotor speed (rad/s)
Rotor flux modulus (Wb)
150 0.8
0.6
0.4 The reference speed signal The response of ANNIC system with real flux value The response of ANNIC system with flux observer
0.2
120
90
60
30 The reference speed signal The response of ANNIC system with real flux value The response of ANNIC system with flux observer
0
0.0 0
1
2
Time (s) (a)
3
4
0
1
2
3
4
Time (s) (b)
Fig. 4. The comparison of ANNIC system with real flux value and the one with flux observer when Rr is varying, (a) The flux response curves of two control system (b) The speed response curves of two control system
204
X. Wang and X. Dai
From above results, we conclude that when rotor resistance varying, the control performance of system with observer is little affected by rotor resistance, and degrades little larger than the one with real flux, the couplings that appear in both two systems are acceptable. The inclusion of observer cannot cause the instability of the system. These make the ANNIC more near to implement in practice. ⎧5.6 + 4t , t ≤ 1.5 Rr = ⎨ ⎩11.6, t > 1.5
(16)
6 Conclusions In this paper, to solve the problem that rotor flux of motor cannot be measured and machine parameters, especially the rotor resistance will be increased because of temperature when the motor is in operation. An ANNIC of induction motor with robust flux observer based on ESO, which doesn’t depend on the model of induction motor strongly, was proposed. The comparative study between the ANNIC of induction motor with flux observer and the one using real flux value was done with Matlab/Simulink. The simulation results show that the ANNIC method with robust observer can almost implement L&D control of the motor and can get good tracking performance despite the rotor resistance varying. The ANNI control with flux observer based on ESO owns strong robustness. So the proposed control system is more close to real case when implementation. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 60574097), the Specialized Research Fund for the Doctoral Program of Higher Education (No.20050286029) and in part by National Basic Research Program of China under Grant (No. 2002CB312204).
References 1. Bodson, M., Chiasson, J.,Novotnak, T..: High-performance Induction Motor Control via Input–output Linearization. IEEE Contr. Syst. Mag 14(4) (1994) 25–33 2. Taylor, D.:Nonlinear Control of Electric Machines: An Overview. IEEE Contr. Syst. Mag 14(6) (1994) 41–51 3. Marino, R.,Peresada, S., Tomei, P.: Global Adaptive Output Feedback Control of Induction Motors with Uncertain Rotor Resistance. IEEE Trans. Automatic Control 44(5) (1999) 967–983 4. Kwan, C., Lewis, F. L.: Robust Backstepping Control of Nonlinear Systems Using Neural Networks. IEEE Trans. Systems, Man and Cybernetics, Part A 30(6) (2000) 753–766 5. Dai, X., Zhang, X., Liu, G., Zhang, L.: Decouping Control of Induction Motor Based on Neural Networks Inverse. (in Chinese), Proceedings of the CSEE 24(1) (2004) 114–117 6. Du,T., Vas,P., Stronach, F.: Design and Application of Extended Observers for Joint State and Parameter Estimation in High-performance AC Drives. Proc. IEE—Elect. Power Applicat 142(2) (1995) 71–78
The ANN Inverse Control of Induction Motor with Robust Flux Observer
205
7. Soto, G. G., Mendes, E., Razek, A.: Reduced-Order Observers for Flux, Rotor Resistance and Speed Estimation for Vector Controlled Induction Motor Drives Using the Extended Kalman Filter Technique.: Proc. IEE—Elect. Power Applicat 146(3) (1999) 282–288 8. Dai,X., He,D., Zhang, X., Zhang, T.: MIMO System Invertibility and Decoupling Control Strategies Based on a ANNá-th Order Inversion. IEE Proceedings Control Theory and Application, 148(2) (2001) 125–136 9. Feng, G., Liu, Y. F., Huang, L. P.: A New Robust Algorithm to Improve the Dynamic Performance on the Speed Control of Induction Motor Ddrive. IEEE Trans. Power Electronics 19(6) (2004) 1614–1627 10. Fei, L., Zhang, C. P., Song, W.C., Chen, S.S.: A Robust Rotor Flux Observer of Induction Motor with Unknown Rotor and Stator Resistance. Industrial Electronics Society,IECON '03 (2003) 738–741
Design of Fuzzy Relation-Based Polynomial Neural Networks Using Information Granulation and Symbolic Gene Type Genetic Algorithms SungKwun Oh1, InTae Lee1, Witold Pedrycz2, and HyunKi Kim1 1
Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected] 2 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G6, Canada and Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Abstract. In this study, we introduce and investigate a genetically optimized fuzzy relation-based polynomial neural networks with the aid of information granulation (IG_gFRPNN), develop a comprehensive design methodology involving mechanisms of genetic optimization with symbolic gene type. With the aid of the information granules based on C-Means clustering, we can determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premised and consequence part of the fuzzy rules respectively. The GA-based design procedure being applied at each layer of IG_gFRPNN leads to the selection of preferred nodes with specific local characteristics (such as the number of input variables, the order of the polynomial, a collection of the specific subset of input variables, and the number of membership function) available within the network. The proposed model is contrasted with the performance of the conventional intelligent models shown in the literatures.
1 Introduction While the theory of traditional equation-based approaches is well developed and successful in practice (particularly in linear cases) there has been a great deal of interest in applying model-free methods such as neural and fuzzy techniques for nonlinear function approximation [1]. GMDH was introduced by Ivakhnenko in the early 1970's [2]. GMDH-type algorithms have been extensively used since the mid-1970’s for prediction and modeling complex nonlinear processes. While providing with a systematic design procedure, GMDH comes with some drawbacks. To alleviate the problems associated with the GMDH, Self-Organizing Neural Networks (SONN, called “FRPNN”) were introduced by Oh and Pedrycz [3], [4], [5] as a new category of neural networks or neuro-fuzzy networks. Although the FRPNN has a flexible architecture whose potential can be fully utilized through a systematic design, it is difficult to obtain the structurally and parametrically optimized network because of the limited design of the nodes located in each layer of the FRPNN. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 206–215, 2007. © Springer-Verlag Berlin Heidelberg 2007
Design of Fuzzy Relation-Based Polynomial Neural Networks
207
In this study, in considering the above problems coming with the conventional FRPNN, we introduce a new structure and organization of fuzzy rules as well as a new genetic design approach. The new meaning of fuzzy rules, information granules melt into the fuzzy rules. In a nutshell, each fuzzy rule describes the related information granule. The determination of the optimal values of the parameters available within an individual FRPN (viz. the number of input variables, the order of the polynomial, a collection of preferred nodes, and the number of MFs) leads to a structurally and parametrically optimized network through the genetic approach.
2 FRPNN with Fuzzy Relation-Based Polynomial Neuron (FRPN) The FRPN consists of two basic functional modules. The first one, labeled by F, is a collection of fuzzy sets that form an interface between the input numeric variables and the processing part realized by the neuron. The second module (denoted here by P) is about the function – based nonlinear (polynomial) processing. The detailed FRPN involving a certain regression polynomial is shown in Table 1. The choice of the number of input variables, the polynomial order, input variables, and the number of MF available within each node itself helps select the best model with respect to the characteristics of the data, model design strategy, nonlinearity and predictive capabilities. Table 1. Different forms of regression polynomial building a FRPN No. of inputs Order of the polynomial Order FRPN 0 Type 1 1 Type 2 Type 3 2 Type 4 1: Basic type, 2: Modified type
1
2
3
Constant Linear
Constant Bilinear Biquadratic-1 Biquadratic-2
Constant Trilinear Triquadratic-1 Triquadratic-2
Quadratic
Proceeding with the FRPNN architecture essential design decisions have to be made with regard to the number of input variables and the order of the polynomial forming the conclusion part of the rules as well as a collection of the specific subset of input variables. Table 2. Polynomial type according to the number of input variables in the conclusion part of fuzzy rules Input vector Type of the consequence polynomial Type T Type T*
Selected input variables in the premise part A A
Selected input variables in the consequence part A B
Where notation A: Vector of the selected input variables (x1, x2,…, xi), B: Vector of the entire system input variables(x1, x2, …xi, xj …), Type T: f(A)=f(x1, x2,…, xi) - type
208
S. Oh et al.
of a polynomial function standing in the consequence part of the fuzzy rules, Type T*: f(B)=f(x1, x2, …xi, xj …) - type of a polynomial function occurring in the consequence part of the fuzzy rules.
3 The Structural Optimization of IG_gFRPNN 3.1 Information Granulation by Means of C-Means Clustering Method Information granulation is defined informally as linked collections of objects (data points, in particular) drawn together by the criteria of indistinguishability, similarity or functionality [6]. Granulation of information is a procedure to extract meaningful concepts from insignificant numerical data and an inherent activity of human being carried out with intend of better understanding of the problem. We extract information for the real system with the aid of Hard C-means clustering method [7], which deals with the conventional crisp sets. Through HCM, we determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premise and consequence part of the fuzzy rules respectively. The fuzzy rules of IG_gFRPNN is given as follows: R j : If x1 is A ji and
x k is A jk then y j − M j = f j {( x1 − v j1 ), ( x 2 − v j 2 ),
, ( xk − v jk )} ,
where Ajk mean the fuzzy set, the apex of which is defined as the center point of information granule (cluster). Mj and vjk are the center points of new created inputoutput variables by information graunle. 3.2 Genetic Optimization of IG_gFRPNN Let us briefly recall that GAs is a stochastic search technique based on the principles of evolution, natural selection, and genetic recombination by simulating a process of “survival of the fittest” in a population of potential solutions to the given problem. The main features of genetic algorithms concern individuals viewed as strings, population-based optimization and stochastic search mechanism (selection and crossover). In order to enhance the learning of the IG_gFRPNN and augment its performance, we use genetic algorithms to obtain the structural optimization of the network by optimally selecting such parameters as the number of input variables (nodes), the order of polynomial, input variables, and the number of MF within a IG_gFRPNN. Here, GAs use serial method of symbolic type, roulette-wheel as the selection operator, one-point crossover, and an uniform operation in the mutation operator [8].
4 The Algorithm and Design Procedure of IG_gFRPNN The IG_gFRPNN comes with a highly versatile architecture both in the flexibility of the individual nodes as well as the interconnectivity between the nodes and organization of the layers. Evidently, these features contribute to the significant flexibility of the networks yet require a prudent design methodology and a well-thought
Design of Fuzzy Relation-Based Polynomial Neural Networks
209
learning mechanisms. The framework of the design procedure of the genetically optimized Fuzzy Relation-based Polynomial Neural Networks (gFRPNN) based on information granulation comprises the following steps. [Step 1] Determine system’s input variables [Step 2] Form training and testing data The input-output data set (xi, yi)=(x1i, x2i, …, xni, yi), i=1, 2, …, N (with N being the total number of data points) is divided into two parts, that is, a training and testing dataset. [Step 3] Decision of axis of MFs by Information granulation(HCM) As mentioned in ‘3.2 Definition of the premise part of fuzzy rules using IG’, we obtained the new axis of MFs by information granulation as shown in Fig. 4. [Step 4] Decide initial information for constructing the FRPNN structure Here we decide upon the essential design parameters of the FRPNN structure. Those include (a) Initial specification of the fuzzy inference method and the fuzzy identification (b) Initial specification for decision of FRPNN structure [Step 5] Decide upon the FRPNN structure with the use of genetic design We divide the chromosome to be used for genetic optimization into four subchromosomes as shown in Fig. 1. The 1st sub-chromosome contains the number of input variables, the 2nd sub-chromosome includes the input variables coming to the corresponding node (FRPN), the 3rd sub-chromosome contains the number of membership functions (MFs), and the last sub-chromosome (remaining bits) involves the order of the polynomial of the consequence part of fuzzy rules. All these elements are optimized by running the GA. S e le c tio n o f n o d e(F R P N ) str u c tr u e b y c h r o m o so m e
R e la te d b it ite m s
B it str u ctu re o f su b c h r o m os o m e d iv id e d f or e a ch ite m
S y m b o lic G e n e T yp e G e n etic D e sig n
i) B its fo r th e se lec tion of th e n o . of in p u t v ar iab le s
3
1
3
F u z z y in fe r e n c e & f u z z y id e n tific a tio n
S e le c te d F R P N s
3 1
S e lec tion of n o. o f in p u t v ar iab les(r)
iii) B its fo r th e se lec tion o f in p u t va ria b le s
4 3
2 4
S ele c tio n o f in p u t va r iab les
iii) B its fo r th e se le ctio n th e n o . o f M F s e ac h in p u t v a ria b le
2
3
3
2
3
3
ii) B its for th e se le ctio n o f th e p o ly n o m ia l or d e r
2
3 3
S electio n o f no . o f M F s b it fo r ea ch in p u t va r ia b le
S e le ctio n o f th e o r d e r of p oly n om ia l (T y p e 1 ~ T y p e 4 )
F u zzy in fere ne m eth o d
M F Type
N o . o f M F s p er e a c h in pu t
T h e str u ctur e o f co n seq ue nt p a r t o f fu zzy r u les
S im p lifie d o r re g res sio n p o ly n o m ia l fu z zy in feren ce
T r ia n g u la r o r G a u ss ia n
N o . o f M F s p er ea c h in p u t va r ia b le 2 ~ 5
S elec ted in p u t v a ria b les o r en tir e sy s tem in pu t v a ria b les
FR PN
Fig. 1. The FRPN design used in the FRPNN architecture – structural considerations and a map ping the structure on a chromosome
210
S. Oh et al.
[Step 6] Carry out fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node i) Simplified inference The consequence part of the simplified inference mechanism is a constant. Using information granulation, the new rules read in the form
and xk is Ank then y n − M n = an 0 ,
R n : If x1 is An1 and
(1)
where Rn is the n-th fuzzy rule, xl (l=1, 2, …, k) is an input variable, Ajl (j=1, …, n; l=1, …, k) is a membership function of fuzzy sets, Mj (j=1, …, n) is the center point related to the new created output variable, n denotes the number of the rules. n
∑ yˆ =
n
∑μ
μ ji yi
j =1 n
∑μ
=
+M)
ji ( a j 0
j =1
n
∑μ
ji
i =1
n
=
∑μ
ji ( a j 0
+ M j),
(2)
j =1
ji
i =1
μ ji = A j1 ( x1i ) ∧
∧ A jk ( xki ) ,
(3)
where μˆ ij is normalized value of μij, and Eq. (3) is inferred value yˆ i from Eq. (1). The consequence parameters (aj0) are produced by the standard least squares method that is
a = ( X T X ) −1 X T Y , X = [ x1 , x2 , ⎡ ⎛ Y = ⎢ y1 − ⎜ ⎜ ⎢ ⎝ ⎣
ˆ 1i , μˆ 2i , , xm ]T , xi = [ μ T
⎞ M j μˆ j1 ⎟ ⎟ j =1 ⎠ n
∑
⎛ y2 − ⎜ ⎜ ⎝
(4)
, μˆ ni ] , a = [a10 , , an0 ] ,
⎞ M j μˆ j 2 ⎟ ⎟ j =1 ⎠ n
∑
⎛ ym − ⎜ ⎜ ⎝
T
⎞⎤ M j μˆ jm ⎟ ⎥ . ⎟⎥ j =1 ⎠⎦ n
∑
ii) Regression polynomial inference The regression fuzzy inference (reasoning scheme) is envisioned: The consequence part can be expressed by linear, quadratic, or modified quadratic polynomial equation as shown in Table 1. The use of the regression polynomial inference method gives rise to the expression.
R n : If x1 is An1 and
and xk is Ank then
yn − M n = f n {(x1 − vn1 ), (x2 − vn 2 ),
,(xk − vnk )} ,
(5)
where Rn is the n-th fuzzy rule, xl (l=1, 2, …, k) is an input variable, Ajl (j=1, …, n; l=1, …, k) is a membership function of fuzzy sets, vjl (j=1, …, n; l=1, …, k) is the center point related to the new created input variable, Mj (j=1, …, n) is the center point related to the new created output variable, n denotes the number of the rules, fi(⋅) is a regression polynomial function of the input variables as shown in Table 1.
Design of Fuzzy Relation-Based Polynomial Neural Networks
211
The calculation of the numeric output of the model are carried out in the wellknown form n
yˆ i =
∑μ j =1 n
n
ji
yi
∑ μ ji
=
∑μ j =1
ji
{ a j 0 + a j1 (x1i − v j1 ) + n
∑μ
j =1
n
i =1
= ∑ μˆ ji { a j 0 + a j1 (x1i − v j1 ) + j =1
+ a jk (xki − v jk ) + M j } ji
(6)
+ a jk (xki − v jk ) + M j },
where I (i=1,…, m) is i-th data, ajl (j=1, …, n ; l=0, …, k) is coefficient of conclusion part of the fuzzy rule, and μji is same as shown in Eq. (3). The coefficients of consequence part of fuzzy rules obtained by least square method (LSM) as follows. a = ( X T X ) −1 X T Y . (7) [Step 7] Select nodes (FRPN) with the highest predictive capability and construct their corresponding layer To evaluate the performance of FRPNs (nodes) constructed using the training dataset, the testing dataset is used. Based on this performance index, we calculate the fitness function. The fitness function reads as
1 , (8) 1 + EPI where EPI denotes the performance index for the testing data (or validation data). In this case, the model is obtained by the training data and EPI is obtained from the testing data (or validation data) of the IG_gFRPNN model constructed by the training data. [Step 8] Check the termination criterion The termination condition that controls the growth of the model consists of two components, that is the performance index and a size of the network (expressed in terms of the maximal number of the layers). As far as the performance index is concerned (that reflects a numeric accuracy of the layers), a termination is straightforward and comes in the form, F1 ≤ F* , (9) F ( fitness Function) =
where F1 denotes a maximal fitness value occurring at the current layer whereas F* stands for a maximal fitness value that occurred at the previous layer. As far as the depth of the network is concerned, the generation process is stopped at a depth of less than three layers. This size of the network has been experimentally found to achieve a sound compromise between the high accuracy of the resulting model and its complexity as well as generalization abilities. In this study, we use a measure (performance index) of Root Mean Squared Error (RMSE)
212
S. Oh et al.
E ( PI or EPI ) =
1 N
N
∑(y p =1
p
(10)
− yˆ p ) 2 ,
where yp is the p-th target output data and yˆ p stands for the p-th actual output of the model for this specific data point. N is training (PI) or testing (EPI) input-output data pairs and E is an overall (global) performance index defined as a sum of the errors for the N. [Step 9] Determine new input variables for the next layer If has not been met, the model is expanded. The outputs of the preserved nodes (zli, z2i, …, zWi) serves as new inputs to the next layer (x1j, x2j, …, xWj) (j=i+1). This is captured by the expression x1 j = z1i , x2 j = z 2i ,… , xwj = z wi .
(11)
The IG_gFRPNN algorithm is carried out by repeating steps 4-9.
5 Experimental Studies We demonstrate how the IG-based gFRPNN can be utilized to predict future values of a chaotic Mackey-Glass time series. This time series is used as a benchmark in fuzzy and neurofuzzy modeling. The time series is generated by the chaotic Mackey-Glass differential delay equation. To come up with a quantitative evaluation of the network, we use the standard RMSE performance index. Table 3. Computational aspects of the genetic optimization of IG_gFRPNN
GAs
Parameters Maximum generation Total population size Selected population size (W) Crossover rate Mutation rate String length Maximal no.(Max) of inputs to be selected
IG_ gFRPNN
Polynomial type (Type T) of the consequent part of fuzzy rules(#) Consequent input type to be used for Type T (##)
Membership Function (MF) type No. of MFs per input l, T, Max: integers, # and ## : refer to Tables 1-2 respectively.
1st layer 150 300 30 0.65 0.1 Max*2+1 1˺ l˺ Max(4~5)
2nd to 3rd layer 150 300 30 0.65 0.1 Max*2+1 1˺ l˺ Max(4~5)
1˺ T˺ 4
1˺ T˺ 4
Type T* Triangular Gaussian 2 or 3
Type T Triangular Gaussian 2 or 3
Fig. 2 depicts the performance index of each layer of IG_gFRPNN according to the increase of maximal number of inputs to be selected. In Fig. 3, the left, middle, and right part within A:(•;•;•)- B:(•;•;•) denote the optimal node numbers at each layer of the network, the polynomial order, and the number of MFs respectively. Fig. 3 illustrates the detailed optimal topologies of IG_gFRPNN with Gaussianlike MFs for 3rd layers when using Max=5. As shown in Fig. 3, the proposed network enables the architecture to be a structurally more optimized and simplified network
Design of Fuzzy Relation-Based Polynomial Neural Networks Maximal number of inputs to be selected x 10 1.5
(Max )
Maximal number of inputs to be selected
5 (B)
4 (A)
-4
x 10 6.5
A : (1, 3, 4, 6 ; 3; 2, 2, 3, 2) B : (1, 3, 4, 6, 0 ; 3; 2, 2, 3, 2)
1.4
6
5.5
A : (1, 2, 3, 29 ; 3; 2, 2, 2, 2) B : (1, 2, 4, 9, 13 ; 2; 2, 2, 2, 2, 2)
Testing Error
Training Error
1.3 1.2
(Max )
5 (B)
4 (A)
-4
213
1.1 1
5
4.5
4
0.9 3.5
0.8 0.7
A : (1, 2, 8, 9 ; 4; 2, 2, 2, 2) B : (5, 12, 17, 19, 29 ; 2; 2, 2, 3, 2, 2) 1
2
3
3
1
2
Layer
3
Layer
(a-1) PI (a-2) EPI (a) Performance Index in case of using Triangular membership function Maximal number of inputs to be selected x 10 9
Maximal number of inputs to be selected x 10 10
A : (1, 3, 4, 6 ; 3; 2, 2, 2, 2) B : (1, 2, 3, 4, 6 ; 2; 3, 2, 2, 2, 2)
8
7
A : (4, 6, 20, 24 ; 2; 3, 2, 2, 2) B : (1, 9, 22, 23, 0 ; 2; 3, 3, 2, 2)
6
5
A : (4, 5, 14, 28 ; 3; 2, 2, 2, 2) B : (1, 2, 15, 0, 0 ; 4; 3, 2, 3)
4
3
2
(Max )
5 (B)
4 (A)
-5
9
Testing Error
Training Error
(Max )
5 (B)
4 (A)
-5
8
7
6
5
4
1
2
3
3
1
2
Layer
3
Layer
(b-1) PI (b-2) EPI (b) Performance Index in case of using Gaussian-like membership function Fig. 2. Performance index of IG-gFRPNN with respect to the increase of number of layers 3 2 2 2 2
x(t-30) x(t-24) x(t-18) x(t-12) x(t-6) x(t)
2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 3 2
FPN
5
FPN
5
FPN
5
1
2 3
2 9
2
FPN 22
5
2
3 3 2 2
FPN
4
2 3 2 2
4
3 2 2
3
FPN
1
2 2
2
3 2 3
FPN
3
1
4
ˆ y
FPN 15
4
FPN 23
4
4
Fig. 3. Optimal networks structure of GAs-based FRPNN ( for 3 layers )
214
S. Oh et al.
Table 4. Comparative analysis of the performance of the network; considered are models reported in the literature Model Wang’s model[10]
PI 0.044 0.013 0.010
PIs
EPIs
NDEI*
Cascaded-correlation NN[14] Backpropagation MLP[14] 6th-order polynomial[14] ANFIS[11] FNN model[12]
0.06 0.02 0.04 0.0016 0.0016 0.0015 0.007 0.014 0.009
Recurrent neural network[15]
0.0138
SuPFuNIS[16] NFI[17] Basic Case 1 (5th Case 2 layer) Type I Modified Case 1 SONN** (5th [13] Case 2 layer) Basic Case 1 Type (5th II Case 2 layer) Max= Triangular MFs IG_gFRPNN 4 Gaussian-like MFs (Our Model) Max= Triangular MFs 5 Gaussian-like MFs
0.016 0.014 0.004 0.0011 0.0011
0.005
0.0027 0.0028
0.011
0.0012 0.0011
0.005
0.0038 0.0038 0.016 0.0003 0.0005 0.0016 0.0002 0.0004 0.0011 8.09e-5 7.46e-5 2.40e-5 2.27e-5
3.77e-4 3.68e-4 6.28e-5 3.69e-5
than the conventional FRPNN. In nodes (FRPNs) of Fig. 3, ‘FRPNn’ denotes the nth FRPN (node) of the corresponding layer, numeric values with rectangle form before a node(neuron) mean number of membership functions per each input variable, the number of the left side denotes the number of nodes(inputs or FRPNs)coming to the corresponding node, and the number of the right side denotes the polynomial order of conclusion part of fuzzy rules used in the corresponding node.
6 Concluding Remarks In this study, we introduced and investigated a new architecture and comprehensive design methodology of IG_gFRPNNs and discussed their topologies. The proposed IG_gFRPNN is constructed with the aid of the algorithmic framework of information granulation based on C-Means clustering and symbolic gene type. In the design of IG_gFRPNN, the characteristics inherent to entire experimental data being used in the construction of the gFRPNN architecture is reflected to fuzzy rules available within a FRPN. The comprehensive experimental studies involving well-known dataset quantify a superb performance of the network in comparison to the existing fuzzy and neuro-fuzzy models.
Design of Fuzzy Relation-Based Polynomial Neural Networks
215
Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2006-311-D00194).
References 1. Nie, J.H., Lee, T.H.: Rule-based Modeling: Fast Construction and Optimal Manipulation. IEEE Trans. Syst., Man, Cybern. 26 (1996) 728-738 2. Ivakhnenko, A.G..: Polynomial Theory of Complex Systems. IEEE Trans. on Systems, Man and Cybernetics. SMC-1 (1971) 364-378 3. Oh, S.K., Pedrycz, W..: The Design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 4. Oh, S.K., Pedrycz, W., Park, B.J.: Polynomial Neural Networks Architecture: Analysis and Design. Computers and Electrical Engineering 29 (2003) 703-725 5. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 6. Zadeh, L.A.: Toward A Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy sets and Systems 90 (1997) 111-117 7. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York. Plenum (1981) 8. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers?. Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. eds., North-Holland, Amsterdam (1992) 9. Vachtsevanos, G., Ramani, V., Hwang, T.W.: Prediction of Gas Turbine NOx Emissions Using Polynomial Neural Network. Technical Report, Georgia Institute of Technology. Atlanta. (1995) 10. Wang, L.X., Mendel, J.M.: Generating Fuzzy Rules from Numerical Data with Applications, IEEE Trans. Systems, Man, Cybern. 22 (6) (1992) 1414-1427 11. Jang, J.S.R.: ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Trans. System, Man, and Cybern. 23 (3) (1993) 665-685 12. Maguire, L.P., Roche, B., McGinnity, T.M., McDaid, L.J.: Predicting A Chaotic Time Series Using A Fuzzy Neural Network. Information Sciences 112 (1998) 125-136 13. Oh, S.K., Pedrycz, W., Ahn, T.C.: Self-organizing Neural Networks with Fuzzy Polynomial Neurons. Applied Soft Computing 2 (2002) 1-10 14. Crowder III, R. S.: Predicting The Mackey-Glass Time Series with Cascade-correlation Learning. In D. Touretzky, G. Hinton, and T. Sejnowski, editors. Proceedings of the 1990 Connectionist Models Summer School. (1990) 117-123 15. Li, C. James, Huang, T.Y.: Automatic Structure and Parameter Training Methods for Modeling of Mechanical Systems by Recurrent Neural Networks. Applied Mathematical Modeling 23 (1999) 933-944 16. Paul, S., Kumar S.: Subsethood-Product Fuzzy Neural Inference System (SuPFuNIS). IEEE Trans. Nerual Networks 13 (3) 2(2002) 578-599 17. Song, Qun, Kasabov, N. K.: NFI : Neuro-Fuzzy Inference Method for Transductive Reasoning. IEEE Trans. Fuzzy Systems 13 (6) (2005) 799-808 18. Park, B.J., Lee, D.Y., Oh, S.K.: Rule-based Fuzzy Polynomial Neural Networks in Modeling Software Process Data. Int. J. of Control, Automations, and Systems. 1 (3) (2003) 321-331
Fuzzy Neural Network Classification Design Using Support Vector Machine in Welding Defect Xiao-guang Zhang1,2, Shi-jin Ren 3, Xing-gan Zhang2, and Fan Zhao1 1
College of Mechanical and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221008 2 Department of Electronic Science & Engineering, Nanjing university, Nanjing, 210093, China 3 Computer Science & Technology college, Xuzhou normal university, Xuzhou, 221116
[email protected] Abstract. To cope up with the variability of defect shadows and the complexity between defect characters and classes in welding image and poor generalization of fuzzy neural network (FNN), a support vector machine (SVM)-based FNN classification algorithm for welding defect is presented. The algorithm firstly adopts supervisory fuzzy cluster to get the rules of input and output space and similarity probability is applied to calculate the importance of rules. Then the parameters and structure of FNN are determined through SVM. Finally, the FNN is trained to classify the welding defects. Simulation for recognizing defects in welding images shows the efficiency of the presented.
1 Introduction FNN inherits advantages of neural network and fuzzy logic, thus it can make use of expert language and have self-learning ability, and is applied widely in machine learning. Most learning algorithms of FNN adopt BP and FCM cluster to obtain fuzzy rules and membership parameters from training data, but these algorithms can not minimize both experience error and expected error simultaneously. Besides, the training time is sensitive to the number of input dimension, and when there are some redundant and conflict rules, the precision of FNN is unsatisfied. SVM can effectively deal with small samples and obtain the global optimum through quadratic optimization [1]. Now more and more researchers pay attention to SVM. Therefore, we propose a new FNN algorithm that SVM is used to determine the initial parameters and structure of FNN. X-ray-based non-destructive inspection is an important method of controlling and inspecting welding quality. However the recognition and evaluation of X-ray inspection welding image mainly depend on person presently, this method often results in uncertain results. In the past 30 years, classifiers are used to recognize defects in the research on defect recognition [2]-[3]. Although it can obtain certain results, the correct recognition ratio is very low. Nowadays, neural network improves the correct recognition ratio of all shapes of welding defects [4]. The main problem is that defect in the welding image varies very much and the relations between defect D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 216–223, 2007. © Springer-Verlag Berlin Heidelberg 2007
FNN Classification Design Using Support Vector Machine in Welding Defect
217
features and classes are complex. Based on the characteristics of defects in welding images, a SVM-based FNN classification algorithm is proposed. Firstly, a supervisory fuzzy cluster is adopted to extract fuzzy rules and weighing algorithm determines the importance of rules making models be able to learn the rules selectively. Then, SVM is applied to determine the structure and initial parameters of FNN. Finally FNN is trained according to weighting cost function. In this paper, chapter 2 introduces the basic structure and realization method of FNN. Chapter 3 introduces the fuzzy cluster algorithm based on supervisory GK and an algorithm is put forward to denote the weight of rule importance. Chapter 4 introduces multi-class classification method of SVM and the training algorithm of FNN based on SVM. The simulated results of recognition of weld defects are discussed in chapter 5. The conclusion is reached in chapter 6.
2 FNN The FNN adopted in this paper consists of three basic modules (fuzziness, fuzzy consequence and fuzzy judgment). Every feature extracted from welding defects is modeled and every fuzzy reference model is the qualitative description of feature and classification of welding defects. Suppose there are n rules and m input variables. The rule j: if x1 is A j1 and …and xm is A jm then y is d j , j = 1,2,", n ,
A ji is the fuzzy set of input variables of xi and d j is the result parameter of y. In order to analyze, the fuzzy rule 0 is added. The rule 0:
x1 is A01 and …and xm is A0 m then y is d 0 . To the m-dimension input x = [ x1 ,", x m ] , the condition of fuzzy model is if
defined as and operation using product operator. m
Aj ( x) = ∏ μ ji ( x j ) ,
(1)
i =1
A j is multi-variable fuzzy set of the jth rule and μ ji is the membership function of single variable. The model output is n
y = ∑ μ j y j + d0 , j =1
n
m
n
j =1
i ' =1
m
n
j =1
i ' =1
μi = Ai ( x) / ∑ Ai ( x) = [∏ μij ( x)] / ∑ Ai ( x) = ∏ {μij ( x) / ∑ Ai ( x)}. '
i ' =1
'
'
(2)
218
X.-g. Zhang et al.
In this way, the equation can be written as follows: n
m
j =1
i =1
y = ∑ d j ∏ μ ij ( xi ) + d 0 .
(3)
Using FNN to train the samples will strengthen the mapping ability of the network, improve its expression ability and possess the simple and practical characteristics, such as self-learning, redundancy, strong classification ability and parallel processing. The neural network introduced fuzzy theory can improve correct recognition ratio under the condition of not adding new information.
3 Fuzzy Cluster with Supervision Now the fuzzy subset and membership function of FNN depend on the manual experience, which is difficult for high input dimension and large samples. How to extract fuzzy rules from sample data automatically is still an open problem. In this paper, we apply supervisory fuzzy cluster algorithm [5] to extract fuzzy rules. The practice proves that the method can make full use of class label information of samples and can cover the input and output space of samples enough. And also it can find important clusters and determine rational cluster number. GK fuzzy cluster algorithm is proved as effective cluster method used to recognize TS fuzzy model. It uses adaptive distance norm to test the clusters with different geometrical shapes [6]. Every cluster presents one rule in the rule database and the cluster is based on minimal object function. M
N
J = ∑∑ ( μ kj ) m d ki2 ,
(4)
i =1 k =1
satisfying the following condition: c
∑μ j =1
kj
= 1,
μ kj ≥ 0,
1≤ k ≤ n,
1 ≤ k ≤ n, I ≤ j ≤ c ,
where, m denotes the weight index of fuzzy cluster, and m > 1 , c is the number of clusters, n is the sample number of cluster space, μkj is the membership of the j sample
xk belonging to the cluster k , dki2 = Aj (x) = (xk − v j )T (Fj )−1(xk − vj ) is inner
xk is from the cluster center v j , F j is the diagonal matrix containing variance, x k ∈ R s , v j ∈ R s and s is the dimension number of input vectors. U = {μ kj } denotes n × c partition matrix and V = {v1 , v 2 , ", vc } denotes s × c matrix of cluster center. For {Z i = ( x i , y i )}i =1, 2 ,", N , the steps of the supervisory fuzzy cluster algorithm norm which denotes the Euclid distance that the sample
are as follows [5]:
FNN Classification Design Using Support Vector Machine in Welding Defect
219
Here repeated times l = 1 , the number of clusters is
M , the contribution ratio (0) threshold of rules, terminated error ε > 0 , random initial fuzzy division matrix U (1) Calculate cluster model N
N
k =1
k =1
v i(l ) = [∑ ( μ ki(l −1) ) m z k ] / ∑ ( μ ki(l −1) ) m .
(5)
(2) Calculate covariance matrix N
N
k =1
k =1
Fi = [∑(μki(l −1) ) m ( zk − vi(l ) )(zk − vi(l ) )T ] /∑(μki(l −1) ) m .
(6)
(3) Calculate the distance to the cluster
dki2 = (zk − vi(l ) )T Di (zk − vi(l ) ) , 1 ≤ i ≤ M , 1 ≤ k ≤ N ,
(7)
where Di = [det(Fi 1 /( n+1) Fi −1 )] . (4) Update division matrix To 1 ≤ i ≤ M , 1 ≤ k ≤ N , if
d ki = 0 ,
M
μ ki(l ) = 1 / ∑ ( d ki / d kj ) 2 /( m −1) ,
(8)
j =1
or else, if
d ki = 0 , μ ki(l ) = 1 .
(5) Run cluster reduction algorithm of orthogonal least square (OLS) [5] and find and save M s important clusters according to principle of maximal error change ratio.
M := M s , U ( l ) = [u i ] , i = 1, " , M s , and regularizing U (l ) . (6) If || U ( l ) − U ( l −1) ||< ε , l = l + 1 , return 1 and go on running. Since sample data contain noises and even isolated points, it is possible that several isolated points will form another cluster and the tight degrees of every cluster are different. So the importance degrees of fuzzy rules extracted from clusters are different. This factor should be considered in modeling, otherwise it will affect the accuracy of the last model. An algorithm denoting the weight importance of rules is put forward in this paper. The weight wi of rule i is: N
nx
k =1
j =1
wi = {(∑ ( μi ,k ) m ) / N }∏ (1/ 2π Fij ) .
(9)
Its meaning is that when the membership function is Gauss function and the rule i exists, the former part is transcendental probability of rule i and the latter part is the
220
X.-g. Zhang et al.
reciprocal of condition membership of rule i . In this way, it is easy to extract rules from clusters and calculate the importance weight of corresponding rules.
4 FNN Training Based on SVM For SVM, the most important problem is to choose appropriate kernel function according to real world. The kernel function is defined as follows: Theorem 1 [7]: To the sample x and μ ( x) : R → [0,1] is a norm function, the function
⎧n ⎪∏μ (x )μ (z ), K(z, x) = ⎨ i =1 j i j i ⎪0 , ⎩
z , if the membership function
x, z are in j − th cluster
(10)
other
is also a norm function and a Mercer kernel function. Suppose the dimension of samples is nx and the number of samples is
n . Use the
method of supervision fuzzy cluster mentioned in chapter 3 to establish initial fuzzy rules. Suppose the samples are divided into m cluster and the sample number of cluster i is ki . The following m clusters can be obtained.
cluster _ 1 = {( x11 , y11 ),", ( x1k1 , y 1k1 )} ,
"
cluster _ m = {( x1m , y1m ),", ( x km1 , y km1 )} , m
∑k i =1
i
=n. ⎛ K1
The corresponding kernel matrix is K = ⎜ ⎜
⎜0 ⎝
⎞ ⎟, % ⎟ K m ⎟⎠
0
K i is the ki × ki kernel matrix. The parameter of kernel function, corresponding to the samples of cluster i , is equivalent to cluster variance. In this way, SVM can be used to learn the parameters of FNN. In the case of 2 clusters, nonlinear classification hyperplane can be obtained through solve the following optimization problem. n
min L(α ) =|| w || 2 +C ∑ ei ξ i ,
(11)
i =1
satisfying the restrict condition: k
yi ( w ⋅ ϕ ( xi ) + d 0 ) ≥ 1 − ξ i , i = 1, " , ∑ k i , i =1
FNN Classification Design Using Support Vector Machine in Welding Defect
C is a constant,
221
ei is the importance weight of sample i . The method is mentioned
in chapter 3 and the weight of the samples belonging to the same cluster is consistent. Its dual problem is n
n
i =1
i , j =1
max L(α ) = ∑ α i − [ ∑ yi y j K ( xi , x j )]/ 2 ,
(12)
satisfying the restrict condition: n
α i ∈ [0, ei C ] , ∑ yiα i = 0 , i =1
K ( xi , x j ) is the fuzzy kernel function. Suppose there are n support vectors. Then n sv
d 0 = [ ∑ a i y i x i' x * (1) + i =1
nsv
∑
i =1
a i y i x i' x * ( − 1) ] / 2 ,
(13)
x * (1) and x * ( −1) belong to class 1 and class 2 respectively. nsv
Since the last equation is y = ∑ ai yi K ( x, xi ) + d 0 , the parameters of FNN can be i =1
regarded as d i = ai y i from (10). The center is the corresponding support vector and the parameter of membership function is the standard difference. The case of one class including many clusters is also done according to the method above. Since SVM can be used to realize 2-class classification, SVM should be rebuilt to adapt the multi-class classification aiming at the multi-class problems. Many scholars researched this problem, this paper adopts “one-against-all” multi-class classifier because it is easy to realize [8]. In this case, classification functions (total N) can be constructed between every class and the others. For example, the jth SVM can classify the jth class samples from the others. In this way, sample labels in the training set should be remodified. If the sample label of the jth class is 1, the others should be -1. In the classification of the training samples, the comparison method is adopted. Enter testing sample x into N two-class classifiers respectively and calculate the discriminated function values of every sub-classifiers. Choose the class corresponding to the maximal discriminated function value as the class of the testing data. Every classifiers of SVM regard the cluster belonging to one class as class 1 and the other as class 2. And the form used to choose parameters and the learning manner are the same with the one mentioned above. In this way, N SVM can be determined. To a new mode x , it should be trained by the N SVM and the decision function is
f ( x) = c, f c ( x) = max f i ( x) . i =1,", N
(14)
5 Simulation Experiment According to [9], weld defects can be generally classified as crack, lack of penetration, lack of fusion, strip-shaped slag inclusion, spherical slag inclusion and
222
X.-g. Zhang et al.
pore 6 classes. Different features, such as defect shape, location, boundary flatness and tip sharpness, are used to recognize and classify defects. In reference [2] and [9], 6 shape feature parameters, such as the ratio of long diameter and short diameter, tip sharpness, boundary flatness, the obliquity with welding direction, centroid coordinate relative to the weld center and symmetry, are chosen as feature parameters. X-ray inspection welding image is processed in the following way, such as preprocessing, segmentation and contour trace (track). Extract defect location parameter, defect perimeter, defect area, long diameter of the defect, short diameter of the defect and the ratio of them described in reference [10]. According to the definition of 6 feature parameters, feature parameters can be obtained using the defect parameters. In this paper, using standard image database of weld defects as experimental samples, choose 184 defect feature of weld image as the total samples (total 184×6 parameters). The feature parameters of practical input samples should be rearranged and regarded as input vectors xi = ( xi1 , xi 2 , " xi 8 )(i = 1, 2, " , 184) , which is composed of 44 pore samples, 40 spherical slag inclusion samples, 25 strip-shaped slag inclusion samples, 25 lack of penetration samples, 25 lack of fusion samples, 25 crack samples. Adopt 6 classes of defect samples (124) as training samples and 60 samples as testing samples. The thresholds of orthogonal least square (OLS) algorithm are ρ = 7% and ε = 0.001 . Through experiments, initial number of initial clusters is 12-6. Although the class number and class label are known, the appropriate number can not be known. The appropriate cluster number can be found after several experiments. Choose the cluster result whose fit errors of training and testing samples are minimal as the last result. The number of the last clusters is 7 and results are shown as follows: Table 1. The Cluster Results and Classification Composition
Defects
Pore
Cluster number 2 Class importance 0.823 Training samples 30 Testing samples 14 Classification 100 precision
Spherical slag inclusion 1 0.965 30 10 100
Strip-shaped slag inclusion
Lack of penetration
Lack of fusion
Crack
1 0.892 16 9
1 0.868 16 9
1 0.926 16 9
1 0.955 16 9
78
78
89
100
Use the “one-against-one” classification methods mentioned above and choose radius basis function as the kernel function. The LOO method is used to determine the punish coefficients of SVM and obtain wonderful classification effects. The detailed numbers of training and testing samples are shown in Table 1 and the recognition classification ratio of defects is 90% in average. The FNN trained by SVM possesses higher learning and testing accuracy. The simulation above has been done in Matlab and SVM is trained by the optimization package in the Matlab toolbox.
FNN Classification Design Using Support Vector Machine in Welding Defect
223
6 Conclusion A SVM-based FNN classification algorithm for welding defects is proposed to overcome the shortcomings of the existing FNN learning algorithms. Through weighing error terms to emphasize particularly on different rules with different importance, the precision and anti-interference can be improved. Simulation results show that the proposed algorithm can effectively model complex relations between defect features and classification has better classification performance for small sample sets.
Acknowledgments The authors would express their appreciation for the financial support of the China Planned Projects for Postdoctoral Research Funds, grant NO.20060390277. The authors also would express their thanks for the finical support of the Jiangsu Planned Projects for Postdoctoral Research Funds, grant NO.0502010B. The paper also is supported by six calling person with ability pinnacle, grant NO.06-E-052, is supported by China university of Mining and Technology Science Research Funds, grant NO.2005B005
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer Press, New York (1995) 2. Zhou, W., Wang, C.: Researth and Application of Automatic Recognition System to Weld Defects. Transactions of the China Welding Institution 13 (1) (1992) 45–50 3. Silva, R.R.da, Siqueira, M.H.S., Caloba, L.P., etc.: Radiographics Pattern Recognition of Welding Defects Using Linear Classifiers. Insight 43 (10) (2001) 669–674 4. Ren, D., You, Z., Sun, C.: Automatic Analysis System of X-ray Weld Real-time Imaging. Transactions of the China Welding Institution 21 (1) (2000) 61–63 5. Magne Setnes.: Supervised Fuzzy Clustering for Rule Extraction. IEEE Transactions on Fuzzy Systems 8 (4) (2000) 416–424 6. Nauck, D., Kruse, R.: Obtaining Interpretable Fuzzy Classification Rules from Medical Data. Artificial Intelligence 16 (2) (1999) 149–169 7. Lin, C.T., Yeh, Chang-Moun, Hsu, Chun-Fei.: Fuzzy Neural Network Classification Design Using Support Vector Machine. IEEE International Symposium on Circuits and Systems 5 (2004) 724-727 8. Sun, Z.H.: Study on Support Vector Machine and Its Application in Control, Dissertion, Zhejiang University (2003) 9. The National Standard of PRC: The Ray- cameras and Quantity Class of Fusion Welding Joint. GB3323-87 (1987) 10. Zhang, X.G.: The Extraction and Automatic Identification of Weld Defects with X-ray Inspection. National Defence Industry Press. Beijing (2004)
Multi-granular Control of Double Inverted Pendulum Based on Universal Logics Fuzzy Neural Networks* Bin Lu1 and Juan Chen2 1
Department of Computer Science & Technology, North China Electric Power University, 071003 Baoding, China
[email protected] 2 Department of Economic Management, North China Electric Power University, 071003 Baoding, China
[email protected] Abstract. The control of double-inverted pendulum is one of the most difficult control problems, especially for the control of parallel-type one, because of the high complexity of control systems. To attain the prescribed accuracy in reducing control complexity, a multi-granular controller for stabilizing a double inverted pendulum system is presented based on universal logics fuzzy neural networks. It is a universal multi-granular fuzzy controller which represents the process of reaching goal at different spaces of the information granularity. When the prescribed accuracy is low, a coarse fuzzy controller can be used. As the process moves from high level to low level, the prescribed accuracy becomes higher and the information granularity to fuzzy controller becomes finer. In this controller, a rough plan is generated to reach the final goal firstly. Then, the plan is decomposed to many sub-goals which are submitted to the next lower level of hierarchy. And the more refined plans to reach these sub-goals are determined. If needed, this process of successive refinement continues until the final prescribed accuracy is obtained. In the assistance of universal logics fuzzy neural networks, more flexible structures suitable for any controlled objects can be easy obtained, which improve the performance of controllers greatly. Finally, simulation results indicate the effectiveness of the proposed controller.
1 Introduction The double inverted pendulum is a classical and complex nonlinear system, which is often used as a benchmark for verifying the effectiveness of a new control method because of the simplicity of the structure. In general, to control a double inverted pendulum system stably, there need 6 input items to cover all of the angular controls of the two pendulums and the position control of the cart. The conventional fuzzy inference model which puts all of the input items into the antecedent part of each fuzzy rule has difficulty to settle fuzzy rules of 6 input items. Even if the fuzzy rule base is built, it will increase the complexity of control system extremely because of its *
The research work is supported by the Ph. D Science Foundation (20041211) and the Postdoctoral Science Foundation (20041101) of North China Electric Power University.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 224–233, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multi-granular Control of Double Inverted Pendulum
225
huge size. How to reduce the size of fuzzy rule base, then degrade the control complexity, it has become one of the main concerns among system designers. Yi J. Q.[1] constructed a controller based on the single input rule modules (SIRMs) dynamically connected fuzzy inference model. Each input item is assigned with a SIRM and a dynamic importance degree. Tal C. W.[2] proposed a fuzzy adaptive approach to fuzzy controllers designed with the spatial model to reduce the complexity. Sun Q.[3] presented a design method for the stabilization of multivariable complex nonlinear systems which can be represented by fuzzy dynamic model in decentralized control of large-scale systems, and an optimal fuzzy controller is designed with genetic algorithms. Jinwoo K.[4] employed a multi-resolutional search paradigm to design optimal fuzzy logic controllers in a variable structure simulation environment, and the search paradigm was implemented using hierarchical distributed genetic algorithms-search agents solving different degrees of abstracted problems. Except for all of above, there are still many other studies of fuzzy controllers for reducing the computational complexity. Although these achievements can improve the performance of controllers more or less, the limitations still exist unavoidably. In this paper, a multi-granular controller for stabilizing a double inverted pendulum system is presented based on universal logics fuzzy neural networks (ULFNN), which has excellent effect on the reduction of complexity of controllers and can guarantee the prescribed control accuracy in the case of certain class of uncertain systems. The fuzzy controller uses the different levels of information granularity to attain the prescribed accuracy. When the prescribed accuracy is low, a fuzzy controller based on coarser granular information can be used. As the process moves from high level to low level, the prescribed accuracy becomes higher and the information granularity to fuzzy controller becomes finer. If needed, this process of successive refinement continues until the final prescribed accuracy is obtained. At the same time, by combining ULFNN, this controller uses flexible, opened and adaptive family of operators, which can contain all logical forms and the inferring patterns, parameterizes the basic fuzzy inferring operators, and realizes the flexibilities of integration of the rule premises, the rule activations as well as the rule outputs. Therefore, the performance of controller is improved greatly. In the following sections, analysis and design of the fuzzy controller will be discussed. Although a parallel-type inverted pendulum system is taken as the demonstration, the fuzzy controller can be also applied for the series-type double inverted pendulum systems and the other control systems.
2 Parallel-Type Double Inverted Pendulum System As one of the family of inverted pendulum systems, stabilization control of a paralleltype double inverted pendulum system is more difficult than single inverted pendulum systems, series-type double inverted pendulum systems, and so on. To stabilize a parallel-type double-inverted pendulum is not only a challenging problem but also a useful way to show the power of the control method. As shown in fig. 1, the double inverted pendulum system considered here consists of a straight line rail, a cart moving on the rail, a longer pendulum 1, a shorter pendulum 2, and a driving unit.
226
B. Lu and J. Chen
Fig. 1. Double inverted pendulum
Here, the parameters M = 1.0 kg, m1 = 0.3 kg, m2 = 0.1 kg are the masses of the cart, the pendulum 1 and the pendulum 2, respectively. The parameter g = 9.8m/s2 is the gravity acceleration. Suppose the mass of each pendulum is distributed uniformly. Half the length of the longer pendulum 1 is given as l1 = 0.6 m, and half the length of the pendulum 2 is given as l2 = 0.2 m. The position of the cart from the rail origin is denoted as x, and is positive when the cart locates on the right side of the rail origin. The angles of the pendulum 1 and pendulum 2 from their upright positions are denoted separately as α and β, and clockwise direction is positive. The driving force applied horizontally to the cart is denoted as F (N), and right direction is positive. Also, suppose no friction exists in the pendulum system. Then the dynamic equation of such a double inverted pendulum system can be obtained by Lagrange’s equation of motion as
⎡ a11 ⎢a ⎢ 21 ⎢⎣ a31
a12 a22 a32
a13 ⎤ ⎡ x⎤ ⎡ b1 ⎤ a23 ⎥⎥ ⎢⎢α ⎥⎥ = ⎢⎢b2 ⎥⎥ . ⎢⎣b3 ⎥⎦ a33 ⎥⎦ ⎢⎣ β⎥⎦
(1)
Where the coefficients are given by
⎡ a11 ⎢a ⎢ 21 ⎣⎢ a31
a12 a22 a32
a13 ⎤ a23 ⎥⎥ = a33 ⎦⎥
⎡ M + m1 + m2 ⎢ m l cos α ⎢ 11 ⎣⎢ m2 l2 cos β
m1l1 cos α 4m1l12 / 3 0
m2 l2 cos β ⎤ ⎥, 0 ⎥ 4m2 l2 2 / 3 ⎦⎥
(2)
and 2 2 ⎡ b1 ⎤ ⎡ F + m1l1α sin α + m2 l2 β sin β ⎤ ⎢ ⎥ ⎢b ⎥ = m1l1 g sin α ⎥. ⎢ 2⎥ ⎢ ⎢ ⎥ m2 l2 g sin β ⎣⎢b3 ⎦⎥ ⎣ ⎦
(3)
3 Multi-granular Fuzzy Control In the controller, starting from the initial state of the overall system, a rough plan is generated to reach the final goal firstly. Then, the plan is decomposed to many
Multi-granular Control of Double Inverted Pendulum
227
sub-goals which are submitted to the next lower level of hierarchy. And the more refined plans to reach these sub-goals are determined. This process of successive refinement continues until the final prescribed accuracy is obtained. The structure of controller is showed in fig. 2. In the figure, the letter r denotes the desired output trajectory, e Error, y actual output and u control action.
Fig. 2. Block diagram of controller
Since the number of rules increase exponentially as the number of system variables increase, one of the most important aims of the controller is to reduce the size of rule base. The idea of the controller is based on the human operator's behavior or problem solving methods. The operator would try to bring the controlled process variable ‘roughly’ to a desirable situation and then to a precisely desirable one. Thus the controlled variable in the case of regulation problem will be brought within a small deviation band around the set-point by using ‘coarse’ resolution and then finer information resolution is used.
Fig. 3. Switch of granularities
In fig. 3, at each level of information granularity, the goal is to reduce the error to zero which is defined on a universe of discourse [-εi, εi]. As a result, the error reduces into the threshold [-εi, εi]. When the zero at i-th level is reached, the granulation of information becomes finer. The interval on which the membership functions are defined, become smaller. The membership functions are now described using the smaller universe of discourse [-εi+1, εi+1]. This process continues until the prescribed accuracy is reached. Thus, the task decomposition is achieved by defining the membership functions on ever decreasing universe of discourse.
228
B. Lu and J. Chen
4 Analysis of ULFNN The ULFNN is a six-layer feed-forward net where the AND operation, OR operation and implication operation are all realized with the Universal Logics operators[5], which are the parameterized families of operators including of zero-level universal AND operators (ZUAND), zero-level universal OR operators (ZUOR) and zero-level universal implication operators (ZUIMP). 4.1 Structure of ULFNN
Consider a multiple input-single output ULFNN. The knowledge base of the system is defined by the set of linguistic rules of the type: IF x1 = Ai1 AND … AND xn = Ain THEN y = Bi , i = 1, 2, … , M.
(4)
In the above, Aij are reference antecedent fuzzy sets of the n input variables x1 , x2 , … , xn and Bi are reference consequent fuzzy sets of the output variable y. The xi is defined on the universes of discourse Xi, i=1,…,n and is y defined on the universe of discourse Y. The M denotes the number of rules.
Fig. 4. Structure of ULFNN
An ULFNN computationally identical to this type of reasoning is shown in fig. 4, which is a six-layer feed-forward net in which each node performs a particular function on incoming signals as well as a set of parameters pertaining to this node. Note that the links in an adaptive network only indicate the flow direction of signals between nodes; no weights are associated with the links. 1)Layer 1 (the input layer) There are the r crisp input variables x1 , x2 , …, xr , which are defined on the universes
of discourse Xi respectively, i = 1, … , r. 2)Layer 2 (the fuzzification layer) Compare the input variables with the membership functions on the premise part to obtain the membership values of each linguistic label. The output of the node is the degree to which the given input satisfies the linguistic label associated to this node. Usually, we choose Gauss-shaped membership functions
Multi-granular Control of Double Inverted Pendulum
Ai j ( x) = exp[−
( x − cij ) 2
σ ij2
] ∈ F (Xi), i = 1, … , r.
229
(5)
to represent the linguistic terms, where{ ci j , σ i j }is the parameter set. As the values of these parameters change, the Gauss-shaped functions vary accordingly, thus exhibiting various forms of membership functions on linguistic labels Ai j . In fact, any continuous, such as trapezoidal and triangular-shaped membership functions are also quantified candidates for node functions in this layer. 3)Layer 3 (the firing strength layer) Usually combine the membership values on the premise part to get firing strength of each rule through a specific t-norm operator, such as Min or Probabilistic. However the firing strength of the associated rule will be computed through the parameterized ZUAND operators here. The firing strength of the i-th rule is
τ i = T ( x1 , x2 , … , xn , hT ) 1
= (max( 0
mT1
mT1
, x1
+ x2
mT1
+ …+ xn
mT1
– (n – 1)) )
1 mT1
(6) .
In the above, real number m has relation with generalized correlation coefficient h as m = (3 – 4h)/ 4h (1 – h), h [0, 1], m R. Basic operators, such as Min, Probabilistic, etc, can be derived from the ZUAND operators by specifying its parameter. If the premise part of rule is connected with logic connective OR, it can be replaced by the parameterized ZUOR operators. So the firing strength of the i-th rule is
∈
∈
τ i = S ( x1 , x2 , … , xn , hS ) =1– (max( 0mS , (1 – x1 mS
mS
) mS (7)
1 mS
+ (1 – x2 ) + …+ (1 – xn ) – (n – 1)) ) . Basic operators, such as Max, Strong, etc, can be derived from the ZUOR operators by specifying its parameter. 4)Layer 4 (the implication layer) Generate the qualified consequent of each rule depending on the firing strength. Each node generates the consequence of each rule through the parameterized ZUIMP operators here. The consequence of the i-th rule is 1
Fi ( y ) = I ( τ i , Bi ( y ) , hI ) = (min(1+ 0 mI ,1 – τ i mI + Bi mI ( y ) ) ) mI .
(8)
In the above, τ i is the firing strength of the i-th rule, and Fi ( y ) is a fuzzy set of the output of the i -th rule. The most often used fuzzy implication operators, such as Goguen, Lukasiewicz and so on, can be derived from the ZUIMP operators by specifying its parameter. 5)Layer 5 (the aggregation layer) Aggregate the qualified consequences to produce a fuzzy output. Replacing the logical connective ALSO with ZUAND operators because Bi ⊂ Fi according to the property of ZUIMP operators, the overall fuzzy output of the output variable y is
230
B. Lu and J. Chen
F ( y ) = T ( F1 ( y ) , F2 ( y ) , … , FM ( y ) , hT2 )= (max ( 0
mT
2
mT
, F1
2
( y ) + F2
mT
2
( y ) +…+ FM
mT
2
( y ) – (M – 1)) )
1 mT
2
(9) .
6)Layer 6 (the defuzzification layer) A crisp output can be obtained with the defuzzification method, and usually we use the COA method to do it. That is
y*=
∫ yF ( y )dy . ∫ F ( y )dy Y
(10)
Y
4.2 Learning Algorithms
After discussing the structure of ULFNN, we will research how to determine a concrete controller. It is well known that the neural network has the strong learning function, which can be introduced to fuzzy system to determine the parameters of the universal logics fuzzy neural network controller through training so as to meet the needs of different controlled objects. In order to give dual attention to the astringency and rapidity of learning process, the BP algorithm with adaptive learning rate is given. Several important formulas are proved firstly. Formula 1:
In the above
1 −1 ∂T ( x1 , x2 ,..., xn , h) = AT m x j m −1 . ∂x j
,A =x T
m 1
+ x2 m +…+ xn m – (n – 1).
Proof: Let AT = x1m + x2 m +…+ xn m – (n – 1), from ref. [5], we have 1
∂T ( x1 , x2 ,..., xn , h) ∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂x j ∂x j =
1 1 −1 −1 1 ( x1m + x2 m +…+ xn m – (n – 1)) ) m m x j m −1 = AT m x j m −1 . m
Formula 2:
∂T ( x1 , x2 ,..., xn , h) = ∂h
⎧ m1 −1 ⎪ AT ( AT ln AT − mBT )C ⎪ ⎨0 ⎪ Not exist ⎪ ⎩
, h ∈ (0.75, 1) or h ∈ (0, 0.75) and A > 0 , h ∈ (0, 0.75) and A ≤ 0 , h = 0, 0.75, 1 T
T
In the above, BT = x1m ln x1 + x2 m ln x2 +…+ xn m ln xn , C=1+
3 . (4h − 3)2
Multi-granular Control of Double Inverted Pendulum
,
∈
231
Proof: From ref. [5], when h (0, 0.75) and AT ≤ 0 or h = 0, 0.75, 1, the formula establishes obviously. Now it’s only needed to prove when h (0.75, 1) or h (0, 0.75) and AT > 0. Let BT = x1m ln x1 + x2 m ln x2 +…+ xn m ln xn , we have
∈
∈
1
∂T ( x1 , x2 ,..., xn , h) ∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂h ∂h 1
∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂m 1
1)) ) m
−1
∂m = ∂h
(n
–
(m( x1m ln x1 + x2 m ln x2 +…+ xn m ln xn )– ( x1m + x2 m +…+ xn m – (n –
1)) ln ( x1m + x2 m +…+ xn m – (n – 1))) 1
( x1m + x2 m +…+ xn m –
1 −1 1 ∂m m = A (m BT – AT ln AT )(–C) T 2 m ∂h
−1
= AT m ( AT ln AT – m BT )C . 1 −1 ∂I ( x1 , x2 , h) = (−1) j AI m x j m −1 , where AI =1– x1m + x2 m , j =1, 2. ∂x j
Formula 3:
Proof: Let AI =1– x1m + x2 m , from ref. [5], we have 1
1 −1 ∂I ( x1 , x2 , h) ∂ (1 − x1m + x2 m ) m 1 = = (1– x1m + x2 m ) m (−1) j m x j m −1 ∂x j ∂x j m 1
= (−1) j AI m
−1
x j m −1 .
Formula 4:
, h ∈ (0.75, 1) or h ∈ (0, 0.75) and A < 1 , h ∈ (0, 0.75) and A ≥ 1 , h = 0, 0.75, 1
⎧ m1 −1 ⎪ AI ( AI ln AI − mBI )C ∂I ( x1 , x2 , h) ⎪ = ⎨0 ∂h ⎪ Not exist ⎪ ⎩ In the above, BI = – x1m ln x1 + x2 m ln x2 .
I
I
∈
Proof: From ref. [5], when h (0, 0.75) and AI ≥ 1, or h = 0, 0.75, 1, the formula establishes obviously. It’s only needed to prove when h (0.75, 1) or h (0, 0.75) and AI < 1. Let BI = x1m ln x1 + x2 m ln x2 , we have
∈
1
1
∈
1 −1 ∂ (1 − x1m + x2 m ) m ∂m ∂I ( x1 , x2 , h) ∂ (1 − x1m + x2 m ) m = = = (1– x1m + x2 m ) m (– (1– ∂h ∂h ∂m ∂h 1 ∂m x1m + x2 m ) ln (1– x1m + x2 m ) + m( x1m ln x1 + x2 m ln x2 )) 2 m ∂h 1
−1
1
−1
= AI m (– AI ln AI + m BI )(–C) = AI m ( AI ln AI – m BI )C .
232
B. Lu and J. Chen
The formulas of ZUOR operators are omitted for the similarity with the ZUAND operators. Let η is the learning rate of adjustable parameters. The following two strategies should be adopted to adjust the learning rate in training:
-If the error measure undergoes 4 consecutive reductions, increase η. -If the error measure undergoes 2 consecutive combinations of 1 increase and 1 reduction, decrease η.
In order to increase the convergence speed of BP algorithm, the initial values of the adjustable parameters should be set to about 0.5.
5 Control Simulations The fuzzy controller takes the angle and angular velocity of the pendulum 1 and pendulum 2, the position and velocity of the cart as the input items, and takes the driving force F as the output item. Without losing generality, the rail origin is selected as the desired position of the cart. Then, the stabilization control of the double inverted pendulum system is to balance the two pendulums upright and move the cart to the rail origin in short time. If the six input items all converge to zero, then the stabilization control is apparently achieved. The membership functions of each variable are defined as Gauss-shaped. In each level of information granularity, the total number of fuzzy rules is reduced significantly in the controller. To verify the effectiveness of the proposed controller, many different simulations are done.
(a)
(b) ◦
◦
Fig. 5. Simulation results for initial angles 5 and 0 in (a), and both 5◦ in (b) separately
Fig. 5 shows the control results when the initial angles of the two pendulums are separately set up to 5◦ and 0◦ in (a), 5◦ and 5◦ in (b), while the initial values of the other state variables are all set to zeros. And the sampling period is 0.01s. In these figures, the line denotes the position of cart, the angle of pendulum 1 and the angle of pendulum 2. As the control results, the stabilization time of both simulations is about 6 s. More experiments and results of simulations are not listed here because of limitations of paper length. However, all the simulation results show that the fuzzy controller can stabilize the parallel-type double inverted pendulum system for a wide range of the initial angles of the two pendulums in relatively short time. Since the conventional
①
②
③
Multi-granular Control of Double Inverted Pendulum
233
fuzzy inference model has difficulty to set up all fuzzy rules of 6 input items and to change the control structure, the method proposed in this paper represents more advantages in the stabilization control of the double inverted pendulum system.
6 Conclusions and Future Work In a word, the proposed controller is a universal fuzzy neural network controller for control problems solving, not only for the parallel-type double inverted pendulum system. It has a simple and intuitively understandable structure, and can attain the prescribed accuracy in the certain class of uncertain systems while reducing control complexity. And in the assistance of ULFNN, more flexible structures suitable for any controlled objects can be easy obtained, which improve the performance of controllers greatly. In future, our emphases of work are mainly in further improving the efficiency of fuzzy controllers.
References 1. Yi, J.Q., Naoyoshi, Y., Kaoru, H.: A New Fuzzy Controller for Stabilization of Parallel-type Double Inverted Pendulum System. Fuzzy Sets and Systems 126 (2002) 105-119 2. Tal, C.W., Taur, J.S.: Fuzzy Adaptive Approach to Fuzzy Controllers with Spacial Model. Fuzzy Sets and Systems 125 (2002) 61-77 3. Sun, Q., Li, R.H., Zhang, P.A.: Stable and Optimal Adaptive Fuzzy Control of Complex Systems Using Fuzzy Dynamic Model. Fuzzy Sets and Systems 133 (2003) 1-17 4. Jinwoo, K., Zeigler, B.P.: Designing Fuzzy Logic Controllers Using A Multiresolutional Search Paradigm. IEEE Trans. Fuzzy Systems 4(3) (1996) 213-226 5. He, H.C.: Principle of Universal Logics. Science Press. Beijing (2006)
The Research of Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks Pei-Gang Sun1,2 , Hai Zhao1 , Xiao-Dan Zhang3 , Jiu-Qiang Xu1 , Zhen-Yu Yin1 , Xi-Yuan Zhang1 , and Si-Yuan Zhu1 1
3
School of Information Science & Engineering, Northeastern University, Shenyang 110004, P.R. China 2 Shenyang Artillery Academy, Shenyang 110162, P.R. China Shenyang Institute of Aeronautical Engineering, Shenyang 110034, P.R. China {sunpg,zhhai,xujq,cmy,zhangxy,zhusy}@neuera.com,
[email protected] http://www.netology.cn
Abstract. A new decision information fusion algorithm based on the fuzzy neural networks, which introduces fuzzy comprehensive assessment into traditional decision information fusion technology under the “sof t” decision architecture, is proposed. The process of fusion is composed of the comprehensive operation and the global decision through fusing the local decision of multiple sensors for obtaining the global decision of the concerned object at the fusion center. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of hydroelectric simulation system of Jilin Fengman. In the analysis of factual data, the performance of the algorithm precedes that of the traditional diagnosis method.
1
Introduction
Information fusion is a new information process technology for the alliance of data obtained from multiple sources, such as sensors, database, knowledge base and so on. It aims at obtaining coherent explanation and description of the concerned object and environment, through making the most of multi-sensor resource, combining the redundant, complement information that each sensor has obtained, by rationally employing each sensor and its sensor data. Information fusion is a kind of comprehensive, multiple angles, and multiple layers analysis process to the concerned object[1], [2]. Information fusion could be classified into three levels according to the abstract level of data, which are pixel level fusion, characteristic level fusion and decision level fusion[3]. Decision fusion is a kind of high-level fusion process, and its result is often utilized as the basis for the system decision. Because the decision level fusion often concerns all kinds of factors, besides the data that obtained by sensors, further more the evidence of decision fusion process is often uncertain; it is very difficult to construct the accurate model that has high reliability for a certain problem. But in practical application, the decision level D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 234–240, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Research of Decision Information Fusion Algorithm
235
fusion can bring some especial benefit, such as high robustness, processing different class information, and so on, so it has been paid attention to by scientists and engineers, and become an important subject in the study of information fusion theory and application[4], [5]. In this paper, a new decision level fusion algorithm, which considers the fuzzy property of the decision level fusion and adopts the “sof t” decision architecture of information fusion, is researched. The algorithm introduces fuzzy comprehensive assessment into decision assessment at the process of the fusion. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of hydroelectric simulation system of Jilin Fengman[6], [7], [8]. In the analysis of factual data, the performance of the algorithm precedes that of the traditional diagnosis method.
2
Model of Fuzzy Comprehensive Assessment
Comprehensive assessment method is one of important methods and tools in the decision and analysis. Fuzzy comprehensive assessment is comprehensive assessment method to the object and phenomena that is influenced by multiple factors using fuzzy set theory[9]. The method has been successfully applied into the industry process, evaluating of product, supervise of quality and so on[10]. In the process of fuzzy comprehensive assessment, it is denoted that (U, V, R) is assessment model of fuzzy comprehensive assessment, and the Factor Set U consists of all elements, which relate to the assessment, it can be represented as U = (u1 , u2 , · · · , um ), In general, every factor has its different weight ai . The weight set A is a fuzzy set, which is represented by a fuzzy vector, A = (a1 , a2 , · · · , am ), where ai is the value of the membership function of the factor ui relating A. That is, it represents the degree of every factor in the comprehensive m assessment. In general, it satisfies ai = 1, ai > 0, (i = 1, 2, 3, · · · , m). i=1
The set V is the assessment set, which is the set that consists of the assessment degree of the object. It can be represented as V = (v1 , v2 , · · · , vn ), where vi is the assessment degree for this assessment. The matrix R = (rij )m×n is a fuzzy mapping from U to V , where rij express the possibility degree of j th assessment when considering the ith factor, which is the membership degree of from Ui to Vi . In the process of fuzzy comprehensive assessment, let A = (a1 , a2 , · · · , am ) be the fuzzy set on the factor set U , in which ai is the weight of ui , B = (b1 , b2 , · · · , bn ) is the fuzzy set on the assessment set V , the comprehensive assessment can be represent as following: B = A ◦ R = (b1 , b2 , · · · , bn )
(1)
in formula (1) the operator ◦ is often defined as the assessment arithmetic operator (∧∗, ∨∗), so formula (1) can be written as: ∀ bi ∈ B, bi = (a1 ∧ ∗ r1i ) ∨ ∗ (a2 ∧ ∗ r2i ) ∨ ∗ · · · ∨ ∗ (am ∧ ∗ rmi )
(2)
236
P.-G. Sun et al.
In general, the assessment arithmetic operator can be defined as common matrix operation (“multiplication” and “addition”) or Zadeh fuzzy operation (“and” and “or”) and so on, according to the practical applications. Following the comprehensive process, the synthetic evaluation of (b1 , b2 , · · · , bn ) is a defuzzifier process of making a fuzzy quantity to a precise quantity, the method, such as max membership principle[11], centroid method[12], weighted average method etc, can be adopted. In general, max-membership principle is also known as the height method, which is limited to peaked output. The centroid method is also called the center area or center of gravity; it is the most prevalent and physically appealing of all the defuzzifier methods. Weighted average method is only valid for symmetrical output membership functions, but is simple and convenient[13]. In practical application the exact method of synthetic evaluation usually depends upon the application.
3 3.1
The Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks The Architecture of the “Sof t” Decision Information Fusion
The objects of decision information fusion is usually the local decisions of the sensors, that is, the process of decision information fusion is that of global decision under the basis of local decisions of the multiple sensors. The method or architecture of the decision information fusion is usually classified into either the “hard” decision or the “sof t” decision according to the results of local decision of the sensor. In the “hard” decision, the local decision of the sensor is usually the problem of binary hypothesis test, the result of hypothesis test is either zero or one, according to the threshold level. So the local decision of the sensor that is directly sent to the fusion center is either zero or one. In the “sof t” decision, the whole region of sensor decision is usually divided into multiple regions, and the result of the sensor includes not only the region of decision but also the possibility value belonging to the region, so the information that is sent to the fusion center in “sof t” decision is the possibility of each hypothesis. In the process of “hard” decision, the sensor couldn’t provide any information that is lower or higher than the threshold level, so the information that is lower or higher than the threshold level is lost in the process of fusion at the fusion center. Compared with the process of “hard” decision, the process of the “sof t” decision provides not only the region of decision, but also the possibility of the region. In the fusion center, the object including the region and the possibility of the region can be utilized for the process of the fusion. The architecture of the process of the “sof t” decision under the fuzzy comprehensive assessment is shown in Fig.1. 3.2
The Description of the Algorithm
From Fig.1, the algorithm of decision level information fusion based on the fuzzy neural networks adopted the architecture of the “sof t” decision. In the algorithm,
The Research of Decision Information Fusion Algorithm
237
Fig. 1. The architecture of the “sof t” decision fusion under the fuzzy comprehensive assessment
we consider an information fusion system consisted of m sensors that observe the same phenomenon. Each sensor makes its local decision based on its observation, the local decision that include the decision region and its possibility value is sent to the fusion center, the global decision based on the local decisions of m sensors is obtained at the fusion center. It is denoted that the set S is the sensor set, that is S = (s1 , s2 , · · · , sm ), the result of the fusion center is classified into n regions, is called as the assessment set Y , that is Y = (y1 , y2 , · · · , yn ). In the process of the “sof t” decision of each sensor, the result of each sensor is the value of possibility on the assessment Y , for the ith sensor, the result of local decision can be described as the vector ri = (ri1 , ri2 , · · · , rin ), through the process of normalization, the input of the fusion center for the ith sensor is the vector ri = (r i1 , ri2 , · · · , rin ). For the ∀ si ∈ S, the vector ri consists of m × n the matrix R, which is called as the fusion matrix of the fusion center, can be described as following. ⎛
r 11 r 12 ⎜ r 21 r 22 ⎜ R=⎜ . .. ⎝ .. . rm1 r m2
⎞ · · · r1n · · · r2n ⎟ ⎟ .. .. ⎟ . . ⎠ · · · rmn
(3)
For each sensor in the fusion system, the effect of each sensor is always different, it is denoted that A is the sensors’ vector weight power, it is a fuzzy set on the sensor set S, and described as the normalized fuzzy vector A = (a1 , a2 , · · · , am ) and ai = μ(si ), i = 1, 2, · · · , m. In the comprehensive operation of the algorithm, the comprehensive result of the sensor weigh vector and the fusion matrix is the fuzzy set of the assessment set. The result can be described as following, B = A ◦ R = (b1 , b2 , · · · , bn )
(4)
238
P.-G. Sun et al.
for the comprehensive operator, the algorithm adopted the comprehensive operator (∧∗, ∨∗) in the fuzzy comprehensive assessment. In the process of the global decision at the fusion center, the input is the vector(b1 , b2 , · · · , bn ) result from the comprehensive operation, in this research, the max membership principle is adopted, that is if ∃ i ∈ (1, 2, · · · , m), satisfy bi = max(b1 , b2 , · · · , bm ), so the result of global decision of the fusion center is bi .
4
Experiment Analysis
In the Hydroelectric Simulation System of Jilin Fengman, the generator system is the important component of the system; its working condition has great influence to the stabilization of the whole system, so fault detection and diagnosis is necessary to the generator. As far as it goes in the detection system in the Hydroelectric system, the method of fault detection and diagnosis usually adopts the senor-threshold level method, that is, in the system of fault detection and diagnosis, primary parameter of the equipment is supervised by a sensor, the data is send to the detection center. In the detection center, threshold level of the parameter is set in advance when the data that is gathered exceed the threshold level, touch off the corresponding fault alarm. So the sensitivity of the whole detection system is dependent upon the threshold level. But in the practical application, the threshold level is set artificially. If the value of the threshold level is too high, it is possible to fail to report the alarm, otherwise if the value is too low, it is possible to cause the system alarm when the equipment is in order. Aimed to the disadvantage of the traditional detection and diagnosis system, the information fusion technology can be applied into fault detection and diagnosis system. In the practical diagnosis system, multiple sensors have been embedded into the equipment, and gathered the current data of circumstance. At the fusion center, redundant and complemented data have been made full use of, so precise estimation about equipment status can be achieved, belief quantity of diagnosis system can be enhanced, and fuzzy level of status is decreased. So the application of information fusion improves detection performance by making full use of resource of multiple sensors[14][15]. In the simulation system, we have applied the new decision information fusion algorithm into the temperature fault detection and diagnosis of the generator. In this diagnostic system, three embedded temperature sensors have been embedded into the generator, and the temperature of equipment has been periodically gathered[16][17]. The sensor set can be defined as S = (s1 , s2 , s3 ). It has been found in the practical application of the system that the reason of temperature alarm of the generator can be classified into the fault of cycle water equipment, cooling water equipment and misplay of operator, etc. So the assessment set can be defined as Y = (y1 , y2 , y3 , y4 , y5 ) ={circulation water valve shutdown by error, low pressure of circulation, cooling water valve shutdown by error, cooling water pump lose pressure and backup pump not switched, other undefined reason} in the temperature fault diagnosis system.
The Research of Decision Information Fusion Algorithm
239
The effect of the three sensors is different in the diagnosis system because of its position, precision and so on, so in the practical application, the weigh power vector has been allocated according to the experience. That is A = (a1 , a2 , a3 ) = (0.4400, 0.2300, 0.3300)
(5)
The three embedded sensors gather the data and make its local decision, the local decision that is the value of the possibility of the fault has been send to the fusion center, the process of the diagnosis in the fusion center as following, in the fusion center, the local decision of the sensor has been normalized firstly, the results of normalization of each sensor constitute the fusion matrix. Secondly, comprehensive operation is made between the sensor weigh power vector and the matrix of decision. At last, global decision about the fault is made according to the result of comprehensive operation under the max membership principle. For example, in the process of diagnosis the local decision of the sensor that has been normalized is described as Tab.1. Table 1. Experiment data of the diagnosis system
S1 S2 S3
O1 0.3750 0.2712 0.1450
O2 0.2200 0.4386 0.3338
O3 0.0000 0.0000 0.0000
O4 0.4050 0.2902 0.5212
O5 0.0000 0.0000 0.0000
In this research, the comprehensive operation of the fusion center adopts the fuzzy set conjunction and disjunction operation, that is max-min operator, so the result of the comprehensive operation is, B = (0.3750, 0.3300, 0.0000, 0.4050, 0.0000) After the normalization the B can be obtained. B = (0.3378, 0.2973, 0.0000, 0.3649, 0.0000) In the global decision, according to the max membership principle the decision about the fault is made as the cooling water pump lose pressure and backup pump not switched.
5
Conclusion
In this paper, a new decision information fusion algorithm based on fuzzy neural networks is proposed. The process of fusion is composed of the comprehensive operation and the global decision through fusing the local decision of multiple sensors for obtaining the global decision of the concerned object at the fusion center. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of Hydroelectric Simulation System, and the performance of the algorithm precedes that of the traditional diagnosis method.
240
P.-G. Sun et al.
Acknowledgments. The authors acknowledge the support of Natural Science Foundation of P. R. China (NSF No. 69873007) and National High-Tech Research and Development Plan of P.R.China(NHRD No. 2001AA415320) about this project, and the cooperation of FengMan hydropower plant of Jinlin province of China for developing and running this system.
References 1. Liu, T.M., Xia, Z.X., Xie, H.C.: Data Fusion Techniques and its Applications. National Defense Industry Press, 1999. 2. Hopfield, J.J., Tank D.W.: Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141-152. 3. Carvalho, H.S., Heinzelman, W.B., Murphy, A.L., Coelho, C.J.N.: A General Data Fusion Architecture. Proceedings of Information Fusion 2003 2 (2003) 1465-1472. 4. Yu, N.H., Yin Y.: Multiple Level Parallel Decision Fusion Model with Distributed Sensors Based on Dempster-Shafer Evidence Theory. Proceedings of 2003 International Conference on Machine Learning and Cybernetics 5 (2003) 3104-3108. 5. Wang, X., Foliente, G., Su, Z., Ye, L.: Multilevel Decision Fusion in a Distributed Active Sensor Network for Structural Damage Detection. Structural Health Monitoring, 5(1) (2006) 45-58. 6. Zhang X.D., Zhao H., Wang G., Wei S.Z.: Fusion Algorithm for Uncertain Information by Fuzzy Decision Tree. Journal of Northeastern University(Natural Science) 25(7) (2004) 657-660. 7. Wang G., Zhang D.G., Zhao H.: Speed Governor Model Based on Fuzzy Information Fusion. Journal of Northeast University (Natural Science) 23(6) (2002) 519-522. 8. Zhang D.G., Zhao H.: General Hydropower Simulation System Based on Information Fusion. Journal of System Simulation 14(10) (2002) 1344-1347. 9. Hall D.: Mathematical Techniques in Multisensor Data Fusion. Artech House Press, London (1992) 235-238. 10. Waltz E.L.: Multisensor Data Fusion. Artech House Press, Norwood (1991) 101-105. 11. Wei S.Z., Zhao H., Wang G., Liu H.: Distributed Fusion Algorithms in Embedded Network On-line Fusion System. Proceedings of Information Fusion’2004, Stockholm, Sweden (2004) 622-628. 12. Hou Z.Q., Han C.Z., Zheng L.: A Fast Visual Tracking Algorithm Based on Circle Pixels Matching. Proceedings of Information Fusion’2003, 1 (2003) 291-295. 13. Yager, R.R.: The Ordered Weighted Averaging Operators: Theory and Applications. Kluwer Academic Publishers, (1997) 10-100. 14. Jlinals J.: Assessing the Performance of Multisensor Fusion System. Proceedings of the International Society for Optical Engineering 1661 (1992) 2-27. 15. Kai F.G.: Conflict Resolution using Strengthening and Weakening Operations in Decision Fusion. Proceedings of The 4th International Conference on Information Fusion 1 (2001) 19-25. 16. Satoshi M.: Theoretical Limitations of a Hopfield Network for Crossbar Switching. IEEE Transactions on Neural Networks 12(3) (2001) 456-462. 17. Wang G., Zhang D.G., Zhao H.: Speed Governor Model Based on Fuzzy Information Fusion. Journal of Northeastern University(Natural Science) 23(6) (2002) 519-522.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network Rahib H. Abiyev1, Fakhreddin Mamedov2, and Tayseer Al-shanableh2 1
Near East University, Department of Computer Engineering, Lefkosa, North Cyprus
[email protected] 2 Near East University, Department of Electrical and Electronic Engineering, Lefkosa, North Cyprus
Abstract. This paper presents the equalization of channel distortion by using a Nonlinear Neuro-Fuzzy Network (NNFN). The NFNN is constructed on the basis of fuzzy rules that incorporate nonlinear functions. The learning algorithm of NNFN is presented. The NFNN is applied for equalization of channel distortion of time-invariant and time-varying channels. The developed equalizer recovers the transmitted signal efficiently. The performance of NNFN based equalizer is compared with the performance of other nonlinear equalizers. The effectiveness of the proposed system is evaluated using simulation results of NNFN based equalization system.
1 Introduction In digital communications, channels are affected by both linear and nonlinear distortion, such as intersymbol interference and channel noise. Various equalizers have been applied to equalize these distortions and recover the original transmitted signal [1,2]. Linear equalizers could not reconstruct the transmitted signal when channels have significant non-linear distortion [3]. Since non-linear distortion is often encountered on time-variant channels, linear equalizers do not perform well in such kind of channels. When a channel has time-varying characteristics and the channel model is not precisely known, adaptive equalization is applied [4]. Nowadays neural networks are widely used for the equalization of nonlinear channel distortion [5-12]. One class of adaptive equalizers is based on multilayer perceptron (MLP) and radial basis functions (RBF) [5-10]. The MLP equalizers require long time for training and are sensitive to the initial choice of network parameters [5,8,9]. The RBF equalizers are simple and require less time for training, but usually require a large number of centers, which increase the complexity of computation [6,7,10]. An application of neural networks for adaptive equalization of nonlinear channel is given in [11]. Using the 16 QAM (quadrature amplitude modulation) scheme, the simulation of equalization of communication systems is carried out. In [12] neural decisionfeedback equalizer is developed by using adaptive filter algorithm and it is applied for equalization of nonlinear communication channels. One of the effective ways for development of adaptive equalizers for nonlinear channels is the use of fuzzy technology. This type of adaptive equalizer can process D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 241–250, 2007. © Springer-Verlag Berlin Heidelberg 2007
242
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
numerical data and linguistic information in natural form [13,14,15]. Human experts determine fuzzy IF-THEN rules using input-output data pairs of the channel. These rules are used to construct the filter for the nonlinear channel. In these systems the incorporation of linguistic and numerical information improves the adaptation speed and the bit error rate (BER) [13]. The fuzzy logic is used for implementation of a Bayesian equalizer to eliminate co-channel interference [16,17]. TSK-based decision feedback fuzzy equalizer is developed by using an evolutionary algorithm and is applied to a QAM communication system [18]. Sometimes the construction of proper fuzzy rules for equalizers is difficult. One of the effective technologies for construction of equalizer’s knowledge base is the use of neural networks. Much effort has been devoted to the development and improvement of fuzzy neural network models. The structures of most of neuro-fuzzy systems mainly implement the TSK-type or Mamdani-type fuzzy reasoning mechanisms. Adaptive neuro-fuzzy inference system (ANFIS) implements TSK-type fuzzy system [19]. The consequent parts of the TSK-type fuzzy system include linear functions. This fuzzy system can describe the considered problem by means of combination of linear functions. Sometimes these fuzzy systems need more rules, during modeling complex nonlinear processes in order to obtain the desired accuracy. Increasing the number of the rules leads to the increasing the number of neurons in the hidden layer of the network. To improve the computational power of neuro-fuzzy system, we use nonlinear functions in the consequent part of each rule. Based on these rules, the structure of the nonlinear neuro-fuzzy network (NNFN) has been proposed. Because of these nonlinear functions, NNFN network has more computational power, and, it can describe nonlinear processes with desired accuracy. In this paper, the NNFN is used for equalization of nonlinear channel distortion. The NNFN network allows in short time train equalizer and gives better results in bit error rate, at the cost of computational strength. This paper is organized as follows. In section 2 the architecture and learning algorithm of NNFN are presented. In section 3 the simulation of NNFN based channel equalization system is presented. Section 4 includes the conclusion of the paper.
2 Nonlinear Neuro-Fuzzy Network The kernel of a fuzzy inference system is the fuzzy knowledge base. In a fuzzy knowledge base, the information that consists of input-output data points of the system is interpreted into linguistic interpretable fuzzy rules. In this paper, the fuzzy rules that have IF-THEN form and constructed by using nonlinear quadratic functions are used. The use nonlinear function allows to increase the computational power of neuro-fuzzy system. They have the following form. If x1 is Aj1 and x2 is Aj2 and…and xm is Ajm Then m
y j = ∑ ( w1ij xi2 + w2ij xi ) + b j ,
(1)
i =1
Here x1, x2, …,xm are input variables, yj (j=1,..,n) are output variables which are nonlinear quadratic functions, Aji is a membership function for i-th rule of the j-th
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
243
input defined as a Gaussian membership function. w1ij , w2ij and bj (i=1,..m, j=1,…,n) are parameters of the network. The fuzzy model that is described by IF-THEN rules can be obtained by modifying parameters of the conclusion and premise parts of the rules. In this paper, a gradient method is used to train the parameters of rules in the neuro-fuzzy network structure. Using fuzzy rules in equation (1), the architecture of the NNFN is proposed (Fig. 1). The NNFN includes seven layers. In the first layer the number of nodes is equal to the number of input signals. These nodes are used for distributing input signals. In the second layer each node corresponds to one linguistic term. For each input signal entering to the system the membership degree to which input value belongs to a fuzzy set is calculated. To describe linguistic terms the Gaussian membership function is used.
μ1 j ( xi ) = e
−
( x i − cij ) 2
σ ij2
P1(x)
R1 x1
, i=1..m,
(2)
j=1..J
NF1
y1
6
NF2
:
y2
R2
u
P2(x) x2
: ‘
:
: ‘ 6
: xm
yn
NFn
:
Rn
Pn(x)
layer 1 layer 2
layer 3
layer 4
layer 5
layer 6
layer 7
Fig. 1. The NNFN architecture
Here m is number of input signals, J is number of linguistic terms assigned for external input signals xi. cij and σij are centre and width of the Gaussian membership functions of the j-th term of i-th input variable, respectively. μ1j(xi) is the membership function of i-th input variable for j-th term. m is number of external input signals. In the third layer, the number of nodes corresponds to the number of the rules (R1, R2,…,Rn). Each node represents one fuzzy rule. To calculate the values of output signals, the AND (min) operation is used. In formula (3), Π is the min operation
μ l ( x) = ∏ μ1 j ( xi ) , l=1,..,n, j=1,..,J j
(3)
244
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
The fourth layer is the consequent layer. It includes n Nonlinear Functions (NF) that are denoted by NF1, NF2,…,NFn, The outputs of each nonlinear function in Fig.1 are calculated by using the following equation (1-3). m
y j = ∑ ( w1ij xi2 + w2ij xi ) + b j , j = 1,..., n
(4)
i =1
In the fifth layer, the output signals of third layer μl(x) are multiplied with the output signals of nonlinear functions. In the sixth and seventh layers, defuzzification is made to calculate the output of the entire network. n
u = ∑ μl ( x ) yl l =1
n
∑ μ ( x). l =1
(5)
l
Here yl is the outputs of fourth layer that are nonlinear quadratic functions, u is the output of whole network. After calculating the output signal of the NNFN, the training of the network starts. Training includes the adjustment of the parameter values of membership functions cij and σij (i=1,..,m, j=1,..,n) in the second layer (premise part) and parameter values of nonlinear quadratic functions w1ij, w2ij, bj (i=1,..,m, j=1,..,n) in the fourth layer (consequent part). At first step, on the output of network the value of error is calculated.
E=
1 O d (ui − ui ) 2 . ∑ 2 i =1
(6)
Here O is number of output signals of network (in given case O=1), u id and u i are the desired and current output values of the network, respectively. The parameters w1ij, w2ij, bj (i=1,..,m, j=1,..,n) and cij and σij (i=1,..,m, j=1,..,n) are adjusted using the following formulas.
w1ij (t + 1) = w1ij (t ) + γ w 2 ij (t + 1) = w 2 ij (t ) + γ
∂E + λ ( w1ij (t ) − w1ij (t − 1)); ∂w1 j
∂E + λ ( w 2 ij (t ) − w 2 ij (t − 1)); ∂w 2 ij
bj (t +1) = bj (t ) + γ
cij (t + 1) = cij (t ) + γ
∂E ; ∂cij
∂E + λ(bj (t ) − bj (t −1)); ∂bj
σ ij (t + 1) = σ ij (t ) + γ
(7)
(8) (9)
∂E . ∂σ ij
(10)
Here γ is the learning rate, λ is the momentum, m is number of input signals of the network (input neurons) and n is the number of rules (hidden neurons), i=1,..,m, j=1,..,n.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
245
The values of derivatives in (7-8) are determined by the following formulas.
μ ∂E = (u (t ) − u d (t )) ⋅ n l ⋅ xi2 ; ∂w1ij
∑μ
μ ∂E = u (t ) − u d (t )) ⋅ n l . ∂b j
∑μ l =1
l =1
μ ∂E = (u (t ) − u d (t )) ⋅ n l ⋅ xi ; ∂w2ij
∑μ
l
l =1
(11)
l
l
The derivatives in (10) are determined by the following formulas.
∂E ∂E ∂u ∂μl =∑ . ∂σ ij j ∂u ∂μl ∂σ ij
(12)
∂u y l − u = L , i = 1,..,m,j = 1,..,n,l = 1,..,n ∂μ l ∑ μl
(13)
∂E ∂E ∂u ∂μl =∑ ; ∂cij j ∂u ∂μl ∂cij
Here
∂E = u(t) − u d (t), ∂u
l =1
2( x j − c ji ) ⎧ , if j node ⎪μl ( x j ) σ 2ji ∂μl ( x j ) ⎪⎪ =⎨ is connected to rule node l ∂c ji ⎪0, otherwise ⎪ ⎪⎩
⎧ 2( x j − c ji ) 2 , if j node ⎪ μl ( x j ) σ 3ji ⎪ ∂μl ( x j ) ⎪ = ⎨ is connected to rule node l ∂σ ji ⎪0, otherwise ⎪ ⎪ ⎩
(14)
Taking into account the formulas (11) and (14) in (7)-(10) the learning of the parameters of the NNFN is carried out.
3
Simulation
The architecture of the NNFN based equalization system is shown in Fig. 2. The random binary input signals s(k) are transmitted through the communication channel. Channel medium includes the effects of the transmitter filter, transmission medium, receiver filter and other components. Input signals can be distorted by noise and intersymbol interference. Intersymbol interference is mainly responsible for linear distortion. Nonlinear distortions are introduced through converters, propagation environment. Channel output signals are filtrated and entered to the equalizer for equalizing the distortion. During simulation the transmitted signals s(k) are known input samples with an equal probability of being –1 and 1. These signals are corrupted by additive noise n(k). These corrupted signals are inputs for the equalizer. In channel equalization, the problem is the classification of incoming input signal of equalizer onto feature space which is divided into two decision regions. A correct decision of equalizer occurs if s (k ) = s (k ) . Here s(k) is transmitted signal, i.e. channel input, s (k ) is the output of equalizer. Based on the values of the transmitted signal s(k) (i.e. ±1) the channel
246
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
n(k)
Channel medium s(k)
Channel
x (k ) 6 x(k)
z-1
z-2
x(k-1)
x(k-2)
...
z-m x(k-m)
NNFN Equalizer e(k) delay
u (k ) 6
s (k )
Decision
Fig. 2. The architecture of the NNFN based equalization system
state can be partitioned into two classes R+ and R-.Here R+={x(k)⏐s(k)=1} and R={x(k)⏐s(k)=-1}. x(k) is the channel output signal. In this paper the NNFN structure and its training algorithm are used to design equalizer. Simulations have been carried out for the equalization of linear and nonlinear channels. In the first simulation, we use the following nonminimum-phase channel model.
x(k) = a1 (k)s(k) + a2 (k)s(k - 1 ) + a3 (k)s(k - 2 ) + n(k),
(15)
where a1(k) = 0.3482, a 2 (k) = 0.8704 and a 3(k) = 0.3482 . n(k) is additive noise. This type of channel is encountered in real communication systems. During equalizer design, the sequence of transmitted signals is given to the channel input. 200 symbols are used for training and 103 signals for testing. They are assumed to be an independent sequence taking values from {-1,1} with equal probability. The additive Gaussian noise n(k) is added to the transmitted signal. In the output of the equalization system, the deviation of original transmitted signal from the current equalizer output is determined. This error e(k) is used to adjust network parameters. Training is continued until the value of the error for all training sequence of signals is acceptably low. During simulation, the input signals for the equalizer are outputs of channel x(k), x(k-1), x(k-2), x(k-3). Using NNFN, ANFIS [19], and feedforward neural networks the computer simulation of equalization system has been performed. During simulation, we used 27 rules (hidden neurons) in the NNFN, 27 hidden neurons in the feedforward neural network and 36 rules (hidden neurons) in the ANFIS based equalizer. The learning of equalizers has been carried out for 3000 samples. After simulation the performance characteristics (bit error rate versus signal-noise ratio) for all equalizers have been determined. Bit Error Rate (BER) versus Signal-Noise Ratio (SNR) characteristics have been obtained for different noise levels. Fig. 3 show the performance of equalizers based on NNFN, ANFIS and feedforward neural networks. In Fig. 3 solid line is the performance of the NNFN based equalizer, dashed line is the performance of the equalizer based on ANFIS and dash-dotted line is the performance of feedforward neural network based equalizer. As shown in figure, at the area of low
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
247
Fig. 3. Performance of the NNFN (solid line with ‘+’), ANFIS (dashed line with ‘o’) and feedforward neural network (dash-doted line with ‘*’) based equalizers
SNR (high level of noises) the performance of NNFN based equalizer is better than other ones. In the second simulation, the following nonlinear channel model was used
x(k) = a1(k)s(k) + a2 (k)s(k - 1 ) - 0.9 ⋅ (a1(k)s(k) + a2 (k)s(k - 1 ))3 + n(k),
(16)
where a1(k) = 1 and a 2 (k) = 0.5 . We consider the case when the channel is time varying, that is a1(k) and a 2 (k) coefficients are time-varying coefficients. These are generated by using second-order Markov model in which white Gaussian noise source drives a second-order Butterworth low-pass filter [4,22]. In simulation a second order Butterworth filter with cutoff frequency 0.1 is used. The colored Gaussian sequences which were used as time varying coefficients ai are generated with a standard deviation of 0.1. The curves representing the time variation of the channel coefficients are
Fig. 4. Time-varying coefficients of channel
248
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
Fig. 5. Performance of the NNFN (solid line with ‘+’), ANFIS (dashed line with ‘o’) and feedforward neural network (dash-doted line with ‘*’) based equalizers
Fig. 6. Error plot
depicted in Fig. 4. The first 200 symbols are used for training. 103 signals are used for testing. The simulations are performed using NNFN, ANFIS and feedforward neural networks. 36 neurons are used in the hidden layer of each network. Fig. 5 illustrates the BER performance of equalizers for channel (16), averaged over 10 independent trials. As shown in figure performance of NNFN based equalizer is better than other ones. In Fig. 6, error plot of learning result of NNFN equalizer is given. The channel states are plotted in Fig. 7. Here Fig.7(a) demonstrates noise free channel states, 7(b) is channel states with additive noise, and Fig. 7(c) is channel states after equalization of distortion. Here 7(c) describes the simulation result after 3000 learning iterations. The obtained result satisfies the efficiency of application of NNFN technology in channel equalization.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
a)
249
b)
c) Fig. 7. Channel states: a) noise free, b) with noise, c) after equalization
4 Conclusion The development of NNFN based equalizer has been carried out. The learning algorithm is applied for finding the parameters of NNFN based equalizer. Using developed equalizer the equalization of linear and nonlinear time-varying channels in presence of additive distortion has been performed. Simulation result of NNFN based equalizer is compared with the simulation results of equalizer based on feedforward neural network. It was found that NNFN based equalizer has better BER performance than other equalizer in the noise channels. Comparative simulation results satisfy the efficiency of application of the NNFN in adaptive channel equalization.
References [1] Proakis, J.: Digital Comunications. New York, McGraw-Hill (1995) [2] Qureshi, S.U.H.: Adaptive Equalization. Proc.IEEE, 73 (9) (1985) 1349-1387 [3] Falconer, D.D.: Adaptive Equalization of Channel Nonlinearites in QAM Data Transmission Systems. Bell System Technical Journal 27 (7) (1978)
250
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
[4] Cowan, C.F.N., Semnani, S.: Time-Variant Equalization Using Novel Nonlinear Adaptive Structure. Int.J.Adaptive Contr. Signal Processing 12 (2) (1998) 195-206. [5] Chen, S., Gibson, G.J., Cowan, C.F.N., Grant, P.M.: Adaptive Equalization of Finite Non-Linear Channels Using Multiplayer Perceptrons. Signal Process 20 (2) (1990) 107-119 [6] Chen, S., Gibson, G.J., Cowan, C.F.N., Grant, P.M.: Reconstruction of Binary Signals Using an Adaptive Radial-Basis Function Equalizer. Signal Processing 22 (1) (1991) 77-93 [7] Chen, S., Mclaughlin, S., Mulgrew, B.: Complex Valued Radial Based Function Network, Part II:Application to Digital Communications Channel Equalization. Signal Processing 36 (1994) 175-188 [8] Peng, M., Nikias, C.L., Proakis, J.: Adaptive Equalization for PAM and QAM Signals with Neural Networks. in Proc. Of 25th Asilomar Conf. On Signals, Systems & Computers 1 (1991) 496-500 [9] Peng, M., Nikias, C.L., Proakis, J.: Adaptive Equalization with Neural Networks: New Multiplayer Perceptron Structure and Their Evaluation. Proc.IEEE Int. Conf.Acoust., Speech, Signal Proc., vol.II (San Francisco,CA) (1992) 301-304 [10] Lee, J.S., Beach, C.D., Tepedelenlioglu, N.: Channel Equalization Using Radial Basis Function Neural Network. Proc.IEEE Int. Conf.Acoust., Speech, Signal Proc., 1996, vol.III (Atlanta, GA) (1996) 1719-1722 [11] Erdogmus, D., Rende, D., Principe, J., Wong, T.F.: Nonlinear Channel Equalization Using Multiplayer Perceptrons with Information-Theoretic Criterion. In Proc. of 2001 IEEE Signal Processing Society Workshop (2001) 443-451 [12] Chen, Z., Antonio, C.: A New Neural Equalizer for Decision-Feedback Equalization. IEEE Workshop on Machine Learning for Signal Processing ( 2004) [13] Wang, L.X., Mendel, J.M.: Fuzzy Adaptive Filters, with Application to Nonlinear Channel Equalization. IEEE Transaction on Fuzzy Systems 1 (3) (1993) [14] Sarwal, P., Srinath, M.D.: A Fuzzy Logic System for Channel Equalization. IEEE Trans. Fuzzy System 3 (1995) 246-249 [15] Lee, K.Y.: Complex Fuzzy Adaptive Filters with LMS Algorithm. IEEE Transaction on Signal Processing 44 (1996) 424-429 [16] Patra, S.K., Mulgrew, B.: Efficient Architecture for Bayesian Equalization Using Fuzzy Filters. IEEE Transaction on Circuit and Systems II 45 (1998) 812-820 [17] Patra, S.K., Mulgrew, B.: Fuzzy Implementation of Bayesian Equalizer in the Presence of Intersymbol and Co-Channel Interference. Proc. Inst. Elect. Eng. Comm.145 (1998) 323-330 [18] Siu, S., Lu, C., Lee, C.M.: TSK-Based Decision Feedback Equalization Using an Evolutionary Algorithm Applied to QAM Communication Systems. IEEE Transactions on Circuits and Systems 52 (9) 2005 [19] Jang, J., Sun, C., Mizutani, E.: Neuro-fuzzy and Soft Computing: a Computational Approach to Learning and Machine Intelligence. Prentice-Hall, NJ (1997) [20] Choi, J., Antonio, C., Haykin, S.: Kalman Filter-Trained Recurrent Neural Equalizers for Time-Varying Channels. IEEE Transactions on Communications 53 (3) (2005) [21] Abiyev, R., Mamedov, F., Al-shanableh, T.: Neuro-Fuzzy System for Channel Noise Equalization. International Conference on Artificial Intelligence.IC-AI’04, Las Vegas, Nevada, USA, June 21-24 (2004)
Comparative Studies of Fuzzy Genetic Algorithms Qing Li1, , Yixin Yin1 , Zhiliang Wang1 , and Guangjun Liu2 1
School of Information Engineering, University of Science and Technology Beijing 100083 Beijing, China {liqing,yyx}@ies.ustb.edu.cn, zhiliang
[email protected] 2 Department of Aerospace Engineering, Ryerson University M5B 2K3 Toronto, Canada
[email protected] Abstract. Many adaptive schemes for controlling the probabilities of crossover and mutation in genetic algorithms with fuzzy logic have been reported in recent years. However, there has not been known work on comparative studies of these algorithms. In this paper, several fuzzy genetic algorithms are briefly summarized first, and they are studied in comparison with each other under the same simulation conditions. The simulation results are analyzed in terms of search speed and search quality. Keywords: genetic algorithm, crossover probability, mutation probability, fuzzy logic.
1
Introduction
It is well known that the probabilities of crossover and mutation in a genetic algorithm (GA) have great influence on its performance (e.g., search speed and search quality), and the correct setting of these parameters is not an easy task. In the last decade, numerous fuzzy logic based approaches for adjustment of crossover and mutation probabilities have been reported, such as [1] to [7]. Song et al. [1] propose a fuzzy logic controlled genetic algorithm (FCGA) for the regulation of crossover probability and mutation probability, where the changes of average fitness value between two consecutive generations are selected as the input variables. Yun and Gen [2] improve the works of Song et al., in which some fuzzy inference rules are modified and a scaling factor for normalizing the input variables is introduced. Li et al. [3] investigate another fuzzy genetic algorithm (FGA), where the information of both the whole generation and particular individuals is used for controlling crossover and mutation probabilities. Subbu, et al. [4] suggest a fuzzy logic controlled genetic algorithm (FLC-GA), and the FLCGA uses two kinds of diversity (genotypic diversity and phenotypic diversity) information as the input. A new fuzzy genetic algorithm using PD (Population Diversity) measurements is designed by Wang in [5], and experiments have
Corresponding author, currently a visiting scholar with the Department of Aerospace Engineering, Ryerson University.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 251–256, 2007. c Springer-Verlag Berlin Heidelberg 2007
252
Q. Li et al.
demonstrated that premature convergence can be avoided by this method. Liu et al. [6] develop a hybrid fuzzy genetic algorithm (HFGA), in which the average fitness value and the best fitness value of each generation are adopted for dynamical tuning the crossover and mutation probabilities. Recently, an improved fuzzy genetic algorithm (IFGA) is proposed by Li et al. in [7]. The differences in the average fitness value and standard deviation between two consecutive generations are selected as the input variables, and two adaptive scaling factors are introduced for normalizing the input variables. Moreover, new domain heuristic knowledge based rules are introduced for fuzzy inference. Although most of fuzzy genetic algorithms have demonstrated their effectiveness in each work, comparative studies and performance analysis have not been reported in previous works. The aim of this paper is to compare the performance of the above-mentioned algorithms under the same conditions. Three fuzzy genetic algorithms are selected for comparative studies and the simulation results are analyzed. The comparison results illustrate that IFGA has led to improved performance in terms of search speed and search quality compared with other two genetic algorithms under the same test functions. The numerical simulation studies of the three selected fuzzy genetic algorithms using the same test functions are presented in Section 2, followed by the conclusions and future work in Section 3.
2
Comparative Studies and Performance Analysis
In this section, three fuzzy genetic algorithms (FCGA in [2], FGA in [3] and IFGA in [7]) are selected for comparative studies. The detailed procedures of each algorithm are not introduced in this paper because of page limitation. Three test functions are applied for numerical simulation studies similarly as in [2]. Test function 1 (T1) is called “Binary f6” and it has a global maximum 1.0 at the point of x1 = x2 = 0 in its search range [-100, 100]. The expression is as follows: (sin x21 + x22 )2 − 0.5 f (x1 , x2 ) = 0.5 − . (1) 1.0 + 0.001(x21 + x22 )2 Test function 2 (T2) is called “Rosenbrock function” and it has a global minimum 0 at x1 = x2 = 0 within the range from -2.048 to 2.048. Its expression is as follows: f (x1 , x2 ) = 100(x21 − x2 )2 + (1 − x1 )2 . (2) Test function 3 (T3) is called “Rastrigin function” and it has a global minimum 0 at the point of x1 = x2 = x3 = x4 = x5 = 0 within the range [-5.12, 5.12]. It can be expressed as follows: f (x1 , x2 , x3 , x4 , x5 ) = 15 +
5
(x2i − 3cos(2πxi )) .
(3)
i=1
The parameters of each algorithm are set as follows: population size 20, maximum generation 2000, initial crossover probability 0.5 and mutation probability
Comparative Studies of Fuzzy Genetic Algorithms
253
0.05. The roulette wheel selection operator, the uniform arithmetic crossover operator and the uniform mutation operator in [8] are adopted as the genetic operators in recombination process. 20 iterations were executed to eliminate the randomness of the searches and an elitism strategy is also used to preserve the best individual of each generation. If a pre-defined maximum generation is reached or an optimal solution is located, the evolution process will be stopped. Two indices are used to compare the performances of the three algorithms. One is “average number of generations” which is defined as the average number of generations that reaches to the given stop conditions. The other is ”number of obtaining the optimal solution” which represents the total number that locates the optimal solution during 20 iterations. We can see that the former index indicates the search speed and the latter one stands for the search quality. All the simulation programs are executed on an Acer notebook (AMD Turion 64, 512MB DDR) and programmed in MATLAB. The simulation results are listed in Table 1. Table 1. Simulation results of three test functions Algorithms
FCGA
T1 T2 T3 T1 Number of obtaining the optimal solution T2 T3 Average number of generations
FGA
1783.6 2137.3 1537.5 1407.7 1534.9 1205.2 1121.3 2285.6 976.8 16 12 18 15 14 18 18 10 19
450 400 IFGA FCGA FGA
Average fitness value
350 300 250 200 150 100 50 0
5
10
15
20
25 30 Generation
35
IFGA
40
45
Fig. 1. Behaviors of average fitness value in T2
50
254
Q. Li et al. 200 IFGA FCGA FGA
180 160
Standard deviation
140 120 100 80 60 40 20 0
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 2. Behaviors of standard deviation in T2 1.1
1
Crossover probability
0.9
0.8
0.7
0.6
IFGA FCGA FGA
0.5
0.4
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 3. Behaviors of crossover probability in T2
In terms of “average number of generations” in Table 1, the IFGA outperforms FCGA and FGA because it requires less generations for getting the optimal solution. In the “number of obtaining the optimal solution”, the IFGA also outweighs FCGA and FGA for it can locate more optimal solutions than the others. From the simulation results, we see that the IFGA shows better performances in term of “average number of generations” and “number of obtaining the optimal solution” compared with FCGA and FGA.
Comparative Studies of Fuzzy Genetic Algorithms
255
0.11
0.1
Mutation probability
0.09
0.08
0.07
0.06
IFGA FCGA FGA
0.05
0.04
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 4. Behaviors of mutation probability in T2
For a more detailed comparison between the adaptive schemes, the average fitness value, standard deviation, crossover probability and mutation probability of testing function 2 (T2) are demonstrated in Figures 1, 2, 3 and 4 when the generation number of 50 is reached. From Figs. 1 and 2, we see lower average fitness value and higher standard deviation in the IFGA than those of FCGA and FGA, implying that IFGA is more efficient in search quality and exploration ability. By analysis using Figs. 3 and 4, we can see that the probabilities of crossover and mutation (especially the mutation probability) of IFGA have more fluctuations than those of FCGA and FGA during the searching process, which shows that the IFGA has enhanced self-adaptive adjusting ability compared with FCGA and FGA.
3
Conclusions and Future Work
Three fuzzy genetic algorithms are compared and analyzed under the same simulation conditions in this paper. The numerical simulation results show that the IFGA has provided faster search speed, better search quality and self-adaptability compared with FCGA and FGA. There are at least two tasks to be performed in the near future: (1) Higher dimension and higher order functions are to be applied to test the generality of the conclusion; and (2) Other fuzzy genetic algorithms are to be taken into consideration for further comparison studies.
256
Q. Li et al.
Acknowledgments. This work is supported by NSFC (Natural Science Foundation of China, Grant#60374032) and CSC (China Scholarship Council).
References 1. Song Y., Wang G., Wang P., Johns A.: Environmental/Economic Dispatch Using Fuzzy Logic Controlled Genetic Algorithm. In: IEE Proceedings on Generation, Transmission and Distribution, Vol.144. The Institution of Engineering and Technology, London (1997) 377-382 2. Yun Y., Gen M.: Performance Analysis of Adaptive Genetic Algorithm with Fuzzy Logic and Heuristics. Fuzzy Optimization and Decision Making, 2 (2003) 161-175 3. Li Q., Zheng D., Tang Y., Chen Z.: A New Kind of Fuzzy Genetic Algorithm. Journal of University of Science and Technology Beijing, 1 (2001) 85-89 4. Subbu R., Sanderson A.C., Bonissone P.P.: Fuzzy Logic Controlled Genetic Algorithms Versus Tuned Genetic Algorithms: An Agile Manufacturing Application. In: Proceedings of the 1998 IEEE ISIC/CIRA/ISAS Joint Conference, New Jersey: (1998) 434-440 5. Wang K.: A New Fuzzy Genetic Algorithm Based on Population Diversity. In: Proceedings of the 2001 International Symposium on Computational Intelligence in Robotics and Automation, New Jersey: (2001) 108-112 6. Liu H., Xu Z., Abraham A.: Hybrid Fuzzy-Genetic Algorithm Approach for Crew Grouping. In: Nedjah N., Mourelle L.M., Vellasco M.M.B.R., Abraham A., Koppen M. (eds.): Proceedings of the 2005 5th International Conference on Intelligence Systems Design and Applications. IEEE Computer Society, Washington, DC: (2005) 332-337 7. Li Q., Tong X., Xie S., Liu G.: An Improved Adaptive Algorithm for Controlling the Probabilities of Crossover and Mutation Based on a Fuzzy Control Strategy. In: L. O’Conner (eds.): Proceedings of the 6th International Conference on Hybrid Intelligent Systems and 4th Conference on Neuro-Computing and Evolving Intelligence. IEEE Computer Society, Washington, DC: (2006) 50-50 8. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996)
Fuzzy Random Dependent-Chance Bilevel Programming with Applications Rui Liang1 , Jinwu Gao2 , and Kakuzo Iwamura3 1
3
Economy,Industry and Business Management College, Chongqing University, Chongqing 400044, China 2 School of Information, Renmin University of China, Beijing 100872, China Department of Mathematics, Josai University, Sakado, Saitama 350-0248, Japan
Abstract. In this paper, a two-level decentralized decision-making problem is formulated as fuzzy random dependent-chance bilevel programming. We define the fuzzy random Nash equilibrium in the lower level problem and the fuzzy random Stackelberg-Nash equilibrium of the overall problem. In order to find the equilibria, we propose a hybrid intelligent algorithm, in which neural network, as uncertain function approximator, plays a crucial role in saving computing time, and genetic algorithm is used for optimization. Finally, we apply the fuzzy random dependent-chance bilevel programming to hierarchical resource allocation problem for illustrating the modelling idea and the effectiveness of the hybrid intelligent algorithm.
1
Introduction
Decentralized decision-making becomes more and more important for contemporary decentralized organizations in which each department seeks its own interest, while the organization seeks the overall interest. In order to dealing with such problems, multilevel programming (MLP) was proposed by Bracken and McGill [4][5] in early 1970s. Thereafter, despite of its inherent NP-hardness [3], MLP has been applied to a wide variety of areas including economics [2][6], transportation [33][36], engineering [7][31], and so on. For detailed expositions, the reader may consult the review papers [34][35] and the books [9][19]. When multilevel programming is applied to real world problems, some system parameters are often subject to fluctuations and difficult to measure. By assuming them to be random variables, Patriksson and Wynter [30] and Gao, et al. [11] discussed stochastic multilevel programming with the numerical solution methods. Meanwhile, Gao and Liu [12]-[14] discussed fuzzy multilevel programming models with hybrid intelligent algorithms under the assumption of fuzzy parameters. However, in many situations, the system parameters are with both randomness and fuzziness. For instance, in a economic system, the demand consists of multiple demand sources, amongst which some are characterized by random variables, others (e.g., new demand sources or demand sources in some
This work was supported by National Natural Science Foundation of China (No.70601034) and Research Foundation of Renmin University of China.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 257–266, 2007. c Springer-Verlag Berlin Heidelberg 2007
258
R. Liang, J. Gao, and K. Iwamura
unsteady states) are characterized by fuzzy variables. Then the total demand is the sum of some random and fuzzy variables, and characterized by a fuzzy random variable. Kwakernaak [17][18] first introduced the notion of fuzzy random variable. The concept of chance measure of fuzzy random event was first given by [22], and fuzzy random dependent-chance programming model was initialized by Liu [23]. The underlying philosophy is to select the decision with maximal chance to meet the fuzzy random event. In this paper, we formulate a two-level decentralized decision-making problem as a fuzzy random dependentchance bilevel programming (FRDBP) model, and present a numerical solution method by integrating neural network and genetic algorithm. For that purpose, the paper is organized as follows. Firstly, we give some basic results of fuzzy random theory in Section 2. Then we formulate a twolevel decentralized decision-making problem in fuzzy random environments as an FRDBP model in Section 3. Thirdly, in Section 4, we propose a hybrid intelligent algorithm by integrating fuzzy random simulation, neural network and genetic algorithm. In Section 5, as an application, a hierarchical resource allocation problem with fuzzy random parameters is formulated by FRDBP, and the computational results further illustrate the idea of the FRDBP, and the effectiveness of the hybrid intelligent algorithm. Lastly, we give a concluding remark.
2
Preliminaries
Let Θ be a nonempty set, (Θ) the power set of Θ, and ξ be a fuzzy variable with membership function μ. Then the credibility measure Cr of a fuzzy event A ∈ (Θ) was defined by Liu and Liu [24] as: 1 Cr(A) = sup μ(x) + 1 − sup μ(x) . 2 x∈B x∈B c Definition 1. (Liu and Liu [24]) A fuzzy random variable is function ξ defined on a probability space (Ω, , Pr) to a set of fuzzy variables such that Cr {ξ(ω) ∈ B} is a measurable function of ω for any Borel set B of . Definition 2. (Gao and Liu [10]) Let ξ be a fuzzy random variable, and B a Borel set of . Then the chance of fuzzy random event B is a function from (0, 1] to [0, 1], and defined as Ch {ξ ∈ B} (α) =
sup
inf Cr {ξ(ω) ∈ B} .
Pr{A}≥α ω∈A
Example 1. A fuzzy random variable ξ is said to be triangular, if for each ω, ξ(ω) is a triangular fuzzy variable.
3
Fuzzy Random Dependent-Chance Bilevel Programming
Consider a decentralized decision system with two-level structure. The lower level consists of m decision makers called followers. Symmetrically, the decision maker
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
259
at the upper level is called leader. Each decision maker has his own decision variables and objective. The leader can only influence the reactions of followers through his own decision variables, while the followers have full authority to decide how to optimize their own objective functions in view of the decisions of the leader and other followers. In order to model the fuzzy random decentralized decision-making problem, we give the following notations: – – – – – – – –
i = 1, 2, · · · , m: index of followers; x: control vector of the leader; y i : control vector of the ith follower; ξ = (ξ1 , ξ2 , · · · , ξn ): n-array fuzzy random vector into which problem parameters are arranged; f0 (x, y 1 , · · · , y m , ξ): objective function of the leader; fi (x, y 1 , · · · , y m , ξ): objective function of the ith follower; g0 (x, ξ): constraint function of the leader; gi (x, y 1 , · · · , y m , ξ): constraint function of the ith follower.
Following the philosophy of fuzzy random dependent-chance programming [23], we formulate this problem as an FRDBP model in the following. Firstly, we assume that the leader’s decision x and the other followers’ decisions y 1 , · · · , y i−1 , y i+1 , · · · , ym are given, and the ith follower concerns with the event of his objective function’s achieving a prospective value f¯i . Then the rational reactions of the ith follower is the set of optimal solutions to the dependentchance programming model ⎧ ⎪ max Ch fi (x, y 1 , y2 , · · · , y m , ξ) ≥ f¯i (αi ) ⎪ ⎨ yi subject to: (1) ⎪ ⎪ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, where αi is a predetermined confidence level. It is obvious that each follower’s rational reaction depends on not only the leader’s decision x but also the other followers’ decisions y 1 , · · · , y i−1 , y i+1 , · · · , y m . Definition 3. An array (y ∗1 , y ∗2 , · · · , y ∗m ) is called a Nash equilibrium with respect to a given decision x of the leader, if Ch fi (x, y ∗1 , · · · , y ∗i−1 , y i , · · · , y ∗m , ξ) ≥ f¯i (αi ) (2) ≤ Ch fi (x, y ∗1 , y ∗2 , · · · , y ∗m , ξ) ≥ f¯i (αi ) subject to the uncertain environment gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, i = 1, 2, · · · , m for any feasible (y ∗1 , y ∗2 , · · · , y ∗i−1 , y i , y ∗i+1 , · · · , y ∗m ) and i = 1, 2, · · · , m. Secondly, if the leader has given a confidence level α0 , and wants to maximize the chance of his objective function’s achieving a prospective value f¯0 , then the leader’s problem is formulated as the following dependent-chance programming model
260
R. Liang, J. Gao, and K. Iwamura
⎧ max Ch f0 (x, y ∗1 , y∗2 , · · · , y ∗m , ξ) ≥ f¯i (α0 ) ⎪ ⎨ x subject to: ⎪ ⎩ g0 (x, ξ) ≤ 0.
(3)
where (y ∗1 , y ∗2 , · · · , y ∗m ) is the Nash equilibrium respect to x. Now, we present the concept of Stackelberg-Nash equilibrium defined as follows, Definition 4. An array (x∗ , y ∗1 , y ∗2 , · · · , y ∗m ) is called a Stackelberg-Nash equilibrium, if Ch f0 (x, y 1 , y 2 , · · · , y m , ξ) ≥ f¯0 (α0 ) (4) ≤ Ch f0 (x∗ , y ∗1 , y ∗2 , · · · , y ∗m , ξ) ≥ f¯0 (α0 ) subject to the uncertain environment g0 (x, ξ) ≤ 0 for any x and the Nash equilibrium (y 1 , y 2 , · · · , y m ) with respect to x. Finally, we assume that the leader first chooses his control vector x, and the followers’ rational reactions always form an Nash equilibrium. In order to maximize the chance functions of the leader and followers, we have the following dependent-chance bilevel programming model, ⎧ max Ch f0 (x, y ∗1 , y∗2 , · · · , y ∗m , ξ) ≥ f¯0 (α0 ) ⎪ ⎪ x ⎪ ⎪ ⎪ subject to: ⎪ ⎪ ⎪ ⎪ ⎪ g0 (x, ξ) ≤ 0 ⎪ ⎪ ⎨ where (y ∗1 , y ∗2 , · · · , y ∗m ) solves problems (5) ⎧ ⎪ ⎪ ¯ ⎪ ⎪ max Ch f (x, y , y , · · · , y , ξ) ≥ f (α ) ⎪ i i i 1 2 m ⎪ ⎪ ⎨ yi ⎪ ⎪ ⎪ ⎪ subject to: ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0.
4
Hybrid Intelligent Algorithm
Since bilevel programming problem is NP-hard [3], its successful implementations of multilevel programming rely largely on efficient numerical algorithms. As an extension of bilevel programming, FRDBP further enhances this difficulty. In this section, we integrate fuzzy random simulation, neural network and genetic algorithm to produce a hybrid intelligent algorithm for solving the FRDBP model. 4.1
Fuzzy Simulation
By uncertain functions we mean the functions with fuzzy random parameters like U : (x, y 1 , y 2 , · · · , y m ) → Ch f (x, y 1 , y 2 , · · · , y m , ξ) ≥ f¯ (α). (6) Due to the complexity, we resort to the fuzzy random simulation technique for computing the uncertain functions. Here we shall not go into details, and the interested reader may consult the book [26] by Liu.
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
4.2
261
Uncertain Function Approximation
A neural network is essentially a nonlinear mapping from the input space to the output space. It is known that a neural network with an arbitrary number of hidden neurons is a universal approximator for continuous functions [8][16]. Moreover, it has high speed of operation after it is well-trained on a set of inputoutput data. In order to speed up the solution process, we train neural networks to approximate uncertain functions, and then use the trained neural networks to evaluate the uncertain functions in the solution process. For training a neural network to approximate an uncertain function, we must first generate a set of input-output data (x(k) , y (k) , z (k) ) k = 1, 2, · · · , M , where x and y are control vectors of the leader and followers, respectively, z (k) are the corresponding function values that are calculated by fuzzy simulations, k = 1, 2, · · · , M . Then, we train a neural network on the set of input-output data by using the popular backpropagation algorithm. Finally, the trained network characterized by U (x, y, w), where w denotes the network weights that was produced via the training process, can be used to evaluate the uncertain function. Thus, much computing time is saved. For detailed discussion on uncertain function approximation, the reader may consult the book [26] by Liu. 4.3
Computing Nash Equilibrium
Define symbols y −i = (y 1 , y 2 , · · · , y i−1 , y i+1 , · · · , y m ), i = 1, 2, · · · , m. For any decision x revealed by the leader, if the ith follower knows the strategies y −i of other followers, then the optimal reaction of the ith follower is represented by a mapping y i = ri (y −i ) that solves the subproblem defined in equation (1). It is clear that the Nash equilibrium of the m followers will be the solution of the system of equations y i = ri (y −i ),
i = 1, 2, · · · , m.
(7)
In other words, we should find a fixed point of the vector-valued function (r1 , r2 , · · · , rm ). This task may be achieved by solving the following dependent-chance programming model, ⎧ m ⎪ ⎪ y i − ri (y −i ) ⎨ min R(y 1 , y 2 , · · · , y m ) = i=1 (8) subject to: ⎪ ⎪ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, i = 1, 2, · · · , m. If the optimal solution (y ∗1 , y ∗2 , · · · , y ∗m ) satisfies that R(y ∗1 , y ∗2 , · · · , y ∗m ) = 0,
(9)
then y ∗i = ri (y ∗−i ) for i = 1, 2, · · · , m. That is, (y ∗1 , y ∗2 , · · · , y ∗m ) must be a Nash equilibrium for the given x.
262
R. Liang, J. Gao, and K. Iwamura
In a numerical solution process, if a solution (y ∗1 , y ∗2 , · · · , y ∗m ) satisfies that R(y ∗1 , y ∗2 , · · · , y ∗m ) ≤ ε,
(10)
where ε is a small positive number, then it can be regarded as a Nash equilibrium for the given x. Otherwise, we should continue the computing procedure. Since the objective function involves m mappings ri (y −i ), the optimization problem (8) may be very complex. So we employ genetic algorithm to search for the Nash equilibrium. Genetic Algorithm for Nash Equilibrium: Step 1. Input a feasible control vector x. Step 2. Generate a population of chromosomes y (j) , j = 1, 2, · · · , pop size at random from the feasible set. Step 3. Calculate the the objective values of chromosomes. Step 4. Compute the fitness of each chromosome according to the objective values. Step 5. Select the chromosomes by spinning the roulette wheel. Step 6. Update the chromosomes by crossover and mutation operations. Step 7. Repeat Steps 3–6 until the best chromosome satisfies inequality (10). Step 8. Return the Nash equilibrium y ∗ = (y ∗1 , y ∗2 , · · · , y ∗m ). 4.4
Hybrid Intelligent Algorithm
For any feasible control vector x revealed by the leader, denote the Nash equilibrium with respect to x by (y ∗1 , y ∗2 , · · · , y ∗m ). Then, the Stackelberg-Nash equilibrium can be get by solving the leader’s problem defined in (3). Since its objective function involves not only uncertain parameters ξ, but also a complex mapping x → (y ∗1 , y ∗2 , · · · , y ∗m ), the optimization problem may be very difficult to solve. Genetic algorithm is a good candidate, although it is a relatively slow way. Now we integrate fuzzy simulation, neural network, and genetic algorithm to produce a hybrid intelligent algorithm for solving general FRDBP models. Hybrid Intelligent Algorithm for Stackelberg-Nash Equilibrium: Step 1. Generate input-output data of uncertain functions like (6). Step 2. Train neural networks by the backpropagation algorithm. Step 3. Initialize a population of chromosomes x(i) , i = 1, 2, · · · , pop size randomly. Step 4. Compute the Nash equilibrium for each chromosome. Step 5. Compute the fitness of each chromosome according to the objective values. Step 6. Select the chromosomes by spinning the roulette wheel. Step 7. Update the chromosomes by crossover and mutation operations. Step 8. Repeat Steps 4–7 for a given number of cycles. Step 9. Return the best chromosome as the Stackelberg-Nash equilibrium.
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
5
263
Hierarchical Resource Allocation Problem
Consider an enterprize composed of a center that markets products and supplies resources and two factories as subsystems each of which produces two kinds of products by consuming allocated resources. The center makes a decision on the amounts of the resources so as to maximize its total profit in marketing the products, while each factory desires to attain its production activity goal based on efficiency, quality, and performance. Some notations are given as follows: – xmj : an amount of resource j allocated to the factory m; – ymj : an amount of the product j produced by the factory m; – Yj : a total amount of marketing product j, where Yj = y1j + y2j ; – f0 (Y ): a profit function of marketing Y = (Y1 , Y2 )T ; – fm (ym ): an objective function expressing the goal of the factory m, where ym = (ym1 , ym2 )T ; ∗ – ymj (x): parametric optimal values of ymj and Yj , respectively, with respect to resource allocation x = (x11 , x12 , x21 , x22 )T . The objective functions of the two factories are f1 (y1 (x)) = (y11 − 4.0)2 + (y12 − 13.0)2 , and f2 (y2 (x)) = (y21 − 35.0)2 + (y22 − 2.0)2 , and the profit function of the center is f0 (Y ) = (ξ1 − Y1 (x))Y1 (x) + (ξ2 − Y2 (x))Y2 (x), where ξ1 is a triangular fuzzy random variable with normal distribution denoted by (N (200, 42) − 10, N (200, 42), N (200, 42 ) + 10), and ξ2 is a triangular fuzzy random variable with normal distribution denoted by (N (160, 32) − 10, N (160, 32), N (200, 32 ) + 10). We note that the prototype of the above example comes from [1]. Here we fuzzy randomize only two system parameters for the convenience of comparison. When ξ1 and ξ2 are substituted by their mean values 200 and 160, respectively, we get the problem in Ref. [1], whose optimal solution is known as x∗ = (x∗11 , x∗12 , x∗21 , x∗22 ) = (7.00, 3.00, 12.00, 18.00), and the optimal reaction of the two factories are ∗ ∗ (y11 , y12 ) = (0.00, 10.00)
264
R. Liang, J. Gao, and K. Iwamura
and
∗ ∗ (y21 , y22 ) = (30.00, 0.00).
The optimal objective of the center is f0 (Y (x∗ )) = 6600. That is, the center can achieve a profit level 6600 from the point of mean value. Due to the fuzzy randomness of the system parameters ξ1 and ξ2 , the objective/profit function of the center is fuzzy random too. Suppose that the center has set a profit level 6200 and a probability level 0.9, and wants to maximize its profit function’s achieving 6400, we have the following FRDBP model ⎧
(ξ1 − Y1∗ (x))Y1∗ (x) ⎪ ⎪ ⎪ max Ch ≥ 6400 (α) ⎪ +(ξ2 − Y2∗ (x))Y2∗ (x) ⎪ x ⎪ ⎪ ⎪ s.t. ⎪ ⎪ ⎪ ⎪ ⎪ x11 + x12 + x21 + x22 ≤ 40 ⎪ ⎪ ⎪ ⎪ ⎪ 0 ≤ x11 ≤ 10, 0 ≤ x12 ≤ 5 ⎪ ⎪ ⎪ ⎪ 0 ≤ x21 ≤ 15, 0 ≤ x22 ≤ 20 ⎪ ⎪ ⎪ ⎪ ⎪ where y ∗1 , y ∗2 solve problems ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ min (y11 − 4.0)2 + (y12 − 13.0)2 ) ⎪ ⎪ ⎪ ⎪ y11 ,y12 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ s.t. ⎨ (11) 4y11 + 7y12 ≤ 10x11 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 6y11 + 3y12 ≤ 10x12 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ 0 ≤ y11 , y12 ≤ 20 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ (y21 − 35.0)2 + (y22 − 2.0)2 ⎪ ymin ⎪ ⎪ ⎪ 21 ,y22 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ s.t. ⎪ ⎪ ⎪ ⎪ 4y21 + 7y22 ≤ 10x21 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 6y21 + 3y22 ≤ 10x22 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ 0 ≤ y21 , y22 ≤ 40. A run of the hybrid intelligent algorithm for 200 generations, we get the best solution x∗ = (6.20, 3.58, 12.18, 18.03), and the corresponding chance is 0.71. For the allocation x∗ , the optimal solution and objective of factory 1 are (1.30, 8.12) and 31.09, respectively; the optimal solution and objective of factory 2 are (30.05, 0.00) and 28.48, respectively. That is, the center can achieve a profit level 6200 with a higher credibility 0.71 given probability level 0.90. However, it is at the expense of the objective value of factory 1.
6
Conclusions
In this paper, we proposed FRDBP as well as a hybrid intelligent algorithm. As shown in their application to hierarchical resource allocation problem, they
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
265
could be used to solve two-level decentralized decision-making problem in fuzzy random environments such as government policy making and engineering.
References 1. Aiyoshi E., and Shimizu K.: Hierarchical decentralized system and its new sollution by a barrier method. IEEE Transactions on System, Man, and Cybernetics SMC 11 (1981) 444–449 2. Bard J.F., Plummer J., and Sourie J.C.: A bilevel programming approach to determining tax credits for biofuel production. European Journal of Operational Research 120 (2000) 30–46 3. Ben-Ayed O., Blair C.E.: Computational difficulties of bilevel linear programming. Operations Research 38 (1990) 556–560 4. Bracken J., McGill J.M.: Mathematical programs with optimization problems in the constraints. Operations Research 21 (1973) 37–44 5. Bracken J., McGill J.M.: A method for solving Mathematical programs with nonlinear problems in the constraints. Operations Research 22 (1974) 1097–1101 6. Candler W., Fortuny-Amat W. and McCarl B.: The potential role of multi-level programming in agricultural economics. American Journal of Agricultural Economics 63 (1981) 521–531 7. Clark P.A., Westerberg A.: Bilevel programming for chemical process design— I. Fundamentals and algorithms. Computer and Chemical Engineering 14 (1990) 87–97 8. Cybenko G.: Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2 (1989), 183–192 9. Dempe S.: Foundations of bilevel programming Kluwer Academic Publishers, Dordrecht, 2002 10. Gao J., Liu B.: New primitive chance measures of fuzzy random event. International Journal of Fuzzy Systems 3 (2001) 527–531 11. Gao J., Liu B. and Gen M.: A hybrid intelligent algorithm for stochastic multilevel programming. IEEJ Transactions on Electronics, Information and Systems 124-C (2004) 1991-1998 12. Gao J., Liu B.: On crisp equivalents of fuzzy chance-constrained multilevel programming. Proceedings of the 2004 IEEE International Conference on Fuzzy Systems Budapest, Hungary, July 26-29, 2004, pp.757-760 13. Gao J., Liu B.: Fuzzy multilevel programming with a hybrid intelligent algorithm. Computer & Mathmatics with applications 49 (2005) 1539-1548 14. Gao J., Liu B.: Fuzzy dependent-chance multilevel programming with application to resource allocation problem. Proceedings of the 2005 IEEE International Conference on Fuzzy Systems Reno, Nevada, May 22-25, 2005, pp.541-545 15. Gao J., Liu Y.: Stochastic Nash equilibrium with a numerical solution method. In: Wang J. et al, (eds.): Advances in Neural Networks-ISNN2005. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 811–816 16. Hornik K., Stinchcombe M. and White H.: Multilayer feedforward networks are universal approximators. Neural Networks, 2 (1989), 359–366 17. Kwakernaak H.: Fuzzy random variables–I: Defnitions and theorems. Information Sciences 15 (1978) 1–29
266
R. Liang, J. Gao, and K. Iwamura
18. Kwakernaak H.: Fuzzy random variables–II: Algorithms and examples for the discrete case. Information Sciences 17 (1979) 253–278 19. Lee E.S., Shih H.S.: Fuzzy and Multi-level Decision Making Springer-Verlag, London, 2001 20. Liu B.: Stackelberg-Nash equilibrium for multi-level programming with multiple followers using genetic algorithm. Comput. Math. Appl. 36 (1998) 79–89 21. Liu B.: Dependent-chance programming in fuzzy environments. Fuzzy Sets and Systems 109 (2000) 95–104 22. Liu B.: Fuzzy random chance-constrained programming. IEEE Transactions on Fuzzy Systems 9 (2001) 713–720 23. Liu B.: Fuzzy random dependent-chance programming. IEEE Transactions on Fuzzy Systems 9 (2001) 721–726 24. Liu B., Liu Y.: Expected value of fuzzy variable and fuzzy expected value models, IEEE Transactions on Fuzzy Systems 10 (2002) 445–450 25. Liu B.: A Survey of Entropy of Fuzzy Variables. Journal of Uncertain Systems, 1 (2007) 1–10 26. Liu B.: Uncertainty Theory, 2nd ed., Springer-Verlag, Berlin, 2007. 27. Liu Y., Gao J.: Convergence criteria and convergence relations for sequences of fuzzy random variables. Lecture Notes in Artificial Intelligence 3613 (2005) 321–331 28. Liu Y.: Convergent results about the use of fuzzy simulation in fuzzy optimization problems. IEEE Transactions on Fuzzy Systems 14/2 (2006) 295–304 29. Liu Y., Gao J.: The dependence of fuzzy variables with applications to fuzzy random optimization. International Journal of Uncertainty, Fuzziness & KnowledgeBased Systems to be published 30. Patriksson M., Wynter L.: Stochastic mathematicl programs with equilibrium constraints. Operations research letters 25 (1999) 159–167 31. Sahin K.H., and Ciric A.R.: A dual temperature simulated annealing approach for solving bilevel programming problems. Computers and Chemical Engineering 23 (1998) 11–25 32. Shimizu K., Aiyoshi E.: A new computational method for Stackelberg and minmax problems by use a penalty method. IEEE Transactions on Automatic Control AC-26 (1981) 460–466 33. Suh S., Kim T.: Solving nonlinear bilevel programming models of the equilibrium network desing problem: A comparative review. Annals Operations Research 34 (1992) 203–218 34. Vicente L., Calamai P.H.: Bilevel programming and multi-level programming: A bibliography review. Journal of Global Optimization 5 (1994) 35. Wen U.P.: Linear bilevel programming problems—A review. Journal of the Operational Research Society 42 (1991) 125–133 36. Yang H., Bell M.G.H.: Transport bilevel programming problems: recent methodological advances. Transportation Research: Part B: 35 (2001) 1–4 37. Zhao R., Liu B.: Renewal Process with Fuzzy Interarrival Times and Rewards. International Journal of Uncertainty, Fuzziness & Knowledge-Based Systems 11 (2003) 573–586 38. Zhao R., Tang W.: Some Properties of Fuzzy Random Processes. IEEE Transactions on Fuzzy Systems 14/2 (2006) 173–179
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria Yan-Kui Liu1,2 , Zhi-Qiang Liu2 , and Ying Liu1 1
2
College of Mathematics & Computer Science, Hebei University Baoding 071002, Hebei, China
[email protected],
[email protected] School of Creative Media, City University of Hong Kong, Hong Kong, China
Abstract. Based on value-at-risk (VaR) criteria, this paper presents a new class of two-stage fuzzy programming models. Because the fuzzy optimization problems often include fuzzy variables defined through continuous possibility distribution functions, they are inherently infinitedimensional optimization problems that can rarely be solved directly. Thus, algorithms to solve such optimization problems must rely on intelligent computing as well as approximating schemes, which result in approximating finite-dimensional optimization problems. Motivated by this fact, we suggest an approximation method to evaluate critical VaR objective functions, and discuss the convergence of the approximation approach. Furthermore, we design a hybrid algorithm (HA) based on the approximation method, neural network (NN) and genetic algorithm (GA) to solve the proposed optimization problem, and provide a numerical example to test the effectiveness of the HA.
1
Introduction
It is known that production games [16] feature transferable utility and strong cooperative incentives, they are appealing in several aspects such as the characteristic function can be explicitly defined and easy to compute. In stochastic decision systems, production games were extended to accommodate uncertainty about events not known ex ante, and planning then took the form of two-stage stochastic programming [17]. In fuzzy decision systems, based on possibility theory [2,14,18], fuzzy linear production programming games were presented by Nishizaki and Sakawa [15], but they belong to static production games. Fuzzy two-stage production games rely on the optimization model developed in [13] as well as the work in this paper. In literature, two-stage and multistage stochastic programming problems have been studied extensively [5], and applied to many real world decision problems, especially decision problems involving risk [4]. Our objective in this paper is to take credibility theory [7,8,9,10,11,12] as the theoretical foundation of fuzzy optimization [1,3,6,13], and present a new class of two-stage fuzzy optimization problem with critical VaR criteria in the objective. In the proposed fuzzy optimization problem, infeasibility of first-stage decisions is accepted, but has D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 267–274, 2007. c Springer-Verlag Berlin Heidelberg 2007
268
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
to be compensated for afterward, hence second-stage or recourse actions are required. Because two-stage fuzzy optimization problems are inherently infinitedimensional optimization problems that can rarely be solved directly, algorithms to solve such optimization problems must rely on intelligent computing and approximation scheme, which results in approximating finite-dimensional optimization problems. This fact motivates us to present an approximation approach to critical VaR objective and combine it with GA and NN to solve the proposed optimization problem. In the following section we formulate a new class of two-stage fuzzy optimization problem with critical VaR criteria in the objective. Section 3 discusses the issue of approximating critical VaR function and deals with the convergence of the approximation method. In Section 4, we design an HA based on the approximation scheme to solve the proposed fuzzy optimization problems, and provide a numerical example to show the effectiveness of the HA. Finally, we draw conclusions in Section 5.
2
Problem Formulation
Consider the following fuzzy linear programming min cT x + q T (γ)y subject to: T (γ)x + W (γ)y = h(γ) x ∈ X, y ∈ n+2 .
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
(1)
We assume that all ingredients above have conformal dimensions, that X ⊂ n1 is a nonempty closed polyhedron, and that some components of q(γ), h(γ), T (γ) and W (γ) are fuzzy variables defined on a credibility space (Γ, P(Γ ), Cr), where Γ is the universe, P(Γ ) the power set of Γ , and Cr the credibility measure defined in [9]. Decision variables are divided into two groups: first-stage variable x to be fixed before observation of γ, and second-stage variables y to be fixed after observation of γ. Given x ∈ X and γ ∈ Γ , denote Q(x, γ) = min{q T (γ)y | W (γ)y = h(γ) − T (γ)x, y ∈ n+2 }.
(2)
According to linear programming theory, the function Q(x, γ) is real-valued on m2 almost sure with respect to γ provided that W (γ)(n+2 ) = m2 and {u ∈ m2 | W (γ)T u ≤ q(γ)} = ∅ almost sure with respect to γ, which will be assumed throughout the paper. With a preselected threshold φ0 ∈ , the excess credibility functional QC (x) = Cr γ ∈ Γ | cT x + Q(x, γ) > φ0 measures the credibility of facing total fuzzy objective values exceeding φ0 . For instance, if φ0 is a critical cost level, then the excess credibility is understood as the ruin credibility.
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
269
However, excess credibility does not quantify the extend to which objective value exceeds the threshold. The latter can be achieved by another risk measure, the critical VaR. Denote by Φ(x, φ) = Cr γ ∈ Γ | cT x + Q(x, γ) ≤ φ the credibility distribution of the fuzzy variable cT x+Q(x, γ). With a preselected credibility 0 < α < 1, the critical VaR at α is defined by QαVaR (x) = inf {φ | Φ(x, φ) ≥ α} . As a consequence, a two-stage fuzzy programming with VaR objective reads min {QαVaR (x) : x ∈ X} .
(3)
Since we will discuss the issue of approximation of the problem (3) when the distribution of γ is continuous and approximated by a discrete one, we are interested in the properties of the α-VaR QαVaR as a function of x as well as the distribution of γ. Toward that end, it will be convenient to introduce the induced credibility measure Cˆr = Cr ◦ ξ −1 on N , and reformulate the optimization problem (3) as follows min {QαVaR (x, Cˆr) : x ∈ X} where
(4)
ˆ ≤φ ≥α , QαVaR (x, Cˆr) = inf φ | Cˆr ξˆ ∈ Ξ cT x + Q(x, ξ)
ˆ is defined as the second-stage value function Q(x, ξ)
ˆ = min q T (ξ)y ˆ W (ξ)y ˆ = h(ξ) ˆ − T (ξ)x, ˆ y ∈ n2 , Q(x, ξ) +
(5)
ˆ hT (ξ), ˆ W1· (ξ), ˆ . . . , Wm · (ξ), ˆ T1· (ξ), ˆ . . . , Tm · (ξ)) ˆ T is the realizaand ξˆ = (q T (ξ), 2 2 tion value of fuzzy vector ξ such that Wi· is the ith row of the matrix W, and Ti· is the ith row of the matrix T.
3
Approximation Approach to VaR
To solve the proposed fuzzy optimization problem (4), it is required to calculate the following VaR at α,
ˆ ≤φ ≥α U : x → QαVaR (x, Cˆr) = inf φ | Cˆr ξˆ ∈ Ξ cT x + Q(x, ξ) (6) repeatedly, where Ξ is the support of ξ described in Section 2. For simplicity, ˆ ≡ W. we assume in this section the matrix W is fixed, i.e., W (ξ) mthat 2 +n2 +m2 n1 Suppose that Ξ = i=1 [ai , bi ] with [ai , bi ] the supports of ξi , i = 1, 2, · · · , m2 + n2 + m2 n1 , respectively. In the following, we adopt the approximation method proposed in [13] to approximate the possibility distribution of ξ by a sequence of possibility distributions of primitive fuzzy vectors ζn , n = 1, 2, · · ·. The method can be described as follows.
270
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
For each integer n, define ζn = (ζn,1 , ζn,2 , · · · , ζn,m2 +n2 +m2 n1 )T as follows ζn = hn (ξ) = (hn,1 (ξ1 ), hn,2 (ξ2 ), · · · , hn,m2 +n2 +m2 n1 (ξm2 +n2 +m2 n1 ))T where the fuzzy variables ζn,i = hn,i (ξi ), i = 1, 2, · · · , m2 + n2 + m2 n1 ,
ki ki hn,i (ui ) = max ki ∈ Z, ≤ ui , ui ∈ [ai , bi ] n n and Z is the set of all integers. As a consequence, the possibility of ζn,i , denoted by νn,i , is as follows
ki ki ki ki + 1 νn,i = Pos ζn,i = = Pos ≤ ξi < n n n n for ki = [nai ], [nai ] + 1, · · · , Ki . By the definition of ξi , one has ξi (γ) − 1/n < ζn,i (γ) ≤ ξi (γ) for all γ ∈ Γ, and i = 1, 2, · · · , m2 + n2 + m2 n1 , which implies the sequence {ζn } of discrete fuzzy vectors converges uniformly to the fuzzy vector ξ on Γ . In what follows, the sequence {ζn } of primitive fuzzy vectors is referred to as the discretization of the fuzzy vector ξ. For each fixed n, the fuzzy vector ζn takes K = K1 K2 · · · Km2 +n2 +m2 n1 values, and denote them as k k ζˆnk = (ζˆn,1 , · · · , ζˆn,m ), k = 1, · · · , K. 2 +n2 +m2 n1
We now replace the possibility distribution of ξ by that of ζn , and approximate the QαVaR (x, Cˆr) by QαVaR (x, Cˆrn ) with Cˆrn = Cr ◦ ζn provided n is k k sufficiently large. Toward that end, denote νk = νn,1 (ζˆn,1 ) ∧ νn,2 (ζˆn,2 ) ∧ ··· ∧ k ˆ νn,m2 +n2 +m2 n1 (ξn,m2 +n2 +m2 n1 ) for k = 1, 2, · · · , K, where νn,i are the possibility distributions of ζn,i , i = 1, 2, · · · , m2 + n2 + m2 n1 , respectively. For each integer k, we solve the second-stage linear programming problem (5) via simplex method, and denote the optimal value as Q(x, ζˆnk ). Letting φk = cT x + Q(x, ζˆnk ), then the α-VaR QαVaR (x, Cˆrn ) can be computed by U(x) = min{φk | ck ≥ α}
(7)
where
1 (1 + max{νj | φj ≤ φk } − max{νj | φj > φk }). 2 The process to compute the α-VaR QαVaR (x, Cˆr) is summarized as ck =
(8)
Algorithm 1 (Approximation Algorithm) k k Step 1. Generate K points ζˆnk = (ξˆn,1 , · · · , ξˆn,m ) uniformly from the 2 +n2 +m2 n1 support Ξ of ξ for k = 1, 2, · · · , K. Step 2. Solve the second-stage linear programming problem (5) and denote the optimal value as Q(x, ζˆnk ), and φk = cT x + Q(x, ζˆnk ) for k = 1, 2, · · · , K.
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
271
k k k Step 3. Set νk = μn,1 (ζˆn,1 ) ∧ μn,2 (ζˆn,2 ) ∧ · · · ∧ μn,m2 +n2 +m2 n1 (ξˆn,m ) 2 +n2 +m2 n1 for k = 1, 2, · · · , K. Step 4. Compute ck = Cˆrn {cT x + Q(x, ζˆn ) ≤ φk } for k = 1, 2, · · · , K according to formula (8). Step 5. Return U(x) via the estimation formula (7).
The convergence of Algorithm 1 is ensured by the following theorem. As a consequence, the α-VaR QαVaR (x, Cˆr) can be estimated by the formula (7) provided that n is sufficiently large. Theorem 1. Consider the two-stage fuzzy programming problem (4). Suppose W is fixed, ξ = q or (h, T ) is a continuous fuzzy vector, and β ∈ (0, 1) a prescribed confidence level. If ξ is a bounded fuzzy vector, and the sequence {ζn } of primitive fuzzy vectors is the discretization of ξ, then for any given x ∈ X, we have lim QβVaR (x, Cˆrn ) = QβVaR (x, Cˆr) n→∞
provided that β is a continuity point of the function QαVaR (x, Cˆr) at α = β. Proof. By the suppositions of Theorem 1, and the properties of Q(x, ξ), the proof of the theorem is similar to that of [10, Theorem 2].
4
HAs and Numerical Example
In the following, we will incorporate the approximation method, NN, and GA to produce an HA for solving the proposed fuzzy optimization problem. First, we generate a set of training data for QαVaR (x, Cˆr) by the approximation method. Then, using the generated input-output data, we train an NN by fast BP algorithm to approximate QαVaR (x, Cˆr). We repeat this BP algorithm until the error for all vectors in the training set is reduced to an acceptable value or perform the specified number of epochs of training. After that, we use new data (which are not learned by the NN) to test the trained NN. If the test results are satisfactory, then we stop the training process; otherwise, we continue to train the NN. After the NN is well-trained, it is embedded into a GA to produce an HA. During the solution process, the output values of the trained NN are used to represent the approximate values of QαVaR (x, Cˆr). Therefore, it is not necessary to compute QαVaR (x, Cˆr) by approximation method during the solution process so that much time can be saved. This process of the HA for solving the proposed fuzzy optimization problem is summarized as Algorithm 2 (Hybrid Algorithm) Step 1. Generate a set of input-output data for the critical VaR function U : x → QαVaR (x, Cˆr) by the proposed approximation method;
272
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
Step 2. Train an NN to approximate the critical VaR function U(x) by the generated input-output data; Step 3. Initialize pop size chromosomes at random; Step 4. Update the chromosomes by crossover and mutation operations; Step 5. Calculate the objective values for all chromosomes by the trained NN; Step 6. Compute the fitness of each chromosome according to the objective values; Step 7. Select the chromosomes by spinning the roulette wheel; Step 8. Repeat Step 4 to Step 7 for a given number of cycles; Step 9. Report the best chromosome as the optimal solution. We now give a numerical example to show the effectiveness of the designed HA. Example 1. Consider the following two-stage fuzzy programming problem with q and h containing fuzzy variables ⎫ min Q0.9VaR (x) ⎪ ⎪ x ⎪ ⎪ ⎪ s.t. ⎪ ⎬ x1 + x2 + 2x3 ≤ 15 2x1 − x2 + x3 ≤ 6 ⎪ ⎪ ⎪ ⎪ ⎪ −2x1 + 2x2 ≤ 8 ⎪ ⎭ x1 , x2 , x3 ≥ 0 where cT x + Q(x, γ) = 3x1 + 2x2 − 4x3 + Q(x, γ), ⎫ Q(x, γ) = min q1 (γ)y1 + q2 (γ)y2 + y3 + q4 (γ)y4 + y5 ⎪ ⎪ ⎪ ⎪ s.t. ⎪ ⎪ ⎬ y1 + y2 − 3y4 − 2y5 = h1 (γ) + x1 − x3 18y1 − 8y2 + 6y3 = h2 (γ) − x1 + 2x2 − x3 ⎪ ⎪ ⎪ −y1 − 9y2 + 14y3 + 8y5 = h3 (γ) + x1 − x2 ⎪ ⎪ ⎪ ⎭ yk ≥ 0, k = 1, 2, · · · , 5, and q1 , q2 , q4 , h1 , h2 , and h3 are mutually independent triangular fuzzy variables (7, 8, 9), (5, 6, 7), (9, 10, 11), (23, 24, 25), (16, 17, 18), and (20, 21, 22), respectively. For any given feasible solution x, we use 10000 samples in approximation method to estimate the 0.9-VaR Q0.9VaR (x). Using this method, we first produce 3000 input-output data xj → Q0.9VaR (xj ), j = 1, · · · , 3000; then we use the data to train an NN to approximate the VaR function Q0.9VaR (x) (3 input neurons representing the value of decision x, 10 hidden neurons, and 1 output neuron representing the value of Q0.9VaR (x)). After the NN is well-trained, it is embedded into a GA to produce an HA to search for the optimal solutions. In view of identification of parameters’ influence on solution quality, we compare solutions by careful variation of parameters of GA. The computational results are reported in Table 1, where the parameters of GA include the population size pop size, the probability of crossover Pc , and the probability of mutation
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
273
Table 1. Comparison Solutions of Example 1 pop size pc pm Optimal solution Optimal value 30 0.3 0.2 (0.0000, 1.0000, 7.0000) 112.232580 30 0.3 0.1 (0.0000, 1.0000, 7.0000) 112.232517 30 0.2 0.2 (0.0000, 1.0000, 7.0000) 112.234515 30 0.1 0.3 (0.0000, 1.0000, 7.0000) 112.234459 20 0.1 0.3 (0.0000, 1.0000, 7.0000) 112.234479 20 0.3 0.2 (0.0000, 1.0000, 7.0000) 112.234522 20 0.3 0.1 (0.0000, 1.0000, 7.0000) 112.234615 20 0.2 0.2 (0.0000, 1.0000, 7.0000) 112.234518
Pm . From Table 1, we can see that the optimal solutions and the optimal objective values change little when various parameters of GA are selected, which imply that the HA is robust to the parameters setting and effective to solve this fuzzy two-stage programming problem.
5
Conclusions
In this paper, we have formulated a novel class of two-stage fuzzy programming with recourse problem based on VaR criteria. In order to compute the critical VaR objective, we presented an approximation approach to fuzzy variables with infinite supports, and discussed the convergence of the approximation scheme. Furthermore, we designed an HA, which combines the approximation approach, GA and NN, to solve the proposed fuzzy optimization problem, and provided a numerical example to show the effectiveness of the HA. Acknowledgements. This work was partially supported by the National Natural Science Foundation of China under Grant No.70571021, the Natural Science Foundation of Hebei Province under Grant No.A2005000087, and the CityUHK SRG 7001794 & 7001679.
References 1. Chen, Y., Liu, Y.K., Chen, J.: Fuzzy Portfolio Selection Problems Based on Credibility Theory. In: Yeung, D.S., Liu, Z.Q., et al. (eds.): Advances in Machine Learning and Cybernetics. Lecture Notes in Artificial Intelligence, Vol.3930, SpringerVerlag, Berlin Heidelberg (2006) 377-386 2. Dubois, D., Prade, H.: Possibility Theory. Plenum Press, New York (1988) 3. Gao, J., Liu, B.: Fuzzy Multilevel Programming with a Hybrid Intelligent Algorithm. Computer & Mathematics with Applications 49 (2005) 1539-1548 4. Hogan, A.J., Morris, J.G., Thompson, H.E.: Decision Problems under Risk and Chance Constrained Programming: Dilemmas in the Transition. Management Science 27 (1981) 698-716 5. Kibzun, A.I., Kan, Y.S.: Stochastic Programming Problems with Probability and Quantile Functions. Wiley, Chichester (1996)
274
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
6. Liu, B.: Theory and Practice of Uncertain Programming. Physica-Verlag, Heidelberg (2002) 7. Liu, B.: Uncertainty Theory: An Introduction to Its Axiomatic Foundations. Springer-Verlag, Berlin Heidelberg New York (2004) 8. Liu, B: A Survey of Entropy of Fuzzy Variables. Journal of Uncertain Systems 1 (2007) 1-11 9. Liu, B., Liu, Y.K.: Expected Value of Fuzzy Variable and Fuzzy Expected Value Models. IEEE Trans. Fuzzy Syst. 10 (2002) 445-450 10. Liu, Y.K.: Convergent Results About the Use of Fuzzy Simulation in Fuzzy Optimization Problems. IEEE Trans. Fuzzy Syst 14 (2006) 295-304 11. Liu, Y.K., Liu, B., Chen, Y.: The Infinite Dimensional Product Possibility Space and Its Applications. In: Huang, D.-S. Li, K., Irwin, G.W. (eds.): Computational Intelligence. Lecture Notes in Artificial Intelligence, Vol.4114, Springer-Verlag, Berlin Heidelberg (2006) 984-989 12. Liu, Y.K. Wang, S.: Theory of Fuzzy Random Optimization. China Agricultural University Press, Beiing (2006) 13. Liu, Y.K.: Fuzzy Programming with Recourse. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 13 (2005) 381-413 14. Nahmias, S.: Fuzzy Variables. Fuzzy Sets Syst. 1 (1978) 97-101 15. Nishizaki, I., Sakawa, M.: On Computional Methods for Solutions of Multiobjective Linear Production Programming Games. European Journal of Operational Research 129 (2001) 386-413 16. Owen, G.: On the Core of Linear Production Games. Math. Programming 9 (1975) 358-370 17. Sandsmark, M.: Production Games under Uncertainty. Comput. Economics 14 (1999) 237-253 18. Zadeh, L.A.: Fuzzy Sets as a Basis for a Theory of Possibility. Fuzzy Sets Syst. 1 (1978) 3-28
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes Yingkui Gu and Xuewen He School of Mechanical & Electronical Engineering Jiangxi University of Science and Technology Ganzhou, Jiangxi 341000, China
[email protected] Abstract. Product conceptual design is an innovative activity that is to form and optimize the projects of products. Identification of the best conceptual design candidate is a crucial step as design information is not complete and design knowledge is minimal at conceptual design stage. It is necessary to select the best scheme from feasible alternatives through comparison and filter. In this paper, the evaluation system of mechanism scheme is established firstly based on the performance analysis of the mechanism system and the opinions of experts. Then, the fuzzy optimum selection model of mechanism scheme evaluation is provided. Combined with the fuzzy optimum selection model with the neural network theory, a rational pattern of determining the topologic structure of network is provided. It also provides a weight-adjusted BP model of the neural network with the fuzzy optimum selection model for mechanism scheme. Finally, an example is given to verify the effective feasibility of the proposed method.
1 Introduction Mechanism scheme design is the core of mechanical product concept design. Concept design is a process to develop design candidate based on design requirements. At conceptual design stage, a number of design candidates are usually generated, which satisfy all design requirements. Therefore, identification of the best conceptual design candidate is a crucial step as design information is not complete and design knowledge is minimal at conceptual design stage. The evaluation and selection of schemes are the important tasks for mechanism conceptual design. How to establish the reasonable evaluation system and how to establish the effective selection model are the key problems for the designers to study. In recent years, many methods have been presented to evaluate mechanism schemes. Especially the recent advances in soft computing techniques, including fuzzy set [1-8], neural network [9-11] and genetic algorithm, provide new tools for developing intelligent systems with the capabilities of modeling uncertainty and learning under the fuzzy and uncertain development environment. Applications of soft computing in mechanism scheme optimum selection resulted in computerized systems. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 275–283, 2007. © Springer-Verlag Berlin Heidelberg 2007
276
Y. Gu and X. He
Chen, Cai and Song [12] introduced a case-based reasoning product conceptual design system. In this system, product similar case can be evaluated based on the knowledge of design and manufacturing, so the optimum solution of product conceptual design can be acquired. Jiang and Hsu [13] presented a manufacturability scheme evaluation decision model based on fuzzy logic and multiple attribute decisionmaking under the concurrent engineering environment. Huang, Li and Xue [14] used fuzzy synthetical evaluation to evaluate and select the optimal grinding machining scheme. Huang, Tian and Zuo [15] introduced an intelligent interactive multiobjective optimization method to evaluation reliability design scheme based on physical programming theory proposed by Messac [16]. Sun, Kalenchuk, Xue and Gu [17] presented a method for design candidate evaluation and identification using neural network-based fuzzy reasoning. Xue and Dong [18] developed a fuzzy-based design function coding system to identify design candidates from design functions. Bahrami, Lynch and Dagli [19] used fuzzy associative memory, a two-layer feedforward neural network, to describe the relationships between customer needs and design candidates. Sun, Xie and Xue [20] presented a drive type decision system based on one-againstone mode of support vector machine through identification of the characteristics and the type decisions. Huang, Bo and Chen [21] presented an integrated computational intelligence approach to generate and evaluate the concept design schemes, where neural network, fuzzy set and genetic algorithm are used to evaluate and select the optimal design scheme. Although the methods proposed above are effective and feasible in evaluating the mechanism scheme, there still exist some disadvantages, such as calculation difficulty, stronger subjectivity and lower evaluation efficiency, etc. In this paper, a neural network-driven fuzzy optimum selection method was introduced based on the fuzzy optimum selection theory proposed by Chen [22-24] for solving the problems of modeling uncertainty and improving computational efficiency in the process of identifying mechanism scheme. The evaluation system of mechanism scheme is established firstly based on the performance analysis of the mechanism system and the opinions of exporters. Then, the fuzzy optimum selection model of mechanism scheme evaluation is provided. Combined with the fuzzy optimum selection model with the neural network theory, a rational pattern of determining the topologic structure of network is provided. It also provides a weight-adjusted BP model of the neural network with the fuzzy optimum selection model for mechanism scheme. Results show that the proposed method offers a new way to evaluate and select the optimum mechanism scheme from scheme set.
2 The Fuzzy Optimum Selection of Mechanism Schemes 2.1 Establishment of the Evaluation Index System Mechanism scheme is usually composed of several sub-systems. In the conceptual design stage, it is necessary to select the best scheme form feasible alternatives through comparison and filter. Therefore, a reasonable and effective evaluation index system should be established to evaluate and optimize the mechanism scheme set. Based on the performance analysis of the mechanism system and the opinions of experts, the evaluation index system of mechanism scheme is established as shown in Figure 1.
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
277
U
R1
R2
R11 R12
R21 R22 R23
R4
R3
R5
R31 R32 R33 R34 R41 R42 R43 R44 R45 R51 R52 R53
Fig. 1. The evaluation system of mechanism scheme
In Figure 1, U is the satisfaction degree. R1 is the basic function. R11 is the kinematic precision. R12 is the transmission precision. R2 is the working function. R21 is the operation speed. R22 is the adjustment. R23 is the loading capacity. R3 is the dynamical function. R31 is the maximal acceleration. R32 is the noise. R33 is the reliability. R34 is the anti-abrasion. R4 is the economical performance. R41 is the design cost. R42 is the manufacturing cost. R43 is the sensitivity of manufacturing errors. R44 is the convenience of adjustment. R45 is the energy consuming. R5 is the structure performance. R51 is the dimension. R52 is the weight. R53 is the complexity of structure. The evaluation index system is an objective set that the mechanism scheme should arrive at. Therefore, the system should have the characteristics of integrality, independency and quantity. 2.2 The Fuzzy Optimum Selection Model of Mechanism Schemes It is assumed that there are n mechanism schemes satisfying the constraint conditions. Each scheme is evaluated according to m evaluation objectives. Let x ij be the eigenvalue of the i th objective of the j th scheme, and rij be the relative membership degree of objective eigenvalue x ij . The objective eigenvalue can be categorized into the following two different categories: (1) The larger, the better. Let x i max = x i1 > x i 2 > " > x in , then rij =
xij x i max
.
(2) The smaller, the better. Let x i min = x i1 < xi 2 < " < x in , then rij =
x i min . x ij
The relative membership degree matrix of n mechanism schemes can be expressed as follows. ⎡ r11 ⎢ r21 R=⎢ ⎢ # ⎢ ⎣⎢rm1
r12 r22
# rm 2
" r1n ⎤ ⎥ " r2 n ⎥ = rij , i = 1,2, " , m, " "⎥ ⎥ " rmn ⎦⎥
( )
j = 1,2, " , n .
(1)
278
Y. Gu and X. He
We know that the relative membership degree vector of the j th scheme is
(
r j = r1 j , r2 j , " , rmj
)
T
. We can define that the relative membership degree vector of
the optimum scheme and the bad scheme is (1,1,",1) and (0,0, " ,0 ) respectively. The Haming distance between the j th scheme and the optimum scheme is T
m
(
)
T
m
d jg = ∑ wij 1 − rij = 1 − ∑ wij rij . i =1
(2)
i =1
where wij is the weight of i th objective of j th scheme. To each scheme j , it should satisfy the following constraint m
∑w i =1
ij
=1.
(3)
The Haming distance between the j th scheme and the bad scheme is
d jb = ∑ wij (rij − 0 ) = ∑ wij rij . m
m
i =1
i =1
(4)
Let the relative membership degree to optimum scheme of the j th scheme be u j , and the relative membership degree to bad scheme be u cj , then u cj = 1 − u j .
(5)
The weight distance between the j th scheme and the optimum scheme is D jg = u j d jg .
(6)
The weight distance between the j th scheme and the bad scheme is
(
)
D jb = u cj d jb = 1 − u j d jb .
(7)
In order to obtain the optimum value of the relative membership degree of the j th scheme, the optimization criterion is established as follows [22]
{
min F ( u j ) = D 2jg + D 2jb
} 2
2
m m 2⎛ ⎛ ⎞ ⎞ =u ⎜1 − ∑ wij rij ⎟ + (1 − u j ) ⎜ ∑ wij rij ⎟ . ⎝ i =1 ⎠ ⎝ i =1 ⎠ 2 j
Let
( )
dF u j du j
=0,
(8)
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
279
The optimization model that is expressed by Haming distance can be given as follows [24]. 1 uj = . m ⎡ ⎤ 1 − w r ⎢ ∑ ij ij ⎥ (9) ⎥ 1 + ⎢ m i =1 ⎢ ⎥ ⎢ ∑ wij rij ⎥ ⎣ i =1 ⎦
3 BP-Neural-Network-Driven Fuzzy Optimum Selection Model The evaluation method based on neural network is an evaluation method based on examples. It only needs the user to offer enough samples for the training of the network. The evaluation results can be obtained according to the trained network. Because back-propagation neural network has the ability to learn by examples, it has been used in pattern matching, pattern classification and pattern recognition. Therefore, it can be used to establish the neural-network-driven fuzzy optimum selection model for mechanism scheme. A back-propagation (BP) neural network is a multi-layer network with an input layer, an output layer, and some hidden layers between the input and output layers. Each layer has a number of proceeding unit, called neurons. A neuron simply computes the sum of their weighted inputs, subtracts its threshold from the sum, and passes the results through its transfer function. One of the most important characteristics of BP neural networks is their ability to learn by examples. With proper training, the network can memorize the knowledge in the problem solving of a particular domain [25]. The back-propagation neural networks refer to their training algorithm, known as error back-propagation or generalized delta rule. The training of such a network starts with assigning random values to all the weights. An input is then presented to the network and the output from each neuron in each layer is propagated forward through the entire network to reach an actual output. The error for each neuron in the output layer is computed as the difference between an actual output and its corresponding target output. This error is then propagated backwards through the entire network and the weights are updated. The weights for a particular neuron are adjusted in direct proportion to the error in the units to which it is connected. In this way the error is reduced and the network learns. As shown in Figure 2, a three-layer BP neural network was selected to reflect the established fuzzy optimum selection model. The network has m input nodes, l hidden nodes and one output node. In the network, the number of input layer nodes is the number of the evaluation objectives of fuzzy optimum selection, and the input of neural network is the relative membership degree of each objective. The output of neural network is the relative membership degree of the evaluated scheme. In input layer, the input and output of the i th node is rij and u ij respectively, where i = 1,2, " , m , j = 1,2, " , n . In hidden layer, the input and output of the k th node is I kj and u kj respectively. wik is the joint weight between the i th node and the
280
Y. Gu and X. He
u pj p
Output layer
wkp
…
Hidden layer
k
wik
… …
Input layer
r1 j
r2 j
r3 j
i
rij
rmj
Fig. 2. BP-neural-network-driven fuzzy optimum selection model
k th node. There is only one node p in the output layer, and the input and out is I pj
and u pj respectively. wkp is the joint weight between hidden layer and output layer. The input and output of the network are listed in Table 1. Table 1. The input and output of fuzzy optimum selection BP neural network
Nodes The i th node of input layer The k th node of hidden layer The node p of output layer
Input
Output
rij
u ij = rij
ukj =
m
I kj = ∑ wik rij i =1
u pj =
l
I pj = ∑ wkpukj k =1
Joint weight
1 −1 ⎡⎛ m ⎤ ⎞ 1 + ⎢⎜⎜ ∑ wik rij ⎟⎟ − 1⎥ ⎢⎣⎝ i =1 ⎥⎦ ⎠ 1
m
2
−1 ⎡⎛ l ⎤ ⎞ 1 + ⎢⎜⎜ ∑ wkp ukj ⎟⎟ − 1⎥ ⎢⎣⎝ k =1 ⎥⎦ ⎠
∑w
ik
i =1
=1,
wik ≥ 0 l
2
∑w k =1
kp
=1,
wkp ≥ 0
The actual output u pj is the response of fuzzy optimum selection BP neural net-
work to the input rij . Let the expectation output of the j th scheme be M (u pj ) , its square error is Ej =
[
( )]
1 u pj − M u pj 2
2
.
(10)
4 Case Study To investigate the model developed above, here is an example of the optimum design scheme selection of the cutting paper machine. The design requirements are as
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
a) Scheme 1
b) Scheme 2
281
c) Scheme 3
Fig. 3. The scheme set of the cutting paper machine
follows. (1) The speed of cutting paper is constant. (2) The reliability is high. (3) The structure of the machine is simple and easy to design and manufacture. Through detail analysis, three schemes are presented as shown in Figure 3. Applying the evaluation system proposed in section 2 and the neural-networkdriven fuzzy optimum model presented in section 3, a three-layer BP neural network is established as shown in Figure 4. The network has 17 input nodes, 5 hidden nodes and one output node. The input of neural network is the relative membership degree of each objective. The output of neural network is the relative membership degree of the evaluated scheme. The input value of each scheme is listed in Table 2. The output value of the relative membership degree of each scheme is listed in Table 3. By comparison we can see, the first scheme has the higher relative membership degree than the other two schemes and is adopted as the optimum scheme of the cutting machine. u pj Output layer
Hidden layer
…
Input layer
r11 j
r12 j R1
r21 j
r22 j R2
r23 j
…
r51 j
r52 j
r53 j
R5
Fig. 4. A three-layer BP neural network model for the fuzzy optimum selection of the cutting paper machine schemes
282
Y. Gu and X. He Table 2. The input value of each schem
Criterion r r12 r21 r22 r23 r31 r32 r33 r34 r41 r42 r43 r44 r45 r51 r52 r53 11 Scheme 1.0 0.75 0.75 0.75 0.75 0.75 1.0 0.5 0.5 1.0 0.5 0.75 0.75 0.75 0.75 0.75 0.75 j=1
j=2 j=3
0.75 0.75 0.75 0.75 0.75 0.75 0.5 0.5 0.75 1.0 0.75 0.75 0.75 0.5 0.75 0.75 0.5 0.75 0.75 0.75 0.75 0.75 0.75 0.5 0.75 0.75 1.0 0.75 0.75 0.75 0.75 0.5 0.75 0.75
Table 3. The output value of the relative membership degree of each scheme Scheme
j=1 j=2 j=3
Relative Membership Degree 0.8654
Order 1
0.7548
3
0.8012
2
5 Conclusions The problem of mechanism scheme evaluation is a kind of expert decision problem that needs to evaluate repeatedly, and the essential characteristics of this problem are fuzziness and uncertainty. The experience of experts has very important influence on the evaluation result. The evaluation method proposed in this paper can describe the property value of evaluation and the non-linear relationship among evaluation results well. It decreases the complexity and subjectivity of scheme evaluation, and improve the rationality of evaluation results through applying the proposed method. Neuralnetwork-driven fuzzy optimum selection offers a new way for the evaluation of mechanism schemes.
Acknowledgment This research was partially supported by China Postdoctoral Science Foundation under Grant 20060391029.
References 1. Huang, H.Z., Zuo, M.J., Sun, Z.Q.: Bayesian Reliability Analysis for Fuzzy Lifetime Data. Fuzzy Sets and Systems 157 (2006) 1674-1686 2. Huang, H.Z., Wang, P., Zuo, M.J., Wu W.D., Liu, C.S.: A Fuzzy Set Based Solution Method for Multiobjective Optimal Design Problem of Mechanical and Structural Systems Using Functional-Link Net. Neural Computing & Applications 15 (2006) 239-244 3. Huang, H.Z., Wu, W.D., Liu, C.S.: A Coordination Method for Fuzzy Multi-Objective Optimization of System Reliability. Journal of Intelligent and Fuzzy Systems 16 (2005) 213-220 4. Huang, H.Z., Li, H.B.: Perturbation Fuzzy Finite Element Method of Structural Analysis Based on Variational Principle. Engineering Applications of Artificial Intelligence 18 (2005) 83-91
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
283
5. Huang, H.Z., Tong, X., Zuo, M.J.: Posbist Fault Tree Analysis of Coherent Systems. Reliability Engineering and System Safety 84 (2004) 141-148 6. Huang, H.Z.: Fuzzy Multi-Objective Optimization Decision-Making of Reliability of Series System. Microelectronics and Reliability 37 (1997) 447-449 7. Huang, H.Z.: Reliability Analysis Method in the Presence of Fuzziness Attached to Operating Time. Microelectronics and Reliability 35 (1995) 1483-1487 8. Zhang, Z., Huang, H.Z., Yu, L.F.: Fuzzy Preference Based Interactive Fuzzy Physical Programming and Its Application in Multi-objective Optimization. Journal of Mechanical Science and Technology 20 (2006) 731-737 9. Xue, L.H., Huang, H.Z., Hu, J., Miao, Q., Ling, D.: RAOGA-based Fuzzy Neural Network Model of Design Evaluation. Lecture Notes in Artificial Intelligence 4114 (2006) 206-211 10. Huang, H.Z., Tian, Z.G.: Application of Neural Network to Interactive Physical Programming. Lecture Notes in Computer Science 3496 (2005) 725-730 11. Li, H.B., Huang, H.Z., Zhao, M.Y.: Finite Element Analysis of Structures Based on Linear Saturated System Model. Lecture Notes in Computer Science 3174 (2004) 820-825 12. Song, Y.Y., Cai, F.Z., Zhang, B.P.: One of Case-Based Reasoning Product Conceptual Design Systems. Journal of Tsinghua University 38 (1998) 5-8 13. Jiang, B., Hsu, C.H.: Development of a Fuzzy Decision Model for Manufacturability Evaluation. Journal of Intelligent Manufacturing 14 (2003) 169-181 14. Huang, H.Z., Li, Y.H., Xue, L.H.: A Comprehensive Evaluation Model for Assessments of Grinding Machining Quality. Key Engineering Materials 291-292 (2005) 157-162 15. Huang, H.Z., Tian, Z.G., Zuo, M.J.: Intelligent Interactive Multiobjective Optimization Method and Its Application to Reliability Optimization. IIE Transactions on Quality and Reliability 37 (2005) 983-993 16. Messac, A., Sukam, C.P., Melachrinoudis, E.: Mathematical and Pragmatic Perspectives of Physical Programming. AIAA Journal 39 (2001) 885-893 17. Sun, J., Kalenchuk, D.K., Xue, D., Gu, P.: Design Candidate Identification Using Neural Network-Based Fuzzy Reasoning. Robotics and Computer Integrated Manufacturing 16 (2000) 383-396 18. Xue, D., Dong, Z.: Coding and Clustering of Design and Manufacturing Features for Concurrent Design. Computers in Industry 34 (1997) 139-53 19. Bahrami, A., Lynch, M., Dagli, C.H.: Intelligent Design Retrieval and Packing System: Application of Neural Networks in Design and Manufacturing. International Journal of Production Research 33 (1995) 405-426 20. Sun, H.L., Xie, J.Y., Xue, Y.F.: Mechanical Drive Type Decision Model Based on Support Vector Machine. Journal of Shanghai Jiao Tong University 39 (2005) 975-978 21. Huang, H.Z., Bo, R.F., Chen, W.: An Integrated Computational Intelligence Approach to Product Concept Generation and Evaluation. Mechanism and Machine Theory 41 (2006) 567-583 22. Chen, S.Y.: Engineering Fuzzy Set Theory and Application. National Defence Industry Press, Beijing (1998) 23. Chen, S.Y., Nie, X.T., Zhu, W.B., Wang, G.L.: A Model of Fuzzy Optimization Neural Networks and Its Application. Advances in Water Science 10 (1999) 69-74 24. Chen, S.Y.: Multi-Objective Decision-Making Theory and Application of Neural Network with Fuzzy Optimum Selection. Journal of Dalian University of Technology 37 (1997) 693-698 25. Zhang, Y.F., Fuh, J.Y.H.: A Neural Network Approach for Early Cost Estimation of Packaging Products. Computers and Industrial Engineering 34 (1998) 433-450
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers Rongrong Sun and Yuanyuan Wang Department of Electronic Engineering, Fudan University, Postfach 20 04 33, Shanghai, China {041021082,yywang}@fudan.edu.cn
Abstract. Accurate detection of atrial arrhythmias is important for implantable devices to treat them. A novel method is proposed to identify sinus rhythm, atrial flutter and atrial fibrillation. Here three different feature sets are firstly extracted based on the frequency-domain, the time-frequency domain and the symbolic dynamics. Then a classifier with two sub-layers is proposed. Three fuzzy classifiers are used as the first layer to perform pre-classification task corresponding to different feature sets respectively. A multilayer perceptron neural network is used as the final classifier. The performance of this algorithm is evaluated with two databases. One is the MIT-BIH arrhythmia database and the other is the endocardial electrogram database. A comparative assessment of the performance of the proposed classifier with individual fuzzy classifier shows that the algorithm can improve the overall accuracy for atrial arrhythmias classification. The implementation of this algorithm in implantable devices may provide accurate detection of atrial arrhythmias.
1 Introduction Cardiac arrhythmias are alterations of cardiac rhythm that disrupt the normal synchronized contraction sequence of the heart and reduce pumping efficiency. Among them, atrial fibrillation (AF) is the most common arrhythmia associated with a considerable risk of morbidity and mortality [1]. Recently, automatic external defibrillator introduced for home use as well as automatic implantable device therapies for atrial arrhythmias become more sophisticated in their ability to deliver several modes of therapy, such as antitachycardiac pacing and defibrillation, depending on the specific rhythm. If a false positive (FP) occurs, for example, when a normal sinus rhythm is misinterpreted as AF, an unnecessary shock will be given, which can damage the heart and cause inconvenience to the patient. So it is critical to detect accurately tachycardias that can be potentially terminated by pacing [2]. Several research groups have been working on the detection problem and a number of detection and analysis techniques have been evolved in the time-domain [3-5], the frequency-domain [6, 7], the time-frequency domain [8], and the nonlinear dynamics and chaos theory [9]. However, most of these methods are based on a single feature, D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 284–292, 2007. © Springer-Verlag Berlin Heidelberg 2007
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
285
in which only one parameter is extracted to depict the signal. Then the feature is compared straightforwardly with a certain threshold chosen to discriminate different arrhythmias. This may lead to a higher error rate. Other multi-features based algorithms only improve the classification accuracy limitedly [10] since these features are usually extracted just from one aspect of the signal. In order to overcome the aforementioned problems, data fusion models are introduced because it can exploit information from different sources [11]. In this study, a novel method which fuses different features sets is proposed for atrial arrhythmias detection. Here, three features sets are firstly extracted based on the frequency-domain, the time–frequency domain and symbolic dynamics of the signals respectively. Then three parallel fuzzy clustering classifiers are used to perform the pre-classification task using the three features sets as input respectively. Finally, a multilayer perceptron (MLP) neural network is used to combine these three parallel fuzzy classifiers to make a final decision.
2 Data Acquisition Two databases of electrogram signals are studied in this paper. One is the MIT-BIH arrhythmia database and the other is the canine endocardial database. In the MIT-BIH arrhythmia database, sinus rhythm (SR), atrial flutter (AFL), and atrial fibrillation (AF) recordings are selected and digitized at sample frequency of 360 Hz. The canine endocardial electrograms are obtained by an 8×8 electrode array (with a 2 mm interelectrode distance) sewed on the atrium surface of six dogs. During SR, AFL, and AF, 20 seconds simultaneous recordings from each dog are digitized at sample frequency of 2000 Hz, with 16-bit resolution.
All data are split into 2-second segments for the analysis. For an example, a segment of SR, AFL, and AF signal in the MIT-BIH database are shown in Figure 1. The MIT-BIH database includes 150 segments of SR, AFL, and AF respectively and the canine database includes 300 segments of SR, AFL, and AF respectively.
Fig. 1. A segment of AFL, SR, and AF signals in the MIT-BIH database
286
R. Sun and Y. Wang
3 Features Extraction Most of previous methods focus on a single feature of electrogram signals, resulting in a low accuracy. In this study, three sets of features are extracted in terms of the frequency-domain, the time-frequency domain and symbolic dynamics respectively. Three sets of features are the input vectors for the three parallel fuzzy clustering classifiers respectively. 3.1 Frequency-Domain Features The first set of features is coefficients of 5-order auto-regression model of the signal which reflect the information of the signal in the frequency-domain. 3.2 Time-Frequency Domain Features The second set of features is obtained from the time-frequency domain of the signal after the wavelet transformation. Firstly, signals are transformed into the time-frequency domain using the wavelet decomposition on the scale a=1 5 with Daubechies4 as the basic wavelet function. The wavelet coefficients matrix of the signal is obtained in the time-frequency domain. Since singular values are the inherent property of a matrix, the singular values of the wavelet coefficient matrix are taken as features of signals.
~
3.3 Symbolic Dynamics Features The traditional techniques of data analysis in the time and frequency domains are often not sufficient to characterize the complex dynamics of Electrocardiograph (ECG). In this study, symbolic dynamics are used to analyze the nonlinear dynamics of the ECG. The concept of symbolic dynamics is based on the elimination of detailed information in order to keep the robust properties of the dynamics by a coarsegraining of the measurements [12]. In this way, the time series is transformed into a symbol sequence Sn with Equation 1. Here, the symbolic sequence Ω = {0,1,2} is used. Figure 2 presents examples of the transformation. These transformations are based on the mean value μ of each analyzed time series and also based on a nondimensional parameter α that characterizes the ranges where the symbols are defined. α ⎧ bn > (1 + ) μ , ⎪0 if 2 ⎪ . (1) α α ⎪ Sn = ⎨1 if (1 − ) μ < bn ≤ (1 + ) μ , 2 2 ⎪ α ⎪ bn ≤ (1 − ) μ . ⎪ 2 if 2 ⎩ Here n=1,2,…,N, where N is the sample numbers of the signal and bn are the values of the time series respectively.
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
287
In order to characterize the symbol strings obtained by transforming the time series to Sn, the probability distribution of words with a length l=3 is analyzed. The words consist of three symbols obtaining a total of 3l different possible word types, the number of overlapped symbols in consecutive words is one. The probability occurrence of each word type are obtained as the third features set.
2 2 2 1 1 1 1 1 0 0
222 211
111 110
Fig. 2. Description of the basic principle of symbolic dynamics, the symbol extraction from a time series and the construction of words
4 Multi-parallel Fuzzy Clustering Classifiers After features extraction, three sets of features based on the frequency-domain, the time-frequency domain and symbolic dynamics are obtained respectively. They differ significantly in what they represent and this makes it difficult to accommodate them in a single classifier. Further more, three sets of features can result in a feature vector with high dimensionality for a single classifier and this may increase the computational complexity and cause accuracy problems. Additionally, the task of appropriately scaling the three sets of features could be a difficult task in itself. In order to overcome the aforementioned problems, multi-classifiers based on different feature sets are used, and their outputs have similar properties (e.g., confidence values) which can be combined with relative eases. Here, three parallel fuzzy clustering classifiers are used corresponding to the three sets of features respectively and they output membership values with each class. Suppose there are N classifier C1,…,CN, M class S1,…,SM.. For each set of features, the mean feature vectors ci=[ci1, ci2, …, cin] of each class are taken as the center vector estimated by the training data, 1 ≤ i ≤ M , n is the dimensionality of the feature vector. xj =[ xj1, xj2, …, xjn] represent the feature vector of testing data set, 1 ≤ j ≤ p , p is the number of testing data. U k ∈ R M × p denotes the membership matrix of the kth fuzzy
clustering classifier, the element μ ij in the matrix represents the membership value of
288
R. Sun and Y. Wang
the feature vector xj to the ith class decided by the kth classifer. μ ij is calculated as follows:
(1 / x j − ci )1 /(b −1) 2
μ ij =
M
∑ (1 / x k =1
, i=1,2,…,M, j=1,2,…,P.
(2)
2 1 /( b −1)
j
−c k )
x j − ci is the distance between the feature vector xj and the center vector ci of class i, and b is a parameter that controls the degree of fuzziness. Here b=2. Fuzzy clustering classifiers preclassify the input vector xj to all classes with different membership values, and μ ij provide the degree of the confidence that a
fuzzy classifier associates with the proposition xj ∈ Si. So for each input feature vector, the output of each of the N classifier can be completely represented by a Mdimensional vector Vi = (vi1, vi2, …, viM), 1 ≤ i ≤ N , where each component vij in the vector is a label associated with class Sj give by classifier Ci. After fuzzy clustering, the location of the input vector xj in the feature vector space is more precise which is easy for human beings to interpret.
5 Classifiers Combination Using MLP The outputs of the individual fuzzy classifier are not redundant, they can be combined to form a multi-classifiers decision that takes advantage of strengths of the individual classifier and diminishes their weaknesses to solve the same problem.
Fig. 3. The structure of the classifier using MLP neural network to combine classifiers
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
289
Here, a classifiers combination method based on the MLP neural network is proposed. The MLP does not have hidden layer and the membership value V1,…,VN from the three parallel fuzzy clustering classifiers form the input vector to it. Here N=3. Z = (z1, z 2,,…, zM) is its output which is responsible for the final classification of atrial arrhythmias. Here zi is a confidence value positively associated with the decision on the class Ci, the higher the value of zi, the higher the associated degree of confidence. The whole structure of the classifier is show in Figure 3. This network is trained by back-propagation minimizing the mean square error (MSE). The transformation function is sigmoid. The advantage of such network is that each weight has an apparent meaning in the role that each classifier plays in the classifier combination. The weight ω ijk is the contribution to class Sk when the classifier Ci assigns membership vij to the class Sj. After finishing training procedure of the whole network using the training data, the cluster centers of the three parallel fuzzy clustering layer as well as weights of neural network are frozen and ready for the use in the retrieval mode. For each input signal, three sets of features are firstly extracted as the input of three parallel fuzzy clustering classifiers respectively and generate the membership values. The membership value vector activates the MLP network and the output of network indicates the final membership of the input signal to the appropriate class of atrial arrhythmias. The signal is decided belonging to the class from which the largest membership value comes from.
In this paper, all analysis is performed in a PC with P-IV 2.80 GHz CPU and 504 M RAM using Matlab 7.1 program.
6 Experimental Results 100 episodes from each class rhythm in the MIT-BIH database are randomly selected as the initial training data of the algorithm and the others as testing data. As for the canine database, the number of training data and testing data is 200 and 100 respectively. Evaluation of the sensitivity (SE), specificity (SP), and accuracy (AC) of the method for arrhythmias classification is carried out with two databases. Individual fuzzy clustering classifier is also used to classify signals, and the results are compared with those obtained by the MLP which combines classifiers. Tables 1-8 show the experimental results. Table 1. Performance of fuzzy clustering classifier based on frequency-domain features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 30 1 19 50 28 19 3 50 38 0 12 50
SE
SP
AC
60.0 38.0 24.0
34.0 99.0 78.0
42.7 78.7 60.0
290
R. Sun and Y. Wang
Table 2. Performance of fuzzy clustering classifier based on time-frequency domain features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 35 0 15 50 1 9 40 50 7 4 39 50
SE
SP
AC
70.0 18.0 78.0
92.0 96.0 45.0
84.7 70.0 56.0
Table 3. Performance of fuzzy clustering classifier based on symbolic dynamics features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 38 1 11 50 7 41 2 50 4 6 40 50
SE
SP
AC
76.0 82.0 80.0
89.0 93.0 87.0
84.7 89.3 84.7
Table 4. Performance of MLP combining classifiers with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 48 1 1 50 0 50 0 50 1 2 47 50
SE
SP
AC
96.0 100.0 94.0
99.0 97.0 99.0
98.0 98.0 97.3
Table 5. Performance of fuzzy clustering classifier based on frequency-domain features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 83 6 11 100 34 14 52 100 31 12 57 100
SE
SP
AC
83.0 14.0 57.0
67.5 91.0 68.5
72.7 65.3 64.7
Table 6. Performance of fuzzy clustering classifier based on time-frequency domain features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 66 4 30 100 42 39 19 100 3 2 95 100
SE
SP
AC
66.0 39.0 95.0
77.5 97.0 75.5
73.7 77.7 82.0
Table 7. Performance of fuzzy clustering classifier based on symbolic dynamics features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 78 6 16 100 63 22 15 100 6 16 78 100
SE
SP
78.0 22.0 78.0
65.5 89.0 84.5
AC 69.7 66.7 82.3
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
291
Table 8. Performance of MLP combining classifiers with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 96 3 1 100 2 97 1 100 0 0 100 100
SE
SP
AC
96.0 97.0 100.0
99.0 98.5 99.0
98.0 98.0 99.3
As shown in Tables 1-8, the performance of each individual fuzzy clustering classifier demonstrates that the three sets of features are complementary. A comparative assessment of the performance of the proposed method with individual fuzzy clustering classifier show that more reliable results are obtained with the MLP neural network which combines classifiers for the classification of atrial arrhythmias.
7 Conclusion The new algorithm for atrial arrhythmias classification applies three sets of features in terms of the frequency-domain, the time-frequency domain, and the symbolic dynamics respectively to characterize the signals. This paper, therefore, focuses on ways by which the information from different features can be combined in order to improve the classification accuracy. Here, a MLP neural network is used to combine classifiers to improve the classification accuracy. The algorithm is composed of two layers connected in cascade. The three parallel fuzzy clustering classifiers form the first layer, it uses the three sets of features respectively and performs the pre-classification task. A MLP neural network which combines the former classifiers forms the second layer, and it makes a final decision on the ECG signals. The fuzzy clustering layer can firstly analyse the distribution of the data and group them into class with different membership values. The neural network takes these membership values as input vector and classifies the atrial arrhythmias to the appropriate class. This technique incorporates fuzzy clustering method with back propagation learning and combines their advantage. The two experiment databases used for evaluation of the method includes not only the ECG signals obtained by standard 12-lead on the human body surface but also endocardial electrograms obtained from the canine atrial surface. They will prove the generalizability of this method to distinguish among various atrial arrhythmias of different type of databases. So the algorithm can provide accurate detection of atrial arrhythmias and be easily implemented not only in automatic external defibrillator but also in the automatic implantable devices.
Acknowledgement This work wa supported by the National Basic Research Program of China under Grant 2005CB724303, Natural Science Foundation of China under Grant 30570488 and Shanghai Science and Technology Plan, China under Grant 054119612.
292
R. Sun and Y. Wang
References 1. Chugh, S.S., Blackshear, J.L., Shen, W.K., Stephen, C.H., Bernard, J.G.: Epidemiology and Natural History of Atrial Fibrillation: Clinical Implications. Journal of the American College of Cardiology 37 (2) (2001) 371-377 2. Wellens, H.J., Lau, C.P., Luderitz, B., Akhtar, M., Waldo, A.L., Camm, A.J., Timmermans, C., Tse, H.F., Jung, W., Jordaens, L., Ayers, G.: Atrioverter: An Implantable Device for the Treatment of Atrial Fibrillation. Circulation 98 (16) (1998) 1651-1656 3. Sih, H.J., Zipes, D.P., Berbari, E.J., Olgin, J.E.: A High-temporal Resolution Algorithm for Quantifying Organization During Atrial Fibrillation. IEEE Transactions on Biomedical Engineering 46 (4) (1999) 440-450 4. Narayan, S.M., Valmik, B.: Temporal and Apatial Phase Analyses of the Electrocardiogram Stratify Intra- Atrial and Intra-ventricular Organization. IEEE Transactions on Biomedical Engineering 51 (10) (2004) 1749-1764 5. Faes, L., Nollo, G., Antolini, R.: A Method for Quantifying Atrial Fibrillation Organization Based on Wave-morphology Similarity. IEEE Transactions on Biomedical Engineering 49 (12) (2002) 1504-1513 6. Khadra, L., Al-Fahoum, A.S., Binajjaj, S.: A Quantitative Analysis Approach for Cardiac Arrhythmia Classification Using Higher Order Spectral Techniques. IEEE Transactions on Biomedical Engineering 52 (11) (2005) 1840-1845 7. Everett, T.H., KoK, L.C., Vaughn, R.H., Moorman, J.R., Haines, D.E.: Frequency Domain Algorithm for Quantifying Atrial Fibrillation Organization to Increase Defibrillation Efficacy. IEEE Transactions on Biomedical Engineering 48 (9) (2001) 969 -978 8. Stridth, M., Sornmo, L., Meurling, C.J., Olsson, S.B.: Characterization of Atrial Fibrillation Using the Surface ECG: Time-dependent Spectral Properties. IEEE Transactions on Biomedical Engineering 48 (1) (2001) 19-27 9. Zhang, X.S., Zhu, Y.S., Thakor, N.V.: Detecting Ventricular Tachycardia and Fibrillation by Complex Measure. IEEE Transactions on Biomedical Engineering 46 (5) (1999) 548-555 10. Xu, W.C., Tse, H.F., Chan, F.H.Y., Fung, P.C. W., Lee, K. L. F., Lau, C. P.: New Bayesian Discriminator for Detection of Atrial Tachyarrhythmias. Circulation 105 (12) (2002) 1472-1479 11. Gupta, L., Chung, B., Srinath, M.D., Molfese, D.L., Kook, H.: Multichannel Fusion Models for the Parametric Classification of Differential Brain Activity. IEEE Transactions on Biomedical Engineering 52 (11) (2005) 1869-1881 12. Baumert, M., Walther, T., Hopfe, J., Stepan, H., Faber, R., Voss, A.: Joint Symbolic Dynamic Analysis of Beat-to-beat Interactions of Heart Rate and Systolic Blood Pressure in Normal Pregnancy. Medical & Biological Engineering and Computing 40 (2002) 241-245
A Neural-Fuzzy Pattern Recognition Algorithm Based Cutting Tool Condition Monitoring Procedure Pan Fu1 and A.D. Hope2 1
Mechanical Engineering Faculty, Southwest JiaoTong University Chengdu 610031, China
[email protected] 2 Systems Engineering Faculty, Southampton Institute Southampton SO14 OYN, U.K.
[email protected] Abstract. An intelligent tool wear monitoring system for metal cutting process will be introduced in this paper. The system is equipped with four kinds of sensors, signal transforming and collecting apparatus and a micro computer. A knowledge based intelligent pattern recognition algorithm has been developed. The fuzzy driven neural network can carry out the integration and fusion of multi-sensor information. The weighted approaching degree can measure the difference of signal features accurately and ANNs successfully recognize the tool wear states. The algorithm has strong learning and noise suppression ability. This leads to successful tool wear classification under a range of machining conditions.
1 Introduction Modern advanced machining systems in the “unmanned” factory must possess the ability to automatically change tools that have been subjected to wear or damage. This can ensure machining accuracy and reduce the production costs. Coupling various transducers with intelligent data processing techniques to deliver improved information relating to tool condition makes optimization and control of the machining process possible. Many tool wear sensing methods have been suggested, but only some of these are suitable for industrial application. The research work of Lin S C and Yang R J [1] showed that both the normal cutting force coefficient and the friction coefficient could be represented as functions of tool wear. An approach was developed for inprocess monitoring tool wear in milling using frequency signatures of the cutting force [2]. An analytical method was developed for the use of three mutually perpendicular components of the cutting forces and vibration signature measurements [3]. The ensuing analyses in time and frequency domains showed some components of the measured signals to correlate well to the accrued tool wear. A tool condition monitoring system was then established for cutting tool-state classification [4]. The investigation concentrated on tool-state classification using a single wear indicator and progressing to two wear indicators. In another study, the input features were D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 293–300, 2007. © Springer-Verlag Berlin Heidelberg 2007
294
P. Fu and A.D. Hope
derived from measurements of acoustic emission during machining and topography of the machined surfaces [5]. Li, X etc. showed that the frequency distribution of vibration changes as the tool wears, so the r.m.s. of the different frequency bands measured indicates the tool wear condition [6]. Tool breakage and wear conditions were monitored in real time according to the measured spindle and feed motor currents, respectively [7]. The models of the relationships between the current signals and the cutting parameters were established under different tool wear states. Many kind of advanced sensor fusion and intelligent data processing techniques have been used to monitor tool condition. A new on-line fuzzy neural network (FNN) model with four parts was developed [8]. They have the functions of classifying tool wear by using fuzzy logic; normalizing the inputs; using modified least-square backpropagation neural network to estimate flank and crater wear. Parameters including forces, AE-rms, skew and kurtosis of force bands, as well as the total energy of forces were employed as inputs. A new approach for online and indirect tool wear estimation in turning using neural networks was developed [9]. This technique uses a physical process model describing the influence of cutting conditions (such as the feed rate) on measured process parameters (here: cutting force signals) in order to separate signal changes caused by variable cutting conditions from signal changes caused by tool wear. Two methods using Hidden Markov models, as well as several other methods that directly use force and power data were used to establish the health of a drilling tool [10]. In order to increase the reliability of these methods, a decision fusion center algorithm (DFCA) was proposed which combines the outputs of the individual methods to make a global decision about the wear status of the drill. Experimental results demonstrated the high effectiveness of the proposed monitoring methods and the DFCA. In this study, a unique neural-fuzzy pattern recognition algorithm was developed to accomplish multi-sensor information integration and tool wear state classification. It combines the strong interpretation power of fuzzy systems and the adaptation and structuring abilities of neural networks. The monitoring system that has been developed provided accurate and reliable tool wear classification results over a range of cutting conditions.
2 The Tool Condition Monitoring System As shown in Fig.1, the tool wear monitoring system is composed of four kinds of sensors, signal amplifying and collecting devices and the microcomputer. Part of the condition monitoring experiments were carried out at the Advanced Manufacturing Lab. of Southampton Institute, U.K.. The experiments were carried out on a Cincinnati Milacron Sabre 500 (ERT) Vertical Machining Centre with computer numerical control. Sensors were installed around the workpiece being machined and four kinds of signal were collected to reflect the tool wear state comprehensively. Tool condition monitoring is a pattern recognition process in which the characteristics of the tool to be monitored are compared with those of the standard models. The process is composed of the following parts: determination of the membership functions of signal features, calculation of fuzzy distances, learning and tool wear classification.
A Neural-Fuzzy Pattern Recognition Algorithm
ADC 200 Digital Oscilloscope
2102 Analogue Module
AE Sensor
2101PA Preamplifier
KISTLER 9257B Dynamometer
KISTLER 5807A Charge Amplifier
EX205 Extension Board
Accelerometer
Charge Amplifier
PC 226 A/D Board
Current Sensor
Low-pass Filter
295
Main Computer
Fig. 1. The tool condition monitoring system
3 Feature Extraction Features are extracted from the time domain and frequency domain. Only those features that are really relevant to tool wear state are eventually selected for the further kw 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05
1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05
VB=0.5(mm) VB=0.4(mm) VB=0.3(mm)
g
VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm)
Condition 1
VB=0.2(mm)
Condition 1
13 12 11 10 9 8 7 6 5
13 12 11 10 9 8 7 6 5
VB=0.1(mm)
VB=0.1(mm)
VB=0(mm)
VB=0(mm)
(a) Mean value of the power consumption signal 450 400 350 300 250 200 150 100 50 0 VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm) VB=0.1(mm) VB=0
μbar
450 400 350 300 250 200 150 100 50 0 40(kHz 80(kHz) 120(kHz) 160(kHz) 200(kHz) 240(kHz) 280(kHz) 320(kHz) 360(kHz) 400(kHz)
(c) Spectra of the AE signal
(b) Standard deviation of the vibration signal
N 60
60
50
50
40
40
30
30
20
20
10
10
0
0 VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm) VB=0.1(mm) VB=0
200(Hz) 400(Hz) 600(Hz) 800(Hz) 1000(Hz) 1200(Hz) 1400(Hz) 1600(Hz) 1800(Hz) 2000(Hz)
(d) Spectra of cutting force ( Fx ) signal
Fig. 2. Some sensor signal features
296
P. Fu and A.D. Hope
pattern recognition as follows: for power consumption signal: mean value; for AERMS signal: mean value, skew and kutorsis; for cutting force, AE and vibration: mean value, standard deviation and the mean power in 10 frequency ranges. As an example, figure.2 shows several features in time and frequency domain (under cutting condition 1*). It can be seen that both the amplitude and the distribution pattern of those features change in certain pattern along with the development of tool flank wear (VB).
4 The Similarity of Fuzzy Sets Fuzzy approaching degree and fuzzy distance can be used as the quantitative indexes to represent the similarity of two fuzzy sets (A and B). The features of sensor signals of the tool condition monitoring system can reflect the tool wear states. For the standard models (cutting tool with standard flank wear values), the j-th feature of the i-th model can be considered as a fuzzy set Aij . Theoretical analysis and experimental results show that these features can be regarded as modified normal distribution fuzzy sets. 4.1 Approaching Degree Assume that F (X) is the fuzzy power set of a universal set X and the map, N:F (X ) × F (X ) → [0,1] satisfies: (a) ∀A ∈ F ( X ) ,
N ( A, A) = 1 ; (b) ∀A, B ∈F ( X ) ,
N ( A , B ) = N ( B , A ) ;(c) if A, B, C ∈ F ( X ) satisfies:
A( x) − C ( x) ≥ A( x) − B( x) (∀x ∈ X ) then N ( A, C ) ≤ N ( A, B ) . So the map N is the approaching degree in F ( X ) and N ( A, B ) is called the approaching degree of A and B. It can be calculated by using different methods. Here the inner and outer products are used. Assume that A, B ∈F ( X ) , so :
A • B = ∨{A( x) ∧ B( x) : x ∈ X } is defined as the inner product of A and B and A ⊕ B = ∧{A( x) ∨ B( x) : x ∈ X } is defined as the outer product of A and B. Finally, in the map :
N : F (X ) × F (X ) → [ 0 , 1 ] , N ( A, B ) is the approaching degree of A and B . N
(A , B ) = (A
• B
) ∧ (A
⊕ B
)c
(1)
4.2 Fuzzy Distance
X = {x1 , x 2 ,..., x n } , the membership value of A ( ( A( x1 ), A( x2 ),..., A( xn )) can be explained as the points in the n-dimensional
If A ∈F (X ) , when
Euclidean space. So the distance between two fuzzy sets can be defined like how the
A Neural-Fuzzy Pattern Recognition Algorithm
distance in Euclidean spaced is defined. When tion on
X = [a, b] , A(x) is a limited func-
[a, b] , the distance between two fuzzy sets can be defined as the followings.
Suppose M p : F (X ) × F (X ) → [0 , + ∞) ( p is a positive real number) F (X ) × F (X ) , when
X = {x1 , x2 ,..., xn } ,
⎤ ⎥ ⎦
1/ p
dx ⎤ ⎦⎥
1/ p
⎡ n M p ( A, B ) = ⎢∑ A( xi ) − B ( xi ) ⎣ i =1
When
297
p
∀( A, B ) ∈
(2)
X = [ a , b] , b M p ( A, B ) = ⎡∫ A( x) − B ( x) ⎣⎢ a
p
(3)
M p is fuzzy distance on F (X ) , M p ( A, B ) is the fuzzy distance between fuzzy set A and B . In general situation, p can take the value of 1. 4.3 Two Dimensional Weighted Approaching Degree In the conventional fuzzy pattern recognition process, the approaching degree or fuzzy distance between corresponding features of the object to be recognized and different models are first calculated. Combining these results can determine the fuzzy similarity between the object and different models. The object should be classified to one of the models that have the highest approaching degree or shortest fuzzy distance with it. This process can be further improved by developing a method that can assign suitable weights to different features to reflect their specific influences in the pattern recognition process. The two fuzzy similarity measures can also be combined to describe the closeness of fuzzy sets more comprehensively. Approaching degree and fuzzy distance reflects the closeness of two fuzzy sets from different angles. For two intersecting membership function, approaching degree reflects the geometric position of the intersecting point and the fuzzy distance shows the area of the intersecting space. Approaching degree and fuzzy distance between different sensor signal features also have changing importance in the practical pattern recognition process. In this study, artificial neural networks (ANNs) are employed to integrate approaching degree and fuzzy distance and assign them with suitable weights to provide a two dimensional weighted approaching degree. This makes more accurate and reliable tool wear classification possible.
5 Fuzzy Driven Neural Network ANNs has the ability to classify inputs. The weights between neurons are adjusted automatically in the learning process to minimize the difference between the desired and actual outputs. ANNs can continuously classify and also update classifications. In this study, ANNs is connected with the fuzzy logic system to establish a fuzzy driven neural network pattern recognition algorithm. Its principle is shown in the following
298
P. Fu and A.D. Hope
figure. Here a back propagation ANN is used to carry out multi-sensor information integration and tool wear classification. The approaching degree and fuzzy distance calculation results are the input of the ANNs. The associated weights can be updated as: w i ( new ) = w i ( old ) + αδ x i . Here α , δ , x i are learning constant, associated error measure and input to the i-th neuron. In this updating process, the ANN recognizes the patterns of the features corresponding to certain tool wear state. So in practical machining process, the feature pattern can be accurately classified to that of one of the models. In fact ANNs combine approaching degree and fuzzy distance and assign each feature a proper synthesized weight and the output of the ANNs is two dimensional weighted approaching degrees. This enables the classification process be more reliable. Forc Load Time and Frequency AE Dormain Feature Vib. Extraction
Fuzzy Membership Function Calculation
Fuzzy Distance and Fuzzy Approaching Degree Calculation
Training input
Encoder
Training target
Data to be Encoded
Decoder Data to be Decoded
Encoded Data
Decoded Data
Test input Test target Inquiry input
ANN
Error Graph
Inquiry Output New Normal Worn
Fig. 3. The fuzzy driven neural networks
6 Tool Wear State Classification In the practical tool condition monitoring process, the tool with unknown wear value is the object and it will be recognized as “new tool”, “normal tool” or “worn tool”. The membership functions of all the features of the object can be determined first. The approaching degree and fuzzy distance of the corresponding features of the standard model and the object to be recognized can then be calculated and become the inquiry input of the ANNs. One of a pre-trained ANNs is then chosen to calculate the two dimensional weighted approaching degree. Finally the tool wear state should be classified to the model that has the highest weighted approaching degree with the tool being monitored. In a verifying experiment, fifteen tools with unknown flank wear value were used in milling operations. Figure.4 shows the classification results under cutting condition
A Neural-Fuzzy Pattern Recognition Algorithm
299
1*. It can be seen that all the tools were classified correctly with the confidence of higher than 80%. Experiments under other cutting conditions showed the similar results.
Tool wear classification results New tool Norm al tool Worn tool
Classification Confidence (%) 100 90 80 70 60 50 40 30 20 10 0 0.05
0.05
0.07
0.07
0.08
0.25
0.25
0.27
0.27
0.28
0.55
0.55
0.56
0.56
0.57
Tool wear value Fig. 4. Tool wear states classification results
7 Conclusions An intelligent tool condition monitoring system has been established. Tool wear classification is realized by applying fuzzy driven neural network based pattern recognition algorithm. On the basis of this investigation, the following conclusions can be made. (1) Power consumption, vibration, AE and cutting force sensors can provide healthy signals to describe tool condition comprehensively. (2) Many features extracted from time and frequency domains were found to be relevant to the changes of tool wear state. This makes accurate and reliable pattern recognition possible. (3) The combination of ANN and fuzzy logic system integrates the strong learning and classification ability of the former and the superb flexibility of the latter to express the distribution characteristics of signal features with vague boundaries. This methodology indirectly solves the weight assignment problem of the conventional fuzzy pattern recognition system and let it have greater representative power, higher training speed and be more robust. (4) The introduction of the two dimensional weighted approaching degree can make the pattern recognition process more reliable. The fuzzy driven neural network effectively fuses multi-sensor information and successfully recognizes the tool wear states. (5) Armed with the advanced pattern recognition methodology, the established intelligent tool condition monitoring system has the advantages of being suitable for different machining conditions, robust to noise and tolerant to faults.
300
P. Fu and A.D. Hope
(6) Future work should be focused on developing data processing methods that produce feature vectors which describe tool condition more accurately, improving the fuzzy distances calculation methods and optimizing the ANNs structure. * Cutting condition 1( for milling operation): cutting speed - 600 rev/min, feed rate 1 mm/rev, cutting depth - 0.6 mm, workpiece material - EN1A, cutting inserts Stellram SDHT1204 AE TN-42.
References 1. Lin, S.C., and Yang, R.J.: Force-based Model for Tool Wear Monitoring in Face Milling, Int. J. Machine Tools and Manufacturing 9 (1995) 1201-1211 2. Elbestawi, M.A., Papazafiriou, T.A., and Du, R.X.: In-process Monitoring of Tool Wear in Milling Using Cutting Force Signature, Int. J. Machine Tools Manufacturing 1 (1991) 55-73 3. Dimla, D.E., Lister, P.M.: On-line metal cutting tool condition monitoring. I: force and vibration analyses, Int. J. of Machine Tools and Manufacturing 5 (2000) 739-768. 4. Dimla, D.E., Lister, P.M.: On-line metal cutting tool condition monitoring. II: tool-state classification using multi-layer perceptron neural networks, Int. J. of Machine Tools and Manufacturing 5 (2000) 769-781 5. Wilkinson, P., Reuben, R.L., Jones, J.D.C.: Tool wear prediction from acoustic emission and surface characteristics via an artificial neural network, Mechanical Systems and Signal Processing 6 (1999) 955-966 6. Li, X., Dong, S., Venuvinod, P.K.: Hybrid learning for tool wear monitoring, Int. J. of Advanced Manufacturing Technology 5 (2000) 303-307 7. Li, X.L., Tso, S.K., Wang, J: Real-time tool condition monitoring using wavelet transforms and fuzzy techniques, IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 3 (2000) 352-357 8. Chungchoo, C., Saini, D.: On-line tool wear estimation in CNC turning operations using fuzzy neural network model, Int. J. of Machine Tools and Manufacture 1 (2002) 29-40 9. Sick, B.: Tool wear monitoring in turning: A neural network application, Measurement and Control 7 (2001) 207-211+222 10. Ertunc, H.M., Loparo, K.A.: A decision fusion algorithm for tool wear condition monitoring in drilling, Int. J. of Machine Tools and Manufacture 9 (2001) 1347-1362
Research on Customer Classification in E-Supermarket by Using Modified Fuzzy Neural Networks Yu-An Tan1, Zuo Wang1, and Qi Luo2 1 Department of Computer Science and Engineering, Beijing Institute of Technology, 100081 Beijing, China
[email protected],
[email protected] 2 Department of Information & Technology, Central China Normal University, 430079, Wuhan, China
[email protected] Abstract. With the development of network technology and E-commerce, more and more enterprises have accepted the management pattern of E-commerce. In order to meet the personalized needs of customers in E-supermarket, customer classification based on their interests is a key technology for developing personalized E-commerce. Therefore, it is highly needed to have a personalized system for extracting customer features effectively, and analyzing customer interests. In this paper, we proposed a new method based on the modified fuzzy neural network to group the customers dynamically according to their Web access patterns. The results suggest that this clustering algorithm is effective and efficacious. Taking one with another, this new proposed approach is a practical solution to make more visitors become to customers, improve the loyalty degree of customer, and strengthen cross sale ability of websites in E-commerce. Keywords: customer classification, E-supermarket, modified fuzzy neural networks, personalized needs, Web access.
1 Introduction With the development of network technology and E-commerce, more and more enterprises have transferred to the management pattern of E-commerce [1]. The management pattern of E-commerce may greatly save the cost in the physical environment and bring conveniences to customers. People pay more and more attention to E-commerce day by day. Therefore, more and more enterprises have set up their own E-supermarket websites to sell commodities or issue information service. But the application of these websites is difficult to attract customer’ initiative participation. Only 2%-4% visitors purchase the commodities on E-supermarket websites [2]. The investigation indicates that personalized recommendation system that selecting and purchasing commodities is imperfect. The validity and accuracy of providing commodities are low. If E-supermarket websites want to attract more visitors to customers, improve the loyalty degree of customers and strengthen the cross sale ability of D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 301–306, 2007. © Springer-Verlag Berlin Heidelberg 2007
302
Y.-A. Tan, Z. Wang, and Q. Luo
websites, the idea of personalized design should be needed. It means that commodities and information service should be provided according to customers’ needs. The key of personalized design is how to classify customers based on their interests. In the paper, we presents a system model that dynamically groups customers according to their Web access and transactional data, which consist of the customers’ behavior on web site, for instance, the purchase records, the purchase date, amount paid, etc. The proposed system model is developed on the base of a modified fuzzy ART neural network, and involves two sequential modules including: (1) trace customers’ behavior on web site and generate customer profiles, (2) classify customers according customer profile using neural network.
2 System Model The system model is characterized as Fig. 1, which is applied to E-supermarket in our experiment. The idea is that customer interests could be extracted by observing customer behavior, including the transaction records, the transaction time and the products pages customer browsed. Then the results of first module are organized in a hierarchical structure and utilized to generate customer profile respectively. Finally customer profile could be grouped into different teams using modified fuzzy ART neural network. The system model includes three modules: customer behavior recording, customer profile generating and customer grouping. A hierarchical structure Customer behavior recording Transaction records Transaction time
Customer profile generating
Customer grouping
Products pages
Modified fuzzy ART neural network
Fig. 1. System model
(1) Customer behavior recording. The customer behavior is divided two types: transaction record and customer operation. Customer operation is composed of browsing time, frequency and so on. According to our early research, visiting duration of a product page is a good way to measure the customer interests. Hence, in our paper, each product page, whose visiting time is longer than a threshold, is analyzed. (2) Customer profile generating. A tree-structured is represented for customer profile. We could organize customer preference in a hierarchical structure according to customer interests. The structure is shown as follows Fig. 2. (3) Customer grouping. Customer could be grouped to different teams according their profiles by using adaptive neural network.
Research on Customer Classification in E-Supermarket
303
Level 1: User preference Tree
Level 2: Class
Level 3: Subclass 1
Level n: Subclass N
Fig. 2. Structure of customer profile
3 Modified Fuzzy ART Network The Fuzzy ART network is an unsupervised neural network with ART architecture for performing both continuous-valued vectors and binary-valued vectors [3]. It is a pure winner-takes-all architecture able to instance output nodes whenever necessary and to handle both binary and analog patterns. Using a vigilance parameter as a threshold of similarity, Fuzzy ART can determine when to form a new cluster. This algorithm uses an unsupervised learning and feedback network. It accepts input vector and classifies it into one of a number of clusters depending upon which it best resembles. The single recognition layer that fires indicates its classification decision. If the input vector does not match any stored pattern; it creates a pattern that is like the input vector as a new category. Once a stored pattern is found that matches the input vector within a specified threshold (the vigilance ρ ∈ [0,1] ), that pattern is adjusted to make it accommodate the new input vector. The adjective fuzzy derives from the functions it uses, although it is not actually fuzzy. To perform data clustering, fuzzy ART instances the first cluster coinciding with the first input and allocating new groups when necessary (in particular, each output node represents a cluster from a group). In the paper, we employ a modified Fuzzy ART proposed by Cinque al. to solve some problems of traditional Fuzzy ART. Function choice used in the algorithm is characterized as follows. 2
choice ( C j , V j ) =
(C
j
∧ Vi
C j Vi
)
2
⎛ n ⎞ ⎜ ∑ Zr ⎟ = ⎝n r =1 n ⎠ . ∑ Cr ∑Vr r =1
(1)
r =1
It computes the compatibility between a cluster and an input to find a cluster with greatest compatibility. The input pattern V j is an n-elements vector transposed, C j is
304
Y.-A. Tan, Z. Wang, and Q. Luo
the weight vector of cluster J (both are n-dimensional vectors). “ ∧ ” is fuzzy set intersection operator, which is defined by: x ∧ y = min{x, y} (2) X ∧ Y = ( x1 ∧ y1 ," , xn ∧ yn ) = ( z1 , z2 , " zn ) Function match is the following: n
match ( C * ∧ Vi ) =
C ∧ Vi *
C*
=
∑Z
r
∑C
* r
r =1 n
r =1
.
(3)
This computes the similarity between the input and the selected cluster. The match process is passed if this value is greater than, or equal to, the parameter ρ ∈ [0,1] . Intuitively, ρ indicates how similar the input has to be to the selected cluster to allow it to be associated with the customer group the cluster represents. As a consequence, a greater value of ρ implies smaller clusters, a lower value means wider clusters. Function adaptation is the selected cluster adjusting function, which algorithm is shown as following: * adaptation ( C* ,Vi ) = Cnew ( Cold* ∧ Vi ) + (1 − β ) Cold* ,
(4)
where the learning parameter ρ ∈ [0,1] , weights the new and old knowledge respec* * tively. It is worth observing that this function is not increasing, that is Cnew < Cold . The energy values of all leaf nodes in a customer profile consist an n-elements vector representing a customer pattern. Each element of the vector represents a product category. If a certain product category doesn’t include in customer profile, the corresponding element in the vector is assigned to 0. Pre-processing is required to ensure the pattern values in the space [0, 1], as expected by the fuzzy ART.
4 Experiment On the foundation of research, we combine with the cooperation item of personalized service system in community. The author constructs E-supermarket website to provide personalized recommendation. The experiment simulated 15 customers behavior on E-supermarket over a 20-day period, and they were grouped to 5 teams. The experimental web site is organized in a 4-level hierarchy that consists of 5 classes and 40 subclasses, including 4878 commodities. As performance measures, we employed evaluation metrics as follows [4]. , , .
(5) (6) (7)
Research on Customer Classification in E-Supermarket
305
The experiment results were compared with SOM, k-means and traditional fuzzy ART [5]. It was used in the fast learning asset (with β =1) with α set to zero. Values for the vigilance parameter β were found by trials. In the simulation of k-means, parameter K representing the number of clusters is assigned to 7. In particular, we used a rectangular map with two training stages: the first was made in 750 steps, with 0.91 as a learning parameter and a half map as a neighborhood, and the second in 400 steps, with 0.016 as a learning parameter and three units as a neighborhood. Map size was chosen by experiments. In the proposed system, decaying factor λ is assigned to 0.93, aging factor ψ is set to 0.02, β is set to 1, and vigilance parameter ρ is assigned to 0.89. With the growth of vigilance parameter, the amount of clusters is increased too. Fig. 3 shows the increase in the number of clusters with increased vigilance parameter values ranging from 0.85 to 0.95.
Fig. 3. The vigilance parameter increase with the clusters increasing
Fig. 4 illustrates the comparisons of three algorithms mentioned before, including precision, recall and F1 . The average for precision, recall and F1 measures using the SOM classifier are 80.7%, 75.3%, 78.9%, respectively. The average for precision, recall and F1 measures using the traditional fuzzy ART classifier are 88.3%, 85.8%, 87%, respectively. And the average for precision, recall and F1 measures using the
Fig. 4. The comparison of SOM, Traditional ART, K-means and modified fuzzy ART algorithm
306
Y.-A. Tan, Z. Wang, and Q. Luo
5 Conclusions In summary, a new approach that uses a modified fuzzy neural network based on adaptive resonance theory to group customers dynamically based on their Web access patterns is proposed in the paper. A new method is applied to E-supermarket to provide personalized service. The results manifest that this clustering algorithm is effective. Thus, we thought it might be a practical solution to make more visitors become to customers, improve the loyalty degree of customer, and strengthen cross sale ability of websites in E-commerce. I wish that this article’s work could give references to certain people.
References 1. Wu, Z.H.: Commercial Flexibility Service of Community Based on SOA. In the proceedings of the fourth Wuhan International Conference on E-Business, Wuhan: IEEE Press (2005) 467-471 2. Li, Y., Liu, L.: Comparison and Analysis on E-Commence Recommendation Method in China. System Engineering Theory and Application 24 (8) (2004) 96-98 3. Haykin, S.: Neural Networks: A Comprehensive Foundation. Canada: Prentice-Hall (2001) 4. Li, D., Cao, Y.D.: A New Weighted Text Filtering Method. In the proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, Wuhan: IEEE Press (2005) 695-698 5. Yang, Z.: Net Flow Clustering Analysis Based on SOM Artificial Neural Network. Computer Engineering 32 (16) (2006) 103-105
Recurrent Fuzzy Neural Network Based System for Battery Charging R.A. Aliev1, R.R. Aliev2, B.G. Guirimov1, and K. Uyar3 1
Azerbaijan State Oil Academy, 20 Azadlig avenue, Baku, Azerbaijan
[email protected] 2 Eastern Mediterranean University
[email protected] 3 Near East University
[email protected] Abstract. Consumer demand for intelligent battery charges is increasing as portable electronic applications continue to grow. Fast charging of battery packs is a problem which is difficult, and often expensive, to solve using conventional techniques. Conventional techniques only perform a linear approximation of a nonlinear behavior of a battery packs. The battery charging is a nonlinear electrochemical dynamic process and there is no exact mathematical model of battery. Better techniques are needed when a higher degree of accuracy and minimum charging time are desired. In this paper we propose soft computing approach based on fuzzy recurrent neural networks (RFNN) training by genetic algorithms to control batteries charging process. This technique does not require mathematical model of battery packs, which are often difficult, if not impossible, to obtain. Nonlinear and uncertain dynamics of the battery pack is modeled by recurrent fuzzy neural network. On base of this FRNN model, the fuzzy control rules of the control system for battery charging is generated. Computational experiments show that the suggested approach gives least charging time and least Tend-Tstart results according to the other intelligent battery charger works.
1 Introduction There are several research works on application of new technologies, namely fuzzy, neural, genetic, neuro-fuzzy approaches for battery charging. Unlike conventional schemes using constant current or a few trip points, the intelligent charger monitors battery parameters continuously, and alters charge current as frequently as required to prevent overcharging, to prevent exceeding temperature limits, and to prevent exceeding safe charge current limits. This allows a high charge to be applied during the initial stages of charging. The charge current is appropriately reduced during the later stages of charging based on the battery parameters. Authors in [1] implement three different approaches for controlling a complex electrochemical process using MATLAB. They compared the results of fuzzy, neurofuzzy systems with conventional PID control by simulating the formation (loading) of a battery. These systems designed using absolute temperature (T) and temperature D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 307–316, 2007. © Springer-Verlag Berlin Heidelberg 2007
308
R.A. Aliev et al.
gradient (dT/dt) as inputs and current (I) as output. Although [1] explains the duration of charging unfortunately neither gives information about the type of battery nor any information about temperature increase level during charging. Paper [2] focuses on the design of a super fast battery charger based on NeuFuz technology. In this application they have used a NiCd battery pack as the test vehicle and measured values were T, Voltage (U) and I. The results are 5 degrees Celsius difference between ending temperature and starting temperature (Tend - Tstart) and charging time is 20 to 30 minutes. The charging time is too long compared to other researches. In [3] the authors gives Tend - Tstart result as 35 up to 60o C and the method increases the life time to 3000 charging time. The paper does not explain how long it takes to charge the battery. The work does not give from how many cycle this method increased the life time to 3000 cycle. The Tend - Tstart result is too high according to other research papers also. Paper [4] considers on a fuzzy controller for rapid NiCd batteries charger using adaptive Neuro-Fuzzy inference system model. The NiCd batteries were charged at different rates between 8 and 0.05 charging rate (C) and for different durations. The two input variables identified to control the C are T, dT/dt. The equivalent ANFIS architecture for the system under consideration is created in MATLAB. Although this work gives the best result with less charging time, the ANFIS gives a high level 50 degrees Celsius Tend - Tstart result. Paper [9] presents a genetic algorithm approach to optimize a fuzzy rule-based system for charging high power NiCd batteries. Unfortunately, as it is mentioned in [5], little progress has been made in creating of intelligent control system of batteries charging those provides optimal trade-off between charging time and battery overheating and there is a big potential to increase efficiency of battery charging system by using more effective technologies. In this paper we propose soft computing approach based on fuzzy recurrent neural networks trained by genetic algorithms to control batteries charging process. This approach does not require mathematical model of battery packs, which are often difficult, if not impossible, to obtain. The work distinguishes as containing fuzzy recurrent neural network modeling non linearity and high degree of uncertainty of battery packs. This FRNN model allows generation of fuzzy rule-base for intelligent battery charging control system. The main advantage of the proposed intelligent control system is that it provides more advantages (minimum charging time and minimum overheating) as compared to existing methods. The rest of this paper is organized as follows. Section 2 describes the fuzzy RNN for battery modeling and battery charging control system. Section 3 describes the Soft Computing based battery charging control system. In section 4 simulations and experimental results are discussed. Section 5 gives conclusion of this paper.
2 Fuzzy Recurrent Neural Network and Its Learning The structure of the proposed fuzzy recurrent neural network is presented in Fig. 1. The box elements represent memory cells that store values of activation of neurons at previous time step, which is fed back to the input at the next time step.
Recurrent Fuzzy Neural Network Based System for Battery Charging
Layer 0 (input)
Layer 1 (hidden)
309
Layer L (output)
y11 (t ) y11 (t − 1)
x1l (t ) 0 1
x (t )
y1L (t − 1)
y1L (t )
y 1i (t − 1) y i1 (t )
xil (t )
x 0j (t ) y NL L (t ) y 1N1 (t )
Fig. 1. The structure of FRNN
where
x lj (t ) is j-th fuzzy input to the neuron i at layer l at the time step t, y il (t ) is
the computed output signal of the neuron at the time step t,
wij is the fuzzy weight of
the connection to neuron i from neuron j located at the previous layer, bias of neuron i, and
θi is the fuzzy
y (t − 1) is the activation of neuron j at the time step (t-1), vij l j
is the recurrent connection weight to neuron i from neuron j at the same layer. The activation F for a total input to the neuron s (Fig. 2) is calculated as:
F (s) =
s . 1+ | s |
(1)
F(s) 0.8 0.6 0.4 0.2
s
0 -8
-6
-4
-2
0
2
4
6
-0.2 -0.4 -0.6 -0.8 -1
Fig. 2. The activation function F(s)
8
310
R.A. Aliev et al.
So, the output of neuron i at layer l is calculated as follows:
θ il +
∑ x (t )w + ∑ y (t − 1)v l j
l ij
l j
j
yil (t ) = 1+
θ il
+
l ij
j
∑
x lj (t ) wijl
j
+
.
∑
y lj (t
(2)
− 1)vijl
j
All fuzzy signals and connection weights and biases are general fuzzy numbers that with any required precision can be represented as T(L0,L1,...,Ln-1,Rn-1,Rn-2,...R0). Fig. 3 shows an example of fuzzy number when n=4 (if n=2, we get traditional trapezoid ( L1 < R1 ) numbers and triangle numbers ( L1 = R1 )).In case the original learning patterns are crisp, we need to sample data into fuzzy terms, i.e. to fuzzify the learning patterns. The fuzzifiers can be created independently for specific problems. For learning of FRNN we use GA [7,8]. To apply genetic algorithm based approach for FRNN training, all adjustable parameters i.e. connection weights and biases are coded as bitstrings. A combination of all weight and bias bitsrings compose a genome (sometimes called a chromosome) representing a potential solution to the problem. During the genetic evolution a population consisting of a set of individuals or genomes (usually 50-100 genomes) undergo a group of operations with selected genetic operators. crossover and mutation are the most often used operators. Applying genetic operators results in generating many offsprings (new individuals or genomes). When bitstrings are decoded back to weights and biases, presenting different FRNN solutions, some may present good network solutions and some bad. Good genomes (i.e. those corresponding to good solutions) have more chances to stay within the populations for upcoming generations while bad genomes have more chances to be discarded during the future selection processes. Whether a genome is good or bad is evaluated by a fitness function. The fitness function is an evaluator function (can also be fuzzy) numerically evaluating the quality of the genome and the representing a solution. In case of a neural network learning, the purpose is to minimize the network error performance index. Thus, the selection of best genomes from the population is done on the basis of the genome fitness value, which is calculated from the FRNN error performance index. The calculation of the fitness value of a particular genome require restoration of the coded genome bits back to fuzzy weight coefficients and biases of FRNN, in other words, we need to get a phenotype from the genotype. The FRNN error performance index can be calculated as follows:
Etot = ∑∑ D ( y pi , y des pi ) , p
(3)
i
where Etot is the total error performance index for all output neurons i and all learning data entries p. We shall assume Y is a finite universe Y={y1,y2,...,yn}; D is an error function such des
as the distance measure between two fuzzy sets, the desired y pi and the computed
y pi outputs. The efficient strategy is to consider the difference of all the points of the
Recurrent Fuzzy Neural Network Based System for Battery Charging
311
used general fuzzy number (Fig. 3). The considered distance measure is based on Hamming distance Δ j = y pij − y pij , des
D (T1 , T2 ) =
n
Δ j ∈ [0,1] : D = ∑ Δ j , j =1
i = n −1
i = n −1
i =0
i =0
∑
k i | LT 1i − LT 2i | +
∑k
i
| RT 1i − RT 2i | ,
where D (T1 , T2 ) is the distance measure between two fuzzy numbers
(4)
T1 ( y des pi ) and
T2 ( y pi ), 0≤k0≤k1... ≤kn-2≤kn-1 are some scaling coefficients. Once the total error performance index for a combination of weights has been calculated the fitness f of the corresponding genome is set as:
f =
1 . 1 + Etot
(5)
1
0.75
0.5
0.25
0 -5
0
5
10
15
Fig. 3. An example of n-point fuzzy number
As can be seen, the fitness function value for a genome (coding a network solution) is based on a distance measure comparing two sets of fuzzy values. Scaling coefficients are included to add sensitivity to high membership areas of a fuzzy number. The GA-based training process is schematically shown in Fig. 4 The GA used here can be described as follows: 1. Prepare the genome structure according to the structure of FRNN; 2. Incase of existence of a good genome (an existing network solution), put it into population; else generate a random network solution and put it into population; 3. Generate at random new PopSize-1 genomes and put them into population; 4. Apply genetic crossover operation to PopSize genomes in the population; 5. Apply mutation operation to the generated offsprings; 6. Get phenotype and rank (i.e. evaluate and assign fitness values to) all the offsprings;
312
R.A. Aliev et al.
Fig. 4. GA based training of a FRNN network
7. Create new population with Nbest best parent genomes and (PopSize- Nbest) best offsprings; 8. Display fitness value of the best genome; If termination condition is met go to Step 9; Else go to step 4; 9. Get phenotype of the best genome in the population. Store network weights file; 10. Stop. In the above algorithm PopSize is minimum population size and Nbest is the number of best parent genomes always kept in the newly generated population. The learning may be stopped once we see the process does not show any significant change in fitness value during many succeeding regenerations. In this case we can specify new mutation (and maybe crossover) probability and continue the process. If the obtained total error performance index or the behavior of the obtained network is not desired, we can restructure the network by adding new hidden neurons, or do better sampling (fuzzification) of the learning patterns.
3 Description of Soft Computing Based Battery Charging Control System The purpose battery control system is to charge the whole battery pack, consisting of 6 batteries, to hold 9.6V. The initial charge level is 1.37V and temperature is 21.6°C.
Recurrent Fuzzy Neural Network Based System for Battery Charging
313
Just after the battery reaches 1.6V (the target for one battery: 1.6V×6 batteries=9.6V), it becomes overheated and we can observe loss of charge due to some chemical processes inside the battery. The purpose of control is to charge the battery to hold 1.6V in a possibly shorter time while preventing the battery from overheating. We can apply different charging currents as control input with values ranging from 0A to 6A. The input signals of suggested control system for batteries charging T and U are measured by temperature and voltage sensors . Output of the sensors are crisp current values of temperature and voltage. Other input signals of battery charging controller are first derivatives of U (dU/dt) and T (dT/dt). All these input signals U, dU/dt, T and dT/dt are fuzzified by fuzzifiers. Generated in advance fuzzy knowledge base of controller is implemented by RFNN approximately. Receiving current fuzzy values of U, dU/dt, T and dT/dt controller performs fuzzy inference and determines fuzzy values of control signal. As only crisp control signals are applied to battery, the fuzzy control signal from RFNN must be defuzzified. This signal is then applied to battery [6]. As it is mentioned above, there is no exact mathematical model. Because of this, instead of design of a nonlinear differential equation model we prefer to use neurofuzzy genetic method to define a model. To design the charger a FRNN is used to learn the behavior of the battery charging and to generate a set of fuzzy rules and membership functions and then acquired knowledge into new fuzzy logic system. The system creates the voltage and temperature models. The FRRN designed for battery charging controller had 4 inputs, 20 hidden neurons, and 1 output. The three used inputs represented temperature (T), change of temperature (dT), voltage (V) and change of voltage (dV). The output of the controller is the current (I) applied for charging the battery. All weights and biases of the FRNN are coded as 64 bits long genes. The control rules used for learning of battery control system are listed in table 1. Table 1. The control rules T LOW LOW LOW ... ... ... LOW LOW MED
dT LOW MED HIGH ... ... ... MED HIGH HIGH
V LOW LOW LOW ... ... ... MED MED MED
dV LOW LOW LOW ... ... ... HIGH HIGH HIGH
I HIGH HIGH HIGH ... ... ... HIGH HIGH LOW
4 Simulation Results The network was trained by the above fuzzy rules. Fig. 5 shows the graph of the charging process under the control of the learned FRNN. The GA learning was done with population size 100, probability of multi-point crossover 0.25, and probability of
R.A. Aliev et al. Table 2. Comparison of different charging controllers Charging controller FRNN based (our approach) FL [3] FG [9] ANFIS [4] NeuFuz [2]
Time (sec) 860 959 900 1200-1800
Tend-Tstart 2,85 35-60 9 50 5
V o lta g e v s T im e 2 1 .9 U (V)
1 .8 1 .7 1 .6 1 .5 1 .4 1 .3 0
500
1000
1500
2000
1500
2000
1500
2000
t (s )
T (degree C)
T e m p e r a tu r e v s T im e 40 38 36 34 32 30 28 26 24 22 20 0
500
1000 t (s )
C u r r e n t v s T im e 8 7 6 I (A)
314
5 4 3 2 1 0 0
500
1000 t (s )
Fig. 5. Battery charging control process
Recurrent Fuzzy Neural Network Based System for Battery Charging
315
mutation 0.05. After the crossover and mutation operations [7], every 90 best offspring genomes plus 10 best parent genomes make a new population of 100 genomes. The selection of 100 best genomes is done on the basis of the genome fitness value [7]. The FRNN based control system allows very quick and effective charge of the battery: the charging time is reduced from more than 2000 seconds (with applied constant charge current 2A) to 860 seconds (or even less, if the temperature limit is set higher than 25ºC) with dynamically changed (under the control of the RFNN) input current (Fig. 5). Also the battery is protected from overheating and a long utilization time of the battery can be provided by adequately adjusting the fuzzy rules describing the desired charging process. The results of FRNN based charging controller compared with other battery chargers are given in Table 2. The NiCd charger with GA based training of FRNN gives less charging time and less Tend-Tstart result than other controllers.
5 Conclusions This work proposes Soft Computing approach based on recurrent fuzzy neural network to control batteries charging process. Dynamics of the battery pack is described by recurrent fuzzy neural networks on base of which the fuzzy control rules are generated. Genetic algorithm is used for tuning fuzzy neural network. Computational experiments show that the suggested approach gives least charging time and least Tend – T start results according to the other intelligent battery charger works. This approach is general and can be extended to design controllers for quickly charging different battery types.
References 1. Castillo, O., Melin, P.: Soft Computing for Control of Non-Linear Dynamical Systems. Springer, Germany (2001) 2. Ullah, M. Z., Dilip, S.: Method and Apparatus for Fast Battery Charging using Neural Network Fuzzy Logic Based Control. IEEE Aerospace and Electronic Systems Magazine 11 (6) (1996) 26-34 3. Ionescu, P.D., Moscalu, M., Mosclu, A.: Intelligent Charger with Fuzzy Logic. Int. Symp. on Signals, Circuits and Systems (2003) 4. Khosla, A., Kumar, S., Aggarwal, K. K.: Fuzzy Controller for Rapid Nickel-cadmium Batteries Charger through Adaptive Neuro-fuzzy Inference System (ANFIS) Architecture. 22nd International Conference of the North American Fuzzy Information Processing Society, NAFIPS. (2003) 540 – 544 5. Diaz, J., Martin-Ramos, J.A., Pernia, A.M., Nuno, F., Linera. F.F.: Intelligent and Universal Fast Charger for Ni-Cd and Ni-MH Batteries in Portable Applications IEEE Trans. On Industrial Electronics 51 (4) (2004) 857-863 6. Jamshidi, M.: Large-Scale systems: Modeling, Control and Fuzzy Logic. Englewood Cliffs, NJ: Prentice Hall (1996)
316
R.A. Aliev et al.
7. Aliev, R.A., Aliev, R.R.: Soft Computing and Its Applications. World Scientific, New Jersey (2001) 8. Jamshidi, M., Krohling, R. A., Coelho, DOSS., Fleming, P.: Robust Control Design Using Genetic Algorithms. CRC Publishers, Boca Raton, FL (2003) 9. Surmann, H.: Genetic Optimization of a Fuzzy System for Charging Batteries. IEEE Trans. on Industrial Electronics 43 (5) (1996) 541-548
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach* Ching-Hung Lee and Yu-Ching Lin Department of Electrical Engineering, Yuan Ze University Chung-li, Taoyuan 320, Taiwan
[email protected] Abstract. This paper proposes the type-2 fuzzy neural network system (type-2 FNN) which combines the advantages of type-2 fuzzy logic systems (FLSs) and neural networks (NNs). For considering the system uncertainties, we use the type2 FLSs to develop a type-2 FNN system. The previous results of type-1 FNN systems can be extended to a type-2 one. Furthermore, the corresponding learning algorithm is derived by input-to-state-stability (ISS) approach. Nonlinear system identification is presented to illustrate the effectiveness of our approach.
1 Introduction In recent years, intelligent systems including fuzzy control, neural networks, and genetic algorithm, etc, has been developed and applied widely, especially in the field of fuzzy neural network (FNN) [1-5]. In literature [1-3], the FNN system has the properties of parallel computation scheme, easy to implement, fuzzy logic inference system, and parameters convergence. The fuzzy rules and the membership functions (MFs) can be designed and trained from linguistic information and numeric data. Thus, it is then easy to design an FNN system to achieve a satisfactory level of accuracy by manipulating the network structure and learning algorithm of the FNN. The concept of type-2 fuzzy sets was initially proposed by Zadeh as an extension of ordinary fuzzy sets (called type-1) [6]. Subsequently, Mendel and Karnik developed a complete theory of type-2 fuzzy logic systems (FLSs) [7-11]. These systems are characterized by IF-THEN rules and type-2 fuzzy rules are more complex than the type-1 fuzzy rules because some differences are their antecedents and their consequent sets are type-2 fuzzy sets [8-10]. In this paper, the so-called type-2 FNN is proposed, which is an extension of the FNN. By the concept of literature [7-10], the type-2 FNN system is used to handle uncertainty. The proposed type-2 FNN is a multilayered connectionist network for realizing type-2 fuzzy inference system, and it can be constructed from a set of type-2 fuzzy rules. The type-2 FNN consists of type2 fuzzy linguistic process as the antecedent and consequent part. The consequent part of type-2 fuzzy rules means the output through type-reduction and defuzzification. Based on input-to-state-stability (ISS) approach, rigorous proofs are presented to guarantee the convergence of the type-2 FNN. *
This work was support partially by National Science Council, Taiwan, R.O.C. under NSC-942213-E-155- 039.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 317–327, 2007. © Springer-Verlag Berlin Heidelberg 2007
318
C.-H. Lee and Y.-C. Lin
This paper is organized as follows. In Section 2, we briefly introduce the type-2 FNN system and the corresponding learning algorithm by input-to-state-stability approach. Section 3 presents the application results of nonlinear system identification to illustrate the effectiveness of our approach. Finally, conclusion is summarized.
2 Type-2 Fuzzy Neural Network Systems 2.1 The Systems Structure The FNN system is a type of fuzzy inference system in neural network structure [1-5]. The construction of the type-2 FNN system is shown in Fig. 1. Obviously, the type-2 FNN is constructed by IF-THEN rule [1, 5]. The main difference is to replace the type-1 fuzzy sets to type-2 fizzy ones. Herein, we firstly introduce the basic function of every node in each layer. In the following symbols, the subscript ij indicates the jth term of the ith input Oij(k ) , where j = 1,…, l , and the super-script (k) means the k-th layer.
x1
~ G ~ G
x2
~ G ~ G ~ G
xn
~ G ~ G ~ G
[f 1
[w 1
1
[w
[f
Layer
2
∏ j
Layer
]
w1
∑
∏
~ G
Layer
]
f1
∏
f
3
j
j
w
j
yˆ
]
] Layer
4
Fig. 1. The construction of MISO type-2 FNN system
(a)
(b)
Fig. 2. Type-2 fuzzy MFs, (a) uncertain mean; (b) uncertain variance
Layer 1: Input Layer For the ith node of layer 1, the net input and the net output are represented as Oi(1) = wi(1) xi(1) where weights w , i = 1, , n are set to be unity. (1) i
(1)
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
319
Layer 2: Membership Layer Each node performs a type-2 membership function (MF), i.e., interval fuzzy sets, as shown in Fig. 2. Then, we described two kinds of the output of layer 2 respectively [8, 11]. For Case 1- Gaussian MFs with uncertain mean, as shown in Fig. 2(a) ⎡ 1 (Oi(1) − mij )2 ⎤ ⎧Oij( 2 ) as mij = mij Oij( 2 ) = exp⎢ − = (σ ij )2 ⎥⎥⎦ ⎨⎩O ij( 2) as mij = mij ⎢⎣ 2
(2)
Case 2- Gaussian MFs with uncertain variance, as shown in Fig. 2(b), we have ⎡ 1 (Oi(1) − mij )2 ⎤ ⎧Oij( 2 ) as σ ij = σ ij Oij( 2 ) = exp ⎢ − = (σ ij )2 ⎥⎦⎥ ⎨⎩O ij( 2) as σ ij = σ ij ⎣⎢ 2
(3)
where mij and σ represent the center (or mean) and the width (or standard-deviation) respectively. The type-2 MFs can be represented as interval bound by upper MF and lower MF, denoted by μ F~ and μ F~ , as shown Fig. 2. Thus, the output Oij( 2 ) is ij
[
i
]
(2)
i
represented as an interval O ij , Oij( 2 ) .
Layer 3: Rule Layer This layer are used to implement the antecedent matching. Here, the operation is chosen as a simple PRODUCT operation. Therefore, for the jth input rule node n ⎧ ( 3) O j = ∏ (wij( 3)Oij( 2 ) ) ⎪ ⎪ i =1 = ∏ (wij( 3) Oij( 2 ) ) = ⎨ n i =1 ⎪ O (j3 ) = ∏ wij( 3 ) O ij( 2 ) ⎪⎩ i =1 n
O (j 3)
(
(4)
)
where the weights wij( 3) =1. Thus, the output Oij( 3) is represented as [O (j3) , O j( 3) ] . Layer 4: Output Layer This links in this layer are used to implement the consequence matching and typereduction and the linear combination [7, 9, 10]. Thus, O ( 4 ) + OL( 4 ) yˆ = O ( 4 ) = R (5) 2 where
(
) ∑ (O
O R( 4 ) = ∑ ( f jR w j( 4 ) ) = ∑ O j w j( 4 ) + l
R
j =1
and l
j =1
(
OL( 4 ) = ∑ f jL w j j =1
(4)
( 3)
) = ∑ (O L
l
( 3) j
k = R +1
) + ∑ (O
wk( 4 ) )
(6)
)
(7)
l
( 3) j
(4)
wj
j =1
k = L +1
( 3) j
(4)
wk
In order to obtain the values O L( 4 ) and O R( 4 ) , find coefficients R and L, firstly. Assume ( 4)
that the pre-computed w j( 4 ) and w j (4) 1
w
≤w
(4) 2
≤
≤w
(4) l
R1: Compute O R( 4 )
let y r ≡ OR( 4 ) .
and w1 ≤ w 2 ≤ (4)
( 4)
are arranged in ascending order, i.e.,
≤ wl
(4)
[7, 10]. Then, 1 ( 3) in (6) by initial setting f jR = O j( 3) + O j for i = 1, 2
(
)
, l , and
320
C.-H. Lee and Y.-C. Lin
R2: Find R (1 ≤ R ≤ l − 1) such that wR( 4 ) ≤ y r ≤ wR( 4+1) . R3: Compute OR( 4) in (6) with f jR = O (j3) for j ≤ R and f jR = O j( 3) for j > R , and
let y r′ = OR( 4 ) . R4: If y ′r ≠ y r , then go to step R5. If y ′r = y r , then stop and set OR( 4 ) = y r′ . R5: Set y′r equal to yr, and return to step R2.
Subsequently, the computation of O L( 4 ) is similar to the above procedure [10]. Thus, the input/output representation of type-2 FNN system with uncertain mean is
(
)
(
)
(
)
l L l 1 R ( 3) (4) ( 3) (4) yˆ (mij , m ijσ ij , w j , w j ) = [ ∑ O j w j( 4 ) + ∑ (Ok( 3) wk( 4 ) ) + ∑ O j( 3) w j + ∑ O k w k ]. (8) 2 j =1 k = R +1 j =1 k = L +1
The type-2 FNN with uncertain variance, as Fig. 2(b), can be simplified as 1 l ( 3) yˆ (mij , σ ij , σ ij , w j ) = ∑ O j + O j( 3) w (j 4 ) . 2 j =1
[(
) ]
(9)
2.2 The Input-to-State-Stability Learning Algorithm
Input-to-state stability (ISS) is one elegant approach to analyze stability besides Lyapunov method [12]. For case of Gaussian MFs with uncertain variance, the qth output of type-2 FNN can be expressed as yˆ q =
⎡ n ⎛ ( xi − mij ) 2 ⎞ n ⎛ ( xi − mij ) 2 ⎞⎤ . 1 l ⎟+ ⎟⎥ wqj ⎢∏ exp⎜ − exp⎜ − ∑ ∏ 2 2 ⎜ ⎟⎥ ⎜ ⎟ i =1 2 j =1 2σ ij 2σ ij ⎢⎣ i =1 ⎝ ⎠ ⎝ ⎠⎦
The object of the type-2 FNN modeling is to find the center values of B~1 j ,
(10) ~ Bmj , as
well as the MFs A~1 j , A~nj , such that the output Yˆ (k ) of the type-2 FNN (10). Let us define identification error as e(k ) = Yˆ (k ) − Y ( k ) .
(11)
We will use the modeling error e(k) through algorithm to train the type-2 FNN on-line such that Yˆ (k ) can approximate Y (k ), ∀k . According to function approximation theories of fuzzy neural networks, the identification can be represented as yˆ q =
⎛ ( x − m* ) 2 ⎞⎤ ⎛ ( xi − mij* ) 2 ⎞ n 1 l *⎡ n ⎟ + ∏ exp⎜ − i * ij ⎟⎥ − Δ q wqj ⎢∏ exp⎜ − ∑ * 2 ⎜ ⎟ ⎜ 2 j =1 2(σ ij ) ⎠ i=1 2(σ ij ) 2 ⎟⎠⎥⎦ ⎢⎣ i =1 ⎝ ⎝
(12)
where wqj* , mij* , σ ij* , and σ *ij are unknown parameters which may minimize the unmodeled dynamic Δ q . In the case of four independent variables, a smooth function f has Taylor formula as ⎡ ∂ ∂ ∂ ∂ ⎤ f ( x1 , x2 , x3 , x 4 ) = ⎢( x − x10 ) + ( x − x20 ) + ( x − x30 ) + ( x − x40 ) ⎥ f + Rl ∂ x ∂ x ∂ x ∂ x4 ⎦ 1 2 3 ⎣
(13)
where Rl is the remainder of Taylor formula. If we let [x1 x2 x3 x4 ] = [wqj mij σ ij σ ij ] and [x10 x20 x30 x40 ] = [wqj∗ mij∗ σ ij∗ σ ∗ij ], we have
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach l ⎛ O ( 3) + O (j3) ⎞ l n ∂ ⎟ + ∑∑ yq + Δ q = yˆ q + ∑ wqj* − wqj ⎜ j yˆ m* − mij ⎜ ⎟ j =1 i =1 ∂mij q ij 2 j =1 ⎝ ⎠ l n l n ∂ ∂ * * + ∑∑ yˆ q σ ij − σ ij + ∑ ∑ yˆ q σ ij − σ ij + R1 j =1 i =1 ∂σ ij j =1 i =1 ∂σ ij
(
)
(
(
(
)
321
)
(14)
)
where R1 is the approximation error. Using chain rule, we obtain ∂yˆ q ∂mij
=
∂yˆ q ∂O j( 3) ∂O
( 3) j
∂mij
+
xi − mij ⎞ 1 ⎛ ( 3) xi − mij ⎞ 1 ⎛ ⎟ + wqj ⎜ O j ⎟ = wqj ⎜ O j( 3) ⎜ ∂mij 2 ⎝ σ ij 2 ⎟⎠ 2 ⎜⎝ σ ij 2 ⎟⎠
∂yˆ q ∂O (j3) ∂O
( 3) j
=
wqj ⎛ ( 3) xi − mij ( 3 ) xi − mij ⎜Oj +Oj 2 ⎜⎝ σ ij 2 σ ij 2
⎞ ⎟ ⎟ ⎠
(15)
∂yˆ q ∂yˆ q ∂O j( 3) ∂yˆ q ∂O (j3) wqj ( 3) (xi − mij ) = + = Oj ( 3) ∂σ ij ∂O j( 3) ∂σ ij 2 ∂O j ∂σ ij σ ij 3
2
(16)
∂yˆ q ∂yˆ q ∂O j( 3) ∂yˆ q ∂O (j3) wqj ( 3) (xi − mij ) . = + = Oj ∂σ ij ∂O j( 3) ∂σ ij ∂O (j3) ∂σ ij 2 σ ij 3 2
(17)
Thus, (14) can be re-written as l ⎛ O j( 3) + O (j3) ⎞ l n wqj ⎛ ( 3) xi − mij ⎞ * ( 3) x i − mij ⎟ + ∑∑ ⎜O ⎟(mij − mij ) y q + Δ q = yˆ q + ∑ (wqj* − wqj )⎜ +Oj 2 2 ⎜ ⎟ j =1 i =1 2 ⎜ j ⎟ 2 σ σ j =1 ij ij ⎝ ⎠ ⎝ ⎠ 2 2 l n ⎡w l n ⎡w ⎤ ⎤ ( x − m ) ( x − m ) ( 3) * qj i ij qj i ij * + ∑∑ ⎢ O j( 3) Oj ⎥ (σ ij − σ ij ) + ∑∑ ⎢ ⎥ σ ij − σ ij + R1 σ ij 3 ⎦⎥ σ ij 3 ⎦⎥ j =1 i =1 ⎣ j =1 i =1 ⎣ ⎢ 2 ⎢ 2
(
)
Rewrite it, we obtain .
[
T
]
( 3) ⎡ O (3) + O1( 3) Ol (3) + O l ⎤ where , l ×1 Wq = wq1 wql ∈ ℜ1×l Z( k ) = ⎢ 1 ⎥ ∈ℜ 2 2 ⎣ ⎦ ( 3) (3) ( 3) ~ ⎡ wq1 O1( 3) ⎡ wq1O1 wql Ol ⎤ wql O l ⎤ 1×l , 1×l , Wq = Wq − Wq* , D D ZLq = ⎢ ⎥ ∈ℜ ⎥ ∈ℜ ZRq = ⎢ 2 ⎥⎦ 2 ⎥⎦ ⎢⎣ 2 ⎢⎣ 2 x n − mn1 ⎡ x1 − m11 * (mn1 − mn*1 )⎤⎥ 2 ⎢ (σ )2 (m11 − m11 ) , ( ) σ 11 n 1 ⎢ ⎥ l×n C R (k ) = ⎢ ∈ ℜ ⎥ xn − mnl ⎢ x1 − m1l * * ⎥ ( m − m ) ( m − m ) 1l 1l ⎢ (σ )2 (σ nl )2 nl nl ⎥⎦ ⎣ 1l
⎡ x1 − m11 ⎢ (σ )2 ⎢ 11 C L (k ) = ⎢ ⎢ x1 − m1l ⎢ (σ )2 1l ⎣ ⎡ ( x1 − m11 )2 ⎢ 3 ⎢ (σ 11 ) B R (k ) = ⎢ ⎢ (x − m )2 1l ⎢ 1 3 ⎣⎢ (σ 1l )
(m
* − m11
(m
1l
− m1*l
(σ
11
− σ 11*
)
(σ
1l
− σ 1*l
)
11
)
)
xn − mn1
(m
)
⎤ − mn*1 ⎥ , ⎥ l×n ∈ ℜ ⎥ xn − mnl ⎥ mnl − mnl* ⎥ 2 (σ nl ) ⎦ (xn − mn1 )2 σ − σ * ⎤ n1 n1 ⎥ , (σ n1 )3 ⎥ l×n ⎥ ∈ℜ (xn − mnl )2 σ − σ * ⎥⎥ nl nl (σ nl )3 ⎦⎥
(σ n1 )2
n1
(
)
(
)
(
)
(18) ,
322
C.-H. Lee and Y.-C. Lin ⎡ ( x1 − m11 )2 * σ 11 − σ 11 ⎢ 3 ( σ ) 11 ⎢ B L (k ) = ⎢ ⎢ (x − m )2 * 1l σ 1l − σ 1l ⎢ 1 4 ( ) σ 1l ⎣⎢
(
(
(xn − mn1 )2 (σ − σ * )⎤ n1 n1 ⎥ (σ n1 )3 ⎥
) )
xn − mnl
(σ nl )3
(σ
− σ nl *
nl
)
.
l×n ⎥ ∈ℜ ⎥ ⎥ ⎦⎥
The identification error is defined as eq = yˆ q − yq , using (18) we have ~ eq = Z(k ) Wq + D ZRq C Rk E + D ZLq C Lk E + D ZRq B Rk E + D ZLq B Lk E + Δ q − R1
(19)
~ e( k ) = Wk Z( k ) + D ZR (k )C Rk E + D ZL ( k )C Lk E + D ZR ( k )B Rk E + D ZL ( k )B Lk E + ς (k )
(20)
and
where
e(k ) = [e1
⎡ w11O1( 3) ⎢ 2 D ZR (k ) = ⎢⎢ ( 3) ⎢ wm1O1 ⎢⎣ 2 Δ = [Δ 1
em ] ∈ ℜ m×1
* ⎡ w11 − w11 ~ ⎢ Wk = ⎢ ⎢ wm1 − wm* 1 ⎣
,
T
⎡ w11 O1(3) w1l Ol ( 3) ⎤ ⎢ ⎥ , 2 ⎥ ⎢ 2 m×l ∈ ℜ D = ZL ⎢ ⎥ ( 3) wml Ol ( 3) ⎥ ⎢ wm1 O1 ⎢ ⎥ 2 ⎦ ⎣ 2
Δ m ] ∈ ℜ m×1 , R 1 = [R11
R1m ] ∈ ℜ m×1 .
T
w1l − w1*l ⎤ ⎥ m×l ⎥ ∈ℜ * ⎥ wml − wml ⎦
(3) w1l O l ⎤ ⎥ , 2 ⎥ m×l ∈ ℜ ⎥ ( 3) wml O l ⎥ 2 ⎥⎦
,
ς (k ) = Δ − R 1 ,
T
By the bound of the Gaussian function and the plant are BIBO stable, Δ and R1 in (19) are bounded. Therefore, ς (k ) in (20) is bounded. The following theorem gives a stable algorithm for discrete-time type-2 FNN. Theorem 1. If we use the type-2 FNN to identify nonlinear plant, the following backpropagation algorithm makes identification error e(k) bounded ⎧ Wk +1 = Wk − η ISS e(k )[Z ( k )]T ⎪ ⎪ m (k + 1) = m (k ) − η wqj ⎛⎜ O (3) xi − mij + O ( 3) xi − mij j ij ISS j ⎪ ij 2 ⎜⎝ σ ij 2 σ ij 2 ⎪⎪ 2 wqj ( 3) (xi − mij ) ⎨ (yˆ q − yq ) 3 ⎪σ ij (k + 1) = σ ij ( k ) − η ISS 2 O j σ ij ⎪ 2 ⎪ wqj ( 3) (xi − mij ) Oj (yˆ q − yq ) ⎪σ ij (k + 1) = σ ij ( k ) − η ISS 3 2 σ ij ⎪⎩
where η ISS =
and
ηk 2
2
1 + Z + 2 D ZR
2
+ 2 D ZL
⎞ ⎟(yˆ − y ) . q ⎟ q ⎠
0 < ηk ≤ 1
(21)
(22)
⋅ denotes 1-norm. In addition, the average of identification error satisfies J = lim sup T →∞
where π = η k
(1 + λ )
2
(
, Z > 0 λ = max k
2
1 T
T
∑e
2
(k ) ≤
k =1
+ 2 DZR
2
ηk ⋅ς π
+ 2 DZL
2
(23)
), ς = max[ς (k )]. 2
k
■
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
323
3 Applications in Nonlinear System Identification Example 1 Consider the BIBO nonlinear plant [3]
y(k ) . (24) 1 + y 2 (k ) In training the type-2 FNN, we use 8 rules to construct the FNN on-line. The initial values of parameters are chosen as y ( k + 1) = u 3 ( k ) +
5 ⎡ mi = ⎢− 1 − 7 ⎣
−
3 7
−1 7
1 7
3 7
5 ⎤, 1 7 ⎥⎦
σ ij =
4 6 2 , σ ij = , σ ij = , w j = 0 . 7 7 7
In addition, the testing input signal u(k) as following equation is used to determine the identification results
mod(k,50) ⎧ ⎪-0.7 + 40 ⎪ rands(1,1) ⎪ u (k ) = ⎨ 0.7 - mod (k,180 ) ⎪ 180 ⎪ ⎛ πk ⎞ ⎪ 0.6 ⋅ cos⎜ ⎟ ⎝ 50 ⎠ ⎩
k ≤ 80 80 < k ≤ 130
(25)
130 < k ≤ 250 k > 250.
Note that, the optimal leaning rate will be invalid when the initial weight is wj=0. According to research [5], we have the optimal leaning rate ⎡
−2 ⎛ ∂yˆ ⎞ ⎤ . ⎟ ⎥ ⎝ ∂W ⎠ ⎦⎥
ηW∗ = min ⎢1, ⎜ ⎣⎢
(26)
Herein, we have different ηISS when ηk is different. We give ten different values, i.e., η k = 0.1, 0.2, ,1 . The simulation results are described in Table 1. We can easily find the best performance when ηk=1. Then, we fixed ηk=1 and have a comparison with ηW∗ . The simulation is shown in Fig. 3(a). The dotted line is plant actual output, the dash-dotted line is the testing output using type-2 FNN with learning rate ηISS ( RMSE = 3.7403 × 10 −3 ), and the real line is the testing output using type-2 FNN with optimal learning rate η ∗ ( RMSE = 3.1305 ×10 −3 ). Figure 3(b) shows the on-line identification performance using type-1 FNN and type-2 FNN. The both FNN systems are with optimal learning rate. The dotted line is plant actual output, the dashdotted line is the testing output using type-1 FNN ( RMSE = 6.9241× 10 −3 ), and the real line is the testing output using type-2 FNN ( RMSE = 3.2047 ×10 −3 ). Example 2 Consider the Duffing forced oscillator system [4, 5] ⎡ x1 ⎤ ⎡0 1⎤ ⎡ x1 ⎤ ⎡0⎤ ⎡ x1 ⎤ , (27) ⎢ x ⎥ = ⎢0 0⎥ ⎢ x ⎥ + ⎢1⎥( f + u + d ) y = [1 0]⎢ x ⎥ ⎦⎣ 2 ⎦ ⎣ ⎦ ⎣ 2⎦ ⎣ ⎣ 2⎦ where f = −c1 x1 − c2 x2 − (x1 )3 + c3cos(c4 t ) and d denotes the external disturbance and is
assumed to be a square-wave with amplitude ± 0.5 and period 2π . Here, we give the
324
C.-H. Lee and Y.-C. Lin Table 1. Comparison of RMSE
ηk 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
RMSE as training 9.3919×10-3 9.8335×10-3 1.0339×10-2 1.0289×10-2 1.0506×10-2 1.1036×10-2 1.1539×10-2 1.1988×10-2 1.2495×10-2 1.5122×10-2
RMSE as testing 3.7403×10-3 4.7220×10-3 7.1634×10-3 6.5308×10-3 6.1930×10-3 6.9446×10-3 8.2179×10-3 9.9165×10-2 1.1588×10-2 1.2019×10-2
(a)
(b) Fig. 3. Identification results of example 1, (a) type-2 FNN with η ISS and η ∗ ; (b) type-1 FNN with η ∗ and type-2 FNN with η ∗
coefficients are C = [c1 c2 c3 c4 ] = [1 0 12 1] . In training the type-2 FNN, we use 8 rules to construct the FNN on-line. The initial values of parameters are chosen as 5 ⎡ ⎢−1 − 7 ⎢ 20 mij = ⎢− 4 − 7 ⎢ ⎢ − 8 − 40 ⎢⎣ 7 ⎡ 4 16 σi = ⎢ ⎣7 7
3 7 12 − 7 24 − 7 −
T
1 7 4 − 7 8 − 7 −
32 ⎤ ⎡6 , σi = ⎢ 7 ⎥⎦ ⎣7
1 7 4 7 8 7 24 7
3 7 12 7 24 7
5 7 20 7 40 7 T
⎤ 1⎥ , ⎥ 4⎥ ⎥ 8⎥ ⎥⎦
48 ⎤ ⎡2 , σi = ⎢ 7 ⎥⎦ ⎣7
w j = rands(1,1) ,
8 7
16 ⎤ 7 ⎥⎦
T
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
325
Besides, the testing input signal u(k) as following equation is used to determine the identification results ⎧ ⎛ πt ⎞ ⎪cos⎜ 10 ⎟ ⎪ ⎝ ⎠ u (k ) = ⎨ 1 ⎪ −1 ⎪ ⎩
t < 10 sec
(28)
10 sec . ≤ t < 20 sec t ≥ 20 sec .
By the same way, we also give ten different values, i.e., η k = 0.1, 0.2, ,1 , to simulate and pick one value for best choice. The simulation results are described in Table 2. We can easily find the best performance when η k = 0.1 . Then, we fixed η k = 0.1 and have a comparison with ηW∗ by the result of [5]. The simulation is shown in Fig. 4(a). Table 2. Comparison of RMSE
ηk 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
RMSE as training 8.3333×10-3 8.8069×10-3 9.1803×10-3 1.0604×10-2 1.2220×10-2 1.3241×10-2 1.4374×10-2 1.6746×10-2 2.3208×10-2 2.3521×10-2
RMSE as testing 9.2382×10-3 1.1091×10-2 1.1408×10-2 1.1792×10-2 1.2577×10-2 1.2943×10-2 1.4100×10-2 1.5480×10-2 1.7419×10-2 1.7862×10-2
(a)
(b) Fig. 4. Identification results, (a) type-2 FNN with η ISS (η k = 1) and η ∗ ; (b) type-1 FNN with η ∗ and type-2 FNN with η ∗
326
C.-H. Lee and Y.-C. Lin
The dotted line is plant’ actual output, the dash-dotted line is the testing output using type-2 FNN with learning rateη ISS ( RMSE = 9.954 ×10 −3 ), and the real line is the testing output using type-2 FNN with optimal learning rate η ∗ ( RMSE = 8.175 × 10−3 ). Figure 4(b) shows the on-line identification performance using type-1 FNN and type-2 FNN. The both FNN systems are with optimal learning rate, as [5]. The dotted line is plant’ actual output, the dash-dotted line is the testing output using type-1 FNN ( RMSE = 1.550 × 10 −2 ), and the real line is the testing output using type-2 FNN ( RMSE = 8.175 × 10−3 ). Herein, we choose the number of type-2 fuzzy rule is half of type-1 one. Hence, we can get the simulation results, as type-1 FNN ( RMSE = 1.550 × 10 −2 ), and type-2 FNN ( RMSE = 1.367 × 10 −2 ). We can find that even we reduce the number of parameters in type-2 FNN, we also can get better performance on identification.
4 Conclusions This paper has presented a type-2 FNN system and the corresponding adaptive learning algorithm by ISS approach. In ISS approach, we derived only one learning rate for all parameters of type-2 FNN system. Compare with the results of literature [5], we have to calculate the optimal learning rate for each parameter of type-2 FNN system. The simulations show the ability of type-2 FNN system for nonlinear system identification with different approaches. Even we reduce the number of parameters in type-2 FNN, we also can get the better performance when total parameters in type-2 FNN less than type-1 one. Several simulation results of nonlinear system identification were proposed to verify the ability of function mapping ability of the type-2 FNN system.
Reference 1. Chen, Y.C., Teng, C.C.: A Model Reference Control Structure Using A Fuzzy Neural Network. Fuzzy Sets and Systems 73 (1995) 291–312 2. Jang, J.S.R., Sun, C.T., Mizutani, E.: Neuro-fuzzy and Soft-computing. Prentice-Hall, Upper Saddle River, NJ (1997) 3. Lin, C.T., Lee, C.S.G.: Neural Fuzzy Systems. Prentice Hall: Englewood Cliff (1996) 4. Lee, C.H., Lin, Y.C.: System Identification Using Type-2 Fuzzy Neural Network (Type-2 FNN) Systems. IEEE Conf. Computer, Intelligent, Robotics, and Automation,CIRA03, Japan (2003) 1264 -1269 5. Lee, C.H.: Stabilization of Nonlinear Nonminimum Phase Systems: An Adaptive Parallel Approach Using Recurrent Fuzzy Neural Network. IEEE Trans. Systems, Man, and Cybernetics Part B 34(2) (2004) 1075-1088 6. Zadeh, L.A.: The Concept of a Linguistic Variable and Its Application to Approximate Reasoning. Information Sciences 8 (1975) 199-249 7. Liang, Q., Mendel, J.: Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Trans. Fuzzy Systems 8(5) (2000) 535-550
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
327
8. Karnik, N., Mendel, J., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 7(6) (1999) 643-658 9. Mendel, J., John, R.: Type-2 Fuzzy Sets Made Simple. IEEE Trans. Fuzzy Systems 10(2) (2002) 117-127 10. Mendel, J.M.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall: NJ (2001) 11. Wang, C.H., Cheng, C.S., Lee, T.T., Dynamical Optimal Training for Interval Type-2 Fuzzy Neural Network (T2FNN). IEEE Trans. Systems, Man, and Cybernetics Part B 34(3) (2004) 1462-1477 12. Grune, L.: Input-to-state Stability and Its Lyapunov Function Characterization. IEEE Trans. Automatic Control 47(9) (2002) 1499-1504
Fuzzy Neural Petri Nets* Hua Xu1, Yuan Wang1,2, and Peifa Jia1,2 1
State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, P.R. China 2 Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China {xuhua,yuanwang05,dcsjpf}@mail.tsinghua.edu.cn
Abstract. Fuzzy Petri net (FPN) is a powerful modeling tool for fuzzy production rules based knowledge systems. But it is lack of learning mechanism, which is the main weakness while modeling uncertain knowledge systems. Fuzzy neural Petri net (FNPN) is proposed in this paper, in which fuzzy neuron components are introduced into FPN as a sub-net model of FNPN. For neuron components in FNPN, back propagation (BP) learning algorithm of neural network is introduced. And the parameters of fuzzy production rules in FNPN neurons can be learnt and trained by this means. At the same time, different neurons on different layers can be learnt and trained independently. The FNPN proposed in this paper is meaningful for Petri net models and fuzzy systems.
1 Introduction Characterized as concurrent, asynchronous, distributed, parallel, nondeterministic, and/or stochastic [1, 2], Petri nets (PN) have gained more and more applications these years. In order to model and analyze uncertain and fuzzy knowledge processing in intelligent systems or discrete event systems, fuzzy Petri nets [3,4] are proposed and have been an area of vigorous theoretical and experimental studies that results in a number of formal models and practical findings, cf. Fuzzy Petri nets (FPN) [3,5]. These models attempted to address an issue of partial firing of transitions, continuous marking of input and output places and relate such models to the reality of environments being inherently associated with factors of uncertainty. However, when FPN is used to model intelligent systems, cf. expert systems, autonomous systems, they are lack of powerful self-learning capability. Based on the powerful self-adaptability and self-learning of Neural Networks (NN), the self-study ability is extended into PN. The study ability extension of PN most frequently found in the literature associate fuzzy firing to transitions. Generally speaking, two classes can be identified. The first one, usually called generalized fuzzy Petri nets, originated from the proposal found in [3]. The places and transitions of the net are represented as OR and AND type and DOMINANCE neurons, respectively. *
This work is jointly supported by the National Nature Science Foundation (Grant No: 60405011, 60575057) and China Postdoctoral Science Fund (Grant No: 20040350078).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 328–335, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fuzzy Neural Petri Nets
329
The second approach, usually called BP based fuzzy Petri nets, originated from [6] and is based on the association of the fuzzy description to each transition firing. The fuzzy knowledge is refreshed on the base of BP learning offline. The object of this work is to extend the generalized fuzzy Petri nets by adding fuzzy neurons into FPN. The fuzzy neurons not only own self-adapting and selflearning ability but also can be regarded as independent fuzzy neuron components in PN. The application areas of Petri nets being vigorously investigated involve knowledge representation and discovery [6, 7], robotics [8], process control [9], diagnostics [10], grid computation [11], traffic control [12], to name a few high representative domains. The research undertaken in this study directly extends independent neuron components into PN models developed and used therein by defining independent neuron components with self-learning ability. The paper is organized as the following. In Section 2, we review the existing models of Petri nets that are developed in the framework of fuzzy sets and neural network concepts. Section 3 concentrates on the underlying formalism of the detailed model where independent neuron components defined in Petri nets. Section 4 concentrates on the issue of learning in independent neuron components. A robot track decision application example is discussed in Section 5. Finally, conclusions are covered in Section 6.
2 Fuzzy Petri Nets (FPN) 2.1 Petri Nets Petri net is a particular kind of directed graph, together with an initial state called the initial marking, M0. The underlying graph N of a Petri net is a directed, weighted, bipartite graph consisting of two kinds of nodes, called places and transitions, where arcs are either from a place to a transition or from a transition to a place. In graphical representation, places are drawn as circles, transitions as bars or boxes. Arcs are labeled with their weights (positive integers), where a k-weighted arc can be interpreted as the set of k parallel arcs. A common Petri net can be defined as the following. Definition 1: A Petri net is a 5-tuple, PN=(P, T, A, W, M0) where z z z z z z
P={p1,p2,…,pn} is a finite set of places, T={t1,t2,…,tn} is a finite set of transitions, A={P╳T} {T╳P} is a set of arcs, W: A {1,2,3,…} is a weight function, M0: p {0,1,2,3,…} is the initial marking, P T= and P T
→ → ∩ Φ
∪
∪ ≠Φ.
■
A transition t is said to be enabled if each input palace p of t is marked with at least w(p,t) tokens, where w(p,t) is the weight of the arc from p to t. An enabled transition may or may not be fired depending on whether or not the event actually takes place.
330
H. Xu, Y. Wang, and P. Jia
A firing of an enabled transition t removes w(p,t) tokens from each input place p to t, where w(p,t) is the weight of the arc from p to t, and adds w(t,p) tokens to each output place p of t, where w(t,p) is the weight of the arc from t to p. 2.2 Fuzzy Petri Nets On the base of common PN, fuzzy Petri net can be defined as the following. Definition 2: A fuzzy Petri net is a 9-tuple, FPN={P,T,D,A,M0,th,f,W,β}, where z z z z z
z z
P, T, A and M0 are similar to those in PN. D={d1,d2,…,dn} is the finite set of propositions and di [0,1]. th: T [0,1] is the mapping from transitions to thresholds. f: T [0,1] is the mapping from transitions to confidence level value. W: P [0,1] is the mapping from propositions represented by places to truth values. It represents the supporting level of every place representing the corresponding proposition condition for transition firing. The corresponding truth value set is {ω1, ω2,…, ωn}. β: P D is the mapping from places to propositions. P T D= and |P|=|D|.
∈
→ → →
→ ∩∩ Φ
■
∈
In FNPN, the truth value of the place pi, pi P is denoted by the weight W(pi), W(pi)= ωi and ωi [0,1]. If W(pi)= ωi and β(pi)=di. This configuration states that the confidence level of the proposition di is ωi. A transition ti with only one input place is enabled, if for the input place pj I(ti), ωi th(ti)= λi where λi is the threshold value. If the transition ti fires, the truth value of its output place is ωi μi,where μi is the confidence level value of ti. For instance, the following fuzzy reduction rule can be modeled and subsequently fired as shown in Fig. 1. For more details of FPN, reference [3] has given a comprehensive discussion.
∈
∈
≥
·
IF dj THEN dk (CF=μj), λj, ωi.
(1)
The truth value ωj of the place Pj and the confidence level value μj of the transition ti are aggregated through the algebraic product yk=ωi μj, where yk is the truth value of its output place.
·
Fig. 1. A Simple Typical FPN Model
3 Fuzzy Neural Petri Nets (FNPN) Besides the basic parts, the building blocks of a FNPN also include neurons. A neuron is a coarse grained subnet of a FNPN, which can also be regarded as basic
Fuzzy Neural Petri Nets
331
components similar to those in CTPN [13]. The neuron models describe its fuzzy information processing procedure in form of FNPN sub-nets or components. A simple typical FNPN based neuron is illustrated in Fig. 2. In Fig. 2, the input signal of a neuron and threshold function are realized by the input places pi (i=1,…,n) and the transitions tj (j=1,…,n). The integration place Pj calculates the transferring function output according to the neuron transferring function. If the result is not less than the threshold, the threshold transition Tk will be fired.
y1
ω1
ym
Fig. 2. A Typical FNPN based Neuron Model
Fuzzy neural Petri net can be defined as the following. Definition 3: A fuzzy neural Petri net is an 11-tuple FNPN=(P,T,D,A,M0,Kp,Kt, th,f,W, β): z z z
P,T,D,A,M0,th,f,W and β are similar to those in definition 2, Kp is the state set of hidden layer and output layer, Kt is the mapping from T to rule sets. P1
Pn
Input Layer t1
… ωn
… tn
Hidden Layer
Output Layer
λ1,μ1
ω11 ωn1 ω1j ωnj
y11
P1
…
T1
λj,μj
Pj
y1m yj1 yjm
…
Top Hierarchy
ym
Tj Mapping and Unfolding
Bottom Hierarchy
Fig. 3. Abstracting Neurons in FNPN
■
332
H. Xu, Y. Wang, and P. Jia
FNPN can abstract the realization details of neurons illustrated in Fig. 3. According to the modeling and analyzing requirements, the neurons in different layers can be abstracted as FNPN based neuron component. At the same time, the FNPN based model can simplify the modeling and analyzing procedure of neuron hierarchical model by abstracting neurons or sub neural networks with independent self-learning ability. On the other hand, the abstract neurons with self learning ability can also be unfolded in the whole model directly without changing the connection relation. The model with unfolded neuron FNPN subnets is a complex one, however the abstract model is a simple hierarchical one.
4 Learning in FNPN Suppose the FNPN model to be studied is n-layered with b ending places pj, where j=1,…,b. r learning samples are used to train the FNPN model. The performance evaluation function is defined as the following: r
E =
b
∑∑ i=1
j =1
′
(M i( p j) − M
i
( p j ))
2
,
(2)
2
′
where Mi(pj) and Mi (pj) represent the actual marking value and the expected one of the ending place pj respectively. Suppose ti(n) is one transition on the nth layer ti(n) Tn. The weights of the (n) (n) (n) corresponding input arcs are ωi1 , ωi2 ,…,ωim . Its threshold is λi(n) and its truth (n) (n) value is μi . If the place pj is one of the output places of the transition ti(n), obviously it is also the ending place. The BP based learning algorithm [14] is used in FNPN. dE dω
(n) ix
∈
δ
=
(n)
dE dμ
(n) i
dE d λ (i n )
δ
d (M
×
(n)
dω
( p j ))
=
δ
×
d (M
=
δ (n) ×
d (M
(n)
=
(n)
(n)
dμ
dλ
(n)
( p j )) ,
(n) i
(n)
dE d (M
x=1,2,…,m-1,
(n) ix
( p j ))
(n) i
. ( p j ))
,
(3)
(4)
(5)
(6)
According to the BP learning algorithm [14], the parameters of the (n-1)th,…, 1st layer can be calculated. δ(q), dE/dωix(q), dE/dμi(q) and dE/dλi(q), where x=1,2,…,m-1, q=n2,…,1. The adjusting algorithm of the parameters of the transition of ti(q) can be got as the following: ωix(q)(k+1)= ωix(q)(k) - ηdE/dωix(q) ,
(7)
Fuzzy Neural Petri Nets
where x=1,…,m-1, q=n,…,1 and
∑ω
333
(q) ix =1.
μi(q)(k+1)= μi(q)(k) - ηdE/dμi(q) ,
(8)
λi(q) (k+1)= λi(q) (k) - ηdE/λi(q) .
(9)
In the above equations, η is the learning rate.
5 A Simple Example 5.1 Rim Judgment Example in Arc Weld Robots In ship building manufacture, arc weld robots are always used to process the steel plate according to the required form. In the processing procedure, the control system of arc weld robot needs to judge the rim of processed steel plate so as to make decisions of processing track plan. However, the steel plate is always irregular in space. Only are the coordinates of the points to be processed used to judge the rim. It requires self-adaptive algorithm to complete the uncertain reasoning. So FNPN is used to model the rim judgment procedure. Suppose the coordinate of the measured point on the steel plate is (x, y, z). The steel plates are irregular, so only the difference between neighbor point coordinates can be referenced to judge whether it is a point on the steel plate or not. The judgment decision model is constructed on the base of FNPN in Fig.4. As the model input, the coordinates are used to conduct fuzzy reasoning in two neurons (PN1 and PN2) with self-learning ability. Then the decision results can be got. If the output of p4 equals to 1, it represents the point is on the steel plate. Otherwise, if that of p5 is 1, it represents the point is outside. The corresponding FNPN model is illustrated in Fig. 4.
ω 12
y 12
Fig. 4. The FNPN Model for Steel Plate Rim Judgment
For initializing model, the initial parameters are set as the following according to the processing experiments: ω11=0.3, ω21=0.2, ω31=0.5; ω12=0.3, ω22=0.2, ω32=0.5; λ1=0.7, μ1=0.6; λ2=0.7, μ2=0.6; y11=0.5, y21=0.5; y12=0.5, y22=0.5. The FNPN model is trained on the base of 100 groups of testing data, where b=1000, η=0.03. After the training, the fuzzy reasoning is conducted on the base of the trained FNPN. The reasoning results are listed in Table 1.
334
H. Xu, Y. Wang, and P. Jia Table 1. The Actual Output and the Expected Output No . 1 2 3 4 5 6 7 8 9 10
P4 Actual Output 0.9898 0.9898 0.9896 0.9896 0.9896 0.9896 0.9896 0.9898 0.9896 -0.0130
P5 Expected Output 1 1 1 1 1 1 1 1 1 0
Actual Output -0.0085 -0.0085 -0.0087 -0.0087 -0.0087 -0.0087 -0.0087 -0.0085 -0.0086 0.9774
Expected Output 0 0 0 0 0 0 0 0 0 1
According to the reasoning results of FNPN in Table.1, the FNPN model has been demonstrated to be effective to model the fuzzy knowledge based systems for actual applications. 5.2 Application Analysis From the view of the former FNPN based example, compared with FPN and NN, FNPN manifests the following advantages: z
z
As an independent self-learning component, neuron is introduced into the FPN. It can model the complex systems, which include several steps with independent self-learning NNs or neurons. In FNPN, complex neurons or NNs can also be abstracted as FNPN components, when the system model needs to be analyzed in the system level. The abstraction and hierarchy is another outstanding feature for FNPN.
6 Conclusions Fuzzy is a usual phenomenon in knowledge-based expert systems especially in the system with fuzzy production rules. FPN is a powerful modeling tool to model fuzzy systems or uncertain discrete event systems. In order to extend FPN with self-learning capability, this paper proposes the fuzzy neural Petri net on the base of FPN and neural networks. As a kind of FNPN component, neurons are introduced into FNPN. Neurons in FNPN are FNPN sub-nets with BP based self-learning ability. The parameters of fuzzy production rules in every neuron component can be trained in its own layer. Neurons are depicted in different layers, which is meaningful for representing FNPN models with multi-rank NNs. State analysis needs to be studied in the future. Xu [15] has proposed an extended State Graph to analyze the state change of objects/components based models. With the temporal fuzzy sets introduced into PN, the confidence level about transition firing (state changing) needs to be considered in the state analysis.
Fuzzy Neural Petri Nets
335
References [1] Murata, T.: Petri Nets: Properties, Analysis and Applications. Proceedings of IEEE 77 (1989) 541-580 [2] Peterson, J.L.: Petri Net Theory and the Modeling of Systems. Prentice-Hall, New York (1991) [3] Pedrycz, W., Gomide, F.: A Generalized Fuzzy Petri Net Model. IEEE Trans. Fuzzy Ststems 2 (1994) 295-301 [4] Pedrycz, W., Camargo, H.: Fuzzy Timed Petri Nets. Fuzzy Sets and Systems 140 (2003) 301-330 [5] Scarpelli, H., Gomide, F., Yager, R.: A Reasoning Algorithm for High Level Fuzzy Petri Nets. IEEE Trans. Fuzzy Systems 4 (1996) 282-293 [6] Manoj, T.V., Leena, J., Soney, R.B.: Knowledge Representation Using Fuzzy Petri Netsrevisited. Knowledge and Data Engineering, IEEE Transactions on 10 (4) (1998) 666- 667 [7] Jong, W., Shiau, Y., Horng, Y., Chen, H., Chen, S.: Temporal Knowledge Representation and ReasoningTechniques Using Ttime Petri Nets. Systems, Man and Cybernetics, Part B, IEEE Transactions on 29 (4) (1999) 541 - 545 [8] Zhao, G., Zheng, H., Wang, J., Li, T.: Petri-net-based Coordination Motion Control for Legged Robot. Systems, Man and Cybernetics, 2003. IEEE International Conference on 1 (2003) 581 - 586 [9] Tang, R., Pang, G.K.H., Woo, S.S.: A Continuous Fuzzy Petri Net Tool for Intelligent Process Monitoring and Control. Control Systems Technology, IEEE Transactions on 3 (3) (1995) 318- 329 [10] Szücs, A., Gerzson, M., Hangos, K. M.: An Intelligent Diagnostic System Based on Petri Nets. Computers & Chemical Engineering 22 (9) (1998) 1335-1344 [11] Han, Y., Jiang, C., Luo, X.: Resource Scheduling Model for Grid Computing Based on Sharing Synthesis of Petri Net. Computer Supported Cooperative Work in Design, 2005. Proceedings of the Ninth International Conference on 1 (2005) 367 - 372 [12] Wang, J., Jin, C., Deng, Y.: Performance Analysis of Traffic Networks Based on Stochastic Timed Petri Net Models. Engineering of Complex Computer Systems, 1999. ICECCS '99. Fifth IEEE International Conference on (1999) 77 - 85 [13] Wang, J., Deng, Y., Zhou, M.: Compositional Time Petri Nets and Reduction Rules. Systems, Man and Cybernetics, Part B, IEEE Transactions on 30 (4) (2000) 562-572 [14] Gallant, S.: Neural Network Learning and ExpertSystems. Cambridge, Mass. : MIT Press, c1993 [15] Xu, H.; Jia. P.: Timed Hierarchical Object-Oriented Petri Net-Part I: Basic Concepts and Reachability Analysis. Lecture Notes In Artificial Intelligence (Proceedings of RSKT2006) 4062 (2006) 727-734
Hardware Design of an Adaptive Neuro-fuzzy Network with On-Chip Learning Capability Tzu-Ping Kao, Chun-Chang Yu, Ting-Yu Chen, and Jeen-Shing Wang Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C
[email protected] Abstract. This paper aims for the development of the digital circuit of an adaptive neuro-fuzzy network with on-chip learning capability. The on-chip learning capability was realized by a backpropagation learning circuit for optimizing the network parameters. To maximize the throughput of the circuit and minimize its required resources, we proposed to reuse the computational results in both feedforward and backpropagation circuits. This leads to a simpler data flow and the reduction of resource consumption. To verify the effectiveness of the circuit, we implemented the circuit in an FPGA development board and compared the performance with the neuro-fuzzy system written in a MATLAB® code. The experimental results show that the throughput of our neuro-fuzzy circuit significantly outperforms the NF network written in a MATLAB® code with a satisfactory learning performance.
1 Introduction A neural-fuzzy (NF) system is well-known for its capability to solve complex applications. However, its highly computational demand hinders itself from many real-time applications. Realization of NF systems into hardware circuits is a good solution to remove the hindrance. How to design the circuits that can efficiently process the network computation and economically allocates hardware resources becomes an important research topic. When designing a digital NF network circuit, there are several issues to consider. These include: 1) how to implement nonlinear functions as linguistic term sets, 2) how to realize a complex parameter learning algorithm, and 3) how to design a highly efficient circuit. Several researchers have proposed different approaches to implement membership functions in digital circuits. For example, Ref. [1] has developed a VLSI fuzzy logic processor with isosceles triangular functions in the digital circuit for controlling the idle speed of engines. A look-up table was proposed to substitute the direct implementation of nonlinear functions in [2]. The look-up table is easy to realize into a hardware device; however, the precision of computational results using a look-up table may not be satisfactory if memory resource is limited. In the second issue, learning capability is a desirable property in NF networks but a learning algorithm is usually complicated for hardware realization. Three methods, offline learning, chip-in-the-loop learning, and on-chip learning have been proposed D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 336–345, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hardware Design of an Adaptive Neuro-fuzzy Network
337
for the hardware realization of parameter learning algorithm [3]. On-chip learning is more attractive for real-time applications because the parameter training can be performed in an online mode. However, the complexity of such hardware design is higher than those of the other two methods. In [4], a real-time adaptive neural network has developed to provide the instant update for the parameters of the network under continuous operations. Ref. [5] has proposed a logic-oriented neural network with a backpropagation algorithm. The drawback of this approach is that the network training is not stable due to the quantization of the weights and neuron outputs into integer values. In order to improve the execution performance, Ref. [6] has proposed an online backpropagation algorithm with a pipelined adaptation structure that separate the parameter learning into several stages and the computation in each stage can be performed simultaneously. However, the efficiency of pipeline operations is lower than that of pipeline scheduling using a dataflow graph. To develop a highly efficient neuro-fuzzy circuit, this study aimed at the integration of the following three concepts into the hardware design: data sharing, optimal scheduling, and pipeline architecture. Each of these possesses its significance in increasing the computational speed as well as the efficiency of NF circuits. For the data sharing, the computational results of each layer in the feedforward circuit were stored in buffers to establish a database that can be used for the calculation of error gradients in the backpropagation algorithm. This idea not only simplifies the data flow of the whole algorithm but reduces the network resource requirement. We used an integer linear programming [7] approach to obtain an optimal scheduling. Moreover, we investigated the resource consumption and performance of different pipeline architectures to increase the throughput and efficiency of the circuit. The rest of this paper is organized as follows. In Section 2, we introduce our adaptive neuro-fuzzy network with its functionalities of each layer. The hardware implementation of the NF network is presented in Section 3. Section 4 provides the hardware verification of the network as well as simulations using the NF circuit as a controller for a path following problem. Finally, conclusions are summarized in the last section.
2 Adaptive Neuro-fuzzy Network We realize a four-layer NF system shown in Fig. 1 into a circuit due to its structural
simplicity. The computation of the network includes two procedures: feedforward and backpropagation. 2.1 Feedforward Computation The detailed function explanation of the nodes in each layer is as follows. z z
Layer 1: The nodes in this layer only transmit input values to the second layer. Layer 2: Each node in this layer represents a membership function which is an isosceles triangle-shaped function.
338
T.-P. Kao et al.
μij(2) ( xi ) = 1 − 2
| xi − aij | bij
(1)
. th
where xi, aij and bij are the input data of the i input node, centers and widths of an
z
isosceles triangle membership function. The index j indicates the labels of membership nodes. Layer 3: The nodes in this layer represent fuzzy logic rules and the function of each node is as follows.
μ (3)l = ∏μij(2) , j ∈{μij(2) with connection to lth node} . n
1≤l ≤m
z
(2)
i =1
Layer 4: The inference results in the previous layer are multiplied respectively by specific weights and divided by the sum of the outputs in layer 3 as a defuzzification process.
∑ ∑
m
y = μ (4) =
l =1 m
μl(3) wl
l =1
μl(3)
(3)
.
2.1 Backpropagation (BP) Learning Algorithm Based on the architecture in Fig. 1, a BP algorithm is utilized to update the centers and widths of the membership function, and the weights of the output layers. First, we define the error function which we want to minimize. 1 ( y − y d )2 . (4) 2 where y is the actual output and yd is the desired output. We express the error function by substituting (1), (2) and (3) into (4) and the function can be expressed as: E=
1 ∑ μ (x) wl 1 E = ( l =m1 − y d )2 = ( (3) 2 ∑ μl ( x ) 2 m
(3) l
l =1
n
∑ (∏ μ m
l =1
(2) ij
) wl
i =1 n
∑ j =1 (∏ μ ) M
i =1
− y d )2 .
(5)
(2) ij
The corresponding error signals for the adjustable parameters are derived as follows: (2) ∂E ∂E ∂μl(3) ∂μij ⎪⎧⎡ 1 ⎤ ⎡ ⎤ 1 ⎪⎫ 2 = (3) (2) =⎨⎢( y − yd ) × ⎢∑wl μl(3) − y∑μl(3) ⎥ × ⎬× (2) × sign(xi − aij ). (6) ⎥ ∂aij ∂μl ∂μij ∂aij ⎪⎩⎣ ACC ⎦ ⎣ l l ⎦ bij ⎪⎭ μij (2) ∂E ∂E ∂μl(3) ∂μij ⎪⎧⎡ 1 ⎤ ⎡ ⎤ 1 ⎪⎫ 1 = (3) (2) = ⎨⎢( y − yd ) × ⎢∑wl μl(3) − y∑μl(3) ⎥ × ⎬×( (2) −1). (7) ⎥ ∂bij ∂μl ∂μij ∂bij ⎪⎩⎣ ACC ⎦ ⎣ l l ⎦ bij ⎪⎭ μij
∂E ∂E ∂y ⎡ 1 ⎤ = = ( y − yd ) × μk(3) . ∂wk ∂y ∂wk ⎢⎣ ACC ⎥⎦ The parameter update rules are described as follows. ∂E aij (t + 1) = aij (t ) − η . ∂aij
(8)
(9)
Hardware Design of an Adaptive Neuro-fuzzy Network
339
bij (t + 1) = bij (t ) − η
∂E . ∂bij
(10)
wij (t + 1) = wij (t ) − η
∂E . ∂wij
(11)
η is the leaning rate. yp
y1 Layer 4
w11
wPM
R1 Layer 3
Layer 2
R2
Π
/\
RM
Π
/\
Π
x − aij
i /\ 1 − 2 b j i
/\
Layer 1
x1
xn
Fig. 1. Structure of neuro-fuzzy network
3 Hardware Implementation of Adaptive Neuro-Fuzzy Networks 3.1 Circuit Architecture and Computational Results Sharing The hardware design of the NF network is divided into a datapath design phase and a control-path design phase. The datapath design includes a feedforward circuit design and a backpropagation circuit design. In order to reduce computational complexity in the datapath and to simplify control signals in the control path, we adopted two approaches in our design. First, we analyzed the calculation regularity of the NF network to decompose the circuit into several modules and to avoid redundant operations. Second, we accelerated the computational process by sharing the computational results that are required in both circuits. To realize which computational result can be shared, we analyze the equations of the feedforward and backpropagation procedures to extract the mathematical terms of the equations that appear in both procedures. That is, we store the computational results that are obtained in the feedforward circuit and will be used in the backpropagation circuit in specific memory locations. Such storages can avoid a great amount of redundant computations. We partitioned the feedforward computation into three primary modules: membership function module, fuzzy inference engine, and defuzzifier. Based on this partition, datapaths are scheduled to achieve as more concurrent executions as possible without
340
T.-P. Kao et al.
any violation on the restriction of data dependency and resource sharing. We employed an integer linear programming approach to achieve optimal scheduling. After the datapath analysis, we divided the feedforward computation into three asynchronous parts. Each part is realized by a synchronous fine-grain pipeline architecture to accelerate the computational speed of the circuit. The backpropagation algorithm, however, is realized by a synchronous pipeline circuit because of continuous update process and resource limitation. During the data transformation, each module should process/transfer the information containing data values, data indexes, and calculation flags from/to its previous/following module. In addition, the data communication follows handshaking logic to ensure the logical order of the circuit events and to avoid race conditions. Based on these issues and considerations on optimal scheduling and allocation analysis, we proposed a new control approach that integrates asynchronous and synchronous design methodologies. That is, we construct synchronous circuits for functional modules and design an asynchronous circuit for the communication between three modules. The islands of synchronous units are connected by an asynchronous communication network as illustrated in Fig. 2. We named this architecture a globally-asynchronous locally-synchronous circuit.
Req
Req
Ack
HS Start
Done
Req Ack
Req
Ack
HS Start
Ack Done
HS Start
Done
Input
Output R1
Module Register
F1
R2
F2
R3
F3
Function Units
Fig. 2. Asynchronous communication approach with islands of synchronous units
3.2 Dataflow of Backpropagation Algorithm The backpropagation learning algorithm can be expressed as the form in Fig. 3. In the equations, we use several labels to represent the buffers that store the computational results obtained in the feedforward circuit. These data storages enable the backpropagation circuit to calculate the error signals for adjustable parameters efficiently due to the data sharing and the omission of the same data computations. In addition, the data sharing leads to a simpler data flow and a reduction of resource consumption without any increasing cost. Here, we provide an example to illustrate our idea of the data sharing. From a circuit point of view, the term, ⎛⎜ ∑ wl μl(3) − y ∑ μl(3) ⎞⎟ × 1 × 1(2) , in the update rule of aij is ⎝ l l ⎠ bij μij complicated to implement. In order to efficiently utilize the resources, some buffers are designed for the feedforward circuit to store the computational results that can be
Hardware Design of an Adaptive Neuro-fuzzy Network
341
shared in the backpropagation algorithm. According to the idea of data sharing, the terms, 1 , ∑ wl μl(3) , and ∑ μl(3) calculated in the feedforward computation can be bij × μij(2)
l
l
stored in the buffers temporarily and be retrieved to reduce the design complexity of the learning rules for tuning aij in the backpropagation circuit. Similarly, the other two update formulas in Fig. 3 can be simplified by the same idea of the first update formula. Note that the data dependency between these formulas is changed because of the data sharing. This constraint should be considered in the scheduling optimization. E aij
(3) l
y
=
y
(2) ij
(3) l (2) ij
E
aij
1 AC C
d
E bij
E (3) l
y
=
E wk
y
d
E y = y wk
(3) l
1 bij
(3) l
y
l
err AC C (3) l (2) ij
rule _ buf
wl
l
ReuseRul
2 (2) ij
sign( xi
aij ).
Inv _ b
(2) ij
bij 1 ACC
wl
( 3) l
( 3) l
y
l
l
1 bij
1 (2) ij
(1
( 2) ij
).
Reuse _ tune _ a
y
yd
1 ACC
(3) k
err AC C
Fig. 3. Learning rules for sharing computational results
3.3 Pipeline Architecture of Backpropagation Circuit In the backpropagation learning circuit, the datapath of updating wl is designed as two pipeline stages with just one clock latency. The throughput of the path is equal to the clock rate. Two choices can be selected for the update path of aij and bij: 1) a nonpipeline circuit with two multipliers, and 2) a structure with one pipeline latency. The first choice takes 70 control steps while the second choice takes only 18 control steps but needs three additional multipliers. In the second choice, the computation of the circuit is the fastest. However, the update of weights wl takes 50 control steps. The cost of control steps are determined by the update procedure of weights. That means the update procedure has to wait 32 control steps until the update of weights finishes. Hence, this result is not desirable. Fig. 4 shows the final pipeline scheduling. There are two pipeline latencies in the update datapaths of aij and bij. The execution process takes 32 control steps. Table 1. Performance Analysis based to Different Pipeline Latency Parameters Pipeline Latency Multiplier Execution Step
Backpropagation Learning Circuit aij and bij Yes No Yes 1 0 1 1 1 4 49+1= 50 Steps 5×14= 70 Steps 1×14+4=18 Steps wl
Yes 2 2 2×14+4= 32 Steps
342
T.-P. Kao et al. err rule _ buf y ReuseRul e ACC wl ReuseWst ReuseWs Wt
b
err ACC A CC
(2) ij
one
>>
aiji
>
wl
aij
bij >>
bij
Fig. 4. The DFG of pipeline scheduling (latency = 1 for wl, latency = 2 for aij and bij)
NF top-level
FSM controller
Forward path register file
Backward path register file
Sharing memory of Computation result
ALU
Main FSM
ALU FSM
Handshaking FSM
Multiplication
Division
Fig. 5. Block diagram of the modular NF structure
Although the cost of the execution step is larger than 18, it is still smaller than the update procedure of the parameters wl. Furthermore, this structure only requires additional one multiplier to obtain a better circuit performance. Fig. 5 shows the modular block diagram of the NF circuit. The results of performance analysis are provided in Table 1.
4 Hardware Simulations and Verification The proposed architecture has been coded in Verilog by using register transfer level (RTL) model. Before the RTL code of the NF network circuit is synthesized, we used MATLAB® to establish a software simulation platform for function verification. This simulation platform is used to simulate the NF circuit as a controller to learn how to drive a vehicle to follow a planned trajectory. Fig. 6 illustrates the car-driving system where the NF circuit was implemented in an FGPA device and served as a forward controller, and the rest blocks are simulated in a PC. In order to online
Hardware Design of an Adaptive Neuro-fuzzy Network
FPGA
ud(t)
NeuralFuzzy Controller
343
(t)
Communication Interface
Trajectory Planner
ARM
+ e(t)
+
P Controller
-
PC
+
Car System
PC u(t)
Fig. 6. Simulation platform of the car-driving system Table 2. Comparison of Hardware Execution and Software Simulation
Register
Decimal
System
0.123194 -37.9291 28.8083 27.9886 40 10.3094 6.4178 -8.8147 9.7273 36.1794 11.28102
output
Average
Hardware
Error
2018 -621430 471995 458565 655360 168909 105149 -144420 159372 592763
2024 -621395 472039 458620 655355 168931 105201 144384 159255 592723
-6 -35 -44 -55 5 -22 -52 -36 117 40
Error in Decimal -0.00037 -0.002136 -0.00269 -0.00336 0.000305 -0.00134 -0.00317 -0.002197 0.007141 0.002441
184828.1
213713.7
-8.8
-0.00054
Shifted Binary
train the NF parameters, we used a proportional controller as an auxiliary controller not only to compensate the insufficiency of the forward controller for achieving a satisfactory trajectory-following accuracy but to provide an error signal for tuning the parameters of the NF circuit. The planned trajectory was generated by a path planning algorithm that is able to find a shortest path from a given initial location to a final destination and to avoid obstacles in a globally optimal manner. The training patterns of the NF circuit were obtained by discretizing the trajectory with a fixed sampling time. The learning objective of the NF circuit was to follow the planned trajectory with a minimal error. The learning process was performed iteratively to tune the NF parameters and was stopped once the total mean square error achieves a pre-specified criterion. This platform is used to verify the effectiveness of the NF network circuit and to compare the efficiency of the NF network implemented in a hardware device and
344
T.-P. Kao et al.
a software system. Because we only used 14 bits to represent the decimal values, sometimes the output values were not as accurate as those of the software simulation platform. Some examples of numerical errors are illustrated in Table 2 to show the difference between the software simulation and hardware execution. The same generated trajectory inputs were used in both RTL and MATLAB® simulations. In our empirical experience, this small error did not cause any instability during parameter learning or degradation in learning performance. Fig. 7 illustrates the learning result of the NF controller implemented in an FPGA device to drive the car to follow the path generated by the path planning algorithm. From the figure, we can see that the actual path is very close to the optimal path after several iterations of parameter learning. In Table 3, the throughput rate of the NF network implemented in an FPGA is much higher than the software simulation. Especially, the performance of the backpropagation learning is excellent because of the effectiveness of our pipeline architecture and data sharing.
Fig. 7. Comparison of the learning results obtained by using the NF circuit in an FPGA
device and the software written in a MATLAB® code for the trajectory following application Table 3. Throughput of Execution on MATLAB® and FPGA
Feedforward Circuit Throughput Rate Backpropagation Circuit Throughput Rate
MATLAB® 0.438 KHz (period 2.28 ms) 0.1 KHz (period 9.18 ms)
FPGA 308.64 KHz (period 3.24 μs) 510.21 KHz (period 1.96 μs )
Hardware Design of an Adaptive Neuro-fuzzy Network
345
5 Conclusion This paper presents a digital hardware implementation of an adaptive neuro-fuzzy network with on-chip learning capability. We proposed an idea of data sharing to reduce the complexity of hardware implementation for a backpropagation learning algorithm. Without the repetition of performing the same computation, we believe that the consumption of hardware resource is greatly reduced while the throughput rate can be increased significantly. Finally, we implemented the circuit in an FPGA device to serve as a controller for driving a car to follow a desired trajectory. The simulation results show that the throughput of our NF circuit significantly outperforms the NF network written in a MATLAB® code with satisfaction in learning performance.
References [1] Jin, W.W., Jin, D.M., Zhang, X.: VLSI Design and Implementation of a Fuzzy Logic Controller for Engine Idle Speed. Proc. of 7th IEEE Int’l. Conf. on Solid-State and Integrated Circuits Technology 3 (2004) 2067-2070 [2] Marchesi, M., Orlandi, G., Piazza, F., Pollonara, L., Uncini, A.: Multi-layer Perceptrons with Discrete Weights. Int’l. Joint Conf. on Neural Networks 2 (1990) 623-630 [3] Reyneri, L.M.: Implementation Issues of Neuro-Fuzzy Hardware: Going Toward HW/SW Codesign. IEEE Trans. Neural Networks 14 (2003) 176-179 [4] Yi, Y., Vilathgamuwa, D.W., Rahman, M.A.: Implementation of an Artificial-NeuralNetwork-Based Real-Time Adaptive Controller for an Interior Permanent-Magnet Motor Drive. IEEE Trans. Industry Applications 39 (2003) 96-104 [5] Kamio, T., Tanaka, S., Morisue, M.: Backpropagation Algorithm for Logic Oriented Neural Networks. Proc. of the IEEE Int’l. Joint Conf. on Neural Networks 2 (2000) 123-128 [6] Girones, R.G., Salcedo, A.M.: Systolic Implementation of a Pipelined On-Line Backpropagation. Proc. of the 7th Int. Conf. on Microelectronics for Neural, Fuzzy and BioInspired Systems (1999) 387-394 [7] Hwang, C.T., Lee, J.H., Hsu, Y.C.: A Formal Approach to the Scheduling Problem in High Level Synthesis. IEEE Trans. Computer-Aided Design 10 (1991) 464-475
Stock Prediction Using FCMAC-BYY Jiacai Fu1, Kok Siong Lum2, Minh Nhut Nguyen2, and Juan Shi1 1
Research Centre of Automation, Heilongjiang Institute of Science and Technology, Harbin, China 2 School of Computer Engineering, Nanyang Technological University Singapore 639798
Abstract. The increasing reliance on Computational Intelligence applications to predict stock market positions have resulted in numerous researches in financial forecasting and trading trend identifications. Stock market price prediction applications are required to be adaptive to new incoming data as well as have fast learning capabilities due to the volatility nature of market movements. This paper analyses stock market price prediction based on a Fuzzy Cerebellar Model Articulation Controller – Bayesian Ying Yang (FCMAC-BYY) neural network. The model is motivated from the Chinese ancient Ying-Yang philosophy which states that everything in the universe can be viewed as a product of a constant conflict between opposites, Ying and Yang. A perfect status is reached if Ying and Yang achieves harmony. The analyzed experiment on a set of real stock market data (Singapore Airlines Ltd – SIA) in the Singapore Stock Exchange (SGX) and Ibex35 stock index shows the effectiveness of the FCMAC-BYY in the universal approximation and prediction.
1 Introduction Charting has been the main analysis approach for stock market prediction for a long time and many mathematical methods have been used to forecast the stock market movements. Sornette and Zhou [1] have proposed a mathematical method based on a theory of imitation between the stock market investors and their herding behavior. However, due to the volatile nature of stock market, its movement would generally not follow any mathematic formula. This has hence limited the accuracy of the prediction for mathematical models. Neural network has long been utilized for this purpose due to its excellent ability to depict and replicate complicated stock market patterns. Cerebellar Model Articulation Controller (CMAC) is a type of associative memory neural network that was first proposed by Albus in 1975 [2]. CMAC imitates human’s cerebellum, which allows it to learn fast and carry out local generalization efficiently. However, the associative memory nature of CMAC does not provide the differential information between input and output and requires excessive memory requirement [9]. Chiang and Lin [3] proposed a fuzzy CMAC (FCMAC) to introduce fuzzy sets as the input clusters into neural network. The differentiable property between the input D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 346–351, 2007. © Springer-Verlag Berlin Heidelberg 2007
Stock Prediction Using FCMAC-BYY
347
state and the actual output value can be obtained by computing the differentiation of the output with respect to its input. In addition, Bayesian Ying-Yang (BYY) [4] is applied in the fuzzification layer to improve the approximation of the input clusters and consequently, the FCMAC-BYY model was proposed in our previous work [5]. The motivation of this paper comes from the popular status of various tools to predict stock market data. The remaining of this paper is structured as follows. The next section would describe the FCMAC-BYY structure while Section 3 would illustrate experiment results with the benchmark dataset (Ibex35) as well as the real life data (SIA). Last but not the least, Section 4 would consist of the conclusion for this paper.
2 FCMAC –BYY Model In Figure 1, the FCMAC-BYY neural network is observed as a five-layer hierarchical structure namely Input layer, Fuzzification layer, Association Layer, Post association Layer and Output Layer. Post Association Layer Association Layer Output Response Unit Fuzzy Layer
q(y)
Weights Wj
p(y|x) p(x)
y0
q(x|y)
x0
xj
Input Layer
Fig. 1. The structure of FCMAC-BYY
2.1 FCMAC-BYY Structure Input Layer is the layer where the input data is obtained from the retrieved information. Bayesian Ying-Yang fuzzification is performed on the input training dataset in Fuzzification Layer in order to obtain fuzzy clusters. Association Layer is the rule layer where each association cell represents a fuzzy rule. A cell is activated only when all the inputs to it are fired. The Association Layer is then mapped to a Post Association Layer. A cell in this layer will be fired if any of its connected inputs is activated. Adapting Credit Assigned-FCMAC [11] methodology, a variable, named f_freq, was added to each cell to count the number of times in which the cell was fired. Using this approach, the cells which were fired more frequently would be learning at a reduced rate. Prior to this switch, the weights were updated according to the total
348
J. Fu et al.
number of fired cells instead. The following formula is applied to the updating of the weights in CA-FCMAC:
α⎛⎜∑ f (l) f (l)⎞⎟ m
ω(ji) = ω(ji−1) + ⎝l=1
ϕ
⎠ (y − y ) d j
(1)
m ⎧m ⎫ where ϕ = ∑ ⎨∑ f (l ) f (l )⎬ , ω (ij ) is the weight of the jth cell after i iterations. α is l =1 ⎩ l =1 ⎭ the learning rate while yd and yj are the desired and calculated output respectively, and f(l) returns the variable f_freq. Using the derived ϕ , Eq(1) reduces proportionally the learning rate of a cell as its fired frequency increases. Finally, the defuzzification center of area (COA) method [6] is used to compute the output in the Output Layer.
2.2 A Ying-Yang Approach to Fuzzification In this research, the BYY fuzzification is performed on the input data patterns to obtain input clusters. Treating both x and y as random processes, the joint distribution can be calculated by either of these two formulas: p( x, y) = p( y | x) p( x) (2)
q( x, y) = q( x | y)q( y)
(3)
The breakdown of Eq(2) follows the Yang concept with the visible domain by p(x) regarded as a Yang space and the forward pathway by p(y | x) as a Yang pathway. Similarly, Eq(3) is regarded as a Ying space and the backward pathway q(x | y) by as a Ying pathway. Both equations should return the same result for the joint distribution, however, this is the case only when the solution is the optimal. The forward/training model or Yang model and the backward/running model or Ying model can be computed using the Eq(4) and (5) respectively. q ( x ) = ∫ q ( x | y ) q ( y ) dy
(4)
p ( y ) = ∫ p ( y | x) p ( x)dx
(5)
Eq(4) focuses on the mapping function of the input data x into a cluster representation y via a forward propagation distribution p(y | x) while the Ying model focuses on the generation function of the input data x from a cluster representation y via a backward propagation distribution q(x | y). Under the Ying-Yang harmony principle, the difference between the two Bayesian representations in Eq(2) and (3) would be minimized. Thus, the trade-off between the forward/training model and the backward/running model is optimized. It means that the input data are well mapped into the clusters and at the same time the clusters also well cover the input data. Eventually, the Eq(2) and (3) will produce the same results when Ying and Yang achieves harmony and FCMAC-BYY will then have the highest generalization ability. For further details, reader may refer to [5].
Stock Prediction Using FCMAC-BYY
349
3 Experimental Results Two experiments were conducted using data obtained from price value of Ibex35 index as well as SIA stock. The results of the tests are as follows. 3.1 Ibex35 Index Data
The Ibex35 is a capitalization-weighted stock market index, comprised of the 35 most liquid Spanish stocks traded in the continuous market, and is the benchmark index for the Bolsa de Madrid. The extensive Spanish Ibex35 daily stock price data [7] was chosen because it is a popular index which is also used as a benchmark test for other prediction tools [10]. 1000 samples were chosen for training and 500 for testing. The dataset was put through the various neural networks (Multi-layer perceptron-MLP, conventional CMAC, FCMAC, CA-FCMAC and FCMAC-BYY) for comparison. The results are as shown in Figure 2 and their performance in Table 1.
Comparison Chart Actual Output
Ibex35 Index
9800
CMAC Output
9600
FCMAC Output
9400 9200
CA-FCMAC Output
9000
MLP
8800 8600 Time
FCMACBYY
Fig. 2. Comparison Chart on Ibex35 index using various neural networks
The three-layer MLP used four neurons in a hidden layer in a 4-4-1 layout. Gaussian function was used for the clustering of data in FCMAC. The MLP produced highly accurate results and has low memory requirement. However, it is hard to determine the optimal number of hidden layer neurons and MLP operates like a black box with its computation of data hidden from the users. From Figure 2, CMAC prediction is not as accurate when compared to FCMAC. Furthermore, FCMAC was able to produce the prediction using less memory. CA-FCMAC capitalized on the FCMAC structure to build the results in less computation cycles. Last but not the least; FCMAC-BYY was able to produce similar results with even less memory requirement.
350
J. Fu et al. Table 1. Comparison Table
Model
MSE
Iterations
Memory used
MLP CMAC FCMAC CA-FCMAC FCMAC_BYY
0.00037 0.00225 0.00119 0.00092 0.00090
93 29 48 43 38
8 130321 6336 6336 4096
3.2 SIA Stock Data
The second test was conducted on SIA stock which is listed on the Mainboard in Singapore Exchange (SGX). Stock prices were collected through data collected from the SGX website [8] at 5 minutes interval from 15 September 2006 till 15 October 2006. The information was then parsed and analyzed using the FCMAC-BYY system. In total, 350 data samples were used as the training dataset and 150 data samples were used as the testing dataset.
SIA Prediction Chart
Actual Price
16.0 15.0
FCMACBYY
Price
14.0 13.0
CMAC
12.0 11.0 10.0
FCMAC Time
Fig. 3. FCMAC-BYY prediction of SIA stock
From Figure 3, it can be observed that FCMAC-BYY was able to closely follow the movement of the stock prices. Mean Square error of 0.00104 was achieved using 625 cells. Using BYY to cluster the input dataset, less cells were needed and thus the memory requirement reduced. Furthermore, the computation cycles required improved while the accuracy of the prediction was not compromised. In all, FCMAC-BYY was able to improve the overall efficiency of the prediction through the proficient BYY clustering methodology.
Stock Prediction Using FCMAC-BYY
351
4 Conclusion Stock prediction application had come a long way and at the same time shown great progress. This paper proposes an Associative Memory structure, which contains two modules: a BYY input space clustering module and a FCMAC neural network approximation system. BYY is based on the ancient Ying-Yang philosophy and aims to find the clusters to represent the input data. On the other hand, the proposed FCMAC system includes a non-constant differentiable Gaussian basis function to preserve the derivative information so that a gradient descent method capable of serving as learning rules. Together, FCMAC-BYY had been used here successfully to model the stock market movement and project accurate predictions. Its great ability to adequately cluster the input patterns has allowed accurate prediction to be carried out and less memory usage. Experimental results indicate that FCMAC-BYY has a high learning speed while maintaining low memory requirement. FCMAC-BYY was able to perform non-linear approximations and on-the-fly updates while keeping memory requirement lower than conventional CMAC structures as well as the original FCMAC.
References 1. Sornette, D., Zhou, W.X.: The US 2000-2002 Market Descent: How Much Longer and Deeper? Taylor and Francis Journals Quantitative Finance, 2 (2002) 468-481 2. Albus, J. S.,: Data Storage in the Cerebellar Model Articulation Controller (CMAC). Transaction of the ASME, Dynamic Systems Measurement and Control, 97 (1975) 228-233 3. Lin, C.S., Chiang, C.-T.: Learning Convergence of CMAC Technique. IEEE Trans. Neural Networks, 8 (1997) 1281–1292 4. Xu, L.: Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor auto Determination. IEEE Trans. Neural Networks 15 (2004) 885-902 5. M.N Nguyen, D.Shi, C.Quek, FCMAC-BYY: Fuzzy CMAC Using Bayesian Ying-Yang Learning. IEEE Trans. Syst. Man Cybern B: Cybernetic, 36 (2006) 1180-1190 6. Lee, E. S., Zhu, Q.: Fuzzy and Evidence Reasoning: Physica-Verlag (1995) 7. Spain Ibex35 historical daily closing stock price (Online). Available: Yahoo! Finance wensite. URL-http://finance.yahoo.com/q?s=%5EIBEX&d=t. 8. SIA stock price value. Available: Singapore Exchange website. URLttp://www.ses.com.sg/. 9. Hu, J., Pratt, F.: Self-orgarnizing CMAC Neural Networks and Adaptive Dynamic Control, IEEE International Conference on Intelligent Control, Cambridge, MA, (1999) 15-17 10. Górriz, Juan M., Puntonet, Carlos G., Salmerón, Moisés, Lang, E.W.: Time Series Prediction using ICA Algorithms. IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 8-10 September (2003), Lviv, Ukraine 11. Su, S-F., Tao, T., Hung, T-H.: Credit Assigned CMAC and Its Application to Online Learning Robust Controllers. IEEE Trans. Syst. Man Cybern. B: Cybernetics. 33 (2003)
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks Shufeng Wang, Gengfeng Wu, and Jianguo Pan Department of Computer Science, Shanghai University, 149 Yanchang road Shanghai, P.R. China 200072
[email protected],
[email protected],
[email protected] Abstract. Rough sets and neural networks are two common techniques applied to rule extraction from data table. Integrating the advantages of two approaches, this paper presents a Hybrid Rule Extraction Method (HREM) using rough sets and neural networks. In the HREM, the rule extraction is mainly done based on rough sets, while neural networks are only served as a tool to reduce the decision table and filter its noises when the final knowledge (rule sets) is generated from the reduced decision table by rough sets. Therefore, the HREM avoids the difficult of extracting rules from a trained neural network and possesses the robustness which the rough sets based approaches are lacking. The effectiveness of HREM is verified by comparing the experiment results with the approaches of traditional rough sets and neural networks.
1 Introduction One important issue of data mining is classification which has attracted great attentions of researchers [11]. Rough sets and neural networks are two technologies frequently applied to data mining tasks [12, 13]. The common advantage of the two approaches is that they do not need any additional information about data like probability in statistics or grade of membership in fuzzy set theory. Rough sets theory introduced by Pawlak in 1982 is a mathematical tool to deal with vagueness and uncertainty of information. It has been proved to be very effective in many practical applications. However, in rough sets theory, the deterministic mechanism for the description of error is very simple. Therefore, the rules generated by rough sets are often unstable and have low classification accuracy. Neural networks are considered as the most powerful classifier for their low classification error rates and robustness to noise. But neural networks have two obvious shortcomings when applied to data mining problems. The first is that neural networks require long time to train the huge amount of data of large databases. Secondly, neural networks lack explanation facilities for their knowledge. The knowledge of neural networks is buried in their structures and weights. It is often difficult to extract rules from a trained neural network. The combination of rough sets and neural networks is very natural for their complementary features. One typical approach is to use rough set approach as a pre-processing tool for the neural networks [1, 2]. By eliminating the redundant data D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 352–361, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
353
from database, rough sets methods can greatly accelerate the network training time and improve its prediction accuracy. In [4], Rough sets method was also applied to generate rules from trained neural networks. In these hybrid systems, neural networks are the main knowledge bases and rough sets are used only as a tool to speedup or simplify the process of using neural networks for mining knowledge from the databases. In [3], a rule set, a part of knowledge, is first generated from a database by rough sets. Then a neural network is trained by the data from the same database. In the prediction phase, a new object is first predicted by the rule set. If it does not match any of the rules, then is fed into the neural networks to get its result. Although the hybrid model can get high classification accuracy, part of prediction knowledge is still hidden among the neural networks and is not comprehensible for user. In this paper, from a new perspective we develop a Hybrid Rule Extraction Method (HREM) using rough sets and neural networks to mine classification rules from large databases. Compared with previous research works, our study has the following contributions. (1) We reduce attributes of decision table by three steps. In the first step, irrelevant and redundant attributes are removed from the table by rough sets approach without loss of any classification information. In the second step, a neural network is used to eliminate noisy attributes in the table while the desirable classification accuracy is maintained. In the third step, the final knowledge which is mainly represented as classification rules are generated from the reduced decision table by rough sets. (2) In our HREM, neural networks are used only as a tool to reduce the decision table and filter its noises. The final classification rules are generated from the reduced decision table by rough sets, not from the trained neural networks.
2 Preliminaries 2.1 Binary Discernibility Matrix
T = U , C ∪ D, V , f be a decision table. In general, D can be transformed into a set that has only one element without changing the classification for U that is, D = {d } . Every value of d corresponds to one equivalence class of U / ind ( D) , Let
which is also called the class label of object. A binary discernibility matrix represents the discernibility between pairs of objects in a decision table. Let M be the binary discernibility matrix of S , it element
M (( s, t ), i ) indicates the discernibility between two objects x s and xt with different class labels by a single condition attribute ci , which is defined as follows:
⎧1.......ci ( x s ) ≠ ci ( xt ), M (( s, t ), i ) = ⎨ ⎩0.......otherwise, s < t ≤ m and d ( x s ) ≠ d ( xt ) i ∈ {1,2,..., n} . It can be seen that M has n columns and its maximal number of rows is m(m − 1) / 2 . Each column of M represents a single condition attribute and each row of M represent an object pair having different d values. Where 1 ≤
354
S. Wang, G. Wu, and J. Pan
2.2 Attribute Reduction by Rough Sets and Neural Networks Attribute reduction is a process of finding an optimal subset of all attributes according to some criterion so that the attribute subset are good enough to represent the classification relation of data. Attributes deleted in attribute reduction can be classified into two categories. One category contains irrelevant and redundant attributes that have no any classification ability. An irrelevant attribute does not affect classification in any way and a redundant feature does not add anything new to classification. These attributes represent some classification ability, but this ability will disturb the mining of true classification relation due to the effect of noise. In general, rough sets theory provides useful techniques to reduce irrelevant and redundant attributes from a large database with a lot of attributes. However, it is not so satisfactory for the reduction of noisy attributes because the classification region defined by rough sets theory is relatively simple and rough sets based attribute reduction criteria lack effective validation method. For example, the dependency γ and information entropy H are two most common attribute reduction measures in rough sets theory. When using them to measure attribute subsets, an attribute subset with a high γ may contain some noisy attribute and degrade the generalization of classification, and H may make the noise overestimated and delete useful attributes. Neural networks have the ability to approach any complex function and possess good robustness to noise. Therefore we think that the nonlinear mapping ability for classification relations and cross-valid mechanism provided by neural networks can give us more chance to eliminate noisy attributes and reserve useful attributes. However, the neural networks will take long training time for attribute reduction when treating a large amount of attributes. 2.3 Rule Extraction by Rough Sets and Neural Networks To extract rules using neural networks is usually difficult because of the nonlinear and complicated nature of data transformation conducted in the multiple hidden layers. Although neural networks researchers have proposed many methods to discover symbolic rules from a trained neural network, these methods are still very complicated when the network is large. The algorithms to extract rules from trained neural networks were summarized in [10]. Compared to the neural network approaches, rule extraction by rough sets is relatively simples and straightforward and without extra computational procedures before rules being extracted.
3 Development of HREM 3.1 The Procedures of HREM The HREM consists of three major phases: 1. attributes reduction done by rough sets. Using rough sets approach, a reduct of condition attributes of decision table is obtained. Then a reduct table is derived from the decision table by removing those attributes that are not in the reduct.
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
355
2. further reduction of decision table done by neural networks. Through a neural network approach, noisy attributes are eliminated from the reduct. Thus the reduct table is further reduced by removing noisy attributes and by removing those objects that can not be classified accurately by the network. 3. all rules in decision table extracted by rough sets. Applying rough sets method, the final knowledge-a rule set is generated from the reduced decision table. Fig 1 shows the procedures of HREM.
Original DT (Decision Table)
Reduct DT
Reduced DT
Attribute Reductio n by RS
Rule generatio n by RS Data
C,F
Rules set
Fig. 1. The procedures of HERM
3.2 Algorithms in HREM We first develop an algorithm of Rough Set Attribute Reduction (RSAR) based on a binary discernibility matrix, which replaces complex set operations by simple bit-wise operations in the process of finding reduct and provides a more simple and intelligible measure for the importance of attributes. Even if the initial number of attributes is very large, using the measure can effectively delete irrelevant and redundant attributes in a relatively short time. Secondly, we employ the neural network feature selection (NNFS) algorithm in [5] to further reduce attributes in the reduct. In this approach, the noisy input nodes (attributes) along with their connections are removed iteratively from the network without decreasing obviously the network’s classification ability. The approach is very effective for a wide variety of classification problems including both artificial and real-world datasets, which was verified by a lot of experiments. Making use of the robustness to noise and generalization ability of the neural network method, these attributes and objects polluted by noisy can be reduced from decision table. Thirdly, we present an Extraction Algorithm of Approximate Sequence Decision Rules (EAASDR) which extracted concise rule from reduced table, remove values of those attributes, then extracted rule from border region, until all rules are extracted from reduced table.
356
S. Wang, G. Wu, and J. Pan
3.2.1 RSAR Algorithm We assume that the context of decision table is the only information source our objective is to find a reduct with minimal number of attributes. Based on the definition of the binary discernibility matrix, we propose our rough set attribute reduction (RSAR) algorithm to find a reduct of a decision table. RSAR is outlined as below. RSAR algorithm Input: a decision table T = (U , C ∪ D )
, U = {u , u ,..., u } , C = {c , c ,..., c } ; Output: a reduct of T , denoted as Re d ; 1. Construct the binary discernibility matrix M of T ; 2. Delete the rows in the M which are all 0’s, Re d = φ ; /*delete pairs of 1
2
n
1
2
m
inconsistent objects*/ 3. while( M ≠ φ ) {
(1)select an attribute c in the M with the highest discernibility degree (if there are i
several ci with the same highest discernibility degree, choose randomly an attribute from them); 2 Re d
( ) ← Re d ∪ {c } ; (3)remove the rows which have “1” in the c column from M ; (4)remove the c column from M ; i
i
i
} endwhile /*the following steps remove redundant attribute from Re d */
= {r1 , r2 ,..., rk } contains k attributes which are sorted by the order of entering Re d , rk is the first attributes chosen into Re d , r1 is the last one chosen into Re d . 5. Get the binary discernibility matrix MR of decision table TR = (U , Re d ∪ D ) ; 6. Delete the rows in the MR which are all 0’s; 4. Suppose that Re d
7. For I =2 to k { remove the ri column from MR ;
(no row in the MR is all 0’s)
if
Re d ← Re d − {ri } ; else put the ri column back to MR ; then
Endif } Endfor. 3.2.2 EAASDR Algorithm A reduced table can be seen as a rule set where each rule corresponds to one object of the table. The rule set can be generalized further by applying rough set value reduction method. Unlike most value reduction methods, which neglect the border region rules among the classification capabilities of condition attributes, we first extracted concise
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
357
rules from reduced table, remove values of those attributes, then extracted rules from border region, until all rules were extracted from reduced table. The steps of our rough set rule generation algorithm called EAASDR (an Extraction Algorithm of Approximate Sequence Decision Rules) are presented below. Input: decision table S = (U , C ∪ D ) ; Output: rule sets
CORE D (C ) in S = (U , C ∪ D) was calculated, /*the
1. The relatively core
relatively core was obtained by calculating the significance σ CD (C
'
) of each condition
attribute to decision attribute*/
,
≠ φ Then P1 = CORE D (C ) and E = P1 Else P1 = {c1 } /*for ∀c ∈ C calculate dependency degree γ [ c ] ( D ) between
2. If CORE D (C )
E = P1
and
;
,
c and D , γ [ c1 ] = max{r[ c ] ( D), c ∈ C} was selected as initially attribute sets.*/ 3. The decision classification U
;
;
/ D = {Y1 , Y2 ,..., Yd } was calculated;
;
P = {P1 } i = 1 U = U B = φ * 5. U / IND ( Pi ) = { X i1 , X i 2 ,..., X ik } ; 4.
6. B
*
= { X k ∈ U * / IND( Pi ) | X k ⊆ Y j , whereY j ∈ U / D, j ∈ {1,2,..., d }}
'
Rule ' = φ ∀X k ∈ B
'
,
;
Rule ' = {des Pi ( X k ) → des D (Y j )} ,where Y j ∈ U / D and
; Rule = Rule ∪ Rule , B = B ∪ B ; = ∪ x ; If B = U Then goto Step 8 Else
Yj ⊇ X k 7.
; Rule = φ ;
B*
'
'
*
{
U * = U * − B*
X ∈B
;
i = i + 1 ;for ∀c ∈ C − E ,the significance σ ({c}∪ E ) D ({c}) was calculated, if
σ ({c }∪ E ) D ({c 2 }) = max{σ ({c}∪ E ) D ({c}), c ∈ C − E} then Pi = Pi −1 ∪ {c 2 } ,and Pi is a equivalence class in P ,goto Step 5;} 2
8.
B is the result of dynamic classification, Rule is the decision rules set.
3.3 General Algorithm In summary, we conclude the general algorithm to generate rules from a decision table as below. The general algorithm Input: a decision table T = (U , C ∪ D ) Output: a rule set RULE ;
, U = {u , u ,..., u } , C = {c , c ,..., c } ; 1
2
n
1
2
m
358
S. Wang, G. Wu, and J. Pan
1. Apply the RSAR algorithm, get a reduct of T , denoted as Re d ; 2. Remove those attributes that are not in Re d from T ; 3. Apply the NNFS algorithm, obtain an important attributes subset IMP of Re d .Suppose OBJ is the set of objects that were classified wrongly by the network; 4. Remove those attributes that are not in IMP and remove those objects in OBJ from T , and merge the identical objects into one object; 5. Apply EAASDR the algorithm, extract all rule set RULE from the reduced decision table. It should be noted that the identical objects are not merged after step 2 as the probability distribution of all objects is needed in the next neural network training phase. While the identical objects correspond to same rule in rule generation phase, so they are merged in the step 4.
4 Experiments and Results We did a series of experiments to test our method. First, for comparing with traditional methods, we applied our approach to eight data mining problems [9] and six standard datasets from the UCI repository that were used in [7], [8] respectively. Secondly, to test our HREM under noisy conditions, we made the relevant experiments on MONK3 dataset by randomly adding different level noise in the data. In this paper, the rules set accuracy and the rules set comprehensibility were used as the criteria of evaluation of the rule extraction approaches. The accuracy of rules set was indicated by the accuracy of rules generated on the testing set, and the comprehensibility of rules set includes two measures, the number of rules and the average number of conditions of each rule. 4.1 Comparison Between HREM with NNFS Ten classification problems were defined on datasets having nine attributes in [9]. We selected eight problems (except function 8) with different complexities in our experiments. Like in [9], the values of the attributes of each object were generated randomly and a perturbation factor of 5% was added. The class labels were determined according to the rules that defined the function. For every experiment, 3000 objects were generated among which 2000 objects were used as the training set ant the other 1000 ones were the testing set. The attribute values were initially discretized and coded by the methods proposed in [7]. Then the nine attributes were transformed into 37 binary attributes. We tested 30 times for each problem. Table 1 reports the results of eight classification problems using our hybrid approach based on rough sets and neural networks (HREM). Experimental results in [7], obtained by an approach based on neural networks (NNFS), are also compared for the same problems. The results obtained show that the HREM is comparable in both accuracy and comprehensibility with NNFS. Moreover, the rule extracting time of HREM is greatly shorter than that of NNFS for the reason as mentioned previously.
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
359
Table 1. Comparison of performance of the rules generated by HREM and NNFS F
Average accuracy (%) HREM
1 2 3 4 5 6 7 9
Average no. of rules
NNFS
99.92(0.48) 99.01(0.87) 98.81(1.37) 93.09(1.92) 98.16(0.78) 90.89(0.23) 91.82(0.57) 91.58(0.98)
HREM
99.91(0.36) 98.13(0.78) 98.18(1.56) 95.45(0.94) 97.16(0.86) 90.78(0.43) 90.50(0.92) 90.86(0.60)
2.14(0.58) 6.78(1.56) 7.60(0.81) 10.90(1.83) 22.89(9.78) 12.45(3.56) 5.13(2.34) 10.21(1.96)
NNFS 2.03(0.18) 7.13(1.22) 6.70(1.15) 13.37(2.39) 24.40(10.1) 13.13(3.72) 7.43(1.76) 9.03(1.65)
Averageno. of condition HREM
NNFS
2.01(0.58) 4.20(0.78) 2.25(0.04) 2.79(0.27) 4.95(1.20) 4.20(0.98) 1.71(0.47) 3.21(0.40)
2.23(0.50) 4.37(0.66) 3.18(0.28) 4.17(0.88) 4.68(0.87) 4.61(1.02) 2.94(0.32) 3.46(0.36)
4.2 Comparison Between HREM with RS To compare with rough set based approaches (RS), we applied our approach to six UCI datasets that were used by [8]. Similar to [8], we randomly separate each dataset into two parts: two thirds as training set and the rest as testing set. The continuous attributes were also initially discretized using the equal width binding method. We also tested 20 times for each case and present the averages of the results in Table2. The data of RS columns were given in [8] and they have no standard deviations. We can see that HREM outperforms RS in accuracy in all six datasets and the rule set of HREM is more concise than that of RS. It is because HREM can filter effectively the noise in the data, which make generated rules more accurate and simper. Table 2. Comparison of performance of the rules generated by HREM and RS Data sets
Average accuracy (%) HREM
Australian breast diabetes German glass iris
85.70(0.39) 94.82(0.77) 73.92(2.07) 72.41(2.95) 63.89(1.24) 95.78(0.79)
RS 85.54 92.38 73.32 70.48 60.42 95.10
Average no. of rules HREM
RS
3.00(2.15) 5.30(0.18) 6.50(2.46) 4.35(3.56) 22.5(2.58) 3.15(1.78)
6.7 7.8 6 4.7 24.5 3.55
Averageno. of condition HREM 1.34(0.62) 1.91(0.15) 4.35(3.56) 2.16(1.02) 1.82(0.58) 1.52(0.57)
RS 2.5 1.6 1.5 1.4 2.2 1.29
4.3 Experiments Under Noisy Conditions In order to demonstrate the robustness of our approach, MONK3 dataset was selected in our experiments. The dataset contains 432 objects, and each is described by 6 attributes. All objects are classified into two classes by the following rules: Class 1: (Jacket_color = green and holding = sword) or (jacket color! = blue and body_shape! = octagon). Class 0: otherwise.
360
S. Wang, G. Wu, and J. Pan
We constructed three classification problems on the dataset by randomly adding 6, 12, 18 noises to the training objects respectively. We set an object as a noise by changing its class label. That is, if an object originally label as “1” was relabeled as “0”. In every experiment, dataset was divided randomly into two equal sets: one set was used as training set and the other set as testing set. Table 3 shows the result of three problems. Each problem was done 30 times. It can be seen that under the different noisy level conditions, the rule set generated remained relatively stable, and HREM can effectively filter the noises in the data by deleting relatively less objects (the number of objects deleted were not more than twice as the number of true noises). It guarantees that concise and accurate rules are generated. Table 3. Result of robustness experiments on MONK3 dataset with HREM Data set 6 noises 12 noises 18 noises
Average accuracy (%) 97.84(1.19) 97.78(1.13) 95.56(6.09)
Average no. of rules 3.57(0.97) 4.30(2.45) 3.97(3.01)
Average no. of condition 1.34(0.14) 1.44(0.34) 1.40(0.42)
5 Conclusions In this paper, we present a hybrid approach integrating rough sets and neural networks to mine classification rules from large datasets. Through rough sets approach a decision table is first reduced by removing redundant attributes without losing any classification information then a neural network is trained to delete noisy attributes in the table. Those objects that cannot be classified accurately by the network are also removed from the table. Finally, all classification rules are generated from the reduced decision table by rough sets. In addition, based on a binary discrenibility matrix, a new algorithm RSAR of finding a reduct and a new algorithm EAASDR of all rules generation from a decision table were also proposed. The HREM was applied to a series of classification problems that include artificial problems and real world problems. The results of comparison experiments show that our approach can generate more concise and more accurate rules than traditional neural network based approach and rough set based approach. The results of robustness experiment indicate that HREM can work very well under the different noisy level condition.
References 1. Jelonek, J., Krawiec, K., Stowinski, R.: Rough Set Reduction of Attributes and Their Domains for Neural Networks. Computational Intelligence 11 (1995) 339-347 2. Swiniarski, R., Hargis, L.: Rough Set as a Front End of Neural-Networks Texture Classifiers. Neurocomputing 36 (2001) 85-102 3. Ahn, B., Cho, S., Kim, C.: The Integrated Methodology of Rough Set Theory and Artificial Neural Network for Business Failure Predication. Expert Systems with Application 18 (2000) 65-74
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
361
4. Yasdi, R.: Combining Rough Sets Learning and Neural Network Learning Method to Deal with Uncertain and Imprecise Information. Neurocomputing 7 (1995) 61-84 5. Setiono, R., Liu, H.: Neural-Network Feature Selector. IEEE Trans. Neural Networks 8 (1997) 554-662 6. Towell, G., Shavlik, J.W.: Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules. In Advances in Neural Information Processing Systems 4, Moody, J.E., Hanson, S.J., Lippmann, R.P. eds., San Mateo, CA: Morgan Kaufmann (1992) 977-984 7. Lu, H., Setiono, R., Liu, H.: Effective Data Mining using Neural Networks. IEEE Trans. Knowledge and Data Engineering 8 (1996) 957-961 8. Chen, X., Zhu, S., Ji, Y.: Entropy based Uncertainty Measures for Classification Rules with Inconsistency Tolerance. In: Proc. IEEE Int. Conf. Systems, Man and Cybernetics (2000) 2816-2821 9. Agrawal, R., Imielinski, T., Swami, A.: Database Mining: A Performance Perspective. IEEE Trans. Knowledge and Data Engineering 5 (1993) 914-925 10. Andrews, R., Diederich, J.,Tickle, A.B.: Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge Based System 8 (1995) 373-389 11. Chen, M., Han, J., Yu, P.: Data Mining: An Overview from a Database Perspective. IEEE Trans. Knowledge and Date Engineering 8 (1996) 866-883.
A Novel Approach for Extraction of Fuzzy Rules Using the Neuro-fuzzy Network and Its Application in the Blending Process of Raw Slurry* Rui Bai1, Tianyou Chai1,2, and Enjie Ma1 1
Key Laboratory of Integrated Automation of Process Industry, Ministry of Education, Northeastern University, Shenyang 110004 2 Research Center of Automation, Northeastern University, Shenyang 110004, China
Abstract. A novel approach is proposed to extract fuzzy rules from the inputoutput data using the neuro-fuzzy network combined the improved c-means clustering algorithm. Interpretability, which is one of the most important features of fuzzy system, is obtained using this approach. The fuzzy sets number of variables can also be determined appropriately using this approach. Finally, the proposed approach is applied to the blending process of raw slurry in the alumina sintering production process. The fuzzy system, which is used to determine the set values of the flow rate of materials, is extracted from the error of production index –adjustment of the flow rate. Application results show that the fuzzy system not only improved the quality of raw slurry but also have good interpretability.
1 Introduction From 1965 when fuzzy set was proposed by L.A. Zadeh , fuzzy systems have been applied widely in many fields including modeling, control, pattern recognition, fault diagnosis, and so on. One of the important design issues of fuzzy systems is how to construct a set of appropriate fuzzy rules. There are two major approaches: manual rule generation and automatic rules generation. Most of the reported fuzzy systems have resorted to a trial-and-error method for constructing fuzzy rules. This not only limits applications of fuzzy systems, but also forces system designers to spend hard time on constructing and tuning fuzzy rules. Moreover, the manual approach becomes even more difficult if the required number of rules increases or domain knowledge is not easily available. To resolve these difficulties, recently, several automatic extraction approaches for fuzzy rules from the input-output data have been proposed, including look-up table approach[1], data mining approach[2,3], GA approach[4], clustering approach[5], *
This project is supported by the National Foundamental Research Program of China(Grant No. 2002CB312201), and the State Key Program of National Natural Science of China(Grant No.60534010), and the Funds for Creative Research Groups of China (Grant No. 60521003), and the Program for Changjiang Scholars and Innovative Research Team in University (Grant No. IRT0421).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 362–370, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Novel Approach for Extraction of Fuzzy Rules
363
neural network approach[6,10]. However, all these approaches only focus on fitting data with the highest possible accuracy, neglecting the interpretability of the obtained fuzzy systems, which is a primary advantage of fuzzy systems and the most prominent feature that distinguishes fuzzy set theory from many other theory used in modeling and control. Another disadvantage of these approaches is that the number of fuzzy sets and fuzzy rules must be determined manually and beforehand. In order to improve the interpretability of fuzzy systems, similar fuzzy sets are merged based on the similarity anlyasis [7, 8, 9, 13]. However, similar or incompatible fuzzy rules are not considered, which also make the interpretability of fuzzy systems decrease. It should be noted that most of the reported approaches can only realize the T-S fuzzy rules whose consequence is a constant or a linear combination of inputs, which is more difficult to interpret linguistically than the normal fuzzy rules whose consequence is fuzzy sets. To resolve this problem, normal fuzzy rules are extracted using the neuro-fuzzy networks [9, 10]. However, every weight of the output layer of neurofuzzy network represents a fuzzy set of output variable in [9, 10], which makes fuzzy sets of output variables excessive and interpretability of fuzzy rules loss. In order to determine the number of fuzzy rules and fuzzy sets appropriately, Kwang Bo Cho [10] used hierarchically self-organizing learning (HSOL) algorithm to automatically determine the number of fuzzy rules. However, the number and initial parameters of fuzzy sets are determined subjectively and randomly. Rui Pedro Paiva, etc, determined the number of fuzzy rules by means of implementing clustering analyse to the input-output data [9, 11]. However, in these approaches, the number of clusters is equal to the number of fuzzy rules, and also is equal to the number of fuzzy sets of input or output variables, so, the fuzzy rules base is not complete. To resolve the above problems, this paper improves the c-means clustering algorithm, and the novel extraction approach for fuzzy rules is also proposed using the neuro-fuzzy network combined the improved c-means clustering algorithm. Interpretability of fuzzy rules is increased and the number of fuzzy sets of variables can be determined appropriately. The proposed approach is applied to the raw slurry blending process in alumina production. Fuzzy control rules are extracted from the error of production index – adjustment of the flow rate. Application results show that the fuzzy system not only improves the quality of raw slurry, but also has good interpretability.
2 The Novel Approach for Extraction Fuzzy Rules from Input-Output Data In this paper, the number and the initial parameters of fuzzy sets are determined using the improved c-means clustering, and the neuro-fuzzy network is used to train the parameters. After training, the weights of output layer are clustered to determine the fuzzy sets of output variables. At last, fuzzy rules are extracted from the neuro-fuzzy
364
R. Bai, T. Chai, and E. Ma
networks and regulations are implemented. Its main steps of the proposed approach in this paper are as follows: Step1. Differing with [9.11], clustering anlysis are implement to every variable instead of the whole input and output data, and the improved c-means algorithm is proposed in this paper. Fuzzy sets of every input variable are determined based on the clustering results. Moreover, the initial weights and the number of nodes of neurofuzzy network are determined appropriately. Step2. Back propagation algorithm with variable step size is adopted to train the neuro-fuzzy network. Step3. Improved c-means Clustering analysis is implemented to the output-layer’s weight. The membership functions of output variable’s fuzzy set are determined based on the clustering results, and the fuzzy rules are extracted from the trained neuro-fuzzy networks. Differing with [9.10], the numbers of the fuzzy sets of output variables are decreased. Step4. Regulations for fuzzy rules are implemented, including merging similar fuzzy set and fuzzy rules, and deleting the similar or incompatible fuzzy rules. 2.1 Determine the Initial Fuzzy Sets of Input Variables Let us assume that the given input-output data set is as follows:
(x
1, p
,
, xm , p;y1, p ,
, yn, p )
p = 1,
,P
(1)
Where m is the number of the input data, n is the number of the output data, and there P input-output data. Traditional c-means algorithm has a disadvantage, i.e., the number of clusters and the initial cluster centers are determined subjectively beforehand. To overcome this disadvantage, the improved c-means algorithm is proposed in this paper. The initial fuzzy sets of input variables are determined appropriately using the improved c-means algorithm. All data of the ith input variable are clustered, and the main steps of the improved c-means are as follows: Step1. Definition of the distance between xi and clusters is:
[
d ip , j = (xi. p − ci , j )
]
2 1/ 2
i = 1,
,m
Where ri is the number of existing clusters,
, p = 1,
, P j = 1,
, ri
(2)
ci , j is the center of the jth cluster.
Step2. Let k=0, and xi,1 is selected as the first cluster which is noted as Wi,1k. Let xi,1, and ri=1.
ci,1k=
Step 3. Computing the distance between xi,2 and Wi,1k, if
d 2i ,1 > T , new cluster
Wi,2k is obtained, and let ci,2k= xi,2, ri=2.Otherwise, xi,2 is assigned into Wi,1k. Step4. Let us assume that there are ri cluster centers, and we compute the distances between
xi , p and the existing clusters. If the minimum of d ipj is greater than T,let k
ri=ri+1, and xi,p is selected as a new cluster which is noted as Wi , ri , and let
cik, ri = xi , p .
A Novel Approach for Extraction of Fuzzy Rules
365
xi are assigned into some clusters, the procedure is over, otherwise, turn to step (4). Finally, xi is divided into ri clusters which are noted as Step5. If all data of
Wi ,kj , j = 1,
, ri . We redefine the center of every cluster is the average of all data in
the cluster. Step1 to 5 is the first phase of the improved c-means algorithm. The number and the initial centers of the clusters are determined. Based on these results, xi is clustered using the traditional c-means algorithm again in the step 6 to 7, which are the second phase of the improved c-means algorithm. Step6. Computing the distances between xi , p and the existing clusters in turn, if
xi , p is closest to the lth cluster, xi , p is assigned into the lth cluster, and the new lth cluster which is noted as the new
Wi ,kj+1 comes into being. Consequently, xi is divided into
ri clusters which are noted as Wi ,kj+1 , and the center of the clusters is noted
cik, +j 1 . k +1
Step7. If ci , j
= cik, j ( j = 1,
turn to (6). After step 1 to 7,
, ri ), procedure is over. Otherwise, let k=k+1, and
xi is divided into ri clusters, and the centers of cluster, i.e., ci , j ,
are obtained. Based on these results,
ri fuzzy sets of input variable xi are determined
appropriately, and the membership functions of fuzzy sets are as follows:
⎡ ( xi − ci , j ) 2 ⎤ μ Ai , j ( xi ) = exp⎢− ⎥ σ i2, j ⎦⎥ ⎣⎢
σ i, j =
ci , j +1 − ci , j
j = 1,
j = 1,
2.5
, ri
, ri
(3)
(4)
2.2 Design and Train the Neuro-fuzzy Network 2.2.1 Determine the Structure and the Initial Weights of Neuro-fuzzy Network Fig.1 shows the schematic diagram of the neuro-fuzzy network. The input layer has m nodes which represent m input variables in equation (1). There Q nodes in the second layer, i.e., the fuzzification layer. m
Q = ∑ ri i =1
(5)
366
R. Bai, T. Chai, and E. Ma μ1,1
R1
x1
R1
ω1,1 ω1, 2
μ1,r
y1
ω1, N
1
μ m ,1 xm
yn
μ m,r
RN
m
ωn, N
RN
Fig. 1. Neuro-fuzzy network
The activation function of every neuron is the corresponding membership function, i.e. equations (3~4). So, c j and σ j are selected as the initial parameters of the fuzzification layer appropriately. There are N nodes in the third layer, i.e., the inference layer. Every node in this layer represents a fuzzy rule, and the output of node is the production of all input data. m
N = ∏ ri
(6)
i =1
Normalization layer has also N nodes. Output layer has n nodes which represent n output variables in equation (1), and the output of every node is computed as follows: N
yi = ∑ ωij Ri
i = 1,2,
,n
(7)
j =1
Where ω ij is the weight between the ith node of output layer and the jth node of normalization layer. The least neighbor principle is adopted to determine the initial ω ij . For example,
ω1,1
is corresponding to c1,1 , c2,1 , ...,
cm ,1 . Assuming the lth input and output data in
data set (1) is the closest to the vector { c1,1 , c2,1 , ...,
cm ,1 }, the first output data of the
lth input-output data pair is selected as the initial ω1,1 . 2.2.2 Training the Neuro-fuzzy Networks Back propagation algorithm with variable step size is adopted to train the neuro-fuzzy network. The error function E is defined as follows:
E=
(
1 P n yi − yid ∑∑ 2 p=1 i=1
)
2
(8)
Where yi is the actual output of the neuro-fuzzy network, and yid is the desired output of the neuro-fuzzy network.
A Novel Approach for Extraction of Fuzzy Rules
367
The learning algorithm for updates cij is as follows:
cij (k + 1) = cij (k ) − α (k )
∂E ∂cij (k )
(9)
α (k ) = 2λ α (k − 1)
(10)
⎡ ∂E ∂E ⎤ × ⎥ ⎣⎢ ∂cij (k ) ∂cij (k − 1) ⎦⎥
λ = sgn ⎢ Where
α (k )
(11)
is learning rate, λ is step coefficient.
Using the same algorithm, we can also update for detail process of computing
σ ij
and ω ij . [12] is the reference
∂E . ∂cij (k )
2.3 Extraction for the Fuzzy Rules
After the neuro-fuzzy network is trained, we can extract fuzzy rules from it. The main steps are as follows: Step1: we select the trained c ij , σ ij as the parameters of the membership functions of input variables. Step2: Let us assume that the trained
ω ij
can be divided into
improved c means algorithm, whose centers are able y i is divided into
qi clusters using the
d i , j . Therefore, the output vari-
qi fuzzy sets, Bi , j , whose center of membership function is
d i, j . Step3: Every node in the inference layer represents a fuzzy rule. For example, the first node of the inference layer corresponds to the fuzzy sets A1,1 , , Am ,1 , and the
{
weight
ωi ,1
}
are corresponding to Bi , j , so, the fuzzy rule that the first node represents
is : Rule 1: If x1 is
A1,1 and x2 is A2,1 and … xm is Am,1 , then y1 is B1, j1 , y 2 is
B2, j 2 , … , yn is Bn , j n . Step4: All fuzzy rules represented by nodes are extracted. 2.4 Regulation for the Fuzzy Rules 2.4.1 Merge the Similar Fuzzy Sets The fuzzy rules obtained above may contain redundant information in terms of similarity between fuzzy sets, and it is difficult to assign qualitatively linguistic term to
368
R. Bai, T. Chai, and E. Ma
similar fuzzy sets. In order to increase the interpretability, the similarity measure is defined as follows: m
Ss ( A, B ) =
∑
min( μ A ( x i ), μ B ( x i ))
∑
max( μ A ( x i ), μ B ( x i ))
i =1 m
(12)
i =1
If
S s > ξ s , i.e., the fuzzy sets are very similar, two fuzzy sets A and B should be
merged to create a new fuzzy set C. where
ξs
is a predefined threshold. The parame-
ters of newly merged fuzzy set C from A and B are defined as: cA + cB ⎧ ⎪cC = 2 ⎪ ⎨ 2 σ A + σ ⎪σ = C 2 .5 ⎩⎪
(13)
2 B
2.4.2 Delete the Similar and Incompatible Fuzzy Rules Considering two fuzzy rules:
Ri : If x1 is A1, i and … xm is Am,i , then y1 is B1,i , and … , yn is Bn, i . R j : If x1 is A1, j and … xm is Am, j , then y1 is B1, j , and … , yn is Bn, j . The similarity measure of the antecedent part and consequent part of fuzzy rules are determined as follows: S r _ if ( R i , R j ) =
S r _ then ( R i , R j ) =
S s ( A1 , i , A1 j ) + m
S s ( B 1 , i , A1 j ) +
If S r _ if > ξ r _ if and S r _ then > ξ r _ then , and S r _ then
< ξ r _ then
,
+ S s ( A m ,i , A m , j )
+ S s ( B m ,i , Am , j ) n
Ri and R j are similar fuzzy rules. If
(14)
(15) S r _ if > ξ r _ if
Ri and R j are incompatible fuzzy rules. If two fuzzy rules are
similar or incompatible, we should delete one of them.
3 Application in Blending Process of Raw Slurry In alumina sintering production process, lime, ore, red slurry and alkali are blended to form the raw slurry. In this blending process, three most importance quality index of raw slurry is determined by the flow rate of four raw materials. Traditional manual operation manner that operators adjust the flow rate based on the errors of quality index can not produce high-quality raw slurry. So, fuzzy system is proposed to replace manual operation, and fuzzy rules are constructed using the approach proposed
A Novel Approach for Extraction of Fuzzy Rules
369
in this paper. In this fuzzy system, the input variables are e1, e2 and e3, which represent the error of quality index, respectively, and the output variables are x1, x2, x3 and x4, which represent adjustment of lime, ore, red slurry and alkali, respectively. The input and output data set can be obtained by means of history data and experience data:
△
△ △
△
(e
1, p
, e2, p , e3, p ; Δx1, p , Δx2, p , Δx3, p , Δx4, p ) i = 1,
,200
(16)
At first, using the improved c-means algorithm, we can obtain the initial fuzzy sets of e1, e2 and e3. The numbers of nodes of the neuro-fuzzy network is 3, 11, 45, 45, and 4, respectively. Using the data set (16) to train the neuro-fuzzy network, we can obtain the trained fuzzy sets. The initial and trained fuzzy sets of input variables are shown in Table1. Using the similarity measure, we find that the two fuzzy sets PM and PB of e1 are similar. We use fuzzy set P of e1 to replace them. After training and regulation, the final fuzzy sets of input and output variables are shown in Table2. Table 1. The initial and final fuzzy sets of e1, e2 and e3
Variables
e1 e2
e3
Initial fuzzy sets (c i , σ i )
Trained fuzzy sets (c i , σ i )
ZE(0.02,0.22),PM(0.56,0.16),P B(0.97,0.2) NB(-0.28,0.052) ,NS(-0.15, 0.048),ZE(-0.03, 0.06), PS(0.12, 0.066), PB(0.285, 0.06) N(-0.12,0.06), ZE(0.03,0.042), P(0.135,0.06)
ZE(0.01,0.17),PM(0.75,0.31), PB(0.82,0.22) NB(-0.29,0.07),NS(-0.12, 0.053),ZE(-0.01, 0.045), PS(0.092, 0.071), PB(0.28, 0.056) N(-0.14,0.05), ZE(0.01,0.05), P(0.139,0.04)
Table 2. The final fuzzy sets of variables
Variables e1 e2 e3
△x △x △x △x
1
2
3
4
final fuzzy sets (c i , σ i ) ZE(0.01,0.17),P(0.785,0.152) NB(-0.29,0.07),NS(-0.12, 0.053),ZE(-0.01, 0.045), PS(0.092, 0.071), PB(0.28, 0.056) N(-0.14,0.05), ZE(0.01,0.05), P(0.139,0.04) NB(-9.21,156), NS(-5.3,2.13), ZE(0.02,2.1) NB(-4.42,0.49),NM(-3.2,0.76),NS(-1.3, 0.51),ZE(0.03,0.58),PS(1.42,0.65),PM(3.051, 0.54), PB(4.41, 0.5) NB(-4.3,0.51) , NM(-3.02,0.7) ,NS(-1.27, 0.51),ZE(-0.01, 0.604), PS(1.5, 0.56), PM(2.98, 0.608), PB(4.5, 0.6) NB(-5.92,1.17) , NM(-3,0.588) ,NS(-1.53, 0.616),ZE(0.01, 0.59), PS(1.49, 0.612), PM(3.02, 0.045), PB(5.84, 1.13)
370
R. Bai, T. Chai, and E. Ma
At last, we can obtain thirty fuzzy rules: Rule1: If is PS,
e1 is ZE and e2 is NB and e3 is NB, then Δx1 is ZE, Δx 2 is NS, Δx3
Δx 4 is PM.
Rule30: If
e1 is P and e2 is PB and e3 is P, then Δx1 is NB, Δx 2 is PB, Δx3 is
NB, Δx 4 is NM. The quality of slurry is improved greatly when fuzzy system replaced the operator.
4 Conclusions This paper improves the c-means algorithm, and uses the neuro-fuzzy network combined the improved c-means algorithm to extract fuzzy rules from the input and output data. The initial parameters and structure of the neuro-fuzzy networks can be determined appropriately, and the regulations of fuzzy rules are implemented to increase the interpretability of fuzzy rules. The proposed approach is applied to construct a set of fuzzy rules in the raw-slurry blending process, and the results show its validity.
References 1. Wang, L.X., Mendel, J.M.: Generating fuzzy rules by learning from examples. IEEE Transactions on Fuzzy Systems 9 (2001) 426-442 2. Wang, Y.F., Chai, T.Y.: Mining fuzzy rules from data and its system implementation. Journal of System Engineering 20 497-503 3. Hu, Y.C., Chen, R.S.: Finding fuzzy classification rules using data mining techniques. Pattern Recognition Letters 24 (2003) 509-519 4. Wong, C.C., Lin, N.S.: Rule extraction for fuzzy modeling. Fuzzy sets and systems (1997) 23-30 5. Gomez-skarmeta, A.F., Delgado, M., Vila, M. A.: About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy sets and systems (1999) 179-188 6. Xiong, X., Wang, D.X.: Effective data mining based fuzzy neural networks. Journal of Systems Engineering 15 32-37 7. Xing, Z.Y., Jia, L.M. etc: A Case Study of Data-driven Interpretable Fuzzy Modeling. ACTA AUTOMATICA SINICA 31 (2005) 815-824 8. Jin, Y.C., Sendhoff, B.: Extracting Interpretable Fuzzy Rules from RBF Networks. Neural Process Letters 17 (2003) 149-164 9. Paiva, R.P.: Interpretability and learning in neuro-fuzzy systems. Fuzzy sets and systems (2004) 17-38 10. Cho, K.B., Wang, B.H.: Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction. Fuzzy sets and systems (1996) 325-339 11. Oh, S.K., Pedrycz, W., Park, H.S.: Hybrid identification in fuzzy-neural networks . Fuzzy Sets and Systems (2003) 399-426 12. Sun, Z.Q.: Intelligent control theory and technology. Tsinghua Universtiy Press 1997 13. Setnes, M., Babuska, R.: Similarity Measures in Fuzzy Rule Base Simplification. IEEE Transactions on system, man, and cybernetics-Part B 28 376-386
Neural Network Training Using Genetic Algorithm with a Novel Binary Encoding Yong Liang1 , Kwong-Sak Leung2 , and Zong-Ben Xu3 1
Department of Computer Science and Ministry of Education National Key Laboratory on Embedded Systems, College of Engineering, Shantou University, Shantou, Guangdong, China
[email protected] 2 Department of Computer Science and Engineering, The Chinese University of Hong Kong, HK
[email protected] 3 School of Science, Xi’an Jiaotong University Xi’an, Shaanxi, China
[email protected] Abstract. Genetic algorithms (GAs) are widely used in the parameter training of Neural Network (NN). In this paper, we investigate GAs based on our proposed novel genetic representation to train the parameters of NN. A splicing/decomposable (S/D) binary encoding is designed based on some theoretical guidance and existing recommendations. Our theoretical and empirical investigations reveal that the S/D binary representation is more proper than other existing binary encodings for GAs’ searching. Moreover, a new genotypic distance on the S/D binary space is equivalent to the Euclidean distance on the real-valued space during GAs convergence. Therefore, GAs can reliably and predictably solve problems of bounded complexity and the methods depended on the Euclidean distance for solving different kinds of optimization problems can be directly used on the S/D binary space. This investigation demonstrates that GAs based our proposed binary representation can efficiently and effectively train the parameters of NN.
1
Introduction
Most of the real-world problems could be encoded by different representations, but genetic algorithms (GAs) may not be able to successfully solve the problems based on their phenotypic representations, unless we use some problem-specific genetic operators. Especially, GAs are widely used in the parameter training of Neural Network (NN). They need to transform NN’s parameters from the real encoding into the binary strings. Therefore, a proper genetic representation is necessary when using GAs on the real-world problems [1], [8], [12]. A large number of theoretical and empirical investigations on genetic representations were made over the last decades, and have shown that the behavior and performance of GAs is strongly influenced by the representation used. Originally, the schema theorem and the building block hypothesis proposed by [1] D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 371–380, 2007. c Springer-Verlag Berlin Heidelberg 2007
372
Y. Liang, K.-S. Leung, and Z.-B. Xu
and [3] to model the performance of GAs to process similarities between binary bitstrings. The most common binary representations are the binary, gray and unary encodings. According to three aspects of representation theory (redundancy, scaled building block and distance distortion), Rothlauf [9] studied the performance differences of GAs by different binary representations for real encoding. Analysis on the unary encoding by the representation theory reveals that encoding is redundant, and does not represent phenotypes uniformly. Therefore, the performance of GAs with the unary encoding depends on the structure of the optimal solution. Unary GAs fail to solve integer one-max, deceptive trap and BinInt problems [4], unless larger population sizes are used, because the optimal solutions are strongly underrepresented for these three types of problems. Thus, the unary GAs perform much worse than GAs using the non-redundant binary or gray encoding [9]. The binary encoding uses exponentially scaled bits to represent phenotypes. Its genotype-phenotype mapping is a one-to-one mapping and encodes phenotypes without redundancy. However, for non-uniformly binary strings and competing Building Blocks (BBs) for high dimensional phenotype space, there are a lot of noise from the competing BBs lead to a reduction on the performance of GAs. In addition, the binary encoding has the effect that genotypes of some phenotypical neighbors are completely different. As a result, the locality of the binary representation is partially low, i.e. Hamming cliff [10]. In the distance distortion theory, an encoding preserves the difficulty of a problem if it has perfect locality and if it does not modify the distance between individuals. The analysis reveals that the binary encoding changes the distance between the individuals and therefore changes the complexity of the optimization problem. Thus, easy problems can become difficult, and vice versa. The binary GAs are not able to reliably solve problems when mapping the phenotypes to the genotypes. The non-redundant gray encoding [10] was designed to overcome the problems with the Hamming cliff of the binary encoding. In the gray encoding, every neighbor of a phenotype is also a neighbor of the corresponding genotype. Therefore, the difficulty of a problem remains unchanged when using mutation-based search operators that only perform small step in the search space. As a result, easy problems and problems of bounded difficulty are easier to solve when using the mutation-based search with the gray coding than that with the binary encoding. Although the gray encoding has high locality, it still changes the distance correspondence between the individuals with bit difference of more than one. When focused on crossover-based search methods, the analysis of the average fitness of the schemata reveals that the gray encoding preserves building block complexity less than the binary encoding. Thus, a decrease in performance of gray-encoded GAs is unavoidable for some kind of problems [2], [12]. Up to now, there is no well set-up theory regarding the influence of representations on the performance of GAs. To help users with different tasks to search good representations, over the last few years, some researchers have made recommendations based on the existing theories. For example, Goldberg [1] has
NN Training Using Genetic Algorithm with a Novel Binary Encoding
373
proposed two basic design principles for encodings: (i), Principle of minimal alphabets — the alphabet of the encoding should be as small as possible while still allowing a natural representation of solutions; and (ii), Principle of meaningful building blocks — the schemata should be short, of low order, and relatively unrelated to schemata over other fixed positions. The principle of minimal alphabets advises us to use bit string representation. Combining with the principle of meaningful building blocks (BBs), we construct uniform salient BBs, which include equal scaled and splicing/decomposable alleles. This paper is organized as follows. Section 2 introduces a novel splicing/ decomposable (S/D) binary representation and its genotypic distance. Section 3 describes a new genetic algorithm based on the S/D binary representation, the splicing/ decompocable genetic algorithm (SDGA). Section 4 provides the simulation results of SDGA for NN training and comparisons with other binary GAs. The paper conclusion are summarized in Section 5.
2
A Novel Splicing/Decomposable Binary Genetic Representation
Based on above investigation results and recommendations, Leung et al. have proposed a new genetic representation, which is proper for GAs searching [5] [13]. In this section, first we introduce a novel splicing/decomposable (S/D) binary encoding, then we define the new genotypic distance for the S/D encoding, finally we give the theoretical analysis for the S/D encoding based on the three elements of genetic representation theory (redundancy, scaled BBs and distance distortion). 2.1
A Splicing/Decomposable Binary Encoding
In [5], Leung et al. have proposed a novel S/D binary encoding for real-value encoding. Assuming the phenotypic domain Φp of the n dimensional problem can be specified by Φp = [α1 , β1 ] × [α2 , β2 ] × · · · × [αn , βn ].
(1) (βi −αi ) , 2(/n)
Given a length of a binary string l, the genotypic precision is hi (l) = i= 1, 2, · · · , n. Any real-value variable x = (x1 , x2 , ..., xn ) ∈ Φp can be represented by a splicing/decomposable (S/D) binary string b = (b1 , b2 , .., bl ), the genotypephenotype mapping fg is defined as l/n
x = (x1 , x2 , · · · , xn ) = fg (b) = (
2(l/n−j) × bj×n+1 ,
(2)
2(l/n−j) × bj×(n+1) ),
(3)
j=0 l/n
l/n
2(l/n−j) × bj×n+2 , · · · , j=0
j=0
where l/n
2(l/n−j) × bj×n+i ≤ j=0
xi − αi < hi (l)
l/n
2(l/n−j) × bj×n+i + 1. j=0
(4)
374
Y. Liang, K.-S. Leung, and Z.-B. Xu
Fig. 1. A graphical illustration of the splicing/decomposable representation scheme, where (b) is the refined bisection of the gray cell (10) in (a) (with mesh size O(1/2) ), (c) is the refined bisection of the dark cell (1001) in (b) (with mesh size O(1/22 )), and so forth
That is, the significance of each bit of the encoding can be clearly and uniquely interpreted (hence, each BB of the encoded S/D binary string has a specific meaning). As shown in Figure 1, take Φp = [0, 1] × [0, 1] and the S/D binary string b = 100101 as an example (in this case, l = 6, n = 2, and the genotypic precisions h1 (l) = h2 (l) = 18 ). Let us look how to identify the S/D binary string b and see what each bit value of b means. In Figure 1-(a), the phenotypic 1
domain Φp is bisected into four Φp2 (i.e., the subregions with uniform size 12 ). According to the left-0 and right-1 correspondence rule in each coordinate di1 rection, these four Φp2 then can be identified with (00), (01), (10) and (11). As the phenotype x lies in the subregion (10) (the gray square), its first building block (BB) should be BB1 = 10. This leads to the first two bits of the S/D 1
binary string b. Likewise, in Figure 1-(b), Φp is partitioned into 22×2 Φp4 , which 1
are obtained through further bisecting each Φp2 along each direction. Particu1
1
larly this further divides Φp2 = (BB1 ) into four Φp4 that can be respectively labelled by (BB1 , 00), (BB1 , 01), (BB1 , 10) and (BB1 , 11). The phenotype x is in (BB1 , 01)-subregion (the dark square), so its second BB should be BB2 = 01 and the first four positions of its corresponding S/D binary string b is 1001. 1
In the same way, Φp is partitioned into 22×3 Φp8 as shown in Figure 11
1
(c), with Φp4 = (BB1 , BB2 ) particularly partitioned into four Φp8 labelled by (BB1 , BB2 , 00), (BB1 , BB2 , 01), (BB1 , BB2 , 10) and (BB1 , BB2 , 11). The phenotype x is found to be (BB1 , BB2 , 01), that is, identical with S/D binary string b. This shows that for any three region partitions, b = (b1 , b2 , b3 , b4 , b5 , b6 ), each bit value bi can be interpreted geometrically as follows: b1 = 0 (b2 = 0) means the
NN Training Using Genetic Algorithm with a Novel Binary Encoding
375
phenotype x is in the left half along the x-coordinate direction (the y-coordinate direction) in Φp partition with 12 -precision, and b1 = 1 (b2 = 1) means x is in the right half. Therefore, the first BB1 = (b1 , b2 ) determine the 12 -precision location 1
1
of x. If b3 = 0 (b4 = 0), it then further indicates that when Φp2 is refined into Φp4 , 1
the x lies in the left half of Φp2 in the x-direction (y-direction), and it lies in the right half if b3 = 1 (b4 = 1). Thus a more accurate geometric location (i.e., the 1 4 -precision location) and a more refined BB2 of x is obtained. Similarly we can explain b5 and b6 and identify BB3 , which determine the 18 -precision location of x. This interpretation holds for any high-resolution l bits S/D binary encoding. 2.2
A New Genotypic Distance on the Splicing/Decomposable Binary Representation
For measuring the similarity of the binary strings, the Hamming distance is widely used on the binary space. Hamming distance describes how many bits are different in two binary strings, but cannot consider the scaled property in non-uniformly binary representations. Thus, the distance distortion between the genotypic and the phenotypic spaces makes phenotypically easy problem more difficult. Therefore, to make sure that GAs are able to reliably solve easy problems and problems of bounded complexity, the use of equivalent distances is recommended. For this purpose, we define a new genotypic distance on the S/D binary space to measure the similarity of the S/D binary strings. Definition 1. Suppose any binary strings a and b belong to the S/D binary space Φg , the genotypic distance a − bg is defined as l/n−1
n
a − bg =
| i=1
j=0
aj×n+i − bj×n+i |, 2j+1
where l and n denote the length of the S/D binary strings and the dimensions of the real-encoding phenotypic space Φp respectively. For any two S/D binary strings a, b ∈ Φg , we can define the Euclidean distance of their correspond phenotypes: a − bp =
l/n−1
n
( i=1
j=0
aj×n+i − 2j+1
l/n−1
j=0
bj×n+i 2 ) , 2j+1
as the phenotypic distance between the S/D binary strings a and b. Theorem 1. The phenotypic distance · p and the genotypic distance · g are equivalents in the S/D binary space Φg because the inequation: · p ≤ · g ≤
√
n × · p
is satisfied in the S/D binary space Φg , where n is the dimensions of the realencoding phenotypic space Φp .
376
Y. Liang, K.-S. Leung, and Z.-B. Xu
According to the distance distortion of the genetic representation, using the new genotypic distance · g can guarantee GA to reliably and predictably solve problems of bounded complexity. 2.3
Theoretical Analysis of the Splicing/Decomposable Binary Encoding
In our previous work [6], [7], we introduce the delicate feature of the S/D representation — a Building Block-significance-variable property. Actually, it is seen from the above interpretation that the first n bits of an encoding are responsible for the location of the n dimensional phenotype x in a global way (particularly, with O( 12 )-precision); the next group of n bits is responsible for the location of phenotype x in a less global (might be called ‘local’) way, with O( 14 )-precision, and so forth; the last group of n-bits then locates phenotype x in an extremely 1 local (might be called ‘microcosmic’) way (particularly, with O( 2/n )-precision). Thus, we have seen that as the encoding length l increases, the representation (b1 , b2 , · · · , bn , bn+1 , bn+2 , · · · , b2n , · · · ,
(5)
b(−n) , b(−n+1) , · · · , bl )
(6)
= (BB1 , BB2 , · · · , BBl/n )
(7)
can provide a successive refinement (from global, to local, and to microcosmic), and more and more accurate representation of the problem variables. In each BBi of the S/D binary string, which consists of the bits (bi×n+1 , bi×n+2 , · · · , b(i+1)×n ), i = 0, · · · , l/n−1, these bits are uniformly scaled. We refer such delicate feature of BBi to as the uniform-salient BB (USBB). Furthermore, the splicing different number of USBBs can describe the rough approximations of the problem solutions with different precisions. So, the intra-BB difficulty (within building block) and inter-BB difficulty (between building blocks) [1] of USBB are low. The theoretical analysis reveals that GAs searching on USBB can explore the high-quality bits faster than GAs on non-uniformly scaled BB. The S/D binary encoding is redundancy-free representation because using the S/D binary strings to represent the real values is one-to-one genotype-phenotype mapping. The whole S/D binary string is constructed by a non-uniformly scaled sequence of USBBs. The domino convergence of GAs occurs and USBBs are solved sequentially from high to low scaled. The BB-significance-variable and uniform-salient BB properties of the S/D binary representation embody many important information useful to the GAs searching. We will explore this information to design new GA based on the S/D binary representation in the subsequent sections.
3
A New S/D Binary Genetic Algorithm (SDGA)
The above interpretation reveals that for non-uniformly binary strings and competing Building Blocks (BBs) in binary and grid encodings, there are a lot of noise from the competing BBs lead to a reduction on the performance of GAs.
NN Training Using Genetic Algorithm with a Novel Binary Encoding
377
Input: N —population size, m—number of USBBs, g—number of generations to run; Termination condition: Population fully converged; begin g ←− 0; m ←− 1; Initialize Pg ; Evaluate Pg ; while (not termination condition) do for t ←− 1 to N/2; randomly select two individuals x1t and x2t from Pg ; crossover and selection x1t , x2t into Pg+1 ; end for mutation operation Pg+1 ; Evaluate Pg+1 ; if (USBBm fully converged) m ←− m + 1; end while end Fig. 2. Pseudocode for SDGA algorithm
To avoid this problem, we propose a new splicing/decomposable GA (SDGA) based on the delicate properties of the S/D binary representation in our previous work [6], [7]. In the SDGA, genetic operators apply from the high scaled to the low scaled USBBs sequentially. For two individuals x1 and x2 randomly selected from current population, The crossover point is randomly set in the convergence window USBB and the crossover operator generates two children c1 , c2 . The parents x1 , x2 and their children c1 , c2 can be divided into two pairs {x1 , c1 } and {x2 , c2 }. In each pair {xi , ci }(i = 1, 2), the parent and child have the same low scaled USBBs. The selection operator will conserve the better one of each pair into next generation according to the fitness calculated by the whole S/D binary string for high accuracy. Thus, the bits contributed to high fitness in the convergence window USBB will be preserved, and the diversity at the low scaled USBBs’ side will be maintain. The mutation will operate on the convergence window and not yet converged USBBs according to the mutation probability to increase the diversity in the population. These low salient USBBs will converge due to GAs searching to avoid genetic drift. The implementation outline of the SDGA is shown in Figure 2. Since identifying high-quality bits in the convergence window USBB of GAs is faster than that GAs on the non-uniform BB, while no genetic drift occurs. Thus, population can efficiently converge to the high-quality BB in the position of the convergence window USBB, which are a component of overrepresented optimum of the problem. According to theoretical results of Thierens [11], the overall convergence time complexity of the new GA with the S/D binary representation
378
Y. Liang, K.-S. Leung, and Z.-B. Xu
√ is approximately of order O(l/ n), where l is the length of the S/D binary string and n is the dimensions of the problem. This is much faster than working on the binary strings as a whole where GAs have a approximate convergence time of order O(l). The gain is especially significant for high dimension problems.
4
Simulations and Comparisons
GAs in NN area can be used for searching weight values, topology design, NN parameter settings and for selection and ordering of input and output vectors for training and testing set. We focused only on the weight searching by GAs. The structure of NN is fixed and it is not changed throw all experiments. The feedforward NN is used with one hidden layer, 20 hidden neurons, the sigmoidal transfer function of the hidden neuron tansig(x) and the sigmoidal transfer function of the output neuron purelin(x). We used NN to approach the nonlinear functions f1 − f3 respectively. f1 (x) = (1 − x2 )e−x
2
/2
x ∈ [−2, 2];
(8)
x ∈ [−10, 10];
(9)
x1 , x2 ∈ [−2, 2];
(10)
,
5
f2 (x) =
j cos{(j + 1)x + j}, j=1
f3 (x) =
1 , 1 + |(x1 + ix2 )6 − 1|
The standard GA (SGA) using binary, gray, unary, S/D encodings and SDGA are used on the training of NN to compare their performance. We performed 50 runs and each run was stopped after the 1000 generations. For fairness of comparison, we implemented SGA with different binary encodings and SDGA with the same parameter setting and the same initial population with 500 individuals, in which each variable is represented by 20 bits binary string. For SGA, we used onepoint crossover operator (crossover probability=0.8), one-point mutation operator (mutation probability=0.05) and tournament selection operator without replacement of size two. All algorithms were implemented in MATLAB environment. Figure 3 presents the results for the problems f1 − f3 respectively. The plots show for SGA with different representations and SDGA the best fitness with respect to the generations. Table 1 summarizes the experimental results for all f
f
1
f
2
fitness
4
2
3
10
1
8
0.8 fitness
6
6 4 2
0
200
400 600 800 generations
1000
0
0.6 0.4 0.2
200
400 600 800 generations
1000
0
200
400 600 800 generations
1
Fig. 3. The comparison results for the problems f1 − f3 .(◦:SDGA; :unary SGA; ×:binary SGA, +: gray SGA; •:S/D encoding SGA)
NN Training Using Genetic Algorithm with a Novel Binary Encoding
379
the test problems f1 − f3 . The best fitness of each problem is calculated as the average of the fitness when GAs fully converged with different runs. As in Figure 3 and Table 1 described, SGA with different scaled binary representations including binary, gray and S/D encodings complies domino convergence, genetic drift and noise from BBs. Due to the problems of the unary encoding with redundancy, which result in an underrepresentation of the optimal solution, the performance of SGA using unary encoding performance is significantly worse than when using binary, gray and S/D encodings. SGA with gray encoding performs worse than the binary encoding for f1 . As expected, SGA using S/D encoding performs better than that using binary and gray encodings for the all test problems. Because in S/D encoding, more salient bits are continuous to construct short and high fit BBs, which are easily identified by SGA. This reveals that the S/D encoding is proper for GAs searching. However, lower salient bits in S/D binary string are randomly fixed by genetic drift and noise from BBs, the performance of SGA with S/D encoding cannot significantly better than those with binary and gray encodings. As shown in Figure 3, the convergence of SDGA is much faster than that of other SGA. This reveals the performance of SDGA is significantly better than SGA with different encodings, because there are no premature and drift occurred. On the other hand, GAs search on USBBs of S/D binary encoding faster than the non-uniformly scaled BBs and domino convergence, which occurs only on the non-uniformly sequence of USBBs, is too weak. Table 1. Comparison of results of SGA with different binary representations and SDGA for the problems f1 − f3 . (Numbers in parentheses are the standard deviations.) Best fitness Unary SGA Binary SGA Gray SGA f1 0.51 (0.17) 0.25 (0.13) 0.33 (0.12) f2 4.3 (1.6) 3.2(1.6) 2.9 (1.8) f3 0.30 (0.19) 0.21 (0.11) 0.18 (0.086)
5
S/D SGA SDGA 0.14 (0.083) 0.057 (0.029) 2.4 (0.95) 0.14 (0.052) 0.17 (0.093) 0.042 (0.034)
Conclusions
Genetic algorithms (GAs) are widely used in the parameter training of Neural Network (NN). In this paper, we investigate GAs based on our proposed novel genetic representation to train the parameters of NN. A splicing/decomposable (S/D) binary encoding is designed based on some theoretical guidance and existing recommendations. Our theoretical and empirical investigations reveal that the S/D binary representation is more proper than other existing binary encodings for GAs’ searching. Moreover, a new genotypic distance on the S/D binary space is equivalent to the Euclidean distance on the real-valued space during GAs convergence. Therefore, GAs can reliably and predictably solve problems of bounded complexity and the methods depended on the Euclidean distance for solving different kinds of optimization problems can be directly used on the
380
Y. Liang, K.-S. Leung, and Z.-B. Xu
S/D binary space. This investigation demonstrates that GAs based our proposed binary representation can efficiently and effectively train the parameters of NN.
References 1. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley (1989) 2. Han, K.H., Kim, J.H.: Genetic Quantum Algorithm and Its Application to Combinatorial Optimization Problem. Proceeding of Congress on Evolutionary Computation 1 (2000) 1354-1360 3. Holland, J.H.: Adaptation in Natural and Artificial systems. Ann Arbor, MI: University of Michigan Press (1975) 4. Julstrom, B.A.: Redundant Genetic Encodings May Not Be Harmful. Proceedings of the Genetic and Evolutionary Computation Conference, San Francisco, CA: Morgan Kaufmann Publishers 1 (1999) 791 5. Liang, Y., Leung, K.S.: Evolution Strategies with Exclusion-based Selection Operators and a Fourier Series Auxiliary Function. Applied Mathematics and Computation 174 (2006) 1080-1109 2006 6. Liang, Y., Leung, K.S., Lee, K.H.: A Splicing/Decomposable Encoding and Its Novel Operators for Genetic Algorithms. Proceeding of the ACM Genetic and Evolutionary Computation Conference (2006) 1225-1232 7. Liang, Y., Leung, K.S., Lee, K.H.: A Novel Binary Variable Representation for Genetic and Evolutionary Algorithms. Proceeding of the 2006 IEEE World Congress on Computational Intelligence (2006) 2551-2558 8. Liepins, G.E., Vose, M.D.: Representational Issues in Genetic Optimization. Journal of Experimental and Theoretical Artificial Intelligence 2 (1990) 101-115 9. Rothlauf, F.: Representations for Genetic and Evolutionary Algorithms. Heidelberg; New York: Physica-Verl. (2002) 10. Schaffer, J.D., Caruana, R.A., Eshelman, L.J., Das, R.: A Study of Control Parameters Affecting Online Performance of Genetic Algorithms for Function Optimization. Proceedings of the Third International Conference on Genetic Algorithms, San Mateo, CA: Morgan Kaufmann (1989) 11. Thierens, D.: Analysis and Design of Genetic Algorithms. Leuven, Belgium: Katholieke Universiteit Leuven (1990) 12. Whitley, D.: Local Search and High Precision Gray Codes: Convergence Results and Neighborhoods. In Martin, W., & Spears, W. (Eds.), Foundations of Genetic Algorithms 6, San Francisco, California: Morgan Kaufmann Publishers, Inc. (2000) 13. Xu, Z.B., Leung, K.S., Liang, Y., Leung, Y.: Efficiency Speed-up Strategies for Evolutionary Computation: Fundamentals and Fast-GAs. Applied Mathematics and Computation 142 (2003) 341-388 2003
Adaptive Training of a Kernel-Based Representative and Discriminative Nonlinear Classifier Benyong Liu, Jing Zhang, and Xiaowei Chen College of Computer Science and Technology, Guizhou University, Huaxi 550025, Guiyang, China
[email protected],
[email protected],
[email protected] Abstract. Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. Previously, we proposed a kernel-based nonlinear classifier for simultaneous representation and discrimination of pattern features. Its batch training has a closedform solution. In this paper we implement an adaptive training algorithm using an incremental learning procedure that exactly retains the generalization ability of batch training. It naturally yields a sparse representation. The feasibility of the presented methods is illustrated by experimental results on handwritten digit classification.
1
Introduction
Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. Generally it is realized by incremental learning, a procedure adaptively updating the parameters when a new datum arrives, without reexamining the old ones. Many incremental learning methods have been devised so far [1], [2]. Some of them improve computational efficiency at the cost of decreasing the generalization capability of batch learning. In this paper, we design an incremental learning procedure to adaptively train a previously proposed classifier named Kernel-based Representative and Discriminative Nonlinear Classifier (KNRD) [3]. In our discussion, it is required that the incremental learning result equals exactly to that of batch learning including the new datum, so that the same generalization ability is maintained [4]. Based on the procedure, a technique for reducing the training set to obtain a sparse KNRD is derived. Validity of the presented adaptive training procedure and set-reducing technique is demonstrated by experimental results on handwritten digit recognition. The rest of this paper is organized as follows. Section 2 briefly reviews the previously proposed classifiers, a Kernel-based Nonlinear Representor (KNR) and a Kernel-based Nonlinear Discriminator (KND), and combines them into a
The related work is supported by the Key Project of Chinese Ministry of Education (No.105150) and the Foundation of ATR Key Lab (51483010305DZ0207). Thanks to Prof. H. Ogawa of Tokyo Institute of Technology for helpful discussions.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 381–390, 2007. c Springer-Verlag Berlin Heidelberg 2007
382
B. Liu, J. Zhang, and X. Chen
KNRD. Section 3 presents an incremental learning procedure that implements adaptive training on a KNRD classifier, and addresses a set-reduction technique to obtain a sparse KNRD. Experimental results on handwritten digit recognition are given in Section 4. Conclusions are drawn in Section 5 and the related proofs are put into three appendices.
2
KNRD: A Kernel-Based Representative and Discriminative Nonlinear Classifier
Our discussion is limited to finding an optimal approximation to a desirable decision function, f0 (x). We assume that f0 is defined on C N , a complex N dimensional vector space, and it is an element of a reproducing kernel Hilbert space which has a kernel function k. Generally only M sampled values of f0 are known beforehand and they constitute a teacher vector y, where y = Af0 ,
(1)
and A is the sampler. We also assume that y is an element of the M -dimensional space C M . The study goal is to find a certain inverse operator X of A, so that f = Xy
(2)
becomes an optimal approximation to f0 [4]. When a classifier is designed for optimal representation of a target class c, we can minimize the distance between f and f0 by deriving X from XR = argminX (c) {tr[(I − X (c) A(c) )(I − X (c) A(c) )∗ ]}, (c)
(3)
where R denotes representation and (c) means that the operators correspond to Class c, while tr and ∗ denote the trace and the adjoint operation of an operator, respectively. The solution is named a KNR [5]. On the other hand, if a classifier is designed to optimally discriminate Class c from other classes, it is required that the inverse operator X satisfies ∗
(c)
XD = argminX (c) {tr(X (c) Q(X (c) ) },
(4)
where D denotes discrimination and Q is given by 1 Q= C −1
C
Q(i) ,
(5)
i=1, i =c
wherein C is the total number of classes and Q(i) = y (i) ⊗ y (i) ,
(6)
with y (i) the teacher vector of class i, ⊗ the Neuman-Schatten product [6], and y (i) the complex conjugate of y (i) . A solution to this criterion results in a KND [7].
Adaptive Training of a KNRD
383
Using a parameter λ to control the balance between representation and discrimination, the above two criteria were combined to the so-called R-D criterion, to simultaneously represent and discriminate pattern features, as follows [3]: ∗
(c)
∗
XR = argminX (c) {tr[(I − X (c) A(c) )(I − X (c) A(c) ) + λX (c) Q(X (c) ) ]}.
(7)
A solution to the R-D criterion results in the following kernel-based representative and discriminative nonlinear classifier (KNRD) (Ref. [3] may be consulted for more information): M f (x) = aj k(x, xj ), (8) j=1
{xj }M j=1
where is the set of the training feature vectors, and k is an associated kernel function. The coefficient set in the above representation has the following closed-form solution [3]: T + a = [a1 , a2 , . . . , aM ] = (U (c) ) y, (9) where T denotes the transpose of a vector or a matrix, (U (c) )+ is the MoorePenrose pseudoinverse of U (c) , with U (c) = K (c) + λQ, and K (c) is the the kernel matrix vectors of Class c, as follows: ⎡ k(x1 , x1 ) ⎢ k(x1 , x2 ) K (c) = ⎢ ⎣ ··· k(x1 , xM )
3
(10)
determined by k and the M training feature k(x2 , x1 ) k(x2 , x2 ) ··· k(x2 , xM )
··· ··· ··· ···
⎤ k(xM , x1 ) k(xM , x2 ) ⎥ ⎥. ⎦ ··· k(xM , xM )
(11)
Adaptive Training of a KNRD
Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. This kind of work has been done for KNR and KND classifiers [8], [9]. In this section, we design a similar algorithm for adaptive training of KNRD. 3.1
Adaptive Training with Incremental Learning
In neural network training, the process to adaptively adjust the trained result by a new sample is called incremental learning [1]. Many incremental learning methods have been devised so far [1], [2]. Some of them improve memory and computation efficiency at the cost of decreasing the generalization ability of batch learning. In our discussion, it is required that the incremental learning
384
B. Liu, J. Zhang, and X. Chen
result equals exactly to that of batch learning including the novel sample, so that the same generalization ability is retained [4]. Although several new instances may become available in a later stage, in this paper we consider a relearning procedure that processes only one instance, i.e., one training feature vector per class. Now we turn our focus to variable number of training data. For clearness, sub(c) scripts are used to denote variation. For example, y m denotes the actual output (c) vector of Class c after m instances are trained, am+1 denotes the coefficient vector obtained after the (m + 1)-th instance becomes available, etc.. (c) (c) (c) For the KNRD of Class c, the objective is to express am+1 by am and ym+1 , the desirable output of Class c at Stage (m + 1), for c = 1, 2, . . . , C. We use the following m-dimensional vector and scalar to describe the traits of the desirable outputs of all classes other than Class c: 1 C −1
q m+1 =
σm+1 =
1 C −1
C
(i)
y (i) m ym+1 ,
(12)
i=1, i =c C
(i)
(i)
ym+1 ym+1 .
(13)
i=1, i =c
Furthermore, we define the following m-dimensional vectors
T (c) (c) (c) (c) (c) sm+1 = k(x1 , xm+1 ), k(x2 , xm+1 ), . . . , k(x(c) , x ) , m m+1
(14)
tm+1 = sm+1 + λq m+1 ,
(15)
+ τ m+1 = (U (c) m ) tm+1 ,
(16)
and and scalars (c)
and
(c)
αm+1 = k(xm+1 , xm+1 ) + λσm+1 − < τ m+1 , tm+1 >,
(17)
βm+1 = 1 + ,
(18)
(c)
γm+1 = (ym+1 − < y (c) m , τ m+1 >)/βm+1 ,
(19)
where < ·, · > denotes the inner product of C m . Then we have the following lemmas, proofs of which are put into Appendices A and B, respectively. Lemma 1. As for tm+1 in Eq.(15) and αm+1 in Eq.(17), we have t ∈ (U (c) m ),
(20)
αm+1 ≥ 0.
(21)
and
Adaptive Training of a KNRD
385
(c)
+ Lemma 2. The operators (U m+1 )+ and (U (c) m ) have the following relation:
(i)when αm+1 = 0 ⎡
(c) (U m+1 )+
+ Tm+1 (U (c) m ) Tm+1 ⎣ = (c) + T Tm+1 (U m ) τ m+1 βm+1
where Tm+1 = Im+1 −
(c)
Tm+1 (U m )+ τ m+1 βm+1 (c)
2 βm+1
⎤ ⎦,
τ m+1 ⊗ τ m+1 . βm+1
(22)
(23)
(ii)when αm+1 > 0 ⎡
τ m+1 ⊗τ m+1 + (U (c) m ) + (c) αm+1 (U m+1 )+ = ⎣ τ Tm+1 − αm+1
−τ m+1 αm+1 1 αm+1
⎤ ⎦.
(24)
Lemma 2 shows that we can avoid direct calculating the Moore-Penrose pseudoinverse of U (c) . In addition, Lemma 2 and Eq.(9) naturally lead us to the following theorem, proof of which is put into Appendix C. Theorem 1. The coefficients of the KNRD of Class c (c = 1, 2, . . . , C) can be adaptively trained as follows: (i)when αm+1 = 0
(c) αm − ητ m+1 + γm+1 (U (c) )+ τ m+1 m = , ηm+1
(c) αm+1
where
(25)
(c)
ηm+1 =
+ γm+1 < (U (c) m ) τ m+1 , τ m+1 > + < τ m+1 , am > . βm+1
(26)
(ii)when αm+1 > 0
(c) αm+1
3.2
=
(c)
αm −
γm+1 βm+1 τ m+1 αm+1 γm+1 βm+1 αm+1
.
(27)
Sparse Representation of KNRD
Theorem 1 shows that the effect of every training sample on a KNRD can be evaluated during adaptive training, and hence samples of little importance can be discarded one by one if necessary. Henceforth, we can obtain a technique for sparse representation of a KNRD and it is briefly discussed as follows. For Class c, we adopt the following distance to evaluate the importance of the (c) novel training feature vector, xm+1 , at Stage (m + 1): (c)
(c)
(c)
(c)
(c) δm+1 = |fm+1 (xm+1 ) − fm (xm+1 )|,
(28)
386
B. Liu, J. Zhang, and X. Chen (c)
where | · | denotes the absolute value of a number. If δm+1 is less than a predetermined threshold , a positive number trading generalization ability for sparse(c) ness, then xm+1 will be discarded. Theorem 1 and Eqs.(8) and (28) yield: m |ηm+1 | |km+1,m+1 + j=1 νm+1 (j)kj,m+1 | if αm+1 = 0, (c) m δm+1 = (29) βm+1 | γm+1 | |k + τ (j)k | if αm+1 > 0, m+1,m+1 m+1 j,m+1 j=1 αm+1 where
(c)
(c)
km+1,m+1 = k(xm+1 , xm+1 ), (c)
(c)
kj,m+1 = k(xm+1 , xj ),
(30) (31)
τm+1 (j) is the j-th element of the vector τ m+1 , and νm+1 (j) is that of the following vector:
γm+1 (c) + ν m+1 = (U m ) τ m+1 − τ m+1 . (32) ηm+1 For the KNRD of Class c, the above adaptive training procedure and the training set reduction technique are summarized into an algorithm as follows. Algorithm 1. Begin 2. Decide on the reproducing kernel k(x, x ) and the threshold for data reduction. (c) (c) 3. Initialize to zero: m = 0, a0 = 0, and {U0 }+ = 0. (i) 4. For the new training feature set {xm+1 }C i=1 , decide on the corresponding (c) (i) desirable output values of the classifier, say ym+1 = 1 and ym+1 = 0 for i = c, and
T (i) (i) (i) (a)Calculate the actual output vectors {ym+1 = f c (x1 ), · · · , f c (xm ) }C i=1 (c)
using am and Eq.(8), where M is substituted by m. (b) Calculate the vector q m+1 using Eq.(12) and the scalar σm+1 using Eq.(13). (c) Calculate the vectors sm+1 and tm+1 using Eqs.(14) and (15), repectively. (d) Calculate the vector τ m+1 using Eq.(16). (e) Calculate the scalars αm+1 , βm+1 , and γm+1 using Eqs.(17), (18), and (19), respectively. (c) (f ) Calculate the weight vector am+1 using Eqs.(25) and (26), or Eq.(27). (c) (c) (g) Calculate δm+1 by Eq.(29). If δm+1 is less than , then discard the new data set. (c) (h) If there is still new training feature set, calculate the operator (Um+1 )+ using Eqs.(22) and (23), or Eq.(24), substitute m by (m + 1) and return to Step 4. Otherwise, let M = m + 1 and go to Step 5. (c) 5. Output aM . 6. End. In the sequel, the feasibility of the above algorithm is demonstrated by experimental results on handwritten digit classification.
Adaptive Training of a KNRD
4
387
Experiments on Handwritten Digit Recognition
For comparison convenience, we take experiments with the dataset used by Jain et al.[10] and our previous works [7] [8] [9]. It provides features of handwritten digits (”0”-”9”) extracted from a collection of Dutch utility maps. For each digit class, there are two hundred patterns. The dataset contains six feature sets respectively consisted of • • • • • •
76 Fourier coefficients of the character shape, 216 profile correlations, 64 Karhunen-Love coefficients, 240 pix averages in 2 × 3 windows, 47 Zernike moments, and 6 morphological features.
We adopt the Gaussian kernel with kernel width, together with the value of λ, crudely estimated by experience [7]. The test set consists of the last one hundred feature vectors of each class and is fixed in the following experiments. In comparison with the results of other classifiers in Ref.[10], for each feature set, we first consider training size 10 × 50, ten classes and fifty patterns per class, which are randomly selected so that there is no intersection between the training set and the test set. Ten different runs are conducted and the averaged classification error rates, over classes and runs, are listed in the second row of Tab.1, in which the values printed in bold denote the best ones among the results of our method and of the methods conducted in Ref.[10], wherein twelve methods were applied to this dataset, and for the six feature sets, the best results (error rates in percent) are given by the Parzen classifier (17.1), the linear Bayes normal classifier (3.4), the Parzen classifier (3.7), the 1-NN rule and the k-NN rule and the Parzen classifier (3.7), the linear Bayes normal classifier (18.0), and the linear Fisher discriminator (28.2), respectively. Our KNRD classifier performs the best for feature Set 3 and Set 4, and nearly the best for feature Set 1 and Set 5. That is, in comparison with the twelve classifiers conducted in [10] on the six feature sets, an adaptively trained KNRD performs almost the best. The third row of Tab.1 lists the error rates of experiment on the efficiency of the proposed technique in sparse representation. In this experiment, the predetermined positive thresholds for the six feature sets are respectively estimated Table 1. Classification error rates of the adaptively trained KNRD (CER1), and the sparse KNRD (CER2), on training size 10 × 50, i.e., 500 feature vectors, where λ = 0.2 Feature set CER1 CER2 Remained feature vectors
Set 1 17.3 19.6 0.3 434
Set 2 7.8 21.6 0.1 398
Set 3 3.6 7.8 0.2 401
Set 4 3.7 7.6 0.4 375
Set 5 18.6 31.6 0.1 406
Set6 69.1 69.0 0.1 431
388
B. Liu, J. Zhang, and X. Chen
in a similar manner of estimating the kernel widths. The listed number of the remained feature vectors is an averaged value over the ten runs and the ten classes. The results in Tab.1 show that training set reduction is obtained at the cost of increasing classification error rate. For example, an increase of more than 4.0 points in error rate is paid for around 20.0 % reduction of the training set (Set 3). Notice that the results in Tab.1 are favorably comparable to those of a quadratic support vector classifier (SVC), which results in error rates of 21.2, 5.1, 4.0, 6.0, 19.3, and 81.1, respectively, for the six feature sets, and these results are better than those of the linear SVC [10].
5
Conclusions
We designed an incremental learning procedure for adaptive training of KNRD, a previously proposed kernel-based nonlinear classifier for simultaneous representation and discrimination of pattern features. The procedure reduces the computational load of batch training because it avoids direct calculating the MoorePenrose general inverse of a matrix, and results in a sparse representation of KNRD. Validity of the presented methods was demonstrated by experimental results on handwritten digit classification.
References 1. Fu, L.M., Hsu, H.H., Principe, J.C.: Incremental Backpropagation Learning Network. IEEE Trans. Neural Networks 7 (1996) 757-761 2. Park, D.C., Sharkawi, A.E., Marks II, R.J.: An Adaptively Trained Neural Network. IEEE Trans. Neural Networks 2 (1991) 334-345 3. Liu, B.Y., Zhang, J.: Face Recognition Applying a Kernel-Based Representative and Discrimnative Nonlinear Classifier to Eigenspectra. Proc. IEEE Int. Conf. Communications, Circuits and Systems, HongKong 2 (2005) 964-968 4. Vijayakumar, S., Ogawa, H.: A Functional Analytic Approach to Incremental Learning in Optimally Generalizing Neural Networks. Proc. IEEE Int. Conf. Neural Networks, Perth, Western Australia 2 (1995) 777-782 5. Liu, B.Y., Zhang, J.: Eigenspectra Versus Eigenfaces: Classification with a KernelBased Nonlinear Representor. LNCS 3610 (2005) 660-663 6. Schatten, R.: Norm Ideals of Completely Continuous Operators. Springer, Berlin (1970) 7. Liu, B.Y.: A Kernel-based Nonlinear Discriminator with Closed-form Solution Discriminator. Proc. IEEE Int. Conf. Neural Network and Signal Processing, Nanjing, China 1 (2003) 41-44 8. Liu, B.Y., Zhang, J.: An Adaptively Trained Kernel-Based Nonlinear Representor for Handwritten Digit Classification. J. Electronics (China) 23 (2006) 379-383 9. Liu, B.Y.: Adaptive Training of a Kernel-Based Nonlinear Discriminator. Pattern Recognition 38 (2005) 2419-2425 10. Jain, A.K., Duin, R.P.W., Mao, J.C.: Statistical Pattern recognition: A Review. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (2000) 4-37 11. Albert, A.: Conditions for Positive and Nonnegative Definiteness in Term of Pseudoinverses. SIAM J. Appl. Math. 17 (1969) 434-440
Adaptive Training of a KNRD
389
Appendix A: Proof of Lemma 1 (c)
From Eqs.(5), (10), and (11) we know that U m+1 is positively semi-definite, that is,
(c) U m+1
≥ 0 . Further more, Eqs.(5) and (10)-(15) yield
(c) U m+1
where
U (c) = Tm tm+1 (c)
tm+1 , μm+1
(c)
μm+1 = k(xm+1 , xm+1 ) + λσm+1 .
(33)
(34)
From Theorems 1 and 2 of Ref.[11] we know that (c) + U (c) m (U m ) tm+1 = tm+1 ,
(35)
+ μm+1 ≥< (U (c) m ) tm+1 , tm+1 > .
(36)
(c) + Eq.(35) is equivalent to Eq.(20) because U (c) m (U m ) = P(U (c) ) , the m projection operator onto the range of U (c) m , and Eq.(36) to Eq.(21)
orthogonal because of
Eqs.(34), (16), and (17).
Appendix B: Proof of Lemma 2 In order to prove Lemma 2, Theorem 3 of Ref.[11] is restated in the following proposition: Proposition A. Suppose U (c) m ≥ 0 and (c) U (c) U m+1 = T m tm+1
tm+1 , μm+1
(37)
(c) + T + Let τ m+1 = (U (c) m ) tm+1 , αm+1 = μm+1 − tm+1 (U m ) tm+1 , βm+1 = 1 + 2 T τ m+1 , and Tm+1 = Im − τ m+1 τ m+1 /βm+1 , then (c) (i) U m+1 ≥ 0 if only if U (c) m τ m+1 = tm+1 and αm+1 ≥ 0. (ii) In this case ⎤ ⎧⎡ τ m+1 τ Tm+1 τ m+1 (c) + ⎪ (U ) + − ⎪ αm+1 αm+1 ⎦ ⎪ ⎣ mT ⎪ if αm+1 > 0, ⎪ τ m+1 ⎪ 1 ⎪ − ⎪ αm+1 αm+1 ⎨ (c) (U m+1 )+ = ⎡ ⎤ (c) ⎪ ⎪ Tm+1 (U m )+ τ m+1 (c) + ⎪ T (U ) T ⎪ m+1 m+1 m ⎪ βm+1 ⎪ ⎣ ⎦ if αm+1 = 0. (c) + ⎪ (c) T ⎪ ⎩ ( Tm+1 (U m )+ τ m+1 )T τ m+1 (U m ) τ m+1 2 βm+1 β m+1
(38)
390
B. Liu, J. Zhang, and X. Chen
T + Proof of Lemma 2. Since tTm+1 (U (c) m ) tm+1 = tm+1 τ m+1 =< tm+1 , τ m+1 >, + τ m+1 τ Tm+1 = τ m+1 ⊗τ m+1 , τ m+1 2 =< τ m+1 , τ m+1 >, and τ Tm+1 (U (c) m ) τ m+1 (c) + =< (U m ) τ m+1 , τ m+1 >, from Eqs.(5), (10), and (11) we know that U (c) m ≥ (c) 0. Furthermore, Eqs. (16) and (35) yield U m τ m+1 = tm+1 . Finally, Lemma 1 shows that αm+1 ≥ 0. These conditions, Eq.(33), and Proposition A lead us to Lemma 2.
Appendix C: Proof of Theorem 1
T (c) Notice that y Tm+1 = y Tm (ym+1 )T , hence Eq.(9) and Lemma 2 directly yield Theorem 1.
Indirect Training of Grey-Box Models: Application to a Bioprocess Francisco Cruz, Gonzalo Acuña, Francisco Cubillos, Vicente Moreno, and Danilo Bassi Facultad de Ingeniería, Universidad de Santiago de Chile, USACH Av. Libertador Bernandor O’Higgins 3363, Santiago, Chile
[email protected],
[email protected] Abstract. Grey-box neural models mix differential equations, which act as white boxes, and neural networks, used as black boxes. The purpose of the present work is to show the training of a grey-box model by means of indirect backpropagation and Levenberg-Marquardt in Matlab®, extending the black box neural model in order to fit the discretized equations of the phenomenological model. The obtained grey-box model is tested as an estimator of a state variable of a biotechnological batch fermentation process on solid substrate, with good results.
1 Introduction The determination of relevant variables or parameters to improve a complex process is a demanding and difficult task. This gives rise to the need to estimate the variables that cannot be measured directly, and this in turn requires a software sensor to estimate those variables that cannot be measured on line [1]. An additional problem is the one consisting of a model that has parameters that vary in time, because a strategy must be applied to identify such parameters on line and in real time [2]. A methodology that is used in these cases, especially in the field of chemical and biotechnological processes, is that of the so-called grey-box models [3]. These are models that include a limited phenomenological model which is complemented with parameters obtained by means of neural networks. The learning or training strategies used so far for grey-box neural models assume the existence of data for the parameters obtained by the neural model [4], but most of the time this is not possible. This paper proposes a training process that does not use learning data for the neural network part, instead backpropagating through the phenomenological model the error at its output, as will be detailed below. The creation of the proposed model, the training and the simulations were all carried out using the Matlab development tool.
2 Grey-Box Models Grey-box neural models are used for systems in which there is some a priori knowledge, i.e., some physical laws are known, but some parameters must be determined from the observed data. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 391–397, 2007. © Springer-Verlag Berlin Heidelberg 2007
392
F. Cruz et al.
Acuña et al. [5] distinguish between two methods of training. The first one corresponds to direct training (Fig. 1(a)), which uses the error originated at the output of the neural network for the correct determination of their weights. The second method is indirect training (Fig. 1(b)), which uses the error originated at the model's output for the purpose of learning by the neural network. Indirect training can be carried out in two ways, one by minimizing an objective function by means of a nonlinear optimization technique, and the other is by backpropagating the output error over the weights of the neural network taking into account the discretized equations of the phenomenological model.
(a)
(b)
Fig. 1. (a) Grey-box model with direct training. (b) Grey-box model with indirect training.
In this paper the second indirect training method is used, calculating the error at the output of the phenomenological model, and backpropagating it from there to the model's neural part or black-box part. The backpropagation process considers a network with m inputs and p outputs, n neurons in an intermediate layer, and d data for training. The computed gradients, depending on the the activation and the transfer function used, are shown in Table 1 Table 1. Gradients depending on the activation functions used in the neurons
f
Sum
Product
∂Ack +1 ∂Z kj
wkjc+1
(∏ wkjc+1 ) ⋅ (∏ Z qk )
∂Akj
Z kj
(∏ wqjk ) ⋅ (∏ Z qk −1 )
q =1
q =1 q≠ j
∂wijk
q =1 q≠ j
q =1
Table 2. Gradients depending on the transfer functions used in the neurons
g
sigmoid
tanh
∂Z kj
Z kj ⋅ (1 − Z kj )
1 + ( Z kj )
∂Akj
2
inverse
identity
− Z kj ⋅ Z kj
1
Indirect Training of Grey-Box Models: Application to a Bioprocess
393
and Table 2 respectively, where wkij is the weight of the connection from neuron i to neuron j in layer k, Aki is the activation value of neuron i of layer k, and Zki is the transfer value of neuron i from layer k (output of neuron i).
3 Biotechnological Process In this paper a grey-box neural model is proposed for the simulation of a batch fermentation bioprocess on a solid substrate, corresponding to the production of gibberellic acid from the philamentous fungus Gibberella fujikuroi. A simplified model describes the evolution of the main variables [6]. This phenomenological model based on mass conservation laws considers 8 state variables: active biomass (X), measured biomass (Xmeasu), urea (U), intermediate nitrogen (NI), starch (S), gibberellic acid (GA3), carbon dioxide (CO2) and oxygen (O2). Only the last two variables can be measured directly on line. The model's equations discretized by Euler's method and considering discrete time t and t+1 are the following:
(
)
X measu ( t +1) = X measu (t ) + μ ⋅ X ( t ) ⋅ Δt
(
,
(1)
)
X (t +1) = X ( t ) + μ ⋅ X (t ) − k d ⋅ X ( t ) ⋅ Δt ,
( )
U (t +1) = U (t ) + − k ⋅ Δt
N I (t +1)
(2)
,
(3)
⎧ ⎛ ⎛ X ⎞⎞ ⎪ N + ⎜ 0, 47 ⋅ k − μ ⋅ ⎜ (t ) ⎟ ⎟ ⋅ Δt , si U ≥ 0, ⎜ YX / N ⎟ ⎟ ⎪ I (t ) ⎜ ⎪ ⎝ I ⎠⎠ ⎝ =⎨ ⎛ ⎛ X ⎞⎞ ⎪ (t ) ⎟ ⎟ ⋅ Δt , U (t ) = 0, si U < 0, ⎪ N I (t ) + ⎜ − μ ⋅ ⎜ ⎜ ⎟⎟ ⎜ Y ⎪⎩ ⎝ ⎝ X / NI ⎠ ⎠
⎛
μ ⋅ X (t )
⎝
YX / S
S ( t +1) = S (t ) + ⎜ −
⎞
− ms ⋅ X ( t ) ⎟ ⋅ Δt
⎠
(
)
,
GA3(t +1) = GA3(t ) + β ⋅ X ( t ) − k p ⋅ GA3(t ) ⋅ Δt
⎛ ⎛
CO2( t +1) = CO2( t ) + ⎜ μ ⋅ ⎜
X (t )
⎜ ⎜ YX / CO 2 ⎝ ⎝
⎛ ⎛
O2( t +1) = O2(t ) + ⎜ μ ⋅ ⎜
X (t )
⎜ ⎜ YX / O 2 ⎝ ⎝
(4)
(5)
,
⎞ ⎞ ⎟ + mCO ⋅ X (t ) ⎟ ⋅ Δt , 2 ⎟ ⎟ ⎠ ⎠
⎞ ⎞ ⎟ + mO ⋅ X (t ) ⎟ ⋅ Δt . 2 ⎟ ⎟ ⎠ ⎠
(6)
(7)
(8)
394
F. Cruz et al.
The measured outputs are the following: y1 = CO2( t +1) ,
y 2 = O2(t +1) .
(9) (10)
On the other hand, the parameters that are difficult to obtain and that will be estimated by the model's neural part are μ and β , corresponding to the specific growth rate and specific production rate of gibberellic acid, respectively. The remaining parameters were identified on the basis of specific practices and experimental conditions. Their values under controlled water temperature and activity conditions (T=25 ºC, Aw=0.992) can be found in [6].
4 Proposed Solution The proposed solution is a grey-box neural model whose phenomenological part can be described jointly with its black-box part, by means of an extended neural network containing both, the discretized equations of the phenomenological model and the time-varying parameters modeled by the black-box part (Fig. 2). This hybrid neural network has the capacity to fix weights in the training phase, so that it can act as a grey-box model. The weights in Fig. 2 that have a fixed value correspond to the model's phenomenological part. The weights for which no value is given correspond to the model's neural part. These weights were initially assigned pseudo-random values obtained by the initialization method of Nguyen & Widrow [7]. In Fig. 2 it is seen that one of the weights corresponding to the white-box or phenomenological part is graphed as a dotted line. This line represents the switching phenomena that is seen in the fourth state variable (NI) in the mathematical model, i.e., if the urea (U) is greater than or equal to zero, this weight has the indicated value, otherwise, if urea (U) is less than zero, this weight has a value of zero. Therefore, the multilayer perceptron, inserted in the model, estimates the values of the two parameters that are difficult to obtain, and in turn they are mixed with the phenomenological part of the model, in that way obtaining its output. For the black-box neural part the hyperbolic tangent was used as transfer functions in the intermediate layer and the identity function in the output layer, while for the phenomenological part the identity function was used as transfer function. The activation function most currently used was the sum of the inputs, except for the two neurons immediatelly after the output of the black-box neural part, for which a product was used as activation function in order to follow the discretized phenomenological equations. The training algorithm used corresponds to backpropagation with a LevenbergMarquardt optimization method. As it was already stated, the algorithm has the capacity to modify only the weights that are indicated, therefore leaving a group of fixed weights which, represent the model's phenomenological part in the training phase. For the validation of the proposed grey-box neural model, quality indexs such as IA (Index of Agreement), RMS (Root Mean Square) and RSD (Relative Standard
Indirect Training of Grey-Box Models: Application to a Bioprocess
395
Fig. 2. Grey-box model for the solid substrate fermentation process. Fixed weights represent the discretized phenomenological model. The black-box part that models the unknown timevarying parameters µ and β has variable weights. The dotted line represents a switch on the model of the state variable (NI).
396
F. Cruz et al.
Deviation) are calculated, and the values considered acceptable for these indexs are IA>0.9, RMS r , the gene value is replaced by a new value. If the gene is binary, the operation inverts
452
J. Tian, M. Li, and F. Chen
the bit (if the original bit is 0, it is replaced by 1, and vice versa). If the gene is realvalued, it is replaced by a new value: cijl ′ = cijl + rijl′ ⋅ cijl − cijl , σ ijl ′ = σ ijl + rijl ′ ⋅ σ ijl − σ ijl .
cijl , cijl are the hidden center values in the previous two iterations, and σ ijl , σ ijl are the corresponding radius widths, rijl′ is a random number uniformly distributed on [–1, 1]. In order to ensure the validity of mutation, a dynamic mutation rate is used, i.e., the individual whose fitness value is above the average level is treated with a lower probability, while the one below the average level is treated with a higher probability. 3.5.2 Structure Mutation Operator Since the probability of crossover and non-structure mutation is usually low, we try to introduce some additional flexibility by using the so-called structure mutation operator, which can add or prune some hidden node centers to get a different network. A binary value rb and a real number r ∈ (0,1) are generated randomly for each
chromosome Cl . If rb = 0 , and pad > r , all the genes below a randomly selected position are deleted, and the corresponding bits of its control vector are assigned as 0. If rb = 1 and pad > r , a random number of non-zero vectors, cijl = x j ,min + tij ⋅ δ x j , are used to replace an equal number of rows whose control vector bits are formerly 0 by taking into account that the total number of rows should not exceed D, and then the relevant control bits are modified to 1. tij is a constant integer between 1 and mn , and the meanings of other parameters are the same as that described in Section 3.2.
4 Experimental Study In order to evaluate the performance of GA-RBFNN, we have applied the proposed methods and conventional methods to eight UCI datasets. They are real-world problems with different number of available patterns from 178 to 990, different number of classes from 2 to 11, and different kind of inputs. Each dataset was divided into three subsets: 50% of the patterns were used for learning, 25% of them for validation, and the remaining 25% for testing. There are two exceptions, Sonar and Vowel problems, as the patterns of these two problems are prearranged in two subsets due to their specific features. 4.1 Experiment 1
The experiments were carried out with the aim of testing our GA-RBFNN model against some traditional training algorithms, such as the DRSC and the K-means, the probabilistic neural network (PNN) and the K-nearest neighbor algorithm (KNN). These methods generate RBFNN without a validation set by joining the validation set to the training set together. For each dataset, 30 runs of the algorithm were performed. The GA parameters were set as follows. The population size L was 50, and the number of generations G was 200. The probability of crossover pc was 0.5. The higher non-structure mutation rate pm1 was 0.4, the lower one pm2 was 0.2, and the structure mutation rate pad was 0.2. The average accuracies of classification and the average numbers of the hidden nodes over the 30 runs are shown in Table 1.
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
453
Table 1. Comparison with other algorithms on eight UCI datasets. The accuracy values of evaluating the GA-RBFNN are omitted for the simplicity in comparison since the data were only divided into two subsets in the compared algorithms. The t-test compares the average testing accuracy of the GA-RBFNN with that of each uesd algorithm. Methods Cancer Glass Heart Iono Pima Sonar Vowel Wines Ave Nc GATrain 0.9629 0.7318 0.8475 0.9368 0.7747 0.9369 0.8636 0.9790 0.8791 25.25 RBFNN Test 0.9688 0.6913 0.8172 0.9326 0.7625 0.7785 0.7371 0.9680 0.8320
Train 0.9673 0.9671 t-test 0.3882 Train 0.9641 KTest 0.9634 means t -test 0.9477 Train 1.000 PNN Test 0.9494 t -test 4.581 Train 0.9736 KNN Test 0.9686 t -test 0.0357 DRSC Test
0.7988 0.6246 6.043 0.7362 0.6610 3.219 1.000 0.6686 4.497 0.7863 0.6566 5.024
0.8862 0.8015 1.276 0.8424 0.8054 0.8420 1.000 0.7309 7.145 0.8653 0.8000 3.488
0.9191 0.9189 4.109 0.8958 0.8970 4.904 1.000 0.9443 -0.6540 0.8203 0.7871 19.04
0.7897 0.7372 3.821 0.7609 0.7415 3.012 1.000 0.6950 10.96 0.8073 0.7181 7.010
0.8147 0.7144 5.641 0.7981 0.7295 4.443 1.000 0.5349 27.81 0.7654 0.7192 5.901
0.7720 0.6700 14.01 0.5285 0.4650 39.89 1.000 0.9515 -30.40 0.8924 0.7799 -1.238
0.9753 0.9515 2.843 0.9744 0.9667 0.8365 1.000 0.9447 4.952 0.9756 0.9697 -0.2480
0.8654 0.7982 34.86 0.8125 0.7787 30 1.000 0.8024 311.6 0.8608 0.7999 311.6 -
Table 2. Results of previous works using the same datasets. The results of the best method is recorded among the algorithms tested in each paper. GA[11]1 RBFNN Cancer 0.9688 0.9580 Glass 0.6913 0.7050 Heart 0.8172 0.8370 Ionosphere 0.9326 0.8970 Pima 0.7625 0.7720 Sonar 0.7785 0.7850 Vowel 0.7371 0.8170 Wines 0.9680 1 k-fold cross-validation 2 Hold out Datasets
[12]1
[13]1
[14]1
[15]2
[16]1
[17]1
[18]1
0.9470 0.6710 0.7400 0.7920
0.7095 0.8296 0.7660 0.9657
0.9620 0.7620 0.9370 0.4830 -
0.6837 0.8817 0.6872 0.9444
0.9650 0.7510 0.8030 0.7560 -
0.9780 0.7050 0.8370 0.9310 0.7430 0.8300 0.6520 0.9290
0.9490 0.7000 0.7890 0.9060 0.7400 0.7650 0.7810 -
The comparison presented in Table 1 shows that GA-RBFNN yielded accuracies close to the best accuracy on most datasets with the hidden nodes being adjusted dynamically. The t-test results indicate that there are significant differences between the GA-RBFNN and the conventional algorithms in comparison in most cases with a confidence level of 95%. The GA-RBFNN has improved 4.23% in the average testing accuracy and dropped 27.57% in the number of the hidden nodes compared with the DRSC algorithm, which is used to determine the initial network of GA-RBFNN. Furthermore, K-means need many trials to obtain a certain suitable number of hidden nodes, whereas the GA-RBFNN can design the network structure dynamically and need only one run to obtain the optimal solution. PNN and KNN need a significantly big number of the hidden nodes although they outperform GA-RBFNN on Sonar and Vowel datasets. These results show that the GA-RBFNN algorithm is able to obtain a significantly higher accuracy and produce a smaller network structure than those compared methods.
454
J. Tian, M. Li, and F. Chen
Moreover, the proposed method is competitive when compared with other works on these datasets. Table 2 shows a summary of the results reported in papers devoted to other classification methods. Comparisons must be made cautiously, as the experimental setup is different in many papers. Some of the papers use tenfold crossvalidation on some of the datasets and obtain a more optimistic estimation. However, we didn’t utilize tenfold cross-validation because it does not fit to the triplet samples partition. Table 2 shows that on Cancer, Ionosphere, Pima and Wines datasets our algorithm achieves a performance that is better or at least identical to all the results reported in the cited papers. 4.2 Experiment 2
In order to test the impacts of the parameters upon the performance of the proposed method, we carried out another experiment by assigning different values to the genetic parameters. Firstly, the effect of the probability of crossover, pc, was considered. It varied from 0.1 to 0.9. And the other parameters were assigned as follows: G=200, L=50, pm1=0.4, pm2=0.2 and pad=0.2. We performed ten runs of the algorithm for each pc and the results in Table 3 are the average accuracies of classification over the ten runs. Table 3. Average testing accuracies for various probability of crossover pc pc 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Cancer 0.9691 0.9692 0.9683 0.9592 0.9703 0.9663 0.9654 0.968 0.9703
Glass 0.6973 0.6828 0.6893 0.6872 0.6908 0.7002 0.6879 0.6973 0.6748
Heart 0.8065 0.8123 0.8079 0.8168 0.8197 0.8050 0.8079 0.8094 0.8182
Datasets Iono Pima 0.9014 0.7632 0.8946 0.7613 0.9332 0.7548 0.9196 0.7587 0.9332 0.7626 0.9264 0.7574 0.9264 0.7515 0.9286 0.7632 0.9263 0.7450
Sonar 0.7769 0.7413 0.7567 0.7634 0.7769 0.7798 0.7807 0.7673 0.7750
Vowel 0.7029 0.7017 0.7094 0.7181 0.7311 0.7218 0.7203 0.7246 0.7181
Wines 0.9712 0.9826 0.9416 0.9485 0.9689 0.9507 0.9462 0.9348 0.9689
Ave
S*
Ratio
0.8236 0.8182 0.8202 0.8214 0.8317 0.8260 0.8233 0.8242 0.8246
0.0021 0.0043 0.0031 0.0021 0.0003 0.0014 0.0019 0.0026 0.0014
17.82 12.52 14.83 17.77 48.77 21.94 18.70 16.02 21.92
The column of Ave in Table 3 is the average classification accuracy of the eight datasets for each pc. Due to the numerous experimental results, we introduce an additional test variable, S*, which donates the difference between the testing accuracy for every pc and the maximum accuracy of each dataset in Table 3. The last two columns in Table 3, S*i and Ratioi, can be calculated as follows:
S *i =
Numset
∑ ( Accu j =1
ij
− Max j )2 , Ratioi =
Avei S *i
(6)
where i = 1, 2,… , Num p , j = 1, 2,… , Numset , Accuij is the testing accuracy of the jth dataset for the ith value of pc, Maxj is the maximum accuracy of the jth dataset, Nump is the number of the different pc values and Numset is the number of the datasets. Note that it is more suitable and convictive when considering both Ave and Ratio than only
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
455
considering the former. As Table 3 shows, both the testing accuracy and the ratio reach the maximum when pc=0.5. We carried out experiment to test the influence of the population size with 10, 40, 70, 100, and 130. We have pc=0.5. The other parameters are assigned as the same as above. The average accuracies of classification over the ten runs are shown in Table 4. The meanings of the last three columns are also defined as in Table 3. Table 4. Average testing accuracies for various population size L L 10 40 70 100 130
Cancer 0.9726 0.9685 0.9645 0.9697 0.9691
Glass 0.6790 0.6941 0.6621 0.6809 0.6734
Heart 0.8012 0.8100 0.8070 0.8026 0.8012
Datasets Iono Pima 0.9305 0.7536 0.9418 0.7573 0.9316 0.7646 0.9271 0.7620 0.9316 0.7630
Sonar 0.7619 0.7792 0.7590 0.7590 0.7677
Vowel 0.6561 0.7341 0.7276 0.7168 0.7049
Wines 0.9495 0.9677 0.9405 0.9404 0.9586
Ave
S*
Ratio
0.8131 0.8316 0.8196 0.8198 0.8212
0.0073 0.0001 0.0024 0.0019 0.0017
9.537 99.32 16.76 18.74 19.96
Note that in Table 4, not only Ave, but also Ratio reaches its peak when L=40. In some of the problems, namely, Cancer and Pima, the enlarging of population size produces an improvement in the performance of the model which is not significant in the view of the increased complexity of the model. A t-test has been conducted and with a confidence level of 95% there are no significant differences in these cases. We can assure that the GA-RBFNN does not perform better with bigger population sizes, which may lead to the inbreeding without making any improvement for the network performance.
5 Conclusions A GA-RBFNN method for complex classification tasks has been presented, which adopts a matrix-form mixed encoding and specifically designed genetic operators to optimize the RBFNN parameters. The individual fitness is evaluated as a multiobjective optimization task and the weights between the hidden layer and the output layer are computed by the pseudo-inverse algorithm. Experiment results over eight UCI datasets show that the GA-RBFNN can output a much simpler network structure with a better generalization and prediction capability. To sum up, the GA-RBFNN is a quite competitive and powerful algorithm for complicated classification problems. Two directions are to be investigated in our further work. One is to combine some feature-selecting algorithms with the proposed methods to improve the performance further, and the other is to introduce the fuzzy technique to increase the selfadaptation ability of the relative parameters. Acknowledgments. The work was supported by the National Science Foundation of China (Grant No.70171002, No. 70571057) and the Program for New Century Excellent Talents in Universities of China (NCET).
456
J. Tian, M. Li, and F. Chen
References 1. Zhu, Q., Cai, Y., Liu, L.: A Global Learning Algorithm for a RBF Network. Neural Networks 12 (1999) 527-540 2. Leonardis, A., Bischof, H.: An Efficient MDL Based Construction of RBF Networks. Neural Networks 11 (1998) 963-973 3. Arifovic, J., Gencay, R.: Using Genetic Algorithms to Select Architecture of a Feedforward Artificial Neural Network. Physica A: Statistical Mechanics and its Applications 289 (2001) 574-594 4. Sarimveis, H., Alexandridis, A., et al: A New Algorithm for Developing Dynamic Radial Basis Function Neural Network Models Based on Genetic Algorithms. Computers and Chemical Engineering 28 (2004) 209-217 5. Li, M.Q., Kou, J.S., et al: The Basic Theories and Applications in GA. Science Press, Beijing (2002) 6. Berthold, M.R., Diamond, J.: Boosting the Performance of RBF Networks with Dynamic Decay Adjustment. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.): Advances in Neural Information Processing Systems 7 MIT Press, Denver Colorado (1995) 512-528 7. Zhao, W.X., Wu, L.D.: RBFN Structure Determination Strategy Based on PLS and Gas. Journal of Software 13 (2002) 1450-1455 8. Burdsall, B., Christophe, G.-C.: GA-RBF: A Self-Optimizing RBF Network. In: G.D. Smith et al (eds.): Proceedings of the Third International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA'97). Springer-Verlag, Norwich (1997) 348-351 9. Bosman, P.A.N., Thierens, D.: The Balance between Proximity and Diversity in Multiobjective Evolutionary Algorithms. IEEE Transactions on Evolutionary Computation 7 (2003) 174-188 10. Liu, Y., Yao, X., Higuchi, T.: Evolutionary Ensembles with Negative Correlation Learning. IEEE Transactions on Evolutionary Computation 4 (2000) 380–387 11. Frank, E., Wang, Y., Inglis, S., et al: Using Model Trees for Classification. Machine Learning 32 (1998) 63-76 12. Erick, C.-P., Chandrika, K.: Inducing Oblique Decision Trees with Evolutionary Algorithms. IEEE Trans Evolution Computation 7 (2003) 54-68 13. Guo, G.D., Wang, H., Bell, D., et al: KNN Model-Based Approach in Classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.): CoopIS, DOA, and ODBASE - OTM Confederated International Conferences. Lecture Notes in Computer Science 2888 Springer-Verlag, Berlin Heidelberg New York (2003) 986-996 14. Friedman, J., Trevor, H., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics 28 (2000) 337-407 15. Draghici S.: The Constraint Based Decomposition (CBD) Training Architecture. Neural Networks 14 (2001) 527-550 16. Webb, G.I.: Multiboosting: A Technique for Combining Boosting and Wagging. Machine Learning 40 (2000) 159-196 17. Yang, J., Parekh, R., Honavar, V.: DistAI: An Inter-pattern Distance-based Constructive Learning Algorithm. Intelligent Data Analysis 3 (1999) 55-73 18. Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: Shavlik, J.W. (eds.): Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco CA (1998) 144-151
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm Jinyuan Shen1 , Huaiyu Fan1 , and Shengjiang Chang2 1
School of Information Engineering, Zhengzhou University, Zhengzhou, China 2 Institute of Modern Optics, Nankai University, Tianjin, China
[email protected] Abstract. A tapped delay neural network (TDNN) with an adaptive learning and pruning algorithm is proposed to predict the nonlinear time serial stock indexes. The TDNN is trained by the recursive least square (RLS) in which the learning-rate parameter can be chosen automatically. This results in the network converging fast. Subsequently the architecture of the trained neural network is optimized by utilizing pruning algorithm to reduce the computational complexity and enhance the network’s generalization. And then the optimized network is retrained so that it has optimum parameters. At last the test samples are predicted by the ultimate network. The simulation and comparison show that this optimized neuron network model can not only reduce the calculating complexity greatly, but also improve the prediction precision. In our simulation, the computational complexity is reduced to 0.0556 and mean square error of test samples reaches 8.7961 × 10−5 .
1
Introduction
The stock indexes are influenced by many factors so that it is very difficulty to forecast the change of the stock indexes by using an accurate mathematical expression. This means that it is a typical nonlinear dynamic system to forecast the stock indexes accurately. Due to its powerful nonlinear processing ability, the neural network has been applied widely in forecasting stock indexes to improve prediction ability recently, and some network models have made greater progress than traditionally statistical methods [1]. But the most research concentrate in the Back Propagation (BP) algorithm [2], Radial-Basis Function (RBF) method [3], Genetic algorithm [4], Support Vector Machine (SVM) [5] and their improvement algorithms. However these models have some shortcoming: it is well known that it is impossible to avoid local minima for BP algorithm and the number of neurons in the hidden layer is determined blindly. It is difficulty for RBF network to choose the best central vectors as well as the number of the central vectors. SVM is suitable to the small sample situation, but until now nobody study how to choose the best kernel function. The architecture and the learning algorithm of the neural network are crucial for predicting the stock indexes accurately. Therefore, it is very important how to choose an optimal topological architecture. A tapped delay neural network (TDNN) with an adaptive learning and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 457–464, 2007. c Springer-Verlag Berlin Heidelberg 2007
458
J. Shen, H. Fan, and S. Chang
pruning algorithm is adopted to predict the nonlinear time serial stock indexes in his paper. Firstly an adaptive algorithm based on the recursive least square is employed to train the network. Secondly the architecture of the primary neural network is optimized by utilizing pruning algorithm which prunes the redundant neurons of the hidden layers and input layer. And then the optimized network is retrained to obtain optimum parameters. At last the test samples are predicted by the ultimate network. In our model, the learning stepsize can be determined automatically, so the network converges fast. The pruning algorithm is used widely to prune the redundant neurons in the hidden layer. However, how many delays are best for the time series prediction? How many input neurons are optimal for TDNN to forecast the stock indexes? Nobody ever studied until now. We adopt the pruning algorithm not only in the hidden layer but also in the input layer by defining a new energy function in this paper. The computational complexity, therefore, is reduced and the generalization is improved greatly. The computational complexity is reduced to 0.0556 and mean square error of test samples reaches 8.7961 × 10−5 .
2
The Architecture of the Neural Network
A three-layer feed-forward neural network can approximate the non-linear continuous function arbitrarily. The sketch of a three-layer TDNN is shown in Fig.1. The input of TDNN is a delaying time series and one output neuron adopts w1
x(n-1)
z
X(n-1) z
f()
1
w2 x( n -2)
f()
x(n-p)
f()
1
1
1
Fig. 1. The model of tapped delay neural network
linear activation function. The functional relation between input and output is described by: T ˆ = x(n) (ω2,i f(ω1,i X(n − 1))) − θ. (1) i
Where θ is the bias of the output neuron, X(n − 1) is the N o.n − 1 time input vector. X(n − 1) = [x(n − 1), x(n − 2), . . . , x(n − p), 1]T . (2)
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
459
Nn which includes the threshold cell is the total number of the neuron in the n − th layer. ωn,ij (t) is the weight between the neuron (i, j) (the j − th neuron in the n − thlayer) and the neuron (n+1,i): ωn = [(ωn,1 )T , (ωn,2 )T , . . . , (ωn,Nn+1 −1 )T ], ωn,j = [ωn,1i , ωn,2i , . . . , ωn,N i ],
(3)
and ω = [(ω1 )T , (ω2 )T ].
3
(4)
Adaptive Training
The weight ω is regarded as a stable non-linear dynamics system in the TDNN model. Supposes the n − th training pattern is input the TDNN model, the system should satisfy the following static equation: ω(n) = ω(n + 1) = ω0 , d(n) = h(ω(n)) + e(n),
(5)
where d(n) is the desired output, h(ω(n))) is the output of the network and e(n) is the modeling error. The error function is: ξ(n) =
n
|d(j) − h(ω(n))|2 λn-j ,
(6)
j=1
where λ is a forget factor that satisfies 0 < λ < 1 and approximates 1. Expanding the h(ω(n)) by the Taylor expanding at the point of ω ˆ (n − 1), we can obtain: h(ω(n)) = h(ˆ ω (n − 1)) + H(n)(ω(n) − ω ˆ (n − 1)) + . . .
(7)
where H(n) = ∂h(ω) ω (n−1) . With these state equations, according to the ∂ω |ω=ˆ identification theory [6], the recursion equation of the estimate ω ˆ (n) can be obtained as follows: ω ˆ (n) = ω ˆ (n − 1) + K(n)(d(n) − h(ˆ ω (n − 1))), −1 −1 T K(n) = λ P(n − 1)H(n)[I + λ H (n)P(n − 1)H(n)]−1 , P(n) = λ−1 P(n − 1) − λ−1 K(n)HT (n)P(n − 1),
(8) (9) (10)
The estimate weight ω ˆ (n) should make the error function ξ(n) to be smallest. K(n) is gain matrix. P(n) is error covariance matrix of the recursive least square algorithm. ω ˆ (0) and P(0) are determined according to prior knowledge, or else: ω ˆ (0) = [0, 0, . . . , 0]T P(0) = δ −1 I, (11) where δ is small amounts that is larger than 0.
460
4
J. Shen, H. Fan, and S. Chang
Pruning Algorithm
4.1
Pruning the Neuron of Hidden Layers
One of the important key problems is how to choose a suitable scale. If the scale of the network is oversized, the generalization ability of the network may be very bad; otherwise, the network may converge too slow even not converge if the scale excessively is small. An effective method solving this question is to prune weights adaptively. After input the n − th training sample, the energy function of the network is defined as: E=
n 1 [ (d(j) − h(ω, x(j)))T (d(j) − h(ω, x(j))) + ω T P(0)−1 ω]. 2 j=1
(12)
According to the pruning algorithm [7], suppose the initial values of the covariance matrix is a diagonal matrix P(0) = δ −1 I, where I is the identity matrix and δ > 0, so the energy change of the network brought by the change of weights (Δˆ ω ) is: 1 ΔE = Δω T P(∞)−1 Δω. (13) 2 The importance of ωj calculates from the equation: ΔEj =
1 [P(∞)−1 ]jj ω ˆ (∞)2j , 2
(14)
where ω ˆ (∞) and P(∞) are weights and covariance matrix of the convergent network. [P(∞)−1 ]jj means the j-th diagonal element. According to equations above, the process of pruning weights is shown as follows: (a) After training network by RLS algorithm, the importance of all weights is estimated according the equation (14). The weights queue up form small to large according to ΔEj . Supposes the queue number is [πi ], then ΔEπm ≤ ΔEk , (m < k); (b) Let [Δω]πk = [ω]πk (1 ≤ k ≤ k ), or else [Δω]πk = 0(k > k ), ΔE caused by pruning the weight from ωπ1 to ωπk can be estimated according to the equation (13) (c)IfΔE ≤ αE (0 < α < 1), let k = k + 1 and return the step(b). Otherwise prune weights from ωπ1 to ωπk −1 . It is worth pointing out that for a three feed-forward neural network with one output, if one weight between the hidden layer and the output layer is pruned, the neuron joined with the weight above will be pruned. Then many weights in the input layer joined with the neuron that has been pruned will be pruned too. After pruning the neuron in the hidden layer, the computational complexity is reduced. According to the literature [8], suppose the number of the neuron in the hidden layer that is un-pruned is equal to H0 and the number of the left neuron in the hidden layer that is H1 , the ratio of the computational complexity 1 2 is ( H H0 ) .
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
4.2
461
Pruning the Input Layer
No one ever studied how many delay is best for the time series prediction. In other words, how many input neurons are optimal for TDNN predicting the stock indexes? A new energy is defined and the pruning algorithm is also employed to pruning the redundant neurons in the input layer. Suppose the weight between the i − th neuron in input layer and the j − th neuron in hidden layer isωi,ij . We defined the energy function: E1,i =
n
(ω1,ij )2 ,
(15)
j=0
where n is the number of neurons in hidden layers. Then we obtain an array of vectors E = [E1,1 , E1,2 , . . . , E1,m ]T ,i = 1, 2, . . . , m. m is the number of neurons in input layer. E1,i queue up form low to high according to the equation (15). ΔE1 Suppose ΔE1 = kk=1 E1,k (1 ≤ k < m) and E1 = m i=1 E1,i .If E1 = β, then the front k neurons in the input layer will been pruned.
5 5.1
Computer Simulation Training
We use 650 daily data of Shanghai Composite Indexes from Mar. 23.2001 to Dec.17.2003 as samples, and extract the first 300 data as training sample, and use the 301 − 500th as retraining samples. That which is left is taken as the test samples. At first we initialize the sample data: Xˆi =
Xi − min(Xi ) . max(Xi ) − min(Xi )
(16)
The primary architecture of the time delay neural network is 12 − 15 − 1. Then the TDNN is trained by RLS algorithm. The primary values of some parameters are: ω ˆ (0) = [0, 0, . . . , 0]T ,P(0) = 60 × I; λ = 0.999; The times of iteration is only 36 i.e. the network converges very fast. The prediction error results of TDNN with different architectures are shown in the table.1. The mean square error is 2 equal to 1.4142 × 10−4 for the training samples.WhereMSE = n e N(n) . 5.2
Pruning the Hidden Layer
According to the equation(14), all of the weights are arranged by their importance, and then some unimportant weights in the front of the queue will be pruned. Fig. 2 shows the relational curve between energy and weights. Obviously, the frontal 130 weights is unimportant corresponding with E = 1268. 10 of these 130 weights are the weights between the hidden layer and the output layer. Therefore, the neuron number of the hidden layer becomes 5after these 130 weights are pruned, i.e. the network architecture turns into 12 − 5 − 1. The
462
J. Shen, H. Fan, and S. Chang Table 1. The Comparison of the MSE of TDNN Models training samples 12 − 15 − 1 retraining samples un-pruning 12 − 15 − 1 retraining samples pruning 12 − 9 − 1 12 − 6 − 1 12 − 5 − 1 12 − 4 − 1 test samples un-pruning 12 − 15 − 1 test samples pruning 12 − 9 − 1 12 − 6 − 1 12 − 5 − 1 12 − 4 − 1
MSE 1.4142 × 10−4 1.3898 × 10−4 1.2996 × 10−4 1.2307 × 10−4 1.2003 × 10−4 1.3689 × 10−4 1.3307 × 10−4 1.1582 × 10−4 1.0986 × 10−4 9.7160 × 10−5 1.0211 × 10−4
5
6
x 10
5
4
3
2
1
0
0
50
100
150
200
250
Fig. 2. The relational curve between energy and weights
computational complexity is reduced to (5/15)2 = 0.1111. This indicated that the topology architecture of network can be optimized effectively by the pruning algorithm. The prediction errors are compared with the different value of ΔE in the Tab.1. The value of MSE is the smallest when 10 neurons in the hidden layer are pruned. This means the neuron number of the hidden layer is 5. 5.3
Pruning the Input Layer
1 Shown in the Table 2, according to the equation (15) and the ratio of the ΔE E1 , when β = 0.2, we obtain the least mean square error is equal to 8.7961 × 10−5
Table 2. The comparison of the different network the architecture of the network the mean square error 12 − 15 − 1 1.3307 × 10−4 6−5−1 9.5924 × 10−5 6 − 5 − 1(obtained by pruning12 − 5 − 1) 8.7961 × 10−5
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
463
for the test samples. The network architecture turns into 6 − 5 − 1 and the computational complexity is reduced to 0.1111 = 0.0556. 2 5.4
Retraining and Predicting
The 301 − 500th data samples are learned to retrain the final TDNN to get the optimum weights. The test samples (501 − 650th samples) are forecasted. In order to examine the architecture of network whether obtain the optimization, different architectures (i.e. different neuron number of the hidden layer) are simulated except the optimum architecture 6 − 5 − 1. The results are shown in Tab.1 and Tab.2. The prediction curve of the test samples with 6 − 5 − 1 TDNN is shown in the Fig. 3 1600 actual datas forecasting datas 1550
1500
1450
1400
1350
1300
0
50
100
150
Fig. 3. The prediction curve of the 6-5-1
6
Conclusions
From the simulation results, we can know that the convergence rate is fast. Therefore the TDNN with RLS learning algorithm basically may satisfy the request of the on-line forecast. The adaptive training and pruning not only make the computational complexity reduced but also the VC dimension reduced. And this improves the network generalization ability. Hence the network can predict the test samples more accurately. In addition, we prune not only the redundant neurons not only in the hidden layers but also in the input layer by present a new energy function in this paper. This results in choosing the useful input factors self-adaptively. It means that we can not only reduce the computational complexity of the network but also exact the useful features from inputs with nose. Our expanding pruning method, therefore, can be an effective method used to preprocessing the input data.
Acknowledgment This work is supported by Outstanding Youth Fund of Henan Province (grant No.512000400), and Hean Province Cultivation Project for University Innovation Talents and The Project-sponsored by SRF for ROCS, SEM.
464
J. Shen, H. Fan, and S. Chang
References 1. Refenes, A.N., Zapranis, A. Francies, G.: Stock Performance Modeling using Neural Networks: A Comparative Study with Regression Models. Neural Network 5 (1994) 961-970 2. Chang, B.R., Tsai, S.F.: A Grey-Cumulative LMS Hybrid Predictor with Neural Network based Weighting for Forecasting Non-Periodic Short-Term Time Series. IEEE International Conference on Systems, Man and Cybernetics 6 (2002) 5 3. Lee, R.S., Jade,T.: Stock Advisor: An Intelligent Agent based Stock Prediction System using Hybrid RBF Recurrent Betwork. IEEE Trans. Systems, Man and Cybernetics-A 34 (2004) 421-428 4. Grosan, C., Abraham, A.:Stock Market Modeling using Genetic Programming Ensembles. Studies in Computational Intelligence 13 (2006) 131-146 5. Ince, H.,Trafal, I.: Kernel Principal Component Analysis and Support Vector Machines for Stock Price Prediction. IEEE International Joint Conference on Neural Networks Proceedings 3 (2004) 2053-2058 6. Shah, S., Palmieri, F., Datum, M.: Optimal Filtering Algorithm for Fast Learning in Feed-Forward Neural Network. Neural network 5 (1992) 779-787 7. Lecun, Y., Denker, J.S., Solla, S.A.:Optimal Brain Damage. Advances in Neural Information Processing 2 (1989) 598-605 8. Chen, S., Chang, S.J., Yuan, J.H.: Adaptive Training and Pruning for Neural Networks Algorithms and Application. Acta Physica Sinica 50 (2001) 674-681
An Improved Algorithm for Eleman Neural Network by Adding a Modified Error Function Zhang Zhiqiang1 , Tang Zheng1 , Tang GuoFeng1 , Catherine Vairappan1 , Wang XuGang2 , and Xiong RunQun3 1
2
Faculty of Engineering, Toyama University, Gofuku 3190, Toyama shi, 930-8555 Japan
[email protected] Institute of Software, Chinese Academy of Sciences, BeiJing 100080, China 3 Key Lab of Computer Network and Information Integration, Southeast University, Nanjing 210096, China
Abstract. The Eleman Neural Network has been widely used in various fields ranging from temporal version of the Exclusive-OR function to the discovery of syntactic categories in natural language date. However, one of the problems often associated with this type of network is the local minima problem which usually occurs in the process of the learning. To solve this problem, we have proposed an error function which can harmonize the update weights connected to the hidden layer and those connected to the output layer by adding one term to the conventional error function. It can avoid the local minima problem caused by this disharmony. We applied this method to the Boolean Series Prediction Questions problems to demonstrate its validity. The result shows that the proposed method can avoid the local minima problem and largely accelerate the speed of the convergence and get good results for the prediction tasks.
1
Introduction
Eleman Neural Network (ENN) is one type of the partial recurrent neural network which more includes Jordan networks [1], [2]. ENN consists of two-layer back propagation networks with an additional feedback connection from the output of the hidden layer to its input. The advantage of this feedback path is that it allows ENN to recognize and generate temporal patterns and spatial patterns. This means that after training, interrelations between the current input and internal states are processed to produce the output and to represent the relevant past information in the internal states [3], [4]. The ENN is the local recurrent network, so when at learning a problem it needs more hidden neurons in its hidden layer than actually are required for a solution by others methods. Since ENN uses back propagation (BP) to deal with the various signals, it has been approved that it suffers from a sub-optimal problem [5], [6], [7]. In order to resolve this question, many improved ENN algorithms have been suggested in the literature to increase the performance of the ENN with simple D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 465–473, 2007. c Springer-Verlag Berlin Heidelberg 2007
466
Z. Zhang et al.
modifications [8], [9], [10]. One of the typical modified methods is proposed by Pham and Liu on the idea which adding a self-connection weights (fixed between 0.0 and 1.0 before the training process) for the context layer. The suggested modifications on the ENN in the literature mostly have been able to improve certain kinds of problems, but it is not clear yet which network architecture is best suited to dynamic system identification or prediction [11]. And at the same time, these methods are apt to change or add some other elements or connections in the network so as to enhance the complexity of the computation. In this paper, we explain the neuron saturation in the hidden layer as the update disharmony between weights connected to the hidden layer and output layer. Then we proposed a modified error function for the ENN algorithm in order to avoid the local minima problem with less neuron units and less training time than the conventional ENN. Finally, simulation results are presented to substantiate the validity of the modified error function. Since a three-layered network is capable of forming arbitrarily close approximation to any continuous nonlinear mapping [12], we use three layers for all training networks.
2
ENN’s Structure
Fig. 1 show the structure of a conventional ENN. In the Fig. 1, after the hidden units are calculated, their values are used to compute the output of the network and are also all stored as ”extra inputs” (called context unit) to be used when the next time the network is operated. Thus, the recurrent contexts provide a weighted sum of the previous values of the hidden units as input to the hidden units. As the definition of the original ENN, the activations are copied form hidden layer to context layer on a one for one basis, with fixed weight of 1.0 (w=1.0). The forward connection weight is trainable between hidden units and context units.
Output w Hidden
w=1.0 w Context
w Input
Fig. 1. Structure of ENN
Training such a network is not straightforward since the output of the network depends on the inputs and also all previous inputs to the network. One approach used in the machine learning is to operate the process by time shown as Fig. 2.
An Improved Algorithm for ENN by Adding a Modified Error Function
467
Fig. 2 represents a long feed forward network where by back propagation is able to calculate the derivatives of the error (at each output unit) by unrolling the network to the beginning. At the next time step t + 1 input is represented, this time the context units contain values which are exactly the hidden unit values at time t, these context units thus provide the network with memory [6]. In our paper, the time element is updated by the iterative of changing connection weights, so we did not much consider the time gene in the computing formulas.
Out T
Context
Context
Out T-1
Hidden unit,time T-1
Inputs,time T
Inputs,time T-1
Context
Hidden unit,time T-2
...
Hidden unit,time T
Out T-2
Inputs,time T-2
Fig. 2. Unroll the ENN Through Time
3
Motivation
In the ENN, usually the sigmoid function is used to process the network. Our proposed method is based on the same understanding of the current sigmoid function shown as Eq.(1). f (x) =
1 . 1 + e−x
(1)
The shape of the sigmoid function is shown as Fig.3. Since we use the Sigmoid function, saturation problem is inevitable. Such a phenomenon is caused by the activation function [13]. The derivative of the sigmoid function is shown as Eq.(2). f (x) = g(1 − f (x)) ∗ f (x).
(2)
In the Fig. 3, we can see there are two extreme area A and area B. Once the activity level of all hidden layer approaches the two extreme areas (the outputs f (x) of all neurons are in the extreme value close to 1 or 0), f (x) will almost be 0. For the ENN, the change in weights is determined by the sigmoid derivative which can even be as small as 0. So for some training patterns, weights connected to the hidden layer and the output layer are modified inharmoniously, that is all the hidden neuron’s output are rapidly driven to the extreme areas before the output start to approximate to the desired value. Thus the hidden layer will lose their sense to the error. The local minimum problem may occur. To overcome such a problem, the neuron output in the output layer and those in the hidden layer should be considered together during the iterative update
468
Z. Zhang et al.
f(x) 1 A B 0
x
Fig. 3. Sigmoid function
procedure. Motivated by this, we add one term concerning the outputs in the hidden layer to the conventional error function [14]. In such way, weights connected to the hidden layer and the output layer could be modified harmoniously.
4
Proposed Algorithm
For the conventional ENN algorithm, the error function is given by 1 (tpj − opj )2 , 2 p=1 j=1 P
EA =
J
(3)
where P is number of training patterns, J is the number of neurons in the output layer. tpj is the target value (desired output) of the j − th component of the outputs for the pattern p, opj is the output of the j − th neuron of the actual output layer. To minimize the error function EA , the ENN algorithm uses the following delta rules as back propagation algorithm: Δwji = −ηA
∂EA , ∂wji
(4)
where wji is the weight connected between neurons i and j and ηA is the learning rate. For the improved ENN algorithm, the modified error function is given by:
Enew = EA +EB =
P J P J H 1 1 (tpj −opj )2 + ( (tpj −opj )2 )×( (ypj −0.5)2 ). 2 p=1 j=1 2 p=1 j=1 j=1
(5) We can see that the new error function consists of two terms, EA is the conventional error function, and EB is the added term. Where ypj is the output of the j − th neuron in the hidden layer and H is the number of neurons in the hidden layer. H j=1
(ypj − 0.5)2 .
(6)
An Improved Algorithm for ENN by Adding a Modified Error Function
469
Eq.6 can be defined as the degree of saturation in the hidden layer for pattern p. This added term is used to keep the degree of saturation of the hidden layer small while EA is large (the output layer have not approximate the desired signals).While the output layer approximates to the desired signals, the affect of term EB will be diminished and becomes zero eventually. Using the above error function as the objective function, we can rewrite the update rule of weight wji as: Δwji = −ηA
∂EA ∂EB − ηB . ∂wji ∂wji
(7)
For pattern p, the derivative ∂EA /∂wji can be computed as the same as the conventional error function does. Thus we can easily get ∂EB /∂wji as following: For weights connected to the output layer: H p p ∂EB ∂EA = (ypj − 0.5)2 . ∂wji ∂wji j=1
(8)
For weights connected to the hidden layer: H J p p ∂EB ∂EA ∂ypj = (ypj − 0.5)2 + (tpj − opj )2 (ypj − 0.5) . ∂wji ∂wji j=1 ∂wji j=1 Because ypj = f (netpj ) and netpj = (wij opj )
∂ypj ∂ ypj ∂netpj = = f (net)opi , ∂wji ∂netpj ∂ wji
(9)
(10)
where opi is the i − th input for pattern p, and netpj is the net input to neuron j produced by the presentation of pattern p. In order to verify the effectiveness of the modified error function, we applied the algorithm to the BSPQ problems.
5
Simulations
Boolean Series Prediction Questions is one of the problems about time sequence prediction. First let me see the definition of the BSPQ problems [15]. Now suppose that we want to train a network with an input P and target T as defined below. P =1 0 1 1 1 0 1 1 And T =0 0 0 1 1 0 0 1 Where T is defined to be 0, except when two 1’s occur in P in which case T is 1 and we called this problem as ”11” problem (one kind of the BSPQ). Also when ”00”or ”111” (two 0’s or three 1’s) occurs, it is named as the ”00” or ”111” problem. In this paper we define the prediction set P1 randomly in the following stochastic 20 unit figures. P1 =1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 1 1
470
Z. Zhang et al.
Table 1. Experiment results for the ”11” question with 7 neurons in the hidden layer Methods Success rate(100 trials) (1-7-1 network) E=0.1 E=0.01 Conventional ENN 45% 35% Improved ENN 100% 100%
Iterative(average) Average CPU time (second) E=0.1 E=0.01 E=0.1 E=0.01 15365 33274 9.5 20.5 322 636 0.18 0.29
In order to test the effectiveness of the proposed method, we compare its performance with those of the conventional ENN algorithm on a series of BSPQ problems including ”11”, ”111” and ”00” problem. In our simulation, we use the modified back propagation algorithm with momentum 0.9. In order to maintain the similarity for both algorithm, the learning rate of ηA = ηB = 0.9 are used in all experiments where as the weights and thresholds are initialized randomly from (0.0, 1.0). Three aspects of training algorithm performance—”success rate” ,”iterative” and ”training time” are assessed for each algorithm. Simulations were implemented in Visual C++ 6.0 on a Pentium4 2.8GHz (1GB)). A training run was deemed to have been successful if the network’s E was smaller than the lower levels E (E=0.1 or E=0.01), where E was the sum of squares error function for the full training set. If the network reach to this error precision point (E=0.1 or E=0.01), all patterns in the training set can get a tolerance of 0.05 for each target element. And we used the well trained network to do the final prediction about sequence P1 to test the prediction capacity of it. For all the trials, 150 patterns were provided to satisfy the equilibrium of the training set and at the same time to ensure that there was enough and reasonable running time for all algorithms. The upper limit iterative were set to 5 0000 for two algorithms. 5.1
”11” Question
Firstly, we deal with the ”11” question and analyze the effect of the memory from the context layer for the network. From the Table 1’s compared results of the two algorithms we can see the improved method not only 100% success but quickly get the convergence point. Although the conventional ENN is able to predict the requested input test sequence P1 , but the training success rate is slow, only 45% when E is set to 0.1. Fig. 4 and Fig. 5 are the training error curves for two algorithms with the same initialization weight for the 1-7-1 network, when E is set to 0.1. Comparison of the Fig. 4 and Fig. 5 shows that the improved ENN only needs 49 iterations to be successful but the conventional ENN needs 1957 iterations to reach the desired goal. For the prediction set P1 , we can get its corresponding expected results with below T1 . T1 =0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 Fig.6 is the prediction results for P1 of the ”11” question with improved ENN, and the two lines represent the expected T1 line and the actual prediction result
An Improved Algorithm for ENN by Adding a Modified Error Function
471
30
E 25
20
15 10
5
0
200
400
600
800 1000 1200 1400 1600 1800 2000
iterative
Fig. 4. Training error curve of the conventional ENN algorithm 30
E 25
20 15 10 5
0 0
5
10
15
20
25
30
35
40
45
50
iterative
Fig. 5. Training error curve of the improved ENN algorithm Value 1 0.8
prediction line T1 line
0.6
0.4
0.2
0
0
5
10
15
20 T 1
Fig. 6. Comparison of the expected output and prediction result with improved ENN
472
Z. Zhang et al.
Table 2. Experiment results for the ”111” question with different neuron units in the hidden layer Structure Network (1-10-1) Network (1-12-1)
Items Success rate(100 trials) Error precision E=0.1 E=0.01 Conventional ENN 72% 70% Improved ENN 99% 99% Conventional ENN 80% 75% Improved ENN 100% 97%
Iterative(average) Time (second) E=0.1 E=0.01 E=0.1 E=0.01 13545 24555 11.3 17.9 2279 3698 1.6 3.1 9823 17090 7.9 14.1 2001 3710 1.9 2.5
line. Apparently the tolerance of every pattern in the P1 is very small. Thus, based on the findings, it can be concluded that the improved ENN has sufficient capability to do the prediction of the given task. 5.2
”111” and ”00” Questions
As we change the type of the BSPQ, we can continue to testify the validity of the improved ENN algorithm. Table 2 is the comparison of the specific results from the ”111” problem between the conventional ENN and the improved ENN algorithm. From the Table 2 we can see that the improved ENN can avoid the local minima problem with almost 100% success rate and less iterative than conventional ENN algorithm. Table 3 is the specific comparison results for the ”00” BSPQ problem between the conventional ENN and the improved ENN algorithm. Table 3. Experiment results for the ”00” BSPQ problem with different neuron units in the hidden layer Structure Network (1-7-1) Network (1-10-1)
6
Items Success rate(100 trials) Error precision E=0.1 E=0.01 Conventional ENN 92% 81% Improved ENN 99% 97% Conventional ENN 95% 92% Improved ENN 100% 100%
Iterative(average) Time (second) E=0.1 E=0.01 E=0.1 E=0.01 4177 9852 4.0 9.6 1661 3589 1.7 3.8 13074 16333 11.3 16.5 5205 6944 4.4 5.8
Conclusion
In this paper, we proposed a modified error function with two terms for ENN algorithm. This modified error function was used to harmonize the update of weights connected to the hidden layer and those connected to the output layer in order to avoid the local minima problem in the training learning process. Moreover, the modified error function did not require any additional computation and did not change the network topology either. Finally, the algorithm has been applied to the BSPQ problems including ”11”, ”111” and ”00” problems. Through the analysis of the result from various BSPQ problems we can see that the proposed algorithm is effective at getting rid of local minima problem
An Improved Algorithm for ENN by Adding a Modified Error Function
473
with less time and getting good prediction results. But more analysis on the other types of problems and more detailed discussions on the parameters setting are still required. Therefore, we will continue in studying the improvement of the ENN.
References 1. L.Eleman, J.: Finding Structure in Time. Cognitive Science 14 (1990) 179-211 2. Jordan, M.I.: Attractor Dynamics and Parallelism in a Connectionsist Sequential Machine. Proceedings of the 8th Conference on Cognitive Science (1986) 531-546 3. Omlin, C.W., Giles, C.L.: Extraction of Rules from Dicrete-Time Recurrent Neural Networks. Neural Networks 9 (1) (1996) 41-52 4. Stagge, P., Sendhoff, B.: Organisation of Past States in Recurrent Neural Networks: Implicit Embedding. Mohammadian, M. (Ed.), Computational Intelligence for Modelling, Control & Automation, IOS Press, Amsterda (1999) 21-27 5. Pham, D.T., Liu, X.: Identification of Linear and Nonlinear Dynamic Systems Using Recurrent Neural Networks. Artificial Intelligence in Engineering 8 (1993) 90-97 6. Smith, A.: Branch Prediction with Neural Networks: Hidden Layers and Recurrent Connections. Department of Computer Science University of California, San Diego La Jolla, CA 92307 (2004) 7. Cybenko, G.: Approximation by Superposition of a Sigmoid Function. Mathematics of Control, Signals, and Systems 2 (1989) 303-314 8. Kwok, D.P., Wang, P., Zhou, K.: Process Identification Using a Modified Eleman Neural Network. International Symposium on Speech, Image Processing and Neural Networks (1994) 499-502 9. Gao, X.Z., Gao, X.M., Ovaska, S.J.: A Modified Eleman Neural Network Model with Application to Dynamical Systems Identification. Proceedings of the IEEE International Conference on System, Man and Cybernetics 2 (1996) 1376-1381 10. Chagra, W., Abdennour, R.B., Bouani, F., Ksouri, M., Favier, G.: A Comparative Study on the Channel Modeling Using Feedforward and Recurrent Neural Network Structures. Proceedings of the IEEE International Conference on System, Man and Cybernetics 4 (1998) 3759-3763 11. Kalinli, A., Sagiroglu, S.: Eleman Network with Embedded Memory for System Identification. Journal of Informaiton Science and Engineering 22 (2006) 15551668 12. Servan-Schreiber, C., Printz, H., Cohen, J.: A Network Model of Neuromodulatory Effects: Gain, Signal- to-Noise Ratio and Behavior. Science 249 (1990) 892-895 13. Cybenko, G.: Approximation by Superposition of a Sigmoid Function. Mathematics of Control, Signals, and System 2 (1989) 303-314 14. Wang, X., Tang, Z.: An Improved Backpropagation Algorithm to Avoid the Local Minima Problem. Neurocomputing 56 (2004) 455-450 15. http://www.mathworks.com/access/helpdesk/help/helpdesk.shtml
Regularization Versus Dimension Reduction, Which Is Better? Yunfei Jiang1 and Ping Guo1,2, 1
Laboratory of Image Processing and Pattern Recognition Beijing Normal University, Beijing, 100875, China 2 School of Computer Science and Technology Beijing Institute of Technology, Beijing, 100081, China yunfeifei
[email protected],
[email protected] Abstract. There exist two main solutions for the classification of highdimensional data with small number settings. One is to classify them directly in high-dimensional space with regularization methods, and the other is to reduce data dimension first, then classify them in feature space. However, which is better on earth? In this paper, the comparative studies for regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world: Raman spectroscopy signals and stellar spectra data. Experimental results show that in most cases, the dimension reduction methods can obtain acceptable classification results, and cost less computation time. When the training sample number is insufficient and distribution is unbalance seriously, performance of some regularization approaches is better than those dimension reduction ones, but regularization methods cost more computation time.
1
Introduction
In real world, there are some data, such as Raman spectroscopy and stellar spectra data, that the number of variables (wavelengths) is much higher than the number of samples. When classification (recognition) tasks are applied, the ill-posed problems arise. For such ill-posed problems, there mainly have two solutions. One is to classify them directly in high-dimensional space with regularization methods [1], the other is to classify them in feature space after dimension reduction. Many approaches are proposed to solve the ill-posed problem [1,2,3,4,5,6,7,8]. Among these methods, Regularized Discriminant Analysis (RDA), Leave-OneOut Covariance matrix estimate (LOOC) and Kullback-Leibler Information Measure based classifier (KILM) are regularization methods. RDA [2] is a method based on Linear Discriminant Analysis (LDA) which adds the identity matrix as a regularization term to solve the problem in matrix estimation, and LOOC [3] brings the diagonal matrix in solving singular problem. The KLIM estimator
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 474–482, 2007. c Springer-Verlag Berlin Heidelberg 2007
Regularization Versus Dimension Reduction, Which Is Better?
475
is derived by Guo and Lyu [4] based on Kullback-Leibler information measure. Regularized Linear Discriminant Analysis (R-LDA), Kernel Direct Discriminant Analysis (KDDA) and Principal Component Analysis (PCA) are dimension reduction methods. R-LDA was proposed by Lu et.al [6], which introduces a regularized Fisher’s discriminant criterion, and via optimizing the criterion, it addresses the small sample size problem. KDDA [7] can be seen as an enhanced kernel Direct Linear Discriminant Analysis (kernel D-LDA) method. PCA [8] is a linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In this paper, comparative studies on regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world, Raman spectroscopy signals and stellar spectra data. Correct classification rate (CCR) and time cost are used to evaluate the performance for each method. The rest of this paper is organized as follows. Section 2 gives a review on discrimination analysis. Section 3 introduces the regularization approaches. Section 4 discusses the dimension reduction approaches. Experiments are described in Section 5. Then we make some discussions in Section 6. Finally, conclusions are given in last section.
2
Discriminant Analysis
Discriminant analysis is to assign an observation x ∈ RN with unknown class membership to one of k classes C1 , ..., Ck known a priori. There is a learning data set A = {(x1 , c1 ), ..., (xn , cn )|xj ∈ RN and cj ∈ {1, ..., k}}, where the vector xj contains N explanatory variables and cj indicates the index of the class of xi . The data set allows to construct a decision rule which associates a new vector x ∈ RN to one of the k classes. Bayes decision rule assigns the observation x to the class Cj∗ which has the maximum a posteriori probability. Which is equivalent, in view of the Bayes rule, to minimize a cost function dj (x), j ∗ = arg min dj (x), j
j = 1, 2, · · · , k,
dj (x) = −2 log(πj fj (x)).
(1)
(2)
Where πj is the prior probability of class Cj and fj (x) denotes the class conditional density of x, ∀j = 1, ..., k. Some classical discriminant analysis methods can be obtained by combining additional assumptions with the Bayes decision rule. For instance, Quadratic Discriminant Analysis (QDA) [1,5] assumes that j ) which j, Σ the class conditional density fi for the class Cj is Gaussian N (m leads to the discriminant function −1 (x − m j | − 2 ln α j )T Σ j ) + ln |Σ dj (x) = (x − m j . j
(3)
476
Y. Jiang and P. Guo
j is the covariance j is the mean vector, and Σ Where α j is the prior probability, m matrix of the j-th class. If the prior probability α j is the same for all classes, the term 2 ln α j can be omitted and the discriminant function reduces to a more simple form. The parameters in above equations can be estimated with traditional maximum likelihood estimator. nj α j = , (4) N 1 nj j = m xi , (5) i=1 nj nj j = 1 j )(xi − m j )T . Σ (xi − m (6) i=1 nj In practice, this method is penalized in high-dimensional spaces since it requires estimating many parameters. For small sample number case, it will lead to the ill-posed problem. In that case the parameter estimates can be highly unstable, giving rise to high variance in classification accuracy. By employing a method of regularization, one attempts to improve the estimates by biasing them away from their sample based values towards values that are deemed to be more “physically plausible”. For this reason, particular rules of QDA exist in order to regularize the estimation of xi . We can assume that covariance matrices are equal, i.e. j = Σ, which yields the framework of LDA [9]. This method makes linear Σ separations between the classes.
3
Regularization Approaches
Regularization techniques have been highly successful in the solution of ill-posed and poorly-posed inverse problems. Such as RDA, LOOC and KLIM are proposed, the crucial difference of these methods is the diversity of the covariance matrix estimation formula. We will give brief review of these methods. 3.1
RDA
RDA is a regularization method which was proposed by Friedman [2]. RDA is designed for small number samples case, where the covariance matrix in Eq.(3) takes the following form: Σj (λ, γ) = (1 − γ)Σj (λ) + γ with Σj (λ) =
Trace[Σj (λ)] Id , d
j + λN Σ (1 − λ)nj Σ . (1 − λ)nj + λN
(7)
(8)
The two parameters λ and γ, which are restricted to the range 0 to 1, are regularization parameters to be selected according to maximizing the leave-one j that are out correct classification rate (CCR). λ controls the amount of the Σ shrunk towards Σ, while γ controls the shrinkage of the eigenvalues towards equality as Trace[Σj (λ)]/d is equal to the average of the eigenvalues of Σj (λ).
Regularization Versus Dimension Reduction, Which Is Better?
3.2
477
LOOC
There exists another covariance matrix estimation formula which was proposed by Hoffbeck and Landgrebe [3]. They examine the diagonal sample covariance matrix, the diagonal common covariance matrix, and some pair-wise mixtures of those matrices. The proposed estimator has the following form: j ) + ξj2 Σ j + ξj3 Σ+ξ j4 diag(Σ). Σj (ξj ) = ξj1 diag(Σ
(9)
The elements of the mixing parameter ξj = [ξj1 , ξj2 , ξj3 , ξj4 ]T are required 4 to sum up to unity: Σl=1 ξjl = 1. In order to reduce the computation cost, they only considered three cases: (ξj3 , ξj4 ) = 0, (ξj1 , ξj4 ) = 0, and (ξj1 , ξj2 ) = 0. They called the covariance matrix estimator as LOOC because the mixture parameter ξ was optimized by Leave-One-Out Cross validation method. 3.3
KLIM
The matrix estimation formula of KLIM is shown in the following: (1) j, Σj (h) = hId + Σ
(10)
where h is a regularization parameter, Id is a d × d dimensional identity matrix. This class of formula can solve matrix singular problem in high-dimension setting. In fact, as long as h is not too small, Σ−1 j (h) exists with a finite value and the estimated classification rate will be stable.
4
Dimension Reduction Approaches
Dimension reduction is another solution to solve the ill-posed problem arising in the case of high dimension with small sample number setting. R-LDA, KDDA and PCA are three common dimension reduction methods. R-LDA and KDDA are considered to be variations of D-LDA. R-LDA introduces a regularized Fisher’s discriminant criterion. The introduction of the regularization helps to decrease the importance of those highly unstable eigenvectors, thereby reducing the overall variance. KDDA introduces a nonlinear mapping from the input space to an implicit high-dimensional feature space, where the nonlinear and complex distribution of patterns in the input space is “linearized” and “simplified” so that conventional LDA can be applied. PCA tends to find a p-dimensional subspace whose basis vectors correspond to the maximum variance direction in the original data space. We will give brief review of R-LDA and KDDA. Since PCA is a well-known method, we will not verbosely introduce it in this paper. 4.1
R-LDA
The purpose of R-LDA [6] is to reduce the high variance related to the eigenvalue estimates of the within-class scatter matrix at the expense of potentially increased bias. The regularized Fisher criterion can be expressed as follows: Ψ = arg max Ψ
|ΨT SB Ψ| , |η(ΨT SB Ψ) + (ΨT SW Ψ)|
(11)
478
Y. Jiang and P. Guo
where SB is the between-class scatter matrix, SW is the within-class scatter matrix, 0 ≤ η ≤ 1 is a regularization parameter. Determine the set Um = [u1 , · · · , um ] of eigenvectors of SB associated with the m ≤ c − 1 non-zero −1/2 eigenvalues ΛB . Define H = Um ΛB , then compute the M (≤ m) eigenvecT tors PM = [p1 , · · · , pM ] of H SW H with the smallest eigenvalues ΛW . In the end, we can obtain Ψ = HPM (ηI + ΛW )−1/2 by combining the results of above, which is considered a set of optimal discriminant feature basis vectors. 4.2
KDDA
The KDDA method [7] implements an improved D-LDA in a high-dimensional feature space using a kernel approach. Define RN as the input space, assuming that A and B represent the null spaces of the between-class scatter matrix SB and the within-class scatter matrix SW respectively, the complement spaces of A and B can be written as A = RN − A and B = RN − B. Then the optimal discriminant subspace sought by the KDDA algorithm is the intersection space (A B). A is found by diagonalizing the matrix SB . The feature space F becomes implicit by using kernel methods, where dot products in F are replaced with a kernel function in RN so that the nonlinear mapping is performed implicitly in it.
5
Experiments
Two typical sets of real world data, namely Raman spectroscopy and stellar spectra data, are used in our study. The Raman spectroscopy data set used in this work is the same data set with in reference [10]. This data set consists of three classes of substance, they are acetic acid, ethanol and ethyl acetate. After data preprocess, all the data have been cut into 134 dimension. There are 50 samples in acetic acid, 30 samples in ethanol and 290 samples in ethyl acetate, therefore there are 370 samples in total. The stellar spectrum data used in the experiments are from Astronomical Data Center (ADC) [11]. They are drawn from standard stellar library for evolutionary synthesis. The data set consists of 430 samples and could be divided 1 (a) 0.9
0.8
0.7
0.6
0.5
I
0.4
0.3
0.2
0.1
0
0
200
400
600
800 1000 Wavelength(nm)
1200
1400
1600
Fig. 1. The typical three type stellar spectra lines
Regularization Versus Dimension Reduction, Which Is Better?
479
into 3 classes. The number of samples in each class are 88, 131 and 211, respectively. The spectrum is of 1221 wavelength points covering the range from 9.1 to 160000 nm. The typical distribution of these spectrum lines in a range from 100 nm to 1600 nm is shown in Fig. 1. In experiments, the data set is randomly partitioned into a training set and a testing set with no overlap between them. In Raman data experiment, 15 samples are chosen randomly from each class. They are used as training samples to estimate the mean vector and covariance matrix. The remained 310 samples are the test samples to verify the classification accuracy. While in stellar data experiment, 40 samples are randomly chosen from each class for training, and the remains for testing. In this study, we investigate regularization methods firstly, that is to classify the data directly in high-dimensional space with regularization methods. The another aspect of experiments is to apply R-LDA, KDDA and PCA methods for dimension reduction, respectively. With the reduced dimension data set, we choose QDA as a classifier to get the correct classification rate (CCR) in feature space. The results with PCA method are gotten under the condition of reduced 10-dimensional data set. All the experiments are repeated 20 runs with random different partitioned data sub-set, and all results reported in tables of this paper are the average values over the twenty runs. In experiments, we noted the CCR and time cost for each method. Table 1 shows the classification results with different approaches. It is needed to point out that the dimension of the raw stellar data is too high compared with its sample number, it is unstable to compute CCR directly in such a high dimensional space. Thus we reduce the dimension of stellar data to 100 with PCA method first, and consider it still being a sufficient high-dimensional space for the problem to investigate. In the tables presented in this paper, the CCR is reported in decimal fraction. Furthermore, the notation N/A represents that the covariance matrix is singular, in which case reliable results can not be obtained. Table 1. The Classification Results with Different Approaches Data Evaluation RDA LOOC Raman CCR N/A N/A Time 99.399 39.782 Stellar CCR 0.9490 0.7786 Time 150.1 40.058
KLIM 0.8448 178.57 0.9653 194.3
R-LDA 0.6536 0.2423 0.9677 0.1672
KDDA 0.7374 2.6132 0.9591 2.6678
PCA 0.7625 0.4166 0.9531 4.157
For further comparison, we perform PCA methods to reduce the data to different dimension before classification, and still use the same classifier QDA to compute CCR. The dimension of two data sets is reduced into four different levels, which is 40, 20, 10 and 2 dimension respectively. For the purpose of comparative convenience, in table 2 we shows the classification results for different dimension of Raman spectroscopy and stellar spectra data together.
480
6
Y. Jiang and P. Guo
Discussion
In table 1, it depicts a quantitative comparison of the mean CCR and time cost obtained by direct classification with regularization approaches in high dimensional space and classification after dimension reduction with a non-regularization classifier (QDA). As it can be seen from the table, time cost by classifying the highdimensional data directly with regularization approaches is usually from 20 to 1000 times higher than that classification after dimension reduction. And in most cases, the CCR obtained by classification with dimension reduction approaches is more acceptable compared to directly classify with regularization approaches. From table 1, we can find that when the training samples are insufficient and distribution unbalance seriously, if the regularization parameters are too small, even with regularized classifiers, the ill-posed problem can not be fully solved. This phenomenon is very obvious for Raman data, and by applying RDA and LOOC classifiers still encounter the covariance matrix singular problem. Meanwhile we also find that KLIM is a very effective regularization approach. For Raman data, it can get the best results among three regularization approaches and three dimension reduction approaches. If data sets are insufficient and distribution unbalance seriously such as Raman, KLIM always gives us better CCR results compared to those dimension reduction approaches, but it costs more computation time than other classifiers. Table 2. The Classification Results for Different Dimensionality Data Evaluation d=40 Raman CCR N/A Time 6.358 Stellar CCR N/A Time 39.2
d=20 N/A 4.2011 0.9574 13.913
d=10 0.7625 0.4166 0.9531 4.157
d=2 0.6446 0.3014 0.8963 2.2212
Wether it can give us more acceptable results when classification with dimension reduction compared to classification directly with regularization classifiers? The answer is not always positive. As illustrated in table 2, the lower of the dimension reduced, the less of the computation time cost. We see that results of Raman data are worse than those of Stellar data, due to training samples insufficient and distribution unbalance seriously. For Raman data, even though we reduce data to the 20 dimension, the ill-posed problem sill exists. And the classified results of Raman data are much worse than those of stellar data. From experiments we also find that mean classification accuracy with principal components(PCs) is still acceptable even with only 2 PCs, but the classification accuracy has an obvious degradation. When we reduce the data dimension to 2 PCs, for Stellar data, the CCR obtained with QDA classifier is lower than the CCR obtained with RDA. We consider that is because a reduction in the number of features will lead to a loss of the discriminant ability for some data set. In order to cut down the computational time and get a satisfactory classification
Regularization Versus Dimension Reduction, Which Is Better?
481
accuracy at the same time, it need a careful choice of the dimension level of the data to reduce. However, it still is an open problem how to select a suitable dimension level.
7
Conclusions
In this paper, we presented comparative studies of regularization and dimension reduction with real world data sets in same working conditions. From the results, we can draw some conclusions: (1) Dimension reduction approaches often gives us acceptable CCR results. Meanwhile they can reduce the computational time cost and use less memory compared to classification directly in high dimension with regularization methods. (2) The choice of what dimension level should be reduced to is a very important thing. There exists an appropriate dimension level, at this level we can get satisfied results, and computational time cost as well as memory required as less as possible. However, it is very difficult to choose such a proper dimension level. (3) If the dimension we chosen is not sufficient low such that still can not avoid the ill-posed problem. And if the dimension reduced is too low, that will lead to a loss in the discriminant ability and consequently degrade the classification accuracy. (4) If data sample number is insufficient and sample distribution unbalance seriously like Raman spectroscopy, some regularization approaches like KLIM may be more effective than those dimension reduction approaches.
Acknowledgments The research work described in this paper was fully supported by a grant from the National Natural Science Foundation of China (Project No. 60675011). The author would like to thank Fei Xing and Ling Bai for their help in part of experiment work.
References 1. Aeberhard, D., Coomans, D., De Vel, O.: Comparative Analysis of Sstatistical Pattern Recognition Methods in High Dimensional Settings. Pattern Recognition 27 (1994) 1065-1077 2. Friedman, J.H.: Regularized Discriminant Analysis. J. Amer. Statist. Assoc. 84 (1989) 165-175 3. Hoffbeck, J.P., Landgrebe, D.A.: Covariance Matrix Estimation and Classification with Limited Training Data. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (1996) 763-767 4. Guo, P., Lyu, M.R.: Classification for High-Dimension Small-Sample Data Sets based on Kullback-Leibler Information Measure. In: Proceedings of The 2000 International Conference on Artificial Intelligence, H. R. Arabnia (2000) 1187-1193 5. Webb, A.R.: Statistical Pattern Recognition. In: Oxford University Press, London (1994)
482
Y. Jiang and P. Guo
6. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Regularization Studies of Linear Discriminant Analysis in Small Sample Size Scenarios with Application to Face Recognition. Pattern Recognition Letter 26 (2005) 181-191 7. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face Recognition using Kernel Direct Discriminant Analysis Algorithms. IEEE Trans. Neural Networks 14 (2003) 117-126 8. Jolliffe, I.T.: Principal Cmponent Analysis. Springer-Verlag (1996) 9. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7 (1936) 179-188 10. Guo, P., Lu, H.Q., Du, W.M.: Pattern Recognition for the Classification of Raman Spectroscopy Signals. Journal of Electronics and Information Technology 26 (2002) 789-793 (in Chinese) 11. Stellar Data: ADC website: (http://adc.gsfc.nasa.gov/adc/sciencedata.html).
Integrated Analytic Framework for Neural Network Construction Kang Li1, Jian-Xun Peng1, Minrui Fei2, Xiaoou Li3, and Wen Yu4 1
School of Electronics, Electrical Engineering & Computer Science Queen’s University Belfast, Belfast BT9 5AH, UK {K.Li,J.Peng}@qub.ac.uk 2 Shanghai Key Laboratory of Power Station Automation Technology, School of Mechatronics and Automation, Shanghai University, Shanghai 200072, China 3 Departamento de Computación, CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, México D.F., 07360, México 4 Departamento de Control Automático, CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, México D.F., 07360, México
Abstract. This paper investigates the construction of a wide class of singlehidden layer neural networks (SLNNs) with or without tunable parameters in the hidden nodes. It is a challenging problem if both the parameter training and determination of network size are considered simultaneously. Two alternative network construction methods are considered in this paper. Firstly, the discrete construction of SLNNs is introduced. The main objective is to select a subset of hidden nodes from a pool of candidates with parameters fixed ‘a priori’. This is called discrete construction since there are no parameters in the hidden nodes that need to be trained. The second approach is called continuous construction as all the adjustable network parameters are trained on the whole parameter space along the network construction process. In the second approach, there is no need to generate a pool of candidates, and the network grows one by one with the adjustable parameters optimized. The main contribution of this paper is to show that the network construction can be done using the above two alternative approaches, and these two approaches can be integrated within a unified analytic framework, leading to potentially significantly improved model performance and/or computational efficiency.
1 Introduction The single-hidden layer neural network (SLNN) represents a large class of flexible and efficient structures due to their excellent approximating capabilities [1][2][12]. A SLNN is a linear combination of some basis functions that are arbitrary (usually nonlinear) functions of the neural inputs. Depending on the type of basis functions being used, various SLNNs have been proposed. Two general categories of SLNNs exist in the literature: 1) the first category includes SLNNs with no tunable parameters in the activation functions, such as the Volterra, polynomial neural nets, etc. [13][14]; 2) The second category includes SLNNs with tunable parameters [3]-[7][11], this include Radial basis function neural nets (RBFNNs), probabilistic RBFNNs, MLPs D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 483–492, 2007. © Springer-Verlag Berlin Heidelberg 2007
484
K. Li et al.
with single hidden layer, or heterogeneous neural networks whose activation functions are heterogeneous [15]. In some extreme cases, the tunable parameters in the activation functions are discontinuous [16]. There are two important issues in the applications of SLNNs. One issue is the network learning, i.e. to optimize the parameters; the other is to determine the network structure or the number of hidden nodes, based on the parsimonious principle. These two issues are closely coupled, and it is a mixed integer hard problem if the two issues are considered simultaneously for many SLNNs. Given the two general categories of the SLNNs, their constructions can be quite different. The construction of the first type of SLNNs is to select a small subset of neural nodes from a candidate pool and it is naturally a discrete SLNN construction procedure, since there is no adjustable parameter in the hidden nodes. The main difficulty is however that the candidate pool can be extremely large, the computational complexity to search for the best subset is extremely high [7][9][12]. To alleviate the computational burden, forward subset selection methods have claimed to be one of a few efficient approaches. Forward subset selection algorithms, such as orthogonal least squares (OLS) method [4][5][6] or the fast recursive algorithm (FRA) [7], select one basis function each time from the candidates which maximizes the reduction of the cost function, e.g. the sum of squared errors (SSE). This procedure is repeated until the desired number of, say n, basis functions have been selected. If n is unknown a 'prior', some selection criteria could be applied to stop the network construction, such as the Akaike’s information criterion (AIC) [8]. For the second category of SLNNs, the network construction is a very complex problem and can be quite time-consuming due to the fact that these adjustable parameters in the activation functions have to be optimized. One solution is to convert it to a discrete neural net construction problem, i.e. generate a pool of candidate hidden neurons with the nonlinear parameters taking various discrete values, then use subset selection methods to select the best hidden neurons. One of the last development in this area is the Extreme Learning Machine concept (ELM) proposed by Huang [10]. In ELM, the nonlinear parameters are assigned with some random values ‘a priori’ and the only set of parameters to be solved is the linear output weights. This concept has been successfully applied to a wide range of problems, and it is quite effective for less complicated problems, and particularly useful for some neural nets that the activation functions are discontinuous [16]. However, ELM has two potential disadvantages. Firstly, since the parameters in the activation functions are determined ‘a priori’, the candidate neuron pool is therefore discrete in nature and may not necessarily contain the best neurons with the optimal parameters in the parameter space. Secondly, since a small network is usually desirable, the network construction is then to select a subset of neurons from a large pool of candidates, which has been described above. This again can be computationally very expensive. Continuous construction of the second category of SLNNs is to optimize the tunable parameters on the whole parameter space along the network construction procedures. This is a very complicated process if both the determination of the network size
Integrated Analytic Framework for Neural Network Construction
485
and parameter learning are considered simultaneously, since it is a mixed integer hard problem. Despite that few analytic methods are available to efficiently and effectively address this problem, the two separate issues of parameter training and determination of the number of hidden nodes have been studied extensively in the literature. For example, for the training of different SLNNs like RBF or MLP, various supervised, unsupervised and hybrid methods have been extensively studied in the last decades [2] [11]. Guidelines and criterions have also been proposed for network growing and pruning [3]. This paper introduces the two alternative approaches for the construction of SLNNs, namely discrete and continuous approaches. Each method has their advantages and disadvantages. It will then be shown that these two approaches can be integrated within one analytic framework, leading to potentially significantly reduced computational complexity and/or improved model performance. This paper is organized as follows. Section 2 briefly introduce the discrete construction of SLNNs. Section 3 will show that after appropriate modification of the discrete construction method, a continuous construction method can be derived. Section 4 presents a simulation example to illustrate these two methods, and Section 5 is the conclusion.
2 Discrete Construction of SLNNs The main objective of discrete SLNN construction is to select a subset of hidden nodes from a pool of candidates using subset selection method. This is called discrete construction since the parameters need not to be trained. The corresponding SLNNs either have no adjustable parameters in the activation function of hidden nodes, or the parameters are assigned values ‘a priori’. This process can be described as follows. Suppose a set of M candidate basis functions {φ i (t ), i = 1,2," , M } , and a set of N samples are used for the construction and training of SLNNs, leading to the following full regression matrices
Φ = [φ1 , φ2 ," , φM ],
φi = [φi (1), φi (2)," , φi ( N )]T , i = 1, 2," , M .
(1)
Now, consider the discrete construction of SLNNs, i.e. {φ i (t ), i = 1,2," , M } have no tunable parameters, such as the Volterra series, or the tunable parameters in these basis functions are assigned with some values ‘a priori’. The main objective is then to select n significant basis functions, denoted as p1 , p 2 , " , p n , which form a selected regression matrix P = [ p1, p2 ,", pn ]
(2)
producing the network output of y = Pθ + e
(3)
486
K. Li et al.
best fitting the data samples in the sense of least-squares, i.e. the sum squared-errors (SSE) is minimized J (P) =
min
Φ n∈Φ ,θ ∈ℜn
{e T e} =
min
Φ n∈Φ ,θ ∈ℜn
{( y − Φ nθ ) T ( y − Φ nθ )} ,
(4)
where Φ n is an N × n matrix composing of n columns from Φ , θ denotes the output weights, and the selected regression matrix P = [ p1 , p 2 , " , p n ] .
(5)
If the selected regression matrix P is of full column-rank, the least-squares estimation of the output weights in (4) is given by θ = ( P T P ) −1 P T y .
(6)
Theoretically, each subset of n terms out of the M candidates forms a candidate neural net, and there are M ! / (n! /( M − n)!) possible combinations. Obviously, to obtain the optimal subset is computationally very expensive or impossible if M is a very large number, and part of this is also referred to as the curse of dimensionality. To overcome the difficulty, the forward stepwise model selection methods select basis functions one by one with the cost function being maximally reduced each time. Obviously a series of intermediate models are generated during the forward stepwise selection process. To formulate this forward selection process, the regression matrix of the kth intermediate network (with k basis functions having been selected) is denoted as Pk = [ p1 , p 2 , " , p k ], k = 1,2," , n .
(7)
Obviously, the cost function (4) becomes
J ( Pk ) = y T y − y T Pk ( PkT Pk ) −1 PkT y = y T ( I − Pk ( PkT Pk ) −1 PkT ) y .
(8)
Now suppose one more basis function pk +1 is selected, the net decrease in the cost function is given by
ΔJ k +1 ( p k +1 ) = J ( Pk ) − J ([ Pk , p k +1 ]) ,
(9)
where Pk +1 = [ Pk , p k +1 ] is the regression matrix of the (k+1)th intermediate model. In model selection, each selected term achieves the maximum contribution among all remaining candidates, i.e.
ΔJ k +1 (p k +1 ) = max{ΔJ k +1 (ϕ ), ϕ ∈ Φ, ϕ ≠ p j , j = 1, ", k} .
(10)
Since the number of candidate basis functions can be very large, an efficient algorithm is required to perform the optimization problem (10). To introduce the forward selection algorithm, a matrix series is defined
⎧I − Pk (PkT Pk )−1 PkT , 0 < k ≤ n, Rk Δ ⎨ k = 0. ⎩ I,
(11)
Integrated Analytic Framework for Neural Network Construction
487
The matrix R k is a residue matrix directly coming from (8), which projects the output into the residue space. This residue matrix series has several interesting properties [7][9][11]. In particular, the following properties hold for R k : Rk +1 = Rk −
Rk p k +1 p kT+1 RkT p kT+1 Rk p k +1
, k = 0,1, " , n − 1 ,
R kT = Rk ; R k R k = R k , k = 0,1, " , n ,
Ri R j = R j Ri = Ri , i ≥ j; i, j = 0,1, " , n ,
if rank ([p1 ," , p k , ϕ ]) = k , ⎧0, R kϕ = ⎨ ( k ) ⎩ϕ ≠ 0, if rank ([p1 ," , p k , ϕ ]) = k + 1,
(12)
(13) (14)
(15)
where ϕ ( k ) = R k ϕ . Now the net contribution of p k +1 to the cost function can is given by ΔJ k +1 ( p k +1 ) = y T ( Rk − Rk +1 ) y .
(16)
To further simplify (16), define
ϕ i( k ) Δ Rk ϕ i , i = 1, " , M , y ( k ) Δ Rk y , (k )
p j ΔRk p j , j = 1, " , n .
(17) (18)
Then it holds that
ϕ
( k +1)
( p k +1 ) T ϕ ( k ) (k )
= Rk +1ϕ = ϕ
(k )
−
(k )
(k )
( p k +1 ) T p k +1
p k( k+)1 ,
ΔJ k +1 ( pk +1 ) = [( y ( k ) ) T p (k k+1) ] 2 /[( p (k k+1) ) T p (k k+1) ] .
(19)
(20)
Equation (20) expresses the net contribution of a selected basis function to the cost function, based on which the discrete construction of SLNNs is given as follows. Algorithm 1 (A1): Discrete construction of SLNNs Step 1: Initialisation phase. Select N training samples, and generate a candidate pool of hidden nodes, denoted as Tpool., and the corresponding full regression matrix is Φ . For the first category of SLNNs, the candidates are all possible basis functions without tunable parameters. For the second category of SLNNs, the tunable parameters are
488
K. Li et al.
assigned with random values like ELM method [10][16] or as the values of the parameters are assigned according to ‘a priori’ information. Define the stop criterion of network construction, this is usually a given level of the minimum desired contribution of the basis functions, δ E or since criterion like AIC [8]. Set the count of selected basis functions k=0. Step 2: Selection phase. The candidate basis function ϕ i , i = 1,..., M in Tpool that satisfies the following criteria will be selected: p k +1 : max ΔJ k +1 (ϕ i ) = [( y ( k ) )T ϕ i( k ) ] 2 /[(ϕ i( k ) )T ϕ i( k ) ] ϕ i ∈ T pool . ϕi
(21)
Step 3: Check phase. Check if the network construction criterion is satisfied, e.g. ΔJ k +1 ( pk +1 ) ≤ δ E . If (22) is true then stop, otherwise, continue.
(22)
Step 4: Update phase. Add pk +1 into the network, and remove it from T pool . Update intermediate variables according to (19), and let k = k + 1 . Step 5: Go to step 2.
3 Continuous Construction of SLNNs Continuous construction of the second category of SLNNs is to optimize the tunable parameters on the whole parameter space along the network construction procedures. In the following, it will be shown that after appropriate modification on the above discrete SLNN construction method, a continuous method can be derived. For continuous construction of SLNNs with tunable parameters, the basis functions {φ i (t ), i = 1,2," , M } in (1) can be redefined as φ i ( x (t ), ω) , where x (t ) is the network input vector, ω is the tunable parameter vector defined in the continuous parameter space, t is the time instant. The notation for φ i ( x (t ), ω) can be further simplified as φ i (ω) . Obviously this representation covers a wider class of neural networks. The continuous construction of SLNNs optimizes the parameters ω in the continuous parameter space. In comparison with the discrete construction, it starts the construction with none candidate basis function, therefore the computational complexity and memory storage can be significantly reduced. For simplicity, it is supposed that there is one type of basis functions, such as guassian function, tangent sigmoid, etc. The network construction procedure is to grow the network by adding basis functions one by one, each time maximizing the reduction of the cost function defined in (9). Based on (17)-(20), the contribution of adding one more basis function, the net change of the cost function defined in (9) is a function of ω : ΔJ k +1 (ω) = C 2 (ω) / D (ω) ,
(23)
Integrated Analytic Framework for Neural Network Construction
489
where C (ω) = y T Rk φ(ω) = ∑tN=1 [ y ( k ) (t )φ ( k ) ( x (t ), ω)]⎫⎪ ⎬ D (ω) = [φ(ω)] T Rk φ(ω) = ∑tN=1 [φ ( k ) ( x (t ), ω)] 2 ⎪⎭
(24)
The maximum contribution by adding one more basis function φ(ω) can be identified as ΔJ k +1 (ω k +1 ) = max{ΔJ k +1 (ω), ω ∈ R n +1 .
(25)
Now (25) is an unconstrained continuous optimization problem and a number of first order and second order searching algorithms be applied, such the Newton’s method, conjugate gradient method, etc. Algorithm 2 (A2): Continuous construction of SLNNs Step 1: Initialization. Let k = 0 and the cost function J = y T y , define the stop criterion of network construction, this is usually a given level of the minimum desired contribution of the basis functions, δ E or since criterion like AIC. Step 2: Search for the optimum parameter ω k +1 for the (k+1)’th hidden node using a conventional first order or second order search algorithm with the first and second order derivative information. Step 3: Check phase. Check if the network construction criterion is satisfied, e.g. ΔJ k +1 (ω k +1 ) ≤ δ E . (26) If (26) is true then stop, otherwise, let k=k+1 and continue.
4 Simulation Consider the following nonlinear function that is to be approximated using RBF neural network, sin( x) f ( x) = , −10 ≤ x ≤ 10 , (27) x 400 data points were generated using y = f (x ) + ξ , where x was uniformly distributed within the range [-10, 10] and the noise ξ ~ N (0,0.2) . The first 200 points were then used for network construction and training and the rest for validation. Table 1. Test performance
Network size (m) 1 2 3 4
SSE Training data A1 A2 20.39 8.78 18.41 8.01 14.29 7.31 11.24 6.71
NPE (%) Validation A1 A2 71.00 47.60 69.35 47.18 60.69 44.97 53.99 43.71
Running-time (s) A1 0.344 0.359 0.391 0.390
A2 0.093 0.094 0.235 0.312
490
K. Li et al.
Fig. 1. Top: Equi-height contour of cost function with respect to centre and width for the 1st neuron; Bottom: input and output signals to be modelled using the first neuron
Fig. 2. Top: Equi-height contour of cost function with respect to centre and width for the 4th neuron; Bottom: input and output signals to be modelled using the 4th neuron
Both the discrete (A1) and continuous (A2) construction methods were used to produce the RBF networks. For discrete construction of the RBFNN, all 200 training data samples are used as the candidate centres, and the width are predetermined as σ 2 = 200 by a series of tests. Networks of sizes from 1 to 6 were produced. The final
Integrated Analytic Framework for Neural Network Construction
491
cost function (sum squared error) over the training data sets and the running time of both the algorithms are listed in Table 1 for comparison. The produced networks obtained by the two algorithms were then tested on the validation data set. The normalized prediction errors (NPE) of networks over the validation data set are also listed in Table 1. NPE in table 1 is defined as
NPE = [∑ tN=1 ( yˆ (t ) − y (t )) 2 / ∑ tN=1 y 2 (t )]1 / 2 × 100% ,
(28)
where yˆ (t ) denotes the network output. Fig. 1 and 2 illustrate the equi-height contours of the cost function SSE with respect to centre and width of the first and the fourth hidden nodes in the RBF network. The y- and x- signals to be modelled by the two hidden neurons are also illustrated in the diagrams. It is obvious that the search space is quite complex and pre-determined widths and centres for hidden nodes may not produce a good neural model. Fig. 2 and table 1 also reveal that further increase in the number of RBF hidden nodes from 4 has little impact on the network performance as the signals to be modelled tends to be simply noise.
5 Conclusion An integrated framework has been proposed for the construction of a wide range of single-hidden layer neural networks (SLNNs) with or without tunable parameters. Firstly, a discrete SLNN construction method has been introduced. After a proper modification, a continuous construction has also been introduced. Each of the two alternative methods has their advantages and disadvantages. It is shown that these two methods can be performed within one analytic framework.
References 1. Igelnik, B., Pao, Y. H.: Additional Perspectives of Feedforward Neural-nets and the Functional-link. IJCNN '93, Nagoya, Japan (1993) 2. Adeney, K. M., Korenberg, M. J.: Iterative Fast Orthogonal Search Algorithm for MDLbased Training of Generalized Single-layer Network. Neural Networks 13 (2000) 787-799 3. Huang, G.-B., Saratchandran, P., Sundararajan, N.: A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Trans. Neural Networks 16 (2005) 57-67 4. Chen, S., Billings, S. A.: Neural Network for Nonlinear Dynamic System Modelling and Identification. International Journal of Control 56 (1992) 319-346 5. Zhu, Q. M., Billings, S.A.: Fast Orthogonal Identification of Nonlinear Stochastic Models and Radial Basis Function Neural Networks. Int. J. Control 64 (5) (1996) 871-886 6. Chen, S., Cowan, C. F. N., Grant, P. M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Functions. IEEE Trans. Neural Networks 2 (1991) 302-309 7. Li, K., Peng, J., Irwin, G. W.: A Fast Nonlinear Model Identification Method. IEEE Trans. Automatic Control 50 (8) (2005) 1211-1216 8. Akaike, H.: A New Look at the Statistical Model Identification. J. R. Statist. Soc. Ser. B. 36 (1974) 117-147
492
K. Li et al.
9. Li, K., Peng, J., Bai, E-W: A Two-stage Algorithm for Identification of Nonlinear Dynamic Systems. Automatica 42 (7) (2006) 1189-1197 10. Huang, G.B., Chen, L., Siew, C.K.: Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes. IEEE Trans. Neural Networks 17 (4) (2006) 79-892 11. Peng, J., Li, K., Huang, D.S.: A Hybrid forward Algorithm for RBF Neural Network Construction. IEEE Trans. Neural Networks 17 (6) (2006) 1439-1451 12. Li, K., Peng, J., Fei, M.: Real-time Construction of Neural Networks. Artificial Neural Networks – ICANN 2006. Lecture Notes in Computer Science, Springer-Verlag GmbH, LNCS 4131 (2006) 140-149 13. Adeney, K.M., Korenberg, M.J.: On the Use of Separable Volterra Networks to Model Discrete-time Volterra Systems. IEEE Trans. Neural Networks 12 (1) (2001) 174 - 175 14. Nikolaev, N., Iba, H.: Learning Polynomial Feedforward Networks by Genetic Programming and Backpropagation. IEEE Trans. Neural Networks 14 (2) (2003) 337-350 15. Weingaertner, D., Tatai, V. K., Gudwin, R. R., Von Zuben, F. J.: Hierarchical Evolution of Heterogeneous Neural Networks. Proceedings of the 2002 Congress on Evolutionary Computation (CEC2002) 2 (2002) 1775-1780 16. Huang, G.B., Zhu, Q.Y., Mao, K. Z., Siew, C. K., Saratchandran, P., Sundararajan, N.: Can Threshold Networks be Trained Directly. IEEE Trans. Circuits and Systems-II: Express Briefs 53 (3) 187-191
A Novel Method of Constructing ANN Xiangping Meng1 , Quande Yuan2 , Yuzhen Pi2 , and Jianzhong Wang2 1
School of Electrical Engineering & Information Technology, Changchun Institute of Technology, 130012, Changchun, China xp
[email protected] http://www.ccit.edu.cn 2 School of Information Engineering Northeast Dianli University, 132012, Jinlin, China {yuanquande,piyuzhen}@gmail.com
[email protected] Abstract. Artificial Neural Networks (ANNs) are powerful computational and modeling tools, however there are still some limitations in ANNs. In this paper, we give a new method to construct artificial neural network, which based on multi-agent theory and Reinforcement learning algorithm. All nodes in this new neural networks are presented as agents, and these agents have learning ability via implementing reinforcement learning algorithm. The experiment results show this method is effective.
1
Introduction
Artificial Neural Networks (ANNs) are powerful computational tools that have been found extensive acceptance in many disciplines for solving complex realworld problems. ANN may be defined as structures comprised of densely interconnected adaptive simple processing elements (called artificial neurons or nodes) that are capable of performing massively computations for data processing and knowledge representation[1][2]. Although ANNs are drastic abstractions of the biological counterparts, the idea of ANNs is not to replicate the operation of the biological systems but to make use of what is known about the functionality of the biological networks for solving complex problems. the ANNs have gained great success, however there are still some limitations in ANNs.1 such as: 1. Most of the ANNs are not really distribute, so its nodes or neurons can not parallel-work. 2. Training time is long. 3. The nodes number is limited by the capability of computer. To solve these problems we try to reconstruct the NN using Multi-agent System theory. 1
Supported by Key project of the ministry of education of China for Science and Technology Researchment(ID:206035).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 493–499, 2007. c Springer-Verlag Berlin Heidelberg 2007
494
X. Meng et al.
Multi-agent technology is a hotspot in the recent study on artificial intelligence. The concept of Agent is a natural analogy of real world. In this new network, the units presented as agents can run on different computers. We use reinforcement learning algorithm as this new neural networks learning rules.
2 2.1
Multi-Agent System and Reinfocement Learning Multi-Agent System
Autonomous agents and multi-agent systems (MASs) are rapidly emerging as a powerful paradigm for designing and developing complex software systems. In fact, the architecture of a multi-agent system can be viewed as a computational organization. Surprisingly there is no agreement on what an agent is: there is no universally accepted definition of the term agent. These problem had been discussed in [3] and other papers in detail. We are agree on this concept that an agent should have characters as follow: 1. Autonomy The agent is capable of acting independently, and exhibiting control over their internal state. 2. Reactivity Maintains an ongoing interaction with its environment, and responds to changes that occur in it (in time for the response to be useful). 3. Pro-activity That means the agent is generating and attempting to achieve goals, not driven solely by events. 4. Social ability The ability to interact with other agents (and possibly humans) via some kinds of agent-communication language, and perhaps cooperate with others. As a intelligent agent, an agent should have another important ability, that’s learning, which can help agent adapt the dynamic environment. A multi-agent system contains a number of agents which interact through communication, able to act in an environment and will be linked by other (organizational) relationships. A MAS can do more things than a single agent. From this perspective, every neuron in a artificial neural networks can be viewed as an agent, who takes input and decide what to do next according its own state and policy. Through the interactive of the agents’, the MAS will output a result. Then supervisor gives feedback to output agent, and output gives feedback to others agent. 2.2
Reinforcement Learning
Learning behaviors in a Multi-Agent environment is crucial for developing and adapting Multi-Agent systems. Reinforcement learning (RL) has been successful in finding optimal control policies for a single agent operating in a stationary environment, specifically a Markov decision process (MDP). RL only
A Novel Method of Constructing ANN
495
through trial and error to find optimal strategies, also can be applied offline, as a pre-processing step during the development of the game, and then be continuously improved online after its release. Stochastic games extend the single agent Markov decision process to include multiple agents whose actions all impact the resulting rewards and next state. Stochastic games are a generalization of MDPs to multiple agents, and can be used as a framework for investigating Multi-Agent learning. Reinforcement learning has opened the way for designing autonomous agents capable of acting in unknown environments by exploring different possible actions and their consequences. Q-learning is a standard reinforcement learning technique. In single-agent systems, Q-learning possesses a firm foundation in the theory of Markov decision processes. The basic idea behind Q-learning is to try to determine which actions, taken from which states, lead to rewards for the agent(however these are defined), and which actions, from which states, lead to the states from which said rewards are available, and so on. The value of each action which could be taken in each state, i.e., its Q-value is a time-discounted measure of the maximum reward available to the agent by following a path through state space of which the action in question is a part. A typical Q-learning model is shown in Fig. 1.
Fig. 1. A typical reinforcement learning model
Q-learning consists on iteratively computing the values for the action-value function, using the following update rule: Q (s, a) ← (1 − α) Q (s, a) + α [r + βV (s )] ,
(1)
496
X. Meng et al.
where β is a discount factor, with 0 ≤ β < 1, V (s ) = maxa (s , a ), which represents the relative importance of future against immediate rewards. Where is a positive step-size parameter. Q-Learning will converge to a best-response independently of the agents behavior as long as the conditions for convergence are satisfied. If decreases appropriately with time and each state-action pair would be visited infinitely often in the limit, then this algorithm will converge to a best-response for all s ∈ S and a ∈ A (s) with probability one. While in some multi-agent environment basic Q-learning is not enough for an agent, multi-agent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. In researching multi-agent Q-learning, most researches adopt the framework of general-sum stochastic games. In Multi-Agent Q-learning, the Q-function of agent is defined over states and joint action vectors a a1 , a2 , . . . an , rather than state-action pairs. The agents start with arbitrary Q-values, and the updating of Q value proceed as following: Qit+1 (s, a) = (1 − α) Qit (s, a) + α rti + β · V i (st+1 ) , (2) where V i (st+1 ) is state value functions, and V i (st+1 ) = max f i Qit (st+1 , a) . ai ∈A
(3)
In this generic formulation, the keys elements are the learning policy, i.e. the selection method of the action a, and the determining of the value function V i (st+1 ) , 0 ≤ α < 1. However, The number of agents in a MAS usually is very big, it’s difficult to get all of the agents’ information and maintain them, because the resource of computer is limited. In fact, agents needn’t know all of the other’s state and action, they just interacting with their neighbor is enough.
3
Artificial Neural Network Based On MAS
As discussed above, sometimes we need a neural network with huge number neurons while using traditional methods is difficult to do this. This new methods we proposed is based on MAS theory: the neural network can be viewed as a multi-agent system, which maps input to output through agents interact. There are four types of agents: input nodes agent, output nodes agents, hidden layer agents and a manager agent. These agents can run on not only same computer but also different ones, so the number of agents is not limited. Every agent has learning ability via implementing Reinforcement Learning. Now we can construct the ANN using MAS theory: each unit or node is an agent. We constructed a three layers network. Firstly, we create a manager agent, which manage the other agents, including agents type, the number of each type agents, agents location and id. The other agent can find the node agents that need to link. A simple new three layer BP NNs topological diagram is shown in Fig. 2. Every node agent has its own internal state, mapping the input to output.
A Novel Method of Constructing ANN
497
Fig. 2. Topological diagram of a new ANN
3.1
Manager Agent
The manager agent(MA) is a platform or a container, and all other agents run on it. A MAS can have many manager agents, but just can have only one main manager agent. these manager agents maintains NNs global information, including number of each type of agents, their location and unique ID. It provides service to other agents. For example, when an input agent needs its neighbor output nodes agents, it will send a message to the manager agent, and the MA will return a list of agents ID to it. It receives other agents register. When a new agent is created, we must send a message to the manager agent, telling it that a node agent is created, and the node agents ID, location and other information. Before an agent die, it sends manager agent a message too; the manager agent will remove the corresponding record. 3.2
Unit Agents
The unit agents are classified into three types: the input agents, the hidden agents, and the output agents. the most important part is the second one. Sup(l) pose there are m layers in the ANN, nl presents the number of layer l, yk is the output of agent k in layer l.
nn−1 (l)
(l)
yk = wk · y (l−1) =
(l) (l−1)
Wkj yj
,
(4)
k = 1, 2, · · · , nl ,
(5)
j=1
(l) (l) yk = f yk (l)
where wk is the weight vector between layer l − 1 and l, Y (0) = X.
498
X. Meng et al.
Given the supervised information, the weight between agents will be modified to minimize E (w): m 2 1 1 Y − Y = Yk − Y k , 2 2
n
E (w) =
(6)
k=1
unit agents learning algorithm is given in Table 1. Table 1. The Process of Reinforcement Learning of Input Agent 1. Initialize: (a) Select initial learning rate α and discount factor β and let t=0; (b) Initialize the state S and action A respectively; (1) (2) (n) (c) For all states s and actions a, let Qi0 (s(0) , as , as , · · · , as ), (i) (i) i (0) i (0) 1 1 π0 (s , as ) = n , π0 (s , as ) = n ; 2. Repeat the following process (for each episode) (a) get action a from current state s using policy π derived from Q; (b) Execute action a, observe reward r and new states s ; (c) Update Qit using formula(1); Until for all states Δ (s) < .
4
Experiment and Results
We construct a simple three layer BP networks to training it to learn XOR. The experiment results are shown in Fig. 3. From this figure, we can find the network can be quickly learn the correct classification. 1
0.9
0.8
0.7
error rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60 episodes
80
100
Fig. 3. The error rates vs training episodes
120
A Novel Method of Constructing ANN
5
499
Conclusion
In this paper, we discussed MAS, reinforcement learning, and proposed a new method to construct artificial neural networks, which is inspired by the theory of agent. The results show this method is effective. However, there are still more work to do, such as: 1. The communication between nodes agents should be improved; 2. How much neighbor agents a unit agent should know; We will continue attending this aspect, and more research will be done in the future work.
References 1. Hecht-Nielsen: Neurocomputing. Addison-Wesley, Reading, MA (1990) 2. Schalkoff, R.J.: Artificial Neural Networks. McGraw-Hill, New York (1997) 3. Jennings, N.R., Sycara,K.P., Wooldridge, M.: A Roadmap of Agent Research and Development. In Journal of Autonomous Agents and Multi-Agent Systems 1 (1) (1998) 7-36 4. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 (2) (1996) 237-285 5. Littman, M.L.: Friend-or-foe: Q-learning in General-sum Games. Proceedings of the Eighteenth International Conference on Machine Learning (2001) 322-328 6. Bowling, M., Veloso, M.: Multiagent Learning using a Variable Learning Rate. Artificial Intelligence 136 (2002) 215-250 7. Maarten, P.: A Study of Reinforcement Learning Techniques for Cooperative MultiAgent Systems. in Vrije Universiteit Brussel Computational Modeling Lab Faculty of - Department of Computer Science Academic (2002-2003) 8. Watkins, C.J.C.H., Dayan: Q-learning. Machine Learning 8 (3/4) (1992) 279-292 9. Littman, M.L.: Markov Games as a Framework for Multiagent Reinforcement Learning. in Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ (1994) 157-163
Topographic Infomax in a Neural Multigrid James Kozloski, Guillermo Cecchi, Charles Peck, and A. Ravishankar Rao IBM T.J. Watson Research Center, Yorktown Heights,NY 10598 {kozloski,gcecchi,cpeck,ravirao}@us.ibm.com
Abstract. We introduce an information maximizing neural network that employs only local learning rules, simple activation functions, and feedback in its functioning. The network consists of an input layer, an output layer that can be overcomplete, and a set of auxiliary layers comprising feed-forward, lateral, and feedback connecwtions. The auxiliary layers implement a novel ”neural multigrid,” and each computes a Fourier mode of a key infomax learning vector. Initially, a partial multigrid computes only low frequency modes of this learning vector, resulting in a spatially correlated topographic map. As higher frequency modes of the learning vector are gradually added, an infomax solution emerges, maximizing the entropy of the output without disrupting the map’s topographic order. When feed-forward and feedback connections to the neural multigrid are passed through a nonlinear activation function, infomax emerges in a phase-independent topographic map. Information rates estimated by Principal Components Analysis (PCA) are comparable to those of standard infomax, indicating the neural multigrid successfully imposes a topographic order on the optimal infomax-derived bases.
1
Introduction
Topographic map formation requires an order-embedding, by which a set of vectors X in some input space is mapped onto a set of vectors Y in some output space such that the ordering of vectors in Y , when embedded within some alternate lower-dimensional coordinate system (usually 2D) preserves as much as possible the partial ordering of vectors in X in the input space. An important additional objective of topographic map formation is that the volume defined by Y be maximized so as to avoid trivial mappings. This second objective ignores the ordering of inputs and outputs in their respective spaces and instead attempts to maximize the mutual information between X and Y , I(X; Y ). We observe that order embedding need not necessarily impact information, as the ordering of outputs imposes no constraint on the volume that they define. One of the most influential approaches to topographic map formation, Kohonen’s self-organizing map (SOM) [1], has its origins in another of Kohonen’s algorithms, Learning Vector Quantization (LVQ) [2]. LVQ has as its stated goal density estimation, or that the number of input vectors assigned to each output vector be equalized. It aims to accomplish this as an approximation of k-means clustering, which minimizes mean squared error between input vectors and output prototypes. As such, neither algorithm guarantees the desired equalization D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 500–509, 2007. c Springer-Verlag Berlin Heidelberg 2007
Topographic Infomax in a Neural Multigrid
501
between inputs and output vectors for nonuniform input spaces. Modifications to the traditional LVQ algorithm improve density estimation, but require ongoing adjustments of still more parameters that govern the rates at which inputs are assigned to each output vector [3]. For SOM and its many variations, these modifications are complicated by a neighborhood function which constrains the assignment of inputs to output vectors and is shrunk during learning to create a smooth mapping (for example, see [4]). While LVQ and SOM support overcomplete bases (wherein the number of outputs exceeds the number of inputs), the problem of density estimation is not addressed by adding more output vectors. Information maximization is a well-characterized optimization that results in density estimation and equalization of output probabilities. Originally expressed for the case of multi-variate Gaussian inputs in the presence of noise [5], a significant extension of this solution accommodates input spaces of arbitrary shape for the noiseless condition [6]. The original derivations of infomax used critically sampled bases (wherein the number of outputs equals the number of inputs), either for notational simplicity [5] or by necessity [6]. While subsequent derivations of infomax [7] and related sparse-coding and probabilistic modeling strategies [8,9] incorporate overcomplete bases, none do so in the context of topographic mapping. Algorithms that maximize Shannon information rates subject to a topographic constraint [10] rely on non-locally computed learning rules and do not apply to arbitrary input spaces and the noiseless condition. Finally, because one of the main operational principles of infomax is to make outputs independent by eliminating redundancy, standard topographic mapping algorithms, which necessarily create dependencies between map neighbors, should seem incompatible with infomax. Here we present a network capable of performing the infomax optimization over an arbitrary input space (in our case, natural images) for the noiseless condition, using either critically sampled or overcomplete bases, while simultaneously creating a topographic map with either a phase-dependent or a phaseindependent order embedding. Changes to learning rates or neighborhood sizes are not required for convergence. These capabilities derive from a novel neural multigrid, configured to estimate Fourier modes of the infomax antiredundancy learning vector using feed-forward, lateral, and feedback connections. Section 2 introduces the infomax network from which ours derives [11], which we have termed a Linsker network. In subsequent sections we present, step by step, each of the modifications to the Linsker network necessary to achieve these capabilities. Section 3 shows how to implement a multilayer Linsker network with feedback, which generates topography, but fails to achieve infomax. Section 4 refines the feedback network and introduces the neural multigrid to address low frequency redundancy reduction. Section 5 shows how an overcomplete basis finally achieves the infomax optimum in a neural multigrid. Section 6 demonstrates that a modified multigrid can achieve the infomax optimum for a critically sampled basis, provided a phase-independent order embedding is specified. Finally section 7 discusses the experimental results in the context of a new principle of information maximization which we term topographic infomax.
502
2
J. Kozloski et al.
A Three-Stage Infomax Network
A Linsker network comprises three stages. Stage one selects a vector x from the input ensemble, x ∈ X, and computes the input vector x = q −1/2 ( x − x0 ), where x0 is an input bias vector that continually adapts with each input according to Δx0 = βx0 [ x − x0 ],1 , and q −1/2 is the pre-computed whitening matrix, where q = ( x − x0 )( x − x0 ) . Pre-whitening of inputs is not required for infomax, but speeds convergence. For results shown here, X was a set of image segments drawn at random from natural images [12]. Stage two learns the input weight matrix C and computes u ≡ Cx, where each element ui is the linear output of a stage two unit i to a corresponding stage three unit. In addition, each stage two unit computes an element of the output vector y, such that yi = σ(ui ), where σ(·) denotes a nonlinear squashing function. We used the logistic transfer function for σ(·) as in [6]: y = 1/1 + e−(u+w0 ) , where w0 is an output bias vector that continually adapts with each input according to Δw0 = βw0 [1 − 2y]. The ensemble of all network output vectors is then y ∈ Y , and the objective to maximize I(X; Y ). Because the network is deterministic, maximizing I(X; Y ) is equivalent to maximizing H(Y ), since I(X; Y ) = H(X)+ H(Y ) − H(Y |X), H(X) is fixed, and, for the noiseless condition, H(Y |X) = 0. Let us now consider stage three, whose outputs comprise a learning vector with the same dimension as u. When applied in stage two to learning C, this learning vector yields the anti-redundancy term (C )−1 of Bell and Sejnowski (ie., the inverse of the transpose of theinput weight matrix) [6,11]. Consider the chain rule H(Y ) = ni=1 H({yi )})− ni=1 I({yi }; {yi−1 }, . . . , {y1 }). Then, maximizing H(Y ) is achieved by constraining the entropy of the elements yi (as expressed by the first sum) while minimizing their redundancy (as expressed by the second). In a Linsker network, the constraint imposed by σ(·) and the learning vector produced by stage three perform these functions, and are sufficient to guide C learning to maximize H(Y ). Thus, the learning vector produced by stage three is responsible for redundancy minimization. We refer to it (and its subsequent derivatives) as the anti-redundancy learning vector, hereafter denoted ψ. In a Linsker network, local learning rules and a linear activation function iterating over a single set of lateral connections compute ψi for each unit i of a single stage three layer. Units in this layer are connected by the weight matrix Q, → Q ≡ uu . For a given whose elements undergo Hebbian learning such that Q input presentation, feed-forward and lateral connections modify elements of an t−1 . To learn auxiliary vector v at each iteration t according to vt = vt−1 +u−αQv Regardless Q locally, we set v0 = u and use the learning rule ΔQ = βQ [v0 v0 − Q]. of initial v, and assuming Q = Q and the scalar α is chosen so that v converges, by Jacobi αv∞ = Q−1 u. The constraint 0 < α < 2/Q+ must be satisfied for v to converge, where Q+ is the largest eigenvalue of Q [11]. In Linsker’s network, α is computed by a heuristic [5]. We devised a dynamic, local computation of 1
Note that all learning rates in the network are constant, and denoted by β with subscript. For this and subsequent learning rates, we used βx0 = 0.0001, βw0 = 0.0021, βQ = 0.0007, and βC = 0.0021.
Topographic Infomax in a Neural Multigrid
503
Q+ based on power iteration, from which α can be computed precisely. Let in the e represent an activity vector propagated through the lateral network Q t−1 . absence of the normal stage three forcing term, vt−1 +u, such that et = −αQe Precalculating α = 1/et for each t ensures e → Q+ and α → 1/Q+ , thus satisfying the convergence criterion of v. In practice, a finite number of iterations are sufficient to approximate αv∞ ,2 and therefore anti-redundancy learning for a given input weight Cij can depend on the locally computed element of the learning vector, ψi = αvi , and the local input to stage two, xj . The final infomax learning rule for the network is then ΔC = βC [(ψ + 1 − 2y)x ] [11]. Use of a standard Linsker network produces the expected infomax result reliably with a fixed number of Jacobi iterations and learning rates (Fig. 1A). A
C
B 10
0
a b
−10
c d
10
e
−20
10
0
40
80
120
Fig. 1. A: Learned bases of Linsker network (standard infomax). This and all subsequent figures derive from training over a set of publicly available natural images (http://www.cis.hut.fi/projects/ica/data/); B: PCA of outputs u (eigenvalues of Q shown on a continuous function). Curves labeled a and b result from A and C, and show maximization of the output volume (successful infomax). Curves labeled c, d, and e, result from multigrid infomax over low to high frequency modes, infomax from feedback, and multigrid infomax over low frequency modes only, and show failed infomax. C: Learned bases of overcomplete, topographic infomax.
3
Infomax from Feedback
Jacobi iteration in stage three of a Linsker network aims to solve the equation αQv = u. We observe that certain linear transformations of this equation likewise yield the infomax anti-redundancy term. For example, stage three might include a second layer, or grid (denoted h1 ), in which a fixed, symmetric, full rank weight 2
For most problems, we found 4 Jacobi iterations to be sufficient.
504
J. Kozloski et al.
matrix S linearly transforms u, yielding a new auxiliary vector uh1 = Su = SCx. As in a Linsker network, units in h1 are connected by a lateral weight matrix, h1 , that undergoes Hebbian learning such that Q h1 → Qh1 ≡ uh1 uh1 = Q h1 SQS . Jacobi iteration in h1 proceeds similarly as above, vth1 = vt−1 + u h1 − h 1 h1 h1 −1 α Q vt−1 . To recover the infomax anti-redundancy term (C ) , a derivation −1 −1 similar to that of Linsker yields: I = Qh1 SQS , (S )−1 = Qh1 SCxx C , −1 −1 (S )−1 (C )−1 = Qh1 SCxx , and (C )−1 = S Qh1 SCxx . Hence, (C )−1 = h1 h1 Sα v∞ x , and anti-redundancy learning for a given input weight now depends on an element of the feedback learning vector ψ = S αh1 v h1 computed in the stage three’s input layer and the corresponding local input to stage two. Theoretically, the proposed feedback network is equivalent to a Linsker network, and should therefore not change the infomax optimization it performs. In practice, however, the choice of S can influence the optimization dramatically, h1 since Jacobi iteration is used to estimate αh1 v∞ . If, for example, S represents a low-pass convolution filter and thus low frequency modes are emphasized in both uh1 and Qh1 ,3 Jacobi iteration in h1 can fail to provide a solution that is accurate enough for infomax to succeed, given some fixed number of Jacobi iterations, since low frequency modes of the solution are notoriously slow to converge [13]. In fact, we observed a failure of this network to achieve infomax (Fig. 1B[d]). However, we consistently observed a stable topographic ordering of the outputs in the 2D coordinate system of the network after each failed infomax optimization, suggesting that errors derived from incomplete convergence of low frequency Fourier modes in ψ are sufficient to generate a topographic map. Next we set out to eliminate these errors.
4
A Neural Multigrid
The second layer of the feedback infomax network described in the preceding section aims to solve the equation αh1 Qh1 v h1 = uh1 by means of Jacobi iteration. It fails to do so accurately because of the dominant low frequency modes of the problem, rendering ψ unsuitable for anti-redundancy learning that depends on accurate solutions in these modes. Hence infomax fails. We explored using multigrid methods to better estimate this solution, since multigrid methods in general speed convergence and accuracy of Jacobi iteration by decomposing it into a series of iterative computations performed sequentially over a set of grids, each solving different Fourier modes of the original problem [13]. Multigrid casts the problem into a series of smaller and smaller grids, such that low frequency modes in the original problem can converge quickly and accurately in the form of high frequency modes in a restricted problem. The multigrid method we implement here in a neural network is nested iteration, though the network design can easily accommodate other multigrid methods such as ”V-cycle” and ”Full Multigrid” [13]. Similar to the feedback network described in the previous section, Jacobi iteration in h1 aims to solve αh1 Qh1 v h1 = uh1 , but now only after 3
We used a low-pass, 2D gaussian kernel with SD=0.85.
Topographic Infomax in a Neural Multigrid
505
v h1 is initialized with the result of a preceding series of nested iterative computations over a set of smaller grids wherein lower frequency modes of the solution have already been computed. The set of grids, hk , is enumerated by the set of wavelengths of the Fourier modes of the problem that each solves, for example k ∈ {1, 2, 4, . . .}. The iterative computation performed in each grid is similar to that in a Linsker network, hn hn v hn , n ∈ k. As in traditional multigrid now denoted vthn = vt−1 + uhn − αhn Q t methods, we chose powers of two for the neural multigrid wavelengths, such that if the Linsker network and grid h1 are 11 × 11 layers, grid h2 is a 5 × 5 layer, and h4 a 2 × 2 layer. Feed-forward connections initially propagate and restrict each uhn to each lower dimensional grid h2n , such that uh2n = S hn uhn ∀n ∈ k, where S hn denotes the restriction operator (in our neural multigrid, a rectangular feed-forward weight matrix) from grid hn to h2n . As in traditional multigrid methods, restriction here applies a stencil (in our neural multigrid, a neighbor1 hn hood function), such that the restriction from grid hn to h2n is uhx2n ,y = 16 [4ux,y + n n n n n n n n uhx+1,y+1 + uhx+1,y−1 + uhx−1,y+1 + uhx−1,y−1 + 2(uhx,y+1 + uhx,y−1 + uhx+1,y + uhx−1,y )], where (x, y) are coordinates in hn , and (x , y ) are the corresponding transformed coordinates in h2n . Jacobi iteration proceeds within coarse grids first, followed by finer grids. Feedback propagates and smoothly interpolates the result of a coarse grid iteration, αh2n v h2n , to the next finer grid, where it replaces v hn prior to Jacobi iteration within the finer grid: v hn ← S hn αh2n v h2n . In this way, higher frequency mode iteration refines the solution provided by lower frequency mode iteration. The process continues until αh1 v h1 is computed by iteration at the second layer of stage three, and finally ψ is derived from feedback to stage three’s input layer as described in the previous section. While restriction and interpolation of activity vectors in the neural multigrid are easily accomplished through feed-forward and feedback connections described by S hn and S hn , how is Qhn computed for each grid using only local learning rules? Consider the problem of restricting to a coarser grid the matrix Qhn , defined for all multigrid methods as Qh2n = S hn Qhn S hn [13]. By substitution, Qh2n = S hn uhn uhn S hn = S hn uhn uhn S hn = uh2n uh2n . Hence, any restricted matrix Qh2n can be computed by Hebbian learning over a lateral weight h2n such that Q h2n → Qh2n ≡ uh2n uh2n .4 matrix Q In the preceding experiments with feedback infomax networks, the failure to compute low frequency modes of the solution to αh1 Qh1 v h1 = uh1 was responsible for a topographic ordering (since infomax learning was unable to eliminate low frequency spatial redundancy in the map). Next, we hypothesized that limiting our solution to these same low frequency modes could have a comparable topographic influence while providing a means to complete the infomax optimization. Minimizing redundancy in low frequency Fourier modes is equivalent to minimizing redundancy between large spatial regions of the map. In the absence of competing anti-redundancy effects from other Fourier modes, redundancy within these large regions should therefore increase, as units learn 4
Any optimization over a matrix A requiring a solution to Ax = b can be implemented using a neural multigrid if, and only if, A is strictly a function of b.
506
J. Kozloski et al.
based on a smooth, interpolated anti-redundancy learning vector derived from coarse grids only. In fact, the topographic effect was pronounced. Again, however, the network failed to achieve infomax (Fig. 1B[e]), now because high frequency modes of the solution were neglected by our partial multigrid. We therefore devised a network that gradually incorporates the iterative computations of each grid of the neural multigrid, from coarse grids to fine, into the computation of ψ. Initially, iteration proceeded only at the two coarsest grids, with m inputs presented to the network in this configuration.5 Infomax learning proceeded with feedback vectors computed by interpolating the solutions from the coarsest grid through each multigrid layer, then feeding the result back to stage three’s input layer, where finally ψ was computed. Subsequent layers in the neural multigrid were activated one at a time at intervals of m input presentations. The number of input presentations for which a grid hn had been active was phn , and only when phn ≥ m∀n ∈ k were all Fourier modes of the solution to αh1 Qh1 v h1 = uh1 present in ψ, and the multigrid complete. Prior to this, ψ was computed in the partial multigrid as a linear combination of the feedback vectors from the twofinest active grids, ha and hb , fed back through all intervening layers: a ψ = S n=1 S hn [β ha αha v ha +(1−β ha )S hb αhb v hb ], where β ha = pha /m ∈ [0, 1]. Gradual incorporation of each grid of the neural multigrid resulted in a topographic map, but failed to achieve infomax (Fig. 1B[c]), suggesting a more radical approach was required to recover an infomax solution.
5
Overcomplete Topographic Infomax
To recover an infomax solution within a topographic map, we first devised a network that uses a neural multigrid to compute an anti-redundancy learning vector based on pooled outputs from nine separate critically sampled bases (each initially embedded in an 11 × 11 grid as above). Each basis therefore represented a separate infomax problem, and each was then re-embedded within a single ”overcomplete” 2D grid (now 33 × 33, denoted hoc ) as follows: x ← r + 3x, y ← s + 3y, where (r, s) represents a unique pair of offsets applied to each 11 × 11 grid’s corresponding coordinates in order to embed it within the 33 × 33 grid, r ∈ [0, 2], s ∈ [0, 2]. The pooling of elements from each basis was achieved by restricting the output of hoc to the neural multigrid’s first layer, such that uh1 = Soc uhoc , where h1 is an 11 × 11 grid, and Soc is a rectangular feed-forward weight matrix.6 The computation of ψ proceeded as above, with grids incorporated into the multigrid gradually, from coarse to fine. After the multigrid was completely active, a set of nine, independent lateral networks within the overcomplete grid became active. Each lateral network included only connections between those 5 6
We used m = 2, 000, 000. We scaled the previous low-pass, 2D gaussian kernel proportionally to SD=2.55 in order to create the new restriction matrix, Soc .
Topographic Infomax in a Neural Multigrid
507
units comprising a single critically sampled basis, and the overcomplete grid’s lateral network thus comprised overlapping, periodic, lateral connections. The results of iteration over each of these independent Linsker networks represented nine separate solutions to nine fully determined problems Qv = u. These solutions were gradually combined with the feedback learning vector from the multigrid as described in the previous section, and yielded an infomax solution wherein each critically sampled basis was co-embedded in a single topographic map (Fig. 1B[b],C).
6
Absolute Redundancy Reduction
Next, we aimed to generate a phase-independent order embedding in our topographic map using the neural multigrid. Drawing upon the work of Hyv¨ arinen et al., [14] we reasoned that certain nonlinear transformations of the inputs to the neural multigrid might produce a topographic influence independent of phase selectivity in output units, as has been observed in primary visual cortex. Unlike Hyv¨ arinen et al., we maintained the goal of maximizing mutual information between the input layer and the first layer of our multilayer network. The nonlinear transformation applied was the absolute value, such that uh1 = S|u|. Given this transformation, feedback from the multigrid was no longer consistent with anti-redundancy learning at the input layer of stage three. We reasoned that the input weights to unit i, Cij ∀j, should be adjusted to eliminate redundancy in multigrid units, which derives from pooled absolute activation levels over i. Cij must then be modified in a manner dependent on each unit i’s contribution to this redundancy, i.e., its absolute activation level. Hence, at each unit i, the multigrid feedback vector was multiplied by ωi prior to its incorporation into ψ, where ωi takes the value 1 if ui ≥ 0 and −1 otherwise. The computation of ψ was modified during learning as above, with grids incorporated into the multigrid gradually, from coarse to fine. In the end, the network employed a linear combination of the multigrid feedback vector and the infomax anti-redundancy vector αv, such that computed elements of ψ locally were determined by ψi = βαvi + (1 − β)ωi k Sik αh1 vkh1 .7 The results show that a phase-independent topographic map does result from nonlinearly mapping the problem αQv = u onto a gradually expanded neural multigrid (Fig. 2A), and that infomax was readily achieved in these experiments (Fig. 2B[b]). Interestingly, when multigrid inputs were transformed in this way, the constraints imposed by initial learning, biased topographically by the partial multigrid, were not severe enough to prevent infomax from emerging within the map, even for a critically sampled basis. Finally, we applied the nonlinear mapping to an overcomplete basis as above and found that phase-independent, overcomplete, topographic infomax was readily achieved (Fig. 2B[b],C). 7
We limited β to 0.25 in order to maintain a smooth mapping, given that infomax emerged readily at this weighting, and did not require β → 1.
508
J. Kozloski et al.
A
B
C
10
0
a b
−1
10 0
40
80
120
Fig. 2. A: Learned bases of phase-independent, topographic infomax; B: PCA of outputs u (eigenvalues of Q shown on a continuous function). Curves labeled a and b result from Linsker network (standard infomax) and from phase-independent topographic infomax (A and C overlapping), and show maximization of the output volume (successful infomax). C: Learned bases of phase-independent, overcomplete, topographic infomax.
7
Discussion
The infomax algorithm implemented here in a modified multi-layered Linsker network maximizes I(X; Y ), the mutual information between network inputs and outputs. Rather than doing so directly, however, the novel network configuration uses first a partial neural multigrid to induce spatial correlations in its output, then a complete multigrid to perform the infomax optimization. The manner in which the multigrid guides infomax to a specific topographically organized optimum constitutes a new principle of information maximization, which we refer to as topographic infomax. The problem solved by the partial neural multigrid first is that of eliminating redundancy in low frequency Fourier modes of the output. While solving this problem for these modes and no others, the network operates as if these modes contain all output redundancy, which they clearly do not. Redundancy between large regions of the output map is minimized, even though doing so increases redundancy in higher frequency modes (ie., between individual units). In the completed multigrid, iteration in coarse grids precedes iteration in finer grids for any input, and thus higher frequency redundancy reduction is heavily constrained by any previous minimization of redundancy in lower frequency modes. For standard infomax, redundancy minimization can be achieved by redundancy reduction in any or all modes simultaneously. Topographic infomax instead aims to first eliminate low frequency redundancy and thus imposes a topographic order on the output map.
Topographic Infomax in a Neural Multigrid
509
The constraints imposed by low frequency redundancy reduction can prevent infomax from emerging in the completed multigrid. Two approaches to relaxing these constraints and recovering the infomax solution have yielded topographic infomax: first, the use of an overcomplete basis, and second, the use of a phaseindependent order-embedding. We anticipate that many parallels between the network configuration employed here and those observed in biological structures such as primate primary visual cortex remain to be drawn, and that topographic infomax represents a mechanism by which these structures, constrained developmentally by local network connection topologies, can achieve quantities of mutual information between inputs and outputs comparable to what is achieved in more theoretical, fully-connected networks of equal dimension.
Acknowledgments We thank Ralph Linsker and John Wagner for many helpful discussions.
References 1. Kohonen, T.: Self-Organizing Maps. Berlin: Springer-Verlag (1997) 2. Kohonen, T.: Learning Vector Quantization. Neural Networks 1, Supplement 1 (1988) 303 3. Desieno, D.: Adding a Conscience to Competitive Learning. Proc. Int. Conf. on Neural Networks I (1988) 117-124 4. Bednar, J. A., Kelkar, A., Miikkulainen, R.: Scaling Self-Organizing Maps to Model Large Cortical Networks. Neuroinformatics 2 (2004) 275-302 5. Linsker, R.: Local Synaptic Learning Rules Suffice to Maximise Mutual Information in a Linear Nnetwork. Neural Computation 4 (1992) 691-702 6. Bell, A. J., Sejnowski, T. J.: An Information-Maximisation Approach to Blind Separation and Blind Deconvolution. Neural Computation 7 (1995) 1129-1159 7. Shriki, O., Sompolinsky, H., Lee, D.D.: An Information Maximization Approach to Overcomplete and Recurrent Representations. 12th Conference on Neural Information Processing Systems (2000) 87-93 8. Olshausen, B. A., Field, D. J.: Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1. Vision Research 37 (1996) 3311-3325 9. Lewicki, M. S., Sejnowski, T. J.: Learning Overcomplete Representations. Neural Computation 12 (2000) 337-365 10. Linsker R.: How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals. Neural Computation 1 (1989) 402-411 11. Linsker, R.: A Local Learning Rule that Enables Information Maximization for Arbitrary Input Distributions. Neural Computation 9 (1997) 1661-1665 12. Hyv¨ arinen, A., Hoyer, P. O.: A Two-Layer Sparse Coding Model Learns Simple and Comlex Cell Receptive Fields and Topogrpahy From Natural Images. Vision Research 41 (2001) 2413-2423 13. Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, Phildelphia, PA: Society for Industrial and Applied Mathematics 14. Hyv¨ arinen, A., Hoyer, P.O., Inki, M.: Topographic Independent Component Analysis. Neural Computation 13 (2001) 1527-1528
Genetic Granular Neural Networks Yan-Qing Zhang1 , Bo Jin1 , and Yuchun Tang2 1
Department of Computer Science Georgia State University Atlanta, GA 30302-3994 USA
[email protected],
[email protected] 2 Secure Computing Corporation Alpharetta, GA 30022 USA
[email protected] Abstract. To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively, a new Genetic Granular Neural Network (GGNN) is desinged. Simulation results have shown that the GGNN is able to extract useful fuzzy knowledge effectively and efficiently from training data to have high training accuracy.
1
Introduction
Recently, granular computing techniques based on computational intelligence techniques and statistical methods have various applications [2-9][13-15]. Type2 fuzzy systems and interval-valued fuzzy systems are investigated by extending type-1 fuzzy systems [1][10-12][16]. It is hard to define and optimize type-2 or interval-valued fuzzy membership functions subjectively and objectively. In other words, the first challenging problem is how to design an effective learning algorithm that can optimize type-2 or interval-valued fuzzy membership functions based on training data. Usually, type-2 fuzzy systems and interval-valued fuzzy systems can handle fuzziness better than type-1 fuzzy systems in terms of reliability and robustness. But type-2 fuzzy reasoning and interval-valued fuzzy reasoning take much longer time than type-1 fuzzy reasoning. So the second challenging problem is how to speed up type-2 fuzzy reasoning and interval-valued fuzzy reasoning. In summary, the two long-term challenging problems are related to effectiveness and efficiency of granular fuzzy systems, respectively. To solve the first effectiveness problem, learning methods are used to optimize type-2 or interval-valued fuzzy membership functions based on given training data. Liang and Mendel present a method to compute the input and antecedent operations for interval type-2 FLSs, introduce the upper and lower membership functions, and transfer an interval type-2 fuzzy logic system into two type-1 fuzzy logic systems for membership function parameter adjustments [10]. To handle different opinions from different people, Qiu, Zhang and Zhao use a statistical linear regression method to construct low and high fuzzy membership functions for an interval fuzzy system [16]. Here, a new interval reasoning method using the granular sets is designed to make fast granular reasoning. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 510–515, 2007. c Springer-Verlag Berlin Heidelberg 2007
Genetic Granular Neural Networks
2
511
Granular Neural Networks
The n-input-1-output GNN with m granular IF-THEN rules uses the granular sets such that IF x1 is Ak1 and ... and xn is Akn T HEN y is B k ,
(1)
where xi and y are input and output granular linguistic variables respectively, granular linguistic values Aki and B k are defined as follows, Aki = [μAk (xi ), μ ¯Aki (xi )]/xi , (2) R
i
μAk (xi ) = exp[−( i
xi − aki 2 ) ], σ ki
(3)
xi − aki 2 μ ¯Aki (xi ) = exp[−( k ) ], σ i + ξik B k = [μB k (y), μ ¯B k (y)]/y,
(4) (5)
R
μB k (y) = exp[−(
y − bk 2 ) ], ηk
(6)
μ ¯B k (y) = exp[−(
y − bk 2 ) ], ηk + ν k
(7)
where aki and bk are centers of membership functions of xi and y respectively, and σ ki and (σ ki + ξik ) for ξik > 0 are widths of lower-bound and upper-bound membership functions of xi , respectively, and η k and (η k + ν k ) for ν k > 0 are widths of membership functions of y for i = 1, 2, ..., n and k = 1, 2, ..., m. The functions of granular neurons in different layers are described layer by layer as follows: Layer 1: Input Layer Input neurons Ii on layer 1 have simple mapping functions Oi = xi
(8)
where i = 1, 2, ..., n. Layer 2: Compensation and Linear Combination Layer In this layer, there are two types of granular neurons which are (1) lower-bound and upper-bound compensatory neurons denoted by C k and C¯k , respectively, and (2) lower-bound and upper-bound linear combination neurons denoted by ¯ k , respectively, for k = 1, 2, ..., m. Lk and L Compensatory neurons on layer 2 have compensatory mapping functions OC k = [
n
i=1 n
OC¯k = [
i=1
μAk (xki )]1−γk +
γk n
,
(9)
,
(10)
i
μ ¯Aki (xki )]1−γk +
γk n
512
Y.-Q. Zhang, B. Jin and Y. Tang
where γk stand for compensatory degrees. Linear combination neurons on layer 2 have linear mapping functions OLk = bk +
n ψ ki (xi − aki ) ηk , n i=1 σ ki
(11)
OL¯ k = bk +
n ¯k ηk + ν k ψi (xi − aki ) . n σ ki + ξik i=1
(12)
Layer 3: Normal granular Reasoning Layer Lower-bound and upper-bound granular reasoning neurons denoted by Rk and ¯ k , respectively, on layer 3 have product mapping functions R ORk = OC k OLk , OR¯ k = OC¯k OL¯ k ,
(13) (14)
Layer 4: Interval Summation Layer The lower-bound and upper-bound compensatory summation neurons denoted ¯ respectively, have mapping functions by CS and CS, OCS = OCS ¯ =
m k=1 m
OC k ,
(15)
OC¯k .
(16)
k=1
The lower-bound and upper-bound granular reasoning summation neurons ¯ respectively, have mapping functions denoted by F RS and F RS, OF RS = OF RS ¯ =
m k=1 m
ORk ,
(17)
OR¯ k .
(18)
k=1
Layer 5: Hybrid Output Layer Finally, an output neuron OU T has an average mapping function OOUT = [
OF RS O ¯ + F RS ]/2. OCS OCS ¯
(19)
For clarity, the output f (x1 , ..., xn ) of the hybrid output layer of the GNN is given below, m f (x1 , ..., xn ) =
ψk (xi −ak η k n i) i + n )g i=1 σk i , m n γk 1−γ + k k n k=1 [ i=1 μAk (xi )] k=1 (b
k
i
(20)
Genetic Granular Neural Networks
513
where g=[
n
μAk (xki )]1−γk +
,
(21)
i
i=1
m f¯(x1 , ..., xn ) =
γk n
¯k (xi −ak ) η k +ν k n ψ i i + n )¯ g k i=1 σk i +ξi , m n γk ¯Aki (xki )]1−γk + n k=1 [ i=1 μ
k=1 (b
k
(22)
where g¯ = [
n
μ ¯Aki (xki )]1−γk +
γk n
,
(23)
i=1
where heuristic parameters ψ ki are defined below, ψ ki
=
ψ¯ik =
υ ki for xi ≤ aki ωki for xi > aki .
(24)
υ¯ik for xi ≤ aki ω ¯ ik for xi > aki .
(25)
Finally, the output function f (x1 , ..., xn ) of the GNN is f (x1 , ..., xn ) =
f (x1 , ..., xn ) + f¯(x1 , ..., xn ) . 2
(26)
Interestingly, the output function f (x1 , ..., xn ) of the GNN also contains a linear combination of xi for i = 1, 2, ..., n since both input and output membership functions are all the same Gaussian functions. Especially, if input and output membership functions are different kinds of functions such as triangular and Gaussian functions, the output of the GNN f (x1 , ..., xn ) may have a nonlinear combination of xi for i = 1, 2, ..., n.
3
Genetic Granular Learning
Suppose: Given n-dimensional input data vectors xp (i.e., xp = (xp1 , xp2 , ..., xpn )) and 1-dimensional output data vector y p for p = 1, 2, ..., N . The energy function is defined by Ep =
1 [f (xp1 , ..., xpn ) − y p ]2 . 2
For simplicity, let E and f p denote E p and f (xp1 , ..., xpn ), respectively.
(27)
514
Y.-Q. Zhang, B. Jin and Y. Tang
A 3-phase evolutionary interval learning algorithm with constant compensatory rate γk = a (a ∈ [0, 1] for k = 1, 2, ..., m) is described below: Step 1: Using the Type-1 Learning Method to Optimize Initial Expected Pointvalued Parameters of the GNN. Step 2: Using Genetic Algorithms to Optimize Initial Interval-valued Parameters. Step 3: Using the Compensatory Interval Learning Algorithm to Optimize Interval-valued Parameters. Step 4: Discovering Granular Knowledge. Once the learning procedure has been completed, all parameters for the GNN have been adjusted and optimized. As a result, all m granular rules have been discovered from training data. Finally, the trained GNN can generate new values for new given input data.
4
Conclusions
To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively, a GGNN is desinged. In the future, more effective and more efficient hybrid granular reasoning methods and learning algorithms will be investigated for complex applications such as bioinformatics, health, Web intelligence, security, etc.
References 1. Karnik, N., Mendel, J.M.: Operations on Type-2 Fuzzy Sets. Fuzzy Sets and Systems 122 (2001) 327-348 2. Fang, P.P., Zhang, Y.-Q.: Car Auxiliary Control System Using Type-2 Fuzzy Logic and Neural Networks. Proc. of WSC9, Sept. 20 -Oct. 8 (2004) 3. Jiang, F.H., Li, Z., Zhang, Y.-Q.: Hybrid Type-1-2 Fuzzy Systems for Surface Roughness Control. Proc. of WSC9, Sept. 20 -Oct. 8 (2004) 4. Lin, T.Y.: Granular Computing: Fuzzy Logic and Rough Sets. Computing with Words in Information/Intelligent Systems. Zadeh, L., Kacprzyk, J. (eds.) (1999) 184-200 5. Karnik, N.N., Mendel, J.M., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 7 (1999) 643-658 6. Pedrycz, W.: Granular Computing: an Emerging Paradigm. Physica-Verlag, Heidelberg (2001) 7. Karnik, N.N., Mendel, J.M.: Applications of Type-2 Fuzzy Logic Systems to Forecasting of Time-series. Inf. Sci. 120 (1999) 89-111 8. Tang, M.L., Zhang, Y.-Q., Zhang, G.: Type-2 Fuzzy Web Shopping Agents. Proc. of IEEE/WIC/ACM-WI2004 (2004) 499-503 9. Tang, Y.C., Zhang, Y.-Q.: Intelligent Type-2 Fuzzy Inference for Web Information Search Task. Computing for Information Processing and Analysis Series in Fuzziness and SoftComputing 164. Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.), Physica-Verlag, Springer (2005) 10. Liang, Q., Mendel, J.M.: Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Trans. Fuzzy Systems 8 (2000) 535-550
Genetic Granular Neural Networks
515
11. Mendel, J.M.: Computing Derivatives in Interval Type-2 Fuzzy Logic Systems IEEE Trans. Fuzzy Systems 12 (2004) 84-98 12. Wu, H., Mendel, J.M.: Uncertainty Bounds and Their Use in the Design of Interval Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 10 (2002) 622-639 13. Zadeh, L.A.: Fuzzy Sets and Information Granulation. Advances in Fuzzy Set Theory and Applications. Gupta, N., Ragade, R., Yager, R. (eds.), North-Holland (1979) 3-18 14. Zhang, Y.-Q., Fraser, M.D., Gagliano, R.A., Kandel, A.: Granular Neural Networks for Numerical-linguistic Data Fusion and Knowledge Discovery. Special Issue on Neural Networks for Data Mining and Knowledge Discovery, IEEE Trans. Neural Networks 11(3) (2000) 658-667 15. Zhang, Y.-Q.: Constructive Granular Systems with Universal Approximation and Fast Knowledge Discovery. IEEE Trans. Fuzzy Systems 13(1) (2005) 16. Qiu, Y., Zhang, Y.-Q., Zhao, Y.: Statistical Interval-Valued Fuzzy Systems via Linear Regression. Proc. of IEEE-GrC 2005, Beijing, July 25-27 (2005) 229-232
A Multi-Level Probabilistic Neural Network Ning Zong and Xia Hong School of Systems Engineering, University of Reading, RG6 6AY, UK
[email protected] Abstract. Based on the idea of an important cluster, a new multi-level probabilistic neural network (MLPNN) is introduced. The MLPNN uses an incremental constructive approach, i.e. it grows level by level. The construction algorithm of the MLPNN is proposed such that the classification accuracy monotonically increases to ensure that the classification accuracy of the MLPNN is higher than or equal to that of the traditional PNN. Numerical examples are included to demonstrate the effectiveness of proposed new approach.
1
Introduction
A popular neural network for classification is probabilistic neural network (PNN) [1]. The PNN classifies a sample by comparing a set of probability density functions (pdf) of the sample conditioned on different classes, where the probability density functions (pdf) are constructed using a Parzen window [2]. Research on PNN has been concentrated on the model reduction using various approaches, e.g. forward selection [3] and clustering algorithms [4,5]. The motivation of this paper is to investigate the possibility of further improvement on the classification accuracy of the PNN. We attempt to identify some input regions with poor classification accuracy from a PNN and emphasize the region as important cluster. A new multi-level probabilistic neural network (MLPNN) and the associated model construction algorithm have been introduced based on the important cluster. The MLPNN uses an incremental constructive approach, i.e. it grows level by level. The classification accuracy over the training data set monotonically increases to ensure that the classification accuracy of the MLPNN is higher than or equal to that of the traditional PNN. Two numerical examples are included to demonstrate the effectiveness of proposed new approach. It is shown that the classification accuracy of the resultant MLPNN over the test data set also monotonically increases as the model level grows for a finite number of levels.
2
Probabilistic Neural Network and Important Cluster
The structure of the probabilistic neural network (PNN) is shown in Figure 1. The input layer receives a sample x composed of d features x1 , · · · , xd . In the hidden layer, there is one hidden unit per training sample. The hidden unit D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 516–525, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Multi-Level Probabilistic Neural Network
517
k argmax
^ y1 (x )
^ yM(x )
Σ x 11
Σ
x 21
xN 1 1
x 1M
output layer
x 2M
xN M M
hidden layer
x1
x2
input layer
xd
Fig. 1. The structure of a PNN
xij corresponds to the ith, i = 1, · · · , Nj training sample in the jth class, j = 1, · · · , M . The output of the hidden unit xij with respect to x is expressed as aij (x) =
1 (x − xij )T (x − xij ) exp{− } d/2 d 2σ 2 (2π) σ
(1)
where σ denotes the smoothing parameter. In the output layer, there are M output units, one for each class Cj , j = 1, · · · , M . The jth output is formed as yˆj (x) =
Nj 1 aij (x), Nj i=1
j = 1, · · · , M.
(2)
The output layer classifies the sample x to class Ck which satisfies k = argmaxj {ˆ yj (x)|j = 1, · · · , M }.
(3)
The important cluster can be formed as some cluster, or “sub-region” containing some or all of the misclassified training samples of the conventional PNN constructed based on the “whole region”. In order to improve the classification accuracy of the conventional PNN, it is crucial that the classification accuracy over the important clusters is improved. Hence, in this study, we attempt to correct the misclassified training sample x through computing the discriminant functions based on only a small number of neurons around it, i.e. emphasizing the contributions of the neurons closer to it. An important cluster contains a smaller number of training samples. The classification accuracy over the important cluster may be improved by (i) initially constructing a new PNN using only the training samples in the important cluster as its neurons and then (ii) classifying the training samples in the important cluster using the new PNN.
518
3
N. Zong and X. Hong
The Structure of MLPNN
A MLPNN with L levels consists of K PNNs denoted by P N Nk , k = 1, · · · , K, (L ) each of which is constructed by using the training samples in cluster Gk k ⊆ Gtr as its neurons, Lk ∈ {1, · · · , L}. The superscript Lk denotes the index of the level and Gtr is some region that contains all the training samples. (L ) (L ) (L ) {G1 1 , G2 2 , · · · , GK K } satisfies the following conditions. 1. L1 = 1, 2 = L2 ≤ L3 · · · ≤ LK−1 ≤ LK = L. (L ) (L ) 2. Gi i ∩ Gj j = ∅ for any i = j, if Li = Lj . ∅ denotes the empty set. By defining G(l) =
K
(Lk )
χ(Lk = l)Gk
l = 1, · · · , L
,
(4)
k=1
where χ(•) denotes an indication function whose value is 1 when • is true or 0 otherwise, the lth level of the MLPNN is referred to as the collection of the PNNs in P N Nk , k ∈ {1, · · · , K} which correspond to G(l) . The model structure of a MLPNN is depicted in Figure 2. The “SWITCH”
x (L 1)
G1
(L 2)
G2 . . . . . .
(L K)
GK
PNN 1
S W I T C H
PNN 2 . . . . . .
. . . . . .
PNN K
k1 k2 . . . . . .
k
MLPNN
kK
Fig. 2. The structure of a MLPNN
decides which PNN in P N Nk , k = 1, · · · , K is used to classify a sample x by calculating (L ) I = argmaxk {Lk |x ∈ Gk k }. (5) The class label output of the MLPNN for the input x is thus kMLP N N = kI
(6)
A Multi-Level Probabilistic Neural Network
519
where kI is the class label output of P N NI given the sample x. Therefore, it can be concluded that the MLPNN classifies a sample using the PNN corresponding to the cluster with the maximum level of all the clusters capturing this sample. In other words, if x ∈ G(L) , it is classified by one of the PNNs in the Lth, i.e. top level of the MLPNN. If x ∈ G(l−1) \ G(l) where \ denotes the set minus operator, it is classified by one of the PNNs in the (l − 1)th level of the MLPNN, l = 2, · · · , L. Figure 3 illustrate the clusters of a MLPNN with 3 levels.
(1)
G1
(2)
G2 (3)
G4
(2)
G3
(3)
G5
Fig. 3. The clusters of MLPNN with 3 levels
4 4.1
The Learning Algorithm of MLPNN The Construction Procedure of the MLPNN
The MLPNN is constructed by using an incremental learning approach, i.e. new level of PNNs is constructed aiming at improving the classification accuracy of the top level of the MLPNN and added to the MLPNN to form a new top level. The construction procedure of the MLPNN is as follows. 1. Construct the first level (or first top level) of the MLPNN by constructing a traditional PNN based on the training samples in Gtr . Set P N N1 as the (1) traditional PNN and G1 as Gtr . (1) (1) 2. Apply P N N1 over G1 for classification. Form s important clusters Gk ⊆ (1) G1 , k = 1, · · · , s by clustering all the misclassified training samples using a clustering algorithm. Test P N N1 by counting the number of the misclassified (1) (k) training samples in Gk as netr , k = 1, · · · , s. 3. Construct P N Nk , k = 1, · · · , s whose neurons are the training samples in (1) (1) Gk . Apply P N Nk over Gk for classification. Test P N Nk by counting the (1)
(k)
number of the misclassified training samples in Gk as netr , k = 1, · · · , s. (k) (k) (k) (k) (1) 4. Compare netr and netr , k = 1, · · · , s, if netr < netr , mark Gk as “pass”; (1)
otherwise, delete Gk
and P N Nk . Count the number of “pass” as np . If
520
N. Zong and X. Hong
np > 0, set s as np , construct the second level of the MLPNN by adding s new (2) (1) (2) (1) PNNs, i.e. G2 = G1 , · · · , G1+s = Gs , P N N2 = P N N1 , · · · , P N N1+s = P N Ns to the MLPNN to form a new top level. (Note that for notational sim(1) (k) (k) plicity, the passed Gk with netr < netr and their corresponding P N Nk are (1) still denoted as Gk and P N Nk , k = 1, · · · , s, respectively). Set l as 2 and K as 1 + s, continue. If np = 0, return with the derived MLPNN with 1 level. (l) (l) 5. For each Gk , k = K − s + 1, · · · , K: (1) Apply P N Nk over Gk for clas(l) (l) sification and form an important cluster Gk ⊆ Gk by clustering all the misclassified training samples. Test P N Nk by counting the number of the (l) (k) misclassified training samples in Gk as netr . (2) Construct P N Nk whose (l) (l) neurons are the training samples in Gk and apply P N Nk over Gk for classification. Test P N Nk by counting the number of the misclassified training (l) (k) samples in Gk as netr . (k) (k) (k) (k) (l) 6. Compare netr and netr , k = K − s + 1, · · · , K, if netr < netr , mark Gk (l) (l) as “pass”; otherwise, Gk = Gk and P N Nk = P N Nk . Count the number of “pass” as np . If np > 0, construct the (l + 1)th level of the MLPNN by (l+1) (l) (l+1) (l) adding s new PNNs, i.e. GK+1 = GK−s+1 , · · · , GK+s = GK , P N NK+1 = P N NK−s+1 , · · · , P N NK+s = P N NK to the MLPNN to form a new top level, l = l + 1, K = K + s, go to step 5. If np = 0, return with the derived MLPNN with L = l level. The following theorem shows that the classification accuracy of the MLPNN monotonically increases over the training data set with the number of levels. Theorem 1: Denote the MLE of the misclassification error rate of the MLPNN (l) (l) (l−1) with l levels as Pˆe , Pˆe < Pˆe . Proof : ( see [6]) For a MLPNN with 1 level, it is equivalent to the traditional PNN. It is shown [6] that the classification performance of the MLPNN is higher than or equal to that of the traditional PNN. 4.2
Comparison with Other Approaches
The MLPNN shares some common characteristics with some other approaches. For example, the boosting [7] and the piecewise linear modelling (PLM) [8,9] also consist of a set of models. The main differences between the MLPNN and other approaches including the boosting [7] and the PLM [8,9] are as follows. 1. Models in the PLM are usually defined on a set of disjoint subsets of the training set. Models in the boosting are all defined on the whole training set. In the MLPNN, PNNs are defined on the important clusters which are disjoint when they are in the same level or overlapped when they are in the different levels. 2. Various approaches have been developed to construct the models in the PLM, such as building hyperplane using linear discriminant function [10], building
A Multi-Level Probabilistic Neural Network
521
subtree using tree growing and pruning algorithm [8] and building linear model using linear system identification algorithm [9]. In the boosting, new model is trained based on the whole training set which is reweighted to deemphasize the training samples correctly classified by the existing models. In the MLPNN, new PNNs are constructed based on the important clusters which are formed by clustering the misclassified training samples of the top level of the MLPNN. 3. For a sample, the boosting combines the outputs of all the models to a final output using the weighted majority vote while the PLM and the MLPNN classify the sample according to the location of this sample, i.e. find a subset or important cluster which captures the sample and apply the corresponding local model to produce an output. 4. There are also some connections between the MLPNN and the improved stochastic discrimination (SD) [11,12]. For example, the improved SD also forms an important cluster by clustering the misclassified training samples of the existing models. However, the improved SD trains a set of new models based on the important cluster using the random sampling while in the MLPNN, new PNNs are constructed based on the important clusters. Moreover, to determine whether a new model is kept or not, the improved SD applies discernibility and uniformity test [13,14,15] while the MLPNN checks the classification accuracy.
5
Numerical Examples
In order to demonstrate the effectiveness of the proposed MLPNN, two examples were presented in this subsection. Numbers of misclassified training and testing samples of the traditional PNN and those of the proposed MLPNN were compared to demonstrate the advantages of the latter. Example 1: In this example, samples composed by 2 features x1 and x2 are uniformly distributed in some circular areas in a 2-dimensional space. A training set with 500 training samples and a test set with 500 test samples were generated. The training samples were plotted in the left subplot in Figure 4 and test samples were plotted in the right subplot in Figure 4. Samples of class 1 were represented by “+” and those of class 2 were represented by “·”. A MLPNN was constructed by using the proposed algorithm. The number of important clusters per level s was chosen as 6. The value of the smoothing (L) parameter σ was set as 1. The numbers of misclassified training samples netr (L) and test samples nete of the constructed MLPNN, which are the functions of the number of the levels L were plotted using the solid line in the left subplot and right subplot in Figure 5, respectively. It can be observed from Figure 5 that the classification accuracy of the constructed MLPNN monotonically increases as L grows. The constructed MLPNN is terminated as 3 levels because when L > 3, no newly constructed PNNs and the corresponding important clusters are kept and the learning procedure was automatically stopped.
522
N. Zong and X. Hong 5
5
4
4
3
3
2
2
1
1
x2
x2 0
0
−1
−1
−2
−2
−3
−3
−4 −4
−2
0
2
−4 −4
4
−2
0
2
4
x1
x1
Fig. 4. Training samples and test samples in Example 1 80
75
70
70
65 60 60 50 (L)
(L) netr
nete 55 40 50 30 45
20
10
40
1
2
L
3
35
1
2
3
L
Fig. 5. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 1. s = 6.
To investigate the effect of s on the classification accuracy of the MLPNN, we increased s to 15 and plotted the corresponding performance curves in Figure 6.It can be observed from Figure 6 that only on the training set, the classification accuracy of the constructed MLPNN monotonically increases as L grows while on the test set, the classification accuracy of the constructed MLPNN
A Multi-Level Probabilistic Neural Network 80
523
75
70 70 60
50
65
netr 40 (L)
(L)
nete 60
30
20 55 10
0
1
2
3
4
50
1
2
L
3
4
L
Fig. 6. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 1. s = 15. 100
70
90
65
80 60 70 (L) netr
(L) 55 nete
60 50 50
45
40
30
1
2
3
L
4
5
40
1
2
3
4
5
L
Fig. 7. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 2
fails to increase after L reaches some point. One feasible explanation is that too big a s means too many small important clusters in the MLPNN. Hence information in the training set is overemphasized and the MLPNN may fit into the noise of the training set. Fitting into the noise of the training set usually impairs the model’s generalization capability. However the number L and s can be determined empirically through the general approach of cross validation. Because the traditional PNN is the first level of the MLPNN, it can be observed from Figure 5 and Figure 6 that the constructed MLPNNs have higher classification accuracy than that of the traditional PNNs.
524
N. Zong and X. Hong
Example 2: In this example, The BUPA liver disorders data set obtained from the repository at the University of California at Irvine [16] was used in this example. The data set contains 345 samples of 2 classes with each sample having 6 features and 1 class label. The first 200 samples were selected as training samples and the remaining 145 samples were used as test samples. With a predetermined value σ = 50, a set of MLPNN was trained where the number of important clusters per level s was determined through cross validation as 4. The simulation results were shown in Figure 7. It is seen that the classification accuracy of the MLPNN can improve the classification accuracy until L = 3.
6
Conclusions
A new MLPNN has been introduced to improve the classification accuracy of the traditional PNN, based on the concept of an important cluster. The construction algorithm of MLPNN has been introduced. Numerical examples have shown that the proposed MLPNN offers improvement on the classification accuracy over the conventional PNNs.
References 1. Specht, D. F.: Probilistic Neural Networks. Neural Networks 3 (1990) 109–118 2. Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis. Wiley, New York. (1973) 3. Mao, K. Z., Tan, K. C. and Ser, W.: Probabilistic Neural-network Structure Determination for Pattern Classification. IEEE Transactions on Neural Networks 3 (2000) 1009–1016 4. Specht, D. F.: Enhancements to The Probabilistic Neural Networks. in Proc IEEE Int. Conf. Neural Networks, Baltimore, MD. (1992) 761-768 5. Zaknich, A: A Vector Quantization Reduction Method for The Probabilistic Neural Networks. in Proc IEEE Int. Conf. Neural Networks, Piscataway, NJ. (1997) 6. Zong, N: Data-based Models Design and Learning Algorithms for Pattern Recognition. PhD thesis, School of Systems Engineering, University of Reading, UK. (2006) 7. Breiman, L.: Arcing Classifiers. Annals of Statistics 26 (1998) 801-849 8. Gelfand, S. B. Ravishankar, C. S. and Delp, E. J.: Tree-structured Piecewise Linear Adaptive Equalization. IEEE Trans. on Communications 41 (1993) 70-82 9. Billings, S. A. and Voon, W. S. F.: Piecewise Linear Identificaiton of Nonlinear Systems. 46 (1987) 215-235 10. Sklansky, J. and Michelotti, L.: Locally Trained Piecewise Linear Classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-2 (1980) 101-111 11. Zong, N. and Hong,X.: On Improvement of Classification Accuracy for Stochastic Discrimination- multi-class Classification. In Proc of Int. Conf. on Computing, Communications and Control Technologies, CCCT’04 3 (2004) 109-114 12. Zong, N. and Hong, X.: On Improvement of Classification Accuracy for Stochastic Discrimination. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cybernetics 35 (2005) 142-149
A Multi-Level Probabilistic Neural Network
525
13. Kleinberg, E. M.: Stochastic Discrimination. Annals of Mathematics and Artificial Intelligence 1 (1990) 207-239 14. Kleinberg, E. M.: An Overtraining-resistant Stochastic Modeling Method for Pattern Recognition. Annals of Statistics 24(1996) 2319-2349 15. Kleinberg, E. M.: On The Algorithmic Implementation of Stochastic Discrimination. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(2000) 473-490 16. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders
An Artificial Immune Network Model Applied to Data Clustering and Classification Chenggong Zhang and Zhang Yi Computational Intelligence Laboratory School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu 610054, P.R. China {zcg,zhangyi}@uestc.edu.cn
Abstract. A novel tree structured artificial immune network is proposed. The trunk nodes and leaf nodes represent memory antibodies and non-memory antibodies, respectively. A link is setup between two antibodies immediately after one has reproduced by another. By introducing well designed immune operators such as clonal selection, cooperation, suppression and topology updating, the network evolves from a single antibody to clusters that are well consistent with the local distribution and local density of original antigens. The framework of learning algorithm and several key steps are described. Experiments are carried out to demonstrate the learning process and classification accuracy of the proposed model.
1
Introduction
Over the past few years, Artificial Immune Network (AIN) has emerged as a novel bio-inspired computational model that provides favorable characteristics for a variety of application areas. The AIN is inspired from immune network theory that proposed by Jerne in 1974 [1], which states that the immune system is composed of B cells and the interactions between them; the B cells receive antigenic stimulus and maintain interactions through mutual stimulation or suppression; thus the immune system acts as a self-regulatory mechanism that can recognize the antigens and memorize the characters of such antigens even in the absence of their stimulations. Several artificial immune network models have been proposed based on Jerne’s theory and applied to a variety of application areas [2,3,4,5,6]. In this paper we propose a novel AIN model - Tree Structured Artificial Immune Network (TSAIN). By implementing novel immune operators on antibody population, such as the clonal selection, cooperation and suppression, the network evolves as clusters with controlled size that are well consistent with the local distribution and local density of original antigens. Comparing with former models [2,3], the network topology plays a more important role in our method. Actually the topology is grows along with the evolution of antibody population. A topological link is setup between two antibodies immediately after one has reproduced by D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 526–533, 2007. c Springer-Verlag Berlin Heidelberg 2007
An AIN Model Applied to Data Clustering and Classification
527
another. Hence there is no need to define a threshold like N AT in [2] to judge whether two antibodies should be connected. Another advantage of adapting tree structure The mutual cooperation between antibodies provides the network with self-organizing capacity. The mutual suppression and topology updating make the topological structure consistent with clusters in shape space. The parameters of the learning algorithm are time-varying, which makes the stop criterion simplified; and the final convergence of network is also ensured.
2
Tree Structured Artificial Immune Network
We first give an overview for the learning algorithm of our proposed model: Algorithm 1: The learning algorithm of TSAIN 1: Initialize antibody population by a single non-memory antibody; 2: gen = 0; 3: while gen + + < maximum generation do 4: Randomly choose an antigen ag; 1 5: Calculate aff r,ag for each antibody r, where aff r,ag = ; 1 + r − ag 6: best = arg max(affr ,ag ); r
7: if best is non-memory antibody then 8: best .stimulation++; 9: if best .stimulation == clonal selction threshold then 10: best goes through cloning, produces children antibodies OS ; 11: mutate each ab ∈ OS ; 12: setup topological links between best and each ab ∈ OS ; 13: convert best to memory antibody; 14: end if 15: end if 16: Antibody cooperation; 17: Antibody suppression; 18: Topology updating; 19: end while 20: Delete all non-memory antibodies;
2.1
Clonal Selection
The antibody population is divided into non-memory antibodies which represented by leaf nodes, and memory antibodies which represented by trunk nodes. The non-memory antibodies serve as the candidates for memory antibodies and as the medium for relying cooperation signals; the memory antibodies stand for the formed immune memory to antigens which have already presented. Once an antigen arrives, the antibody with highest affinity against that antigen will be selected as the best(see Algorithm 1) and increases its stimulation.
528
C. Zhang and Z. Yi
Further, if best is non-memory and its stimulation attains a certain threshold called sti, then it will go through the clonal selection process: 1. Generate children antibodies OS with the size calculated by: nt(1 − aff )(1 − mc) |OS| = max 1, + mc , aff (1 − nt)
(1)
where aff is the affinity of best against current antigen. mc ≥ 1 is the predefined maximum size of OS. nt ∈ [0, 1) is the predefined affinity threshold. If aff ≤ nt, the size of OS will be 1. Each newabi ∈ OS is an identical copy of best. 2. Each newabi ∈ OS goes through the mutation process: newabi = newabi + var · N (0, 1),
(2)
where N (0, 1) is the standard normal distribution. var 1 controls the intensity of mutation. 3. Convert the best to memory antibody. Then it enters dormant phase in which the stimulation level will not be increased any more. And it will not reproduce children antibodies in future evolution. In other words, if we regard the chance of reproduction as a kind of resource, then when the best has finished the reproduction, the resource it holds will be bereaved and passed to its children. By using the clonal selection process, the antibodies with higher affinity gradually increase their proportion in the whole population. 2.2
Antibody Cooperation
When the clonal selection has finished, the algorithm enters the cooperation phase in which each antibody abi moves according to four factors: the position of current antigen, its topological distance with best , current learning rate and current neighborhood width, That is: d2i abi = abi + λgen · e 2 · δgen · (ag − abi ), −
(3)
where gen is current generation number, λgen ≤ 1 is the current learning rate. di is the topological distance between abi and best. δgen > 0 is the current neighborhood width that controls the influence zone of best. In each generation the λgen and δgen is determined by: gen k λgen = (λ1 − λ0 ) · + λ0 , (4) G δgen = (δ1 − δ0 ) ·
gen k
+ δ0 , (5) G where 0 < λ1 < λ0 ≤ 1, δ0 > δ1 > 0. G is the maximum generation number. k > 0 is used to control the convergence rate of λgen and δgen .
An AIN Model Applied to Data Clustering and Classification
529
From Eq. 3 we can see that all antibodies seek to approach the current antigen, namely moving forward the same directions with best . And the intensity of such movement is decreasing with their topological distance with best. In fact, the moving of antibodies can be regarded as a form of reaction; hence we can say that the antibodies are cooperatively react to current antigen. This is the reason of calling this mechanism as “cooperation”. Notice that we adopt tree structure as the network topology. Thus the topological distance with any two antibodies is definite since there is exactly one path between any two antibodies. Consequently the cooperation intensity between antibodies is also definite. 2.3
Antibody Suppression
We implement population controlling mechanism by using mutual suppression based on topological links. For any two antibodies abi and abj , if the suppression condition is satisfied, i.e. they do not have lineal relationship and their affinity is larger than suppression threshold st, then the one with larger offspring size will be the winner. Let abi be the winner, then it will impose one of the following suppressing operators on abj : 1. Delete abj and all of its offspring with probability 1 − p. 2. Remove the link between abj and its father and then create a link between abi and abj with probability p. gen We define p = G . It means that in the initial phase, the suppression inclines to shape the network; In ending phase, the suppression inclines to adjust the network topology in a non-reducing manner. Each antibody goes through the suppression until there is no pair of antibodies satisfies the suppression condition. Notice that after the second type suppression, the network structure is still a tree. By using the suppression, the size of sub-population in each cluster is under control. In each iteration the st is updated by: stgen = (st1 − st0 ) · 2.4
gen k G
+ st0 , 0 < st1 < st0 < 1.
(6)
Topology Updating
When the suppression has finished, the topology updating is preformed. In this phase, the links between antibodies with affinity smaller than ct are removed. By using this mechanism, there will be more independent branches in the tree structured network which represent different clusters. In each iteration, the ct is updated by: ctgen = (ct1 − ct0 ) ·
gen k G
+ ct0 , 0 < ct0 < ct1 < 1.
(7)
530
C. Zhang and Z. Yi Table 1. Parameter settings for the experiments
G I k sti mc nt λ0 λ1 δ0 δ1 st0 st1 ct0 ct1 Artificial data 24000 1 0.2 20 3 0.935 0.5 0.01 30 0.5 0.995 0.985 0.806 0.935 Real problem 4000 1 0.2 10 3 0.91 0.5 0.01 30 0.5 0.91 0.74 0.54 0.83
3
Simulations
3.1
Artificial Dataset
We first use a 2-dimensional artificial data set (Fig. 1(a)) to show the learning process of TSAIN. The original data set involves 3 clusters with different shape in unit square. Each cluster has 640 samples which produced by adding noises to standard curves. There are 80 additional noise samples independently distribute in the unit square. The parameter settings are listed in Table.1. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(a) The artificial data set.
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) The final network.
Fig. 1. The artificial data set and the network obtained through learning process
Fig. 1(b) shows the final resultant network in which there are totally 501 memory antibodies distribute in 10 clusters. Two out of the 10 clusters are observably local in noise areas, contain 10 memory antibodies. These 501 memory antibodies are used to represent the original antigen population. Fig. 2 visually demonstrates the network in different generations. From it we can find that the network is evolved from a single antibody to a tree structured network which contains a number of antibodies of which the positions are consistent with local distribution and local density of original antigen population. 3.2
Real Problem
The second experiment is based on Wisconsin Breast Cancer Database [7]. The original database contains 699 instances, each instance has 9 numeric-valued attributes. Since there are 16 instances that contain missing attribute values, we
An AIN Model Applied to Data Clustering and Classification Iteration number: 3000
Iteration number: 9000
Iteration number: 6000
1
1
1
0.5
0.5
0.5
0
0
0.5 1 Iteration number: 12000
0
0
0.5 1 Iteration number: 15000
0
1
1
1
0.5
0.5
0.5
0
0
0.5 1 Iteration number: 21000
0
1
1
0.5
0.5
0
0
0.5
1
0
0
0
0.5 1 Iteration number: 24000
0.5
531
0
0
0.5 1 Iteration number: 18000
0
0.5
1
1
Fig. 2. The evolution process of antibody population and network topology Table 2. Comparative classification accuracy (a) Our result
(b) Historical results
Time Average accuracy Train Validate 1 97.3 96.8 2 97.2 96.6 3 97.2 95.9 4 97.1 96.2 5 97.2 96.4 6 97.2 96.4
Method Reported accuracy(%) C4.5 [8] 94.74 RIAC [9] 94.99 LDA [10] 96.80 NEFCLASS [11] 95.06 Optimized-LVQ [12] 96.70 Supervised fuzzy clustering [13] 95.57
only use the rest 683 instances for our experiment. The instances are divided into 2 classes: class 0(tested benign) contains 444(65.0%) instances; class 1(tested malignant) contains 239(35.0%) instances. We apply 10-fold cross-validation for 6 times. The attributes are normalized before the experiment. Table 1 lists the parameter settings used in the experiment. We use two separate antibody population, each represents a cancer class(0 or 1). In each training process, both populations are evolved independently using their corresponding antigen populations. When both populations are obtained, the final resultant network is the intersection of them. When an unseen antigen is presented, a best antibody is selected(see definition of best in algorithm 1),
532
C. Zhang and Z. Yi
and the antigen is classified as the class the best belongs to. Table 2(a) listed the final classification accuracy in each of the 6 times. The overall average accuracy on training set is 97.2%; and the overall average accuracy on validating set is 96.4%. Table 2(b) listed reported results on the same data set using 10-fold cross-validation in former research. We can find that our model outperforms some former methods in term of validation accuracy.
4
Conclusions
In this paper, we proposed a new artificial immune network model. The basic components of our model are antibodies and the topological links between them. With the help of clonal selection and cooperation, the network exhibits self-organizing property. By using the suppression, the antibodies compete for occupancy of the cluster areas. The introducing of topology updating ensures the consistency of network topology with distribution of clusters. Experimental results shows that the learning algorithm exhibits well learning capacity. In future work, some more experiments on complicated data sets should be implemented.
References 1. Jerne, N.: Towards a Network Theory of the Immune System. Ann. Immunol. 125 (1974) 373–389 2. Timmis, J., Neal, M.: A Resource Limited Artificial Immune System for Data Analysis. Konwledge-Based Systems 14 (2001) 121–130 3. Castro, L.N.D., Zuben, F.J.V.aiNet: An Artificial Immune Network for Data Analysis. Int. J. Computation Intelligence and Applications 1 (3) (2001) 4. Knight, T., Timmis, J.: A Multi-layered Immune Inspired Machine Learning Algorithm. In: Lotfi, A., Garibaldi, M. (eds.): Applications and Science in Soft Computing. Springer (2003) 195–202 5. Nasaroui, O., Gonzalez, F., Cardona, C., Rojas, C., Dasgupta, D.: A Scalable Artificial Immune System Model for Dynamic Unsupervised Learning. In: Cant´ uPaz, E. et al. (eds.): Proceedings of GECCO 2003. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg 2723 (2003) 219–230 6. Neal, M.: Meta-Stable Memory in an Artificial Immune Network. In: Timmis J., Bentley, P., Hart E. (eds.): Proceeding of ICARIS 2003. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg 2787 (2003) 168–180 7. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998) 8. Quinlan, J.R.: Improved Use of Continous Attributes in C4.5. J. Artif. Intell. Res. 4 (1996) 77–90 9. Hamilton, H.J., Shan, N., Cercone, N.: RIAC: a Rule Induction Algorithm Based on Approximate Classification. Technical Report CS 96-06. University of Regina. 10. Ster, B., Dobnikar, A.: Neural Networks in Medical Diagnosis: Comparison with Other Methods. In: Proceedings of the International Conference on Engineering Applications of Neural Networks (1996) 427-430
An AIN Model Applied to Data Clustering and Classification
533
11. Nuack, D., Kruse, R.: Obtaining Interpretable Fuzzy Classification Rules from Medical Data. Artif. Intell. Med. 16 (1999) 149–169 12. Goodman, D.E., Boggess, L., Watkins, A.: Artificial Immune System Classification of Multiple-class Problems. In: Proceedings of the Artificial Neural Networks in Engineering (2002) 179-183 13. Abonyi, J., Szeifert, F.: Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers. Pattern Recognition Lett. 24 (2003) 2195-2207
Sparse Coding in Sparse Winner Networks Janusz A. Starzyk1, Yinyin Liu1, and David Vogel2 1
School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701 {starzyk,yliu}@bobcat.ent.ohiou.edu 2 Ross University School of Medicine Commonwealth of Dominica
[email protected] Abstract. This paper investigates a mechanism for reliable generation of sparse code in a sparsely connected, hierarchical, learning memory. Activity reduction is accomplished with local competitions that suppress activities of unselected neurons so that costly global competition is avoided. The learning ability and the memory characteristics of the proposed winner-take-all network and an oligarchy-take-all network are demonstrated using experimental results. The proposed models have the features of a learning memory essential to the development of machine intelligence.
1 Introduction In this paper we describe a learning memory built as a hierarchical, self-organizing network in which many neurons activated at lower levels represent detailed features, while very few neurons activated at higher levels represent objects and concepts in the sensory pathway [1]. By recognizing the distinctive features of patterns in a sensory pathway, such a memory may be made to be efficient, fault-tolerant, and to a useful degree, invariant. Lower level features may be related to multiple objects represented at higher levels. Accordingly, the number of neurons increases up the hierarchy with the neurons at lower levels making divergent connections with those on higher levels [2]. This calls to mind the expansion in number of neurons along the human visual pathway (e.g., a million geniculate body neurons drive 200 million V1 neurons [3]). Self-organization is a critical aspect of the human brain in which learning occurs in an unsupervised way. Presentation of a pattern activates specific neurons in the sensory pathway. Gradually, neuronal activities are reduced at higher levels of the hierarchy, and sparse data representations, usually referred to as “sparse codes”, are built. The idea of “sparse coding” emerged in several earlier works [4][5]. In recent years, various experimental and theoretical studies have supported the assumption that information in real brains is represented by a relatively small number of active neurons out of a large neuronal population [6][[7][3]. In this paper, we implement the novel idea of performing pathway selections in sparse network structures. Self-organization and sparse coding are obtained by means D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 534–541, 2007. © Springer-Verlag Berlin Heidelberg 2007
Sparse Coding in Sparse Winner Networks
535
of localized, winner-take-all (WTA) competitions and Hebbian learning. In addition, an oligarchy-take-all (OTA) concept and its mechanism is proposed that produces redundant, fault tolerant, information coding. This paper is organized as follows. In section 2, a winner network is described that produces sparse coding and activity reduction in the learning memory. In section 3, an OTA network is described that produces unsupervised, self-organizing, learning with distributed information representations. Section 4 demonstrates the learning capabilities of the winner and the OTA networks using experimental results. Finally, our method of sparse coding in sparse structures is summarized in section 5.
2 The Winner Network In the process of extracting information from data, we expect to predictably reduce neuronal activities at each level of a sensory pathway. Accordingly, a competition is required at each level. In unsupervised learning, we need to find the neuron in the network that has the best match to the input data. In neural networks, such a neuron, is usually determined using a WTA network [8][9]. A WTA network is usually implemented based on competitive neural network in which inhibitory lateral links and recurrent links are utilized, as shown in Fig. 1. The outputs iteratively suppress the signal strength among each other and the neuron with maximum signal strength will stay as the only active neuron when the competition is done. For a large memory, with many neurons on the top level, a global WTA operation is complex, inaccurate and costly. Moreover, average competition time increases as the likelihood of similar signal strengths increases in large WTA networks.
…
α
α
… 1
r1
dm
dj
d1
1
1
α rj
rm
Fig. 1. WTA network as competitive neural network
The use of sparse connections between neurons can, at the same time, improve efficiency and reduce energy consumption. However, sparse connections between neurons on different hierarchical levels may fail to transmit enough information along the hierarchy for reliable feature extraction and pattern recognition. In a local network model for cognition, called an “R-net” [10][11], secondary neurons, with random connections to a fraction of primary neurons in other layers, effectively provide almost complete connectivity between primary neurons pairs. While R-nets provide large
536
J.A. Starzyk, Y. Liu, and D. Vogel
capacity, associative memories, they were not used for feature extraction and sparse coding in the original work. The R-net concept is expanded in this work by using secondary neurons to fully connect primary neurons on lower levels to primary neurons on higher levels through the secondary neurons of a sparsely connected network. The network has an increasing number of neurons on the higher levels, and all neurons on the same level have an equal number of input links from neurons on the lower level. The number of secondary levels between primary levels affects the overall network sparsity. More secondary levels can be used to increase the network sparsity. Such sparsely connected network with secondary levels is defined as winner network and illustrated in Fig. 2.
…
Secondary level s
Primary level h
…
…
…
…
… …
…
Increasing number of Overall neurons
… Primary level h+1
Fig. 2. Primary level and secondary level in winner network
The initial random input weights to each neuron are scaled to have a sum of squared weights equal to 1, which places them on the unit multidimensional sphere. Because a neuron becomes active when its input weight vector is similar to its input pattern, spreading the input weights uniformly on the unit-sphere increases the memory capacity of the winner network. Furthermore, the normalization of the weights maintains the overall input signal level so that the output signal strength of neurons, and accordingly the output of the network, will not be greatly affected by the number of input connections. In a feed-forward computation, each neuron combines its weighted inputs using a thresholded activation function. Only when the signal strength is higher than the activation threshold can the neuron send a signal to its post-synaptic neurons. Eventually, the neurons on the highest level will have different levels of activation, and the most strongly activated neuron (the global winner) is used to represent the input pattern. In this work, the competition to find the global winner is replaced by small-scale WTA circuits in local regions in the winner network as described next. In a sparsely connected network, each neuron on the lower level connects to a group of neurons on the next higher level. The winning neuron at this level is found by comparing neuronal activities. In Hebbian learning, weight adjustments reduce the plasticity of the winning neuron’s connections. Therefore, a local winner should not
Sparse Coding in Sparse Winner Networks
537
only have the maximum response to the input, but also its connections should be flexible enough to be adjusted towards the input pattern so that the local winner is,
⎧ ⎫ ⎪ ⎪ level +1 level level s winner = max w s ⋅ ρ ), ⎨ i jk k ji ⎬ (i = 1,2,..N j∈N ⎪⎩k∈N ⎪⎭ level +1 i
∑
(1)
level j
where Nilevel+1is a set of post-synaptic neurons on level (level+1) driven by a neuron i, Nilevel is a set of pre-synaptic neurons that project onto neuron j on level (level), and ρji denotes the plasticity of the link between pre-synaptic neuron i and post-synaptic neuron j, as shown in Fig. 3(a).
N ilevel +1
j 1
2
3 4
1
2
5
3 4
6
7
5 6 7
S winner 8 9 level+1
level
S winner S winnerSwinner … …
Loser neurons in local competition
S winner
i
N
Winner network
level j
…
Winner neurons in local competition
Fig. 3. (a) Interconnection structure to determine a local winner, (b) The winner network
Such local competition can be easily implemented using a current-mode WTA circuit [12]. A local winner neuron, for example N4level+1 in Fig. 3(a), will pass its signal strength to its pre-synaptic neuron N4level, and all other post-synaptic branches connecting neuron N4level with the losing nodes will be logically cut off. Such local competition is done first on the highest level. The signal strengths of neurons which win in their corresponding local competitions propagate down to the lower levels and the same procedure continues until the first input layer is reached. The global winning neuron on the top level depends on the results of all local competitions. Subsequently, the signal strength of the global winner is propagated down to all lower-level neurons which connect to the global winner. Most of the branches not connected to the global winner are logically cut off, while the branches of the global winner are kept active. All the branches that propagate the local winner signal down the hierarchy form the winner network, as shown in Fig. 3(b). Depending on the connectivity structure, one or more winner networks can be found. By properly choosing the connectivity structure, we may guarantee that all of the input neurons are in a single winner network so that the output level contains a single winner. Let us use a 3-layer winner network (1 input level, 2 secondary levels and 1 output level) as an example. The network has 64 primary input neurons and 4096 output neurons with 256 and 1024 secondary neurons, respectively. The number of active neurons in the top level decreases with increasing numbers of input connections. As shown in Fig.4, when the number of input links to each neuron is more than 8, a single winner neuron in the top level is achieved.
538
J.A. Starzyk, Y. Liu, and D. Vogel
Since the branches logically cut off during local competition will not contribute to post-synaptic neuronal activities, the synaptic strengths are recalculated only for branches in the winner network. As all the branches of the winner network are used, the signal strength of pathways to the global winner are not reduced. However, due to the logically disconnected branches, the signal strength of pathways to other output neurons are suppressed. As a result, an input pattern activates only some of the neurons in the winner networks. The weights are only adjusted using Hebbian learning for links in winner networks to reinforce the activation level of the global winner. After updating, weights are scaled so that they are still spread on the unit-sphere. In general, the winner network with secondary neurons and sparse connections, builds sparse representations in three steps: sending data up through the hierarchy, finding the winner network and global winner by using local competitions, and training. The winner network finds the global winner efficiently without iterations usually adopted in MAXNET [8][9]. It provides an effective and efficient solution to the problem of finding global winners in large networks. The advantages of sparse winner networks are significant for large size memories. Number of active neurons on top level vs. Number of input links to each neuron number of active neurons on top level
12 10 8 6 4 2 0
2
3
4
5
6
7
8
9
10
number of input links
Fig. 4. Effect of number of input connections to neurons
3 Winner Network with Oligarchy-Takes-All The recognition using a single-neuron representation scheme in the winner network can easily fail because of noise, fault, variant views of the same object, or learning of other input patterns due to an overlap between activation pathways. In order to have distributed, redundant data representations, an OTA network is proposed in this work to use a small group of neurons as input representations. In an OTA network, the winning neurons in the oligarchy are found directly in a feed-forward process instead of the 3-step procedure used in the winner network as described in section 2. Neurons in the 2nd layer combine weighted inputs and use a threshold activation function as in the winner network. Each neuron in the 2nd layer competes in a local competition. The projections onto losing nodes are logically cut off. The same Hebbian learning as is used in the winner network is carried out on the
Sparse Coding in Sparse Winner Networks
539
logically connected links. Afterwards, the signal strengths of the 2nd level are recalculated considering only effects of the active links. The procedure is continued until the top level of hierarchy is reached. Only active neurons on each level are able to send the information up the hierarchy. The group of active neurons on the top level provides redundant distributed coding of the input pattern. When similar patterns are presented, it is expected that similar groups of neurons will be activated. Similar input patterns can be recognized from the similarities of their highest level representations.
4 Experimental Results The learning abilities of the proposed models were tested on the 3-layer network described in section 2. The weights of connections were randomly initialized within the range [-1, 1]. A set of handwritten digits from the benchmark database [13] containing data in the range [-1, 1] was used to train the winner network or OTA networks. All patterns have 8 by 8 grey pixel inputs, as shown in Fig. 5. Each input pattern activates between 26 and 34 out of 4096 neurons on the top level. The groups of active neurons in the OTA network for each digit are shown in Table 1. On average, each pattern activates 28.3 out of 4096 neurons on the top level with the minimum number of 26 neurons and the maximum number of 34 neurons.
Fig. 5. Ten typical patterns for each digit Table 1. Active neuron index in the OTA network for handwritten digit patterns
digit 0 1 2 3 4 5 6 7 8 9
Active Neuron index in OTA network 72 237 294 109 188 103 68 237 35 184
91 291 329 122 199 175 282 761 71 235
365 377 339 237 219 390 350 784 695 237
371 730 771 350 276 450 369 1060 801 271
1103 887 845 353 307 535 423 1193 876 277
1198 1085 1163 564 535 602 523 1218 1028 329
1432 1193 1325 690 800 695 538 1402 1198 759
1639 1218 1382 758 1068 1008 798 1479 1206 812
… … … … … … … … … …
The ability of the network to classify was tested by changing 5 randomly selected bits of each training pattern. Comparing the OTA neurons obtained during training with those activated by the variant patterns, we find that the OTA network successfully recognizes 100% of the variant patterns. It is expected that changing more bits of the original patterns will degrade recognition performance. However, the tolerance of the OTA network for such change is expected to be better than that of the winner network.
540
J.A. Starzyk, Y. Liu, and D. Vogel
Fig. 6 compares the performances of the winner network and the OTA network for different numbers of changed bits in the training patterns based on 10 Monte-Carlo trials. We note that increasing the number of changed bits in the patterns quickly degrades the winner network’s performance on this recognition task. When the number of bits changed is larger than 20, the recognition correctness stays around 10%. However,10% is the accuracy level for random recognition for 10 digit patterns recognition. It means that when the number of changed bits is over 20, the winner network is not able to make useful recognition. As anticipated, the OTA network has much better fault tolerance and it is resistant to this degradation of recognition correctness.
percentage of correct recognition
Percentage of correct recognition 1
performance of OTA performance of winner network
0.8 0.6 0.4 0.2 0 0
Accuracy level of random recognition
10 20 30 40 number of bits changed in the pattern
50
Fig. 6. Recognition performance of the OTA network and the winner network
5 Conclusions This paper investigates a mechanism for reliably producing sparse coding in sparsely connected networks and building high capacity memory with redundant coding into sensory pathways. Activity reduction is accomplished with local rather than global competition, which reduces hardware requirements and computational cost of self-organizing learning. High memory capacity is obtained by means of layers of secondary neurons with optimized numbers of interconnections. In the winner network, each pattern activates a dominant neuron as its representation. In the OTA network, a pattern triggers a distributed group of neurons. With OTA, information is redundantly coded so that recognition is more reliable and robust. The learning ability of the winner network is demonstrated using experimental results. The proposed models produce features of a learning memory that may prove essential for developing machine intelligence.
Sparse Coding in Sparse Winner Networks
541
References 1. Starzyk, J.A., Liu, Y., He, H.: Challenges of Embodied Intelligence. Proc. Int. Conf. on Signals and Electronic Systems, ICSES'06, Lodz, Poland, Sep. 17-20 (2006) 2. Kandel, E.R., Schwartz, J.H., Jessell, T.M.: Principles of Neural Science. McGraw-Hill Medical 4th edition (2000) 3. Anderson, J.: Learning in Sparsely Connected and Sparsely Coded System. Ersatz Brain Project working note (2005) 4. Barlow, H.B.: Single Units and Sensation: A Neuron Doctrine for Perceptual Psychology? Perception 1 (1972) 371-394 5. Amari, S.: Neural Representation of Information by Sparse Encoding, Brain Mechanisms of Perception and Memory from Neuron to Behavior. Oxford University Press (1993) 630-637 6. Földiak, P., Young, M.P.: Sparse Coding in the Primate Cortex, The Handbook of Brain Theory and Neural Networks, The MIT Press (1995) 895-898 7. Olshausen, B.A., Field, D.J.: Sparse coding of sensor inputs, Current Opinions in Neurobiology 14 (2004) 481-487 8. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall (1999) 9. Zurada, J.M.: Introduction to Artificial Neural Systems. West Publishing Company (1992) 10. Vogel, D.D., Boos, W.: Sparsely Connected, Hebbian Networks with Strikingly Large Storage Capacities. Neural Networks 10(4) (1997) 671-682 11. Vogel, D.D.: A Neural Network Model of Memory and Higher Cognitive Functions. Int J Psychophysiol 55 (1) (2005) 3-21 12. Starzyk, J.A., Fang, X.: A CMOS Current Mode Winner-Take-All Circuit with both Excitatory and Inhibitory Feedback. Electronics Letters 29 (10) (1993) 908-910 13. LeCun, Y., Cortes, C.: The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
Multi-Valued Cellular Neural Networks and Its Application for Associative Memory Zhong Zhang, Takuma Akiduki, Tetsuo Miyake, and Takashi Imamura Toyohashi University of Technology 1-1 Hibarigaoka, Tempaku-cho, Toyohashi 441-8580, Japan
[email protected] Abstract. This paper discusses the design of multi-valued output functions of Cellular Neural Networks (CNNs) implementing associative memories. The output function of the CNNs is a piecewise linear function which consists of a saturation and non-saturation range. The new structure of the output function is defined, and is called the “basic waveform”. The saturation ranges with n levels are generated by adding n − 1 basic waveforms. Consequently, creating an associative memory of multivalued patterns has been successful, and computer experiment results show the validity of the proposed method. The results of this research can expand the range of applications of CNNs as associative memories.
1
Introduction
Cellular Neural Networks (CNNs), proposed by Chua and Yang in 1988 [1,2], are one type of interconnected neural networks. CNNs consist of nonlinear elements that are called cells and each cell is connected to its neighborhood cells. The state of each cell changes in parallel based on a differential equation, and converges to an equilibrium state. Thus, CNNs can be designed to be associative memories by the dynamics of the cells [3] and have been applied in various fields, such as character recognition, medical diagnosis and machine failure detection system [4,5,6]. The purpose of our study is to create an abnormal diagnosis system which detects anomalous behavior in man-machine systems by pattern classification using the the CNN as an associative memory. To realize this system, it is important to have a wide variety of diagnosis patterns. In order to acquire accuracy of processing results, two methods are considered: the first which increases the number of cells, and the second which adds more output levels to each cell. However, the first method would decrease the computational efficiency due to the expansion of the CNN’s scale. On the other hand, the second method has the advantage that there is no need to expand the scale. The output levels of conventional CNNs is two-levels or three-levels. When the CNNs conduct abnormality diagnosis, Kanagawa et al. classified the results of blood tests into three states, and made them the patterns for diagnosis, which were either ”NORMAL”, ”LIGHT EXCESS” or ”HEAVY EXCESS”. Other CNNs which have multi-valued output functions for image processing have also been proposed in the past [7], but the evaluation of them as an associative storage medium having arbitrary output levels has not been conducted yet. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 542–551, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multi-Valued CNNs and Its Application for Associative Memory
543
In this paper, we discuss the design of multi-valued output functions of CNNs for associative memory. The output function of a CNN is a piecewise linear function which consists of a saturation and non-saturation range. We define a new structure of the output function, and compose a basic waveform using the piecewise linear function. The basic waveform creates the multiple ranges by adding itself together. Hence we can compose the multi-valued output function by adding the basic waveform. Our method’s effectiveness is evaluated by computer experiment using random patterns.
r=1
Memory patterns
Cell : C(i,j)
Memory Table Neighborhood cell
Fig. 1. A Cellular Neural Network and a corresponding example associative memory. Each cell is connected to its r-neighborhood cells.
2
Cellular Neural Networks
Figure 1 shows a Cellular Neural Network with its memory patterns. In this section, the definition of “cells” in the Cellular Neural Network and its dynamics are described first, and second, the design method for CNNs as an associative memory is described. 2.1
Dynamics of Cells
We first consider the following CNN, which is composed of an M × N array of nonlinear elements that are called cells. The dynamics of cell C(i, j) in the ith row and jth column is expressed as follows: x˙ ij = −xij + Tij ∗ yij + Iij , (1) yij = sat(xij ) (1 ≤ i ≤ M, 1 ≤ j ≤ N ) where xij and yij represent the state variable and the output variable, respectively, while Tij represents the matrix of coupling coefficients, Iij represents the threshold, and ∗ is the composition operator. When the (i, j) cell is influenced
544
Z. Zhang et al.
yij=sat(xij) 1 -1 0
1
xij
-1
Fig. 2. Piecewise linear output characteristic
from neighborhood cells r-units away (shown in Figure 1), Tij ∗ yij is expressed as in the following equation: Tij ∗ yij =
r r
tij(k,l) yi+k,j+l ,
(2)
k=−r l=−r
The output of each cell yij is given by the piecewise linear function of the state xij , and when output level is binary, the function is expressed as in the following equation: 1 yij = (|xij + 1| − |xij − 1|). (3) 2 This output function has two saturated levels (as depicted in Figure 2). 2.2
Design of the CNN for Associative Memory
When we express the differential equation of each cell given in Eq.(1) in vector notation, two-dimensional CNNs having M rows and N columns are represented by the following: x˙ = −x + T y + I , (4) y = sat(x) where m = M N , x = (x11 , x12 , · · · , x1N , x21 , x22 , · · · , xMN )T = (x1 , x2 , · · · , xk , · · · , xm )T , y = (y11 , y12 , · · · , y1N , y21 , y22 , · · · yMN )T = (y1 , y2 , · · · , yk , · · · , ym )T , I = (I11 , I12 , · · · , I1N , I21 , I22 , · · · IMN )T = (I1 , I2 , · · · , Ik , · · · , Im )T .
Multi-Valued CNNs and Its Application for Associative Memory
545
The matrix T = [Tij ] ∈ m×m is the template matrix composed of row vectors whose elements are zero when the corresponding cells have no connections. The state vector to be memorized by the CNN corresponds to the stable equilibrium point of the system of differential equations in Eq.(4). Here, Eq.(4), which is a system of interval linear equations, has a number of asymptotically stable equilibrium points. We can make the network memorize patterns by making the patterns correspond to the asymptotically stable equilibrium points. Following Liu and Michel [3], we are given q vectors α1 , α2 , · · · , αq ≡ {x ∈ m : xi = 1 or − 1, i = 1, · · · , m} which are to be stored as reachable memory vectors for CNNs, and then assume vectors β 1 , β 2 , · · · , β q such that: β i = kαi ,
(5)
where the real number k is an equilibrium point arrangement coefficient and βi (i = 1, · · · , q) are asymptotic stable equilibrium points in each cell. It is evident that the output vectors are αi . Therefore, the CNN designed to have α1 , α2 , · · · , αq as memory vectors has templates T and a threshold vector I, which satisfies the following Eq. (8) simultaneously: ⎧ 1 1 ⎪ ⎪−β + T α + I = 0 ⎪ ⎪ 2 ⎨−β + T α2 + I = 0 , (6) .. ⎪ ⎪ . ⎪ ⎪ ⎩ q −β + T αq + I = 0 Here we set the following matrices: Y = (α1 − αq , α2 − αq , · · · , αq−1 − αq ), Z = (β 1 − β q , β 2 − β q , · · · , β q−1 − βq ), We have Z = T Y,
(7)
I = β − Tα . q
q
(8)
Under Eq.(5), in order for the CNNs to have alpha as memory vectors, it is necessary and sufficient to have template matrix T and threshold vector I, which satisfy Eqs. (7) and (8).
3
Multivalued Function for the CNN
In this section, we propose a design method of the multi-valued output function for associative memory CNNs. We first introduce some notation which shows how to relate Eq.(3) to the multi-valued output function. The output function of Eq.(3) consists of a saturation and non-saturation range. We define the structure of the output function such that the length of the non-saturation range is L, the
546
Z. Zhang et al.
length of the saturation range is cL, and the saturated level is |y| = H which is a positive integer (refer to Figure 3). Moreover, we assume equilibrium points which are |xe | = kH. Here, the Eq.(3) can be rewritten as follows: y=
H L L (|x + | − |x − |). L 2 2
(9)
Then, the equilibrium point arrangement coefficient is expressed as k = ( L2 + cL)/H by the above-mentioned definition. When H = 1, L = 2, c > 0, Eq.(9) is equal to Eq.(3). We will call the waveform of Figure 3 (a) a “basic waveform”. Next we give the theorem for designing the output function. Theorem 1. Both L > 0 and c > 0 are necessary conditions for convergence to an equilibrium point. Proof. We consider the cell model Eq. (1) where r = 0, I = 0. The cell behaves according to the following differential equation: x˙ = −x + ky.
(10)
y=sat(x) H
cL
-kH
L
x 0
kH
cL
-H
(a)
(b) Fig. 3. Design procedure of the multivalued output function. (a) shows a basic waveform, (b) shows the multivalued output function which is formed from (a).
Multi-Valued CNNs and Its Application for Associative Memory
In the range of |x| < L2 , the output value of a cell is y = (a)). Eq. (10) is expressed by the following: x˙ = −x + k
2H , L
2H L x
547
(refer to Figure 3
(11)
The solution of the equation is: x(t) = x0 e(
2kH L
−1)t
,
(12)
where x0 is an initial value at t = 0. The exponent in Eq. (12) must be 2kH L −1 > 0 for transiting from one state in the non-saturation range to a state in the saturation range. Here, by the above-mentioned definition, the equilibrium point arrangement coefficient is expressed as: 1 L k = (c + ) , (13) 2 H Therefore, parameter conditions c > 0 can be obtained from the Eqs. (12) and (13). In the range of L ≤ |x| ≤ kH, the output value of a cell is y = ±H. Then Eq. (10) is expressed by the following: x˙ = −x ± kH,
(14)
x(t) = ±kH + (x0 ∓ kH)e−t .
(15)
The solution of the equation is: When t → ∞, Eq. (15) proves to be xe = ±kH which is not L = 0 in Eq. (13). The following expression is derived from the above: L > 0 ∧ c > 0.
(16)
Secondly, we give the method of constructing the multi-value output function based on the basic waveform. The saturation ranges with n levels are generated by adding n−1 basic waveforms. Therefore, the n-valued output function satn (·) is expressed as follows: H satn (x) = (−1)i (|x + Ai | − |x − Ai |), (17) (n − 1)L i where,
Ai =
Ai−1 + 2cL (i : odd) , Ai−1 + L (i : even)
However, i and k are defined as follows: n : odd i = 0, 1, . . . , n − 2, A0 = L2 , L k = (n − 1)(c + 1/2) H , n : even i = 1, 2, . . . , n − 1, A1 = cL, L k = (n − 1)(2c + 1) 2H . Figure 4 shows the output waveforms which are the result of Eq. (17). The results demonstrate the validity of the proposed method, because the saturation ranges of the n levels have been made in the n value output function: satn (·).
548
Z. Zhang et al.
sat2
sat3
(a)
(b) sat4
sat5
(c)
(d)
Fig. 4. The output waveforms of the saturation function. (a), (b), (c), and (d) represent, respectively sat2 , sat3 , sat4 and sat5 . Here, the parameters of the multivalued function are set to L = 0.5, c = 1.0.
4
Computer Experimentation
In this section, a computer experiment is conducted using numerical software in order to show the effectiveness of the proposed method, P1
P2
P3
P4
2 1 P5
P6
P7
P8
0 -1 -2
Fig. 5. The memory patterns for the computer experiment. These random patterns of 5 rows and 5 columns have elements of {−2, −1, 0, 1, 2}, and are used for creation of the associative memory.
Multi-Valued CNNs and Its Application for Associative Memory
4.1
549
Experimental Procedure
For this memory recall experiment, the desired patterns to be memorized are fed into the CNN, which are then associated by the CNN. In this experiment, we use random patterns which have five values for generalizing the result as memory patterns. To test recall, noise is added to the patterns shown in Figure 5 and the 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(a) 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(b) 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(c) Fig. 6. Results of the computer experiments. Each figure shows (a) the result when L = 0.1, (b) the result when L = 0.5, and (c) the result when L = 1.0.
550
Z. Zhang et al.
resulting patterns are used as initial patterns. The initial patterns are represented as follows: x0 = kαi + ε. (18) where αi ≡ {x ∈ m : xi = −H or − H/2 or 0 or H/2 or H, i = 1, · · · , m}, and ε ∈ m is a noise vector which corresponds to the normal distribution N (0, σ 2 ). These initial patterns are presented to the CNN and the output evaluated to see whether the memorized patterns can be remembered correctly. Then, the number of correct recalls are converted into a recall probability which is used as the CNN’s performance measure.. The parameter L of the output function is in turn set to L = 0.1, L = 0.5, L = 1.0, and parameter c is changed by 0.5 step sizes in the range of 0 to 10. Moreover, the noise level is a constant σ = 1.0, and the experiments are repeated for 100 trials at each parameter combination (L, c). 4.2
Experimental Results
Figure 6 shows the results of the experiments. Each figure shows the relationship between the parameter c and both time and the recall probability. The horizontal axis is parameter c and the vertical axes are the mean recall rate (the mean recall probability;%) and mean recall time (measured in time steps). As can be seen in the experiment results, the recall rate increases as parameter c increases. The reason is that c is the parameter which determines the size of a convergence range. Therefore, the mean recall rate improves by increasing c. On the other hand, if the length L of the non-saturated range is short, convergence to the right equilibrium point becomes difficult because the distance between equilibrium points is small. Additionally, as shown in Figure 6 (a), the mean recall rate is lower than in Figures 6 (b),(c). Therefore, the length of the saturation range and the non-saturation range needs to be set at a suitable ratio. Moreover, in order for each cell to converge to the equilibrium points, both c > 0 and L > 0 must hold.
5
Conclusions
In this paper, we proposed a novel design method of the multi-valued output function for CNNs as an associative memory, and conducted computer experiment with five-valued random patterns. Consequently, memorization of the multi-valued patterns has been successful, and the results showed the validity of our method. The method requires only two parameters, L, and c. These two parameters must be L > 0, c > 0, because the length of the saturation and non-saturation range is required for allocating equilibrium points. When noise is added to the initial pattern, the parameters affect the recall probability and recall time. Therefore, the optimal value of the parameters changes according to the noise level. Future research will focus on creating a multi-valued output function of more than five values, and on evaluating its performance with the CNN. Moreover, we will apply the CNNs in an abnormality detection system.
Multi-Valued CNNs and Its Application for Associative Memory
551
References 1. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circuits Syst. 35(10) (1988) 1257-1272 2. Chua, L.O., Yang L.: Cellular Neural Networks: Application. IEEE Trans. Circuits Syst. 35(10) (1988) 1273-1290 3. Liu, D., Michel, A.N.: Cellular Neural Networks for Associative Memories. EEE Trans. Circuits Syst. 40(2) (1993) 119-121 4. Zhang, Z., Namba, M., Kawabata, H.: Cellular Neural Networks and Its Application for Abnormal Detection. T.SICE 39(3) (2003) 209-217 5. Tetzlaff, R.(Eds.): Celular Neural Networks and Their Applications. World Scientific (2002) 6. Kanagawa, A., Kawabata, H., Takahashi, H.: Cellular Neural Networks with Multiple-Valued Output and Its Application. IEICE Trans. E79-A(10) (1996) 16581663 7. Yokosawa, K., Nakaguchi, T., Tanji, Y., Tanaka, M.: Cellular Neural Networks With Output Function Having Multiple Constant Regions. IEEE Trans. Circuits Syst. 50(7) (2003) 847-857
Emergence of Topographic Cortical Maps in a Parameterless Local Competition Network A. Ravishankar Rao, Guillermo Cecchi, Charles Peck, and James Kozloski IBM T.J. Watson Research Center Yorktown Heights, NY 10598, USA
[email protected],
[email protected] Abstract. A major research problem in the area of unsupervised learning is the understanding of neuronal selectivity, and its role in the formation of cortical maps. Kohonen devised a self-organizing map algorithm to investigate this problem, which achieved partial success in replicating biological observations. However, a problem in using Kohonen’s approach is that it does not address the stability-plasticity dilemma, as the learning rate decreases monotonically. In this paper, we propose a solution to cortical map formation which tackles the stability-plasticity problem, where the map maintains stability while enabling plasticity in the presence of changing input statistics. We adapt the parameterless SOM (Berglund and Sitte 2006) and also modify Kohonen’s original approach to allow local competition in a larger cortex, where multiple winners can exist. The learning rate and neighborhood size of the modified Kohonen’s method are set automatically based on the error between the local winner’s weight vector and its input. We used input images consisting of lines of random orientation to train the system in an unsupervised manner. Our model shows large scale topographic organization of orientation across the cortex, which compares favorably with cortical maps measured in visual area V1 in primates. Furthermore, we demonstrate the plasticity of this map by showing that the map reorganizes when the input statistics are chanaged.
1
Introduction
A major research problem in the area of unsupervised learning in neural networks is the understanding of neuronal selectivity and the formation of cortical maps [2][pg. 293]. In the vertebrate brain, in areas such as the visual cortex, individual neurons have been found to be selective for different visual cues such as ocular dominance and orientation [10]. Furthermore, these selective neurons are arranged in an orderly 2D fashion known as a cortical map [2][pg. 293], and such maps have been observed and extensively studied in the primate cortex [4]. A natural question is to ask how such maps are formed, and what are the underlying computational processes at work. Understanding cortical map formation is a central problem in computational neuroscience, and impacts our ability to D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 552–561, 2007. c Springer-Verlag Berlin Heidelberg 2007
Emergence of Topographic Cortical Maps
553
understand processes operating across the entire brain. Since the visual cortex is the best studied, we will restrict our attention to visual phenomena, and the formation of orientation maps in particular [4]. We pose the following requirements that a computational model of cortical map formation should satisfy. – The model should use biologically realistic visual inputs as stimuli rather than abstract variables representing features of interest such as orientation or ocular dominance. The cortical units in the model should learn their synaptic weights. – The model should exhibit the formation of stable maps that resemble experimental measurements such as those made in [4]. Some features of observed cortical orientation maps include pinwheels, fractures and linear zones. – The model should be able to address the stability-plasticity dilemma. Though the cortical maps are stable, they retain plasticity when the statistics of the input space are changed, such as through a change in cortical connectivity or input statistics [1,11]. This allows the cortical map to faithfully represent the external world. – The model should involve as little parameterization as possible. This requirement allows the model to be widely applicable under different conditions, such as different input spaces and different sizes of the cortical maps. Many computational theories have been developed to explain the formation of such cortical maps [9], especially the formation of orientation columns. However, no single model appears to satisfactorily meet the requirements described above. For instance, Carreira-Perpinan et al [5] use abstract variables for input, such as orientation and frequency, which are not derived from real images. Miikkulainen et al [3] describe a method for self-organization to obtain orientation columns in a simulated patch cortex. However, their method requires significant parameterization and the use of a carefully applied schedule [7]. The main contribution of this paper is to demonstrate how Kohonen’s selforganizing map (SOM) algorithm can be modified to employ only local competition, and then combined with a recently published technique to eliminate the traditional parameterization required [8]. This combination is novel, and achieves the formation of realistic cortical orientation maps with inputs consisting of visual images of randomly oriented lines. Furthermore, the cortical map is plastic, as we demonstrate by changing the statistics of the input space multiple times, by varying the statistical distribution of orientations. If the input statistics are constant, the map converges to a stable representation, as defined by an error measure. This effectively addresses the stability-plasticity problem. Our model is computationally simple, and its behavior is intuitive and easy to understand and verify. Due to these reasons, it meets all the imposed requirements, and hence should prove to be a useful technique to practitioners in computational neuroscience.
554
2
A.R. Rao et al.
Background
Kohonen’s self-organizing map (SOM) has been widely used in a number of domains [6]. An area where it has had considerable impact is that in computational neuroscience, in the modeling of the formation of cortical maps [9,7]. The traditional Kohonen SOM requires the use of a schedule to gradually reduce the neighborhood size over which weight updates are applied, and to reduce the learning rate. This requires careful modification of these key parameters over the course of operation of the algorithm. For instance, Bednar has shown the formation of cortical orientation maps through the use of a rigid schedule [7]. Recently, Berglund and Sitte [8] presented a technique for automatically selecting the neighborhood size and learning rate based on a measure of the error of fit. Though they did not state it, it appears quite plausible that such a computation can be carried out by the cortex, as it is a local computation. All that is required is that the error between the weight vector (synaptic weights) and the input vector be computed. This allows the neuron to adjust its learning rate over time. The role of inputs is critical in the process of self-organization. Hubel et al [10] showed that rather than being genetically predetermined, the structure of cortical visual area V1 undergoes changes depending on the animal’s visual experience, especially during the critical period of development. Sharma et al [13] showed that rewiring the retinal output to the auditory cortex instead of the visual cortex resulted in the formation of orientation-selective columns in the auditory cortex. It is thus likely that the same self-organization process is taking place in different areas of the cortex. The nature of the cortical maps then becomes a function of the inputs received. In order to demonstrate this cortical plasticity, we have created a computational model that responds to changing input space statistics. Certain classes of inputs are sufficient to model V1. For instance, Bednar [7] used input stimuli consisting of elongated Gaussian blobs. Other researchers have used natural images [12] as inputs to self-organizing algorithms. In this paper, we use sine-wave gratings of random orientation for the sake of simplicity, and to demonstrate the essential characteristics of our solution.
3
Experimental Methods
We model the visual pathway from the retina to the cortex as shown in Figure 1. The retina projects to the lateral geniculate nucleus (LGN), which in turn projects to the cortex. There are two channels in the LGN, which perform on-center and off-center processing of the visual input. The cortical units are interconnected through a lateral network which is responsible for spreading the weights of the winner. 3.1
Algorithm for Weight Updates
A significant contribution of this paper is to provide a natural extension of Kohonen’s algorithm to allow local competition in a larger cortex, such that multiple
Emergence of Topographic Cortical Maps
555
Cortex T opographic mapping from input layer to cortex
Lateral neighborhood used for computation of local winners
(LGN) “On” center
“Off” center channel
channel
Input Layer (Retina)
(A)
(B)
Fig. 1. Illustrating the network connectivity. (A) The input units are arranged in a twodimensional grid, and can be thought of as image intensity values. The cortical units also form a 2D grid. Each input unit projects via the LGN in a feedforward topographic manner to the cortical grid. (B) shows the lateral connectivity in the cortex.
winners are possible. In the traditional Kohonen algorithm, the output layer is fully connected, and all the output units receive the same input. There is only one global winner in this case. We have modified the algorithm, such that there is limited connectivity between output units, and each output unit receives input from a restricted area of the retinal input. This allows the possibility of multiple winners in the output layer. Learning is driven by winners in local neighborhoods, determined by the extent of lateral connectivity. A simple Hebbian rule is used to update synaptic weights. The basic operation of the network is as follows. Let X1 denote the input vector from the on-center LGN channel and X2 the input vector from the off-center LGN channel to a cortical unit. Each cortical unit receives projections from only a restricted portion of the LGN. Let w1ij denote a synaptic weight, which represents the strength of the connection between the ith on-center LGN unit and the j th unit in the cortex. Similarly w2ij represents weights between the off-center LGN and cortical units. The output yj of the j th cortical unit is given by yj = w1ij X1i + w2ij X2i (1) i∈Lj
i∈Lj
Here the cortical unit combines the responses from the two LGN channels, and Lj is the neighborhood of LGN units that project to this j th cortical unit. The next step is for each cortical unit to determine whether it is a winner within its local neighborhood. Let Nj denote the local cortical neighborhood of the j th cortical unit (which excludes the j th unit). Let m index the cortical units within Nj . Thus, unit j is a local winner if ∀m ∈ Nj , yj > ym
(2)
556
A.R. Rao et al.
This is a local computation for a given cortical unit. Once the local winners are determined, their weights are updated to move them closer to the input vector. If cortical unit j is the winner, the update rule is w1ij ← w1ij + μ(X1i − w1ij )
(3)
where i indexes those input units that are connected to the cortical unit j, and μ is the learning rate. μ is typically set to a small value, so that the weights are incrementally updated over a large set of input presentations. A similar rule is used to learn w2ij . In addition, the weights of the cortical units within the neighborhood Nj , denoted by the index m, are also updated to move closer to their inputs, but with a weighting function f (d(j, m)), where d(j, m) is the distance from the unit m to the local winner j. This is given by w1im ← w1im + f (d(j, m))μ [X1i − w1im ]
(4)
Finally, the incident weights at each cortical unit are normalized. The cortical dynamics and learning are thus based on Kohonen’s algorithm. Typically, the size of the neighborhood Nj and the learning rate μ are gradually decreased according to a schedule such that the resulting solution is stable. However, this poses a problem in that the cortex cannot remain plastic as the learning rate and neighborhood size for the weight updates may become very small over time. One of the novel contributions of this paper is to solve this stability-plasticity dilemma in cortical maps through an adaptation of the parameterless SOM technique of Berglund and Sitte [8]. Their formulation called for the adjustment of parameters based on a normalized error measure between the winner’s weight vector and the input vector. We modify their formulation as follows. First, since there can be multiple local winners in the cortex, we compute an average error measure. Second, we use temporal smoothing based on a trace learning mechanism. This ensures that the learning rate is varied smoothly. There appears to be biological support for trace learning, as pointed out by Wallis [14]. Let n (i) denote the error measure at the ith cortical winner out of a total of Mn winners at the nth iteration. n (i) = W1i − X1i + W2i − X2i where · denotes the L2 norm. The average error measure is defined as follows n (i) (n) = Mn
(5)
(6)
Let r(n) = max((n), r(n − 1))
(7)
Emergence of Topographic Cortical Maps
557
where r(0) = (0)
(8)
The normalized average error measure is then defined to be = (n) (n) r(n)
(9)
The time averaged error measure, η(n) is defined by the following trace equation + (1 − κ)η(n − 1) η(n) = κ(n)
(10)
We used κ = 0.05 in our simulation. The learning rate μ and neighborhood size N were varied as follows μ = μ0 η(n) ;
N = N0 η(n)
(11)
where μ0 = 0.05 and N0 = 15. The rationale behind Equation 11 is that the learning rate and neighborhood size decrease as the error between the winners’ weight vectors and input vectors decrease, which happens while a stable representation of an input space is being learnt. If the input space statistics change, the error increases, causing the learning rate and neighborhood size to increase. This allows learning of the new input space. This effectively solves the stabilityplasticity dilemma. 3.2
Network Configuration
We used an input layer consisting of 30x30 retinal units. The images incident on this simulated retina consisted of sinusoidal gratings of random orientation and phase. The LGN was the same size as the retina. A radius of r = 9 was used to generate a topographic mapping from the LGN into the cortex. We modeled the cortex with an array consisting of 30x30 units. The intra-cortical connectivity was intitialized with a radius of rCC = 15. For the weight updates, the function f was chosen to be a Gaussian that tapers to approximately zero at the boundary of the local neighborhood, ie at rCC . The learning rules in section 3.1 were applied to learn the afferent weights. The learning rate μ was set as in equation 11. The entire simulation consisted of 100,000 iterations. In order to test cortical plasticity we varied the statistics of the input space as follows. For the first 33,000 iterations, we used sinusoidal gratings of random orientation. From iteration 33,000 to 66,000 we changed the inputs to be purely horizontal. Then from iteration 66,000 to 100,000 we changed the inputs back to gratings of random orientation. In order to contrast the behavior of the parameterless SOM, we also show the results of running the same learning algorithm with a modifed Kohonen algorithm that allows local winners, and follows a fixed schedule which uses exponentially decaying learning rates and neighborhood sizes.
558
4
A.R. Rao et al.
Experimental Results
We present the results in the form of a map of the receptive fields of cortical units. The receptive field is shown as a grayscale image of the weight matrix incident on each cortical unit. In order to save space, we show the weight matrices connecting only the on-center LGN channels to the cortex. (The weight matrices of the offcenter LGN channels appear as inverse images of the on-center channels). Figures 2 - 4 show demonstrate that the modified parameterless SOM exhibits plasticity in accommodating changing input statistics, whereas a scheduled SOM is non-plastic.
(A)
(B)
Fig. 2. Map of receptive fields for each cortical unit. Only the on-center LGN channel is shown. This is the result after 33,000 presentations of sinuoisoidal gratings of random orientation. Note that the receptive fields show typical organization that is seen in biologically measured cortical orientation maps [4]. Features that are present in this map are pinwheels, fractures and linear zones. (A) Shows the modified parameterless SOM. (B) Shows the map with a traditional schedule.
Figure 5 shows how the error measure generally decreases as the iteration number increases. As can be seen, suddenly increases when the input statistics are changed. This causes an increase in the learning rate and the size of the neighborhood.1 The map eventually settles to a stable configuration at 100,000 iterations (when the simulation was terminated) as becomes small. Thus we have demonstrated stability of the cortical map through an error measure which decreases. 1
We note that the input disturbances are introduced before the maps have converged, as these two factors are independent of each other. In other words, the input disturbance does not have any knowledge of the configuration or stability of the map.
Emergence of Topographic Cortical Maps
(A)
559
(B)
Fig. 3. Map of receptive fields for each cortical unit at 66,000 iterations. This shows the cortical map after the input statistics were changed at iteration number 33000, such that only horizontal lines were presented. (A) With the modified parameterless SOM, the receptive fields of the cortical units are purely horizontal now, reflecting an adaptation to the input space. (B) However, the traditional SOM with a schedule fails to adapt to the new input space. Very few receptive fields have changed to represent horizontal lines.
(A)
(B)
Fig. 4. Map of receptive fields for each cortical unit at 100,000 iterations. The input statistics were changed again at 66000 iterations to create lines of random orientation. (A) With the modified parameterless SOM, the receptive fields of the cortex now contain lines of all orientations in a characteristic pattern as observed in Figure 2. (B) The traditional SOM following a schedule continues to retain its original properties, and does not exhibit plasticity.
560
A.R. Rao et al. 0
Input statistics changed
Input statistics changed
−0.5 ln(ε) −1
−1.5
−2 0
2
4 6 iteration number
8
10 4 x 10
Fig. 5. A plot of the logarithm of the error measure ln() as a function of the number of iterations. The input statistics are changed twice, indicated by the arrow marks.
5
Conclusions
In this paper, we developed a systematic approach to modeling cortical map formation in the visual cortex. We presented a solution that satisfies the following key requirements: self-organization is driven by visual image input; the cortical map converges to a stable representation, and yet exhibits plasticity to accommodate changes in input statistics. Furthermore, our computational approach is simple and involves minimal parameterization, which lends itself to easy experimentation. Our solution is based on modifying the traditional Kohonen SOM to use localized lateral connectivity that results in local winners, and to use the parameterless SOM [8] to solve the stability-plasticity problem. This combination of techniques is novel in the literature. We demonstrated the power of our solution by varying the input statistics multiple times. Each time, the cortical map exhibited the desired plasticity, and converged to a stable representation of the input space. The significance of this result is that it shows how a modified Kohonen SOM can be used to explain the dual phenomena of cortical map formation and cortical plasticity. By bringing together these two capabilities in a simple model, we pave the way for more complex models of cortical function involving multiple maps.
References 1. Carpenter, G.A., Grossberg, S.: The ART of Adaptive Pattern Recognition by a Self-organizing Neural Network. Computer 21(3) (1988) 77-88 2. Dayan, P., Abbott, L.F.: Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press, Cambridge, MA (2001)
Emergence of Topographic Cortical Maps
561
3. Miikkulainen, R., Bednar, J.A., Choe, Y., Sirosh, J.: Computational Maps in the Visual Cortex. Springer, Berlin (2005) 4. Obermayer, K., Blasdel, G.: Geometry of Orientation and Ocular Dominance Columns in Monkey Striate Cortex. J. Neuroscience 13 (1993) 4114-4129 5. Carreira-Perpinan, M.A., Lister, R.J., Goodhill, G.J., A Computational Model for the Development of Multiple Maps in Primary Visual Cortex Cerebral Cortex 15 (2005) 1222-1233 6. Kohonen, T.: The Self-organizing Map. Proceedings of the IEEE 78(9) (1990) 1464-1480 7. Bednar, J.A.: Learning to See: Genetic and Environmental Influences on Visual Development. PhD thesis, Department of Computer Sciences, The University of Texas at Austin (2002) Technical Report AI-TR-02-294 8. Berglund, E., Sitte, J.: The Parameterless Self-organizing Map Algorithm. IEEE Trans. Neural Networks 17(3) (2006) 305-316 9. Erwin, E., Obermayer, K., Schulten, K.: Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison. Neural Computation 7(3) (1995) 425-468 10. Hubel, D.H., Wiesel, T.N., Levay, S.: Plasticity of Ocular Dominance Columns in Monkey Striate Cortex. Phil. Trans. R. Soc. Lond. B 278 (1977) 377-409 11. Buonomano, D.V., Merzenich, M.M.: Cortical Plasticity: From Synapses to Maps. Annual Review of Neuroscience 21 (1998) 149-186 12. Hyv¨ arinen, A., Hoyer, P.O., Hurri, J.: Extensions of ICA as Models of Natural Images and Visual Processing. Nara, Japan (2003) 963–974 13. Sharma, J., Angelucci, A., Sur, M.: Induction of Visual Orientation Modules in Auditory Cortex. Nature 404 (2000) 841-847 14. Wallis, G.: Using Spatio-temporal Correlations to Learn Invariant Object Recognition. Neural Networks (1996) 1513-1519
Graph Matching Recombination for Evolving Neural Networks Ashique Mahmood, Sadia Sharmin, Debjanee Barua, and Md. Monirul Islam Bangladesh University of Engineering and Technology, Department of Computer Science and Engineering, Dhaka, Bangladesh
[email protected], {aumi buet,rakhee buet}@yahoo.com,
[email protected] http://www.buet.ac.bd/cse
Abstract. This paper presents a new evolutionary system using genetic algorithm for evolving artificial neural networks (ANNs). Existing genetic algorithms (GAs) for evolving ANNs suffer from the permutation problem. Frequent and abrupt recombination in GAs also have very detrimental effect on the quality of offspring. On the other hand, Evolutionary Programming (EP) does not use recombination operator entirely. Proposed algorithm introduces a recombination operator using graph matching technique to adapt structure of ANNs dynamically and to avoid permutation problem. The complete algorithm is designed to avoid frequent recombination and reduce behavioral disruption between parents and offspring. The evolutionary system is implemented and applied to three medical diagnosis problems - breast cancer, diabetes and thyroid. The experimental results show that the system can dynamically evolve compact structures of ANNs, showing competitiveness in performance.
1
Introduction
Stand-alone weight learning of artificial neural networks (ANNs) with fixed structures is not adequate for many real-world problems. The success of solving problems by ANNs largely depends on their structures. A fixed structure may not contain the optimal solution in its search space. Therefore, learning only the weights may result into a solution that is convergent to a local optimum. On the other hand, devising an algorithm that searches an optimal structure for a given problem is a very challenging task. Many consequent problems have arisen [1],[2] in constructing ANNs, many of which are yet unresolved. Thus designing an optimal structure remains a challenge to the ANN researchers for decades. Genetic algorithms (GAs) [3] and evolutionary programming (EPs) [4] both have been applied for evolving ANNs. GA-based approaches rely on dual representation [5], one (phenotypic) for applying training algorithm like Back Propagation (BP) for weight adaptation, another (genotypic) for structural evolution. This duality introduces a deceptive mapping problem between genotype and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 562–568, 2007. c Springer-Verlag Berlin Heidelberg 2007
Graph Matching Recombination for Evolving Neural Networks
563
phenotype, namely - permutation problem or many-to-one mapping problem [2]. This problem yet remains unresolved. Moreover, frequent and abrupt recombination between parents in GA processes have very detrimental effect on the quality of offspring. Frequent recombination breaks ANN structures before training up to maturity, which is essential before evaluating them. Abrupt recombination between ANN structures drastically effects the already built up performance gain and makes it difficult to rebuild. Many efforts made recently are through EP approaches [6]. EP-based approaches do not use dual representation, thus avoids the permutation problem and relies only on statistical perturbation as its sole reproductive operator. As, this statistical perturbation is mutation in nature, direction of evolution extracted from recombination operation in GA-based approaches is absent here. This paper presents a GA-based approach with a permutation problem free recombination operator for evolving ANNs. Here, a genotypic representation and its incorporating recombination operator are introduced. To make the operator permutation problem free, it uses graph matching technique between two parent ANNs. Moreover, the GA process is so designed to avoid the detrimental effect of frequent and abrupt recombination in traditional GA-based approaches.
2
Evolution of ANNs Using Graph Matching Recombination
Proposed algorithm is a GA-based approach for evolution of ANNs. Encoding scheme and recombination operator is so designed that it can avoid permutation problem and the problems led from frequent and abrupt recombination. The complete algorithm, though uses GA, do not recombine parents rigorously. ANNs, that are learning with a good rate, are not subject to recombination. The idea is taken from an EP approach, namely EPNet [6], to reduce the detrimental effect of restructuring immature ANNs through early recombination. Recombination operator itself does not also produce offspring that differ from their parents greatly. In the following, the algorithm is elaborated: Step 1. Generate an initial population of networks at random. Nodes and connections are uniformly generated at random within a certain range. Weights are also generated uniformly at random within a certain range. Step 2. Train each network partially for certain number of epochs using backpropagation algorithm. After training, if the training error has not been significantly reduced, the network is marked with ’failure’, otherwise the network is marked with ’success’. Step 3. If stopping criteria are met, stop the process and identify the best network. Otherwise, go to step 4. Step 4. Choose a network marked with ’failure’. If no network with ’failure’ mark is left to choose, go to step 2. Step 5. Choose a network among the networks marked with ’success’ uniformly at random.
564
A. Mahmood et al.
Step 6. Set the first network (marked with ’failure’) as the weak parent (i.e. which is to be updated) and the second network (marked with ’success’) as the strong parent (which gives the update suggestions and towards which the first network is to be changed). Apply the encoding scheme i.e., form AGs (defined later) for the ANNs, find M G (defined later) from the AGs, find a nice maximal clique to identify their common subgraph. Make use of the clique in the recombination operator to generate an updated ANN. To perform the update, delete some unique nodes of the weak parent with a low probability, add some unique connections of the strong parent to the weak parent with a low probability, and delete a connection from their common part (mutation) with a very low probability. Step 7. Partially train the resulting network for certain number of epochs. If the resulting network performs better, in terms of validation error, than its parent then replace its parent with it, otherwise, discard the resulting network and keep the parent. Go to step 4. In this approach, an ANN (together with its topology and weights) is converted to a comparable graph structure namely attributed graph (AG) [7]. Using AG, two different networks can be compared and their unique graph portions can be found. To compare two AG graphs, a match graph (M G) is formed [7]. An M G contains all the information of similarity between the AGs. Each clique of the M G is a common subgraph of the AGs and hence a common subnet of the corresponding ANNs. Offspring is generated based on this subnet. Stopping criteria is met when two different generations having a fixed number of generations between them overfits. 2.1
ANN to AG Encoding
AG is a graph in which vertices (and, possibly edges also) have attributes and each vertex is associated with some values of those attributes. The encoding is defined in Definition 1. Definition 1. Convert an ANN to an AG < V, E, P > in such a way that, – (V ) Each connection of ANN becomes a vertex of AG. – (E) If two connections of ANN are incident on the same node, then there is an edge between the corresponding vertices of AG. The edge should be labeled with the layer number of the incident node. – (P ) Each vertex of AG has two properties: • Weight, w associated with the corresponding connection of AG. • A set S of ordered pairs < n, w >, each of which corresponds to the incidence of another connection with this connection; n is the layer number of the incident neuron and w is the weight of the adjacent connection. An application of such encoding on an ANN is shown in Figure 1. Only the two hidden layers are shown for the ANN.
Graph Matching Recombination for Evolving Neural Networks
565
Fig. 1. An ANN (left) and its encoded AG (right) according to the scheme
2.2
AG to M G
Match graph (M G) is a graph, formed on the basis of matches found in two different AGs [7]. Rules of matching depend on the particular definition used. One is as in Definition 2. Definition 2. An M G is formed from two AGs (say, AG1 and AG2 ) with following characteristics: – Vertices of M G are assignments from AG1 and AG2 , – An edge in M G exists between two of its vertices if corresponding assignments are compatible. Now, assignment and compatibility of assignments are defined in Definition 3 and 4. Definition 3. v1 , a vertex from AG1 and v2 , a vertex from AG2 form an assignment if all of its attributes are similar. Definition 4. One assignment (and thus, one vertex of an M G, in effect) a1 between v1 from AG1 and v1 from AG2 is compatible with another assignment a2 between v2 from AG1 and v2 from AG2 if, all the relationships between v1 and v2 from AG1 is compatible with the relationships between v1 and v2 from AG2 . For the instance of problem, similarity and compatibility of relationships can be defined as in Definition 5. Definition 5. Similarities and compatability can be defined as: Similarity of weight. Two weights of two vertices are similar if there absolute difference is below some threshold value. Similarity of Set S. Two S sets from two vertices are similar if any two ordered pairs from each set are similar.
566
A. Mahmood et al.
Similarity of ordered pair. Two order pairs of two S sets are similar if their first values (layer number) are exactly same and the second values - weights are similar (Similarity of weight). Compatibility of edge. Two edges are compatible if they are labeled with same (layer) number. 2.3
Clique
Finding the largest clique of a graph is N P-complete. To find the largest clique in a graph, an exhaustive search via backtracking provides the only real solution. Here, a maximal clique is searched using one simpler version of Qualex-MS [8] - namely New-Best-In Weighted, a maximal clique finding approximation algorithm, which finds a solution in polynomial time. 2.4
Recombination Step
Now, as similarity between these two ANNs is found, the next step is to recombine them. To perform recombination, connections unique to AG1 (which is named - weak parent ) are deleted from AG1 and connections unique to AG2 (strong parent ) are added to AG1 . Resulting offspring is basically the ANN of AG1 having modifications directed towards AG2 . Such modification can result offspring hung between the structural loci from AG1 to AG2 . There will be limited classes of structures allowed in the process if all offspring lie only in the loci, reducing the region of exploration. To overcome this tendency of overcrowding, a connection from the similar portion is deleted from AG1 with a very low probability. It is a mutation step, which retains diversity of the population here.
3
Experimental Studies
Performance is evaluated on well-known benchmark problems - breast cancer, diabetes and thyroid. The datasets representing these problems were obtained from the UCI machine learning benchmark repository. The detailed descriptions of datasets are available at ics.uci.edu (128.195.11) in directory /pub/machinelearning-databases. 3.1
Experimental Setup
Experiment is standardized on dataset, partitioning and benchmark rules according to Proben1 [9]. A population of 20 individuals has been used. Each individual is a neural network of two hidden layers. Number of connections between the hidden layers is chosen and set uniformly from 70% to 100% that of the fully connected network. Initial weight is between -0.5 to 0.5. Number of epochs for training is 100 for each run on each set of problems. For each set of problems, a number of 10 runs are used to accumulate the results.
Graph Matching Recombination for Evolving Neural Networks
3.2
567
Results
Table 1 shows the accuracies for the three problems over training set, validation set and test set. Within each set, first column is the error which is minimized by BP and the second column is the classification error. Table 1. Mean, standard deviation, minimum and maximum value of training, validation and testing errors for different problems
Mean SD Min Max Mean Diabetes SD Min Max Mean Heart SD Disease Min Max Mean Thyroid SD Min Max Breast Cancer
Training Set error error rate 2.7667 0.0286 0.0662 0.0003 2.0460 0.0200 2.8960 0.0314 13.9678 0.2063 0.7363 0.0182 10.23 0.1484 21.92 0.3307 7.4503 0.0899 1.7097 0.0229 2.6750 0.0283 12.31 0.1783 1.0977 0.0049 0.6414 0.0021 0.3412 0.0014 7.7310 0.0186
Validation Set error error rate 2.3837 0.0391 0.2342 0.0030 1.8460 0.0229 3.4340 0.0457 16.1721 0.2170 0.3677 0.0101 14.58 0.1927 23.22 0.2552 15.6445 0.1937 1.1991 0.0164 12.49 0.1391 20.99 0.2739 1.8006 0.0079 0.5793 0.0019 0.9904 0.0033 7.8040 0.0189
Test Set error error rate 1.6512 0.0230 0.193 0.0019 1.14 0.0115 3.144 0.0345 16.9480 0.2458 0.6330 0.0169 15.5900 0.1875 23.0500 0.3646 17.0669 0.2076 1.1399 0.0143 14.06 0.1609 22.61 0.2652 1.8111 0.0087 0.6361 0.0023 0.9513 0.0039 7.9370 0.0244
Evolution of the structure can be observed by connection Vs generation curves. Figure 2(a) shows the curve of the mean of average number of connections Vs generations for breast cancer problem, Figure 2(b) shows the same for diabetes problem.
(a)
(b)
Fig. 2. Evolution of structure of ANNs for problems (a) breast cancer and (b) diabetes
568
A. Mahmood et al.
The experimental results show that a comparatively good performance for diabetes and thyroid problem while results for breast cancer problem are also competitive. It also evolves structurally compact ANNs which explains its well generalization capability.
4
Conclusions
Here, one particular evolutionary system is used to describe the potential of the devised recombination operator. It was carefully designed to avoid permutation problem and line up matching blocks. Result shows that this effort can dynamically adapt structures, which also validates its suitability as an operator. This recombination operator can also be experimented by incorporating it with other evolutionary flows having different choices of ’success’ and ’failure’ marks, strong and weak parents and other parameters. Its applicability to other processes makes it a better choice of operator for GA-based evolutionary approaches.
References 1. Storn, R., Price, K.: Differential Evolution -a Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. Technical Report TR-95-012, ICSI, March (1995) ftp.icsi.berkeley.edu 2. Hancock, P.J.B.: Genetic Algorithms and Permutation Problems: A Comparison of Recombination Operators for Neural Net Structure Specification. Proc. COGANN Workshop, Baltimore, MD (1992) 108-122 3. Holland, J.H.: Adaptation in Natural and Artificial Systems. Ann Arbor, MI: Univ. Michigan Press (1975) 4. Fogel, L., Owens, A., Walsh, M., Eds.: Artificial Intelligence Through Simulated Evolution. New York: Wiley (1966) 5. Fogel, D.B.: Phenotypes, Genotypes, and Operators in Evolutionary Computation. Proc. 1995 IEEE Int. Conf. Evolutionary Computation, Piscataway, NJ (1995) 193-198 6. Yao, X., Liu, Y.: A New Evolutionary System for Evolving Artificial Neural Networks. IEEE Trans. Neural Networks 8(3) (1997) 694-713 7. Schalkoff, R.J.: Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley & Sons, New York (1992) 8. Busygin, S.: A New Trust Region Technique for the Maximum Weight Clique Problem. Discrete Applied Mathematics 154(15) (International Symposium on Combinatorial Optimization CO’02) (2006) 2080-2096 9. Prechelt, L.: Proben1-A Set of Neural Network Benchmark Problems and Benchmarking Rules. Fakultat fur Informatik, Univ. Karlsruhe. Karlsruhe, Germany, Tech. Rep. 21/94, Sept. (1994)
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks Min Han and Jia Yin School of Electronic and Information Engineering, Dalian University of Technology, Dalian, 116023, China
[email protected] Abstract. This paper proposes an orthogonal least square algorithm based on QR decomposition (QR-OLS) for the neurons selection of the hidden layer of wavelet networks. This new algorithm divides the original neurons matrix into several parts to avoid comparing among the poor ones and uses QR decomposition to select the significant ones. It can avoid lots of meaningless calculation. This algorithm is applied to the wavelet network with the analysis of variance (ANOVA) expansion and one-step-ahead predictions, respectively, for the Mackey-Glass delay-differential equation and the annual sunspot data set. The results show that the QR-OLS algorithm can relieve the load of the heave calculation and has a good performance.
1 Introduction The idea of combining wavelets with neural networks has led to the development of wavelet networks (WNs), where wavelets were introduced as activation functions. The wavelet analysis procedure is implemented with dilated and translated versions of a mother wavelet, which contains much redundant information. So the calculation of the WNs is heavy and complicated in some cases especially for high-dimensional models. Therefore it is necessary to use an efficient method to select the hidden neurons for relieving the load of the calculation. Several methods have been developed for selecting items. Battiti et al. [1] used the mutual information to select the hidden neurons, Gomm et al. [2] proposed the piecewise linearization based on Taylor decomposition, and F. Alonge et al. [3] applied genetic algorithm for selecting the wavelet functions. The orthogonal least squares (OLS) algorithm was developed by Bellings et al. [4]. However, these methods are time-consuming and therefore, some efficient approaches have been investigated. In the OLS algorithm, to select a correct hidden neuron, the vectors formed by the candidate neurons must be processed by using orthogonal methods, which became a heave burden for the WNs. In the present paper, an orthogonal least squares algorithm based on QR decomposition (QR-OLS) which divides the candidate neurons into some sub-blocks to avoid comparing among the poor neurons and uses the forward orthogonal least squares algorithm based on the QR decomposition approach to select the hidden neurons in the WNs. The paper is organized as follows. Section 2 briefly reviews some primarily acknowledge on WNs. The QR-OLS algorithm is described in section 3. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 569–574, 2007. © Springer-Verlag Berlin Heidelberg 2007
570
M. Han and J. Yin
In section 4, two examples are simulated to illustrate the performance of the new algorithm. Finally, the conclusions of this paper are given in section 5.
2 Wavelet Networks The most popular wavelet decomposition restricts the dilation and translation parameters into dyadic lattices [4, 5]. And the output of the WNs can be expressed as J0
y=∑
J0
∑c
ψ j , k ( x) = ∑
j,k
j0 k ∈K j
∑c
j0 k ∈K j
j ,k
⋅ 2− j / 2ψ (2 j ⋅ x − k )
(1)
where x is the one-dimension input of the network, y is the one-dimension output of the network, cj,k is the coefficient of the wavelet decomposition or the weight of the WN, j can be regarded as the dilation parameter, k can be regarded as the translation parameter, ψj,k(x)=2-j/2ψ(2j⋅x−k) and ψ(⋅) is the wavelet function. In Eq. (1), j0 is the coarsest resolution, J0 is the finest resolution and k Kj that depends on the dilation parameter is the subset of integers. According to Eq. (1), the structure of WNs used in this paper is similar with the Radial Basis Function networks, while the activation functions are wavelet function not radial basis functions. The WNs can be trained by using least-squares methods. The result for the one-dimensional case described previously can be extended to high dimensions [4,6]. Firstly, n-dimensional wavelet function can be expressed as
∈
n
ψ [jn, k] (x) = ψ [jn, k] ( x1 , x2 ," xn ) = ∏ψ j , k ( xi ), x = [ x1 , x2 ," , xn ], i = 1, 2," , n
(2)
i =1
where the superscript [n] stands for the dimension of the wavelet and x is the multidimension input of the WNs. Then using the analysis of variance decomposition (ANOVA) [4] simplifies the n-dimensional wavelet function. The main idea of ANOVA in Eq. (3) is to decompose the high dimensional function into lower ones.
ψ [jn, k] (x) =
∑ψ
1≤ l1 ≤ n
[1] j,k
( xl1 ) +
∑
ψ [2] j , k ( xl , xl ) + 1
1≤ l1 ≤ l2 ≤ n
2
∑
ψ [3] j , k ( xl , xl , xl ) + " + e
1≤ l1 ≤ l2 ≤ l3 ≤ n
1
2
3
(3)
where e is the error of the ANOVA decomposition and p, q, r=1, 2, ⋅ ⋅ ⋅ , n. Then Eq. (1) can be extended as: y = ∑ fl1[1] ( xl1 ) + ∑ fl1[2] ∑ fl1[li2]"li ( xl1 , xl2 ,", xli ) + e l2 ( xl1 , xl2 ) + " + 1≤ l1 ≤ n
1≤ l1 ≤ l2 ≤ n Ji
f l1[li2]"li ( xl1 , xl2 ," , xli ) = ∑
1≤ l1 ≤ l2 ≤"≤ li ≤ n
∑ ψ [ji,]k ( xl , xl ,", xl ) i =1,2,",n; 1 ≤ l1 ≤ " ≤ li ≤ n
ji k ∈K j
1
2
(4)
i
where j1, j2, j3 are the coarsest resolutions, and J1, J2, J3 are the finest resolutions.
3 OLS Algorithm Based on QR Decomposition The Pall consists of the Mall vectors formed by the hidden neurons whose activation functions are wavelet functions in Eq. (4). Several OLS algorithms have been
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks
571
developed for selecting the vectors or the wavelet function with parameters j and k, such as Classical Gram-Schmidt (CGS) algorithm, Modified Gram-Schmidt (MGS) algorithm [8, 9] and Householder algorithm. The core of these algorithms is the method to deal with Pall. In the CGS algorithm, Pall is decomposed as: Pall=W⋅A
(5)
where A is an unit upper triangular matrix and W is a matrix with M orthogonal columns w1, w2,…, wM. MGS algorithm performs basically the same operations as CGS, only in a different sequence. However, both methods are sensitive to the roundoff error. As for the Householder algorithm, Householder translation is applied for the orthogonal procedure. The three methods have the same drawback that to select a hidden neuron, the vectors in Pall must be processed by using orthogonal methods. Therefore, it is timeconsuming and complicated. To avoid repeating decompositions of Pall, the new algorithm divides Pall into several sub-blocks. And then QR decomposition is applied to every sub-block, which can avoid ill conditioning. The algorithm is showed as follows. Firstly, a sub-block P with M (M<Mall) columns derived from Pall is selected. Assuming that P is full rank in columns, then P can be decomposed as P=Q⋅R
(6)
T
where Q=[Q1 Q2], R=[R1 0] . Q1 is a N by M matrix; Q2 is a N by (N-M) matrix; R1 is a square matrix with M columns. Therefore, Eq. (6) can be simplified. P=Q1⋅R1
(7)
where Q1 = [q1 , q 2 ," , q M ] . Then the model of the WN can be expressed as Y=P⋅W+Ξ
(8)
where Y is the output matrix of the WN; W is the weight matrix to be solved; Ξ is the sum of the model error and noise. According to Eq. (7), Eq. (8) becomes Y = P ⋅ R1−1 ⋅ R1 ⋅ W + Ξ .
(9)
Define G = R1 ⋅ W = [ g1 , g 2 ," , g M ]T . Then Eq. (9) can be rewritten as Y = Q1 ⋅ G + Ξ .
(10)
If Ξ is ignored, G can be solved by least square algorithm as G = [Q1T ⋅ Q1 ]−1 ⋅ Q1T ⋅ Y , gi = qTi ⋅ Y qTi q i = qTi Y .
(11)
Assume that Ξ is uncorrelated with the past output of the system, Eq. (10) can be expressed as 1 T 1 Y Y= N N
M
∑g i =1
2 i
+
1 T ΞΞ. N
(12)
572
M. Han and J. Yin
The parameter ERRi is introduced, which is defined as ERRi = gi2 YT Y = YT qi YT Y .
(13)
Several vectors qi (i=1, 2, ⋅ ⋅ ⋅ , M) with larger ERRi are selected as hypo-optimized neurons. The same procedure is applied to other sub-blocks, and then the optimized neurons are selected from the hypo-optimized ones with the same algorithm.
4 Simulations Two examples are provided to verify the performances of QR-OLS algorithm. Both data sets are normalized based on wavelet operating domain rather than physical insight. In addition, the OLS algorithm based on CGS (CGS-OLS) is also used in the simulations to compare with the proposed algorithm. 4.1 Mackey-Glass Delay-Differential Equation
This data set is generated by the Mackey-Glass delay-differential equation dx(t ) 0.2 x(t − τ ) = −0.1x(t ) + dt 1 + x10 (t − τ )
(14)
where the time delay τ is chosen to be 30 in this example. Setting the initial condition x(t)=0.9 for 0 ≤ t ≤ τ , a Rung-Kutta integral algorithm is applied to solve Eq. (14) with an integral step Δt = 0.01 and 1000 equi-spaced samples, x(t), (t=1,2,…,1000) is extracted with a sampling interval of T=0.06 time unit. The data set is divided into two parts: 500 data points are used to train the WNs and others are prepared for testing the network. The model can be expressed as y(m) =
∑
1≤ l1 ≤ 6
f l1[1] ( xl1 ) +
∑
1≤ l1 ≤ l2 ≤ 6
fl1[2] l2 ( xl1 , xl2 ) +
∑
1≤ l1 ≤ l2 ≤ l3 ≤ 6
fl1[3] l2 l3 ( xl1 , xl2 , xl3 )
(15)
where xli=x(t−li) and i=1, 2, 3, 1 ≤ l1 ≤ l2 ≤ l3 ≤ 6 , the 1-D, 2-D and 3-D compactly supported Mexican hat wavelets are used in this example to approximate the uni[3] variate functions f l1[1] , the bi-variate functions f l1[2] l2 , and the tri-variate function f l1l2 l3 , respectively, with the coarsest resolutions j1 = j2 = j3 = 0 and the finest resolutions J1 = 3, J2 = 1 and J3 = 0. The value of M is 285, so Pall is divided into some sub-blocks with 285 columns. To avoid the ill conditioning, the candidate neuron will be eliminated if ||qi|| in Eq. (11) is less than a predetermined threshold ε . ε is chosen to 6 to guarantee well conditioning, which based on the data set and the normalization used in the WNs. And 14 hypo-optimized neurons are selected from every sub-block. Finally, 14 most significant neurons are selected from hypo-optimized ones. Table 1 gives the comparison of CPU time required to select the hidden neurons and the simulation is realized in matlab. It can be seen that the algorithm is more efficient and accurate than CGS-OLS algorithm.
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks
1.4
Actual Value
573
Table 1. Comparison of CGS-OLS and QROLS algorithm
QR-OLS
y
1.0
Method CGS-OLS
RMSE 3.5×10-3
TIME(s) 419.1400
QR-OLS
2.8×10-3
23.7500
0.6
0.2 500
600
700 800 Sampling Index
900
1000
Fig. 1. Prediction with QR-OLS algorithm for Mackey-Glass series
4.2 The Sunspot Time Series
This example uses the Wolf sunspot data series recording the annual sunspot indices form 1700 to 1999. The data set is separated into two parts: the training set consists of 270 data points and the test set consisted of 30 data points. According to [6], y(t-1), y(t-2), and y(t-9) are selected as the most significant variables, and the model order is chosen to be 9. Then the model can be expressed as y (t ) =
∑
1≤ l1 ≤ 9
f l1[1] ( xl1 ) +
∑
1≤ l1 ≤ l2 ≤ 9
[3] f l1[2] l2 ( zl1 , zl2 ) + f129 ( x1 , x2 , x9 )
(16)
where xli=y(t − li) for li=1, 2 , ⋅ ⋅ ⋅, 9, zli=y(t − li) for li=1, 2 and z3=y(t − 9). The 1-D, 2-D and 3-D compactly supported Gaussian wavelets are used in this example to approximate the uni-variate functions f l1[1] , the bi-variate functions f l1[2] l2 , and the tri[3] variate functions f129 , respectively, with the coarsest resolutions j1 = j2 = j3 = 0 and finest resolutions J1= 2, J2 = J3 = 0. Pall is divided into some sub-blocks with 160 columns and ε in this example is chosen to 0.085. The predicted result of the WNs compared with CGS-OLS is showed in Figure 2 and Table 2. The new algorithm can
Table 2. Comparison of CGS-OLS and QR-OLS algorithm
Fig. 2. Predictions with CGS-OLS and QR-OLS algorithm for annual sunspot time series
Method CGS-OLS
RMSE 16.4899
QR-OLS
13.3800
574
M. Han and J. Yin
catch the characteristics of the real-world chaotic system and has a better performance. For this less size of the problem, the improvement of time consuming is a little.
5 Conclusions In this paper, a new OLS algorithm based on QR decomposition for WNs is proposed. The QR-OLS algorithm avoids a great deal of calculation by dividing the original neuron group into several parts and using QR decomposition to select the significant ones. The more the candidate neurons are, the better the QR-OLS algorithm performs than the other OLS algorithms. The results obtained from the examples which include the Mackey-Glass delay-differential equation and the sunspot time series demonstrate its effectiveness and accuracy.
Acknowledgements This research is supported by the National Natural Science Foundation of China under Project (60674073) and (60374064). All of these supports are appreciated.
References [1] Robert, B.: Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Networks 5 (4) (1994) 537-550 [2] Gomm, J.B., Yu, D.L.: Order and Delay Selection for Neural Network Modelling by Identification of Linearized Models. International Journal of Systems Science 31 (10) (2000) 1273-1283 [3] Alonge, F., D'lppolito, F., Raimond,i F.M.: System identification via optimized waveletbased neural networks. IEEE Proc.-Control Theory Appl 150 (2) (2003) 147-154 [4] Billings, S.A., Wei, H.L.: A New Class of Wavelet Networks for Nonlinear System Identification. IEEE Transactions on Neural Networks 16 (4) (2005) 862-874 [5] Cao, L.Y., Hong, Y.G., Fang, H.P.,, Hai G.W.: Predicting Chaotic Time Series with Wavelet Networks. Physica D 85 (1995) 225-238 [6] Wei, H.L., Billings, S.A., Liu, J.: Term and Variable Selection for Nonlinear System Identification. Int. J. Control 77 (2004) 86-110 [7] Paulito, P.P., Taichi, H., Shiro, U.: Mutation-Based Genetic Neural Network. IEEE Transactions on Neural Networks 16 (3) (2005) 587-600 [8] Achiya, D.: A modified Gram-Schmidt Algorithm with Iterrabive Orthogonalization and Column Pivoting. Linear Algebra and its Applications 310 (2005) 25-42 [9] Ling, F.Y., Dimitrls, M., John, G.P.: A Recursive Modified Gram-Schmidt Algorithm for Least-Squares Estimation. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-34 (4) (1986) 829-836
Implementation of Multi-valued Logic Based on Bi-threshold Neural Networks Qiuxiang Deng and Zhigang Zeng School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China qx
[email protected],
[email protected] Abstract. The implementation of multi-valued logic with a three layers forward neural network is proposed. The hidden layer is constituted by bi-threshold neurons compared with traditional simple threshold neurons. According to the obtained results in this paper, if a perception with a simple threshold neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1} can be gained. In addition, if a linear neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} can be obtained. The arithmetic, which is used to design the three layers forward neural network, improves on the traditional digital logic of two values. An example shows that the designable procedure of the network is simple and effective.
1
Introduction
Recently, a great deal of attention has been paid on the implementation of digital logic based on neural networks. Numerous research methods and results have been proposed [1-15]. That’s partly because neural networks possess of many merits such as being able to learn, strong currency, lower complexity, simple configuration and easy integration. In order to reduce the complexity and limit of neural networks, the ideas of Karnaugh map and minterm inhibition are introduced by using a three layers neural network in [1]. According to the obtained results in [1], a network is presented to lower complexity and limit and to counteract light disturbance of digital voltages in [2]. However, all these results focus on the logical map of {0, 1} to {0, 1} or the map of {−1, 1} to {−1, 1}. In these three layers forward neural networks, there are 2 values in implementation of the logical map as all neurons have a simple threshold and offer two states. Some studies suggest that the neural networks can achieve not only typical logic of two values, but also multi-valued logic and not exact logic such as fuzzy logic. Ojha presents a method for enumerating linear threshold functions of n-dimensional inputs in [3]. The linear threshold functions are the transfer function of McCulloch-Pitts model neurons. The problem is to enumerate distinct LTF’s which can be computed by varying the weights and the bias [4-6]. A multi-threshold neuron to implement multi-valued logic is suggested based on the parallel hyperplanes which can be divided the Euclidian space into some regions in [7]. However, along with increasing number of variables, the number D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 575–582, 2007. c Springer-Verlag Berlin Heidelberg 2007
576
Q. Deng and Z. Zeng
of threshold of a multi-threshold neuron is increased. The contribution of this paper is to put bi-threshold neurons into a three layers forward neural network. The bi-threshold neurons are used in the hidden layer. The neural network with bi-threshold neurons and other neurons can reduce the total number of neurons and can implement arbitrary logical map. The rest of this paper is organized as follows. In Section 2, the neurons including a simple threshold neuron and a bi-threshold neuron and their transfer functions in the network are introduced. In Section 3, a construction of the three layers network with bi-threshold neurons is proposed to implement arbitrary logic. Bi-threshold neurons are used as the neurons of the hidden layer. According to the obtained results, if the perception with a simple threshold neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1} can be derived. In addition, if a linear neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} can be obtained. In Section 4, we give an example to show that the structure of this kind of neural network is simple, reliable and easy to implement the arbitrary logic. Finally, concluding remarks are included in Section 5.
2 2.1
Threshold Neuron Simple Threshold Neuron
A simple threshold neuron, also called as a perception, is the simplest neuron in the neural networks. The model of this neuron is given in Fig. 1, where xi is the
Fig. 1. The model of a simple threshold neuron
Fig. 2. The transfer function of a simple threshold neuron
Implementation of Multi-valued Logic
577
i-th input, wi is the weight connecting outer inputs and this neuron and θ is the only threshold of this neuron. Its transfer function is shown in Fig. 2. The output function of this neuron can be express as follows: n y = f( wi xi − θ),
f (u) =
i=1
2.2
1 u > 0; 0 otherwise
(1)
Bi-threshold Neuron
A bi-threshold neuron is a linear combination of two simple threshold neurons. The problem of “exclusive-OR”, which can’t implement by only one simple threshold neuron, can be solved with one bi-threshold neuron. The model of a bi-threshold neuron is given in Fig. 3, where xi is the i-th input, wi is the weight connecting outer inputs and this neuron, θ1 is the upper limit threshold of this neuron and θ2 is the lower limit threshold of this neuron. Its transfer function is shown in Fig. 4. If the value of signal’s sum is between threshold θ1 and θ2 , then the activation signal 1 is produced. Otherwise, the restrained signal 0 is produced. The output function of this neuron can be expressed as follows: n y = f( wi xi ), i=1
f (u) =
1 θ1 ≤ u ≤ θ2 ; 0 otherwise
Fig. 3. The model of a bi-threshold neuron
Fig. 4. The transfer function of a bi-threshold neuron
(2)
578
3 3.1
Q. Deng and Z. Zeng
Implementation of Multi-valued Logic Implementation of the Logical Map of {0, 1, · · · , n} to {0, 1}
The construction of a neural network to realize the logical map of {0, 1, · · · , n} to {0, 1} can be shown in Fig. 5, where x = (x1 , x2 , · · · , xn )T is the input vector, w1ij is the weight connecting the input neurons and the hidden neurons, hj is the output of hidden neurons, w2j is the weight connecting the hidden neurons and the output neuron. Obviously, it has three layers, the neurons in the hidden layer are bi-threshold neurons with different upper and lower thresholds. There is only one neuron in the output layer. In this section, in order to realize the logical map of {0, 1, · · · , n} to {0, 1}, a simple threshold neuron is used in the output layer. The output functions of this neural network satisfies the following equations ⎧ n 1, θj1 ≤ sj ≤ θj2 ; ⎪ ⎪ s = w x , h = f (s ) = j 1ij i j j ⎨ 0, otherwise, i=1 (3) L 1 o ≥ 0; ⎪ ⎪ o = w h , Y = ⎩ 2j j 0 o < 0, j where sj and o are the input of a hidden neuron and the output neuron, respectively. θj1 and θj2 are the upper and lower threshold of the j-th neuron in the hidden layer, respectively. The number of the neurons in the hidden layer is similar with the number of output value which is 1. The arithmetic of this three layers network to realize the logical map of {0, 1, · · · , n} to {0, 1} can be concluded as follows: – Initialization: Let the number of input neurons be the number of digital logical variables, the number of neurons in the hidden layer be the number of output value which is 1 and let the output layer have one simple threshold neuron whose threshold is zero.
Fig. 5. The structure of a neural network to implement the logical map of {0,1,. . . ,n} to {0,1}
Implementation of Multi-valued Logic
579
– Swatch learning: Learning only takes place in the swatch variables whose output value is 1. The rules of learning are concluded as follows: Assume xr is the r-th input swatch vector whose function value is 1. Let the swatch variables whose function value is 1 compose a set U. If xr ∈ U, then xr = [xr1 , xr2 , · · · , xrm ], where xr ∈ {0, 1, · · · , n}. The learning arithmetic of the weights and thresholds in the neural networks is shown as follows: ⎧ m ⎪ ⎪ θj1 = (n + 1)i xi − 0.5, ⎨ w1ij = (n + 1)i , i=1 (4) m ⎪ ⎪ (n + 1)i xi + 0.5, w2j = 1, T = 0. ⎩ θj2 = i=1
– Repeat the swatch learning until the whole swatches, which function values are 1, are trained. Then the learning arithmetic is end. Lemma. In hidden layer, only one neuron is on the exciting state, and the others are on the restrained state in any time. The output of the network is similar with the function value of corresponding swatch. Proof. Let x = xr ∈ U , [x1 , · · · , xr , · · · , xL ] ⊂ U, where xr = (xr1 , xr2 , · · · , xrj , · · · , xrm )T and xrj ∈ {0, 1, · · · , n}. As the statement before, the number of input variables is m, and the value of each variable is between 1 and n. Hence, there are m neurons in the input layer, L neurons in the hidden layer. According to (3) and (4), if j = r, then n m sj = sr = f ( w1ij xi ) = f ( ((n + 1)i )xri ), i=1
hj = 1.
i
If x is other swatches, sj is not belong to [θj1 , θj2 ], then hj = 0. Hence, the output of whole network is 1 according to the expression (3)and (4). Hence, if the bi-threshold neuron is put into the hidden layer, then the input value can extent to m. That’s to say, the neural networks overcomes the disadvantage of the input value fixing to two values such as {0, 1} and {−1, 1}. 3.2
Implementation of the Logical Map of {0, 1, · · · , n} to {0, 1, · · · , m}
The construction of the neural networks to realize the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} is also shown in Fig. 5. In the output layer is no more a simple threshold, but a linear neuron. The output function can be expressed as follows. Y = f (u) = u. In the swatch learning, the training of weights and thresholds learning in the hidden layer is the same as above detailed. So in the hidden layer, only one neuron is on the exciting state, and the others are on the restrained states in any time. The training of weights connecting the hidden layer and the output
580
Q. Deng and Z. Zeng
layer should be trained in other way since the output value is no more two values but multiple values. The rules of weights training between the hidden layer and the output layer follow the expression w2j = tj , where tj is the function value of the j-th input swatch. Hence, the output of entire network can be obtained. Y = w2j hj = 1 ∗ tj ,
tj ∈ {0, 1, · · · , m}.
As tj is not belongs to {0, 1} but to {0, 1, . . . , m}, the output value can vary 0 to m. Hence, this three layer network can realize the logical map of {0, 1, . . . , n} to {0, 1, . . . , m}.
4
Application in a Three Values Logic with a Three Layers Neural Network
In the three values logical operation, there are 3 input variables and 3 output values. Take the AND operation for an example. Using this three layers network, the operation can be described perfectly. Firstly, true value of this operation can be shown in table 1. Table 1. True value table of the three values with three variables for AND operation
x1 0 0 0 0 0 0 0 0 0
x2 0 0 0 1 1 1 2 2 2
x3 output 0 0 1 0 2 0 0 0 1 0 2 0 0 0 1 0 2 0
x1 1 1 1 1 1 1 1 1 1
x2 0 0 0 1 1 1 2 2 2
x3 0 1 2 0 1 2 0 1 2
output 0 0 0 0 1 1 0 1 1
x1 2 2 2 2 2 2 2 2 2
x2 0 0 0 1 1 1 2 2 2
x3 output 0 0 1 0 2 0 0 0 1 1 2 1 0 0 1 1 2 2
From the table, one can see that there are 27 swatches and 8 items which the output value is not zero in this logical operation. As statement before, let the input layer has 3 neurons, the hidden layer has 8 neurons and the output layer has one linear neuron. The neural network’s construction of this operation can be shown in Fig. 6 based on the above arithmetic. According to above arithmetic, w1ij = 30 , w12j = 31 , w13j = 32 , where j = 1, 2, · · · , 8. The threshold of the neurons in the hidden layer can be obtained in table 2. The weights connecting the hidden layer and the output layer are trained by using this way. That is h1 = h2 = h3 = h4 = h5 = h6 = h7 = 1, h8 = 2. Obviously, one can get the three values with three variables for AND operation through this network. In addition, one can see this construction of the network to implement
Implementation of Multi-valued Logic
581
Fig. 6. The structure of the neural networks to implement the three value logic Table 2. The threshold of the neurons in the hidden layer
the ith hidden the neuron 1 2 3 4 5 6 7 8
the upper threshold 12.5 13.5 15.5 16.5 21.5 22.5 24.5 25.5
the lower threshold 13.5 14.5 16.5 17.5 22.5 23.5 25.5 26.5
x1 1 1 1 1 2 2 2 2
swatch x2 1 1 2 2 1 1 2 2
x2 1 2 1 2 1 2 1 2
expected output 1 1 1 1 1 1 1 2
logic is simple and efficient. The arithmetic to construct the three layers network can extent to solve any multi-valued logic with multi-variable problems.
5
Concluding Remarks
After discussing the connection between digital logic and a bi-threshold neuron, the arithmetic of a three layers feed forward neural network to resolve the multivalued logic is presented in this paper. The arithmetic overcomes the pitfall in researching logic that only limited in two values. The network becomes simple and efficient and the obtained results in this paper can be widely used in the study of digital circuits, code and pattern recognition.
582
Q. Deng and Z. Zeng
Acknowledgement This work was supported by the Natural Science Foundations of China under Grant 60405002.
References 1. Donald, L.G., Anthony, N.M.: A Training Algorithm for Binary Feedforward Neural Networks. IEEE Trans. Neural Networks 2 (1992) 176-194 2. Ma, X.M., Hu, Z.p.: Design of Digital Logic Using Neural Network. Journal of Circuits and System 3 (1998) 51-58 3. Piyush, C.O.: Enumeration of Linear Threshold Functions from the Lattice of Hyperplane Intersections. IEEE Trans. Neural Networks 4 (2000) 839-850 4. Winder, R.O.: Enumeration of Seven-argument Threshold Functions. IEEE Trans. Electron. Comput. 14 (1965) 315-325 5. Zuev, Y.A.: Asymptotics of the Logarithm of the Number of Threshold Functions of the Algebra of Logic. Sov. Math. Dok. 39 (1989) 512-513 6. Siu, K.Y., Roychowdhury, V., Kailath, T.: Discrete Neural Computation: A Theoretical Foundation. Englewood Cliffs, NJ: Prentice-Hall (1995) 7. Wang, S.J., Zhao, G.L., Liu, Y.Y.: Research on Logic Operation by Multi-threshold Neural Network. BIC-CA2006 (2006) 335-342 8. Sun, C., Feng, C.: Global Robust Exponential Stability of Interval Neural Networks with Delays. Neural Processing Letters 17 (2003) 107-115 9. Sun, C., Feng, C.: On Robust Exponential Periodicity of Interval Neural Networks with Delays. Neural Processing Letters 20 (2004) 53-61 10. Huang, D.S., Ip, H.H.S., Law, K.C.K., Chi, Z.R.: Zeroing Polynomials Using Modified Constrained Neural Network Approach. IEEE Trans. Neural Networks 3 (2005) 721-732 11. Huang, D.S., Ip, H.H.S., Chi, Z.R.: A Neural Root Finder of Polynomials Based on Root Moments. Neural Computation 8 (2004) 1721-1762 12. Huang, D.S.: A Constructive Approach for Finding Arbitrary Roots of Polynomials by Neural Networks. IEEE Trans. Neural Networks 2 (2004) 477-491 13. Xu, B.J., Liu, X.Z., Liao, X.X.: Global Exponential Stability of High Order Hopfield Type Neural Networks. Applied Mathematics and Computation 1 (2006) 98-116 14. Liu, M.Q.: Global Exponential Stability Analysis for Neutral Delay- differential Systems: An LMI Approach. International Journal of Systems Science 11 (2006) 777-783 15. Liu, M.Q.: Dynamic Output Feedback Stabilization for Nonlinear Systems Based on Standard Neural Network Models. International Journal of Neural Systems 4 (2006) 305-317
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model Wangmeng Zuo1, Kuanquan Wang1, David Zhang2, and Feng Yue1 1
School of Computer Science and Technology, Harbin Institute of Technology, 150001 Harbin, China
[email protected],
[email protected] 2 Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong
Abstract. Recently a class of reduced multivariate polynomial models (RM) has been proposed that performs well in classification tasks involving few features and many training data. The RM method, however, adopts a ridge leastsquare estimator, overlooking the fact that least square error usually does not correspond to minimum classification error. In this paper, we propose an iteratively reweighted regression method and two novel weight functions for fitting the RM model (IRF-RM). The IRF-RM method iteratively increases the weights of samples prone to misclassification and decreases the weights of samples far from the decision boundary, making the IRF-RM model more suitable for efficient pattern classification. A number of benchmark data sets are used to evaluate the IRF-RM method. Experimental results indicate that IRFRM achieves a higher or comparable classification accuracy compared with RM and several state-of-the-art classification approaches.
1 Introduction Pattern classification, which assigns a class label to an unseen instance from a set of attributes describing that instance, plays a key role in many applications such as image retrieval, medical diagnosis, and bioinformatics. Over the years, various classification methods have been proposed. These approaches could be grouped into two major categories, generative learning approaches and discriminative learning approaches. In a generative learning approach, a generative model is learned from the training data and is then used to predict the class label of an unknown instance using the Bayes rule. In a discriminative learning approach, a decision function or decision boundary is learned by optimizing some performance criterion, such as the classification accuracy or generalization. Unlike generative learning, discriminative learning makes no assumptions to the distributions of samples, but instead attempts to directly compute the sample-class mapping. These approaches have been successful in many application tasks. For example, the performance of support vector machines (SVMs) in handwritten character recognition [11] is state-of-the-art. To ensure that it is able to represent decision boundaries of any shape, the decision function in a discriminative learning framework should be nonlinear, yet the demand D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 583–592, 2007. © Springer-Verlag Berlin Heidelberg 2007
584
W. Zuo et al.
for good generalization implies that the decision function should be restricted in its complexity. One approach to treat this dilemma is to map the original data to a high dimensional feature space and then use a simple decision function (e.g., linear) on the feature space. This class of algorithms includes SVM, Φ-Machine, etc. A natural choice for use with discriminative learning models is the multivariate polynomial model (MPM). MPM first transforms the data into a high dimensional polynomial feature space. A linear regression method is then applied in the transformed feature domain to fit an optimal polynomial model [5]. Although a MPM model can represent the nonlinear decision function, the number of feature dimensions would increase exponentially with the growth of rank r, requiring a great number of data to ensure the polynomial model is not over-fitted. Recently a class of reduced multivariate polynomial models (RM) has been proposed for pattern classification [10]. Unlike MPM, RM transforms the data into a reduced polynomial feature space where its dimensionality is significantly less than that of the multivariate polynomial model. RM has performed well in classification tasks that involve few features and many training data. In neural network community, polynomial feedforward neural networks (PFNN) have also been used to generate high-order multivariate polynomial mappings. Neural learning or evolutionary computation approaches have been proposed to learn the neural network architecture, including the sampled polynomial terms and the corresponding coefficients [13]. One disadvantage of RM is that it adopts a least-square estimator and neglects the fact that the outputs intrinsically are discrete response variable. Least-square regression assumes that the predicted output has a Gaussian distribution with the mean of theoretical output. However, this assumption is generally not tenable. The samples close to the boundary play a more important role in estimating the decision function. In this paper we propose to use an iteratively reweighted fitting method for the RM model (IRF-RM) which is more suitable for classification tasks. The iteratively reweighted fitting algorithm is an efficient method in solving nonlinear optimal problem, and has been widely applied to train SVM [7] and neural network. We use 42 data sets to evaluate the classification performance of IRF-RM. To provide rational comparisons of classifiers over multiple data sets, we adopt the Wilcoxon signed ranks test for comparison of two classifiers, and use the Friedman test for comparison of multiple classifiers [3]. IRF-RM can achieve a higher or comparable classification accuracy compared with several state-of-the-art classifiers.
2 The Reduced Multivariate Polynomial Model 2.1 The Multivariate Polynomial Model The multivariate polynomial model provides an efficient way to describe complex nonlinear relationships between instances and class labels. Assume that we are given a set of training data, X = [x1, ", x m ]T , x i ∈ R N with corresponding labels y = [ y1 ,", ym ]T
, yi ∈{0, 1} . For this two-class problem, the goal of a MPM model is to fit an r-rank polynomial model
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model
585
K −1
g ( x ) = b + ∑ wi x1r1 x2r2 " xNrN , 0= 0 [12], M M 1 then we get C = M k=1 (xek xTek + xok xTok ) = Ce + Co . It is evident that the eigenvalue decomposition on C is equivalent to the eigenvalue decomposition on Ce and Co . As a result, xk can be reconstructed linearly from the eigenvectors of Ce and Co .
3
Symmetry Based 2DPCA Algorithm
In 2DPCA, we hope to find a projection vector V , where V denote an ndimensional unitary column vector. Then an image X is projected onto V by the following linear transformation: Y = XV.
(1)
Thus, an m−dimensional projected vector Y , which is called the projected feature vector of image X, can be obtained. How to determine a good projection
1050
M. Ding et al.
vector V ? According to the proposition of Yang [7], the total scatter of the projected samples can be introduced to measure the discriminatory power of the projection vector V and the total scatter of the projected samples can be characterized by the trace of the covariance matrix of the projected feature vectors. Thus, the following criterion can be adopted: J(V ) = tr(Sv ),
(2)
where Sv denotes the covariance matrix of the projected feature vectors of the training samples and tr(Sv ) denotes the trace of Sv , which tr(Sv ) = V T {E[(X − EX)T (X − EX)]}V . Let us define the following matrix C = E[(X − EX)T (X − EX)],
(3)
which C is called the image covariance matrix and C can be computed directly using the training image samples. Now, we give a description of S2DPCA algorithm as follows. Suppose that there are N training image samples in total, the jth training image is denoted by an m × n matrix Xi (i = 1, 2, · · · , N ), and the average image of all training ¯ Then, C can be evaluated by samples is denoted by X. C=
N 1 ¯ T (Xi − X). ¯ (Xi − X) N i=1
(4)
According to the odd-even decomposition theory, Xi can be decomposed as X +X Xi = Xei + Xoi , with Xei = i 2 mi denoting the even symmetrical image, X −X and Xoi = i 2 mi denoting the odd symmetrical image, where Xmi is the ¯ e denoting the average image of all even mirror image of Xi . Here, let us define X ¯ e denoting the average image of all odd symmetrical symmetrical samples and X samples. Then, C can also be evaluated by C = Ce + Co , which
N 1 ¯ e )T (Xei − X ¯ e ), (Xei − X N i=1
(6)
N 1 ¯ o )T (Xoi − X ¯ o ). Ce = (Xoi − X N i=1
(7)
Ce = and
(5)
So the eigenvalue decomposition on C is equal to the eigenvalue decomposition on the right side of equation (5), i.e., eigenvalue decomposition on Ce , Co . Thus, the criterion in (2) can be expressed by J(V ) = V T CV.
(8)
Symmetry Based Two-Dimensional Principal Component Analysis
1051
The optimal projection axis Vopt is the unitary vector that maximizes J(V ), i.e., the eigenvector of C corresponding to the largest eigenvalue. From literature [6], the odd and even symmetrical principal components hold different energy in facial images. As to the face recognition problem under restricted environments (for example, angles of view and the illumination variations are not vary significantly), the symmetry of faces overwhelms their asymmetry though the asymmetry is quite valuable in some other fields such as the thermal imaging of faces. Thus, even symmetrical components will take up larger energy than odd symmetrical components. This means even symmetrical components are more important than the odd symmetrical components. Of course, it does not mean that we should completely discard the odd symmetrical components, because some of them can also contain important information for face recognition. In conclusion, both the odd symmetrical components and the even symmetrical components should all be utilized in recognition; at the same time, the even symmetrical components should be reinforced while the odd symmetrical components should be suppressed to some extend. In fact, if select the components with more energy or greater variance, the even symmetrical components will be selected naturally because they hold more energy. For feature selection in S2DPCA, we adopt the strategy similar to SPCA, i.e. order eigenvectors according to eigenvalue, and then select eigenvectors corresponding to lager eigenvalue. Since the variance (corresponding to eigenvalue) of the even symmetrical components Ve is bigger than the variance of the correlative components Vo . So it is natural to consider the even symmetrical components first, and then the odd symmetrical components if necessary. Based on the theoretical description of the S2DPCA, we can conclude the S2DPCA algorithm as follows. (1) Generate the mirror images Xmk from the training images Xk . Then, decompose Xk into the even symmetrical images Xek and the odd symmetrical images X +X X −X Xok by Xek = k 2 mk , Xok = k 2 mk . (2) Firstly, Ce ,Co are evaluated by equations (6) and (7), respectively. Then eigenvector Ve and Vo are computed by eigenvalue decomposition on Ce and Co , respectively. (3) Order the eigenvectors Ve , Vo according to their eigenvalue, then select the eigenvectors Ve , Vo with greater eigenvalue as the feature transformation matrix V = (Ve , Vo ). (4) Extract the principal components of the test sample X using the equation Y = XV .
4
The Experiments and Analysis
In order to compare the performance of S2DPCA with one of SPCA, 2DKPCA, we perform experiments on two popular face databases: CBCL database [9] for binary classification and ORL database [10] for multi-category classification. The first experiment is designed to test their performance and execution time. The second experiment is designed to compare the top recognition rate of these
1052
M. Ding et al.
algorithms described above and the selection of even symmetrical features and of odd symmetrical features. In all the experiments, firstly, the image data are preprocessed, i.e. histogram equalization, normalized to zero mean, and linearly clipped to [-1,1]. Then, S2DPCA, 2DPCA and SPCA are used to extract features from images respectively. Finally, linear support vector machine [8] is used for classification. 4.1
Experiments on CBCL Database for Binary Classification
To demonstrate the performances of S2DPCA, we conduct the experiments of face and nonface classification on CBCL database. The image database has been normalized to standard images of 19 × 19 with 256-level gray scale. To demonstrate the effectiveness of S2DPCA, we compare it with SPCA and 2DPCA respectively. The experiments were performed with 400 training images (200 face images and 200 non-face images) and 400 test images (200 face images and 200 non-face images). There is no overlapping between the training samples and the test samples. Table 1. Comparison of the recognition rate (%) of S2DPCA, 2DPCA and SPCA on CBCL database Training samples/class 1 2 3 4 5 SPCA 79.5(40) 85.8(47) 91.4(64) 93.5(57) 94.1(48) 2DPCA 75.6(19 × 3) 84.5(19 × 4) 89.8(19 × 5) 93.2(19 × 5) 95.0(19 × 4) S2DPCA 79.8(19 × 3) 88.7(19 × 4) 92.9(19 × 5) 94.4(19 × 5) 96.5(19 × 4)
Table 2. Comparison of average time to feature extracttion using SPCA, 2DPCA and S2DPCA, respectively Algorithms SPCA 2DPCA S2DPCA Time of feature extraction(second) 10.45 2.19 5.23
Recognition rates using SPCA, 2DPCA and S2DPCA are shown in Table 1. At the same time, a comparison of average time to feature extract using SPCA, 2DPCA and S2DPCA is presented in Table 2. From the experiment on CBCL database for binary classification, it’s evident that the classification performance of S2DPCA is higher than that of SPCA or 2DPCA. In Table 1, the lowest recognition error rate is reduced by 40.68%, from 5.9% to 3.5%, when SPCA is replaced by S2DPCA. Moreover, introducing symmetry in S2DPCA can also largely raise the recognition accuracy, from 95.0% to 96.5%, reducing the error rate by 30%, in comparison with 2DPCA. According to Table 2, it should be pointed out that it is an increases of the computational burden, which needs to compute the covariance matrix of the even symmetrical images and of the odd symmetrical images. So it is a major disadvantage of the proposed method in contrast with 2DPCA.
Symmetry Based Two-Dimensional Principal Component Analysis
4.2
1053
Experiment on ORL Face Database for Multi-category Classification
ORL database (Olivetti Research Laboratory Database) consists of 400 face images from 40 different peoples with 10 images per person. The images for one person differ in the shoot time, or expression (open eyes, closed eyes, smiling, etc.) or details (with glasses, without glasses etc.). The images for a person in ORL database are showed in Fig. 1. The characteristics of ORL database are: each person has equivalent number of face images; Rich variance in expressions, angle of view and facial details; But there is little variance in illumination.
Fig. 1. Samples from ORL face database Table 3. Comparison of the recognition rate (%) of S2DPCA, 2DPCA and SPCA on CBCL database Training samples/class 1 2 3 4 5 SPCA 78.5(39) 86.8(45) 89.4(66) 90.3(50) 92.1(46) 2DPCA 75.6(112 × 4) 85.2(112 × 4) 89.2(112 × 5) 92.3(112 × 4) 94.5(112 × 3) S2DPCA 79.1(112 × 4) 89.6(112 × 4) 91.5(112 × 4) 94.6(112 × 4) 96.4(112 × 3)
Table 4. Comparison of Number of Even Symmetrical Features (NESFs) , Number of Odd Symmetrical Features (NOSFs) extracted by using S2DPCA versus SPCA on ORL face database, when Sum of the Number of Features (Sum of NFs) extracted is 40. Algorithms NESFs NOSFs Sum of NFs SPCA 32 8 40 S2DPCA 34 6 40
The original face images on ORL database were all sized 92 × 112 with a 256-level gray scale. The gray scale was linearly normalized to lie within [-1,1]. We select samples randomly from ORL database. For each person, the number of training samples is equal to the number of test samples, and there is no overlapping between the training samples and the test samples. Properly speaking: for each person, 5 images are selected randomly from the total 10 images. Thus, totally 200 images are selected as the training samples; the rest 200 images are taken as the test samples. The top recognition rate using SPCA, 2DPCA and S2DPCA to extract features respectively are presented in Table 3. From Table 3, it is also obvious that the classification performance of S2DPCA is superior to that of SPCA or 2DPCA. In addition, From table 4, each eigenface is both symmetrical and asymmetrical to
1054
M. Ding et al.
Table 5. Comparison of time (s) for feature extraction with 1 training sample per class Algorithms SPCA 2DPCA S2DPCA time (s) 14.42 10.76 12.36
some extent. But most eigenfaces are more like symmetrical images and few have strong asymmetrical parts. Table 5 gives a comparison of time for feature extraction with 1 training sample per class. From table 5, we find that S2DPCA needs less time than SPCA. This reason is that S2DPCA uses directly image matrices to extract feature rather than vectors which are usually of very high dimensionality. At the same time, S2DPCA obtains higher recognition rate in comparison with 2DPCA, because S2DPCA utilizes the symmetry of facial image. So, S2DPCA combines the advantages of SPCA and of 2DPCA.
5
Conclusions
S2DPCA algorithm presented in this paper combines the advantages of 2DPCA and of SPCA, i.e., it not only considers the symmetry property of human face, but also alleviates computational cost of PCA which needs to transform from image matrices into vectors. At the same time, from an image, S2DPCA can obtain two images, i.e., an even symmetrical image and an odd symmetrical image by even-odd decomposition. So number of samples can obtain an increase of 2 times. It is very valuable for a few samples. Thus, S2DPCA is competitive with or superior to 2DPCA and SPCA, especially with a few samples. However, there also exists a weakness for the S2DPCA, i.e., it needs to compute the covariance matrices of the even symmetrical images and of the odd symmetrical images. As a result, it will lead to an increase of the computational cost. How to reduce the computational cost needs further analysis as future work.
Acknowledgement We would like to thank all the anonymous reviewers for their valuable comments. This work was fully supported by Chongqing Educational committee Foundation under Grant No. KJ060711.
References 1. Turk, M., Pentland, A.: Eigenfaces for Recognition. J. Cogn. Neurosci., 3(1991) 71-86 2. Xu, L., Yuille, A.L.: Robust Principal Component Analysis by Self-Organizing Rules Based on Statistical Physics Approach. IEEE Trans. Neural Networks, 6(1995) 131-143
Symmetry Based Two-Dimensional Principal Component Analysis
1055
3. Sch¨ olkopf, B., Smola,A.J., M¨ uller, K.R.: Kernel Principal Component Analysis. 7th International Conference on Artificial Neural Networks, ICANN 97, Lausanne, Switzerland 1327, 583-588 (Eds.) W. Gerstner, A. Germond, M. Hasler and J.-D. Nicoud, Lecture Notes in Computer Science, Berlin: Springer, 1997 4. Sch¨ olkopf, B., Smola, A., M¨ uller, K.R.: Nonlinear Component Analysis as A Kernel Eigenvalue Problem. Neural Computation, 10(1998) 1299-1310 5. Kim, K.I., Jung, K., Kim, H.J.: Face Recognition using Kernel Principal Component Analysis. IEEE Signal Processing Letters, 9(2002) 40-42 6. Yang, Q., Ding, X.: Symmertrical PCA in Face Recognition. IEEE ICIP 2002 proceedings, II(2002) 97-100 7. Yang, J., Zhang, D., Frangi, A.F., et al.: Two-dimensional PCA: A New Approach to Appearance-based Face Representation and Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2004) 131-137 8. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, 1995 9. CBCL database, available from: http://www.ai.mit.edu /projects/cbcl/ software-datasets 10. ORL database, available from: http://www.uk.research.att.com:pub/data/ att faces.tar.Z
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition Yaobo Li1, Zhiliang Ren1, Gong Chen2, and Changcun Sun1 1
Dept. of Weaponry Eng., Naval Univ. of Engineering, Wuhan 430033, China {Lyb358,rzl503}@163.com 2 Dept. of Electronic Information Eng. Institute Communications Engineering of PLA University, Nanjing 210007, China
[email protected] Abstract. With independent component analysis (ICA) to realize the blind separation from mixed acoustic objects, a recognition method based on support vector machine/Gaussian mixture models (SVM/GMM) is proposed through extracting linear prediction coefficient (LPC) feature. It is revealed that LPC is consistently better than wavelet energy feature, ICA is efficient algorithm to estimate the unknown signal level. This method uses the output of GMM to adjust the probabilistic output of SVM. The validity of the ICA and SVM/GMM model is verified via examples in mixed acoustic objects recognition system.
1 Introduction The correct recognition rate of field acoustic recognition system can reach satisfactory result, however, the performance will rapidly degrade when the inputs of system are combined with noise. In the presence of battlefields, the inputs of system mix with kinds of acoustic objects (like tank helicopter), if the mixed objects can be separated from recognition system before the inputs, we can enhance performance greatly. Besides, separating of mixed acoustic objects may realize the decision of the number and location of objects, so as to estimate and analysis new acoustic objects. People have paid more and more attention to the counter-tank and counter-helicopter war in the future. In the past, much effort has been made to develop methods of eliminating background noise. Tradition filter and wavelet transfer methods are most common techniques for the elimination of stationary noise from degraded acoustic objects, in fact, the spectrum of noise overlaps with the spectrum of acoustic objects. It is very limited to raise the signal-to-noise ratio and still can distort acoustic objects. In this work, we separate acoustic objects using independent component analysis (ICA). The Independent Component Analysis provides a linear representation that minimizes the statistical dependencies among its components, which is based on higher order statistics of the data. These dependencies among higher order features could be removed by isolating independent components. The ability of the ICA to handle higher-order statistics in addition to the second order statistics is useful in achieving an effective separation of feature space for given data[1][2]. This paper presents a new method to combine both discriminative model and the generative model to make use of the two kinds of models. Support vector machine is
、
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1056–1064, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1057
recently proved to be a good discriminative classifier for many kinds of pattern recognition applications, which also shows advantages in training and performance compare to artificial neural networks. Gaussian mixture models have become the dominant approach for modeling in objects recognition applications because GMM present objects statistical characters properly. In this paper, we address the method of acoustic targets recognition based on ICA and SVM/GMM. With ICA to realize the blind separation from mixed acoustic targets, a recognition method based on SVM/GMM is proposed through extracting LPC feature. This paper combines the SVM and GMM to utilize the abilities of both models. This method uses GMM's output to adjust the probabilistic output of SVM. In training phase, both SVM and GMM are trained independently. While testing phase, the probability outputs of the GMM are used to adjust the posterior probability output of SVM. Likelihood scoring is performed using the new hybrid model.
2 LPC-Based Feature Extraction We choose tanks as acoustical objects here. In observed time tank objects can be modeled by the auto regressive (AR) model[4][5]. p
x(n) = ξ (n) − ∑ ai x(n − i ) ,
(1)
i =1
where ai is linear prediction coefficient (LPC), p is the order of the AR model,
ξ (n) is input excitation. The LPC analysis essentially attempts to find an optimal fit to the envelope of the acoustical spectrum from a given sequence of objects signal samples. The estimation of AR acoustic is expressed by: p
R(t ) = σ ξ2 Δt 1 + ∑ ai exp(− j 2π fiΔt ) ,
(2)
i =1
where Δt is sampling interval, σ ξ2 is variance of excitation. AR spectrum is high-resolution method for power spectrum estimation. Power spectrum of tank signals reflects signal energy along with the frequency distribution.
Fig. 1. Spectrum response of tank objects
1058
Y. Li et al.
From (2), AR spectrum is tightly relations to LPC, thus it is reasonable to extract feature from tank objects. The frequency response using 20th order LPC analysis is shown in Fig. 1.
3 ICA Algorithm In independent component analysis (ICA), the measured samples (here multidimension tank objects) are thought to be linear mixtures of some underlying sources. The goal of ICA is to try to find how the measured signals are formed from the underlying signals, assuming that the signals are statistically as independent as possible. ICA is a linear model that describes the mixed of source components, S = {s1 (t ), s2 (t ), ⋅⋅⋅, sm (t )}T by means of a mixed matrix A , to produce mixtures
X = {x1 (t ), x2 (t ), ⋅⋅⋅, xn (t )}T . In matrix form, defining S and X to be the column vectors associated with the source and mixed signals, we have[6]: X = AS , (3) where A is unknown mixed matrix, n ≥ m . The purpose of ICA is to separate source signals from X . S = A−1 X = W T X , (4) T where S is an estimate of the source and W is a regularized estimate of the inverse of A . 3.1 Fast ICA
We apply Fast ICA algorithm in feature extraction. ICA algorithm controls convergence with generating function, so as to estimate non-gaussian independent component. The kurtosis is defined as the 4th coefficient of the cumulant generating function. k (v) = E{v 4 } − 3( E{v 2 }) 2 . (5) The kurtosis of W T X is expressed by: k (W T X ) E{(W T X ) 4 } 3[ E{W T X 2 }]2
4
E{(W T X ) 4 } 3 W .
(6)
Generating function:
J (W ) = E{(W T X )4 } − 3 W
4
+ F ( W )2 ,
(7)
where F is cost function. Considering the convergence rate , we change it to (8). W = E{ X (W T X )3 } − 3W . (8) 3.2 Simulation
Experiment is performed with tanks s1 (t ), s2 (t ), s3 (t ), s4 (t ), s5 (t ), s6 (t ) . The sampling frequency of signals is 8000Hz, the sample length is 3000. Then hybrid matrix is carried on to obtain five observations signals. The ICA result shows that four approximate source signals are separated. Fig. 2 (a) is normalized observation signals x1 (t ), x2 (t ), x3 (t ), x4 (t ), x5 (t ), x6 (t ), x7 (t ) before
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1059
、source signals (b) and separated signals (c)
Fig. 2. Normalized observation signals (a)
mixed. Fig. 2 (b) is normalized source signals s1 (t ), s2 (t ), s3 (t ), s4 (t ), s5 (t ), s6 (t ) , which are A, B, C, D tanks respectively. Fig. 2 (c) is normalized signals y1 (t ), y2 (t ), y3 (t ), y4 (t ), y5 (t ), y6 (t ) with ICA. Being statistical method, this method can separate approximate signal similar to source one. Thus in Fig. 2 peak-to-peak of separated signal value as well as the order and the source signal is different, however, the waveforms are consistent.
4 SVM/GMM Based Recognition There are two major models for acoustic objects recognition: discriminative model like SVM and generative model like GMM. Each one of them can construct models for object recognition tasks. Discriminative models have good property of making full use of discriminative information of different classes, while generative models use statistical information. In a word, discriminative models use intra-class information and generative models use extra-class information. Since discriminative model and generative model have both advantages of themselves, they also have disadvantages of lack using the other kind of information. 4.1 GMM Based Modeling
GMM can be regarded as HMM involving one state. The Gaussian mixture density is defined by a weighted sum of M component densities as: M
P( xt ) = ∑ Pb i i ( xt ) ,
(9)
i =1
where
(
= 1 (2π ) d / 2 Σi
bi ( xt ) = N ( xt , μ i , Σ i )
1/ 2
) exp ( −1 2( x − μ ) Σ T
t
i
−1 i
( xt − μi ) ) xt , t = 1, 2, ⋅⋅⋅, T ,
(10)
Pi , μi and Σi , are the weight, mean and variance, respectively. The mixture weights M
satisfy the constraint that
∑ P = 1 . The GMM reflects the intra-class information. i =1
i
1060
Y. Li et al.
4.2 SVM Based Modeling
Let x be a set of data points and y be corresponding target classes. The output of the standard SVM is y = sign( f ( x)) , (11) where f ( x) = ω T x + b . One method of producing probabilistic outputs was proposed by Wahba, which used a logistic link function. For SVM there are two outputs corresponding to two classes. So the posterior probability outputs of the SVM are P(C+1 x) = 1 1 + e− f ( x ) , P(C−1 x) = 1 1 + e f ( x ) ,
(12)
here f ( x) can be viewed as the distance from x to the support vectors. That is to say that the posterior probability outputs of SVMs are based on the distance of testing vectors and support vectors. So the outputs reflect the inter-class information. This methods combines the SVM and GMM to utilize the abilities of both models. This method uses GMM's output to adjust the probabilistic output of SVM. The GMM can be embedded into SVM to adjust the acoustic objects probabilistic output as follows: K
P(Ci x) = ∏(PSVM (Ci xk )PGMM (xk Ci ) k=1
N
∑P j=1
SVM
(Cj xk )PGMM (xk Cj )) ,
(13)
M
where probabilistic output of i-th GMM PGMM ( x Ci ) = ∑ cim N ( x, μim , Σim ) , m =1
N ( x, μ , Σ) = ((2π ) − d / 2 Σ
− 12
1 ) exp[− ( x − μ )T Σ −1 ( x − μ )] , 2
(14)
cim , μim and Σim are m-th of weight, mean and variance. PSVM (Ci x) is probabilistic output of i-th SVM. The complete GMM for object model is parameterised by the mean vectors, covariance matrices and mixture weights from all component densities. The notation collectively represents these parameters λ = {Pi , μi , Σi / i = 1, 2, ⋅⋅⋅, M } .
5 Acoustic Objects Recognition In this paper, with ICA to realize the blind separation from mixed acoustic objects, the features of clean tanks are derived using 10th order LPC analysis on a 20 Separating Targets Based on ICA
LPC-based Feature Extraction of Test Acoustic Signals
k-means
SVM/GMM Recognition
LPC-based Feature Extraction of Trained Acoustic Signals
Fig. 3. The acoustic recognition system
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1061
milliseconds frame every 5 milliseconds. In order to construct a small data set for training, the training data for each object was converge to one centroid using k-means clustering algorithm. The acoustic recognition system based on ICA and GCA Table 1. (a) PERFORMANCE COMP ARISON OF ACOUSTIC SOURCE AND SEP ARATED O BJECTS Gun
Machine gun
Artillery
Vehicle
Tank
Heli
Performance (Separating targets)
100%
100%
98%
87%
100%
100%
Performance (Source targets)
91%
94%
91%
81%
100%
98%
(b) P ERFORMANCE COMP ARISON OF T ANK SOURCE AND SEP ARATED OBJECTS Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
Performance (Separating targets)
99%
100%
93%
88%
95%
100%
Performance (Source targets)
93%
93%
90%
88%
90%
96%
Table 2. (a) PERFORMANCE
ACOUSTIC VARIOUS O RDER OF LP C(NUMBER OF FRAME L = 30 )
OF
Number of Frame 5
10
20
(b) P ERFORMANCE
10
20
Machine gun
Artillery
Vehicle
Tank
Heli 98%
GMM
95%
92%
20%
62%
69%
SVM
100%
100%
26%
80%
74%
61%
SVM-GMM
99%
93%
27%
78%
100%
99% 97%
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 96%
GMM
84%
100%
79%
84%
61%
SVM
100%
100%
90%
87%
84%
91%
SVM-GMM
100%
100%
98%
88%
100%
100%
OF
T ANK VARIOUS ORDER OF LP C(N UMBER OF FRAME L = 30 )
Order of LPC
5
Gun
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F 98%
GMM
95%
92%
20%
62%
69%
SVM
100%
100%
26%
80%
74%
61%
SVM-GMM
99%
93%
27%
78%
100%
99% 97%
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 96%
GMM
84%
100%
79%
84%
61%
SVM
100%
100%
90%
87%
84%
91%
SVM-GMM
100%
100%
98%
88%
100%
100%
1062
Y. Li et al. Table 3. (a) PERFORMANCE
A COUSTIC VARIOUS NUMBER OF F RAME (ORDER OF LP C P = 10 )
OF
Number of Frame
Gun
Machine gun
Artillery
Vehicle
Tank
Heli 96%
GMM
83%
91%
58%
79%
54%
SVM
99%
91%
90%
77%
71%
71%
SVM-GMM
94%
95%
96%
83%
98%
96% 97%
10
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 100%
20
GMM
85%
100%
75%
88%
75%
SVM
100%
100%
93%
86%
100%
99%
SVM-GMM
100%
100%
98%
87%
100%
100%
30
(b) P ERFORMANCE
OF
T ANK VARIOUS NUMBER OF F RAME (ORDER OF LP C P = 10 )
Number of Frame
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
GMM
43%
76%
41%
41%
94%
91%
SVM
86%
90%
78%
18%
76%
71%
SVM-GMM
67%
96%
86%
22%
98%
97%
GMM
33%
100%
38%
100%
100%
99%
SVM
100%
100%
82%
50%
100%
94%
SVM-GMM
98%
100%
85%
81%
100%
99%
GMM
85%
100%
73%
100%
100%
100%
SVM
100%
100%
84%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
10
20
30
Table 4. (a) PERFORMANCE OF ACOUSTIC D IFFERENT FEATURE (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) Feature
Gun
Machine Artillery gun
Vehicle
Tank
Heli
Wavelet GMM packet SVM energy SVM-GMM
98%
100%
25%
50%
99%
87%
72%
79%
14%
16%
86%
58%
72%
87%
20%
14%
85%
65%
GMM
85%
100%
75%
88%
75%
100%
SVM
100%
100%
93%
86%
100%
99%
SVM-GMM
100%
100%
98%
87%
100%
100%
LPC
(b) P ERFORMANCE OF T ANK DIFFERENT FEATURE (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) Feature
Tank A
Tank B
Tank C
Wavelet GMM packet SVM energy SVM-GMM
100%
100%
100%
100%
LPC
Tank D
Tank E
Tank F
37%
100%
37%
89%
33%
100%
11%
90%
100%
100%
58%
99%
53%
99%
GMM
85%
100%
73%
100%
100%
100%
SVM
100%
100%
64%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1063
Table 5. P ERFORMANCE OF ACOUSTIC DIFFERENT SNR (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) SNR
40dB
20dB
PERFORMANCE
OF
Gun
Machine Artillery gun
Vehicle
Tank
Heli
GMM
85%
100%
64%
87%
76%
100%
SVM
100%
100%
93%
86%
96%
99%
SVM-GMM 100%
100%
91%
87%
100% 100%
GMM
87%
100%
48%
0%
70%
93%
SVM
100%
100%
75%
0%
96%
90%
SVM-GMM 100%
100%
80%
0%
100%
97%
T ANK DIFFERENT SNR (ORDER OF LP C P = 10 , NUMBER OF FRAME L = 30 ) SNR
40dB
20dB
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
GMM
86%
100%
73%
100%
100%
100%
SVM
99%
100%
64%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
GMM
80%
100%
64%
100%
100%
99%
SVM
95%
100%
52%
52%
100%
90%
SVM-GMM
99%
100%
93%
88%
100%
99%
are shown in Fig. 3. The number of probabilistic density function is 8. The kernel function of SVM is K(xi , x) = exp(− x − xi
2
σ2) .
Table 1 shows the performance of different conditions. From Table 1, it can be seen that by incorporating ICA before SVM/GMM input, the performance of recognition and verification are slightly degrade. Table 2-4 show performances based on SVM/GMM for clean tank signals. Table 5 show the accuracy averaged through 100 test experiments, related to the field wind. From above tables, it can be concluded: 1) Performance varies with the increasing order of LPC when the number of frame is constant; on the contrary, it varies with frame while constant feature. 2) In table 2, the average performances based on GMM are 87.17%, 93% and 96.33%, 84.83% based on SVM, but the average performances reach 97.5%, 96.67% based on SVM/GMM. Because GMM reflects the similarity among same data and SVM finds difference among different data, the average performances with GMM is lower than it with SVM 9.16% for different acoustic objects, on the contrary, results with SVM show better performance than SVM 8.17%. SVM/GMM combines both advantage and its performances is 10.33%, 3.67% higher than SVM and 10.33%, 11.84% higher than GMM. 3) On the condition of noise, SVM/GMM shows robust performance.
1064
Y. Li et al.
6 Conclusion A novel recognition system is proposed. The advantages of ICA and SVM/GMM are integrated into the proposed method. Linear prediction coefficient is used to characterize tank objects. The recognition system achieves encouraging results for different training and testing configurations. Conclusion can be drawn as following: 1) The performance of the method based on LPC is consistently better than wavelet packet energy feature (Table 4). 2) An efficient algorithm using ICA technique is employed to estimate the unknown signal level. Separated objects based on ICA are similar to source objects (Table 1). 3) The proposed SVM/GMM approach gives more accurate solution to identify field tanks than SVM and GMM. Further investigations are being carried out to compare the performance of other algorithms for the recognition of field acoustic objects.
References 1. Amari, S., Cardoso, J.: Blind Source Separation - Semiparametric Statistical Approach [J]. IEEE Trans. on Signal Processing 45 (11) (1997) 2692-2700 2. Amari, S., et al.: A New Learning Algorithm for Blind Separation of Sources [A]. Advances in Neural Information Processong, 8[C]. MIT Press, Cambrige (1996) 757-763 3. Wang, B., Qu, D., Peng, X.: Practical Speech Recognition Foundation[M]. Beijing, Defense industry publishing press (2001) 4. Marple, SL.: A New Autoregressive Spectrum Analysis Algorithm. IEEE Transactions on ASSP 2 (4) (1980) 441-453 5. Ji, X., Shi, W., Zhang, G.: An Effective Algorithm of Feature Extraction and Its Application[J]. Journal of shanghai jiaotong university 37 (11) (2003) 6. Bell, A., Sejnowski, T.: An Information-maximization Approach to Blind Separation and Blind Deconvolution. Neural Comput 7 (1995) 1129–1159 7. Reynolds, D.A.: Speaker Identification and Verification Using Gaussian Mixture Speaker Models[J]. Speech Communication 17 (1995) 91-108 8. Comes, C., Vapnik, V.: Support Vector Networks for Pattern Recognition. Data Mining and Machine Learning 20 (1995) 272-297 9. Rabiner, L., Wilpon, J. G., Juang, B. H.: A Segmental K-Means Training Procedure for Connect Word Recognition. AT&T Tech. J 65 (3) (1986) 21-31
ICA Based Super-Resolution Face Hallucination and Recognition Hua Yan, Ju Liu, Jiande Sun, and Xinghua Sun School of Information Science and Engineering, Shandong University Jinan, 250100, Shandong, China {yhzhjg,juliu,jd_sun}sdu.edu.cn,
[email protected] Abstract. In this paper, we propose a new super-resolution face hallucination and recognition method based on Independent Component Analysis (ICA). Firstly, ICA is used to build a linear mixing relationship between highresolution (HR) face image and independent HR source faces images. The linear mixing coefficients are retained, thus the corresponding low-resolution (LR) face image is represented by linear mixture of down-sampled source faces images. So, when the source faces images are obtained by training a set of HR face images, unconstrained least square is utilized to obtain mixing coefficients to a LR image for hallucination and recognition. Experiments show that the accuracy of face recognition is insensitive to image size and the number of HR source faces images when image size is larger than 8×8, and the resolution and quality of the hallucinated face image are greatly enhanced over the LR ones, which is very helpful for human recognition.
1 Introduction The faces of interest often appear very small in surveillance imagery because of the relatively large distance between the cameras and the scene. Face resolution becomes an important factor for recognition performance, so resolution enhancement techniques are generally needed. Super-resolution (SR) techniques in computer vision are focused on to enhance resolution, that is, to infer the lost high-resolution (HR) image from the lowresolution (LR) ones. In general, there are two classes of SR techniques: reconstruction-based (from inputs images only) and learning-based (from other training images). Of particular interest is face hallucination, or learning high-resolution face image from low-resolution one. Face hallucination is a term presented by baker and Kanade [1], which implies that the high-frequency part of face image must be purely fabricated from a parent structure by recognizing the local features from the training set. Liu etc. [2] developed a two-step statistical modeling algorithm combining global and local parameter models, based on PCA and nonparametric Markov Network for hallucinating faces. The twostep algorithm was improved by Li. etc. [3]. In [4], PCA also was used to fit the input LR face image as a linear combination of the LR face images in the training set. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1065–1071, 2007. © Springer-Verlag Berlin Heidelberg 2007
1066
H. Yan et al.
Replacing the LR training images with HR ones, while retaining the same combination coefficients, rendered the HR image. In this paper, we propose a new SR face hallucination and recognition method based on Independent Component Analysis (ICA). Firstly, ICA model is used to represent HR face image by linearly mixing some independent HR source faces images. After the HR source faces images are down-sampled to get the corresponding downsampled source faces images and the same mixing coefficients are retained, a linear relation between LR face images and down-sampled source face images is built. Thus, when the source faces images are trained from a set of HR face images by ICA, unconstrained least square method is utilized to obtain mixing coefficients to a LR face image. Finally face hallucination and recognition can be achieved by retaining the mixing coefficient and combining the trained source face images. Experiments show that when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images, and the hallucinated face images are greatly approximated to the original HR face images and very helpful for recognition by a human being.
2 Image-Domain Independent Component Analysis (ICA) Independent component analysis (ICA) can derive features which best represent the image via a set of source signals which are as statistically independent as possible. The main assumption behind ICA is that some observation signals X = [x1,x 2 ,",x m ]Τ may be modeled as the linear mixture of the statistically independent source signals
Τ S = [s1,s 2 ,",s n ]Τ by an unknown m × n dimensional mixing matrix A = [a1, a2 ,", an ] , where x i and si denote column vectors which observation signal and source signal
are ordered into in the lexicographical notation respectively. a i denotes mixing column vector which includes the coordinates of observation signal x i with respect to the source signals S . X = AS .
(1)
Usually, ICA can be performed on images under two different kinds of architectures [5]. In Architecture I, images are treated as random variables and pixels as outcomes, whereas in Architecture II, pixels are treated as random variables and images as outcomes. In this paper, Architecture I is chosen, and the image synthesis model based on ICA is shown in Fig. 1. In Architecture I, the data matrix is organized so that the images are in rows and the pixels are in columns, and each image has zero mean. In this approach, ICA finds a matrix W such that the rows of S = WX are as statistically independent as possible. The rows of S are then used as source face images to represent faces. Face image representations consist of the coordinates a of these images with respect to the source face images defined by the rows of S , as shown in Fig. 2. These coordinates are contained in the mixing matrix.
ICA Based Super-Resolution Face Hallucination and Recognition
1067
Fig. 1. Image synthesis model for Architecture I
Fig. 2. Independent source image representation consists of the coefficients a = [a1 , a 2 ,", a n ]Τ for the linear combination of independent source images s i
3 ICA-Based Super-Resolution Face Hallucination 3.1 Mathematic Model from HR to LR Image
Let I H and I L denote the HR and LR face image respectively. If I L is d times smaller than I H in both the horizontal and vertical directions, I L is computed by I L (m, n) =
d −1 d −1
∑∑ I H (dm + i, dn + j ) ,. d2 1
(2)
i =0 j =0
where d , which is always an integer, denotes down-sampling factor in the horizontal and vertical directions. Equation (2) combines a smoothing step and a down-sampling step, more consistent with image formation as integration over the pixel [1]. To simplify the notation, let I H and I L be lexicographically ordered into the row vectors respectively, (2) can be rewritten as matrix-vector style. I L = I H D ,.
(3)
where I L denotes a N -dimension row vector of LR image; I H denotes a d 2 N dimension row vector of HR image; D denotes a d 2 N × N down-sampling matrix.
3.2 ICA Model of HR and LR Image We apply ICA model under Architecture I to fit the HR face image I H as a linear combination of n source face images. (4) I = a s1 + a s 2 + " + a s n + M = a Τ S + M ,. H
1 H
2 H
n H
H
H
H
H
1068
H. Yan et al.
where SH = [s1H , s2H ,", snH ]Τ is the statistically independent source face image set in the HR space. M H denotes mean face image to HR face image. a 1 ," , a n are the mixing coefficients, which construct a mixing vector a H = [a1 , a 2 ," , a n ]Τ . Thus, Equation (4) can be rewritten as
I L = (a 1s1H + a 2 s 2H + " + a n s nH )D + M H D .
(5)
Since a 1 ," , a n are the coefficients that denote the coordinates of HR face image with respect to the source faces images, then
I L = a 1s1H D + a 2 s 2H D + " + a n s nH D + M H D .
(6)
We can set up the corresponding relation between HR source image and downsampled one, as well as HR and LR mean face image.
s iL = s iH D M L = M H D ,.
(7)
where s iL denotes down-sampled source face image, and M L denotes LR mean face image. Then
I L = a1s1L + a 2s 2L + " + a ns nL + M L = a ΤH S L + M L .
(8)
Equation (8) describes the linear relation between the LR face image and the downsampled source faces images.
3.3 Face Hallucination and Recognition In the proposed method, we first remove mean face image from the HR face image set. Then the zero-mean training HR faces images are used to obtain a set of independent source faces images s1H , s 2H ,", s nH and the mixing matrix A in HR space by FASTICA [6]. And then the source faces images are down-sampled in term of Equation (7). Finally, s1H , s 2H ,", s nH , s1L , s 2L ,", s nL and A are stored for face hallucination and recognition. For hallucinating HR face image from a LR one, we hope to obtain a mixing vector and make the following cost function minimized. J = I L − M L − a ΤH S L .
(9)
A kind of unconstrained least square method can be used to solve the question.
(
a ΤH = I L S L Τ S LS L Τ
)
−1
.
(10)
Finally, we retain a H to linearly combining the corresponding HR source images and the mean face image to LR input is rendered for face hallucination.
ICA Based Super-Resolution Face Hallucination and Recognition
Iˆ H = a ΤH S H + M L D Τ .
1069
(11)
And we also can carry out face recognition by the following criterion. c=
a ΤH a i ,. a H ⋅ ai
(12)
where a i denotes mixing vector which includes the coordinates of training face image
I iH with respect to the source faces images S H . The framework of face hallucination and recognition is shown in Fig.3.
Fig. 3. Face hallucination and recognition
4 Experiments and Results Our experiments are conducted with 288 HR faces images from NUST 603 face database. The HR faces images, which are aligned to be 32 × 32 , include 96 individuals, with three faces images in different sessions for each individual. Faces images are blurred by averaging neighbor pixels and down-sampled to low-resolution images. Down-sampling factor d is set to be 2 or 4 for obtaining the corresponding LR images with size 16 × 16 or 8 × 8 . All HR and LR faces images are ordered into row vectors lexicographically. First, we study the recognition performance using original HR face images and hallucinated ones. 192 HR faces images from 96 individuals with two different sessions are selected for training, and other 96 HR ones and the corresponding LR ones for test. Equation (11) is used for similarity measurement. Compared with hallucinated HR face image and original HR one, the recognition accuracies over different downsampling factors and the number of source images are shown in Table.1. It can be seen that when the size of LR face image is small to 8 × 8 , the recognition accuracy for hallucinating HR face image drops 10% compared with the original HR image, whereas in other case the recognition accuracy only change slightly, that is, when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images. Next, the hallucination experiment is conducted on a data set containing 192 individuals with two face images for each individual. Using the “leave-one-out” methodology at each time, one individual, who can be recognized in recognition procedure, is
1070
H. Yan et al.
selected for testing, and the remaining are used for training. In terms of the recognition result, we select 100 source faces images for down-sampling factor 2, and 30 for 4, to hallucinate HR faces images. Some hallucination results are shown in Fig.4 and Fig.5. Compared with the input LR faces images and the Cubic B-Spline interpolation results, the hallucinated face images have much clearer detail features. They are greatly approximated to the original high-resolution images, and very useful for human recognition. Table 1. The result of face recognition Size of LR face image
16 × 16
8×8
HR face image
Original
hallucinated
Recognition rate
80/96 79/96 77/96 71/96
80/96 81/96 78/96 72/96
Original --79/96 77/96 71/96
hallucinated --23/96 68/96 61/96
number of source images 100 60 30 15
Fig. 4. Face hallucination of three groups for down-sampling factor of 2. From left to right for each group: original HR image, hallucinated HR image, Cubic B-spline interpolation, LR image.
Fig. 5. Face hallucination of three groups for down-sampling factor of 4. From left to right for each group: original HR image, hallucinated HR image, Cubic B-spline interpolation, LR image.
5 Conclusion In this paper, we propose a new SR face hallucination and recognition method based on ICA. ICA model first is used to build a linear relationship between HR face image and independent HR source faces images. Then the HR source faces images are down-sampled to get the corresponding down-sampled source faces images and the linear mixture coefficients are retained, so that the corresponding LR image is represented as a linear mixture of the down-sampled source faces images. Thus, when a LR face image is known and the source faces images are trained, the unconstrained least
ICA Based Super-Resolution Face Hallucination and Recognition
1071
square method is utilized to obtain the mixing coefficients. Finally the mixing coefficients are retained and face hallucination and recognition are carried out. Experiments show that the ICA-based face hallucinated face images are greatly approximated to the original HR faces images, and very helpful for human recognition, and when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images.
Acknowledgement The work is supported by Program for New Century Excellent Talents in University (NCET-05-0582), Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20050422017) and the Project sponsored by SRF for ROCS, SEM ([2005]55). The corresponding author is Ju Liu (
[email protected]).
References 1. Baker, S., Kanade, T.: Hallucinating Faces. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, (2000) 83-88. 2. Liu, C., Shum H.Y., Zhang, C.S.: A Two-Step Approach to Hallucinating Faces: Global Parametric Model and Local Nonparametric Model. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1 (2001) 192-198. 3. Li, Y., Lin, X.Y.: An Improved Two-Step Approach to Hallucinating Faces, In Proceedings Third International Conference on Image and Graphics (2004) 298-301. 4. Wang, X.G., Tang, X.O.: Hallucinating Face by Eigentransformation. IEEE Transactions on Systems, Man and Cybernetics, Part C 35 (3) (2005) 425-434. 5. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition by Independent Component Analysis. IEEE Transactions on Neural Networks 13 (6) (2002) 1450-1464. 6. http://www.cis.hut.fi/projects/ica/fastica/
Principal Component Analysis Based Probability Neural Network Optimization Jie Xing1, Deyun Xiao1, and Jiaxiang Yu1,2 1
Department of Automation, Tsinghua University, Beijing, 100084, China
[email protected] 2 Department of Shipboard Weaponry, Dalian Naval Academy, Dalian, 116018, China
[email protected] Abstract. Topological structure of Probability Neural Network (PNN) is usually complex when it is trained with large-scale and high-redundant training samples. Aiming this problem, PNN is analyzed and simplified by using probability calculation and multiplication formula. At first, input data of training samples was statistical analyzed by using Principal Component Analysis (PCA). PNN topological structure was optimized based on the statistical results. Subsequently, a complete learning algorithm was provided to avoid the artificial set of smoothing parameters. And Simulated Annealing (SA) coefficient was introduced to increase learning speed and stability. Eventually, the optimized PNN was applied to real problem. The test result validated that the optimized PNN had simpler structure and higher efficiency than typical PNN in the application with large-scale and high-redundant training samples.
1 Introduction Probability Neural Network (PNN) is a four layer feed-forward neural network based on radial basis function neural network. Specht developed it from Bayesian decision theory [1]. PNN constructs an estimate of the probability density functions, according to the Parzen-windows method, by summing the outputs of radial basis function neurons [2]. PNN has compact mathematic theory and clear structure. Being supplied with enough classific training samples, PNN can be directly used without training course, and get ideal results in general pattern classification applications. But PNN is usually complex, when it is supplied with large-scale and highredundant training samples. As a solution, Principal Component Analysis (PCA) is introduced to PNN training. PCA is an important multi-variable statistical analysis technique, which uses statistical analysis to extract the principle components and indicate the linear correlation between process variables. In recently years, there are mainly two research directions about the combination of PCA and neural network. On the one hand, PCA is used as input pretreatment of neural network, reduces the input dimension. Because of the less input variables, PNN makes neural networks have simpler structure, less computational complexity in some extent, and higher efficiency pattern classification [3,4]. On the other hand, neural network is used to perform nonlinear PCA. It simplifies the statistical computation, and improves the robustness of D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1072–1080, 2007. © Springer-Verlag Berlin Heidelberg 2007
Principal Component Analysis Based Probability Neural Network Optimization
1073
PCA [5,6]. In this paper, PCA is considered not only the principle components extracting, but also the elimination of correlation between process variables. Statistical results of PCA are used as instruction of PNN construction. Pattern neurons of PNN are optimized combination based on the eigenvalues obtained in PCA, which can decrease the structure complexity of PNN in the application with large-scale and high-redundant training samples. This paper is organized as follows: In section 2, PNN are analyzed on probability computation, and a multiplication formula based optimizing strategy of PNN structure is proposed; Subsequently, a PCA based structure optimization strategy of PNN and input data pretreatment are described in section 3; Computation courses and learning algorithm with Simulated Annealing (SA) coefficient of the optimized PNN are expatiated in section 4; In section 5, the optimized PNN is applied to prediction of Anode Effect (AE) in aluminium reduction cell, and the results are compared with those of typical PNN; At last in section 6, conclusions of this paper are pointed out.
2 PNN Structure Simplification Based on Multiplication Formula Motivation of PNN is the probability density function estimators developed by Parzen. It asymptotically approach the underlying parent density provided by the known random samples. The particular estimator is
fA ( x) =
1
1 p/2 p (2π ) σ S A
SA
∑ exp[− i =1
( x − xai )T ( x − xai ) ], 2σ 2
(1)
where, xai is the i-th sample of pattern A; SA is the number of samples of pattern A; σ is variance, it is called “smooth parameters” here, which means the width of the Gaussian function with mean set by the corresponding training sample. With enough classic samples, this posteriori probability density function can approach the priori one smoothly and continuously [1]. Eq. (1) can be expressed by network computation, which is PNN as shown in Fig.1. This PNN has M–dimension input, and N–dimension output, which means N patterns need to be recognized. S1, S2,…,SN are the numbers of training samples, which belong to the N patterns respectively. PNN has four feed-forward layers, which are input layer, pattern layer, summation layer and output layer. The connection weights from the summation layer to the output layer are proportions of corresponding pattern samples number to all patterns samples number. The other connection weights between other layers are unit 1. In PNN, each pattern neuron represents a training sample. They compute the similarity between inputs and the corresponding samples, which means the probability of the inputs being the corresponding samples. Summation neurons sum outputs of all pattern neurons belong to same pattern. Probability of the input belongs to a certain pattern is product of the corresponding connection weight and the summation output. Generally, being applied to the problem with not large-number training samples, PNN can get ideal classification results.
1074
J. Xing, D. Xiao, and J. Yu y1 Output Layer
w1 = Summation Layer
Σ
y2
S1
yN
w2 =
∑S
Σ
S2
∑S
wN =
Σ
SN
∑S
Pattern Layer
Input Layer
x1
x2
x3
xM
Fig. 1. Topological structure of PNN. Each pattern neuron represents a training sample.
But a shortcoming of PNN cannot be ignored. The number of pattern neurons is equal to the number of all training samples. When PNN is applied to a problem with large-scale and high-redundant training samples, the number of pattern neurons is larger-scale, the structure is complex, computation speed is slow, and classification efficiency is decreased. In real application, some characters are extracted in some extent, to decrease number and redundancy of training samples, and to simplify structure of PNN [7,8]. But, what characters should be extracted, and how extract them. These problems restrict PNN user with much experience on the special object, which results in a limit application field of PNN. In this paper, a general PNN structure optimization strategy is proposed. The motivation is multiplication formula in probability computation
P(ABC) = P (C|AB) P (B|A) P (A),
(2)
when event A and B are independent each other, Eq. (2) can be simplified as P(ABC) = P (C|AB) P (B) P(A).
(3)
Assuming the input variables are independent each other, structure of PNN can be simplified with the instruction of Eq. (3), as shown in Fig. 2. Each pattern neuron of typical PNN represents a training sample, while each pattern neuron of the simplified PNN represents a character state of the corresponding input variable. Multiplication neurons compute product of corresponding pattern neurons outputs. Probability of event represented by multiplication neuron is the product of multiplication output and the output weight as well as conditional probability. For example, connection weight w11 = P(Z1|X11X21…XM1), then P(Z1) = w11P(X11)P(X21)…P(XM1). Output neurons sum the product of outputs of multiplication neurons and theirs corresponding weights. So probability of the input belonging the corresponding pattern is gotten. In simplified PNN, connection weights from multiplication layer to summation layer are the conditional probability. The other connection weights between other layers are 1.
Principal Component Analysis Based Probability Neural Network Optimization
Summation Layer
w11
y1
y2
yN
Σ
Σ
Σ
1075
wKN
=P(Z1|X11X21...XM1) Z1
Multiplication Layer
Pattern Layer
X11
Π
Z2
Π
X12
ΠZ
K
X21
XM1
Input Layer
x1
x2
xM
Fig. 2. Topological structure of simplified PNN by using probability multiplication formula. Each pattern neuron represents a character state of the corresponding input variable.
3 PCA Based PNN Structure Optimization Only events are independent, Eq. (2) can be simplified as Eq. (3). So the keyprecondition of PNN structure simplification by using multiplication formula is that input variables are independent. But it is impossible in real problem. Correlation between input variables needs to be eliminated by using some pretreatment. PCA is the most simple and efficient. Furthermore, number of pattern neuron corresponding different inputs can calculated with instruction of the eigenvalues gotten in PCA. Structure design of PNN has computation rules, which decreases structure uncertainty by avoiding artificial setting. There are S training samples, with M-dimension input and N-dimension output. The original training samples set is ( x1 , x2 ,..., xS ∈ R M , y1 , y2 ,..., yS ∈ R N ). PCA is used to input data of training samples. Eigenvalues are λ1 , λ2 ,..., λM (λ1 ≥ λ2 ≥ ... ≥ λM ≥ 0) , and corresponding eigenvectors are l1, l2…lM. First P eigenvalues are selected for P
η P = ∑ λk k =1
M
∑λ k =1
k
≥ 80%.
And the corresponding eigenvectors l1, l2…lP of the first P eigenvalues are combined into transform matrix LP = [l1′; l 2′ ;...; l P′ ] ∈ R P× M .
The new P-dimension vectors pi=LPxi (i=1,2,…,S) contain the first P principal components. By using the new samples set ( p1 , p2 ,..., pS ∈ R P , y1 , y2 ,..., yS ∈ R N ) as training samples set for neural network, the correlation between input variables is eliminated, so PNN can be simplified based on multiplication formula. Meanwhile,
1076
J. Xing, D. Xiao, and J. Yu
P<M, which means PCA reduces PNN input dimension. It decreases the computation and structure complexity with the less input dimension. Pattern neurons of the simplified PNN are grouped according to the first P eigenvalues obtained in PCA. In PCA ideas, the larger eigenvalue means the lager variance of the corresponding component variable, means more information contained in. Therefore, the input variable with the larger eigenvalue should be distributed more pattern neurons, to increase classification accuracy of PNN. In the simplified PNN, assuming sum number of pattern neurons is Q, the number of pattern neurons corresponding to the input variables p1,p2,…,pP are
qi = round(Q
λi P
∑ λk
), i = 1, 2,..., P,
k =1
where round( ) is a function to round the independent variable to the nearest integer. The role of PCA in PNN structure optimization is shown in Fig. 3. On the one hand, original input to PNN is pretreated using transform matrix obtained in PCA. The correlation between input variables is eliminated for PNN can be simplified based on multiplication formula. Meanwhile, input dimension is reduced without losing much information. On the other hand, number of pattern neurons corresponding to different input variable is set according to numerical value of the eigenvalues. Structure design of PNN is optimized with some certain computation rules, and no more depends on artificial experience or large-scale tests.
4 Computation and Learning Algorithm of the Optimized PNN PCA based optimized PNN has two computation courses: pretreatment and network computation, as shown in Fig. 3. Pretreatment is input dimension reduction and correlation elimination by using transform matrix LP; Network computation is the basic feed-forward neural network computation. Pretreatment computation is
p = LP x where x is input vector, LP is transform matrix, and p is result of PCA. Layer 1 transfers input value to pattern neuron directly. Function of pattern neuron in layer 2 is Gaussian function z (2) = j
1 2πσ 2j
exp(−
2 (u (2) j − μj)
σ 2j
),
where u (2) and z (2) are input and output of the j-th neuron in layer 2. μ j and σ j are j j mean and variance of Gaussian function. Layer 3 is multiplication computation Lk
zk(3) = ∏ ukl(3) , l =1
Principal Component Analysis Based Probability Neural Network Optimization y1
y2
yN
Σ
Σ
Σ
Π
1077
Π
Π
Network Computation
p1
p2
P C A
pP
LP Pretreatment
x1
x2
x3
xM
Fig. 3. Input pretreatment and structure optimization of PNN by using PCA
where ukl(3) and zk(3) are input and output of the k-th neuron in layer 3. Lk is the number of inputs to the k-th neuron in layer 3. Layer 4 is summation computation K
yn = zn(4) = ∑ wnk zk(3) , k =1
where yn = z is output of the n-th neuron in layer 4, which is the n-th output of PNN. K is the number of neurons in layer 3. wnk is connection weight from k-th neuron in layer 3 to n-th neuron in layer 4. Connections from layer 3 to layer 4 are full connections with corresponding weights. Typical PNN has no learning course. Parameters are set based on artificial experience, which brings PNN uncertainty in some extent. Aiming this problem, a learning algorithm is introduced, whose object is to minimize (4) n
E=
1 S N ∑∑ ( yn (s) − yˆ n (s))2 , 2 s =1 n =1
(4)
where yˆ n ( s ) is the expected output. In the optimized PNN, the parameters need to be adjusted are: wnk, the connection weights from layer 3 to layer 4; μ j and σ j , term value and smooth parameter of pattern neurons in layer 2. For simplification and without losing generality, gradient-descent learning algorithm is employed here. Update rules of wnk are S N ∂E = ∑∑ ( yn ( s ) − yˆ n ( s ))z k(3) ( s ), ∂wnk s =1 n =1
E ∂E wnk = wnk − exp( )η , t ∂wnk
1078
J. Xing, D. Xiao, and J. Yu
where η is learning efficiency coefficient; exp(E/t) is SA coefficient, E is the mean square error in Eq. (4), and t is learning step. Obviously, if E is larger, exp(E/t) is larger, the update extent of wnk increases; and if t is larger, exp(E/t) is smaller, the update extent of wnk decreases. exp(E/t) makes learning course more efficient in early period and more stable in later period. It is called SA coefficient because of the origin from and similarity with the position modification probability in SA optimization algorithm. Update rules of μ j and σ j are N ∂E = ∑ ( yn ( s ) − yˆ n ( s )) wnk , (3) ∂zk ( s ) n =1
∂zk(3) ( s ) 2 = δ kj zk(3) ( s ) 2 (u (2) j − μ j ), ∂μ j σj ∂zk(3) ( s ) 2 2 2 = δ kj zk(3) ( s ) 3 ((u (2) j − μ j ) − σ j ), ∂σ j σj
μ j = μ j −η
∂E E S K ∂E ∂zk(3) ( s ) = μ j − exp( )η ∑∑ (3) , ∂μ j t s =1 k =1 ∂z k ( s ) ∂μ j
σ j = σ j −η
∂E E S K ∂E ∂zk(3) ( s ) = σ j − exp( )η ∑∑ (3) , ∂σ j t s =1 k =1 ∂zk ( s ) ∂σ j
where δ kj =1 if the j-th neuron in layer 2 is connected to the k-th layer in layer 3; otherwise, δ kj is zero.
5 Prediction of AE Based on the Optimized PNN AE is the most important operation fault of aluminium reduction cell, the key equipment in aluminium making industry. Due to the complex working states and field conditions, aluminium reduction cell is a multi variable coupled, time varying, and long time delay large nonlinear system. It makes neural network is feasible to be applied to prediction of AE. In aluminium making, minute is basic period of online data record, such as cell voltage; but several days or longer time is basic period of AE happening. So, samples for prediction of AE have high redundancy. If typical PNN is used, number of pattern neurons will be vary large, structure of PNN will be complex, computation speed and classification efficiency will be low. The PCA based optimized PNN is used to predict AE in aluminium reduction cell. Training sample set is the data from a cell in an 8-hour period when a normal AE happens. Input is some online and offline data from aluminium reduction cell, such as inter-electrode gap, series current, cell voltage, cell temperature and so on. Expect output is probability of AE decided by artificial experience of filed worker. After PCA based structure designed, the optimized PNN has an only 2-dimension input, which is much less than the original 8-dimension input. Size of the optimized PNN is
Principal Component Analysis Based Probability Neural Network Optimization
1079
Table 1. Number of neurons and connections of the typical PNN and the optimized PNN
Samples Neurons Connections
Traditional PNN 958 968 8713
Optimized PNN 958 28 58
compared with those of the typical PNN, as shown in Table 1. It shows that PCA can optimize structure of PNN in much extent in the application with large-scale and high-redundant training samples. The optimized PNN is tested by using data from the same aluminium reduction cell in an 8-hour period when an unwanted AE happens, whose results are shown in Fig. 4. Fig. 4a shows that the cell operates normally at first, undergoes the sudden fault AE at about 303rd minute, the cell voltage increases rapidly from normal 4.2V to a value about 40V, decreased rapidly to normal range again after AE extinguished about 4 to 5 minutes later. Fig. 4b shows that the PNN predicted probabilities of AE increase rapidly about 30 minutes before real AE happens, and get to the alert range about 10 minutes before. These prediction results show that PNN can predict AE timely, which is helpful to economical and safe aluminium production. Furthermore, the PCA based optimized PNN can get similar prediction results with the typical one, by using a much simpler network structure.
Fig. 4. Anode effect prediction based on the optimized PNN
6 Conclusion In this paper, PNN was analyzed on probability computation and a structure optimization strategy based on multiplication formula was proposed. At first, input training samples are analyzed by using PCA. Correlation between input variables is eliminate, for PNN can be optimized by using multiplication formula. Input dimension is reduced, which decreases PNN computation. Subsequently, according to the eigenvalues obtained in PCA, pattern neurons are distribute to different input variable with
1080
J. Xing, D. Xiao, and J. Yu
different number. It optimizes PNN structure with the pretreatment together. Then, computation courses of the optimized PNN are described. The learning algorithm decreases the parameters uncertainty by avoiding artificial set. SA coefficient is introduced to increase the efficiency and stability of the learning course. Eventually, the optimized PNN was used to predict AE in aluminium reduction cell. It validated the simpleness, efficiency and reliability of the optimized PNN in the application with large-scale high-redundant training samples.
Acknowledgment This paper is sponsored by the National High Technology Research and Development Program of China (863 Program) No. 2002AA412510 and No. 2002AA412420. A gratefully acknowledgement is presented here for the financial aid to this research.
References 1. Specht, D.F.: Probabilistic Neural Networks for Classification, Mapping, or Associative Memory. IEEE International Conference on Neural Networks 1 (1988) 525-532 2. Labonte, G.: On the Efficiency of OLS Reduced Probabilistic Neural Networks for AircraftFlare Discrimination. Proceedings of the International Joint Conference on Neural Network 3 (2003) 2306-2311 3. Zhang, Y.C., Peng, L.H., Yao, D.Y., et al.: Principal Component Analysis Method for TwoPhase Flow Concentration Measurements. Journal of Tsinghua University (Science and Technology) 43 (2003) 400-401, 405 4. Oh, B.J.: Face Recognition by Using Neural Network Classifiers based on PCA and LDA. 2005 IEEE International Conference on Systems, Man and Cybernetics 2 (2005) 1699-1703 5. Wang, S., Xia, S.W.: Self-Organizing Algorithm of Robust PCA based on Single-Layer NN. Journal of Tsinghua University (Science and Technology) 37 (1997) 121-124 6. Kong, W., Yang, J.: Applications of Nonlinear PCA based on Neural Network in Prediction of Melt Index. Computer Simulation 20 (2003) 65-67 7. Albano, M., Caldon, R., Turri, R.: Voltage Sag Analysis on Three Phase System Using Wavelet Transform and Probabilistic Neural Network. Universities Power Engineering Conference 3 (2004) 948-952 8. Chen, C.H., Chu, C.T.: Low Complexity Iris Recognition based on Wavelet Probabilistic Neural Networks. Proceedings of. 2005 IEEE International Joint Conference on Neural Networks (IJCNN '05) 3 (2005) 1930-1935
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map for Brain MRI Image Segmentation Jingdan Zhang and Dao-Qing Dai Center for Computer Vision and Department of Mathematics, Sun Yat-Sen (Zhongshan) University, Guangzhou, 510275 China
[email protected],
[email protected] Abstract. With Kohonen’s self-organizing map based brain MRI image segmentation, there are still some regions which are not partitioned accurately, particularly in the transitional regions of gray matter and white matter, or cerebrospinal fluid and gray matter. In this paper, we propose a dynamically growing hierarchical self-organizing map integrated with a multi-scale feature vector to overcome the problem mentioned above, which uses the spatial relationships between image pixels and using multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images.
1
Introduction
In recent years, various imaging modalities are available for acquiring complementary information for different aspects of anatomy. Because of the advantages of MRI over other diagnostic imaging [1], the majority of researches in medical image segmentation pertains to its use for MRI images. Automatic segmentation of brain MRI images to the three main tissue types: white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), is a topic of great importance and much research. It is known that volumetric analysis of different parts of the brain is useful in assessing the progress or remission of various diseases, such as Alzheimer’s disease, epilepsy, sclerosis, and schizophrenia [2]. Now, there are a lot of methods available for MRI image segmentation [2]. Clustering methods would naturally be applied in image segmentation [2], [3]. However, the uncertainty of MRI image is widely presented in data because of the noise and blur in acquisition and the partial volume effects originating from the low sensor resolution. Therefore, neural-network-based segmentation could be used to overcome these adversities [4], [5]. In these neural network techniques, Kohonen’s self-organizing map (SOM) is used most in MRI segmentation [6]. But SOM has certain fundamental limitations in the context of image segmentation, especially in the brain MRI images. Because most brain MRI images always
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1081–1089, 2007. c Springer-Verlag Berlin Heidelberg 2007
1082
J. Zhang and D.-Q. Dai
present overlapping grey-scale intensities for different tissues, particularly in the transitional regions of GM and WM, or CSF and GM. Several improved SOM algorithms and SOM-related algorithms have been proposed in recent years to overcome its drawbacks. Hierarchical SOM is a variation of SOM [7], [8]. In this paper we address the segmentation problem in the context of isolating the brain tissues in MRI images. Kohonen’s self-organizing feature map is exploited as a competitive learning clustering algorithm in our work. However, there are still some regions which are not partitioned accurately, particularly in the transitional regions between GM and WM, CSF and GM. Therefore, a dynamically growing hierarchical SOM (DGHSOM) is proposed in our work to overcome this problem. Moreover, for image data, there is strong spatial correlation between neighboring pixels. To produce meaningful segmentation, we integrate a multi-scale feature vector with DGHSOM, called MDGHSOM, in which we consider the spatial relationships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images. The rest of this paper is organized as follows. MDGHSOM for MRI image segmentation is proposed in Section 2. Experimental results are presented in Section 3 and we conclude this paper in Section 4.
2 2.1
SOM and the Proposed MDGHSOM Kohonen’s Self-Organizing Map (SOM)
SOM consists of an input layer and a single output layer of M neurons which usually form a two-dimensional array. The training of SOM is usually performed using Kohonen algorithm [9]. Each neuron i has a d-dimensional feature vector wi = [wi1 , ..., wid ]. At each training step t, a sample data vector x(t) is randomly chosen from a training set. Distances between x(t) and all feature vectors are computed. The winning neuron, denoted by c, is the neuron with the feature vector closest to x(t) c = arg min xt − wi , i ∈ {1, ..., M}. i
(1)
A set of neighboring nodes of the winning node is denoted as Nc , which decreases its neighboring radius of the winning neuron with time. We define Nt (c, i) as the neighborhood function around the winning neuron c at time t. The neighborhood function is a non-increasing function of time t and of the distance of neuron i from the winning neuron c in the output layer. The function can be taken as Nt (c, i) = exp(−ri − rc 2 /2Nc2 (t)) where ri is the coordinate of neuron i on the output layer and Nc (t) is a width parameter. The weight-updating rule in the sequential SOM algorithm can be written as wi (t + 1) = wi (t) + α(t)Nt (c, i)[xi (t) − wi (t)], ∀ i ∈ Nc .
(2)
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1083
Both the learning rate α(t) and neighborhood width Nc (t) decrease monotonically with time. 2.2
A Multi-scale Feature Vector
In this section, we first propose a multi-scale and adaptive spatial feature vector as input vector, considering the spatial relationships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. Moreover, we also take the combination of local information and global information into account. The intensity of input pixel (Intensity) is very important for clustering. But only considering the intensity information, some local details and information would be neglected particularly in the transitional regions. Thus, the gradient of the input pixel in its neighborhood of size 3 × 3 is computed as one element of the input vector Gradient. If Gradient is small enough, the input pixel is homogenous with its neighbors and it may be an interior pixel of some tissue. Otherwise, if Gradient exceeds a given threshold, the input pixel could be in the transitional region. For obtaining the precise clustering result, multiscale information is considered and the mean value (mean3, mean5) and variance (variance3, variance5) of input pixel in neighborhoods of size 3 × 3 and 5 × 5 are computed respectively. Comparing mean3 and mean5 with Intensity, the variety tendency of local region can be obtained. Moreover, variance3 and variance5 would give more local information about the attributes of the input pixel. Based on the aforementioned reason, we constructed the input vector as (Intensity, Gradient, mean3, variance3, mean5, variance5). Different element in the input vector has different weightiness in each layer, and we assign different weights for them. 2.3
MDGHSOM
SOM has certain fundamental limitations in the context of image segmentation, especially in the brain MRI images. Because most brain MRI images always present overlapping grey-scale intensities for different tissues, particularly in the transitional regions. A dynamically growing SOM with double hierarchies is integrated with the multi-scale feature vector to solve the problem aforementioned, called MDGHSOM. A neuron at high level can generate its child SOM at lower level dynamically according to the higher level neuron’s weights. This is like GHSOM where the network grows hierarchically under some conditions. But MDGHSOM does not grow neurons horizontally because we want to simplify the growing process and train the network faster. Note that the number of neurons at the second level of MDGHSOM is adaptively determined which that of HSOM is predefined. At each layer, each neuron i has a six-dimensional weight vector wi =[wi1 , ..., wi6 ] n n n n corresponding with the input vector xp = [xp1 , ..., xp6 ]. And the wi1 , wi2 , wi3 , wi4 , n n wi5 , wi6 denote the intensity centroid, gradient centroid, 3 × 3 neighborhood mean centroid, 3 × 3 neighborhood variance centroid, 5 × 5 neighborhood mean centroid
1084
J. Zhang and D.-Q. Dai
and 5 × 5 neighborhood variance centroid of the pixels clustered in ith neuron in nth layer SOM respectively. The MDGHSOM algorithm is summarized as follows: 1. Initialization: Set level n = 1 and the weights at the first level wi1 = 1 1 1 1 1 1 [wi1 , wi2 , wi3 , wi4 , wi5 , wi6 ]. The SOM in the first layer is trained with all data by invoking the function TrainSOM(xp , n, W ). 2. Recursive Loop: GenerateSOM(xp , n, W ). 1 If wi2 is larger than g0 (a constant threshold taken according to the experimental result, and it is set to 64 in our experiment), it means that the neighborhood gradient of the pixel classified into neuron i is too large, and this pixel may be in the transitional region of different issues. Of course, the large value of gradient could be generated because of noise pixels. To overcome this problem, the 1 1 1 1 1 1 values of |wi4 − wi1 | and |wi6 − wi1 | are also considered. If |wi4 − wi1 | < m0 1 1 and |wi6 − wi1 | < m1 (m0 , m1 are constant thresholds and set to 6 and 8 respectively in our experiment), we can conclude that the intensity of the pixel classified into neuron i is similar with the mean values of its 3 × 3 and 5 × 5 neighborhood, and the large value of gradient is influenced by noise pixels. Otherwise, the pixel clustered into this neuron is different with its neighbors, and it might be the transitional pixels between different tissues. To obtain the accurate segmentation result, a child SOM with two neurons (Because there are only two class pixels in the transitional regions) is generated in the second layer for partitioning this transitional pixel again. Based on the above analysis, we give the function of generating SOM:
Function GenerateSOM(xp , n, W ) 1 1 1 1 1 If wi2 > g0 , |wi4 − wi1 | > m0 , and |wi6 − wi1 | > m1 , neuron i of the first layer spawns two children neurons representing a child SOM in the second layer. Then train the child SOM by TrainSOM(xp , n, W ). Function TrainSOM(xp , n, W ) Train SOM at the nth level with the input vector xp , and update the weights of the neurons. When training each SOM, we implement the original SOM algorithm. The child SOMs are trained with data associated with their mother neurons. MDGHSOM completes the training of SOMs at the first level, and then proceeds to train SOMs at the second level. Fig. 1 shows the architecture of MDGHSOM.
3
Experimental Results
The number of tissue classes in the segmentation is set to three, which corresponds to CSF, GM and WM. Background pixels are ignored in the computation. Extra-cranial tissues are removed from all images prior to segmentation. For all segmentation experiments, the number of training steps T is set to 25 in the
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1085
Fig. 1. The architecture of MDGHSOM that grows neurons hierarchically when needed
first layer, and it is set to 50 in the second layer. The proposed algorithm was implemented in C and tested on both simulated MRI images obtained from the BrainWeb Simulated Brain Database1 , and on real MRI data obtained from the Internet Brain Segmentation Repository (IBSR)2 . 3.1
Results Analysis and Comparison
To illustrate this MDGHSOM approach, we compare the segmentation results of FCM segmentation, SOM clustering and our method with the ground truth in Fig. 2c, d, e and f. The original image simulated from normal brain phantoms with 5% noise level is shown in Fig. 2a with its processed result after wavelet-based de-noising [10] in Fig. 2b. All the segmentation results are obtained after the original image de-noised. The FCM segmentation result (Fig. 2c) and SOM clustering result (Fig. 2d) are partitioned inaccurately, particularly in the transitional regions of GM and WM, or CSF and GM. The result image of our proposed MDGHSOM approach (Fig. 2e), taking both local information and global information into account, clearly outperforms the results mentioned above (Fig. 2c and d). Fig. 2f is the ground truth of this slice. Fig. 2g, h, i, j, k and l are parts of images after Fig. 2a, b, c, d, e and f zoomed in respectively, and the efficiency of our method can be obviously observed in the circle and square regions. In our experiments, three different indices (false positive ratio γf p , false negative ratio γf n , and similarity index ρ [11]) are exploited for each of three brain tissues as quantitative measures to compare FCM segmentation, SOM clustering and our method with the ground truth. For a given brain tissue i, i = 1, 2, 3 for CSF, GM and WM respectively, suppose that Ai and Bi represent the sets of pixels labeled into i by the ground truth and by our method respectively. |Ai | denotes the number of pixels in Ai . The false positive ratio γf p is defined as γf p = (|Bi | − |Ai ∩ Bi |)/|Ai |. Likewise, the false negative ratio γf n is defined as γf n = (|Ai | − |Ai ∩ Bi |)/|Ai |. The 1 2
http://www.bic.mni.mcgill.ca/brainweb http://www.cma.mgh.harvard.edu/ibsr
1086
J. Zhang and D.-Q. Dai
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 2. (a) Original image simulated from MRI brain phantom with 5% noise level, and its processed version with (b) wavelet-based de-noising. (c) FCM segmentation. (d) SOM clustering result. (e) MDGHSOM clustering result. (f) The ground truth of (a). (g), (h), (i), (j), (k) and (l) are parts of images after (a), (b), (c), (d), (e) and (f) zoomed in respectively. Table 1. Comparing FCM clustering, SOM clustering, and our method with the ground truth FCM γf p
γf n
SOM ρ
WM 19.3 4.13 93.92 GM 9.66 9.29 90.08 CSF 5.46 17.3 88.28
γf p
γf n
MDGHSOM ρ
4.11 8.24 94.47 15.7 2.06 91.19 6.65 15.2 89.30
γf p
γf n
ρ
2.61 6.42 96.30 9.77 5.29 93.09 2.96 8.22 94.26
similarity index ρ is an intuitive and plain index to consider the matching area between Ai and Bi , defined as ρ = 2|Ai ∩ Bi |/(|Ai | + |Bi |). The comparing results are shown in Table 1 using these indices. Our scheme produces a robust and precise segmentation. 3.2
Quantitative Validation
To quantitatively validate our method, test images with known ground truth are required. For this purpose, we use the simulated MRI images from the BrainWeb Simulated Brain Database with T1-weighted sequences, slice thickness of 1 mm, and noise levels of 3%, 5%, 7% and 9% respectively. The skull, fat, and unnecessary background are first removed with the guidance of ground truth. Fig. 3a, d, g and j are simulated MRI images with noise levels of 3%, 5%, 7% and 9% respectively. The corresponding segmentation results processed using
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
1087
Fig. 3. Segmentation of simulated image from MRI brain phantom. (a), (d), (g) and (j) are images with noise levels of 3%, 5%, 7% and 9% respectively, with (b), (e), (h) and (k) their corre-sponding segmentation results using our proposed approach. (c), (f), (i) and (l) are the ground truth of (a), (d), (g) and (j) respectively. 1 wm gm csf 0.98
similarity index
0.96
0.94
0.92
0.9
0.88 0.02
0.03
0.04
0.05
0.06 noise level
0.07
0.08
0.09
0.1
Fig. 4. Validation results for different noise levels
our approach on the original images de-noised [10] are shown in Fig. 3b, e, h and k with their ground truth of Fig. 3c, f, i and l respectively. The similarity index ρ [11] is exploited for each of three brain tissues as quantitative measure to validate the accuracy of our method. The validation results are shown in Fig. 4. The similarity index ρ > 70% indicates an excellent similarity [11]. In our experiments, the similarity indices ρ of all the tissues are
1088
J. Zhang and D.-Q. Dai
larger than 90% even for a bad condition with 9% noise level, which indicates an excellent agreement between our segmentation results and the ground truth. 3.3
Performance on Real MRI Data
To validate the efficiency of our proposed approach, we also test it on real MRI data obtained from the Internet Brain Segmentation Repository (IBSR). Extracranial tissues are removed from all images prior to segmentation, using the method proposed in [12]. Fig. 5a shows one slice of real T1-weighted MRI images. Fig. 5b is FCM segmentation result. Fig. 5c shows the clustering result using SOM. Using our proposed method, the result image is shown in Fig. 5d. Visual inspection shows that our approach produces better segmentation than others, especially in the transitional regions of gray matter and white matter, or cerebrospinal fluid and gray matter, such as in the circle regions.
(a)
(b)
(c)
(d)
Fig. 5. Segmentation of real MRI image. (a) Original image. (b) FCM segmentation. (c) SOM clustering result. (d) MDGHSOM clustering result.
4
Conclusion
In this paper we address the segmentation problem in the context of isolating the brain tissues in MRI images. SOM is exploited as a competitive learning clustering algorithm in our work. However, the transitional regions between tissues in MRI images are not clearly defined and their membership is intrinsically vague. Therefore, a dynamically growing hierarchical SOM is proposed in this paper to overcome the above problem. Moreover, for image data, there is strong spatial correlation between neighboring pixels. So a multi-scale feature vector is integrated with DGHSOM, called MDGHSOM, in which we consider the spatial relation-ships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images.
Acknowledgments This project is supported in part by NSF of China(60575004, 10231040), NSF of GuangDong(05101817) and Ministry of Education of China(NCET-04-0791).
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1089
References 1. Wells, W.M., Grimson, W.E.L., Kikinis R., Arrdrige S.R.: Adaptive Segmentation of MRI Data. IEEE Trans. Med. Imaging. 15 (1996) 429-442 2. Pham, D.L., Xu, C.Y., Prince, J.L.: Current Methods in Medical Image Segmentation. Ann. Rev. Biomed. Eng. 2 (2000) 315-337 3. Liew, A.W.C., Yan, H.: An Adaptive Spatial Fuzzy Clustering Algorithm for 3-D MR Image Segmentation. IEEE Trans. Med. Imaging. 22 (2003) 1063-1074 4. Reddick, W.E., Glass, J.O., Cook, E.N., Elkin, T.D., Deaton, R.: Automated Segmentation and Classification of Multispectral Magnetic Resonance Images of Brain using Artificial Neural Networks. IEEE Trans. Med. Imaging. 16 (1997) 911-918 5. Ozkan, M., Dawant, B.M., Maciunas, R.J.: Neural-Network-based Segmentation of Multi-Modal Medical Images: A Comparative and Prospective Study. IEEE Trans. Med. Imaging. 12 (1993) 534-544 6. Chuang, K.H., Chiu, M.J., Lin, C.C., Chen, J.H.: Model-Free Functional MRI Analysis using Kohonen Clustering Neural Network and Fuzzy C-Means. IEEE Trans. Med. Imaging. 18 (1999) 1117-1128 7. Marsland, S., Shapiro, J., Nehmzow, U.: A Self-Organizing Network that Grows when Required. Neural Networks 15 (2002) 1041-1058 8. Si, J., Lin, S., Vuong, M.A.: Dynamic Topology Representing Networks. Neural Networks 13 (2000) 617-627 9. Kohonen, T.: Self-Organizing Maps. New York: Springer-Verlag (1995) 10. Pizurica, A., Philips, W., Lemahieu, I., Acheroy, M.: A Versatile Wavelet Domain Noise Filtration Technique for Medical Imaging. IEEE Trans. Med. Imaging. 22 (2003) 323-331 11. Zijdenbos, A., Dawant, B.: Brain Segmentation and White Matter Lesion Detection in MR Images. Crit. Rev. Biomed. Eng. 22 (1994) 401-465 12. Kong, J., Zhang, J.D., Lu, Y.H., Wang, J.Z., Zhou, Y.J.: A Novel Approach for Adaptive Unsupervised Segmentation of MRI Brain Images. MICAI 2005, LNCS 3789 (2005) 918-927
A Study on How to Classify the Security Rating of Medical Information Neural Network* Jaegu Song and Seoksoo Kim Hannam University, Department of Multimedia Engineering, Postfach 306 791 133 Ojeong-Dong, Daedeok-Gu, Daejeon, Korea {Song}
[email protected], {Kim}
[email protected] Abstract. Provide these intelligent medical services, it is necessary to understand the situation information generated in a hospital. There should be infra technologies that can classify and control the information for processing situation data, not mere collection of conceptual information, with clear standards. This paper, as a study to seize the information generated from medical situation more clearly, understood the property of data using neural network and applied the security ratings of information so that the system to provide the user appropriate to designated rating with analyzed medical information is established. It will be an effective measure to enhance the effectiveness of medical devices and backup data already introduced and understand the various medical data that will be generated from medical devices to be introduced.
1 Introduction The development of mobile communication and medical technologies is providing many technologies to address the lack of clinic facilities to the aging society. Combined with ubiquitous system, intelligent sensor or remote clinic, and similar technologies, are evolving into more developed system. The aim of current studies is to make it possible to check the conditions that need medical treatment in complicated situations. Such concept is defined as pervasive computing, disappearing computing and invisible computing. Pervasive computing technologies are being applied more and more in electronic products, bottles, chamber pot, mirror and medicine container, etc. with micro processor through telecommunication and cooperation [1]. However, to provide these intelligent medical services, it is necessary to understand the situation information generated in a hospital. In addition, there should be infra technologies that can classify and control the information for processing situation data, not mere collection of conceptual information, with clear standards. It is also necessary for the preparation of medical information’s growing needs and their application to seize the types of sensor and medical situation and to embody the classification ratings. It will be an effective measure to enhance the effectiveness of medical devices and backup data already introduced and understand the various medical data that will be generated from medical devices to be introduced. *
This work was supported by a grant from Security Engineering Research Center of Ministry of Commerce, Industry and Energy.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1090–1096, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Study on How to Classify the Security Rating
1091
This paper, as a study to seize the information generated from medical situation more clearly, under-stood the property of data using neural network and applied the security ratings of information so that the system to provide the user appropriate to designated rating with analyzed medical information is established. Chapter 2, related research, explains the situation information and how to classify using the neural net-work. Chapter 3 proposes the method to classify the medical information security ratings using the neural network. Chapter 4 carries out the simple test using the classification system and analyzes the effectiveness. Final chapter 5 suggests the conclusion and the future study subject.
2 Related Researches 2.1 Situation Information Situation information is defined in three types roughly. 1. The information indicating the individual situation of user, place and object. It is required in the interaction between user and applied service [2]. 2. The data needed for location information such as the location where access request occurred and accessing object exists, for time information such as the time when access request occurred and intervals, as well as for certain actions [3]. 3. The location information such as site and domain for proximity control [4]. Similar to above, situation information is defined in different respects. This paper adapted the first definition for study. There are mainly two ways to classify the situation information. One is RBAC (Role-Based Access Control), which groups according to names, and give role to user according to the individual’s responsibility and authority so that controls the use of resources. It develops security policy according to organizations and it is effective for security control [5]. Another is proximity control using the situation information. Its representative method are GRBAC (Generalized Role-Based Access Control) which enables time-based proximity control that could not provided in role based proximity control, and xoRBAC, which proposed the restricted matters of role based proximity control that checks the previously defined conditions [6]. This paper classifies the situation information using role based proximity control that can easily apply proximity control to the medical information from hospitals. 2.2 How to Apply Neural Network The most widely used method in neural network is multilayer perceptron (MLP), which, as a typical static neural network, is used for the recognition by supervised learning, classification, function approximation, etc. Multilayer perceptron has a layer-structured neural network that has more than one middle layer (hidden layer) between input and out layer. As shown in figure 1, neural network generally consists of input layer, hidden layer, and output layer. When factor x, input data, is input, it will be treated by the defined process in hidden layer then output Y will be produced. At this, each layer learns the process so that output layer develops clearly into the desirable result value.
1092
J. Song and S. Kim
Fig. 1. Neural network generally consists
However, this neural network method only outputs and learns the result corresponding the input/output pattern, so it has the problem to create similar output for unknown input patterns. As most of information generated from real life tend to have relation with time, it necessary to renew past information into up-to-date information through circular linkage in dynamic expression. Considering such characteristic, circular neural network should be applied in autonomous dynamic expression. Figure 2 is a Jordan’s SRN model and indicates circular neural network that can show temporary transit of input pattern and contextual dependent relationship. For real-time renew of information on medical treatment, this paper applied the circular neural network method [7].
Fig. 2. Jordan’s SRN
3 Medical Information Security Ratings Classification Method Using Neural Network 3.1 Assumptions For system design and composition, this paper assumed that various medical information is simple information as below: - Sensor information (location information, body temperature, heart beat) - Medical information (medical record, meditation details, written opinion, prescription details, individual information, medical charge)
A Study on How to Classify the Security Rating
1093
- Situation information user (doctor, nurse, assistant nurse, patient, guardian) - Security ratings (it is divided into 3 stages to restrict access to information needed for medical acts and information for patient and guardian: 1: basic ratings, 2: medical common ratings, 3: medical security ratings). 3.2 Designing of Medical Information Security Ratings Classification Method Using Neural Network In this paper, medical situation information is provided to final information identifier through neural network structure as like figure 3. In hidden layer, each information is classified into sensor and common medical information then; security rating is applied to the classified information and stored to be identified by final information identifier. At this time, to process the information changing to time dynamically, the information of hidden layer and output layer are turned to input layer so to make a circular linkage.
Fig. 3. Neural network structural for medical situation information classification
The neural network through circular linkage like above figure 3 reflects the state of past neural network to determine the state of current neural network, so dynamic information provision is available according to the change of medical information. In other words, as repeated information acquired from progressing medical acts are input in real time, the states of medical acts can be identified in real time and overlapping treatments can be prevented. In neural network structure for medical situation information classification, specific applying measure for hidden layer part is like figure 4. The process is as follows: Medical data value input Extracting the characteristics of information. The information from hospitals is classified into sensor data and medical data according to characteristics and generating nature of information. At this time, sensor data and medical data provide the basic category value for the generation places of each information in the form of data and extract the patterns by the characteristics of information through the learning whether 1
2
1094
J. Song and S. Kim
it can be distinguished persistently with classified information and existing data classification methods. The extracted pattern value is provided as additional standards so that it contributes to more exact data analysis in later. 3 Considering the information extracted from 2 , each information will be applied to security ratings. Security ratings are applied by the information guideline adjusted to hospital’s medical acts, learns the information classified along with security ratings, stores security applying patterns and gives the learned value to security guideline so that contributes more exact analysis in later. 4 Stores medical information’s characteristics and data having security ratings to finish the preparation for information provision 5 If the latest data of information characteristics extraction, security ratings application and medical situation information database are changed to consider the change of time, it should be returned to initial information in the manner of circulation method to renew past information.
Fig. 4. Medical information security ratings classification processing using neural network
To learn neural network, medical information is defined as representative data property or classification standard and learns the output result repeatedly. At this time, medical data is used as the input data of neural network after the extraction of many input data. All data is renewed to each neuron through feed-back linkage, so it is designed so as to enable dynamic situation information analysis and classification.
4 Test and Analysis Medical situation information classification method using the neural network proposed in this paper carried out simple test using the sensor information of location information, body temperature and heart beat, which are input through sensor, and the information input as diagnosis. All information applied static pattern with text basis and tried to examine whether the result provides proper situation information to final medical agent who receives data and analysis information needed for practical
A Study on How to Classify the Security Rating
1095
medical procedure. Figure 5 indicates the result off test in which each medical data was in-put in 1 second interval for 100 seconds and it was repeated for 4 times. As a test result, both learning and information applied with security ratings showed higher error rate in the extracting process of medical information characteristics, because the process to classify 400 situations information into 8 is relatively more difficult than security classification that classifies into 3. The test showed that the value appropriate to learning result property and the error rate of medical information applied with security ratings decreases gradually. However, the rate of identification on medical data such as complex medical information or unclassified terminology was very low. Additional setting on the data of classification standard can solve such problem to the roots.
Fig. 5. Neural network studying error rate
5 Conclusion In order to specify the standards for the classification of medical situation information, this paper applied the rating standards by medical information’s property and security using neural network. As a method to distinguish a special situation, medical information, Role-Based Access Control was used in situation information classification. To divide and provide real time medical information, circular neural network was used in information classification. This can provide the dynamic demand for the latest medi-cal acts considering time information when practical information is applied. As a result of simple test with limited information, this study developed applicable system by reducing the error rate through automatic classification of medical acts and the repeated learning of process. This study, as a method to dynamically classify situation information that is generated from future ubiquitous environment, would be used for the integration of similar situation information and the improvement of system performance. In the future, there should be a study on the system that performs autonomous learning considering the frequency of persistently occurring
1096
J. Song and S. Kim
information, as a measure for the data not existing in learning data value, with recognizing the situation information of hospital as each object. Such study may address the situation information more flexibly.
References 1. Weiser, M.: The Computer for the Twenty-First Centrury. Scientific American (1991) 94-101 2. Dey, A.K., Abowd, G.D.: Understanding and Using Context. Personal and Ubiquitous Computing Journal 5(1) (2001) 4-7 3. Georgiadis, C.K., Mavridis, I., Pangalos, G., Thomas, R.K.: Flexible Team-based Access Control Using Contexts. In: ACM Symposium on Access Control Models and Technologies (SACMAT2001) (2001) 21-30 4. Wilikens, M., Feriti, S., Sanna, A., Masera, M.: A Context-related Authorization and Access Control Method Based on RBAC : A Case Study from the Health Care Domain. In: 7th ACM Symposium on Access Control Models and Technologies (SACMAT2002) (2002) 117-124 5. Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role Based Access Control Model. IEEE Computer 20(2) (1996) 38-47 6. Neumann, G., Strembeck, M.: An Approach to Engineer and Enforce Context Constraints in an RBAC Environment. In: 8th ACM Symposium on Access Control Models and Technologies (SACMAT2003), Como, Italy (2003) 65-79 7. Jordan, M.: Serial Order: A parallel Distributed Processing Approach (ICS Tech. Rep. No. 8604). La Jolla, CA: University of California, San Diego, Department of Cognitive Science, 1986
Detecting Biomarkers for Major Adverse Cardiac Events Using SVM with PLS Feature Selection and Extraction Zheng Yin1, Xiaobo Zhou2,3, Honghui Wang4, Youxian Sun1, and Stephen T. C. Wong2,3 1
National Laboratory of Industrial Control Technology, Institute of Industrial Process Control, Zhejiang University Hangzhou, 310027 P.R. China
[email protected],
[email protected] 2 Functional and Molecular Imaging Center, Brigham and Women’s Hospital Boston, MA 02121 USA 3 HCNR Center for Bioinformatics, Harvard Medical School Boston, MA 02115, USA {xiaobo_zhou,stephen_wong}@hms.harvard.edu 4 Radiology and Imaging Sciences, Clinical Center, National Institutes of Health Bethesda, MD 20892 USA
Abstract. Detection of biomarkers capable of predicting a patient’s risk of major adverse cardiac events (MACE) is of clinical significance. Due to the high dynamic range of the protein concentration in human blood, applying proteomics techniques for protein profiling can generate large arrays of data for development of optimized clinical biomarker panels. The objective of this study is to discover an optimized subset of biomarkers for predicting risk of MACE containing less than ten biomarkers. In this paper, we connect linear SVM with PLS feature selection and extraction. A simplified PLS algorithm selects a subset of biomarkers and extracts latent variables and prediction performance of linear SVM is dramatically improved. The proposed method is compared with a widely used PLS-Logistic Discriminant solution and several other reported methods based on the MACE prediction experiments.
1 Introduction The search for biomarker panels in MACE prediction has been motivated by the original work in [1]. They reported that assays on MPO (myeloperoxidase) levels in blood samples from 604 patients supply accuracy over 60% in predicting the risk of MACE in the ensuing 30- and 180-day period after present in emergency room with chest pain [1]. MPO has been accepted as a biomarker to MACE with measurement kit approved by FDA (http://www.fda.gov/cdrh/reviews/K050029.pdf), meanwhile, the detection of better biomarker groups as assistants to MPO is inspired with the help of Mass Spectrometry (MS). In our study, the same plasma samples as in [1] are adopted in search for a biomarker set containing less than 6 proteins supplying better prediction accuracy. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1097–1106, 2007. © Springer-Verlag Berlin Heidelberg 2007
1098
Z. Yin et al.
The samples belong to 3 categories, and we just focus on classifying MACE group from Control group. MACE group contains samples from patients with chest pain and consistently negative Troponin T, but had MACE in next 30- or 180-day, while Control Group contains patients with chest pain and consistently negative Troponin T and lived in the next 5 years without any major cardiac events or death. As described in [2], the plasma samples were fractionated into 6 fractions. SELDI MS profiles from fraction 1,3,4,5 and 6 were acquired with Ciphergen’s PBSIIc using IMAC and CM10 ProteinChips. Pre-processing and peak detection was done following [3] with S/N ratio thresholds set at 1.5. The feature selection and classification experiment is done on the peak map of the protein profile obtained from fraction 4 using IMAC protein chips (denoted as IMAC F4). The number of peaks (features) in the peak map is 200. The algorithms developed will be applied to other peak maps (made using different S/N ratio thresholds) and profiles (obtained from different fraction or using different protein chips). The biomarker selection is intrinsically linked to feature selection techniques of machine learning. Various feature selection and classification methods are applied in this project. The dataset involved is of great overlap, i.e. samples from different categories have resembling feature values and are hard to be classified. What’s more, sample number is lower than feature number in this dataset. According to the highly overlapped nature of dataset, kernel methods can be involved to detect the intrinsic nature between features [4, 5], a kind of improved genetic algorithms is also proposed [2] and filter approaches can also supply a guideline of dimension reduction with relatively low computational cost [6]. However, linear SVM classifier just performs poor when connected to traditional feature selection methods on this dataset. A baseline solution featuring PLS-Logistic Discriminant Analysis [7] outperforms linear SVM greatly both on accuracy and computation cost. According to the realization of SVM, it must be connected to a method which can determine a feature rank and form a less overlapped feature space in relatively short time to form an effective biomarker detection method. Our solution is obtained by connecting a simplified Partial Least-Squares (PLS) method [8] designed for binary classification to linear SVM. PLS has proven to be useful in the situations where feature number is much greater than sample number [5], the reduced feature space obtained from PLS can be connected to classifiers like LDA [9], logistic analysis, QDA [7] and SVM [4,5]. What’s more, the component axis supplied by PLS is proven to be a validated feature ranking criterion [9], which means PLS intrinsically combines feature selection and feature extraction. This method selects most informative features and forms a less overlapped feature space through forming latent variables according to the correlation between feature values and class labels. The classification performance of linear SVM and the baseline method are dramatically improved when applied in the feature space selected by PLS, and linear SVM outperforms baseline method on both sensitivity and specificity with less computational cost, moreover, it shows the potential of help determine the component number used in PLS realization.
Detecting Biomarkers for MACE Using SVM
1099
This paper is organized like this: methods involved in feature selection, extraction and classification is described in Section 2, while Section 3 is devoted to the experiment results followed by a conclusion made at last.
2 Methods Let
Y = [ y1 ," , yn ]
T
denote
the
class
labels
of
n
training
samples,
where yi ∈ {−1,1} , i = 1,", n . -1 denotes control group and +1 denotes MACE group. The protein peak values are summarized in an n × L matrix X, where each row of X represents one of the n samples, for the ith row: xi = [ xi1 , xi 2 ,..., xiL ] , i = 1," , n , x ij is the value of the jth peak for the ith sample. Our method consists of the steps of feature selection, PLS feature extraction and SVM classification. Through cross validation on randomly partitioned training sets, a feature rank is obtained using the coefficients of first PLS axis, the simplified PLS utilizes the top features of this rank to form a latent feature space and linear SVM is applied on this space to supply prognostic values. 2.1 Feature Selection Using First PLS Axis
Biologists often want statisticians to supply an interpretable model simply answering which proteins can be used for diagnosis. Feature extraction methods, on the other hand, taking advantage of all the available features to give a reduced feature space, those new components often give information or hints about the interactions and correlations among features rather than ranking single genes like in [10, 11]. Feature extraction is often questioned about losing information about single feature and supplying models with poor interpretability. However, when trying to find optimal feature subsets [12], feature selection methods may suffer from overfitting and may be also difficult to interpret and implement because they are based on computationally intensive iterative technique like genetic algorithms. There has been literatures claiming the relationship between PLS, a feature extraction method, and feature selection. In our work, the features are ranked according to the square for their weight during calculating the first PLS axis, which can be denoted as w12j . The equivalence between w12j and the widely used feature ranking criterion BSS j as well as F-Test scores has been proven in [9]. Thus, feature WSS j
selection and extraction can be handled together using PLS. 2.2 Simplified PLS for Feature Selection and Extraction
The partial least squares (PLS) method [13, 14] has been a popular modeling regression, discrimination and classification technique in its domain of originchemometrics. It creates score vectors (components, latent vectors) using the existing correlations between different sets of variables while keeping most of the variance.
1100
Z. Yin et al.
Typically, PLS regression deals with multivariate matrix X and F, recursively extracts components from both matrices according to the correlations between components from X and F, while in the binary classification scenario defined earlier, Y is a single variable vector consisting of binary class labels so that we can use Y all across the PLS calculation rather than extract components from it in each iteration. We standardize both X and Y to zero mean and unit variance for PLS processing. The procedure of extracting altogether m PLS components from X starts from seeking for a first PLS axis direction w1 , and outputs an n × m score matrix T at last.
w1 is typically calculated as the eigenvector corresponding to the biggest eigenvalue (denoted as θ12 ) of a matrix reflecting the correlation of X and Y: X T YY T X . While according to [8], with a single vector Y consisting of binary labels, we have: 2
θ12 = X 0T Y0 , w1 =
1
θ1
X TY . X TY
X 0T Y0 =
The square of each element in w1 can be utilized as the feature selection criterion. Then for each h=1…m, with X 0 = X , Y0 = Y1 = ...Yh :
wh =
X hT−1Y , X hT−1Y
th = X h −1wh , ph =
X hT−1th th
2
,
X h = X h −1 − th phT ,
th is the hth component (axis direction) extracted, i.e. the hth column of the score matrix T. ph is the loading of X on th , and X h is the residual components. Each component th is the linear combination of vectors in X, we have: where
h −1
th = X h −1wh = X ∏ ( I − w j pTj ) wh , j =1
thus h −1
wh* ∏ ( I − w j pTj ) wh , j =1
th = Xwh* .
Detecting Biomarkers for MACE Using SVM
1101 ~
Typical PLS regression applies LR regression of Y on t h to form a predicted Y . ~
The loadings of Y on th may also be determined using discriminant analysis approaches. In this paper, the score matrix T serves as the input of logistic discriminant or SVM to generate the evaluated class label. An advantage of PLS feature extraction is the possibility to visualize the data by graphical representation. Later, the structure of the IMAC-F4 dataset will be posed by plotting the first two PLS components using different colors for each class. There is no widely accepted procedure to determine component number m. Methods based on cross-validation for training set are proposed in [8] and boosting is applied to improve classification accuracy during seeking for m in [9]. With improved classification performance, linear SVM can be involved in the selection of m. Leaveone-out(LOO) test is implemented to each training set with different m and the results given by SVM is used to determine m used in classifying the testing set. 2.3 Linear SVM for Classification
Over the past several years, the model of SVM has been intensively investigated and applied as highly reliable and flexible classifier in various scenarios. With earlier n
defined X and Y, the optimal hyper-plane f (x) = ∑ yiα i K (xi , x) + b is used to classify i =1
each testing sample x by the sign of f(x), here K is the kernel function mapping x into a higher dimensional space, α i and
b are solved by SVM algorithm [15]. Using
common inner product xi , x
kernel function, a linear SVM is defined. Linear SVM was designed to find the hyper-plane whose minimal distance to the training sample is maximized [15]. The position of SVM decision boundary is determined by support vectors, i.e. samples cutting the edge of each category with Lagrange multiplier α i > 0 . If samples from different categories closely resemble each other, the number of support vectors will be large and make SVM perform badly. The punishment coefficient C is involved to handle the overlap in the training set with a limitation of 0 < α i ≤ C , lower value of C allows greater overlapping between different classes and has higher risk of overfitting (high classification accuracy on training set while low one on testing set). Realization of SVMs involves solving quadratic programming problems, thus using SVMs, especially those with non-linear kernels, in wrapper methods will bring extra computational cost [16], linear SVM though, has fewer coefficients to adjust, fewer risk of overfitting and runs faster compared with its non-linear counterparts.
3 Experimental Results 3.1 Challenges from the Dataset The nature of the whole dataset is imposed in Figure. 1 using the first two latent variables extracted from all 200 features. Two classes are highly overlapped.
1102
Z. Yin et al. 10 Control MACE
8
Second latent variable
6 4 2 0 -2 -4 -6 -8 -25
-20
-15
-10 -5 First latent variable
0
5
10
Fig. 1. Visualization of IMAC-F4 samples using the first two components extracted from the all 200 features
The performance of a classification method based on PLS feature extraction and Logistic Discrimination proposed in [7] serves as a baseline (denoted as “baseline” in the tables below). This method is accepted as an effective linear solution in tumor diagnose with DNA Microarray data. Table 1 shows the performance of classifiers in Leave-one-out (LOO) test across the dataset without feature selection. In each of the LOO iterations, one sample is left to test the classifier trained by the other 119 samples. Without feature selection and extraction, SVM struggles in the highly overlapped dataset. More than 90% of the training samples are chosen as support vectors and complicated but crisp classifiers are obtained. During the leave one out test over the whole dataset, very few samples are classified into Control group, which causes a catastrophic specificity of less than 5%. Actually, the average support vector number during the LOO is more than 100, and linear SVM just makes the same decision (“MACE”) in most of the iterations. Table 1. The performance of two classifiers without feature selection
Classifier Baseline
Linear SVM C=500 No PLS
Features 200
200
Components 1 2 3 4 5 6 -
Accuracy 0.5167 0.5833 0.6083 0.5750 0.5833 0.6000 0.4750
Sensitivity 0.5833 0.5500 0.6000 0.5500 0.6167 0.6167 0.9167*
Specificity 0.4500 0.6167 0.6167 0.6000 0.5500 0.5833 0.0333*
*Only 7 samples are classified into Control group during leave one out test.
Detecting Biomarkers for MACE Using SVM
1103
Feature extraction from all 200 features made by PLS ensures an accuracy around 50%-60% for baseline method, However, the performance of baseline method varies greatly with component number varying from 1 to 6. According to [9], the performance will decrease with unnecessary components added. On the other hand, it is reported that common classifiers supplies accuracy around 62% using top 5 features given by T-test and the accuracy of wrapper method based on standard GA is around 69% with top 5 features [2]. 3.2 Feature Selection It can be seen from Table 1 that feature extraction based on all the features gives an even worse accuracy than simple feature ranking criterion of t-test [2], worse still, the model involving 200 features is too complex to be accepted for development of immunoassay. Feature selection is necessary for better classification performance. 2
In our study, we use the average rank of features given by w1 j across 500 training sets randomly divided from the dataset, each training set contains 70% of the observations. w1 j is calculated for each feature j. The whole selection procedure takes less than 2 minutes, which is faster than most wrapper method based on genetic algorithms. Top 4 features in this table are finally selected for PLS processing and classification, as listed in Table 2: Table 2. The knowledge of top 4 features selected using average
No.
m/z(Da)
Avg. Score: w1 j
19 47 106 55
2639.04 17543 5230.4 6516.3
0.1563 0.1451 0.1361 0.1340
w12j
criterion
During the whole feature selection procedures, these features always take the top 4 position. The first column of Table 2 is the column number of selected features in the dataset, while the second column shows the m/z (mass over electric charge) value which labels the nature of related protein. 3.3 LOO Results on the Extracted Feature Space The simplified PLS is applied on the shrunk dataset formed by 4 selected features, and the first 1-4 latent variables are extracted to form new latent space. Different classifiers are applied to those feature space and their performance in LOO test across all of 120 samples is recorded in Table 3. It seems harder for linear SVM to handle the dataset with 4 features left. However, when the PLS scores serve as the input, some less overlapped feature space is obtained and the performance of linear SVM
1104
Z. Yin et al.
become comparable with baseline and give better maxima of accuracy (68.33%), sensitivity (66.7%) and specificity (70%). It can be seen from the table that the performance of linear SVM is more vulnerable to the variation of component number. As the sample number is not too large, the variation of C influences less on the SVM performance. Table 3. The performance of classifiers in the 4-feature space
Classifier Baseline
Linear SVM
Features 4
4
Components 1 2 3 4 -
C=500 No PLS
Linear SVM
4
C=100
Linear SVM
4
C=500
Linear SVM C=1000
4
1 2 3 4 1 2 3 4 1 2 3 4
Accuracy 0.6667 0.6583 0.6583 0.6583 0.4917
Sensitivity 0.6333 0.6333 0.6333 0.6333 0.9833*
Specificity 0.7000 0.6833 0.6833 0.6833 0
*Only 1 sample is classified into Control group while that should be a Mace sample. 0.6583 0.6500 0.6667 0.6583 0.6667 0.6500 0.6417 0.6000 0.6833 0.6083 0.5500 0.6667 0.6667 0.6667 0.6667 0.6833 0.6667 0.7000 0.6250 0.5667 0.6833 0.6083 0.5500 0.6667 0.6583 0.6500 0.6667 0.6833 0.6667 0.7000 0.6250 0.5667 0.5333 0.5917 0.5333 0.6500
3.4 70%-Cross-Validation Results The performance of linear SVM in the 4-feature space is further validated using 70%Cross validation on IMAC-F4 dataset. For both of 200-feature dataset and the 4feature dataset, 500 partitions into a training set containing 70% of the 120 observations and a test set with the left 30% is obtained. For each partition, Baseline method is applied with the extracted component number ranged from 1 to 6 for 200feature space and 1to 4 for 4-feature space. And linear SVM is applied without PLS feature extraction in the 200-feature space. Linear SVM can be used to further modify the PLS solution, e.g. the determination of component number. A common solution is applying cross-validation on training sets and comparing the mean error rate (MER), when MER begins to increase with new components added, feature extraction should be ended [9]. Linear SVM can be adopted to supply the mean error rate. For each training set, the LOO tests are applied with m varies from 1 to 4, and MERs are given by linear SVM. In over 90% iterations, MER gets its minimum when m=2.
Detecting Biomarkers for MACE Using SVM
1105
The performance of classifiers in 70%-cross validations is summarized in Table 4. Using top 4 features selected using first PLS axis weight, Baseline method achieves best accuracy of 71.06% with m=1, while linear SVM has an average accuracy of 68.86% with m=2, also comparable with the performance of classifiers using top 5 features selected by standard GA [2]. Table 4. The performance of classifiers in 70%-Cross validation (C=500 for SVMs)
Classifier
Features
Components
Baseline
200
Baseline
4
Linear SVM
200
1 2 3 4 5 6 1 2 3 4 -
4
2
Testing Accuracy (Avg. over 500 partitions) 0.5278 0.6111 0.6111 0.4722 0.4722 0.3278 0.7106 0.6886 0.6667 0.6667 0.4722
No PLS
Linear SVM
0.6886
4 Conclusion In this paper, the PLS-linear SVM method is applied to the MACE biomarker detection dataset. A simplified PLS designed for binary classification help linear SVM out of the curse of overlapped dataset and give improved prognostic values, PLS coefficient serves as feature ranking criterion and handles feature selection and extraction together in quite short time. Linear SVM also involves itself in the selection of component number used in PLS to further modify the performance. This solution is an efficient alternative to MACE biomarker detection solution based on evolutionary methods like genetic algorithms, and it also outperforms filter methods based on statistical scores on classification accuracy.
Acknowledgements This research is supported by the Chinese NSF No.60574019 and No.60474045, the Key Technology R&D Program of Zhejiang Province No.2005C21087, the Academician Foundation of Zhejiang Province No.2005A1001-13, and also funded by the HCNR Center for Bioinformatics Research Grant, HMS (STCW).
1106
Z. Yin et al.
References 1. Brennan, M.L., Penn, M.S., Lente, F.V., Nambi, V., Shishehbor, M. H., Aviles, R.J., Goormastic, M., Pepoy, M.L., McErlean, E.S., Topol, E.J., Nissen, S.E., Hazen, S.L.: Prognostic Value of Myeloperoxidase in Patients with Chest Pain. The New England J. Med. 349 (2003) 1595-1604 2. Zhou, X., Wang, H., Wang, J., Hoehn, G., Azok.J., Brennan, M.L., Hazen, S.L., Li, K., Wong, S.T.C.: Biomarker Discovery for Risk Stratification of Cardiovascular Events Using an Improved Genetic Algorithm. Proc. 2006 IEEE/NLM Int. Symposium on Life Science and Multimodality, Washington, D.C. 3. Morris, J.S., Coombes, K.R., Koomen, J., Baggerly, K.A., Kobayashi, R.: Feature Extraction and Quantification for Mass Spectrometry in Biomedical Applications Using the Mean Spectrum. Bioinformatics 21 (2005) 1764-1775 4. Tenenhaus, A., Giron, A., Saporta, G., Fertil, B.: Kernel Logistic PLS: A New Tool for Complex Classification. Proc. 2005 ASMDA Applied Stochastic models and Data Analysis, Brest, France. (http://asmda2005.enst-bretagne.fr/IMG/pdf/proceedings/441.pdf) 5. Rosipal, R., Trejo, L.J., Matthews, B.: Kernel PLS-SVC for Linear and Nonlinear Classification. Proc. 2003 ICML the Twentieth Int. Conf. on Machine Learning, Washington, D.C. (http://www.ofai.at/~roman.rosipal/Papers/icml03.pdf) 6. Peng, H., Long, F., Ding, C.: Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence 27 (2005) 1226-1238 7. Nguyen, D., Rocke, D: Tumor Classification by Partial Least Squares using Microbaray Gene Expression Data. Bioinformatics 18 (2002) 39-50 8. Wang, H.: Partial Least-Squares Regression-Method and Applications. (in Chinese). National Defense Industry Press, Beijing (1999) 9. Boulesteix, A-L.: PLS Dimension Reduction for Classification with Microarray Data. Statistical Applications in Genetics and Molecular Biology 3 (2004) Article 33 (Epub 2004 Nov 23) 10. Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple Hypothesis Testing in Microarray Experiments. Statistical Science 18 (2003) 71-103 11. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakihini, Z.: Tissue Classification with Gene Expression Profiles. J. of Computational Biology 7 (2000) 559584 12. Bo, T.H., Jonassen, I.: New Feature Subset Selection Procedures for Classification of Expression Profiles. Genome Biology 3 (2002) R17 13. Wold, S.: Soft Modeling by Latent Variables; the Nonlinear Iterative Partial Least Squares Approch. In J.Gani(Ed.), Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett, 520-540. Academic Press, London (1975) 14. Wold, S., Ruhe, H., Wold, H., Dunn III, W.J.: The Collinearity Problem in Linear Regression. The partial least squares (PLS) approach and Statistical Computations 5 (1984) 735-743 15. Vapnik, V.N.: The Nature of Statistical Learning Theory, Springer Press, New York (1998) 16. Mao, Y., Zhou, X., Pi, D., Wong, S.T.C., Sun, Y.: Parameters Selection in Gene Selection Using Gaussian Kernel Support Vector Machines by Genetic Algorithm. J. of Zhejiang University SCIENCE, 6B (10) (2005) 961-973
Hybrid Systems and Artificial Immune Systems: Performances and Applications to Biomedical Research Vitoantonio Bevilacqua, Cosimo G. de Musso, Filippo Menolascina, Giuseppe Mastronardi, and Antonio Pedone Department of Electronics and Electrical Engineering, Polytechnic of Bari, Via E. Orabona, 4 70125 – Bari, Italy
[email protected] Abstract. In this paper we propose a comparative study of Artificial Neural Networks (ANN) and Artificial Immune Systems. Artificial Immune Systems (AIS) represent a novel paradigm in the field of computational intelligence based on the mechanisms that allow vertebrate immune systems to face attacks from foreign agents (called antigens). Several similarities as well as differences have been shown by Dasgupta in [1]. Here we present a comparative study of these two approaches considering evolutions of the concepts of ANN and AIS, respectively hybrid neural systems, Artificial Immune Recognition Systems (AIRS) and aiNet. We tried to establish a comparison among these three methods using a well known dataset, namely the Wisconsin Breast Cancer Database. We observed interesting trends in systems’ performances and capabilities. Peculiarities of these systems have been analyzed, possible strength points and ideal contexts of application suggested. These and other considerations will be addressed in the rest of this manuscript.
1 Introduction The nervous system and the immune one are probably the most complex systems in the vertebrates. Both have been shown to be necessary components for adaptability and then survival to the environment. Learning, memory and associative retrieval are the keywords for these systems and on these aspects researchers have focused their interests in order to replicate such behaviors in artificial systems. Artificial Neural Networks grew in this context and they are nowadays one of the most useful and powerful tools for data classification, clustering and prediction. Starting from the ’40s several other bio-inspired models have been successfully proposed including Genetic Algorithms (GA) [2] and Swarm Intelligence [3]. Artificial Immune Systems (AIS) [4] followed this scientific trend. Proposed for the first time by Farmer et al in 1986 [5] AIS field of research underwent a noticeable boost in the mid ‘90s with the research carried out by Dasgupta [1][6] and then with the pioneering work of de Castro, Timmis [7] and Hunt[8]. AIS systems are knowing a remarkable spread in the scientific community because of their flexibility and potentialities. Someone could D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1107–1114, 2007. © Springer-Verlag Berlin Heidelberg 2007
1108
V. Bevilacqua et al.
argue that there’s a great emphasis on these systems and it is not for a chance. AIS has been shown to be a good choice in several fields of application and one of the most versatile tools researchers are able to use. On the other hand, in their most common implementations, they are characterized by low computational costs; a critical aspect in delicate contexts like e.g. real time computing. Just like Artificial Neural Networks, even Artificial Immune Systems have known different implementations and optimizations through the years; supervised and unsupervised flavors being the former first and resource competition and negative selection the remaining. Hybrid Neural Systems are, probably, among the most famous evolution of Neural Networks being combination of Neural Networks with other Intelligent Systems that are able to conjugate strength points of all of the constituent systems. A Neural-Genetic approach for the problem of breast cancer classification has been described in [9]. However different approaches have been shown to be able to perform quite well on the same problem and, in particular, in this paper, we will focus on the comparison of novel immune systems models and advanced hybrid neural systems. The common platform selected for this comparative study is the Wisconsin Breast Cancer Database (WBCD). The WBCD is fundamentally based on the flattening principle. Composed by 699 cases, each defined by 11 fields, this dataset collects breast cancer cases observed by W.H. Wolberg in the late ’80s [5]. Sixteen cases lack of one parameter. Database entries are characterized by the following structure: (ID, Parameters, Diagnosis) where ID is the primary key, parameters fields contain numerical values associated to 10; the last field in the entries contains medical diagnosis associated to the cases, it is a binary value representing malignant/benign tumor. The former first ten indicators are extracted analyzing images obtained through Fine Needle Aspiration (FNA), a fast and easily repeatable breast biopsy exhaustively described in [10]. The systems selected for this comparison are a Hybrid-Neural approach and, for AIS, Artificial Immune Recognition Systems (AIRS, [11]) and aiNet [12]. All of these systems have been described in the following sections of this paper. The comparison among the presented solutions has been assessed using global accuracy metric. Interesting trends have been observed and reported; they mainly concern specific capabilities of Artificial Immune Systems to perform better under certain conditions and algorithms’ computational costs. These and other peculiarities of the systems under investigation have been addressed in the next paragraphs that are organized as follows: firstly an overview of the benchmark dataset is given with preliminary statistical analysis, then the NeuralGenetic approach is explored. Descriptions of AIRS and aiNet implementation follow. “Comparative study” collects all the results of the three systems and “Conclusions and Further Works” paragraphs ends the manuscript with considerations and interpretations of the results giving further cues of research in this field.
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1109
2 WBCD Dataset and Preliminary Data Analysis The WBCD dataset, as described above, is composed by 699 cases each of them defined by 11 parameters. The classification process could be faced like a function analysis problem:
x11 = f ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) with: x1 = Radius x2 = Area x3 = Perimeter x4 = Texture x5 = Smoothness x6 = Compactness x7 = Concavity x8 = Concave points x9 = Symmetry x10 = Fractal dimensions The attention is then focused on the analysis of the multidimensional space defined by n-tuple associated to each case. Several statistical analyses were then carried out in order to improve knowledge about this space and create the set of information to be used in following steps of development. PCA, PFA and ICA were preliminary carried out in order to gain an adequate model; these results have been previously described [9].
3 Hybrid Neural-Genetic Systems IDEST (Intelligent Diagnosis Expert System) is a Genetic Optimized Neural System for integrated diagnosis of breast cancer. This system is mainly based on a FeedForward Artificial Neural Network whose topology has been optimized using the approach illustrated in [13]. An ANN setup based on the results obtained in previous steps was trained using Back-Propagation algorithm in the variation that updates weights and biases values according to gradient descent momentum. The starting learning rate chosen was 0.3. This choice avoided the occurrence of step-back phenomena in learning phase and it gave the network sufficient energy to exit from "local minima": all this resulted in ANN's good aptitude for convergence. Stop criterion was set to SSE equal to 7E-3 or to the limit of 50000 training epochs. The system was able to complete the training phase in 13000 epochs reaching, therefore, the SSE target. The relatively contained number of epochs needed to accomplish the training step confirms the correctness of the results obtained via linear and non linear analysis and, in particular the accuracy of GA search. System validation was carried out submitting to the network the 228 cases of the validation set and calculating the number misclassified ones. The results returned by this analysis proved to be quite good: no misleading prediction was made on the 228 cases analyzed. Comparison between error distribution in training and validation sets shows low variability and confirms high precision on both classes. The highest obtainable system accuracy was then reached: it is evidently an indicative
1110
V. Bevilacqua et al.
result but potentialities of similar systems seem, now, to be supported by more concrete elements. In an a priori analysis we have tried to estimate the impact of most significant choices on the accuracy of the ANN's predictions. In this phase we observed how particular decisions contributed to the achievement of such a result. Leaving unchanged the phases of process described until now, but employing training and validation sets assembled ignoring results of linear and non linear analysis (i.e. using sets obtained simply dividing the original dataset in two subsets), it is possible to observe an error on the validation set equal to 4 cases on the 228. This corresponds to an accuracy of 98.6%, a competitive value indeed, which shows, the importance of the observation mostly on intrinsic variance of cases. Error oscillations, furthermore, become more evident right by the cases characterized, by a high variance. Another interesting observation can be made leaving unchanged the process described but eliminating the hybrid ranking inspired by elitist method typical of "evolutionary algorithms" contexts. Suppressing this step we can incur an error, on the validations set, approximately equal to 0.9-0.4%.Results obtained highlight the contribution to the accuracy of learning that choices in data pre-processing phase [13] have generated. Adopted devices allowed obtaining a system capable of taking full advantage of peculiar characteristics of the datasets and of the distribution of information in it.
4 Artificial Immune Recognition System AIRS (Artificial Immune Recognition System) is a supervised learning algorithm inspired by the function of the biological immune system designed to resolve problems such as intrusion detection, data clustering and classification problems. For this work, we have been employed AIRS1, described in [14], with the goal of developing a binary classification system. In the AIRS environment the feature that should be recognized is represented by the antigens, instead the recognizer features is a pool of antibodies, called memory cells. All the memory cells are created during the training stage and are representative of the training data. The lifecycle [16] of the AIRS system is represented in figure 1.
Fig. 1. An overview of the AIRS algorithm
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1111
Classification To classify an unknown antigen the affinity between this feature and all memory cells is calculated; the class of the best match memory cell is the class of the antigen presented. Parameter set for this algorithm is presented below: Seed: 1 Clonal rate: 10.0 Hyper clonal rate: 2.0 Mutation rate: 0.1 Total resources: 650 Affinity threshold scalar: 0.1 Stimulation threshold: 0.99 Seed is the number of antigens selected to seed the initial pool of memory cell.
Fig. 2. Java application GUI
5 aiNet aiNet (Artificial Immune Network) was proposed to the scientific community by de Castro and Von Zuben in the 2001 [12]. AiNet combines compression techniques with the applications of graph theory, yielding an unsupervised classifier. The centroids extracted by the algorithm are analyzed by minimizing a tree with MST (Minimal Spanning Tree - MST) where function cost is the distance among the centroids. AiNet performs well as filtering of dataset of great dimensionalities, describing the distribution of data. The cells of the immune net are represented in a space of the same dimension as the input data. The dimension of the net, that is the number of cells that composes it, is defined by a mechanisms inspired by the dynamics of the biological immune system. aiNet is based on two principles of the biological immune system: Clonal Selection and Immune Network Theory [18]. Clonal Selection defines how the system reacts to antigens' invasion: when an antigen is recognized by the system a subset of the antibodies that recognized the antigen undergoes cloning and changing by
1112
V. Bevilacqua et al.
introducing diversity in the population and then adapting itself to the invaders. This principle allows us extracting repetitive pattern in a dataset, because all the antigens with a particular sequence of values will be recognized by the same antibody. All the cells that compose the biological immune system interact with each other in total absence of external stimuli. This gave the idea of the existence of the model of such interactions, with a communication net that connects various elements. In the biological system chemical messages are exchanged determining the antibodies' survival or death. In the computational model the interaction between two antibodies is given by their relative distance; moreover nearest antibodies would recognize similar antigens, vanishing all the net. The model of de Castro & Von Zuben is based on these two principles. The recognition of the pattern influences the clone, mutation and selection task according to the principle of Clonal Selection, while the recognition of the elements of a same net is determined by the network suppression, eliminating its redundancy. The stop criterion proposed is a max number of iterations/generations. These properties turn out to be important in the analysis of biomedical dataset where several information are available for each patient but reduced sample numbers is usual. At the end of each run of aiNet the extracted antibodies represent an internal representation of the system for the spatial distribution of antigens. In aiNet the affinity between Ab and Ag is given by their relative distance: the use of Euclidean distance is very common in this context, especially in case of real valued data. Hamming distance is preferred in case of binary strings. Classification is then unsupervised and this is a critical factor in the analysis of biomedical data where information extraction tends most often to be an explorative rather than confirmative one.
6 Results The aim of this paper is the comparison of three different approaches to solve the same problem. Two of these techniques are supervised, IDEST and AIRS, while aiNet is an unsupervised learning algorithm. Results are reported in the table below: Table 1. For each technique, the rate per cent of features correctly classified
IDEST
Training Validation
100,00% 100,00%
AIRS (reducing training items) 99,00% 94,00%
AIRS
AINET
98,50% 100,00%
95,00%
IDEST uses an ANN (Artificial Neural Network) to classify and to recognise the data. Learning phase for this system is slower than the AIRS learning process, but performs better results both on the training and validation set. ANN explores the space of the features better than AIRS does because the latter is a single shot algorithm and for this reason is faster. Using the same training and validation set (502 and 185 items) also for the AIRS, this algorithm presents a rate of features correctly recognised for the validation set greater than for the training set. It could be argued
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1113
that the training set is greater than the double of the validation set (see Tab. 3). With a 200 entries training set we obtained that the accuracy is better for the training set. In conclusion, reducing learning set, AIRS maintains good results where IDEST starts to fail. For the classification, IDEST needs to train a lot of feature because a neural network must process a minimum set of items to learn its space, while AIRS needs to process a reduced data set, for its nature of unsupervised system, because AIRS is the extension of AIS (Artificial Immune System – unsupervised algorithm). For these investigations we consider that AIRS is better than a system based on ANN (as IDEST) when it needs to implement an on-line learning process or when the number of the features is not so large to train a neural network. Finally we have used an unsupervised system, aiNet, to understand how all the features are distributed in the space and with this technique we have calculated two clusters that classify correctly the 95% of the all items.
7 Conclusions and Further Works In this paper we compared different and evolved approaches from different fields of research. ANNs and Artificial Immune Systems derived approaches have been compared to each other. Although small differences in accuracy levels have been observed being IDEST the top performing algorithm, some observations should be made. As shown in Tab. 2 computational costs of these three algorithms are markedly different. Table 2. Time (in seconds) needed to complete the training phase
Training set
IDEST 1223
AIRS 96
AINET 135
Execution times have been computed flooring the mean of 100 runs of each system. It is evident that the Neural-Genetic system is more accurate but is most time consuming. On the other hand, AIRS and aiNet show smaller computational resources, the cost being a little degradation in accuracy. However it seems that a well planned fine tuning of the parameters for AIS based systems can lead to a remarkable improvement of the results. For their characteristics in terms of computational times and levels of accuracy, immune based approach seems to be a good alternative to well-established paradigms. Questions about sensitivity analysis of parameters and optimal feature sets are currently being investigated. Some interesting behaviours of Immune based systems are even under investigation: they mainly refer to the ability of such systems, under certain conditions to outperform classical approaches (like ANN or SVM). We are trying to model these behaviours (as reported in Tab. 1, columns 3 and 4) and to understand the inner mechanisms that lead to them; it is obvious, in fact, that this aspect could turn out to be a strength point of top performing AIS based systems in biomedical field. Maintaining a low sample size can allow containing experimental costs of the research pipeline.
1114
V. Bevilacqua et al.
References 1. Dasgupta, D.: Artificial Neural Networks and Artificial Immune Systems: Similarities and Differences, Proc. of the IEEE SMC 1 (1997) 873-878 2. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. (1975) 3. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial Systems (1999) 4. de Castro, L.N., Timmis, J.I.: Artificial Immune Systems: A New Computational Intelligence Approach, Springer-Verlag, London, September (2002) 357 p. 5. Farmer, J.D. , Packard N., Perelson, A.: The immune system, adaptation and machine learning, Physica D 22 (1986) 187--204 6. Dasgupta, D.: Artificial Immune Systems and Their Applications, Springer-Verlag, Inc. Berlin, January (1999) 7. DeCastro, L., Timmis, J.: Artificial Immune Systems: A New Computational Intelligence Approach (2001) 8. Timmis, J, Neal, M., Hunt, J.: An Artificial Immune System for Data Analysis 55 Biosystems (2000) 143--150 9. Bevilacqua, V., Mastronardi, G., Menolascina, F.: Intelligent information structure investigation in biomedical databases: the breast cancer diagnosis problem, ISC 2005 10. Wolberg, W.H., Street, W.N., Heisey, D.M., Mangasarian, O.L.: Computer derived nuclear features distinguish malignant from benign breast cytology. Cancer Cytopathology 81 (1997) 172-179 11. Watkins, A., Timmis, J., Boggess, L.: Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Machine Learning Algorithm, Genetic Programming and Evolvable Machines, 5-3 (2004) 291--317 12. de Castro, L.N., Von Zuben, F.J.: aiNet: An Artificial Immune Network for Data Analysis, (full version, pre-print), Book Chapter in Data Mining: A Heuristic Approach, H. A. Abbass, R. A. Sarker, and C. S. Newton (eds.), Idea Group Publishing (2001), USA, Chapter XII, pp. 231-259. 13. Bevilacqua, V., Mastronardi, G., Menolascina, F., Pannarale, P., Pedone, A.: A Novel Multi-Objective Genetic Algorithm Approach to Artificial Neural Network Topology Optimisation: The Breast Cancer Classification Problem, IJCNN 2006 14. Watkins, A.: AIRS: A resource limited artificial immune classifier - Department of Computer Science, Mississippi State University (2001) 15. Timmis, J., M. Neal, et al., An Artificial Immune System for Data Analysis - BioSystems 55 (1/3) (2000) 143-150 16. Brownlee, J.: Artificial Immune Recognition System (AIRS) a review and analysis- Centre for Intelligent Systems and Complex Processes (CISCP), Swinburne University of Technology (SUT) (2005) 17. Jerne N.K.: Towards a Network Theory of Immune System, Annals of Immunology (1973) 18. de Castro L.N.: http://www.dca.fee.unicamp.br/~lnunes/
NeuroOracle: Integration of Neural Networks into an Object-Relational Database System Erich Schikuta and Paul Glantschnig Research Lab on Computational Technologies and Applications, Institute of Knowledge and Business Engineering, Faculty of Computer Science, University of Vienna, Rathausstraße 19/9, A-1010 Vienna, Austria
[email protected] Abstract. Many different approaches for the modeling of neural networks were presented in the literature (e.g. [4]). Generally the objectoriented approach proved itself as most appropriate. It provides a concise but comprehensive framework for the design of neural networks in terms of its static and dynamic components, i.e. the information structure and its methods in the object-oriented notion. This paper presents a framework for the conceptual and physical integration of neural networks into object-relational database systems. The static components comprise the structural parts of a neural network, as the neurons and connections, higher topological structures as layers, blocks and network systems. The dynamic components are the behavioral characteristics, as the creation, training and evaluation of the network. Finally the implementation of the new NeuroOracle system based on the proposed framework is presented.
1
Introduction
Object oriented database management systems (OO-DBMS) proved very useful for handling and administrating of complex objects. We believe that the objectoriented approach is the most comfortable and natural design model for neural networks [4]. In the context of object-oriented database systems neural networks are treated generally as complex objects. These systems showed very valuable at handling and administrating such objects in different areas, as computer aided design, geographic databases, administration of component structures, etc. It is our objective to consider neural networks as conventional data in the database system. From the logical point of view a neural network is a complex data value and can be stored as a normal data object. The usage of a database system as an environment for neural networks provides both quantitative and qualitative advantages. – Quantitative Advantages. Modern database systems allow the administration of objects efficiently. This is provided by a ’smart’ internal level of the system, which exploits well studied and well known data structures, access D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1115–1124, 2007. c Springer-Verlag Berlin Heidelberg 2007
1116
E. Schikuta and P. Glantschnig
paths, etc. A whole bunch of further concepts is inherent to these systems, like models for transaction handling, recovery, multi-user capability, concurrent access etc. This places an unchallenged platform in speed and security for the definition and manipulation of large data sets at users disposal. – Qualitative Advantages. The user has powerful tools and models at hand, like data definition and manipulation languages, report generators or transaction processing. These tools provide a unified framework for both handling neural networks and the input/output data streams of these networks. A homogeneous and comprehensive user interface is provided to the user. This spares awkward tricks to analyze the data of his database with a separate network simulator system. A further important aspect (which is beyond the scope of this paper) is the usage of neural networks as part of the rule component of a knowledge base database system [7]. Neural networks represent inherently knowledge by the processing in the nodes [6]. Trained neural networks are similar to rules in the conventional symbolic sense. A very promising approach is therefore the embedding of neural networks directly into the generalized knowledge framework of a knowledge based database system. 1.1
Integration of a Neural Network Simulator into a OO-DBMS
To solve the integration problem the approach described in [8] is used: In this approach, the embedment of neural networks into database systems, the opposite direction to conventional approaches is followed. In other words, the neural networks are moved into the database systems, and not the data to the neural network simulators. In this paper we present the underlying data model of the NeuroOracle system, an artificial neural network simulation system integrated into an object-relational database system, i.e. the Oracle database management system. 1.2
Goals
The goals and basic requirements of the NeuroOracle system were specified as follows. – The system is a neural network simulator. Basic neural network functions such as create, modify, delete, train and evaluate neural networks should be performed. – The system provides the main neural network paradigms which can be classified into three groups. These are Backpropagation Nets, Self Organizing Maps and Recurrent Nets. – The system is expansible, so it should be possible to create new neural networks, training and evaluation. Moreover a new neural network paradigm can be added to the neural network system. – The training and evaluation algorithms should be implemented in a high level programming language. – The system is implemented within object-relational Database Management System (DBMS).
NeuroOracle: Integration of Neural Networks
1.3
1117
Object-Relational Database Design
Why storing complex data, such as data of a neural network, in a flat relational database schema when there is an object-relational approach? In Oracle, since version 9i the definition of Oracle object types is possible. These object types are user-defined types that make it possible to model real-world entities as objects in the database. In general, the object-type model is similar to the class mechanism found in C++ and Java. Like classes, objects make it easier to model complex entities and logic, and the reusability of objects makes it possible to develop database applications faster and more efficiently. By natively supporting object types in the database, an object-relational DBMS enables application developers to directly access the data structures used by their applications. No mapping layer is required between client-side objects and the relational database columns and tables that contain data. Object abstraction and the encapsulation of object behaviors also make applications easier to understand and maintain. So, object types provide much extensibility. Because of the reasons explained in the last paragraph, it is supposed to be very practical to use this approach for the NeuroOracle system. With many self-defined object types a complex structure inside the database could be built, which is very generic and extensible.
2
The NeuroOracles’s Database Structure
The object-oriented model proved extremely appropriate for the specification and modeling of neural network systems. Object oriented database systems have proven very valuable to handle and manage complex objects. One describing property of the object oriented design is the hierarchy of types. A type comprises a set of objects, which share common functions. Generalization and specialization define a hierarchical type structure, which organizes the unique types. Functions defined on a super-type are also inherited by all of its sub-types along the type hierarchy. However, many state-of-the-art and widely used database systems provide a relational data model framework only and do not support the object-oriented paradigm until now. The relational approach [1] is declarative and value-oriented. Operations on relations are expressed by simple and declarative languages delivering their results by new relations. Today the relational approach is the model of choice in the community and provided by beneath all available database systems. The neural network is represented by values in the database system. The semantic information is expressed by relationships between these values. 2.1
Object Model
The neural network’s stored data within the NeuroOracle system can be divided into two groups: Static components and Dynamic components.
E. Schikuta and P. Glantschnig
Fig. 1. NeuroObject’s object model
1118
NeuroOracle: Integration of Neural Networks
1119
Static Components. The static neural network components comprise all information stored in relations, as neural network specific parameters, links, training objects, evaluation objects, etc. respective to the shown entity-relationship diagram (see Figure 1). The neural network object is a sub-type of the general object type of the database system. Sub-types can be classified into specialized neural network types according to their network paradigm. The network paradigm is defined by a specialization, a sub-type of the neural network type. This sub-type (which inherits all characteristics of its super-type) provides the specific and necessary attributes dependent on the network paradigm. Combined with the definition of the paradigm are the dynamics (the dynamic behavior) of the network. This approach is reflected in the Unified-Modeling-Language (UML) diagram of NeuroOracle’s object model in figure 1. A UML diagram is useful for the description of the conceptual schema of the ’reality’ in focus. The transformation of the model in a UML diagram to a database realization is straightforward. Rectangles represent entity sets, circles attributes, and connections relationships between entities. For a in-depth explanation see [2]. All data of a specific net is stored inside the neuralnet tab database table and its nested tables. As it is shown in Figure 1, neuralnet tab contains four nested tables: trainings, evaluations, connections and layers, which in turn contain the nested table neurons. One part of the neural network’s structure, the type and number of layers, the number of neurons of a specific layer, their activation- and output function types and their BIAS information are mapped in the nested tables layers and neurons. The other part, the connections between the neurons are stored within the nested table connections. The remaining two nested tables trainings and evaluations are used for performing training- and evaluation results. The structure of neural networks, to be more precisely, the layers, the neuron units of the layers, their activation and output functions and referred neural network type are the static components of neural networks, while the connectionweights between the neurons and/or the BIAS-value of each neuron, if activated, are dynamic information and are usually adapted during the neural network’s learning phase. Dynamic Components. The dynamic components of the neural network object are the typical operations on neural network, the training and evaluation phase. The algorithms for these phases are realized by routines coded in the internal database application code (Oracle’s PL/SQL procedures). Further these routines keep certain consistency assertions on the static components after execution of specific phases, as insertion of link weights after a training, results after an evaluation phase, and so on. 2.2
Datastream Concept
The functional data stream definition allows to specify the data sets in a comfortable way. It is not necessary to specify the data values explicitly, but the
1120
E. Schikuta and P. Glantschnig
data streams can be described by SQL statements. In the database component of the NeuroOracle system the well known apparatus of the SQL database manipulation language ([5]) is at hand. Thus the same tool can both be used for administration and analysis of the stored information. So it is easily possible to use ’real world’ data sets as training set for neural networks and to analyze other (or the same) data with trained networks. 2.3
Extensibility
An important aspect is the extensibility of the system. This is reached by a modularized paradigm approach. New modules have to follow a specific programming style to make it possible to integrate them easily into the NeuroOracle environment. Thus users have the possibility to shape the system to their needs by changing existing or adding new paradigms easily. All these implementations can be done without leaving the comfortable database environment.
3
Interaction of the NeuroOracle’s Components
For a better understanding of the system case study of a simple neural network tries to give explanation of the mapping between the neural network structure and the database tables and the interaction of the defined database components. Before a new neural network can be added to the system, the table nntype tab must contain the value of at least one neural network type as a neural network must refer to one specific network type. In case there is at least one network type defined a new neural network can be inserted into the neuralnet tab table respectively nested tables. This can be done by using a simple constructor method: INSERT INTO neuralnet\_tab VALUES ( neuralnet(1,’xor1’,struct(2,3,1), feedforward())); The Figure 2 shows how the neural network’s data is stored within the neuralnet tab table and its nested tables layers and connections. The nested tables trainings and evaluations do not contain any data before a training session respectively evaluation session is performed. The attribute nettype refers to a specific neural network type of the nntype tab table. The next task is to perform a training process, in which the dynamic components are adapted. By adding new training parameters and specific training data into the nested table trainings, a database trigger is fired, which in turn starts the self-implemented training algorithm written in PL/SQL programming language. The training algorithm then returns a set of results and stores it into the nested table trainings. A new evaluation session is created equivalently to the training process. Lacking of trigger-methods on nested tables an automatic training and/or evaluation of newly inserted data into these nested tables is not
NeuroOracle: Integration of Neural Networks ID 1.1
NETID xor1
LAYERS […….....]
CONNECTIONS […………]
LAYERID 1 2 3
NEURONID 1.1 1.2 2.1 2.2 2.3 3.1
TRAININGS
LAYERTYPE INPUT HIDDEN OUTPUT
NEURONNAME INPUT-1.1 INPUT-1.2 HIDDEN-2.1 HIDDEN-2.2 HIDDEN-2.3 OUTPUT-3.1
EVALUATIONS
1121
NETTYPE 0000220208FDC185DE499E47A983D34……….
NEURONS [………] [………] [………]
NEURONBIAS BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE')
FACTIVATION ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() NE_FROM NE_TO IN.1 1.1 IN.2 1.2 1.1 2.1 1.1 2.2 1.1 2.3 1.2 2.1 1.2 2.2 1.2 2.3 2.1 3.1 2.2 3.1 2.3 3.1 3.1 OUT.1
FOUTPUT OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() WEIGHT 1 1 -,31239259 ,255087614 -,81784761 -,24360991 -,58022749 ,206756711 -,86869955 ,071927786 ,441814899 1
FIXED TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Fig. 2. Mapping of neural network data
possible. For that reason two ”dummy”-tables with trigger constraints have been designed. The following SQL-statement creates a new training session. The data is inserted in the train session table. INSERT INTO train\_session VALUES ( doTraining(’xor1-training1’, ’xor1’, ’0;0;0;1;1;0;1;1’, ’0;1;1;0’, backprop(400,0.5,0.01) )); The data is inserted into the train session table (see Figure 3) ID 1
NETID xor1
TRAINPARAM(TRAINID, TRAINNAME, INPUTQUERY, OUTPUTQUERY, ALGORITHM(MAXEPOCH), TRAINRESULT(EPOCH)) TRAINING(1, 'xor1-training1', '0;0;0;1;1;0;1;1', '0;1;1;0', BACKPROP(400, ,5, ,01, RNDPERMUTATION('rndperm', 'rnd permutation of pattern-ids'), CONSTV ALUE(0), ALPHATERM(0), BATCHMODE('FALSE'), 10), NULL)
Fig. 3. Training data inserted into nested table
After insertion of the training data a database trigger is fired, that calls the proper training algorithm depending on the algorithm type inserted as training parameter into the database table. In this case the backpropagation algorithm function is used. The training parameters and the result of the training algorithm function are then inserted into the nested table trainings of the neuralnet tab table. This is shown in Figure 4. The training result object stored in the trainresult column is of type backproptrainresult. It contains a nested table resultdetail that stores the total net error of specified epochs. Furthermore, the database trigger updates the dynamic components of the neural network.
1122
E. Schikuta and P. Glantschnig TRAINID 1
TRAINNAME xor1-training1
INPUTQUERY 0;0;0;1; 1;0;1;1
OUTPUTQUERY 0;1;1;0
ALGORITHM() BACKPROP(400, ,5, ,01, RNDPERMUTATION( 'rndperm', 'rnd permutation of pattern-ids'), CONSTVALUE(0), ALPHATERM(0), BATCHMODE('FALSE'), 10)
TRAINRESULT() BACKPROPTRAINRESULT( 267, TAB_VFLOAT( VFLOAT(,00579831), VFLOAT(,977317695), VFLOAT(,977296002), VFLOAT(,137114644)), ,009932002, NT_RESULTDETAIL[…..]
Fig. 4. Training results inserted by training trigger
Figure 5 shows the graphical mapping of the neural network after the performed training session. Inputlayer
Hiddenlayer -3.39
INPUT1
Outputlayer
2.1
1.1
-3.73 0.23 -3.50 2.2
1.88
3.1
OUTPUT
-3.34 -3.46
INPUT2
1.2
1.86 0.26
2.3
Fig. 5. Graphical sketch of the trained network
As already mentioned the evaluation process works equivalently. New evaluation parameters are inserted into the other dummy table, eval session. Again, a database trigger is fired and the result of the evaluation process is inserted into the nested table evaluations. The following SQL statement gives an example of how to create a new evaluation with the doEvaluation object. A name for the new evaluation, the name of the existing neural network, an input that should be evaluated and the proper net training algorithm type have to be set. The object type of the algorithm is necessary to determine the right evaluation algorithm for the neural network. INSERT INTO eval_session VALUES (doEvaluation(’eval1’,’xor1’,’0;1’, backprop() )); The new created evaluation is stored in the nested table evaluations inside the neuralnet tab table. The next SQL statement accesses this table and its entries. SELECT e.* FROM neuralnet_tab n, TABLE(n.evaluations) e WHERE n.netid=’xor1’; The results of this query are shown in Figure 6.
NeuroOracle: Integration of Neural Networks EVALID 1
EVALNAME eval1
INPUTQUERY 0;0;1;0 ;0;1;1;1
EVALRESULT(OUTPUT_MATRIX) EVALRESULT_T(TAB_VFLOAT(VFLOAT(,04003242), VFLOAT(,877493239), VFLOAT(,881920161), VFLOAT(-,03385231)))
1123
ALGORITHM BACKPROP
Fig. 6. Evaluation table entries
4
Implementation Issues
The NeuroOracle system was developed and tested with Oracle10g. The NeuroOracle software package together with a comprehensive documentation can be downloaded from [3]. A personal edition of the Oracle DBMS System can be downloaded from the official Oracle website. Once the Oracle database is running on the system, the NeuroOracle system can be easily installed in virtually any user schema by running the provided script createNeuroOracle.sql in SQL*Plus. The author recommends to execute the script createUser.sql in order to create the default Oracle user for the NeuroOracle system. By executing the createNeuroOracle.sql script the necessary database schema as mentioned above is created. But also three neural network types are inserted by default into the nntype tab table. These are Feed-forward Networks with typeid 1, Recurrent Networks with typeid 2 and Self Organizing Maps with typeid 3. For the usage of the neural network simulator no new network type is necessary. The predefined three neural network types are sufficient for most applications, however new neural network types can be created easily by inserting into the nntype tab table easily. These entries do not produce any restrictions about the neural network’s structure inserted into the neuralnet tab table or the used training algorithm of the network but it provides a better overview of the existing networks for the users; it is just additional information for the neural network.
5
Visions
The NeuroOracle database system provides a fundamental basis, the application logic, on which front-end applications could and should possibly be built quite easily. Because of the object-relational approach application developers do not need to create a mapping between the data stored in the database tables and the application’s data structure. Due to the different kinds of interfaces provided by Oracle, such as Java, PL/SQL, Pro*C/C++, OCI or OLE, there exist many possibilities of building a user front-end application for NeuroOracle. For example a web-based application could be built using JSP or PL/SQL, or using a Java-Applet. Or a standalone application could be built using the OCI interface. So, the NeuroOracle system can be extended by a user interface making the usage of the system much easier or new network paradigms, training and evaluation algorithms can be added to the system. Furthermore the NeuroOracle system could be deployed on an Oracle Real Application Cluster to boost the performance and may be enabled to grid-computing in the future within the Oracle10g database.
1124
E. Schikuta and P. Glantschnig
Known problems Until this paper was written, no referential constraints could be defined for nested table columns. Therefore two dummy-tables, train session and eval session were needed for triggering newly defined training or evaluation objects. Also it was not possible to associate the neurons of a net with the nested connection table, with the result that there could be connections between neurons that don’t even exist. As a consequence the applications built on NeuroOracle should ensure data integrity among the NeuroOracle objects.
6
Conclusions and Future Research
We presented in this paper an object-relational model for the embedding of neural networks into data base systems. Based on this framework the NeuroOracle system was developed, an extensible, comfortable, and powerful neural network tool embedded into the object-relational Oracle database system. This approach provides a homogeneous and natural environment for the administration and handling of neural networks to the user.
References 1. Codd, E.: A Relational Model for Large Shared Data Banks. Communications of the ACM 13 (1970) 377–387 2. Date, C.: An Introduction to Database Systems. Addison-Wesley (1986) 3. Glantschnig, P., Schikuta, E.: Neurooracle Package. http:// www.cs.univie.ac.at/ template.php?tpl=shared/studProjES.tpl, November (2005) 4. Heileman, G. et al.: A General Framework for Concurrent Simulation of Neural Networks Models. IEEE Trans. Software Engineering 18 (1992) 551–562 5. Melton, J., Simon, A.: Understanding the New SQL: A Complete Guide. Morgan Kaufmann Publishers (1993) 6. Pao, Y.H., Sobajic, D.: Neural Networks and Knowledge Engineering. IEEE Knowledge and Data Engineering 3 (1991) 185–192 7. Schikuta, E.: The Role of Neural Networks in Knowledge Based Systems. Int. Symp. on nonlinear theory and applications, Hawaii, IEICE (1993) 8. Schikuta, E.: Neudb’95: An sql Based Neural Network Environment. In Shunichimeri et al., editors, Progress in Neural Information Processing, Proc. Int. Conf. on Neural Information Processing, ICONIP’96, Hong Kong, Springer-Verlag, Singapore (1996) 1033–1038
Discrimination of Coronary Microcirculatory Dysfunction Based on Generalized Relevance LVQ Qi Zhang1, Yuanyuan Wang1, Weiqi Wang1, Jianying Ma2, Juying Qian2, and Junbo Ge2 1
Department of Electronic Engineering, Fudan University, Shanghai 200433, P.R. China {051021084, yywang}@fudan.edu.cn 2 Department of Cardiology, Zhongshan Hospital of Fudan University, Shanghai 200032, P.R. China
[email protected] Abstract. There are fewer effective methods to accurately discriminate the coronary microcirculatory dysfunction from the normal coronary microcirculation. Rather than traditional approaches only considering a single hemodynamic parameter, a novel scheme is proposed based on the generalized relevance learning vector quantization (GRLVQ) using multiple parameters (features). Naturally integrating the tasks of feature selection and classification, this scheme circularly adopts GRLVQ to gradually prune the unimportant features according to their weighting factors. In each circulation, the prototypes are generated for classification and the classification accuracy is obtained. Finally, the feature subset with the highest classification accuracy is selected and the corresponding classifier is also achieved. This approach not only simplifies the classifier but also enhances the classification performance. The method is verified on the physiological data collected from animals, and proved to be superior to the traditional single-parameter method.
1 Introduction Coronary microcirculation is suspected of being involved in a large number of cardiovascular diseases [1]. Therefore the task of effective classification of the coronary microcirculatory function, namely discriminating the coronary microcirculatory dysfunction from the normal coronary microcirculation, has great significance in the medical diagnosis. The traditional research orientation is trying to find a single clinical hemodynamic parameter for the classification, such as coronary flow reserve (CFR) and coronary resistance reserve (CRR) [2-4]. However, the discrimination performance of any single parameter is not satisfactory [2-4]. On one hand, because a single hemodynamic parameter is difficult to achieve the objective of exact classification, a new classifier scheme using multiple parameters is taken into account. On the other hand, to integrate multiple sources of information, the physiological data often lead to overfull parameters, which mean a high dimensionality of features, and increase the complexity of the classifier; in addition, the classification performance with so many features is not always excellent. Thus a D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1125–1132, 2007. © Springer-Verlag Berlin Heidelberg 2007
1126
Q. Zhang et al.
method automatically determining the intrinsic features and pruning the unimportant ones is needed. It can be seen from the mentioned above that an appropriate multiple-parameter method with both the function of intrinsic features determination and the ability of powerful classification is required for the classification of the coronary microcirculatary function. As a common comprehension, it can be divided into two successive tasks. The first task is the feature selection which selects a subset of original features by an evaluation criterion, such as the distance measures, information measures [5]; and the following task is the classifier design, to design a classifier with the minimal error-rate. Usually, these two tasks are not integrated; it means that the feature selection does not directly yield a result of the classification, and the classifier can not reduce the input dimensions itself. Therefore it increases the complexity of the whole classification system. A neural network called generalized relevance learning vector quantization (GRLVQ) catches our attention for its potential in natural integration of the feature selection and classification. As a modification of learning vector quantization (LVQ), GRLVQ reserves the merit of LVQ, namely the simple and accurate classifier based on prototypes. Furthermore, GRLVQ introduces the weighting factors to the data dimensions which are adapted automatically so that the classification error-rate becomes small and the intrinsic data dimensions can be explored [6]. In this paper, a novel and efficient scheme based on GRLVQ is proposed to effectively classify the microcirculatary function. GRLVQ is firstly utilized to rank all of the hemodynamic parameters (features) extracted from the physiological data and prune the obviously unimportant ones. Then for a subtle feature reduction, GRLVQ is adopted circularly to rank the residual features and prune the most irrelevant one, until the number of features is reduced to zero. In each circulation, the prototypes are also generated, which are brought for the classification by means of nearest neighbor, and the corresponding classification accuracy can be obtained. Finally, the feature subset with the highest classification accuracy is determined. The feature selection and classification are both achieved so far. This method is compared to the traditional single-parameter method via the animal experiments and proved to be more effective.
2 Data Acquisition and Feature Extraction The coronary microcirculatary dysfunction can be artificially caused by injecting different amount of microspheres into animals to produce different intracoronary microembolization [7], so we can get disease cases from animal experiments more conveniently than humans. Here we acquired 136 cases from 25 pigs. All the physiological data were collected at Department of Cardiology, Zhongshan Hospital, Shanghai, via the catheterization techniques including intravascular ultrasound (IVUS) imaging, intracoronary Doppler technique and intracoronary pressure measurement, respectively getting the IVUS video images, Doppler signals and pressure signals [7]. As a following, plentiful hemodynamic parameters were extracted from the original physiological data using several methods of medical signal processing. Here 21 features were extracted, such as the aforementioned parameters CFR and CRR, the
Discrimination of Coronary Microcirculatory Dysfunction
1127
frequency-domain parameters of coronary artery pressure (Pf(k), k = 0, 1, 2, 3), coronary blood flow volume (Qf(k)), coronary artery impedance (Rf(k)), etc. It should be noticed that, in our former research, CRR has been extended from a single parameter to a set of parameters CRRf(k) (where CRR = CRRf(0)), which have more hemodynamic information of the coronary microcirculation [7].
3 Feature Selection and Classification In this section, we will first introduce the algorithm of GRLVQ, then explain the functions of weighting factors and prototypes, and finally describe our novel scheme for the feature selection and classification. 3.1 The GRLVQ Algorithm Let D = {(xi, yi) ∈ Rn × {1,…, C} | i = 1,…, m} be a training data set with ndimensional elements xi = (x1i,…, xni) to be classified and C classes. A set W = {w1,…, wM} of prototypes in data space with class labels ck (k = 1,…, M) is used for data representation, where wk = (w1k,…, wnk) ∈ Rn, and ck ∈ {1,…, C}. The general algorithm of GRLVQ consists in minimizing the classification cost function [6]
EGRLVQ = ∑ f ( μ λ ( x i ) ) . m
(1)
i =1
Choose f as the sigmoid function f(x) = sgd(x) = 1/(1 + exp(−x)) ∈ (0, 1), and μλ(xi) = (dλ+ − dλ−)/(dλ+ + dλ−). Here dλ+ is the squared distance on a certain metric between the data point xi and the nearest correctly classified prototype, say w+; similarly, dλ− is the distance between xi and the nearest wrongly classified prototype, say w−. Rather than the Euclidian metric which considers the input dimensions as equally scaled and equally important, the metric of distance in GRLVQ is an ununiform metric d λ = xi − w
2
λ
n
= ∑ λ j ( xij − w j ) 2 .
(2)
j =1
Eq. (2) introduces input weighting factors λ = (λ1,…, λn), λj ≥ 0, j = 1,…, n, ∑jλj = 1, in order to allow a different scaling of the input dimensions. Via a stochastic gradient descent, partial derivatives of EGRLVQ yield the update formulas for w+, w− and λ [8]: sgd′ ( μ λ ( x i ) ) d λ− + + Δw = η Λ( x i − w + ) , (3) + − 2 d + d ( λ λ) Δw − = −η −
sgd ′ ( μ λ ( x i ) ) d λ+
(d
+ λ
+d
)
− 2 λ
Λ( x i − w − ) ,
⎛ ⎞ d λ− d λ+ i + 2 i − 2 ⎟ Δλ j = −η ⋅ sgd′ ( μ λ ( x ij ) ) ⎜ x − w − x − w . ( ) ( ) j j j + − 2 ⎜ d+ + d− 2 j ⎟ d + d ( ) ( ) λ λ λ λ ⎝ ⎠
(4)
(5)
1128
Q. Zhang et al.
Here η+, η−, η ∈ (0, 1) are learning rates, sgd′ is the derivative of the sigmoid function, and Λ is the diagonal matrix with entries λ1,…, λn. Each time λ updates, the normalization ∑jλj = 1 is followed so as to avoid numerical instabilities. 3.2 Functions of Weighting Factors and Prototypes
In the method of GRLVQ, the weighting factor λ means an adaptive metric for the input dimensions, and it can be determined automatically via a stochastic gradient descent as Eq. (5) shows. When the algorithm converges, the final λ ranks the input dimensions; namely if λj is bigger, the corresponding feature is more important and has more contribution to the classification. So according to the rankings of the features, some unimportant features can be pruned. As a prototype-based algorithm, GRLVQ yields the prototypes wk to represent the corresponding classes as accurately as possible via minimizing the cost function EGRLVQ [9]. As a following, the prototypes are used to design the classifier. In this paper, the element x is classified to the class cx depending on the principle of nearest neighbor: cx = cK, where K = arg min x − w k k
2
λ
.
(6)
It means that x is classified to the class which the nearest prototype belongs to. Here the metric of distance is also the scaled adaptive one, substituting the traditional Euclidian metric. 3.3 GRLVQ-Based Feature Selection and Classification
Adequately considering the functions of weighting factors and prototypes, a novel scheme naturally integrating the feature selection and classification is descried as follows. In this paper, the feature selection keeps to a top-down strategy, which means gradually reducing unimportant features from the whole feature set, and finding an optimal feature subset in this course. There are two problems to be solved; one is what the principle of feature reduction is, and the other is what the evaluation criterion of optimal feature subset is. Aiming at the first problem, we propose two principles utilizing the weighting factors, according to the current feature dimensionality n. When n is high, some features may have much interfering noise. Considering the uniform scaling is λj = 1/n, we define the feature reduction threshold as
λthresh = 1/(γ ⋅ n) .
(7)
If λj < λthresh, the corresponding feature can be pruned. Here γ > 1, is a threshold controlling factor; if γ is smaller, more dimensions will be reduced. By this threshold method, several obviously unimportant or interfering features can be pruned, to enhance the efficiency of feature reduction. When n is low, for a subtle feature reduction, we just prune one feature once. Suppose J = arg min(λ j ) , (8) j
then the Jth feature will be pruned.
Discrimination of Coronary Microcirculatory Dysfunction
1129
Aiming at the second problem, we directly use the classification accuracy RA (see Eq. (9)) as the evaluation criterion of optimal feature subset, unlike the conventional criterions such as distance measures, or information measures [5]. RA = mt+/mt ,
(9)
where mt is the number of elements in the test set, and mt+ is the number of correctly classified elements in the test set. After solving the aforesaid problems, the tasks of feature selection and classification can be naturally integrated as the following procedure: 1. With all of n features, use Eq. (3), (4) to get wk, and Eq. (5) to get λ. Use Eq. (6) to design the classifier, and Eq. (9) to gain the current classification accuracy RA(n). 2. Use Eq. (7) to prune several obviously unimportant or interfering features. Feature dimensionality is reduced from n to nr. 3. With the reduced nr features, use Eq. (3), (4) to get wk, and Eq. (5) to get λ. Use Eq. (6) to design the classifier, and Eq. (9) to gain the current classification accuracy RA(nr). 4. Prune the least important feature according to Eq. (8). Let nr = nr − 1. 5. If nr = 0, proceed to step 6; else skip to step 3. 6. Find the maximum of RA, then the corresponding nr features are the finally selected features which are most useful and essential for the classification, and the corresponding classifier is the optimal classifier in this course. The above procedure can be concluded into three stages. The first stage consists of step 1 and step 2, for coarse feature reduction; the second stage consists of step 3 to step 5, for subtle feature reduction; and the third stage is step 6, for final feature selection and classification.
4 Experiments and Results The presented GRLVQ-based scheme is verified on hemodynamic data collected from animals, in contrast to the traditional single-parameter method. Here, the number of classes C = 2, and the total number of cases m = 136, including m1 = 45 cases with the normal coronary microcirculation, and m2 = 91 cases with coronary microcirculatary dysfunction. 21 hemodynamic parameters were extracted, including CFR, CRRf(k), Pf(k), Qf(k), Rf(k) (where k = 0, 1, 2, 3), so the feature dimensionality n = 21. Every time for experiments, randomly divide the data in half, one as the training set (m1r = 23, m2r = 46), and the other as the test set (m1t = 22, m2t = 45). First investigate the classification performance of each single parameter with a simple method of threshold. Choose the threshold pthresh = (m1rμ1 + m2rμ2)/( m1r + m2r), where μi, i = 1, 2, is the mean of the parameter on training set for class i [10]. After training and testing each parameter for 50 times, the parameter CRRf(0) is found with the best classification performance, but the accuracy is not very high actually, with a big variance furthermore. The 6 parameters with highest accuracy on the test set are listed in Table 1.
1130
Q. Zhang et al.
Then investigate our GRLVQ-based scheme. Set the number of prototypes, M1 = 2, M2 = 5. Choose the learning rates as constant, η+ = η− = 0.1, η = 0.01. On every stage for the feature reduction, the prototypes wk and the weighting factors λ are repeatedly computed for 50 times to get more reliable results. The average weighting factors of 21 parameters are first obtained for coarse feature reduction
λ = (0.3778, 0.3105, 0.0626, 0.0576, 0.0454, 0.0271, …), where the weeny factors have been omitted. Let λthresh = 1/(5n) = 0.0095, 10 most unimportant parameters are pruned; other 11 parameters are reserved, including CRRf(0), CFR, CRRf(3), CRRf(2), CRRf(1), Qf(1) (listed according to the rankings high Table 1. The mean and standard deviation of classification accuracy on the test set with the single-parameter method
Parameter Accuracy Parameter Accuracy
CRRf(0) 0.8499f0.0384 CRRf(2) 0.6961f0.0339
CFR 0.8215f0.0368 Qf(0) 0.6791f0.0644
CRRf(3) 0.7033f0.0390 Qf(3) 0.6693f0.0479
Table 2. The mean and standard deviation of classification accuracy on the test set with the GRLVQ-based scheme. Only list the accuracy at the feature dimensionality of 21, 11, 8, 5, 3 and 1.
Feature Dimensionality Accuracy Feature Dimensionality Accuracy
21 0.8675f0.0411 5 0.8833f0.0320
11 0.8815f0.0489 3 0.8985f0.0205
8 0.8764f0.0296 1 0.8648f0.0418
Fig. 1. The classification accuracy on the test set with the GRLVQ-based scheme. The accuracy is various according to the feature dimensionality.
Discrimination of Coronary Microcirculatory Dysfunction
1131
to low). It is discovered that the importance of a single parameter is not equivalent to that of the same parameter in the cooperative work. As a following, the subtle feature reduction is carried out to prune the residual 11 features one by one, and also gain the current classification accuracy RA. Finally, find the maximum of RA is 0.8985 (see Table 2 and Fig. 1), and determine the corresponding 3 features as the ultimately selected features, which are CFR, CRRf(0), CRRf(3), orderly. The final average weighting factors for these 3 parameters are
λ = (0.3589, 0.3373, 0.3038) . It is indicated that CRRf(3) strongly increases its contribution to the classification. As Table 1 and Table 2 show, the presented GRLVQ-based scheme increases the classification accuracy by 4.86% on the test set in contrast to the single-parameter method. It also decreases the variance of accuracy, namely enhances the stability of the classification.
5 Conclusions In this paper, a novel scheme based on GRLVQ is proposed for discrimination of the coronary microcirculatary dysfunction from the normal microcirculation. Unlike the traditional single-parameter method, this GRLVQ-based scheme uses multiple parameters to enhance the classification performance. Adequately taking advantage of the weighting factors and prototypes in GRLVQ, this scheme naturally integrates feature selection and classification so that it simplifies the classifier and also improves the classification accuracy. On physiological data from animals, it is verified that the presented scheme is more effective than the traditional method. Since the disease cases of humans are not enough, we didn’t test our scheme on hemodynamic data from humans. It is expected to accumulate the cases of humans and investigate the performance of the scheme for clinical application in the future.
Acknowledgement This work was supported by the National Basic Research Program of China (No. 2006CB705700), Natural Science Foundation of China (No.30570488) and Shanghai Science and Technology Plan (No.054119612).
References 1. L'Abbate, A., Sambuceti, G., Haunso, S., Schneider-Eicke, J.: Methods for Evaluating Coronary Microvasculature in Humans. Eur Heart J. 20 (1999) 1300-1313 2. Kern, M.J., Lerman, A., Bech, J., Bruyne, B.D., et al: Physiological Assessment of Coronary Artery Disease in the Cardiac Catheterization Laboratory. Circulation. 114 (2006) 1321-1341 3. McGinn, A.L., White, C.W., Wilson, R.F.: Interstudy Variability of Coronary Flow Reserve. Influence of Heart Rate, Arterial Pressure, and Ventricular Preload. Circulation. 81 (1990) 1319-1330
1132
Q. Zhang et al.
4. Vassalli, G., Hess, O.M.: Measurement of Coronary Flow Reserve and Its Role in Patient Care. Basic Research in Cardiology. 93 (1998) 339-353 5. Liu, H., Yu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. Knowledge and Data Eng. 17 (2005) 491-502 6. Hammer, B., Villmann, T.: Generalized Relevance Learning Vector Quantization. Neural Networks. 15 (2002) 1059-1068 7. Luo, Z., Wang, Y., Wang, W., et al: Coronary Artery Impedance Estimation Based on the Intravascular Ultrasound Technique and Its Experimental Studies. Acta Acustica. 30 (2005) 15-20 8. Villmann, T., Schleif, F., Hammer, B.: Supervised Neural Gas and Relevance Learning in Learning Vector Quantization. Proc. of the Workshop on Self-Organizing Networks (WSOM). (2003) 47-52 9. Strickert, M., Seiffert, U., Sreenivasulu, N.: Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Analysis. Neurocomputing. 69 (2006) 651-659 10. Bian, Z., Zhang, X., et al: Pattern Recognition. 2nd edn. Press of Tsinghua University, Beijing (2000) 87-90
Multiple Signal Classification Based on Genetic Algorithm for MEG Sources Localization* Chenwei Jiang1, Jieming Ma1, Bin Wang1,2, and Liming Zhang1,2 1
Department of Electronics Engineering, Fudan University, Shanghai 200433, China {0272015,042021026,wangbin,lmzhang}@fudan.edu.cn 2 The Research Center for Brain Science, Fudan University, Shanghai 200433, China
Abstract. How to locate the neural activation sources effectively and precisely from the magnetoencephalographic (MEG) recording is a critical issue for the clinical neurology and brain functions research. Multiple signal classification (MUSIC) algorithm and recursive MUSIC algorithm are widely used to locate multiple dipolar sources from the MEG data. The drawback of these algorithms is that they run very slowly when scanning a three-dimensional head volume globally. In order to solve this problem, a novel MEG sources localization scheme based on genetic algorithm (GA) is proposed. First, this scheme uses the property of global optimum of GA to estimate the rough source location. Then, combined with grids in small area, the accurate dipolar source localization is performed. Furthermore, we introduce the adaptive crossover and mutation probability, two-point crossover operator, periodical substitution and niche strategies to overcome the disadvantage of GA which falls into local optimum occasionally. Experimental results show that the proposed scheme can improve the speed of source localization greatly and its accuracy is satisfactory.
1 Introduction Magnetoencephalography is a noninvasive brain-measuring technique with the ability of estimating the neural current location in a millisecond-level definition. In contrast with other brain imaging techniques, e.g., MRI, CT, SPECT, PET, the temporal resolution of MEG is far superior to that achieved by others. So how to use the MEG signals to locate the neural current sources is an essential issue for understanding both spatial and temporal behavior of the brain. Multiple signal classification (MUSIC) [1] and recursive multiple signal classification (R-MUSIC) [2] are two widely used methods for MEG sources localization. These two methods commonly search for the MEG sources by scanning every grid. Unfortunately, it is quite time-consuming. For example, if the head is modeled as a sphere that is centered at the origin of the Cartesian coordinate system and has a radius of 9 cm, we have to repeat 729000 times to locate one dipole with the precision of one millimeter in one quadrant. To overcome this problem, a scheme *
This research was supported by the grant from the National Natural Science Foundation of China (No. 30370392 and No.60672116).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1133–1139, 2007. © Springer-Verlag Berlin Heidelberg 2007
1134
C. Jiang et al.
based on genetic algorithm (GA) is proposed here to locate current dipoles quickly and precisely. In our scheme, we present a two-step grid-scanning procedure. First, we divide the whole three-dimension space to large grids for coarse scan, and pick out the grid, where the object function has its optimum answer as the next step’s scanning range, by the means of GA. Second, we scan the selected area by small grids to locate the MEG source positions more precisely. Furthermore, we introduce the adaptive crossover and mutation probability, two-point crossover operator, periodical gene substitution and niche strategies into GA to overcome the disadvantage of GA which falls into local optimum occasionally. In addition, we should point out that applying a two-step grid-scanning procedure to overcome the time-consuming problem constitutes the first originality of our research, and applying GA to large grids for coarse scan to overcome the problem of falling into the local optimum occasionally constitutes the second originality of our research. Experimental results show that the proposed scheme can improve the speed of source localization greatly and its accuracy is satisfactory. The remainder of this paper is organized as follows. We briefly review the inverse localization methods in Section 2 and describe the proposed scheme in Section 3. In Section 4, we present some simulated results to show the excellent performance of the proposed scheme. Conclusion is given in Section 5.
2 Inverse Localization Methods In this section, a least-square method is applied for computing the MEG sources localization. First, we present a least-square method as follows (1) F = || B − B ' || , where B is the magnetic field detected by Superconducting Quantum Interference Device (SQUID) sensors outside the head. B ' is the logical magnetic field derived from the parameter P{x, y, z} through the equation proposed in [1] B = [G ( p 1 ) " G ( p k ) ] Q T ,
(2)
where Q = [Q , " , Q , " , Q ] , and Q is the moment of the i th dipole. Using SVD T
T 1
T i
T K
T
T i
to decompose QiT , we can obtain QiT = uiσ i viT so that B can be expressed as B i = G ( p i ) Q iT = G ( p i ) u i σ i v iT = a ( p i , θ i ) s iT ,
(3)
where pi represents the location of the dipole, θ i does the direction of the dipole, and siT does the current strength of the dipole. Decomposing Bi as described in [2] yields subspace correlations between the subspaces spanned by A and ΦS . We designate a calculating function of subspace correlations as
{ c 1 , c 2 , " , c k } = s u b c u r r{ A , Φ S } ,
(4)
Multiple Signal Classification Based on GA for MEG Sources Localization
1135
which returns the ordered set { c1 , c2 ," , ck } of subspace correlations. Now we directly simplify equation (1) into finding a set of P{x, y, z} which makes the sum of { c1 , c2 ," , ck } be the maximum. In [2], the enumerative method is adopted. If we want to obtain high computational precision, we have to design a sufficiently dense grid in the volume of the head. But it is quite time-consuming to calculate at each grid point.
3 The Proposed Scheme 3.1 Method Based on GA The least-squares method mentioned above requires nonlinear multidimensional searches to find the unknown parameter P{x, y, z} . Here we review the grid-scanning method, which leads to our proposed “two-step” grid-scanning approach. Though it has a high definition, it is quite time-consuming. Therefore, we present a GA-based MUSIC approach to locate the MEG sources. Before performing the refined grid scan, we firstly use GA method, taking advantage of GA’s excellent global searching ability and its high convergent speed in coarse grid scan process, to find the area where the least-square method has its optimum answer for the next step scanning. After that, we use the refined grid-scanning method to locate the more accurate position of the MEG sources. This approach assures that the final results have the same definition as the ordinary grid-scanning method because of its second refined grid-scanning step, and also guarantees the global optimum because of its first gridscanning step by means of GA. As a result, the whole procedure can improve the speed of source localization greatly and its accuracy is also satisfactory. Here, GA method is proposed to find the parameter P{x, y, z} which maximizes the sum of { c1 , c2 ," , ck }. First, we utilize the GA algorithm to adapt the MEG sources localization problem. The procedures of performing GA are as follows: Step 1. Generate the initial population which has total N chromosomes. Step 2. Calculate the object function, and get each chromosome’s fitness Pi . Step 3. Find out the best chromosome individual Nmax . Step 4. Judge whether the most optimum answer’s fitness Pi > Preq or not. If Pi > Preq , go to step 7. Step 5. Manipulate the GA operator which containing the roulette wheel selection operator, one-point crossover operator and the mutation operator. Step 6. Judge whether it is the last generation. If not, go to step 2. Step 7. End the GA operation, and begin the grid scan. From the simulation experimental results, we found that the premature convergence phenomena happened occasionally and the GA-based MUSIC approach sometimes converged to the local minimum. In order to solve this problem, we improve the approach by adaptive crossover and mutation probability, two-point crossover operator, periodical substitution and niche strategies.
1136
C. Jiang et al.
3.2 Improved GA Method 3.2.1 Adaptive Crossover and Mutation Probability In this section, we propose a constant k which is adjusted by experiment to represent the mature extent of GA. By this means, we make the crossover and mutation probability adaptive. Here, f max is the maximum fitness of the gene population, and f avg is the average fitness of the gene population. When ( f max − f avg ) f avg > k , we consider that GA is in the former stage of the evolution. At this time, the fitness of the population has a high diversity. So we can use a large crossover probability and a small mutation probability to accelerate the convergence. When ( f max − f avg ) f avg ≤ k , we consider that GA is in the later stage of the evolution and the fitness of the population has a low diversity. So we can use a small crossover probability and a large mutation probability to widen the searching range, and avoid falling into the local optimum. In the former stage of the evolution, pc and pm are showed as follows: Pc = Pc 0 × e x p ( (
f m ax − f avg
Pm = Pm 0 × e x p ( − (
f m ax − f avg
− k ) /(
f avg f m ax − f avg f avg
f avg
− k ) /(
f m ax − f avg f avg
) ),
(5)
)).
(6)
Here, we let pc 0 and pm 0 be the initial value of pc and pm . In the later stage of the evolution, pc and pm are showed as follows: Pc = Pc 0 × e x p ((
f m ax − f avg
Pm = Pm 0 × e x p ( − (
f avg f m ax − f avg f avg
− k ) / k ),
− k ) / k ).
(7) (8)
3.2.2 Two-Point Crossover Operator When the GA falls into the premature convergence, the crossover operator will become invalid. The gene will be the same as the previous one after crossover operating. Here, in the two-point crossover operator, we pick out two points in the chromosome randomly. The genes between the two points are intercrossed, and the rest ones don’t, which can make the GA escape from the local optimum when the premature convergence happens. 3.2.3 Periodical Substitution Strategy In order to avoid converging to a local minimum, we also present a periodical substitution strategy. For every ten generations, we insert 10% new random chromosomes to substitute the old ones whose fitness is the smallest. This approach will increase the population diversity.
Multiple Signal Classification Based on GA for MEG Sources Localization
1137
3.2.4 Niche Strategy The main concept of the niche strategy is that there only can be one chromosome in the range of L. Therefore, the niche strategy can maintain the population diversity, and make every chromosome keep a distance from the others and the fitness distribute in the whole three-dimension head volume. Taking advantage of the niche strategy, we can avoid the local optimum effectively and enhance the searching ability of whole algorithm. The operating procedures of the niche strategy are as follows: Step 1. First, the total M chromosomes in population are linearly ordered by their fitness magnitudes and then the first N ones are memorized. Step 2. Manipulate the GA operators and generate the M chromosomes of next generation. Step 3. Arrange the total M+N chromosomes, in which M is the number of the new chromosomes and N is the memorized ones, in sequence according to their fitness magnitudes. The hamming distance between every two chromosomes is computed as follows: Xi − X
j
=
M
∑
k =1
2
( x ik − x jk )
( i = 1, 2 ... M + N − 1; j = i + 1, ..., M + N ) ,
(9)
When || X i − X j ||< L , we compare fitness of these two chromosomes. The smaller one will be reset as follows: Fmin( xi , x j ) = Fmin( xi , x j ) × 10−3 .
(10)
Step 4. Rearrange the new M chromosomes and the memorized N chromosomes in sequence by fitness magnitudes, and memorize the first M ones. Step 5. Go to the next generation of GA.
4 Experimental Results The simulation experiments are used to evaluate the performance of the proposed method. We implement the program via Matlab7 on a PC with a Pentium(R) 4 1.7G CPU. And a standard arrangement of 37 radial SQUID sensors is used here. The array has one sensor at θ =0, a ring of six sensors at θ = π / 8 , ϕ = kπ / 3 , k = 0, …, 5, a
ring of twelve sensors at θ = π / 4 , ϕ = kπ / 6 , k = 0, …, 11, and a ring of eighteen sensors at θ = 3π / 8 , ϕ = kπ / 9 , k = 0, …, 17. They are distributed on the upper region of a 9 cm single-shell sphere as shown in Fig. 1. In the simulation, a complete MEG model comprised of the model of primary current sources, dipole-in-a-sphere head model, magnetoconductivity, etc [3]-[6] is utilized here. The dipoles are assumed to have the fixed locations and orientations, whereas the current strengths are allowed to change in time according to a parametric model. Basing on this model, we can obtain a 37 × 500 simulative spatio-temporal MEG data set. Both R-MUSIC method and the proposed method are used in the experiment to locate the MEG sources. A 24-bit binary chromosome is adopted to represent a dipole position. Here, every 8 bits represent a coordinate of one quadrant.
1138
C. Jiang et al.
Fig. 1. The location distribution of 37 SQUID sensors Table 1. Localization results by R-MUSIC method (3 dipoles)
Data Data 1
True Locations (cm) X Y Z 0.0038 0.0224 1.2285 0.4110 0.6985 1.5985 0.8934 2.0409 0.4404
Estimated Locations (cm) X Y Z 0.1000 0.1000 1.3000 0.4000 0.7000 1.6000 0.9000 2.0000 0.5000 T (s): 11142.125
Table 2. Localization results by the proposed method (3 dipoles)
Data Data 1
True Locations (cm) X Y Z 0.0038 0.0224 1.2285 0.4110 0.6985 1.5985 0.8934 2.0409 0.4404
Estimated Locations (cm) X Y Z 0.0000 0.0000 1.2000 0.4000 0.7000 1.6000 0.9000 2.0000 0.4000 T (s): 475.265
Table 3. Comparison between R-MUSIC method and the proposed method (3 dipoles)
Method R-MUSIC The proposed method
Generation Number / 200
Population Number / 100
Average Accuracy (cm) 0.0384 0.0309
Average Time (s) 11115 475
We pick 10 groups of the MEG data of 3 dipoles randomly for the simulation experiment, and one of them is listed in table 1 and 2. The chromosome population is set as 100 and the generation number is 200. The initial crossover probability pc is set
Multiple Signal Classification Based on GA for MEG Sources Localization
1139
as 0.4 and the mutation probability pm is 0.05. Observing the results shown in table 1 and 2, we can find that our method has the same precision as R-MUSIC method. From the comparison between the R-MUSIC method and the proposed method in table 3, we can find that the proposed method is much faster in locating the MEG sources. Our method has the same results as R-MUSIC, but the speed is improved greatly. The average execution time is about 1/23 of R-MUSIC. Furthermore, with the source number increasing, the proposed method will have more superiority to the RMUSIC method.
5 Conclusions In this paper, we proposed a MEG source localization scheme based on GA. The experimental results from the simulation show that the source localization operation can be speeded up greatly. Further, combining with grids in small areas, we can obtain more accurate results. The localization of MEG sources based on GA precisely and quickly will contribute to its further applications.
References 1. Mosher, J. C., Lewis, P. S., Leahy R. M.: Multiple Dipole Modeling and Localization from Spatio-temporal MEG Data. IEEE Transactions on Biomedical Engineering 39 (6) (1992) 541-557 2. Mosher, J. C., Leahy, R. M.: Recursive MUSIC: A Framework for EEG and MEG Source Localization. IEEE Transactions on Biomedical Engineering 45 (11) (1998) 1342-1354 3. Cuffin, B. N.: Effects of Head Shape on EEG’s and MEG’s. IEEE Transactions on Biomedical Engineering 37 (1) (1990) 44-52 4. Crouzeix, A., Yvert, B., Bertrand, O., Pernier, J.: An Evaluation of Dipole Reconstruction Accuracy with Spherical and Realistic Head Models in MEG. Clinical Neurophysiology 110 (12) (1999) 2176-2188 5. Mosher, J. C., Leahy, R. M., Lewis, P. S.: EEG and MEG: Forward Solutions for Inverse Methods. IEEE Transactions on Biomedical Engineering 46 (3) (1999) 245-259 6. Aleksandar, D., Arye, N.: Estimating Evoked Dipole Responses in Unknown Spatially Correlated Noise with EEG MEG Arrays. IEEE Transactions on Signal Processing 48 (1) (2000)
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation with Bayesian Iterative Closest Points Xia Zheng1, Xiaobo Zhou2,3, Youxian Sun1, and Stephen T.C. Wong2,3 1
Zhejiang University, National Laboratory of Industrial Control Technology, Hangzhou 310027, P.R. China
[email protected] 2 HCNR-CBI, Harvard Medical School and Brigham and Women’s Hospital, Boston, MA 02215, USA 3 Functional Molecular Imaging Center, Brigham and Women’s Hospital, MA 02115, USA
Abstract. It is difficult to directly co-register the 3D FMT (Fluorescence Molecular Tomography) image of a small tumor in a mouse whose maximal diameter is only a few mm with a larger CT image of the entire animal that spans about ten cm. This paper proposes a new method to register 2D flat and projected CT image first to facilitate the registration between small 3D FMT images and large CT images. And a novel algorithm Bayesian Iterative Closest Point (BICP) is introduced and validated in 2D affine registration. The visualization of the alignment of the 3D FMT and CT image through 2D registration shows promising results that would lead to automated 3D registration.
1 Introduction Mouse models of human cancer have dramatically improved over the past decade with the development of molecular imaging techniques, which characterizes and measures molecular events in living animals with high sensitivity and spatial resolution [1]. Imaging these molecular events is achieved by using innovative imaging agents, which include “smart sensor probes” that can be activated upon interaction with their biological targets [2]. The progress of molecular imaging has been accelerated by the development of dedicated small animal imaging equipment for microcomputed tomography (CT) and optical imaging. Optical imaging has seen exciting developments in recent years, such as optical tomography, which, in contrast to reflectance imaging, is not surface-weighted [3, 4]. Fluorescence molecular tomography (FMT) is one such technique that is capable of resolving molecular functions in deep tissues by reconstructing the in vivo distribution of intravenous injected far red and near infrared fluorescent probes [5]. 3D FMT images, however, contain only tumor information in the mouse and carry little anatomical information. Thus, we need to align and fuse FMT images with full animal CT or MR images in order to reveal fine anatomical structures [6]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1140–1149, 2007. © Springer-Verlag Berlin Heidelberg 2007
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1141
Image acquired from these modalities have different intensities and spatial resolutions in all three directions. To solve the registration problem of multimodal images, the maximization of mutual information of voxel intensities has been proposed and demonstrated to be a powerful method, allowing fully automated robust and accurate registration of multimodal images in a variety of applications without the need for segmentation or other preprocessing of the images [7]. As for registration of 3D FMT and CT images of the mouse, 3D FMT only reveals specific tumors without much anatomical information as whose maximum diameter is only a couple of mm in length and is much smaller than CT image of the whole animal, which is about ten cm long. This results in a challenging problem of directly aligning both images efficiently and precisely. More information is needed in order to facilitate registration. Fortunately, 2D flat images (photography) and 3D FMT images can be acquired in series using the planar fluorescence reflectance imaging/FMT system without moving the subject [8]. Therefore, their spatial relationship is known after acquisition, and FMT images can be superimposed onto the surface anatomical features to create a rough context. Our observation is that 2D flat images can be employed to bridge the gap between small 3D FMT images and large CT images of the same animal. Mutual information represents a leading technique to register the multimodal images. But this method requires reference and moving images have somewhat similar features, either identical or at least statistically dependent [9]. Unfortunately, flat image and the projected 3D CT image have no such similarity. Another registration approach is feature-based matching method which is typically applied when the local structural information is more significant than the information carried by the image intensities. It allows registering images of completely different natures and can handle complex between-image distortions. Hence, feature-based registration methods are employed here. The features of the two images are represented by their segmentation results – two point sets which will be registered by our novel Bayesian Iterative Closest Points (BICP). The final registration results demonstrate that our method is effective in automatically aligning the flat and the projected 3D CT images. The remaining of this paper is organized as follows. The registration problem and formulation is stated in Section 2, and the proposed algorithm of parameter estimate is presented in Section 3. The simulation and experimental results are shown in Section 4. Finally, Section 5 concludes this paper.
2 Problem Statement and Formulation Since 3D FMT images contain only tumor information in the mouse and carry little anatomical information, it is difficult to co-register 3D FMT and 3D CT images. While 3D FMT images are acquired, flat images can be acquired and matched with 3D FMT images to give an anatomical context of the mouse in 2D. If we register 2D flat images with 3D CT images, we can roughly align 3D FMT and CT images in 2D. The resultant transformation can then be employed as a starting point for further alignment in all three dimensions. A 2D flat image is basically a photograph i.e. a projection surface image, of the mouse. We project 3D CT images to obtain a 2D projection image that is similar to
1142
X. Zheng et al.
the flat image. Only the boundary information in both 2D images is the same and can be used for registration. The registration procedure is described here in detail The first step is to segment the 2D flat image. The boundary of the mouse, denoted by Cr , is obtained by segmenting the 2D flat image through gradient vector flow (GVF) snake method proposed by Chenyang Xu and Jerry Prince [13]. The second step is to project the 3D CT images to obtain a 2D CT images corresponding to the flat image. Projected image is denoted by P (θ ; I ) , where I is the 3D CT image,
θ = (θ1 ,θ 2 )
is the adjustable projection parameters (two rotation
angles about axes in the coronal plane of CT images) and
P (θ ; I ) is the projection
operation on 3D CT image I . The proper θ is used to make the projected CT image as similar as possible to the flat image that is exemplified in Figure. 1. Figure 1(b) resembles the 2D flat image more than Figure 1(c). After projection, the CT image P (θ ; I ) is then segmented to obtain the boundary of the mouse for registration, denoted by Cm (θ ) .
(a)
(b)
(c)
Fig. 1. (a) 2D flat image, (b) projection of the 3D CT image with a θ , (c) projection of the 3D CT image with another θ
The third step is to register the two 2D images by the two point sets Cr and Cm (θ ) . We denote the two point sets by reference point set and moving
Cr = {ri | i = 1,… , N r } and Cm (θ ) = {m (θ ) j | j = 1,… , N m } , where ri and m (θ ) j are the pixel position
point
set
with
their
elements
denoted
in
vectors and N m < N r . We assume that the relationship between the two point sets is 2D Euclidean affine transformation comprising of six parameters. The task of the registration is to determine the parameters of transformation λ = [λ1 , λ2 , λ3 , λ4 , λ5 , λ6 ] ,
the previous adjustable projection parameters and reference point set.
θ
which best align the moving point set
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1143
Iteration is used to optimize our problem. For each iteration, the 3D CT image should be automatically projected according to the adjustable projection parameter θ . This poses another problem of automatically segmenting the projected CT image during iteration. In order to get a good segmentation result, we initialize the contour position to the true boundary as close as possible. Morphological operations including opening and closing are performed on the binary version of the projected CT image. The boundary of the last binary image is employed as the starting position of the GVF snake model. After segmentation of the images, we get two point sets and then register them. What makes the problem difficult is that correspondences between the point sets are unknown a priori. A popular approach to solve the problem is the class of algorithms based on the Iterative Closest Point (ICP) technique introduced by Besl [14]. We employ k -dimensional ( k -d) tree to find the closest point from the reference points as the correspondence points of the
N m moving points after affine transformation
denoted by Cc = {c j | j = 1,… , N m } . Now we can formulate our problem as follows Y = D * λ + nt .
(1)
That is
⎡ m1,1 ⎢m ⎢ 2,1 ⎢ ⎢ ⎢⎣ m Nm ,1 where
m1,2 ⎤ ⎡1 c1,1 c1,2 ⎤ ⎡λ ⎥ ⎢ m2,2 ⎥ ⎢1 c2,1 c2,2 ⎥⎥ ⎢ 5 = λ ⎥ ⎢ ⎥⎢ 1 ⎥ ⎢ ⎥ ⎢λ m Nm ,2 ⎥⎦ ⎢⎣1 c Nm ,1 c N m ,2 ⎥⎦ ⎣ 2
λ6 ⎤ λ3 ⎥⎥ + [ n1 n2 ] , λ4 ⎥⎦
(2)
mi ,1 and mi ,2 denote the two components of a moving point mi ,
λ1 = [λ5 , λ1 , λ2 ]T , λ2 = [λ6 , λ3 , λ4 ]T ,and
the two noise are assumed the anisot-
ropic normally distributed as follows
⎛ ⎡0 ⎤ ⎡σ 12 0 ⎤ ⎞ . nt ∼ N ⎜ ⎢ ⎥ , ⎢ ⎜ ⎣0 ⎦ 0 σ 2 ⎥ ⎟⎟ ⎣ 2 ⎦⎠ ⎝ We assume here that the parameter where
λ = [λ1 , λ2 ] , σ = [σ 1 , σ 2 ]
(3) and
Y = [Y1 Y2 ]
Yi is the i th column of Y .
3 Parameter Estimate 3.1 Prior Distributions
The traditional ICP method is widely used in point sets registration. But it is sensitive to noise and outliers. We found that there were too many identical points in the correspondence point sets Cc when the traditional ICP failed. The contribution of our
1144
X. Zheng et al.
BICP is to penalize this situation by introducing a hyper-parameter δ 2 . We follow a Bayesian approach to estimate the parameters [15]. The overall parameter space is Θ = λ × σ × δ where δ is a hyper-parameter which will be explained at the end of this section. Given the two point sets, our objective is to estimate Θ . The Bayesian inference of Θ is based on the joint posterior distribution p(λ , σ 2 , δ 2 | Y , Cc ) . The joint distribution of all variables is:
p(λ , σ 2 , δ 2 , Y | Cc ) = p(Y | λ , σ 2 , δ 2 , Cc ) p (λ | σ 2 , δ 2 , Cc )
(4)
× p (σ 2 | δ 2 , Cc ) p(δ 2 | Cc ). Under the assumption of independent moving points given
( λ , σ 2 , δ 2 ) , we have
2
p(Y | λ , σ 2 , δ 2 , Cc ) = ∏ p(Yi | λ1 , σ 2 , δ 2 , Cc ) i =1
2
= ∏ (2πσ )
− Nm / 2
2 i
i =1
exp(−
1 2σ i2
λ
λ
(5)
(Yi − D i ) (Yi − D i )). T
We assume the following structure for the prior distribution:
p(λ , σ 2 , δ 2 ) = p(λ1:2 | σ 2 , δ 2 ) p(σ 2 | δ 2 ) p(δ 2 ) = p(λ1:2 | σ 2 , δ 2 ) p(σ 2 ) p (δ 2 ), 2
where p(σ 2 ) = ∏ p (σ i2 ) and i =1
σ i2
(6)
distributed according to conjugate inverse-Gamma
prior distributions: σ i ∼ IG(0, 0) = 2
1
σ
2 i
. Given
σ 2,δ 2
we introduce the following
prior distribution: 2
p(λ | σ 2 , δ 2 ) = ∏ | 2πσ i2 Σ i |−1/ 2 exp(− i =1
where
Σ i−1 = δ i−2 DT D .
1 2σ i2
λ Tι Σ i−1λ ι ),
(7)
It shows that conditional upon (σ , δ ) , coefficients 2
2
λι are assumed to be zero-mean Gaussian with variance σ ι2 Σ i . As mentioned before, Cc selected from the reference point set is the correspondence point set of moving point set. We used the iterative closest point method to determine Cc . Accordingly it is possible that the correspondence points of several different moving points are identical due to noise which make the iteration converges to local optimal position. The extreme case is that all the moving points correspond to an identical point .In that −1
situation the determinant of Σi will tend to zero. To penalize this situation we introduce the term δ ∈ ( R ) . 2
+ 2
p(δ 2 ) = ∏ i =1 p(δ i2 ) where δ i2 ∼ IG (αδ 2 , βδ 2 ) . 2
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1145
3.2 Formulation for Sampling
According to Bayes theorem: p(λ , σ 2 , δ 2 | Y , Cc ) ∝ p(Y | λ , σ 2 , δ 2 , Cc ) p(λ , σ 2 , δ 2 ) 2
∝ ∏ (2πσ i2 )
− Nm / 2
exp(−
i =1
1 2σ i2
λ
λ
(Yi − D i )T (Yi − D i ))
⎡ 2 ⎤⎡ 2 1 ⎤ × ⎢∏ | 2πσ i2 Σ i |−1/ 2 exp(− 2 λ Tι Σ i−1λ ι ) ⎥ ⎢∏ (σ i2 ) −1 ⎥ 2σ i ⎦ ⎣ i =1 ⎦ ⎣ i =1
(8)
β2 ⎤ ⎡ 2 − (α +1) × ⎢∏ (δ i2 ) δ 2 exp(− δ2 ) ⎥ . δi ⎦ ⎣ i =1 We can proceed to multiply the exponential terms to obtain ⎡ 2 ⎤ 1 p(λ , σ 2 , δ 2 | Y , Cc ) ∝ ⎢∏ (2πσ i2 ) − Nm / 2 −1 exp(− 2 YiT K iYi ) ⎥ 2σ ι ⎣ i =1 ⎦
λ
λ
⎡ 2 ⎤ 1 × ⎢∏ | 2πσ i2 Σ i |−1/ 2 exp(− 2 ( i − hi )T M i−1 ( i − hi )) ⎥ 2σ i ⎣ i =1 ⎦
(9)
β2 ⎤ ⎡ 2 − (α 2 +1) × ⎢∏ (δ i2 ) δ exp(− δ2 ) ⎥ , δi ⎦ ⎣ i =1 where: M i−1 = DT D + Σ i−1 , hi = M i DT Yi ,
(10)
K i = I N m − DM i D . T
We recall that p(λ , σ 2 , δ 2 | Y , Cc ) = p(λ1:2 | σ 2 , δ 2 , Y , Cc ) p(σ 2 | δ 2 , Y , Cc ) p(δ 2 | Y , Cc ) . It follows that for i = 1, 2 , λi and
δ i2
are distributed according to:
YT K Y Nm + 1, i i i ), 2 2 λi | (σ 2 , δ 2 , Y , Cc ) ∼ N (hi , σ i2 M i ).
σ i2 | (δ 2 , Y , Cc ) ∼ IG (
(11) (12)
Finally it is easy to derive follow expression from equation (8) 3 1 δ i2 | (λ , σ 2 , Y , Cc ) ∼ IG (αδ 2 + , βδ 2 + 2 λi DT Dλi ) . (13) 2 2σ i Then we can sample the parameters according to equations (11~13) to estimate them. 3.3 Estimate of the Projection Parameter
θ
In our case we have additional adjustable projection parameter θ which contributes to many local minima. We turn to certain global optimization methods among which
1146
X. Zheng et al.
the Differential Evolution (DE) is a simple and efficient adaptive scheme for global optimization over continuous spaces [11]. We can combine DE and BICP to estimate (λ,θ ) . That is, for every specified θ the two point sets can be extracted. Then BICP is used to register the 2D images and the resultant registration error is regarded as the value of the objective function of DE with the specified θ . 3.4 The Initial Position
Another problem in the BICP we have to consider is estimating the global initial values of λ0 , one component of which is the translation parameters. Since our algo(0)
rithm is based on iterative calculation, the initial alignment of {m( θ )j }Nj =m1 and {ri }i =r1 N
affects their convergence rates and precision. Registration of the geometric centers of both the data solves their initial shift such that: 1 Nr 1 Nm [λ5(0) , λ6(0) ] = ri − (14) ∑ ∑ m( θ )j , N r i =1 N m j =1 The other initial affine parameters [λ1(0) , λ2(0) , λ3(0) , λ4(0) ] are set to [1 0 0 1]. We summarize the components in the parameters estimate procedure as follows: • DE method over θ with the BICP method as its objective function • Utilize the BICP method to register the 2D images with specified θ .
1. Initialization. Set
λ (0) , δ i2(0) ,σ (0)
and
i =1
2. Iteration i a. finding the correspondence with λ using iterative closest point method b. sample σ λ δ from equation (11~13) 3. i = i + 1 and go to step 2
4 Experimental Results In below experiments we set
αδ = 2 , βδ = 20 . 2
2
4.1 Experiment with Simulated Image Data
First we confirm the effectiveness of our proposed BICP in affine registration. The simulated data is generated as follows. a. Segment the flat image to get a reference point set Cr . b. Select a part from the flat image and segment it to create a reference point set. c. Make an affine transformation ( λ =[0.8 0 0 1.1 0 0.1]) on the above point set and add noise (zero mean normally distributed with standard deviations 0.01) to the transformed result to give a moving point set Cflat_m . d. Register Cflat_m to Cr with our proposed simplex method.
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1147
The error of a registration is defined as the registration’s Euclidean distance from the optimal match. Herein the registration error of our method is 1.32 pixel compared with the traditional ICP 1.34. Then we verify our algorithm in registering a synthetic flat image and the 3D CT image. The synthetic flat image is generated as follows: a. Project a 3D CT image after rotation about two axes according to the adjustable projection parameters θ . Herein we set it to reasonable values, (20, 13). b. Transform the projected 2D image with affine transformation parameters: λ = [1.2 0.1 0.2 0.95 -3 4]. c. Add random noise (zero mean normally distributed with standard deviations η ) to the transformed 2D image to get a "practical" simulated flat image. Following the above steps the synthetic flat image is generated with η =0.05 as illustrated in Fig. 2 (b).
(a)
(b)
(c)
Fig. 2. (a) Projected image (b) simulated flat image, and (c) registration result
Additionally, to show the performance on different noise levels, our algorithm is implemented on the synthetic flat image and the 3D CT image with different values ofη . The final results are shown in Table 1. Table 1. Registration result with different noise level
η
λ = [λ1 , λ2 , λ3 , λ4 , λ5 , λ6 ]
θ
0.15 0.1
Error (pixel) 0.5058 0.4274
[1.2421 0.1065 0.2357 0.9660 -3.8202 4.8272] [1.2225 0.1065 0.2227 0.9360 -3.8322 4.5470]
[22.914.3] [22.6 13.4]
0.05
0.3854
[1.2121 0.0954 0.1964 0.9488 -3.6229 4.2732]
[21.4 13.2]
4.2 Experiment with Real Data
In this experiment, the size of the real flat image is 231 × 341 with a spacing of 0.0185 × 0.0185 cm. For each mouse, we have three 3D CT image sets that cover the
1148
X. Zheng et al.
entire animal with an overlap. The 3D CT image sets are composed of head, thorax, and pelvis, which are high resolution images. We only use one image set of the central part, which is our region of interest (ROI). The resolution of the thorax 3D CT image is 512 × 512 × 512, and its voxel spacing is 0.0072 × 0.0072 × 0.0072 cm. The real data is registered with physical coordinate. The final parameters of our algorithm applied on the real data are: λ = [1.1959 0.0655 0.6935 -0.0758 1.0636 0.28790], and θ = [-7.1 13.5]. We demonstrate the registration result in Fig. 3. The blue line is the segmentation of the flat image, and the red line is the contour of the projected 3D CT image after rotation according to θ .
Fig. 3. Image registration result
Ultimately we fusion the 3D FMT and CT image in x-y plane with above 2D preliminary registration result in Fig. 4.
(a)
(b)
(c)
Fig. 4. (a) Projection and affine transformation of the 3D CT image ; (b) projection of the 3D FMT image; (c) fusion
5 Conclusions This work has contributed to the registration of the flat image and the projected 3D CT image of mouse to reduce the gap between the 3D FMT image and 3D CT image of the animal. A novel algorithm combining DE and BICP is proposed to optimize this multi-modality image registration problem. We first validated the new registration on simulated animal images and apply to the real data obtained experimentally. Future work of this research will investigate the alignment of the 3D FMT and CT images based on the 2D registration result.
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1149
Acknowledgement The authors would like to acknowledge the excellent collaboration with their molecular imaging collaborators in this research effort, and, in particular, Dr. Dunham, Joshua M in Ntziachristos’s lab. Research of Xiaobo Zhou is supported by the HCNR Center for Bioinformatics Research Grant, Harvard Medical School. This work is supported by the Academician Foundation of Zhejiang Province (No. 2005A1001-13).
References [1] Jan, G., David, G. K., Stephen, D. W., Carla, F. B. K., Philip, M. S., Vasilis, N., Tyler, J., Ralph, W.: Use of Gene Expression Profiling to Direct in Vivo Molecular Imaging of Lung Cancer. PNAS 102 (40) (2005) [2] Weissleder, R.: Molecular Imaging: Exploring the Next Frontier (Review). Radiology 212 (1999) 609-614 [3] Tung, CH., Mahmood ,U., Bredow, S., et al.: In Vivo Imaging of Proteolytic Enzyme Activity Using a Novel Molecular Reporter. Cancer Res. 60 (2000) 4953-4958 [4] Ntziachristos, V., Tung, CH., Bremer, C., Weissleder, R.: Fluorescence Molecular Tomography Resolves Protease Activity in Vivo. Nat Med 8 (7) (2002) 757-760 [5] Ntziachristos, V., Bremer, C., Graves, E.E., Ripoll, J., Weissleder, R.: In Vivo Tomographic Imaging of Near-infrared Fluorescent probes. Mol Imaging 1 (2002) 82-88 [6] Ntziachristos, V., Ripoll, J., Wang, L.V., Weissleder, R.: Looking and Listening to Light: the Evolution of Whole-body Photonic Imaging. Nat Biotechnol 23 (2005) 313-320 [7] Frederik, M., Dirk, V., Paul, S.: Comparative Evaluation of Multiresolution Optimization Strategies for Multimodality Image Registration by Maximization of Mutual Information. Medical Image Analysis 3 (4) (1999) 373-386 [8] Graves, E.E., et al.: A Submillimeter Resolution Fluorescence Molecular Imaging System for Small Animal Imaging. Med. Phys. 30 (3) (2003) 901-911 [9] Barbara, Z., Jan, F.: Image Registration Methods: A Survey. Image and Vision Computing 21 (2003) 977-1000, [10] Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C++. Second ed., Cambridge University Press, Cambridge (2002) [11] Rainer, S., Kenneth, P.: Differential Evolution - A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. ICSI TECHINCAL Report tr-95-012, (1995) [12] Kass, M., Witkin, A., Terzopoulos, D.: Snake: Active Contour Models. Int. J. Computer Vision 1 (4) (1987) 321-331 [13] Chenyang, X., Jerry, L. P.: Snakes, Shapes, and Gradient Vector Flow. IEEE Transaction on Image Processing 7 (3) (1998) 359-369 [14] Paul, J.B., Neil, D.M.: A Method for Registration of 3D Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2) (1992) 239-255 [15] Andrieu, C., Freitas, J., Doucet, A.: Robust Full Bayesian Learning for Neural Networks. http://www.cs.berkeley.edu/jfgf/software.html (1999)
Automatic Diagnosis of Foot Plant Pathologies: A Neural Networks Approach Marco Mora1 , Mary Carmen Jarur1, Daniel Sbarbaro2, and Leopoldo Pavesi1 1
2
Department of Computer Science, Catholic University of Maule Casilla 617, Talca, Chile {mora,mjarur,lpavesi}@spock.ucm.cl http://www.ganimides.ucm,cl/mmora/ Department of Electrical Engineering, University of Concepcion Casilla 160-C, Concepcion, Chile.
[email protected] Abstract. Some foot plant pathologies, like cave and flat foot, are normally detected by a human expert by means of footprint images. Nevertheless, the lack of trained personal to accomplish such massive first screening detection efforts precludes the routinely diagnostic of the above mentioned pathologies. In this work an innovative automatic system for foot plant pathologies based on neural networks (NN) is presented. We propose the use of principal components analysis to reduce the number of inputs to the NN and therefore increasing the efficiency of the training algorithm. The results achieved with this system evidence the feasibility of establishing automatic diagnosis systems based on the footprint image. These systems are of a great value specially in apart areas and are also suited to carry on massive first screening health campaigns.
1
Introduction
When the foot is planted, not all the sole is in contact with the ground, the footprint is the surface of the foot plant in contact with the ground. The cave foot and the flat foot are pathologies presented in children at the age of three. If these foot malformations are not detected and treated on time, they become worst during adulthood producing several disturbances, pain and posture-related disorders [12]. The characteristic form and zones of the footprint are shown in figure 1a. Zones 1, 2 and 3 correspond to regions in contact with the surface when the foot is planted, these are called anterior heel, posterior heel and isthmus respectively. Zone 4 does not form part of the surface in contact and is called footprint vault [12]. A simple method to obtain footprints is directly stepping the inked foot onto a paper on the floor. After obtaining the footprints, an expert analyzes them and assesses if they present pathologies. Usually, in the diagnosis of these pathologies an instrument known as podoscope is used to capture the footprints. A simple digital version of the podoscope based on a scanner has been proposed in [7]. Another basic instrument to obtain footprints is the pedobarograph [1]. Modern variants of the pedobarograph are proposed in [8,10]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1150–1158, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1151
Our goal is to develop a system to achieve massive screening for the early detection of pathologies as flat foot and cave foot. For that reason, a simple, inexpensive, and easy-to-use system is needed. Considering the simplicity of a podoscope, we have developed a digital version based on a simple color digital camera. The instrument consists of a robust metallic structure with adjustable height and transparent glass in its upper part. The patient must stand on the glass and the footprint image is obtained with a digital color camera in the interior of the structure. For adequate lighting, white bulbs are used. The target system includes the regulation of the light intensity of the bulbs, which allows the amount of light to be adjusted for capturing images under different lighting conditions. With our digital podoscope we have built a database with more than 230 footplant optical images to support our research. These images were classified by a expert1 . From the total sample, 12.7% are flat feet, 61.6% are normal feet and 25.7% are cave feet. On the other hand, a non automatic method to segment footprints based on a simple sequence of traditional digital image processing techniques have been proposed in [2]. A method to segment footprints in optical color images of the sole by using neural networks is proposed in [6]. This paper describes the development of an automatic method to diagnose foot plant pathologies. Firstly, we propose an original representation for the footprint segmented patterns. Secondly, to reduce the pattern dimensionality we perform a principal component transform. Finally, we formulate the diagnose of foot plant pathologies as a pattern recognition problem, adopting a neural network during the footprint classification process. This paper is organized as follows. Section 2 introduce the foot plant pathologies and the diagnosis by using neural networks. Section 3 describes the footprint representation and characteristics extraction. Section 4 presents the training of the neural network classifier. Section 5 shows the validation of the neural network classifier. Finally, section 6 shows some conclusions and future studies.
2
Foot Plant Pathologies and Neural Network Diagnosis
It is possible to classify a foot by its footprint form and dimensions as a: normal, flat or cave foot. Figure 1b shows an image of a flat foot, figure 1c shows an image of a normal foot, and figure 1d shows an image of a cave foot. Currently, an expert defines if a patient has a normal, cave or flat foot by a manual exam called photopodogram. A photopodogram is a chemical photo of the foot part supporting the load. The expert determines the position for the two distances, sizes them, calculates the ratio and classifies the foot. Even tough the criteria for classifying footprints seems very simple, the use of a classifier based on neural networks (NN) offers the following advantages compared with more traditional approaches: (1) it is not simple to develop an algorithm to determine 1
The authors of this study acknowledge Mr.Eduardo ACHU, specialist in Kinesiology, Department of Kinesiology, Catholic University of Maule, Talca, Chile, for his participation as an expert in the classification of the database images.
1152
M. Mora et al.
(a) Zones
(b) Flat foot
(c) Normal foot
(d) Cave foot
Fig. 1. Images of the sole
with precision the right position to measure the distances, and (2) it can be trained to recognize other pathologies or to improve their performance as more cases are available. The multilayer perceptron (MLP) and the training algorithm called backpropagation (BP) [11] have been successfully used in classification and functional approximation. An important characteristic of MLP is its capacity to classify patterns grouped in classes not lineally separable. Besides that, there are powerful tools, such as the Levenberg-Marquardt optimization algorithm [3], and a Bayesian approach for defining the regularization parameters [5], which enable the efficient training of MLP. Even though there exist this universal framework for building classifiers, as we will illustrate in this work, a simple preprocessing can lead to smaller network structures without compromising performance.
3
Footprint Representation and Characteristics Extraction
Prior to classification, the footprint is isolated from the rest of the components of the sole image by using the method proposed in [6]. Figures 2a, 2b and 3c shown the segmentation of a flat, a normal foot, and a cave foot respectively.
(a) Flat foot
(b) Normal foot
(c) Cave foot
Fig. 2. Segmentation of footprint without toes
After performing the segmentation, the footprint is represented by a vector containing the width in pixels of the segmented footprint, without toes, by each column in the horizontal direction. Because every image has a width vector with different length, the vectors were normalized to have the same length. Also the value of each element was normalized to a value in the range of 0 to 1.Figures 3a, 3b and 3c show the normalized vectors of a flat, a normal and a cave foot.
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach 1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.1
0.1
0
10
20
30
40
(a) Flat vector
50
60
70
80
90
100
1153
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0
10
20
30
40
50
footprint (b) Normal footprint (c) Cave vector vector
60
70
80
90
100
footprint
Fig. 3. Representation of the footprint without toes
As a method to reduce the dimensionality of the inputs to the classifier, a principal components analysis was used [4]. Given an eigenvalue λi associated to the covariance matrix of the width vector set, the percentage contribution γi (1) and the accumulated percentage contribution AP Ci (2) are calculated by the following expressions: λi γi = d j=1
AP Ci =
(1)
λi
i
γi
(2)
j=1
Table 1 shows the value, percentage contribution and the accumulated percentage contribution of the first nine eigenvalues. It is possible to note that from the 8th eigenvalue the contribution is close to zero, and then it is enough to represent the width vector with the first seven principal components. Figure 4 shows a normalized width vector (rugged red signal) and the resultant approximation from using the seven first main components (smoothed blue signal) for the three classes. Table 1. Contribution of the first 9 eigenvalues Value Percentual contribution Accumulated contribution
4
λ1 0.991 63.44 63.44
λ2 0.160 10.25 73.7
λ3 0.139 8.95 82.65
λ4 0.078 5.01 87.67
λ5 λ6 λ7 λ8 0.055 0.0352 0.020 0.014 3.57 2.25 1.31 0.94 91.24 93.50 94.82 95.76
λ9 0.010 0.65 96.42
Training of the Neural Network Classifier
A preliminary analysis of the segmented footprint analysis showed very little presence of limit patterns among classes: flat feet almost normal, normal feet almost flat, normal feet almost cave and cave feet almost normal. Thus, the training set was enhanced with 4 synthetic patterns for each one of the limit
1154
M. Mora et al. 1.1
1.2
1
1
1.1
0.9
0.9
1
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.8
0.7
0.6
0.5
0.4
0.1
0
10
20
30
40
50
60
70
80
90
100
0.2
0.3
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0
10
20
30
40
50
60
70
80
90
100
(a) Flat class (b) Normal class (c) Cave class Fig. 4. Principal components approximation
cases. Thus the training set has a total of 199 images, 12.5% corresponding to a flat foot, 63% to a normal one and 24.5% to a cave foot. To build the training set the first seven principal components were calculated for all the width vectors in the training set. For the foot classification as a flat, normal or cave foot, a MLP trained with Bayesian regularization backpropagation was used. The structure of the NN is: – Number of inputs: 7, one for each main component. – Number of outputs: 1. It takes a value of 1 if the foot is flat, a value of 0 when the foot is normal and a value of −1 when the foot is cave. To determine the amount of neurons of the hidden layer, the procedure described in [3] was followed. Batch learning was adopted and the initial network weights were generated by the Nguyen-Widrow method [9] since it increases the convergence speed of the training algorithm. The details of this procedure are shown in table 2, where NNCO corresponds to the number of neurons in the hidden layer, SSE is the sum squared error and SSW is the sum squared weights. From the table 2 it can be seen that from 4 neurons in the hidden layer, the SSE, SSW and the effective parameters stay practically constants. As a result, 4 neurons are considered in the hidden layer. In figure 5 it is possible to observe that the SSE, SSW and the effective parameters of the network are relatively constant over several iterations, this means that the training process has been appropriately made. Figure 6 shows the training error by each pattern of the training set. From the figure it is important to emphasize that the classification errors are not very small values. This behavior assures that the network has not memorized the training set, and it will generalize well. Table 2. Determining the amount of neurons in the hidden layer NNCO Epochs 1 114/1000 2 51/1000 3 83/1000 4 142/1000 5 406/1000 6 227/1000
SSE 22.0396/0.001 12.4639/0.001 12.3316/0.001 11.3624/0.001 11.3263/0.001 11.3672/0.001
SSW Effective parameters Total parameters 23.38 8.49e+000 10 9.854 1.63e+001 19 9.661 1.97e+001 28 13.00 2.61e+001 37 13.39 2.87e+001 46 12.92 2.62e+001 55
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1155
Training SSE = 11.3624
3
Tr−Blue
10
2
10
1
10
0
10
Squared Weights = 13.003
2
SSW
10
1
10
0
10
# Parameters
Effective Number of Parameters = 26.0911
30 20 10 0
20
40
60
80
100
120
140
142 Epochs
Fig. 5. Evolution of the training process for 4 neurons in the hidden layer. Top: SSE evolution. Center: SSW evolution. Down: Effective parameters evolution.
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
0
20
40
60
80
100
120
140
160
180
200
Fig. 6. Classification error of the training set
5
Validation of the Neural Network Classifier
The validation set contains 38 new real footprint images classified by the expert, where 13.1% correspond to a flat foot, 55.3% to a normal one and 31.6% to a cave foot. For each footprint in the validation set, the corresponding normalizedwidth vector was calculated for the binary images of the segmented footprint, and then by performing principal component decomposition only the first 7 axes were presented to the trained NN. The figure 7 shows the results of the classification, the outputs of the network and the targets are represented by circles and crosses respectively. Moreover the figure shows the error of the classification represented by a black continuous line. The results are very good, considering that the classification was correct for the complete set.
1156
M. Mora et al. 1
Flat
0
Normal
Cave
−1 0
5
10
15
20
25
30
35
40
Fig. 7. Classification error of validation set and output/target of the net
1
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.8
0.7
0.6
0.5 0.5
0.5 0.4
0.4
0.4 0.3
0.3
0.2
(a) solucion01nd.jpg
(b) solucion02nd.jpg
0.3
0
10
20
(c) solucion03ni.jpg
30
40
50
60
70
80
90
100
0.2
0.2
0
10
20
30
(a)
40
50
60
70
80
90
100
0.1
0
10
20
(b)
30
40
50
60
70
80
90
100
60
70
80
90
100
60
70
80
90
100
(c)
1
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.8
0.7 0.6
0.6
0.5
0.5
0.6
0.5 0.4
0.4
0.4 0.3 0.3
0.2
(d) solucion04ni.jpg
(e) solucion05nd.jpg
10
20
(f) solucion04nd.jpg
0.3
0.2
0
30
40
50
60
70
80
90
100
0.1
0.2
0
10
20
30
(d)
40
50
60
70
80
90
100
0.1
0
10
20
30
(e)
1
40
50
(f)
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.8
0.7
0.6 0.6
0.6
0.5
0.5
0.5
0.4 0.4
0.4 0.3
0.3
0.3
0.2
0.1
(g)
(h)
(i)
0.2
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0.1
0
10
20
30
(g)
40
50
60
(h)
70
80
90
100
0
0
10
20
30
40
50
(i)
Fig. 8. Foot plants and its segmented Fig. 9. Footprint vectors of the segmented footprints footprint Table 3. Classification of validation set ID footprint (a) (b) (c) (d) (e) (f) (g) (h) (i)
Output Target Error of Classification Classification Net Output classification of Net of expert 1.0457 1 -0.0457 Flat Flat 1.0355 1 -0.0355 Flat Flat 0.8958 1 0.1042 Flat Flat 0.0126 0 -0.0126 Normal Normal -0.0395 0 0.0395 Normal Normal 0.0080 0 -0.0080 Normal Normal -0.9992 -1 -0.0008 Cave Cave -1.0010 -1 0.0010 Cave Cave -0.9991 -1 -0.0009 Cave Cave
In order to illustrate the whole process, we have chosen 9 patterns of the validation set. Figure 8 shows the foot images and its segmented footprints. Figures 8a-c, 8d-f and 8g-i correspond to flat, normal and cave feet respectively. Figure 9 shows the footprint vectors of the previous images.
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1157
Table 3 shows quantitative classification results of the 9 examples selected from the validation set, as can be seen the network classification and the expert classification are equivalent.
6
Final Remarks and Future Studies
This work has presented a method to detect footprint pathologies based on neural networks and principal components analysis. Our work shows a robust solution to a real world problem, in addition it contributes to automate a process that currently is made by a human expert. By adding synthetic border patterns, the training process was enhanced. The footprint representation by a width vector and principal component analysis were used. By using a MLP trained by a Bayesian approach, all patterns of the validation set were correctly classified. The encouraging results of this study demonstrate the feasibility of implementing a system for early, automatic and massive diagnosis of the pathologies analyzed in this study. In addition, this study lay down the foundation for incorporating new foot pathologies, which can be diagnosed from the footprint. Considering the experience obtained from this study, our interest is centered in the real time footprint segmentation for monitoring, analysis and detection of walking disorders. Additionally, during the research a database containing a large amount of images from different patients was generated. These images have been all ready classified by an expert considering some pathologies such as flat and cave foot. The authors are willing to make these images available to the entire research community, so that more knowledge and experience on a cooperative basis can be achieved.
References 1. Chodera, J.: Pedobarograph-Apparatus for visual display of pressure between contacting surfaces of irregular shape. CZS Patent 104 514 30d, (1960) 2. Chu, W., Lee, S., Chu, W. , Wang, T., and Lee, M.: The use of arch index to characterized arch height : a digital image processing approach. IEEE Transaction on Biomedical Engineering 42 (1995) 3. Foresee, D., Hagan, M.,: Gauss-Newton Approximation to Bayesian Learning. Proceedings of the International Joint Conference on Neural Networks, (1997) 4. Jollife, I.: Principal Component Analysis. Springer-Verlag, (1986) 5. Mackay, D.: Bayesian Interpolation. Neural Computation 4 (1992) 6. Mora, M., Sbarbaro, D.: A Robust Footprint Detection Using Color Images and Neural Networks. Proceedings of the CIARP 2005, Lecture Notes in Computer Science 3773 (2005) 311-318 7. Morsy, A., Hosny, A.: A New System for the Assessment of Diabetic Foot Planter Pressure. Proceedings of the 26th Annual International Conference of the IEEE EMBS (2004) 1376-1379
1158
M. Mora et al.
8. Nakajima, K., Mizukami, Y., Tanaka, K.: Footprint-Based personal recognition. IEEE Transactions on Biomedical Ingineering 47 (2000) 9. Nguyen, D., Widrow, B.: Improving the Learning Speed of 2-Layer Neural Networks by Choossing Initial Values of the Adaptive Weights. Proceedings of the IJCNN 3 (1990) 21-26 10. Patil, K, Bhat, V., Bhatia, M., Narayanamurthy, V., Parivalan, R.: New online methods for analysis of foot preesures in diabetic neuropathy. Frontiers Me. Biol.Engg. 9 (1999) 49-62 11. Rumelhart, D., McClelland, J., and PDP group: Explorations in Parallel Distributed Processing. The MIT Press. 1 and 2 (1986) 12. Valenti, V.: Orthotic Treatment of Walk Alterations. Panamerican Medicine (in Spanish) (1979)
Phase Transitions Caused by Threshold in Random Neural Network and Its Medical Applications Guangcheng Xi and Jianxin Chen Key Laboratory of Complex Systems and Intelligence Science Institute of Automation, Chinese Academy of Sciences Beijing 100080, China {guangcheng.xi,jianxin.chen}@ia.ac.cn
Abstract. In this paper, we detect threshold-driven phase transitions in the homogeneous random neural network. When the neurons are arranged as one dimension, the critical threshold is two, while in two dimensions counterpart, the critical threshold is four. We declare that random neural network is a specific case of Abstract neural automata. So we conclude that phase transitions in the random neural network can produce thought in human brain. We successfully apply the network to interpret the relation between diseases and syndrome in Traditional Chinese Medicine.
1
Introduction
The human consists of nearly 1011 neurons. The number of neurons is approximately infinite. Neural networks present a powerful strategy to disclose the mechanism of brain. However, most of neural networks, implemented by either software or hardware, is composed of finite neurons. Viewing this, we have presented a more perfect net-work, Abstract Neural Automata [1], which is composed of infinite neurons. Philosophical concepts are thought to be the highest product of brain and phase transitions has its prominent role to play in modeling brain, particularly in the process of thought. Hoshino.O, et, al showed that self organized phase transitions contribute to the in-formation processing mechanism of brain [5]. It is proposed that the phase transitions of simple learning in one layer perceptron [4]. We have showed phase transitions in brain - transitions of concepts produce thought. But any concept can be uniquely ex-pressed by basic concepts that are considered as set of extreme points of limit Gibbs measures of ANA[2]. In this paper, we detect the phase transitions driven by threshold of neuron in the homogenous random neural network, whose limit distribution is Poisson
The work was supported by the National Basic Research Program of China (973 Program) under grant No. (2003CB517106) and NSFC Projects under Grant No. 60621001.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1159–1167, 2007. c Springer-Verlag Berlin Heidelberg 2007
1160
G. Xi and J. Chen
distribution-a specific kind of Gibbs distribution .The phase transitions occur under two situations; the first is the case that neurons are arranged as a line, that is to say, the dimension is one. The second is that neurons are arraying in the grid, i.e. the dimension is two. We found out that the critical threshold of phase transitions in the case of line is two, while in the case of grid is four.
2
Phase Transitions Driven by Threshold in Random Neural Network
Random neural networks (RNN) [6],[7] are composed of finite interconnecting neurons that operate as threshold automata in an asynchronous manner. Each neuron is either firing or quiet and from which emits an output line that could branch out after leaving the neurons and every branch must terminate as an input connection to another neurons either excitedly or inhibitedly. Any number of input connections should terminate to a neuron. Every neuron has a same threshold Z and fires if and only if the number of excitatory inputs is not less than Z in addition there is no inhibitory input to the neuron. There exists an input connection from a neuron to any one of the other counterparts with a fixed probability u an input connection is excitatory with a probability v and inhibitory with 1-v, any change of state of a neuron means a response delay of the neuron concerned. Response delays are independent identically distributed random variables that follow a negative exponential distribution with rate r. The probability that a neuron transits from the quiet state to the firing state under the condition that the number of the firing neurons is i is denoted asgi , accordingly, the probability that a neuron turns from firing to quiet given that i neurons are firing is denoted as hi .According to the rule introduced above, if i is less than the threshold Z ,gi = 0, hi = 1 .if i is not less than Z , i.e., i ≥ Z, then i i k gi = u (1 − u)i−k v k (1) k k=Z
hi = 1 − g i
(2)
We take two situations of state spaces of RNN to discuss the relation between the thresholds of the neurons and phase transitions, one is rectangular grid state, the corre-sponding random neural network is denoted as two dimension network; the other is line state, denoted as one dimension network. 2.1
Two Dimension Network
Given a random neural network is assembled by N neurons, besides the self organization of these neurons according to the rule described above, the operation of the network is also affected by M external inputs, whose states can be firing or quiet just like the ordinary neurons but operation is independent of them. The times of each input sojourning the quiet (firing) state are independent identically distributed random variables following a negative exponential distribution with
Phase Transitions Caused by Threshold in Random Neural Network
1161
rate m( m1 ).Although these M inputs behave independently of ordinary neurons, however, the probability that an input connects to any one of the ordinary neurons is u , input connection is excitatory with a probability v and inhibitory with a probability (1-v).The behavior of the system then can be described by a Markov process {Q(t), W (t), t ≥ 0}, where Q(t)is the number of firing neurons and W(t)represents the number of the firing inputs of the network at time t with the state space {(i, j), 0 ≤ i ≤ N, 0 ≤ j ≤ M } that can be represented by a rectangular grid. The state transition rates are shown in Fig. 1.
Fig. 1. Transitions rate for the Markov process with rectangular grid state space. The transitive rates are represented by eight symbols: A is (M − j)m, B is (j + 1)m1, C is irhi+j−1 , D is (N − i + 1)rhi+j , E is (i + 1)rhi+j , F is (N − i)rgi+j , G is (M − j + 1)m, H is jm1.
If the number of inputs M is chosen less than threshold Z ,then the set {(0, j), 0 ≤ j ≤ M } is a close state set since the network would be always quiet,a state in the set can not reach to any state outside it and vice versa.So M is assigned to let M ≥ Z, then the Markov process is a finite state irreducible Markov process in which there exists a steady state probability distribution that denoted as P = (Pi,j ),where Pi,j = limt→∞ P (Q(t) = i, W (t) = j),Pi,j must satisfy following equations with the general form: [(M − j)m + jm1 + irhi+j−1 + (N − i)rgi+j ]Pi,j = (M − j + 1)mPi,j−1 + (j + 1)m1Pi,j+1 + (N − i + 1)rgi+j−1 + (i + 1)rhi+j Pi+1,j
(3)
It is noted that when the general form is applied to the states on the border of the rectangular grid state space, some terms will be absent according to the certain condition.In addition to these (N + 1)(M + 1) equations, the limit probability N M Pi,j must meet the normalizing condition, i.e. Pi,j = 1.So the steady-state i=0 j=0
probability distribution P can be easily obtained numerically. The parameters of the system are chosen as:N = 100, M = 10, u = 0.1, v = 1, m = m1 = 1, r = 5 The threshold Z is chosen from zero to ten.We find that in the case of threshold Z < 4, the corresponding distributions P are nearly same, if Z > 4, P resemble with each other but are totally different to the case that Z > 4.The probability distribution at Z = 4 is a critical case. Namely, the network at least has two
1162
G. Xi and J. Chen
phases. Z = 4 is the critical point that phase transitions occur.We take the mean firing rate of the network as a measure to phase transition. Fig. 2 shows variation of the network’s mean firing rate versus the threshold, from which we can clearly verify that mean firing rate of the network varies vigorously during the critical threshold Z = 4.
Fig. 2. Mean firing rate versus the threshold
2.2
One Dimension Network
We further discuss the relation between phase transition and threshold under the condition that the state space of the network is represented by a line, namely one dimension state space. The neural network is composed of N neurons as before.The running of the network is affected by the arrival of stimulating pulses that are generated by external source from environment. The stimulating pulses arrive to each neuron according to a Poisson process with rate b;Each arriving pulse connects to the neuron with a probability u;The connected pulse is excitatory with probability v and inhibitory with probability (1-v) ; A quiet neuron will immediately fire if it is connected to the excitatory arriving pulse, a firing neuron will turn to be quiet instantaneously if it is connected to the inhibitory arriving pulse accordingly. Otherwise, the network operates in a selforganized manner described above. The behavior of the neural network can also be described by a Markov process {Q(t), t ≥ 0} with state space {i, 0 ≤ i ≤ N } represented by a line, where Q(t) is the number of firing neurons of network at time t .Transitions between the states rely on the self organization of the network and the arriving pulses, Fig. 3 displays the transitive rates for a state i . The Markov process is also a finite state irreducible Markov process, so there exists a steady-state probability distribution P = (Pi,j ) = [P0 , P1 , · · · , PN ] , where . must satisfy the following three equations: N (buv + rg0 )P0 = [bu(1 − v) + rh0 ]P1
(4)
Phase Transitions Caused by Threshold in Random Neural Network
1163
{i[bu(1 − v) + rhi−1 ] + (N − i)(buv + rgi )}Pi = (N − i + 1)(buv + rgi−1 )Pi−1 + (i + 1)[bu(1 − v) + rhi ]Pi+1 f or1 ≤ i ≤ N − 1
(5)
N [bu(1 − v) + rhN −1 ]PN = (buv + rgN −1 )PN −1
(6)
The solution to these N+1 equations is obtained analytically: i−1 N buv + rgk Pi = P0 i bu(1 − v) + rhk
(7)
k=0
P0 is calculated according to the normalizing condition:
N
Pi = 1
i=0
P0 = [1 +
N i−1 N i=1
i
k=0
buv + rgk ]−1 bu(1 − v) + rhk
(8)
In this situation, we found out that the critical threshold is two, that is to say, if thresholds of neurons are less than two, the corresponding steady probability distributions are of the same kind, in the other part, the distributions for those neurons whose thresholds are larger than two are approximately identical. The distributions for Z < 2 are totally distinct from Z > 2 . This indicates that when thresholds go from value that is less than two to value that is larger than two or vice versa, the limit distributions vary significantly. This is also a well defined phase transitions, which are also measured by mean firing rate of the neural network ,as shown in Fig. 4. Simple mathematic derivation can prove the steady probability distribution of random neural network follows Poisson distribution. Therefore the random neural network is Poisson neural networkspecific kind of Markov neural network, hence Gibbs according to O.K.Kozlov’s theory about Gibbs description of point random fields [9]. We can say the random neural network is specific kind of Abstract Neural Automata. It is of primary significance to point out that the limit states of random neural network are not unique, from above empirical analysis it is evident that the network has at least two limit distributions and they can transit with each other according to the different threshold. This is homologous with ANA’s variability of structure.
Fig. 3. State transitions for the Markov process with one dimension state space. A is (N − i + 1)(buv + rgi−1 ), B is i[bu(1 − v) + rhi−1 ] , C is (N − i)(buv + rgi ), D is (i + 1)[bu(1 − v) + rhi ].
1164
G. Xi and J. Chen
Fig. 4. The phase transitions in the one dimension neural network. The parameters are chosen as: N = 19, u = 0.1, v = 1, b = 1, r = 1.0 ≤ Z ≤ 3.
3
Medical Application
As a specific kind of Abstract neural automata, random neural network has medical application, especially in Traditional Chinese medicine(TCM)which is considered as a classical treasure of China and is on his way to standardization [10],[11]. However, TCM is seen somewhat occult, even expert TCM practitioners can not explicitly ex-plain how they diagnose from molecule level. So many people, particularly western people, do not take TCM as scientifically as western medicine. Indeed, the distillation of TCM is ”Bian Zheng Lun Zhi”, which means TCM experts first identify and deter-mine which Zheng that is called syndrome a patient caught based on information gathered from watching, snuffing, inquiring, and feeling the pulse (the four procedures are denoted as Si Zhen ), then they prescribe. The syndrome is key in system of Bian Zheng Lun Zhi. Study about the syndrome is core of study of basic theory of TCM. Here, we formally declare that syndromes are philosophical concepts that exist in the brain of TCM experts and the phase transitions between them lead to the thought of brain. In fact, syndrome is a diagnostic concept produced by mean of mapping symptoms (that is the Si Zhen information) into brain of TCM practitioners. Syndrome does have a close relation with disease, for example, a patient who suffers coronary heart disease often suffers blood stasis syndromes, which is characterized by some key symptoms [8]. As shown in Table 1, blood stasis syndrome does exist in the three complex diseases: coronary heart disease, diabetes (or sugar diabetes and cerebral infarction, and).However, the symptoms that correspond to each disease’s syndrome are immensely different with each other. So each disease’s syndrome is called subtype of the blood stasis syndrome. We creatively apply one-dimensional random neural network with phase transitions to interpret syndrome in TCM. We state above that syndrome is a concept
Phase Transitions Caused by Threshold in Random Neural Network
1165
Fig. 5. The x-axis is states, which represent symptoms, the y-axis is corresponding limit probability of each state, the probability distribution is disease’s syndrome subtype
of brain, it is abstract, while symptom is concrete. In this application, a symptom is represented by a state of Markov process that describes the behavior of network, while syndrome is manifested by the limit probability distribution of the Markov process. Since the number of phases of the neural network is two, each
1166
G. Xi and J. Chen
Table 1. Symptoms of three disease’s syndrome subtype. The Coronary heart disease(CHD)’s subtype has 12 symptoms, while Diabetes(SD) 11, Cerebral infarction(CI) 15. The fist and second disease have 4 symptoms in common, the first and the last have 5 in common, while the second and the last have 4. Disease Symptoms corresponding to Blood stasis syndrome CHD Angina; Palpitation; Dyspnea; Lassitude; Dark lips; Squamous and dry skin; Dark eye orbit; Dysphoria with feverish sensation in the chest,palms and soles; Dark purple tongue marked with ecchymosis; Petechia on the tongue; Engorged sublingual veins; Wiry pulse SD Polyoresia; Emaciate; Gender; Smoking; Frequent nocturia; Dark lips; Palpitation;Darkish complexion; Dark purple tongue marked with ecchymosis; Engorged sublingual veins; Unsmooth pulse CI Hemiplegia; Headache; Vertigo; Pain in nape; Lethargy; Age; Profession; Sign of palate mucous; Squamous and dry skin; Emaciate; Dark purple tongue marked with ecchymosis; Petechia on the tongue; Engorged sublingual veins; Unsmooth pulse; Wiry pulse
time two syndrome subtypes can be expressed by ones. We align all symptoms of two subtypes, and assign a state from state space of Markov process to each of them. Take symptoms of sugar diabetes and cerebral infarction as example, the two subtypes have 4 symptoms in common. The former has 11 and the latter 15, so the total is 22, which is assigned to the number of neurons. The other parameters of network are set as: connection probability u = 0.1 ; the excitatory probabilityv = 1 ; the arriving rate b = 1 ; response delay rater = 1. The limit probability distribution is depicted in Figure 5. The symptoms are placed at X-axis, the first 15 symptoms belong to cerebral infarction’s syndrome subtype, while the latter 11 belong to Diabetes’s subtype, the middle 4 are responsible for 4 overlap symptoms, they are Emaciate, Dark purple tongue marked with ecchymosis, Engorged sublingual veins and Unsmooth pulse. It is necessary to note that as long as the four overlap symptoms are represented by state 12,13,14,15 (regardless of order), the network can successfully interpret the syndrome. From Fig. 5 we can see that when Z = 1, the probabilities of latter 11 states are obviously larger than former, which approximate 0, this indicates that the distribution at can represent Diabetes’s syndrome subtype, while at Z = 2 , the other phase, the first 15 states’ probabilities are significantly larger than latter, which nearly vanish. So we can conclude that ran-dom neural network can successfully interpret disease and syndrome in TCM. The three disease’s other combinations can easily separate in this way.
4
Conclusions
This contribution is devoted to detecting phase transitions driven by threshold of neuron in the one-dimension random neural network and two-dimension counterpart respectively. In the former case, the critical threshold is two, while in the later case, the critical threshold is four. The random neural network whose limit
Phase Transitions Caused by Threshold in Random Neural Network
1167
distribution is Poisson is shown as a special Abstract Neural Automata, the limit configuration of which is Gibbs distribution. Finally, the one dimension network is applied to interpret blood stasis syndrome of Traditional Chinese Medicine, the successful result suggests that syndrome of TCM can be considered as a science.
References 1. Xi,G.C.: Abstract neural automata. Kybernete: The International Journal of System. and Cybernetics. 27 (1998) 81–86 2. Xi,G.C.: Variability of structure of abstract neural automata and the ability of thought. Kybernete: The International Journal of System. and Cybernetics. 30 (2003) 1549–1554 3. Hoshino, O. , Kashimori,Y. , Kambara,T.:Self-organized phase transitions in neural networks as a neural mechanism of information processing. PNAS. 93 (1996) 3303– 3307 4. Hertz,J., Krogh,A., Thorbergsson,G. : Phase transitions in simple learning. J. Phys. A: Math. Gen. 22 (1989) 2133-2150. 5. Hoshino, O. , Kashimori,Y. , Kambara,T.:Self-organized phase transitions in neural networks as a neural mechanism of information processing. PNAS. 93 (1996) 3303– 3307 6. Gelenbe,E., Stafylopatis,A. :Global behavior of homogeneous random neural systems. Appl. Math. Modelling. 15 (1991) 534-541 7. Jo,S. , Yin,J., Mao, Z.H.: Random neural networks with state-dependent firing neurons. IEEE Transactions on Neural Networks.16 (2005) 980–983 8. Yao,K.W.: Quantitative diagnosis of blood stasis syndrome and research on combination syndrome with disease. Doctoral Thesis, Chinese Academy of Traditional Chinese Medi-cine, 2004. 9. Kozlov,O.K.: Gibbsian description of point random fields. Theory. Prob. Appl. 21 (1976) 339–355 10. Xue,T.H., Roy, R. : Studying Traditional Chinese Medicine. Science. 300 (2003) 740–741 11. Normile, D. : The New Face of Traditional Chinese Medicine. Science. 299 (2003) 188–190
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis Lisha Sun1, Guoliang Chang2, and Patch J. Beadle2 1
Key Lab of Intel. Manuf. Tech. of State Education Ministry, College of Engineering, Shantou University, Guangdong 515063, China
[email protected] 2 School of System Engineering, The University of Portsmouth, Portsmouth, U.K.
Abstract. Method for extracting the specified rhythms of clinical electroencephalogram (EEG) is proposed using the wavelet packet decomposition. Based on the ability of accurately resolving the signal into desired time-frequency components, EEG signals are preprocessed and decomposed into a series of rhythms for many clinical applications. Specified dynamic EEG rhythms can be accurately filtered with designed wavelet structure. In addition, we present a wavelet packet entropy method for processing of EEG signal. Both relative wavelet packet energy and wavelet packet entropy are presented as the quantitative parameter to measure the complexity of the EEG signal. Several experiments with real EEG signals are carried out to show that the proposed method excels the common discrete wavelet decomposition. The presented procedure can isolate specific EEG rhythms accurately and is also regarded as an efficient method for analyzing non-stationary signals in practice.
1 Introduction EEG collection includes the important information of the potentials in the cortex or on the surface of scalp interacted by the physiological activities of the brain. EEG become a common and effective way for clinical analysis as we know that detecting the changes of EEG signals is critical to understand the brain functions and many applications for neuroscience. Various clinical measurement tools are continuously widely used, but EEG signal, as a nondestructive testing method, is still play a key role in the diagnosis of brain and the brain functional analysis. Recently, many kinds of signal processing techniques have been proposed for studying the dynamic EEG signals and the corresponding functions. Power spectral analysis via Fourier transform has been widely used for many quantitative analysis of EEGs, but the spectral analysis is just appropriate for investigating the stationary signals of simple dynamics that consists of a linear superposition of few independent, strong, non-evolving periodicities [1,2]. In other words, the traditional spectral analysis has severe drawbacks for analyzing the practical EEG signals due to the transient periodicities and non-stationarity of the practical EEGs such as the records corresponding to the sleep stages, epileptic transients and the changes of the physiological state of the patients. Furthermore, evoked potentials (EPs) reflect the event related non-stationary records [3, 4]. With the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1168–1176, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1169
development of modern signal processing techniques, the analysis of the transients characteristics of the EEG recordings become a widely accepted method like analyzing some types of artifacts and epilepto-genic transients of the EEG signals. To study the time-dependent spectrum of non-stationary EEG signals, some approaches were developed which include the short time Fourier transform (STFT), the Wigner-Ville representation and the time-varying parametric model. The STFT assumes the stationarity of the signal within a temporal window to match the time-frequency resolution chosen for the spectral analysis. The main problem is the fixed time-frequency resolution trade-off that results from windowing of the signal. Wigner-Ville distribution (WVD) is another common procedure for time-varying spectral analysis, but the most significant drawback of WVD is the existence of the cross-terms for analyzing the multi-component signals. As a parametric model approach, the time-varying AR model is used for the transient spectral estimate for non-stationary signals. The significant limitation of time-varying parametric models is the difficulties to establish the model properly for different practical signals like the selection of the model order and the basis functions. With the variable window length, Wavelet transform (WT) can overcome some of the drawbacks as indicated above and provide a time-frequency filtering capabilities. Wavelet analysis has been applied with considerable success to non-stationary signals, but many problems associated with the clinical EEG application and the feature extraction is still needed to be further investigated. Moreover, automatic methods generally do not stand comparison with traditional visual EEG analysis by trained physicians [5-8]. For this purpose, the aim of this paper is to presents a new method for the effective detection of the transient of the EEG signals based on wavelet transform. We mainly investigate the time-frequency characteristics of the different spontaneous brain rhythms and the new techniques for extracting time-varying rhythms of the signals by employing wavelet packet decomposition. In addition, the multi-channels time-varying rhythms are applied to reconstruct the Dynamic Topographic Brain Mapping (DTBM), which enable physician to understand the changes of the multichannels brain activities in specific rhythm of the EEG recordings. Finally, wavelet packet entropy are presented as the quantitative parameter to measure the complexity of the EEG signal.
2 Proposed Approach 2.1 Wavelet Transform Wavelet transform is an effective method for the time-frequency analysis of non-stationary signals. It can decompose a temporal signal into a summation of time-domain basis function of various frequency resolutions. The wavelet is a smooth and quickly vanishing oscillating function with a good location in both frequency and time [9,10]. Generally, wavelet ψ(t) is a function of zero average such as
∫
+∞
−∞
ψ (t )dt = 0
(1)
1170
L. Sun, G. Chang, and P.J. Beadle
a and translated by b :
which is dilated with scale parameter
1 ⎛t −b⎞ ψ⎜ ⎟ a ⎝ a ⎠
ψ a ,b ( t ) =
(2)
a , b ∈ R, a ≠ 0 .The wavelet transform of a signal f (t ) at the scale a and position b is computed by correlating f (t ) with a wavelet function given as where
+∞
Wf (a, b) = ∫ f (t ) −∞
1 *⎛ t − b ⎞ ψ ⎜ ⎟dt = 〈 s,ψ a,b 〉 a ⎝ a ⎠
(3)
If the wavelet satisfies
ψˆ (ω ) cψ = ∫ dω < ∞ −∞ ω 2
+∞
where ψˆ (ω ) is the Fourier transform of ψ (t ) , following relationship:
f (t ) =
1 cψ
+∞ +∞
∫ ∫W
f
(4)
f (t ) can be reconstructed by the
( a , b)ψ a ,b
−∞ −∞
1 dadb a2
(5)
Wavelet analysis allows a simultaneous and varying time-frequency resolution which leads to a multi-resolution representation for non-stationary physical signals. 2.2 Wavelet Packet Transform
The main problem of the WT is that the frequency resolution is poor in the high frequency region. In many applications, the wavelet transform may not generate a spectral resolution fine enough to meet the problem requirement. The use of wavelet packet is a generalization of a wavelet in that each octave frequency band of wavelet spectrum is further subdivided into finer frequency band by using the tow-scale relations, repeatedly. The wavelet packet function can be obtain by [11,12]: ∞
ψ 2j +i −11 (t ) = 2 ∑ h( k )ψ ij ( 2t − k )
(6)
ψ 2j +i 1 (t ) = 2 ∑ g ( k )ψ ij ( 2t − k )
(7)
k = −∞ ∞
k = −∞
The first wavelet ψ (t ) denotes the so-call mother wavelet function. The
h (k ) and
g (k ) represents the quadrature mirror filters associated with the scaling function and the mother wavelet function. The recursive relations between the j level and the j+1 level are given as f j2+i1−1 (t ) =
∞
∑ h( k ) f
k = −∞
i j
( 2t − k )
(8)
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
f j2+i1−1 (t ) = The wavelet coefficients
∞
∑ g (k ) f
i j
( 2t − k )
1171
(9)
k = −∞
c kj can be obtain from ∞
∫ f (t )ψ
c kj =
i j
(t )dt
(10)
−∞
Each wavelet packet subspace can be viewed as the output of a filter turned to a particular basis. Thus, a signal can be decomposed into many wavelet packet components. A signal may also be represented by a selected set of wavelet packets for a given level of resolution. Different combination of wavelet packet should be chosen for specific purpose. Signal f (t )
f 11 (t ) f 21 (t )
f 12 (t ) f 22 (t )
f 23 (t )
j=1
f 24 (t )
j=2
f 31 (t ) f 32 (t ) f 33 (t ) f 34 (t ) f 35 (t ) f 36 (t ) f 37 (t ) f 38 (t )
j=3
Fig. 1. The wavelet packet decomposition of a time-domain signal
2.3 Wavelet Packet Component Energy
Wavelet packet node energy is more robust in representing a signal than using the wavelet packet coefficients directly [13,14]. Define the signal energy as ∞
E f = ∫ f 2 (t )dt
(11)
−∞
E f i can be defined as the energy stored in the
Wavelet packet component energy component signal
j
Efi = ∫ j
∞
−∞
f ji (t ) 2 dt
(12)
The total energy of the signal can be decomposed into a summation of wavelet packet component energy that corresponds to different frequency bands and obtained by 2j
Etot = E f = ∑ E f i i =1
j
(13)
1172
L. Sun, G. Chang, and P.J. Beadle
In order to analyze specific frequency region, optimal tree structure should be selected. For example, as shown in Figure 1, the signal can be covered by f 12 (t ) , f 21 (t ) and f 22 (t ) or by f 12 (t ) , f 22 (t ) , f 31 (t) and f 32 (t ) . By defining the energy of each sub-band as E l , then, the normalized relative wavelet packet energy can be given as
Pl =
El E tot
(14)
Pl denotes the energy distribution in each wavelet packet. It is clear that energy distribution is sensitive to the energy changes with the signals components and represents the energy relation among each wavelet packet. 2.4 Wavelet Packet Entropy
As discussed, the Shannon entropy provides us a way for measuring the dynamic quantity distribution of the amount of disorder in system that can be regarded as a measure of uncertainty regarding the information content of a system. Thus, following the definition of entropy given by Shannon, the wavelet packet entropy is defined as
S wp = − ∑ pl ln[ pl ]
(15)
If the signal is a mono-frequency signal, all the energy will be within one frequency band. The energy of all other frequency band will be nearly zero. As a result, the relative wavelet packet energy will be 1, 0, 0…, which will lead to zero or very low value in the wavelet packet entropy. On the other hand, for a very disordered signal like a random process, its energy distribution will be in every frequency band. The relative wavelet packet energy will be almost the same and lead to a maximum value in wavelet packet entropy. The wavelet packets decomposition enables us to choose the best combination of the components for the representation of the EEG rhythms. A particular choice of tree-structure containing various components referred to as “wavelet packet decomposition” is used to the time-varying filter in 4 different filter banks corresponding to 4 types of time-varying EEG rhythms. For instance, a six-levels decomposition of Daubechies wavelet function is applied to detect the basic rhythms of EEG signals. The lowest frequency resolution can be estimated as
:
Δf =
1 fs ⋅ = 0.7812 Hz 26 2
The common 4 EEG rhythms such as
β
rhythm (13.28-30.47Hz),
(16)
α
rhythm
(7.812-13.28Hz), θ rhythm (3.906-7.812Hz) and δ rhythm (0.7812-3.906Hz) can be extracted. Several experimental results are tested to indicate the time-varying filtering characteristics of the specified rhythms. Some clinical EEG signals are investigated through the wavelet packet transform to show the transient of the rhythms and the satisfied filtering characteristics of 4 kinds of rhythms.
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1173
In order to study the time-varying characteristics of different rhythms of the multichannels EEG signals for visual analysis, we develop a novel approach which constructs the EEG for visualizing the dynamic EEG topography that is very helpful for physicians to investigate the changes of multi-channel brain activities in specific rhythm. The time-varying energy of the specified rhythm of the EEG signal is defined as
E ( i ) (t ) = x ( i ) (t )
2
(17)
The time-varying EEG topography via 14 channels brain activities in specific rhythm is obtained. We can display the specific rhythm simultaneously for an interesting short time period selected. For example, the alpha rhythm transient reflects the main changes of brain electrical activity of the normal person. The time-varying energies of event related to brain rhythms may be tracked by observing the temporal variations of the squares of the wavelet packet coefficients.
EEG
200 0 -200
į
200
-200
ș
200
200
Į
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0 -200
0 -200 200
ȕ
0
0
0 -200
(a)
(b) Fig. 2. (a) Four kinds of time-varying EEG rhythms at the C1 channel. (b). The time-varying contour mapping of the alpha rhythm in (a).
1174
L. Sun, G. Chang, and P.J. Beadle
3 Results and Discussion In this section, some real EEG signals from normal subjects were digitally collected with a standard commercial electroencephalograph of Model EEG 4400A by Nihon Kohden Corporation. 14 channel EEGs were converted at a sampling frequency of 100Hz with the international 10-20 system, recorded at the location of the scalp known as: Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T5, T6 [13]. Both Fig. 2(a) shows the 4 time-varying EEG rhythms of a normal EEG data at channel C1 by using 6 level Daubechis wavelet packet decomposition. The transient rhythms are obviously exhibited from the experimental results. It can be seen that the alpha rhythm becomes the main rhythm when subjects at rest with eyes closed. Fig. 2 (b) shows the contour mapping of the time-varying BEAM of the alpha rhythm that is displayed as a 2D surface by representing the head’s surface as an elliptical area. The two dimensional interpolation method was applied in this paper for reconstructing the overall 14 channels EEGs from the α rhythm. Thus, the α rhythm’s time-varying characteristics of all channels are clearly shown in Fig. 2 (b). In addition, the time-varying contour mapping provides us the possibility of computing the scalp spectral density maps about different brain rhythm’s active areas in the cortex of scalp. From the analysis above, it proved that the alpha rhythm in the EEG will enhance when the subject close his eyes, which represents different brain state compared to the open eyes status. To compare the differences from the 2 kinds of EEGs, two segments of EEG signal were chosen for the purpose. The first segment is a 2 seconds EEG signal with eyes open, and the other EEG signal is a 2 seconds period with the subject’s eyes closed. The wavelet packet decomposition was applied to the EEGs to extract the rhythms. A shown in Fig.3, four kinds of rhythms were comparable to each other when the subject opens his eyes. However, the alpha rhythm enhances and became the [
1.4
Wavelet Packet Entropy
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10 12 Time Step
14
16
18
20
Fig. 3. The time-varying wavelet packet entropy of EEG signals when eyes open with solid line and eyes closed with dotted line, respectively
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1175
domain rhythm when the subject closed his eyes. The primary experiments demonstrated that the proposed method can effectively detect the specified transient EEG rhythms with high time-frequency resolution. The parameters of the wavelet packet decomposition corresponding to the rhythms were successfully developed to form the contour mapping, which has important clinical significant for the EEG physicians. We also tested the time-dependent wavelet packet entropy which can be found from Fig.3. The time-varying wavelet packet entropy with eyes closed and open are shown in Fig.3. Fig.3 also shows the temporal average wavelet package entropy which can be viewed as a quantitative parameter of system’s complexity. It can be seen form the experiments that the brain activity in the closed eyes status is less disorder than it in the eyes open status.
4 Conclusions This paper presents a nonlinear method with wavelet packet analysis for rhythm decomposition and entropy study of EEG signals and ERPs. Wavelet packet transformation is applied to design the filters with different frequency characteristics in order to extract different kinds of dynamic EEG rhythms. Relative wavelet packet energy and the wavelet packet entropy are calculated. The relative energy provides information about the relative energy within different rhythms. Wavelet packet entropy measures the degree of order/disorder of the clinical EEG signal. The proposed method excels the wavelet packet decomposition can be used to isolate specific EEG and ERP rhythms more accurately for practical applications. The method presented in this paper is more flexible and accurate to design the specific filter banks due to the better matching in time-frequency characteristics of EEG signal for extracting different EEG rhythms. Finally, our method can be used as a new way for analyzing other kinds of medical signals.
Acknowledgement This work is supported by the Natural Science Foundation of China (60271023 and 60571066) and the Natural Science Foundation of Guangdong, respectively.
References 1. Pardey, J., Roberts, S. and Tarassenko, L.: A Review of Parametric Modeling Techniques for EEG Analysis. Med. Eng. Phys. 8(1) (1996) 2-11. 2. Jung, T.P., et al.: Estimating Alertness from the EEG Power Spectrum. IEEE Transaction on Biomedical Engineering 44(1) (1997) 60-69. 3. D’Attellis, C.E. et al.: Detection of Epileptic Events in Electroencephalograms Using Wavelet Analysis. Annals of Biomedical Engineering 25 (1997) 286-293. 4. Blanco, S., et al.: Time-Frequency Analysis of Electroencephalograms series 2. Gabor and Wavelet Transforms. Physical Review E 54(6) (1996) 6661-6672. 5. Thakor, N.V., et al.: Multiresolution Wavelet Analysis of Evoked Potentials. IEEE Transactions on Biomedical Engineering 40(11) (1993) 1085-1093.
1176
L. Sun, G. Chang, and P.J. Beadle
6. Clark, I., et al.: Multiresolution Decomposition of Non-Stationary EEG Signals: A Preliminary Study. Comput. Bio. Med. 25(4) (1995) 373-382. 7. Unser, M. and Aldroubi, A.: A Review of Wavelets in Biomedical Applications. Proceedings of the IEEE 84(4) (1996)626-638. 8. Blinowska, K.J. and Durka, P.J.: Application of Wavelet Transform and Matching Pursuit to the Time-Varying EEG Signals. Proc. of Conf. Artif. Neural Networks in Eng. (1994) 535-540. 9. Schiff, S.J., et al.: Fast Wavelet Transformation of EEG., Electroencephalogram and Clinical Neurophysiology 91(6) (1994) 442-455. 10. Tseng, S., et al.: Evaluation of Parametric Methods in EEG Signal Analysis. Med. Eng. Phys. 17(1) (1995) 71-78. 11. Pesquet, J., Krim, H. and Carfantan, H.: Time-Invariant Orthonormal Wavelet Representations. IEEE Transaction on Signal Processing 44(8) (1996) 1964-1970. 12. Daubechies, I.: Orthonormal Bases of Compactly Supported Wavelets. Communications on Pure and Applied Mathematics XII (1988) 909-996. 13. Quiroga, R., et al.: Wavelet Entropy in Event-Related Potentials. A New Method Shows Ordering of EEG Oscillations. Biological Cybernetics 84 (2001) 291-299. 14. Sun, Z.: Continuous Condition Assessment for Bridges based on Wavelet Packets Decomposition. Proceedings of SPIE - The International Society for Optical Engineering 4337 (2001) 357-367.
Comparing Analytical Decision Support Models Through Boolean Rule Extraction: A Case Study of Ovarian Tumour Malignancy M.S.H. Aung1, P.J.G Lisboa1, T.A. Etchells1, A.C. Testa2, B. Van Calster3, S. Van Huffel3, L. Valentin4, and D. Timmerman5 1
School of Computing and Mathematical Sciences, Liverpool John Moores University, UK {M.S.Aung,P.J.Lisboa,T.A.Etchells}@ljmu.ac.uk 2 Istituto di Clinica Ostetrica e Ginecologica, Università Cattolica del Sacro Cuore, Rome, Italy 3 Dept of Electical Engineering, ESAT-SCD, Katholieke Universiteit Leuven, Belgium {ben.vancalster,sabine.vanhuffel}@esat.kuleuven.be 4 Dept of Obstetrics and Gynaecology, University Hospital Malmö, Sweden 5 Dept of Obstetrics and Gynaecology, University Hospitals, Katholieke Universiteit Leuven, Belgium
[email protected] Abstract. The relative performances of different classifiers applied to the same data are typically analyzed using the Receiver Operator Characteristic framework (ROC). This paper proposes a further analysis by explaining the operation of classifiers using low-order Boolean rules to fit the predicted response surfaces using the Orthogonal Search Based Rule Extraction algorithm (OSRE). Four classifiers of malignant or benign ovarian tumours are considered. The models analyzed are two Logistic Regression models and two Multi-Layer Perceptrons with Automatic Relevance Determination (MLP-ARD) each applied to a specific alternative covariate subset. While all models have comparable classification rates by Area Under ROC (AUC) the classification varies for individual cases and so do the resulting explanatory rules. Two sets of clinically plausible rules are obtained which account for over one half of the malignancy cases, with near-perfect specificity. These rules are simple, explicit and can be prospectively validated in future studies.
1 Introduction The logic behind the behaviour of parametric classification models is often described with reference to decision boundaries within the data space of N dimensions where N is the number of input attributes or variables. However the often complex morphology of the boundaries makes it difficult to describe them explicitly. For this reason parametric classification models are often treated as black boxes and only evaluated as such often using the Area Under the Receiver Operator Characteristic curve (AUC). This study shows how deeper insight can be gained about the decision boundaries if they are approximated using axis parallel hyper-cubes in the data space. The axis parallel morphology corresponds to a Boolean rule specifying the limits to each variable in the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1177–1186, 2007. © Springer-Verlag Berlin Heidelberg 2007
1178
M.S.H. Aung et al.
hyper-cube. Consequently these statements are explanations for how a model classifies a data point into a particular class. While two models may have similar classification performances the extracted rules will show whether the models are functionally similar or not. The rule extraction method used for this analysis is the Orthogonal Search Based Rule Extraction algorithm (OSRE) [5]. This is a fast and principled approach to approximate hyper-cubes to the decision boundaries of selected models for the purpose of yielding low order mutually overlapping Boolean explanatory rules. This method has been used in a variety of applications [2] [4] including the medical domain. The explicit rules generated by the OSRE framework are especially important in safety critical domains such as decision support for clinical medicine, as they enable direct validation against expert knowledge. This study is focused on a prospectively acquired dataset for ovarian cancer kindly provided by the International Ovarian Tumour Analysis group (IOTA) as part of collaborative research funded by FP6 Network of Excellence: BIOPATTERN (www.biopattern.org). The International Ovarian Tumour Analysis group (IOTA) is a network of 9 clinical and academic centres from Belgium, Sweden, Italy, France and the United Kingdom that collected the data used in this study known as the IOTA Phase I dataset. The group applied strict data acquisition protocols [6] specifically for the purpose of parametric modeling. They consist of demographic variables, clinical signs and measurements derived from Doppler ultrasound scans for a cohort of 1066 patients diagnosed with having either a benign or a malignant ovarian tumour. There are over 40 explanatory variables measured for each patient. The database has been the object of several studies undertaken to produce classification models for diagnosing malignancy [1] [3] [6]. Four models that produced models with AUC of over 0.9 [3] [6] were selected for further investigation with rule extraction: The first two models labeled as ‘M1’ and ‘M2’ are logistic regression models and the latter two labeled as ‘M3’ and M4’ are neural network classifiers both based on a Bayesian Multi Layer Perceptron method.
2 Orthogonal Search Based Rule Extraction Orthogonal Search-based Rule Extraction (OSRE) is a framework for automatic rule extraction and pruning which efficiently extracts low order rules from the smooth decision surfaces created by analytical classifiers. The OSRE framework uses the data that constructed the smooth model to search, in orthogonal directions, for hyper-cubes containing the regions of the data space for which the model prediction is in-class (Fig. 1). The hyper-boxes that capture in-class data are converted into conjunctive rules expressed by the boundary values for each covariate. As the algorithm is applied to each data item that the model predicts to be in-class, there are as many rules produced as there are predicted in-class data, rule set Rn. This large number of rules is then automatically pruned according to predefined performance criteria intended typically to maximize the proportion of data explained or the positive predictive value (PPV) of the rule set [5].
Comparing Analytical Decision Support Models Through Boolean Rule Extraction
1179
Fig. 1. An example of creating a hyper-cube: after orthogonally searching the data space from a sample point, the size of the cube is determined by the lengths of the search spans limited by the decision surface from the model or the extreme of the data space
All the rules are expressive in conjunctive form. For example the hyper-cube in Fig.1 can be stated as: r1 = (1 ≤ a1 ≤ 6) ∧ (1 ≤ a 2 ≤ 4) ∧ (3 ≤ a 3 ≤ 6) , (1)
where a1, a2 and a3 are input variables. The initial rule set is first reduced by removing any repeated rules. In some cases this single reduction strategy can reduce Rn to a very small rule set of conjunctive rules. However, if the data contains continuous variables then, in practice, further rule reduction techniques may be necessary. The next phase of refinement is a minimum specificity filter. The rules whose specificity value is less than some predefined value are removed from the list. In clinical diagnostic applications we would only be interested in rules that have a very high specificity value, normally not accepting rules whose specificity is less than 0.9. After this process a minimum sensitivity filter is applied to give a measure of the coverage of the rule, i.e. how much of the in-class data the rule covers. Sensitivity and specificity values can now be calculated for the disjunction of all of the remaining rules. It is important to note that the rules can be mutually overlapping. The sensitivity and specificity values of the individual conjunctive rules or the combined set of disjunctive rules can be represented as steps in a Receiver Operator Characteristic (ROC) plot. The final goal of the rule refinement process is to achieve a rule set whose global ROC point is closest to the point with unit sensitivity and specificity.
3 Data and Classification Models The International Ovarian Tumour Analysis Group Phase I dataset consists of 1066 patients with adnexal tumours. The data has been prospectively collected according to a standardized protocol at the nine IOTA group centres [6]. Much attention was given to uniformity of protocols to ensure integrity and consistency. The dataset consists of over 40 clinical, biochemical, and ultrasound variables that have the potential for malignancy diagnosis. The clinical variables included family history, age, and any previous hormonal therapy. Also included were variables derived from transvaginal
1180
M.S.H. Aung et al.
scans and transabdominal ultrasonography. The presence or absence of pain during the scan was also recorded. The target vector for the analytical models is the outcome of the tumour (benign or malignant). 3.1 Description of Multi-centre Models
Four classification models from previous multi-centre studies that have demonstrated good classification rates are selected for rule extraction. These were two logistic regression models described in Timmerman et al [6] and two Multi-Layer Perceptrons with Automatic Relevance Determination described in Van Calster et al [3]. Both studies aimed to produce models for the diagnosis of malignant ovarian tumours trained on the IOTA Phase I dataset. The variable selection methods and parameter determination algorithms are described in detail in these studies. M1: Logistic Regression 1 with twelve variables selected Timmerman et al [6] describes a linear logistic regression model with a 12 variable subset of the set of 40 variables (Table 1). Selection process is stepwise multivariate logistic regression. The AUC value of M1 for a validation set is 0.94 (SE =0.017). M2: Logistic Regression 2 with a six variable subset This model is a simplified version of M1 using six variables also presented in Timmerman et al [6], these variables being the first to be entered into the stepwise selection process of the M1 set. The AUC value of this for M2 is shown to be 0.92 (SE=0.018) for a validation set. The six variables are a subset of those in Table 1, namely Age, Ascites, PapFlow, MaxSolid, WallRegularity and Shadows. M3: Multi-Layer Perceptron with Automatic Relevance Determination 1 with eleven variables selected Described in Van Calster et al [3] this model uses the connectionist Multi Layer Perceptron neural network for prediction. The network utilises a Bayesian framework Table 1. Variables used in Logistic Regression 1 (M1) Variable Name
Description
Variable Type
Univariate P Value
PerHistOvCa HormTherapy
Personal History of Ovarian Cancer Current Hormonal Therapy
Binary Binary
0.0096 0.0477
Age
Age of patient in years
Continuous
α, add the class label Ci to the class set SCq1 ,q2 ,...,qd degressively. Step5: For each Ci ∈ SCq1 ,q2 ,...,qd , compute the corresponding certainty grade as follows and add to the antecedent fuzzy set SCF q1 ,q2 ,...,qd CFi = (αCi − α)/
M
αCk ,
(5)
k=1
where α =
M
αCk /(M − 1).
k=1,k=i
From the above method, we can come to some conclusions: (a) If each pattern only belongs to one class in the fuzzy subspace Aq1 1 ∗ Aq2 2 ∗ ... ∗ Aqd d , ∃αCX 0 , ∀αCi = 0(i = X) , and SCq1 ,q2 ,...,qd = {CX }, SCF q1 ,q2 ,...,qd = {1}. (b) If there is no pattern in the fuzzy subspace Aq1 1 ∗ Aq2 2 ∗ ... ∗ Aqd d , ∀αCi = 0(i = 1, 2, ..., M ) and SCq1 ,q2 ,...,qd = SCF q1 ,q2 ,...,qd = N U LL. (c) If the pattern belongs to more than one class, ∃αCi 0 and there are several elements in SCq1 ,q2 ,...,qd and SCF q1 ,q2 ,...,qd . Fuzzy Reasoning. Let us assume that we have L valid fuzzy rules generated from given training patterns. A pattern X = (x1 , x2 , ..., xp ) is classified by the
A Method of X-Ray Image Recognition
1237
fuzzy classification system labelled by the possible class set SX . The process of fuzzy classification is proposed as follows: Step1: Initially for each class Ci (i = 1, 2, ...M ) , compute the membership value: βCi = max{μq11 (x1k )μq22 (xk2 )...μqdd (xkd )CFi /Ci ∈ SCq1 ,q2 ,...,qd ∩ CFi ∈ SCF q1 ,q2 ,...,qd ∩ Rq1 ,q2 ,...,qd ∈ S}. Step2: Sort βC1 , βC2 , ..., βCM degressively. Step3: If ∀βCi = 0 , pattern X = (x1 , x2 , ..., xp ) is an unknown class. Step4: For 0 = βCi > β , add the class label Ci to the class set SX degressively. Step5: Finally the possible class of pattern X = (x1 , x2 , ..., xp ) is in SX . 3.2
Parallel Neural Networks
Neural networks are non-parametric alternatives for image recognition[11]. The most popular multi-layer neural networks use backpropagation algorithm for training. In this study, a connected NN with p input neurons, 1 output neuron and p/2 hidden neurons have been simulated. Training Processing. Each NN classifier only aims at one class, so the output layer only consists of one neuron. The number of input layer and the input features differ from various classes. The input features are extracted through the above weighted feature extraction(WFE) method. Then, the BP algorithm is performed in each neural networks respectively. BP algorithm is repeated until the sum of squared error becomes equal to or smaller than certain value. The training rule is used to adjust the weights in order to move the network output closer to the targets. Variable learning rate has been used in our research. The performance of learning algorithm can be improved if we allow the learning rate to change during the training process. First, the initial network output and error are calculated. At each epoch new weights are calculated using the current learning rate. New output and error are then calculated. If the new error exceeds the old error, the new weights are discarded and the learning rate is decreased. Otherwise, the new weights are kept. If the new error is less than the old error, the learning rate is increased. Recognition Processing. When a test pattern is input into the parallel NN, some NN have been excluded based on the recognition result of fuzzy classifiers, only those more than threshold value need to recognize the test pattern whether it belongs to the class or not. The recognition process of each NN is at the same time. This parallel NN is easy to expand, while increasing some classes you can just need to add the corresponding NN classifier.
4
Experiments and Discussion
In this work, experiments have been carried out using X-ray images. Three fuzzy classifiers and eleven parallel neural networks classifiers are structured based on 110 training patterns during the training process. There are 200 non-sample patterns to be tested.
1238
D. Liu and Z. Wang
We adopt recognition rate, falseness rate to assess the classifier, and the falseness rate consists of three parts: loss rate that means a hazardous article is considered as a normal one, misinformation rate that means a normal article is considered as a hazardous one, mistake rate that means a type of hazardous article is considered as another type. The 6 features for fuzzy classifier are extracted by the smallest information entropy method in[10], and the 24 features for neural networks classifier are extracted by the method in 2.3, in order to make a comparison of the feature extraction method, we also show the result of using all 52 features and the 34 features except used in fuzzy classifiers. Table. 1 gives the recognition result of sample of fuzzy classifiers, neural networks classifier and combing fuzzy classifiers and neural networks classifiers. Table. 2 gives the recognition result of non-sample of different classifiers. From the experiment results, we can come to such conclusions: (a) Fuzzy classifiers have an accuracy of 100% both for sample and non-sample. (b) NN and Fuzzy-NN classifiers have an accuracy of 100% for sample. (c) The method of feature extraction is effective and the recognition rate for non-sample can reach to 95%. Table 1. The Recognition Results of sample
Table 2. The Recognition Results of non-sample
A Method of X-Ray Image Recognition
5
1239
Conclusions
In this paper, we proposed a fuzzy-neuron system approach for X-ray image recognition. In fuzzy rule method, a pattern data may belongs exclusively to several classes with different degrees. There is a neural networks classifier for each class and used to make sure if the pattern is really belonged to that class based on fuzzy rules, they are combined to obtain the recognition result. From the experience results, the new method performs well.
References 1. Singh, S., Singh, M.: Explosives Detection Systems (EDS) for Aviation Security: A Review. Signal Processing 83 (2003) 31-55 2. Bjorkholm, P., Wang, T.R.: Contraband Detection using X-rays with Computer Assisted Image Analysis. Proceedings of the Symposium on Contraband and Cargo Inspection Technology (1992) 111-115 3. Cable, A.P.: Some Aspects of the Use of Intelligent Systems Engineering in the Design of Airport Security Programmes. Proceedings of the First International Conference on Intelligent Systems Engineering, Edinburgh (1992) 77-85 4. Liu, W.: Automatic Detection of Elongated Objects in X-ray Images of Luggage. Masters Thesis, Department of Electrical and Computer Engineering, Virginia Tech. and State University, Blacksburg, VA, (1997) 5. Lu, Q.: The Utility of X-ray Dual-Energy Transmission and Scatter Technologies for Illicit Material Detection. Ph.D. Thesis, Department of Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA (1999) 6. Singh, M., Singh, S.: Image Segmentation Optimization for X-ray Images of Airline Luggage. CIHSPS2004-IEEE International Conference on Computational Intelligence for Homeland Security and Personal Safety, Venice, Italy (2004) 7. Singh, M, Singh, S., Partridge, D.: A Knowledge-Based Framework for Image Enhancement in Aviation Security. IEEE Trans. Systems, Man and Cybernetics B 34 (2004) 2354-2365 8. Liu, D.M.: A Simple and Effective Enhancement Algorithm to X-ray Image. Application Research of Computers (2007) in press 9. Wang, L.L.: Structural X-ray Image Segmentation for Threat Detection by Attribute Relational Graph Matching. International conference on neural networks and brain (2005) 1206-1210 10. Nakashima, T., Nakai, G., Ishibuchi, H.: Improving the Performance of Fuzzy Classification Systems by Membership Function Learning and Feature Selection. IEEE International Conference on Fuzzy System, May 12-17, 1 (2002) 488-493 11. Krzysztof J. Cios: Image Recognition Neural Network - IRNN. Neurocomputing (1995) 159-185
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting of Confocal Raman Spectra Seong-Joon Baek1 , Aaron Park1, , Sangki Kang2 , Yonggwan Won1 , Jin Young Kim1 , and Seung You Na1 1
2
The School of Electronics and Computer Engineering, Chonnam National University, Gwangju, South Korea, 500-757 Telecommunication R&D Center, Samsung Electronics Co., LTD., South Korea, 426-791
Abstract. Confocal Raman spectroscopy is known to have strong potential for providing noninvasive dermatological diagnosis of skin cancer. According to the previous work, various well known methods including maximum a posteriori probability classifier (MAP), linear classifier using minimum squared error (MSE) and multi layer perceptron networks classifier (MLP) showed competitive results for basal cell carcinoma (BCC) detection. The experimental results are hard to interpret, however, since the classifiers uses global features obtained by principal component analysis (PCA). In this paper, we propose a method that can identify which regions of the spectra are discriminating for BCC detection. For the purpose, 5 and 7 Gaussian prototypes were built located on the typical peak position of BCC and normal (NOR) tissue spectra respectively. Every spectrum is approximated by a linear combination of the Gaussian prototypes. Decision tree is then applied to identify which prototypes are important for the detection of BCC. Among 12 prototypes, 5 discriminating prototypes were selected and the associated weights were used as an input feature vector. According to the experiments involving 216 confocal Raman spectra, support vector machines (SVM) gave 97.4% sensitivity, which confirms that the peak regions corresponding to the selected features are significant for BCC detection and the proposed fitting method is effective.
1
Introduction
Skin cancer is one of the most common cancers in the world. Recently, the incidence of skin cancer has dramatically increased due to the excessive exposure of skin to UV radiation caused by ozone layer depletion, environmental contamination, and so on. If detected early, skin cancer has a cure rate of 100%. Unfortunately, early detection is difficult because diagnosis is still based on morphological inspection by a pathologist [1].
This work was supported by grant No. RTI-04-03-03 from the Regional Technology Innovation Program of the Ministry of Commerce, Industry and Energy(MOCIE) of Korea.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1240–1247, 2007. c Springer-Verlag Berlin Heidelberg 2007
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1241
There are two common skin cancers: basal cell carcinoma (BCC) and squamous cell carcinoma (SCC). Among them, BCC is the most common skin neoplasm and difficult to distinguish from surrounding noncancerous tissue. Clinical dermatologists have requested the accurate detection of BCC. The routine diagnostic technique used for the detection of BCC is pathological examination of biopsy samples. It involves rather complex treatments and relies upon a subjective judgment, which is dependent on the level of experience of an individual pathologist and can lead to the excessive biopsy of tissues. Thus, a fast and accurate diagnostic technique for the initial screening and selection of lesions for further biopsy is needed [2]. To resolve this problem, some researchers carried out BCC detection using Fourier transform (FT) Raman spectroscopy [3, 4]. However FT Raman spectra from a long wavelength excitation laser give poor signal-to-noise ratio and suffer from background noise. Then complicated statistical treatments are required for FT Raman spectra to eliminate the noise. Recently, direct observation method based on the confocal Raman technic was presented for the dermatological diagnosis of BCC using the shorter wavelength laser [2]. According to the study, confocal Raman spectra provide promising results for detection of precancerous and noncancerous lesions without special treatments. Based on the result, we investigated various classification methods including MAP, probabilistic neural networks, k-nearest neighbor, and MLP classification for BCC detection [5]. According to the results, MAP and MLP gave a classification error rate of about 4-5%. Also an ambiguous category was introduced for perfect classification. The experimental results showed that perfect classification was possible with reasonable number of ambiguous data, i.e., about 8% of test patterns for the case of MSE [6]. However the experimental results are hard to interpret and it is difficult to identify which protein bands are the most discriminating since the classifiers uses global features obtained by PCA. Hence we investigated the method that can identify which region of the spectra are significant for the detection of BCC. Inspecting the peak distribution of BCC and NOR spectra, we built 5 and 7 Gaussian prototypes located on the typical peak position of BCC and NOR spectra respectively. Every spectrum was then approximated by a weighted sum of the Gaussian prototypes and the weights were used as a feature vector. Decision tree was applied to identify which prototypes are important for the detection of BCC. The 5 prototypes from 12 prototypes were selected and the associated weights were used as an input feature vector. Experiments were carried out to show that the 5 peak regions associated with the selected 5 prototypes are significant bands for BCC detection and the proposed fitting method is effective.
2
Sample Preparation and Preprocessing
The tissue samples were prepared with conventional treatment. Details for the biological and chemical processes are described in [2]. A skin biopsy and spectral measurements were carried out in the perpendicular direction from the epidermis
1242
S.-J. Baek et al.
to the dermis. Confocal Raman spectra of BCC tissues were measured at different spots with an interval of 30-40 μm. In this way, 216 Raman spectra were collected from 10 patients. After measurement, Raman spectra were clipped at Raman shifts from 17501000 cm−1 , which region is known to contain all important protein bands, e.g., amide III , lipid and protein, amide I, phospholipid and nucleic acid mode. Then the spectra were normalized so that it falls in the interval [0,1]. To build Gaussian prototypes located on the peak position, we first searched peaks of BCC and NOR spectra. Before peak picking, spectra were smoothed by a moving average with span 25 cm−1 . We marked all peak positions and obtained peak distribution. Smoothed peak distribution is plotted in Fig. 1 and Fig. 2. Fig. 1 shows 5 dominant peak positions of BCC and Fig. 2 shows 7 dominant peak positions of NOR. 1588
Relative Peak Occurrence
1098
1000
1457
1340
1410
1100
1200
1300
1400
1500
1600
1700
Raman Shift (cm-1)
Fig. 1. Peak distribution of BCC spectra. Straight lines indicate the selected peaks.
After determining dominant peak positions, we computed the average width, i.e., standard deviation of Gaussian prototypes located on those peak positions. All peak positions and widths of Gaussian prototypes are listed in Table 1. Given center μi and width σi , the i-th Gaussian prototype is x−μi 2 1 −1( ) gi (x) = √ e 2 σi . 2πσi Every spectrum is approximated by a linear combination of the Gaussian prototypes. Let the i-th BCC and NOR Gaussian prototype be gib and gin . A spectrum y(x) is approximated by the following equations.
5 y(x) ≈ fb (x) = i=1 wib gib (x), 7 ≈ fn (x) = i=1 win gin (x), w b = arg min ||y(x) − fb (x)||2 , w n = arg min ||y(x) − fn (x)||2 ,
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1450
1658
1243
1090
Relative Peak Occurrence
1317
1255
1595
1535
1700
1600
1500
1400
1300
1200
1100
1000
Raman Shift (cm-1)
Fig. 2. Peak distribution of NOR spectra. Straight lines indicate the selected peaks. Table 1. Gaussian prototypes for BCC and NOR spectra BCC NOR Weight w1b w2b w3b w4b w5b w1n w2n w3n w4n w5n w6n w7n Center 1098 1340 1410 1457 1588 1090 1255 1317 1450 1535 1595 1658 Width 55 60 35 30 55 40 40 40 25 25 25 25
where wb , wn is a weight vector. Two weight vectors are concatenated to give a feature vector. The dimension of a feature vector is then 12. Two approximation examples are shown in Fig. 3 and Fig. 4. Figure 3 shows a BCC spectrum fitted by gib , while Fig. 4 shows a NOR spectrum fitted by gin . The major peaks in the original spectrum are preserved in the fitted spectrum. Thus we can say that the Gaussian prototypes located on the typical peak regions can successfully approximate main characteristics of a BCC and NOR spectrum.
3
Classification Methods and Experimental Results
Decision tree classifies a pattern through a sequence of questions, in which the next question asked depends upon the answer to the current question. Such a sequence of questions forms a node connected by successive links or branches downward to other nodes. Each node selects the best feature with which heterogeneity of the data in the next descendent nodes is as pure as possible. However, with the noisy data, this decision scheme is not always appropriate. So we used a decision tree as a feature selector rather than a classifier. Using the ’leaving one out’ experiments with decision tree, we obtained 3 prominent weights {w4b , w6n , w7n }. Two additional weights {w2b , w3n } were selected by visual inspection and experiments. These 5 weights are normalized to fall in [-1,1] and used
1244
S.-J. Baek et al.
1
original spectrum fitted spectrum
Normalized intensity
0.8
0.6
0.4
0.2
0
1700
1600
1500
1400
1300
1200
1100
1000
Raman shift (cm-1)
Fig. 3. A BCC spectrum fitted by Gaussian prototypes
1
Normalized intensity
original spectrum fitted spectrum 0.8
0.6
0.4
0.2
0
1700
1600
1500
1400
1300
1200
1100
Raman Shift (cm-1)
Fig. 4. A NOR spectrum fitted by Gaussian prototypes
as a feature vector in the subsequent classification. Table 2 shows the protein bands corresponding to these weights. MLP is the most powerful and flexible classifier since it can adapt to arbitrarily complex posterior probability functions [7]. Each layer has several processing units, called nodes or neurons which are generally nonlinear units except the input nodes that are simple bypass buffers. The unit operation is characterized by the equation, ok = f (netk ). The input to the unit, netk , and the activation function f (netk ) are given by netk = i wik oi + biask , f (netk ) = 2/(1 + e−2netk ) − 1. Since there are only two distinct classes, one output unit was used. MLP models were trained to output -1 for NOR class and +1 for BCC class using back propagation algorithm. The performance of MLP undergo a change according to the initial condition. Thus experiments were carried out 20 times and the
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1245
Table 2. Protein bands corresponding to 5 coefficients Weights Raman frequency Vibrational cm−1 descriptions w7n 1658 Amide I mode w6n 1588 Amide I mode w4b 1457 Lipid and protein mode w2b 1340 Amide III mode w3n 1317 Amide III mode
results were averaged. At the classification, output value is hard limited to give a classification result. SVM is also a powerful methodology for solving problems in nonlinear classification. In the simplest pattern recognition tasks, SVM uses a linear separating hyperplane to create a classifier with a maximal margin. In cases when given classes cannot be linearly separated in the original input space, SVM first nonlinearly transforms original feature into a higher dimensional feature space. This transformation can be achieved by using various nonlinear mappings: polynomial, sigmoid, and radial basis function (RBF). After the nonlinear transformation, linear optimal separating hyperplane can easily be found. The resulting hyperplane will be optimal in the sense of being a maximal margin classifier with respect to training data [8]. In the experiments, we used the following Gaussian RBF kernel. K(xi , xj ) = e−
||xi −xj ||2 σ2
,
where σ 2 is set to 20. Least squares SVM is used as an optimization method, where the regularization parameter is set to 1 [9]. Overall 216 spectra were divided into two groups. One is a training set and the other is a test set. Actually, the data from 9 patients were used as a training set and the data from one remaining patient were used as a test set. Once classification completes, data from one patient are eliminated from the training set and used as new test data. The previous test data are now joined with the training set. In this way, data from every patient were used as a test set. The average number of BCC and NOR spectra in the test set is 8 and 14 and that in the training set is 68 and 126 respectively. The classification results with the whole 12 weights are summarized in the Table 3. The number of hidden units in MLP was set to 6. In the table, we can see that the average sensitivity is about 94.1% while the average specificity is about 95.4%. The results are comparable to those in [5] and convince us that the Gaussian prototype fitting gives discriminating features. The experimental results with the 5 selected weights are summarized in the Table 4, where the number of hidden units in MLP was set to 5. According to the results, average sensitivity is about 96.8% and average specificity is about 98.3%. The increased classification rates indicate that the selected weights are the most discriminating and the Gaussian prototype fitting methods are very effective. In terms of biomarker discovery, the selected weights can be considered as potential
1246
S.-J. Baek et al. Table 3. Classification results with the whole 12 weights MLP SVM BCC NOR BCC NOR Decision of BCC 94.7 5.3 93.4 6.6 a pathologist NOR 5.0 95.0 4.3 95.7
Table 4. Classification results with the 5 selected weights MLP SVM BCC NOR BCC NOR Decision of BCC 96.1 3.9 97.4 2.6 a pathologist NOR 2.1 97.9 1.4 98.6
markers with which the abnormal cases (BCC) can be distinguished from the normal cases.
4
Conclusion
In this paper, we propose a method that can identify which bands of the spectra are significant for the detection of BCC. For that reason, we built Gaussian prototypes located on the typical peak position of BCC and NOR spectra. Every spectrum is then approximated by a linear combination of Gaussian prototypes. Using decision tree and visual inspection, we selected the 5 significant prototypes among the 12. The 5 weights associated with them are used as an input feature vector. The experiments involving 216 confocal Raman spectra showed the sensitivity of 97.4% in the case of SVM, which confirms that the peak regions corresponding to the selected weights are the most discriminating and the proposed Gaussian fitting method is very effective.
Acknowledgments The authors are grateful to prof. Jaebum Choo, Hanyang university, Ansan, Korea for providing the precious data.
References [1] Jijssen, A., Schut, T.C.B., Heule, F., Caspers, P.J., Hayes, D.P., Neumann, M.H., Puppels, G.J.: Discriminating Basal Cell Carcinoma from its Surrounding Tissue by Raman Spectroscopy. Journal of Investigative Dermatology 119 (2002) 64–69 [2] Choi, J., Choo, J., Chung, H., Gweon, D.G., Park, J., Kim, H.J., Park, S., Oh, C.H.: Direct Observation of Spectral Differences between Normal and Basal Cell Carcinoma (BCC) Tissues using Confocal Raman Microscopy. Biopolymers 77 (2005) 264–272
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1247
[3] Sigurdsson, S., Philipsen, P.A., Hansen, L.K., Larsen, J., Gniadecka, M., Wulf, H.C.: Detection of Skin Cancer by Classification of Raman Spectra. IEEE Trans. on Biomedical Engineering 51 (2004) 1784–1793 [4] Nunes, L.O., Martin, A.A., Silveira Jr, L., Zampieri, M., Munin, E.: Biochemical Changes between Normal and BCC Tissue: a FT-raman Study. Proceedings of the SPIE 4955 (2003) 546–553 [5] Baek, S.J., Park, A., Kim, J.Y., Na, S. Y., Won, Y., Choo, J.: Detection of Basal Cell Carcinoma by Automatic Classification of Confocal Raman Spectra. LNBI 4115 (2006) 402–411 [6] Baek, S.J., Park, A., Kim, D., Hong, S.H., Kim, D.K., Lee, B.H.: Screening of Basal Cell Carcinoma by Automatic Classifiers with an Ambiguous Category, LNCIS 345 (2006) 488–496 [7] Gniadecka, M., Wulf, H., Mortensen, N., Nielsen, O., Christensen, D.: Diagnosis of Basal Cell Carcinoma by Raman Spectra. Journal of Raman Spectroscopy 28 (1997) 125–129 [8] Kecman, V.: Learning and Soft Computing. The MIT Press (2001) [9] Suykens J.A.K., Gestel, T.V., Brabanter, J.D., Moor, B.D., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific. Singapore (2002)
Prediction of Helix, Strand Segments from Primary Protein Sequences by a Set of Neural Networks Zhuo Song, Ning Zhang, ZhuoYang, and Tao Zhang* Key Lab. of Bioactive Materials, Ministry of Education and College of life science, Nankai University, Tianjin, PR China, 300071
[email protected] Abstract. In prediction of secondary structure of proteins there are always some suspected segments. These suspected segments confuse people and lower the accuracy of prediction methods. To deal with this problem, a set of neural networks (NNs) are built based on helix, strand and coil segments selected from PDB. The test performance of these NNs on training data is perfect without surprise. However the prediction on test data is not good enough because the training data are lake of great representativeness. The results support the fact that closer neighbor vectors have the similar outputs of NNs. One can improve representativeness of training data without enlarging data scale as long as select less data from dense region and more from sparse region on condition that distribution of sample data has been known.
1 Introduction Prediction of 3D structure from primary protein sequence is an attractive problem in bioinformatics. If we believe that the secondary structures of a protein determine its 3D structures, prediction of secondary structure becomes an important issue. Since late 1970s many computational methods have been developed. The data they used include primary amino acid (AA) sequences [1], 3-51 neighboring AAs through moving-window [2], chemical properties of AAs [3] and alignments of sequences in structure-known protein databases that match the query sequences [4]. However, the highest accuracy of the methods is less than eighty percentages. Generally, there are 8 types of secondary structure in proteins, which are H(alpha-helix), G(3-helix/310helix), I(5-helix /π-helix), B(residue in isolated beta-bridge), E(extended strand), T(hydrogen bond turn), S(bend), “_”(any other structures). The above secondary structures can be simplified to three groups: helix (H, “H” and “G”), strand (E, “E” and “B”) and coil(C, all remaining types) [5]. Using symbol of H, E and C, the primary protein sequences can be transferred to their corresponding secondary structure sequences, of the form: …CCHHHHCCEEEEECCC.... Based on existing knowledge and predicting methods, however, there are always some segments, which look like either (both) H or (and) E. When a segment is *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1248–1253, 2007. © Springer-Verlag Berlin Heidelberg 2007
Prediction of Helix, Strand Segments from Primary Protein Sequences
1249
supposed to be a kind of secondary structures but whether it is H or E is doubted, the segment becomes suspected. To determine the types of suspected segments, a novel method of neural networks has been developed.
2 Method Architecture of the novel method Helix, strand and coil segments are selected respectively from structure-known protein sequence database. Two ends of them are treated with different ways according to the types they belong to. For each protein segment, a 40 dimension vector was used to describe the composition proportion and the position order of 20 kinds of amino acid residues in the segment. The ration of principal structure content in a segment is its output of NNs. To use the learning ability of NNs and limit their disadvantages, a set of six NNs are designed. For example, NN of “H” is trained only by vectors of H whose output values are not less than 0.8; NN of “HE” is trained by both H and E in which E’s value is defined as 0; NN of “HEC” is trained by H, E and C in which E’s value is 0 and C’s value is 0.5. If the three NNs work well, they should act as the same way - when input helix vectors, they should return values closer 1. Beside, for NN of “HE”, when input E vectors the returned value is closer 0. Fig. 1 shows a detailed architecture of the method. H (H>=0.8)
Position of H, E and
H +E (H>=0.8; E=0)
C along the primary protein sequences
H
H
E
E
H +E +C (H>=0.8; E=0; C=0.5) E (E>=0.8) E +H (E>=0.8; H=0)
C
PDB file (*.pdb)
AA segments
C
40D vectors
E +H +C (E>=0.8; H=0; C=0.5) Six neural networks (NNs)
Fig. 1. A detailed architecture of this prediction method
DATASET Selection of secondary structure segments from proteins Proteins data are selected from ASTRAL SCOP 1.69 genetic domain sequence subsets based on PDB SEQRES records, with less than 40% identity to each other (http://astral.berkeley.edu/) [6]. Domain segments in above subsets are removed and
1250
Z. Song et al.
only intact protein chains are selected which belong to different structure classifications (all alpha, all beta, alpha/beta and alpha+beta etc.). The position information of all secondary structures in proteins are obtained from their PDB files (*.pdb). According to PDB Contents Guide Ver.2.1 (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/part_2.html), proteins whose PDB files have format faults are removed too. Based on these conditions, 1820 proteins are selected and all their secondary structures segments are picked out finally. Define the output value of secondary structure segments used in NNs In practical application, the two ends positions of long secondary structures can not be determined exactly. To partly solve this problem, every used H or E segment contains one or two more AA residues on both sides if the added AA residues belong to Coil. To ensure the principal structure content of these segment is more than 0.8, H and E are treated as following: L=length of segments; if 8=0.8), HE (H>=0.8; E=0), HEC (H>=0.8; E=0; C=0.5), E (E>=0.8), EH (E>=0.8; H=0), EHC (E>=0.8; H=0; C=0.5). The segments of H, E and C come from 9403 helix, 2999 strand and 2069 coil respectively. According to Euclid distance of these 40D vectors, vectors in 40 dimension space whose distances are not less than 2.5 are selected. In training datasets, for example in HE, the distance from every vector in H to E is not less than 2.5, and for HEC in which the H and E data is same as in HE and the distance for every vector in E to both H and E are not less than 2.5. Finally, 2882 H, 1064 E and 101 C are picked out to consist the six NNs. So H is the subset of HE, HE is the subset of HEC, E is the subset of EH, and EH is the subset of EHC. Segments of 6521 H, 1935 E and 1968 C are remaining as test data. The training dataset is too large to do jackknife tests. Test method by using sets of NNs Test rules of H and E are based on two types of 4-NNs combinations are used. One is of H, HE, HEC and EH and the other is E, EH, EHC and HE. If input H segments, output values of NNs of H, HE and HEC should be basically equal. And the returned value of HE should be near 0. Similarly, if input E, values of E, EH and EHC should be basically equal and the output of EH should be near 0. The degree of equality is calculated by using training data to test the trained NNs. Mean value add 3 times standard deviation are used to eliminate calculation errors.
3 Results and Discussion 3.1 The Accuracy of Trained NNs (Show as Table 1) Table 1. The accuracy of six trained NNs
Average square error Average accuracy
H 0.00015 98.8%
HE 0.00033 98.2%
HEC 0.00158 96%
E 0.00039 98%
EH 0.00116 96.6%
EHC 0.00279 94.7%
3.2 Use Training Data to Test the NNs and Determine the Rules of H and E
For training data, the absolute values of differences between groups were calculated under the trained NNs, such as H and HE, HE and HEC, HEC and H, E and EH, EH and Table 2. Rule of determining H and E
H rule E rule
H-HE 0.0128 ± 0.0264 E-EH 0.0205 ± 0.0262
HE-HEC 0.0125 ± 0.0292 EH-EHC 0.0048 ± 0.021
HEC-H 0.0114 ± 0.0355 EHC-E 0.041 ± 0.1889
EH 0.0003 ± 0.0089 HE 0.0009 ± 0.0172
1252
Z. Song et al.
EHC, EHC and E. The values from EH and HE were also calculated. The means plus or minus 3 times the standard deviation were used to create the rule of determining H and E segments. (Show in table 2) 3.3 Training Data and Test Data Under Rules of H and E
Table 3 shows the accuracy of training data under rules of H and E. The results present that neither ‘H rule’ or ‘E rule’ can pick out all the real H or E segments. And the H and E segments they hold are not always true. It is partly because the training data selected can not represent whole characters of H and E. NNs can not know what they do not learn, so under distance space, there should be some singular vectors do not be contained in training data. This can partly solved by large the training dataset, but if so, the time and accuracy of training will become a problem. To solve the dilemma, it is important to know if a vector is belong to H or E, whether its closer neighbor vectors are belong to H and E as well. If it is true, and if the distribution of data under certain space is known, as long as select less data in dense region and more data in sparse region, the representativeness of data would be improved without enlarging the dataset. Following evidences support the conjecture. Table 3. The accuracy of data under rules of H and E
Training helix (2882) Training strand (1064) Test helix (6521) Test strand (1935)
H rule 2764 (95.9%) 0 (0%) 1360 (20.9%) 105 (5.4%)
E rule 0 (0%) 999 (93.9%) 744(11.4%) 547 (28.3%)
Table 4. Average Euclid distance between two groups
Test H_H rule (N=1360) Test H_H rule_No (N=5161) Test H_E rule (N=744) Test E_E rule (N=547) Test E_E rule_No (N=1388) Test E_H rule (N=105)
Training H 1.54 ± 0.28 (HH/H) 1.98 ± 0.34 (HH_No/H) 2.36 ± 0.24 (HE/H) / / 1.46 ± 0.22 (EH/H)
Training E / / 1.43 ± 0.20 (HE/E) 1.37 ± 0.18 (EE/E) 1.76 ± 0.32 (EE_No/E) 2.34 ± 0.31 (EH/E)
“Test H_H rule” means segments selected from test H by rule of H, “Test H_H rule_No” means segments which are not selected from test H by rule of H. Others have similar meanings. Less number is bold.
3.4 Euclid Distance of Data
The Euclid distance of a vector to a group vectors should be observed to support the above conjecture. The distance of a vector to a group vectors is determined as the minimum distance of the vector to all vectors in the group. For example, if the conjecture
Prediction of Helix, Strand Segments from Primary Protein Sequences
1253
3.0
** 2.5
** ** Euclid distance
2.0
** 1.5
1.0
0.5
0.0 HH/H HH_No/H
HE/H HE/E
EE_No/E EE/E
EH/H EH/E
Fig. 2. Average Euclid distances of two groups of data. **P 1. Definition 3. The support of a pattern X in dataset D, denoted as suppD (X) is defined as suppD (X) = countD (X)/|D|, where countD (X) is the number of elements in D matching the pattern X. Definition 4. The growth rate of a pattern X from D1 to D2 , denoted as GrowthRate(X), is defined as ⎧ ⎪ if suppD1 (X) = 0 and suppD2 (X) = 0, ⎨ 0, ∞, if suppD1 (X) = 0 and suppD2 (X) = 0, GrowthRate(X) = ⎪ ⎩ suppD2 (X) , otherwise. suppD1 (X)
Definition 5. Given ρ > 1 as a growth-rate threshold, a pattern X is said to be an ρ-emerging patterns(ρ-EP or simply EP) from D1 to D2 if GrowthRateX≥ ρ.
1256
3
H. Wang et al.
The ALL/AML Dataset
The human acute leukemia (ALL/AML) gene expression profiles1 consist of 47 ALL samples and 25 AML samples. Each sample contains 7129 genes. The whole dataset is separated into training samples and test samples. Training samples include 27 ALL samples and 11 AML samples, and test samples include 20 ALL samples and 14 AML samples.
4
Emerging Patterns Advanced
4.1
Select Discriminatory Genes and Corresponding Cut Points
An entropy-based discretization method [7] uses entropy to select the important features. This method can remove many noisy features and produce items from remaining features. Let T partition the set S of samples into the subsets S1 and S2 . P (Ci , Sj ) is the frequency of samples belonging to class Ci in Sj . The class entropy of a subset Sj , j = 1, 2 is defined as Ent(Sj ) = −
k
P (Ci , Sj ) log P (Ci , Sj ).
(1)
i=1
Suppose the subsets S1 and S2 are induced by partitioning a feature A at cut point T. Then, the CIE (class information entropy) of the partition, denoted E(A, T ; S), is given by E(A, T ; S) =
|S1 | |S2 | Ent(S1 ) + Ent(S2 ). |S| |S|
(2)
The Minimal Description Length Principle is used to stop the process of finding gene’s cut point. Recursive partitioning within a set of values S stops iff Gain(A, T ; S)
|Y ∈ EP A DP , supp = suppDP (Y ), class = P} T rip DN = {< Y, supp, class > |Y ∈ EP A DN , supp = suppDN (Y ), class = N} {Merge all EPAs into EP A Sort} EP A Sort = {Xi |Xi ∈ T rip DP ∪ T rip DN , Xi .supp ≥ Xi+1 .supp} {Initialize variables} Score(T ) P = 0, Score(T ) N = 0, count = 0, i = 1 {Find the K nearest neighbors of T, and Score(T ) P ,Score(T ) N are the likelihoods that T is class P or N respectively} while count < K do if T matches Xi .EP A(Xi ∈ EP A Sort) then count = count + 1 if Xi .class == P then Score(T ) P = Score(T ) P + 1
1262
H. Wang et al.
else Score(T ) N = Score(T ) N + 1 end if i=i+1 end if end while 9: {Predict the class of T } if Score(T ) P > Score(T ) N then T is class P else T is class N end if EPA-KNN makes a reference to KNN. The K nearest neighbors of T are the first K EPAs matched by T, and the classes of the K nearest EPAs determine the class of T. When the class of T is predicted, it’s sure that the likelihood of one class is bigger than that of the other class absolutely. It’s impossible that the likelihoods of the two classes are equal or approximate, and it can avoid misclassifications caused by classification algorithm.
6
Experimental Results
We carry out our experiment in two methods: (1) Conduct Leave-One-OutCross-Validation (LOOCV) classification in total samples and (2) Conduct Independent Test (IT) to classify test samples based on training samples. The K of EPA-KNN is 19 in the two methods, and every time we select the 25 smallest discriminatory genes of CIE and corresponding cut points to discover EPA. The experiment repeated 15 times independently, and the average number of misclassification is used to measure performance. To test whether the Bayes estimation and RCP are effective according to the overall performance, we delete bayes estimation and RCP from EPA-KNN, the remainder of EPA-KNN algorithm is called EP-KNN. Table 4 presents the performances of EPA-KNN and EP-KNN, and makes a comparison with other algorithms that are good at gene classification. Compared with other methods, our method gains competitive performance. From Table 4, we can find that Bayes estimation and RCP improve the selection of discriminatory genes and corresponding cut points, and are more effective Table 4. The comparison of experimental results Method
Gene number Average number LOOCV EPA-KNN(EPA) 25 0.73 EP-KNN(EP) 25 1.0 Neural Networks [8] 10 1.0 SVM [5][8] 25-1000 0
of misclassification IT 1.07 2.0 2.0 2.0-4.0
A Novel EPA-KNN Gene Classification Algorithm
1263
for IT experiments especially due to smaller training samples. Compared with Neural Networks and SVM, our method gets better performance, and extracts biologic gene rules as well, which are helpful to study cancer pathology and drug discovery.
7
Conclusion
In this paper, we propose a novel EPA-KNN gene classification algorithm. In the process of producing EPAs, Bayes estimation is applied to improve the reliability of entropy for small samples through adding virtual samples; RCP is used to strengthen the generalization about unknown test samples. Then a new EPA-KNN classifier is proposed. Experiments show that EPA-KNN can discover extractive biologic rules and improve the cancer recognition rate efficiently. Future research works will include how to apply EP to parallel classification of multi-class data.
References 1. Kuramochi, M., Karypis, G.: Gene Classification Using Expression Profiles: A Feasibility Study. Int. J. ArtificialIntelligence Tools 14 (2005) 641-660 2. Lu, X.G., Lin, Y.P., Yang, X.L., et al.: Using Most Similarity Tree Based Clustering to Select the Top Most Discriminating Genes for Cancer Detection. Proceeding of The Eighth International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland (2006) 931-940 3. Jiang, D.X., Tang, C., Zhang, A.D.: Cluster Analysis for Gene Expression Data: A Survey. IEEE Trans. Knowledge and Data Engineering 16 (2004) 1370-1386 4. Khan, J., Wei, J.S., Ringner, M., et al.: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7 (2001) 673-679 5. Furey, T.S., Cristianini, N., Duffy, N., et al.: Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data. Bioinformatics 16 (2000) 906-914 6. Dong, G.Z., Li, J.Y.: Mining Border Descriptions of Emerging Patterns from Dataset Pairs. Knowledge and Information Systems 8 (2005) 178-202 7. Li, J.Y., Wong, L.: Emerging Patterns and Gene Expression Data. Proceedings of 12th Workshop on Genome Informatics, Tokyo, Japan (2001) 3-13 8. Tan, A.H., Pan, H.: Predictive Neural Networks for Gene Expression Data Analysis. Neural Networks 18 (2005) 297-306 9. Dong, G.Z., Li, J.Y.: Efficient Mining of Emerging Patterns: Discovering Trends and Differences. Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Diego, USA (1999) 43-52
A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy Shuxue Zou, Yanxin Huang, Yan Wang, Chengquan Hu, Yanchun Liang, and Chunguang Zhou College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, 130012, China {Shuxue Zou,sandror}@163.com, {Chunguang Zhou,cgzhou}@jlu.edu.cn
Abstract. Detecting the boundaries of protein domains has been an important and challenging problem in experimental and computational structural biology. In this paper the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. On multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. The overall accuracy is about 87% together with high sensitivity and specificity. Simulation results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
1 Introduction Domain structure is one of the structure levels of protein, which is considered as the fundamental unit of the protein structure, folding, function, evolution and design. Detecting the domain structure of a protein is a challenging problem. There are such many methods for domain detection as methods on expert experiences on known protein structures to identify the domains, CATH [1], SCOP [2]; methods that try to infer domain boundaries by using the dimensional structure of proteins, PDP[3], DALI[4] and so forth. However, structural information is available for only a small portion of the protein space. And with the current rapid growth in the number of sequences with unknown structures, it is very important not only to accurately define protein structural domains, but also to predict domain boundaries on the basis of amino-acid sequence alone. A few methods only on the information of protein sequence were proposed, based on the use of similarity searches and multiple alignments to delineate domain boundaries, Domainer [5], DOMO [6]. In this paper, on multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. We realize the boundary positions are far less than core-domain and first take Domain detection as imbalanced data learning problem. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1264–1272, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Novel Method for Prediction of Protein Domain
1265
Learning classification from imbalanced datasets is an important topic. Support vector machines (SVMs) have been very successful in application areas ranging from image retrieval [7] to text classification [8]. Nevertheless when faced with imbalanced datasets, the performance of SVM drops significantly [9]. In the remainder of this paper, the negatives, i.e., the core domains, are always taken to be the majority class and the positives, i.e., the boundary positions, are the minority class. We propose a novel undersampling method based on the distance-based maximal entropy in the feature space of SVMs. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class, which can compensate for the skew associated with imbalanced datasets. The experiment results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
2 Datasets and Feature Extraction Given a query sequence, our algorithm starts by searching the protein sequence database and generating a multiple alignment of all significant hits. The columns of the multiple alignment are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns. Information theory based principles are employed to maximize the information content. An overview of our method is depicted in Figure 1.
Fig. 1. Overview of the domain prediction method
1266
S. Zou et al.
2.1 Datasets The SCOP database with version 1.65 is employed in this paper, which includes 20,619 proteins and 54,745 chains. The datasets are selected according to the statistical results respectively on the single domain and more than two domains as well as for the consideration of the protein homology. 2.2 Domain Definitions For each protein chain we defined the domain positions to be the positions that are at least x residues apart from a domain boundary. Domain boundaries are obtained from SCOP definitions where for a SCOP definition of the form (start1; end1)::(startn; endn) the domain boundaries are set to (endi+starti+1)/2. All positions that are within x residues from domain boundaries are considered boundary positions. It is clear that the domain positions are far more than the boundary ones. Then the protein domain boundary detection is known as the class imbalance problem. SCOP Domains
MSGTLLAFDFGTKSIGVAVGQRITGTARPLPAIKAQDGTPDWNIIERLLKEWQPDEIIVGLP
x Domain Positions
x
Boundary Positions
Domain Positions
Fig. 2. Definition of the domain and boundary positions
2.3 Feature Extraction Firstly each protein in the selected dataset has to been aligned in sequence databases. To quantify the likelihood that a sequence position is part of a domain, or at the boundary of a domain we referred six measures [10] based on the multiple alignments that reflect structural properties of proteins. Information theory based principles are employed to maximize the information content. For each column amino acid entropy and class entropy are defined to measure the sequence conservation. There are four features extracted among the columns over a window. All appearances of a domain in database sequences will maintain the domain’s integrity, which can be measured by quantifying the consistency and correlation, including asymmetric correlation and symmetric one, of neighboring columns in an alignment. Regions of substantial structural flexibility in a protein often correspond to domain boundaries. In a multiple alignment of related sequences, positions with indels are with respect to the seed sequence indicate regions.
A Novel Method for Prediction of Protein Domain
1267
Besides we quote a novel method [11] to predict domain boundary from protein sequence alone. The simple physical approach is based on the fact that the protein unique three dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy. Considering here the conformational entropy as the number of degrees of freedom on the angles φ,ψ and χ for each amino acid along the chain, the method for domain boundary prediction relies on finding the extreme values in a latent entropy profile.
3 SVMs and Strategies for the Imbalanced Classification Support Vector machines (SVMs) are novel statistical learning techniques that can be seen as typical novel methods for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Without loss of generality we choose the SVMs coupled with the RBF kernel widely used in pattern recognition. Given a set of labeled instances X train = {xi , yi }in=1 and a kernel function K SVM finds the optimal α i for each xi to maximize the margin γ between the hyperplane and the closest instances to it. The class prediction for a new test instance x is made through:
, 2
n − x − xi ⎛ ⎞ sign ⎜ f ( x ) = − ∑ y i α i K ( x , x i ) + b ⎟ , K ( x , x i ) = exp( 2σ 2 i =1 ⎝ ⎠
),
(1)
where b is the threshold. The 1-norm soft-margin SVM is used to minimize the primal Lagrangian: Lp =
w 2
2
n
+ C ∑ ξi − i =1
n
∑ α [ y (w ⋅ x i =1
i
i
i
+ b )− 1 + ξ i ]−
n
∑ rξ i =1
i
i
,
(2)
where α i ≥ 0 and ri ≥ 0 [12]. The penalty constant C represents the trade-off between the empirical error ξ and the margin. In order to meet the Karush-Kuhn-Tucker (KKT) conditions, the value of α i must satisfy: 0 ≤ α i ≤ C , and
n
∑α i =1
i
yi = 0 .
(3)
In [13], Akbani analysis three causes of performance loss with imbalanced data. Firstly positive points lie further from the ideal boundary. And the second is the weakness of Soft-Margins. The last one is imbalanced Support Vector Ratio. 3.1 SVMs Fail to Imbalanced Classification
Veropoulos [14] suggests using different penalty factors C+ and C for positive and negative classes, reflecting their importance during training. Therefore, the Lp formulation has two loss functions for two types of errors. -
1268
S. Zou et al.
Lp =
2
w 2
+C
n+
+
{i
∑
y i = + 1}
ξ ik + C −
n−
∑
{j y
}
j
= −1
− ∑ α i ⎣⎡ y i ( w ⋅ x i + b ) − 1 + ξ i ⎦⎤ −
∑
n
i =1
p
i =1
ξ
k j
(4) μ iξ i .
If the SVM algorithm uses an L norm ( k = 1 ) for the losses, its dual formulation gives the same Lagrangian as in the original soft-margin SVMs, but with different constraints on α i as follows: 0 ≤ α i ≤ C + , if y i = + 1,
and
(5)
0 ≤ α i ≤ C , if y i = − 1, −
It turns out that this biased-penalty method does not help SVMs as much as expected. From the KKT conditions (Eq. (3)), we can see that C imposes only an upper bound on α i , not a lower bound. Increasing C does not necessarily affect α i . Moreover, the constraint in Eq. (3) imposes equal total influence from the positive and negative support vectors. The increases in some α i at the positive side will inadvertently increase some α i at the negative side to satisfy the constraint. These constraints can make the increase of C+ on minority instances ineffective. 3.2 Strategies for the Imbalanced Classification
A number of solutions to the class-imbalance problems were previously proposed both at the data and algorithmic levels. As the above analysis when faced with imbalanced datasets the performance of SVMs could not rise by setting the parameters [9]. At the data level [15], these solutions include many different forms of resampling such as random oversampling, random undersampling, directed oversampling, directed undersampling, oversampling with informed generation of novel samples, and combinations of the above techniquesIn case of undersampling, examples from the majority class are removed. Examples removed can be randomly selected, or near miss examples, or examples that are far from the minority class examples. In this paper, a novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. Its unique learning mechanism makes it an interesting candidate for dealing with imbalanced datasets, since SVMs only takes into account those data that are close to the boundary, i.e. the support vectors, for building its model. What’s more important, as kernel-based methods, the classification of SVMs is defined in the feature space. So does our undersampling preprocessing. Let data set X = {x1 , x 2 , … , x M + N }, where x i = (xi ,1 , xi ,1 ,…, xi ,s ), i = 1,2,…, M + N , s is the number of features for data. Here N is the number of positive and M is the number of negative in the imbalanced data classification of SVMs, and M>>N. The entropy of xi is defined as: M N E = − S x i , x j log S x i , x j + 1 − S x i , x j log 1 − S x i , x j , (6) i
∑∑( ( i =1
j =1
)
2
(
) (
(
))
2
(
(
)))
A Novel Method for Prediction of Protein Domain
1269
where S (x i , x j ) = e −α D (x , x ) is the similarity between data xi and x j , and α is the curvature of the function. D ( x i , x j ) is the Euclidean distance in the feature space of i
i
j
SVMs between xi and x j , Suppose that u i is the projection of input vector xi in the feature space. We define: φ ( x i ) ⋅ φ ( x j ) = k ( x i , x j ) , then we can get a Euclidean distance in feature space: D 2 (u i , u j ) = φ ( xi ) − φ ( x j )
2
= φ 2 ( x i ) − 2φ ( x i )φ ( x j ) + φ 2 ( x j )
(7)
= k ( x i , x i ) − 2 k ( x i , x j ) + k ( x j , x j ).
The value of k ( x i , x j ) can be got from Eq. (1). What’s more important, the parameter in k ( x i , x j ) has to be coincident with the SVMs. S ( xi , x j ) varies in the range of [0, 1] ,
((
)
(
) ( (
)) ( (
)))
and S xi , x j log 2 S xi , x j + 1 − S xi , x j log2 1 − S xi , x j tends to its maximal value 1 for S ( xi , x j ) → 0.5 , and to its minimal value 0 for S ( xi , x j ) → 0 or S ( xi , x j ) → 1 . Accordingly α i = − ln(0.5) D i is estimated, where D i is the mean distance among all pairs of each possitive and all the negtives. Therefore, those negtives that are very close or distant to a given possitive one, would not be sampled. The negtives too close to the learned hyperplane may have skewed hyperplane and far away from it could not be the support vector but be trained with uselessness. While for the ones separated by the distance close to D i , their contributions are very high. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class. The undersampling algorithm using the distance-based maximal entropy is as follows:
① ② ③
Compute the distance from each positive to all of the negtives with Eq (7) and get the D i ; Compute the corresponding entropy of each positive according the Eq. (6); Sort the entropy and nonrepetitively choose the negative with the maximal entropy.
4 Analysis of Results In learning extremely imbalanced data, the overall classification accuracy ( ( true positive+ true negative) /the total of the samples) is often not an appropriate measure of performance. A trivial classifier that predicts every case as the majority class can still achieve very high accuracy. The medical community, and increasingly the machine learning community, use two metrics, the sensitivity and the specificity, when evaluating the performance of various tests. Sensitivity can be defined as the accuracy on the positive instances (true positives / (true positives + false negatives)), while specificity can be defined as the accuracy on the negative instances (true negatives / (true negatives + false positives)).
1270
S. Zou et al.
To find the best pair of C and σ 2 , the most important parameters involved in both the SVMs and the undersampling preprocessing, over some ranges, a grid search using cross-validation are employed. Since there are only limited numbers of instances obtained by the time-consuming alignment running on PC, we adopt a k-fold (k = 5 in this study) cross-validation procedure for an unbiased evaluation, which is a common technique used in the machine learning community for model selection. All the examples in the dataset are eventually used for both training and testing. The SVMs parameters grid search was iterated over the following values: C = {2 0 , , 2 9 , 210 } and σ 2 {2 −2 , 2 −4 , , 2 7 , 2 8 }. In our experiments, we compared the performance of our classifier with regular SVMs. And partly experimental results are shown in Table 1.The table below shows the sensitivity (Se), Specificity (Sp), the confused time and accuracy of each algorithm. Native SVMs stands for the imbalanced data directly as the input compared with balancing preprocessing proposed in this paper.
=
Table 1. Experimental results compared regular SVMs with the preprocessing
accuracy 0.25 0.5 1 2 4 8 16 32 64 128 256
1024 2 512 1 8 64 16 128 32 256 4
0.0893 0.0216 0.0649 0 0.0059 0.0287 0 0.0202 0 0.0071 0
0.9964 0.9998 0.9979 1 1 0.9996 1 0.9998 1 1 1
1182 128 449 133 136 135 135 131 130 127 135
85.82% 85.16% 85.71% 84.88% 84.95% 85.21% 84.83% 85.13% 84.77% 84.93% 84.85%
accuracy 0.761 0.531 0.859 0.625 0.787 0.953 0.944 0.895 0.875 0.665 0.807
0.732 0.842 0.797 0.734 0.532 0.901 0.996 0.926 0.892 0.584 0.884
187 79 146 66 81 94 74 111 86 99 62
64.39% 69.75% 78.22% 80.63% 86.78% 80.25% 87.46% 76.87% 88.14% 74.76% 83.34%
Note that regular SVMs has almost perfect specificity, but poor sensitivity because it tends to classify everything as negative, which is the same reason as the SVMs is not sensitive to the variable on the parameters of C and σ 2 . There are lower values of accuracy with some unfit pair of C and σ 2 , because all of the data will be tested in our classifier in place of the training set with equal negatives and positives. For example, as C = 1024 and σ 2 = 0.25 , the accuracy is the lowest as the generalization performance of the classifier decreases with the highest C . It is clear that the preprocessing not only reduces the size of input data, which decreases the training time, but also outperforms both the sensitivity and the specificity. The results show that the preprocessing method of balancing the input we proposed improves the classification performance.
5 Summary and Outlook In this paper we presented a promising method for detecting the domain structure of a protein from sequence information alone. Besides the conformational entropy of the
A Novel Method for Prediction of Protein Domain
1271
seed sequence is considered. Further more the information theory principles are used to optimize the scores. Worthwhile we firstly refer protein domain detection as an imbalanced data classification, and then propose a novel undersampling method by using distance-based maximal entropy in the feature space of SVMs. Finally support vector machines with RBF kernel are trained. And a grid search using cross-validation is employed in order to identify the optinmal C and σ 2 . At last the prediction of protein domain from sequence gets the accuracy about 87% by the method, which hopefully shows a significant improvement for the biological macromolecular constructions in bioinformatics. For further validating our method the datasets will be enlarged in the near future. And the features analysis should be compared with the biological importance.
Acknowledgement This paper is supported by the National Natural Science Foundation of China (60433020, 60673099, 60673023) and “985” project of Jilin University.
References 1. Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5 (1997) 1093-1108 2. Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247 (1995) 536–540 3. Alexandrov, N., Shindyalov, I.: PDP: Protein Domain Parser, Bioinformatics 19 (3) (2003) 429-430 4. Holm, L., Sander, C.: Mapping the Protein Universe. Science 273 (1996) 595–602 5. Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred from Analysis of Homology. Protein Sci. 3 (1994) 482-492) 6. Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. Bioinformatics 14 (2) (1998) 164-187 7. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. Proceedings of ACM International Conference on Multimedia (2001) 107-118 8. Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. Proceedings of ECML-98. 10th European Conference on Machine Learning (1998) 9. Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC. (2003) 10. Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1 (2004) 1–27 11. Oxana, V., Galzitskaya, Bogdan, S.M.: Prediction of Protein Domain Boundaries from Sequence Alone. Protein Science 12 (2003) 696–701 12. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK. (2000)
1272
S. Zou et al.
13. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. Proc. 15th. European Conf. Machine Learning (ECML) (2004) 39-50 14. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. Proceedings of the International Joint Conference on Artificial Intelligence (1999) 55–60 15. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering 30 (1) (2006) 25-36
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence Sanqing Hu1 , Matt Stead1 , Andrew B. Gardner2 , and Gregory A. Worrell1 1
Department of Neurology Division of Epilepsy and Electroencephalography Mayo Clinic, 200 First Street SW Rochester, MN 55905, USA
[email protected] 2 BioQuantix Corp. Atlanta, GA 30363
[email protected] Abstract. In [1], we developed two methods to automatically identify the contribution of the recording reference signal from multi-channel intracranial Electroencephalography (iEEG) recordings. In this study, we subtract the reference recording contribution to iEEG and obtain corrected iEEG. We then investigate three commonly used iEEG metrics: spectral power, phase synchrony, and magnitude squared coherence (MSC) for common referential iEEG, corrected iEEG and bipolar montage iEEG. We find significant differences among the three iEEG metrics, and are able to determine the contribution from the recording reference to each metric. Generally, reference signals with smaller amplitude yield lower phase synchrony and reference signals with larger amplitude increase phase synchrony. Reference signals with spectral peaks increase coherence. Reference signal with low power may have no significant impact on calculated coherence. Bipolar EEG usually yields small phase synchrony or MSC values and may obscure the actual phase synchrony or MSC values between two local sources.
1 Introduction The temporal resolution of electroencephalography (EEG) has led to the widespread use of EEG by clinicians and scientists investigating physiologic and pathologic brain function. In particular, the use of EEG in the studies of neuronal assemblies and their oscillations has received wide attention [2]-[15]. It has been noted that certain types of assemblies are characterized by the synchronous activity of their constituent neurons and different EEG frequency components also reveal synchronies relating to different perceptual, motor or cognitive states [2]-[6]. In the literature, there are three common ways to analyze neuronal assemblies: spectral power, phase synchrony (such as mean phase coherence), and coherence (such as MSC). Spectral power measures the consequence of synchrony activity rather than synchrony activity itself. It is an indirect index of neural synchrony. The spectral power of EEG is an indirect measure of the degree of synchronization because weakly synchronized activity leads to destructive interference and lower measurable power at a given frequency [7]. Phase synchrony is a more direct index of neural synchrony and is defined as phase locking value, ranging from 0 (no synchronization) to 1 (perfect synchronization). Spectral power and phase synchrony have been used extensively to assess neural synchrony in human electrophysiological studies D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1273–1280, 2007. c Springer-Verlag Berlin Heidelberg 2007
1274
S. Hu et al.
[7] and [8]. Coherence identifies the synchrony of neuronal assemblies as a function of the correlation of EEG frequency components. Neurons in an assembly are presumably located in some proximity to the recording electrodes and exhibit oscillatory activity with common spectral properties. The typical finding is that in a given perceptual, cognitive, or motor task, EEG coherence increases (or decreases) in a certain band of EEG frequency spectrum [9]. It is well known that reference electrode is indispensable to recording multi-channel EEGs, but the fact that reference signal may heavily contaminate EEG recordings and have a significant effect on various EEG metrics is often ignored. Unfortunately, in all above cited references neuronal synchronization has been studied only based on either common referential EEG recordings or common reference-free EEG recordings such as bipolar EEG, average common reference EEG and Laplacian EEG. The potential confounding effect with common referential EEG recordings for spectral power analysis [7], coherence analysis [9] and [10], and phase synchrony analysis [7] and [11] is well established. The bipolar EEG, obtained by subtracting the potentials of two nearby electrodes, will remove all signals common to the two channels, including the common reference. However, one must realize that a given bipolar montage will completely miss dipoles with certain locations and tangential orientations, and not all signals common to the two electrodes are from the reference. Caution against the use of bipolar EEG for coherence analysis was given [12] and [13]. Although the average reference EEG and Laplacian EEG are also reference-free, caution against the use of them for synchronization analysis was given [14]. In [1] we proposed two methods to extract the scalp reference signal from clinical multi-channel iEEG recordings based on independent component analysis [15] and stated why the obtained signal is a “good” estimation of the real reference signal.
2 Methods One patient undergoing evaluation for epilepsy surgery with intracranial electrodes was investigated. This patient had a subdural grid of electrodes covering the left anterior temporal neocortex. The grid consisted of a silastic sheet embedded with 32 2.3 mm diameter Platinum- Iridium alloy electrodes separated by 10 mm in a 4 × 8 array. The iEEG was acquired using a stainless steel scalp suture placed in the vertex region of the scalp, midline between the Cz and Fz electrode positions (international 10–20). The data were recorded differentially, with the common reference being the scalp suture electrode. The scalp suture electrode is relatively isolated from the intracranial electrodes by the intervening layers of cerebrospinal fluid, bone, muscle, and scalp. These layers serve to distribute signal in such a way that approximately 7 cm2 of coordinated cortical activity is generally required to produce a clear deflection detectable on the scalp. In practice the reference electrode serves the purpose primarily of rejecting common mode potentials generated by muscular contraction and body movement, which are conducted to the intracranial vault. Unfortunately, it also introduces artifacts unique to the scalp site. The data were acquired with a DC-capable NeuralynxT M electrophysiology system, and digitized at 32556 Hz. For analysis in this paper the data were decimated offline
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
1275
by first low-pass filtering to 400 Hz using a forward and reverse digital filter (Matlab “filtfilt” procedure) to avoid phase shift and sampling to a frequency of 2000Hz. The time-series measures estimated were the power spectral density (PSD), phase synchrony (mean phase coherence in this paper), and MSC. The PSD was estimated by using Welch’s method and 512 sample Hanning window with 256 sample overlap. The mean phase coherence [16] is defined as N 1 R = ei[φx (tj )−φy (tj )] , N j=1 where φx (t) and φy (t) denote the phase variables of two oscillating signals x(t) and y(t), N = 20000 is the window sample size with half-overlap. To calculate phase synchrony, we performed a prefilter in frequency band 1Hz∼70Hz for iEEG. The MSC was estimated by using 512 sample Hanning window with 256 sample overlap.
3 Results The patient underwent iEEG monitoring using 32-contact subdural grid electrodes with a scalp reference electrode at sample frequency of 2000Hz. Four adjacent channel iEEG recordings (Grid2, Grid3, Grid4 and Grid5) are plotted in Figure 1A where only 20 seconds of 200 seconds are shown representatively. The reference signal in Figure 1A was calculated based on the second method [1] for the whole time period (200 seconds) and for the whole 32 channels where the rest of channels are omitted here. It should be pointed out that the whole data are artifact-free as seen in Figure 1A. In Figure 1A, the second four channels (Grid2∼Grid5) are corrected iEEG of the first four channels (Grid2∼Grid5). The corrected iEEG was obtained by subtracting the calculated reference signal from the original referential iEEG. One can see that the basic patterns of the original referential iEEG are preserved in the corresponding corrected iEEG. Two bipolar montage iEEG (Grid2−Grid3 and Grid4−Grid5) are also plotted in Figure 1A. To show the influence of the reference signal on iEEG recordings, Figure 1B∼1D describe the PSD for the reference signal, original referential Grid3, and corrected Grid3. One can easily see that the reference signal has activity of frequency near 60Hz over the whole time period in Figure 1B. This contribution was verified to come from the line voltage. This activity can be seen clearly for the referential Grid3 near 60Hz over the whole time period in Figure 1C and is removed in corrected Grid3 in Figure 1D. Figure 1E shows the spectral power for the reference signal (the dash-dot line), Grid2 and Grid3 where the solid lines correspond to the referential Grid2 and Grid3, the dashed lines correspond to the corrected Grid2 and Grid3, and the dotted lines correspond to the bipolar iEEG (Grid2−Grid3). It is easy to see that the referential iEEG (Grid2 and Grid3) and the reference signal have peaks near 60Hz and that the bipolar iEEG (Grid2−Grid3), the corrected iEEG Grid2 and Grid3 have no peak near 60Hz. This further verifies that the reference signal is removed out from the referential iEEG completely. In Figure 1F and 1G two measures: phase synchrony and MSC are analyzed where the solid lines correspond to the referential iEEG, the dashed lines correspond to the
1276
S. Hu et al.
corrected iEEG, the dotted lines correspond to the bipolar iEEG. To analyze phase synchrony between different channels, we filtered the referential iEEG and corrected iEEG to the frequency band 1Hz∼70Hz. From Figure 1F, one can see that phase synchrony values for Grid2*Grid4 fall into the interval [0.25, 0.5] and are minimally different for the referential iEEG and corrected iEEG. Phase synchrony values for Grid2*Grid5 fall into the interval [0.15, 0.35] and are minimally different for the referential iEEG and corrected iEEG. However, phase synchrony values for Grid3*Grid4 or Grid3*Grid5 are significantly different for the referential iEEG and corrected iEEG where for instance, phase synchrony values for the corrected Grid3*Grid4 are all above 0.8 and phase synchrony values for the referential Grid3*Grid4 lie in the interval [0.4, 0.6]. It is notable that all phase synchrony values for the bipolar iEEG (Grid2−Grid3)*(Grid4−Grid5) are small and all less than 0.2. Another interesting observation is that phase synchrony values of the corrected iEEG is mostly larger than that of the referential iEEG, e.g. Grid3*Grid4, because the amplitude of the reference signal is mostly smaller than that of the referential or corrected iEEG as seen in Figure 1A for this patient. Hence, the amplitude of the reference signal plays an important role in phase synchrony. From Figure 1G, one can see that MSC values for Grid2*Grid4 or Grid2*Grid5 are less than 0.3 from 7Hz to 70Hz for both the referential iEEG and corrected iEEG. MSC values for Grid3*Grid4 or Grid3*Grid5 are higher from 7Hz to 70Hz for the corrected iEEG compared with the long down trends before 55Hz and the big peaks after 55Hz for the referential iEEG. There are peaks near 60Hz for all the referential iEEG and no peak 60Hz for the corrected iEEG, which further confirms that the peaks come from the common reference signal. Moreover, comparing the four peaks, one can see that the peaks for Grid3*Grid4 and Grid3*Grid5 are much larger than that for Grid2*Grid4 and Grid2*Grid5 because the following fact: the power of the reference signal is larger than that of the corrected Grid3 near 60Hz shown in Figure 1E and as a result the reference signal plays a dominant role and leads to big peak amplitudes. On the contrary, the power of the reference signal is smaller than that of the corrected Grid2 near 60Hz shown in Figure 1E and as a result the reference signal plays a weak role and leads to small peak amplitudes. It is notable that all MSC values have little change from 0Hz to 7Hz for the referential iEEG and corrected iEEG because of lower power of the reference signal which can be understood from the comparison of the reference, Grid2 and Grid3 in Figure 1E. It is also interesting to note that all MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) from 7Hz to 70Hz are very close to zero. The higher MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) from 0Hz to 7Hz may come from the higher MSC values for Grid3*Grid4. Thus, in this case, it is obviously wrong to use MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) to reflect MSC values for Grid3*Grid4 or Grid3*Grid5 from 7Hz to 70Hz. Hence, now we make the following conclusions: i) reference signal may change the observed phase synchrony values or MSC values and thus lead to an incorrect interpretation of the EEG even if the raw data are very clean, that is, artifact-free; ii) the commonly used bipolar iEEG usually leads to small phase synchrony values or MSC values and cannot reflect large phase synchrony values or MSC values between two local real sources; iii) the amplitude of the reference signal plays an important role
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
A
B
1277
Reference
70
mer Grid2~Grid5), Corrected iEEG (the Later Grid2~Grid5), Referen
60
Frequency
50 40 30 20 10 145
C
150 Time (second)
155
0 0
D
Grid3
60
50
50
40 30
20 10 50
100 Time (second)
150
F
iEEG for Patient 1
Mean Phase Coherence
50
Power (db)
40 G2
30
G3
Reference
10
G2−G3
0
Magnitude Squared Coherence
0 0
200
60
50
10
20
30 40 50 Frequency (Hz)
60
70
iEEG for Patient 1
0.8 G3*G4
0.6 0.4 G3*G5
G2*G4
0.2 (G2−G3)*(G4−G5)
10
20
30 40 50 Frequency (Hz)
G2*G5
60
70
Fig. 1.
100 Time (second)
200
iEEG for Patient 1 1
0.8 G3*G4
0.6 G2*G4 G3*G5
0.4 0.2 G2*G5
1
0 0
150
30
10 0 0
G
200
40
20
−10 0
150
Corrected Grid3
60
20
100 Time (second)
70
Frequency
Frequency
70
E
50
0 0
(G2−G3)*(G4−G5)
50
100 Time (second)
150
200
1278
S. Hu et al.
in phase synchrony at one time point; iv) the power of the reference signal plays an important role in MSC at a given frequency.
4 Discussion and Conclusions In this study, we mainly discuss phase synchrony and coherence of common referential, corrected, bipolar iEEGs. It is well known that reference electrode is indispensable to recording multi-channel EEG, but the reference signal may heavily contaminate EEG recordings. It is very desirable to find what is the reference signal. In our recent paper [1] we proposed two methods to extract the scalp reference signal from multiple iEEG channel recordings based on independent component analysis and the following assumption: the reference signal from the scalp reference electrode can be treated as independent from all of the sources recorded at each intracranial electrode. This assumption is basically true because the reference scalp electrode is relatively isolated from the intracranial electrodes by the three intervening layers of cerebrospinal fluid, bone, and scalp. This assumption was supported by simulation results from clinic EEG data [1]. In this study, we applied the second method to clinical iEEG data from one patient to find the reference signal. After removing the reference signal for the patient, we got corrected iEEG and made comparisons for phase synchrony and coherence of the referential iEEG, corrected iEEG and bipolar montage iEEG. We found significant differences for these observed values among three iEEGs in many cases. Here the iEEG data from the patient were largely artifact-free. To show why the obtained reference signal is a “good” estimation of the real reference signal, we compared the spectral power of the referential and corrected iEEG and the reference signal in time-frequency domain, and find that the high frequency activity near 60Hz for the reference signal and referential iEEG (Figures 1B and 1C ) is removed from the corrected iEEG (Figure 1D). This shows that the reference signal is almost completely extracted out from the referential iEEG. This fact is further verified by peaks in the spectral power of the referential iEEG which were removed from the corrected iEEG (Figure 1E). Reference signal may have important influence on values of phase synchrony and MSC of EEG in the following two ways: i) the amplitude of the reference signal plays an important role in phase synchrony at one time point. More precisely, the reference signal with smaller amplitude may decrease phase synchrony values. On the contrary, the reference signal with larger amplitude may increase phase synchrony values. ii) The power of the reference signal plays an important role in MSC values at one given frequency. The reference signal with larger power may increase MSC values and may lead to larger peak amplitude (see, e.g., the referential and correct Grid3*Grid5 near 60Hz in Figure 1G). The reference signal with lower power may cause no much change for MSC values (see, e.g., the referential and correct Grid3*Grid4 from 0Hz to 10Hz in Figure 1G). Hence, the reference signal may change the observed phase synchrony values or MSC values significantly and thus lead to an incorrect interpretation of EEG even if the raw data are artifact-free. The commonly used bipolar EEG can remove the common reference. The bipolar EEG is very useful to remove artifacts when these artifacts come from the reference electrode. However, one should note that bipolar EEG also removes all signals
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
1279
common to the two channels, and not all signals common to the two electrodes are from the reference or artifact. Hence, a given bipolar montage will completely miss dipoles with certain locations and tangential orientations. From our simulation results, bipolar EEG usually leads to small phase synchrony values or MSC values and will underestimate real phase synchrony or MSC values between two different channels due to some local sources (see e.g., Figures 1F). Hence, bipolar EEG may cause distortion of phase synchrony and coherence and lead to misinterpretation of EEG. Figure 1: A) 20 seconds sample of four channel iEEG recorded from the subdural grid electrodes where the former four Grid2−Grid5 are the referential iEEG and the later four Grid2−Grid5 are the corrected iEEG. The reference signal is calculated by using the second method in our recent work. Grid2−Grid3 and Grid4−Grid5 are two bipolar iEEG. B) The PSD of the reference signal. C) The PSD of the referential Grid3. D) The PSD of the corrected Grid3. E) The spectral power for the referential (solid line) and corrected (dashed line) Grid2 and Grid3, reference signal (dash-dot line), and bipolar Grid2−Grid3 (dotted line). F) The mean phase coherence for referential (solid line) and corrected (dashed line) Grid2*Grid4, Grid2*Grid5, Grid3*Grid4, Grid3*Grid5, and bipolar (Grid2−Grid3)*(Grid4−Grid5) (dotted line). G) The magnitude squared coherence for the referential (solid line) and corrected (dashed line) Grid2*Grid4, Grid2*Grid5, Grid3*Grid4, Grid3*Grid5, and bipolar (Grid2−Grid3)* (Grid4−Grid5) (dotted line).
Acknowledgment This work was supported by NIH 5K23NS047495.
References 1. Hu, S., Stead, M., Worrell, G.A.: Automatic Identification and Removal of Scalp Reference Signal for Intracranial EEGs based on Independent Component Analysis. IEEE Transaction on Biomedical Engineering (in press) 2. Basar, E., Basar-Eroglu, C., Karakas, S., Schurmann, M.: Gamma, Alpha, Delta and Theta Oscillations Govern Cognitive Processes. Int. J. Psychophysiol 39 (2001) 241-248 3. Fuentemillaa L., Marco-Pallars, J., Graua, C.: Modulation of Spectral Power and of Phase Resetting of EEG Contributes Differentially to the Generation of Auditory Event-related Potentials. NeuroImage 30 (2006) 909-916 4. Knyazev, G. G., Savostyanov, A. N., Levin, E. A.: Alpha Synchronization and Anxiety: Implications for Inhibition Vs. Alertness Hypotheses. Int J Psychophysiol 59 (2006) 151-158 5. Sritharan, A., Line, P., Sergejew, A., Silberstein R., Egan, G., Copolov, D.: EEG Coherence Measures During Auditory Hallucinations in Schizophrenia. Psychiatry Res 136 (2005) 189-200 6. Yeragani, V. K., Cashmere, D., Miewald, J., Tancer, M., Keshavan, M.S.: Decreased Coherence in Higher Frequency Ranges (Beta and Gamma) between Central and Frontal EEG in Patients with Schizophrenia: A Preliminary Report. Psychiatry Res 141 (2005) 53-60 7. Trujillo, L. T., Peterson, M. A., Kaszniak, A. W., Allen, J. J.: EEG Phase Synchrony Differences Across Visual Perception Conditions May Depend on Recording and Analysis Methods. Clin Neurophysiol 116 (2005) 172-189
1280
S. Hu et al.
8. Palva, J. M., Palva, S., Kaila, K.: Phase Synchrony among Neuronal Oscillations in the Human Cortex. J Neurosci 25 (2005) 3962-3972 9. Duckrow, R. B., Zaveri, H. P.: Coherence of the Electroencephalogram during the First Sleep Cycle. Clin Neurophysiol 116 (2005) 1088-1095 10. Nunez, P. L.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press. New York: NY (1981) 11. Guevara, R., Velazquez, J. L., Nenadovic, V., Wennberg, R., Senjanovic, G., Dominguez, L. G.: Phase Synchronization Measurements using Electroencephalographic Recordings: What Can We Really Say about NeuronalSynchrony? Neuroinformatics 3 (2005) 301-314 12. Zaveri, H. P., Duckrow, R. B., Spencer, S. S.: The Effect of a Scalp Reference Signal on Coherence Measurements of Intracranial Electroencephalograms. Clin Neurophysiol 111 (2000) 1293-1299 13. Zaveri, H. P., Duckrow, R. B., Spencer, S. S.: On the Use of Bipolar Montages for Time-series Analysis of Intracranial Electroencephalograms . Clin Neurophysiol 117 (2006) 2102-2108 14. Schiff, S. J.: Dangerous Phase. Neuroinformatics 3 (2006) 315-318 15. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 12 (2000) 411-130 16. Mormann, F., Kreuz, T., Rieke, C., Andrzejak, R. G., Kraskov, A., David, P., Elger, C.E., Lehnertz, K.: On the Predictability of Epileptic Seizures. Clin Neurophysiol 116 (2005) 569-587
Biological Inspired Global Descriptor for Shape Matching* Yan Li, Siwei Luo, and Qi Zou Department of Computer Science Beijing Jiao Tong University Postcode 100044 Beijing China
[email protected],
[email protected],
[email protected] Abstract. Shape description is the precondition for shape matching and retrieval. The more robust and stable primitives to describe shapes are global topological properties, but obtaining global topological properties is still an obstacle in computer vision. Motivated by the difference sensitivity of shortrange connection in biology vision, we present a novel global descriptor to describe the entire topology of simple closed 2D shape in this paper. We employ two novel strategies – the zigzag rule, which approximates shape to an elaborate polygonal curve, and cost function which combines global configurations as well as local information of the line stimulations as our punishments. With these two key steps the descriptor is robust to translation, scaling and rotation. Experimental results show the model gain good performance on matching and retrieval for silhouettes. Even for images with occlusion the result is excellent and reasonable.
1 Introduction A problem of both theoretical and practical importance in shape matching and retrieval is how to describe a shape. As we known, simple cells in V1 cortex obtain local features, such as position and orientation, however, how global information, such as the correlation between local features and topology, can be represent? An important finding in 1960s is that the majority of neurons function as edge detectors, they react strongly to an edge or a line of a given orientation in a give position of the visual field [1], and visual neural system manage the information by majority, but not the complexity of their functions [2, 3]. The neuron synapses form a network, by which the features of visual objects can be detected and organized in less than a millisecond [4, 5]. It is a swift synchronous process of distributed information, accordingly making a great contribution to some complex visual task such as grouping and matching etc. The synchronous action is completed by short-range connection and long-range connection [6]. Here, we focus on the contribution of short-range connection for planar closed curve description. *
Supported by the National Natural Science Foundation of China under Grant Nos. 60373029 and the National Research Foundation for the Doctoral Program of Higher Education of China under Grant Nos. 20050004001.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1281–1290, 2007. © Springer-Verlag Berlin Heidelberg 2007
1282
Y. Li, S. Luo, and Q. Zou
Motivated by neurophysiology, we build a network model to simulate aggregation of simple cells in V1 cortex. The neuron responding optimally records the relative position, orientation and frequency of the specific line stimulate. In this way, a meaningful stimulation turns to a connective network which is constructed by simple cells in primary visual cortex. Each node corresponding to a single simple cell in the network records the local primary visual features, and the short-range connection records the differences among the local features. Our main hypothesis is to record the topological configuration features of line stimulates into topology of neural network. The basic idea of our method is very simple. First, we approximate a planar closed shape to a polygon based on our zigzag rule. At the same time, getting the position, orientation and length of each edge, as simple cells do. Second, using these local features to make our descriptor, in which the statistical distribution of relative distances and orientational differences are used. Third, combining with the similarity distances among descriptors, the penalty from the difference of node number and the restriction of local similarity, our cost function is made up of three items to finish the matching.
2 Related Work The GD developed in this paper is closely related to two areas of current work. The first area concerns how to describe a shape. Some approaches use interest points and their luminance information to represent object images, such as [7, 8, 9, 10] and so on. However, these approaches are not appropriate for shapes, because shapes have configuration information which can not gain for several points. In other words, the information of global attribute is much more important than that of the local attribute in shape representation. A great deal of research has been done on contour image. Since contour can be represent as a single-closed curve, there are many methods based on parameterized perimeter. Arkin et al [11] use the distance between the turning functions of two polygons. Latecki el al [12] improved the work by computing the similarity between corresponding visual parts. Mokhtarian et al [13, 14] use the CSS (curvature scalespace) image to represent the shapes of object contours. The above two approaches perform well [15], however, they fail to account for the global configuration. S. Belongie et al [16] present the configuration of contour by global shape context, which capture the distribution of the remaining points relative to the reference point. And [17] extended the shape context as a robust method for capturing spatial relationships. In [18], Mortensen et al combined the SIFT and shape context, which gained a more accurate result in point matching. Our work is an improvement of [16]. The second area concerns correspondence problem. For shape matching, it shouldn’t be the only purpose which is to find total minimum of the similarity between each pair features, but also preserve their spatial relationships in corresponding feature. In [19], C. Scott and R. Nowak extend the standard assignment [20], which can preserve the order of the point correspondence. By enforcing the order preserving constraint, the accuracy and geometric fidelity of the point matching is increased, leading in turn to improved shape similarity assessment.
Biological Inspired Global Descriptor for Shape Matching
1283
3 Simple Shape Evolution Based on Zigzag Rule Since contours of objects in digital images are often distorted, the obvious way to neglect the distortions is to approximating the shape by a polygonal curve, which preserving the sufficient configuration information of the original one. We treat a contour as a finite point set first and then group and merge them to be segments. In Latecki el al [21], a discrete curve evolution process was generated with a stop parameter of the iterations. As a key contribution, we propose a novel evolution, the simple shape evolution, which does not require any parameter or iterations. Our purpose is using as less segments as possible to represent the configuration of contour as more elaborate as we can. First, we obtain the contour by using code available from M. Dow et al [22] that provides a set of points P = { p1 , p 2 ,..., pn } , pi = {( pi ) x , ( pi ) y } ( pi ) x , ( pi ) y ∈ R 2 in
which ( pi ) x , ( pi ) y are the x-coordinate and y-coordinate of the point, and sequence
number i is the order they constitute the contour. Second, grouping the points pi , pi +1 ... pi + d ∈ P to first-order segments l j (we abbreviate the first-order segment to FS below), if their x-coordinate or y-coordinate 3π π π π π 3π is same or proportionate. We define θ j ∈ {− ,− ,− ,0, , , , π } is the 4 2 4 4 2 4 direction of the FS, measured in degree. This step is an interim process in order to form a FS set L = {l1 , l 2 ,..., l m } with each segment l j of length d j and direction θ j . The third step here is the key step, which is merging several FSs into a final segment. Our merging rule is that whether the sequential segments l j , l j +1 ,..., l j +t ∈ L have the same directions alternately. If they have, merge them to be a segment s k . Since the figure of s k looks like a zigzag, we call the rule as zigzag rule.
Zigzag rule. Sequential segments l j , l j +1 ,..., l j +t ∈ L can be merged to be a
segment s k for approximation if they have the same directions alternately. The direction θ k of s k is computed by the definition of slope. Fig. 1 shows the sample merging for the conditions.
Fig. 1. Sample merging. Entire contour and approximate polygon is in the left. The rectangle in the down right corner in left image is zoomed in on the right image. Points are the original contour, and the solid line is the merged segment, which can be represented by the starting point and ending point. The two points are surrounded by the rectangles.
1284
Y. Li, S. Luo, and Q. Zou
4 Generating Shape Descriptor As mentioned in section 1, we build a model to simulate the V1 cortex simple cells network. Because the network is full connective, the differences and their corresponding local relations are stored in the topology network. In our model, we define the global descriptor to represent the topology network, denote the single simple cell and its topology with all other cells by s k . We use the histogram of binary relations of relative differences of positions and orientations (hijgeodis , hijori ) to represent the statistical rules. It is noted that hijgeodis is not the Euclidean distance, but geodesic distance, which is computed along the contour. From results in [17], we can conclude that geodesic distance which is more accurate than inner-distance is more effective to represent the position of line stimulation, especially in non-convex shapes such as articulated shapes. The value field of hijgeodis is grouped in m intervals and that of hijori is grouped in n intervals. Their combination distribution of (hijgeodis , hijori ) generates m × n intervals, and arbitrary (hijgeodis , hijori ) is in one of the m × n intervals doubtless. We compute the histogram hij = (hijgeodis , hijori ) of relative differences of positions and orientations between cell s k and the remaining k − 1 cells on the shape: descriptor ( s k ) = {hij ( s k ) / ∑i =1 ∑ j =1 hij ( s k ) i = 1,2,..., m j = 1,2,..., n} , m
hij ( s k ) = # { q ≠ s k
:h
geodis ij
n
(1)
∈ interval geodis (i ) & hijori ∈ interval ori ( j ) }.
(2)
This normalized histogram is defined to be the shape descriptor, an example is showed in Fig. 2.
Fig. 2. Sample of global descriptor. Approximate polygon of contour shows in the left, in which the bold line is s k . The descriptor of it is in the right. X-coordinate shows the 10 geodis
intervals of hij
ori
. In each interval, the different colors indicate the 4 intervals of hij . And
the y-coordinate shows the number in each bin.
4.1 Extension of Shape Context
To extend the shape context defined in [16], we redefined the bins with the difference of positions and orientations of the line stimulations. From biological view, in
Biological Inspired Global Descriptor for Shape Matching
1285
primary visual cortex line stimulation is more reasonable than point in shape representation. As a result, we use line segments to represent the shape, and replace the polar angle θ in shape context with the difference of line stimulation orientations. At the same time, the Euclidean distance is replaced by the geodesic distance, which is more sensitive to shape transformation. In the following the shape context is called SC and our descriptor GD.
5 Measuring the Shape Similarity Including the distances among descriptors, the penalty from the difference of node number and the restriction of local similarity, our cost function is made up of three items D , Num and LS . Considering descriptor ( s1k ) reference to node s1k on one contour, and another node s 2 l on another contour with descriptor ( s 2 l ) , we define the distance using the
χ 2 test statistic denote the difference between the two contours in (3). d kl ≡ descriptor( s1k ) − descriptor( s 2 l ) =
1 2
m
n
∑∑ i =1 j =1
[ hij ( s1k ) − hij ( s 2 l )] 2 hij ( s1k ) + hij ( s 2 l )
.
(3)
Given the set of distances between a M nodes shape and a N nodes shape, where M ≥ N , we can obtain the minimum total cost of them, D = min ∑ M k =1 d k ,π ( k ) . π ∈Π
Here, π (k ) is a permutation of minimum matching cost. This is a weighted bipartite matching problem. The input is a cost matrix with entries d kl , and the result is a minimum permutation π (k ) as well as the minimum cost D . In our experiments, we use the efficient algorithm of [22]. The parameters will be discussed in next section. Differ from the SC, which can normalize the shape with a determinate number of points, in order to maintain as most structure information as possible, since the node number on each contour is not always the same, so a “dummy” system is necessary. For two shapes with nodes M and N , where M ≥ N , we add dummy nodes to the second shape with an average matching cost of ε d = (∑kM=1 ∑lN=1 d kl ) / M * N . In this case, a node will be matched to a “dummy” whenever there is no real match available at the second shape. Our dummy node is not to handle the outliers, but to punish the node number difference of query shape and the reference shape. So the second item of cost function is Num = k Num ( M − N )ε d (k Num ≥ 0) ,
(4)
where k Num is a punishment coefficient. The third item of our cost function is about the local similarity. For a specific line stimulate in bi-level contour, length of the line stimulate is the local information we concerned. In the query and reference shape, a pair of corresponding line stimulate should have the similar length. In this consideration, we define the third item is (5) LS = k | length( s1 ) − length( s 2 ) | ( k ≥ 0) , LS
k
π (k )
LS
where length(⋅) is the length of a line stimulate, k LS is a punishment coefficient.
1286
Y. Li, S. Luo, and Q. Zou
Our final cost function is cos t = min( D + Num + LS ) M
= min ( π ∈Π
∑d
k ,π ( k ) ) + k Num ( M
(6)
− N )ε d + k LS | length( s1k ) − length( s 2 π ( k ) ) | .
k =1
In this cost function, the second item is determinate when the cost matrix is computed and the third item is added as an appendix of the best permutation.
6 Matching In this section we illustrate some aspects of the similarity and shape descriptor by computing some contours using the method presented before. The images are downloaded from Kimia data set. Our approach is simple and easy to apply, in every step it match the goal very well. In shape evolution stage, we use 50 images to compute the percentage of the segments reduction of our method. Each image is normalized to 155 pixels by 155 pixels, in order that 220 to 600 points consist of a shape. After evolution, the number of final segments is about 18 to 60, and the percentage of reduction is about 91.3%. 6.1 Translation, Rotation and Scaling Invariance
A matching approach should be invariant under translation, scaling and rotation, and here we evaluate our shape descriptor matching by these criteria. Invariance to translation is inherent to our descriptor, since all the measurements are computed by points taken from the contour. The final computing of descriptor use relative distance, avoiding the effect of absolutely numerical value. In our experiments the scaling is testing by 50 basic shapes and 6 derived shapes from each basic shape by scaling with factors 1.3, 1.2, 1.1, 0.9, 0.8 and 0.7. If the percentage of scaling factor is smaller or bigger more than 30%, invariance decreases a bit probably because of the error of discretization. Fig. 3 shows the scaling sample.
origanal
-30%
-20%
-10%
10%
20%
30%
Fig. 3. Sample of scaling invariant. The values under the images are the factor of scaling.
origanal
90º
180º
270º
Fig. 4. Sample of rotation invariance. The values under the images are the rotation angles.
Biological Inspired Global Descriptor for Shape Matching
1287
As to rotation, our shape descriptor guarantees almost complete invariance when the rotated angle is kπ / 2, (k = 0,1,2,3) . Fig. 4 shows the sample of rotation. For the shape similarity measures, the average cost of each segment for scaling and rotation are computed. When normalized the GD in [0, 1], the biggest distance between two GD is 20. The results are showed in Table 1. We can conclude from the data that as scaling factors smaller or bigger the cost is bigger. For rotation, we can provide for almost completely rotation invariance which difference of orientation is 90° , and a bit more distance for other rotation angles. Table 1. Average cost for scaling and rotation Scaling Cost
+30% 3.05
+20% 2.84
+10% 2.68
-10% 2.87
-20% 3.04
-30% Rotation 3.35
Cost
90
180
270
1.94
1.17
1.78
6.2 Retrieval on Kimia Database
To show the performance of the global descriptor in matching and retrieval, we test it in the Kimia data set. This test contains 25 images from 6 categories in Fig. 5. It has been tested by [21]. In our experiments, the last image is replaced by a similar one because the original one is not appropriate to distilling simple closed contour. The parameter used in our experiments as follow: L in COPAP [22] is 85% of min( M , N ) , k Num = 0.9, K LS = 0.1* ε d /(length( s1k ) + length( s 2π (k ) )) .
(10)
The values showed the punishment of node number difference is much bigger than that of local similarity for our matching. The node number equality of two shapes is more important. The retrieval result is summarized as the number of 1st, 2nd and 3rd closest matches that fall into the correct category. Because we choose line stimulation but not point to build our descriptor, the discretization affects our results a bit. The performance of our descriptor is not the best, but the result shows it can represent the shape in great extent. Compare with [2] and [21], computation of our method is much easier and swifter, at the same time gain the comparable performance. From biological view, although using points may have better results, line stimulation is more reasonable in shape representation in primary visual cortex.
Fig. 5. Kimia dataset: This dataset contains 25 instances of 6 categories
1288
Y. Li, S. Luo, and Q. Zou Table 2. Retrieval result on Kimia dataset [19] (Fig. 5)
Method Top 1 Top 2 Top 3
Sharvit [19] 23/25 21/25 20/25
G and W [7] 25/25 21/25 19/25
Belongie [2] 25/25 24/25 22/25
IDSC+DP
25/25 24/25 25/25
[21]
GD 24/25 22/25 20/25
6.3 Matching with Occlusion
To show the performance of the global descriptor in occlusion, we test it in the Kimia data set too. This test contains 15 basic images and 3 derived shapes from each basic shape by rotation with π / 2 orientation difference and at most 30% occlusion. See Fig. 6 for example. In our experiments, the parameter is same to that of section 6.2. The retrieval result is summarized as the number of 10% occlusion, 20% occlusion and 30%occlusion closest matches that fall into the shapes derived from the same basic one. The performance of our descriptor is 14/15, 12/15, 10/15. We can provide good robustness in rotation and with occlusion less than 20%.
Fig. 6. Sample of occlusion. In each category, the first one is the basic image, and the another three ones are derived from it with different occlusion and rotation.
6.4 An Extreme Example
In this chapter we test the GD in extreme occlusion, form 10% to 80%. We use the same 15 images as 6.3, and curtain them along x- axis, y- axis as well as the main axis of the shape. See Fig. 7 for examples. Shape, with many concaves and convexities, is called complex shape, as Basic A; on the contrary, the other ones are called simple shape, as Basic B. In this test, we rank the occlusion shapes by their similarity from the basic one in the same category. The result is in Fig. 8. At the same time, rank those by human eyes, the result are Eye A and Eye B. Fig. 9 shows the relation of contour occlusion percentage and reciprocal of similarity. Based on the reiteration experiments on the 15 images, we find that when the occlusion is small the result of GD and human eye are exactly the same. That means our method is reasonably simulated human vision in some extend. As the occlusion bigger, human eye begins confused, in where the number labeled with underline. Just like the inflexion points in Fig. 9. In Fig. 9 we can see that the L-curve of simple contour has an obvious inflexion point, means whatever the percentage of occlusion variant, the possibility of recognition will have no big difference. On the contrary, the L-curve of complex contour has no obvious inflexion point. It may because that complex contours have many features such as concaves and convexities, several missing will not affect the entire contour. On the contrary, simple contour has few features, and maybe one or two missing will result in bad recognition result, so the inflexion is much more obvious.
Biological Inspired Global Descriptor for Shape Matching
1289
Basic A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Basic A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 7. Sample of extreme occlusion along x-axis, y-axis as well as the main axis of the shape
Basic A
5
10
11
1
6
12
2
13
7
8
3
14
4
9
15
Eye A
5
10
11
1
6
12
2
7
3
13
14
8
4
9
15
Basic B
1
10
11
2
6
12
3
13
7
8
14
9
4
15
5
Eye B
1
10
11
2
6
3
12
13
12
7
9
5
8
14
4
Fig. 8. Rank results in extreme occlusion
Fig. 9. The relation of contour occlusion percentage and reciprocal of similarity
7 Conclusion and Future Work We have presented a biological inspired new approach to shape matching and retrieval. We employ two novel strategies – the zigzag rule, which approximates shape to an elaborate polygonal curve, and cost function which combines global configurations as well as local information of the line stimulations as our punishments. Combining these with the shape descriptor reflecting entire topology of the contour, our matching model is robust under translation, scaling and rotation. Compared with local luminance feature based approaches, our matching has much more accurate results, and with other contour based matching methods, our model is much easier and more efficient. Since the segmentation is based on pixel level, invariance under arbitrary rotated angles and arbitrary scaling factors is not always the same. Future improvements include making the descriptor more robust to scaling and arbitrary rotation, and extending the model to more complex shapes. The great significance of GD is applying the biological inspired elicitation into CV, and it has led to at least comparable result with current methods. We believe the
1290
Y. Li, S. Luo, and Q. Zou
established model will be constantly improved by adding new biological and cognitive details.
References 1. Hubel, D.H., Wiesel, T.N.: Receptive Fields, Binocular Interaction, and Functional Architecture in the Cat’s Visual Cortex. J. Physiology (London) 160 (1962) 106-154 2. Wolfram, S.: University and Complexity in Cellular Automata. Physica D 10 (1) (1984) 1-35 3. Jackson, E.: Perspectives of Nonlinear Dynamics. Cambridge University Press, New York 2 (1990) 454-504 4. Riehle, A., Grum, S., Diesman, M., et al.: Spike Synchronization and Rate Modulation Differentially Involved in Motor Cortical Function. Science 278 (1997) 1959-1953 5. Mainen, Z. F., Sejnowski, T. J.: Reliability of Spike Timing in Neocortical Neurons. Science 268 (1995) 1503-1506 6. Amir, Y., Harel, M.: Cortical Hierarchy Reflected in the Organization of Intrinsic Connections in Macaque Visual Cortex. J. Comp neurol 334 (1) (1993) 19-46 7. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. in Fourth Alvey Vision Conf. (1988) 147-151 8. Schmid, C., Mohr, R.: Local Grayvalue Invariants for Image Retrieval. IEEE Trans. PAMI 19 (1997) 530-534 9. Lowe, David G: Object Recognition from Local Scale-Invariant Features. In ICCV (1999) 1150-1157 10. Khotanzan, A., Hong, Y. H.: Invariant Image Recognition by Zernike Moments. IEEE Trans. PAMI 12 (1990) 489-497 11. Arkin, M., Chew, L.P., Huttenlocher, D.P., Kedem, K., Mitchell, J.S.B.: An Efficiently Computable Metric for Comparing Polygonal Shapes. IEEE Trans. PAMI 13 (1991) 209-206 12. Latecki, L. J., Lakämper, R.: Shape Similarity Measure Based on Correspondence of Visual Parts. IEEE Trans. PAMI 22 (10) (2000) 1185-1190 13. Mokhtarian, F., Abbasi, S., Kittler, J.: Efficient and Robust Retrieval by Shape Content Through Curvature Scale Space. Image Databases and Multi-media search, A.W.M. Smeulders and R. Jain, eds., 51-58, World Scientific, 1997 14. Mokhtarian, F., Mackworth, A. K.: A Theory of Multiscale Curvature-based Shape Representation for Planar Curves. IEEE Trans. PAMI 14 (1992) 789-805 15. Latecki, L. J., Lakamper, R., Eckhardt, U.: Shpae Descriptors for Non-Rigid Shapes with a Single Closed Contour. Proc. IEEE Conf. Computer Vision and Pattern Recognition (2000) 424-429 16. Belongie, S., Malik, J., Puzicha, J.: Shape Matching and Object Recognition Using Shape Contexts. IEEE Trans. PAMI 24 (4) (2002) 509-522 17. Ling, H., Jacobs, D.: Using the Inner-distance for Classification of Articulated Shapes. in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA (2005) 18. Mortensen, E.N, Hongli, D., Shapiro, L.: A SIFT Descriptor with Global Context. in CVPR 1 (2005) 184-190 19. Scott, C., Nowak, R.: Robust Contour Matching via the Order Preserving Assignment Problem. accepted for publication in IEEE Transactions on Image Processing 20. Jonker, R., Volgenant, A.: A Shortest Augmenting Path Algorithm for Dense and Sparse Linear Assignment Problems. Computing 38 (1987) 325-340 21. Latecki, L. J., Lakamper, R.: Convexity Rule for shape Decomposition Based on Discrete Contour Evolution. Computer Vision and Image Understanding 73 (3) (1999) 441-454 22. Dow, M., Nunnally, R.: http: //lcni.uoregon.edu /~mark/ SS_Edges/ SS_Edges.html
Fuzzy Support Vector Machine for EMG Pattern Recognition and Myoelectrical Prosthesis Control Lingling Chen, Peng Yang, Xiaoyun Xu, Xin Guo, and Xueping Zhang School of Electrical Engineering and Automation, Hebei University of Technology 300130 Tianjin, China
[email protected] Abstract. For the optional control to the trans-femoral prosthesis and natural gait, an ongoing investigation of lower limb prosthesis model with myoelectrical control was presented. In this research, the surface electromyographic signals of lower limb were extracted to be switch signal, and translate into movement information. Considering every muscle’s different physiologic tendency, fuzzy support vector regression method was applied to establish an intelligent black box that can interpret the physiological signals to accurate information of knee joint angle. It achieves a comparable or better performance than other methods, and provides a more native gait to the prosthesis user.
1
Introduction
Electromyographic signal (EMG) detects the bioelectrical signals generated by the body during its contraction. Aside from the traditional use of detecting neuromuscular disease, muscle weakness, and modeling of muscle movements, applications for EMG have increased quite tremendously, especially in the field of prosthetics and exoskeletons. Because the technology allows for an electrical detection of the muscle contractions, these can be used as inputs to control artificial limbs or even help ease a heavy load by acting in accord with the human body. In this sense surface EMG (SEMG) has grown in its sensor technology and probably will grow even further in the future as well. After a limb is amputated, the brain continues to send signals to the remainder of the limb. EMG signals, which are intended for the movement of the missing limb, can potentially be interpreted and used to control prosthesis [1]. Human locomotion is complex biologic progress and controlled by cerebra and nerve center. Although this type of approach has proved successful for arm movement control, it focused on upper limb prosthesis control mostly. Walking movement, although seemingly stereotyped, is highly complex as it integrates equilibrium constraints and forward propulsion in a multi-joint system. In swing phase, it must make the gait of prosthesis be symmetric to the ableside limb to the greatest extent. So the crus prosthesis should have the same D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1291–1298, 2007. c Springer-Verlag Berlin Heidelberg 2007
1292
L. Chen et al.
accelerate and decelerate course like able-side leg in the swing phase. The movements of accelerate and decelerate control by the muscle of thigh and crut. The compressions of those muscles produce the knee moment to drive the knee joint, make the knee moment change according to a certain rule, control the flexure angle of knee joint and make the crut accelerate forwards in the initial swing phase [2]. To the lower limb prosthesis, adjusting the flexion angle of knee joint optionally is the key of the optional movement control of prosthesis. On the other hand, another essential request is safe and reliable to the prosthesis. It must can avoid the ”giving way” and adapt to different kinds of emergency. Support vector machine (SVM) is based on the theoretical learning theory [3] [4], that can be used for pattern classification or regression, such as object recognition, speech recognition, and isolated handwritten digit recognition [5] [6]. Therefore, we applied SVM to classify on-off situation of artificial knee joint, and the support vector regression (SVR) method is applied to interpret the physiological signals to give accurate information on the position and movement of the knee joint. It can achieve a comparable or better performance than other methods, and provide a more native gait to the prosthesis user. In this study, a lower limb prosthesis model with myoelectrical control is presented. This approach can avoid giving way of the prosthesis and establish the fuzzy SVR (FSVR) model to predict the angle of knee joint using SEMG signals.
2
Myoelectrical Prosthesis
The controlled lower limb prosthesis will have special features, some not available, or not well tuned and performed by conventional prostheses. These features include controlled knee flexion at early stance, high stability during weight bearing with a single axis knee, controlled knee release at late stance, controlled heel rise at early swing, damped full knee extension at late swing, damped knee at stance of stair descent and in sitting down, and adaptation to gait speed [7]. In order to resolve those problems above, the structure of the controlled system is described by the block diagram shown in Fig. 1. Stump
Able-side Leg
Sensor
Sensor
EMG
EMG
Classifier
Gait-sate Identification
On-off Signal
Knee Joint Moment
Self-lock Control
Electric Control
Prosthesis
Gait Control Physical Quantity
Fig. 1. The prosthesis model’s main control source is SEMG signal
2.1
Prosthesis Model with Myoelectrical Control
For different condition of amputees, the prosthesis control by the stump’s SEMG signal is very complexity. So the SEMG signal which sampled from the able-side
Fuzzy Support Vector Machine for EMG Pattern Recognition
1293
leg is as the control signal mostly. As the essential control request of prosthesis is realtime, the method of processing the raw SEMG signal must be simple, fast and effective. Therefore the linear envelope of the rectified signal was input of the SVR model. This model can be regard as a nonlinear function estimator. It accomplished the nonlinear mapping from physiological signals to the position and movement of the knee joint, and made optimal estimate to the angle of knee joint. 2.2
Giving Way
The main control of prosthesis is the control of knee joint, including standing and swing period. The most importance demand of lower limb prosthesis is avoiding “giving way” and tumble, namely knee joint inflects suddenly when the prosthesis is bearing (it is allowed that the knee joint has a slight flex in the initial stage of erecting, but must be extended subsequently). For instance, C-leg used a sensor under the tiptoe. When the tiptoe use enough force, the self-lock knee joint will be open and the amputee can walk. The SEMG signal sampled from stump acted as the switching control signal, and fulfilled the conversion from standing to walking. It can distinguish the mutation from standing to walking clearly, and it can be the on-off signal through classifying.
3
Gait-Sate Identification
Support Vector Machine represents a new approach for pattern classification that has attracted a great deal of interest in machine learning. It succeeded in solving many pattern recognition problems and performed better than nonlinear classifiers. A solution is the extraction of the statistical information enclosed in the blackbox model. As neural networks encrypt the model into a complex, nonlinear, mathematical formula, they are not easy to interpret at all. But in contrast to back propagation network (BP-NN), the SVM could be more appropriate for this purpose, given that the support vectors represent the critical samples for the classification task. The approach here is to develop an intelligent black box that can take the physiological signals and interpret them to give accurate information on the angle of knee joint. 3.1
Support Vector Regressing
Given a set of data points {(x1 , z1 ), . . . , (xl , zl )}, such that xi ∈ Rn is an input and zi ∈ Rn is a target output, the standard form of support vector regression: 1 T 1 w w + C(νε + (ξi + ξi∗ )). ,ε 2 l i=1 l
min∗
w,b,ξ,ξ
(1)
1294
L. Chen et al.
Subject to (wT φ(xi ) + b) − zi ≤ ε + ξi , zi − wT φ(xi ) − b ≤ ε + ξi∗ ,
(2)
ξi , ξi∗ ≥ 0, i = 1, . . . , l, ε ≥ 0.
(3)
1 min (α − α∗ )T Q(α − α∗ ) + Z T (α − α∗ ), eT (α − α∗ ) = 0, 2
(4)
eT (α + α∗ ) ≤ Cν, 0 ≤ αi , α∗i ≤ C/l, i = 1, . . . , l.
(5)
The dual is: α,α∗
where Qij = K(xi , xj ) ≡ φ(xi )T φ(xj ). e is the vector of all ones, C > 0 is the upper bound, and K(xi , xj ) is the kernel. The training vectors xi are mapped into a higher dimensional space by the function φ. The decision function is f (x) =
l
(−αi + α∗i )K(xi , xj ) + b.
(6)
i=1
3.2
Fuzzy Support Vector Regressing
Each muscle has different contribution to the angle predicting of the knee joint. For standard SVR, however, all of input vectors have same influence to the model. Therefore, we divided the whole forecasting process into several subsystems according to the different muscles. A series of submodels were applied to estimate the output of every subsystem, and the angle of knee joint is the weighted sum of every subsystem’s output. Using the FSVR to established multi-model estimation (Fig. 2), it can improve the effect of modelling, heighten the accuracy of model, and improve the predictive ability and generalization ability of model [9]. Utilizing fuzzy logic theory, the contribution of each muscle can be considered for the fuzzy membership. To do this, we define one-dimensional membership function pi . The original decision values will take on values ranging generally from 0 to +1, where values closer to +1 indicate the muscle that contribute to the movement of knee joint mostly, and values closer to 0 indicate lesser. pi is an alterable
Each Muscle Channel Expert System
SEMG Signals FSVR model M
pi
y = ∑ f i ( x , pi )
Angle of Knee Joint
i =1
Fig. 2. Translation the SEMG signals recorded from several muscles into the angle of knee joint by FSVR model. The choice of the fuzzy membership depends upon the expert system and angle of knee joint last time.
Fuzzy Support Vector Machine for EMG Pattern Recognition
1295
parameter, and it changed with the vary of knee joint angle, where i is the corresponding muscle, and M is the total number of muscles. Given the process output is the weighted sum of the M sub-models’ output: yj =
M
pi fi (xj ), j = 1, 2, . . . , l.
(7)
i=1
M where i=1 pi = 1, 0 ≤ pi ≤ 1, i = 1, 2, . . . , M. Then we can convert the modeling problem into solve l M min( (yi − pi fi (xj ))2 ).
pi ,fi
j=1
(8)
i=1
l Similar to the standard SVR algorithms, the linear loss function j=1 (ξj +ξj∗ ) l is applied to replace of the quadratic loss function j=1 (ξj2 +(ξj∗ )2 ). By the using of fi (xj ) = ωiT ϕ(xj )+ b and insensitive loss function ε, equation (8) is equivalent to 1 T min ∗ J = ωi ωi + C (ξj + ξj∗ ). ωi ,b,ξ,ξ 2 i=1 1 M
yj −
M
pi (ωiT ϕ(xj ) + b) ≤ ε + ξj ,
i=1
l
M
pi (ωiT ϕ(xj ) + b) − yj ≤ ε + ξj∗ ,
(9)
(10)
i=1
ξj , ξj∗ ≥ 0, j = 1, 2, . . . , l.
(11)
To solve this optimization problem, we constructed the Lagrangian
L =
M l l M 1 T ωi ωi + C (ξj + ξj∗ ) − αj (ε + ξj − yj + pi (ωiT ϕ(xj ) + b)) 2 i=1 j=1 j=1 i=1
−
l
α∗j (ε + ξj∗ + yj −
j=1
M
pi (ωiT ϕ(xj ) + b)) −
i=1
l
(ηj ξj + ηj∗ ξj∗ ).
(12)
j=1
The output of estimation sub-model i is fi (x, pi ) =
l
pi αj − α∗j k (xj , x) + b.
(13)
j=1
The output of model is y=
M i=1
fi (x, pi ) .
(14)
1296
4
L. Chen et al.
Experiment Result
SEMG signal was recorded with the Infiniti system, which is a biofeedback and physiological monitoring. It can record the SEMG signal synchronized with angles of knee joint in order to study temporal relationships. The SEMG signals are captured from four muscles of the lower limb: rectus femoris, vastus lateralis, biceps femoris, and tensor fasciae latae. The corresponding knee joint angles are also provided. Five subjects were asked to walk with 40 steps per minute.The raw SEMG signals are depicted in Fig. 3 over 4 gait cycles.
Fig. 3. The SEMG signals ( μV) coming from Rectus femoris, vastus lateralis, biceps femoris, tensor fascia latae, and the corresponding angle of knee joint were recorded. The EMG data are preprocessed into signal features, and the signal envelopes are use as input to the SVM.
In this study, Root mean square (RMS) of the rectified EMG signal with a time constant of 25 ms is applied to minimize the non-reproducible part of the signal, and outline the mean trend of signal development [8]. Normalizing signal is applied to over-come ”uncertain” character of micro-volt scaled parameters with a reference contraction. Based on certain threshold standard that defines when a muscle is “on”, the On/Off timing pattern of muscle in the gait cycle can be applied to control the open of prosthesis. Simulation results show that SEMG signal and knee joint angle have strong relation, and this arithmetic obtains preferable forecast result. Model was established by BP-NN, SVR and FSVR respectively. In order to compare the predicting effect, both of SVR and FSVR applied RBF (radial basis function) kernel function. K(xi , xj ) = exp(−νxi − xj 2 ).
(15)
Fuzzy Support Vector Machine for EMG Pattern Recognition 60
Error / De gre e
Angle / De gre e
real angle
60
predict angle
50
40 30 20 10
40 30 20 10
0 -10
70
real angle predict angle
50
1297
0
0
500
1000
1500
2000 2500 3000 S ample number
3500
4000
4500
-10 5000 0
500
1000
1500
By the SVR model
2000 2500 3000 S ample number
3500
4000
4500
5000
By the FSVR model
Fig. 4. The predicting result and error of knee joint angle are shown by SVR and FSVR respectively
Table 1. The comparing of BP-NN, SVR and F-SVR Error
BP-NN SVR
Root Mean Square Error/ Degree 9.954 Maximum Positive Error/ Degree 30.878 Maximum Negative Error/ Degree -26.489
FSVR
6.1452 3.457 18.565 14.154 -15.465 -12.154
Where the kernel parameter ν = 0.4, and the regularization parameter C = 1.5.The predicting results of SVR and FSVR methods are depicted in Fig. 4, and the comparing of them is shown in Table 1. It has better predicting result by FSVR than the other methods, and the root mean square error and maximum error are less than the other methods obviously. To the application of knee joint angle’s predicting, it is superiority in term of smoothness of curve and generalization ability. But it needs modify of the predicting value. Since the SEMG signals influence by a lot of uncertain factor, it should be modified according to the experience. If there is an abrupt spike in the angle curve, this kind of predicting is not unreliable, and can be replaced by the average value of fore- and -aft values.
5
Conclusion
This myoelectric control system can satisfy the optional control and safety need. The prime advantage of myoelectrical prosthesis is using the muscle system to control the prosthesis by physiological method. This kind of prosthesis can adapt to the external factor, the location of prosthesis, and the shift of body. However, there also have a lot of problems in the practical application, such as real-time control. We must quick the speed of the algorithm to satisfy the prosthesis user’s normal gait. At the same time, the ingenious combination of EMG signals and physical signals also is an important problem.
1298
L. Chen et al.
By applying FSVR algorithm to construct angle estimator, it not only provides satisfactory approximation and generalization property, but also achieves superior performance to BP-NN modeling method and SVR method. It can consider every muscle’s physiologic tendency sufficiently. But it need pay more attention to the selection of membership function. Otherwise, it should be pay more attention to the model selection of SVM. Through more suitable model selection, it will reach a better classification result after training. Acknowledgments. This work was supported by the National Natural Science Foundation of China (60575009). And the Research Institute of Prosthetics and Orthotics of the Ministry of Civil Affairs of P. R. China provided great aids on the experiments.
References 1. Mordaunt, P., Zalzala, A.S.M.: Towards and Evolutionary Neural Network for Gait Analysis. IEEE (2002) 1922–1927 2. Farina, D., Merletti, R., Nazzaro, M.: Effect of Joint Angle on EMG Variables in Leg and Thigh Muscles. IEEE Trans. Engineering in Medicine and Biology 20 (6) (2001) 62–71 3. Burges, J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 4. Corts, C., Vapnik, V.N.: Support Vector Networks. Machine Learning 20 (1995) 273–297 5. Tanaka, K., Kimuro, Y.: Motion Sequence Scheme for Detecting Mobile Robots in an Office Environment. Computational Intelligence in Robotics and Automation 1 (2003) 145–150 6. Osowski, S., Hoai, L.T., Markiewicz, T.: Support Vector Machine based Expert System for Reliable Heartbeat Recognition. IEEE Trans. Biomedical Engineering 51 (4) (2004) 582–589 7. Cheron, G., Leurs, F., Bengoetxea, A., Draye, J.P., Destre, M., Dan, B.: A Dynamic Recurrent Neural Network for Multiple Muscles Electromyographic Mapping to Elevation Angles of the Lower Limb in Human Locomotion. Journal of Neuroscience Methods 129 (2003) 95–104 8. Ferdjallah, M., Myers, K., Starsky, A.: Dynamic Electromyography. Proc. Pediatric Gait Con-ference (2000) 99-108 9. Feng, R., Shen, W., Zhang, Y., Shao, H.: Multiple Modeling Approach using Fuzzy Support Vector Machines. Control and Decision 18 (6) (2003) 646-650
Classification of Obstructive Sleep Apnea by Neural Networks Zhongyu Pang1 , Derong Liu1 , and Stephen R. Lloyd2 1
Department of Electrical and Computer Engineering University of Illinois at Chicago Chicago, IL 60607-7053, USA
[email protected],
[email protected] 2 Center for Narcolepsy, Sleep, and Health Research College of Nursing University of Illinois at Chicago Chicago, IL 60612-7350, USA
[email protected] Abstract. Electroencephalogram (EEG) is a common tool to explore brain activities ranging from concentrated cognitive efforts to sleepiness. For the issue of sleepiness, pupil behavior can provide some information regarding alertness. The issue of sleepiness can be reflected by EEG energy. Specifically, intrusion of EEG theta wave activity into the beta activity of active wakefulness has been interpreted as ensuing sleepiness. This paper develops an innovative signal classification method that is capable of differentiating subjects with sleep disorder of obstructive sleep apnea (OSA) which cause excessive daytime sleepiness (EDS) from normal control subjects who do not have a sleep disorder. The theta energy ratios are calculated from the 2-second sliding windows by Fourier transform. An artificial neural network of modified ART2 is utilized to identify subjects with OSA from a combined group of subjects including healthy controls. This grouping from the neural network is then compared with the actual diagnostic classification of subjects as OSA or healthy controls and it is found to be 91% accurate in differentiating between the two groups.
1 Introduction Sleep plays an important role in the history of neuroscience and in the lives of human being. Excessive daytime sleepiness caused by sleep apnea can have a disruptive, embarrassing, or even dangerous impact on daily living activities. Symptoms of OSA include loud and irregular snoring, restless sleep and daytime sleepiness [17]. In addition, episodes of sleep apnea are often associated with oxygen desaturation. Repeated episodes of this desaturation can eventually lead to additional medical complications. In clinic [17], sleep apnea can be diagnosed when those symptoms are present and an all-night sleep study reveals the presence of at least 5 episodes of apnea and/or hypopnea per hour of sleep. An apnea episode occurs when airflow is decreased by at least 50% and lasts for more than 10 seconds. A hypopnea episode occurs when airflow is reduced by at least 30%, oxygen level in blood is reduced by at least 4% and the airflow reduction lasts for more than 10 seconds [23]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1299–1308, 2007. c Springer-Verlag Berlin Heidelberg 2007
1300
Z. Pang, D. Liu, and S.R. Lloyd
There are two classes of detection methods that have been applied to subjects with OSA. One class is based on breath signals while the other is based on electromyogram (EMG) signals. Several research papers [9,10,19,21,22] have examined methods belonging to the first class. Assessment of respiratory signal is a key step to detect OSA. Respiratory impedance can be determined by a forced oscillation technique (FOT) [10] and can be considered as a proper noninvasive method to diagnose sleep apnea. FOT relies on applying an oscillatory pressure signal to the respiratory system and determines the respiratory impedance using nasal pressure and airflow signals. Currently, FOT is a promising noninvasive method for measuring respiratory impedance [9,19]. Using FOT during sleep, Yen et al. [22] estimated airway impedance with high specificity and reliability. Then they used an artificial neural network to classify people with and without hypopnea/apnea based on this respiratory signal. Gumery et al. [12] developed a device to measure the surface EMG time latency reflex of the genioglossus muscle stimulated by time and amplitude calibrated negative pharyngeal pressure drops. Then, based on a Berkner transform, they built a multi-scale detector. Further they tested those detectors in terms of accuracy and robustness using signals acquired from apneic patients and healthy controls. EEG measurement can be used to detect brain activity since different mental activities produce different EEG patterns. Mill´an et al. [18] present a neural classifier to recognize mental tasks and get about 70% correct recognition. Researchers have found that fluctuations in wakefulness can be examined with EEG measurement from active subjects with eyes open and engaged in their usual awake activities [1,2,13]. In these situations, intrusions of theta activity into the beta activity of active wakefulness have been interpreted as ensuing sleepiness. Subjects with OSA and healthy controls may have different alertness levels under the same conditions. The pupil response patterns between subjects with and without sleep disorders are different. In this paper, we develop a novel method to detect subjects with the sleep disorder of OSA based on EEG. We compare subjects with sleep disorders to those healthy controls and find that they have different responses in theta wave patterns under the same situation. A significant difference between those subjects can be used for the purpose of classification by artificial neural networks, specifically, modified ART2 neural networks. We tested our algorithm using one set of subjects with OSA and healthy controls. This methodology may eventually lead to new diagnostic methods for the sleep disorder of OSA. This paper is organized as follows. In Section 2, we present the subjects and the experimental data collection. In Section 3, data preprocessing is described and our method for detecting excessive daytime sleepiness associated with sleep disorders is developed. In Section 4, simulation results are given. In Section 5, conclusions are presented, and future perspectives are discussed.
2 Subjects and Experimental Data Data from 5 untreated OSA (obstructive sleep apnea) subjects and 6 healthy controls were collected approximately 12 h after their mid-sleep period to maximize the
Classification of Obstructive Sleep Apnea by Neural Networks
1301
probability of sleepiness occurring. This mid-afternoon increase in somnolence, commonly believed to be a post-prandial phenomenon, has been shown to be unrelated to food intake [5]. Data collection was performed at the Center for Narcolepsy Research at the University of Illinois at Chicago. The alertness level testing, conducted with a pupillometry system built at Mayo Clinic [16], consists of 1 minute of recording of pupil diameter in the light followed by 14 minutes in a quiet, dark room. The analog pupil diameter data are digitized at the rate of 256 Hz using an A/D converter and saved to a PC using a binary format [16]. Filter for EEG was set at 0.3 Hz for high pass and 30 Hz for low pass. EMG filters were set at 10 and 100 Hz, respectively. EEG/PSG (polysomnography) data were also digitized at 256 Hz and stored with pupillometry data in a PC.
3 Method for Detecting Sleep Disorder Based on ART2 Neural Networks 3.1 Data Pre-processing A window of 2 seconds, which is a common technique in this field, is used to process EEG data. Since measurement records include pupil diameter and EEG in the specific environment, data for the first 3 minutes of recording were eliminated from analysis because the pupil dilates and oscillates when the lights are extinguished and can take 2–3 minutes of darkness to adapt and reach a larger stable diameter [15]. For this type of data, pupil diameters can provide some useful information about excessive daytime sleepiness and some researchers focus their research on pupil size only. According to the definition of theta wave in EEG, its main rhythm is between 4 Hz and 8 Hz. As we know, when people are awake, the main rhythm in EEG is beta and/or alpha wave. When a person goes to sleepiness, the theta wave is the main rhythm on EEG. Therefore, theta wave activity can be considered as an indicator of sleepiness. The amount of theta wave activity has been shown to increase in value during episodes where people demonstrate decreasing alertness level. Accordingly, theta energy could be calculated for 2-sec windows with the original data from three EEG channels of C3/A2, O1/A2 and P3/O1. The calculation was realized by Fourier transform for subjects with OSA and accompanying controls. This procedure can provide insight for the pattern of theta wave energy. Noise always exists in the EEG recording, so for a good representation of theta energy, the average energy can be obtained, which indicates the amount of theta wave present. Further theta energy ratios were obtained from mean power of the 4th minute; see Fig. 1 for the theta energy ratios of an OSA and a healthy control. In order to recognize general change of theta ratio, we use the regression method to catch changes with time under specific circumstances. The regression analysis is based on the method of Chatterjee and Hadi [6], expressed by, Y = Xβ + , ∼ N (0, σ 2 I),
(1)
where Y is a dependent variable (output), X is an independent variable (input or data), and is the error. Solving for β from (1) based on the least square error will give the predicted data.
1302
Z. Pang, D. Liu, and S.R. Lloyd OSA subject 4
3.5
Amplitude of theta energy
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
250
300
350
400
250
300
350
400
Healhy control 4
3.5
Amplitude of theta energy
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
Fig. 1. Theta energy ratios of a subject with OSA and a healthy control
3.2
ART2 Neural Networks
Adaptive resonance theory 2 (ART2) neural networks [4] were designed for both analog and digital inputs in 1987. ART2 has been widely used to identify patterns in various fields, e.g., Suzuki [20] used neural networks based on ART2 to recognize QRS-waves from electrocardiogram (ECG). The present paper is based on ART2 neural networks and modified learning functions to adapt to the input patterns. An ART2 neural network [4,8] consists of two subsystems: An attentional subsystem and an orienting subsystem. The attentional subsystem has two layers, F1 and F2. F1 is made up of three sub-layers. Here three sub-layers of F1 are necessary for analog input patterns since the differences between possible signals with particular patterns may be much smaller for analog inputs
Classification of Obstructive Sleep Apnea by Neural Networks
1303
than for binary inputs which are used to represent features of signals. There are some equations to describe ART2. Most of the equations that we use are the same as those in the original paper [4]. When resonant conditions existing in the network are below the threshold set by the vigilance parameter, the memory will be activated and the long term memory (top-down weights and bottom-up weights) adaptive process described next will also be activated. The following equations describe the updated relationship between the third layer of F1 from an input signal and the activated category layer F2. Bottom-up long term memory trace (F1 → F2) d zij = fcn (yj )[pi − zij ], dt
(2)
where pi is the ith output of the third layer of F1, yj is the output of the jth category activated, and fcn (yj ) is a function given in [4]. Top-down long term memory trace (F2 → F1) d zji = fcn (yj )[pi − zji ]. dt
(3)
When resonant conditions existing in the network exceed the threshold set by the vigilance parameter, we modify the above update equations in order to avoid forgetting all the information obtained before. The memory will be updated by the average value of all long term memory (LTM) associated with the same winner while individual input should get its own LTM by equations (2) and (3). The memory update is described as follows. Bottom-up long term memory trace (F1 → F2) zij ⇐
n−1 1 zij + zij . n n
(4)
Top-down long term memory trace (F2 → F1) zji ⇐
n−1 1 zji + zji . n n
(5)
In (4) and (5), n is the number of subjects associated with the winner j, n−1 n zij are the 1 previous weights, and n zij is the new weight from a new input. According to the original paper, there is a clear procedure along the feedforward and feedback path in the layers of F1 to calculate all the signal in F1, but what procedure to get a reset signal is not definite. The reset signal after the third layer of F1 is updated with the feedback signal from layer F2. 3.3 Our System and Parameter Selection Since the vigilance parameter ρ decides the level of similarity between input signals in the same category, more categories will be obtained if ρ is large when other parameters are the same, e.g., if ρ is close to 1. The order of input signals has certain effects on the final classification results. The reason is that the original ART2 algorithm has forgetting property. In order to solve this problem, a large ρ is chosen so that only signals similar
1304
Z. Pang, D. Liu, and S.R. Lloyd
enough will be grouped together. Based on equations (4) and (5), a mean signal can be obtained in one group. After that, the other ART2 are used with a different set of parameters to classify grouped signals from the first ART2. The parameter choices for our method are based on the original ART2 papers [3,4], where their relationship has been derived and limits have been set. ART2 (I) a1, b1, c1, d1, e1, ρ1
ART2 (II) a2, b2, c2, d2, e2, ρ 2
ART2 (III) a3, b3, c3, d3, e3, ρ 3
Fig. 2. Architecture of three ART2 series
The vigilance parameter has a critical effect on the results of classification. A bigger value of the vigilance parameter tends to separate inputs into more categories. In this case, only very similar subjects can be grouped together. On the other hand, if its value is too small, most inputs will go into one group. So we use three networks in hierarchy as in Fig. 2 for subjects with OSA, in order to avoid missing some subjects due to the choice of vigilance parameter and to obtain more precise classifications. However, we do not follow the traditional procedure of the third ART2 in our system, where we separate all subjects into two groups based on their similarity parameters for the fact that the larger the similarity parameter is, the closer they are to each other.
4 Results A total of 11 subjects are used to test our algorithm of neural networks, including 6 healthy controls and 5 subjects with obstructive sleep apnea. Difference between energy ratio distribution of the theta wave of one healthy subject and one with OSA is not obvious. It is not possible to distinguish them directly based on energy ratios so we use linear regression method to process them further. Fig. 3 shows some results of regression with different point numbers. The two top figures in Fig. 3 show regression results of a healthy subject with 20, 30 points and the bottom ones are for a subject with OSA under regression points 20 and 30. The energy ratio of theta wave is from the original data with 2-second sliding windows, in which artifacts such as eye blinking have been removed. However, the regression procedure can make the change of theta energy ratio more obvious.
Classification of Obstructive Sleep Apnea by Neural Networks
1305
20 point regression 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200
250
300
350
250
300
350
250
300
350
250
300
350
Time
30 point regession 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
20 point regression 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
30 point regression 4
3.5
3
2.5
2
1.5
1
0.5 0
50
100
150
200 Time
Fig. 3. Regression results of a healthy subject and an OSA subject. The first two figures are for a healthy subject with regression points 20 and 30, respectively. The bottom two figures are for an OSA subject with regression points of 20 and 30, respectively.
Only one ART2 neural network with another ART2 cannot identify all subjects, and more subjects go to the wrong group since there are subjects grouped into more
1306
Z. Pang, D. Liu, and S.R. Lloyd Classification results 1 0.95
Correct classification rate
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 10
15
20
25 30 Regression points
35
40
Fig. 4. Performance of one ART2 and three ART2 series neural networks. Solid line represents performance of our system with three ART2 neural networks; dashed line represents performance of one traditional ART2.
categories or missed as a noise under a fixed vigilance parameter ρ. Therefore, we use three ART2 in series. If parameter ρ is set slightly higher, more subjects will not be classified. On the other hand, if the parameter ρ is set slightly lower, most subjects go into the same group. That is, precision of the classification will be reduced. We find that regression by different data points has an effect on the results. The more the number of points used in the regression, the higher the vigilance parameter ρ has to be set in order to get better classification results. More points make the figure more flat and stable. For purpose of comparison of one ART2 and three ART2 series, we draw the figures of performance in Fig. 4. The vertical axis is the percentage of successful classification while the horizontal axis is the number of points used in the regression. From Fig. 4, we find that the three ART2 series could get much better result than only one ART2 under optimal vigilance parameter ρ. The reason is from the architecture of ART2 which reflects the similarity of signals. A single vigilance parameter ρ may be proper to two different inputs. When the number of inputs is large, one fixed vigilance parameter is not appropriate. Three ART2 series allows us to choose different parameters in different stages. The best performance is from inputs with regression of 30 points. In this case, the first ART2 obtains 8 categories, the second ART2 gets 3 categories, and the third ART2 has 2 categories. The system could identify correctly 10 subjects among 11 subjects. It missed one subject with OSA. After we observed its data, we noted that it has a different change of pupil diameter from other subjects. Normally the pupil diameter of subjects becomes smaller with time during data collection, but this subject has an opposite change. Its pupil diameter becomes the largest in the last 2 minutes. The simultaneous EEG should have similar response. Therefore, such change might have effects on the final classification result.
Classification of Obstructive Sleep Apnea by Neural Networks
1307
The following example is from our simulation for regression of 30 points. Three ART2 neural networks are used and 11 inputs are from 11 subjects. After the first ART2 neural networks, 11 inputs go into 8 categories since some very similar inputs group together. We take the average of inputs in the same group to get 8 inputs for the second ART2. Three categories are obtained after the second ART2, including 2 big groups and 1 small group. The same averaging strategy is applied to these three groups to get three inputs for the third ART2. Finally, two groups are reached after the third ART2. We check the status of each subject in two groups to get the percentage of correct classification.
5 Conclusion We have shown that ART2 neural networks in series can successfully classify subjects with/without OSA, based on the idea that patients with sleep disorder have different level of wakefulness from healthy people. Some studies of sleepiness in subjects with OSA have found that participants were unaware of the extent of their sleepiness under the same circumstance [7,11]. Patterns of theta energy ratio in EEG can reflect the difference between sleep disorder patients and healthy people since there is good evidence that rising theta EEG activity is a sign of increasing sleepiness [14]. A series of ART2 neural networks are necessary to get a precise classification result in order to eliminate effects of input ordering and to group more similar subjects together. Our experiment shows that a hierarchical system of ART2 neural networks could improve the precision of classification over that of a single ART2 network, achieving 91%, between subjects with OSA and controls. The reason is that our system has more flexibility to adapt to input patterns. Acknowledgments. The data collection for this research was supported by NIH grant R15 NR04030 and by Mr. J. A. Piscopo.
References 1. Akerstedt, T., Gillberg, M.: Subjective and Objective Sleepiness in the Active Individual. International Journal of Neuroscience 52 (1990) 29–37 2. Broughton, R., Yan, H., Boucher, B.: Effects of One Night of Sleep Deprivation on Quantified EEG Measures. Journal of Sleep Research 7 suppl. 2 (1998) 32 3. Carpenter, G.A., Grossberg, S.: A Massively Parallel Architecture for A Self-Organizing Neural Pattern Recognition Machine. Computer Vision, Graphics, and Image Proc. 37 (1987) 54–115 4. Carpenter, G.A., Grossberg, S.: ART 2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns. Applied Optics 26(23) (1987) 4919–4930 5. Carskadon, M.A., Dement, W.C.: Multiple Sleep Latency Test During the Constant Routine. Sleep 15 (1992) 396–399 6. Chatterjee, S., Hadi, A.S.: Influential Observations, High Leverage Points, and Outliers in Linear Regression. Statistical Science 1 (1986) 379–393 7. Cherivin, R.D., Guilleminault, C.: Obstructive Sleep Apnea and Related Disorders. Neurologic Clinics 14 (1996) 583–609
1308
Z. Pang, D. Liu, and S.R. Lloyd
8. Davenport, M.P., Titus, A.H.: Multilevel Category Structure in the ART-2 Network. IEEE Transactions on Neural Networks 15(1) (2004) 145–158 9. Davis, K.A., Luchten, K.R.: Respiratory Impedance Spectral Estimation for Digitally Created Random Noise. Annals of Biomedical Engineering, Boston, MA: Department of Biomedical Engineering, Boston University (1991) 179–195 10. DuBios, A.B., Brody, A.W., Lewis, D.H., Brugess, B.F.: Oscillation Mechanics of Lungs and Chest in Man. Journal of Applied Physiology 8 (1956) 587–594 11. Engleman, H.M., Douglas, W.S.: Under Reporting of Sleepiness and Driving Impairment in Patients with Sleep Apnea/Hypopnea Syndrome. Journal of Sleep Research 6 (1997) 272–275 12. Gunery P. Y., Roux-Buisson H., Meignen, S., Comyn, F.L., Dematteis, M., Wuyam, B., Pepin, J.L., Levy, P.: An Adaptive Detector of Genioglossus EMG Reflex Using Berkner Transform for Time Latency Measurement in OSA Pathophysiological Studies. IEEE Transactions on Biomedical Engineering 52(8) (2005) 1382–1389 13. Hasan, J.: Past and Future of Computer-Assisted Sleep Analysis and Drowsiness Assessment. Journal of Clinical Neurophysiology 13 (1996) 295–313 14. Horne, J.A., Reyner, L.A.: Driver Sleepiness. Journal of Sleep Research 4 (1995) 23–29 15. Kollarits, C.R., Kollarits, F.J., Schuette, W.H.: The Pupil Dark Response in Normal Volunteers. Current Eye Research 2(4) (1982) 255–259 16. McLaren, J.W., Fjerstad, W.H., Ness, A.B., Graham, M.D., Brubaker, R.F.: New Video Pupillometer. Optical Engineering 34(3) (1995) 676–682 17. Merritta, S.L., Schnydersa, H.C., Patelb, M., Basnerc, R.C., O’Neilld, W.: Pupil Staging and EEG Measurement of Sleepiness. International Journal of Psychophysiology 52 (2004) 97–112 18. Mill´an, J.R., Mouri˜no, J., Franz´e, M., Cincotti, F., Varsta, M., Heikkonen, J., Babiloni, F.: A Local Neural Classifier for the Recognition of EEG Patterns Associated to Mental Tasks. IEEE Transactions on Neural Networks 13 (2002) 678–686 19. Roth, P.R.: Effective Measurements Using Digital Signal Analysis. IEEE Spectrum (1971) 62–70 20. Suzuki, Y.: Self-Organizing QRS-wave Recognition in ECG Using Neural Networks. IEEE Transactions on Neural Networks 6 (1995) 1469–1477 21. Varady, P., Micsik, T., Benedek, S., Benyo, Z.: A Novel Method for the Detection of Apnea and Hypopnea Events in Respiration Signals. IEEE Transactions on Biomedical Engineering 49(9) (2002) 936–942 22. Yen, F.C., Behbehani, K., Lucas, E.A., Burk, J.R., Axe, J.P.: A Noninvasive Technique for Detecting Obstructive and Central Sleep Apnea. IEEE Transactions on Biomedical Engineering 44(12) (1997) 1262–1268 23. American Academy of Sleep Medicine: Sleep Related Breathing Disorders in Adults: Recommendations for Syndrome Definition and Measurement Techniques in Clinical Research. Sleep 22(5) (1999) 667–689
Author Index
Abiyev, Rahib H. II-241 Acu˜ na, Gonzalo I-311, I-1255, II-391 Afzulpurkar, Nitin V. III-252 Ahmad, Khurshid II-938 Ahn, Tae-Chon II-186 Ai, Lingmei II-1202 Akiduki, Takuma II-542 Al-Jumeily, Dhiya II-921 Al-shanableh, Tayseer II-241 Aliev, R.A. II-307 Aliev, R.R. II-307 Almeida, An´ıbal T. de I-138, III-73 Amari, Shun-ichi I-935 Anitha, R. I-546 Ara´ ujo, Ricardo de A. II-602 Aung, M.S.H. II-1177 Bae, Hyeon III-641 Bae, JeMin I-1221 Baek, Gyeongdong III-641 Baek, Seong-Joon II-1240 Bai, Qiuguo III-1107 Bai, Rui II-362 Bai, Xuerui I-349 Bambang, Riyanto T. I-54 Bao, Zheng I-1303 Barua, Debjanee II-562 Bassi, Danilo II-391 Bastari, Alessandro III-783 Baten, A.K.M.A. II-1221 Bevilacqua, Vitoantonio II-1107 Bi, Jing I-609 Bin, Deng III-981 Bin, Liu III-981 Boumaiza, Slim I-582 Cai, ManJun I-148 Cai, W.C. I-786 Cai, Wenchuan I-70 Cai, Zixing I-743 Caiyun, Chen III-657, III-803 Calster, B. Van II-1177 Canu, St´ephane III-486 Cao, Fenwen II-810 Cao, Jinde I-941, I-958, I-1025
Cao, Shujuan II-680 Carpenter, Gail A. I-1094 Carvajal, Karina I-1255 Cecchi, Guillermo II-500, II-552 Cecchi, Stefania III-731, III-783 Celikoglu, Hilmi Berk I-562 Chacon M., Mario I. III-884 Chai, Lin I-222 Chai, Tianyou II-362 Chai, Yu-Mei I-1162 Chandra Sekhar, C. I-546 Chang, Bao Rong III-357 Chang, Bill II-1221 Chang, Guoliang II-1168 Chang, Hyung Jin III-506 Chang, Shengjiang II-457 Chang, T.K. II-432 Chang, Y.P. III-580 Chang, Zhengwei III-1015 Chao, Kuei-Hsiang III-1145 Che, Haijun I-480 Chen, Boshan III-123 Chen, Chaochao I-824 Chen, Dingguo I-183, I-193 Chen, Feng I-473, I-1303 Chen, Fuzan II-448 Chen, Gong II-1056 Chen, Huahong I-1069 Chen, Huawei I-1069 Chen, Hung-Cheng III-26 Chen, Jianxin I-1274, II-1159 Chen, Jie II-810 Chen, Jing I-1274 Chen, Jinhuan III-164 Chen, Joseph III-1165 Chen, Juan II-224 Chen, Lanfeng I-267 Chen, Le I-138 Chen, Li-Chao II-656 Chen, Lingling II-1291 Chen, Min-You I-528 Chen, Mou I-112 Chen, Mu-Song III-998 Chen, Ping III-426
1310
Author Index
Chen, Po-Hung III-26, III-1120 Chen, Qihong I-64 Chen, Shuzhen III-454 Chen, Tianping I-994, I-1034 Chen, Ting-Yu II-336 Chen, Wanming I-843 Chen, Weisheng I-158 Chen, Wen-hua I-112 Chen, Xiaowei II-381 Chen, Xin I-813 Chen, Xiyuan III-41 Chen, Ya zhu III-967 Chen, Yen-wei II-979 Chen, Ying III-311, III-973 Chen, Yong I-1144, II-772 Chen, Yuehui I-473, II-1211 Chen, Yunping III-1176 Chen, Zhi-Guo II-994, III-774 Chen, Zhimei I-102 Chen, Zhimin III-204 Chen, Zhong I-776, III-914 Chen, Zhongsong III-73 Cheng, Gang I-231 Cheng, Jian II-120 Cheng, Zunshui I-1025 Chi, Qinglei I-29 Chi, Zheru I-626 Chiu, Ming-Hui I-38 Cho, Jae-Hyun III-923 Cho, Sungzoon II-880 Choi, Jeoung-Nae III-225 Choi, Jin Young III-506 Choi, Seongjin I-602 Choi, Yue Soon III-1114 Chu, Shu-Chuan II-905 Chun-Guang, Zhou III-448 Chung, Chung-Yu II-785 Chung, TaeChoong I-704 Cichocki, Andrzej II-1032, III-793 Cruz, Francisco II-391 Cruz-Meza, Mar´ıa Elena III-828 Cuadros-Vargas, Ernesto II-620 Cubillos, Francisco A. I-311, II-391 Cui, Baotong I-935 Cui, Baoxia II-160 Cui, Peiling III-597 da Silva Soares, Anderson Dai, Dao-Qing II-1081 Dai, Jing III-607
III-1024
Dai, Ruwei I-1280 Dai, Shaosheng II-640 Dai, Xianzhong II-196, III-1138 Dakuo, He III-330 Davies, Anthony II-938 Dell’Orco, Mauro I-562 Deng, Fang’an I-796 Deng, Qiuxiang II-575 Deng, Shaojiang II-724 Dengsheng, Zhu III-1043 Dillon, Tharam S. II-965 Ding, Gang III-66 Ding, Mingli I-667, III-721 Ding, Mingwei II-956 Ding, Mingyong II-1048 Ding, Xiao-qing III-1033 Ding, Xiaoshuai III-117 Ding, Xiaoyan II-40 Dong, Jiyang I-776, III-914 Dou, Fuping I-480 Du, Ji-Xiang I-1153, II-793, II-819 Du, Junping III-80 Du, Lan I-1303 Du, Wei I-652 Du, Xin I-714 Du, Xin-Wei III-1130 Du, Yina III-9 Du, Zhi-gang I-465 Duan, Hua III-812 Duan, Lijuan II-851 Duan, Yong II-160 Duan, Zhemin III-943 Duan, Zhuohua I-743 El-Bakry, Hazem M. III-764 Etchells, T.A. II-1177 Fan, Fuling III-416 Fan, Huaiyu II-457 Fan, Liping II-1042 Fan, Shao-hui II-994 Fan, Yi-Zheng I-572 Fan, Youping III-1176 Fan, Yushun I-609 Fang, Binxing I-1286 Fang, Jiancheng III-597 Fang, Shengle III-292 Fang, Zhongjie III-237 Fei, Minrui II-483 Fei, Shumin I-81, I-222
Author Index Feng, Chunbo III-261 Feng, Deng-Chao III-869 Feng, Hailiang III-933 Feng, Jian III-715 Feng, Xiaoyi II-135 Feng, Yong II-947 Feng, Yue I-424 Ferreira, Tiago A.E. II-602 Florez-Choque, Omar II-620 Freeman, Walter J. I-685 Fu, Chaojin III-123 Fu, Jiacai II-346 Fu, Jun I-685 Fu, Lihua I-632 Fu, Mingang III-204 Fu, Pan II-293 Fuli, Wang III-330 Fyfe, Colin I-397 Gan, Woonseng I-176 Gao, Chao III-35 Gao, Cunchen I-910 Gao, Jian II-640 Gao, Jinwu II-257 Gao, Junbin II-680 Gao, Liang III-204 Gao, Liqun II-931, III-846 Gao, Ming I-935 Gao, Shaoxia III-35 Gao, Song II-424 Gao, Wen II-851 Gao, Zengan III-741 Gao, Zhi-Wei I-519 Gao, Zhifeng I-875 Gardner, Andrew B. II-1273 Gasso, Gilles III-486 Ge, Baoming I-138, III-73 Ge, Junbo II-1125 Geng, Guanggang I-1280 Ghannouchi, Fadhel M. I-582 Glantschnig, Paul II-1115 Gong, Shenguang III-672 Grossberg, Stephen I-1094 Gu, Hong II-1 Gu, Ying-kui I-553, II-275 Guan, Peng I-449, II-671 Guan, Zhi-Hong II-8, II-113 Guirimov, B.G. II-307 Guo, Chengan III-461
Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo,
1311
Chenlei I-723 Lei I-93, I-1054 Li I-1286, II-931, III-846 Ling III-434 Peng III-633, III-950 Ping II-474 Qi I-904 Wensheng III-80 Xin II-1291
Hadzic, Fedja II-965 Haifeng, Sang III-330 Halgamuge, Saman K II-801, II-1221, III-1087 Hamaguchi, Kosuke I-926 Han, Feng II-740 Han, Fengqing I-1104 Han, Jianda III-589 Han, Jiu-qiang II-646 Han, Min II-569 Han, Mun-Sung I-1318 Han, Pu III-545 Han, Risheng II-705 Han, SeungSoo III-246 Hao, Yuelong I-102 Hao, Zhifeng I-8 Hardy, David II-801 He, Fen III-973 He, Guoping III-441, III-812 He, Haibo I-413, I-441 He, Huafeng I-203 He, Lihong I-267 He, Naishuai II-772 He, Qing III-336 He, Tingting I-632 He, Xin III-434 He, Xuewen II-275 He, Yigang III-570, III-860, III-1006 He, Zhaoshui II-1032 He, Zhenya III-374 Heng, Yue III-561 Hirasawa, Kotaro I-403 Hoang, Minh-Tuan T. I-1077 Hong, Chin-Ming I-45 Hong, SangJeen III-246 Hong, Xia II-516, II-699 Hope, A.D. II-293 Hou, Weizhen III-812 Hou, Xia I-1247 Hou, Zeng-Guang II-438
1312
Author Index
Hsu, Arthur III-1087 Hsu, Chia-Chang III-1145 Hu, Chengquan I-652, II-1264 Hu, Dewen I-1061 Hu, Haifeng II-630 Hu, Jing II-985 Hu, Jinglu I-403 Hu, Jingtao III-277 Hu, Meng I-685 Hu, Ruifen I-685 Hu, Sanqing II-1273 Hu, Shiqiang III-950 Hu, Shou-Song I-1247 Hu, Wei III-277 Hu, Xiaolin III-194 Hu, Xuelei I-1211 Hu, Yun-an II-47 Huaguang, Zhang III-561 Huang, Benxiong I-1336, III-626 Huang, D. II-1002 Huang, Dexian III-219 Huang, Fu-Kuo III-57 Huang, Hong III-933 Huang, Hong-Zhong III-267 Huang, Jikun III-1058 Huang, Kai I-1183 Huang, Liangli III-407 Huang, Liyu II-1202 Huang, Peng II-593 Huang, Qingbao III-1097 Huang, Tingwen II-24 Huang, Xinsheng III-853 Huang, Xiyue II-772, III-553 Huang, Yanxin II-1264 Huang, Yuancan III-320 Huang, Yuchun I-1336, III-626 Huang, Zailu I-1336 Huang, Zhen I-733, I-824 Huffel, S. Van II-1177 Huo, Linsheng III-1182 Hussain, Abir Jaafar II-921 Huynh, Hieu T. I-1077 Hwang, Chi-Pan III-998 Hwang, Seongseob II-880 Imamura, Takashi II-542 Irwin, George W. I-496 Isahara, Hitoshi I-1310 Islam, Md. Monirul II-562 Iwamura, Kakuzo II-257
Jarur, Mary Carmen II-1150 Je, Sung-Kwan III-923 Ji, Geng I-166 Ji, Guori III-545 Jia, Hongping I-257, I-642 Jia, Huading III-497 Jia, Peifa I-852, II-328 Jia, Yunde II-896 Jia, Zhao-Hong I-572 Jian, Feng III-561 Jian, Jigui II-143 Jian, Shu III-147 Jian-yu, Wang III-448 Jiang, Chang-sheng I-112 Jiang, Chenwei II-1133 Jiang, Haijun I-1008 Jiang, Minghui I-952, III-292 Jiang, Nan III-1 Jiang, Tiejun III-350 Jiang, Yunfei II-474 Jiang, Zhe III-589 Jiao, Li-cheng II-120 Jin, Bo II-510 Jin, Huang II-151 Jin, Xuexiang II-1022 Jin, Yihui III-219, III-1058 Jin-xin, Tian III-49 Jing, Chunguo III-1107 Jing, Zhongliang II-705 Jinhai, Liu III-561 JiuFen, Zhao III-834 Jo, Taeho I-1201, II-871 Jos´e Coelho, Clarimar III-1024 Ju, Chunhua III-392 Ju, Liang I-920, I-1054 Ju, Minseong III-140 Jun, Ma I-512 Jun, Yang III-981 Junfeng, Xu III-17 Jung, Byung-Wook III-641 Jung, Young-Giu I-1318 Kanae, Shunshoku I-275, II-1194 Kang, Jingli III-157 Kang, Mei I-257 Kang, Min-Jae I-1015 Kang, Sangki II-1240 Kang, Y. III-580 Kao, Tzu-Ping II-336 Kao, Yonggui I-910
Author Index Kaynak, Okyay I-14 Ke, Hai-Sen I-285 Kelleher, Dermot II-938 Kil, Rhee Man I-1117, I-1318 Kim, Byungwhan I-602 Kim, Dae Young III-368 Kim, Dongjun II-1187 Kim, DongSeop III-246 Kim, Ho-Chan I-1015 Kim, Ho-Joon II-715 Kim, HyunKi II-206 Kim, Jin Young II-1240 Kim, Kwang-Baek II-756, III-923 Kim, Kyeongseop II-1187 Kim, Pyo Jae III-506 Kim, Seoksoo II-1090, III-140 Kim, Sungshin III-641 Kim, Tai-hoon III-140 Kim, Woo-Soon III-1114 Kim, Yong-Kab III-1114 Kim, Yountae III-641 Ko, Hee-Sang I-1015 Konako˘ glu, Ekrem I-14 Koo, Imhoi I-1117 Kozloski, James II-500, II-552 Kumar, R. Pradeep II-1012 Kurnaz, Sefer I-14 Lai, Pei Ling I-397 Lee, Ching-Hung I-38, II-317 Lee, Geehyuk II-104 Lee, InTae II-206 Lee, Jeongwhan II-1187 Lee, Jin-Young III-923 Lee, Joseph S. II-715 Lee, Junghoon I-1015 Lee, Malrey I-1201, II-871 Lee, Seok-Lae I-1045 Lee, SeungGwan I-704 Lee, Shie-Jue III-515 Lee, SungJoon III-246 Lee, Tsai-Sheng I-694 Lee, Yang Weon III-1192 Leu, Yih-Guang I-45 Leung, Kwong-Sak II-371 Li, Ang II-689 Li, Bin I-767, I-1087 Li, Chuandong II-24 Li, Chun-hua III-382 Li, Demin III-695
Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li,
Guang I-685 Haibin I-994 Hailong III-9 Haisheng II-414 Hongnan III-1182 Hongru III-9 Ji III-686 Jianwei III-933 Jing II-47, II-656 Jiuxian II-889, III-392 Jun I-676 Jun-Bao II-905 Kang I-496, II-483 Li I-132 Liming III-407 Meng II-842, III-1077 Minqiang II-448 Ping II-33 Qing II-251 Qingdu II-96 Qingguo II-424 Qiudan I-1280 San-ping III-382 Shaoyuan I-505 Shutao III-407 Tao I-81, I-93, I-374, II-8 Weidong III-147 Weimin III-988 Xiao-Li I-87 Xiaodong I-176 Xiaomei II-170 Xiaoou I-487, II-483 Xiuxiu I-796 Xue III-741 Yan II-1281 Yang I-1286 Yangmin I-757, I-813 Yanwen II-842 Yaobo II-1056 Yi II-612 Yibin I-1087 Yinghong I-424 Yong-Wei III-633 Yongming I-1 Yongwei III-950 Yuan III-1130 Yue I-414 Yufei I-424 Yunxia III-758 Zhengxue III-117
1313
1314
Author Index
Li, Zhiquan III-311 Lian, Qiusheng III-454 Lian, Shiguo II-79 Liang, Dong I-572 Liang, Hua I-618, I-920, III-399 Liang, Huawei I-843 Liang, Jinling II-33 Liang, Rui II-257 Liang, Yanchun I-8, I-652, II-1264 Liang, Yong II-371 Liao, Longtao I-505 Liao, Wudai I-897, III-164 Liao, X.H. I-70, I-786 Liao, Xiaofeng I-1104, II-724 Liao, Xiaoxin I-897, II-143, III-164, III-292 Lim, Jun-Seok II-398, III-678 Lin, Chuan Ku III-998 Lin, Hong-Dar II-785 Lin, ShiehShing III-231 Lin, Sida I-968 Lin, Xiaofeng III-1097 Lin, Yaping II-1254 Lin, Yu-Ching II-317 Lin, Yunsong II-1048 Lin, Zhiling I-380 Ling, Zhuang III-213 Linhui, Cai II-151 Lisboa, P.J.G II-1177 Liu, Benyong II-381 Liu, Bin III-1107 Liu, Bo III-219, III-1058 Liu, Derong I-387, II-1299 Liu, Di-Chen III-1130 Liu, Dianting II-740 Liu, Dongmei II-1231 Liu, Fei III-1067 Liu, Guangjun II-251 Liu, Guohai I-257, I-642 Liu, Hongwei I-1303 Liu, Hongwu III-686 Liu, Ji-Zhen II-179 Liu, Jilin I-714 Liu, Jin III-751 Liu, JinCun I-148 Liu, Jinguo I-767 Liu, Ju II-1065 Liu, Jun II-772 Liu, Lu III-1176 Liu, Meiqin I-968
Liu, Meirong III-570 Liu, Peipei I-1069 Liu, Qiuge III-336 Liu, Shuhui I-480 Liu, Taijun I-582 Liu, Wen II-57 Liu, Wenhui III-721 Liu, Xiang-Jie II-179 Liu, Xiaohe I-176 Liu, Xiaohua III-751 Liu, Xiaomao II-680 Liu, Yan-Kui II-267 Liu, Ying II-267, III-1058 Liu, Yinyin II-534, II-956 Liu, Yongguo III-237 Liu, Yun III-1155 Liu, Yunfeng I-203 Liu, Zengrong II-16 Liu, Zhi-Qiang II-267 Liu, Zhongxuan II-79 Lloyd, Stephen R. II-1299 L¨ ofberg, Johan I-424 Long, Aideen II-938 Long, Fei I-292 Long, Jinling I-1110 Long, Ying III-1006 Loosli, Ga¨elle III-486 L´ opez-Y´ an ˜ez, Itzam´ a II-835, III-828 Lu, Bao-Liang I-1310, III-525 Lu, Bin II-224 Lu, Congde II-1048 Lu, Hong tao III-967 Lu, Huiling I-796 Lu, Wenlian I-1034 Lu, Xiaoqing I-986, I-1193 Lu, Xinguo II-1254 Lu, Yinghua II-842 Lu, Zhiwu I-986, I-1193 Luan, Xiaoli III-1067 Lum, Kok Siong II-346 Luo, Qi I-170, II-301 Luo, Siwei II-1281 Luo, Wen III-434 Luo, Yan III-302 Luo, Yirong I-455 Lv, Guofang I-618 Ma, Chengwei III-973 Ma, Enjie II-362 Ma, Fumin I-658
Author Index Ma, Honglian III-461 Ma, Jiachen III-840, III-877 Ma, Jianying II-1125 Ma, Jieming II-1133 Ma, Jinwen I-1183, I-1227 Ma, Liyong III-840, III-877 Ma, Shugen I-767 Ma, Xiaohong II-40, III-751 Ma, Xiaolong I-434 Ma, Xiaomin III-1 Ma, Yufeng III-672 Ma, Zezhong III-933 Ma, Zhiqiang III-1077 Mahmood, Ashique II-562 Majewski, Maciej III-1049 Mamedov, Fakhreddin II-241 Mao, Bing-yi I-867, III-454 Marwala, Tshilidzi I-1237, I-1293 Mastorakis, Nikos III-764 Mastronardi, Giuseppe II-1107 Matsuka, Toshihiko I-1135 May, Gary S. III-246 Mei, Tao I-843 Mei, Xue II-889 Mei, Xuehui I-1008 Men, Jiguan III-1176 Meng, Hongling II-88, III-821 Meng, Max Q.-H. I-843 Meng, Xiangping II-493 Menolascina, Filippo II-1107 Miao, Jun II-851 Min, Lequan III-147 Mingzeng, Dai III-663 Miyake, Tetsuo II-542 Mohler, Ronald R. I-183 Mora, Marco II-1150 Moreno, Vicente II-391 Moreno-Armendariz, Marco I-487 Musso, Cosimo G. de II-1107 Na, Seung You II-1240 Nagabhushan, P. II-1012 Nai-peng, Hu III-49 Nan, Dong I-1110 Nan, Lu III-448 Naval Jr., Prospero C. III-174 Navalertporn, Thitipong III-252 Nelwamondo, Fulufhelo V. I-1293 Ng, S.C. II-664 Ngo, Anh Vien I-704
1315
Nguyen, Hoang Viet I-704 Nguyen, Minh Nhut II-346 Ni, Junchao I-158 Nian, Xiaoling I-1069 Nie, Xiaobing I-958 Nie, Yalin II-1254 Niu, Lin I-465 Oh, Sung-Kwun II-186, II-206, III-225 Ong, Yew-Soon I-1327 Ortiz, Floriberto I-487 Ou, Fan II-740 Ou, Zongying II-740 Pan, Jeng-Shyang II-905 Pan, Jianguo II-352 Pan, Li-Hu II-656 Pan, Quan II-424 Pandey, A. III-246 Pang, Zhongyu II-1299 Park, Aaron II-1240 Park, Cheol-Sun III-368 Park, Cheonshu II-730 Park, Choong-shik II-756 Park, Dong-Chul III-105, III-111 Park, Jong Goo III-1114 Park, Sang Kyoon I-1318 Park, Yongsu I-1045 Pavesi, Leopoldo II-1150 Peck, Charles II-500, II-552 Pedone, Antonio II-1107 Pedrycz, Witold II-206 Pei, Wenjiang III-374 Peng, Daogang I-302 Peng, Jian-Xun II-483 Peng, Jinzhu I-592, I-804 Peng, Yulou III-860 Pi, Yuzhen II-493 Pian, Zhaoyu II-931, III-846 Piazza, Francesco III-731, III-783 Ping, Ling III-448 Pizzileo, Barbara I-496 Pu, Xiaorong III-237 Qi, Juntong III-589 Qian, Jian-sheng II-120 Qian, Juying II-1125 Qian, Yi II-689 Qianhong, Lu III-981 Qiao, Chen III-131 Qiao, Qingli II-72
1316
Author Index
Qiao, Xiao-Jun III-869 Qing, Laiyun II-851 Qingzhen, Li III-834 Qiong, Bao I-536 Qiu, Jianlong I-1025 Qiu, Jiqing I-871 Qiu, Zhiyong III-914 Qu, Di III-117 Quan, Gan II-151 Quan, Jin I-64 Rao, A. Ravishankar II-500, II-552 Ren, Dianbo I-890 Ren, Guanghui II-765, III-651 Ren, Quansheng II-88, III-821 Ren, Shi-jin II-216 Ren, Zhen II-79 Ren, Zhiliang II-1056 Rivas P., Pablo III-884 Roh, Seok-Beom II-186 Rohatgi, A. III-246 Rom´ an-God´ınez, Israel II-835 Ronghua, Li III-657 Rosa, Jo˜ ao Lu´ı Garcia II-825 Rossini, Michele III-731 Rubio, Jose de Jesus I-1173 Ryu, Joung Woo II-730 Sakamoto, Yasuaki I-1135 S´ anchez-Garfias, Flavio Arturo III-828 Sasakawa, Takafumi I-403 Savage, Mandara III-1165 Sbarbaro, Daniel II-1150 Schikuta, Erich II-1115 Senaratne, Rajinda II-801 Seo, Ki-Sung III-225 Seredynski, Franciszek III-85 Shang, Li II-810 Shang, Yan III-454 Shao, Xinyu III-204 Sharmin, Sadia II-562 Shen, Jinyuan II-457 Shen, Lincheng I-1061 Shen, Yanjun I-904 Shen, Yehu I-714 Shen, Yi I-952, III-292, III-840, III-877 Shen, Yue I-257, I-642 Sheng, Li I-935 Shi, Haoshan III-943 Shi, Juan II-346 Shi, Yanhui I-968
Shi, Zhongzhi III-336 Shiguang, Luo III-803 Shin, Jung-Pil III-641 Shin, Sung Hwan III-111 Shunlan, Liu III-663 Skaruz, Jaroslaw III-85 Sohn, Joo-Chan II-730 Song, Chunning III-1097 Song, David Y. I-70 Song, Dong Sung III-506 Song, Jaegu II-1090 Song, Jinya I-618, III-479 Song, Joo-Seok I-1045 Song, Kai I-671, III-721 Song, Qiankun I-977 Song, Shaojian III-1097 Song, Wang-Cheol I-1015 Song, Xiao xiao III-1097 Song, Xuelei II-746 Song, Y.D. I-786 Song, Yong I-1087 Song, Young-Soo III-105 Song, Yue-Hua III-426 Song, Zhuo II-1248 Sousa, Robson P. de II-602 Squartini, Stefano III-731, III-7783 Starzyk, Janusz A. I-413, I-441, II-534, II-956 Stead, Matt II-1273 Stuart, Keith Douglas III-1049 Sun, Bojiao I-1346 Sun, Changcun II-1056 Sun, Changyin I-618, I-920, III-479 Sun, Fangxun I-652 Sun, Fuchun I-132 Sun, Haiqin I-319 Sun, Jiande II-1065 Sun, Lei I-843 Sun, Lisha II-1168 Sun, Pei-Gang II-234 Sun, Qiuye III-607 Sun, Rongrong II-284 Sun, Shixin III-497 Sun, Xinghua II-1065 Sun, Youxian II-1097, II-1140 Sun, Z. I-786 Sung, KoengMo II-398 Tan, Ah-Hwee I-1094 Tan, Cheng III-1176
Author Index Tan, Hongli III-853 Tan, Min II-438, III-1155 Tan, Yanghong III-570 Tan, Ying III-705 Tan, Yu-An II-301 Tang, Guiji III-545 Tang, GuoFeng II-465 Tang, Jun I-572 Tang, Lixin II-63 Tang, Songyuan II-979 Tang, Wansheng III-157 Tang, Yuchun II-510 Tang, Zheng II-465 Tao, Liu I-512 Tao, Ye III-267 Testa, A.C. II-1177 Tian, Fengzhan II-414 Tian, GuangJun I-148 Tian, Jin II-448 Tian, Xingbin I-733 Tian, Yudong I-213 Tie, Ming I-609 Timmerman, D. II-1177 Tong, Ling II-1048 Tong, Shaocheng I-1 Tong, Weiming II-746 Tsai, Hsiu Fen III-357 Tsai, Hung-Hsu III-904 Tsao, Teng-Fa I-694 Tu, Zhi-Shou I-358 Uchiyama, Masao Uyar, K. II-307
I-1310
Vairappan, Catherine II-465 Valentin, L. II-1177 Vanderaa, Bill II-801 Veludo de Paiva, Maria Stela III-1024 Vilakazi, Christina B. I-1237 Vo, Nguyen H. I-1077 Vogel, David II-534 Volkov, Yuri II-938 Wada, Kiyoshi I-275, II-1194 Wan, Yuanyuan II-819 Wang, Baoxian II-143 Wang, Bin II-1133 Wang, Bolin III-399 Wang, C.C. III-580 Wang, Chunheng I-1280 Wang, Dacheng I-1104
Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang,
Dongyun I-897 Fu-sheng I-241, I-249 Fuli I-380 Fuliang I-257 Gaofeng I-632 Grace S. III-57 Guang-Jiang III-774 Guoqiang II-740 Haijun II-1254 Haila II-79 Hong I-93 Hongbo I-733, I-824 Honghui II-1097 Hongrui I-749 Hongwei II-1 Jiacun III-695 Jiahai III-184 Jian III-821 Jianzhong II-493 Jingming I-1087 Jue II-1202 Jun III-95, III-194 Jun-Song I-519 Kai II-406 Ke I-834 Kuanquan II-583 Kun II-931, III-846 Lan III-416 Le I-1183 Lei I-29, III-497 Liangliang I-1227 Ling III-219, III-1058 Lipo II-57 Meng-Hui III-1145 Nian I-572 Qi III-721 Qin II-689 Qingren II-406 Quandi I-528 Rubin I-1127 Ruijie III-80 Sheng-jin III-1033 Shiwei I-122 Shoujue III-616 Shoulin I-920 Shufeng II-352 Shuqin I-652 Shuzong III-350 Tian-Zhen II-985 Tianmiao III-535
1317
1318
Author Index
Wang, Wei I-834, III-535 Wang, Weiqi II-1125 Wang, Xiang-ting II-120 Wang, Xiaohong I-952 Wang, Xiaohua III-860, III-1006 Wang, Xihuai I-658 Wang, Xin II-196 Wang, Xiuhong II-72 Wang, Xiumei I-652 Wang, Xiuqing III-1155 Wang, Xiuxiu II-612 Wang, Xuelian II-1202 Wang, Xuexia II-765 Wang, XuGang II-465 Wang, Yan I-652, II-1264 Wang, Yaonan I-592, I-804, III-469 Wang, Yen-Nien I-694 Wang, Yong I-22 Wang, Yongtian II-979 Wang, Yuan I-852, II-328 Wang, Yuan-Yuan II-284, II-819, II-1125, III-426 Wang, Yuechao I-767 Wang, Zeng-Fu I-1153 Wang, Zhancheng III-988 Wang, Zhaoxia II-1231 Wang, Zhen-Yu III-633 Wang, Zhihai II-414 Wang, Zhiliang II-251 Wang, Zuo II-301 Wei, Guoliang III-695 Wei, Hongxing III-535 Wei, Miaomiao II-612 Wei, Qinglai I-387 Wei, Ruxiang III-350 Wei, Wei I-292 Wei, Xunkai I-424 Wei, Yan Hao III-998 Weiqi, Yuan III-330 Wen, Bangchun I-29 Wen, Cheng-Lin I-319, II-994, II-985, III-774 Wen, Chuan-Bo III-774 Wen, Lei III-284 Wen, Shu-Huan I-863 Wen, Yi-Min III-525 Weng, Liguo I-74 Weng, Shilie I-213 Wenjun, Zhang I-512 Wickramarachchi, Nalin II-1221
Won, Yonggwan I-1077, II-1240 Wong, Hau-San III-894 Wong, Stephen T.C. II-1097, II-1140 Woo, Dong-Min III-105 Woo, Seungjin II-1187 Woo, Young Woon II-756 Worrell, Gregory A. II-1273 Wu, Aiguo I-380 Wu, Bao-Gui III-267 Wu, Gengfeng II-352 Wu, Jianbing I-642 Wu, Jianhua I-267, II-931, III-846 Wu, Kai-gui II-947 Wu, Ke I-1310 Wu, Lingyao I-1054 Wu, Qiang I-473 Wu, Qing-xian I-112 Wu, Qingming I-231 Wu, Si I-926 Wu, TiHua I-148 Wu, Wei I-1110, III-117 Wu, Xianyong II-8, II-113 Wu, Xingxing II-170 Wu, You-shou III-1033 Wu, Yunfeng II-664 Wu, Yunhua III-1058 Wu, Zhengping II-113 Wu, Zhilu II-765, III-651 Xi, Guangcheng I-1274, II-1159 Xia, Jianjin I-64 Xia, Liangzheng II-889, III-392 Xia, Siyu III-392 Xia, Youshen III-95 Xian-Lun, Tang III-213 Xiang, C. II-1002 Xiang, Changcheng III-553 Xiang, Hongjun I-941 Xiang, Lan II-16 Xiang, Yanping I-368 Xiao, Deyun II-1072 Xiao, Gang II-705 Xiao, Jianmei I-658 Xiao, Jinzhuang I-749 Xiao, Min I-958 Xiao, Qinkun II-424 Xiaoli, Li III-1043 Xiaoyan, Ma III-981 Xie, Haibin I-1061 Xie, Hongmei II-135
Author Index Xing, Guangzhong III-1107 Xing, Jie II-1072 Xing, Yanwei I-1274 Xingsheng, Gu I-536 Xiong, Guangze III-1015 Xiong, Min III-1176 Xiong, RunQun II-465 Xiong, Zhong-yang II-947 Xu, Chi I-626 Xu, De III-1155 Xu, Dongpo III-117 Xu, Guoqing III-988 Xu, Hong I-285 Xu, Hua I-852, II-328 Xu, Huiling I-319 Xu, Jian III-1033 Xu, Jianguo I-807 Xu, Jing I-358 Xu, Jiu-Qiang II-234 Xu, Ning-Shou I-519 Xu, Qingsong I-757 Xu, Qinzhen III-374 Xu, Shuang II-896 Xu, Shuhua I-1336, III-626 Xu, Shuxiang I-1265 Xu, Xiaoyun II-1291 Xu, Xin I-455 Xu, Xinhe I-267, II-160 Xu, Xinzheng II-913 Xu, Xu I-8 Xu, Yang II-1042 Xu, Yangsheng III-988 Xu, Yulin III-164 Xu, Zong-Ben II-371, III-131 Xue, Xiaoping I-879 Xue, Xin III-441 Xurong, Zhang III-834 Yan, Gangfeng I-968 Yan, Hua II-1065 Yan, Jianjun III-959 Yan, Qingxu III-616 Y´ an ˜ez-M´ arquez, Cornelio II-835, III-828 Yang, Guowei III-616 Yang, Hongjiu I-871 Yang, Hyun-Seung II-715 Yang, Hong-yong I-241, I-249 Yang, Jiaben I-183, I-193 Yang, Jingming I-480
Yang, Jiyun II-724 Yang, Jun I-158 Yang, Kuihe III-342 Yang, Lei II-646 Yang, Luxi III-374 Yang, Ming II-842 Yang, Peng II-1291 Yang, Ping I-302 Yang, Wei III-967 Yang, Xiao-Song II-96 Yang, Xiaogang I-203 Yang, Xiaowei I-8 Yang, Yingyu I-1211 Yang, Yixian III-1 Yang, Yongming I-528 Yang, Yongqing II-33 Yang, Zhao-Xuan III-869 Yang, Zhen-Yu I-553 Yang, Zhi II-630 Yang, Zhi-Wu I-1162 Yang, Zhuo II-1248 Yang, Zi-Jiang I-275, II-1194 Yang, Zuyuan III-553 Yanxin, Zhang III-17 Yao, Danya II-1022 Ye, Bin II-656 Ye, Chun-xiao II-947 Ye, Mao III-741 Ye, Meiying II-127 Ye, Yan I-582 Ye, Zhiyuan I-986, I-1193 Yeh, Chi-Yuan III-515 Yi, Gwan-Su II-104 Yi, Jianqiang I-349, I-358, I-368, I-374, I-1274 Yi, Tinghua III-1182 Yi, Yang I-93 Yi, Zhang I-1001, II-526, III-758 Yin, Fuliang III-751 Yin, Jia II-569 Yin, Yixin II-251 Yin, Zhen-Yu II-234 Yin, Zheng II-1097 Yin-Guo, Li III-213 Ying, Gao III-17 Yongjun, Shen I-536 Yu, Changrui III-302 Yu, Chun-Chang II-336 Yu, D.L. II-432 Yu, D.W. II-432
1319
1320
Author Index
Yu, Ding-Li I-122, I-339 Yu, Haocheng II-170 Yu, Hongshan I-592, I-804 Yu, Jian II-414 Yu, Jiaxiang II-1072 Yu, Jin-Hua III-426 Yu, Jinxia I-743 Yu, Miao II-724 Yu, Wen I-487, I-1173, II-483 Yu, Wen-Sheng I-358 Yu, Xiao-Fang III-633 Yu, Yaoliang I-449, II-671 Yu, Zhiwen III-894 Yuan, Chongtao III-461 Yuan, Dong-Feng I-22 Yuan, Hejin I-796 Yuan, Quande II-493 Yuan, Xiaofang III-469 Yuan, Xudong III-35 Yue, Dongxue III-853 Yue, Feng II-583 Yue, Heng III-715 Yue, Hong I-329 Yue, Shihong II-612 Yusiong, John Paul T. III-174 Zang, Qiang III-1138 Zdunek, Rafal III-793 Zeng, Qingtian III-812 Zeng, Wenhua II-913 Zeng, Xiaoyun III-1077 Zeng, Zhigang II-575 Zhai, Chuan-Min II-793, II-819 Zhai, Yu-Jia I-339 Zhai, Yuzheng III-1087 Zhang, Biyin II-861 Zhang, Bo II-40 Zhang, Chao III-545 Zhang, Chenggong II-526 Zhang, Daibing I-1061 Zhang, Daoqiang II-778 Zhang, Dapeng I-380 Zhang, David II-583 Zhang, Guo-Jun I-1153 Zhang, Hao I-302 Zhang, Huaguang I-387, III-715 Zhang, Jianhai I-968 Zhang, Jinfang I-329 Zhang, Jing II-381 Zhang, Jingdan II-1081
Zhang, Jinggang I-102 Zhang, Jinhui I-871 Zhang, Jiqi III-894 Zhang, Jiye I-890 Zhang, Jun II-680 Zhang, Jun-Feng I-1247 Zhang, Junxiong III-973 Zhang, Junying I-776 Zhang, Kanjian III-261 Zhang, Keyue I-890 Zhang, Kun II-861 Zhang, Lei I-1001, III-1077 Zhang, Lijing I-910 Zhang, Liming I-449, I-723, II-671, II-1133 Zhang, M.J. I-70, I-786 Zhang, Meng I-632 Zhang, Ming I-1265 Zhang, Ning II-1248 Zhang, Pan I-1144 Zhang, Pinzheng III-374 Zhang, Qi II-1125 Zhang, Qian III-416 Zhang, Qiang I-231 Zhang, Qizhi I-176 Zhang, Shaohong III-894 Zhang, Si-ying I-241, I-249 Zhang, Su III-967 Zhang, Tao II-1248 Zhang, Tengfei I-658 Zhang, Tianping I-81 Zhang, Tianqi II-640 Zhang, Tianxu II-861 Zhang, Tieyan III-715 Zhang, Wei II-656 Zhang, Xi-Yuan II-234 Zhang, Xiao-Dan II-234 Zhang, Xiao-guang II-216 Zhang, Xing-gan II-216 Zhang, XueJian I-148 Zhang, Xueping II-1291 Zhang, Xueqin II-1211 Zhang, Yan-Qing II-510 Zhang, Yanning I-796 Zhang, Yanyan II-63 Zhang, Yaoyao I-968 Zhang, Yi II-1022 Zhang, Ying-Jun II-656 Zhang, Yongqian III-1155 Zhang, You-Peng I-676
Author Index Zhang, Yu II-810 Zhang, Yu-sen III-382 Zhang, Yuxiao I-8 Zhang, Zhao III-967 Zhang, Zhaozhi III-1 Zhang, Zhikang I-1127 Zhang, Zhiqiang II-465 Zhang, Zhong II-542 Zhao, Bing III-147 Zhao, Chunyu I-29 Zhao, Dongbin I-349, I-368, I-374, I-1274 Zhao, Fan II-216 Zhao, Feng III-382 Zhao, Gang III-553 Zhao, Hai II-234 Zhao, Hongming III-535 Zhao, Jian-guo I-465 Zhao, Jianye II-88, III-821 Zhao, Lingling III-342 Zhao, Nan III-651 Zhao, Shuying I-267 Zhao, Xingang III-589 Zhao, Yaou II-1211 Zhao, Yaqin II-765, III-651 Zhao, Youdong II-896 Zhao, Zeng-Shun II-438 Zhao, Zhiqiang II-810 Zhao, Zuopeng II-913 Zheng, Chaoxin II-938 Zheng, Hongying II-724 Zheng, Huiru I-403 Zheng, Jianrong III-959 Zheng, Xia II-1140 Zhiping, Yu I-512 Zhon, Hong-Jian I-45 Zhong, Jiang II-947 Zhong, Shisheng III-66
Zhong, Ying-Ji I-22 Zhongsheng, Hou III-17 Zhou, Chunguang I-652, II-842, II-1264, III-1077 Zhou, Donghua I-1346 Zhou, Huawei I-257, I-642 Zhou, Jianting I-977 Zhou, Jie III-695 Zhou, Jin II-16 Zhou, Qingdong I-667 Zhou, Tao I-796 Zhou, Wei III-943 Zhou, Xianzhong III-434 Zhou, Xiaobo II-1097, II-1140 Zhou, Xin III-943 Zhou, Yali I-176 Zhu, Chongjun III-123 Zhu, Xun-lin I-241, I-249 Zhu, Jie II-593 Zhu, Jie (James) III-1165 Zhu, Lin I-904 Zhu, Qiguang I-749, III-311 Zhu, Qing I-81 Zhu, Si-Yuan II-234 Zhu, Xilin II-170 Zhu, Zexuan I-1327 Zhuang, Yan I-834 Zimmerman S., Alejandro III-884 Zong, Chi I-231 Zong, Ning II-516, II-699 Zou, An-Min II-438 Zou, Qi II-1281 Zou, Shuxue II-1264 Zuo, Bin II-47 Zuo, Wangmeng II-583 Zurada, Jacek M. I-1015 Zuyuan, Yang III-803
1321