Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4984
Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa (Eds.)
Neural Information Processing 14th International Conference, ICONIP 2007 Kitakyushu, Japan, November 13-16, 2007 Revised Selected Papers, Part I
13
Volume Editors Masumi Ishikawa Hiroyuki Miyamoto Takeshi Yamakawa Kyushu Institute of Technology Department of Brain Science and Engineering 2-4 Hibikino, Wakamatsu, Kitakyushu 808-0196, Japan E-mail: {ishikawa, miyamo, yamakawa}@brain.kyutech.ac.jp Kenji Doya Okinawa Institute of Science and Technology Initial Research Project 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan E-mail:
[email protected] Library of Congress Control Number: Applied for CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3, C.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-69154-5 Springer Berlin Heidelberg New York 978-3-540-69154-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12282845 06/3180 543210
Preface
These two-volume books comprise the post-conference proceedings of the 14th International Conference on Neural Information Processing (ICONIP 2007) held in Kitakyushu, Japan, during November 13–16, 2007. The Asia Pacific Neural Network Assembly (APNNA) was founded in 1993. The first ICONIP was held in 1994 in Seoul, Korea, sponsored by APNNA in collaboration with regional organizations. Since then, ICONIP has consistently provided prestigious opportunities for presenting and exchanging ideas on neural networks and related fields. Research fields covered by ICONIP have now expanded to include such fields as bioinformatics, brain machine interfaces, robotics, and computational intelligence. We had 288 ordinary paper submissions and 3 special organized session proposals. Although the quality of submitted papers on the average was exceptionally high, only 60% of them were accepted after rigorous reviews, each paper being reviewed by three reviewers. Concerning special organized session proposals, two out of three were accepted. In addition to ordinary submitted papers, we invited 15 special organized sessions organized by leading researchers in emerging fields to promote future expansion of neural information processing. ICONIP 2007 was held at the newly established Kitakyushu Science and Research Park in Kitakyushu, Japan. Its theme was “Towards an Integrated Approach to the Brain—Brain-Inspired Engineering and Brain Science,” which emphasizes the need for cross-disciplinary approaches for understanding brain functions and utilizing the knowledge for contributions to the society. It was jointly sponsored by APNNA, Japanese Neural Network Society (JNNS), and the 21st century COE program at Kyushu Institute of Technology. ICONIP 2007 was composed of 1 keynote speech, 5 plenary talks, 4 tutorials, 41 oral sessions, 3 poster sessions, 4 demonstrations, and social events such as the Banquet and International Music Festival. In all, 382 researchers registered, and 355 participants joined the conference from 29 countries. In each tutorial, we had about 60 participants on the average. Five best paper awards and five student best paper awards were granted to encourage outstanding researchers. To minimize the number of researchers who cannot present their excellent work at the conference due to financial problems, we provided travel and accommodation support of up to JPY 150,000 to six researchers and of up to JPY to eight students 100,000. ICONIP 2007 was jointly held with the 4th BrainIT 2007 organized by the 21st century COE program, “World of Brain Computing Interwoven out of Animals and Robots,” with the support of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) and Japan Society for the Promotion of Science (JSPS).
VI
Preface
We would like to thank Mitsuo Kawato for his superb Keynote Speech, and Rajesh P.N. Rao, Fr´ed´eric Kaplan, Shin Ishii, Andrew Y. Ng, and Yoshiyuki Kabashima for their stimulating plenary talks. We would also like to thank Sven Buchholz, Eckhard Hitzer, Kanta Tachibana, Jung Wang, Nikhil R. Pal, and Tetsuo Furukawa for their enlightening tutorial lectures. We would like to express our deepest appreciation to all the participants for making the conference really attractive and fruitful through lively discussions, which we believe would tremendously contribute to the future development of neural information processing. We also wish to acknowledge the contributions by all the Committee members for their devoted work, especially Katsumi Tateno for his dedication as Secretary. Last but not least, we want to give special thanks to Irwin King and his students, Kam Tong Chan and Yi Ling Wong, for providing the submission and reviewing system, Etsuko Futagoishi for hard secretarial work, Satoshi Sonoh and Shunsuke Sakaguchi for maintaining our conference server, and many secretaries and graduate students at our department for their diligent work in running the conference.
January 2008
Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa
Organization
Conference Committee Chairs General Chair Organizing Committee Chair Steering Committee Chair Program Co-chairs
Tutorials Chair Exhibitions Chair Publications Chair Publicity Chair Local Arrangements Chair Web Master Secretary
Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Shiro Usui (RIKEN, Japan) Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Masumi Ishikawa (Kyushu Institute of Technology, Japan), Kenji Doya (OIST, Japan) Hirokazu Yokoi (Kyushu Institute of Technology, Japan) Masahiro Nagamatsu (Kyushu Institute of Technology, Japan) Hiroyuki Miyamoto (Kyushu Institute of Technology, Japan) Hideki Nakagawa (Kyushu Institute of Technology, Japan) Satoru Ishizuka (Kyushu Institute of Technology, Japan) Tsutomu Miki (Kyushu Institute of Technology, Japan) Katsumi Tateno (Kyushu Institute of Technology, Japan)
Steering Committee Takeshi Yamakawa, Masumi Ishikawa, Hirokazu Yokoi, Masahiro Nagamatsu, Hiroyuki Miyamoto, Hideki Nakagawa, Satoru Ishizuka, Tsutomu Miki, Katsumi Tateno
Program Committee Masumi Ishikawa, Kenji Doya Track Co-chairs
Track 1: Masato Okada (Tokyo Univ.), Yoko Yamaguchi (RIKEN), Si Wu (Sussex Univ.) Track 2: Koji Kurata (Univ. of Ryukyus), Kazushi Ikeda (Kyoto Univ.), Liqing Zhang (Shanghai Jiaotong Univ.)
VIII
Organization
Track 3: Yuzo Hirai (Tsukuba Univ.), Yasuharu Koike (Tokyo Institute of Tech.), J.H. Kim (Handong Global Univ., Korea) Track 4: Akira Iwata (Nagoya Institute of Tech.), Noboru Ohnishi (Nagoya Univ.), SeYoung Oh (Postech, Korea) Track 5: Hideki Asoh (AIST), Shin Ishii (Kyoto Univ.), Sung-Bae Cho (Yonsei Univ., Korea)
Advisory Board Shun-ichi Amari (Japan), Sung-Yang Bang (Korea), You-Shou Wu (China), Lei Xu (Hong Kong), Nikola Kasabov (New Zealand), Kunihiko Fukushima (Japan), Tom D. Gedeon (Australia), Soo-Young Lee (Korea), Yixin Zhong (China), Lipo Wang (Singapore), Nikhil R. Pal (India), Chin-Teng Lin (Taiwan), Laiwan Chan (Hong Kong), Jun Wang (Hong Kong), Shuji Yoshizawa (Japan), Minoru Tsukada (Japan), Takashi Nagano (Japan), Shozo Yasui (Japan)
Referees S. Akaho P. Andras T. Aonishi T. Aoyagi T. Asai H. Asoh J. Babic R. Surampudi Bapi A. Kardec Barros J. Cao H. Cateau J-Y. Chang S-B. Cho S. Choi I.F. Chung A.S. Cichocki M. Diesmann K. Doya P. Erdi H. Fujii N. Fukumura W-k. Fung T. Furuhashi A. Garcez T.D. Gedeon
S. Gruen K. Hagiwara M. Hagiwara K. Hamaguchi R.P. Hasegawa H. Hikawa Y. Hirai K. Horio K. Ikeda F. Ishida S. Ishii M. Ishikawa A. Iwata K. Iwata H. Kadone Y. Kamitani N. Kasabov M. Kawamoto C. Kim E. Kim K-J. Kim S. Kimura A. Koenig Y. Koike T. Kondo
S. Koyama J.L. Krichmar H. Kudo T. Kurita S. Kurogi M. Lee J. Liu B-L. Lu N. Masuda N. Matsumoto B. McKay K. Meier H. Miyamoto Y. Miyawaki H. Mochiyama C. Molter T. Morie K. Morita M. Morita Y. Morita N. Murata H. Nakahara Y. Nakamura S. Nakauchi K. Nakayama
Organization
K. Niki J. Nishii I. Nishikawa S. Oba T. Ogata S-Y. Oh N. Ohnishi M. Okada H. Okamoto T. Omori T. Omori R. Osu N. R. Pal P. S. Pang G-T. Park J. Peters S. Phillips
Y. Sakaguchi K. Sakai Y. Sakai Y. Sakumura K. Samejima M. Sato N. Sato R. Setiono T. Shibata H. Shouno M. Small M. Sugiyama I. Hong Suh J. Suzuki T. Takenouchi Y. Tanaka I. Tetsunari
N. Ueda S. Usui Y. Wada H. Wagatsuma L. Wang K. Watanabe J. Wu Q. Xiao Y. Yamaguchi K. Yamauchi Z. Yi J. Yoshimoto B.M. Yu B-T. Zhang L. Zhang L. Zhang
Sponsoring Institutions Asia Pacific Neural Network Assembly (APNNA) Japanese Neural Network Society (JNNS) 21st Century COE Program, Kyushu Institute of Technology
Cosponsors RIKEN Brain Science Institute Advanced Telecommunications Research Institute International (ATR) Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT) IEEE CIS Japan Chapter Fuzzy Logic Systems Institute (FLSI)
IX
Table of Contents – Part I
Computational Neuroscience A Retinal Circuit Model Accounting for Functions of Amacrine Cells . . . Murat Saglam, Yuki Hayashida, and Nobuki Murayama
1
Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi
7
Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Hatori and Ko Sakai
18
Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuya Soga and Yoshiki Kashimori
27
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordan H. Boyle, John Bryden, and Netta Cohen
37
Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Kuang Chao, Hsiao-Lung Chan, Tony Wu, Ming-An Lin, and Shih-Tseng Lee
48
Population Coding of Song Element Sequence in the Songbird Brain Nucleus HVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Nishikawa, Masato Okada, and Kazuo Okanoya
54
Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamami Motomura, Yuki Hayashida, and Nobuki Murayama
64
Region-Based Encoding Method Using Multi-dimensional Gaussians for Networks of Spiking Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lakshmi Narayana Panuku and C. Chandra Sekhar
73
Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kouichi Mitsunaga, Yusuke Totoki, and Takami Matsuo
83
XII
Table of Contents – Part I
Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings . . . . . . . . . . . . . . . . . . . . . . . . Akihisa Ichiki and Masatoshi Shiino Spike-Timing Dependent Plasticity in Recurrently Connected Networks with Fixed External Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Gilson, David B. Grayden, J. Leo van Hemmen, Doreen A. Thomas, and Anthony N. Burkitt A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Ren Su, Michelle Liou, Philip E. Cheng, John A.D. Aston, and Shang-Hong Lai
93
102
112
126
The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murat Saglam, Kaoru Matsunaga, Yuki Hayashida, Nobuki Murayama, and Ryoji Nakanishi
135
Interactions between Spike-Timing-Dependent Plasticity and Phase Response Curve Lead to Wireless Clustering . . . . . . . . . . . . . . . . . . . . . . . . Hideyuki Cˆ ateau, Katsunori Kitano, and Tomoki Fukai
142
A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoko Yamaguchi, Colin Molter, Wu Zhihua, Harshavardhan A. Agashe, and Hiroaki Wagatsuma Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Colliaux, Yoko Yamaguchi, Colin Molter, and Hiroaki Wagatsuma
151
160
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Fujii, Kazuyuki Aihara, and Ichiro Tsuda
170
Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongtao Li and Shigetoshi Nara
179
A Generalised Entropy Based Associative Model . . . . . . . . . . . . . . . . . . . . . Masahiro Nakagawa
189
Table of Contents – Part I
The Detection of an Approaching Sound Source Using Pulsed Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaname Iwasa, Takeshi Fujisumi, Mauricio Kugler, Susumu Kuroyanagi, Akira Iwata, Mikio Danno, and Masahiro Miyaji
XIII
199
Sensitivity and Uniformity in Detecting Motion Artifacts . . . . . . . . . . . . . Wen-Chuang Chou, Michelle Liou, and Hong-Ren Su
209
A Ring Model for the Development of Simple Cells in the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Hamada and Kazuhiro Okada
219
Learning and Memory Practical Recurrent Learning (PRL) in the Discrete Time Domain . . . . . Mohamad Faizal Bin Samsudin, Takeshi Hirose, and Katsunari Shibata
228
Learning of Bayesian Discriminant Functions by a Layered Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshifusa Ito, Cidambi Srinivasan, and Hiroyuki Izumi
238
RNN with a Recurrent Output Layer for Learning of Naturalness . . . . . . J´ an Dolinsk´y and Hideyuki Takagi
248
Using Generalization Error Bounds to Train the Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakria Hussain and John Shawe-Taylor
258
Model of Cue Extraction from Distractors by Active Recall . . . . . . . . . . . . Adam Ponzi
269
PLS Mixture Model for Online Dimension Reduction . . . . . . . . . . . . . . . . . Jiro Hayami and Koichiro Yamauchi
279
Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi Sing Leung, Pui Fai Sum, and Tien-Tsin Wong
289
Fuzzy ARTMAP with Explicit and Implicit Weights . . . . . . . . . . . . . . . . . . Takeshi Kamio, Kenji Mori, Kunihiko Mitsubori, Chang-Jun Ahn, Hisato Fujisaka, and Kazuhisa Haeiwa Neural Network Model of Forward Shift of CA1 Place Fields Towards Reward Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Ponzi
299
309
XIV
Table of Contents – Part I
Neural Network Models A New Constructive Algorithm for Designing and Training Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdus Sattar, Md. Monirul Islam, and Kazuyuki Murase
317
Effective Learning with Heterogeneous Neural Networks . . . . . . . . . . . . . . Llu´ıs A. Belanche-Mu˜ noz
328
Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa
338
Effect of Spatial Attention in Early Vision for the Modulation of the Perception of Border-Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nobuhiko Wagatsuma, Ryohei Shimizu, and Ko Sakai
348
Effectiveness of Scale Free Network to the Performance Improvement of a Morphological Associative Memory without a Kernel Image . . . . . . . Takashi Saeki and Tsutomu Miki
358
Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Hung Chuang, Jiun-Wei Liou, Philip E. Cheng, Michelle Liou, and Cheng-Yuan Liou Feature Subset Selection Using Constructive Neural Nets with Minimal Computation by Measuring Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Dynamic Link Matching between Feature Columns for Different Scale and Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuomi D. Sato, Christian Wolff, Philipp Wolfrum, and Christoph von der Malsburg
365
374
385
Perturbational Neural Networks for Incremental Learning in Virtual Learning System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eiichi Inohira, Hiromasa Oonishi, and Hirokazu Yokoi
395
Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Tiˇ no
405
Variable Selection for Multivariate Time Series Prediction with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Ru Wei
415
Table of Contents – Part I
XV
Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takaaki Aoki, Kaiichiro Ota, Koji Kurata, and Toshio Aoyagi
426
A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Azusa Iwata, Yoshihisa Shinozawa, and Akito Sakurai
436
Supervised/Unsupervised/Reinforcement Learning Unbiased Likelihood Backpropagation Learning . . . . . . . . . . . . . . . . . . . . . . Masashi Sekino and Katsumi Nitta
446
The Local True Weight Decay Recursive Least Square Algorithm . . . . . . Chi Sing Leung, Kwok-Wo Wong, and Yong Xu
456
Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Yamazaki and Sumio Watanabe
466
Using Image Stimuli to Drive fMRI Analysis . . . . . . . . . . . . . . . . . . . . . . . . David R. Hardoon, Janaina Mour˜ ao-Miranda, Michael Brammer, and John Shawe-Taylor
477
Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima
487
Convergence Behavior of Competitive Repetition-Suppression Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Bacciu and Antonina Starita
497
Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma
507
An Automatic Speaker Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . P. Chakraborty, F. Ahmed, Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Modified Modulated Hebb-Oja Learning Rule: A Method for Biologically Plausible Principal Component Analysis . . . . . . . . . . . . . . . . . Marko Jankovic, Pablo Martinez, Zhe Chen, and Andrzej Cichocki
517
527
Statistical Learning Algorithms Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katsuyuki Hagiwara
537
XVI
Table of Contents – Part I
A Subspace Method Based on Data Generation Model with Class Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minkook Cho, Dongwoo Yoon, and Hyeyoung Park
547
Hierarchical Feature Extraction for Compact Representation and Classification of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Schubert and Jens Kohlmorgen
556
Principal Component Analysis for Sparse High-Dimensional Data . . . . . . Tapani Raiko, Alexander Ilin, and Juha Karhunen
566
Hierarchical Bayesian Inference of Brain Activity . . . . . . . . . . . . . . . . . . . . . Masa-aki Sato and Taku Yoshioka
576
Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byron M. Yu, John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani
586
Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Samejima and Kenji Doya
596
Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomohiro Shibata, Takashi Bando, and Shin Ishii
604
Bayesian System Identification of Molecular Cascades . . . . . . . . . . . . . . . . Junichiro Yoshimoto and Kenji Doya
614
Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shir Li Wang, Chen Change Loy, Chee Peng Lim, Weng Kin Lai, and Kay Sin Tan
625
Extraction of Approximate Independent Components from Large Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshitatsu Matsuda and Kazunori Yamaguchi
635
Local Coordinates Alignment and Its Linearization . . . . . . . . . . . . . . . . . . . Tianhao Zhang, Xuelong Li, Dacheng Tao, and Jie Yang
643
Walking Appearance Manifolds without Falling Off . . . . . . . . . . . . . . . . . . . Nils Einecke, Julian Eggert, Sven Hellbach, and Edgar K¨ orner
653
Inverse-Halftoning for Error Diffusion Based on Statistical Mechanics of the Spin System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yohei Saika
663
Table of Contents – Part I
XVII
Optimization Algorithms Chaotic Motif Sampler for Motif Discovery Using Statistical Values of Spike Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takafumi Matsuura and Tohru Ikeguchi
673
A Thermodynamical Search Algorithm for Feature Subset Selection . . . . F´elix F. Gonz´ alez and Llu´ıs A. Belanche
683
Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation . . . . . . . . . . . . . Mikio Hasegawa and Ken Umeno
693
Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Hu and Jun Wang Optimization of Parametric Companding Function for an Efficient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shin-ichi Maeda and Shin Ishii A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiann-Der Lee, Chung-Hsien Huang, Li-Chang Liu, Shih-Sen Hsieh, Shuen-Ping Wang, and Shin-Tseng Lee Solution Method Using Correlated Noise for TSP . . . . . . . . . . . . . . . . . . . . Atsuko Goto and Masaki Kawamura
703
713
723
733
Novel Algorithms Bayesian Collaborative Predictors for General User Modeling Tasks . . . . Jun-ichiro Hirayama, Masashi Nakatomi, Takashi Takenouchi, and Shin Ishii
742
Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shohei Shimizu and Aapo Hyv¨ arinen
752
Efficient Incremental Learning Using Self-Organizing Neural Grove . . . . . Hirotaka Inoue and Hiroyuki Narihisa
762
Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masato Uchida, Yousuke Maehara, and Hiroyuki Shioya
771
Sparse Super Symmetric Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Marko Jankovic, Rafal Zdunek, and Shun-ichi Amari
781
XVIII
Table of Contents – Part I
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dacheng Tao, Jimeng Sun, Xindong Wu, Xuelong Li, Jialie Shen, Stephen J. Maybank, and Christos Faloutsos Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fumikazu Miwakeichi, Pedro A. Valdes-Sosa, Eduardo Aubert-Vazquez, Jorge Bosch Bayard, Jobu Watanabe, Hiroaki Mizuhara, and Yoko Yamaguchi
791
802
Flexible Component Analysis for Sparse, Smooth, Nonnegative Coding or Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Anh Huy Phan, Rafal Zdunek, and Li-Qing Zhang
811
Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Xu and Yen-Wei Chen
821
Head Pose Estimation Based on Tensor Factorization . . . . . . . . . . . . . . . . . Wenlu Yang, Liqing Zhang, and Wenjun Zhu
831
Kernel Maximum a Posteriori Classification with Error Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Kameyama
841
851
Pattern Discovery for High-Dimensional Binary Datasets . . . . . . . . . . . . . . V´ aclav Sn´ aˇsel, Pavel Moravec, Duˇsan H´ usek, Alexander Frolov, ˇ Hana Rezankov´ a, and Pavel Polyakov
861
Expand-and-Reduce Algorithm of Particle Swarm Optimization . . . . . . . . Eiji Miyagawa and Toshimichi Saito
873
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture . . . . . . . . . Tadashi Kondo
882
Motor Control and Vision Coordinated Control of Reaching and Grasping During Prehension Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masazumi Katayama and Hirokazu Katayama
892
Table of Contents – Part I
Computer Simulation of Vestibuloocular Reflex Motor Learning Using a Realistic Cerebellar Cortical Neuronal Network Model . . . . . . . . . . . . . . Kayichiro Inagaki, Yutaka Hirata, Pablo M. Blazquez, and Stephen M. Highstein Reflex Contributions to the Directional Tuning of Arm Stiffness . . . . . . . . Gary Liaw, David W. Franklin, Etienne Burdet, Abdelhamid Kadi-allah, and Mitsuo Kawato
XIX
902
913
Analysis of Variability of Human Reaching Movements Based on the Similarity Preservation of Arm Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Oyama, Yoji Uno, and Shigeyuki Hosoe
923
Directional Properties of Human Hand Force Perception in the Maintenance of Arm Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshiyuki Tanaka and Toshio Tsuji
933
Computational Understanding and Modeling of Filling-In Process at the Blind Spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shunji Satoh and Shiro Usui
943
Biologically Motivated Face Selective Attention Model . . . . . . . . . . . . . . . . Woong-Jae Won, Young-Min Jang, Sang-Woo Ban, and Minho Lee
953
Multi-dimensional Histogram-Based Image Segmentation . . . . . . . . . . . . . . Daniel Weiler and Julian Eggert
963
A Framework for Multi-view Gender Classification . . . . . . . . . . . . . . . . . . . Jing Li and Bao-Liang Lu
973
Japanese Hand Sign Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hirotada Fujimura, Yuuichi Sakai, and Hiroomi Hikawa
983
An Image Warping Method for Temporal Subtraction Images Employing Smoothing of Shift Vectors on MDCT Images . . . . . . . . . . . . . Yoshinori Itai, Hyoungseop Kim, Seiji Ishikawa, Shigehiko Katsuragawa, Takayuki Ishida, Ikuo Kawashita, Kazuo Awai, and Kunio Doi
993
Conflicting Visual and Proprioceptive Reflex Responses During Reaching Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato An Involuntary Muscular Response Induced by Perceived Visual Errors in Hand Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato Independence of Perception and Action for Grasping Positions . . . . . . . . . 1021 Takahiro Fujita, Yoshinobu Maeda, and Masazumi Katayama
XX
Table of Contents – Part I
Handwritten Character Distinction Method Inspired by Human Vision Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Jumpei Koyama, Masahiro Kato, and Akira Hirose Recent Advances in the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Kunihiko Fukushima Engineering-Approach Accelerates Computational Understanding of V1–V2 Neural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051 Shunji Satoh and Shiro Usui Recent Studies Around the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Hayaru Shouno Toward Human Arm Attention and Recognition . . . . . . . . . . . . . . . . . . . . . 1071 Takeharu Yoshizuka, Masaki Shimizu, and Hiroyuki Miyamoto Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Osamu Nomura and Takashi Morie Optimality of Reaching Movements Based on Energetic Cost under the Influence of Signal-Dependent Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091 Yoshiaki Taniai and Jun Nishii Influence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding . . . . . . . . . . . . . . . . . . . . . . . 1100 Yusuke Azuma and Akira Hirose Global Localization for the Mobile Robot Based on Natural Number Recognition in Corridor Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110 Su-Yong An, Jeong-Gwan Kang, Se-Young Oh, and Doo San Baek A System Model for Real-Time Sensorimotor Processing in Brain . . . . . . 1120 Yutaka Sakaguchi Perception of Two-Stroke Apparent Motion and Real Motion . . . . . . . . . . 1130 Qi Zhang and Ken Mogi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
A Retinal Circuit Model Accounting for Functions of Amacrine Cells Murat Saglam, Yuki Hayashida, and Nobuki Murayama Graduate School of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. In previous experimental studies on vertebrates, high level processes of vision such as object segregation and spatio-temporal pattern adaptation were found to begin at retinal stage. In those visual functions, diverse subtypes of amacrine cells are believed to play essential roles by processing the excitatory and inhibitory signals laterally over a wide region on the retina to shape the ganglion cell responses. Previously, a simple "linear-nonlinear" model was proposed to explain a specific function of the retina, and could capture the spiking behavior of retinal output, although each class of the retinal neurons were largely omitted in it. Here, we present a spatio-temporal computational model based on the response function for each class of the retinal neurons and the anatomical intercellular connections. This model is not only capable of reproducing filtering properties of outer retina but also realizes high-order inner retinal functions such as object segregation mechanism via wide-field amacrine cells. Keywords: Retina, Amacrine Cells, Model, Visual Function.
1 Introduction The vertebrate retina is far more than a passive visual receptor. It has been reported that many high-level vision tasks begin in the retinal circuits although they are believed to be performed in visual cortices of the brain [1]. One important task among those is the discrimination of the actual motion of an object from the global motion across the retina. Even in the case of perfect stationary scene, eye movements cause retinal image-drifts that hinder the retinal circuit to have a stationary global input at the background [2]. To handle this problem, retinal circuits are able to distinguish the object motions better when they have different patterns than background motion. The synaptic configuration of diverse types of retinal cells plays an essential role during this function. It was reported that wide-field polyaxonal amacrine cells can drive inhibitory process between the surround and the object regions (receptive field) on the retina [1, 2, 4, 5, 6]. Those wide-field amacrine cells are known to use inhibitory neurotransmitters like glycine or GABA [7, 8]. A previous study reported that glycine-mediated wide-field inhibition exists in the salamander retina and proposed a simple “linear-nonlinear” model, consisting of a temporal filter and a threshold function [2]. However that model does not include the details of any retinal neurons M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1–6, 2008. © Springer-Verlag Berlin Heidelberg 2008
2
M. Saglam, Y. Hayashida, and N. Murayama
accounting for that inhibitory mechanism although it is capable of predicting the spiking behavior of the retinal output for certain input patterns. On the other hand, different temporal models that include the behavior of each class of retinal neurons exist in the literature [9, 10]. Even though those models provide high temporal resolution, they lack spatial information of the retinal processing. Here, we present a spatio-temporal computational model that realizes wide-field inhibition between object and surround region via wide-field transient on/off amacrine cells. The model considers the responses of all major retinal neurons in detail.
2 Retinal Model The retina model parcels the stimulating input into a line of spatio-temporal computational units. Each unit consists of main retinal elements that convey information forward (photoreceptors, bipolar and ganglion cells) and laterally (horizontal and amacrine cells). Figure 1 illustrates the organization and the synaptic connections of the retinal neurons.
Fig. 1. Three computational units of the model are depicted. Each unit includes: PR: Photoreceptor, HC: Horizontal Cell, onBC: On Bipolar Cell, offBC: Off Bipolar Cell, onAC: Sustained On Amacrine Cell, offAC: Sustained Off Amacrine Cell, on/offAC: Fast-transient Wide-field On/Off Amacrine Cell, GC: On/Off Ganglion Cell. Excitatory/Inhibitory synaptic connections are represented with black/white arrowheads, respectively. Gap junctions within neighboring HCs are indicated by dotted horizontal lines. Wide-field connections are realized between wide-field on/offAC and GCs, double-s-symbols point to distant connections within those neurons.
A Retinal Circuit Model Accounting for Functions of Amacrine Cells
3
Each neuron’s membrane dynamics is governed by a differential equation (eqn.1) which is adjusted from push-pull shunting models of retinal neurons [9]. n dvc (t ) = − Avc (t ) + [ B − vc (t )]e(t ) − [ D + vc (t )]i (t ) + ∑Wck vk (t ) dt k =1
(1)
Here υc(t) stands for the membrane potential of the neuron of interest. A represents the rate of passive membrane decay toward the resting potential in the dark. B and D are the saturation levels for the excitatory, e(t) and inhibitory i(t) inputs, respectively. Those excitatory/inhibitory inputs correspond to the synaptic connections (solid lines in fig.1) of different neurons in a computational unit. υk(t) is the membrane potential of a different neuron belonging to another unit making synapse or gap-junction to the neuron of interest, υc(t). The efficiency of that link is determined by a weight parameter, Wck. In the current model, spatial connectivity is present within horizontal cells as gap junctions (dashed lines in fig.1) and between on/off amacrine cells and ganglion cells as wide-field inhibitory process (thin solid lines in fig1). As for the other neurons Wck is fixed to zero since we ignore the lateral spatial connection within them. A compressive nonlinearity (eqn. 2) is cascaded prior to the photoreceptor input stage in order to account the limited dynamic range of the neural elements. Therefore the photoreceptor is fed by a hyperpolarizing input, r (t), representing the compressed form of light intensity, f (t).
⎛ f (t ) ⎞ ⎟⎟ r (t ) = G⎜⎜ ⎝ f (t ) + I ⎠
n
(2)
Here G denotes the saturation level of the hyperpolarizing input to the photoreceptor. I represents the light intensity yielding half-maximum response and n is a real constant. Although ganglion cell receptive field size is diverse among different animals, we defined that each unit corresponds to 500μm which is in well accordance with experiments on salamander [2]. 32 computational units are interconnected as a line. Wck values are determined as a function of distance between computational units. Parameter set given in [9] is calibrated to reproduce the temporal dynamics of all neuron classes. Spatial parameters of the model are selected to meet the spatial ganglion cell response profile given in [2]. All differential equations in the model are solved sequentially using fixed-step (1ms) Bogacki-Shampine solver of the MATLAB/SIMULINK software package (The Mathworks-Inc., Natick, MA).
3 Results First we confirmed that responses of each neuron agree with the physiological observations [9]. Figure 2 illustrates the response of all neurons to 150ms-long flash light stimulating the whole model retina. At the outer retina, photoreceptor responds with a transient hyperpolarization followed by a less steep level and reaches to the resting potential with a small overshoot. Essentially horizontal cell has the smoothened form of the photoreceptor response due to its low-pass filtering feature. On- and offbipolar cells are depolarized during on/off-set of the flash light, respectively. Since
4
M. Saglam, Y. Hayashida, and N. Murayama
Fig. 2. Responses of retinal neurons (labeled as fig.1) to 150ms-long full field light flash. Dashed horizontal lines indicate dark responses(resting potentials) of each neuron. Note that the timings of on/off responses of wide-field ACs and GC spike generating potentials (GC gen. pot.) match each other. This phenomenon drives the wide-field inhibition.
Fig. 3. GC spike generating potential responses (top row) of the center unit at incoherent (left column) and coherent (right column) stimulation case. Stimulation timing and position are depicted on the x and y axes, respectively (bottom row, white bars indicate the light onset). Under coherent stimulation condition, off responses are significantly inhibited and on responses all disappeared.
those cells build negative feedback loop with sustained on- and off-amacrine cells, their responses are more transient than photoreceptors and horizontal cells as expected. Eventually bipolar cells transmit excitatory, wide-field transient on/off amacrine cells
A Retinal Circuit Model Accounting for Functions of Amacrine Cells
5
convey inhibitory inputs to ganglion cells. Significant inhibition at ganglion cell level only happens when wide-field amacrine cell signal matches the excitatory input. Figure 3 demonstrates how inhibitory process differs when the peripheral (surround) and the object regions are stimulated coherently or incoherently. In both cases object region invades 3 units (750μm radius) and stimulated identically. When the surround is stimulated incoherently, depolarized peaks of wide-field amacrine cells do not coincide with the ganglion cell peaks so that spike generating potentials are evident. However when the surround region is simulated coherently, inhibition from amacrine cells cancels out the big portions of ganglion cell depolarizations. This leads to maximum inhibition of the spike generating potentials of ganglion cells (Fig.3, right column).
4 Discussion In the current study we realized the basic mechanism of an important retinal task which is discriminating a moving object from moving background image. The coherent stimulation (Fig.2) can be linked to the global motion of the retinal image that takes place when the eye moves. However when there is a moving object in the
Fig. 4. Relative generating potential response of GC as a function of object size. Dashed line represents the model response with the original parameter set. Triangle and square markers indicate data points for ‘wide-field AC blocked’ and ‘control’ cases, respectively. Maximum GC response is observed when the object radius is 250μm (1 unit stimulation, 2nd data point). For the sake of symmetry 3rd data point represents 3 unit stimulation (750μm radius as in Fig.3), similarly each interval after 2nd data point corresponds to 500μm increment in the radius of object. As the object starts to invade the surround region, GC response decreases. When the weight of the interconnections among wide-field on/off ACs are set to zero (STR application), inhibition process is partially disabled (solid line).
6
M. Saglam, Y. Hayashida, and N. Murayama
scene, its image would be reflected on the receptive field as a different stimulation pattern than the global pattern (Incoherent stimulation, Fig.3). Experimental results revealed that blocking glycine-mediated inhibition by strychnine (STR) disables the wide-field process [2]. Therefore this glycine-ergic mechanism could be accounted to wide-field amacrine cells [7]. In our model, STR application can be realized by turning off synaptic weight parameters between wide-field amacrine cells and ganglion cells. Figure 4 demonstrates how STR application can affect ganglion cell response. As the object invades the background ganglion cell response is expected to be inhibited as in the control case however STR prevents this phenomenon to occur. This behavior of the model is in very well accordance with the experimental results in [2]. Note that the model is flexible enough to fit to another wide-field inhibitory process such as GABA-ergic mechanism [8]. Spike generation of the ganglion cells is not implemented in the current model in order to highlight the role of wide-field amacrine cells only. A specific spike generator can be cascaded to the model to reproduce spike responses and highlight retinal features more. Since the model covers on/off pathways and all major retinal neurons, it can be flexibly adjusted to reproduce other functions. Although we deduced the retina into a line of spatio-temporal computational units, the model was able to reproduce a retinal mechanism. This deduction can be bypassed and more precise results can be achieved by creating a 2-D mesh of spatiotemporal computational units.
References 1. Masland, R.H.: Vision: The retina’s fancy tricks. Nature 423(6938), 387–388 (2003) 2. Olveczky, B.P., Baccus, S.A., Meister, M.: Segregation of object and background motion in the retina. Nature 423(6938), 401–408 (2003) 3. Volgyi, B., Xin, D., Amarillo, Y., Bloomfield, S.A.: Morphology and physiology of the polyaxonal amacrine cells in the rabbit retina. J. Comp. Neurol. 440(1), 109–125 (2001) 4. Lin, B., Masland, R.H.: Populations of wide-field amacrine cells in the mouse retina. J. Comp. Neurol. 499(5), 797–809 (2006) 5. Solomon, S.G., Lee, B.B., Sun, H.: Suppressive surrounds and contrast gain in magnocellular pathway retinal ganglion cells of macaque. J. Neurosci. 26(34), 8715–8726 (2006) 6. van Wyk, M., Taylor, W.R., Vaney, D.: Local edge detectors: a substrate for fine spatial vision at low temporal frequencies in rabbit retina. J. Neurosci. 26(51), 250–263 (2006) 7. Hennig, M.H., Funke, K., Worgotter, F.: The influence of different retinal subcircuits on the nonlinearity of ganglion cell behavior. J. Neurosci. 22(19), 8726–8738 (2002) 8. Lukasiewicz, P.D.: Synaptic mechanisms that shape visual signaling at the inner retina. Prog Brain Res. 147, 205–218 (2005) 9. Thiel, A., Greschner, M., Ammermuller, J.: The temporal structure of transient ON/OFF ganglion cell responses and its relation to intra-retinal processing. J. Comput. Neurosci. 21(2), 131–151 (2006) 10. Gaudiano, P.: Simulations of X and Y retinal ganglion cell behavior with a nonlinear pushpull model of spatiotemporal retinal processing. Vision Res. 34(13), 1767–1784 (1994)
Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi Division of Electrical, Electronic and Information Engineering, Graduate School of Engineering, Osaka University, Yamada-Oka 2-1, Suita, Osaka, Japan
[email protected] Abstract. Many mathematical models of different neurons have been proposed so far, however, the way of modeling Ca2+ regulation mechanisms has not been established yet. Therefore, we try to construct a physiologically plausible model which contains many regulating systems of the intracellular Ca2+ , such as Ca2+ buffering, Na+ /Ca2+ exchanger and Ca2+ pump current. In this paper, we seek the plausible values of parameters by analyzing the global bifurcation structure of our temporary model.
1
Introduction
Complex information processing of brain is regulated by the electrical activity of neurons. Neurons transmit an electrical signal called action potential each other for the information processing. The action potential is a spiking or bursting and plays an important role in the information processing of the brain. In the visual system, visual signals from the retina are processed by neurons in the primary visual cortex. There are several types of neurons in the visual cortex, and pyramidal cells compose roughly 80% of the neurons of the cortex. Pyramidal cells are connected each other and form a complex neuronal circuit. Previous physiological and anatomical studies [1] revealed the fundamental structure of the circuit. However, it is not completely understood how visual signals propagate and function in the neuronal circuit of the visual cortex. In order to investigate the neuronal circuit, not only physiological experiments but also simulations by using a mathematical model of neuron are necessary. Many mathematical models of neurons have been proposed so far [2]. Though there are various models of neurons, the way of modeling the regulating system of the intracellular calcium ions (Ca2+ ) has not been established yet. The regulating system of the intracellular Ca2+ is a very important element because the intracellular Ca2+ plays crucial roles in cellular processes such as hormone and neurotransmitter release, gene transcription, and regulations of synaptic plasticity. Therefore, it is important to establish the way of modeling the regulating system of the intracellular Ca2+ . M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 7–17, 2008. c Springer-Verlag Berlin Heidelberg 2008
8
T. Ishiki et al.
In this paper, we try to construct a model of pyramidal cells by regarding previous physiological experimental data, especially focusing on the regulating systems of the intracellular Ca2+ , such as Ca2+ buffering, Na+ /Ca2+ exchanger and Ca2+ pump current. In order to estimate the value of the parameters which cannot be determined by physiological experiments solely, we analyze the global bifurcation structure based on the slow/fast decomposition of the model. Thus we demonstrate the usefulness of such nonlinear analyses not only for the analysis of an established model but also for the construction of a model.
2
Cell Model
The well-known Hodgkin-Huxley (HH) equations [3] describe the temporal variation of membrane potential of neuronal cells. Though there are many neuron models based on the HH equations, the way of modeling the regulating system of the intracellular Ca2+ has not been established yet. Thus, we construct a pyramidal cell model by using several physiological experiments data [4]-[11]. The model includes Ca2+ buffer, Ca2+ pump, Na+ /Ca2+ exchanger in order to describe the regulating system of the intracellular Ca2+ appropriately. The model also includes seven ionic currents through the ionic channels. The equations of the pyramidal cell model are as follows: −C
dV = Itotal − Iext , dt dy 1 = (y∞ − y) , (y = M 1, · · · , M 6, H1, · · · , H6), dt τy
d[Ca2+ ] −S · ICatotal = + k− [CaBuf] − k+ [Ca2+ ][Buf], dt 2·F d[Buf] = k− [CaBuf] − k+ [Ca2+ ][Buf], dt d[CaBuf] = −(k− [CaBuf] − k+ [Ca2+ ][Buf]), dt
(1a) (1b) (1c) (1d) (1e)
where V is the membrane potential, C is the membrane capacitance, Itotal is the sum of all currents through the ionic channels and Na+ /Ca2+ exchanger, and Iext is the current injected to the cell externally. The variable y denotes gating variables (M 1, · · · , M 6, H1, · · · , H6) of the ionic channels, y∞ is the steady state function of y, and τy is a time constant. [Ca2+ ] denotes the intracellular Ca2+ concentration, ICatotal is the sum of all Ca2+ ionic currents, S is the surface to volume ratio, and F is the Faraday constant. [Buf] and [CaBuf] are the concentrations of the unbound and the bound buffers, and k− and k+ are the reverse and forward rate constants of the binding reaction, respectively. Details of all equations and parameter values of this model can be found in Appendix. First we show the simulation results when a certain external stimulus current is injected to the cell model (1). Figure 1 shows an action potential waveform, and a change of [Ca2+ ] when a narrow external stimulus current (length 1ms,
Global Bifurcation Analysis of a Pyramidal Cell Model
20
9
[B]
[A]
0.0018
V (mV)
[Ca2+] (mM)
0 -20 -40 -60 -80 3500
4000
4500
5000
t (ms)
5500
6000
0.0014 0.001 0.0006 0.0002 3500 4000
4500
5000 5500 6000
t (ms)
Fig. 1. Waveforms of [A] the membrane potential and [B] [Ca2+ ] in the case of a 1ms narrow pulse injection [A]
[B]
V (mV)
20 0
-20 -40 -60 -80
4000
5000
t (ms)
6000
7000
Fig. 2. [A] A waveform of the membrane potential under a long pulse injection. [B] A typical waveform of the membrane potential of pyramidal cell in physiological experiments [12].
density 40μA/cm2 ) is injected at t = 4000ms. The waveforms of the membrane potential and [Ca2+ ] are not so different from the physiological experimental data qualitatively [1]. In contrast, as shown in Fig. 2A, in the case that a long pulse (length 1000ms, density 40μA/cm2 ) is injected, the membrane potential keeps a resting state after one action potential is generated. Though the membrane potential is spiking continuously in the physiological experiment (Fig. 2B), the membrane potential of the model does not show such a behavior. In general, it is well known that the membrane potential of pyramidal cell is resting in the case of no external stimulus, and spiking or bursting when the external stimulus is added. The aim of this paper is a reconstruction or a parameter tuning of the model (1) which can reproduce such a behavior of the membrane potential.
3
Bifurcation Analysis
The characteristics of the membrane potential vary with the change of the value of some parameters, therefore we investigate the bifurcation structure of the model (1) to estimate the values of the parameters. For the bifurcation analysis in this paper, we used the bifurcation analysis software AUTO [13].
10
T. Ishiki et al.
V (mV)
-35 -40 -45 -50 -55 -60 -65 -70 0
20
40
60
80
100
Iext (µA/cm2)
Fig. 3. One-parameter bifurcation diagram on the parameter Iext . The solid curve denotes stable equilibria of eq. (1).
3.1
The External Stimulation Current I ext
We analyze the bifurcation structure of the model to understand why the continuous spiking of the membrane potential is not generated when a long pulse is injected. In order to investigate whether the spiking is generated or not when Iext is increased, we vary the external stimulation current Iext as a bifurcation parameter. We show the one-parameter bifurcation diagram on the parameter Iext (Fig. 3) in which the solid curve denotes the membrane potential at stable equilibria of eq. (1). The one-parameter bifurcation diagram shows the dependence of the membrane potential on the parameter Iext . There is no bifurcation point in Fig. 3. Therefore the stability of the equilibrium point does not change, and thus the membrane potential of the model keeps resting even if Iext is increased. This result means that we cannot reproduce the physiological experimental result in Fig. 2B by varying Iext , thus we have to reconsider the parameter values of the model which cannot be determined by physiological experiments only. 3.2
The Maximum Conductance of Ca2+ -Dependent Potassium Channel GKCa
The current through the Ca2+ -dependent potassium channel is involved in the generation of spiking or bursting of the membrane potential. Therefore, we select the maximum conductance of Ca2+ -dependent potassium channel GKCa as a bifurcation parameter, and show the one-parameter bifurcation diagram when Iext = 10 (Fig. 4A). There are two saddle-node bifurcation points (SN1, SN2), three Hopf bifurcation points (HB1-HB3) and two torus bifurcation points (TR1, TR2). An unstable periodic solution which is bifurcated from HB1 changes its stability at the two torus bifurcation points and merges into the equilibrium point at HB3. Only in the range between HB1 and HB3, the membrane potential can oscillate. In order to investigate the dependence of the oscillatory range on the Iext value, we show the two-parameter bifurcation diagram (Fig. 4B) in which the horizontal and vertical axes denote Iext and GKCa , respectively. The twoparameter bifurcation diagram shows the loci where a specific bifurcation occurs.
Global Bifurcation Analysis of a Pyramidal Cell Model [A]
14
V (mV)
-10
-40
HB3
-60
2.50
SN1 HB2 HB1
2.75
3.00
3.25
HB SN
10
SN2
-50
[B]
12
TR1
-20 -30
Iext =10
TR2
G K Ca (mS/cm 2 )
0
11
3.50
G K Ca (mS/cm 2 )
8 6 4 2
3.75
4.00
0 -15 -10 -5
0
5
10 15 20 25 30
Iext (µA/cm2)
Fig. 4. [A] One-parameter bifurcation diagram on the parameter GKCa . The solid and broken curves show stable and unstable equilibria, respectively. The symbols • and ◦ denote the maximum value of V of the stable and unstable periodic solutions, respectively. [B] Two-parameter bifurcation diagram in the (Iext , GKCa )-plane.
In the diagram, the gray colored area separated by the HB and SN bifurcation curves corresponds to the range between HB1 and HB3 in Fig. 4A where the periodic solutions appear. Increasing Iext , the gray colored area shrinks gradually and disappears near Iext = 25. This result means that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of the parameter GKCa . 3.3
The Maximum Pumping Rate of the Ca2+ Pump Apump
The Ca2+ pump plays an important role in the regulation of the intracellular Ca2+ . We also investigate the effect of varying the Apump value, which is the maximum pumping rate of the Ca2+ pump, on the membrane potential. Figure 5A is the one-parameter bifurcation diagram when Iext = 10. There are two Hopf bifurcation points (HB1, HB2) and four double-cycle bifurcation points (DC1DC4). A stable periodic solution generated at HB1 changes its stability at the four double-cycle bifurcation points and merges into the equilibrium point at HB2. Similarly to the case of GKCa , we show the two-parameter bifurcation diagram in the plane of two parameters Iext and Apump (Fig. 5B) in order to examine the dependence of the oscillatory range between HB1 and HB2 on Iext . In Fig. 5B, the gray colored area, where the membrane potential oscillates, shrinks and disappears as Iext increases. The result shows that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of Apump similarly to the case of GKCa . 3.4
Slow/Fast Decomposition Analysis
In this section, in order to investigate the dynamics of our pyramidal cell model in more detail, we use the slow/fast decomposition analysis [14].
T. Ishiki et al.
30 20 10 0 -10 -20 -30 -40 -50 -60 0
[A]
1000 DC3
Iext =10
DC4 DC2 HB2 DC1 HB1
300
500
800
1000
Apump (pmol/s/cm2)
1300
Apump (pmol/s/cm2)
V (mV)
12
[B] HB
800 600 400 200 0
0
20
40
60
80
100 120 140
Iext (µA/cm2)
Fig. 5. [A] One-parameter bifurcation diagram on the parameter Apump . [B] Twoparameter bifurcation diagram in the (Iext , Apump )-plane.
A system with multiple time scales can be denoted generally as follows: dx = f (x, y), x ∈ Rn , y ∈ Rm , dt dy = g(x, y), 1. dt
(2a) (2b)
Equation (2b) is called a slow subsystem since the value of y changes slowly while equation (2a) a fast subsystem. The whole eq. (2) is called a full system. So-called slow/fast analysis divides the full system into the slow and fast subsystems. In the fast subsystem (2a), the slow variable y is considered as a constant or a parameter. The variable x changes more quickly than y and thus x is considered to stay close to the attractor (stable equilibrium points, limit cycle, etc.) of the fast subsystem for a fixed value of y. The variable y changes slowly with a velocity g(x, y) in which x is considered to be in the neighborhood of the attractor. The attractor of the fast subsystem may change if y is varied. The problem of analysis of the dependence of attractor on the parameter y is a bifurcation problem. Thus the slow/fast analysis reduces the analysis of full system to the bifurcation problem of the fast subsystem with a slowly-varying bifurcation parameter. In the case of the pyramidal cell model (1), under the assumption that the change of the intracellular Ca2+ concentration [Ca2+ ] is slower than the other variables, the slow/fast analysis can be made. Thus, we consider [Ca2+ ] as a bifurcation parameter and eq. (1c) as a slow subsystem, and all other equations of eqs. (1a,b,d,e) are considered as a fast subsystem. We show the bifurcation diagram of the fast subsystem by varying the value of [Ca2+ ] as a parameter (Fig. 6). The figure shows the stable and unstable equilibria of the fast subsystem with Iext = 0 (thick solid and broken curves, resp.), and the slow subsystem (thin curve). The point at the intersection of the equilibrium curve of the fast subsystem with the nullcline of the slow subsystem is the equilibrium point of the f ull system. The stability of the full system is determined whether the intersection point is on the stable or unstable branch of the equilibrium curve of the f ast subsystem. Therefore, the stability of the full system is stable in the case of
Global Bifurcation Analysis of a Pyramidal Cell Model
13
10
0
V (mV)
-10 -20 -30 -40 -50 -60 -70 -80
0
0.0001 0.0002 0.0003 0.0004 0.0005
[Ca2+] (mM)
Fig. 6. Bifurcation diagram of the fast subsystem (Iext = 0) with Ca2+ as a bifurcation parameter and the slow-nullcline of the slow subsystem
Fig. 6. In addition, when Iext is increased, the bifurcation diagram (equilibrium curve) of the fast subsystem shifts upward and the stability of the full system keeps stable and no oscillation appear, as will be shown in Fig. 7. By changing the parameter of the slow subsystem, the shape of the slownullcline changes and the intersection point is also shifted. First, we select some parameters of the slow subsystem. Because the Ca2+ pump is included only in the slow subsystem, we select the Apump and the dissociation constant Kpump which are both parameters contained in Ca2+ pump as the parameters of the slow subsystem. Second, we change the values of Apump and Kpump in order to change the shape of the nullcline. Figure 7A shows the slow-nullclines (thin solid or broken curves) with varying Apump and also the equilibria of the fast subsystem (thick solid and broken curves) with Iext = 0, 20 and 40. By the increase of Apump , the nullcline of the slow subsystem shifts upward, and the intersection point of the equilibrium curve of the fast subsystem (Iext = 0) with the slow-nullcline is then located at an unstable equilibrium. Therefore, the membrane potential of the full system is spiking when Iext = 0. Figure 7B is the similar diagram to Fig. 7A, where the value of Kpump is varied (Apump is varied in Fig. 7A). By the change of Kpump value, the shape of the slow-nullcline is not changed much, therefore the intersection point of the equilibrium curve of the fast subsystem with the nullcline keeps staying at stable equilibria and the full system remains stable at a resting state. Next, in Fig. 8, we show an example of spontaneous spiking induced by an increase of Apump (Apump = 20). The gray colored orbit in Fig. 8A is the projected trajectory of the oscillatory membrane potential and the waveform is shown in Fig. 8B. Because the equilibrium curve of the fast subsystem intersects with the slow-nullcline at the unstable equilibrium, the membrane potential oscillates even though Iext = 0. The projected trajectory of the full system follows the stable equilibrium of the fast subsystem (the lower branch of thick curve) for a long time, and this prolongs the inter-spike interval. After the trajectory passes through the intersection point, the membrane potential makes a spike. When the trajectory passes through the intersection point, the trajectory winds around the intersection point. This winding is possibly caused by a complicated nonlinear
14
T. Ishiki et al.
10
[A] Iext=0 Iext=20 Iext=40 Apump=5(default) Apump=50 Apump=100 Apump=150
0 -10 -30 -40 -50
-20 -30 -40 -60 -70
0.0002
-80
0.0004 0.0006 0.0008 0.001
or or or
-50
-70 0
Iext=0 Iext=20 Iext=40 Kpump=0.2 Kpump=0.32 Kpump=0.4(default) Kpump=1.2 Kpump=2.0 Kpump=3.2
-10
-60 -80
[B]
0
V (mV)
V (mV)
-20
10
or or or
[Ca2+] (mM)
0
0.0002
0.0004 0.0006 0.0008 0.001
[Ca2+] (mM)
Fig. 7. Variation of the equilibria of the fast subsystem and the nullcline of the slow subsystem, by the change of the parameters of slow subsystem: [A] Apump , [B] Kpump [A]
20
0
0
-20
-20
V (mV)
V (mV)
20
-40
-40 -60
-60 -80
[B]
0
0.0005
0.001
[Ca2+] (mM)
0.0015
-80 20000
20400
20800
21200 21600 22000
t (ms)
Fig. 8. [A] An oscillatory trajectory of the full system (gray curve) with bifurcation diagram of the fast subsystem (Iext = 0, thick solid and broken curve) and the nullcline of the slow subsystem (Apump = 20, thin curve), [B] The oscillatory waveform of the membrane potential
dynamics [14] and makes the subthreshold oscillation of the membrane potential just before the spike in Fig. 8B. However, this subthreshold oscillation cannot be observed in physiological experiments of pyramidal cell (Fig. 2B).
4
Conclusion
In this research, we tried to construct a model of pyramidal cells in the visual cortex focusing on Ca2+ regulation mechanisms, and analyzed the global bifurcation structure of the model in order to seek the physiologically plausible values of its parameter. We analyzed the global bifurcation structure of the model using the maximum conductance of Ca2+ -dependent potassium channel (GKCa ) and the maximum pumping rate of the Ca2+ pump (Apump) as bifurcation parameters. According to the two-parameter bifurcation diagrams we showed that the range where the spontaneous spiking occurs shrinks as the external stimulation current Iext
Global Bifurcation Analysis of a Pyramidal Cell Model
15
increases. Therefore, the membrane potential of the model cannot oscillate for large values of Iext even if both values of GKCa and Apump were changed. We also investigated the effect of Apump and the dissociation constant Kpump on the nullcline of slow subsystem based on the slow/fast decomposition analysis. If Apump is increased, the membrane potential is spiking when Iext = 0 because the nullcline shifts upward and the stability of the full system becomes unstable. When Kpump is varied, the membrane potential keeps a resting state because the full system remains stable. Unfortunately, no expected behavior was obtained by the change of values of the parameters considered in this paper. We have, however, demonstrated the usefulness of such nonlinear analyses as the bifurcation and slow/fast analyses to examine parameter values and construct a physiological model. More detailed study using the other parameters is necessary for the construction of the appropriate model as a future subject.
References 1. Osanai, M., Takeno, Y., Hasui, R., Yagi, T.: Electrophysiological and optical studies on the signal propagation in visual cortex slices. In: Proc. of 2005 Annu. Conf. of Jpn. Neural Network Soc., pp. 89–90 (2005) 2. Herz, A.V.M., Gollisch, T., Machens, C.K., Jaeger, D.: Modeling single-neuron dynamics and computations: a balance of detail and abstraction. Science 314, 80– 85 (2006) 3. Hodgkin, A.L., CHuxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol(Lond) 177, 500–544 (1952) 4. Brown, A.M., Schwindt, P.C., Crill, W.E.: Voltage dependence and activation kinetics of pharmacologically defend components of the high-threshold calcium current in rat neocortical neurons. J. Neurophysiol. 70, 1516–1529 (1993) 5. Peterson, B.Z., Demaria, C.D., Yue, D.T.: Calmodulin is the Ca2+ sensor for Ca2+ dependent inactivation of L-type calcium channels. Neuron. 22, 549–558 (1999) 6. Cummins, T.R., Xia, Y., Haddad, G.G.: Functional properties of rat and human neocortical voltage-sensitive sodium currents. J. Neurophysiol. 71, 1052–1064 (1994) 7. Korngreen, A., Sakmann, B.: Voltage-gated K + channels in layer 5 neocortical pyramidal neurons from young rats: subtypes and gradients. J. Neurophy. 525, 621–639 (2000) 8. Kang, J., Huguenard, J.R., Prince, D.A.: Development of BK channels in neocortical pyramidal neurons. J. Neurophy. 76, 188–198 (1996) 9. Hayashida, Y., Yagi, T.: On the interaction between voltage-gated conductances and Ca+ regulation mechanisms in retinal horizontal cells. J. Neurophysiol. 87, 172–182 (2002) 10. Naraghi, M., Neher, E.: Linearized buffered Ca+ diffusion in microdomains and its implications for calculation of [Ca+ ] at the mouth of a calcium channel. J. Neurosci. 17, 6961–6973 (1997) 11. Noble, D.: Influence of Na/Ca exchanger stoichiometry on model cardiac action potentials. Ann. N.Y, Acad. Sci. 976, 133–136 (2002)
16
T. Ishiki et al.
12. Yuan, W., Burkhalter, A., Nerbonne, J.M.: Functional role of the fast transient outward K+ current IA in pyramidal neurons in (rat) primary visual cortex. J. Neurosci. 25, 9185–9194 (2005) 13. Doedel, E.J., Champeny, A.R., Fairgrieve, T.F., Kunznetsov, Y.A., Sandstede, B., Xang, X.: Continuation and bifurcation software for ordinary differential equations (with HomCont). Technical Report, Concordia University (1997) 14. Doi, S., Kumagai, S.: Generation of very slow neuronal rhythms and chaos near the Hopf bifurcation in single neuron models. J. Comp. Neurosci. 19, 325–356 (2005)
Appendix Itotal = INa + IKs + IKf + IK−Ca + ICaL + ICa + Ileak + Iex ICatotal = ICaL + ICa − 2Iex + Ipump 2 INa = GNa · M 1 · H1 · (V − ENa ), GNa = 13.0(mS/cm ), ENa = 35.0(mV) V +29.5 τM1 (V ) = 1/ 0.182 1−expV +29.5 − 0.124 1−exp [− V +29.5 ] [ V +29.5 ] 6.7 6.7 V +124.955511 V +10.07413 τH1 (V ) = 0.5 + 1/ exp − 19.76147 + exp − 20.03406 2 IKs = GKs · M 2 · H2 · (V − EK ), GKs = 0.66(mS/cm ), EKs = −103.0(mV) 1.25 + 115.0 exp[0.026V ], (V < −50mV) τM2 (V ) = 1.25 + 13.0 exp[−0.026V ], (V ≥ −50mV) τH2 (V ) = 360.0 + (1010.0 + 24.0(V + 55.0)) exp (−(V + 75.0)/48.0)2 IKf = GKf · M 3 · H3 · (V − EKf ), GKf = 0.27(mS/cm2 ), EKf = EKs τM3 (V ) = 0.34 + 0.92 exp (−(V + 71.0)/59.0)2 τH3 (V ) = 8.0 + 49.0 exp (−(V + 37.0)/23.0)2 IKCa = GKCa · M 4 · (V − EKCa ), GKCa = 12.5(mS/cm2 ), EKCa = EKs M 4∞ (V, [Ca2+ ]) = ([Ca2+ ]/([Ca2+ ] + Kh )) · (1/(1 + exp[−(V + 12.7)/26.2])) Kh = 0.15(μM) 1.25 + 1.12 exp[(V + 92.0)/41.9] (V < 40mV) τM4 (V ) = 27.0 (V ≥ 40mV)
2+ (2F )2 [Ca ] exp[2V F /RT ]−[Ca2+ ]o ICaL = PCaL · M 5 · H5 · RT · V · exp[2V F /RT ]−1 4
PCaL = 0.225(cm/ms), H5∞ = KCa 4 /(KCa 4 + [Ca2+ ] ), KCa = 4.0(μm) τM5 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) τH5 = 2000.0
2+ )2 exp[2V F /RT ]−[Ca2+ ]o ICa = PCa · M 6 · H6 · (2F · V · [Ca ] exp[2V R·T F /RT ]−1 PCa = 0.155(cm/ms), τH6 = 2000.0 τM6 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) Ileak = V /30.0 Iex = k [Na+ ]3i · [Ca2+ ]o · exp s · VRTF − [Na+ ]3o · [Ca2+ ] · exp −(1 − s) · VRTF −5 2 4 k = 6.0 × 10 (μA/cm /mM ), s = 0.5 Ipump = (2 · F · Apump · [Ca2+ ])/([Ca2+ ] + Kpump ) Apump = 5.0(pmol/s/cm2 ), Kpump = 0.4(μM) M i∞ = 1/(1 + exp (−(V − αM i )/βM i ), i = 1, 2, 3, 5, 6 Hj∞ = 1/(1 + exp ((V − αHj )/βHj ), j = 1, 2, 3, 6 i 1 2 3 5 6 j 1 2 3 6 αM i (mV ) −29.5 −3.0 −3.0 −18.75 18.75 βM i 6.7 10.0 10.0 7.0 7.0
αHj (mV ) −65.8 −51.0 −66.0 −12.6 βHi 7.1 12.0 10.0 18.9
Global Bifurcation Analysis of a Pyramidal Cell Model
17
C = 1.0(μF/cm2 ), S = 3.75(/cm), k− = 5.0(/ms), k+ = 500.0(/mM · ms) [Ca2+ ]o = 2.5(mM), [Na+ ]i = 7.0(mM), [Na+ ]o = 150.0(mM)
T and R denote the absolute temperature and gas constant, respectively. In all ionic currents of the model, the powers for gating variables (M 1, · · · , M 6, H1, · · · , H6) are approximately set at one in order to simplify the equations. Leak current is assumed to have no ion selectivity and follow Ohm’s law. Thus, the reversal potential of leak current is set at 0 (mV), though it might be unusual.
Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells Yasuhiro Hatori and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573 Japan
[email protected] [email protected] http://www.cvs.cs.tsukuba.ac.jp/
Abstract. The representation of object shape in the visual system is one of the most crucial questions in brain science. Although we can perceive figure shape correctly and quickly, without any effort, the underlying cortical mechanism is largely unknown. Physiological experiment with macaque indicated the possibility that the brain represents a surface with Medial Axis (MA) representation. To examine whether early visual areas could provide basis for MA representation, we constructed the physiologically realistic, computational model of the early visual cortex, and examined what constraint is necessary for the representation of MA. Our simulation results showed that simultaneous firing of BorderOwnership (BO) selective cells at the stimulus onset is a crucial constraint for MA representation. Keywords: Shape, representation, perception, vision, neuroscience.
1 Introduction Segregation of figure from ground might be the first step in the cortex toward the recognition of shape and object. Recent physiological studies have shown that around 60% of neurons in cortical areas V2 and V4 are selective to Border Ownership (BO) that tells which side of a contour owns the border, or the direction of figure, even about 20% of V1 neurons also showed the BO selectivity [1]. These reports also give an insightful idea on coding of shape in early- to intermediate- level vision. The coding of shape is a major question in neuroscience as well as in robot vision. Specifically, it is of great interest that how the visual information in early visual areas is processed to form the representation of shape. Physiological studies in monkeys [2] suggest that shape is coded by medial axis (MA) representation in early visual areas. The MA representation is the method that codes a surface by a set of circles inscribed along the contour of the surface. An arbitrary shape would be reproduced from the centers of the circles and their diameters. We examined whether neural circuits in early- to intermediate-level visual areas could provide a basis for MA representation. We propose that the synchronized responses of BO-selective neurons could evoke the representation of MA. The physiological study [2] showed that V1 neurons responded to figure shape around 40 ms after the stimulus onset, while the latency of the cells responded to MA was about 200 ms after the onset. Physiological study on M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 18–26, 2008. © Springer-Verlag Berlin Heidelberg 2008
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
19
BO-selective neurons reported that their latency is around 70 ms. These results give rise to the proposal that the neurons detect contours first, then the BO is determined from the local contrast detected, and finally MA representation is constructed. It should be noted that BO could be determined from local contrast surrounding the classical receptive field (CRF), thus a single neuron with surrounding modulation would be sufficient to yield BO selectivity [3]. To examine the proposal, we constructed physiologically realistic, firing model of Border-ownership (BO) selective neuron. We assume that the onset of a stimulus with high contrast evokes simultaneous, strong responses of the neurons. These fast responses will propagate retinotopically, thus the neurons at MA (equidistance to the contour) will be activated. Our simulation results show that even relatively small facilitation from the propagated signals yield the firing of the model-cells at the MA because of the synchronization, indicating that simultaneous firing of BO-selective cells at the stimulus onset could enable the representation of MA.
2 The Proposed Model The model is comprised of three stages: (1) contrast detection stage, (2) BO detection stage, and (3) MA detection stage. The first stage extracts luminance contrast as similar to V1 simple cells. The model cells in the second stage mimic BO-selective neurons, which determine the direction of BO with respect to the border at the CRF based on the modulation from surrounding contrast up to 5 deg in visual angle from the CRF. The third stage spatially pools the responses from the second stage to test the MA representation from simultaneous firing of BO cells. A schematic diagram of the model is given in Figure 1. The following sections describe the functions of each stage. 2.1 Model Neurons To enable the simulations of precise spatiotemporal properties, we implemented single-component firing neurons and their connections through their synapses on NEURON simulator [4]. The cell body of the model cell is approximated by a sphere. We set the radius and the membrane resistance to 50μm and 34.5Ωcm, respectively. The model neurons calculate the membrane potential following the Hodgkin-Huxley equation [5] with constant parameter as shown in table 1. We used the biophysically realistic spiking neuron model because we need to examine exact timing of the firing of BO-selective cells as well as the propagtation of the signals, which cannot be realized by integrate-and-fire neuron model or any other abstract models. 2.2 Contrast Detection Stage The model cells in the first stage have response properties similar to those of V1 cells, including the contrast detection, dynamic contrast normalization and static compressive nonlinearity. The model cells in this stage detect the luminance contrast from oriented Gabor filters with distnct orientations. We limited ourselves to have four orientations for the sake of simplicity; vertical (0 and 180 deg) and horizontal (90 and 270 deg). Gabor filter is defined as follows:
Gθ (x , y ) = cos(2πω
(x sin θ + y cos θ )) × gaussian(x , y , μ x , μ y , σ x , σ y ),
(1)
20
Y. Hatori and K. Sakai
Fig. 1. A schematic illustration of the proposed model. Luminance contrast is detected in the first stage which is then processed to determine the direction of BO by surrounding modulation. F and S represent excitatory and inhibitory regions, respectively, for the surrounding modulation. The last stage detects MA based on the propagations from BO model cells.
where x and y represent spatial location, μx and μy represent central coordinate of Gauss function, σx and σy represent standard deviation of Gauss function, θ and ω represent orientation and spatial frequency, respectively. We take a convolution of an input image
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
21
with the Gabor filters, with dynamic contrast normalization [6] including a static, compressive nonlinear function. For the purpose of efficient computation, the responses of the vertical pathways (0 and 180 deg) are integrated to form vertical orientation, and so do the horizontal pathways (90 and 270 deg), which will be convenient for the computation of iso-orientation suppression and cross-orientation facilitation in the next stage: O θ (x , y ) = (I * G θ )(x , y ) ,
(2)
1 (x , y ) = O0 (x , y ) + O180 (x , y ) , Oiso
(3)
1 (x , y ) = O90 (x , y ) + O270 (x , y ) , Ocross
(4)
where I represents input image, Oθ(x,y) represents the output of convolution (*), O1iso (O1cross) represents integrated responses of vertical (horizontal) pathway. Table 1. The constant values for the model cells used in the simulations
Parameter
Value
Cm
1( F/cm2) 50(mv) -77(mv) -54.3(mv) 0.120(S/cm2) 0.036(S/cm2) 0.0003(S/cm2)
ENa EK El gNa gK gl
μ
2.3 BO Detection Stage The second stage models surrounding modulation reported in early- to intermediatelevel vision for the determination of BO [3]. The model cells integrate surrounding contrast information up to5 deg in visual angle from the CRF center. We modeled the surrounding region with two Gaussians, one for inhibition and the other for facilitation, that are located asymmetrically with respect to the CRF center. If a part of the contour of an object is projected onto the excitatory (or inhibitory) region, the contrast information of the contour that is detected by the first stage is transmitted via a pulse to generate EPSP (or IPSP) of the BO-selective model-cell. In other words, the projection of a figure within the excitatory region facilitates the response of the BO model cell. Conversely, if a figure is projected onto the inhibitory region, the response of the model cell is suppressed. Therefore, surrounding contrast signals from the excitatory and inhibitory regions modulate the activity of BO model cells depending on the direction of figure. In this way, we implemented BO model cells based on the surrounding modulation. Note that Jones and her colleagues reported the orientation dependency in the surrounding modulation in monkeys' V1 cells [7]. The suppression is limited to similar orientations to the preferred orientation of the CRF (iso-orientation suppression), and facilitation is dominant for other orientations (cross-orientation facilitation). We implemented this orientation dependency for surround modulation.
22
Y. Hatori and K. Sakai
Taking into account the EPSP and IPSP from the surrounds, we compute the membrane potential of a BO-selective model cell at time t as follows: O 2 ( x1 , y1 , t ) = input ( x1 , y1 ) + c ∑{E iso (x, y , t - d x1, y1 (x, y ) ) + E cross (x, y , t - d x1, y1 (x, y ))} (5) x, y
,where x1 and y1 represent the spatial position of the BO-selective cell, input(x1,y1) represents the output of the first stage that is O1iso or O1cross, c represents a weight of synaptic connection, Eiso(x,y,t-dx1,y1(x,y)) represents EPSP (or IPSP) that is triggered by the pulse which is generated at t-dx1,y1. Ecross(x,y,t-dx1,y1(x,y)) is the same except for input orientation. And, dx1,y1(x,y) shows time delay in proportion to the distance between the BO-selective cell whose coordinate is (x1,y1) and the connected cell whose coordinate is (x,y). We defined Eiso(x,y,t-dx1,y1(x,y)) and dx1,y1(x,y) as:
(
)
(
)
E iso x , y ,t - d x1 , y1 (x , y ) = gaussian x , y , μ x1 , μ y1 ,σ x1 ,σ y1 × exp d x1 , y1 (x , y ) = ctime
(x
1
((
)
- t - d x1 , y1 ( x , y )
τ
)× (v - e) ,
2 2 - x ) + (y1 - y ) ,
(6) (7)
where τ represents time constant, v represents membrane potential, e represents reversal potential, ctime is a constant that converts distance to time. We set c, τ, e and ctime to 0.6 (or 1.0), 10ms, 0mv and 0.2ms/μm, respectively. And Ecross(x,y,t - dx1,y1(x,y)) is also calculated similarly to eq.6. 2.4 MA Detection Stage The third stage integrates BO information from the second stage to detect the MA. A MA model cell has a single excitatory surrounding region that is represented by a Gaussian. The membrane potential of a MA model cell is given by: O 3 (x2 , y 2 ,t )= cmedial
∑gaussian(x , y , μ
(x ,y )∈(x1 ,y1 )
x2
)
(
)
, μ y2 ,σ x2 ,σ y2 × O 2 x , y ,t - d x2 , y2 (x , y ) ,
(8)
where x2 and y2 represent the spatial position of the MA model cell, μx2 and μy2 represent the center of gauss function, cmedial is a constant that we set to 1.8 (or 10.0), dx2,y2(x,y) is calculated similarly to eq.7. Note that MA model cells receive EPSP only from BO model cells. When a BO model cell is activated, the model cell transmits a pulse to MA model cells, with its magnitude and time delay depending on the distance between the two. If a MA model cell was located equidistance to some parts of the contours, the pulses from the BO model cells on the contours reach the MA cell at the same time to evoke strong EPST that will generate a spike. On the other hand, MA model cells that are located not equidistance to the contours will never evoke a pulse. Therefore, the model cells that are located equidistance to the contours are activated based on simultaneous activation from BO model cells, the neural population of which will represent the medial axis of the object.
3 Simulation Result We carried out the simulations of the model to test whether the model shows the representation of MA. As typical examples, the results for thee types of stimuli are
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
23
shown here, a square, a C-shaped figure, and a natural image of an eagle [8], as shown in Fig 1. Note that the model cells described in the previous sections are distributed retinotopically to form each layer with 138 × 138 cells.
(A)
(B)
(C)
Fig. 2. Three examples of stimuli used for the simulations. A square (A), a C-shaped figure (B), and a natural image of an eagle from Berkeley Segmentation Dataset [8] (C).
3.1 A Single Square First, we tested the model with a single square similar to that used in corresponding physiological experiment [2], as shown in Fig.1(A). Although we carried out the simulations retinotopically in 2D, for the purpose of graphical presentation, the responses of a horizontal cross section of the stimulus indicated by Fig. 3(A) are shown here in Fig. 3(B). The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines at the horizontal positions of -1 and 1) and MA model cells (dotted lines at the horizontal position of 0). We observe a clear peak corresponding to MA at the center, as similar to the results of physiological experiments by Lee, et al. [2]. Although we tested the model along the horizontal cross section, the response of a vertical is identical. This result suggests that simultaneous firing of BO cells is capable of generating MA without any other particular constraints. 14 12 c 10 e s m 0 8 0 4 / s 6 e ki p s
4 2 0
(A)
-1 0 Horizontal position
1
(B)
Fig. 3. The simulation results for a square. (A) We show the responses of the cells located along the dashed line (the number of neurons is 138). Horizontal positions -1 and 1 represent the places on the vertical edges of the square. Zero represents the center of the square. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. A clear peak at the center corresponding to MA is observed.
24
Y. Hatori and K. Sakai
3.2 C-Shape We tested the model with a C-shaped figure that has been suggested to be difficult shape for the determination of BO. Fig. 4 shows the simulation result for the C-shape. The responses of a horizontal and vertical cross sections, as indicated in Fig.3(A), are shown here in Fig. 4(B) and (C), respectively, with the conventions same as Fig.1. The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines) and MA model cells (dotted lines). MA representation predicts a strong peak at 0 along the horizontal cross section, and distributed strong responses along the vertical cross section. Although we observe the maximum responses at the centers corresponding to MA, the peak along the horizontal was not significant, and the distribution along the vertical was peaky and uneven. It appears that MA cells cannot integrate properly the signals propagated from BO cells. This distributed MA response comes from complicated propagations of BO signals from the concaved shape. Furthermore, as has been suggested [3], the C-shaped figure is a challenging shape for the determination of BO. Therefore, the BO signals are not clear before the propagation begins. We would like to note that the model is still capable of providing a basis for MA representation for a complicated figure.
(A) 14
14
12
12
c10 e s m 0 8 0 4 / s 6 e ik p s 4
c10 e s m 0 8 0 4 / s 6 e ik p s
2
2
0
4
-1
0 1 Horizontal position
(B)
0
-1
0 Vertical position
1
(C)
Fig. 4. The simulation results for the C-shaped figure. (A) Positions of the analysis along horizontal and vertical cross-sections as indicated by dashed line. The responses of the model cells in firing rate along the horizontal cross section (B), and that along the vertical cross section (C). Solid and dotted lines indicate the responses of BO and MA model cells, respectively. Although the maximum responses are observed at the centers, the responses are distributed.
3.3 Natural Images The model has shown its ability to extract a basis for MA representation for not only simple but also difficult shapes. To further examine the model for arbitrary shapes, we
Representation of Medial Axis from Synchronous Firing of BO Selective Cells
25
tested the model with natural images taken from Berkeley Segmentation Dataset [8]. Fig.2(C) shows an example, an eagle stops on a tree branch. Because we are interested in the representation of shape, we extracted its shape by binarizing the gray scale, as shown in Fig. 5(A). The simulation results of BO and MA model cells are shown in Fig. 5(B). We plotted the responses along the horizontal cross-section indicated in Fig. 5(A). Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig.2(B). The BO model cells responded to the contours (horizontal positions at -1 and 1), and MA model cells exhibited a strong peak at the center (horizontal position at 0). This result indicates that the model detects MA for figures with arbitrary shape. The aim of the simulations with natural images is to test a variety of stimulus shape and configurations that are possible in natural scenes. Further simulations with a number of natural images is expected, specifically with images including occulusion, multiple objects and ambiguous figures. 12 10 c e s 8 m 0 0 4 6 / s e ik 4 p s
2 0
(A)
-1 0 1 Horizontal position
(B)
Fig. 5. An example of simulation results for natural images. (A) The binary image of an eagle together with the horizontal cross-section for the graphical presentation of the results. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. Horizontal positions -1 and 1 represent the places on the vertical edges of the eagle. Zero represents the center of the bird. A clear peak at the center corresponding to MA is observed. Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig. 2(B), indicating robustness of the model.
4 Conclusion We studied whether early visual areas could provide basis for MA representation, specifically what constraint is necessary for the representation of MA. Our results showed that simultaneous firing of BO-selective neurons is crucial for MA representation. We implemented the physiologically realistic firing model neurons that have connections from BO model cells to MA model cells. If a stimulus is presented at once so that BO cells along the contours fire simultaneously, and if a MA cell is located equidistant from some of contours of the stimulus, then the MA cell fires because of synchronous signals from the BO cells that give rise to strong EPSP. We showed three typical examples of the simulation results, a simple square, a difficult C-shaped figure, and a natural image of an eagle. The simulation results showed that the model provides a basis for MA representation for all three types of stimuli. These
26
Y. Hatori and K. Sakai
results suggest that the simultaneous firing of BO cells is an essence for the MA representation in early visual areas.
Acknowledgment We thank Dr. Haruka Nishimura for her insightful comments and Mr. Satoshi Watanabe for his help in simulations. This work was supported by Grant-in-aid for Scientific Research from the Brain Science Foundation, the Okawa Foundation, JSPS (19530648), and MEXT of Japan (19024011).
References 1. Zhou, H., Friedman, H.S., Heydt, R.: Coding of Border Ownership in Monkey Visual Cortex. The Journal of Neuroscience 86, 2796–2808 (2000) 2. Lee, T.S., Mumford, D., Romero, R., Lamme, V.A.F.: The role of the primary visual cortex in higher level vision. Vision Research 38, 2429–2454 (1998) 3. Sakai, K., Nishimura, H.: Surrounding Suppression and Facilitation in the determination of Border Ownership. The Journal of Cognitive Neuroscience 18, 562–579 (2006) 4. NEURON: http://www.neuron.yale.edu/neuron/ 5. Johnston, D., Wu., S. (eds.): Foundations of Cellular Neurophysiology. MIT Press, Cambridge (1999) 6. Carandini, M., Heeger, D.J., Movshon, J.A.: Linearity and Normalization in Simple Cells of the Macaque Primary Visual Cortex. The Journal of Neuroscience 21, 8621–8644 (1997) 7. Jones, H.E., Wang, W., Silito, A.M.: Spatial Organization and Magnitude of Orientation Contrast Interaction in Primate V1. Journal of Neurophysiology 88, 2796–2808 (2002) 8. The Berkeley Segmentation Dataset: http://www.eecs.berkeley.edu/Research/ Projects/CS/vision/grouping/segbench/ 9. Engel, A.K., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in topdown processing. Nature Reviews Neuroscience 2, 704–716 (2001)
Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task Mitsuya Soga1 and Yoshiki Kashimori1,2 1
2
Dept. of Information Network Science, Graduate school of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo 182-8585, Japan Dept. of Applied Physics and Chemistry, Univ. of Electro-communications, Chofu, Tokyo 182-8585, Japan
Abstract. The ability to group visual stimuli into meaningful categories is a fundamental cognitive process. Some experiments are made to investigate the neural mechanism of visual categorization. Although experimental evidence is known that prefrontal cortex (PFC) and inferiortemporal (IT) cortex neurons sensitively respond in categorization task, little is known about the functional role of interaction between PFC and IT in categorization task To address this issue, we propose a functional model of visual system, and investigate the neural mechanism for the categorization task of line drawings of faces. We show here that IT represents similarity of face images based on the information of the resolution maps of early visual stages. We show also that PFC neurons bind the information of part and location of the face image, and then PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.
1
Introduction
Visual categorization is fundamental to the behavior of higher primates. Our raw perceptions would be useless without our classification of items such as animals and food. The visual system has the ability to categorize visual stimuli, which is the ability to react similarity to stimuli even when they are physically distinct, and to react differently to stimuli that may be similar. How does the brain group stimuli into meaningful categories? Some experiments have been made to investigate the neural mechanism of visual categorization. Freedman et al.[1] examined the responses of neurons in the prefrontal cortex(PFC) of monkey trained to categorize animal forms(generated by computer) as either “doglike” or “catlike”. They reported that many PFC neurons responded selectively to the different types of visual stimuli belonging to either the cats or the dogs category. Sigala and Logothetis [2] recorded from inferior temporal (IT) cortex after monkey learned a categorization task, and found that selectivity of the IT neurons was significantly increased to features critical for the task. The numerous reciprocal connections between PFC and IT could allow the necessary interactions to select the best diagnostic features of stimuli [3]. However, little is known about the role of interaction between IT M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 27–36, 2008. c Springer-Verlag Berlin Heidelberg 2008
28
M. Soga and Y. Kashimori
and PFC in categorization task that gives the category boundaries in relation to behavioral consequences. To address this issue, we propose a functional model of visual system in which categorization task is achieved based on functional roles of IT and PFC. The functional role of IT is to represent features of object parts, based on different resolution maps in early visual system such as V1 and V4. In IT, visual stimuli are categorized by the similarity based on the features of object parts. Posterior parietal (PP) encodes the location of object part to which attention is paid. The PFC neurons combine the information about feature and location of object parts, and generate a working memory of the object information relevant to the categorization task. The synaptic connections between IT and PFC are learned so as to achieve the categorization task. The feedback signals from PFC to IT enhance the sensitivity of IT neurons that respond to the features of object parts critical for the categorization task, thereby enabling the visual system to perform quickly and reliably task-dependent categorization. In the present study, we present a neural network model which makes categories of visual objects depending on categorization task. We investigated the neural mechanism of the categorization task about line drawings of faces used by Sigala and Logothetis [2]. Using this model we show that IT represents similarity of face images based on the information of the resolution maps in V1 and V4. We also show that PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.
2
Model
To investigate the neural mechanism of visual categorization, we made a neural network model for a form perception pathway from retina to prefrontal cortex(PFC). The model consists of eight neural networks corresponding to the retina, lateral geniculate nucleus(LGN), V1, V4, inferior temporal cortex (IT), posterior parietal(PP), PFC, and premotor area, which are involved in ventral and dorsal pathway [4,5]. The network structure of our model is illustrated in Fig. 1. 2.1
Model of Retina and LGN
The retinal network is an input layer, on which the object image is projected. The retina has a two-dimensional lattice structure that contains NR x NR pixel scene. The LGN network consists of three different types of neurons with respect to the spatial resolution of contrast detection, fine-tuned neurons with high spatial frequency (LGNF), middle-tuned neurons with middle spatial frequency (LGNM), and broad-tuned neurons with low spatial frequency (LGNB). The output of LGNX neuron (X=B, M, F) of (i, j) site is given by ILGN (iR , jR ; x) =
i
j
IR (iR , jR )M (i, j; x),
(1)
Neural Mechanism for Extracting Object Features
29
Fig. 1. The structure of our model. The model is composed of eight modules structured such that they resemble ventral and dorsal pathway of the visual cortex, retina, lateral geniculate nucleus (LGN), primary visual cortex (V1), V4, inferior temporal cortex (IT), posterior parietal (PP), prefrontal cortex (PFC), and premotor area. a1 ∼ a3 mean dynamical attractors of visual working memory.
(i − iR )2 + (j − jR )2 (i − iR )2 + (j − jR )2 M (i, j; x) = A exp − −B exp − , 2 2 σ1x σ2x (2) where, IR (iR , jR ) is the gray-scale intensity of the pixel of retinal site (iR , jR ), and the function M (i, j; X) is Mexican hat-like function that represents the convergence of the retinal inputs with ON center-OFF surrounding connections between retina and LGN. The parameter values were set to be A = 1.0, B = 1.0, σ1x = 1.0, and σ2x = 2.0. 2.2
Model of V1 Network
The neurons of V1 network have the ability to elemental features of the object image, such as orientation and edge of a bar. The V1 network consists of three different types of networks with high, middle, and broad spatial resolutions, V1B, V1M, and V1F, each of which receives the outputs of LGNB, LGNM, and LGNF, respectively. The V1X network (X=B, M, F) contains MX × MX hypercolumns, each of which contains LX orientation columns. The neurons in V1X (X=B, M, F) have receptive field performing a Gabor transform. The output of V1X neuron of (i, j) site is given by IV 1 (i, j, θ; x) = ILGN (p, q; x)G(p, q, θ; x), (3) p
q
30
M. Soga and Y. Kashimori
1 (i − p)2 (j − q)2 G(p, q, θ; x) = exp − + 2 2 2πσxX σyX 2 σGx σGy π
× sin 2πfx p cos θ + 2πfy q sin θ + , 2 1
(4)
where fx , fy are spatial frequencies of x- and y- coordinate, respectively. The parameter values were σGx = 1, σGy = 1, fx = 0.1Hz, and fy = 0.1Hz. 2.3
Model of V4 Network
The V4 network consists of three different networks with high, middle, and low spatial resolutions, which receive convergent inputs from the cell assemblies with the same tuning in V1F, V1M, and V1B. The convergence of outputs of V1 neurons enables V4 neurons to respond specifically to a combination of elemental features such as a cross and triangle represented on the V4 network. 2.4
Model of PP
The posterior parietal(PP) network consists of NP P x NP P neurons, each of which corresponds to a spatial position of each pixel of the retinal image. The functions of PP network are to represent the spatial position of a whole object and the spatial arrangement of its parts in the retinotopic coordinate system and to mediate the location of the object part to which attention is paid. 2.5
Model of IT
The network of IT consists of three subnetworks, each of which receives the outputs of V4F, V4M, and V4B maps, respectively. Each network has neurons tuned to various features of the object parts depending on the resolutions of V4 maps and the location of the object parts to which attention is directed. The first sub-network detects a broad outline of a whole object, the second subnetwork detects elemental figures of the object that represent elemental outlines of the object, and the third subnetwork represents the information of the object parts based on the fine resolution of V4F map. Each subnetwork was made based on Kohonen’s self-organized map model. The elemental figures in the second subnetwork may play an important role in extracting the similarity to the outlines of objects. 2.6
Model for Working Memory in PFC
The PFC memorizes the information of spatial positions of the object parts as dynamical attractors. The functional role of the PFC is to reproduce a complete form of the object by binding the information of the object parts memorized in the second and third subnetworks of ITC and the their spatial arrangements represented by PP network. The PFC network model was made based on the dynamical map model [6,7]. The network model consists of three types of neurons, M, R, and Q neurons. Q neuron is connected to M neuron with inhibitory
Neural Mechanism for Extracting Object Features
31
synapse. M neurons are interconnected to each other with excitatory and inhibitory synaptic connections. M neuron layer of the present model corresponds to an associative neural network. M neurons receive inputs from three IT subnetworks. Dynamical evolutions of membrane potentials of the neurons, M, R, and Q, are described by
τm
tm dumi = −umi + Wmm,ij (t, Tij )Vm (t − τij ) + Wmq U qi dt j τij =0 +Wmr Uri + WIT,ik XkIT + WP P,il XlP P , R
(5)
l
duqi = −uqi + Wqm Vmi , dt duir τri = −uri + Wrm Vmi , dt τqi
(6) (7)
where um,i , uq,i , and ur,i are the membrane potentials of ith M neuron, ith Q neuron, and ith R neuron, respectively. τmi , τqi , τri are the relaxation times of these membrane potentials. τij is the delay time of the signal propagation from jth M neuron to ith one, and τmax is the maximum delay time. The time delay plays an important role in the stabilization of temporal sequence of firing pattern. Wmm,ij (t, Tij ) is the strength of the axo-dendric synaptic connection from jth M neuron to ith M neuron whose propagation delay time is τij . Wmq (t), Wmr (t), and Wqm (t), and Wrm (t) are the strength of the dendro-dendritic synaptic connection from Q neuron to M neuron, from R neuron to M neuron, from M neuron to Q neuron, and from M neuron to R neuron, respectively. Vm is the output of ith M neuron, Uqi and Uri are dendritic outputs of ith Q neuron and ith R neuron, respectively. Outputs of Q and R neurons are given by sigmoidal functions of uqi and uri , respectively. M neuron has a spike output, the firing probability of which is determined by a sigmoidal function of umi . WIT,ik and WP P,il are the synaptic strength from kth IT neuron to ith M neuron and that from lth PP neuron to ith M neuron, respectively. XkIT and XlP P are the output of k th IT neuron and lth PP neuron, respectively. The parameters were set to be τm = 3ms, τqi = 1ms, τri = 1ms, Wmq = 1, Wmr = 1, Wqm = 1, Wqr = −10, and τm = 38ms. Dynamical evolution of the synaptic connections are described by τw
dWmm,ij (t, Tij ) = −Wmm,ij (t, Tij ) + λVmi (t)Vmj (t − Tij ), dt
(8)
where τw is a time constant, and λ is a learning rate.The parameter values were τw = 1200ms, and λ = 28. The PFC connects reciprocally with IT, PP, and the premotor networks. The synaptic connections between PFC and IT and those between PFC and PP are learned by Hebbian learning rule.
32
2.7
M. Soga and Y. Kashimori
Model of Premotor Cortex
The model of premotor area consists of neurons whose firing correspond to action relevant to the categorization task, that is, pressing right or left lever.The details of mathematical description are described in Refs. [8-10].
3 3.1
Neural Mechanism for Extracting Diagnostic Features in Visual Categorization Task Categorization Task
We used the line drawings of faces used by Sigala and Logothetis [2] to investigate the neural mechanism of categorization task. The face images consist of four varying features, eye height, eye separation, nose length, and mouse height. The monkeys were trained to categorize the face stimuli depending on two diagnostic features, eye height and eye separation. The two diagnostic features allowed separation between classes along a linear category boundary as shown in Fig.2b. The face stimuli were not linearly separable by using the other two, non-diagnostic features, or nose length and mouth height. On each trial, the monkeys saw one face stimulus and then pressed one of two levers to indicate category. Thereafter, they received a reward only if they chose correct category. After the training, the monkeys were able to categorize various face stimuli based on the two diagnostic features. In our simulation, we used four training stimuli shown in Fig.2a and test stimuli with varying four features.
Fig. 2. a) The training stimulus set consisted of line drawing of faces with four varying features: eye separation, eye height, nose length and mouth height. b) In the categorization task, the monkyes were presented with one stimulus at a time. The two categories were linearly separable along the line. The test stimuli are illustrated by the marks, ‘x‘ and ‘o‘. See Ref.2 for details of the task.
Neural Mechanism for Extracting Object Features
3.2
33
Neural Mechanism for Accomplishing Visual Categorization Task
Neural mechanism for accomplishing visual categorization task is illustrated in Fig.3. Object features are encoded by hierarchical processing at each stage of ventral visual pathway from retina to V4. The IT neurons encode the information of object parts such as eyes, nose, and mouth. The PP neurons encode the location of object parts to which attention should be directed. The information of object part and its location are combined by a dynamical attractor in PFC. The output of PFC is sent to two types of premotor neurons, each firing of which leads to pressing of the right or left lever. When monkeys exhibit the relevant behavior for the task and then receive a reward, the attractor in PFC, which represents the object information relevant to the task, is gradually stabilized by the facilitation of learning across the PFC neurons and that between PFC and premotor neurons. On the other hand, when monkeys exhibit the irrelevant behavior for the task and then receive no reward, the attractor associated with the irrelevant behavior is destabilized, and then eliminated in the PFC network. As a result, the PFC retains only the object information relevant to the categorization task, as working memory. The feedback from PFC to IT and PP makes the responses of IT and PP neurons strengthened, thereby enabling the visual system to rapidly and accurately discriminate between object features belonging to different categories. When monkey pays attention to a local region of face stimulus, the attention signal from other brain area such as prefrontal eye field increases the activity of PP neurons encoding the location of face part to which attention is directed. The PP neurons send their outputs back to V4, and thereby increasing the activity of V4 neurons encoding the feature of the face parts which the attention is paid, because V4 has the same retinotopic map as PP. Thus the attention to PP allows V4 to send IT only the information of face part to which attention is directed,
Fig. 3. Neural mechanism for accomplishing visual categorization task. The information about object part and it’s location are combined to generate working memory, indicated by α and β. The solid and dashed lines indicate the formation and elimination of synaptic connection, respectively.
34
M. Soga and Y. Kashimori
leading to generation of IT neurons encoding face parts. Furthermore, as the training proceeds, the attention relevant to the task is fixed by the learning of synaptic connections between PFC and PP, allowing monkey to perform quickly the visual task.
4 4.1
Results Information Processing of Visual Images in Early Visual Areas
Figure 4 shows the responses of neurons in early visual areas, LGN, V1, and V4, to the training stimulus, face 1 shown in Fig. 2a. The visual information is processed in a hierarchical manner, because the neurons involved in the pathway from LGN to V4 have progressively larger receptive fields and prefer more complex stimuli. At the first stage, the contrast of the stimulus is encoded by ON center-OFF surrounding receptive field of LGN neurons, as shown in Fig. 4b. Then, the V1 neurons, receiving the outputs of LGN neurons, encode the information of directional features of short bars contained in the drawing face, as shown in Fig. 4c. The V4 network was made by Kohonen’s self-organized map so that the V4 neurons could respond to more complex features of the stimulus. Figure 4d shows that the V4 neurons respond to the more complex features such as eyes, nose, and mouth.
Fig. 4. Responses of neurons in LGN, V1, and V4. The magnitude of neuronal responses in these areas is illustrated with a gray scale, in which the response magnitude is increased with increase of gray color. a) Face stimulus. The image is 90 x 90 pixel scale. b) Response of LGN neurons. c) Responses of V1 neurons tuned to four kinds of directions. d) Responses of V4 neurons. The kth neurons (k=1-3) encode the stimulus features such as eyes, nose, and mouth, respectively.
4.2
Information Processing of Visual Images in IT Cortex
Figure 5a shows the ability of IT neurons encoding eye separation and eye height to categorize test stimuli of faces. The test stimuli with varying the two features were categorized by the four ITC neurons learned by the four training stimuli shown in Fig.2a, suggesting that the ITC neurons are capable for separating test stimuli into some categories, based on similarity to the features of face parts.
Neural Mechanism for Extracting Object Features
35
Fig. 5. a) Ability of IT neurons to categorize face stimuli for two diagnostic features. The four IT neurons were made by using the four kinds of training stimuli shown in Fig. 2a, whose features are represented by four kinds of symbols (circle, square, triangle, cross). The test stimuli, represented by small symbols, are categorized by the four IT neurons. The kind of small symbols means the symbol of IT neuron that categorizes the test stimulus. The solid lines mean the boundary lines of the categories. b) Temporal variation of dynamic state of the PFC network during the categorization task. The attractors representing the diagnostic features are denoted by α ∼ δ, and the attractors representing non-diagnostic feature is denoted by . A mark on the row corresponding to α ∼ indicates that the network activity stays in the attractor. The visual stimulus of face 1 was applied to the retina at 300m ∼ 500ms.
Similarly, the IT neurons encoding nose length and mouth height separated test stimuli into other categories on the basis of similarity of the two features. However, the classification in the IT is not task-dependent, but is made based on the similarity of face features. 4.3
Mechanism for Generating Working Memory Attractor in PFC
The PFC combines the information of face features and that of location of face parts to which attention is directed, and then makes memory attractors about the information. Figure 5b shows temporal variation of the memory attractors in PFC. The information about face parts with the two diagnostic features is represented by attractors X (X= α, β, γ, δ), in which X represents the information about eye separation and eye height of the four training stimuli and the location around eyes. The attractors X are dynamically linked in the PFC. As shown in Fig. 5b, the information about face parts with the diagnostic features are memorized as working memory α ∼ δ, because the synaptic connections between PFC and premotor area are strengthened by a reward signal given by the choice of correct categorization. On the other hand, the information about face parts with non-diagnostic features are not memorized as a stable attractor, as shown by in Fig.5b, because the information of non-diagnostic features does
36
M. Soga and Y. Kashimori
not lead to correct categorization behavior. Thus, the PFC can retain only the information required for the categorization task, as working memory.
5
Concluding Remarks
In the present study, we have shown that IT represents similarity of face images based on the resolution maps of V1 and V4, and PFC generates a working memory state, in which the information of face features relevant to categorization task are sustained. The feedback from PFC to IT and PP may play an important role in extracting the diagnostic features critical for the categorization task. The feedback from PFC increases the sensitivity of IT and PP neurons which encode the relevant object feature and location to the task, respectively. This allows the visual system to rapidly and accurately perform the categorization task. It remains to see how the feedback from PFC to IT and PP makes the functional connections across the three visual areas.
References 1. Freedman, D.J., Riesenhube, M., Poggio, T., Miller, E.K.: Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316 (2001) 2. Sigala, N., Logothetis, N.K.: Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002) 3. Hagiwara, I., Miyashita, Y.: Categorizing the world: expert neurons look into key features. Nature Neurosci. 5, 90–91 (2002) 4. Marcelija, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70, 1297–1300 (1980) 5. Rolls, E.T., Deco, G.: Computational Neuroscience of Vision. Oxford University Press, Oxford (2002) 6. Hoshino, O., Inoue, S., Kashimori, Y., kambara, T.: A hierachical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput. 13, 1781–1810 (2001) 7. Hoshino, O., Kashimori, Y., Kambara, T.: An olfactory recognition model of spatiotemporal coding of odor quality in olfactory bulb. Biol. Cybernet 79, 109–120 (1998) 8. Suzuki, N., Hashimoto, N., Kashimori, Y., Zheng, M., Kambara, T.: A neural model of predictive recognition in form pathway of visual cortex. Biosystems 79, 33–42 (2004) 9. Ichinose, Y., Kashimori, Y., Fujita, K., Kambara, T.: A neural model of visual system based on multiple resolution maps for categorizing visual stimuli. In: Proceedings of ICONIP 2005, pp. 515–520 (2005) 10. Kashimori, Y., Suzuki, N., Fujita, K., Zheng, M., Kambara, T.: A functional role of multiple spatial resolution maps in form perception along the ventral visual pathway. Neurocomputing 65-66, 219–228 (2005)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion Jordan H. Boyle, John Bryden, and Netta Cohen School of Computing, University of Leeds, Leeds LS2 9JT, United Kingdom
Abstract. One of the most tractable organisms for the study of nervous systems is the nematode Caenorhabditis elegans, whose locomotion in particular has been the subject of a number of models. In this paper we present a first integrated neuro-mechanical model of forward locomotion. We find that a previous neural model is robust to the addition of a body with mechanical properties, and that the integrated model produces oscillations with a more realistic frequency and waveform than the neural model alone. We conclude that the body and environment are likely to be important components of the worm’s locomotion subsystem.
1
Introduction
The ultimate aim of neuroscience is to unravel and completely understand the links between animal behaviour, its neural control and the underlying molecular and genetic computation at the cellular and sub-cellular levels. This daunting challenge sets a distant goal post in the study of the vast majority of animals, but work on one animal in particular, the nematode Caenorhabditis elegans, is leading the way. This tiny worm has only 302 neurons and yet is capable of generating an impressive wealth of sensory-motor behaviours. With the first fully sequenced animal genome [1], a nearly complete wiring diagram of the nervous circuit [2], and hundreds of well characterised mutant strains, the link between genetics and behaviour never seemed more tractable. To date, a number of models have been constructed of subcircuits within the C. elegans nervous system, including sensory circuits for thermotaxis and chemotaxis [3,4], reflex control such as tap withdrawal [5], reversals (from forward to backward motion and vice versa) [6] and head swing motion [7]. Locomotion, like the overwhelming majority of known motor activity in animals, relies on the rhythmic contraction of muscles, which are controlled or regulated by neural networks. This system consists of a circuit in the head (generally postulated to initiate motion and determine direction) and an additional subcircuit along the ventral cord (responsible for propagating and sustaining undulations, and potentially generating them as well). Models of C. elegans locomotion have tended to focus on forward locomotion, and in particular, on the ability of the worm to generate and propagate undulations down its length [8,9,10,11,12]. These models have tended to study either the mechanics of locomotion [8] or the forward locomotion neural circuit [9,10,11,12]. In this paper we present simulations of an M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 37–47, 2008. c Springer-Verlag Berlin Heidelberg 2008
38
J.H. Boyle, J. Bryden, and N. Cohen
integrated model of the neural control of forward locomotion [12] with a minimal model of muscle actuation and a mechanical model of a body, embedded in a minimal environment. The main questions we address are (i) whether the disembodied neural model is robust to the addition of a body with mechanical properties; and (ii) how the addition of mechanical properties alters the output from the motor neurons. In particular, models of the isolated neural circuit for locomotion suffer from a common limitation: the inability to reproduce undulations with frequencies that match the observed behaviour of the crawling worm. To address this question, we have limited our integrated model to a short section of the worm, rather than modelling the entire body. We find that the addition of a mechanical framework to the neural control model of Ref. [12] leads to robust oscillations, with significantly smoother waveforms and reduced oscillation frequencies, matching observations of the worm.
2 2.1
Background C . elegans Locomotion
Forwards locomotion is achieved by propagating sinusoidal undulations along the body from head to tail. When moving on a firm substrate (e.g. agarose) the worm lies on its side, with the ventral and dorsal muscles at any longitudinal level contracting in anti-phase. With the exception of the head and neck, the worm is only capable of bending in the dorso-ventral plane. Like all nematode worms, C. elegans lacks any form of rigid skeleton. Its roughly cylindrical body has a diameter of ∼ 80 μm and a length of ∼ 1 mm. It has an elastic cuticle containing (along with its intestine and gonad) pressurised fluid, which maintains the body shape while remaining flexible. This structure is referred to as a hydrostatic skeleton. The body wall muscles responsible for locomotion are anchored to the inside of the cuticle. 2.2
The Neural Model
The neural model used here is based on the work of Bryden and Cohen [11,12,13]. Specifically, we use the model (equations and parameters) presented in [12] which is itself an extension of Refs. [11,13]. The model simplifies the neuronal wiring diagram of the worm [2,14] into a minimal neural circuit for forward locomotion. This reduced model contains a set of repeating units, (one “tail” and ten “body” units) where each unit consists of one dorsal motor neuron (of class DB) and one ventral motor neuron (of class VB). A single command interneuron (representing a pair of interneurons of class AVB in the biological worm) provides the “on” signal to the forward locomotion circuit and is electrically coupled (via gap junctions) to all motor neurons of classes DB and VB. In the model, motor neurons also have sensory function, integrating inputs from stretch-receptors, or mechano-sensitive ion channels, that encode each unit’s bending angle. Motor neurons receive both local and – with the exception of the tail – proximate sensory input, with proximate input received from the adjacent posterior unit.
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
39
Fig. 1. A: Schematic diagram of the physical model illustrating nomenclature (see Appendix B for details). B: The neural model, with only two units (one body, one tail). AVB is electrically coupled to each of the motor neurons via gap junctions (resistor symbols).
The sensory-motor loop for each unit gives rise to local oscillations which phase lock with adjacent units. Equations and parameters for the neural model are set out in Appendix A. This neural-only model uses a minimal physical framework to translate neuronal output to bending. Fig. 1B shows the neural model with only two units (a tail and one body unit), as modelled in this paper. In the following section, we outline a more realistic physical model of the body of the worm.
3
Physical Model
Our physical model is an adaptation of Ref. [8], a 2-D model consisting of two rows of N points (representing the dorsal and ventral sides of the worm). Each point is acted on by the opposing forces of the elastic cuticle and pressure, as well as muscle force and drag (often loosely referred to as friction or surface friction [8]). We modify this model by introducing simplifications to reduce simulation time, in part by allowing us to use a longer time step. Fig. 1A illustrates the model’s structure. The worm is represented by a number rigid beams, connected to each of the adjacent beams by four springs. Two horizontal (h) springs connect points on the same side of adjacent beams and resist both elongation and compression. Two diagonal (d) springs connect the dorsal side of the ith beam to the ventral side of the i + 1st , and vice versa. These springs strongly resist compression and have an effect analogous to that of pressure, in that they help to maintain reasonably constant area in each unit. The model was implemented in C++, using a 4th order Runge-Kutta method for numerical integration, with a time step of 0.1 ms.1 Equations and parameters 1
The original model [8] required a time step of 0.001 ms with the same integration method.
40
J.H. Boyle, J. Bryden, and N. Cohen
of the physical model are given in Appendix B. The steps taken to interface the physical and neuronal models are described in Appendix C.
4
Results
Using our integrated model we first simulated a single unit (the tail), and then implemented two phase-lagged units (adding a body unit). In what follows, we present these results, as compared to those of the neural model alone. 4.1
Single Oscillating Segment
The neural model alone produces robust oscillations in unit bending angle (θi ) with a roughly square waveform, as shown in Fig. 2A. The model unit oscillates at about 3.5 Hz, as compared to frequencies of about 0.5 Hz observed for C. elegans forward locomotion on an agarose substrate. It has not been possible to find parameters within reasonable electrophysiological bounds for the neural model that would slow the oscillations to the desired time scales [12]. Oscillations of the integrated neuro-mechanical model of a single unit are shown in Fig. 2B. All but four parameters of the neuronal model remain unchanged from Ref. [12]. However, parameters used for the actuation step caused a slight asymmetry in the oscillations when integrated with a physical model, and were therefore modified. As can be seen from the traces in the figure, the frequency of oscillation in the integrated model is about 0.5 Hz for typical agarose drag [8], and the waveform has a smooth, almost sinusoidal shape. Faster (and slower) oscillations are possible for lower (higher) values of drag. Fig. 2C shows a plot of oscillation frequencies as a function of drag for the integrated model.
Fig. 2. Oscillations of, A, the original neural model [12] and, B, the integrated model (with drag of 80 × 10−6 kg.s−1 ). Note the different time scales. C: Oscillation frequency as a function of drag. The zero frequency point indicates that the unit can no longer oscillate.
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
4.2
41
Two Phase-Lagged Segments
Parameters of the neural model are given in Table A-1 for the tail unit and in Table A-2 for the body unit. Fig. 3 compares bending waveforms recorded from a living worm (Fig. 3A), simulated by the neural model (Fig. 3B) and simulated by the integrated model (Fig. 3C).
Fig. 3. Phase lagged oscillation of two units. A: Bending angles extracted from a recording of a forward locomoting worm on an agarose substrate. The traces are of two 1 points along the worm (near the middle and 12 of a body length apart). B: Simulation of two coupled units in the neural model. C: Simulation of the integrated model. Take note of the faster oscillations in subplot B.
5
Discussion
C. elegans is amenable to manipulations at the genetic, molecular and neuronal levels but with such rich behaviour being produced by a system with so few components, it can often be difficult to determine the pathways of cause and effect. Mathematical and simulation models of the locomotion therefore provide an essential contribution to the understanding of C. elegans neurobiology and motor control. The inclusion of a realistic embodiment is particularly relevant to a model of C. elegans locomotion. Sensory feedback is important to the locomotion of all animals. However, in C. elegans, the postulated existence of stretch receptor inputs along the body (unpublished communication, L. Eberly and R. Russel, reported in [2]) would provide direct information about body posture to the motor neurons themselves. Thus, the neural control is likely to be tightly coupled to the shape the worm takes as it locomotes. Modelling the body physics is therefore particularly important in this organism. Here we have presented the first steps in the implementation of such an integrated model, using biologically plausible parameters for both the neural and mechanical components. One interesting effect is the smoothing of the waveform from a square-like waveform in the isolated neural model to a nearly sinusoidal waveform in the integrated model. The smoothing can be attributed to the body’s resistance to bending (modelled as a set of springs), which increases with the bending angle.
42
J.H. Boyle, J. Bryden, and N. Cohen
By contrast, in the original neural model, the rate of bending depends only on the neural output. The work presented here would naturally lead to an integrated neuromechanical model of locomotion for an entire worm. The next step toward this goal, extending the neural circuit to the entire ventral cord (and the corresponding motor system) is currently underway. The physical model introduces long range interactions between units via the body and environment. In a real worm, as in the physical model, for bending to occur at some point along the worm, local muscles must contract. However, such contractions also apply physical forces to adjacent units, and so on up and down the worm, giving rise to a significant persistence length. For this reason the extension of the neuro-mechanical model from two to three (or more) units will not be automatic and will require parameter changes to model an operable balance between the effects of the muscle and body properties. In fact, the worm’s physical properties (and, in particular, the existence of long range physical interactions along it) could set new constraints on the neural model, or could even be exploited by the worm to achieve more effective locomotion. Either way, the physics of the worm’s locomotion is likely to offer important insights that could not be gleaned from a model of the isolated neural subcircuit. We have shown that a neural model developed with only the most rudimentary physical framework can continue to function with a more realistic embodiment. Indeed, both the waveform and frequency have been improved beyond what was possible for the isolated neural model. We conclude that the body and environment are likely to be important components of the subsystem that generates locomotion in the worm.
Acknowledgement This work was funded by the EPSRC, grant EP/C011961. NC was funded by the EPSRC, grant EP/C011953. Thanks to Stefano Berri for movies of worms and behavioural data.
References 1. C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998) 2. White, J.G., Southgate, E., Thomson, J.N., Brenner, S.: The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society of London, Series B 314, 1–340 (1986) 3. Ferr´ee, T.C., Marcotte, B.A., Lockery, S.R.: Neural network models of chemotaxis in the nematode Caenorhabditis elegans. Advances in Neural Information Processing Systems 9, 55–61 (1997) 4. Ferr´ee, T.C., Lockery, S.R.: Chemotaxis control by linear recurrent networks. Journal of Computational Neuroscience: Trends in Research, 373–377 (1998) 5. Wicks, S.R., Roehrig, C.J., Rankin, C.H.: A Dynamic Network Simulation of the Nematode Tap Withdrawal Circuit: Predictions Concerning Synaptic Function Using Behavioral Criteria. Journal of Neuroscience 16, 4017–4031 (1996)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
43
6. Tsalik, E.L., Hobert, O.: Functional mapping of neurons that control locomotory behavior in Caenorhabditis elegans. Journal of Neurobiology 56, 178–197 (2003) 7. Sakata, K., Shingai, R.: Neural network model to generate head swing in locomotion of Caenorhabditis elegans. Network: Computation in Neural Systems 15, 199–216 (2004) 8. Niebur, E., Erd¨ os, P.: Theory of the locomotion of nematodes. Biophysical Journal 60, 1132–1146 (1991) 9. Niebur, E., Erd¨ os, P.: Theory of the Locomotion of Nematodes: Control of the Somatic Motor Neurons by Interneurons. Mathematical Biosciences 118, 51–82 (1993) 10. Niebur, E., Erd¨ os, P.: Modeling Locomotion and Its Neural Control in Nematodes. Comments on Theoretical Biology 3(2), 109–139 (1993) 11. Bryden, J.A., Cohen, N.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. In: Schaal, S., Ijspeert, A.J., Billard, A., Vijayakumar, S., Hallam, J., Meyer, J.A. (eds.) Proceedings of the eighth international conference on the simulation of adaptive behavior, pp. 183–192. MIT Press / Bradford Books (2004) 12. Bryden, J.A., Cohen, N.: Neural control of C. elegans forward locomotion: The role of sensory feedback (Submitted 2007) 13. Bryden, J.A.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. Master’s thesis, University of Leeds (2003) 14. Chen, B.L., Hall, D.H., Chklovskii, D.B.: Wiring optimization can relate neuronal structure and function. Proceedings of the National Academy of Sciences USA 103, 4723–4728 (2006)
Appendix A: Neural Model Neurons are assumed to have graded potentials [11,12,13]. In particular, motor neurons (VB and DB) and are modelled by leaky integrators with a transmembrane potential V (t) following: C
dV = −G(V − Erev ) − I shape + I AVB , dt
(A-1)
where C is the cell’s membrane capacitance; Erev is the cell’s effective reversal potential; and G is the total effective membrane conductance. Sensory input n stretch I shape = )Gstretch σjstretch (θj ) is the stretch receptor input j j=1 (V − Ej from the shape of the body, where Ejstretch is the reversal potential of the ion channels, θj is the bending angle of unit j and σjstretch is a sigmoid response function of the stretch receptors to the local bending. The stretch receptor activation function is given by σ stretch (θ) = 1/ [1 + exp (−(θ − θ0 )/δθ)] where the steepness parameter δθ and the threshold θ0 are constants. The command input current I AVB = GAVB (VAVB − V ) models gap junctional coupling with AVB (with coupling strength GAVB and denoting AVB voltage by VAVB ). Note that in the model, AVB is assumed to have a sufficiently high capacitance, so that the gap junctional currents have a negligible effect on its membrane potential.
44
J.H. Boyle, J. Bryden, and N. Cohen
Segment bending in this model is given as a summation of an output function from each of the two neurons: dθ out = σVout B (V ) − σDB (V ) , dt
(A-2)
where σ out (V ) = ωmax /[1 + exp (−(V − V0 )/δV )] with constants ωmax , δV and V0 . Note that dorsal and ventral muscles contribute to bending in opposite directions (with θ and -θ denoting ventral and dorsal bending, respectively). Table A-1. Parameters for a self-oscillating tail unit (as in Ref. [12]) Parameter Value Parameter Value Parameter Value Erev −60mV VAVB −30.7mV C 5pF GVB 19.07pS GDB 17.58pS GAVB 35.37pS VB GAVB 13.78pS Gstretch 98.55pS Gstretch 67.55pS DB VB DB Estretch 60mV θ0,VB −18.68o θ0,DB −19.46o δθVB 0.1373o δθDB 0.4186o ωmax,VB 6987o /sec ωmax,DB 9951o /sec V0,VB 22.8mV V0,DB 25.0mV δVVB 0.2888mV/sec δVDB 0.0826mV/sec
Table A-2. Parameters for body units and tail-body interactions as in Ref. [12]. All body-unit parameters that are not included here are the same as for the tail unit. Parameter GVB Gstretch DB θ0,DB
Value Parameter Value Parameter Value 26.09pS GDB 25.76pS Gstretch 16.77pS VB 18.24pS E stretch 60mV θ0,VB −19.14o −13.26o δθVB 1.589o /sec δθDB 1.413o /sec
Appendix B: Physical Model The physical model consists of N rigid beams which form the boundaries between the N − 1 units. The ith beam can be described in one of two ways: either by the (x, y) coordinates of the centre of mass (CoMi in Fig. 1) and angle φi , or by the (x, y) coordinates of its two end points (PiD and PiV in Fig. 1). Each formulation has its own advantages and is used where appropriate. B.1 Spring Forces The rigid beams are connected to each of their neighbours by two horizontal (h) springs and two diagonal (d) springs, directed along the vectors k k Δh k,i = Pi+1 − Pi for k = D, V k Δdm,i = Pi+1 − Pil for k = D, V , l = V, D and m = 1, 2 ,
(B-1)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
45
for i = 1 : N − 1, where Pik = (xki , yik ) are the coordinates of the ends of the ith beam. The spring forces F(s) depend on the length of these vectors, Δjk,i = |Δjk,i | and are collinear to them. The magnitude of the horizontal and diagonal spring forces are piecewise linear functions ⎧ h κ (Δ − Lh2 ) + κhS1 (Lh2 − Lh0 ) : Δ > Lh2 ⎪ ⎪ ⎨ S2 κhS1 (Δ − Lh0 ) : Lh2 > Δ > Lh0 h F(s) (Δ) = , (B-2) h h κ (Δ − L1 ) + κhC1 (Lh1 − Lh0 ) : Δ < Lh1 ⎪ ⎪ ⎩ C2 h h κC1 (Δ − L0 ) : otherwise ⎧ d ⎨ κC2 (Δ − Ld1 ) + κdC1 (Ld1 − Ld0 ) : Δ < Ld1 d F(s) (Δ) = κdC1 (Δ − Ld0 ) : Ld1 < Δ < Ld0 , (B-3) ⎩ 0 : otherwise where spring (κ) and length (L) constants are given in Table B-1. Table B-1. Parameters of the physical model. Note that values for θ0 and θ0 differ from Ref. [12] and Table A-1. Parameter Value Parameter Value Parameter Value D 80μm Lh0 50μm Lh1 0.5Lh0 h h d d h2 2 L2 1.5L1 L0 L0 + D L1 0.95Ld0 h −1 h h h κS1 20μN.m κS2 10κS1 κC1 0.5κhS1 h h d h d κC2 10κC1 κC1 50κS1 κC2 10κdC1 fmuscle 0.005Lh0 κhC1 c = c⊥ 80 × 10−6 kg.s−1 θ0,VB −29.68o θ0,DB −8.46o θ0,VB −22.14o θ0,DB −10.26o
B.2
Muscle Forces
Muscle forces F(m) are directed along the horizontal vectors Δh k,i with magnitude F(m)k,i = fmuscle Ak,i for k = D, V and i = 1 : N − 1 ,
(B-4)
where fmuscle is a constant (see Table B-1) and Ak,i are scalar activation functions for the dorsal and ventral muscles, determined by (θi (t), 0) if θi (t) ≥ 0 (AD,i , AV,i ) = (B-5) (0, −θi (t)) if θi (t) < 0 , where θi (t) = B.3
t
dθi 0 dt dt
is the integral over the output of the neural model.
Total Point Force
With the exception of points on the outer beams, each point i is subject to forces F D,i and F V,i , given by differences of the spring and muscle forces from the corresponding units (i and i − 1):
46
J.H. Boyle, J. Bryden, and N. Cohen
F D,i = (F h(s)D,i − F h(s)D,i−1 ) + (F d(s)1,i − F d(s)2,i−1 ) + (F (m)D,i − F (m)D,i−1 ) F V,i = (F h(s)V,i − F h(s)V,i−1 ) + (F d(s)2,i − F d(s)1,i−1 ) + (F (m)V,i − F (m)V,i−1 ) . (B-6) Since the first beam has no anterior body parts, and the last beam has no posterior body parts, all terms with i = 0 or i = N are taken as zero. B.4
Equations of Motion
Motion of the beams is calculated from the total force acting on each of the 2N points. Since the points PiD and PiV are connected by a rigid beam, it is convenient to convert F(t)k,i to a force and a torque acting on the beam’s centre of mass. y x Rotation by φi converts the coordinate system of F(t)k,i = (F(t)k,i , F(t)k,i )
⊥ to a new system F(t)k,i = (F(t)k,i , F(t)k,i ) with axes perpendicular to (⊥) and parallel with () the beam: y ⊥ x F(t)k,i = F(t)k,i cos(φi ) + F(t)k,i sin(φi )
y x F(t)k,i = F(t)k,i cos(φi ) − F(t)k,i sin(φi ) .
(B-7)
The parallel components are summed and applied to CoMi , resulting in pure translation. The perpendicular components are separated into odd and even parts (giving rise to a torque and force respectively) by Fi⊥,even = Fi⊥,odd =
⊥ ⊥ (F(t)D,i + F(t)V,i )
2 ⊥ ⊥ (F(t)D,i − F(t)V,i ) 2
.
(B-8)
As in Ref. [8] we disregard inertia, but include Stokes’ drag. Also following Ref. [8], we allow for different constants for drag in the parallel and perpendicular directions, given by c and c⊥ respectively. The motion of CoMi is therefore
1 (F + F(t)V,i ) c (t)D,i 1 = (2Fi⊥,even) c⊥ 1 = (2Fi⊥,odd ) , rc⊥
V(CoM),i = ⊥ V(CoM),i
ω(CoM),i
(B-9)
where r = 0.5D is the radius of the worm. Finally we convert V(CoM),i and ⊥ V(CoM),i back to (x, y) coordinates with
x ⊥ V(CoM),i = V(CoM),i cos(φi ) − V(CoM),i sin(φi )
y ⊥ V(CoM),i = V(CoM),i cos(φi ) + V(CoM),i sin(φi ) .
(B-10)
An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion
47
Appendix C: Integrating the Neural and Physical Model In the neural model, the output dθi (t)/dt specifies the bending angles θi (t) for each unit. In the integrated model, θi (t) are taken as the input to the muscles. Muscle ouputs (or contraction) are given by unit lengths. The bending angle αi is then estimated from the dorsal and ventral unit lengths by αi = 36.2
h |Δh D,i | − |ΔV,i |
Lh0
,
(C-1)
where Lh0 is the resting unit length. (For simplicity, we have denoted the bending angles of both the neural and integrated models by θ in the Figures).
Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra Pei-Kuang Chao1, Hsiao-Lung Chan1,4, Tony Wu2,4, Ming-An Lin1, and Shih-Tseng Lee3,4 1
Department of Electrical Engineering, Chang Gung University 2 Department of Neurology, Chang Gung Memorial Hospital 3 Department of Neurosurgery, Chang Gung Memorial Hospital 4 Center of Medical Augmented Virtual Reality, Chang Gung Memorial Hosipital 259 Wen-Hua First Road, Gui-Shan, 333, Taoyuan, Taiwan
[email protected] Abstract. This paper proposes that bursting characteristics can be effective parameters in classifying and identifying neural activities from subthalamic nucleus (STN) and substantia nigra (SNr). The string method was performed to quantify bursting patterns in microelectrode recordings into indexes. Interspike-interval (ISI) was used as one of the independent variables to examine effectiveness and consistency of the method. The results show consistent findings about bursting patterns in STN and SNr data across all ISI constraints. Neurons in STN tend to release a larger number of bursts with fewer spikes in the bursts. Neurons in SNr produce a smaller number of bursts with more spikes in the bursts. According to our statistical evaluation, 50 and 80 ms are suggested as the optimal ISI constraint to classify STN and SNr’s bursting patterns by the string method. Keywords: Subthalamic nucleus, substantia nigra, inter-spike-interval, burst, microelectrode.
1 Introduction Subthalamic nucleus (STN) is frequently the target to study and to treat Parkinson’s disease [1, 2]. Placing a microelectrode to record neural activities in deep brain nuclei provides useful information for localization during deep brain stimulation (DBS) neurosurgery. DBS has been approved by FDA since 1998[3]. The surgery implants a stimulator to deep brain nuclei, usually STN, to alleviate Parkinson’s symptoms, such as tremor and rigidity. To search for STN in operation, a microelectrode probe is often used to acquire neural signals from outer areas to the specific target. With assistance of imagery techniques, microelectrode signals from different depth are read and recorded. Then, an important step to determine STN location is to distinguish signals of STN from its nearby areas, e.g. subtantia nigra (SNr) (which is a little ventral and medial to STN). Therefore, characterizing and quantifying firing patterns of STN and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 48–53, 2008. © Springer-Verlag Berlin Heidelberg 2008
Applying the String Method to Extract Bursting Information
49
SNr are essential. Firing rate defined as the number of neural spikes within a period is the most common variable used for describing neural activities. However, STN and SNr have a broad range of firing rate and mostly overlapped [4] (although SNr has a slightly higher mean firing rate than STN). This makes it difficult to depend on firing rate to target STN. Bursting patterns may provide a better solution to separate signals from different nuclei. Bursting, defined as clusters of high-frequency spikes released by a neuron, is believed storing important neural information. To establish long-term responses, central synapses usually require groups of action potentials (bursts) [5,6]. Exploring bursting information in neural activities has recently become fundamental in Parkinson’s studies [1,2]. Also, that spike arrays are more regular in SNr than in STN signals is observed [7]. However, the regularity of firing or grouping of spikes in STN and SNr potentials has not been investigated thoroughly. This study aims to extract bursting information from STN and SNr. A quantifying method for bursting, the string method, will be applied. The string method quantifies bursting information relying on inter-spike-interval (ISI) and spike number [8]. Although some other methods for quantifying bursts exist [9,10], the string method is the one which can provide information about what spike members contributing to the detected bursts. In addition, because various ISIs have been used in research [8,9] to define bursts, this study will also evaluate the effect of ISI constraints on discriminating STN and SNr signals.
2 Method The neuronal data used in this study were acquired during DBS neurosurgery in Chang Gung Memorial Hospital. With assistance of imagery localization systems [11], trials (10s for each) of microelectrode recordings were collected at a sampling
a
b
Fig. 1. MRI images from one patient: a. In the sagittal plane – the elevation angle of the probe (yellow line) from the inter-commissural line (green line) was around 50 to 75°; b. In the frontal plane – the angle of the probe (yellow lines) from the midline (green line) was about 8 to 18° to right or left.
50
P.-K. Chao et al.
rate of 24,000 Hz. Based on several observations, e.g. magnetic resonance imaging (MRI), computed topography (CT), motion/perception-related responses and probe location according to a stereotactic system, experienced neurologists diagnosed 18 trials as neural signals from STN, and the other 23 trials as from SNr. The trials which were collected outside STN and SNr and/or confused between STN and SNr were excluded. In this paper, the data are from 3 Parkinson’s patients (2 females, 1 male, age=73.3±8.3 y/o) who received DBS treatment. Due to the patients’ individual difference, e.g. head size, the depth of STN from the scalp was found varied between 15 and 20 cm. During surgery, the elevation angle of the probe from the intercommissural line was around 50 to 75° (Fig 1a) and the angle between the probe and the midline was about 8 to 18° toward either right or left (Fig 1b). 2.1 Spike Detection Each trial of microelectrode recordings includes 2 types of signals, spikes and background signals. The background signals are interference from nearby neural areas or environment. Because background signals can be interpreted as a Gaussian distribution, signals which are 3 standard deviations (SD) above or below mean can be treated as non-background signals or spikes. Therefore, a threshold in the level of mean plus 3 SD is applied in this study to detect spikes (Fig 2).
100 90 80
amplitude
70 60 50 40 30 20 10 5.175
5.18
5.185
5.19 5.195 time (s)
5.2
5.205
5.21
Fig. 2. A segment of microelectrode recording – the red horizontal line is threshold; the green stars indicate the found spikes; most signals around the baseline are background signals
Applying the String Method to Extract Bursting Information
51
2.2 The String Method Every detected spike was plotted as a circle in a spike sequential number versus spike occurring time figure (Fig 3). The spike sequential number is starting at 1 for the first spike in a trial. The spikes which are closer to each other are labeled as strings [8] and defined as bursts. Two parameters were controlled and manipulated to determine bursts: (1) the minimum number of spikes to form a burst was 5; (2) the maximum ISI of the adjacent spikes in a burst was set as 20ms, 50ms, 80ms, and 110ms separately to find an optimal condition to distinguish STN and SNr bursting patterns.
spike sequential number
180
160
140
120
100
80 5
5.5
6
6.5
7
time (s)
Fig. 3. A segment of strings plot – each blue circle means a spike; the red triangles mean the starting spikes of a burst; the black triangles mean the ending spikes of a burst
2.3 Dependent Variables Three dependent variables were computed: (1) Firing rate (FR) was calculated as total spike number divided by trial duration (10 s). (2) Number of bursts (NB) was determined by the string method as the total burst number in a trial. (3) The average of spike number in each burst (SB) was also counted in every trial. 2.4 Statistical Analysis Independent sample’s t-test was applied to test firing rate difference between STN and SNr signals. MANOVA was performed to evaluate NB and SB separately among different ISI constraints (α=.05).
52
P.-K. Chao et al.
3 Results The signals from STN and SNr showed similar firing rate but different bursting patterns. There is no significant difference between STN and SNr in firing rate (STN: 57.0±22.1; SNr: 68.8±23.5) (p>.05). The results of NB and SB are listed in Table 1 and Table 2. In NB, SNr has significantly fewer bursts than STN while ISI setting is 50ms and 80 ms (p 1.0, 65/104 cells), and 28% of neurons were more responsive to BOS than to OREV (d (RS BOS /RS OREV ) > 1.0, 29/104 cells). These results are consistent with past studies [4]. The mean of d (RS BOS /RS REV ) was 1.25, and the mean of d (RS BOS /RS OREV ) was 0.63. These results indicate that BOS-selective neurons in HVC are largely variable, especially in terms of sequential response properties. A. Elements PSTH (Hz)
Raster
20 0 100 0 -0.2
0
0.5 (s)
A
B
C
A B
A C
D
PSTH (Hz)
Raster
B. Element pairs 20 0 100 0 -0.2
0 A A
B A
0.5 (s)
B
B
B
AD
C
B D
C A
C B
C C
CD
D A
D B
D C
D D
Fig. 1. An example of the auditory response to song elements (A) and element pairs (B) in a single unit from HVC
Population Coding of Song Element Sequence
59
Fig. 2. Song transition matrices of self-generated songs (first row) and sequential response distribution matrices of each single unit (second to seventh rows)
3.2
Responses to Song Element Pair Stimuli
To investigate the neural selectivity to song element sequences, we recorded neural responses to all possible element pair stimuli. Because the playback of these stimuli is extremely time-consuming, we could only maintain 34% of the recorded single units stable throughout the entire presentation (35/104 cells, 12/23 birds). In total, 70% of the stable single-units were BOS-selective (d (RS BOS /RS REV ) > 1.0, 27/35 cells, 12/12 birds). Thereafter, we focused on these data. A typical example of neural responses to each song element is shown in Fig. 1 (A). The neuron responded to a single element A or C with single phasic activity, but it did not respond to element B. It responded to element D with double phasic activity. These results indicate that the neuron has various response properties even during single element presentation. In addition, the neuron exhibited more complex response properties during the presentation of element pairs (Fig. 1(B)). The neuron responded more strongly to most of the element pairs when the second element was A or C, compared to single presentation of each element. However, the response was weaker when the first and second elements were the same. When the second element was B, no differences were observed between single and paired stimuli. When the second element was D, we measured single
60
J. Nishikawa, M. Okada, and K. Okanoya
phasic responses, and a strong response to BD. These response properties were not correlated with the element-to-element transition probabilities in the song structure. The dotted boxes indicate the sequences included in BOS. However, the neuron responded only weakly to some sequences that were included in BOS (brack arrows). In contrast, the neuron responded strongly to other sequences that were not included in BOS (white arrows). Thus, the neuron had broad response properties to song element pairs beyond the structure of self-generated song. To quantitatively evaluate sequential response properties, we calculated the response strength measure d (RS S /RS Baseline ) to the element pair stimuli S. The sequential response distributions were created for each neuron in two individuals with more than five well identified single units. Song transition matrices and sequential response distributions are shown in Fig. 2. The response distributions were not correlated with the associated song transition matrices. However, each HVC neuron in the same individual had broad but different response distribution properties. This tendency was consistent among individuals. This result indicates that the song element sequence is encoded at the population level, within broadly but differentially selective HVC neurons. 3.3
Population Dynamics Analysis
To analyze the information coding of song element sequences at the population level, we calculated the time course of population activity vectors, which is the set of instantaneous mean firing rates for each neuron in a 50 ms time window. Snapshots of population responses to stimuli are shown in eight panels of Fig. 3A (n = 6, bird 2 of Fig. 2). Each point in the panel represents the population vector toward each stimulus on the MDS space. The ellipses in the upper four panels indicate the group of vectors whose stimuli have the same first element, while the ellipses in the lower four panels indicate the group of vectors whose stimuli have the same second element. Note that the population activity vectors in the upper four panels are identical to those in the bottom four panels, and only the ellipses differ. Before the stimulus presentation ([-155 ms: -105 ms], upper and lower panels), only spontaneous activities were observed around the origin. After the first element presentation ([50 ms: 100 ms], upper panel), groups with the same first elements split. After the second element presentation ([131 ms: 181 ms] of the lower panel), groups with the same second elements were still largely overlapping. In the next section, we will show that confounded information, which represents the relation between first and second elements, increased significantly in this timing. After sufficient time ([480 ms: 530 ms], upper and lower panels), the neurons returned to spontaneous activity. The result indicates that the population response to the first and second element is drastically different. Subsequently, we will show that this overlap is derived from the information in the song element sequence. 3.4
Information-Theoretic Analysis
To determine the origin of the overlap in the population response, we calculated the time course of mutual information between the stimulus and neural activity.
Population Coding of Song Element Sequence
61
Fig. 3. Responses of HVC neurons at the population level (A) and encoded information (B)
The mutual information for first elements I(S1 ; R), the second elements I(S2 ; R), and that of element pairs I(S1 , S2 ; R) was calculated within each time window; the window was shifted to analyze the temporal dynamics of information coding (left upper 3 graphs in Fig. 3B). Narrow lines in each graph indicate the cumulative trace of mutual information in each neuron. The thick line is the cumulative trace of all neurons in the individual. The bottom-left graph in Fig. 3B shows the probability of stimulus presentation. After the presentation of the first elements, mutual information for the first elements increased, showing a statistically significant peak (P < 0.001). After the presentation of the second elements, mutual information for the second elements significantly increased (P < 0.001). At the
62
J. Nishikawa, M. Okada, and K. Okanoya
same time, mutual information for element pairs also showed a significant peak (P < 0.001). Intuitively, information for element pairs I(S1 , S2 ; R) would consist of information for the first elements I(S1 ; R) and second elements I(S2 ; R). However, the consecutive calculation of I(S1 , S2 ; R) − I(S1 ; R) − I(S2 ; R) in each time window causes a statistical peak after the presentation of element pairs (P < 0.001; forth graph from left in Fig. 3B). The difference C represents the conditional mutual information between the first and second elements for a given neural response, otherwise known as confounded information [13]. Therefore, confounded information represents the relationship between the first and second elements encoded in the neural responses. The I(S1 ; R) peak occurred at the same time that groups of population vectors with the same first elements were splitting ([50 ms: 100 ms]). The peaks for I(S2 ; R), I(S1 , S2 ; R), and C occurred during the same time that groups with the same second elements were still largely overlapping ([131 ms: 181 ms]). This indicates that the sequential information causes an overlap in the population response. In the population dynamics analysis, we cannot combine the data from different birds because each bird has a different number and types of song elements. However, in the mutual information analysis, we can combine and average the data from different birds. The five graphs on the right in Fig. 3B show the time courses for I(S1 ; R), I(S2 ; R), I(S1 , S2 ; R), C, and the stimulus presentation probability, which were calculated from all stable single units with BOS selectivity (n = 27, 12 birds). The combined mutual information for first elements was very similar to that from one bird, showing a significant peak after the presentation of the first elements (P < 0.001). Mutual information for second elements, element pairs, and confounded information also had significant peaks after the presentation of the second elements (P < 0.001). These results show that the song element sequence is encoded into a neural ensemble in HVC by population coding.
4
Conclusion
In this study, we recorded auditory responses to all possible element pair stimuli from the Bengalese finch HVC. By determining the sequential response distributions for each neuron, we showed that each neuron in HVC has broad but differential response properties to song element sequences. The population dynamics analysis revealed that population activity vectors overlap after the presentation of element pairs. Using mutual information analysis, we demonstrated that this overlap in the population response is due to confounded information, namely, the sequential information of song elements. These results indicate that the song element sequence is encoded into the HVC microcircuit at the population level. Song element sequences are encoded in a neural ensemble with broad and differentially selective neuronal populations, rather than the chain-like model of differential TCS neurons.
Population Coding of Song Element Sequence
63
Acknowledgment This study was partially supported by the RIKEN Brain Science Institute, and by a Grant-in-Aid for young scientists (B) No. 18700303 from the Japanese Ministry of Education, Culture, Sports, Science, and Technology.
References 1. Okanoya, K.: The Bengalese finch: a window on the behavioral neurobiology of birdsong syntax. Ann. N.Y. Acad. Sci. 1016, 724–735 (2004) 2. Doupe, A.J., Kuhl, P.K.: The Bengalese finch: a window on the behavioral neurobiology of birdsong syntax. Birdsong and human speech: common themes and mechanisms. Annu. Rev. Neurosci. 22, 567–631 (1999) 3. Margoliash, D., Fortune, E.S.: Temporal and harmonic combination-selective neurons in the zebra finch’ s HVc. J. Neurosci. 12, 4309–4326 (1992) 4. Lewicki, M.S., Arthur, B.J.: Hierarchical organization of auditory temporal context sensitivity. J. Neurosci. 16, 6987–6998 (1996) 5. Drew, P.J., Abbott, L.F.: Model of song selectivity and sequence generation in area HVc of the songbird. J. Neurophysiol. 89, 2697–2706 (2003) 6. Deneve, S., Latham, P.E., Pouget, A.: Reading population codes: a neural implementation of ideal observers. Nat. Neurosci. 2, 740–745 (2001) 7. Pouget, A., Dayan, P., Zemel, R.: Information processing with population codes. Nat. Rev. Neurosci. 1, 125–132 (2000) 8. Green, D., Swets, J.: Signal Detection Theory and Psychophysics. Wiley, New York (1966) 9. Theunissen, F.E., Doupe, A.J.: Temporal and spectral sensitivity of complex auditory neurons in the nucleus HVc of male zebra finches. J. Neurosci. 18, 3786–3802 (1998) 10. Matsumoto, N., Okada, M., Sugase-Miyamoto, Y., Yamane, S., Kawano, K.: Population dynamics of face-responsive neurons in the inferior temporal cortex. Cerebr. Cort. 15, 1103–1112 (2005) 11. Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–328 (1966) 12. Sugase, Y., Yamane, S., Ueno, S., Kawano, K.: Global and fine information coded by single neurons in the temporal visual cortex. Nature 400, 869–873 (1999) 13. Reich, D.S., Mechler, F., Victor, J.D.: Formal and attribute-specific information in primary visual cortex. J. Neurophysiol. 85, 305–318 (2001)
Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration Tamami Motomura, Yuki Hayashida, and Nobuki Murayama Graduate school of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. We recently developed a new method to dissociate neurons from mammalian retinae by utilizing low-Ca2+ tissue incubation and the vibrodissociation technique, but without use of enzyme. The retinal ganglion cell somata dissociated by this method showed spontaneous voltage transients (sVT) with the fast rise and slower decay. In this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The sVT varied in amplitude with quantal manner, and reversed in polarity around −80 mV in a normal physiological saline. The reversal potential of sVT shifted dependently on the K+ equilibrium potential, indicating the involvement of some K+ conductance. Based on the model, the conductance changes responsible for producing sVT were little dependent on the membrane potential below −50 mV. These results could suggest the presence of isolated, inhibitory presynaptic terminals attaching on the ganglion cell somata. Keywords: Neuronal computation, dissociated cells, retina, patch-clamp, neuron model.
1 Introduction Elucidating the functional role of single neurons in neural information processing is intricate because the neuronal computation itself is highly nonlinear and adaptive, and depends on combinations of many parameters, e.g. the ionic conductances, the intracellular signaling, their subcellular distributions, and the cell morphology. Furthermore, the interactions with surrounding neurons/glias can alter those factors, and thereby hinder us from examining some of those factors separately. This could be overcome by pharmacologically or physically isolating neurons from the circuits. One would use the pharmacological agents those can block the synaptic signal transmission in situ, although it is hard to know whether or not such agents show any unintended side-effects. Alternatively, one can dissociate neural tissue into single neurons by means of enzymatic digestion and mechanical trituration. The dissociated single neurons often lost their fine neurites and the synaptic contacts with other cells during the dissociation procedure, and thus, are useful for examining the properties of ionic conductances at known membrane potentials [3]. Unfortunately, however, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 64–72, 2008. © Springer-Verlag Berlin Heidelberg 2008
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
65
several studies have demonstrated that proteolytic enzymes employed for the cell dissociations can distort the amplitude, kinetics, localization, and pharmacological properties of ionic currents, e.g. [2]. These observations lead to attempts to isolate neurons by enzyme-free, mechanical means. Recently, we developed a new protocol for dissociating single neurons from specific layers of mammalian retinae without use of any proteolytic enzymes [12], but with a combination of the low-Ca2+ tissue incubation [9] and the vibrodissociation technique [15] which has been applied to the slices of brains and spinal cords [1]. The somata of ganglion cells dissociated by our method showed spontaneous voltage transients (sVT) with fast rise and slower decay in the time course [8]. To our knowledge, such sVT have never been reported in previous studies on the retinal ganglion cells dissociated with or without enzyme [9]. Therefore, in this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The present results could suggest the presence of inhibitory presynaptic terminals attaching to the ganglion cell somata we recorded from, as demonstrated in previous studies on the vibrodissociated neurons of brains and spinal cords [1]. If this is the case, the retinal neurons dissociated by our method would be advantageous to investigate the mechanisms of transmitter release in single tiny synaptic boutons even under the isolation from the axons and neurites.
2 Methods All animal care and experimental procedures in this study were approved by the committee for animal researches of Kumamoto University. 2.1 Cell Dissociation The neural retinas were isolated from two freshly enucleated eyes of Wistar rats (P7-P25), cut into 2-4 pieces each, and briefly kept in chilled extracellular “bath” solution. This solution contained (in mM): 140 NaCl, 3.5 KCl, 1 MgCl2, 2.5 CaCl2, 10 D-glucose, 5 HEPES. The pH was adjusted to 7.3 with NaOH. A retinal piece was then placed with photoreceptor-side down in a culture dish, covered with 0.4 ml of chilled, low-Ca2+ solution and incubated for 3-5 min. This low-Ca2+ solution contained (in mM): 140 sucrose, 2.5 KCl, 70 CsOH, 20 NaOH, 1 NaH2PO4, 15 CaCl2, 20 EDTA, 11 D-glucose, 15 HEPES. The estimated free Ca2+ concentration was 100–200 nM. The pH was adjusted to 7.2 with HCl. After the incubation, the fireblunted glass pipette horizontally vibrating in amplitude of 0.2-0.5 mm at 100 Hz was applied to the flattened surface of retina throughout under visual control with the microscope, so that the cells were dissociated from the ganglion cell layer, but least from the inner and outer nuclear layers. After removing the remaining retinal tissue, the culture dish was filled with the bath solution, and left on a vibration-isolation table for allowing cells to settle down for 15-40 min. The bath solution was replaced by a fresh aliquot supplemented with 1 mg/ml bovine serum albumin and the dissociated cells were maintained at room temperature (20-25 oC) for 2-18 hrs prior to the electrophysiological recordings described below. The ganglion cells were identified
66
T. Motomura, Y. Hayashida, and N. Murayama
based on the size criteria [6]. Nearly all those cells we made recordings from in voltage-/current-clamp showed the large amplitude of voltage-gated Na+ current and/or of action potentials (see Fig. 1A-B), verifying that they were ganglion cells [4]. 2.2 Electrophysiology Since the previous studies demonstrated that the membrane conductances of retinal ganglion cells can be modulated by the intracellular messengers, e.g. Zn2+ [13] and cAMP [7], all recordings presented here were performed in perforated-patch wholecell mode [9] to maintain cytoplasmic integrity. Patch electrodes were pulled from borosilicate glass capillaries to tip resistances of approximately 4-8 MΩ. The tip of the electrodes were filled with a recording “electrode” solution that contained (in mM): 110 K-D-gluconic acid, 15 KCl, 15 NaOH, 2.6 MgCl2, 0.34 CaCl2, 1 EGTA, 10 HEPES. The pH was adjusted to 7.2 with methanesulfonic acid. The shank of the electrodes were filled with this solution after the addition of amphotericin B as the perforating agent (260 μg/ml, with 400 μg/ml Pluronic F-127). The recordings were made after the series resistance in perforated-patch configuration reached a stable value (typically 20-40 MΩ, ranging 10-100 MΩ). In the fast current-clamp mode of the amplifier (EPC-10, Heka), the voltage monitor output was analog-filtered by the built-in Bessel filters (3-pole 10–30 kHz followed by 4-pole 2-5 kHz) and digitally sampled (5–20 kHz). The voltage drop across the series resistance was compensated by the built-in circuitry. The recording bath was grounded via an agar bridge, and the bath solution was continuously superfused over each cell recorded from, at a constant flow rate (0.4 ml/min). The volume of solution in the recording chamber was kept at about 2 ml. To apply a high-K+ solution (Fig. 2B), 8 mM NaCl in the bath solution was replaced by the equimolar KCl. An enzyme solution was made by supplementing 0.25 mg/ml papain and 2.5 mM L-cystein in the bath solution. All experiments were performed at room temperature.
3 Results Perforated-patch whole-cell recordings were made from the somata of ganglion cells dissociated by our recently developed protocol (see Methods), which offered us quantitative measurements of the intrinsic membrane properties with the least distortion due to the proteolysis by enzymes [2]. Conversely, since these cells were never exposed to any enzyme, they were useful in examining the effects of the enzymes utilized for the cell dissociation in previous studies. In fact, spike firing of the ganglion cells in response to constant current injection via the patch electrode (30-pA step in the positive direction) was irreversibly altered when the enzyme solution was superfused over those cells (n=3): 1) The resting potential depolarized by 5-20 mV and the spike firing diminished during the enzyme application; 2) When the enzyme was washed out from the recording chamber, the resting potential gradually hyperpolarized near the original level and the spike firing returned in some way;
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
A
67
B
C
Fig. 1. Spontaneous voltage transients (sVT) observed in the dissociated retinal ganglion cells. A: Microphotograph of the cell recorded from. Note the soma being lager than 15 μm in diameter. B: Membrane potential changes in response to step-wise constant current injections. Four traces are superimposed. The injected current was 10 pA in the negative direction and 10, 20, and 30 pA in the positive directions. C: Spontaneous hyperpolarizations under the currentclamp. The recordings were made for 50 sec in three different episodes with breaks for 12 sec between first and second, and for 6 sec between second and third. A constant current (2 pA in the positive direction) was injected to hold the membrane potential at around –70 mV (dashed gray line). Inset: Examples of sVT in an expanded time scale. Five events are recognized. Three of them are similar in their amplitude and time course, and the other two have roughly half and quarter of the largest amplitude.
3) After ~20 min of the washing out of enzyme, the spike firing reached a steady state at which the interval between the first and second spike firings in response to the current step was shorter than that before the enzyme application, by 40 11 % (mean S.E.) (not shown, [12]). These results suggest that, in previous studies on isolated retinal ganglion cells, some of the ionic channels could be significantly distorted during the dissociation procedure because of the use of proteolytic enzymes. Moreover, we found the spontaneous voltage transients (sVT) with the fast rise and slower decay in the retinal ganglion cell somata dissociated by our method [8]. Fig. 1C shows an example of sVT recorded from the cell shown in Fig. 1A. As shown in the Fig., transient hyperpolarizations spontaneously appeared under a constant current injection. Most of these hyperpolarizations are similar in amplitude and time course at a certain membrane potential (−70 mV here) and in some, the peak amplitude of hyperpolarizations was roughly the half or quarter (or one-eighth, in other cells) of the largest one (Inset). Such sVT appeared in 10-20 % of the cells we made recordings from, and could be observed in particular cells as long as we kept the recordings (0.5-2 hrs). When the enzyme solution was superfused over one of those cells, the sVT disappeared completely, and then were not seen again even after 20 min of the washing out of enzyme.
±
±
68 A
T. Motomura, Y. Hayashida, and N. Murayama B
C
Fig. 2. Reversal potential of sVT. A, B: The sVT recorded in the saline containing extracellular K+ of 3.5 mM (A) and 11.5 mM (B). The basal membrane potential (indicated by arrows) was varied by injecting holding currents ranging between –8 and +8 pA in A and between –8 and +12 pA in B. C: Plots of the peak amplitude versus basal potential. Only the events having the largest amplitude (see Fig. 1C) are taken into account. Note that the amplitude of depolarizations were plotted as negative values, and vice versa. The filled circles and open circles represent the data for 3.5-mM K+ and 11.5-mM K+, respectively.
As shown in Fig. 1C, the sVT were all recorded as hyperpolarizations when the cell was held at approximately −70 mV. Thus, the reversal potential of the ionic current producing sVT should be below this voltage. In Fig. 2, the reversal potential for sVT was measured by holding the basal membrane potential at different levels under the current-clamp, c.f. [5]. As expected, the polarity of sVT reversed around −80 mV when the basal membrane potential was varied from about −100 to −40 mV (Fig. 2A). Based on the ionic compositions in the bath and electrode solutions used in this recording, the equilibrium potential of K+ (EK) was estimated to be about −90 mV, and close to the reversal potential for sVT. When the EK was shifted by +30 mV, i.e. from about −90 to −60 mV by applying the high-K+ solution (see Method), the polarity of sVT reversed between −54 and −39 mV of the basal membrane potential (Fig. 2B). Fig. 2C plots the peak amplitude of sVT versus the basal membrane potential. The linear regressions on these plots (gray lines) crossed the abscissa (dashed line) at approximately –76 mV and –48 mV for 3.5-mM K+ and 11.5-mM K+, respectively, showing the shift of reversal potential parallel to the EK shift. Similar results were obtained in other two cells. These results indicate that the ionic conductance responsible for producing sVT is permeable to, at least, K+. In the present experiments, we made recordings from the cells without neurites or with neurites no longer than 10 μm or so. Therefore, those cells can be modeled as a single compartment shown in Fig. 3A. In this model, the unknown conductance responsible for producing sVT and the reversal potential are represented by “gx” and “Ex”, respectively. The membrane properties intrinsic to the cell are represented by membrane capacitance Cm, nonlinear conductance gm, and the apparent reversal potential Em. Here, Cm was measured with the capacitance compensation circuitry of
sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration
69
B
A
C
E
D
G
H F
Fig. 3. Conductance changes during sVT. A: Single compartment model of the isolated somata. ICm, the current through Cm. Igm, the current through gm. Igx, the current through gx. B: Voltage responses to the current steps injected. The amplitude of current steps (Iinj) were varied from –10 to +10 pA in 5-pA increment and the corresponding voltage changes (Vm), from the bottom (black) to the top (light gray), were recorded. C: Plots of the membrane potential versus the amplitude of current steps. The voltage was measured at the time points indicated by the marks in B (circle, square, triangle, rhombus, and hexagon). The solid line shows the best fit to the plots with a single exponential function. D: Voltage-dependency of gm calculated from the plots in C. The derivative of current with respect to the membrane potential gave the slope conductance gm, which could be approximated by a hyperbolic function, gm = Iα / (Vα –Vm), where Vm 2) GRFs are used. The center of the ith GRF is set to μi = Imin + ((2i − 3)/2)((Imax − Imin )/(n − 2)). One GRF is placed outside the range at each of the two ends. All the GRFs encoding an input variable will have the same width. The width of a GRF is set to σ = (1/γ)((Imax − Imin )/(n − 2)), where γ controls the extent of overlap between the GRFs. For an input variable with value a, the activation value of the ith GRF with center μi and width σ is given by fi (a) = exp −(a − μi )2 /(2 σ 2 ) (3) The firing time of the neuron associated with this GRF is inversely proportional to fi (a). For a highly stimulated GRF, with value of fi (a) close to 1.0, the firing time t = 0 milliseconds is assigned. When the activation value of the
Region-Based Encoding Method Using Multi-dimensional Gaussians
77
GRF is small, the firing time t is high indicating that the neuron fires later. In our experiments, the firing time of an input neuron is chosen to be in the range 0 to 9 milliseconds. While converting the activation values of GRFs into firing times, a coding threshold is imposed on the activation value. A GRF that gives an activation value less than the coding threshold will be marked as not-firing (NF), and the corresponding input neuron will not contribute to the membrane potential of the post-synaptic neuron. The range-based encoding method is illustrated in Fig. 1(c). For multi-variate data, each variable is encoded separately, effectively not using the correlation present among the variables. In this encoding method, 1-D GRFs are uniformly placed along an input dimension, without considering the distribution of the data. Hence, when the data is sparse some GRFs, placed in the regions where data is not present, are not effectively used. This results in a high neuron count and computational cost. The widths of the GRFs are derived without using the knowledge of the data distribution, except for the range of values that the variables take. Taking one GRF along an input dimension and quantizing its activation value results in the formation of intervals within the range of values of that variable, such that one or more intervals are mapped onto a particular quantization level. When 2-D data is encoded by taking an array of 1-D GRFs along each input dimension, the input space is quantized into rectangular grids such that all the input patterns falling into a particular rectangular grid have the same vector of quantization levels, and hence the same encoded time vector. Additionally, one or more rectangular grids may have the same encoded time vector. For multivariate data, the input space is divided into hypercuboids. To demonstrate this, the single-ring data (shown in Fig. 2(a)) is encoded by placing 5 GRFs along each dimension, dividing the input space into grids as shown in Fig. 2(b). A 10-2 MDSNN, having 10 neurons in the input layer and 2 neurons in the output layer, is trained to cluster this data. The space of data points as represented by the output layer neurons is shown in Fig. 2(c). The cluster boundary is observed to be a combination of linear segments defined by the rectangular grid boundaries formed by encoding. The shape of the boundary formed by the MDSNN is significantly different from the desired circle-shaped boundary between the two clusters in the single-ring data. Increasing the number of GRFs used for encoding each dimension may give a boundary that is a combination of smaller linear segments, at the expense of high neuron count. However, this may not result in proper clustering of the data, as the choice of number of GRFs is observed to be crucial in the range-based encoding method. When the range-based encoding method is used along with the varyingthreshold method and the multi-stage learning method [15] to cluster complex data sets such as the double-ring data and the spiral data, it is observed that proper subclusters are not formed by the neurons in the hidden layer. As each dimension is encoded separately, the spatially disjoint subsets of data points, as shown by the marked regions in Fig. 3, that have similar encoding along a particular dimension are found to be represented by a single neuron in the hidden
78
L.N. Panuku and C.C. Sekhar 2
1
1.5
0.8 0.6
1
0.4
0.5
0.2 0
0
−0.2
−0.5
−0.4
−1
−0.6
−1.5
−0.8
−2 −2 −1.5 −1 −0.5
0(a)0.5
1
1.5
−1 −1 −0.8−0.6−0.4−0.2
2
0
0.2 0.4 0.6 0.8
(b)
1
Fig. 2. Clustering the single-ring data encoded using the range-based encoding method: (a) The single-ring data, (b) data space quantization due to the range-based encoding, and (c) data space representation by the output neurons
15 3
10 5
y
y
1
0
−1 −5 −3
−5 −5
−10 −15 −3
x
−1
1 (a)
3
−15
−10
−5
0 x
(b)
5
10
15
Fig. 3. Improper subclusters formed when the data is encoded with the range-based encoding method for (a) the double-ring data and (b) the spiral data
layer. This binding is observed to form during the initial iterations of learning when the firing thresholds of neurons are low. The established binding cannot be unlearnt in the subsequent iterations, leading to improper clustering at the output layer. An encoding method that overcomes the above discussed limitations is proposed in the next section.
4
Region-Based Encoding Using Multi-dimensional Gaussian Receptive Fields
Using multi-dimensional GRFs for encoding helps in capturing the correlation present in the data. One approach would be to uniformly place the multidimensional GRFs covering the whole range of the input data space. However, this results in an exponential increase in the number of neurons in the input layer with the dimensionality of the data. To circumvent this, we propose a region-based encoding method that places the multi-dimensional GRFs only in the data-inhabited regions, i.e., the regions where data is present. The mean vectors and the covariance matrices of these GRFs are computed from the data in the regions, thus capturing the correlation present in the data. To identify the data-inhabited regions in the input space, first the k-means clustering is performed on the data to be clustered, with the value of k being larger than the number of actual clusters. On each of the regions, identified using the k-means clustering method, a multi-dimensional GRF is placed by computing the mean vector and the covariance matrix from the data in that region. Response of the ith GRF for a multi-variate input pattern a is computed as,
Region-Based Encoding Method Using Multi-dimensional Gaussians
1 t −1 fi (a) = exp − (a − μi ) Σi (a − μi ) , 2
79
(4)
where, μi and Σi are the mean vector and the covariance matrix of the ith GRF respectively, and fi (a) is the activation value of that GRF. As discussed in Section 3, these activation values are translated into firing times in the range 0 to 9 milliseconds and the non-optimally stimulated input neurons are marked as NF. By deriving the covariance for a GRF from the data, the region-based encoding method captures the correlation present in the data. The regions identified by k-means clustering and the data space quantization resulting from this encoding method, for the single-ring data used in Section 3, are shown in Fig 4(a) and 4(b) respectively. The boundary given by the MDSNN with the region-based encoding method is shown in Fig. 4(c). This boundary is more like the desired circle-shaped boundary, as against the combination of linear segments observed with the range-based encoding method (see Fig. 2(c)). 2
2.5
1.5
2 1.5 1
1 0.5 0
0.5 0
−0.5
−0.5 −1 −1.5 −2
−1 −1.5 −2 −2 −1.5 −1 −0.5
0 (a)0.5
1
1.5
2
−2.5 −2.5 −2 −1.5 −1 −0.5
0
0.5
b)
1
1.5
2
2.5
Fig. 4. (a) Regions identified by k-means clustering with k = 8, (b) data space quantization due to the region-based encoding and (c) data space representation by the neurons in the output layer
Next, we study the performance of the proposed region-based encoding method in clustering complex 2-D and 3-D data sets. For the double-ring data, the k-means clustering is performed with k = 20 and the resulting regions are shown in Fig 5(a). Over each of the regions, a 2-D GRF is placed to encode the data. A 20-8-2 MDSNN is trained using the multi-stage learning method, discussed in Section 2. It is observed that out of 8 neurons in the hidden layer 3 neurons do not win for any of the training examples and the data is represented by the remaining 5 neurons as shown in Fig 5(b). These 5 neurons provide the input in the second stage of learning to form the final clusters (Fig 5(c)). The resulting cluster boundaries are seen to follow the data distribution as shown in Fig 5(d). Similarly, the spiral data is encoded using 40, 2-D GRFs. The regions of data identified using the k-means clustering method are shown in in Fig 6(a). A 40-20-2 MDSNN is trained to cluster the spiral data. As shown in Fig 6(b), 14 subclusters are formed in the hidden layer that are combined in the next layer to form the final clusters as shown in Fig 6(c). The region-based encoding method helps in proper subcluster formation at the hidden layer (Fig 5(b) and Fig 6(b)), against the range-based encoding method (Fig 3). The proposed method is also used to cluster 3-D data sets namely, the interlocking donuts data and the 3-D ring data. The interlocking donuts data is
80
L.N. Panuku and C.C. Sekhar 4
4
4
3
3
3
2
2
2
1
1
0
0
0
−1
−1
−1
−2
−2
−2
−3 −4 −4
1
−3
−3 −3
−2
−1
0
1
2
3
(a)
4
−4 −4
−3
−2
−1
0
(b) 1
2
3
4
−4 −4
−3
−2
−1
0
1
2
3
4
(c)
Fig. 5. Clustering the double-ring data: (a) Regions identified by k-means clustering with k = 20, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer
15
15
15
10
10
10
5
5
5
0
0
0
−5
−5
−5
−10
−10 −15 −15
−10
−5
0
(a)
5
10
15
−15 −15
−10
−10
−5
0
5
10
15
−15 −15
−10
−5
(b)
0
(c)
5
10
15
Fig. 6. Clustering the spiral data: (a) Regions identified by k-means clustering with k = 40, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer 1
2 1 0 −1 −2 −2
−1
0
1
(a) 2
2 1.5 1 2 0.5 0 1 −0.5 0 −1 −1.5 −1 −2 3 3 −2
0
1 0
−1 2 −2 −1
−1 2 1.5
1 0
0 2
1
1 0
(b)−1
−2 2
−1 −2 −2
−1
0
(c)
1
2
1 0.5
2 0 −0.5
1 0 −1 −1.5
−1 −2 −2
(d)
Fig. 7. Clustering of the interlocking donuts data and the 3-D ring data: (a) Regions identified by k-means clustering on the interlocking donuts data with k = 10 and (b) clusters formed at the output layer. (c) Regions identified by k-means clustering on the 3-D ring data with k = 5 and (d) clusters formed at the output layer.
encoded with 10, 3-D GRFs and a 10-2 MDSNN is trained to cluster this data. The k-means clustering results and the final clusters formed by the MDSNN are shown in Fig 7(a) and (b) respectively. The clustering results for the 3-D ring data with the proposed encoding method are shown in Fig 7(c) and 7(d). For comparison, the performance of the range-based encoding method and the region-based encoding method, for different data sets, is presented in Table 1. It is observed that the region-based encoding method outperforms the range-based encoding method for clustering complex data sets like the double-ring data and the spiral data. For the cases, where both the methods give the same or almost the same performance, the number of neurons used in the input layer is given in the parentheses. It is observed that the region-based encoding method always maintains a low neuron count, there by reducing the computational cost. The difference between the neuron counts for the two methods may look small for these 2-D and 3-D data sets. However, as the dimensionality of the data increases, this
Region-Based Encoding Method Using Multi-dimensional Gaussians
81
Table 1. Comparison of the performance (in %) of MDSNNs using the range-based encoding method and the region-based encoding method for clustering. The numbers in parentheses give the number of neurons in the input layer. Data set
Encoding method Range-based Region-based encoding encoding
Double-ring data 74.82 Spiral data 66.18 Single-ring data 100.00 (10) Interlocking cluster data 99.30 (24) 3-D ring data 100.00 (15) Interlocking donuts data 97.13 (21)
100.00 100.00 100.00 100.00 100.00 100.00
(8) (6) (5) (10)
difference can be significant. From these results, it is evident that the proposed encoding method scales well to higher dimensional data clustering problems, while keeping a low count of neurons. Additionally, and more importantly, the nonlinear cluster boundaries given by the region-based encoding method follow the distribution of the data or shapes of the clusters.
5
Conclusions
In this paper, we have proposed a new encoding method using multi-dimensional GRFs for MDSNNs. We have demonstrated that the proposed encoding method effectively uses the correlation present in the data and positions the GRFs in the data-inhabited regions. We have also shown that the proposed method results in a low neuron count as opposed to the encoding method proposed in [14] and the simple approach of placing multi-dimensional GRFs covering the data space. This in turn results in low computational cost for clustering. With the encoding method proposed in [14], the cluster boundaries obtained for clustering nonlinearly separable data are observed to be combinations of linear segments and the MDSNN is failed to cluster the double-ring data and the spiral data. We have experimentally shown that with the proposed encoding method, the MDSNNs could cluster complex data like the double-ring data and the spiral data, while giving smooth nonlinear boundaries that follow the data distribution. In the existing range-based encoding method, when the data consists of clusters with different scales, i.e., narrow and wider clusters, then the GRFs with different widths are used. This technique is called multi-scale encoding. However, in the region-based encoding method the widths of the multi-dimensional GRFs are automatically computed from the data-inhabited regions. The widths of these GRFs can be different. In the proposed method, for clustering the 2-D and 3-D data, the value of k is decided empirically and the formation of subclusters at the hidden layer is verified visually. However, for higher dimensional data, it is necessary to ensure the formation of subclusters automatically.
82
L.N. Panuku and C.C. Sekhar
References 1. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Englewood Cliffs (1998) 2. Kumar, S.: Neural Networks: A Classroom Approach. Tata McGraw-Hill, New Delhi (2004) 3. Maass, W.: Networks of Spiking Neurons: The Third Generation of Neural Network Models. Trans. Soc. Comput. Simul. Int. 14(4), 1659–1671 (1997) 4. Bi, Q., Poo, M.: Precise Spike Timing Determines the Direction and Extent of Synaptic Modifications in Cultured Hippocampal Neurons. Neuroscience 18, 10464–10472 (1998) 5. Maass, W., Bishop, C.M.: Pulsed Neural Networks. MIT-Press, London (1999) 6. Gerstner, W., Kistler, W.M.: Spiking Neuron Models. Cambridge University Press, Cambridge (2002) 7. Maass, W.: Fast Sigmoidal Networks via Spiking Neurons. Neural Computation 9, 279–304 (1997) 8. Verstraeten, D., Schrauwen, B., Stroobandt, D., Campenhout, J.V.: Isolated Word Recognition with the Liquid State Machine: A Case Study. Information Processing Letters 95(6), 521–528 (2005) 9. Bohte, S.M., Kok, J.N., Poutre, H.L.: Spike-Prop: Error-backpropagation in Temporally Encoded Networks of Spiking Neurons. Neural Computation 48, 17–37 (2002) 10. Natschlager, T., Ruf, B.: Spatial and Temporal Pattern Analysis via Spiking Neurons. Network: Comp. Neural Systems 9, 319–332 (1998) 11. Ruf, B., Schmitt, M.: Unsupervised Learning in Networks of Spiking Neurons using Temporal Coding. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 361–366. Springer, Heidelberg (1997) 12. Hopfield, J.J.: Pattern Recognition Computation using Action Potential Timing for Stimulus Representations. Nature 376, 33–36 (1995) 13. Gerstner, W., Kempter, R., Van Hemmen, J.L., Wagner, H.: A Neuronal Learning Rule for Sub-millisecond Temporal Coding. Nature 383, 76–78 (1996) 14. Bohte, S.M., Poutre, H.L., Kok, J.N.: Unsupervised Clustering with Spiking Neurons by Sparse Temporal Coding and Multilayer RBF Networks. IEEE Transactions on Neural Networks 13, 426–435 (2002) 15. Panuku, L.N., Sekhar, C.C.: Clustering of Nonlinearly Separable Data using Spiking Neural Networks. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, Springer, Heidelberg (2007)
Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer Kouichi Mitsunaga1 , Yusuke Totoki2 , and Takami Matsuo2 1
2
Control Engineering Department, Oita Institute of Technology, Oita, Japan Department of Architecture and Mechatronics, Oita University, 700 Dannoharu, Oita, 870-1192, Japan
Abstract. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Strictly Positive Realness and Yu’s stability criterion, we can show the asymptotic stability of the error systems. The estimators allow us to recover the internal states and to distinguish the firing patterns with early-time dynamic behaviors.
1
Introduction
In traditional artificial neural networks, the neuron behavior is described only in terms of firing rate, while most real neurons, commonly known as spiking neurons, transmit information by pulses, also called action potentials or spikes. Model studies of neuronal synchronization can be separated in those where models of the integrated-and-fire type are used and those where conductance-based spiking and bursting models are employed[1]. Bursting occurs when neuron activity alternates, on slow time scale, between a quiescent state and fast repetitive spiking. In any study of neural network dynamics, there are two crucial issues that are: 1) what model describes spiking dynamics of each neuron and 2) how the neurons are connected[3]. Izhikevich considered the first issue and compared various models of spiking neurons. He reviewed the 20 types of real (cortical) neurons response, considering the injection of simple dc pulses such as tonic spiking, phasic spiking, tonic bursting, phasic bursting. Through out his simulations, he suggested that if the goal is to study how the neuronal behavior depends on measurable physiological parameters, such as the maximal conductance, steady-state (in)activation functions and time constants, then the Hodgkin-Huxley type model is the best. However, its computational cost is the highest in all models. He also pointed out that the Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich firing patterns exhibited by real biological neurons. Nevertheless the HR model is a computational one of the neuronal bursting using three coupled first order differential equations[5,6], it can generate a tonic spiking, phasic spiking, and so on, for different parameters in the model equations. Charroll simulated that the additive noise shifts the neuron model into two-frequency region (ı.e. bursting) and the slow part of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 83–92, 2008. c Springer-Verlag Berlin Heidelberg 2008
84
K. Mitsunaga, Y. Totoki, and T. Matsuo
responses allows being robust to added noises using the HR model[7]. The parameters in the model equations are important to decide the dynamic behaviors in the neuron[12]. From the measurement theoretical point of view, it is important to estimate the states and parameters using measurement data, because extracellular recordings are a common practice in neuro-physiology and often represent the only way to measure the electrical activity of neurons[8]. Tokuda et al. applied an adaptive observer to estimate the parameters of HR neuron by using membrane potential data recorded from a single lateral pyloric neuron synaptically isolated from other neurons[13]. However, their observer cannot guarantee the asymptotic stability of the error system. Steur[14] pointed out that HR equations could not transformed into the adaptive observer canonical form and it is not possible to make use of the adaptive observer proposed by Marino[10]. He simplified the three dimensional HR equations and write as one-dimensional system with exogenous signal using contracting and the wandering dynamics technique. His adaptive observer with first-order differential equation cannot estimate the internal states of HR neurons. We have recently presented adaptive observers with full states measurement and with the membrane potential measurement[15]. However, the estimates of the states by the observer with output measurement are not enough to recover the immeasurable internal states. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Kalman-Yakubovich lemma, we can show the asymptotic stability of the error systems based on the standard adaptive control theory[11]. The estimators allow us to recover the internal states and to distinguish the firing patterns with early-time dynamic behaviors. The MATLAB simulations demonstrate the estimation performance of the proposed adaptive observers.
2
Review of Real (Cortical) Neuron Responses
There are many types of cortical neurons responses. Izhikevich reviewed 20 of the most prominent features of biological spiking neurons, considering the injection of simple dc pulses[3]. Typical responses are classified as follows[4]: – Tonic Spiking (TS): The neuron fires a spike train as long as the input current is on. This kind of behavior can be observed in the three types of cortical neurons: regular spiking excitatory neurons (RS), low-threshold spiking neurons (LTS), and first spiking inhibitory neurons (FS). – Phasic Spiking (PS): The neuron fires only a single spike at the onset of the input. – Tonic Bursting: The neuron fires periodic bursts of spikes when stimulated. This behavior may be found in chattering neurons in cat neocortex. – Phasic Bursting (PB): The neuron fires only a single burst at the onset of the input.
Firing Pattern Estimation of Biological Neuron Models
85
– Mixed Mode (Bursting Then Spiking) (MM): The neuron fires a phasic burst at the onset of stimulation and then switch to the tonic spiking mode. The intrinsically bursting excitatory neurons in mammalian neocortex may exhibit this behavior. – Spike Frequency Adaptation (SFA): The neuron fires tonic spikes with decreasing frequency. RS neurons usually exhibit adaptation of the interspike intervals, when these intervals increase until a steady state of periodic firing is reached, while FS neurons show no adaptation.
3
Single Model of HR Neuron
The Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich firing patterns exhibited by real biological neurons. 3.1
Dynamical Equations
The single model of the HR neuron[1,5,6] is given by x˙ = ax2 − x3 − y − z + I y˙ = (a + α)x2 − y z˙ = μ(bx + c − z) where x represents the membrane potential y and z are associated with fast and slow currents, respectively. I is an applied current, and a, α, μ, b and c are constant parameters. We rewrite the single HR neuron as a vectorized form: ˙ = h(w) + Ξ(x, z)θ (S0 ) : w where ⎡ ⎤T ⎡ ⎤ ⎡ 2 ⎤ x −(x3 + y + z) x 1 0 00 0 ⎦, Ξ(x, z) = ⎣ 0 0 x2 0 0 0 ⎦, −y w = ⎣ y ⎦ , h(w) = ⎣ z 0 0 0 0 x 1 −z T T θ = θ1 , θ2 , θ3 , θ4 , θ5 , θ6 = a, I, a + α, μb, μc, μ . 3.2
Numerical Examples
The HR model shows a large variety of behaviors with respect to the parameter values in the differential equations[12]. Thus, we can characterize the dynamic behaviors with respect to different values of the parameters. We focus on the parameters a and I. The parameter a is an internal parameter in the single neuron and I is an external depolarizing current. For the fixed I = 0.05, the HR model shows a tonic bursting with a ∈ [1.8, 2.85] and a tonic spiking with a ≥ 2.9. On the other hand, for the fixed a = 2.8, the HR model shows a tonic bursting with I ∈ [0, 0.18] and a tonic spiking with a ∈ [0.2, 5].
86
K. Mitsunaga, Y. Totoki, and T. Matsuo
2 1.5 −0.3 −0.4
1
−0.5 −0.6
x
z
0.5 0
−0.7 −0.8 −0.9
−0.5
−1 8 6
−1
2 1
4 −1.5 0
0
2 200
400
600
800
1000
Fig. 1. The response of x in the tonic bursting
−1 0
y
time
−2
x
Fig. 2. 3-D surface of x, y, z in the tonic bursting
2 1.5 −0.5 1
−0.6 −0.7
x
z
0.5
−0.8 0 −0.9 −0.5
−1 8 6
−1
2 1
4 −1.5 0
0
2 200
400
600
800
1000
Fig. 3. The response of x in the tonic spiking
0
y
time
−1 −2
x
Fig. 4. 3-D surface of x, y, z in the tonic spiking
1.5
−0.2
1
−0.4
z
x
2
0.5
−0.6 −0.8
0 −1 8 −0.5
6
2 4
−1 0
1 2
200
400
600
800
1000
time
Fig. 5. The response of x1 in the intrinsic bursting neuron
y
0 0
−1
x
Fig. 6. 3-D surface of x, y, z in the intrinsic bursting neuron
The parameters of the HR model in the tonic bursting (TB) are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. Figure 1 shows the response of x. Figure 2 shows the 3 dimensional surface of x, y, z. We call this neuron the intrinsic bursting neuron(IBN).
Firing Pattern Estimation of Biological Neuron Models
87
The parameters of the HR model in the tonic spiking (TS) are given by a = 3.0, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The difference between the tonic bursting and the tonic spiking is only the value of the parameter a. Figure 3 shows the responses of x. Figure 4 shows the 3 dimensional surface of x, y, z. We also call this neuron the intrinsic spiking neuron(ISN). When the external current change from I = 0.05 to I = 0.2, the IBN shows a tonic spiking. Figure 5 shows the responses of x. Figure 6 shows the 3 dimensional surface of x, y, z.
4 4.1
Synaptically Coupled Model of HR Neuron Dynamical Equations
Consider the following synaptically coupled two HR neurons[1]: x˙ 1 = a1 x21 − x31 − y1 − z1 − gs (x1 − Vs1 )Γ (x2 ) y˙ 1 = (a1 + α1 )x21 − y1 , z˙1 = μ1 (b1 x1 + c1 − z1 ) x˙ 2 = a2 x22 − x32 − y2 − z2 − gs (x2 − Vs2 )Γ (x1 ) y˙ 2 = (a2 + α2 )x22 − y2 , z˙2 = μ2 (b2 x2 + c2 − z2 ) where Γ (x) is the sigmoid function given by Γ (x) =
4.2
1 . 1 + exp(−λ(x − θs ))
Numerical Examples
Consider the IBN neuron with a = 2.8 and the ISN neuron with a = 10.8 whose other parameters are as follows: αi = 1.6, ci = 5, bi = 9, μi = 0.001, Vsi = 2, θs = −0.25, λ = 10. Figures 7,8 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 0.05, respectively. Each neuron behaves as an intrinsic single neuron. As increasing the coupling strength, however, the IBN neuron shows a chaotic behavior. Figures 9,10 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 1, respectively. Figure 11 shows the response of the membrane potentials in the coupling of two same IBNs with
88
K. Mitsunaga, Y. Totoki, and T. Matsuo 2
12
1.5
10 8
1
6
x
x
0.5 4
0 2 −0.5
0
−1
−2
−1.5 0
200
400
600
800
−4 0
1000
200
400
time
600
800
1000
time
Fig. 7. The response of x1 of the IBN
Fig. 8. The response of x2 of the ISN
2
12
1.5
10 8
1
6
x
x
0.5 4
0 2 −0.5
0
−1 −1.5 0
−2
200
400
600
800
−4 0
1000
200
time
400
600
800
1000
time
Fig. 9. The response of x1 of the IBN
Fig. 10. The response of x2 of the ISN
2
2
1.5
1.5
1 1
x
x
0.5 0.5
0 0 −0.5 −0.5
−1 −1.5 0
200
400
600
time
800
1000
−1 0
200
400
600
800
1000
time
Fig. 11. The response of x1 of the IBN- Fig. 12. The response of x1 of the IBNIBN coupling with gs = 0.05 IBN coupling with gs = 1
the coupling strength gs = 0.05. In this case, two IBNs synchronize as bursting neurons. Figure 12 shows the response of the membrane potentials in the coupling of two same IBNs with the coupling strength gs = 1. Two IBNs synchronize as spiking neurons.
Firing Pattern Estimation of Biological Neuron Models
5
89
Adaptive Observer with Full States
We present the parameter estimation problem to distinguish the firing patterns by using early-time dynamic behaviors. In this section, assuming that the full states are measurable, we present an adaptive observer to estimate all parameters in the single HR neuron. 5.1
Construction of Adaptive Observer
We present an adaptive observer as ˆ ˆ˙ = W (w ˆ − w) + h(w) + Ξ θ (O0 ) : w ˆ is an estimate of the ˆ = [ˆ where w x, yˆ, zˆ] is an estimate of the states, θ unknown parameters and W is selected as a stable matrix. Using the standard adaptive control theory[11], the parameter update law is given by ˆ˙ = Γ Ξ T P (w − w). ˆ θ where P is a positive definite solution of the following Lyapunov equation for a positive definite matrix Q: W T P + P W = −Q. 5.2
Numerical Examples
We will show the simulation results of single IBN case. The parameters in the tonic bursting are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The parameters of the adaptive observers are selected as W = −10I3 , Γ = diag{100, 50, 300}. Figure 13 shows the estimation behavior of a (solid line) and I (dotted line). The estimates a ˆ and Iˆ converge to the true values of a and I.
6
Adaptive Observer with a Partial State
We assume that the membrane potential x is available, but the others are immeasurable. In this case, we consider following problems: – Estimate y and z using the available signal x; – Estimate the parameter a or I to distinguish the firing patterns by using early-time dynamic behaviors.
90
K. Mitsunaga, Y. Totoki, and T. Matsuo
6.1
Construction of Adaptive Observer
The parameters a and I are key parameters that determine the firing pattern. The HR model can be rewritten by the following three forms[9]: ˙ = Aw + h1 (x) + b1 (x2 a) (S1 ) : w ˙ = Aw + h2 (x) + b2 (I) (S2 ) : w ˙ = Aw + h3 (x) + b2 (θ T ξ) (S3 ) : w where
(1) (2) (3)
⎡
⎤ ⎡ 3 ⎤ ⎡ ⎤ ⎡ ⎤ 0 −1 −1 −x + I 1 1 A = ⎣ 0 −1 0 ⎦ , h1 = ⎣ αx2 ⎦ , b1 = ⎣ 1 ⎦ , b2 = ⎣ 0 ⎦ , μb 0 −μ μc 0 0 ⎡ 3 ⎤ ⎡ 3⎤ 2 −x + ax −x a x2 2 ⎦ 2 ⎦ ⎣ ⎣ h2 = (a + α)x , h3 = δx ,θ = ,ξ = . I 1 μc μc
In (S1 ) and (S2 ), the unknown parameters are assumed to be a and I, respectively. In (S3 ), we assume that the parameter δ = a + α is known, and a and I are unknown. Since the measurable signal is x, the output equation is given by x = cw = 1 0 0 w We present adaptive observers that estimate the parameter for each system (Si ), i = 1, 2, 3, as follows: ˆ 1 + h1 (x) + b1 (x2 a (O1 ) : wˆ˙ 1 = Aw ˆ) + g(x − x ˆ) ˙ ˆ ˆ 2 + h2 (x) + b2 (I) + g(x − x (O2 ) : wˆ2 = Aw ˆ) T ˙ ˆ ˆ 2 + h3 (x) + b2 (θ ξ) + g(x − xˆ) (O3 ) : wˆ2 = Aw
(4) (5) (6)
where g is selected such that A − gc is a stable. Since (A, b1 , c) and (A, b2 , c) are strictly positive real, the parameter estimation laws are given as ˙ a ˆ˙ = γ1 x2 (x − x ˆ), Iˆ = γ2 (x − xˆ).
(7)
Using the Kalman-Yakubovich (KY) lemma, we can show the asymptotic stability of the error system based on the standard adaptive control theory[11]. 6.2
Numerical Examples
We will show the simulation results of single IBN case. The parameters in the tonic spiking are same as in the previous simulation. Figures 14 and 15 show the estimated parameters by the adaptive observers (O1 ) and (O2 ), respectively. Figures 16 and 17 show the responses of y (solid line) and its estimate yˆ (dotted line) the adaptive observers (O1 ) for t ≤ 500 and for t ≤ 20, respectively. Figure 18 shows the responses of z (solid line) and its estimate zˆ (dotted line) by the adaptive observers (O1 ). The simulation results of other cases are omitted. The states and parameters can be asymptotically estimated.
Firing Pattern Estimation of Biological Neuron Models
5
5
a ˆ Iˆ
4.5
91
4.5
4 4 3.5 3.5 3
a ˆ
a ˆ Iˆ
3 2.5 2
2.5
1.5 2 1 1.5
0.5 0 0
50
100
150
1 0
200
20
40
time
60
80
100
time
ˆ Fig. 13. a ˆ(solid line) and I(dotted line) in the adaptive observer (O0 ) with full states
Fig. 14. a ˆ(solid line) in the adaptive observer (O1 ) with x
0.1
7
y yˆ
6
0
5
−0.1
y yˆ
Iˆ
4
−0.2
3
−0.3 2
−0.4
−0.5 0
1
20
40
60
80
0 0
100
100
200
300
400
500
time
time
ˆ Fig. 15. I(solid line) in the adaptive observer (O2 ) with x
Fig. 16. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 500)
7
0
y yˆ
z zˆ
6 5
−0.5
y yˆ
z zˆ
4 3
−1 2 1 0 0
5
10
15
20
time
Fig. 17. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)
−1.5 0
5
10
15
20
time
Fig. 18. z(solid line) and zˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)
92
7
K. Mitsunaga, Y. Totoki, and T. Matsuo
Conclusion
We presented estimators of the parameters of the HR model using the adaptive observer technique with the output measurement data such as the membrane potential. The proposed observers allow us to distinguish the firing pattern in early time and to recover the immeasurable internal states.
References 1. Belykh, I., de Lange, E., Hasler, M.: Synchronization of Bursting Neurons: What Matters in the Network Topology. Phys. Rev. Lett. 94, 101–188 (2005) 2. Izhikevich, E.M.: Simple Model of Spiking Neurons. IEEE Trans on Neural Networks 14(6), 1569–1572 (2003) 3. Izhikevich, E.M.: Which model to use for cortical spiking neurons? IEEE Trans on Neural Networks 15(5), 1063–1070 (2004) 4. Watts, L.: A Tour of NeuraLOG and Spike - Tools for Simulating Networks of Spiking Neurons (1993), http://www.lloydwatts.com/SpikeBrochure.pdf 5. Hindmarsh, J.L., Rose, R.M.: A model of the nerve impulse using two first order differential equations. Nature 296, 162–164 (1982) 6. Hindmarsh, J.L., Rose, R.M.: A model of neuronal bursting using three coupled first order differential equations. Proc. R. Soc. Lond. B. 221, 87–102 (1984) 7. Carroll, T.L.: Chaotic systems that are robust to added noise, CHAOS, 15, 013901 (2005) 8. Meunier, N., Narion-Poll, R., Lansky, P., Rospars, J.O.: Estimation of the Individual Firing Frequencies of Two Neurons Recorded with a Single Electrode. Chem. Senses 28, 671–679 (2003) 9. Yu, H., Liu, Y.: Chaotic synchronization based on stability criterion of linear systems. Physics Letters A 314, 292–298 (2003) 10. Marino, R.: Adaptive Observers for Single Output Nonlinear Systems. IEEE Trans. on Automatic Control 35(9), 1054–1058 (1990) 11. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall Inc., Englewood Cliffs (1989) 12. Arena, P., Fortuna, L., Frasca, M., Rosa, M.L.: Locally active Hindmarsh-Rose neurons. Chaos, Soliton and Fractals 27, 405–412 (2006) 13. Tokuda, I., Parlitz, U., Illing, L., Kennel, M., Abarbanel, H.: Parameter estimation for neuron models. In: Proc. of the 7th Experimental Chaos Conference (2002), http://www.physik3.gwdg.de/∼ ulli/pdf/TPIKA02 pre.pdf 14. Steur, E.: Parameter Estimation in Hindmarsh-Rose Neurons (2006), http://alexandria.tue.nl/repository/books/626834.pdf 15. Fujikawa, H., Mitsunaga, K., Suemitsu, H., Matsuo, T.: Parameter Estimation of Biological Neuron Models with Bursting and Spiking. In: Proc. of SICE-ICASE International Joint Conference 2006 CD-ROM, pp. 4487–4492 (2006)
Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings Akihisa Ichiki and Masatoshi Shiino Department of Applied Physics, Faculty of Science, Tokyo Institute of Technology, 2-12-2 Ohokayama Meguro-ku Tokyo, Japan
Abstract. We derive Thouless-Anderson-Palmer (TAP) equations and order parameter equations for stochastic analog neural network models with fluctuating synaptic couplings. Such systems with finite number of neurons originally have no energy concept. Thus they defy the use of the replica method or the cavity method, which require the energy concept. However for some realizations of synaptic noise, the systems have the effective Hamiltonian and the cavity method becomes applicable to derive the TAP equations.
1
Introduction
The replica method [1] for random spin systems has been successfully employed in neural network models of associative memory to have the order parameters and the storage capacity [2] and the cavity method [3] has been employed to derive the Thouless-Anderson-Palmer (TAP) equations [4,5]. However these techniques require the energy concept. On the other hand, various types of neural network models which have no energy concept, such as networks with temporally fluctuating synaptic couplings, may exist. The alternative approach to the replica method to derive the order parameter equations, called the self-consistent signal-to-noise analysis (SCSNA), is closely related to the cavity concept in the case where networks have free energy [6,7]. An advantage to apply the SCSNA to neural networks is that the energy concept is not required to derive the order parameter equations once the TAP equations are obtained. The SCSNA, which was originally proposed for deriving a set of order parameter equations for deterministic analog neural networks, becomes applicable to stochastic networks by noting that the TAP equations define the deterministic networks. Furthermore, the coefficients of the Onsager reaction terms characteristic to the TAP equations which determine the form of the transfer functions in analog networks are selfconsistently obtained through the concept shared by the cavity method and the SCSNA. Thus the TAP equations as well as the order parameter equations are derived self-consistently by the hybrid use of the cavity method and the SCSNA in the case where the energy concept exists. However the networks with synaptic noise, which have no energy concept, defy the use of the cavity method to have the TAP equations. On the other hand, as in [8], the network with a specific M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 93–101, 2008. c Springer-Verlag Berlin Heidelberg 2008
94
A. Ichiki and M. Shiino
involvement of synaptic noise can be analyzed by the cavity method to derive the TAP equation in the thermodynamic limit, since the energy concept appears as an effective Hamiltonian in this limit. It is natural to consider such neural network models with fluctuating synaptic couplings, since the synaptic couplings in real biological systems are updated by learning rules and the time-sequence of the synaptic couplings may be stochastic under the influence of noisy external stimuli. Thus the study on such networks is required to understand the retrieval process of the realistic networks. The aim of this paper is two-fold: (i) we will investigate the networks with which realization of synaptic noise have the energy concept to apply the cavity method to derive the TAP equations, (ii) we will show the TAP equations for the networks when the concept of the effective Hamiltonian appears. This paper is organized as the follows: in the next section, we will briefly review how the energy concept appears in the network with synaptic noise and derive the TAP equations and the order parameter equations by using the cavity method and the SCSNA [8]. Once the effective Hamiltonian is found, the replica method is also applicable to derive the order parameter equations. However in the present paper, to make clear the relationship between the TAP equations and the order parameter equations, we do not use the replica trick. In section 3, we will investigate the cases where the energy concept appears in the networks with synaptic noise. We will see that the TAP equations and the order parameter equations for some models can be derived in the framework mentioned in section 2. We will also mention that some difficulties to derive the TAP equations arise in some models with other involvements of synaptic noise. In the last section, we will make discussions on the structure of the TAP equations for the network with temporally fluctuating synaptic noise and conclude this paper.
2
Brief Review on Effective Hamiltonian, TAP Equations and Order Parameter Equations
In this section, we briefly review the cavity method becomes applicable to the network with fluctuating synaptic couplings [8]. Then we derive the TAP equations and the order parameter equations self-consistently in the framework of the SCSNA. In this section, we deal with the following stochastic analog neural network of N neurons with temporally fluctuating synaptic noise (multiplicative noise): x˙ i = −φ (xi ) + Jij (t)xj + ηi (t), (1) j(=i)
ηi (t)ηj (t ) = 2Dδij δ(t − t ),
(2)
where xi (i = 1, · · · , N ) represents a state of the neuron at site i taking a continuous value, φ(xi ) is a potential of an arbitrary form which determines the probability distribution of xi in the case without the input j(=i) Jij xj , ηi the Langevin white noise with its noise intensity 2D and Jij (t) the synaptic coupling.
TAP Equation for Associative Memory Neural Network Models
95
We note here that, in the case of associative memory neural network, the synaptic coupling Jij is usually defined by the well-known Hebb learning rule. However, in the present paper, we will deal with the coupling Jij fluctuating around the Hebb rule with a white noise: Jij (t) = J¯ij + ij (t), ˜ 2D ij (t)kl (t ) = δik δjl δ(t − t ), N
(3) (4)
p where J¯ij is defined by the usual Hebb learning rule J¯ij ≡ N1 μ=1 ξiμ ξjμ with p = αN the number of patterns embedded in the network, ξiμ = ±1 is the μth embedded pattern at neuron i, and ij (t) denotes the synaptic noise independent of ηi (t), which we assume in the present model as a white noise with its intensity ˜ . 2D/N Using Ito integral, we obtain the Fokker-Planck equation corresponding to the Langevin equation (1) as ⎧ ⎫ N ∂ ⎬ ∂P (t, x) ∂ ⎨ ˜q =− −φ (xi ) + J¯ij xj − D + Dˆ P (t, x), (5) ∂t ∂xi ⎩ ∂xi ⎭ i=1
j(=i)
where qˆ ≡ N1 j(=i) x2j . Since the self-averaging property holds in the thermodynamic limit N → ∞, qˆ is identified as qˆ =
N 1 2 x . N i=1 i
(6)
The order parameter qˆ is obtained self-consistently in our framework as seen below. Supposing qˆ is given, one can easily find the equilibrium probability density for the Fokker-Planck equation (5) as ⎧ ⎛ ⎞⎫ N ⎨ ⎬ PN (x) = Z −1 exp −βeff ⎝ φ(xi ) − J¯ij xi xj ⎠ , (7) ⎩ ⎭ i=1
i<j
where Z denotes the normalization constant and −1 ˜ qˆ βeff ≡D+D
(8)
plays the role of the effective temperature of the network. The temperature of the system is modified as a consequence of the multiplicative noise and it depends on the order parameter qˆ. Notice here that the equilibrium distribution of the system becomes Gibbs distribution in the thermodynamic limit N → ∞. The equilibrium solution for equation (5) in the finite N -body system differs√from the indeed 1 2 2 probability density (7). However, since N1 x − x = O(1/ N ), the j(=i) j N j j difference between the probability densities in the finite N -body system PN√and in the system in the thermodynamic system PN →∞ is PN →∞ − PN = O(1/ N ).
96
A. Ichiki and M. Shiino
Thus one can conclude that the equilibrium density for equation (5) converges to the probability density (7) in the thermodynamic limit N → ∞. Since we have explicitly written down the equilibrium probability density (7) as a Gibbsian form, one can define the effective Hamiltonian of (sufficiently large) N -body system as N
HN ≡
i=1
φ(xi ) −
J¯ij xi xj .
(9)
i<j
Since we have found the effective Hamiltonian and the effective temperature, one can apply the usual cavity method [3] to this system and derive the TAP equation. According to the cavity method, we divide the Hamiltonian of N -body system (9) into that of (N −1)-body system and the part involving the state of ith neuron as HN = φ(xi ) − hi xi + HN −1 , where hi ≡ j(=i) J¯ij xj is the local field at site i and the Hamiltonian of (N −1) body system HN −1 is given as HN −1 ≡ j(=i) φ(xj ) − j 0) and x = P for the potentiation part (u < 0), cf. [2, Sec. 5] for details and the parameter values.
structures) and it can be linked to the variability observed in physiological data; the other is due to the additional impact of the background activity upon the generation of the spike output of each neuron (intrinsic stochasticity modelled by the Poisson process).
3 3.1
Theoretical Analysis Characterisation of the Neural Activity
We consider a network of N Poisson neurons (referred to as internal neurons) stimulated with M Poisson pulse trains (or external inputs), as shown in Fig. 2. The activity of the neural network can be described using the firing rates and the pairwise correlations, plus the weights. In the terminology of the theory of dynamical systems, these variables characterise the “state” of the network activity at each time t and their evolution is referred to as neural dynamics. Similar to [7,2], we consider the time-averaged firing rates νi (t) (for the internal neuron indexed by i; T is a given time period) 1 t νi (t) Si (t ) dt (3) T t−T where Si (t) is the spike-time series of the ith neuron and the brackets . . . denotes the ensemble averaging. Likewise for the coefficient correlations (timeaverage correlation function convoluted with the STDP window function W ): W Dik (t) between the ith internal neuron and the k th external input t +∞ 1 W Dik (t) W (u) Si (t )Sˆk (t + u) dt du (4) T t−T −∞ th and QW and j th internal neurons (with Si and Sj ) [2, Sec. 3]. ij (t) between the i
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
105
Fig. 2. Presentation of the network and the notation. The internal neurons (within the network) are indexed by i ∈ [1..N ], and their output pulse trains are denoted Si (t) (it can be understood as a sum of “Dirac functions” at each spiking time [2, Sec. 2]). Likewise for the external input pulse trains Sˆk (t) (k ∈ [1..M ]). The time-averaged firing rates are denoted by νi (t), cf. Eq. 3; the correlation coefficients within the network by Qij (t) (resp. Dik (t) between a neuron in the network and an external input; and ˆ kl (t) between two external inputs), cf. Eq. 4. The weight of the connection from the Q kth external input onto the ith internal neuron is denoted Kik (t) (resp. Jij (t) from the j th internal neuron onto the ith internal neuron).
The stimulation parameters (which represent the “information” carried in the external inputs) are determined by the time-averaged input spiking rates (ˆ νk (t), ˆ W (t), defined simidefined similarly to νi (t)) and their correlation coefficients (Q kl W larly to Dik (t) and QW ij (t)) [2, Sec. 3]. In this paper, we only consider stimulation parameters that are constant in time. 3.2
Learning Equations
Learning equations can be derived from Eq. 2 as in Kempter et al. [7]. This requires the assumption that the internal pulse trains are statistically independent (this can be considered valid when the number of recurrent connection is large enough) and a small learning rate η. This leads to the matrix equation Eq. 10 for the weights between internal neurons (resp. to Eq. 9 for the input weights K). 3.3
Activation Dynamics
In order to study the evolution of the weights described by Eqs. 9 and 10, we need to evaluate the neuron time-average firing rates (the vector ν(t)) and their time-average correlation coefficients (the matrices DW (t) and QW (t)). Similar to Kempter et al. [7] and Burkitt et al. [2], we approximate the instantaneous firing rate Si (t) of the ith Poisson neuron by its expected inhomogeneous Poisson parameter ρ(t) (cf. Eq. 1) and we neglect the impact of the short-time dynamics (the synaptic response kernel and the synaptic delays dˆik and dij ) by using time averaged variables (over a “long” period T ). We require that the
106
M. Gilson et al.
learning occurs slowly compared to the activation mechanisms (cf. the neuron and synapse models) so that T is large compared to the time scale of these mechanisms but small compared to the inverse of the learning parameter η −1 . This leads to the consistency matrix equations of the firing rates (Eq. 6) and of the correlation coefficients (Eqs. 7 and 8). See Burkitt et al. [2, Sec. 3] for details of the derivation. Note that the consistency equations of the correlation coefficients as defined in [2, Sec. 3 and 4] have been reformulated to express the usual covariance using the assumption that the correlations are quasi-constant in time [5] (this implies ˆ V T [2, Sec. 3] is actually equal to Q ˆ W ). Eqs. 7 and 8 express the impact that Q of the connectivity (through the term [I − J]−1 K) on the internal firing rates and the cross covariances in terms of the input covariance Cˆ W
ˆW − W νˆ νˆT , Cˆ W Q
(5)
W (u) du evaluates the balance between the potentiation and the where W depression of our STDP rule. These equations are obtained by combining the equations in [2] with the firing-rate consistency equation Eq. 6. 3.4
Network Dynamical System
Putting everything together, the network dynamics is described by
ν = [I − J]−1 ν0 E + K νˆ (6)
ν νˆT = [I − J]−1 K Q ˆW − W νˆ νˆT DW − W (7)
ν ν T = [I − J]−1 K Q ˆW − W νˆ νˆT K T [I − J]−1T QW − W (8)
dK ˆ T + DW = ΦK win E νˆT + wout ν E (9) dt
dJ = ΦJ win E ν T + wout ν E T + QW , (10) dt where E is the unit vector of N elements (likewise for Eˆ with M elements); ΦJ is a projector on the space of N × N matrices to which J belongs [2, Sec. 3]. The effect of ΦJ is to nullify the coefficients corresponding to a missing connection in the network, viz. all the diagonal terms because the self-connection of a neuron onto itself is forbidden. More generally, such an projection operator can account for any network connectivity [2, Sec. 3]. Note that time has been rescaled, in order to remove η from these equations and for simplicity of notation the dependence on time t will be omitted in the rest of this paper. The matrix [I − J(t)] is assumed invertible at all time (the contrary would imply a diverging behaviour of the firing rates [2, Sec. 4]).
4
Recurrent Network with Fixed Input Weights
We now examine the case of a recurrently connected network with fixed input weights K and learning on the recurrent weights J. Only Eqs. 6, 8 and 10 remain,
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
107
which simplifies the analysis. In the case of full recurrent connectivity, ΦJ in Eq. 10 only nullifies the diagonal terms of the square matrix in its argument. 4.1
Analytical Predictions
Homeostatic equilibrium. Similarly to [7,2], we derive the scalar equations of the mean firing rate νav N −1 i νi and weight Jav (N (N − 1))−1 i=j Jij . It consists of neglecting the inhomogeneities of the firing rates and of the weights over the network, as well as of the connectivity, and we obtain ν0 + (K νˆ)av 1 − (N − 1)Jav
in +C ν2 , = w + wout νav + W av
νav = J˙av
(11)
is defined as where (K νˆ)av denotes the mean of the matrix product K νˆ and C ˆW T (K C K )av . C 2 [ν0 + (K νˆ)av ]
(12)
Eqs. 11 are thus the same equations as for the case with no input [2, Sec. 5], with replaced by, resp., ν0 + (K νˆ)av and W + C. It qualitatively implies the ν0 and W < 0, the network exhibits same dynamical behaviour and, provided that W + C a homeostatic equilibrium (when it is realisable, it requires in particular μ > 0), and the means of the firing rates and of the weights converge towards win + wout +C W μ − ν0 − (K νˆ)av = , (N − 1)μ
∗ νav =μ − ∗ Jav
(13)
If the correlation inputs are time-invariant functions and if the correlations are positive (i.e. more likely to fire in a synchronous way) and homogeneous is of the same sign as W . among the correlated input pool, it follows that C < 0 as in [7,2]. Therefore, the condition for stability reverts to W Structural dynamics. For the particular case of uncorrelated inputs (or more generally when K Cˆ W K T = 0), the fixed-point structure of the dynamical system is qualitatively the same as for the case of no external input [2, Sec. 5]: a homogeneous distribution for the firing rates and a continuous manifold of fixed points for the internal weights. In the case of “spatial” inhomogeneities over the input correlations, the network dynamics shows a different evolution. To illustrate and to compare this with the case of feed-forward connections, we consider the network configuration described in Fig. 3 and inspired by [7], where one input pool has correlation while the other pool has uncorrelated sources. In general, the equations that
108
M. Gilson et al.
Fig. 3. Architecture of the simulated network. The inputs are divided into two subpools, each feeding half of the internal network with means K1 and K2 for the input weights from each subpool. Similarly to the recurrent weights J and the firing rates νˆ and ν, the inhomogeneities are neglected within each subpool and they are assumed to be all identical to their mean. The weights and delays are initially set with 10% random variation around a mean.
determine the fixed points have no accurate solution. Yet, we can reduce the dimensionality by neglecting the variance within each subpool and make approximations to evaluate the asymptotic distribution of the firing rates, which turns out to be bimodal. 4.2
Simulation Protocol and Results
We simulated a network of Poisson neurons as described in Fig. 3 with random initial recurrent weights (uniformly distributed in a given interval, as well as all the synaptic delays). An issue with such simulations is to maintain positive internal weights during their homeostatic convergence because they individually diverge. Thus, all equilibria are not realisable depending on the initial distributions of K and J and the weight bounds [7,2]. See [2, Sec. 5] for details about the simulation parameters. An interesting first case to test the analytical predictions consists of two pools of uncorrelated inputs, each feeding half of the network with distinct weights (K1 νˆ1 = K2 νˆ2 and Cˆ W = 0). Each half of the internal network thus has distinct firing rates initially and, as predicted, the outcome is a convergence of these firing-rates towards a uniform value (similar to the case of no external input [2, Sec. 5]). In the case of a full connected (for both the K and the J according to Fig. 3) network stimulated by one correlated input pool (short-time correlation inspired by [4] so that Cˆ W = 0) and one uncorrelated one, both the internal firing rates and the internal weights also exhibit a homeostatic equilibrium. As shown in Fig. 4, the means over the network (thick solid lines) converge towards the predicted equilibrium values (dashed lines). Furthermore, the individual firing rates tend to stabilise and their distribution remains bimodal (the subpool #1 excited by < 0). The recurrent correlated inputs fires at a lower rate eventually when W weights individually diverge similarly to [2, Sec. 5] and reorganise so that the outgoing weights from the subpool #1 (see the means over each weight subpool
Spike-Timing Dependent Plasticity in Recurrently Connected Networks 25
0.03
15
0.02 0.015 J11
0.01
#1
#2
J21
0.025
#2
weights
firing rates (Hz)
#1 20
10 0
109
0.005 10000
0 0
20000
10000
J22 J12 20000
time (sec)
time (sec)
Fig. 4. Evolution of the firing rates (left) and of the recurrent weights (right) for N = 30 fully-connected Poisson neurons (cf. Fig.3) with short-time correlated inputs. The outcome is a quasi bimodal distribution of the firing rates (the grey bundle, the mean is the thick solid line) around the predicted homeostatic equilibrium (dashed line). The subgroup #1 that receives correlated inputs is more excited initially but fires at a lower rate at the end of the simulation (cf. the two thin black solid lines which represent the mean over each subpool). The internal weights individually diverge, while their mean (thick line) converges towards the predicted equilibrium value (dashed line). They reorganise themselves so that the weights outgoing from the subpool #2 (that receives uncorrelated inputs) become silent, while the ones from #1 are strengthened. Note that the homeostatic equilibrium is preserved even when some weights saturate.
firing rates (Hz)
25 #1
#2
20
15 #2 0
#1 10000
20000
time (sec) Fig. 5. Evolution of the firing rates for a partially connected network of N = 75 neurons. Both the K and the J have 40% probability of connection with the same setup as the network in Fig. 4 (to preserve the total input synaptic strength). The mean firing rate (very thick line) still converges towards the predicted equilibrium value (dashed line) and the two subgroups (grey bundles, each mean represented by a thin black solid line) get separated similarly to the case of full connectivity. The internal weights (not shown) exhibit similar dynamics as for the case of full connectivity.
J11 and J21 in Fig. 4) are strengthened while the other ones almost become silent (see J12 and J22 ). In other words, the subpool that receives correlated inputs takes the upper hand in the recurrent architecture.
110
M. Gilson et al.
30
firing rates (Hz)
firing rates (Hz)
30
25
20
15 0
10000
time (sec)
25
20
15 0
10000
time (sec)
Fig. 6. Simulation of networks of IF neurons with partial random connectivity of 50%. The network qualitatively exhibits the expected behaviour in the case of uncorrelated inputs (left) and inputs with one correlated pool and one uncorrelated pool (right).
In the case of partial connectivity for both the K and the J, the behaviour of the individual firing rates (cf. Fig. 5) still follows the predictions but they are more dispersed, their convergence is slower and the bimodal distribution is not always observed as clearly as in the case of full connectivity (in Fig. 5 the means of the two internal neuron subpools clearly remain separated though). The homeostatic equilibrium of the internal weights also holds and they individually diverge. The partial connectivity needs to be rich enough for the predictions of the mean variable to be accurate enough. First results with IF neurons comply with the analytical predictions remain valid even if the activation mechanisms are more complex (here with a connectivity of 50%, cf. Fig. 6).
5
Discussion and Future Work
The analytical results presented here are preliminary, and further investigation is needed to gain better understanding of the interplay between the input correlation structure and STDP. Nevertheless, our results illustrate two points: STDP induces a stable activity in recurrent architectures similar to that for feed-forward ones (homeostatic regulation of the network activity under the con < 0); and the qualitative structure of the internal firing rates is mainly dition W determined by the input correlation structure. Namely, a “poor” correlation structure (uncorrelated or delta-correlated inputs, so that Cˆ W = 0) induces a homogenisation of the firing activity. Finally, partial connectivity impacts upon the structure of the internal firing rates, but such networks still exhibit similar behaviour to fully connected ones. ˆ W K T suggest Preliminary results involving more complex patterns of K Q more complex interplay between the input correlation structure and the equilibrium distribution of the network firing rates and weights. Such cases are under investigation and may constitute more “interesting” dynamic behaviour of the network from a cognitive modelling point of view, namely through the
Spike-Timing Dependent Plasticity in Recurrently Connected Networks
111
relationship between the attractors of the network activity and the input structure. The case of learning for both the input connections and the recurrent ones will also form part of a future study. Comparison with IF neurons suggests that the impact of the neuron activation mechanisms on the weight dynamics may not be significant. It can be linked to the separation of the time scales between them in the case of slow learning. Note that with our approximations the IF neurons are assumed to be in “linear input-output regime” (no bursting for instance).
Acknowledgments The authors thank Iven Mareels, Chris Trengove, Sean Byrnes and Hamish Meffin for useful discussions that introduced significant improvements. MG is funded by two scholarships from The University of Melbourne and NICTA. ANB and DBG acknowledge funding from the Australian Research Council (ARC Discovery Projects #DP0453205 and #DP0664271) and The Bionic Ear Institute.
References 1. Bi, G.Q., Poo, M.M.: Synaptic modification by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience 24, 139–166 (2001) 2. Burkitt, A.N., Gilson, M., van Hemmen, J.L.: Spike-timing-dependent plasticity for neurons with recurrent connections. Biological Cybernetics 96, 533–546 (2007) 3. Gerstner, W., Kempter, R., van Hemmen, J.L., Wagner, H.: A neuronal learning rule for sub-millisecond temporal coding. Nature 383, 76–78 (1996) 4. Gutig, R., Aharonov, R., Rotter, S., Sompolinsky, H.: Learning input correlations through nonlinear temporally asymmetric hebbian plasticity. Journal of Neuroscience 23, 3697–3714 (2003) 5. Hawkes, A.G.: Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society Series B-Statistical Methodology 33, 438–443 (1971) 6. Hebb, D.O.: The organization of behavior: a neuropsychological theory. Wiley, Chichester (1949) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Hebbian learning and spiking neurons. Physical Review E 59, 4498–4514 (1999) 8. van Rossum, M.C.W., Bi, G.Q., Turrigiano, G.G.: Stable hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience 20, 8812–8821 (2000)
A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki RIKEN Brain Science Institute, Saitama, Japan
[email protected], {fvialatte,cia}@brain.riken.jp
Abstract. It has repeatedly been reported in the medical literature that the EEG signals of Alzheimer’s disease (AD) patients are less synchronous than in age-matched control patients. This phenomenon, however, does at present not allow to reliably predict AD at an early stage, so-called mild cognitive impairment (MCI), due to the large variability among patients. In recent years, many novel techniques to quantify EEG synchrony have been developed; some of them are believed to be more sensitive to abnormalities in EEG synchrony than traditional measures such as the cross-correlation coefficient. In this paper, a wide variety of synchrony measures is investigated in the context of AD detection, including the cross-correlation coefficient, the mean-square and phase coherence function, Granger causality, the recently proposed corr-entropy coefficient and two novel extensions, phase synchrony indices derived from the Hilbert transform and time-frequency maps, information-theoretic divergence measures in time domain and timefrequency domain, state space based measures (in particular, non-linear interdependence measures and the S-estimator), and at last, the recently proposed stochastic-event synchrony measures. For the data set at hand, only two synchrony measures are able to convincingly distinguish MCI patients from age-matched control patients (p < 0.005), i.e., Granger causality (in particular, full-frequency directed transfer function) and stochastic event synchrony (in particular, the fraction of non-coincident activity). Combining those two measures with additional features may eventually yield a reliable diagnostic tool for MCI and AD.
1
Introduction
Many studies have shown that the EEG signals of AD patients are generally less coherent than in age-matched control patients (see [1] for an in-depth review). It is noteworthy, however, that this effect is not always easily detectable: there tends to be a large variability among AD patients. This is especially the case for patients in the pre-symptomatic phase, commonly referred to as Mild Cognitive Impairment (MCI), during which neuronal degeneration is occurring prior to the clinical symptoms appearance. On the other hand, it is crucial to predict AD at M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 112–125, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Comparative Study of Synchrony Measures for the Early Detection of AD
113
an early stage: medication that aims at delaying the effects of AD (and hence intend to improve the quality of life of AD patients) are the most effective if applied in the pre-symptomatic phase. In recent years, a large variety of measures has been proposed to quantify EEG synchrony (we refer to [2]–[5] for recent reviews on EEG synchrony measures); some of those measures are believed to be more sensitive to perturbations in EEG synchrony than classical indices as for example the cross-correlation coefficient or the coherence function. In this paper, we systematically investigate the state-ofthe-art of measuring EEG synchrony with special focus on the detection of AD in its early stages. (A related study has been presented in [6,7] in the context of epilepsy.) We consider various synchrony measures, stemming from a wide spectrum of disciplines, such as physics, information theory, statistics, and signal processing. Our aim is to investigate which measures are the most suitable for detecting the effect of synchrony perturbations in MCI and AD patients; we also wish to better understand which aspects of synchrony are captured by the different measures, and how the measures are related to each other. This paper is structured as follows. In Section 2 we review the synchrony measures considered in this paper. In Section 3 those measures are applied to EEG data, in particular, for the purpose of detecting MCI; we describe the EEG data set, elaborate on various implementation issues, and present our results. At the end of the paper, we briefly relate our results to earlier work, and speculate about the neurophysiological interpretation of our results.
2
Synchrony Measures
We briefly review the various families of synchrony measures investigated in this paper: cross-correlation coefficient and analogues in frequency and time-frequency domain, Granger causality, phase synchrony, state space based synchrony, information theoretic interdependence measures, and at last, stochastic-event synchrony measures, which we developed in recent work. 2.1
Cross-Correlation Coefficient
The cross-correlation coefficient r is perhaps one of the most well-known measures for (linear) interdependence between two signals x and y. If x and y are not linearly correlated, r is close to zero; on the other hand, if both signals are identical, then r = 1 [8]. 2.2
Coherence
The coherence function quantifies linear correlations in frequency domain. One distinguishes the magnitude square coherence function c(f ) and the phase coherence function φ(f ) [8].
114
2.3
J. Dauwels, F. Vialatte, and A. Cichocki
Corr-Entropy Coefficient
The corr-entropy coefficient rE is a recently proposed [9] non-linear extension of the correlation coefficient r; it is close to zero if x and y are independent (which is stronger than being uncorrelated). 2.4
Coh-Entropy and Wav-Entropy Coefficient
One can define a non-linear magnitude square coherence function, which we will refer to as “coh-entropy” coefficient cE (f ); it is an extension of the corr-entropy coefficient to the frequency domain. The corr-entropy coefficient rE can also be extended to the time-frequency domain, by replacing the signals x and y in the definition of rE by their time-frequency (“wavelet”) transforms. In this paper, we use the complex Morlet wavelet, which is known to be well-suited for EEG signals [10]. The resulting measure is called “wav-entropy” coefficient wE (f ). (To our knowledge, both cE (f ) and wE (f ) are novel). 2.5
Granger Causality
Granger causality1 refers to a family of synchrony measures that are derived from linear stochastic models of time series; as the above linear interdependence measures, they quantify to which extent different signals are linearly interdependent. Whereas the above linear interdependence measures are bivariate, i.e., they can only be applied to pairs of signals, Granger causality measures are multivariate, they can be applied to multiple signals simultaneously. Suppose that we are given n signals X1 (k), X2 (k), . . . , Xn (k), each stemming from a different channel. We consider the multivariate autoregressive (MVAR) model: X(k) =
p
A(j)X(k − ) + E(k),
(1)
=1
where X(k) = (X1 (k), X2 (k), . . . , Xn (k))T , p is the model order, the model coefficients A(j) are n × n matrices, and E(k) is a zero-mean Gaussian random vector of size n. In words: Each signal Xi (k) is assumed to linearly depend on its own p past values and the p past values of the other signals Xj (k). The deviation between X(k) and this linear dependence is modeled by the noise component E(k). Model (1) can also be cast in the form: E(k) =
p
˜ A(j)X(k − ),
(2)
=0 1
The Granger causality measures we consider here are implemented in the BioSig library, available from http://biosig.sourceforge.net/
A Comparative Study of Synchrony Measures for the Early Detection of AD
115
˜ ˜ where A(0) = I (identity matrix) and A(j) = −A(j) for j > 0. One can transform (2) into the frequency domain (by applying the z-transform and by substituting z = e−2πiΔt , where 1/Δt is the sampling rate):
˜ −1 (f )E(f ) = H(f )E(f ). X(f ) = A
(3)
The power spectrum matrix of the signal X(k) is determined as S(f ) = X(f )X(f )∗ = H(f )VH∗ (f ),
(4)
where V stands for the covariance matrix of E(k). The Granger causality measures are defined in terms of coefficients of the matrices A, H, and S. Due to space limitations, only a short description of these methods is provided here, additional information can be found in existing literature (e.g., [4]). From these coefficients, two symmetric measures can be defined: – Granger coherence |Kij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f . – Partial coherence (PC) |Cij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f when the influence (i.e., linear dependence) of the other signals is statistically removed. The following asymmetric (“directed”) Granger causality measures capture causal relations: 2 – Directed transfer function (DTF) γij (f ) quantifies the fraction of inflow to channel i stemming from channel j. – Full frequency directed transfer function (ffDTF) |Hij (f )|2 Fij2 (f ) = m ∈ [0, 1], 2 f j=1 |Hij (f )|
(5)
2 is a variation of γij (f ) with a global normalization in frequency. – Partial directed coherence (PDC) |Pij (f )| ∈ [0, 1] represents the fraction of outflow from channel j to channel i 2 – Direct directed transfer function (dDTF) χ2ij (f ) = Fij2 (f )Cij (f ) is non-zero if the connection between channel i and j is causal (non-zero Fij2 (f )) and 2 direct (non-zero Cij (f )).
2.6
Phase Synchrony
Phase synchrony refers to the interdependence between the instantaneous phases φx and φy of two signals x and y; the instantaneous phases may be strongly synchronized even when the amplitudes of x and y are statistically independent. The instantaneous phase φx of a signal x may be extracted as [11]:
φH x(k)] , x (k) = arg [x(k) + i˜
(6)
116
J. Dauwels, F. Vialatte, and A. Cichocki
where x ˜ is the Hilbert transform of x. Alternatively, one can derive the instantaneous phase from the time-frequency transform X(k, f ) of x:
φW x (k, f ) = arg[X(k, f )].
(7)
The phase φW x (k, f ) depends on the center frequency f of the applied wavelet. By appropriately scaling the wavelet, the instantaneous phase may be computed in the frequency range of interest. The phase synchrony index γ for two instantaneous phases φx and φy is defined as [11]: γ = ei(nφx −mφy ) ∈ [0, 1], (8) where n and m are integers (usually n = 1 = m). We will use the notation γH and γW to indicate whether the instantaneous phases are computed by the Hilbert transform or time-frequency transform respectively. In this paper, we will consider two additional phase synchrony indices, i.e., the evolution map approach (EMA) and the instantaneous period approach (IPA) [12]. Due to space constraints, we will not describe those measures here, instead we refer the reader to [12]2 ; additional information about phase synchrony can be found in [6]. 2.7
State Space Based Synchrony
State space based synchrony (or “generalized synchronization”) evaluates synchrony by analyzing the interdependence between the signals in a state space reconstructed domain (see e.g., [7]). The central hypothesis behind this approach is that the signals at hand are generated by some (unknown) deterministic, potentially high-dimensional, non-linear dynamical system. In order to reconstruct such system from a signal x, one considers delay vectors X(k) = (x(k), x(k − τ ), . . . , x(k − (m − 1) τ ))T , where m is the embedding dimension and τ denotes the time lag. If τ and m are appropriately chosen, and the signals are indeed generated by a deterministic dynamical system (to a good approximation), the delay vectors lie on a smooth manifold (“mapping”) in Rm , apart from small stochastic fluctuations. The S-estimator [13], here denoted by Sest , is a state space based measure obtained by applying principal component analysis (PCA) to delay vectors3 . We also considered three measures of nonlinear interdependence, S k , H k , and N k (see [6] for details4 ). 2.8
Information-Theoretic Measures
Several interdependence measures have been proposed that have their roots in information theory. Mutual Information I is perhaps the most well-known 2 3
4
Program code is available at www.agnld.uni-potsdam.de/%7Emros/dircnew.m We used the S Toolbox downloadable from http://aperest.epfl.ch/docs/ software.htm Software is available from http://www.vis.caltech.edu/~rodri/software.htm
A Comparative Study of Synchrony Measures for the Early Detection of AD
117
information-theoretic interdependence measure; it quantifies the amount of information the random variable Y contains about random variable X (and vice versa); it is always positive, and it vanishes when X and Y are statistically independent. Recently, a sophisticated and effective technique to compute mutual information between time series was proposed [14]; we will use that method in this paper5 . The method of [14] computes mutual information in time-domain; alternatively, this quantity may also be determined in time-frequency domain (denoted by IW ), more specifically, from normalized spectrograms [15,16] (see also [17,18]). We will also consider several information-theoretic measures that quantify the dissimilarity (or “distance”) between two random variables (or signals). In contrast to the previously mentioned measures, those divergence measures vanish if the random variables (or signals) are identical ; moreover, they are not necessarily symmetric, and therefore, they can not be considered as distance measures in the strict sense. Divergences may be computed in time domain and timefrequency domain; in this paper, we will only compute the divergence measures in time-frequency domain, since the computation in time domain is far more involved. We consider the Kullback-Leibler divergence K, the R´enyi divergence Dα , the Jensen-Shannon divergence J, and the Jensen-R´enyi divergence Jα . Due to space constraints, we will not review those divergence measures here; we refer the interested reader to [15,16]. 2.9
Stochastic Event Synchrony (SES)
Stochastic event synchrony, an interdependence measure we developed in earlier work [19], describes the similarity between the time-frequency transforms of two signals x and y. As a first step, the time-frequency transform of each signal is approximated as a sum of (half-ellipsoid) basis functions, referred to as “bumps” (see Fig. 1 and [20]). The resulting bump models, representing the most prominent oscillatory activity, are then aligned (see Fig. 2): bumps in one timefrequency map may not be present in the other map (“non-coincident bumps”); other bumps are present in both maps (“coincident bumps”), but appear at slightly different positions on the maps. The black lines in Fig. 2 connect the centers of coincident bumps, and hence, visualize the offset in position between pairs of coincident bumps. Stochastic event synchrony consists of five parameters that quantify the alignment of two bump models: – ρ: fraction of non-coincident bumps, – δt and δf : average time and frequency offset respectively between coincident bumps, – st and sf : variance of the time and frequency offset respectively between coincident bumps. The alignment of the two bump models (cf. Fig. 2 (right)) is obtained by iterative max-product message passing on a graphical model; the five SES parameters are determined from the resulting alignment by maximum a posteriori (MAP) 5
The program code (in C) is available at www.klab.caltech.edu/~kraskov/MILCA/
118
J. Dauwels, F. Vialatte, and A. Cichocki
Fig. 1. Bump modeling Coincident bumps (ρ = 27%) 30
25
25
20
20
15
15
f
f
Bump models of two EEG signals 30
10
10
5
5
00
200
400
t
600
800
00
200
400
t
600
800
Fig. 2. Coincident and non-coincident activity (“bumps”); (left) bump models of two signals; (right) coincident bumps; the black lines connect the centers of coincident bumps
estimation [19]. The parameters ρ and st are the most relevant for the present study, since they quantify the synchrony between bump models (and hence, the original time-frequency maps); low ρ and st implies that the two time-frequency maps at hand are well synchronized.
3
Detection of EEG Synchrony Abnormalities in MCI Patients
In the following section, we describe the EEG data we analyzed. In Section 3.2 we address certain technical issues related to the synchrony measures, and in Section 3.3, we present and discuss our results. 3.1
EEG Data
The EEG data6 analyzed here have been analyzed in previous studies concerning early detection of AD [21]–[25]. They consist of rest eyes-closed EEG data 6
We are grateful to Prof. T. Musha for providing us the EEG data.
A Comparative Study of Synchrony Measures for the Early Detection of AD
119
recorded from 21 sites on the scalp based on the 10–20 system. The sampling frequency was 200 Hz, and the signals were band pass filtered between 4 and 30Hz. The subjects comprised two study groups. The first consisted of a group of 25 patients who had complained of memory problems. These subjects were then diagnosed as suffering from MCI and subsequently developed mild AD. The criteria for inclusion into the MCI group were a mini mental state exam (MMSE) score above 24, the average score in the MCI group was 26 (SD of 1.8). The other group was a control set consisting of 56 age-matched, healthy subjects who had no memory or other cognitive impairments. The average MMSE of this control group was 28.5 (SD of 1.6). The ages of the two groups were 71.9 ± 10.2 and 71.7 ± 8.3, respectively. Pre-selection was conducted to ensure that the data were of a high quality, as determined by the presence of at least 20 sec. of artifact free data. Based on this requirement, the number of subjects in the two groups described above was reduced to 22 MCI patients and 38 control subjects. 3.2
Methods
In order to reduce the computational complexity, we aggregated the EEG signals into 5 zones (see Fig. 3); we computed the synchrony measures (except the S-estimator) from the averages of each zone. For all those measures except SES, we used the arithmetic average; in the case of SES, the bump models obtained from the 21 electrodes were clustered into 5 zones by means of the aggregation algorithm described in [20]. We evaluated the S-estimator between each pair of zones by applying PCA to the state space embedded EEG signals of both zones. We divided the EEG signals in segments of equal length L, and computed the synchrony measures by averaging over those segments. Since spontaneous EEG is usually highly non-stationary, and most synchrony measures are strictly speaking only applicable to stationary signals, the length L should be sufficiently small; on the other hand, in order to obtain reliable measures for synchrony, the length should be chosen sufficiently large. Consequently, it is not a priori clear how to choose the length L, and therefore, we decided to test several values, i.e., L = 1s, 5s, and 20s. In the case of Granger causality measures, one needs to specify the model order p. Similarly, for mutual information (in time domain) and the state space based measures, the embedding dimension m and the time lag τ needs to be chosen; the phase synchrony indices IPA and EMA involve a time delay τ . Since it is not obvious which parameter values amount to the best performance for detecting AD, we have tried a range of parameter settings, i.e., p = 1, 2,. . . , 10, and m = 1, 2,. . . , 10; the time delay was in each case set to τ = 1/30s, which is the period of the fastest oscillations in the EEG signals at hand. 3.3
Results and Discussion
Our main results are summarized in Table 1, which shows the sensitivity of the synchrony measures for detecting MCI. Due to space constraints, the table only shows results for global synchrony, i.e., the synchrony measures were averaged
120
J. Dauwels, F. Vialatte, and A. Cichocki
Fp1
Fpz
Fp2
1 F7
F3
T3
3 C3
T5
F4
Fz
2
T4
C4
Cz
Pz
P3
F8
4
P4
T6
5 O1
Oz
O2
Fig. 3. The 21 electrodes used for EEG recording, distributed according to the 10– 20 international placement system [8]. The clustering into 5 zones is indicated by the colors and dashed lines (1 = frontal, 2 = left temporal, 3 = central, 4 = right temporal and 5 = occipital).
over all pairs of zones. (Results for local synchrony and individual frequency bands will be presented in a longer report, including a detailed description of the influence of various parameters such as model order and embedding dimension on the sensitivity.) The p-values, obtained by the Mann-Whitney test, need strictly speaking to be Bonferroni corrected; since we consider many different measures simultaneously, it is likely that a few of those measures have small p-values merely due to stochastic fluctuations (and not due to systematic difference between MCI and control patients). In the most conservative Bonferroni post-correction, the p-values need to be divided by the number of synchrony measures. From the table, it can be seen that only a few measures evince significant differences in EEG synchrony between MCI and control patients: full-frequency DTF and ρ are the most sensitive (for the data set at hand), their p-values remain significant (pcorr < 0.05) after Bonferroni correction. In other words, the effect of MCI and AD on EEG synchrony can be detected, as was reported earlier in the literature; we will expand on this issue in the following section. In other to gain more insight in the relation between the different measures, we calculated the correlation between them (see Fig. 5; red and blue indicate strong correlation and anti-correlation respectively). From this figure, it becomes strikingly clear that the majority of measures are strongly correlated (or anticorrelated) with each other; in other words, the measures can easily be classified in different families. In addition, many measures are strongly (anti-)correlated with the classical cross-correlation coefficient r, the most basic measure; as a result, they do not provide much additional information regarding EEG synchrony. Measures that are only weakly correlated with the cross-correlation coefficient include the phase synchrony indices, Granger causality measures, and stochastic-event synchrony measures; interestingly, those three families of synchrony measures are mutually uncorrelated, and as a consequence, they each seem to capture a specific kind of interdependence.
A Comparative Study of Synchrony Measures for the Early Detection of AD
121
Table 1. Sensitivity of synchrony measures for early prediction of AD (p-values for Mann-Whitney test; * and ** indicate p < 0.05 and p < 0.005 respectively) Measure
Cross-correlation
Coherence
Phase Coherence
Corr-entropy
Wave-entropy
p-value
0.028∗
0.060
0.72
0.27
0.012∗
References Measure
[8]
[9] PDC
DTF
ffDTF
dDTF
0.15
0.16
0.60
0.34
0.0012∗∗
0.030∗
Measure
Kullback-Leibler
R´enyi
Jensen-Shannon
Jensen-R´enyi
IW
I
p-value
0.072
0.076
0.084
0.12
0.060
0.080
Measure
Nk
Sk
Hk
S-estimator
p-value
0.032∗
0.29
0.090
0.33
p-value
Granger coherence Partial Coherence
References
[4]
References
[15]
References
[14]
[6]
Measure
Hilbert Phase
p-value
0.15
[13]
Wavelet Phase
Evolution Map Instantaneous Period
0.082
References
0.020∗
0.072
[6]
[12]
Measure
st
ρ
p-value
0.92
0.00029∗∗
In Fig. 4, we combine the two most sensitive synchrony measures (for the data set at hand), i.e., full-frequency DTF and ρ. In this figure, the MCI patients are fairly well distinguishable from the control patients. As such, the separation is not sufficiently strong to yield reliable early prediction of AD. For this purpose, the two features need to be combined with complementary features, for example, derived from the slowing effect of AD on EEG, or perhaps from different modalities such as PET, MRI, DTI, or biochemical indicators. On the other hand, we remind the reader of the fact that in the data set at hand, patients did not carry out any specific task; moreover, the recordings were short (only 20s). It is plausible that the sensitivity of EEG synchrony could be further improved by increasing the length of the recordings and by recording the EEG before, while, and after patients carry out specific tasks, e.g., working memory tasks.
0.5 MCI CTR
0.45 0.4
ρ
0.35 0.3 0.25 0.2 0.15 0.045
0.05
Fij2
0.055
Fig. 4. ρ vs. ffDTF
0.06
122
J. Dauwels, F. Vialatte, and A. Cichocki
state space
corr/coh mut inf
phase
divergence
Granger
SES
N k (X|Y ) N k (Y |X) S k (X|Y ) S k (Y |X) H k (X|Y ) 5 H k (Y |X) Sest r c rE 10 wE IW I γH γW 15 φ EMA IPA K(Y |X) K(X|Y )20 K Dα J Jα Kij 25 Cij Pij γij Fij χij st ρ
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
30
−0.8 5
10
15
20
25
30
Fig. 5. Correlation between the synchrony measures
4
Conclusions
In previous studies, brain dynamics in AD and MCI patients were mainly investigated using coherence (cf. Section 2.2) or state space based measures of synchrony (cf. Section 2.7). During working memory tasks, coherence shows significant effects in AD and MCI groups [26] [27]; in resting condition, however, coherence does not show such differences in low frequencies (below 30Hz), neither between AD and controls [28] nor between MCI and controls [27]. These results are consistent with our observations. In the gamma range, coherence seems to decrease with AD [29]; we did not investigate this frequency range, however, since the EEG signals analyzed here were band pass filtered between 4 and 30Hz. Synchronization likelihood, a state space based synchronization measure similar to the non-linear interdependence measures S k , H k , and N k (cf. Section 2.7), is believed to be more sensitive than coherence to detect changes in AD patients [28]. Using state space based synchrony methods, significant differences were found between AD and control in rest conditions [28] [30] [32] [33]. State space based synchrony failed to retrieve significant differences between MCI patient and control subjects on a global level [32] [33], but significant effects were observed locally: fronto-parietal electrode synchronization likelihood progressively decreased through MCI and mild AD groups [30]. We report here a lower p-value for the state space based synchrony measure N k (p = 0.032) than for coherence (p = 0.06); those low p-values, however, would not be statistically significant after Bonferroni correction.
A Comparative Study of Synchrony Measures for the Early Detection of AD
123
By means of Global Field Synchronization, a phase synchrony measure similar to the ones we considered in this paper, Koenig et al. [31] observed a general decrease of synchronization in correlation with cognitive decline and AD. In our study, we analyzed five different phase synchrony measures: Hilbert and wavelet based phase synchrony, phase coherence, evolution map approach (EMA), and instantaneous period approach (IPA). The p-value of the latter is low (p=0.020), in agreement with the results of [31], but it would be non-significant after Bonferroni correction. The strongest observed effect is a significantly higher degree of local asynchronous activity (ρ) in MCI patients, more specifically, a high number of noncoincident, asynchronous oscillatory events (p = 0.00029). Interestingly, we did not observe a significant effect on the timing jitter st of the coincident events (p = 0.92). In other words, our results seem to indicate that there is significantly more non-coincident background activity, while the coincident activity remains well synchronized. On the one hand, this observation is in agreement with previous studies that report a general decrease of neural synchrony in MCI and AD patients; on the other hand, it goes beyond previous results, since it yields a more subtle description of EEG synchrony in MCI and AD patients: it suggests that the loss of coherence is mostly due to an increase of (local) non-coincident background activity, whereas the locked (coincident) activity remains equally well synchronized. In future work, we will verify this conjecture by means of other data sets.
References 1. Jong, J.: EEG Dynamics in Patients with Alzheimer’s Disease. Clinical Neurophysiology 115, 1490–1505 (2004) 2. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear Multivariate Analysis of Neurophsyiological Signals. Progress in Neurobiology 77, 1–37 (2005) 3. Breakspear, M.: Dynamic Connectivity in Neural Systems: Theoretical and Empirical Considerations. Neuroinformatics 2(2) (2004) 4. Kami´ nski, M., Liang, H.: Causal Influence: Advances in Neurosignal Analysis. Critical Review in Biomedical Engineering 33(4), 347–430 (2005) 5. Stam, C.J.: Nonlinear Dynamical Analysis of EEG and MEG: Review of an Emerging Field. Clinical Neurophysiology 116, 2266–2301 (2005) 6. Quiroga, R.Q., Kraskov, A., Kreuz, T., Grassberger, P.: Performance of Different Synchronization Measures in Real Data: A Case Study on EEG Signals. Physical Review E 65 (2002) 7. Sakkalis, V., Giurc˘ aneacu, C.D., Xanthopoulos, P., Zervakis, M., Tsiaras, V.: Assessment of Linear and Non-Linear EEG Synchronization Measures for Evaluating Mild Epileptic Signal Patterns. In: Proc. of ITAB 2006, Ioannina-Epirus, Greece, October 26–28 (2006) 8. Nunez, P., Srinivasan, R.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press, Oxford (2006) 9. Xu, J.-W., Bakardjian, H., Cichocki, A., Principe, J.C.: EEG Synchronization Measure: a Reproducing Kernel Hilbert Space Approach. IEEE Transactions on Biomedical Engineering Letters (submitted to, September 2006)
124
J. Dauwels, F. Vialatte, and A. Cichocki
10. Herrmann, C.S., Grigutsch, M., Busch, N.A.: EEG Oscillations and Wavelet Analysis. In: Handy, T. (ed.) Event-Related Potentials: a Methods Handbook, pp. 229– 259. MIT Press, Cambridge (2005) 11. Lachaux, J.-P., Rodriguez, E., Martinerie, J., Varela, F.J.: Measuring Phase Synchrony in Brain Signals. Human Brain Mapping 8, 194–208 (1999) 12. Rosenblum, M.G., Cimponeriu, L., Bezerianos, A., Patzak, A., Mrowka, R.: Identification of Coupling Direction: Application to Cardiorespiratory Interaction. Physical Review E, 65 041909 (2002) 13. Carmeli, C., Knyazeva, M.G., Innocenti, G.M., De Feo, O.: Assessment of EEG Synchronization Based on State-Space Analysis. Neuroimage 25, 339–354 (2005) 14. Kraskov, A., St¨ ogbauer, H., Grassberger, P.: Estimating Mutual Information. Phys. Rev. E 69(6), 66138 (2004) 15. Aviyente, S.: A Measure of Mutual Information on the Time-Frequency Plane. In: Proc. of ICASSP 2005, Philadelphia, PA, USA, March 18–23, vol. 4, pp. 481–484 (2005) 16. Aviyente, S.: Information-Theoretic Signal Processing on the Time-Frequency Plane and Applications. In: Proc. of EUSIPCO 2005, Antalya, Turkey, September 4–8 (2005) 17. Quiroga, Q.R., Rosso, O., Basar, E.: Wavelet-Entropy: A Measure of Order in Evoked Potentials. Electr. Clin. Neurophysiol (Suppl.) 49, 298–302 (1999) 18. Blanco, S., Quiroga, R.Q., Rosso, O., Kochen, S.: Time-Frequency Analysis of EEG Series. Physical Review E 51, 2624 (1995) 19. Dauwels, J., Vialatte, F., Cichocki, A.: A Novel Measure for Synchrony and Its Application to Neural Signals. In: Honolulu, H.U. (ed.) Proc. IEEE Int. Conf. on Acoustics and Signal Processing (ICASSP), Honolulu, Hawai’i, April 15–20 (2007) 20. Vialatte, F., Martin, C., Dubois, R., Haddad, J., Quenet, B., Gervais, R., Dreyfus, G.: A Machine Learning Approach to the Analysis of Time-Frequency Maps, and Its Application to Neural Dynamics. Neural Networks 20, 194–209 (2007) 21. Chapman, R., et al.: Brain Event-Related Potentials: Diagnosing Early-Stage Alzheimer’s Disease. Neurobiol. Aging 28, 194–201 (2007) 22. Cichocki, A., et al.: EEG Filtering Based on Blind Source Separation (BSS) for Early Detection of Alzheimer’s Disease. Clin. Neurophys. 116, 729–737 (2005) 23. Hogan, M., et al.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49 (2003) 24. Musha, T., et al.: A New EEG Method for Estimating Cortical Neuronal Impairment that is Sensitive to Early Stage Alzheimer’s Disease. Clin. Neurophys. 113, 1052–1058 (2002) 25. Vialatte, F., et al.: Blind Source Separation and Sparse Bump Modelling of TimeFrequency Representation of EEG Signals: New Tools for Early Detection of Alzheimer’s Disease. In: IEEE Workshop on Machine Learning for Signal Processing, pp. 27–32 (2005) 26. Hogan, M.J., Swanwick, G.R., Kaiser, J., Rowan, M., Lawlor, B.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49(2), 147–163 (2003) 27. Jiang, Z.Y.: Study on EEG Power and Coherence in Patients with Mild Cognitive Impairment During Working Memory Task. J. Zhejiang Univ. Sci. B 6(12), 1213– 1219 (2005) 28. Stam, C.J., van Cappellen van Walsum, A.M., Pijnenburg, Y.A., Berendse, H.W., de Munck, J.C., Scheltens, P., van Dijk, B.W.: Generalized Synchronization of MEG Recordings in Alzheimer’s Disease: Evidence for Involvement of the Gamma Band. J. Clin. Neurophysiol. 19(6), 562–574 (2002)
A Comparative Study of Synchrony Measures for the Early Detection of AD
125
29. Herrmann, C.S., Demiralp, T.: Human EEG Gamma Oscillations in Neuropsychiatric Disorders. Clinical Neurophysiology 116, 2719–2733 (2005) 30. Babiloni, C., Ferri, R., Binetti, G., Cassarino, A., Forno, G.D., Ercolani, M., Ferreri, F., Frisoni, G.B., Lanuzza, B., Miniussi, C., Nobili, F., Rodriguez, G., Rundo, F., Stam, C.J., Musha, T., Vecchio, F., Rossini, P.M.: Fronto-Parietal Coupling of Brain Rhythms in Mild Cognitive Impairment: A Multicentric EEG Study. Brain Res. Bull. 69(1), 63–73 (2006) 31. Koenig, T., Prichep, L., Dierks, T., Hubl, D., Wahlund, L.O., John, E.R., Jelic, V.: Decreased EEG Synchronization in Alzheimer’s Disease and Mild Cognitive Impairment. Neurobiol. Aging 26(2), 165–171 (2005) 32. Pijnenburg, Y.A., Made, Y.v., van Cappellen, A.M., van Walsum, Knol, D.L., Scheltens, P., Stam, C.J.: EEG Synchronization Likelihood in Mild Cognitive Impairment and Alzheimer’s Disease During a Working Memory Task. Clin. Neurophysiol. 115(6), 1332–1339 (2004) 33. Yagyu, T., Wackermann, J., Shigeta, M., Jelic, V., Kinoshita, T., Kochi, K., Julin, P., Almkvist, O., Wahlund, L.O., Kondakor, I., Lehmann, D.: Global dimensional complexity of multichannel EEG in mild Alzheimer’s disease and age-matched cohorts. Dement Geriatr Cogn Disord 8(6), 343–347 (1997)
Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials Hong-Ren Su1,2, Michelle Liou2,*, Philip E. Cheng2, John A.D. Aston2, and Shang-Hong Lai1 1
Dept. of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
[email protected] Abstract. In this study, we introduce the use of orthogonal causal Laguerre polynomials for analyzing data collected in event-related functional magnetic resonance imaging (fMRI) experiments. This particular family of polynomials has been widely used in the system identification literature and recommended for modeling impulse functions in BOLD-based fMRI experiments. In empirical studies, we applied Laguerre polynomials to analyze data collected in an eventrelated fMRI study conducted by Scott et al. (2001). The experimental study investigated neural mechanisms of visual attention in a change-detection task. By specifying a few meaningful Laguerre polynomials in the design matrix of a random effect model, we clearly found brain regions associated with trial onset and visual search. The results are consistent with the original findings in Scott et al. (2001). In addition, we found the brain regions related to the mask presence in the parahippocampal, superior frontal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus. Keywords: Reproducibility analysis, Event-related fMRI.
1 Introduction We previously proposed a methodology for assessing reproducibility evidence in fMRI studies using an on-and-off paradigm without necessarily conducting replicated experiments, and suggested interpreting SPMs in conjunction with reproducibility evidence (Liou et al., 2003; 2006). Empirical studies have shown that the method is robust to the specification of hemodynamic response functions (HRFs). Recently, BOLD-based event-related fMRI experiments have been widely used as an advanced alternative to the on-and-off design for studies on human brain functions. In eventrelated fMRI experiments, the duration of stimulus presentation is generally longer and there are no obvious contrasts between the experimental and control conditions to be used in data analyses. In order to detect possible brain activations during stimulus presentation and task performance, there have been a variety of event-related HRFs proposed in the literature. In this study, we introduce the use of orthogonal causal * Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 126–134, 2008. © Springer-Verlag Berlin Heidelberg 2008
Reproducibility Analysis of Event-Related fMRI Experiments
127
Laguerre polynomials for modeling response functions. This particular family of polynomials has been widely used in the system identification literature and was recommended for modeling impulse functions in fMRI experiments (Saha et al., 2004). In the empirical study, we applied Laguerre polynomials to analyze data in the study by Scott et al. (2001). The dataset was published by the US fMRI Data Center and is available for public access. The original experiment involved 10 human subjects and investigated brain functions associated with a change-detection task. In the experimental task, subjects look attentively at two versions of the same picture in alternation, separated by a brief mask interval. The experiment additionally analyzed behavioral responses that subjects detected something changing between pictures and pressed a button with hands. In our reproducibility analysis, a few meaningful Laguerre polynomials matching the experimental design were inserted into a random effect model and reproducibility analyses were conducted based on the selected polynomials. In the analyses, we successfully located brain regions associated with the visual change-detection task similar to those found in Scott et al.. Additionally, we found other interesting brain regions that were not included in the previous study.
2 Method In this section, we will briefly describe the method for investigating the reproducibility evidence in fMRI experiments, and outline the family of Laguerre polynomials including those used in our empirical study. 2.1 Reproducibility Analysis In the SPM generalized linear model, the fMRI responses in the ith run can be expressed as
yi = X i β i + ei ,
(1)
where yi is the vector of image intensity after pre-whitening, Xi is the design matrix, and β i is the vector containing the unknown regression parameters. In the random effect model, the regression parameters
βi
are additionally assumed to be random
from a multivariate Gaussian distribution with common mean μ and variance Ω . The empirical Bayes estimate of
βi
in the random effect model would shrink all
estimates toward the mean μ , with greater shrinkage at noisy runs. In fMRI studies, the true status of each voxel is unknown, but can be estimated using the t-values (i.e., standardized β estimates) within individual runs derived from the random effect model along with the maximum likehood estimation method. By specifying a mixed multinomial model, the receiver-operation characteristic (ROC) curve can be estimated using the maximum likelihood estimation method and t-values of all image voxels. The curve is simply a bivariate plot of sensitivity versus the false alarm rate. The threshold (or the operational point) on the ROC curve for classifying voxels into the active/inactive status was found by maximizing the kappa value. We follow the
128
H.-R. Su et al.
same definition in Liou et al. (2006) to categorize voxels according to reproducibility, that is, a voxel is strongly reproducible if its active status remains the same in at least 90% of the runs, moderately reproducible in 70-90% of the runs, weakly reproducible in 50-70% of the runs, and otherwise not reproducible. The brain activation maps are constructed on the basis of strongly reproducible voxels, but include voxels that are moderately reproducible and spatially proximal to those strongly reproducible voxels. 2.2 Laguerre Polynomials The Laguerre polynomials can be used for detecting experimental responses. This family of polynomials can be specified as follows: L
h (t ) = ∑ f i g ia (t ) ,
(2)
i =1
where h(t) is the design coefficients to be input into Xi in (1); L is the order of Laguerre polynomial; ƒi is the coefficient of the basis function, and gia(t) is the inverse Z transform of the i-th Laguerre polynomial given by ~ a ⎡ z −1 z −1 − a i −1 ⎤ −1 g ia (t ) = Z −1 ⎢ ( ) = Z [ g i ( z )] ⎥ −1 −1 ⎣1 − az 1 − az ⎦
(3)
where a is a time constant. As an illustration, Fig. 1 gives the response coefficients corresponding to L=2 and L =3.
h(t), L=2
h(t), L=3
(a)
(b)
Fig. 1. The boxcar functions of experimental conditions in the Scott et al. study are depicted in (a), and the Laguerre polynomials h(t) with L=2 and L =3 are depicted in (b)
3 Event-Related fMRI Experiments We here introduce the experimental design behind the fMRI dataset used in the empirical study, and select the design matrix suitable for the study.
Reproducibility Analysis of Event-Related fMRI Experiments
129
3.1 Experimental Design In our empirical study, the dataset contains functional MR images of 10 subjects who went through 10-12 experimental runs, with 10 stimulus trials in each run. Experimental runs involved the change-detection task in which two images within a pair differed in either the presence/absence of the position of a single object or the color of the object. The two images were presented alternatively for 40 sec. In the first 30 sec, each image was presented for 300 msec followed by a 100 msec mask. However, the mask was removed in the last 10 sec. Subjects pressed a button when detecting something changing between the pair of images. The experimental images and stimulus duration are shown in Fig. 2.
Fig. 2. The experimental images and stimulus duration in the Scott et al. study
3.2 Experimental Design Matrix We used the Laguerre polynomials in Fig. 1 for specifying the design matrix in (1) instead of the theoretical HRFs. According to the original experimental design, there are two contrasts of interest in the Scott et al. study. The first is the response after the task onset within the 40 secs trial, and the second is the difference between stimulus presentations with and without the mask, that is, responses during the image presentation with the mask (0 ~ 30 sec.) and without the mask (30 ~ 40 sec.) in Fig. 2. The boxcar functions in Fig. 1 can also be specified in the design matrix as was suggested in the study by Liou et al. (2006) for the on-and-off design. However, the two boxcar functions are not orthogonal to each other and carry redundant information on experimental effects. In the event-related fMRI experiment, the duration of stimulus presentation is always longer than that in the on-and-off design. The theoretical HRFs vanish during the stimulus presentation. There might be brain regions continuously responding to the stimulus. The Laguerre polynomials are orthogonal and offer possibilities for examining all kinds of experimental effects. We
130
H.-R. Su et al.
might also consider Laguerre polynomials in Fig. 1 as a smoothed version of the boxcar functions.
4 Results In the Scott et al. study, a response-contingent event-related analysis technique was used in the data analyses, and the original results showed brain regions associated with different processing components in the visual change-detection task. For instance, the lingual gyrus, cuneus, precentral gyrus, and medial frontal gyrus showed activations associated with the task onset. And the pattern of activation in dorsal and ventral visual pathways was temporally associated with the duration of visual search. Finally parietal and frontal regions showed systematic deactivations during task performance. In the reproducibility analysis with Laguerre polynomials, we found the similar activation regions associated with the task onset, visual search and deactivations. In addition, we found activation regions in the parahippocampal, superior frontal gyrus, supramarginal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus which are also reproducible across all subjects; this finding is consistent with our previous data analyses of fMRI studies involving object recognition and word/pseudoword reading (Liou et al., 2006). Table 1 lists a few activation regions in the change-detection task for the 10 subjects. Table 1. The activation regions in the change-detection task. The plus sign indicates the positive response and minus sign indicates the negative response.
Subjects Lingual gyrus Precuneus Cuneus Posterior cingulate Medial frontal gyrus Parahippocampal gyrus Superior frontal gyrus Supramarginal gyrus
1 +/+/+/+/+/-
2 +/+/+/-
+ + +
3 +/+/+/+/+/+/+ +
4 +/+/+/+/-
5 +/+/+/+/+/+ +
6 +/+/+/+ +/-
7 +/+/+/+/+/+/-
8 +/+/+/+/+/+/-
9 +/+/+/-
10 +/+/+/+/+/-
+ +
In the table, there are 4 subjects showing activations in the superior frontal gyrus and supramarginal gyrus in the change-detection task. The two regions have been referred to in fMRI studies on language process (e.g., the study on word and pseudoword reading). The 4 subjects, on average, had longer reaction time in the change-detection task, that is, a delay of pressing the button until the image presentation without the mask (30-40 sec.). Fig. 3 shows the brain activation regions for Subjects 5 and 7 in the Scott et al. study. Subject 5 involved the superior frontal gyrus and supramarginal gyrus and had the longest reaction time compared with other subjects in the experiment. On the other hand, Subject 7 had relatively shorter reaction time and showed no activations in the two regions.
Reproducibility Analysis of Event-Related fMRI Experiments
Fig. 3. Brain activation regions for Subjects 5 and 7 in the Scott et al. study
131
132
H.-R. Su et al.
Subject 5
Fig. 3. (continued)
Reproducibility Analysis of Event-Related fMRI Experiments
Subject 7
Fig. 3. (continued)
133
134
H.-R. Su et al.
5 Discussion The reproducibility evidence suggests that the 10 subjects consistently show a pattern of increased/decreased responses in the lingual gyrus, cuneus, and precuneus. Similar observations were also found in our empirical studies on other datasets published by the fMRIDC. In the fMRI literature, the precuneus, posterior cingulate and medial prefrontal cortex are known to be the default network in a resting state and show decreased activities in a variety of cognitive tasks. The physiological mechanisms behind the decreased responses are still under investigation. However, discussions on the network have given a focus on the decreased activities. We would suggest to consider both positive and negative responses when interpreting the default network. By the method of reproducibility analyses, we can clearly classify brain regions that show consistent responses across subjects and those that show patterns and inconsistencies across subjects (see results in Table 1). Higher mental functions are individual and their localization in specific brain regions can be made only with some probabilities. Accordingly, the higher mental functions are connected with speech, that is, external or internal speech organizing personal behavior. Subjects differ from each other as a result of using different speech designs when making decisions in performing experimental tasks. Change of functional localization is an additional characteristic of a subject’s psychological traits. The proposed methodology would assist researchers in identifying those brain regions that are specific to individual speech designs and those that are consistent across subjects. Acknowledgments. The authors are indebted to the fMRIDC at Dartmouth College for supporting the datasets analyzed in this study. This research was supported by the grant NSC 94-2413-H-001-001 from the National Science Council (Taiwan).
References 1. Liou, M., Su, H.-R., Lee, J.-D., Aston, J.A.D., Tsai, A.C., Cheng, P.E.: A method for generating reproducible evidence in fMRI studies. NeuroImage 29, 383–395 (2006) 2. Huettel, S.A., Guzeldere, G., McCarthy, G.: Dissociating neural mechanisms of visual attention in change detection using functional MRI. Journal of Cognitive Neuroscience 13(7), 1006–1018 (2001) 3. Liou, M., Su, H.-R., Lee, J.-D., Cheng, P.E., Huang, C.-C., Tsai, C.-H.: Bridging Functional MR Images and Scientific Inference: Reproducibility Maps. Journal of cognitive Neuroscience 15(7), 935–945 (2003) 4. Saha, S., Long, C.J., Brown, E., Aminoff, E., Bar, M., Solo, V.: Hemodynamic transfer function estimation with Laguerre polynomials and confidence intervals construction from functional magnetic resonance imaging (FMRI) data. IEEE ICASSP 3, 109–112 (2004) 5. Andrews, G.E., Askey, R., Roy, R.: Laguerre Polynomials. In: §6.2 in Special Functions, pp. 282–293. Cambridge University Press, Cambridge (1999) 6. Arfken, G.: Laguerre Functions. In: §13.2 in Mathematical Methods for Physicists, 3rd ed., Orlando, FL, pp. 721–731. Academic Press, London (1985)
The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence Murat Saglam1, Kaoru Matsunaga2, Yuki Hayashida1, Nobuki Murayama1, and Ryoji Nakanishi2 1
Graduate School of Science and Technology, Kumamoto University, Japan 2 Department of Neurology, Kumamoto Kinoh Hospital, Japan
[email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp
Abstract. Recent studies proposed a new paradigm of repetitive transcranial magnetic stimulation (rTMS), “theta burst stimulation” (TBS); to primary motor cortex (M1) or sensory cortex (S1) can influence cortical excitability in humans. Particularly it has been shown that TBS can induce the long-lasting effects with the stimulation duration shorter than those of conventional rTMSs. However, in those studies, effects of TBS over M1 or S1 were assessed only by means of motor- and/or somatosensory-evoked-potentials. Here we asked how the coherence between electromyographic (EMG) and electroencephalographic (EEG) signals during isometric contraction of the first dorsal interosseous muscle is modified by TBS. The coherence magnitude localizing for the C3 scalp site, and at 13-30Hz band, significantly decreased 30-60 minutes after the TBS on M1, but not that on S1, and recovered to the original level in 90-120 minutes. These findings indicate that TBS over M1 can suppress the corticomuscular synchronization. Keywords: Theta Burst Transcranial Magnetic Stimulation, Coherence Electroencephalogram, Electromyogram, Motor Cortex.
1 Introduction Previous studies have demonstrated dense functional and anatomical projections among motor cortex building a global network which realizes the communication between the brain and peripheral muscles via the motor pathway [1, 2]. The quality of the communication is thought to highly depend on the efficacy of the synaptic transmission between cortical units. In the past few decades, repetitive transcranial magnetic stimulation (rTMS) was considered to be a promising method to modify cortical circuitry by leading the phenomena of long-term potentiation (LTP) and depression (LTD) of synaptic connections in human subjects [3]. Furthermore, a recently developed rTMS paradigm, called “theta burst stimulation” (TBS) requires less number of the stimulation pulses and even offers the longer aftereffects than conventional rTMS protocols do [4]. Previously, efficiency of TBS has M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 135–141, 2008. © Springer-Verlag Berlin Heidelberg 2008
136
M. Saglam et al.
been assessed by means of signal transmission from cortex to muscle or from muscle to cortex, by measuring motor-evoked-potential (MEP) or somatosensory-evoked-potential (SEP), respectively. It was shown that TBS applied over the surface of sensory cortex (S1) as well as primary motor cortex (M1) could modify the amplitude of SEP (recorded from the S1 scalp site) lasting for tens of minutes after the TBS [5]. On the other hand, the amplitude of MEP was not modified by the TBS applied over S1, while the MEP amplitude was significantly decreased by the TBS applied over M1 [4, 5]. In the present study, we examined the effects of TBS applied over either M1 or S1 on the functional coupling between cortex and muscle by measuring the coherence between electroencephalographic (EEG) and electromyographic (EMG) signals during voluntary isometric contraction of the first dorsal interosseous (FDI) muscle.
2 Methods 2.1 Subjects Seven subjects among whole set of recruited participants (approximately twenty) showed significant coherence and only those subjects participated to TBS experiments. Experiments on M1 and S1 performed on different days and subjects did not report any side effects during or after the experiments. 2.2 Determination of M1 and S1 Location The optimal location of the stimulating coil was determined by searching the largest MEP response(from the contralateral FDI-muscle) elicited by single pulse TMS while
Fig. 1. EEG-EMG signals recorded before and after the application of TBS as depicted in the experiment time line. Subjects were asked to contract four times at each recording set. Location and intensity for actual location were determined after pre30 session. Pre0 recording was done to confirm searching does not make any conditioning. TBS paradigm is illustrated in the rightabove inset.
The Effects of Theta Burst Transcranial Magnetic Stimulation
137
moving the TMS coil in 1cm steps on the presumed position of M1. Stimulation was applied by a with a High Power Magstim 200 machine and a figure-of-8 coil with mean loop diameter of 70 mm (Magstim Co., Whitland, Dyfed, UK). The coil was placed tangentially to the scalp with the handle pointing backwards and laterally at a 45˚ angle away from the midline. Based on previous reports S1 is assumed to be 2cm posterior from M1 site [5]. 2.3 Theta Burst Stimulation Continuous TBS (cTBS) paradigm of 600 pulses was applied to the M1 and S1 location. cTBS consists of 50Hz triplets of pulses that are repeating themselves at every 0.2s (5Hz) for 40s[4]. Intensity of each pulse was set to 80% active motor threshold (AMT) which is defined as the minimum stimulation intensity that could evoke an MEP of no less than 200μV during slight tonic contraction. 2.4 EEG and EMG Recording EEG signals were recorded based on the international 10-20 scalp electrode placement method (19 electrodes) with earlobe reference. EMG signal, during isometric hand contraction at 15% level from the maximum, was recorded from the FDI muscle of the right hand with reference to the metacarpal bone of the index finger. EEG and EMG signals were recorded with 1000 Hz sampling frequency and passbands of 0.5-200Hz and 5-300 Hz, respectively. Each recording set consists of 4 one-minute-long recordings with 30s-rest time intervals. To assess TBS effect with respect to time, each set was performed 30 minutes before (pre30), just before (pre0), 0,30,60,90 and 120 minutes after the delivery of TBS. Stimulation location and intensity were determined between pre30 and pre0 recordings (Fig. 1). 2.5 Data Analysis Coherence function is the squared magnitude of the cross-spectra of the signal pair divided by the product their power spectra. Therefore cross- and power-spectra between EMG and 19 EEG channels were calculated. Fast Fourier transform, with an epoch size of 1024, resulting in frequency resolution of 0.98 Hz was used to convert the signals into frequency domain. The current source density (CSD) reference method was utilized in order to achieve spatially sharpened EEG signals [6]. Coherency between EEG and EMG signals was obtained using the expression below:
0≤κ (f) = 2 xy
S xy ( f )
2
S xx ( f ) S yy ( f )
≤1
(1)
where Sxy(f) represents the cross-spectral density function. Sxx(f) and Syy(f) stand for the auto-spectral density of the signals x and y, respectively. Since coherence is a normalized measure of the correlation between signal pairs, κ2xy (f) =1 represents a perfect linear dependence and κ2xy (f) =0 indicates a lack of linear dependence within those signal pairs. Coherence values for κ2xy (f) > 0 are assumed to be statistically significant only if they are above 99% confidence limit that is estimated by:
138
M. Saglam et al. 1
α ⎞ ( n−1) ⎛ CL(α %) := 1 − ⎜1 − ⎟ ⎝ 100 ⎠ where n is the number of epochs used for cross- and power- spectra calculations.
(2)
3 Results First we have confirmed that EEG-EMG coherence values for all (n=7) subjects lie above 99% significance level (coh ~= 0.02) and within beta (13-30 Hz) frequency Table 1. Mean ± standard error of mean (SEM) of beta band EEG (C3)-EMG coherence values and peak frequencies for TBS-over-M1 and S1 experiments (n=7)
Coherence Magnitude
M1 S1
Peak M1 Frequency(Hz) S1
Pre30 (min)
Pre0 (min)
0 (min)
30 (min)
60 (min)
90 (min)
120 (min)
0.061±0.008
0.051±0.014
0.053±0.014
0.03±0.007
0.031±0.009
0.057±0.015
0.067±0.016
0.059±0.021
0.068±0.026
0.058±0.027 0.061±0.023
0.052±0.014
0.034±0.006
0.040±0.016
20.50±1.92
21.80±1.98
22.48±2.04
23.73±1.84
18.57±1.2
20.75±1.62
18.23±1.82
21.17±1.79
21.80±1.97
23.08±1.82
23.10±2.15
18.22±2.13
21.17±1.08
19.52±.3.21
Fig. 2. Coherence spectra between EEG and EMG signals at pre30 session. A, Coherence spectra between 19 EEG channels and EMG (FDI) for all subjects (n=7) are superimposed and topographically according to the approximate locations of the electrodes on the scalp. Each electrode labeled with respect to its location (Fp: Frontal pole F: Frontal T: Temporal C: Central P: Parietal O: Occipital). B, Expanded view of the coherence spectra between EEG (C3 scalp site) and EMG (FDI) for all subjects (n=7). Each line style specifies different subject’s coherence spectra. Coherence values only above 99% significance level (indicated by the solid horizontal line) are highlighted.
The Effects of Theta Burst Transcranial Magnetic Stimulation
139
band at C3 scalp site. Maximum coherence levels were observed at C3 (n=6) and at F3 (n=1) scalp sites whereas no significant coherence was observed at other locations (Figure 2). These results are in well agreement with the previous studies on coherence between EEG-EMG during isometric contraction [7, 8]. Table 1 shows the average absolute coherence values and peak frequencies at all trials before and after the application of TBS. Figure 3 demonstrates the normalized EEG (C3)-EMG coherence values as average of all subjects. Coherence values obtained before the TBS were taken as control and set to 100%. Average beta band coherence suppressed to 56.2% after 30 minutes and 54.5% after 60 minutes with statistical significance(p 0, and depression for a negative timing difference, Δt < 0. Many have argued that the asymmetry of this rule produces a one-way coupling (see e.g. [4] ). Such arguments would be valid if Δt represented the time difference between post- and presynaptic spikes. However, actually most experimental literatures [3] define Δt to be the time difference between a postsynaptic spike and the onset or peak of the somatic excitatory postsynaptic potential (EPSP) induced by a presynaptic spike: Δt = tpost spike − tEPSP by pre . Hence the above argument does not apply. A somatic EPSP should lag behind a presynaptic spike for a few msec. Therefore, if two neurons fire in exact synchrony (Fig.1g), Δt < 0 negative [12] for both directions, thereby weakens connections bidirectionally. Now, how does this mechanism convert initial asynchronous firing to clustered synchronous firing (Fig.1b)? Initial asynchronous firing (Fig.1a) is represented as firing phases evenly spread around the circle (Fig.1h, left). The firing remains asynchronous without STDP . However, with the phases of many neurons squeezed into the circle, any single neuron must have neighboring neurons that unwillingly fire synchronously with it(Fig.1h). Among these neurons, the abovementioned mechanism weakens the connections bidirectionally. As their synaptic connections weaken, mutual repulsion is also weakened. This then further synchronizes their firing. This positive feedback mechanism develops wireless clusters (Figs.1h). Although this mechanism qualitatively explains how the clustering happens, a quantitative question how many clusters are formed requires further consideration. We will later see that a stability analysis tells the possible number of clusters. In contrast to the vanishing intra-cluster connections, the inter-cluster connections survive and can be unidirectional (Fig.1d), which defines the cyclic network topology such as shown in Fig.1f, upper. Let us ask how we can change this 3-cycle topology. We find that one of the recently observed higher-order rules of STDP [15,16] increases the number of clusters (Fig.1e). The higher-order rule shown in [16] implies the gross reduction in the LTD effect because LTP override the immediately preceding LTD, while LTP simply cancels partly the immediately preceding LTD. The weakened LTD effect is likely to increase the total number of potentiated synapses, which is in consistent with the increased ratio of black areas in Fig.1e compared to Fig.1d. In contrast to such cluster-wise synchrony observed with Model A neurons, Model B neurons that favor synchrony ( a = 0.02, b = 0.2, c = −50, d = 40) selforganize into the globally synchronous state with or without synchrony (Fig.2). Due to the global synchrony, mutual synaptic connections are largely lost, and each neuron ends up being driven by the external input individually, having little sense of being present as a population. The global synchrony gives too strong
146
H. Cˆ ateau, K. Kitano, and T. Fukai
D
50 45 40 35 30 25 20 15 10 5 0
9750
postsynaptic neurons
neuron index
C
9800
9850 9900 time [ms]
9950
5
presynaptic neurons 10 15 20 25 30 35 40 45 50
5 10 15 20 25 30 35 40 45 50
10000
Fig. 2. Global synchrony observed with Model B neurons that favor synchrony. A raster plot (a) and connection matrix (b) of fifty Model B neurons. The neurons were aligned with the connection-based method: neurons are defined to belong to the same cluster whenever there mutual connections are small enough.
an impact and also has minimal coding capacity because all the neurons behave identically, and it appears to bear more similarity to the pathological activity such as seizure in the brain than to meaningful information processing. By contrast, the clustered synchrony arising in the network of Model A neurons appears functionally useful. Generally in the brain, the unitary EPSP amplitude (∼ 0.5mV ) is designed to be much smaller than the voltage rise needed to elicit firing (∼ 15mV ). Therefore, single-neuron activity alone cannot cause other neurons to respond. Hence, it is difficult to regard the single-neuron activity as a carrier of information transferred back and forth in the brain. In contrast, the self-organized assembly of tens of Model A neurons (Figs. 1d) looks an ideal candidate for a carrier of information in the brain because their impact on other neurons are strong enough to elicit responses. Additionally, a cluster can reliably code the timing information. The PRC, Z(2πt/T ), representing the amount of advance/delay of the next firing time in response to the input at t in the firing interval [0, T ] has been mostly used to decide whether a coupled pair of neurons or oscillators tend to synchronize or desynchronize under the assumption that the connection strengths between the neurons are equal and unchanged. Specifically, suppose that a pair of neurons are mutually connected and a spike of one neuron introduces a current with the waveform of EP SC(t) in an innervated neuron after a transmission T delay of τd . The effective PRC defined as Γ− (θ) = T1 0 Z(2πt /T )EP SC(t − θ τd − T 2π )dt is known to decide their synchrony tendency. If Γ− (θ) < 0 at θ = 0 is positive (negative), the two neurons are desynchronized (synchronized). This synchrony condition is inherited to a population of neurons coupled in an allto-all or random manner as far as the connection strengths remain unchanged. Theoretically calculated Γ− (θ)s for Model A and B (Fig.3a,b) explain that the all-to-all netwok of Model A (B) neurons exhibit global asynchrony (synchrony). Note that both Model A and B neurons belong to type II[19] so that both model neurons favor synchrony if they are delta-coupled with no synaptic delay. After STDP is switched on, the network consisting of Model A neurons, is selforganized into the 3-cycle circuit (Fig.1d) with a successive phase difference of the clusterd activity being Δsuc θ = 2π/3. Stability analysis shows that the slope
Interactions between STDP and PRC Lead to Wireless Clustering
147
(b)
(a) 0.035
0.25
2pi/3
0.03
0.2
2pi/4 0.15
0.025 0.02 0.015
0.1
0.01
0.05
0.005 0 -0.05
0
0
pi/2
pi
3pi/2
2pi
-0.005
0
pi/2
pi
3pi/2
2pi
Fig. 3. Effective PRCs and schema of triad mechanism. The effective PRC of Model A and B were calculated with the adjoint method [19] and shown shown in (a) and (b). The slope at θ = 0 is positive for (a) but negative for (b) although it hardly recognizable with this resolution. The slope at θ = 2π/3 is negative for (a) but positive for (b). The dashed lines represent θ = 2π/3 and θ = 2π/4.
of Γ− (θ) not at the origin but at θ = 2π − Δsuc θ now determines the stability of the 3-cycle activity: the 3-cycle activity is stable if Γ− (2π − Δsuc θ) < 0. Fig.3a tells that the 3-cycle activity shown in Fig.1c is stable. The stable cyclic activity is achieved through the following synergetic process: (1) PRC determines the preferred network activity (e.g. asynchronous or synchronous), (2) the network activity determines how STDP works, STDP modifies the network structure (e.g. from all-to-all to cyclic ), and (3) the network structure determines how the PRC is readout (e.g. θ = 0 or θ = 2π −Δsuc θ) ), closing the loop. Generally, we can show that the n-cylce activity whose successive phase difference equals Δsuc θ = 2π/n is stable if Γ− (2π − Δsuc θ) < 0. PRCs of biologically plausible neuron models or real neurons [20] tend to have a negative slope in a later phase of the firing interval and converge to zero at θ = 2π because the membrane potential starts the regenerative depolarization and becomes insensitive to any synaptic input. The corresponding effective PRCs inherit this negative slope in the later phase and tends to stabilize the n-cycle activity for some n.
3
Self-organization of Hodgkin-Huxley Type Neurons
Next we see that the self-organized cyclic activity with the wireless clustering is also observed in biologically realistic setting. Our simulations as described in [18] with 200 excitatory and 50 inhibitory neurons modeled with the HodgkinHuxley (HH) formalism exhibits the 3-cyclic activity with the wireless clustering (Fig.4a,b). The setup here is biologically realistic in that (1) HH type neurons are used, (2) physiologically known percentage of inhibitory neurons with nonplastic synapses are included, (3) neurons fire with high irregularity due to large noise in the background input unlike the well-regulated firing as shown in Fig.1c. Interestingly, the effective PRC (Fig.4c) of the HH type neuron shares important features with that of Model A: the positive initial slope implying the preference to asynchrony and a negative later slope stabilizing the 3-cycle activity.
148
H. Cˆ ateau, K. Kitano, and T. Fukai (a)
presynaptic neurons
(b)
20
200
postsynaptic neurons
160 140
neuron index
60
80
100 120 140 160 180 200
20
180
120 100 80 60 40 20 0 5000
40
40 60 80 100 120 140 160 180
5040
5080
5120
5160
200
5200
time(msec)
(c) 3.5
2pi/3 2pi/4
3 2.5 2 1.5 1 0.5 0
0
pi/2
pi
3pi/2
2pi
Fig. 4. Conductance-based model also develops the wireless clustering. (a) A raster plot of 200 HH-type excitatory neurons showing 3-cycle activity. (b) The corresponding connection matrix showing the wirelessness. (c) Effective PRC or Γ− (θ) of the conductance-based model.
Generally, technical difficulty in the HH simulations is their massive computational demands due to the complexity of the system. That difficulty has hidered theoretical analysis, and has left the studies largely experimental. In particular, previously we tried hard to understand why we never observed 4-cycle or longer in vain. However, the analytic argument we developed here with the simplified model gives a clear insight into the biologically plausible but complex system. Comparison of Fig.3a and Fig.4c reveals that the negative slope of Γ− (θ) of the HH model is located at more left than that of Model A, indicating less stability of long cycles in the HH simulations. With the larger amount of noise in the HH simulations in mind, it is now understood that 4-cycle and longer can be easily destabilized in the HH simulations. Thus, our analysis developed with the simplefied system serves as a useful tool to understand a biologically realistic but complex systems. There is, howevr, an interesting difference between the Model A and HH simulations. Although the intra-cluster wirelessness is a fairly good first approximation in the HH model simulations (Fig.4b), it is not as exact as in the Model A simulations (Fig.1d,e). Interestingly, an elimination of the residual intracluster connections destroys the cyclic activity, suggesting the supportive role of the tiny residual intra-cluster connections.
4
Discussion
In the previous simulation study[17] using the LIF model, cyclic activity was observed to propagate only at the theoretical speed limit: it takes only τd from one cluster to the next, requiring the zero membrane integration time. To understand
Interactions between STDP and PRC Lead to Wireless Clustering
149
why it was the case, we first remind that the effective PRC needs a negative slope at 2π − Δsuc θ to stabilize the cylic activity. However, the slope of the PRC of an θ LIF model, Z(θ) = c exp( τTm 2π ), is always positive except at the the end point, T where Z(2π − 0) = c exp( τm ) and Z(2π + 0) = c, implying Z (2π) = −∞. This infinitely sharp negative slope of the PRC at θ = 2π is rounded and displaced to 2π − 2πτd /T in Γ− (θ) (see its definition). Since this is the only place where Γ− (θ) has a negative slope, The cyclic activity is stable only if Δsuc θ = 2πτd /T , implying the propagation at the theoretical speed limit. We demonstrated an intimate interplay between PRC and STDP using the Izhikevich neuron model as well as the HH type model. The present study complements previous studies using the phase oscillator [11,14], where its mathematical tractability was exploited to analytically investigate the stability of the global phase/frequency synchrony. The self-organization or unsupervised learning by STDP studied here complements the supervised learning studied in [22]. The propagation of synchronous firing and temporal evolution of synaptic strength under STDP is know to be analyzed semi-analytically with the Fokker-Planck equation STDP [5,6,8,9,21]. It is interesting future direction to see how the Fokker-Planck equation can be used to understand the interplay between PRC and STDP.
Acknowledgement The present authors thank Dr. T. Takewaka at RIKEN BSI for offering the code to calculate the PRC.
References 1. Kuramoto, Y.: Chemical oscillations,waves,and turbulence. Springer, Berlin (1984) 2. Ermentrout, G., Kopell, N.: SIAM J. Math.Anal. 15, 215 (1984) 3. Markram, H., et al.: Science 275, 213 (1997); Bell, C.C., et al.: Nature 387, 278 (1997); Magee, J.C., Johnston, D.: Science 275, 209 (1997); Bi, G.-Q., Poo, M.-M.: J. Neurosci. 18, 10464 (1998); Feldman, D. E., Neuron 27, 45 (2000); Nishiyama, M., et al.: Nature 408, 584 (2000) 4. Song, S., et al.: Nat. Neurosci. 3, 919 (2000) 5. van Rossum, M.C., Turrigiano, G.G., Nelson, S.B.: J. Neurosci. 22,1956 (2000) 6. Rubin, J., et al.: Phys. Rev. Lett. 86, 364 (2001) 7. Abbott, L.F., Nelson, S.B.: Nat. Neurosci. 3, 1178 (2000) 8. Gerstner, W., Kistler, W.M.: Spiking neuron model. Cambridge University Press, Cambridge (2002) 9. Cˆ ateau, H., Fukai, T.: Neural Comput., 15, 597 (2003) 10. Izhikevich, E.M.: IEEE Trans. Neural Netw. 15, 1063 (2004) 11. Karbowski, J.J., Ermentrout, G.B.: Phys. Rev. E. 65, 031902 (2002) 12. Nowotny, T., et al.: J. Neurosci. 23, 9776 (2003) 13. Zhigulin, V.P., et al.: Phys. Rev. E, 67, 021901 (2003) 14. Masuda, N., Kori, H.: J. Comp. Neurosci, 22, 327 (2007) 15. Froemke, R.C., Dan, Y.: Nature, 416, 433 (2002)
150 16. 17. 18. 19. 20.
H. Cˆ ateau, K. Kitano, and T. Fukai
Wang, H.-X., et al.: Nat. Neurosci, 8, 187 (2005) Levy, N., et al.: Neural Netw. 14, 815 (2001) Kitano, K., Cˆ ateau, H., Fukai, T.: Neuroreport, 13, 795 (2002) Ermentrout, G.B.: Neural Comput. 8, 979 (1996) Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1673 (1993); Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1661 (1993); Oprisan, S.A., Prinz, A.A., Canavier, C.C.: Biophys. J., 87, 2283 (2004); Netoff, T.I., et al.: J Neurophysiol. 93, 1197 (2005); Lengyel, M., et al.: Nat. Neurosci. 8, 1667 (2005); Galan, R.F., Ermentrout, G.B., Urban, N.N.: Phys. Rev. Lett. 94, 158101 (2005); Preyer, A.J., Butera, R.J.: Phys. Rev. Lett. 95, 13810 (2005); Goldberg, J.A., Deister, C. A., Wilson, C.J.: J. Neurophysiol., 97, 208 (2007); Tateno, T., Robinson, H.P.: Biophys. J., 92, 683 (2007); Mancilla, J.G., et al.: J. Neurosci. 27, 2058 (2007); Tsubo, Y., et al.: Eur J. Neurosci, 25, 3429 (2007) 21. Cˆ ateau, H., Reyes, A.D.: Phys. Rev. Lett. 96, 058101, and references therein (2006) 22. Lengyel, M., et al.: Nat. Neurosci. 8, 1677 (2005)
A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells Yoko Yamaguchi1, Colin Molter1, Wu Zhihua1,2, Harshavardhan A. Agashe1, and Hiroaki Wagatsuma1 1
Lab For Dynamic of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, Japan 2 Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
[email protected] Abstract. This paper proposes a computational model of spatio-temporal property formation in the entorhinal neurons recently known as “grid cells”. The model consists of module structures for local path integration, multiple sensory integration and for theta phase coding of grid fields. Theta phase precession naturally encodes the spatial information in theta phase. The proposed module structures have good agreement with head direction cells and grid cells in the entorhinal cortex. The functional role of theta phase coding in the entorhinal cortex for cognitive map formation in the hippocampus is discussed. Keywords: Cognitive map, hippocampus, temporal coding, theta rhythm, grid cell.
1 Introduction In rodents, it is well known that a hippocampal neuron increases its firing rate in some specific position in an environment [1]. These neurons are called place cells and considered to provide neural representation of a cognitive map. Recently it was found that the entorhinal neurons, giving major inputs to the hippocampus fire at positions distributing in a form of a triangular-grid-like patterns in the environment [2]. They are called “grid cells” and their spatial firing preference is termed “grid fields”. Interestingly, temporal coding of space information, “theta phase precession” initially found in hippocampal place cells were also observed in grid cells in the superficial layer of the entorhinal cortex [3], as shown in Figs.1 – 3. A sequence of neural firing is locked to theta rhythm (4~12 HZ) of local field potential (LFP) during spatial exploration. As a step to understand cognitive map formation in the rat hippocampus, the mechanism to form the grid field and also the mechanism of phase precession formation in grid cells must be clarified. Here we propose a model of neural computation to create grid cells based on known property of entorhinal neurons including “head direction cells” which fires M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 151–159, 2008. © Springer-Verlag Berlin Heidelberg 2008
152
Y. Yamaguchi et al.
when the animal’s head has some specific direction in the environment. We demonstrate that theta phase precession in the entorhinal cortex naturally emerge as a consequence of grid cell formation mechanism.
Fig. 1. Theta phase precession observed in rat hippocampal place cells. When the rat traverses in a place field, spike timing of the place cell gradually advances relative to local field potential (LFP) theta rhythm. In a running sequence through place filed A-B-C, the spike sequence in order of A-B-C emerge in each theta cycle. The spike sequence repeatedly encoded in theta phase is considered to lead robust on-line memory formation of the running experience through asymmetric synaptic plasticity in the hippocampus.
Fig. 2. Network structure of the hippocampal formation (DG, CA3, CA1, the entorhinal cortex (EC deeper layer and EC superficial payer) and cortical areas giving multimodal input. Theta phase precession was initially found in the hippocampus, and also found in EC superficial layer. EC superficial layer can be considered as an origin of theta phase precession.
A Computational Model of Formation of Grid Field and Theta Phase Precession
153
HC place cell EC grid cell EC LFP theta
Time Fig. 3. Top) A grid field in an entorhinal grid cell (left) and a place field (right) in a hippocampal place cell. Bottom) Theta phase precession observed in the EC grid cell and in the hippocampus place cell.
2 Model Firing rate of the ith grid cell at a location (x, y) in a given environment increases in the condition given by the relation:
x = α i + nAi cos φ i + mAi cos(φ i + π / 3), y = β i + nAi sinφ i + mAi sin(φ i + π / 3), with
n, m = integer + r ,
(1)
where φ i , Ai and ( α i , β i ) denote one of angles characterizing the grid orientation, a distance of nearby vertices, and a spatial phase of the grid field in an environment. The parameter r is less than 1.0 giving the relative size of a field with high firing rate.
154
Y. Yamaguchi et al.
Fig. 4. Illustration of a hypercolumn structure for grid field computation in the hypothesized entorhinal cortex. The bottom layer consists of local path integration module with a hexagonal direction system. The middle layer associates output of local path integration and visual cue in a given environment. The top layer consists of triplet of grid cells whose grid fields have a common orientation, a common spatial scale and complementary spatial phases. Phase precession is generated at the grid cell at each grid field.
The computational goal to create a grid field is to find the region with n, m = integer + r . We hypothesize that the deeper layer of the entorhinal cortex works as local path integration systems by using head direction and running velocity. The local path integration results in a variable with slow gradual change forming a grid field. This change can cause the gradual phase shift of theta phase precession in accordance with the phenomenological model of theta phase precession by Yamaguchi et al. [4]. The schematic structure of the hypothesized entorhinal cortex and multimodal sensory system is illustrated in Fig. 4. The entorhinal layer includes head direction cells in the deeper layer and grid cells in the superficial layer. Cells with theta phase precession can be considered as stellate cells. The set of modules along vertical direction form a kind of functional column with a direction preference. These columns form a hypercolumnar structure with a set of directions. Mechanisms in individual modules are explained below. 2.1 A Module of Local Path Integrator
The local path integration module consists of six units. During animal’s locomotion with a given head direction and velocity, each unit integrates running distance in each direction with an angle dependent coefficient. Units have preferred vector directions distributing with π/3 intervals as shown in Fig. 5. Computation of animal displacement in given directions in this module is illustrated in Fig, 6. The maximum integration length of the distance in each direction is assumed to be common in a module, corresponding to the distance between nearby vertices of the subsequently formed grid field. This computation gives (n.m) in eq. (1). These systems distribute in the deeper layer of the entorhinal cortex in agreement with observation of head direction cells. Different modules have different vector
A Computational Model of Formation of Grid Field and Theta Phase Precession
155
directions and form a hypercolumn set covering the entire running directions or entire orientation of resultant grid field. The entorhinal cortex is considered to include multiple hypercolumns with different spatial scales. They are considered to work in parallel possibly to give stability in a global space by compensating accumulation of local errors.
Fig. 5. Left) A module of local path integrator with hexagonal direction vectors and a common vector size. Right) Activity coefficient of each vector unit. A set of these vector units to give a displacement distance measure computes an animal motion in a give head direction.
Fig. 6. Illustration of computation of local path integration in a module. Animal locomotion in a give head direction is computed by a pair of vector units among six vectors to give a position measure.
2.2 Grid Field Formation with Visual Cues
Computational results of local path integration are projected to next module in the superficial layer of the entorhinal cortex, which has multiple sensory inputs in a given environment. The association of path integration and visual cues results in the relative location of path integration measure ( α i , β i ) in eq. (1) in the module. Further interaction in a set of three cells as shown in Fig. 7 can give robustness of the parameter ( α i , β i ). Possible interaction among these cells is mutual inhibition to give supplementary distribution of three grid fields.
156
Y. Yamaguchi et al.
2.3 Theta Phase Precession in the Grid Field
The input of the parameter (n, m) and ( α i , β i ), to a cell at next module, at the top part of the module, can cause theta phase precession. It is obtained by the fundamental mechanism of theta phase generation proposed by Yamaguchi et al. [4] [5]. The mechanism needs the presence of a gradual increase of natural frequency in a cell with oscillation activity. Here we find that the top module consists of stellate cells with intrinsic theta oscillation. The natural increase in frequency is expected to emerge by the input of path integration at each vertex of a grid field.
Fig. 7. A triplet of grid fields with the same local path integration system and different spatial phases can generate mostly uniform spatial representation where a single grid cell fires at every location. The uniformity can help robust assignment of environmental spatial phases under the help of environmental sensory cues. The association is processed in the middle part of each column in the entorinal cortex. Projection of each cell output to the module at the top generates a grid field wit theta phase precession as explained in text.
3 Mathematical Formulation Simple mathematical formulation of the above model is phenomenologically given below. The locomotion of animal is represented by current displacement (R, φc ) computed with head direction φ , and running velocity. An elementary vector at a H column of in local path integration system has a vector angle φ and its length A. The i output of the ith vector system I is given by
⎧⎪1 if I (φi ) = ⎨ ⎪⎩ 0
φi − φ H < π / 2 and − r < S (φi ) < r , otherwise, (2)
with S (φi ) = R cos(φi − φ c ) ( S mod A ).
A Computational Model of Formation of Grid Field and Theta Phase Precession
157
where r and A respectively represent the field radius and the distance between neighboring grid vertices. The output of path integration module Di to the middle layer is given by
Di = ∏ I (φi + kπ / 3). k
(3)
Through association with visual cues, spatial phase of the grid is determined. (Details are not shown here.) The term Eqs. (2-3) from the middle layer to the top layer gives on-off regulation and also a parameter with gradual increase in a grid field. Dynamics of the membrane potential G i of the cell at the top layer is given by d dt
G i = f (G i ,t) + aDi S(φi ) + I theta ,
(4)
where f is a function of time-dependent ionic currents and a is constant. The last term I theta denotes a sinusoidal current representing theta oscillation of inhibitory neurons. In a proper dynamics of f, the second term in the right had side gives activation of the grid cell oscillation and gradual increase in its natural frequency. According to our former results by using a phenomenological model [5], the last term of theta currents leads phase locking of grid cells with gradual phase shift. This realizes a cell with grid field and theta phase precession. One can test Eq. (4) by applying several types of equations including a simple reduced model and biophysical model of the hippocampus or entorhinal cells. An example of computer experiments is given in the following section.
4 Computer Simulation of Theta Phase Precession The mechanism of theta phase precession was phenomenologically proposed by Yamaguchi et al. [4][5] as coupling of two oscillations. One is LFP theta oscillation with a constant frequency of theta rhythm. The other is a sustained oscillation with gradual increase in natural frequency. The sustained oscillation in the presence of LFP theta exhibits gradual phase shift as quasi steady states of phase locking. The simulation by using a hippocamal pyramidal cell [6] is shown in Fig. 8. It is obviously seen that LFP theta instantaneously captures the oscillation with gradual increase in natural frequency into a quasi-stable phase at each theta cycle to give gradual phase shift. This phase shift is robust against any perturbation as a consequence of phase locking in nonlinear oscillations. The simulation with a model of an entorhinal stellate cell [7] was also elucidated. We obtained similar phase precession with stellate cell model. One important property of stellate cell is the presence of sub threshold oscillations, while synchronization of this oscillation can be reduced to a simple behavior of the phase model. Thus, the mechanism of phenomenological model [5] is found to endow comprehensive description of phase locking of complex biophysical neuron models.
158
Y. Yamaguchi et al.
(a)
(b) Fig. 8. Computer experiment of theta phase precession by using a hippocampal pyramidal neuron model [6]. (a)Bottom: Input current with gradual increase. Top: Resultant sustained oscillation with gradual increase in natural frequency of the membrane potential. (b) In the presence of LFP theta (middle), the neuronal activity exhibit theta phase precession.
5 Discussions and Conclusion We elucidated a computational model of grid cells in the entorhnal cortex to investigate how temporal coding works for spatial representation in the brain, A computational model of formation of grid field was proposed based on local path integration. This assumption was found to give theta phase precession within the grid field. This computational mechanism does not need an assumption of learning in repeated trials in a novel environment but enables instantaneous spatial representation. Furthermore, this model has good agreements with experimental observations of head direction cells and grid cells. The networks proposed in the model predict local interaction networks in the entorhinal cortex and also head direction systems distributed in many areas. Although computation of place cells based on grid cells is beyond this paper, emergence of theta phase precession in the entorhinal cortex can be used for place cell formation and also instantaneous memory formation in the hippocampus [8]. These computational model studies with space-time structure for environmental space representation enlightens the temporal coding over distributed areas used in real-time operation of spatial information in ever changing environment.
A Computational Model of Formation of Grid Field and Theta Phase Precession
159
References 1. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Clarendon Press, Oxford (1978) 2. Fyhn, M., Molden, S., Witter, M., Moser, E.I., Moser, M.B.: Spatial representation in the entorhinal cortex. Sience 305, 1258–1264 (2004) 3. Hafting, T., Fyhn, M., Moser, M.B., Moser, E.I.: Phase precession and phase locking in entorhinal grid cells. Program No. 68.8, Neuroscience Meeting Planner. Atlanta, GA: Society for Neuroscience (2006.) Online (2006) 4. Yamaguchi, Y., Sato, N., Wagatsuma, H., Wu, Z., Molter, C., Aota, Y.: A unified view of theta-phase coding in the entorhinal-hippocampal system. Current Opinion in Neurobiology 17, 197–204 (2007) 5. Yamaguchi, Y., McNaughton, B.L.: Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In: Usui, S., Omori, T. (eds.) The Fifth International Conference on Neural Information Processing (ICONIP 1998) and The 1998 Annual Conference of the Japanese Neural Network Society (JNNS 1998), Kitakyushu, Japan. Burke, VA, vol. 2, pp. 781–784. IOS Press, Amsterdam (1998) 6. Pinsky, P.F., Rinzel, J.: Intrinsic and network rhythmogenesis in a reduced traub model for CA3 neurons. Journal of Computational Neuroscience 1, 39–60 (1994) 7. Fransén, E., Alonso, A.A., Dickson, C.T., Magistretti, J., Hasselmo, M.E.: Ionic mechanisms in the generation of subthreshold oscillations and action potential clustering in entorhinal layer II stellate neurons 14(3), 368–384 (2004) 8. Molter, C., Yamaguchi, Y.: Theta phase precession for spatial representation and memory formation. In: The 1st International Conference on Cognitive Neurodynamics (ICCN 2007), Shanghai, 2-09-0002 (2007)
Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor David Colliaux1,2 , Yoko Yamaguchi1 , Colin Molter1 , and Hiroaki Wagatsuma1 1
Lab for Dynamics of Emergent Intelligence, RIKEN BSI, Wako, Saitama, Japan 2 Ecole Polytechnique (CREA), 75005 Paris, France
[email protected] Abstract. A phenomenological model is developed where complex dynamics are the correlate of spatio-temporal memories. If resting is not a classical fixed point attractor but a Milnor attractor, multiple oscillations appear in the dynamics of a coupled system. This model can be helpful for describing brain activity in terms of well classified dynamics and for implementing human-like real-time computation.
1
Introduction
Neuronal collective activities of the brain are widely characterized by oscillations in human and animals [1][2]. Among various frequency bands, distant synchronization in theta rhythms (4-8 Hz oscillation defined in human EEG) is recently known to relate with working memory, a short-term memory for central execution in human scalp EEG [3][4] and in neural firing in monkeys [5][6]. For long-term memory, information coding is mediated by synaptic plasticity whereas short-term memory is stored in neural activities [7]. Recent neuroscience reported various types of persistent activities of a single neuron and a population of neurons as possible mechanisms of working memory. Among those, bistable states, up- and down-states, of the membrane potential and its flip-flop transitions were measured in a number of cortical and subcortical neurons. The up-state, characterized by frequent firing, shows stability for seconds or more due to network interactions [8]. However it is little known whether flip-flop transition and distant synchronization work together or what kind of processings are enabled by the flip-flop oscillation network. Associative memory network with flip-flop change was proposed for working memory with classical rate coding view [9], while further consideration on dynamical linking property based on firing oscillation, such as synchronization of theta rhythms referred above, is likely essential for elucidation of multiple attractor systems. Besides, Milnor extended the concept of attractors to invariant sets with Lyapunov unstability, which has been of interest in physical, chemical and biological systems. It might allow high freedom in spontaneous switching among semi-stable states [12]. In this paper, we propose a model of oscillation associative memory with flip-flop change for working memory. We found that M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 160–169, 2008. c Springer-Verlag Berlin Heidelberg 2008
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
161
the Milnor attractor condition is satisfied in the resting state of the model. We will first study how the Milnor attractor appears and will then show possible behaviors of coupled units in the Milnor attractor condition.
2 2.1
A Network Model Structure
In order to realize up- and down-states where up-state is associated with oscillation, phenomenological models are joined. Traditionally, associative memory networks are described by state variables representing the membrane potential {Si } [9]. Oscillation is assumed to appear in the up-state as an internal process within each variable φi for the ith unit. Oscillation dynamics is simply given by a phase model with a resting state and periodic motion [10,11]. cos(φi ) stands for an oscillation current in the dynamics of the membrane potential. 2.2
Mathematical Formulation of the Model
The flip-flop oscillations network of N units is described by the set of state variables {Si , φi } ∈ N × [0, 2π[N (i ∈ [1, N ]). Dynamic of Si and φi is given by the following equations: dSi Wij R(Sj ) + σ(cos(φi ) − cos(φ0 )) + I± dt = −Si + dφi (1) = ω + (β − ρS i )sin(φi ) dt with R(x) = 12 (tanh(10(x − 0.5)) + 1), φ0 = arcsin( −ω β ) and cos(φ0 ) < 0. R is the spike density of units and input I± will be taken as positive (I+ ) or negative (I− ) pulses (50 time steps), so that we can focus on the persistent activity of units after a phasic input. ω and β are respectively the frequency and the stabilization coefficient of the internal oscillation. ρ and σ represent mutual feedback between internal oscillation and membrane potential. Wij are the connection weights describing the strength of coupling between units i and j. φ0 is known to be a stable fixed point of the equation for φ, and 0 to be a fixed point for the S equation.
3 3.1
An Isolated Unit Resting State
The resting state is the stable equilibrium when I = 0 for a single unit. We assume ω < β so that M0 = (0, φ0 ) is the fixed point of the system. To study the linear stability of this fixed point, we write the stability matrix around M0 : −1 −σsin(φ0 ) DF |M0 = (2) −ρsin(φ0 ) βcos(φ0 )
162
D. Colliaux et al.
The sign of the eigenvalues of DF |M0 and thus the stability of M0 depends only on μ = ρσ. With our choice of ω = 1 and β = 1.2, μc ≈ 0.96. If μ < μc , M0 is a stable fixed point and there is another fixed point M1 = (S1 , φ1 ) with φ1 < φ0 which is unstable. If μ > μc , M0 is unstable and M1 is stable with φ1 > φ0 . Fixed points exchange stability as the bifurcation parameter μ increases (transcritical bifurcation). The simplified system according to eigenvectors (X1 , X2 ) of the matrix DF |M0 gives a clear illustration of the bifurcation as dx1 dt dx2 dt
= ax21 + λ1 x1 = λ2 x2
(3)
Here a = 0 is equivalent to μ = μc and in this condition there is a positive measure basin of attraction but some directions are unstable. The resting state M0 is not a classical fixed point attractor because it does not attract all trajectories from an open neighborhood, but it is still an attractor if we consider Milnor’s extended definition of attractors. Phase plane (S, φ) Fig. 1 shows that for μ close to the critical value, nullclines cross twice staying close to each other in between. That narrow channel makes the configuration indistinguishable from a Milnor attractor in computer experiments.
Fig. 1. Top: Phase space (S, φ) with vector field and nullclines of the system. The dashed domain in B shows that M0 have positive measure basin of attraction when μ = μc . Bottom: Fixed points with their stable and unstable directions for the equivalent simplified system. A: μ < μc . B: μ = μc . C:μ > μc .
Since we showed μ is the crucial parameter for the stability of the resting state, we can now consider ρ = 1 and study the dynamics according to σ with a close look near the critical regime (σ = μc ).
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
3.2
163
Constant Input Can Give Oscillations
Under constant input there are two possible dynamics: fixed point and limit cycle. If ω (4) β − S < 1 there is a stable fixed point (S1 , φ1 ) with φ1 solution of ω + (β − σ(cos(φ1 ) − cos(φ0 )) − I)sin(φ1 ) = 0 S1 = σ(cosφ1 − cosφ0 ) + I
(5)
If condition 4 is not satisfied, the φ equation in 1 will give rise to oscillatory dynamics. Identifying S with its temporal average, dφ dt = ω + Γ sin(φ) with 2π dφ Γ = β − S will be periodic with period 0 ω+(β−S)sin(φ) . This approximation gives an oscillation at frequency ω = ω 2 − (β − S)2 , which is qualitatively in good agreement with computer experiments Fig. 2. σ=μc 6
1.2 S minimum S maximum Frequency (theoretical) Frequency
5
1
4
0.8
3 f
S
0.6 2 0.4 1 0.2 0 0 -1 -0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
I
Fig. 2. For each value of constant current I, maximum and minimum values of S1 are plotted. Dominant frequency of S1 obtained by FFT is compared to the theoretical value when S is identified with its temporal average: Frequency VS Frequency (theoretical).
If we inject an oscillatory input into the system, S oscillates at the same frequency provided the input frequency is low. For higher frequencies, S cannot follow the input and shows complex oscillatory dynamics with multiple frequencies.
164
4
D. Colliaux et al.
Two Coupled Units
For two coupled units, flip-flop of oscillations is observed under various conditions. We will analyze the case μ = 0 and flip-flop properties under various strengths of connection weights, assuming symmetrical connections (W12 = W2,1 = W ). 4.1
Influence of the Feedback Loop
In equation 1, ρ and σ implement a feedback loop representing mutual influence of φ and S for each unit. The Case µ = 0. In the case σ = 0 or ρ = 0, φ remains constant φ = φ0 : the system is then a classical recurrent network. This model was used to provide associative memory network storing patterns in fixed point attractors [9]. For small coupling strength, the resting state is a fixed point. For strong coupling strength, two more fixed points appear, one unstable, corresponding to threshold, and one stable, providing memory storage. After a transient positive input I+ above threshold, the coupled system will be in up-state. A transient negative input I− can bring it back to resting state. For a small perturbation (σ 1 and ρ = 1), the active state is a small up-state oscillation but associative memory properties (storage, completion) are preserved. Growing Oscillations. The up-state oscillation in the membrane potential dynamics triggered by giving an I+ pulse to unit 1 grows when σ increases and saturates to an up-state fixed point for strong feedback. Interestingly, for a range of feedback strength values near μc , S returns transiently near the Milnor attractor resting state. Projection of the trajectories of the 4-dimensional system on a 2-dimensional plane section P illustrates these complex dynamics Fig. 3. A cycle would intersect this plane in two points. For each σ value, we consider S1 for these intersection points. For a range between 0.91 and 1.05 with our choice of parameters, there are much more than two intersection points M*, suggesting chaotic dynamics. 4.2
Influence of the Coupling Strength
The dynamics of two coupled units can be a fixed point attractor, as in the resting state (I = 0), or down-state or up-state oscillation (depending on the coupling strength), after a transient input. Near critical value of the feedback loop, in addition to these, more complex dynamics occur for intermediate coupling strength.
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
165
Fig. 3. A: Influence of the feedback loop- Bifurcation diagram according to σ (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to σ(Bottom). B: Influence of the coupling strengh - S1 maximum and minimum values and average phase difference (φ1 − φ2 ) according to W (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to W (Bottom).
Down-state Oscillation. For small coupling strength, the system periodically visits the resting state for a long time and goes briefly to up-state. The frequency of this oscillation increases with coupling strength. The two units are anti-phase (when Si takes maximum value, Sj takes minimum value) Fig. 4 (Bottom). Up-state Oscillation. For strong coupling strength, a transient input to unit 1 leads to an up-state oscillation Fig. 4 (Top). The two units are perfectly in-phase at W = 0.75 and phase difference stays small for stronger coupling strength. Chaotic Dynamics. For intermediate coupling strength, an intermediate cycle is observed and more complex dynamics occur for a small range (0.58 < W < 0.78
166
D. Colliaux et al.
Fig. 4. Si temporal evolution, (S1 , S2 ) phase plane and (Si , φi ) cylinder space. Top: Up-state oscillation for strong coupling. Middle: Multiple frequency oscillation for intermediate coupling. Bottom: Down-state oscillation for weak coupling.
with our parameters) before full synchronization characterized by φ1 − φ2 = 0. The trajectory can have many intersection points with P and S ∗ in Fig. 3 shows multiple roads to chaos through period doubling.
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
5
167
Application to Slow Selection of a Memorized Pattern
5.1
A Small Network
The network is a set N of five units consisting in a subset N1 of three units A,B and C and another N2 of two units D and E. In the set N, units have symmetrical all-to-all weak connections (WN = 0.01) and in each subset units have symmetrical all-to-all strong connections (WNi = 0.1 ∗ M ) with M a global parameter slowly varying in time between 1 and 10. These subsets could represent two objects stored in the weight matrix. 5.2
Memory Retrieval and Response Selection
S
We consider a transient structured input into the network. For constant M, a partial or complete stimulation of a subset Ni can elicit retrieval and completion of the subset in an up-state as would do a classical auto-associative memory network.
2 1.5 1 0.5 0
A 0
20000
40000
60000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
80000
100000
120000
140000
S
t 2 1.5 1 0.5 0
B 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
C 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
D 0
20000
40000
60000
S
t 2 1.5 1 0.5 0
E 0
20000
40000
60000 t
Fig. 5. Slow activation of a robust synchronous up-state in N1 during slow increase of M
In the Milnor attractor condition more complex retrieval can be achieved when M is slowly increased. As an illustration, we consider transient stimulation of units A and B from N1 and unit E from N2 Fig. 5. N2 units show anti-phase
168
D. Colliaux et al.
oscillations with increasing frequency. N1 units first show synchronous downstate oscillations with long stays near the Milnor attractor and gradually go toward sustained up-state oscillations. In this example, the selection of N1 in up-state is very slow and synchrony between units plays an important role.
6
Conclusion
We demonstrated that, in cylinder space, a Milnor attractor appears at a critical condition through forward and reverse saddle-node bifurcations. Near the critical condition, the pair of saddle and node constructs a pseudo-attractor, which can serves for observation of Milnor attractor-like properties in computer experiments. Semi-stability of the Milnor attractor in this model seems to be associated with the variety of oscillations and chaotic dynamics through period doubling roads. We demonstrated that an oscillations network provides a variety of working memory encoding in dynamical states under the presence of a Milnor attractor. Applications of oscillatory dynamics have been compared to classical autoassociative memory models. The importance of Milnor attractors was proposed in the analysis of coupled map lattices in high dimension [11] and for chaotic itinerancy in the brain [13]. The functional significance of flip-flop oscillations networks with the above dynamical complexity is of interest for further analysis of integrative brain dynamics.
References 1. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience (2001) 2. Buzsaki, G., Draguhn, A.: Neuronal oscillations in cortical networks. Science (2004) 3. Onton, J., Delorme, A., Makeig, S.: Frontal midline EEG dynamics during working memory. NeuroImage (2005) 4. Mizuhara, H., Yamaguchi, Y.: Human cortical circuits for central executive function emerge by theta phase synchronization. NeuroImage (2004) 5. Rainer, G., Lee, H., Simpson, G.V., Logothetis, N.K.: Working-memory related theta (4-7Hz) frequency oscillations observed in monkey extrastriate visual cortex. Neurocomputing (2004) 6. Tsujimoto, T., Shimazu, H., Isomura, Y., Sasaki, K.: Prefrontal theta oscillations associated with hand movements triggered by warning and imperative stimuli in the monkey. Neuroscience Letters (2003) 7. Goldman-Rakic, P.S.: Cellular basis of working memory. Neuron (1995) 8. McCormick, D.A.: Neuronal Networks: Flip-Flops in the Brain. Current Biology (2005) 9. Durstewitz, D., Seamans, J.K., Sejnowski, T.J.: Neurocomputational models of working memory. Nature Neuroscience (2000) 10. Yamaguchi, Y.: A Theory of hippocampal memory based on theta phase precession. Biological Cybernetics (2003)
Working Memory Dynamics in a Flip-Flop Oscillations Network Model
169
11. Kaneko, K.: Dominance of Minlnor attractors in Globally Coupled Dynamical Systems with more than 7 +- 2 degrees of freedom (retitled from ‘Magic Number 7 +- 2 in Globally Coupled Dynamical Systems’) Physical Review Letters (2002) 12. Fujii, H., Tsuda, I.: Interneurons: their cognitive roles - A perspective from dynamical systems view. Development and Learning (2005) 13. Tsuda, I.: Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioural and Brain Sciences (2001)
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors Hiroshi Fujii1,2, Kazuyuki Aihara2,3, and Ichiro Tsuda4,5 1
Department of Information and Communication Sciences, Kyoto Sangyo University, Kyoto 603-8555, Japan
[email protected] 2 Institute of Industrial Science, the University of Tokyo, Tokyo 153-8505
[email protected] 3 ERATO, Japan Science and Technology Agency, Tokyo 151-0065, Japan 4 Research Institute for Electronic Science, Hokkaido University, Sapporo 060-0812, Japan
[email protected] 5 Center of Excellence COE in Mathematics, Department of Mathematics, Hokkaido University, Sapporo 060-0810, Japan
( )
Abstract. A new hypothesis on a possible role for the corticopetal acetylcholine (ACh) is provided from a dynamical systems standpoint. The corticopetal ACh helps to transiently organize a global (inter- and intra-cortical) quasi-attractors via gamma range synchrony when it is behaviorally needed as top-down attentions and expectation.
1 Introduction 1.1 Corticopetal Acetylcholine Achetylcholine (ACh) is the first substance identified as a neurotransmitter by Otto Loewi [19]. Although it is increasingly recognized that ACh plays a critical role, not only in arousal and sleep, but in higher cognitive functions as attention, conscious flow, and so on, the question on the way in which ACh works in those cognitive processes remains a mystery [11]. The corticopetal ACh, originated in the nucleus basalis of Meinert (NBM), a part of the basal forebrain (BF), is the primary source of cortical ACh, and the major target of BF projections is the cortex [21]. Behavioral studies and those using immunotoxin as well provide consistent evidence of the role of ACh in top-down attentions. A blockage of NBM ACh, either by deasease-related or drug-induced, causes a severe loss of attentions: selective attention, sustained attention, and divided attention together with a shift of attention. ACh concerns conscious flow (Perry & Perry [24]). Continual death of cholinergic neurons in NBM causes Lewy Body Dementia (LBD), one of the most salient symptoms of which is the complex visual hallucination (CVH) [1].1 1
Perry and Perry [24] noted those hallucinatory LBD patients who see: “integrated images of people or animals which appear real at the time”, “insects on walls”, or “bicycles outside the fourth storey window”. Images are generally vivid and colored, continue for a few minutes (neither seconds nor hours). It is to be noted that “many of those experiences are enhanced during eye closed and relieved by visual input”, and “nicotinic anatagonists, such as mecamylamine, are not reported to induce hallucinations”.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 170–178, 2008. © Springer-Verlag Berlin Heidelberg 2008
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
171
1.2 Attentions, Cortical State Transitions and Cholinergic Control System from NBM Top-down flow of signals which accompanies attentions, expectation and so on may cause state transitions in the “down stream” cortices. Fries et al. [7] reported an increase of synchrony in high-gamma range in accordance of selective attention. (See, also Jones [15], Buschman et al. [2].) Metherate et al. [22] stimulated NBM in in vivo preparations of auditory cortex. Of particular interest in their observations is that NBM ACh produced a change in subthreshold membrane potential fluctuations from large-amplitude, slow (1-5 Hz) oscillations to low-amplitude, fast (20-40 Hz) (i.e., gamma) oscillations. A shift of spike discharge pattern from phasic to tonic was also observed.2 They pointed out that in view of the wide spread projections of NBM neurons, larger intercortical networks could also be modified. Together with Fries et al. data, it is suggested that NBM cholinergic projections may induce a state transition as a shift of frequency, and change of discharge pattern in neocortical neurons. This may be consistent with the observation made by Kay et al. [16]. During perceptual processing in the olfactory-limbic axis, a cascade of brain events at the successive stages of the task as “expectation” and/or “attention” was observed. ‘Local’ transitions of the olfactory structures indicated by modulations of EEG signals as gamma amplitude, periodicity, and coherence were reported to exist. Kay et al. also observed that the local dynamics transiently falls into attractor-like states. Such ‘local’ transitions of states are generally postulated to be triggered by ‘topdown’ glutamatergic spike volleys from “upper stream” organizations. However, such brain events of state transitions with a change in synchrony could be a result of collaboration of descending glutamatergic spike volleys and ascending ACh afferents from NBM. (See, also [26]).3
2 Neural Correlate of Conscious Percepts and the Role of the Corticopetal ACh 2.1 Neural Correlates of Conscious Percepts and Transient Synchrony The corticopetal ACh pathway might be the critical control system which may trigger various kinds of attentions receiving convergent inputs from other sensory and association areas as Sarter et al. [25], [26] argued. In order to discuss the role of the 2
NBM contains cholinergic neurons and non-cholinergic neurons. GABAergic neurons are at least twice more numerous than cholinergic neurons [15] The Metherate et al. observations (above) may be the result of collective functioning of both the cholinergic and GABAergic projections. Wenk [31] argued another possibility that the NBM ACh projections on the reticular thalamic nucleus might cause the cortical state change. 3 Triangular Attentional Pathway: The above arguments may better be complemented by the triangular amplification circuitry, a pathway consisting of parietal cortex → prefrontal cortex → NBM → sensory cortex [10]. This may constitute the cholinergic control system from NBM, i.e., the top-down attentional pathway for cholinergic modulations of responses in cortical sensory areas.
172
H. Fujii, K. Aihara, and I. Tsuda
corticopetal ACh related to attentions, we first begin with the question: What is the neural correlate of conscious percepts? The recent experiments by Kenet et al. [17] show the possibility that in a background (or spontaneous) state where no external stimuli exist, the visual cortex fluctuates between specific internal states. This might mean that cortical circuitry has a number of pre-existing and intrinsic internal states which represent features, and the cortex fluctuates between such multiple intrinsic states even when no external inputs exist. (See, also Treisman et al. [27].) Attention and Dynamical Binding Through Synchrony In order that perception of an object makes sense, its features, or fragmentary subassemblies, must be bound as a unity. How does the “binding” of those local fragmentary dynamics into a global dynamics is done? A widely accepted view is that top-down signals as expectation, attentions, and so on may play the role of such an integration, which is mediated by (possibly gamma) synchronization among the concerned assemblies representing stimuli or events. We postulate that such a process is a basis of global intra- and inter-cortical conjunctions of brain activities. See, Varela et al. [30],Womelsdorf et al. [32]. See, also Dehaene and Changeux [5].) The neural correlate of conscious percepts is a globally integrated state of related brain networks mediated by synchrony over gamma and other frequency bands. In mathematical terms, such transient processes of synchrony of global, but between selected groups of neurons may be described as a transitory state of approaching a global attractor. We note that such a “transitory” state may be conceptualized as an attractor ruin (Tsuda [28], Fujii et al. [8]), in which there are orbits approaching and stay there for a while, but may simultaneously possess repelling orbits from itself. The proper Milnor attractor and its perturbed structures can be a specific representation of attractor ruins [8]. However, since the concept of attractor ruins may include a wider class of non-classical attractors than the Milnor attractor, we may use the term “attractor ruins” in this paper to include possible but unknown classes of ruins. 2.2 Role of the Corticopetal ACh: A Working Hypothesis How can top-down attentions, expectation, etc. contribute to conscious perception with the aid of ACh? Assuming the arguments in the preceding section, this question could be translated into: “How do the corticopetal ACh projections work for the emergence of globally organized attractor ruins via transient synchrony?” We summarize our tentative proposition as a Working Hypothesis in the following. Working Hypothesis: The role of the corticopetal ACh accompanied with top-down contextual signals as attentions and so on is the mediator for dynamically organizing quasi-attractors, which are required in conscious perception, or in executing actions. ACh “integrates” multiple “floating subnetworks” into “a transient synchrony group” in the gamma frequency range. Such a transiently emerging synchrony group can be regarded as an attractor ruin in the dynamical systems-theoretic sense.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
173
3 Do the Existing Experimental Data Support the Working Hypothesis? 3.1 Introductory Remarks: Transient Synchronization by Virtue of Pre- and Post-synaptic ACh Modulations ACh may have both pre-synaptic and post-synaptic effects on individual neurons in the cortex. First, top-down glutamatergic spike volleys flow into cortical layers, which might convey contextual information on stimuli. The corticopetal ACh arrives concomitantly with the glutamatergic volleys. If ACh release modulates synaptic connectivity between cortical neurons even in an effective sense by virtue of “pre-synaptic modulations”, metamorphosis of the attractor landscape4 should be inevitable. Post-synaptic influences of the corticopetal ACh on individual neurons – either inhibitory or excitatory, might cause deep effects on their firing behavior, and might induce a state transition with a collective gamma oscillation. A consequence of the three effects together might trigger a specific group of networks to oscillate in gamma frequency with phase synchrony. These are all speculative stories based on experimental evidence. We need at least to examine the present status on the experimental data concerning the cortical ACh influence on individual neurons. This may be the place to add a comment on the specificity of the corticopetal ACh projections on the cortex, which may be a point of arguments. It is reported that the cholinergic afferents specialized synaptic connections with post synaptic targets, rather than releasing ACh non-specifically (Turrini et al..[29].) 3.2 Controversy on Experimental Data Let us review quickly the existing experimental data. As noted before, “there exist little consensus among researchers for more than a half century” [11]. The following is not intended to give a complete review, but to give a preliminary knowledge which may be of help to understand the succeeding discussions. ACh have two groups of receptors, one is the muscarinic receptors, mAChRs with 5 subtypes, and the other is nicotinic receptors, nAChRs with 17 subtypes. The nAChR is a relatively simple cationic (Na+ and Ca2+) channel, the opening of which leads to a rapid depolarization followed by desensitization. Most of mAChRs activation exhibits the slower onset and longer lasting G-protein coupled second messenger generation. Here the primary interest is in mAChRs. 5 The functions of mAChRs are reported to be two-fold: one is pre-synaptic, and the other is postsynaptic modulations. 4
“Attractor landscape” is usually used for potential systems. Here, we use it to mean the landscape of “basins” (absorbing regions), of classical and non-classical attractors as attractor ruins. 5 The nicotinic receptors, nAChRs may work as a disinhibition system to layer 2/3 pyramidal neurons [4]), the exact function of which is not known yet.
174
H. Fujii, K. Aihara, and I. Tsuda
Post-synapticModulations The results of traditional studies may be divided into two opposing data. The majority view is that mAChRs function as excitatory transmitter for post-synaptic neurons (see, e.g., McCormick [20]), while there are minority data that claim inhibitory functioning. The latter, however, has been considered to be a consequence of ACh excitation of interneurons, which in turn may inhibit post-synaptic pyramidal neurons (PYR). Recently, Gulledge et al. [11], [12] stated that transient mAChR activation generates strong and direct transient inhibition of neocortical PYR. The underlying ionic process is the induction of calcium release from internal stores, and subsequent activation of small-conductance (SK–type) calcium-activated potassium channels. The authors claim that the traditional data do not describe the actions of transient mAChR activation, as is likely to happen during synaptic release of ACh by the following reasons. 1.
2.
In vivo ACh concentration: Previous studies typically used high concentrations of muscarinic agonists (1–100 mM). Extracellular concentrations of ACh in the cortex are at least one order of magnitude lower than those required to depolarize PYR in vitro. Phasic (transient) application vs. bath application: Most data depended on experiments with bath applications, which may correspond to prolonged, tonic mAChR stimulation. The ACh release accompanied with attentions, etc., would better correspond to a transient puff application as the authors’ experiment.
The specificity of ACh afferents on postsynaptic targets was already noted. [29]. Pre-synaptic Modulations Experimental works on pre-synaptic modulations are mostly based on ACh bath applications, and the modulation data were measured in terms of local field potentials (LFP). Most results claimed the pathway specificity of modulations. Typically, it was concluded that muscarinic modulation can strongly suppress intracortical(IC) synaptic activity while exerting less suppression, or actually enhancing, thalamocortical (TC) inputs. [14]. Gil et al. [9] reported that 5 μM muscarine decreases both IC and TC pathway transmission, and that those data were presynaptic effects, since membrane potential and input resistance were unchanged. Recently, Kuczewski et al. [18] studied the same problem, and obtained different results from the previous ones. Low ACh (less than 100 μM) shows facilitation, and as ACh concentration goes high, the result is depression. This is true for the both IC and TC pathways, i.e., for the layer 2/3 and layer 4. 3.3 Possible Scenarios The lack of consistent experimental data makes our job complicated. The situation might be compared to playing a jigsaw puzzle with, many pieces missing and some pieces mingled from other jigsaw pictures. What we can do at this moment may be to propose possible alternatives of scenarios for the role of the corticopetal ACh.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
175
The following are the list of prerequisites and evidences on which our arguments should be based. 1.
2.
3.
Two modulations may occur simultaneously inside the 6 layers of the cortex. The firing characteristics of individual neurons, and the strength of synaptic connections may change dynamically either as post-synaptic or pre-synaptic modulations. Virtually no models, to our knowledge, have been proposed, which took the net effects of the two modulations into account. The interaction of ACh with top-down glutamatergic spike volleys should be considered. The majority of neurons alter in response to combined exposure to both acetylcholine and glutamate concomitantly. (Perry & Perry [24].) As a post-synaptic influence, ACh release may change the firing regime of neurons, and induce gamma oscillation [6], [23], [31].
As to pre-synaptic modulations, the details of synaptic processes appear to be largely unknown. The significance of experimental studies can not be overemphasized. Now let us try to draw a dessin for possible scenarios on the role of corticopetal ACh. Here we may put three corner stones for the models: 1. 2. 3.
Who triggers the gamma oscillation? Who (and how to) modulates the effective connectivity? What is the mechanism of phase synchrony and what is the role of it?
Scenario I The basic idea is that in the default, low level state of ACh, globally organized attractors do not virtually exist, and may take the form of floating fragmentary dynamics. Then, ACh release may help to strengthen the synaptic connections presynaptically. One of the roles of post-synaptic modulation is to start up the gamma oscillation. (Here the influence of GABAergic projections from NBM might play a role.) Another, but important role will be stated later. Scenario II The effective modulation of synaptic connectivity might be carried by, rather than the pre-synaptic modulation, the phase synchrony of the gamma oscillation itself which is triggered by the post-synaptic modulation. Such a mechanism for the change of synaptic connectivity, and resulting binding of fragmentary groups of neurons was proposed by Womelsdorf et al. [32]. They claimed that the mutual influence among neuronal groups depends on the phase relation between rhythmic activities within the groups. Phase relations supporting interactions among the groups preceded those interactions by a few milliseconds, consistent with a mechanistic role. See, also Buzsaki [3]. For the case of Scenario II, the role of the post-synaptic modulation is to start up the gamma oscillation, and the reset of its phase, as Gulledge and Stuart [11] suggested. The transient hyper-polarization may play the role of referee to start up the oscillation among the related groups in unison. The Scenario I makes the postsynaptic modulation carry the two roles of starting up the gamma oscillation, and
176
H. Fujii, K. Aihara, and I. Tsuda
resetting its phase. For the pre-synaptic modulation a bigger role of realization of attractors by virtue of synaptic strength modulation is assigned.
4 Concluding Discussions The critical role of the corticopetal ACh in cognitive functions, together with its relation to some disease-related symptoms as complex visual hallucinations in DLB, and its apparent involvements in the neocortical state change have motivated us to the study of the functional role(s) of the corticopetal ACh from dynamical systems standpoints. Cognitive functions are phenomena carried by the brain dynamics. We hope that understanding the cognitive dynamics with the dynamical systems language would open new theoretical horizons. It is of some help to consider the conceptual difference of the two “forces” which flow into the 6 layers of the neocortex. Glutamate spike volleys could be, if viewed as an event in a dynamical system, an external force, which may kick the orbit to another orbit, and may sometimes to out of the “basin” of the present attractor beyond the border – the separatrix. In contrary to this situation, ACh projections – though transient, could be regarded as a slow parameter working as a bifurcation parameter that modifies the landscape itself. What we are looking at in the preceding arguments is that the two phenomena happen concomitantly at the 6 layers of the cortex. Hasselmo and McGaughy [13] emphasized the ACh role in memory as: “high acetylcholine sets circuit dynamics for attention and encoding; low acetylcholine sets dynamics for consolidation”, which is based on some experimental data on selective pre-synaptic depression and facilitation. However, in view of the potential role of attentions in local bindings or global integrations, we may pose alternative (but not necessarily exclusive,) scenarios on the ACh function as temporarily modifying the quasi-attractor landscape, in collaboration with glutamatergic spike volleys. Rather, we speculate that the process of memorization itself would realized through such a dynamic formation of attractor ruins, for which mAChR may play a role.
Acknowledgements The first author (HF) was supported by a Grant-in-Aid for Scientific Research (C), No. 19500259, from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. The second author (KA) was partially supported by Grant-in-Aid for Scientific Research on Priority Areas 17022012 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. The third author (IT) was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas, No. 18019002 and No. 18047001, a Grant-inAid for Scientific Research (B), No. 18340021, Grant-in-Aid for Exploratory Research, No. 17650056, a Grant-in-Aid for Scientific Research (C), No. 16500188, and the 21st Century COE Program, Mathematics of Nonlinear Structures via Singularities.
Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization
177
References 1. Behrendt, R.-P., Young, C.: Hallucinations in schizophrenia, sensory impairment, and brain disease: A unifying model. Behav. Brain Sci. 27, 771–787 (2004) 2. Buschman, T.J., Miller, E.K.: Top-down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices. Science 315, 1860–1862 (2007) 3. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, Oxford (2006) 4. Christophe, E., Roebuck, A., Staiger, J.F., Lavery, D.J., Charpak, S., Audinat, E.: Two Types of Nicotinic Receptors Mediate an Excitation of Neocortical Layer I Interneurons. J. Neurophysiol. 88, 1318–1327 (2002) 5. Dehaene, S., Changeux, J.-P.: Ongoing Spontaneous Activity Controls Access to Consciousness: A Neuronal Model for Inattentional Blindness. PLoS Biology 3, 910–927 (2005) 6. Detari, L.: Tonic and phasic influence of basal forebrain unit activity on the cortical EEG. Behav. Brain. Res. 115, 159–170 (2000) 7. Fries, P., Reynolds, J.H., Rorie, A.E., Desimone, R.: Modulation of Oscillatory Neuronal Synchronization by Selective Visual Attention. Science 291, 1560–1563 (2001) 8. Fujii, H., Aihara, K., Tsuda, I.: Functional Relevance of ‘Excitatory’ GABA Actionsin Cortical Interneurons: A Dynamical Systems Approach. J. Integrative Neurosci. 3, 183– 205 (2004) 9. Gil, Z., Connors, B.W., Yael Amitai, Y.: Differential Regulation of Neocortical Synapses by Neuromodulators and Activity. Neuron 19, 679–686 (1997) 10. Golmayo, L., Nunez, A., Zaborsky, L.: Electrophysiological Evidence for the Existence of a Posterior Cortical-Prefrontal-Basal Forebrain Circuitry in Modulating Sensory Responses in Visual and Somatyosensory Rat Cortical Areas. Neuroscience 119, 597–609 (2003) 11. Gulledge, A.T., Stuart, G.J.: Cholinergic Inhibition of Neocortical Pyramidal Neurons. J. Neurosci 25, 10308–10320 (2005) 12. Gulledge, A.T., Susanna, S.B., Kawaguchi, Y., Stuart, G.J.: Heterogeneity of phasic signaling in neocortical neurons. J. Neurophysiol. 97, 2215–2229 (2007) 13. Hasselmo, M.E., McGaughy, J.: High acetylcholine sets circuit dynamics for attention and encoding; Low acetylcholine sets dynamics for consolidation. Prog. Brain Res. 145, 207– 231 (2004) 14. Hsieh, C.Y., Cruikshank, S.J., Metherate, R.: Differential modulation of auditory thalamocortical and intracortical synaptic transmission by cholinergic agonist. Brain Res 880, 51–64 (2000) 15. Jones, B.E., Muhlethaler, M.: Cholinergic and GABAergic neurons of the basal forebrain: role in cortical activation. In: Lydic, R., Baghdoyan, H.A. (eds.) Handbook of Behavioral State Control, pp. 213–233. CRC Press, London (1999) 16. Kay, L.M., Lancaster, L.R., Freeman, W.J.: Reafference and attractors in the olfactory system during odor recognition. Int. J. Neural Systems 4, 489–495 (1996) 17. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., Arieli, A.: Nerve cell activity when eyes are shut reveals internal views of the world. Nature 425, 954–956 (2003) 18. Kuczewski, N., Aztiria, E., Gautam, D., Wess, J., Domenici, L.: Acetylcholine modulates cortical synaptic transmission via different muscarinic receptors, as studied with receptor knockout mice. J. Physiol. 566.3, 907–919 (2005) 19. Loewi, O.: Ueber humorale Uebertragbarkeit der Herznervenwirkung. Pflueger’s Archiven Gesamte Physiologie 189, 239–242 (1921) 20. McCormick, D.A., Prince, D.A.: Mechanisms of action of acetylcholine in the guinea-pig cerebral cortex in vitro. J. Physiol. 375, 169–194 (1986)
178
H. Fujii, K. Aihara, and I. Tsuda
21. Mesulam, M.M., Mufson, E.J., Levey, A.I., Wainer, B.H.: Cholinergic innervation of cortex by the basal forebrain: cytochemistry and cortical connections of the septal area, diagonal band nuclei, nucleus basalis (substantia innominata), and hypothalamus in the rhesus monkey. J. Comp. Neurol. 214, 170–197 (1983) 22. Metherate, R., Charles, L., Cox, C.L., Ashe, J.H.: Cellular Bases of Neocortical Activation: Modulation of Neural Oscillations by the Nucleus Basalis and Endogenous Acetylcholine. J. Neurosci. 72, 4701–4711 (1992) 23. Niebur, E., Hsiao, S.S., Johnson, K.O.: Synchrony: a neuronal mechanism for attentional selection? Curr. Opin. Neurobiol. 12, 190–194 (2002) 24. Perry, E.K., Perry, R.H.: Acetylcholine and Hallucinations: Disease-Related Compared to Drug-Induced Alterations in Human Consciousness. Brain Cognit. 28, 240–258 (1995) 25. Sarter, M., Gehring, W.J., Kozak, R.: More attention should be paid: The neurobiology of attentional effort. Brain Res. Rev. 51, 145–160 (2006) 26. Sarter, M., Parikh, V.: Choline Transporters, Cholinergic Transmission and Cognition. Nature Reviews Neurosci. 6, 48–56 (2005) 27. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognit. Psychol. 12, 97–136 (1980) 28. Tsuda, I.: Chaotic Itinerancy as a Dynamical Basis of Hermeneutics of Brain and Mind. World Future 32, 167–185 (1991) 29. Turrini, P., Casu, M.A., Wong, T.P., De Koninck, Y., Ribeiro-da-Silva, A., And Cuello, A.C.: Cholinergic nerve terminals establish classical synapses in the rat cerebral cortex: synaptic pattern and age—related atrophy. Neiroscience 105, 277–285 (2001) 30. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase synchronization and large-scale integration. Nature Rev. Neurosci. 2, 229–239 (2001) 31. Wenk, G.L.: The Nucleus Basalis Magnocellularis Cholinergic System: One Hundred Years of Progress. Neurobiol. Learn. Mem. 67, 85–95 (1997) 32. Womelsdorf, T., Schoffelen, J.M., Oostenveld, R., Singer, W., Desimone, R., Engel, A.K., Fries, P.: Modulation of neuronal interactions through neuronal synchronization. Science 316, 1578–1579 (2007)
Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model Yongtao Li and Shigetoshi Nara Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-naka, Okayama 700-8530, Japan
[email protected] Abstract. Chaotic dynamics introduced in a recurrent neural network model is applied to controlling an tracker to track a moving target in two-dimensional space, which is set as an ill-posed problem. The motion increments of the tracker are determined by a group of motion functions calculated in real time with firing states of the neurons in the network. Several groups of cyclic memory attractors that correspond to several simple motions of the tracker in two-dimensional space are embedded. Chaotic dynamics enables the tracker to perform various motions. Adaptively real-time switching of control parameter causes chaotic itinerancy and enables the tracker to track a moving target successfully. The performance of tracking is evaluated by calculating the success rate over 100 trials. Simulation results show that chaotic dynamics is useful to track a moving target. To understand them further, dynamical structure of chaotic dynamics is investigated from dynamical viewpoint. Keywords: Chaotic dynamics, tracking, moving target, neural network.
1 Introduction Biological systems have became a hot research around the world because of their excellent functions not only in information processing, but also in well-regulated functioning and controlling, which work quite adaptively in various environments. However, we are yet poor of understanding the mechanisms of biological systems including brains despite many efforts of researchers because enormous complexity originating from dynamics in systems is very difficult to be understood and described using the conventional methodologies based on reductionism, that is, decomposing a system into parts or elements. The conventional reductionism more or less falls into two difficulties: one is “combinatorial explosion” and the other is “divergence of algorithmic complexity”. These difficulties are not yet solved. On the other hand, dynamical viewpoint to understand the mechanism seems to be a plausible method. In particular, chaotic dynamics experimentally observed in biological systems including brains[1,2] has suggested a viewpoint that chaotic dynamics would play important roles in complex functioning and controlling of biological systems including brains. From this viewpoint, many dynamical models have been constructed for approaching the mechanisms by means of large-scale simulation or heuristic methods. Artificial neural networks in which chaotic dynamics can be introduced has been attracting great interests, and the relation between chaos and functions has been discussed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 179–188, 2008. c Springer-Verlag Berlin Heidelberg 2008
180
Y. Li and S. Nara
[9,10,11,12]. As one of those works, Nara and Davis found that chaotic dynamics can occur in a recurrent neural network model(RNNM) consisting of binary neurons [3], and they investigated the functional aspects of chaos by applying it to solving a memory search task with an ill-posed context[7]. To show the potential of chaos in controlling, chaotic dynamics was applied to solving two-dimensional mazes, which are set as ill-posed problems[8]. Two important points were proposed. One is a simple coding method translating the neural states into motion increments , the other is a simple control algorithm, switching a system parameter adaptively to produce constrained chaos. The conclusions show that constrained chaos behaviours can give better performance to solving a two-dimensional maze than that of random walk. In this paper, we develop the idea and apply chaotic dynamics to tracking a moving target, which is set as another ill-posed problem. Let us state about a model of tracking a moving target. An tracker is assumed to move in two-dimensional space and track a moving target along a certain trajectory by employing chaotic dynamics. the tracker is assumed to move with discrete time steps. The state pattern is transform into the tracker’s motion by the coding of motion functions, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as the prototypical simple motions, are embedded in the network. By the coding of motion function, each cycle corresponds to a monotonic motion in twodimensional space. If the state pattern converges into a prototypical attractor, the tracker moves in a monotonic direction. Introducing chaotic dynamics into the network generated non-period state pattern, which is transformed into chaotic motion of the tracker by motion functions. Adaptive switching of a system parameter by a simple evaluation between chaotic dynamics and attractor’s dynamics in the network results in complex motions of the tracker in various environments. Considering this point, a simple control algorithm is proposed for tracking a moving target. In actual simulation, the present method using chaotic dynamics gives novel performance. To understand the mechanism of better performance, dynamical structure of chaotic dynamics is investigated from statistical data.
2 Memory Attractors and Motion Functions Our study works with a fully interconnected recurrent neural network consisting of N binary neurons. Its updating rule is defined by Si (t + 1) = sgn Wi j S j (t) (1) j∈Gi (r)
sgn(u) = • • • •
+1 u ≥ 0; −1 u < 0.
S i (t) = ±1(i = 1 ∼ N): the firing state of a neuron specified by index i at time t. Wi j : connection weight from the neuron S j to the neuron S i (Wii is taken to be 0) r: fan-in number for the neuron S i , named as connectivity,(0 < r < N). Gi (r): a spatial configuration set of connectivity r.
Tracking a Moving Target Using Chaotic Dynamics
181
At a certain time t, the state of neurons in the network can be represented as a N-dimensional state vector S(t), called as state pattern. Time development of state pattern S(t) depends on the connection weight matrix {Wi j } and connectivity r, therefore, in our study, Wi j are determined in the case of full connectivity r = N − 1, by a kind of orthogonalized learning method[7]and taken as follows. Wi j =
L K μ=1 λ=1
λ † (ξλ+1 μ )i · (ξ μ ) j
(2)
where {ξ λμ |λ = 1 . . . K, μ = 1 . . . L} is an attractor pattern set, K is the number of memory patterns included in a cycle and L is the number of memory cycles. ξλ† μ is the conjugate λ λ† λ vector of ξμ which satisfies ξμ · ξμ = δμμ · δλλ ,where δ is Kronecker’s delta. This method was confirmed to be effective to avoid spurious attractors[3,4,5,6,7,8]. Biological data show that neurons in brain causes various motions of muscles in body with a quite large redundancy. Therefore, the network consisting of N neurons is used to realize two-dimensional motion control of an tracker. We confirmed that chaotic dynamics introduced in the network does not so sensitively depend on the size of the neuron number[7].In our actual computer simulation,N = 400. Suppose that an tracker moves from the position (p x (t), py (t)) to (p x (t + 1), py (t + 1)) with a set of motion increments ( f x (t), fy (t)). The state pattern S(t) at time t is a 400-dimensional vector, and we transform it to two-dimensional motion increments by the coding of motion functions ( f x (S(t)), fy (S(t))). In 2-dimensional space, the actual motion of the tracker is given by 4 A·C p x (t + 1) p (t) f (S(t)) p (t) = x + x = x + (3) py (t + 1) py (t) fy (S(t)) py (t) N B·D where A, B, C, D are four independent N/4 dimensional sub-space vectors of state pattern S(t). Therefore, after the inner product between two independent sub-space vectors is normalized by 4/N, motion functions range from -1 to +1. In our actual simulations, two-dimensional space is digitized with a resolution 0.02 due to the binary neuron state ±1 and N = 400. Now, let us consider the construction of memory attractors corresponding to prototypical simple motions. We take 24 attractor patterns consisting of (L=4 cycles) × (K=6 patterns per cycle). Each cycle corresponds to one prototypical simple motion. We take four types of motion that one tracker moves toward (+1, +1), (-1, +1), (-1, -1), (+1, -1) in two-dimensional space. Each attractor pattern consists of four random subspace vectors A, B, C and D, where C = A or −A, and D = B or −B. So only A and B are independent random patterns. From the law of large number, memory patterns are almost orthogonal each other. Furthermore, in determining {Wi j }, the orthogonalized learning method was employed. Therefore, memory patterns are orthogonalized each other. The corresponding relations between memory attractors and prototypical simple motions are shown as follows. ( f x (ξλ1 ), fy (ξλ1 )) = (+1, +1) ( f x (ξλ3 ), fy (ξλ3 ))
= (−1, −1)
( f x (ξλ2 ), fy (ξλ2 )) = (−1, +1) ( f x (ξλ4 ), fy (ξλ4 )) = (+1, −1)
182
Y. Li and S. Nara
3 Introducing Chaotic Dynamics in RNNM Now let us state the effects of connectivity r. In the case of full connectivity r = N − 1, the network can function as a conventional associative memory. If the state pattern S(t) is one or near one of the memory patterns ξλμ , finally the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . In other words, for each memory pattern, there is a set of the state patterns, called as memory basin Bλμ . If S(t) is in the memory basin Bλμ , then the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . It is quite difficult to estimate basin volume accurately because of enormous amounts of calculation for the whole state patterns in N-dimensional state space. Therefore, a statistical method is applied to estimating the approximate basin volume. First, a sufficiently large amount of state patterns are sampled in the state space. Second, each sample is taken as initial pattern and updated with full connectivity. Third, it is taken statistic which memory attractor limk→∞ S(kK) of each sample would converge into. The distribution of statistic data over the whole samples is regarded as the approximate basin volume for each memory attractor(see Fig.1). The basin volume shows that almost all initial state patterns converge into one of the memory attractors averagely and there are seldom spurious attractors. 0.06
Basin volume
0.05 0.04 0.03 0.02 0.01 0 0
5
10 15 20 25 Memory pattern number
30
Fig. 1. Basin volume: The horizontal axis represents memory pattern number(1-24). Basin 25 corresponds to samples that converged into cyclic outputs with a period of six steps but not any one memory attractor. Basin 26 corresponds to samples excluded from any other case(1-25). The vertical axis represents the ratios between the corresponding samples and the whole samples.
Next, we continue to decrease connectivity r. When r is large enough, r N, memory attractors are stable. When r becomes smaller and smaller, more and more state patterns gradually do not converge into a certain memory pattern despite the network is updated for a long time, that is, attractors become unstable. Finally, when r becomes quite small, state pattern becomes non-period output, that is, non-period dynamics occurs in the state space. In our previous papers, we confirmed that the non-period dynamics in the network is chaotic wandering. In order to investigate the dynamical structure, we calculated basin visiting measures and it suggests that the trajectory can pass the
Tracking a Moving Target Using Chaotic Dynamics
183
whole N-dimensional state space, that is, cyclic memory attractors ruin due to a quite small connectivity [3,4,5,6,7].
4 Motion Control and Tracking Algorithm When connectivity r is sufficiently large, one random initial pattern converges into one of four limit cycle attractors as time evolves. By the coding transformation of motion functions, the corresponding motion of the tracker in 2-dimensional space becomes monotonic(see Fig.2). On the other hand, when connectivity r is quite small, chaotic dynamics occurs in the state space, correspondingly, the tracker moves chaotically (see Fig.3). If the updating of state pattern in chaotic regime is replaced by random 400-bitpattern generator, the tracker shows random walk(see Fig.4). Obviously, chaotic motion is different from random walk, and has a certain dynamical structure.
6 4 5
4 r=30
r=399
Random walk
2
4
2
0
3 2 1
START
START
-2
-2
-4
-4
-6
0 -1 0
1
2
3
4
5
6
Fig. 2. Monotonic motion
START
-6
-8 -1
0
(500 steps)
-10 -10
-8
-6
-4
-2
-8 0
2
Fig. 3. Chaotic walk
4
(500 steps)
-10 -10
-8
-6
-4
-2
0
2
4
Fig. 4. Random walk
Therefore, when the network evolves, monotonic motion and chaotic motion can be switched by switching the connectivity r. Based on this idea, we proposed a simple algorithm to track a moving target, shown in Fig.5. First, an tracker is assumed to be tracking a target that is moving along a certain trajectory in two-dimensional space, and the tracker can obtain the rough directional information D1 (t) of the moving target, which is called as global target direction. At a certain time t, the present position of the tracker is assumed at the point (p x (t), py (t)). This point is taken as the origin point and two-dimensional space can be divided into four quadrants. If the target is moving in the nth quadrant, D1 (t) = n (n = 1, 2, 3, 4). Next, we also suppose that the tracker can know another directional information D2 (t) = m (m = 1, 2, 3, 4), which is called global motion direction. It means that the tracker has moved toward the mth quadrant from time t − 1 to t, that is, in the previous step. Global target direction D1 (t) and global motion direction D2 (t) are time-dependent variables. If these information are taken as feedback to the network in real time, the connectivity r also becomes a time-dependent variable r(t) and is determined by D1 (t) and D2 (t). In Fig.5, RL is a sufficiently large connectivity and RS is a quite small connectivity that can lead to chaotic dynamics in the neural network. Adaptive switching of connectivity is the core idea of the algorithm. When the synaptic connectivity r(t) is determined by comparing two directions,D1(t−1) and D2 (t−1), the motion increments of the tracker are calculated from the state pattern of the network updated with r(t). The new motion
184
Y. Li and S. Nara
Fig. 5. Control algorithm of tracking a moving target: By judging whether global target direction D1 (t) coincides with global motion direction D2 (t) or not, adaptive switching of connectivity r between RS and RL results in chaotic dynamics or attractor’s dynamics in state space. Correspondingly, the tracker is adaptively tracking a moving target in two-dimensional space.
causes the next D1 (t) and D2 (t), and produces the next connectivity r(t + 1). By repeating this process, the synaptic connectivity r(t) is adaptively switching between RL and RS , the tracker is alternatively implementing monotonic motion and chaotic motion in two-dimensional space.
5 Simulation Results In order to confirm that this control algorithm is useful to tracking a moving target, the moving target should be set. Firstly, we have taken nine kinds of trajectories which the target moves along, which are shown in Fig.6 and include one circular trajectory and eight linear trajectories. Suppose that the initial position of the tracker is the origin(0,0) of two-dimensional space. The distance L between initial position of the tracker and that of the target is a constant value. Therefore, at the beginning of tracking, the tracker is at the circular center of the circular trajectory and the other eight linear trajectories are tangential to the circular trajectory along a certain angle α, where the angle is defined by the x axis. The tangential angle α = nπ/4 (n = 1, 2, . . . , 8), so we number the eight linear trajectories as LTn , and the circular trajectory as LT0 . 20 15 10 5
Object
Target
0 -5 -10
Capture
-15 -20 -20
Fig. 6. Trajectories of moving target: Arrow represents the moving direction of the target. Solid point means the position at time t=0.
-15
-10
-5
0
5
10
15
20
Fig. 7. An example of tracking a target that is moving along a circular trajectory with the simple algorithm. the tracker captured the moving target at the intersection point.
Tracking a Moving Target Using Chaotic Dynamics
185
Next, let us consider the velocity of the target. In computer simulation, the tracker moves one step per discrete time step, at the same time, the target also moves one step with a certain step length S L that represents the velocity of the target. The motion increments of the tracker ranges from -1 to 1, so the step length S L is taken with an interval 0.01 from 0.01 to 1 up to 100 different velocities. Because velocity is a relative quantity, so S L = 0.01 is a slower target velocity and S L = 1 is a faster target velocity relative to the tracker. Now, let us look at a simulation of tracking a moving target using the algorithm proposed above, shown in Fig.7. When an target is moving along a circular trajectory at a certain velocity, the tracker captured the target at a certain point of the circular trajectory, which is a successful capture to a circular trajectory.
6 Performance Evaluation
(a) circular target trajectory
-2
et
10 20 30 40 50 60 0 Connectivity
Ve l
oc
ity
x10 100 80 60 40 20 rg
Success Rate
1 0.8 0.6 0.4 0.2 0 0
Ta
Ve
et
10 20 30 40 50 60 0 Connectivity
lo
ci
ty
x10-2 100 80 60 40 20 rg
1 0.8 0.6 0.4 0.2 0 0
Ta
Success Rate
To show the performance of tracking a moving target, we have evaluated the success rate of tracking a moving target that moves along one of nine trajectories over 100 initial state patterns. In tracking process, the tracker sufficiently approaching the target within a certain tolerance during 600 steps is regarded as a successful trial. The rate of successful trials is called as the success rate. However, even though tracking a same target trajectory, the performance of tracking depends not only on synaptic connectivity r, but also on target velocity or target step length S L. Therefore, when we evaluate the success rate of tracking, a pair of parameters, that is, one of connectivity r(1 ≤ r ≤ 60) and one of target velocity S L(0.01 ≤ T ≤ 1.0), is taken. Because we take 100 different 100 target velocity with a same interval 0.01, we have C60 pairs of parameters. We have evaluated the success rate of tracking a moving target along different trajectories. Two examples are shown as Fig.8(a) and (b). By comparing Fig.8(a) and (b), we are sure that tracking a moving target of circular trajectory has better performance than that of linear trajectory. However, to some linear trajectories, quite excellent performance was observed. On the other hand, the success rate highly depends on connectivity r and the target velocity S L even if the same target trajectory is set. In order to observe the performance clearly, we have taken the data
(b) linear target trajectory
Fig. 8. Success rate of tracking a moving target along (a)a circle trajectory;(b)a linear trajectory: The positive orientation obeys the right-hand rule. The vertical axis represents success rate, and two axes in the horizontal plane represents connectivity r and target velocity S L, respectively.
Y. Li and S. Nara
1
1
0.8
0.8 Success rate
Success rate
186
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
20
40
60
80
100
0
-2
Target Velocity ( x10 )
(a) r = 16: downward tendency
20
40
60
80
100
Target Velocity ( x10-2)
(b) r = 51: upward tendency
Fig. 9. Success rates drawn from Fig.8(a): We take the data of a certain connectivity and show them in two dimension diagram. The horizontal axis represents target velocity from 0.01 to 1.0, and the vertical axis represents success rate.
of certain connectivities from Fig.8(a), and plot them in two-dimensional coordinates, shown as Fig.9. Comparing these figures, we can see a novel performance, when the target velocity becomes faster, the success rate has a upward tendency, such as r = 51. In other words, when the chaotic dynamics is not too strong, it seems useful to tracking a faster target.
7 Discussion In order to show the relations between the above cases and chaotic dynamics, from dynamical viewpoint, we have investigated dynamical structure of chaotic dynamics. To a quite small connectivity from 1 to 60, the network performs chaotic wandering for long time from a random initial state pattern. During this hysteresis, we have taken a statistics of continuously staying time in a certain basin [8] and evaluated the distribution p(l, μ) which is defined by p(l, μ) = {the number of l | S(t) ∈ βμ in τ ≤ t ≤ τ + l and S(τ − 1) βμ and S(τ + l + 1) βμ , μ| μ ∈ [1, L]} K βμ = Bλμ
(4)
(5)
λ=1
T =
lp(l, μ)
(6)
l
where l is the length of continuously staying time steps in each attractor basin, and p(l, μ) represents a distribution of continuously staying l steps in attractor basin L = μ within T steps. In our actual simulation, T = 105 . To different connectivity r=15 and r=50, the distribution p(l, μ) are shown in Fig.10(a) and Fig.10(b). In these figures, different basins are marked with different symbols. From the results, we can know that continuously staying time l becomes longer and longer with increase of the connectivity r. Referring to those novel performances talked in previous section, let us try to consider the reason.
100000
Frequency distribution of staying
Frequency distribution of staying
Tracking a Moving Target Using Chaotic Dynamics
Basin 1 Basin 2 Basin 3 Basin 4
10000 1000 100 10 1 2
4
6
8
10
12
14
Continuously staying time steps
(a) r = 15: shorter
16
187
100000 Basin 1 Basin 2 Basin 3 Basin 4
10000 1000 100 10 1 10
20
30
40
50
60
Continuously staying time steps
(b) r = 50: longer
Fig. 10. The log plot of the frequency distribution of continuously staying time l: The horizontal axis represents continuously staying time steps l in a certain basin μ during long time chaotic wandering, and the vertical axis represents the accumulative number p(l, μ) of the same staying time steps l in a certain basin μ. continuously staying time steps l becomes long with the increase of connectivity r.
First, in the case of slower target velocity, a decreasing success rate with the increase of connectivity r is observed from both circular target trajectory and linear ones. This point shows that chaotic dynamics localized in a certain basin for too much time is not good to track a slower target. Second, in the case of faster target velocity, it seems useful to track a faster target when chaotic dynamics is not too strong. Computer simulations shows that, when the target moves quickly, the action of the tracker is always chaotic so as to track the target. In past experiments, we know that motion increments of chaotic motion is very short. Therefore, shorter motion increments and faster target velocity result in bad tracking performance. However, when continuously staying time l in a certain basin becomes longer, the tracker can move toward a certain direction for l steps. This would be useful for the tracker to track the faster target. Therefore, when connectivity becomes a little large (r=50 or so), success rate arises following the increase of target velocity, such as the case shown in Fig.9. As an issue for future study, a functional aspect of chaotic dynamics still has context dependence.
8 Summary We proposed a simple method to tracking a moving target using chaotic dynamics in a recurrent neural network model. Although chaotic dynamics could not always solve all complex problems with better performance, better results often were often observed on using chaotic dynamics to solve certain ill-posed problems, such as tracking a moving target and solving mazes [8]. From results of the computer simulation, we can state the following several points. • A simple method to tracking a moving target was proposed • Chaotic dynamics is quite efficient to track a target that is moving along a circular trajectory.
188
Y. Li and S. Nara
• Performance of tracking a moving target of a linear trajectory is not better than that of a circular trajectory, however, to some linear trajectories, excellent performance was observed. • The length of continuously staying time steps becomes long with the increase of synaptic connectivity r that can lead chaotic dynamics in the network. • Continuously longer staying time in a certain basin seems useful to track a faster target.
References 1. Babloyantz, A., Destexhe, A.: Low-dimensional chaos in an instance of epilepsy. Proc. Natl. Acad. Sci. USA. 83, 3513–3517 (1986) 2. Skarda, C.A., Freeman, W.J.: How brains make chaos in order to make sense of the world. Behav. Brain. Sci. 10, 161–195 (1987) 3. Nara, S., Davis, P.: Chaotic wandering and search in a cycle memory neural network. Prog. Theor. Phys. 88, 845–855 (1992) 4. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Memory search using complex dynamics in a recurrent neural network model. Neural Networks 6, 963–973 (1993) 5. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudo-inverse method. Int. J. Bifurcation and Chaos Appl. Sci. Eng. 5, 1205–1212 (1995) 6. Nara, S., Davis, P.: Learning feature constraints in a chaotic neural memory. Phys. Rev. E 55, 826–830 (1997) 7. Nara, S.: Can potentially useful dynamics to solve complex problems emerge from constrained chaos and/or chaotic itinerancy? Chaos. 13(3), 1110–1121 (2003) 8. Suemitsu, Y., Nara, S.: A solution for two-dimensional mazes with use of chaotic dynamics in a recurrent neural network model. Neural Comput. 16(9), 1943–1957 (2004) 9. Tsuda, I.: Chaotic itinerancy as a dynamical basis of Hermeneutics in brain and mind. World Futures 32, 167–184 (1991) 10. Tsuda, I.: Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behav Brain Sci. 24(5), 793–847 (2001) 11. Kaneko, K., Tsuda, I.: Chaotic Itinerancy. Chaos 13(3), 926–936 (2003) 12. Aihara, K., Takabe, T., Toyoda, M.: Chaotic Neural Networks. Phys. Lett. A 114, 333–340 (1990)
A Generalised Entropy Based Associative Model Masahiro Nakagawa Nagaoka University of Technology, Kamitomioka 1603-1, Nagaoka, Niigata 940-2188, Japan
[email protected] Abstract. In this paper, a generalised entropy based associative memory model will be proposed and applied to memory retrievals with analogue embedded vectors instead of the binary ones in order to compare with the conventional autoassociative model with a quadratic Lyapunov functionals. In the present approach, the updating dynamics will be constructed on the basis of the entropy minimization strategy which may be reduced asymptotically to the autocorrelation dynamics as a special case. From numerical results, it will be found that the presently proposed novel approach realizes the larger memory capacity even for the analogue memory retrievals in comparison with the autocorrelation model based on dynamics such as associatron according to the higher-order correlation involved in the proposed dynamics. Keywords: Entropy, Associative Memory, Analogue Memory Retrieval.
1 Introduction During the past quarter century, the numerous autoassociative models have been extensively investigated on the basis of the autocorrelation dynamics. Since the proposals of the retrieval models by Anderson, [1] Kohonen, [2] and Nakano, [3] some works related to such an autoassociation model of the inter-connected neurons through an autocorrelation matrix were theoretically analyzed by Amari, [4] Amit et al . [5] and Gardner [6] . So far it has been well appreciated that the storage capacity of the autocorrelation model , or the number of stored pattern vectors, L , to be completely associated vs the number of neurons N, which is called the relative storage capacity or loading rate and denoted as c = L / N , is estimated as c ~0.14 at most for the autocorrelation learning model with the activation function as the signum one ( sgn (x) for the abbreviation) [7,8] . In contrast to the abovementioned models with monotonous activation functions, the neuro-dynamics with a nonmonotonous mapping was recently proposed by Morita, [9] Yanai and Amari, [10] Shiino and Fukai [11]. They reported that the nonmonotonous mapping in a neurodynamics possesses a remarkable advantage in the storage capacity, c ~0.27, superior than the conventional association models with monotonous mappings, e.g. the signum or sigmoidal function. In the present paper, we shall propose a novel approach based on the entropy defined in terms of the overlaps, which are defined by the innerproducts between the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 189–198, 2008. © Springer-Verlag Berlin Heidelberg 2008
190
M. Nakagawa
state vector and the analogue embedded vectors instead of the previously investigated binary ones [1-16,25].
2 Theory Let us consider an associative model with the embedded analogue vector (r ) ei (1 i N,1 r L) , where N and L are the number of neurons and the number of embedded vectors. The states of the neural network are characterized in terms of the output vector si (1 i N ) and the internal states i (1 i N ) which are related each other in terms of
s i =f
(1iN),
i
(1)
where f ( • ) is the activation function of the neuron. Then we introduce the following entropy I which is to be related to the overlaps as L
I=- 1 2 (r)
where the overlaps m
(r) 2
m
log m
(r) 2
,
(2)
r=1
(r=1,2,...,L) are defined by N (r)
m
†(r)
= e
(3)
i s i;
i=1
here the covariant vector e relation,
†(r )
is defined in terms of
i
the following orthogonal
N
e
†(r)
ie
(s)
a
(r') rr' e i ,
(4a)
,
(4b)
i = rs
(1r,sL) ,
(4)
i=1
e
L †(r) i = r'=1
a
rr' =(
-1
)
rr'
e
(r) (r') i e i
N
rr' =
.
(4c)
i=1
The entropy defined by eq.(2) can be minimized by the following condition
m and
(r)
=
rs
(1r,sL),
(5)
A Generalised Entropy Based Associative Model L
m
(r) 2
191
(6)
=1
r=1
That is, regarding m (1 r L) as the probability distribution in eq.(2), a target (r) pattern may be retrieved by minimizing the entropy I with respect to m or the state vector si to achieve the retrieval of a target pattern in which the eqs.(5a) and (5b) are to be satisfied. Therefore the entropy function may be considered to be a functional to be minimized during the retrieval process of the auto-association model instead of the conventional quadratic energy functional, E, i.e. (r) 2
E=- 1 2
N N
w ij s †
is j
,
(6a)
js j ,
(6b)
i=1 j=1
†
where s i is the covariant vector defined by L N
s
†
i=
e
†(r)
ie
†(r)
r=1 j=1
and the connection matrix w ij is defined in terms of L (r)
w ij = e
ie
†(r)
j.
(6c)
r=1
According to the steepest descent approach in the discrete time model, the updating rule of the internal states i (1 i N ) may be defined by
i (t+1) =-
I s
(1iN) ,
†
(7)
i
where (> 0 ) is a coefficient. Substituting eqs.(2) and (3) into eq.(7) and noting the following relation with aid of eq.(6b), N
m
(r)
N
= e
†(r)
(r)
i s i = e
i=1
is
† i
,
(8)
i=1
one may readily derive the following relation.
i (t+1)=- I† =+ 1 † 2 s i s i = 1 † 2 s i L
m
(r) 2
2
log
= e = e
(r)
i
r=1 L
N
e
(r)
†
js
j
t
r=1 j=1
(r)
(r) 2
r=1
e
(r)
js
js
†
†
j
2
t
j=1 N
e
(r)
js
†
j
t
1+log
j=1 im
log m
N
N (r)
r=1
L
L
e
(r)
j
t
2
j=1
1+log m
(r) 2
.
(9)
192
M. Nakagawa
Generalizing somewhat the above dynamics in order to combine the quadratic approach ( 0) and the present entropy one ( 1) , we propose the following dynamic rule, in a somewhat ad-hoc manner, for the internal states L
N
i (t+1)= e
(r)
i
r=1
N
†(r)
e
js j
t
1+log 1- +
j=1
e
†(r)
js j
t
2
j=1
L
= e
(r)
i
m
(r)
t
1+log 1- + m
(r)
2
t
. (10)
r=1
In practice, in the limit of 0 , the above dynamics will be reduced to the autocorrelation dynamics. L
e (r) i
i (t+1)=lim L
= e
m
0 r=1
(r)
(r)
im
r=1
(r)
t
L
1+log 1- + m N
t =- e
(r)
r=1
i
(r)
2
t
N
e
† (r)
j=1
j s j (t) = w ij s j (t) . j=1
(11)
On the other hand, eq.(10) results in eq.(9) in the case of 1 . Therefore one may control the dynamics between the autocorrelation ( 0) and the entropy based approach ( 1 ).
3 Numerical Results The embedded vectors are set to the binary random vectors as follows.
e (r) i =z
(r) i
(1iN,1rL)
(12)
where z i (1 i N ,1 r L) are the zero-mean pseudo-random numbers between -1 and +1. For simplicity, the activation function , eq.(1), is assumed to be a piecewise linear function instead of the previous signum form for the binary embedded vectors[25] and set to (r )
s i =f ( i )=
1+sgn 1- i
2
i
+sgn
1-sgn 1- i
where denotes the signum function sgn ( •) defined by
-1 (x0)
2
i
,
(13)
(14)
A Generalised Entropy Based Associative Model
193
The initial vector si (0) (1 i N ) is set to
s i (0)=
where e
(r ) i
-e (s) i (1iH d ) , +e (s) i (H d +1iN)
(15)
is a target pattern to be retrieved and H d is the Hamming distance
between the initial vector si (0) and a target vector e
(s) i
. The retrieval is succeeded if
N
m
(s)
(t ) = e
†(s) i s
i (t
)
(16)
i=1
results in ±1 for t 1, in which the system may be in a steady state such that
s i (t+1)=s i (t) ,
(17a)
i (t+1)= i (t) .
(17b)
To see the retrieval ability of the present model, the success rate Sr is defined as the rate of the success for 1000 trials with the different embedded vector sets e (r )i (1 i N ,1 r L) . To control from the autocorrelation dynamics after the initial state (t~1) to the entropy based dynamics (t~ Tmax ) , the parameter in eq.(10) was simply controlled by
= t T max
max
(0tT
max ) ,
(18)
where Tmax and max are the maximum values of the iterations of the updating according to eq.(10) and , respectively. Choosing N =200, η = 1, Tmax = 25, L/N=0.5 and α max = 1, we first present an example of the dynamics of the overlaps in Figs.1(a) and (b) (Entropy based approach). Therein the cross symbols( × ) and the open circles(o) represent the success of retrievals, in which eqs.(5a) and (5b) are satisfied, and the entropy defined by eq.(2), respectively, for a retrieval process. In addition the time dependence of the parameter α / α max defined by eq.(18) are depicted as dots ( i ). In Fig. 1 after a transient state, it is confirmed that the complete association corresponding to eqs.(5a) and (5b) can be achieved. Then we shall present the dependence of the success rate Sr on the loading rate = L / N are depicted in Figs.2 (a) and (b) for H d / N = 0.3 , N =100 for the entropy approach and the associatron, respectively. From these results, one may confirm the larger memory capacity of the presently proposed model defined by eq.(10) in
194
M. Nakagawa <EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=10 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1 1
0.8
0.6
Overlaps < o(n) >
0.4
0.2 0
-0.2 -0.4 -0.6 -0.8
-1 0
5
10
15
(a)
25 n
20
30
35
40
45
50
H d / N = 0.1
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1
1
0.8 0.6
Overlaps < o(n) >
0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0
5
10
15
(b)
20
25 n
30
35
40
45
50
H d / N = 0.3 (r)
Fig. 1. The time dependence of overlaps m eq.(10)
of the present entropy based model defined by
A Generalised Entropy Based Associative Model
195
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1
1 0.9
Success Rate Sr(L/N)
0.8 0.7
Success Rate MemCap= 0.9999 Hd/N= 0.3
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 L/N
0.6
0.7
0.8
0.9
1
(a) Entropy based Model defined by eq.(10)
<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=0 izero=0 alpmax=1 1
Success Rate MemCap= 0.0134 Hd/N= 0.3
0.9
Success Rate Sr(L/N)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 L/N
0.6
0.7
0.8
0.9
1
(b) Conventional Associatron Model defined by eq.(11) Fig. 2. The dependence of the success rate on the loading rate α = L / N of the present entropy based model defined by eqs.(10) and (11). Here the Hamming distance is set to H d / N = 0.3.
196
M. Nakagawa
comparison with the conventional autoassociation model defined by eq.(11). In practice, it is found that the present approach may achieve the high memory capacity beyond the conventional autocorrelation strategy even for the analogue embedded vectors as well as the previously concerned binary case [15,16,25].
4 Concluding Remarks In the present paper, we have proposed an entropy based association model instead of the conventional autocorrelation dynamics. From numerical results, it was found that the large memory capacity may be achieved on the basis of the entropy approach. This advantage of the association property of the present model is considered to result from the fact such that the present dynamics to update the internal state eq.(10) assures that the entropy, eq.(2) is minimized under the conditions, eqs.(5a) and (5b), which corresponding to the succeeded retrieval of a target pattern. In other words, the higher-order correlations in the presently proposed dynamics, eq.(10), which was ignored in the conventional approaches, [1-11] was found to play an important role to improve memory capacity, or the retrieval ability. To conclude this work, we shall show the dependence of the storage capacity, which is defined as the area covered in terms of the success rate curves as shown in Fig.3 , on the Hamming distance in Fig.3 for the analogue embedded vectors (Ana) as well as the previous binary ones (Bin). In addition OL and CL imply the orthogonal learning model and the autocorrelation learning model, respectively. Therein one may see again the great advantage of the present model based on the entropy functional to be minimized beyond the conventional quadratic form [12,13] even for the analogue embedded vectors. In fact one may realize the considerably larger storage capacity in the present model in comparison with the associatron over H d / N 0.5 . The memory retrievals for the associatron based on the quadratic 1
n a m
n a m
n a m
a n m
a n m
0.9 Memory Capacity
0.8
a m n
a m n
t
a m n
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
t s s
t s
t s
t s
t s
t s
a m
t s
n
a m n
a
Entropy based Model (OL:Ana)
m
Entropy based Model (OL:Bin)
n
Entropy based Model (CL:Bin)
s
Associatron(OL:Bin)
t
Associatron(OL:Wii=0:Bin) t s
t s
0.010.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Hd/N
Fig. 3. The dependence of the storage capacity on the Hamming distance. Here symbols a, m and n are for the entropy based approach with eq. (10) as well as the orthogonal learning (OL) and the autocorrelation learning (CL) [16,17], in which Ana and Bin imply the analogue embedded vectors and the binary ones, respectively. In addition we presented the associatron in symbols s with the orthogonal learning [13], and the associatron in symbols t with orthogonal learning under the condition wii = 0 [12], respectively.
A Generalised Entropy Based Associative Model
197
Lyapunov functionals to be minim ized become troublesome near H d / N = 0 .5 as seen in Fig.3 since the directional cosine between the initial vector and a target pattern eventually vanishes therein. Remarkably, even in such a case, the present model attains a remarkably large memory capacity because of the higher-order correlations involved in eq.(10) as expected from Figs. 1 and 2 for the analogue vectors as well as the binary ones previously investigated [15,16,25]. As a future problem, it seems to be worthwhile to involve a chaotic dynamics in the present model introducing a periodic activation function such as sinusoidal one as a nonmonotonic activation function [14]. The entropy based approach [15] with chaos dynamics [14] is now in progress and will be reported elsewhere together with the synergetic models [17-24] in the near future.
References 1. Anderson, J.A.: A Simple Neural Network Generating Interactive Memory. Mathematical Biosciences 14, 197–220 (1972) 2. Kohonen, T.: Correlation Matrix Memories. IEEE Transaction on Computers C-21, 353– 359 (1972) 3. Nakano, K.: Associatron-a Model of Associative Memory. IEEE Trans. SMC-2, 381–388 (1972) 4. Amari, S.: Neural Theory of Association and Concept Formation. Biological Cybernetics 26, 175–185 (1977) 5. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Storing Infinite Numbers of Patternsin a Spinglass Model of Neural Networks. Physical Review Letters 55, 1530–1533 (1985) 6. Gardner, E.: Structure of Metastable States in the Hopfield Model. Journal of Physics A19, L1047–L1052 (1986) 7. Kohonen, T., Ruohonen, M.M.: Representation of Associated Pairs by Matrix Operators. IEEE Transaction C-22, 701–702 (1973) 8. Amari, S., Maginu, K.: Statistical Neurodynamics of Associative Memory. Neural Networks 1, 63–73 (1988) 9. Morita, M.: Neural Networks. Associative Memory with Nonmonotone Dynamics 6, 115– 126 (1993) 10. Yanai, H.-F., Amari, S.: Auto-associative Memory with Two-stage Dynamics of nonmonotonic neurons. IEEE Transactions on Neural Networks 7, 803–815 (1996) 11. Shiino, M., Fukai, T.: Self-consistent Signal-to-noise Analysis of the Statistical Behaviour of Analogu Neural Networks and Enhancement of the Storage Capacity. Phys. Rev. E48, 867 (1993) 12. Kanter, I., Sompolinski, H.: Associative Recall of Memory without Errors. Phys. Rev. A 35, 380–392 (1987) 13. Personnaz, L., Guyon, I., Dreyfus, D.: Information Storage and Retrieval in Spin-Glass like Neural Networks. J. Phys(Paris) Lett. 46, L-359 (1985) 14. Nakagawa, M.: Chaos and Fractals in Engineering, p. 944. World Scientific Inc., Singapore (1999) 15. Nakagawa, M.: Autoassociation Model based on Entropy Functionals. In: Proc. of NOLTA 2006, pp. 627–630 (2006) 16. Nakagawa, M.: Entropy based Associative Model. IEICE Trans. Fundamentals EA-89(4), 895–901 (2006)
198
M. Nakagawa
17. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System I. Biological Cybernetics 60, 17–22 (1988) 18. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System II. Biological Cybernetics 60, 107–109 (1988) 19. Fuchs, A., Haken, H.: Dynamic Patterns in Complex Systems. In: Kelso, J.A.S., Mandell, A.J., Shlesinger, M.F. (eds.), World Scientific, Singapore (1988) 20. Haken, H.: Synergetic Computers and Cognition. Springer, Heidelberg (1991) 21. Nakagawa, M.: A study of Association Model based on Synergetics. In: Proceedings of International Joint Conference on Neural Networks 1993 NAGOYA, JAPAN, pp. 2367– 2370 (1993) 22. Nakagawa, M.: A Synergetic Neural Network. IEICE Fundamentals E78-A, 412–423 (1995) 23. Nakagawa, M.: A Synergetic Neural Network with Crosscorrelation Dynamics. IEICE Fundamentals E80-A, 881–893 (1997) 24. Nakagawa, M.: A Circularly Connected Synergetic Neural Networks. IEICE Fundamentals E83-A, 881–893 (2000) 25. Nakagawa, M.: Entropy based Associative Model. In: Proceedings of ICONIP 2006, pp. 397–406. Springer, Heidelberg (2006)
The Detection of an Approaching Sound Source Using Pulsed Neural Network Kaname Iwasa1, Takeshi Fujisumi1 , Mauricio Kugler1 , Susumu Kuroyanagi1, Akira Iwata1 , Mikio Danno2 , and Masahiro Miyaji3 1
Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, 466-8555, Japan
[email protected] 2 Toyota InfoTechnology Center, Co., Ltd, 6-6-20 Akasaka, Minato-ku, Tokyo, 107-0052, Japan 3 Toyota Motor Corporation, 1 Toyota-cho, Toyota, Aichi, 471-8572, Japan
Abstract. Current automobiles’ safety systems based on video cameras and movement sensors fail when objects are out of the line of sight. This paper proposes a system based on pulsed neural networks able to detect if a sound source is approaching a microphone or moving away from it. The system, based on PN models, compares the sound level difference between consecutive instants of time in order to determine its relative movement. Moreover, the combined level difference information of all frequency channels permits to identify the type of the sound source. Experimental results show that, for three different vehicles sounds, the relative movement and the sound source type could be successfully identified.
1
Introduction
Driving safety is one of the major concerns of the automotive industry nowadays. Video cameras and movement sensors are used in order to improve the driver’s perception of the environment surrounding the automobile [1][2]. These methods present good performance when detecting objects (e.g., cars, bicycles, and people) which are in line of sight of the sensor, but fail in case of obstruction or dead angles. Moreover, the use of multiple cameras or sensors for handling dead angles increases the size and cost of the safety system. The human being, in contrast, is able to perceive people and vehicles around itself by the information provided by the auditory system [3]. If this ability could be reproduced by artificial devices, complementary safety systems for automobiles would emerge. Cause of diffraction, sound waves can contour objects and be detected even when the source is not in direct line of sight. A possible approach for processing temporal data is the use of Pulsed Neuron (PN) models [4]. This type of neuron deals with input signals on the form of pulse trains, using an internal membrane potential as a reference for generating pulses on its output. PN models can directly deal with temporal data and can be efficiently implemented in hardware, due to its simple structure. Furthermore, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 199–208, 2008. c Springer-Verlag Berlin Heidelberg 2008
200
K. Iwasa et al.
high processing speeds can be achieved, as PN model based methods are usually highly parallelizable. A sound localization system based on pulsed neural networks has already being proposed in [5] and a sound source identification system, with a corresponding implementation on FPGA, was introduced in [6]. This paper focuses specifically on the relative moving direction of a sound emitting object, and proposes a method to detect if a sound source is approaching or moving away from it using a microphone. The system, based on PN models, compares the sound level difference between consecutive instants of time in order to determine its relative movement. Moreover, the proposed method also identifies the type of the sound source by the use of PN model based competitive learning pulsed neural network for processing the spectral information.
2
Pulsed Neuron Model
When processing time series data (e.g., sound), it is important to consider the time relation and to have computationally inexpensive calculation procedures to enable real-time processing. For these reasons, a PN model is used in this research.
Input Pulses
A Local Membrane Potential p1(t)
IN 1(t) The Inner Potential of the Neuron IN 2 (t)
w1 w2 p2(t) wk pk(t)
IN k (t)
I(t)
θ
Output Pulses o(t)
w n pn(t)
IN n (t)
Fig. 1. Pulsed neuron model
Figure 1 shows the structure of the PN model. When an input pulse INk (t) reaches the k th synapse, the local membrane potential pk (t) is increased by the value of the weight wk . The local membrane potentials decay exponentially with a time constant τk across time. The neuron’s output o(t) is given by o(t) = H(I(t) − θ) n I(t) = pk (t)
(1) (2)
k=1 t
pk (t) = wk INk (t) + pk (t − 1)e− τ
(3)
The Detection of an Approaching Sound Source
201
where n is the total number of inputs, I(t) is the inner potential, θ is the threshold and H(·) is the unit step function. The PN model also has a refractory period tndti , during which the neuron is unable to fire, independently of the membrane potential.
3
The Proposed System
The basic structure of the proposed system is shown in Fig.2. This system consists of three main blocks, the frequency-pulse converter, the level difference extractor and the sound source classifier, from which the last two are based on PN models. The relative movement (approaching or moving away) of the sound source is determined by the sound level variation. The system compares a signal level x(t) from a microphone with the level in a previous time x(t−Δt). If x(t) > x(t−Δt), the sound source is getting closer to a microphone, if x(t) < x(t−Δt), it is moving away. After the level difference having been extracted, the outputs of the level difference extractors contain the spectral pattern of the input sound, which is then used for recognizing the type of the source. 3.1
Filtering and Frequency-Pulse Converter
Initially, the input signal must be pre-processed and converted to a train of pulses. A bank of 4th order band-pass filters decomposes the signal in 13 frequency channels equally spaced in a logarithm scale from 500 Hz to 2 kHz. Each frequency channel is modified by the non-linear function shown in Eq.(4), and the resulting signal’s envelope is extracted by a 400 Hz low-pass filter. Finally, Input Signal Filter Bank & Frequency - Pulse Converter f1
f2
fN
Time Delay
x(t)
Time Delay
x(t- D t)
Level Difference Extractor
x(t)
x(t- D t)
Level Difference Extractor
Time Delay
x(t)
x(t- D t)
Level Difference Extractor
Sound Source Classifier Approaching Detection & Sound Classification
Fig. 2. The structure of the recognition system
202
K. Iwasa et al.
each output signal is independently converted to a pulse train, whose rate is proportional to the amplitude of the signal. 1 x(t) 3 x(t) ≥ 0 F (t) = 1 (4) 1 3 x(t) x(t) < 0 4 3.2
Level Difference Extractor
Each pulse trains generated by the Frequency-Pulse converter is inputted in a Level Difference Extractor (LDE) independently. The LDE, shown in Fig. 3, is composed by two parts, the Lateral Superior Olive (LSO) model and the Level Mapping Two (LM2) model [7]. In LSO model and LM2 model, each neurons work as Eq.(3). The LSO is responsible for the time difference extraction itself, while the LM2 extracts the envelope of the complex firing pattern. Each pulse train correspondent to each frequency channel is inputted in a LSO LSO model. The PN potential of f th channel, ith LSO neuron Ii,f (t) is calculated as follows: LSO B Ii,f (t) = pN i,f (t) + pi,f (t)
pN i,f (t)
=
N wi,f xf (t)
+
pN i,f (t
(5) − 1)e
t LSO
−τ
B B pB i,f (t) = wi,f xf (t − Δt) + pi,f (t − 1)e
−τ t LSO
(6) (7)
N where τLSO is the time constant of the LSO neuron and the weights wi,f and B wi,f are defined as: ⎧ ⎧ 0.0 i=0 0.0 i=0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨1.0 ⎨ i>0 1.0 i E p −min . That is, when the last HN is added, the minimum training error Emin should be larger than the previously achieved minimum training error Ep-min. 2.1.2 Stopping Criterion for Backward Elimination (BE) 2.1.2.1 Calculation of Error Reduction Rate (ERR) During BE process, we need to calculate the ERR as follows,
ERR = − Here,
′ − Emin Emin Emin
′ is the minimum training error during backward elimination. Emin
2.1.2.2 Stopping Criterion in double feature deletion During the course of training, the irrelevant features are sequentially deleted double at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.05, CA′ and CA refer to the classification accuracies during BE and before BE, respectively.
2.1.2.3 Stopping Criterion in single feature deletion During the course of training, the irrelevant features are sequentially deleted single at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.08. 2.2 Measurement of Contribution Finding the contribution of input attributes accurately ensures the effectiveness of CFSS. Therefore, during the one-by-one removal training, we measure the minimum i
training error, Emin for removing each ith feature. Then, calculate the contribution of each feature as follows. i i Training error difference for removing the ith features is, Ediff = Emin − Emin where Emin is the minimum training error using all features. Now, we need to calculate the average training error difference for removing ith feature Eavg = 1
N
∑E N i =1
i diff
where
N is the number of features. Now, the percentage contribution of ith feature is,
Coni = 100.
i Eavg − Ediff
Eavg
Now, we can decide the rank of features according to Conmax = max(Coni ) . The worst ranking features are treated as the irrelevant features for the network as these are providing comparatively less contribution for the network.
Feature Subset Selection Using Constructive Neural Nets
377
2.3 Calculation of Network Connection The number of connections (C) of the final architecture of the network can be calculated in terms of the number of existing input attributes (x), number of existing hidden units (h), and number of outputs (o) by C = ( x × h) + ( h × o) + h + o . 2.3.1 Reduction of Network Connection (RNC) In this context we here estimate how much connection has been reduced due to feature selection. In this regard, we initially calculate the total number of connections of the achieved network before and after BE, Cbefore and Cafter , respectively. After that, we estimate RNC as, RNC = 100.
Cbefore − C after
.
Cbefore
2.3.2 Increment of Network Accuracy (INA) Due to the reduction of network connection, we estimate INA that shows how much network accuracy is improved. We measure the classification accuracy before and after BE, CAbefore and CAafter , respectively. After that, we estimate INA as, INA = 100.
CAafter − CAbefore Cbefore
.
2.4 The Algorithm
In this framework, CFSS comprises into three main steps that are summarized in Fig. 1: (a) architecture determination of NN in constructive fashion (step 1-2), (b) measurement of contribution of each feature (step 3), and (c) subset generation (step 4-8). These steps are dependent each other and the entire process has been accomplished one after another depending on particular criteria. The detail of each step is as follows. Step 1) Create a minimal NN architecture. Initially it has three layers, i.e. an input layer, an output layer, and a hidden layer with only one neuron. Number of input and output neurons is equal to the number of inputs and outputs of the problem. Randomly initialize connection weights between input layer to hidden layer and hidden layer to output layer within a range between [+1.0, -1.0]. Step 2) Train the network by using BPL algorithm and try to achieve a minimum training error, Emin. Then, add a HN, retrain the network from the beginning, and check the SC according to subsection 2.1.1. When it is satisfied, then validate the NN to test set to calculate the classification accuracy, CA and go to step 3. Otherwise, continue the HN selection process. Step 3) Now to find out the contribution of features, we perform one-by-one removal training of the network. Delete successively the ith feature and save the i individual minimum training error, E min . Then, the rank of all features can be reflected in accordance with the subsection 2.2. Then go to Step 4.
378
M.M. Kabir, M. Shahjahan, and K. Murase
1
Create an NN with minimal architecture Training
2
SC satisfied ?
yes
Validate NN, calculate accuracy
no Add HN
One-by-one training, compute contribution
3
Delete double/single feature Training 4
Validate subset, calculate accuracy
SC satisfied ?
no
yes 5
6
Backtracking no
Single deletion end ? yes
7
Further check
8
Final subset
Fig. 1. Overall Flowchart of CFSS. Here, NN, HN and SC refer to neural network, hidden neuron and stopping criterion respectively.
Step 4) This stage is the first step for BE to generate the feature subset. Initially, we attempt to delete double features of worst rank at a time to accelerate the process. Calculate the minimum training error, E′min during training. Validate the existing features using test set and calculate classification accuracy CA′ . After that, if SC mentioned in subsection 2.1.2.2 is satisfied, then continue. Otherwise, go to step 5. Step 5) We perform backtracking. That is, the lastly deleted double/single of features is restored here with associated all components. Step 6) If single deletion process has not been finished then attempt to delete feature single at a time using step 4 to filter the unnecessary ones, otherwise go to step 7. If SC mentioned in subsection 2.1.2.3 is satisfied, then continue. Otherwise, go to step 5 and then step 7.
Feature Subset Selection Using Constructive Neural Nets
379
Step 7) Before going to select the final subset, we again check through the existing features whether any irrelevant feature presents or not. The following steps are taken to accomplish this task. i) Delete the existing features from the network single in each step in accordance with worst rank, and then again retrain. ii) Validate the existing features by using test set. For the next stage, save the classification accuracy CA′′ , the responsible deleted feature, and all its components. iii) If DF CA . v) If better CA′′ are available, then identify the higher rank of feature among them. Otherwise, stop. Delete the higher ranked feature with corresponding worst ranked ones from the subset that obtained at step 7-i. After that, recall the components to the current network that was obtained at step 7-ii and stop. Step 8) Finally, we achieve the relevant feature subset with compact network.
3 Experimental Analysis This section evaluates CFSS’s performance on four well-known benchmark problems such as Breast cancer (BCR), Diabetes (DBT), Glass (GLS), and Iris (IRS) problems which are obtained from [13]. Input attributes and output classes of BCR are 9 and 2, 8 and 2 of DBT, 9 and 6 of GLS, 4 and 3 of IRS, respectively. All datasets are partitioned into three sets: a training set, a validation set, and a test set. The first 50% is used as training set for training the network, and the second 25% as validation set to check the condition of training and stop it when the overall training error goes high. The last 25% is used as test set to evaluate the generalization performance of the network and is not seen by the network during the whole training process. In all experiments, two bias nodes with a fixed input +1 are connected to the hidden layer and output layer. The learning rate η is set between [0.1,0.3] and the weights are initialized to random values between [+1.0, -1.0]. Each experiment was carried out 10 times and the presented results were the average of these 10 runs. The all experiments were done by Pentium-IV, 3 GHz desktop personal computer. 3.1 Experimental Results
Table 1 shows the average results of CFSS where the number of selected features, classification accuracies and so on are incorporated for BCR, DBT, GLS, and IRS problems in before and after feature selection process. For clarification, in this table, we measured the training error by the section 2.5 of [12], and classification accuracy CA is the ratio of classified examples to the total examples of the particular data set.
40 30 20 10 0 -10 -20 0
1
2
3
4
5
6
7
8
9
10
Classification Accuracy (%)
M.M. Kabir, M. Shahjahan, and K. Murase Attribute Contribution (%)
380
80 78 76 74 72 70 0
Attribute
2
4
6
8
10
Subset Size
Fig. 2. Contribution of attributes in breast Fig. 3. Performance of the network according to cancer problem for single run the size of subset in diabetes problem for single run Table 1. Results of four problems such as BCR, DBT, GLS, and IRS. Numbers in () are the standard deviations. Here, TERR, CA, FS, HN, ITN, and CTN refer to training error, classification accuracy, feature selection, hidden neuron, iteration, and connection respectively. Name
BCR DBT GLS IRS
Feature # 9 (0.00) 8 (0.00) 9 (0.00) 4 (0.00)
Before FS TERR (%) 3.71 (0.002) 16.69 (0.006) 26.44 (0.026) 4.69 (0.013)
CA (%) 98.34 (0.31) 75.04 (2.06) 64.71 (1.70) 97.29 (0.00)
Feature # 3.2 (0.56) 2.5 (0.52) 4.1 (1.19) 1.3 (0.48)
Table 2. Results of computational time for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS
Computational Time (Second) 22.10 30.07 22.30 7.90
After FS TERR (%) 3.76 (0.003) 14.37 (0.024) 26.56 (0.028) 8.97 (0.022)
CA (%) 98.57 (0.52) 76.13 (1.74) 66.62 (2.18) 97.29 (0.00)
HN#
ITN#
CTN#
2.60
249.4
18.12
2.66
271.3
16.63
3.3
1480.9
42.63
3.2
2317.1
19.96
Table 3. Results of connection reduction and accuracy increase for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS
Connection Decrease (%) 45.42 46.80 27.5 30.2
Accuracy Increase (%) 0.23 1.45 2.95 0.00
In each experiment, CFSS generates a subset having the minimum number of relevant features is available. Thereupon, CFSS produces better CA with a smaller number of hidden neurons HN. It is shown in Table 1 that, for example, in BCR, the network easily selects a minimum number of relevant features subset i.e. 3.2 which occurs 98.57% CA while the network used only 2.6 of hidden neurons to build the minimal network structure. In contrast, in case of remaining problems like DBT, GLS, and IRS, the results of those attributes are nearly similar as BCR except IRS. This is because, for IRS problem, the network cannot produce better accuracy any
Feature Subset Selection Using Constructive Neural Nets
381
more after feature selection rather it able to generate 1.3 relevant features subset among 4 attributes which is sufficient to exhibit the same network performance. In addition, Fig. 2 exhibits the arrangement of attributes during training according to their contribution in BCR. CFSS can easily delete the unnecessary ones and generate the subset {6,7,1}. In contrast, Fig. 3 shows the relationship between classification accuracy and subset size for DBT. We calculated the network connection of the pruned network at the final stage according to section 2.3 and the results are shown in Table 1. The computational time for completing the entire FSS process is exhibited in Table 2. After that, we estimated the connection decrement of the network corresponding to accuracy increment due to FSS as shown in Table 3 according to 2.3.1 and 2.3.2. We can see here the relation between the reduction of the network connection, and the increment of accuracy. 3.2 Comparison with other Methods
In this section, we compare the results of CFSS to those obtained by other methods NNFS and ICFS reported in [4] and [6], respectively. The results are summarized in Tables 4-7. Prudence should be considered because different technique has been involved in their methods for feature selection. Table 4. Comparison on the number of relevant features for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
3.2 2.5 4.1 1.3
2.70 2.03 -
ICFS (M 1) 5 2 5 -
ICFS (M 2) 5 3 4 -
Table 6. Comparison on the average number of hidden neurons for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
2.60 2.66 3.3 3.2
12 12 -
ICFS (M 1) 33.55 8.15 62.5 -
ICFS (M 2) 42.05 21.45 53.95 -
Table 5. Comparison on the average testing CA (%) for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
98.57 76.13 66.62 97.29
94.10 74.30 -
ICFS (M 1) 98.25 78.88 63.77 -
ICFS (M 2) 98.25 78.70 66.61 -
Table 7. Comparison on the average number of connections for BCR, DBT, GLS, and IRS data sets Name
CFSS
NNFS
BCR DBT GLS IRS
18.12 16.63 42.63 19.96
70.4 62.36 -
ICFS (M 1) 270.4 42.75 756 -
ICFS (M 2) 338.5 130.7 599.4 -
Table 4 shows the discrimination capability of CFSS in which the dimensionality of input layer is reduced for above-mentioned four problems. In case of BCR, the result of CFSS is quite better in terms of ICFS’s two methods while it is not so with NNFS. The result of DBT is average with comparing others. But, for the case of GLS, the result is comparable or better with two methods of ICFS.
382
M.M. Kabir, M. Shahjahan, and K. Murase
The comparison on the average testing CA for all problems is shown in Table 5. It is seen that, the results of BCR and GLS are better with NNFS and ICFS while DBT with NNFS. The most important aspect however in our proposed method is, less number of connections of final network due to less number of HN in NN architecture. The results are shown in Table 6 and 7 where the number of HNs and connections of CFSS are much less in comparison to other three methods. Note that there are some missing data in Table 4-7, since IRS problem was not tested by NNFS and ICFS, and GLS problem by NNFS.
4 Discussion This paper presents a new combinatorial method for feature selection that generates subset in minimal computation due to minimal size of hidden layer as well as input layer. Constructive technique is used to achieve the compact size of hidden layer, and a straightforward contribution measurement leads to achieve reduced input layer showing better performance. Moreover, a composite combination of BE by means of double and single elimination, backtracking, and validation helps CFSS to generate subset proficiently. The results shown in Table 1 exhibit that CFSS generates subset with a small number of relevant features with producing better performance in four-benchmark problems. The results of relevant subset generation and generalization performance are better or comparable to other three methods as shown in Tables 4 and 5. From the long period, the wrapper approaches for FSS are overlooked because of the huge computation in processing. In CFSS, computational cost is much less. As seen in Table 3, due to FSS, 37.48% of computational cost is reduced in the advantage of 1.18% network accuracy for four problems in average. The computational time for different problems to complete the entire process in CFSS is shown in Table 2. We believe that these values are sufficiently low especially for clinical field. The system can give the diagnostic result of the patient to the doctor within a minute. Though the exact comparison is difficult, other methods such as NNFS and ICFS may take 4-10 times more since the numbers of hidden neurons and connections are much more as seen in Tables 6 and 7 respectively. CFSS thus provides minimal computation in feature subset selection. In addition, during subset generation, we used to meet up the generated subset for validation in each step. The reason is that, during BE we build a composite SC, which eventually find out the local minima where the network training should be stopped. Due to implementing such criterion, network produces significant performance and thus no need to validate the generated subset finally. Furthermore, further checking for irrelevant features in BE brings the completeness of CFSS. In this study we applied CFSS to the datasets with smaller number of features up to 9. To get more relevant tests for real tasks, we intend to use CFSS with other datasets having a larger number of features in future. The issue of extracting rules from NN is always demandable to interpret the knowledge how it works. For this, a NN with compactness is desirable. As CFSS can give support to fulfill the requirements, rule extraction from NN is the further task in future efficiently.
Feature Subset Selection Using Constructive Neural Nets
383
5 Conclusion This paper presents a new approach for feature subset selection based on contribution of input attributes in NN. The combination of constructive, contribution, and backward elimination carries the success of CFSS. Initially a basic constructive algorithm is used to determine a minimal and optimal structure of NN. In the latter part, one-by-one removal of input attributes is adopted that does not computationally expensive. Finally, a backward elimination with new stopping criteria are used to generate relevant feature subset efficiently. Moreover, to evaluate CFSS, we tested it on four real-world problems such as breast cancer, diabetes, glass and iris problems. Experimental results confirmed that, CFSS has a strong capability of feature selection, and it can remove the irrelevant features from the network and generates feature subset by producing compact network with minimal computational cost.
Acknowledgements Supported by grants to KM from the Japanese Society for Promotion of Sciences, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.
References 1. Liu, H., Tu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 2. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis - An International Journal 1(3), 131–156 (1997) 3. Kohavi, R., John, G.H.: Wrapper for feature subset selection. Artificial Intelligence 97, 273–324 (1997) 4. Sateino, R., Liu, H.: Neural Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 5. Milna, L.: Feature Selection using Neural Networks with Contribution Measures. In: 8th Australian Joint Conference on Artificial Intelligence, Canberra, November 27 (1995) 6. Guan, S., Liu, J., Qi, Y.: An incremental approach to Contribution-based Feature Selection. Journal of Intelligence Systems 13(1) (2004) 7. Schuschel, D., Hsu, C.: A weight analysis-based wrapper approach to neural nets feature subset selection. In: Tools with Artificial Intelligence: Proceedings of 10th IEEE International Conference (1998) 8. Hsu, C., Huang, H., Schuschel, D.: The ANNIGMA-Wrapper Approach to Fast Feature Selection for Neural Nets. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 32(2), 207–212 (2002) 9. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection. Journal of Machine Learning Research (2002)
384
M.M. Kabir, M. Shahjahan, and K. Murase
10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture, Technical Report TR-92-CSE-27, ECE Department, UMASS, Amherst (1994) 11. Rumelhart, D.E., McClelland, J.: Parallel Distributed Processing. MIT Press, Cambridge (1986) 12. Prechelt, L.: PROBEN1-A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994) 13. newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Dept. of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Dynamic Link Matching between Feature Columns for Different Scale and Orientation Yasuomi D. Sato1 , Christian Wolff1 , Philipp Wolfrum1 , and Christoph von der Malsburg1,2 1
2
Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe University, Max-von-Laue-Str. 1, 60438, Frankfurt am Main, Germany Computer Science Department, University of Southern California, LA, 90089-2520, USA
Abstract. Object recognition in the presence of changing scale and orientation requires mechanisms to deal with the corresponding feature transformations. Using Gabor wavelets as example, we approach this problem in a correspondence-based setting. We present a mechanism for finding feature-to-feature matches between corresponding points in pairs of images taken at different scale and/or orientation (leaving out for the moment the problem of simultaneously finding point correspondences). The mechanism is based on a macro-columnar cortical model and dynamic links. We present tests of the ability of finding the correct feature transformation in spite of added noise.
1
Introduction
When trying to set two images of the same object or scene into correspondence with each other, so that they can be compared in terms of similarity, it is necessary to find point-to-point correspondences in the presence of changes in scale or orientation (see Fig. 1). It is also necessary to transform local features (unless one chooses to work with features that are invariant to scale and orientation, accepting the reduced information content of such features). Correspondencebased object recognition systems [1,2,3,4] have so far mainly addressed the issue of finding point-to-point correspondences, leaving local features unchanged in the process [5,6]. In this paper, we propose a system that can not only transform features for comparison purposes, but also recognize the transformation parameters that best match two sets of local features, each one taken from one point in an image. Our eventual aim will be to find point correspondences and feature correspondences simultaneously in one homogeneous dynamic link matching system, but we here take point correspondences as given for the time being. Both theoretical [7] and experimental [8] investigations are suggesting 2D-Gabor-based wavelets as features to be used in visual cortex. These are best sampled in a log-polar manner [11]. This representation has been shown to be particularly useful for face recognition [12,13], and due to its inherent symmetry it is highly appropriate for implementing a transformation system for scale and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 385–394, 2008. c Springer-Verlag Berlin Heidelberg 2008
386
Y.D. Sato et al.
Fig. 1. When images of an object differ in (a) scale or (b) orientation, correspondence between them can only established if also the features extracted at a given point (e.g., at the dot in the middle of the images) are transformed during comparison
orientation. The work described here is based entirely on the macro-columnar model of [14]. In computer simulations we demonstrate feature transformation and transformation parameter recognition.
2
Concept of Scale and Rotation Invariance
Let there be two images, called I and M (for image and model). The GaborL based wavelet transform J¯ (L = I or M), has the form of a vector, called a jet, whose components are defined as convolutions of the image with a family of Gabor functions: 1 1 L ¯ Jk,l (x0 , a, θ) = I(x) 2 ψ Q(θ)(x0 − x) d2 x, (1) a a R2 1 1 D2 2 ψ(x) = 2 exp(− 2 |x| ) exp(ixT · e1 ) − exp(− ) , (2) D 2D 2 cos θ sin θ Q(θ) = . (3) −sin θ cos θ Here a simultaneously represents the spatial frequency and controls the width of a Gaussian envelope window. D represents the standard deviation of the Gaussian. The individual components of the jet are indexed by n possible orientations and (m + m1 + m2 ) possible spatial frequencies, which are θ = πk/n (k ∈ {0, 1, . . . , (n−1)}) and a = al0 (0 < a0 < 1, l ∈ {0, 1, . . . , (m+m1 +m2 −1)}). Here parameters m1 and m2 fix the number of scale steps by which the image can be scaled up or down, respectively. m (> 1) is the number of scales in the jets of the model domain. L L The Gabor jet J¯ is defined as the set {J¯k,l } of NL Gabor wavelet components extracted from the position x0 of the image. In order to test the ability to find the correct feature transformation described below, we add noise of strength σ1 to L L the Gabor wavelet components, J˜k,l = J¯k,l (1 + σ1 Sran ) where Sran are random L numbers between 0 and 1. Instead of the Gabor jet J¯ itself, we employ the L sum-normalized J˜k,l . Jets can be visualized in angular-eccentricity coordinates,
Dynamic Link Matching between Feature Columns (a)
387
(b)
Fig. 2. Comparison of Gabor jets in the model domain M and the image domain I (left and right, resp., in the two part-figures). Jets are visualized in log-polar coordinates: The orientation θ of jets is arranged circularly while the spatial frequency l is set radially. The jet J M of the model domain M is shown right while the jet J I from the transformed image in the image domain I is shown left. Arrows indicate which components of J I and J M are to be compared. (a) comparison of scaled jets and (b) comparison of rotated jets.
in which orientation and spatial frequency of a jet are arranged circularly and radially (as shown in Fig.2). The components of jets that are taken from corresponding points in images I and M may be arranged according to orientation (rows) and scale (columns): ⎛ ⎞ I I 0 · · · 0 J0,m · · · J0,m 0 ··· 0 1 −1 1 −1+m ⎜ .. .. .. .. .. ⎟ , J I = ⎝ ... (4) . . . . .⎠ I I 0 · · · 0 Jn−1,m · · · Jn−1,m 0 ··· 0 1 −1 1 −1+m
⎛
JM
⎞ M M J0,m · · · J0,m 1 −1 1 −1+m ⎜ ⎟ .. .. =⎝ ⎠. . . M M Jn−1,m1 −1 · · · Jn−1,m1 −1+m
(5)
Let us assume the two images M and I to be compared have the exact same structure, apart from being transformed relative to each other. Let us assume the transformation conforms to the sample grid of jet components. Then there will be pair-wise identities between components in Eqs.(4) and (5). If the jet in I is scaled relative to the jet in M, the non-zero components of J I are shifted along the horizontal (or, in Fig.2(a), radial) coordinate. If the image I, and correspondingly the jet J I , is rotated, then jet components are circularly permuted along the vertical axis in Eq.(4) (see Fig.2(b)). When comparing scaled and rotated jets, the non-zero components of J I are shifted along both axes simultaneously. There are model jet components of m different scales, and to allow for m1 steps of scaling down the image I and m2 steps of scaling it up. The jet in Eq.(4) is padded on the left and right with m1 and m2 columns of zeros, respectively.
388
Y.D. Sato et al.
Fig. 3. A dynamic link matching network model of a pair of single macro-columns. In the I and M domains, the network consists of, respectively, NI and NM feature units (mini-columns). The units in the I and M domains represent components of the jets J I and J M . The Control domain C controls interconnections between the I and M macrocolumns, which are all-to-all at the beginning of a ν cycle. By comparing the two jets and through competition of control units in C, the initial all-to-all interconnections are dynamically reduced to a one-to-one interconnection.
3
Modelling a Columnar Network
In analogy to [14], we set up a dynamical system of variables, both for jet components and for all possible correspondences between jet components. These variables are interpreted as the activity of cortical mini-columns (here called “units”, “feature units” in this case). The set of units corresponding to a jet, either in I or in M, form a cortical macro-column (or short “column”) and inhibit each other, the strength of inhibition being controlled by a parameter ν, which cyclically ramps up and falls down. In addition, there is a column of control units, forming the control domain C. Each control unit stands for one pair of possible values of relative scale and orientation between I and M and can evaluate the similarity in the activity of feature units under the corresponding transformation. The whole network is schematically shown in Fig.3. Each domain, or column in our simple case, consists of a set of unit activities {P1L , . . . , PNLL } in the respective domain (L = I or M). NI = (m + m1 + m2 ) × n and NM = m × n represent respectively the total number of units in the I and M domains. The control unit’s activations are PC = (P1C , . . . , PNCC )T , where NC = (m1 + m2 + 1) × n. The equation system for the activations in the domains is given by dPαL = fα (PL ) + κL CE JαL + κL (1 − CE )EαLL + ξα (t), dt (α = 1, . . . , NL and L = {M, I}\L), dPγC = fγ (PC ) + κC Fγ + ξγ (t), dt (γ = 1, . . . , NC ),
(6)
(7)
Dynamic Link Matching between Feature Columns
389
Fig. 4. Time course of all activities in the domains I, M and C over a ν cycle for two jets without relative rotation and scale, for a system with size m = 5, n = 8, m1 = 0 and m2 = 4. The activity of units in the M and I columns is shown within the bars on the right and the bottom of the panels, respectively, each of the eight subblocks corresponding to one orientation, with individual components corresponding to different scales. The indices k and l at the top label the orientation and scale of the I jet, respectively. The large matrix C contains, in the top version, as non-zero (white) entries all allowed component-to-component correspondences between the two jets (the empty, black, entries would correspond to forbidden relative scales outside the range set by m1 and m2 ). The matrix consists of 8 × 8 blocks, corresponding to the 8 × 8 possible correspondences between jet orientations, and each of these sub-matrices contains m×(m+m1 +m2 ) entries to make room for all possible scale correspondences. Each control unit controls (and evaluates) a whole jet to jet correspondence with the help of its entries in this matrix. The matrix in the lowest panel has active entries corresponding to just one control unit (for identical scale and orientation). Active units are white, silent units black. Unit activity is shown for three moments within one ν cycle. At t = 3.4 ms, the system has already singled out the correct scale, but is still totally undecided as to relative orientation. System parameters are κI = κM = 2, κC = 5, CE = 0.99, σ = 0.015 and σ1 = 0.1.
390
Y.D. Sato et al.
where ξα (t) and ξγ (t) are Gaussian noise of strength σ 2 . CE ∈ [0, 1] scales the ratio between the Gabor jet JαL and the total input EαLL from units in the other domain, L , and κL controls the strength of the projection L to L. κC controls the strength of the influence of the similarity Fγ on the control units. Here we explain the other terms in the above equations. The function fα (·) is: L L L L L 2 fα (P ) = a1 Pα Pα − ν(t) max {Pβ } − (Pα ) , (8) β=1,...,NL
where a1 = 25. The function ν(t) describes a cycle in time during which the feedback is periodically modulated:
k (νmax − νmin ) t−T (Tk t < T1 + Tk ) T1 + νmin , ν(t) = . (9) νmax , (T1 + Tk t < T1 + Tk + Trelax ) Here T1 [ms] is the duration during which ν increases with t while Tk (k = 1, 2, 3, . . .) are the periodic times at which the inhibition ν starts to increase. Trelax is a relaxation time after ν has been increased. νmin = 0.4, νmax = 1, T1 = 36.0 ms and Trelax = T1 /6 are set up in our study. Because of the increasing ν(t) and the noise in Eqs.(6) and (7), only one minicolumn in each domain remains active at the end of the cycle, while the others are deactivated, as shown in Fig.4. If we define the mean-free column activities as P˜αL := PαL − (1/NL ) α PαL , the interaction terms in Eqs. (6) and (7) are EαLL
=
NL NC
β ˜ L PβC Cαα Pα ,
(10)
β ˜I P˜γM Cγγ Pγ .
(11)
β=1 α =1
Fβ =
NM NI γ=1
γ =1
Here NC = n×(m1 +m2 +1) is the number of permitted transformations between the two domains (n different orientations and m1 + m2 + 1 possible scales). The β expression Cαα designates a family of matrices, rotating and scaling one jet into another. If we write β = (o, s) so that o identifies the rotation and s the scaling, then we can write the C matrices as tensor product: C β = Ao ⊗ B s , where Ao is an n × n matrix with a wrap-around shifted diagonal of entries 1 and zeros otherwise, implementing rotation o of jet components at any orientation, and B s is an m × (m + m1 + m2 ) matrix with a shifted diagonal of entries 1, and zeros otherwise, implementing an up or down shift of jet components of any scale. The C matrix implementing identity looks like the lowest panel in Fig. 4. The function E of Eq. (10), transfers feature unit activity from one domain to the other, and is a weighted superposition of all possible transformations, weighted with the activity of the control units. The function of F in Eq. (11) evaluates the similarity of the feature unit activity in one domain with that in the other under mutual transformation.
Dynamic Link Matching between Feature Columns
391
Through the columnar dynamics of Eq. (6), the relative strengths of jet components are expressed in the history of feature unit activities (the strongest jet component, for instance, letting its unit stay on longest) so that the sum of products in Eq. (11), integrated over time in Eq. (7), indeed expresses jet similarities. The time course of all unit activities in our network model during a ν cycle is demonstrated in Fig.4. In this figure, the non-zero components of the jets used in the I and M domains are identical without any rotation and scale. At the beginning of the ν cycle, all units in the I, M and C domains are active, as indicated in Fig.4 by the white jet-bars on the right and the bottom, and indicated by all admissible correspondences in the matrix being on. The activity state of the units in each domain is gradually reduced following Eqs.(6) – (11). In the intermediate state at t = 3.4 ms, all control units for the wrong scale have switched off, but the control units for all orientations are still on. At t = 12.0 ms, finally, only one control unit has survived, the one that correctly identifies the identity map. This state remains stable during the rest of the ν cycle.
4
Results
In this Section, we describe tests performed on the functionality of our macrocolumnar model by studying the robustness of matching to noise introduced into the image jets (parameter σ1 ) and the dynamics of the units, Eqs. (6) and (7) (parameter σ) and robustness to the admixture of input from the other domain, Eq. (6), due to non-vanishing (1 − CE ). A correct match is one in which the surviving control unit corresponds to the correct transformation pair which the I jet differs from the M jet. Jet noise is to model differences between jets actually extracted from natural images. Dynamic noise is to reflect signal fluctuations to be expected in units (minicolumns) composed of spiking neurons, and a nonzero admixture parameter may be relevant in a system in which transfer of jet information between the domains is desired. For both of the experiments described below, we used Gabor jets extracted from the center of real facial images. The transformed jets used for matching were extracted from images that had been scaled and rotated relative to the center position by exactly the same factors and angles that are also used in the Gabor transform. This ensures that there exists a perfect match between any two jets produced from two different rotated and scaled versions of the same face. The first experiment uses a single facial image. This experiment investigates the influence of noise in the units (σ) and of the parameter CE . Using a pair of the same jets, we obtain temporal averages of the correct matching over 10 ν cycles for each size (n = 4, 5, . . . , 9, 10, m = 4, m1 = 0 and m2 = m − 1) of the macro-column. From these temporal averages, we calculate √ the sampling average over all n. The standard error is estimated using [σd / N1 ] where σd is standard deviation of the sampling average. N1 is the sample size. Fig.5(a) shows results for κI = κM = 1 and κC = 5. For strong noise (σ = 0.015 to 0.02), the correct matching probabilities for CE = 0.98 to CE = 0.96
392
Y.D. Sato et al.
(a)
(b) CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94
1
Probability Correct
Probability Correct
0.8
0.6
0.4 CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94
0.2
1
0.75
0 0.5 0
0.005
0.01
0.015
0.02
0
0.005
σ
0.01
0.015
0.02
σ
Fig. 5. Probability of correct match between units in the I and M domains. (a) κI = κM = 1 and κC = 5. (b) κI = 2.3, κM = 5 and κC = 5. σ=0.000
Probability Correct
1 0.8 0.6 0.4 0.2 0 0
0.05
0.1 σ1
0.15
0.2
Fig. 6. Probability of the correct matching for comparisons of two identical faces. 6 facial images are used. Our system is set with m = 4, m1 = m2 = m − 1, n = 8, κI = κM = 6.5, κC = 5.0, CE = 0.98 and σ = 0.0.
take better values than the ones for CE = 1. Interestingly, for these low CE values, matching gets worse for weaker noise, collapsing in the case of σ = 0. This effect requires further investigation. However, for κI = 2.3 and κM = 5, we have found that the correct matching probability takes higher values than in case of Fig.5(a). In particular, our system model with higher noise σ = 0.015 even demonstrates perfect correct matching for CE = 0.98, independent of n. Next, we have investigated the robustness of the matching against the second type of noise (σ1 ), using 6 different face types. Since the original image is resized and/or rotated with m1 + m2 + 1 = 7 scales and n = 8 orientations, a total of 56 images could be employed in the I domain. Here our system is set with κI = κM = 6.5, κC = 5.0 and CE = 0.98. For each face image, we take temporal averages of the correct matching probability for each size-orientation of the image, in a similar manner as described above. Averages and standard errors of the temporal averages on 6 different facial images can be plotted, in terms of σ1 [Fig. 6]. The result of this experiment is independent of σ as long as σ 0.02. As a result of Fig. 6, as random noise in the jets is increased, the correct matching probability smoothly decreases for one image, or it abruptly shifts down to around 0 at a certain σ1 for the other image. We have also obtained
Dynamic Link Matching between Feature Columns
393
perfect matching, independent of σ1 when using the same but rotated or resized face. The most interesting point that we would like to insist is that the probability takes higher values than 87.74% as σ < 0.02 and σ1 < 0.08 (see Fig.6). Therefore, we can say that network model has a high recognition ability for scale and orientation invariance, which is not dependent of different facial types.
5
Discussion and Conclusion
The main purpose of this communication is to convey a simple concept. Much more work needs to be done on the way to practical applications, for instance, more experiments with features extracted from independently taken images of different scale and orientation to better bring out the strengths and weaknesses of the approach. In addition to comparing and transforming local packages of feature values (our jets), it will be necessary to also handle spatial maps, that is, sets of point-to-point correspondences, a task we will approach next. Real applications involve, of course, a continuous range of transformation parameters, whereas we here had admitted only transformations from the same sample grid used for defining the family of wavelets in a jet. We hope to address this problem by working with continuous superpositions of transformation matrices for the neighboring transformation parameter values that straddle the correct value. Our approach may be seen as using brute force, as it requires many separate control units to sample the space of transformation parameters with enough density. However, the same set of control units can be used in a whole region of visual space, and in addition to that, for controlling point-to-point maps between regions, as spatial maps and feature maps stand in one-to-one correspondence to each other. The number of control units can be reduced even further if transformations are performed in sequence, by consecutive, separate layers of dynamic links. Thus, the transformation from an arbitrary segment of primary visual cortex to an invariant window could be done in the sequence translation – scaling – rotation. In this case, the number of required control units would be the sum and not the product of the number of samples for each individual transformation. A disadvantage of that approach might be added difficulties in finding correspondences between the domains. Further work is required to elucidate these issues.
Acknowledgements This work was supported by the EU project Daisy, FP6-2005-015803 and by the Hertie Foundation. We would like to thank C. Weber for help with preparing the manuscript.
References 1. Anderson, C.H., Van Essen, D.C., Olshausen, B.A.: Directed visual attention and the dynamic control of information flow. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of attention, pp. 11–17. Academic Press/Elservier (2005) 2. Weber, C., Wermter, S.: A self-organizing map of sigma-pi units. Neurocomputing 70, 2552–2560 (2007)
394
Y.D. Sato et al.
3. Wiskott, L., v.d. Malsburg, C.: Face Recognition by Dynamic Link Matching, ch. 4. In: Sirosh, J., Miikkulainen, R., Choe, Y. (eds.) Lateral Interactions in the Cortex: Structure and Function, vol. 4, Electronic book (1996) 4. Wolfrum, P., von der Malsburg, C.: What is the optimal architecture for visual information routing? Neural Comput. 19 (2007) 5. Lades, M.: Invariant Object Recognition with Dynamical Links, Robust to Variations in Illumination. Ph.D. Thesis, Ruhr-Univ. Bochum (1995) 6. Maurer, T., von der Malsburg, C.: Learning Feature Transformations to Recognize Faces Rotated in Depth. In: Fogelman-Souli´e, F., Rault, J.C., Gallinari, P., Dreyfus, G. (eds.) Proc. of the International Conference on Artificial Neural Networks ICANN 1995, EC2 & Cie, p. 353 (1995) 7. Daugmann, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 8. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987) 9. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. of the third IEEE Conference on Computer Vision and Pattern Recognition, pp. 84–91 (1994) 10. Wiskott, L., Fellous, J.-M., Kr¨ uger, von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Anal. & Machine Intelligence 19, 775–779 (1997) 11. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Optical Soc. Am. 70, 1297–1300 (1980) 12. Yue, X., Tjan, B.C., Biederman, I.: What makes faces special? Vision Res. 46, 3802–3811 (2006) 13. Okada, K., Steffens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., von der Malsburg, C.: The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test. In: Wechsler, H., Phillips, P.J., Bruce, V., Fogelman Souli´e, F., Huang, T.S. (eds.) Face Recognition: FromTheory to Applications, pp. 186–205. Springer, Heidelberg (1998) 14. L¨ ucke, J., von der Malsburg, C.: Rapid Correspondence Finding in Networks of Cortical Columns. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 668–677. Springer, Heidelberg (2006)
Perturbational Neural Networks for Incremental Learning in Virtual Learning System Eiichi Inohira1 , Hiromasa Oonishi2 , and Hirokazu Yokoi1 1
Kyushu Institute of Technology, Hibikino 2-4, 808-0196 Kitakyushu, Japan {inohira,yokoi}@life.kyutech.ac.jp 2 Mitshbishi Heavy Industries, Ltd., Japan
Abstract. This paper presents a new type of neural networks, a perturbational neural network to realize incremental learning in autonomous humanoid robots. In our previous work, a virtual learning system has been provided to realize exploring plausible behavior in a robot’s brain. Neural networks can generate plausible behavior in unknown environment without time-consuming exploring. Although an autonomous robot should grow step by step, conventional neural networks forget prior learning by training with new dataset. Proposed neural networks features adding output in sub neural network to weights and thresholds in main neural network. Incremental learning and high generalization capability are realized by slightly changing a mapping of the main neural network. We showed that the proposed neural networks realize incremental learning without forgetting through numerical experiments with a twodimensional stair-climbing bipedal robot.
1
Introduction
Recently, humanoid robots such as ASIMO [1] dramatically develop in view of hardware and are prospective for working just like a human. Although many researchers have studied artificial intelligence for a long time, humanoid robots have not so high autonomy as experts are not wanted. Humanoid robots should accomplish missions by itself rather than experts give them solutions such as models, algorithms, and programs. Researchers have studied to realize a robot with learning ability through trial and error. Such studies uses so-called soft computing techniques such as reinforcement learning [2] and central pattern generator (CPG) [3]. Learning of a robot saves expert’s work to some degree but takes much time. These techniques are less efficient than humans. Humans instantly act depending on a situation by using imagination and experience. Even if a human fails, a failure will serve himself as experience. In particular, humans know characteristics of environment and their behavior and simulate trial and error to explore plausible behavior in their brains. In our previous work, Yamashita and et al. [4] have proposed a bipedal walking control system with virtual environment based on motion control mechanism of primates. In this study, exploring plausible behavior by trial and error is carried M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 395–404, 2008. c Springer-Verlag Berlin Heidelberg 2008
396
E. Inohira, H. Oonishi, and H. Yokoi
out in a robot’s brain. However, the problem is that exploring takes much time. Kawano and et al. [5] have introduced learning of a relation of environmental data and an optimized behavior into the previous work to save time. Generating behavior becomes fast because a trained neural network (NN) immediately outputs plausible behavior due to its generalization capability, which is an ability to obtain a desired output at an unknown input. However, a problem in this previous work is that conventional neural networks cannot realize incremental learning. It means that a trained NN forgets prior learning by training with new dataset. Realizing incremental learning is indispensable to learn various behaviors step by step. We presents a new type of NNs, perturbational neural networks (PNN) to achieve incremental learning. PNN consists of main NN and sub NN. Main NN learns with representative dataset and never changes after that. Sub NN learns with new dataset as output error of main NN for the dataset is canceled. Sub NN is connected with main NN as additional terms of weights and thresholds in main NN. We guess that connections through weights and thresholds slightly changes characteristics of main NN, which means that perturbation is given to mapping of main NN, and that incremental learning does not sacrifice its generalization capability.
2
A Virtual Learning System for a Biped Robot
We assume that a robot tries to achieve a task in an unknown environment. Virtual learning is defined as learning of virtual experience in virtual environment in a robot’s brain. Virtual environment is generated from sensory information on real environment around a robot and used for exploring behavior fit for a task without real action. Exploring such behavior in virtual environment is regarded as virtual experience by trial and error or ingenuity. Virtual learning is to memorize a relation between environment and behavior under a particular task. Benefits of virtual learning are (1) to reduce risk of robot’s failure in the real world such as falling and colliding and (2) to enable a robot to immediately act and achieve learned tasks in similar environments. The former is involved in a robot’s safety and physical wastage. The latter leads to saving work and time to achieve a task. We presented virtual learning system for a biped robot [4,5]. This virtual learning system has virtual environment and NNs for learning as shown in Fig.1. We assume that central pattern generator (CPG) generates motion of a biped robot. CPG of a biped robot consists of neural oscillators corresponding to its joints. Then behavior is described as CPG parameters rather than joint angles. CPG parameters for controlling a robot are called CPG input. Optimized CPG input for a plausible behavior is given by exploring in virtual environment. Virtual learning is realized by NNs. A NN learned a mapping from environmental data to optimized CPG input, i.e, a mapping from real environment to robot’s behavior. NNs have generalization capability, which means that desired output can be generated at an unknown input. When a NN learned a mapping
PNNs for Incremental Learning in Virtual Learning System
397
Virtual Environment
Exploring optimized motion Environmental data
Does it serve a purpose ?
CPG
CPG input Robot
NN
Fig. 1. Virtual learning system for a biped robot
with representative data, it can generate desired CPG input at unknown environment and then a biped robot achieves a task. It takes much time to train NN, but NN is fast to generate output. CPG input is first generated by NN and then checked in virtual environment. When the CPG input generated by NN achieve a task, it is sent to a CPG. Otherwise, CPG input obtained by exploring, which takes time, is used for controlling a robot. And NN learns with CPG input obtained by exploring to correct its error. Combination of NN and exploring realizes autonomous learning and covers their shortcomings each other.
3
Neural Networks for Incremental Learning
A key of our virtual learning system is NN. A virtual learning system should develop with experience in various environments and tasks. However, multilayer perceptron (MLP) with back-propagation (BP) learning algorithm, which is widely used in many applications, has a problem that it forgets prior learning through training with a new dataset because its connection weights are overwritten. If all training datasets are stored in a memory, prior learning can be reflected in a MLP by training with all the datasets, but training with a large volume of dataset takes huge amount of time and memory. Of course humans can learn something new with keeping old experiences. A virtual learning system needs to accumulate virtual experiences step by step. 3.1
Perturbational Neural Networks
The basic idea of PNN is that a NN learns a new dataset by adding new parameters to constant weights and thresholds after training without overwriting them. PNN consists of main NN and sub NN as shown in Fig.2. Main NN learns with representative datasets and is constant after this learning. Sub NN learns with new datasets and generates new terms in weights and thresholds of main
398
E. Inohira, H. Oonishi, and H. Yokoi Input
Main NN
Output
Δw, Δh Sub NN
Fig. 2. A perturbational neural network
Neuron w1 Δ w1
Input
x1 −
f
z1
Output
Δh
h From Sub NN
Fig. 3. A neuron in main NN
NN, i.e., Δw and Δh. A neuron of PNN is shown in Fig.3. Input-output relation of a conventional neuron is given by the following equation. z=f wi xi − h (1) i
where z denotes output, wi weight, xi input, and f (·) activation function. A neuron of PNN is given as follows. z=f (wi + Δwi )xi − (h + Δh)
(2)
i
where Δwi and Δh are outputs of sub NN. Training of PNN is divided into two phases. First, main NN learns a main NN with representative dataset. For instance, it is assumed that representative datasets consists of environmental data (EA , EB , EC ) and CPG parameters
PNNs for Incremental Learning in Virtual Learning System
399
PA , PB , P C EA , EB , E C
Main NN Δw, Δh Sub NN 0
Fig. 4. Training of NN with representative dataset
PD Stop learning
ED
Main NN Δw, Δh Sub NN
Fig. 5. Training of NN with new dataset
(PA , PB , PC ) as shown in Fig.4. Training of main NN is the same as NN with BP. At this time, sub NN learn as Δw and Δh equal zero. Then, for representative dataset, sub NN has no effect on main NN. Next, PNN learns with new dataset. For instance, it is assumed that new dataset consists of environmental data ED and CPG parameters PD as shown in Fig.5. Main NN does not learn with the new dataset, but reinforcement signals are passed to sub NN through main NN. Output error of main NN for ED exists because it is unknown for main NN. Sub NN learns with new dataset as such error is canceled. 3.2
Related Work
Some authors [6,7,8] have proposed other methods in which Mixture of Experts (MoE) architecture is applied to incremental learning of neural networks. The MoE architecture is based on divide-and-conquer approach. It means that a complex mapping which a signal neural network should learn is divided into simple mappings which a neural network can learn easily in such architecture. On the other hand, a PNN learns a complex mapping by connecting sub NNs with a main NN. A PNN is expected to have global generalization capability because it does not divided a mapping to local mappings. In the MoE architecture, a expert neural network is expected to have local generalization capability. However, a system would not have global generalization capability because it is not concerned. Therefore, when generalization capability is focused on, a PNN should be used.
400
E. Inohira, H. Oonishi, and H. Yokoi
A PNN have a problem that it needs large number of connections from sub NNs to a main NN. It means that a PNN is very inefficient in resources. Although we have not yet study efficiency in a PNN, we guess that a PNN has room to reduce the connections, and it would be our future work.
4 4.1
Numerical Experiments Setup
We evaluate generalization capability of proposed NN for incremental learning through numerical experiments of a robot climbing stairs. Simplified experiments are performed because we focus on a NN for virtual learning.
θ3 θ4
θ2 θ1
θ5
Fig. 6. A two-dimensional five-link biped robot model
θ3
Neural oscillator
θ2
θ4
θ1
θ5
Fig. 7. A CPG for a biped robot
A two-dimensional five-link model of a bipedal robot is used as shown in Fig. 6. Five joint angles defined in Fig. 6 are controlled by five neural oscillators corresponding to each joint angle. Length of all link is 100 cm. A task of a bipedal robot is to climb five stairs. Height and depth of stairs are used for
PNNs for Incremental Learning in Virtual Learning System
401
Table 1. Dimension of representative stairs Stairs Height [cm] Depth [cm] A 30 60 B 10 70 C 30 100 D 10 90
Output
Input
Main NN Sub NN Sub NN for hidden layer
Δwh , Δhh
Sub NN for output layer
Δwo , Δho
Fig. 8. A PNN used for experiments
environmental data. A robot concerns only kinematics in virtual environment and ignores dynamics. A CPG shown in Fig. 7 is used for controlling of a bipedal robot. We used Matsuoka’s neural oscillator model [9] in a CPG. In this study, 16 connection weights w and five constant external inputs u0 are defined as CPG input for controlling a bipedal robot. CPG input for climbing stairs are given through exploring their parameter space by GA and are optimized for each environment. Internal parameters of CPG are also given by GA and are constant in all environments because it takes much time to explore all CPG parameters including internal parameters. Internal parameters of CPG are optimized for walking on a horizontal plane. The four pairs shown in Table 1 are defined as representative data for NN training. Then inputs and outputs of the NNs numbers 2 and 21 respectively. The following targets are compared to evaluating their generalization capability. – CPG whose parameters are optimized from each of the three representative environmental data, i.e., stairs A, B, and C – MLP trained for stairs A, B, and C (MLP-I) – MLP trained for stairs D after trained for stairs A, B, and C (MLP-II) – PNN trained in the same way as the above MLP These targets are optimized and trained for one to four kinds of stairs. Generalization capability of each targets is calculated by the number of conditions which is different from the representative stairs and where a biped robot can climb five
402
E. Inohira, H. Oonishi, and H. Yokoi
stairs successfully. Stairs’ height ranges from 4 cm to 46 cm and width from 40 cm to 110 cm. MLPs has 30 neurons in a hidden layer. Initial weights of MLPs are given by uniform random numbers ranging between ±0.3. PNN used in this paper has two sub NNs as shown in Fig.8. Main NN in PNN is the same as the above MLP. Sub NN for hidden layer has 100 neurons and 90 outputs. Sub NN for output layer has 600 neurons and 561 outputs. All initial weights in PNN are given in the same way as the MLPs. One learning cycle is defined as stairs A, B, and C are given to MLP or PNN sequentially. MLP-I is trained for the three kinds of stairs until 10000 learning cycles where sum of squared error is much less than 10−7 . MLP-II is trained for stairs D until 1000 learning cycles after trained for the three kinds of stairs. As mentioned below, although the number of learning cycles for stairs D is small, MLP-II forgets stair A, B, and C. Training condition for Main NN in PNN is the same as MLP-I. Incremental learning of PNN for stairs D means that the two sub NNs are trained for stairs D while the main NN is constant. These sub NNs are trained until 700 learning cycles. Learning rate of NNs is optimized through preliminary experiments because their performance heavily depends on it. We used learning rate minimizing sum of squared error under a certain number of learning cycles for each of NNs. Learning rate used in MLP-I and sub NN of PNN is 0.30 and 0.12 respectively. 4.2
Experimental Results
In Fig. 9, mark x denotes successful condition in climbing five stairs and a circle a given condition as representative dataset. Fig. 9 (a) to (c) show that successful conditions spreads around representative stairs to some degree. It means that CPG has a certain generalization capability by itself as already known. However, generalization capability of CPG is not as much as stairs A, B, and C are covered simultaneously. On the other hand, Fig. 9 (d) shows that MLP-I covers intermediate conditions in stairs A, B, and C. Effect of virtual learning with NNs is clear. Fig. 9 (e) and (f) are involved in incremental learning. Fig. 9 (e) shows that PNN is successful in incremental learning. Moreover, when conditions near stairs C are focused on, generalization capability of PNN is larger than MLP-I. Fig. 9 (f) shows that MLP-II forgot the precedent learning on stairs A, B, and C and fails in incremental learning. It is known that incremental learning of MLP with BP fails because connection weights are overwritten by training with new dataset. In PNN, main NN is constant after training with initial dataset. Then main NN does not forget initial dataset. Incremental learning is realized by sub NNs in PNN. The problems of sub NNs are effects on adjusting of connection weights, i.e., whether it is successful in incremental learning or not, and whether performance for initial dataset decreases or not. From the experimental results, we showed that PNN realized incremental learning and increased generalization capability by incremental learning.
PNNs for Incremental Learning in Virtual Learning System
45
45
40
40
25 20
10
80 100 Depth [cm] (d) Trained NN with data A, B, and C
45
45
40
40
35
35
25 20
A
30
C
25 20
B
10 5 40
60
80 100 Depth [cm] (b) CPG without NN for data B
D
80 100 Depth [cm] (e) Trained proposed NN with data D after the three data
45
45
40
40
35
60
25 20
Height [cm]
35 C
30
30
15 10 80 100 Depth [cm] (c) CPG without NN for data C
C
20
10 60
A
25
15
5 40
60
15 B
10 5 40
B
5 40
80 100 Depth [cm] (a) CPG without NN for data A
15
Height [cm]
20
10 60
C
25
15
30
A
30
15
5 40
Height [cm]
Height [cm]
30
35 A
Height [cm]
Height [cm]
35
403
5 40
B
D
60
80 100 Depth [cm] (f) Trained conventional NN with data D after the three data
Fig. 9. A comparison of generalization capability of CPG and MLP and PNN in stair climbing
404
5
E. Inohira, H. Oonishi, and H. Yokoi
Conclusions
We proposed a new type of NNs for incremental learning in a virtual learning system. Our proposed NNs features adjusting weights and thresholds externally to slightly change a mapping of a trained NN. This paper demonstrated numerical experiments with a two-dimensional five-link biped robot and climbing-stairs task. We showed that PNN is successful in incremental learning and has generalization capability to some degree. This study is very limited to focus on only verifying our approach. In future work, we will study PNN with as much data as an actual robot needs and compare with related work quantitatively.
References 1. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of honda humanoid robot. In: Proc. IEEE ICRA, vol. 2, pp. 1321–1326 (1998) 2. Mahadevan, S., Connell, J.: Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence 55, 311–365 (1992) 3. Kuniyoshi, Y., Sangawa, S.: Early motor development from partially ordered neuralbody dynamics: experiments with a cortico-spinal-musculo-skeletal model. Biological Cybernetics 95, 589–605 (2006) 4. Yamashita, I., Yokoi, H.: Control of a biped robot by using several virtual environments. In: Proceedings of the 22nd Annual Conference of the Robotics Society of Japan 1K25 (in Japanese) (2004) 5. Kawano, T., Yamashita, I., Yokoi, H.: Control of the bipedal robot generating the target by the simulation in virtual space (in Japanese). IEICE Technical Report 104, 65–69 (2004) 6. Haruno, M., Wolpert, D.M., Kawato, M.: MOSAIC model for sensorimotor learning and control. Neural Computation 13, 2201–2220 (2001) 7. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Computation 10, 2047–2084 (1998) 8. Yamauchi, K., Hayami, J.: Incremental learning and model selection for radial basis function network through sleep. IEICE TRANSACTIONS on Information and Systems E90-D, 722–735 (2007) 9. Matsuoka, K.: Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biological Cybernetics 52, 367–376 (1985)
Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks Peter Tiˇ no University of Birmingham, Birmingham, UK
[email protected] Abstract. Self-organizing neural networks (SONN) driven by softmax weight renormalization are capable of finding high quality solutions of difficult assignment optimization problems. The renormalization is shaped by a temperature parameter - as the system cools down the assignment weights become increasingly crisp. It has been reported that SONN search process can exhibit complex adaptation patterns as the system cools down. Moreover, there exists a critical temperature setting at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states. To shed light on such observed phenomena, we present a detailed bifurcation study of the renormalization process. As SONN cools down, new renormalization equilibria emerge in a strong structure leading to a complex skeleton of saddle type equilibria surrounding an unstable maximum entropy point, with decision enforcing “one-hot” stable equilibria. This, in synergy with the SONN input driving process, can lead to sensitivity to annealing schedules and adaptation dynamics exhibiting signatures of complex dynamical behavior. We also show that (as hypothesized in earlier studies) the intermittent search by SONN can occur only at temperatures close to the first (symmetry breaking) bifurcation temperature.
1
Introduction
For almost three decades there has been an energetic research activity on application of neural computation techniques in solving difficult combinatorial optimization problems. Self-organizing neural network (SONN) [1] constitutes an example of a successful neural-based methodology for solving 0-1 assignment problems. SONN has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in mobile communications. As in most self-organizing systems, dynamics of SONN adaptation is driven by a synergy of cooperation and competition. In the competition phase, for each item to be assigned, the best candidate for the assignment is selected and the corresponding assignment weight is increased. In the cooperation phase, the assignment weights of other candidates that were likely to be selected, but were not quite as strong as the selected one, get increased as well, albeit to a lesser M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 405–414, 2008. c Springer-Verlag Berlin Heidelberg 2008
406
P. Tiˇ no
degree. The assignment weights need to be positive and sum to 1. Therefore, after each SONN adaptation phase, the assignment weights need to be renormalized back onto the standard simplex e.g via the softmax function [2]. When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a temperature parameter T > 0. As the system cools down, the assignments become increasingly crisp. In the original setting SONN is annealed so that a single high quality solution to an assignment problem is found. Yet, renormalization onto the standard simplex is a double edged sword. On one hand, SONN with assignment weight renormalization have empirically shown sensitivity to annealing schedules, on the other hand, the quality of solutions could be greatly improved [3]. Interestingly enough, it has been reported recently [4] that there exists a critical temperature T∗ at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states of the adaptation SONN dynamics. It is hypothesised that the critical temperature may be closely related to the symmetry breaking bifurcation of equilibria in the autonomous softmax dynamics. At present there is still no theory regarding the dynamics of SONN adaptation driven by the softmax renormalization. Consequently, the processes of crystallising a solution in an annealed version of SONN, or of sampling the solution space in the intermittent search regime are far from being understood. The first steps towards theoretical underpinning of SONN adaptation driven by softmax renormalization were taken in [5,4,6]. For example, in [5] SONN is treated as a dynamical system with a bifurcation parameter T . The cooperation phase was not included in the model, The renormalization process was empirically shown to result in complicated bifurcation patterns revealing a complex nature of the search process inside SONN as the systems gets annealed. More recently, Kwok and Smith [4] suggested to study SONN adaptation dynamics by concentrating on the autonomous renormalization process, since it is this process that underpins the search dynamics in the SONN. In [6] we initiated a rigorous study of equilibria of the autonomous renormalization process. Based on dynamical properties of the autonomous renormalization, we found analytical approximations to the critical temperature T∗ as a function of SONN size. In this paper we complement [6] by reporting a detailed bifurcation study of the renormalization process and give precise characterization and stability types of equilibria, as they emerge during the annealing process. Interesting and intricate equilibria structure emerges as the system cools down, explaining empirically observed complexity of SONN adaptation during intermediate stages of the annealing process. The analysis also clarifies why the intermittent search by SONN occurs near the first (symmetry breaking) bifurcation temperature of the renormalization step, as was experimentally verified in [4,6]. Due to space limitations, we cannot fully prove statements presented in this study. Detailed proofs can be found in [7] and will be published elsewhere.
Bifurcations of Renormalization Dynamics in SONNs
2
407
Self-organizing Neural Network and Iterative Softmax
First, we briefly introduce Self-Organizing Neural Network (SONN) endowed with weight renormalization for solving assignment optimization problems (see e.g. [4]). Consider a finite set of input elements (neurons) i ∈ I = {1, 2, ..., M } that need to be assigned to outputs (output neurons) j ∈ J = {1, 2, ..., N }, so that some global cost of an assignment A : I → J is minimized. Partial cost of assigning i ∈ I to j ∈ J is denoted by V (i, j). The “strength” of assigning i to j is represented by the “assignment weight” wi,j ∈ (0, 1). The SONN algorithm can be summarized as follows: The connection weights wi,j , i ∈ I, j ∈ J , are first initialized to small random values. Then, repeatedly, an output item jc ∈ J is chosen and the partial costs V (i, jc ) incurred by assigning all possible input elements i ∈ I to jc are calculated in order to select the “winner” input element (neuron) i(jc ) ∈ I that minimizes V (i, jc ). The “neighborhood” BL (i(jc )) of size L of the winner node i(jc ) consists of L nodes i = i(jc ) that yield the smallest partial costs V (i, jc ). Weights from nodes i ∈ BL (i(jc )) to jc get strengthened: wi,jc ← wi,jc + η(i)(1 − wi,jc ),
i ∈ BL (i(jc )),
(1)
where η(i) is proportional to the quality of assignment i → jc , as measured by V (i, jc ). Weights1 wi = (wi,1 , wi,2 , ..., wi,N ) for each input node i ∈ I are then renormalized using softmax wi,j ←
wi,j T ) N wi,k . k=1 exp( T )
exp(
(2)
We will refer to SONN for solving an (M, N )-assignment problem as (M, N )SONN. As mentioned earlier, following [4,6] we strive to understand the search dynamics inside SONN by analyzing the autonomous dynamics of the renormalization update step (2) of the SONN algorithm. The weight vector wi of each of M neurons in an (M, N )-SONN lives in the standard (N − 1)-simplex, SN −1 = {w = (w1 , w2 , ..., wN ) ∈ RN | wi ≥ 0, i = 1, 2, ..., N, and
N
wi = 1}.
i=1
Given a value of the temperature parameter T > 0, the softmax renormalization step in SONN adaptation transforms the weight vector of each unit as follows: w → F(w; T ) = (F1 (w; T ), F2 (w; T ), ..., FN (w; T )) , where Fi (w; T ) = 1
exp( wTi ) , i = 1, 2, ..., N, Z(w; T )
Here denotes the transpose operator.
(3)
(4)
408
P. Tiˇ no
wk N and Z(w; T ) = N k=1 exp( T ) is the normalization factor. Formally, F maps R 0 to SN −1 , the interior of SN −1 . 0 Linearization of F around w ∈ SN −1 is given by the Jacobian J(w; T ): J(w; T )i,j =
1 [δi,j Fi (w; T ) − Fi (w; T )Fj (w; T )], T
i, j = 1, 2, ..., N,
(5)
where δi,j = 1 iff i = j and δi,j = 0 otherwise. 0 The softmax map F induces on SN −1 a discrete time dynamics known as Iterative Softmax (ISM): w(t + 1) = F(w(t); T ).
(6)
The renormalization step in an (M, N )-SONN adaptation involves M separate renormalizations of weight vectors of all of the M SONN units. For each temperature setting T , the structure of equilibria in the i-th system, wi (t + 1) = F(wi (t); T ), gets copied in all the other M − 1 systems. Using this symmetry, it is sufficient to concentrate on a single ISM (6). Note that the weights of different units are coupled by the SONN adaptation step (1). We will study systems for N ≥ 2.
3
Equilibria of SONN Renormalization Step
We first introduce basic concepts and notation that will be used throughout the paper. An (r − 1)-simplex is the convex hull of a set of r affinely independent points in Rm , m ≥ r−1. A special case is the standard (N −1)-simplex SN −1 . The convex hull of any nonempty subset of n vertices of an (r − 1)-simplex Δ, n ≤ r, is called an (n − 1)-face of Δ. There are nr distinct (n − 1)-faces of Δ and each (n − 1)-face is an (n − 1)-simplex. Given a set of n vertices w1 , w2 , ..., wn ∈ Rm defining an (n − 1)-simplex Δ in Rm , the central point, n
w(Δ) =
1 wi , n i=1
(7)
is called the maximum entropy point of Δ. We will denote the set of all (n − 1)-faces of the standard (N − 1)-simplex SN −1 by PN,n . The set of their maximum entropy points is denoted by QN,n , i.e. QN,n = {w(Δ)| Δ ∈ PN,n }.
(8)
The n-dimensional column vectors of 1’s and 0’s are denoted by 1n and 0n , respectively. Note that wN,n = n1 (1n , 0N −n ) ∈ QN,n . In addition, all the other elements of QN,n can be obtained by simply permuting coordinates of wN,n . Due to this symmetry, we will be able to develop most of the material using wN,n only and then transfer the results to permutations of wN,n . The maximum entropy point wN,N = N −1 1N of the standard (N − 1)-simplex SN −1 will be denoted
Bifurcations of Renormalization Dynamics in SONNs
409
simply by w. To simplify the notation we will use w to denote both the maximum entropy point of SN −1 and the vector w − 0N . We showed in [6] that w is a fixed point of ISM (6) for any temperature setting T and all the other fixed points w = (w1 , w2 , ..., wN ) of ISM have exactly two different coordinate values: wi ∈ {γ1 , γ2 }, such that N −1 < γ1 < N1−1 and 0 < γ2 < N −1 , where N1 is the number of coordinates γ1 larger than N −1 . Since 0 w ∈ SN −1 , we have 1 − N1 γ1 γ2 = . (9) N − N1 The number of coordinates γ2 smaller than N −1 is denoted by N2 . Obviously, N2 = N − N1 . If w = (γ1 1N1 , γ2 1N2 ) is a fixed point of ISM (6), so are all NN1 distinct permutations of it. We collect w and its permutations in a set 1 − N γ 1 1 EN,N1 (γ1 ) = v| v is a permutation of γ1 1N1 , 1 . (10) N − N1 N −N1 The fixed points in EN,N1 (γ1 ) exist if and only if the temperature parameter T is set to [6]
TN,N1 (γ1 ) = (N γ1 − 1) −(N − N1 ) · ln 1 −
N γ1 − 1 (N − N1 )γ1
−1 .
(11)
We will show that as the system cools down, increasing number of equilibria emerge in a strong structure. Let w, v ∈ SN −1 be two points on the standard simplex. The line from w to v is parametrized as (τ ; w, v) = w + τ · (v − w),
τ ∈ [0, 1].
(12)
Theorem 1. All equilibria of ISM (6) lie on lines connecting the maximum entropy point w of SN −1 with the maximum entropy points of its faces. In particular, for 0 < N1 < N and γ1 ∈ (N −1 , N1−1 ), all fixed points from EN,N1 (γ1 ) lie on lines (τ ; w, w), where w ∈ QN,N1 . Sketch of the Proof: Consider the maximum entropy point wN,N1 = N11 (1N1 , 0N2 ) of an (N1 −1)-face of SN −1 . Then w(γ1 ) = (γ1 1N1 , γ2 1N2 ) lies on the line (τ ; w, wN,N1 ) for the parameter setting τ = 1 − N γ2 . Q.E.D. The result is illustrated in figure 1. As the (M, N )-SONN cools down, the ISM equilibria emerge on lines connecting w with the maximum entropy points of faces of SN −1 of increasing dimensionality. Moreover, on each such line there can be at most two ISM equilibria. Theorem 2. For N1 < N/2, there exists a temperature TE (N, N1 ) > N −1 such that for T ∈ (0, TE (N, N1 )], ISM fixed points in EN,N1 (γ1 ) exist for some γ1 ∈
410
P. Tiˇ no
Fig. 1. Positions of equilibria of SONN renormalization illustrated for the case of 4-dimensional weight vectors w. ISM is operating on the standard 3-simplex S3 and its equilibria can only be found on the lines connecting the maximum entropy point w (filled circle) with maximum entropy points of its faces. Triangles, squares and diamonds represent maximum entropy points of 0-faces (vertices), 1-faces (edges) and 2-faces (facets), respectively.
(N −1 , N1−1 ), and no ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ), can exist at temperatures T > TE (N, N1 ). For each temperature T ∈ (N −1 , TE (N, N1 )), there are two coordinate values γ1− (T ) and γ1+ (T ), N −1 < γ1− (T ) < γ1+ (T ) < N1−1 , such that ISM fixed points in both EN,N1 (γ1− (T )) and EN,N1 (γ1+ (T )) exist at temperature T . Furthermore, as the temperature decreases, γ1− (T ) decreases towards N −1 , while γ1+ (T ) increases towards N1−1 . For temperatures T ∈ (0, N −1 ], there is exactly one γ1 (T ) ∈ (N −1 , N1−1 ) such that ISM equilibria in EN,N1 (γ1 (T )) exist at temperature T . Sketch of the Proof: The temperature function TN,N1 (γ1 ) (11) is concave and can be continuously extended to [N −1 , N1−1 ) with TN,N1 (N −1 ) = N −1 and limγ1 →N −1 TN,N1 (γ1 ) = 1 0 < N −1 . The slope of TN,N1 (γ1 ) at N −1 is positive for N1 < N/2. Q.E.D. Theorem 3. The bifurcation temperature TE (N, N1 ) is decreasing with increasing number N1 of equilibrium coordinates larger than N −1 . Sketch of the Proof: It can be shown that for any feasible value of γ1 > N −1 , if there are two fixed points w ∈ EN,N1 (γ1 ) and w ∈ EN,N1 (γ1 ) of ISM, such that N1 < N1 , then w exists at a higher temperature than w does. For a given N1 < N/2, the bifurcation temperature TE (N, N1 ) corresponds to the maximum of TN,N1 (γ1 ) on γ1 ∈ (N −1 , N1−1 ). It follows that N1 < N1 implies TE (N, N1 ) > TE (N, N1 ). Q.E.D.
Bifurcations of Renormalization Dynamics in SONNs
411
Theorem 4. If N/2 ≤ N1 < N , for each temperature T ∈ (0, N −1 ), there is exactly one coordinate value γ1 (T ) ∈ (N −1 , N1−1 ), such that ISM fixed points in EN,N1 (γ1 (T )) exist at temperature T . No ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ) can exist for temperatures T > N −1 . As the temperature decreases, γ1 (T ) increases towards N1−1 . Sketch of the Proof: Similar to the proof of theorem 2, but this time the slope of TN,N1 (γ1 ) at N −1 is not positive. Q.E.D. Let us now summarize the process of creation of new ISM equilibria, as the (M, N )-SONN cools down. For temperatures T > 1/2, the ISM has exactly one equilibrium - the maximum entropy point w of SN −1 [6]. As the temperature is lowered and hits the first bifurcation point, TE (N, 1), new fixed points of ISM w ∈ QN,1 , one on each line. The lines connect w emerge on the lines (τ ; w, w), of SN −1 . As the temperature decreases further, on each line, with the vertices w the single fixed point splits into two fixed points, one moves towards w, the other in QN,1 (vertex of SN −1 ). moves towards the corresponding high entropy point w When the temperature reaches the second bifurcation point, TE (N, 2), new fixed w ∈ QN,2 , one on each line. This points of ISM emerge on the lines (τ ; w, w), of the time the lines connect w with the maximum entropy points (midpoints) w edges of SN −1 . Again, as the temperature continues decreasing, on each line, the single fixed point splits into two fixed points, one moves towards w, the other in QN,2 (midpoint moves towards the corresponding maximum entropy point w of an edge of SN −1 ). The process continues until the last bifurcation temperature TE (N, N1 ) is reached, where N1 is the largest natural number smaller than N/2. At TE (N, N1 ), new fixed points of ISM emerge on the lines (τ ; w, w), ∈ QN,N1 , connecting w with maximum entropy points w of (N1 − 1)-faces w of SN −1 . As the temperature continues decreasing, on each line, the single fixed point splits into two fixed points, one moves towards w, the other moves towards in QN,N1 . At temperatures below the corresponding maximum entropy point w N −1 , only the fixed points moving towards the maximum entropy points of faces of SN −1 exist. In the low temperature regime, 0 < T < N −1 , a fixed point occurs on every w ∈ QN,N1 , N1 = N/2 , N/2 + 1, ..., N − 1. Here, x denotes line (τ ; w, w), the smallest integer y, such that y ≥ x. As the temperature decreases, the fixed ∈ QN,N1 points w move towards the corresponding maximum entropy points w of (N1 − 1)-faces of SN −1 . The process of creation of new fixed points and their flow as the temperature cools down is demonstrated in figure 2 for an ISM operating on 9-simplex S9 (N = 10). We plot against each temperature setting T the values of the larger coordinate γ1 > N −1 = 0.1 of the fixed points existing at T . The behavior of ISM in the neighborhood of its equilibrium w is given by the structure of stable and unstable manifolds of the linearized system at w outlined in the next section.
412
P. Tiˇ no Bifurcation structure of ISM equilibria (N=10) T (10,4)
T (10,2)
E
TE(10,1)
E
1 T (10,3) E
N =1 1
0.8
gamma
1
N =2 1
0.6
N =3 1
0.4
N =4 1
0.2
0
0
0.05
0.1
0.15
0.2
0.25
temperature T
Fig. 2. Demonstration of the process of creation of new ISM fixed points and their flow as the system temperature cools down. Here N = 10, i.e. the ISM operates on the standard 9-simplex S9 . Against each temperature setting T , the values of the larger coordinate γ1 > N −1 of the fixed points existing at T are plotted. The horizontal bold line corresponds to the maximum entropy point w = 10−1 110 .
4
Stability Analysis of Renormalization Equilibria
The maximum entropy point w is not only a fixed point of ISM (6), but also, regarded as a vector w − 0N , it is an eigenvector of the Jacobian J(w; T ) 0 at any w ∈ SN −1 , with eigenvalue λ = 0. This simply reflects the fact that ISM renormalization acts on the standard simplex SN −1 , which is a subset of a (N − 1)-dimensional hyperplane with normal vector 1N . We have already seen that w plays a special role in the ISM equilibria structure: all equilibria lie on lines going from w towards maximum entropy points of faces of SN −1 . The lines themselves are of special interest, since we will show that these lines are invariant manifolds of the ISM renormalization and their directional vectors are eigenvectors of ISM Jacobians at the fixed points located on them. Theorem 5. Consider ISM (6) and 1 ≤ N1 < N . Then for each maximum ∈ QN,N1 of an (N1 − 1)-face of SN −1 , the line (τ ; w, w), entropy point w is an invariant set τ ∈ [0, 1) connecting the maximum entropy point w with w under the ISM dynamics. Sketch of the Proof: into (3) and realizing The result follows from plugging parametrization (τ ; w, w) (after some manipulation) that for each τ ∈ [0, 1), there exists a parameter = (τ1 ; w, w). setting τ1 ∈ [0, 1) such that F((τ ; w, w)) Q.E.D.
Bifurcations of Renormalization Dynamics in SONNs
413
The proofs of the next two theorems are rather involved and we refer the interested reader to [7]. Theorem 6. Let w ∈ EN,N1 (γ1 ) be a fixed point of ISM (6). Then, w∗ = w − w is an eigenvector of the Jacobian J(w; TN,N1 (γ1 )) with the corresponding eigenvalue λ∗ , where 1. if N/2 ≤ N1 ≤ N − 1, then 0 < λ∗ < 1. 2. if 1 ≤ N1 < N/2 and N −1 < γ1 < (2N1 )−1 , then λ∗ > 1. 3. if 1 ≤ N1 < N/2 , then there exists γ¯1 ∈ ((2N1 )−1 , N1−1 ), such that for all ISM fixed points w ∈ EN,N1 (γ1 ) with γ1 ∈ (¯ γ1 , N1−1 ), 0 < λ∗ < 1. We have established that for an ISM equilibrium w, both w and w∗ = w− w are eigenvectors of the ISM Jacobian at w. Stability types of the remaining N − 2 eigendirections are characterized in the next theorem. Theorem 7. Consider an ISM fixed point w ∈ EN,N1 (γ1 ) for some 1 ≤ N1 < N and N −1 < γ1 < N1−1 . Then, there are N − N1 − 1 and N1 − 1 eigenvectors of Jacobian J(w; TN,N1 (γ1 )) of ISM at w with the same associated eigenvalue 0 < λ− < 1 and λ+ > 1, respectively.
5
Discussion – SONN Adaptation Dynamics
In the intermittent search regime by SONN [4], the search is driven by pulling promising solutions temporarily to the vicinity of the 0-1 “one-hot” assignment values - vertices of SN −1 (0-dimensional faces of the standard simplex SN −1 ). The critical temperature for intermittent search should correspond to the case where the attractive forces already exist in the form of attractive equilibria near the “one-hot” assignment suggestions (vertices of SN −1 ), but the convergence rates towards such equilibria should be sufficiently weak so that the intermittent character of the search is not destroyed. This occurs at temperatures lower than, but close to the first bifurcation temperature TE (N, 1) (for more details, see [7]). In [4] it is hypothesised that there is a strong link between the critical temperature for intermittent search by SONN and bifurcation temperatures of the autonomous ISM. In [6] we hypothesised (in accordance with [4]) that even though there are many potential ISM equilibria, the critical bifurcation points are related only to equilibria near the vertices of SN −1 , as only those could be guaranteed by the theory of [6] (stability bounds) to be stable, even though the theory did not prevent the other equilibria from being stable. In this study, we have rigorously shown that the stable equilibria can in fact exist only near the vertices of SN −1 , on the lines connecting w with the vertices. Only when N1 = 1, there are no expansive eigendirections of the local Jacobian with λ+ > 1. As the SONN system cools down, more and more ISM equilibria emerge on the lines connecting the maximum entropy point w of the standard simplex SN −1 with the maximum entropy points of its faces of increasing dimensionality. With decreasing temperature, the dimensionality of stable and unstable manifolds
414
P. Tiˇ no
of linearized ISM at emerging equilibria decreases and increases, respectively. At lower temperatures, this creates a peculiar pattern of saddle type equilibria surrounding the unstable maximum entropy point w, with decision enforcing “one-hot” stable equilibria located near vertices of SN −1 . Trajectory towards the solution as the SONN system anneals is shaped by the complex skeleton of saddle type equilibria with stable/unstable manifolds of varying dimensionalities and can therefore, in synergy with the input driving process, exhibit signatures of a very complex dynamical behavior, as reported e.g. in [5]. Once the temperature is sufficiently low, the attraction rates of stable equilibria near the vertices of SN −1 are so high that the found solution is virtually pinned down by the system. Even though the present study clarifies the prominent role of the first (symmetry breaking) bifurcation temperature TE (N, 1) in obtaining the SONN intermittent search regime and helps to understand the origin of complex SONN adaptation patterns in the annealing regime, many interesting open questions remain. For example, no theory as yet exists of the role of abstract neighborhood BL (i(jc )) of the winner node i(jc ) in the cooperative phase of SONN adaptation. We conclude by noting that it may be possible to apply the theory of ISM in other assignment optimization systems that incorporate the softmax assignment weight renormalization e.g. [8,9].
References 1. Smith, K., Palaniswami, M., Krishnamoorthy, M.: Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks 9, 1301–1318 (1998) 2. Guerrero, F., Lozano, S., Smith, K., Canca, D., Kwok, T.: Manufacturing cell formation using a new self-organizing neural network. Computers & Industrial Engineering 42, 377–382 (2002) 3. Kwok, T., Smith, K.: Improving the optimisation properties of a self-organising neural network with weight normalisation. In: Proceedings of the ICSC Symposia on Intelligent Systems and Applications (ISA 2000), Paper No.1513-285 (2000) 4. Kwok, T., Smith, K.: Optimization via intermittency with a self-organizing neural network. Neural Computation 17, 2454–2481 (2005) 5. Kwok, T., Smith, K.: A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks 15, 84–88 (2004) 6. Tiˇ no, P.: Equilibria of iterative softmax and critical temperatures for intermittent search in self-organizing neural networks. Neural Computation 19, 1056–1081 (2007) 7. Tiˇ no, P.: Bifurcation structure of equilibria of adaptation dynamics in selforganizing neural networks. Technical Report CSRP-07-12, University of Birmingham, School of Computer Science (2007), http://www.cs.bham.ac.uk/∼ pxt/PAPERS/ism.bifurc.tr.pdf 8. Gold, S., Rangarajan, A.: Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks 2, 381–399 (1996) 9. Rangarajan, A.: Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition 33, 635–649 (2000)
Variable Selection for Multivariate Time Series Prediction with Neural Networks Min Han and Ru Wei School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China
[email protected] Abstract. This paper proposes a variable selection algorithm based on neural networks for multivariate time series prediction. Sensitivity analysis of the neural network error function with respect to the input is developed to quantify the saliency of each input variables. Then the input nodes with low sensitivity are pruned along with their connections, which represents to delete the corresponding redundant variables. The proposed algorithm is tested on both computergenerated time series and practical observations. Experiment results show that the algorithm proposed outperformed other variable selection method by achieving a more significant reduction in the training data size and higher prediction accuracy. Keywords: Variable selection, neural network pruning, sensitivity, multivariate prediction.
1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated dynamics from measurements. Usually, multivariate variables are required since the output may depend not only on its own previous values but also on the past values of other variables. However, we can’t make sure that all of the variables are equally important. Some of them may be redundant or even irrelevant. If these unnecessary input variables are included into the prediction model, the parameter estimation process will be more difficult, and the overall results may be poorer than if only the required inputs are used [1]. Variable selection is such a problem to discard the redundant variables, which will reduce the number of input variables and the complexity of the prediction model. A number of variable selection methods based on statistical or heuristics tools have been proposed, such as Principal Component Analysis (PCA) and Discriminant Analysis. These techniques attempt to reduce the dimensionality of the data by creating new variables that are linear combinations of the original ones. The major difficulty comes from the separation of variable selection process and prediction process. Therefore, variable selection using neural network is attractive since one can globally adapt the variable selector together with the predictor. Variable selection with neural network can be seen as a special case of architecture pruning [2], where the pruning of input nodes is equivalent to removing the corresponding M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 415–425, 2008. © Springer-Verlag Berlin Heidelberg 2008
416
M. Han and R. Wei
variables from the original data set. One approach to pruning is to estimate the sensitivity of the output to the exclusion of each unit. There are several ways to perform sensitivity analysis with neural network. Most of them are weight-based [3], which is based on the idea that weights connected to important variables attain large absolute values while weights connected to unimportant variables would probably attain values somewhere near zero. However, smaller weights usually result in smaller inputs to neurons and larger sigmoid derivatives in general, which will increase the output sensitivity to the input. Mozer and Smolensky [4] have introduced a method which estimates which units are least important and can be deleted over training. Gevrey et al. [5] compute the partial derivatives of the neural network output with respect to the input neurons and compare performances of several different methods to evaluate the relative contribution of the input variables. This paper concentrates on a neural-network-based variable selection algorithm as the tool to determine which variables are to be discarded. A simple sensitivity criterion of the neural network error function with respect to each input is developed to quantify the saliency of each input variables. Then the input nodes are arrayed by a decreasing sensitivity order so that the neural network can be pruned efficiently by discarding the last items with low sensitivity. The variable selection algorithm is then applied to both computer-generated data and practical observations and is compared with the PCA variable reduction method. The rest of this paper is organized as follows. Section 2 reviews the basic concept of multivariate time series prediction and a statistical variable selection method. Section 3 explains the sensitivity analysis with neural networks in detail. Section 4 presents two simulation results. The work is finally concluded in section 5.
2 Modeling Multivariate Chaotic Time Series The basic idea of chaotic time series analysis is that, a complex system can be described by a strange attractor in its phase space. Therefore, the reconstruction of the equivalent state space is usually the first step in chaotic time series prediction. 2.1 Multivariate Phase Space Reconstruction Phase space reconstruction from observations can be accomplished by choosing a suitable embedding dimension and time delay. Given an M-dimensional time series{Xi, i=1, 2,…, M}, where Xi=[xi(1), xi(2), …, xi(N)]T, N is the length of each scalar time series. As in the case of univariate time series (where M=1), the reconstructed phase-space can be made as [6]:
X (t ) = [ x1 (t ), x1 (t − τ 1 ), " , x1 (t − (d1 − 1)τ 1 ), ," ,
(1)
xM (t ), xM (t − τ M )," , xM (t − (d M − 1)τ M ] where t = L, L + 1," , N , L = max(di − 1) ⋅τ i + 1 , τi and di ( i = 1, 2," , M ) are the time 1≤ i ≤ M
delays and embedding dimensions of each time series, respectively. The delay time τi can be calculated using mutual information method and the embedding dimension is computed with the false nearest neighbor method.
Variable Selection for Multivariate Time Series Prediction with Neural Networks
417
M
According to Takens’ embedding theorem, if D = ∑ di is large enough there exist i =1
an mapping F: X(t+1)=F{X(t)}. Then the evolvement of X(t)→X(t+1) reflects the evolvement of the original dynamics system. The problem is then to find an appropriate expression for the nonlinear mapping F. Up to the present, many chaotic time series prediction models have been developed. Neural network has been widely used because of its universal approximation capabilities. 2.2 Neural Network Model
A multilayer perceptron (MLP) with a back propagation (BP) algorithm is used as a nonlinear predictor for multivariate chaotic time series prediction. MLP is a supervised learning algorithm designed to minimize the mean square error between the computed output of the neural network and the desired output. The network usually consists of three layers: an input layer, one or more hidden layers and an output layer. Consider a three layer MLP that contains one hidden layer. The D dimensional delayed time series X(t) are used as the input of the network to generate the network output X(t+1). Then the neural network can be expressed as follows: NI
o j = f (∑ xi wij(I) )
(2)
i =0
NH
yk = ∑ w(O) jk o j
(3)
j =1
where [ x1 , x2 ," , xNI ] = X (t ) denotes the input signal, N I is number of input signal to the neural network, wij(I) is the weight connected from the ith input neuron to the jth hidden neuron, oj are the output of the jth hidden neuron, N H is the number of neurons in the hidden layer, [ y1 , y2 ," , y NO ] = X (t + 1) is the output, N O is the number of output neurons and w(O) is the weight connected from the jth hidden neuron jk and the kth output neuron. The activation function f(·) is the sigmoid function given by
f ( x) =
1 1 + exp(− x)
(4)
The error function of the net is usually defined as the sum square of the error N
No
E = ∑∑ [ yk (t ) − pk (t )]2 , t=1,2,…N t =1 k =1
where pk(t) is the desired output for unit k, N is the length of the training sample.
(5)
418
M. Han and R. Wei
2.3 Statistical Variable Selection Method
For the multivariate time series, the dimension of the reconstructed phase space is usually very high. Moreover, the increase of the input variable numbers will lead to the high complexity of the prediction model. Therefore, in many practical applications, variable selection is needed to reduce the dimensionality of the input data. The aim of variable selection in this paper is to select a subset of R inputs that retains most of the important features of the original input sets. Thus, D-R irrelevant inputs are discarded. The Principle Component Analysis (PCA) is a traditional technique for variable selection [7]. PCA attempts to reduce the dimensionality by first decomposing the normalize input vector X(t) d with singular value decomposition (SVD) method X = U ∑V T
(6)
where ∑ = diag[ s1 s2 ... s p 0 ... 0] , s1 ≥ s2 " ≥ s p are the first p eigenvalues of X ar-
rayed by a decreasing order, U and V are both orthogonal matrixes. Then the first k singular values are preserved as the principle components. The final input can be obtained as
Z = U T X
(7)
where U is the first k rows of U. PCA is an efficient method to reduce the input dimension. However, we can’t make sure that the factors we discard have no influence to the prediction output because the variable selection and prediction process are separated individually. Neural network selector is a good choice to combine the selection process and prediction process.
3 Sensitivity Analysis with Neural Networks Variable selection with neural networks can be achieved by pruning the input nodes of a neural network model based on some saliency measure aiming to remove less relevant variables. The significance of a variable can be defined as the error when the unit is removed minus the error when it is left in place:
Si = EWithoutUnit _ i − EWithUnit _ i = E ( xi = 0) − E ( xi = xi )
(8)
where E is the error function defined in Eq.(5). After the neural network has been trained, a brute-force pruning method for ever input is setting the input to zero and evaluate the change in the error. If it increases too much, the input is restored, otherwise it is removed. Theoretically, this can be done by training the network under all possible subsets of the input set. However, this exhaustive search is computational infeasible and can be very slow for large network. This paper uses the same idea with Mozer and Smolensky [4] to approximate the sensitivity by introducing a gating term α i for each unit such that
o j = f (∑ wijα i oi ) i
where o j is the activity of unit j, wij is the weight from unit i to unit j.
(9)
Variable Selection for Multivariate Time Series Prediction with Neural Networks
419
The gating term α is shown in Fig.1, where α iI , i = 1, 2," , N I is the gating term of the ith input neuron and α Hj , j = 1, 2," , N H is the gating term of the jth output neuron.
α1I
x1 xi
xN I
α1H
#
#
#
α NI
α
yk H NH
I
Fig. 1. The gating term for each unit
The gating term α is merely a notational convenience rather than a parameter that must be implied in the net. If α = 0 , the unit has no influence on the network; If α = 1 , the unit behaves normally. The importance of a unit is then approximated by the derivative Si = −
∂E ∂α i
(10) α i =1
By using a standard error back-propagation algorithm, the derivative of Eq.(9) can be expressed in term of network weights as follows NH N NO ⎡ ⎤ ∂E ∂E ∂yk (O ) = − ⋅ = ∑∑ ⎢( pk (t ) − yk (t ) ) ∑ w jk o j ⎥ H H ∂α j ∂yk ∂α j t =1 k =1 ⎣ j =1 ⎦
(11)
NH N NO ⎡ ⎤ ∂E ∂E ∂yk = − ⋅ = p ( t ) − y ( t ) w(jkO ) o j (1 − o j ) wij( I ) xi (t ) ⎥ ( ) ∑∑ ⎢ k ∑ k I I ∂α i ∂yk ∂α i t =1 k =1 ⎣ j =1 ⎦
(12)
S Hj = −
SiI = −
where SiI is the sensitivity of the ith input neuron, S Hj is the sensitivity of the jth output neuron. Thus the algorithm can prune the input nodes as well as the hidden nodes according to the sensitivity over training. However, the undulation is high when the sensitivity is calculated directly using Eq.(11) and Eq.(12) because of the engineering approximation in Eq.(10). Sometimes, it may delete the input incorrectly. In order to possibly reduce the dimensionality of input vectors, the sensitivity matrix needs to be evaluated over the entire training set. This paper develops several ways to define the overall sensitivity such as: (1) The mean square average sensitivity:
Si , avg =
1 N ∑ Si (t )2 T t =1
where T is the number of data in the training set.
(13)
420
M. Han and R. Wei
(2) The absolute value average sensitivity:
Si , abs =
1 N ∑ Si (t ) T t =1
(14)
(3) The maximum absolute sensitivity:
Si ,max = max Si (t ) 1≤ t ≤ N
(15)
Any of the sensitivity measure in Eqs.(13)~(15) can provide a useful criterion to determine which input is to be deleted. For succinctness, this paper uses the mean square average sensitivity as an example. An input with a low sensitivity has little or no influence on the prediction accuracy and can therefore be removed. In order to get a more efficient criterion for pruning inputs, the sensitivity is normalized. Define the absolute sum of the sensitivity for all the input nodes NI
S = ∑ Si
(16)
i =1
Then the normalized sensitivity of each unit can be defined as S Sˆi = i S
(17)
where the normalized value Sˆi is between [0 1]. The input variables is then arrayed by a decreasing sensitivity order: Sˆ1 ≥ Sˆ2 ≥ " ≥ SˆN I
(18)
The larger values of Sˆi (t ), i = 1, 2," , N I present the important variables. Define the sum the first k terms of the sensitivity ηk as k
η k = ∑ Sˆ j
(19)
j =1
where k=1,2,…,NI . Choosing a threshold value 0 < η 0 < 1 , if ηk > η0 , the first k values are preserved as the principal components and the last term of the inputs with low sensitivity are removed. The number of variable remained is increasing as the threshold η0 increase.
4 Simulations In this section, two simulations are carried out on both computer-generated data and practical observed data to demonstrate the performance of the variable selection method proposed in this paper. Then the simulation results are compared with the
Variable Selection for Multivariate Time Series Prediction with Neural Networks
421
PCA method. The prediction performance can be evaluated by two error evaluation criteria [8]: the Root Mean Square Error ERMSE and Prediction Accuracy EPA: 1
ERMSE
⎛ 1 N 2 ⎞2 =⎜ [ P(t ) − O(t )] ⎟ ∑ ⎝ N − 1 t =1 ⎠
(20)
N
EPA =
∑ [( P(t ) − P )(O(t ) − O )] t =1
m
m
(21)
( N − 1)σ Pσ O
where O(t) is the target value, P(t) is the predicted value, Om is the mean value of O(t), σO is the standard deviation of y(t), Pm and σP are the mean value and standard deviation of P(t), respectively. ERMSE reflects the absolute deviation between the predicted value and the observed value while EPA denotes the correlation coefficient between the observed and predicted value. In ideal situation, if there are no errors in prediction, these parameters will be ERMSE =0 and EPA =1. 4.1 Prediction of Lorenz Time Series
The first data is derived from the Lorenz system, given by three differential equations: ⎧ dx(t ) ⎪ dt = a ( − x(t ) + y (t ) ) ⎪ ⎪ d y (t ) = bx(t ) − y (t ) − x(t ) z (t ) ⎨ ⎪ dt ⎪ d z (t ) ⎪ dt = x(t ) y (t ) − c(t ) z (t ) ⎩
(22)
where the typical values for the coefficients are a=10, b=8/3, c=28 and the initial values are x(0)=12, y(0)=2, z(0)=9. 1500 points of x(t), y(t) and z(t) obtained by fourorder Runge-Kutta method are used as the training sample and 500 points as the testing sample. In order to extract the dynamics of this system to predict x(t+1), the parameters for phase-space reconstruction are chosen as τx=τy=τz=3, mx=my=mz=9. Thus a MLP neural network with 27 input nodes, one hidden layer of 20 neurons and one output node are considered and a back propagation training algorithm is used. After the training process of the MLP neural network is topped, sensitivity analysis is carried out to evaluate the contribution of each input variable to the error function of the neural network. The trajectories of the sensitivity through training for each input are shown in Fig.2. It can be seen that the sensitivity undulates through training and finally converges when the weights and error are steady. The normalized sensitivity measures in Eq.(17) are calculated. A threshold η0 = 0.98 is chosen to determine which inputs are discarded. Thus the input dimension of neural network is reduced to 11. The original input matrix is replaced by the reduced input matrix and the structure of the neural networks is simplified. The prediction performance over the testing samples with the reduced inputs is shown in Fig.4.
422
M. Han and R. Wei
8
0.7
7
0.6
6
0.5
5
0.4
4
0.3
3 0.2
2
0.1
1 0
0 0
2000
4000
6000 Epoch
8000
10000
0
5
10
15 20 Input Nodes
25
30
Fig. 2. The trajectories of the input sensitivityFig. 3. The normalized sensitivity for each input through training node
x(t)
20 10
Ovserved
Predicted
0
-10
Error
-20 1 0.5 0
-0.5 -1
0
100
200
Time
300
400
500
Fig. 4. The observed and predicted values of Lorenz x(t) time series
The solid line in Fig. 4 represents the observed values while the dashed line represents the predicted values. It can be seen from Fig.4 that the chaotic behaviors of x(t) time series are well predicted and the errors between the observed values and the predicted values are small. The prediction performance are calculated in Table 1 and compared with the PCA variable reduction method. Table 1. Prediction performance of the x(t) time series
Input Nodes ERMSE EPA
With All Variables 27 0.1278 0.9998
PCA Selection 11 0.1979 0.9997
NN Selection 11 0.0630 1.0000
The prediction performance in Table 1 are comparable for the variable selection method with neural networks and the PCA method while the algorithm proposed in this paper obtains the best prediction accuracy.
Variable Selection for Multivariate Time Series Prediction with Neural Networks
423
4.2 Prediction of the Rainfall Time Series
Rainfall is an important variable in hydrological systems. The chaotic characteristic of the rainfall time series has been proven in many papers [9]. In this section, the simulation is taken on the monthly rainfall time series in the city of Dalian, China over a period of 660 months (from 1951 to 2005). The performance of the rainfall may be influenced by many factors, so in this paper five other time series such as the temperature time series, air-pressure time series, humidity time series, wind-speed time series and sunlight time series are also considered. This method also follows the Taken’s theorem to reconstruct the embedding phase space first with the dimension and delay-time as m1=m2=m3=m4=m5=m6=9, τ1=τ2=τ3 =τ4=τ5=τ6=3. Then the input of the neural network contains L=660-(9-1)×3=636 data points. In the experiments, this data set is divided into a training set composed of the first 436 points and a testing set containing the remaining 200 points. The neural network used in this paper then contains 54 input nodes, 20 hidden notes and 1 output. The threshold is also chosen as η0 = 0.98 . The trajectory of the input sensitivity and the normalized sensitivity for ever inputs are shown in Fig.5 and Fig.6, respectively. Then 34 input nodes are remained according to the sensitivity value. 6
0.2
Normalized Sensitivity
5 4 3 2
0.12 0.08 0.04
1 0
0.16
0
2000
4000 6000 Epoch
8000
10000
Fig. 5. The trajectories of the input sensitivity through training
0
0
10
20
30 40 Input Nodes
50
60
Fig. 6. The normalized sensitivity for each input node
The observed and predicted values of rainfall time series are shown in Fig.7, which gives high prediction accuracy. It can be seen from the figures that the chaotic behaviors of the rainfall time series are well predicted and the errors between the observed values and the predicted values are small. Corresponding values of ERMSE and EPA are shown in Table 2. Both of the figures and the error evaluation criteria indicate that the result for multivariate chaotic time series using the neural network based variable selection is much better than the results with all variables and PCA method. It can be concluded from the two simulations that the variable selection algorithm using neural networks is able to capture the dynamics of both computer-generated and practical time series accurately and gives high prediction accuracy.
424
M. Han and R. Wei
rainfall(mm)
400
Observed Predicted
300 200 100
error(mm)
0 200 100 0
-100 -200
0
40
80 120 t (month)
160
200
Fig. 7. The observed and predicted values of rainfall time series Table 2. Prediction performance of the rainfall time series
Input Nodes ERMSE EPA
With All Variables 54 22.2189 0.9217
PCA Selection 43 21.0756 0.9286
NN Selection 31 18.1435 0.9529
5 Conclusions This paper studies the variable selection algorithm using the sensitivity for pruning input nodes in a neural network model. A simple and effective criterion for identifying input nodes to be removed is also derived which does not require high computational cost and proves to work well in practice. The validity of the method was examined through a multivariate prediction problem and a comparison study was made with other variable selection methods. Experimental results encourage the application of the proposed method to complex tasks that need to identify significant input variables. Acknowledgements. This research is supported by the project (60674073) of the National Nature Science Foundation of China, the project (2006CB403405) of the National Basic Research Program of China (973 Program) and the project (2006BAB14B05) of the National Key Technology R&D Program of China. All of these supports are appreciated.
References [1] Verikas, B.M.: Feature selection with neural networks. Pattern Recognition Letters 23, 1323–1335 (2002) [2] Castellano, G., Fanelli, A.M.: Variable selection using neural network models. Neuralcomputing 31, 1–13 (2000)
Variable Selection for Multivariate Time Series Prediction with Neural Networks
425
[3] Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative method for pruning feed-forward neural networks. IEEE Trans. Neural Networks 8(3), 519–531 (1997) [4] Mozer, M.C., Smolensky, P.: Skeletonization: a technique for trimming the fat from a network via a relevance assessment. NIPS 1, 107–115 (1989) [5] Gevrey, M., Dimopoulos, I., Lek, S.: Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecol. Model. 160, 249–264 (2003) [6] Cao, L.Y., Mees, A., Judd, K.: Dynamics from multivariate time series. Physica D 121, 75–88 (1998) [7] Han, M., Fan, M., Xi, J.: Study of Nonlinear Multivariate Time Series Prediction Based on Neural Networks. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 618–623. Springer, Heidelberg (2005) [8] Chen, J.L., Islam, S., Biswas, P.: Nonlinear dynamics of hourly ozone concentrations: nonparametric short term prediction. Atmospheric environment 32(11), 1839–1848 (1998) [9] Liu, D.L., Scott, B.J.: Estimation of solar radiation in Australia from rainfall and temperature observations. Agricultural and Forest Meteorology 106(1), 41–59 (2001)
Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function Takaaki Aoki1 , Kaiichiro Ota2 , Koji Kurata3 , and Toshio Aoyagi1,2 1 CREST, JST, Kyoto 606-8501, Japan Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Faculty of Engineering, University of the Ryukyus, Okinawa 903-0213, Japan
[email protected] 2 3
Abstract. The Self-Organizing Map (SOM) is an unsupervised learning method based on the neural computation, which has recently found wide applications. However, the learning process sometime takes multi-stable states, within which the map is trapped to a undesirable disordered state including topological defects on the map. These topological defects critically aggravate the performance of the SOM. In order to overcome this problem, we propose to introduce an asymmetric neighborhood function for the SOM algorithm. Compared with the conventional symmetric one, the asymmetric neighborhood function accelerates the ordering process even in the presence of the defect. However, this asymmetry tends to generate a distorted map. This can be suppressed by an improved method of the asymmetric neighborhood function. In the case of one-dimensional SOM, it found that the required steps for perfect ordering is numerically shown to be reduced from O(N 3 ) to O(N 2 ). Keywords: Self-Organizing Map, Asymmetric Neighborhood Function, Fast ordering.
1
Introduction
The Self-Organizing Map (SOM) is an unsupervised learning method of a type of nonlinear principal component analysis [1]. Historically, it was proposed as a simplified neural network model having some essential properties to reproduce topographic representations observed in the brain [2,3,4,5]. The SOM algorithm can be used to construct an ordered mapping from input stimulus data onto two-dimensional array of neurons according to the topological relationships between various characters of the stimulus. This implies that the SOM algorithm is capable of extracting the essential information from complicated data. From the viewpoint of applied information processing, the SOM algorithm can be regarded as a generalized, nonlinear type of principal component analysis and has proven valuable in the fields of visualization, compression and data mining. With based on the biological simple learning rule, this algorithm behaves as an unsupervised M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 426–435, 2008. c Springer-Verlag Berlin Heidelberg 2008
A
B
1
0.5
0 0
0.5
1
Reference Vector mi
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
427
1 0.5 0 0
N Unit Number i
Fig. 1. A: An example of a topological defect in a two-dimensional array of SOM with a uniform rectangular input space. The triangle point indicates the conflicting point in the feature map. B: Another example of topological defect in a one-dimensional array with scalar input data. The triangle points also indicate the conflicting points.
learning method and provides a robust performance without a delicate tuning of learning conditions. However, there is a serious problem of multi-stability or meta-stability in the learning process [6,7,8]. When the learning process is trapped to these states, the map seems to be converged to the final state practically. However, some of theses states are undesirable for the solution of the learning procedure, in which typically the map has topological defects as shown in Fig. 1A. The map in Fig. 1A, is twisted with a topological defect at the center. In this situation, two-dimensional array of SOM should be arranged in the square space, for the input data taken uniformly from square space. But, this topological defect is a global conflicting point which is difficult to remove by local modulations of the reference vectors of units. Therefore, it will require a sheer number of learning steps to rectify the topological defect. Thus, the existence of the topological defect critically aggravates the performance of the SOM algorithm. To avoid the emergence of the topological defect, several conventional and empirical methods have been used. However, it is more favorable that the SOM algorithm works well without tuning any model parameters, even when the topological defect emerged. Thus, let us consider a simple method which enables the effective ordering procedure of SOM in the presence of the topological defect. Therefore we propose an asymmetric neighborhood function which effectively removes the topological defect [9]. In the process of removing the topological defect, the conflicting point must be moved out toward the boundary of the arrays and vanished. Therefore, the motive process of the defect is essential for the efficiency of the ordering process. With the original symmetric neighborhood, the movement of the defect is similar to a random walk stochastic process, whose efficiency is worse. By introducing the asymmetry of the neighborhood function, the movement behaves like a drift, which enables the faster ordering. For this reason, in this paper we investigate the effect of an asymmetric neighborhood function on the performance of the SOM algorithm for the case of one-dimensional and two-dimensional SOMs.
428
2 2.1
T. Aoki et al.
Methods SOM
The SOM constructs a mapping from the input data space to the array of nodes, we call the ‘feature map’. To each node i, a parametric ‘reference vector’ mi is assigned. Through SOM learning, these reference vectors are rearranged according to the following iterative procedure. An input vector x(t) is presented at each time step t, and the best matching unit whose reference vector is closest to the given input vector x(t) is chosen. The best matching unit c, called the ‘winner’ is given by c = arg mini x(t) − mi . In other words, the data x(t) in the input data space is mapped on to the node c associated with the reference vector mi closest to x(t). In SOM learning, the update rule for reference vectors is given by mi (t + 1) = mi (t) + α · h(ric )[x(t) − mi (t)],
ric ≡ ric ≡ ri − rc
(1)
where α, the learning rate, is some small constant. The function h(r) is called the ‘neighborhood function’, in which ric is the distance from the position rc of the winner node c to the position ri of a node i on the array of units. A widely used r2 neighborhood function is the Gaussian function defined by, h(ric ) = exp − 2σic2 . We expect an ordered mapping after iterating the above procedure a sufficient number of times. 2.2
Asymmetric Neighborhood Function
We now introduce a method to transform any given symmetric neighborhood function to an asymmetric one (Fig. 2A). Let us define an asymmetry parameter β (β ≥ 1), representing the degree of asymmetry and the unit vector k indicating the direction of asymmetry. If a unit i is located on the positive direction with k, then the component parallel to k of the distance from the winner to the unit is scaled by 1/β. If a unit i is located on the negative direction with k, the parallel component of the distance is scaled by β. Hence, the asymmetric function hβ (r), transformed from its symmetric counterpart h(r), is described by ⎧ ⎪ −1 r 2 ⎨ + r⊥ 2 (ric · k ≥ 0) 1 hβ (ric ) = 2 +β · h(˜ ric ), r˜ic = β , ⎪ β ⎩ βr 2 + r 2 (r · k < 0) ⊥ ic (2) where r˜ic is the scaled distance from the winner. r is the projected component of ric , and r⊥ are the remaining components perpendicular to k, respectively. In addition, in order to single ∞out the effect of asymmetry, the overall area of the neighborhood function, −∞ h(r)dr, is preserved under this transformation. In the special case of the asymmetry parameter β = 1, hβ (r) is equal to the original symmetric function h(r). Figure 2B displays an example of asymmetric Gaussian neighborhood functions in the two-dimensional array of SOM.
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
A
Oriented in the positive direction c
Oriented in the negative direction β
1/β
c
k
k
B
C
Inverse the direction 1.0
hβ(r)
k
1.0
Interval: T 0.5
1 0.5
0.5
distance
0 -6 -3
429
distance
Reduce the degree of asymmetric
0
3
6-6
-3
0
3
6
1.0
1.0
1.0
0.5
0.5
0.5
b=3
b=2
b=1
Fig. 2. A: Method of generating an asymmetric neighborhood function by scaling the distance ric asymmetrically.The degree of asymmetry is parameterized by β. The distance of the node on the positive direction with asymmetric unit vector k, is scaled by 1/β. The distance on the negative direction is scaled by β. Therefore, the asymmetric 2 function is described by hβ (ric ) = β+1/β h(˜ ric ) where r˜ic is the scaled distance of node i from the winner c. B: An example of an asymmetric Gaussian function. C: An illustration of the improved algorithm for asymmetric neighborhood function.
Next, we introduce an improved algorithm for the asymmetric neighborhood function. The asymmetry of the neighborhood function causes to distort the feature map, in which the density of units does not represent the probability of the input data. Therefore, two novel procedures will be introduced. The first procedure is an inversion operation on the direction of the asymmetric neighborhood function. As illustrated in Fig. 2C, the direction of the asymmetry is turned in the opposite direction after every time interval T , which is expected to average out the distortion in the feature map. It is noted that the interval T should be set to a larger value than the typical ordering time for the asymmetric neighborhood function. The second procedure is an operation that decreases the degree of asymmetry of the neighborhood function. When β = 1, the neighborhood function equals the original symmetric function. With this operation, β is decreased to 1 with each time step, as illustrated in Fig. 2C. In our numerical simulations, we adopt a linear decreasing function. 2.3
Numerical Simulations
In the following sections, we tested learning procedures of SOM with sample data to examine the performance of the ordering process. The sample data is
430
T. Aoki et al.
generated from a random variable of a uniform distribution. In the case of onedimensional SOM, the distribution is uniform in the ranges of [0, 1]. Here we use Gaussian function for the original symmetric neighborhood function. The model parameters in SOM learning was used as follows: The total number of units N = 1000, the learning rate α = 0.05 (constant), and the neighborhood radius σ = 50. The asymmetry parameter β = 1.5 and asymmetric direction k is set to the positive direction in the array. The interval period T of flipping the asymmetric direction is 3000. In the case of two-dimensional SOM ( 2D → 2D map), the input data taken uniformly from square space, [0, 1] × [0, 1]. The model parameters are same as in one-dimensional SOM, excepted that The total number of units N = 900 (30 × 30) and σ = 5. The asymmetric direction k is taken in the direction (1, 0), which can be determined arbitrary. In the following numerical simulations, we also confirmed the result holds with other model parameters and other form of neighborhood functions. 2.4
Topological Order and Distortion of the Feature Map
For the aim to examine the ordering process of SOM, lets us consider two measures which characterize the property of the future map. One is the ‘topological order’ η for quantifying the order of reference vectors in the SOM array. The units of the SOM array should be arranged according to its reference vector mi . In the presence of the topological defect, most of the units satisfy the local ordering. However, the topological defect violates the global ordering and the feature map is divided into fragments of ordering domains within which the units satisfy the order-condition. Therefore, the topological order η can be defined as the ratio of the maximum domain size to the total number of units N , given by maxl Nl , (3) N where Nl is the size of domain l. In the case of one-dimensional SOM, the order-condition for the reference vector of units is defined as, mi−1 ≤ mi ≤ mi+1 , or mi−1 ≥ mi ≥ mi+1 . In the case of two-dimensional SOM as referred in the previous section, the order-condition is also defined explicitly with the vector product a(i,i) ≡ (m(i+1,i) −m(i,i) )×(m(i,i+1) −m(i,i) ). Within the ordering domain, the vector products a(i,i) of units have same sign, because the reference vectors are arranged in the sample space ordered by the position of the unit. The other is the ‘distortion’ χ, which measures the distortion of the feature map. The asymmetry of the neighborhood function tends to distort the distribution of reference vectors, which is quite different from the correct probability density of input vectors. For example, when the probability density of input vectors is uniform, a non-uniform distribution of reference vectors is formed with an asymmetric neighborhood function. Hence, for measuring the non-uniformity in the distribution of reference vectors, let us define the distortion χ. χ is a coefficient of variation of the size-distribution of unit Voronoi tessellation cells, and is given by Var(Δi ) χ= , (4) E(Δi ) η≡
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
431
where Δi is the size of Voronoi cell of unit i. To eliminate the boundary effect of the SOM algorithm, the Voronoi cells on the edges of the array are excluded. When the reference vectors are distributed uniformly, the distortion χ converges to 0. It should be noted that the evaluation of Voronoi cell in two-dimensional SOM is time-consuming, and we approximate the size of Voronoi cell by the size of the vector product a(i,i) . If the feature map is uniformly formed, the approximate value also converges to 0.
3 3.1
Results One-Dimensional Case
In this section, we investigate the ordering process of SOM learning in the presence of a topological defect in symmetric, asymmetric, and improved asymmetric cases of the neighborhood function. For this purpose, we use the initial condition that a single topological defect appears at the center of the array. Because the density of input vectors is uniform, the desirable feature map is a linear arrangement of SOM nodes. Figure 3A shows a typical time development of the reference vectors mi . In the case of the symmetric neighborhood function, a single defect remains around the center of the array even after 10000 steps. In contrast, in the case of the asymmetric one, this defect moves out to the right so that the reference vectors are ordered within 3000 steps. This phenomenon can also be confirmed in Fig. 3B, which shows the time dependence of the topological order η. In the case of the asymmetric neighborhood function, η rapidly converges to 1 (completely ordered state) within 3000 steps, whereas the process of eliminating the last defect takes a large amount of time (∼18000 step) for the symmetric one. On the other hand, one problem arises in the feature map obtained with the asymmetric neighborhood function. After 10000 steps, the distribution of the reference vectors in the feature map develops an unusual bias (Fig. 3A). Figure 3C shows the time dependence of the distortion χ during learning. In the case of the symmetric neighborhood function, χ eventually converges to almost 0. This result indicates that the feature map obtained with the symmetric one has an almost uniform size distribution of Voronoi cells. In contrast, in the case of the asymmetric one, χ converges to a finite value (= 0). Although the asymmetric neighborhood function accelerates the ordering process of SOM learning, the resultant map becomes distorted which is unusable for the applications. Therefore, the improved asymmetric method will be introduced, as mentioned in Method. Using this improved algorithm, χ converges to almost 0 as same as the symmetric one (Fig. 3C). Furthermore, as shown in Fig. 3B, the improved algorithm preserves the faster order learning. Therefore, by utilizing the improved algorithm of asymmetric neighborhood function, we confer the full benefit of both the fast order learning and the undistorted feature map. To quantify the performance of the ordering process, let us define the ‘ordering time’ as the time at which η reaches to 1. Figure 4A shows the ordering
432
T. Aoki et al.
A
Symmetric 1
1
0
1
0 0
1000
1
0 0
1000
0 0
1000
0
1000
0
1000
Asymmetric 1
1
0
1
0 0
1000
1
0 0
1000
1
0 0
1000
0 0
1000
Improved asymmetric 1
1
0
0 0 1000 t=1000
t=0
1 0 0 1000 t=5000
B
Distortion χ
1
0.5
0
0 0 1000 t=10000
0 1000 t=20000
Symmetric Asymmetric Improved asym.
C
Topological order η
1
1.5 1 0.5 0
0
10000 Time
20000
0
10000 Time
20000
Fig. 3. The asymmetric neighborhood function enhances the ordering process of SOM. A: A typical time development of the reference vectors mi in cases of symmetric, asymmetric, and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. The standard deviations are denoted by the error bars, which cannot be seen because they are smaller than the size of the graphed symbol. C: The time dependence of the distortion χ.
time as a function of the total number of units N for both improved asymmetric and symmetric cases of the neighborhood function. It is found that the ordering time scales roughly as N 3 and N 2 for symmetric and improved asymmetric neighborhood functions, respectively. For detailed discussion about the reduction of the ordering time, refer to the Aoki & Aoyagi (2007). Figure 4B shows the dependency of the ordering time on the width of neighborhood function, which indicates that ordering time is proportional to (N/σ)k with k = 2/3 for asymmetric/symmetric neighborhood function. This result implies that combined usage of the asymmetric method and annealing method for the width of neighborhood function is more effective. 3.2
Two-Dimensional Case
In this section, we investigate the effect of asymmetric neighborhood function for two-dimensional SOM (2D → 2D map). Figure 5 shows that a similar fast
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
Ordering time
108 10
7
10
6
10
5
N2.989±0.002
10
7
10
6
Sym. σ = 8 Sym. σ = 16 Sym. σ = 32 Sym. σ = 64 Asym.σ = 8 Asym.σ = 16 Asym.σ = 32 Asym.σ = 64
N3
105
104 10
108
Symmetric Improved asym. Fitting
104
N1.917±0.005
3
103
10
104
The number of units N
433
N2
3
102
Scaled number of units N/σ
Fig. 4. Ordering time as a function of the total number of units N . The fitting function is described by Const. · N γ .
A
Symmetric 1
1
0
1
0 0
0
1
0
1
0
1
0
1
Asymmetric 1
1
0
1
0 0
1
1
0 0
1
0 0
1
Improved asymmetric 1
1
0
0 0
t=0
1
0 0
t=1000
B
1 t=5000
1
0.5
0
0 1 t=20000
Symmetric Asymmetric Improved asym.
C Distortion χ
Topological order η
1
1
0.5
0 0
10000 Time
20000
0
10000 Time
20000
Fig. 5. A: A typical time development of reference vectors in two-dimensional array of SOM for the cases of symmetric, asymmetric and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. C: The time dependence of the distortion χ.
434
T. Aoki et al.
Population
Symmetric
Improved asymmetric
0.5
0.5
0
0 0
25000
50000
Ordering time
0
25000
50000
Ordering time
Fig. 6. Distribution of ordering times when the initial reference vectors are generated randomly. The white bin at the right in the graph indicates a population of failed trails which could not converged to the perfect ordering state within 50000 steps.
ordering process can be realized with an asymmetric neighborhood function in two-dimensional SOM. The initial state has a global topological defect, in which the map is twisted at the center. In this situation, the conventional symmetric neighborhood function has trouble in correcting the twisted map. Because of the local stability, this topological defect is never corrected even with a huge learning iteration. The asymmetric neighborhood function also is effective to overcome such a topological defect, like the case of one-dimensional SOM. However, the same problem of ’distortion map’ occurs. Therefore, by using the improved asymmetric neighborhood function, the feature map converges to the completely ordered map in much less time without any distortion. In the previous simulations, we have considered a simple situation that a single defect exists around the center of the feature map as an initial condition in order to investigate the ordering process with the topological defect. However, when the initial reference vectors are set randomly, the total number of topological defects appearing in the map is not generally equal to one. Therefore, it is necessary to consider the statistical distribution of the ordering time, because the total number of the topological defects and the convergence process depend generally on the initial conditions. Figure 6 shows the distribution of the ordering time, when the initial reference vectors are randomly selected from the uniform distribution [0, 1] × [0, 1]. In the case of symmetric neighborhood function, a part of trial could not converged to the ordered state with trapped in the undesirable meta-stable states, in which the topological defects are never rectified. Therefore, although the fast ordering process is observed in some successful cases (lucky initial conditions), the formed map with symmetric one is highly depends on the initial conditions. In contrast, for the improved asymmetric neighborhood function, the distribution of the ordering time has a single sharp peak and the successive future map is constructed stably without any tuning of the initial condition.
4
Conclusion
In this paper, we discussed the learning process of the self-organized map, especially in the presence of a topological defect. Interestingly, even in the presence
Ordering Process of SOMs Improved by Asymmetric Neighborhood Function
435
of the defect, we found that the asymmetry of the neighborhood function enables the system to accelerate the learning process. Compared with the conventional symmetric one, the convergence time of the learning process can be roughly reduced from O(N 3 ) to O(N 2 ) in one-dimensional SOM(N is the total number of units). Furthermore, this acceleration with the asymmetric neighborhood function is also effective in the case of two-dimensional SOM (2D → 2D map). In contrast, the conventional symmetric one can not rectify the twisted feature map even with a sheer of iteration steps due to its local stability. These results suggest that the proposed method can be effective for more general case of SOM, which is the subject of future study.
Acknowledgments This work was supported by Grant-in-Aid for Scientific Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grant number 18047014, 18019019 and 18300079.
References 1. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 2. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in cats visual cortex. J. Physiol.-London 160(1), 106–154 (1962) 3. Hubel, D.H., Wiesel, T.N.: Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol. 158(3), 267–294 (1974) 4. von der Malsburg, C.: Self-organization of orientation sensitive cells in striate cortex. Kybernetik 14(2), 85–100 (1973) 5. Takeuchi, A., Amari, S.: Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern. 35(2), 63–72 (1979) 6. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps - stationary states, metastability and convergence rate. Biol. Cybern. 67(1), 35–45 (1992) 7. Geszti, T., Csabai, I., Cazok´ o, F., Szak´ acs, T., Serneels, R., Vattay, G.: Dynamics of the kohonen map. In: Statistical mechanics of neural networks: proceedings of the XIth Sitges Conference, pp. 341–349. Springer, New York (1990) 8. Der, R., Herrmann, M., Villmann, T.: Time behavior of topological ordering in self-organizing feature mapping. Biol. Cybern. 77(6), 419–427 (1997) 9. Aoki, T., Aoyagi, T.: Self-organizing maps with asymmetric neighborhood function. Neural Comput. 19(9), 2515–2535 (2007)
A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer Azusa Iwata1 , Yoshihisa Shinozawa1 , and Akito Sakurai1,2 1
Keio University, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan 2 CREST, Japan Science and Technology Agency
Abstract. We give a necessary condition that a simple recurrent neural network with two sigmoidal hidden units to implement a recognizer of the formal language {an bn |n > 0} which is generated by a set of generating rules {S → aSb, S → ab} and show that by setting parameters so as to conform to the condition we get a recognizer of the language. The condition implies instability of learning process reported in previous studies. The condition also implies, contrary to its success in implementing the recognizer, difficulty of getting a recognizer of more complicated languages.
1
Introduction
Pioneered by Elman [6], many researches have been conducted on grammar learning by recurrent neural networks. Grammar is defined by generating or rewriting rules such as S → Sa and S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b means that “the letter S should be rewritten to b.” If we have more than one applicable rules, we have to try all the possibility. The string generated this way but with no further applicable rewriting rule is called a sentence. The set of all possible sentences is called the language generated by the grammar. Although the word “language” would be better termed formal language in contrast to natural language, we follow custom in formal language theory field(e.g. [10]). In everyday expression a sentence is a sequence of words whereas in the above example it is a string of characters. Nevertheless the essence is common. The concept that a language is defined based on a grammar this way has been a major paradigm in a wide range of formal language study and related field. A study of grammar learning focuses on restoring a grammar from finite samples of sentences of a language associated with the grammar. In contrast to ease of generation of sample sentences, grammar learning from sample sentences is a hard problem and in fact it is impossible except for some very restrictive cases e.g. the language is finite. As is well-known, humans do learn language whose grammar is very complicated. The difference between the two situations might be attributed to possible existence of some unknown restrictions on types of natural language grammars in our brain. Since neural network is a model of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Characterization of Simple RNNs with Two Hidden Units
437
brain, some researchers think that neural network could learn a grammar from exemplar sentences of a language. The researches on grammar learning by neural networks are characterized by 1. focus on learning of self-embedding rules, since simpler rules that can be expressed by finite state automata are understood to be learnable, 2. simple recurrent neural networks (SRN) are used as basic mechanism, and 3. languages such as {an bn |n > 0}, {an bn cn |n > 0} which are clearly the results of, though not a representative of, self-embedding rules, were adopted for target languages. Here an is a string of n-times repetition of a character a, the language {an bn |n > 0} is generated by a grammar {S → ab, S → aSb}, and {an bn cn |n > 0} is generated by a context-sensitive grammar. The adoption of simple languages such as {an bn |n > 0} and {an bn cn |n > 0} as target languages are inevitable, since it is not easy to see whether enough generalization is achieved by the learning if realistic grammars are used. Although these languages are simple, many times when they learned, what they learned was just what they were taught, not a grammar, i.e. they did almost rote-learning. In other words, their generalization capability in learning was limited. Also their learned-results were unstable in a sense that when they were given new training sentences which were longer than the ones they learned but still in the same language, learned network changes unexpectedly. The change was more than just refinement of the learned network. Considering these situations, we may doubt that there really exists a solution or correct learned network. Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others tried to clarify the reasons and explore the possibility of network learning of the languages. But their results were not conclusive and did not give clear conditions that the learned networks satisfy in common. We will in this paper describe a condition that SRNs with two hidden units learned to recognize language {an bn |n > 0} have in common, that is, a condition that has to be met for SRN to be qualified to be successfully learned the language. The condition implies, by the way, instability of learning results. Moreover by utilizing the condition, we realize a language recognizer. By doing this we found that learning of recognizers of languages more complicated than {an bn |n > 0} are hard. RNN (recurrent neural network) is a type of networks that has recurrent connections and a feedforward network connections. The calculation is done at once for the feedforward network part and after one time-unit delay the recurrent connection part gives rises to additional inputs (i.e., additional to the external inputs) to the feedforward part. RNN is a kind of discrete time system. Starting with an initial state (initial outputs, i.e. outputs without external inputs, of feedforward part) the network proceeds to accept the next character in a string given to external inputs, reaches the final state, and obtains the final external output from the final state. SRN (simple recurrent network) is a simple type of RNN and has only one layer of hidden units in its feedforward part.
438
A. Iwata, Y. Shinozawa, and A. Sakurai
Rodriguez et al. [13] showed that SRN learns languages {an bn |n > 0} (or, more correctly, its subset {ai bi |n ≥ i > 0}) and {an bn cn |n > 0} (or, more correctly, {ai bi |n ≥ i > 0}). For {an bn |n > 0}, they found an SRN that generalized to n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are processed by the SRN but the results were not conclusive so that it is still an open problem if it is really possible to realize recognizers of the languages such as {an bn |n > 0} or {an bn cn |n > 0} and if SRN could learn or implement more complicated languages. Siegelmann [16] showed that an RNN has computational ability superior to a Turing machine. But the difficulty of learning {an bn |n > 0} by SRN suggests that at least SRN might not be able to realize Turing machine. The difference might be attributed to the difference of output functions of neurons: piecewise linear function in Siegelmann’s case and sigmoidal function in standard SRN cases, since we cannot find other substantial differences. On the other hand Casey [4] and Maass [12] showed that in noisy environment RNN is equivalent to finite automaton or less powerful one. The results suggest that correct (i.e., infinite precision) computation is to be considered when we had to make research on possibility of computations by RNN or specifically SRN. Therefore in the research we conducted and will report in this paper we have adopted RNN models with infinite precision calculation and with sigmoidal function (tanh(x)) as output function of neurons. From the viewpoints above, we discuss two things in the paper: a necessary condition for an SRN with two hidden units to be a recognizer of the language {an bn |n > 0}, the condition is sufficient enough to guide us to build an SRN recognizing the language.
2
Preliminaries
SRN (simple recurrent network) is a simple type of recurrent neural networks and its function is expressed by sn+1 = σ(w s · sn + wx · xn ), Nn (sn ) = wos · sn + woc where σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 + exp(−x))) and is applied component-wise. A counter is a device which keeps an integer and allows +1 or −1 operation and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a device which allows operations push i (store i) and pop up (recall the laststored content, discard it and restore the device as if it were just before the corresponding push operation). Clearly a stack is more general than a counter, so that if a counter is not implementable, a stack is not, too. Elman used SRN as a predictor but Rodriguez et al. [13] and others consider it as a counter. We in this paper mainly stand on the latter viewpoint. We will first explain that a predictor is used as a recognizer or a counter. When we try to train SRN to recognize a language, we train it to predict correctly the next character to come in an input string (see e.g. [6]). As usual we adopted “one hot vector” representation for the network output. One hot vector is a vector representation in which a single element is one and the others
A Characterization of Simple RNNs with Two Hidden Units
439
are zero. The network is trained so that the sum of the squared error (difference between actual output and desired output) is minimized. By this method, if two possible outputs with the same occurrence frequency for the same input exist in a training data, the network would learn to output 0.5 for two elements in the output with others being 0, since the output vector should give the minimum of the sum of the squared error. It is easily seen that if a network correctly predicts the next character to come for a string in the language {an bn |n > 0}, we could see it behaves as a counter with limited function. Let us add a new network output whose value is positive if the original network predicts only a to come (which happens only when the input string up to the time was an bn for some n which is common practice since Elman’s work) and is negative otherwise. The modified network seems to count up for a character a and count down for b since it outputs positive value when the number of as and bs coincide and negative value otherwise. Its counting capability, though, is limited since it could output any value when a is fed before the due number of bs are fed, that is, when counting up action is required before the counter returns back to 0-state. A (discrete-time) dynamical system is represented as the iteration of a function application: si+1 = f (si ), i ∈ N , si ∈ Rn . A point s is called a fixed point of f if f (s) = s. A point s is an attracting fixed point of f if s is a fixed point and there exists a neighborhood Us around s such that limi→∞ f i (x) = s for all x ∈ Us . A point s is a repelling fixed point of f if s is an attracting fixed point of f −1 . A point s is called a periodic point of f if f n (s) = s for some n. A point s is a ω-limit point of x for f if limi→∞ f ni (x) = s for limi→∞ ni = ∞. A fixed point x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values different from one, where Df = [∂fi /∂xj ] is the Jacobian matrix of first partial derivatives of the function f . A set D is invariant under f if for any s ∈ D, f (s) ∈ D. The following theorem plays an important role in the current paper. Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f : Rn → Rn be a C r (r ≥ 1) diffeomorphism with a hyperbolic fixed point x. Then s,f u,f there are local stable and unstable manifolds Wloc (x), Wloc (x), tangent to the s,f s,f u,f eigenspaces Ex , Ex of Df at x and of corresponding dimension. Wloc (x) and u,f r Wloc (x) are as smooth as the map f , i.e. of class C . Local stable and unstable manifold for f are defined as follows: s,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f m (y), q) = 0} u,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f −m (y), q) = 0}
where Uq is a neighborhood of q and dist is a distance function. Then global s,f stable and unstable manifold for f are defined as: W s,f (q) = i≥0 f −i (Wloc (q)) u,f u,f i and W (q) = i≥0 f (Wloc (q)). As defined, SRN is a pair of a discrete-time dynamical system sn+1 = σ(w s · sn + wx · xn ) and an external output part Nn = wos · sn + woc . We simply
440
A. Iwata, Y. Shinozawa, and A. Sakurai
write the former (dynamical system part) as sn+1 = f (sn , xn ) and the external output part as h(sn ). When RNN (or SRN) is used as a recognizer of the language {an bn |n > 0}, as described in Introduction, it is seen as a counter where the input character a is for a count-up operation (i.e., +1) and b is for a count-down operation (i.e., −1). In the following we may use x+ for a and x− for b. For abbreviation, in the following, we use f+ = f ( · , x+ ) and f− = f ( · , x− ) Please note that f− is undefined for the point outside and on the border of the square (I[−1, 1])2 , where I[−1, 1] is the closed interval [−1, 1]. In the following, though, we do not mention it for simplicity. D0 is the set {s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let −i D i = f− (D0 ), that is, a region where the counter value is i. We postulate that f+ (Di ) ⊆ Di+1 . This means that any point in Di is eligible for a state where the counter content is i. This may seem to be rather demanding. An alternative would be that the point p is for counter content c if and only if p1 pi m1 mi p = f− ◦ f+ ◦ . . . ◦ f− ◦ f+ (s0 ) for a predefined s0 , some mj ≥ 0 and pj ≥ 0 i for 1 ≤ j ≤ i, and i ≥ 0 such that j=1 (pj − mj ) = c. This, unfortunately, has not resulted in a fruitful result. We also postulate that Di ’s are disjoint. Since we defined Di as a closed set, the postulate is natural. The point is therefore we have chosen Di to be closed. The postulate requires that we should keep a margin between D0 and D1 and others.
3
A Necessary Condition
We consider only SRN with two hidden units, i.e., all the vectors concerning s such as ws , sn , wos are two-dimensional. Definition 2. Dω is the set of the accumulation points of i≥0 Di , i.e. s ∈ Dω iff s = limi→∞ ski for some ski ∈ Dki . Definition 3. Pω is the set of ω-limit points of points in D0 for f+ , i.e. s ∈ Pω ki iff s = limi→∞ f+ (s0 ) for some ki and s0 ∈ D0 . Qω is the set of ω-limit points −ki −1 of points in D0 for f− , i.e. s ∈ Qω iff s = limi→∞ f− (s0 ) for some ki and s0 ∈ D0 . Considering the results obtained in Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), it is natural, at least for a first consideration, to i i postulate that f+ (x) and f− (x) are not wondering so that they will converge to periodic points. Therefore Pω and Qω are postulated as finite set of hyperbolic periodic points for f+ and f− , respectively. In the following, though, for simplification of presentations, we postulate that Pω and Qω are finites set of hyperbolic fixed points for f+ and f− , respectively. Moreover, according to the same literature, the points in Qω are saddle points u,f s,f for f− , so that we further postulate that Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, where their existence is guaranteed by Theorem 1.
A Characterization of Simple RNNs with Two Hidden Units
441
Postulate 4. We postulate that f+ (Di ) ⊆ Di+1 , Di ’s are disjoint, Pω and Qω u,f are finites set of hyperbolic fixed points for f+ and f− , respectively, and Wloc − (q) s,f− for q ∈ Qω and Wloc (q) for q ∈ Qω are one-dimensional. −1 −1 Lemma 5. f− ◦ f+ (Dω ) = Dω , f− (Dω (I(−1, 1))2 ) = Dω and Dω (I(−1, 1))2 = f− (Dω ), and f+ (Dω ) ⊆ Dω . Pω ⊆ Dω and Qω ⊆ Dω . −1 Definition 6. W u,−1 (q) is the global unstable manifold at q ∈ Qω for f− , i.e., −1 u,f− u,−1 s,f− W (q) = W (q) = W (q). i Lemma 7. For any p ∈ Dω , any accumulation point of {f− (p)|i > 0} is in Qω .
Proof. Since p is in Dω , there exist pki ∈ Dki such that p = limi→∞ pki . Suppose q in Dω is the accumulation point stated in the theorem statement, i.e., q = h limj→∞ f−j (p). We take ki large enough for any hj so that in any neighborhood of h h h −k q where f−j (p) is in, pki exists. Then q = limj→∞ f−j (pki ) = limj→∞ f−j i (ski ) ki where ki is a function of hj with ki > hj . Let ski = f− (pki ) ∈ D0 and s0 ∈ D0 −1 be an accumulation point of {ski }. Then since f− is continuous, letting nj = −n −hj + ki > 0, q = limj→∞ f− j (s0 ), i.e., q ∈ Qω .
Lemma 8. Dω = q∈Qω W u,−1 (q). Proof. Let p be any point in Dω . Since f− (Dω ) ⊆ (I[−1, 1])2 where I[−1, 1] n is the interval [−1, 1], i.e., f− (Dω ) is bounded, and f− (Dω ) ⊆ Dω , {f− (p)} has an accumulation point q in Dω , which is, by Lemma 7, in Qω . Then q is n expressed as q = limj→∞ f−j (p). Since Qω is a finite set of hyperbolic fixed −1 n points, q = limn→∞ f− (p), i.e., p ∈ W s,f (q) = W u,f (q) = W u,−1 (q).
Since Pω ⊆ Dω , the next theorem holds. Theorem 9. A point in Pω is either a point in Qω or in W u,−1 (q) for some q ∈ Qω . Please note that since q ∈ W u,−1 (q), the theorem statement is just “If p ∈ Pω then p ∈ W u,−1 (q) for some q ∈ Qω .”
4
An Example of a Recognizer
To construct an SRN recognizer for {an bn |n > 0}, the SRN should satisfy the conditions stated in Theorem 9 and Postulate 4, which are summarized as: 1. f+ (Di ) ⊆ Di+1 , 2. Di ’s are disjoint, 3. Pω and Qω are finites set of hyperbolic fixed points for f+ and f− , respectively, u,f s,f 4. Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, and u,−1 5. If p ∈ Pω then p ∈ W (q) for some q ∈ Qω .
442
A. Iwata, Y. Shinozawa, and A. Sakurai
Let us consider as simple as possible, so that the first choice is to think about a point p ∈ Pω and q ∈ Qω , that is f+ (p) = p and f− (q) = q. Since p cannot be −1 the same as q (because f− ◦ f+ (p) = p + w−1 s · w x · (x+ − x− ) = p ), we have u,−1 to find a way to let p ∈ W (q). Since it is very hard in general to calculate stable or unstable manifolds from a function and its fixed point, we had better try to let W u,−1 (q) be a “simple” manifold. There is one more reason to do so: we have to define D0 = {x|h(x) ≥ 0} but if W u,−1 (q) is not simple, suitable h may not exist. We have decided that W u,−1 (q) be a line (if possible). Considering the function form f− (s) = σ(w s · s + wx · x− ), it is not difficult to see that the line could be one of the axes or one of the bisectors of the right angles at the origin (i.e., one of the lines y = x and y = −x). We have chosen the bisector in the first (and the third) quadrant (i.e., the line y = x). By the way q was chosen to be the origin and p was chosen arbitrarily to be (0.8, 0.8). The item 4 is satisfied by setting one of the two eigenvalues of Df− at the origin to be greater than one, and the other smaller then one. We have chosen 1/0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are satisfied by considering eigenvalues of Df+ at p for f+ . The design consideration that we have skipped is how to design D0 = {x|h(x) ≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W u,−1 (q) for our intended q ∈ Qω . Because, if we do so, by setting the largest eigenvalue of Df− at q to be equal to the inverse of the eigenvalue of Df+ at p along the 2 2 i normal to W u,−1 , we can get the points s ∈ D0 , f− ◦ f+ (s), f− ◦ f+ (s), . . . , f− ◦ i f+ (s), . . . that belong to {an bn |n > 0}, reside at approximately equal distance from W u,−1 . Needless to say that the points belonging to, say, {an+1 bn |n > 0} have approximately equal distance from W u,−1 among them and this distance is different from that for {an bn |n > 0}. Let f− (x) = σ(Ax + B0 ), f+ (x) = σ(Ax + B1 ). We plan to put Qω = {(0, 0)}, Pω = {(0.8, 0.8)}, W u,−1 = {(x, y)|y = x}, the eigenvalues of the tangent space −1 of f− at (0, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is expanding), and the eigenvalues of the tangent space of f+ at (0.8, 0.8) are 1/μ and any value. Then, considering derivatives at (0, 0) and (0.8, 0.8), it is easy to see π 1 π 1 λ 0 A=ρ ρ − , = (1 − 0.82 )μ 0 μ 2 4 4 μ where ρ(θ) is a rotation by θ. Then λ+μ λ−μ A= λ−μ λ+μ Next from σ(B0 ) = (0, 0)T and σ((0.8λ, 0.8λ)T + B1 ) = (0.8, 0.8)T , −1 0 σ (0.8) − 0.8λ B0 = , B1 = . 0 σ −1 (0.8) − 0.8λ These give us μ = 5/3, λ = 0.6, B1 ≈ (1.23722, 1.23722)T .
A Characterization of Simple RNNs with Two Hidden Units
443
Fig. 1. The vector field representation of f+ (left) and f− (right) 1
0.75
0.5
0.25
-1
-0.75
-0.5
-0.25
0.25
0.5
0.75
1
-0.25
-0.5
-0.75
-1
-1
-0.75
-0.5
1
1
0.75
0.75
0.5
0.5
0.25
0.25
-0.25
0.25
0.
0.75
1
-1
-0.75
-0.5
-0.25
0.25
-0.25
-0.25
-0.5
-0.5
-0.75
-0.75
-1
-1
0.5
0.75
1
n+1 n+1 n n n Fig. 2. {f− ◦ f+ (p)| n ≥ 1} (upper), {f− ◦ f+ (p)| n ≥ 1} (lower left), and {f− ◦ n f+ (p)| n ≥ 1} (lower right) where p = (0.5, 0.95)
444
A. Iwata, Y. Shinozawa, and A. Sakurai
In Fig. 1, the left image shows the vector field of f+ where the arrows starting at x end at f+ (x) and the right image shows the vector field of f− . In Fig. 2, the upper plot shows points corresponding to strings in {an bn |n > 0}, the lower-left plot {an+1 bn |n > 0}, and the lower-right plot {an bn+1 |n > 0}. The initial point was set to p = (0.5, 0.95) in Fig. 2. All of them are for n = 1 to n = 40 and when n grows the points gather so we could say that they stay in narrow stripes, i.e. Dn , for any n.
5
Discussion
We obtained a necessary condition that SRN implements a recognizer for the language {an bn |n > 0} by analyzing its behavior from the viewpoint of discrete dynamical systems. The condition supposes that Di ’s are disjoint, f+ (Di ) ⊆ Di+1 , and Qω is finite. It suggests a possibility of the implementation and in fact we have successfully built a recognizer for the language, thereby we showed that the learning problem of the language has at least a solution. Unstableness of any solutions for learning is suggested to be (but not derived to be) due to the necessity of Pω being in an unstable manifold W u,−1 (q) for n q ∈ Qω . Since Pω is attractive in the above example, f+ (s0 ) for s0 ∈ D0 comes n exponentially close to Pω for n. By even a small fluctuation of Pω , since f+ (s0 ), u,−1 n n too, is close to W (q), f− (f+ (s0 )), which should be in D0 , is disturbed much. This means that even if we are close to a solution, by just a small fluctuation of n n Pω caused by a new training data, f− (f+ (s0 )) may easily be pushed out of D0 . Since Rodriguez et al. [14] showed that the languages that do not belong to the context-free class could be learned to some degree, we have to further study the discrepancies. Instability of grammar learning by SRN shown above might not be seen in our natural language learning, which suggests that SRN might not be appropriate for a model of language learning.
References 1. Bod´en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free language: analysis of dynamics in recurrent hidden units. Artificial Neural Networks (1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999) 2. Bod´en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science 12(3/4), 197–210 (2000) 3. Bod´en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied Intelligence: Special issue on natural language and machine learning 19(1/2), 51–63 (2003) 4. Casey, M.: Correction to proof that recurrent neural networks can robustly recognize only regular languages. Neural Computation 10, 1067–1069 (1998) 5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955– 972 (2003)
A Characterization of Simple RNNs with Two Hidden Units
445
6. Elman, J.L.: Distributed representations, simple recurrent networks and grammatical structure. Machine Learning 7, 195–225 (1991) 7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge 8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks 12(6), 1333–1340 (2001) 9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997) 10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley, Reading (1979) 11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Systems. Cambridge University Press, Cambridge (1996) 12. Maass, W., Orponen, P.: On the effect of analog noise in discrete-time analog computations. Neural Computation 10, 1071–1095 (1998) 13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to count. Connection Science 11, 5–40 (1999) 14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive languages by counting. Neural Computation 13(9), 2093–2118 (2001) 15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041 (2002) 16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing Limit, Birkh¨ auser (1999) 17. Wiles, J., Blair, A.D., Bod´en, M.: Representation Beyond Finite States: Alternatives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Networks
Unbiased Likelihood Backpropagation Learning Masashi Sekino and Katsumi Nitta Tokyo Institute of Technology, Japan
Abstract. The error backpropagation is one of the popular methods for training an artificial neural network. When the error backpropagation is used for training an artificial neural network, overfitting occurs in the latter half of the training. This paper provides an explanation about why overfitting occurs with the model selection framework. The explanation leads to a new method for training an aritificial neural network, Unibiased Likelihood Backpropagation Learning. Several results are shown.
1
Introduction
An artificial neural network is one of the model for function approximation. It is possible to approximate arbitrary function when the number of basis functions is large. The error backpropagation learning [1], which is a famous method for training an artificial neural network, is the gradient discent method with the squared error to learning data as a target function. Therefore, the error backpropagation learning can obtain local optimum while monotonously decreasing the error. Here, although the error to learning data is monotonously decreasing, the error to test data increases in the latter half of training. This phenomenon is called overfitting. Early stopping is one of the method for preventing the overfitting. This method stop the training when an estimator of the generalization error does not decrease any longer. For example, the technique which stop the training when the error to hold-out data does not decrease any longer is often applied. However, the early stopping basically minimize the error to learning data, therefore there is no guarantee for obtaining the optimum parameter which minimize the estimator of the generalization error. When the parameters of the basis functions (model parameter) are fixed, an artificial neural network becomes a linear regression model. If a regularization parameter is introduced to assure the regularity of this linear regression model, the artificial neural network becomes a set of regular linear regression models. The cause of why an artificial neural network tends to overfit is that the maximum likelihood estimation with respect to the model parameter is the model selection about regular linear regression models based on the empirical likelihood. In this paper, we propose the unbiased likelihood backpropagation learning which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. It is expected M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008
Unbiased Likelihood Backpropagation Learning
447
that the proposed method has better approximation performance because the method explicitly minimize an estimator of the generalization error. Following section, section 2 explains about statistical learning and maximum likelihood estimation, and section 3 explains about information criterion shortly. Next, in section 4, we give an explanation about an artificial neural network, regularized maximum likelihood estimation and why the error backpropagation learning cause overfitting. Then, the proposed method is explained in section 5. We show the effectiveness of the proposed method by applying the method to DELVE data set [4] in section 6. Finally we conclude this paper in section 7.
2
Statistical Learning
Statistical learning aims to construct an optimal approximation pˆ(x) of a true distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning data D ≡ {xn | n = 1, · · · , N } obtained from q(x). M is called model and approximation pˆ(x) is called estimation. When we want to clearly denote that the estimation pˆ(x) is constructed using learning data D, we use a notation pˆ(x|D). Kullback-Leibler divergence: q(x) D(q||p) ≡ q(x) log dx (1) p(x) is used for distance from q(x) to p(x). Probability density p(x) is called likelihood, and especially the likelihood of the estimation pˆ(x) to the learning data D: N pˆ(D|D) ≡ pˆ(xn |D) (2) n=1
is called empirical likelihood. Sample mean of the log-likelihood: N 1 1 log p(D) = log p(xn ) N N n=1
asymptotically converges in probability to the mean log-likelihood: Eq(x) log p(x) ≡ q(x) log p(x)dx
(3)
(4)
according to the law of large numbers, where Eq(x) denotes an expectation under q(x). Because Kullback-Leibler divergence can be decomposed as D(q||p) = Eq(x) log q(x) − Eq(x) log p(x) , (5) the maximization of the mean log-likelihood is equal to the minimization of Kullback-Leibler divergence. Therefore, statistical learning methods such as maximum likelihood estimation, maximum a posteriori estimation and Bayesian estimation are based on the likelihood.
448
M. Sekino and K. Nitta
Maximum Likelihood Estimation Maximum likelihood estimation pˆML (x) is the hypothesis p(x|θˆML ) by the maximizer θˆML of likelihood p(D|θ): pˆML (x) ≡ p(x|θˆML ), θˆML ≡ argmax p(D|θ). θ
3 3.1
(6) (7)
Information Criterion and Model Selection Information Criterion
Because sample mean of the log-likelihood asymptotically converges in probability to the mean log-likelihood, statistical learning methods are based on likelihood. However, because learning data D is a finite set in practice, the empirical likelihood pˆ(D|D) contains bias. This bias b(ˆ p) is defined as b(ˆ p) ≡ Eq(D) log pˆ(D) −N · Eq(x) log pˆ(x|D) , (8) N where q(D) ≡ n=1 q(xn ). Because of the bias, it is known that the most overfitted model to learning data is selected when select a regular model from a candidate set of regular models based on the empirical likelihood. Addressing this problem, there have been proposed many information criteria which evaluates learning results by correcting the bias. Generally the form of the information criterion is IC(ˆ p, D) ≡ −2 log pˆ(D) + 2ˆb(ˆ p)
(9)
where ˆb(ˆ p) is the estimator of the bias b(ˆ p). Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical log-likelihood as N (M + 1) ˆbcAIC (ˆ pML ) = , (10) N −M −2 under the assumption that the learning model is a normal linear regression model, the true model is included in the learning model and the estimation is constructed by maximum likelihood estimation. Here, M is the number of explanatory variables and dim θ = M +1 (1 is the number of the estimators of variance.) Therefore, cAIC asymptotically equals AIC [2]: ˆbcAIC (ˆ pML ) → ˆbAIC (ˆ pML ) (N → ∞).
4
Artificial Neural Network
In a regression problem, we want to estimate the true function using learning data D = {(xn , yn ) | n = 1, · · · , N }, where xn ∈ Rd is an input and yn ∈ R is the corresponding output. An artificial neural network is defined as
Unbiased Likelihood Backpropagation Learning
f (x; θM ) =
M
ai φ(x; ϕi )
449
(11)
i=1
where ai (i = 1, · · · , N ) are regression coefficients and φ(x; ϕi ) are basis functions which are parametalized by ϕi . Model parameter of this neural network is θM ≡ (ϕT1 , · · · , ϕTM )T (T denotes transpose.) Design matrix X, coefficient vector a and output vector y are defined as Xij ≡ φ(xi ; ϕj )
(12)
a ≡ (a1 , · · · , aM )
(13)
y ≡ (y1 , · · · , yN ) .
(14)
T
T
When the model parameter θM are fixed, the artificial neural network (11) becomes a linear regression model parametalized by a. A normal linear regression model:
1 (y −f (x; θM))2 p(y|x; θM ) ≡ √ exp − (15) 2σ 2 2πσ 2 is usually used when the noise included in the output y is assumed to follow a normal distribution. In this paper, we call the parameter θR ≡ (aT , σ 2 )T regular parameter. 4.1
Regularized Maximum Likelihood Estimation
To assure the regularity of the normal linear regression model (15), regularized maximum likelihood estimation is usually used for estimating θR = (aT , σ 2 )T . Regularized maximum likelihood estimation maximize regularized log-likelihood: log p(D) − exp(λ) a 2 .
(16)
λ ∈ R is called a regularization parameter. Regularized maximum likelihood estimators of the coefficient vector a and the valiance σ 2 are ˆ = Ly a (17) and σ ˆ2 = where
1 ˆ 2 , y − Xa N
L ≡ (X T X + exp(λ)I)−1 X T
and I is the identity matrix. Here, the effective number of the regression coefficients is
Mef f = tr XL .
(18) (19)
(20)
Therefore, Mef f is used for the number of explanatory variables M in (10)1 . 1
When the denominator of (10) is not positive value, we use ˆbcAIC (ˆ p) = ∞.
450
4.2
M. Sekino and K. Nitta
Overfitting of the Error Backpropagation Learning
The error backpropagation learning is usually used for training an artifical neural network. This method is equal to the gradient discent method based on the likelihood of the linear regression model (15), because the target function is the squared error to learning data. In what follows, we assume the noise follow a normal distribution and the normal linear regression model (15) is regular for all θM . Then, an artificial neural network becomes a set of regular linear regression models. For simplicity, let’s think about a set of regular models H ≡ {M(θM ) | θM ∈ ΘM }, where M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } is a regular model. We can define a new model MC ≡ {ˆ p(x; θM ) | θM ∈ ΘM }, where pˆ(x; θM ) is the estimation of M(θM ). Concerning model parameter θM , statistical learning methods construct the estimation of the model MC based on the empirical likelihood pˆ(D; θM ). For example, maximum likelihood estimation selects θˆMML which is the maximizer of pˆ(D; θM ): pˆML (x) = pˆ(x; θˆMML ), θˆMML ≡ argmax pˆ(D; θM ). θM
(21) (22)
Thus, the maximum likelihood estimation with respect to model parameter θM is the model selection from H based on the empirical likelihood. Because the error backpropagation is the method which realizes maximum likelihood estimation by the gradient discent method, the model M(θM ) becomes the one gradually overfitted in the latter half of the training. Therefore, we propose a learning method for model parameter θM based on unbiased likelihood which is the empirical likelihood corrected by an appropriate information criterion.
5 5.1
Unbiased Likelihood Backpropagation Learning Unbiased Likelihood
Using an information criterion IC(ˆ p, D), we define unbiased likelihood as: 1 pˆub (D) = exp − IC(ˆ p, D) . (23) 2 This unbiased likelihood satisfies 1 Eq(D) log pˆub (D) = Eq(x) log pˆ(x) N when the assumptions of the information criterion are satisfied.
(24)
Unbiased Likelihood Backpropagation Learning
5.2
451
Regular Hierarchical Model
In this paper, we consider about a certain type of hierarchical model, which we call a regular hierarchical model, defined as a set of regular models. A concise definitions of a regular hierarchical model are follows. Regular Hierarchical Model – H ≡ {M(θM ) | θM ∈ ΘM } – M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } – M(θM ) is a regular model with respect to θR . An artificial neural network is one of the regular hierarchical models. And also we define unbiased maximum likelihood estimation as follows. Unbiased Maximum Likelihood Estimation Unbiased maximum likelihood estimation pˆubML (x) is the estimation pˆ(x; θˆMubML) by the maximizer θˆMubML of the unbiased likelihood pˆub (D|θM ): pˆubML (x) ≡ pˆ(x; θˆMubML ) θˆMubML ≡ argmax pˆub (D; θM ). θM
5.3
(25) (26)
Unbiased Likelihood Backpropagation Learning
The partial differential of the unbiased likelihood is ∂ ∂ ∂ ˆ log pˆub (D; θM ) = log pˆ(D; θM ) − b(ˆ p; θM ). ∂θM ∂θM ∂θM
(27)
We define the unbiased likelihood estimation based on the gradient method with this partial differential as unbiased likelihood backpropagation learning. 5.4
Unbiased Likelihood Backpropagation Learning for an Artificial Neural Network
In this paper, we derive the unbiased likelihood backpropagation learning for an artificial neural network when the bias of the empirical likelihood is estimated by cAIC (10). The partial differential of the empirical likelihood with respect to θM which is the first term of (27), is ∂ 1 log pˆ(D; θM ) = 2 ∂θM σ ˆ ˆ ∂X a = ∂θM
ˆ ∂X a ∂θM
T
ˆ) (y − X a
T ∂X ∂X T ∂X L+L y − 2LT X T Ly. ∂θM ∂θM ∂θM
(28)
(29)
452
M. Sekino and K. Nitta
The partial differential of cAIC (10) with respect to θM , which is the second term of (27), is ∂ ˆ N (N − 1) ∂Mef f bcAIC (ˆ p; θM ) = ∂θM (N − M − 2)2 ∂θM ∂Mef f ∂X ∂X = 2 tr L − LT X T L . ∂θM ∂θM ∂θM
(30)
(31)
We can also obtain the partial differential of the unbiased likelihood with respect to λ as ∂ 1 ˆ ) exp(λ), log pˆ(D; θM ) = − 2 y T LT L(y − X a (32) ∂λ σ ˆ and ∂Mef f = −tr LT L exp(λ). (33) ∂λ Now, we have already obtain the partial differential of the unbiased likelihood with respect to θM and λ, therefore it is possible to apply the unbiased likelihood backpropagation learning to an artificial neural network.
6
Application to Kernel Regression Model
Kernel regression model is one of the artificial neural networks. The kernel regression model using gaussian kernels has the model parameter of degree one, which is the size of gaussian kernels. This model is comprehensible about the behavior of learning methods. Therefore, the kernel regression model using gaussian kernels is used in the following simulations. In the implementation of the gradient discent method, we adopted quasi-Newton method with BFGS method for estimating the Hesse matrix and golden section search for determining the modification length. 6.1
Kernel Regression Model
Kernel regression model is f (x; θM ) =
N
an K(x, xn ; θM ).
(34)
n=1
K(x, xn ; θM ) are kernel functions parametalized by model parameter θM . Gaussian kernel: x − xn 2 K(x, xn ; c) = exp − (35) 2c2 is used in the following simulations, where c is a parameter which decides the size of a gaussian kernel. Model parameter is θM = c.
Unbiased Likelihood Backpropagation Learning
453
1 emp true cAIC
log-likelihood
0.5
0
-0.5
-1 0
2
4
6 iterations
8
10
12
(a) Empirical Likelihood Backpropagation
1 emp true cAIC
log-likelihood
0.5
0
-0.5
-1 0
2
4
6 iterations
8
10
12
(b) Unbiased Likelihood Backpropagation Fig. 1. An example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC)
454
6.2
M. Sekino and K. Nitta
Simulations
For the purpose of evaluation, the empirical likelihood backpropagation learning and the unbiased likelihood backpropagation learning are applied to the 8 dimensional input to 1 dimensional output regression problems of “kin-family” and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We use 128 samples for learning data. 50 templates are chosen randomly and kernel functions are put on the templates. Fig.1 shows an example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the test likelihood of the empirical likelihood backpropagation learning (a) decrease in the latter half of the training and overfitting occurs. On the contrary, it shows the unbiased likelihood backpropagation learning (b) keeps the test likelihood close to the unbiased likelihood (cAIC) and overfitting does not occur. Table 1 shows mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Table 1. Mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Data Empirical BP kin-8fm 2.323 ± 0.346 kin-8fh 1.108 ± 0.240 kin-8nm −0.394 ± 0.415 kin-8nh −0.435 ± 0.236 pumadyn-8fm −2.235 ± 0.249 pumadyn-8fh −3.168 ± 0.187 pumadyn-8nm −3.012 ± 0.238 pumadyn-8nh −3.287 ± 0.255
6.3
Unbiased BP 2.531 ± 0.140 1.599 ± 0.100 0.078 ± 0.287 0.064 ± 0.048 −1.708 ± 0.056 −2.626 ± 0.021 −2.762 ± 0.116 −2.934 ± 0.055
Discussion
The reason why the results of the unbiased likelihood backpropagation learning in Table 1 shows better is attributed to the fact that the method maximizes the true likelihood averagely, because the mean of the log unbiased likelihood is equal to the log true likelihood (see (24)). The reason why the standard deviations of the test likelihood of the unbiased likelihood backpropagation learning is smaller than that of the empirical likelihood backpropagation learning is assumed to be due to the fact that the empirical likelihood prefer the model which has bigger degree of freedom. On the contrary, the unbiased likelihood prefer the model which has appropriate degree of freedom. Therefore, the variance of the estimation of the unbiased likelihood backpropagation learning becomes smaller than that of the empirical likelihood backpropagation learning.
Unbiased Likelihood Backpropagation Learning
7
455
Conclusion
In this paper, we provide an explanation about why overfitting occurs with the model selection framework. We propose the unibiased likelihood backpropagation learning, which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. And we confirm the effectiveness of the proposed method by applying the method to DELVE data set.
References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987) 2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974) 3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978) 4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996), http://www.cs.toronto.edu/∼ delve/
The Local True Weight Decay Recursive Least Square Algorithm Chi Sing Leung, Kwok-Wo Wong, and Yong Xu Department of Electronic Engineering, City University of Hong Kong, Hong Kong
[email protected] Abstract. The true weight decay recursive least square (TWDRLS) algorithm is an efficient fast online training algorithm for feedforward neural networks. However, its computational and space complexities are very large. This paper first presents a set of more compact TWDRLS equations. Afterwards, we propose a local version of TWDRLS to reduce the computational and space complexities. The effectiveness of this local version is demonstrated by simulations. Our analysis shows that the computational and space complexities of the local TWDRLS are much smaller than those of the global TWDRLS.
1
Introduction
Training multilayered feedforward neural networks (MFNNs) using recursive least square (RLS) algorithms has aroused much attention in many literatures [1, 2, 3, 4]. This is because those RLS algorithms are efficient second-order gradient descent training methods. They lead to a faster convergence when compared with first-order methods, such as the backpropagation (BP) algorithm. Moreover, fewer parameters are required to be tuned during training. Recently, Leung et. al. found that the standard RLS algorithm has an implicit weight decay effect [2]. However, its decay effect is not substantial and so its generalization ability is not very good. A true weight decay RLS (TWDRLS) algorithm is then proposed [5]. However, the computational complexity of TWDRLS is equal to O(M 3 ) at each iteration, where M is the number of weights. Therefore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS can be used for large scale practical problems. The main goal of this paper is to reduce both the computational complexity and storage requirement. In Section 2, we derive a set of concise equations for TWDRLS and give some discussions on it. We then describe a local TWDRLS algorithm in Section 3. Simulation results are presented in Section 4. We then summarize our findings in Section 5.
2
TWDRLS Algorithm
A general MFNN is composed of L layers, indexed by 1, · · · , L from input to output. There are nl neurons in layer l. The output of the i-th neuron in the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Local True Weight Decay Recursive Least Square Algorithm
457
l-th layer is denoted by yi,l . That means, the i-th neuron of the output layer is represented by yi,L while the i-th input of the network is represented by yi,1 . The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of layer l is denoted by wi,j,l . Biases are implemented as weights and are specified by wi,(nl−1 +1),l , where l = 2, · · · , L. Hence, the number of weights in a MFNN is given by M = L l=2 (nl−1 + 1)nl . In the standard RLS algorithm, we arrange all weights as a M -dimensional vector, given by T w = w1,1,2 , · · · w1,(n1 +1),2 , · · · , wnL ,1,L , · · · , wnL ,(nL−1 +1),L . (1) The energy function up to the t-th training sample is given by E(w) =
t
T
ˆ ˆ d(τ ) − h(w, x(τ )) + [w − w(0)] P −1 (0) [w − w(0)] 2
(2)
τ =1
where x(τ ) is a n1 -dimensional input vector, d(τ ) is the desired nL -dimensional output, and h(w0 , x(τ )) is a nonlinear function that describes the function of the network. The matrix P (0) is the error covariance matrix and is usually set to δ −1 IM×M , where IM×M is a M × M identity matrix. The minimization of (2) leads to the standard RLS equations [1, 3, 6, 7], given by −1 K(t) = P (t − 1) H T (t) InL ×nL + H(t) P (t − 1) H T (t) (3) P (t) = P (t − 1) − K(t) H(t) P (t − 1) (4) ˆ ˆ − 1) + K(t) [d(t) − h(w(t ˆ − 1), x(t)) ] , w(t) = w(t (5) T ∂h(w,(t)) where H(t) = is the gradient matrix (nL × M ) of ∂w ˆ w=w(t−1)
h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × nL ) in the classical control theory. The matrix P (t) is the so-called error covariance matrix. It is symmetric positive definite. As mentioned in [5], the standard RLS algorithm only has the limited weight decay effect, being tδo per training iteration, where to is the number of training iterations. The decay effect decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing effect would have in the data fitting process. A true weight decay RLS algorithm, namely TDRLS, was then proposed [5], where a decay term is added to the original energy function. The new energy function is given by t
E(w) =
T ˆ ˆ d(τ ) − h(w, x(τ ))2 + αw T w + [w − w(0)] P −1 (0) [w − w(0)] (6)
τ =1
where α is a regularization parameter. The gradient of E(w) is given by t ∂E(w) ˆ ≈ P −1 (0) [w − w(0)] + αw − H T (τ ) [d(τ ) − H(τ )w − ξ(τ )] . ∂w τ =1 (7)
458
C.S. Leung, K.-W. Wong, and Y. Xu
ˆ − 1). That means, In the above, we linearize h(w, x(τ )) around the estimate w(τ ˆ − 1), x(τ )) − H(τ )w(τ ˆ ) + ρ(τ ). h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h(w(τ To minimize the energy function, we set the gradient to zero. Hence, we have ˆ w(t) = P (t)r(t)
(8)
P −1 (t) = P −1 (t − 1) + H T (t) H(t) + αIM×M r(t) = r(t − 1) + H T (t) [d(t) − ξ(t)] .
(9) (10)
where
Δ
Furthermore, we define P ∗ (t) = [IM×M + αP (t − 1)]−1 P (t − 1). Hence, we have P ∗ −1 (t) = P −1 (t − 1) + αIM×M . With the matrix inversion lemma [7] in the recursive calculation of P (t), (8) becomes −1
P ∗ (t − 1) = [IM×M + αP (t − 1)] P (t − 1) (11) −1 ∗ T ∗ T K(t) = P (t − 1) H (t) InL ×nL + H(t) P (t − 1)H (t) (12) ∗ ∗ P (t) = P (t − 1) − K(t) H(t) P (t − 1) (13) ˆ ˆ − 1)−αP (t)w(t ˆ − 1) + K(t)[d(t) − h(w(t ˆ − 1), x(t))]. (14) w(t) = w(t Equations (11)-(14) are the general global TWDRLS equations. Those are more compact than the equations presented in [5]. Also, the weight updating equation in [5] ( i.e., (14) in this paper) is more complicated. When the regularization parameter α is set to zero, the decay term αw T w vanishes and (11)-(14) reduce to the standard RLS equations. The decay effect can be easily understood by the decay term αw T w in the energy function given by (6). As mentioned in [5], the decay effect per training iterations is equal to αw T w which does not decrease with the number of training iterations. The energy function of TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [8,9] for choosing the value of α can be used for the TWDRLS’s case. We can also explain the weight decay effect based on the recursive equations (11)-(14). The main difference between the standard RLS equations and the ˆ − 1) in TWDRLS equations is the introduction of a decay term −αP (t) w(t (14). This term guarantees that the magnitude of the updating weight vector decays an amount proportional to αP (t). Since P (t) is positive definite, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better [8, 9]. A drawback of TWDRLS is the requirement in computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity is equal to O(M 3 ) which is much larger than that of the standard RLS, O(M 2 ). Hence, the TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a local version of the TWDRLS algorithm will be proposed to solve this large complexity problem.
The Local True Weight Decay Recursive Least Square Algorithm
3
459
Localization of the TWDRLS Algorithm
To localize the TWDRLS algorithm, we first divide the weight vector into several T small vectors, where wi,l = wi,1,l , · · · , wi,(nl−1 +1),l is denoted as the weights connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We consider the estimation of each weight vector separately. When we consider the i−th neuron in layer l, we assume that other weight vectors are constant vectors. Such a technique is usually used in many numerical methods [10]. At each training iteration, we update each weight vector separately. Each neuron has its energy function. The energy function of the i − th neuron in layer l is given by E(wi,l ) =
t
2 d(τ ) − h(wi,l , x(τ )) + αw Ti,l wi,l τ =1 T
−1 ˆ i,l (0)] Pi,l ˆ i,l (0)] . + [wi,l − w (0) [w i,l − w
(15)
Utilizing a derivation process similar to the previous analysis, we obtain the following recursive equations for the local TWDRLS algorithm. Each neuron (excepting input neurons) has its set of TWDRLS equations. For the i − th neuron in layer l, the TWDRLS equations are given by
∗ Pi,l (t − 1) = I(nl−1 +1)×(nl−1 +1) + αPi,l (t − 1)
−1
Pi,l (t − 1)
∗ T ∗ T Ki,l (t) = Pi,l (t − 1) Hi,l (t) InL ×nL + Hi,l (t) Pi,l (t − 1)Hi,l (t)
Pi,l (t) =
∗ Pi,l (t
− 1) −
∗ Ki,l (t) Hi,l (t) Pi,l (t
− 1)
−1
(16) (17) (18)
ˆ i,l (t) = w ˆ i,l (t−1)−αPi,l (t)w ˆ i,l (t−1) + Ki,l (t)[d(t)−h(w ˆ i,l (t−1), x(t))], (19) w
where Hi,l is the nL × n(nl−1 +1) local gradient matrix. In this matrix, only one row associated with the considered neuron is nonzero for output layer L. Ki,l (t) is the (nl−1 + 1) × nL local Kalman gain. Pi,l (t) is the (nl−1 + 1) × (nl−1 + 1) local error covariance matrix. The follows. There training process of the local TWDRLS algorithm is as L are L n neurons (excepting input neurons). Hence, there are l l=2 l=2 nl sets of TWDRLS equations. We update the local weight vectors in accordance with a descending order of l and then an ascending order of i using (16)-(19). At each training stage, only the concerned local weight vector is updated and all other local weight vectors remain unchanged. In the global TWDRLS, the complexity mainly comes from the computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity 3 is equal to O(M complexity is equal to T CCglobal = ). So, the computational
3 L 3 O(M ) = O . Since the size of the matrix is M × M , l=2 nl (nl−1 + 1) the space complexity (storage requirement) is equal to T CSglobal = O(M 2 ) =
2 L O . l=2 nl (nl−1 + 1) From (16), the computational cost of local TWDRLS algorithm mainly comes from the inversion of an (nl−1 + 1) × (nl−1 + 1) matrix. In this way, the
460
C.S. Leung, K.-W. Wong, and Y. Xu
computational complexity of each set of local TWDRLS equations is equal to O((nl−1 +1)3 ) and the corresponding space complexity is equal to O((nl−1 +1)2 ). Hence, the total of the local TWDRLS is given by computational complexity
L 3 T CClocal = O n (n + 1) and the space complexity (storage requirel−1 l=2 l
L 2 ment) is equal to T CSlocal = O n (n + 1) . They are much smaller l l−1 l=2 than the computational and space complexities of the global case.
4
Simulations
Two problems, the generalized XOR and the sunspot data prediction, are considered. We use three-layers networks. The initial weights are small zero-mean independent identically distributed Gaussian random variables. The transfer function of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classification problem, output neurons are with the hyperbolic tangent function. For the sunspot data prediction problem, output neurons are with the linear activation function. The training for each problem is performed 10 times with different random initial weights. 4.1
Generalized XOR Problem
The generalized XOR problem is formulated as d = sign(x1 x2 ) with inputs in the range [−1, 1]. The network has 2 input neurons, 10 hidden neurons, and 1 output neuron. As a result, there are 41 weights. The training set and test set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. In each cycle, training samples from the training set are feeded to the network one by one. The decision boundaries obtained from typical networks trained with both global and local TWDRLS and standard RSL algorithms are plotted in Figure 2.
(a) Training samples
(b) Test samples
Fig. 1. Training and test samples for the generalized XOR problems
The Local True Weight Decay Recursive Least Square Algorithm
461
Table 1. Computational and space complexities of the global and local TWDRLS algorithms for solving the generalized XOR problem Algorithm Computational complexity Space complexitty Global O(6.89 × 104 ) O(1.68 × 103 ) 3 Local O(1.60 × 10 ) O(2.21 × 102 )
(a) Global TWDRLS, α = 0
(b) Local TWDRLS, α = 0
(c) Global TWDRLS, α = 0.00178
(d) Local TWDRLS, α = 0.00178
Fig. 2. Decision boundaries of various trained networks for the generalized XOR problem. Note that when α = 0, the TWDRLS is identical to RLS.
From Figures 1 and 2, the decision boundaries obtained from the trained networks with TWDRLS algorithm are closer to the ideal ones than those with the standard RLS algorithm. Also, both local and global TWDRLS algorithms produce a similar shape of decision boundaries. Figure 3 summarizes the average test set false rates in the 10 runs. The average test set false rates obtained by global and local TWDRLS algorithms are usually lower than those obtained by the standard RLS algorithm over a wide range of regularization parameters. That means, both global and local TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the local TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and local algorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that
462
C.S. Leung, K.-W. Wong, and Y. Xu
Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem
the performance of local TWDRLS is comparable to that of the global ones, and that its complexities are much smaller. Figure 3 indicates that the average test set false rate first decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α will indeed improve the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high at large values of α, especially for the networks trained with global TWDRLS algorithm. This is due to the fact that when the value of α is too large, the weight decay effect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Figure 4 the decision boundary obtained from the network trained with global TWDRLS algorithm for α = 0.0178. The figure shows that the network has already converged when the decision boundary is still quite far from the ideal one. This is because when the value of α is too large, the weight decay effect is too strong. That means, the regularization parameter α cannot be too large otherwise the network cannot learn the target function. 4.2
Sunspot Data Prediction
The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700 − 1920) and two test sets, namely, Test-set 1 (1921 − 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather nonstationary and Test-set 2 is atypical for the series as a whole. In the simulation, we assume that the series is generated from the following auto-regressive model, given by d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t)
(20)
where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation
The Local True Weight Decay Recursive Least Square Algorithm
463
Table 2. Computational and space complexities of the global and local TWDRLS algorithms for solving the sunspot data prediction Algorithm Computational complexity Space complexitty Global O(1.44 × 106 ) O(1.28 × 104 ) 4 Local O(1.83 × 10 ) O(1.43 × 103 )
Fig. 4. Decision boundaries of a trained network with local TWDRLS where α = 0.0178. In this case, the value of the regularization parameter is too large. Hence, the network cannot form a good decision boundary.
(a) Test-set 1 average RMSE
(b) Test-set 2 average RMSE
Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS.
function), and one output neuron (with linear activation function) is used for approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200. As this is a time series problem, the training samples are feeded to the network sequentially in each iteration. The criterion to evaluate the model performance
464
C.S. Leung, K.-W. Wong, and Y. Xu
is the mean squared error (RMSE) of the test set. The experiment are repeated 10 times with different initial weights. Figure 5 summarizes the average RMSE in 10 runs. The computational and space complexities for global and local algorithms are listed in Table 2. We observe from Figure 5 that over a wide range of the regularization parameter α, both global and local TWDRLS algorithms have greatly improved the generalization ability of the trained networks, especially for test-set 2 that is quite different from the training set. However, the test RMSE becomes very large at large values of α. The reasons are similar to those stated in the last subsection. This is because at large value of α, the weight decay effect is too strong and so the network cannot learn the target function. In most cases, the performance of the local training is found to be comparable to that of the global ones. Also, Table 2 shows that those complexities of the local training are much smaller than those of the global one.
5
Conclusion
We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the local TWDRLS algorithm. The computational complexity and the storage requirement are reduced considerably when using the local approach. Computer simulations indicate that both local and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the local TWDRLS algorithm is comparable to that of the global ones.
Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grant (Project No. CityU 115606).
References 1. Shah, S., Palmieri, F., Datum, M.: Optimal filtering algorithm for fast learning in feedforward neural networks. Neural Networks 5, 779–787 (1992) 2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive least square algorithm. Neural Networks 14, 147–174 (2001) 3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992) 4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman filtering method in neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165 (1999) 5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Networks 12, 1314–1332 (2001) 6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood Cliffs, NJ (1995)
The Local True Weight Decay Recursive Least Square Algorithm
465
7. Haykin, S.: Adaptive filter theory. Prentice-Hall, Englewood Cliffs, NJ (1991) 8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992) 9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) 10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliffs, NJ (1989)
Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift Keisuke Yamazaki and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute of Technology R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan {k-yam,swatanab}@pi.titech.ac.jp
Abstract. In the standard setting of statistical learning theory, we assume that the training and test data are generated from the same distribution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly under the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many effective results are reported on statistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical findings. Moreover it is observed that the basis function in the model plays an important role in some cases.
1
Introduction
The task of regression problem is to estimate the input-output relation q(y|x) from sample data, where x, y are the input and output data, respectively. Then, we generally assume that the input distribution q(x) is generating both of the training and test data. However, this assumption cannot be satisfied in practical situations, e.g., brain computer interfacing [2], bioinformatics [3], etc. The change of the input distribution from the training q0 (x) into the test q1 (x) is referred to as the covariate shift [4]. It is known that, under the covariate shift, the standard techniques in machine learning cannot work properly, and many efficient methods to tackle this issue are proposed [4,5,6,7]. In the Bayes estimation, Shimodaira [4] revealed the generalization error improved by the importance weight on regular cases. We formally clarified the behavior of the error in non-regular cases [1]. The result shows the generalization error is determined by lower order terms, which are ignored at the situation without the covariate shift. At the same time, it appeared that the calculation of these terms is not straightforward even in a simple regular example. To cope M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 466–476, 2008. c Springer-Verlag Berlin Heidelberg 2008
Experimental Bayesian Generalization Error of Non-regular Models
467
with this problem, we also established the upper bound using the generalization error without the covariate shift. However, it is still open how to derive the theoretical generalization error in non-regular models. In this paper, we observe experimental results calculated with a Monte Carlo method and examine the theoretical upper bound on some non-regular models. Comparing the generalization error under the covariate shift to those without the shift, we investigate an effect of a basis function in the learning model and consider the tightness of the bound. In the next section, we define non-regular models, and summarize the Bayesian analyses for the models without and with the covariate shift. We show the experimental results in Section 3 and give discussions at the last.
2
Bayesian Generalization Errors with and without Covariate Shift
In this section, we summarize the asymptotic theory on the non-regular Bayes inference. At first, we define the non-regular case. Then, mathematical properties of non-regular models without the covariate shift are introduced [8]. Finally, we state the results of the former study [1], which clarified the generalization error under the shift. 2.1
Non-regular Models
Let us define the parametric learning model by p(y|x, w), where x, y and w are the input, output and parameter, respectively. When the true distribution r(y|x) is realized by the learning model, the true parameter w∗ exists, i.e., p(y|x, w∗ ) = r(y|x). The model is regular if w∗ is one point in the parameter space. Otherwise, non-regular; the true parameter is not one point but a set of parameters such that Wt = {w∗ : p(y|x, w∗ ) = r(y|x)}. For example, three-layer perceptrons are non-regular. Let the true distribution be a zero function with Gaussian noise 1 y2 r(y|x) = √ exp − , 2 2π and the learning models be a simple three-layer perceptron 1 (y − a tanh(bx))2 p(y|x, w) = √ exp − , 2 2π where the parameter w = {a, b}. It is easy to find that the true parameters are a set {a = 0}∪{b = 0} since 0×tanh(bx) = a×tanh 0 = 0. This non-regularity (also called non-identifiability) causes that the conventional statistical method cannot be applied to such models (We will mention the detail in Section 2.2). In spite of this difficulty for the analysis, the non-regular models, such as perceptrons, Gaussian mixtures, hidden Markov models, etc., are mainly employed in many information engineering fields.
468
2.2
K. Yamazaki and S. Watanabe
Properties of the Generalization Error without Covariate Shift
As we mentioned in the previous section, the conventional statistical manner does not work in the non-regular models. To cope with the issue, a method was developed based on algebraic geometry. Here, we introduce the summary. Hereafter, we denote the cases without and with the covariate shift subscript 0 and 1, respectively. Some of the functions with a suffix 0 will be replaced by those with 1 in the next section. Let {X n , Y n } = {X1 , Y1 , . . . , Xn , Yn } be a set of training samples that are independently and identically generated by the true distribution r(y|x)q0 (x). Let p(y|x, w) be a learning machine and ϕ(w) be an a priori distribution of parameter w. Then the a posteriori distribution/posterior is defined by p(w|X n , Y n ) =
n 1 p(Yi |Xi , w)ϕ(w), Z(X n , Y n ) i=1
where Z(X n , Y n ) =
n
p(Yi |Xi , w)ϕ(w)dw.
(1)
i=1
The Bayesian predictive distribution is given by n n p(y|x, X , Y ) = p(y|x, w)p(w|X n , Y n )dw. When the number of sample data is sufficiently large (n → ∞), the posterior has the peak at the true parameter(s). The posterior in a regular model is a Gaussian distribution, whose mean is asymptotically the parameter w∗ . On the other hand, the shape of the posterior in a non-regular model is not Gaussian because of Wt (cf. the right panel of Fig.10 in Section 3). We evaluate the generalization error by the average Kullback divergence from the true distribution to the predictive distribution: r(y|x) 0 G0 (n)=EX r(y|x)q0 (x) log dxdy . n ,Y n p(y|x, X n , Y n ) In the standard statistical manner, we can formally calculate the generalization error by integrating the predictive distribution. This integration is viable based on the Gaussian posterior. Therefore, this method is applicable only to regular models. The following is one of the solutions for non-regular cases. The stochastic complexity [9] is defined by F (X n , Y n ) = − log Z(X n , Y n ), (2) which can be used for selecting an appropriate model or hyper-parameters. To analyze the behavior of the stochastic complexity, the following functions play important roles: 0 n n U 0 (n) = EX )] , n ,Y n [F (X , Y 0 where EX n ,Y n [·] stands for the expectation value over r(y|x)q0 (x) and
(3)
Experimental Bayesian Generalization Error of Non-regular Models
F (X n , Y n ) = F (X n , Y n ) +
n
469
log r(Yi |Xi ).
i=1
The generalization error and the stochastic complexity are linked by the following equation [10]: G0 (n) = U 0 (n + 1) − U 0 (n).
(4)
When the learning machine p(y|x, w) can attain the true distribution r(y|x), the asymptotic expansion of F (n) is given as follows [8]. U 0 (n) = α log n − (β − 1) log log n + O(1).
(5)
The coefficients α and β are determined by the integral transforms of U 0 (n). More precisely, the rational number −α and natural number β are the largest pole and its order of J(z) = H0 (w)z ϕ(w)dw, r(y|x) H0 (w) = r(y|x)q0 (x) log dxdy. (6) p(y|x, w) J(z) is obtained by applying the inverse Laplace and Mellin transformations to exp[−U 0 (n)]. Combining Eqs.(5) and (4) immediately gives
α β−1 1 0 G (n) = − +o , n n log n n log n when G0 (n) has an asymptotic form. The coefficients α and β indicate the speed of convergence of the generalization error when the number of training samples is sufficiently large. When the learning machine cannot attain the true distribution (i.e., the model is misspecified), the stochastic complexity has an upper bound of the following asymptotic expression [11]. U 0 (n) ≤ nC + α log n − (β − 1) log log n + O(1),
(7)
where C is a non-negative constant. When the generalization error has an asymptotic form, combining Eqs.(7) and (4) gives
α β−1 1 G0 (n) ≤ C + − +o , (8) n n log n n log n where C is the bias. 2.3
Properties of the Generalization Error with Covariate Shift
Now, we introduce the results in [1]. Since the test data are distributed from r(y|x)q1 (x), the generalization error with the shift is defined by
470
K. Yamazaki and S. Watanabe
1
G
0 (n)=EX n ,Y n
r(y|x) r(y|x)q1 (x) log dxdy . p(y|x, X n , Y n )
(9)
We need the function similar to (3) given by 1 U 1 (n) = EX EX n−1 ,Y n−1 [F (X n , Y n )]. n ,Yn
(10)
Then, the variant of (4) is obtained as G1 (n) = U 1 (n + 1) − U 0 (n).
(11)
When we assume that G1 (n) has an asymptotic expansion and converges to a constant and that U i (n) has the following asymptotic expansion 1 di U i (n) = ai n + bi log n + · · · +ci + +o , n n i (n) TH
0
i (n) TL
1
it holds that G (n) and G (n) are expressed by 1 b0 G0 (n) = a0 + +o , n n 1 b0 + (d1 − d0 ) G1 (n) = a0 + (c1 − c0 ) + +o , (12) n n 1 0 and that TH (n) = TH (n). Note that b0 = α. The factors c1 −c0 and d1 −d0 determine the difference of the errors. We have also obtained that the generalization error G1 (n) has an upper bound G1 (n) ≤ M G0 (n), if the following condition is satisfied M ≡ max
x∼q0 (x)
3
q1 (x) < ∞. q0 (x)
(13)
(14)
Experimental Generalization Errors in Some Toy Models
Even though we know the factors causing the difference between G1 (n) and G0 (n) according to Eq.(12), it is not straightforward to calculate the lower order terms in TLi (n). More precisely, the solvable model is restricted to find the constant and decreasing factors ci , di in the increasing function U i (n). This implies to reveal the analytic expression of G1 (n) is still open issue in non-regular models. Here, we calculate G1 (n) with experiments and observe the behavior. A non-regular model requires the sampling from the non-Gaussian posterior in the Bayes inference. We use the Markov Chain Monte Carlo (MCMC) method to execute this task [12]. In the following examples, we use the common notations: the true distribution is defined by 1 (y − g(x))2 r(y|x) = √ exp − , 2 2π
Experimental Bayesian Generalization Error of Non-regular Models
-10
-5
0
5
10
20
15
-5
0
5
10
20
15
-5
0
5
10
15
0
5
10
15
20
-10
-10
-5
0
5
10
15
20
20
Fig. 7. (μ1 , σ1 ) = (10, 1)
-10
-5
0
5
10
15
20
Fig. 8. (μ1 , σ1 )=(10, 0.5)
-5
0
5
10
15
20
Fig. 3. (μ1 , σ1 ) = (0, 2)
-10
-5
0
5
10
15
20
Fig. 6. (μ1 , σ1 ) = (2, 2)
Fig. 5. (μ1 , σ1 ) = (2, 0.5)
Fig. 4. (μ1 , σ1 ) = (2, 1)
-10
-5
Fig. 2. (μ1 , σ1 ) = (0, 0.5)
Fig. 1. (μ1 , σ1 ) = (0, 1)
-10
-10
471
-10
-5
0
5
10
15
20
Fig. 9. (μ1 , σ1 ) = (10, 2)
The training and test distributions
the learning model is given by
1 (y − f (x, w))2 p(y|x, w) √ exp − , 2 2π
the prior ϕ(w) is a standard normal distribution, and the input distribution is of the form 1 (x − μi )2 qi (x) = √ exp − (i = 0, 1). 2σi2 2πσi The training input distribution has (μ0 , σ0 ) = (0, 1), and there are nine test distributions, where each one has the mean and variance as the combination between μ1 = {0, 2, 10} and σ1 = {1, 0.5, 2} (cf. Fig.1-9). Note that the case in Fig.1 corresponds to q0 (x). As for the experimental setting, the number of traning samples is n, the number of test samples is ntest , the number of parameter samples distributed from the posterior with the MCMC method is np , and the number of samples to have the expectation EX n ,Y n [·] is nD . In the mathematical expressions, n np 1 p(y|x, wj ) i=1 p(Yi |Xi , wj )ϕ(wj ) n n p(y|x, X , Y ) , n p np j=1 n1p nk=1 i=1 p(Yi |Xi , wk )ϕ(wk ) nD ntest 1 1 r(yj |xj ) G1 (n) log , nD i=1 ntest j=1 p(yj |xj , Di )
0.4
0.49
0.31
0.22
0.13
0.04
-0.05
-0.14
-0.23
-0.32
-0.5
K. Yamazaki and S. Watanabe
-0.41
472
4 cy en qu fre
2500 2000 1500 1000 500 0
4
−4
0
b
3
2
2
−2
1
a
0
−2
0 2
-1
-2
4
-3 -3
-2
-1
0
1
2
3
−4
4
Fig. 10. The sampling from the posteriors. The left-upper panel shows the histogram of a for the first model. The left-middle one is the histogram of a3 for the third model. The left-lower one is the point diagram of (a, b) for the second model, and the right one is its histogram.
where Di = {Xi1 , Yi1 , · · · , Xin , Yin } stands for the ith set of training data, and Di and (xj , yj ) in G1 (n) are taken from q0 (x)r(y|x) and q1 (x)r(y|x), respectively. The experimental parameters were as follows: n = 500, ntest = 1000, np = 10000, nD = 100. Example 1 (Lines with Various Parameterizations) g(x) = 0, f1 (x, a) = ax, f2 (x, a, b) = abx, f3 (x, a) = a3 x, where the true is the zero function, the learning functions are lines with the gradient a, ab and a3 . In this example, all learning functions belong to the same function class though the second model is non-regular (Wt = {a = 0} ∪ {b = 0}) and the third one has the non-Gaussian posterior. The gradient parameters are taken from the posterior depicted by Fig.10. Table 1 summarizes the results. The first row indicates the pairs (μ1 , σ1 ), and the rest does the experimental average generalization errors. G1 [fi ] stands for the error of the model with fi . M G0 [fi ] is the upper bound in each case according to Eq.(13). Note that there are some blanks in the row because of the condition Eq.(14). To compare G1 [f3 ] with G1 [f1 ], the last row shows the values 3 × G1 [f3 ] of each change. Since it is regular, the first model has theoretical results:
1 1 R 1 μ2 + σ12 G0 (n) = +o , G1 (n) = +o , R = 12 . 2n n log n 2n n log n μ0 + σ02 ‘th G1 [f1 ]’ in Table 1 is this theoretical result.
Experimental Bayesian Generalization Error of Non-regular Models
473
Table 1. Average generalization errors in Example 1 (μ1 , σ1 ) 1
th G [f1 ] G1 [f1 ] G1 [f2 ] G1 [f3 ] MG0 [f1 ] MG0 [f2 ] MG0 [f3 ]
(0,1)= G0 (0,0.5)
(0,2)
(2,1)
0.001 0.00025 0.004 0.005 0.001055 0.000239 0.004356 0.006162 0.000874 0.000170 0.003466 0.004523 0.000394 0.000059 0.001475 0.002374 — — —
0.002000 0.001356 0.000667
— — —
— — —
(2,0.5)
(2,2)
(10,1)
(10,0.5)
(10,2)
0.00425 0.008 0.101 0.10025 0.104 0.005107 0.009532 0.109341 0.106619 0.108042 0.003670 0.006669 0.079280 0.078802 0.080180 0.001912 0.003736 0.040287 0.038840 0.039682 0.028784 0.019521 0.009595
— — —
— — —
1.79×1026 1.22×1026 5.98×1025
— — —
3 × G1 [f3 ] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046
Table 2. Average generalization errors in Example 2 (μ1 , σ1 ) 1
G [f4 ] G1 [f5 ] G1 [f6 ] MG0 [f4 ] MG0 [f5 ] MG0 [f5 ]
(0,1)= G0 (0,0.5)
(0,2)
(2,1)
(2,0.5)
(2,2)
(10,1)
(10,0.5)
(10,2)
0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204 0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743 0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489 — — —
0.001356 0.000667 0.000400
— — —
— — —
0.019521 0.009595 0.005757
— — —
— — —
1.22×1026 5.98×1025 3.59×1025
— — —
3 × G1 [f5 ] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229 5 × G1 [f6 ] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445
We can find that ‘th G1 [f1 ]’ are very close to G1 [f1 ] in spite of the fact that the theoretical values are established in asymptotic cases. Based on this fact, the accuracy of experiments can be evaluated to compare them. As for f2 and f3 , they do not have any comparable theoretical value except for the upper bound. We can confirm that every value in G1 [f2 , f3 ] is actually smaller than the bound. Example 2 (Simple Neural Networks). Let assume that the true is the zero function, and the learning models are three-layer perceptrons: g(x) = 0, f4 (x, a, b) = a tanh(bx), f5 (x, a, b) = a3 tanh(bx), f6 (x, a, b) = a5 tanh(bx). Table 2 shows the results. In this example, we can also confirm that the bound works. Combining the results in the previous example, the bound tends to be tight when μ1 is small. As a matter of fact, the bound holds in small sample cases, i.e., the number of training data n does not have to be sufficiently large. Though we omit it because of the lack of space, the bound is always larger than the experimental results in n = 100, 200, . . . , 400. The property of the bound will be discussed in the next section.
4
Discussions
First, let us confirm if the sampling from the posterior was successfully done by the MCMC method. Based on the algebraic geometrical method, the coefficients of G0 (n) are derived in the models (cf. Table 3). As we mentioned, f2 , f4 and
474
K. Yamazaki and S. Watanabe Table 3. The coefficients of generalization error without the covariate shift f1 f2 , f4 f3 , f5 f6 α 1/2 1/2 1/6 1/10 β 1 2 1 1
f3 , f5 have the same theoretical error. According to the examples in the previous section, we can compare the theoretical value to the experimental one, G0 (n)[f1 ] = 0.001 0.001055, G0(n)[f2 , f4 ] = 0.000678 0.000874, 0.000688 G0 (n)[f3 , f5 ] = 0.000333 0.000394, 0.000251, G0(n)[f6 ] = 0.0002 0.000146. In the sense of the generalization error, the MCMC method worked well though there is some fluctuation in the results. Note that it is still open how to evaluate the method. Here we measured by the generalization error since the theoretical value is known in G0 (n). However, this index is just a necessary condition. To develop an evaluation of the selected samples is our future study. Next, we consider the behavior of G1 (n). In the examples, the true function was commonly the zero function g(x) = 0. It is an important case to learn the zero function because we often prepare an enough rich Kmodel in practice. Then, the learning function will be set up as f (x, w) = i=k t(w1k )h(x, w2k ), where h is the basis function t is the parameterization for its weight, and w = {w11 , w21 , w12 , w22 , . . . , w1K , w2K }. Note that many practical models are included in this expression. According to the redundancy of the function, some of h(x, w2k ) learn the zero function. Our examples provided the simplest situations and highlighted the effect of non-regularity in the learning models. The errors G0 (n) and G1 (n) are generally expressed as
α β−1 1 G0 (n) = − +o , n n log n n log n
α β−1 1 G1 (n) = R1 − R2 +o , n n log n n log n where R1 , R2 depend on f, g, q0 , and q1 . R1 and R2 cause the difference between G0 and G1 in this expression. In Eq. (12), the coefficient of 1/n is given by b0 + (d1 − d0 ). So b0 + (d1 − d0 ) d1 − d0 R1 = =1+ . b0 α Let us denote A B as “A is the only factor to determine a value of B”. As mentioned above, f, g, q0 , q1 R1 , R2 . Though f, g α, β (cf. around Eqs.(5)-(6)), we should emphasize that α, β, q0 , q1 R1 , R2 .
Experimental Bayesian Generalization Error of Non-regular Models
475
This fact is easily confirmed by comparing f2 to f4 (also f3 to f5 ). It holds that G1 (n)[f2 ] = G1 (n)[f4 ] for all q1 although they have the same α and β (G0 (n)[f2 ] = G0 (n)[f4 ]). Thus α and β are not enough informative to describe R1 and R2 . Comparing the values in G1 [f2 ] to the ones in G1 [f4 ], the basis function (x in f2 and tanh(bx) in f4 ) seems to play an important role. To clarify the effect of basis functions, let us fix the function class. Examples 1 and 2 correspond to h(x, w2 ) = x and h(x, w2 ) = tanh(bx), respectively. The values of G1 [f1 ] and 3 × G1 [f3 ] (also 3 × G1 [f5 ] and 5 × G1 [f6 ]) can be regarded as the same in any covariate shift. This implies h, g, q0 , q1 R1 , i.e. the parameterization t(w1 ) will not affect R1 . Instead, it affects the nonregularity or the multiplicity and decides α and β. Though it is an unclear factor, the influence of R2 does not seem as large as R1 . Last, let us analyze properties of the upper bound M G0 (n). According to the above discussion, it holds that R G1 /G0 ≤ M. The ratio G1 /G0 basically depends on g, h, q0 and q1 . However, M is determined by only the training and test input distributions, q0 , q1 M . Therefore this bound gives the worst case evaluation in any g and h. Considering the tightness of the bound, we can still improve it based on the relation between the true and learning functions.
5
Conclusions
In the former study, we have got the theoretical generalization error and its upper bound under the covariate shift. This paper showed that the theoretical value is supported by the experiments in spite of the fact that it is established under an asymptotic case. We observed the tightness of the bound and discussed an effect of basis functions in the learning models. In this paper, the non-regular models are simple lines and neural networks. It is an interesting issue to investigate more general models. Though we mainly considered the amount of G1 (n), the computational cost for the MCMC method strongly connects to the form of the learning function f . It is our future study to take account of the cost in the evaluation.
Acknowledgements The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and Klaus-Robert M¨ uller for fruitful discussions. The software to calculate the MCMC method and technical comments were provided by Kenji Nagata. This research partly supported by the Alexander von Humboldt Foundation, and MEXT 18079007.
476
K. Yamazaki and S. Watanabe
References 1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨ uller, K.R.: Asymptotic bayesian generalization error when training and test distributions are different. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1079–1086 (2007) 2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.: Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6), 767–791 (2002) 3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227– 244 (2000) 5. Sugiyama, M., M¨ uller, K.R.: Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions 23(4), 249–279 (2005) 6. Sugiyama, M., Krauledat, M., M¨ uller, K.R.: Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (2007) 7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨ olkopf, B.: Correcting sample selection bias by unlabeled data. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA (2007) 8. Watanabe, S.: Algebraic analysis for non-identifiable learning machines. Neural Computation 13(4), 899–933 (2001) 9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) 10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer, Heidelberg (1999) 11. Watanabe, S.: Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems 14, 329–336 (2001) 12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst. Statis. Math. 42(3), 403–433 (1990)
Using Image Stimuli to Drive fMRI Analysis David R. Hardoon1 , Janaina Mour˜ ao-Miranda2, Michael Brammer2 , and John Shawe-Taylor1 1
The Centre for Computational Statistics and Machine Learning Department of Computer Science University College London Gower St., London WC1E 6BT {D.Hardoon,jst}@cs.ucl.ac.uk 2 Brain Image Analysis Unit Centre for Neuroimaging Sciences (PO 89) Institute of Psychiatry, De Crespigny Park London SE5 8AF {Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk
Abstract. We introduce a new unsupervised fMRI analysis method based on Kernel Canonical Correlation Analysis which differs from the class of supervised learning methods that are increasingly being employed in fMRI data analysis. Whereas SVM associates properties of the imaging data with simple specific categorical labels, KCCA replaces these simple labels with a label vector for each stimulus containing details of the features of that stimulus. We have compared KCCA and SVM analyses of an fMRI data set involving responses to emotionally salient stimuli. This involved first training the algorithm ( SVM, KCCA) on a subset of fMRI data and the corresponding labels/label vectors, then testing the algorithms on data withheld from the original training phase. The classification accuracies of SVM and KCCA proved to be very similar. However, the most important result arising from this study is that KCCA in able in part to extract many of the brain regions that SVM identifies as the most important in task discrimination blind to the categorical task labels. Keywords: Machine learning methods, Kernel canonical correlation analysis, Support vector machines, Classifiers, Functional magnetic resonance imaging data analysis.
1
Introduction
Recently, machine learning methodologies have been increasingly used to analyse the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8, 9,10]. In this paper, we introduce a new unsupervised machine learning approach to fMRI analysis, in which the simple categorical description of stimulus type (e.g. type of task) is replaced by a more informative vector of stimulus features. We compare this new approach with a standard Support Vector Machine (SVM) analysis of fMRI data using a categorical description of stimulus type. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008. c Springer-Verlag Berlin Heidelberg 2008
478
D.R. Hardoon et al.
The technology of the present study originates from earlier research carried out in the domain of image annotation [11], where an image annotation methodology learns a direct mapping from image descriptors to keywords. Previous attempts at unsupervised fMRI analysis have been based on Kohonen selforganising maps, fuzzy clustering [12] and nonparametric estimation methods of the hemodynamic response function, such as the general method described in [13]. [14] have reported an interesting study which showed that the discriminability of PCA basis representations of images of multiple object categories is significantly correlated with the discriminability of PCA basis representation of the fMRI volumes based on category labels. The current study differs from conventional unsupervised approaches in that it makes use of the stimulus characteristics as an implicit representation of a complex state label. We use kernel Canonical Correlation Analysis (KCCA) to learn the correlation between an fMRI volume and its corresponding stimulus. Canonical correlation analysis can be seen as the problem of finding basis vectors for two sets of variables such that the correlations of the projections of the variables onto corresponding basis vectors are maximised. KCCA first projects the data into a higher dimensional feature space before performing CCA in the new feature space. CCA [15, 16] and KCCA [17] have been used in previous fMRI analysis using only conventional categorical stimulus descriptions without exploring the possibility of using complex characteristics of the stimuli as the source for feature selection from the fMRI data. The fMRI data used in the following study originated from an experiment in which the responses to stimuli were designed to evoke different types of emotional responses, pleasant or unpleasant. The pleasant images consisted of women in swimsuits while the unpleasant images were a collection of images of skin diseases. Each stimulus image was represented using Scale Invariant Feature Transformation (SIFT) [18] features. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. In the current study, we present a feasibility study of the possibility of generating new activity maps by using the actual stimuli that had generated the fMRI volume. We have shown that KCCA is able to extract brain regions identified by supervised methods such as SVM in task discrimination and to achieve similar levels of accuracy and discuss some of the challenges in interpreting the results given the complex input feature vectors used by KCCA in place of categorical labels. This work is an extension of the work presented in [19]. The paper is structured as follows. Section 2 gives a review of the fMRI data acquisition as well as the experimental design and the pre-processing. These are followed by a brief description of the scale invariant feature transformation in Section 2.1. The SVM is briefly described in Section 2.2 while Section 2.2 elaborates on the KCCA methodology. Our results in Section 3. We conclude with a discussion in Section 4.
Using Image Stimuli to Drive fMRI Analysis
2
479
Materials and Methods
Due to the lack of space we refer the reader to [10] for a detailed account of the subject, data acquisition and pre-processing applied to the data as well as to the experimental design. 2.1
Scale Invariant Feature Transformation
Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown to be superior to other descriptors [20]. This is due to the SIFT descriptors being designed to be invariant to small shifts in position of salient (i.e. prominent) regions. Calculation of the SIFT vector begins with a scale space search in which local minima and maxima are identified in each image (so-called key locations). The properties of the image at each key location are then expressed in terms of gradient magnitude and orientation. A canonical orientation is then assigned to each key location to maximize rotation invariance. Robustness to reorientation is introduced by representing local image regions around key voxels in a number of orientations. A reference key vector is then computed over all images and the data for each image are represented in terms of distance from this reference. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not specifically exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. Image Processing. Let fil be the SIFT features vector for image i where l is the number of features. Each image i has a different number of SIFT features l, making it difficult to directly compare two images. To overcome this problem we apply K-means to cluster the SIFT features into a uniform frame. Using K-means clustering we find K classes and their respective centers oj where j = 1, . . . , K. The feature vector xi of an image stimuli i is K dimensional with j’th component xi,j . The feature vectors is computed as the Gaussian measure of the minimal distance between the SIFT features fil to the centre oj . This can be represented as − minv∈f l d(v,oj )2
xi,j = exp
i
(1)
where d(., .) is the Euclidean distance. The number of centres is set to be the smallest number of SIFT features computed (found to be 300). Therefore after processing each image, we will have a 300 dimensional feature vector representing its relative distance from the cluster centres. 2.2
Methods
Support Vector Machines. Support vector machines [21] are kernel-based methods that find functions of the data that facilitate classification. They are derived from statistical learning theory [22] and have emerged as powerful tools for statistical pattern recognition [23]. In the linear formulation a SVM finds,
480
D.R. Hardoon et al.
during the training phase, the hyperplane that separates the examples in the input space according to their class labels. The SVM classifier is trained by providing examples of the form (x, y) where x represents a input and y it’s class label. Once the decision function has been learned from the training data it can be used to predict the class of a new test example. We used a linear kernel SVM that allows direct extraction of the weight vector as an image. A parameter C, that controls the trade-off between training errors and smoothness was fixed at C = 1 for all cases (default value).1 Kernel Canonical Correlation Analysis. Proposed by Hotelling in 1936, Canonical Correlation Analysis (CCA) is a technique for finding pairs of basis vectors that maximise the correlation between the projections of paired variables onto their corresponding basis vectors. Correlation is dependent on the chosen coordinate system, therefore even if there is a very strong linear relationship between two sets of multidimensional variables this relationship may not be visible as a correlation. CCA seeks a pair of linear transformations one for each of the paired variables such that when the variables are transformed the corresponding coordinates are maximally correlated. Consider the linear combination x = wa x and y = wb y. Let x and y be two random variables from a multi-dimensional distribution, with zero mean. The maximisation of the correlation between x and y corresponds to solving maxwa ,wb ρ = wa Cab wb subject to wa Caa wa = wb Cbb wb = 1. Caa and Cbb are the non-singular within-set covariance matrices and Cab is the between-sets covariance matrix. We suggest using the kernel variant of CCA [24] since due to the linearity of CCA useful descriptors may not be extracted from the data. This may occur as the correlation could exist in some non linear relationship. The kernelising of CCA offers an alternate solution by first projecting the data into a higher dimensional feature space φ : x = (x1 , . . . , xn ) → φ(x) = (φ1 (x), . . . , φN (x)) (N ≥ n) before performing CCA in the new feature space. Given the kernel functions κa and κb let Ka = Xa Xa and Kb = Xb Xb be the kernel matrices corresponding to the two representations of the data, where Xa is the matrix whose rows are the vectors φa (xi ), i = 1, . . . , from the first representation while Xb is the matrix with rows φb (xi ) from the second representation. The weights wa and wb can be expressed as a linear combination of the training examples wa = Xa α and wb = Xb β. Substituting into the primal CCA equation gives the optimisation maxα,β ρ = α Ka Kb β subject to α K2a α = β K2b β = 1. This is the dual form of the primal CCA optimisation problem given above, which can be cast as a generalised eigenvalue problem and for which the first k generalised eigenvectors can be found efficiently. Both CCA and KCCA can be formulated as an eigenproblem. The theoretical analysis shown in [25,26] suggests the need to regularise kernel CCA as it shows that the quality of the generalisation of the associated pattern function is controlled by the sum of the squares of the weight vector norms. We 1
The LibSVM toolbox for Matlab was used to perform the classifications http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
Using Image Stimuli to Drive fMRI Analysis
481
refer the reader to [25, 26] for a detailed analysis and the regularised form of KCCA. Although there are advantages in using kernel CCA, which have been demonstrated in various experiments across the literature. We must clarify that in this particular work, as we are using a linear kernel in both views, regularised CCA is the same as regularised linear KCCA (since the former and latter are linear). Although using KCCA with a linear kernel has advantages over CCA, the most important of which is in our case speed, together with the regularisation.2 Using linear kernels as to allow the direct extraction of the weights, KCCA performs the analysis by projecting the fMRI volumes into the found semantic space defined by the eigenvector corresponding to the largest correlation value (these are outputted from the eigenproblem). We classify a new fMRI volume as follows; Let αi be the eigenvector corresponding to the largest eigenvalue, and let φ(ˆ x) be the new volume. We project the fMRI into the semantic space w = Xa αi (these are the training weights, similar to that of the SVM) and using the weights we are able to classify the new example as w ˆ = φ(ˆ x)w where w ˆ is a weighted value (score) for the new volume. The score can be thresholded to allocate a category to each test example. To avoid the complications of finding a threshold, we zeromean the outputs and threshold the scores at zero, where w ˆ < 0 will be associated with unpleasant (a label of −1) and w ˆ ≥ 0 will be associated with pleasant (a label of 1). We hypothesis that KCCA is able to derive additional activities that may exist a-priori, but possibly previously unknown, in the experiment. By projecting the fMRI volumes into the semantic space using the remaining eigenvectors corresponding to lower correlation values. We have attempted to corroborate this hypothesis on the existing data but found that the additional semantic features that cut across pleasant and unpleasant images did not share visible attributes. We have therefore confined our discussion here to the first eigenvector.
3
Results
Experiments were run on a leave-one-out basis where in each repeat a block of positive and negative fMRI volumes was withheld for testing. Data from the 16 subjects was combined. This amounted, per run, in 1330 training and 14 testing fMRI volumes, each set evenly split into positive and negative volumes (these pos/neg splits were not known to KCCA but simply ensured equal number of images with both types of emotional salience). The analyses were repeated 96 times. Similarly, we run a further experiment of leave-subject-out basis where 15 subjects were combined for training and one left for testing. This gave a sum total of 1260 training and 84 testing fMRI volumes. The analyses was repeated 16 times. The KCCA regularisation parameter was found using 2-fold cross validation on the training data. Initially we describe the fMRI activity analysis. After training the SVM we are able to extract and display the SVM weights as a representation of the brain 2
The KCCA toolbox used was from http://homepage.mac.com/davidrh/Code.html
482
D.R. Hardoon et al.
regions important in the pleasant/unpleasant discrimination. A thorough analysis is presented in [10]. We are able to view the results in Figures 1 and 2 where in both figures the weights are not thresholded and show the contrast between viewing Pleasant vs. Unpleasant. The weight value of each voxel indicates the importance of the voxel in differentiating between the two brain states. In Figure 1 the unthresholded SVM weight maps are given. Similarly with KCCA, once learning the semantic representation we are able to project the fMRI data into the learnt semantic feature space producing the primal weights. These weights, like those generated from the SVM approach, could be considered as a representation of the fMRI activity. Figure 2 displays the KCCA weights. In Figure 3 the unthresholded weights values for the KCCA approach with the hemodynamic function applied to the image stimuli (i.e. applied to the SIFT features prior to analysis) are displayed. The hemodynamic response function is the impulse response function which is used to model the delay and dispersion of hemodynamic responses to neuronal activation [27]. The application of the hemodynamic function to the images SIFT features allows for the reweighting of the image features according to the computed delay and dispersion model. We compute the hemodynamic function with the SPM2 toolbox with default parameter settings. As the KCCA weights are not driven by simple categorical image descriptors (pleasant/unpleasant) but by complex image feature vectors it is of great interest that many regions, especially in the visual cortex, found by SVM are also highlighted by the KCCA. We interpret this similarity as indicating that many important components of the SIFT feature vector are associated with pleasant/unpleasant discrimination. Other features in the frontal cortex are much less reproducible between SVM and KCCA indicting that many brain regions detect image differences not rooted in the major emotional salience of the images. In order to validate the activity patterns found in Figure 2 we show that the learnt semantic space can be used to correctly discriminate withheld (testing) fMRI volumes. We also give the 2−norm error to provide an indication as to
Fig. 1. The unthresholded weight values for the SVM approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed with labels (+1/ − 1).
Using Image Stimuli to Drive fMRI Analysis
483
Fig. 2. The unthresholded weight values for the KCCA approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed without labels. The class discrimination is automatically extracted from the analysis.
Fig. 3. The unthresholded weight values for the KCCA approach with the hemodynamic function applied to the image stimuli showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant).
the quality of the patterns found between the fMRI volumes and image stimuli from the testing set by Ka α − Kb β2 (normalised over the number of volumes and analyses repeats). The latter is especially important when the hemodynamic function has been applied to the image stimuli as straight forward discrimination is no longer possible to compare with. Table 1 shows the average and median performance of SVM and KCCA on the testing of pleasant and unpleasant fMRI blocks for the leave-two-block-out experiment. Our proposed unsupervised approach had achieved an average accuracy of 87.28%, slightly less than the 91.52% of the SVM. Although, both methods had the same median accuracy of 92.86%. The results of the leavesubject-out experiment are given in Table 2, where our KCCA has achieved an average accuracy of 79.24% roughly 5% less than the supervised SVM method. In both tables the Hemodynamic Function is abbreviated as HF. We are able to observe in both tables that the quality of the patterns are better than random. The results demonstrate that the activity analysis is meaningful. To further confirm the validity of the methodology we repeat the experiments with the
484
D.R. Hardoon et al.
Table 1. KCCA & SVM results on the leave-two-block-out experiment. Average and median performance over 96 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 87.28 92.86 0.0048 0.0048 SVM 91.52 92.86 Random KCCA 49.78 50.00 0.0103 0.0093 Random SVM 52.68 50.00 KCCA with HF 0.0032 0.0031 Random KCCA with HF 1.1049 0.9492
Table 2. KCCA & SVM results on the leave-one-subject-out experiment. Average and median performance over 16 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 79.24 79.76 0.0025 0.0024 SVM 84.60 86.90 Random KCCA 48.51 47.62 0.0052 0.0044 Random SVM 48.88 48.21 KCCA with HF 0.0016 0.0015 Random KCCA with HF 0.5869 0.0210
image stimuli randomised, hence breaking the relationship between fMRI volume and stimuli. Table 1 and 2 KCCA and SVM both show performance equivalent to the performance of a random classifier. It is also interesting to observe that when applying the hemodynamic function the random KCCA is substantially different, and worse than, the non random KCCA. Implying that the spurious correlations are found.
4
Discussion
In this paper we present a novel unsupervised methodology for fMRI activity analysis in which a simple categorical description of a stimulus type is replaced by a more informative vector of stimulus (SIFT) features. We use kernel canonical correlation analysis using an implicit representation of a complex state label to make use of the stimulus characteristics. The most interesting aspect of KCCA is its ability to extract visual regions very similar to those found to be important in categorical image classification using supervised SVM. KCCA “finds” areas in the brain that are correlated with the features in the SIFT vector regardless of the stimulus category. Because many features of the stimuli were associated with the pleasant/unpleasant categories we were able to use the KCCA results to classify the fMRI images between these categories. In the current study it is difficult to address the issue of modular versus distributed neural coding as the complexity of the stimuli (and consequently of the SIFT vector) is very high.
Using Image Stimuli to Drive fMRI Analysis
485
A further interesting possible application of KCCA relates to the detection of “inhomogeneities” in stimuli of a particular type (e.g happy/sad/disgusting emotional stimuli). If KCCA analysis revealed brain regions strongly associated with substructure within a single stimulus category this could be valuable in testing whether a certain type of image was being consistently processed by the brain and designing stimuli for particular experiments. There are many openended questions that have not been explored in our current research, which has primarily been focused on fMRI analysis and discrimination capacity. KCCA is a bi-directional technique and therefore are also able to compute a weight map for the stimuli from the learned semantic space. This capacity has the potential of greatly improving our understanding as to the link between fMRI analysis and stimuli by potentially telling us which image features were important. Acknowledgments. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. David R. Hardoon is supported by the EPSRC project Le Strum, EP-D063612-1. This publication only reflects the authors views. We would like to thank Karl Friston for the constructive suggestions.
References 1. Cox, D.D., Savoy, R.L.: Functional magnetic resonance imaging (fmri) ‘brain reading’: detecting and classifying distributed patterns of fmri activity in human visual cortex. Neuroimage 19, 261–270 (2003) 2. Carlson, T.A., Schrater, P., He, S.: Patterns of activity in the categorical representations of objects. Journal of Cognitive Neuroscience 15, 704–717 (2003) 3. Wang, X., Hutchinson, R., Mitchell, T.M.: Training fmri classifiers to detect cognitive states across multiple human subjects. In: Proceedings of the 2003 Conference on Neural Information Processing Systems (2003) 4. Mitchell, T., Hutchinson, R., Niculescu, R., Pereira, F., Wang, X., Just, M., Newman, S.: Learning to decode cognitive states from brain images. Machine Learning 1-2, 145–175 (2004) 5. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X.: Support vector machines for temporal classification of block design fmri data. NeuroImage 26, 317–329 (2005) 6. Mourao-Miranda, J., Bokde, A.L.W., Born, C., Hampel, H., Stetter, S.: Classifying brain states and determining the discriminating activation patterns: support vector machine on functional mri data. NeuroImage 28, 980–995 (2005) 7. Haynes, J.D., Rees, G.: Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience 8, 686–691 (2005) 8. Davatzikos, C., Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C., Langleben, D.D.: Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. NeuroImage 28, 663–668 (2005) 9. Kriegeskorte, N., Goebel, R., Bandettini, P.: Information-based functional brain mapping. PANAS 103, 3863–3868 (2006)
486
D.R. Hardoon et al.
10. Mourao-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M.: The impact of temporal compression and space selection on svm analysis of singlesubject and multi-subject fmri data. NeuroImage (accepted, 2006) 11. Hardoon, D.R., Saunders, C., Szedmak, S., Shawe-Taylor, J.: A correlation approach for automatic image annotation. In: Li, X., Za¨ıane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 681–692. Springer, Heidelberg (2006) 12. Wismuller, A., Meyer-Base, A., Lange, O., Auer, D., Reiser, M.F., Sumners, D.: Model-free functional mri analysis based on unsupervised clustering. Journal of Biomedical Informatics 37, 10–18 (2004) 13. Ciuciu, P., Poline, J., Marrelec, G., Idier, J., Pallier, C., Benali, H.: Unsupervised robust non-parametric estimation of the hemodynamic response function for any fmri experiment. IEEE TMI 22, 1235–1251 (2003) 14. O’Toole, A.J., Jiang, F., Abdi, H., Haxby, J.V.: Partially distributed representations of objects and faces in ventral temporal cortex. Journal of Cognitive Neuroscience 17(4), 580–590 (2005) 15. Friman, O., Borga, M., Lundberg, P., Knutsson, H.: Adaptive analysis of fMRI data. NeuroImage 19, 837–845 (2003) 16. Friman, O., Carlsson, J., Lundberg, P., Borga, M., Knutsson, H.: Detection of neural activity in functional MRI using canonical correlation analysis. Magnetic Resonance in Medicine 45(2), 323–330 (2001) 17. Hardoon, D.R., Shawe-Taylor, J., Friman, O.: KCCA for fMRI Analysis. In: Proceedings of Medical Image Understanding and Analysis, London, UK (2004) 18. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer vision, Kerkyra, Greece, pp. 1150–1157 (1999) 19. Hardoon, D.R., Mourao-Miranda, J., Brammer, M., Shawe-Taylor, J.: Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImag (in press, 2007) 20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: International Conference on Computer Vision and Pattern Recognition, pp. 257–263 (2003) 21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp. 144–152. ACM, New York (1992) 24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 365–377 (2001) 25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16, 2639–2664 (2004) 26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)
Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan
[email protected] Abstract. Reinforcement learning (RL) for a linear family of tasks is studied in this paper. The key of our discussion is nonlinearity of the optimal solution even if the task family is linear; we cannot obtain the optimal policy by a naive approach. Though there exists an algorithm for calculating the equivalent result to Q-learning for each task all together, it has a problem with explosion of set sizes. We introduce adaptive margins to overcome this difficulty.
1
Introduction
Reinforcement learning (RL) for a linear family of tasks is studied in this paper. Such learning is useful for time-varying environments, multi-criteria problems, and inverse RL [5,6]. The family is defined as a weighted sum of several criteria. This family is linear in the sense that reward is linear with respect to weight parameters. For instance, criteria of network routing include end-to-end delay, loss of packets, and power level associated with a node [5]. Selecting appropriate weights beforehand is difficult in practice and we need try and errors. In addition, appropriate weights may change someday. Parallel RL for all possible weight values is desirable in such cases. The key of our discussion is nonlinearity of the optimal solution; it is not linear but piecewise-linear actually. This fact implies that we cannot obtain the best policy by the following naive approach: 1. Find the value function for each criterion. 2. Calculate weighted sum of them to obtain the total value function. 3. Construct a policy on the basis of the total value function. A typical example is presented in section 5. Piecewise-linearity of the optimal solution has been pointed out independently in [4] and [5]. The latter aims at fast adaptation under time-varying environments. The former is our previous report, and we have tried to obtain the optimal solutions for various weight values all together. Though we have developed an algorithm that gives exactly equivalent solution to Q-learning for each weight value, it has a difficulty with explosion of set size. This difficulty is not a problem of the algorithm but an intrinsic nature of Q-learning for the weighted criterion model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008. c Springer-Verlag Berlin Heidelberg 2008
488
K. Hiraoka, M. Yoshida, and T. Mishima
We have introduced a simple approximation with a ‘margin’ into decision of convexity first [6]. Then we have improved it so that we obtain an interval estimation and we can monitor the effect of the approximation [7]. In this paper, we propose adaptive adjustment of margins. In margin-based approach, we have to manage large sets of vectors in the first stage of learning. The peak of the set size tends to be large if we set a small margin to obtain an accurate final result. The proposed method reduces worry of this trade-off. By changing margins appropriately through learning steps, we can enjoy small set size in the first stage with large margins, and an accurate result in the final stage with small margins. The weighted criterion model is defined in section 2, and parallel RL for it is described in section 3. Then the difficulty of set size is pointed out and margins are introduced in section 4. Adaptive adjustment of margins is also proposed there. Its behavior is verified with experiments in section 5. Finally, a conclusion is given in section 6.
2
Weighted Criterion Model
An “orthodox” RL setting is assumed for states and actions as follows. – – – – –
The The The The The
time step is discrete (t = 0, 1, 2, 3, . . .). state set S and the action set A are finite and known. state transition rule P is unknown. state st is observable. task is a Markov decision process (MDP).
1 M The reward rt+1 is given as a weighted sum of partial rewards rt+1 , . . . , rt+1 :
rt+1 (β) =
M
i βi rt+1 = β · r t+1 ,
(1)
i=1
weight vector β ≡ (β1 , . . . , βM ) ∈ RM , reward vector r t+1 ≡
1 M (rt+1 , . . . , rt+1 )
M
∈R .
(2) (3)
1 M We assume that the partial rewards rt+1 , . . . , rt+1 are also observable, whereas their reward rules R(1), . . . , R(M ) are unknown. Multi-criteria RL problems of this type have been introduced independently in [3] and [5]. We hope to find the optimal policy πβ∗ for each weight β that maximizes the expected cumulative reward with a given discount factor 0 < γ < 1, ∞ πβ∗ = argmax E π γ τ rτ +1 (β) , (4) π
τ =0
π
where E [·] denotes the expectation under a policy π. To be exact, π ∗ is defined π∗
as a policy that attains Qββ (s, a; γ) = Q∗β (s, a; γ) ≡ maxπ Qπβ (s, a; γ) for all state-action pairs (s, a), where the action-value function Qπβ is defined as
Parallel Reinforcement Learning for Weighted Multi-criteria Model
Qπβ (s, a; γ)
≡E
π
∞
τ =0
γ rτ +1 (β) s0 = s, a0 = a . τ
489
(5)
It is well known that MDP has a deterministic policy πβ∗ that satisfies the above condition; such πβ∗ is obtained from the optimal value function [2], πβ∗ : S → A : s → argmax Q∗β (s, a; γ). a∈A
(6)
Thus we concentrate on estimation of Q∗β . Note that Q∗β is nonlinear with respect to β. A typical example is presented in section 5. Basic properties of the action-value function Q are described briefly in the rest of this section [4,5,6]. The discount factor γ is fixed through this paper, and it is omitted below. Proposition 1. Qπβ (s, a) is linear with respect to β for a fixed policy π. Proof. Neither P nor π depend on β from assumptions. Hence, joint distribution of (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), . . . is independent of β. It implies linearity. Definition 1. If f : RM → R can be written as f (β) = maxq∈Ω (q · β) with a nonempty finite set Ω ⊂ RM , we call f Finite-Max-Linear (FML) and write it as f = FMLΩ . It is trivial that f is convex and piecewise-linear if f is FML. Proposition 2. The optimal action-value function is FML as a function of the weight β. Namely, there exists a nonempty finite set Ω ∗ (s, a) ⊂ RM for each state-action pair (s, a), and Q∗β is written as Q∗β (s, a) =
max
q∈Ω ∗ (s,a)
q · β.
(7)
Proof. We have assumed MDP. It is well known that Q∗β can be written Q∗β (s, a) = maxπ∈Π Qπβ (s, a) for the set Π of all deterministic policies. Π finite, and Qπβ is linear with respect to β from proposition 1. Hence, Q∗β FML.
as is is
Proposition 3. Assume that an estimated action-value function Qβ is FML as a function of the weight β. If we apply Q-learning, the updated new Qβ (st , at ) = (1 − α)Qβ (st , at ) + α β · rt+1 + γ max Qβ (st+1 , a) (8) a∈A
is still FML as a function of β, where α > 0 is the learning rate. Proof. There exists a nonempty finite set Ω(s, a) ⊂ RM such that Qβ (s, a) = ˜ · β, maxq∈Ω(s,a) (q · β) for each (s, a). Then (8) implies Qnew ˜q β (st , at ) = maxq ˜ ∈Ω where ˜ ≡ (1 − α)q + α(r t+1 + γq ) a ∈ A, q ∈ Ω(st , at ), q ∈ Ω(st+1 , a) , (9) Ω because maxx f (x) + maxy g(y) = maxx,y (f (x) + g(y)) holds in general. The set ˜ is finite, and Qnew is FML. Ω β These propositions imply that (1) the true Q∗β is FML, and (2) its estimation Qβ is also FML as long as the initial estimation is FML.
490
3
K. Hiraoka, M. Yoshida, and T. Mishima
Parallel Q-Learning for All Weights
A parallel Q-learning method for the weighted criterion model has been proposed in [6]. The estimation Qβ for all β ∈ RM are updated all together in parallel Q-learning. In this method, Qβ (s, a) for each (s, a) is treated in an FML expression: Qβ (s, a) =
max q · β = FMLΩ(s,a) (β)
(10)
q∈Ω(s,a)
with a certain set Ω(s, a) ⊂ RM . We store and update Ω(s, a) instead of Qβ (s, a) on the basis of propositions 2 and 3. Though a naive updating rule has been suggested in the proof of proposition 3, it is extremely redundant and inefficient. We need several definitions to describe a better algorithm. Definition 2. An element c ∈ Ω is redundant if FML(Ω−{c}) = FMLΩ . Definition 3. We use Ω † to represent non-redundant elements in Ω. Note that FMLΩ † = FMLΩ [5]. Definition 4. We define the following operations: cΩ ≡ {cq | q ∈ Ω},
c + Ω ≡ {c + q | q ∈ Ω},
K † K † Ω Ω ≡ (Ω ∪ Ω ) , Ωk ≡ Ωk , k=1
Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω },
(11) (12)
k=1 †
Ω Ω ≡ (Ω ⊕ Ω ) .
With these operations, the updating rule of Ω is described as follows [6]:
Ω new (st , at ) = (1 − α)Ω(st , at ) α r t+1 + γ Ω(st+1 , a) .
(13)
(14)
a∈A
The initial value of Ω at t = 0 is Ω(s, a) = {o} ⊂ RM for all (s, a) ∈ S × A. It corresponds to a constant initial function Qβ (s, a) = 0. Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A, Qnew β (st , at ) in (8) is equal to FMLΩ new (st ,at ) (β) for (14). Namely, parallel Qlearning is equivalent to Q-learning for each β: {Qβ (s, a)} FML expression
{Ω(s, a)}
update → Qnew β (st , at )
FML expression . → Ω new (st , at ) update
(15)
Parallel Reinforcement Learning for Weighted Multi-criteria Model
491
y
Ω
O
x
Ω[+]Ω’
(2) Merge and sort edges according to their arguments
Ω’
(3) Connect edges to generate a polygon
(4) Shift the origin (max x in Ω)+(max x in Ω’) 㤘(max x in Ω[+]Ω’) (max y in Ω)+(max y in Ω’) 㤘(max y in Ω[+]Ω’)
(1) Set directions of edges
Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of polygons correspond to Ω, Ω and Ω Ω .
˜ in (9) to prove proposition 3. With the above Proof. We have introduced a set Ω operations, (9) is written as
˜ Ω = (1 − α)Ω(st , at ) ⊕ α r t+1 + Ω(st+1 , a) . a∈A †
˜ = Ω new (st , at ) is obtained and FMLΩ new (s ,a ) (β) = FML ˜ Then (Ω) t t Ω(st ,at ) (β) = new Qβ (st , at ) is implied. It is well known that Ω † is equal to the vertices in the convex hull of Ω [6]. Efficient algorithms of convex hull have been developed in computational geometry † [8]. Using them, we can calculate the merged set (Ω Ω ) = (Ω ∪ Ω ) . The sum set (Ω Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its calculation is particularly easy for two-dimensional convex polygons (Fig.1). Before closing the present section, we note an FML version of Bellman equation in our notation. Theoretically, we can use successive iteration of this equation to find the optimal policy when we know P and R, though we must take care of numerical error in practice. Proposition 5. FML expression Q∗β = FMLΩ ∗ (β) satisfies †
Ω ∗ (s, a) = Ras + γ
a ∈A
a ∗ + Pss Ω (s , a ),
(16)
s ∈S
where i Ras = (Ras (1), . . . , Ras (M )), Ras (i) = E[rt+1 | st = s, at = a], a Pss = P (st+1 = s | st = s, at = a),
+ s ∈{s1 ,...,sk }
Xs = Xs1 Xs2 · · · Xsk .
(17) (18) (19)
492
K. Hiraoka, M. Yoshida, and T. Mishima
In particular, the next equation holds if state transition is deterministic: † Ω ∗ (s, a) = Ras + γ Ω ∗ (s , a ),
(20)
a ∈A
where s is the next state for the action a at the current state s. a Proof. Substituting (7) and Ras,β ≡ E[r t+1 (β) | st = s, at = a] = Rs · β into the ∗ a a ∗ Bellman equation Qβ (s, a) = Rs,β + γ s ∈S Pss maxa ∈A Qβ (s , a ), we obtain
max q · β, a ∗ Ω (s, a) = Ras + γ Pss q s q s ∈ Ω (s , a )
max
q∈Ω ∗ (s,a)
q·β =
q ∈Ω (s,a)
a ∈A
(22)
s ∈S
in the same way as (9). Hence, Ω ∗ is equal to Ω except for redundancy.
4
(21)
Interval Operations
Under regularity conditions, Q-learning has been proved to converge to Q∗ [1]. That result implies pointwise convergence of parallel Q-learning to Q∗β for each β because of proposition 3. From proposition 2, Q∗β (s, a) is expressed with a finite Ω ∗ (s, a). However, as we can see in Fig.1, the number of elements in the set Ω(s, a) increases monotonically and it never ‘converges’ to Ω ∗ (s, a). This is not a paradox; the following assertions can be true at the same time. 1. Vertices of polygons P1 , P2 , . . . monotonically increase. 2. Pt converges to a polygon P ∗ in the sense that the volume of the difference Pt P = (Pt ∪ P ∗ ) − (Pt ∩ P ∗ ) converges to 0. 2’. The function FMLPt (·) converges pointwise to FMLP ∗ (·). In short, pointwise convergence of a piecewise-linear function does not imply convergence of the number of pieces. Note that it is not a problem of the algorithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion model for each weight β. To overcome this difficulty, we tried a simple approximation with a small ‘margin’ at first [6]. Then we have introduced interval operations to monitor approximation error [7]. A pair of sets Ω L (s, a) and Ω U (s, a) are updated instead of the original Ω(s, a) so that CH Ω L (s, a) ⊂ CH Ω(s, a) ⊂ CH Ω U (s, a) holds, where CH Z represents the convex hull of Z. This relation implies lower and upU X per bounds QL β (s, a) ≤ Qβ (s, a) ≤ Qβ (s, a), where Qβ (s, a) = FMLΩ X (s,a) (β) L U for X = L, U . When the difference between Q and Q is sufficiently small, it is guaranteed that the effect of the approximation can be ignored. Updating rules of Ω L and Ω U are same as those of Ω, except for the following approximations after every calculation of and . We assume M = 2 here. Lower approximation for Ω L : A vertex is removed if the change of the area of CH Ω L (s, a) is smaller than a threshold L /2 (Fig.2 left).
Parallel Reinforcement Learning for Weighted Multi-criteria Model
b
c
b
d a
a
d
e
b a
c
if the area of triangle /// is small
if the area of triangle /// is small d - remove c e
493
z a
- remove b,c - add z d
Fig. 2. Lower approximation (left) and upper approximation (right)
Upper approximation for Ω U : An edge is removed if the change of the area of CH Ω U (s, a) is smaller than a threshold U /2 (Fig.2 right). In this paper, we propose an automatic adjustment of the margins L , U . The below procedures are performed at every step t after the updating of Ω L , Ω U . The symbol X represents L or U here. ξs , ξw ≥ 1 and θQ , θΩ ≥ 0 are constants. 1. Check the changes of set sizes and interval width compared with the previous ones. Namely, check these values: Xnew ∆X (st , at ) − Ω X (st , at ) , (23) Ω = Ω Unew U Lnew L ∆Q = Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) , (24) ¯ is selected beforehand. where |Z| is the number of elements in Z, and β 2. Increase of set size suggests a need of thinning, whereas increase of interval width suggests a need of more accurate calculation. Modify margins as ˜X (∆Q ≤ θQ ) X (∆X Xnew X Ω ≤ θΩ ) = X , where ˜ = . (25) X ˜ /ξw (∆Q > θQ ) ξs (∆X Ω > θΩ ) To avoid underflow, we set Xnew = min if Xnew is smaller than a constant min .
5
Experiments with a Basic Task of Weighted Criterion
We have verified behaviors of the proposed method. We set S = {S, G, A, B, X, Y}, A = {Up, Down, Left, Right}, s0 = S, and γ = 0.8 (Fig.3) [6]. Each action causes a deterministic state transition to the corresponding direction except at G, where the agent is moved to S regardless of its action. Rewards 1, 4b, b are offered at st = G, X, Y, respectively. If at is an action to ‘outside wall’ at st = G, the state is unchanged and a negative reward (−1) is added further. It is a weighted criterion model of M = 2, because it can be written as 1 2 the form rt+1 = β · rt+1 for r t+1 = (rt+1 , rt+1 ) and β = (b, 1). The optimal policy changes depending on the weight b. Hence, the optimal value function is
494
K. Hiraoka, M. Yoshida, and T. Mishima
S
X (4b)
G (1)
A
Y (b)
B
outside = wall (-1)
Fig. 3. Task for experiments. Numbers in parentheses are reward values. Table 1. Optimal state-value functions and optimal policies Range of weight Optimal Vb∗ (S) Optimal state transition b < −16/25 0 S → A → S → ··· −16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · · −225/1796 ≤ b < 15/47 (400b + 80)/61 S → X → G → S → ··· 15/47 ≤ b < 3/4 32b/3 S → X → Y → X → ··· 3/4 ≤ b 16b − 4 S → X → X → ···
1
1e-04 1e-06
1e-13 1e-10 1e-7 1e-4 0.1
0.01 1e-04 Upper margin
Lower margin
1
1e-13 1e-10 1e-7 1e-4 0.1
0.01
1e-08 1e-10 1e-12 1e-14
1e-06 1e-08 1e-10 1e-12 1e-14
1e-16
1e-16
1e-18
1e-18 0
5000
10000
15000
20000
0
5000
10000
t
15000
20000
t
300
Total number of elements Σs,a|Ω (s,a)|
1e-13 1e-10 1e-7 1e-4 0.1
250 200
U
L
Total number of elements Σs,a|Ω (s,a)|
Fig. 4. Transition of margins L and U from various initial margins
150 100 50 0 0
5000
10000
15000
20000
t
Fig. 5. Total number of elements
300
1e-13 1e-10 1e-7 1e-4 0.1
250 200 150 100 50 0 0
5000
10000
15000
20000
t
s,a
|Ω X (s, a)|. (Left: X = L, Right: X = U ).
Parallel Reinforcement Learning for Weighted Multi-criteria Model
495
1 0.01 1e-04 Interval width
1e-06 1e-08 1e-10 1e-12
1e-13 1e-10 1e-7 1e-4 0.1
1e-14 1e-16 1e-18 0
5000
10000
15000
20000
t
4500
1
1e-2(Lower) 1e-2(Upper) 1e-9(Lower) 1e-9(Upper)
4000 3500
0.01 1e-04
3000
Interval width
X
Total number of elements Σs,a|Ω (s,a)|
L Fig. 6. Interval width QU (0.2,1) (A, Up) − Q(0.2,1) (A, Up)
2500 2000 1500
1e-06 1e-08 1e-10 1e-12
1000
1e-14
500
1e-16
0
1e-18 0
5000
10000
15000
20000
1e-2 1e-9 0
5000
10000
t
15000
20000
t
300
1
1e-13 1e-10 1e-7 1e-4 0.1
250 200
0.01 1e-04 Interval width
U
Total number of elements Σs,a|Ω (s,a)|
Fig. 7. Fixed-margin (U = L = 10−2 and U = L = 10−9 ). Left: total
algorithm X number of elements s,a |Ω (s, a)| for X = U, L. Right: interval width.
150 100
1e-06 1e-08 1e-10 1e-12
1e-13 1e-10 1e-7 1e-4 0.1
1e-14
50
1e-16 0
1e-18 0
2000
4000
6000 t
8000
10000
0
2000
4000
6000
8000
10000
t
Fig. 8. Average of 100 trials with inappropriate factors ξs = 1.5, ξw = 1.015 for γ = 0.5 Left: total number of elements in upper approximation. Right: interval width.
nonlinear with respect to b (Table 1). Note that the second pattern (S→A→Y) in Table 1 cannot appear on the naive approach in section 1. The proposed algorithm is applied to this task with random actions at and ¯ = (0.2, 1), parameters α = 0.7, (ξs , ξw ) = (1.7, 1.015), (θQ , θΩ ) = (0, 2), β −14 L U −1 min = 10 . The initial margins = at t = 0 is one of 10 , 10−4 , 10−7 ,
496
K. Hiraoka, M. Yoshida, and T. Mishima
10−10 , 10−13 . On this task, we can replace convex hulls with upper convex hulls in our algorithm because β is restricted to the upper half plane [6]. We also assume |b| ≤ 10 ≡ bmax and we safely remove the edges on the both end in Fig.2 if the absolute value of their slope is greater than bmax for lower approximation. Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is robust to wide range of initial margins. It realizes reduced set sizes and small interval width at the same time; these requirements are trade-off in the conventional fixed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms is sensitivity to the factors ξs , ξw . When they are inappropriate, instability is observed after a long run (Fig.8). Another problem is slow convergence of the interval width QU − QL compared with the fixed-margin algorithm.
6
Conclusion
A parallel RL method with adaptive margins is proposed for the weighted criterion model, and its behaviors are verified experimentally with a basic task. Adaptive margins realize reduced set sizes and accurate results. A problem of the adaptive margins is instability for inappropriate parameters. Though it is robust for initial margins, it needs tuning of factor parameters. Another problem is slow convergence of the interval between upper and lower estimations. These points must be studied further.
References 1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge (1998) 3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167 (2004) 4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10 (2005) 5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005) 6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural Network Society 13, 137–145 (2006) 7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007) 8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985) 9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.): ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001) 10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004) 11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006)
Convergence Behavior of Competitive Repetition-Suppression Clustering Davide Bacciu1,2 and Antonina Starita2 1
2
IMT Lucca Institute for Advanced Studies, P.zza San Ponziano 6, 55100 Lucca, Italy
[email protected] Dipartimento di Informatica, Universit` a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy
[email protected] Abstract. Competitive Repetition-suppression (CoRe) clustering is a bio-inspired learning algorithm that is capable of automatically determining the unknown cluster number from the data. In a previous work it has been shown how CoRe clustering represents a robust generalization of rival penalized competitive learning (RPCL) by means of M-estimators. This paper studies the convergence behavior of the CoRe model, based on the analysis proposed for the distance-sensitive RPCL (DSRPCL) algorithm. Furthermore, it is proposed a global minimum criterion for learning vector quantization in kernel space that is used to assess the correct location property for the CoRe algorithm.
1
Introduction
CoRe learning has been proposed as a biologically inspired learning model mimicking a memory mechanism of the visual cortex, i.e. repetition suppression [1]. CoRe is a soft-competitive model that allows only a subset of the most active units to learn in proportion to their activation strength, while it penalizes the least active units, driving them away from the patterns producing low firing strengths. This feature has been exploited in [2] to derive a clustering algorithm that is capable of automatically determining the unknown cluster number from the data by means of a reward-punishment procedure that resembles the rival penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed a generalized loss function for the RPCL algorithm, named DSRPCL, that has been used for studying the convergence behavior of the rival penalization scheme. In this paper, we present a convergence analysis for CoRe clustering that founds on Ma and Wang’s approach, describing how CoRe satisfies the three properties of separation nature, correct division and correct location [4]. The intuitive analysis presented in [4] for DSRPCL is enforced with theoretical considerations showing that CoRe pursues a global optimality criterion for vector quantization algorithms. In order to do this, we introduce a kernel interpretation for the CoRe loss that is used to generalize the results given in [5] for hard vector quantization, to kernel-based algorithms. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008
498
2
D. Bacciu and A. Starita
A Kernel Based Loss Function for CoRe Clustering
A CoRe clustering network consists of cluster detector units that are characterized by a prototype ci , that identifies the preferred stimulus for the unit ui and represents the learned cluster centroid. In addition, units are characterized by an activation function ϕi (xk , λi ), defined in terms of a set of parameters λi , that determines the firing strength of the unit in response to the presentation of an input pattern xk ∈ χ. Such an activation function measures the similarity between the prototype ci and the inputs, determining whether the pattern xk belongs to the i-th cluster. In the remainder of the paper we will use an activation function that is a gaussian centered in ci with spread σi , i.e. ϕi (xk |{ci , σi }) = exp −0.5xk − ci 2 /σi2 . CoRe clustering works essentially by evolving a small set of highly selective cluster detectors out of an initially larger population by means of a competitive reward-punishment procedure that resembles the rival penalization mechanism [3]. Such a competition is engaged between two sets of units: at each step the most active units are selected to form the winners pool, while the remainder is inserted into the losers pool. More formally, we define the winners pool for the input xk as the set of units ui that fires more than θwin or the single unit that is maximally active for the pattern, that is wink = {i | ϕi (xk , {ci , σi }) ≥ θwin } ∪ {i | i = arg max ϕj (xk | {ci , σi })} j∈U
(1)
where the second term of the union ensures that wink is non-empty. Conversely, the losers pool for xk is losek = U \ wink , that is the complement of wink with respect to the neuron set U . The units belonging to the losers pool are penalized and their response is suppressed. The strength of the penalization for the pattern xk , at time t, is regulated by the repetition suppression RSkt ∈ [0, 1] and is proportional to the frequency of the pattern that has elicited the suppressive effect (see [2,6] for details). The repetition suppression is used to define a pseudo-target activation for the units in the losers pool as ϕˆti (xk ) = ϕi (xk , {ci , σi })(1 − RSkt ). This reference signal forces the losers to reduce their activation proportionally to the amount of repetition suppression they receive. The error of the i-th loser unit can thus be written as E ti,k =
1 t 1 (ϕˆi (xk ) − ϕi (xk , {ci , σi }))2 = (−ϕi (xk , {ci , σi })RSkt )2 . 2 2
(2)
Conversely, in order to strengthen the activation of the winner units, we set the target activation for the neurons ui (i ∈ wink ) to M , that is the maximum of the activation function ϕi (·). The error, in this case, can be written as t
E i,k = (M − ϕi (xk , {ci , σi })).
(3)
To analyze the CoRe convergence, we give an error formulation that accumulates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe units in U and the dataset χ = (x1 , . . . , xk , . . . , xK ) yields
Convergence Behavior of Competitive Repetition-Suppression Clustering
Je (χ, U ) =
I K
δik (1 − ϕi (xk )) +
i=1 k=1
I K
499
2 (e|χ|+k) (1 − δik ) ϕi (xk )RSk (4)
i=1 k=1
where δik is the indicator function for the set wink and where {ci , σi } has been omitted from ϕi to ease the notation. Note that, in (4), we have implicitly used the fact that the units can be treated as independent. The CoRe learning equations can be derived using gradient descent to minimize Je (χ, U ) with respect to the parameters {ci , σi } [2]. Hence, the prototype increment for the e-th epoch can be calculated as follows ⎡ ⎤
K (e|χ|+k) 2 e ⎣δik ϕi (xk ) (xk − ci ) − (1 − δik ) ϕi (xk )RSk cei = αc (xk − cei )⎦ (σie )2 σie k=1
(5) where αc is a suitable learning rate ensuring that Je decreases with e. Similarly, the spread update can be calculated as σie = ασ
K k=1
δik ϕi (xk )
e 2 xk − cei 2 (e|χ|+k) 2 xk − ci − (1 − δ )(ϕ (x )RS ) . (6) i ik k k (σie )3 (σie )3
As one would expect, unit prototypes are attracted by similar patterns (first term in (5)) and are repelled by the dissimilar inputs (second term in (5)). Moreover, the neural selectivity is enhanced by reducing the Gaussian spread each time the corresponding unit happens to be a winner. Conversely, the variance of loser neurons is enlarged, reducing the units’ selectivity and penalizing them for not having sharp responses. The error formulation introduced so far can be restated by exploiting the kernel trick [7] to express the CoRe loss in terms of differences in a given feature space F . Kernel methods are algorithms that exploit a nonlinear mapping Φ : χ → F to project the data from the input space χ onto a convenient, implicit feature space F . The kernel trick is used to express all operations on Φ(x1 ), Φ(x2 ) ∈ F in terms of the inner product Φ(x1 ), Φ(x2 ) . Such inner product can be calculated without explicitly using the mapping Φ, by means of the kernel κ(x1 , x2 ) = Φ(x1 ), Φ(x2 ) . To derive the kernel interpretation for the CoRe loss in (4), consider first the formulation of the distance dFκ of two vectors x1 , x2 ∈ χ in the feature space Fκ , induced by the kernel κ, and described by the mapping Φ : χ → Fκ , that is dFκ (x1 , x2 ) = Φ(x1 ) − Φ(x2 )2Fκ = κ(x1 , x1 ) − 2κ(x1 , x2 ) + κ(x2 , x2 ). The kernel trick [7] have been used to substitute the inner products in feature space with a suitable kernel κ calculated in the data space. If κ is chosen to be a gaussian kernel, then we have that κ(x, x) = 1. Hence dFκ can be rewritten as dFκ = Φ(x1 ) − Φ(x2 )2Fκ = 2 − 2κ(x1 , x2 ). Now, if we take x1 to be an element of the input dataset, e.g. xk ∈ χ, and x2 to be the prototype ci of the i-th CoRe unit, we can rewrite dFκ in such a way to depend on the activation function ϕi . Therefore, applying the substitution κ(xk , ci ) = ϕi (xk , {ci , σi }) we obtain ϕi (xk , {ci , σi }) = 1 − 12 Φ(xk ) − Φ(ci )2Fκ . Now, if we substitute this result in the formulation of the CoRe loss in (4), we obtain
500
D. Bacciu and A. Starita
1 δik Φ(xk ) − Φ(ci )2Fκ + 2 i=1 k=1
2 I K 1 1 (e|χ|+k) 2 + (1 − δik ) RSk 1 − Φ(xk ) − Φ(ci )Fκ (7) 2 i=1 2 I
Je (χ, U ) =
K
k=1
Equation (7) states that CoRe minimizes the feature space distance between the prototype ci and those xk that are close in the kernel space induced by the activation functions ϕi , while it maximizes the feature space distance between the prototypes and those xk that are far from ci in the kernel space.
3
Separation Nature
To prove the separation nature of the CoRe process we need to demonstrate that, given a a bounded hypersphere G containing all the sample data, then after sufficient iterations of the algorithm the cluster prototypes will finally either fall into G or remain outside it and never get into G. In particular, those prototypes remaining outside the hypersphere will be driven far away from the samples by the RS repulsion. We consider a prototype ci to be far away from the data if, for a given epoch e, it is in the loser pool for every xk ∈ χ. To prove CoRe separation nature we first demonstrate the following Lemma. Lemma 1. When a prototype ci is far away from the data at a given epoch e, then it will always be a loser for every xk ∈ χ and will be driven away from the data samples. Proof. The definition of far away implies that, given cei , ∀xk ∈ χ. i ∈ loseek , where the e in the superscript refers to the learning epoch. Given the prototype update in (5), we obtain the weight vector increment Δcei at epoch e as follows cei
= −ασ
K k=1
(e|χ|+k)
ϕi (xk )RSk σie
2 (xk − cei ).
(8)
As a result of (8), the prototype ce+1 is driven further from the data. On the i other hand, by definition (1), for each of the data samples there exists at least one winner unit for every epoch e, such that its prototype is moved towards the samples for which it has been a winner. Moreover, not every prototype can be deflected from the data, since this would make the first term of Je (χ, U ) (see (4)) grow and, consequently, the whole Je (χ, U ) will diverge since the loser error term in (4) is lower bounded. However, this would contradict the fact that Je (χ, U ) decreases with e since CoRe applies gradient descent to the loss function. Therefore, there must exist at least one winning prototype cel that remains close to the samples at epoch e. On the other hand cei is already far away from the samples and, by (8), ce+1 will be further from the data and i won’t be a winner for any xk ∈ χ. To prove this, consider the definition of wink
Convergence Behavior of Competitive Repetition-Suppression Clustering
501
in (1): for ce+1 to be a winner, it must hold either (i) ϕi (xk ) ≥ θwin or (ii) i i = arg maxj∈U ϕj (xk , λj ). The former does not hold because the receptive field area where the firing strength of the i-th unit is above the threshold θwin does not contain any sample at epoch e. Consequently, it cannot contain any sample at epoch e + 1 since its center ce+1 has been deflected further from the data. The i latter does not hold since there exist at least one prototype, i.e. cl , that remains close to the data, generating higher activations than unit ui . As a consequence, a far away prototype ci will be deflected from the data until it reaches a stable point where the corresponding firing strength ϕi is negligible. Now we can proceed to demonstrate the following Theorem 1. For a CoRe process there exist an hypersphere G surrounding the sample data χ such that after sufficient iterations each prototype ci will finally either (i) fall into G or (ii) keep outside G and reach a stable point. Proof. The CoRe process is a gradient descent (GD) algorithm on Je (χ, U ), hence, for a sufficiently small learning step, the loss decreases with the number of epochs. Therefore, being Je (χ, U ) always positive the GD process will converge to a minimum J ∗ . The sequences of prototype vectors {cei } will converge either to a point close to the samples or to a point of negligible activation far away from the data. If a unit ui has a sufficiently long subsequence of prototypes {cei } diverging from the dataset then, at a certain time, will no longer be a winner for any sample and, by Lemma 1, will converge at a point far away from the data. The attractors for the sequence {cei } of the diverging units lie at a certain distance r from the samples, that is determined by those points x where the gaussian unit centered in x produces a negligible activation in response to any pattern xk ∈ χ. Hence, G can be chosen as any hypersphere surrounding the samples with radius smaller than r. On the other hand, since Je (χ, U ) decreases to J ∗ , there must exist at least one prototype that is not far away from the data (otherwise the first term of Je (χ, U ) in (4) will diverge). In this case, the sequences {cei } must have accumulation points close to the samples. Therefore any hypersphere G enclosing all the samples will also surround the accumulation points of {cei } and, after a certain epoch E, the sequence will be always within such hypersphere. In summary, Theorem 1 tells that the separation nature holds for a CoRe process: some prototypes are possibly pushed away from the data until their contribution to the error in (4) becomes negligible. Far away prototypes will always be losers and will never head back to the data. Conversely, some prototypes will converge to the samples, heading to a saddle point of the loss Je (χ, U ) by means of a gradient descent process.
4
Correct Division and Location
Following the convergence analysis in [4] we now turn our attention to the issues of correct division and location of the weight vectors. This means that the
502
D. Bacciu and A. Starita
number of prototypes falling into G will be nc , i.e. the number of the actual clusters in the sample data, and they will finally converge to the centers of the clusters. At this point, we leave the intuitive study presented for DSRPCL [4], introducing a sound analysis of the properties of the saddle points identified by CoRe, giving a sufficient and necessary condition for identifying the global minimum of a vector quantization loss in feature space. 4.1
A Global Minimum Condition for Vector Quantization in Kernel Space
The classical problem of hard vector quantization (VQ) in Euclidean space is to determine a codebook V = v1 , . . . , vN minimizing the total distortion, calculated by Euclidean norms, resulting from the approximation of the inputs xk ∈ χ by the code vectors vi . Here, we focus on a more general problem that is vector quantization in feature space. Given the nonlinear mapping Φ and the induced feature space norm · Fκ introduced in the previous sections, we aim at optimizing the distortion K N 1 1 min D(χ, ΦV ) = δik Φ(xk ) − Φvi 2Fκ K i=1
(9)
k=1
where ΦV = {Φv1 , . . . , ΦvN } represents the codebook in the kernel space and 1 δik is equal to 1 if the i-th cluster is the closest to the k-th pattern in the feature space Fκ , and is 0 otherwise. It is widely known that VQ generates a Voronoi tessellation of the quantized space and that a necessary condition for the minimization of the distortion requires the code-vectors to be selected as the centroids of the Voronoi regions [8]. In [5], it is given a necessary and sufficient condition for the global minimum of an Euclidean VQ distortion function. In the following, we generalize this result to vector quantization in feature space. To prove the global minimum condition in kernel space we need to extend the results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel induced distance metric. Therefore we introduce the following lemma. Lemma 2. Let κ be a kernel and Φ : χ → Fκ a map into the corresponding feature space Fκ . Given a dataset χ = x1 , . . . , xK partitioned into N subsets Ci , K 1 define the feature space mean Φ = χ k=1 Φ(xk ) and the i-th partition centroid K 1 Φvi = |Ci | k∈Ci Φ(xk ), then we have K k=1
Φ(xk ) − Φχ 2Fκ =
N i=1 k∈Ci
Φ(xk ) − Φvi 2Fκ +
N
|Ci |Φvi − Φχ 2Fκ . (10)
i=1
Proof. Given a generic feature vector Φ1 , consider the identity Φ(xk ) − Φ1 = (Φ(xk ) − Φvi ) + (Φvi − Φ1 ): its squared norm in feature space is Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ + 2(Φ(xk ) − Φvi )T (Φvi − Φ1 ).
Convergence Behavior of Competitive Repetition-Suppression Clustering
503
Summing over all the elements in the i-th partition we obtain Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ k∈Ci
k∈Ci
+2
k∈Ci
(Φ(xk ) − Φvi ) (Φvi − Φ1 ) T
k∈Ci
=
Φ(xk ) − Φvi 2Fκ + |Ci |Φvi − Φ1 2Fκ .
(11)
k∈Ci
The last term in (11) vanishes since k∈Ci (Φ(xk ) − Φvi ) = 0 by definition of Φvi . Now, applying the substitution Φ1 = Φχ and summing up for all the N partitions yields K
Φ(xk ) − Φχ 2Fκ =
N
Φ(xk ) − Φvi 2Fκ +
i=1 k∈Ci
k=1
N
|Ci |Φvi − Φχ 2Fκ (12)
i=1
N
where the left side of equality holds since
i=1
Ci = χ and
N i=1
Ci = ∅ .
Using the results from Lemma 2 we can proceed with the formulation of the global minimum criterion by generalizing the results of Proposition 1 in [5] to vector quantization in feature space. g } be a global minimum solution to the probProposition 1. Let {Φv1g , . . . , ΦvN lem in (9), then we have
N
|Cig |Φvig − Φχ 2Fκ ≥
N
i=1
|Ci |Φvi − Φχ 2Fκ
(13)
i=1
g for any local optimal solution {Φv1 , . . . , ΦvN } to (9), where {C1g , . . . , CN } and g g {C , . . . , C } are the χ partitions corresponding to the centroids Φ = 1/|C 1 N v i| i k∈Cig Φ(xk ) and Φvi = 1/|Ci | k∈Ci Φ(xk ) respectively, and where Φχ is the dataset mean (see definition in Lemma 2). g } is a global minimum for (9) we have Proof. Since {Φv1g , . . . , ΦvN
N
Φ(xk ) − Φvig 2Fκ ≤
i=1 k∈Cig
N
Φvi − Φχ 2Fκ
(14)
i=1 k∈Ci
for any local minimum {Φv1 , . . . , ΦvN }. From Lemma 2 we have that K
Φ(xk ) − Φχ 2Fκ =
k=1
Φ(xk ) − Φvig 2Fκ +
i=1 k∈Cig
k=1 K
N
Φ(xk ) − Φχ 2Fκ =
N
Since (14) holds, we obtain
|Cig |Φvig − Φχ 2Fκ (15)
i=1
Φ(xk ) − Φvi 2Fκ +
i=1 k∈Ci
N
N
g g i=1 |Ci |Φvi
N
|Ci |Φvi − Φχ 2Fκ . (16)
i=1
−
Φχ 2Fκ
≥
N i=1
|Ci |Φvi − Φχ 2Fκ
504
4.2
D. Bacciu and A. Starita
Correct Division and Location for CoRe Clustering
To evaluate the correct division and location properties we first analyze the case when the number of units N is equal to the true cluster number nc . Consider the loss in (4) as being decomposed into a winner and a loser dependent term, i.e. Je (χ, U ) = Jewin (χ, U )+Jelose (χ, U ). By definition, Jewin (χ, U ) = nc K i=1 k=1 δik (1 − ϕi (xk )) must have at least one minimum point. Applying the necessary condition ∂Jewin (χ, U )/∂ci = 0 we obtain an estimate of the prototypes by means of fixed point iteration, that is N e k=1 δik ϕi (xk )xk ci = . (17) N k=1 δik ϕi (xk ) When the number of prototypes equals the number of clusters, the fixed point iteration in (17) converges by positioning each unit weight vector close to the true cluster centroids. In addition, it can be shown that (17) approximates a local minima of the kernel vector quantization loss in (9). To prove this, consider the CoRe loss formulation in kernel space (7): we have Jewin (χ, U ) = 1 nc K 2 i=1 k=1 δik Φ(xk ) − Φ(ci )Fκ , where ci is estimated by (17). 2 Now, consider the VQ loss in (9): a necessary condition for its minimization requires the computation of the cluster centroids as Φvi = |C1i | k∈Ci Φ(xk ). The exact calculation of Φvi requires to know the form of the implicit nonlinear mapping Φ to solve the so-called pre-image problem [10], that is determining z such that Φ(z) = Φvi . Unfortunately, such a problem is insolvable in the general case [10]. However, instead of calculating the exact pre-image we can search an approximation by seeking z minimizing ρ(z) = Φvi − Φ(z)2Fκ , that is the feature space distance between the centroid in kernel space and the mapping of its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize the distance between Φvi and its orthogonal projection onto the span Φ(z). Due to space limitations, we omit the technicalities of this calculation (see [10] for further details). It turns out that the minimization of ρ(z) reduces to the the evaluation of the gradient of ρ (z) = Φvi , Φ(z) . By substituting the definition of Φvi and applying the kernel trick we obtain ρ (z) = (1/|Ci |) κ(xk , xj ) + κ(z, z) + (1/|Ci |) κ(xk , z) k,j∈Ci
k∈Ci
where κ(z, z) = 1 since we are using a gaussian kernel. Differentiating ρ (z) with respect to z and solving by fixed point iteration yields e−1 )xk k∈Ci κ(xk , z e z = (18) e−1 ) k∈Ci κ(xk , z that is the same as the prototype estimate obtained in (17) for gaussian kernels centered in z e . The indicator function δik in (17) is not null only for those points xk for which unit ui was in the winner set. This does not ensures the partition conditions over χ, since, by definition of wink , some points can be associated with
Convergence Behavior of Competitive Repetition-Suppression Clustering
505
two or more winners. However, by (6) we know that the variance of the winners tends to reduce as learning proceeds. Therefore, using the same arguments by Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe winners competition will become a WTA process where δik will be ensuring the partition conditions over χ. Summarizing, the minimization of the CoRe winners error Jewin (χ, U ) generates an approximate solution to the vector quantization problem in feature space in (9). As a consequence, the prototypes ci become a local solution satisfying the conditions of Proposition 1. Hence, substituting the definition of Φχ in the results of Proposition 1 we obtain that {c1 , . . . , cnc } is an approximated global minimum for (9) if and only if nc K |Ci | i=1 k=1
K
Φ(ci ) − Φ(xk )2Fκ ≥
nc K |C˜i | i=1 k=1
K
Φ(˜ ci ) − Φ(xk )2Fκ
(19)
holds for every {˜ c1 , . . . , c˜nc } that are approximated pre-images of a local minimum for (9). In summary, a global optimum to (9) should minimize the feature space distance between the prototypes and samples belonging to their cluster while maximizing the weight vector distance from the sample mean, or, equivalently, the distance from all the samples in the dataset χ. The loser component Jelose (χ, U ) in the kernel CoRe loss (7) depends on the term (1 − (1/2)Φ(ci ) − Φ(xk )2Fκ that maximizes the distance between the prototypes ci and those xk that do not fall in the respective Voronoi sets Ci . Hence, Jelose (χ, U ) produces a distortion in the estimate of ci that pursues the global optimality criterion except for the fact that it discounts the repulsive effect of the xk ∈ Ci . In fact, (19) suggests that ci has to be repelled by all the xk ∈ χ. On the other hand, the estimate ci is a linear combination of the xk ∈ Ci : applying the repulsive effect in (19) would subtract their contribution, either canceling the attractive effect (which would be catastrophic) or simply scaling the magnitude of the learning step without changing the final direction. Hence, the CoRe loss makes a reasonable assumption discarding the repulsive effect of the xk ∈ Ci when calculating the estimate of ci . Summarizing, CoRe locates the prototypes close to the centroids of the nc clusters by means of (17), escaping from local minima of the loss function by approximating the global minimum condition of Proposition 1. Finally, we need to study the behavior of Je (χ, U ) as the number of units N varies with respect to the true cluster number nc . Using the same motivations in [4], we see that the winner-dependent loss Jewin tends to reduce as the the number of units increases. However, if the number of units falling into G is larger than nc there will be a number of clusters that are erroneously split. Therefore, the samples from these clusters will tend to produce an increased level of error in Jelose contrasting the reduction of Jewin . On the other hand, Jelose will tend to reduce when the number of units inside G is lower than nc . This however will produce increased levels of Jewin since the prototype allocation won’t match the underlying sample distribution. Hence, the CoRe error will have its minimum when the number of units inside G will approximate nc .
506
5
D. Bacciu and A. Starita
Conclusion
The paper presents a sound analysis of the convergence behavior of CoRe clustering, showing how the minimization of the CoRe cost function satisfies the properties of separation nature, correct division and location [4]. As the loss reduces to a minimum, the CoRe algorithm is shown to converge allocating the correct number of prototypes to the centers of the clusters. Moreover, it is given a sound optimality criterion that shows how CoRe gradient descent pursues a global minimum of the vector quantization problem in feature space. The results presented in the paper hold for a batch gradient descent process. However, it can be proved that, under Ljung’s conditions [11], they can be extended to stochastic (online) gradient descent. Moreover, we plan to investigate further the properties of the CoRe kernel formulation, extending the convergence analysis to a wider class of activation functions other than gaussians, i.e. normalized kernels.
References 1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models of stimulus-specific effects. Trends in Cognitive Sciences 10(1), 14–23 (2006) 2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic determination of unknown cluster number. In: Proceedings of the 2007 International Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos (2007) 3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993) 4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning (rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006) 5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural Process. Lett. 15(3), 261–273 (2002) 6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 130–139. Springer, Heidelberg (2006) 7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992) 9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980) 10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. on Neur. Net. 10(5), 1000–1017 (1999) 11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals of Statistics 6(3), 680–696 (1978)
Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma Kyushu Institute of Technology, Faculty of Engineering, 1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan
[email protected] Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evolution of Self-Organizing Map, where each computational unit defines a linear subspace. Recently, its modified version, where each unit defines an linear manifold instead of the linear subspace, has been proposed. The linear manifold in a unit is represented by a mean vector and a set of basis vectors. After training, these units result in a set of linear variety detectors. In another point of view, we can consider the AMSOM represents the latent commonality of data as linear structures. In numerous cases, however, these are not enough to describe the latent commonality of data because of its linearity. In this paper, the nonlinear variety is considered in order to represent a diversity of data in a class. The effectiveness of the proposed method is verified by applying it to some simple classification problems.
1
Introduction
The subspace method is popular in pattern recognition, feature extraction, compression, classification and signal processing.[1] Unlike other techniques where classes are primarily defined as regions or zones in the feature space, the subspace method uses linear subspaces that are defined by a set of normalized basis vectors. One linear subspace is usually associated with one class. An input vector is classified to a particular class if its projection error into the subspace associated with one class is the minimum. The subspace method, as compared to other pattern recognition techniques, has advantages in applications where the relative intensities or energies of the vector components are more important than the overall level of the signal. It also provides an economical representation for groups of vectors with high dimensionality, since one can often use a small set of basis vectors to approximate the subspace where the vectors reside. Another paradigm is to use is use a mixture of local subspace to collectively model the data space. Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of local subspace method for pattern recognition. ASSOM, which is an evolution M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008
508
H. Kawano, H. Maeda, and N. Ikoma
of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive layer arranging some computational units in a line or a lattice structure. Each computational units defines a subspace spanned by some basis vectors. ASSOM creates a set of subspaces representations by competitive selection and cooperative learning. In SOM, a set of reference vectors is spatially organized to partition the input space. In ASSOM, a set of reference sub-models is topologically ordered, with each sub-model responsible for describing a specific region of the input space by its local principal subspace. The ASSOM is attractive not only because it inherits the topographic representation property in the SOM, but also because the learning results of ASSOM can faithfully describe the the core features of various transformation groups. The simulation results in the reference [2] and the reference [3] have illustrated that different feature filters can be self-organized to different low-dimensional subspaces and a wavelet type representation does emerge in the learning. Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a modified version of ASSOM has been proposed.[5] AMSOM is the same structure as the ASSOM, except for the way to represent each computational unit. Each unit in AMSOM defines an affine subspace which is composed of a mean vector and a set of basis vectors. By incorporating a mean vector into each unit, the recognition performance has been improved significantly. The simulation results in the reference [5] have been shown that AMSOM outperforms linear PCA-based method and ASSOM in face recognition problem. In both ASSOM and AMSOM, a local subspace in each unit can be adapted by linear PCA learning algorithms. On the other hand, it is known that there are a number of advantages in introducing nonlinearities into a PCA type network with reproducing kernels.[6][13] For example, the performance of the subspace method is affected by the dimensionality for the intersections of subspaces.[1] In other words, the dimensionality of subspace should be as possible as low in order to achieve successful performance. it is, however, not enough to describe variation in a class of patterns by low dimensional subspace because of its linearity. From this consideration, we propose a nonlinear extended version of the AMSOM with the reproducing kernels. The proposed method could be expected to construct nonlinear varieties so that effective representation of data belonging to the same category is achieved with low dimensionality. The effectiveness of the proposed method is verified by applying it to some simple pattern classification problems.
2
Adaptive Manifold Self-Organizing Map (AMSOM)
In this section, we give a brief review of the original AMSOM. Fig.1 shows the structure of the AMSOM. It consists of an input layer and a competitive layer, in which n and M units are included respectively. Suppose i ∈ {1, · · · , M } is used to index computational units in the competitive layer, the dimensionality of the input vector is n. The i-th computational unit constructs an affine subspace, which is composed of a mean vector μ(i) and a subspace spanned by H basis
Self-Organizing Clustering with Map
x1
xj
509
xn
Input Layer
1
M
i
Competitive Layer
Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM)
input vector :
relative input vector :
x ~ ~ x=
φ
φ
φ^ mean vector : μ
x^
Subspace
Origin
Fig. 2. Affine Subspace in a computational unit (i)
vectors bh , h ∈ {1, · · · , H}. First of all, we define the orthogonal projection of a input vector x onto the affine subspace of i-th unit as x ˆ(i) = μ(i) +
H
T (i)
(i)
(φ(i) bh )bh ,
(1)
h=1
where φ(i) = x − μ(i) . Therefore the projection error is represented as x ˜(i) = φ(i) −
H
T (i)
(i)
(φ(i) bh )bh .
(2)
h=1
Figure 2 shows a schematic interpretation for the orthogonal projection and the projection error of a input vector onto the affine subspace defined in a unit i. The AMSOM is more general strategy than the ASSOM, where each computational unit solely defines a subspace. To illustrate why this is so, let us consider a very simple case: Suppose we are given two clusters as shown in Fig.3(a). It
510
H. Kawano, H. Maeda, and N. Ikoma
x2
x2 class 1 class 2
manifold 1
b1
class 1 class 2
μ1
O
x1
O
μ2
x1 b2
manifold 2
(b)
(a)
Fig. 3. (a) Clusters in 2-dimensional space: An example of the case which can not be separated without a mean value. (b) Two 1-dimensional affine subspaces to approximate and classify clusters.
is not possible to use one dimensional subspaces, that is lines intersecting the origin O, to approximate the clusters. This is true even if the global mean is removed, so that the origin O is translated to the centroid of the two clusters. However, two one-dimensional affine subspaces can easily approximate the clusters as shown in Fig.3(b), since the basis vectors are aligned in the direction that minimizes the projection error. In the AMSOM, the input vectors are grouped into episodes in order to present them to the network as an input sets. For pattern classification, an episode input is defined as a subset of training data belonging to the same category. Assume that the number of input vectors in the subset is E, then an episode input ωq in the class q is denoted as ωq = {x1 , x2 , · · · , xE } , ωq ⊆ Ωq , where Ωq is a set of training patterns belonging to the class q. The set of input vectors of an episode has to be recognized as one class, such that any member of this set and even an arbitrary linear combination of them should have the same winning unit. The training process in AMSOM has the following steps: (a) Winner lookup. The unit that gives the minimum projection error for an episode is selected. The unit is denoted as the winner, whose index is c. This decision criterion for the winner c is represented as E (i) 2 c = arg min ||˜ xe || , (3) i
where i ∈ {1, · · · , M }.
e=1
Self-Organizing Clustering with Map
(b) Learning. For each unit i, and for each xe , update μ(i) μ(i) (t + 1) = μ(i) (t) + λm (t)hci (t) xe − μ(i) (t) ,
511
(4)
where λm (t) is the learning rate for μ(i) at learning epoch t, hci (t) is the neighborhood function at learning epoch t with respect to the winner c. Both λm (t) and hci (t) are monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λm (t) = λini and hci (t) = exp(−|c − i|/γ(t)), m (λm /λm ) ini f in f in t/tmax γ(t) = γ (γ /γ ) are used. Then the basis vectors are updated as (i)
(i)
(i)
bh (t + 1) = bh (t) + λb (t)hci (t)
(i)
φe (t)T bh (t) , φ(i) e (t), (i) ˆ(i) ||φ (t)||||φ (t)|| e e
(5)
(i)
where φe (t) is the relative input vector in the manifold i updated the mean (i) ˆ(i) vector, which is represented by φe (t) = xe − μ(i) (t + 1), φ e (t) is the orthogonal projection of the relative input vector, which is represented by H (i) (i) T (i) ˆ(i) φ e (t) = h=1 (φ (t) bh (t))bh (t) and λb (t) is the learning rate for the basis vectors, which is also monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λb (t) = λini is used. b (λb /λb ) After the learning phase, a categorization phase to determine the class association of each unit. Each unit is labeled by the class index for which is selected as the winner most frequently when the input data for learning are applied to the AMSOM again,
3 3.1
Self-Organizing Clustering with Nonlinear Varieties Reproducing Kernels
Reproducing kernels are functions k : X 2 → R which for all pattern sets {x1 , · · · , xl } ⊂ X
(6)
give rise to positive matrices Kij := k(xi , xj ). Here, X is some compact set in which the data resides, typically a subset of Rn . In the field of Support Vector Machine (SVM), reproducing kernels are often referred to as Mercer kernels. They provide an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in some feature space F nonlinearly related to input space: Using k instead of a dot product in Rn corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map Φ : Rn → F , and taking the dot product there, i.e.[11] k(x, y) = (Φ(x), Φ(y)) .
(7)
By virtue of this property, we shall call Φ a feature map associated with k. Any linear algorithm which can be carried out in terms of dot products can be made
512
H. Kawano, H. Maeda, and N. Ikoma
nonlinear by substituting a priori chosen kernel. Examples of such algorithms include the potential function method[12], SVM [7][8] and kernel PCA.[9] The price that one has to pay for this elegance, however, is that the solutions are only obtained as expansions in terms of input patterns mapped into feature space. For instance, the normal vector of an SV hyperplane is expanded in terms of Support Vectors, just as the kernel PCA feature extractors are expressed in terms of training examples, Ψ=
l
αi Φ(xi ).
(8)
i=1
3.2
AMSOM in the Feature Space
The AMSOM in the high-dimensional feature space F is considered. In the proposed method, an varieties defined by a computational unit i in the competitive layer take the nonlinear form Mi = {Φ(x)|Φ(x) = Φ(μ(i) ) +
H
(i)
ξΦ(bh )},
(9)
h=1
where ξ ∈ R. Given training data set {x1 , · · · , xN }, the mean vector and the basis vector in a unit i are represented by the following form Φ(μ(i) ) =
N
(i)
(10)
(i)
(11)
αl Φ(xl ),
l=1
(i)
Φ(bh ) =
N
βhl Φ(xl ),
l=1 (i)
(i)
respectively. αl in Eq.(10) and βhl in Eq.(11) are the parameters adjusted by learning. The derivation of training procedure in the proposed method is given as follows: (a) Winner lookup. The norm of the orthogonal projection error onto the i-th affine subspace with respect to present input xp is calculated as follows: (i)
||Φ(˜ xp ) ||2 = k(xp , xp ) +
H
(i) 2
Ph
h=1
−2
N
αl k(x, xl ) + 2
l=1
−2
H N h=1 l=1
+
N N
(i) (i)
αl1 αl2 k(xl1 , xl2 )
l1 =1 l2 =1 H N
N
(i)
Ph αl1 βhl2 k(xl1 , xl2 )
h=1 l1 =1 l2 =1 (i)
Ph βhl k(x, xl ),
(12)
Self-Organizing Clustering with Map
513
(i)
where Ph means the orthogonal projection component of present input xp (i) into the basis Φ(bh ) and it is calculated by (i)
Ph =
N
N N
(i)
βhl k(xp , xl )−
l=1
(i) (i)
αl1 βhl2 k(xl1 , xl2 ).
(13)
l1 =1 l2 =1
The reproducing kernels used generally are as follows: = (xTs xt )d d ∈ N, T d = (xs xt + 1) d ∈ N, ||xs −xt ||2 k(xs , xt ) = exp − 2σ2 σ ∈ R, k(xs , xt ) k(xs , xt )
(14) (15) (16)
where N and R are the set of natural numbers and the set of reals, respectively. Eq.(14), Eq.(15) and Eq.(16) are referred as to homogeneous polynomial kernels, non-homogeneous polynomial kernels and gaussian kernels, respectively. The winner for an episode input ωq = {Φ(x1 ), · · · , Φ(xE )} is decided by the same manner as the AMSOM as follows: E (i) 2 c = arg min ||Φ(˜ xe ) || , i ∈ {1, · · · , M }. (17) i
e=1 (i)
(i)
(b) Learning. The learning rule for αl and βhl are as follows: ⎧ (i) ⎪ f or l = e ⎨−αl (t)λm (t)hci (t) (i) Δαl = −α(i) (t)λm (t)hci (t) + λm (t)hci (t) , ⎪ ⎩ l f or l = e ⎧ (i) (i) −α (t + 1)λb (t)hci (t)Th (t) ⎪ ⎪ ⎪ l ⎪ f or l = e ⎨ (i) Δβhl = −α(i) , (t + 1)λ (t)h (t)T (t) b ci l ⎪ ⎪ ⎪ +λ (t)h (t)T (t) b ci ⎪ ⎩ f or l = e where
(i)
T (t) = ˆ(i) Φ(φ e (t))) =
H N h=1
(i)
l=1
||Φ(φ(i) e )(t)|| = k(xe , xe ) − 2
N l=1
(19)
(i)
Φ(φe (t))T Φ(bh (t)) , (i) ˆ(i) ||Φ(φ e (t))||||Φ(φe (t))|| βhl k(xe , xl ) −
(18)
N N
(20)
2 12 αl1 βkl2 k(xl1 , xl2 ) ,
l1 =1 l2 =1
(i)
αl k(xe , xl ) +
N N
(21) 12 (i) (i) αl1 αl2 k(xl1 , xl2 ) ,
l1 =1 l2 =1
(22)
514
H. Kawano, H. Maeda, and N. Ikoma (i)
T Φ(φ(i) e (t)) Φ(bh (t)) =
N
βhl k(xe , xl ) −
l=1
N N
(i) (i)
αl1 βhl2 k(xl1 , xl2 ). (23)
l1 =1 l2 =1
In Eqs.(18) and (19), λm (t), λb (t) and hci (t) are the same parameters as mentioned in the AMSOM training process. After the learning phase, a categorization phase to determine the class association of each unit. The procedure of the categorization is the same manner as mentioned in previous section.
4
Experimental Results
Data Distribution 1. To analyze the effect of reducing the dimensionality for the intersections of subspaces by the proposed method, the data as shown in Fig.4(a) is used. For this data, although a set of each class can be approximated by 1-dimensional linear manifold in the input space R2 , the intersections of subspace could be happend between class 1 and class 2, and between class 2 and class 3 , even if the optimal linear manifold for each class can be obtained. However, the linear manifold in the high-dimensional space, that is the nonlinear manifold in input space, can be expected to classify the given data by reduction effect of the dimensionality for the intersections of subspaces. As the result of simulation, the given input data are classified correctly by the proposed method. Figure 4(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 1, class 2 to unit 2, class 3 to unit 3, respectively. In this simulation, the experimental parameters are assigned as follows: the f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in ini ini λb = 0.01, γ = 3, γ = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y + 1)3 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to reduce the dimensionality for intersections of subspaces. Data Distribution 2. To verify that the proposed method has the ability to describe the nonlinear manifolds, the data as shown in Fig.5(a) is used. This case is of impossible to describe each class by a linear manifold, that is 1-dimensional affine subspace. As the result of simulation, the given input data are classified correctly by the proposed method. Figure 5(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 3, class 2 to unit 2, class 3 to unit 1, respectively. In this simulation, the experimental parameters are assigned as follows: the
Self-Organizing Clustering with Map 5
5
4
4
Unit 3
3
3
2
2
1
Unit 1
1
Unit 1
Unit 2
0
x2
x2
515
-1
-2
-2
-3
-3
-4
-4
-5 -5 -4 -3 -2 -1
0
x1
1
2
3
4
Unit 2
0
-1
Unit 3
-5 -5 -4 -3 -2 -1
5
(b)
(a)
0
x1
1
2
3
4
5
(c)
Fig. 4. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method.
4
4
4
class 1 class 2 class 3
3
2
2
2
1
1
1
0
x2
3
x2
x2
3
0
0
-1
-1
-1
-2
-2
-2
-3
-3
-4
-3
-4 -4
-3
-2
-1
0
1
x1
(a)
2
3
4
-4 -4
-3
-2
-1
0
x1
(b)
1
2
3
4
-4
-3
-2
-1
0
1
2
3
4
x1
(c)
Fig. 5. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method. f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in λb = 0.01, γ ini = 3, γ ini = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y)2 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to extract the suitable nonlinear manifolds efficiently.
5
Conclusions
A clustering method with map of nonlinear varieties was proposed as a new pattern classification method. The proposed method has been extended to a nonlinear method easily from AMSOM by applying the kernel method. The effectiveness of the proposed method were verified by the experiments. The proposed algorithm has highly promising applications of the ASSOM in a wide area of practical problems.
516
H. Kawano, H. Maeda, and N. Ikoma
References 1. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press (1983) 2. Kohonen, T.: Emergence of Invariant-Feature Detectors in the Adaptive-Subspace Self-Organizing Map. Biol.Cybern 75, 281–291 (1996) 3. Kohonen, T., Kaski, S., Lappalainen, H.: Self-Organizing Formation of Various invariant-feature filters in the Adaptive Subspace SOM. Neural Comput 9, 1321– 1344 (1997) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg, New York (1995) 5. Liu, Z.Q.: Adaptive Subspace Self-Organizing Map and Its Application in Face Recognition. International Journal of Image and Graphics 2(4), 519–540 (2002) 6. Saitoh, S.: Theory of Reproducing Kernels and its Applications. In: Longman Scientific & Technical, Harlow, England (1988) 7. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 8. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York, Berlin, Heidelberg (1995) 9. Sch¨ olkopf, B., Smola, A.J., M¨ uler, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Technical Report 44, Max-Planck-Institut fur biologische Kybernetik (1996) 10. Mika, S., R¨ atsch, G., Weston, J., Sch¨ olkopf, B., M¨ uler, K.R.: Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 41–48 (1999) 11. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Otimal Margin Classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152 (1992) 12. Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Automation and Remote Control 25, 821–837 (1964) 13. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)
An Automatic Speaker Recognition System P. Chakraborty1, F. Ahmed1, Md. Monirul Kabir2, Md. Shahjahan1, and Kazuyuki Murase2,3 1
Department of Electrical & Electronic Engineering, Khulna University of Engineering and Technology, Khulna-920300, Bangladesh 2 Dept. of Human and Artificial Intelligence Systems, Graduate School of Engineering 3 Research and Education Program for Life Science, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan
[email protected],
[email protected] Abstract. Speaker Recognition is the process of identifying a speaker by analyzing spectral shape of the voice signal. This is done by extracting & matching the feature of voice signal. Mel-frequency Cepstrum Co-efficient (MFCC) is the feature extraction technique in which we will get some coefficients named Mel-Frequency Cepstrum coefficient. This Cepstrum Coefficient is extracted feature. This extracted feature is taken as the input of Vector Quantization process. Vector Quantization (VQ) is the typical feature matching technique in which VQ codebook is generated by providing predefined spectral vectors for each speaker to cluster the training vectors in a training session. Finally test data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN). Keywords: MFCC- Mel-Frequency Cepstrum Co-efficient, DCT: Discrete cosine Transform, IIR: - Infinite impulse response, FIR: - Finite impulse response, FFT: - Fast Fourier Transform, VQ: - Vector Quantization.
1 Introduction Speaker Recognition is the process of automatic recognition of the person who is speaking on the basis of individual information included in speech waves. This paper deals with the automatic Speaker recognition system using Vector Quantization. There are another techniques for speaker recognition such as Hidden Markov model (HMM), Artificial Neural network (ANN) for speaker recognition. We have used VQ because of its less computational complexity [1]. There are two main modulesfeature extraction and feature matching in any speaker recognition system [1, 2]. The speaker specific features are extracted using Mel-Frequency Cepstrum Co-efficient (MFCC) processor. A set of Mel-frequency cepstrum coefficients was found, which are called acoustic vectors [3]. These are the extracted features of the speakers. These M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526, 2008. © Springer-Verlag Berlin Heidelberg 2008
518
P. Chakraborty et al.
acoustic vectors are used in feature matching using vector quantization technique. It is the typical feature matching technique in which VQ codebook is generated using trained data. Finally tested data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. This work is done with about 70 spectral data. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN) because the correct recognition for HMM & ANN is below ninety percent. The future work is to generate a VQ codebook with many pre-defined spectral vectors. Then it will be possible to add many trained data in that codebook in a training session, but the main problem is that the network size and training time become prohibitively large with increasing data size. To overcome these limitations, time alignment technique can be applied, so that continuous speaker recognition system becomes possible. There are several implementations for feature matching & identification. Lawrence Rabiner & B. H. Juang proposed Mel- frequency cepstrum co-efficient (MFCC) method to extract the feature & Vector Quantization as feature matching technique [1]. Lawrence Rabiner & R.W. Schafer discussed the performance of MFCC Processor by following several theoretical concepts [2]. S. B. Davis & P. Mammelstein described the characteristics of acoustic speech [3]. Linde A Buzo & R. Gray proposed the LBG Algorithm to generate a VQ codebook by splitting technique [4]. S. Furui describes the speaker independent word recognition using dynamic features of speech spectrum [5]. S. Furui also proposed the overall speaker recognition technology using MFCC & VQ method [6].
2 Methodology A general model for speaker recognition system for several people is shown in fig: 1. the model consists of four building blocks. The first is data extraction that converts a wave data stored in audio wave format into a form that is suitable for further computer processing and analysis. The second is pre-processing, which involves filtering,
Fig. 1. Block diagram of speaker recognition system
An Automatic Speaker Recognition System
519
removing pauses, silences and weak unvoiced sound signal and detect the valid speech signal. The third block is feature extraction, where speech features are extracted from the speech signal. The selected features have enough information to recognize a speaker. Here a class label is assigned to each word uttered by each speaker by examining the extracted features and comparing them with classes learnt during the training phase. Vector quantization is used as an identifier.
3 Pre-processing A digital filter is a mathematical algorithm implemented in hardware and/or software that operates on a digital input signal to produce a digital output signal for the purpose of achieving a filtering objective. Digital filters often operate on digitized analog signals or just numbers, representing some variable, stored in a computed memory. Digital filters are broadly divided into two classes, namely infinite impulse response (IIR) and finite impulse response (FIR) filters. We chose FIR filter because, FIR filters can have an exactly linear phase response. The implication of this is that no phase distortion is introduced into the signal by the filter. This is an important requirement in many applications, for example data transmission, biomedicine, digital audio and image processing. The phase responses of IIR filters are non-linear, especially at the band edges. When a machine is continuously listening to speech, a difficulty arises when it is trying to figure out to where a word starts and stops. We solved this problem by examining the magnitude of several consecutive samples of sound. If the magnitude of these samples is great enough, then keep those samples and examine them later. [1]
Fig. 2. Sample Speech Signal
Fig. 3. Example of speech signal after it has been cleaned
One thing that is clearly noticeable in the example speech signal is that there is lots of empty space where nothing is being said, so we simply remove it. An example speech signal is shown before cleaning in Figure 2, and after in Figure 3. After the empty space is removed from the speech signal, the signal is much shorter. In this case the signal had about 13,000 samples before it was cleaned. After it was run through the clean function, it only contained 2,600 samples. There are several advantages of this. The amount of time required to perform calculations on 13,000 samples is much larger than that required for 2,600 samples. The cleaned sample now contains all the important data that is required to perform the analysis of the speech. The sample produced from the cleaning process is then fed in to the other parts of the ASR system.
520
P. Chakraborty et al.
4 Feature Extraction The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, namely: • • • • •
Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC) Perceptual Linear Prediction (PLP) Neural Predictive Coding (NPC)
Among the above classes we used MFCC, because it is the best known and most popular. MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [1, 2]. 4.1 Mel-Frequency Cepstrum Processor A diagram of the structure of an MFCC processor is given in Figure 4. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselves, MFCC’s are shown to be less susceptible to mentioned variations [5, 6].
Fig. 4. Block diagram of the MFCC processor
4.1.1 Frame Blocking In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by
An Automatic Speaker Recognition System
521
N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 and M = 100. 4.1.2 Windowing Next processing step is windowing. By means of windowing the signal discontinuities, at the beginning and end of each frame, is minimized. The concept here is to minimize the spectrum distortion by using the window to taper the signal to zero at the beginning and end of each frame.If we define the window as w(n), 0 ≤ n≤ N - 1 , where N is the number of samples, then the result of windowing is the signal
yl (n) = xl (n) w(n), 0 ≤ n ≤ N − 1
(1)
The followings are the types of window method: ▪ Hamming ▪ Rectangular ▪ Barlett (Triangular) ▪ Hanning
▪ Kaiser ▪ LancZos ▪ Blackman-Harris
Among all the above, we used Hamming Window method most to serve our purpose for ease of mathematical computations, which is described as:
⎛ 2π n ⎞ w ( n ) = 0 . 54 − 0 .46 cos ⎜ (2) ⎟, 0 ≤ n ≤ N − 1 ⎝ N −1⎠ Besides this, we’ve also used Hanning window and Blackman-Harris window. As an example, a Hamming window with 256 samples is shown here.
Fig. 5. Hamming window with 256 speech sample
4.1.3 Fast Fourier Transform (FFT) Fast Fourier Transform, converts a signal from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow:
Xn =
N −1
∑x k =0
k
e − 2 π jkn / N ,
n = 0 ,1, 2 ,..., N − 1
(3)
522
P. Chakraborty et al.
We use j here to denote the imaginary unit, i.e. j = √ (-1). In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero frequency corresponds to n = 0, positive frequencies 0 Cn, = 0, for > 0 (17) n→∞ 1≤i≤n lim P max |Wi |2 ≤ Cn, = 0, for < 0. (18) n→∞
1≤i≤n
The strong convergence result with Cn,1 is known and Cn,1 is used as the universal threshold level for removing irrelevant wavelet coefficients in wavelet denoising[5]. The universal threshold level is shown to be asymptotically equivalent to the minimax one[5]. (17) and (18) are the weaker results while they evaluate the O(log log n) term. We return to our problem and consider the case of v ∗λ = 0n ; i.e. all of vλ,i ’s are noise components. Here, 0n denotes the n-dimensional zero vector. We define = ( ∼ N (0n , σ 2 Γ λ ). We define σi2 = σ 2 γi /(γi + λ), u u1 , . . . , u n ) that satisfies u 2 i = 1, . . . , n, where σi = 0 for any i. We also define u = (u1 , . . . , un ), where ui = u i /σi . Then, u ∼ N (0n , I n ). By (17) and the definition of u, we have
n
2 2 P u i > σi Cn, = P max u2i > Cn, → 0 (n → ∞), (19) i=1
1≤i≤n
542
K. Hagiwara
if > 0. On the other hand, by using (18) and the definition of u, we have
n
2 2 2 P u i ≤ σi Cn, = P max ui ≤ Cn, → 0 (n → ∞), (20) 1≤i≤n
i=1
if < 0. (19) tells us that, for any i, u 2i cannot exceed σi2 Cn, with high probability when n is large and > 0. Therefore, if we employ σi Cn, with > 0 as component-wise threshold levels, (19) implies that they remove noise components if it is. On the other hand, (20) tells us that there are some noise components which satisfy u 2i > σi2 Cn, with high probability when n is large and < 0. Therefore, the ith component should be recognized as a noise component if u 2i ≤ σi2 Cn, even when it is not. In other words, we can not distinguish nonzero mean components from zero mean components in this case. Hence, σi2 Cn,0 , i = 1, . . . , n are critical levels for identifying noise components. We note that ∗ these results are still valid when vλ,i are not zero for some i but the number of such components is very small. This is the case of assuming the sparseness of the representation of h in the orthogonal domain. As a conclusion, we propose a hard thresholding method by putting θn,i = σi2 Cn,0 , i = 1, . . . , n in (16), where we set = 0. We refer to this method by component-wise hard thresholding(CHT). In practical applications of the method, we need an estimate of noise variance σ 2 . Fortunately, in nonparametric regression methods, [1] suggested to apply σ 2 =
y [I − Hλ ]2 y , trace[(I − Hλ )2 ]
(21)
where Hλ is defined by Hλ = GFλ−1 G = GQΓλ Q G . Although the method includes a regularization parameter, it can be fixed for a small value. This is because the thresholding method keeps a trained machine compact on the orthogonal domain, by which the contribution of the regularizer may not be significant in improving the generalization performance. Therefore it is needed only for ensuring the numerical stability of the matrix calculations. Since the basis functions can be nearly linearly dependent in practical applications, small eigen values are less reliable. We therefore ignore the orthogonal components whose eigen values are less than a predetermined small value, say, 10−16 . Although the run time of eigendecomposition is O(n3 ), the subsequent procedures of CHT such as calculations of σ 2 and wλ are achieved with less computational costs by the eigendecomposition. Hard thresholding with the universal threshold level(HTU). Basically eigendecomposition of G G corresponds to the principal component analysis of g 1 , . . . , g n . Therefore, for nearly linearly dependent basis functions, only several eigen vectors are largely contributed. On the other hand, the components with small eigen values are largely affected by numerical errors. Therefore, it is natural to take care of only the components with large eigen values. For a component
Orthogonal Shrinkage Methods for Nonparametric Regression
543
with a large eigen value, γi λ holds since we can choose a small value for λ. Thus, σ i2 σ 2 holds by the definition of σ i2 in CHT. We then consider to apply 2 a single threshold level σ Cn,0 instead of σ i2 Cn,0 , i = 1, . . . , n in CHT; i.e. we 2 set θn,i = σ Cn,0 in (16). This is a direct application of the universal threshold level in wavelet denoising[5]. This method is referred to by hard thresholding with the universal threshold level(HTU). Backward hard thresholding(BHT). On the other hand, since the threshold level derived here is the worst case evaluation for a noise level, CHT and HTU have a possibility of yielding a bias between fwλ and h by unexpected removes of contributed components. The component with a large eigen value is composed by a linear sum of many basis functions, it may be a smooth component. Therefore, removes of these components may yield a large bias. Actually, in wavelet denoising, fast/detail components are the target of a thresholding method and slow/approximation components are harmless by the thresholding method[5], which may be a device for reducing the bias. For our method, we also introduce this idea and consider the following procedure. We assume that γ1 ≤ γ2 ≤ · · · ≤ γn . The method is that, by increasing j from 1 to n, we find the first j = j for which vλ,j > σ 2 Cn,0 occurs. Then, thresholding is made by Tj ( vλ,j ) = vλ,j if j ≥ j and Tj ( vλ,j ) = 0 if j < j. This keeps components with large eigen values and possibly reduces the bias. We refer to this method by backward hard thresholding(BHT). BHT can be viewed as a stopping criterion for choosing contributed components in orthogonal components that are enumerated in order of the magnitudes of eigen values.
4 4.1
Numerical Experiments Choice of Regularization Parameter
CHT, HTU and BHT do not include parameters to be adjusted except the regularization parameter. Since thresholding of orthogonal components yields a certain simple representation of a machine, the regularization parameter may not be significant in improving the generalization performance. To demonstrate this property of our methods, through a simple numerical experiment, we see the relationship between generalization performances of trained machines and regularization parameter values. The target function is h(x) = 5 sinc(8 x) for x ∈ R. x1 , . . . , xn are randomly drawn in the interval [−1, 1]. We assume i.i.d. Gaussian noise with mean zero and variance σ 2 = 1. The basis functions are Gaussian basis functions that are defined by gj (x) = exp{−(x − xi )2 /(2τ 2 )}, j = 1, . . . , n, where we set τ 2 = 0.05. In this experiment, under a fixed value of a regularization parameter, we trained machines for 1000 sets of training data of size n. At each trial, the test error is measured by the mean squared error between the target function and the trained machine, which is calculated on 1000 equally spaced input points in [−1, 1]. Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively, which show the relationship between the averaged test errors of trained machines
544
K. Hagiwara
0.1 averaged test error
averaged test error
0.1
0.05
0 10
−6
−4
−2
0
10 10 10 10 reguralization parameter
2
(a) n = 200
0.05
0 10
−6
−4
−2
0
10 10 10 10 reguralization parameter
2
(b) n = 400
Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and (b) n = 400. The filled circle, open circle and open square indicate the results for the raw estimate, CHT and BHT respectively.
and regularization parameter values. The filled circle, open circle and open square indicate the results for the raw estimator, CHT and BHT respectively, where the raw estimate is obtained by (6) at each fixed value of a regularization parameter. We do not show the result for HTU since it is almost the same as the result for CHT. In these figures, we can see that the averaged test errors of our methods are almost unchanged for small values of a regularization parameter while those of the row estimates are sensitive to regularization parameter values. We can also see that BHT is entirely superior to the raw estimate and CHT while CHT is worse than the raw estimate around λ = 101 for both of n = 200 and 400. In practical applications, the regularization parameter of the raw estimate should be determined based on training data and the performance comparisons to the leave-one-out cross validation method are shown in below. 4.2
Comparison with LOOCV
We compare the performances of the proposed methods to the performance of the leave-one-out cross validation(LOOCV) choice of a regularization parameter value. We see not only generalization performances but also computational times of the methods. For the regularized estimator considered in this article, it is known that the LOOCV error is calculated without training on validation sets[3,6]. We assume the same conditions as the previous experiment. The CPU time is measured only for the estimation procedure. The experiments are conducted by using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory. Table 1 (a) and (b) show the averaged test errors and averaged CPU times of LOOCV and our methods respectively, in which the standard deviations(divided by 2) are also appended. The examined values for a regularization parameter in LOOCV is {m × 10−j : m = 1, 2, 5, j = −4, −3, . . . , 3}. In our methods, the
Orthogonal Shrinkage Methods for Nonparametric Regression
545
Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT n 100 200 400
LOOCV 0.079± 0.027 0.040± 0.013 0.021± 0.006
CHT 0.101± 0.034 0.046± 0.014 0.023± 0.007
HTU 0.100± 0.034 0.045± 0.014 0.023± 0.007
BHT 0.076± 0.027 0.035± 0.011 0.017± 0.005
(a) Test errors n 100 200 400
LOOCV 0.079± 0.002 0.533± 0.003 3.657± 0.003
CHT 0.013± 0.002 0.080± 0.002 0.523± 0.004
HTU 0.014± 0.002 0.080± 0.002 0.523± 0.004
BHT 0.014± 0.002 0.080± 0.002 0.523± 0.004
(b) CPU times
regularization parameter is fixed at 1 × 10−4 which is the smallest value in candidate values for LOOCV. Based on Table 1 (a), we first discuss the generalization performances of the methods. CHT and HTU are almost comparable. This implies that only the components corresponding to large eigen values are contributed. CHT and HTU are entirely worse than LOOCV in average while the differences are within the standard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV, CHT and HTU in average while the difference between the averaged test error of LOOCV and that of BHT is almost within the standard deviations. As pointed out previously, CHT and HTU have a possibility to remove smooth components accidentally since the threshold levels were determined based on the worst case evaluation of dispersion of noise. The better generalization performance of BHT compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other hand, as shown in Table 1 (b), our methods completely outperform LOOCV in terms of the CPU times.
5
Conclusions and Future Works
In this article, we proposed shrinkage methods in training a machine by using a regularization method. The machine is represented by a linear combination of fixed basis functions, in which the number of basis functions, or equivalently, the number of weights is identical to that of training data. In the regularized cost function, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. In the proposed shrinkage procedures, basis functions are orthogonalized by eigendecomposition of the Gram matrix of the vectors of basis function outputs. Then, the orthogonal components are kept or removed according to the proposed thresholding methods. The proposed methods are based on the statistical properties of regularized estimators of weights, which are derived by assuming i.i.d. Gaussian noise. The final weights are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators of weights. We
546
K. Hagiwara
proposed three versions of thresholding methods which are component-wise hard thresholding, hard thresholding with the universal threshold level and backward hard thresholding. Since the regularization parameter can be fixed for a small value in our methods, our methods are automatic. Additionally, since eigendecomposition algorithms are included in many software packages and the thresholding methods are simple, the implementations of our methods are quite easy. The numerical experiments showed that our methods achieve relatively good generalization capabilities in strictly less computational time by comparing with the LOOCV method. Especially, the backward hard thresholding method outperformed the LOOCV method in average in terms of the generalization performance. As future works, we need to investigate the performance of our methods on real world problems. Furthermore, we need to evaluate the generalization error when applying the proposed shrinkage methods.
References 1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric regression. J. R. Statist. Soc. B 54, 773–780 (1992) 2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neurocomputing 69, 559–585 (2006) 3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403 (1979) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994) 6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D thesis, MIT (2002) 7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58, 267–288 (1996) 8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 85–105 (2002) 9. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel machines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 682–688 (2001)
A Subspace Method Based on Data Generation Model with Class Information Minkook Cho, Dongwoo Yoon, and Hyeyoung Park School of Electrical Engineering and Computer Science Kyungpook National University, Deagu, Korea
[email protected],
[email protected],
[email protected] Abstract. Subspace methods have been used widely for reduction capacity of memory or complexity of system and increasing classification performances in pattern recognition and signal processing. We propose a new subspace method based on a data generation model with intra-class factor and extra-class factor. The extra-class factor is associated with the distribution of classes and is important for discriminating classes. The intra-class factor is associated with the distribution within a class, and is required to be diminished for obtaining high class-separability. In the proposed method, we first estimate the intra-class factors and reduce them from the original data. We then extract the extra-class factors by PCA. For verification of proposed method, we conducted computational experiments on real facial data, and show that it gives better performance than conventional methods.
1
Introduction
Subspace methods are for finding a low dimensional subspace which presents some meaningful information of input data. They are widely used for high dimensional pattern classification such as image data owing to two main reasons. First, by applying a subspace method, we can reduce capacity of memory or complexity of system. Also, we can expect to increase classification performances by eliminating useless information and by emphasizing essential information for classification. The most popular subspace method is PCA(Principal Component Analysis) [10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation models. The PCA finds a subspace of independent linear combinations (principal components) that retains as much of the information in the original variables as possible. However, PCA method is an unsupervised method, which does not use class information. This may cause some loss of critical information for classification. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a supervised learning method which uses information of the target label of data set. The LDA method attempts to find basis vectors of subspace maximizing the linear class separability. It is generally known that LDA can give better classification performance than PCA by using class information. However, LDA gives M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008. c Springer-Verlag Berlin Heidelberg 2008
548
M. Cho, D. Yoon, and H. Park
at most k-1 basis of subspace for k-class, and cannot extract features stably for the data set with limited number of data in each class. Another subspace method for classification is the intra-person space method[9], which is developed for face recognition. The intra-person space is defined as a difference between two facial data from same person. For dimension reduction, the low dimensional eigenspace is obtained by applying PCA to the intra-person space. In the classification tasks, raw input data are projected to the intra-personal eigenspace to get the low dimensional features. The intra-person method showed better performance than PCA and LDA in the FERET Data[9]. However, it is not based on data generation model and it cannot give a sound theoretical reason why the intra-person space gives good information for classification. On the other hand, a data generation model with class information has recently been developed[12]. It is a variant of the factor analysis model with two type of factors; class factor(what we call extra-class factor) and environment factor(what we call intra-class factor). Based on the data generation model, the intra-class factor is estimated by using difference vectors between two data in the same class. The estimated probability distribution of intra-class factor is applied to measuring the similarity of data for classification. Through this method takes similar approaches to the intra-person method in the sense that it is using the difference vectors to get the intra-class information and similarity measure, it is based on the data generation model which can give an explanation on the developed similarity measure. Still, it does not include dimension reduction process and another subspace method is necessary for high-dimensional data. In this paper, we propose an appropriate subspace method for the data generation model developed in [12]. The proposed method finds subspace which undercuts the effect of intra-class factors and enlarges the effect of extra-class factors based on the data generation model. In Section 2, the model will be explained in detail.
2
Data Generation Model
In advance of defining the data generation model, let us consider that we obtained several images(i.e. data) from different persons(i.e. class). We know that pictures of different persons are obviously different. Also, we know that the pictures of a person is not exactly same due to some environmental condition such as illumination. Therefore, it is natural to assume that a data consists of the intra-class factor which represents within-class variations such as the variation of pictures of same person and the extra-class factor which represents betweenclass variations such as the differences between two persons. Under this condition, a random variable x for observed data can be written as a function of two distinct random variable ξ and η, which is the form, x = f (ξ, η),
(1)
where ξ represents an extra-class factor, which keeps some unique information in each class, and η represents an intra-class factor which represents environmental
A Subspace Method Based on Data Generation Model
549
variation in the same class. In [12], it was assumed that η keeps information of any variation within class and its distribution is common for all classes. In order to explicitly define the generation model, a linear additive factor model was applied such as xi = W ξ i + η.
(2)
This means that a random sample xi in class Ci is generated by the summation of a linearly transformed class prototype ξ i and a random class-independent variation η. In this paper, as an extension of the model, we assume that the low dimensional intra-class factor is linearly transformed to generate an observed data x. Therefore, function f is defined as xi = W ξi + V η i .
(3)
This model implies that a data of a specific class is generated by the extra-class factor ξ which gives some discriminative information among class and the intraclass factor η which represents some variation in a class. In this equation, W and V are transformation matrices of corresponding factors. We call W the extra-class factor loading and call V the intra-class factor loading. Figure 1 represents this data generation model. To find a good subspace for classification, we try to find V and W using class information which is given with the input data. In Section 3, we explain how to find the subspace based on the data generation model.
ξi
ηi V
W Xi
Fig. 1. The proposed data generation model
3
Factor Analysis Based on Data Generation Model
For given data x, if we keep the extra-class information and reduce the intraclass information as much as possible, we can expect better classification performances. In this aspect, the proposed method can be thought to be similar to the traditional LDA method. The LDA finds a projection matrix which simultaneously maximizes the between-scatter matrix and minimizes the within-scatter matrix of original data set. On the other hand, the proposed method first estimates the intra-class information from the difference data set from same class, and excludes the intra-class information from original data to keep the extraclass information. Therefore, as compared to LDA, the proposed method does
550
M. Cho, D. Yoon, and H. Park
not need to compute a inverse matrix of the within-scatter matrix and the number of basis of subspace does not depend on a number of class. In addition, the proposed method is so simple to get subspaces and many variations of the proposed method can be developed by exploiting various data generation model. In this section, we will state how to obtain the subspace in detail based on simple linear generative model. 3.1
Intra-class Factor Loading
We first find the projection matrix Λ which represents the intra-class factor instead of intra-class factor loading matrix V for obtaining intra-class information. In given data set {x}, we calculate the difference vector δ between two data from same class, which can be written as δ kij = xki − xkj = (W ξ ki − W ξ kj ) + (V η ki − V η kj ),
(4)
where xki and xkj are data from a class Ck (k=1,...,K). Because the xki , xkj are came from the same class, we can assume that the extra-class factor does not make much difference and we ignore the first term W ξki − W ξ kj . Then we can obtain a new approximate relationship δ kij ≈ V (η ki − η kj ).
(5)
Based on the relationship, we try to find the factor loading matrix V . For the obtained the data set, Δ = {δ kij }k=1,...,K,i=1,...,N,j=1,...,N ,
(6)
where K is the number of classes and N is the number of data in each class, we apply PCA to obtain the principal component of Δ. The obtained matrix Λ of the principal component of the Δ gives the subspace which maximizes the variance of intra-class factor η. The original data set X is projected to the subspace for extraction of intraclass factors Y intra , such as Y intra = XΛ.
(7)
Note that Y intra is a low dimensional data set and includes the intra-class information of X. Using the Y intra , we reconstruct X intra in original dimension by applying the calculation, X intra = Y intra ΛT (ΛΛT )−1 .
(8)
Note that X intra keeps the intra-class information which is not desirable for classification. To remove the undesirable information, we subtract X intra from ˜ such as the original data set X. As a result, we get a new data set X ˜ = X − X intra . X
(9)
A Subspace Method Based on Data Generation Model
3.2
551
Extra-Class Factor Loading
˜ we try to find the projection matrix Using the newly obtained data set X, ˜ Λ which represents the extra-class factor instead of extra-class factor loading matrix W for preserving extra-class information as much as possible. To solve ˜ Noting that a data xintra in the this problem, let us consider the data set X. intra data set X is a reconstruction from the intra-class factor, we can write the approximate relationship, xintra ≈ V η.
(10)
˜ can be ˜ in X By combining it with equation (3), the newly obtained data x rewritten as ˜ ≈ x − V η = W ξ. x
(11)
˜ mainly has the extra-class From this, we can say that the new data set X information, and thus we need to preserve the information as much as possible. ˜ and obtain the From these consideration, we apply PCA to the new data set X, ˜ ˜ ˜ to the basis principal component matrix Λ of the data set X. By projecting X vectors such as ˜Λ ˜ = Z, ˜ X
(12)
˜ which has small intra-class variance and large we can obtain the data set Z extra-class variance. The obtained data set is used for classification.
4
Experimental Results
For verification of the proposed method, we did comparison experiments on facial data sets with conventional methods : PCA, LDA and intra-person. For classification, we apply the minimum distance method[10] with Euclidean distance. When we find the subspace for each method, we optimized the dimension of subspace with respect to the classification rates for each data set. 4.1
Face Recognition
We first conducted facial recognition task for face images with different viewpoints which are obtained from FERET(Face Recognition Technology) database at the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows some samples of the data set. We used 450 images from 50 subjects and each subject consists of 9 images of different poses corresponding to 15 degree from left and right. The left, right(±60 degree) and frontal images are used for training and the rest 300 images are used for testing. The size of image is 50 × 70, thus the dimension of the raw input is 3500. For LDA method, we first applied PCA to solve small sample set problem and obtained 52 dimensional features. We then applied LDA and obtained 9 dimensional features. Similarly, in the proposed method, we first
552
M. Cho, D. Yoon, and H. Park
Fig. 2. Examples of the human face images with different viewpoints Table 1. Result on face image data Method PCA LDA Intra-Person Proposed
Dimension 117 9 92 8
Classification Rate 97 99.66 92.33 100
find 83 dimensional subspace for intra-class factor and 8 dimensional subspace for extra-class factor. In this case, there are the 50 number of classes and the number of data in each class is very limited; 3 for each class. The experimental results are shown in Table 1. The performance of the proposed method is perfect, and the other methods also give generally good results. In spite of 50 number of class and limited number of training data, the good results may due to that the variation of between-class is intrinsically high. 4.2
Pose Recognition
We conducted pose recognition task with the same data set used in Section 4.1. Therefore, we used 450 images from 9 different viewpoints classes of −60o to 60o with 15o intervals, and each class consists of 50 images from different persons. The 255 images which are composed of 25 images for each class are used for training and the rest 255 images are used for testing. For LDA method, we first applied PCA and obtained 167 dimensional features. We then applied LDA
A Subspace Method Based on Data Generation Model
553
Table 2. Result on the human pose image data Method PCA LDA Intra-Person Proposed
Dimension 65 6 51 21
Classification Rate 36.44 57.78 38.67 58.22
and obtained 6 dimensional features. For the proposed method, we first find 128 dimensional subspace for intra-class factor and 21 dimensional subspace for extra-class factor. In this case, there is the 9 number of classes and the 225 number of training data which is composed of 25 number of data from each class. The results are shown in Table 2. The performance is generally low, but the proposed method and LDA give much better performance than the PCA and intra-person method. From the low performance, we can conjecture that the variance of between-class is very small in contrast to the variance of within-class. However, the proposed method achieved the best performance. 4.3
Facial Expression Recognition
We also conducted facial expression recognition task with data set obtained from PICS(Psychological Image Collection at Stirling) at the homepage(http : //pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We obtained 276 images from 69 persons and each person has 4 images of different expressions. The 80 images which is composed of 20 images from each expression are used for training and the rest 196 images are used for testing. The size of image is 80 × 90, thus the dimension of the raw input is 7200. For LDA method, we first applied PCA and obtained 59 dimensional features. We then applied LDA and obtained 3 dimensional features. For the proposed method, we first find 48 dimensional subspace for intra-class factor and 14 dimensional subspace for extra-class factor. In this case, there is 4 number of classes and 20 number of training data in each class. Although the performances of all methods are generally low, the proposed method performed much better than PCA and intra-person method. Like the pose recognition, it is also seemed that the variance of between-class Table 3. Result on facial expression image data Method PCA LDA Intra-Person Proposed
Dimension 65 3 76 14
Classification Rate 35.71 65.31 42.35 66.33
554
M. Cho, D. Yoon, and H. Park
Fig. 3. Examples of the human facial expression images
is very small in contrast to the variance of within-class. However, the proposed method is achieved the best performance.
5
Conclusions and Discussions
In this paper, we proposed a new subspace method based on a data generation model with class information which can be represented as intra-class factors and extra-class factors. By reducing the intra-class information from original data and by keeping extra-class information using PCA, we could get a low dimensional features which preserves some essential information for the given classification problem. In the experiments on various type of facial classification tasks, the proposed method showed better performance than conventional methods. As further study, it could be possible to find more sophisticated dimension reduction than PCA which can enlarge extra-class information. Also, the kernel method could be applied to overcome the non-linearity problem.
Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD) (KRF-2006-311-D00807).
References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE trans. on Pattern Recogntion and Machine Intelligence 19(7), 711–720 (1997) 2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004) 3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classifiers. Advances in Neural Information Processing System, 487–493 (1998)
A Subspace Method Based on Data Generation Model
555
4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8, 376–386 (1938) 5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and Helmholtz Free Energy. Advances In Neural Information Processing Systems 6, 3–10 (1994) 7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Distributed Representations. Philosophical Transactions Royal Society B 352, 1177– 1190 (1997) 8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000 International Technical Conference on Circuits/Systems, Computers, and Communications, pp. 873–876 (2000) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity. Advances in Neural Information Processing System, 910–916 (1998) 10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Inteligence 23(2), 228–233 (2001) 12. Park, H., Cho, M.: Classification of Bio-data with Small Data set Using Additive Factor Model and SVM. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg (2006) 13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Verification. In: Proc. of International Conference on Computer Vision on Pattern Recognition, pp. 539–546 (2005) 14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neural Information Processing Systems, vol. 7, pp. 617–624 (1995)
Hierarchical Feature Extraction for Compact Representation and Classification of Datasets Markus Schubert and Jens Kohlmorgen Fraunhofer FIRST.IDA Kekul´estr. 7, 12489 Berlin, Germany {markus,jek}@first.fraunhofer.de http://ida.first.fraunhofer.de
Abstract. Feature extraction methods do generally not account for hierarchical structure in the data. For example, PCA and ICA provide transformations that solely depend on global properties of the overall dataset. We here present a general approach for the extraction of feature hierarchies from datasets and their use for classification or clustering. A hierarchy of features extracted from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording.
1
Introduction
The vast majority of feature extraction methods does not account for hierarchical structure in the data. For example, PCA [1] and ICA [2] provide transformations that solely depend on global properties of the overall data set. The ability to model the hierarchical structure of the data, however, might certainly help to characterize and understand the information contained in the data. For example, neural dynamics are often characterized by a hierarchical structure in space and time, where methods for hierarchical feature extraction might help to group and classify such data. A particular demand for these methods exists in EEG recordings, where slow dynamical components (sometimes interpreted as internal “state” changes) and the variability of features make data analysis difficult. Hierarchical feature extraction is so far mainly related to 2-D pattern analysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the hierarchical structure is typically a priori hard-wired in the architecture and the methods primarily apply to a 2-D grid structure. There are, however, more recent approaches, like local PCA [4] or tree-dependent component analysis [5], that are promising steps towards structured feature extraction methods that derive also the structure from the data. While local PCA in [4] is not hierarchical and tree-dependent component analysis in [5] is restricted to the context of ICA, we here present a general approach for the extraction of feature hierarchies and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hierarchical Feature Extraction
557
their use for classification and clustering. We exemplify this by using PCA as the core feature extraction method. In [6] and [7], hierarchies of two-dimensional PCA projections (using probabilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional data. For obtaining the hierarchies, the selection of sub-clusters was performed either manually [6] or automatically by using a model selection criterion (AIC, MDL) [7], but in both cases based on two-dimensional projections. A 2-D projection of high-dimensional data, however, is often not sufficient to unravel the structure of the data, which thus might hamper both approaches, in particular, if the sub-clusters get superimposed in the projection. In contrast, our method is based on hierarchical clustering in the original data space, where the structural information is unchanged and therefore undiminished. Also, the focus of this paper is not on visualizing the data itself, which obviously is limited to 2-D or 3-D projections, but rather on the extraction of the hierarchical structure of the data (which can be visualized by plotting trees) and on replacing the data by a compact hierarchical representation in terms of a tree of extracted features, which can be used for classification and clustering. The individual quantity to be classified or clustered in this context, is a tree of features representing a set of data points. Note that classifying sets of points is a more general problem than the well-known problem of classifying individual data points. Other approaches to classify sets of points can be found, e.g., in [9, 10], where the authors define a kernel on sets, which can then be used with standard kernel classifiers. The paper is organized as follows. In section 2, we describe the hierarchical feature extraction method. In section 3, we show how feature hierarchies can be used for classification and clustering, and in section 4 we provide a proof of concept with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. Section 5 concludes with a discussion.
2
Hierarchical Feature Extraction
We pursue a straightforward approach to hierarchical feature extraction that allows us to make any standard feature extraction method hierarchical: we perform hierarchical clustering of the data prior to feature extraction. The feature extraction method is then applied locally to each significant cluster in the hierarchy, resulting in a representation (or replacement) of the original dataset in terms of a tree of features. 2.1
Hierarchical Clustering
There are many known variants of hierarchical clustering algorithms (see, e.g., [11, 12]), which can be subdivided into divisive top-down procedures and agglomerative bottom-up procedures. More important than this procedural aspect, however, is the dissimilarity function that is used in most methods to quantify the dissimilarity between two clusters. This function is used as the criterion to
558
M. Schubert and J. Kohlmorgen
determine the clusters to be split (or merged) at each iteration of the top-down (or bottom-up) process. Thus, it is this function that determines the clustering result and it implicitly encodes what a “good” cluster is. Common agglomerative procedures are single-linkage, complete-linkage, and average-linkage. They differ simply in that they use different dissimilarity functions [12]. We here use Ward’s method [13], also called the minimum variance method, which is agglomerative and successively merges the pair of clusters that causes the smallest increase in terms of the total sum-of-squared-errors (SSE), where the error is defined as the Euclidean distance of a data point to its cluster mean. The increase in square-error caused by merging two clusters, Di and Dj , is given by ni nj d (Di , Dj ) = mi − mj , (1) ni + nj where ni and nj are the number of points in each cluster, and mi and mj are the means of the points in each cluster [12]. Ward’s method can now simply be described as a standard agglomerative clustering procedure [11, 12] with the particular dissimilarity function d given in Eq. (1). We use Ward’s criterion, because it is based on a global fitness criterion (SSE) and in [11] it is reported that the method outperformed other hierarchical clustering methods in several comparative studies. Nevertheless, depending on the particular application, other criteria might be useful as well. The result of a hierarchical clustering procedure that successively splits or merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it defines a partition of the given n samples into k clusters. The leaf node level consists of n nodes describing a partition into n clusters, where each cluster/node contains exactly one sample. Each hierarchy level further up contains one node with edges to the two child nodes that correspond to the clusters that have been merged. The tree can be depicted graphically as a dendrogram, which aligns the leaf nodes along the horizontal axis and connects them by lines to the higher level nodes along the vertical axis. The position of the nodes along the vertical axis could in principle correspond linearly to the hierarchy level k. This, however, would reveal almost nothing of the structure in the data. Most of the structural information is actually contained in the dissimilarity values. One therefore usually positions the node at level k vertically with respect to the dissimilarity value of its two corresponding child clusters, Di and Dj , δ(k) = d (Di , Dj ) .
(2)
For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function δ can be regarded as within-cluster dissimilarity. By using δ as the vertical scale in a dendrogram, a large gap between two levels, for example k and k + 1, means that two very dissimilar clusters have been merged at level k. 2.2
Extracting a Tree of Significant Clusters
As we have seen in the previous subsection, a hierarchical clustering algorithm always generates a tree containing n − 1 non-singleton clusters. This does not
Hierarchical Feature Extraction
559
necessarily mean that any of these clusters is clearly separated from the rest of the data or that there is any structure in the data at all. The identification of clearly separated clusters is usually done by visual inspection of the dendrogram, i.e. by identifying large gaps. For an automatic detection of significant clusters, we use the following straightforward criterion δ(parent(k)) > α, δ(k)
for 1 < k < n,
(3)
where parent(k) is the parent cluster level of the cluster obtained at level k and α is a significance threshold. If a cluster at level k is merged into a cluster that has a within-cluster dissimilarity which is more than α times higher than that of cluster k, we call cluster k a significant cluster. That means that cluster k is significantly more compact than its merger (in the sense of the dissimilarity function). Note that this does not necessarily mean that the sibling of cluster k is also a significant cluster, as it might have a higher dissimilarity value than cluster k. The criterion directly corresponds to the relative increase of the dissimilarity value in a dendrogram from one merger level to the next. For small clusters that contain only a few points, the relative increase in dissimilarity can be large just because of the small sample size. To avoid that these clusters are detected as being significant, we require a minimum cluster size M for significant clusters. After having identified the significant clusters in the binary cluster tree, we can extract the tree of significant clusters simply by linking each significant cluster node to the next highest significant node in the tree, or, if there is none, to the root node (which is just for the convenience of getting a tree and not a forest). The tree of significant clusters is generally much smaller than the original tree and it is not necessarily a binary tree anymore. Also note that there might be data points that are not in any significant cluster, e.g., outliers. The criterion in (3) is somewhat related to the criterion in [14], which is used to take out clusters from the merging process in order to obtain a plain, nonhierarchical clustering. The criterion in [14] accounts for the relative change of the absolute dissimilarity increments, which seems to be somewhat less intuitive and unnecessarily complicated. This criterion might also be overly sensitive to small variations in the dissimilarities. 2.3
Obtaining a Tree of Features
To obtain a representation of the original dataset in terms of a tree of features, we can now apply any standard feature extraction method to the data points in each significant cluster in the tree and then replace the data points in the cluster by their corresponding features. For PCA, for example, the data points in each significant cluster are replaced by their mean vector and the desired number of principle components, i.e. the eigenvectors and eigenvalues of the covariance matrix of the data points. The obtained hierarchy of features thus constitutes
560
M. Schubert and J. Kohlmorgen
a compact representation of the dataset that does not contain the individual data points anymore, which can save a considerable amount of memory. This representation is also independent of the size of the dataset. The hierarchy can on the one hand be used to analyze and understand the structure of the data, on the other hand – as we will further explain in the next section – it can be used to perform classification or clustering in cases where the individual input quantity to be classified (or clustered) is an entire dataset and not, as usual, a single data point.
3
Classification of Feature Trees
The classification problem that we address here is not the well-known problem of classifying individual data points or vectors. Instead, it relates to the classification of objects that are sets of data points, for example, time series. Given a “training set” of such objects, i.e. a number of datasets, each one attached with a certain class label, the problem consists in assigning one class label to each new, unlabeled dataset. This can be accomplished by transforming each individual dataset into a tree of features and by defining a suitable distance function to compare each pair of trees. For example, trees of principal components can be regarded as (hierarchical) mixtures of Gaussians, since the principal components of each node in the tree (the eigenvectors and eigenvalues) describe a normal distribution, which is an approximation to the true distribution of the underlying data points in the corresponding significant cluster. Two mixtures (sums) of Gaussians, f and g, corresponding to two trees of principal components (of two datasets), can be compared, e.g., by using the the squared L2 -Norm as distance function, which is also called the integrated squared error (ISE), ISE(f, g) = (f − g)2 dx. (4) The ISE has the advantage that the integral is analytically tractable for mixtures of Gaussians. Note that the computation of a tree of principal components, as described in the previous section, is in itself an interesting way to obtain a mixture of Gaussians representation of a dataset: without the need to specify the number of components in advance and without the need to run a maximum likelihood (gradient ascent) algorithm like, for example, expectation–maximization [15], which is prone to get stuck in local optima. Having obtained a distance function on feature trees, the next step is to choose a classification method that only requires pairwise distances to classify the trees (and their corresponding datasets). A particularly simple method is first-nearest-neighbor (1-NN) classification. For 1-NN classification, the tree of a test dataset is assigned the label of the nearest tree of a collection of trees that were generated from a labeled “training set” of datasets. If the generated trees are sufficiently different among the classes, first- (or k-) nearest-neighbor
Hierarchical Feature Extraction
561
classification can already be sufficient to obtain a good classification result, as we demonstrate in the next section. In addition to classification, the distance function on feature trees can also be used to cluster a collection of datasets by clustering their corresponding trees. Any clustering algorithm that uses pairwise distances can be used for this purpose [11, 12]. In this way it is possible to identify homogeneous groups of datasets.
4 4.1
Applications Mixtures of Gaussians
As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness. From three classes of Gaussian mixture distributions, which are exemplarily shown in Fig. 1(a)-(c), we generated 10 training samples for each class, which constitute the training set, and a total of 100 test samples constituting the test set. Each sample contains 540 data points. The mixture distribution of each test sample was chosen with equal probability from one of the three classes. Next, we generated the binary cluster tree from each sample using Ward’s criterion. Examples of the corresponding dendrograms for each class are shown in Fig. 1(d)-(f) (in gray). We then determined the significant clusters in each tree, using the significance factor α = 3 and the minimum cluster size M = 40. In Fig. 1(d)-(f), the significant clusters are depicted as black dots and the extracted trees of significant clusters are shown by means of thick black lines. The cluster of each node in a tree of significant clusters was then replaced by the principle components obtained from the data in the cluster, which turns the tree of clusters into a tree of features. In Fig. 1(g)-(i), the PCA components of all significant clusters are shown for the three example datasets from Fig. 1(a)-(c). Finally, we classified the feature trees obtained from the test samples, using the integrated squared error (Eq. (4)) and first-nearest-neighbor classification. We obtained a nearly perfect accuracy of 98% correct classifications (i.e.: only two misclassifications), which can largely be attributed to the circumstance that the structural differences between the classes were correctly exposed in the tree structures. This result demonstrates that an appropriate representation of the data can make the classification problem very simple. 4.2
Clinical EEG
To demonstrate the applicability of our approach to real-world data, we used a clinical recording of human EEG. The recording was carried out in order to screen for pathological features, in particular the disposedness to epilepsy. The subject went through a number of experimental conditions: eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and, finally, a stimulation with stroboscopic light of increasing frequency (PO: photic on).
562
M. Schubert and J. Kohlmorgen
50
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
−20
−20
−20
−30
−30
−40 −30
−20
−10
0
10
20
30
40
50
60
−30
−40 −30
−20
−10
0
(a)
10
20
30
40
50
60
−40 −30
−20
−10
0
(b)
10
20
30
40
50
60
30
40
50
60
(c) 600
500 500 500
400 400
200
dissimilarity
dissimilarity
dissimilarity
400
300
300
200
100
200
100
0
100
0
data
0
data
(d)
(f)
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
−10
−10
−10
−20
−20
−20
−30
−30
−20
−10
0
10
20
(g)
data
(e)
50
−40 −30
300
30
40
50
60
−40 −30
−30
−20
−10
0
10
20
(h)
30
40
50
60
−40 −30
−20
−10
0
10
20
(i)
Fig. 1. (a)-(c) Example datasets for the three types of mixture distributions used in the application. (d)-(f) The corresponding dendrograms for each example dataset (gray) and the extracted trees of significant clusters (black). Note that the extracted tree structure exactly corresponds to the structure in the data. (g)-(i) The PCA components of all significant clusters. The components are contained in the tree of features.
During the photic phase, the subject kept the eyes closed, while the rate of light flashes was increased every four seconds in steps of 1 Hz, from 5 Hz to 25 Hz. The obtained recording was subdivided into 507 epochs of fixed length (1s). For each epoch, we extracted four features that correspond to the power in
Hierarchical Feature Extraction
563
25
dissimilarity
20
15
10
5
0
82% (EC)
69% (PHV)
92% (EO)
88% (PO)
76% (HV)
90% (HV)
Fig. 2. The tree of significant clusters (black), obtained from the underlying dendrogram (gray) for the EEG data. The data in each significant sub-cluster largely corresponds to one of the experimental conditions (indicated in %): eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and ‘photic on’ (PO).
specific frequency bands of particular EEG electrodes.1 The resulting set of four-dimensional feature vectors was then analyzed by our method. For the hierarchical clustering, we used Ward’s method and found the significant clusters depicted in Fig. 2. The extracted tree of significant clusters consists of a twolevel hierarchy. As expected, the majority of feature vectors in each sub-cluster corresponds to one of the experimental conditions. By applying PCA to each sub-cluster and replacing the data of each node with its principle components, we obtain a tree of features, which constitutes a compact representation of the original dataset. It can then be used for comparison with trees that arise from normal or various kinds of pathological EEG, as outlined in section 3.
5
Discussion
We proposed a general approach for the extraction of feature hierarchies from datasets and their use for classification or clustering. The feasibility of this approach was demonstrated with an application to mixtures of Gaussians with 1
In detail: (I.) the power of the α-band (8–12 Hz) at the electrode positions O1 and O2 (according to the international 10–20 system), (II.) the power of 5 Hz and its harmonics (except 50 Hz) at electrode F4, (III.) the power of 6 Hz and its harmonics at electrode F8, and (IV.) the power of the 25–80 Hz band at F7.
564
M. Schubert and J. Kohlmorgen
varying degree of structuredness and to a clinical EEG recording. In this paper we focused on PCA as the core feature extraction method. Other types of feature extraction, like, e.g., ICA, are also conceivable, which then should be complemented with an appropriate distance function on the feature trees (if used for classification or clustering). The basis of the proposed approach is hierarchical clustering. The quality of the resulting feature hierarchies thus depends on the quality of the clustering. Ward’s criterion tends to find compact, hyperspherical clusters, which may not always be the optimal choice for a given problem. Therefore, one should consider to adjust the clustering criterion to the problem at hand. Our future work will focus on the application of this method to classify normal and pathological EEG. By comparing the different tree structures, the hope is to gain a better understanding of the pathological cases. Acknowledgements. This work was funded by the German BMBF under grant 01GQ0415 and supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.
References [1] Jolliffe, I.: Principal Component Analysis. Springer, New York (1986) [2] Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, Chichester (2001) [3] Fukushima, K.: Neural network model for a mechanism of pattern recognition unaffected by shift in position — neocognitron. Transactions IECE 62-A(10), 658–665 (1979) [4] Bregler, C., Omohundro, S.: Surface learning with applications to lipreading. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Precessing Systems, vol. 6, pp. 43–50. Morgan Kaufmann Publishers, San Mateo (1994) [5] Bach, F., Jordan, M.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) [6] Bishop, C., Tipping, M.: A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281–293 (1998) [7] Wang, Y., Luo, L., Freedman, M., Kung, S.: Probabilistic principal component subspaces: A hierarchical finite mixture model for data visualization. IEEE Transactions on Neural Networks 11(3), 625–636 (2000) [8] Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B 61(3), 611–622 (1999) [9] Kondor, R., Jebara, T.: A kernel between sets of vectors. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the ICML, pp. 361–368. AAAI Press, Menlo Park (2003) [10] Desobry, F., Davy, M., Fitzgerald, W.: A class of kernels for sets of vectors. In: Proceedings of the ESANN, pp. 461–466 (2005) [11] Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Inc., Englewood Cliffs (1988) [12] Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley–Interscience, Chichester (2000)
Hierarchical Feature Extraction
565
[13] Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 236–244 (1963) [14] Fred, A., Leitao, J.: Clustering under a hypothesis of smooth dissimilarity increments. In: Proceedings of the ICPR, vol. 2, pp. 190–194 (2000) [15] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, 1–38 (1977)
Principal Component Analysis for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Adaptive Informatics Research Center, Helsinki Univ. of Technology P.O. Box 5400, FI-02015 TKK, Finland {Tapani.Raiko,Alexander.Ilin,Juha.Karhunen}@tkk.fi http://www.cis.hut.fi/projects/bayes/
Abstract. Principal component analysis (PCA) is a widely used technique for data analysis and dimensionality reduction. Eigenvalue decomposition is the standard algorithm for solving PCA, but a number of other algorithms have been proposed. For instance, the EM algorithm is much more efficient in case of high dimensionality and a small number of principal components. We study a case where the data are highdimensional and a majority of the values are missing. In this case, both of these algorithms turn out to be inadequate. We propose using a gradient descent algorithm inspired by Oja’s rule, and speeding it up by an approximate Newton’s method. The computational complexity of the proposed method is linear with respect to the number of observed values in the data and to the number of principal components. In the experiments with Netflix data, the proposed algorithm is about ten times faster than any of the four comparison methods.
1
Introduction
Principal component analysis (PCA) [1,2,3,4,5,6] is a classic technique in data analysis. It can be used for compressing higher dimensional data sets to lower dimensional ones for data analysis, visualization, feature extraction, or data compression. PCA can be derived from a number of starting points and optimization criteria [2,3,4]. The most important of these are minimization of the mean-square error in data compression, finding mutually orthogonal directions in the data having maximal variances, and decorrelation of the data using orthogonal transformations [5]. While standard PCA is a very well-established linear statistical technique based on second-order statistics (covariances), it has recently been extended into various directions and considered from novel viewpoints. For example, various adaptive algorithms for PCA have been considered and reviewed in [4,6]. Fairly recently, PCA was shown to emerge as a maximum likelihood solution from a probabilistic latent variable model independently by several authors; see [3] for a discussion and references. In this paper, we study PCA in the case where most of the data values are missing (or unknown). Common algorithms for solving PCA prove to be M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 566–575, 2008. © Springer-Verlag Berlin Heidelberg 2008
Principal Component Analysis for Sparse High-Dimensional Data
567
inadequate in this case, and we thus propose a new algorithm. The problem of overfitting and possible solutions are also outlined.
2
Algorithms for Principal Component Analysis
Principal subspace and components. Assume that we have n d-dimensional data vectors x1 , x2 , . . . , xn , which form the d×n data matrix X=[x1 , x2 , . . . , xn ]. The matrix X is decomposed into X ≈ AS,
(1)
where A is a d × c matrix, S is a c × n matrix and c ≤ d ≤ n. Principal subspace methods [6,4] find A and S such that the reconstruction error C = X − AS2F =
d n
(xij −
i=1 j=1
c
aik skj )2 ,
(2)
k=1
is minimized. There F denotes the Frobenius norm, and xij , aik , and skj elements of the matrices X, A, and S, respectively. Typically the row-wise mean is removed from X as a preprocessing step. Without any further constraints, there exist infinitely many ways to perform such a decomposition. However, the subspace spanned by the column vectors of the matrix A, called the principal subspace, is unique. In PCA, these vectors are mutually orthogonal and have unit length. Further, for each k = 1, . . . , c, the first k vectors form the k-dimensional principal subspace. This makes the solution practically unique, see [4,2,5] for details. There are many ways to determine the principal subspace and components [6,4,2]. We will discuss three common methods that can be adapted for the case of missing values. Singular Value Decomposition. PCA can be determined by using the singular value decomposition (SVD) [5] X = UΣVT ,
(3)
where U is a d × d orthogonal matrix, V is an n × n orthogonal matrix and Σ is a d × n pseudodiagonal matrix (diagonal if d = n) with the singular values on the main diagonal [5]. The PCA solution is obtained by selecting the c largest singular values from Σ, by forming A from the corresponding c columns of U, and S from the corresponding c rows of ΣVT . Note that PCA can equivalently be defined using the eigendecomposition of the d × d covariance matrix C of the column vectors of the data matrix X: C=
1 XXT = UDUT , n
(4)
Here, the diagonal matrix D contains the eigenvalues of C, and the columns of the matrix U contain the unit-length eigenvectors of C in the same order
568
T. Raiko, A. Ilin, and J. Karhunen
[6,4,2,5]. Again, the columns of U corresponding to the largest eigenvalues are taken as A, and S is computed as AT X. This approach can be more efficient for cases where d n, since it avoids the n × n matrix. EM Algorithm. The EM algorithm for solving PCA [7] iterates updating A and S alternately.1 When either of these matrices is fixed, the other one can be obtained from an ordinary least-squares problem. The algorithm alternates between the updates S ← (AT A)−1 AT X ,
A ← XST (SST )−1 .
(5)
This iteration is especially efficient when only a few principal components are needed, that is c d [7]. Subspace Learning Algorithm. It is also possible to minimize the reconstruction error (2) by any optimization algorithm. Applying the gradient descent algorithm yields rules for simultaneous updates A ← A + γ(X − AS)ST ,
S ← S + γAT (X − AS) .
(6)
where γ > 0 is called the learning rate. Oja-Karhunen learning algorithm [8,9,6,4] is an online learning method that uses the EM formula for computing S and the gradient for updating A, a single data vector at a time. A possible speed-up to the subspace learning algorithm is to use the natural gradient [10] for the space of matrices. This yields the update rules A ← A + γ(X − AS)ST AT A ,
S ← S + γSST AT (X − AS) .
(7)
If needed, the end result of subspace analysis can be transformed into the PCA solution, for instance, by computing the eigenvalue decomposition SST = 1/2 T US DS UTS and the singular value decomposition AUS DS = UA ΣA VA . The transformed A is formed from the first c columns of UA and the transformed S −1/2 T from the first c rows of ΣA VA DS UTS S. Note that the required decompositions are computationally lighter than the ones done to the data matrix directly.
3
Principal Component Analysis with Missing Values
Let us consider the same problem when the data matrix has missing entries2 . In the following there are N = 9 observed values and 6 missing values marked with a question mark (?): ⎡ ⎤ −1 +1 0 0 ? X = ⎣−1 +1 ? ? 0⎦ . (8) ? ? −1 +1 ? 1
2
The procedure studied in [7] can be seen as the zero-noise limit of the EM algorithm for a probabilistic PCA model. We make the typical assumption that values are missing at random, that is, the missingness does not depend on the unobserved data. An example where the assumption does not hold is when out-of-scale measurements are marked missing.
Principal Component Analysis for Sparse High-Dimensional Data
569
We would like to find A and S such that X ≈ AS for the observed data values. The rest of the product AS represents the reconstruction of missing values. Adapting SVD. One can use the SVD approach (4) in order to find an approximate solution to the PCA problem. However, estimating the covariance matrix C becomes very difficult when there are lots of missing values. If we estimate C leaving out terms with missing values from the average, we get for the estimate of the covariance matrix ⎡ ⎤ 0.5 1 0 1 C = XXT = ⎣ 1 0.667 ?⎦ . (9) n 0 ? 1 There are at least two problems. First, the estimated covariance 1 between the first and second components is larger than their estimated variances 0.5 and 0.667. This is clearly wrong, and leads to the situation where the covariance matrix is not positive (semi)definite and some of its eigenvalues are negative. Secondly, the covariance between the second and the third component could not be estimated at all3 . Both problems appeared in practice with the data set considered in Section 5. Another option is to complete the data matrix by iteratively imputing the missing values (see, e.g., [2]). Initially, the missing values can be replaced by zeroes. The covariance matrix of the complete data can be estimated without the problems mentioned above. Now, the product AS can be used as a better estimate for the missing values, and this process can be iterated until convergence. This approach requires the use of the complete data matrix, and therefore it is computationally very expensive if a large part of the data matrix is missing. The time complexity of computing the sample covariance matrix explicitly is O(nd2 ). We will further refer to this approach as the imputation algorithm. Note that after convergence, the missing values do not contribute to the reconstruction error (2). This means that the imputation algorithm leads to the solution which minimizes the reconstruction error of observed values only. Adapting the EM Algorithm. Grung and Manne [11] studied the EM algorithm in the case of missing values. Experiments showed a faster convergence compared to the iterative imputation algorithm. The computational complexity is O(N c2 + nc3 ) per iteration, where N is the number of observed values, assuming na¨ıve matrix multiplications and inversions but exploiting sparsity. This is quite a bit heavier than EM with complete data, whose complexity is O(ndc) [7] per iteration. Adapting the Subspace Learning Algorithm. The subspace learning algorithm works in a straightforward manner also in the presence of missing values. 3
It could be filled by finding a value that maximizes the determinant of the covariance matrix (and thus the entropy of the underlying Gaussian distribution).
570
T. Raiko, A. Ilin, and J. Karhunen
We just take the sum over only those indices i and j for which the data entry xij (the ijth element of X) is observed, in short (i, j) ∈ O. The cost function is
C=
e2ij ,
with
c
eij = xij −
aik skj .
(10)
k=1
(i,j)∈O
and its partial derivatives are ∂C = −2 eij slj , ∂ail j|(i,j)∈O
∂C = −2 ∂slj
eij ail .
(11)
i|(i,j)∈O
The update rules for gradient descent are ∂C ∂C , S ←S+γ ∂A ∂S and the update rules for natural gradient descent are A←A+γ
(12)
∂C T ∂C A A, S ← S + γSST . (13) ∂A ∂S We propose a novel speed-up to the original simple gradient descent algorithm. In Newton’s method for optimization, the gradient is multiplied by the inverse of the Hessian matrix. Newton’s method is known to converge fast especially in the vicinity of the optimum, but using the full Hessian is computationally too demanding in truly high-dimensional problems. Here we use only the diagonal part of the Hessian matrix. We also include a control parameter α that allows the learning algorithm to interpolate between the standard gradient descent (α = 0) and the diagonal Newton’s method (α = 1), much like the Levenberg-Marquardt algorithm. The learning rules then take the form 2 −α ∂ C ∂C j|(i,j)∈O eij slj α , ail ← ail − γ = ail + γ (14) 2 ∂ail ∂ail 2 s j|(i,j)∈O lj
−α ∂2C ∂C i|(i,j)∈O eij ail α . slj ← slj − γ = slj + γ (15) 2 ∂slj ∂slj a2 A←A+γ
i|(i,j)∈O
il
The computational complexity is O(N c + nc) per iteration.
4
Overfitting
A trained PCA model can be used for reconstructing missing values: xˆij =
c
aik skj ,
(i, j) ∈ / O.
(16)
k=1
Although PCA performs a linear transformation of data, overfitting is a serious problem for large-scale problems with lots of missing values. This happens when the value of the cost function C in Eq. (10) is small for training data, but the quality of prediction (16) is poor for new data. For further details, see [12].
Principal Component Analysis for Sparse High-Dimensional Data
571
Regularization. A popular way to regularize ill-posed problems is penalizing the use of large parameter values by adding a proper penalty term into the cost function; see for example [3]. In our case, one can modify the cost function in Eq. (2) as follows: Cλ = e2ij + λ(A2F + S2F ) . (17) (i,j)∈O
This has the effect that the parameters that do not have significant evidence will decay towards zero. A more general penalization would use different regularization parameters λ for different parts of A and S. For example, one can use a λk parameter of its own for each of the column vectors ak of A and the row vectors sk of S. Note that since the columns of A can be scaled arbitrarily by rescaling the rows of S accordingly, one can fix the regularization term for ak , for instance, to unity. An equivalent optimization problem can be obtained using a probabilistic formulation with (independent) Gaussian priors and a Gaussian noise model:
c p(xij | A, S) = N xij ; aik skj , vx , (18) k=1
p(aik ) = N (aik ; 0, 1) ,
p(skj ) = N (skj ; 0, vsk ) ,
(19)
where N (x; m, v) denotes the random variable x having a Gaussian distribution with the mean m and variance v. The regularization parameter λk = vsk /vx is the ratio of the prior variances vsk and vx . Then, the cost function (ignoring constants) is minus logarithm of the posterior for A and S: CBR =
d c c n 2 e2ij /vx + ln vx + a2ik + skj /vsk + ln vsk (i,j)∈O
i=1 k=1
(20)
k=1 j=1
An attractive property of the Bayesian formulation is that it provides a natural way to choose the regularization constants. This can be done using the evidence framework (see, e.g., [3]) or simply by minimizing CBR by setting vx , vsk to the means of e2ij and s2kj respectively. We will use the latter approach and refer to it as regularized PCA. Note that in case of joint optimization of CBR w.r.t. aik , skj , vsk , and vx , the cost function (20) has a trivial minimum with skj = 0, vsk → 0. We try to avoid this minimum by using an orthogonalized solution provided by unregularized PCA from the learning rules (14) and (15) for initialization. Note also that setting vsk to small values for some components k is equivalent to removal of irrelevant components from the model. This allows for automatic determination of the proper dimensionality c instead of discrete model comparison (see, e.g., [13]). This justifies using separate vsk in the model in (19). Variational Bayesian Learning. Variational Bayesian (VB) learning provides even stronger tools against overfitting. VB version of PCA by [13] approximates
572
T. Raiko, A. Ilin, and J. Karhunen
the joint posterior of the unknown quantities using a simple multivariate distribution. Each model parameter is described a posteriori using independent Gaussian distributions. The means can then be used as point estimates of the parameters, while the variances give at least a crude estimate of the reliability of these point estimates. The method in [13] does not extend to missing values easily, but the subspace learning algorithm (Section 3) can be extended to VB. The derivation is somewhat lengthy, and it is omitted here together with the variational Bayesian learning rules because of space limitations; see [12] for details. The computational complexity of this method is still O(N c + nc) per iteration, but the VB version is in practice about 2–3 times slower than the original subspace learning algorithm.
5
Experiments
Collaborative filtering is the task of predicting preferences (or producing personal recommendations) by using other people’s preferences. The Netflix problem [14] is such a task. It consists of movie ratings given by n = 480189 customers to d = 17770 movies. There are N = 100480507 ratings from 1 to 5 given, from which 1408395 ratings are reserved for validation (or probing). Note that 98.8% of the values are thus missing. We tried to find c = 15 principal components from the data using a number of methods.4 We subtracted the mean rating for each movie, assuming 22 extra ratings of 3 for each movie as a Dirichlet prior. Computational Performance. In the first set of experiments we compared the computational performance of different algorithms on PCA with missing values.The root mean square (rms) error is measured on the training data, 1 2 EO = |O| (i,j)∈O eij . All experiments were run on a dual cpu AMD Opteron SE 2220 using Matlab. First, we tested the imputation algorithm. The first iteration where the missing values are replaced with zeros, was completed in 17 minutes and led to EO = 0.8527. This iteration was still tolerably fast because the complete data matrix was sparse. After that, it takes about 30 hours per iteration. After three iterations, EO was still 0.8513. Using the EM algorithm by [11], the E-step (updating S) takes 7 hours and the M-step (updating A) takes 18 hours. (There is some room for optimization since we used a straightforward Matlab implementation.) Each iteration gives a much larger improvement compared to the imputation algorithm, but starting from a random initialization, EM could not reach a good solution in reasonable time. We also tested the subspace learning algorithm described in Section 3 with and without the proposed speed-up. Each run of the algorithm with different values of the speed-up parameter α was initialized in the same starting point (generated randomly from a normal distribution). The learning rate γ was adapted such that 4
The PCA approach has been considered by other Netflix contestants as well (see, e.g., [15,16]).
Principal Component Analysis for Sparse High-Dimensional Data
573
1.1
1.04
Gradient Speed−up Natural Grad. Imputation EM
1 0.96
Gradient Speed−up Natural Grad. Regularized VB1 VB2
1.05
0.92 1
0.88 0.84 0.95
0.8 0.76
0
1
2
4
8
16
32
64
0
1
2
4
8
16
32
Fig. 1. Left: Learning curves for unregularized PCA (Section 3) applied to the Netflix data: Root mean-square error on the training data EO is plotted against computation time in hours. Right: The root mean square error on the validation data EV from the Netflix problem during runs of several algorithms: basic PCA (Section 3), regularized PCA (Section 4) and VB (Section 4). VB1 has some parameters fixed (see [12]) while VB2 updates all the parameters. The time scales are linear below 1 and logarithmic above 1.
if an update decreased the cost function, γ was multiplied by 1.1. Each time an update would increase the cost, the update was canceled and γ was divided by 2. Figure 1 (left) shows the learning curves for basic gradient descent, natural gradient descent, and the proposed speed-up with the best found parameter value α = 0.625. The proposed speed-up gave about a tenfold speed-up compared to the gradient descent algorithm even if each iteration took longer. Natural gradient was slower than the basic gradient. Table 1 gives a summary of the computational complexities. Overfitting. We compared PCA (Section 3), regularized PCA (Section 4) and VB-PCA (Section 4) by computing the rms reconstruction error for the validation set V , that is, testing how the models generalize to new data: EV = 1 2 (i,j)∈V eij . We tested VB-PCA by firstly fixing some of the parameter |V | values (this run is marked as VB1 in Fig. 1, see [12] for details) and secondly by Table 1. Summary of the computational performance of different methods on the Netflix problem. Computational complexities (per iteration) assume na¨ıve computation of products and inverses of matrices and ignores the computation of SVD in the imputation algorithm. While the proposed speed-up makes each iteration slower than the basic gradient update, the time to reach the error level 0.85 is greatly diminished. Method Gradient Speed-up Natural Grad. Imputation EM
Complexity Seconds/Iter Hours to EO = 0.85 O(N c + nc) 58 1.9 O(N c + nc) 110 0.22 O(N c + nc2 ) 75 3.5 O(nd2 ) 110000 64 O(N c2 + nc3 ) 45000 58
574
T. Raiko, A. Ilin, and J. Karhunen
adapting them (marked as VB2). We initialized regularized PCA and VB1 using normal PCA learned with α = 0.625 and orthogonalized A, and VB2 using VB1. The parameter α was set to 2/3. Fig. 1 (right) shows the results. The performance of basic PCA starts to degrade during learning, especially using the proposed speed-up. Natural gradient diminishes this phenomenon known as overlearning, but it is even more effective to use regularization. The best results were obtained using VB2: The final validation error EV was 0.9180 and the training rms error EO was 0.7826 which is naturally larger than the unregularized EO = 0.7657.
6
Discussion
We studied a number of different methods for PCA with sparse data and it turned out that a simple gradient descent approach worked best due to its minimal computational complexity per iteration. We could also speed it up more than ten times by using an approximated Newton’s method. We found out empirically that setting the parameter α = 2/3 seems to work well for our problem. It is left for future work to find out whether this generalizes to other problem settings. There are also many other ways to speed-up the gradient descent algorithm. The natural gradient did not help here, but we expect that the conjugate gradient method would. The modification to the gradient proposed in this paper, could be used together with the conjugate gradient speed-up. This will be another future research topic. There are also other benefits in solving the PCA problem by gradient descent. Algorithms that minimize an explicit cost function are rather easy to extend. The case of variational Bayesian learning applied to PCA was considered in Section 4, but there are many other extensions of PCA, such as using non-Gaussianity, nonlinearity, mixture models, and dynamics. The developed algorithms can prove useful in many applications such as bioinformatics, speech processing, and meteorology, in which large-scale datasets with missing values are very common. The required computational burden is linearly proportional to the number of measured values. Note also that the proposed techniques provide an analogue of confidence regions showing the reliability of estimated quantities. Acknowledgments. This work was supported in part by the Academy of Finland under its Centers for Excellence in Research Program, and the IST Program of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. We would like to thank Antti Honkela for useful comments.
References 1. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2(6), 559–572 (1901) 2. Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 3. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Principal Component Analysis for Sparse High-Dimensional Data
575
4. Diamantaras, K., Kung, S.: Principal Component Neural Networks - Theory and Application. Wiley, Chichester (1996) 5. Haykin, S.: Modern Filters. Macmillan, Basingstoke (1989) 6. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing - Learning Algorithms and Applications. Wiley, Chichester (2002) 7. Roweis, S.: EM algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998) 8. Karhunen, J., Oja, E.: New methods for stochastic approximation of truncated Karhunen-Loeve expansions. In: Proceedings of the 6th International Conference on Pattern Recognition, pp. 550–553. Springer, Heidelberg (1982) 9. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press and J. Wiley (1983) 10. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998) 11. Grung, B., Manne, R.: Missing values in principal components analysis. Chemometrics and Intelligent Laboratory Systems 42(1), 125–139 (1998) 12. Raiko, T., Ilin, A., Karhunen, J.: Principal component analysis for large scale problems with lots of missing values. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 691–698. Springer, Heidelberg (2007) 13. Bishop, C.: Variational principal components. In: Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN 1999), pp. 509–514 (1999) 14. Netflix: Netflix prize webpage (2007), http://www.netflixprize.com/ 15. Funk, S.: Netflix update: Try this at home (December 2006), http://sifter.org/∼ simon/journal/20061211.html 16. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the International Conference on Machine Learning (2007)
Hierarchical Bayesian Inference of Brain Activity Masa-aki Sato1 and Taku Yoshioka1,2 1
2
ATR Computational Neuroscience Laboratories
[email protected] National Institute of Information and Communication Technology
Abstract. Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement. Therefore, prior information on the source currents is essential to solve the inverse problem. We have proposed a new hierarchical Bayesian method to combine several sources of information. In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information by using variational Bayes method. The fMRI information can be imposed as prior distribution rather than the variance itself so that it gives a soft constraint on the variance. It is shown that the hierarchical Bayesian method has better accuracy and spatial resolution than conventional linear inverse methods by evaluating the resolution curve. The proposed method also demonstrated good spatial and temporal resolution for estimating current activity in early visual area evoked by a stimulus in a quadrant of the visual field.
1
Introduction
In recent years, there has been rapid progress in noninvasive neuroimaging measurement for human brain. Functional organization of the human brain has been revealed by PET and functional magnetic resonance imaging (fMRI). However, these methods can not reveal the detailed dynamics of information processing in the human brain, since they have poor temporal resolution due to slow hemodynamic responses to neural activity (Bandettini, 2000;Ogawa et al., 1990). On the other hand, Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement (Hamalainen et al., 1993)). Therefore, prior information on the source currents is essential to solve the inverse problem. One of the standard methods for the inverse problem is a dipole method (Hari, 1991; Mosher et al., 1992). It assumes that brain activity can be approximated by a small number of current dipoles. Although this method gives good estimates when the number of active areas is small, it can not give distributed brain activity for higher function. On the other hand, a number of distributed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 576–585, 2008. c Springer-Verlag Berlin Heidelberg 2008
Hierarchical Bayesian Inference of Brain Activity
577
source methods have been proposed to estimate distributed activity in the brain such as the minimum norm method, the minimum L1-norm method, and others (Hamalainen et al., 1993)). It has been also proposed to combine fMRI information with MEG data (Dale and Sereno, 1993;Ahlfors et al., 1999; Dale et al., 2000;Phillips et al., 2002). However, there are essential differences between fMRI and MEG due to their temporal resolution. The fMRI activity corresponds to an average of several thousands of MEG time series data and may not correspond MEG activity at some time points. We have proposed a new hierarchical Bayesian method to combine several sources of information (Sato et al. 2004). In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. Therefore, our method is capable of appropriately estimating the source current variance from the MEG data supplemented with the fMRI data, even if fMRI data convey inaccurate information. Accordingly, our method is robust against inaccurate fMRI information. Because of the hierarchical prior, the estimation problem becomes nonlinear and cannot be solved analytically. Therefore, the approximate posterior distribution is calculated by using the Variational Bayesian (VB) method (Attias, 1999; Sato, 2001). The resulting algorithm is an iterative procedure that converges quickly because the VB algorithm is a type of natural gradient method (Amari, 1998) that has an optimal local convergence property. The position and orientation of the cortical surface obtained from structural MRI can be also introduced as hard constraint. In this article, we explain our hierarchical Bayesian method. To evaluate the performance of the hierarchical Bayesian method, the resolution curves were calculated by varying the numbers of model dipoles, simultaneously active dipoles and MEG sensors. The results show the superiority of the hierarchical Bayesian method over conventional linear inverse methods. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. The estimation results are consistent with known physiological findings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.
2
MEG Inverse Problem
When neural current activity occurs in the brain, it produces a magnetic field observed by MEG. The relationship between the magnetic field B = {Bm |m = 1 : M } measured by M sensors and the primary source current J = {Jn |n = 1 : N } in the brain is given by B = G · J,
(1)
where G= {Gm,n |m = 1 : M, n = 1 : N } is the lead field matrix. The lead field Gm,n represents the magnetic field Bm produced by the n-th unit dipole current. The above equations give the forward model and the inverse problem
578
M. Sato and T. Yoshioka
is to estimate the source current J from the observed magnetic field data B. The probabilistic model for the source currents can be constructed assuming Gaussian noise for the MEG sensors. Then, the probability distribution, that the magnetic field B is observed for a given current J , is given by 1 P (B|J ) ∝ exp − β (B − G · J) · Σ G · (B − G · J ) , (2) 2 where (βΣG )−1 denotes the covariance matrix of the sensor noise. Σ−1 G is the −1 normalized covariance matrix satisfying T r(Σ −1 is the average G ) = M , and β noise variance.
3
Hierarchical Bayesian Method
In the hierarchical Bayesian method, the variances of the currents are considered unknown parameters and estimated from the observed MEG data by introducing a hierarchical prior on the current variance. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. The spatial smoothness constraint, that neurons within a few millimeter radius tends to fire simultaneously due to the neural interactions, can also be implemented as a hierarchical prior (Sato et al. 2004). Hierarchical Prior. Let us suppose a time sequence of MEG data B 1:T ≡ {B(t)|t = 1 : T } is observed. The MEG inverse problem in this case is to estimate the primary source current J 1:T ≡ {J (t)|t = 1 : T } from the observed MEG data B 1:T . We assume a Normal prior for the current: T 1 P0 (J 1:T |α) ∝ exp − β J (t) · Σ α · J (t) , (3) 2 t=1 where Σ α is the diagonal matrix with diagonal elements α = {αn |n = 1 : N }. We also assume that the current variance α−1 does not change over period T . The current inverse variance parameter α is estimated by introducing an ARDiAutomatic Relevance Determinationj hierarchical prior (Neal, 1996): P0 (α) =
N n=1 −1
Γ (α|¯ α, γ) ≡ α
Γ (αn |¯ α0n , γ0nα ),
(4)
(αγ/α ¯ )γ Γ (γ)−1 e−αγ/α¯ ,
where Γ (α|¯ α, γ) represents the Gamma distribution with mean α ¯ and degree of ∞ freedom γ. Γ (γ) ≡ 0 dttγ−1 e−t is the Gamma function. When the fMRI data is not available, we use a non-informative prior for the current inverse variance parameter αn , i.e., γ0nα = 0 and P0 (αn ) = α−1 n . When the fMRI data is available, fMRI information is imposed as the prior for the inverse variance parameter αn . The mean of the prior, α ¯0n , is assumed to be inversely proportional to the fMRI activity. Confidence parameter γ0nα controls a reliability of the fMRI information.
Hierarchical Bayesian Inference of Brain Activity
579
Variational Bayesian Method. The objective of the Bayesian estimation is to calculate the posterior probability distribution of J for the observed data B (in the following, B 1:T and J 1:T are abbreviated as B and J, respectively, for notational simplicity): P (J |B) = dα P (J , α|B), P (J, α, B) , P (B) P (J , α, B) = P (B|J ) P0 (J |α) P0 (α) , P (B) = dJ dαP (J , α, B) . P (J , α|B) =
The calculation of the marginal likelihood P (B) cannot be done analytically. In the VB method, the calculation of the joint posterior P (J , α|B) is reformulated as the maximization problem of the free energy. The free energy for a trial distribution Q(J, α) is defined by P (J , α, B) F (Q) = dJ dαQ (J, α) log Q (J , α) = log (P (B)) − KL [Q (J , α) || P (J , α|B)].
(5)
Equation (5) implies that the maximization of the free energy F (Q) is equivalent to the minimization of the Kullback-Leibler distance (KL-distance) defined by KL [Q(J , α) || P (J , α|B)] ≡ dJ dαQ(J, α) log (Q(J , α)/P (J, α|B)) . This measures the difference between the true joint posterior P (J , α|B) and the trial distribution Q(J , α). Since the KL-distance reaches its minimum at zero when the two distributions coincide, the joint posterior can be obtained by maximizing the free energy F (Q) with respect to the trial distribution Q. In addition, the maximum free energy gives the log-marginal likelihood log(P (B)). The optimization problem can be solved using a factorization approximation restricting the solution space (Attias, 1999; Sato, 2001): Q (J , α) = QJ (J ) Qα (α) . Under the factorization assumption (6), the free energy can be written as
F (Q) = log P (J , α, B)J α − log QJ (J ) J − log Qα (α)α
= log P (B|J )J − KL QJ (J )Qα (α) || P0 (J |α)P0 (α) ,
(6)
(7)
where ·J and ·α represent the expectation values with respect to QJ (J ) and Qα (α), respectively. The first term in the second equation of (7) corresponds to the negative sign of the expected reconstruction error. The second term (KL-distance) measures the difference between the prior and the posterior
580
M. Sato and T. Yoshioka
and corresponds to the effective degree of freedom that can be well specified from the observed data. Therefore, (negative sign of) the free energy can be considered a regularized error function with a model complexity penalty term. The maximum free energy is obtained by alternately maximizing the free energy with respect to QJ and Qα . In the first step (J-step), the free energy F (Q) is maximized with respect to QJ while Qα is fixed. The solution is given by QJ (J ) ∝ exp [log P (J , α, B)α ] .
(8)
In the second step (α-step), the free energy F (Q) is maximized with respect to Qα while QJ is fixed. The solution is given by
Qα (α) ∝ exp log P (J , α, B)J . (9) The above J- and α-steps are repeated until the free energy converges. VB algorithm. The VB algorithm is summarized here. In the J-step, the in−1 verse filter L(Σ−1 α ) is calculated using the estimated covariance matrix Σ α in the previous iteration: −1 −1 −1 −1 L(Σ−1 α ) = Σα · G · G · Σα · G + ΣG
(10)
The expectation values of the current J and the noise variance β −1 with respect to the posterior distribution are estimated using the inverse filter (10). J = L(Σ−1 α ) · B, γβ β −1 =
γβ =
1 N T, 2
1 (B − G · J ) · ΣG · (B − G · J) + J · Σα · J . 2
(11)
In the α-step, the expectation values of the variance parameters α−1 n with respect to the posterior distribution are estimated as −1 γnα α−1 n = γ0nα α0n +
T −1 αn 1 − Σ−1 · G · Σ−1 α B · G n,n , 2
where γnα is given by γnα = γ0nα +
4
T 2
(12)
.
Resolution Curve
We evaluated the performance of the hierarchical Bayesian method by calculating the resolution curve and compared with the minimum norm (MN) method. The inverse filter of the MN method L can be obtained from Eq.(10), if the inverse variance parameters αn is set to a given constant which is independent of position n. Let define the resolution matrix R by R = L · G.
(13)
Hierarchical Bayesian Inference of Brain Activity
581
Fig. 1. Resolution curves of minimum norm method for different number of model dipoles (170, 262, 502, and 1242). The number of sensors is 251. Horizontal axis denotes the radius from the source position in m.
The (n,k) component of the resolution matrix Rn,k represents the n-th estimated current when the unit current dipole is applied for the k-th position without noise, i.e., Jk = 1 and Jl = 0(l = k). The resolution curve is defined by the averaged estimated currents as a function of distance r from the source position. It can be obtained by summing the estimated currents Rn,k at the n-th position whose distance from the k-th position is in the range from r to r + dr, when the unit current dipole is applied for the k-th position. The averaged resolution curve is obtained by averaging the resolution curve over the k-th positions. If the estimation is perfect, the resolution curve at the origin, which is the estimation gain, should be one. In addition, the resolution curve should be zero elsewhere. However, the estimationgain of the linear inverse method such as MN method satisfies the constraint, N n=1 Gn ≤ M , where Gn denotes the estimation gain at n-th position (Sato et al. 2004). This constraint implies that the linear inverse method cannot perfectly retrieve more current dipoles than the number of sensors M . To see the effect of this constraint, we calculated the resolution curve for the MN method by varying the number of model dipoles while the number of sensors M is fixed at 251 (Fig. 1). We assumed model dipoles are placed evenly on a hemisphere. Fig. 1 shows that MN method gives perfect estimation if the number of dipoles are less than M . On the other hand, the performance degraded as the number of dipoles increases over M . Although the above results are obtained by using MN method, similar results can be obtained for a class of linear inverse methods. This limitation is the main cause of poor spatial
582
M. Sato and T. Yoshioka
Fig. 2. Resolution curves of hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 sensors. The number of active dipoles are 240 or 400. Horizontal axis denotes the radius from the source position in m.
resolution of the linear inverse methods. When several dipoles are simultaneously active, estimated currents in the linear inverse methods can be obtained by the summation of the estimated currents for each dipole. Therefore, the resolution curve gives complete descriptions on the spatial resolution of the linear inverse methods. From the theoretical analysis (in preparation), the hierarchical Bayesian method can estimate dipole currents perfectly even when the number of model dipoles are larger than the number of sensors M . This is because the hierarchical Bayesian method effectively eliminates inactive dipoles from the estimation model by adjusting the estimation gain of these dipoles to zero. Nevertheless, the number of active dipoles gives the constraint on the performance of the hierarchical Bayesian method. The calculation of the resolution curves for the hierarchical Bayesian method are somewhat complicated, because Bayesian inverse filters are dependent on the MEG data. To evaluate the performance for the situations where multiple dipoles are active, we generated 240 or 400 active dipoles randomly on the hemisphere, and calculated the corresponding MEG data where 240 or 400 dipoles were simultaneously active. The Bayesian inverse filters were estimated using these simulated MEG data. Then, the resolution curves were calculated using the estimated Bayesian inverse filters for each active dipole and they were averaged over all active dipoles. Fig. 2 shows the resolution curves for the hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 MEG sensors. When the number of simultaneously active dipoles are less than those of MEG sensors, almost perfect estimation is obtained
Hierarchical Bayesian Inference of Brain Activity
583
regardless of the number of model dipoles. Therefore, the hierarchical Bayesian method can achieve much better spatial resolution than the conventional linear inverse method. On the other hand, the performance is degraded if the numbers of simultaneously active dipoles are larger than the number of MEG sensors. The above results demonstrate the superiority of the hierarchical Bayesian method over MN method.
5
Visual Experiments
We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. Red and green checkerboards of a pseudo-randomly selected quadrant are presented for 700 ms in one trial. FMRI experiments with the same quadrant stimuli were also done by adopting conventional block design where the stimuli were presented for 15 seconds in a block. The global field power (sum of MEG signals of all sensors) recorded from subject RH induced by the upper right stimulus is shown in Fig. 3a. The strong peak was observed after 93 ms of the stimulus onset. Cortical currents were estimated by applying the hierarchical Bayesian method to the averaged MEG data between 100 ms before and 400 ms after the stimulus onset. The fMRI activity t-values were used as a prior for the inverse
Fig. 3. Estimated current for quadrant visual stimulus. (a) shows the global field power of MEG signal. (b) shows the temporal pattens of averaged currents in V1, V2/3, and V4. (c-e) shows spatial pattens of the current strength averaged over 20ms time windows centered at 93ms, 98ms, and 134ms.
584
M. Sato and T. Yoshioka
variance parameters. As explained in ’Hierarchical Prior’ subsection, the mean of the prior was assumed to be α ¯ −1 0n = a0 · tf (n), where tf (n) was the t-value at the n-th position and a0 was a hyper parameter and set to 500 in this analysis. Estimated spatiotemporal brain activities are illustrated in Fig. 3. We identified 3 ROIs (Region Of Interest) in V1, V2/3, and V4 and temporal patterns of the estimated currents are obtained by averaging the current within these ROIs. Fig. 3b shows that V1, V2/3, and V4 are successively activated and attained their peak around 93ms, 98ms, and 134ms, respectively. Fig. 3c-3e illustrates the spatial pattern of the current strength averaged over 20ms time windows (centered at 93ms, 98ms, and 134ms), in a flattened map format. The flattened map was made by cutting along the bottom of calcarine sulcus. We can see strongly active regions in V1, V2/3, and V4 corresponding to their peak activities. The above results are consistent with known physiological findings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.
6
Conclusion
In this article, we have explained the hierarchical Bayesian method which combines MEG and fMRI by using the hierarchical prior. We have shown the superiority of the hierarchical Bayesian method over conventional linear inverse methods by evaluating the resolution curve. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a flickering stimulus in one of four quadrants of the visual field. The estimation results are consistent with known physiological findings and shows the good spatial and temporal resolutions of the hierarchical Bayesian method. Currently, we are applying the hierarchical Bayesian method for brain machine interface using noninvasive neuroimaging. In our approach, we first estimate current activity in the brain. Then, the intention or the motion of the subject is estimated by using the current activity. This approach enables us to use physiological knowledge and gives us more insight on the mechanism of human information proceeding. Acknowledgement. This research was supported in part by NICT-KARC.
References Ahlfors, S.P., Simpson, G.V., Dale, A.M., Belliveau, J.W., Liu, A.K., Korvenoja, A., Virtanen, J., Huotilainen, M., Tootell, R.B.H., Aronen, H.J., Ilmoniemi, R.J.: Spatiotemporal activity of a cortical network for processing visual motion revealed by MEG and fMRI. J. Neurophysiol. 82, 2545–2555 (1999) Amari, S.: Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998) Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Proc. 15th Conference on Uncertainty in Artificial Intelligence, pp. 21–30 (1999) Bandettini, P.A.: The temporal resolution of functional MRI. In: Moonen, C.T.W., Bandettini, P.A. (eds.) Functional MRI, pp. 205–220. Springer, Heidelberg (2000)
Hierarchical Bayesian Inference of Brain Activity
585
Dale, A.M., Liu, A.K., Fischl, B.R., Buchner, R.L., Belliveau, J.W., Lewine, J.D., Halgren, E.: Dynamic statistical parametric mapping: Combining fMRI and MEG for high-resolution imaging of cortical activity. Neuron 26, 55–67 (2000) Dale, A.M., Sereno, M.I.: Improved localization of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: A Linear approach. J. Cognit. Neurosci. 5, 162–176 (1993) Hamalainen, M.S., Hari, R., Ilmoniemi, R.J., Knuutila, J., Lounasmaa, O.V.: Magentoencephalography– Theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Modern Phys. 65, 413–497 (1993) Hari, R.: On brain’s magnetic responses to sensory stimuli. J. Clinic. Neurophysiol. 8, 157–169 (1991) Mosher, J.C., Lewis, P.S., Leahy, R.M.: Multiple dipole modelling and localization from spatio-temporal MEG data. IEEE Trans. Biomed. Eng. 39, 541–557 (1992) Neal, R.M.: Bayesian learning for neural networks. Springer, Heidelberg (1996) Ogawa, S., Lee, T.-M., Kay, A.R., Tank, D.W.: Brain magnetic resonance imaging with contrast-dependent oxygenation. In: Proc. Natl. Acad. Sci. USA, vol. 87, pp. 9868–9872 (1990) Phillips, C., Rugg, M.D., Friston, K.J.: Anatomically Informed Basis Functions for EEG Source Localization: Combining Functional and Anatomical Constraints. NeuroImage 16, 678–695 (2002) Sato, M.: On-line Model Selection Based on the Variational Bayes. Neural Computation 13, 1649–1681 (2001) Sato, M., Yoshioka, T., Kajihara, S., Toyama, K., Goda, N., Doya, K., Kawato, M.: Hierarchical Bayesian estimation for MEG inverse problem. NeuroImage 23, 806–826 (2004)
Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models Byron M. Yu1,2 , John P. Cunningham1 , Krishna V. Shenoy1 , and Maneesh Sahani2 1
2
Dept. of Electrical Engineering and Neurosciences Program, Stanford University, Stanford, CA, USA Gatsby Computational Neuroscience Unit, UCL, London, UK {byronyu,jcunnin,shenoy}@stanford.edu,
[email protected] Abstract. To date, the neural decoding of time-evolving physical state – for example, the path of a foraging rat or arm movements – has been largely carried out using linear trajectory models, primarily due to their computational efficiency. The possibility of better capturing the statistics of the movements using nonlinear trajectory models, thereby yielding more accurate decoded trajectories, is enticing. However, nonlinear decoding usually carries a higher computational cost, which is an important consideration in real-time settings. In this paper, we present techniques for nonlinear decoding employing modal Gaussian approximations, expectatation propagation, and Gaussian quadrature. We compare their decoding accuracy versus computation time tradeoffs based on high-dimensional simulated neural spike counts. Keywords: Nonlinear dynamical models, nonlinear state estimation, neural decoding, neural prosthetics, expectation-propagation, Gaussian quadrature.
1
Introduction
We consider the problem of decoding time-evolving physical state from neural spike trains. Examples include decoding the path of a foraging rat from hippocampal neurons [1,2] and decoding the arm trajectory from motor cortical neurons [3,4,5,6,7,8]. Advances in this area have enabled the development of neural prosthetic devices, which seek to allow disabled patients to regain motor function through the use of prosthetic limbs, or computer cursors, that are controlled by neural activity [9,10,11,12,13,14,15]. Several of these prosthetic decoders, including population vectors [11] and linear filters [10,12,15], linearly map the observed neural activity to the estimate of physical state. Although these direct linear mappings are effective, recursive Bayesian decoders have been shown to provide more accurate trajectory estimates [1,6,7,16]. In addition, recursive Bayesian decoders provide confidence regions on the trajectory estimates and allow for nonlinear relationships between M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 586–595, 2008. c Springer-Verlag Berlin Heidelberg 2008
Neural Decoding Using Nonlinear Trajectory Models
587
the neural activity and the physical state variables. Recursive Bayesian decoders are based on the specification of a probabilistic model comprising 1) a trajectory model, which describes how the physical state variables change from one time step to the next, and 2) an observation model, which describes how the observed neural activity relates to the time-evolving physical state. The function of the trajectory model is to build into the decoder prior knowledge about the form of the trajectories. In the case of decoding arm movements, the trajectory model may reflect 1) the hard, physical constraints of the limb (for example, the elbow cannot bend backward), 2) the soft, control constraints imposed by neural mechanisms (for example, the arm is more likely to move smoothly than in a jerky motion), and 3) the physical surroundings of the person and his/her objectives in that environment. The degree to which the trajectory model captures the statistics of the actual movements directly affects the accuracy with which trajectories can be decoded from neural data [8]. The most commonly-used trajectory models assume linear dynamics perturbed by Gaussian noise, which we refer to collectively as linear-Gaussian models. The family of linear-Gaussian models includes the random walk model [1,2,6], those with a constant [8] or time-varying [17,18] forcing term, those without a forcing term [7,16], those with a time-varying state transition matrix [19], and those with higher-order Markov dependencies [20]. Linear-Gaussian models have been successfully applied to decoding the path of a foraging rat [1,2], as well as arm trajectories in ellipse-tracing [6], pursuit-tracking [7,20,16], “pinball” [7,16], and center-out reach [8] tasks. Linear-Gaussian models are widely used primarily due to their computational efficiency, which is an important consideration for real-time decoding applications. However, for particular types of movements, the family of linear-Gaussian models may be too restrictive and unable to capture salient properties of the observed movements [8]. We recently proposed a general approach to constructing trajectory models that can exhibit rather complex dynamical behaviors and whose decoder can be implemented to have the same running time (using a parallel implementation) as simpler trajectory models [8]. In particular, we demonstrated that a probabilistic mixture of linear-Gaussian trajectory models, each accurate within a limited regime of movement, can capture the salient properties of goal-directed reaches to multiple targets. This mixture model, which yielded more accurate decoded trajectories than a single linear-Gaussian model, can be viewed as a discrete approximation to a single, unified trajectory model with nonlinear dynamics. An alternate approach is to decode using this single, unified nonlinear trajectory model without discretization. This makes the decoding problem more difficult since nonlinear transformations of parametric distributions are typically no longer easily parametrized. State estimation in nonlinear dynamical systems is a field of active research that has made substantial progress in recent years, including the application of numerical quadrature techniques to dynamical systems [21,22,23], the development of expectation-propagation (EP) [24] and its application to dynamical systems [25,26,27,28], and the improvement in the
588
B.M. Yu et al.
computational efficiency of Monte Carlo techniques (e.g., [29,30,31]). However, these techniques have not been rigorously tested and compared in the context of neural decoding, which typically involves observations that are high-dimensional vectors of non-negative integers. In particular, the tradeoff between decoding accuracy and computational cost among different neural decoding algorithms has not been studied in detail. Knowing the accuracy-computational cost tradeoff is important for real-time applications, where one may need to select the most accurate algorithm given a computational budget or the least computationally intensive algorithm given a minimal acceptable decoding accuracy. This paper takes a step in this direction by comparing three particular deterministic Gaussian approximations. In Section 2, we first introduce the nonlinear dynamical model for neural spike counts and the decoding problem. Sections 3 and 4 detail the three deterministic Gaussian approximations that we focus on in this report: global Laplace, Gaussian quadrature-EP (GQ-EP), and Laplace propagation (LP). Finally, in Section 5, we compare the decoding accuracy versus computational cost of these three techniques.
2
Nonlinear Dynamical Model and Neural Decoding
In this report, we consider nonlinear dynamical models for neural spike counts of the following form: xt | xt−1 ∼ N (f (xt−1 ) , Q) yti
| xt ∼ Poisson (λi (xt ) · Δ) ,
(1a) (1b)
where xt ∈ Rp×1 is a vector containing the physical state variables at time t = 1, . . . , T , yti ∈ {0, 1, 2, . . .} is the corresponding observed spike count for neuron i = 1, . . . , q taken in a time bin of width Δ, and Q ∈ Rp×p is a covariance matrix. The functions f : Rp×1 → Rp×1 and λi : Rp×1 → R+ are, in general, nonlinear. The initial state x1 is Gaussian-distributed. For notational compactness, the spike counts for all q simultaneously-recorded neurons are assembled into a q × 1 vector yt , whose ith element is yti . Note that the observations are discretevalued and that, typically, q p. Equations (1a) and (1b) are referred to as the trajectory and observation models, respectively. The task of neural decoding involves finding, at each timepoint t, the likely physical states xt given the neural activity observed up to that time {y}t1 . In other words, we seek to compute the filtered state posterior P (xt | {y}t1 ) at each t. We previously showed how to estimate the filtered state posterior when f is a linear function [8]. Here, we consider how to compute P (xt | {y}t1 ) when f is nonlinear. The extended Kalman filter (EKF) is a commonly-used technique for nonlinear state estimation. Unfortunately, it cannot be directly applied to the current problem because the observation noise in (1b) is not additive Gaussian. Possible alternatives are the unscented Kalman filter (UKF) [21,22] and the closelyrelated quadrature Kalman filter (QKF) [23], both of which employ quadrature
Neural Decoding Using Nonlinear Trajectory Models
589
techniques to approximate Gaussian integrals that are analytically intractable. While the UKF has been shown to outperform the EKF [21,22], the UKF requires making Gaussian approximations in the observation space. This property of the UKF is undesirable from the standpoint of the current problem because the observed spike counts are typically 0 or 1 (due to the use of relatively short binwidths Δ) and, therefore, distinctly non-Gaussian. As a result, the UKF yielded substantially lower decoding accuracy than the techniques presented in Sections 3 and 4 [28], which make Gaussian approximations only in the state space. While we have not yet tested the QKF, the number of quadrature points required grows geometrically with p + q, which quickly becomes impractical even for moderate values of p and q. Thus, we will no longer consider the UKF and QKF in the remainder of this paper. The decoding techniques described in Sections 3 and 4 naturally yield the smoothed state posterior P xt | {y}T1 , rather than the filtered state posterior P (xt | {y}t1 ). Thus, we will focus on the smoothed state posterior in this work. However, the filtered state posterior at time t can be easily obtained by smoothing using only observations from timepoints 1, . . . , t.
3
Global Laplace
The idea is to estimate the joint state posterior across the entire sequence (i.e., the global state posterior) as a Gaussian matched to the location and curvature of a mode of P {x}T1 | {y}T1 , as in Laplace’s method [32]. The mode is defined as {x }T1 = argmax P {x}T1 | {y}T1 = argmax L {x}T1 , (2) {x}T 1
{x}T 1
where L {x}T1 = log P {x}T1 , {y}T1 = log P (x1 ) +
T
log P (xt | xt−1 ) +
t=2
q T
log P yti | xt .
(3)
t=1 i=1
Using the known distributions (1), the gradients of L {x}T1 can be computed exactly and a local mode {x }T1 can be found by applying a gradient optimization technique. The global state posterior is then approximated as: −1 P {x}T1 | {y}T1 ≈ N {x }T1 , −∇2 L {x }T1 . (4)
4
Expectation Propagation
We briefly summarize here the application of EP [24] to dynamical models [25,26,27,28]. More details can be found in the cited references. The two primary distributions of interest here are the marginal P xt | {y}T1 and pairwise
590
B.M. Yu et al.
joint P xt−1 , xt | {y}T1 state posteriors. These distributions can be expressed in terms of forward αt and backward βt messages as follows P xt | {y}T1 =
1 αt (xt ) βt (xt ) P {y}T1 αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) P xt−1 , xt | {y}T1 = , P {y}T1
(5) (6)
where αt (xt ) = P (xt , {y}t1 ) and βt (xt ) = P {y}Tt+1 | xt . The messages αt and βt are typically approximated by an exponential family density; in this paper, we use an unnormalized Gaussian. These approximate messages are iteratively updated by matching the expected sufficient statistics1 of the marginal posterior (5) with those of the pairwise joint posterior (6). The updates are usually performed sequentially via multiple forward-backward passes. During the forward pass, the αt are updated while the βt remain fixed: αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) T P xt | {y}1 = dxt−1 (7) P {y}T1 ≈ Pˆ (xt−1 , xt ) dxt−1 (8) αt (xt ) ∝ Pˆ (xt , xt−1 ) dxt−1 βt (xt ) , (9) where Pˆ (xt−1 , xt ) is an exponential family distribution whose expected sufficient statistics are matched to those of P xt−1 , xt | {y}T1 . In this paper, Pˆ (xt−1 , xt ) is assumed to be Gaussian. The backward pass proceeds similarly, where the βt are updated while the αt remain fixed. The decoded trajectory is obtained by combining the messages αt and βt , as shown in (5), after completing the forwardbackward passes. In Section 5, we investigate the accuracy-computational cost tradeoff of using different numbers of forward-backward iterations. Although the expected sufficient statistics (or moments) of P xt−1 , xt | {y}T1 cannot typically be computed analytically for the nonlinear dynamical model (1), they can be approximated using Gaussian quadrature [26,28]. This EP-based decoder is referred to as GQ-EP. By applying the ideas of Laplace propagation (LP) [33], a closely-related developed that uses a modal decoder has been Gaussian approximation of P xt−1 , xt | {y}T1 rather than matching moments [27,28]. This technique, which uses the same message-passing scheme as GQ-EP, is referred to here as LP. In practice, it is possible to encounter invalid message updates. For example, if the variance of xt in the numerator is larger than that in the denominator in (9) due to approximation error in the choice of Pˆ , the update rule would assign αt (xt ) a negative variance. A way around this problem is to simply skip that message update and hope that the update is no longer invalid during the next 1
If the approximating distributions are assumed to be Gaussian, this is equivalent to matching the first two moments.
Neural Decoding Using Nonlinear Trajectory Models
591
forward-backward iteration [34]. An alternative is to set βt (xt ) = 1 in (7) and (9), which guarantees a valid update for αt (xt ). This is referred to as the onesided update and its implications for decoding accuracy and computation time are considered in Section 5.
5
Results
We evaluated decoding accuracy versus computational cost of the techniques described in Sections 3 and 4. These performance comparisons were based on the model (1), where f (x) = (1 − k) x + k · W · erf(x) λi (x) = log 1 + eci x+di
(10) (11)
with parameters W ∈ Rp×p , ci ∈ Rp×1 , and di ∈ R. The error function (erf) in (10) acts element-by-element on its argument. We have chosen the dynamics (10) of a fully-connected recurrent network due to its nonlinear nature; we make no claims in this work about its suitability for particular decoding applications, such as for rat paths or arm trajectories. Because recurrent networks are often used to directly model neural activity, it is important to emphasize that x is a vector of physical state variables to be decoded, not a vector of neural activity. We generated 50 state trajectories, each with 50 time points, and corresponding spike counts from the model (1), where the model parameters were randomly chosen within a range that provided biologically realistic spike counts (typically, 0 or 1 spike in each bin). The time constant k ∈ R was set to 0.1. To understand how these algorithms scale with different numbers of physical state variables and observed neurons, we considered all pairings (p, q), where p ∈ {3, 10} and q ∈ {20, 100, 500}. For each pairing, we repeated the above procedure three times. For the global Laplace decoder, the modal trajectory was found using PolackRibi`ere conjugate gradients with quadratic/cubic line searches and Wolfe-Powell stopping criteria (minimize.m by Carl Rasmussen, available at http://www.kyb. tuebingen.mpg.de/bs/people/carl/code/minimize/). To stabilize GQ-EP, we used a modal Gaussian proposal distribution and the custom precision 3 quadrature rule with non-negative quadrature weights, as described in [28]. Forboth GQ EP and LP, minimize.m was used to find a mode of P xt−1 , xt | {y}T1 . Fig. 1 illustrates the decoding accuracy versus computation time of the presented techniques. Decoding accuracy was measured by evaluating the marginal state posteriors P xt | {y}T1 at the actual trajectory. The higher the log probability, the more accurate the decoder. Each panel corresponds to a different number of state variables and observed neurons. For GQ-EP (dotted line) and LP (solid line), we varied the number of forward-backward iterations between one and three; thus, there are three circles for each of these decoders. Across all panels, global Laplace required the least computation time and yielded state
592
Log probability
-191
B.M. Yu et al. -110
(a)
-199
-114
-24
-207
-118
-28
-600
-250
(d) Log probability
-20
(b)
-2600 -1 10
50
(e)
-1600
(f)
-600
0
10
1
10
Computation time (sec)
(c)
-950 -1 10
-250
0
10
1
10
Computation time (sec)
-550 -1 10
0
10
1
10
Computation time (sec)
Fig. 1. Decoding accuracy versus computation time of global Laplace (no line), GQEP (dotted line), and LP (solid line). (a) p = 3, q = 20, (b) p = 3, q = 100, (c) p = 3, q = 500, (d) p = 10, q = 20, (e) p = 10, q = 100, (f) p = 10, q = 500. The circles and bars represent mean±SEM. Variability in computation time is not represented on the plots because they were negligible. The computation times were obtained using a 2.2-GHz AMD Athlon 64 processor with 2 GB RAM running MATLAB R14. Note that the scale of the vertical axes is not the same in each panel and that some error bars are so small that they can’t be seen.
estimates as accurate as, or more accurate than, the other techniques. This is the key result of this report. We also implemented a basic particle smoother [35], where the number of particles (500 to 1500) was chosen such that its computation time was on the same order as those shown in Fig. 1 (results not shown). Although this particle smoother yielded substantially lower decoding accuracy than global Laplace, GQ-EP, and LP, the three deterministic techniques should be compared to more recently-developed Monte Carlo techniques, as described in Section 6. Fig. 1 shows that all three techniques have computation times that scale well with the number of state variables p and neurons q. In particular, the required computational time typically scales sub-linearly with increases in p and far sublinearly with increases in q. As the q increases, the accuracies of the techniques become more similar (note that different panels have different vertical scales), and there is less advantage to performing multiple forward-backward iterations for GQ-EP and LP. The decoding accuracy and required computation time both typically increase with the number of iterations. In a few cases (e.g., GQ-EP in Fig. 1(b)), it is possible for the accuracy to decrease slightly when going from two to three iterations, presumably due to one-sided updates. In theory, GQ-EP should require greater computation time than LP because it needs to perform the same modal Gaussian approximation, then use it as a proposal distribution for Gaussian quadrature. In practice, it is possible for LP
Neural Decoding Using Nonlinear Trajectory Models
593
to be slower if it needs many one-sided updates (cf. Fig. 1(d)), since one-sided updates are used only when the usual update (9) fails. Furthermore, LP required greater computation time in Fig. 1(d) than in Fig. 1(e) due to the need for many more one-sided updates, despite having five times fewer neurons. It was previously shown that {x }T1 is a local optimum of P {x}T1 | {y}T1 (i.e., a solution of global Laplace) if and only if it is a fixed-point of LP [33]. Because the modal Gaussian approximation matches local curvature up to second order, it can also be shown that the estimated covariances using global Laplace and LP are equal at {x }T1 [33]. Empirically, we found both statements to be true if few one-sided updates were required for LP. Due to these connections between global Laplace and LP, the accuracy of LP after three forward-backward iterations was similar to that of global Laplace in all panels in Fig. 1. Although LP may have computational savings compared to global Laplace in certain applications [33], we found that global Laplace was substantially faster for the particular graph structure described by (1).
6
Conclusion
We have presented three deterministic techniques for nonlinear state estimation (global Laplace, GQ-EP, LP) and compared their decoding accuracy versus computation cost in the context of neural decoding, involving high-dimensional observations of non-negative integers. This work can be extended in the following directions. First, the deterministic techniques presented here should be compared to recently-developed Monte Carlo techniques that have yielded increased accuracy and/or reduced computational cost compared to the basic particle filter/smoother in applications other than neural decoding [29]. Examples include the Gaussian particle filter [31], sigma-point particle filter [30], and embedded hidden Markov model [36]. Second, we have compared these decoders based on one particular non-linear trajectory model (10). Other non-linear trajectory models (e.g., a model describing primate arm movements [37]) should be tested to see if the decoders have similar accuracy-computational cost tradeoffs as shown here. Acknowledgments. This work was supported by NIH-NINDS-CRCNS-R01, NDSEG Fellowship, NSF Graduate Research Fellowship, Gatsby Charitable Foundation, Michael Flynn Stanford Graduate Fellowship, Christopher Reeve Paralysis Foundation, Burroughs Wellcome Fund Career Award in the Biomedical Sciences, Stanford Center for Integrated Systems, NSF Center for Neuromorphic Systems Engineering at Caltech, Office of Naval Research, Sloan Foundation and Whitaker Foundation.
References 1. Brown, E.N., Frank, L.M., Tang, D., Quirk, M.C., Wilson, M.A.: A statistical paradigm for neural spike train decoding applied to position prediction from the ensemble firing patterns of rat hippocampal place cells. J. Neurosci 18(18), 7411– 7425 (1998)
594
B.M. Yu et al.
2. Zhang, K., Ginzburg, I., McNaughton, B.L., Sejnowski, T.J.: Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol 79, 1017–1044 (1998) 3. Wessberg, J., Stambaugh, C.R., Kralik, J.D., Beck, P.D., Laubach, M., Chapin, J.K., Kim, J., Biggs, J., Srinivasan, M.A., Nicolelis, M.A.L.: Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408(6810), 361–365 (2000) 4. Schwartz, A.B., Taylor, D.M., Tillery, S.I.H.: Extraction algorithms for cortical control of arm prosthetics. Curr. Opin. Neurobiol. 11, 701–707 (2001) 5. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., Donoghue, J.: Robustness of neuroprosthetic decoding algorithms. Biol. Cybern. 88(3), 219–228 (2003) 6. Brockwell, A.E., Rojas, A.L., Kass, R.E.: Recursive Bayesian decoding of motor cortical signals by particle filtering. J. Neurophysiol 91(4), 1899–1907 (2004) 7. Wu, W., Black, M.J., Mumford, D., Gao, Y., Bienenstock, E., Donoghue, J.P.: Modeling and decoding motor cortical activity using a switching Kalman filter. IEEE Trans Biomed Eng 51(6), 933–942 (2004) 8. Yu, B.M., Kemere, C., Santhanam, G., Afshar, A., Ryu, S.I., Meng, T.H., Sahani, M., Shenoy, K.V.: Mixture of trajectory models for neural decoding of goal-directed movements. J. Neurophysiol. 97, 3763–3780 (2007) 9. Chapin, J.K., Moxon, K.A., Markowitz, R.S., Nicolelis, M.A.L.: Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci. 2, 664–670 (1999) 10. Serruya, M.D., Hatsopoulos, N.G., Paninski, L., Fellows, M.R., Donoghue, J.P.: Instant neural control of a movement signal 416, 141–142 (2002) 11. Taylor, D.M., Tillery, S.I.H., Schwartz, A.B.: Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002) 12. Carmena, J.M., Lebedev, M.A., Crist, R.E., O’Doherty, J.E., Santucci, D.M., Dimitrov, D.F., Patil, P.G., Henriquez, C.S., Nicolelis, M.A.L.: Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biology 1(2), 193–208 (2003) 13. Musallam, S., Corneil, B.D., Greger, B., Scherberger, H., Andersen, R.A.: Cognitive control signals for neural prosthetics. Science 305, 258–262 (2004) 14. Santhanam, G., Ryu, S.I., Yu, B.M., Afshar, A., Shenoy, K.V.: A high-performance brain-computer interface. Nature 442, 195–198 (2006) 15. Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006) 16. Wu, W., Gao, Y., Bienenstock, E., Donoghue, J.P., Black, M.J.: Bayesian population decoding of motor cortical activity using a Kalman filter. Neural Comput 18(1), 80–118 (2006) 17. Kemere, C., Meng, T.: Optimal estimation of feed-forward-controlled linear systems. In: Proc IEEE ICASSP, pp. 353–356 (2005) 18. Srinivasan, L., Eden, U.T., Willsky, A.S., Brown, E.N.: A state-space analysis for reconstruction of goal-directed movements using neural signals. Neural Comput 18(10), 2465–2494 (2006) 19. Srinivasan, L., Brown, E.N.: A state-space framework for movement control to dynamic goals through brain-driven interfaces. IEEE Trans. Biomed. Eng. 54(3), 526–535 (2007) 20. Shoham, S., Paninski, L.M., Fellows, M.R., Hatsopoulos, N.G., Donoghue, J.P., Normann, R.A.: Statistical encoding model for a primary motor cortical brainmachine interface. IEEE Trans. Biomed. Eng. 52(7), 1313–1322 (2005)
Neural Decoding Using Nonlinear Trajectory Models
595
21. Wan, E., van der Merwe, R.: The unscented Kalman filter. In: Haykin, S. (ed.) Kalman Filtering and Neural Networks, Wiley Publishing, Chichester (2001) 22. Julier, S., Uhlmann, J.: Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92(3), 401–422 (2004) 23. Arasaratnam, I., Haykin, S., Elliott, R.: Discrete-time nonlinear filtering algorithms using Gauss-Hermite quadrature. Proceedings of the IEEE 95(5), 953–977 (2007) 24. Minka, T.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 362–369 (2001) 25. Heskes, T., Zoeter, O.: Expectation propagation for approximate inference in dynamic Bayesian networks. In: Darwiche, A., Friedman, N. (eds.) Proceedings UAI2002, pp. 216–223 (2002) 26. Zoeter, O., Ypma, A., Heskes, T.: Improved unscented Kalman smoothing for stock volatility estimation. In: Barros, A., Principe, J., Larsen, J., Adali, T., Douglas, S. (eds.) Proceedings of the IEEE Workshop on Machine Learning for Signal Processing (2004) 27. Ypma, A., Heskes, T.: Novel approximations for inference in nonlinear dynamical systems using expectation propagation. Neurocomputing 69, 85–99 (2005) 28. Yu, B.M., Shenoy, K.V., Sahani, M.: Expectation propagation for inference in nonlinear dynamical models with Poisson observations. In: Proc. IEEE Nonlinear Statistical Signal Processing Workshop (2006) 29. Doucet, A., de Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 30. van der Merwe, R., Wan, E.: Sigma-point Kalman filters for probabilistic inference in dynamic state-space models. In: Proceedings of the Workshop on Advances in Machine Learning (2003) 31. Kotecha, J.H., Djuric, P.M.: Gaussian particle filtering. IEEE Transactions on Signal Processing 51(10), 2592–2601 (2003) 32. MacKay, D.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 33. Smola, A., Vishwanathan, V., Eskin, E.: Laplace propagation. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 34. Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 352–359 (2002) 35. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing 10(3), 197–208 (2000) 36. Neal, R.M., Beal, M.J., Roweis, S.T.: Inferring state sequences for non-linear systems with embedded hidden Markov models. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 37. Chan, S.S., Moran, D.W.: Computational model of a primate arm: from hand position to joint angles, joint torques and muscle forces. J. Neural. Eng. 3, 327–337 (2006)
Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience Kazuyuki Samejima1 and Kenji Doya2 1 Brain Science Institute, Tamagawa University, 6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan
[email protected] 2 Initial Research Project, Okinawa Institute of Science and Technology 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan
[email protected] Abstract. A major problem in search of neural substrates of learning and decision making is that the process is highly stochastic and subject dependent, making simple stimulus- or output-triggered averaging inadequate. This paper presents a novel approach of characterizing neural recording or brain imaging data in reference to the internal variables of learning models (such as connection weights and parameters of learning) estimated from the history of external variables by Bayesian inference framework. We specifically focus on reinforcement leaning (RL) models of decision making and derive an estimation method for the variables by particle filtering, a recent method of dynamic Bayesian inference. We present the results of its application to decision making experiment in monkeys and humans. The framework is applicable to wide range of behavioral data analysis and diagnosis.
1 Introduction The traditional approach in neuroscience to discover information processing mechanisms is to correlate neuronal activities with external physical variables, such as sensory stimuli or motor outputs. However, when we search for neural correlates of higher-order brain functions, such as attention, memory and learning, a problem has been that there are no external physical variables to correlate with. Recently, the advances in computational neuroscience, there are a number of computational models of such cognitive or learning processes and make quantitative prediction of the according subject’s behavioral responses. Thus a possible new approach is to try to find neural activities that correlate with the internal variables of such computational models(Corrado and Doya, 2007). A major issue in such model-based analysis of neural data is how to estimate the hidden variables of the model. For example, in learning agents, hidden variables such as connection weights change in time. In addition, the course of learning is regulated by hidden meta-parameters such as learning rates. Another important issue is how to judge the validity of a model or to select the best model among a number of candidates. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 596–603, 2008. © Springer-Verlag Berlin Heidelberg 2008
Estimating Internal Variables of a Decision Maker’s Brain
597
The framework of Bayesian inference can provide coherent solutions to the issues of estimating hidden variables, including meta-parameter from observable experimental data and selecting the most plausible computational model out of multiple candidates. In this paper, we first review the reinforcement learning model of reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian estimation method for the hidden variables of a reinforcement learning model by particle filtering (Samejima et al., 2004). We then review examples of application of the method to monkey neural recording (Samejima et al., 2005) and human imaging studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007).
2 Reinforcement Leaning Model as an Animal or Human Decision Maker Reinforcement learning can be a model of animal or human decision based on reward delivery. Notably, the response of monkey midbrain dopamine neurons are successfully explained by the temporal difference (TD) error of reinforcement learning models (Schultz et al., 1997). The goal of reinforcement learning is to improve the policy, the rule of taking an action at at state st , so that the resulting rewards rt is maximized in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the value function for each state and then it improves the policy based on the value function. In a standard reinforcement learning algorithm called “Q-learning,” an agent learns the action-value function
[
]
(1)
which estimates the cumulative future reward when action
at is taken at a
Q ( st , at ) = E rt + γrt +1 + γ 2 rt + 2 + ... | s, a state st .The discount factor
0 < γ < 1 is a meta-parameter that controls the time
scale of prediction. The policy of the learner is then given by comparing actionvalues, e.g. according to Boltzman distribution
π (a | s ) =
exp(βQ(a, st )) ∑ exp(βQ(a' , st ))
(2)
a '∈A
where the inverse temperature β > 0 is another meta-parameter that controls randomness of action selection. From an experience of state st , action at , reward rt , and next state st +1 , the action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as
δ t = rt + γ max Q( st +1 , a) − Q( st , at ) a∈A
Q( st , at ) ⇐ Q( st , at ) + αδ t
,
(3)
598
K. Samejima and K. Doya
where α > 0 is the meta-parameter for learning rate. In the case of a reinforcement learning agent, we have three meta-parameters. Such a reinforcement learning model of behavior learning does not only predict subject’s actions, but can also provide candidates of brain’s internal processes for decision making, which may be captured in neural recording or brain imaging data. However, a big problem is that the predictions are depended on the setting of metaparameters, such as learning rate α , action randomness β and discount factor γ .
3 Probabilistic Dynamic Evolution of Internal Variable for Q-Learning Agent Let
us
consider
a
problem
of
estimating
the
course
of
action-values
{Qt ( s, a); s ∈ S , a ∈ A,0 < t < T } , and meta-parameters α, β, and γ of reinforcement
st , actions at and rewards rt . We use a Bayesian method of estimating a dynamical hidden variable {x t ; t ∈ N } from sequence of observable variable {y t ; t ∈ N } to solve this problem. We assume learner by only observing the sequence of states
that the unobservable signal (hidden variable) is modeled as a Markov process of initial distribution p (x 0 ) and the transition probability p (x t +1 | x t ) . The observations {y t ; t ∈ N } are assumed to be conditionally independent given the process
{x t ; t ∈ N } and of marginal distribution p(y t | xt ) . The problem to solve
in this setting is to estimate recursively in time the posterior distribution of hidden variable p (x1:t | y1:t ) , where x 0:T = {x 0 , " , xT } and y 1:T = {y 1 , " , y T } . The marginal distribution is given by recursive procedure of the following prediction and updating, Predicting:
p(x t | y1:t −1 ) = ∫ p(x t | x t −1 ) p(x t −1 | y
Updating:
p( xt | y1:t ) =
1:t −1
)dx t −1
p (y t | xt ) p (x t | y1:t −1 ) ∫ p(y t | xt ) p(xt | y )dxt −1 1:t −1
We use a numerical method to solve the Bayesian recursion procedure was proposed, called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of sequence of hidden variables are represented by a set of random samples, also named ``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and the update the distribution of particles (Doucet et al. 2001). Figure 1 shows the dynamical Bayesian network representation of a evolution of internal variables in Q-learning agent. The hidden variable x t consists of actionvalues
Q( s, a ) for each state-action pair, learning rate α, inverse temperature β, and
Estimating Internal Variables of a Decision Maker’s Brain
discount factor γ. The observable variable and rewards
599
y t consists of the states st , actions at ,
rt .
p(y t | xt ) is given by the softmax action selection (2). The transition probability p (x t +1 | x t ) of the hidden variable is given by the The observation probability
Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we assume that meta-parameters (α,β, and γ) are constant with small drifts. Because α, β and γ should all be positive, we assume random-walk dynamics in logarithmic space.
log( xt +1 ) = log( xt ) + ε x where
σx
ε x ~ N (0,σ x )
(4)
is a meta-meta-parameter that defines random-walk variability of meta-
parameters.
Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and unobservable variable is depended on decision, reward probability, state transition, and update rule for value function. Circles: hidden variable. Double box: observable variable. Arrow: probabilistic dependency.
4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity Samejima et al (Samejima et al., 2005) used the internal parameter approach with Q-leaning model for monkey’s free choice task of a two-armed-bandit problem
600
K. Samejima and K. Doya
(Figure 2). The task has only one state, two actions, and stochastic binary reward. The reward probability for each action is fixed in a 30-150 trials of block, but randomly chosen from five kinds of probability combination, block-by-block. The reward probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9, 0.5]}, at the beginning of each block.
Fig. 2. Two-armed bandit task for monkey’s behavioral choice. Upper: Time course of the task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle with their right hand and adjusted at the center position. If monkeys held the handle at the center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned the handle to either right or left side, which was associated with a shift of yellow LED illumination from up to the turned direction. After 0.5 sec., color of the LED changed from yellow to either green or red. Green LED was followed by a large amount of reward water, while red LED was followed by a small amount of water. Lower panel: state diagram of the task. The circle indicates state. Arrow indicate possible action and state transition.
The Q-learning model of monkey behavior tries to learn reward expectation of each action, action value, and maximize reward acquired in each block. Because the task has only one state, the agent does not need to take into account next state’s value, and thus, we set the discount factor as γ = 0 . (Samejima et al., 2005) showed that the computed internal variable, the action value for a particular movement direction (left/right), that is estimated by past history of choice and outcome(reward), could predicts monkey’s future choice probability (Figure 3). Action value is an example of a variable that could not be immediately
Estimating Internal Variables of a Decision Maker’s Brain
601
Fig. 3. Time course of predicted choice probability and estimated action values. Upper panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large), choice ratio (cyan line, Gaussian smoothed σ=2.5) and predicted choice probability (black line). Color of upper bar indicate reward probability combination. Lower panel: estimated action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).
Fig. 4. An example of the activity of a caudate neuron plotted on the space of estimated action values QL(t) and QR(t).Left panel: 3-dimentional plot of neural activity on estimated QL(t) and QR(t). Right panel: 2-d projected plot for the discharge rates of the neuron on QL axes(Left side) and on QR (right side). Grey lines derived from regression model. Circles and error bars indicate average and standard deviation of neural discharge rates for each of 10 equally populated action value bins. (from Samejima et al. 2005).
obvious from observable experimental parameters but can be inferred using an actionpredictable computational model. Further more, the activity of most dorsal striatum projection neurons correlate to the estimated action value for particular action (figure 4). 4.2 Application to Human Imaging Data Not only the internal variable estimation but also the meta-parameters (e.g. learning rate, action stochasticity, and discount rate for future reward) are also estimated by this methodology. Although the subjective value of learning meta-parameters might be different for individual subject, the model-based approach could track subjective internal value for different meta-parameters. Especially, in human imaging study, this
602
K. Samejima and K. Doya
admissibility is effective to extract common neuronal circuit activation in multiple subject experiment. One problem in the cognitive neuroscience by decision making task is lack of controllability of internal variables. In conventional analysis of neuroscience and brain-imaging study, experimenter tries to control a cognitive state or an assumed internal parameter by a task demand or an experimental setting. Observed brain activities are compared to the assumed parameter. However, the subjective internal variables may depended personal behavioral tendency and may be different from the parameter assumed by experimenter. The Baysian estimation method for internal parameters including meta-parameter could reduce such a noise term of personal difference by fitting the meta-parameters. (Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple human subjects could be featured by the estimated meta-parameter of Q-learning agent. Figure 5 shows distribution of three meta-parameters, learning rate α, action stochasticity β and discount rate γ. The subjects whose estimated γ are lower tend to be trapped on a local optimal polity and could not reach optimal choice sequence (figure 5 left panel) . On the other hand, the subjects, whose learing rate α and inverse temperature β are estimated lower than others, reported in post-experimental questionnaire that they could not find any confident action selection in each state even in later experimental session of the task (figure 5 right panel). Regardless of the variety of subject’s behavioral tendency, the fMRI signal that correlated to estimated action value for the selected action is observed in ventral striatum in unpredictable condition, in which the state transitions are completely random, whereas dorsal striatum is correlated to action value in predictable environment, in which the state transitions are deterministic. This suggests that the different cortico-basal ganglia circuits might be involved in different predictability of the environment. (Tanaka et al., 2006)
Fig. 5. Subject distribution of estimated meta-parameters, larning rate, α, action stochasticity (inverse temperature), β, and discount rate, γ. Left panel: distribution on α−γ space. Subject LI, NN, and NT (Left panel, indicated by inside of ellipsoid), were trapped to local optimal action sequence. Right panel: distribution on α−β space. Subject BB and LI (right panel, indicated by inside of ellipsoid) were reported that they could not find any confident strategy.
Estimating Internal Variables of a Decision Maker’s Brain
603
5 Conclusion Theoretical framework of reinforcement learning to model behavioral decision making and the Bayesian estimating method for subjective internal variable can be powerful tools for analyzing both neural recording (Samejima et al., 2005) and human imaging data (Daw et al., 2006; Pessiglione et al., 2006; Tanaka et al., 2006). Especially, tracking meta-parameter of RL can capture behavioral tendency of animal or human decision making. Recently, correlation with anterior cingulated cortex activity and learning rate in uncertain environmental change are reported by using the approach with Bayesian decision model with temporal evolving the parameter of learning rate (Behrens et al., 2007). Although not detailed in this paper, Bayesian estimation framework also provides a way of objectively selecting the best model in reference to the given data. Combination of Bayesian model selection and hidden variable estimation methods would contribute to a new understanding of decision mechanism of our brain through falsifiable hypotheses and objective experimental tests.
References 1. Behrens, T.E., Woolrich, M.W., Walton, M.E., Rushworth, M.F.: Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007) 2. Corrado, G., Doya, K.: Understanding neural coding through the model-based analysis of decision making. J. Neurosci. 27, 8178–8180 (2007) 3. Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006) 4. Doucet, A., Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 5. Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., Imamizu, H., Kawato, M.: A neural correlate of reward-based behavioral learning in caudate nucleus: a functional magnetic resonance imaging study of a stochastic decision task. J. Neurosci. 24, 1660–1665 (2004) 6. Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., Frith, C.D.: Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045 (2006) 7. Samejima, K., Doya, K., Ueda, Y., Kimura, M.: Advances in neural processing systems, vol. 16. The MIT Press, Cambridge, Massachusetts, London, England (2004) 8. Samejima, K., Ueda, Y., Doya, K., Kimura, M.: Representation of action-specific reward values in the striatum. Science 310, 1337–1340 (2005) 9. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) 10. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT press, Cambridge (1998) 11. Tanaka, S.C., Samejima, K., Okada, G., Ueda, K., Okamoto, Y., Yamawaki, S., Doya, K.: Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics. Neural Netw. 19, 1233–1241 (2006)
Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions Tomohiro Shibata1 , Takashi Bando2 , and Shin Ishii1,3 1
Graduate School of Information Science, Nara Institute of Science and Technology
[email protected] 2 DENSO Corporation 3 Graduate School of Informatics, Kyoto University
Abstract. Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. From computational point of view, visual tracking is the real-time process of statistical spatiotemporal filtering of target states from an image stream, and incremental Bayesian computation is one of the most important devices. To make Bayesian computation of the posterior density of state variables tractable for any types of probability distribution, Particle Filters (PFs) have been often employed in the real-time vision area. In this paper, we briefly review incremental Bayesian computation and PFs for visual tracking, indicate drawbacks of PFs, and then propose our framework, in which hierarchical and parallel predictions are integrated by adaptive sampling to achieve appropriate balancing of tracking accuracy and robustness. Finally, we discuss the proposed model from the viewpoint of neuroscience.
1
Introduction
Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. The prediction is also essential for real-time recognition or visual tracking. Due to flood of visual data, examining the whole data is infeasible, and ignoring the irrelevant data is essentially requisite. Primate fovea and oculomotor control can be viewed from this point; high visual acuity is realized by the narrow foveal region on the retina, and the visual axis has to actively move by oculomotor control. Computer vision, in particular real-time vision faces the same computational problems discussed above, and attractive as well as feasible methods and applications have been developed in the light of particle filters (PFs) [4]. One of key ideas of PFs is importance sampling distribution or proposal distribution which can be viewed as prediction or attention in order to overcome the discussed computational problems. The aim of this paper is to propose a novel Bayesian visual tracking framework for hierarchically-modeled state variables for single object tracking, and to discuss the PFs and our framework from the viewpoint of neuroscience. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 604–613, 2008. c Springer-Verlag Berlin Heidelberg 2008
Visual Tracking Achieved by Adaptive Sampling
2 2.1
605
Adaptive Sampling from Hierarchical and Parallel Predictions Incremental Bayes and Particle Filtering
Particle filtering [4] is an approach to performing Bayesian estimation of intractable posterior distributions from time series signals with non-Gaussian noise, such to generalize the traditional Kalman filtering. This approach has been attracting attention in various research areas, including real-time visual processing (e.g., [5]). In clutter, there are usually several competing observations and these causes the posterior to be multi-modal and therefore non-Gaussian. In reality, using a large number of particles is not allowed especially for realtime processing, and thus there is strong demand to reduce number of particles. Reducing the number of particles, however, can lead to sacrificing accuracy and robustness of filtering, particularly in case that the dimension of state variables is high. How to cope with this trade-off has been one of the most important computational issues in PFs, but there have been few efforts to reduce the dimension of the state variables in the context of PFs. Making use of hierarchy in state variables seems a natural solution to this problem. For example, in the case of pose estimation of a head from a videostream involving state variables of three-dimensional head pose and the twodimansional position of face features on the image, the high-dimensional state space can be divided into two groups by using the causality in their states, i.e., the head pose strongly affects the position of feace features (cf. Fig. 1, except dotted arrows). There are, however, two big problems in this setting. First, the estimation of the lower state variables is strongly dependent on the estimation of the higher state variables. In real applications, it often happens that the assumption on generating from the higher state variable to the lower state variable is violated. Second, the assumption on generating from the lower state variable to the input, typically image frames, can also be violated. These two problems lead to failure in the estimation. 2.2
Estimation of Hierarchically-Modeled State Variables
Here we present a novel framework for hierarchically-modeled state variables for single object tracking. The intuition of our approach is that the higher and lower layer have their own dynamics respectively, and mixing their predictions over the proposal distribution based on their reliability adds both robustness and accuracy in tracking with the fewer number of particles. We assume there are two continuous state vectors at time step t, denoted by at ∈ RNa and xt ∈ RNx , and they are hierarchically modeled as in Fig. 1. Our goal is then estimating these unobservable states from an observation sequence z1:t . State Estimation. According to Bayes rule, the joint posterior density p(at , xt |z1:t ) is given by p(at , xt |z1:t ) ∝p(zt |at , xt )p(at , xt |z1:t−1 ),
606
T. Shibata, T. Bando, and S. Ishii
Fig. 1. A graphical model of a hierarchical and parallel dynamics. Right panels depict example physical representations processed in the layers.
where p(at , xt |z1:t−1 ) is the joint prior density, and p(zt |at , xt ) is the likelihood. The joint prior density p(at , xt |z1:t−1 ) is given by the previous joint posterior density p(at−1 , xt−1 |z1:t−1 ) and the state transition model p(at , xt |at−1 , xt−1 ) as follows: p(at , xt |z1:t−1 ) = p(at , xt |at−1 , xt−1 )p(at−1 , xt−1 |z1:t−1 )dat−1 dxt−1 . The state vectors at and xt are assumed to be genereted from the hierarchical model shown in Fig. 1. Furthermore, dynamics model of at is assumed to be represented as a temporal Markov chain, conditional independence between at−1 and xt given at . Under these assumptions, the state transition model p(at , xt |at−1 , xt−1 ) and the joint prior density p(at , xt |z1:t−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at )p(at |at−1 ) and
p(at , xt |z1:t−1 ) =p(xt |at )p(at |z1:t−1 ),
respectively. Then, we can carry out hierarchically computation of the joint posterior dis(i) tribution by PF as follows; first, the higher samples {at |i = 1, ..., Na } are drawn (j) from the prior density p(at |z1:t−1 ). The lower samples {xt |j = 1, ..., Nx } are then drawn from the prior density p(xt |z1:t−1 ) described as the proposal dis(i) tribution p(xt |at ). Finally, the weights of samples are given by the likelihood p(zt |at , xt ). Adaptive Mixing of Proposal Distribution. For applying to the state estimation problem in the real world, the above proposal distribution can be contaminated by the cruel non-Gaussian noise and/or the declination of the assumption. Especially in the case of PF estimation with small number of particles, the contaminated proposal distribution can give fatal disturbance to the
Visual Tracking Achieved by Adaptive Sampling
607
estimation. In this study, we assumed the state transition, which are represented as dotted arrows in Fig.1, in lower layer independently of upper layer, and we suppose to enable robust and acculate estimation by adaptive determination of the contribution the from hypothetical state transition. The state transition p(at , xt |at−1 , xt−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at , xt−1 )p(at |at−1 ).
(1)
Here, the dynamics model of xt is assumed to be represented as p(xt |at , xt−1 ) =αa,t p(xt |at ) + αx,t p(xt |xt−1 ),
(2)
with αa,t + αx,t = 1. In our algorithm, p(xt |at , xt−1 ) is modelled as a mixture of approximated prediction densities computed in the lower and higher layers, p(xt |xt−1 ) and p(xt |at ). Then, its mixture ratio αt = {αa,t , αx,t } representing the contribution of each layer is determined by means of which enables the interaction between the layers. We describe the determination way of the mixture ratio αt in the following subsection. From Eqs. (1) and (2), the joint prior density p(at , xt |z1:t−1 ) is given as p(at , xt |z1:t−1 ) = p(xt |at , z1:t−1 )p(at |z1:t−1 ) = π(xt |αt )p(at |z1:t−1 ), where
π(xt |αt ) =αa,t p(xt |at ) + αx,t p(xt |z1:t−1 )
(3)
is the adaptive-mixed proposal distribution for xt based on the prediction densities p(xt |at ) and p(xt |z1:t−1 ). Determination of αt using On-line EM Algorithm. The mixture ratio αt is the parameter for determining the adaptive-mixed proposal distribution π(xt |αt ), and its determination by minimalizing the KL divergence between the posterior density in the lower layer p(xt |at , z1:t ) and π(xt |αt ) gives robust and accurate estimation. The determination of αt is equal to determination of mixture ratio in two components mixture model, and we employ a sequential Maximum-Likelihood (ML) estimation of the mixture ratio. In our method, the index variable for componet selection becomes the latent variable, and therefore the sequential ML estimation is implemented by means of an on-line EM algorithm [11]. Resampling from the posterior density p(xt |at , z1:t ), we obtain Nx (i) ˜ t . Using the latent variable m = {ma , mx } indicates which prediction samples x density, p(xt |at ) or p(xt |z1:t−1 ) is trusted, the on-line log likelihood can then be represented as t N t x L(αt ) = ηt λs log π(˜ x(i) τ |αt ) τ =1
= ηt
t τ =1
s=τ +1 t s=τ +1
λs
i=1 Nx i=1
log
m
p(˜ x(i) τ , m|ατ ) ,
608
T. Shibata, T. Bando, and S. Ishii
1. Estimation of the state variable xt in the lower layer. (i) – Obtain αa,t Nx samples xa,t from p(xt |at ). (i) – Obtain αx,t Nx samples xx,t from p(xt |z1:t−1 ). ˆ t and Std(xt ) of p(xn,t |at , z1:t ) using Nx mixture – Obtain expectation x (i) (i) (i) samples xt , constituted by xx,t and xa,t . Above procedure is applied to each feature, and obtain {ˆ xn,t , Std(xn,t )}. 2. Estimation of the state variable at in the higher layer. – Obtain Da,n (t) based on Std(xn,t ), and then estimate p(at |z1:t ). 3. Determination of the mixture ratio ¸t+1 . ˜ (i) – Obtain Nx samples x from p(xn,t |at , z1:t ). t – Calculate ¸t+1 such to maximize the on-line log likelihood. Above procedure is applied to each feature, and obtain {¸n,t+1 }.
Fig. 2. Hierarchical pose estimation algorithm
where λs is a decay constant to decline adverse influence from former inaccurate
−1 t t estimation, and ηt = is a normalization constant which τ =1 s=τ +1 λs works as a learning coefficient. The optimal mixture ratio α∗t by means of the KL divergence, which gives the optimal proposal distribution π(xt |α∗t ), can be calculated by miximization of the on-line log likelihood as follows: mt , m ∈m m t
α∗m,t = where (i)
(i)
p(˜ xt , ma |αt ) =αa,t p(˜ xt |at ),
(i)
(i)
p(˜ xt , mx |αt ) = αx,t p(˜ xt |z1:t−1 ),
and mt = ηt
t τ =1
t s=τ +1
λs
Nx
(i)
p(m|˜ x(i) τ , α τ ),
(i)
p(m|˜ xt , α t ) =
i=1
p(˜ xt , m|αt ) m ∈m
(i)
p(˜ xt , m |αt )
.
Note that mt can be calculated incrementally. 2.3
Application to Pose Estimation of a Rigid Object
Here the proposed method is applied to a real problem, pose estimation of a rigid object (cf. Fig. 1). The algorithm is shown in Fig. 2. Nf features on the image plane at time step t are denoted by xn,t = (un,t , vn,t )T , (n = 1, ..., Nf ), an affine camera matrix which projects a 3D model of the object onto the image plane at time step t is reshaped into a vector form at = (a1,t , ..., a8,t )T , for simplicity, and an observed image at time step t is denoted by zt . When applied to the pose estimation of a rigid object, Gaussian process is assumed in the higher layer and the object’s pose is estimated by Kalman Filter,
Visual Tracking Achieved by Adaptive Sampling
609
while tracking of the features is performed by PF because of the existence of cruel non-Gaussian noise, e.g. occlusion, in the lower layer. (i) To obtain the samples xn,t from the mixture proposal distribution we need two prediction densities, p(xn,t |at ) and p(xn,t |z1:t−1 ). The prediction density computed in the higher layer, p(xn,t |at ), which contains the physical relationship between the features, is given by the affine projection process of the 3D model of the rigid object. 2.4
Experiments
Simulations. The goal of this simulation is to estimate the posture of a rigid object, as in Fig. 3A, from an observation sequence of eight features. The rigid object was a hexahedral piece, whose size was 20 (upper base) × 30 (lower base) × 20 (height) × 20 (depth) [mm], was rotated at 150mm apart from a pin-hole camera (focal length: 40mm) and was projected onto the image plane (pixel size: 0.1 × 0.1 [mm]) by a perspective projection process disturbed by a Gaussian noise (mean: 0, standard deviation: 1 pixel). The four features at back of the object were occluded by the object itself. An occluded feature was represented by assuming the standard deviation of measurement noise grows to 100 pixels from a non-occluded value of 1 pixel. The other four features at front of the object were not occluded. The length of the observation sequence of the features was 625 frames, i.e., about 21 sec. In this simulation, we compared the performance of our proposed method with (i) the method with a fixed αt = {1, 0} which is equivalent to the simple hierarchical modeling in which the prediction density computed in the higher layer is trusted every time, (ii) the method with a fixed αt = {0, 1} which does not implement the mutual interaction between the layers. The decay constant of the on-line EM algorithm was set at λt = 0.5. The estimated pose by the adaptive proposal distribution and αa,t at each time step are shown in Figs. 3B and 3C, respectively. Here, the object’s pose θt = {θX,t , θY,t , θZ,t } was calculated from the estimated at using Extended Kalman Filter (EKF). In our implementation, maximum and minimum values of αa,t were limited to 0.8 and 0.2, respectively, to prohibit the robustness from degenerating. As shown in Fig. 3B, our method achieved robust estimation against the occlusion. Concurrently with the robust estimation of the object’s pose, appropriate determination of the mixture ratio was exhibited. For example, as in the case of the feature x1 , the prediction density computed in the higher layer was emphasized and well predicted using the 3D object’s model during the period in which x1 was occluded, because the observed feature position contaminated by cruel noise depressed confidence of the lower prediction density. Real Experiments. To investigate the performance against the cruel nonGaussian noise existing in real environments, the proposed method was applied to a head pose estimation problem of a driver from a real image sequence captured in a car. The face extraction/tracking from an image sequence is a wellstudied problem because of its applicability to various area, and then several
610
T. Shibata, T. Bando, and S. Ishii
Fig. 3. A: a sample simulation image. B: Time course of the estimated object’s pose. C: Time course of αa,t determined by the on-line EM algorithm. Gray background represents a frame in which the feature was occluded.
PF algorithms have been proposed. However, for accurate estimation of state variables, e.g. human face, lying in a high dimensional space especially in the case of real-time processing, some techniques for dimensional reduction are required. The proposed method is expected to enable more robust estimation in spite of limitation in computing resource by exploiting hierarchy of state variables. The real image sequence was captured by a near-infrared camera at the back of a handle, i.e., the captured images did not contain color information, and the image resolution was 640 × 480. In such a visual tracking task, the true observation process p(zt |xn,t ) is unknown because the true positions of face features are unobservable, hence, a (i) model of the observation process for calculating the particle’s weight wn,t is needed. In this study, we employed normalized correlation with a template as the model of the approximate observation process. Although this observation model seems too simple to apply to problems in real environments, it is sufficient for examining the efficiency of the proposal distribution. We employed nose, eyes, canthi, eyebrows, corners of mouth as the face features. Using the 3D feature positions measured by a 3D distance measuring equipment, we constructed the 3D face model. The proposed method was applied by employing 50 particles for each face feature, as well as in the simulation, and processed by a Pentium 4 (2.8 GHz) Windows 2000 PC with 1048MB RAM. Our system processed one frame in 29.05 msec, and hence achieved real-time processing.
Visual Tracking Achieved by Adaptive Sampling
611
Fig. 4. Mean estimation error in the case when αt was estimated by EM, fixed at αa,t = {1, 0}, and fixed at αa,t = {0, 1} (100 trials). Since some bars protruded from the figure, they were shorten and the error amount is instead displayed on the top of them.
Fig. 5. Tracked face features
Fig. 4 shows the estimation error of the head pose, and the true head pose of the driver measured by a gyro sensor in the car. Our adaptive mixing proposal distribution achieved robust head pose estimation as well as in the simulation task. In Fig. 5, the estimated face features are depicted by “+” for various head pose of the driver; the variance of the estimated feature position is represented as the size of the “+” mark, i.e., the larger a “+” is, the higher estimation confidence the estimator has.
3
Modeling Primate’s Visual Tracking by Particle Filters
Here we note that our computer vision study and primates’ vision share the same computational problems and similar constraints. Namely, they need to perform real-time spatiotemporal filtering of the visual data robustly and accurately as
612
T. Shibata, T. Bando, and S. Ishii
much as possible with the limited computing resource. Although there are huge numbers of neurons in the brain, their firing rate is very noisy and much slower than recent personal computers. We can visually track only around four to six objects simultaneously (e.g., [2]). These facts indicate the limited attention resource. As mentioned at the beginning this paper, it is widely known that only the foveal region on the retina can acquire high-resolution images in primates, and that humans usually make saccades mostly to eyes, nose, lip, and contours when we watch a human face. In other words, primates actively ignore irrelevant information against massive image inputs. Furthermore, there have been many behavioral and computational studies reporting that the brain would compute Bayesian statistics (e.g, [8][10]). As we discussed, however, Bayesian computation is intractable in general, and particle filters (PFs) is an attractive and feasible solution to the problem as it is very flexible, easy to implement, parallelisable. Importance sampling is analogous to efficient computing resource delivery. As a whole, we conjecture that the primate’s brain would employ PFs for visual tracking. Although one of the major drawbacks of PF is that a large number of particles, typically exponential to the dimension of state variables, are required for accurate estimation, our proposed framework in which adaptive sampling from hierarchical and parallel predictive distributions can be a solution. As demonstrated in section 2 and in other’s [1], adaptive importance sampling from multiple predictions can balance both accuracy and robustness of the estimation with a restricted numbers of particles. Along this line, overt/covert smooth pursuit in primates could be a research target to investigate our conjecture. Based on the model of Shibata, et al. [12], Kawawaki et al. investigated the human brain mechanism of overt/covert smooth pursuit by fMRI experiments and suggested that the activity of the anterior/superior lateral occipito-temporal cortex (a/sLOTC) was responsible for target motion prediction rather than motor commands for eye movements [7]. Note that LOTC involves the monkey medial superior temporal (MST) homologue responsible for visual motion processing (e.g., [9][6]) In their study, the mechanism for increasing the a/sLOTC activity remained unclear. The increase in the a/sLOTC activity was observed particularly when subjects pursued blinking target motion covertly. This blink condition might cause two predictions, e.g., one emphasizes observation and the other its belief as proposed in [1]), and require the computational resource for adaptive sampling. Multiple predictions might be performed in other brain regions such as frontal eye filed (FEF), the inferior temporal (IT) area, fusiform face area (FFA). It is known that FEF involved in smooth pursuit (e.g., [3]), and it has reciprocal connection to the MST area (e.g., [13]), but how they work together is unclear. Visual tracking for more general object tracking rather than a small spot as a visual stimulus requires a specific target representation and distractors’ representation inculding a background. So the IT, FFA and other areas related to higher-order visual representation would be making parallel predictions to deal with the varying target appearance during tracking.
Visual Tracking Achieved by Adaptive Sampling
4
613
Conclusion
In this paper, first we have introduced particle filters (PFs) as an approximated incremental Bayesian computation, and pointed out their drawbacks. Then, we have proposed a novel framework for visual tracking based on PFs as a solution to the drawback. The keys of the framework are: (1) high-dimensional state space is decomposed into hierarchical and parallel predictors which treat state variables in the lower dimension, and (2) their integration is achieved by adaptive sampling. The feasibility of our frame work has been demonstrated by real as well as simulation studies. Finally, we have pointed out the shared computational problems between PFs and human visual tracking, presented our conjecture that at least the primate’s brain employs PFs, and discussed its possibility and perspectives for future investigations.
References 1. Bando, T., Shibata, T., Doya, K., Ishii, S.: Switching particle filters for efficient visual tracking. J. Robot Auton. Syst. 54(10), 873 (2006) 2. Cavanagh, P., Alvarez, G.A.: Tracking multiple targets with multifocal attention. Trends in Cogn. Sci. 9(7), 349–354 (2005) 3. Fukushima, K., Yamanobe, T., Shinmei, Y., Fukushima, J., Kurkin, S., Peterson, B.W.: Coding of smooth eye movements in three-dimensional space by frontal cortex. Nature 419, 157–162 (2002) 4. Gordon, N.J., Salmond, J.J., Smith, A.F.M.: Novel approach to nonlinear nonGaussian Bayesian state estimation. IEEE Proc. Radar Signal Processing 140, 107– 113 (1993) 5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998) 6. Kawano, M., Shidara, Y., Watanabe, Y., Yamane, S.: Neural activity in cortical area MST of alert monkey during ocular following responses. J. Neurophysiol 71(6), 2305–2324 (1994) 7. Kawawaki, D., Shibata, T., Goda, N., Doya, K., Kawato, M.: Anterior and superior lateral occipito-temporal cortex responsible for target motion prediction during overt and covert visual pursuit. Neurosci. Res. 54(2), 112 8. Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosci. 27(12) (2004) 9. Newsome, W.T., Wurtz, H., Komatsu, R.H.: Relation of cortical areas MT and MST to pursuit eye movements. II. Differentiation of retinal from extraretinal inputs. J. Neurophysiol 60(2), 604–620 (1988) 10. Rao, R.P.N.: The Bayesian Brain: Probabilistic Approaches to Neural Coding. In: Neural Models of Bayesian Belief Propagation, MIT Press, Cambridge (2006) 11. Sato, M., Ishii, S.: On-line EM algorithm for the normalized gaussian network. Neural Computation 12(2), 407–432 (2000) 12. Shibata, T., Tabata, H., Schaal, S., Kawato, M.: A model of smooth pursuit in primates based on learning the target dynamics. Neural Netw. 18(3), 213 13. Tian, J.-R., Lynch, J.C.: Corticocortical input to the smooth and saccadic eye movement subregions of the frontal eye field in cebus monkeys. J. Neurophysiol 76(4), 2754–2771 (1996)
Bayesian System Identification of Molecular Cascades Junichiro Yoshimoto1,2 and Kenji Doya1,2,3 1
Initial Research Project, Okinawa Institute of Science and Technology Corporation 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan {jun-y,doya}@oist.jp 2 Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan 3 ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai, “Keihanna Science City”, Kyoto 619-0288, Japan
Abstract. We present a Bayesian method for the system identification of molecular cascades in biological systems. The contribution of this study is to provide a theoretical framework for unifying three issues: 1) estimating the most likely parameters; 2) evaluating and visualizing the confidence of the estimated parameters; and 3) selecting the most likely structure of the molecular cascades from two or more alternatives. The usefulness of our method is demonstrated in several benchmark tests. Keywords: Systems biology, biochemical kinetics, system identification, Bayesian inference, Markov chain Monte Carlo method.
1
Introduction
In recent years, the analysis of molecular cascades by mathematical models has contributed to the elucidation of intracellular mechanisms related to learning and memory [1,2]. In such modeling studies, the structure and parameters of the molecular cascades are selected based on the literature and databases1 . However, if reliable information about a target molecular cascade is not obtained from those repositories, we must tune its structure and parameters so as to fit the model behaviors to the available experimental data. The development of a theoretically sound and efficient system identification framework is crucial for making such models useful. In this article, we propose a Bayesian system identification framework for molecular cascades. For a given set of experimental data, the system identification can be separated into two inverse problems: parameter estimation and model selection. The most popular strategy for parameter estimation is to find a single set of parameters based on the least mean-square-error or maximum likelihood criterion [3]. However, we should be aware that the estimated parameters might 1
DOQCS ( http://doqcs.ncbs.res.in/) and BIOMODELS ( http://biomodels. net/) are available for example.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 614–624, 2008. c Springer-Verlag Berlin Heidelberg 2008
Bayesian System Identification of Molecular Cascades
615
suffer from an “over-fitting effect” because the available set of experimental data is often small and noisy. For evaluating the accuracy of the estimators, statistical methods based on the asymptotic theory [4] and Fisher information [5] were independently proposed. Still, we must pay attention to practical limitations: a large number of data are required for the former method; and the mathematical model should be linear at least locally for the latter method. For model selection, Sugimoto et al. [6] proposed a method based on genetic programming. This provided a systematic way to construct a mathematical model, but a criterion to adjust a tradeoff between data fitness and model complexity was somewhat heuristic. Below, we present a Bayesian formulation for the system identification of molecular cascades.
2
Problem Setting
In this study, we consider the dynamic behavior of molecular cascades that can be modeled as a well-stirred mixture of I molecular species {S1 , . . . , SI }, which chemically interact through J elementary reaction channels {R1 , . . . , RJ } inside a compartment with a fixed volume Ω and constant temperature [7]. The dynamical state of this system is specified by a vector X(t) ≡ (X1 (t), . . . , XI (t)), where Xi (t) is the number of the molecules of species Si in the system at time t. If we assume that there are a sufficiently large number of molecules within the compartment, it is useful to take the state vector as the molar concentrations x(t) ≡ (x1 (t), . . . , xI (t)) by the transformation of x(t) = X(t)/(ΩNA ), where NA ≈ 6.0 × 1023 is Avogadro’s constant. In this case, the system behavior can be well approximated by the deterministic ordinary differential equations (ODEs) [8]: J I x˙ i (t) = j=1 (νij − νij )kj l=1 xl (t)νlj , i = 1, . . . , I, (1) where x˙ ≡ dx/dt denotes the time-derivative (velocity) of a variable x. νij and νij are the stoichiometric coefficients of the molecular species Si as a reactant and a product, respectively, in the reaction channel Rj . kj is the rate constant of the reaction channel Rj . Equation (1) is called the law of mass action. To illustrate an example of this mathematical model, let us consider a simple interaction whose chemical equation is given by kf
A + B −→ ←− AB.
(2)
kb
Here, the two values associated with the arrows (kf and kb ) are the rate constants of their corresponding reactions. The dynamic behavior of this system is given by ˙ = −[A] ˙ = −[B] ˙ = kf [A][B] − kb [AB], [AB]
(3)
where the bracket [Si ] denotes the concentration of the molecular species Si . Enzymatic reactions, which frequently appear in biological signalings, can be modeled as a series of elementary reactions. In an enzymatic reaction that
616
J. Yoshimoto and K. Doya
changes a substrate S into a product P by the catalysis of an enzyme E, for example, the enzyme-substrate complex ES is regarded as the intermediate product, and the overall reaction is modeled as kf
kcat E + S −→ ←− ES −→ E + P.
(4)
kb
We can derive a system of ODEs corresponding to the chemical equation (4) based on the law of mass action (1), but it is also common to use the approximation form called the Michaelis-Menten equation [8]: ˙ = −[S] ˙ = (kcat [Etot ][S]) / (KM + [S]) , [P] (5) where [Etot ] ≡ [E] + [ES] is the total concentration of the molecular species including the enzyme E. KM ≡ (kb + kcat )/kf is called the Michaelis-Menten constant. When considering signaling communications among other compartments and biochemical experiments wherein drugs are injected, it is convenient to separate a control vector u ≡ (u1 , . . . , uI ), which denotes I external effects on the system, from a state vector x. Let us consider a model (2), in which [B] can be arbitrarily manipulated by an experimenter. In this case, the control vector is u = ([B]), and the state vector is modified into x = ([A], [AB]). By summarizing the above discussion, the dynamic models of molecular cascades concerned in this study assume a general form of ˙ x(t) = f (x(t), u(t), q1 ) subject to x(0) = q0 , (6) where q0 ∈ RI+ is a set of I initial values of state variables2 . q1 ∈ RM is a set of + M system parameters such as rate constants and Michaelis-Menten constants. f : (x, u, q1 ) → x˙ is a function derived based on the law of mass action. We assume that a set of control inputs U ≡ {u(t)|t ∈ [0, Tf ]} has been designed before commencing an experiment, where Tf is a final time point of the experiment. Let O ⊂ {1, . . . , I} be a set of indices such that the concentration for any species in {Si |i ∈ O} can be observed through the experiment. Then, let t1 , . . . , tN ∈ [0, Tf ] be the time points at which the observations are made in the experiment. Using these notations, a set of all the data observed in the experiment is defined by Y ≡ {yi (tn )|i ∈ O; n = 1, . . . , N }, where yi (tn ) is the concentration of the species Si observed at time tn . The objective of this study is to identify the mathematical model (6) from a limited number of experimental data Y . This objective can be separated into two inverse problems. One problem involves fitting a parameter vector q ≡ (q1 , q0 ) so as to minimize the residuals i (tn ) = yi (tn ) − xi (tn ; q, U ), where xi (tn ; q, U ) denotes the solution of the ODEs (6) at time tn that depends on the parameter vector q and control inputs U . In other words, we intend to find q such that model (6) reproduces the observation data Y as closely as possible. This inverse problem is called parameter estimation. The second problem is to infer the most likely candidate among multiple options {C1 , . . . , CC } with different model structures (functional forms of f ) from the observation data Y . 2
We denote the set of all non-negative real values by R+ in accordance with usage.
Bayesian System Identification of Molecular Cascades
617
This problem is called model selection, and we call each Cc (c = 1, . . . , C) a “model candidate” in this study.
3
Bayesian System Identification
In this section, we first present our Bayesian approach to the parameter estimation problem, and then extend the discussion to the model selection problem. In statistical inference, the residuals {i (tn )|i ∈ O; n = 1, . . . , N } are modeled as unpredictable noises. At first, we present a physical interpretation of the noises. Let us assume that each observation yi (tn ) is obtained by sampling particles of the molecular species Si from a small region with a fixed volume ωi (0 < ωi < Ω) inside the model compartment. Since the interior of the model compartment is well-stirred, the particles of Si would be distributed by a spatial Poisson distribution that is independent of the states of the other molecular species. This implies that the number of particles of the sampled species, Xωi (tn ), would be distributed according to a binomial distribution with the total number of trials Xi (tn ) and success probability (ωi /Ω) [9]. Also, this distribution can be i.i.d.
well approximated by Xωi (tn ) ∼ N (ωi Xi (tn )/Ω, ωi Xi (tn )/Ω)3 from the assumption of Xi (tn ) 0. Using the two relationships, yi (tn ) = Xωi (tn )/(ωi NA ) i.i.d.
and xi (tn ) = Xi (tn )/(ΩNA ), we have yi (tn ) ∼ N (xi (tn ), xi (tn )/(ωi NA )), that is, i.i.d. i (tn ) ∼ N (0, xi (tn )/γi ) , (7) where γi ≡ ωi NA . In general, we cannot determine the exact values of γ ≡ |O| (γi )i∈O ∈ R+ ; hence, they are regarded as latent variables to be estimated along with the parameter q via the Bayesian inference. Now, we consider the Bayesian inference of the set of all the unknown variables θ ≡ (q, γ), i.e., the posterior distribution of θ. By combining (7) with (6), we have the conditional distribution of the observation data Y : N p(Y |θ; U ) = n=1 i∈O N1 (yi (tn ); xi (tn ; q, U ), xi (tn ; q, U )/γi ) , (8) where Np (·; ·, ·) denotes a p-dimensional Gaussian density function 4 . According to Bayes’ theorem, the posterior distribution of θ is given by p(θ|Y ; U ) =
p(Y |θ;U)p(θ;U) p(Y ;U)
=
p(Y |θ;U)p(θ;U) , p(Y |θ;U)p(θ;U)dθ
(9)
where p(θ; U ) is a prior distribution of θ defined before obtaining Y . In this study, we employ the prior distribution that has a form of I+M p(θ; U ) = (10) i=1 L (qi ; ξ0 ) i∈O G (γi ; α0 , β0 ) , where qi and γi are the i-th elements of the vectors q and γ, respectively. α0 , β0 , and ξ0 are small positive constants. L (·; ·) and G (·; ·, ·) denote the density 3 4
N(μ, σ 2 ) denotes a Gaussian (or normal) distribution with mean μ and variance σ 2 . Np (x; m, S) ≡ (2π)−p/2 |S|−1/2 exp{− (x − m) S−1 (x − m) /2}, where it should be noted that all vectors are supposed to be row vectors throughout this article.
618
J. Yoshimoto and K. Doya
functions of a log-Laplace distribution and a gamma distribution, respectively5 . By substituting (8) and (10) for (9), the posterior distribution is well-defined. The posterior distribution can be regarded as a degree of certainty of θ deduced from the outcome Y [10]. The most “likely” value of θ can be given by the maximum a posteriori (MAP) estimator θ MAP ≡ argmaxθ p(θ|Y ; U ). This value is asymptotically equivalent to the solutions of the maximum likelihood method and a generalized least-mean-square method as the number of time points N approaches infinity. We must be aware that the MAP estimator asymptotically has a variance of O(1/N ) as long as N is finite. Accordingly, it is informative to evaluate a high-confidence region of θ. The Bayesian inference can be used for this purpose, as will be shown in Section 4. Although we have considered the parameter estimation problem for a given model candidate so far, the denominator of (9), which is called the evidence, plays an important role in the model selection problem. To make the dependence on model candidates explicit, let us write the evidence p(Y ; U ) for each candidate Cc (c = 1, . . . , C) as p(Y |Cc ; U ). The posterior distribution of Cc is given by p(Cc |Y ; U ) ∝ p(Y |Cc ; U ) if the prior distribution over {C1 , . . . , CC } is uniform. This implies that the model candidate C ∗ , where C ∗ = argmaxc p(Y |Cc ; U ), is the optimal solution in the sense of the MAP estimation. Accordingly, the main task in the Bayesian model selection reduces to the calculation of the evidence p(Y ; U ) for each model candidate Cc . The major difficulty encountered in the Bayesian inference is that an intractable integration of the denominator in the final equation in (9) must be evaluated. For convenience of implementation, we adopt a Markov chain Monte Carlo technique [11] to approximate the posterior distribution and evidence. Hereafter, we describe only the key essence of our approximation method. The basic framework of our approximation is based on a Gibbs sampling method [11]. We suppose that it is possible to draw the realizations q and γ directly from two conditional distributions with density functions p(q|γ, Y ; U ) and p(γ|q, Y ; U ), respectively. In this case, the following iterative procedures asymptotically generate a set of L (L 1) random samples Θ1:L ≡ {θ(l) ≡ (q(l) , γ (l) )|l = 1, . . . , L} distributed according to the density function p(θ|Y ; U ). 1. 2. 3. 4.
Initialize the value of γ (0) appropriately and set l = 1. Generate q(l) from the distribution with the density p(q|γ (l−1) , Y ; U ). Generate γ (l) from the distribution with the density p(γ|q(l) , Y ; U ). Increment l := l + 1 and go to Step 2.
Since we select a gamma distribution as the prior distribution of γi in this study, p(γ|q, Y ; U ) always has the form i∈O G (γi |α, β(q) ), where 2 N i (tn ;q,U)) α = α0 + N2 and β(q) = β0 + 12 n=1 (yi (tnx)−x . (11) i (tn ;q,U) This implies that it is possible to draw γ directly from p(γ|q, Y ; U ) using a gamma-distributed random number generator in Step 3. On the other hand, the 5
L (x; ξ) ≡ exp{−| ln x|/ξ}/(2ξx) and G (γi ; α, β) ≡ β α xα−1 e−βx /Γ (α), where ∞ Γ (α) ≡ 0 tα−1 e−t dt is the gamma function.
Bayesian System Identification of Molecular Cascades
619
analytical calculation of p(q|γ, Y ; U ) is still intractable; however, we know that p(q|γ, Y ; U ) is proportional to p(Y |θ; U )p(θ; U ) up to the normalization constant. Thus, we approximate Step 2 by a Metropolis sampling method [11]. Here, a transition distribution of the Markov chain is constructed based on an adaptive direction sampling method [12] in order to improve the efficiency of the sampling. Indeed, it is required to compute the solution of the ODEs xi (tn ; q(l) , U ) explicitly in both Steps 2 and 3; however, this is not significant because many numerical ODE solvers are currently available (see [13] for example). After the procedures generate L random samples Θ 1:L distributed according to the density function p(θ|Y ; U ), the target function p(θ|Y ; U ) can be well reconstructed by the kernel density estimator [14]: (D 2+4) L θ 4 p˜(θ|Y ; U ) = L1 l=1 NDθ θ; θ (l) , V ; V = (Dθ +2)L Cov(Θ 1:L ), (12) where Dθ is the dimensionality of the vector θ, and Cov(Θ 1:L ) is the covariance matrix of the set of samples Θ1:L . Futhermore, assuming that p˜(θ|Y ; U ) is a good approximator of the true posterior distribution p(θ|Y ; U ), the importance sampling theory [11] states that the integration of the denominator in (9) (i.e., the evidence p(Y ; U )) can be well approximated by p(Y |θ(l) ;U)p(θ(l) ;U) p˜(Y ; U ) = L1 L . (13) l=1 p(θ ˜ (l) |Y ;U)
4
Numerical Demonstrations
In order to demonstrate the usefulness, we applied our Bayesian method to several benchmark problems. For all the benchmark problems, the set of constants for the prior distribution were fixed at α0 = 10−4 , β0 = 1, and ξ0 = 1. The total number of Monte Carlo samples were set at L = 107 , and every element in the initial sample θ (0) was drawn from a uniform distribution within the range of [0.99 × 10−10 , 1.01 × 10−10 ]. In the first benchmark, we supposed a situation where the structure of the molecular cascades was fixed by the chemical equation (2), but [A] was unobservable at any time point. Instead of real experiments, the observations Y with the total number of time points N = 51 were synthetically generated by (3) using the parameters q1 = (kf , kb ) = (5.0, 0.01) and the initial state q0 = ([A](0), [B](0), [AB](0)) = (0.4, 0.3, 0.01), where the noises in the observation processes were added according to the model (7) with the parameter γi = 1000. The circles and squares in Fig.1(a) show the generated observation Y . For the observation data Y , we evaluated the posterior distribution of the set of all the unknown variables θ using the method described in Section 3. Figures 1(bf) show the posterior distributions of non-trivial unknown variables (kf , kb , [A](0) and γi for species B and AB), respectively, after marginalizing out any other variate. Here, the bold-solid lines denote the true values used in generating data Y , and the bold-broken lines are MAP estimators that were approximately obtained by argmaxl∈{1,...,L} p(Y |θ(l) ; U )p(θ(l) ; U ). The histograms indicate the distribution of Monte Carlo samples Θ 1:L , and the curves surrounding the histograms
J. Yoshimoto and K. Doya Observation data Y and reconstruction
Posterior distribution: P(kf|Y;U)
0.35 0.6 0.5
[B]
0.1
0.2
model
[AB] 0 0
1
2
3
4
0 2
5
3
4
kf
Time
(a)
20
5
6
0 0
7
0.01
0.02
(b)
Posterior distribution: P([A](0)|Y;U)
-3
x 10 hist pdf true MAP 0.5% 99.5%
10 8 6 4
kb
0.03
0.04
0.05
(c)
Posterior distribution: P(γ2|Y;U)
-3
x 10
1
Posterior distribution: P(γ3|Y;U)
3
hist pdf true MAP 0.5% 99.5%
1.5 p(γ2|Y;U)
12
p([A](0)|Y;U)
30
10
0.1
model
0.05
0.3
40
b
f
[AB]obs
0.15
0.4
p(k |Y;U)
Concentration
[B]obs
hist pdf true MAP 0.5% 99.5%
50
p(k |Y;U)
0.3 0.25 0.2
Posterior distribution: P(kb|Y;U)
hist pdf true MAP 0.5% 99.5%
hist pdf true MAP 0.5% 99.5%
2.5 p(γ3|Y;U)
620
2 1.5 1
0.5
0.5
2 0 0.3
0 500
0.35 0.4 0.45 0.5 0.55 0.6 [A](0): initial concentration of species A
1000 1500 γ2 (for species B)
(d)
0
2000
400
600 800 1000 γ3 (for species AB)
(e)
1200
(f)
Fig. 1. Brief summaries of results in the first benchmark
Posterior distribution: P(kf,kb|Y) 0.03
50
25
0.45
20
30
0.01
20
0.005
10
[A](0)
40
0.015
0.4
15
4
kf
5
6
0
1000 0.4
10 0.35
0 3
0.45 [A](0)
0.02 b
Posterior distribution: P(kb,[A](0)|Y) 0.5
30
1500
0.025
k
Posterior distribution: P(kf,[A](0)|Y) 0.5
60
500
0.35 5
0.3 3
4
kf
5
6
0
0.3 0
0.01
kb
0.02
0.03
0
Fig. 2. The joint posterior distributions of two out of three parameters (kf , kb , [A](0))
denote the kernel density estimators (12). Though model (3) with the MAP estimators could reproduce the observation data Y well (see solid and broken lines in Fig.1(a)), the estimators were slightly different from their true values because the total number of data was not very large. For this reason, the broadness of the posterior distribution provided useful information about the confidence in inference. To evaluate the confidence more quantitatively, we defined “99% credible interval” as the interval between 0.5% and 99.5% percentiles along each axis of Θ 1:N . The dotted lines in Figs.1(b-f) show the 99% credible interval. Since the posterior distribution was originally a joint probability density of all the unknown variables θ, it was also possible to visualize the highly confident parameter space even in higher dimensions, which provided useful information about the dependence among the parameters. The three panels in Fig.2 show that the joint posterior distribution of two out of the three parameters (kf , kb , and [A](0)), where the white circles denote the true values. From the panels, we can easily observe that these parameters have a strong correlation between them. Interestingly, highly
Bayesian System Identification of Molecular Cascades
621
Table 1. MAP estimators and 99% credible intervals in the second benchmark Parameters True MAP 99% itvls. Parameters True MAP 99% itvls. Ys (×10−5 ) 7.00 7.00 [6.83,7.19] k2 (×10+6 ) 6.00 6.22 [5.77,6.66] ksyn (×10−2 ) 1.68 1.97 [1.46,3.97] KIB (×10−2 ) 1.00 0.74 [0.28,1.28]
Observation data Y and reconstruction
0.6
obs
0.01
[M2]
model
[M2]
0.5
1.5
1
1.5
obs
[M3]model
0.4
[M3]obs
0.3
0.008 IB
[S]obs
5
x 10 2
[M1]
[S]model Concentration
Concentration
0.012
[M1]
model
[B]obs 2
Posterior distribution: P(ksyn,KIB|Y)
Observation data Y and reconstruction 0.7
[B]model
K
2.5
1 0.006
0.2
0.5
0.004
0.5 0.1 0 0
10
20
30 Time
40
(a)
50
60
0 0
0.002 10
20
30 Time
40
50
60
0.015
(b)
0.02 0.025 ksyn
0.03
0
(c)
Fig. 3. (a-b) Observation data Y and state trajectories reconstructed in the second benchmark; and (c) the joint posterior distribution of (ksyn , KIB )
confident spaces over (kf -[A](0)) and (kb -[A](0)) had hyperbolic shapes that reflected the bilinear form in the ODEs (3). To demonstrate the applicability to more realistic molecular cascades, we adopted a simulator of a bioreactor system developed by [15] in the second benchmark. The mathematical model corresponding to this system was given by the following ODEs: ˙ = (μ − qin )[B]; [S] ˙ = qin (cin − [S]) − r1 Mw [B] [B] ˙ = r1 − r2 − μ[M1]; [M2] ˙ = r2 − r3 − μ[M2]; [M3] ˙ = rsyn − μ[M3] [M1] r1 ≡
r1,max [S] KS +[S] ;
r
[M2]
k
K
[M3][M1] 3,max syn IB r2 ≡ kK2M1 +[M1] ; r3 ≡ KM2 +[M2] ; rsyn ≡ KIB +[M2] ; μ = Ys r1 ,
where Mw , r1,max , KS , KM1 , r3,max , and KM2 were assumed to be known constants that were same as those reported in [15]. u ≡ (cin , qin ) were control inputs in this simulator, and all the state variables x ≡ ([B], [S], [M1], [M2], [M3]) were observable. The observation time-series Y were downloaded from the website6 . The objective in this benchmark was to estimate four unknown system parameters q1 ≡ (Ys , k2 , ksyn , KIB ). As a result of evaluating the posterior distribution over all the elements of θ, the MAP estimators and 99% credible intervals of q1 were obtained, and are shown in Table 1. Indeed, the observation noises in this i.i.d. benchmark were distributed according to i (tn ) ∼ N(0, (0.1xi (tn ))2 ), which was different from our model (7). This inconsistency and small amount of data led to MAP estimators that were deviated slightly from the true values. However, it was notable that the model with the MAP estimators could reproduce the observation data Y fairly well, as shown in Figs.3(a-b), so that the credible intervals provide good information about the plausibility/implausibility of the 6
http://sysbio.ist.uni-stuttgart.de/projects/benchmark
622
J. Yoshimoto and K. Doya Observation data Y1 and reconstruction
Observation data Y and reconstruction 2
0.6
C1
0.4
C
1
2
3
4
2
3
4
1
C
3
Obs. 0 0
1
2
306.69
168.29
100 C2 Model candidate
C3
log-evidence: ln p(Y2|C;U)
log-evidence: ln p(Y1|C;U)
373.41
C1
3
4
5
Time
300
0
5
0.2
5
Log-evidence of three candidates
200
4
C
Time
400
3
2
Obs. 1
2
C [P]
[P]
C3
1
0.4
C2 0.1
Obs.
0 0
C1
0 0
3
5
0.2
2
C 0.2
Obs.
0 0
C
[S]
[S] 0.2
C
0.4
1
2
C3
0.6
(a) for observation data Y1
Log-evidence of three candidates 300 200
246.92
246.16
C2 Model candidate
C3
189.74
100 0
C1
(b) for observation data Y2
Fig. 4. Brief summaries of results in the model selection problem
estimated parameters. In addition, we could find a nonlinear dependence between the two parameters (ksyn , KIB ) even in this problem, as shown in Fig.3(c). Finally, our Bayesian method was applied to a simple model selection problem. In this problem, we supposed that the observation data were obtained from the molecular cascades whose chemical equations were written as k=1
Sin −→ S;
kf 1
kf 2
kb1
kb2
S + E −→ ←− SE −→ ←− P + E,
(14)
where x ≡ ([S], [P], [E], [SE]) was a state vector of our target compartment, and only two elements, [S] and [P], were observable. u ≡ ([Sin ]) was a control input to the compartment. We considered three model candidates depending on the reversibility of the reaction: C1 : The ODEs consistent with model (14) was constructed based on the law of mass action without any approximation. In this case, four system parameters (kf 1 , kb1 , kf 2 , and kb2 ) were estimated. C2 : kb2 was sufficiently small so that the second part of model (14) was reduced to model (4). The estimated parameters were also reduced to (kf 1 , kb1 , kf 2 ). C3 : In addition to the assumption of C2 , the second part of model (14) was well approximated by the ODEs (5). In this case, the system parameters were further reduced to (KM1 , kf 2 ), where KM1 ≡ (kb1 + kf 2 )/kf 1 . The goal of this benchmark is to select the most reasonable candidate from {C1 , C2 , C3 } for the given observation data Y . We prepared two sets of observation data Y1 and Y2 that were generated by model candidates C1 and C2 , respectively. The circles in the first and second panels in Figs.4(a-b) show the points of data Y1 and Y2 . Here, the curves associated with the circles denote the state trajectories
Bayesian System Identification of Molecular Cascades
623
reconstructed by the three model candidates with their MAP estimators. Then, we evaluated the evidence p˜(Yd |Cc ; U ) for d = 1, 2 and c = 1, . . . , 3 by Eq.(13). The results are shown in the bottom panels in Figs.4(a-b), where the evidence is transformed into the natural logarithm to avoid numerical underflow. As shown in Fig.4(a), only C1 could satisfactorily reproduce the observation data Y1 , while the others could not. Accordingly, the evidence for the model candidate C1 was maximum. For the observation data Y2 , all the model candidates could reproduce them satisfactorily in a similar manner, as shown in Fig.4(b). In this case, the evidence for the simplest model candidate C3 was nearly as high as that for C2 , from which the observation data Y2 were generated. These results demonstrated that the evidence provided useful information for determining whether a structure of molecular cascades could be further simplified or not.
5
Conclusion
In this article, we presented a Bayesian method for the system identification of molecular cascades in biological systems. The advantage of this method is to provide a theoretically sound framework for unifying parameter estimation and model selection. Due to the limited space, we have shown only the small-scale benchmark results here. However, we have also confirmed that the precision of our MAP estimators was comparable to that of an existing good parameter estimation method in a medium-scale benchmark with 36 unknown system parameters [3]. The current limitations of our method are difficulty in the convergence diagnosis of the Monte Carlo sampling and time-consuming computation for large-scale problems. We will overcome these limitations in our future work.
References 1. Bhalla, U.S., Iyengar, R.: Emergent properties of networks of biological signaling pathways. Science 387, 283–381 (1999) 2. Doi, T., et al.: Inositol 1,4,5-trisphosphate-dependent Ca2+ threshold dynamics detect spike timing in cerebellar Purkinje cells. The Journal of Neuroscience 25(4), 950–961 (2005) 3. Moles, C.G., et al.: Parameter estimation in biochemical pathways: A comparison of global optimization methods. Genome Research 13(11), 2467–2474 (2003) 4. Faller, D., et al.: Simulation methods for optimal experimental design in systems biology. Simulation 79, 717–725 (2003) 5. Banga, J.R., et al.: Computation of optimal identification experiments for nonlinear dynamic process models: A stochastic global optimization approach. Industrial & Engineering Chemistry Research 41, 2425–2430 (2002) 6. Sugimoto, M., et al.: Reverse engineering of biochemical equations from time-course data by means of genetic programming. BioSystems 80, 155–164 (2005) 7. Gillespie, D.T.: The chemical langevin equation. Journal of Chemical Physics 113(1), 297–306 (2000) 8. Crampin, E.J., et al.: Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Progress in Biophysics & Molecular Biology 86, 77–112 (2004)
624
J. Yoshimoto and K. Doya
9. Daley, D.J., Vere-Jones, D.: An Introduction to the Theory of Point Processes, 2nd edn. Springer, Heidelberg (2003) 10. Bernardo, J.M., Smith, A.F.M.: Bayesian Theory. John Wiley & Sons Inc., Chichester (2000) 11. Andrieu, C., et al.: An introduction to MCMC for machine learning. Machine Learning 50(1-2), 5–43 (2003) 12. Gilks, W.R., et al.: Adaptive direction sampling. The Statistician 43(1), 179–189 (1994) 13. Hindmarsh, A.C., et al.: Sundials: Suite of nonlinear and differential/algebraic equation solvers. ACM Transactions on Mathematical Software 31, 363–396 (2005) 14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, Boca Raton (1986) 15. Kremling, A., et al.: A benchmark for methods in reverse engineering and model discrimination: Problem formulation and solutions. Genome Research 14(9), 1773– 1785 (2004)
Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classification Shir Li Wang1, Chen Change Loy2, Chee Peng Lim1,∗, Weng Kin Lai2, and Kay Sin Tan3 1
School of Electrical & Electronic Engineering, University of Science Malaysia Engineering Campus, 14300 Nibong Tebal, Penang, Malaysia
[email protected] 2 Centre for Advanced Informatics, MIMOS Berhad 57000 Kuala Lumpur, Malaysia 3 Department of Medicine, Faculty of Medicine, University of Malaya 50603 Kuala Lumpur Malaysia
Abstract. One of the issues associated with pattern classification using databased machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour(kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circlesegments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets. Keywords: Feature selection, circle-segments, data visualization, principal component analysis, machine learning techniques.
1 Introduction Data-based machine learning systems have wide applications owing to the capability of learning from a set of representative data samples and performance improvements when more and more data samples are used for learning. These systems have been employed to tackle many modeling, prediction, and classification tasks [1-7]. However, one of the crucial issues pertaining to pattern classification using databased machine learning techniques is the “curse of dimensionality” [2, 6, 8]. This is especially true because it is important to identify an optimal set of input features for learning, e.g. with the support vector machine (SVM) [9]. The same problem arises in other data-based machine learning systems as well. ∗ Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 625–634, 2008. © Springer-Verlag Berlin Heidelberg 2008
626
S.L. Wang et al.
In pattern classification, the main task is to learn and construct an appropriate function that is able to assign a set of input features into a finite set of classes. If noisy and irrelevant input features are used, the learning process may fail to formulate a good decision boundary that has a discriminatory power for data samples of various classes. As a result, feature selection has a significant impact on classification accuracy [9]. Useful feature selection methods for machine learning systems include the principal component analysis (PCA), genetic algorithm (GA), as well as other data visualization techniques [1-2, 6, 8-13]. Despite the good performance of PCA and GA in feature selection [1-2, 6, 8-9], the circle-segments method, which is a data visualization technique, is investigated in this paper. Previously, the application of circle-segments is confined to display the history of a stock data [14]. In this research, the circle-segments method is used in a different way, i.e., it is used to display the possible relationships between the input features and output classes. More importantly, the circle-segments method allows the involvement of humans (domain users) in the process of data exploration and analysis. Indeed, use of the circle-segments method for feature selection not only focuses on the accuracy of a classification system, but also the comprehensibility of the system. Here, comprehensibility refers to how easily a system can be assessed by humans [5]. In this regard, the circle-segment method provides a platform for easy interpretation and explanation on the decision made in feature selection by a domain user. The user can obtain an overall picture of the input-output relationship of the data set, and interpret the possible relationships between the input features and the output classes based on additional, possibly intuitive information. In this paper, we apply the circle-segments method as a feature selection method for pattern classification with four different machine learning systems, i.e., Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM) and k-Nearest Neighbour (kNN). Two case studies (one benchmark and one real data sets) are used to evaluate the effectiveness of the proposed approach. For each case study, the data set is first divided into a training set and a test set. The circle-segments method and PCA are used to select a set of important input features for learning with MLP, SVM, FAM, and kNN. The results are compared with those without any feature selection methods. The organization of this paper is as follows. Section 2 presents a description of the circle-segments method. Section 3 presents the description on the case studies. The results are also presented, analyzed, and discussed here. A summary on the research findings, contributions, and suggestions for further work is presented in Section 4.
2 The Circle-Segments Method Generally, the circle-segments method comprises three stages, i.e., dividing, ordering, and colouring. In the dividing stage, a circle is divided equally according to the number of features/attributes involved. For example, assume a process consists of seven input features and one output feature (which can be of a number of discrete classes). Then, the circle is divided into eight equal segments, with one segment representing the output feature, while others representing the seven input features.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
627
In the ordering stage, a proper way of sorting the data samples is needed to ensure that all the features can be placed into the space of the circle. Since the focus is on the effects of each input feature towards the output, the correlation between the input features and the output is used to sort the data sample accordingly. Correlation was used as feature selection strategy in classification problems [17-18]. The results had shown that the correlation based feature selection was able to determine the representative input features, and thus improve the classification accuracy. In this paper, correlation is used to sort the priority of the input features involved. The input features that have higher influence towards the output are sorted first followed by the less influential input features. Correlation is defined in Equation (1). n
rxy = ∑ ( xi − x)( yi − y ) / i =1
n
n
i =1
i =1
∑ ( xi − x) 2 ∑ ( yi − y) 2
(1)
where xi and yi refer to inputs and output features respectively, and n is the number of samples. For ease of understanding, an example is used to illustrate the ordering stage. Assume a data set of n samples. Each sample consists of seven input features, (x1, x2, … , x7), and one output feature, y, which represents four discrete classes. The combinations of the input-output data can be represented by a matrix, A, with 8 columns and n rows, as in Table 1. The original values of the input-output data are first normalized between 0 and 1. This facilitates mapping of the input features into the colour map whereby the colour values are between 0 and 1. The correlation of each input (x1, x2, … , x7) towards the output is denoted as rx1, rx2, … , rx7. Table 1. The n samples of input-output data
Sample 1 Sample 2
1st column (Output) y Output class Output class
2nd column x1 x11 x21
3rd column x2 x12 x22
#
#
#
#
Sample n
Output class
xn1
xn2
… … … … # …
8th column x7 x17 x27
# xn7
Assume the magnitudes of the correlation are as described in Equation (2),
R : rx1 > rx 2 > … > rx 7
(2)
then, matrix A is sorted based on the output column (1st column). When the output values are equal, the rows of matrix A are further sorted based on the column order specified in the vector column, Q that has the highest magnitude of correlation in an ascending order, as shown in Equation (3). Q = [C x1 , C x 2 , … , C x 7 ]
(3)
where C xi refers to the column order for feature i. Based on this example, the rows of matrix A are first sorted by the output column (1st column). When the output values are equal, the rows of A are further sorted by
628
S.L. Wang et al.
column x2. When the elements in the values of x2 are equal, the rows are further sorted by column x3, and so on according to the column order specified in Q. After the ordering stage, matrix A has a new order. The first row of data in matrix A is located at the centre of the circle, while the last row of the data is located at the outside of the circle, as shown in Figure 1. The remaining data in between these two rows are put into the circle-segments based on their row order. In the colouring stage, colour values are used to encode the relevance of each data value to the original value based on a colour map. This relevance measure is in the range of 0 to 1. Based on the colour map located at the right side of Figure 1, the colours that have the highest and lowest values are represented by dark red and dark blue, respectively. With the help of the pseudo-colour approach, the data samples within each segment are linearly transformed into colours. Therefore, a combination of colours along the perimeter represents a combination of the input-output data.
x6
x5 x46
x7
y
x47
x45
x x16
x4 x44
x15
x17
x14
y1
x13 x11 x12
y4 x41 x1
x43 x3 x42 x2
Fig. 1. Circle-segments with 7 input features and 1 output feature
3 Experiments 3.1 Data Sets
Two data sets are used to evaluate the effectiveness of the circle-segments methods for feature selection. The first is the Iris data set obtained from the UCI Machine Learning Repository [15]. The data set has 150 samples, with 4 input features (sepal length, sepal width, petal length, and petal width) and 3 output classes (Iris Setosa, Iris Versicolour and Iris Virginica). There are 50 samples in each output class. The second is a real medical data set of suspected acute stroke patients. The input features comprise patients’ medical history, physical examination, laboratory test results, etc. The task is to predict the Rankin Scale category of patients upon discharge, either class 1-Rankin scale between 0 and 1 (141 samples) or class 2-Rankin scale between 2 and 6 (521 samples). After consultation with medical experts, a total of 18 input features, denoted as V1, V2, ..., V18, were selected.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
629
3.2 Iris Classification
Figure 2 shows the circle-segments of the input-output data for the Iris data set. The circle-segments display for the three-class problem demonstrates the discrimination of the input features towards classification. Note that Iris Setosa, Iris Versicolour, and Iris Virginica are represented by dark blue, green and dark red respectively. Observing the projection of the Iris data into the circle-segments, the segments for petal length and petal width show significant colour changes as they propagate from the centre (blue) to the perimeter of the circle (red). By comparing these segments with segment Class, it is clear that petal width and petal length have a strong discriminatory power that could segregate the classes well. The discriminatory power of the other two segments is not as obvious, owing to colour overlapping and, thus, there is no clear progression of colour changes from blue to red. Based on segment Sepal Width, both Iris Versicolour (green) and Iris Virginica (dark red) have colour values lower than 0.6 for sepal width. Colour overlapping can also be observed for Iris Versicolour and Iris Virginica in segment Sepal Length too. The colour values of sepal length for both classes are distributed between 0.5 and 0.6. Therefore, sepal width and sepal length are only useful to differentiate Iris Setosa from the other two classes. As a result, the circle-segments method shows that petal width and petal length are important input features that can be used to classify the Iris data samples.
Fig. 2. Circle-segments of Iris data set
PCA is also used to extract significant input features. From the cumulative value shown in Table 2, the first principal component already accounts for 84% of the total variation. The features that tend to have strong relationships with each principal components are the features that have larger eigenvectors in absolute value than the others [16]. Therefore, the results in Table 3 indicate that the variables that tend to have strong relationship with the first principal component are petal width and petal length because their eigenvector tend to be larger in absolute value than the others. Table 2. Eigenvalues of the covariance matrix of the Iris data set Principal Eigenvalue Proportion Cumulative
PC1 0.232 0.841 0.841
PC2 0.032 0.117 0.959
PC3 0.010 0.035 0.994
PC4 0.002 0.006 1.000
630
S.L. Wang et al. Table 3. Eigenvectors of the first principal component of the Iris data set Sepal Length Sepal Width Petal Length Petal Width
PC1 -0.425 0.146 -0.616 -0.647
The results obtained from the circle-segments and PCA methods suggest petal width and petal length have more discriminatory power than the other two input features. As such, two data sets are produced; one containing the original data samples and another containing only petal width and petal length. Both data sets are used to train and evaluate the performance of the four machine learning systems The free parameters of the four machine learning systems were determined by trialand-error, as follows. For MLP, early stopping method was used to adjust the number of hidden nodes and training epochs. The fast learning rule was used to train the FAM network. The baseline vigilance parameter was determined from a fine-tuning process by performing leave-one-out cross validation on the training data sets. The same process was used to determine the number of neighbours for kNN. The radial basis function was selected as the kernel for SVM. Grid search with ten fold cross validation was performed to find the best values for the parameters of SVM. Table 4 shows the overall average classification accuracy rates of 10 runs for the Iris data set. Although the Iris problem is a simple one, the results demonstrate that it is useful to apply feature selection methods to identify the important input features for pattern classification. As shown in Table 4, the accuracy rates are better, if not the same, even with 50% of the number of input features eliminated. Table 4. Classification results of the Iris data set Method MLP FAM SVM kNN
Accuracy (%) (before feature selection) 88.67 96.33 100 100
Accuracy (%) (after feature selection 98.00 97.33 100 100
3.3 Suspected Acute Stroke Patients
Figure 3 shows the circle-segments of the input-output data for the stroke data set. Based on the circle-segments, one can observe that the data samples are dominated by class 0 (dark red). Observing the projection of the data into the circle-segments, segments V8, V16, and V18 show significant colour changes from the centre (class 0) to the perimeter of the circle (class 1). From segment V8, one can see that most of the class 0 samples have colour values equal or lower than 0.4, while for class 1, they are between 0.5 and 0.8. In segment V16, most of the class 0 samples are distributed within the colour range lower than 0.4, while for class 1, they are 0.4. By observing V18 segment, most of the class 0 samples have colour values lower than 0.7, while for class 1, they are mostly at around 0.4 with only a few samples between 0.7 and 1.0. The rest of the circle-segments do not depict a clear progression of colour changes pertaining to the two output classes.
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
631
The PCA method is again used to analyse the data set. According to [16], for a real data set, five or six principal components may be required to account for 70% to 75% of the total variation, and the more principal components that are required, the less useful each one becomes. As shown in Table 5, the cumulative values indicate that six principal components account for 72.1% of the total variation. Thus, six principal components are selected for further analysis. Table 6 presents the eigenvectors of the six principal components. The variables that have strong relationship with each principal component are in bold. These variables, i.e., V2, V3, V4, V6, V7, V8, V16, V17, and V18, have eigenvectors larger (in absolute value) than those of the others. Therefore, they are identified as the important input features.
Fig. 3. Circle-segments of the stroke data Table 5. Eigenvalues of the covariance matrix of the stroke data set Principal Eigenvalue Proportion Cumulative
PC1 0.33573 0.262 0.262
PC2 0.15896 0.124 0.386
PC3 0.14386 0.112 0.499
PC4 0.11533 0.090 0.589
PC5 0.09023 0.070 0.659
PC6 0.07917 0.062 0.721
Principal Eigenvalue Proportion Cumulative
PC7 0.07403 0.058 0.779
PC8 0.06239 0.049 0.828
PC9 0.05390 0.042 0.870
PC10 0.04268 0.033 0.903
PC11 0.03252 0.025 0.929
PC12 0.03080 0.024 0.953
Principal Eigenvalue Proportion Cumulative
PC13 0.02588 0.020 0.973
PC14 0.01546 0.012 0.985
PC15 0.00894 0.007 0.992
PC16 0.00584 0.005 0.996
PC17 0.00293 0.002 0.999
PC18 0.00160 0.001 1.000
Three data sets are produced, i.e., one containing the original data samples, the other two containing the important input features identified by circle-segments and PCA, respectively. Three performance indicators that are commonly used in medical diagnostic systems are computed, i.e., accuracy (ratio of correct diagnoses to total number of patients), sensitivity (ratio of correct positive diagnoses to total number of patients with the disease), and specificity (ratio of correct negative diagnoses to total number of patients without the disease).
632
S.L. Wang et al. Table 6. Eigenvectors of the first six principal components of the stroke data set PC1 0.098 0.611 0.052 0.520 -0.054 0.569 0.039 -0.035 0.017 -0.011 0.006 0.002 0.018 0.002 -0.013 0.065 0.086 0.056
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
PC2 0.054 -0.218 0.362 -0.059 -0.029 0.128 -0.116 0.122 0.003 0.015 -0.013 -0.011 -0.046 -0.015 0.011 0.064 0.870 0.072
PC3 -0.083 0.364 -0.064 -0.188 0.179 -0.027 -0.389 0.407 0.024 0.064 -0.023 -0.023 -0.054 -0.029 0.104 0.616 -0.018 -0.267
PC4 0.042 0.584 -0.262 -0.252 -0.244 -0.499 0.090 -0.131 -0.049 -0.073 0.045 0.019 0.059 0.020 -0.072 -0.217 0.355 -0.013
PC5 -0.038 -0.316 -0.741 0.333 -0.300 0.088 -0.163 0.027 -0.050 -0.066 0.114 0.027 0.082 0.023 -0.010 0.148 0.214 -0.136
PC6 0.048 0.024 -0.102 -0.410 -0.023 0.332 -0.683 -0.170 0.008 0.079 0.136 0.026 0.142 0.017 -0.071 -0.328 -0.083 0.216
Table 7 summarises the average results of 10 runs for the stroke data set. In general, the classification performances improve with feature selection either using PCA or circle-segments. The circle-segments method yields the best accuracy rates for all the four machine learning systems despite the fact that only three features are used for classification (a reduction of 83% of the number of input features). In terms of sensitivity, the circle-segments method also shows improvements (except FAM) from 17%-35% as compared with those before feature selection. The specificity rates are better than those before feature selection using MLP and FAM, but inferior for SVM and kNN. Table 7. Classification results and the number of input features for the stroke data set
Accuracy (%)
Sensitivity (%)
Specificity (%)
33.57
95.28
86.57
61.79
93.11
72.67
57.14
76.79
72.39
49.29
78.49
73.81
51.07
79.81
SVM
80.60
21.43
96.23
82.09
25.00
97.17
86.57
57.14
94.34
kNN
82.09
32.14
95.28
81.42
32.14
94.43
85.45
57.14
92.92
MLP
81.79
Sensitivity (%)
82.39
FAM
Accuracy (%)
Specificity (%)
Feature selection with circlesegments (3 input features)
Sensitivity (%)
Feature selection with PCA (8 input features) Accuracy (%)
Before feature selection (18 input features)
44.29
Specificity (%)
Methods
91.70
From Table 7, it seems that there is a trade-off between sensitivity and specificity with the use of the circle-segments method, i.e., more substantial improvement in sensitivity with marginal degradation in specificity for SVM and kNN while less substantial improvement in sensitivity with marginal improvement in specificity for MLP and FAM. This observation is interesting as it is important for a medical
Use of Circle-Segments as a Data Visualization Technique for Feature Selection
633
diagnostic system to have high sensitivity as well as specificity rates so that patients with and without the diseases can be identified accurately. By comparing the results of circle-segments and PCA, the accuracy and sensitivity rates of circle-segments are better than those of PCA. The trade-off is that PCA generally yields better specificity rates. Again, it is interesting to note the substantial improvement in sensitivity with marginal inferior performance in specificity of both the circle-segment and PCA results. Another observation is that the PCA results do not show substantial improvements in terms of accuracy, sensitivity, and specificity as compared with those from the original data set.
4 Summary and Further Work Feature selection in a data set based on data visualization is an interactive process involving the judgments of domain users. By identifying the patterns in the circlesegments, the important input features are distinguished from other less important ones. The circle-segments method enables the domain users to carry out the necessary filtering of input features, rather than feeding the whole data set, for learning using machine learning systems. With the circle-segments, domain users can visualize the relationships of the input-output data and comprehend the rationale on how the selection of input features is made. The results obtained from the two case studies positively demonstrate the usefulness of the circle-segments method for feature selection in pattern classification problems using four machine learning systems. Although the circle-segments method is useful, selection of the important features is very dependent on the users’ knowledge, expertise, interpretation, and judgement. As such, it would be beneficial if some objective function could be integrated with the circle-segments method to quantify the information content of the selected features with respect to the original data sets. In addition, the use of the circle-segments method may be difficult when the problem involves high dimensional data, and the users face limitation in analyzing and extracting patterns from massive data sets. It is infeasible to carry out the data analysis process when the dimension of the data is high, which cover hundreds to thousands of variables. Therefore, a better characterization method in reducing the dimensionality of the data set is needed.
References 1. Lerner., B., Levinstein, M., Roserberg, B., Guterman, H., Dinstein, I., Romem, Y.: Feature Selection and Chromosome Classification Using a Multilayer Perceptron. IEEE World Congress on Computational Intelligence 6, 3540–3545 (1994) 2. Kavzoglu, T., Mather, P.M.: Using Feature Selection Techniques to Produce Smaller Neural Networks with Better Generalisation Capabilities. Geoscience and Remote Sensing Symposium 7, 3069–3071 (2000) 3. Lou, W.G., Nakai, S.: Application of artificial neural networks for predicting the thermal inactivation of bacteria: a combined effect of temperature, pH and water activity. Food Research International 34(2001), 573–579 (2001) 4. Spedding, T.A., Wang, Z.Q.: Study on modeling of wire EDM process. Journal of Materials Processing Technology 69, 18–28 (1997)
634
S.L. Wang et al.
5. Meesad, P., Yen, G.G.: Combined Numerical and Lingustic Knowledge Representation and Its Application to Medical Diagnosis. IEEE Transaction on Systems, Man and Cybernetics, Part A 33, 202–222 (2003) 6. Harandi, M.T., Ahmadabadi, M.N., Arabi, B.N., Lucas, C.: Feature selection using genetic algorithm and it’s application to face recognition. In: Proceedings of the 2004 IEEE Conference on Cybernetics and Intelligent systems, pp. 1368–1373 (2004) 7. Wu, T.K., Huang, S.C., Meng, Y.R.: Evaluation of ANN and SVM classifiers as predictors to the diagnosis of student with learning abilities. Expert System with Applications (Article in Press) 8. Melo, J.C.B., Cavalcanti, G.D.C., Guimarães, K.S.,, P.C.A.: feature selection for protein structure prediction. In: Proceedings of the International Joint Conference on Neural Networks, vol. 4, pp. 2952–2957 (2003) 9. Huang, C.L., Wang, C.J.: A GA-based feature selection and parameters optimization for support vector machines. Expert Systems with Applications 31, 231–240 (2006) 10. Johansson, J., Treloar, R., Jern, M.: Integration of Unsupervised Clustering, Interaction and Parallel Coordinates for the Exploration of Large Multivariate Data. In: Proceedings the Eighth International Conference on Information Visualization (IV 2004), pp. 52–57 (2004) 11. McCarthy, J.F., Marx, K.A., Hoffman, P.E., Gee, A.G., O’Neil, P., Ujwal, M.L., Hotchkiss, J.: Applications of Machine Learning and High Dimensional in Cancer Detection, Diagnosis, and Management. Analysis New York Academy Science 1020, 239– 262 (2004) 12. Ruthkowska, D.: IF-THEN rules in neural networks for classification. In: Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC 2005), vol. 2, pp. 776–780 (2005) 13. Hoffman, P., Grinstein, G., Marx, K., Grosse, I., Stanley, E.,, D.N.A.: visual and analytic data mining. In: Proceedings on Visualization 1997, pp. 437–441 (1997) 14. Ankerst, M., Keim, D.A., Kriegel, H.P.: ‘Circle Segments’: A Technique for Visualizing Exploring Large Multidimensional Data Sets. In: Proc. Visualization 1996, Hot Topics Session (1996) 15. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html 16. Johnson, D.E.: Applied Multivariate Methods for Data Analysts. Duxbury Press, USA (1998) 17. Michalak, K., Kwasnicka, H.: Correlation-based Feature Selection Strategy in Neural Classification. In: Proceedings of Sixth International Conference on Intelligent Systems Design and Applications (ISDA 2006), vol. 1, pp. 741–746 (2006) 18. Kim, K.-J., Cho, S.-B.: Ensemble Classifiers based on Corrrelation Analysis for DNA Microarray Classification. Neurocomputing 70(1-3), 187–199 (2006)
Extraction of Approximate Independent Components from Large Natural Scenes Yoshitatsu Matsuda1 and Kazunori Yamaguchi2 1
Department of Integrated Information Technology, Aoyama Gakuin University, 5-10-1 Fuchinobe, Sagamihara-shi, Kanagawa, 229-8558, Japan
[email protected] http://www-haradalb.it.aoyama.ac.jp/∼ matsuda 2 Department of General Systems Studies, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1, Komaba, Meguro-ku, Tokyo, 153-8902, Japan
[email protected] Abstract. Linear multilayer ICA (LMICA) is an approximate algorithm for independent component analysis (ICA). In LMICA, approximate independent components are efficiently estimated by optimizing only highly-dependent pairs of signals. Recently, a new method named “recursive multidimensional scaling (recursive MDS)” has been proposed for the selection of pairs of highly-dependent signals. In recursive MDS, signals are sorted by one-dimensional MDS at first. Then, the sorted signals are divided into two sections and each of them is sorted by MDS recursively. Because recursive MDS is based on adaptive PCA, it does not need the stepsize control and its global optimality is guaranteed. In this paper, the LMICA algorithm with recursive MDS is applied to large natural scenes. Then, the extracted independent components of large scenes are compared with those of small scenes in the four statistics: the positions, the orientations, the lengths, and the length to width ratios of the generated edge detectors. While there are no distinct differences in the positions and the orientations, the lengths and the length to width ratios of the components from large scenes are greater than those from small ones. In other words, longer and sharper edges are extracted from large natural scenes.
1
Introduction
Independent component analysis (ICA) is a widely-used method in signal processing [1,2,3]. It solves blind source separation problems under the assumption that source signals are statistically independent of each other. Though many efficient algorithms of ICA have been proposed [4,5], it nevertheless requires heavy computation for optimizing nonlinear functions. In order to avoid this problem, linear multilayer ICA (LMICA) has been recently proposed [6]. LMICA is a variation of Jacobian methods, where the sources are extracted by maximizing the independency of each pair of signals [7]. The difference is that LMICA optimized M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 635–642, 2008. c Springer-Verlag Berlin Heidelberg 2008
636
Y. Matsuda and K. Yamaguchi
only pairs of highly-dependent signals instead of those of all ones. LMICA is based on an intuition that optimizations on highly-dependent pairs probably increase the independency of all the signals more than those of low-dependent ones, and its validity was verified by numerical experiments for natural scenes. Besides, an additional method named “recursive multidimensional scaling (MDS)” has been proposed for improving the selection of highly-dependent signals [8]. The method is based on the repetition of the simple MDS. It sorts signals in a one-dimensional array by MDS, then divides the array into the former and latter sections and sorts each of them recursively. In consequence, highly-correlated signals are brought into a neighborhood. Because the simple MDS is equivalent to PCA [9], it can be solved efficiently without the stepsize control by adaptive PCA algorithms (e.g. PAST [10]). The global optimality in adaptive PCA is guaranteed if the number of steps is sufficient. In this paper, the above LMICA algorithm with recursive MDS was applied to large natural scenes and the results were compared with those of small natural scenes in the following four statistics: the positions, the orientations, the lengths, and the length to width ratios of the generated edge detectors. As a result, it was observed that the independent components from large natural scenes are longer and sharper edge detectors than those from small ones. This paper is organized as follows. Section 2 gives a brief description of LMICA with recursive MDS. Section 3 shows the results of numerical experiments for small and large natural scenes. This paper is concluded in Sect. 4.
2
LMICA with Recursive MDS
Here, LMICA with recursive MDS is described in brief. See [6] and [8] for the details. 2.1
MaxKurt Algorithm
LMICA is an algorithm for extracting approximate independent components from signals. It is based on MaxKurt [7], which minimizes the contrast function of kurtoses by optimally “rotating pairs” of signals. The observed signals x = (xi ) are assumed to be prewhitened. Then, one iteration of MaxKurt is given as follows: – Pick up every pair (i, j) of signals, and find the optimal rotation θˆkurt given as θˆkurt = argminθ − E{(xi )4 + (xj )4 }, (1) where E{} is the expectation operator, xi = cos θ · xi + sin θ · xj , and xj = − sin θ · xi + cos θ · xj . 4 ˆ θ kurt is determined analytically and calculated easily. E{xi } can generalize to E{G (xi )} where G (u) is any suitable function such as log (cosh (u)). In this case, θˆkurt is initially set to θ because it is expected to give a good initial value,
Extraction of Approximate Independent Components
637
then θ is incrementally updated by Newton’s method for minimizing E{G (xi ) + G (xj )} w.r.t θ. Though it is much more time-consuming, it is expected to be much more robust to outliers. 2.2
Recursive MDS
In MaxKurt, all the pairs are optimized. On the other hand, LMICA optimizes only highly-correlated pairs in higher-order statistics so that approximate components can be extracted quite efficiently. In order to select such pairs, LMICA forms a one-dimensional array where highly-correlated signals are near to each other by recursive MDS. Then, LMICA optimized only the nearest-neighbor pairs of signals in the array. In recursive MDS, first, a one-dimensional mapping is formed by a simple MDS method [9], where the original distance Dij between the i-th and j-th signals is defined by 2 Dij = E{ x2i − x2j }. (2) Because Dij is greater if xi and xj are more independent of each other in higher-order statistics, highly-dependent signals are globally close together in the formed map. Such a one-dimensional map is easily transformed into a discrete array by sorting it. Because the simple MDS utilizes all the distances among signals instead of only neighbor relations, the nearest-neighbor pair in the formed array do not always correspond to the one with the smallest distance Dij . In order to avoid this problem, the array is divided into the former and latter parts and MDS is applied to each part recursively. If a part includes only two signals, the recursion is terminated and pair optimization is applied. If a part include three signals, the pair of the first and second signals and that of the second and third ones are optimized after MDS is applied to the three signals. The whole algorithm of RMDS(signals) is described as follows:
RMDS(signals) If the number of signals (N ) is 2, optimize the signals. Otherwise, 1. Sort the signals in a one-dimensional array by the simple MDS. 2. If N is 3, optimize the pairs of the first and second signals, and then do the second and third ones. Otherwise, do RMDS(the former section of signals) and RMDS(the latter section of signals).
Note that recursive MDS also does not confirm that Dij of the nearest-neighbor pair is the smallest. But, numerical experiments has verified the validity of this method in [8]. Regarding the algorithm of the simple MDS in the one-dimensional space, its solution is equivalent to the first principal component of a covariance matrix of
638
Y. Matsuda and K. Yamaguchi
x2
i i transformed signals z, each component of which is given as zi = x2i − N where N is the number of signals [9]. Therefore, MDS is solved efficiently by applying an adaptive PCA algorithm to a sequence of signals z. Here, well-known PAST [10] is employed. It is fast and does not need the stepsize control. PAST repeats the following procedure for T times where y = (yi ) is the coordinates of signals in the one-dimensional space:
1. 2. 3. 4.
Pick up a z randomly, Calculate α = i yi zi and β := β + α, Calculate e = (ei ) as ei = zi − αyi , Update y := y + α β e.
The initial y is given randomly and the initial β is set to 1.0. This algorithm is guaranteed to converge to the global optimum. The value of T is the only parameter to set empirically.
3 3.1
Results Experimental Settings
Two experiments were carried out for comparing independent components of small natural scenes and those of large ones. In the first experiment, simple fast ICA [4] is applied to 10 sets of 30000 small natural scenes of 12 × 12 pixels with the symmetrical approach and G (u) = log (cosh (u)). Because no dimension reduction method was applied, the total number of extracted components is 1440 (144 ∗ 10). Second, LMICA with recursive MDS was applied to 100000 large images of 64 × 64 pixels for 1000 layers with G (u) = log (cosh (u)) The original images were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. The large images were prewhitened by ZCA. Note that an image of 64 × 64 pixels is not regarded as “large” in general. Since they are enough large in comparison with image patches usually used in other ICA algorithms, they are referred as large images in this paper. 3.2
Experimental Results
Figure 1 shows that the decreasing curve of the contrast function nearly converged around 1000 layers. It means that LMICA reached to an approximate optimal solution at the 1000th layer. Figure 2 displays extracted independent components in the two experiments. It shows that edge detectors were generated in both cases. Regarding the other comparative experiments of efficiency between our method and other ICA algorithms, see [8]. In order to examine the statistical properties of the generated edge detectors, they are analyzed in the similar way as in [11]. First, the envelope of each detector was calculated by Hilbert transformation. Next, the dominant elements were strengthened by raising each element to the fourth power. Then, the means and
Extraction of Approximate Independent Components
639
Decrease of Contrast Function for LMICA 0.324 LMICA E[log cosh x]
0.322 0.32 0.318 0.316 0.314 0
200
400
600
800
1000
layers
Fig. 1. Decreasing curves of the contrast function by LMICA for large natural scenes
(a) From small natural scenes.
E (log (cosh (xi ))) along the layers
(b) From large natural scenes.
Fig. 2. Independent components extracted from small and large natural scenes
covariances on each detector was calculated by regarding the values of elements as a probability distribution on the discrete two-dimensional space ([0.5, 11.5] × [0.5, 11.5] for small scenes or [0.5, 63.5] × [0.5, 63.5] for large scenes because they exist only in the ranges). By approximating each edge as a Gaussian distribution with the same means and covariances on the continuous two-dimensional space, its position (means), its orientation (the angle of the principal axis), its length (the full width at half maximum (FWHM) along the principal one), and its width (FWHM perpendicular to the principal one) were calculated. The results are shown in Figs. 3-6 and Table 1. The scatter diagrams of the positions of edges in Fig. 3 show that they are uniformly distributed in both cases and there are no distinct differences. Figure 4 displays the histograms of orientations of edges from 0 to π. It shows that edges with the horizontal (0 and π) and vertical (0.5π) orientations are dominant as reported in [11] and there are no distinct differences between small and large scenes.
640
Y. Matsuda and K. Yamaguchi
Table 1. Means and medians of the lengths and the length to width ratios for small and large scenes
mean of lengths median of lengths mean of lw ratios median of lw ratios
small scenes large scenes 1.84 2.15 1.53 1.23 1.69 2.53 1.55 1.70
scatter diagram of places (small)
scatter diagram of places (large) 63.5 y-axis
y-axis
11.5
6
0.5 0.5
6 x-axis
32
0.5 0.5
11.5
(a) small scenes.
32 x-axis
63.5
(b) large scenes.
Fig. 3. Plot diagrams of the positions of edges over the two-dimensional spaces: (a). Distribution for small scenes over [0.5, 11.5] × [0.5, 11.5]. (b). Distribution for large scenes over [0.5, 63.5] × [0.5, 63.5]. histgram of orientations (small)
histgram of orientations (large) 20 percentage
percentage
20 15 10 5 0 0
0.5π angles of edges
(a) small scenes.
π
15 10 5 0 0
0.5π angles of edges
π
(b) large scenes.
Fig. 4. Histograms of the orientations of edges from 0 to π
On the contrary, Figs. 5 and 6 show that the statistical properties of edges for large scenes obviously differ from those for small ones. In Fig. 5, short edges of 1-1.5 length are much more dominant for large scenes than for small ones. But, the rate of the long edges over 3 in length is more for large scenes than for small scenes. Table 1 also shows this strangeness, where edges for large scenes are shorter on the median and longer on the mean than those for small ones. In Fig. 6, the length to width ratios for large scenes are greater than those for
Extraction of Approximate Independent Components
641
small scenes. Table 1 also shows both mean and median of the ratios for large scenes are greater than those for small scenes. 3.3
Discussion
The results show that there are significant differences in the distributions of the length and width of edges. First, a few long edges were observed for large scenes in Fig. 5. It shows that large images includes some intrinsic components of long edges and the division into small images hides them. Second, it was also observed in Fig. 5 that the rate of short edges for large images was greater than that for small ones. This seemingly strange phenomenon may be caused by the approximation of LMICA. Because LMICA optimizes only nearest neighbor pairs, it is expected to be biased in favor of locally short edges. Third, Fig. 6 shows that the length to width ratios for large natural scenes were greater than those for small ones. In other words, the edges from large scenes were “sharper” than those from small ones. The utilization of large scenes without any division histgram of lengths (small)
histgram of lengths (large) 80 percentage
percentage
80 60 40 20 0
60 40 20 0
0
2
4 6 8 lengths of edges
10
0
(a) small scenes.
2
4 6 8 lengths of edges
10
(b) large scenes.
Fig. 5. Histograms of the lengths of edges from 0 to 10. Edges longer than 10 are counted in the rightmost bar. histgram of lw ratios (small)
histgram of lw ratios (large) 60 percentage
percentage
60 40 20 0
40 20 0
1
3
5 7 9 lw ratios of edges
(a) small scenes.
11
1
3
5 7 9 lw ratios of edges
11
(b) large scenes.
Fig. 6. Histograms of the length to width ratios of edges from 1 to 11. Edges beyond 11 in ratio are counted in the rightmost bar.
642
Y. Matsuda and K. Yamaguchi
drastically weakens the effect of constraints that edges can not exist beyond the borders. It may be the reason why many sharper edges were generated.
4
Conclusion
In this paper, the method of LMICA with recursive MDS was described first. Then, the method was applied to two datasets of small natural scenes and large natural scenes and the generated edge detectors were compared in some statistical properties. Consequently, it was observed in the experiment for large scenes that there are a few long edges and many edges are shaper than those generated from small scenes. We are now planning to do additional experiments for verifying the speculations in this paper. Besides, we are planning to compare our results with the statistical properties observed in real brains in the similar way to [11]. In addition, we are planning to apply this algorithm to a movie where a sequence of large natural scenes is given as a sample. This work is supported by Grant-in-Aid for Young Scientists (KAKENHI) 19700267.
References 1. Jutten, C., Herault, J.: Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing 24(1), 1–10 (1991) 2. Comon, P.: Independent component analysis - a new concept? Signal Processing 36, 287–314 (1994) 3. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129–1159 (1995) 4. Hyv¨ arinen, A.: Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999) 5. Cardoso, J.F., Laheld, B.: Equivariant adaptive source separation. IEEE Transactions on Signal Processing 44(12), 3017–3030 (1996) 6. Matsuda, Y., Yamaguchi, K.: Linear multilayer ICA generating hierarchical edge detectors. Neural Computation 19, 218–230 (2007) 7. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural Computation 11(1), 157–192 (1999) 8. Matsuda, Y., Yamaguchi, K.: Linear multilayer ICA with recursive MDS (preprint, 2007) 9. Cox, T.F., Cox, M.A.A.: Multidimensional scaling. Chapman & Hall, London (1994) 10. Yang, B.: Projection approximation subspace tracking. IEEE Transactions on Signal Processing 43(1), 95–107 (1995) 11. van Hateren, J.H., van der Schaaf, A.: Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B 265, 359–366 (1998)
Local Coordinates Alignment and Its Linearization Tianhao Zhang1,3, Xuelong Li2, Dacheng Tao3, and Jie Yang1 1 Institute of IP & PR, Shanghai Jiao Tong Univ., P.R. China Sch. of Comp. Sci. and Info. Sys., Birkbeck, Univ. of London, U.K. 3 Dept. of Computing, Hong Kong Polytechnic Univ., Hong Kong {z.tianhao,dacheng.tao}@gmail.com,
[email protected],
[email protected] 2
Abstract. Manifold learning has been demonstrated to be an effective way to discover the intrinsic geometrical structure of a number of samples. In this paper, a new manifold learning algorithm, Local Coordinates Alignment (LCA), is developed based on the alignment technique. LCA first obtains the local coordinates as representations of a local neighborhood by preserving the proximity relations on the patch which is Euclidean; and then the extracted local coordinates are aligned to yield the global embeddings. To solve the out of sample problem, the linearization of LCA (LLCA) is also proposed. Empirical studies on both synthetic data and face images show the effectiveness of LCA and LLCA in comparing with existing manifold learning algorithms and linear subspace methods. Keywords: Manifold learning, Local Coordinates Alignment, dimensionality reduction.
1 Introduction Manifold learning addresses the problem of discovering the intrinsic structure of the manifold from a number of samples. The generic problem of manifold learning can be G described as following. Consider a dataset X , which consists of N samples xi G G ( 1 ≤ i ≤ N ) in a high dimensional Euclidean space \ m . That is X = [ x1 ,", xN ] ∈ \ m× N . G G The sample xi is obtained by embedding a sample zi , which is drawn from M d (a low
dimensional nonlinear manifold) to \ m (a higher dimensional Euclidean space, d < m ), G G i.e., ϕ : M d → \ m and xi = ϕ ( zi ) . A manifold learning algorithm aims to find the G corresponding low-dimensional representation yi in a low dimensional Euclidean space G \ d of xi to preserve the geometric property of M d . Recently, a number of manifold learning algorithms have been developed for pattern analysis, classification, clustering, and dimensionality reduction. Each algorithm detects a specific intrinsic, i.e., geometrical, structure of the underlying manifold of the raw data. The representative ones include ISOMAP [6], locally linear embedding (LLE) [7], Laplacian eigenmaps (LE) [1], local tangent space alignment (LTSA) [3], etc. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 643–652, 2008. © Springer-Verlag Berlin Heidelberg 2008
644
T. Zhang et al.
ISOMAP is a variant of MDS [13]. Unlike MDS, the distance measure between two samples is not the Euclidean distance but the geodesic distance, which is the shortest path from one sample to another computed by the Dijkstra’s algorithm. ISOMAP is intrinsically a global approach since it attempts to preserve the global geodesic distances of all pairs of samples. Different from ISOMAP, LLE, LE and LTSA are all local methods, which preserve the geometrical structure of every local neighborhood. LLE uses the linear coefficients, which reconstruct the given point by its neighbors, to represent the local structure, and then seeks a low-dimensional embedding, in which these coefficients are still suitable for reconstructions. LE is developed based on the Laplace Beltrami operator in the spectral graph theory [1,4]. This operator provides an optimal embedding for a manifold. The algorithm preserves proximity relationships by the manipulations on the undirected weighted graph, which indicates the neighbor relations of the pairwise points. LTSA exploits the local tangent information as a representation of the local structure and this local information is then aligned to give the global coordinates. Inspired by LTSA [3], we propose a new manifold learning algorithm, i.e., Local Coordinates Alignment (LCA). LCA obtains the local coordinates as representations of a local neighborhood by preserving the proximity relations on the patch. The extracted local coordinates are then aligned by the alignment trick [3] to yield the global embeddings. In LCA, the local representation and the global alignment are explicitly implemented as two steps for intrinsic structure discovery. In addition, to solve the out of sample problem, the linear approximation is applied to LCA, called Linear LCA or LLCA. Experimental results show that LCA discovers the manifold structure, and LLCA outperforms representative subspace methods for the face recognition. The rest of the paper is organized as follows: Section 2 describes LCA and its linearization is given in Section 3. Section 4 shows the empirical studies over synthetic data and face databases. Section 5 concludes.
2 LCA: Local Coordinates Alignment The proposed algorithm is based on the assumption that each local neighborhood of the manifold M d is intrinsically Euclidean, i.e., it is homeomorphic to an open subset of the Euclidean space with dimension d . The proposed LCA extracts local coordinates by preserving the neighbor relationships of the raw data, and then obtains the optimal embeddings by aligning these local coordinates in globality. 2.1 Local Representations
G G G Given an arbitrary sample xi ∈ \ m and its k nearest neighbors xi1 ," , xik , measured G G G in terms of Euclidean distance, we let X i = [ xi0 , xi1 ," , xik ] ∈ \ m ×( k +1) denote a G G neighborhood in the manifold M d , where xi0 = xi . For X i , we have the local map G G G ψ i : \ m → \ d , i.e., ψ i : X i 6 Yi = [ yi0 , yi1 ," , yik ] ∈ \ d × ( k +1) . Here, Yi is the local coordinates to parameterize the corresponding points in X i . To yield the faithful maps,
Local Coordinates Alignment and Its Linearization
645
G we expect that the nearby points remain nearby, that is, the point yi0 ∈ \ d is close to G G yi1 ," , yik , i.e., k G G arg min G ∑ yi0 − yi j yi0
2
.
(1)
j =1
To capture the geometrical structure of the neighborhood, the weighting vector G wi ∈ \ k is introduce to (1) as k G G arg min G ∑ yi0 − yi j yi0
2
j =1
G
( wi ) j ,
(2)
G where the jth element of wi is obtained by the heat kernel [4] with parameter t ∈ \ , i.e., G
G G = exp ⎛⎜ − xi0 − xi j ⎝
( wi ) j
2
/ t ⎞⎟ . ⎠
(3)
G G Vector wi is a natural measure of distances between xi0 and its neighbors. The G G larger the value ( wi ) j is (the closer the point xi0 is to the jth neighbor), the more important is the corresponding neighbor. In setting t = ∞ , Eq. (2) reduces to Eq. (1). To make further deduction, Eq. (2) is transformed to ⎛⎡ ⎜⎢ arg min tr ⎜ ⎢ G yi0 ⎜⎢ ⎜⎜ ⎢ ⎝⎣
( yG (
G − yi1
)
⎤ ⎥ ⎥ ⎡ yGi − yGi , " , # 1 ⎥⎣ 0 G G T⎥ yi0 − yik ⎦ i0
T
)
⎞ ⎟ G G G ⎟ ⎤ yi0 − yik ⎦ diag ( wi ) , ⎟ ⎟⎟ ⎠
(4)
where tr ( ⋅) denotes the trace operator, and diag ( ⋅) denotes the diagonal matrix whose diagonal entries are the corresponding components of the given vector. Let G G T T M i = [ − ek I k ] ∈ \ ( k +1)× k , where ek = [1, " ,1] ∈ \ k and I k is the k × k identity matrix. Then, Eq. (4) reduces to: G G T arg min tr ( Yi M i ) Yi M i diag ( wi ) = arg min tr ( Yi M i diag ( wi ) M iT YiT ) Yi Yi (5) = arg min tr (Yi LiYiT ) ,
(
)
Yi
G where Li = M i diag ( wi ) M iT , which encapsulates the local geometric information. The matrix Li is equivalent to:
G ⎡ −ek ⎤ G G Li = ⎢ diag ( wi ) [ −ek ⎥ ⎣ Ik ⎦
⎡ k G ⎢ ∑ ( wi ) j I k ] = ⎢ j =1 G ⎢⎣ − wi
⎤ ⎥ . G ⎥ diag ( wi ) ⎥⎦ G − wi T
(6)
646
T. Zhang et al.
Therefore, according (5), the local coordinates Yi for the ith neighborhood can be expressed. 2.2 Global Alignment
According to Section 2.1, local coordinates are obtained for all local neighborhoods respectively, namely Y1 , Y2 ," , YN . In this section, these local coordinates are aligned to yield the global one. G G Let Y = [ y1 ,", yN ] denote the embedding coordinates which are faithful G G representations for X = [ x1 ," , xN ] sampled from a manifold. Define the selection matrix Si ∈ \
N ×( k +1)
:
( Si ) pq
G ⎧⎪1 if p = hi =⎨ ⎪⎩0 else
( )
q
,
(7)
where, G hi = [i0 , i1 ," , ik ] ∈ \ k +1
(8)
denotes the set of indices for the ith point and the corresponding neighbors. Then we can express the local coordinates Yi as the selective combinations of the embedding coordinates Y by using the selection matrix Si : Yi = YSi .
(9)
Now, we can write Eq. (5) as follows:
arg min tr (YSi Li SiT Y T ) .
(10)
Y
For all samples, we have the global optimization: N ⎛ N arg min ∑ tr (YSi Li SiT Y T ) = arg min tr ⎜ Y ∑ Si Li SiT Y T Y Y i =1 ⎝ i =1
= arg min tr (YLY
T
Y
⎞ ⎟ ⎠,
(11)
)
where L = ∑ i =1 Si Li SiT . The matrix L can be termed by the alignment matrix [3] N
which is symmetric, positive semi-definite, and sparse. We can form the matrix L by the iterative procedure: G G G G L hi , hi ← L hi , hi + Li , (12)
(
)
(
for 1 ≤ i ≤ N with the initialization L = 0 .
)
Local Coordinates Alignment and Its Linearization
To uniquely determine Y , we impose the constraint YY T = I d d × d identity matrix. Now, the objective function is:
arg min tr (YLY T ) Y
, where
s.t. YY T = I d .
647
I d is the
(13)
The above minimization problem can be converted to solving an eigenvalue decomposition problem as follows: G G (14) Lf = λ f .
G G G The column vectors f1 , f 2 " , f d in \ N are eigenvectors associated with ordered nonzero eigenvalues, λ1 < λ2 < " < λd . Therefore, we obtain the embedding coordinates as: G G G T Y = ⎡⎣ f1 , f 2 " , f d ⎤⎦ .
(15)
3 LLCA: Linear Local Coordinates Alignment LCA aims to detect the intrinsic features of nonlinear manifolds, but it fails in the out of sample problem. To solve this problem, the linearization approximation of LCA, i.e., Linear LCA (LLCA), is developed in this Section. LLCA finds the transformation matrix A ∈ \ m× d that maps the dataset X to the dataset Y , i.e., Y = AT X . Thus, the linearized objective function (11) is:
arg min tr ( AT XLX T A) . A
(16)
Impose the column orthonormal constraint to A , i.e. AT A = I d , we have:
arg min tr ( AT XLX T A) A
s.t. AT A = I d .
The solution of (17) is given by the eigenvalue decomposition: G G XLX T α = λα .
(17)
(18)
The transformation matrix A is
G G G A = [α1 , α 2 " , α d ] ,
(19)
G G G where α1 , α 2 " , α d are eigenvectors of XLX T associated with the first d smallest eigenvalues. LLCA can be implemented in either an unsupervised or a supervised mode. In the later case, the label information is utilized, i.e., each neighborhood is replaced by points G which have identical class label information. For a given point xi and its same-class
648
T. Zhang et al.
G G points xi1 ," , xini−1 , where ni denotes the number of samples in this class, one can G construct the vector wi ∈ \ ni −1 as follows: G
( wi ) j
G G = exp ⎛⎜ − xi0 − xi j ⎝
2
/ t ⎞⎟ , j = 1,", ni − 1 . ⎠
(20)
Furthermore, to denote new indices set for the ith point and its same-class points, one need reset G hi = [ i0 , i1 ," , ini −1 ] ∈ \ ni . (21)
4 Experiments In this section, several tests are performed to evaluate LCA and LLCA respectively.
Fig. 1. Synthetic Data: the left sub-figure is the S-curve dataset and the right is the Punctured Sphere dataset
4.1 Non-linear Dimensionality Reduction Using LCA
We employ two synthetic datasets, as shown in Fig. 1, which are randomly sampled from the S-curve and the Punctured Sphere [5]. LCA is implemented in comparison with PCA, ISOMAP, LLE, LE, and LTSA. For LCA, the reduced dimension is 2, the number of neighbors is 15, and the parameter t in the heat kernel is 5. Fig. 2 and Fig. 3 illustrate the experimental results. It is shown that PCA which only sees the global Euclidean structure fails to detect the underlying structure of the raw data, while LCA can unfold the non-linear manifolds as well as ISOMAP, LLE, LE, and LTSA. 4.2 Face Recognition Using LLCA
Since face images, parameterized by some continuous variables such as poses, illuminations and expressions, often belong to an intrinsically low dimensional submanifold [2,12], LLCA is implemented for effective face manifold learning and recognition. We briefly introduce three steps in our face recognition experiments. First, LLCA is conducted on the training face images and learn the transformation matrix. Second, each test face image is mapped into a low-dimensional subspace via the transformation matrix. Finally, we classify the test images by the Nearest Neighbor classifier with Euclidean measure.
Local Coordinates Alignment and Its Linearization
PCA
LE
ISOMAP
LTSA
649
LLE
LCA
Fig. 2. Embeddings of the S-curve dataset
PCA
ISOMAP
LLE
LE
LTSA
LCA
Fig. 3. Embeddings of the Punctured Sphere dataset
We compare LLCA with PCA [8], LDA [9], and LPP [14], over the publicly available databases: ORL [10] and YALE [11]. PCA and LDA are two of the most popular traditional dimensionality reduction methods, while LPP are newly proposed manifold learning method. Here, LPP is the supervised version, which is introduced in [15] as “LPP1”. For LLCA, the algorithm are also implemented in the supervised mode and the parameter t in the heat kernel is +∞ . For all experiments, images are cropped based on centers of eyes, and cropped images are normalized to the 40 × 40 pixel arrays with 256 gray levels per pixel. 4.2.1 ORL The ORL database [10] contains 400 images of 40 individuals including variation in facial expression and pose. Fig. 4 illustrates a sample subject of ORL along with its all
650
T. Zhang et al.
Fig. 4. Sample face images from ORL
Fig. 5. Recognition rate vs. dimensionality reduction on ORL: the left sub-figure is achieved by selecting 3 images per person for training and the right sub-figure is achieved by selecting 5 images per person for training Table 1. Best recognition rates(%) of four algorithms on ORL
Method PCA LDA LPP LLCA
3 Train 79.11 (113) 87.20 (39) 88.09 (41) 91.48 (70)
5 Train 88.15 (195) 94.68 (39) 94.77 (48) 97.42 (130)
10 views. For each person, p (= 3, 5) images are randomly selected for training and the rest are used for testing. For each given p , we average the realizations over 20 random splits. Fig. 5 shows the plots of the average recognition rates versus subspace dimensions. The best average results and the corresponding reduced dimensions are listed in Table 1. As can be seen, LLCA algorithm outperforms the other algorithms involved in this experiment and LLCA is more competitive to discover the intrinsic structure from the raw face images. 4.2.2 YALE The YALE database [11] contains 15 subjects and each subject has 11 samples with varying facial expression and illumination. Fig. 6 shows the sample images of an individual. Similarly to the strategy adopted on ORL, p (= 3, 5) images per person are randomly selected for training and the rest are used for testing. All the tests are repeated over 20 random splits independently, and then the average recognition results are calculated. The recognition results are shown in Fig. 7 and Table 2. We can draw a similar conclusion as before.
Local Coordinates Alignment and Its Linearization
651
Fig. 6. Sample face images from YALE
Fig. 7. Recognition rate vs. dimensionality reduction on YALE: the left sub-figure is achieved by selecting 3 images per person for training and the right sub-figure is achieved by selecting 5 images per person for training Table 2. Best recognition rates(%) of four algorithms on YALE
Method PCA LDA LPP LLCA
3 Train 50.71 (44) 61.42 (14) 66.79 (15) 67.50 (15)
5 Train 58.44 (74) 74.44 (14) 75.94 (19) 78.22 (33)
5 Conclusions This paper presents a new manifold learning algorithm, called Local Coordinates Alignment (LCA). It first expresses the local coordinates as the parameterizations to each local neighborhood, and then achieves global optimization by performing eigenvalue decomposition on the alignment matrix which can be obtained by an iterative procedure. Meantime, the linearization of LCA (LLCA) is also proposed to solve the out of sample problem. Experiments over both synthetic datasets and real face datasets have shown the effectiveness of LCA and LLCA. Acknowledgments. The authors would like to thank anonymous reviewers for their constructive comments on the first version of this paper. The research was supported by the Internal Competitive Research Grants of the Department of Computing with the Hong Kong Polytechnic University (under project number A-PH42), National Science Foundation of China (No. 60675023) and China 863 High Tech. Plan (No. 2007AA01Z164).
652
T. Zhang et al.
References 1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems (NIPS), pp. 585–591. MIT Press, Cambridge (2001) 2. Saul, L.K., Weinberger, K.Q., Ham, J.H., Sha, F., Lee, D.D.: Spectral methods for dimensionality reduction. In: Chapelle, O., Schoelkopf, B., Zien, A. (eds.) Semisupervised Learning, MIT Press, Cambridge (to appear) 3. Zhang, Z.Y., Zha, H.Y.: Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal of Scientific Computing 26(1), 313–338 (2004) 4. Rosenberg, S.: The Laplacian on a Riemannian Manifold. Cambridge University Press, Cambridge (1997) 5. Lafon, S.: Diffusion Maps and Geometric Harmonics. Ph.D. dissertation. Yale University (2004) 6. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensional reduction. Science 290(5500), 2319–2323 (2000) 7. Saul, L., Rowels, S.: Think globally, fit locally: unsupervised learning of nonlinear manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 8. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 9. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997) 10. Available at http://www.uk.research.att.com/facedatabase.html 11. Available at http://cvc.yale.edu/projects/yalefaces/yalefaces.html 12. Shakhnarovich, G., Moghaddam, B.: Face recognition in subspaces. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition, Springer, Heidelberg (2004) 13. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, Inc., Chichester (2001) 14. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS) (2003) 15. Cai, D., He, X., Han, J.: Using graph model for face analysis, Department of Computer Science Technical Report No. 2636, University of Illinois at Urbana-Champaign (UIUCDCS-R-2005-2636) (September 2005)
Walking Appearance Manifolds without Falling Off Nils Einecke1 , Julian Eggert2 , Sven Hellbach1 , and Edgar K¨ orner2 1
Technical University of Ilmenau Department of Neuroinformatics and Cognitive Robotics 98684 Ilmenau, Germany 2 Honda Research Institute Europe GmbH Carl-Legien-Str.30, 63073 Offenbach/Main, Germany
Abstract. Having a good description of an object’s appearance is crucial for good object tracking. However, modeling the whole appearance of an object is difficult because of the high dimensional and nonlinear character of the appearance. To tackle the first problem we apply nonlinear dimensionality reduction approaches on multiple views of an object in order to extract the appearance manifold of the object and to embed it into a lower dimensional space. The change of the appearance of the object over time then corresponds to a walk on the manifold, with view prediction reducing to a prediction of the next step on the manifold. An inherent problem here is to constrain the prediction to the embedded manifold. In this paper, we show an approach towards solving this problem by applying a special mapping which guarantees that low dimensional points are mapped only to high dimensional points lying on the appearance manifold.
1
Introduction
One focus of the current research in computer vision is to find a way to represent the appearance of objects. Attempts of full 3D modeling of an object’s 3D shape turned out to be not reasonable as it is computationally intensive and learning or generating appropriate models is laborious. According to the viewer-centered theory [1,2,3] the human brain stores multiple views of an object in order to be able to recognize the object from various view points. For example in [4] an approach is introduced that uses multiple views of objects to model their appearance. Thereto the desired object is tracked and at each time step the pose of the object is estimated and a view is inserted into the model of appearance if it holds new or better information. Unfortunately, this is very time consuming as this approach works directly with the high dimensional views. Actually, the different views of an object are samples of the appearance manifold of the object. This manifold is a nonlinear subspace in the space of all possible appearances (appearance space) where all the views of this particular object are located. In general, the appearance manifold has a much lower dimensionality than the appearance space it is embedded in. Non-Linear Dimensionality Reduction (NLDR) algorithms, like Locally Linear Embedding (LLE) M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 653–662, 2008. c Springer-Verlag Berlin Heidelberg 2008
654
N. Einecke et al.
[5], Isometric Feature Mapping (Isomap) [6] or Local Tangent Space Alignment (LTSA) [7], can embed a manifold into lower dimensional spaces (embedding space) by means of a sufficient number of samples of the manifold. Elgammal and Lee [8] use embedded appearance manifolds for 3D body pose estimation of humans based on silhouettes of persons and LLE. Pose estimation is realized via a RBF1 -motivated mapping from the visual input to the embedded manifold and from there to the pose space. Note that they model the embedded manifold with cubic splines in order to be able to project points mapped into the embedding space onto the manifold. Lim et al. [9] follow a similar approach but in contrast to Elgammal and Lee they use the model of the embedded manifold to predict the next appearance. Actually, both approaches are limited to one-dimensional manifolds as the views were sampled during motion sequences and the modeling of the manifold is based on the available time information. In [10] Lui et al. use an aligned mixture of linear subspace models to generate the embedding of the appearance manifold which does not dependent on additional time information. Using a Dynamic Bayesian Network they infer the next position in the embedding space and based on this the position and scale parameters. The approach of Lui et al. is able to handle manifolds with more than one dimension but the prediction process is not constrained to the structure of the manifold. This, however, is very important for predictions over a larger time span as without this constraint the prediction would tend to leave the manifold, leading to awkward views when projected back to the appearance space or to wrong pose parameter estimates. Unfortunately, this constraining is quite difficult because of the highly nonlinear shape of the manifold. In the work presented here, we do not attempt to tackle this problem directly. Instead, we just use a simple non-constrained linear predictor in the low dimensional embedding space and leave the work of imposing the manifold constraint to the mapping procedure between the low dimensional embedding space and the high dimensional appearance space. The rest of this paper is organized as follows. In Sect. 2 we show what kind of objects we used to investigate our approach and we discuss the shape of appearance manifolds of rigid objects in the light of our way of sampling views. Section 3 introduces our approach for mapping between the spaces which guarantees to map only to points lying on the manifold and its embedding. Then Sect. 4 provides the workflow of our view prediction approach. In Sect. 5 we describe the experiments conducted for analyzing our view prediction approach and present the results. Finally, Sect. 6 summarizes this paper and outlines future work.
2
Appearance Manifolds and Embedding
All possible views of an object together form the so-called appearance manifold. By embedding such a manifold in a low dimensional space one gets a low dimensional equivalent of this manifold. If one is able to correctly map between the spaces one can work efficiently in the low dimensional space and project the 1
Radial Basis Function.
Walking Appearance Manifolds without Falling Off
655
Fig. 1. A trajectory on a two-dimensional band-like manifold. On the left we see the actual manifold and on the right its two-dimensional embedding.
results to the high dimensional space. For example series of views exhibit as a trajectory on the appearance manifold. Mapping such a trajectory into the low dimensional space eases the processing as the trajectory’s dimensionality and nonlinearity is reduced. Figure 1 shows a simple band-like manifold resident in the three-dimensional space and its embedding in the two-dimensional space. We used the POV-Ray2 tool for generating views of virtual objects3 (see Fig. 2). This way we are able to verify our approach under ideal conditions and, for now, we do not have to deal with problems like segmenting the object from the background. A view of an object is described mainly by the orientation parameters of the object. These could for example comprise: scaling, rotation about the three axis of the three dimensional space, deformation and translation. However, we will concentrate only on the rotation here. While tracking deformation would considerably blow up the complexity of the problem, scaling can be handled by a resolution pyramid. Furthermore, it makes sense to use views which are centered because this could be dealt with by a preprocessing step, like a translational tracker. So we are left with a three-dimensional parameter space spanned by the three rotation angles. In addition, sampling views over all three angles is not feasible as this would lead to a too large number of views. Therefore we decided to sample views by varying only 2 axes. Unfortunately, experiments have shown that, in general, the views sampled varying 2 axes are not embeddable in a non-pervasive manner in a low dimensional (three-dimensional) space. Hence we reconsidered to rotate the objects full 360◦ only about one axis. We sampled views every 5◦ while rotating the object 360◦ about its vertical axis (y-axis) and tilting it from −45◦ to +45◦ about its horizontal axis (x-axis). Each 360◦ rotation for itself leads to a cyclic trajectory in the appearance space. As these trajectories are neighboring, all views together form a cylindric appearance manifold. This can be seen exemplarily at the embedding of the views of the bee and the bird in Fig. 3. For embedding the appearance manifold into a low dimensional space we use the Isomap approach because comparisons of LLE [5], LTSA [7] and Isomap [6] have shown that Isomap is most appropriate for this purpose. 2 3
POV-Ray is a freely available tool for rendering 3D scenes. The objects we used are templates from http://objects.povworld.org
656
N. Einecke et al.
Fig. 2. The objects used for analyzing our view prediction approach embedding of the bird’s appearance manifold
10 5 0 −5
d 2n
5
dim
0
n sio en
−5
−10
−8
−6
−4
−2
0
2
4
6
8
10 15 0 10 5
−10 15
10
ension 1st dim
en sio n
10
0
10 5 0 2nd dime −5 nsion
−5 −10 −15
−10
di m
−10
1s t
3rd dimension
3rd dimension
embedding of the bee’s appearance manifold
Fig. 3. Three-dimensional embeddings of the views of the bee (left) and the bird (right) generated with Isomap. Views were sampled in an area of 360◦ vertical and from −45◦ to +45◦ horizontal. The colors encode the rotation angle about the vertical axis from blue 0◦ to red 360◦ . As each full rotation about the vertical axis exhibits as a cyclic trajectory in the appearance space and since all cyclic trajectories are neighboring, the embedding of the views leads to a cylinder-like structure (appearance manifold).
3
Mapping between the Spaces
We prefer not to predict views directly in the high dimensional appearance space but on the low dimensional embedding of the appearance manifold. Two problems arise. First, most NLDR algorithms do not yield a function for mapping between appearance space and embedding space, and second, it is difficult to ensure that the prediction does not leave the manifold. In order to actually ensure that the prediction is done only along the manifold one has to constrain the prediction with the nonlinear shape of the manifold. This, however, is very problematic because appearance manifolds often exhibit highly nonlinear and wavy shapes. Take for example a simple linear prediction. Such a prediction is quite likely to predict positions that do not lie on the manifold as can be seen in Fig. 4 a). Leaving the manifold in the low dimensional space means also leaving the appearance manifold, i.e. for a point in the low dimensional space which is not lying on the embedded manifold there is simple no valid corresponding view of the object. Usual interpolation methods cannot handle this problem. They just try to find an appropriate counterpart but in doing so they are not directly constrained to the appearance manifold. This means that the views they map those points to are no valid views of the object and often show heavy distortions.
Walking Appearance Manifolds without Falling Off
a)
3.5
b)
3
3.5 w5=0.18 w3=−0.068 w =0.453
3 w =0.68 1
2.5
1
2.5
w =0.32
w =0.427
2
2
2
4
1.5 1
t−2
0.5 0 −4.5
w =0.008
2
t−1
1.5 1
657
0.5
−4
−3.5
−3
−2.5
−2
−1.5
0 −4.5
−1
−4
−3.5
−3
−2.5
−2
−1.5
−1
Fig. 4. These two figures show a subsection of a one-dimensional cyclic manifold. a) A linear prediction using the last two positions (light blue) on the manifold leads to a point (red) not belonging to the manifold. Reconstructing this point by convex combination of its nearest neighbors (orange) projects it back to the manifold. b) Reconstruction using the LLE idea does not ensure positive weights. However, iterative repetition of the reconstruction (yellow-to-green points) makes the weights converge to positive values. The reconstruction weights after 4 iterations are displayed.
A possible way out of this dilemma is the reconstruction idea upon which LLE [11] is based. What we want to do is to map between two structures whereas one is a manifold in a high dimensional space and the other its embedding in a low dimensional space. By assuming a local linearity (which is a fundamental assumption of most NLDR algorithms anyway) it is possible to calculate reconstruction weights for a point on one of these structures that accounts for both spaces, i.e. it is possible to calculate the reconstruction weights for a point in either of the two spaces and by means of these the counterpart of this point in the other space can be reconstructed. The weights in the appearance space are calculated via minimizing the following energy function 2 j E(wi ) = xi − wi · xj j∈Ni
with
wij = 1,
(1)
j∈Ni
where x is a D-dimensional point in the appearance space, xi is the point to reconstruct, Ni = {j|xj is a k-NearestNeighbor of xi } and wi is the vector holding the reconstruction weights. After the weights are determined the counterpart yi of xi in the embedding space can be calculated by yi =
wij · yj ,
(2)
j∈Ni
with the yj ’s being the d-dimensional embedding counterparts of the xj ’s. Naturally d < D but in general d D. Reconstructing a xi from a yi works in an analogous way. The neighbors Ni of a data point are chosen only among those data points whose mapping is known, namely the data points that were used for the nonlinear dimensionality reduction.
658
N. Einecke et al.
If one demands the reconstruction weights to be larger than zero and summing up to one, then the reconstructed points always lie on the manifold. The reason is that this corresponds to a convex combination whose result is constrained to lie in the convex hull of the support points. Together with the local linear assumption this leads to reconstruction results where the reconstructed points always lie on the manifold. So even if a point beyond the manifold is predicted the mapping by reconstruction ensures that only valid views of the object are generated because it inherently projects the point onto the manifold. This can be seen in Fig. 4 a). In [11] it has been shown that the energy function (1) can be rewritten as a system of linear equations. This enables to directly calculate the weights using matrix operations. Although the calculated weights are constrained to sum up to one they are not constrained to be positive. This is a problem as it violates the convex combination criteria and hence it is not ensured that a reconstructed point lies on the manifold. However, an iterative repetition of the reconstruction, i.e. reconstructing the reconstructed point, projects the reconstructed point onto the manifold. During this process the weights converge to positive values. Figure 4 b) depicts an example.
4
View Prediction
Embedding a set of views of an object into a low dimensional space leads to tuples (xi ,yi ) of views xi in the appearance space and their low dimensional counterparts yi . With this representation of the object’s appearance the process of view prediction is as follows: 1) At each time step t the current view xt of the object is provided e.g. from a tracking or a detection stage. 2) Determine the k nearest-Neighbors among the represented views. Nt = {i|xi is a k-NearestNeighbor of xt } 3) Calculate the reconstruction weights wt in the appearance space. 2 i ˆ t = arg minwt xt − i∈Nt wti · xi , w i∈Nt wt = 1 4) Calculate the mapping to the embedding space by reconstructing the low dimensional counterpart of view xt . yt = i∈Nt w ˆti · yi 5) Predict the next position in the low dimensional embedding space, e.g. using the last two views. pred yt−1 , yt → yt+1 6) Determine the reconstruction weights wa in the embedding space by iterative reconstruction. pred Set ya = yt+1 and repeat the following steps: (i) Na = {i|yi is a k-NearestNeighbor of ya } 2 i ˆ a = arg minwa ya − i∈Na wai · yi , (ii) w i∈Na wa = 1 i (iii) ya = i∈Na w ˆa · yi
Walking Appearance Manifolds without Falling Off
659
7) Map back to thei appearance space. xpred ˆa · xi t+1 = i∈Na w As explained in the last section, the iterative reconstruction assures that only valid object views are generated. We denote this procedure embedding view prediction.
5
Experiments
In order to analyze the embedding view predictor we conducted some experiments where we compared this view predictor with two view predictors working directly in the high dimensional appearance space. The first predictor predicts linearly the next view directly in the appearance space from the last two views. In general, this predicted view will lie beyond the manifold of the views. In order to be comparable to the embedding view predictor, the nearest neighbor of the linearly predicted view is determined and returned as the actual predicted view. We denote this view predictor the nearest neighbor view predictor. The second view predictor works like our embedding view predictor but in contrast to this it works directly in the high dimensional appearance space. This means that it linearly predicts views in the appearance space and projects the predicted views onto the appearance manifold using the iterative reconstruction idea. We denote this view predictor the iterative reconstruction view predictor. To validate our view prediction we generated two trajectories in the appearance space for each object. The trajectories are depicted exemplarily with the views of the bird in Fig. 5. It can be seen that the views of the trajectory do not correspond to already represented views in the set of sampled views as introduced in Sect. 2. The tests we conducted surveyed only the prediction ability of the embedding view predictor compared to the other two view predictors. The view predictors had to predict the views along the discussed trajectories. Thereto each view is predicted using its two predecessors in the trajectory. The predicted views are compared with the actual next views by means of a sum of absolute difference. Figure 6 shows the prediction error of the three view predictors applied on the two trajectories rotate and whirl (see Fig. 5) of the bee and the bird. It can be observed that the prediction in the low dimensional space is comparable to the predictors operating directly in the high dimensional appearance space. In general, the embedding view predictor is even slightly better. Sometimes, however, it tends to predict views with a large error which appear as single high peaks in the error curve. A closer look revealed that this may be due to topological defects of the embedded appearance manifolds. The strong peaks occur more often when predicting the bird than the bee and indeed the embedding of the bird’s appearance manifold is more distorted than that of the bee (see Fig. 3).
660
N. Einecke et al.
Fig. 5. From left to right the views in the two rows show the two variants of trajectories, the view predictors are tested with. The upper is a simple rotation about the vertical axis. The lower starts at 320◦ horizontal and 2.5◦ vertical and goes straight to 40◦ horizontal and 360◦ vertical and consist of 72 equally distributed views. In order to distinguish between these two trajectories, the first is called “rotate” and the second “whirl”. The degrees in the top left corners of the images denote the horizontal rotation and in the top right corners the vertical rotation.
embedding view predictor
bird rotate
4
x 10
3
2
1 0
10
20
30
40
iterative reconstruction view prediction
bird whirl
4
8
prediction error
prediction error
4
nearest neighbor view prediction
50
60
x 10
6 4 2 0 0
70
10
20
time x 10
6
4
3
2 0
10
20
30
40
time
40
50
60
70
50
60
70
bee whirl
4
prediction error
prediction error
5
30
time
bee rotate
4
50
60
70
x 10
5 4 3 2 1 0
10
20
30
40
time
Fig. 6. This figure displays the prediction error of the embedding view predictor, nearest neighbor view predictor and iterative reconstruction view predictor for the two trajectories rotate and whirl of the bird and the bee. The error is a sum of absolute difference between the predicted and the actual view. Almost all time the embedding view predictor is superior to the other two.
Furthermore, we analyzed the three predictors concerning their ability to predict further views without being updated with actual views, i.e. we simulated an occlusion of the objects. To this end the three view predictors were again applied to the whirl and rotate trajectories but this time they had to rely solely on their own prediction from the 10th time step on. The results are shown in Fig. 7. It strikes that the embedding view predictor is able to reliably predict up to 10 further views while the other two predictors are only able to predict 2-3 further views. A possible explanation could be the higher ambiguity in the high
Walking Appearance Manifolds without Falling Off embedding view predictor
bird rotate
4
x 10
6 4 2 0 0
5
10
15
iterative reconstruction view prediction
bird whirl
4
12
prediction error
prediction error
8
nearest neighbor view prediction
20
x 10
10 8 6 4 2 0 0
25
5
10
time
10 8 6 4 2 0 0
5
10
15
time
20
25
20
25
bee whirl
4
12
prediction error
prediction error
x 10
15
time
bee rotate
4
12
661
20
25
x 10
10 8 6 4 2 0 0
5
10
15
time
Fig. 7. This figure shows the prediction error of the three view predictors applied to the two trajectories rotate and whirl of the bird and the bee. From the 10th time step (view) on the objects are considered to be completely occluded. This means that the predictors have to rely entirely on their own prediction. It can be observed that the embedding predictor can reliably predict up to 10 further views while the other two predictors cannot predict more than two to three views.
dimensional appearance space. This is a hint that predicting on the embedding of the appearance manifold in a low dimensional space is more appropriate for tracking the appearance of objects than predicting directly in the high dimensional appearance space.
6
Conclusion
We introduced an approach for predicting views of an object by means of its appearance manifold. By applying Isomap to the various views of an object the appearance manifold of that object can be extracted and embedded into a lower dimensional space. A change of object appearance corresponds to a trajectory on the appearance manifold as well as its embedding. By keeping track of the position of the object on the embedded manifold it is possible to forecast the upcoming appearance. We used an iterative version of the reconstruction idea of LLE in order to map points from the embedding space back into the appearance space and showed that this maps points from the embedding space only to points on the appearance manifold, i.e. only valid views of the object are predicted. Simulations have shown that following the trajectory (and by doing so predicting views) is less error prone using the embedded manifold than its high dimensional equivalent. Furthermore, we have shown that predicting the appearance for several following time steps is also more accurate using the low dimensional embedding. We want to stress that the introduced approach is
662
N. Einecke et al.
no full-fledged real object tracking system but rather a scheme for predicting complex views. In future work we want to investigate the possibility of using the simplex method for calculating the reconstruction weights as it implicitly constraints the weights to a convex combination. Furthermore, we want to analyze our approach with real objects and integrate it into a tracking architecture based on a view prediction and confirmation model, hopefully boosting the performance of the tracker strongly.
References 1. Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) 2. Edelman, S., Buelthoff, H.: Orientation dependence in the recognition of familiar and novel views of 3D objects. Vision Research 32, 2385–2400 (1992) 3. Ullman, S.: Aligning pictorial descriptions: An approach to object recognition. Cognition 32(3), 193–254 (1989) 4. Morency, L.P., Rahimi, A., Darrell, T.: Adaptive View-Based Appearance Models. In: Proceedings of CVPR 2003, vol. 1, pp. 803–812 (2003) 5. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by Locally Linear Embedding. Science 290(5500), 2323–2326 (2000) 6. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000) 7. Zhang, Z., Zha, H.: Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment. SIAM J. Sci. Comput. 26(1), 313–338 (2004) 8. Elgammal, A., Lee, C.S.: Inferring 3D Body Pose from Silhouettes Using Activity Manifold Learning. In: Proceedings of CVPR 2004, vol. 2, pp. 681–688 (2004) 9. Lim, H., Camps, O.I., Sznaier, M., Morariu, V.I.: Dynamic Appearance Modeling for Human Tracking. In: Proceedings of CVPR 2006, pp. 751–757 (2006) 10. Liu, C.B., et al.: Object Tracking Using Globally Coordinated Nonlinear Manifolds. In: Proceedings of ICPR 2006, pp. 844–847 (2006) 11. Saul, L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2003)
Inverse-Halftoning for Error Diffusion Based on Statistical Mechanics of the Spin System Yohei Saika Wakayama National College of Technology, 77 Noshima, Nada, Gobo, Wakayama 644-0023, Japan
[email protected] Abstract. On the basis of statistical mechanics of the Q-Ising model with ferromagnetic interactions under the random fields, we formulate the problem of inverse-halftoning for the error diffusion using the Floyd-Steinburg kernel. Then using the Monte Carlo simulation for a set of the snapshots of the Q-Ising model and a standard image, we estimate the performance of our method based on the mean square error and edge structures of the reconstructed image, such as the edge length and the gradient of the gray-level. We clarify that the optimal performance of the MPM estimate is achieved by suppressing the gradient of the gray-level on the edges of the halftone image and by removing a part of the halftone image if we set parameters appropriately. Keywords: Bayes inference, digital halftoning, error diffusion, inversehalftoning, Monte Carlo simulation.
1 Introduction For many years, a lot of researchers have investigated information processing, such as image analysis, spatial data and the Markov-random fields [1-5]. In recent years, based on the analogy between probabilistic information processing and statistical mechanics, statistical-mechanical methods have been applied to image restoration [6] and error-correcting codes [7]. Pryce and Bruce [8] have formulated the threshold posterior marginal (TPM) estimate based on statistical mechanics of the classical spin system. Then Sourlas [9,10] has pointed out the analogy between error-correction of the Sourlas’ codes and statistical mechanics of spin glasses. Then Nishimori and Wong [11] have constructed the unified framework of image restoration and errorcorrecting codes based on statistical mechanics of the Ising model. Recently, the statistical-mechanical techniques are applied to various problems, such as mobile communication [12]. In the field of the print technology, a lot of techniques in information processing have played important roles to print a multi-level image with high quality. Especially digital halftoning is an essential technique to convert a multi-level image into a bi-level image which is visually similar to the original image [13]. Various techniques for digital halftoning have been established, such as the threshold mask method [14], the dither method [15], the blue noise mask method [16] and the error diffusion [17,18]. Also inverse-halftoning is an important technique to reconstruct the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 663–672, 2008. © Springer-Verlag Berlin Heidelberg 2008
664
Y. Saika
multi-level image from the halftone image [19]. For this purpose, various techniques [20] for image restoration have been used for inverse-halftoning. In recent years, the MAP estimate [21] has been applied both for the threshold mask method and the error diffusion method. Recently the statistical mechanical method has applied for the threshold mask method [21-23]. In this article, we show the statistical-mechanical formulation for the problem of inverse-halftoning of the error diffusion method using the maximizer of the posterior marginal (MPM) estimate. This method is based on the Bayes inference and then based on the Bayes formula the posterior probability can be estimated using the model prior and the likelihood. In this study, we use the model prior expressed by the Boltzmann factor of the Q-Ising model and the likelihood expressed by the Boltzmann factor of the random fields enhancing the halftone image. Then using the Monte Carlo simulation both for a set of the snapshots of the Q-Ising model and a standard image we estimate the performance of our method based on the mean square error and edge structures observed in an original, halftoning and reconstructed images, such as the edge length and the gradient of the gray-level. In this study, we investigate the edge structures of the reconstructed image because the dot pattern with complex structures appearing in the halftone image is considered to influence on the performance of inverse-halftoning. The simulation clarifies that the MPM estimate works effectively for the problem of inverse-halftoning for the halftone image converted by the error diffusion method using the Floyd-Steinburg kernel if we set the parameters appropriately. We also clarify that the optimal performance of our method is achieved by suppressing the gray-level difference between neighboring pixels and by removing a part of the edges which are embedded in the halftone image through the procedure of error diffusion. Further we clarify the dynamical properties of the MPM estimate for inverse-halftoning. If the parameters are set appropriately, the mean square error smoothly converges to the optimal value irrespective of the choice of the initial condition. On the other hand, the convergent value of the MPM estimate depends on the initial condition of the Monte Carlo simulation. The present article is organized as follows. In chapter 2, we show the statisticalmechanical formulation for the problem of inverse-halftoning for the error diffusion. Then using the Monte Carlo simulation both for a set of the snapshots of the Q-Ising model and a gray-level standard image, we estimate the performance of the MPM estimate based on the mean square error and the edge structures of the original, halftone and reconstructed images, such as the edge length and the gradient of the graylevel. The chapter 4 is devoted to summary and discussion.
2 General Formulation Here we show a statistical-mechanical formulation for the problem of inversehalftoning for a halftone image which is generated by the error diffusion using the Floyd-Steinberg kernel. First, we consider a gray-level image {ξx,y} in which all pixels are arranged on the lattice points located on the square lattice. Here we set as ξx,y = 0,…,Q-1, x, y = 1,…,L.
Inverse-Halftoning for Error Diffusion
(a)
(b)
(c)
(d)
(e)
(f)
665
Fig. 1. (a) a sample of the snapshot of the Q=4 Ising model with 100×100 pixels, (b) the halftone image converted from (a) by the error diffusion method using the Floyd-Steinburg kernel, (c) the 4-level image reconstructed by the MPM estimate when hs=1, Ts=1, h=1, T=0.1, J=0.875, (d) the 256-level standard image “girl” with 100×100 pixels, (e) the halftone image converted from (d) by the error diffusion method using the Floyd-Steinburg kernel, (f) the 256level image reconstructed from (e) by the MPM estimate when h=1, T=0.1, J=1.40.
In this study, we treat two kinds of the original images. One is the set of the graylevel images generated by a true prior expressed by the Boltzmann factor of the Q-Ising model:
(ξ )
Pr { x, y } =
⎡ h 1 exp ⎢− s ξ x, y − ξ x', y ' ⎢ Ts Zs n.n. ⎣⎢
∑(
⎤
)2 ⎥⎥,
(1)
⎦⎥
where Zs is the normalization factor and n.n. is the nearest neighboring pairs. As shown in Fig. 1 (a), we can generate the gray-level image which has smooth structures, as can be seen in the natural images, if we appropriately set the parameters hs and Ts. Then the other is the 256-level standard image “girl”, as shown in Figs. 1(d). Next, in the procedure of digital halftoning based on the error diffusion, we convert the gray-level image {ξx,y} into a halftone image {τx,y} which is visually similar to the original gray-level image, where we set as τx,y=0, 255 and x, y = 1, …, L. The halftone images obtained due to the error diffusion method are in Figs. 1 (b) and (e). The error diffusion algorithm is performed following the block diagram in Fig. 2 and the FloydSteinburg kernel in Fig. 3. As shown in these figures, the algorithm proceeds through the image in a raster scan and then a binary decision at each pixel is made based on the input gray-level at the (x,y)-th pixel and filtered errors from the previous threshold samples. At the (x,y)-th pixel the gray-level ux,y is rewritten into the modified graylevel u’x,y as
u' x , y = u x , y −
∑h
( k ,l )∈S
e
k ,l x − k , y − l
.
(2)
666
Y. Saika
Here {hk,l} is the Floyd-Steinburg kernel and S is the region which supports the site (x,y) by the Floyd-Steinburg kernel. Then ex,y is the error of the halftone image τx,y to the gray-level one ux,y at the site (x, y) as
ex , y = τ x , y − u' x , y .
(3)
Here the pixel value τx,y of the halftone image is obtained using the threshold procedure as
⎧Q − 1 ( z' x , y ≥ (Q − 1) / 2) . (otherwize) ⎩0
τ x, y = ⎨
(4)
Next, using the MPM estimate based on statistical mechanics of the Q-Ising model, we reconstruct a gray-level image for the halftone image converted by the error diffusion method using the Floyd-Steinburg kernel. In this method, we use the model system which expressed by a set of Q-Ising spins {zx,y} (zx,y= 0,…, Q-1, x, y = 1,…, L) located on the square lattice. The procedure of inverse-halftoning is carried out so as to maximize the posterior marginal probability as zˆ x , y = arg max ∑ P ({z} | {J }), z x,y
(5)
{ z }≠ z x , y
where the posterior probability is estimated based on the Bayes formula:
P ({z}|{J }) = P ({z}) P ({J }|{z})
(6)
using the model of the true prior and the likelihood. In this study, we assume the model of the true prior which is expressed by the Boltzmann factor of the Q-Ising model as
P({z}) =
⎡ J 1 exp ⎢− ∑ (z − z Z ⎣ T n .n.
m
x ,y
m
) ⎤⎥. 2
x ', y '
⎦
(7)
This model prior is expected to enhance smooth structures which can be seen in natural images. Then, we assume the likelihood:
⎡ h P({z} | {τ }) ∝ exp⎢− ⎣ Tm
∑ (z
x, y
x, y
2⎤ − τˆx , y ) ⎥. ⎦
(8)
which generally enhances the gray-level image:
τˆx , y =
1
∑a
i , j = −1
τ
i , j x +i , y + j
(9)
where {ai,j} is the kernel of the conventional filter. In this study, we set the halftone image itself which is obtained by the error diffusion method using the FloydSteinburg kernel. We note that our method corresponds to the MAP estimate in the limit of T→0.
Inverse-Halftoning for Error Diffusion
ξx,y
Tx,y
+
τx,y
threshold
+
667
‐ + Ex,y Error filter aa ij
Fig. 2. The block diagram of the error diffusion algorithm
Fig. 3. The Floyd-Steinburg kernel
Next, in order to estimate the performance of our method for a standard image, we use the mean square error as
σ=
2 1 L ˆ ( z − ξ ) ∑ x , y x , y NQ 2 x , y =1
(10)
where ξx, y and zˆ are the pixel values of the original gray-level and reconstructed x,y
images. On the other hand, if we estimate the performance of gray-level images generated by the true prior P({ξ}), we evaluate the mean square error which is averaged over the true prior as
σ = ∑ P({ξ }) {ξ }
1 NQ 2
∑ (zˆ
2
L
x , y =1
x, y
− ξ x, y ) .
(11)
This value becomes zero, if each pixel value of all reconstructed images is completely same with that of the corresponding original images. As we have said in above, because the edge structures embedded in the halftone image is considered to influence on the performance of inverse-halftoning, we estimate the edge length appearing both in the halftone and reconstructed images as
668
Y. Saika
LGH edge =
⎡
∑ξ P({ξ })⎢⎢⎢∑ (1 − δτ ⎣ n.n.
{ }
⎤
)(
)
⎥ x, y ,τ x ', y ' 1 − δz x , y , z x ', y ' ⎥
(12)
⎥⎦
which is averaged all over the snapshots of the Q-Ising model. We also estimate the gradient of the gray-level on the edges appearing both in the halftone and reconstructed images as δz GH x, y =
⎡ ⎤ P ({ξ })⎢ 1 − δτ x , y ,τ x ', y ' | z x, y − z x ', y ' |⎥ ⎢ ⎥ ⎢⎣ n.n. {ξ } ⎦⎥
∑
∑(
)
(13)
which is averaged over the snapshots of the Q-Ising model. Next in order to clarify how the edge structures of the original image is observed in the reconstructed image, we estimate the edge length on the edges in the original, halftone and reconstructed images: LGHO edge =
⎡
∑ξ P({ξ})⎢⎢⎢∑(1 − δξ { }
⎣ n.n.
)(
⎤
)(
)
⎥ x, y , ξ x ', y ' 1 − δτ x, y ,τ x ', y ' 1 − δz x, y , z x ', y ' ⎥
(14)
⎥⎦
which is averaged over the original images. Also we estimate the edge length on the edges in the original, halftone and reconstructed images: δz GHO x, y =
⎡
∑ξ P({ξ })⎢⎢⎢∑ (1 − δξ { }
⎣ n.n.
)(
)
⎤
⎥ x , y , ξ x ', y ' 1 − δτ x, y ,τ x', y ' | z x, y − z x ', y ' |⎥
(15)
⎦⎥
which is averaged over the original images.
3 Performance In order to estimate the performance of the MPM estimate, we carry out the Monte Carlo simulation both the set of gray-level images generated by the Boltzmann factor of the Q-Ising model and the gray-level standard image. First, we estimate the performance of the MPM estimate for the set of the snapshots of the Q=4 Ising model as shown in Fig. 1 (a). Then the halftone image shown in Fig. 1(d) is obtained by the error diffusion method using the Floyd-Steinburg kernel. When we estimate the performance of the MPM estimate, we use the mean square error which is averaged over the 10 samples of the snapshots of the Q=4 Ising model with hs=1 and Ts=1. Now we investigate static properties of the MPM estimate based on the mean square error and the edge structures observed both in the original, halftone and reconstructed images. First we confirm the static properties of the MPM estimate when the posterior probability has same form with the likelihood at J=0 or with the model prior in the limit of J→∞. At J=0, as the likelihood is assumed to enhance the halftone image, the MPM estimate reconstructs the gray-level image which is almost same with the halftone image. Therefore the reconstructed image at J=0 has the edge length which is
Inverse-Halftoning for Error Diffusion
669
Fig. 4. The mean square error as a function of J obtained by the MPM estimate for the halftone image which is obtained from the set of the snapshots of the Q=4 Ising model by the error diffusion method using the Floyd-Steinburg kernel when hs=1, Ts=1, h=1, T=0.1
Fig. 5. The edge structures observed in the reconstructed image obtained by the MPM estimate using the Q-Ising model for the set of the snapshots of the 4-Ising model when hs=1, Ts=1, h=1, T=1
longer than the original image, as the halftone image is expressed by the dot pattern which is visually similar to the original image. For instance, the edge length averaged over the set of the Q=4 Ising model and the halftone images obtained by the error diffusion using the Floyd-Steinburg kernel are 6021.3 and 12049.8. On the other hand, in the limit of J→∞, as the model prior is assumed to enhance smooth structures, the MPM estimate reconstructs the flat pattern.
670
Y. Saika
Fig. 6. The mean square error as a function of J obtained by the MPM estimate for error diffusion for Q=255 standard image “girl” when h=1, T=1
Fig. 7. The edge length and the gradient of the gray-level in the gray-level image restored by the MPM estimate for Q=256 standard image “girl” when h=1, T=1
Then we investigate the performance of the MPM estimate when the posterior probability is composed both of the model prior and the likelihood. In Fig. 4 we show the mean square error as a function of J for the snapshots of the Q=4 Ising model. The figure indicates that the mean square error takes its minimum at J=0.875. This means that the MPM estimate reconstructs the gray-level image by suppressing the gradient of the gray-level of the edges embedded in the halftone image by the Boltzmann factor of the Q-Ising model. Also the figure indicates that the mean square error rapidly decreases as we increase J from 0.025 to 0.3. The origin of the rapid drop of the mean square error can be explained by the edge structures of the reconstructed image, such
Inverse-Halftoning for Error Diffusion
671
as the edge length and the gradient of the gray-level as L , L , | δz | and | δz | which are shown in (12)-(15) in the following. Now we evaluate the edge structures of the reconstructed images, such as the edge length and the gradient of the gray-level. Figure 5 indicates how LGH, LGHO, |δzGH| and |δzGHO| depend on the parameter J due to the MPM estimate for the set of the Q=4 Ising model. Figure 5 indicates that LGH is steady in 0<J1.175. This means that the edge structures of the halftone image survives in J0.6. On the other hand, |δzGH| decreases with the increase in J from 0.025 to 0.3 and then becomes almost same with the edge length LGH at J=0.3. Then, as shown in Fig. 5, although the LGH is steady in 0<JJopt. Next, we estimate the performance of the MPM estimate for the 256-level standard image “girl” with 100×100 pixels which is shown in Fig. 1 (d). Then the halftone image in Fig.1 (e) is obtained from the original image by the error diffusion method using the Floyd-Steinburg kernel. As shown in Figs. 6 and 7, we estimate the performance in terms of the mean square error, the edge length and the gradient of the gray-level. Figure 6 indicates that the MPM estimate shows optimal performance at J=1.45. Then, as shown in Fig. 7, the Monte Carlo simulation clarifies that the edge length is longer than that of the halftone image and further that the gray-level difference is larger than that of the edge length in the reconstructed image. These indicate that the fine structures introduced into the halftone image remains in the reconstructed image, though the gradient of the gray-level is sharply suppressed with the increase of J in 0<J 12 , where f (y) = 1/(1 + exp(−y/)), and yi,j (t) is an internal state of the (i, j)th neuron at time t. The internal state of the chaotic neuron [1] is decomposed into two parts—ξi,j (t) and ζi,j (t)—each of which represents different effects or determining the firing of a neuron in the algorithm; a gain effect, a refractory effect, and a mutual inhibition effect, respectively.
676
T. Matsuura and T. Ikeguchi
The first part, ξi,j (t), which expresses the gain effect, is defined by ξi,j (t + 1) = β Ei,j (t) − Eˆ , Ei,j (t) =
L 1 fk (ω) fk (ω) log2 , L p(ω)
(1) (2)
k=1 ω∈Ω
where β is a scaling parameter of the effect; Ei,j (t) represents the objective function in CMS, a relative entropy score when a candidate motif position is ˆ the entropy score of the current moved to the jth position in the sequence si ; E, state; fk (ω), the number of appearances of an element (one of the four bases in the case of DNA or RNA sequences and 20 amino acids in the case of a protein sequence) ω ∈ Ω at the kth position of subsequences; p(ω), the background probability of appearance of the element ω; and Ω, a set of bases (ΩBASE ={A, C, G ,T}) or a set of amino acids (ΩACID ={N, S, D, Q, E, T, R, H, G, K, Y, W, C, M, P, F, A, V, L, I}). The quantity on the right-hand side of Eq.(1) becomes positive if the new motif candidate position is better than the current state. The second part, ζi,j (t), qualitatively realizes the refractoriness of the neuron. The refractoriness is one of the important properties of real biological neurons; once a neuron fires, it can hardly fire for a certain period of time. The second part is then expressed as follows: ζi,j (t + 1) = −α
t
krd xi,j (t − d) + θ
(3)
= −αxi,j (t) + kr ζi,j (t) + θ(1 − kr )
(4)
d=0
where α is a scaling parameter to decide strength of the refractory effect after neuron firing (0 < α); kr , a decay parameter that assumes values between 0 and 1; and θ, a threshold value. Thus, in Eq.(3), ζi,j (t + 1) expresses the refractory effect with a factor kr because the more frequently the neuron has fired in its past, the more negative the first term of the right side hand of Eq.(3) becomes, which in turn depresses the value of ζi,j (t+1) and leads the neuron to a relatively resting state. Although the refractoriness realized in the chaotic neuron [1] has similar effect such as tabu effect [3, 4], we have already shown that the refractoriness can control to search a solution in a state space better than the tabu effect [12, 13]. By using these two internal states, we construct an algorithm for extracting motifs, as described in the following. Consider a set of N sequences of lengths mi (i = 1, 2, ..., N ); further, let the length of a subsequence (motif) be L (Fig. 2). We proceed as follows: 1. The position of an initial subsequence ti,j (i = 1, 2, ..., N ; j = 1, 2, ..., mi − L + 1) is randomly set at the ith sequence. 2. The ith sequence si is selected cyclically. 3. For a selected sequence si at the second step, to change the position of the motif candidate to a new position, yi,j (t + 1) is calculated from the first neuron (j = 1) to the last neuron (j = mi − L + 1) in the sequence si . Then,
Chaotic Motif Sampler for Motif Discovery
677
the neuron whose internal state takes maximum (yi,max ) is selected. If the (i, max)th neuron fires (xi,max (t + 1) > 1/2), a new motif position is moved ˆ is updated. to the (i, max)th position, and the value of E 4. Repeat the steps 2 and 3 for the sufficient number of iterations.
3
Statistical Measures CV and LV [14]
The coefficient of variation (CV ) of interspike intervals is a measure of randomness of interval variation. The CV is defined as n 1 (Ti − T )2 n − 1 i=1 CV = , (5) T where Ti is the ith interspike interval (ISI), n is the number of ISIs, and T = n 1 Ti is the mean ISI. For spike intervals that are independently exponentially n i=1 distributed, CV is 1 in the limit of a large number of intervals. For a regular spike time-series in which Ti is constant, CV = 0. Next, the local variation (LV ) of interspike intervals is a measure of the spiking characteristics of an individual neuron. The LV is defined as LV =
n−1 1 3(Ti − Ti−1 )2 . n − 1 i=1 (Ti + Ti+1 )2
(6)
For spike intervals that are independently exponentially distributed, LV = 1. For a regular spike time-series in which Ti is constant, LV = 0.
4 4.1
Results Investigation of Statistical Measures of CV and LV
In order to investigate the values of CV and LV [14] for chaotic neurons in CMS which solves MEP, we used real protein sequences [9, 15]. At first, we used the real protein data set consisted of N = 30 sequences. The ith sequence is composed of mi (i = 1, 2, ..., N ) amino acids, and one motif (L = 18) is embedded in each sequences. Because correct locations of the motif for this real protein sequences [15] has already been clarified by another method [9, 15], we can compare the analysis results between the neurons of correct motif position and the others. For more details of the sequence data and how to identify the correct motif positions, see Ref.[9] and the references therein. Table 1 shows correct locations of the motif in each sequence. To extract motifs from this data set, the parameters of the CMS are set to kr = 0.8, α = 0.25, θ=1.0, β=30.0, and =0.01.
678
T. Matsuura and T. Ikeguchi Table 1. Correct location of the motif in each real protein sequence [15]
Sequence number
1
2
3
4
5
6
7
9 10 11 12
13 14 15
5 73 99 25 169 15
12 196 196
Sequence number 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30
Correct location 225 198 22 326 449 22
(a)
Neuron index
Correct location 252 444 11 23
3
8
5 26 67 495 205 160
3
3 27 25
200 100 0 0
100
200
300
400
500
t
600
700
800
900
1000
(b)
25 20 15
(c)
1 0.8
(d)
1 0.8 50
100
150
200
250
Neuron index
Fig. 3. (a) Raster plot, the values of (b) the firing rate, (c) CV and (d) LV for all neurons in the 3rd sequence of Table 1
It is considered that the frequently firing neurons correspond to location of a correct motif, because the gain effect of the chaotic neurons is decided by a degree of similarity to another motif candidates. Namely, the neurons corresponding to motifs frequently fire. In this sense, it might be enough to observe firing rates of the neurons. However, only the firing rate does not always give a correct result. Figure 3 shows such an example; the results for the 3rd sequence (s3 ) of the data in Table 1. For the 3rd sequence, although the location of correct motif is 22, the most frequently firing neuron is not the 22nd, but middle (160 ∼ 170) and end (270 ∼ 280) neurons of the sequence (Fig. 3 (b)). If we evaluate a raster plot of all the neurons in the 3rd sequence, the frequently firing neurons seem to be firing randomly, and a firing pattern of the 22nd neuron shows characteristic behavior (Fig. 3(a)). To detect such characteristic firing behavior, we calculated the statistical values of output spike time-series. As a result, it is clearly seen that the value of the CV of the 22nd neuron is higher than the other neurons and the value of the LV is lower than the other neurons (Figs. 3 (c) and (d)). On the other hand, for the frequently firing neurons, the values of CV are low and the values of the LV are high. For the other
Chaotic Motif Sampler for Motif Discovery
679
Sequence number
30 25
30
20
25
15
20
10 15 5 10 100
200
300
400
500
Neuron index (a) Firing rate of neurons 30
1.3
25
1.2
20
1.1
15
1
10
0.9
5
0.8
Sequence number
Sequence number
30
1.3
25
1.2
20
1.1 1
15
0.9 10 0.8 5 0.7
0.7 100
200
300
400
500
100
Neuron index (b) Values of CV
200
300
400
500
Neuron index (c) Values of LV
Fig. 4. The values of (a) the firing rate, (b) CV , and (c) LV of all neurons for the real protein data set. The correct location of the motifs are shown in Table 1.
sequences, the neurons corresponding to a correct motif show the same tendency (Fig. 4 and Table 1). Furthermore, if the neuron at a correct motif position fires frequently, CV becomes not low but high, and LV becomes not high but low. The results indicate that spike time-series generated by the chaotic neuron of a correct motif position has different characteristics from the other neurons. 4.2
Multiple Motif Case
The original CMS [11, 12] cannot always find multiple motifs from a sequence, because the most similar subsequence is extracted as a motif from the sequence. However, the multiple motifs might be extracted from each sequence by using Table 2. Correct locations of the motif in each artificial protein sequence Sequence number Correct locations
1
2
3
4
5
6
130 225 198 22 326 467 22
7
8
9 10 11 12
15 5 99 169 35 12 196
Sequence number 13 14 15 16 17 18 19 20 21 22 23 24 25 Correct locations
11 23 196 250 150
3
50 5 26 67 495 205 178
3
3 25
(a)
T. Matsuura and T. Ikeguchi
Neuron index
680
200 100 0 0
100
200
300
400
500
600
t
700
800
900
1000
(b)
15 10
(c)
1.2 1 0.8
(d)
1.2 1 0.8 50
100
150
200
250
Neuron index
Sequence number
Fig. 5. (a) Raster plot, the values of (b) the firing rate, (c) CV and (d) LV for all neurons in the 15th sequence of Table 2
25
30
20
25
15
20
10
15
5
10 100
200
300
400
500
Neuron index (a) Firing rate of neurons 25 1.6 20 1.4 15 1.2 10 1 5 0.8 100
200
300
400
Neuron index (b) Values of CV
500
Sequence number
Sequence number
25
1.4 1.3
20 1.2 1.1
15
1 10
0.9 0.8
5 0.7 100
200
300
400
500
Neuron index (c) Values of LV
Fig. 6. The values of (a) the firing rate, (b) CV , and (c) LV of all neurons for the artificial protein data set. The correct location of the motifs are shown in Table 2.
characteristics of spike time-series. To verify this hypothesis, we prepared an artificial data set. This artificial data has 25 sequences (N = 25), and the ith sequence is composed of mi (i = 1, 2, ..., N ) amino acids. In each sequence, one
Chaotic Motif Sampler for Motif Discovery
681
or two motifs are embedded. Table 2 shows the correct locations of the motif (L = 18) in each sequence. To extract motifs from this data set, the parameters are set to kr = 0.8, α = 0.25, θ = 0.9, β = 25.0, and = 0.01. Figure 5 shows the results for the 15th sequence of Table 2 in which two motifs are embedded. For the 15th sequence, although one neuron of the motif (23rd) takes a high firing rate, the other (156th) is low (Figs. 5 (a) and (b)). However, the both values of CV are higher than the other neurons, and both values of LV are lower than the other neurons (Figs. 5 (c) and (d)). For the other sequences, the neurons of correct motifs show the same tendency (Fig. 6 and Table 2).
5
Conclusions
In this paper, to improve solving performance of motif extraction problem by a chaotic neurodynamics, we introduced the spike time-series produced from chaotic neurons in the CMS, and analyzed its complexity by using a statistical measure, such as coefficient of variation (CV ) and local variation (LV ) of interspike intervals, which are frequently used in the field of neuroscience. As a result, CV of corresponding neurons to correct motif positions becomes higher than the other neurons and LV becomes lower than the other neurons, independent of firing rates of the neurons. The result of using these characteristics for finding motifs, we can solve MEP or extract the motifs in case that multiple motifs are embed in each sequence. Although statistical values of CV and LV of spikes produced from neurons corresponding to the correct location are depend on each sequence, their value exhibit statistically significant in the sequence. Then, if we evaluate the motif positions more quantitatively by using statistical values of CV and LV , we could develop a new algorithm of the CMS, eliminating the sequences with no motifs. The research of T.I. is partially supported by a Grant-in-Aid for Scienctific Research (B) (No.16300072) and a research grant from the Mazda Foundation.
References [1] Aihara, K., et al.: Chaotic Neural Networks. Physics Letters A 144, 33–340 (1990) [2] Akutsu, T., et al.: On Approximation Algorithms for Local Multiple Alignment. In: Proceedings of the 4th Annual International Conference Molecular Biology, pp. 1–7 (2000) [3] Glover, F.: Tabu Search I. ORSA Journal on Computing 1(3), 190–206 (1989) [4] Glover, F.: “Tabu Search II”. ORSA Journal on Computing 2(1), 4–32 (1990) [5] Hasegawa, M., et al.: Combination of Chaotic Neurodynamics with the 2-opt Algorithm to Solve Traveling Salesman Problems. Physical Review Letters 79(12), 2344–2347 (1997) [6] Hasegawa, M., et al.: Exponential and Chaotic Neurodynamical Tabu Searches for Quadratic Assignment Problems. Control and Cybernetics 29(3), 773–788 (2000) [7] Hasegawa, M., et al.: Solving Large Scale Traveling Salesman Problems by Chaotic Neurodynamics. Neural Networks 15(2), 271–283 (2002)
682
T. Matsuura and T. Ikeguchi
[8] Hasegawa, M., et al.: A Novel Chaotic Search for Quadratic Assignment Problems. European Journal of Operational Research 39(3), 543–556 (2002) [9] Lawrence, C.E., et al.: Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science 262, 208–214 (1993) [10] Matsuura, T., et al.: A Tabu Search for Extracting Motifs from DNA Sequences. In: Proceedings of Nonlinear Circuits and Signal Processing 2004, pp. 347–350 (2004) [11] Matsuura, T., et al.: A Tabu Search and Chaotic Search for Extracting Motifs from DNA Sequences. In: Proceedings of Metaheuristics International Conference 2005, pp. 677–682 (2005) [12] Matsuura, T., Ikeguchi, T.: Refractory Effects of Chaotic Neurodynamics for Finding Motifs from DNA Sequences. In: Corchado, E.S., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 1103–1110. Springer, Heidelberg (2006) [13] Matsuura, T., et al.: Analysis on Memory Effect of Chaotic Dynamics for Combinatorial Optimization Problem. In: Proceedings of Metaheuristics International Conference (2007) [14] Shinomoto, S., et al.: Differences in Spiking Patterns Among Cortical Neurons. Neural Computation 15, 2823–2842 (2003) [15] Information available at, http://www.genome.jp/
A Thermodynamical Search Algorithm for Feature Subset Selection F´elix F. Gonz´alez and Llu´ıs A. Belanche Languages and Information Systems Department Polytechnic University of Catalonia, Barcelona, Spain {fgonzalez,belanche}@lsi.upc.edu
Abstract. This work tackles the problem of selecting a subset of features in an inductive learning setting, by introducing a novel Thermodynamic Feature Selection algorithm (TFS). Given a suitable objective function, the algorithm makes uses of a specially designed form of simulated annealing to find a subset of attributes that maximizes the objective function. The new algorithm is evaluated against one of the most widespread and reliable algorithms, the Sequential Forward Floating Search (SFFS). Our experimental results in classification tasks show that TFS achieves significant improvements over SFFS in the objective function with a notable reduction in subset size.
1
Introduction
The main purpose of Feature Selection is to find a reduced set of attributes from a data set described by a feature set. This is often carried out with a subset search process in the power set of possible solutions. The search is guided by the optimization of a user-defined objective function. The generic purpose pursued is the improvement in the generalization capacity of the inductive learner by reduction of the noise generated by irrelevant or redundant features. The idea of using powerful general-purpose search strategies to find a subset of attributes in combination with an inductive learner is not new. There has been considerable work using genetic algorithms (see for example [1], or [2] for a recent successful application). On the contrary, simulated annealing (SA) has received comparatively much less attention, probably because of the prohibitive cost, which can be specially acute when it is combined with an inducer to evaluate the quality of every explored subset. In particular, [3] developed a rule-based system based on the application of a generic SA algorithm by standard bitwise random mutation. We are of the opinion that such algorithms must be tailored to feature subset selection if they are to be competitive with state-of-the-art stepwise search algorithms. In this work we contribute to tackling this problem by introducing a novel Thermodynamic Feature Selection algorithm (TFS). The algorithm makes uses of a specially designed form of simulated annealing (SA) to find a subset of attributes that maximizes the objective function. TFS has a number of distinctive characteristic over other search algorithms for feature subset selection. First, the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 683–692, 2008. c Springer-Verlag Berlin Heidelberg 2008
684
F.F. Gonz´ alez and L.A. Belanche
probabilistic capability of SA to accept momentarily worse solutions is enhanced by the concept of an -improvement, explained below. Second, the algorithm is endowed with a feature search window in the backward steps that limits the neighbourhood size while retaining efficacy. Third, the algorithm accepts any objective function for evaluating the quality of generated subsets. In this paper three different inducers have been tried to assess this quality. TFS is evaluated against one of the most reliable stepwise search algorithms, the Sequential Forward Floating Search method (SFFS). Our experimental results in classification tasks show that TFS notably improves over SFFS in the objective function in all cases, with a very notable reduction in subset size, compared to the full set size. We also find that the compared computational cost is affordable for small number of features and much better as the number of features increases.
2
Feature Selection
Let X = {x1 , . . . , xn }, n > 0 denote the full feature set. Without loss of generality, we assume that the objective function J : P(X) → R+ ∪ {0} is to be maximized, where P denotes the power set. We denote by Xk ⊆ X a subset of selected features, with |Xk | = k. Hence, by definition, X0 = ∅, and Xn = X. The feature selection problem can be expressed as: given a set of candidate features, select a subset1 defined by one of two approaches: 1. Set 0 < m < n. Find Xm ⊂ X, such that J(Xm ) is maximum. 2. Set a value J0 , the minimum acceptable J. Find the Xm ⊆ X with smaller m such that J(Xm ) ≥ J0 . Alternatively, given α > 0, find the Xm ⊆ X with smaller m, such that J(Xm ) ≥ J(X) or |J(Xm ) − J(X)| < αJ(X). In the literature, several suboptimal algorithms have been proposed for feature selection. A wide family is formed by those algorithms which, departing from an initial solution, iteratively add or delete features by locally optimizing the objective function. Among them we find the sequential forward generation (SFG) and sequential backward generation (SBG), the plus-l-minus-r (also called plus l - take away r) [12], the Sequential Forward Floating Search (SFFS) and its backward counterpart SFBS [4]. The high trustworthiness and performance of SFFS has been ascertained experimentally (see e.g. [5], [6]).
3
Simulated Annealing
Simulated Annealing (SA) is a stochastic technique inspired on statistical mechanics for finding near globally optimum solutions to large (combinatorial) optimization problems. SA is a weak method in that it needs almost no information about the structure of the search space. The algorithm works by assuming that some part of the current solution belongs to a potentially better one, and thus this part 1
Such an optimal subset of features always exists but is not necessarily unique.
A Thermodynamical Search Algorithm for Feature Subset Selection
685
should be retained by exploring neighbors of the current solution. Assuming the objective function is to be minimized, SA can jump from hill to hill and escape or simply avoid sub-optimal solutions. When a system S (considered as a set of possible states) is in thermal equilibrium (at a given temperature T ), the probability PT (s) that it is in a certain state s depends on T and on the energy E(s) of s. This probability follows a Boltzmann distribution: exp(− E(s) E(s) kT ) PT (s) = , with Z = exp − Z kT s∈S
where k is the Boltzmann constant and Z acts as a normalization factor. Metropolis and his co-workers [7] developed a stochastic relaxation method that works by simulating the behavior of a system at a given temperature T . Being s the current state and s a neighboring state, the probability of making a transition from s to s is the ratio PT (s → s ) between the probability of being in s and the probability of being in s : PT (s ) ΔE PT (s → s ) = = exp − PT (s) kT where we have defined ΔE = E(s )−E(s). Therefore, the acceptance or rejection of s as the new state depends on the difference of the energies of both states at temperature T . If PT (s ) ≥ PT (s) then the “move” is always accepted. It PT (s ) < PT (s) then it is accepted with probability PT (s, s ) < 1 (this situation corresponds to a transition to a higher-energy state). Note that this probability depends upon and decreases with the current temperature. In the end, there will be a temperature low enough (the freezing point ), wherein these transitions will be very unlikely and the system will be considered frozen. In order to maximize the probability of finding states of minimal energy at every temperature, thermal equilibrium must be reached. The SA algorithm proposed in [8] consists on using the Metropolis idea at each temperature for a finite amount of time. In this algorithm the temperature is first set at a initially high value, spending enough time at it so to approximate thermal equilibrium. Then a small decrement of the temperature is performed and the process is iterated until the system is considered frozen. If the cooling schedule is well designed, the final reached state may be considered a near-optimal solution.
4
TFS: A Thermodynamic Feature Selection Algorithm
In this section we introduce TFS (Thermodynamic Feature Selection). Considering SA as a combinatorial optimization process [10], TFS finds a subset of attributes that maximizes the value of a given objective function. A specialpurpose feature selection mechanism is embedded into the SA tehcnique that takes advantage of the probabilistic acceptance capability of worse scenarios over a finite time. This characteristic is enhanced in TFS by the notion of an -improvement: a feature -improves a current solution if it has a higher value of the objective function or a value not worse than %. This mechanism is intended
686
F.F. Gonz´ alez and L.A. Belanche
to account for noise in the evaluation of the objective function due to finite sample sizes. The algorithm is also endowed with a feature search window (of size l) in the backward step, as follows. In forward steps always the best feature is added (by looking at all possible additions). In backward steps this search is limited to l tries at random (without repetition). The value of l is incremented by one at every thermal equilibrium point. This mechanism is an additional source of non-determinism and a bias towards adding a feature only when it is the best option available. In contrast, to remove a feature, it suffices that its removal -improves the current solution. A direct consequence is of course a considerable speed-up of the algorithm. Note that the design of TFS is such that it grows more and more deterministic, informed and costly as it converges towards the final configuration.
Fig. 1. TFS algorithm for feature selection. For the sake of clarity, only those variables that are modified are passed as parameters to Forward and Backward.
The pseudo-code of TFS is depicted in Figs. 1 to 3. The algorithm consists of two major loops. The outer loop waits for the inner loop to finish and then updates the temperature according to the chosen cooling schedule. When the outer loop reaches Tmin , the algorithm halts. The algorithm keeps track of the best solution found (which is not necessarily the current one). The inner loop is the core of the algorithm and is composed of two interleaved procedures: Forward and Backward, that iterate until a thermal equilibrium point is found, represented by reaching the same solution before and after. These procedures work independently of each other, but share information about the results of their respective searches in the form of the current solution. Within them, feature selection takes place and the mechanism to escape from local minima starts
A Thermodynamical Search Algorithm for Feature Subset Selection
687
PROCEDURE Forward (var Z, JZ ) Repeat x := argmax{ J(Z ∪ {xi }) }, xi ∈ Xn \ Z If -improves (Z, x, true) then accept := true else ΔJ := J(Z ∪ {x}) − J(Z) ΔJ accept := rand(0, 1) < e t endif If accept then Z := Z ∪ {x} endif If J(Z) > JZ then JZ := J(Z) endif Until not accept END Forward Fig. 2. TFS Forward procedure (note Z, JZ are modified) PROCEDURE Backward (var Z, JZ ) A := ∅; AB := ∅ Repeat For i := 1 to min(l, |Z|) do FUNCTION -improves (Z, x, b) Select x ∈ Z \ AB randomly RETURNS boolean If -improves (Z, x, f alse) If b then Z := Z ∪ {x} then A := A ∪ {x} endif else Z := Z \ {x} endif AB := AB ∪ {x} Δx := J(Z ) − J(Z) EndFor If Δx > 0 then return true x0 := argmax{J(Z \ {x})}, x ∈ AB else return −Δx < endif J (Z) If x0 ∈ A then accept := true END -improves else ΔJ := J(Z \ {x0 }) − J(Z) ΔJ accept := rand(0, 1) < e t endif If accept then Z := Z \ {x0 } endif If J(Z) > JZ then JZ := J(Z) endif Until not accept END Forward Fig. 3. Left: TFS Backward procedure (note Z, JZ are modified and x0 can be efficiently computed while in the For loop). Right: The function for -improvement.
working, as follows: these procedures iteratively add or remove features one at a time in such a way that an -improvement is accepted unconditionally, whereas a non -improvement is accepted probabilistically.
5
An Experimental Study
In this section we report on empirical work. There are nine problems, taken mostly from the UCI repository and chosen to be a mixture of different real-life feature selection processes in classification tasks. In particular, full feature size
688
F.F. Gonz´ alez and L.A. Belanche
ranges from just a few (17) to dozens (65), sample size from few dozens (86) to the thousands (3,175) and feature type is either continuous, categorical or binary. The problems have also been selected with a criterion in mind not very commonly found in similar experimental work, namely, that these problems are amenable to feature selection. By this it is meant that performance benefits clearly from a good selection process (and less clearly or even worsens with a bad one). Their main characteristics are summarized in Table 1. Table 1. Problem characteristics. Instances is the number of instances, Features is that of features, Origin is real/artificial and Type is categorical/continuous/binary. Name Instances Features Origin Type Breast Cancer (BC) 569 30 real continuous Ionosphere (IO) 351 34 real continuous Sonar (SO) 208 60 real continuous Mammogram (MA) 86 65 real continuous Kdnf (KD) 500 40 artificial binary Knum (KN) 500 30 artificial binary Hepatitis (HP) 129 17 real categorical Splice (SP) 3,175 60 real categorical Spectrum (SC) 187 22 real categorical
5.1
Experimental Setup
Each data set was processed with both TFS and SFFS in wrapper mode [11], using the accuracy of several classifiers as the objective function : 1-Nearest Neighbor (1NN) for categorical data sets; 1-Nearest Neighbor, Linear and Quadratic discriminant analysis (LDA and QDA) for continuous data sets [12]. We chose these for being fast, parameter-free and not prone to overfit (specifically we discarded neural networks). Other classifiers (e.g. Naive Bayes or Logistic Regression) could be considered, being this matter user-dependent. Decision trees are not recommended since they perform their own feature selection in addition to that done by the algorithm and can hinder an accurate assessment of the results. The important point here is noting that the compared algorithms do not specifically depend on this choice. Despite taking more time, leave-one-out cross validation is used to obtain reliable generalization estimates, since it is known to have low bias [9]. The TFS parameters are as follows: = 0.01, T0 = 0.1 and Tmin = 0.0001. These settings were chosen after some preliminary trials and are kept constant for all the problems. The cooling function was chosen to be geometric α(t) = ct, taking c = 0.9, following recommendations in the literature [10]. Note that SFFS needs the specification of the desired size of the final solution [4], acting also as a stop criterion. This parameter is very difficult to estimate in practice. To overcome this problem, we let SFFS to run over all possible sizes 1 . . . n, where n is the size of each data set. This is a way of getting the best performance of this algorithm. In all cases, a limit of 100 hours computing time was imposed to the executions.
A Thermodynamical Search Algorithm for Feature Subset Selection
689
Table 2. Performance results. Jopt is the value of the objective function for the final solution and k its size. For categorical problems only 1NN is used. Note NR are the corresponding results with no reduction of features. The results for SFFS on the SP data set (marked with a star) were the best after 100 hours of computing time, when the algorithm was cut (this is the only occasion this happened).
BC IO SO MA SP SC KD KN HP
1NN Jopt k 0.96 14 0.95 12 0.91 25 0.91 6 0.91 8 0.74 7 0.70 11 0.86 14 0.91 4
TFS LDA Jopt k 0.98 11 0.90 13 0.84 9 0.99 10 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
QDA Jopt k 0.99 9 0.95 8 0.88 10 0.94 6 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
1NN Jopt 0.92 0.87 0.80 0.70 0.78 0.64 0.60 0.69 0.82
NR LDA QDA Jopt Jopt 0.95 0.95 0.85 0.87 0.70 0.60 0.71 0.62 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
1NN Jopt k 0.94 11 0.93 13 0.88 7 0.84 16 0.90∗ 6∗ 0.73 9 0.68 11 0.85 12 0.91 4
SFFS LDA Jopt k 0.98 13 0.89 6 0.79 23 0.93 6 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
QDA Jopt k 0.97 7 0.94 14 0.85 11 0.93 5 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
A number of questions are raised prior to the realization of the experiments: 1. Does the Feature Selection process help to find solutions of similar or better accuracy using lower numbers of features? Is there any systematic difference in performance for the various classifiers? 2. Does TFS find better solutions in terms of the objective function J? Does it find better solutions in terms of the size k? 5.2
Discussion of the Results
Performance results for TFS and SFFS are displayed in Table 2, including the results obtained with no feature selection of any kind, as a reference. Upon realization of the results we can give answers to the previously raised questions: 1. The feature selection process indeed helps to find solutions of similar or better accuracy using (much) lower numbers of features. This is true for both algorithms and all of the three classifiers used. Regarding a systematic difference in performance, for the classifiers, the results are non-conclusive, as can reasonably be expected, being this matter problem-dependent in general [11]. 2. In terms of the best value of the objective function, TFS outperforms SFFS in all the tasks, no matter the classifier, except in HP for 1NN, where there is a tie. The difference is sometimes substantial (more than 10%). Recall SFFS was executed for all possible final size values and the figure reported is the best overall. In this sense SFFS did never came across the subsets obtained by TFS in the search process (otherwise they would have been recorded). It is hence conjectured that TFS can have a better access to hidden good subsets than SFFS does. In terms of the final subset size, the results are quite interesting.
690
F.F. Gonz´ alez and L.A. Belanche
Both TFS and SFFS find solutions of smaller size using QDA, then using LDA and finally using 1NN, which can be checked by calculating the column totals. TFS gives priority to the optimization of the objective function, without any specific concern for reducing the final size, However, it finds very competitive solutions both in terms of accuracy and size. This fact should not be overlooked, since it is by no means clear that solutions of bigger size should have better value of the objective function and viceversa. In order to avoid the danger of loosing relevant information, many times it is better to accept solutions of somewhat bigger size if these entail a significantly higher value of the objective function. Finally, it could be argued that the conclusion that TFS consistently yields better final objective function values is but optimistic. Possibly the results for other sizes are consistently equal or even worse. To check whether this is the case, we show the entire solution path for one of the runs, representative of the experimental behaviour found (figs. 4). It is seen that for (almost) all sizes, TFS offers better accuracy, very specially at lowest values. K-SFFS
Accuracy
2 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93
4
6
8
10
12
14
16
18
20
22
TFS-QDC
(9, 0.99)
24
26
28
30 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93
SFFS-QDC
(7, 0.97)
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
K-TFS
Fig. 4. Comparative full performance for all k with QDA on the BC data set. The best solutions (those shown in Table 2) are marked by a bigger circle or square.
5.3
Computational Cost
The computational cost of the performed experiments is displayed in Table 3. It is seen that SFFS takes less time for smaller sized problems (e.g. HP, BC, IO and KN), whereas TFS is more efficient for medium to bigger sized ones (e.g. SO, MA, KD and SP), where time is more and more an issue. In this vain, the two-phase interleaved mechanism for forward and backward exploration carries out a good neighborhood exploration, thereby contributing to a fast relaxation of the algorithm. Further, the setup of the algorithm is made easier than in a conventional SA since the time spent at each temperature is automatically and dynamically set. In our experiments this mechanism did not lead to the stagnation of the process. The last analyzed issue concerns the distribution of features as selected by TFS. In order to determine this, a perfectly known data set is needed. An artificial problem f has been explicitly designed, as follows: letting x1 , . . . , xn be the relevant features f (x1 , · · · , xn ) = 1 if the majority of xi is equal to 1 and 0 otherwise. Next, completely irrelevant features and redundant features (taken as copies of
A Thermodynamical Search Algorithm for Feature Subset Selection
691
Table 3. Computational costs: Jeval is the number of times the function J is called and Time is the total computing time (in hours)
250
Frequencies
200 150 100 50 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Features
Fig. 5. Distribution of features for the Majority data set as selected by TFS
relevant ones) are added. Ten different data samples of 1000 examples each are generated with n = 21. The truth about this data set is: 8 relevant features (1-8), 8 irrelevant (9-16) and 5 redundant (17-21). We were interested in analyzing the frequency distribution of the features selected by TFS according to their type. The results are shown in figure 5: it is remarkable that TFS gives priority to all the relevant features and rejects all the irrelevant ones. Redundant features are sometimes allowed, compensating for the absence of some relevant ones. Average performance as given by 1NN is 0.95. This figure should be compared to the performance of 0.77 obtained with the full feature set.
6
Conclusions
An algorithm for feature selection based on simulated annealing has been introduced. A notable characteristic over other search algorithms for this task is its capability to accept momentarily worse solutions. The algorithm has been evaluated against the Sequential Forward Floating Search (SFFS), one of the most reliable algorithms for moderate-sized problems. In comparative results with SFFS (and using a number of inducers as wrappers) the feature selection process shows remarkable results, superior in all cases to the full feature set and
692
F.F. Gonz´ alez and L.A. Belanche
substantially better than those achieved by SFFS alone. The proposed algorithm finds higher-evaluating solutions, both when their size is bigger or smaller than those found by SFFS and offers a solid and reliable framework for feature subset selection tasks. As future work, we plan to use the concept of thermostatistical persistency [13] to improve the algorithm while reducing computational cost.
References 1. Yang, J., Honavar, V.: Feature Subset Selection Using a Genetic Algorithm. In: Motoda, H., Liu, H. (eds.) Feature Extraction, Construction, and Subset Selection: A Data Mining Perspective, Kluwer, New York (1998) 2. Schlapbach, A., Kilchherr, V., Bunke, H.: Improving Writer Identification by Means of Feature Selection and Extraction. In: 8th Int. Conf. on Document Analysis and Recognition, pp. 131–135 (2005) 3. Debuse, J.C., Rayward-Smith, V.: Feature Subset Selection within a Simulated Annealing DataMining Algorithm. J. of Intel. Inform. Systems 9, 57–81 (1997) 4. Pudil, P., Ferri, F., Novovicova, J., Kittler, J.: Floating search methods for feature selection. Pattern Recognition Letters 15(11), 1119–1125 (1994) 5. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2), 153–158 (1997) 6. Kudo, M., Somol, P., Pudil, P., Shimbo, M., Sklansky, J.: Comparison of classifierspecific feature selection algorithms. In: Procs. of the Joint IAPR Intl. Workshop on Advances in Pattern Recognition, pp. 677–686 (2000) 7. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equations of state calculations by fast computing machines. J. of Chem. Phys. 21 (1953) 8. Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statistical Physics 34 (1984) 9. Bishop, C.: Neural networks for pattern recognition. Oxford Press, Oxford (1996) 10. Reeves, C.R.: Modern Heuristic Techniques for Combinatorial Problems. McGraw Hill, New York (1995) 11. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) 12. Duda, R.O., Hart, P., Stork, G.: Pattern Classification. Wiley & Sons, Chichester (2001) 13. Chardaire, P., Lutton, J.L., Sutter, A.: Thermostatistical persistency: A powerful improving concept for simulated annealing algorithms. European Journal of Operational Research 86(3), 565–579 (1995)
Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation Mikio Hasegawa1 and Ken Umeno2,3 1
Tokyo University of Science, Tokyo 102-0073, Japan 2 ChaosWare Inc., Koganei-shi, 183-8795, Japan 3 NICT, Koganei-shi, 183-8795, Japan
Abstract. By adding chaotic sequences to a neural network that solves combinatorial optimization problems, its performance improves much better than the case that random number sequences are added. It was already shown in a previous study that a specific autocorrelation of the chaotic noise makes a positive effect on its high performance. Autocorrelation of such an effective chaotic noise takes a negative value at lag 1, and decreases with dumped oscillation as the lag increases. In this paper, we generate a stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , similar to effective chaotic noise, and evaluate the performance of the neural network with such stochastic noise. First, we show that an appropriate amplitude value of the additive noise changes depending on the negative autocorrelation parameter r. We also show that the performance with negative autocorrelation noise is better than those with the white Gaussian noise and positive autocorrelation noise, and almost the same as that of the chaotic noise. Based on such results, it can be considered that high solvable performance of the additive chaotic noise is due to its negative autocorrelation.
1
Introduction
Chaotic dynamics have been shown to be effective for combinatorial optimization problems in many researches [1,2,3,4,5,6,7,8,9]. In one of those approaches that adds chaotic sequences to each neuron in a mutually-connected neural network solving an optimization problem to avoid local minimum problem, it has been clearly shown that the chaotic noise is more effective than random noise to improve the performance [8,9]. For realizing higher performance using the chaotic dynamics, it was also utilized in optimization methods with heuristic algorithms which are applicable to large-scale problems [3], and this approach is shown to be more effective than tabu search and simulated annealing [5,6]. In experimental analyses of the chaotic search, several factors enabling high performance were found. One is that the chaotic dynamics close to the edge of chaos has high performance [7]. The second is that a specific autocorrelation function of the chaotic dynamics has a positive effect [9], in the approach that adds the chaotic sequences as additive noise to mutually-connected neural network. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 693–702, 2008. c Springer-Verlag Berlin Heidelberg 2008
694
M. Hasegawa and K. Umeno
As another application of the chaos dynamics, it has been applied to CDMA communication system [10,11,12]. For the spreading code, cross correlation between the code sequences should be as small as possible to reduce bit error rate caused by mutual interference. In the chaotic CDMA approach, it has been shown that chaotic sequences, which have a negative autocorrelation at lag 1, make the cross correlation smaller [11]. For generating such an optimum code sequence whose autocorrelation is C(τ ) ≈ C × (−r)τ , a FIR filter has also been proposed, and the advantages of such code sequences has been experimentally shown [12]. In this paper, we analyze of effects of the chaotic noise to combinatorial optimization by introducing the stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , same as the sequence utilized in the chaotic CDMA [12], because the effective chaotic noise shown in Ref. [9] has a similar autocorrelation with them. We evaluate their performance with changing the variance and parameters of chaotic and stochastic noise, and investigate a positive factor of chaotic noise to combinatorial optimization problems.
2
Optimization by Neural Networks with Chaotic Noise
We introduce the Traveling Salesman Problems (TSPs) and the Quadratic Assignment Problems (QAPs) to evaluate the performance of each type of noise. In this paper, the Hopfield-Tank neural network [13] is a base of the solution search. It is well-known that the energy function, 1 wikjl xik (t)xjl (t) + θik xik (t) 2 i=1 j=1 i=1 n
E(t) = −
n
n
k=1 l=1
n
(1)
k=1
always decreasing when the neurons are updated by the following equation, n n xik (t + 1) = f [ wikjl xjl (t) + θik ],
(2)
j=1 l=1
where xik (t) is the output of the (i, k)th neuron at time t, wikjl is the connection weight between the (i, k) th and (j, l) th neurons, and θik is the threshold of the (i, k) th neuron. Because the original Hopfield-Tank neural networks stop searching at a local minimum, the chaotic noise and other random dynamics has been added to the neurons to avoid trapping at such undesirable states and to achieve much higher performance. As already been applied in many conventional researches, the energy function for solving the TSPs [13,1,2,8,9] can be defined by the following equation, Etsp = A[{
N N N N ( xik (t) − 1)2 } + { ( xik (t) − 1)2 }] i=1 k=1
+B
N N N i=1 j=1 k=1
k=1 i=1
dij xik (t){xjk+1 (t) + xjk−1 (t)},
(3)
Solvable Performances of Optimization Neural Networks
695
Solvable Performance (%)
100
a=3.82 a=3.92 a=3.95 80 Gausian Noise 60 40 20 0 0.00
0.10
0.20
0.30
0.40
0.50
Noise Amplitude β
Fig. 1. Solvable performance of chaotic noise and white Gaussian noise on a 20-city TSP
where N is the number of cities, dij is the distance between the cities i and j, and A and B are the weight of the constraint term (formation of a closed tour) and the objective term (minimization of total tour length), respectively. From Eqs. (1) and (3), the connection weights wijkl and the threshold θijkl can be obtained as follows, wijkl = −A{δij (1 − δkl ) + δkl (1 − δij)} − Bdij (δlk+1 + δl−k−1 ),
(4)
θij = 2A,
(5)
where δij = 1 if i = j, δij = 0 otherwise. For the QAPs whose objective function is F (p) =
N N
aij bp(i)p(j) ,
(6)
i=1 j=1
we use the following energy function, Eqap = A[{
N N N N ( xik (t) − 1)2 } + { ( xik (t) − 1)2 }] i=1 k=1
+
B
N N N N
k=1 i=1
aij bkl xik (t)xjl (t).
(7)
i=1 j=1 k=1 l=1
From Eqs. (1) and (7), the connection weight and the threshold for the QAPs are obtained as follows, wijkl = −A{δij (1 − δkl ) + δkl (1 − δij)} − Baij bkl ,
(8)
θij = 2A.
(9)
696
M. Hasegawa and K. Umeno
Solvable Performance (%)
25 20
a=3.82 a=3.92 a=3.95 Gausian Noise
15 10 5 0 0.04
0.06
0.08
0.10
0.12
0.14
0.16
Noise Amplitude β
Fig. 2. Solvable performance of chaotic noise and white Gaussian noise on a QAP with 12 nodes (Nug12)
In order to introduce additive noise into the update equation of each neuron, we use the following equation, N N xik (t + 1) = f [ wikjl xjl (t) + θik + βzik (t)],
(10)
j=1 l=1
where zij (t) is a noise sequence added to the (i, j)th neuron, β is the amplitude of noise, and f is the sigmoidal output function, f (y) = 1/(1 + exp(−y/)). The noise sequence introduced as zij (t) is normalized to be zero mean and unit variance. In Figs. 1 and 2, we compare the solvable performances of the above neural c c c networks with the logistic map chaos, zij (t + 1) = azij (t)(1 − zij (t)), and white Gaussian noise for additive noise sequence zij (t). For the chaotic noise, we used 3.82, 3.92, and 3.95 for the parameter a. The abscissa axis is the amplitude of the noise, β in Eq. (10). The solvable performance on the ordinate is defined as the percentage of successful runs obtaining the optimum solution in 1000 runs with different initial conditions. The successful run obtaining of the optimum solution is defined as hitting the optimum solution state at least once in a fixed iteration. For the problems introduced in this paper, exactly optimum solution is known for each, by an algorithm which guarantees exactness of the solution but requires much larger amount of computational time. In this paper, the solvable performances of each type of noise are evaluated using a small and fixed computational time. We set the cutoff time longer for the QAP because it is more difficult thatn the TSP and includes it as a sub-problem. The cutoff times for each run are set to 1024 iterations for TSP and to 4096 iterations for QAP, respectively. The parameters of the neural network are A = 1, B = 1, = 0.3 for the TSP, and A = 0.35, B = 0.2, = 0.075 for the QAP, respectively. The problems introduced in this paper are a 20-city TSP in [9] and a QAP with 12 nodes, Nug12 in QAPLIB [14].
z(t+1)=az(t)(1−z(t))
Solvable Performances of Optimization Neural Networks 1.0 0.8 0.6 0.4 0.2 0.0 3.50
3.60
3.70
3.80
3.90
4.00
3.90
4.00
3.90
4.00
3.90
4.00
697
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.325
80 60 40 20 0 3.50
3.60
3.70
3.80
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.375
80 60 40 20 0 3.50
3.60
3.70
3.80
Solvable Performance (%)
Logistic Map Parameter a 100
β=0.425
80 60 40 20 0 3.50
3.60
3.70
3.80
Logistic Map Parameter a
Fig. 3. Solvable performance of logistic map chaos with different noise amplitude β on a 20-city TSP
The results in both Figs. 1 and 2 show that the neural network with the chaotic noise performs better than that with the white Gaussian noise, on a comparison of the best solvable performance of each noise. From the results, it also can be seen that the best value of noise amplitude for the highest solvable performance is different among the noise sequences. This is not due to difference in the variances of the original noise sequences, because each sequence is normalized before being added to neurons as zik (t) as described above. Figure 3 shows the solvable performance of the neural network with the logistic map chaos with different noise amplitude β. From the results, it can be seen that the value of the parameter a for high performances changes depending on the noise amplitude. It is obvious that temporal structure of the chaotic time series changes depending on a. Therefore, both the amplitude and the autocorrelation correlation should be investigated at the same time to analyze the effects of the additive noise.
698
M. Hasegawa and K. Umeno
Autocorrelation Coefficient
1.0
a=3.82 a=3.92 a=3.95 Gausian Noise
0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 0
2
4
6
8
10
τ
Fig. 4. Autocorrelation coefficients of chaotic noise that has high solvable performance
3
Negative Autocorrelation Noise
The autocorrelation coefficients of the chaotic sequences used for the results of Figs. 1 and 2 are shown in Fig. 4. The figure shows that the autocorrelation of the effective chaotic sequences has a negative value at lag 1. In other research using chaos, chaotic CDMA utilizes similar sequences whose autocorrelation is C(τ ) ≈ C × (−r)τ [12]. Such autocorrelation has been shown to lead small cross-correlation among the sequences. Therefore, the chaotic codes having such autocorrelation are effective for reducing bit error rates caused by mutual interferences. In this paper, we introduce stochastic noise having such autocorrelation C(τ ) ≈ C × (−r)τ to evaluate effects of the negative autocorrelation to the solvable performance of the neural networks, because it is similar to effective autocorrelation function of the chaotic noise shown in Fig. 4 and possible to analyze only the effect of such negative dumped oscillation in autocorrelation. We generate such stochastic sequence whose autocorrelation is C(τ ) by the following procedures. First, a target autocorrelation sequence C(τ ) is generated. By applying FFT to C(τ ), its power spectrum sequence is obtained. Only the phase of the spectrum is randomized without changing the power spectrum. Then, a stochastic sequence, whose autocorrelation was C(τ ), is generated by applying IFFT to the phase randomized sequence. When we apply the generated noise sequence to the neural network as zik (t) in Eq. (10), the sequence is normalized to zero mean and unit variance as already described. We evaluate the solvable performance of such stochastic noise by changing the autocorrelation parameter r and the noise amplitude β. Figures 5 and 6 show the results on the TSP and the QAP solved by the neural networks with the stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ . The upper figures (1) show the solvable performance as both the noise amplitude β and the negative correlation parameter r are changed, and the lower figures (2) show another view from the r axis direction.
Solvable Performances of Optimization Neural Networks
699
Solvable Performance (%) 100 80 60 40 20 0
-1.0 -0.5 0.0
r
0.5 1.0 0.0
0.3
0.2
0.1
0.4
0.5
0.6
β
(1) 100
Solvable Performance (%)
90 80 70 60 50 40 30 20 10 0 -1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
r
(2) Fig. 5. Solvable performance of a neural network with a stochastic additive noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , on a 20-city TSP, with changing the noise amplitude β and the negative autocorrelation parameter r
Figures 5 (1) and 6 (1) demonstrate that the best amplitude β for the best solvable performance varies depending on the negative autocorrelation parameter r. The best β increases as r increases. In the conventional stochastic resonance, only the amplitude of the noise has been considered important for its phenomena, and the temporal structure such as autocorrelation has not been so much focused on. However, the results shown in Fig. 5 (1) and Fig. 6 (1) suggest that the temporal structure of the additive noise is also important when we analyze effect of the additive noise. Figures 5 (2) and 6 (2) clearly demonstrate that positive r (autocorrelation is negative) induces higher performance. A negative autocorrelation with oscillation corresponding to r > 0 has higher performance than the white Gaussian noise corresponding to r = 0 and positive autocorrelation noise corresponding to r < 0. Comparing the best results of Fig. 5 and 6 with the results in Figs. 1 and 2, we can see that the stochastic noise with negative autocorrelation
700
M. Hasegawa and K. Umeno
Solvable Performance (%) 20 15 10 5 0 0.16 0.14 0.12 0.10 0.08 β 0.06 0.04 1.0 0.02
-1.0 -0.5 0.0 0.5
r
(1)
Solvable Performance (%)
20
15
10
5
0 -1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
r
(2) Fig. 6. Solvable performance of a neural network with a stochastic additive noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , on a QAP with 12 nodes (Nug12), with changing the noise amplitude β and the negative autocorrelation parameter r
has almost the same high performance as the chaotic noise. As shown in Fig. 4, the chaotic noise also has a negative autocorrelation with damped oscillation, similar to the introduced stochastic noise, C(τ ) ≈ C × (−r)τ . Based on these results, we conclude that the effects of the chaotic dynamics for the additive sequences to the optimization neural network owes to its negative autocorrelation.
4
Conclusion
Chaotic noise has been shown to be more effective than white Gaussian noise in conventional chaotic noise approaches to combinatorial optimization problems. We obtained the results indicating this in Figs. 1 and 2. In a previous research, some specific characteristics of autocorrelation function of chaotic noise was shown to be effective by using the method of surrogate data [9].
Solvable Performances of Optimization Neural Networks
701
In order to investigate more essential effect of the additive noise, we introduced stochastic noise whose autocorrelation is C(τ ) ≈ C × (−r)τ , because the chaotic sequences effective for optimization has similar autocorrelation with negative dumped oscillation. By such noise, we showed that the solvable performance depends not only on the noise amplitude but also on the negative autocorrelation parameter r. The best value of amplitude for additive noise varies depending on the autocorrelation of the noise. We also showed that the noise, whose autocorrelation takes negative value with damped oscillation, is more effective than white Gaussian noise and positive autocorrelation noise. Moreover, we also showed that such a stochastic noise having autocorrelation with dumped oscillation lead to almost the same high solvable performance as the chaotic noise. From these results, we conclude that essential effect of the specific autocorrelation of chaotic noise shown in Ref. [9] is this negative autocorrelation with dumped oscillation. In the chaotic CDMA, such noise sequences, whose autocorrelation take negative value with dumped oscillation, have been shown to be effective for minimizing the cross-correlation among the sequences. This paper showed that adding such a set of noise sequences whose cross-correlation is low leads to high performance of the neural network. To clarify this effect of the noise related to cross-correlation to solvable performance is very important future work. In this paper, we dealt with just one approach of chaotic optimization, which is based on mutually connected neural networks. However, this approach has poor performance compared to a heuristic approach using chaos [3,6,5]. Therefore, it is also important future work to create a new chaotic algorithm in a heuristic approach, using the obtained results about effective feature of chaotic dynamics.
References 1. Nozawa, H.: A neural network model as a globally coupled map and applications based on chaos. Chaos 2, 377–386 (1992) 2. Chen, L., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8(6), 915–930 (1995) 3. Hasegawa, M., Ikeguchi, T., Aihara, K.: Combination of Chaotic Neurodynamics with the 2-opt Algorithm to Solve Traveling Salesman Problems. Physical Review Letters 79(12), 2344–2347 (1997) 4. Ishii, S., Niitsuma, H.: Lambda-opt neural approaches to quadratic assignment problems. Neural Computation 12(9), 2209–2225 (2000) 5. Hasegawa, M., Ikeguchi, T., Aihara, K., Itoh, K.: A Novel Chaotic Search for Quadratic Assignment Problems. European Journal of Operational Research 139, 543–556 (2002) 6. Hasegawa, M., Ikeguchi, T., Aihara, K.: Solving Large Scale Traveling Salesman Problems by Chaotic Neurodynamics. Neural Networks 15, 271–283 (2002) 7. Hasegawa, M., Ikeguchi, T., Matozaki, T., Aihara, K.: Solving Combinatorial Optimization Problems by Nonlinear Neural Dynamics. In: Proc. of IEEE International Conference on Neural Networks, pp. 3140–3145 (1995) 8. Hayakawa, Y., Marumoto, A., Sawada, Y.: Effects of the Chaotic Noise on the Performance of a Neural Network Model for Optimization Problems. Physical Review E 51(4), 2693–2696 (1995)
702
M. Hasegawa and K. Umeno
9. Hasegawa, M., Ikeguchi, T., Matozaki, T., Aihara, K.: An Analysis on Additive Effects of Nonlinear Dynamics for Combinatorial Optimization. IEICE Trans. Fundamentals E80-A(1), 206–212 (1997) 10. Umeno, K., Kitayama, K.: Spreading Sequences using Periodic Orbits of Chaos for CDMA. Electronics Letters 35(7), 545–546 (1999) 11. Mazzini, G., Rovatti, R., Setti, G.: Interference minimization by autocorrelation shaping in asynchronous DS-CDMA systems: chaos-based spreading is nearly optimal. Electronics Letters 35(13), 1054–10555 (1999) 12. Umeno, K., Yamaguchi, A.: Construction of Optimal Chaotic Spreading Sequence Using Lebesgue Spectrum Filter. IEICE Trans. Fundamentals E85-A(4), 849–852 (2002) 13. Hopfield, J., Tank, D.: Neural Computation of Decision in Optimization Problems. Biological Cybernetics 52, 141–152 (1985) 14. Burkard, R., Karish, S., Rendl, F.: QAPLIB-A Quadratic Assignment Problem Library. Journal of Global Optimization 10, 391–403 (1997), http://www.seas.upenn.edu/qaplib
Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks Xiaolin Hu1 and Jun Wang2 1 Tsinghua National Lab of Information Science & Technology, State Key Lab of Intelligent Technology & Systems, and Department of Computer Science & Technology, Tsinghua University, Beijing 100084, China
[email protected] 2 Department of Mechanical and Automation Engineering The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China
[email protected] Abstract. The k-winners-take-all (k-WTA) problem is to select k largest inputs from a set of inputs in a network, which has many applications in machine learning. The Cournot-Nash equilibrium is an important problem in economic models . The two problems can be formulated as linear variational inequalities (LVIs). In the paper, a linear case of the general projection neural network (GPNN) is applied for solving the resulting LVIs, and consequently the two practical problems. Compared with existing recurrent neural networks capable of solving these problems, the designed GPNN is superior in its stability results and architecture complexity.
1
Introduction
Following the seminal work of Hopfield and Tank [1], numerous neural network models have been developed for solving optimization problems, from the earlier proposals such as the neural network proposed by Kennedy and Chua [2] based on the penalty method and the switched-capacitor neural network by Rodr´ıguezV´azquez et al. [3] to the latest development by Xia and Wang et al. [5, 8]. In recent years, new research in this direction focus on solving variational inequality (VI), a problem closely related to optimization problems, which has itself many applications in a variety of disciplines (see, e.g., [13]). Regarding solving VIs, a recurrent neural network, called projection neural network is developed (see [5,8] and references therein). An extension of this neural network is presented in [6], termed general projection neural network (GPNN), which is primarily designed for solving general variational inequality (GVI) with bound constraints on variables. Based on that work, it is found that the GPNN with proper formulations is capable of solving a linear case of VI (for short, LVI) with linear constraints. In view of this observation, we then apply the formulated GPNNs for solving two typical LVI problems, k-WTA problem and oligopoly Cournot-Nash equilibrium problem, whose descriptions can be found in Section 3 and Section 4, respectively. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 703–712, 2008. c Springer-Verlag Berlin Heidelberg 2008
704
2
X. Hu and J. Wang
Neural Network Model
Consider the following linear variational inequality (LVI): find x∗ ∈ Ω such that (M x∗ + p)T (x − x∗ ) ≥ 0, where M ∈
n×n
∀x ∈ Ω,
(1)
n
, p ∈ and Ω = {x ∈ n |Ax ∈ Y, Bx = c, x ∈ X }
(2)
with A ∈ m×n , B ∈ r×n , c ∈ r . In above, X and Y are two box sets defined as X = {x ∈ n |x ≤ x ≤ x}, Y = {y ∈ m |y ≤ y ≤ y}, where x, x ∈ n , y, y ∈ m . Note that any component of x, y may be set to −∞ and any component of x, y may be set to ∞. Without loss of generality, we assume y < y since if yi = yi for some i, then the corresponding inequality constraints can be incorporated into Bx = c. Denote Ω ∗ as the solution set of (1). Throughout the paper, it is assumed that Ω ∗ = ∅. Moreover, we assume Rank(B) = r in (2), which is always true for a well-posed problem. The above LVI is closely related to the following quadratic programming problem: minimize 12 xT M x + pT x (3) subject to x ∈ Ω where the parameters are defined as the same as in (1). It is well-known that (e.g., [13]), if M is symmetric and positive semi-definite, the two problems are actually equivalent. If M = 0, the above problem degenerates to a linear programming problem. Since Rank(B) = r, without loss of generality, B can be partitioned as [BI , BII ], where BI ∈ r×r , BII ∈ r×(n−r) and det(BI ) = 0. Then Bx = c can be decomposed into xI (BI , BII ) = c, xII where xI ∈ r , xII ∈ n−r , which yields xI = −BI−1 BII xII + BI−1 c and
where
x = QxII + q,
(4)
−1 −BI−1 BII BI c n×(n−r) Q= ∈ ,q = ∈ n . I O
(5)
Similarly, we can have x∗ = Qx∗II + q. Substitution x and x∗ into (1) will give a new LVI. Let AQ T T T ¯ ¯ u = xII , M = Q M Q, p¯ = Q M q + Q p, A = , −BI−1 BII y − Aq y − Aq V = v ∈ m+r ≤v≤ , xI − BI−1 c xI − BI−1 c U = {u ∈ n−r |xII ≤ u ≤ xII }
Solving the k-WTA Problem
705
where xI = (x1 , · · · , xr )T , xII = (xr+1 , · · · , xn )T , xI = (x1 , · · · , xr )T , xII = (xr+1 , · · · , xn )T , then the new LVI can be rewritten in the following compact form ¯ u∗ + p¯)T (u − u∗ ) ≥ 0, ∀u ∈ Ω, ¯ (M (6) where
¯ = {u ∈ n−r |Au ¯ ∈ V, u ∈ U}. Ω
Compared with the original LVI (1) the new formulation (6) has fewer variables while the equality constraints are absorbed. By following similar techniques as in [10], it is not difficult to obtain the following theorem. Theorem 1. u∗ ∈ n−r is a solution of (6) if and only if there exists v ∗ ∈ m+r ˜ ∗ ∈ W and such that w = (uT , v T )T satisfies Nw ˜ w∗ + p˜)T (w − N ˜ w∗ ) ≥ 0, (M where ˜ = M
∀w ∈ W,
(7)
¯ −A¯T M I O p¯ ˜ ,N = ¯ , p˜ = , W = U × V. O I AO O
Inequality (7) represents a class of general variational inequality [12]. Thus the following general projection neural network presented in [6] can be applied to solve the problem. – State equation dw ˜ +N ˜ )T {−Nw ˜ + PW ((N ˜ −M ˜ )w − p˜)}, = λ(M dt – Output equation x = Qu + q,
(8a) (8b)
where λ > 0 is a scalar, Q, q are defined in (5) and PW (·) is a standard projection operator [5, 8]. The architecture of the neural network can be drawn in a similar fashion in [6], and is omitted here for brevity. ¯ ≥ 0, M ˜ +N ˜ is nonsingular, and M ˜ TN ˜ is positive semiClearly, when M definite. Then we have the following theorem which follows from [6, Corollary4] directly. Theorem 2. Consider neural network (8) for solving LVI (6) with Ω defined in (2). ¯ ≥ 0, then the state of the neural network is stable in the sense of (i) If M Lyapunov and globally convergent to an equilibrium, which corresponds to an exact solution of the LVI by the output equation. ¯ > 0, then the output x(t) of the neural network is globally (ii) Furthermore, if M asymptotically stable at the unique solution x∗ of the LVI. Other neural networks existing in the literature can be also adopted to solve (6). In Table 1 we compare the proposed GPNN with several salient neural network models in terms of the stability conditions and the number of neurons. It is seen
706
X. Hu and J. Wang
Table 1. Comparison with Several Salient Neural Networks for Solving LVI (6) Refs. [7] Ref. [4] Ref. [11] Present paper ¯ >0 M ¯ ≥ 0, M ¯T = M ¯ M ¯ > 0, M ¯T = M ¯ ¯ ≥0 Conditions M M Neurons n + 2m + r n + 2m + r n+m n+m
that the GPNN requires the weakest conditions and fewest neurons. Note that fewer neurons imply fewer connective weights in the network, and consequently lower structural complexity of the network.
3
k-Winners-Take-All Network
Consider the following k-winners-take-all (k-WTA) problem xi = f (σi ) =
1, if σi ∈ {k largest elements of σ} 0, otherwise,
where σ ∈ n stands for the input vector and x ∈ {0, 1}n stands for the output vector. The k-WTA operation accomplishes a task of selecting k largest inputs from n inputs in a network. It has been shown to be a computationally powerful operation compared with standard neural network models with threshold logic gates [14]. In addition, it has important applications in machine learning as well, such as k-neighborhood classification, k-means clustering. Many attempts have been made to design neural networks to perform the k-WTA operation [15, 16, 17]. Recently, the problem was formulated into a quadratic programming problem and a simplified dual neural network was developed to k-WTA with n neurons [11]. The limitation of that approach lies in the finite resolution. Specifically, when the k-th largest input σk is equal to the (k + 1)-th largest input, the network cannot be applied. The k-WTA problem is equivalent to the following integer programming problem minimize −σ T x (9) subject to Bx = k, x ∈ {0, 1}n , where B = (1, · · · , 1). Suppose that σk = σk+1 , similar to the arguments given in [11], the k-WTA problem can be reasoned equivalent to the following linear programming problem minimize −σ T x subject to Bx = k, x ∈ [0, 1]n .
(10)
Then neural network (8) with M = 0, p = −σ, c = k, x = 0, x = 1 can be used to solve the problem. Explicitly, the neural network can be written in the following component form
Solving the k-WTA Problem
707
– State equation ⎧ ⎫ n−1 n−1 ⎨ ⎬ dui = λ −ui + uj + PUi (ui − v − σ1 + σi+1 ) + PV (− uj − v) ⎩ ⎭ dt j=1
j=1
∀i = 1, · · · , n − 1, ⎧ ⎫ n−1 n−1 ⎨ n−1 ⎬ dv =λ 2 uj − PUj (uj − v − σ1 + σj+1 ) + PV (− uj − v) , ⎩ ⎭ dt j=1 j=1 j=1 – Output equation x1 = −
n−1
uj + k
j=1
xi+1 = ui
∀i = 1, · · · , n − 1,
where U = {u ∈ n−1 |0 ≤ u ≤ 1}, V = {v ∈ | − k ≤ v ≤ 1 − k}. It is seen that no parameter is needed to choose in contrast to [11]. Next, we consider another scenario. Suppose that there are totally s identical inputs and some of them but not all should be selected as “winners”. Suppose r inputs among them are needed to be selected, where r < s. Then the proposed neural network will definitely output k − r 1’s and n − (k − l) − s 0’s which correspond to σ1 ≥ σ2 ≥ · · · ≥ σk−l and σs−l+k+1 ≥ · · · ≥ σn . The other outputs which correspond to s identical inputs σk−l+1 = · · · = σs−l+k might be neither 1’s nor 0’s, but the entire outputs x∗ still correspond to the minimum of (10). Denote the minimum value of the objective function as p∗ , then ∗
p =−
k−l
σi − σ ¯
i=1
s−l+k
x∗i ,
i=k−l+1
where σ ¯ = σk−l+1 = · · · = σs−l+k . From the constraints, we have s−l+k
x∗i = l,
∀x∗i ∈ [0, 1].
i=k−l+1
Therefore, the optimum value of the objective function will not change by varying x∗i in [0, 1] ∀i = k − l + 1, · · · , s − l + k, if only the sum of these x∗i ’s is equal to l. Clearly, we can let any l of these outputs equal to 1’s and the rest equal to 0’s. According to this rule, the output x∗ will be an optimum of (10) and consequently, an optimum of (9). So, for this particular application, a rectification module is needed which should be placed between the output layer of the GPNN and the desired output layer, as shown in Fig. 1. Example 1. In the k-WTA problem, let k = 2, n = 4 and the inputs σi = 10 sin[2π(t + 0.2(i − 2))], ∀i = 2, 3, 4 and σ1 = σ2 , where t increases from 0 to
708
X. Hu and J. Wang
1 continuously, which leads to four sinusoidal input signals σi for i = 1, · · · , 4 with σ1 = σ2 (see the top subfigure in Fig. 2). We use the rectified GPNN to accomplish the k-WTA operation. The other four subfigures in Fig. 2 record the final outputs of the proposed network at each time instant t. Note that when t is between approximately [0.05, 0.15] and [0.55, 0.65], either σ1 or σ2 may be selected as a winner, and the figure just shows one possible selection. x1
x2
xn
... Output rectification
xc1
...
xc2
xcn
General projection neural network
...
Vn
V1 V 2
Fig. 1. Diagram of the rectified GPNN for k-WTA operation in Example 1. xi stands for the output of the GPNN and xi stands for the final output. σ ,σ 1
σ
2
3
σ
4
σ
10 0 −10
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
x
1
1 0.5 0
x
2
1 0.5 0
x
3
1 0.5 0
x
4
1 0.5 0 t
Fig. 2. Inputs and outputs of the rectified GPNN for k-WTA operation in Example 1
Solving the k-WTA Problem
4
709
Oligopoly Cournot-Nash Equilibrium
In this section, we apply the GPNN for solving a spatial oligopoly model in economics formulated as a variational inequality [18, pp. 94-97]. Assume that a homogeneous commodity is produced by m firms and demanded by n markets that are generally spatially separated. Let Qij denote the commodity shipment from firm i to demand market j; qi denote the commodity produced by firm i satisfying qi = nj=1 Qij ; dj denote the demand for the commodity at market j m satisfying dj = i=1 Qij ; q denote a column vector with entries qi , i = 1, · · · , m; d denote a column vector with entries dj , j = 1, · · · , n; fi denote the production cost associated with firm i, which is a function of the entire production pattern, i.e., fi = fi (q); pj denote the demand price associated with market j, which is a function of the entire consumption pattern, i.e., pj = pj (d); cij denote the transaction cost associated with trading (shipping) the commodity from firm i to market j, which is a function of the entire shipment pattern; i.e., cij = cij (Q); and ui denote the profit or utility of firm i, given by ui (Q) =
n
pj Qij − fi −
j=1
n
cij Qij .
j=1
Clearly, for a well-defined model, Qij must be nonnegative. In addition, other constraints can be added to make the model more practical. For instance, a constrained set can be defined as follows ¯ ij , S = {Qij ∈ m×n |fi ≥ 0, p ≤ pj ≤ p¯j ,cij ≥ 0, 0 ≤ Qij ≤ Q j
∀i = 1, · · · , m; j = 1, · · · , n}, where pj and p¯j are respectively the lower and upper bound of the price pj , and ¯ ij is the capacity associated with the shipment Qij . Q An important issue in oligopoly problems is to determine the so-called CournotNash equilibrium [18]. According to Theorem 5.1 in [18, p. 97], assume that for each firm i the profit function ui (Q) is concave with respect to the variables {Qi1 , · · · , Qin } and continuously differentiable, then Q∗ is a Cournot-Nash equilibrium if and only if it satisfies the following VI −
m n ∂ui (Q∗ ) i=1 j=1
∂Qij
(Qij − Q∗ij ) ≥ 0 ∀Q ∈ S.
(11)
Then, in the linear case, this equilibrium problem can be solved by using the GPNN designed in Section 2. Example 2. In the above Cournot-Nash equilibrium problem, let m = 3, n = 5 and define the cost functions f1 (q) = 2q12 + 2q2 q3 + q2 + 4q3 + 1 f2 (q) = 3q22 + 2q32 + 2q2 q3 + q1 + 2q2 + 5q3 + 3 f3 (q) = 2q12 + q32 + 6q2 q3 + 2q1 + 2q2 + q3 + 5
710
X. Hu and J. Wang
and demand price functions p1 (d) = −3d1 − d2 + 5 p2 (d) = −2d2 + d3 + 8 p3 (d) = −d3 + 6 p4 (d) = d2 − d3 − 2d4 + 6 p5 (d) = −d3 − d5 + 3. For simplicity, let the transaction costs cij be constants; i.e., ⎛
⎞ 52329 cij = ⎝4 8 2 1 5⎠ . 65414 ¯ ij = 1. By these definitions, the In the constrained set S, let pj = 1, p¯j = 5, Q constraints fi ≥ 0 and ci ≥ 0 in S are automatically satisfied. By splitting Q into a column vector x ∈ 15 we can write (11) in the form of LVI (1) with ⎡
⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 8 2 3 223 0 1 003 0 1 00 0 0 1 ⎢ 2 6 1 1 2 0 2 −1 0 0 0 2 −1 0 0 ⎥ ⎢ −6 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 1 4 3 3 0 0 1 0 0 0 0 1 0 0⎥ ⎢ −3 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 2 1 3 6 2 0 −1 1 2 0 0 −1 1 2 0 ⎥ ⎢ −4 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢2 2 3 2 4 0 0 1 0 1 0 0 1 0 1⎥ ⎢ 6 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 0 1 0 0 9 3 4 3 3 4 1 2 1 1⎥ ⎢ 1 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 2 −1 0 0 3 7 2 2 3 1 3 0 1 1 ⎥ ⎢ 2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ , p = ⎢ −2 ⎥ , x = ⎢ 0 ⎥ , x = ⎢ 1 ⎥ 0 0 1 0 0 4 2 5 4 4 1 1 2 1 1 M =⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 −1 1 2 0 3 2 4 7 3 1 0 2 3 1 ⎥ ⎢ −3 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 0 1 0 1 3 3 4 3 5 1 1 2 1 2⎥ ⎢ 4 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢3 0 1 0 0 6 3 4 3 3 7 1 2 1 1⎥ ⎢ 2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 2 −1 0 0 3 5 2 3 3 1 5 0 0 1 ⎥ ⎢ −2 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 0 1 0 0 3 3 4 3 3 2 0 3 2 2⎥ ⎢ −1 ⎥ ⎢0⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 0 −1 1 2 0 3 2 4 5 3 1 0 2 5 1 ⎦ ⎣ −4 ⎦ ⎣0⎦ ⎣1⎦ 0 0 1 013 3 4 341 1 2 13 2 0 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ −3 0 −1 0 0 −3 0 −1 0 0 −3 0 −1 0 0 −4 0 ⎢ 0 −2 1 0 0 0 −2 1 0 0 0 −2 1 0 0 ⎥ ⎢ −7 ⎥ ⎢ −3 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 0 −1 0 0 0 0 −1 0 0 0 0 −1 0 0 −5 A=⎢ ⎥,y = ⎢ ⎥ , y = ⎢ −1 ⎥ ⎣ 0 1 −1 −2 0 0 1 −1 −2 0 0 1 −1 −2 0 ⎦ ⎣ −5 ⎦ ⎣ −1 ⎦ 0 0 −1 0 −1 0 0 −1 0 −1 0 0 −1 0 −1 −2 2
and B = 0, c = 0. For this LVI, there is no equality constraints. Consequently, there is no need to “eliminate” the equality constraints by reformulating the problem into another LVI as in (6). And a GPNN similar to (8) can be designed for solving the LVI directly. It can be checked that M > 0. Then, the designed GPNN is globally asymptotically stable. All simulation results show that with
Solving the k-WTA Problem
711
any positive λ and any initial state the GPNN is always convergent to the unique solution: ⎛ ⎞ 0.000 1.000 0.919 0.058 0.000 Q∗ = ⎝−0.000 0.000 0.081 0.221 0.000⎠ . 0.000 1.000 0.000 0.721 0.000 Fig. 3 depicts one simulation result of the neural network (state). Clearly, all trajectories reach steady states after a period of time. It is seen that by this numerical setting, in the Cournot-Nash equilibrium state, no firm will ship any commodity to market 1 or market 5. 10 8 6 4
States
2 0 −2 −4 −6 −8 −10
0
50
100
150
Time t
Fig. 3. State trajectories of the GPNN with a random initial point in Example 2
5
Concluding Remarks
In this paper, a general projection neural network (GPNN) is designed which is capable of solving a general class of linear variational inequalities (LVIs) with linear equality and two-sided linear inequality constraints. Then, the GPNN is applied for solving two problems, the k-winners-take-all problem and the oligopoly Cournot-Nash equilibrium problem, which are formulated into LVIs. The designed GPNN is of lower structural complexity than most existing ones. Numerical examples are discussed to illustrate the good applicability and performance of the proposed method.
Acknowledgments The work was supported by the National Natural Science Foundation of China under Grant 60621062 and by the National Key Foundation R&D Project under Grants 2003CB317007, 2004CB318108 and 2007CB311003.
712
X. Hu and J. Wang
References 1. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233, 625–633 (1986) 2. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans. Circuits Syst. 35, 554–562 (1988) 3. Rodr´ıguez-V´ azquez, A., Dom´ınguez-Castro, R., Rueda, A., Huertas, J.L., S´ anchezSinencio, E.: Nonlinear Switched-Capacitor Neural Networks for Optimization Problems. IEEE Trans. Circuits Syst. 37, 384–397 (1990) 4. Tao, Q., Cao, J., Xue, M., Qiao, H.: A High Performance Neural Network for Solving Nonlinear Programming Problems with Hybrid Constraints. Physics Letters A 288, 88–94 (2001) 5. Xia, Y., Wang, J.: A Projection Neural Network and Its Application to Constrained Optimization Problems. IEEE Trans. Circuits Syst. I 49, 447–458 (2002) 6. Xia, Y., Wang, J.: A General Projection Neural Network for Solving Monotone Variational Inequalities and Related Optimization Problems. IEEE Trans. Neural Netw. 15, 318–328 (2004) 7. Xia, Y.: On Convergence Conditions of an Extended Projection Neural Network. Neural Computation 17, 515–525 (2005) 8. Hu, X., Wang, J.: Solving Pseudomonotone Variational Inequalities and Pseudoconvex Optimization Problems Using the Projection Neural Network. IEEE Trans. Neural Netw. 17, 1487–1499 (2006) 9. Hu, X., Wang, J.: A Recurrent Neural Network for Solving a Class of General Variational Inequalities. IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics 37, 528–539 (2007) 10. Hu, X., Wang, J.: Solving Generally Constrained Generalized Linear Variational Inequalities Using the General Projection Neural Networks. IEEE Trans. Neural Netw. 18, 1697–1708 (2007) 11. Liu, S., Wang, J.: A Simplified Dual Neural Network for Quadratic Programming with Its KWTA Application. IEEE Trans. Neural Netw. 17, 1500–1510 (2006) 12. Pang, J.S., Yao, J.C.: On a Generalization of a Normal Map and Equations. SIAM J. Control Optim. 33, 168–184 (1995) 13. Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, vol. I and II. Springer, New York (2003) 14. Maass, W.: On the Computational Power of Winner-Take-All. Neural Comput. 12, 2519–2535 (2000) 15. Wolfe, W.J.: K-Winner Networks. IEEE Trans. Neural Netw. 2, 310–315 (1991) 16. Calvert, B.A., Marinov, C.: Another k-Winners-Take-All Analog Neural Network. IEEE Trans. Neural Netw. 11, 829–838 (2000) 17. Marinov, C.A., Hopfield, J.J.: Stable Computational Dynamics for a Class of Circuits with O(N ) Interconnections Capable of KWTA and Rank Extractions. IEEE Trans. Circuits Syst. I 52, 949–959 (2005) 18. Nagurney, A., Zhang, D.: Projected Dynamical Systems and Variational Inequalities with Applications. Kluwer, Boston (1996)
Optimization of Parametric Companding Function for an Efficient Coding Shin-ichi Maeda1 and Shin Ishii2 1
Nara Institute of Science and Technology, 630–0192 Nara, Japan 2 Kyoto University, 611–0011 Kyoto, Japan
Abstract. Designing a lossy source code remains one of the important topics in information theory, and has a lot of applications. Although plain vector quantization (VQ) can realize any fixed-length lossy source coding, it has a serious drawback in the computation cost. Companding vector quantization (CVQ) reduces the complexity by replacing vector quantization with a set of scalar quantizations. It can represent a wide class of practical VQs, while the structure in CVQ restricts it from representing every lossy source coding. In this article, we propose an optimization method for parametrized CVQ by utilizing a newly derived distortion formula. To test its validity, we applied the method especially to transform coding. We found that our trained CVQ outperforms Karhunen-Lo¨eve transformation (KLT)-based coding not only in the case of linear mixtures of uniform sources, but also in the case of low bit-rate coding of a Gaussian source.
1
Introduction
The objective of lossy source coding is to reduce the coding length while the distortion between the original data and the reproduction data maintains constant. It is indispensable to compressing various kinds of sources such as images and audio signals to save memory amount, and may serve useful tools for extracting features from high-dimensional inputs, as the well-known Karhunen-Lo¨eve transformation (KLT) does. Plain vector quantization (VQ) has been known to be able to realize any fixed-length lossy source coding, implying VQ can be optimal among any fixed-length lossy source coding. However, it has a serious drawback in the computation cost (encoding, memory amount, and optimization of the coder) that increases exponentially as the input dimensionality increases. Companding vector quantization (CVQ) has potential to reduce the complexity inherent in the non-structured VQ by replacing vector quantization with a set of scalar quantizations, and can represent a wide class of practically useful VQs such as transform coding, shape-gain VQ, and tree structured VQ. On the other hand, the structure in CVQ restricts it from representing every lossy source coding. So far optimal CVQ has been calculated analytically for a very limited class. KLT is one of them, which is optimal for encoding Gaussian sources among the class of transform coding, in the limit of high rate [1] [2]. Although an analytical derivation of optimal CVQ has been tried, numerical optimization to obtain suboptimal CVQ has been hardly done except for the case that CVQ has a limited structure. In this article, we propose a general optimization method of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 713–722, 2008. c Springer-Verlag Berlin Heidelberg 2008
714
S. Maeda and S. Ishii
parametrized CVQ, which includes bit allocation algorithm based on a newly derived distortion formula, a generalization of Bennett’s formula, and test its validity through numerical simulations.
2 2.1
Companding Vector Quantization Architecture
The architecture of our CVQ is shown in Fig. 1.
y1 x
nj1
^ y
1
Ȁ
Ǿ ym
njm
^ x
^ y
m
Fig. 1. Architecture of CVQ
First, n-dimensional original datum x = [x1 , · · · , xn ]T ∈ n is transformed into m-dimensional feature vector y = [y1 , · · · , ym ]T ∈ [0, 1]m as y = ψ(x) by compressor ψ. T denotes a transpose. Then, quantizer Γ quantizes feature vector y into y ˆ = [ˆ y1 , · · · , yˆm ]T = Γ (y). Quantizer Γ consists of m uniform scalar quantizers, Γ1 , · · · , Γm , each of which individually quantizes the corresponding element of y as yˆi = Γi (yi ) with quantization level ai , as shown in Fig. 2. A set of quantization levels is represented by quantization vector a = [a1 , · · · , am ]T . Finally, expander φ transforms quantized feature vector y ˆ into reproduction vector x ˆ = [ˆ x1 , · · · , x ˆn ]T ∈ {μk ∈ n |k = 1, · · · , M }. The number of reproducm tion vectors M has a relationship with quantization vector a as M = i=1 ai . Reproduction vector x ˆ is hence obtained by expander φ as x ˆ = φ(ˆ y). The compressor and expander are parameterized as ψ(x; θα ) and φ(ˆ y ; θβ ), respectively, and are assumed to be differentiable with respect to their parameters θα and ˆ . A pair of a compressor and an expander is called θβ and their inputs x and y a compander. As a whole, reproduction vector x ˆ is calculated as x ˆ := ρ(x) = φ(Γ (ϕ(x))). We also denote ρ(x; a, θα , θβ ) for ρ(x), when we emphasize the dependence of coder ρ(x) on parameters θα and θβ and quantization vector a. 2.2
Optimization
Since we consider fixed-rate coding, the optimization problem is given by min E[d(x, x ˆ)] = min p(x)d(x, ρ(x; θα , θβ , a))dx θα ,θβ ,a
θα ,θβ ,a
s.t. M ≤ M ∀ i ∈ {1, · · · , m}, ai ∈ N,
(1)
Optimization of Parametric Companding Function
715
^ y
i
1 1- a1i
^ y i =nji (y) 㨪 㨪
^ yi = yi
㨪 㨪
2 ai 1 ai 1 ai
yi
1 1- a i 1
2 ai
Fig. 2. Quantizer yˆi = Γi (yi )
where M is the maximum permissible number of reproduction vectors, E[·] denotes an expectation with respect to the original data distribution (source) p(x), and d(x, x ˆ) denotes a distortion between an original datum and a reproduction datum. Provided that we have N independent and identically distributed samples {x(1) , · · · , x(N ) } from source p(x), source p(x) is usually unknown in practice, and so the expected distortion in equation (1) is approximated by the sample mean: E[d(x, ρ(x; θα , θβ , a)] ≈ DN (θα , θβ , a) ≡
N 1 d(x(k) , ρ(x(k) ; θα , θβ , a)). (2) T k=1
Since expected distortion can be correctly estimated by using enough samples, we try to minimize average distortion (2). This optimization consists of two parts: one is optimization of companding parameters θα and θβ , and the other is optimization of quantization vector a, i.e., bit allocation. At first, we explain the optimization of companding parameters θα and θβ . Here, we present an iterative optimization as follows. 1. θβ is optimized as
N min d x(i) , φ(ˆ y(i) ; θβ ) , θβ
(3)
i=1
where y ˆ(i) = Γ (ϕ(x(i) )) while θα is fixed. 2. θα is optimized as N min d x(i) , φ(ˆ y(i) ) , θα
i=1
where y ˆ(i) = Γ (ϕ(x(i) ; θα )) while θβ is fixed.
(4)
716
S. Maeda and S. Ishii
The two optimization steps above correspond to the well-recognized conditions that optimal VQ should satisfy: centroid and nearest neighbor, because (k) parameters θα and θβ determine = φ(ν (k) ; θβ ) and (k)reproduction vectors μ (k) partition functions S = x|ν = ϕ(x; θα ) , respectively, and centroid and nearest neighbor conditions denote the conditions that optimal reproduction vectors and optimal partition function should satisfy, respectively. Optimization for θβ in equation (3) is performed by conventional gradientbased optimization methods such as steepest descent or Newton’s method. On the other hand, cost function in equation (4) is not differentiable due to the discontinuous function Γ ; therefore, we replace the discontinuous function Γ with a continuous one, e.g., Γi (yi ) ≈ yi . (5) The dotted line in Fig. 2 depicts this approximation, which becomes exact when the quantization level ai becomes sufficiently high. With this approximation, the optimization problem given by equation (4) becomes min θα
N
d(x(i) , φ(ψ(x(i) ; θα ))),
(6)
i=1
which can be solved by a gradient-based method. In particular, if expander φ is invertible, optimality is always realized when θˆα forces compander ψ to be an inverse of expander φ, ψ = φ−1 , because distortion takes its minimum value d(x, φ(ψ(x; θα ))) = d(x, x) = 0 only in that condition. In this case, we need no parameter search for θα and hence can avoid the local minima problem. Although this simple optimization scheme works well as it is expected, we have found it sometimes fails to minimize the cost function (4). To improve optimization for θα , therefore, we estimate a local curvature of the cost function given by (4), and employ a line search along the estimated steepest gradient. Next, we present our bit allocation scheme. Here, we consider a special case in which the distortion measure is mean squared error (MSE), d(x, x ˆ) = n 1 2 (xi − x ˆi ) , and constitute an efficient search method for a based on an n i=1
estimated distortion: 1 1 E [Gj (y)], n j=1 12a2j m
E[d(x, x ˆ)] ≈
where Gj (y) =
2 n ∂φi (y) i=1
∂yj
(7)
. The derivation of equation (7) is based on Taylor
expansion, but omitted due to the space limitation. This estimation becomes accurate, if the companded source pdf, p(y), is well approximated as a uniform distribution and expander φ(y) is well approximated as a linear function within (k) (k) a small m-dimensional hyper-rectangle that includes ν (k) as C
≡ L1 ×· · · × (k)
(k)
Lm , where the interval in each dimension is given by Li
≡
(k)
li
−1 ai ,
(k)
li ai
and
Optimization of Parametric Companding Function
717
(k)
li ∈ {1, · · · , ai } for each i = 1, · · · , m. These conditions are both satisfied in the case of an asymptotic high rate. Note that expectation with respect to feature vector y in equation (7) can be replaced practically with sample average. Since equation (7) is identical to Bennett’s original integral [3] if the dimensionalities of original datum x and feature vector y are both one (i.e., a scalar companding case), equation (7) is a multidimensional version of the Bennett’s integral. The estimated distortion is useful for improving quantization vector a in our CVQ, as shown below. Using estimated distortion (7), the optimization problem for non-negative real-valued a is given by
min
m E [Gi (y)]
a
a2i
i=1
s.t.
m
log ai ≤ log M, ∀ i ∈ {1, · · · , m}, ai > 0, ai ∈ R. (8)
i=1
If we allow a non-negative real-valued solution, we can obtain the solution by conventional optimization methods as done by Segall [4]. The non-negative real-valued solution above does not guarantee level ai to be a natural integer. We then employ a two-stage optimization scheme. The first stage obtains a non-negative real-valued solution and rounds it to the nearest integer solution. This stage allows a global search for the quantization vector but provides a rough solution for the original cost function given in equation (1). The second stage then tries to optimize over integer solutions around the solution obtained in the first stage, finely but locally. The second stage utilizes rate-distortion cost function:
min J(a) = min a
a
s.t.
∀
m
m log2 ai + γE[d(x, x ˆ)] ≈ min log2 ai + a
i=1
i ∈ {1, · · · , m}, ai ∈ N,
i=1
γ E[Gi (y)] , 12na2i (9)
where parameter γ takes balance between the rate and distortion. γ is set so that the real-valued solution of the cost function (9) corresponds to the solution
n1 n 2 Mn of (8), as γ = 2C log 2 , where C = E [Gj (y)] . j=1
We choose the quantization vector that minimizes the rate-distortion cost function (9) among 2m quantization vectors each of which sets a component ai (i = 1, · · · , m) at one level larger or smaller than that of the first stage solution a. After re-optimizing the companding parameters such to be suitable for the new quantization vector a, we check whether the new quantization vector and re-optimized companding parameters {a, θα , θβ } actually decrease the average distortion (2). The optimization procedure in the case of MSE distortion measure is summarized in the panel.
718
S. Maeda and S. Ishii Summary of optimization procedure ˆ N at a sufficiently large value so that 1. Initialize D parameter update at step 5 is done at least once, and ˆ } as an initial guess. set {θˆα , θˆβ , a 2. Select a trial quantization vector a according to rough global search. Initialize companding parameters {θα , θβ } so a trial parameter set is {θα , θβ , a}. 3. Minimize DN (θα , θβ , a) with respect to parameters {θα , θβ }. ˆ N , then replace D ˆ N and 4. If DN (θα , θβ , a) < D ˆ } by DN (θα , θβ , a) and {θα , θβ , a}, {θˆα , θˆβ , a respectively, and go to step 2. Otherwise, go to step 5. 5. Improve a trial quantization vector a according to fine local search. 6. Minimize DN (θα , θβ , a) with respect to parameter {θα , θβ }. ˆ N , then replace D ˆ N and 7. If DN (θα , θβ , a) < D ˆ } by DN (θα , θβ , a) and {θα , θβ , a}, {θˆα , θˆβ , a respectively, and go to step 5. Otherwise, terminate the algorithm.
3 3.1
Applications to Transform Coding Architecture
Here, we apply the proposed learning method in particular to transform coding that utilizes linear matrices and componentwise nonlinear functions for the companding functions,ψ(x; θα ) = f (W x; ωα ), and φ(ˆ y; θβ ) = Ag(ˆ y; ωβ ) where θα = {W, ωα } and θβ = {A, ωβ }. W and A are m × n and n × m matrices, respectively, and f and g are componentwise nonlinear functions with parameters ωα and ωβ , respectively. We use a scaled sigmoid function for f which expands a certain interval of a sigmoid function so that the range becomes [0, 1], and g is chosen as the inverse function of f ; α di (aα i , bi ) α α α α )) + ci (ai , bi ) 1 + exp(−a (x − b i i
i 1 di (aβi , bβi ) gi (xi ; ωβ ) = − β log − 1 + bβi , ai xi − ci (aβi , bβi )
fi (xi ; ωα ) =
ai (1−bi )
α where ci = − e eai −1+1 , di = −ci eai bi , and ωα = {aα i , bi |i = 1, · · · , m} and β β ωβ = {ai , bi |i = 1, · · · , m} are parameters of f and g, respectively. Note that β β α when aα i = ai and bi = bi , each of the functions fi and gi is an inverse function of the other. The above transform coding is optimized according to the method described in section 2.2. However, the matrix A can be obtained analytically, because A
Optimization of Parametric Companding Function
is a solution of min
n
A j=1
ˆ T Aj Xj − Z
T
719
ˆ T Aj , where Ai is the i-th row Xj − Z
vector of matrix A, Xj is a vector consisting of the j-th components of the N ˆ j is a matrix whose column is a transformed vector original data samples, and Z corresponding to the i-th original datum x(i) . The MSE solution of this cost function is given as a set of row vectors [A1 , · · · , An ], each of which is obtained ˆ −T Xi , where Z ˆ −T is the pseudo inverse of matrix Z ˆT . individually as A∗i = Z 3.2
Simulation Results
To examine the performance of our optimization for CVQ, we compared the transform coding trained by our method with KLT-based transform coding, which uses a scalar quantizer trained by the Lloyd-I algorithm [5] for each feature component. Optimization of our CVQ and the trained KLT was each performed for 30 data sets, each consisting of 10,000 samples. The performance of each coding scheme was evaluated as average distortion using 500,000 samples independently sampled from the true source distribution. Hereafter, average distortion using 10,000 samples and 500,000 independent samples are called training error and test error, respectively. In our CVQ, matrix A was initially set at a value randomly chosen from uniform distribution and normalized so that the variance of each transformed component was 1. Initial matrix W was determined as an β inverse of A. The other companding parameters were set as aα i = ai = 6 and β α bi = bi = 0.5, which are well suited to a signal whose distribution is bell-shaped with variance being 1. First, we examined the performance of our CVQ by using two kinds of twodimensional sources: a linear mixture of uniform source or a Gaussian source. They have different characteristics as coding targets because the former cannot be decomposed into independent signals by an orthogonal transformation, whereas the latter can be. In the case of high bit-rate coding, KLT-based transform coding is optimal when an orthogonal transformation exists as that decomposes the source data into independent signals [6]. Therefore, the coding performance for a linear mixture of uniform source, where the optimality of KLT is not assured, is of great interest. Moreover, it is worth examining whether our CVQ can find any superior transformation to the KLT for low bit-rate encoding of a Gaussian source because non-orthognal matrices are permitted to be a transformation matrix in our CVQ. Figures 3(a) and (b) compare the arrangements of reproduction vectors by our CVQ and the trained KLT for two different linear mixtures of uniform sources; both of them have their density on the gray shaded parallelograms, slanted (a) 45◦ and (b) 60◦ . Each figure shows results with minimum training error among 30 runs by the two methods. Black circles indicate arranged reproduction vectors, and two axes indicate two feature vectors. Upper panels show results of 2 bit-rate coding, and the lower ones show 4 bit-rate coding. Each title denotes signal-tonoise ratio (SNR) for test data set. As seen in this figure, our CVQ obtained feature vector axes that were almost independent of each other and efficiently arranged the reproduction vectors, while
720
S. Maeda and S. Ishii
2 bit-rate
CVQ
trained KLT
CVQ
trained KLT
SNR : 10.70 dB 4
SNR :10.24 dB 4
SNR :11.44 dB 4
SNR : 10.18 dB 4 2
2
0
0
0
0
−2
−2
−2
−2
−4 −4
−4 −4
−4 −4
−4 −4
−2
0
2
4
SNR : 22.70 dB 4
4 bit-rate
2
2
−2
0
2
4
SNR : 20.88 dB 4
−2
0
2
4
SNR : 23.40 dB 4
2
2
2
0
0
0
0
−2
−2
−2
−2
−4 −4
−2
0
2
(a)
4
−4 −4
−2
0
2
4
45
−4 −4
−2
0
2
4
SNR : 21.73 dB 4 2
−2
0
2
4
(b)
−4 −4
−2
0
2
4
60
Fig. 3. Arrangements of reproduction vectors for linear mixtures of uniform sources
the trained KLT code failed. Then, the test performance was degraded especially in high rate coding, due to placing reproduction vectors on areas without original data. We also see that our CVQ modifies the feature vector axes as bit-rate changes even for the same source, as typically illustrated in Fig. 3(a). Thus, our CVQ adaptively seeks an efficient allocation of reproduction vectors, which is difficult to perform analytically. Results with various bit-rates indicated that the test error of our CVQ was likely the best especially when bit-rate was high. On the other hand, KLT code is guaranteed to be the best transform coder to encode Gaussian source in high bit-rate. However, there remains a possibility that a non-orthogonal transform coding might outperform KLT in low bit-rate by changing the transformation matrix according to the bit-rate. SNR : 6.327 (r) dB
1.6 bit-rate
SNR : 6.237 dB
5
5
0
0
−5 −5
0
best CVQ
5
−5 −5
0
5
optimal KLT code
Fig. 4. Encoding result of a Gaussian source in low bit-rate
In fact, we found that the best transformation matrix in our CVQ was superior to the analytical KLT code in low bit-rate as shown in Fig. 4. In this figure, we compare our trained CVQ with optimal KLT code, which can be numerically calculated in low bit rate and SNR of our CVQ was estimated by using 5,000,000 samples (three times the value of estimated standard deviation is shown in the brackets). At last, we examined performance when the source dimension became large. A linear mixture of uniform distribution was used for the original source because it prevents the high dimensional vector quantization from replacing a set of scalar
Optimization of Parametric Companding Function
721
CVQ (med) trained KLT (med) SNR - SNR of CVQ [dB] 0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4
SNR - SNR of CVQ [dB]
2
4
8
(a) 1 bit-rate
16
dimension
0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −3.5 −4
2
4
8
(b) 2 bit-rate
16
dimension
Fig. 5. Encoding results of a linear mixture of uniform sources with various dimensionality, in (a) 1 bit, and (b) 2 bits per dimension
quantizations by an orthogonal transformation while Gaussian source does not. We tested 1 and 2 bits per dimension, in the dimensionality of 2, 4, 8, or 16. In Figure 5, abscissa denotes a dimension of the source, and ordinate denotes the test error relative to the best CVQ (dB) among 30 runs. Test error by CVQ over 30 runs is shown by a box-whisker plot, and median test errors by CVQ, analytical KLT, and trained KLT are denoted by a solid line, a dashed-space line, and a dot-and-dash line, respectively. The number of poor results by CVQ, which are not plotted in the figure, is shown in brackets on each x label if they exist. As seen from Fig. 5, performance variance became large as the data dimensionality increased. We found that the initial setting of the parameters largely affected the performance of our CVQ when data dimensionality was large. However, the best trained CVQ (the best run among 30 runs) was in all cases superior to both the trained and analytical KLTs, and this superiority was consistent regardless of data dimensionality.
4
Discussion
The idea of companding has existed, and many theoretical analyses have been done. An optimal companding function for scalar quantization was analyzed by Bennett [3] and has been used to examine the ability of an optimal scalar quantizer. On the other hand, an optimal companding function for VQ has not been derived analytically except for very limited cases, and moreover, it is known that optimal CVQ does not correspond to optimal VQ except for very limited cases [7] [8] [9]. These negative results show the difficulty in analytically deriving the optimal companding function. Yet, the optimization of CVQ is attractive because CVQs constitute a wide class of practically useful codes and avoid an exponential increase in the coder’s complexity, referred to as the ‘curse of dimensionality.’ Through the numerical simulation, we noticed that theoretical analysis based on high rate assumption often deviates from real situation and analytically derived companding function based on high rate analysis does not show very good performance even when the bit-rate was quite high (e.g., 4 bit-rate). We guess
722
S. Maeda and S. Ishii
that such substantial disagreement stems from the high rate assumption. Since high rate coding implies that objective function of the lossy source coding solely comprises distortion by neglecting the coding length, the high rate analysis may lead to large disagreement when coding length cannot be completely neglected. These findings suggest the importance of the optimization of practically useful code which is realized by learning from the sample data. Fortunately, plenty of data are usually available in the case of practical lossy source coding, and the computation cost for the optimization is not very large in comparison to various machine learning problems, because the code optimization is necessary performed just once. Although we could show the potential ablitity of the CVQ, it sholud be noted that we must carefully choose the initial parameter for training of the parametrized CVQ especially in the case of high dimensionality. To find a good initial parameter, the idea presented by Hinton to train the deep network [10] may be useful. Acknowledgement. This work was supported by KAKENHI 19700219.
References 1. Huang, J.H., Schultheiss, P.M.: Block quantization of correlated gaussian random variables. IEEE Trans. Comm. CS-11, 289–296 (1963) 2. Goyal, V.K.: Theoretical foundations of transform coding. IEEE Signal Processing Mag. 18(5), 9–21 (2001) 3. Bennett, W.R.: Spectra of quantized signals. Bell Syst. Tech. J. 27, 446–472 (1948) 4. Segall, A.: Bit allocation and encoding for vector sources. IEEE Trans. Inform. Theory 22(2), 162–169 (1976) 5. Lloyd, S.: Least square optimization in pcm. IEEE Trans. Inform. Theory 28(2), 129–137 (1982) 6. Goyal, V.K., Zhuang, J., Veiterli, M.: Transform coding with backward adaptive updates. IEEE Trans. Inform. Theory 46(4), 1623–1633 (2000) 7. Gersho, A.: Asymptotically optimal block quantization. IEEE Trans. Inform. Theory 25, 373–380 (1979) 8. Bucklew, J.A.: Companding and random quantization in several dimensions. IEEE Trans. Inform. Theory 27(2), 207–211 (1981) 9. Bucklew, J.A.: A note on optimal multidimensional companders. IEEE Trans. Inform. Theory 29(2), 279 (1983) 10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data Jiann-Der Lee1, Chung-Hsien Huang1, Li-Chang Liu1, Shih-Sen Hsieh1, Shuen-Ping Wang1, and Shin-Tseng Lee2 1
Department of Electrical Engineering, Chang Gung University, Tao-Yuan, Taiwan
[email protected],
[email protected],
[email protected], {m9321039,m9321042}@stmail.cgu.edu.tw 2 Department of Neurosurgery, Chang Gung Memorial Hospital, Lin-Kuo, Taiwan
[email protected] Abstract. A facial point data registration system based on ICP is presented. The reference facial point data are extracted from patient’s pre-stored CT images, and the floating facial point data are captured from the patient directly by using a touch or non-touch capture device. A modified soft-shape-context ICP, which includes an adaptive dual AK-D tree for searching the closest point and a modified objective function, is embedded in this system. The adaptive dual AK-D tree searches the closest-point pair and discards insignificant control coupling points by an adaptive distance threshold on the distance between the two returned closest points which are searched by using AK-D tree search algorithm in two different partition orders. In the objective function of ICP, we utilize the modified soft-shape-context information which is one kind of projection information to enhance the robustness of the objective function. Relying on registering the floating data to the reference data, the system provides the geometric relationship for a medical assistant system and a preoperative training. Experiment results of using touch and non-touch capture devices to capture floating point data are performed to show the superiority of the proposed system. Keywords: ICP, KD-tree, Shape context, registration.
1 Introduction Besl and McKay [8] proposed the Iterative Closest Point algorithm (ICP) which has become a popular trend especially in 3D point data registration [10-12] and suggested using the K-D tree for the nearest neighbor search. Later Greenspan [7] proposed the Approximate K-D tree search algorithm (AK-D tree) by excluding the backtracking in K-D tree and giving more searching bin space. Greenspan claimed that the computation time of the best performance using AK-D tree is 7.6% and 39% of the computation time using K-D tree and Elias [7][6], respectively. One weakness of using K-D tree and AK-D tree is that a false nearest neighbor point may be found because only one projection plane is considered on partitioning a k-dimension space into two groups to build the node tree in one partition iteration. In order to improve M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 723–732, 2008. © Springer-Verlag Berlin Heidelberg 2008
724
J.-D. Lee et al.
search for the best closest point pair which affects the translation matrixes in ICP, we use AK-D tree twice in two different geometrical projection orders for determining the true nearest neighbor point to form significant coupling points used in later ICP stages. An adaptive threshold is used in the proposed Adaptive Dual AK-D tree search algorithm in order to reserve sufficient coupling points for a valid result. In the objective function of ICP, we modify the soft-shape-context idea proposed by Liu and Chen [2]. In soft-shape-context ICP (SICP) [2], each point generates a bin histogram and a low-pass filter is used to smooth the neighbor histogram values. For 3-D point data, the computation time to generate bin histograms for all point data is huge so in order to reduce the computation time of using SICP, we only compute two bin histograms for the centroid points of reference and floating data. We propose a registration system for facial point data by combining the modified soft-shape-context concept and ADAK-D tree. The floating facial point data are captured on-site from a touch or non-touch capture device, and the reference point data are extracted from pre-stored CT images in DICOM format. Experimental results will show that the registration results of the proposed system are more accurate than the results of using ICP and SICP. After the registration, surgeons can use mouse to click on the desired location on any slice of CT images in the user’s interface or use the digitizer to touch the desired location on the patient’s face, and these locations will be shown in the registration result sub-window as information survey. The purpose of the proposed system is to rebuild a medical virtual reality environment to assist surgeons for the pre-operation assistance and training.
2 Background of ICP Algorithm 2.1 The Approximate K-D Tree Search Algorithm ICP registers the floating data to reference data by finding the best matching relation with the minimum distance between two data sets. Iterations in ICP are summarized as the following 5 steps. Step 1. Search the closest neighbor points Xk from the reference data set X to a given floating data PK of the floating data set P, i.e. d(PK, Xk)=min{d(PK, X)}. Step 2. Compute the transformation which comprises a rotation matrix R and a translation matrix T by using least square method. Step 3. Calculate the mean square error of the objective function. e( R ,T ) =
1 Np
N
p
∑ i =1
2
xi − R ( pi ) − T
(1)
Step 4. Apply R and T to P. If the stop criterions are reached, then the algorithm stops. Otherwise it goes to the second step. Step 5. Find the best transform parameters of R and T with the minimum e(R,T). In the closest-point-search step, K-D tree and AK-D tree [7] are widely used. In K-D tree, one of k dimensions is used as a key to split the complete space, and the key is stored in a node to generate a binary node tree. The root node contains the complete bin regions, and the leaf nodes contain a sub-bin region. To a query point
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
725
Pi, the key of Pi is compared with the keys in nodes of Q to find the most matched leaf node qb. All points in the node bin region of qb are computed to find the closest point. The Ball-within-bounds (BWB) test [7] is performed to avoid a false closest point in K-D tree. In AK-D tree, more bin regions are searched for an approximate closest point, and the BWB test is discarded to reduce the computations. 2.2 The Soft-Shape-Context ICP Registration The shape context [9] is a description of the similarity of the corresponding shapes between two images. Two shapes are calculated to obtain a transformation relation by changing one shape according the minimum square error of the transformation matrix equation. To a given image D, every point di in D, i.e. di ∈ D , is computed the bin information to generate a bin histogram defined below.
hi (k ) =# {d j ≠ d i : (d j − d i )∈ bin(k )}, k = 1,.., B
(2)
where B is the total bin number. An amount of the segmented bins affects the similarity result. If a segmented bin area is too small, the similarity information is too noisy. If a segmented bin area is too big, the similarity information contains the rough global information without the local difference information. To two given images D and F, a cost function defined in (3) is computed for the shape similarity. C DF =
1 2
[h D (k ) − h F (k )]2 h D (k ) + h F (k ) k =1 B
∑
(3)
Liu and Chen [8] proposed a soft-shape-context ICP (SICP) by adding the shape context information into the objective function in ICP. A symmetric triangular function Triang K (l) is used to increase the neighbor bin accumulations at K to smooth the bins. This triangular function is similar to a low-pass filter to reduce the sensitivity around the bin boundaries. The histogram of the soft shape context is called SSC and defined in (4).
SSC d j (l) =
∑ Triang
K =label ( d j )
K
( l ) ⋅ hd j ( l )
(4)
where l is from one to B for a B-bin diagram and the center is dj. Each point dj generates a B-bin histogram which is one kind of projection information. The objective function in ICP is added with the soft-shape-context information and is rewritten in (5).
e( R, T ) = ∑{ q j − R( pi ) − T + α ⋅ Eshape (q j , pi )}
(5)
i
B
(
Eshape(q j , pi ) = ∑ SSCq j (l) − SSCpi (l) l=1
)
(6)
and α is a weighting value to balance values. Liu and Chen claimed that the experimental results of 2-D images of using ICP with soft shape context are superior to the results of using ICP and ICP with shape context in [2].
726
J.-D. Lee et al.
3 The Proposed Adaptive Registration System 3.1 The Proposed Registration System To assist surgeons to track the interested regions in the pre-stored CT images, we propose a registration system to register the facial point data captured from 3-D range scanning equipment to the facial point data of pre-stored CT images. The flowchart of the proposed registration system is shown in Fig. 1.
Fig. 1. The proposed system flowchart
The first process is the point-data capture process. The reference facial point data are extracted from pre-stored CT imaging according the grayscale values and the desired width, and the floating facial point data are captured immediately from a patient in the operation space by using 3-D range scanning equipment. The second process is the trimming process to discard undesired areas of the floating point data to reduce computation time and to increase the registration accuracy. The third process is the registration process. We proposed a modified soft-shape-context ICP registration including an adaptive dual AK-D tree search algorithm (ADAK-D) for the nearest neighbor point searching and an ICP with modified soft shape context information in the objective function. The capture functions of a laser range scanner named LinearLaser (LSH II 150) [14] and a 3-D digitizer MicroScribe3D G2X [15] are embedded in the system to capture on-site floating point data. The system interface is shown in Fig. 2. The red line in the left sub-window of the user’s interface is used to assign the desired face width to be extracted from prestored CT imaging or other DICOM images, and we can select the desired slice and brightness and contrast of the organs such as skin or bone tissues to obtain the model data set. The right sub-window is used to discard unwanted areas of floating point
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
727
data and to display the registration result. After the registration, the new target imported from a 3-D digitizer or the target inside the CT image pointed by the mouse are able to be displayed in the right sub-window, too. An example is shown in Fig. 2 where the black point in the registration result of the right sub-window is respected to the clicked point in the CT image.
Fig. 2. The system interface. The pointed target inside a CT slice is displayed in the registered coordinate as the black point.
3.2 The Adaptive Dual AK-D Tree Search Algorithm In the search structure of K-D tree and AK-D tree, the node tree is generated by splitting a k-dimension space along with a fixed geometrical projecting direction order. Because only one projection direction or a projection plane of k-dimensions is considered to split the space each time, two truly nearest neighbor points may be located at two far sub-root nodes in a binary node tree. It is assumed that if a nearestneighbor point is true, then the same point should be returned in any geometrical projection plane order when using AK-D tree or K-D tree. Based on this assumption, we utilize AK-D tree twice in two projection axis orders of “x, y, z” and “z, y, x” to examine the queried results. AK-D tree is used here because of its runtime efficiency. If two returned queried results using AK-D tree in two different projection plane orders are very close, then this query point and the returned queried point are reserved as significant coupling points, otherwise this query point is rejected. The proposed ADAK-D tree is a 5-step process to search and determine significant coupling points used in ICP. In each iteration of ICP, the nearest neighbor points are found in the following 5 steps. It’s assumed that there is a minimum 3-D rectangular box C to contain the complete floating data and M is the area of the biggest plane of C. The initial P is the amount of total floating points. Step 1. The node tree of reference/model data is built as Database 1 by using AK-D tree in the x-y-z projection axis order iteratively. Step 2. The node tree of reference/model data is built as Database 2 by using AK-D tree in the z-y-x projection axis order iteratively.
728
J.-D. Lee et al.
Step 3. The threshold T is computed by Eq. 7.
T = M /P
(7)
Step 4. To a query point from the floating data set, if the distance between two returned queried points from Database 1 and Database 2 is smaller than the distance T, this query point and the returned queried point from Database 1 are reserved as a pair of significant coupling points. Step 5. After that all floating points are queried, P is updated by the reserved pair number. The reserved significant points after using ADAK-D tree are coupling points [12] used to compute the best translation and rotation parameters in later ICP stages. In order to avoid falling into a local minimum during ICP because of insufficient coupling points, T in Step 3 is automatically adjusted by the previous iteration information. If the situation without sufficient coupling points happens, i.e. P decreases in this iteration, then T increases in the next iteration, which causes the increase of P in the next iteration. 3.3 The Modified Soft-Shape-Context ICP The SICP has a practical weakness when it is used on 3-D point data. To 3-D point data, each point of the floating data and reference data generates its bin histogram in the soft-shape-context ICP. This processing consumes huge computation time and causes ICP to appear weaker when compared with other registration algorithms such as using Mutual Information [3][4] and Genetic Algorithm (GA)[1][5]. To reduce the computation time of generating bin histograms, only two bin histograms of the centroid points of the floating and reference data are calculated and the equations of (5) and (6) are rewritten as (8) and (9).
e(R, T ) = {∑ q j − R( pi ) − T } + α ⋅ Eshape (qc , pc )
(8)
i
B
(
Eshape (qC , pC ) = ∑ SSCqc (l) − SSCpc (l) l=1
)
(9)
and qc and pc are the centroid points of data sets q and p. For the closest point search algorithm in the modified SICP (MSICP), ADAK-D tree is used. The proposed system utilizes the modified soft-shape-context ICP together with procedures and functions suggested by people of interest in the medical field to build a brain virtual reality environment.
4 Experimental Results The synthetic and real experiments are performed to compare the proposed ICP registration algorithms against other methods. In the synthetic experiments, the reference point data set is translated by 10 mm along x, y, and z axes as well as rotated 10 degrees around each of the three axes to generate 6 floating point data sets which
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
729
labeled as (a), (b), (c), (d), (e), and (f) respectively. In the real experiments, the reference data sets are obtained from the pre-stored CT images, and the floating data sets are captured from a touch or non-touch capture device. 4.1 The Comparisons of ICP with AK-D Tree and ADAK-D Tree In the synthetic experiments, the reference point data sets with 11291 points and are captured by LinearLaser. Then six floating data sets are generated from each reference point data set and listed in Table 1. The root-square-error (RMS) distance [13] is used to measure the performance. The RMS of registration result of using AK-D tree in Table 1(a) is high because the solution was trapped in a local minimum. The results in Table 1 have shown that the proposed method has improved the accuracies over the AK-D method in most of the cases. In the real experiments, using real data, CT facial point data are extracted from prestored CT data first. The reference data are CT facial point data with 17876 points as shown in Fig. 3(a) and the first floating data are laser scan facial point data with 9889 points as shown in Fig. 3(b). The RMS of the registration result using AK-D tree as Table 1. The comparison of registration results of using laser-scan surface data
AK-D tree method [11] RMS (mm)
Runtime(sec.)
ADAK-D tree method RMS (mm)
Runtime(sec.)
(a)
16.53
2.25
0.19
5.5
(b)
6.47
1.73
0.19
5.53
(c)
2.54
1.98
0.19
5.75
(d)
2.55
2.00
0.19
5.79
(e)
2.53
2.00
0.19
5.64
(f)
2.52
1.98
1.06
5.65
(a)
(b)
(c)
(d)
Fig. 3. The illustrations of registration results from laser scan facial surface data to CT facial surface data
730
J.-D. Lee et al.
(a)
(b)
(c)
(d)
Fig. 4. The illustrations of registration results of using different capture ways by a 3D digitizer
in Fig. 3(c) is 2.47 mm and the runtime is 3.21 seconds. The RMS of the registration result using ADAK-D tree as in Fig. 3(d) is 1.16 mm and the runtime is 4.43 seconds.In the second experiment, the floating facial point data are captured by MicroScribe3D G2X. We test two different capture ways: the first capture way is a discrete way as shown in Fig. 4(a), and the second capture way is a continuous way as shown in Fig. 4(b). The registration results are 3.24 mm with 0.43 seconds and 0.989 mm with 0.578 seconds as shown in Fig. 4(c) and 4(d) respectively, and a good registration result occurs when the capture way contains more 3-D direction information even thought the floating point data contain only few hundred points. 4.2 The Comparisons of ICP, Soft-Shape-Context ICP and the Proposed Algorithm The comparisons of ICP, the soft-shape-context ICP (SICP), and the proposed modified soft-shape-context ICP (MSICP) are performed in the synthetic and real experiments. We use K-D tree in ICP and SICP, and 12*6 shell-model bins to cover the complete 3-D space are used in SICP and MSICP. In the synthetic experiment, in Table 2. The comparison of registration results of using digitized point data ICP RMS(mm)
SICP
Runtime (sec.)
RMS(mm)
MSICP
Runtime (sec.)
RMS(mm)
Runtime (sec.)
(a)
6.14
0.1
2.77
32
0.97
0.1
(b)
8.99
0.1
1.46
30
0.97
0.4
(c)
7.45
0.1
1.36
31
1.16
0.1
(d)
10.71
0.1
1.36
31
0.91
0.2
(e)
11.37
0.1
1.37
35
0.78
1.2
(f)
10.81
0.1
1.41
32
1.18
1.1
A Modified Soft-Shape-Context ICP Registration System of 3-D Point Data
(a)
(b)
(c)
(d)
731
(e)
Fig. 5. The illustrations of registration results of using ICP, SICP and MSICP in Table 6 Table 3. The registration results of the real experiment.
ICP RMS(mm)
SICP
Runtime(sec.)
2.62
0.5
RMS(mm)
1.16
MSICP
Runtime(sec.)
103.5
RMS(mm)
0.69
Runtime(sec.)
0.93
Table 2 the reference point data set is the 3-D digitized data with 729 points captured by MicroScribe3D G2X in the first experiment and artificially moved along and around the axes. In the real experiment, the reference data are CT facial point data with 31053 points and the floating facial point data with 984 points are captured by MicroScribe3D G2X. The registration results of using ICP, SICP and MSICP are 2.62 mm, 1.16 mm, and 0.69 mm as shown in Fig. 5(c), (d) and (e) respectively and in Table 3.
5 Conclusion An adaptive ICP registration system which utilizes ADAK-D tree for the closest point searching and the modified soft-shape-context objective function is presented in this paper to assist surgeons in finding a surgery target within a medical virtual reality environment. The proposed system registers floating facial point data captured by a touch or non-touch capture equipment to reference facial point data which are extracted from pre-stored CT imaging. In the searching-closest-point process, an adaptive dual AK-D tree search algorithm (ADAK-D tree) by utilizing AK-D tree twice in different partition axis orders to search the nearest neighbor point and to determine the significant coupling points as the control points in ICP. An adaptive threshold for the determination of significant coupling points in ADAK-D tree also maintains sufficient control points during the iteration. Experiment results illustrated the superiority of the proposed system of using ADAK-D tree over the registration system of using AK-D tree. In the objective function of ICP, the proposed system adapted the soft-shape-context objective function which contains the shape projection information and the distance error information to improve the accuracy. We then modified the soft-shape-context objective function again to reduce the computation time but maintaining the accuracy. Experimental results of synthetic and real experiments have shown that the proposed
732
J.-D. Lee et al.
system is more robust than ICP or soft-shape-context ICP. Additional functions such as tracking the desired location in the registration result were suggested by surgeons and are embedded in the user’s interface. In the future, we like to try other optical or electromagnetic 3D digitizers for the better capture data and export the registration information to a stereo display system.
References 1. Chow, C.K., Tsui, H.T., Lee, T.: Surface registration using a dynamic genetic algorithm. Pattern Recognition 37(1), 105–117 (2004) 2. Liu, D., Chen, T.: Soft shape context for interative closest point registration. In: IEEE International Conference on ICIP, pp. 1081–1084 (2004) 3. Tomazevic, D., Likar, B., Pernus, F.: 3-D/2-D registration by integrating 2-D information in 3-D. IEEE Transaction on Medical Imaging 25(1), 17–27 (2006) 4. Chen, H., Varshney, P.K., Arora, M.K.: Performance of mutual information similarity measure for registration of multitemporal remote sensing images. IEEE Transaction on Geoscience and Remote Sensing 41(11), 2445–2454 (2003) 5. Zhang, H., Zhou, X., Sun, J., Sun, J.: A novel medical image registration method based on mutual information and genetic algorithm. In: International Conference on Computer Graphics, Imaging and Vision: New Trends, pp. 221–226 (2005) 6. Cleary, J.G.: Analysis of an algorithm for finding nearest neighbours in Euclidean space. ACM Transaction on Mathematical Software 5(2), 183–192 (1979) 7. Greenspan, M., Yurick, M.: Approximate K-D tree search for efficient ICP. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 442– 448 (2003) 8. Besl, P.J., Mckay, N.D.: A method for registration of 3D shape. IEEE Transcation on Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992) 9. Belongie, S., Malike, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. PAMI 24(4), 509–522 (2002) 10. Bhandarkar, S.M., Chowdhury, A.S., Tang, Y., Yu, J., Tollner, E.W.: Surface matching algorithms computer aided reconstructive plastic surgery. In: Proceedings of IEEE International Symposium on Biomedical Imaging: Macro to Nano, vol. 1, pp. 740–743 (2004) 11. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 145–152 (2001) 12. Jost, T., Hugli, H.: A multi-resolution ICP with heuristic closest point search for fast and robust 3D registration of range images. In: Proceedings of Fouth International Conference on 3-D Digital Imaging and Modeling (3DIM), pp. 427–433 (2003) 13. Zagrodsky, V., Walimbe, V., Castro-Pareja, C.R., Qin, J.X., Song, J.M., Shekar, R.: Registration-assisted segmentation of real-time 3-D echocardiographic data using deformable models. IEEE Transaction on Medical Imaging 24(9), 1089–1099 (2005) 14. 3DFamily Technology Corporation, http://www.3dfamily.com/ 15. Immersion Corporation, http://www.immersion.com/digitizer/
Solution Method Using Correlated Noise for TSP Atsuko Goto and Masaki Kawamura Yamaguchi University, 1677-1 Yoshida, Yamaguchi, Japan {kawamura,atsuko21}@is.sci.yamaguchi-u.ac.jp
Abstract. We suggest solution method for optimization problems using correlated noises. The correlated noises are introduced to neural networks to discuss mechanism of synfire chain. Kawamura and Okada have introduced correlated noises to associative memory models and have analyzed those dynamics. In the associative memory models, memory patterns are memorized as attractors in the minimum of the system. They found the correlated noise can make the state transit between the attractors. However, the mechanism of the state transition has not been known enough yet. One the other hand, for combinational optimization problems, the energy function of a problem can be defined. Therefore, finding a optimum solution is finding a minima of the energy function. The steepest descent method searches one of the solutions by going down along the gradient direction. By this method, however, the state is usually trapped in a local minimum of the system. In order to escape from the local minimum, the simulated annealing, i.e. Metropolis method, or chaotic disturbance is introduced. These methods can be represented by adding thermal noises or chaotic inputs to the dynamic equation. In this paper, we show that correlated noises introduced to neural networks can be applied to solve the optimization problems. We solve the TSP that is a typical combinational optimization problem of NP-hard, and evaluated solutions obtained by using the steepest descent method, the simulated annealing and the proposed method with the correlated noises. As results, in the case of ten cities, the proposed method with correlated noises can obtain more optimum solutions than the steepest descent method and the simulated annealing. In the cases of large numbers of cites, where it is hard to find one of the optimum solutions, our method can obtain solutions at least as same level as the simulated annealing.
1
Introduction
In the activities of nerve cells, synfire chains, i.e. synchronous firings of neurons, can often be observed [1]. To analyze the mechanism of synchronous firings, condition for propagating them between layers have been investigated in layered neural networks [2,3]. In the layered neural networks, it has been proofed that the spacial correlation between neurons is necessary [4]. Aoki and Aoyagi [5] M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 733–741, 2008. c Springer-Verlag Berlin Heidelberg 2008
734
A. Goto and M. Kawamura
have shown that the state transition in associative memory models is invoked by not thermal independent noises but synchronous spikes. Kawamura and Okada [6] have proposed associative memory models to which common external inputs are introduced, and found that the state could transit between attractors by the inputs. The synchronous spikes of Aoki and Aoyagi model correspond to the common external inputs. In associative memory models, memory patterns are memorized as attractors. When we consider the energy function or cost function in the associative memory models, the attractors are represented by minimum of the system. The states of neurons are attracted into one of the memory patterns near the initial state. A optimization problem is one of the problems to minimize the energy function. In engineering and social science, the optimization problems are important. The combinational optimization problem is one of the optimization problems, which is the problem that find the solution minimizing the value of object function in feasible area. Since number of feasible solutions is finite, some optimal solutions might be obtained when we could search all feasible solutions. Such solution methods are known as enumeration methods, i.e. branch-and-bound method and dynamic programming. However, the combinational optimization problems are belonging to NP-hard, and then we cannot obtain solutions within a effective time by these methods. Therefore, instead of finding optimum solutions in whole feasible area, the methods that can find optimum or quasi-optimum solutions are developed. In these methods, the optimum solutions are designed as minimum of the energy function, and the problems are formulated as finding global minimum of the energy function. The steepest descent method (SDM), the simulated annealing (SA) [7,8,9], and chaotic method [10,11] are introduced in order to find global minimum. Since the steepest descent method obtains solutions along the gradient direction, the states cannot escape from local minimum. Therefore, thermal independent noise or chaotic noise is introduced to escape from the local minimum. The simulated annealing is the method using thermal independent noise. The optimum solution can be found by decreasing temperature T through Tt+1 ≥ c/log(1 + t), where c is constant and t represents time [9]. We consider the correlated noises introduced to associative memory models by Kawamura and Okada [6], since the correlated noises can make the state transit between attractors. The optimum and quasi-optimum solutions of the optimization problems can be assumed to be attractors, and it is expect that better optimum solutions can be easily obtained using the correlated noise. We, therefore, propose the method with the correlated noises in order to solve the combinational optimization problems. We can assume that the thermal noise used in the simulated annealing corresponds to independent noise, since the noise is fed to each element independently. The correlated noises that we propose is fed to all elements mutually. Therefore, the state of each element has spacial correlation. We show that the better solutions are obtained by the proposed method efficiently than the simulated annealing and the steepest descent method.
Solution Method Using Correlated Noise for TSP
2
735
TSP
The traveling salesman problem, TSP, is one of the typical combinational optimization problems. The TSP is the problem that a salesman visits once each city and finds the shortest path. There are (N − 1)!/2 different cyclic paths for N cities. In this paper, we show that the correlated noise can be applied to the combinational optimization problem. This kind of problems is formulated as the problem for which one obtains minimum values of its energy function. The state variable Vxi takes 1 when a salesman visits x-th city at the i-th order, and 0 when he doesn’t. The energy function of the TSP is defined as E = αEc + βEo , 2 2 N N N N 1 1 Ec = Vxi − 1 + Vxi − 1 2 x=1 i=1 2 i=1 x=1 +
N N
Vxi (1 − Vxi ),
(1)
(2)
x=1 i=1
1 dxy Vxi Vy,i−1 + Vxi Vy,i+1 , 2 x=1 y=1 i=1 d 2 N
Eo =
N
N
(3)
y =x
where the energy Ec and Eo represent constrained condition and object function, respectively. The constant dxy represents distance between x-th and y-th cities, and the average distance d is given by, d =
N N 1 dxy , N (N − 1) x=1 y=1
(4)
where it shows the average distance between all different cities. The coefficients α is usually α = 1, and β is decided according to the cities’ locations and the number of them. The optimum solutions are obtained by searching minimum of the energy function E. The minimum which shall be satisfied with Ec = 0 are called solutions, and the the solutions which give the shortest paths are called optimum solutions.
3
Proposed Method
In order to obtain one of local minimum of the energy function E by the steepest descent method, the state Vxi (t) is updated by μ
N N duxi (t) = −uxi (t) + Wxiyj Vyj (t) + θxi , dt y=1 j=1
Vxi (t) = F (uxi (t)),
(5) (6)
736
A. Goto and M. Kawamura
where the function F is the output function which decides output Vxi (t) according to the internal state uxi (t). We used the output function, ⎧ ⎪ ⎨1, 1 < u F (u) = u, 0 < u ≤ 1 . (7) ⎪ ⎩ 0, u ≤ 0 From the energy function, the constant Wxiyj is given by Wxiyj = −δx,y (1 − δi,j ) − δi,j (1 − δx,y ) − β
dxy (δi−1,j + δi+1,j )(1 − δx,y ), (8) d
and the external input θxi is constant θxi = 1. The delta function δx,y is defined as
1, x = y δx,y = . (9) 0, x =y When independent noise ζxi (t) is introduced to (5), the equation corresponds to the simulated annealing. When correlated noise η(t) is introduced, the equation gives the proposed method. Therefore, we consider the equation given by μ
N N duxi (t) = −uxi (t) + Wxiyj Vyj (t) + θxi dt y=1 j=1
(10)
+ζxi (t) + η(t), Vxi (t) = F (uxi (t)).
(11)
We note that independent noise ζxi (t) is fed to each neuron independently, and the correlated noise η(t) is fed to all neurons mutually. We assume that the independent noise obeys normal distribution with mean 0 and variance σζ2 , and the correlated noises obeys normal distribution with mean 0 and variance ση2 .
4 4.1
Simulation Results Locations of Cities
Figure 1 shows the locations for 10 cities that are arranged in random order, and one for 29 cities named bayg29 in TSPLIB [12]. The shortest path for which a salesman visits 10 cities is -A-D-B-E-J-H-I-G-F-C-, and for 29 cities -1-28-6-12-9-263-29-5-21-2-20-10-4-15-18-14-17-22-11-19-25-7-23-8-27-16-13-24-. The distance of the shortest path of 10 cities is 2.69, and one of 29 cities is 9074.15, where the significant figure is until second decimal place. 4.2
Experimental Procedure
The initial values of internal state uxi (t) are determined at random with uniform distribution on the interval [−0.01, 0.01). Using the steepest descent method, the
Solution Method Using Correlated Noise for TSP
1
2400 B
21
1600
2 20
D
I
1200 18
G
800
0
1
400
13
27
0
23
16
15
19
25
400
7
11
22
(a) 10 cities
8 24
14
17 0
10
1
4
F C
28
5
29
J H
12 6
2000
A
9
26
3
E
737
800
1200
1600
2000
(b) 29 cities
Fig. 1. Locations of (a) 10 cities and (b) 29 cities. The paths show the optimal solutions.
method with independent noises, and proposed method, we perform computer simulations where α = 1 in (1). For the case of the 10 cities, we perform this case 100 times, and for the case of the 29 cities, 1000 times. We evaluate the ratio R of the path length for an obtained solution to the optimum path length, R=
[path length for obtained solution] . [optimum path length]
(12)
Since the state must satisfy 0-1 condition when the state converges, the final state is given by, Vxi = H(uxi ), where the function H(uxi ) is given by, 1, uxi > 0 H(uxi ) = . 0, uxi ≤ 0 4.3
(13)
(14)
Results
For the 10 cities, we assume β = 0.35, the variance of independent noise is σζ2 = 0.08, and the variance of correlated noise is ση2 = 0.08. We calculate the number of optimum solutions on 100 trials for this location. The histogram of the path length, when we can obtain solutions, is shown in Fig.2. Abscissa represents the ratio R in (12), and ordinate represents the percentage of number of ratio R. The number of obtained solutions by the steepest descent method is 17 times, that by the method with independent noise is 55 times, and that by proposed
738
A. Goto and M. Kawamura
100 10 cities 80
CN
60 % 40
IN
20 SDM 0
1
1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 ratio
Fig. 2. Histogram of path length of obtained solutions for 10 cities. Solid, dot, and broken lines represent results obtained by correlated noise (CN), independent noise (IN), and steepest decent method (SDM), respectively.
10
10
3
3
10 2
2
10
IN
10 1 E -1 10 10
-2
10
-3
10
-4
CN
10
SDM
1 E -1 10
SDM
-2
10
-3
10
10-4
10 cities
-5
10
10 cities
-5
1
10
10
2
10 3 time
10 4
(a)
10 5
10 6
10
1
10
10 2
10 3 time
10 4
10 5
10 6
(b)
Fig. 3. Residual energy by (a) independent noises (IN) and (b) correlated noise (CN) for 10 cities
method with correlated noises is 90 times. The proposed method can obtain most solutions in these methods. Next, Figure 3 shows the transition of residual energy by the method with independent noise and correlated noise, where the updating steps of internal state are 1,000,000 times. The residual energy Eres means difference between energy E(t) of Vxi at time t and the energy of the optimum solution, Eopt ; Eres = E(t) − Eopt . (15) We found that the energy do not go down through 0 by the method with independent noise, but by the proposed method.
Solution Method Using Correlated Noise for TSP
739
30 29 cities
CN
25 20
IN
% 15 10 SDM
5 0
1
1.1
1.2
1.3 1.4 ratio R
1.5
1.6
1.7
1.8
Fig. 4. Histogram of path length for 29 cities. we assumed β = 0.35, the variance of independent noise is σζ2 = 0.04, and the variance of correlated noise is ση2 = 0.04. The number of obtained solutions by the CN is 979 times, one by IN is 931 times and one by the SDM is 883 times.
100
CN (β=0.35)
80
29 cities
IN (β=0.35)
60 %
CN (β=0.50) IN (β=0.50)
40 20 0
0
0.02
0.04
0.06 ση2 , σ ζ2
0.08
0.10
Fig. 5. Solutions ratio (%) for 29 cities
For the 29 cities, we calculate the solutions, where the variances of independent noise are σζ2 = 0.01 ∼ 0.10, and the variances of correlated noise are ση2 = 0.01 ∼ 0.10. The optimum solutions could not be obtained by all these methods for this location. Therefore, we calculate the path lengths for obtained solutions. Figure 4 shows the histogram of path lengths when solutions are obtained. The abscissa represents the ratio R, and the ordinate represents the percentage that the solutions having ratio R are obtained. The method with independent noise
740
A. Goto and M. Kawamura
Table 1. The optimum variance and the solution ratio for β = 0.35 and β = 0.5 β=0.35 β=0.5 optimum variance σζ2 = 0.03 ση2 = 0.03 σζ2 = 0.05 ση2 = 0.05 solution ratio(%) 93 98 56 75
and the proposed method can obtain better solutions than the steepest descent method. We compare the solution ratio of obtained solutions for the method with independent noise with one for the proposed method. Figure 5 shows the solution ratio for variances σζ2 and ση2 in the cases of β = 0.35, 0.50. Table 1 shows the optimum variance and the solution ratio. In the case of β = 0.35, the number of obtained solutions by the method of independent noises with σζ2 = 0.03 is 93 times. The number of obtained solutions by the proposed method with ση2 = 0.03 is 98 times. There are not so much of a difference between them. On the other hand, in the case of β = 0.5, number of solutions by the method of independent noises with σζ2 = 0.05 is 56 times and, one by the proposed method with ση2 = 0.05 is 75 times. Namely, the proposed method can obtain better solutions than the method with independent noises. We, therefore, found that the proposed method can be much more effective than the method with independent noises in order to obtain solutions depending on β.
5
Conclusion
In associative memory models, the correlated noise is effective in state transition. In this paper, we proposed the solution method using the correlated noise and applied to TSP that is one of the typical combinational optimization problems of NP-hard. As the results, for the case of the 10 cities, the proposed method with the correlated noises can obtains more solutions than both the steepest descent method and the method with independent noises. For the case of the 29 cities, all these methods cannot be obtained any optimum solutions. However, we found that the proposed method can obtain better solutions than the existing methods depending on β. From these results, we can show that the correlated noises is also effective for the combinational optimization problems.
Acknowledgments This work was partially supported by a Grant-in-Aid for Young Scientists (B) No. 16700210. The computer simulation results were obtained using the PC cluster system at Yamaguchi University.
References 1. Abeles, M.: Corticonics. Cambridge Univ. Press, Cambridge (1991) 2. Dlesmann, M., Gewaltig, M.-O., Aertsen, A.: Stable Propagation of Synchronous Spiking in Cortical Neural Networks. Nature 402, 529–533 (1999)
Solution Method Using Correlated Noise for TSP
741
3. Cˆ ateau, H., Fukai, T.: Fokker-Planck Approach to the Pulse Packet Propagation in Synfire Chain. Neural Networks 14, 675–685 (2001) 4. Amari, S., Nakahara, H., Wu, S., Sakai, Y.: Synchronous Firing and Higher-Order Interactions in Neuron Pool. Neural.Comp. 15, 127–143 (2003) 5. Aoyagi, T., Aoki, T.: Possible Role of Synchronous Input Spike Trains in Controlling the Function of Neural Networks. Neurocomputing 264, 58–60 (2004) 6. Kawamura, M., Okada, M.: Stochastic Transitions of Attractors in Associative Memory Models with Correlated Noise. J. Phys. Soc. Jpn 75, 124–603 (2006) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220, 671–680 (1983) 8. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E.: Equation of State Calculations by Fast Computing Machies. J. Chem. Phys. 21(6), 1087–1092 (1953) 9. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Ryestoration of Image. IEEE Trans 6, 721–741 (1984) 10. Zhou, C.-S., Chan, T.-L.: Chaotic Annealing for Optimization. Phys. Rev. E 55, 2580–2587 (1997) 11. Tokuda, I., Nagashima, T., Aihara, K.: Global Bifurcation Structure of Chaotic Neural Networks and its Application to Traveling Salesman Problems. Neural Networks 10, 1673–1690 (1997) 12. TSPLIB, http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/
Bayesian Collaborative Predictors for General User Modeling Tasks Jun-ichiro Hirayama1, Masashi Nakatomi2 , Takashi Takenouchi1 , and Shin Ishii1,3 1
Graduate School of Information Science, Nara Institute of Science and Technology, Takayama 8916-5, Ikoma, Nara {junich-h,ttakashi}@is.naist.jp 2 Ricoh Company, Ltd.
[email protected] 3 Graduate School of Informatics, Kyoto University
[email protected] Abstract. Collaborative approach is of crucial importance in user modeling to improve the individual prediction performance when only insufficient amount of data are available for each user. Existing methods such as collaborative filtering or multitask learning, however, have a limitation that they cannot readily incorporate a situation where individual tasks are required to model a complex dependency structure among the task-related variables, such as one by Bayesian networks. Motivated by this issue, we propose a general approach for collaboration which can be applied to Bayesian networks, based on a simple use of Bayesian principle. We demonstrate that the proposed method can improve both the prediction accuracy and its variance in many cases with insufficient data, in an experiment with a real-world dataset related to user modeling. Keywords: User modeling, collaborative method, Bayesian network, Bayesian inference.
1
Introduction
Predicting users’ actions based on past observations of their behaviors is an important topic for developing personalized systems. Such prediction usually needs a user model that effectively represents the knowledge of a user or a group of users, and is useful for the prediction. User modeling (or user profiling) [12,10,3] is currently an active research area for this aim, which seeks the methods of acquiring user models and making prediction based on them. Recently, probabilistic graphical models such as Bayesian networks (BNs) [8,6] have been attracted an attention as an effective modeling tool in this field, because of their capacity to deal with relatively complex dependency structures among variables, which is a favorable property in general user modeling tasks. One crucial demand on user modeling is to develop “collaborative” methods which utilize the other users’ information to improve the prediction of a target M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 742–751, 2008. c Springer-Verlag Berlin Heidelberg 2008
Bayesian Collaborative Predictors
743
user. This is due partly to the limited sample size for individual users [10], and partly to the assumption that users may share common intension to decide. Consider an interactive system that is repeatedly used by many users, such as some sorts of web sites, say e-commerce ones, and publicly-used electronic devices, say network printers. It should be considered that all the users do not necessarily interact with the system actively so that there are insufficient amount of data to construct reliable models of some users. This is also the case with new users for the system; prediction about new users would often fail due to their limited sample sizes. Furthermore, in large-scale problems, it would be almost impossible in practice to collect a sufficient amount of data for any user. This is actually the case in many e-commerce recommendation systems [9]. Collaborative methods are particularly important to complement the lack of individual data by considering the relationship to other users. Collaborative filtering (CF) [2] is probably the best known collaborative method in the context of recommendation, which estimates users’ unknown ratings over items based on known ratings of similar users. The problem setting of rating estimation is, however, a limited one in that it do not usually consider general dependency structures among multiple variables (including ratings and content-related ones), so that the usual CF methods cannot be directly applied to more general tasks. Alternatively, multi-task learning (MTL) has recently been an active research topic in machine learning, which solves multiple related prediction tasks simultaneously to improve each task’s performance by exploiting the generalization among the tasks. However, the application of exsisting MTL schemes to general graphical models is not so straightforward, especially when one consider the learning of graph structure in addition to their parameters. One attempt to the MTL of BNs has recently been reported in [7] in which a prior distribution over multiple graph structures of BNs is employed to force them to have similar structures for each other. However, the joint determination of multiple model structure is computationally expensive, and also the method do not allow heterogeneity among users which usually exists in user modeling tasks. In this article, we propose a simple framework to collaborative user modeling, with particular interest in its application to Bayesian networks. The principal aim here is to improve the typically low performance of prediction in the initial phase after a user started to interact the system. Our approach flexibly realizes knowledge sharing among individual models that have already been learned individually, instead of follwoing the usual MTL setting in which the models are simultaneously learned. A similar post-sharing approach has recently been investigated in a different context in [11], which focused on the selection of relevant subset of individual models. In this article, we also evaluate the proposed method with a real dataset, which has been collected in a real-world user modeling task.
2
Bayesian Network
For a general purpose of user modeling, BN is one of the key tools for representing knowledge and making prediction of users [12]. While our proposed approach
744
J. Hirayama et al.
A B
C D
E
Fig. 1. An example of DAG
described in the next section is not limited to a specific learning model, we focus on its application to (discrete) BNs in this article, considering the importance of BNs in user modeling. In this section, we briefly review the learning and prediction by BN. For more details, see [8,6]. BN is a probabilistic graphical model that defines a class of parametric distribution over multiple random variables in terms of a directed acyclic graph (DAG). Fig. 1 shows an example of DAG. Each node corresponds to a single variable, and each edge represents conditional dependence between the variables. BN has a relatively high representational power in conjunction with effective prediction and marginalization algorithms such as the belief propagation (BP) [8] or the junction tree algorithm (JT) [4]. In addition, BN is attractive because of a human-interpretable knowledge representation in terms of DAG. Let v be a set of the variables of interests. The probability distribution of BN can be written, corresponding to a specific DAG G, as p(v | θ, G) = p (v(i) | pai, θi, G) , (1) i
where v(i) denotes the i-th element (node i) of v and pai the set of parents variables of the node i. The model parameters are denoted by θ = {θi}, where θi is the parameters that define the local conditional distribution on v(i). Note that both pai and θi are defined under the specification of the DAG G. The training of BN is conducted in two steps. First, the structure of BN is determined according to a certain scoring criterion (structure learning). Second, the conditional multinomial probability distribution of each node given its parents is estimated by Bayesian inference by assuming a conjugate Dirichlet prior (parameter learning). In this study, we assume no hidden variables. The most basic (but rather computationally-expensive) scoring criterion in the structure learning is then the (log-)marginal likelihood of the graph structures. After given a structure, the parameter learning can be done in a straightforward manner.
3 3.1
Bayesian Collaborative Predictors Prediction with Individual User Models
Consider that a user u makes a repeated interaction with a target system, with generating a sample v nu = {xnu , tnu } in the n-th usage, where xu and tu denote
Bayesian Collaborative Predictors
745
the sets of input and target variables, respectively. Given a set of observations, Du = {vnu | n = 1, 2, . . . , Nu }, the individual prediction task is stated generally new as to predict a new instance of the target, tnew u , given a new input xu . In this article, we assume both xu and tu are discrete variables (for allowing the use of discrete BN), while our approach can be applied in a straightforward manner for the other cases such that the input and/or target variables are continuous. A key requirement of our approach is to construct an individual user model as a conditional probability distribution, pu (tu | xu ), where we explicitly put the subscript u to indicate the user. This can be done by any kinds of probabilistic regressors or classifiers, but we focus on the use of discrete BN to realize it. Given a BN joint distribution pu (v u ) that has already been trained based on the individual dataset Du , which can be done as described in the previous section, there are some ways to obtain the conditional distribution. One naive way is to directly calculate the conditional probabilities for all possible realizations of tu and xu with the BN parameters, and then normalize them for each condition of xu . We refer to this approach as the enumerative method. Such exhaustive enumeration becomes intractable if the number of variables increases, but this approach is easy to implement while enabling exact computation; it is thus useful in the cases with moderate number of variables. When there are a large number of variables, however, the exact calculation is not realistic, and then other ways are required. The state-of-the-art methods of probabilistic reasoning such as BP or JT would efficiently calculate the marginals of single target nodes or a subset/clique of target nodes conditional on a given input, instead of joint distribution over all the target variables. The conditional distribution pu (tu | xt ) (which considers joint probability over tu ) can then be approximated by the products of the resultant marginals. Based on the conditional distribution obtained by these methods, prediction for each individual can be simply made as giving its conditional mode: ˆtnew = argmaxtu pu (tu | xu = xnew u u ).
(2)
In this article, this type of prediction is referred to as individual prediction, in contrast to collaborative ones. 3.2
Bayesian Collaboration of Pre-learned Predictors
Consider that individual user models, i.e., the conditional distributions, have already learned for U users. The number of training samples used in the learning, however, may be quite different for each user. The prediction accuracy of the users with small sample sizes can become low, probably with a high uncertainty/variance of learning. The aim of this study is to deal with this problem by introducing a framework of knowledge sharing among the pre-learned individual user models, based on the Bayesian perspective. Our approach is based on a rather tricky use of Bayesian principle. For convenience of explanation, we assume for a moment a virtual agent who makes prediction of a specific user s by collecting the information of the other users.
746
J. Hirayama et al.
Suppose the agent can freely access any other user’s model and dataset. The available information for the agent is then the U conditional models, p1 (t1 | x1 ), p2 (t2 | x2 ), . . . , pU (tU | xU ), each of which is for different tasks but somehow related to each other, and also the U original datasets, D1 , D2 , . . . , DU . The trick here is to regard each of the other users’ models as another hypothesis that also describes the behaviors of user s. The agent then turns out to have U conditional hypotheses, p1 (ts | xs ), p2 (ts | xs ), . . . , pU (ts | xs ), for the specific prediction task of user s. If the agent is a Bayesian, a natural choice in such a situation is to compute the posterior distribution over U hypotheses, given the corresponding dataset Ds , and then form a predictive distribution by taking posterior average over the conditional models. Using the notations: Ts = {tns | n = 1, 2, . . . , Ns } and Xs = {xns | n = 1, 2, . . . , Ns }, the posterior distribution over U models can be given as Ns
pu (Ts | Xs )πs (u)
πs (u | Ds ) = U
u=1
pu (Ts | Xs )πs (u)
n n n=1 pu (ts | xs )πs (u) , Ns n n u=1 n=1 pu (ts | xs )πs (u)
= U
(3)
where πs denotes the subjective belief of the agent for user s; πs (u) is the prior belief over the models. With this posterior, the Bayesian predictive distribution is given as p¯s (ts | xs ) ≡
U
πs (u | Ds )pu (ts | xs ).
(4)
u=1
The final prediction is made, according to the predictive distribution, as ˆtnew = argmaxts p¯s (ts | xs = xnew ), s s
(5)
which we refer to as (Bayesian) collaborative prediction in this study. We note that in Eq. (4) the predictive distribution now does not depend on the other users’ data. This may be an advantage in some distributed environments, like when the users are in different sites and the communication between them is costly. The posterior probability of other users become large when the user s have only a limited number of training data and also there exist similar users to user s in the sence that they generate similar outputs given same inputs. The posterior integration will then incorporate additional knowledge from the similar users’ models to the prediction of user s. The introduced knowledge is expected to be effective in improving the prediction accuracy. In addition, the model averaging would reduces the uncertainty/variance of the model. On the other hand, if the user s have sufficient amount of data or there are no similar users, the posterior probability p(u = s) would be close to one, which reduces the collaborative predictor to the individual predictor; this is quite natural, because the model is successfully learned individually or user s is isolated.
Bayesian Collaborative Predictors
4 4.1
747
Experimental Results Printer Usage Data
We evaluated the proposed approach by using a real-world user modeling task. The dataset used here, which we refer to as “printer usage” dataset, is the set of electronic log data that were collected through the daily use of printer devices that are shared via network within a section of a company to which one of the authors belongs. The motivation has presented in detail in [5], while the current task setting and the dataset are slightly different from the previous ones. The log data consist of many records, where a single record corresponds to one printing job output via network to one of the shared printers. Each record originally consists of a number of attributes, including user ID, context-related identifiers such as date or time, and content- or function-related identifiers such as number of document pages or usage of duplex/simplex printing. The aim of the experiment here, however, is not to construct full user models but to evaluate the basic performance of our approach. Then, we pre-processed the original log data to produce a rather compact task setting which was appropriate for the evaluation purpose. The log data were first separated into those of each individual according to the user ID. Then, in each individual log, only the records that meet the following condition were extracted: Paper size is A4, with only one copy and only one page per single sheet. This is usually the default setting of printer interfaces, and the usage of most users is strongly biased toward this setting. By limiting to the default condition, the distribution of the other attributes became regularly balanced and thus the dataset became suitable for normative performance evaluation. In this reduced dataset, we consider the five attributes other than the fixed attributes, which are summarized in Table 1. The values of the attribute modelName were replaced by anonymous ones. We quantized the values of the attribute docPages, which can take any natural number, as shown in the table. The attribute docExt may originally take various values, but the number of frequently-appeared values are small, such as pdf, xls, html, etc. We thus extracted only the records that include these frequent values and removed the others from the experiment. Finally, we removed the records including missing values. In this experiment, we also fixed both of the input and target variables within the five attributes; that is, we consistently set x = {docExt, docPages} and t = {modelName, duplex, colorMode}, where the number of possible realization is |x| = 25 and |t| = 20. After the pre-processing, the total number of users became 76, and the numbers of individual training data were quite different within the range from 2 to 1, 192 (Fig. 2). 4.2
Simulation Setting
Although our method is particularly expected to improve the prediction performance of users located at the rightward of Fig. 2 (with small sample sizes) in real
748
J. Hirayama et al. Table 1. Five attributes in printer logs
Num of training data 0 500 1000
Attribute Description Values modelName Anonymous form of model names Pr1, Pr2, Pr3, Pr4, Pr5 docExt File extensions of original document doc, html, pdf, ppt, xls docPages Number of pages in original document 1, 2-5, 6-20, 21-50, 51-over duplex Duplex or simplex duplex, simplex colorMode Color mode monochrome, fullColor
20
40 User index (sorted)
60
Fig. 2. The original numbers of data for 76 users.
situations, the limited number of data for such users prevents the quantitative evaluation of prediction performance. In this experiment, therefore, we tested the collaborative predictors for the top 16 users (the leftmost users in Fig. 2), with artificially varying the number of training data, Nu , from relatively small to large by random selection. For each setting of Nu , twenty runs were repeatedly conducted, in each of which both the collaborative and individual predictors were constructed based on the same data and then their performances were evaluated (see below). In contrast, the number of test data was commonly set at 200 in every run. In each single run for a user s, both the training data and test data were randomly selected from the original data without overlapping. Then the BN for user s was first constructed based on the training dataset. The BN was trained in a standard manner as described in Sec. 2, where our implementation used the “deal” package written in the language R. To realize the structure learning, we simply used a heuristic optimization provided by “deal” as the function heuristic() (see [1] for the detail), which searches for a good structure according to the partially-greedy stepwise ascent of the score function starting from multiple initial conditions. The Dirichlet hyperparameters were commonly set at a constant such that the total number of prior pseudo-counts [6] was five. To obtain the conditional model ps (ts | xs ) from the resultant joint distribution, we used the enumerative method. After learning, with this individual predictor ps , we constructed the corresponding collaborative predictors p¯s as follows. First, in prior to the simulation, we trained individual predictors for all the 76 users based on the data of the numbers shown in Fig. 2, without limiting the number of the top 16 user’s data. org org Then an ensemble of 76 individual predictors, porg 1 , p2 , . . . , pU , which we referred to as the original ensemble, was prepared in advance. The collaborative predictor p¯s was then constructed by first replacing porg in the original ensemble s with ps , and then making the collaborative predictor p¯s based on the replaced
Bayesian Collaborative Predictors
749
ensemble. The prior πs (u) was simply set as uniform. To evaluate the prediction performance for the test dataset, we calculated the test accuracy, defined as the fraction of the test cases in which the predictions of the three target variables were all correct. 4.3
Results
0
20
40
20
40
0
20
40
1.0 20
40
0
20
40
0
20
40
0.5 0.5
1.0
0.5
1.0
0.5 0.5
1.0
0.5
1.0
0.5 1.0 0.5 1.0 0.5
0.5
1.0 0
0
0
20
40
0
20
40
0
20
40
0
20
40
1.0
40
40
40
0.5
20
20
20
1.0
0
0
0
0.5
40
40
1.0
20
20
0.5
0
0
1.0
40 1.0
20
1.0
0
0.5
1.0 0.5
0.5
1.0
Fig. 3 shows the test accuracies by individual and collaborative predictors. Each panel corresponds to one of the 16 users, where the vertical axis denotes the
Fig. 3. Test accuracy for 16 users. The vertical axis denotes the test accuracy and the horizontal the number of training data. The solid (black) and dash (gray) lines respectively show the mean over the 20 runs by the collaborative (proposed) and individual predictors. The errorbar represents ±SD (standard deviation).
750
J. Hirayama et al.
x
x x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x xx x x x x x x x x x x x x
x x
x
x x x x x x x x x xx x x x x x x
x x x x x x x x x x xx x xx x
x x x x xxx x x
0
10
20
30
40
Number of training data
50
0.030
x
0.015
x x x x x x
Variances of test accuracies
0.0
0.4
x xx xx x xx xx xx x x x xxx x xxx x xxx xx xx xxx xxx xx x xx xx
0.000
0.8 xx x
−0.4
Differences of test accuracies
x
0
10
20
30
40
50
Number of training data
Fig. 4. Left: The difference in test accuracy (Collaborative − Individual). This figure collectively shows the 16 users’ results. A mark x denotes an actual value of test accuracy. A solid line denotes the mean value, with an errorbar of ±SD. Right: The individual variance of test accuracy. Solid and dash lines respectively denote the collaborative and individual predictors. An errorbar denotes the standard deviation over the 16 users.
test accuracy and the horizontal the number of training data, where only the cases with Nu = 5, 10, 20, 30, 40, and 50 are shown. In this figure, the individual predictors often exhibited relatively lower accuracy and larger variances when Nu is less than about 20 in comparison to the cases with larger Nu . In contrast, it was shown that the collaborative predictors can improve these undesirable results of individual predictors in that they showed a higher mean accuracy for a number of users, and also smaller variances for almost all the users. Fig. 4 (left) shows the improvement in test accuracy by the Bayesian collaboration over the individual predictor (i.e, the accuracy by collaborative predictor minus that by corresponding individual one). This figure collectively plots the results of the 16 users, where each point denotes the improvement in a single run for a single user. The improvement was achieved in many runs, especially with small samples. Fig.4 (right) shows the variance of test accuracy of each individual user against the number of training data. Each errorbar denotes the standard deviation over the 16 users. This figure clearly shows the variance of test performance was substantially reduced by collaborative prediction in comparison to the individual one especially in the cases of small samples.
5
Summary
In this article, we proposed a new collaborative framework for user modeling with special interest in its application to BNs, which have recently been an popular modeling tool in general user modeling tasks. Our method is essentially a simple use of Bayesian principle, but the key idea that regards the other users’ models as the target user’s ones is consistent with the basic assumption of collaborative methods, i.e., there may be some other users similar to the target user. The effectiveness of the proposed method was demonstrated with a real-world dataset related to user modeling. While the improvements by our
Bayesian Collaborative Predictors
751
method was showed only in the too limited range of sample sizes, say less than 20, it should be noted that this range will likely to be extended in a more realistic problem having large number of variables, where the needed amount of training data become increased. More detailed performance evaluation, and also the investigation of remained issues such as computational cost or effective setting of prior distribution πs (u) are our future tasks.
References 1. B¨ ottcher, S.G., Dethlefsen, C.: deal: A package for learning Bayesian networks. Journal of Statistical Software 8(20) (2003) 2. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proc. 14th Conf. on Uncertainty in Artificial Intelligence, pp. 43–52. Morgan Kaufmann, San Francisco (1998) 3. Godoy, D., Amandi, A.: User profiling in personal information agents: a survey. Knowl. Eng. Rev. 20(4), 329–361 (2005) 4. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B 50(2), 157–224 (1988) 5. Nakatomi, M., Iga, S., Shinnishi, M., Nagatsuka, T., Shimada, A.: What affects printing options? - Toward personalization & recommendation system for printing devices. In: International Conference on Intelligent User Interfaces (Workshop: Beyond Personalization 2005) (2005) 6. Neapolitan, R.E.: Learning Bayesian Networks. Prentice-Hall, Inc., Upper Saddle River (2003) 7. Niculescu-Mizil, A., Caruana, R.: Inductive transfer for Bayesian network structure learning. In: Proc. 11th International Conf. on AI and Statistics (2007) 8. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo (1988) 9. Schafer, J.B., Konstan, J.A., Riedl, J.: E-commerce recommendation applications. Data Mining and Knowledge Discovery 5(1–2), 115–153 (2001) 10. Webb, G.I., et al.: Machine learning for user modeling. User Modeling and UserAdapted Interaction 11(1–2), 19–29 (2001) 11. Zhang, Y., Burer, S., Nick Street, W.: Ensemble pruning via semi-definite programming. Journal of Machine Learning Research 7, 1315–1338 (2006) 12. Zukerman, I., Albrecht, D.: Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction 11(1) (2001)
Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes Shohei Shimizu1,2 and Aapo Hyv¨ arinen1 1
Helsinki Institute for Information Technology, Finland 2 The Institute of Statistical Mathematics, Japan http://www.hiit.fi/neuroinf
Abstract. An effective way to examine causality is to conduct an experiment with random assignment. However, in many cases it is impossible or too expensive to perform controlled experiments, and hence one often has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments. We have recently proposed such a discovery method based on independent component analysis (ICA) called LiNGAM and shown how to completely identify the data generating process under the assumptions of linearity, non-gaussianity, and no latent variables. In this paper, after briefly recapitulating this approach, we extend the framework to cases where latent classes (hidden groups) are present. The model identification can be accomplished using a method based on ICA mixtures. Simulations confirm the validity of the proposed method.
1
Introduction
An effective way to examine causality is to conduct an experiment with random assignment [1]. However, in many cases it is impossible or too expensive to perform controlled experiments. Hence one often has to resort to methods for discovering good initial causal models from data which do not come from such controlled experiments, though obviously one can never fully prove the validity of a causal model from such uncontrolled data alone. Thus, developing methods for causal inference from uncontrolled data is a fundamental problem with a very large number of potential applications such as social sciences [2], gene network estimation [3] and brain connectivity analysis [4]. Previous methods developed for statistical causal analysis of non-experimental data [2, 5, 6] generally work in one of two settings. In the case of discrete data, no functional form for the dependencies is usually assumed. On the other hand, when working with continuous variables, a linear-Gaussian approach is almost invariably taken and has hence been based solely on the covariance structure of the data. Because of this, additional information (such as the time-order of the variables and prior information) is usually required to obtain a full causal model of the variables. Without such information, algorithms based on the Gaussian assumption cannot in most cases distinguish between multiple equally possible causal models. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 752–761, 2008. c Springer-Verlag Berlin Heidelberg 2008
Discovery of Linear Non-Gaussian Acyclic Models
753
We have recently shown that when working with continuous-valued data, a significant advantage can be achieved by departing from the Gaussianity assumption [7,8,9]. The linear-Gaussian approach usually only leads to a set of possible models that are equivalent in their covariance structure. The simplest such case is that of two variables, x1 and x2 . A method based only on the covariance matrix has no way of preferring x1 → x2 over the reverse model x1 ← x2 [2, 7]. However, a linear-non-Gaussian setting actually allows the linear acyclic model to be uniquely identified [9]. In this paper, we extend our previous work to cases where latent classes (hidden groups) are present. The paper is structured as follows. In Section 2 we briefly describe the basics of LiNGAM and subsequently extend the framework in Section 3. Some illustrative examples are provided in Section 4, and the proposed method is empirically evaluated in Section 5. Section 6 concludes the paper.
2
LiNGAM
Here we provide a brief review of our previous work [9]. Assume that we observe data generated from a process with the following properties: 1. The observed variables xi , i = {1 . . . n} can be arranged in a causal order k(i), defined to be an ordering of the variables such that no later variable in the order participates in generating the value of any earlier variable. That is, the generating process is recursive [2], meaning it can be represented graphically by a directed acyclic graph (DAG) [5, 6]. 2. The value assigned to each variable xi is a linear function of the values already assigned to the earlier variables, plus a ‘disturbance’ (noise) term ei , and plus an optional constant term μi , that is xi = bij xj + ei + μi . (1) k(j)> 1), the above derived local algorithm could be computationally very time consuming.
Sparse Super Symmetric Tensor Factorization
787
In this section, we propose an alternative simple approach which converts the problem to a simple tri-NMF model: ¯ T +N ¯, Y¯ = GDG
(18)
Q ¯ = Q Dq = diag{d¯1 , d¯2 , . . . , d¯J } and N ¯ = where Y¯ = q=1 Y q ∈ RI×I , D q=1 Q I×I . q=1 N q ∈ R The above system of linear algebraic equations can be represented in an equivalent scalar form as: y¯it = j gij gtj d¯j + n ¯ it or equivalently in the vector form: ¯ where g are columns of G. Y¯ = j g j d¯j g Tj + N j Such a simple model is justified if noise in the frontal slices is uncorrelated. It is interesting to note that the model can be written in the equivalent form: ˜G ˜T +N ¯, Y¯ = G
(19)
˜ = GD ¯ 1/2 , assuming that D ¯ ∈ RJ×J is a non-singular matrix. Thus, the where G problem can be converted to a standard symmetric NMF problem to estimate ˜ Using any available NMF algorithm: Multiplicative, FP-ALS, or PG, matrix G. ˜ we can estimate the matrix G. For example, by minimizing the following regularized cost function: T
˜G ˜ )= D(Y¯ ||G
1 ¯ ˜G ˜ T ||2 + αG ||G|| ˜ 1 ||Y − G F 2
(20)
and applying the FP-ALS approach, we obtain the following simple algorithm ˜ ← (Y¯ T G ˜ − αG E)(G ˜ T G) ˜ −1 , G (21) +
˜ to unit-length in each iteration step, subject to normalization the columns of G where E is the matrix of all ones of appropriate size. 3.2
Row-Wise and Column-Wise Unfolding Approach
It is worth noting that the diagonal matrices D q are scaling matrices that can be absorbed by the matrix G. By defining the column-normalized matrices Gq = GD q , we can use the following simplified models: Y q = Gq GT + N q ,
(q = 1, . . . , Q)
(22)
Y q = GGTq + N q ,
(q = 1, . . . , Q).
(23)
or equivalently
These simplified models can be described by a single compact matrix equation using column-wise or row-wise unfolding as follows Y c = Gc GT ,
(24)
788
A. Cichocki et al.
or Y r = GGTr ,
(25)
where Y c = Y Tr = [Y 1 ; Y 2 ; . . . ; Y Q ] ∈ RI ×I is the column-wise unfolded matrix of the slices Y q and Gc = GTr = [G1 ; G2 ; . . . ; GQ ] ∈ RJI×I is columnwise unfolded matrix of the matrices Gq = GD q (q = 1, 2, , . . . , I). Using any efficient NMF algorithm (multiplicative, IPN, quasi-Newton, or FP-ALS) [23,24,25,26,27,28,9,13], we can estimate the matrix G. For example, by minimizing the following cost function: 2
D(Y c ||Gc GT ) =
1 ||Y c − Gc GT ||2F + αG ||G||1 2
(26)
and applying the FP-ALS approach, we obtain the following iterative algorithm: G ← ([Y Tc Gc − αG E]+ )(GTc Gc )−1 (27) +
or equivalently G ← ([Y r Gr − αG E]+ )(GTr Gr )−1 ,
(28)
+
where Gc = GTr = [GD 1 ; GD 2 ; . . . ; GD Q ], Dq = diag{g q } and g q means q-th row of G. 3.3
Semi-orthogonality Constraint
The matrix G is usually very sparse and additionally satisfies orthogonality constraints. We can easily impose orthogonality constraint by incorporating additionally the following iterations: −1/2 G ← G GT G . 3.4
(29)
Simulation Results
All the NTF algorithms presented in this paper have been tested for many difficult benchmarks for signals and images with various statistical distributions of signals and additive noise. Comparison and simulation results will be presented in the ICONIP-2007.
4
Conclusions and Discussion
We have proposed the generalized and flexible cost function (controlled by sparsity penalty/regularization terms) that allows us to derive a family of SNTF algorithms. The main objective and motivations of this paper is to derive simple multiplicative algorithms which are especially suitable both for very sparse
Sparse Super Symmetric Tensor Factorization
789
representation and highly over-determined cases. The basic advantage of the multiplicative algorithms is their simplicity and relatively straightforward generalization to L-order tensors (L > 3). However, the multiplicative algorithms are relatively slow. We found that simple approaches which convert a SNTF problem to a symmetric NMF (SNMF) or symmetric tri-NMF (ST-NMF) problem provide the more efficient and fast algorithms, especially for large scale problems. Moreover, by imposing orthogonality constraints, we can drastically improve performance, especially for noisy data. Obviously, there are many challenging open issues remaining, such as global convergence and an optimal choice of the associated parameters.
References 1. Amari, S.: Differential-Geometrical Methods in Statistics. Springer, Heidelberg (1985) 2. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 3. Workshop on tensor decompositions and applications, CIRM, Marseille, France (2005) 4. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 5. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 6. Shashua, A., Zass, R., Hazan, T.: Multi-way clustering using super-symmetric nonnegative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 595–608. Springer, Heidelberg (2006) 7. Zass, R., Shashua, A.: A unifying approach to hard and probabilistic clustering. In: International Conference on Computer Vision (ICCV), Beijing, China (2005) 8. Sun, J., Tao, D., Faloutsos, C.: Beyond streams and graphs: dynamic tensor analysis. In: Proc.of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) 9. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis (in press, 2006) 10. Cichocki, A., Zdunek, R., Amari, S.: Csiszar’s divergences for non-negative matrix factorization: Family of new algorithms. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 32–39. Springer, Heidelberg (2006) 11. Cichocki, A., Amari, S., Zdunek, R., Kompass, R., Hori, G., He, Z.: Extended SMART algorithms for non-negative matrix factorization. In: Rutkowski, L., ˙ Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2006. LNCS (LNAI), vol. 4029, pp. 548–562. Springer, Heidelberg (2006) 12. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
790
A. Cichocki et al.
13. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada, pp. 283– 290 (2005) 14. Kim, M., Choi, S.: Monaural music source separation: Nonnegativity, sparseness, and shift-invariance. In: Rosca, J.P., Erdogmus, D., Pr´ıncipe, J.C., Haykin, S. (eds.) ICA 2006. LNCS, vol. 3889, pp. 617–624. Springer, Heidelberg (2006) 15. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999) 16. Mørup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 17. Miwakeichi, F., Martinez-Montes, E., Valds-Sosa, P., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space−time−frequency components using parallel factor analysi. NeuroImage 22, 1035–1045 (2004) 18. Zass, R., Shashua, A.: Nonnegative sparse pca. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 19. Zass, R., Shashua, A.: Doubly stochastic normalization for spectral clustering. In: Neural Information Processing Systems (NIPS), Vancuver, Canada (2006) 20. Comon, P.: Tensor decompositions-state of the art and applications. In: McWhirter, J.G., Proudler, I.K. (eds.) Institute of Mathematics and its Applications Conference on Mathematics in Signal Processing, pp. 18–20. Clarendon Press, Oxford, UK (2001) 21. Byrne, C.L.: Choosing parameters in block-iterative or ordered-subset reconstruction algorithms. IEEE Transactions on Image Processing 14, 321–327 (2005) 22. Minami, M., Eguchi, S.: Robust blind source separation by Beta-divergence. Neural Computation 14, 1859–1886 (2002) 23. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.I.: Novel multi-layer nonnegative tensor factorization with sparsity constraints. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007. LNCS, vol. 4432, pp. 271–280. Springer, Heidelberg (2007) 24. Cichocki, A., Zdunek, R.: Regularized alternating least squares algorithms for nonnegative matrix/tensor factorizations. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 793–802. Springer, Heidelberg (2007) 25. Cichocki, A., Zdunek, R., Amari, S.: New algorithms for non-negative matrix factorization in applications to blind source separation. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP2006, Toulouse, France, vol. 5, pp. 621–624 (2006) 26. Cichocki, A., Zdunek, R., Choi, S., Plemmons, R., Amari, S.: Nonnegative tensor factorization using Alpha and Beta divergencies. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, vol. III, pp. 1393–1396 (2007) 27. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with constrained second-order optimization. Signal Processing 87, 1904–1916 (2007) 28. Zdunek, R., Cichocki, A.: Nonnegative matrix factorization with quadratic programming. Neurocomputing (accepted, 2007) 29. Shashua, A., Hazan, T.: Non-negative tensor factorization with applications to statistics and computer vision. In: Proc. of the 22-th International Conference on Machine Learning, Bonn, Germany (2005)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria Dacheng Tao1, Jimeng Sun2, Xindong Wu3, Xuelong Li4, Jialie Shen5, Stephen J. Maybank4, and Christos Faloutsos2 1
Department of Computing, Hong Kong Polytechnic University, Hong Kong
[email protected] 2 Department of Computer Science, Carnegie Mellon University, Pittsburgh, USA
[email protected],
[email protected] 3 Department of Computer Science, University of Vermont, Burlington, USA
[email protected] 4 Sch. Computer Science & Info Systems, Birkbeck, University of London, London, UK
[email protected],
[email protected] 5 School of Information Systems, Singapore Management University, Singapore
[email protected] Abstract. From data mining to computer vision, from visual surveillance to biometrics research, from biomedical imaging to bioinformatics, and from multimedia retrieval to information management, a large amount of data are naturally represented by multidimensional arrays, i.e., tensors. However, conventional probabilistic graphical models with probabilistic inference only model data in vector format, although they are very important in many statistical problems, e.g., model selection. Is it possible to construct multilinear probabilistic graphical models for tensor format data to conduct probabilistic inference, e.g., model selection? This paper provides a positive answer based on the proposed decoupled probabilistic model by developing the probabilistic tensor analysis (PTA), which selects suitable model for tensor format data modeling based on Akaike information criterion (AIC) and Bayesian information criterion (BIC). Empirical studies demonstrate that PTA associated with AIC and BIC selects correct number of models. Keywords: Probabilistic Inference, Akaike Information Criterion, Bayesian Information Criterion, Probabilistic Principal Component Analysis, and Tensor.
1 Introduction In computer vision, data mining, and different applications, objects are all naturally represented by tensors or multidimensional arrays, e.g., gray face image in biometrics, colour image in scene classification, colour video shot (as shown in Figure 1) in multimedia information management, TCP flow records in computer networks, and DBLP bibliography in data mining [6]. Therefore, tensor based data modeling becomes very popular and a large number of learning models have been developed from unsupervised learning to supervised learning, e.g., high order singular value decomposition [5] [1] [2], n-mode component analysis or tensor principal component M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 791–801, 2008. © Springer-Verlag Berlin Heidelberg 2008
792
D. Tao et al.
Fig. 1. A colour video shot is a fourth order tensor. Four indices are used to locate elements. Two indices are used for pixel locations; one index is used to locate the colour information; and the other index is used for time. The video shot comes from http://www–nlpir.nist.gov/ projects/trecvid/.
analysis (TPCA) [5] [10] [6], three-mode data principal component analysis [4], tucker decomposition [9], tensorface [10], generalized low rank approximations of matrix [12], two dimensional linear discriminant analysis [11], dynamic tensor analysis [6] and general tensor discriminant analysis [7]. However, all these tensor based learning models lack systematic justifications at the probabilistic level. Therefore, it is impossible to conduct conventional statistical tasks, e.g., model selection, over existing tensor based learning models. The difficulty of applying probability theory, probabilistic inference, and probabilistic graphical modeling to tensor based learning models comes from the gap between the tensor based datum representation and the vector based input requirement in traditional probabilistic models. Is it possible to apply the probability theory and to develop probabilistic graphical models for tensors? Is it possible to apply the statistical models over specific tensor based learning models? To answer the above questions, we narrow down our focus on generalizing the probabilistic principal component analysis (PPCA) [8], which is an important generative model [3] for subspace selection or dimension reduction for vectors. This generalization forms the probabilistic tensor analysis (PTA), which is a generative model for tensors. The significances of PTA are as following: 1) Providing a probabilistic analysis for tensors. Based on PTA, a probabilistic graphical model can be constructed, the dimension reduction procedure for tensors could be understood at the probabilistic level, and the parameter estimation can be formulated by utilizing the expectation maximization (EM) algorithm under the maximum likelihood framework [3]; 2) Providing statistical methods for model selection based on the Akaike and Bayesian information criteria (AIC and BIC) [3]. It is impossible for conventional tensor based subspace analysis to find a criterion for model selection, i.e., determining the appropriate number of retained dimensions to model the original tensors. By
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
793
providing specific utilizations of AIC and BIC, the model selection for tensor subspace analysis comes true. In PTA associated with AIC/BIC, the number of retained dimensions can be chosen by an iterative procedure; and 3) Providing a flexible framework for tensor data modeling. PTA assumes entries in tensor measurements are drawn from multivariate normal distributions. This assumption could be changed for different applications, e.g., multinomial/binomial distributions in sparse tensor modeling. With different assumptions, different dimension reduction algorithm could be developed for different applications.
2 Probabilistic Tensor Analysis In this Section, we first construct the latent tensor model which a multilinear mapping to relate the observed tensors with unobserved latent tensors. Based on the proposed latent tensor model, we propose the probabilistic tensor analysis with dimension reduction and data reconstruction. The detailed description about tensor terminologies can be found in [9]. 2.1 Latent Tensor Model Similar to the latent variable model [8], the latent tensor model multilinearly relates high dimensional observation tensors Ti ∈ R l1 ×l2 ×"×lM −1 ×lM for 1 ≤ i ≤ n to the
corresponding latent tensors Xi ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space, i.e., M
Ti = Xi ∏ ×d U dT + M + Ei ,
(1)
d =1
where U d |dM=1∈ R ld′ ×ld is the dth projection matrix and 1 ≤ d ≤ M ; M = (1 n ) ∑ i =1 Ti is n
the mean tensor of all observed tensors Ti ; and Ei is the ith residue tensor and every entry of Ei follows N ( 0, σ 2 ) . Moreover, the number of effective dimension of latent
tensors is upper bounded by ld − 1 for dth mode, i.e., ld′ ≤ ld − 1 for the dth projection matrix U d . Here, 1 ≤ i ≤ n and n is the number of observed tensors. Projection matrices U d (1 ≤ d ≤ M ) construct the multilinear mapping between the observations and the latent tensors.
2.2 Probabilistic Tensor Analysis Armed with the latent tensor model and the probabilistic principal component analysis (PPCA), a probabilistic treatment of TPCA is constructed by introducing hyperparameters with prior distribution p ( M ,U d |dM=1 , σ 2 | D ) , where D = {Ti |in=1} is the set contains all observed tensors. According to the Bayesian theorem, with D the probabilistic modeling of TPCA or PTA is defined by the predictive density, i.e.,
p ( T | D ) = p ( T | M,U d |dM=1 , σ 2 ) p ( M,U d |dM=1 , σ 2 | D ) ,
(2)
794
D. Tao et al.
where U d |Md =1∈ R ld′ ×ld , p ( T | M ,U d |dM=1 , σ 2 ) is the predictive density with the given
full probabilistic model, and p ( M ,U d |dM=1 , σ 2 | D ) is the posterior probability. The
model parameters ( M ,U d |dM=1 , σ 2 ) can be obtained by applying maximum a posterior (MAP). In PTA, to obtain ( M ,U d |dM=1 , σ 2 ) , we have the following concerns:
1) the probabilistic distributions are defined over vectors but not tensors. Although the vectorization operation is helpful to utilize the conventional probability theory and inferences, the computational cost will be very high and almost intractable for practical applications. Based on this perspective, it is important to develop a method to obtain the probabilistic model in a computational tractable way. In the next part, the decoupled probabilistic model is developed to significantly reduce the computational complexity. 2) how to determine the number of retained dimensions to model observed tensors? In probability theory and statistics, the Akaike information criterion (AIC) [3] and the Bayesian information criterion (BIC) [3] are popular in model selection. However, both AIC and BIC are developed for data represented by vectors. Therefore, it is important to generalize AIC and BIC for tensor data. Based on the above discussions, to obtain the projection matrices U d |Md =1 is computationally intractable, because the model requires obtaining all projection matrices simultaneously. To construct a computationally tractable algorithm to obtain U d |Md =1 , we can construct decoupled probabilistic model for (1), i.e., obtain each
M
Ud
μd
G
σ d2
G td ; j
G xd ; j
Ti
Xi
nld
n
U1
U d −1
U d +1
UM
Fig. 2. Decoupled probabilistic graphical model for probabilistic tensor analysis
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
795
projection matrix U d separately and form an alternating optimization procedure. The decoupled predictive density p ( T | M ,U d |dM=1 , σ 2 ) is defined as
M G p ( T | M,U d |dM=1 , σ 2 ) ∝ ∏ p ( T ×d U d | μd , σ d2 ) .
(3)
d =1
G where μd ∈ Rld is the mean vector for the dth mode; σ d2 is the variance of the noise to model the residue on the dth mode. The decoupled posterior distribution with the given observed tensors D is M G p ( M,U d |dM=1 , σ 2 | D ) ∝ ∏ p ( μ d ,U d , σ d2 | D, U k |1k≤≠kd≤ M ) .
(4)
d =1
Therefore, based on (3) and (4), the decoupled predictive density is
(
)
M G G p ( T | D ) ∝ ∏ p ( T ×d U d | μ d , σ d2 ) × p ( D,U k |1k≤≠kd≤ M | μ d ,U d , σ d2 ) . d =1
(5)
With (4) and (5), we have the decoupled probabilistic graphical model as shown in Figure 2. G Consider the column space of the projected data set Dd = {td ; j } , where the sample G td ; j ∈ R ld is the jth column of Td = mat d ( T ×d U d ) and Td ∈ R ld × ld , where ⎛ M ⎞ G ld = ∏ k ≠ d lk′ . Therefore, 1 ≤ j ≤ nld . Let X d = mat d ⎜ T∏ ×k U k ⎟ and xd ; j ∈ R ld′ is ⎝ k =1 ⎠ G G th T G 2 the j column of X d . Suppose, td | xd ~ N (U d xd + μ d , σ d I ) and the marginal G distribution of the latent tensors xd ~ N ( 0, I ) , then the marginal distribution of the G G G observed projected mode–d vector td is also Gaussian, i.e., td ~ N ( μ d , Cd ) . The L′ G mean vector and the corresponding covariance matrix are μd = ∑ j d=1 td ; j nld and
(
)( )
Cd = U U d + σ I , respectively. Based on the property of Gaussian properties, we have G G G xd | td ~ N ( M d−1U d ( td − μd ) , σ d2 M d−1 ) , (6) T d
2 d
where M d = (U d U dT + σ d2 I ) . In this decoupled model, the objective is to find the dth projection matrix U d based on MAP with given U k |1k≤≠kd≤ M , i.e., calculate U d by maximizing log p (U d | D ×d U d ) ∝ −
nld ⎡ log det ( Cd ) + tr ( Cd−1 Sd ) ⎤ , ⎦ 2 ⎣
(7)
where log ( a ) is the natural logarithm of a and Sd is the sample covariance matrix G of td . The eq. (7) is benefited from the decoupled definition of the probabilistic model, defined by (5). Based on (7), the total log of posterior distribution is
796
D. Tao et al. M M ⎧ ⎫ nl L = ∑ log p (U d | D ×d U d ) ∝ −∑ ⎨ d ⎡⎣log det ( Cd ) + tr ( Cd−1 Sd ) ⎤⎦ ⎬. d =1 d =1 ⎩ 2 ⎭
(8)
To implement MAP for the dth projection matrix U d estimation, the EM algorithm is applied here. The expectation of the log likelihood of complete data with respect to G G p ( xd ; j | td ; j , μ d ,U d , σ d2 ) is given by M n M nld G G E ( Lc ) = ∑∑ E ⎡log p ( Ti , Xi | U j ≠ d ) ⎤ = ∑∑ E ⎡log p ( td ; j , xd ; j ) ⎤. ⎣ ⎦ ⎣ ⎦ d =1 i =1 d =1 j =1
(
)
(
)
(9)
1 G G G 2 td ; j − U dT xd ; j − ud ;i . 2
(10)
G G Here, log p ( td ; j , xd ; j ) with the given U k |1k≤≠kd≤ M is given by
(
)
G G G log p ( td ; j , xd ; j ) ∝ − xd ; j
(
)
2
− d log (σ 2 ) −
σ
It is impossible to maximize E ( Lc ) with respect to all projection matrices U d |dM=1 because different projection matrices are inter-related [7] during optimization procedure, i.e., it is required to know U j ≠ d to optimize U d . Therefore, we need to apply alternating optimization procedure [7] for optimization. To optimize the dth projection matrix U d with σ d2 , we need the decoupled expectation of the log likelihood function on the dth mode: G
∑ E ⎡⎣log ( p ( t nld
j =1
d; j
)
G , xd ; j ) ⎤ ⎦
1 G G GT G T G G ⎞ ⎛ 2 ⎜ tr E ⎡⎣ xd ; j xd ; j ⎤⎦ + d log (σ ) + σ 2 ( td ; j − ud ;i ) ( td ; j − ud ;i ) ⎟ ⎟. ∝ −∑ ⎜ G 2 1 G G G GT ⎟ j =1 ⎜ T ⎜ − 2 E ⎡⎣ xd ; j ⎤⎦ U d ( td ; j − ud ;i ) + 2 tr U d U d E ⎡⎣ xd ; j xd ; j ⎤⎦ ⎟ σ ⎝ σ ⎠ Based on (6), then we have G G G E ⎡⎣ xd ; j ⎤⎦ = M d−1U d ( td ; j − μ d )
(
nld
)
(
(11)
)
(12)
and G G G G E ⎡⎣ xd ; j xdT; j ⎤⎦ = σ d2 M d−1 + E ⎡⎣ xd ; j ⎤⎦ E ⎡⎣ xdT; j ⎤⎦ .
(13)
Eq. (12) and (13) form the expectation step or E-step. The maximization step or M-step is obtained by maximizing nld G G E ⎡log p ( td ; j , xd ; j ) ⎤ with respect to U d and σ d2 . In detail, by setting ∑ ⎣ ⎦ j =1
(
)
⎡ nld G G ∂U d ⎢ ∑ E ⎡log p ( td ; j , xd ; j ) ⎣ ⎣ j =1
(
⎤
)⎤⎦ ⎥ = 0 , we have ⎦
−1
⎡ nld G G G ⎤ ⎡ nl G G ⎤ U d = ⎢ ∑ E ⎡⎣ xd ; j xdT; j ⎤⎦ ⎥ ⎢ ∑ E ⎡⎣ xd ; j ⎤⎦ ( td ; j − μd ) ⎥ ; ⎦ ⎣ j =1 ⎦ ⎣ j =1
(14)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
⎡ nld ⎤ G G and by setting ∂σ 2 ⎢ ∑ E ⎡log p ( td ; j , xd ; j ) ⎤ ⎥ = 0 , we have d ⎣ ⎦ ⎣ j =1 ⎦ G G 2 G GT G ⎫ ⎧ 1 nld ⎪ td ; j − μd − 2 E ⎡⎣ xd ; j ⎤⎦ U d ( td ; j − μd ) ⎪ 2 σd = ⎬. ∑⎨ nld ld i =1 ⎪+ tr E ⎡ xG xG T ⎤ U U T ⎪⎭ d ; j d ; j d d ⎣ ⎦ ⎩
(
797
)
(
)
(15)
2.3 Dimension Reduction and Data Reconstruction
After having projection matrices U d |Md =1 , the following operations are important for different applications: Dimension Reduction: Given the projection matrices U d |Md =1 and an observed tensor T ∈ R l1 ×l2 ×"×lM −1 ×lM in the high dimensional space, how to find the corresponding latent tensor X ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space? From tensor algebra, M
the dimension reduction is given by X = T∏ ×d U d . However, the method is absent d =1
the probabilistic perspective. Under the proposed decoupled probabilistic model, X is M G G obtained by maximizing p ( X | T ) ∝ ∏ p ( xd | td ) . The dimension reduction is d =1
X = ( T − M ) ∏ ×d ( M d−1U dT ) . M
T
(16)
d =1
Data Reconstruction: Given the projection matrices U d |Md =1 and the latent tensor
X ∈ R l1′×l2′ ×"×lM′ −1 ×lM′ in the low dimensional space, how to approximate the corresponding observed tensor T ∈ R l1 ×l2 ×"×lM −1 ×lM in the high dimensional space? Based on (16), the data reconstruction procedure is given by
(
−1 Tˆ = X∏ ×d U dT (U d U dT ) M d M
d =1
ˆ The reconstruction error is given by T − T
Fro
)
T
T
+ M.
Fro
(17)
.
2.4 Akaike and Bayesian Information Criteria for PTA AIC and BIC are popular methods for model selection in statistics. However, they are developed for vector data. In the proposed PTA, data are in tensor form. Therefore, it is important to find a suitable method to utilize AIC and BIC for tensor based learning models. In PTA, the conventional AIC and BIC could be applied to determine the size of U d |Md =1 . The exhaustive search based on AIC (BIC) is applied for model selection. In detail, for AIC based model selection, we need to calculate the score of AIC
798
D. Tao et al.
J dAIC (U d , σ d2 , ld′ ) = ( 2ld ld′ + 2 − ld′ ( ld′ − 1) )
)
(
−1 + nld ⎡log det (U dT U d + σ d2 I ) + tr (U dT U d + σ d2 I ) S d ⎤ ⎣⎢ ⎦⎥
for each mode
∏ (l M
d =1
d
(18)
− 1) times, because the number of rows ld′ in each projection
matrix U d changes from 1 to ( ld − 1) . In determination stage, the optimal ld′ * is ld′ * = arg min J dAIC (U d , σ d2 , ld′ ) ,
(19)
ld′
where 1 ≤ ld′ ≤ ld − 1 . For BIC based model selection in PTA, we have similar definition as AIC, ⎛ l ′ ( l ′ − 1) ⎞ J dBIC (U d , σ d2 , ld′ ) = log nld ⎜ ld ld′ + 1 − d d ⎟ 2 ⎝ ⎠
( )
(
)
−1 + nld ⎡log det (U dT U d + σ d2 I ) + tr (U dT U d + σ d2 I ) Sd ⎤ ⎣⎢ ⎦⎥
for each mode
∏ (l M
d =1
d
(20)
− 1) times. In determination stage, the optimal ld′ * is ld′ * = arg min J dBIC (U d , σ d2 , ld′ ) , ld′
(21)
where 1 ≤ ld′ ≤ ld − 1 .
3 Empirical Study In this Section, we utilize a synthetic data model, to evaluate BIC PTA in terms of accuracy for model selection. For AIC PTA, we have the very similar experimental results as BIC PTA. The accuracy is measured by the model selection error M ∑ d =1 ld′ − ld′ * . Here, ld′ is the real model, i.e., the real dimension of dth mode of the unobserved latent tensor; and ld′ * is the selected model, i.e., the selected dimension of the dth mode of the unobserved latent tensor by using BIC PTA. A multilinear transformation is applied to map the tensor from the low dimensional space R l1′×l2′ ×"×lM′ to high M
dimensional space R l1 ×l2 ×"×lM by Ti = Xi ∏ ×d U dT + M + ς Ei , where Xi ∈ R l1′×l2′ ×"×lM′ d =1
and every entry of every unobserved latent tensor Xi is generated from a single
Gaussian with mean zero and variance 1, i.e., N ( 0,1) ; Ei is the noise tensor and every entry e j is drawn from N ( 0,1) , ς is a scalar and we set it as 0.01, the mean tensor
M ∈ R l1 ×l2 ×"×lM is a random tensor and every entry in M is drawn from the uniform distribution on the interval [ 0,1] ; projection matrices U d |Md =1∈ R ld′ ×ld are random matrices and every entry in U d |Md =1 is drawn from the uniform distribution on the interval [ 0,1] ; and i denotes the ith tensor measurement.
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
799
Fig. 3. BIC Score matrices for the first and the second projection matrices. Each block corresponds to a BIC score. The darker the block is the smaller the BIC score is. Based on this Figure, we determine l1′* = 3 and l2′ * = 2 based on BIC obtained by PTA and the model selection error is 0. Figure 4 shows the Hinton diagram of the first and the second projection matrices in the left and the right sub-figures, respectively. Projection matrices are obtained from PTA by setting l1′* = 7 and l2′ * = 5 .
In the first experiment, the data generator gives 10 measurements by setting M = 2 , l1 = 8 , l1′ = 3 , l2 = 6 , and l2′ = 2 . To determine l1′ * and l2′ * based on BIC
for PTA, we need to conduct PTA ( l1 − 1)( l2 − 1) times and obtain two BIC score
matrices for the first mode projection matrix U1 and the second projection matrix U 2 , respectively, as shown in Figure 3. In this Figure, every block corresponds to a BIC score and the darker the block is the smaller the corresponding BIC score is. We use a light rectangular to hint the darkest block in each BIC score matrix and the block corresponds to the smallest value. In the first BIC score matrix, as shown in the left sub-figure of Figure 3, the smallest value locates at ( 3,5 ) . Because this BIC score matrix is calculated for the first mode projection matrix based on (20), we can set l1′* = 3 according to (21). Similar to the determination of l1′ * , we determine l2′ * = 2 according to the second BIC score matrix, as shown in the right of Figure 3, because the smallest value locates at ( 7, 2 ) . For this example, the model selection error is
∑
2
l ′ − ld′ * = 0 .
d =1 d
We repeat the experiments with the similar setting as the first experiment in this Section 30 times, but l1 , l1′ , l2 , and l2′ are randomly set with the following requirements: 6 ≤ l1 , l2 ≤ 10 , 2 ≤ l1′, l2′ ≤ 5 , l1′ < l1 , and l2′ < l2 . The total model selection errors for BIC PTA are 0. We also conduct 30 experiments for third order tensor, with similar setting as described above and l1 , l1′ , l2 , l2′ , l3 , and l3′ are setting
800
D. Tao et al.
Fig. 4. The Hinton diagram of the first and the second projection matrices obtained by PTA, as shown in the left and the right sub-figures, respectively
with the following requirements: 6 ≤ l1 , l2 , l3 ≤ 10 , 2 ≤ l1′, l2′ , l3′ ≤ 5 , l1′ < l1 , l2′ < l2 , and l3′ < l3 . The total model selection errors are also 0. In every experiment, the value of the reconstruction error is very small.
4 Conclusion Vector data are normally used for probabilistic graphical models with probabilistic inference. However, tensor data, i.e., multidimensional arrays, are actually natural representations of a large amount of data, in data mining, computer vision, and many other applications. Aiming at breaking the huge gap between vectors and tensors in conventional statistical tasks, e.g., model selection, this paper proposes a decoupled probabilistic algorithm, named probabilistic tensor analysis (PTA) with Akaike information criterion (AIC) and Bayesian information criterion (BIC). PTA associated AIC and BIC can select suitable models for tensor data, as demonstrated by empirical studies.
Acknowledgment We authors would like to thank Professor Andrew Blake at the Microsoft Research Cambridge for encouragement of developing the tensors probabilistic graphic model. This research was supported by the Competitive Research Grants at the Hong Kong Polytechnic University (under project number A-PH42 and A-PC0A) and the National Natural Science Foundation of China (under grant number 60703037).
References [1] Bader, B.W., Kolda, T.G.: Efficient MATLAB Computations with Sparse and Factored Tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA (2006) [2] Bader, B.W., Kolda, T.G.: MATLAB Tensor Classes for Fast Algorithm Prototyping. ACM Transactions on Mathematical Software 32(4) (2006)
Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria
801
[3] Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) [4] Kroonenberg, P., Leeuw, J.D.: Principal Component Analysis of Three-Mode Data by Means of Alternating Least Square Algorithms. Psychometrika, 45 (1980) [5] Lathauwer, L.D.: Signal Processing Based on Multilinear Algebra, Ph.D. Thesis. Katholike Universiteit Leuven (1997) [6] Sun, J., Tao, D., Faloutsos, C.: Beyond Streams and Graphs: Dynamic Tensor Analysis. In: The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, pp. 374–383 (2006) [7] Tao, D., Li, X., Wu, X., Maybank, S.J.: General Tensor Discriminant Analysis and Gabor Features for Gait Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(10) (2007) [8] Tipping, M.E., Bishop, C.M.: Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society, Series B 21(3), 611–622 (1999) [9] Tucker, L.R.: Some Mathematical Notes on Three-mode Factor Analysis. Psychometrika 31(3) (1966) [10] Vasilescu, M.A.O., Terzopoulos, D.: Multilinear Subspace Analysis of Image Ensembles. In: IEEE Proc. International Conference on Computer Vision and Pattern Recognition, Madison, Wisconsin, USA, vol. 2, pp. 93–99 (2003) [11] Ye, J., Janardan, R., Li, Q.: Two-Dimensional Linear Discriminant Analysis. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, Vancouver, British Columbia, Canada, vol. 17 (2004) [12] Ye, J.: Generalized Low Rank Approximations of Matrices. Machine Learning 61(1-3), 167–191 (2005)
Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow Fumikazu Miwakeichi1, Pedro A. Valdes-Sosa3, Eduardo Aubert-Vazquez3, Jorge Bosch Bayard3, Jobu Watanabe4, Hiroaki Mizuhara2, and Yoko Yamaguchi2 1
Medical System Course, Graduate School of Engineering, Chiba University 1-33, Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522 Japan
[email protected] 2 Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Japan 3 Cuban Neuroscience Center, Cuba 4 Waseda Institute for Advance Study, Waseda University, Japan
Abstract. Finding the means to efficiently summarize electroencephalographic data has been a long-standing problem in electrophysiology. Our previous works showed that Parallel Factor Analysis (PARAFAC) can effectively perform atomic decomposition of the time-varying EEG spectrum in space/ frequency/time domain. In this study, we propose to use PARAFAC for extracting significant activities in EEG data that is concurrently recorded with functional Magnetic Resonance Imaging (fMRI), and employ the temporal signature of the atom for investigating the relation between brain electrical activity and the changing of BOLD signal that reflects cerebral blood flow. We evaluated the statistical significance of dynamical effect of BOLD respect to EEG based on the modeling of BOLD signal by plain autoregressive model (AR), its AR with exogenous EEG input (ARX) and ARX with nonlinear term (ARNX). Keywords: Parallel Factor Analysis, EEG space/frequency/time decomposition, Nonlinear time series analysis, Concurrent fMRI/EEG data.
1 Introduction The electroencephalogram (EEG) is recorded as multi-channel time-varying data. In the history of EEG study, there are so many types of oscillatory phenomena in spontaneous and evoked EEG have been observed and reported. A statistical description of the oscillatory phenomena of the EEG was carried out first in the frequency domain by estimation of the power spectrum for quasi-stationary segments of data [1]. More recent characterizations of transient oscillations are carried out by estimation of the time-varying (or evolutionary) spectrum in the frequency/time domain [2]. These evolutionary spectra of EEG oscillations will have a topographic distribution on the sensors that is contingent on the spatial configuration of the neural sources that generate them as well as the properties of the head as a volume conductor [3]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 802–810, 2008. © Springer-Verlag Berlin Heidelberg 2008
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
803
There is a long history of atomic decompositions for the EEG. However, to date, atoms have not been defined by the triplet spatial, spectral and temporal signatures but rather pairwise combinations of these components. Space/time atoms are the basis of both Principal Component Analysis (PCA) and Independent Component Analysis (ICA) as applied to multi-channel EEG. PCA has been used for artifact removal and to extract significant activities in the EEG [4,5]. A basic problem is that atoms defined by only two signatures (space and time) are not determined uniquely. In PCA orthogonality is therefore imposed between the corresponding signatures of different atoms. And there is the well-known non-uniqueness of PCA that allows the arbitrary choice of rotation of axes (e.g. Varimax and Quartimax rotations). More recently, ICA has become a popular tool for space/time atomic decomposition [6,7]. It avoids the arbitrary choice of rotation (Jung et al. 2001). Uniqueness, however, is achieved at the price of imposing a constraint even stronger than orthogonality, namely, statistical independence. In both PCA and ICA the frequency information may be obtained from the temporal signature of the extracted atoms in a separate step. For the purpose of decomposing of single channel EEG into frequency/time atoms the Fast Fourier Transformation (FFT) with sliding window [8] or the wavelet transformation [9,10] have been employed. In fact, any of the frequency/time atomic decompositions currently available [11] could, in principle, be used for the EEG. However, these methods do not address the topographic aspects of the EEG time/frequency analysis. It has long been known, especially in the chemometrics literature, that unique multilinear decompositions of multi-way arrays of data (more than 2 dimensions) are possible under very weak conditions [12]. In fact, this is the basic argument for Parallel Factor Analysis (PARAFAC). This technique recently has been improved by Bro [13]. The important difference between PARAFAC and techniques such as PCA or ICA, is that the decomposition of multi-way data is unique even without additional orthogonality or independence constraints. Thus, PARAFAC can be employed for a space/frequency/time atomic decomposition of the EEG. This makes use of the fact that multi-channel evolutionary spectra are multi-way arrays, indexed by electrode, frequency and time. The inherent uniqueness of the PARAFAC solution leads to a topographic time/frequency decomposition with a minimum of a priori assumptions. It has been shown that PARAFAC can effectively perform a time/frequency/spatial (T/F/S) atomic decomposition which is suitable for identification of fundamental modes of oscillatory activity in the EEG [14,15]. In this paper, the theory of PARAFAC and its applications to EEG analysis will be showed. Moreover the possibility of analysis for investigating the relation between extracted EEG atom and cerebral blood flow will be discussed.
2 Theory For the purpose of EEG analysis, we define the N d × N f × N t data matrix S as the three-way time-varying EEG spectrum array obtained by applying a wavelet transformation, where N d , N f and N t are the numbers of channels, frequency steps and
804
F. Miwakeichi et al.
time points, respectively. For the wavelet transformation a complex Morlet mother function was used. The energy S d f t of channel d at frequency f and time t is given by the squared norm of the convolution of a Morlet wavelet with the EEG signal v(d , t ) ,
S d f t = w(t , f ) ∗ v (d , t ) where
the
complex
w(t , f ) = πσ b e
⎛ t ⎞ −⎜ ⎟ ⎝ σb ⎠
Morlet
wavelet
2
(1)
,
w(t , f )
is
defined
by
2
e i 2π f t
with
σb
denoting the bandwidth parameter. The
width of the wavelet, m = 2πσ b f , is set to 7 in this study. The basic structural model for a PARAFAC decomposition of the data matrix S ( Nd × Nf × Nt ) with elements Sd f t is defined by Nk
S d f t = ∑ ad k b f k ct k + ed f t = Sˆd f t + ed f t
(2) . The problem is to find the loading matrices A , B and C , the elements of which are denoted by ad k , b f k and ct k in Eq. (2). Here we will refer to components k as k =1
“atoms”,
and
the
corresponding
loading
vectors
a k = {adk } , b k = {b fk } , ck = {ctk } will be said to represent the spatial, spectral
and temporal signatures of the atoms (Fig. 1). The uniqueness of the decomposition (2) is guaranteed if rank(A )+rank(B )+rank(C) ≥ 2( N k + 1) . As can be seen, this is a less stringent condition than either orthogonality or statistical independence [12].
Fig. 1. Graphical explanation of the PARAFAC model. The multi-channel EEG evolutionary spectrum S is obtained from a channel by channel wavelet transform. S is a three-way data array indicated by channel, frequency and time. PARAFAC decomposes this array into the sum of “atoms”. The k-th atom is the trilinear product of loading vectors representing spatial ( a k ), spectral ( b k ) and temporal ( c k ) “signatures”. Under these condition PARAFAC can be summarized as finding the matrices A = {a k } , B = {b k } and minimal residual error.
C = {c k } that explain S with
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
The decomposition (2) can be obtained by evaluating
min
a d k bf
k
ct k
805
2 Sd f t − Sˆd f t . Since
the S d f t can be regarded as representing spectra, this minimization should be carried out under a non-negativity constraint for the loading vectors. The result of the PARAFAC decomposition is given by the ( Nd ×1) vector a k , representing the topographical map of the k -th component, the ( N f × 1) vector b k representing the frequency spectrum of the
k -th component, and the
temporal signature of the
( Nt ×1) vector
c k representing the
k -th component.
3 Results 3.1 Extracting Relevant Components from EEG The experiment consisted of two conditions, eyes-closed resting condition and mental arithmetic condition; for each condition five epochs, each lasting 30 s per condition, were recorded. During the mental arithmetic epochs, the subjects were asked to count backwards from 1000 in steps of a constant single-digit natural number, while keeping their eyes closed. At the beginning of each mental arithmetic epoch this number was randomly chosen by a computer and presented to the subjects through headphones. The end of each mental arithmetic epoch was announced by a beeping sound (4 kHz, 20 ms). Mental arithmetic and resting epochs were occurring five times within each trial and arranged in alternating order; each subject was examined in two trials. During the experiment, fMRI and EEG were recorded simultaneously. The electrode set consisted of 61 EEG channels, two ECG channels and one EOG channel. The reference electrode was at FCz. Raw EEG signals were sampled at a sampling frequency of 5 kHz, using a 1Hz high-pass filter provided by the Brain Vision Recorder software (Brain Products, Munich, Germany). The fMRI was acquired as blood-oxygenation-sensitive (T2*-weighted) echoplanar images, using a 1.5 T MR scanner (Staratis II, Hitachi Medico, Japan) and a standard head coil. Thirty slices (4mm in thickness, gapless), covering almost the entire cerebrum, were measured, under the following conditions: TR 5 s, TA 3.3 s, TE 47.2 ms, FA 901, FoV 240mm, matrix size 64×64. In order to minimize head motion, heads of subjects were immobilized by a vacuum pad. SPM99 was used for preprocessing of fMRI images (motion correction, slice timing correction). Using bilinear interpolation, images were normalized for each subject to a standard brain defined by Montreal Neurological Institute; the normalized images were subsequently smoothed using a Gaussian kernel (fullwidth half-maximum: 10 mm). For each voxel, the BOLD signal time series were high-pass filtered (>0.01Hz), in order to remove fluctuations with frequencies lower than the frequency defined by the switching between resting and task. The EEG signal v(d , t ) was subsampled to 500Hz, and, in order to construct a three-way (channel/frequency/time) time-varying EEG N d × N f × N t data matrix S , a
806
F. Miwakeichi et al.
Fig. 2. (a) Spectral signatures of atoms of Parallel Factor Analysis (PARAFAC) for a typical subject. Note the recurrent appearance of frequency peaks in the theta and alpha bands. The horizontal axis represents frequency in Hz, the vertical axis represents the normalized amplitude. (b) Spatial signatures of atoms displayed as a topographic map, for the theta, alpha and high alpha atoms (from above) of Parallel Factor Analysis (PARAFAC) for the same subject. (c) Temporal signatures of atoms, same order as in (a). The horizontal axis represents time in units of scans (time between scans is five seconds), the vertical axis represents normalized intensity. The black and red colored lines are corresponding to resting and task stages, respectively. (See colored figure in CD-R).
wavelet transform was applied in the frequency range from 0.5 to 40.0 Hz with 0.5 Hz steps, using the complex Morlet wavelet as mother function. For adjusting the temporal resolution to the resolution of the fMRI data, the wavelet-transformed EEG was averaged over consecutive 5 seconds intervals. Then, by applying PARAFAC to this threeway data set, major signal components in the frequency range from 3.0 to 30.0 Hz were
Decomposing EEG Data into Space-Time-Frequency Components Using PARAFAC
807
extracted. The reason for choosing this frequency band was that at lower frequency EEG data typically contains eye movement artifacts, while at higher frequency there is no relevant activity in the data set analyzed in this study. In Fig.2(a) spectral signatures of the identified atoms are shown: theta (around 6Hz), alpha (around 9Hz), and high alpha (around 12Hz) atoms were found; in this study we denote the frequency of the last atom as “high alpha”. We find that if the order of the decomposition is increased beyond three, two atoms with overlapping frequency peaks appear. Since this indicates overfitting, we select three as the order of the decomposition. In Fig. 2(b) spatial signatures of atoms are shown. It can be seen that the main power of theta and alpha atoms is focused in frontal and occipital areas, respectively. In Fig.2(c) temporal signatures of atoms are shown; red color refers to task condition, black color to resting condition. By comparing the amplitudes of temporal signatures during both conditions by a standard T-test, we find that for the theta atom the amplitudes of the temporal signature are higher during the task condition, while for the alpha atom they are higher during the resting condition. For the high alpha atom no significant result is found with this test, therefore this atom will be omitted from further study. 3.2 Hemodynamic Response of Cerebral Blood Flow with Respect to EEG In order to investigate the dynamical relation between EEG and cerebral blood flow, we considered the linear autoregressive model (AR), the linear autoregressive model with exogenous input (ARX) and nonlinear nonparametric term (ARNX). The definitions of these models are as follows; AR (3) Bt = α Bt −1 + et ,
Bt = β 0 Bt −1 +
ARX ARNX
Bt = β 0 Bt −1 +
m
∑β
k∈U p
x
k t −k
(
m
∑β
k∈U p
x
k t −k
)
+ et ,
+ f xt − j1 , xt − j2 ,… + et ;
{ j1 , j2 ,…} ⊆ U np
(4) (5)
where the f (.) is a nonparametric Nadaraya-Watson type kernel estimator and the sequence of et is prediction error. The indices of those columns of the design matrix that belong to the parametric or nonparametric parts were fixed for ARX and ARNX at the same set of values, U p = U np = {1, 2, 3, 4, 5, 6} . The reason for choosing a maximum lag of 6 is given by the fact that the effect of the neural electric activity on the BOLD signal has been experimentally found to decay to zero level in approximately 30 seconds. In ARX and ARNX there is an autoregressive (AR) model term, with parameter β 0 , describing intrinsic dynamics of the BOLD signal, in addition to the terms representing the influence of the neural electric activity. The parameters of the parametric and the nonparametric model part, as well as the kernel width parameter, can be estimated by minimizing the modified Global Cross Validation (GCV) criterion (Hong, 1999). It can also be used for model comparison. Consider the difference between the values of GCV for two models, A and B:
808
F. Miwakeichi et al.
D = GCV ( A) -GCV ( B )
(6)
If for a particular voxel D assumes a positive value, model B is superior to A. If for a particular voxel the GCV value for ARX or ARNX is smaller than for AR, it can be said that this voxel is influenced by the EEG. From now on, we will focus only on the set of these voxels. We expect that among these voxels there is a subset of voxels for which the nonlinear term of ARNX improves the model, as compared to ARX. For extracting these voxels, D = GCV ( ARX ) -GCV ( ARNX ) was computed. If for a particular voxel, D assumes a positive value, this indicates the necessity of employing a nonlinear response to the EEG for modeling the BOLD signal of this voxel. On the other hand, if D assumes a negative value, it is sufficient to model the BOLD signal of this voxel by ARX. Regions where ARX or ARNX outperforms AR, are shown in Fig.3 (a) for the alpha atom and in Fig.3 (b) for the theta atom; note that these regions contain only 7% (alpha atom) and 4% (theta atom) of all gray-matter voxels. Regions where ARNX outperforms ARX (as measured by GCV) are shown by green color; these regions contain about 3% (alpha atom) and 2% (theta atom) of all gray-matter voxels. The opposite case of regions where ARNX fails to outperform ARX is shown by orange color; these regions contain about 4% (alpha atom) and 2% (theta atom) of all graymatter voxels.
Fig. 3. Regions where the ARX and ARNX models outperform the AR model, for the alpha atom (a) and the theta atom (b). Green color denotes regions where ARNX outperforms ARX, while orange color denotes the opposite case (thresholded at p> I ≥ J is held, where J is known or can be estimated using PCA/SVD. In the paper, we propose a family of algorithms that can work also for an under-determined case, i.e., for T >> J > I, if sources are enough sparse and/or smooth. Our primary objective is to estimate the mixing (basis) matrix A and the sources X q , subject to additional natural constraints such as nonnegativity, sparsity and/or smoothness constraints. To deal with the factorization problem (1) efficiently, we adopt several approaches from constrained optimization, regularization theory, multi-criteria optimization, and projected gradient techniques. We minimize simultaneously or sequentially several cost functions with the desired constraints using switching between two sets of parameters: {A} and {X q }. 1
It should be noted that the 2D unfolded model, in a general case, is not exactly equivalent to the PARAFAC2 model (in respect to sources X q ), since we usually need to impose different additional constraints for each slice q. In other words, the PARAFAC2 model should not be considered as a 2-D model with the single 2-D unfolded matrix X . Profiles of the augmented (row-wise unfolded) X can only be treated as a single profile, while we need to impose individual constraints independently to each slice X q or even to each row of X q . Moreover, the 3D tensor factorization is considered as a dynamical process or a multi-stage process, where the data analysis is performed several times under different conditions (initial estimates, selected natural constraints, post-processing) to get full information about the available data and/or discover some inner structures in the data, or to extract physically meaningful components.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
813
Q
1
q
1
T
=
Y
i I 1
D
Q
t
(I x T x Q)
T
1
J
J
A
1 J
I (I x J)
~ X ~ Q 1 X q .. ~ + X 1 .. . . T
(J x T x Q)
Q T
N
I
(I x T x Q)
Fig. 1. Modified PARAFAC2 model illustrating factorization of 3D tensor into a set of fundamental matrices: A, D, {X q }. In the special case, the model is reduced to standard PARAFAC for X q = X 1 , ∀q, or tri-NMF model for Q = 1.
2
Projected Gradient Local Least Squares Regularized Algorithm
Many algorithms for the PARAFAC model are based on Alternating Least Square (ALS) minimization of the squared Euclidean distance [1,4,5]. In particular, we can attempt to minimize a set of the following cost functions: DF q (Y q ||AX q ) =
1 Y q − AX q 2F + αA JA (A) + αX JXq (X q ), 2
(3)
subject to some additional constraints, where JA (A), JXq (X q ) are penalty or regularization functions, and αA and αX are nonnegative coefficients controlling a tradeoff between data fidelity and a priori knowledge on the components to be estimated. A choice of the regularization terms can be critical for attaining a desired solution and noise suppression. Some of the candidates include the entropy, lp -quasi norm and more complex, possibly non-convex or non-smooth regularization terms [11]. In such a case a basic approach to the above formulated optimization problem is alternating minimization or alternating projection: the specified cost function is alternately minimized with respect to two sets of the parameters {xjtq } and {aij }, each time optimizing one set of arguments while keeping the other one fixed [6,7]. In this paper, we consider a different approach: instead of minimizing only one global cost function, we perform sequential minimization of the set of local cost functions composed of the squared Euclidean terms and regularization terms: (j)
DF q (Y (j) q ||aj xjq ) =
1 (j) 2 (j) (Y (j) q − aj xjq )F + αa Ja (aj ) + αXq Jx (xjq ), (4) 2
for j = 1, 2, . . . , J, q = 1, 2, . . . , Q, subject to additional constraints, where Y (j) ar xrq = Y q − AX q + aj xjq , (5) q = Yq − r =j
814
A. Cichocki et al.
aj ∈ RI×1 are the columns of the basis mixing matrix A, xjq ∈ R1×T are the rows of X q which represent unknown source signals, Ja (aj ) and Jx (xjq ) are local penalty regularization terms which impose specific constraints for the estimated (j) (j) parameters, and αa ≥ 0 and αXq ≥ 0 are nonnegative regularization parameters that control a tradeoff between data-fidelity and the imposed constraints. The construction of such a set of local cost functions follows from the simple observation that the observed data can be decomposed as follows Y q = J j=1 aj xjq + N q , ∀ q. We are motivated to use of such a representation and decomposition, because xjq have physically meaningful interpretation as sources with specific temporal and morphological properties. The penalty terms may take different forms depending on properties of the estimated sources. For example, ifthe sources are sparse, we can apply the lp quasi norm: Jx (xjq ) = ||xjq ||pp = ( t |xjtq |p )1/p with 0 < p ≤ 1, or alternatively p/2 T 2 we can use the smooth approximation Jx (xjq ) = |x | + ε , where jtq t ε ≥ 0 is a small constant. In order to impose smoothing of signals, we can Tlocal −1 apply the total variation (TV) Jx (xjq ) = t=1 |xj,t+1,q − xj,t,q |, or if we wish T −1 to achieve a smoother solution: Jx (xjq ) = t=1 |xj,t+1,q − xj,t,q |2 + ε, [12]. The gradients of the local cost function (4) with respect to the unknown vectors aj and xjq are expressed by (j)
∂DF q (Y (j) q ||aj xjq ) ∂xjq
= aTj aj xjq − aTj Y (j) q + αXq Ψx (xjq ),
(6)
T (j) = aj xjq xTjq − Y (j) q xjq + αa Ψa (aj ),
(7)
(j)
(j)
∂DF q (Y (j) q ||aj xjq ) ∂aj
where the matrix functions Ψa (aj ) and Ψx (xjq ) are defined as2 (j)
(j)
Ψa (aj ) =
∂Ja (aj ) , ∂aj
Ψx (xjq ) =
∂Jx (xjq ) . ∂xjq
(8)
By equating the gradient components to zero, we obtain a new set of local learning rules: 1 T (j) (j) a Y − α Ψ (x ) , j q jq Xq x aTj aj 1 T (j) aj ← Y (j) q xjq − αa Ψa (aj ) , T xjq xjq
xjq ←
(9) (10)
for j = 1, 2, . . . , J and q = 1, 2, . . . , Q. However, it should be noted that the above algorithm provides only a regularized least squares solution, and this is not sufficient to extract the desired 2
If the penalty functions are non-smooth, we can use sub-gradient instead of the gradient.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
815
sources, especially for an under-determined case since the problem may have many solutions. To solve this problem, we need additionally to impose nonlinear projections PΩj (xjq ) or filtering after each iteration or each epoch in order to enforce that individual estimated sources xjq satisfy the desired constraints. All such projections or filtering can be imposed individually for the each source xjq depending on morphological properties of the source signals. The similar nonlinear projection P Ωj (aj ) can be applied, if necessary, individually for each vector aj of the mixing matrix A. Hence, using the projected gradient approach, our algorithm can take the following more general and flexible form: 1 (j) (aT Y (j) − αXq Ψx (xjq )), aTj aj j q 1 T (j) aj ← (Y (j) q xjq − αa Ψa (aj )), xjq xTjq
xjq ←
xjq ← PΩj {xjq },
(11)
aj ← P Ωj {aj };
(12)
where PΩj {xjq } means generally a nonlinear projection, filtering, transformation, local interpolation/extrapolation, inpainting, smoothing of the row vector xjq . Such projections or transformations can take many different forms depending on required properties of the estimated sources (see the next section for more details). Remark 1. In practice, it is necessary to normalize the column vectors aj or the row vectors xjq to unit length vectors (in the sense of the lp norm (p = 1, 2, ..., ∞)) in each iterative step. In the special case of the l2 norm, the above algorithm can be further simplified by neglecting the denominator in (11) or in q (i.e., the (12), respectively. After estimating the normalized matrices A and X normalized X q to unit-length rows), we can estimate the diagonal matrices, if necessary, as follows: +
}, D q = diag{A+ Y q X q
3
(q = 1, 2, . . . , Q).
(13)
Flexible Component Analysis (FCA) – Possible Extensions and Practical Implementations
The above simple algorithm can be further extended or improved (in respect to a convergence rate and performance). First of all, different cost functions can be used for estimating the rows of the matrices X q (q = 1, 2, . . . , Q) and the columns of the matrix A. Furthermore, the columns of A can be estimated simultaneously, instead one by one. For example, by minimizing the set of cost functions in (4) with respect to xjq , and simultaneously the cost function (3) with normalization of the columns aj to an unit l2 -norm, we obtain a new FCA learning algorithm in which the individual rows of X q are updated locally (row by row) and the matrix A is updated globally (all the columns aj simultaneously): xjq ← aTj Y (j) q − αXq Ψx (xjq ), (j)
xjq ← PΩj {xjq },
(j = 1, . . . , J), (14)
A ← (Y q X Tq − αA ΨA (A))(X q X Tq )−1 , A ← P Ω (A), (q = 1, . . . , Q),(15)
816
A. Cichocki et al.
with the normalization (scaling) of the columns of A to an unit length in the sense of the l2 norm, where ΨA (A) = ∂JA (A)/∂A. In order to estimate the basis matrix A, we can use alternatively the following global cost function (see Eq. (2)): DF (Y ||AX) = (1/2)Y −AX2F +αA JA (A). The minimization of the cost function for a fixed X leads to the updating rule A ← Y X T − αA ΨA (A) (XX T )−1 . (16) 3.1
Nonnegative Matrix/Tensor Factorization
In order to enforce sparsity and nonnegativity constraints for all the parameters: aij ≥ 0, xjtq ≥ 0, ∀ i, t, q, we can apply the ”half-way rectifying” element-wise projection: [x]+ = max{ε, x}, where ε is a small constant to avoid numerical instabilities and remove background noise (typically, ε = [10−2 − 10−9 ]). Simultaneously, we can impose weak sparsity constriants by using the l1 -norm (j) penalty functions: JA (A) = ||A||1 = ij aij and Jx (xjq ) = ||xjq ||1 = t xjtq . In such a case, the FCA algorithm for the 3D NTF2 (i.e., the PARAFAC2 with nonnegativity constraints) will take the following form: (j) xjq ← aTj Y (j) , (j = 1, . . . , J), (q = 1, . . . , Q), (17) q − αXq 1 + A ← (Y X T − αA 1)(XX T )−1 , (18) +
with normalization of the columns of A in each iterative step to a unit length with the l2 norm, where 1 means a matrix of all ones of appropriate size. It should be noted the above algorithm can be easily extended to semi-NMF or semi-NTF in which only some sources xjq are nonnegative and/or the mixing matrix A is bipolar, by simple removing the corresponding ”half-wave rectifying” projections. Moreover, the similar algorithm can be used for arbitrary bounded sources with known lower and/or upper bounds (or supports), i.e ljq ≤ xjtq ≤ uiq , ∀t, rather than xjtq ≥ 0, by using suitably chosen nonlinear projections which bound the solutions. 3.2
Smooth Component Analysis (SmoCA)
In order to enforce smooth estimation of the sources xjq for all or some preselected indexes j and q, we may apply after each iteration (epoch) the local smoothing or filtering of the estimated sources, such as the MA (Moving Average), EMA, SAR or ARMA models. A quite efficient way of smoothing and denoising can be achieved by minimizing the following cost function (which satisfies multi-resolution criterion): J(xjq ) =
T t=1
2
(xjtq − x jtq ) +
T −1
λjtq gt ( xj,t+1,q − x jtq ) ,
(19)
t=1
where x jtq is a smoothed version of the actually estimated (noisy) xjtq , gt (u) is a convex continuously differentiable function with a global minimum at u = 0, and λjtq are parameters that are data driven and chosen automatically.
FCA for Sparse, Smooth, Nonnegative Coding or Representation
3.3
817
Multi-way Sparse Component Analysis (MSCA)
In the sparse component analysis an objective is to estimate the sources xjq which are sparse and usually with a prescribed or specified sparsification profile, possibly with additional constraints like local smoothness. In order to enforce that the estimated sources are sufficiently sparse, we need to apply a suitable nonlinear projection or filtering which allows us adaptively to sparsify the data. The simplest nonlinear projection which enforces some sparsity to the normalized data is to apply the following weakly nonlinear element-wise projection: PΩj (xjtq ) = sign(xjtq )|xjtq |1+αjq
(20)
where αjq is a small parameter which controls sparsity. Such nonlinear projection can be considered as a simple (trivial) shrinking. Alternatively, we may use more sophisticated adaptive local soft or hard shrinking in order to sparsify individual sources. Usually, we have the three-steps procedure: First, we perform the linear transformation: xw = xW , then, the nonlinear shrinking (adaptive thresholding), e.g., the soft element-wise shrinking: PΩ (xw ) = sign(xw ) [|xw | − = PΩ (xw )W −1 . The threshold δ > 0 δ]1+δ + , and finally the inverse transform: x is usually not fixed but it is adaptively (data-driven) selected or it gradually decreases to zero with iterations. The optimal choice for a shrinkage function depends on a distribution of data. We have tested various shrinkage functions with gradually decreasing δ: the hard thresholding rule, soft thresholding rule, non-negative Garrotte rule, n-degree Garotte, and Posterior median shrinkage rule [13]. For all of them, we have obtained the promising results, and usually the best performance appears for the simple hard rule. Our method is somewhat related to the MoCA and SCA algorithms, proposed recently by Bobin et al., Daubechies et al., Elad et al., and many others [10,14,11]. However, in contrast to these approaches our algorithms are local and more flexible. Moreover, the proposed FCA is more general than the SCA, since it is not limited only to a sparse representation via shrinking and linear transformation but allows us to impose general and flexible (soft and hard) constraints, nonlinear projections, transformations, and filtering3 . Furthermore, in the contrast to many alternative algorithms which process the columns of X q , we process their rows which represent directly the source signals. We can outline the FCA algorithm as follows: 1. Set the initial values of the matrix A and the matrices X q , and normalize the vectors aj to an unit l2 -norm length. 2. Calculate the new estimate xjq of the matrices X q using the iterative formula in (14). 3. If necessary, enforce the nonlinear projection or filtering by imposing natural constraints on individual sources (the rows of X q , (q = 1, 2, . . . , Q)), such as nonnegativity, boundness, smoothness, and/or sparsity. 3
In this paper, in fact, we use two kinds of constraints: the soft (or weak) constraints via penalty and regularization terms in the local cost functions, and the hard (strong) constraints via iteratively adaptive postprocessing using nonlinear projections or filtering.
818
A. Cichocki et al.
4. Calculate the new estimate of A from (16), normalize each column of A to an unit length, and impose the additional constraints on A, if necessary. 5. Repeat the steps (2) and (4) until the convergence criterion is satisfied.
3.4
Multi-layer Blind Identification
In order to improve the performance of the FCA algorithms proposed in this paper, especially for ill-conditioned and badly-scaled data and also to reduce the risk of getting stuck in local minima in non-convex alternating minimization, we have developed the simple hierarchical multi-stage procedure [15] combined together with multi-start initializations, in which we perform a sequential decomposition of matrices as follows. In the first step, we perform the basic decomposition (factorization) Y q ≈ A(1) X (1) using any suitable FCA algorithm q presented in this paper. In the second stage, the results obtained from the first from the estimated frontal slices destage are used to build up a new tensor X 1 (1) (1) fined as Y = X , (q = 1, 2, . . . , Q). In the next step, we perform the similar q
q
(1) decomposition for the new available frontal slices: Y q = X (1) ≈ A(2) X (2) q q , using the same or different update rules. We continue our decomposition taking into account only the last achieved components. The process can be repeated arbitrarily many times until some stopping criteria are satisfied. In each step, we usually obtain gradual improvements of the performance. Thus, our FCA model has the following form: Y q ≈ A(1) A(2) · · · A(L) X (L) q , (q = 1, 2, . . . , Q) with the final components A = A(1) A(2) · · · A(L) and X q = X (L) q . Physically, this means that we build up a system that has many layers or cascade connections of L mixing subsystems. The key point in our approach is (l) that the learning (update) process to find the matrices X (l) is performed q and A sequentially, i.e. layer by layer, where each layer is initialized randomly. In fact, we found that the hierarchical multi-layer approach plays a key role, and it is necessary to apply in order to achieve satisfactory performance for the proposed algorithms.
4
Simulation Results
The algorithms presented in this paper have been tested for many difficult benchmarks for signals and images with various temporal and morphological properties of signals and additive noise. Due to space limitation we present here only one illustrative example. The sparse nonnegative signals with different sparsity and smoothness profiles are collected in with 10 slices X q (Q = 10) under the form of the tensor X ∈ R5×1000×10 . The observed (mixed) 3D data Y ∈ R4×1000×10 are obtained by multiplying the randomly generated mixing matrix A ∈ R4×5 by X. The simulation results are illustrated in Fig. 2 (only for one frontal slice q = 1).
FCA for Sparse, Smooth, Nonnegative Coding or Representation
819
Fig. 2. (left) Original 5 spectra (representing X 1 ); (middle) observed 4 mixed spectra Y 1 generated by random matrix A ∈ R4×5 (under-determined case); (right) estimated 5 spectra X 1 with our algorithm given by (17)–(18), using 10 layers, and (j) αA = αX1 = 0.05. Signal-to-Interference Ratios (SIR) for A and X 1 are as follows: SIRA = 31.6, 34, 31.5, 29.9, 23.1[dB] and SIRX1 = 28.5, 19.1, 29.3, 20.3, 23.2[dB], respectively
5
Conclusions and Discussion
The main objective and motivations of this paper is to derive simple algorithms which are suitable both for under-determined (over-complete) and overdetermined cases. We have applied the simple local cost functions with flexible penalty or regularization terms, which allows us to derive a family of robust FCA algorithms, where the sources may have different temporal and morphological properties or different sparsity profiles. Exploiting these properties and a priori knowledge about the character of the sources we have proposed a family of efficient algorithms for estimating sparse, smooth, and/or nonnegative sources, even if the number of sensors is smaller than the number of hidden components, under the assumption that the some information about morphological or desired properties of the sources is accessible. This is an original extension of the standard MoCA and NMF/NTF algorithms, and to the authors’ best knowledge, the first time such algorithms have been applied to the multi-way PARAFAC models. In comparison to the ordinary BSS algorithms, the proposed algorithms are shown to be superior in terms of the performance, speed, and convergence properties. We implemented the discussed algorithms in MATLAB [16]. The approach can be extended for other applications, such as dynamic MRI imaging, and it can be used as an alternative or improved reconstruction method to: the k-t BLAST, k-t SENSE or k-t SPARSE, because our approach relaxes the problem of getting stuck to in local minima, and provides usually better performance than the standard FOCUSS algorithms. This research is motivated by potential applications of the proposed models and algorithms in three areas of the data analysis (especially, EEG and fMRI) and signal/image processing: (i) multi-way blind source separation, (ii) model
820
A. Cichocki et al.
reduction and selection, and (iii) sparse image coding. Our preliminary experimental results are promising. The models can be further extended by imposing additional, natural constraints such as orthogonality, continuity, closure, unimodality, local rank - selectivity, and/or by taking into account a prior knowledge about the specific components.
References 1. Smilde, A., Bro, R., Geladi, P.: Multi-way Analysis: Applications in the Chemical Sciences. John Wiley and Sons, New York (2004) 2. Hazan, T., Polak, S., Shashua, A.: Sparse image coding using a 3D non-negative tensor factorization. In: International Conference of Computer Vision (ICCV), pp. 50–57 (2005) 3. Heiler, M., Schnoerr, C.: Controlling sparseness in non-negative tensor factorization. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 56–67. Springer, Heidelberg (2006) 4. Miwakeichi, F., Martnez-Montes, E., Valds-Sosa, P., Nishiyama, N., Mizuhara, H., Yamaguchi, Y.: Decomposing EEG data into space−time−frequency components using parallel factor analysi. NeuroImage 22, 1035–1045 (2004) 5. Mørup, M., Hansen, L.K., Herrmann, C.S., Parnas, J., Arnfred, S.M.: Parallel factor analysis as an exploratory tool for wavelet transformed event-related EEG. NeuroImage 29, 938–947 (2006) 6. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999) 7. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing (New revised and improved edition). John Wiley, New York (2003) 8. Dhillon, I., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: Neural Information Proc. Systems, Vancouver, Canada, pp. 283– 290 (2005) 9. Berry, M., Browne, M., Langville, A., Pauca, P., Plemmons, R.: Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis (in press, 2006) 10. Bobin, J., Starck, J.L., Fadili, J., Moudden, Y., Donoho, D.L.: Morphological component analysis: An adaptive thresholding strategy. IEEE Transactions on Image Processing (in print, 2007) 11. Elad, M.: Why simple shrinkage is still relevant for redundant representations? IEEE Trans. On Information Theory 52, 5559–5569 (2006) 12. Kovac, A.: Smooth functions and local extreme values. Computational Statistics and Data Analysis 51, 5155–5171 (2007) 13. Tao, T., Vidakovic, B.: Almost everywhere behavior of general wavelet shrinkage operators. Applied and Computational Harmonic Analysis 9, 72–82 (2000) 14. Daubechies, I., Defrrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Pure and Applied Mathematics 57, 1413–1457 (2004) 15. Cichocki, A., Zdunek, R.: Multilayer nonnegative matrix factorization. Electronics Letters 42, 947–948 (2006) 16. Cichocki, A., Zdunek, R.: NTFLAB for Signal Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)
Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA Rui Xu1 and Yen-Wei Chen2,3 1
2
Graduate School of Engineering and Science, Ritsumeikan University, Japan
[email protected] College of Information Science and Engineering, Ritsumeikan University, Japan
[email protected] 3 School of Electronic and Information Engineering, Dalian University of Technology, China
Abstract. Appearance models is important for the task of medical image analysis, such as segmentation. Principal component analysis (PCA) is an efficient method to build the appearance models; however the 3D medical volumes should be first unfolded to form the 1D long vectors before the PCA is used. For large medical volumes, such a unfolding preprocessing causes two problems. One is the huge burden of computing cost and the other is bad performance on generalization. A method named as generalized 3D-PCA is proposed to build the appearance models for medical volumes in this paper. In our method, the volumes are directly treated as the third order tensor in the building of the model without the unfolding preprocessing. The output of our method is three matrices whose columns are formed by the orthogonal bases in the three mode subspaces. With the help of these matrices, the bases in the third order tensor space can be constructed. According to these properties, our method is not suffered from the two problems of the PCA-based methods. Eighteen 256 × 256 × 26 MR brain volumes are used in the experiments of building appearance models. The leave-one-out testing shows that our method has good performance in building the appearance models for medical volumes even when few samples are used for training.
1
Introduction
In recent years, a lot of research has been focused on how to extract the statistical information from 3D medical volumes. Inspired from 2D active shape models (ASMs) [1], 3D ASMs[2][3] has been proposed to extract the statistical shape information from 3D medical volumes. The 3D ASMs is a kind of important method for segmentation but this method does not consider about the
This work was supported in part by the Grand-in Aid for Scientific Research from the Japanese Ministry for Education, Science, Culture and Sports under the Grand No. 19500161, and the Strategic Information and Communications R&D Promotion Programme (SCOPE) under the Grand No. 051307017.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 821–830, 2008. c Springer-Verlag Berlin Heidelberg 2008
822
R. Xu and Y.-W. Chen
information of the voxel’s intensity. So, the authors in [4] extend the works of 2D active appearance model (AAMs) [5] and propose the 3DAAMs to combine the shape models and appearance models together for 3D medical data. In the 3DAAMs, the appearance models for the volumes is built by the method based on the principal component analysis (PCA). PCA is a widely-used method to extract statistical information and successfully applied into many fields, such as image representation and face recognition. When the PCA-based methods are applied to build the appearance models for medical volumes, it is required to first unfold the 3D volumes to 1D long vectors in order to make the data to be processed by PCA. There are two types of ways to implement the PCA. In the first way, the bases of the unfolding vector space are obtained by calculating eigenvectors directly from the covariance matrix Cov, where Cov = A · AT and the columns of the matrix A are formed by the unfolded vectors. The dimension of the covariance matrix is depended on the dimension of the unfolded vectors. If the 3D volumes have large dimensions, the dimensions of covariance matrix are also very large. So we encounter the problem of huge burden of computing cost when we calculate the eigenvectors. The other way to implement PCA is similar to the one used in the work of eigenfaces [6]. Instead of calculating the the spaces directly from Cov, the eigenvectors ui is first calculated from the matrix Cov = AT · A. Then the bases of the unfolding vector space vi can be calculated by vi = A · ui . For the second way, the dimension of Cov is depended on the number of the training samples which is much fewer than the dimension of the unfolded vectors. Therefore, the second way to implement PCA does not suffer from the problem of huge computing cost. However, only M − 1 bases are available in the second way, where M is the number of training samples. In the medical field, the typical dimension of unfolding vectors is several millions; however the typical number of the medical data is only several tens. So the obtained bases are not enough to represent an untrained vector in the huge vector space. Therefore, the second way has the problem of bad performance on generalization when the large 3D volumes are used. The generalization problem may be overcome by increasing the training samples; however since the dimension of the unfolding vectors is huge, it is difficult to collect enough medical data for the training. Additionally, with the development of medical imaging techniques, the medical volumes become larger and larger. The conflict between the large dimensions of medical volumes and small number of training samples will become more prominent. Therefore, it is desirable to research on how to build the appearance model for medical volumes when the training samples are relative few. In this paper, a method called generalized 3D-PCA (G3D-PCA) is proposed to build the appearance models for the medical volumes. In our method, the 3D volumes are treated as the third order tensors in stead of unfolding them to be the long 1D vectors. The bases of the tensor space are constructed by three matrices whose columns are set by the leading eigenvectors calculated from the covariance matrices on the three mode subspaces. Since the dimensions of these covariance matrices are only depended by the dimensions of the corresponding
Appearance Models for Medical Volumes with Few Samples
823
mode subspaces, our method is able to overcome the problem of huge computing cost. Additionally, the base number is not limited by the training sample number in our method, so our method can also overcome the generalization problem. The proposed method is based on the multilinear algebra [7] [8] which is the mathematical tool for higher order tensors. Recently, the multilinear algebra based method is applied to field of computer vision and pattern recognition and achieves good results. The authors in [9] propose tensorfaces to build statistical model of faces in different conditions, such as illumination and viewpoint. A method called concurrent subspaces analysis (CSA) [10] is proposed for video sequence reconstruction and digital number recognition. However, we does not find the research to apply the multilinear algebra based method in the field of medical image analysis.
2
Background Knowledge of Multilinear Algebra
We introduce some background knowledge about multilinear algebra in this section. For more detailed knowledge of multilinear algebra, the readers can refer to [7] [8]. The notations of the multilinear algebra in this paper are listed as follows. Scalers are denoted by italic-shape letters, i.e. (a, b, ...) or (A, B, ...). Bold lower case letters, i.e. (a, b, ...), are used to represent vectors. Matrices are denoted by bold upper case letters, i.e. (A, B, ...); and higher-order tensors are denoted by calligraphic upper case letters, i.e. (A, B, ...). An N -th order tensor A is defined as a multiway array with N indices, where A ∈ RI1 ×I2 ×...×IN . Elements of the tensor A are denoted as ai1 ...in ...iN , where 1 in In . The space of the N -th order tensor is comprised by the N mode subspaces. In the viewpoint of tensor, scales, vectors and matrices can be seen as zeroth-order, first order and second order tensors respectively. Varying the n-th index in and keeping the other indices fixed, we can obtain the ”mode-n vectors” of the tensor A. The ”mode-n matrix” A(n) can be formed by arranging all the mode-n vectors sequentially as its column vectors, A(n) ∈ RIn ×(I1 ·...In−1 ·In+1 ·...IN ) . The procedure of forming the mode-n matrix is called unfolding of a tensor. Fig. 1 gives the examples to show how to unI2
I3
I2
I2
I2
Mode-1 Matrix
I1
I1
A(1) I3 I3
I3
I3
I3
I2
Mode-2 Matrix
I2
A(2)
I1
I1 I1
I3
I1
I1
I2 I1
Mode-3 Matrix
I3
A(3) I2
Fig. 1. Example of Unfolding the third order tensor to the three mode-n matrices
824
R. Xu and Y.-W. Chen
fold the third order to their mode-n matrices. The mode-n product of a tensor A ∈ RI1 ×I2 ×...×IN and a matrix U ∈ RJn ×In , denoted as A ×n U, is an (I1 × I2 × ... × In−1 × Jn × In+1 × ... × IN ) tensor. Entries of the new tendef sor is defined by (A ×n U)i1 i2 ...in−1 jn in+1 ...iN = in ai1 i2 ...in−1 in in+1 ...iN ujn in . These entries can also be calculated by matrix product, B(n) = U · A(n) , B(n) is mode-n matrix of the tensor B = A ×n U. The mode-n product has two properties. One can be expressed by (A ×n U) ×m V = (A ×m V) ×n U = A ×n U ×m V; and the other is (A ×n U) ×n V = A ×n (V · U). The def
I1 ×I2 ×...×IN inner is defined by A, B = product of two tensors A, B ∈ R ... a b . The Frobenius-norm of a tensor A is defined i i ...i i i ...i 1 2 N 1 2 N i1 i2 iN def by A = A, A. The Frobenius-norm of a tensor can also be calculated
from its mode-n matrix, A = A(n) = a matrix.
3
tr(A(n) · AT(n) ), tr() is the trace of
Appearance Models Built by Generalized 3D-PCA
An medical volume, such as MR volume, can be seen as a third order tensor. The three modes have their own physical meanings (coronal, sagittal and transversal axis respectively). Supposing there are a series of medical volumes collected from the same organs but different patients, there must be some common statistical information of their appearance (the voxel’s intensity) for these volumes. In this paper, we propose the multilinear algebra based algorithm named by G3D-PCA to build such appearance models for 3D medical volumes. We generalize the G3D-PCA for the appearance models building as follows. 4 I1 ×I2 ×I3 Given a series ,i = Mof the third order tensors with zero-mean Ai ∈ R 1, 2, ..., M , i=1 Ai = 0, we hope to find three matrices with orthogonal columns U ∈ RI1 ×J1 , V ∈ RI2 ×J2 , W ∈ RI3 ×J3 , (J1 < I1 , J2 < I2 , J3 < I3 ) which is able to reconstruct Ai from the corresponding dimension-reduced tensors Bi ∈ RJ1 ×J2 ×J3 with least error. Each reconstructed tensor Aˆi can be expressed by Eq. 1. The matrices U, V, W contain the statistical intensity information of the volume set Ai and they are respectively the orthogonal bases in the three mode subspaces. Aˆi = Bi ×1 U ×2 V ×3 W (1) We need to minimize the cost function S, shown by Eq. 2, to find the three matrices. S=
M i=1
4
Ai − Aˆi 2 =
M
Ai − Bi ×1 U ×2 V ×3 W 2
(2)
i=1
If the tensors do not have zero-mean, we can subtract the mean-value from each tensor to obtain a new series of tensors A i which have zero-mean, i.e. A i = Ai − M 1 i=1 Ai . M
Appearance Models for Medical Volumes with Few Samples
825
Algorithm 1. Iteration Algorithm to Compute the Matrices Uopt , Vopt , Wopt IN: a series of third order tensors, Ai ∈ RI1 × I2 × I3 , i = 1, 2, ..., M . OUT: Matrices Uopt ∈ RI1 ×J1 , Vopt ∈ RI2 ×J2 , Wopt ∈ RI3 ×J3 , (J1 < I1 , J2 < I2 , J3 < I3 ) with orthogonal column vectors.
1. Initial values: k = 0 and U0 , V0 , W0 whose columns are determined as M T the first J1 , J2 , J3 leading eigenvectors of the matrices i=1 (Ai(1) · Ai(1) ), M M T T i=1 (Ai(2) · Ai(2) ), i=1 (Ai(3) · Ai(3) ) respectively. 2. Iterate until convergence
T 2 T T – Maximize S = M i=1 Ci ×1 U , Ci = Ai ×2 Vk ×3 Wk Solution: U whose columns are determined as the first J1 leading eigenvecT tors of M i=1 (Ci(1) · Ci(1) ) Set Uk+1 = U. T 2 T T – Maximize S = M i=1 Ci ×2 V , Ci = Ai ×1 Uk+1 ×3 Wk Solution: V whose columns are determined as the first J2 leading eigenvecT tors of M i=1 (Ci(2) · Ci(2) ) Set Vk+1 = V. T 2 T T – Maximize S = M i=1 Ci ×3 W , Ci = Ai ×1 Uk+1 ×2 Vk+1 Solution: W whose columns are determined as the first J3 leading eigenvecT tors of M i=1 (Ci(3) · Ci(3) ) Set Wk+1 = W.
k = k+1 3. Set Uopt = Uk , Vopt = Vk , Wopt = Wk .
In Eq. 2, only the tensors Ai are known. However, supposing the three matrices are known, the answer of Bi to minimize Eq. 2 is merely the result of the traditional linear least-square problem. Theorem 1 can be obtained. Theorem 1. Given fixed matrices U, V, W, the tensors Bi that minimize the cost function, Eq. 2, are given by Bi = Ai ×1 UT ×2 VT ×3 WT . With the help of Theorem 1, Theorem 2 can be obtained. Theorem 2. If the tensors Bi are chosen as Bi = Ai ×1 UT ×2 VT ×3 WT , minimization of the cost function Eq. 2 is equal to maximization of the following M cost function, S = i=1 Ai ×1 UT ×2 VT ×3 WT 2 . There is no close-form solution to simultaneously resolve the matrices U, V, W for the cost function S ; however if two of them are fixed, we can find the explicit solution of the other matrix which is able to maximize S . Lemma 1. Given the fixed matrices, V, W, if the columns of the matrix U are M T selected as the first J1 leading eigenvectors of the matrix i=1 (Ci(1) · Ci(1) ), Ci(1) is the mode-1 matrix of the tensor Ci = Ai ×2 VT ×3 WT , the cost function S can be maximized.
826
R. Xu and Y.-W. Chen J2 I2
Vopt
u2 J1 I1
U opt
J3
u1
J1
I3
J3
I2
J2
u3
I3
Wopt
I1
Fig. 2. Illustration of reconstructing a volume by the principal components Bi and the three orthogonal bases of mode subspaces Uopt , Vopt , Wopt
Lemma 2. Given the fixed matrices, U, W, if the columns of the matrix V are M T selected as the first J2 leading eigenvectors of the matrix i=1 (Ci(2) · Ci(2) ), Ci(2) is the mode-2 matrix of the tensor Ci = Ai ×1 UT ×3 WT , the cost function S can be maximized. Lemma 3. Given the fixed matrices, U, V, if the columns of matrix W are the M T selected as the first J3 leading eigenvectors of the matrix (C i(3) · Ci(3) ), i=1 T T Ci(3) is the mode-3 matrix of the tensor Ci = Ai ×1 U ×2 V , the cost function S can be maximized. According to Lemma 1, Lemma 2 and Lemma 3, we can use an iteration algorithm to get the optimal matrices, Uopt , Vopt , Wopt , which are able to maximize the cost function S . This algorithm is summarized as Algorithm 1. In Algorithm 1, we terminate the iteration when the values of the cost function is not significantly changed in two consecutive times. Usually, the convergence is fast. According to our experiences, two or three times iterations are enough. Using the calculated matrices Uopt , Vopt , Wopt , each of the volume Ai can be approximated by Aˆi with least errors, where Aˆi = Bi ×1 Uopt ×2 Vopt ×3 Wopt T T and Bi = Ai ×1 UTopt ×2 Vopt ×3 Wopt . The approximation can be illustrated by Fig. 2. In 3D PCA, the three matrices Uopt , Vopt , Wopt construct the bases on the third order tensor space; and the components of the tensor Bi are the principal components.
Table 1. Comparison of G3D-PCA and the two PCA-based methods, given the training volumes Ai , where Ai ∈ RI1 ×I2 ×I3 , i = 1, 2, ..., M Methods
Dimension of Maximal Number of Covariance Matrices Available Bases PCA–First Way (I1 · I2 · I3 ) × (I1 · I2 · I3 ) I1 · I 2 · I3 PCA–Second Way M ×M M −1 Mode-1 : I1 × I1 G3D-PCA Mode-2 : I2 × I2 I1 · I 2 · I3 Mode-3 : I3 × I3
Appearance Models for Medical Volumes with Few Samples
827
Compared to the PCA-based methods, the G3D-PCA has advantages in two aspects. Given M training volumes with the dimension of I1 × I2 × I3 , the covariance matrix for the first way to implement PCA has the a huge dimension of (I1 · I2 · I3 ) × (I1 · I2 · I3 ). So, this method is suffered from huge computing cost. Different from the PCA-based method, G3D-PCA is to find three orthogonal bases in the three mode subspaces rather than find one orthogonal bases in the very long 1D vector space. The three bases are respectively obtained by calculating the leading eigenvectors from three covariance matrices of the mode subspaces. Since the three matrices have the dimension of I1 × I1 , I2 × I2 , I3 × I3 respectively, the calculation of eigenvectors is efficient. So G3D-PCA overcomes the problem of huge computing cost. In the other aspect, the G3D-PCA can obtain enough bases to represent the untrained volumes compared to the PCAbased method implemented by the second way. In theory, the maximal number of the tensor bases constructed from the bases of the three mode subspaces is I1 ·I2 · I3 . So the number of the available bases is not limited on the number of training samples and this make the G3D-PCA has good performance on generalization even when few samples are used for training. Table 1 compares the differences between the G3D-PCA and the two PCA-based methods.
4
Experiments
We use eighteen MR T1-weighted volumes of Vanderbilt [11] database to build the appearance models of the human brain. These eighteen volumes are collected from different patients, and their dimensions are 256 × 256 × 26. We choose one volume as the template and align the other seventeen volumes onto the template by similarity-transformation based registration. A 3D similarity-transformation has seven parameters, three for translations, three for rotation angles and one for scaling factor. Fig. 3 gives examples for three volumes of the registered data.
Fig. 3. Example of registered volumes
For these volumes, the first way to implement PCA is suffered from the huge burden of computing cost. For the 256 × 256 × 26 volumes, the unfolded vectors have the huge dimensions of 1703936, so the covariance matrix in the unfoled vector space has an extremely huge dimension of 1703936 × 1703936. Supposing float type is used to store the covariance matrix in a computer, we need to allocate the memory of 10816GB. This is impossible for the current desktop PCs.
828
R. Xu and Y.-W. Chen
The second way to implement PCA is suffered from the problem of bad performance for generalization. This can be demonstrated by the leave-one-out experiment, where the seventeen volumes are used to train the appearance model and the left one is reconstructed by the trained models for testing. Fig. 4 gives the result of leave-one-out experiment when the PCA is implemented in the second way. It should be noticed that all the 16 available bases are used in the reconstruction in this experiment. It can be seen that the quality of the reconstructed slices is not satisfied. The reconstructed slices are blurred and the tumor part on the 17-th can not be visualized from the reconstructed result. This problem is because the available bases are too few to represent the untrained vector in the unfolded vector space with large dimension.
Original Slices
Reconstructed Slices RMSE: 0.197
7th-Slice
13th-Slice
17th-Slice
Fig. 4. Result of leave-one-out testing for the PCA-based method implemented by the second way
The G3D-PCA is not suffered from the two problems. The covariance matrices for the three mode subspaces have the dimension of 256×256, 256×256 and 26× 26 respectively, so the eigenvectors of them can be calculated efficiently. In the other aspect, it is able to obtain enough bases to represent an untrained volumes in the space of third order tensor by G3D-PCA. This can also be demonstrated by the leave-one-out experiment. Fig. 5 gives the result of leave-one-out experiment for the G3D-PCA. The training samples and testing volume are the same as the experiment for PCA. The untrained volume is tried to be represented by three tensor bases with different dimensions of 50 · 50 · 15 = 37500, 75 · 75 · 20 = 112500, 100 · 100 · 20 = 200000. Since the number of the maximal available bases is 256 · 256 · 26 = 1703936, the compressing rates for the three cases 50·50·15 75·75·20 are 256·256·26 ≈ 2.2%, 256·256·26 ≈ 6.6% and 100·100·20 256·256·26 ≈ 11.7% respectively. It can be seen that the quality of the reconstructed results become better and
Appearance Models for Medical Volumes with Few Samples
829
Original Slices
Dimension of the Bases:
50 5015 RMSE: 7.69-E2
Dimension of the Bases:
75 75 20 RMSE: 5.36-E2
Dimension of the Bases:
100100 20 RMSE: 4.23-E2 7th-Slice
13th-Slice
17th-Slice
Fig. 5. Result of leave-one-out testing for the proposed G3D-PCA method
better with the increase of the dimensions of the bases. The root mean square error (RMSE) is also calculated between the reconstructed volumes and the original volumes. Compared to the PCA-based result, it can be seen that the reconstructed results based on G3D-PCA is much better. The reconstructed results are clearer. Additionally, the tumor region is also reconstructed well. This experiment illustrates that the appearance models for medical volumes based on G3D-PCA has good performance on generalization even when the models are trained from few samples.
5
Conclusion
We propose a method named as G3D-PCA based on multilinear algebra to build the appearance models for 3D medical volumes with large dimensions in this
830
R. Xu and Y.-W. Chen
paper. In this method 3D volumes are treated as the third order tensors directly without the unfolding to be 1D vectors. Additionally, the output of G3D-PCA is three matrices whose columns are the orthogonal bases respectively in the three mode subspaces. The bases in the third order tensor spaces can be constructed from the three matrices and the maximal available number for the tensor bases is not limited by the number of the training samples. According to these properties, our method is not suffered from the huge computing cost; and what’s more, it has good performance on generalization even when few samples are used in training.
References 1. Cootes, T.F., Cooper, D., Taylor, C.J., Graham, J.: Active shape models – their training and application. Comput. Vision Image Understanding 61(1), 38–59 (1995) 2. van Assen, H.C., Danilouchkine, M.G., Behloul, F.: J., H., van der Geest, R.J., Reiber, J.H.C., Lelieveldt, B.P.F.: Cardiac lv segmentation using a 3d active shape model driven by fuzzy inference. In: Procedings of Lecture Notes in Computer Science: Medical Image Computing and Computer-Assisted Intervention, pp. 533– 540 (2003) 3. Kaus, M.R., von Berg, J., Weese, J., Niessen, W.J., Pekar, V.: Automated segmentation of the left ventricle in cardiac mri. Medical Image Analysis 8(3), 245–254 (2004) 4. Mitchell, S.C., Bosch, J.G., Lelieveldt, B.P.F., van der Geest, R.J., Reiber, J.H.C., Sonka, M.: 3d active appearance models: Segmentation of cardiac mr and ultrasound images. IEEE Transactions on Medical Imaging 21(9), 1167–1178 (2002) 5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001) 6. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 7. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: A multilinear sigular value decomposition. SIAM Journal of Matrix Analysis and Application 21(4), 1253–1278 (2000) 8. Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(r1, r2,., rn) approximation of higher-order tensors. SIAM Journal of Matrix Analysis and Application 21(4), 1324–1342 (2000) 9. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear analysis of image ensembles: Tensorfaces. In: Procedings of European Conference on Computer Vision, pp. 447–460 (2002) 10. Xu, D., Yan, S.C., Zhang, L., Zhang, H.J., Liu, Z.L., Shum, H.Y.: Concurrent subspaces analysis. In: Procedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 203–208 (2005) 11. West, J., Fitzpatrick, J., et al.: Comparison and evaluation of retrospective intermodality brain image registration techniques. J. Comput. Assist. Tomogr. 21, 554–566 (1997)
Head Pose Estimation Based on Tensor Factorization Wenlu Yang1,2 , Liqing Zhang1, , and Wenjun Zhu1 1
Department of Computer Science and Engineering, 800 Dongchuan Rd., Shanghai Jiaotong University, Shanghai 200240, China 2 Department of Electronic Engineering, 1550 Pudong Rd. Shanghai Maritime University, Shanghai 200135, China
[email protected],
[email protected],
[email protected] Abstract. This paper investigates head pose estimation problem which is considered as front-end preprocessing for improving multi-view human face recognition. We propose a computational model for perceiving head poses based on the Non-negative Multi-way Factorization (NMWF). The model consists of three components: the tensor representation for multiview faces, feature selection and head pose perception. To find the facial representation basis, the NMWF algorithm is applied to the training data of facial images. The face tensor includes three factors of facial images, poses, and people. The former two factors are used to construct the computational model for pose estimation. The discriminative measure for perceiving the head pose is defined by the similarity between tensor basis and the representation of testing facial image which is projection of faces on the subspace spanned by the basis “TensorFaces”. Computer simulation results show that the proposed model achieved satisfactory accuracy for estimating head poses of facial images in the CAS-PEAL face database. Keywords: Head pose estimation, Face recognition, Tensor decomposition, Non-negative multi-way factorization, TensorFaces.
1
Introduction
Human possess a remarkable ability to recognize faces regardless of facial geometries, expressions, head poses, lighting conditions, distances, and ages. Modelling the functions of face recognition and identification remains a difficult open problem in the fields of computer vision and neuroscience. Particularly, face recognition robust to head pose variation is still difficult in the complex natural environment. Therefore, head pose estimation is a very useful front-end processing for multi-view human face recognition. Head pose variation covers the following three free rotation parameters, such as yaw (rotation around the neck), tilt (rotation up and down), and roll (rotation left profile to right profile). In this paper, we will focus on yawing rotation which has many important applications. Previous methods on head pose estimation from 2D images can be roughly divided into two categories: template-based approach and appearance-based ap
Corresponding author.
M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 831–840, 2008. c Springer-Verlag Berlin Heidelberg 2008
832
W. Yang, L. Zhang, and W. Zhu
proach. The template-based approaches are based on nearest neighbor classification with texture templates [1][2], or deduced from geometric information like configurations of facial landmarks [3][5][6]. Appearance-based approaches generally regard head pose estimation as a parameter estimation problem. It could be handled using regression method, multi-class classification, or nonlinear dimension reduction method. Typical appearance approaches include Support Vector Machines (SVM)[7][8], tree-structured AdaBoost pose classifier[9], soft margin AdaBoost[10], Neural Networks (NN)[11][12], Gabor Wavelet Networks[13], MDS[14], Principal component analysis (PCA)[15][16], Kernel PCA [17], Independent Component Analysis (ICA)[18][24], linear or non-linear embedding and mapping [19][20](ISOMAP[21], Local Linear Embedding (LLE)[22][23]). In this paper, we propose a computational model for perceiving head poses from facial images using tensor decomposition. The rest of this paper is organized as follows: In section 2, we propose an computational model for pose estimation and introduce the tensor decomposition algorithm the Nonnegative Multi-Way Factorization (NMWF). In section 4, the tensor bases for different head poses are obtained from training data and computer simulations are provided to show the performance of our proposed pose estimation method. Finally, discussions and conclusions are given in section 5.
2
Tensor Representation and Learning Algorithm
In this section, we first review related linear models for Tensor Decomposition. Then we propose a computational model for perceiving multi-views or head poses of facial images. Then we introduce the Non-negative Multi-way Factorization (NMWF) algorithm for learning the representation basis functions of facial images, and we further define representation basis for different views of facial images. Furthermore, we propose a method for estimating the head poses from facial images. 2.1
Related Models for Tensor Decomposition
PCA has been a popular technique in facial image recognition, and its generalization, known as ICA, has also been employed in face recognition[24]. PCA based face recognition method Eigenfaces works well when face images are well aligned. If facial images contain other factors, such as lighting, poses, and expression, The recognition performance of Eigenfaces degenerate dramatically. Tensor representation is able to formulate more than two factors in one framework. Therefore, tensor models have richer representational capability that commonly used matrix does. Grimes and Rao [25] proposed a bilinear sparse model to learn styles and contents from natural images. When facial images contain more than two factors, these linear and bilinear algebra do not work well. Some researchers proposed multilinear algebra to represent multiple factors of image ensembles. Vasilescu and Terzopoulos [26] applied the higher-order tensors approach to factorize facial images, resulting the “TensorFaces” representation of facial images. And later, they proposed a multilinear
Head Pose Estimation Based on Tensor Factorization
833
ICA model to learn the statistically independent components of multiple factors. Morup et al. proposed the Non-negative Multi-way Factorization (NMWF) algorithm, a generalization of Lee and Seung’s algorithm [28] of non-negative matrix factorization (NMF algorithm) to decompose the time-frequency representation of EEG[27]. Using the advantage of multiple factor representation in tensor model, we will use the tensor factorization approach to find the tensor basis for facial image and further propose a computational model for perceiving head poses from facial images. The model consists of two parts: the training module and perception module, as shown in Fig.1. The training module is to find the tensor representation using the non-negative multi-way algorithm. By using the coefficients of poses factor, we define the representation basis for different head poses.
Fig. 1. Model for learning the “TensorFaces” and estimating poses
The second module, perception module, is to estimate the head pose from input facial image. other part composed of testing set, basis functions, coefficients of poses, similarity analysis, and output of resulting poses. Given a facial image randomly selected from the testing set, the corresponding coefficients of poses are obtained by projecting the image onto the subspace spanned by the basis functions. Then analyze the similarity between the coefficients and factors of poses, and the maximal correlation coefficients results in the corresponding pose. The next section will give the brief introduction of the NMWF and the algorithm for pose estimation. 2.2
The NMWF Algorithm
In this section, we briefly introduce tensor representation model and the Nonnegative Multi-way Factorization (NMWF) algorithm. Generally speaking, tensor is a generalized data representation of 2-dimensional matrix. Given an n-way array T , the objective of tensor factorization is to decompose T into the following N components
834
W. Yang, L. Zhang, and W. Zhu
T ≈Th =
N
(1)
(2)
(n)
Uj ⊗ Uj ⊗ · · · ⊗ Uj
(1)
j=1
This factorization is important for many applications, since each rank-1 component represents certain feature of the tensor data. There exist a number of questions needed to be discussed with this tensor factorization. The first question is how many components are good for the decomposition, this problem is related to the matrix rank estimation in the singular value decomposition of matrices. It is usually difficult to estimate the rank of a tensor because the tensor rank estimation is still an NP-hard and open problem. Refer to [4] for a tutorial on the tensor rank estimation. Second issue of the tensor factorization is what criterions shall be used for finding the tensor components. Commonly used criterions covers the least-squares error, relative entropy and more generalized divergence, the alpha-divergence. In this paper, we use the relative entropy as a criterion to find the tensor components. Given a nonnegative tensor T , the relative entropy of the tensor T and its approximation T h is defined as T j ,j ,··· ,j N Div(T ||T h ) = Tj1 ,j2 ,··· ,jN log h1 2 − Tj1 ,j2 ,··· ,jN + Tjh1 ,j2 ,··· ,jN T j ,j ,··· ,j 1 2 N j ,j ,··· ,j 1
2
N
(2) Denote the inner product of two tensor as < T , R >= Tj1 ,j2 ,··· ,jN Rj1 ,j2 ,··· ,jN . j1 ,j2 ,··· ,jN
Then the relative entropy can be rewritten as Div(T ||T h ) =< T , log(T ) > − < T , log(T h ) > − < 1, T > + < 1, T h > . (3) To explore some specific properties of the data, such as sparseness and smoothness, some constraints may be imposed as additional conditions of the optimization problem. By using the gradient descent algorithm, we derive the learning algorithm as follows: p n p i=1 Uki /Uj k1 ,k2 ,··· ,kN kj =j Tk1 ,k2 ,··· ,kN Tkh ,k ,··· ,k 1 2 N n Ujp ⇐ Ujp (4) p p U /U j k1 ,k2 ,··· ,kN ij =j i=1 ki The algorithm can be implemented alternatively by updating one set of parameters and keeping the others unchanged.
3
Face Tensor Representation and Head Pose Estimation
The facial images can be formulated into the framework of tensor representation. Different mode represents different factor in facial images. In this paper, we take
Head Pose Estimation Based on Tensor Factorization
835
three factors: face image, head pose and people into consideration. Thus the facial tensor discussed in this paper consists of three modes. For a data set of facial images with I3 people, I2 poses of each person and I1 -dimensionality of each facial image, the tensor T I1 I2 I3 denotes all facial images. According to the decomposi(1) tion of T I1 I2 I3 , Uj , j = 1, 2, · · · , N represent the set of the basis functions for (2)
facial images, Uj
(3)
and Uj
denote poses and people coefficients separately with (1)
(1)
respect to the basis function Uj . Using all basis functions Uj , j = 1, 2, · · · , N , the k-th vector U3k (k = 1, 2, · · · , I2 ) of coefficients of people, and the l-th vector U2l of pose coefficients, the facial image Tk,l is reconstructed as follows: h Tk,l =
N
(1)
(2)
(3)
Uj Uj (k)Uj (l).
(5)
j=1 (2)
(3)
Therefore, the parameters Uj (k), Uj (l), j = 1, 2, · · · , N are considered as the (1)
feature representation on the tensor basis T I1 I2 I3 , Uj , j = 1, 2, · · · , N . In order to find the tensor representation of an input image, we use the least square error method. Given a new facial image X, we attempts to find the coefficients S = (s1 , s2 , · · · , sN )T of the following tensor representation, X=
N
(1)
sj Uj .
(6)
j=1
The least square error solution of the above equation is given
by the following (1) (1) −1 solution S = A B, where A = (Ai,j ) = < Ui , Uj > and B = (Bj ) =
(j) < Ui , X > . 3.1
Head Pose Estimation
After obtaining the decomposition of the tensor DI1 I2 I3 , we are able to use the second mode of the model for estimating the poses of faces randomly selected from the testing set. Given a facial image I = U1 S, here, U1 denotes the basis functions and S the representation of the image on the subspace spanned by the bases. Computing S = (UT1 U1 )−1 UT1 I, the S is obtained. And then the similarity of S and U2,k (k = 1, 2, · · · , I2 ) is calculated by the correlation measure Corr(S, U 2,k ) =
< S, U2,k > . < S, S >< U2,k , U2,k >
(7)
For different head pose facial images, the feature representation of U2,k in S are different. Therefor, head poses can be easily estimated by maximizing the similarity between S and U2,k .
836
4
W. Yang, L. Zhang, and W. Zhu
Simulations and Results
In this section, we will present simulation results to verify the performance of the computational model. 4.1
Face Database
The facial images are selected from the pose subset, images of 1040 subjects across 7 different poses {±45◦, ±30◦ , ±15◦, 0◦ } shown in Fig. 2, included in the public CASE-PEAL-R1 face database[29]. In our experiments, we first crop the original facial images to patches of 300 × 250 pixels which include whole faces by the positions of eyes. Then resize them to 36 ×30 for the limit of computational memory. The tensor D in Equation (1) is a 3-mode tensor with pixels, poses, and people. For the limited computing resources, images of one hundred subjects across seven different poses are randomly selected. And the facial tensor D1080×7×100 is generated for finding the tensor basis functions. We use all other images in the pose subset as testing set.
Fig. 2. Examples of facial images in CASE-PEAL-R1 face database[29]
4.2
TensorFaces Representation
The tensor D is decomposed by the NMWF, and the basis functions or “TensorFaces” are obtained, shown as in Fig.3. Carefully examining the “TensorFaces” tells us that the “TensorFaces” have the favorable characteristics for seven pose representation. When different facial images are projected onto the tensor basis, the resulting coefficients S of seven poses are apparently discriminative. The subsets of “TensorFaces” projected onto subspace in different angles of yawing rotation are shown in Fig. 4. 4.3
Head Pose Estimation
According to the algorithm mentioned in section 3.1, we estimate head poses of facial images randomly selected from the testing set. For test stage, 500 facial images were used. The accuracy rate of pose estimation is shown in Figs. 6. Compared with the other methods, our method shows some advantages such as simplicity, effectivity, and high accuracy. First testing on the same face database included in the CAS-PEAL, Ma[30] used the combination of SVM classifier and
Head Pose Estimation Based on Tensor Factorization
837
Fig. 3. Basis functions decomposed by the NMWF from facial images across seven poses
Fig. 4. Subsets of “TensorFaces” in different poses. (Left) Pose angle of yaw rotation: -30◦ . (Right) Pose angle of yaw rotation: 0◦ .
838
W. Yang, L. Zhang, and W. Zhu 3
3
−45 −30 −15 +00 +15 +30 +45
2 1 PC3
−45 −15 +15 +45
4
0 −1
2
−2
PC3
1 −3
0
−2 −1
−1
0
−2 −3 −3
1 PC2
5 −2
−1
0
PC2
2
0
1
2
−2
−5 3 PC1
0
−1
2
1
PC1
Fig. 5. The left figure shows the sample features of four poses projected in 2dimensional space. The right figure shows sample features of seven poses, projected in 3-dimensional space. 1 1.005 1
0.995
0.995
0.99 0.99 0.985
0.985 0
2
4
6
8
10
0
2
4
6
8
Fig. 6. Pose estimation results. (Left) The X-axis denotes indexes of tests, and Y-axis denotes pose estimation results. (Right) It is different from the (Left) that the results are rearranged by poses. The X-axis denotes poses in the range of {-45◦ , -30◦ , -15◦ , 0◦ , +15◦ , +30◦ , +45◦ }. The Y-axis denotes pose estimation results by poses.
LGBPH features for pose estimation, showed that their best accuracy of pose estimation is 97.14%. Using the same data, the worst accuracy we obtained is 98.6% and the average is 99.2%. Second in the preprocessing process of the face images, Ma et al. applied geometric normalization, histogram equalization, and facial mask to the images, and cropped them to 64×64 pixels. And we simply cropped the image to include the head and resized it to 36×30 for the limit of computing resources. Therefore, our preprocessing is simpler and the results is better. By the way, if the facial images are not cropped, the resulting accuracy of pose estimation is worse, averagely about 90.4%. On the other hand, the training set, randomly selected from the face database, composed of 100 subjects with seven poses. Our supplemental simulations show that if the subjects in the training set increase, the resulting accuracy of pose estimation will further improve, even close to 100%.
Head Pose Estimation Based on Tensor Factorization
5
839
Discussions and Conclusions
In this paper, we proposed a computational model for head pose estimation based on tensor decomposition. Using the NMWF algorithm, facial images are decomposed into three factors such as the “TensorFaces”, poses, and people. And then, these facial images are projected onto the subspace spanned by the “TensorFaces”. The resulting representations can be used for pose estimation and further face recognition. Consider the correlation between representation of facial images and components of poses as the similarity measure of pose estimation, the head pose of the facial images included in the public CAS-PEAL database are estimated. The simulation results show that the model is very efficient. Acknowledgments. The work was supported by the National Basic Research Program of China (Grant No. 2005CB724301) and the National High-Tech Research Program of China (Grant No. 2006AA01Z125). Portions of the research in this paper use the CAS-PEAL face database collected under the sponsor of the Chinese National Hi-Tech Program and ISVISION Tech. Co. Ltd.
References 1. Bichsel, M., Pentland, A.: Automatic interpretation of human head movements. In: 13th International Joint Conference on Artificial Intelligence (IJCAI), Workshop on Looking At People, Chambery France (1993) 2. McKenna, S., Gong, S.: Real time face pose estimation. Real-Time Imaging 4(5), 333–347 (1998) 3. Heinzmann, J., Zelinsky, A.: 3D facial pose and gaze point estimation using a robust real-time tracking paradigm. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pp. 142–147 (1998) 4. Bro, R.: PARAFAC. Tutorial and applications, Chemom. Intell. Lab. Syst. In: Special Issue 2nd Internet Conf. in Chemometrics (INCINC 1996), vol. 38, pp. 149–171 (1997) 5. Xu, M., Akatsuka, T.: Detecting Head Pose from Stereo Image Sequence for Active Face Recognition. In: Proceedings of the 3rd. International Conference on Face & Gesture Recognition, pp. 82–87 (1998) 6. Hu, Y.X., Chen, L.B., Zhou, Y., Zhang, H.J.: Estimating face pose by facial asymmetry and geometry. In: Proceedings of Automatic Face and Gesture Recognition, pp. 651–656 (2004) 7. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (SVM). In: Proceedings of Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 154–156 (1998) 8. Moon, H., Miller, M.: Estimating facial pose from a sparse representation. In: International Conference on Image Processing, vol. 1, pp. 75–78 (2004) 9. Yang, Z., Ai, H., Okamoto, T.: Multi-view face pose classification by tree-structured classifier. In: International Conference on Image Processing, vol. 2, pp. 358–361 (2005) 10. Guo, Y., Poulton, G., Li, J., Hedley, M.: Soft margin AdaBoost for face pose classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp. 221–224 (2003)
840
W. Yang, L. Zhang, and W. Zhu
11. Baluja, S., Sahami, M., Rowley, H.A.: Efficient face orientation discrimination. In: International Conference on Image Processing, vol. 1, pp. 589–592 (2004) 12. Voit, M., Nickel, K., Stiefelhagen, R.: Multi-view head pose estimation using neural networks. In: The 2nd Canadian Conference on Computer and Robot Vision, pp. 347–352 (2005) 13. Kruger, V., Sommer, G.: Gabor Wavelet Networks for Object Representation. Journal of the Optical Society of America 19(6), 1112–1119 (2002) 14. Chen, L.B., Zhang, L., Hu, Y.X., Li, M.J., Zhang, H.J.: Head Pose Estimation Using Fisher Manifold Learning. In: Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, pp. 203–207 (2003) 15. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 16. Martinez, A.M., adn Kak, A.C.: PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 228–233 (2001) 17. Li, S.Z., Fu, Q.D., Gu, L., Scholkopf, B., Cheng, Y.M., Zhang, H.J.: Kernel Machine Based Learning for Multi-view Face Detection and Pose Estimation. International Conference of Computer Vision 2, 674–679 (2001) 18. Li, S.Z., Lv, X.G., Hou, X.W., Peng, X.H., Cheng, Q.S.: Learning Multi-View Face Subspaces and Facial Pose Estimation Using Independent Component Analysis. IEEE Transactions on Image Processing 14(6), 705–712 (2005) 19. Hu, N., Huang, W.M., Ranganath, S.: Head pose estimation by non-linear embedding and mapping. In: International Conference on Image Processing, vol. 2, pp. 342–345 (2005) 20. Raytchev, B., Yoda, I., Sakaue, K.: Head Pose Estimation By Nonlinear Manifold Learning. In: International Conference on Pattern Recognition, vol. 4, pp. 462–466 (2004) 21. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000) 22. Saul, L.K., Roweis, S.T.: Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research 4, 119–155 (2003) 23. Roweis, S., Saul, L.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290, 2323–2326 (2000) 24. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent component analysis. IEEE Transactions on Neural Networks 13(6), 1450–1464 (2002) 25. Grimes David, B., Rao Rajesh, P.N.: Bilinear Sparse Coding for Invariant Vision. Neural Computation 17(1), 47–73 (2005) 26. Alex, M., Vasilescu, O., Terzopoulos, D.: Multilinear analysis of image ensembles: TensorFaces. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 447–460. Springer, Heidelberg (2002) 27. Mørup, M., Hansen, L.K., Parnas, J., Arnfred, S.M.: Decomposing the timefrequency representation of EEG using non-negative matrix and multi-way factorization. Technical reports (2006) 28. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999) 29. Gao, W., Cao, B., Shan, S., Zhou, D., Zhang, X., Zhao, D.: The CAS-PEAL Large-Scale Chinese Face Database and Evaluation Protocols, Technical Report No. JDL TR 04 FR 001, Joint Research & Development Laboratory, CAS (2004) 30. Ma, B.P., Zhang, W.C., Shan, S.G., Chen, X.L., Gao, W.: Robust Head Pose Estimation Using LGBP. In: Proceeding of International Conference on Pattern Recognition (2), pp. 512–515 (2006)
Kernel Maximum a Posteriori Classification with Error Bound Analysis Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Dept. of Computer Science and Engineering, The Chinese Univ. of Hong Kong, Shatin, N.T., Hong Kong {zlxu,kzhuang,jkzhu,king,lyu}@cse.cuhk.edu.hk
Abstract. Kernel methods have been widely used in data classification. Many kernel-based classifiers like Kernel Support Vector Machines (KSVM) assume that data can be separated by a hyperplane in the feature space. These methods do not consider the data distribution. This paper proposes a novel Kernel Maximum A Posteriori (KMAP) classification method, which implements a Gaussian density distribution assumption in the feature space and can be regarded as a more generalized classification method than other kernel-based classifier such as Kernel Fisher Discriminant Analysis (KFDA). We also adopt robust methods for parameter estimation. In addition, the error bound analysis for KMAP indicates the effectiveness of the Gaussian density assumption in the feature space. Furthermore, KMAP achieves very promising results on eight UCI benchmark data sets against the competitive methods.
1
Introduction
Recently, kernel methods have been regarded as the state-of-the-art classification approaches [1]. The basic idea of kernel methods in supervised learning is to map data from an input space to a high-dimensional feature space in order to make data more separable. Classical kernel-based classifiers include Kernel Support Vector Machine (KSVM) [2], Kernel Fisher Discriminant Analysis (KFDA) [3], and Kernel Minimax probability Machine [4,5]. The reasonability behind them is that the linear discriminant functions in the feature space can represent complex separating surfaces when mapped back to the original input space. However, one drawback of KSVM is that it does not consider the data distribution and cannot directly output the probabilities or confidences for classification. Therefore, it is hard to be applied in systems that reason under uncertainty. On the other hand, in statistical pattern recognition, the probability densities can be estimated from data. Future examples are then assigned to the class with the Maximum A Posteriori (MAP) [6]. One typical probability density function is the Gaussian density function. The Gaussian density functions are easy to handle. However, the Gaussian distribution cannot be easily satisfied in the input space and it is hard to deal with non-linearly separable problems. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 841–850, 2008. c Springer-Verlag Berlin Heidelberg 2008
842
Z. Xu et al.
To solve these problems, we propose a Kernel Maximum a Posteriori (KMAP) Classification method under Gaussianity assumption in the feature space. Different from KSVM, we make the Gaussian density assumption, which implies that data can be separated by more complex surfaces in the feature space. Generally, distributions other than the Gaussian distribution can also be assumed in the feature space. However, under a distribution with a complex form, it is hard to get a close form solution and easy to trap in over-fitting. Moreover, with the Gaussian assumption, a kernelized version can be derived without knowing the explicit form of the mapping functions for our model. In addition, to indicate the effectiveness of our assumption, we calculate a separability measure and the error bound for bi-category data sets. The error bound analysis shows that Gaussian density distribution can be more easily satisfied in the feature space. This paper is organized as follows. Section 2 derives the MAP decision rules in the feature space, and analyzes its separability measures and upper error bounds. Section 3 presents the experiments against other classifiers. Section 4 reviews the related work. Section 5 draws conclusions and lists possible future research directions.
2
Main Results
In this section, our MAP classification model is derived. Then, we adopt a special regularization to estimate the parameters. The kernel trick is used to calculate our model. Last, the separability measure and the error bound are calculated in the kernel-induced feature space. 2.1
Model Formulation
Under the Gaussian distribution assumption, the conditional density function for each class Ci (1 ≤ i ≤ m) is written as below: 1 1 T −1 p(Φ(x)|Ci ) = exp − (Φ(x) − μi ) Σi (Φ(x) − μi ) , (1) 2 (2π)N/2 |Σi |1/2 where Φ(x) is the image of x in the feature space, N is the dimension of the feature space (N could be infinity), μi and Σi are the mean and the covariance matrix of Ci , respectively, and |Σi | is the determinant of the covariance matrix. According to the Bayesian Theorem, the posterior probability of class Ci is calculated by p(x|Ci )P (Ci ) P (Ci |x) = m . j=1 p(x|Cj )P (Cj )
(2)
Based on Eq. (2), the decision rule can be formulated as below: x ∈ Cw if P (Cw |x) = max P (Cj |x). 1≤j≤m
(3)
Kernel Maximum a Posteriori Classification with Error Bound Analysis
843
This means that a test data point will be assigned to the class with the maximum of P (Cw |x), i.e., the MAP. Since the MAP is calculated in the kernel-induced feature space, the output model is named as the KMAP classification. KMAP can provide not only a class label but also the probability of a data point belonging to that class. This probability can be viewed as a confidence of classifying new data points and can be used in statistical systems that reason under uncertainty. If the confidence is lower than some specified threshold, the system can refuse to make an inference. However, many kernel learning methods including KSVM cannot output these probabilities. It can be further formulated as follows: gi (Φ(x)) = (Φ(x) − μi )T Σi−1 (Φ(x) − μi ) + log |Σi |.
(4)
The intuitive meaning of the function is that a class is more likely assigned to an unlabeled data point, when the Mahalanobis distance from the data point to the class center is smaller. 2.2
Parameter Estimation
In order to compute the Mahalanobis distance function, the mean vector and the covariance matrix for each class are required to be estimated. Typically, the mean vector (μi ) and the within covariance matrix (Σi ) are calculated by the maximum likelihood estimation. In the feature space, they are formulated as follows: μi =
ni 1 Φ(xj ), ni j=1
Σ i = Si =
(5)
ni 1 (Φ(xj ) − μi )(Φ(xj ) − μi )T , ni j=1
(6)
where ni is the cardinality of the set composed of data points belonging to Ci . Directly employing Si as the covariance matrix, will generate quadratic discriminant functions in the feature space. In this case, KMAP is noted as KMAP-M. However, the covariance estimation problem is clearly ill-posed, because the number of data points in each class is usually much smaller than the number of dimensions in the kernel-induced feature space. The treatment of this ill-posed problem is to introduce the regularization. There are several kinds of regularization methods. One of them is to replace the individual withinm Si covariance matrix by their average, i.e., Σi = S = i=1 + rI, where I is the m identity matrix and r is a regularization coefficient. This method can substantially reduce the number of free parameters to be estimated. Moreover, it also reduces the discriminant function between two classes to a linear one. Therefore, a linear discriminant analysis method can be obtained. Alternatively, we can estimate the covariance matrix by combining the above linear discriminant function with the quadratic one. Instead of estimating the
844
Z. Xu et al.
covariance matrix in the input space [7], we try to apply this method in the feature space. The formulation in the feature space is as follows: ˜ ˜i + η trace(Σi ) I, Σi = (1 − η)Σ n
(7)
˜i = (1 − θ)Si + θS. where Σ In the equations, θ (0 ≤ θ ≤ 1) is a coefficient linked with the linear discriminant term and the quadratic discriminant one. Moreover, η (0 ≤ η ≤ 1) determines the shrinkage to a multiple of the identity matrix. This approach is more flexible to adjust the effect of the regularization. The corresponding KMAP is noted as KMAP-R. 2.3
Kernel Calculation
We derive methods to calculate the Mahalanobis distance (Eq. (4)) using the kernel trick, i.e., we only need to formulate the function in an inner-product form regardless of the explicit mapping function. To do this, the spectral rep T resentation of the covariance matrix, Σi = N j=1 Λij Ωij Ωij where Λij ∈ R is the j-th eigenvalue of Σi and Ωij ∈ RN is the eigenvector relevant to Λij , is utilized. However, the small eigenvalues will degrade the performance of the function overwhelmingly because they are underestimated due to the small number of examples. In this paper, we only estimate the k largest eigenvalues and replace each left eigenvalue with a nonnegative number hi . Thus Eq. (4) can be reformulated as follows: 1 [g1i (Φ(x)) − g2i (Φ(x))] + g3i (Φ(x)) hi ⎛ ⎞ N k 1 ⎝ T h i T = [Ω (Φ(x) − μi )]2 − 1− [Ωij (Φ(x) − μi )]2 ⎠ hi j=1 ij Λ ij j=1 ⎛ ⎞ k −k + log ⎝hN Λij ⎠ . (8) i
gi (Φ(x)) =
j=1
In the following, we show that g1i (Φ(x)), g2i (Φ(x)), and g3i (Φ(x)) can all be written in a kernel form. To formulate these equations, we need to calculate the eigenvalues Λi and eigenvectors Ωi . The eigenvectors lie in the space spanned by all the training samples, i.e., each eigenvector Ωij can be written as a linear combination of all the training samples: Ωij =
n
(l)
γij Φ(xl ) = U γij
(9)
l=1 (1)
(2)
(n)
where γij = (γij , γij , . . . , γij )T is an n dimensional column vector and U = (Φ(x1 ), . . . , Φ(xn )).
Kernel Maximum a Posteriori Classification with Error Bound Analysis
845
It is easy to prove that γij and Λij are actually the eigenvector and eigenvalue of the covariance matrix ΣG(i) , where G(i) ∈ Rni ×N is the i-th block of the kernel matrix G relevant to Ci . We omit the proof due to the limit of space. Accordingly, we can express g1i (Φ(x)) as the kernel form: g1i (Φ(x)) =
n j=1 n
T T γij U (Φ(x) − μi )T (Φ(x) − μi )U γij
i 1 = Kx − K xl ni j=1 l=1 2 ni 1 = Kx − Kxl , ni
n
2
T γij
l=1
(10)
2
where Kx = {K(x1 , x), . . . , K(xn , x)}T . In the same way, g2i (Φ(x)) can be formulated as the following: g2i (Φ(x)) =
k hi T 1− Ωij (Φ(x) − μi )(Φ(x) − μi )T Ωij . Λ ij j=1
(11)
Substituting (9) into the above g2i (Φ(x)), we have: k hi T g2i (Φ(x)) = 1− γij Λ ij j=1
ni 1 Kx − K xl ni l=1
ni 1 Kx − K xl ni
T γij .
l=1
(12) Now, the Mahalanobis distance function in the feature space gi (Φ(x)) can be finally written in a kernel form, where N in g3i (Φ(x)) is substituted by the cardinality of data n. The time complexity of KMAP is mainly due to the eigenvalue decomposition which scales as O(n3 ). Thus KMAP has the same complexity as KFDA. 2.4
Connection to Other Kernel Methods
In the following, we show the connection between KMAP and other kernel-based methods. In the regularization method based on Eq. (7), by varying the settings of θ and η, other kernel-based classification methods can be derived. When (θ = 0, η = 0), the KMAP model represents a quadratic discriminant method in the kernelinduced feature space; when (θ = 1, η = 0), it represents a kernel discriminant method; and when (θ = 0, η = 1) or (θ = 1, η = 1), it represents the nearest mean classifier. Therefore, by varying θ and η, different models can be generated from different combinations of quadratic discriminant, linear discriminant and the nearest mean methods. We consider a special case of the regularization method when θ = 1 and η = 0. If both classes are assumed to have the same covariance structure for a
846
Z. Xu et al.
2 binary class problem, i.e., Σi = Σ1 +Σ , it leads to a linear discriminant function. 2 Assuming all classes have the same class prior probabilities, gi (Φ(x)) can be 2 −1 derived as: gi (Φ(x)) = (Φ(x) − μi )T ( Σ1 +Σ ) (Φ(x) − μi ), where i = 1, 2. We 2 reformulate the above equation in the following form: gi (Φ(x)) = wi Φ(x) + bi , where wi = −4(Σ1 + Σ2 )−1 μi , and bi = 2μTi (Σ1 + Σ2 )−1 μi . The decision hyperplane is f (Φ(x)) = g1 (Φ(x)) − g2 (Φ(x)), i.e.,
1 f (Φ(x)) = (Σ1 + Σ2 )−1 (μ1 − μ2 )Φ(x) − (μ1 − μ2 )T (Σ1 + Σ2 )−1 (μ1 + μ2 ). (13) 2 Eq. (13) is just the solution of KFDA [3]. Therefore, KFDA can be viewed as a special case of KMAP when all classes have the same covariance structure. Remark. KMAP provides a rich class of kernel-based classification algorithms using different regularization methods. This makes KMAP as a flexible framework for classification adaptive to data distribution. 2.5
Separability Measures and Error Bounds
To measure the separability of different classes of data in the feature space, the Kullback-Leibler divergence (a.k.a. K-L distance) between two Gaussians is adopted. The K-L divergence is defined as pi (Φ(x)) dKL [pi (Φ(x)), pj (Φ(x))] = Pi (Φ(x)) ln . (14) pj (Φ(x)) Since the K-L divergence is not symmetric, a two-way divergence is used to measure the distance between two distributions dij = dKL [pi (Φ(x)), pj (Φ(x))] + dKL [pj (Φ(x)), pi (Φ(x))]
(15)
Following [6], it can be proved that: dij =
1 1 (μi − μj )T (Σi−1 + Σj−1 )(μi − μj ) + trace(Σi−1 Σj + Σj−1 Σi − 2I), (16) 2 2
which can be solved by using the trick in Section 2.3. The Bayesian decision rule guarantees the lowest average error rate as presented in the following: P (correct) =
m i=1
p(Φ(x)|Ci )P (Ci )dΦ(x),
(17)
Ri
where Ri is the decision region of class Ci . We implement the Bhattacharyya bound in the feature space for the Gaussian density distribution function. Following [6], we have P (error) ≤ P (C1 )P (C2 ) exp−q(0.5) , (18)
Kernel Maximum a Posteriori Classification with Error Bound Analysis
847
where q(0.5) =
1 (μ2 − μ1 )T 8
Σ1 + Σ 2 2
−1 (μ2 − μ1 ) +
| Σ1 +Σ2 | 1 ln 2 . 2 |Σ1 ||Σ2 |
(19)
Using the results in Section 2.3, the Bhattacharyya error bound can be easily calculated in the kernel-induced feature space.
3
Experiments
In this section, we report the experiments to evaluate the separability measure, the error bound and the prediction performance of the proposed KMAP. 3.1
Synthetic Data
We compare the separability measure and error bounds on three synthetic data sets. The description of these data sets can be found in [8]. The data sets are named according to their characteristics and they are plotted in Fig. 1. We map the data using RBF kernel to a special feature space where Gaussian distributions are approximately satisfied. We then calculate separability measures on all data sets according to Eq. (16). The separability values for the Overlap, Bumpy, and Relevance in the original input space, are 14.94, 5.16, and 22.18, respectively. Those corresponding values in the feature space are 30.88, 5.87, and 3631, respectively. The results indicate that data become more separable after mapped into the feature space, especially for the Relevance data set. For data in the kernel-induced feature space, the error bounds are calculated according to Eq. (18). Figure 1 also plots the prediction rates and the upper error bounds for data in the input space and in the feature space, respectively. It can be observed that the error bounds are more valid in the feature space than those in the input space. 3.2
Benchmark Data
Experimental Setup. In this experiment, KSVM, KFDA, Modified Quadratic Discriminant Analysis (MQDA) [9] and Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA) [10] are employed as the competitive algorithms. We implement two variants of KMAP, i.e., KMAP-M and KMAP-R. The properties of eight UCI benchmark data sets are described in Table 1. In all kernel methods, a Gaussian-RBF kernel is used. The parameter C of KSVM and the parameter γ in RBF kernel are all tuned by 10-cross validation. In KMAP, we select k pairs of eigenvalues and eigenvectors according to their contribution to the covariance matrix, i.e., the index j ∈ { : nλ λq ≥ α}; while q=1 in MQDF, the range of k is relatively small and we select k by cross validation. PCA is used as the regularization method in KFQDA and the commutative decay ratio is set to 99%; the regularization parameter r is set to 0.001 in KFDA.
848
Z. Xu et al. Overlap
Bumpy
2
2
1.5
1.5
1
1
0.5 0.5 0 0 −0.5 −0.5 −1 −1
−1.5
−1.5
−2 −2.5 −1.5
−1
−0.5
0
0.5
1
1.5
−2 −2.5
2
(a) Overlap
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)Bumpy
Relevance 0.8
60 input_bound input_rate feature_bound feature_rate
Prediction error and error bound (%)
0.6 0.4 0.2 0 −0.2 −0.4 −0.6
50
40
30
20
10 −0.8 −1 −1
−0.5
0
0.5
(c) Relevance
1
0 Bumpy
Relevance Different data sets
Overlap
(d) Separability Comparison
Fig. 1. The data plot of Overlap, Bumpy and Relevance and the comparison of data separability in the input space and the feature space Table 1. Data set information Data Set # Samples # Features # Classes Data Set # Samples # Features # Classes Iono 351 34 2 Breast 683 9 2 Twonorm 1000 21 2 Sonar 208 60 2 Pima 768 8 2 Iris 150 4 3 Wine 178 13 3 Segment 210 19 7
In both KMAP and MQDF, h takes the value of Λk+1 . In KMAP-R, extra parameters (θ, η) are tuned by cross-validation. All experimental results are obtained in 10 runs and each run is executed with 10-cross validation for each data set. Experimental Results. Table 2 reports the average prediction accuracy with the standard errors on each data set for all algorithms. It can be observed that both variants of KMAP outperform MQDF, which is an MAP method in the input space. This also empirically validates that the separability among different classes of data becomes larger and that the upper error bounds get tighter and more accurate, after data are mapped to the high dimensional feature space. Moreover, the performance of KMAP is competitive to that of other kernel methods. Especially, the performance of KMAP-R gets better prediction accuracy than all other methods for most of the data sets. The reason is that the regularization methods in KMAP favorably capture the prior distributions of
Kernel Maximum a Posteriori Classification with Error Bound Analysis
849
Table 2. The prediction results of KMAP and other methods Data set Iono(%) Breast(%) Twonorm(%) Sonar(%) Pima(%) Iris(%) Wine(%) Segment(%) Average(%)
KSVM 94.1±0.7 96.5±0.4 96.1±0.4 86.6±1.0 77.9±0.7 96.2±0.4 98.8±0.1 92.8±0.7 92.38
MQDF 89.6±0.5 96.5±0.1 97.4±0.4 83.7±0.7 73.1±0.4 96.0±0.1 99.2±1.3 86.9±1.2 90.30
KFDA 94.2±0.1 96.4±0.1 96.7±0.5 88.3±0.3 71.0±0.5 95.7±0.1 99.1±0.1 91.6±0.3 91.62
KFQDA KMAP-M KMAP-R 93.6±0.4 95.2±0.2 95.7±0.3 96.5±0.1 96.5±0.1 97.5±0.1 97.3±0.5 97.6±0.7 97.5±0.4 85.1±1.9 87.2±1.6 88.8±1.2 74.1±0.5 75.4±0.7 75.5±0.4 96.8±0.2 96.9±0.1 98.0±0.0 96.9±0.7 99.3±0.1 99.3±0.6 85.8±0.8 90.2±0.2 92.1±0.8 90.76 92.29 93.05
data, since the Gaussian assumption in the feature space can fit a very complex distribution in the input space.
4
Related Work
In statistical pattern recognition, the probability density function can first be estimated from data, then future examples could be assigned to the class with the MAP. One typical example is the Quadratic Discriminant Function (QDF) [11], which is derived from the multivariate normal distribution and achieves the minimum mean error rate under Gaussian distribution. In [9], a Modified Quadratic Discriminant Function (MQDF) less sensitive to estimation error is proposed. [7] improves the performance of QDF by covariance matrix interpolation. Unlike QDF, another type of classifiers does not assume the probability density functions in advance, but are designed directly on data samples. An example is the Fisher discriminant analysis (FDA), which maximizes the between-class covariance while minimizing the within-class variance. It can be derived as a Bayesian classifier under Gaussian assumption on the data. [3] develops a Kernel Fisher Discriminant Analysis (KFDA) by extending FDA to a non-linear space by the kernel trick. To supplement the statistical justification of KFDA, [10] extends the maximum likelihood method and Bayes classification to their kernel generalization under Gaussian Hilbert space assumption. The authors do not directly kernelize the quadratic forms in terms of kernel values. Instead, they use an explicit mapping function to map the data to a high dimensional space. Thus the kernel matrix is usually used as the input data of FDA. The derived model is named as Kernel Fisher’s Quadratic Discriminant Analysis (KFQDA).
5
Conclusion and Future Work
In this paper, we present a novel kernel classifier named Kernel-based Maximum a Posteriori, which implements Gaussian distribution in the kernel-induced feature space. Comparing to state-of-the-art classifiers, the advantages of KMAP include that the prior information of distribution is incorporated and that it can output probability or confidence in making a decision. Moreover, KMAP can
850
Z. Xu et al.
be regarded as a more generalized classification method than other kernel-based methods such as KFDA. In addition, the error bound analysis illustrates that Gaussian distribution is more easily satisfied in the feature space than that in the input space. More importantly, KMAP with proper regularization achieves very promising performance. We plan to incorporate the probability information into both the kernel function and the classifier in the future work.
Acknowledgments The work described in this paper is fully supported by two grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4205/04E and Project No. CUHK4235/04E).
References 1. Sch¨ olkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Chichester (1998) 3. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fisher discriminant analysis with kernels. In: Proceedings of IEEE Neural Network for Signal Processing Workshop, pp. 41–48 (1999) 4. Lanckriet, G.R.G., Ghaoui, L.E., Bhattacharyya, C., Jordan, M.I.: A robust minimax approach to classification. Journal of Machine Learning Research 3, 555–582 (2002) 5. Huang, K., Yang, H., King, I., Lyu, M.R., Chan, L.: Minimum error minimax probability machine. Journal of Machine Learning Research 5, 1253–1286 (2004) 6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience Publication (2000) 7. Friedman, J.H.: Regularized discriminant analysis. Journal of American Statistics Association 84(405), 165–175 (1989) 8. Centeno, T.P., Lawrence, N.D.: Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Research 7(2), 455–491 (2006) 9. Kimura, F., Takashina, K., S., T., Y., M.: Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 9, 149–153 (1987) 10. Huang, S.Y., Hwang, C.R., Lin, M.H.: Kernel Fisher’s discriminant analysis in Gaussian Reproducing Kernel Hilbert Space. Technical report, Academia Sinica, Taiwan, R.O.C. (2005) 11. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990)
Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classification Keisuke Kameyama Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan
[email protected] Abstract. Use of local higher-order moment kernel (LHOM kernel) in SVMs for texture classification was investigated by comparing it with SVMs using other conventional kernels. In the experiments, it became clear that SVMs with LHOM kernels achieve better trainability and give stable response to the texture classes when compared with those with conventional kernels. Also, the number of support vectors were kept low which indicates better class separability in the nonlinearly-mapped feature space. Keywords: Support Vector Machine (SVM), Higher-Order Moment Spectra, Kernel Function, Texture classification.
1
Introduction
Image texture classification of according to their local nature has a various use. Among them are, scene understanding in robot vision, document processing and diagnosis support in medical images, etc. Generally, segmentation of textures rely on local features, extracted by a local window. Multiple features are extracted from the windowed image, and gives a feature vector of the local texture. The feature vector is further passed to a classifier, which maps them to class labels. One of the common choices among the local features is the local power spectrum of the signal. However, there are cases when second order feature is not sufficient. In such cases, higher-order feature can be used to improve the classification ability [1]. However, when the orders of the features grow higher, exhaustive examination of the possible features or an efficient search for an effective feature would become difficult due to the high dimensionality of the feature space. Thus, use of moments and moment spectra features of higher-order have been limited so far. Recently, use of high dimensional feature space by the so-called kernel-based methods are being actively examined. The Support Vector Machine (SVM) [2] [3] M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 851–860, 2008. c Springer-Verlag Berlin Heidelberg 2008
852
K. Kameyama
which also benefits from this nature has a potential to make the high dimensional signal features a realistic choice for supervised signal classification. In the literature, kernel function corresponding to higher-order moments [4,5] and their spectra [6] has been investigated. In this work, the effectiveness of using the local higher-order moment kernel (LHOM kernel) in SVMs for texture classification is examined by comparing it with SVMs using other conventional kernel functions. In the following, local higher-order moment (LHOM) features and SVMs will be reviewed in Secs. 2 and 3, respectively. The definition of the LHOM kernel and its characteristics will be mentioned in Sec. 4. In Sec. 5, the setup and the results of the texture classification experiments will be shown, with discussions focused on the differences introduced by the kernel choice. Finally, Sec. 6 concludes the paper.
2 2.1
Local Moment and Local Moment Spectra Feature Extraction by Local Power Spectrum
Let s(t) be a real valued image signal defined on R2 . One of the most common characterization method of local signal is the use of local power spectrum. It can be obtained as the Fourier transform of the local second-order moment of signal s as, Ms,w,2 (ω, x) = ms,w,2 (τ , x) exp(−jω T τ )dτ . (1) The integral is over R2 where not specified. The local second-order moment of signal s is defined as ms,w,2 (τ , x) = w(t)s(t + x)w(t + τ )s(t + x + τ )dt, (2) where w is the window function for localizing the signal characterization. In this work, a window of a Gaussian function defined as −1/2
w(t, Σ) = (2π)−1 |Σ|
exp(−
1 T −1 t Σ t), 2
(3)
is used. Here, Σ is a positive definite symmetric real 2 × 2 matrix to modify the window shape, and superscript T denotes the matrix transpose. 2.2
Local Higher-Order Moment Spectra (LHOMS)
Moment spectra of orders n higher than two, namely the bispectrum (n = 3), trispectrum (n = 4) and so on, also contribute to characterize the texture. The spectrum of order n reflects the correlation of n − 1 frequency components in complex values. Therefore it is possible, for example, to characterize the phase difference among the signal components using the bispectrum, while the information was lost in the case of power spectrum.
Comparison of LHOM Kernel and Conventional Kernels in SVM
853
The n-th order local moment spectrum is the Fourier transform of the nth order local moment. By defining Ω n−1 = [ω T1 ω T2 . . . ω Tn−1 ]T and T n−1 = [τ T1 τ T2 . . . τ Tn−1 ]T , we have Ms,w,n (Ω n−1 , x) = R2(n−1)
ms,w,n (T n−1 , x) exp(−jΩ Tn−1 T n−1 )dT n−1
(4)
where ms,w,n (T n−1 , x) =
w(t)s(t + x)
n−1
{w(t + τ k )s(t + x + τ k )}dt.
(5)
k=1
The n-th order LHOMS being product of n − 1 frequency components and their sum frequency component, makes LHOMS a especially useful tool for characterizing the phase relations among the fundamental (ω 0 ) and harmonic (2ω0 , 3ω0 , . . .) frequency components.
3
Support Vector Machine and Kernel Functions
Support Vector Machine (SVM) [2] [3] is a supervised pattern recognition scheme with the following two significant features. First, the SVM realizes an optimal linear classifier (optimal hyperplane) in the feature space, which is theoretically based on the Structural Risk Minimization (SRM) theory. The SVM learning achieves a linear classifier with minimum VC-dimension thereby keeping the expected generalization error low. Second, utilization of feature spaces of very high dimension is made possible by way of kernel functions. The SVM takes an input vector x ∈ RL , which is transformed by a predetermined feature extraction function Φ(x) ∈ RN . This feature vector is classified to one of the two classes by a linear classifier y = sgn(u(x)) = sgn(w, Φ(x) + b)
y ∈ {−1, 1},
(6)
where w ∈ RN and b ∈ R are the weight vector and the bias determining the placement of the discrimination hyperplane in the feature space, respectively. Parameters w and b, are determined by supervised learning using a training set of input-output pairs {(xi , yi ) ∈ RL × {±1}}i=1. Assuming that the training set mapped to the feature space {Φ(xi )}i=1 is linearly separable, the weight vector w will be determined to maximize the margin between the two classes, by solving an optimization problem with inequality constraints to minimize
1 ||w||2 2
subject to
{yi (w, Φ(xi ) + b) ≥ 1}i=1 .
It can be solved by using the Lagrangian function L(w, b, α), minimizing it for the variables w and b, and maximizing it for the multipliers α = [α1 , . . . , α ].
854
K. Kameyama
When the training set is not linearly separable, the constraints in the optimization problem can be relaxed to minimize
1 ||w||2 + C ξi 2 i=1
subject to {yi (w, Φ(xi ) + b) ≥ 1 − ξi }i=1 (7)
utilizing the so called soft margins by introducing nonnegative slack variables {ξi }i=1 and regularization constant C > 0. In both cases, function u for the classifier is obtained in the form of u(x) =
yi αi Φ(xi ), Φ(x) + b.
(8)
i=1
Typically, the optimal solution of the original constrained problem will reside at where equalities hold for a very small fraction of the inequality conditions. This leaves the multipliers for the other conditions to be zero and makes the use of Eq. (8) practical even for a large . Training inputs with nonzero multipliers are called support vectors, thus the name of SVM. Parameter b in Eq. (8) can be obtained from the (equality) constraints for one of the support vectors. In both operations of training and using the SVM, the feature extraction function Φ(x) is never explicitly evaluated. Instead, the inner product K(x, y) = Φ(x), Φ(y) called the kernel function is always evaluated. Among the well known conventional kernel functions are, the polynomial kernel K(x, y) = (x, y + 1)d
(d = 1, 2, ...)
(9)
and the radial-basis function (RBF) kernel K(x, y) = exp(−
||x − y||2 ). γ2
(γ ∈ R)
(10)
In some cases of feature extractions, evaluation of Φ, Φ requires less computation than directly evaluating Φ. This nature enables the SVM to efficiently utilize feature spaces of very high dimensions.
4 4.1
The LHOMS and LHOM Kernels LHOMS Kernel Derivation
Here, a kernel function for the case when the LHOMS corresponds to the nonlinear feature map Φ, will be derived. This will amount to treating the LHOMS function as the feature vector of infinite dimension for all the possible combinations of frequency vectors ω 1 , . . . , ωn−1 .
Comparison of LHOM Kernel and Conventional Kernels in SVM
855
The inner product of the n-th order moment spectra of signals s(t) and v(t) localized at x and y, respectively, is Kw,n (s, v ; x, y) = Ms,w,n (ω 1 , . . . , ω n−1 , x), Mv,w,n (ω 1 , . . . , ω n−1 , y) n−1 n−1 ∗ = . . . (Sw ( ωk , x) Sw (ω k , x)) × k=1 n−1
×{Vw∗ (
k=1 n−1
ω k , y)
k=1
k=1
2(n−1)
= (2π)
Vw (ω k , y)}∗ dω 1 . . . dω n−1 n
w(z)s(z + x)w(z + τ )v(z + y + τ )dz
dτ .
(11)
See [6] for the derivation. Function Kw,n (s, v ; x, y) will serve as the kernel function in the SVM for signal classification. Notable here, is that the increase of computational cost due to the moment order can be avoided as far as the inner product is concerned. In the literature, the inner product of the n-th order moments of two signals s and v, which omits the window w in Eq. (3) has been derived as, ms,n , mv,n =
n s(z)v(z + τ )dz
dτ ,
(12)
by an early work of McLaughlin and Raghu [7] for use in optical character recognition. Recently, it has been used in the context of modern kernel-based methods, for signal and image recognition problems [4, 5]. Non-localized and localized versions of this kernel will be referred to as the HOM and LHOM kernels, respectively. 4.2
Equivalence of LHOMS and LHOM Kernels
The following clarifies an important relation between the LHOMS kernel and the LHOM kernel. Theorem (Plancherel). Define the Fourier transform of function f (x) ∈ L2 (RN ) as (F f )(ω) = RN f (x)e−jω,x dx. Then, Ff, F g = (2π)N f, g holds for functions f, g ∈ L2 (RN ) [8]. Corollary. By denoting LHOMS and LHOM of an image s(t) as Ms,w,n and ms,w,n , respectively, Ms,w,n , Mv,w,n = (2π)2(n−1) ms,w,n , mv,w,n .
856
K. Kameyama c(t)
Class map
Feature / Kernel
HOMS feature φ Power spectrum , Bispectrum, Trispectrum ...
Common kernel K(s, v) = =
image
HOM feature ψ Autocorrelations (n = 1, 2, 3 ...)
s(t)
Fig. 1. The LHOM(S) kernel unifies the treatment of higher-order moment features and higher-order moment spectra features of a signal
This corollary shows that LHOMS and LHOM kernels are proportional, which is clear from the fact that LHOMS is a Fourier transform of LHOM as mentioned in Eq. (4). Therefore, this kernel function introduces a unified view of the kernelbased pattern recognition using moments and moment spectra, as shown in Fig. 1.
5 5.1
Texture Classification Experiments Experimental Conditions
The SVM with the LHOM(S) kernel was tested in two-class texture classification problems. Image luminance was preprocessed to be scaled to the range of [0, 1]. The training sets were obtained from random positions of the original texture images. All the experiments used 20 training examples per class, each sized 32 × 32 pixels. The kernel types used in the experiments were the following. 1. Polynomial kernel of Eq. (9), for (d = 1, . . . , 5). 2. RBF kernel kernel of Eq. (10), for (γ = 1, 10, 100). 3. LHOM kernel in their discrete versions [6] with isotropic Gaussian windows. Order n and window width σ were varied as, (n = 2, 3, 4, 5) and (σ = 4, 8, 12). When testing the SVM after training, images with tiled texture patterns were used as shown in Figs. 2(c) and 3(c). The classification rates were calculated for the center regions of the test images allowing edge margins of 16 pixel width. The output class maps will be presented as the linear output u in Eq. (8). No post-smoothing of the class maps has been used. 5.2
Sinusoidal Wave with Harmonic Components
Two texture classes generated by a computer according to two formulas, sA (x) = sin(ω T0 x) + sin(2ω T0 x) + sin(3ωT0 x) and sB (x) = sin(ω T0 x) + sin(2ω T0 x + φ) + sin(3ω T0 x).
Comparison of LHOM Kernel and Conventional Kernels in SVM
(a)
857
(b)
(c)
Fig. 2. Computer generated textures including fundamental, second and third harmonic frequency components. (a) Class 1, (b) Class 2, and (c) tiled collage for testing.
(a)
(b)
(c)
Fig. 3. Natural texture images selected from the Vision Texture collection. (a) Fabric17 texture, (b) Fabric18 texture and (c) tiled collage for testing.
Here, the fundamental frequency vector was set to ω0 = [π/6, π/6]T . Phase shift φ = π/2 in the second harmonic component of sB is the only difference in the two textures. Although the two classes cannot be discriminated using the local power spectrum, they should be easily classified using the phase information, extracted by the features of higher orders. The classification results in Table 1 show that the training was successful for RBF and LHOM kernels (n = 3, 4, 5) only as expected. The higher test rates for the LHOMS kernels is partly due to the clear discrimination of the classes within the feature space, indicated by the small number of support vectors without using soft margins. 5.3
Fabric Texture
The textures in Fig. 3 were selected from the fabric textures of the Vision Texture collection [9]. The results are shown in Table 2. Tendencies similar to the previous experiment are observed showing the superiority of the LHOMS kernel. Fig. 5(a) indicates that the unthresholded output of the SVM with RBF kernels have a very small inter-class variance indicating the instability of the classifier. Use of soft margins resulted in a small improvement for the SVMs with LHOM kernels.
858
K. Kameyama
Table 1. Trainability, number of support vectors and the test rate of the SVMs using various kernels for the computer-generated texture SV number (ratio to TS)
Kernel Type
Trained ?
Test rate
Polynomial (d = 1, .., 5)
No
RBF (γ=10)
Yes
39 (97.5%)
LHOM (n = 2) (σ = 8)
No
-
LHOM (n = 3) (σ = 8)
Yes
2 (5%)
83 %
LHOM (n = 4) (σ = 8)
Yes
3 (7.5%)
82 %
LHOM (n = 5) (σ = 8)
Yes
8 (20%)
84 %
-
80 % -
(a)
(b)
(c)
(d)
Fig. 4. SVM linear outputs and classification rates for the test image of the computergenerated texture using various types of kernels. (a) RBF(80%), (b) LHOM (n = 3) (83%), (c) LHOM (n = 4)(82%) and (d) LHOM (n = 5)(84%).
Comparison of LHOM Kernel and Conventional Kernels in SVM
859
Table 2. Trainability, number of support vectors and the test rate of the SVMs using various kernels for the natural texture
SV number Test rate (No SM / Use SM) (No SM / Use SM)
Kernel Type
Trained ?
Polynomial (d = 1, .., 5)
Partly (at d=1only)
40 / 40
50 % / 50 %
RBF (γ=10)
Yes
40 / 40
65 % / 65 %
LHOM (n = 2) (σ = 8)
Yes
5 / 40
69 % / 79 %
LHOM (n = 3) (σ = 8)
Yes
27 / 40
73 % / 75 %
LHOM (n = 4) (σ = 8)
Yes
30 / 35
75 % / 75 %
LHOM (n = 5) (σ = 8)
Yes
37 / 40
67 % / 69 %
(a)
(b)
(c)
(d)
Fig. 5. SVM linear outputs and classification rates for the test image of the natural texture using various types of kernels. (a) RBF(65%), (b) LHOM (n = 2)(69%), (c) LHOM (n = 3)(73%) and (d) LHOM (n = 4)(75 %).
860
5.4
K. Kameyama
Discussion
The test using the tiled images evaluates the combination of the raw classification rate and the spatial resolution. Further investigation separating the two factors are necessary. When non-tiled (pure) test images of the fabric texture were used, SVMs with kernels of RBF (γ = 10), LHOMS (σ = 8) for n = 2, 3, 4 and 5 achieved 90%, 95%, 90%, 95% and 80%, respectively. In [1], classifiers using LHOM features of orders up to 3 is reported to have achieved test rates over 96% for 30 classes of natural textures. Although this work does not give a direct comparison, the results supports the suitability of using LHOM(S) in natural texture classification,
6
Conclusion
Texture classification using SVMs with LHOM kernel and other conventional kernel functions were compared. It became clear that the SVMs with LHOM kernels achieve better trainability and give stable response to the texture classes when compared with those with conventional kernels. Also, the number of support vectors were lower which indicates a better class separability in the feature space.
References 1. Kurita, T., Otsu, N.: Texture classification by higher order local autocorrelation features. In: Proc. of Asian Conf. on Computer Vision (ACCV 1993), pp. 175–178 (1993) 2. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995) 3. Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, Heidelberg (2000) 4. Popovici, V., Thiran, J.P.: Higher order autocorrelations for pattern classification. In: Proceedings of International Conference on Image Processing 2001, pp. 724–727 (2001) 5. Popovici, V., Thiran, J.P.: Pattern recognition using higher-order local autocorrelation coefficients. In: Neural Networks for Signal Processing XII (NNSP), pp. 229–238 (2002) 6. Kameyama, K., Taga, K.: Texture classification by support vector machines with kernels for higher-order gabor filtering. In: Proceedings of International Joint Conference on Neural Networks 2004, vol. 4, pp. 3009–3014 (2004) 7. MacLaughlin, J.A., Raviv, J.: N-th autocorrelations in pattern recognition. Information and Control 12(2), 121–142 (1968) 8. Umegaki, H.: Basic Information Mathematics - Development via Functional Analysis - (in Japanese). Saiensu-sha (1993) 9. MIT Vision and Modeling Group: Vision texture (1995)
Pattern Discovery for High-Dimensional Binary Datasets Václav Snášel1 , Pavel Moravec1 , Dušan Húsek2 , Alexander Frolov3 , 4 ˇ Hana Rezanková , and Pavel Polyakov5 1
3
Department of Computer Science, FEECS, VŠB – Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic {pavel.moravec,vaclav.snasel}@vsb.cz 2 Institute of Computer Science, Dept. of Nonlinear Systems, Academy of Sciences of the Czech Republic, Pod Vodárenskou vˇeží 2, 182 07 Prague, Czech Republic
[email protected] Institute of Higher Nervous Activity and Neurophysiology, Russian Academy of Sciences, Butlerova 5a, 117 485 Moscow, Russia
[email protected] 4 Department of Statistics and Probability, University of Economics, Prague, W. Churchill sq. 4, 130 67 Prague, Czech Republic
[email protected] 5 Institute of Optical Neural Technologies, Russian Academy of Sciences, Vavilova 44, 119 333 Moscow, Russia
[email protected] Abstract. In this paper we compare the performance of several dimension reduction techniques which are used as a tool for feature extraction. The tested methods include singular value decomposition, semi-discrete decomposition, non-negative matrix factorization, novel neural network based algorithm for Boolean factor analysis and two cluster analysis methods as well. So called bars problem is used as the benchmark. Set of artificial signals generated as a Boolean sum of given number of bars is analyzed by these methods. Resulting images show that Boolean factor analysis is upmost suitable method for this kind of data.
1 Introduction In order to perform object recognition (no matter which one) it is necessary to learn representations of the underlying characteristic components. Such components correspond to object-parts, or features. These data sets may comprise discrete attributes, such as those from market basket analysis, information retrieval, and bioinformatics, as well as continuous attributes such as those in scientific simulations, astrophysical measurements, and sensor networks. Feature extraction from high-dimensional data typically consists in correlation analysis, clustering including finding efficient representations for clustered data, data classification, and event association. The objective is discovering meaningful M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 861–872, 2008. c Springer-Verlag Berlin Heidelberg 2008
862
V. Snášel et al.
information in data to be able to represent them appropriately and if possible in a space of lower dimension. The feature extraction if applied on binary datasets, addresses many research and application fields, such as association rule mining [1], market basket analysis [2], discovery of regulation patterns in DNA microarray experiments [3], etc. Many of these problem areas have been described in tests of PROXIMUS framework (e.g. [4]). The feature extraction methods can use different aspects of images as the features. Such methods are either using a heuristics based on the known properties of the image collection, or are fully automatic and may use the original image vectors as an input. Here we will concentrate on the case of black and white pictures of bars combinations represented as binary vectors, so the complex feature extraction methods are unnecessary. Values of the entries of this vector represent individual pixels i.e. 0 for white and 1 for black. There are many attempts that could be used for this reason. This article reports on an empirical investigation of the performance of some of them. For the sake of simplicity we use the well-known bars problem (see e.g. [5]) where we try to isolate separate horizontal and vertical bars from images containing their combinations. Here we will concentrate on the category which use dimension reduction techniques for automatic feature extraction. We will compare results of the most up to date procedures. One of such methods is the singular value decomposition which was already many times successfully used for automatic feature extraction. In case of bars collection (such as our test data), the base vectors can be interpreted as images, describing some common characteristics of several input signals. However singular value decomposition is not suitable for huge collections and is computationally expensive, so other methods of dimension reduction were proposed. Here we use the semi-discrete decomposition. Because the data matrix does have all elements non-negative, we tried to apply a new method, called non-negative matrix factorization as well. The question, how does the brain form a useful representation of its environment, was behind the development of neural network based methods for dimension reduction, neural network based Boolean factor analysis [6,7] and optimal sparse coding network developed by Földiák [5]. Here we applied the neural network based Boolean factor analysis developed by us only, because we have not the optimal sparse coding network algorithm for our disposal at this moment. However for the first view on image structures, we can apply traditional statistical methods, mainly different algorithms for statistical cluster analysis. Analysis of discrete data sets, however, generally leads to NP-complete/hard problems, especially when physically interpretable results in discrete spaces are desired. The rest of this paper is organized as follows. The second section explains the dimension reduction method used in this study. Then in the section three we describe experimental results, and finally in the section four we made some conclusions.
Pattern Discovery for High-Dimensional Binary Datasets
863
2 Dimension Reduction We used four promising methods of dimension reduction for our comparison – Singular Value Decomposition (SVD), Semi-Discrete Decomposition (SDD), Non-negative Matrix Factorization (NMF) and Neural network Boolean Factor Analysis (NBFA). For the first analysis we used two statistical clustering methods, Hierarchical Agglomerative Algorithm (HAA) and Two-Step Cluster Analysis (TSCA). All methods are briefly described bellow. 2.1 Singular Value Decomposition The SVD [8] is an algebraic extension of classical vector model. It is similar to the PCA method, which was originally used for the generation of eigenfaces in image retrieval. Informally, SVD discovers significant properties and represents the images as linear combinations of the base vectors. Moreover, the base vectors are ordered according to their significance for the reconstructed image, which allows us to consider only the first k base vectors as important (the remaining ones are interpreted as "noise" and discarded). Furthermore, SVD is often referred to as more successful in recall when compared to querying whole image vectors [8]. Formally, we decompose the matrix of images A by SVD, calculating singular values and singular vectors of A. We have matrix A, which is an n × m rank-r matrix (where m ≥ n without loss T of generality) √ and values σ1 , . . . , σr are calculated from eigenvalues of matrix AA as σi = λi . Based on them, we can calculate column-orthonormal matrices U = (u1 , . . . , un ) and V = (v1 , . . . , vn ), where U T U = In a V T V = Im , and a diagonal matrix Σ = diag(σ1 , . . . , σn ), where σi > 0 for i ≤ r, σi ≥ σi+1 and σr+1 = . . . = σn = 0. The decomposition A = U ΣV T (1) is called singular decomposition of matrix A and the numbers σ1 , . . . , σr are singular values of the matrix A. Columns of U (or V ) are called left (or right) singular vectors of matrix A. Now we have a decomposition of the original matrix of images A. We get r nonzero singular numbers, where r is the rank of the original matrix A. Because the singular values usually fall quickly, we can take only k greatest singular values with the corresponding singular vector coordinates and create a k-reduced singular decomposition of A. Let us have k (0 < k < r) and singular value decomposition of A A = U ΣV T ≈ Ak = (Uk U0 )
Σk 0 0 Σ0
VkT V0T
(2)
We call Ak = Uk Σk VkT a k-reduced singular value decomposition (rank-k SVD). Instead of the Ak matrix, a matrix of image vectors in reduced space Dk = Σk VkT is used in SVD as the representation of image collection. The image vectors (columns in Dk ) are now represented as points in k-dimensional space (the feature-space). represent the matrices Uk , Σk , VkT .
864
V. Snášel et al.
Fig. 1. Rank-k SVD
Rank-k SVD is the best rank-k approximation of the original matrix A. This means that any other decomposition will increase the approximation error, calculated as a sum of squares (Frobenius norm) of error matrix B = A−Ak . However, it does not implicate that we could not obtain better result with a different approximation. 2.2 Semi-discrete Decomposition The SDD is one of other LSI methods, proposed recently for text retrieval in [9]. As mentioned earlier, the rank-k SVD method (called truncated SVD by authors of semidiscrete decomposition) produces dense matrices U and V , so the resulting required storage may be even larger than the one needed by the original term-by-document matrix A. To improve the required storage size and query time, the semi-discrete decomposition was defined as A ≈ Ak = Xk Dk YkT ,
(3)
where each coordinate of Xk and Yk is constrained to have entries from the set ϕ = {−1, 0, 1}, and the matrix Dk is a diagonal matrix with positive coordinates. The SDD does not reproduce A exactly, even if k = n, but it uses very little storage with respect to the observed accuracy of the approximation. A rank-k SDD (although from mathematical standpoint it is a sum on rank-1 matrices) requires the storage of k(m + n) values from the set {−1, 0, 1} and k scalars. The scalars need to be only single precision because the algorithm is self-correcting. The SDD approximation is formed iteratively. The optimal choice of the triplets (xi , di , yi ) for given k can be determined using greedy algorithm, based on the residual Rk = A − Ak−1 (where A0 is a zero matrix). 2.3 Non-negative Matrix Factorization The NMF [10] method calculates an approximation of the matrix A as a product of two matrices, W and H. The matrices are usually pre-filled with random values (or H is initialized to zero and W is randomly generated). During the calculation the values in W and H stay positive. The approximation of matrix A, matrix Ak , can be calculated as Ak = W H.
Pattern Discovery for High-Dimensional Binary Datasets
865
The original NMF method tries to minimize the Frobenius norm of the difference between A and Ak using min ||V − W H||2F as the criterion in the minimization problem. W,H
Recently, a new method was proposed in [11], where the constrained least squares problem min{||Vj − W Hj ||2 − λ||Hj ||22 } is the criterion in the minimization problem. Hj
This approach is yields better results for sparse matrices. Unlike in SVD, the base vectors are not ordered from the most general one and we have to calculate the decomposition for each value of k separately. 2.4 Neural Network Based Boolean Factor Analysis The NBFA is a powerful method for revealing the information redundancy of high dimensional binary signals [7]. It allows to express every signal (vector of variables) from binary data matrix X of observations as superposition of binary factors: X=
L
Sl f l ,
(4)
l=1
where Sl is a component of factor scores and f l is a vector of factor loadings and ∨ denotes Boolean summation (0 ∨ 0 = 0, 1 ∨ 0 = 1, 0 ∨ 1 = 1, 1 ∨ 1 = 1). If we mark Boolean matrix multiplication by the symbol , then we can express approximation of data matrix X in matrix notation Xk F S
(5)
where S is the matrix of factor scores and F is the matrix of factor loadings. The Boolean factor analysis implies that components of original signals, factor loadings and factor scores are binary values. Optimal solution of Xk decomposition according 5 by brute force search is NP-hard problem and as such is not suitable for high dimensional data. On other side the classical linear methods could not take into account non-linearity of Boolean summation and therefore are inadequate for this task. The NBFA is based on Hopfield-like neural network [12,13]. Used is the fully connected network of N neurons with binary activity (1 - active, 0 - nonactive). Each pattern of the learning set Xm is stored in the matrix of synaptic connections J according to Hebbian rule: Jij
=
M
(Xim − q m )(Xjm − q m ), i, j = 1, ..., N, i =j, Jii = 0
(6)
m=1
where M is the number of patterns in the learning set and bias q m =
N i=1
Xim /N is
the total relative activity of the m-th pattern. This form of bias corresponds to the biologically plausible global inhibition being proportional to overall neuron activity. One
866
V. Snášel et al.
special inhibitory neuron was added to N principal neurons of the Hopfield network. The neuron was activated during the presentation of every pattern of the learning set and was connected with all the principal neurons by bidirectional connections. Patterns of the learning set are stored in the vector j of the connections according to the Hebbian rule: ji =
M
(Xim − q m ) = M (qi − q), i = 1..N,
(7)
m=1
where qi =
M m=1
Xim /M is a mean activity of the i-th neuron in the learning set and q is
a mean activity of all neurons in the learning set. We also supposed that the excitability of the introduced inhibitory neuron decreases inversely proportional to the size of the learning set, being 1/M after all patterns are stored. In the recall stage its activity is then: A(t) = (1/M )
N
ji Xi (t) = (1/M )jT X(t)
i=1
where jT is transposed j . The inhibition produced in all principal neurons of the network is given by vector j A(t) = (1/M )j jT X(t). Thus, the inhibition is equivalent to the subtraction of J = j jT /M = M qqT
(8)
from J where q is a vector with components qi − q. Adding the inhibitory neuron is equivalent to replacing the ordinary connection matrix J by the matrix J = J − J . To reveal factors we suggest the following two-run recall procedure. Its initialization starts by the presentation of a random initial pattern Xin with kin = rin N active neurons. Activity kin is supposed to be smaller than the activity of any factor. On presentation of Xin , network activity X evolves to some attractor. This evolution is determined by the synchronous discrete time dynamics. At each time step: Xi (t + 1) = Θ(hi (t) − T (t)), i = 1, · · ·, N,
Xi (0) = Xiin
(9)
where hi are components of the vector of synaptic excitations h(t) = JX(t),
(10)
Θ is the step function, and T (t) is an activation threshold. At each time step of the recall process the threshold T (t) was chosen in such a way that the level of the network activity was kept constant and equal to kin . Thus, on each time step kin “winners" (neurons with the greatest synaptic excitation) were chosen and only they were active on the next time step. As shown in [12], this choice of activation threshold enables the network activity to stabilize in point or cyclic attractors
Pattern Discovery for High-Dimensional Binary Datasets
867
of length two. The fixed level of activity at this stage of the recall process could be ensured by biologically plausible non-linear negative feed-back control accomplished by the inhibitory interneurons. It is worth to note that although the fact of convergence of network synchronous dynamics to point or cyclic attractors of length two was established earlier [14], it was done for fixed activation threshold T but for fixed network activity it was done first in [12]. When activity stabilizes at the initial level of activity kin , kin + 1 neurons with maximal synaptic excitation are chosen for the next iteration step, and network activity evolves to an attractor at the new level of activity kin + 1. The level of activity then increases to kin + 2, and so on, until the number of active neurons reaches the final level kf = rf N where r = k/N is a relative network activity. Thus, one trial of the recall procedure contains (rf − rin )N external steps and several internal steps (usually 2-3) inside each external step to reach an attractor for a given level of activity. At the end of each external step when network activity stabilizes at the level of k active neurons a Lyapunov function was calculated by formula: λ = XT (t + 1)JX(t)/k,
(11)
where XT (t+1) and X(t) are two network states in a cyclic attractor (for point attractor XT (t + 1) = X(t) ). The identification of factors along the trajectories of the network dynamics was based on the analysis of the change of the Lyapunov function and the activation threshold along each trajectory. In our definition of Lyapunov function its value gives a mean synaptic excitation of neurons belonging to an attractor at the end of each external step. 2.5 Statistical Clustering Methods The clustering methods help to reveal groups of black pixels, it means typical parts of images. However, obtaining disjunctive clusters is a problem of the usage of traditional hard clustering. The HAA algorithm starts with each pixel in a group of its own. Then it merges clusters until only one large cluster remains which includes all pixels. The user must choose dissimilarity or similarity measure and agglomerative procedure. At the beginning, when each pixel represents its own cluster, the dissimilarity between two pixels is defined by the chosen dissimilarity measure. However, once several pixels have been linked together, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. Several linkage rules have been proposed, i.e. the distance between two different clusters can be determined by the greatest distance between two pixels in the clusters (Complete Linkage method - CL), or average distance between all pairs of objects in the two clusters (Average Linkage Between Groups - ALBG). Hierarchical clustering is based on the proximity matrix (dissimilarities for all pairs of pixels) and it is independent on the order of pixels. For binary data, we can choose for example Jaccard and Ochiai (cosine) similarity measures of two pixa els. The former can be expressed as SJ = a+b+c where a is the number of the common
868
V. Snášel et al.
occurrences of ones and b + c is the number of pairs in which one value is one and the a a second is zero. The latter can be expressed as SO = a+b · a+c . In the TSCA algorithm, the pixels are arranged into subclusters, known as "clusterfeatures". These cluster-features are then clustered into k groups, using a traditional hierarchical clustering procedure. A cluster feature (CF) represents a set of summary statistics on a subset of the data. The algorithm consists of two phases. In the first one, an initial CF tree is built (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data). In the second one, an arbitrary clustering algorithm is used to cluster the leaf nodes of the CF tree. Disadvantage of this method is its sensitivity to the order of the objects (pixels in our case). The log-likelihood distance measure can be used for calculation of the distance between two clusters.
3 Experimental Results For testing of above mentioned methods, we used generic collection of 1600 32 × 32 black-and-white images containing different combinations of horizontal and vertical lines (bars). The probabilities of bars to occur in images were the same and equal to 10/64, i.e. images contain 10 bars in average. An example of several images from generated collection is shown in Figure 3a. For the first view on image structures, we applied traditional cluster analysis. We clustered 1024 (32 × 32) positions into 64 and 32 clusters. The problem of the use of traditional cluster analysis consists in that we can obtain disjunctive clusters which means we can find only horizontal bars and parts of vertical bars and vice versa. We applied HAA and TSCA clustering as implemented in the SPSS system. The problem of TSCA method consists in that it is dependent on the order of analyzed images, so we used two different orders. For HAA, we tried to use different similarity measures. We found that linkage methods have more influence to the results of clustering than similarity measures. We used Jaccard and Ochiai (cosine) similarity measures suitable for asymmetric binary attributes. We found both as suitable methods for the identification of the bars or their parts. For 64 clusters, the differences were only in a few assignments of positions by ALBG and CL methods with Jaccard and Ochiai measures. The following figures illustrate the application of some of these techniques for 64 and 32 clusters. Figure 2a,b show results of ALBG method with Jaccard measure Figure 2a for 32 clusters and Figure 2b for 64 clusters. In the case of 32 clusters, we found 32 horizontal bars (see Figure 2c ) by TSCA method for the second order of features. Many of tested methods were able to generate a set of base images or factors, which should ideally record all possible bar positions. However, not all methods were truly successful in this. With the SVD, we obtain classic singular vectors, the most general being among the first. The first few are shown in Figure 3b. We can see, that the bars are not separated and different shades of gray appear. The NMF methods yield different results. The original NMF method, based on the adjustment of random matrices W and H, provides hardly-recognizable images even for k = 100 and 1000 iterations (we used 100 iterations for other experiments).
Pattern Discovery for High-Dimensional Binary Datasets
(b)
869
(c)
(a)
Fig. 2. (a) 64 and (b) 32 clusters of pixels by ALBG method (Jaccard coefficient) (c) 32 clusters of pixels by TSCA method
(a)
(b)
(c)
Fig. 3. (a) Several bars from generated collection (b) First 64 base images of bars for SVD method (c) First 64 factors for the original NMF method
Moreover, these base images still contain significant salt and pepper noise and have a bad contrast. The factors are shown in Figure 3c. We must also note, that the NMF decomposition will yield slightly different results each time it is run, because the matrix(es) are pre-filled with random values. The GD-CLS modification of NMF method tries to improve the decomposition by calculating the constrained least squares problem. This leads to a better overall quality, however, the decomposition really depends on the pre-filled random matrix H. The result is shown in Figure 4a. The SDD method differs slightly from previous methods, since each factor contains only values {−1, 0, 1}. Gray in the factors shown in Figure 4b represents 0; −1 and 1 are represented with black and white respectively. The base vectors in Figure 4b can be divided into three categories: Base vectors containing only one bar. Base vectors containing one horizontal and one vertical bar. Other base vectors, containing several bars and in some cases even noise. Finally, we made decomposition of images into binary vectors by the NBFA method. Here factors contain only values {0, 1} and we use Boolean arithmetics. The factor search was performed under assumption that the number of ones in factor is not less
870
V. Snášel et al.
(a)
(b)
(c)
Fig. 4. First (a) 64 factors for GD-CLS NMF method, first 64 base vectors for (b) SDD method and (c) NBFA method
than 5 and not greater than 200. Since the images are obtained by Boolean summation of binary bars, it is not surprising that the NBFA is able to reconstruct all bars as base vectors, providing an ideal solution, as we can see in Figure 4. It is clear that classical linear methods could not take into account non-linearity of Boolean summation and therefore are inadequate for bars problem task. But the linear methods are fast and well elaborated so it was very interesting to compare linear approach with the NBFA and compare qualitatively the results.
4 Conclusion In this paper, we have compared NBFA and several dimension reduction attempts icluding clustering methods - on so called bars problem. It is shown that NBFA perfectly found basis (factors) from which the all learning pictures can be reconstructed. First, it is because the model, on which the BFA is based, is the same as used for data generation. Secondly, because of robustness of BFA implementation based on recurrent neural network. Some experiments show, that the resistance against noise is very high. We hypothesize that it is due the self reconstruction ability of our neural network. Whilst the SVD is known to provide quality eigenfaces, it is computationally expensive and in case we only need to beat the "curse of dimensionality" by reducing the dimension, SDD may suffice. As expected, from methods which allowed direct querying in reduced space, SVD was the slower, but most exact method. The NMF and SDD methods may be also used, but not with the L2 metric, since the distances are not preserved enough in this case. There are some other newly-proposed methods, which may be interesting for future testing, e.g. the SparseMap [15]. Additionally, faster pivot selection technique for FastMap [16] may be considered. Finally, testing of used dimension reduction methods with deviation metrics on metric structures should answer the question of projected data indexability (which is poor for SVD-reduced data).
Pattern Discovery for High-Dimensional Binary Datasets
871
Cluster analysis in this application is focused on finding original factors from which images were generated. Applied clustering methods were quite successful in finding these factors. The problem of cluster analysis is, that it provides disjunctive clusters only. So, only some bars or their parts were revealed. However, from the general view on 64 clusters, it is obvious, that images are compounded from vertical and horizontal bars (lines). By two-step cluster analysis, 32 horizontal lines were revealed by clustering to 32 clusters. Acknowledgment. The work was partly funded by the Centre of Applied Cybernetics 1M6840070004 and partly by the Institutional Research Plan AV0Z10300504 "Computer Science for the Information Society: Models, Algorithms, Appplications" and by the project 1ET100300419 of the Program Information Society of the Thematic Program II of the National Research Program of the Czech Republic.
References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994) 2. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: SIGMOD 1997: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pp. 255–264. ACM Press, New York (1997) 3. Spellman, P.T., Sherlock, G., Zhang, M.Q., Anders, V.I.K., Eisen, M.B., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273– 3297 (1998) 4. Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1), 33–69 (2006) 5. Földiák., P.: Forming sparse representations by local anti-Hebbian learning. Biological cybernetics 64(22), 165–170 (1990) ˇ 6. Frolov, A., Húsek, D., Polyakov, P., Rezanková, H.: New Neural Network Based Approach Helps to Discover Hidden Russian Parliament Votting Paterns. In: International Joint Conference on Neural Networks, Omnipress, pp. 6518–6523 (2006) 7. Frolov, A.A., Húsek, D., Muravjev, P., Polyakov, P.: Boolean Factor Analysis by Attractor Neural Network. Neural Networks, IEEE Transactions 18(3), 698–707 (2007) 8. Berry, M., Dumais, S., Letsche, T.: Computational Methods for Intelligent Information Access. In: Proceedings of the 1995 ACM/IEEE Supercomputing Conference, San Diego, California, USA (1995) 9. Kolda, T.G., O’Leary, D.P.: Computation and uses of the semidiscrete matrix decomposition. In: ACM Transactions on Information Processing (2000) 10. Shahnaz, F., Berry, M., Pauca, P., Plemmons, R.: Document clustering using nonnegative matrix factorization. Journal on Information Processing and Management 42, 373–386 (2006) 11. Spratling, M.W.: Learning Image Components for Object Recognition. Journal of Machine Learning Research 7, 793–815 (2006) 12. Frolov, A.A., Húsek, D., Muravjev, P.: Informational efficiency of sparsely encoded Hopfield-like autoassociative memory. Optical Memory and Neural Networks (Information Optics), 177–198 (2003)
872
V. Snášel et al.
13. Frolov, A.A., Sirota, A.M., Húsek, D., Muravjev, P.: Binary factorization in Hopfield-like neural networks: single-step approximation and computer simulations. Neural Networks World, 139–152 (2004) 14. Goles-Chacc, E., Fogelman-Soulie, F.: Decreasing energy functions as a tool for studying threshold networks. Discrete Mathematics, 261–277 (1985) 15. Faloutsos, C.: Gray Codes for Partial Match and Range Queries. IEEE Transactions on Software Engineering 14(10) (1988) 16. Faloutsos, C., Lin, K.: FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. ACM SIGMOD Record 24(2), 163–174 (1995)
Expand-and-Reduce Algorithm of Particle Swarm Optimization Eiji Miyagawa and Toshimichi Saito Hosei University, Koganei, Tokyo, 184-0002 Japan
Abstract. This paper presents an optimization algorithm: particle swarm optimization with expand-and-reduce ability. When particles are trapped into a local optimal solution, a new particle is added and the trapped particle(s) can escape from the trap. The deletion of the particle is also used in order to suppress excessive network grows. The algorithm efficiency is verified through basic numerical experiments.
1
Introduction
Particle swarm optimization (PSO) has been used extensively as an effective technique for solving a variety of optimization problems [1] - [6]. It goes without saying that it is hard to find a real optimal solution in practical complex problems. The PSO algorithm tries to find the solution with enough accuracy in practical cases. The original PSO was developed by Kennedy and Eberhart [1]. The PSO shares best information in the swarm and is used for the function optimization problem. In order to realize effective PSO search, there exist several interesting approaches including particle dynamics by random sampling from a flexible distribution [3], an improved PSO by speciation [4], growing multiple subswarms [5] and adaptive PSO [6]. The prospective applications of the PSO are many, including iterated prisoner’s dilemma, optimizing RNA secondary structure, mobile sensor networks and nonlinear State estimation [7] - [10]. However, the basic PSO may not find a desired solution in complex problems. For example, if one particle find a local optimal solution, the entire swarm may be trapped into it. Once the swam is trapped, it is difficult to escape from the trap. This paper presents a novel version of the PSO: PSO with expand-and-reduce ability (ERSPO). When particles are trapped into a local optimal solution, a new particle is added and the particle swarm can grow. In a swarm, particles share information of added particles. The trapped particle(s) can escape from the trap provided parameters are selected suitably. The deletion of particle(s) is also introduced in order to suppress excessive swarm grows. If the deletion does not present, the swarm goes excessively and computation overload is caused. Performing basic numerical experiments, effectiveness of the ERSPO algorithm is verified. It should be noted that the growing and/or reducing of swarms have been key technique in several learning algorithms including self-organizing maps and practical applications [11]. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 873–881, 2008. c Springer-Verlag Berlin Heidelberg 2008
874
2
E. Miyagawa and T. Saito
Fundamental Version of PSO
As preparation, we introduce the global best version of the PSO with an elemental numerical experiment. The PSO is an optimization algorithm where particles within the swarm learn from each other, and move to become more similar to their “better” neighbors. The social structure for the PSO is determined through the formation of neighborhood to communicate. The global best (gbest) version is fundamental where each particle can communicate with every other particle, forming a fully connected social network. The i-th particle at time t is characterized by its position xi (t) and the position is updated based on its velocity vi (t)”. We summarize the gbest version for finding global minimum of a function F where dimension of F and xi are the same: Step 1: Initialization. Let t = 0. The particle position xi (t) is located randomly in the swarm S(t) where i = 1 ∼ N and N is the number of particles. Step 2: Compare the value of F of each particle to its best value (pbesti ) so far: if F (xi (t)) < pbesti then (a) pbesti = F (xi (t)) and (b) xpbesti = xi (t). Step 3: Compare the value of F of each particle to the global best value: if F (xi (t)) < gbest then (a) gbest = F (xi (t)) and (b) xgbest = xi (t).
1RV
㧔C㧕+PKVKCN5VCVG
㧔E㧕
㧔D㧕
㧔F㧕%QORNGVKQP
Fig. 1. Search process by the gbest version. The dot denotes a particle. N = 10 and tmax = 50. ρ1 = ρ2 is uniform random number on [0, 2].
Expand-and-Reduce Algorithm of Particle Swarm Optimization
875
Step 4: Change the velocity vector for each particle: vi (t) = W vi (t) + ρ1 (xpbesti − xi (t)) + ρ2 (xgbest − xi (t))
(1)
where W is the inertial term defined by W = Wmax −
Wmax −Wmin tmax
× t.
(2)
ρ1 and ρ2 are random variables. Step 5: Move each particle to a new position: xi (t) = xi (t) + vi (t). Step 6: Let t = t + 1. Go to Step 2, and repeat until t = tmax . In order to demonstrate performance of the PSO, we apply the gbest algorithm to the following elemental problem: minimize subject to
f1 (x1 , x2 ) = x21 + x22 |x1 | ≤ 50, |x2 | ≤ 50
(3)
f1 has the unique extremum at the origin (x1 , x2 ) = (0, 0) that is the optimal solution. Fig. 1 illustrates learning process of particles in the gbeat version. We can see that the entire swarm converge to the optimal solution. New algorithm is an improvement of this algorithm: the network has ring structure, the number of swarm is time-variant, and the global best value is replaced with the local best value within the closest neighbors. The detailed definition is given in Section 3.
3
Expand-and-Reduce Algorithm
The fundamental PSO can find the optimal solution in simple problems as shown in the previous section. However, if the object functions has complex shape with many local extrema, entire swarm may be trapped into a local optimal solution. The velocity of each particle is changed by the random variables ρ1 and ρ2 , however, it is very hard to adjust the parameter depending on various problems. Here we present a novel type of PSO where the swarm can be expanded and reduced. In this expand-and-reduce particle swarm optimization (ERPSO), the particle can escape from the trap without adjusting the parameter. In the ERPSO, the i-th particle at time t is characterized by its position xi (t) and counter ci (t) that counts time-invariance of pbesti. The closest neighbors in the ring structure is used to update particles and the number of particles N is a variable depending on the value of ci (t). We define the ERPSO for the swarm with ring structure and for finding problem of global minimum of a function F . Step 1: Initialization. Let t = 0 and the number of particles N and the value of the counter c( t) is initialized. The position xi (t) is located randomly in the swarm.
876
E. Miyagawa and T. Saito
1RV
RCTVKENG
VTCR
.QECN1RV 㧔C㧕
㧔D㧕
GUECRG
PGYRCTVKENG
㧔F㧕
㧔E㧕
Fig. 2. Expansion of PSO and escape from a trap
Step 2: Compare the cost of each particle with its best cost so far: if F (xi (t)) < pbesti then (a) pbesti = F (xi (t)) and (b) xpbesti = xi (t) The counter value increases if improvement of pbesti is not sufficient: ci (t) = ci (t) + 1
if |F (xi (t)) − pbesti | <
(4)
where is a small value. Step 3 (expansion): If a particle is trapped into local optimal or global optimal as shown in Fig. 2 (a) and (b), the number of ci is large. If the counter value exceeds a threshold, a new particle is inserted into the closest neighbor of the trapped particle and is located away from the trap as shown in Fig. 2 (c) and (d). In the case where the i-th particle is trapped, xnew (t) = xi (t) + r
if ci (t) > Tint .
(5)
where r is a random variable and the suffix is reassigned: j = j + 1 for i < j and xnew = xi+1 . Let N = N + 1. Let the counter be initialized, ci (t) = 0 for all i. This expansion can be effective to escape from the trap. Step 4 (reduction): If the value of gbest(t) does not change during some time interval started from time when a new particle is inserted, one of trapped particles is removed. The reduction aims at suppression of excessive growing that causes unnecessary computation time.
Expand-and-Reduce Algorithm of Particle Swarm Optimization
877
Step 5: Compare the cost of each particle with its local best lbset cost so far: if F (xi (t)) < lbesti then (a) lbesti = F (xi (t)) and (b) xlbesti = xi (t) where the lbest is given for the both closest neighbors of the i-th particle. Step 6: Change the velocity vector for each particle: vi (t) = W vi (t) + ρ1 (xpbesti − xi (t)) + ρ2 (xlbesti − xi (t))
(6)
where W is the inertial term defined by Eq. 2 and ρ1 and ρ2 are random variables.
XCNWGQHf
KVGTCVKQP
C.QECNDGUV5GCTEJ
XCNWGQHf
KVGTCVKQP
D'ZRCPFCPFTGFWEGUGCTEJ Fig. 3. Searching the minimum value of Equation (3). The red line is the solution. (a) lbest version without expand-and-reduce for N = 7. (b) ERPSO, N (0) = 3. The parameter values: Tint = 30, Wmax = 0.9, Wmin = 0.4, tmax = 750, = 10−3 . The random variables ρ1 = ρ2 are given by uniform distribution on [0, 2]. The time average of N (t) is 7.
878
E. Miyagawa and T. Saito
Step 7: Move each particle to a new position: xi (t) = xi (t) + vi (t). Step 8: Let t = t + 1. Go to Step 2, and repeat until t = tmax .
4
Numerical Experiments
In order to investigate efficiency of the ERPSO, we have performed basic numerical experiments. First, we have applied the ERPSO to the problem defined by Equation (3). The result is shown in Fig. 3 where the lbest version is defined by Step 1 to Step 7 without Steps 3 and 4. This result suggest that the ERPSO can find the optimal solution a little bit faster than the lbest version. The efficiency of ERPSO seems to be remarkable in more complicated problem.
Optimal Solution
Local Optimal Solution
Ett = 250
E
Dtt = 0
D
CHWPEVKQPf
CHWPEVKQP
Ftt = 500
F
Fig. 4. Expand-and-reduce Search process in f2 (x1 , x2 ). Parameter values are defined in Fig. 5 caption.
Expand-and-Reduce Algorithm of Particle Swarm Optimization
879
XCNWGQHf
KVGTCVKQP
C.QECNDGUV5GCTEJ
XCNWGQHf
KVGTCVKQP
D'ZRCPFCPFTGFWEGUGCTEJ Fig. 5. Searching the minimum value of Equation (7). The red line is the solution. (a) is lbest version without expand-and-reduce, N = 9 that is the time average of N (t) in the ERPSO. (b) is ERPSO, N (0) = 3. The parameter values: Tint = 30, Wmax = 0.9, Wmin = 0.4, tmax = 750, = 10−3 . The random variables ρ1 = ρ2 is used uniform distribution on [0, 2].
Second, we have applied the ERPSO to a multi-extrema function: the modified Shekel’s Foxholes function studied in [12]. minimize
f2 (x1 , x2 ) = −
30
2
(x1 − αi ) + (x2 − βi ) + γi 0 ≤ x1 ≤ 10, 0 ≤ x2 ≤ 10 i=1
subject to
1 2
(7)
880
E. Miyagawa and T. Saito Table 1. Success rate for 1,000 trials. Parameters are as in Fig. 5. Function Algorithm Success rate f1 PSO 100 % ERPSO 100 % f2 PSO 37.5 % ERPSO 56.2 %
(α1 , · · · , α10 ) = (9.681, 9.400, 8.025, 2.196, 8.074, 7.650, 1.256, 8.314, 0.226, 7.305) (α11 , · · · , α20 ) = (0.652, 2.699, 8.327, 2.132, 4.707, 8.304, 8.632, 4.887, 2.440, 6.306) (α21 , · · · , α30 ) = (0.652, 5.558, 3.352, 8.798, 1.460, 0.432, 0.679, 4.263, 9.496, 4.138) (β1 , · · · , β10 ) = (0.667, 2.041, 9.152, 0.415, 8.777, 5.658, 3.605, 2.261, 8.858, 2.228) (β11 , · · · , β20 ) = (7.027, 3.516, 3.897, 7.006, 5.579, 7.559, 4.409, 9.112, 6.686, 8.583) (β21 , · · · , β30 ) = (2.343, 1.272, 7.549, 0.880, 8.057, 8.645, 2.800, 1.074, 4.830, 2.562) (γ1 , · · · , γ10 ) = (0.806, 0.517, 0.100, 0.908, 0.965, 0.669, 0.524, 0.902, 0.531, 0.876) (γ11 , · · · , γ20 ) = (0.462, 0.491, 0.463, 0.714, 0.352, 0.869, 0.813, 0.811, 0.828, 0.964) (γ21 , · · · , γ30 ) = (0.789, 0.360, 0.369, 0.992, 0.332, 0.817, 0.632, 0.883, 0.608, 0.326)
This function has a lot of local optimal solutions as shown in Fig. 4(a). The optimal solution is f2 (x1 , x2 ) = −12.12 at (x1 , x2 ) = (8.02, 9.14). A search process is illustrated in Fig. 5: we can see that particles can escape from the trap and goes to the optimal solution. We have confirmed that this optimal solution is hard to search by the fundamental gbast version. Fig. 5 and Table. 1 shows the result of ERPSO with that of the lbest version. In the results, the lbest version is hard to find the solution but the ERPSO can find it. In the lbest version, the particles are trapped into some local minimum almost always.
5
Conclusions
A novel version of PSO is presented in this paper. In the algorithm, the expanding function is effective for escape from a trap and the reducing function is effective to suppress computation cost. The algorithm efficiency is suggested through basic numerical experiment. This paper is a first step and has many problems including the following: 1) 2) 3) 4) 5)
analysis of role of parameters, analysis of effect of topology of particle network, automatic adjustment of parameters, application to more complicated benchmarks, and application to practical problems.
References 1. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proc of IEEE/ICNN, pp. 1942–1948 (1995) 2. Engelbrecht, A.P.: Computational Intelligence, an introduction, pp. 185–198. Wiley, Chichester (2004)
Expand-and-Reduce Algorithm of Particle Swarm Optimization
881
3. Richer, T.J., Blackwell, T.M.: The Levy Particle Swarm. In: Proc. Congr. Evol. Comput., pp. 3150–3157 (2006) 4. Parrott, D., Li, X.: Locating and tracking multiple dynamic optima by a particle swarm model using speciation. IEEE Trans. Evol. Comput. 10(4), 440–458 (2006) 5. Brits, R., Engelbrecht, A.P., van den Bergh, F.: A Niching Particle Swarm Optimizer. In: Proc. of SEAL, vol. 1079 (2002) 6. Hu, X., Eberhart, R.C.: Adaptive Particle Swarm Optimization: Detection and Response to Dynamic Systems. In: Proc. of IEEE/CEC, pp. 1666–1670 (2002) 7. Franken, N., Engelbrecht, A.P.: Particle swarm optimization approaches to coevolve strategies for the Iterated Prisoner’s Dilemma. IEEE Trans. Evol. Comput. 9(6), 562–579 (2005) 8. Neethling, M., Engelbrecht, A.P.: Determining RNA secondary structure using setbased particle swarm optimization. In: Proc. Congr. Evol. Comput., pp. 6134–6141 (2006) 9. Jatmiko, W., Sekiyama, K., Fukuda, T.: A PSO-based mobile sensor network for odor source localization in dynamic environment: theory, simulation and measurement. In: Proc. Congr. Evol. Comput., pp. 3781–3788 (2006) 10. Tong, G., Fang, Z., Xu, X.: A particle swarm optimized particle filter for nonlinear system state estimation. In: Proc. Congr. Evol. Comput., pp. 1545–1549 (2006) 11. Oshime, T., Saito, T., Torikai, H.: ART-based parallel learning of growing SOMs and its application to TSP. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 1004–1011. Springer, Heidelberg (2006) 12. Bersini, H., Dorigo, M., Langerman, S., Geront, G., Gambardella, L.: Results of the first international contest on evolutionary optimisation (1st iceo). In: Proc. of IEEE/ICEC, pp. 611–615 (1996)
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture Tadashi Kondo School of Health Sciences, The University of Tokushima, 3-l8-15 Kuramoto-cho, Tokushima 770-8509, Japan
[email protected] Abstract. A revised Group Method of Data Handling (GMDH)-type neural network is applied to the nonlinear pattern identification. The GMDH-type neural network has both characteristics of the GMDH and the conventional multilayered neural network trained by the back propagation algorithm and can automatically organize the optimum neural network architecture using the heuristic self-organization method. In the GMDH-type neural network, many types of neurons described by such functions as the sigmoid function, the radial basis function, the high order polynomial and the linear function, can be used to organize neural network architecture and neuron characteristics, which fit the complexity of the nonlinear system, are automatically selected so as to minimize the error criterion defined as Akaike’s Information Criterion (AIC) or Prediction Sum of Squares (PSS). In this paper, the revised GMDH-type neural network is applied to the identification of the nonlinear pattern, showing that it is a useful method for this process. Keywords: Neural Network, GMDH, Nonlinear pattern identification.
1 Introduction The multi-layered GMDH-type neural networks, which have both characteristics of the GMDH [1],[2] and conventional multi-layered neural network, have been proposed on our early works [3],[4]. The multi-layered GMDH-type neural network can automatically organize multi-layered neural network architectures using the heuristic self-organization method and also organize optimum architectures of the high order polynomial fit for the characteristics of the nonlinear complex system. The GMDH-type neural network has several advantages compared with conventional multilayered neural network. It has the ability of self-selecting useful input variables. Also, useless input variables are eliminated and useful input variables are selected automatically. The GMDH-type neural network also has the ability of self-selecting the number of layers and the number of neurons in each layer. These structural parameters are automatically determined so as to minimize the error criterion defined as Akaike’s Information Criterion (AIC) [5] or Prediction Sum of Squares (PSS) [6], and the optimum neural network architectures can be organized autoM. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 882–891, 2008. © Springer-Verlag Berlin Heidelberg 2008
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
883
matically. Because of this feature it is very easy to apply this algorithm to the identification problems of practical complex systems. In this paper, the revised GMDH-type neural network is applied to the identification problem of the nonlinear pattern. It is shown that the revised GMDH-type neural network can be applied easily and that it is a useful method for the identification of the nonlinear system.
2 Heuristic Se1f-organization Method[1],[2] The architecture of the GMDH-type neural network is organized automatically by using the heuristic self-organization method which is the basic theory of the GMDH algorithm. The heuristic self-organization method in the GMDH-type neural networks is implemented through the following five procedures: Separating the Original Data into Training and Test Sets. The original data are separated into training and test sets. The training data are used for the estimation of the weights of the neural network. The test data are used for organizing the network architecture. Generating the Combinations of the Input Variables in Each Layer. Many combinations of r input variables are generated in each layer. The number of combinations is p!/(p-r)!r!. Here, p is the number of input variables and the number of r is usually set to two. Selecting the Optimum Neuron Architectures. For each combination, the optimum neuron architectures which describe the partial characteristics of the nonlinear system can be calculated by applying the regression analysis to the training data. The output variables (yk) of the optimum neurons are called intermediate variables. Selecting the Intermediate Variables. The L intermediate variables giving the L smallest test errors which are calculated by using the test data are selected from the generated intermediate variables (yk ). Stopping the Multilayered Iterative Computation. The L selected intermediate variables are set to the input variables of the next layer and the same computation is continued. When the errors of the test data in each layer stop decreasing, the iterative computation is terminated. The complete neural network which describes the characteristics of the nonlinear system can be constructed by using the optimum neurons generated in each layer. The heuristic self-organization method plays very important roles for organization of the GMDH-type neural network.
3 Revised GMDH-Type Neural Network Algorithm Revised GMDH-type neural network has a common feedforward multilayered architecture. Figure 1 shows architecture of the revised GMDH-type neural network. This neural network is organized using heuristic self-organization method.
884
T. Kondo
X1 X2 X3 . .
. .
XP
Σf
Σf
Σf
Σf
Σf
Σf
Σf
Σf
Σf
. .
. .
. .
Σf
Σf
Σf
Σf
φ1
Σf
φ2
. .
. .
Σf
φκ
Fig. 1. Architecture of the revised GMDH-type neural network
Procedures for determining architecture of the revised GMDH-type neural network conform to the following: 3.1 First Layer uj=xj
(j=1,2,…,p)
(1)
where xj (j=1,2,…,p) are input variables of the nonlinear system, and p is the number of input variables. In the first layer, input variables are set to output variables. 3.2 Second Layer All combinations of r input variables are generated. For each combination, optimum neuron architectures are automatically selected from the following two neurons. Architectures of the first and second type neurons are shown in Fig.2. Optimum neuron architecture for each combination is selected from the first and second type neuron architectures. u1 u2
ui Σ
f
yk
. .
Σ
f
yk
ur
uj
(a)
(b)
Fig. 2. Neuron architectures of two type neurons: (a) with two inputs; (b) with r inputs
Revised GMDH-type neural network algorithm proposed in this paper can select optimum neural network architecture from three neural network architectures such as sigmoid function neural network, RBF neural network and polynomial neural network. Neuron architectures of the first and second type neurons in each neural network architecture are shown as follows. Sigmoid Function Neural Network The first type neuron Σ: (Nonlinear function) zk=w1ui+w2uj+w3uiuj+w4ui2+w5uj2+w6ui3+w7ui2uj+w8uiuj2+w9uj3 - w0θ1
(2)
Nonlinear Pattern Identification by Multi-layered GMDH-Type Neural Network
885
f: (Nonlinear function)
1 (3) 1 + e ( − zk ) Here, θ1 =1 and wi (i=0,1,2,…,9) are weights between the first and second layer. Value of r, which is the number of input variables u in each neuron, is set to two for the first type neuron. The second type neuron Σ: (Linear function) yk =
zk= w1u1+w2u2+w3u3+ ··· +wrur - w0θ1 ( r