Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

4984

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa (Eds.)

Neural Information Processing 14th International Conference, ICONIP 2007 Kitakyushu, Japan, November 13-16, 2007 Revised Selected Papers, Part I

13

Volume Editors Masumi Ishikawa Hiroyuki Miyamoto Takeshi Yamakawa Kyushu Institute of Technology Department of Brain Science and Engineering 2-4 Hibikino, Wakamatsu, Kitakyushu 808-0196, Japan E-mail: {ishikawa, miyamo, yamakawa}@brain.kyutech.ac.jp Kenji Doya Okinawa Institute of Science and Technology Initial Research Project 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan E-mail: [email protected]

Library of Congress Control Number: Applied for CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3, C.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-540-69154-5 Springer Berlin Heidelberg New York 978-3-540-69154-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12282845 06/3180 543210

Preface

These two-volume books comprise the post-conference proceedings of the 14th International Conference on Neural Information Processing (ICONIP 2007) held in Kitakyushu, Japan, during November 13–16, 2007. The Asia Pacific Neural Network Assembly (APNNA) was founded in 1993. The first ICONIP was held in 1994 in Seoul, Korea, sponsored by APNNA in collaboration with regional organizations. Since then, ICONIP has consistently provided prestigious opportunities for presenting and exchanging ideas on neural networks and related fields. Research fields covered by ICONIP have now expanded to include such fields as bioinformatics, brain machine interfaces, robotics, and computational intelligence. We had 288 ordinary paper submissions and 3 special organized session proposals. Although the quality of submitted papers on the average was exceptionally high, only 60% of them were accepted after rigorous reviews, each paper being reviewed by three reviewers. Concerning special organized session proposals, two out of three were accepted. In addition to ordinary submitted papers, we invited 15 special organized sessions organized by leading researchers in emerging fields to promote future expansion of neural information processing. ICONIP 2007 was held at the newly established Kitakyushu Science and Research Park in Kitakyushu, Japan. Its theme was “Towards an Integrated Approach to the Brain—Brain-Inspired Engineering and Brain Science,” which emphasizes the need for cross-disciplinary approaches for understanding brain functions and utilizing the knowledge for contributions to the society. It was jointly sponsored by APNNA, Japanese Neural Network Society (JNNS), and the 21st century COE program at Kyushu Institute of Technology. ICONIP 2007 was composed of 1 keynote speech, 5 plenary talks, 4 tutorials, 41 oral sessions, 3 poster sessions, 4 demonstrations, and social events such as the Banquet and International Music Festival. In all, 382 researchers registered, and 355 participants joined the conference from 29 countries. In each tutorial, we had about 60 participants on the average. Five best paper awards and five student best paper awards were granted to encourage outstanding researchers. To minimize the number of researchers who cannot present their excellent work at the conference due to financial problems, we provided travel and accommodation support of up to JPY 150,000 to six researchers and of up to JPY to eight students 100,000. ICONIP 2007 was jointly held with the 4th BrainIT 2007 organized by the 21st century COE program, “World of Brain Computing Interwoven out of Animals and Robots,” with the support of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) and Japan Society for the Promotion of Science (JSPS).

VI

Preface

We would like to thank Mitsuo Kawato for his superb Keynote Speech, and Rajesh P.N. Rao, Fr´ed´eric Kaplan, Shin Ishii, Andrew Y. Ng, and Yoshiyuki Kabashima for their stimulating plenary talks. We would also like to thank Sven Buchholz, Eckhard Hitzer, Kanta Tachibana, Jung Wang, Nikhil R. Pal, and Tetsuo Furukawa for their enlightening tutorial lectures. We would like to express our deepest appreciation to all the participants for making the conference really attractive and fruitful through lively discussions, which we believe would tremendously contribute to the future development of neural information processing. We also wish to acknowledge the contributions by all the Committee members for their devoted work, especially Katsumi Tateno for his dedication as Secretary. Last but not least, we want to give special thanks to Irwin King and his students, Kam Tong Chan and Yi Ling Wong, for providing the submission and reviewing system, Etsuko Futagoishi for hard secretarial work, Satoshi Sonoh and Shunsuke Sakaguchi for maintaining our conference server, and many secretaries and graduate students at our department for their diligent work in running the conference.

January 2008

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa

Organization

Conference Committee Chairs General Chair Organizing Committee Chair Steering Committee Chair Program Co-chairs

Tutorials Chair Exhibitions Chair Publications Chair Publicity Chair Local Arrangements Chair Web Master Secretary

Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Shiro Usui (RIKEN, Japan) Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Masumi Ishikawa (Kyushu Institute of Technology, Japan), Kenji Doya (OIST, Japan) Hirokazu Yokoi (Kyushu Institute of Technology, Japan) Masahiro Nagamatsu (Kyushu Institute of Technology, Japan) Hiroyuki Miyamoto (Kyushu Institute of Technology, Japan) Hideki Nakagawa (Kyushu Institute of Technology, Japan) Satoru Ishizuka (Kyushu Institute of Technology, Japan) Tsutomu Miki (Kyushu Institute of Technology, Japan) Katsumi Tateno (Kyushu Institute of Technology, Japan)

Steering Committee Takeshi Yamakawa, Masumi Ishikawa, Hirokazu Yokoi, Masahiro Nagamatsu, Hiroyuki Miyamoto, Hideki Nakagawa, Satoru Ishizuka, Tsutomu Miki, Katsumi Tateno

Program Committee Masumi Ishikawa, Kenji Doya Track Co-chairs

Track 1: Masato Okada (Tokyo Univ.), Yoko Yamaguchi (RIKEN), Si Wu (Sussex Univ.) Track 2: Koji Kurata (Univ. of Ryukyus), Kazushi Ikeda (Kyoto Univ.), Liqing Zhang (Shanghai Jiaotong Univ.)

VIII

Organization

Track 3: Yuzo Hirai (Tsukuba Univ.), Yasuharu Koike (Tokyo Institute of Tech.), J.H. Kim (Handong Global Univ., Korea) Track 4: Akira Iwata (Nagoya Institute of Tech.), Noboru Ohnishi (Nagoya Univ.), SeYoung Oh (Postech, Korea) Track 5: Hideki Asoh (AIST), Shin Ishii (Kyoto Univ.), Sung-Bae Cho (Yonsei Univ., Korea)

Advisory Board Shun-ichi Amari (Japan), Sung-Yang Bang (Korea), You-Shou Wu (China), Lei Xu (Hong Kong), Nikola Kasabov (New Zealand), Kunihiko Fukushima (Japan), Tom D. Gedeon (Australia), Soo-Young Lee (Korea), Yixin Zhong (China), Lipo Wang (Singapore), Nikhil R. Pal (India), Chin-Teng Lin (Taiwan), Laiwan Chan (Hong Kong), Jun Wang (Hong Kong), Shuji Yoshizawa (Japan), Minoru Tsukada (Japan), Takashi Nagano (Japan), Shozo Yasui (Japan)

Referees S. Akaho P. Andras T. Aonishi T. Aoyagi T. Asai H. Asoh J. Babic R. Surampudi Bapi A. Kardec Barros J. Cao H. Cateau J-Y. Chang S-B. Cho S. Choi I.F. Chung A.S. Cichocki M. Diesmann K. Doya P. Erdi H. Fujii N. Fukumura W-k. Fung T. Furuhashi A. Garcez T.D. Gedeon

S. Gruen K. Hagiwara M. Hagiwara K. Hamaguchi R.P. Hasegawa H. Hikawa Y. Hirai K. Horio K. Ikeda F. Ishida S. Ishii M. Ishikawa A. Iwata K. Iwata H. Kadone Y. Kamitani N. Kasabov M. Kawamoto C. Kim E. Kim K-J. Kim S. Kimura A. Koenig Y. Koike T. Kondo

S. Koyama J.L. Krichmar H. Kudo T. Kurita S. Kurogi M. Lee J. Liu B-L. Lu N. Masuda N. Matsumoto B. McKay K. Meier H. Miyamoto Y. Miyawaki H. Mochiyama C. Molter T. Morie K. Morita M. Morita Y. Morita N. Murata H. Nakahara Y. Nakamura S. Nakauchi K. Nakayama

Organization

K. Niki J. Nishii I. Nishikawa S. Oba T. Ogata S-Y. Oh N. Ohnishi M. Okada H. Okamoto T. Omori T. Omori R. Osu N. R. Pal P. S. Pang G-T. Park J. Peters S. Phillips

Y. Sakaguchi K. Sakai Y. Sakai Y. Sakumura K. Samejima M. Sato N. Sato R. Setiono T. Shibata H. Shouno M. Small M. Sugiyama I. Hong Suh J. Suzuki T. Takenouchi Y. Tanaka I. Tetsunari

N. Ueda S. Usui Y. Wada H. Wagatsuma L. Wang K. Watanabe J. Wu Q. Xiao Y. Yamaguchi K. Yamauchi Z. Yi J. Yoshimoto B.M. Yu B-T. Zhang L. Zhang L. Zhang

Sponsoring Institutions Asia Pacific Neural Network Assembly (APNNA) Japanese Neural Network Society (JNNS) 21st Century COE Program, Kyushu Institute of Technology

Cosponsors RIKEN Brain Science Institute Advanced Telecommunications Research Institute International (ATR) Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT) IEEE CIS Japan Chapter Fuzzy Logic Systems Institute (FLSI)

IX

Table of Contents – Part I

Computational Neuroscience A Retinal Circuit Model Accounting for Functions of Amacrine Cells . . . Murat Saglam, Yuki Hayashida, and Nobuki Murayama

1

Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi

7

Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Hatori and Ko Sakai

18

Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuya Soga and Yoshiki Kashimori

27

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordan H. Boyle, John Bryden, and Netta Cohen

37

Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Kuang Chao, Hsiao-Lung Chan, Tony Wu, Ming-An Lin, and Shih-Tseng Lee

48

Population Coding of Song Element Sequence in the Songbird Brain Nucleus HVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Nishikawa, Masato Okada, and Kazuo Okanoya

54

Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamami Motomura, Yuki Hayashida, and Nobuki Murayama

64

Region-Based Encoding Method Using Multi-dimensional Gaussians for Networks of Spiking Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lakshmi Narayana Panuku and C. Chandra Sekhar

73

Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kouichi Mitsunaga, Yusuke Totoki, and Takami Matsuo

83

XII

Table of Contents – Part I

Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings . . . . . . . . . . . . . . . . . . . . . . . . Akihisa Ichiki and Masatoshi Shiino Spike-Timing Dependent Plasticity in Recurrently Connected Networks with Fixed External Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Gilson, David B. Grayden, J. Leo van Hemmen, Doreen A. Thomas, and Anthony N. Burkitt A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Ren Su, Michelle Liou, Philip E. Cheng, John A.D. Aston, and Shang-Hong Lai

93

102

112

126

The Eﬀects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murat Saglam, Kaoru Matsunaga, Yuki Hayashida, Nobuki Murayama, and Ryoji Nakanishi

135

Interactions between Spike-Timing-Dependent Plasticity and Phase Response Curve Lead to Wireless Clustering . . . . . . . . . . . . . . . . . . . . . . . . Hideyuki Cˆ ateau, Katsunori Kitano, and Tomoki Fukai

142

A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoko Yamaguchi, Colin Molter, Wu Zhihua, Harshavardhan A. Agashe, and Hiroaki Wagatsuma Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Colliaux, Yoko Yamaguchi, Colin Molter, and Hiroaki Wagatsuma

151

160

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Fujii, Kazuyuki Aihara, and Ichiro Tsuda

170

Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongtao Li and Shigetoshi Nara

179

A Generalised Entropy Based Associative Model . . . . . . . . . . . . . . . . . . . . . Masahiro Nakagawa

189

Table of Contents – Part I

The Detection of an Approaching Sound Source Using Pulsed Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaname Iwasa, Takeshi Fujisumi, Mauricio Kugler, Susumu Kuroyanagi, Akira Iwata, Mikio Danno, and Masahiro Miyaji

XIII

199

Sensitivity and Uniformity in Detecting Motion Artifacts . . . . . . . . . . . . . Wen-Chuang Chou, Michelle Liou, and Hong-Ren Su

209

A Ring Model for the Development of Simple Cells in the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Hamada and Kazuhiro Okada

219

Learning and Memory Practical Recurrent Learning (PRL) in the Discrete Time Domain . . . . . Mohamad Faizal Bin Samsudin, Takeshi Hirose, and Katsunari Shibata

228

Learning of Bayesian Discriminant Functions by a Layered Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshifusa Ito, Cidambi Srinivasan, and Hiroyuki Izumi

238

RNN with a Recurrent Output Layer for Learning of Naturalness . . . . . . J´ an Dolinsk´y and Hideyuki Takagi

248

Using Generalization Error Bounds to Train the Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakria Hussain and John Shawe-Taylor

258

Model of Cue Extraction from Distractors by Active Recall . . . . . . . . . . . . Adam Ponzi

269

PLS Mixture Model for Online Dimension Reduction . . . . . . . . . . . . . . . . . Jiro Hayami and Koichiro Yamauchi

279

Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi Sing Leung, Pui Fai Sum, and Tien-Tsin Wong

289

Fuzzy ARTMAP with Explicit and Implicit Weights . . . . . . . . . . . . . . . . . . Takeshi Kamio, Kenji Mori, Kunihiko Mitsubori, Chang-Jun Ahn, Hisato Fujisaka, and Kazuhisa Haeiwa Neural Network Model of Forward Shift of CA1 Place Fields Towards Reward Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Ponzi

299

309

XIV

Table of Contents – Part I

Neural Network Models A New Constructive Algorithm for Designing and Training Artiﬁcial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdus Sattar, Md. Monirul Islam, and Kazuyuki Murase

317

Eﬀective Learning with Heterogeneous Neural Networks . . . . . . . . . . . . . . Llu´ıs A. Belanche-Mu˜ noz

328

Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa

338

Eﬀect of Spatial Attention in Early Vision for the Modulation of the Perception of Border-Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nobuhiko Wagatsuma, Ryohei Shimizu, and Ko Sakai

348

Eﬀectiveness of Scale Free Network to the Performance Improvement of a Morphological Associative Memory without a Kernel Image . . . . . . . Takashi Saeki and Tsutomu Miki

358

Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Hung Chuang, Jiun-Wei Liou, Philip E. Cheng, Michelle Liou, and Cheng-Yuan Liou Feature Subset Selection Using Constructive Neural Nets with Minimal Computation by Measuring Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Dynamic Link Matching between Feature Columns for Diﬀerent Scale and Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuomi D. Sato, Christian Wolﬀ, Philipp Wolfrum, and Christoph von der Malsburg

365

374

385

Perturbational Neural Networks for Incremental Learning in Virtual Learning System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eiichi Inohira, Hiromasa Oonishi, and Hirokazu Yokoi

395

Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Tiˇ no

405

Variable Selection for Multivariate Time Series Prediction with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Ru Wei

415

Table of Contents – Part I

XV

Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takaaki Aoki, Kaiichiro Ota, Koji Kurata, and Toshio Aoyagi

426

A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Azusa Iwata, Yoshihisa Shinozawa, and Akito Sakurai

436

Supervised/Unsupervised/Reinforcement Learning Unbiased Likelihood Backpropagation Learning . . . . . . . . . . . . . . . . . . . . . . Masashi Sekino and Katsumi Nitta

446

The Local True Weight Decay Recursive Least Square Algorithm . . . . . . Chi Sing Leung, Kwok-Wo Wong, and Yong Xu

456

Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Yamazaki and Sumio Watanabe

466

Using Image Stimuli to Drive fMRI Analysis . . . . . . . . . . . . . . . . . . . . . . . . David R. Hardoon, Janaina Mour˜ ao-Miranda, Michael Brammer, and John Shawe-Taylor

477

Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima

487

Convergence Behavior of Competitive Repetition-Suppression Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Bacciu and Antonina Starita

497

Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma

507

An Automatic Speaker Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . P. Chakraborty, F. Ahmed, Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Modiﬁed Modulated Hebb-Oja Learning Rule: A Method for Biologically Plausible Principal Component Analysis . . . . . . . . . . . . . . . . . Marko Jankovic, Pablo Martinez, Zhe Chen, and Andrzej Cichocki

517

527

Statistical Learning Algorithms Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katsuyuki Hagiwara

537

XVI

Table of Contents – Part I

A Subspace Method Based on Data Generation Model with Class Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minkook Cho, Dongwoo Yoon, and Hyeyoung Park

547

Hierarchical Feature Extraction for Compact Representation and Classiﬁcation of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Schubert and Jens Kohlmorgen

556

Principal Component Analysis for Sparse High-Dimensional Data . . . . . . Tapani Raiko, Alexander Ilin, and Juha Karhunen

566

Hierarchical Bayesian Inference of Brain Activity . . . . . . . . . . . . . . . . . . . . . Masa-aki Sato and Taku Yoshioka

576

Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byron M. Yu, John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani

586

Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Samejima and Kenji Doya

596

Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomohiro Shibata, Takashi Bando, and Shin Ishii

604

Bayesian System Identiﬁcation of Molecular Cascades . . . . . . . . . . . . . . . . Junichiro Yoshimoto and Kenji Doya

614

Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shir Li Wang, Chen Change Loy, Chee Peng Lim, Weng Kin Lai, and Kay Sin Tan

625

Extraction of Approximate Independent Components from Large Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshitatsu Matsuda and Kazunori Yamaguchi

635

Local Coordinates Alignment and Its Linearization . . . . . . . . . . . . . . . . . . . Tianhao Zhang, Xuelong Li, Dacheng Tao, and Jie Yang

643

Walking Appearance Manifolds without Falling Oﬀ . . . . . . . . . . . . . . . . . . . Nils Einecke, Julian Eggert, Sven Hellbach, and Edgar K¨ orner

653

Inverse-Halftoning for Error Diﬀusion Based on Statistical Mechanics of the Spin System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yohei Saika

663

Table of Contents – Part I

XVII

Optimization Algorithms Chaotic Motif Sampler for Motif Discovery Using Statistical Values of Spike Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takafumi Matsuura and Tohru Ikeguchi

673

A Thermodynamical Search Algorithm for Feature Subset Selection . . . . F´elix F. Gonz´ alez and Llu´ıs A. Belanche

683

Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation . . . . . . . . . . . . . Mikio Hasegawa and Ken Umeno

693

Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Hu and Jun Wang Optimization of Parametric Companding Function for an Eﬃcient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shin-ichi Maeda and Shin Ishii A Modiﬁed Soft-Shape-Context ICP Registration System of 3-D Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiann-Der Lee, Chung-Hsien Huang, Li-Chang Liu, Shih-Sen Hsieh, Shuen-Ping Wang, and Shin-Tseng Lee Solution Method Using Correlated Noise for TSP . . . . . . . . . . . . . . . . . . . . Atsuko Goto and Masaki Kawamura

703

713

723

733

Novel Algorithms Bayesian Collaborative Predictors for General User Modeling Tasks . . . . Jun-ichiro Hirayama, Masashi Nakatomi, Takashi Takenouchi, and Shin Ishii

742

Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shohei Shimizu and Aapo Hyv¨ arinen

752

Eﬃcient Incremental Learning Using Self-Organizing Neural Grove . . . . . Hirotaka Inoue and Hiroyuki Narihisa

762

Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masato Uchida, Yousuke Maehara, and Hiroyuki Shioya

771

Sparse Super Symmetric Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Marko Jankovic, Rafal Zdunek, and Shun-ichi Amari

781

XVIII

Table of Contents – Part I

Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dacheng Tao, Jimeng Sun, Xindong Wu, Xuelong Li, Jialie Shen, Stephen J. Maybank, and Christos Faloutsos Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fumikazu Miwakeichi, Pedro A. Valdes-Sosa, Eduardo Aubert-Vazquez, Jorge Bosch Bayard, Jobu Watanabe, Hiroaki Mizuhara, and Yoko Yamaguchi

791

802

Flexible Component Analysis for Sparse, Smooth, Nonnegative Coding or Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Anh Huy Phan, Rafal Zdunek, and Li-Qing Zhang

811

Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Xu and Yen-Wei Chen

821

Head Pose Estimation Based on Tensor Factorization . . . . . . . . . . . . . . . . . Wenlu Yang, Liqing Zhang, and Wenjun Zhu

831

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Kameyama

841

851

Pattern Discovery for High-Dimensional Binary Datasets . . . . . . . . . . . . . . V´ aclav Sn´ aˇsel, Pavel Moravec, Duˇsan H´ usek, Alexander Frolov, ˇ Hana Rezankov´ a, and Pavel Polyakov

861

Expand-and-Reduce Algorithm of Particle Swarm Optimization . . . . . . . . Eiji Miyagawa and Toshimichi Saito

873

Nonlinear Pattern Identiﬁcation by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture . . . . . . . . . Tadashi Kondo

882

Motor Control and Vision Coordinated Control of Reaching and Grasping During Prehension Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masazumi Katayama and Hirokazu Katayama

892

Table of Contents – Part I

Computer Simulation of Vestibuloocular Reﬂex Motor Learning Using a Realistic Cerebellar Cortical Neuronal Network Model . . . . . . . . . . . . . . Kayichiro Inagaki, Yutaka Hirata, Pablo M. Blazquez, and Stephen M. Highstein Reﬂex Contributions to the Directional Tuning of Arm Stiﬀness . . . . . . . . Gary Liaw, David W. Franklin, Etienne Burdet, Abdelhamid Kadi-allah, and Mitsuo Kawato

XIX

902

913

Analysis of Variability of Human Reaching Movements Based on the Similarity Preservation of Arm Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Oyama, Yoji Uno, and Shigeyuki Hosoe

923

Directional Properties of Human Hand Force Perception in the Maintenance of Arm Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshiyuki Tanaka and Toshio Tsuji

933

Computational Understanding and Modeling of Filling-In Process at the Blind Spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shunji Satoh and Shiro Usui

943

Biologically Motivated Face Selective Attention Model . . . . . . . . . . . . . . . . Woong-Jae Won, Young-Min Jang, Sang-Woo Ban, and Minho Lee

953

Multi-dimensional Histogram-Based Image Segmentation . . . . . . . . . . . . . . Daniel Weiler and Julian Eggert

963

A Framework for Multi-view Gender Classiﬁcation . . . . . . . . . . . . . . . . . . . Jing Li and Bao-Liang Lu

973

Japanese Hand Sign Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hirotada Fujimura, Yuuichi Sakai, and Hiroomi Hikawa

983

An Image Warping Method for Temporal Subtraction Images Employing Smoothing of Shift Vectors on MDCT Images . . . . . . . . . . . . . Yoshinori Itai, Hyoungseop Kim, Seiji Ishikawa, Shigehiko Katsuragawa, Takayuki Ishida, Ikuo Kawashita, Kazuo Awai, and Kunio Doi

993

Conﬂicting Visual and Proprioceptive Reﬂex Responses During Reaching Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato An Involuntary Muscular Response Induced by Perceived Visual Errors in Hand Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato Independence of Perception and Action for Grasping Positions . . . . . . . . . 1021 Takahiro Fujita, Yoshinobu Maeda, and Masazumi Katayama

XX

Table of Contents – Part I

Handwritten Character Distinction Method Inspired by Human Vision Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Jumpei Koyama, Masahiro Kato, and Akira Hirose Recent Advances in the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Kunihiko Fukushima Engineering-Approach Accelerates Computational Understanding of V1–V2 Neural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051 Shunji Satoh and Shiro Usui Recent Studies Around the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Hayaru Shouno Toward Human Arm Attention and Recognition . . . . . . . . . . . . . . . . . . . . . 1071 Takeharu Yoshizuka, Masaki Shimizu, and Hiroyuki Miyamoto Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Osamu Nomura and Takashi Morie Optimality of Reaching Movements Based on Energetic Cost under the Inﬂuence of Signal-Dependent Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091 Yoshiaki Taniai and Jun Nishii Inﬂuence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding . . . . . . . . . . . . . . . . . . . . . . . 1100 Yusuke Azuma and Akira Hirose Global Localization for the Mobile Robot Based on Natural Number Recognition in Corridor Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110 Su-Yong An, Jeong-Gwan Kang, Se-Young Oh, and Doo San Baek A System Model for Real-Time Sensorimotor Processing in Brain . . . . . . 1120 Yutaka Sakaguchi Perception of Two-Stroke Apparent Motion and Real Motion . . . . . . . . . . 1130 Qi Zhang and Ken Mogi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141

A Retinal Circuit Model Accounting for Functions of Amacrine Cells Murat Saglam, Yuki Hayashida, and Nobuki Murayama Graduate School of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. In previous experimental studies on vertebrates, high level processes of vision such as object segregation and spatio-temporal pattern adaptation were found to begin at retinal stage. In those visual functions, diverse subtypes of amacrine cells are believed to play essential roles by processing the excitatory and inhibitory signals laterally over a wide region on the retina to shape the ganglion cell responses. Previously, a simple "linear-nonlinear" model was proposed to explain a specific function of the retina, and could capture the spiking behavior of retinal output, although each class of the retinal neurons were largely omitted in it. Here, we present a spatio-temporal computational model based on the response function for each class of the retinal neurons and the anatomical intercellular connections. This model is not only capable of reproducing filtering properties of outer retina but also realizes high-order inner retinal functions such as object segregation mechanism via wide-field amacrine cells. Keywords: Retina, Amacrine Cells, Model, Visual Function.

1 Introduction The vertebrate retina is far more than a passive visual receptor. It has been reported that many high-level vision tasks begin in the retinal circuits although they are believed to be performed in visual cortices of the brain [1]. One important task among those is the discrimination of the actual motion of an object from the global motion across the retina. Even in the case of perfect stationary scene, eye movements cause retinal image-drifts that hinder the retinal circuit to have a stationary global input at the background [2]. To handle this problem, retinal circuits are able to distinguish the object motions better when they have different patterns than background motion. The synaptic configuration of diverse types of retinal cells plays an essential role during this function. It was reported that wide-field polyaxonal amacrine cells can drive inhibitory process between the surround and the object regions (receptive field) on the retina [1, 2, 4, 5, 6]. Those wide-field amacrine cells are known to use inhibitory neurotransmitters like glycine or GABA [7, 8]. A previous study reported that glycine-mediated wide-field inhibition exists in the salamander retina and proposed a simple “linear-nonlinear” model, consisting of a temporal filter and a threshold function [2]. However that model does not include the details of any retinal neurons M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1–6, 2008. © Springer-Verlag Berlin Heidelberg 2008

2

M. Saglam, Y. Hayashida, and N. Murayama

accounting for that inhibitory mechanism although it is capable of predicting the spiking behavior of the retinal output for certain input patterns. On the other hand, different temporal models that include the behavior of each class of retinal neurons exist in the literature [9, 10]. Even though those models provide high temporal resolution, they lack spatial information of the retinal processing. Here, we present a spatio-temporal computational model that realizes wide-field inhibition between object and surround region via wide-field transient on/off amacrine cells. The model considers the responses of all major retinal neurons in detail.

2 Retinal Model The retina model parcels the stimulating input into a line of spatio-temporal computational units. Each unit consists of main retinal elements that convey information forward (photoreceptors, bipolar and ganglion cells) and laterally (horizontal and amacrine cells). Figure 1 illustrates the organization and the synaptic connections of the retinal neurons.

Fig. 1. Three computational units of the model are depicted. Each unit includes: PR: Photoreceptor, HC: Horizontal Cell, onBC: On Bipolar Cell, offBC: Off Bipolar Cell, onAC: Sustained On Amacrine Cell, offAC: Sustained Off Amacrine Cell, on/offAC: Fast-transient Wide-field On/Off Amacrine Cell, GC: On/Off Ganglion Cell. Excitatory/Inhibitory synaptic connections are represented with black/white arrowheads, respectively. Gap junctions within neighboring HCs are indicated by dotted horizontal lines. Wide-field connections are realized between wide-field on/offAC and GCs, double-s-symbols point to distant connections within those neurons.

A Retinal Circuit Model Accounting for Functions of Amacrine Cells

3

Each neuron’s membrane dynamics is governed by a differential equation (eqn.1) which is adjusted from push-pull shunting models of retinal neurons [9]. n dvc (t ) = − Avc (t ) + [ B − vc (t )]e(t ) − [ D + vc (t )]i (t ) + ∑Wck vk (t ) dt k =1

(1)

Here υc(t) stands for the membrane potential of the neuron of interest. A represents the rate of passive membrane decay toward the resting potential in the dark. B and D are the saturation levels for the excitatory, e(t) and inhibitory i(t) inputs, respectively. Those excitatory/inhibitory inputs correspond to the synaptic connections (solid lines in fig.1) of different neurons in a computational unit. υk(t) is the membrane potential of a different neuron belonging to another unit making synapse or gap-junction to the neuron of interest, υc(t). The efficiency of that link is determined by a weight parameter, Wck. In the current model, spatial connectivity is present within horizontal cells as gap junctions (dashed lines in fig.1) and between on/off amacrine cells and ganglion cells as wide-field inhibitory process (thin solid lines in fig1). As for the other neurons Wck is fixed to zero since we ignore the lateral spatial connection within them. A compressive nonlinearity (eqn. 2) is cascaded prior to the photoreceptor input stage in order to account the limited dynamic range of the neural elements. Therefore the photoreceptor is fed by a hyperpolarizing input, r (t), representing the compressed form of light intensity, f (t).

⎛ f (t ) ⎞ ⎟⎟ r (t ) = G⎜⎜ ⎝ f (t ) + I ⎠

n

(2)

Here G denotes the saturation level of the hyperpolarizing input to the photoreceptor. I represents the light intensity yielding half-maximum response and n is a real constant. Although ganglion cell receptive field size is diverse among different animals, we defined that each unit corresponds to 500μm which is in well accordance with experiments on salamander [2]. 32 computational units are interconnected as a line. Wck values are determined as a function of distance between computational units. Parameter set given in [9] is calibrated to reproduce the temporal dynamics of all neuron classes. Spatial parameters of the model are selected to meet the spatial ganglion cell response profile given in [2]. All differential equations in the model are solved sequentially using fixed-step (1ms) Bogacki-Shampine solver of the MATLAB/SIMULINK software package (The Mathworks-Inc., Natick, MA).

3 Results First we confirmed that responses of each neuron agree with the physiological observations [9]. Figure 2 illustrates the response of all neurons to 150ms-long flash light stimulating the whole model retina. At the outer retina, photoreceptor responds with a transient hyperpolarization followed by a less steep level and reaches to the resting potential with a small overshoot. Essentially horizontal cell has the smoothened form of the photoreceptor response due to its low-pass filtering feature. On- and offbipolar cells are depolarized during on/off-set of the flash light, respectively. Since

4

M. Saglam, Y. Hayashida, and N. Murayama

Fig. 2. Responses of retinal neurons (labeled as fig.1) to 150ms-long full field light flash. Dashed horizontal lines indicate dark responses(resting potentials) of each neuron. Note that the timings of on/off responses of wide-field ACs and GC spike generating potentials (GC gen. pot.) match each other. This phenomenon drives the wide-field inhibition.

Fig. 3. GC spike generating potential responses (top row) of the center unit at incoherent (left column) and coherent (right column) stimulation case. Stimulation timing and position are depicted on the x and y axes, respectively (bottom row, white bars indicate the light onset). Under coherent stimulation condition, off responses are significantly inhibited and on responses all disappeared.

those cells build negative feedback loop with sustained on- and off-amacrine cells, their responses are more transient than photoreceptors and horizontal cells as expected. Eventually bipolar cells transmit excitatory, wide-field transient on/off amacrine cells

A Retinal Circuit Model Accounting for Functions of Amacrine Cells

5

convey inhibitory inputs to ganglion cells. Significant inhibition at ganglion cell level only happens when wide-field amacrine cell signal matches the excitatory input. Figure 3 demonstrates how inhibitory process differs when the peripheral (surround) and the object regions are stimulated coherently or incoherently. In both cases object region invades 3 units (750μm radius) and stimulated identically. When the surround is stimulated incoherently, depolarized peaks of wide-field amacrine cells do not coincide with the ganglion cell peaks so that spike generating potentials are evident. However when the surround region is simulated coherently, inhibition from amacrine cells cancels out the big portions of ganglion cell depolarizations. This leads to maximum inhibition of the spike generating potentials of ganglion cells (Fig.3, right column).

4 Discussion In the current study we realized the basic mechanism of an important retinal task which is discriminating a moving object from moving background image. The coherent stimulation (Fig.2) can be linked to the global motion of the retinal image that takes place when the eye moves. However when there is a moving object in the

Fig. 4. Relative generating potential response of GC as a function of object size. Dashed line represents the model response with the original parameter set. Triangle and square markers indicate data points for ‘wide-field AC blocked’ and ‘control’ cases, respectively. Maximum GC response is observed when the object radius is 250μm (1 unit stimulation, 2nd data point). For the sake of symmetry 3rd data point represents 3 unit stimulation (750μm radius as in Fig.3), similarly each interval after 2nd data point corresponds to 500μm increment in the radius of object. As the object starts to invade the surround region, GC response decreases. When the weight of the interconnections among wide-field on/off ACs are set to zero (STR application), inhibition process is partially disabled (solid line).

6

M. Saglam, Y. Hayashida, and N. Murayama

scene, its image would be reflected on the receptive field as a different stimulation pattern than the global pattern (Incoherent stimulation, Fig.3). Experimental results revealed that blocking glycine-mediated inhibition by strychnine (STR) disables the wide-field process [2]. Therefore this glycine-ergic mechanism could be accounted to wide-field amacrine cells [7]. In our model, STR application can be realized by turning off synaptic weight parameters between wide-field amacrine cells and ganglion cells. Figure 4 demonstrates how STR application can affect ganglion cell response. As the object invades the background ganglion cell response is expected to be inhibited as in the control case however STR prevents this phenomenon to occur. This behavior of the model is in very well accordance with the experimental results in [2]. Note that the model is flexible enough to fit to another wide-field inhibitory process such as GABA-ergic mechanism [8]. Spike generation of the ganglion cells is not implemented in the current model in order to highlight the role of wide-field amacrine cells only. A specific spike generator can be cascaded to the model to reproduce spike responses and highlight retinal features more. Since the model covers on/off pathways and all major retinal neurons, it can be flexibly adjusted to reproduce other functions. Although we deduced the retina into a line of spatio-temporal computational units, the model was able to reproduce a retinal mechanism. This deduction can be bypassed and more precise results can be achieved by creating a 2-D mesh of spatiotemporal computational units.

References 1. Masland, R.H.: Vision: The retina’s fancy tricks. Nature 423(6938), 387–388 (2003) 2. Olveczky, B.P., Baccus, S.A., Meister, M.: Segregation of object and background motion in the retina. Nature 423(6938), 401–408 (2003) 3. Volgyi, B., Xin, D., Amarillo, Y., Bloomfield, S.A.: Morphology and physiology of the polyaxonal amacrine cells in the rabbit retina. J. Comp. Neurol. 440(1), 109–125 (2001) 4. Lin, B., Masland, R.H.: Populations of wide-field amacrine cells in the mouse retina. J. Comp. Neurol. 499(5), 797–809 (2006) 5. Solomon, S.G., Lee, B.B., Sun, H.: Suppressive surrounds and contrast gain in magnocellular pathway retinal ganglion cells of macaque. J. Neurosci. 26(34), 8715–8726 (2006) 6. van Wyk, M., Taylor, W.R., Vaney, D.: Local edge detectors: a substrate for fine spatial vision at low temporal frequencies in rabbit retina. J. Neurosci. 26(51), 250–263 (2006) 7. Hennig, M.H., Funke, K., Worgotter, F.: The influence of different retinal subcircuits on the nonlinearity of ganglion cell behavior. J. Neurosci. 22(19), 8726–8738 (2002) 8. Lukasiewicz, P.D.: Synaptic mechanisms that shape visual signaling at the inner retina. Prog Brain Res. 147, 205–218 (2005) 9. Thiel, A., Greschner, M., Ammermuller, J.: The temporal structure of transient ON/OFF ganglion cell responses and its relation to intra-retinal processing. J. Comput. Neurosci. 21(2), 131–151 (2006) 10. Gaudiano, P.: Simulations of X and Y retinal ganglion cell behavior with a nonlinear pushpull model of spatiotemporal retinal processing. Vision Res. 34(13), 1767–1784 (1994)

Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi Division of Electrical, Electronic and Information Engineering, Graduate School of Engineering, Osaka University, Yamada-Oka 2-1, Suita, Osaka, Japan [email protected]

Abstract. Many mathematical models of diﬀerent neurons have been proposed so far, however, the way of modeling Ca2+ regulation mechanisms has not been established yet. Therefore, we try to construct a physiologically plausible model which contains many regulating systems of the intracellular Ca2+ , such as Ca2+ buﬀering, Na+ /Ca2+ exchanger and Ca2+ pump current. In this paper, we seek the plausible values of parameters by analyzing the global bifurcation structure of our temporary model.

1

Introduction

Complex information processing of brain is regulated by the electrical activity of neurons. Neurons transmit an electrical signal called action potential each other for the information processing. The action potential is a spiking or bursting and plays an important role in the information processing of the brain. In the visual system, visual signals from the retina are processed by neurons in the primary visual cortex. There are several types of neurons in the visual cortex, and pyramidal cells compose roughly 80% of the neurons of the cortex. Pyramidal cells are connected each other and form a complex neuronal circuit. Previous physiological and anatomical studies [1] revealed the fundamental structure of the circuit. However, it is not completely understood how visual signals propagate and function in the neuronal circuit of the visual cortex. In order to investigate the neuronal circuit, not only physiological experiments but also simulations by using a mathematical model of neuron are necessary. Many mathematical models of neurons have been proposed so far [2]. Though there are various models of neurons, the way of modeling the regulating system of the intracellular calcium ions (Ca2+ ) has not been established yet. The regulating system of the intracellular Ca2+ is a very important element because the intracellular Ca2+ plays crucial roles in cellular processes such as hormone and neurotransmitter release, gene transcription, and regulations of synaptic plasticity. Therefore, it is important to establish the way of modeling the regulating system of the intracellular Ca2+ . M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 7–17, 2008. c Springer-Verlag Berlin Heidelberg 2008

8

T. Ishiki et al.

In this paper, we try to construct a model of pyramidal cells by regarding previous physiological experimental data, especially focusing on the regulating systems of the intracellular Ca2+ , such as Ca2+ buﬀering, Na+ /Ca2+ exchanger and Ca2+ pump current. In order to estimate the value of the parameters which cannot be determined by physiological experiments solely, we analyze the global bifurcation structure based on the slow/fast decomposition of the model. Thus we demonstrate the usefulness of such nonlinear analyses not only for the analysis of an established model but also for the construction of a model.

2

Cell Model

The well-known Hodgkin-Huxley (HH) equations [3] describe the temporal variation of membrane potential of neuronal cells. Though there are many neuron models based on the HH equations, the way of modeling the regulating system of the intracellular Ca2+ has not been established yet. Thus, we construct a pyramidal cell model by using several physiological experiments data [4]-[11]. The model includes Ca2+ buﬀer, Ca2+ pump, Na+ /Ca2+ exchanger in order to describe the regulating system of the intracellular Ca2+ appropriately. The model also includes seven ionic currents through the ionic channels. The equations of the pyramidal cell model are as follows: −C

dV = Itotal − Iext , dt dy 1 = (y∞ − y) , (y = M 1, · · · , M 6, H1, · · · , H6), dt τy

d[Ca2+ ] −S · ICatotal = + k− [CaBuf] − k+ [Ca2+ ][Buf], dt 2·F d[Buf] = k− [CaBuf] − k+ [Ca2+ ][Buf], dt d[CaBuf] = −(k− [CaBuf] − k+ [Ca2+ ][Buf]), dt

(1a) (1b) (1c) (1d) (1e)

where V is the membrane potential, C is the membrane capacitance, Itotal is the sum of all currents through the ionic channels and Na+ /Ca2+ exchanger, and Iext is the current injected to the cell externally. The variable y denotes gating variables (M 1, · · · , M 6, H1, · · · , H6) of the ionic channels, y∞ is the steady state function of y, and τy is a time constant. [Ca2+ ] denotes the intracellular Ca2+ concentration, ICatotal is the sum of all Ca2+ ionic currents, S is the surface to volume ratio, and F is the Faraday constant. [Buf] and [CaBuf] are the concentrations of the unbound and the bound buﬀers, and k− and k+ are the reverse and forward rate constants of the binding reaction, respectively. Details of all equations and parameter values of this model can be found in Appendix. First we show the simulation results when a certain external stimulus current is injected to the cell model (1). Figure 1 shows an action potential waveform, and a change of [Ca2+ ] when a narrow external stimulus current (length 1ms,

Global Bifurcation Analysis of a Pyramidal Cell Model

20

9

[B]

[A]

0.0018

V (mV)

[Ca2+] (mM)

0 -20 -40 -60 -80 3500

4000

4500

5000

t (ms)

5500

6000

0.0014 0.001 0.0006 0.0002 3500 4000

4500

5000 5500 6000

t (ms)

Fig. 1. Waveforms of [A] the membrane potential and [B] [Ca2+ ] in the case of a 1ms narrow pulse injection [A]

[B]

V (mV)

20 0

-20 -40 -60 -80

4000

5000

t (ms)

6000

7000

Fig. 2. [A] A waveform of the membrane potential under a long pulse injection. [B] A typical waveform of the membrane potential of pyramidal cell in physiological experiments [12].

density 40μA/cm2 ) is injected at t = 4000ms. The waveforms of the membrane potential and [Ca2+ ] are not so diﬀerent from the physiological experimental data qualitatively [1]. In contrast, as shown in Fig. 2A, in the case that a long pulse (length 1000ms, density 40μA/cm2 ) is injected, the membrane potential keeps a resting state after one action potential is generated. Though the membrane potential is spiking continuously in the physiological experiment (Fig. 2B), the membrane potential of the model does not show such a behavior. In general, it is well known that the membrane potential of pyramidal cell is resting in the case of no external stimulus, and spiking or bursting when the external stimulus is added. The aim of this paper is a reconstruction or a parameter tuning of the model (1) which can reproduce such a behavior of the membrane potential.

3

Bifurcation Analysis

The characteristics of the membrane potential vary with the change of the value of some parameters, therefore we investigate the bifurcation structure of the model (1) to estimate the values of the parameters. For the bifurcation analysis in this paper, we used the bifurcation analysis software AUTO [13].

10

T. Ishiki et al.

V (mV)

-35 -40 -45 -50 -55 -60 -65 -70 0

20

40

60

80

100

Iext (µA/cm2)

Fig. 3. One-parameter bifurcation diagram on the parameter Iext . The solid curve denotes stable equilibria of eq. (1).

3.1

The External Stimulation Current I ext

We analyze the bifurcation structure of the model to understand why the continuous spiking of the membrane potential is not generated when a long pulse is injected. In order to investigate whether the spiking is generated or not when Iext is increased, we vary the external stimulation current Iext as a bifurcation parameter. We show the one-parameter bifurcation diagram on the parameter Iext (Fig. 3) in which the solid curve denotes the membrane potential at stable equilibria of eq. (1). The one-parameter bifurcation diagram shows the dependence of the membrane potential on the parameter Iext . There is no bifurcation point in Fig. 3. Therefore the stability of the equilibrium point does not change, and thus the membrane potential of the model keeps resting even if Iext is increased. This result means that we cannot reproduce the physiological experimental result in Fig. 2B by varying Iext , thus we have to reconsider the parameter values of the model which cannot be determined by physiological experiments only. 3.2

The Maximum Conductance of Ca2+ -Dependent Potassium Channel GKCa

The current through the Ca2+ -dependent potassium channel is involved in the generation of spiking or bursting of the membrane potential. Therefore, we select the maximum conductance of Ca2+ -dependent potassium channel GKCa as a bifurcation parameter, and show the one-parameter bifurcation diagram when Iext = 10 (Fig. 4A). There are two saddle-node bifurcation points (SN1, SN2), three Hopf bifurcation points (HB1-HB3) and two torus bifurcation points (TR1, TR2). An unstable periodic solution which is bifurcated from HB1 changes its stability at the two torus bifurcation points and merges into the equilibrium point at HB3. Only in the range between HB1 and HB3, the membrane potential can oscillate. In order to investigate the dependence of the oscillatory range on the Iext value, we show the two-parameter bifurcation diagram (Fig. 4B) in which the horizontal and vertical axes denote Iext and GKCa , respectively. The twoparameter bifurcation diagram shows the loci where a speciﬁc bifurcation occurs.

Global Bifurcation Analysis of a Pyramidal Cell Model [A]

14

V (mV)

-10

-40

HB3

-60

2.50

SN1 HB2 HB1

2.75

3.00

3.25

HB SN

10

SN2

-50

[B]

12

TR1

-20 -30

Iext =10

TR2

G K Ca (mS/cm 2 )

0

11

3.50

G K Ca (mS/cm 2 )

8 6 4 2

3.75

4.00

0 -15 -10 -5

0

5

10 15 20 25 30

Iext (µA/cm2)

Fig. 4. [A] One-parameter bifurcation diagram on the parameter GKCa . The solid and broken curves show stable and unstable equilibria, respectively. The symbols • and ◦ denote the maximum value of V of the stable and unstable periodic solutions, respectively. [B] Two-parameter bifurcation diagram in the (Iext , GKCa )-plane.

In the diagram, the gray colored area separated by the HB and SN bifurcation curves corresponds to the range between HB1 and HB3 in Fig. 4A where the periodic solutions appear. Increasing Iext , the gray colored area shrinks gradually and disappears near Iext = 25. This result means that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of the parameter GKCa . 3.3

The Maximum Pumping Rate of the Ca2+ Pump Apump

The Ca2+ pump plays an important role in the regulation of the intracellular Ca2+ . We also investigate the eﬀect of varying the Apump value, which is the maximum pumping rate of the Ca2+ pump, on the membrane potential. Figure 5A is the one-parameter bifurcation diagram when Iext = 10. There are two Hopf bifurcation points (HB1, HB2) and four double-cycle bifurcation points (DC1DC4). A stable periodic solution generated at HB1 changes its stability at the four double-cycle bifurcation points and merges into the equilibrium point at HB2. Similarly to the case of GKCa , we show the two-parameter bifurcation diagram in the plane of two parameters Iext and Apump (Fig. 5B) in order to examine the dependence of the oscillatory range between HB1 and HB2 on Iext . In Fig. 5B, the gray colored area, where the membrane potential oscillates, shrinks and disappears as Iext increases. The result shows that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of Apump similarly to the case of GKCa . 3.4

Slow/Fast Decomposition Analysis

In this section, in order to investigate the dynamics of our pyramidal cell model in more detail, we use the slow/fast decomposition analysis [14].

T. Ishiki et al.

30 20 10 0 -10 -20 -30 -40 -50 -60 0

[A]

1000 DC3

Iext =10

DC4 DC2 HB2 DC1 HB1

300

500

800

1000

Apump (pmol/s/cm2)

1300

Apump (pmol/s/cm2)

V (mV)

12

[B] HB

800 600 400 200 0

0

20

40

60

80

100 120 140

Iext (µA/cm2)

Fig. 5. [A] One-parameter bifurcation diagram on the parameter Apump . [B] Twoparameter bifurcation diagram in the (Iext , Apump )-plane.

A system with multiple time scales can be denoted generally as follows: dx = f (x, y), x ∈ Rn , y ∈ Rm , dt dy = g(x, y), 1. dt

(2a) (2b)

Equation (2b) is called a slow subsystem since the value of y changes slowly while equation (2a) a fast subsystem. The whole eq. (2) is called a full system. So-called slow/fast analysis divides the full system into the slow and fast subsystems. In the fast subsystem (2a), the slow variable y is considered as a constant or a parameter. The variable x changes more quickly than y and thus x is considered to stay close to the attractor (stable equilibrium points, limit cycle, etc.) of the fast subsystem for a ﬁxed value of y. The variable y changes slowly with a velocity g(x, y) in which x is considered to be in the neighborhood of the attractor. The attractor of the fast subsystem may change if y is varied. The problem of analysis of the dependence of attractor on the parameter y is a bifurcation problem. Thus the slow/fast analysis reduces the analysis of full system to the bifurcation problem of the fast subsystem with a slowly-varying bifurcation parameter. In the case of the pyramidal cell model (1), under the assumption that the change of the intracellular Ca2+ concentration [Ca2+ ] is slower than the other variables, the slow/fast analysis can be made. Thus, we consider [Ca2+ ] as a bifurcation parameter and eq. (1c) as a slow subsystem, and all other equations of eqs. (1a,b,d,e) are considered as a fast subsystem. We show the bifurcation diagram of the fast subsystem by varying the value of [Ca2+ ] as a parameter (Fig. 6). The ﬁgure shows the stable and unstable equilibria of the fast subsystem with Iext = 0 (thick solid and broken curves, resp.), and the slow subsystem (thin curve). The point at the intersection of the equilibrium curve of the fast subsystem with the nullcline of the slow subsystem is the equilibrium point of the f ull system. The stability of the full system is determined whether the intersection point is on the stable or unstable branch of the equilibrium curve of the f ast subsystem. Therefore, the stability of the full system is stable in the case of

Global Bifurcation Analysis of a Pyramidal Cell Model

13

10

0

V (mV)

-10 -20 -30 -40 -50 -60 -70 -80

0

0.0001 0.0002 0.0003 0.0004 0.0005

[Ca2+] (mM)

Fig. 6. Bifurcation diagram of the fast subsystem (Iext = 0) with Ca2+ as a bifurcation parameter and the slow-nullcline of the slow subsystem

Fig. 6. In addition, when Iext is increased, the bifurcation diagram (equilibrium curve) of the fast subsystem shifts upward and the stability of the full system keeps stable and no oscillation appear, as will be shown in Fig. 7. By changing the parameter of the slow subsystem, the shape of the slownullcline changes and the intersection point is also shifted. First, we select some parameters of the slow subsystem. Because the Ca2+ pump is included only in the slow subsystem, we select the Apump and the dissociation constant Kpump which are both parameters contained in Ca2+ pump as the parameters of the slow subsystem. Second, we change the values of Apump and Kpump in order to change the shape of the nullcline. Figure 7A shows the slow-nullclines (thin solid or broken curves) with varying Apump and also the equilibria of the fast subsystem (thick solid and broken curves) with Iext = 0, 20 and 40. By the increase of Apump , the nullcline of the slow subsystem shifts upward, and the intersection point of the equilibrium curve of the fast subsystem (Iext = 0) with the slow-nullcline is then located at an unstable equilibrium. Therefore, the membrane potential of the full system is spiking when Iext = 0. Figure 7B is the similar diagram to Fig. 7A, where the value of Kpump is varied (Apump is varied in Fig. 7A). By the change of Kpump value, the shape of the slow-nullcline is not changed much, therefore the intersection point of the equilibrium curve of the fast subsystem with the nullcline keeps staying at stable equilibria and the full system remains stable at a resting state. Next, in Fig. 8, we show an example of spontaneous spiking induced by an increase of Apump (Apump = 20). The gray colored orbit in Fig. 8A is the projected trajectory of the oscillatory membrane potential and the waveform is shown in Fig. 8B. Because the equilibrium curve of the fast subsystem intersects with the slow-nullcline at the unstable equilibrium, the membrane potential oscillates even though Iext = 0. The projected trajectory of the full system follows the stable equilibrium of the fast subsystem (the lower branch of thick curve) for a long time, and this prolongs the inter-spike interval. After the trajectory passes through the intersection point, the membrane potential makes a spike. When the trajectory passes through the intersection point, the trajectory winds around the intersection point. This winding is possibly caused by a complicated nonlinear

14

T. Ishiki et al.

10

[A] Iext=0 Iext=20 Iext=40 Apump=5(default) Apump=50 Apump=100 Apump=150

0 -10 -30 -40 -50

-20 -30 -40 -60 -70

0.0002

-80

0.0004 0.0006 0.0008 0.001

or or or

-50

-70 0

Iext=0 Iext=20 Iext=40 Kpump=0.2 Kpump=0.32 Kpump=0.4(default) Kpump=1.2 Kpump=2.0 Kpump=3.2

-10

-60 -80

[B]

0

V (mV)

V (mV)

-20

10

or or or

[Ca2+] (mM)

0

0.0002

0.0004 0.0006 0.0008 0.001

[Ca2+] (mM)

Fig. 7. Variation of the equilibria of the fast subsystem and the nullcline of the slow subsystem, by the change of the parameters of slow subsystem: [A] Apump , [B] Kpump [A]

20

0

0

-20

-20

V (mV)

V (mV)

20

-40

-40 -60

-60 -80

[B]

0

0.0005

0.001

[Ca2+] (mM)

0.0015

-80 20000

20400

20800

21200 21600 22000

t (ms)

Fig. 8. [A] An oscillatory trajectory of the full system (gray curve) with bifurcation diagram of the fast subsystem (Iext = 0, thick solid and broken curve) and the nullcline of the slow subsystem (Apump = 20, thin curve), [B] The oscillatory waveform of the membrane potential

dynamics [14] and makes the subthreshold oscillation of the membrane potential just before the spike in Fig. 8B. However, this subthreshold oscillation cannot be observed in physiological experiments of pyramidal cell (Fig. 2B).

4

Conclusion

In this research, we tried to construct a model of pyramidal cells in the visual cortex focusing on Ca2+ regulation mechanisms, and analyzed the global bifurcation structure of the model in order to seek the physiologically plausible values of its parameter. We analyzed the global bifurcation structure of the model using the maximum conductance of Ca2+ -dependent potassium channel (GKCa ) and the maximum pumping rate of the Ca2+ pump (Apump) as bifurcation parameters. According to the two-parameter bifurcation diagrams we showed that the range where the spontaneous spiking occurs shrinks as the external stimulation current Iext

Global Bifurcation Analysis of a Pyramidal Cell Model

15

increases. Therefore, the membrane potential of the model cannot oscillate for large values of Iext even if both values of GKCa and Apump were changed. We also investigated the eﬀect of Apump and the dissociation constant Kpump on the nullcline of slow subsystem based on the slow/fast decomposition analysis. If Apump is increased, the membrane potential is spiking when Iext = 0 because the nullcline shifts upward and the stability of the full system becomes unstable. When Kpump is varied, the membrane potential keeps a resting state because the full system remains stable. Unfortunately, no expected behavior was obtained by the change of values of the parameters considered in this paper. We have, however, demonstrated the usefulness of such nonlinear analyses as the bifurcation and slow/fast analyses to examine parameter values and construct a physiological model. More detailed study using the other parameters is necessary for the construction of the appropriate model as a future subject.

References 1. Osanai, M., Takeno, Y., Hasui, R., Yagi, T.: Electrophysiological and optical studies on the signal propagation in visual cortex slices. In: Proc. of 2005 Annu. Conf. of Jpn. Neural Network Soc., pp. 89–90 (2005) 2. Herz, A.V.M., Gollisch, T., Machens, C.K., Jaeger, D.: Modeling single-neuron dynamics and computations: a balance of detail and abstraction. Science 314, 80– 85 (2006) 3. Hodgkin, A.L., CHuxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol(Lond) 177, 500–544 (1952) 4. Brown, A.M., Schwindt, P.C., Crill, W.E.: Voltage dependence and activation kinetics of pharmacologically defend components of the high-threshold calcium current in rat neocortical neurons. J. Neurophysiol. 70, 1516–1529 (1993) 5. Peterson, B.Z., Demaria, C.D., Yue, D.T.: Calmodulin is the Ca2+ sensor for Ca2+ dependent inactivation of L-type calcium channels. Neuron. 22, 549–558 (1999) 6. Cummins, T.R., Xia, Y., Haddad, G.G.: Functional properties of rat and human neocortical voltage-sensitive sodium currents. J. Neurophysiol. 71, 1052–1064 (1994) 7. Korngreen, A., Sakmann, B.: Voltage-gated K + channels in layer 5 neocortical pyramidal neurons from young rats: subtypes and gradients. J. Neurophy. 525, 621–639 (2000) 8. Kang, J., Huguenard, J.R., Prince, D.A.: Development of BK channels in neocortical pyramidal neurons. J. Neurophy. 76, 188–198 (1996) 9. Hayashida, Y., Yagi, T.: On the interaction between voltage-gated conductances and Ca+ regulation mechanisms in retinal horizontal cells. J. Neurophysiol. 87, 172–182 (2002) 10. Naraghi, M., Neher, E.: Linearized buﬀered Ca+ diﬀusion in microdomains and its implications for calculation of [Ca+ ] at the mouth of a calcium channel. J. Neurosci. 17, 6961–6973 (1997) 11. Noble, D.: Inﬂuence of Na/Ca exchanger stoichiometry on model cardiac action potentials. Ann. N.Y, Acad. Sci. 976, 133–136 (2002)

16

T. Ishiki et al.

12. Yuan, W., Burkhalter, A., Nerbonne, J.M.: Functional role of the fast transient outward K+ current IA in pyramidal neurons in (rat) primary visual cortex. J. Neurosci. 25, 9185–9194 (2005) 13. Doedel, E.J., Champeny, A.R., Fairgrieve, T.F., Kunznetsov, Y.A., Sandstede, B., Xang, X.: Continuation and bifurcation software for ordinary diﬀerential equations (with HomCont). Technical Report, Concordia University (1997) 14. Doi, S., Kumagai, S.: Generation of very slow neuronal rhythms and chaos near the Hopf bifurcation in single neuron models. J. Comp. Neurosci. 19, 325–356 (2005)

Appendix Itotal = INa + IKs + IKf + IK−Ca + ICaL + ICa + Ileak + Iex ICatotal = ICaL + ICa − 2Iex + Ipump 2 INa = GNa · M 1 · H1 · (V − ENa ), GNa = 13.0(mS/cm ), ENa = 35.0(mV) V +29.5 τM1 (V ) = 1/ 0.182 1−expV +29.5 − 0.124 1−exp [− V +29.5 ] [ V +29.5 ] 6.7 6.7 V +124.955511 V +10.07413 τH1 (V ) = 0.5 + 1/ exp − 19.76147 + exp − 20.03406 2 IKs = GKs · M 2 · H2 · (V − EK ), GKs = 0.66(mS/cm ), EKs = −103.0(mV) 1.25 + 115.0 exp[0.026V ], (V < −50mV) τM2 (V ) = 1.25 + 13.0 exp[−0.026V ], (V ≥ −50mV) τH2 (V ) = 360.0 + (1010.0 + 24.0(V + 55.0)) exp (−(V + 75.0)/48.0)2 IKf = GKf · M 3 · H3 · (V − EKf ), GKf = 0.27(mS/cm2 ), EKf = EKs τM3 (V ) = 0.34 + 0.92 exp (−(V + 71.0)/59.0)2 τH3 (V ) = 8.0 + 49.0 exp (−(V + 37.0)/23.0)2 IKCa = GKCa · M 4 · (V − EKCa ), GKCa = 12.5(mS/cm2 ), EKCa = EKs M 4∞ (V, [Ca2+ ]) = ([Ca2+ ]/([Ca2+ ] + Kh )) · (1/(1 + exp[−(V + 12.7)/26.2])) Kh = 0.15(μM) 1.25 + 1.12 exp[(V + 92.0)/41.9] (V < 40mV) τM4 (V ) = 27.0 (V ≥ 40mV)

2+ (2F )2 [Ca ] exp[2V F /RT ]−[Ca2+ ]o ICaL = PCaL · M 5 · H5 · RT · V · exp[2V F /RT ]−1 4

PCaL = 0.225(cm/ms), H5∞ = KCa 4 /(KCa 4 + [Ca2+ ] ), KCa = 4.0(μm) τM5 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) τH5 = 2000.0

2+ )2 exp[2V F /RT ]−[Ca2+ ]o ICa = PCa · M 6 · H6 · (2F · V · [Ca ] exp[2V R·T F /RT ]−1 PCa = 0.155(cm/ms), τH6 = 2000.0 τM6 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) Ileak = V /30.0 Iex = k [Na+ ]3i · [Ca2+ ]o · exp s · VRTF − [Na+ ]3o · [Ca2+ ] · exp −(1 − s) · VRTF −5 2 4 k = 6.0 × 10 (μA/cm /mM ), s = 0.5 Ipump = (2 · F · Apump · [Ca2+ ])/([Ca2+ ] + Kpump ) Apump = 5.0(pmol/s/cm2 ), Kpump = 0.4(μM) M i∞ = 1/(1 + exp (−(V − αM i )/βM i ), i = 1, 2, 3, 5, 6 Hj∞ = 1/(1 + exp ((V − αHj )/βHj ), j = 1, 2, 3, 6 i 1 2 3 5 6 j 1 2 3 6 αM i (mV ) −29.5 −3.0 −3.0 −18.75 18.75 βM i 6.7 10.0 10.0 7.0 7.0

αHj (mV ) −65.8 −51.0 −66.0 −12.6 βHi 7.1 12.0 10.0 18.9

Global Bifurcation Analysis of a Pyramidal Cell Model

17

C = 1.0(μF/cm2 ), S = 3.75(/cm), k− = 5.0(/ms), k+ = 500.0(/mM · ms) [Ca2+ ]o = 2.5(mM), [Na+ ]i = 7.0(mM), [Na+ ]o = 150.0(mM)

T and R denote the absolute temperature and gas constant, respectively. In all ionic currents of the model, the powers for gating variables (M 1, · · · , M 6, H1, · · · , H6) are approximately set at one in order to simplify the equations. Leak current is assumed to have no ion selectivity and follow Ohm’s law. Thus, the reversal potential of leak current is set at 0 (mV), though it might be unusual.

Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells Yasuhiro Hatori and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573 Japan [email protected] [email protected] http://www.cvs.cs.tsukuba.ac.jp/

Abstract. The representation of object shape in the visual system is one of the most crucial questions in brain science. Although we can perceive figure shape correctly and quickly, without any effort, the underlying cortical mechanism is largely unknown. Physiological experiment with macaque indicated the possibility that the brain represents a surface with Medial Axis (MA) representation. To examine whether early visual areas could provide basis for MA representation, we constructed the physiologically realistic, computational model of the early visual cortex, and examined what constraint is necessary for the representation of MA. Our simulation results showed that simultaneous firing of BorderOwnership (BO) selective cells at the stimulus onset is a crucial constraint for MA representation. Keywords: Shape, representation, perception, vision, neuroscience.

1 Introduction Segregation of figure from ground might be the first step in the cortex toward the recognition of shape and object. Recent physiological studies have shown that around 60% of neurons in cortical areas V2 and V4 are selective to Border Ownership (BO) that tells which side of a contour owns the border, or the direction of figure, even about 20% of V1 neurons also showed the BO selectivity [1]. These reports also give an insightful idea on coding of shape in early- to intermediate- level vision. The coding of shape is a major question in neuroscience as well as in robot vision. Specifically, it is of great interest that how the visual information in early visual areas is processed to form the representation of shape. Physiological studies in monkeys [2] suggest that shape is coded by medial axis (MA) representation in early visual areas. The MA representation is the method that codes a surface by a set of circles inscribed along the contour of the surface. An arbitrary shape would be reproduced from the centers of the circles and their diameters. We examined whether neural circuits in early- to intermediate-level visual areas could provide a basis for MA representation. We propose that the synchronized responses of BO-selective neurons could evoke the representation of MA. The physiological study [2] showed that V1 neurons responded to figure shape around 40 ms after the stimulus onset, while the latency of the cells responded to MA was about 200 ms after the onset. Physiological study on M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 18–26, 2008. © Springer-Verlag Berlin Heidelberg 2008

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

19

BO-selective neurons reported that their latency is around 70 ms. These results give rise to the proposal that the neurons detect contours first, then the BO is determined from the local contrast detected, and finally MA representation is constructed. It should be noted that BO could be determined from local contrast surrounding the classical receptive field (CRF), thus a single neuron with surrounding modulation would be sufficient to yield BO selectivity [3]. To examine the proposal, we constructed physiologically realistic, firing model of Border-ownership (BO) selective neuron. We assume that the onset of a stimulus with high contrast evokes simultaneous, strong responses of the neurons. These fast responses will propagate retinotopically, thus the neurons at MA (equidistance to the contour) will be activated. Our simulation results show that even relatively small facilitation from the propagated signals yield the firing of the model-cells at the MA because of the synchronization, indicating that simultaneous firing of BO-selective cells at the stimulus onset could enable the representation of MA.

2 The Proposed Model The model is comprised of three stages: (1) contrast detection stage, (2) BO detection stage, and (3) MA detection stage. The first stage extracts luminance contrast as similar to V1 simple cells. The model cells in the second stage mimic BO-selective neurons, which determine the direction of BO with respect to the border at the CRF based on the modulation from surrounding contrast up to 5 deg in visual angle from the CRF. The third stage spatially pools the responses from the second stage to test the MA representation from simultaneous firing of BO cells. A schematic diagram of the model is given in Figure 1. The following sections describe the functions of each stage. 2.1 Model Neurons To enable the simulations of precise spatiotemporal properties, we implemented single-component firing neurons and their connections through their synapses on NEURON simulator [4]. The cell body of the model cell is approximated by a sphere. We set the radius and the membrane resistance to 50μm and 34.5Ωcm, respectively. The model neurons calculate the membrane potential following the Hodgkin-Huxley equation [5] with constant parameter as shown in table 1. We used the biophysically realistic spiking neuron model because we need to examine exact timing of the firing of BO-selective cells as well as the propagtation of the signals, which cannot be realized by integrate-and-fire neuron model or any other abstract models. 2.2 Contrast Detection Stage The model cells in the first stage have response properties similar to those of V1 cells, including the contrast detection, dynamic contrast normalization and static compressive nonlinearity. The model cells in this stage detect the luminance contrast from oriented Gabor filters with distnct orientations. We limited ourselves to have four orientations for the sake of simplicity; vertical (0 and 180 deg) and horizontal (90 and 270 deg). Gabor filter is defined as follows:

Gθ (x , y ) = cos(2πω

(x sin θ + y cos θ )) × gaussian(x , y , μ x , μ y , σ x , σ y ),

(1)

20

Y. Hatori and K. Sakai

Fig. 1. A schematic illustration of the proposed model. Luminance contrast is detected in the first stage which is then processed to determine the direction of BO by surrounding modulation. F and S represent excitatory and inhibitory regions, respectively, for the surrounding modulation. The last stage detects MA based on the propagations from BO model cells.

where x and y represent spatial location, μx and μy represent central coordinate of Gauss function, σx and σy represent standard deviation of Gauss function, θ and ω represent orientation and spatial frequency, respectively. We take a convolution of an input image

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

21

with the Gabor filters, with dynamic contrast normalization [6] including a static, compressive nonlinear function. For the purpose of efficient computation, the responses of the vertical pathways (0 and 180 deg) are integrated to form vertical orientation, and so do the horizontal pathways (90 and 270 deg), which will be convenient for the computation of iso-orientation suppression and cross-orientation facilitation in the next stage: O θ (x , y ) = (I * G θ )(x , y ) ,

(2)

1 (x , y ) = O0 (x , y ) + O180 (x , y ) , Oiso

(3)

1 (x , y ) = O90 (x , y ) + O270 (x , y ) , Ocross

(4)

where I represents input image, Oθ(x,y) represents the output of convolution (*), O1iso (O1cross) represents integrated responses of vertical (horizontal) pathway. Table 1. The constant values for the model cells used in the simulations

Parameter

Value

Cm

1( F/cm2) 50(mv) -77(mv) -54.3(mv) 0.120(S/cm2) 0.036(S/cm2) 0.0003(S/cm2)

ENa EK El gNa gK gl

μ

2.3 BO Detection Stage The second stage models surrounding modulation reported in early- to intermediatelevel vision for the determination of BO [3]. The model cells integrate surrounding contrast information up to5 deg in visual angle from the CRF center. We modeled the surrounding region with two Gaussians, one for inhibition and the other for facilitation, that are located asymmetrically with respect to the CRF center. If a part of the contour of an object is projected onto the excitatory (or inhibitory) region, the contrast information of the contour that is detected by the first stage is transmitted via a pulse to generate EPSP (or IPSP) of the BO-selective model-cell. In other words, the projection of a figure within the excitatory region facilitates the response of the BO model cell. Conversely, if a figure is projected onto the inhibitory region, the response of the model cell is suppressed. Therefore, surrounding contrast signals from the excitatory and inhibitory regions modulate the activity of BO model cells depending on the direction of figure. In this way, we implemented BO model cells based on the surrounding modulation. Note that Jones and her colleagues reported the orientation dependency in the surrounding modulation in monkeys' V1 cells [7]. The suppression is limited to similar orientations to the preferred orientation of the CRF (iso-orientation suppression), and facilitation is dominant for other orientations (cross-orientation facilitation). We implemented this orientation dependency for surround modulation.

22

Y. Hatori and K. Sakai

Taking into account the EPSP and IPSP from the surrounds, we compute the membrane potential of a BO-selective model cell at time t as follows: O 2 ( x1 , y1 , t ) = input ( x1 , y1 ) + c ∑{E iso (x, y , t - d x1, y1 (x, y ) ) + E cross (x, y , t - d x1, y1 (x, y ))} (5) x, y

,where x1 and y1 represent the spatial position of the BO-selective cell, input(x1,y1) represents the output of the first stage that is O1iso or O1cross, c represents a weight of synaptic connection, Eiso(x,y,t-dx1,y1(x,y)) represents EPSP (or IPSP) that is triggered by the pulse which is generated at t-dx1,y1. Ecross(x,y,t-dx1,y1(x,y)) is the same except for input orientation. And, dx1,y1(x,y) shows time delay in proportion to the distance between the BO-selective cell whose coordinate is (x1,y1) and the connected cell whose coordinate is (x,y). We defined Eiso(x,y,t-dx1,y1(x,y)) and dx1,y1(x,y) as:

(

)

(

)

E iso x , y ,t - d x1 , y1 (x , y ) = gaussian x , y , μ x1 , μ y1 ,σ x1 ,σ y1 × exp d x1 , y1 (x , y ) = ctime

(x

1

((

)

- t - d x1 , y1 ( x , y )

τ

)× (v - e) ,

2 2 - x ) + (y1 - y ) ,

(6) (7)

where τ represents time constant, v represents membrane potential, e represents reversal potential, ctime is a constant that converts distance to time. We set c, τ, e and ctime to 0.6 (or 1.0), 10ms, 0mv and 0.2ms/μm, respectively. And Ecross(x,y,t - dx1,y1(x,y)) is also calculated similarly to eq.6. 2.4 MA Detection Stage The third stage integrates BO information from the second stage to detect the MA. A MA model cell has a single excitatory surrounding region that is represented by a Gaussian. The membrane potential of a MA model cell is given by: O 3 (x2 , y 2 ,t )= cmedial

∑gaussian(x , y , μ

(x ,y )∈(x1 ,y1 )

x2

)

(

)

, μ y2 ,σ x2 ,σ y2 × O 2 x , y ,t - d x2 , y2 (x , y ) ,

(8)

where x2 and y2 represent the spatial position of the MA model cell, μx2 and μy2 represent the center of gauss function, cmedial is a constant that we set to 1.8 (or 10.0), dx2,y2(x,y) is calculated similarly to eq.7. Note that MA model cells receive EPSP only from BO model cells. When a BO model cell is activated, the model cell transmits a pulse to MA model cells, with its magnitude and time delay depending on the distance between the two. If a MA model cell was located equidistance to some parts of the contours, the pulses from the BO model cells on the contours reach the MA cell at the same time to evoke strong EPST that will generate a spike. On the other hand, MA model cells that are located not equidistance to the contours will never evoke a pulse. Therefore, the model cells that are located equidistance to the contours are activated based on simultaneous activation from BO model cells, the neural population of which will represent the medial axis of the object.

3 Simulation Result We carried out the simulations of the model to test whether the model shows the representation of MA. As typical examples, the results for thee types of stimuli are

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

23

shown here, a square, a C-shaped figure, and a natural image of an eagle [8], as shown in Fig 1. Note that the model cells described in the previous sections are distributed retinotopically to form each layer with 138 × 138 cells.

(A)

(B)

(C)

Fig. 2. Three examples of stimuli used for the simulations. A square (A), a C-shaped figure (B), and a natural image of an eagle from Berkeley Segmentation Dataset [8] (C).

3.1 A Single Square First, we tested the model with a single square similar to that used in corresponding physiological experiment [2], as shown in Fig.1(A). Although we carried out the simulations retinotopically in 2D, for the purpose of graphical presentation, the responses of a horizontal cross section of the stimulus indicated by Fig. 3(A) are shown here in Fig. 3(B). The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines at the horizontal positions of -1 and 1) and MA model cells (dotted lines at the horizontal position of 0). We observe a clear peak corresponding to MA at the center, as similar to the results of physiological experiments by Lee, et al. [2]. Although we tested the model along the horizontal cross section, the response of a vertical is identical. This result suggests that simultaneous firing of BO cells is capable of generating MA without any other particular constraints. 14 12 c 10 e s m 0 8 0 4 / s 6 e ki p s

4 2 0

(A)

-1 0 Horizontal position

1

(B)

Fig. 3. The simulation results for a square. (A) We show the responses of the cells located along the dashed line (the number of neurons is 138). Horizontal positions -1 and 1 represent the places on the vertical edges of the square. Zero represents the center of the square. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. A clear peak at the center corresponding to MA is observed.

24

Y. Hatori and K. Sakai

3.2 C-Shape We tested the model with a C-shaped figure that has been suggested to be difficult shape for the determination of BO. Fig. 4 shows the simulation result for the C-shape. The responses of a horizontal and vertical cross sections, as indicated in Fig.3(A), are shown here in Fig. 4(B) and (C), respectively, with the conventions same as Fig.1. The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines) and MA model cells (dotted lines). MA representation predicts a strong peak at 0 along the horizontal cross section, and distributed strong responses along the vertical cross section. Although we observe the maximum responses at the centers corresponding to MA, the peak along the horizontal was not significant, and the distribution along the vertical was peaky and uneven. It appears that MA cells cannot integrate properly the signals propagated from BO cells. This distributed MA response comes from complicated propagations of BO signals from the concaved shape. Furthermore, as has been suggested [3], the C-shaped figure is a challenging shape for the determination of BO. Therefore, the BO signals are not clear before the propagation begins. We would like to note that the model is still capable of providing a basis for MA representation for a complicated figure.

(A) 14

14

12

12

c10 e s m 0 8 0 4 / s 6 e ik p s 4

c10 e s m 0 8 0 4 / s 6 e ik p s

2

2

0

4

-1

0 1 Horizontal position

(B)

0

-1

0 Vertical position

1

(C)

Fig. 4. The simulation results for the C-shaped figure. (A) Positions of the analysis along horizontal and vertical cross-sections as indicated by dashed line. The responses of the model cells in firing rate along the horizontal cross section (B), and that along the vertical cross section (C). Solid and dotted lines indicate the responses of BO and MA model cells, respectively. Although the maximum responses are observed at the centers, the responses are distributed.

3.3 Natural Images The model has shown its ability to extract a basis for MA representation for not only simple but also difficult shapes. To further examine the model for arbitrary shapes, we

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

25

tested the model with natural images taken from Berkeley Segmentation Dataset [8]. Fig.2(C) shows an example, an eagle stops on a tree branch. Because we are interested in the representation of shape, we extracted its shape by binarizing the gray scale, as shown in Fig. 5(A). The simulation results of BO and MA model cells are shown in Fig. 5(B). We plotted the responses along the horizontal cross-section indicated in Fig. 5(A). Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig.2(B). The BO model cells responded to the contours (horizontal positions at -1 and 1), and MA model cells exhibited a strong peak at the center (horizontal position at 0). This result indicates that the model detects MA for figures with arbitrary shape. The aim of the simulations with natural images is to test a variety of stimulus shape and configurations that are possible in natural scenes. Further simulations with a number of natural images is expected, specifically with images including occulusion, multiple objects and ambiguous figures. 12 10 c e s 8 m 0 0 4 6 / s e ik 4 p s

2 0

(A)

-1 0 1 Horizontal position

(B)

Fig. 5. An example of simulation results for natural images. (A) The binary image of an eagle together with the horizontal cross-section for the graphical presentation of the results. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. Horizontal positions -1 and 1 represent the places on the vertical edges of the eagle. Zero represents the center of the bird. A clear peak at the center corresponding to MA is observed. Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig. 2(B), indicating robustness of the model.

4 Conclusion We studied whether early visual areas could provide basis for MA representation, specifically what constraint is necessary for the representation of MA. Our results showed that simultaneous firing of BO-selective neurons is crucial for MA representation. We implemented the physiologically realistic firing model neurons that have connections from BO model cells to MA model cells. If a stimulus is presented at once so that BO cells along the contours fire simultaneously, and if a MA cell is located equidistant from some of contours of the stimulus, then the MA cell fires because of synchronous signals from the BO cells that give rise to strong EPSP. We showed three typical examples of the simulation results, a simple square, a difficult C-shaped figure, and a natural image of an eagle. The simulation results showed that the model provides a basis for MA representation for all three types of stimuli. These

26

Y. Hatori and K. Sakai

results suggest that the simultaneous firing of BO cells is an essence for the MA representation in early visual areas.

Acknowledgment We thank Dr. Haruka Nishimura for her insightful comments and Mr. Satoshi Watanabe for his help in simulations. This work was supported by Grant-in-aid for Scientific Research from the Brain Science Foundation, the Okawa Foundation, JSPS (19530648), and MEXT of Japan (19024011).

References 1. Zhou, H., Friedman, H.S., Heydt, R.: Coding of Border Ownership in Monkey Visual Cortex. The Journal of Neuroscience 86, 2796–2808 (2000) 2. Lee, T.S., Mumford, D., Romero, R., Lamme, V.A.F.: The role of the primary visual cortex in higher level vision. Vision Research 38, 2429–2454 (1998) 3. Sakai, K., Nishimura, H.: Surrounding Suppression and Facilitation in the determination of Border Ownership. The Journal of Cognitive Neuroscience 18, 562–579 (2006) 4. NEURON: http://www.neuron.yale.edu/neuron/ 5. Johnston, D., Wu., S. (eds.): Foundations of Cellular Neurophysiology. MIT Press, Cambridge (1999) 6. Carandini, M., Heeger, D.J., Movshon, J.A.: Linearity and Normalization in Simple Cells of the Macaque Primary Visual Cortex. The Journal of Neuroscience 21, 8621–8644 (1997) 7. Jones, H.E., Wang, W., Silito, A.M.: Spatial Organization and Magnitude of Orientation Contrast Interaction in Primate V1. Journal of Neurophysiology 88, 2796–2808 (2002) 8. The Berkeley Segmentation Dataset: http://www.eecs.berkeley.edu/Research/ Projects/CS/vision/grouping/segbench/ 9. Engel, A.K., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in topdown processing. Nature Reviews Neuroscience 2, 704–716 (2001)

Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task Mitsuya Soga1 and Yoshiki Kashimori1,2 1

2

Dept. of Information Network Science, Graduate school of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo 182-8585, Japan Dept. of Applied Physics and Chemistry, Univ. of Electro-communications, Chofu, Tokyo 182-8585, Japan

Abstract. The ability to group visual stimuli into meaningful categories is a fundamental cognitive process. Some experiments are made to investigate the neural mechanism of visual categorization. Although experimental evidence is known that prefrontal cortex (PFC) and inferiortemporal (IT) cortex neurons sensitively respond in categorization task, little is known about the functional role of interaction between PFC and IT in categorization task To address this issue, we propose a functional model of visual system, and investigate the neural mechanism for the categorization task of line drawings of faces. We show here that IT represents similarity of face images based on the information of the resolution maps of early visual stages. We show also that PFC neurons bind the information of part and location of the face image, and then PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.

1

Introduction

Visual categorization is fundamental to the behavior of higher primates. Our raw perceptions would be useless without our classiﬁcation of items such as animals and food. The visual system has the ability to categorize visual stimuli, which is the ability to react similarity to stimuli even when they are physically distinct, and to react diﬀerently to stimuli that may be similar. How does the brain group stimuli into meaningful categories? Some experiments have been made to investigate the neural mechanism of visual categorization. Freedman et al.[1] examined the responses of neurons in the prefrontal cortex(PFC) of monkey trained to categorize animal forms(generated by computer) as either “doglike” or “catlike”. They reported that many PFC neurons responded selectively to the diﬀerent types of visual stimuli belonging to either the cats or the dogs category. Sigala and Logothetis [2] recorded from inferior temporal (IT) cortex after monkey learned a categorization task, and found that selectivity of the IT neurons was signiﬁcantly increased to features critical for the task. The numerous reciprocal connections between PFC and IT could allow the necessary interactions to select the best diagnostic features of stimuli [3]. However, little is known about the role of interaction between IT M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 27–36, 2008. c Springer-Verlag Berlin Heidelberg 2008

28

M. Soga and Y. Kashimori

and PFC in categorization task that gives the category boundaries in relation to behavioral consequences. To address this issue, we propose a functional model of visual system in which categorization task is achieved based on functional roles of IT and PFC. The functional role of IT is to represent features of object parts, based on diﬀerent resolution maps in early visual system such as V1 and V4. In IT, visual stimuli are categorized by the similarity based on the features of object parts. Posterior parietal (PP) encodes the location of object part to which attention is paid. The PFC neurons combine the information about feature and location of object parts, and generate a working memory of the object information relevant to the categorization task. The synaptic connections between IT and PFC are learned so as to achieve the categorization task. The feedback signals from PFC to IT enhance the sensitivity of IT neurons that respond to the features of object parts critical for the categorization task, thereby enabling the visual system to perform quickly and reliably task-dependent categorization. In the present study, we present a neural network model which makes categories of visual objects depending on categorization task. We investigated the neural mechanism of the categorization task about line drawings of faces used by Sigala and Logothetis [2]. Using this model we show that IT represents similarity of face images based on the information of the resolution maps in V1 and V4. We also show that PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.

2

Model

To investigate the neural mechanism of visual categorization, we made a neural network model for a form perception pathway from retina to prefrontal cortex(PFC). The model consists of eight neural networks corresponding to the retina, lateral geniculate nucleus(LGN), V1, V4, inferior temporal cortex (IT), posterior parietal(PP), PFC, and premotor area, which are involved in ventral and dorsal pathway [4,5]. The network structure of our model is illustrated in Fig. 1. 2.1

Model of Retina and LGN

The retinal network is an input layer, on which the object image is projected. The retina has a two-dimensional lattice structure that contains NR x NR pixel scene. The LGN network consists of three diﬀerent types of neurons with respect to the spatial resolution of contrast detection, ﬁne-tuned neurons with high spatial frequency (LGNF), middle-tuned neurons with middle spatial frequency (LGNM), and broad-tuned neurons with low spatial frequency (LGNB). The output of LGNX neuron (X=B, M, F) of (i, j) site is given by ILGN (iR , jR ; x) =

i

j

IR (iR , jR )M (i, j; x),

(1)

Neural Mechanism for Extracting Object Features

29

Fig. 1. The structure of our model. The model is composed of eight modules structured such that they resemble ventral and dorsal pathway of the visual cortex, retina, lateral geniculate nucleus (LGN), primary visual cortex (V1), V4, inferior temporal cortex (IT), posterior parietal (PP), prefrontal cortex (PFC), and premotor area. a1 ∼ a3 mean dynamical attractors of visual working memory.

(i − iR )2 + (j − jR )2 (i − iR )2 + (j − jR )2 M (i, j; x) = A exp − −B exp − , 2 2 σ1x σ2x (2) where, IR (iR , jR ) is the gray-scale intensity of the pixel of retinal site (iR , jR ), and the function M (i, j; X) is Mexican hat-like function that represents the convergence of the retinal inputs with ON center-OFF surrounding connections between retina and LGN. The parameter values were set to be A = 1.0, B = 1.0, σ1x = 1.0, and σ2x = 2.0. 2.2

Model of V1 Network

The neurons of V1 network have the ability to elemental features of the object image, such as orientation and edge of a bar. The V1 network consists of three diﬀerent types of networks with high, middle, and broad spatial resolutions, V1B, V1M, and V1F, each of which receives the outputs of LGNB, LGNM, and LGNF, respectively. The V1X network (X=B, M, F) contains MX × MX hypercolumns, each of which contains LX orientation columns. The neurons in V1X (X=B, M, F) have receptive ﬁeld performing a Gabor transform. The output of V1X neuron of (i, j) site is given by IV 1 (i, j, θ; x) = ILGN (p, q; x)G(p, q, θ; x), (3) p

q

30

M. Soga and Y. Kashimori

1 (i − p)2 (j − q)2 G(p, q, θ; x) = exp − + 2 2 2πσxX σyX 2 σGx σGy π

× sin 2πfx p cos θ + 2πfy q sin θ + , 2 1

(4)

where fx , fy are spatial frequencies of x- and y- coordinate, respectively. The parameter values were σGx = 1, σGy = 1, fx = 0.1Hz, and fy = 0.1Hz. 2.3

Model of V4 Network

The V4 network consists of three diﬀerent networks with high, middle, and low spatial resolutions, which receive convergent inputs from the cell assemblies with the same tuning in V1F, V1M, and V1B. The convergence of outputs of V1 neurons enables V4 neurons to respond speciﬁcally to a combination of elemental features such as a cross and triangle represented on the V4 network. 2.4

Model of PP

The posterior parietal(PP) network consists of NP P x NP P neurons, each of which corresponds to a spatial position of each pixel of the retinal image. The functions of PP network are to represent the spatial position of a whole object and the spatial arrangement of its parts in the retinotopic coordinate system and to mediate the location of the object part to which attention is paid. 2.5

Model of IT

The network of IT consists of three subnetworks, each of which receives the outputs of V4F, V4M, and V4B maps, respectively. Each network has neurons tuned to various features of the object parts depending on the resolutions of V4 maps and the location of the object parts to which attention is directed. The ﬁrst sub-network detects a broad outline of a whole object, the second subnetwork detects elemental ﬁgures of the object that represent elemental outlines of the object, and the third subnetwork represents the information of the object parts based on the ﬁne resolution of V4F map. Each subnetwork was made based on Kohonen’s self-organized map model. The elemental ﬁgures in the second subnetwork may play an important role in extracting the similarity to the outlines of objects. 2.6

Model for Working Memory in PFC

The PFC memorizes the information of spatial positions of the object parts as dynamical attractors. The functional role of the PFC is to reproduce a complete form of the object by binding the information of the object parts memorized in the second and third subnetworks of ITC and the their spatial arrangements represented by PP network. The PFC network model was made based on the dynamical map model [6,7]. The network model consists of three types of neurons, M, R, and Q neurons. Q neuron is connected to M neuron with inhibitory

Neural Mechanism for Extracting Object Features

31

synapse. M neurons are interconnected to each other with excitatory and inhibitory synaptic connections. M neuron layer of the present model corresponds to an associative neural network. M neurons receive inputs from three IT subnetworks. Dynamical evolutions of membrane potentials of the neurons, M, R, and Q, are described by

τm

tm dumi = −umi + Wmm,ij (t, Tij )Vm (t − τij ) + Wmq U qi dt j τij =0 +Wmr Uri + WIT,ik XkIT + WP P,il XlP P , R

(5)

l

duqi = −uqi + Wqm Vmi , dt duir τri = −uri + Wrm Vmi , dt τqi

(6) (7)

where um,i , uq,i , and ur,i are the membrane potentials of ith M neuron, ith Q neuron, and ith R neuron, respectively. τmi , τqi , τri are the relaxation times of these membrane potentials. τij is the delay time of the signal propagation from jth M neuron to ith one, and τmax is the maximum delay time. The time delay plays an important role in the stabilization of temporal sequence of ﬁring pattern. Wmm,ij (t, Tij ) is the strength of the axo-dendric synaptic connection from jth M neuron to ith M neuron whose propagation delay time is τij . Wmq (t), Wmr (t), and Wqm (t), and Wrm (t) are the strength of the dendro-dendritic synaptic connection from Q neuron to M neuron, from R neuron to M neuron, from M neuron to Q neuron, and from M neuron to R neuron, respectively. Vm is the output of ith M neuron, Uqi and Uri are dendritic outputs of ith Q neuron and ith R neuron, respectively. Outputs of Q and R neurons are given by sigmoidal functions of uqi and uri , respectively. M neuron has a spike output, the ﬁring probability of which is determined by a sigmoidal function of umi . WIT,ik and WP P,il are the synaptic strength from kth IT neuron to ith M neuron and that from lth PP neuron to ith M neuron, respectively. XkIT and XlP P are the output of k th IT neuron and lth PP neuron, respectively. The parameters were set to be τm = 3ms, τqi = 1ms, τri = 1ms, Wmq = 1, Wmr = 1, Wqm = 1, Wqr = −10, and τm = 38ms. Dynamical evolution of the synaptic connections are described by τw

dWmm,ij (t, Tij ) = −Wmm,ij (t, Tij ) + λVmi (t)Vmj (t − Tij ), dt

(8)

where τw is a time constant, and λ is a learning rate.The parameter values were τw = 1200ms, and λ = 28. The PFC connects reciprocally with IT, PP, and the premotor networks. The synaptic connections between PFC and IT and those between PFC and PP are learned by Hebbian learning rule.

32

2.7

M. Soga and Y. Kashimori

Model of Premotor Cortex

The model of premotor area consists of neurons whose ﬁring correspond to action relevant to the categorization task, that is, pressing right or left lever.The details of mathematical description are described in Refs. [8-10].

3 3.1

Neural Mechanism for Extracting Diagnostic Features in Visual Categorization Task Categorization Task

We used the line drawings of faces used by Sigala and Logothetis [2] to investigate the neural mechanism of categorization task. The face images consist of four varying features, eye height, eye separation, nose length, and mouse height. The monkeys were trained to categorize the face stimuli depending on two diagnostic features, eye height and eye separation. The two diagnostic features allowed separation between classes along a linear category boundary as shown in Fig.2b. The face stimuli were not linearly separable by using the other two, non-diagnostic features, or nose length and mouth height. On each trial, the monkeys saw one face stimulus and then pressed one of two levers to indicate category. Thereafter, they received a reward only if they chose correct category. After the training, the monkeys were able to categorize various face stimuli based on the two diagnostic features. In our simulation, we used four training stimuli shown in Fig.2a and test stimuli with varying four features.

Fig. 2. a) The training stimulus set consisted of line drawing of faces with four varying features: eye separation, eye height, nose length and mouth height. b) In the categorization task, the monkyes were presented with one stimulus at a time. The two categories were linearly separable along the line. The test stimuli are illustrated by the marks, ‘x‘ and ‘o‘. See Ref.2 for details of the task.

Neural Mechanism for Extracting Object Features

3.2

33

Neural Mechanism for Accomplishing Visual Categorization Task

Neural mechanism for accomplishing visual categorization task is illustrated in Fig.3. Object features are encoded by hierarchical processing at each stage of ventral visual pathway from retina to V4. The IT neurons encode the information of object parts such as eyes, nose, and mouth. The PP neurons encode the location of object parts to which attention should be directed. The information of object part and its location are combined by a dynamical attractor in PFC. The output of PFC is sent to two types of premotor neurons, each ﬁring of which leads to pressing of the right or left lever. When monkeys exhibit the relevant behavior for the task and then receive a reward, the attractor in PFC, which represents the object information relevant to the task, is gradually stabilized by the facilitation of learning across the PFC neurons and that between PFC and premotor neurons. On the other hand, when monkeys exhibit the irrelevant behavior for the task and then receive no reward, the attractor associated with the irrelevant behavior is destabilized, and then eliminated in the PFC network. As a result, the PFC retains only the object information relevant to the categorization task, as working memory. The feedback from PFC to IT and PP makes the responses of IT and PP neurons strengthened, thereby enabling the visual system to rapidly and accurately discriminate between object features belonging to diﬀerent categories. When monkey pays attention to a local region of face stimulus, the attention signal from other brain area such as prefrontal eye ﬁeld increases the activity of PP neurons encoding the location of face part to which attention is directed. The PP neurons send their outputs back to V4, and thereby increasing the activity of V4 neurons encoding the feature of the face parts which the attention is paid, because V4 has the same retinotopic map as PP. Thus the attention to PP allows V4 to send IT only the information of face part to which attention is directed,

Fig. 3. Neural mechanism for accomplishing visual categorization task. The information about object part and it’s location are combined to generate working memory, indicated by α and β. The solid and dashed lines indicate the formation and elimination of synaptic connection, respectively.

34

M. Soga and Y. Kashimori

leading to generation of IT neurons encoding face parts. Furthermore, as the training proceeds, the attention relevant to the task is ﬁxed by the learning of synaptic connections between PFC and PP, allowing monkey to perform quickly the visual task.

4 4.1

Results Information Processing of Visual Images in Early Visual Areas

Figure 4 shows the responses of neurons in early visual areas, LGN, V1, and V4, to the training stimulus, face 1 shown in Fig. 2a. The visual information is processed in a hierarchical manner, because the neurons involved in the pathway from LGN to V4 have progressively larger receptive ﬁelds and prefer more complex stimuli. At the ﬁrst stage, the contrast of the stimulus is encoded by ON center-OFF surrounding receptive ﬁeld of LGN neurons, as shown in Fig. 4b. Then, the V1 neurons, receiving the outputs of LGN neurons, encode the information of directional features of short bars contained in the drawing face, as shown in Fig. 4c. The V4 network was made by Kohonen’s self-organized map so that the V4 neurons could respond to more complex features of the stimulus. Figure 4d shows that the V4 neurons respond to the more complex features such as eyes, nose, and mouth.

Fig. 4. Responses of neurons in LGN, V1, and V4. The magnitude of neuronal responses in these areas is illustrated with a gray scale, in which the response magnitude is increased with increase of gray color. a) Face stimulus. The image is 90 x 90 pixel scale. b) Response of LGN neurons. c) Responses of V1 neurons tuned to four kinds of directions. d) Responses of V4 neurons. The kth neurons (k=1-3) encode the stimulus features such as eyes, nose, and mouth, respectively.

4.2

Information Processing of Visual Images in IT Cortex

Figure 5a shows the ability of IT neurons encoding eye separation and eye height to categorize test stimuli of faces. The test stimuli with varying the two features were categorized by the four ITC neurons learned by the four training stimuli shown in Fig.2a, suggesting that the ITC neurons are capable for separating test stimuli into some categories, based on similarity to the features of face parts.

Neural Mechanism for Extracting Object Features

35

Fig. 5. a) Ability of IT neurons to categorize face stimuli for two diagnostic features. The four IT neurons were made by using the four kinds of training stimuli shown in Fig. 2a, whose features are represented by four kinds of symbols (circle, square, triangle, cross). The test stimuli, represented by small symbols, are categorized by the four IT neurons. The kind of small symbols means the symbol of IT neuron that categorizes the test stimulus. The solid lines mean the boundary lines of the categories. b) Temporal variation of dynamic state of the PFC network during the categorization task. The attractors representing the diagnostic features are denoted by α ∼ δ, and the attractors representing non-diagnostic feature is denoted by . A mark on the row corresponding to α ∼ indicates that the network activity stays in the attractor. The visual stimulus of face 1 was applied to the retina at 300m ∼ 500ms.

Similarly, the IT neurons encoding nose length and mouth height separated test stimuli into other categories on the basis of similarity of the two features. However, the classiﬁcation in the IT is not task-dependent, but is made based on the similarity of face features. 4.3

Mechanism for Generating Working Memory Attractor in PFC

The PFC combines the information of face features and that of location of face parts to which attention is directed, and then makes memory attractors about the information. Figure 5b shows temporal variation of the memory attractors in PFC. The information about face parts with the two diagnostic features is represented by attractors X (X= α, β, γ, δ), in which X represents the information about eye separation and eye height of the four training stimuli and the location around eyes. The attractors X are dynamically linked in the PFC. As shown in Fig. 5b, the information about face parts with the diagnostic features are memorized as working memory α ∼ δ, because the synaptic connections between PFC and premotor area are strengthened by a reward signal given by the choice of correct categorization. On the other hand, the information about face parts with non-diagnostic features are not memorized as a stable attractor, as shown by in Fig.5b, because the information of non-diagnostic features does

36

M. Soga and Y. Kashimori

not lead to correct categorization behavior. Thus, the PFC can retain only the information required for the categorization task, as working memory.

5

Concluding Remarks

In the present study, we have shown that IT represents similarity of face images based on the resolution maps of V1 and V4, and PFC generates a working memory state, in which the information of face features relevant to categorization task are sustained. The feedback from PFC to IT and PP may play an important role in extracting the diagnostic features critical for the categorization task. The feedback from PFC increases the sensitivity of IT and PP neurons which encode the relevant object feature and location to the task, respectively. This allows the visual system to rapidly and accurately perform the categorization task. It remains to see how the feedback from PFC to IT and PP makes the functional connections across the three visual areas.

References 1. Freedman, D.J., Riesenhube, M., Poggio, T., Miller, E.K.: Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316 (2001) 2. Sigala, N., Logothetis, N.K.: Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002) 3. Hagiwara, I., Miyashita, Y.: Categorizing the world: expert neurons look into key features. Nature Neurosci. 5, 90–91 (2002) 4. Marcelija, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70, 1297–1300 (1980) 5. Rolls, E.T., Deco, G.: Computational Neuroscience of Vision. Oxford University Press, Oxford (2002) 6. Hoshino, O., Inoue, S., Kashimori, Y., kambara, T.: A hierachical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput. 13, 1781–1810 (2001) 7. Hoshino, O., Kashimori, Y., Kambara, T.: An olfactory recognition model of spatiotemporal coding of odor quality in olfactory bulb. Biol. Cybernet 79, 109–120 (1998) 8. Suzuki, N., Hashimoto, N., Kashimori, Y., Zheng, M., Kambara, T.: A neural model of predictive recognition in form pathway of visual cortex. Biosystems 79, 33–42 (2004) 9. Ichinose, Y., Kashimori, Y., Fujita, K., Kambara, T.: A neural model of visual system based on multiple resolution maps for categorizing visual stimuli. In: Proceedings of ICONIP 2005, pp. 515–520 (2005) 10. Kashimori, Y., Suzuki, N., Fujita, K., Zheng, M., Kambara, T.: A functional role of multiple spatial resolution maps in form perception along the ventral visual pathway. Neurocomputing 65-66, 219–228 (2005)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion Jordan H. Boyle, John Bryden, and Netta Cohen School of Computing, University of Leeds, Leeds LS2 9JT, United Kingdom

Abstract. One of the most tractable organisms for the study of nervous systems is the nematode Caenorhabditis elegans, whose locomotion in particular has been the subject of a number of models. In this paper we present a ﬁrst integrated neuro-mechanical model of forward locomotion. We ﬁnd that a previous neural model is robust to the addition of a body with mechanical properties, and that the integrated model produces oscillations with a more realistic frequency and waveform than the neural model alone. We conclude that the body and environment are likely to be important components of the worm’s locomotion subsystem.

1

Introduction

The ultimate aim of neuroscience is to unravel and completely understand the links between animal behaviour, its neural control and the underlying molecular and genetic computation at the cellular and sub-cellular levels. This daunting challenge sets a distant goal post in the study of the vast majority of animals, but work on one animal in particular, the nematode Caenorhabditis elegans, is leading the way. This tiny worm has only 302 neurons and yet is capable of generating an impressive wealth of sensory-motor behaviours. With the ﬁrst fully sequenced animal genome [1], a nearly complete wiring diagram of the nervous circuit [2], and hundreds of well characterised mutant strains, the link between genetics and behaviour never seemed more tractable. To date, a number of models have been constructed of subcircuits within the C. elegans nervous system, including sensory circuits for thermotaxis and chemotaxis [3,4], reﬂex control such as tap withdrawal [5], reversals (from forward to backward motion and vice versa) [6] and head swing motion [7]. Locomotion, like the overwhelming majority of known motor activity in animals, relies on the rhythmic contraction of muscles, which are controlled or regulated by neural networks. This system consists of a circuit in the head (generally postulated to initiate motion and determine direction) and an additional subcircuit along the ventral cord (responsible for propagating and sustaining undulations, and potentially generating them as well). Models of C. elegans locomotion have tended to focus on forward locomotion, and in particular, on the ability of the worm to generate and propagate undulations down its length [8,9,10,11,12]. These models have tended to study either the mechanics of locomotion [8] or the forward locomotion neural circuit [9,10,11,12]. In this paper we present simulations of an M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 37–47, 2008. c Springer-Verlag Berlin Heidelberg 2008

38

J.H. Boyle, J. Bryden, and N. Cohen

integrated model of the neural control of forward locomotion [12] with a minimal model of muscle actuation and a mechanical model of a body, embedded in a minimal environment. The main questions we address are (i) whether the disembodied neural model is robust to the addition of a body with mechanical properties; and (ii) how the addition of mechanical properties alters the output from the motor neurons. In particular, models of the isolated neural circuit for locomotion suﬀer from a common limitation: the inability to reproduce undulations with frequencies that match the observed behaviour of the crawling worm. To address this question, we have limited our integrated model to a short section of the worm, rather than modelling the entire body. We ﬁnd that the addition of a mechanical framework to the neural control model of Ref. [12] leads to robust oscillations, with signiﬁcantly smoother waveforms and reduced oscillation frequencies, matching observations of the worm.

2 2.1

Background C . elegans Locomotion

Forwards locomotion is achieved by propagating sinusoidal undulations along the body from head to tail. When moving on a ﬁrm substrate (e.g. agarose) the worm lies on its side, with the ventral and dorsal muscles at any longitudinal level contracting in anti-phase. With the exception of the head and neck, the worm is only capable of bending in the dorso-ventral plane. Like all nematode worms, C. elegans lacks any form of rigid skeleton. Its roughly cylindrical body has a diameter of ∼ 80 μm and a length of ∼ 1 mm. It has an elastic cuticle containing (along with its intestine and gonad) pressurised ﬂuid, which maintains the body shape while remaining ﬂexible. This structure is referred to as a hydrostatic skeleton. The body wall muscles responsible for locomotion are anchored to the inside of the cuticle. 2.2

The Neural Model

The neural model used here is based on the work of Bryden and Cohen [11,12,13]. Speciﬁcally, we use the model (equations and parameters) presented in [12] which is itself an extension of Refs. [11,13]. The model simpliﬁes the neuronal wiring diagram of the worm [2,14] into a minimal neural circuit for forward locomotion. This reduced model contains a set of repeating units, (one “tail” and ten “body” units) where each unit consists of one dorsal motor neuron (of class DB) and one ventral motor neuron (of class VB). A single command interneuron (representing a pair of interneurons of class AVB in the biological worm) provides the “on” signal to the forward locomotion circuit and is electrically coupled (via gap junctions) to all motor neurons of classes DB and VB. In the model, motor neurons also have sensory function, integrating inputs from stretch-receptors, or mechano-sensitive ion channels, that encode each unit’s bending angle. Motor neurons receive both local and – with the exception of the tail – proximate sensory input, with proximate input received from the adjacent posterior unit.

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

39

Fig. 1. A: Schematic diagram of the physical model illustrating nomenclature (see Appendix B for details). B: The neural model, with only two units (one body, one tail). AVB is electrically coupled to each of the motor neurons via gap junctions (resistor symbols).

The sensory-motor loop for each unit gives rise to local oscillations which phase lock with adjacent units. Equations and parameters for the neural model are set out in Appendix A. This neural-only model uses a minimal physical framework to translate neuronal output to bending. Fig. 1B shows the neural model with only two units (a tail and one body unit), as modelled in this paper. In the following section, we outline a more realistic physical model of the body of the worm.

3

Physical Model

Our physical model is an adaptation of Ref. [8], a 2-D model consisting of two rows of N points (representing the dorsal and ventral sides of the worm). Each point is acted on by the opposing forces of the elastic cuticle and pressure, as well as muscle force and drag (often loosely referred to as friction or surface friction [8]). We modify this model by introducing simpliﬁcations to reduce simulation time, in part by allowing us to use a longer time step. Fig. 1A illustrates the model’s structure. The worm is represented by a number rigid beams, connected to each of the adjacent beams by four springs. Two horizontal (h) springs connect points on the same side of adjacent beams and resist both elongation and compression. Two diagonal (d) springs connect the dorsal side of the ith beam to the ventral side of the i + 1st , and vice versa. These springs strongly resist compression and have an eﬀect analogous to that of pressure, in that they help to maintain reasonably constant area in each unit. The model was implemented in C++, using a 4th order Runge-Kutta method for numerical integration, with a time step of 0.1 ms.1 Equations and parameters 1

The original model [8] required a time step of 0.001 ms with the same integration method.

40

J.H. Boyle, J. Bryden, and N. Cohen

of the physical model are given in Appendix B. The steps taken to interface the physical and neuronal models are described in Appendix C.

4

Results

Using our integrated model we ﬁrst simulated a single unit (the tail), and then implemented two phase-lagged units (adding a body unit). In what follows, we present these results, as compared to those of the neural model alone. 4.1

Single Oscillating Segment

The neural model alone produces robust oscillations in unit bending angle (θi ) with a roughly square waveform, as shown in Fig. 2A. The model unit oscillates at about 3.5 Hz, as compared to frequencies of about 0.5 Hz observed for C. elegans forward locomotion on an agarose substrate. It has not been possible to ﬁnd parameters within reasonable electrophysiological bounds for the neural model that would slow the oscillations to the desired time scales [12]. Oscillations of the integrated neuro-mechanical model of a single unit are shown in Fig. 2B. All but four parameters of the neuronal model remain unchanged from Ref. [12]. However, parameters used for the actuation step caused a slight asymmetry in the oscillations when integrated with a physical model, and were therefore modiﬁed. As can be seen from the traces in the ﬁgure, the frequency of oscillation in the integrated model is about 0.5 Hz for typical agarose drag [8], and the waveform has a smooth, almost sinusoidal shape. Faster (and slower) oscillations are possible for lower (higher) values of drag. Fig. 2C shows a plot of oscillation frequencies as a function of drag for the integrated model.

Fig. 2. Oscillations of, A, the original neural model [12] and, B, the integrated model (with drag of 80 × 10−6 kg.s−1 ). Note the diﬀerent time scales. C: Oscillation frequency as a function of drag. The zero frequency point indicates that the unit can no longer oscillate.

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

4.2

41

Two Phase-Lagged Segments

Parameters of the neural model are given in Table A-1 for the tail unit and in Table A-2 for the body unit. Fig. 3 compares bending waveforms recorded from a living worm (Fig. 3A), simulated by the neural model (Fig. 3B) and simulated by the integrated model (Fig. 3C).

Fig. 3. Phase lagged oscillation of two units. A: Bending angles extracted from a recording of a forward locomoting worm on an agarose substrate. The traces are of two 1 points along the worm (near the middle and 12 of a body length apart). B: Simulation of two coupled units in the neural model. C: Simulation of the integrated model. Take note of the faster oscillations in subplot B.

5

Discussion

C. elegans is amenable to manipulations at the genetic, molecular and neuronal levels but with such rich behaviour being produced by a system with so few components, it can often be diﬃcult to determine the pathways of cause and eﬀect. Mathematical and simulation models of the locomotion therefore provide an essential contribution to the understanding of C. elegans neurobiology and motor control. The inclusion of a realistic embodiment is particularly relevant to a model of C. elegans locomotion. Sensory feedback is important to the locomotion of all animals. However, in C. elegans, the postulated existence of stretch receptor inputs along the body (unpublished communication, L. Eberly and R. Russel, reported in [2]) would provide direct information about body posture to the motor neurons themselves. Thus, the neural control is likely to be tightly coupled to the shape the worm takes as it locomotes. Modelling the body physics is therefore particularly important in this organism. Here we have presented the ﬁrst steps in the implementation of such an integrated model, using biologically plausible parameters for both the neural and mechanical components. One interesting eﬀect is the smoothing of the waveform from a square-like waveform in the isolated neural model to a nearly sinusoidal waveform in the integrated model. The smoothing can be attributed to the body’s resistance to bending (modelled as a set of springs), which increases with the bending angle.

42

J.H. Boyle, J. Bryden, and N. Cohen

By contrast, in the original neural model, the rate of bending depends only on the neural output. The work presented here would naturally lead to an integrated neuromechanical model of locomotion for an entire worm. The next step toward this goal, extending the neural circuit to the entire ventral cord (and the corresponding motor system) is currently underway. The physical model introduces long range interactions between units via the body and environment. In a real worm, as in the physical model, for bending to occur at some point along the worm, local muscles must contract. However, such contractions also apply physical forces to adjacent units, and so on up and down the worm, giving rise to a signiﬁcant persistence length. For this reason the extension of the neuro-mechanical model from two to three (or more) units will not be automatic and will require parameter changes to model an operable balance between the eﬀects of the muscle and body properties. In fact, the worm’s physical properties (and, in particular, the existence of long range physical interactions along it) could set new constraints on the neural model, or could even be exploited by the worm to achieve more eﬀective locomotion. Either way, the physics of the worm’s locomotion is likely to oﬀer important insights that could not be gleaned from a model of the isolated neural subcircuit. We have shown that a neural model developed with only the most rudimentary physical framework can continue to function with a more realistic embodiment. Indeed, both the waveform and frequency have been improved beyond what was possible for the isolated neural model. We conclude that the body and environment are likely to be important components of the subsystem that generates locomotion in the worm.

Acknowledgement This work was funded by the EPSRC, grant EP/C011961. NC was funded by the EPSRC, grant EP/C011953. Thanks to Stefano Berri for movies of worms and behavioural data.

References 1. C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998) 2. White, J.G., Southgate, E., Thomson, J.N., Brenner, S.: The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society of London, Series B 314, 1–340 (1986) 3. Ferr´ee, T.C., Marcotte, B.A., Lockery, S.R.: Neural network models of chemotaxis in the nematode Caenorhabditis elegans. Advances in Neural Information Processing Systems 9, 55–61 (1997) 4. Ferr´ee, T.C., Lockery, S.R.: Chemotaxis control by linear recurrent networks. Journal of Computational Neuroscience: Trends in Research, 373–377 (1998) 5. Wicks, S.R., Roehrig, C.J., Rankin, C.H.: A Dynamic Network Simulation of the Nematode Tap Withdrawal Circuit: Predictions Concerning Synaptic Function Using Behavioral Criteria. Journal of Neuroscience 16, 4017–4031 (1996)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

43

6. Tsalik, E.L., Hobert, O.: Functional mapping of neurons that control locomotory behavior in Caenorhabditis elegans. Journal of Neurobiology 56, 178–197 (2003) 7. Sakata, K., Shingai, R.: Neural network model to generate head swing in locomotion of Caenorhabditis elegans. Network: Computation in Neural Systems 15, 199–216 (2004) 8. Niebur, E., Erd¨ os, P.: Theory of the locomotion of nematodes. Biophysical Journal 60, 1132–1146 (1991) 9. Niebur, E., Erd¨ os, P.: Theory of the Locomotion of Nematodes: Control of the Somatic Motor Neurons by Interneurons. Mathematical Biosciences 118, 51–82 (1993) 10. Niebur, E., Erd¨ os, P.: Modeling Locomotion and Its Neural Control in Nematodes. Comments on Theoretical Biology 3(2), 109–139 (1993) 11. Bryden, J.A., Cohen, N.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. In: Schaal, S., Ijspeert, A.J., Billard, A., Vijayakumar, S., Hallam, J., Meyer, J.A. (eds.) Proceedings of the eighth international conference on the simulation of adaptive behavior, pp. 183–192. MIT Press / Bradford Books (2004) 12. Bryden, J.A., Cohen, N.: Neural control of C. elegans forward locomotion: The role of sensory feedback (Submitted 2007) 13. Bryden, J.A.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. Master’s thesis, University of Leeds (2003) 14. Chen, B.L., Hall, D.H., Chklovskii, D.B.: Wiring optimization can relate neuronal structure and function. Proceedings of the National Academy of Sciences USA 103, 4723–4728 (2006)

Appendix A: Neural Model Neurons are assumed to have graded potentials [11,12,13]. In particular, motor neurons (VB and DB) and are modelled by leaky integrators with a transmembrane potential V (t) following: C

dV = −G(V − Erev ) − I shape + I AVB , dt

(A-1)

where C is the cell’s membrane capacitance; Erev is the cell’s eﬀective reversal potential; and G is the total eﬀective membrane conductance. Sensory input n stretch I shape = )Gstretch σjstretch (θj ) is the stretch receptor input j j=1 (V − Ej from the shape of the body, where Ejstretch is the reversal potential of the ion channels, θj is the bending angle of unit j and σjstretch is a sigmoid response function of the stretch receptors to the local bending. The stretch receptor activation function is given by σ stretch (θ) = 1/ [1 + exp (−(θ − θ0 )/δθ)] where the steepness parameter δθ and the threshold θ0 are constants. The command input current I AVB = GAVB (VAVB − V ) models gap junctional coupling with AVB (with coupling strength GAVB and denoting AVB voltage by VAVB ). Note that in the model, AVB is assumed to have a suﬃciently high capacitance, so that the gap junctional currents have a negligible eﬀect on its membrane potential.

44

J.H. Boyle, J. Bryden, and N. Cohen

Segment bending in this model is given as a summation of an output function from each of the two neurons: dθ out = σVout B (V ) − σDB (V ) , dt

(A-2)

where σ out (V ) = ωmax /[1 + exp (−(V − V0 )/δV )] with constants ωmax , δV and V0 . Note that dorsal and ventral muscles contribute to bending in opposite directions (with θ and -θ denoting ventral and dorsal bending, respectively). Table A-1. Parameters for a self-oscillating tail unit (as in Ref. [12]) Parameter Value Parameter Value Parameter Value Erev −60mV VAVB −30.7mV C 5pF GVB 19.07pS GDB 17.58pS GAVB 35.37pS VB GAVB 13.78pS Gstretch 98.55pS Gstretch 67.55pS DB VB DB Estretch 60mV θ0,VB −18.68o θ0,DB −19.46o δθVB 0.1373o δθDB 0.4186o ωmax,VB 6987o /sec ωmax,DB 9951o /sec V0,VB 22.8mV V0,DB 25.0mV δVVB 0.2888mV/sec δVDB 0.0826mV/sec

Table A-2. Parameters for body units and tail-body interactions as in Ref. [12]. All body-unit parameters that are not included here are the same as for the tail unit. Parameter GVB Gstretch DB θ0,DB

Value Parameter Value Parameter Value 26.09pS GDB 25.76pS Gstretch 16.77pS VB 18.24pS E stretch 60mV θ0,VB −19.14o −13.26o δθVB 1.589o /sec δθDB 1.413o /sec

Appendix B: Physical Model The physical model consists of N rigid beams which form the boundaries between the N − 1 units. The ith beam can be described in one of two ways: either by the (x, y) coordinates of the centre of mass (CoMi in Fig. 1) and angle φi , or by the (x, y) coordinates of its two end points (PiD and PiV in Fig. 1). Each formulation has its own advantages and is used where appropriate. B.1 Spring Forces The rigid beams are connected to each of their neighbours by two horizontal (h) springs and two diagonal (d) springs, directed along the vectors k k Δh k,i = Pi+1 − Pi for k = D, V k Δdm,i = Pi+1 − Pil for k = D, V , l = V, D and m = 1, 2 ,

(B-1)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

45

for i = 1 : N − 1, where Pik = (xki , yik ) are the coordinates of the ends of the ith beam. The spring forces F(s) depend on the length of these vectors, Δjk,i = |Δjk,i | and are collinear to them. The magnitude of the horizontal and diagonal spring forces are piecewise linear functions ⎧ h κ (Δ − Lh2 ) + κhS1 (Lh2 − Lh0 ) : Δ > Lh2 ⎪ ⎪ ⎨ S2 κhS1 (Δ − Lh0 ) : Lh2 > Δ > Lh0 h F(s) (Δ) = , (B-2) h h κ (Δ − L1 ) + κhC1 (Lh1 − Lh0 ) : Δ < Lh1 ⎪ ⎪ ⎩ C2 h h κC1 (Δ − L0 ) : otherwise ⎧ d ⎨ κC2 (Δ − Ld1 ) + κdC1 (Ld1 − Ld0 ) : Δ < Ld1 d F(s) (Δ) = κdC1 (Δ − Ld0 ) : Ld1 < Δ < Ld0 , (B-3) ⎩ 0 : otherwise where spring (κ) and length (L) constants are given in Table B-1. Table B-1. Parameters of the physical model. Note that values for θ0 and θ0 diﬀer from Ref. [12] and Table A-1. Parameter Value Parameter Value Parameter Value D 80μm Lh0 50μm Lh1 0.5Lh0 h h d d h2 2 L2 1.5L1 L0 L0 + D L1 0.95Ld0 h −1 h h h κS1 20μN.m κS2 10κS1 κC1 0.5κhS1 h h d h d κC2 10κC1 κC1 50κS1 κC2 10κdC1 fmuscle 0.005Lh0 κhC1 c = c⊥ 80 × 10−6 kg.s−1 θ0,VB −29.68o θ0,DB −8.46o θ0,VB −22.14o θ0,DB −10.26o

B.2

Muscle Forces

Muscle forces F(m) are directed along the horizontal vectors Δh k,i with magnitude F(m)k,i = fmuscle Ak,i for k = D, V and i = 1 : N − 1 ,

(B-4)

where fmuscle is a constant (see Table B-1) and Ak,i are scalar activation functions for the dorsal and ventral muscles, determined by (θi (t), 0) if θi (t) ≥ 0 (AD,i , AV,i ) = (B-5) (0, −θi (t)) if θi (t) < 0 , where θi (t) = B.3

t

dθi 0 dt dt

is the integral over the output of the neural model.

Total Point Force

With the exception of points on the outer beams, each point i is subject to forces F D,i and F V,i , given by diﬀerences of the spring and muscle forces from the corresponding units (i and i − 1):

46

J.H. Boyle, J. Bryden, and N. Cohen

F D,i = (F h(s)D,i − F h(s)D,i−1 ) + (F d(s)1,i − F d(s)2,i−1 ) + (F (m)D,i − F (m)D,i−1 ) F V,i = (F h(s)V,i − F h(s)V,i−1 ) + (F d(s)2,i − F d(s)1,i−1 ) + (F (m)V,i − F (m)V,i−1 ) . (B-6) Since the ﬁrst beam has no anterior body parts, and the last beam has no posterior body parts, all terms with i = 0 or i = N are taken as zero. B.4

Equations of Motion

Motion of the beams is calculated from the total force acting on each of the 2N points. Since the points PiD and PiV are connected by a rigid beam, it is convenient to convert F(t)k,i to a force and a torque acting on the beam’s centre of mass. y x Rotation by φi converts the coordinate system of F(t)k,i = (F(t)k,i , F(t)k,i )

⊥ to a new system F(t)k,i = (F(t)k,i , F(t)k,i ) with axes perpendicular to (⊥) and parallel with () the beam: y ⊥ x F(t)k,i = F(t)k,i cos(φi ) + F(t)k,i sin(φi )

y x F(t)k,i = F(t)k,i cos(φi ) − F(t)k,i sin(φi ) .

(B-7)

The parallel components are summed and applied to CoMi , resulting in pure translation. The perpendicular components are separated into odd and even parts (giving rise to a torque and force respectively) by Fi⊥,even = Fi⊥,odd =

⊥ ⊥ (F(t)D,i + F(t)V,i )

2 ⊥ ⊥ (F(t)D,i − F(t)V,i ) 2

.

(B-8)

As in Ref. [8] we disregard inertia, but include Stokes’ drag. Also following Ref. [8], we allow for diﬀerent constants for drag in the parallel and perpendicular directions, given by c and c⊥ respectively. The motion of CoMi is therefore

1 (F + F(t)V,i ) c (t)D,i 1 = (2Fi⊥,even) c⊥ 1 = (2Fi⊥,odd ) , rc⊥

V(CoM),i = ⊥ V(CoM),i

ω(CoM),i

(B-9)

where r = 0.5D is the radius of the worm. Finally we convert V(CoM),i and ⊥ V(CoM),i back to (x, y) coordinates with

x ⊥ V(CoM),i = V(CoM),i cos(φi ) − V(CoM),i sin(φi )

y ⊥ V(CoM),i = V(CoM),i cos(φi ) + V(CoM),i sin(φi ) .

(B-10)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

47

Appendix C: Integrating the Neural and Physical Model In the neural model, the output dθi (t)/dt speciﬁes the bending angles θi (t) for each unit. In the integrated model, θi (t) are taken as the input to the muscles. Muscle ouputs (or contraction) are given by unit lengths. The bending angle αi is then estimated from the dorsal and ventral unit lengths by αi = 36.2

h |Δh D,i | − |ΔV,i |

Lh0

,

(C-1)

where Lh0 is the resting unit length. (For simplicity, we have denoted the bending angles of both the neural and integrated models by θ in the Figures).

Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra Pei-Kuang Chao1, Hsiao-Lung Chan1,4, Tony Wu2,4, Ming-An Lin1, and Shih-Tseng Lee3,4 1

Department of Electrical Engineering, Chang Gung University 2 Department of Neurology, Chang Gung Memorial Hospital 3 Department of Neurosurgery, Chang Gung Memorial Hospital 4 Center of Medical Augmented Virtual Reality, Chang Gung Memorial Hosipital 259 Wen-Hua First Road, Gui-Shan, 333, Taoyuan, Taiwan [email protected]

Abstract. This paper proposes that bursting characteristics can be effective parameters in classifying and identifying neural activities from subthalamic nucleus (STN) and substantia nigra (SNr). The string method was performed to quantify bursting patterns in microelectrode recordings into indexes. Interspike-interval (ISI) was used as one of the independent variables to examine effectiveness and consistency of the method. The results show consistent findings about bursting patterns in STN and SNr data across all ISI constraints. Neurons in STN tend to release a larger number of bursts with fewer spikes in the bursts. Neurons in SNr produce a smaller number of bursts with more spikes in the bursts. According to our statistical evaluation, 50 and 80 ms are suggested as the optimal ISI constraint to classify STN and SNr’s bursting patterns by the string method. Keywords: Subthalamic nucleus, substantia nigra, inter-spike-interval, burst, microelectrode.

1 Introduction Subthalamic nucleus (STN) is frequently the target to study and to treat Parkinson’s disease [1, 2]. Placing a microelectrode to record neural activities in deep brain nuclei provides useful information for localization during deep brain stimulation (DBS) neurosurgery. DBS has been approved by FDA since 1998[3]. The surgery implants a stimulator to deep brain nuclei, usually STN, to alleviate Parkinson’s symptoms, such as tremor and rigidity. To search for STN in operation, a microelectrode probe is often used to acquire neural signals from outer areas to the specific target. With assistance of imagery techniques, microelectrode signals from different depth are read and recorded. Then, an important step to determine STN location is to distinguish signals of STN from its nearby areas, e.g. subtantia nigra (SNr) (which is a little ventral and medial to STN). Therefore, characterizing and quantifying firing patterns of STN and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 48–53, 2008. © Springer-Verlag Berlin Heidelberg 2008

Applying the String Method to Extract Bursting Information

49

SNr are essential. Firing rate defined as the number of neural spikes within a period is the most common variable used for describing neural activities. However, STN and SNr have a broad range of firing rate and mostly overlapped [4] (although SNr has a slightly higher mean firing rate than STN). This makes it difficult to depend on firing rate to target STN. Bursting patterns may provide a better solution to separate signals from different nuclei. Bursting, defined as clusters of high-frequency spikes released by a neuron, is believed storing important neural information. To establish long-term responses, central synapses usually require groups of action potentials (bursts) [5,6]. Exploring bursting information in neural activities has recently become fundamental in Parkinson’s studies [1,2]. Also, that spike arrays are more regular in SNr than in STN signals is observed [7]. However, the regularity of firing or grouping of spikes in STN and SNr potentials has not been investigated thoroughly. This study aims to extract bursting information from STN and SNr. A quantifying method for bursting, the string method, will be applied. The string method quantifies bursting information relying on inter-spike-interval (ISI) and spike number [8]. Although some other methods for quantifying bursts exist [9,10], the string method is the one which can provide information about what spike members contributing to the detected bursts. In addition, because various ISIs have been used in research [8,9] to define bursts, this study will also evaluate the effect of ISI constraints on discriminating STN and SNr signals.

2 Method The neuronal data used in this study were acquired during DBS neurosurgery in Chang Gung Memorial Hospital. With assistance of imagery localization systems [11], trials (10s for each) of microelectrode recordings were collected at a sampling

a

b

Fig. 1. MRI images from one patient: a. In the sagittal plane – the elevation angle of the probe (yellow line) from the inter-commissural line (green line) was around 50 to 75°; b. In the frontal plane – the angle of the probe (yellow lines) from the midline (green line) was about 8 to 18° to right or left.

50

P.-K. Chao et al.

rate of 24,000 Hz. Based on several observations, e.g. magnetic resonance imaging (MRI), computed topography (CT), motion/perception-related responses and probe location according to a stereotactic system, experienced neurologists diagnosed 18 trials as neural signals from STN, and the other 23 trials as from SNr. The trials which were collected outside STN and SNr and/or confused between STN and SNr were excluded. In this paper, the data are from 3 Parkinson’s patients (2 females, 1 male, age=73.3±8.3 y/o) who received DBS treatment. Due to the patients’ individual difference, e.g. head size, the depth of STN from the scalp was found varied between 15 and 20 cm. During surgery, the elevation angle of the probe from the intercommissural line was around 50 to 75° (Fig 1a) and the angle between the probe and the midline was about 8 to 18° toward either right or left (Fig 1b). 2.1 Spike Detection Each trial of microelectrode recordings includes 2 types of signals, spikes and background signals. The background signals are interference from nearby neural areas or environment. Because background signals can be interpreted as a Gaussian distribution, signals which are 3 standard deviations (SD) above or below mean can be treated as non-background signals or spikes. Therefore, a threshold in the level of mean plus 3 SD is applied in this study to detect spikes (Fig 2).

100 90 80

amplitude

70 60 50 40 30 20 10 5.175

5.18

5.185

5.19 5.195 time (s)

5.2

5.205

5.21

Fig. 2. A segment of microelectrode recording – the red horizontal line is threshold; the green stars indicate the found spikes; most signals around the baseline are background signals

Applying the String Method to Extract Bursting Information

51

2.2 The String Method Every detected spike was plotted as a circle in a spike sequential number versus spike occurring time figure (Fig 3). The spike sequential number is starting at 1 for the first spike in a trial. The spikes which are closer to each other are labeled as strings [8] and defined as bursts. Two parameters were controlled and manipulated to determine bursts: (1) the minimum number of spikes to form a burst was 5; (2) the maximum ISI of the adjacent spikes in a burst was set as 20ms, 50ms, 80ms, and 110ms separately to find an optimal condition to distinguish STN and SNr bursting patterns.

spike sequential number

180

160

140

120

100

80 5

5.5

6

6.5

7

time (s)

Fig. 3. A segment of strings plot – each blue circle means a spike; the red triangles mean the starting spikes of a burst; the black triangles mean the ending spikes of a burst

2.3 Dependent Variables Three dependent variables were computed: (1) Firing rate (FR) was calculated as total spike number divided by trial duration (10 s). (2) Number of bursts (NB) was determined by the string method as the total burst number in a trial. (3) The average of spike number in each burst (SB) was also counted in every trial. 2.4 Statistical Analysis Independent sample’s t-test was applied to test firing rate difference between STN and SNr signals. MANOVA was performed to evaluate NB and SB separately among different ISI constraints (α=.05).

52

P.-K. Chao et al.

3 Results The signals from STN and SNr showed similar firing rate but different bursting patterns. There is no significant difference between STN and SNr in firing rate (STN: 57.0±22.1; SNr: 68.8±23.5) (p>.05). The results of NB and SB are listed in Table 1 and Table 2. In NB, SNr has significantly fewer bursts than STN while ISI setting is 50ms and 80 ms (p 1.0, 65/104 cells), and 28% of neurons were more responsive to BOS than to OREV (d (RS BOS /RS OREV ) > 1.0, 29/104 cells). These results are consistent with past studies [4]. The mean of d (RS BOS /RS REV ) was 1.25, and the mean of d (RS BOS /RS OREV ) was 0.63. These results indicate that BOS-selective neurons in HVC are largely variable, especially in terms of sequential response properties. A. Elements PSTH (Hz)

Raster

20 0 100 0 -0.2

0

0.5 (s)

A

B

C

A B

A C

D

PSTH (Hz)

Raster

B. Element pairs 20 0 100 0 -0.2

0 A A

B A

0.5 (s)

B

B

B

AD

C

B D

C A

C B

C C

CD

D A

D B

D C

D D

Fig. 1. An example of the auditory response to song elements (A) and element pairs (B) in a single unit from HVC

Population Coding of Song Element Sequence

59

Fig. 2. Song transition matrices of self-generated songs (ﬁrst row) and sequential response distribution matrices of each single unit (second to seventh rows)

3.2

Responses to Song Element Pair Stimuli

To investigate the neural selectivity to song element sequences, we recorded neural responses to all possible element pair stimuli. Because the playback of these stimuli is extremely time-consuming, we could only maintain 34% of the recorded single units stable throughout the entire presentation (35/104 cells, 12/23 birds). In total, 70% of the stable single-units were BOS-selective (d (RS BOS /RS REV ) > 1.0, 27/35 cells, 12/12 birds). Thereafter, we focused on these data. A typical example of neural responses to each song element is shown in Fig. 1 (A). The neuron responded to a single element A or C with single phasic activity, but it did not respond to element B. It responded to element D with double phasic activity. These results indicate that the neuron has various response properties even during single element presentation. In addition, the neuron exhibited more complex response properties during the presentation of element pairs (Fig. 1(B)). The neuron responded more strongly to most of the element pairs when the second element was A or C, compared to single presentation of each element. However, the response was weaker when the ﬁrst and second elements were the same. When the second element was B, no diﬀerences were observed between single and paired stimuli. When the second element was D, we measured single

60

J. Nishikawa, M. Okada, and K. Okanoya

phasic responses, and a strong response to BD. These response properties were not correlated with the element-to-element transition probabilities in the song structure. The dotted boxes indicate the sequences included in BOS. However, the neuron responded only weakly to some sequences that were included in BOS (brack arrows). In contrast, the neuron responded strongly to other sequences that were not included in BOS (white arrows). Thus, the neuron had broad response properties to song element pairs beyond the structure of self-generated song. To quantitatively evaluate sequential response properties, we calculated the response strength measure d (RS S /RS Baseline ) to the element pair stimuli S. The sequential response distributions were created for each neuron in two individuals with more than ﬁve well identiﬁed single units. Song transition matrices and sequential response distributions are shown in Fig. 2. The response distributions were not correlated with the associated song transition matrices. However, each HVC neuron in the same individual had broad but diﬀerent response distribution properties. This tendency was consistent among individuals. This result indicates that the song element sequence is encoded at the population level, within broadly but diﬀerentially selective HVC neurons. 3.3

Population Dynamics Analysis

To analyze the information coding of song element sequences at the population level, we calculated the time course of population activity vectors, which is the set of instantaneous mean ﬁring rates for each neuron in a 50 ms time window. Snapshots of population responses to stimuli are shown in eight panels of Fig. 3A (n = 6, bird 2 of Fig. 2). Each point in the panel represents the population vector toward each stimulus on the MDS space. The ellipses in the upper four panels indicate the group of vectors whose stimuli have the same ﬁrst element, while the ellipses in the lower four panels indicate the group of vectors whose stimuli have the same second element. Note that the population activity vectors in the upper four panels are identical to those in the bottom four panels, and only the ellipses diﬀer. Before the stimulus presentation ([-155 ms: -105 ms], upper and lower panels), only spontaneous activities were observed around the origin. After the ﬁrst element presentation ([50 ms: 100 ms], upper panel), groups with the same ﬁrst elements split. After the second element presentation ([131 ms: 181 ms] of the lower panel), groups with the same second elements were still largely overlapping. In the next section, we will show that confounded information, which represents the relation between ﬁrst and second elements, increased signiﬁcantly in this timing. After suﬃcient time ([480 ms: 530 ms], upper and lower panels), the neurons returned to spontaneous activity. The result indicates that the population response to the ﬁrst and second element is drastically diﬀerent. Subsequently, we will show that this overlap is derived from the information in the song element sequence. 3.4

Information-Theoretic Analysis

To determine the origin of the overlap in the population response, we calculated the time course of mutual information between the stimulus and neural activity.

Population Coding of Song Element Sequence

61

Fig. 3. Responses of HVC neurons at the population level (A) and encoded information (B)

The mutual information for ﬁrst elements I(S1 ; R), the second elements I(S2 ; R), and that of element pairs I(S1 , S2 ; R) was calculated within each time window; the window was shifted to analyze the temporal dynamics of information coding (left upper 3 graphs in Fig. 3B). Narrow lines in each graph indicate the cumulative trace of mutual information in each neuron. The thick line is the cumulative trace of all neurons in the individual. The bottom-left graph in Fig. 3B shows the probability of stimulus presentation. After the presentation of the ﬁrst elements, mutual information for the ﬁrst elements increased, showing a statistically signiﬁcant peak (P < 0.001). After the presentation of the second elements, mutual information for the second elements signiﬁcantly increased (P < 0.001). At the

62

J. Nishikawa, M. Okada, and K. Okanoya

same time, mutual information for element pairs also showed a signiﬁcant peak (P < 0.001). Intuitively, information for element pairs I(S1 , S2 ; R) would consist of information for the ﬁrst elements I(S1 ; R) and second elements I(S2 ; R). However, the consecutive calculation of I(S1 , S2 ; R) − I(S1 ; R) − I(S2 ; R) in each time window causes a statistical peak after the presentation of element pairs (P < 0.001; forth graph from left in Fig. 3B). The diﬀerence C represents the conditional mutual information between the ﬁrst and second elements for a given neural response, otherwise known as confounded information [13]. Therefore, confounded information represents the relationship between the ﬁrst and second elements encoded in the neural responses. The I(S1 ; R) peak occurred at the same time that groups of population vectors with the same ﬁrst elements were splitting ([50 ms: 100 ms]). The peaks for I(S2 ; R), I(S1 , S2 ; R), and C occurred during the same time that groups with the same second elements were still largely overlapping ([131 ms: 181 ms]). This indicates that the sequential information causes an overlap in the population response. In the population dynamics analysis, we cannot combine the data from diﬀerent birds because each bird has a diﬀerent number and types of song elements. However, in the mutual information analysis, we can combine and average the data from diﬀerent birds. The ﬁve graphs on the right in Fig. 3B show the time courses for I(S1 ; R), I(S2 ; R), I(S1 , S2 ; R), C, and the stimulus presentation probability, which were calculated from all stable single units with BOS selectivity (n = 27, 12 birds). The combined mutual information for ﬁrst elements was very similar to that from one bird, showing a signiﬁcant peak after the presentation of the ﬁrst elements (P < 0.001). Mutual information for second elements, element pairs, and confounded information also had signiﬁcant peaks after the presentation of the second elements (P < 0.001). These results show that the song element sequence is encoded into a neural ensemble in HVC by population coding.

4

Conclusion

In this study, we recorded auditory responses to all possible element pair stimuli from the Bengalese ﬁnch HVC. By determining the sequential response distributions for each neuron, we showed that each neuron in HVC has broad but diﬀerential response properties to song element sequences. The population dynamics analysis revealed that population activity vectors overlap after the presentation of element pairs. Using mutual information analysis, we demonstrated that this overlap in the population response is due to confounded information, namely, the sequential information of song elements. These results indicate that the song element sequence is encoded into the HVC microcircuit at the population level. Song element sequences are encoded in a neural ensemble with broad and differentially selective neuronal populations, rather than the chain-like model of diﬀerential TCS neurons.

Population Coding of Song Element Sequence

63

Acknowledgment This study was partially supported by the RIKEN Brain Science Institute, and by a Grant-in-Aid for young scientists (B) No. 18700303 from the Japanese Ministry of Education, Culture, Sports, Science, and Technology.

References 1. Okanoya, K.: The Bengalese ﬁnch: a window on the behavioral neurobiology of birdsong syntax. Ann. N.Y. Acad. Sci. 1016, 724–735 (2004) 2. Doupe, A.J., Kuhl, P.K.: The Bengalese ﬁnch: a window on the behavioral neurobiology of birdsong syntax. Birdsong and human speech: common themes and mechanisms. Annu. Rev. Neurosci. 22, 567–631 (1999) 3. Margoliash, D., Fortune, E.S.: Temporal and harmonic combination-selective neurons in the zebra ﬁnch’ s HVc. J. Neurosci. 12, 4309–4326 (1992) 4. Lewicki, M.S., Arthur, B.J.: Hierarchical organization of auditory temporal context sensitivity. J. Neurosci. 16, 6987–6998 (1996) 5. Drew, P.J., Abbott, L.F.: Model of song selectivity and sequence generation in area HVc of the songbird. J. Neurophysiol. 89, 2697–2706 (2003) 6. Deneve, S., Latham, P.E., Pouget, A.: Reading population codes: a neural implementation of ideal observers. Nat. Neurosci. 2, 740–745 (2001) 7. Pouget, A., Dayan, P., Zemel, R.: Information processing with population codes. Nat. Rev. Neurosci. 1, 125–132 (2000) 8. Green, D., Swets, J.: Signal Detection Theory and Psychophysics. Wiley, New York (1966) 9. Theunissen, F.E., Doupe, A.J.: Temporal and spectral sensitivity of complex auditory neurons in the nucleus HVc of male zebra ﬁnches. J. Neurosci. 18, 3786–3802 (1998) 10. Matsumoto, N., Okada, M., Sugase-Miyamoto, Y., Yamane, S., Kawano, K.: Population dynamics of face-responsive neurons in the inferior temporal cortex. Cerebr. Cort. 15, 1103–1112 (2005) 11. Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–328 (1966) 12. Sugase, Y., Yamane, S., Ueno, S., Kawano, K.: Global and ﬁne information coded by single neurons in the temporal visual cortex. Nature 400, 869–873 (1999) 13. Reich, D.S., Mechler, F., Victor, J.D.: Formal and attribute-speciﬁc information in primary visual cortex. J. Neurophysiol. 85, 305–318 (2001)

Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration Tamami Motomura, Yuki Hayashida, and Nobuki Murayama Graduate school of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. We recently developed a new method to dissociate neurons from mammalian retinae by utilizing low-Ca2+ tissue incubation and the vibrodissociation technique, but without use of enzyme. The retinal ganglion cell somata dissociated by this method showed spontaneous voltage transients (sVT) with the fast rise and slower decay. In this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The sVT varied in amplitude with quantal manner, and reversed in polarity around −80 mV in a normal physiological saline. The reversal potential of sVT shifted dependently on the K+ equilibrium potential, indicating the involvement of some K+ conductance. Based on the model, the conductance changes responsible for producing sVT were little dependent on the membrane potential below −50 mV. These results could suggest the presence of isolated, inhibitory presynaptic terminals attaching on the ganglion cell somata. Keywords: Neuronal computation, dissociated cells, retina, patch-clamp, neuron model.

1 Introduction Elucidating the functional role of single neurons in neural information processing is intricate because the neuronal computation itself is highly nonlinear and adaptive, and depends on combinations of many parameters, e.g. the ionic conductances, the intracellular signaling, their subcellular distributions, and the cell morphology. Furthermore, the interactions with surrounding neurons/glias can alter those factors, and thereby hinder us from examining some of those factors separately. This could be overcome by pharmacologically or physically isolating neurons from the circuits. One would use the pharmacological agents those can block the synaptic signal transmission in situ, although it is hard to know whether or not such agents show any unintended side-effects. Alternatively, one can dissociate neural tissue into single neurons by means of enzymatic digestion and mechanical trituration. The dissociated single neurons often lost their fine neurites and the synaptic contacts with other cells during the dissociation procedure, and thus, are useful for examining the properties of ionic conductances at known membrane potentials [3]. Unfortunately, however, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 64–72, 2008. © Springer-Verlag Berlin Heidelberg 2008

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

65

several studies have demonstrated that proteolytic enzymes employed for the cell dissociations can distort the amplitude, kinetics, localization, and pharmacological properties of ionic currents, e.g. [2]. These observations lead to attempts to isolate neurons by enzyme-free, mechanical means. Recently, we developed a new protocol for dissociating single neurons from specific layers of mammalian retinae without use of any proteolytic enzymes [12], but with a combination of the low-Ca2+ tissue incubation [9] and the vibrodissociation technique [15] which has been applied to the slices of brains and spinal cords [1]. The somata of ganglion cells dissociated by our method showed spontaneous voltage transients (sVT) with fast rise and slower decay in the time course [8]. To our knowledge, such sVT have never been reported in previous studies on the retinal ganglion cells dissociated with or without enzyme [9]. Therefore, in this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The present results could suggest the presence of inhibitory presynaptic terminals attaching to the ganglion cell somata we recorded from, as demonstrated in previous studies on the vibrodissociated neurons of brains and spinal cords [1]. If this is the case, the retinal neurons dissociated by our method would be advantageous to investigate the mechanisms of transmitter release in single tiny synaptic boutons even under the isolation from the axons and neurites.

2 Methods All animal care and experimental procedures in this study were approved by the committee for animal researches of Kumamoto University. 2.1 Cell Dissociation The neural retinas were isolated from two freshly enucleated eyes of Wistar rats (P7-P25), cut into 2-4 pieces each, and briefly kept in chilled extracellular “bath” solution. This solution contained (in mM): 140 NaCl, 3.5 KCl, 1 MgCl2, 2.5 CaCl2, 10 D-glucose, 5 HEPES. The pH was adjusted to 7.3 with NaOH. A retinal piece was then placed with photoreceptor-side down in a culture dish, covered with 0.4 ml of chilled, low-Ca2+ solution and incubated for 3-5 min. This low-Ca2+ solution contained (in mM): 140 sucrose, 2.5 KCl, 70 CsOH, 20 NaOH, 1 NaH2PO4, 15 CaCl2, 20 EDTA, 11 D-glucose, 15 HEPES. The estimated free Ca2+ concentration was 100–200 nM. The pH was adjusted to 7.2 with HCl. After the incubation, the fireblunted glass pipette horizontally vibrating in amplitude of 0.2-0.5 mm at 100 Hz was applied to the flattened surface of retina throughout under visual control with the microscope, so that the cells were dissociated from the ganglion cell layer, but least from the inner and outer nuclear layers. After removing the remaining retinal tissue, the culture dish was filled with the bath solution, and left on a vibration-isolation table for allowing cells to settle down for 15-40 min. The bath solution was replaced by a fresh aliquot supplemented with 1 mg/ml bovine serum albumin and the dissociated cells were maintained at room temperature (20-25 oC) for 2-18 hrs prior to the electrophysiological recordings described below. The ganglion cells were identified

66

T. Motomura, Y. Hayashida, and N. Murayama

based on the size criteria [6]. Nearly all those cells we made recordings from in voltage-/current-clamp showed the large amplitude of voltage-gated Na+ current and/or of action potentials (see Fig. 1A-B), verifying that they were ganglion cells [4]. 2.2 Electrophysiology Since the previous studies demonstrated that the membrane conductances of retinal ganglion cells can be modulated by the intracellular messengers, e.g. Zn2+ [13] and cAMP [7], all recordings presented here were performed in perforated-patch wholecell mode [9] to maintain cytoplasmic integrity. Patch electrodes were pulled from borosilicate glass capillaries to tip resistances of approximately 4-8 MΩ. The tip of the electrodes were filled with a recording “electrode” solution that contained (in mM): 110 K-D-gluconic acid, 15 KCl, 15 NaOH, 2.6 MgCl2, 0.34 CaCl2, 1 EGTA, 10 HEPES. The pH was adjusted to 7.2 with methanesulfonic acid. The shank of the electrodes were filled with this solution after the addition of amphotericin B as the perforating agent (260 μg/ml, with 400 μg/ml Pluronic F-127). The recordings were made after the series resistance in perforated-patch configuration reached a stable value (typically 20-40 MΩ, ranging 10-100 MΩ). In the fast current-clamp mode of the amplifier (EPC-10, Heka), the voltage monitor output was analog-filtered by the built-in Bessel filters (3-pole 10–30 kHz followed by 4-pole 2-5 kHz) and digitally sampled (5–20 kHz). The voltage drop across the series resistance was compensated by the built-in circuitry. The recording bath was grounded via an agar bridge, and the bath solution was continuously superfused over each cell recorded from, at a constant flow rate (0.4 ml/min). The volume of solution in the recording chamber was kept at about 2 ml. To apply a high-K+ solution (Fig. 2B), 8 mM NaCl in the bath solution was replaced by the equimolar KCl. An enzyme solution was made by supplementing 0.25 mg/ml papain and 2.5 mM L-cystein in the bath solution. All experiments were performed at room temperature.

3 Results Perforated-patch whole-cell recordings were made from the somata of ganglion cells dissociated by our recently developed protocol (see Methods), which offered us quantitative measurements of the intrinsic membrane properties with the least distortion due to the proteolysis by enzymes [2]. Conversely, since these cells were never exposed to any enzyme, they were useful in examining the effects of the enzymes utilized for the cell dissociation in previous studies. In fact, spike firing of the ganglion cells in response to constant current injection via the patch electrode (30-pA step in the positive direction) was irreversibly altered when the enzyme solution was superfused over those cells (n=3): 1) The resting potential depolarized by 5-20 mV and the spike firing diminished during the enzyme application; 2) When the enzyme was washed out from the recording chamber, the resting potential gradually hyperpolarized near the original level and the spike firing returned in some way;

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

A

67

B

C

Fig. 1. Spontaneous voltage transients (sVT) observed in the dissociated retinal ganglion cells. A: Microphotograph of the cell recorded from. Note the soma being lager than 15 μm in diameter. B: Membrane potential changes in response to step-wise constant current injections. Four traces are superimposed. The injected current was 10 pA in the negative direction and 10, 20, and 30 pA in the positive directions. C: Spontaneous hyperpolarizations under the currentclamp. The recordings were made for 50 sec in three different episodes with breaks for 12 sec between first and second, and for 6 sec between second and third. A constant current (2 pA in the positive direction) was injected to hold the membrane potential at around –70 mV (dashed gray line). Inset: Examples of sVT in an expanded time scale. Five events are recognized. Three of them are similar in their amplitude and time course, and the other two have roughly half and quarter of the largest amplitude.

3) After ~20 min of the washing out of enzyme, the spike firing reached a steady state at which the interval between the first and second spike firings in response to the current step was shorter than that before the enzyme application, by 40 11 % (mean S.E.) (not shown, [12]). These results suggest that, in previous studies on isolated retinal ganglion cells, some of the ionic channels could be significantly distorted during the dissociation procedure because of the use of proteolytic enzymes. Moreover, we found the spontaneous voltage transients (sVT) with the fast rise and slower decay in the retinal ganglion cell somata dissociated by our method [8]. Fig. 1C shows an example of sVT recorded from the cell shown in Fig. 1A. As shown in the Fig., transient hyperpolarizations spontaneously appeared under a constant current injection. Most of these hyperpolarizations are similar in amplitude and time course at a certain membrane potential (−70 mV here) and in some, the peak amplitude of hyperpolarizations was roughly the half or quarter (or one-eighth, in other cells) of the largest one (Inset). Such sVT appeared in 10-20 % of the cells we made recordings from, and could be observed in particular cells as long as we kept the recordings (0.5-2 hrs). When the enzyme solution was superfused over one of those cells, the sVT disappeared completely, and then were not seen again even after 20 min of the washing out of enzyme.

±

±

68 A

T. Motomura, Y. Hayashida, and N. Murayama B

C

Fig. 2. Reversal potential of sVT. A, B: The sVT recorded in the saline containing extracellular K+ of 3.5 mM (A) and 11.5 mM (B). The basal membrane potential (indicated by arrows) was varied by injecting holding currents ranging between –8 and +8 pA in A and between –8 and +12 pA in B. C: Plots of the peak amplitude versus basal potential. Only the events having the largest amplitude (see Fig. 1C) are taken into account. Note that the amplitude of depolarizations were plotted as negative values, and vice versa. The filled circles and open circles represent the data for 3.5-mM K+ and 11.5-mM K+, respectively.

As shown in Fig. 1C, the sVT were all recorded as hyperpolarizations when the cell was held at approximately −70 mV. Thus, the reversal potential of the ionic current producing sVT should be below this voltage. In Fig. 2, the reversal potential for sVT was measured by holding the basal membrane potential at different levels under the current-clamp, c.f. [5]. As expected, the polarity of sVT reversed around −80 mV when the basal membrane potential was varied from about −100 to −40 mV (Fig. 2A). Based on the ionic compositions in the bath and electrode solutions used in this recording, the equilibrium potential of K+ (EK) was estimated to be about −90 mV, and close to the reversal potential for sVT. When the EK was shifted by +30 mV, i.e. from about −90 to −60 mV by applying the high-K+ solution (see Method), the polarity of sVT reversed between −54 and −39 mV of the basal membrane potential (Fig. 2B). Fig. 2C plots the peak amplitude of sVT versus the basal membrane potential. The linear regressions on these plots (gray lines) crossed the abscissa (dashed line) at approximately –76 mV and –48 mV for 3.5-mM K+ and 11.5-mM K+, respectively, showing the shift of reversal potential parallel to the EK shift. Similar results were obtained in other two cells. These results indicate that the ionic conductance responsible for producing sVT is permeable to, at least, K+. In the present experiments, we made recordings from the cells without neurites or with neurites no longer than 10 μm or so. Therefore, those cells can be modeled as a single compartment shown in Fig. 3A. In this model, the unknown conductance responsible for producing sVT and the reversal potential are represented by “gx” and “Ex”, respectively. The membrane properties intrinsic to the cell are represented by membrane capacitance Cm, nonlinear conductance gm, and the apparent reversal potential Em. Here, Cm was measured with the capacitance compensation circuitry of

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

69

B

A

C

E

D

G

H F

Fig. 3. Conductance changes during sVT. A: Single compartment model of the isolated somata. ICm, the current through Cm. Igm, the current through gm. Igx, the current through gx. B: Voltage responses to the current steps injected. The amplitude of current steps (Iinj) were varied from –10 to +10 pA in 5-pA increment and the corresponding voltage changes (Vm), from the bottom (black) to the top (light gray), were recorded. C: Plots of the membrane potential versus the amplitude of current steps. The voltage was measured at the time points indicated by the marks in B (circle, square, triangle, rhombus, and hexagon). The solid line shows the best fit to the plots with a single exponential function. D: Voltage-dependency of gm calculated from the plots in C. The derivative of current with respect to the membrane potential gave the slope conductance gm, which could be approximated by a hyperbolic function, gm = Iα / (Vα –Vm), where Vm 2) GRFs are used. The center of the ith GRF is set to μi = Imin + ((2i − 3)/2)((Imax − Imin )/(n − 2)). One GRF is placed outside the range at each of the two ends. All the GRFs encoding an input variable will have the same width. The width of a GRF is set to σ = (1/γ)((Imax − Imin )/(n − 2)), where γ controls the extent of overlap between the GRFs. For an input variable with value a, the activation value of the ith GRF with center μi and width σ is given by fi (a) = exp −(a − μi )2 /(2 σ 2 ) (3) The ﬁring time of the neuron associated with this GRF is inversely proportional to fi (a). For a highly stimulated GRF, with value of fi (a) close to 1.0, the ﬁring time t = 0 milliseconds is assigned. When the activation value of the

Region-Based Encoding Method Using Multi-dimensional Gaussians

77

GRF is small, the ﬁring time t is high indicating that the neuron ﬁres later. In our experiments, the ﬁring time of an input neuron is chosen to be in the range 0 to 9 milliseconds. While converting the activation values of GRFs into ﬁring times, a coding threshold is imposed on the activation value. A GRF that gives an activation value less than the coding threshold will be marked as not-ﬁring (NF), and the corresponding input neuron will not contribute to the membrane potential of the post-synaptic neuron. The range-based encoding method is illustrated in Fig. 1(c). For multi-variate data, each variable is encoded separately, eﬀectively not using the correlation present among the variables. In this encoding method, 1-D GRFs are uniformly placed along an input dimension, without considering the distribution of the data. Hence, when the data is sparse some GRFs, placed in the regions where data is not present, are not eﬀectively used. This results in a high neuron count and computational cost. The widths of the GRFs are derived without using the knowledge of the data distribution, except for the range of values that the variables take. Taking one GRF along an input dimension and quantizing its activation value results in the formation of intervals within the range of values of that variable, such that one or more intervals are mapped onto a particular quantization level. When 2-D data is encoded by taking an array of 1-D GRFs along each input dimension, the input space is quantized into rectangular grids such that all the input patterns falling into a particular rectangular grid have the same vector of quantization levels, and hence the same encoded time vector. Additionally, one or more rectangular grids may have the same encoded time vector. For multivariate data, the input space is divided into hypercuboids. To demonstrate this, the single-ring data (shown in Fig. 2(a)) is encoded by placing 5 GRFs along each dimension, dividing the input space into grids as shown in Fig. 2(b). A 10-2 MDSNN, having 10 neurons in the input layer and 2 neurons in the output layer, is trained to cluster this data. The space of data points as represented by the output layer neurons is shown in Fig. 2(c). The cluster boundary is observed to be a combination of linear segments deﬁned by the rectangular grid boundaries formed by encoding. The shape of the boundary formed by the MDSNN is signiﬁcantly diﬀerent from the desired circle-shaped boundary between the two clusters in the single-ring data. Increasing the number of GRFs used for encoding each dimension may give a boundary that is a combination of smaller linear segments, at the expense of high neuron count. However, this may not result in proper clustering of the data, as the choice of number of GRFs is observed to be crucial in the range-based encoding method. When the range-based encoding method is used along with the varyingthreshold method and the multi-stage learning method [15] to cluster complex data sets such as the double-ring data and the spiral data, it is observed that proper subclusters are not formed by the neurons in the hidden layer. As each dimension is encoded separately, the spatially disjoint subsets of data points, as shown by the marked regions in Fig. 3, that have similar encoding along a particular dimension are found to be represented by a single neuron in the hidden

78

L.N. Panuku and C.C. Sekhar 2

1

1.5

0.8 0.6

1

0.4

0.5

0.2 0

0

−0.2

−0.5

−0.4

−1

−0.6

−1.5

−0.8

−2 −2 −1.5 −1 −0.5

0(a)0.5

1

1.5

−1 −1 −0.8−0.6−0.4−0.2

2

0

0.2 0.4 0.6 0.8

(b)

1

Fig. 2. Clustering the single-ring data encoded using the range-based encoding method: (a) The single-ring data, (b) data space quantization due to the range-based encoding, and (c) data space representation by the output neurons

15 3

10 5

y

y

1

0

−1 −5 −3

−5 −5

−10 −15 −3

x

−1

1 (a)

3

−15

−10

−5

0 x

(b)

5

10

15

Fig. 3. Improper subclusters formed when the data is encoded with the range-based encoding method for (a) the double-ring data and (b) the spiral data

layer. This binding is observed to form during the initial iterations of learning when the ﬁring thresholds of neurons are low. The established binding cannot be unlearnt in the subsequent iterations, leading to improper clustering at the output layer. An encoding method that overcomes the above discussed limitations is proposed in the next section.

4

Region-Based Encoding Using Multi-dimensional Gaussian Receptive Fields

Using multi-dimensional GRFs for encoding helps in capturing the correlation present in the data. One approach would be to uniformly place the multidimensional GRFs covering the whole range of the input data space. However, this results in an exponential increase in the number of neurons in the input layer with the dimensionality of the data. To circumvent this, we propose a region-based encoding method that places the multi-dimensional GRFs only in the data-inhabited regions, i.e., the regions where data is present. The mean vectors and the covariance matrices of these GRFs are computed from the data in the regions, thus capturing the correlation present in the data. To identify the data-inhabited regions in the input space, ﬁrst the k-means clustering is performed on the data to be clustered, with the value of k being larger than the number of actual clusters. On each of the regions, identiﬁed using the k-means clustering method, a multi-dimensional GRF is placed by computing the mean vector and the covariance matrix from the data in that region. Response of the ith GRF for a multi-variate input pattern a is computed as,

Region-Based Encoding Method Using Multi-dimensional Gaussians

1 t −1 fi (a) = exp − (a − μi ) Σi (a − μi ) , 2

79

(4)

where, μi and Σi are the mean vector and the covariance matrix of the ith GRF respectively, and fi (a) is the activation value of that GRF. As discussed in Section 3, these activation values are translated into ﬁring times in the range 0 to 9 milliseconds and the non-optimally stimulated input neurons are marked as NF. By deriving the covariance for a GRF from the data, the region-based encoding method captures the correlation present in the data. The regions identiﬁed by k-means clustering and the data space quantization resulting from this encoding method, for the single-ring data used in Section 3, are shown in Fig 4(a) and 4(b) respectively. The boundary given by the MDSNN with the region-based encoding method is shown in Fig. 4(c). This boundary is more like the desired circle-shaped boundary, as against the combination of linear segments observed with the range-based encoding method (see Fig. 2(c)). 2

2.5

1.5

2 1.5 1

1 0.5 0

0.5 0

−0.5

−0.5 −1 −1.5 −2

−1 −1.5 −2 −2 −1.5 −1 −0.5

0 (a)0.5

1

1.5

2

−2.5 −2.5 −2 −1.5 −1 −0.5

0

0.5

b)

1

1.5

2

2.5

Fig. 4. (a) Regions identiﬁed by k-means clustering with k = 8, (b) data space quantization due to the region-based encoding and (c) data space representation by the neurons in the output layer

Next, we study the performance of the proposed region-based encoding method in clustering complex 2-D and 3-D data sets. For the double-ring data, the k-means clustering is performed with k = 20 and the resulting regions are shown in Fig 5(a). Over each of the regions, a 2-D GRF is placed to encode the data. A 20-8-2 MDSNN is trained using the multi-stage learning method, discussed in Section 2. It is observed that out of 8 neurons in the hidden layer 3 neurons do not win for any of the training examples and the data is represented by the remaining 5 neurons as shown in Fig 5(b). These 5 neurons provide the input in the second stage of learning to form the ﬁnal clusters (Fig 5(c)). The resulting cluster boundaries are seen to follow the data distribution as shown in Fig 5(d). Similarly, the spiral data is encoded using 40, 2-D GRFs. The regions of data identiﬁed using the k-means clustering method are shown in in Fig 6(a). A 40-20-2 MDSNN is trained to cluster the spiral data. As shown in Fig 6(b), 14 subclusters are formed in the hidden layer that are combined in the next layer to form the ﬁnal clusters as shown in Fig 6(c). The region-based encoding method helps in proper subcluster formation at the hidden layer (Fig 5(b) and Fig 6(b)), against the range-based encoding method (Fig 3). The proposed method is also used to cluster 3-D data sets namely, the interlocking donuts data and the 3-D ring data. The interlocking donuts data is

80

L.N. Panuku and C.C. Sekhar 4

4

4

3

3

3

2

2

2

1

1

0

0

0

−1

−1

−1

−2

−2

−2

−3 −4 −4

1

−3

−3 −3

−2

−1

0

1

2

3

(a)

4

−4 −4

−3

−2

−1

0

(b) 1

2

3

4

−4 −4

−3

−2

−1

0

1

2

3

4

(c)

Fig. 5. Clustering the double-ring data: (a) Regions identiﬁed by k-means clustering with k = 20, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10 −15 −15

−10

−5

0

(a)

5

10

15

−15 −15

−10

−10

−5

0

5

10

15

−15 −15

−10

−5

(b)

0

(c)

5

10

15

Fig. 6. Clustering the spiral data: (a) Regions identiﬁed by k-means clustering with k = 40, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer 1

2 1 0 −1 −2 −2

−1

0

1

(a) 2

2 1.5 1 2 0.5 0 1 −0.5 0 −1 −1.5 −1 −2 3 3 −2

0

1 0

−1 2 −2 −1

−1 2 1.5

1 0

0 2

1

1 0

(b)−1

−2 2

−1 −2 −2

−1

0

(c)

1

2

1 0.5

2 0 −0.5

1 0 −1 −1.5

−1 −2 −2

(d)

Fig. 7. Clustering of the interlocking donuts data and the 3-D ring data: (a) Regions identiﬁed by k-means clustering on the interlocking donuts data with k = 10 and (b) clusters formed at the output layer. (c) Regions identiﬁed by k-means clustering on the 3-D ring data with k = 5 and (d) clusters formed at the output layer.

encoded with 10, 3-D GRFs and a 10-2 MDSNN is trained to cluster this data. The k-means clustering results and the ﬁnal clusters formed by the MDSNN are shown in Fig 7(a) and (b) respectively. The clustering results for the 3-D ring data with the proposed encoding method are shown in Fig 7(c) and 7(d). For comparison, the performance of the range-based encoding method and the region-based encoding method, for diﬀerent data sets, is presented in Table 1. It is observed that the region-based encoding method outperforms the range-based encoding method for clustering complex data sets like the double-ring data and the spiral data. For the cases, where both the methods give the same or almost the same performance, the number of neurons used in the input layer is given in the parentheses. It is observed that the region-based encoding method always maintains a low neuron count, there by reducing the computational cost. The difference between the neuron counts for the two methods may look small for these 2-D and 3-D data sets. However, as the dimensionality of the data increases, this

Region-Based Encoding Method Using Multi-dimensional Gaussians

81

Table 1. Comparison of the performance (in %) of MDSNNs using the range-based encoding method and the region-based encoding method for clustering. The numbers in parentheses give the number of neurons in the input layer. Data set

Encoding method Range-based Region-based encoding encoding

Double-ring data 74.82 Spiral data 66.18 Single-ring data 100.00 (10) Interlocking cluster data 99.30 (24) 3-D ring data 100.00 (15) Interlocking donuts data 97.13 (21)

100.00 100.00 100.00 100.00 100.00 100.00

(8) (6) (5) (10)

diﬀerence can be signiﬁcant. From these results, it is evident that the proposed encoding method scales well to higher dimensional data clustering problems, while keeping a low count of neurons. Additionally, and more importantly, the nonlinear cluster boundaries given by the region-based encoding method follow the distribution of the data or shapes of the clusters.

5

Conclusions

In this paper, we have proposed a new encoding method using multi-dimensional GRFs for MDSNNs. We have demonstrated that the proposed encoding method eﬀectively uses the correlation present in the data and positions the GRFs in the data-inhabited regions. We have also shown that the proposed method results in a low neuron count as opposed to the encoding method proposed in [14] and the simple approach of placing multi-dimensional GRFs covering the data space. This in turn results in low computational cost for clustering. With the encoding method proposed in [14], the cluster boundaries obtained for clustering nonlinearly separable data are observed to be combinations of linear segments and the MDSNN is failed to cluster the double-ring data and the spiral data. We have experimentally shown that with the proposed encoding method, the MDSNNs could cluster complex data like the double-ring data and the spiral data, while giving smooth nonlinear boundaries that follow the data distribution. In the existing range-based encoding method, when the data consists of clusters with diﬀerent scales, i.e., narrow and wider clusters, then the GRFs with diﬀerent widths are used. This technique is called multi-scale encoding. However, in the region-based encoding method the widths of the multi-dimensional GRFs are automatically computed from the data-inhabited regions. The widths of these GRFs can be diﬀerent. In the proposed method, for clustering the 2-D and 3-D data, the value of k is decided empirically and the formation of subclusters at the hidden layer is veriﬁed visually. However, for higher dimensional data, it is necessary to ensure the formation of subclusters automatically.

82

L.N. Panuku and C.C. Sekhar

References 1. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Englewood Cliﬀs (1998) 2. Kumar, S.: Neural Networks: A Classroom Approach. Tata McGraw-Hill, New Delhi (2004) 3. Maass, W.: Networks of Spiking Neurons: The Third Generation of Neural Network Models. Trans. Soc. Comput. Simul. Int. 14(4), 1659–1671 (1997) 4. Bi, Q., Poo, M.: Precise Spike Timing Determines the Direction and Extent of Synaptic Modiﬁcations in Cultured Hippocampal Neurons. Neuroscience 18, 10464–10472 (1998) 5. Maass, W., Bishop, C.M.: Pulsed Neural Networks. MIT-Press, London (1999) 6. Gerstner, W., Kistler, W.M.: Spiking Neuron Models. Cambridge University Press, Cambridge (2002) 7. Maass, W.: Fast Sigmoidal Networks via Spiking Neurons. Neural Computation 9, 279–304 (1997) 8. Verstraeten, D., Schrauwen, B., Stroobandt, D., Campenhout, J.V.: Isolated Word Recognition with the Liquid State Machine: A Case Study. Information Processing Letters 95(6), 521–528 (2005) 9. Bohte, S.M., Kok, J.N., Poutre, H.L.: Spike-Prop: Error-backpropagation in Temporally Encoded Networks of Spiking Neurons. Neural Computation 48, 17–37 (2002) 10. Natschlager, T., Ruf, B.: Spatial and Temporal Pattern Analysis via Spiking Neurons. Network: Comp. Neural Systems 9, 319–332 (1998) 11. Ruf, B., Schmitt, M.: Unsupervised Learning in Networks of Spiking Neurons using Temporal Coding. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 361–366. Springer, Heidelberg (1997) 12. Hopﬁeld, J.J.: Pattern Recognition Computation using Action Potential Timing for Stimulus Representations. Nature 376, 33–36 (1995) 13. Gerstner, W., Kempter, R., Van Hemmen, J.L., Wagner, H.: A Neuronal Learning Rule for Sub-millisecond Temporal Coding. Nature 383, 76–78 (1996) 14. Bohte, S.M., Poutre, H.L., Kok, J.N.: Unsupervised Clustering with Spiking Neurons by Sparse Temporal Coding and Multilayer RBF Networks. IEEE Transactions on Neural Networks 13, 426–435 (2002) 15. Panuku, L.N., Sekhar, C.C.: Clustering of Nonlinearly Separable Data using Spiking Neural Networks. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, Springer, Heidelberg (2007)

Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer Kouichi Mitsunaga1 , Yusuke Totoki2 , and Takami Matsuo2 1

2

Control Engineering Department, Oita Institute of Technology, Oita, Japan Department of Architecture and Mechatronics, Oita University, 700 Dannoharu, Oita, 870-1192, Japan

Abstract. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Strictly Positive Realness and Yu’s stability criterion, we can show the asymptotic stability of the error systems. The estimators allow us to recover the internal states and to distinguish the ﬁring patterns with early-time dynamic behaviors.

1

Introduction

In traditional artiﬁcial neural networks, the neuron behavior is described only in terms of ﬁring rate, while most real neurons, commonly known as spiking neurons, transmit information by pulses, also called action potentials or spikes. Model studies of neuronal synchronization can be separated in those where models of the integrated-and-ﬁre type are used and those where conductance-based spiking and bursting models are employed[1]. Bursting occurs when neuron activity alternates, on slow time scale, between a quiescent state and fast repetitive spiking. In any study of neural network dynamics, there are two crucial issues that are: 1) what model describes spiking dynamics of each neuron and 2) how the neurons are connected[3]. Izhikevich considered the ﬁrst issue and compared various models of spiking neurons. He reviewed the 20 types of real (cortical) neurons response, considering the injection of simple dc pulses such as tonic spiking, phasic spiking, tonic bursting, phasic bursting. Through out his simulations, he suggested that if the goal is to study how the neuronal behavior depends on measurable physiological parameters, such as the maximal conductance, steady-state (in)activation functions and time constants, then the Hodgkin-Huxley type model is the best. However, its computational cost is the highest in all models. He also pointed out that the Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich ﬁring patterns exhibited by real biological neurons. Nevertheless the HR model is a computational one of the neuronal bursting using three coupled ﬁrst order diﬀerential equations[5,6], it can generate a tonic spiking, phasic spiking, and so on, for diﬀerent parameters in the model equations. Charroll simulated that the additive noise shifts the neuron model into two-frequency region (ı.e. bursting) and the slow part of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 83–92, 2008. c Springer-Verlag Berlin Heidelberg 2008

84

K. Mitsunaga, Y. Totoki, and T. Matsuo

responses allows being robust to added noises using the HR model[7]. The parameters in the model equations are important to decide the dynamic behaviors in the neuron[12]. From the measurement theoretical point of view, it is important to estimate the states and parameters using measurement data, because extracellular recordings are a common practice in neuro-physiology and often represent the only way to measure the electrical activity of neurons[8]. Tokuda et al. applied an adaptive observer to estimate the parameters of HR neuron by using membrane potential data recorded from a single lateral pyloric neuron synaptically isolated from other neurons[13]. However, their observer cannot guarantee the asymptotic stability of the error system. Steur[14] pointed out that HR equations could not transformed into the adaptive observer canonical form and it is not possible to make use of the adaptive observer proposed by Marino[10]. He simpliﬁed the three dimensional HR equations and write as one-dimensional system with exogenous signal using contracting and the wandering dynamics technique. His adaptive observer with ﬁrst-order diﬀerential equation cannot estimate the internal states of HR neurons. We have recently presented adaptive observers with full states measurement and with the membrane potential measurement[15]. However, the estimates of the states by the observer with output measurement are not enough to recover the immeasurable internal states. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Kalman-Yakubovich lemma, we can show the asymptotic stability of the error systems based on the standard adaptive control theory[11]. The estimators allow us to recover the internal states and to distinguish the ﬁring patterns with early-time dynamic behaviors. The MATLAB simulations demonstrate the estimation performance of the proposed adaptive observers.

2

Review of Real (Cortical) Neuron Responses

There are many types of cortical neurons responses. Izhikevich reviewed 20 of the most prominent features of biological spiking neurons, considering the injection of simple dc pulses[3]. Typical responses are classiﬁed as follows[4]: – Tonic Spiking (TS): The neuron ﬁres a spike train as long as the input current is on. This kind of behavior can be observed in the three types of cortical neurons: regular spiking excitatory neurons (RS), low-threshold spiking neurons (LTS), and ﬁrst spiking inhibitory neurons (FS). – Phasic Spiking (PS): The neuron ﬁres only a single spike at the onset of the input. – Tonic Bursting: The neuron ﬁres periodic bursts of spikes when stimulated. This behavior may be found in chattering neurons in cat neocortex. – Phasic Bursting (PB): The neuron ﬁres only a single burst at the onset of the input.

Firing Pattern Estimation of Biological Neuron Models

85

– Mixed Mode (Bursting Then Spiking) (MM): The neuron ﬁres a phasic burst at the onset of stimulation and then switch to the tonic spiking mode. The intrinsically bursting excitatory neurons in mammalian neocortex may exhibit this behavior. – Spike Frequency Adaptation (SFA): The neuron ﬁres tonic spikes with decreasing frequency. RS neurons usually exhibit adaptation of the interspike intervals, when these intervals increase until a steady state of periodic ﬁring is reached, while FS neurons show no adaptation.

3

Single Model of HR Neuron

The Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich ﬁring patterns exhibited by real biological neurons. 3.1

Dynamical Equations

The single model of the HR neuron[1,5,6] is given by x˙ = ax2 − x3 − y − z + I y˙ = (a + α)x2 − y z˙ = μ(bx + c − z) where x represents the membrane potential y and z are associated with fast and slow currents, respectively. I is an applied current, and a, α, μ, b and c are constant parameters. We rewrite the single HR neuron as a vectorized form: ˙ = h(w) + Ξ(x, z)θ (S0 ) : w where ⎡ ⎤T ⎡ ⎤ ⎡ 2 ⎤ x −(x3 + y + z) x 1 0 00 0 ⎦, Ξ(x, z) = ⎣ 0 0 x2 0 0 0 ⎦, −y w = ⎣ y ⎦ , h(w) = ⎣ z 0 0 0 0 x 1 −z T T θ = θ1 , θ2 , θ3 , θ4 , θ5 , θ6 = a, I, a + α, μb, μc, μ . 3.2

Numerical Examples

The HR model shows a large variety of behaviors with respect to the parameter values in the diﬀerential equations[12]. Thus, we can characterize the dynamic behaviors with respect to diﬀerent values of the parameters. We focus on the parameters a and I. The parameter a is an internal parameter in the single neuron and I is an external depolarizing current. For the ﬁxed I = 0.05, the HR model shows a tonic bursting with a ∈ [1.8, 2.85] and a tonic spiking with a ≥ 2.9. On the other hand, for the ﬁxed a = 2.8, the HR model shows a tonic bursting with I ∈ [0, 0.18] and a tonic spiking with a ∈ [0.2, 5].

86

K. Mitsunaga, Y. Totoki, and T. Matsuo

2 1.5 −0.3 −0.4

1

−0.5 −0.6

x

z

0.5 0

−0.7 −0.8 −0.9

−0.5

−1 8 6

−1

2 1

4 −1.5 0

0

2 200

400

600

800

1000

Fig. 1. The response of x in the tonic bursting

−1 0

y

time

−2

x

Fig. 2. 3-D surface of x, y, z in the tonic bursting

2 1.5 −0.5 1

−0.6 −0.7

x

z

0.5

−0.8 0 −0.9 −0.5

−1 8 6

−1

2 1

4 −1.5 0

0

2 200

400

600

800

1000

Fig. 3. The response of x in the tonic spiking

0

y

time

−1 −2

x

Fig. 4. 3-D surface of x, y, z in the tonic spiking

1.5

−0.2

1

−0.4

z

x

2

0.5

−0.6 −0.8

0 −1 8 −0.5

6

2 4

−1 0

1 2

200

400

600

800

1000

time

Fig. 5. The response of x1 in the intrinsic bursting neuron

y

0 0

−1

x

Fig. 6. 3-D surface of x, y, z in the intrinsic bursting neuron

The parameters of the HR model in the tonic bursting (TB) are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. Figure 1 shows the response of x. Figure 2 shows the 3 dimensional surface of x, y, z. We call this neuron the intrinsic bursting neuron(IBN).

Firing Pattern Estimation of Biological Neuron Models

87

The parameters of the HR model in the tonic spiking (TS) are given by a = 3.0, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The diﬀerence between the tonic bursting and the tonic spiking is only the value of the parameter a. Figure 3 shows the responses of x. Figure 4 shows the 3 dimensional surface of x, y, z. We also call this neuron the intrinsic spiking neuron(ISN). When the external current change from I = 0.05 to I = 0.2, the IBN shows a tonic spiking. Figure 5 shows the responses of x. Figure 6 shows the 3 dimensional surface of x, y, z.

4 4.1

Synaptically Coupled Model of HR Neuron Dynamical Equations

Consider the following synaptically coupled two HR neurons[1]: x˙ 1 = a1 x21 − x31 − y1 − z1 − gs (x1 − Vs1 )Γ (x2 ) y˙ 1 = (a1 + α1 )x21 − y1 , z˙1 = μ1 (b1 x1 + c1 − z1 ) x˙ 2 = a2 x22 − x32 − y2 − z2 − gs (x2 − Vs2 )Γ (x1 ) y˙ 2 = (a2 + α2 )x22 − y2 , z˙2 = μ2 (b2 x2 + c2 − z2 ) where Γ (x) is the sigmoid function given by Γ (x) =

4.2

1 . 1 + exp(−λ(x − θs ))

Numerical Examples

Consider the IBN neuron with a = 2.8 and the ISN neuron with a = 10.8 whose other parameters are as follows: αi = 1.6, ci = 5, bi = 9, μi = 0.001, Vsi = 2, θs = −0.25, λ = 10. Figures 7,8 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 0.05, respectively. Each neuron behaves as an intrinsic single neuron. As increasing the coupling strength, however, the IBN neuron shows a chaotic behavior. Figures 9,10 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 1, respectively. Figure 11 shows the response of the membrane potentials in the coupling of two same IBNs with

88

K. Mitsunaga, Y. Totoki, and T. Matsuo 2

12

1.5

10 8

1

6

x

x

0.5 4

0 2 −0.5

0

−1

−2

−1.5 0

200

400

600

800

−4 0

1000

200

400

time

600

800

1000

time

Fig. 7. The response of x1 of the IBN

Fig. 8. The response of x2 of the ISN

2

12

1.5

10 8

1

6

x

x

0.5 4

0 2 −0.5

0

−1 −1.5 0

−2

200

400

600

800

−4 0

1000

200

time

400

600

800

1000

time

Fig. 9. The response of x1 of the IBN

Fig. 10. The response of x2 of the ISN

2

2

1.5

1.5

1 1

x

x

0.5 0.5

0 0 −0.5 −0.5

−1 −1.5 0

200

400

600

time

800

1000

−1 0

200

400

600

800

1000

time

Fig. 11. The response of x1 of the IBN- Fig. 12. The response of x1 of the IBNIBN coupling with gs = 0.05 IBN coupling with gs = 1

the coupling strength gs = 0.05. In this case, two IBNs synchronize as bursting neurons. Figure 12 shows the response of the membrane potentials in the coupling of two same IBNs with the coupling strength gs = 1. Two IBNs synchronize as spiking neurons.

Firing Pattern Estimation of Biological Neuron Models

5

89

Adaptive Observer with Full States

We present the parameter estimation problem to distinguish the ﬁring patterns by using early-time dynamic behaviors. In this section, assuming that the full states are measurable, we present an adaptive observer to estimate all parameters in the single HR neuron. 5.1

Construction of Adaptive Observer

We present an adaptive observer as ˆ ˆ˙ = W (w ˆ − w) + h(w) + Ξ θ (O0 ) : w ˆ is an estimate of the ˆ = [ˆ where w x, yˆ, zˆ] is an estimate of the states, θ unknown parameters and W is selected as a stable matrix. Using the standard adaptive control theory[11], the parameter update law is given by ˆ˙ = Γ Ξ T P (w − w). ˆ θ where P is a positive deﬁnite solution of the following Lyapunov equation for a positive deﬁnite matrix Q: W T P + P W = −Q. 5.2

Numerical Examples

We will show the simulation results of single IBN case. The parameters in the tonic bursting are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The parameters of the adaptive observers are selected as W = −10I3 , Γ = diag{100, 50, 300}. Figure 13 shows the estimation behavior of a (solid line) and I (dotted line). The estimates a ˆ and Iˆ converge to the true values of a and I.

6

Adaptive Observer with a Partial State

We assume that the membrane potential x is available, but the others are immeasurable. In this case, we consider following problems: – Estimate y and z using the available signal x; – Estimate the parameter a or I to distinguish the ﬁring patterns by using early-time dynamic behaviors.

90

K. Mitsunaga, Y. Totoki, and T. Matsuo

6.1

Construction of Adaptive Observer

The parameters a and I are key parameters that determine the ﬁring pattern. The HR model can be rewritten by the following three forms[9]: ˙ = Aw + h1 (x) + b1 (x2 a) (S1 ) : w ˙ = Aw + h2 (x) + b2 (I) (S2 ) : w ˙ = Aw + h3 (x) + b2 (θ T ξ) (S3 ) : w where

(1) (2) (3)

⎡

⎤ ⎡ 3 ⎤ ⎡ ⎤ ⎡ ⎤ 0 −1 −1 −x + I 1 1 A = ⎣ 0 −1 0 ⎦ , h1 = ⎣ αx2 ⎦ , b1 = ⎣ 1 ⎦ , b2 = ⎣ 0 ⎦ , μb 0 −μ μc 0 0 ⎡ 3 ⎤ ⎡ 3⎤ 2 −x + ax −x a x2 2 ⎦ 2 ⎦ ⎣ ⎣ h2 = (a + α)x , h3 = δx ,θ = ,ξ = . I 1 μc μc

In (S1 ) and (S2 ), the unknown parameters are assumed to be a and I, respectively. In (S3 ), we assume that the parameter δ = a + α is known, and a and I are unknown. Since the measurable signal is x, the output equation is given by x = cw = 1 0 0 w We present adaptive observers that estimate the parameter for each system (Si ), i = 1, 2, 3, as follows: ˆ 1 + h1 (x) + b1 (x2 a (O1 ) : wˆ˙ 1 = Aw ˆ) + g(x − x ˆ) ˙ ˆ ˆ 2 + h2 (x) + b2 (I) + g(x − x (O2 ) : wˆ2 = Aw ˆ) T ˙ ˆ ˆ 2 + h3 (x) + b2 (θ ξ) + g(x − xˆ) (O3 ) : wˆ2 = Aw

(4) (5) (6)

where g is selected such that A − gc is a stable. Since (A, b1 , c) and (A, b2 , c) are strictly positive real, the parameter estimation laws are given as ˙ a ˆ˙ = γ1 x2 (x − x ˆ), Iˆ = γ2 (x − xˆ).

(7)

Using the Kalman-Yakubovich (KY) lemma, we can show the asymptotic stability of the error system based on the standard adaptive control theory[11]. 6.2

Numerical Examples

We will show the simulation results of single IBN case. The parameters in the tonic spiking are same as in the previous simulation. Figures 14 and 15 show the estimated parameters by the adaptive observers (O1 ) and (O2 ), respectively. Figures 16 and 17 show the responses of y (solid line) and its estimate yˆ (dotted line) the adaptive observers (O1 ) for t ≤ 500 and for t ≤ 20, respectively. Figure 18 shows the responses of z (solid line) and its estimate zˆ (dotted line) by the adaptive observers (O1 ). The simulation results of other cases are omitted. The states and parameters can be asymptotically estimated.

Firing Pattern Estimation of Biological Neuron Models

5

5

a ˆ Iˆ

4.5

91

4.5

4 4 3.5 3.5 3

a ˆ

a ˆ Iˆ

3 2.5 2

2.5

1.5 2 1 1.5

0.5 0 0

50

100

150

1 0

200

20

40

time

60

80

100

time

ˆ Fig. 13. a ˆ(solid line) and I(dotted line) in the adaptive observer (O0 ) with full states

Fig. 14. a ˆ(solid line) in the adaptive observer (O1 ) with x

0.1

7

y yˆ

6

0

5

−0.1

y yˆ

Iˆ

4

−0.2

3

−0.3 2

−0.4

−0.5 0

1

20

40

60

80

0 0

100

100

200

300

400

500

time

time

ˆ Fig. 15. I(solid line) in the adaptive observer (O2 ) with x

Fig. 16. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 500)

7

0

y yˆ

z zˆ

6 5

−0.5

y yˆ

z zˆ

4 3

−1 2 1 0 0

5

10

15

20

time

Fig. 17. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)

−1.5 0

5

10

15

20

time

Fig. 18. z(solid line) and zˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)

92

7

K. Mitsunaga, Y. Totoki, and T. Matsuo

Conclusion

We presented estimators of the parameters of the HR model using the adaptive observer technique with the output measurement data such as the membrane potential. The proposed observers allow us to distinguish the ﬁring pattern in early time and to recover the immeasurable internal states.

References 1. Belykh, I., de Lange, E., Hasler, M.: Synchronization of Bursting Neurons: What Matters in the Network Topology. Phys. Rev. Lett. 94, 101–188 (2005) 2. Izhikevich, E.M.: Simple Model of Spiking Neurons. IEEE Trans on Neural Networks 14(6), 1569–1572 (2003) 3. Izhikevich, E.M.: Which model to use for cortical spiking neurons? IEEE Trans on Neural Networks 15(5), 1063–1070 (2004) 4. Watts, L.: A Tour of NeuraLOG and Spike - Tools for Simulating Networks of Spiking Neurons (1993), http://www.lloydwatts.com/SpikeBrochure.pdf 5. Hindmarsh, J.L., Rose, R.M.: A model of the nerve impulse using two ﬁrst order diﬀerential equations. Nature 296, 162–164 (1982) 6. Hindmarsh, J.L., Rose, R.M.: A model of neuronal bursting using three coupled ﬁrst order diﬀerential equations. Proc. R. Soc. Lond. B. 221, 87–102 (1984) 7. Carroll, T.L.: Chaotic systems that are robust to added noise, CHAOS, 15, 013901 (2005) 8. Meunier, N., Narion-Poll, R., Lansky, P., Rospars, J.O.: Estimation of the Individual Firing Frequencies of Two Neurons Recorded with a Single Electrode. Chem. Senses 28, 671–679 (2003) 9. Yu, H., Liu, Y.: Chaotic synchronization based on stability criterion of linear systems. Physics Letters A 314, 292–298 (2003) 10. Marino, R.: Adaptive Observers for Single Output Nonlinear Systems. IEEE Trans. on Automatic Control 35(9), 1054–1058 (1990) 11. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall Inc., Englewood Cliﬀs (1989) 12. Arena, P., Fortuna, L., Frasca, M., Rosa, M.L.: Locally active Hindmarsh-Rose neurons. Chaos, Soliton and Fractals 27, 405–412 (2006) 13. Tokuda, I., Parlitz, U., Illing, L., Kennel, M., Abarbanel, H.: Parameter estimation for neuron models. In: Proc. of the 7th Experimental Chaos Conference (2002), http://www.physik3.gwdg.de/∼ ulli/pdf/TPIKA02 pre.pdf 14. Steur, E.: Parameter Estimation in Hindmarsh-Rose Neurons (2006), http://alexandria.tue.nl/repository/books/626834.pdf 15. Fujikawa, H., Mitsunaga, K., Suemitsu, H., Matsuo, T.: Parameter Estimation of Biological Neuron Models with Bursting and Spiking. In: Proc. of SICE-ICASE International Joint Conference 2006 CD-ROM, pp. 4487–4492 (2006)

Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings Akihisa Ichiki and Masatoshi Shiino Department of Applied Physics, Faculty of Science, Tokyo Institute of Technology, 2-12-2 Ohokayama Meguro-ku Tokyo, Japan

Abstract. We derive Thouless-Anderson-Palmer (TAP) equations and order parameter equations for stochastic analog neural network models with ﬂuctuating synaptic couplings. Such systems with ﬁnite number of neurons originally have no energy concept. Thus they defy the use of the replica method or the cavity method, which require the energy concept. However for some realizations of synaptic noise, the systems have the eﬀective Hamiltonian and the cavity method becomes applicable to derive the TAP equations.

1

Introduction

The replica method [1] for random spin systems has been successfully employed in neural network models of associative memory to have the order parameters and the storage capacity [2] and the cavity method [3] has been employed to derive the Thouless-Anderson-Palmer (TAP) equations [4,5]. However these techniques require the energy concept. On the other hand, various types of neural network models which have no energy concept, such as networks with temporally ﬂuctuating synaptic couplings, may exist. The alternative approach to the replica method to derive the order parameter equations, called the self-consistent signal-to-noise analysis (SCSNA), is closely related to the cavity concept in the case where networks have free energy [6,7]. An advantage to apply the SCSNA to neural networks is that the energy concept is not required to derive the order parameter equations once the TAP equations are obtained. The SCSNA, which was originally proposed for deriving a set of order parameter equations for deterministic analog neural networks, becomes applicable to stochastic networks by noting that the TAP equations deﬁne the deterministic networks. Furthermore, the coeﬃcients of the Onsager reaction terms characteristic to the TAP equations which determine the form of the transfer functions in analog networks are selfconsistently obtained through the concept shared by the cavity method and the SCSNA. Thus the TAP equations as well as the order parameter equations are derived self-consistently by the hybrid use of the cavity method and the SCSNA in the case where the energy concept exists. However the networks with synaptic noise, which have no energy concept, defy the use of the cavity method to have the TAP equations. On the other hand, as in [8], the network with a speciﬁc M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 93–101, 2008. c Springer-Verlag Berlin Heidelberg 2008

94

A. Ichiki and M. Shiino

involvement of synaptic noise can be analyzed by the cavity method to derive the TAP equation in the thermodynamic limit, since the energy concept appears as an eﬀective Hamiltonian in this limit. It is natural to consider such neural network models with ﬂuctuating synaptic couplings, since the synaptic couplings in real biological systems are updated by learning rules and the time-sequence of the synaptic couplings may be stochastic under the inﬂuence of noisy external stimuli. Thus the study on such networks is required to understand the retrieval process of the realistic networks. The aim of this paper is two-fold: (i) we will investigate the networks with which realization of synaptic noise have the energy concept to apply the cavity method to derive the TAP equations, (ii) we will show the TAP equations for the networks when the concept of the eﬀective Hamiltonian appears. This paper is organized as the follows: in the next section, we will brieﬂy review how the energy concept appears in the network with synaptic noise and derive the TAP equations and the order parameter equations by using the cavity method and the SCSNA [8]. Once the eﬀective Hamiltonian is found, the replica method is also applicable to derive the order parameter equations. However in the present paper, to make clear the relationship between the TAP equations and the order parameter equations, we do not use the replica trick. In section 3, we will investigate the cases where the energy concept appears in the networks with synaptic noise. We will see that the TAP equations and the order parameter equations for some models can be derived in the framework mentioned in section 2. We will also mention that some diﬃculties to derive the TAP equations arise in some models with other involvements of synaptic noise. In the last section, we will make discussions on the structure of the TAP equations for the network with temporally ﬂuctuating synaptic noise and conclude this paper.

2

Brief Review on Eﬀective Hamiltonian, TAP Equations and Order Parameter Equations

In this section, we brieﬂy review the cavity method becomes applicable to the network with ﬂuctuating synaptic couplings [8]. Then we derive the TAP equations and the order parameter equations self-consistently in the framework of the SCSNA. In this section, we deal with the following stochastic analog neural network of N neurons with temporally ﬂuctuating synaptic noise (multiplicative noise): x˙ i = −φ (xi ) + Jij (t)xj + ηi (t), (1) j(=i)

ηi (t)ηj (t ) = 2Dδij δ(t − t ),

(2)

where xi (i = 1, · · · , N ) represents a state of the neuron at site i taking a continuous value, φ(xi ) is a potential of an arbitrary form which determines the probability distribution of xi in the case without the input j(=i) Jij xj , ηi the Langevin white noise with its noise intensity 2D and Jij (t) the synaptic coupling.

TAP Equation for Associative Memory Neural Network Models

95

We note here that, in the case of associative memory neural network, the synaptic coupling Jij is usually deﬁned by the well-known Hebb learning rule. However, in the present paper, we will deal with the coupling Jij ﬂuctuating around the Hebb rule with a white noise: Jij (t) = J¯ij + ij (t), ˜ 2D ij (t)kl (t ) = δik δjl δ(t − t ), N

(3) (4)

p where J¯ij is deﬁned by the usual Hebb learning rule J¯ij ≡ N1 μ=1 ξiμ ξjμ with p = αN the number of patterns embedded in the network, ξiμ = ±1 is the μth embedded pattern at neuron i, and ij (t) denotes the synaptic noise independent of ηi (t), which we assume in the present model as a white noise with its intensity ˜ . 2D/N Using Ito integral, we obtain the Fokker-Planck equation corresponding to the Langevin equation (1) as ⎧ ⎫ N ∂ ⎬ ∂P (t, x) ∂ ⎨ ˜q =− −φ (xi ) + J¯ij xj − D + Dˆ P (t, x), (5) ∂t ∂xi ⎩ ∂xi ⎭ i=1

j(=i)

where qˆ ≡ N1 j(=i) x2j . Since the self-averaging property holds in the thermodynamic limit N → ∞, qˆ is identiﬁed as qˆ =

N 1 2 x . N i=1 i

(6)

The order parameter qˆ is obtained self-consistently in our framework as seen below. Supposing qˆ is given, one can easily ﬁnd the equilibrium probability density for the Fokker-Planck equation (5) as ⎧ ⎛ ⎞⎫ N ⎨ ⎬ PN (x) = Z −1 exp −βeﬀ ⎝ φ(xi ) − J¯ij xi xj ⎠ , (7) ⎩ ⎭ i=1

i<j

where Z denotes the normalization constant and −1 ˜ qˆ βeﬀ ≡D+D

(8)

plays the role of the eﬀective temperature of the network. The temperature of the system is modiﬁed as a consequence of the multiplicative noise and it depends on the order parameter qˆ. Notice here that the equilibrium distribution of the system becomes Gibbs distribution in the thermodynamic limit N → ∞. The equilibrium solution for equation (5) in the ﬁnite N -body system diﬀers√from the indeed 1 2 2 probability density (7). However, since N1 x − x = O(1/ N ), the j(=i) j N j j diﬀerence between the probability densities in the ﬁnite N -body system PN√and in the system in the thermodynamic system PN →∞ is PN →∞ − PN = O(1/ N ).

96

A. Ichiki and M. Shiino

Thus one can conclude that the equilibrium density for equation (5) converges to the probability density (7) in the thermodynamic limit N → ∞. Since we have explicitly written down the equilibrium probability density (7) as a Gibbsian form, one can deﬁne the eﬀective Hamiltonian of (suﬃciently large) N -body system as N

HN ≡

i=1

φ(xi ) −

J¯ij xi xj .

(9)

i<j

Since we have found the eﬀective Hamiltonian and the eﬀective temperature, one can apply the usual cavity method [3] to this system and derive the TAP equation. According to the cavity method, we divide the Hamiltonian of N -body system (9) into that of (N −1)-body system and the part involving the state of ith neuron as HN = φ(xi ) − hi xi + HN −1 , where hi ≡ j(=i) J¯ij xj is the local ﬁeld at site i and the Hamiltonian of (N −1) body system HN −1 is given as HN −1 ≡ j(=i) φ(xj ) − j 0) and x = P for the potentiation part (u < 0), cf. [2, Sec. 5] for details and the parameter values.

structures) and it can be linked to the variability observed in physiological data; the other is due to the additional impact of the background activity upon the generation of the spike output of each neuron (intrinsic stochasticity modelled by the Poisson process).

3 3.1

Theoretical Analysis Characterisation of the Neural Activity

We consider a network of N Poisson neurons (referred to as internal neurons) stimulated with M Poisson pulse trains (or external inputs), as shown in Fig. 2. The activity of the neural network can be described using the ﬁring rates and the pairwise correlations, plus the weights. In the terminology of the theory of dynamical systems, these variables characterise the “state” of the network activity at each time t and their evolution is referred to as neural dynamics. Similar to [7,2], we consider the time-averaged ﬁring rates νi (t) (for the internal neuron indexed by i; T is a given time period) 1 t νi (t) Si (t ) dt (3) T t−T where Si (t) is the spike-time series of the ith neuron and the brackets . . . denotes the ensemble averaging. Likewise for the coeﬃcient correlations (timeaverage correlation function convoluted with the STDP window function W ): W Dik (t) between the ith internal neuron and the k th external input t +∞ 1 W Dik (t) W (u) Si (t )Sˆk (t + u) dt du (4) T t−T −∞ th and QW and j th internal neurons (with Si and Sj ) [2, Sec. 3]. ij (t) between the i

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

105

Fig. 2. Presentation of the network and the notation. The internal neurons (within the network) are indexed by i ∈ [1..N ], and their output pulse trains are denoted Si (t) (it can be understood as a sum of “Dirac functions” at each spiking time [2, Sec. 2]). Likewise for the external input pulse trains Sˆk (t) (k ∈ [1..M ]). The time-averaged ﬁring rates are denoted by νi (t), cf. Eq. 3; the correlation coeﬃcients within the network by Qij (t) (resp. Dik (t) between a neuron in the network and an external input; and ˆ kl (t) between two external inputs), cf. Eq. 4. The weight of the connection from the Q kth external input onto the ith internal neuron is denoted Kik (t) (resp. Jij (t) from the j th internal neuron onto the ith internal neuron).

The stimulation parameters (which represent the “information” carried in the external inputs) are determined by the time-averaged input spiking rates (ˆ νk (t), ˆ W (t), deﬁned simideﬁned similarly to νi (t)) and their correlation coeﬃcients (Q kl W larly to Dik (t) and QW ij (t)) [2, Sec. 3]. In this paper, we only consider stimulation parameters that are constant in time. 3.2

Learning Equations

Learning equations can be derived from Eq. 2 as in Kempter et al. [7]. This requires the assumption that the internal pulse trains are statistically independent (this can be considered valid when the number of recurrent connection is large enough) and a small learning rate η. This leads to the matrix equation Eq. 10 for the weights between internal neurons (resp. to Eq. 9 for the input weights K). 3.3

Activation Dynamics

In order to study the evolution of the weights described by Eqs. 9 and 10, we need to evaluate the neuron time-average ﬁring rates (the vector ν(t)) and their time-average correlation coeﬃcients (the matrices DW (t) and QW (t)). Similar to Kempter et al. [7] and Burkitt et al. [2], we approximate the instantaneous ﬁring rate Si (t) of the ith Poisson neuron by its expected inhomogeneous Poisson parameter ρ(t) (cf. Eq. 1) and we neglect the impact of the short-time dynamics (the synaptic response kernel and the synaptic delays dˆik and dij ) by using time averaged variables (over a “long” period T ). We require that the

106

M. Gilson et al.

learning occurs slowly compared to the activation mechanisms (cf. the neuron and synapse models) so that T is large compared to the time scale of these mechanisms but small compared to the inverse of the learning parameter η −1 . This leads to the consistency matrix equations of the ﬁring rates (Eq. 6) and of the correlation coeﬃcients (Eqs. 7 and 8). See Burkitt et al. [2, Sec. 3] for details of the derivation. Note that the consistency equations of the correlation coeﬃcients as deﬁned in [2, Sec. 3 and 4] have been reformulated to express the usual covariance using the assumption that the correlations are quasi-constant in time [5] (this implies ˆ V T [2, Sec. 3] is actually equal to Q ˆ W ). Eqs. 7 and 8 express the impact that Q of the connectivity (through the term [I − J]−1 K) on the internal ﬁring rates and the cross covariances in terms of the input covariance Cˆ W

ˆW − W νˆ νˆT , Cˆ W Q

(5)

W (u) du evaluates the balance between the potentiation and the where W depression of our STDP rule. These equations are obtained by combining the equations in [2] with the ﬁring-rate consistency equation Eq. 6. 3.4

Network Dynamical System

Putting everything together, the network dynamics is described by

ν = [I − J]−1 ν0 E + K νˆ (6)

ν νˆT = [I − J]−1 K Q ˆW − W νˆ νˆT DW − W (7)

ν ν T = [I − J]−1 K Q ˆW − W νˆ νˆT K T [I − J]−1T QW − W (8)

dK ˆ T + DW = ΦK win E νˆT + wout ν E (9) dt

dJ = ΦJ win E ν T + wout ν E T + QW , (10) dt where E is the unit vector of N elements (likewise for Eˆ with M elements); ΦJ is a projector on the space of N × N matrices to which J belongs [2, Sec. 3]. The eﬀect of ΦJ is to nullify the coeﬃcients corresponding to a missing connection in the network, viz. all the diagonal terms because the self-connection of a neuron onto itself is forbidden. More generally, such an projection operator can account for any network connectivity [2, Sec. 3]. Note that time has been rescaled, in order to remove η from these equations and for simplicity of notation the dependence on time t will be omitted in the rest of this paper. The matrix [I − J(t)] is assumed invertible at all time (the contrary would imply a diverging behaviour of the ﬁring rates [2, Sec. 4]).

4

Recurrent Network with Fixed Input Weights

We now examine the case of a recurrently connected network with ﬁxed input weights K and learning on the recurrent weights J. Only Eqs. 6, 8 and 10 remain,

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

107

which simpliﬁes the analysis. In the case of full recurrent connectivity, ΦJ in Eq. 10 only nulliﬁes the diagonal terms of the square matrix in its argument. 4.1

Analytical Predictions

Homeostatic equilibrium. Similarly to [7,2], we derive the scalar equations of the mean ﬁring rate νav N −1 i νi and weight Jav (N (N − 1))−1 i=j Jij . It consists of neglecting the inhomogeneities of the ﬁring rates and of the weights over the network, as well as of the connectivity, and we obtain ν0 + (K νˆ)av 1 − (N − 1)Jav

in +C ν2 , = w + wout νav + W av

νav = J˙av

(11)

is deﬁned as where (K νˆ)av denotes the mean of the matrix product K νˆ and C ˆW T (K C K )av . C 2 [ν0 + (K νˆ)av ]

(12)

Eqs. 11 are thus the same equations as for the case with no input [2, Sec. 5], with replaced by, resp., ν0 + (K νˆ)av and W + C. It qualitatively implies the ν0 and W < 0, the network exhibits same dynamical behaviour and, provided that W + C a homeostatic equilibrium (when it is realisable, it requires in particular μ > 0), and the means of the ﬁring rates and of the weights converge towards win + wout +C W μ − ν0 − (K νˆ)av = , (N − 1)μ

∗ νav =μ − ∗ Jav

(13)

If the correlation inputs are time-invariant functions and if the correlations are positive (i.e. more likely to ﬁre in a synchronous way) and homogeneous is of the same sign as W . among the correlated input pool, it follows that C < 0 as in [7,2]. Therefore, the condition for stability reverts to W Structural dynamics. For the particular case of uncorrelated inputs (or more generally when K Cˆ W K T = 0), the ﬁxed-point structure of the dynamical system is qualitatively the same as for the case of no external input [2, Sec. 5]: a homogeneous distribution for the ﬁring rates and a continuous manifold of ﬁxed points for the internal weights. In the case of “spatial” inhomogeneities over the input correlations, the network dynamics shows a diﬀerent evolution. To illustrate and to compare this with the case of feed-forward connections, we consider the network conﬁguration described in Fig. 3 and inspired by [7], where one input pool has correlation while the other pool has uncorrelated sources. In general, the equations that

108

M. Gilson et al.

Fig. 3. Architecture of the simulated network. The inputs are divided into two subpools, each feeding half of the internal network with means K1 and K2 for the input weights from each subpool. Similarly to the recurrent weights J and the ﬁring rates νˆ and ν, the inhomogeneities are neglected within each subpool and they are assumed to be all identical to their mean. The weights and delays are initially set with 10% random variation around a mean.

determine the ﬁxed points have no accurate solution. Yet, we can reduce the dimensionality by neglecting the variance within each subpool and make approximations to evaluate the asymptotic distribution of the ﬁring rates, which turns out to be bimodal. 4.2

Simulation Protocol and Results

We simulated a network of Poisson neurons as described in Fig. 3 with random initial recurrent weights (uniformly distributed in a given interval, as well as all the synaptic delays). An issue with such simulations is to maintain positive internal weights during their homeostatic convergence because they individually diverge. Thus, all equilibria are not realisable depending on the initial distributions of K and J and the weight bounds [7,2]. See [2, Sec. 5] for details about the simulation parameters. An interesting ﬁrst case to test the analytical predictions consists of two pools of uncorrelated inputs, each feeding half of the network with distinct weights (K1 νˆ1 = K2 νˆ2 and Cˆ W = 0). Each half of the internal network thus has distinct ﬁring rates initially and, as predicted, the outcome is a convergence of these ﬁring-rates towards a uniform value (similar to the case of no external input [2, Sec. 5]). In the case of a full connected (for both the K and the J according to Fig. 3) network stimulated by one correlated input pool (short-time correlation inspired by [4] so that Cˆ W = 0) and one uncorrelated one, both the internal ﬁring rates and the internal weights also exhibit a homeostatic equilibrium. As shown in Fig. 4, the means over the network (thick solid lines) converge towards the predicted equilibrium values (dashed lines). Furthermore, the individual ﬁring rates tend to stabilise and their distribution remains bimodal (the subpool #1 excited by < 0). The recurrent correlated inputs ﬁres at a lower rate eventually when W weights individually diverge similarly to [2, Sec. 5] and reorganise so that the outgoing weights from the subpool #1 (see the means over each weight subpool

Spike-Timing Dependent Plasticity in Recurrently Connected Networks 25

0.03

15

0.02 0.015 J11

0.01

#1

#2

J21

0.025

#2

weights

firing rates (Hz)

#1 20

10 0

109

0.005 10000

0 0

20000

10000

J22 J12 20000

time (sec)

time (sec)

Fig. 4. Evolution of the ﬁring rates (left) and of the recurrent weights (right) for N = 30 fully-connected Poisson neurons (cf. Fig.3) with short-time correlated inputs. The outcome is a quasi bimodal distribution of the ﬁring rates (the grey bundle, the mean is the thick solid line) around the predicted homeostatic equilibrium (dashed line). The subgroup #1 that receives correlated inputs is more excited initially but ﬁres at a lower rate at the end of the simulation (cf. the two thin black solid lines which represent the mean over each subpool). The internal weights individually diverge, while their mean (thick line) converges towards the predicted equilibrium value (dashed line). They reorganise themselves so that the weights outgoing from the subpool #2 (that receives uncorrelated inputs) become silent, while the ones from #1 are strengthened. Note that the homeostatic equilibrium is preserved even when some weights saturate.

firing rates (Hz)

25 #1

#2

20

15 #2 0

#1 10000

20000

time (sec) Fig. 5. Evolution of the ﬁring rates for a partially connected network of N = 75 neurons. Both the K and the J have 40% probability of connection with the same setup as the network in Fig. 4 (to preserve the total input synaptic strength). The mean ﬁring rate (very thick line) still converges towards the predicted equilibrium value (dashed line) and the two subgroups (grey bundles, each mean represented by a thin black solid line) get separated similarly to the case of full connectivity. The internal weights (not shown) exhibit similar dynamics as for the case of full connectivity.

J11 and J21 in Fig. 4) are strengthened while the other ones almost become silent (see J12 and J22 ). In other words, the subpool that receives correlated inputs takes the upper hand in the recurrent architecture.

110

M. Gilson et al.

30

firing rates (Hz)

firing rates (Hz)

30

25

20

15 0

10000

time (sec)

25

20

15 0

10000

time (sec)

Fig. 6. Simulation of networks of IF neurons with partial random connectivity of 50%. The network qualitatively exhibits the expected behaviour in the case of uncorrelated inputs (left) and inputs with one correlated pool and one uncorrelated pool (right).

In the case of partial connectivity for both the K and the J, the behaviour of the individual ﬁring rates (cf. Fig. 5) still follows the predictions but they are more dispersed, their convergence is slower and the bimodal distribution is not always observed as clearly as in the case of full connectivity (in Fig. 5 the means of the two internal neuron subpools clearly remain separated though). The homeostatic equilibrium of the internal weights also holds and they individually diverge. The partial connectivity needs to be rich enough for the predictions of the mean variable to be accurate enough. First results with IF neurons comply with the analytical predictions remain valid even if the activation mechanisms are more complex (here with a connectivity of 50%, cf. Fig. 6).

5

Discussion and Future Work

The analytical results presented here are preliminary, and further investigation is needed to gain better understanding of the interplay between the input correlation structure and STDP. Nevertheless, our results illustrate two points: STDP induces a stable activity in recurrent architectures similar to that for feed-forward ones (homeostatic regulation of the network activity under the con < 0); and the qualitative structure of the internal ﬁring rates is mainly dition W determined by the input correlation structure. Namely, a “poor” correlation structure (uncorrelated or delta-correlated inputs, so that Cˆ W = 0) induces a homogenisation of the ﬁring activity. Finally, partial connectivity impacts upon the structure of the internal ﬁring rates, but such networks still exhibit similar behaviour to fully connected ones. ˆ W K T suggest Preliminary results involving more complex patterns of K Q more complex interplay between the input correlation structure and the equilibrium distribution of the network ﬁring rates and weights. Such cases are under investigation and may constitute more “interesting” dynamic behaviour of the network from a cognitive modelling point of view, namely through the

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

111

relationship between the attractors of the network activity and the input structure. The case of learning for both the input connections and the recurrent ones will also form part of a future study. Comparison with IF neurons suggests that the impact of the neuron activation mechanisms on the weight dynamics may not be signiﬁcant. It can be linked to the separation of the time scales between them in the case of slow learning. Note that with our approximations the IF neurons are assumed to be in “linear input-output regime” (no bursting for instance).

Acknowledgments The authors thank Iven Mareels, Chris Trengove, Sean Byrnes and Hamish Mefﬁn for useful discussions that introduced signiﬁcant improvements. MG is funded by two scholarships from The University of Melbourne and NICTA. ANB and DBG acknowledge funding from the Australian Research Council (ARC Discovery Projects #DP0453205 and #DP0664271) and The Bionic Ear Institute.

References 1. Bi, G.Q., Poo, M.M.: Synaptic modiﬁcation by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience 24, 139–166 (2001) 2. Burkitt, A.N., Gilson, M., van Hemmen, J.L.: Spike-timing-dependent plasticity for neurons with recurrent connections. Biological Cybernetics 96, 533–546 (2007) 3. Gerstner, W., Kempter, R., van Hemmen, J.L., Wagner, H.: A neuronal learning rule for sub-millisecond temporal coding. Nature 383, 76–78 (1996) 4. Gutig, R., Aharonov, R., Rotter, S., Sompolinsky, H.: Learning input correlations through nonlinear temporally asymmetric hebbian plasticity. Journal of Neuroscience 23, 3697–3714 (2003) 5. Hawkes, A.G.: Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society Series B-Statistical Methodology 33, 438–443 (1971) 6. Hebb, D.O.: The organization of behavior: a neuropsychological theory. Wiley, Chichester (1949) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Hebbian learning and spiking neurons. Physical Review E 59, 4498–4514 (1999) 8. van Rossum, M.C.W., Bi, G.Q., Turrigiano, G.G.: Stable hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience 20, 8812–8821 (2000)

A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki RIKEN Brain Science Institute, Saitama, Japan [email protected], {fvialatte,cia}@brain.riken.jp

Abstract. It has repeatedly been reported in the medical literature that the EEG signals of Alzheimer’s disease (AD) patients are less synchronous than in age-matched control patients. This phenomenon, however, does at present not allow to reliably predict AD at an early stage, so-called mild cognitive impairment (MCI), due to the large variability among patients. In recent years, many novel techniques to quantify EEG synchrony have been developed; some of them are believed to be more sensitive to abnormalities in EEG synchrony than traditional measures such as the cross-correlation coeﬃcient. In this paper, a wide variety of synchrony measures is investigated in the context of AD detection, including the cross-correlation coeﬃcient, the mean-square and phase coherence function, Granger causality, the recently proposed corr-entropy coeﬃcient and two novel extensions, phase synchrony indices derived from the Hilbert transform and time-frequency maps, information-theoretic divergence measures in time domain and timefrequency domain, state space based measures (in particular, non-linear interdependence measures and the S-estimator), and at last, the recently proposed stochastic-event synchrony measures. For the data set at hand, only two synchrony measures are able to convincingly distinguish MCI patients from age-matched control patients (p < 0.005), i.e., Granger causality (in particular, full-frequency directed transfer function) and stochastic event synchrony (in particular, the fraction of non-coincident activity). Combining those two measures with additional features may eventually yield a reliable diagnostic tool for MCI and AD.

1

Introduction

Many studies have shown that the EEG signals of AD patients are generally less coherent than in age-matched control patients (see [1] for an in-depth review). It is noteworthy, however, that this eﬀect is not always easily detectable: there tends to be a large variability among AD patients. This is especially the case for patients in the pre-symptomatic phase, commonly referred to as Mild Cognitive Impairment (MCI), during which neuronal degeneration is occurring prior to the clinical symptoms appearance. On the other hand, it is crucial to predict AD at M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 112–125, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Comparative Study of Synchrony Measures for the Early Detection of AD

113

an early stage: medication that aims at delaying the eﬀects of AD (and hence intend to improve the quality of life of AD patients) are the most eﬀective if applied in the pre-symptomatic phase. In recent years, a large variety of measures has been proposed to quantify EEG synchrony (we refer to [2]–[5] for recent reviews on EEG synchrony measures); some of those measures are believed to be more sensitive to perturbations in EEG synchrony than classical indices as for example the cross-correlation coeﬃcient or the coherence function. In this paper, we systematically investigate the state-ofthe-art of measuring EEG synchrony with special focus on the detection of AD in its early stages. (A related study has been presented in [6,7] in the context of epilepsy.) We consider various synchrony measures, stemming from a wide spectrum of disciplines, such as physics, information theory, statistics, and signal processing. Our aim is to investigate which measures are the most suitable for detecting the eﬀect of synchrony perturbations in MCI and AD patients; we also wish to better understand which aspects of synchrony are captured by the diﬀerent measures, and how the measures are related to each other. This paper is structured as follows. In Section 2 we review the synchrony measures considered in this paper. In Section 3 those measures are applied to EEG data, in particular, for the purpose of detecting MCI; we describe the EEG data set, elaborate on various implementation issues, and present our results. At the end of the paper, we brieﬂy relate our results to earlier work, and speculate about the neurophysiological interpretation of our results.

2

Synchrony Measures

We brieﬂy review the various families of synchrony measures investigated in this paper: cross-correlation coeﬃcient and analogues in frequency and time-frequency domain, Granger causality, phase synchrony, state space based synchrony, information theoretic interdependence measures, and at last, stochastic-event synchrony measures, which we developed in recent work. 2.1

Cross-Correlation Coeﬃcient

The cross-correlation coeﬃcient r is perhaps one of the most well-known measures for (linear) interdependence between two signals x and y. If x and y are not linearly correlated, r is close to zero; on the other hand, if both signals are identical, then r = 1 [8]. 2.2

Coherence

The coherence function quantiﬁes linear correlations in frequency domain. One distinguishes the magnitude square coherence function c(f ) and the phase coherence function φ(f ) [8].

114

2.3

J. Dauwels, F. Vialatte, and A. Cichocki

Corr-Entropy Coeﬃcient

The corr-entropy coeﬃcient rE is a recently proposed [9] non-linear extension of the correlation coeﬃcient r; it is close to zero if x and y are independent (which is stronger than being uncorrelated). 2.4

Coh-Entropy and Wav-Entropy Coeﬃcient

One can deﬁne a non-linear magnitude square coherence function, which we will refer to as “coh-entropy” coeﬃcient cE (f ); it is an extension of the corr-entropy coeﬃcient to the frequency domain. The corr-entropy coeﬃcient rE can also be extended to the time-frequency domain, by replacing the signals x and y in the deﬁnition of rE by their time-frequency (“wavelet”) transforms. In this paper, we use the complex Morlet wavelet, which is known to be well-suited for EEG signals [10]. The resulting measure is called “wav-entropy” coeﬃcient wE (f ). (To our knowledge, both cE (f ) and wE (f ) are novel). 2.5

Granger Causality

Granger causality1 refers to a family of synchrony measures that are derived from linear stochastic models of time series; as the above linear interdependence measures, they quantify to which extent diﬀerent signals are linearly interdependent. Whereas the above linear interdependence measures are bivariate, i.e., they can only be applied to pairs of signals, Granger causality measures are multivariate, they can be applied to multiple signals simultaneously. Suppose that we are given n signals X1 (k), X2 (k), . . . , Xn (k), each stemming from a diﬀerent channel. We consider the multivariate autoregressive (MVAR) model: X(k) =

p

A(j)X(k − ) + E(k),

(1)

=1

where X(k) = (X1 (k), X2 (k), . . . , Xn (k))T , p is the model order, the model coeﬃcients A(j) are n × n matrices, and E(k) is a zero-mean Gaussian random vector of size n. In words: Each signal Xi (k) is assumed to linearly depend on its own p past values and the p past values of the other signals Xj (k). The deviation between X(k) and this linear dependence is modeled by the noise component E(k). Model (1) can also be cast in the form: E(k) =

p

˜ A(j)X(k − ),

(2)

=0 1

The Granger causality measures we consider here are implemented in the BioSig library, available from http://biosig.sourceforge.net/

A Comparative Study of Synchrony Measures for the Early Detection of AD

115

˜ ˜ where A(0) = I (identity matrix) and A(j) = −A(j) for j > 0. One can transform (2) into the frequency domain (by applying the z-transform and by substituting z = e−2πiΔt , where 1/Δt is the sampling rate):

˜ −1 (f )E(f ) = H(f )E(f ). X(f ) = A

(3)

The power spectrum matrix of the signal X(k) is determined as S(f ) = X(f )X(f )∗ = H(f )VH∗ (f ),

(4)

where V stands for the covariance matrix of E(k). The Granger causality measures are deﬁned in terms of coeﬃcients of the matrices A, H, and S. Due to space limitations, only a short description of these methods is provided here, additional information can be found in existing literature (e.g., [4]). From these coeﬃcients, two symmetric measures can be deﬁned: – Granger coherence |Kij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f . – Partial coherence (PC) |Cij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f when the inﬂuence (i.e., linear dependence) of the other signals is statistically removed. The following asymmetric (“directed”) Granger causality measures capture causal relations: 2 – Directed transfer function (DTF) γij (f ) quantiﬁes the fraction of inﬂow to channel i stemming from channel j. – Full frequency directed transfer function (ﬀDTF) |Hij (f )|2 Fij2 (f ) = m ∈ [0, 1], 2 f j=1 |Hij (f )|

(5)

2 is a variation of γij (f ) with a global normalization in frequency. – Partial directed coherence (PDC) |Pij (f )| ∈ [0, 1] represents the fraction of outﬂow from channel j to channel i 2 – Direct directed transfer function (dDTF) χ2ij (f ) = Fij2 (f )Cij (f ) is non-zero if the connection between channel i and j is causal (non-zero Fij2 (f )) and 2 direct (non-zero Cij (f )).

2.6

Phase Synchrony

Phase synchrony refers to the interdependence between the instantaneous phases φx and φy of two signals x and y; the instantaneous phases may be strongly synchronized even when the amplitudes of x and y are statistically independent. The instantaneous phase φx of a signal x may be extracted as [11]:

φH x(k)] , x (k) = arg [x(k) + i˜

(6)

116

J. Dauwels, F. Vialatte, and A. Cichocki

where x ˜ is the Hilbert transform of x. Alternatively, one can derive the instantaneous phase from the time-frequency transform X(k, f ) of x:

φW x (k, f ) = arg[X(k, f )].

(7)

The phase φW x (k, f ) depends on the center frequency f of the applied wavelet. By appropriately scaling the wavelet, the instantaneous phase may be computed in the frequency range of interest. The phase synchrony index γ for two instantaneous phases φx and φy is deﬁned as [11]: γ = ei(nφx −mφy ) ∈ [0, 1], (8) where n and m are integers (usually n = 1 = m). We will use the notation γH and γW to indicate whether the instantaneous phases are computed by the Hilbert transform or time-frequency transform respectively. In this paper, we will consider two additional phase synchrony indices, i.e., the evolution map approach (EMA) and the instantaneous period approach (IPA) [12]. Due to space constraints, we will not describe those measures here, instead we refer the reader to [12]2 ; additional information about phase synchrony can be found in [6]. 2.7

State Space Based Synchrony

State space based synchrony (or “generalized synchronization”) evaluates synchrony by analyzing the interdependence between the signals in a state space reconstructed domain (see e.g., [7]). The central hypothesis behind this approach is that the signals at hand are generated by some (unknown) deterministic, potentially high-dimensional, non-linear dynamical system. In order to reconstruct such system from a signal x, one considers delay vectors X(k) = (x(k), x(k − τ ), . . . , x(k − (m − 1) τ ))T , where m is the embedding dimension and τ denotes the time lag. If τ and m are appropriately chosen, and the signals are indeed generated by a deterministic dynamical system (to a good approximation), the delay vectors lie on a smooth manifold (“mapping”) in Rm , apart from small stochastic ﬂuctuations. The S-estimator [13], here denoted by Sest , is a state space based measure obtained by applying principal component analysis (PCA) to delay vectors3 . We also considered three measures of nonlinear interdependence, S k , H k , and N k (see [6] for details4 ). 2.8

Information-Theoretic Measures

Several interdependence measures have been proposed that have their roots in information theory. Mutual Information I is perhaps the most well-known 2 3

4

Program code is available at www.agnld.uni-potsdam.de/%7Emros/dircnew.m We used the S Toolbox downloadable from http://aperest.epfl.ch/docs/ software.htm Software is available from http://www.vis.caltech.edu/~rodri/software.htm

A Comparative Study of Synchrony Measures for the Early Detection of AD

117

information-theoretic interdependence measure; it quantiﬁes the amount of information the random variable Y contains about random variable X (and vice versa); it is always positive, and it vanishes when X and Y are statistically independent. Recently, a sophisticated and eﬀective technique to compute mutual information between time series was proposed [14]; we will use that method in this paper5 . The method of [14] computes mutual information in time-domain; alternatively, this quantity may also be determined in time-frequency domain (denoted by IW ), more speciﬁcally, from normalized spectrograms [15,16] (see also [17,18]). We will also consider several information-theoretic measures that quantify the dissimilarity (or “distance”) between two random variables (or signals). In contrast to the previously mentioned measures, those divergence measures vanish if the random variables (or signals) are identical ; moreover, they are not necessarily symmetric, and therefore, they can not be considered as distance measures in the strict sense. Divergences may be computed in time domain and timefrequency domain; in this paper, we will only compute the divergence measures in time-frequency domain, since the computation in time domain is far more involved. We consider the Kullback-Leibler divergence K, the R´enyi divergence Dα , the Jensen-Shannon divergence J, and the Jensen-R´enyi divergence Jα . Due to space constraints, we will not review those divergence measures here; we refer the interested reader to [15,16]. 2.9

Stochastic Event Synchrony (SES)

Stochastic event synchrony, an interdependence measure we developed in earlier work [19], describes the similarity between the time-frequency transforms of two signals x and y. As a ﬁrst step, the time-frequency transform of each signal is approximated as a sum of (half-ellipsoid) basis functions, referred to as “bumps” (see Fig. 1 and [20]). The resulting bump models, representing the most prominent oscillatory activity, are then aligned (see Fig. 2): bumps in one timefrequency map may not be present in the other map (“non-coincident bumps”); other bumps are present in both maps (“coincident bumps”), but appear at slightly diﬀerent positions on the maps. The black lines in Fig. 2 connect the centers of coincident bumps, and hence, visualize the oﬀset in position between pairs of coincident bumps. Stochastic event synchrony consists of ﬁve parameters that quantify the alignment of two bump models: – ρ: fraction of non-coincident bumps, – δt and δf : average time and frequency oﬀset respectively between coincident bumps, – st and sf : variance of the time and frequency oﬀset respectively between coincident bumps. The alignment of the two bump models (cf. Fig. 2 (right)) is obtained by iterative max-product message passing on a graphical model; the ﬁve SES parameters are determined from the resulting alignment by maximum a posteriori (MAP) 5

The program code (in C) is available at www.klab.caltech.edu/~kraskov/MILCA/

118

J. Dauwels, F. Vialatte, and A. Cichocki

Fig. 1. Bump modeling Coincident bumps (ρ = 27%) 30

25

25

20

20

15

15

f

f

Bump models of two EEG signals 30

10

10

5

5

00

200

400

t

600

800

00

200

400

t

600

800

Fig. 2. Coincident and non-coincident activity (“bumps”); (left) bump models of two signals; (right) coincident bumps; the black lines connect the centers of coincident bumps

estimation [19]. The parameters ρ and st are the most relevant for the present study, since they quantify the synchrony between bump models (and hence, the original time-frequency maps); low ρ and st implies that the two time-frequency maps at hand are well synchronized.

3

Detection of EEG Synchrony Abnormalities in MCI Patients

In the following section, we describe the EEG data we analyzed. In Section 3.2 we address certain technical issues related to the synchrony measures, and in Section 3.3, we present and discuss our results. 3.1

EEG Data

The EEG data6 analyzed here have been analyzed in previous studies concerning early detection of AD [21]–[25]. They consist of rest eyes-closed EEG data 6

We are grateful to Prof. T. Musha for providing us the EEG data.

A Comparative Study of Synchrony Measures for the Early Detection of AD

119

recorded from 21 sites on the scalp based on the 10–20 system. The sampling frequency was 200 Hz, and the signals were band pass ﬁltered between 4 and 30Hz. The subjects comprised two study groups. The ﬁrst consisted of a group of 25 patients who had complained of memory problems. These subjects were then diagnosed as suﬀering from MCI and subsequently developed mild AD. The criteria for inclusion into the MCI group were a mini mental state exam (MMSE) score above 24, the average score in the MCI group was 26 (SD of 1.8). The other group was a control set consisting of 56 age-matched, healthy subjects who had no memory or other cognitive impairments. The average MMSE of this control group was 28.5 (SD of 1.6). The ages of the two groups were 71.9 ± 10.2 and 71.7 ± 8.3, respectively. Pre-selection was conducted to ensure that the data were of a high quality, as determined by the presence of at least 20 sec. of artifact free data. Based on this requirement, the number of subjects in the two groups described above was reduced to 22 MCI patients and 38 control subjects. 3.2

Methods

In order to reduce the computational complexity, we aggregated the EEG signals into 5 zones (see Fig. 3); we computed the synchrony measures (except the S-estimator) from the averages of each zone. For all those measures except SES, we used the arithmetic average; in the case of SES, the bump models obtained from the 21 electrodes were clustered into 5 zones by means of the aggregation algorithm described in [20]. We evaluated the S-estimator between each pair of zones by applying PCA to the state space embedded EEG signals of both zones. We divided the EEG signals in segments of equal length L, and computed the synchrony measures by averaging over those segments. Since spontaneous EEG is usually highly non-stationary, and most synchrony measures are strictly speaking only applicable to stationary signals, the length L should be suﬃciently small; on the other hand, in order to obtain reliable measures for synchrony, the length should be chosen suﬃciently large. Consequently, it is not a priori clear how to choose the length L, and therefore, we decided to test several values, i.e., L = 1s, 5s, and 20s. In the case of Granger causality measures, one needs to specify the model order p. Similarly, for mutual information (in time domain) and the state space based measures, the embedding dimension m and the time lag τ needs to be chosen; the phase synchrony indices IPA and EMA involve a time delay τ . Since it is not obvious which parameter values amount to the best performance for detecting AD, we have tried a range of parameter settings, i.e., p = 1, 2,. . . , 10, and m = 1, 2,. . . , 10; the time delay was in each case set to τ = 1/30s, which is the period of the fastest oscillations in the EEG signals at hand. 3.3

Results and Discussion

Our main results are summarized in Table 1, which shows the sensitivity of the synchrony measures for detecting MCI. Due to space constraints, the table only shows results for global synchrony, i.e., the synchrony measures were averaged

120

J. Dauwels, F. Vialatte, and A. Cichocki

Fp1

Fpz

Fp2

1 F7

F3

T3

3 C3

T5

F4

Fz

2

T4

C4

Cz

Pz

P3

F8

4

P4

T6

5 O1

Oz

O2

Fig. 3. The 21 electrodes used for EEG recording, distributed according to the 10– 20 international placement system [8]. The clustering into 5 zones is indicated by the colors and dashed lines (1 = frontal, 2 = left temporal, 3 = central, 4 = right temporal and 5 = occipital).

over all pairs of zones. (Results for local synchrony and individual frequency bands will be presented in a longer report, including a detailed description of the inﬂuence of various parameters such as model order and embedding dimension on the sensitivity.) The p-values, obtained by the Mann-Whitney test, need strictly speaking to be Bonferroni corrected; since we consider many diﬀerent measures simultaneously, it is likely that a few of those measures have small p-values merely due to stochastic ﬂuctuations (and not due to systematic difference between MCI and control patients). In the most conservative Bonferroni post-correction, the p-values need to be divided by the number of synchrony measures. From the table, it can be seen that only a few measures evince signiﬁcant diﬀerences in EEG synchrony between MCI and control patients: full-frequency DTF and ρ are the most sensitive (for the data set at hand), their p-values remain signiﬁcant (pcorr < 0.05) after Bonferroni correction. In other words, the eﬀect of MCI and AD on EEG synchrony can be detected, as was reported earlier in the literature; we will expand on this issue in the following section. In other to gain more insight in the relation between the diﬀerent measures, we calculated the correlation between them (see Fig. 5; red and blue indicate strong correlation and anti-correlation respectively). From this ﬁgure, it becomes strikingly clear that the majority of measures are strongly correlated (or anticorrelated) with each other; in other words, the measures can easily be classiﬁed in diﬀerent families. In addition, many measures are strongly (anti-)correlated with the classical cross-correlation coeﬃcient r, the most basic measure; as a result, they do not provide much additional information regarding EEG synchrony. Measures that are only weakly correlated with the cross-correlation coeﬃcient include the phase synchrony indices, Granger causality measures, and stochastic-event synchrony measures; interestingly, those three families of synchrony measures are mutually uncorrelated, and as a consequence, they each seem to capture a speciﬁc kind of interdependence.

A Comparative Study of Synchrony Measures for the Early Detection of AD

121

Table 1. Sensitivity of synchrony measures for early prediction of AD (p-values for Mann-Whitney test; * and ** indicate p < 0.05 and p < 0.005 respectively) Measure

Cross-correlation

Coherence

Phase Coherence

Corr-entropy

Wave-entropy

p-value

0.028∗

0.060

0.72

0.27

0.012∗

References Measure

[8]

[9] PDC

DTF

ﬀDTF

dDTF

0.15

0.16

0.60

0.34

0.0012∗∗

0.030∗

Measure

Kullback-Leibler

R´enyi

Jensen-Shannon

Jensen-R´enyi

IW

I

p-value

0.072

0.076

0.084

0.12

0.060

0.080

Measure

Nk

Sk

Hk

S-estimator

p-value

0.032∗

0.29

0.090

0.33

p-value

Granger coherence Partial Coherence

References

[4]

References

[15]

References

[14]

[6]

Measure

Hilbert Phase

p-value

0.15

[13]

Wavelet Phase

Evolution Map Instantaneous Period

0.082

References

0.020∗

0.072

[6]

[12]

Measure

st

ρ

p-value

0.92

0.00029∗∗

In Fig. 4, we combine the two most sensitive synchrony measures (for the data set at hand), i.e., full-frequency DTF and ρ. In this ﬁgure, the MCI patients are fairly well distinguishable from the control patients. As such, the separation is not suﬃciently strong to yield reliable early prediction of AD. For this purpose, the two features need to be combined with complementary features, for example, derived from the slowing eﬀect of AD on EEG, or perhaps from diﬀerent modalities such as PET, MRI, DTI, or biochemical indicators. On the other hand, we remind the reader of the fact that in the data set at hand, patients did not carry out any speciﬁc task; moreover, the recordings were short (only 20s). It is plausible that the sensitivity of EEG synchrony could be further improved by increasing the length of the recordings and by recording the EEG before, while, and after patients carry out speciﬁc tasks, e.g., working memory tasks.

0.5 MCI CTR

0.45 0.4

ρ

0.35 0.3 0.25 0.2 0.15 0.045

0.05

Fij2

0.055

Fig. 4. ρ vs. ﬀDTF

0.06

122

J. Dauwels, F. Vialatte, and A. Cichocki

state space

corr/coh mut inf

phase

divergence

Granger

SES

N k (X|Y ) N k (Y |X) S k (X|Y ) S k (Y |X) H k (X|Y ) 5 H k (Y |X) Sest r c rE 10 wE IW I γH γW 15 φ EMA IPA K(Y |X) K(X|Y )20 K Dα J Jα Kij 25 Cij Pij γij Fij χij st ρ

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

30

−0.8 5

10

15

20

25

30

Fig. 5. Correlation between the synchrony measures

4

Conclusions

In previous studies, brain dynamics in AD and MCI patients were mainly investigated using coherence (cf. Section 2.2) or state space based measures of synchrony (cf. Section 2.7). During working memory tasks, coherence shows signiﬁcant eﬀects in AD and MCI groups [26] [27]; in resting condition, however, coherence does not show such diﬀerences in low frequencies (below 30Hz), neither between AD and controls [28] nor between MCI and controls [27]. These results are consistent with our observations. In the gamma range, coherence seems to decrease with AD [29]; we did not investigate this frequency range, however, since the EEG signals analyzed here were band pass ﬁltered between 4 and 30Hz. Synchronization likelihood, a state space based synchronization measure similar to the non-linear interdependence measures S k , H k , and N k (cf. Section 2.7), is believed to be more sensitive than coherence to detect changes in AD patients [28]. Using state space based synchrony methods, signiﬁcant diﬀerences were found between AD and control in rest conditions [28] [30] [32] [33]. State space based synchrony failed to retrieve signiﬁcant diﬀerences between MCI patient and control subjects on a global level [32] [33], but signiﬁcant eﬀects were observed locally: fronto-parietal electrode synchronization likelihood progressively decreased through MCI and mild AD groups [30]. We report here a lower p-value for the state space based synchrony measure N k (p = 0.032) than for coherence (p = 0.06); those low p-values, however, would not be statistically signiﬁcant after Bonferroni correction.

A Comparative Study of Synchrony Measures for the Early Detection of AD

123

By means of Global Field Synchronization, a phase synchrony measure similar to the ones we considered in this paper, Koenig et al. [31] observed a general decrease of synchronization in correlation with cognitive decline and AD. In our study, we analyzed ﬁve diﬀerent phase synchrony measures: Hilbert and wavelet based phase synchrony, phase coherence, evolution map approach (EMA), and instantaneous period approach (IPA). The p-value of the latter is low (p=0.020), in agreement with the results of [31], but it would be non-signiﬁcant after Bonferroni correction. The strongest observed eﬀect is a signiﬁcantly higher degree of local asynchronous activity (ρ) in MCI patients, more speciﬁcally, a high number of noncoincident, asynchronous oscillatory events (p = 0.00029). Interestingly, we did not observe a signiﬁcant eﬀect on the timing jitter st of the coincident events (p = 0.92). In other words, our results seem to indicate that there is signiﬁcantly more non-coincident background activity, while the coincident activity remains well synchronized. On the one hand, this observation is in agreement with previous studies that report a general decrease of neural synchrony in MCI and AD patients; on the other hand, it goes beyond previous results, since it yields a more subtle description of EEG synchrony in MCI and AD patients: it suggests that the loss of coherence is mostly due to an increase of (local) non-coincident background activity, whereas the locked (coincident) activity remains equally well synchronized. In future work, we will verify this conjecture by means of other data sets.

References 1. Jong, J.: EEG Dynamics in Patients with Alzheimer’s Disease. Clinical Neurophysiology 115, 1490–1505 (2004) 2. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear Multivariate Analysis of Neurophsyiological Signals. Progress in Neurobiology 77, 1–37 (2005) 3. Breakspear, M.: Dynamic Connectivity in Neural Systems: Theoretical and Empirical Considerations. Neuroinformatics 2(2) (2004) 4. Kami´ nski, M., Liang, H.: Causal Inﬂuence: Advances in Neurosignal Analysis. Critical Review in Biomedical Engineering 33(4), 347–430 (2005) 5. Stam, C.J.: Nonlinear Dynamical Analysis of EEG and MEG: Review of an Emerging Field. Clinical Neurophysiology 116, 2266–2301 (2005) 6. Quiroga, R.Q., Kraskov, A., Kreuz, T., Grassberger, P.: Performance of Diﬀerent Synchronization Measures in Real Data: A Case Study on EEG Signals. Physical Review E 65 (2002) 7. Sakkalis, V., Giurc˘ aneacu, C.D., Xanthopoulos, P., Zervakis, M., Tsiaras, V.: Assessment of Linear and Non-Linear EEG Synchronization Measures for Evaluating Mild Epileptic Signal Patterns. In: Proc. of ITAB 2006, Ioannina-Epirus, Greece, October 26–28 (2006) 8. Nunez, P., Srinivasan, R.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press, Oxford (2006) 9. Xu, J.-W., Bakardjian, H., Cichocki, A., Principe, J.C.: EEG Synchronization Measure: a Reproducing Kernel Hilbert Space Approach. IEEE Transactions on Biomedical Engineering Letters (submitted to, September 2006)

124

J. Dauwels, F. Vialatte, and A. Cichocki

10. Herrmann, C.S., Grigutsch, M., Busch, N.A.: EEG Oscillations and Wavelet Analysis. In: Handy, T. (ed.) Event-Related Potentials: a Methods Handbook, pp. 229– 259. MIT Press, Cambridge (2005) 11. Lachaux, J.-P., Rodriguez, E., Martinerie, J., Varela, F.J.: Measuring Phase Synchrony in Brain Signals. Human Brain Mapping 8, 194–208 (1999) 12. Rosenblum, M.G., Cimponeriu, L., Bezerianos, A., Patzak, A., Mrowka, R.: Identiﬁcation of Coupling Direction: Application to Cardiorespiratory Interaction. Physical Review E, 65 041909 (2002) 13. Carmeli, C., Knyazeva, M.G., Innocenti, G.M., De Feo, O.: Assessment of EEG Synchronization Based on State-Space Analysis. Neuroimage 25, 339–354 (2005) 14. Kraskov, A., St¨ ogbauer, H., Grassberger, P.: Estimating Mutual Information. Phys. Rev. E 69(6), 66138 (2004) 15. Aviyente, S.: A Measure of Mutual Information on the Time-Frequency Plane. In: Proc. of ICASSP 2005, Philadelphia, PA, USA, March 18–23, vol. 4, pp. 481–484 (2005) 16. Aviyente, S.: Information-Theoretic Signal Processing on the Time-Frequency Plane and Applications. In: Proc. of EUSIPCO 2005, Antalya, Turkey, September 4–8 (2005) 17. Quiroga, Q.R., Rosso, O., Basar, E.: Wavelet-Entropy: A Measure of Order in Evoked Potentials. Electr. Clin. Neurophysiol (Suppl.) 49, 298–302 (1999) 18. Blanco, S., Quiroga, R.Q., Rosso, O., Kochen, S.: Time-Frequency Analysis of EEG Series. Physical Review E 51, 2624 (1995) 19. Dauwels, J., Vialatte, F., Cichocki, A.: A Novel Measure for Synchrony and Its Application to Neural Signals. In: Honolulu, H.U. (ed.) Proc. IEEE Int. Conf. on Acoustics and Signal Processing (ICASSP), Honolulu, Hawai’i, April 15–20 (2007) 20. Vialatte, F., Martin, C., Dubois, R., Haddad, J., Quenet, B., Gervais, R., Dreyfus, G.: A Machine Learning Approach to the Analysis of Time-Frequency Maps, and Its Application to Neural Dynamics. Neural Networks 20, 194–209 (2007) 21. Chapman, R., et al.: Brain Event-Related Potentials: Diagnosing Early-Stage Alzheimer’s Disease. Neurobiol. Aging 28, 194–201 (2007) 22. Cichocki, A., et al.: EEG Filtering Based on Blind Source Separation (BSS) for Early Detection of Alzheimer’s Disease. Clin. Neurophys. 116, 729–737 (2005) 23. Hogan, M., et al.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49 (2003) 24. Musha, T., et al.: A New EEG Method for Estimating Cortical Neuronal Impairment that is Sensitive to Early Stage Alzheimer’s Disease. Clin. Neurophys. 113, 1052–1058 (2002) 25. Vialatte, F., et al.: Blind Source Separation and Sparse Bump Modelling of TimeFrequency Representation of EEG Signals: New Tools for Early Detection of Alzheimer’s Disease. In: IEEE Workshop on Machine Learning for Signal Processing, pp. 27–32 (2005) 26. Hogan, M.J., Swanwick, G.R., Kaiser, J., Rowan, M., Lawlor, B.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49(2), 147–163 (2003) 27. Jiang, Z.Y.: Study on EEG Power and Coherence in Patients with Mild Cognitive Impairment During Working Memory Task. J. Zhejiang Univ. Sci. B 6(12), 1213– 1219 (2005) 28. Stam, C.J., van Cappellen van Walsum, A.M., Pijnenburg, Y.A., Berendse, H.W., de Munck, J.C., Scheltens, P., van Dijk, B.W.: Generalized Synchronization of MEG Recordings in Alzheimer’s Disease: Evidence for Involvement of the Gamma Band. J. Clin. Neurophysiol. 19(6), 562–574 (2002)

A Comparative Study of Synchrony Measures for the Early Detection of AD

125

29. Herrmann, C.S., Demiralp, T.: Human EEG Gamma Oscillations in Neuropsychiatric Disorders. Clinical Neurophysiology 116, 2719–2733 (2005) 30. Babiloni, C., Ferri, R., Binetti, G., Cassarino, A., Forno, G.D., Ercolani, M., Ferreri, F., Frisoni, G.B., Lanuzza, B., Miniussi, C., Nobili, F., Rodriguez, G., Rundo, F., Stam, C.J., Musha, T., Vecchio, F., Rossini, P.M.: Fronto-Parietal Coupling of Brain Rhythms in Mild Cognitive Impairment: A Multicentric EEG Study. Brain Res. Bull. 69(1), 63–73 (2006) 31. Koenig, T., Prichep, L., Dierks, T., Hubl, D., Wahlund, L.O., John, E.R., Jelic, V.: Decreased EEG Synchronization in Alzheimer’s Disease and Mild Cognitive Impairment. Neurobiol. Aging 26(2), 165–171 (2005) 32. Pijnenburg, Y.A., Made, Y.v., van Cappellen, A.M., van Walsum, Knol, D.L., Scheltens, P., Stam, C.J.: EEG Synchronization Likelihood in Mild Cognitive Impairment and Alzheimer’s Disease During a Working Memory Task. Clin. Neurophysiol. 115(6), 1332–1339 (2004) 33. Yagyu, T., Wackermann, J., Shigeta, M., Jelic, V., Kinoshita, T., Kochi, K., Julin, P., Almkvist, O., Wahlund, L.O., Kondakor, I., Lehmann, D.: Global dimensional complexity of multichannel EEG in mild Alzheimer’s disease and age-matched cohorts. Dement Geriatr Cogn Disord 8(6), 343–347 (1997)

Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials Hong-Ren Su1,2, Michelle Liou2,*, Philip E. Cheng2, John A.D. Aston2, and Shang-Hong Lai1 1

Dept. of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan [email protected]

Abstract. In this study, we introduce the use of orthogonal causal Laguerre polynomials for analyzing data collected in event-related functional magnetic resonance imaging (fMRI) experiments. This particular family of polynomials has been widely used in the system identification literature and recommended for modeling impulse functions in BOLD-based fMRI experiments. In empirical studies, we applied Laguerre polynomials to analyze data collected in an eventrelated fMRI study conducted by Scott et al. (2001). The experimental study investigated neural mechanisms of visual attention in a change-detection task. By specifying a few meaningful Laguerre polynomials in the design matrix of a random effect model, we clearly found brain regions associated with trial onset and visual search. The results are consistent with the original findings in Scott et al. (2001). In addition, we found the brain regions related to the mask presence in the parahippocampal, superior frontal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus. Keywords: Reproducibility analysis, Event-related fMRI.

1 Introduction We previously proposed a methodology for assessing reproducibility evidence in fMRI studies using an on-and-off paradigm without necessarily conducting replicated experiments, and suggested interpreting SPMs in conjunction with reproducibility evidence (Liou et al., 2003; 2006). Empirical studies have shown that the method is robust to the specification of hemodynamic response functions (HRFs). Recently, BOLD-based event-related fMRI experiments have been widely used as an advanced alternative to the on-and-off design for studies on human brain functions. In eventrelated fMRI experiments, the duration of stimulus presentation is generally longer and there are no obvious contrasts between the experimental and control conditions to be used in data analyses. In order to detect possible brain activations during stimulus presentation and task performance, there have been a variety of event-related HRFs proposed in the literature. In this study, we introduce the use of orthogonal causal * Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 126–134, 2008. © Springer-Verlag Berlin Heidelberg 2008

Reproducibility Analysis of Event-Related fMRI Experiments

127

Laguerre polynomials for modeling response functions. This particular family of polynomials has been widely used in the system identification literature and was recommended for modeling impulse functions in fMRI experiments (Saha et al., 2004). In the empirical study, we applied Laguerre polynomials to analyze data in the study by Scott et al. (2001). The dataset was published by the US fMRI Data Center and is available for public access. The original experiment involved 10 human subjects and investigated brain functions associated with a change-detection task. In the experimental task, subjects look attentively at two versions of the same picture in alternation, separated by a brief mask interval. The experiment additionally analyzed behavioral responses that subjects detected something changing between pictures and pressed a button with hands. In our reproducibility analysis, a few meaningful Laguerre polynomials matching the experimental design were inserted into a random effect model and reproducibility analyses were conducted based on the selected polynomials. In the analyses, we successfully located brain regions associated with the visual change-detection task similar to those found in Scott et al.. Additionally, we found other interesting brain regions that were not included in the previous study.

2 Method In this section, we will briefly describe the method for investigating the reproducibility evidence in fMRI experiments, and outline the family of Laguerre polynomials including those used in our empirical study. 2.1 Reproducibility Analysis In the SPM generalized linear model, the fMRI responses in the ith run can be expressed as

yi = X i β i + ei ,

(1)

where yi is the vector of image intensity after pre-whitening, Xi is the design matrix, and β i is the vector containing the unknown regression parameters. In the random effect model, the regression parameters

βi

are additionally assumed to be random

from a multivariate Gaussian distribution with common mean μ and variance Ω . The empirical Bayes estimate of

βi

in the random effect model would shrink all

estimates toward the mean μ , with greater shrinkage at noisy runs. In fMRI studies, the true status of each voxel is unknown, but can be estimated using the t-values (i.e., standardized β estimates) within individual runs derived from the random effect model along with the maximum likehood estimation method. By specifying a mixed multinomial model, the receiver-operation characteristic (ROC) curve can be estimated using the maximum likelihood estimation method and t-values of all image voxels. The curve is simply a bivariate plot of sensitivity versus the false alarm rate. The threshold (or the operational point) on the ROC curve for classifying voxels into the active/inactive status was found by maximizing the kappa value. We follow the

128

H.-R. Su et al.

same definition in Liou et al. (2006) to categorize voxels according to reproducibility, that is, a voxel is strongly reproducible if its active status remains the same in at least 90% of the runs, moderately reproducible in 70-90% of the runs, weakly reproducible in 50-70% of the runs, and otherwise not reproducible. The brain activation maps are constructed on the basis of strongly reproducible voxels, but include voxels that are moderately reproducible and spatially proximal to those strongly reproducible voxels. 2.2 Laguerre Polynomials The Laguerre polynomials can be used for detecting experimental responses. This family of polynomials can be specified as follows: L

h (t ) = ∑ f i g ia (t ) ,

(2)

i =1

where h(t) is the design coefficients to be input into Xi in (1); L is the order of Laguerre polynomial; ƒi is the coefficient of the basis function, and gia(t) is the inverse Z transform of the i-th Laguerre polynomial given by ~ a ⎡ z −1 z −1 − a i −1 ⎤ −1 g ia (t ) = Z −1 ⎢ ( ) = Z [ g i ( z )] ⎥ −1 −1 ⎣1 − az 1 − az ⎦

(3)

where a is a time constant. As an illustration, Fig. 1 gives the response coefficients corresponding to L=2 and L =3.

h(t), L=2

h(t), L=3

(a)

(b)

Fig. 1. The boxcar functions of experimental conditions in the Scott et al. study are depicted in (a), and the Laguerre polynomials h(t) with L=2 and L =3 are depicted in (b)

3 Event-Related fMRI Experiments We here introduce the experimental design behind the fMRI dataset used in the empirical study, and select the design matrix suitable for the study.

Reproducibility Analysis of Event-Related fMRI Experiments

129

3.1 Experimental Design In our empirical study, the dataset contains functional MR images of 10 subjects who went through 10-12 experimental runs, with 10 stimulus trials in each run. Experimental runs involved the change-detection task in which two images within a pair differed in either the presence/absence of the position of a single object or the color of the object. The two images were presented alternatively for 40 sec. In the first 30 sec, each image was presented for 300 msec followed by a 100 msec mask. However, the mask was removed in the last 10 sec. Subjects pressed a button when detecting something changing between the pair of images. The experimental images and stimulus duration are shown in Fig. 2.

Fig. 2. The experimental images and stimulus duration in the Scott et al. study

3.2 Experimental Design Matrix We used the Laguerre polynomials in Fig. 1 for specifying the design matrix in (1) instead of the theoretical HRFs. According to the original experimental design, there are two contrasts of interest in the Scott et al. study. The first is the response after the task onset within the 40 secs trial, and the second is the difference between stimulus presentations with and without the mask, that is, responses during the image presentation with the mask (0 ~ 30 sec.) and without the mask (30 ~ 40 sec.) in Fig. 2. The boxcar functions in Fig. 1 can also be specified in the design matrix as was suggested in the study by Liou et al. (2006) for the on-and-off design. However, the two boxcar functions are not orthogonal to each other and carry redundant information on experimental effects. In the event-related fMRI experiment, the duration of stimulus presentation is always longer than that in the on-and-off design. The theoretical HRFs vanish during the stimulus presentation. There might be brain regions continuously responding to the stimulus. The Laguerre polynomials are orthogonal and offer possibilities for examining all kinds of experimental effects. We

130

H.-R. Su et al.

might also consider Laguerre polynomials in Fig. 1 as a smoothed version of the boxcar functions.

4 Results In the Scott et al. study, a response-contingent event-related analysis technique was used in the data analyses, and the original results showed brain regions associated with different processing components in the visual change-detection task. For instance, the lingual gyrus, cuneus, precentral gyrus, and medial frontal gyrus showed activations associated with the task onset. And the pattern of activation in dorsal and ventral visual pathways was temporally associated with the duration of visual search. Finally parietal and frontal regions showed systematic deactivations during task performance. In the reproducibility analysis with Laguerre polynomials, we found the similar activation regions associated with the task onset, visual search and deactivations. In addition, we found activation regions in the parahippocampal, superior frontal gyrus, supramarginal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus which are also reproducible across all subjects; this finding is consistent with our previous data analyses of fMRI studies involving object recognition and word/pseudoword reading (Liou et al., 2006). Table 1 lists a few activation regions in the change-detection task for the 10 subjects. Table 1. The activation regions in the change-detection task. The plus sign indicates the positive response and minus sign indicates the negative response.

Subjects Lingual gyrus Precuneus Cuneus Posterior cingulate Medial frontal gyrus Parahippocampal gyrus Superior frontal gyrus Supramarginal gyrus

1 +/+/+/+/+/-

2 +/+/+/-

+ + +

3 +/+/+/+/+/+/+ +

4 +/+/+/+/-

5 +/+/+/+/+/+ +

6 +/+/+/+ +/-

7 +/+/+/+/+/+/-

8 +/+/+/+/+/+/-

9 +/+/+/-

10 +/+/+/+/+/-

+ +

In the table, there are 4 subjects showing activations in the superior frontal gyrus and supramarginal gyrus in the change-detection task. The two regions have been referred to in fMRI studies on language process (e.g., the study on word and pseudoword reading). The 4 subjects, on average, had longer reaction time in the change-detection task, that is, a delay of pressing the button until the image presentation without the mask (30-40 sec.). Fig. 3 shows the brain activation regions for Subjects 5 and 7 in the Scott et al. study. Subject 5 involved the superior frontal gyrus and supramarginal gyrus and had the longest reaction time compared with other subjects in the experiment. On the other hand, Subject 7 had relatively shorter reaction time and showed no activations in the two regions.

Reproducibility Analysis of Event-Related fMRI Experiments

Fig. 3. Brain activation regions for Subjects 5 and 7 in the Scott et al. study

131

132

H.-R. Su et al.

Subject 5

Fig. 3. (continued)

Reproducibility Analysis of Event-Related fMRI Experiments

Subject 7

Fig. 3. (continued)

133

134

H.-R. Su et al.

5 Discussion The reproducibility evidence suggests that the 10 subjects consistently show a pattern of increased/decreased responses in the lingual gyrus, cuneus, and precuneus. Similar observations were also found in our empirical studies on other datasets published by the fMRIDC. In the fMRI literature, the precuneus, posterior cingulate and medial prefrontal cortex are known to be the default network in a resting state and show decreased activities in a variety of cognitive tasks. The physiological mechanisms behind the decreased responses are still under investigation. However, discussions on the network have given a focus on the decreased activities. We would suggest to consider both positive and negative responses when interpreting the default network. By the method of reproducibility analyses, we can clearly classify brain regions that show consistent responses across subjects and those that show patterns and inconsistencies across subjects (see results in Table 1). Higher mental functions are individual and their localization in specific brain regions can be made only with some probabilities. Accordingly, the higher mental functions are connected with speech, that is, external or internal speech organizing personal behavior. Subjects differ from each other as a result of using different speech designs when making decisions in performing experimental tasks. Change of functional localization is an additional characteristic of a subject’s psychological traits. The proposed methodology would assist researchers in identifying those brain regions that are specific to individual speech designs and those that are consistent across subjects. Acknowledgments. The authors are indebted to the fMRIDC at Dartmouth College for supporting the datasets analyzed in this study. This research was supported by the grant NSC 94-2413-H-001-001 from the National Science Council (Taiwan).

References 1. Liou, M., Su, H.-R., Lee, J.-D., Aston, J.A.D., Tsai, A.C., Cheng, P.E.: A method for generating reproducible evidence in fMRI studies. NeuroImage 29, 383–395 (2006) 2. Huettel, S.A., Guzeldere, G., McCarthy, G.: Dissociating neural mechanisms of visual attention in change detection using functional MRI. Journal of Cognitive Neuroscience 13(7), 1006–1018 (2001) 3. Liou, M., Su, H.-R., Lee, J.-D., Cheng, P.E., Huang, C.-C., Tsai, C.-H.: Bridging Functional MR Images and Scientific Inference: Reproducibility Maps. Journal of cognitive Neuroscience 15(7), 935–945 (2003) 4. Saha, S., Long, C.J., Brown, E., Aminoff, E., Bar, M., Solo, V.: Hemodynamic transfer function estimation with Laguerre polynomials and confidence intervals construction from functional magnetic resonance imaging (FMRI) data. IEEE ICASSP 3, 109–112 (2004) 5. Andrews, G.E., Askey, R., Roy, R.: Laguerre Polynomials. In: §6.2 in Special Functions, pp. 282–293. Cambridge University Press, Cambridge (1999) 6. Arfken, G.: Laguerre Functions. In: §13.2 in Mathematical Methods for Physicists, 3rd ed., Orlando, FL, pp. 721–731. Academic Press, London (1985)

The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence Murat Saglam1, Kaoru Matsunaga2, Yuki Hayashida1, Nobuki Murayama1, and Ryoji Nakanishi2 1

Graduate School of Science and Technology, Kumamoto University, Japan 2 Department of Neurology, Kumamoto Kinoh Hospital, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. Recent studies proposed a new paradigm of repetitive transcranial magnetic stimulation (rTMS), “theta burst stimulation” (TBS); to primary motor cortex (M1) or sensory cortex (S1) can influence cortical excitability in humans. Particularly it has been shown that TBS can induce the long-lasting effects with the stimulation duration shorter than those of conventional rTMSs. However, in those studies, effects of TBS over M1 or S1 were assessed only by means of motor- and/or somatosensory-evoked-potentials. Here we asked how the coherence between electromyographic (EMG) and electroencephalographic (EEG) signals during isometric contraction of the first dorsal interosseous muscle is modified by TBS. The coherence magnitude localizing for the C3 scalp site, and at 13-30Hz band, significantly decreased 30-60 minutes after the TBS on M1, but not that on S1, and recovered to the original level in 90-120 minutes. These findings indicate that TBS over M1 can suppress the corticomuscular synchronization. Keywords: Theta Burst Transcranial Magnetic Stimulation, Coherence Electroencephalogram, Electromyogram, Motor Cortex.

1 Introduction Previous studies have demonstrated dense functional and anatomical projections among motor cortex building a global network which realizes the communication between the brain and peripheral muscles via the motor pathway [1, 2]. The quality of the communication is thought to highly depend on the efficacy of the synaptic transmission between cortical units. In the past few decades, repetitive transcranial magnetic stimulation (rTMS) was considered to be a promising method to modify cortical circuitry by leading the phenomena of long-term potentiation (LTP) and depression (LTD) of synaptic connections in human subjects [3]. Furthermore, a recently developed rTMS paradigm, called “theta burst stimulation” (TBS) requires less number of the stimulation pulses and even offers the longer aftereffects than conventional rTMS protocols do [4]. Previously, efficiency of TBS has M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 135–141, 2008. © Springer-Verlag Berlin Heidelberg 2008

136

M. Saglam et al.

been assessed by means of signal transmission from cortex to muscle or from muscle to cortex, by measuring motor-evoked-potential (MEP) or somatosensory-evoked-potential (SEP), respectively. It was shown that TBS applied over the surface of sensory cortex (S1) as well as primary motor cortex (M1) could modify the amplitude of SEP (recorded from the S1 scalp site) lasting for tens of minutes after the TBS [5]. On the other hand, the amplitude of MEP was not modified by the TBS applied over S1, while the MEP amplitude was significantly decreased by the TBS applied over M1 [4, 5]. In the present study, we examined the effects of TBS applied over either M1 or S1 on the functional coupling between cortex and muscle by measuring the coherence between electroencephalographic (EEG) and electromyographic (EMG) signals during voluntary isometric contraction of the first dorsal interosseous (FDI) muscle.

2 Methods 2.1 Subjects Seven subjects among whole set of recruited participants (approximately twenty) showed significant coherence and only those subjects participated to TBS experiments. Experiments on M1 and S1 performed on different days and subjects did not report any side effects during or after the experiments. 2.2 Determination of M1 and S1 Location The optimal location of the stimulating coil was determined by searching the largest MEP response(from the contralateral FDI-muscle) elicited by single pulse TMS while

Fig. 1. EEG-EMG signals recorded before and after the application of TBS as depicted in the experiment time line. Subjects were asked to contract four times at each recording set. Location and intensity for actual location were determined after pre30 session. Pre0 recording was done to confirm searching does not make any conditioning. TBS paradigm is illustrated in the rightabove inset.

The Effects of Theta Burst Transcranial Magnetic Stimulation

137

moving the TMS coil in 1cm steps on the presumed position of M1. Stimulation was applied by a with a High Power Magstim 200 machine and a figure-of-8 coil with mean loop diameter of 70 mm (Magstim Co., Whitland, Dyfed, UK). The coil was placed tangentially to the scalp with the handle pointing backwards and laterally at a 45˚ angle away from the midline. Based on previous reports S1 is assumed to be 2cm posterior from M1 site [5]. 2.3 Theta Burst Stimulation Continuous TBS (cTBS) paradigm of 600 pulses was applied to the M1 and S1 location. cTBS consists of 50Hz triplets of pulses that are repeating themselves at every 0.2s (5Hz) for 40s[4]. Intensity of each pulse was set to 80% active motor threshold (AMT) which is defined as the minimum stimulation intensity that could evoke an MEP of no less than 200μV during slight tonic contraction. 2.4 EEG and EMG Recording EEG signals were recorded based on the international 10-20 scalp electrode placement method (19 electrodes) with earlobe reference. EMG signal, during isometric hand contraction at 15% level from the maximum, was recorded from the FDI muscle of the right hand with reference to the metacarpal bone of the index finger. EEG and EMG signals were recorded with 1000 Hz sampling frequency and passbands of 0.5-200Hz and 5-300 Hz, respectively. Each recording set consists of 4 one-minute-long recordings with 30s-rest time intervals. To assess TBS effect with respect to time, each set was performed 30 minutes before (pre30), just before (pre0), 0,30,60,90 and 120 minutes after the delivery of TBS. Stimulation location and intensity were determined between pre30 and pre0 recordings (Fig. 1). 2.5 Data Analysis Coherence function is the squared magnitude of the cross-spectra of the signal pair divided by the product their power spectra. Therefore cross- and power-spectra between EMG and 19 EEG channels were calculated. Fast Fourier transform, with an epoch size of 1024, resulting in frequency resolution of 0.98 Hz was used to convert the signals into frequency domain. The current source density (CSD) reference method was utilized in order to achieve spatially sharpened EEG signals [6]. Coherency between EEG and EMG signals was obtained using the expression below:

0≤κ (f) = 2 xy

S xy ( f )

2

S xx ( f ) S yy ( f )

≤1

(1)

where Sxy(f) represents the cross-spectral density function. Sxx(f) and Syy(f) stand for the auto-spectral density of the signals x and y, respectively. Since coherence is a normalized measure of the correlation between signal pairs, κ2xy (f) =1 represents a perfect linear dependence and κ2xy (f) =0 indicates a lack of linear dependence within those signal pairs. Coherence values for κ2xy (f) > 0 are assumed to be statistically significant only if they are above 99% confidence limit that is estimated by:

138

M. Saglam et al. 1

α ⎞ ( n−1) ⎛ CL(α %) := 1 − ⎜1 − ⎟ ⎝ 100 ⎠ where n is the number of epochs used for cross- and power- spectra calculations.

(2)

3 Results First we have confirmed that EEG-EMG coherence values for all (n=7) subjects lie above 99% significance level (coh ~= 0.02) and within beta (13-30 Hz) frequency Table 1. Mean ± standard error of mean (SEM) of beta band EEG (C3)-EMG coherence values and peak frequencies for TBS-over-M1 and S1 experiments (n=7)

Coherence Magnitude

M1 S1

Peak M1 Frequency(Hz) S1

Pre30 (min)

Pre0 (min)

0 (min)

30 (min)

60 (min)

90 (min)

120 (min)

0.061±0.008

0.051±0.014

0.053±0.014

0.03±0.007

0.031±0.009

0.057±0.015

0.067±0.016

0.059±0.021

0.068±0.026

0.058±0.027 0.061±0.023

0.052±0.014

0.034±0.006

0.040±0.016

20.50±1.92

21.80±1.98

22.48±2.04

23.73±1.84

18.57±1.2

20.75±1.62

18.23±1.82

21.17±1.79

21.80±1.97

23.08±1.82

23.10±2.15

18.22±2.13

21.17±1.08

19.52±.3.21

Fig. 2. Coherence spectra between EEG and EMG signals at pre30 session. A, Coherence spectra between 19 EEG channels and EMG (FDI) for all subjects (n=7) are superimposed and topographically according to the approximate locations of the electrodes on the scalp. Each electrode labeled with respect to its location (Fp: Frontal pole F: Frontal T: Temporal C: Central P: Parietal O: Occipital). B, Expanded view of the coherence spectra between EEG (C3 scalp site) and EMG (FDI) for all subjects (n=7). Each line style specifies different subject’s coherence spectra. Coherence values only above 99% significance level (indicated by the solid horizontal line) are highlighted.

The Effects of Theta Burst Transcranial Magnetic Stimulation

139

band at C3 scalp site. Maximum coherence levels were observed at C3 (n=6) and at F3 (n=1) scalp sites whereas no significant coherence was observed at other locations (Figure 2). These results are in well agreement with the previous studies on coherence between EEG-EMG during isometric contraction [7, 8]. Table 1 shows the average absolute coherence values and peak frequencies at all trials before and after the application of TBS. Figure 3 demonstrates the normalized EEG (C3)-EMG coherence values as average of all subjects. Coherence values obtained before the TBS were taken as control and set to 100%. Average beta band coherence suppressed to 56.2% after 30 minutes and 54.5% after 60 minutes with statistical significance(p 0, and depression for a negative timing diﬀerence, Δt < 0. Many have argued that the asymmetry of this rule produces a one-way coupling (see e.g. [4] ). Such arguments would be valid if Δt represented the time diﬀerence between post- and presynaptic spikes. However, actually most experimental literatures [3] deﬁne Δt to be the time diﬀerence between a postsynaptic spike and the onset or peak of the somatic excitatory postsynaptic potential (EPSP) induced by a presynaptic spike: Δt = tpost spike − tEPSP by pre . Hence the above argument does not apply. A somatic EPSP should lag behind a presynaptic spike for a few msec. Therefore, if two neurons ﬁre in exact synchrony (Fig.1g), Δt < 0 negative [12] for both directions, thereby weakens connections bidirectionally. Now, how does this mechanism convert initial asynchronous ﬁring to clustered synchronous ﬁring (Fig.1b)? Initial asynchronous ﬁring (Fig.1a) is represented as ﬁring phases evenly spread around the circle (Fig.1h, left). The ﬁring remains asynchronous without STDP . However, with the phases of many neurons squeezed into the circle, any single neuron must have neighboring neurons that unwillingly ﬁre synchronously with it(Fig.1h). Among these neurons, the abovementioned mechanism weakens the connections bidirectionally. As their synaptic connections weaken, mutual repulsion is also weakened. This then further synchronizes their ﬁring. This positive feedback mechanism develops wireless clusters (Figs.1h). Although this mechanism qualitatively explains how the clustering happens, a quantitative question how many clusters are formed requires further consideration. We will later see that a stability analysis tells the possible number of clusters. In contrast to the vanishing intra-cluster connections, the inter-cluster connections survive and can be unidirectional (Fig.1d), which deﬁnes the cyclic network topology such as shown in Fig.1f, upper. Let us ask how we can change this 3-cycle topology. We ﬁnd that one of the recently observed higher-order rules of STDP [15,16] increases the number of clusters (Fig.1e). The higher-order rule shown in [16] implies the gross reduction in the LTD eﬀect because LTP override the immediately preceding LTD, while LTP simply cancels partly the immediately preceding LTD. The weakened LTD eﬀect is likely to increase the total number of potentiated synapses, which is in consistent with the increased ratio of black areas in Fig.1e compared to Fig.1d. In contrast to such cluster-wise synchrony observed with Model A neurons, Model B neurons that favor synchrony ( a = 0.02, b = 0.2, c = −50, d = 40) selforganize into the globally synchronous state with or without synchrony (Fig.2). Due to the global synchrony, mutual synaptic connections are largely lost, and each neuron ends up being driven by the external input individually, having little sense of being present as a population. The global synchrony gives too strong

146

H. Cˆ ateau, K. Kitano, and T. Fukai

D

50 45 40 35 30 25 20 15 10 5 0

9750

postsynaptic neurons

neuron index

C

9800

9850 9900 time [ms]

9950

5

presynaptic neurons 10 15 20 25 30 35 40 45 50

5 10 15 20 25 30 35 40 45 50

10000

Fig. 2. Global synchrony observed with Model B neurons that favor synchrony. A raster plot (a) and connection matrix (b) of ﬁfty Model B neurons. The neurons were aligned with the connection-based method: neurons are deﬁned to belong to the same cluster whenever there mutual connections are small enough.

an impact and also has minimal coding capacity because all the neurons behave identically, and it appears to bear more similarity to the pathological activity such as seizure in the brain than to meaningful information processing. By contrast, the clustered synchrony arising in the network of Model A neurons appears functionally useful. Generally in the brain, the unitary EPSP amplitude (∼ 0.5mV ) is designed to be much smaller than the voltage rise needed to elicit ﬁring (∼ 15mV ). Therefore, single-neuron activity alone cannot cause other neurons to respond. Hence, it is diﬃcult to regard the single-neuron activity as a carrier of information transferred back and forth in the brain. In contrast, the self-organized assembly of tens of Model A neurons (Figs. 1d) looks an ideal candidate for a carrier of information in the brain because their impact on other neurons are strong enough to elicit responses. Additionally, a cluster can reliably code the timing information. The PRC, Z(2πt/T ), representing the amount of advance/delay of the next ﬁring time in response to the input at t in the ﬁring interval [0, T ] has been mostly used to decide whether a coupled pair of neurons or oscillators tend to synchronize or desynchronize under the assumption that the connection strengths between the neurons are equal and unchanged. Speciﬁcally, suppose that a pair of neurons are mutually connected and a spike of one neuron introduces a current with the waveform of EP SC(t) in an innervated neuron after a transmission T delay of τd . The eﬀective PRC deﬁned as Γ− (θ) = T1 0 Z(2πt /T )EP SC(t − θ τd − T 2π )dt is known to decide their synchrony tendency. If Γ− (θ) < 0 at θ = 0 is positive (negative), the two neurons are desynchronized (synchronized). This synchrony condition is inherited to a population of neurons coupled in an allto-all or random manner as far as the connection strengths remain unchanged. Theoretically calculated Γ− (θ)s for Model A and B (Fig.3a,b) explain that the all-to-all netwok of Model A (B) neurons exhibit global asynchrony (synchrony). Note that both Model A and B neurons belong to type II[19] so that both model neurons favor synchrony if they are delta-coupled with no synaptic delay. After STDP is switched on, the network consisting of Model A neurons, is selforganized into the 3-cycle circuit (Fig.1d) with a successive phase diﬀerence of the clusterd activity being Δsuc θ = 2π/3. Stability analysis shows that the slope

Interactions between STDP and PRC Lead to Wireless Clustering

147

(b)

(a) 0.035

0.25

2pi/3

0.03

0.2

2pi/4 0.15

0.025 0.02 0.015

0.1

0.01

0.05

0.005 0 -0.05

0

0

pi/2

pi

3pi/2

2pi

-0.005

0

pi/2

pi

3pi/2

2pi

Fig. 3. Eﬀective PRCs and schema of triad mechanism. The eﬀective PRC of Model A and B were calculated with the adjoint method [19] and shown shown in (a) and (b). The slope at θ = 0 is positive for (a) but negative for (b) although it hardly recognizable with this resolution. The slope at θ = 2π/3 is negative for (a) but positive for (b). The dashed lines represent θ = 2π/3 and θ = 2π/4.

of Γ− (θ) not at the origin but at θ = 2π − Δsuc θ now determines the stability of the 3-cycle activity: the 3-cycle activity is stable if Γ− (2π − Δsuc θ) < 0. Fig.3a tells that the 3-cycle activity shown in Fig.1c is stable. The stable cyclic activity is achieved through the following synergetic process: (1) PRC determines the preferred network activity (e.g. asynchronous or synchronous), (2) the network activity determines how STDP works, STDP modiﬁes the network structure (e.g. from all-to-all to cyclic ), and (3) the network structure determines how the PRC is readout (e.g. θ = 0 or θ = 2π −Δsuc θ) ), closing the loop. Generally, we can show that the n-cylce activity whose successive phase diﬀerence equals Δsuc θ = 2π/n is stable if Γ− (2π − Δsuc θ) < 0. PRCs of biologically plausible neuron models or real neurons [20] tend to have a negative slope in a later phase of the ﬁring interval and converge to zero at θ = 2π because the membrane potential starts the regenerative depolarization and becomes insensitive to any synaptic input. The corresponding eﬀective PRCs inherit this negative slope in the later phase and tends to stabilize the n-cycle activity for some n.

3

Self-organization of Hodgkin-Huxley Type Neurons

Next we see that the self-organized cyclic activity with the wireless clustering is also observed in biologically realistic setting. Our simulations as described in [18] with 200 excitatory and 50 inhibitory neurons modeled with the HodgkinHuxley (HH) formalism exhibits the 3-cyclic activity with the wireless clustering (Fig.4a,b). The setup here is biologically realistic in that (1) HH type neurons are used, (2) physiologically known percentage of inhibitory neurons with nonplastic synapses are included, (3) neurons ﬁre with high irregularity due to large noise in the background input unlike the well-regulated ﬁring as shown in Fig.1c. Interestingly, the eﬀective PRC (Fig.4c) of the HH type neuron shares important features with that of Model A: the positive initial slope implying the preference to asynchrony and a negative later slope stabilizing the 3-cycle activity.

148

H. Cˆ ateau, K. Kitano, and T. Fukai (a)

presynaptic neurons

(b)

20

200

postsynaptic neurons

160 140

neuron index

60

80

100 120 140 160 180 200

20

180

120 100 80 60 40 20 0 5000

40

40 60 80 100 120 140 160 180

5040

5080

5120

5160

200

5200

time(msec)

(c) 3.5

2pi/3 2pi/4

3 2.5 2 1.5 1 0.5 0

0

pi/2

pi

3pi/2

2pi

Fig. 4. Conductance-based model also develops the wireless clustering. (a) A raster plot of 200 HH-type excitatory neurons showing 3-cycle activity. (b) The corresponding connection matrix showing the wirelessness. (c) Eﬀective PRC or Γ− (θ) of the conductance-based model.

Generally, technical diﬃculty in the HH simulations is their massive computational demands due to the complexity of the system. That diﬃculty has hidered theoretical analysis, and has left the studies largely experimental. In particular, previously we tried hard to understand why we never observed 4-cycle or longer in vain. However, the analytic argument we developed here with the simpliﬁed model gives a clear insight into the biologically plausible but complex system. Comparison of Fig.3a and Fig.4c reveals that the negative slope of Γ− (θ) of the HH model is located at more left than that of Model A, indicating less stability of long cycles in the HH simulations. With the larger amount of noise in the HH simulations in mind, it is now understood that 4-cycle and longer can be easily destabilized in the HH simulations. Thus, our analysis developed with the simpleﬁed system serves as a useful tool to understand a biologically realistic but complex systems. There is, howevr, an interesting diﬀerence between the Model A and HH simulations. Although the intra-cluster wirelessness is a fairly good ﬁrst approximation in the HH model simulations (Fig.4b), it is not as exact as in the Model A simulations (Fig.1d,e). Interestingly, an elimination of the residual intracluster connections destroys the cyclic activity, suggesting the supportive role of the tiny residual intra-cluster connections.

4

Discussion

In the previous simulation study[17] using the LIF model, cyclic activity was observed to propagate only at the theoretical speed limit: it takes only τd from one cluster to the next, requiring the zero membrane integration time. To understand

Interactions between STDP and PRC Lead to Wireless Clustering

149

why it was the case, we ﬁrst remind that the eﬀective PRC needs a negative slope at 2π − Δsuc θ to stabilize the cylic activity. However, the slope of the PRC of an θ LIF model, Z(θ) = c exp( τTm 2π ), is always positive except at the the end point, T where Z(2π − 0) = c exp( τm ) and Z(2π + 0) = c, implying Z (2π) = −∞. This inﬁnitely sharp negative slope of the PRC at θ = 2π is rounded and displaced to 2π − 2πτd /T in Γ− (θ) (see its deﬁnition). Since this is the only place where Γ− (θ) has a negative slope, The cyclic activity is stable only if Δsuc θ = 2πτd /T , implying the propagation at the theoretical speed limit. We demonstrated an intimate interplay between PRC and STDP using the Izhikevich neuron model as well as the HH type model. The present study complements previous studies using the phase oscillator [11,14], where its mathematical tractability was exploited to analytically investigate the stability of the global phase/frequency synchrony. The self-organization or unsupervised learning by STDP studied here complements the supervised learning studied in [22]. The propagation of synchronous ﬁring and temporal evolution of synaptic strength under STDP is know to be analyzed semi-analytically with the Fokker-Planck equation STDP [5,6,8,9,21]. It is interesting future direction to see how the Fokker-Planck equation can be used to understand the interplay between PRC and STDP.

Acknowledgement The present authors thank Dr. T. Takewaka at RIKEN BSI for oﬀering the code to calculate the PRC.

References 1. Kuramoto, Y.: Chemical oscillations,waves,and turbulence. Springer, Berlin (1984) 2. Ermentrout, G., Kopell, N.: SIAM J. Math.Anal. 15, 215 (1984) 3. Markram, H., et al.: Science 275, 213 (1997); Bell, C.C., et al.: Nature 387, 278 (1997); Magee, J.C., Johnston, D.: Science 275, 209 (1997); Bi, G.-Q., Poo, M.-M.: J. Neurosci. 18, 10464 (1998); Feldman, D. E., Neuron 27, 45 (2000); Nishiyama, M., et al.: Nature 408, 584 (2000) 4. Song, S., et al.: Nat. Neurosci. 3, 919 (2000) 5. van Rossum, M.C., Turrigiano, G.G., Nelson, S.B.: J. Neurosci. 22,1956 (2000) 6. Rubin, J., et al.: Phys. Rev. Lett. 86, 364 (2001) 7. Abbott, L.F., Nelson, S.B.: Nat. Neurosci. 3, 1178 (2000) 8. Gerstner, W., Kistler, W.M.: Spiking neuron model. Cambridge University Press, Cambridge (2002) 9. Cˆ ateau, H., Fukai, T.: Neural Comput., 15, 597 (2003) 10. Izhikevich, E.M.: IEEE Trans. Neural Netw. 15, 1063 (2004) 11. Karbowski, J.J., Ermentrout, G.B.: Phys. Rev. E. 65, 031902 (2002) 12. Nowotny, T., et al.: J. Neurosci. 23, 9776 (2003) 13. Zhigulin, V.P., et al.: Phys. Rev. E, 67, 021901 (2003) 14. Masuda, N., Kori, H.: J. Comp. Neurosci, 22, 327 (2007) 15. Froemke, R.C., Dan, Y.: Nature, 416, 433 (2002)

150 16. 17. 18. 19. 20.

H. Cˆ ateau, K. Kitano, and T. Fukai

Wang, H.-X., et al.: Nat. Neurosci, 8, 187 (2005) Levy, N., et al.: Neural Netw. 14, 815 (2001) Kitano, K., Cˆ ateau, H., Fukai, T.: Neuroreport, 13, 795 (2002) Ermentrout, G.B.: Neural Comput. 8, 979 (1996) Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1673 (1993); Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1661 (1993); Oprisan, S.A., Prinz, A.A., Canavier, C.C.: Biophys. J., 87, 2283 (2004); Netoﬀ, T.I., et al.: J Neurophysiol. 93, 1197 (2005); Lengyel, M., et al.: Nat. Neurosci. 8, 1667 (2005); Galan, R.F., Ermentrout, G.B., Urban, N.N.: Phys. Rev. Lett. 94, 158101 (2005); Preyer, A.J., Butera, R.J.: Phys. Rev. Lett. 95, 13810 (2005); Goldberg, J.A., Deister, C. A., Wilson, C.J.: J. Neurophysiol., 97, 208 (2007); Tateno, T., Robinson, H.P.: Biophys. J., 92, 683 (2007); Mancilla, J.G., et al.: J. Neurosci. 27, 2058 (2007); Tsubo, Y., et al.: Eur J. Neurosci, 25, 3429 (2007) 21. Cˆ ateau, H., Reyes, A.D.: Phys. Rev. Lett. 96, 058101, and references therein (2006) 22. Lengyel, M., et al.: Nat. Neurosci. 8, 1677 (2005)

A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells Yoko Yamaguchi1, Colin Molter1, Wu Zhihua1,2, Harshavardhan A. Agashe1, and Hiroaki Wagatsuma1 1

Lab For Dynamic of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, Japan 2 Institute of Biophysics, Chinese Academy of Sciences, Beijing, China [email protected]

Abstract. This paper proposes a computational model of spatio-temporal property formation in the entorhinal neurons recently known as “grid cells”. The model consists of module structures for local path integration, multiple sensory integration and for theta phase coding of grid fields. Theta phase precession naturally encodes the spatial information in theta phase. The proposed module structures have good agreement with head direction cells and grid cells in the entorhinal cortex. The functional role of theta phase coding in the entorhinal cortex for cognitive map formation in the hippocampus is discussed. Keywords: Cognitive map, hippocampus, temporal coding, theta rhythm, grid cell.

1 Introduction In rodents, it is well known that a hippocampal neuron increases its firing rate in some specific position in an environment [1]. These neurons are called place cells and considered to provide neural representation of a cognitive map. Recently it was found that the entorhinal neurons, giving major inputs to the hippocampus fire at positions distributing in a form of a triangular-grid-like patterns in the environment [2]. They are called “grid cells” and their spatial firing preference is termed “grid fields”. Interestingly, temporal coding of space information, “theta phase precession” initially found in hippocampal place cells were also observed in grid cells in the superficial layer of the entorhinal cortex [3], as shown in Figs.1 – 3. A sequence of neural firing is locked to theta rhythm (4~12 HZ) of local field potential (LFP) during spatial exploration. As a step to understand cognitive map formation in the rat hippocampus, the mechanism to form the grid field and also the mechanism of phase precession formation in grid cells must be clarified. Here we propose a model of neural computation to create grid cells based on known property of entorhinal neurons including “head direction cells” which fires M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 151–159, 2008. © Springer-Verlag Berlin Heidelberg 2008

152

Y. Yamaguchi et al.

when the animal’s head has some specific direction in the environment. We demonstrate that theta phase precession in the entorhinal cortex naturally emerge as a consequence of grid cell formation mechanism.

Fig. 1. Theta phase precession observed in rat hippocampal place cells. When the rat traverses in a place field, spike timing of the place cell gradually advances relative to local field potential (LFP) theta rhythm. In a running sequence through place filed A-B-C, the spike sequence in order of A-B-C emerge in each theta cycle. The spike sequence repeatedly encoded in theta phase is considered to lead robust on-line memory formation of the running experience through asymmetric synaptic plasticity in the hippocampus.

Fig. 2. Network structure of the hippocampal formation (DG, CA3, CA1, the entorhinal cortex (EC deeper layer and EC superficial payer) and cortical areas giving multimodal input. Theta phase precession was initially found in the hippocampus, and also found in EC superficial layer. EC superficial layer can be considered as an origin of theta phase precession.

A Computational Model of Formation of Grid Field and Theta Phase Precession

153

HC place cell EC grid cell EC LFP theta

Time Fig. 3. Top) A grid field in an entorhinal grid cell (left) and a place field (right) in a hippocampal place cell. Bottom) Theta phase precession observed in the EC grid cell and in the hippocampus place cell.

2 Model Firing rate of the ith grid cell at a location (x, y) in a given environment increases in the condition given by the relation:

x = α i + nAi cos φ i + mAi cos(φ i + π / 3), y = β i + nAi sinφ i + mAi sin(φ i + π / 3), with

n, m = integer + r ,

(1)

where φ i , Ai and ( α i , β i ) denote one of angles characterizing the grid orientation, a distance of nearby vertices, and a spatial phase of the grid field in an environment. The parameter r is less than 1.0 giving the relative size of a field with high firing rate.

154

Y. Yamaguchi et al.

Fig. 4. Illustration of a hypercolumn structure for grid field computation in the hypothesized entorhinal cortex. The bottom layer consists of local path integration module with a hexagonal direction system. The middle layer associates output of local path integration and visual cue in a given environment. The top layer consists of triplet of grid cells whose grid fields have a common orientation, a common spatial scale and complementary spatial phases. Phase precession is generated at the grid cell at each grid field.

The computational goal to create a grid field is to find the region with n, m = integer + r . We hypothesize that the deeper layer of the entorhinal cortex works as local path integration systems by using head direction and running velocity. The local path integration results in a variable with slow gradual change forming a grid field. This change can cause the gradual phase shift of theta phase precession in accordance with the phenomenological model of theta phase precession by Yamaguchi et al. [4]. The schematic structure of the hypothesized entorhinal cortex and multimodal sensory system is illustrated in Fig. 4. The entorhinal layer includes head direction cells in the deeper layer and grid cells in the superficial layer. Cells with theta phase precession can be considered as stellate cells. The set of modules along vertical direction form a kind of functional column with a direction preference. These columns form a hypercolumnar structure with a set of directions. Mechanisms in individual modules are explained below. 2.1 A Module of Local Path Integrator

The local path integration module consists of six units. During animal’s locomotion with a given head direction and velocity, each unit integrates running distance in each direction with an angle dependent coefficient. Units have preferred vector directions distributing with π/3 intervals as shown in Fig. 5. Computation of animal displacement in given directions in this module is illustrated in Fig, 6. The maximum integration length of the distance in each direction is assumed to be common in a module, corresponding to the distance between nearby vertices of the subsequently formed grid field. This computation gives (n.m) in eq. (1). These systems distribute in the deeper layer of the entorhinal cortex in agreement with observation of head direction cells. Different modules have different vector

A Computational Model of Formation of Grid Field and Theta Phase Precession

155

directions and form a hypercolumn set covering the entire running directions or entire orientation of resultant grid field. The entorhinal cortex is considered to include multiple hypercolumns with different spatial scales. They are considered to work in parallel possibly to give stability in a global space by compensating accumulation of local errors.

Fig. 5. Left) A module of local path integrator with hexagonal direction vectors and a common vector size. Right) Activity coefficient of each vector unit. A set of these vector units to give a displacement distance measure computes an animal motion in a give head direction.

Fig. 6. Illustration of computation of local path integration in a module. Animal locomotion in a give head direction is computed by a pair of vector units among six vectors to give a position measure.

2.2 Grid Field Formation with Visual Cues

Computational results of local path integration are projected to next module in the superficial layer of the entorhinal cortex, which has multiple sensory inputs in a given environment. The association of path integration and visual cues results in the relative location of path integration measure ( α i , β i ) in eq. (1) in the module. Further interaction in a set of three cells as shown in Fig. 7 can give robustness of the parameter ( α i , β i ). Possible interaction among these cells is mutual inhibition to give supplementary distribution of three grid fields.

156

Y. Yamaguchi et al.

2.3 Theta Phase Precession in the Grid Field

The input of the parameter (n, m) and ( α i , β i ), to a cell at next module, at the top part of the module, can cause theta phase precession. It is obtained by the fundamental mechanism of theta phase generation proposed by Yamaguchi et al. [4] [5]. The mechanism needs the presence of a gradual increase of natural frequency in a cell with oscillation activity. Here we find that the top module consists of stellate cells with intrinsic theta oscillation. The natural increase in frequency is expected to emerge by the input of path integration at each vertex of a grid field.

Fig. 7. A triplet of grid fields with the same local path integration system and different spatial phases can generate mostly uniform spatial representation where a single grid cell fires at every location. The uniformity can help robust assignment of environmental spatial phases under the help of environmental sensory cues. The association is processed in the middle part of each column in the entorinal cortex. Projection of each cell output to the module at the top generates a grid field wit theta phase precession as explained in text.

3 Mathematical Formulation Simple mathematical formulation of the above model is phenomenologically given below. The locomotion of animal is represented by current displacement (R, φc ) computed with head direction φ , and running velocity. An elementary vector at a H column of in local path integration system has a vector angle φ and its length A. The i output of the ith vector system I is given by

⎧⎪1 if I (φi ) = ⎨ ⎪⎩ 0

φi − φ H < π / 2 and − r < S (φi ) < r , otherwise, (2)

with S (φi ) = R cos(φi − φ c ) ( S mod A ).

A Computational Model of Formation of Grid Field and Theta Phase Precession

157

where r and A respectively represent the field radius and the distance between neighboring grid vertices. The output of path integration module Di to the middle layer is given by

Di = ∏ I (φi + kπ / 3). k

(3)

Through association with visual cues, spatial phase of the grid is determined. (Details are not shown here.) The term Eqs. (2-3) from the middle layer to the top layer gives on-off regulation and also a parameter with gradual increase in a grid field. Dynamics of the membrane potential G i of the cell at the top layer is given by d dt

G i = f (G i ,t) + aDi S(φi ) + I theta ,

(4)

where f is a function of time-dependent ionic currents and a is constant. The last term I theta denotes a sinusoidal current representing theta oscillation of inhibitory neurons. In a proper dynamics of f, the second term in the right had side gives activation of the grid cell oscillation and gradual increase in its natural frequency. According to our former results by using a phenomenological model [5], the last term of theta currents leads phase locking of grid cells with gradual phase shift. This realizes a cell with grid field and theta phase precession. One can test Eq. (4) by applying several types of equations including a simple reduced model and biophysical model of the hippocampus or entorhinal cells. An example of computer experiments is given in the following section.

4 Computer Simulation of Theta Phase Precession The mechanism of theta phase precession was phenomenologically proposed by Yamaguchi et al. [4][5] as coupling of two oscillations. One is LFP theta oscillation with a constant frequency of theta rhythm. The other is a sustained oscillation with gradual increase in natural frequency. The sustained oscillation in the presence of LFP theta exhibits gradual phase shift as quasi steady states of phase locking. The simulation by using a hippocamal pyramidal cell [6] is shown in Fig. 8. It is obviously seen that LFP theta instantaneously captures the oscillation with gradual increase in natural frequency into a quasi-stable phase at each theta cycle to give gradual phase shift. This phase shift is robust against any perturbation as a consequence of phase locking in nonlinear oscillations. The simulation with a model of an entorhinal stellate cell [7] was also elucidated. We obtained similar phase precession with stellate cell model. One important property of stellate cell is the presence of sub threshold oscillations, while synchronization of this oscillation can be reduced to a simple behavior of the phase model. Thus, the mechanism of phenomenological model [5] is found to endow comprehensive description of phase locking of complex biophysical neuron models.

158

Y. Yamaguchi et al.

(a)

(b) Fig. 8. Computer experiment of theta phase precession by using a hippocampal pyramidal neuron model [6]. (a)Bottom: Input current with gradual increase. Top: Resultant sustained oscillation with gradual increase in natural frequency of the membrane potential. (b) In the presence of LFP theta (middle), the neuronal activity exhibit theta phase precession.

5 Discussions and Conclusion We elucidated a computational model of grid cells in the entorhnal cortex to investigate how temporal coding works for spatial representation in the brain, A computational model of formation of grid field was proposed based on local path integration. This assumption was found to give theta phase precession within the grid field. This computational mechanism does not need an assumption of learning in repeated trials in a novel environment but enables instantaneous spatial representation. Furthermore, this model has good agreements with experimental observations of head direction cells and grid cells. The networks proposed in the model predict local interaction networks in the entorhinal cortex and also head direction systems distributed in many areas. Although computation of place cells based on grid cells is beyond this paper, emergence of theta phase precession in the entorhinal cortex can be used for place cell formation and also instantaneous memory formation in the hippocampus [8]. These computational model studies with space-time structure for environmental space representation enlightens the temporal coding over distributed areas used in real-time operation of spatial information in ever changing environment.

A Computational Model of Formation of Grid Field and Theta Phase Precession

159

References 1. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Clarendon Press, Oxford (1978) 2. Fyhn, M., Molden, S., Witter, M., Moser, E.I., Moser, M.B.: Spatial representation in the entorhinal cortex. Sience 305, 1258–1264 (2004) 3. Hafting, T., Fyhn, M., Moser, M.B., Moser, E.I.: Phase precession and phase locking in entorhinal grid cells. Program No. 68.8, Neuroscience Meeting Planner. Atlanta, GA: Society for Neuroscience (2006.) Online (2006) 4. Yamaguchi, Y., Sato, N., Wagatsuma, H., Wu, Z., Molter, C., Aota, Y.: A unified view of theta-phase coding in the entorhinal-hippocampal system. Current Opinion in Neurobiology 17, 197–204 (2007) 5. Yamaguchi, Y., McNaughton, B.L.: Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In: Usui, S., Omori, T. (eds.) The Fifth International Conference on Neural Information Processing (ICONIP 1998) and The 1998 Annual Conference of the Japanese Neural Network Society (JNNS 1998), Kitakyushu, Japan. Burke, VA, vol. 2, pp. 781–784. IOS Press, Amsterdam (1998) 6. Pinsky, P.F., Rinzel, J.: Intrinsic and network rhythmogenesis in a reduced traub model for CA3 neurons. Journal of Computational Neuroscience 1, 39–60 (1994) 7. Fransén, E., Alonso, A.A., Dickson, C.T., Magistretti, J., Hasselmo, M.E.: Ionic mechanisms in the generation of subthreshold oscillations and action potential clustering in entorhinal layer II stellate neurons 14(3), 368–384 (2004) 8. Molter, C., Yamaguchi, Y.: Theta phase precession for spatial representation and memory formation. In: The 1st International Conference on Cognitive Neurodynamics (ICCN 2007), Shanghai, 2-09-0002 (2007)

Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor David Colliaux1,2 , Yoko Yamaguchi1 , Colin Molter1 , and Hiroaki Wagatsuma1 1

Lab for Dynamics of Emergent Intelligence, RIKEN BSI, Wako, Saitama, Japan 2 Ecole Polytechnique (CREA), 75005 Paris, France [email protected]

Abstract. A phenomenological model is developed where complex dynamics are the correlate of spatio-temporal memories. If resting is not a classical ﬁxed point attractor but a Milnor attractor, multiple oscillations appear in the dynamics of a coupled system. This model can be helpful for describing brain activity in terms of well classiﬁed dynamics and for implementing human-like real-time computation.

1

Introduction

Neuronal collective activities of the brain are widely characterized by oscillations in human and animals [1][2]. Among various frequency bands, distant synchronization in theta rhythms (4-8 Hz oscillation deﬁned in human EEG) is recently known to relate with working memory, a short-term memory for central execution in human scalp EEG [3][4] and in neural ﬁring in monkeys [5][6]. For long-term memory, information coding is mediated by synaptic plasticity whereas short-term memory is stored in neural activities [7]. Recent neuroscience reported various types of persistent activities of a single neuron and a population of neurons as possible mechanisms of working memory. Among those, bistable states, up- and down-states, of the membrane potential and its ﬂip-ﬂop transitions were measured in a number of cortical and subcortical neurons. The up-state, characterized by frequent ﬁring, shows stability for seconds or more due to network interactions [8]. However it is little known whether ﬂip-ﬂop transition and distant synchronization work together or what kind of processings are enabled by the ﬂip-ﬂop oscillation network. Associative memory network with ﬂip-ﬂop change was proposed for working memory with classical rate coding view [9], while further consideration on dynamical linking property based on ﬁring oscillation, such as synchronization of theta rhythms referred above, is likely essential for elucidation of multiple attractor systems. Besides, Milnor extended the concept of attractors to invariant sets with Lyapunov unstability, which has been of interest in physical, chemical and biological systems. It might allow high freedom in spontaneous switching among semi-stable states [12]. In this paper, we propose a model of oscillation associative memory with ﬂip-ﬂop change for working memory. We found that M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 160–169, 2008. c Springer-Verlag Berlin Heidelberg 2008

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

161

the Milnor attractor condition is satisﬁed in the resting state of the model. We will ﬁrst study how the Milnor attractor appears and will then show possible behaviors of coupled units in the Milnor attractor condition.

2 2.1

A Network Model Structure

In order to realize up- and down-states where up-state is associated with oscillation, phenomenological models are joined. Traditionally, associative memory networks are described by state variables representing the membrane potential {Si } [9]. Oscillation is assumed to appear in the up-state as an internal process within each variable φi for the ith unit. Oscillation dynamics is simply given by a phase model with a resting state and periodic motion [10,11]. cos(φi ) stands for an oscillation current in the dynamics of the membrane potential. 2.2

Mathematical Formulation of the Model

The ﬂip-ﬂop oscillations network of N units is described by the set of state variables {Si , φi } ∈ N × [0, 2π[N (i ∈ [1, N ]). Dynamic of Si and φi is given by the following equations: dSi Wij R(Sj ) + σ(cos(φi ) − cos(φ0 )) + I± dt = −Si + dφi (1) = ω + (β − ρS i )sin(φi ) dt with R(x) = 12 (tanh(10(x − 0.5)) + 1), φ0 = arcsin( −ω β ) and cos(φ0 ) < 0. R is the spike density of units and input I± will be taken as positive (I+ ) or negative (I− ) pulses (50 time steps), so that we can focus on the persistent activity of units after a phasic input. ω and β are respectively the frequency and the stabilization coeﬃcient of the internal oscillation. ρ and σ represent mutual feedback between internal oscillation and membrane potential. Wij are the connection weights describing the strength of coupling between units i and j. φ0 is known to be a stable ﬁxed point of the equation for φ, and 0 to be a ﬁxed point for the S equation.

3 3.1

An Isolated Unit Resting State

The resting state is the stable equilibrium when I = 0 for a single unit. We assume ω < β so that M0 = (0, φ0 ) is the ﬁxed point of the system. To study the linear stability of this ﬁxed point, we write the stability matrix around M0 : −1 −σsin(φ0 ) DF |M0 = (2) −ρsin(φ0 ) βcos(φ0 )

162

D. Colliaux et al.

The sign of the eigenvalues of DF |M0 and thus the stability of M0 depends only on μ = ρσ. With our choice of ω = 1 and β = 1.2, μc ≈ 0.96. If μ < μc , M0 is a stable ﬁxed point and there is another ﬁxed point M1 = (S1 , φ1 ) with φ1 < φ0 which is unstable. If μ > μc , M0 is unstable and M1 is stable with φ1 > φ0 . Fixed points exchange stability as the bifurcation parameter μ increases (transcritical bifurcation). The simpliﬁed system according to eigenvectors (X1 , X2 ) of the matrix DF |M0 gives a clear illustration of the bifurcation as dx1 dt dx2 dt

= ax21 + λ1 x1 = λ2 x2

(3)

Here a = 0 is equivalent to μ = μc and in this condition there is a positive measure basin of attraction but some directions are unstable. The resting state M0 is not a classical ﬁxed point attractor because it does not attract all trajectories from an open neighborhood, but it is still an attractor if we consider Milnor’s extended deﬁnition of attractors. Phase plane (S, φ) Fig. 1 shows that for μ close to the critical value, nullclines cross twice staying close to each other in between. That narrow channel makes the conﬁguration indistinguishable from a Milnor attractor in computer experiments.

Fig. 1. Top: Phase space (S, φ) with vector ﬁeld and nullclines of the system. The dashed domain in B shows that M0 have positive measure basin of attraction when μ = μc . Bottom: Fixed points with their stable and unstable directions for the equivalent simpliﬁed system. A: μ < μc . B: μ = μc . C:μ > μc .

Since we showed μ is the crucial parameter for the stability of the resting state, we can now consider ρ = 1 and study the dynamics according to σ with a close look near the critical regime (σ = μc ).

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

3.2

163

Constant Input Can Give Oscillations

Under constant input there are two possible dynamics: ﬁxed point and limit cycle. If ω (4) β − S < 1 there is a stable ﬁxed point (S1 , φ1 ) with φ1 solution of ω + (β − σ(cos(φ1 ) − cos(φ0 )) − I)sin(φ1 ) = 0 S1 = σ(cosφ1 − cosφ0 ) + I

(5)

If condition 4 is not satisﬁed, the φ equation in 1 will give rise to oscillatory dynamics. Identifying S with its temporal average, dφ dt = ω + Γ sin(φ) with 2π dφ Γ = β − S will be periodic with period 0 ω+(β−S)sin(φ) . This approximation gives an oscillation at frequency ω = ω 2 − (β − S)2 , which is qualitatively in good agreement with computer experiments Fig. 2. σ=μc 6

1.2 S minimum S maximum Frequency (theoretical) Frequency

5

1

4

0.8

3 f

S

0.6 2 0.4 1 0.2 0 0 -1 -0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

I

Fig. 2. For each value of constant current I, maximum and minimum values of S1 are plotted. Dominant frequency of S1 obtained by FFT is compared to the theoretical value when S is identiﬁed with its temporal average: Frequency VS Frequency (theoretical).

If we inject an oscillatory input into the system, S oscillates at the same frequency provided the input frequency is low. For higher frequencies, S cannot follow the input and shows complex oscillatory dynamics with multiple frequencies.

164

4

D. Colliaux et al.

Two Coupled Units

For two coupled units, ﬂip-ﬂop of oscillations is observed under various conditions. We will analyze the case μ = 0 and ﬂip-ﬂop properties under various strengths of connection weights, assuming symmetrical connections (W12 = W2,1 = W ). 4.1

Influence of the Feedback Loop

In equation 1, ρ and σ implement a feedback loop representing mutual inﬂuence of φ and S for each unit. The Case µ = 0. In the case σ = 0 or ρ = 0, φ remains constant φ = φ0 : the system is then a classical recurrent network. This model was used to provide associative memory network storing patterns in ﬁxed point attractors [9]. For small coupling strength, the resting state is a ﬁxed point. For strong coupling strength, two more ﬁxed points appear, one unstable, corresponding to threshold, and one stable, providing memory storage. After a transient positive input I+ above threshold, the coupled system will be in up-state. A transient negative input I− can bring it back to resting state. For a small perturbation (σ 1 and ρ = 1), the active state is a small up-state oscillation but associative memory properties (storage, completion) are preserved. Growing Oscillations. The up-state oscillation in the membrane potential dynamics triggered by giving an I+ pulse to unit 1 grows when σ increases and saturates to an up-state ﬁxed point for strong feedback. Interestingly, for a range of feedback strength values near μc , S returns transiently near the Milnor attractor resting state. Projection of the trajectories of the 4-dimensional system on a 2-dimensional plane section P illustrates these complex dynamics Fig. 3. A cycle would intersect this plane in two points. For each σ value, we consider S1 for these intersection points. For a range between 0.91 and 1.05 with our choice of parameters, there are much more than two intersection points M*, suggesting chaotic dynamics. 4.2

Influence of the Coupling Strength

The dynamics of two coupled units can be a ﬁxed point attractor, as in the resting state (I = 0), or down-state or up-state oscillation (depending on the coupling strength), after a transient input. Near critical value of the feedback loop, in addition to these, more complex dynamics occur for intermediate coupling strength.

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

165

Fig. 3. A: Inﬂuence of the feedback loop- Bifurcation diagram according to σ (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to σ(Bottom). B: Inﬂuence of the coupling strengh - S1 maximum and minimum values and average phase diﬀerence (φ1 − φ2 ) according to W (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to W (Bottom).

Down-state Oscillation. For small coupling strength, the system periodically visits the resting state for a long time and goes brieﬂy to up-state. The frequency of this oscillation increases with coupling strength. The two units are anti-phase (when Si takes maximum value, Sj takes minimum value) Fig. 4 (Bottom). Up-state Oscillation. For strong coupling strength, a transient input to unit 1 leads to an up-state oscillation Fig. 4 (Top). The two units are perfectly in-phase at W = 0.75 and phase diﬀerence stays small for stronger coupling strength. Chaotic Dynamics. For intermediate coupling strength, an intermediate cycle is observed and more complex dynamics occur for a small range (0.58 < W < 0.78

166

D. Colliaux et al.

Fig. 4. Si temporal evolution, (S1 , S2 ) phase plane and (Si , φi ) cylinder space. Top: Up-state oscillation for strong coupling. Middle: Multiple frequency oscillation for intermediate coupling. Bottom: Down-state oscillation for weak coupling.

with our parameters) before full synchronization characterized by φ1 − φ2 = 0. The trajectory can have many intersection points with P and S ∗ in Fig. 3 shows multiple roads to chaos through period doubling.

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

5

167

Application to Slow Selection of a Memorized Pattern

5.1

A Small Network

The network is a set N of ﬁve units consisting in a subset N1 of three units A,B and C and another N2 of two units D and E. In the set N, units have symmetrical all-to-all weak connections (WN = 0.01) and in each subset units have symmetrical all-to-all strong connections (WNi = 0.1 ∗ M ) with M a global parameter slowly varying in time between 1 and 10. These subsets could represent two objects stored in the weight matrix. 5.2

Memory Retrieval and Response Selection

S

We consider a transient structured input into the network. For constant M, a partial or complete stimulation of a subset Ni can elicit retrieval and completion of the subset in an up-state as would do a classical auto-associative memory network.

2 1.5 1 0.5 0

A 0

20000

40000

60000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

S

t 2 1.5 1 0.5 0

B 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

C 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

D 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

E 0

20000

40000

60000 t

Fig. 5. Slow activation of a robust synchronous up-state in N1 during slow increase of M

In the Milnor attractor condition more complex retrieval can be achieved when M is slowly increased. As an illustration, we consider transient stimulation of units A and B from N1 and unit E from N2 Fig. 5. N2 units show anti-phase

168

D. Colliaux et al.

oscillations with increasing frequency. N1 units ﬁrst show synchronous downstate oscillations with long stays near the Milnor attractor and gradually go toward sustained up-state oscillations. In this example, the selection of N1 in up-state is very slow and synchrony between units plays an important role.

6

Conclusion

We demonstrated that, in cylinder space, a Milnor attractor appears at a critical condition through forward and reverse saddle-node bifurcations. Near the critical condition, the pair of saddle and node constructs a pseudo-attractor, which can serves for observation of Milnor attractor-like properties in computer experiments. Semi-stability of the Milnor attractor in this model seems to be associated with the variety of oscillations and chaotic dynamics through period doubling roads. We demonstrated that an oscillations network provides a variety of working memory encoding in dynamical states under the presence of a Milnor attractor. Applications of oscillatory dynamics have been compared to classical autoassociative memory models. The importance of Milnor attractors was proposed in the analysis of coupled map lattices in high dimension [11] and for chaotic itinerancy in the brain [13]. The functional signiﬁcance of ﬂip-ﬂop oscillations networks with the above dynamical complexity is of interest for further analysis of integrative brain dynamics.

References 1. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience (2001) 2. Buzsaki, G., Draguhn, A.: Neuronal oscillations in cortical networks. Science (2004) 3. Onton, J., Delorme, A., Makeig, S.: Frontal midline EEG dynamics during working memory. NeuroImage (2005) 4. Mizuhara, H., Yamaguchi, Y.: Human cortical circuits for central executive function emerge by theta phase synchronization. NeuroImage (2004) 5. Rainer, G., Lee, H., Simpson, G.V., Logothetis, N.K.: Working-memory related theta (4-7Hz) frequency oscillations observed in monkey extrastriate visual cortex. Neurocomputing (2004) 6. Tsujimoto, T., Shimazu, H., Isomura, Y., Sasaki, K.: Prefrontal theta oscillations associated with hand movements triggered by warning and imperative stimuli in the monkey. Neuroscience Letters (2003) 7. Goldman-Rakic, P.S.: Cellular basis of working memory. Neuron (1995) 8. McCormick, D.A.: Neuronal Networks: Flip-Flops in the Brain. Current Biology (2005) 9. Durstewitz, D., Seamans, J.K., Sejnowski, T.J.: Neurocomputational models of working memory. Nature Neuroscience (2000) 10. Yamaguchi, Y.: A Theory of hippocampal memory based on theta phase precession. Biological Cybernetics (2003)

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

169

11. Kaneko, K.: Dominance of Minlnor attractors in Globally Coupled Dynamical Systems with more than 7 +- 2 degrees of freedom (retitled from ‘Magic Number 7 +- 2 in Globally Coupled Dynamical Systems’) Physical Review Letters (2002) 12. Fujii, H., Tsuda, I.: Interneurons: their cognitive roles - A perspective from dynamical systems view. Development and Learning (2005) 13. Tsuda, I.: Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioural and Brain Sciences (2001)

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors Hiroshi Fujii1,2, Kazuyuki Aihara2,3, and Ichiro Tsuda4,5 1

Department of Information and Communication Sciences, Kyoto Sangyo University, Kyoto 603-8555, Japan [email protected] 2 Institute of Industrial Science, the University of Tokyo, Tokyo 153-8505 [email protected] 3 ERATO, Japan Science and Technology Agency, Tokyo 151-0065, Japan 4 Research Institute for Electronic Science, Hokkaido University, Sapporo 060-0812, Japan [email protected] 5 Center of Excellence COE in Mathematics, Department of Mathematics, Hokkaido University, Sapporo 060-0810, Japan

（ ）

Abstract. A new hypothesis on a possible role for the corticopetal acetylcholine (ACh) is provided from a dynamical systems standpoint. The corticopetal ACh helps to transiently organize a global (inter- and intra-cortical) quasi-attractors via gamma range synchrony when it is behaviorally needed as top-down attentions and expectation.

1 Introduction 1.1 Corticopetal Acetylcholine Achetylcholine (ACh) is the first substance identified as a neurotransmitter by Otto Loewi [19]. Although it is increasingly recognized that ACh plays a critical role, not only in arousal and sleep, but in higher cognitive functions as attention, conscious flow, and so on, the question on the way in which ACh works in those cognitive processes remains a mystery [11]. The corticopetal ACh, originated in the nucleus basalis of Meinert (NBM), a part of the basal forebrain (BF), is the primary source of cortical ACh, and the major target of BF projections is the cortex [21]. Behavioral studies and those using immunotoxin as well provide consistent evidence of the role of ACh in top-down attentions. A blockage of NBM ACh, either by deasease-related or drug-induced, causes a severe loss of attentions: selective attention, sustained attention, and divided attention together with a shift of attention. ACh concerns conscious flow (Perry & Perry [24]). Continual death of cholinergic neurons in NBM causes Lewy Body Dementia (LBD), one of the most salient symptoms of which is the complex visual hallucination (CVH) [1].1 1

Perry and Perry [24] noted those hallucinatory LBD patients who see: “integrated images of people or animals which appear real at the time”, “insects on walls”, or “bicycles outside the fourth storey window”. Images are generally vivid and colored, continue for a few minutes (neither seconds nor hours). It is to be noted that “many of those experiences are enhanced during eye closed and relieved by visual input”, and “nicotinic anatagonists, such as mecamylamine, are not reported to induce hallucinations”.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 170–178, 2008. © Springer-Verlag Berlin Heidelberg 2008

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

171

1.2 Attentions, Cortical State Transitions and Cholinergic Control System from NBM Top-down flow of signals which accompanies attentions, expectation and so on may cause state transitions in the “down stream” cortices. Fries et al. [7] reported an increase of synchrony in high-gamma range in accordance of selective attention. (See, also Jones [15], Buschman et al. [2].) Metherate et al. [22] stimulated NBM in in vivo preparations of auditory cortex. Of particular interest in their observations is that NBM ACh produced a change in subthreshold membrane potential fluctuations from large-amplitude, slow (1-5 Hz) oscillations to low-amplitude, fast (20-40 Hz) (i.e., gamma) oscillations. A shift of spike discharge pattern from phasic to tonic was also observed.2 They pointed out that in view of the wide spread projections of NBM neurons, larger intercortical networks could also be modified. Together with Fries et al. data, it is suggested that NBM cholinergic projections may induce a state transition as a shift of frequency, and change of discharge pattern in neocortical neurons. This may be consistent with the observation made by Kay et al. [16]. During perceptual processing in the olfactory-limbic axis, a cascade of brain events at the successive stages of the task as “expectation” and/or “attention” was observed. ‘Local’ transitions of the olfactory structures indicated by modulations of EEG signals as gamma amplitude, periodicity, and coherence were reported to exist. Kay et al. also observed that the local dynamics transiently falls into attractor-like states. Such ‘local’ transitions of states are generally postulated to be triggered by ‘topdown’ glutamatergic spike volleys from “upper stream” organizations. However, such brain events of state transitions with a change in synchrony could be a result of collaboration of descending glutamatergic spike volleys and ascending ACh afferents from NBM. (See, also [26]).3

2 Neural Correlate of Conscious Percepts and the Role of the Corticopetal ACh 2.1 Neural Correlates of Conscious Percepts and Transient Synchrony The corticopetal ACh pathway might be the critical control system which may trigger various kinds of attentions receiving convergent inputs from other sensory and association areas as Sarter et al. [25], [26] argued. In order to discuss the role of the 2

NBM contains cholinergic neurons and non-cholinergic neurons. GABAergic neurons are at least twice more numerous than cholinergic neurons [15] The Metherate et al. observations (above) may be the result of collective functioning of both the cholinergic and GABAergic projections. Wenk [31] argued another possibility that the NBM ACh projections on the reticular thalamic nucleus might cause the cortical state change. 3 Triangular Attentional Pathway: The above arguments may better be complemented by the triangular amplification circuitry, a pathway consisting of parietal cortex → prefrontal cortex → NBM → sensory cortex [10]. This may constitute the cholinergic control system from NBM, i.e., the top-down attentional pathway for cholinergic modulations of responses in cortical sensory areas.

172

H. Fujii, K. Aihara, and I. Tsuda

corticopetal ACh related to attentions, we first begin with the question: What is the neural correlate of conscious percepts? The recent experiments by Kenet et al. [17] show the possibility that in a background (or spontaneous) state where no external stimuli exist, the visual cortex fluctuates between specific internal states. This might mean that cortical circuitry has a number of pre-existing and intrinsic internal states which represent features, and the cortex fluctuates between such multiple intrinsic states even when no external inputs exist. (See, also Treisman et al. [27].) Attention and Dynamical Binding Through Synchrony In order that perception of an object makes sense, its features, or fragmentary subassemblies, must be bound as a unity. How does the “binding” of those local fragmentary dynamics into a global dynamics is done? A widely accepted view is that top-down signals as expectation, attentions, and so on may play the role of such an integration, which is mediated by (possibly gamma) synchronization among the concerned assemblies representing stimuli or events. We postulate that such a process is a basis of global intra- and inter-cortical conjunctions of brain activities. See, Varela et al. [30],Womelsdorf et al. [32]. See, also Dehaene and Changeux [5].) The neural correlate of conscious percepts is a globally integrated state of related brain networks mediated by synchrony over gamma and other frequency bands. In mathematical terms, such transient processes of synchrony of global, but between selected groups of neurons may be described as a transitory state of approaching a global attractor. We note that such a “transitory” state may be conceptualized as an attractor ruin (Tsuda [28], Fujii et al. [8]), in which there are orbits approaching and stay there for a while, but may simultaneously possess repelling orbits from itself. The proper Milnor attractor and its perturbed structures can be a specific representation of attractor ruins [8]. However, since the concept of attractor ruins may include a wider class of non-classical attractors than the Milnor attractor, we may use the term “attractor ruins” in this paper to include possible but unknown classes of ruins. 2.2 Role of the Corticopetal ACh: A Working Hypothesis How can top-down attentions, expectation, etc. contribute to conscious perception with the aid of ACh? Assuming the arguments in the preceding section, this question could be translated into: “How do the corticopetal ACh projections work for the emergence of globally organized attractor ruins via transient synchrony?” We summarize our tentative proposition as a Working Hypothesis in the following. Working Hypothesis: The role of the corticopetal ACh accompanied with top-down contextual signals as attentions and so on is the mediator for dynamically organizing quasi-attractors, which are required in conscious perception, or in executing actions. ACh “integrates” multiple “floating subnetworks” into “a transient synchrony group” in the gamma frequency range. Such a transiently emerging synchrony group can be regarded as an attractor ruin in the dynamical systems-theoretic sense.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

173

3 Do the Existing Experimental Data Support the Working Hypothesis? 3.1 Introductory Remarks: Transient Synchronization by Virtue of Pre- and Post-synaptic ACh Modulations ACh may have both pre-synaptic and post-synaptic effects on individual neurons in the cortex. First, top-down glutamatergic spike volleys flow into cortical layers, which might convey contextual information on stimuli. The corticopetal ACh arrives concomitantly with the glutamatergic volleys. If ACh release modulates synaptic connectivity between cortical neurons even in an effective sense by virtue of “pre-synaptic modulations”, metamorphosis of the attractor landscape4 should be inevitable. Post-synaptic influences of the corticopetal ACh on individual neurons – either inhibitory or excitatory, might cause deep effects on their firing behavior, and might induce a state transition with a collective gamma oscillation. A consequence of the three effects together might trigger a specific group of networks to oscillate in gamma frequency with phase synchrony. These are all speculative stories based on experimental evidence. We need at least to examine the present status on the experimental data concerning the cortical ACh influence on individual neurons. This may be the place to add a comment on the specificity of the corticopetal ACh projections on the cortex, which may be a point of arguments. It is reported that the cholinergic afferents specialized synaptic connections with post synaptic targets, rather than releasing ACh non-specifically (Turrini et al..[29].) 3.2 Controversy on Experimental Data Let us review quickly the existing experimental data. As noted before, “there exist little consensus among researchers for more than a half century” [11]. The following is not intended to give a complete review, but to give a preliminary knowledge which may be of help to understand the succeeding discussions. ACh have two groups of receptors, one is the muscarinic receptors, mAChRs with 5 subtypes, and the other is nicotinic receptors, nAChRs with 17 subtypes. The nAChR is a relatively simple cationic (Na+ and Ca2+) channel, the opening of which leads to a rapid depolarization followed by desensitization. Most of mAChRs activation exhibits the slower onset and longer lasting G-protein coupled second messenger generation. Here the primary interest is in mAChRs. 5 The functions of mAChRs are reported to be two-fold: one is pre-synaptic, and the other is postsynaptic modulations. 4

“Attractor landscape” is usually used for potential systems. Here, we use it to mean the landscape of “basins” (absorbing regions), of classical and non-classical attractors as attractor ruins. 5 The nicotinic receptors, nAChRs may work as a disinhibition system to layer 2/3 pyramidal neurons [4]), the exact function of which is not known yet.

174

H. Fujii, K. Aihara, and I. Tsuda

Post-synapticModulations The results of traditional studies may be divided into two opposing data. The majority view is that mAChRs function as excitatory transmitter for post-synaptic neurons (see, e.g., McCormick [20]), while there are minority data that claim inhibitory functioning. The latter, however, has been considered to be a consequence of ACh excitation of interneurons, which in turn may inhibit post-synaptic pyramidal neurons (PYR). Recently, Gulledge et al. [11], [12] stated that transient mAChR activation generates strong and direct transient inhibition of neocortical PYR. The underlying ionic process is the induction of calcium release from internal stores, and subsequent activation of small-conductance (SK–type) calcium-activated potassium channels. The authors claim that the traditional data do not describe the actions of transient mAChR activation, as is likely to happen during synaptic release of ACh by the following reasons. 1.

2.

In vivo ACh concentration: Previous studies typically used high concentrations of muscarinic agonists (1–100 mM). Extracellular concentrations of ACh in the cortex are at least one order of magnitude lower than those required to depolarize PYR in vitro. Phasic (transient) application vs. bath application: Most data depended on experiments with bath applications, which may correspond to prolonged, tonic mAChR stimulation. The ACh release accompanied with attentions, etc., would better correspond to a transient puff application as the authors’ experiment.

The specificity of ACh afferents on postsynaptic targets was already noted. [29]. Pre-synaptic Modulations Experimental works on pre-synaptic modulations are mostly based on ACh bath applications, and the modulation data were measured in terms of local field potentials (LFP). Most results claimed the pathway specificity of modulations. Typically, it was concluded that muscarinic modulation can strongly suppress intracortical(IC) synaptic activity while exerting less suppression, or actually enhancing, thalamocortical (TC) inputs. [14]. Gil et al. [9] reported that 5 μM muscarine decreases both IC and TC pathway transmission, and that those data were presynaptic effects, since membrane potential and input resistance were unchanged. Recently, Kuczewski et al. [18] studied the same problem, and obtained different results from the previous ones. Low ACh (less than 100 μM) shows facilitation, and as ACh concentration goes high, the result is depression. This is true for the both IC and TC pathways, i.e., for the layer 2/3 and layer 4. 3.3 Possible Scenarios The lack of consistent experimental data makes our job complicated. The situation might be compared to playing a jigsaw puzzle with, many pieces missing and some pieces mingled from other jigsaw pictures. What we can do at this moment may be to propose possible alternatives of scenarios for the role of the corticopetal ACh.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

175

The following are the list of prerequisites and evidences on which our arguments should be based. 1.

2.

3.

Two modulations may occur simultaneously inside the 6 layers of the cortex. The firing characteristics of individual neurons, and the strength of synaptic connections may change dynamically either as post-synaptic or pre-synaptic modulations. Virtually no models, to our knowledge, have been proposed, which took the net effects of the two modulations into account. The interaction of ACh with top-down glutamatergic spike volleys should be considered. The majority of neurons alter in response to combined exposure to both acetylcholine and glutamate concomitantly. (Perry & Perry [24].) As a post-synaptic influence, ACh release may change the firing regime of neurons, and induce gamma oscillation [6], [23], [31].

As to pre-synaptic modulations, the details of synaptic processes appear to be largely unknown. The significance of experimental studies can not be overemphasized. Now let us try to draw a dessin for possible scenarios on the role of corticopetal ACh. Here we may put three corner stones for the models: 1. 2. 3.

Who triggers the gamma oscillation? Who (and how to) modulates the effective connectivity? What is the mechanism of phase synchrony and what is the role of it?

Scenario I The basic idea is that in the default, low level state of ACh, globally organized attractors do not virtually exist, and may take the form of floating fragmentary dynamics. Then, ACh release may help to strengthen the synaptic connections presynaptically. One of the roles of post-synaptic modulation is to start up the gamma oscillation. (Here the influence of GABAergic projections from NBM might play a role.) Another, but important role will be stated later. Scenario II The effective modulation of synaptic connectivity might be carried by, rather than the pre-synaptic modulation, the phase synchrony of the gamma oscillation itself which is triggered by the post-synaptic modulation. Such a mechanism for the change of synaptic connectivity, and resulting binding of fragmentary groups of neurons was proposed by Womelsdorf et al. [32]. They claimed that the mutual influence among neuronal groups depends on the phase relation between rhythmic activities within the groups. Phase relations supporting interactions among the groups preceded those interactions by a few milliseconds, consistent with a mechanistic role. See, also Buzsaki [3]. For the case of Scenario II, the role of the post-synaptic modulation is to start up the gamma oscillation, and the reset of its phase, as Gulledge and Stuart [11] suggested. The transient hyper-polarization may play the role of referee to start up the oscillation among the related groups in unison. The Scenario I makes the postsynaptic modulation carry the two roles of starting up the gamma oscillation, and

176

H. Fujii, K. Aihara, and I. Tsuda

resetting its phase. For the pre-synaptic modulation a bigger role of realization of attractors by virtue of synaptic strength modulation is assigned.

4 Concluding Discussions The critical role of the corticopetal ACh in cognitive functions, together with its relation to some disease-related symptoms as complex visual hallucinations in DLB, and its apparent involvements in the neocortical state change have motivated us to the study of the functional role(s) of the corticopetal ACh from dynamical systems standpoints. Cognitive functions are phenomena carried by the brain dynamics. We hope that understanding the cognitive dynamics with the dynamical systems language would open new theoretical horizons. It is of some help to consider the conceptual difference of the two “forces” which flow into the 6 layers of the neocortex. Glutamate spike volleys could be, if viewed as an event in a dynamical system, an external force, which may kick the orbit to another orbit, and may sometimes to out of the “basin” of the present attractor beyond the border – the separatrix. In contrary to this situation, ACh projections – though transient, could be regarded as a slow parameter working as a bifurcation parameter that modifies the landscape itself. What we are looking at in the preceding arguments is that the two phenomena happen concomitantly at the 6 layers of the cortex. Hasselmo and McGaughy [13] emphasized the ACh role in memory as: “high acetylcholine sets circuit dynamics for attention and encoding; low acetylcholine sets dynamics for consolidation”, which is based on some experimental data on selective pre-synaptic depression and facilitation. However, in view of the potential role of attentions in local bindings or global integrations, we may pose alternative (but not necessarily exclusive,) scenarios on the ACh function as temporarily modifying the quasi-attractor landscape, in collaboration with glutamatergic spike volleys. Rather, we speculate that the process of memorization itself would realized through such a dynamic formation of attractor ruins, for which mAChR may play a role.

Acknowledgements The first author (HF) was supported by a Grant-in-Aid for Scientific Research (C), No. 19500259, from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. The second author (KA) was partially supported by Grant-in-Aid for Scientific Research on Priority Areas 17022012 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. The third author (IT) was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas, No. 18019002 and No. 18047001, a Grant-inAid for Scientific Research (B), No. 18340021, Grant-in-Aid for Exploratory Research, No. 17650056, a Grant-in-Aid for Scientific Research (C), No. 16500188, and the 21st Century COE Program, Mathematics of Nonlinear Structures via Singularities.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

177

References 1. Behrendt, R.-P., Young, C.: Hallucinations in schizophrenia, sensory impairment, and brain disease: A unifying model. Behav. Brain Sci. 27, 771–787 (2004) 2. Buschman, T.J., Miller, E.K.: Top-down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices. Science 315, 1860–1862 (2007) 3. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, Oxford (2006) 4. Christophe, E., Roebuck, A., Staiger, J.F., Lavery, D.J., Charpak, S., Audinat, E.: Two Types of Nicotinic Receptors Mediate an Excitation of Neocortical Layer I Interneurons. J. Neurophysiol. 88, 1318–1327 (2002) 5. Dehaene, S., Changeux, J.-P.: Ongoing Spontaneous Activity Controls Access to Consciousness: A Neuronal Model for Inattentional Blindness. PLoS Biology 3, 910–927 (2005) 6. Detari, L.: Tonic and phasic influence of basal forebrain unit activity on the cortical EEG. Behav. Brain. Res. 115, 159–170 (2000) 7. Fries, P., Reynolds, J.H., Rorie, A.E., Desimone, R.: Modulation of Oscillatory Neuronal Synchronization by Selective Visual Attention. Science 291, 1560–1563 (2001) 8. Fujii, H., Aihara, K., Tsuda, I.: Functional Relevance of ‘Excitatory’ GABA Actionsin Cortical Interneurons: A Dynamical Systems Approach. J. Integrative Neurosci. 3, 183– 205 (2004) 9. Gil, Z., Connors, B.W., Yael Amitai, Y.: Differential Regulation of Neocortical Synapses by Neuromodulators and Activity. Neuron 19, 679–686 (1997) 10. Golmayo, L., Nunez, A., Zaborsky, L.: Electrophysiological Evidence for the Existence of a Posterior Cortical-Prefrontal-Basal Forebrain Circuitry in Modulating Sensory Responses in Visual and Somatyosensory Rat Cortical Areas. Neuroscience 119, 597–609 (2003) 11. Gulledge, A.T., Stuart, G.J.: Cholinergic Inhibition of Neocortical Pyramidal Neurons. J. Neurosci 25, 10308–10320 (2005) 12. Gulledge, A.T., Susanna, S.B., Kawaguchi, Y., Stuart, G.J.: Heterogeneity of phasic signaling in neocortical neurons. J. Neurophysiol. 97, 2215–2229 (2007) 13. Hasselmo, M.E., McGaughy, J.: High acetylcholine sets circuit dynamics for attention and encoding; Low acetylcholine sets dynamics for consolidation. Prog. Brain Res. 145, 207– 231 (2004) 14. Hsieh, C.Y., Cruikshank, S.J., Metherate, R.: Differential modulation of auditory thalamocortical and intracortical synaptic transmission by cholinergic agonist. Brain Res 880, 51–64 (2000) 15. Jones, B.E., Muhlethaler, M.: Cholinergic and GABAergic neurons of the basal forebrain: role in cortical activation. In: Lydic, R., Baghdoyan, H.A. (eds.) Handbook of Behavioral State Control, pp. 213–233. CRC Press, London (1999) 16. Kay, L.M., Lancaster, L.R., Freeman, W.J.: Reafference and attractors in the olfactory system during odor recognition. Int. J. Neural Systems 4, 489–495 (1996) 17. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., Arieli, A.: Nerve cell activity when eyes are shut reveals internal views of the world. Nature 425, 954–956 (2003) 18. Kuczewski, N., Aztiria, E., Gautam, D., Wess, J., Domenici, L.: Acetylcholine modulates cortical synaptic transmission via different muscarinic receptors, as studied with receptor knockout mice. J. Physiol. 566.3, 907–919 (2005) 19. Loewi, O.: Ueber humorale Uebertragbarkeit der Herznervenwirkung. Pflueger’s Archiven Gesamte Physiologie 189, 239–242 (1921) 20. McCormick, D.A., Prince, D.A.: Mechanisms of action of acetylcholine in the guinea-pig cerebral cortex in vitro. J. Physiol. 375, 169–194 (1986)

178

H. Fujii, K. Aihara, and I. Tsuda

21. Mesulam, M.M., Mufson, E.J., Levey, A.I., Wainer, B.H.: Cholinergic innervation of cortex by the basal forebrain: cytochemistry and cortical connections of the septal area, diagonal band nuclei, nucleus basalis (substantia innominata), and hypothalamus in the rhesus monkey. J. Comp. Neurol. 214, 170–197 (1983) 22. Metherate, R., Charles, L., Cox, C.L., Ashe, J.H.: Cellular Bases of Neocortical Activation: Modulation of Neural Oscillations by the Nucleus Basalis and Endogenous Acetylcholine. J. Neurosci. 72, 4701–4711 (1992) 23. Niebur, E., Hsiao, S.S., Johnson, K.O.: Synchrony: a neuronal mechanism for attentional selection? Curr. Opin. Neurobiol. 12, 190–194 (2002) 24. Perry, E.K., Perry, R.H.: Acetylcholine and Hallucinations: Disease-Related Compared to Drug-Induced Alterations in Human Consciousness. Brain Cognit. 28, 240–258 (1995) 25. Sarter, M., Gehring, W.J., Kozak, R.: More attention should be paid: The neurobiology of attentional effort. Brain Res. Rev. 51, 145–160 (2006) 26. Sarter, M., Parikh, V.: Choline Transporters, Cholinergic Transmission and Cognition. Nature Reviews Neurosci. 6, 48–56 (2005) 27. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognit. Psychol. 12, 97–136 (1980) 28. Tsuda, I.: Chaotic Itinerancy as a Dynamical Basis of Hermeneutics of Brain and Mind. World Future 32, 167–185 (1991) 29. Turrini, P., Casu, M.A., Wong, T.P., De Koninck, Y., Ribeiro-da-Silva, A., And Cuello, A.C.: Cholinergic nerve terminals establish classical synapses in the rat cerebral cortex: synaptic pattern and age—related atrophy. Neiroscience 105, 277–285 (2001) 30. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase synchronization and large-scale integration. Nature Rev. Neurosci. 2, 229–239 (2001) 31. Wenk, G.L.: The Nucleus Basalis Magnocellularis Cholinergic System: One Hundred Years of Progress. Neurobiol. Learn. Mem. 67, 85–95 (1997) 32. Womelsdorf, T., Schoffelen, J.M., Oostenveld, R., Singer, W., Desimone, R., Engel, A.K., Fries, P.: Modulation of neuronal interactions through neuronal synchronization. Science 316, 1578–1579 (2007)

Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model Yongtao Li and Shigetoshi Nara Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-naka, Okayama 700-8530, Japan [email protected]

Abstract. Chaotic dynamics introduced in a recurrent neural network model is applied to controlling an tracker to track a moving target in two-dimensional space, which is set as an ill-posed problem. The motion increments of the tracker are determined by a group of motion functions calculated in real time with firing states of the neurons in the network. Several groups of cyclic memory attractors that correspond to several simple motions of the tracker in two-dimensional space are embedded. Chaotic dynamics enables the tracker to perform various motions. Adaptively real-time switching of control parameter causes chaotic itinerancy and enables the tracker to track a moving target successfully. The performance of tracking is evaluated by calculating the success rate over 100 trials. Simulation results show that chaotic dynamics is useful to track a moving target. To understand them further, dynamical structure of chaotic dynamics is investigated from dynamical viewpoint. Keywords: Chaotic dynamics, tracking, moving target, neural network.

1 Introduction Biological systems have became a hot research around the world because of their excellent functions not only in information processing, but also in well-regulated functioning and controlling, which work quite adaptively in various environments. However, we are yet poor of understanding the mechanisms of biological systems including brains despite many eﬀorts of researchers because enormous complexity originating from dynamics in systems is very diﬃcult to be understood and described using the conventional methodologies based on reductionism, that is, decomposing a system into parts or elements. The conventional reductionism more or less falls into two diﬃculties: one is “combinatorial explosion” and the other is “divergence of algorithmic complexity”. These diﬃculties are not yet solved. On the other hand, dynamical viewpoint to understand the mechanism seems to be a plausible method. In particular, chaotic dynamics experimentally observed in biological systems including brains[1,2] has suggested a viewpoint that chaotic dynamics would play important roles in complex functioning and controlling of biological systems including brains. From this viewpoint, many dynamical models have been constructed for approaching the mechanisms by means of large-scale simulation or heuristic methods. Artificial neural networks in which chaotic dynamics can be introduced has been attracting great interests, and the relation between chaos and functions has been discussed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 179–188, 2008. c Springer-Verlag Berlin Heidelberg 2008

180

Y. Li and S. Nara

[9,10,11,12]. As one of those works, Nara and Davis found that chaotic dynamics can occur in a recurrent neural network model(RNNM) consisting of binary neurons [3], and they investigated the functional aspects of chaos by applying it to solving a memory search task with an ill-posed context[7]. To show the potential of chaos in controlling, chaotic dynamics was applied to solving two-dimensional mazes, which are set as ill-posed problems[8]. Two important points were proposed. One is a simple coding method translating the neural states into motion increments , the other is a simple control algorithm, switching a system parameter adaptively to produce constrained chaos. The conclusions show that constrained chaos behaviours can give better performance to solving a two-dimensional maze than that of random walk. In this paper, we develop the idea and apply chaotic dynamics to tracking a moving target, which is set as another ill-posed problem. Let us state about a model of tracking a moving target. An tracker is assumed to move in two-dimensional space and track a moving target along a certain trajectory by employing chaotic dynamics. the tracker is assumed to move with discrete time steps. The state pattern is transform into the tracker’s motion by the coding of motion functions, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as the prototypical simple motions, are embedded in the network. By the coding of motion function, each cycle corresponds to a monotonic motion in twodimensional space. If the state pattern converges into a prototypical attractor, the tracker moves in a monotonic direction. Introducing chaotic dynamics into the network generated non-period state pattern, which is transformed into chaotic motion of the tracker by motion functions. Adaptive switching of a system parameter by a simple evaluation between chaotic dynamics and attractor’s dynamics in the network results in complex motions of the tracker in various environments. Considering this point, a simple control algorithm is proposed for tracking a moving target. In actual simulation, the present method using chaotic dynamics gives novel performance. To understand the mechanism of better performance, dynamical structure of chaotic dynamics is investigated from statistical data.

2 Memory Attractors and Motion Functions Our study works with a fully interconnected recurrent neural network consisting of N binary neurons. Its updating rule is defined by Si (t + 1) = sgn Wi j S j (t) (1) j∈Gi (r)

sgn(u) = • • • •

+1 u ≥ 0; −1 u < 0.

S i (t) = ±1(i = 1 ∼ N): the firing state of a neuron specified by index i at time t. Wi j : connection weight from the neuron S j to the neuron S i (Wii is taken to be 0) r: fan-in number for the neuron S i , named as connectivity,(0 < r < N). Gi (r): a spatial configuration set of connectivity r.

Tracking a Moving Target Using Chaotic Dynamics

181

At a certain time t, the state of neurons in the network can be represented as a N-dimensional state vector S(t), called as state pattern. Time development of state pattern S(t) depends on the connection weight matrix {Wi j } and connectivity r, therefore, in our study, Wi j are determined in the case of full connectivity r = N − 1, by a kind of orthogonalized learning method[7]and taken as follows. Wi j =

L K μ=1 λ=1

λ † (ξλ+1 μ )i · (ξ μ ) j

(2)

where {ξ λμ |λ = 1 . . . K, μ = 1 . . . L} is an attractor pattern set, K is the number of memory patterns included in a cycle and L is the number of memory cycles. ξλ† μ is the conjugate λ λ† λ vector of ξμ which satisfies ξμ · ξμ = δμμ · δλλ ,where δ is Kronecker’s delta. This method was confirmed to be eﬀective to avoid spurious attractors[3,4,5,6,7,8]. Biological data show that neurons in brain causes various motions of muscles in body with a quite large redundancy. Therefore, the network consisting of N neurons is used to realize two-dimensional motion control of an tracker. We confirmed that chaotic dynamics introduced in the network does not so sensitively depend on the size of the neuron number[7].In our actual computer simulation,N = 400. Suppose that an tracker moves from the position (p x (t), py (t)) to (p x (t + 1), py (t + 1)) with a set of motion increments ( f x (t), fy (t)). The state pattern S(t) at time t is a 400-dimensional vector, and we transform it to two-dimensional motion increments by the coding of motion functions ( f x (S(t)), fy (S(t))). In 2-dimensional space, the actual motion of the tracker is given by 4 A·C p x (t + 1) p (t) f (S(t)) p (t) = x + x = x + (3) py (t + 1) py (t) fy (S(t)) py (t) N B·D where A, B, C, D are four independent N/4 dimensional sub-space vectors of state pattern S(t). Therefore, after the inner product between two independent sub-space vectors is normalized by 4/N, motion functions range from -1 to +1. In our actual simulations, two-dimensional space is digitized with a resolution 0.02 due to the binary neuron state ±1 and N = 400. Now, let us consider the construction of memory attractors corresponding to prototypical simple motions. We take 24 attractor patterns consisting of (L=4 cycles) × (K=6 patterns per cycle). Each cycle corresponds to one prototypical simple motion. We take four types of motion that one tracker moves toward (+1, +1), (-1, +1), (-1, -1), (+1, -1) in two-dimensional space. Each attractor pattern consists of four random subspace vectors A, B, C and D, where C = A or −A, and D = B or −B. So only A and B are independent random patterns. From the law of large number, memory patterns are almost orthogonal each other. Furthermore, in determining {Wi j }, the orthogonalized learning method was employed. Therefore, memory patterns are orthogonalized each other. The corresponding relations between memory attractors and prototypical simple motions are shown as follows. ( f x (ξλ1 ), fy (ξλ1 )) = (+1, +1) ( f x (ξλ3 ), fy (ξλ3 ))

= (−1, −1)

( f x (ξλ2 ), fy (ξλ2 )) = (−1, +1) ( f x (ξλ4 ), fy (ξλ4 )) = (+1, −1)

182

Y. Li and S. Nara

3 Introducing Chaotic Dynamics in RNNM Now let us state the eﬀects of connectivity r. In the case of full connectivity r = N − 1, the network can function as a conventional associative memory. If the state pattern S(t) is one or near one of the memory patterns ξλμ , finally the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . In other words, for each memory pattern, there is a set of the state patterns, called as memory basin Bλμ . If S(t) is in the memory basin Bλμ , then the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . It is quite diﬃcult to estimate basin volume accurately because of enormous amounts of calculation for the whole state patterns in N-dimensional state space. Therefore, a statistical method is applied to estimating the approximate basin volume. First, a sufficiently large amount of state patterns are sampled in the state space. Second, each sample is taken as initial pattern and updated with full connectivity. Third, it is taken statistic which memory attractor limk→∞ S(kK) of each sample would converge into. The distribution of statistic data over the whole samples is regarded as the approximate basin volume for each memory attractor(see Fig.1). The basin volume shows that almost all initial state patterns converge into one of the memory attractors averagely and there are seldom spurious attractors. 0.06

Basin volume

0.05 0.04 0.03 0.02 0.01 0 0

5

10 15 20 25 Memory pattern number

30

Fig. 1. Basin volume: The horizontal axis represents memory pattern number(1-24). Basin 25 corresponds to samples that converged into cyclic outputs with a period of six steps but not any one memory attractor. Basin 26 corresponds to samples excluded from any other case(1-25). The vertical axis represents the ratios between the corresponding samples and the whole samples.

Next, we continue to decrease connectivity r. When r is large enough, r N, memory attractors are stable. When r becomes smaller and smaller, more and more state patterns gradually do not converge into a certain memory pattern despite the network is updated for a long time, that is, attractors become unstable. Finally, when r becomes quite small, state pattern becomes non-period output, that is, non-period dynamics occurs in the state space. In our previous papers, we confirmed that the non-period dynamics in the network is chaotic wandering. In order to investigate the dynamical structure, we calculated basin visiting measures and it suggests that the trajectory can pass the

Tracking a Moving Target Using Chaotic Dynamics

183

whole N-dimensional state space, that is, cyclic memory attractors ruin due to a quite small connectivity [3,4,5,6,7].

4 Motion Control and Tracking Algorithm When connectivity r is suﬃciently large, one random initial pattern converges into one of four limit cycle attractors as time evolves. By the coding transformation of motion functions, the corresponding motion of the tracker in 2-dimensional space becomes monotonic(see Fig.2). On the other hand, when connectivity r is quite small, chaotic dynamics occurs in the state space, correspondingly, the tracker moves chaotically (see Fig.3). If the updating of state pattern in chaotic regime is replaced by random 400-bitpattern generator, the tracker shows random walk(see Fig.4). Obviously, chaotic motion is diﬀerent from random walk, and has a certain dynamical structure.

6 4 5

4 r=30

r=399

Random walk

2

4

2

0

3 2 1

START

START

-2

-2

-4

-4

-6

0 -1 0

1

2

3

4

5

6

Fig. 2. Monotonic motion

START

-6

-8 -1

0

(500 steps)

-10 -10

-8

-6

-4

-2

-8 0

2

Fig. 3. Chaotic walk

4

(500 steps)

-10 -10

-8

-6

-4

-2

0

2

4

Fig. 4. Random walk

Therefore, when the network evolves, monotonic motion and chaotic motion can be switched by switching the connectivity r. Based on this idea, we proposed a simple algorithm to track a moving target, shown in Fig.5. First, an tracker is assumed to be tracking a target that is moving along a certain trajectory in two-dimensional space, and the tracker can obtain the rough directional information D1 (t) of the moving target, which is called as global target direction. At a certain time t, the present position of the tracker is assumed at the point (p x (t), py (t)). This point is taken as the origin point and two-dimensional space can be divided into four quadrants. If the target is moving in the nth quadrant, D1 (t) = n (n = 1, 2, 3, 4). Next, we also suppose that the tracker can know another directional information D2 (t) = m (m = 1, 2, 3, 4), which is called global motion direction. It means that the tracker has moved toward the mth quadrant from time t − 1 to t, that is, in the previous step. Global target direction D1 (t) and global motion direction D2 (t) are time-dependent variables. If these information are taken as feedback to the network in real time, the connectivity r also becomes a time-dependent variable r(t) and is determined by D1 (t) and D2 (t). In Fig.5, RL is a suﬃciently large connectivity and RS is a quite small connectivity that can lead to chaotic dynamics in the neural network. Adaptive switching of connectivity is the core idea of the algorithm. When the synaptic connectivity r(t) is determined by comparing two directions,D1(t−1) and D2 (t−1), the motion increments of the tracker are calculated from the state pattern of the network updated with r(t). The new motion

184

Y. Li and S. Nara

Fig. 5. Control algorithm of tracking a moving target: By judging whether global target direction D1 (t) coincides with global motion direction D2 (t) or not, adaptive switching of connectivity r between RS and RL results in chaotic dynamics or attractor’s dynamics in state space. Correspondingly, the tracker is adaptively tracking a moving target in two-dimensional space.

causes the next D1 (t) and D2 (t), and produces the next connectivity r(t + 1). By repeating this process, the synaptic connectivity r(t) is adaptively switching between RL and RS , the tracker is alternatively implementing monotonic motion and chaotic motion in two-dimensional space.

5 Simulation Results In order to confirm that this control algorithm is useful to tracking a moving target, the moving target should be set. Firstly, we have taken nine kinds of trajectories which the target moves along, which are shown in Fig.6 and include one circular trajectory and eight linear trajectories. Suppose that the initial position of the tracker is the origin(0,0) of two-dimensional space. The distance L between initial position of the tracker and that of the target is a constant value. Therefore, at the beginning of tracking, the tracker is at the circular center of the circular trajectory and the other eight linear trajectories are tangential to the circular trajectory along a certain angle α, where the angle is defined by the x axis. The tangential angle α = nπ/4 (n = 1, 2, . . . , 8), so we number the eight linear trajectories as LTn , and the circular trajectory as LT0 . 20 15 10 5

Object

Target

0 -5 -10

Capture

-15 -20 -20

Fig. 6. Trajectories of moving target: Arrow represents the moving direction of the target. Solid point means the position at time t=0.

-15

-10

-5

0

5

10

15

20

Fig. 7. An example of tracking a target that is moving along a circular trajectory with the simple algorithm. the tracker captured the moving target at the intersection point.

Tracking a Moving Target Using Chaotic Dynamics

185

Next, let us consider the velocity of the target. In computer simulation, the tracker moves one step per discrete time step, at the same time, the target also moves one step with a certain step length S L that represents the velocity of the target. The motion increments of the tracker ranges from -1 to 1, so the step length S L is taken with an interval 0.01 from 0.01 to 1 up to 100 diﬀerent velocities. Because velocity is a relative quantity, so S L = 0.01 is a slower target velocity and S L = 1 is a faster target velocity relative to the tracker. Now, let us look at a simulation of tracking a moving target using the algorithm proposed above, shown in Fig.7. When an target is moving along a circular trajectory at a certain velocity, the tracker captured the target at a certain point of the circular trajectory, which is a successful capture to a circular trajectory.

6 Performance Evaluation

(a) circular target trajectory

-2

et

10 20 30 40 50 60 0 Connectivity

Ve l

oc

ity

x10 100 80 60 40 20 rg

Success Rate

1 0.8 0.6 0.4 0.2 0 0

Ta

Ve

et

10 20 30 40 50 60 0 Connectivity

lo

ci

ty

x10-2 100 80 60 40 20 rg

1 0.8 0.6 0.4 0.2 0 0

Ta

Success Rate

To show the performance of tracking a moving target, we have evaluated the success rate of tracking a moving target that moves along one of nine trajectories over 100 initial state patterns. In tracking process, the tracker suﬃciently approaching the target within a certain tolerance during 600 steps is regarded as a successful trial. The rate of successful trials is called as the success rate. However, even though tracking a same target trajectory, the performance of tracking depends not only on synaptic connectivity r, but also on target velocity or target step length S L. Therefore, when we evaluate the success rate of tracking, a pair of parameters, that is, one of connectivity r(1 ≤ r ≤ 60) and one of target velocity S L(0.01 ≤ T ≤ 1.0), is taken. Because we take 100 diﬀerent 100 target velocity with a same interval 0.01, we have C60 pairs of parameters. We have evaluated the success rate of tracking a moving target along diﬀerent trajectories. Two examples are shown as Fig.8(a) and (b). By comparing Fig.8(a) and (b), we are sure that tracking a moving target of circular trajectory has better performance than that of linear trajectory. However, to some linear trajectories, quite excellent performance was observed. On the other hand, the success rate highly depends on connectivity r and the target velocity S L even if the same target trajectory is set. In order to observe the performance clearly, we have taken the data

(b) linear target trajectory

Fig. 8. Success rate of tracking a moving target along (a)a circle trajectory;(b)a linear trajectory: The positive orientation obeys the right-hand rule. The vertical axis represents success rate, and two axes in the horizontal plane represents connectivity r and target velocity S L, respectively.

Y. Li and S. Nara

1

1

0.8

0.8 Success rate

Success rate

186

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

20

40

60

80

100

0

-2

Target Velocity ( x10 )

(a) r = 16: downward tendency

20

40

60

80

100

Target Velocity ( x10-2)

(b) r = 51: upward tendency

Fig. 9. Success rates drawn from Fig.8(a): We take the data of a certain connectivity and show them in two dimension diagram. The horizontal axis represents target velocity from 0.01 to 1.0, and the vertical axis represents success rate.

of certain connectivities from Fig.8(a), and plot them in two-dimensional coordinates, shown as Fig.9. Comparing these figures, we can see a novel performance, when the target velocity becomes faster, the success rate has a upward tendency, such as r = 51. In other words, when the chaotic dynamics is not too strong, it seems useful to tracking a faster target.

7 Discussion In order to show the relations between the above cases and chaotic dynamics, from dynamical viewpoint, we have investigated dynamical structure of chaotic dynamics. To a quite small connectivity from 1 to 60, the network performs chaotic wandering for long time from a random initial state pattern. During this hysteresis, we have taken a statistics of continuously staying time in a certain basin [8] and evaluated the distribution p(l, μ) which is defined by p(l, μ) = {the number of l | S(t) ∈ βμ in τ ≤ t ≤ τ + l and S(τ − 1) βμ and S(τ + l + 1) βμ , μ| μ ∈ [1, L]} K βμ = Bλμ

(4)

(5)

λ=1

T =

lp(l, μ)

(6)

l

where l is the length of continuously staying time steps in each attractor basin, and p(l, μ) represents a distribution of continuously staying l steps in attractor basin L = μ within T steps. In our actual simulation, T = 105 . To diﬀerent connectivity r=15 and r=50, the distribution p(l, μ) are shown in Fig.10(a) and Fig.10(b). In these figures, diﬀerent basins are marked with diﬀerent symbols. From the results, we can know that continuously staying time l becomes longer and longer with increase of the connectivity r. Referring to those novel performances talked in previous section, let us try to consider the reason.

100000

Frequency distribution of staying

Frequency distribution of staying

Tracking a Moving Target Using Chaotic Dynamics

Basin 1 Basin 2 Basin 3 Basin 4

10000 1000 100 10 1 2

4

6

8

10

12

14

Continuously staying time steps

(a) r = 15: shorter

16

187

100000 Basin 1 Basin 2 Basin 3 Basin 4

10000 1000 100 10 1 10

20

30

40

50

60

Continuously staying time steps

(b) r = 50: longer

Fig. 10. The log plot of the frequency distribution of continuously staying time l: The horizontal axis represents continuously staying time steps l in a certain basin μ during long time chaotic wandering, and the vertical axis represents the accumulative number p(l, μ) of the same staying time steps l in a certain basin μ. continuously staying time steps l becomes long with the increase of connectivity r.

First, in the case of slower target velocity, a decreasing success rate with the increase of connectivity r is observed from both circular target trajectory and linear ones. This point shows that chaotic dynamics localized in a certain basin for too much time is not good to track a slower target. Second, in the case of faster target velocity, it seems useful to track a faster target when chaotic dynamics is not too strong. Computer simulations shows that, when the target moves quickly, the action of the tracker is always chaotic so as to track the target. In past experiments, we know that motion increments of chaotic motion is very short. Therefore, shorter motion increments and faster target velocity result in bad tracking performance. However, when continuously staying time l in a certain basin becomes longer, the tracker can move toward a certain direction for l steps. This would be useful for the tracker to track the faster target. Therefore, when connectivity becomes a little large (r=50 or so), success rate arises following the increase of target velocity, such as the case shown in Fig.9. As an issue for future study, a functional aspect of chaotic dynamics still has context dependence.

8 Summary We proposed a simple method to tracking a moving target using chaotic dynamics in a recurrent neural network model. Although chaotic dynamics could not always solve all complex problems with better performance, better results often were often observed on using chaotic dynamics to solve certain ill-posed problems, such as tracking a moving target and solving mazes [8]. From results of the computer simulation, we can state the following several points. • A simple method to tracking a moving target was proposed • Chaotic dynamics is quite eﬃcient to track a target that is moving along a circular trajectory.

188

Y. Li and S. Nara

• Performance of tracking a moving target of a linear trajectory is not better than that of a circular trajectory, however, to some linear trajectories, excellent performance was observed. • The length of continuously staying time steps becomes long with the increase of synaptic connectivity r that can lead chaotic dynamics in the network. • Continuously longer staying time in a certain basin seems useful to track a faster target.

References 1. Babloyantz, A., Destexhe, A.: Low-dimensional chaos in an instance of epilepsy. Proc. Natl. Acad. Sci. USA. 83, 3513–3517 (1986) 2. Skarda, C.A., Freeman, W.J.: How brains make chaos in order to make sense of the world. Behav. Brain. Sci. 10, 161–195 (1987) 3. Nara, S., Davis, P.: Chaotic wandering and search in a cycle memory neural network. Prog. Theor. Phys. 88, 845–855 (1992) 4. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Memory search using complex dynamics in a recurrent neural network model. Neural Networks 6, 963–973 (1993) 5. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudo-inverse method. Int. J. Bifurcation and Chaos Appl. Sci. Eng. 5, 1205–1212 (1995) 6. Nara, S., Davis, P.: Learning feature constraints in a chaotic neural memory. Phys. Rev. E 55, 826–830 (1997) 7. Nara, S.: Can potentially useful dynamics to solve complex problems emerge from constrained chaos and/or chaotic itinerancy? Chaos. 13(3), 1110–1121 (2003) 8. Suemitsu, Y., Nara, S.: A solution for two-dimensional mazes with use of chaotic dynamics in a recurrent neural network model. Neural Comput. 16(9), 1943–1957 (2004) 9. Tsuda, I.: Chaotic itinerancy as a dynamical basis of Hermeneutics in brain and mind. World Futures 32, 167–184 (1991) 10. Tsuda, I.: Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behav Brain Sci. 24(5), 793–847 (2001) 11. Kaneko, K., Tsuda, I.: Chaotic Itinerancy. Chaos 13(3), 926–936 (2003) 12. Aihara, K., Takabe, T., Toyoda, M.: Chaotic Neural Networks. Phys. Lett. A 114, 333–340 (1990)

A Generalised Entropy Based Associative Model Masahiro Nakagawa Nagaoka University of Technology, Kamitomioka 1603-1, Nagaoka, Niigata 940-2188, Japan [email protected]

Abstract. In this paper, a generalised entropy based associative memory model will be proposed and applied to memory retrievals with analogue embedded vectors instead of the binary ones in order to compare with the conventional autoassociative model with a quadratic Lyapunov functionals. In the present approach, the updating dynamics will be constructed on the basis of the entropy minimization strategy which may be reduced asymptotically to the autocorrelation dynamics as a special case. From numerical results, it will be found that the presently proposed novel approach realizes the larger memory capacity even for the analogue memory retrievals in comparison with the autocorrelation model based on dynamics such as associatron according to the higher-order correlation involved in the proposed dynamics. Keywords: Entropy, Associative Memory, Analogue Memory Retrieval.

1 Introduction During the past quarter century, the numerous autoassociative models have been extensively investigated on the basis of the autocorrelation dynamics. Since the proposals of the retrieval models by Anderson, [1] Kohonen, [2] and Nakano, [3] some works related to such an autoassociation model of the inter-connected neurons through an autocorrelation matrix were theoretically analyzed by Amari, [4] Amit et al . [5] and Gardner [6] . So far it has been well appreciated that the storage capacity of the autocorrelation model , or the number of stored pattern vectors, L , to be completely associated vs the number of neurons N, which is called the relative storage capacity or loading rate and denoted as c = L / N , is estimated as c ~0.14 at most for the autocorrelation learning model with the activation function as the signum one ( sgn (x) for the abbreviation) [7,8] . In contrast to the abovementioned models with monotonous activation functions, the neuro-dynamics with a nonmonotonous mapping was recently proposed by Morita, [9] Yanai and Amari, [10] Shiino and Fukai [11]. They reported that the nonmonotonous mapping in a neurodynamics possesses a remarkable advantage in the storage capacity, c ~0.27, superior than the conventional association models with monotonous mappings, e.g. the signum or sigmoidal function. In the present paper, we shall propose a novel approach based on the entropy defined in terms of the overlaps, which are defined by the innerproducts between the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 189–198, 2008. © Springer-Verlag Berlin Heidelberg 2008

190

M. Nakagawa

state vector and the analogue embedded vectors instead of the previously investigated binary ones [1-16,25].

2 Theory Let us consider an associative model with the embedded analogue vector (r ) ei (1 i N,1 r L) , where N and L are the number of neurons and the number of embedded vectors. The states of the neural network are characterized in terms of the output vector si (1 i N ) and the internal states i (1 i N ) which are related each other in terms of

s i =f

(1iN),

i

(1)

where f ( • ) is the activation function of the neuron. Then we introduce the following entropy I which is to be related to the overlaps as L

I=- 1 2 (r)

where the overlaps m

(r) 2

m

log m

(r) 2

,

(2)

r=1

(r=1,2,...,L) are defined by N (r)

m

†(r)

= e

(3)

i s i;

i=1

here the covariant vector e relation,

†(r )

is defined in terms of

i

the following orthogonal

N

e

†(r)

ie

(s)

a

(r') rr' e i ,

(4a)

,

(4b)

i = rs

(1r,sL) ,

(4)

i=1

e

L †(r) i = r'=1

a

rr' =(

-1

)

rr'

e

(r) (r') i e i

N

rr' =

.

(4c)

i=1

The entropy defined by eq.(2) can be minimized by the following condition

m and

(r)

=

rs

(1r,sL),

(5)

A Generalised Entropy Based Associative Model L

m

(r) 2

191

(6)

=1

r=1

That is, regarding m (1 r L) as the probability distribution in eq.(2), a target (r) pattern may be retrieved by minimizing the entropy I with respect to m or the state vector si to achieve the retrieval of a target pattern in which the eqs.(5a) and (5b) are to be satisfied. Therefore the entropy function may be considered to be a functional to be minimized during the retrieval process of the auto-association model instead of the conventional quadratic energy functional, E, i.e. (r) 2

E=- 1 2

N N

w ij s †

is j

,

(6a)

js j ,

(6b)

i=1 j=1

†

where s i is the covariant vector defined by L N

s

†

i=

e

†(r)

ie

†(r)

r=1 j=1

and the connection matrix w ij is defined in terms of L (r)

w ij = e

ie

†(r)

j.

(6c)

r=1

According to the steepest descent approach in the discrete time model, the updating rule of the internal states i (1 i N ) may be defined by

i (t+1) =-

I s

(1iN) ,

†

(7)

i

where (> 0 ) is a coefficient. Substituting eqs.(2) and (3) into eq.(7) and noting the following relation with aid of eq.(6b), N

m

(r)

N

= e

†(r)

(r)

i s i = e

i=1

is

† i

,

(8)

i=1

one may readily derive the following relation.

i (t+1)=- I† =+ 1 † 2 s i s i = 1 † 2 s i L

m

(r) 2

2

log

= e = e

(r)

i

r=1 L

N

e

(r)

†

js

j

t

r=1 j=1

(r)

(r) 2

r=1

e

(r)

js

js

†

†

j

2

t

j=1 N

e

(r)

js

†

j

t

1+log

j=1 im

log m

N

N (r)

r=1

L

L

e

(r)

j

t

2

j=1

1+log m

(r) 2

.

(9)

192

M. Nakagawa

Generalizing somewhat the above dynamics in order to combine the quadratic approach ( 0) and the present entropy one ( 1) , we propose the following dynamic rule, in a somewhat ad-hoc manner, for the internal states L

N

i (t+1)= e

(r)

i

r=1

N

†(r)

e

js j

t

1+log 1- +

j=1

e

†(r)

js j

t

2

j=1

L

= e

(r)

i

m

(r)

t

1+log 1- + m

(r)

2

t

. (10)

r=1

In practice, in the limit of 0 , the above dynamics will be reduced to the autocorrelation dynamics. L

e (r) i

i (t+1)=lim L

= e

m

0 r=1

(r)

(r)

im

r=1

(r)

t

L

1+log 1- + m N

t =- e

(r)

r=1

i

(r)

2

t

N

e

† (r)

j=1

j s j (t) = w ij s j (t) . j=1

(11)

On the other hand, eq.(10) results in eq.(9) in the case of 1 . Therefore one may control the dynamics between the autocorrelation ( 0) and the entropy based approach ( 1 ).

3 Numerical Results The embedded vectors are set to the binary random vectors as follows.

e (r) i =z

(r) i

(1iN,1rL)

(12)

where z i (1 i N ,1 r L) are the zero-mean pseudo-random numbers between -1 and +1. For simplicity, the activation function , eq.(1), is assumed to be a piecewise linear function instead of the previous signum form for the binary embedded vectors[25] and set to (r )

s i =f ( i )=

1+sgn 1- i

2

i

+sgn

1-sgn 1- i

where denotes the signum function sgn ( •) defined by

-1 (x0)

2

i

,

(13)

(14)

A Generalised Entropy Based Associative Model

193

The initial vector si (0) (1 i N ) is set to

s i (0)=

where e

(r ) i

-e (s) i (1iH d ) , +e (s) i (H d +1iN)

(15)

is a target pattern to be retrieved and H d is the Hamming distance

between the initial vector si (0) and a target vector e

(s) i

. The retrieval is succeeded if

N

m

(s)

(t ) = e

†(s) i s

i (t

)

(16)

i=1

results in ±1 for t 1, in which the system may be in a steady state such that

s i (t+1)=s i (t) ,

(17a)

i (t+1)= i (t) .

(17b)

To see the retrieval ability of the present model, the success rate Sr is defined as the rate of the success for 1000 trials with the different embedded vector sets e (r )i (1 i N ,1 r L) . To control from the autocorrelation dynamics after the initial state (t~1) to the entropy based dynamics (t~ Tmax ) , the parameter in eq.(10) was simply controlled by

= t T max

max

(0tT

max ) ,

(18)

where Tmax and max are the maximum values of the iterations of the updating according to eq.(10) and , respectively. Choosing N =200, η = 1, Tmax = 25, L/N=0.5 and α max = 1, we first present an example of the dynamics of the overlaps in Figs.1(a) and (b) (Entropy based approach). Therein the cross symbols( × ) and the open circles(o) represent the success of retrievals, in which eqs.(5a) and (5b) are satisfied, and the entropy defined by eq.(2), respectively, for a retrieval process. In addition the time dependence of the parameter α / α max defined by eq.(18) are depicted as dots ( i ). In Fig. 1 after a transient state, it is confirmed that the complete association corresponding to eqs.(5a) and (5b) can be achieved. Then we shall present the dependence of the success rate Sr on the loading rate = L / N are depicted in Figs.2 (a) and (b) for H d / N = 0.3 , N =100 for the entropy approach and the associatron, respectively. From these results, one may confirm the larger memory capacity of the presently proposed model defined by eq.(10) in

194

M. Nakagawa <EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=10 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1 1

0.8

0.6

Overlaps < o(n) >

0.4

0.2 0

-0.2 -0.4 -0.6 -0.8

-1 0

5

10

15

(a)

25 n

20

30

35

40

45

50

H d / N = 0.1

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1

1

0.8 0.6

Overlaps < o(n) >

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0

5

10

15

(b)

20

25 n

30

35

40

45

50

H d / N = 0.3 (r)

Fig. 1. The time dependence of overlaps m eq.(10)

of the present entropy based model defined by

A Generalised Entropy Based Associative Model

195

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1

1 0.9

Success Rate Sr(L/N)

0.8 0.7

Success Rate MemCap= 0.9999 Hd/N= 0.3

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5 L/N

0.6

0.7

0.8

0.9

1

(a) Entropy based Model defined by eq.(10)

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=0 izero=0 alpmax=1 1

Success Rate MemCap= 0.0134 Hd/N= 0.3

0.9

Success Rate Sr(L/N)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5 L/N

0.6

0.7

0.8

0.9

1

(b) Conventional Associatron Model defined by eq.(11) Fig. 2. The dependence of the success rate on the loading rate α = L / N of the present entropy based model defined by eqs.(10) and (11). Here the Hamming distance is set to H d / N = 0.3.

196

M. Nakagawa

comparison with the conventional autoassociation model defined by eq.(11). In practice, it is found that the present approach may achieve the high memory capacity beyond the conventional autocorrelation strategy even for the analogue embedded vectors as well as the previously concerned binary case [15,16,25].

4 Concluding Remarks In the present paper, we have proposed an entropy based association model instead of the conventional autocorrelation dynamics. From numerical results, it was found that the large memory capacity may be achieved on the basis of the entropy approach. This advantage of the association property of the present model is considered to result from the fact such that the present dynamics to update the internal state eq.(10) assures that the entropy, eq.(2) is minimized under the conditions, eqs.(5a) and (5b), which corresponding to the succeeded retrieval of a target pattern. In other words, the higher-order correlations in the presently proposed dynamics, eq.(10), which was ignored in the conventional approaches, [1-11] was found to play an important role to improve memory capacity, or the retrieval ability. To conclude this work, we shall show the dependence of the storage capacity, which is defined as the area covered in terms of the success rate curves as shown in Fig.3 , on the Hamming distance in Fig.3 for the analogue embedded vectors (Ana) as well as the previous binary ones (Bin). In addition OL and CL imply the orthogonal learning model and the autocorrelation learning model, respectively. Therein one may see again the great advantage of the present model based on the entropy functional to be minimized beyond the conventional quadratic form [12,13] even for the analogue embedded vectors. In fact one may realize the considerably larger storage capacity in the present model in comparison with the associatron over H d / N 0.5 . The memory retrievals for the associatron based on the quadratic 1

n a m

n a m

n a m

a n m

a n m

0.9 Memory Capacity

0.8

a m n

a m n

t

a m n

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

t s s

t s

t s

t s

t s

t s

a m

t s

n

a m n

a

Entropy based Model (OL:Ana)

m

Entropy based Model (OL:Bin)

n

Entropy based Model (CL:Bin)

s

Associatron(OL:Bin)

t

Associatron(OL:Wii=0:Bin) t s

t s

0.010.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Hd/N

Fig. 3. The dependence of the storage capacity on the Hamming distance. Here symbols a, m and n are for the entropy based approach with eq. (10) as well as the orthogonal learning (OL) and the autocorrelation learning (CL) [16,17], in which Ana and Bin imply the analogue embedded vectors and the binary ones, respectively. In addition we presented the associatron in symbols s with the orthogonal learning [13], and the associatron in symbols t with orthogonal learning under the condition wii = 0 [12], respectively.

A Generalised Entropy Based Associative Model

197

Lyapunov functionals to be minim ized become troublesome near H d / N = 0 .5 as seen in Fig.3 since the directional cosine between the initial vector and a target pattern eventually vanishes therein. Remarkably, even in such a case, the present model attains a remarkably large memory capacity because of the higher-order correlations involved in eq.(10) as expected from Figs. 1 and 2 for the analogue vectors as well as the binary ones previously investigated [15,16,25]. As a future problem, it seems to be worthwhile to involve a chaotic dynamics in the present model introducing a periodic activation function such as sinusoidal one as a nonmonotonic activation function [14]. The entropy based approach [15] with chaos dynamics [14] is now in progress and will be reported elsewhere together with the synergetic models [17-24] in the near future.

References 1. Anderson, J.A.: A Simple Neural Network Generating Interactive Memory. Mathematical Biosciences 14, 197–220 (1972) 2. Kohonen, T.: Correlation Matrix Memories. IEEE Transaction on Computers C-21, 353– 359 (1972) 3. Nakano, K.: Associatron-a Model of Associative Memory. IEEE Trans. SMC-2, 381–388 (1972) 4. Amari, S.: Neural Theory of Association and Concept Formation. Biological Cybernetics 26, 175–185 (1977) 5. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Storing Infinite Numbers of Patternsin a Spinglass Model of Neural Networks. Physical Review Letters 55, 1530–1533 (1985) 6. Gardner, E.: Structure of Metastable States in the Hopfield Model. Journal of Physics A19, L1047–L1052 (1986) 7. Kohonen, T., Ruohonen, M.M.: Representation of Associated Pairs by Matrix Operators. IEEE Transaction C-22, 701–702 (1973) 8. Amari, S., Maginu, K.: Statistical Neurodynamics of Associative Memory. Neural Networks 1, 63–73 (1988) 9. Morita, M.: Neural Networks. Associative Memory with Nonmonotone Dynamics 6, 115– 126 (1993) 10. Yanai, H.-F., Amari, S.: Auto-associative Memory with Two-stage Dynamics of nonmonotonic neurons. IEEE Transactions on Neural Networks 7, 803–815 (1996) 11. Shiino, M., Fukai, T.: Self-consistent Signal-to-noise Analysis of the Statistical Behaviour of Analogu Neural Networks and Enhancement of the Storage Capacity. Phys. Rev. E48, 867 (1993) 12. Kanter, I., Sompolinski, H.: Associative Recall of Memory without Errors. Phys. Rev. A 35, 380–392 (1987) 13. Personnaz, L., Guyon, I., Dreyfus, D.: Information Storage and Retrieval in Spin-Glass like Neural Networks. J. Phys(Paris) Lett. 46, L-359 (1985) 14. Nakagawa, M.: Chaos and Fractals in Engineering, p. 944. World Scientific Inc., Singapore (1999) 15. Nakagawa, M.: Autoassociation Model based on Entropy Functionals. In: Proc. of NOLTA 2006, pp. 627–630 (2006) 16. Nakagawa, M.: Entropy based Associative Model. IEICE Trans. Fundamentals EA-89(4), 895–901 (2006)

198

M. Nakagawa

17. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System I. Biological Cybernetics 60, 17–22 (1988) 18. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System II. Biological Cybernetics 60, 107–109 (1988) 19. Fuchs, A., Haken, H.: Dynamic Patterns in Complex Systems. In: Kelso, J.A.S., Mandell, A.J., Shlesinger, M.F. (eds.), World Scientific, Singapore (1988) 20. Haken, H.: Synergetic Computers and Cognition. Springer, Heidelberg (1991) 21. Nakagawa, M.: A study of Association Model based on Synergetics. In: Proceedings of International Joint Conference on Neural Networks 1993 NAGOYA, JAPAN, pp. 2367– 2370 (1993) 22. Nakagawa, M.: A Synergetic Neural Network. IEICE Fundamentals E78-A, 412–423 (1995) 23. Nakagawa, M.: A Synergetic Neural Network with Crosscorrelation Dynamics. IEICE Fundamentals E80-A, 881–893 (1997) 24. Nakagawa, M.: A Circularly Connected Synergetic Neural Networks. IEICE Fundamentals E83-A, 881–893 (2000) 25. Nakagawa, M.: Entropy based Associative Model. In: Proceedings of ICONIP 2006, pp. 397–406. Springer, Heidelberg (2006)

The Detection of an Approaching Sound Source Using Pulsed Neural Network Kaname Iwasa1, Takeshi Fujisumi1 , Mauricio Kugler1 , Susumu Kuroyanagi1, Akira Iwata1 , Mikio Danno2 , and Masahiro Miyaji3 1

Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, 466-8555, Japan [email protected] 2 Toyota InfoTechnology Center, Co., Ltd, 6-6-20 Akasaka, Minato-ku, Tokyo, 107-0052, Japan 3 Toyota Motor Corporation, 1 Toyota-cho, Toyota, Aichi, 471-8572, Japan

Abstract. Current automobiles’ safety systems based on video cameras and movement sensors fail when objects are out of the line of sight. This paper proposes a system based on pulsed neural networks able to detect if a sound source is approaching a microphone or moving away from it. The system, based on PN models, compares the sound level diﬀerence between consecutive instants of time in order to determine its relative movement. Moreover, the combined level diﬀerence information of all frequency channels permits to identify the type of the sound source. Experimental results show that, for three diﬀerent vehicles sounds, the relative movement and the sound source type could be successfully identiﬁed.

1

Introduction

Driving safety is one of the major concerns of the automotive industry nowadays. Video cameras and movement sensors are used in order to improve the driver’s perception of the environment surrounding the automobile [1][2]. These methods present good performance when detecting objects (e.g., cars, bicycles, and people) which are in line of sight of the sensor, but fail in case of obstruction or dead angles. Moreover, the use of multiple cameras or sensors for handling dead angles increases the size and cost of the safety system. The human being, in contrast, is able to perceive people and vehicles around itself by the information provided by the auditory system [3]. If this ability could be reproduced by artiﬁcial devices, complementary safety systems for automobiles would emerge. Cause of diﬀraction, sound waves can contour objects and be detected even when the source is not in direct line of sight. A possible approach for processing temporal data is the use of Pulsed Neuron (PN) models [4]. This type of neuron deals with input signals on the form of pulse trains, using an internal membrane potential as a reference for generating pulses on its output. PN models can directly deal with temporal data and can be eﬃciently implemented in hardware, due to its simple structure. Furthermore, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 199–208, 2008. c Springer-Verlag Berlin Heidelberg 2008

200

K. Iwasa et al.

high processing speeds can be achieved, as PN model based methods are usually highly parallelizable. A sound localization system based on pulsed neural networks has already being proposed in [5] and a sound source identiﬁcation system, with a corresponding implementation on FPGA, was introduced in [6]. This paper focuses speciﬁcally on the relative moving direction of a sound emitting object, and proposes a method to detect if a sound source is approaching or moving away from it using a microphone. The system, based on PN models, compares the sound level diﬀerence between consecutive instants of time in order to determine its relative movement. Moreover, the proposed method also identiﬁes the type of the sound source by the use of PN model based competitive learning pulsed neural network for processing the spectral information.

2

Pulsed Neuron Model

When processing time series data (e.g., sound), it is important to consider the time relation and to have computationally inexpensive calculation procedures to enable real-time processing. For these reasons, a PN model is used in this research.

Input Pulses

A Local Membrane Potential p1(t)

IN 1(t) The Inner Potential of the Neuron IN 2 (t)

w1 w2 p2(t) wk pk(t)

IN k (t)

I(t)

θ

Output Pulses o(t)

w n pn(t)

IN n (t)

Fig. 1. Pulsed neuron model

Figure 1 shows the structure of the PN model. When an input pulse INk (t) reaches the k th synapse, the local membrane potential pk (t) is increased by the value of the weight wk . The local membrane potentials decay exponentially with a time constant τk across time. The neuron’s output o(t) is given by o(t) = H(I(t) − θ) n I(t) = pk (t)

(1) (2)

k=1 t

pk (t) = wk INk (t) + pk (t − 1)e− τ

(3)

The Detection of an Approaching Sound Source

201

where n is the total number of inputs, I(t) is the inner potential, θ is the threshold and H(·) is the unit step function. The PN model also has a refractory period tndti , during which the neuron is unable to ﬁre, independently of the membrane potential.

3

The Proposed System

The basic structure of the proposed system is shown in Fig.2. This system consists of three main blocks, the frequency-pulse converter, the level diﬀerence extractor and the sound source classiﬁer, from which the last two are based on PN models. The relative movement (approaching or moving away) of the sound source is determined by the sound level variation. The system compares a signal level x(t) from a microphone with the level in a previous time x(t−Δt). If x(t) > x(t−Δt), the sound source is getting closer to a microphone, if x(t) < x(t−Δt), it is moving away. After the level diﬀerence having been extracted, the outputs of the level diﬀerence extractors contain the spectral pattern of the input sound, which is then used for recognizing the type of the source. 3.1

Filtering and Frequency-Pulse Converter

Initially, the input signal must be pre-processed and converted to a train of pulses. A bank of 4th order band-pass ﬁlters decomposes the signal in 13 frequency channels equally spaced in a logarithm scale from 500 Hz to 2 kHz. Each frequency channel is modiﬁed by the non-linear function shown in Eq.(4), and the resulting signal’s envelope is extracted by a 400 Hz low-pass ﬁlter. Finally, Input Signal Filter Bank & Frequency - Pulse Converter f1

f2

fN

Time Delay

x(t)

Time Delay

x(t- D t)

Level Difference Extractor

x(t)

x(t- D t)

Level Difference Extractor

Time Delay

x(t)

x(t- D t)

Level Difference Extractor

Sound Source Classifier Approaching Detection & Sound Classification

Fig. 2. The structure of the recognition system

202

K. Iwasa et al.

each output signal is independently converted to a pulse train, whose rate is proportional to the amplitude of the signal. 1 x(t) 3 x(t) ≥ 0 F (t) = 1 (4) 1 3 x(t) x(t) < 0 4 3.2

Level Diﬀerence Extractor

Each pulse trains generated by the Frequency-Pulse converter is inputted in a Level Diﬀerence Extractor (LDE) independently. The LDE, shown in Fig. 3, is composed by two parts, the Lateral Superior Olive (LSO) model and the Level Mapping Two (LM2) model [7]. In LSO model and LM2 model, each neurons work as Eq.(3). The LSO is responsible for the time diﬀerence extraction itself, while the LM2 extracts the envelope of the complex ﬁring pattern. Each pulse train correspondent to each frequency channel is inputted in a LSO LSO model. The PN potential of f th channel, ith LSO neuron Ii,f (t) is calculated as follows: LSO B Ii,f (t) = pN i,f (t) + pi,f (t)

pN i,f (t)

=

N wi,f xf (t)

+

pN i,f (t

(5) − 1)e

t LSO

−τ

B B pB i,f (t) = wi,f xf (t − Δt) + pi,f (t − 1)e

−τ t LSO

(6) (7)

N where τLSO is the time constant of the LSO neuron and the weights wi,f and B wi,f are deﬁned as: ⎧ ⎧ 0.0 i=0 0.0 i=0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨1.0 ⎨ i>0 1.0 i E p −min . That is, when the last HN is added, the minimum training error Emin should be larger than the previously achieved minimum training error Ep-min. 2.1.2 Stopping Criterion for Backward Elimination (BE) 2.1.2.1 Calculation of Error Reduction Rate (ERR) During BE process, we need to calculate the ERR as follows,

ERR = − Here,

′ − Emin Emin Emin

′ is the minimum training error during backward elimination. Emin

2.1.2.2 Stopping Criterion in double feature deletion During the course of training, the irrelevant features are sequentially deleted double at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.05, CA′ and CA refer to the classification accuracies during BE and before BE, respectively.

2.1.2.3 Stopping Criterion in single feature deletion During the course of training, the irrelevant features are sequentially deleted single at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.08. 2.2 Measurement of Contribution Finding the contribution of input attributes accurately ensures the effectiveness of CFSS. Therefore, during the one-by-one removal training, we measure the minimum i

training error, Emin for removing each ith feature. Then, calculate the contribution of each feature as follows. i i Training error difference for removing the ith features is, Ediff = Emin − Emin where Emin is the minimum training error using all features. Now, we need to calculate the average training error difference for removing ith feature Eavg = 1

N

∑E N i =1

i diff

where

N is the number of features. Now, the percentage contribution of ith feature is,

Coni = 100.

i Eavg − Ediff

Eavg

Now, we can decide the rank of features according to Conmax = max(Coni ) . The worst ranking features are treated as the irrelevant features for the network as these are providing comparatively less contribution for the network.

Feature Subset Selection Using Constructive Neural Nets

377

2.3 Calculation of Network Connection The number of connections (C) of the final architecture of the network can be calculated in terms of the number of existing input attributes (x), number of existing hidden units (h), and number of outputs (o) by C = ( x × h) + ( h × o) + h + o . 2.3.1 Reduction of Network Connection (RNC) In this context we here estimate how much connection has been reduced due to feature selection. In this regard, we initially calculate the total number of connections of the achieved network before and after BE, Cbefore and Cafter , respectively. After that, we estimate RNC as, RNC = 100.

Cbefore − C after

.

Cbefore

2.3.2 Increment of Network Accuracy (INA) Due to the reduction of network connection, we estimate INA that shows how much network accuracy is improved. We measure the classification accuracy before and after BE, CAbefore and CAafter , respectively. After that, we estimate INA as, INA = 100.

CAafter − CAbefore Cbefore

.

2.4 The Algorithm

In this framework, CFSS comprises into three main steps that are summarized in Fig. 1: (a) architecture determination of NN in constructive fashion (step 1-2), (b) measurement of contribution of each feature (step 3), and (c) subset generation (step 4-8). These steps are dependent each other and the entire process has been accomplished one after another depending on particular criteria. The detail of each step is as follows. Step 1) Create a minimal NN architecture. Initially it has three layers, i.e. an input layer, an output layer, and a hidden layer with only one neuron. Number of input and output neurons is equal to the number of inputs and outputs of the problem. Randomly initialize connection weights between input layer to hidden layer and hidden layer to output layer within a range between [+1.0, -1.0]. Step 2) Train the network by using BPL algorithm and try to achieve a minimum training error, Emin. Then, add a HN, retrain the network from the beginning, and check the SC according to subsection 2.1.1. When it is satisfied, then validate the NN to test set to calculate the classification accuracy, CA and go to step 3. Otherwise, continue the HN selection process. Step 3) Now to find out the contribution of features, we perform one-by-one removal training of the network. Delete successively the ith feature and save the i individual minimum training error, E min . Then, the rank of all features can be reflected in accordance with the subsection 2.2. Then go to Step 4.

378

M.M. Kabir, M. Shahjahan, and K. Murase

1

Create an NN with minimal architecture Training

2

SC satisfied ?

yes

Validate NN, calculate accuracy

no Add HN

One-by-one training, compute contribution

3

Delete double/single feature Training 4

Validate subset, calculate accuracy

SC satisfied ?

no

yes 5

6

Backtracking no

Single deletion end ? yes

7

Further check

8

Final subset

Fig. 1. Overall Flowchart of CFSS. Here, NN, HN and SC refer to neural network, hidden neuron and stopping criterion respectively.

Step 4) This stage is the first step for BE to generate the feature subset. Initially, we attempt to delete double features of worst rank at a time to accelerate the process. Calculate the minimum training error, E′min during training. Validate the existing features using test set and calculate classification accuracy CA′ . After that, if SC mentioned in subsection 2.1.2.2 is satisfied, then continue. Otherwise, go to step 5. Step 5) We perform backtracking. That is, the lastly deleted double/single of features is restored here with associated all components. Step 6) If single deletion process has not been finished then attempt to delete feature single at a time using step 4 to filter the unnecessary ones, otherwise go to step 7. If SC mentioned in subsection 2.1.2.3 is satisfied, then continue. Otherwise, go to step 5 and then step 7.

Feature Subset Selection Using Constructive Neural Nets

379

Step 7) Before going to select the final subset, we again check through the existing features whether any irrelevant feature presents or not. The following steps are taken to accomplish this task. i) Delete the existing features from the network single in each step in accordance with worst rank, and then again retrain. ii) Validate the existing features by using test set. For the next stage, save the classification accuracy CA′′ , the responsible deleted feature, and all its components. iii) If DF CA . v) If better CA′′ are available, then identify the higher rank of feature among them. Otherwise, stop. Delete the higher ranked feature with corresponding worst ranked ones from the subset that obtained at step 7-i. After that, recall the components to the current network that was obtained at step 7-ii and stop. Step 8) Finally, we achieve the relevant feature subset with compact network.

3 Experimental Analysis This section evaluates CFSS’s performance on four well-known benchmark problems such as Breast cancer (BCR), Diabetes (DBT), Glass (GLS), and Iris (IRS) problems which are obtained from [13]. Input attributes and output classes of BCR are 9 and 2, 8 and 2 of DBT, 9 and 6 of GLS, 4 and 3 of IRS, respectively. All datasets are partitioned into three sets: a training set, a validation set, and a test set. The first 50% is used as training set for training the network, and the second 25% as validation set to check the condition of training and stop it when the overall training error goes high. The last 25% is used as test set to evaluate the generalization performance of the network and is not seen by the network during the whole training process. In all experiments, two bias nodes with a fixed input +1 are connected to the hidden layer and output layer. The learning rate η is set between [0.1,0.3] and the weights are initialized to random values between [+1.0, -1.0]. Each experiment was carried out 10 times and the presented results were the average of these 10 runs. The all experiments were done by Pentium-IV, 3 GHz desktop personal computer. 3.1 Experimental Results

Table 1 shows the average results of CFSS where the number of selected features, classification accuracies and so on are incorporated for BCR, DBT, GLS, and IRS problems in before and after feature selection process. For clarification, in this table, we measured the training error by the section 2.5 of [12], and classification accuracy CA is the ratio of classified examples to the total examples of the particular data set.

40 30 20 10 0 -10 -20 0

1

2

3

4

5

6

7

8

9

10

Classification Accuracy (%)

M.M. Kabir, M. Shahjahan, and K. Murase Attribute Contribution (%)

380

80 78 76 74 72 70 0

Attribute

2

4

6

8

10

Subset Size

Fig. 2. Contribution of attributes in breast Fig. 3. Performance of the network according to cancer problem for single run the size of subset in diabetes problem for single run Table 1. Results of four problems such as BCR, DBT, GLS, and IRS. Numbers in () are the standard deviations. Here, TERR, CA, FS, HN, ITN, and CTN refer to training error, classification accuracy, feature selection, hidden neuron, iteration, and connection respectively. Name

BCR DBT GLS IRS

Feature # 9 (0.00) 8 (0.00) 9 (0.00) 4 (0.00)

Before FS TERR (%) 3.71 (0.002) 16.69 (0.006) 26.44 (0.026) 4.69 (0.013)

CA (%) 98.34 (0.31) 75.04 (2.06) 64.71 (1.70) 97.29 (0.00)

Feature # 3.2 (0.56) 2.5 (0.52) 4.1 (1.19) 1.3 (0.48)

Table 2. Results of computational time for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS

Computational Time (Second) 22.10 30.07 22.30 7.90

After FS TERR (%) 3.76 (0.003) 14.37 (0.024) 26.56 (0.028) 8.97 (0.022)

CA (%) 98.57 (0.52) 76.13 (1.74) 66.62 (2.18) 97.29 (0.00)

HN#

ITN#

CTN#

2.60

249.4

18.12

2.66

271.3

16.63

3.3

1480.9

42.63

3.2

2317.1

19.96

Table 3. Results of connection reduction and accuracy increase for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS

Connection Decrease (%) 45.42 46.80 27.5 30.2

Accuracy Increase (%) 0.23 1.45 2.95 0.00

In each experiment, CFSS generates a subset having the minimum number of relevant features is available. Thereupon, CFSS produces better CA with a smaller number of hidden neurons HN. It is shown in Table 1 that, for example, in BCR, the network easily selects a minimum number of relevant features subset i.e. 3.2 which occurs 98.57% CA while the network used only 2.6 of hidden neurons to build the minimal network structure. In contrast, in case of remaining problems like DBT, GLS, and IRS, the results of those attributes are nearly similar as BCR except IRS. This is because, for IRS problem, the network cannot produce better accuracy any

Feature Subset Selection Using Constructive Neural Nets

381

more after feature selection rather it able to generate 1.3 relevant features subset among 4 attributes which is sufficient to exhibit the same network performance. In addition, Fig. 2 exhibits the arrangement of attributes during training according to their contribution in BCR. CFSS can easily delete the unnecessary ones and generate the subset {6,7,1}. In contrast, Fig. 3 shows the relationship between classification accuracy and subset size for DBT. We calculated the network connection of the pruned network at the final stage according to section 2.3 and the results are shown in Table 1. The computational time for completing the entire FSS process is exhibited in Table 2. After that, we estimated the connection decrement of the network corresponding to accuracy increment due to FSS as shown in Table 3 according to 2.3.1 and 2.3.2. We can see here the relation between the reduction of the network connection, and the increment of accuracy. 3.2 Comparison with other Methods

In this section, we compare the results of CFSS to those obtained by other methods NNFS and ICFS reported in [4] and [6], respectively. The results are summarized in Tables 4-7. Prudence should be considered because different technique has been involved in their methods for feature selection. Table 4. Comparison on the number of relevant features for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

3.2 2.5 4.1 1.3

2.70 2.03 -

ICFS (M 1) 5 2 5 -

ICFS (M 2) 5 3 4 -

Table 6. Comparison on the average number of hidden neurons for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

2.60 2.66 3.3 3.2

12 12 -

ICFS (M 1) 33.55 8.15 62.5 -

ICFS (M 2) 42.05 21.45 53.95 -

Table 5. Comparison on the average testing CA (%) for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

98.57 76.13 66.62 97.29

94.10 74.30 -

ICFS (M 1) 98.25 78.88 63.77 -

ICFS (M 2) 98.25 78.70 66.61 -

Table 7. Comparison on the average number of connections for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

18.12 16.63 42.63 19.96

70.4 62.36 -

ICFS (M 1) 270.4 42.75 756 -

ICFS (M 2) 338.5 130.7 599.4 -

Table 4 shows the discrimination capability of CFSS in which the dimensionality of input layer is reduced for above-mentioned four problems. In case of BCR, the result of CFSS is quite better in terms of ICFS’s two methods while it is not so with NNFS. The result of DBT is average with comparing others. But, for the case of GLS, the result is comparable or better with two methods of ICFS.

382

M.M. Kabir, M. Shahjahan, and K. Murase

The comparison on the average testing CA for all problems is shown in Table 5. It is seen that, the results of BCR and GLS are better with NNFS and ICFS while DBT with NNFS. The most important aspect however in our proposed method is, less number of connections of final network due to less number of HN in NN architecture. The results are shown in Table 6 and 7 where the number of HNs and connections of CFSS are much less in comparison to other three methods. Note that there are some missing data in Table 4-7, since IRS problem was not tested by NNFS and ICFS, and GLS problem by NNFS.

4 Discussion This paper presents a new combinatorial method for feature selection that generates subset in minimal computation due to minimal size of hidden layer as well as input layer. Constructive technique is used to achieve the compact size of hidden layer, and a straightforward contribution measurement leads to achieve reduced input layer showing better performance. Moreover, a composite combination of BE by means of double and single elimination, backtracking, and validation helps CFSS to generate subset proficiently. The results shown in Table 1 exhibit that CFSS generates subset with a small number of relevant features with producing better performance in four-benchmark problems. The results of relevant subset generation and generalization performance are better or comparable to other three methods as shown in Tables 4 and 5. From the long period, the wrapper approaches for FSS are overlooked because of the huge computation in processing. In CFSS, computational cost is much less. As seen in Table 3, due to FSS, 37.48% of computational cost is reduced in the advantage of 1.18% network accuracy for four problems in average. The computational time for different problems to complete the entire process in CFSS is shown in Table 2. We believe that these values are sufficiently low especially for clinical field. The system can give the diagnostic result of the patient to the doctor within a minute. Though the exact comparison is difficult, other methods such as NNFS and ICFS may take 4-10 times more since the numbers of hidden neurons and connections are much more as seen in Tables 6 and 7 respectively. CFSS thus provides minimal computation in feature subset selection. In addition, during subset generation, we used to meet up the generated subset for validation in each step. The reason is that, during BE we build a composite SC, which eventually find out the local minima where the network training should be stopped. Due to implementing such criterion, network produces significant performance and thus no need to validate the generated subset finally. Furthermore, further checking for irrelevant features in BE brings the completeness of CFSS. In this study we applied CFSS to the datasets with smaller number of features up to 9. To get more relevant tests for real tasks, we intend to use CFSS with other datasets having a larger number of features in future. The issue of extracting rules from NN is always demandable to interpret the knowledge how it works. For this, a NN with compactness is desirable. As CFSS can give support to fulfill the requirements, rule extraction from NN is the further task in future efficiently.

Feature Subset Selection Using Constructive Neural Nets

383

5 Conclusion This paper presents a new approach for feature subset selection based on contribution of input attributes in NN. The combination of constructive, contribution, and backward elimination carries the success of CFSS. Initially a basic constructive algorithm is used to determine a minimal and optimal structure of NN. In the latter part, one-by-one removal of input attributes is adopted that does not computationally expensive. Finally, a backward elimination with new stopping criteria are used to generate relevant feature subset efficiently. Moreover, to evaluate CFSS, we tested it on four real-world problems such as breast cancer, diabetes, glass and iris problems. Experimental results confirmed that, CFSS has a strong capability of feature selection, and it can remove the irrelevant features from the network and generates feature subset by producing compact network with minimal computational cost.

Acknowledgements Supported by grants to KM from the Japanese Society for Promotion of Sciences, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.

References 1. Liu, H., Tu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 2. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis - An International Journal 1(3), 131–156 (1997) 3. Kohavi, R., John, G.H.: Wrapper for feature subset selection. Artificial Intelligence 97, 273–324 (1997) 4. Sateino, R., Liu, H.: Neural Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 5. Milna, L.: Feature Selection using Neural Networks with Contribution Measures. In: 8th Australian Joint Conference on Artificial Intelligence, Canberra, November 27 (1995) 6. Guan, S., Liu, J., Qi, Y.: An incremental approach to Contribution-based Feature Selection. Journal of Intelligence Systems 13(1) (2004) 7. Schuschel, D., Hsu, C.: A weight analysis-based wrapper approach to neural nets feature subset selection. In: Tools with Artificial Intelligence: Proceedings of 10th IEEE International Conference (1998) 8. Hsu, C., Huang, H., Schuschel, D.: The ANNIGMA-Wrapper Approach to Fast Feature Selection for Neural Nets. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 32(2), 207–212 (2002) 9. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection. Journal of Machine Learning Research (2002)

384

M.M. Kabir, M. Shahjahan, and K. Murase

10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture, Technical Report TR-92-CSE-27, ECE Department, UMASS, Amherst (1994) 11. Rumelhart, D.E., McClelland, J.: Parallel Distributed Processing. MIT Press, Cambridge (1986) 12. Prechelt, L.: PROBEN1-A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994) 13. newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Dept. of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

Dynamic Link Matching between Feature Columns for Diﬀerent Scale and Orientation Yasuomi D. Sato1 , Christian Wolﬀ1 , Philipp Wolfrum1 , and Christoph von der Malsburg1,2 1

2

Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe University, Max-von-Laue-Str. 1, 60438, Frankfurt am Main, Germany Computer Science Department, University of Southern California, LA, 90089-2520, USA

Abstract. Object recognition in the presence of changing scale and orientation requires mechanisms to deal with the corresponding feature transformations. Using Gabor wavelets as example, we approach this problem in a correspondence-based setting. We present a mechanism for ﬁnding feature-to-feature matches between corresponding points in pairs of images taken at diﬀerent scale and/or orientation (leaving out for the moment the problem of simultaneously ﬁnding point correspondences). The mechanism is based on a macro-columnar cortical model and dynamic links. We present tests of the ability of ﬁnding the correct feature transformation in spite of added noise.

1

Introduction

When trying to set two images of the same object or scene into correspondence with each other, so that they can be compared in terms of similarity, it is necessary to ﬁnd point-to-point correspondences in the presence of changes in scale or orientation (see Fig. 1). It is also necessary to transform local features (unless one chooses to work with features that are invariant to scale and orientation, accepting the reduced information content of such features). Correspondencebased object recognition systems [1,2,3,4] have so far mainly addressed the issue of ﬁnding point-to-point correspondences, leaving local features unchanged in the process [5,6]. In this paper, we propose a system that can not only transform features for comparison purposes, but also recognize the transformation parameters that best match two sets of local features, each one taken from one point in an image. Our eventual aim will be to ﬁnd point correspondences and feature correspondences simultaneously in one homogeneous dynamic link matching system, but we here take point correspondences as given for the time being. Both theoretical [7] and experimental [8] investigations are suggesting 2D-Gabor-based wavelets as features to be used in visual cortex. These are best sampled in a log-polar manner [11]. This representation has been shown to be particularly useful for face recognition [12,13], and due to its inherent symmetry it is highly appropriate for implementing a transformation system for scale and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 385–394, 2008. c Springer-Verlag Berlin Heidelberg 2008

386

Y.D. Sato et al.

Fig. 1. When images of an object diﬀer in (a) scale or (b) orientation, correspondence between them can only established if also the features extracted at a given point (e.g., at the dot in the middle of the images) are transformed during comparison

orientation. The work described here is based entirely on the macro-columnar model of [14]. In computer simulations we demonstrate feature transformation and transformation parameter recognition.

2

Concept of Scale and Rotation Invariance

Let there be two images, called I and M (for image and model). The GaborL based wavelet transform J¯ (L = I or M), has the form of a vector, called a jet, whose components are deﬁned as convolutions of the image with a family of Gabor functions: 1 1 L ¯ Jk,l (x0 , a, θ) = I(x) 2 ψ Q(θ)(x0 − x) d2 x, (1) a a R2 1 1 D2 2 ψ(x) = 2 exp(− 2 |x| ) exp(ixT · e1 ) − exp(− ) , (2) D 2D 2 cos θ sin θ Q(θ) = . (3) −sin θ cos θ Here a simultaneously represents the spatial frequency and controls the width of a Gaussian envelope window. D represents the standard deviation of the Gaussian. The individual components of the jet are indexed by n possible orientations and (m + m1 + m2 ) possible spatial frequencies, which are θ = πk/n (k ∈ {0, 1, . . . , (n−1)}) and a = al0 (0 < a0 < 1, l ∈ {0, 1, . . . , (m+m1 +m2 −1)}). Here parameters m1 and m2 ﬁx the number of scale steps by which the image can be scaled up or down, respectively. m (> 1) is the number of scales in the jets of the model domain. L L The Gabor jet J¯ is deﬁned as the set {J¯k,l } of NL Gabor wavelet components extracted from the position x0 of the image. In order to test the ability to ﬁnd the correct feature transformation described below, we add noise of strength σ1 to L L the Gabor wavelet components, J˜k,l = J¯k,l (1 + σ1 Sran ) where Sran are random L numbers between 0 and 1. Instead of the Gabor jet J¯ itself, we employ the L sum-normalized J˜k,l . Jets can be visualized in angular-eccentricity coordinates,

Dynamic Link Matching between Feature Columns (a)

387

(b)

Fig. 2. Comparison of Gabor jets in the model domain M and the image domain I (left and right, resp., in the two part-ﬁgures). Jets are visualized in log-polar coordinates: The orientation θ of jets is arranged circularly while the spatial frequency l is set radially. The jet J M of the model domain M is shown right while the jet J I from the transformed image in the image domain I is shown left. Arrows indicate which components of J I and J M are to be compared. (a) comparison of scaled jets and (b) comparison of rotated jets.

in which orientation and spatial frequency of a jet are arranged circularly and radially (as shown in Fig.2). The components of jets that are taken from corresponding points in images I and M may be arranged according to orientation (rows) and scale (columns): ⎛ ⎞ I I 0 · · · 0 J0,m · · · J0,m 0 ··· 0 1 −1 1 −1+m ⎜ .. .. .. .. .. ⎟ , J I = ⎝ ... (4) . . . . .⎠ I I 0 · · · 0 Jn−1,m · · · Jn−1,m 0 ··· 0 1 −1 1 −1+m

⎛

JM

⎞ M M J0,m · · · J0,m 1 −1 1 −1+m ⎜ ⎟ .. .. =⎝ ⎠. . . M M Jn−1,m1 −1 · · · Jn−1,m1 −1+m

(5)

Let us assume the two images M and I to be compared have the exact same structure, apart from being transformed relative to each other. Let us assume the transformation conforms to the sample grid of jet components. Then there will be pair-wise identities between components in Eqs.(4) and (5). If the jet in I is scaled relative to the jet in M, the non-zero components of J I are shifted along the horizontal (or, in Fig.2(a), radial) coordinate. If the image I, and correspondingly the jet J I , is rotated, then jet components are circularly permuted along the vertical axis in Eq.(4) (see Fig.2(b)). When comparing scaled and rotated jets, the non-zero components of J I are shifted along both axes simultaneously. There are model jet components of m diﬀerent scales, and to allow for m1 steps of scaling down the image I and m2 steps of scaling it up. The jet in Eq.(4) is padded on the left and right with m1 and m2 columns of zeros, respectively.

388

Y.D. Sato et al.

Fig. 3. A dynamic link matching network model of a pair of single macro-columns. In the I and M domains, the network consists of, respectively, NI and NM feature units (mini-columns). The units in the I and M domains represent components of the jets J I and J M . The Control domain C controls interconnections between the I and M macrocolumns, which are all-to-all at the beginning of a ν cycle. By comparing the two jets and through competition of control units in C, the initial all-to-all interconnections are dynamically reduced to a one-to-one interconnection.

3

Modelling a Columnar Network

In analogy to [14], we set up a dynamical system of variables, both for jet components and for all possible correspondences between jet components. These variables are interpreted as the activity of cortical mini-columns (here called “units”, “feature units” in this case). The set of units corresponding to a jet, either in I or in M, form a cortical macro-column (or short “column”) and inhibit each other, the strength of inhibition being controlled by a parameter ν, which cyclically ramps up and falls down. In addition, there is a column of control units, forming the control domain C. Each control unit stands for one pair of possible values of relative scale and orientation between I and M and can evaluate the similarity in the activity of feature units under the corresponding transformation. The whole network is schematically shown in Fig.3. Each domain, or column in our simple case, consists of a set of unit activities {P1L , . . . , PNLL } in the respective domain (L = I or M). NI = (m + m1 + m2 ) × n and NM = m × n represent respectively the total number of units in the I and M domains. The control unit’s activations are PC = (P1C , . . . , PNCC )T , where NC = (m1 + m2 + 1) × n. The equation system for the activations in the domains is given by dPαL = fα (PL ) + κL CE JαL + κL (1 − CE )EαLL + ξα (t), dt (α = 1, . . . , NL and L = {M, I}\L), dPγC = fγ (PC ) + κC Fγ + ξγ (t), dt (γ = 1, . . . , NC ),

(6)

(7)

Dynamic Link Matching between Feature Columns

389

Fig. 4. Time course of all activities in the domains I, M and C over a ν cycle for two jets without relative rotation and scale, for a system with size m = 5, n = 8, m1 = 0 and m2 = 4. The activity of units in the M and I columns is shown within the bars on the right and the bottom of the panels, respectively, each of the eight subblocks corresponding to one orientation, with individual components corresponding to diﬀerent scales. The indices k and l at the top label the orientation and scale of the I jet, respectively. The large matrix C contains, in the top version, as non-zero (white) entries all allowed component-to-component correspondences between the two jets (the empty, black, entries would correspond to forbidden relative scales outside the range set by m1 and m2 ). The matrix consists of 8 × 8 blocks, corresponding to the 8 × 8 possible correspondences between jet orientations, and each of these sub-matrices contains m×(m+m1 +m2 ) entries to make room for all possible scale correspondences. Each control unit controls (and evaluates) a whole jet to jet correspondence with the help of its entries in this matrix. The matrix in the lowest panel has active entries corresponding to just one control unit (for identical scale and orientation). Active units are white, silent units black. Unit activity is shown for three moments within one ν cycle. At t = 3.4 ms, the system has already singled out the correct scale, but is still totally undecided as to relative orientation. System parameters are κI = κM = 2, κC = 5, CE = 0.99, σ = 0.015 and σ1 = 0.1.

390

Y.D. Sato et al.

where ξα (t) and ξγ (t) are Gaussian noise of strength σ 2 . CE ∈ [0, 1] scales the ratio between the Gabor jet JαL and the total input EαLL from units in the other domain, L , and κL controls the strength of the projection L to L. κC controls the strength of the inﬂuence of the similarity Fγ on the control units. Here we explain the other terms in the above equations. The function fα (·) is: L L L L L 2 fα (P ) = a1 Pα Pα − ν(t) max {Pβ } − (Pα ) , (8) β=1,...,NL

where a1 = 25. The function ν(t) describes a cycle in time during which the feedback is periodically modulated:

k (νmax − νmin ) t−T (Tk t < T1 + Tk ) T1 + νmin , ν(t) = . (9) νmax , (T1 + Tk t < T1 + Tk + Trelax ) Here T1 [ms] is the duration during which ν increases with t while Tk (k = 1, 2, 3, . . .) are the periodic times at which the inhibition ν starts to increase. Trelax is a relaxation time after ν has been increased. νmin = 0.4, νmax = 1, T1 = 36.0 ms and Trelax = T1 /6 are set up in our study. Because of the increasing ν(t) and the noise in Eqs.(6) and (7), only one minicolumn in each domain remains active at the end of the cycle, while the others are deactivated, as shown in Fig.4. If we deﬁne the mean-free column activities as P˜αL := PαL − (1/NL ) α PαL , the interaction terms in Eqs. (6) and (7) are EαLL

=

NL NC

β ˜ L PβC Cαα Pα ,

(10)

β ˜I P˜γM Cγγ Pγ .

(11)

β=1 α =1

Fβ =

NM NI γ=1

γ =1

Here NC = n×(m1 +m2 +1) is the number of permitted transformations between the two domains (n diﬀerent orientations and m1 + m2 + 1 possible scales). The β expression Cαα designates a family of matrices, rotating and scaling one jet into another. If we write β = (o, s) so that o identiﬁes the rotation and s the scaling, then we can write the C matrices as tensor product: C β = Ao ⊗ B s , where Ao is an n × n matrix with a wrap-around shifted diagonal of entries 1 and zeros otherwise, implementing rotation o of jet components at any orientation, and B s is an m × (m + m1 + m2 ) matrix with a shifted diagonal of entries 1, and zeros otherwise, implementing an up or down shift of jet components of any scale. The C matrix implementing identity looks like the lowest panel in Fig. 4. The function E of Eq. (10), transfers feature unit activity from one domain to the other, and is a weighted superposition of all possible transformations, weighted with the activity of the control units. The function of F in Eq. (11) evaluates the similarity of the feature unit activity in one domain with that in the other under mutual transformation.

Dynamic Link Matching between Feature Columns

391

Through the columnar dynamics of Eq. (6), the relative strengths of jet components are expressed in the history of feature unit activities (the strongest jet component, for instance, letting its unit stay on longest) so that the sum of products in Eq. (11), integrated over time in Eq. (7), indeed expresses jet similarities. The time course of all unit activities in our network model during a ν cycle is demonstrated in Fig.4. In this ﬁgure, the non-zero components of the jets used in the I and M domains are identical without any rotation and scale. At the beginning of the ν cycle, all units in the I, M and C domains are active, as indicated in Fig.4 by the white jet-bars on the right and the bottom, and indicated by all admissible correspondences in the matrix being on. The activity state of the units in each domain is gradually reduced following Eqs.(6) – (11). In the intermediate state at t = 3.4 ms, all control units for the wrong scale have switched oﬀ, but the control units for all orientations are still on. At t = 12.0 ms, ﬁnally, only one control unit has survived, the one that correctly identiﬁes the identity map. This state remains stable during the rest of the ν cycle.

4

Results

In this Section, we describe tests performed on the functionality of our macrocolumnar model by studying the robustness of matching to noise introduced into the image jets (parameter σ1 ) and the dynamics of the units, Eqs. (6) and (7) (parameter σ) and robustness to the admixture of input from the other domain, Eq. (6), due to non-vanishing (1 − CE ). A correct match is one in which the surviving control unit corresponds to the correct transformation pair which the I jet diﬀers from the M jet. Jet noise is to model diﬀerences between jets actually extracted from natural images. Dynamic noise is to reﬂect signal ﬂuctuations to be expected in units (minicolumns) composed of spiking neurons, and a nonzero admixture parameter may be relevant in a system in which transfer of jet information between the domains is desired. For both of the experiments described below, we used Gabor jets extracted from the center of real facial images. The transformed jets used for matching were extracted from images that had been scaled and rotated relative to the center position by exactly the same factors and angles that are also used in the Gabor transform. This ensures that there exists a perfect match between any two jets produced from two diﬀerent rotated and scaled versions of the same face. The ﬁrst experiment uses a single facial image. This experiment investigates the inﬂuence of noise in the units (σ) and of the parameter CE . Using a pair of the same jets, we obtain temporal averages of the correct matching over 10 ν cycles for each size (n = 4, 5, . . . , 9, 10, m = 4, m1 = 0 and m2 = m − 1) of the macro-column. From these temporal averages, we calculate √ the sampling average over all n. The standard error is estimated using [σd / N1 ] where σd is standard deviation of the sampling average. N1 is the sample size. Fig.5(a) shows results for κI = κM = 1 and κC = 5. For strong noise (σ = 0.015 to 0.02), the correct matching probabilities for CE = 0.98 to CE = 0.96

392

Y.D. Sato et al.

(a)

(b) CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94

1

Probability Correct

Probability Correct

0.8

0.6

0.4 CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94

0.2

1

0.75

0 0.5 0

0.005

0.01

0.015

0.02

0

0.005

σ

0.01

0.015

0.02

σ

Fig. 5. Probability of correct match between units in the I and M domains. (a) κI = κM = 1 and κC = 5. (b) κI = 2.3, κM = 5 and κC = 5. σ=0.000

Probability Correct

1 0.8 0.6 0.4 0.2 0 0

0.05

0.1 σ1

0.15

0.2

Fig. 6. Probability of the correct matching for comparisons of two identical faces. 6 facial images are used. Our system is set with m = 4, m1 = m2 = m − 1, n = 8, κI = κM = 6.5, κC = 5.0, CE = 0.98 and σ = 0.0.

take better values than the ones for CE = 1. Interestingly, for these low CE values, matching gets worse for weaker noise, collapsing in the case of σ = 0. This eﬀect requires further investigation. However, for κI = 2.3 and κM = 5, we have found that the correct matching probability takes higher values than in case of Fig.5(a). In particular, our system model with higher noise σ = 0.015 even demonstrates perfect correct matching for CE = 0.98, independent of n. Next, we have investigated the robustness of the matching against the second type of noise (σ1 ), using 6 diﬀerent face types. Since the original image is resized and/or rotated with m1 + m2 + 1 = 7 scales and n = 8 orientations, a total of 56 images could be employed in the I domain. Here our system is set with κI = κM = 6.5, κC = 5.0 and CE = 0.98. For each face image, we take temporal averages of the correct matching probability for each size-orientation of the image, in a similar manner as described above. Averages and standard errors of the temporal averages on 6 diﬀerent facial images can be plotted, in terms of σ1 [Fig. 6]. The result of this experiment is independent of σ as long as σ 0.02. As a result of Fig. 6, as random noise in the jets is increased, the correct matching probability smoothly decreases for one image, or it abruptly shifts down to around 0 at a certain σ1 for the other image. We have also obtained

Dynamic Link Matching between Feature Columns

393

perfect matching, independent of σ1 when using the same but rotated or resized face. The most interesting point that we would like to insist is that the probability takes higher values than 87.74% as σ < 0.02 and σ1 < 0.08 (see Fig.6). Therefore, we can say that network model has a high recognition ability for scale and orientation invariance, which is not dependent of diﬀerent facial types.

5

Discussion and Conclusion

The main purpose of this communication is to convey a simple concept. Much more work needs to be done on the way to practical applications, for instance, more experiments with features extracted from independently taken images of diﬀerent scale and orientation to better bring out the strengths and weaknesses of the approach. In addition to comparing and transforming local packages of feature values (our jets), it will be necessary to also handle spatial maps, that is, sets of point-to-point correspondences, a task we will approach next. Real applications involve, of course, a continuous range of transformation parameters, whereas we here had admitted only transformations from the same sample grid used for deﬁning the family of wavelets in a jet. We hope to address this problem by working with continuous superpositions of transformation matrices for the neighboring transformation parameter values that straddle the correct value. Our approach may be seen as using brute force, as it requires many separate control units to sample the space of transformation parameters with enough density. However, the same set of control units can be used in a whole region of visual space, and in addition to that, for controlling point-to-point maps between regions, as spatial maps and feature maps stand in one-to-one correspondence to each other. The number of control units can be reduced even further if transformations are performed in sequence, by consecutive, separate layers of dynamic links. Thus, the transformation from an arbitrary segment of primary visual cortex to an invariant window could be done in the sequence translation – scaling – rotation. In this case, the number of required control units would be the sum and not the product of the number of samples for each individual transformation. A disadvantage of that approach might be added diﬃculties in ﬁnding correspondences between the domains. Further work is required to elucidate these issues.

Acknowledgements This work was supported by the EU project Daisy, FP6-2005-015803 and by the Hertie Foundation. We would like to thank C. Weber for help with preparing the manuscript.

References 1. Anderson, C.H., Van Essen, D.C., Olshausen, B.A.: Directed visual attention and the dynamic control of information ﬂow. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of attention, pp. 11–17. Academic Press/Elservier (2005) 2. Weber, C., Wermter, S.: A self-organizing map of sigma-pi units. Neurocomputing 70, 2552–2560 (2007)

394

Y.D. Sato et al.

3. Wiskott, L., v.d. Malsburg, C.: Face Recognition by Dynamic Link Matching, ch. 4. In: Sirosh, J., Miikkulainen, R., Choe, Y. (eds.) Lateral Interactions in the Cortex: Structure and Function, vol. 4, Electronic book (1996) 4. Wolfrum, P., von der Malsburg, C.: What is the optimal architecture for visual information routing? Neural Comput. 19 (2007) 5. Lades, M.: Invariant Object Recognition with Dynamical Links, Robust to Variations in Illumination. Ph.D. Thesis, Ruhr-Univ. Bochum (1995) 6. Maurer, T., von der Malsburg, C.: Learning Feature Transformations to Recognize Faces Rotated in Depth. In: Fogelman-Souli´e, F., Rault, J.C., Gallinari, P., Dreyfus, G. (eds.) Proc. of the International Conference on Artiﬁcial Neural Networks ICANN 1995, EC2 & Cie, p. 353 (1995) 7. Daugmann, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 8. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor ﬁlter model of simple receptive ﬁelds in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987) 9. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. of the third IEEE Conference on Computer Vision and Pattern Recognition, pp. 84–91 (1994) 10. Wiskott, L., Fellous, J.-M., Kr¨ uger, von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Anal. & Machine Intelligence 19, 775–779 (1997) 11. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Optical Soc. Am. 70, 1297–1300 (1980) 12. Yue, X., Tjan, B.C., Biederman, I.: What makes faces special? Vision Res. 46, 3802–3811 (2006) 13. Okada, K., Steﬀens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., von der Malsburg, C.: The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test. In: Wechsler, H., Phillips, P.J., Bruce, V., Fogelman Souli´e, F., Huang, T.S. (eds.) Face Recognition: FromTheory to Applications, pp. 186–205. Springer, Heidelberg (1998) 14. L¨ ucke, J., von der Malsburg, C.: Rapid Correspondence Finding in Networks of Cortical Columns. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 668–677. Springer, Heidelberg (2006)

Perturbational Neural Networks for Incremental Learning in Virtual Learning System Eiichi Inohira1 , Hiromasa Oonishi2 , and Hirokazu Yokoi1 1

Kyushu Institute of Technology, Hibikino 2-4, 808-0196 Kitakyushu, Japan {inohira,yokoi}@life.kyutech.ac.jp 2 Mitshbishi Heavy Industries, Ltd., Japan

Abstract. This paper presents a new type of neural networks, a perturbational neural network to realize incremental learning in autonomous humanoid robots. In our previous work, a virtual learning system has been provided to realize exploring plausible behavior in a robot’s brain. Neural networks can generate plausible behavior in unknown environment without time-consuming exploring. Although an autonomous robot should grow step by step, conventional neural networks forget prior learning by training with new dataset. Proposed neural networks features adding output in sub neural network to weights and thresholds in main neural network. Incremental learning and high generalization capability are realized by slightly changing a mapping of the main neural network. We showed that the proposed neural networks realize incremental learning without forgetting through numerical experiments with a twodimensional stair-climbing bipedal robot.

1

Introduction

Recently, humanoid robots such as ASIMO [1] dramatically develop in view of hardware and are prospective for working just like a human. Although many researchers have studied artiﬁcial intelligence for a long time, humanoid robots have not so high autonomy as experts are not wanted. Humanoid robots should accomplish missions by itself rather than experts give them solutions such as models, algorithms, and programs. Researchers have studied to realize a robot with learning ability through trial and error. Such studies uses so-called soft computing techniques such as reinforcement learning [2] and central pattern generator (CPG) [3]. Learning of a robot saves expert’s work to some degree but takes much time. These techniques are less eﬃcient than humans. Humans instantly act depending on a situation by using imagination and experience. Even if a human fails, a failure will serve himself as experience. In particular, humans know characteristics of environment and their behavior and simulate trial and error to explore plausible behavior in their brains. In our previous work, Yamashita and et al. [4] have proposed a bipedal walking control system with virtual environment based on motion control mechanism of primates. In this study, exploring plausible behavior by trial and error is carried M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 395–404, 2008. c Springer-Verlag Berlin Heidelberg 2008

396

E. Inohira, H. Oonishi, and H. Yokoi

out in a robot’s brain. However, the problem is that exploring takes much time. Kawano and et al. [5] have introduced learning of a relation of environmental data and an optimized behavior into the previous work to save time. Generating behavior becomes fast because a trained neural network (NN) immediately outputs plausible behavior due to its generalization capability, which is an ability to obtain a desired output at an unknown input. However, a problem in this previous work is that conventional neural networks cannot realize incremental learning. It means that a trained NN forgets prior learning by training with new dataset. Realizing incremental learning is indispensable to learn various behaviors step by step. We presents a new type of NNs, perturbational neural networks (PNN) to achieve incremental learning. PNN consists of main NN and sub NN. Main NN learns with representative dataset and never changes after that. Sub NN learns with new dataset as output error of main NN for the dataset is canceled. Sub NN is connected with main NN as additional terms of weights and thresholds in main NN. We guess that connections through weights and thresholds slightly changes characteristics of main NN, which means that perturbation is given to mapping of main NN, and that incremental learning does not sacriﬁce its generalization capability.

2

A Virtual Learning System for a Biped Robot

We assume that a robot tries to achieve a task in an unknown environment. Virtual learning is deﬁned as learning of virtual experience in virtual environment in a robot’s brain. Virtual environment is generated from sensory information on real environment around a robot and used for exploring behavior ﬁt for a task without real action. Exploring such behavior in virtual environment is regarded as virtual experience by trial and error or ingenuity. Virtual learning is to memorize a relation between environment and behavior under a particular task. Beneﬁts of virtual learning are (1) to reduce risk of robot’s failure in the real world such as falling and colliding and (2) to enable a robot to immediately act and achieve learned tasks in similar environments. The former is involved in a robot’s safety and physical wastage. The latter leads to saving work and time to achieve a task. We presented virtual learning system for a biped robot [4,5]. This virtual learning system has virtual environment and NNs for learning as shown in Fig.1. We assume that central pattern generator (CPG) generates motion of a biped robot. CPG of a biped robot consists of neural oscillators corresponding to its joints. Then behavior is described as CPG parameters rather than joint angles. CPG parameters for controlling a robot are called CPG input. Optimized CPG input for a plausible behavior is given by exploring in virtual environment. Virtual learning is realized by NNs. A NN learned a mapping from environmental data to optimized CPG input, i.e, a mapping from real environment to robot’s behavior. NNs have generalization capability, which means that desired output can be generated at an unknown input. When a NN learned a mapping

PNNs for Incremental Learning in Virtual Learning System

397

Virtual Environment

Exploring optimized motion Environmental data

Does it serve a purpose ?

CPG

CPG input Robot

NN

Fig. 1. Virtual learning system for a biped robot

with representative data, it can generate desired CPG input at unknown environment and then a biped robot achieves a task. It takes much time to train NN, but NN is fast to generate output. CPG input is ﬁrst generated by NN and then checked in virtual environment. When the CPG input generated by NN achieve a task, it is sent to a CPG. Otherwise, CPG input obtained by exploring, which takes time, is used for controlling a robot. And NN learns with CPG input obtained by exploring to correct its error. Combination of NN and exploring realizes autonomous learning and covers their shortcomings each other.

3

Neural Networks for Incremental Learning

A key of our virtual learning system is NN. A virtual learning system should develop with experience in various environments and tasks. However, multilayer perceptron (MLP) with back-propagation (BP) learning algorithm, which is widely used in many applications, has a problem that it forgets prior learning through training with a new dataset because its connection weights are overwritten. If all training datasets are stored in a memory, prior learning can be reﬂected in a MLP by training with all the datasets, but training with a large volume of dataset takes huge amount of time and memory. Of course humans can learn something new with keeping old experiences. A virtual learning system needs to accumulate virtual experiences step by step. 3.1

Perturbational Neural Networks

The basic idea of PNN is that a NN learns a new dataset by adding new parameters to constant weights and thresholds after training without overwriting them. PNN consists of main NN and sub NN as shown in Fig.2. Main NN learns with representative datasets and is constant after this learning. Sub NN learns with new datasets and generates new terms in weights and thresholds of main

398

E. Inohira, H. Oonishi, and H. Yokoi Input

Main NN

Output

Δw, Δh Sub NN

Fig. 2. A perturbational neural network

Neuron w1 Δ w1

Input

x1 −

f

z1

Output

Δh

h From Sub NN

Fig. 3. A neuron in main NN

NN, i.e., Δw and Δh. A neuron of PNN is shown in Fig.3. Input-output relation of a conventional neuron is given by the following equation. z=f wi xi − h (1) i

where z denotes output, wi weight, xi input, and f (·) activation function. A neuron of PNN is given as follows. z=f (wi + Δwi )xi − (h + Δh)

(2)

i

where Δwi and Δh are outputs of sub NN. Training of PNN is divided into two phases. First, main NN learns a main NN with representative dataset. For instance, it is assumed that representative datasets consists of environmental data (EA , EB , EC ) and CPG parameters

PNNs for Incremental Learning in Virtual Learning System

399

PA , PB , P C EA , EB , E C

Main NN Δw, Δh Sub NN 0

Fig. 4. Training of NN with representative dataset

PD Stop learning

ED

Main NN Δw, Δh Sub NN

Fig. 5. Training of NN with new dataset

(PA , PB , PC ) as shown in Fig.4. Training of main NN is the same as NN with BP. At this time, sub NN learn as Δw and Δh equal zero. Then, for representative dataset, sub NN has no eﬀect on main NN. Next, PNN learns with new dataset. For instance, it is assumed that new dataset consists of environmental data ED and CPG parameters PD as shown in Fig.5. Main NN does not learn with the new dataset, but reinforcement signals are passed to sub NN through main NN. Output error of main NN for ED exists because it is unknown for main NN. Sub NN learns with new dataset as such error is canceled. 3.2

Related Work

Some authors [6,7,8] have proposed other methods in which Mixture of Experts (MoE) architecture is applied to incremental learning of neural networks. The MoE architecture is based on divide-and-conquer approach. It means that a complex mapping which a signal neural network should learn is divided into simple mappings which a neural network can learn easily in such architecture. On the other hand, a PNN learns a complex mapping by connecting sub NNs with a main NN. A PNN is expected to have global generalization capability because it does not divided a mapping to local mappings. In the MoE architecture, a expert neural network is expected to have local generalization capability. However, a system would not have global generalization capability because it is not concerned. Therefore, when generalization capability is focused on, a PNN should be used.

400

E. Inohira, H. Oonishi, and H. Yokoi

A PNN have a problem that it needs large number of connections from sub NNs to a main NN. It means that a PNN is very ineﬃcient in resources. Although we have not yet study eﬃciency in a PNN, we guess that a PNN has room to reduce the connections, and it would be our future work.

4 4.1

Numerical Experiments Setup

We evaluate generalization capability of proposed NN for incremental learning through numerical experiments of a robot climbing stairs. Simpliﬁed experiments are performed because we focus on a NN for virtual learning.

θ3 θ4

θ2 θ1

θ5

Fig. 6. A two-dimensional ﬁve-link biped robot model

θ3

Neural oscillator

θ2

θ4

θ1

θ5

Fig. 7. A CPG for a biped robot

A two-dimensional ﬁve-link model of a bipedal robot is used as shown in Fig. 6. Five joint angles deﬁned in Fig. 6 are controlled by ﬁve neural oscillators corresponding to each joint angle. Length of all link is 100 cm. A task of a bipedal robot is to climb ﬁve stairs. Height and depth of stairs are used for

PNNs for Incremental Learning in Virtual Learning System

401

Table 1. Dimension of representative stairs Stairs Height [cm] Depth [cm] A 30 60 B 10 70 C 30 100 D 10 90

Output

Input

Main NN Sub NN Sub NN for hidden layer

Δwh , Δhh

Sub NN for output layer

Δwo , Δho

Fig. 8. A PNN used for experiments

environmental data. A robot concerns only kinematics in virtual environment and ignores dynamics. A CPG shown in Fig. 7 is used for controlling of a bipedal robot. We used Matsuoka’s neural oscillator model [9] in a CPG. In this study, 16 connection weights w and ﬁve constant external inputs u0 are deﬁned as CPG input for controlling a bipedal robot. CPG input for climbing stairs are given through exploring their parameter space by GA and are optimized for each environment. Internal parameters of CPG are also given by GA and are constant in all environments because it takes much time to explore all CPG parameters including internal parameters. Internal parameters of CPG are optimized for walking on a horizontal plane. The four pairs shown in Table 1 are deﬁned as representative data for NN training. Then inputs and outputs of the NNs numbers 2 and 21 respectively. The following targets are compared to evaluating their generalization capability. – CPG whose parameters are optimized from each of the three representative environmental data, i.e., stairs A, B, and C – MLP trained for stairs A, B, and C (MLP-I) – MLP trained for stairs D after trained for stairs A, B, and C (MLP-II) – PNN trained in the same way as the above MLP These targets are optimized and trained for one to four kinds of stairs. Generalization capability of each targets is calculated by the number of conditions which is diﬀerent from the representative stairs and where a biped robot can climb ﬁve

402

E. Inohira, H. Oonishi, and H. Yokoi

stairs successfully. Stairs’ height ranges from 4 cm to 46 cm and width from 40 cm to 110 cm. MLPs has 30 neurons in a hidden layer. Initial weights of MLPs are given by uniform random numbers ranging between ±0.3. PNN used in this paper has two sub NNs as shown in Fig.8. Main NN in PNN is the same as the above MLP. Sub NN for hidden layer has 100 neurons and 90 outputs. Sub NN for output layer has 600 neurons and 561 outputs. All initial weights in PNN are given in the same way as the MLPs. One learning cycle is deﬁned as stairs A, B, and C are given to MLP or PNN sequentially. MLP-I is trained for the three kinds of stairs until 10000 learning cycles where sum of squared error is much less than 10−7 . MLP-II is trained for stairs D until 1000 learning cycles after trained for the three kinds of stairs. As mentioned below, although the number of learning cycles for stairs D is small, MLP-II forgets stair A, B, and C. Training condition for Main NN in PNN is the same as MLP-I. Incremental learning of PNN for stairs D means that the two sub NNs are trained for stairs D while the main NN is constant. These sub NNs are trained until 700 learning cycles. Learning rate of NNs is optimized through preliminary experiments because their performance heavily depends on it. We used learning rate minimizing sum of squared error under a certain number of learning cycles for each of NNs. Learning rate used in MLP-I and sub NN of PNN is 0.30 and 0.12 respectively. 4.2

Experimental Results

In Fig. 9, mark x denotes successful condition in climbing ﬁve stairs and a circle a given condition as representative dataset. Fig. 9 (a) to (c) show that successful conditions spreads around representative stairs to some degree. It means that CPG has a certain generalization capability by itself as already known. However, generalization capability of CPG is not as much as stairs A, B, and C are covered simultaneously. On the other hand, Fig. 9 (d) shows that MLP-I covers intermediate conditions in stairs A, B, and C. Eﬀect of virtual learning with NNs is clear. Fig. 9 (e) and (f) are involved in incremental learning. Fig. 9 (e) shows that PNN is successful in incremental learning. Moreover, when conditions near stairs C are focused on, generalization capability of PNN is larger than MLP-I. Fig. 9 (f) shows that MLP-II forgot the precedent learning on stairs A, B, and C and fails in incremental learning. It is known that incremental learning of MLP with BP fails because connection weights are overwritten by training with new dataset. In PNN, main NN is constant after training with initial dataset. Then main NN does not forget initial dataset. Incremental learning is realized by sub NNs in PNN. The problems of sub NNs are eﬀects on adjusting of connection weights, i.e., whether it is successful in incremental learning or not, and whether performance for initial dataset decreases or not. From the experimental results, we showed that PNN realized incremental learning and increased generalization capability by incremental learning.

PNNs for Incremental Learning in Virtual Learning System

45

45

40

40

25 20

10

80 100 Depth [cm] (d) Trained NN with data A, B, and C

45

45

40

40

35

35

25 20

A

30

C

25 20

B

10 5 40

60

80 100 Depth [cm] (b) CPG without NN for data B

D

80 100 Depth [cm] (e) Trained proposed NN with data D after the three data

45

45

40

40

35

60

25 20

Height [cm]

35 C

30

30

15 10 80 100 Depth [cm] (c) CPG without NN for data C

C

20

10 60

A

25

15

5 40

60

15 B

10 5 40

B

5 40

80 100 Depth [cm] (a) CPG without NN for data A

15

Height [cm]

20

10 60

C

25

15

30

A

30

15

5 40

Height [cm]

Height [cm]

30

35 A

Height [cm]

Height [cm]

35

403

5 40

B

D

60

80 100 Depth [cm] (f) Trained conventional NN with data D after the three data

Fig. 9. A comparison of generalization capability of CPG and MLP and PNN in stair climbing

404

5

E. Inohira, H. Oonishi, and H. Yokoi

Conclusions

We proposed a new type of NNs for incremental learning in a virtual learning system. Our proposed NNs features adjusting weights and thresholds externally to slightly change a mapping of a trained NN. This paper demonstrated numerical experiments with a two-dimensional ﬁve-link biped robot and climbing-stairs task. We showed that PNN is successful in incremental learning and has generalization capability to some degree. This study is very limited to focus on only verifying our approach. In future work, we will study PNN with as much data as an actual robot needs and compare with related work quantitatively.

References 1. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of honda humanoid robot. In: Proc. IEEE ICRA, vol. 2, pp. 1321–1326 (1998) 2. Mahadevan, S., Connell, J.: Automatic programming of behavior-based robots using reinforcement learning. Artiﬁcial Intelligence 55, 311–365 (1992) 3. Kuniyoshi, Y., Sangawa, S.: Early motor development from partially ordered neuralbody dynamics: experiments with a cortico-spinal-musculo-skeletal model. Biological Cybernetics 95, 589–605 (2006) 4. Yamashita, I., Yokoi, H.: Control of a biped robot by using several virtual environments. In: Proceedings of the 22nd Annual Conference of the Robotics Society of Japan 1K25 (in Japanese) (2004) 5. Kawano, T., Yamashita, I., Yokoi, H.: Control of the bipedal robot generating the target by the simulation in virtual space (in Japanese). IEICE Technical Report 104, 65–69 (2004) 6. Haruno, M., Wolpert, D.M., Kawato, M.: MOSAIC model for sensorimotor learning and control. Neural Computation 13, 2201–2220 (2001) 7. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Computation 10, 2047–2084 (1998) 8. Yamauchi, K., Hayami, J.: Incremental learning and model selection for radial basis function network through sleep. IEICE TRANSACTIONS on Information and Systems E90-D, 722–735 (2007) 9. Matsuoka, K.: Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biological Cybernetics 52, 367–376 (1985)

Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks Peter Tiˇ no University of Birmingham, Birmingham, UK [email protected]

Abstract. Self-organizing neural networks (SONN) driven by softmax weight renormalization are capable of finding high quality solutions of difficult assignment optimization problems. The renormalization is shaped by a temperature parameter - as the system cools down the assignment weights become increasingly crisp. It has been reported that SONN search process can exhibit complex adaptation patterns as the system cools down. Moreover, there exists a critical temperature setting at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states. To shed light on such observed phenomena, we present a detailed bifurcation study of the renormalization process. As SONN cools down, new renormalization equilibria emerge in a strong structure leading to a complex skeleton of saddle type equilibria surrounding an unstable maximum entropy point, with decision enforcing “one-hot” stable equilibria. This, in synergy with the SONN input driving process, can lead to sensitivity to annealing schedules and adaptation dynamics exhibiting signatures of complex dynamical behavior. We also show that (as hypothesized in earlier studies) the intermittent search by SONN can occur only at temperatures close to the first (symmetry breaking) bifurcation temperature.

1

Introduction

For almost three decades there has been an energetic research activity on application of neural computation techniques in solving diﬃcult combinatorial optimization problems. Self-organizing neural network (SONN) [1] constitutes an example of a successful neural-based methodology for solving 0-1 assignment problems. SONN has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in mobile communications. As in most self-organizing systems, dynamics of SONN adaptation is driven by a synergy of cooperation and competition. In the competition phase, for each item to be assigned, the best candidate for the assignment is selected and the corresponding assignment weight is increased. In the cooperation phase, the assignment weights of other candidates that were likely to be selected, but were not quite as strong as the selected one, get increased as well, albeit to a lesser M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 405–414, 2008. c Springer-Verlag Berlin Heidelberg 2008

406

P. Tiˇ no

degree. The assignment weights need to be positive and sum to 1. Therefore, after each SONN adaptation phase, the assignment weights need to be renormalized back onto the standard simplex e.g via the softmax function [2]. When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a temperature parameter T > 0. As the system cools down, the assignments become increasingly crisp. In the original setting SONN is annealed so that a single high quality solution to an assignment problem is found. Yet, renormalization onto the standard simplex is a double edged sword. On one hand, SONN with assignment weight renormalization have empirically shown sensitivity to annealing schedules, on the other hand, the quality of solutions could be greatly improved [3]. Interestingly enough, it has been reported recently [4] that there exists a critical temperature T∗ at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states of the adaptation SONN dynamics. It is hypothesised that the critical temperature may be closely related to the symmetry breaking bifurcation of equilibria in the autonomous softmax dynamics. At present there is still no theory regarding the dynamics of SONN adaptation driven by the softmax renormalization. Consequently, the processes of crystallising a solution in an annealed version of SONN, or of sampling the solution space in the intermittent search regime are far from being understood. The ﬁrst steps towards theoretical underpinning of SONN adaptation driven by softmax renormalization were taken in [5,4,6]. For example, in [5] SONN is treated as a dynamical system with a bifurcation parameter T . The cooperation phase was not included in the model, The renormalization process was empirically shown to result in complicated bifurcation patterns revealing a complex nature of the search process inside SONN as the systems gets annealed. More recently, Kwok and Smith [4] suggested to study SONN adaptation dynamics by concentrating on the autonomous renormalization process, since it is this process that underpins the search dynamics in the SONN. In [6] we initiated a rigorous study of equilibria of the autonomous renormalization process. Based on dynamical properties of the autonomous renormalization, we found analytical approximations to the critical temperature T∗ as a function of SONN size. In this paper we complement [6] by reporting a detailed bifurcation study of the renormalization process and give precise characterization and stability types of equilibria, as they emerge during the annealing process. Interesting and intricate equilibria structure emerges as the system cools down, explaining empirically observed complexity of SONN adaptation during intermediate stages of the annealing process. The analysis also clariﬁes why the intermittent search by SONN occurs near the ﬁrst (symmetry breaking) bifurcation temperature of the renormalization step, as was experimentally veriﬁed in [4,6]. Due to space limitations, we cannot fully prove statements presented in this study. Detailed proofs can be found in [7] and will be published elsewhere.

Bifurcations of Renormalization Dynamics in SONNs

2

407

Self-organizing Neural Network and Iterative Softmax

First, we brieﬂy introduce Self-Organizing Neural Network (SONN) endowed with weight renormalization for solving assignment optimization problems (see e.g. [4]). Consider a ﬁnite set of input elements (neurons) i ∈ I = {1, 2, ..., M } that need to be assigned to outputs (output neurons) j ∈ J = {1, 2, ..., N }, so that some global cost of an assignment A : I → J is minimized. Partial cost of assigning i ∈ I to j ∈ J is denoted by V (i, j). The “strength” of assigning i to j is represented by the “assignment weight” wi,j ∈ (0, 1). The SONN algorithm can be summarized as follows: The connection weights wi,j , i ∈ I, j ∈ J , are ﬁrst initialized to small random values. Then, repeatedly, an output item jc ∈ J is chosen and the partial costs V (i, jc ) incurred by assigning all possible input elements i ∈ I to jc are calculated in order to select the “winner” input element (neuron) i(jc ) ∈ I that minimizes V (i, jc ). The “neighborhood” BL (i(jc )) of size L of the winner node i(jc ) consists of L nodes i = i(jc ) that yield the smallest partial costs V (i, jc ). Weights from nodes i ∈ BL (i(jc )) to jc get strengthened: wi,jc ← wi,jc + η(i)(1 − wi,jc ),

i ∈ BL (i(jc )),

(1)

where η(i) is proportional to the quality of assignment i → jc , as measured by V (i, jc ). Weights1 wi = (wi,1 , wi,2 , ..., wi,N ) for each input node i ∈ I are then renormalized using softmax wi,j ←

wi,j T ) N wi,k . k=1 exp( T )

exp(

(2)

We will refer to SONN for solving an (M, N )-assignment problem as (M, N )SONN. As mentioned earlier, following [4,6] we strive to understand the search dynamics inside SONN by analyzing the autonomous dynamics of the renormalization update step (2) of the SONN algorithm. The weight vector wi of each of M neurons in an (M, N )-SONN lives in the standard (N − 1)-simplex, SN −1 = {w = (w1 , w2 , ..., wN ) ∈ RN | wi ≥ 0, i = 1, 2, ..., N, and

N

wi = 1}.

i=1

Given a value of the temperature parameter T > 0, the softmax renormalization step in SONN adaptation transforms the weight vector of each unit as follows: w → F(w; T ) = (F1 (w; T ), F2 (w; T ), ..., FN (w; T )) , where Fi (w; T ) = 1

exp( wTi ) , i = 1, 2, ..., N, Z(w; T )

Here denotes the transpose operator.

(3)

(4)

408

P. Tiˇ no

wk N and Z(w; T ) = N k=1 exp( T ) is the normalization factor. Formally, F maps R 0 to SN −1 , the interior of SN −1 . 0 Linearization of F around w ∈ SN −1 is given by the Jacobian J(w; T ): J(w; T )i,j =

1 [δi,j Fi (w; T ) − Fi (w; T )Fj (w; T )], T

i, j = 1, 2, ..., N,

(5)

where δi,j = 1 iﬀ i = j and δi,j = 0 otherwise. 0 The softmax map F induces on SN −1 a discrete time dynamics known as Iterative Softmax (ISM): w(t + 1) = F(w(t); T ).

(6)

The renormalization step in an (M, N )-SONN adaptation involves M separate renormalizations of weight vectors of all of the M SONN units. For each temperature setting T , the structure of equilibria in the i-th system, wi (t + 1) = F(wi (t); T ), gets copied in all the other M − 1 systems. Using this symmetry, it is suﬃcient to concentrate on a single ISM (6). Note that the weights of diﬀerent units are coupled by the SONN adaptation step (1). We will study systems for N ≥ 2.

3

Equilibria of SONN Renormalization Step

We ﬁrst introduce basic concepts and notation that will be used throughout the paper. An (r − 1)-simplex is the convex hull of a set of r aﬃnely independent points in Rm , m ≥ r−1. A special case is the standard (N −1)-simplex SN −1 . The convex hull of any nonempty subset of n vertices of an (r − 1)-simplex Δ, n ≤ r, is called an (n − 1)-face of Δ. There are nr distinct (n − 1)-faces of Δ and each (n − 1)-face is an (n − 1)-simplex. Given a set of n vertices w1 , w2 , ..., wn ∈ Rm deﬁning an (n − 1)-simplex Δ in Rm , the central point, n

w(Δ) =

1 wi , n i=1

(7)

is called the maximum entropy point of Δ. We will denote the set of all (n − 1)-faces of the standard (N − 1)-simplex SN −1 by PN,n . The set of their maximum entropy points is denoted by QN,n , i.e. QN,n = {w(Δ)| Δ ∈ PN,n }.

(8)

The n-dimensional column vectors of 1’s and 0’s are denoted by 1n and 0n , respectively. Note that wN,n = n1 (1n , 0N −n ) ∈ QN,n . In addition, all the other elements of QN,n can be obtained by simply permuting coordinates of wN,n . Due to this symmetry, we will be able to develop most of the material using wN,n only and then transfer the results to permutations of wN,n . The maximum entropy point wN,N = N −1 1N of the standard (N − 1)-simplex SN −1 will be denoted

Bifurcations of Renormalization Dynamics in SONNs

409

simply by w. To simplify the notation we will use w to denote both the maximum entropy point of SN −1 and the vector w − 0N . We showed in [6] that w is a ﬁxed point of ISM (6) for any temperature setting T and all the other ﬁxed points w = (w1 , w2 , ..., wN ) of ISM have exactly two diﬀerent coordinate values: wi ∈ {γ1 , γ2 }, such that N −1 < γ1 < N1−1 and 0 < γ2 < N −1 , where N1 is the number of coordinates γ1 larger than N −1 . Since 0 w ∈ SN −1 , we have 1 − N1 γ1 γ2 = . (9) N − N1 The number of coordinates γ2 smaller than N −1 is denoted by N2 . Obviously, N2 = N − N1 . If w = (γ1 1N1 , γ2 1N2 ) is a ﬁxed point of ISM (6), so are all NN1 distinct permutations of it. We collect w and its permutations in a set 1 − N γ 1 1 EN,N1 (γ1 ) = v| v is a permutation of γ1 1N1 , 1 . (10) N − N1 N −N1 The ﬁxed points in EN,N1 (γ1 ) exist if and only if the temperature parameter T is set to [6]

TN,N1 (γ1 ) = (N γ1 − 1) −(N − N1 ) · ln 1 −

N γ1 − 1 (N − N1 )γ1

−1 .

(11)

We will show that as the system cools down, increasing number of equilibria emerge in a strong structure. Let w, v ∈ SN −1 be two points on the standard simplex. The line from w to v is parametrized as (τ ; w, v) = w + τ · (v − w),

τ ∈ [0, 1].

(12)

Theorem 1. All equilibria of ISM (6) lie on lines connecting the maximum entropy point w of SN −1 with the maximum entropy points of its faces. In particular, for 0 < N1 < N and γ1 ∈ (N −1 , N1−1 ), all fixed points from EN,N1 (γ1 ) lie on lines (τ ; w, w), where w ∈ QN,N1 . Sketch of the Proof: Consider the maximum entropy point wN,N1 = N11 (1N1 , 0N2 ) of an (N1 −1)-face of SN −1 . Then w(γ1 ) = (γ1 1N1 , γ2 1N2 ) lies on the line (τ ; w, wN,N1 ) for the parameter setting τ = 1 − N γ2 . Q.E.D. The result is illustrated in ﬁgure 1. As the (M, N )-SONN cools down, the ISM equilibria emerge on lines connecting w with the maximum entropy points of faces of SN −1 of increasing dimensionality. Moreover, on each such line there can be at most two ISM equilibria. Theorem 2. For N1 < N/2, there exists a temperature TE (N, N1 ) > N −1 such that for T ∈ (0, TE (N, N1 )], ISM fixed points in EN,N1 (γ1 ) exist for some γ1 ∈

410

P. Tiˇ no

Fig. 1. Positions of equilibria of SONN renormalization illustrated for the case of 4-dimensional weight vectors w. ISM is operating on the standard 3-simplex S3 and its equilibria can only be found on the lines connecting the maximum entropy point w (filled circle) with maximum entropy points of its faces. Triangles, squares and diamonds represent maximum entropy points of 0-faces (vertices), 1-faces (edges) and 2-faces (facets), respectively.

(N −1 , N1−1 ), and no ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ), can exist at temperatures T > TE (N, N1 ). For each temperature T ∈ (N −1 , TE (N, N1 )), there are two coordinate values γ1− (T ) and γ1+ (T ), N −1 < γ1− (T ) < γ1+ (T ) < N1−1 , such that ISM fixed points in both EN,N1 (γ1− (T )) and EN,N1 (γ1+ (T )) exist at temperature T . Furthermore, as the temperature decreases, γ1− (T ) decreases towards N −1 , while γ1+ (T ) increases towards N1−1 . For temperatures T ∈ (0, N −1 ], there is exactly one γ1 (T ) ∈ (N −1 , N1−1 ) such that ISM equilibria in EN,N1 (γ1 (T )) exist at temperature T . Sketch of the Proof: The temperature function TN,N1 (γ1 ) (11) is concave and can be continuously extended to [N −1 , N1−1 ) with TN,N1 (N −1 ) = N −1 and limγ1 →N −1 TN,N1 (γ1 ) = 1 0 < N −1 . The slope of TN,N1 (γ1 ) at N −1 is positive for N1 < N/2. Q.E.D. Theorem 3. The bifurcation temperature TE (N, N1 ) is decreasing with increasing number N1 of equilibrium coordinates larger than N −1 . Sketch of the Proof: It can be shown that for any feasible value of γ1 > N −1 , if there are two ﬁxed points w ∈ EN,N1 (γ1 ) and w ∈ EN,N1 (γ1 ) of ISM, such that N1 < N1 , then w exists at a higher temperature than w does. For a given N1 < N/2, the bifurcation temperature TE (N, N1 ) corresponds to the maximum of TN,N1 (γ1 ) on γ1 ∈ (N −1 , N1−1 ). It follows that N1 < N1 implies TE (N, N1 ) > TE (N, N1 ). Q.E.D.

Bifurcations of Renormalization Dynamics in SONNs

411

Theorem 4. If N/2 ≤ N1 < N , for each temperature T ∈ (0, N −1 ), there is exactly one coordinate value γ1 (T ) ∈ (N −1 , N1−1 ), such that ISM fixed points in EN,N1 (γ1 (T )) exist at temperature T . No ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ) can exist for temperatures T > N −1 . As the temperature decreases, γ1 (T ) increases towards N1−1 . Sketch of the Proof: Similar to the proof of theorem 2, but this time the slope of TN,N1 (γ1 ) at N −1 is not positive. Q.E.D. Let us now summarize the process of creation of new ISM equilibria, as the (M, N )-SONN cools down. For temperatures T > 1/2, the ISM has exactly one equilibrium - the maximum entropy point w of SN −1 [6]. As the temperature is lowered and hits the ﬁrst bifurcation point, TE (N, 1), new ﬁxed points of ISM w ∈ QN,1 , one on each line. The lines connect w emerge on the lines (τ ; w, w), of SN −1 . As the temperature decreases further, on each line, with the vertices w the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other in QN,1 (vertex of SN −1 ). moves towards the corresponding high entropy point w When the temperature reaches the second bifurcation point, TE (N, 2), new ﬁxed w ∈ QN,2 , one on each line. This points of ISM emerge on the lines (τ ; w, w), of the time the lines connect w with the maximum entropy points (midpoints) w edges of SN −1 . Again, as the temperature continues decreasing, on each line, the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other in QN,2 (midpoint moves towards the corresponding maximum entropy point w of an edge of SN −1 ). The process continues until the last bifurcation temperature TE (N, N1 ) is reached, where N1 is the largest natural number smaller than N/2. At TE (N, N1 ), new ﬁxed points of ISM emerge on the lines (τ ; w, w), ∈ QN,N1 , connecting w with maximum entropy points w of (N1 − 1)-faces w of SN −1 . As the temperature continues decreasing, on each line, the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other moves towards in QN,N1 . At temperatures below the corresponding maximum entropy point w N −1 , only the ﬁxed points moving towards the maximum entropy points of faces of SN −1 exist. In the low temperature regime, 0 < T < N −1 , a ﬁxed point occurs on every w ∈ QN,N1 , N1 = N/2 , N/2 + 1, ..., N − 1. Here, x denotes line (τ ; w, w), the smallest integer y, such that y ≥ x. As the temperature decreases, the ﬁxed ∈ QN,N1 points w move towards the corresponding maximum entropy points w of (N1 − 1)-faces of SN −1 . The process of creation of new ﬁxed points and their ﬂow as the temperature cools down is demonstrated in ﬁgure 2 for an ISM operating on 9-simplex S9 (N = 10). We plot against each temperature setting T the values of the larger coordinate γ1 > N −1 = 0.1 of the ﬁxed points existing at T . The behavior of ISM in the neighborhood of its equilibrium w is given by the structure of stable and unstable manifolds of the linearized system at w outlined in the next section.

412

P. Tiˇ no Bifurcation structure of ISM equilibria (N=10) T (10,4)

T (10,2)

E

TE(10,1)

E

1 T (10,3) E

N =1 1

0.8

gamma

1

N =2 1

0.6

N =3 1

0.4

N =4 1

0.2

0

0

0.05

0.1

0.15

0.2

0.25

temperature T

Fig. 2. Demonstration of the process of creation of new ISM fixed points and their flow as the system temperature cools down. Here N = 10, i.e. the ISM operates on the standard 9-simplex S9 . Against each temperature setting T , the values of the larger coordinate γ1 > N −1 of the fixed points existing at T are plotted. The horizontal bold line corresponds to the maximum entropy point w = 10−1 110 .

4

Stability Analysis of Renormalization Equilibria

The maximum entropy point w is not only a ﬁxed point of ISM (6), but also, regarded as a vector w − 0N , it is an eigenvector of the Jacobian J(w; T ) 0 at any w ∈ SN −1 , with eigenvalue λ = 0. This simply reﬂects the fact that ISM renormalization acts on the standard simplex SN −1 , which is a subset of a (N − 1)-dimensional hyperplane with normal vector 1N . We have already seen that w plays a special role in the ISM equilibria structure: all equilibria lie on lines going from w towards maximum entropy points of faces of SN −1 . The lines themselves are of special interest, since we will show that these lines are invariant manifolds of the ISM renormalization and their directional vectors are eigenvectors of ISM Jacobians at the ﬁxed points located on them. Theorem 5. Consider ISM (6) and 1 ≤ N1 < N . Then for each maximum ∈ QN,N1 of an (N1 − 1)-face of SN −1 , the line (τ ; w, w), entropy point w is an invariant set τ ∈ [0, 1) connecting the maximum entropy point w with w under the ISM dynamics. Sketch of the Proof: into (3) and realizing The result follows from plugging parametrization (τ ; w, w) (after some manipulation) that for each τ ∈ [0, 1), there exists a parameter = (τ1 ; w, w). setting τ1 ∈ [0, 1) such that F((τ ; w, w)) Q.E.D.

Bifurcations of Renormalization Dynamics in SONNs

413

The proofs of the next two theorems are rather involved and we refer the interested reader to [7]. Theorem 6. Let w ∈ EN,N1 (γ1 ) be a fixed point of ISM (6). Then, w∗ = w − w is an eigenvector of the Jacobian J(w; TN,N1 (γ1 )) with the corresponding eigenvalue λ∗ , where 1. if N/2 ≤ N1 ≤ N − 1, then 0 < λ∗ < 1. 2. if 1 ≤ N1 < N/2 and N −1 < γ1 < (2N1 )−1 , then λ∗ > 1. 3. if 1 ≤ N1 < N/2 , then there exists γ¯1 ∈ ((2N1 )−1 , N1−1 ), such that for all ISM fixed points w ∈ EN,N1 (γ1 ) with γ1 ∈ (¯ γ1 , N1−1 ), 0 < λ∗ < 1. We have established that for an ISM equilibrium w, both w and w∗ = w− w are eigenvectors of the ISM Jacobian at w. Stability types of the remaining N − 2 eigendirections are characterized in the next theorem. Theorem 7. Consider an ISM fixed point w ∈ EN,N1 (γ1 ) for some 1 ≤ N1 < N and N −1 < γ1 < N1−1 . Then, there are N − N1 − 1 and N1 − 1 eigenvectors of Jacobian J(w; TN,N1 (γ1 )) of ISM at w with the same associated eigenvalue 0 < λ− < 1 and λ+ > 1, respectively.

5

Discussion – SONN Adaptation Dynamics

In the intermittent search regime by SONN [4], the search is driven by pulling promising solutions temporarily to the vicinity of the 0-1 “one-hot” assignment values - vertices of SN −1 (0-dimensional faces of the standard simplex SN −1 ). The critical temperature for intermittent search should correspond to the case where the attractive forces already exist in the form of attractive equilibria near the “one-hot” assignment suggestions (vertices of SN −1 ), but the convergence rates towards such equilibria should be suﬃciently weak so that the intermittent character of the search is not destroyed. This occurs at temperatures lower than, but close to the ﬁrst bifurcation temperature TE (N, 1) (for more details, see [7]). In [4] it is hypothesised that there is a strong link between the critical temperature for intermittent search by SONN and bifurcation temperatures of the autonomous ISM. In [6] we hypothesised (in accordance with [4]) that even though there are many potential ISM equilibria, the critical bifurcation points are related only to equilibria near the vertices of SN −1 , as only those could be guaranteed by the theory of [6] (stability bounds) to be stable, even though the theory did not prevent the other equilibria from being stable. In this study, we have rigorously shown that the stable equilibria can in fact exist only near the vertices of SN −1 , on the lines connecting w with the vertices. Only when N1 = 1, there are no expansive eigendirections of the local Jacobian with λ+ > 1. As the SONN system cools down, more and more ISM equilibria emerge on the lines connecting the maximum entropy point w of the standard simplex SN −1 with the maximum entropy points of its faces of increasing dimensionality. With decreasing temperature, the dimensionality of stable and unstable manifolds

414

P. Tiˇ no

of linearized ISM at emerging equilibria decreases and increases, respectively. At lower temperatures, this creates a peculiar pattern of saddle type equilibria surrounding the unstable maximum entropy point w, with decision enforcing “one-hot” stable equilibria located near vertices of SN −1 . Trajectory towards the solution as the SONN system anneals is shaped by the complex skeleton of saddle type equilibria with stable/unstable manifolds of varying dimensionalities and can therefore, in synergy with the input driving process, exhibit signatures of a very complex dynamical behavior, as reported e.g. in [5]. Once the temperature is suﬃciently low, the attraction rates of stable equilibria near the vertices of SN −1 are so high that the found solution is virtually pinned down by the system. Even though the present study clariﬁes the prominent role of the ﬁrst (symmetry breaking) bifurcation temperature TE (N, 1) in obtaining the SONN intermittent search regime and helps to understand the origin of complex SONN adaptation patterns in the annealing regime, many interesting open questions remain. For example, no theory as yet exists of the role of abstract neighborhood BL (i(jc )) of the winner node i(jc ) in the cooperative phase of SONN adaptation. We conclude by noting that it may be possible to apply the theory of ISM in other assignment optimization systems that incorporate the softmax assignment weight renormalization e.g. [8,9].

References 1. Smith, K., Palaniswami, M., Krishnamoorthy, M.: Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks 9, 1301–1318 (1998) 2. Guerrero, F., Lozano, S., Smith, K., Canca, D., Kwok, T.: Manufacturing cell formation using a new self-organizing neural network. Computers & Industrial Engineering 42, 377–382 (2002) 3. Kwok, T., Smith, K.: Improving the optimisation properties of a self-organising neural network with weight normalisation. In: Proceedings of the ICSC Symposia on Intelligent Systems and Applications (ISA 2000), Paper No.1513-285 (2000) 4. Kwok, T., Smith, K.: Optimization via intermittency with a self-organizing neural network. Neural Computation 17, 2454–2481 (2005) 5. Kwok, T., Smith, K.: A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks 15, 84–88 (2004) 6. Tiˇ no, P.: Equilibria of iterative softmax and critical temperatures for intermittent search in self-organizing neural networks. Neural Computation 19, 1056–1081 (2007) 7. Tiˇ no, P.: Bifurcation structure of equilibria of adaptation dynamics in selforganizing neural networks. Technical Report CSRP-07-12, University of Birmingham, School of Computer Science (2007), http://www.cs.bham.ac.uk/∼ pxt/PAPERS/ism.bifurc.tr.pdf 8. Gold, S., Rangarajan, A.: Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks 2, 381–399 (1996) 9. Rangarajan, A.: Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition 33, 635–649 (2000)

Variable Selection for Multivariate Time Series Prediction with Neural Networks Min Han and Ru Wei School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China [email protected]

Abstract. This paper proposes a variable selection algorithm based on neural networks for multivariate time series prediction. Sensitivity analysis of the neural network error function with respect to the input is developed to quantify the saliency of each input variables. Then the input nodes with low sensitivity are pruned along with their connections, which represents to delete the corresponding redundant variables. The proposed algorithm is tested on both computergenerated time series and practical observations. Experiment results show that the algorithm proposed outperformed other variable selection method by achieving a more significant reduction in the training data size and higher prediction accuracy. Keywords: Variable selection, neural network pruning, sensitivity, multivariate prediction.

1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated dynamics from measurements. Usually, multivariate variables are required since the output may depend not only on its own previous values but also on the past values of other variables. However, we can’t make sure that all of the variables are equally important. Some of them may be redundant or even irrelevant. If these unnecessary input variables are included into the prediction model, the parameter estimation process will be more difficult, and the overall results may be poorer than if only the required inputs are used [1]. Variable selection is such a problem to discard the redundant variables, which will reduce the number of input variables and the complexity of the prediction model. A number of variable selection methods based on statistical or heuristics tools have been proposed, such as Principal Component Analysis (PCA) and Discriminant Analysis. These techniques attempt to reduce the dimensionality of the data by creating new variables that are linear combinations of the original ones. The major difficulty comes from the separation of variable selection process and prediction process. Therefore, variable selection using neural network is attractive since one can globally adapt the variable selector together with the predictor. Variable selection with neural network can be seen as a special case of architecture pruning [2], where the pruning of input nodes is equivalent to removing the corresponding M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 415–425, 2008. © Springer-Verlag Berlin Heidelberg 2008

416

M. Han and R. Wei

variables from the original data set. One approach to pruning is to estimate the sensitivity of the output to the exclusion of each unit. There are several ways to perform sensitivity analysis with neural network. Most of them are weight-based [3], which is based on the idea that weights connected to important variables attain large absolute values while weights connected to unimportant variables would probably attain values somewhere near zero. However, smaller weights usually result in smaller inputs to neurons and larger sigmoid derivatives in general, which will increase the output sensitivity to the input. Mozer and Smolensky [4] have introduced a method which estimates which units are least important and can be deleted over training. Gevrey et al. [5] compute the partial derivatives of the neural network output with respect to the input neurons and compare performances of several different methods to evaluate the relative contribution of the input variables. This paper concentrates on a neural-network-based variable selection algorithm as the tool to determine which variables are to be discarded. A simple sensitivity criterion of the neural network error function with respect to each input is developed to quantify the saliency of each input variables. Then the input nodes are arrayed by a decreasing sensitivity order so that the neural network can be pruned efficiently by discarding the last items with low sensitivity. The variable selection algorithm is then applied to both computer-generated data and practical observations and is compared with the PCA variable reduction method. The rest of this paper is organized as follows. Section 2 reviews the basic concept of multivariate time series prediction and a statistical variable selection method. Section 3 explains the sensitivity analysis with neural networks in detail. Section 4 presents two simulation results. The work is finally concluded in section 5.

2 Modeling Multivariate Chaotic Time Series The basic idea of chaotic time series analysis is that, a complex system can be described by a strange attractor in its phase space. Therefore, the reconstruction of the equivalent state space is usually the first step in chaotic time series prediction. 2.1 Multivariate Phase Space Reconstruction Phase space reconstruction from observations can be accomplished by choosing a suitable embedding dimension and time delay. Given an M-dimensional time series{Xi, i=1, 2,…, M}, where Xi=[xi(1), xi(2), …, xi(N)]T, N is the length of each scalar time series. As in the case of univariate time series (where M=1), the reconstructed phase-space can be made as [6]:

X (t ) = [ x1 (t ), x1 (t − τ 1 ), " , x1 (t − (d1 − 1)τ 1 ), ," ,

(1)

xM (t ), xM (t − τ M )," , xM (t − (d M − 1)τ M ] where t = L, L + 1," , N , L = max(di − 1) ⋅τ i + 1 , τi and di ( i = 1, 2," , M ) are the time 1≤ i ≤ M

delays and embedding dimensions of each time series, respectively. The delay time τi can be calculated using mutual information method and the embedding dimension is computed with the false nearest neighbor method.

Variable Selection for Multivariate Time Series Prediction with Neural Networks

417

M

According to Takens’ embedding theorem, if D = ∑ di is large enough there exist i =1

an mapping F: X(t+1)=F{X(t)}. Then the evolvement of X(t)→X(t+1) reflects the evolvement of the original dynamics system. The problem is then to find an appropriate expression for the nonlinear mapping F. Up to the present, many chaotic time series prediction models have been developed. Neural network has been widely used because of its universal approximation capabilities. 2.2 Neural Network Model

A multilayer perceptron (MLP) with a back propagation (BP) algorithm is used as a nonlinear predictor for multivariate chaotic time series prediction. MLP is a supervised learning algorithm designed to minimize the mean square error between the computed output of the neural network and the desired output. The network usually consists of three layers: an input layer, one or more hidden layers and an output layer. Consider a three layer MLP that contains one hidden layer. The D dimensional delayed time series X(t) are used as the input of the network to generate the network output X(t+1). Then the neural network can be expressed as follows: NI

o j = f (∑ xi wij(I) )

(2)

i =0

NH

yk = ∑ w(O) jk o j

(3)

j =1

where [ x1 , x2 ," , xNI ] = X (t ) denotes the input signal, N I is number of input signal to the neural network, wij(I) is the weight connected from the ith input neuron to the jth hidden neuron, oj are the output of the jth hidden neuron, N H is the number of neurons in the hidden layer, [ y1 , y2 ," , y NO ] = X (t + 1) is the output, N O is the number of output neurons and w(O) is the weight connected from the jth hidden neuron jk and the kth output neuron. The activation function f(·) is the sigmoid function given by

f ( x) =

1 1 + exp(− x)

(4)

The error function of the net is usually defined as the sum square of the error N

No

E = ∑∑ [ yk (t ) − pk (t )]2 , t=1,2,…N t =1 k =1

where pk(t) is the desired output for unit k, N is the length of the training sample.

(5)

418

M. Han and R. Wei

2.3 Statistical Variable Selection Method

For the multivariate time series, the dimension of the reconstructed phase space is usually very high. Moreover, the increase of the input variable numbers will lead to the high complexity of the prediction model. Therefore, in many practical applications, variable selection is needed to reduce the dimensionality of the input data. The aim of variable selection in this paper is to select a subset of R inputs that retains most of the important features of the original input sets. Thus, D-R irrelevant inputs are discarded. The Principle Component Analysis (PCA) is a traditional technique for variable selection [7]. PCA attempts to reduce the dimensionality by first decomposing the normalize input vector X(t) d with singular value decomposition (SVD) method X = U ∑V T

(6)

where ∑ = diag[ s1 s2 ... s p 0 ... 0] , s1 ≥ s2 " ≥ s p are the first p eigenvalues of X ar-

rayed by a decreasing order, U and V are both orthogonal matrixes. Then the first k singular values are preserved as the principle components. The final input can be obtained as

Z = U T X

(7)

where U is the first k rows of U. PCA is an efficient method to reduce the input dimension. However, we can’t make sure that the factors we discard have no influence to the prediction output because the variable selection and prediction process are separated individually. Neural network selector is a good choice to combine the selection process and prediction process.

3 Sensitivity Analysis with Neural Networks Variable selection with neural networks can be achieved by pruning the input nodes of a neural network model based on some saliency measure aiming to remove less relevant variables. The significance of a variable can be defined as the error when the unit is removed minus the error when it is left in place:

Si = EWithoutUnit _ i − EWithUnit _ i = E ( xi = 0) − E ( xi = xi )

(8)

where E is the error function defined in Eq.(5). After the neural network has been trained, a brute-force pruning method for ever input is setting the input to zero and evaluate the change in the error. If it increases too much, the input is restored, otherwise it is removed. Theoretically, this can be done by training the network under all possible subsets of the input set. However, this exhaustive search is computational infeasible and can be very slow for large network. This paper uses the same idea with Mozer and Smolensky [4] to approximate the sensitivity by introducing a gating term α i for each unit such that

o j = f (∑ wijα i oi ) i

where o j is the activity of unit j, wij is the weight from unit i to unit j.

(9)

Variable Selection for Multivariate Time Series Prediction with Neural Networks

419

The gating term α is shown in Fig.1, where α iI , i = 1, 2," , N I is the gating term of the ith input neuron and α Hj , j = 1, 2," , N H is the gating term of the jth output neuron.

α1I

x1 xi

xN I

α1H

#

#

#

α NI

α

yk H NH

I

Fig. 1. The gating term for each unit

The gating term α is merely a notational convenience rather than a parameter that must be implied in the net. If α = 0 , the unit has no influence on the network; If α = 1 , the unit behaves normally. The importance of a unit is then approximated by the derivative Si = −

∂E ∂α i

(10) α i =1

By using a standard error back-propagation algorithm, the derivative of Eq.(9) can be expressed in term of network weights as follows NH N NO ⎡ ⎤ ∂E ∂E ∂yk (O ) = − ⋅ = ∑∑ ⎢( pk (t ) − yk (t ) ) ∑ w jk o j ⎥ H H ∂α j ∂yk ∂α j t =1 k =1 ⎣ j =1 ⎦

(11)

NH N NO ⎡ ⎤ ∂E ∂E ∂yk = − ⋅ = p ( t ) − y ( t ) w(jkO ) o j (1 − o j ) wij( I ) xi (t ) ⎥ ( ) ∑∑ ⎢ k ∑ k I I ∂α i ∂yk ∂α i t =1 k =1 ⎣ j =1 ⎦

(12)

S Hj = −

SiI = −

where SiI is the sensitivity of the ith input neuron, S Hj is the sensitivity of the jth output neuron. Thus the algorithm can prune the input nodes as well as the hidden nodes according to the sensitivity over training. However, the undulation is high when the sensitivity is calculated directly using Eq.(11) and Eq.(12) because of the engineering approximation in Eq.(10). Sometimes, it may delete the input incorrectly. In order to possibly reduce the dimensionality of input vectors, the sensitivity matrix needs to be evaluated over the entire training set. This paper develops several ways to define the overall sensitivity such as: (1) The mean square average sensitivity:

Si , avg =

1 N ∑ Si (t )2 T t =1

where T is the number of data in the training set.

(13)

420

M. Han and R. Wei

(2) The absolute value average sensitivity:

Si , abs =

1 N ∑ Si (t ) T t =1

(14)

(3) The maximum absolute sensitivity:

Si ,max = max Si (t ) 1≤ t ≤ N

(15)

Any of the sensitivity measure in Eqs.(13)~(15) can provide a useful criterion to determine which input is to be deleted. For succinctness, this paper uses the mean square average sensitivity as an example. An input with a low sensitivity has little or no influence on the prediction accuracy and can therefore be removed. In order to get a more efficient criterion for pruning inputs, the sensitivity is normalized. Define the absolute sum of the sensitivity for all the input nodes NI

S = ∑ Si

(16)

i =1

Then the normalized sensitivity of each unit can be defined as S Sˆi = i S

(17)

where the normalized value Sˆi is between [0 1]. The input variables is then arrayed by a decreasing sensitivity order: Sˆ1 ≥ Sˆ2 ≥ " ≥ SˆN I

(18)

The larger values of Sˆi (t ), i = 1, 2," , N I present the important variables. Define the sum the first k terms of the sensitivity ηk as k

η k = ∑ Sˆ j

(19)

j =1

where k=1,2,…,NI . Choosing a threshold value 0 < η 0 < 1 , if ηk > η0 , the first k values are preserved as the principal components and the last term of the inputs with low sensitivity are removed. The number of variable remained is increasing as the threshold η0 increase.

4 Simulations In this section, two simulations are carried out on both computer-generated data and practical observed data to demonstrate the performance of the variable selection method proposed in this paper. Then the simulation results are compared with the

Variable Selection for Multivariate Time Series Prediction with Neural Networks

421

PCA method. The prediction performance can be evaluated by two error evaluation criteria [8]: the Root Mean Square Error ERMSE and Prediction Accuracy EPA: 1

ERMSE

⎛ 1 N 2 ⎞2 =⎜ [ P(t ) − O(t )] ⎟ ∑ ⎝ N − 1 t =1 ⎠

(20)

N

EPA =

∑ [( P(t ) − P )(O(t ) − O )] t =1

m

m

(21)

( N − 1)σ Pσ O

where O(t) is the target value, P(t) is the predicted value, Om is the mean value of O(t), σO is the standard deviation of y(t), Pm and σP are the mean value and standard deviation of P(t), respectively. ERMSE reflects the absolute deviation between the predicted value and the observed value while EPA denotes the correlation coefficient between the observed and predicted value. In ideal situation, if there are no errors in prediction, these parameters will be ERMSE =0 and EPA =1. 4.1 Prediction of Lorenz Time Series

The first data is derived from the Lorenz system, given by three differential equations: ⎧ dx(t ) ⎪ dt = a ( − x(t ) + y (t ) ) ⎪ ⎪ d y (t ) = bx(t ) − y (t ) − x(t ) z (t ) ⎨ ⎪ dt ⎪ d z (t ) ⎪ dt = x(t ) y (t ) − c(t ) z (t ) ⎩

(22)

where the typical values for the coefficients are a=10, b=8/3, c=28 and the initial values are x(0)=12, y(0)=2, z(0)=9. 1500 points of x(t), y(t) and z(t) obtained by fourorder Runge-Kutta method are used as the training sample and 500 points as the testing sample. In order to extract the dynamics of this system to predict x(t+1), the parameters for phase-space reconstruction are chosen as τx=τy=τz=3, mx=my=mz=9. Thus a MLP neural network with 27 input nodes, one hidden layer of 20 neurons and one output node are considered and a back propagation training algorithm is used. After the training process of the MLP neural network is topped, sensitivity analysis is carried out to evaluate the contribution of each input variable to the error function of the neural network. The trajectories of the sensitivity through training for each input are shown in Fig.2. It can be seen that the sensitivity undulates through training and finally converges when the weights and error are steady. The normalized sensitivity measures in Eq.(17) are calculated. A threshold η0 = 0.98 is chosen to determine which inputs are discarded. Thus the input dimension of neural network is reduced to 11. The original input matrix is replaced by the reduced input matrix and the structure of the neural networks is simplified. The prediction performance over the testing samples with the reduced inputs is shown in Fig.4.

422

M. Han and R. Wei

8

0.7

7

0.6

6

0.5

5

0.4

4

0.3

3 0.2

2

0.1

1 0

0 0

2000

4000

6000 Epoch

8000

10000

0

5

10

15 20 Input Nodes

25

30

Fig. 2. The trajectories of the input sensitivityFig. 3. The normalized sensitivity for each input through training node

x(t)

20 10

Ovserved

Predicted

0

-10

Error

-20 1 0.5 0

-0.5 -1

0

100

200

Time

300

400

500

Fig. 4. The observed and predicted values of Lorenz x(t) time series

The solid line in Fig. 4 represents the observed values while the dashed line represents the predicted values. It can be seen from Fig.4 that the chaotic behaviors of x(t) time series are well predicted and the errors between the observed values and the predicted values are small. The prediction performance are calculated in Table 1 and compared with the PCA variable reduction method. Table 1. Prediction performance of the x(t) time series

Input Nodes ERMSE EPA

With All Variables 27 0.1278 0.9998

PCA Selection 11 0.1979 0.9997

NN Selection 11 0.0630 1.0000

The prediction performance in Table 1 are comparable for the variable selection method with neural networks and the PCA method while the algorithm proposed in this paper obtains the best prediction accuracy.

Variable Selection for Multivariate Time Series Prediction with Neural Networks

423

4.2 Prediction of the Rainfall Time Series

Rainfall is an important variable in hydrological systems. The chaotic characteristic of the rainfall time series has been proven in many papers [9]. In this section, the simulation is taken on the monthly rainfall time series in the city of Dalian, China over a period of 660 months (from 1951 to 2005). The performance of the rainfall may be influenced by many factors, so in this paper five other time series such as the temperature time series, air-pressure time series, humidity time series, wind-speed time series and sunlight time series are also considered. This method also follows the Taken’s theorem to reconstruct the embedding phase space first with the dimension and delay-time as m1=m2=m3=m4=m5=m6=9, τ1=τ2=τ3 =τ4=τ5=τ6=3. Then the input of the neural network contains L=660-(9-1)×3=636 data points. In the experiments, this data set is divided into a training set composed of the first 436 points and a testing set containing the remaining 200 points. The neural network used in this paper then contains 54 input nodes, 20 hidden notes and 1 output. The threshold is also chosen as η0 = 0.98 . The trajectory of the input sensitivity and the normalized sensitivity for ever inputs are shown in Fig.5 and Fig.6, respectively. Then 34 input nodes are remained according to the sensitivity value. 6

0.2

Normalized Sensitivity

5 4 3 2

0.12 0.08 0.04

1 0

0.16

0

2000

4000 6000 Epoch

8000

10000

Fig. 5. The trajectories of the input sensitivity through training

0

0

10

20

30 40 Input Nodes

50

60

Fig. 6. The normalized sensitivity for each input node

The observed and predicted values of rainfall time series are shown in Fig.7, which gives high prediction accuracy. It can be seen from the figures that the chaotic behaviors of the rainfall time series are well predicted and the errors between the observed values and the predicted values are small. Corresponding values of ERMSE and EPA are shown in Table 2. Both of the figures and the error evaluation criteria indicate that the result for multivariate chaotic time series using the neural network based variable selection is much better than the results with all variables and PCA method. It can be concluded from the two simulations that the variable selection algorithm using neural networks is able to capture the dynamics of both computer-generated and practical time series accurately and gives high prediction accuracy.

424

M. Han and R. Wei

rainfall(mm)

400

Observed Predicted

300 200 100

error(mm)

0 200 100 0

-100 -200

0

40

80 120 t (month)

160

200

Fig. 7. The observed and predicted values of rainfall time series Table 2. Prediction performance of the rainfall time series

Input Nodes ERMSE EPA

With All Variables 54 22.2189 0.9217

PCA Selection 43 21.0756 0.9286

NN Selection 31 18.1435 0.9529

5 Conclusions This paper studies the variable selection algorithm using the sensitivity for pruning input nodes in a neural network model. A simple and effective criterion for identifying input nodes to be removed is also derived which does not require high computational cost and proves to work well in practice. The validity of the method was examined through a multivariate prediction problem and a comparison study was made with other variable selection methods. Experimental results encourage the application of the proposed method to complex tasks that need to identify significant input variables. Acknowledgements. This research is supported by the project (60674073) of the National Nature Science Foundation of China, the project (2006CB403405) of the National Basic Research Program of China (973 Program) and the project (2006BAB14B05) of the National Key Technology R&D Program of China. All of these supports are appreciated.

References [1] Verikas, B.M.: Feature selection with neural networks. Pattern Recognition Letters 23, 1323–1335 (2002) [2] Castellano, G., Fanelli, A.M.: Variable selection using neural network models. Neuralcomputing 31, 1–13 (2000)

Variable Selection for Multivariate Time Series Prediction with Neural Networks

425

[3] Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative method for pruning feed-forward neural networks. IEEE Trans. Neural Networks 8(3), 519–531 (1997) [4] Mozer, M.C., Smolensky, P.: Skeletonization: a technique for trimming the fat from a network via a relevance assessment. NIPS 1, 107–115 (1989) [5] Gevrey, M., Dimopoulos, I., Lek, S.: Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecol. Model. 160, 249–264 (2003) [6] Cao, L.Y., Mees, A., Judd, K.: Dynamics from multivariate time series. Physica D 121, 75–88 (1998) [7] Han, M., Fan, M., Xi, J.: Study of Nonlinear Multivariate Time Series Prediction Based on Neural Networks. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 618–623. Springer, Heidelberg (2005) [8] Chen, J.L., Islam, S., Biswas, P.: Nonlinear dynamics of hourly ozone concentrations: nonparametric short term prediction. Atmospheric environment 32(11), 1839–1848 (1998) [9] Liu, D.L., Scott, B.J.: Estimation of solar radiation in Australia from rainfall and temperature observations. Agricultural and Forest Meteorology 106(1), 41–59 (2001)

Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function Takaaki Aoki1 , Kaiichiro Ota2 , Koji Kurata3 , and Toshio Aoyagi1,2 1 CREST, JST, Kyoto 606-8501, Japan Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Faculty of Engineering, University of the Ryukyus, Okinawa 903-0213, Japan [email protected]

2 3

Abstract. The Self-Organizing Map (SOM) is an unsupervised learning method based on the neural computation, which has recently found wide applications. However, the learning process sometime takes multi-stable states, within which the map is trapped to a undesirable disordered state including topological defects on the map. These topological defects critically aggravate the performance of the SOM. In order to overcome this problem, we propose to introduce an asymmetric neighborhood function for the SOM algorithm. Compared with the conventional symmetric one, the asymmetric neighborhood function accelerates the ordering process even in the presence of the defect. However, this asymmetry tends to generate a distorted map. This can be suppressed by an improved method of the asymmetric neighborhood function. In the case of one-dimensional SOM, it found that the required steps for perfect ordering is numerically shown to be reduced from O(N 3 ) to O(N 2 ). Keywords: Self-Organizing Map, Asymmetric Neighborhood Function, Fast ordering.

1

Introduction

The Self-Organizing Map (SOM) is an unsupervised learning method of a type of nonlinear principal component analysis [1]. Historically, it was proposed as a simpliﬁed neural network model having some essential properties to reproduce topographic representations observed in the brain [2,3,4,5]. The SOM algorithm can be used to construct an ordered mapping from input stimulus data onto two-dimensional array of neurons according to the topological relationships between various characters of the stimulus. This implies that the SOM algorithm is capable of extracting the essential information from complicated data. From the viewpoint of applied information processing, the SOM algorithm can be regarded as a generalized, nonlinear type of principal component analysis and has proven valuable in the ﬁelds of visualization, compression and data mining. With based on the biological simple learning rule, this algorithm behaves as an unsupervised M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 426–435, 2008. c Springer-Verlag Berlin Heidelberg 2008

A

B

1

0.5

0 0

0.5

1

Reference Vector mi

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

427

1 0.5 0 0

N Unit Number i

Fig. 1. A: An example of a topological defect in a two-dimensional array of SOM with a uniform rectangular input space. The triangle point indicates the conflicting point in the feature map. B: Another example of topological defect in a one-dimensional array with scalar input data. The triangle points also indicate the conflicting points.

learning method and provides a robust performance without a delicate tuning of learning conditions. However, there is a serious problem of multi-stability or meta-stability in the learning process [6,7,8]. When the learning process is trapped to these states, the map seems to be converged to the ﬁnal state practically. However, some of theses states are undesirable for the solution of the learning procedure, in which typically the map has topological defects as shown in Fig. 1A. The map in Fig. 1A, is twisted with a topological defect at the center. In this situation, two-dimensional array of SOM should be arranged in the square space, for the input data taken uniformly from square space. But, this topological defect is a global conﬂicting point which is diﬃcult to remove by local modulations of the reference vectors of units. Therefore, it will require a sheer number of learning steps to rectify the topological defect. Thus, the existence of the topological defect critically aggravates the performance of the SOM algorithm. To avoid the emergence of the topological defect, several conventional and empirical methods have been used. However, it is more favorable that the SOM algorithm works well without tuning any model parameters, even when the topological defect emerged. Thus, let us consider a simple method which enables the eﬀective ordering procedure of SOM in the presence of the topological defect. Therefore we propose an asymmetric neighborhood function which eﬀectively removes the topological defect [9]. In the process of removing the topological defect, the conﬂicting point must be moved out toward the boundary of the arrays and vanished. Therefore, the motive process of the defect is essential for the efﬁciency of the ordering process. With the original symmetric neighborhood, the movement of the defect is similar to a random walk stochastic process, whose eﬃciency is worse. By introducing the asymmetry of the neighborhood function, the movement behaves like a drift, which enables the faster ordering. For this reason, in this paper we investigate the eﬀect of an asymmetric neighborhood function on the performance of the SOM algorithm for the case of one-dimensional and two-dimensional SOMs.

428

2 2.1

T. Aoki et al.

Methods SOM

The SOM constructs a mapping from the input data space to the array of nodes, we call the ‘feature map’. To each node i, a parametric ‘reference vector’ mi is assigned. Through SOM learning, these reference vectors are rearranged according to the following iterative procedure. An input vector x(t) is presented at each time step t, and the best matching unit whose reference vector is closest to the given input vector x(t) is chosen. The best matching unit c, called the ‘winner’ is given by c = arg mini x(t) − mi . In other words, the data x(t) in the input data space is mapped on to the node c associated with the reference vector mi closest to x(t). In SOM learning, the update rule for reference vectors is given by mi (t + 1) = mi (t) + α · h(ric )[x(t) − mi (t)],

ric ≡ ric ≡ ri − rc

(1)

where α, the learning rate, is some small constant. The function h(r) is called the ‘neighborhood function’, in which ric is the distance from the position rc of the winner node c to the position ri of a node i on the array of units. A widely used r2 neighborhood function is the Gaussian function deﬁned by, h(ric ) = exp − 2σic2 . We expect an ordered mapping after iterating the above procedure a suﬃcient number of times. 2.2

Asymmetric Neighborhood Function

We now introduce a method to transform any given symmetric neighborhood function to an asymmetric one (Fig. 2A). Let us deﬁne an asymmetry parameter β (β ≥ 1), representing the degree of asymmetry and the unit vector k indicating the direction of asymmetry. If a unit i is located on the positive direction with k, then the component parallel to k of the distance from the winner to the unit is scaled by 1/β. If a unit i is located on the negative direction with k, the parallel component of the distance is scaled by β. Hence, the asymmetric function hβ (r), transformed from its symmetric counterpart h(r), is described by ⎧ ⎪ −1 r 2 ⎨ + r⊥ 2 (ric · k ≥ 0) 1 hβ (ric ) = 2 +β · h(˜ ric ), r˜ic = β , ⎪ β ⎩ βr 2 + r 2 (r · k < 0) ⊥ ic (2) where r˜ic is the scaled distance from the winner. r is the projected component of ric , and r⊥ are the remaining components perpendicular to k, respectively. In addition, in order to single ∞out the eﬀect of asymmetry, the overall area of the neighborhood function, −∞ h(r)dr, is preserved under this transformation. In the special case of the asymmetry parameter β = 1, hβ (r) is equal to the original symmetric function h(r). Figure 2B displays an example of asymmetric Gaussian neighborhood functions in the two-dimensional array of SOM.

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

A

Oriented in the positive direction c

Oriented in the negative direction β

1/β

c

k

k

B

C

Inverse the direction 1.0

hβ(r)

k

1.0

Interval: T 0.5

1 0.5

0.5

distance

0 -6 -3

429

distance

Reduce the degree of asymmetric

0

3

6-6

-3

0

3

6

1.0

1.0

1.0

0.5

0.5

0.5

b=3

b=2

b=1

Fig. 2. A: Method of generating an asymmetric neighborhood function by scaling the distance ric asymmetrically.The degree of asymmetry is parameterized by β. The distance of the node on the positive direction with asymmetric unit vector k, is scaled by 1/β. The distance on the negative direction is scaled by β. Therefore, the asymmetric 2 function is described by hβ (ric ) = β+1/β h(˜ ric ) where r˜ic is the scaled distance of node i from the winner c. B: An example of an asymmetric Gaussian function. C: An illustration of the improved algorithm for asymmetric neighborhood function.

Next, we introduce an improved algorithm for the asymmetric neighborhood function. The asymmetry of the neighborhood function causes to distort the feature map, in which the density of units does not represent the probability of the input data. Therefore, two novel procedures will be introduced. The ﬁrst procedure is an inversion operation on the direction of the asymmetric neighborhood function. As illustrated in Fig. 2C, the direction of the asymmetry is turned in the opposite direction after every time interval T , which is expected to average out the distortion in the feature map. It is noted that the interval T should be set to a larger value than the typical ordering time for the asymmetric neighborhood function. The second procedure is an operation that decreases the degree of asymmetry of the neighborhood function. When β = 1, the neighborhood function equals the original symmetric function. With this operation, β is decreased to 1 with each time step, as illustrated in Fig. 2C. In our numerical simulations, we adopt a linear decreasing function. 2.3

Numerical Simulations

In the following sections, we tested learning procedures of SOM with sample data to examine the performance of the ordering process. The sample data is

430

T. Aoki et al.

generated from a random variable of a uniform distribution. In the case of onedimensional SOM, the distribution is uniform in the ranges of [0, 1]. Here we use Gaussian function for the original symmetric neighborhood function. The model parameters in SOM learning was used as follows: The total number of units N = 1000, the learning rate α = 0.05 (constant), and the neighborhood radius σ = 50. The asymmetry parameter β = 1.5 and asymmetric direction k is set to the positive direction in the array. The interval period T of ﬂipping the asymmetric direction is 3000. In the case of two-dimensional SOM ( 2D → 2D map), the input data taken uniformly from square space, [0, 1] × [0, 1]. The model parameters are same as in one-dimensional SOM, excepted that The total number of units N = 900 (30 × 30) and σ = 5. The asymmetric direction k is taken in the direction (1, 0), which can be determined arbitrary. In the following numerical simulations, we also conﬁrmed the result holds with other model parameters and other form of neighborhood functions. 2.4

Topological Order and Distortion of the Feature Map

For the aim to examine the ordering process of SOM, lets us consider two measures which characterize the property of the future map. One is the ‘topological order’ η for quantifying the order of reference vectors in the SOM array. The units of the SOM array should be arranged according to its reference vector mi . In the presence of the topological defect, most of the units satisfy the local ordering. However, the topological defect violates the global ordering and the feature map is divided into fragments of ordering domains within which the units satisfy the order-condition. Therefore, the topological order η can be deﬁned as the ratio of the maximum domain size to the total number of units N , given by maxl Nl , (3) N where Nl is the size of domain l. In the case of one-dimensional SOM, the order-condition for the reference vector of units is deﬁned as, mi−1 ≤ mi ≤ mi+1 , or mi−1 ≥ mi ≥ mi+1 . In the case of two-dimensional SOM as referred in the previous section, the order-condition is also deﬁned explicitly with the vector product a(i,i) ≡ (m(i+1,i) −m(i,i) )×(m(i,i+1) −m(i,i) ). Within the ordering domain, the vector products a(i,i) of units have same sign, because the reference vectors are arranged in the sample space ordered by the position of the unit. The other is the ‘distortion’ χ, which measures the distortion of the feature map. The asymmetry of the neighborhood function tends to distort the distribution of reference vectors, which is quite diﬀerent from the correct probability density of input vectors. For example, when the probability density of input vectors is uniform, a non-uniform distribution of reference vectors is formed with an asymmetric neighborhood function. Hence, for measuring the non-uniformity in the distribution of reference vectors, let us deﬁne the distortion χ. χ is a coeﬃcient of variation of the size-distribution of unit Voronoi tessellation cells, and is given by Var(Δi ) χ= , (4) E(Δi ) η≡

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

431

where Δi is the size of Voronoi cell of unit i. To eliminate the boundary eﬀect of the SOM algorithm, the Voronoi cells on the edges of the array are excluded. When the reference vectors are distributed uniformly, the distortion χ converges to 0. It should be noted that the evaluation of Voronoi cell in two-dimensional SOM is time-consuming, and we approximate the size of Voronoi cell by the size of the vector product a(i,i) . If the feature map is uniformly formed, the approximate value also converges to 0.

3 3.1

Results One-Dimensional Case

In this section, we investigate the ordering process of SOM learning in the presence of a topological defect in symmetric, asymmetric, and improved asymmetric cases of the neighborhood function. For this purpose, we use the initial condition that a single topological defect appears at the center of the array. Because the density of input vectors is uniform, the desirable feature map is a linear arrangement of SOM nodes. Figure 3A shows a typical time development of the reference vectors mi . In the case of the symmetric neighborhood function, a single defect remains around the center of the array even after 10000 steps. In contrast, in the case of the asymmetric one, this defect moves out to the right so that the reference vectors are ordered within 3000 steps. This phenomenon can also be conﬁrmed in Fig. 3B, which shows the time dependence of the topological order η. In the case of the asymmetric neighborhood function, η rapidly converges to 1 (completely ordered state) within 3000 steps, whereas the process of eliminating the last defect takes a large amount of time (∼18000 step) for the symmetric one. On the other hand, one problem arises in the feature map obtained with the asymmetric neighborhood function. After 10000 steps, the distribution of the reference vectors in the feature map develops an unusual bias (Fig. 3A). Figure 3C shows the time dependence of the distortion χ during learning. In the case of the symmetric neighborhood function, χ eventually converges to almost 0. This result indicates that the feature map obtained with the symmetric one has an almost uniform size distribution of Voronoi cells. In contrast, in the case of the asymmetric one, χ converges to a ﬁnite value (= 0). Although the asymmetric neighborhood function accelerates the ordering process of SOM learning, the resultant map becomes distorted which is unusable for the applications. Therefore, the improved asymmetric method will be introduced, as mentioned in Method. Using this improved algorithm, χ converges to almost 0 as same as the symmetric one (Fig. 3C). Furthermore, as shown in Fig. 3B, the improved algorithm preserves the faster order learning. Therefore, by utilizing the improved algorithm of asymmetric neighborhood function, we confer the full beneﬁt of both the fast order learning and the undistorted feature map. To quantify the performance of the ordering process, let us deﬁne the ‘ordering time’ as the time at which η reaches to 1. Figure 4A shows the ordering

432

T. Aoki et al.

A

Symmetric 1

1

0

1

0 0

1000

1

0 0

1000

0 0

1000

0

1000

0

1000

Asymmetric 1

1

0

1

0 0

1000

1

0 0

1000

1

0 0

1000

0 0

1000

Improved asymmetric 1

1

0

0 0 1000 t=1000

t=0

1 0 0 1000 t=5000

B

Distortion χ

1

0.5

0

0 0 1000 t=10000

0 1000 t=20000

Symmetric Asymmetric Improved asym.

C

Topological order η

1

1.5 1 0.5 0

0

10000 Time

20000

0

10000 Time

20000

Fig. 3. The asymmetric neighborhood function enhances the ordering process of SOM. A: A typical time development of the reference vectors mi in cases of symmetric, asymmetric, and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. The standard deviations are denoted by the error bars, which cannot be seen because they are smaller than the size of the graphed symbol. C: The time dependence of the distortion χ.

time as a function of the total number of units N for both improved asymmetric and symmetric cases of the neighborhood function. It is found that the ordering time scales roughly as N 3 and N 2 for symmetric and improved asymmetric neighborhood functions, respectively. For detailed discussion about the reduction of the ordering time, refer to the Aoki & Aoyagi (2007). Figure 4B shows the dependency of the ordering time on the width of neighborhood function, which indicates that ordering time is proportional to (N/σ)k with k = 2/3 for asymmetric/symmetric neighborhood function. This result implies that combined usage of the asymmetric method and annealing method for the width of neighborhood function is more eﬀective. 3.2

Two-Dimensional Case

In this section, we investigate the eﬀect of asymmetric neighborhood function for two-dimensional SOM (2D → 2D map). Figure 5 shows that a similar fast

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

Ordering time

108 10

7

10

6

10

5

N2.989±0.002

10

7

10

6

Sym. σ = 8 Sym. σ = 16 Sym. σ = 32 Sym. σ = 64 Asym.σ = 8 Asym.σ = 16 Asym.σ = 32 Asym.σ = 64

N3

105

104 10

108

Symmetric Improved asym. Fitting

104

N1.917±0.005

3

103

10

104

The number of units N

433

N2

3

102

Scaled number of units N/σ

Fig. 4. Ordering time as a function of the total number of units N . The fitting function is described by Const. · N γ .

A

Symmetric 1

1

0

1

0 0

0

1

0

1

0

1

0

1

Asymmetric 1

1

0

1

0 0

1

1

0 0

1

0 0

1

Improved asymmetric 1

1

0

0 0

t=0

1

0 0

t=1000

B

1 t=5000

1

0.5

0

0 1 t=20000

Symmetric Asymmetric Improved asym.

C Distortion χ

Topological order η

1

1

0.5

0 0

10000 Time

20000

0

10000 Time

20000

Fig. 5. A: A typical time development of reference vectors in two-dimensional array of SOM for the cases of symmetric, asymmetric and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. C: The time dependence of the distortion χ.

434

T. Aoki et al.

Population

Symmetric

Improved asymmetric

0.5

0.5

0

0 0

25000

50000

Ordering time

0

25000

50000

Ordering time

Fig. 6. Distribution of ordering times when the initial reference vectors are generated randomly. The white bin at the right in the graph indicates a population of failed trails which could not converged to the perfect ordering state within 50000 steps.

ordering process can be realized with an asymmetric neighborhood function in two-dimensional SOM. The initial state has a global topological defect, in which the map is twisted at the center. In this situation, the conventional symmetric neighborhood function has trouble in correcting the twisted map. Because of the local stability, this topological defect is never corrected even with a huge learning iteration. The asymmetric neighborhood function also is eﬀective to overcome such a topological defect, like the case of one-dimensional SOM. However, the same problem of ’distortion map’ occurs. Therefore, by using the improved asymmetric neighborhood function, the feature map converges to the completely ordered map in much less time without any distortion. In the previous simulations, we have considered a simple situation that a single defect exists around the center of the feature map as an initial condition in order to investigate the ordering process with the topological defect. However, when the initial reference vectors are set randomly, the total number of topological defects appearing in the map is not generally equal to one. Therefore, it is necessary to consider the statistical distribution of the ordering time, because the total number of the topological defects and the convergence process depend generally on the initial conditions. Figure 6 shows the distribution of the ordering time, when the initial reference vectors are randomly selected from the uniform distribution [0, 1] × [0, 1]. In the case of symmetric neighborhood function, a part of trial could not converged to the ordered state with trapped in the undesirable meta-stable states, in which the topological defects are never rectiﬁed. Therefore, although the fast ordering process is observed in some successful cases (lucky initial conditions), the formed map with symmetric one is highly depends on the initial conditions. In contrast, for the improved asymmetric neighborhood function, the distribution of the ordering time has a single sharp peak and the successive future map is constructed stably without any tuning of the initial condition.

4

Conclusion

In this paper, we discussed the learning process of the self-organized map, especially in the presence of a topological defect. Interestingly, even in the presence

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

435

of the defect, we found that the asymmetry of the neighborhood function enables the system to accelerate the learning process. Compared with the conventional symmetric one, the convergence time of the learning process can be roughly reduced from O(N 3 ) to O(N 2 ) in one-dimensional SOM(N is the total number of units). Furthermore, this acceleration with the asymmetric neighborhood function is also eﬀective in the case of two-dimensional SOM (2D → 2D map). In contrast, the conventional symmetric one can not rectify the twisted feature map even with a sheer of iteration steps due to its local stability. These results suggest that the proposed method can be eﬀective for more general case of SOM, which is the subject of future study.

Acknowledgments This work was supported by Grant-in-Aid for Scientiﬁc Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grant number 18047014, 18019019 and 18300079.

References 1. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 2. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in cats visual cortex. J. Physiol.-London 160(1), 106–154 (1962) 3. Hubel, D.H., Wiesel, T.N.: Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol. 158(3), 267–294 (1974) 4. von der Malsburg, C.: Self-organization of orientation sensitive cells in striate cortex. Kybernetik 14(2), 85–100 (1973) 5. Takeuchi, A., Amari, S.: Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern. 35(2), 63–72 (1979) 6. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps - stationary states, metastability and convergence rate. Biol. Cybern. 67(1), 35–45 (1992) 7. Geszti, T., Csabai, I., Cazok´ o, F., Szak´ acs, T., Serneels, R., Vattay, G.: Dynamics of the kohonen map. In: Statistical mechanics of neural networks: proceedings of the XIth Sitges Conference, pp. 341–349. Springer, New York (1990) 8. Der, R., Herrmann, M., Villmann, T.: Time behavior of topological ordering in self-organizing feature mapping. Biol. Cybern. 77(6), 419–427 (1997) 9. Aoki, T., Aoyagi, T.: Self-organizing maps with asymmetric neighborhood function. Neural Comput. 19(9), 2515–2535 (2007)

A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer Azusa Iwata1 , Yoshihisa Shinozawa1 , and Akito Sakurai1,2 1

Keio University, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan 2 CREST, Japan Science and Technology Agency

Abstract. We give a necessary condition that a simple recurrent neural network with two sigmoidal hidden units to implement a recognizer of the formal language {an bn |n > 0} which is generated by a set of generating rules {S → aSb, S → ab} and show that by setting parameters so as to conform to the condition we get a recognizer of the language. The condition implies instability of learning process reported in previous studies. The condition also implies, contrary to its success in implementing the recognizer, diﬃculty of getting a recognizer of more complicated languages.

1

Introduction

Pioneered by Elman [6], many researches have been conducted on grammar learning by recurrent neural networks. Grammar is deﬁned by generating or rewriting rules such as S → Sa and S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b means that “the letter S should be rewritten to b.” If we have more than one applicable rules, we have to try all the possibility. The string generated this way but with no further applicable rewriting rule is called a sentence. The set of all possible sentences is called the language generated by the grammar. Although the word “language” would be better termed formal language in contrast to natural language, we follow custom in formal language theory ﬁeld(e.g. [10]). In everyday expression a sentence is a sequence of words whereas in the above example it is a string of characters. Nevertheless the essence is common. The concept that a language is deﬁned based on a grammar this way has been a major paradigm in a wide range of formal language study and related ﬁeld. A study of grammar learning focuses on restoring a grammar from ﬁnite samples of sentences of a language associated with the grammar. In contrast to ease of generation of sample sentences, grammar learning from sample sentences is a hard problem and in fact it is impossible except for some very restrictive cases e.g. the language is ﬁnite. As is well-known, humans do learn language whose grammar is very complicated. The diﬀerence between the two situations might be attributed to possible existence of some unknown restrictions on types of natural language grammars in our brain. Since neural network is a model of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Characterization of Simple RNNs with Two Hidden Units

437

brain, some researchers think that neural network could learn a grammar from exemplar sentences of a language. The researches on grammar learning by neural networks are characterized by 1. focus on learning of self-embedding rules, since simpler rules that can be expressed by ﬁnite state automata are understood to be learnable, 2. simple recurrent neural networks (SRN) are used as basic mechanism, and 3. languages such as {an bn |n > 0}, {an bn cn |n > 0} which are clearly the results of, though not a representative of, self-embedding rules, were adopted for target languages. Here an is a string of n-times repetition of a character a, the language {an bn |n > 0} is generated by a grammar {S → ab, S → aSb}, and {an bn cn |n > 0} is generated by a context-sensitive grammar. The adoption of simple languages such as {an bn |n > 0} and {an bn cn |n > 0} as target languages are inevitable, since it is not easy to see whether enough generalization is achieved by the learning if realistic grammars are used. Although these languages are simple, many times when they learned, what they learned was just what they were taught, not a grammar, i.e. they did almost rote-learning. In other words, their generalization capability in learning was limited. Also their learned-results were unstable in a sense that when they were given new training sentences which were longer than the ones they learned but still in the same language, learned network changes unexpectedly. The change was more than just reﬁnement of the learned network. Considering these situations, we may doubt that there really exists a solution or correct learned network. Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others tried to clarify the reasons and explore the possibility of network learning of the languages. But their results were not conclusive and did not give clear conditions that the learned networks satisfy in common. We will in this paper describe a condition that SRNs with two hidden units learned to recognize language {an bn |n > 0} have in common, that is, a condition that has to be met for SRN to be qualiﬁed to be successfully learned the language. The condition implies, by the way, instability of learning results. Moreover by utilizing the condition, we realize a language recognizer. By doing this we found that learning of recognizers of languages more complicated than {an bn |n > 0} are hard. RNN (recurrent neural network) is a type of networks that has recurrent connections and a feedforward network connections. The calculation is done at once for the feedforward network part and after one time-unit delay the recurrent connection part gives rises to additional inputs (i.e., additional to the external inputs) to the feedforward part. RNN is a kind of discrete time system. Starting with an initial state (initial outputs, i.e. outputs without external inputs, of feedforward part) the network proceeds to accept the next character in a string given to external inputs, reaches the ﬁnal state, and obtains the ﬁnal external output from the ﬁnal state. SRN (simple recurrent network) is a simple type of RNN and has only one layer of hidden units in its feedforward part.

438

A. Iwata, Y. Shinozawa, and A. Sakurai

Rodriguez et al. [13] showed that SRN learns languages {an bn |n > 0} (or, more correctly, its subset {ai bi |n ≥ i > 0}) and {an bn cn |n > 0} (or, more correctly, {ai bi |n ≥ i > 0}). For {an bn |n > 0}, they found an SRN that generalized to n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are processed by the SRN but the results were not conclusive so that it is still an open problem if it is really possible to realize recognizers of the languages such as {an bn |n > 0} or {an bn cn |n > 0} and if SRN could learn or implement more complicated languages. Siegelmann [16] showed that an RNN has computational ability superior to a Turing machine. But the diﬃculty of learning {an bn |n > 0} by SRN suggests that at least SRN might not be able to realize Turing machine. The diﬀerence might be attributed to the diﬀerence of output functions of neurons: piecewise linear function in Siegelmann’s case and sigmoidal function in standard SRN cases, since we cannot ﬁnd other substantial diﬀerences. On the other hand Casey [4] and Maass [12] showed that in noisy environment RNN is equivalent to ﬁnite automaton or less powerful one. The results suggest that correct (i.e., inﬁnite precision) computation is to be considered when we had to make research on possibility of computations by RNN or speciﬁcally SRN. Therefore in the research we conducted and will report in this paper we have adopted RNN models with inﬁnite precision calculation and with sigmoidal function (tanh(x)) as output function of neurons. From the viewpoints above, we discuss two things in the paper: a necessary condition for an SRN with two hidden units to be a recognizer of the language {an bn |n > 0}, the condition is suﬃcient enough to guide us to build an SRN recognizing the language.

2

Preliminaries

SRN (simple recurrent network) is a simple type of recurrent neural networks and its function is expressed by sn+1 = σ(w s · sn + wx · xn ), Nn (sn ) = wos · sn + woc where σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 + exp(−x))) and is applied component-wise. A counter is a device which keeps an integer and allows +1 or −1 operation and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a device which allows operations push i (store i) and pop up (recall the laststored content, discard it and restore the device as if it were just before the corresponding push operation). Clearly a stack is more general than a counter, so that if a counter is not implementable, a stack is not, too. Elman used SRN as a predictor but Rodriguez et al. [13] and others consider it as a counter. We in this paper mainly stand on the latter viewpoint. We will ﬁrst explain that a predictor is used as a recognizer or a counter. When we try to train SRN to recognize a language, we train it to predict correctly the next character to come in an input string (see e.g. [6]). As usual we adopted “one hot vector” representation for the network output. One hot vector is a vector representation in which a single element is one and the others

A Characterization of Simple RNNs with Two Hidden Units

439

are zero. The network is trained so that the sum of the squared error (diﬀerence between actual output and desired output) is minimized. By this method, if two possible outputs with the same occurrence frequency for the same input exist in a training data, the network would learn to output 0.5 for two elements in the output with others being 0, since the output vector should give the minimum of the sum of the squared error. It is easily seen that if a network correctly predicts the next character to come for a string in the language {an bn |n > 0}, we could see it behaves as a counter with limited function. Let us add a new network output whose value is positive if the original network predicts only a to come (which happens only when the input string up to the time was an bn for some n which is common practice since Elman’s work) and is negative otherwise. The modiﬁed network seems to count up for a character a and count down for b since it outputs positive value when the number of as and bs coincide and negative value otherwise. Its counting capability, though, is limited since it could output any value when a is fed before the due number of bs are fed, that is, when counting up action is required before the counter returns back to 0-state. A (discrete-time) dynamical system is represented as the iteration of a function application: si+1 = f (si ), i ∈ N , si ∈ Rn . A point s is called a ﬁxed point of f if f (s) = s. A point s is an attracting ﬁxed point of f if s is a ﬁxed point and there exists a neighborhood Us around s such that limi→∞ f i (x) = s for all x ∈ Us . A point s is a repelling ﬁxed point of f if s is an attracting ﬁxed point of f −1 . A point s is called a periodic point of f if f n (s) = s for some n. A point s is a ω-limit point of x for f if limi→∞ f ni (x) = s for limi→∞ ni = ∞. A ﬁxed point x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values diﬀerent from one, where Df = [∂fi /∂xj ] is the Jacobian matrix of ﬁrst partial derivatives of the function f . A set D is invariant under f if for any s ∈ D, f (s) ∈ D. The following theorem plays an important role in the current paper. Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f : Rn → Rn be a C r (r ≥ 1) diﬀeomorphism with a hyperbolic ﬁxed point x. Then s,f u,f there are local stable and unstable manifolds Wloc (x), Wloc (x), tangent to the s,f s,f u,f eigenspaces Ex , Ex of Df at x and of corresponding dimension. Wloc (x) and u,f r Wloc (x) are as smooth as the map f , i.e. of class C . Local stable and unstable manifold for f are deﬁned as follows: s,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f m (y), q) = 0} u,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f −m (y), q) = 0}

where Uq is a neighborhood of q and dist is a distance function. Then global s,f stable and unstable manifold for f are deﬁned as: W s,f (q) = i≥0 f −i (Wloc (q)) u,f u,f i and W (q) = i≥0 f (Wloc (q)). As deﬁned, SRN is a pair of a discrete-time dynamical system sn+1 = σ(w s · sn + wx · xn ) and an external output part Nn = wos · sn + woc . We simply

440

A. Iwata, Y. Shinozawa, and A. Sakurai

write the former (dynamical system part) as sn+1 = f (sn , xn ) and the external output part as h(sn ). When RNN (or SRN) is used as a recognizer of the language {an bn |n > 0}, as described in Introduction, it is seen as a counter where the input character a is for a count-up operation (i.e., +1) and b is for a count-down operation (i.e., −1). In the following we may use x+ for a and x− for b. For abbreviation, in the following, we use f+ = f ( · , x+ ) and f− = f ( · , x− ) Please note that f− is undeﬁned for the point outside and on the border of the square (I[−1, 1])2 , where I[−1, 1] is the closed interval [−1, 1]. In the following, though, we do not mention it for simplicity. D0 is the set {s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let −i D i = f− (D0 ), that is, a region where the counter value is i. We postulate that f+ (Di ) ⊆ Di+1 . This means that any point in Di is eligible for a state where the counter content is i. This may seem to be rather demanding. An alternative would be that the point p is for counter content c if and only if p1 pi m1 mi p = f− ◦ f+ ◦ . . . ◦ f− ◦ f+ (s0 ) for a predeﬁned s0 , some mj ≥ 0 and pj ≥ 0 i for 1 ≤ j ≤ i, and i ≥ 0 such that j=1 (pj − mj ) = c. This, unfortunately, has not resulted in a fruitful result. We also postulate that Di ’s are disjoint. Since we deﬁned Di as a closed set, the postulate is natural. The point is therefore we have chosen Di to be closed. The postulate requires that we should keep a margin between D0 and D1 and others.

3

A Necessary Condition

We consider only SRN with two hidden units, i.e., all the vectors concerning s such as ws , sn , wos are two-dimensional. Definition 2. Dω is the set of the accumulation points of i≥0 Di , i.e. s ∈ Dω iﬀ s = limi→∞ ski for some ski ∈ Dki . Definition 3. Pω is the set of ω-limit points of points in D0 for f+ , i.e. s ∈ Pω ki iﬀ s = limi→∞ f+ (s0 ) for some ki and s0 ∈ D0 . Qω is the set of ω-limit points −ki −1 of points in D0 for f− , i.e. s ∈ Qω iﬀ s = limi→∞ f− (s0 ) for some ki and s0 ∈ D0 . Considering the results obtained in Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), it is natural, at least for a ﬁrst consideration, to i i postulate that f+ (x) and f− (x) are not wondering so that they will converge to periodic points. Therefore Pω and Qω are postulated as ﬁnite set of hyperbolic periodic points for f+ and f− , respectively. In the following, though, for simpliﬁcation of presentations, we postulate that Pω and Qω are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively. Moreover, according to the same literature, the points in Qω are saddle points u,f s,f for f− , so that we further postulate that Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, where their existence is guaranteed by Theorem 1.

A Characterization of Simple RNNs with Two Hidden Units

441

Postulate 4. We postulate that f+ (Di ) ⊆ Di+1 , Di ’s are disjoint, Pω and Qω u,f are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively, and Wloc − (q) s,f− for q ∈ Qω and Wloc (q) for q ∈ Qω are one-dimensional. −1 −1 Lemma 5. f− ◦ f+ (Dω ) = Dω , f− (Dω (I(−1, 1))2 ) = Dω and Dω (I(−1, 1))2 = f− (Dω ), and f+ (Dω ) ⊆ Dω . Pω ⊆ Dω and Qω ⊆ Dω . −1 Definition 6. W u,−1 (q) is the global unstable manifold at q ∈ Qω for f− , i.e., −1 u,f− u,−1 s,f− W (q) = W (q) = W (q). i Lemma 7. For any p ∈ Dω , any accumulation point of {f− (p)|i > 0} is in Qω .

Proof. Since p is in Dω , there exist pki ∈ Dki such that p = limi→∞ pki . Suppose q in Dω is the accumulation point stated in the theorem statement, i.e., q = h limj→∞ f−j (p). We take ki large enough for any hj so that in any neighborhood of h h h −k q where f−j (p) is in, pki exists. Then q = limj→∞ f−j (pki ) = limj→∞ f−j i (ski ) ki where ki is a function of hj with ki > hj . Let ski = f− (pki ) ∈ D0 and s0 ∈ D0 −1 be an accumulation point of {ski }. Then since f− is continuous, letting nj = −n −hj + ki > 0, q = limj→∞ f− j (s0 ), i.e., q ∈ Qω .

Lemma 8. Dω = q∈Qω W u,−1 (q). Proof. Let p be any point in Dω . Since f− (Dω ) ⊆ (I[−1, 1])2 where I[−1, 1] n is the interval [−1, 1], i.e., f− (Dω ) is bounded, and f− (Dω ) ⊆ Dω , {f− (p)} has an accumulation point q in Dω , which is, by Lemma 7, in Qω . Then q is n expressed as q = limj→∞ f−j (p). Since Qω is a ﬁnite set of hyperbolic ﬁxed −1 n points, q = limn→∞ f− (p), i.e., p ∈ W s,f (q) = W u,f (q) = W u,−1 (q).

Since Pω ⊆ Dω , the next theorem holds. Theorem 9. A point in Pω is either a point in Qω or in W u,−1 (q) for some q ∈ Qω . Please note that since q ∈ W u,−1 (q), the theorem statement is just “If p ∈ Pω then p ∈ W u,−1 (q) for some q ∈ Qω .”

4

An Example of a Recognizer

To construct an SRN recognizer for {an bn |n > 0}, the SRN should satisfy the conditions stated in Theorem 9 and Postulate 4, which are summarized as: 1. f+ (Di ) ⊆ Di+1 , 2. Di ’s are disjoint, 3. Pω and Qω are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively, u,f s,f 4. Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, and u,−1 5. If p ∈ Pω then p ∈ W (q) for some q ∈ Qω .

442

A. Iwata, Y. Shinozawa, and A. Sakurai

Let us consider as simple as possible, so that the ﬁrst choice is to think about a point p ∈ Pω and q ∈ Qω , that is f+ (p) = p and f− (q) = q. Since p cannot be −1 the same as q (because f− ◦ f+ (p) = p + w−1 s · w x · (x+ − x− ) = p ), we have u,−1 to ﬁnd a way to let p ∈ W (q). Since it is very hard in general to calculate stable or unstable manifolds from a function and its ﬁxed point, we had better try to let W u,−1 (q) be a “simple” manifold. There is one more reason to do so: we have to deﬁne D0 = {x|h(x) ≥ 0} but if W u,−1 (q) is not simple, suitable h may not exist. We have decided that W u,−1 (q) be a line (if possible). Considering the function form f− (s) = σ(w s · s + wx · x− ), it is not diﬃcult to see that the line could be one of the axes or one of the bisectors of the right angles at the origin (i.e., one of the lines y = x and y = −x). We have chosen the bisector in the ﬁrst (and the third) quadrant (i.e., the line y = x). By the way q was chosen to be the origin and p was chosen arbitrarily to be (0.8, 0.8). The item 4 is satisﬁed by setting one of the two eigenvalues of Df− at the origin to be greater than one, and the other smaller then one. We have chosen 1/0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are satisﬁed by considering eigenvalues of Df+ at p for f+ . The design consideration that we have skipped is how to design D0 = {x|h(x) ≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W u,−1 (q) for our intended q ∈ Qω . Because, if we do so, by setting the largest eigenvalue of Df− at q to be equal to the inverse of the eigenvalue of Df+ at p along the 2 2 i normal to W u,−1 , we can get the points s ∈ D0 , f− ◦ f+ (s), f− ◦ f+ (s), . . . , f− ◦ i f+ (s), . . . that belong to {an bn |n > 0}, reside at approximately equal distance from W u,−1 . Needless to say that the points belonging to, say, {an+1 bn |n > 0} have approximately equal distance from W u,−1 among them and this distance is diﬀerent from that for {an bn |n > 0}. Let f− (x) = σ(Ax + B0 ), f+ (x) = σ(Ax + B1 ). We plan to put Qω = {(0, 0)}, Pω = {(0.8, 0.8)}, W u,−1 = {(x, y)|y = x}, the eigenvalues of the tangent space −1 of f− at (0, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is expanding), and the eigenvalues of the tangent space of f+ at (0.8, 0.8) are 1/μ and any value. Then, considering derivatives at (0, 0) and (0.8, 0.8), it is easy to see π 1 π 1 λ 0 A=ρ ρ − , = (1 − 0.82 )μ 0 μ 2 4 4 μ where ρ(θ) is a rotation by θ. Then λ+μ λ−μ A= λ−μ λ+μ Next from σ(B0 ) = (0, 0)T and σ((0.8λ, 0.8λ)T + B1 ) = (0.8, 0.8)T , −1 0 σ (0.8) − 0.8λ B0 = , B1 = . 0 σ −1 (0.8) − 0.8λ These give us μ = 5/3, λ = 0.6, B1 ≈ (1.23722, 1.23722)T .

A Characterization of Simple RNNs with Two Hidden Units

443

Fig. 1. The vector ﬁeld representation of f+ (left) and f− (right) 1

0.75

0.5

0.25

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

1

-0.25

-0.5

-0.75

-1

-1

-0.75

-0.5

1

1

0.75

0.75

0.5

0.5

0.25

0.25

-0.25

0.25

0.

0.75

1

-1

-0.75

-0.5

-0.25

0.25

-0.25

-0.25

-0.5

-0.5

-0.75

-0.75

-1

-1

0.5

0.75

1

n+1 n+1 n n n Fig. 2. {f− ◦ f+ (p)| n ≥ 1} (upper), {f− ◦ f+ (p)| n ≥ 1} (lower left), and {f− ◦ n f+ (p)| n ≥ 1} (lower right) where p = (0.5, 0.95)

444

A. Iwata, Y. Shinozawa, and A. Sakurai

In Fig. 1, the left image shows the vector ﬁeld of f+ where the arrows starting at x end at f+ (x) and the right image shows the vector ﬁeld of f− . In Fig. 2, the upper plot shows points corresponding to strings in {an bn |n > 0}, the lower-left plot {an+1 bn |n > 0}, and the lower-right plot {an bn+1 |n > 0}. The initial point was set to p = (0.5, 0.95) in Fig. 2. All of them are for n = 1 to n = 40 and when n grows the points gather so we could say that they stay in narrow stripes, i.e. Dn , for any n.

5

Discussion

We obtained a necessary condition that SRN implements a recognizer for the language {an bn |n > 0} by analyzing its behavior from the viewpoint of discrete dynamical systems. The condition supposes that Di ’s are disjoint, f+ (Di ) ⊆ Di+1 , and Qω is ﬁnite. It suggests a possibility of the implementation and in fact we have successfully built a recognizer for the language, thereby we showed that the learning problem of the language has at least a solution. Unstableness of any solutions for learning is suggested to be (but not derived to be) due to the necessity of Pω being in an unstable manifold W u,−1 (q) for n q ∈ Qω . Since Pω is attractive in the above example, f+ (s0 ) for s0 ∈ D0 comes n exponentially close to Pω for n. By even a small ﬂuctuation of Pω , since f+ (s0 ), u,−1 n n too, is close to W (q), f− (f+ (s0 )), which should be in D0 , is disturbed much. This means that even if we are close to a solution, by just a small ﬂuctuation of n n Pω caused by a new training data, f− (f+ (s0 )) may easily be pushed out of D0 . Since Rodriguez et al. [14] showed that the languages that do not belong to the context-free class could be learned to some degree, we have to further study the discrepancies. Instability of grammar learning by SRN shown above might not be seen in our natural language learning, which suggests that SRN might not be appropriate for a model of language learning.

References 1. Bod´en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free language: analysis of dynamics in recurrent hidden units. Artiﬁcial Neural Networks (1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999) 2. Bod´en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science 12(3/4), 197–210 (2000) 3. Bod´en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied Intelligence: Special issue on natural language and machine learning 19(1/2), 51–63 (2003) 4. Casey, M.: Correction to proof that recurrent neural networks can robustly recognize only regular languages. Neural Computation 10, 1067–1069 (1998) 5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955– 972 (2003)

A Characterization of Simple RNNs with Two Hidden Units

445

6. Elman, J.L.: Distributed representations, simple recurrent networks and grammatical structure. Machine Learning 7, 195–225 (1991) 7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge 8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks 12(6), 1333–1340 (2001) 9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997) 10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley, Reading (1979) 11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Systems. Cambridge University Press, Cambridge (1996) 12. Maass, W., Orponen, P.: On the eﬀect of analog noise in discrete-time analog computations. Neural Computation 10, 1071–1095 (1998) 13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to count. Connection Science 11, 5–40 (1999) 14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive languages by counting. Neural Computation 13(9), 2093–2118 (2001) 15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041 (2002) 16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing Limit, Birkh¨ auser (1999) 17. Wiles, J., Blair, A.D., Bod´en, M.: Representation Beyond Finite States: Alternatives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Networks

Unbiased Likelihood Backpropagation Learning Masashi Sekino and Katsumi Nitta Tokyo Institute of Technology, Japan

Abstract. The error backpropagation is one of the popular methods for training an artificial neural network. When the error backpropagation is used for training an artificial neural network, overfitting occurs in the latter half of the training. This paper provides an explanation about why overfitting occurs with the model selection framework. The explanation leads to a new method for training an aritificial neural network, Unibiased Likelihood Backpropagation Learning. Several results are shown.

1

Introduction

An artiﬁcial neural network is one of the model for function approximation. It is possible to approximate arbitrary function when the number of basis functions is large. The error backpropagation learning [1], which is a famous method for training an artiﬁcial neural network, is the gradient discent method with the squared error to learning data as a target function. Therefore, the error backpropagation learning can obtain local optimum while monotonously decreasing the error. Here, although the error to learning data is monotonously decreasing, the error to test data increases in the latter half of training. This phenomenon is called overﬁtting. Early stopping is one of the method for preventing the overﬁtting. This method stop the training when an estimator of the generalization error does not decrease any longer. For example, the technique which stop the training when the error to hold-out data does not decrease any longer is often applied. However, the early stopping basically minimize the error to learning data, therefore there is no guarantee for obtaining the optimum parameter which minimize the estimator of the generalization error. When the parameters of the basis functions (model parameter) are ﬁxed, an artiﬁcial neural network becomes a linear regression model. If a regularization parameter is introduced to assure the regularity of this linear regression model, the artiﬁcial neural network becomes a set of regular linear regression models. The cause of why an artiﬁcial neural network tends to overﬁt is that the maximum likelihood estimation with respect to the model parameter is the model selection about regular linear regression models based on the empirical likelihood. In this paper, we propose the unbiased likelihood backpropagation learning which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. It is expected M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008

Unbiased Likelihood Backpropagation Learning

447

that the proposed method has better approximation performance because the method explicitly minimize an estimator of the generalization error. Following section, section 2 explains about statistical learning and maximum likelihood estimation, and section 3 explains about information criterion shortly. Next, in section 4, we give an explanation about an artiﬁcial neural network, regularized maximum likelihood estimation and why the error backpropagation learning cause overﬁtting. Then, the proposed method is explained in section 5. We show the eﬀectiveness of the proposed method by applying the method to DELVE data set [4] in section 6. Finally we conclude this paper in section 7.

2

Statistical Learning

Statistical learning aims to construct an optimal approximation pˆ(x) of a true distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning data D ≡ {xn | n = 1, · · · , N } obtained from q(x). M is called model and approximation pˆ(x) is called estimation. When we want to clearly denote that the estimation pˆ(x) is constructed using learning data D, we use a notation pˆ(x|D). Kullback-Leibler divergence: q(x) D(q||p) ≡ q(x) log dx (1) p(x) is used for distance from q(x) to p(x). Probability density p(x) is called likelihood, and especially the likelihood of the estimation pˆ(x) to the learning data D: N pˆ(D|D) ≡ pˆ(xn |D) (2) n=1

is called empirical likelihood. Sample mean of the log-likelihood: N 1 1 log p(D) = log p(xn ) N N n=1

asymptotically converges in probability to the mean log-likelihood: Eq(x) log p(x) ≡ q(x) log p(x)dx

(3)

(4)

according to the law of large numbers, where Eq(x) denotes an expectation under q(x). Because Kullback-Leibler divergence can be decomposed as D(q||p) = Eq(x) log q(x) − Eq(x) log p(x) , (5) the maximization of the mean log-likelihood is equal to the minimization of Kullback-Leibler divergence. Therefore, statistical learning methods such as maximum likelihood estimation, maximum a posteriori estimation and Bayesian estimation are based on the likelihood.

448

M. Sekino and K. Nitta

Maximum Likelihood Estimation Maximum likelihood estimation pˆML (x) is the hypothesis p(x|θˆML ) by the maximizer θˆML of likelihood p(D|θ): pˆML (x) ≡ p(x|θˆML ), θˆML ≡ argmax p(D|θ). θ

3 3.1

(6) (7)

Information Criterion and Model Selection Information Criterion

Because sample mean of the log-likelihood asymptotically converges in probability to the mean log-likelihood, statistical learning methods are based on likelihood. However, because learning data D is a ﬁnite set in practice, the empirical likelihood pˆ(D|D) contains bias. This bias b(ˆ p) is deﬁned as b(ˆ p) ≡ Eq(D) log pˆ(D) −N · Eq(x) log pˆ(x|D) , (8) N where q(D) ≡ n=1 q(xn ). Because of the bias, it is known that the most overﬁtted model to learning data is selected when select a regular model from a candidate set of regular models based on the empirical likelihood. Addressing this problem, there have been proposed many information criteria which evaluates learning results by correcting the bias. Generally the form of the information criterion is IC(ˆ p, D) ≡ −2 log pˆ(D) + 2ˆb(ˆ p)

(9)

where ˆb(ˆ p) is the estimator of the bias b(ˆ p). Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical log-likelihood as N (M + 1) ˆbcAIC (ˆ pML ) = , (10) N −M −2 under the assumption that the learning model is a normal linear regression model, the true model is included in the learning model and the estimation is constructed by maximum likelihood estimation. Here, M is the number of explanatory variables and dim θ = M +1 (1 is the number of the estimators of variance.) Therefore, cAIC asymptotically equals AIC [2]: ˆbcAIC (ˆ pML ) → ˆbAIC (ˆ pML ) (N → ∞).

4

Artificial Neural Network

In a regression problem, we want to estimate the true function using learning data D = {(xn , yn ) | n = 1, · · · , N }, where xn ∈ Rd is an input and yn ∈ R is the corresponding output. An artiﬁcial neural network is deﬁned as

Unbiased Likelihood Backpropagation Learning

f (x; θM ) =

M

ai φ(x; ϕi )

449

(11)

i=1

where ai (i = 1, · · · , N ) are regression coeﬃcients and φ(x; ϕi ) are basis functions which are parametalized by ϕi . Model parameter of this neural network is θM ≡ (ϕT1 , · · · , ϕTM )T (T denotes transpose.) Design matrix X, coeﬃcient vector a and output vector y are deﬁned as Xij ≡ φ(xi ; ϕj )

(12)

a ≡ (a1 , · · · , aM )

(13)

y ≡ (y1 , · · · , yN ) .

(14)

T

T

When the model parameter θM are ﬁxed, the artiﬁcial neural network (11) becomes a linear regression model parametalized by a. A normal linear regression model:

1 (y −f (x; θM))2 p(y|x; θM ) ≡ √ exp − (15) 2σ 2 2πσ 2 is usually used when the noise included in the output y is assumed to follow a normal distribution. In this paper, we call the parameter θR ≡ (aT , σ 2 )T regular parameter. 4.1

Regularized Maximum Likelihood Estimation

To assure the regularity of the normal linear regression model (15), regularized maximum likelihood estimation is usually used for estimating θR = (aT , σ 2 )T . Regularized maximum likelihood estimation maximize regularized log-likelihood: log p(D) − exp(λ) a 2 .

(16)

λ ∈ R is called a regularization parameter. Regularized maximum likelihood estimators of the coeﬃcient vector a and the valiance σ 2 are ˆ = Ly a (17) and σ ˆ2 = where

1 ˆ 2 , y − Xa N

L ≡ (X T X + exp(λ)I)−1 X T

and I is the identity matrix. Here, the eﬀective number of the regression coeﬃcients is

Mef f = tr XL .

(18) (19)

(20)

Therefore, Mef f is used for the number of explanatory variables M in (10)1 . 1

When the denominator of (10) is not positive value, we use ˆbcAIC (ˆ p) = ∞.

450

4.2

M. Sekino and K. Nitta

Overfitting of the Error Backpropagation Learning

The error backpropagation learning is usually used for training an artiﬁcal neural network. This method is equal to the gradient discent method based on the likelihood of the linear regression model (15), because the target function is the squared error to learning data. In what follows, we assume the noise follow a normal distribution and the normal linear regression model (15) is regular for all θM . Then, an artiﬁcial neural network becomes a set of regular linear regression models. For simplicity, let’s think about a set of regular models H ≡ {M(θM ) | θM ∈ ΘM }, where M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } is a regular model. We can deﬁne a new model MC ≡ {ˆ p(x; θM ) | θM ∈ ΘM }, where pˆ(x; θM ) is the estimation of M(θM ). Concerning model parameter θM , statistical learning methods construct the estimation of the model MC based on the empirical likelihood pˆ(D; θM ). For example, maximum likelihood estimation selects θˆMML which is the maximizer of pˆ(D; θM ): pˆML (x) = pˆ(x; θˆMML ), θˆMML ≡ argmax pˆ(D; θM ). θM

(21) (22)

Thus, the maximum likelihood estimation with respect to model parameter θM is the model selection from H based on the empirical likelihood. Because the error backpropagation is the method which realizes maximum likelihood estimation by the gradient discent method, the model M(θM ) becomes the one gradually overﬁtted in the latter half of the training. Therefore, we propose a learning method for model parameter θM based on unbiased likelihood which is the empirical likelihood corrected by an appropriate information criterion.

5 5.1

Unbiased Likelihood Backpropagation Learning Unbiased Likelihood

Using an information criterion IC(ˆ p, D), we deﬁne unbiased likelihood as: 1 pˆub (D) = exp − IC(ˆ p, D) . (23) 2 This unbiased likelihood satisﬁes 1 Eq(D) log pˆub (D) = Eq(x) log pˆ(x) N when the assumptions of the information criterion are satisﬁed.

(24)

Unbiased Likelihood Backpropagation Learning

5.2

451

Regular Hierarchical Model

In this paper, we consider about a certain type of hierarchical model, which we call a regular hierarchical model, deﬁned as a set of regular models. A concise deﬁnitions of a regular hierarchical model are follows. Regular Hierarchical Model – H ≡ {M(θM ) | θM ∈ ΘM } – M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } – M(θM ) is a regular model with respect to θR . An artiﬁcial neural network is one of the regular hierarchical models. And also we deﬁne unbiased maximum likelihood estimation as follows. Unbiased Maximum Likelihood Estimation Unbiased maximum likelihood estimation pˆubML (x) is the estimation pˆ(x; θˆMubML) by the maximizer θˆMubML of the unbiased likelihood pˆub (D|θM ): pˆubML (x) ≡ pˆ(x; θˆMubML ) θˆMubML ≡ argmax pˆub (D; θM ). θM

5.3

(25) (26)

Unbiased Likelihood Backpropagation Learning

The partial diﬀerential of the unbiased likelihood is ∂ ∂ ∂ ˆ log pˆub (D; θM ) = log pˆ(D; θM ) − b(ˆ p; θM ). ∂θM ∂θM ∂θM

(27)

We deﬁne the unbiased likelihood estimation based on the gradient method with this partial diﬀerential as unbiased likelihood backpropagation learning. 5.4

Unbiased Likelihood Backpropagation Learning for an Artificial Neural Network

In this paper, we derive the unbiased likelihood backpropagation learning for an artiﬁcial neural network when the bias of the empirical likelihood is estimated by cAIC (10). The partial diﬀerential of the empirical likelihood with respect to θM which is the ﬁrst term of (27), is ∂ 1 log pˆ(D; θM ) = 2 ∂θM σ ˆ ˆ ∂X a = ∂θM

ˆ ∂X a ∂θM

T

ˆ) (y − X a

T ∂X ∂X T ∂X L+L y − 2LT X T Ly. ∂θM ∂θM ∂θM

(28)

(29)

452

M. Sekino and K. Nitta

The partial diﬀerential of cAIC (10) with respect to θM , which is the second term of (27), is ∂ ˆ N (N − 1) ∂Mef f bcAIC (ˆ p; θM ) = ∂θM (N − M − 2)2 ∂θM ∂Mef f ∂X ∂X = 2 tr L − LT X T L . ∂θM ∂θM ∂θM

(30)

(31)

We can also obtain the partial diﬀerential of the unbiased likelihood with respect to λ as ∂ 1 ˆ ) exp(λ), log pˆ(D; θM ) = − 2 y T LT L(y − X a (32) ∂λ σ ˆ and ∂Mef f = −tr LT L exp(λ). (33) ∂λ Now, we have already obtain the partial diﬀerential of the unbiased likelihood with respect to θM and λ, therefore it is possible to apply the unbiased likelihood backpropagation learning to an artiﬁcial neural network.

6

Application to Kernel Regression Model

Kernel regression model is one of the artiﬁcial neural networks. The kernel regression model using gaussian kernels has the model parameter of degree one, which is the size of gaussian kernels. This model is comprehensible about the behavior of learning methods. Therefore, the kernel regression model using gaussian kernels is used in the following simulations. In the implementation of the gradient discent method, we adopted quasi-Newton method with BFGS method for estimating the Hesse matrix and golden section search for determining the modiﬁcation length. 6.1

Kernel Regression Model

Kernel regression model is f (x; θM ) =

N

an K(x, xn ; θM ).

(34)

n=1

K(x, xn ; θM ) are kernel functions parametalized by model parameter θM . Gaussian kernel: x − xn 2 K(x, xn ; c) = exp − (35) 2c2 is used in the following simulations, where c is a parameter which decides the size of a gaussian kernel. Model parameter is θM = c.

Unbiased Likelihood Backpropagation Learning

453

1 emp true cAIC

log-likelihood

0.5

0

-0.5

-1 0

2

4

6 iterations

8

10

12

(a) Empirical Likelihood Backpropagation

1 emp true cAIC

log-likelihood

0.5

0

-0.5

-1 0

2

4

6 iterations

8

10

12

(b) Unbiased Likelihood Backpropagation Fig. 1. An example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC)

454

6.2

M. Sekino and K. Nitta

Simulations

For the purpose of evaluation, the empirical likelihood backpropagation learning and the unbiased likelihood backpropagation learning are applied to the 8 dimensional input to 1 dimensional output regression problems of “kin-family” and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We use 128 samples for learning data. 50 templates are chosen randomly and kernel functions are put on the templates. Fig.1 shows an example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the test likelihood of the empirical likelihood backpropagation learning (a) decrease in the latter half of the training and overﬁtting occurs. On the contrary, it shows the unbiased likelihood backpropagation learning (b) keeps the test likelihood close to the unbiased likelihood (cAIC) and overﬁtting does not occur. Table 1 shows mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is signiﬁcantly better result by the t-test at the signiﬁcance level 1%. Table 1. Mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Data Empirical BP kin-8fm 2.323 ± 0.346 kin-8fh 1.108 ± 0.240 kin-8nm −0.394 ± 0.415 kin-8nh −0.435 ± 0.236 pumadyn-8fm −2.235 ± 0.249 pumadyn-8fh −3.168 ± 0.187 pumadyn-8nm −3.012 ± 0.238 pumadyn-8nh −3.287 ± 0.255

6.3

Unbiased BP 2.531 ± 0.140 1.599 ± 0.100 0.078 ± 0.287 0.064 ± 0.048 −1.708 ± 0.056 −2.626 ± 0.021 −2.762 ± 0.116 −2.934 ± 0.055

Discussion

The reason why the results of the unbiased likelihood backpropagation learning in Table 1 shows better is attributed to the fact that the method maximizes the true likelihood averagely, because the mean of the log unbiased likelihood is equal to the log true likelihood (see (24)). The reason why the standard deviations of the test likelihood of the unbiased likelihood backpropagation learning is smaller than that of the empirical likelihood backpropagation learning is assumed to be due to the fact that the empirical likelihood prefer the model which has bigger degree of freedom. On the contrary, the unbiased likelihood prefer the model which has appropriate degree of freedom. Therefore, the variance of the estimation of the unbiased likelihood backpropagation learning becomes smaller than that of the empirical likelihood backpropagation learning.

Unbiased Likelihood Backpropagation Learning

7

455

Conclusion

In this paper, we provide an explanation about why overﬁtting occurs with the model selection framework. We propose the unibiased likelihood backpropagation learning, which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. And we conﬁrm the eﬀectiveness of the proposed method by applying the method to DELVE data set.

References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987) 2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974) 3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978) 4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996), http://www.cs.toronto.edu/∼ delve/

The Local True Weight Decay Recursive Least Square Algorithm Chi Sing Leung, Kwok-Wo Wong, and Yong Xu Department of Electronic Engineering, City University of Hong Kong, Hong Kong [email protected]

Abstract. The true weight decay recursive least square (TWDRLS) algorithm is an eﬃcient fast online training algorithm for feedforward neural networks. However, its computational and space complexities are very large. This paper ﬁrst presents a set of more compact TWDRLS equations. Afterwards, we propose a local version of TWDRLS to reduce the computational and space complexities. The eﬀectiveness of this local version is demonstrated by simulations. Our analysis shows that the computational and space complexities of the local TWDRLS are much smaller than those of the global TWDRLS.

1

Introduction

Training multilayered feedforward neural networks (MFNNs) using recursive least square (RLS) algorithms has aroused much attention in many literatures [1, 2, 3, 4]. This is because those RLS algorithms are eﬃcient second-order gradient descent training methods. They lead to a faster convergence when compared with ﬁrst-order methods, such as the backpropagation (BP) algorithm. Moreover, fewer parameters are required to be tuned during training. Recently, Leung et. al. found that the standard RLS algorithm has an implicit weight decay eﬀect [2]. However, its decay eﬀect is not substantial and so its generalization ability is not very good. A true weight decay RLS (TWDRLS) algorithm is then proposed [5]. However, the computational complexity of TWDRLS is equal to O(M 3 ) at each iteration, where M is the number of weights. Therefore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS can be used for large scale practical problems. The main goal of this paper is to reduce both the computational complexity and storage requirement. In Section 2, we derive a set of concise equations for TWDRLS and give some discussions on it. We then describe a local TWDRLS algorithm in Section 3. Simulation results are presented in Section 4. We then summarize our ﬁndings in Section 5.

2

TWDRLS Algorithm

A general MFNN is composed of L layers, indexed by 1, · · · , L from input to output. There are nl neurons in layer l. The output of the i-th neuron in the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008

The Local True Weight Decay Recursive Least Square Algorithm

457

l-th layer is denoted by yi,l . That means, the i-th neuron of the output layer is represented by yi,L while the i-th input of the network is represented by yi,1 . The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of layer l is denoted by wi,j,l . Biases are implemented as weights and are speciﬁed by wi,(nl−1 +1),l , where l = 2, · · · , L. Hence, the number of weights in a MFNN is given by M = L l=2 (nl−1 + 1)nl . In the standard RLS algorithm, we arrange all weights as a M -dimensional vector, given by T w = w1,1,2 , · · · w1,(n1 +1),2 , · · · , wnL ,1,L , · · · , wnL ,(nL−1 +1),L . (1) The energy function up to the t-th training sample is given by E(w) =

t

T

ˆ ˆ d(τ ) − h(w, x(τ )) + [w − w(0)] P −1 (0) [w − w(0)] 2

(2)

τ =1

where x(τ ) is a n1 -dimensional input vector, d(τ ) is the desired nL -dimensional output, and h(w0 , x(τ )) is a nonlinear function that describes the function of the network. The matrix P (0) is the error covariance matrix and is usually set to δ −1 IM×M , where IM×M is a M × M identity matrix. The minimization of (2) leads to the standard RLS equations [1, 3, 6, 7], given by −1 K(t) = P (t − 1) H T (t) InL ×nL + H(t) P (t − 1) H T (t) (3) P (t) = P (t − 1) − K(t) H(t) P (t − 1) (4) ˆ ˆ − 1) + K(t) [d(t) − h(w(t ˆ − 1), x(t)) ] , w(t) = w(t (5) T ∂h(w,(t)) where H(t) = is the gradient matrix (nL × M ) of ∂w ˆ w=w(t−1)

h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × nL ) in the classical control theory. The matrix P (t) is the so-called error covariance matrix. It is symmetric positive deﬁnite. As mentioned in [5], the standard RLS algorithm only has the limited weight decay eﬀect, being tδo per training iteration, where to is the number of training iterations. The decay eﬀect decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing eﬀect would have in the data ﬁtting process. A true weight decay RLS algorithm, namely TDRLS, was then proposed [5], where a decay term is added to the original energy function. The new energy function is given by t

E(w) =

T ˆ ˆ d(τ ) − h(w, x(τ ))2 + αw T w + [w − w(0)] P −1 (0) [w − w(0)] (6)

τ =1

where α is a regularization parameter. The gradient of E(w) is given by t ∂E(w) ˆ ≈ P −1 (0) [w − w(0)] + αw − H T (τ ) [d(τ ) − H(τ )w − ξ(τ )] . ∂w τ =1 (7)

458

C.S. Leung, K.-W. Wong, and Y. Xu

ˆ − 1). That means, In the above, we linearize h(w, x(τ )) around the estimate w(τ ˆ − 1), x(τ )) − H(τ )w(τ ˆ ) + ρ(τ ). h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h(w(τ To minimize the energy function, we set the gradient to zero. Hence, we have ˆ w(t) = P (t)r(t)

(8)

P −1 (t) = P −1 (t − 1) + H T (t) H(t) + αIM×M r(t) = r(t − 1) + H T (t) [d(t) − ξ(t)] .

(9) (10)

where

Δ

Furthermore, we deﬁne P ∗ (t) = [IM×M + αP (t − 1)]−1 P (t − 1). Hence, we have P ∗ −1 (t) = P −1 (t − 1) + αIM×M . With the matrix inversion lemma [7] in the recursive calculation of P (t), (8) becomes −1

P ∗ (t − 1) = [IM×M + αP (t − 1)] P (t − 1) (11) −1 ∗ T ∗ T K(t) = P (t − 1) H (t) InL ×nL + H(t) P (t − 1)H (t) (12) ∗ ∗ P (t) = P (t − 1) − K(t) H(t) P (t − 1) (13) ˆ ˆ − 1)−αP (t)w(t ˆ − 1) + K(t)[d(t) − h(w(t ˆ − 1), x(t))]. (14) w(t) = w(t Equations (11)-(14) are the general global TWDRLS equations. Those are more compact than the equations presented in [5]. Also, the weight updating equation in [5] ( i.e., (14) in this paper) is more complicated. When the regularization parameter α is set to zero, the decay term αw T w vanishes and (11)-(14) reduce to the standard RLS equations. The decay eﬀect can be easily understood by the decay term αw T w in the energy function given by (6). As mentioned in [5], the decay eﬀect per training iterations is equal to αw T w which does not decrease with the number of training iterations. The energy function of TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [8,9] for choosing the value of α can be used for the TWDRLS’s case. We can also explain the weight decay eﬀect based on the recursive equations (11)-(14). The main diﬀerence between the standard RLS equations and the ˆ − 1) in TWDRLS equations is the introduction of a decay term −αP (t) w(t (14). This term guarantees that the magnitude of the updating weight vector decays an amount proportional to αP (t). Since P (t) is positive deﬁnite, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better [8, 9]. A drawback of TWDRLS is the requirement in computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity is equal to O(M 3 ) which is much larger than that of the standard RLS, O(M 2 ). Hence, the TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a local version of the TWDRLS algorithm will be proposed to solve this large complexity problem.

The Local True Weight Decay Recursive Least Square Algorithm

3

459

Localization of the TWDRLS Algorithm

To localize the TWDRLS algorithm, we ﬁrst divide the weight vector into several T small vectors, where wi,l = wi,1,l , · · · , wi,(nl−1 +1),l is denoted as the weights connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We consider the estimation of each weight vector separately. When we consider the i−th neuron in layer l, we assume that other weight vectors are constant vectors. Such a technique is usually used in many numerical methods [10]. At each training iteration, we update each weight vector separately. Each neuron has its energy function. The energy function of the i − th neuron in layer l is given by E(wi,l ) =

t

2 d(τ ) − h(wi,l , x(τ )) + αw Ti,l wi,l τ =1 T

−1 ˆ i,l (0)] Pi,l ˆ i,l (0)] . + [wi,l − w (0) [w i,l − w

(15)

Utilizing a derivation process similar to the previous analysis, we obtain the following recursive equations for the local TWDRLS algorithm. Each neuron (excepting input neurons) has its set of TWDRLS equations. For the i − th neuron in layer l, the TWDRLS equations are given by

∗ Pi,l (t − 1) = I(nl−1 +1)×(nl−1 +1) + αPi,l (t − 1)

−1

Pi,l (t − 1)

∗ T ∗ T Ki,l (t) = Pi,l (t − 1) Hi,l (t) InL ×nL + Hi,l (t) Pi,l (t − 1)Hi,l (t)

Pi,l (t) =

∗ Pi,l (t

− 1) −

∗ Ki,l (t) Hi,l (t) Pi,l (t

− 1)

−1

(16) (17) (18)

ˆ i,l (t) = w ˆ i,l (t−1)−αPi,l (t)w ˆ i,l (t−1) + Ki,l (t)[d(t)−h(w ˆ i,l (t−1), x(t))], (19) w

where Hi,l is the nL × n(nl−1 +1) local gradient matrix. In this matrix, only one row associated with the considered neuron is nonzero for output layer L. Ki,l (t) is the (nl−1 + 1) × nL local Kalman gain. Pi,l (t) is the (nl−1 + 1) × (nl−1 + 1) local error covariance matrix. The follows. There training process of the local TWDRLS algorithm is as L are L n neurons (excepting input neurons). Hence, there are l l=2 l=2 nl sets of TWDRLS equations. We update the local weight vectors in accordance with a descending order of l and then an ascending order of i using (16)-(19). At each training stage, only the concerned local weight vector is updated and all other local weight vectors remain unchanged. In the global TWDRLS, the complexity mainly comes from the computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity 3 is equal to O(M complexity is equal to T CCglobal = ). So, the computational

3 L 3 O(M ) = O . Since the size of the matrix is M × M , l=2 nl (nl−1 + 1) the space complexity (storage requirement) is equal to T CSglobal = O(M 2 ) =

2 L O . l=2 nl (nl−1 + 1) From (16), the computational cost of local TWDRLS algorithm mainly comes from the inversion of an (nl−1 + 1) × (nl−1 + 1) matrix. In this way, the

460

C.S. Leung, K.-W. Wong, and Y. Xu

computational complexity of each set of local TWDRLS equations is equal to O((nl−1 +1)3 ) and the corresponding space complexity is equal to O((nl−1 +1)2 ). Hence, the total of the local TWDRLS is given by computational complexity

L 3 T CClocal = O n (n + 1) and the space complexity (storage requirel−1 l=2 l

L 2 ment) is equal to T CSlocal = O n (n + 1) . They are much smaller l l−1 l=2 than the computational and space complexities of the global case.

4

Simulations

Two problems, the generalized XOR and the sunspot data prediction, are considered. We use three-layers networks. The initial weights are small zero-mean independent identically distributed Gaussian random variables. The transfer function of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classiﬁcation problem, output neurons are with the hyperbolic tangent function. For the sunspot data prediction problem, output neurons are with the linear activation function. The training for each problem is performed 10 times with diﬀerent random initial weights. 4.1

Generalized XOR Problem

The generalized XOR problem is formulated as d = sign(x1 x2 ) with inputs in the range [−1, 1]. The network has 2 input neurons, 10 hidden neurons, and 1 output neuron. As a result, there are 41 weights. The training set and test set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. In each cycle, training samples from the training set are feeded to the network one by one. The decision boundaries obtained from typical networks trained with both global and local TWDRLS and standard RSL algorithms are plotted in Figure 2.

(a) Training samples

(b) Test samples

Fig. 1. Training and test samples for the generalized XOR problems

The Local True Weight Decay Recursive Least Square Algorithm

461

Table 1. Computational and space complexities of the global and local TWDRLS algorithms for solving the generalized XOR problem Algorithm Computational complexity Space complexitty Global O(6.89 × 104 ) O(1.68 × 103 ) 3 Local O(1.60 × 10 ) O(2.21 × 102 )

(a) Global TWDRLS, α = 0

(b) Local TWDRLS, α = 0

(c) Global TWDRLS, α = 0.00178

(d) Local TWDRLS, α = 0.00178

Fig. 2. Decision boundaries of various trained networks for the generalized XOR problem. Note that when α = 0, the TWDRLS is identical to RLS.

From Figures 1 and 2, the decision boundaries obtained from the trained networks with TWDRLS algorithm are closer to the ideal ones than those with the standard RLS algorithm. Also, both local and global TWDRLS algorithms produce a similar shape of decision boundaries. Figure 3 summarizes the average test set false rates in the 10 runs. The average test set false rates obtained by global and local TWDRLS algorithms are usually lower than those obtained by the standard RLS algorithm over a wide range of regularization parameters. That means, both global and local TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the local TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and local algorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that

462

C.S. Leung, K.-W. Wong, and Y. Xu

Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem

the performance of local TWDRLS is comparable to that of the global ones, and that its complexities are much smaller. Figure 3 indicates that the average test set false rate ﬁrst decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α will indeed improve the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high at large values of α, especially for the networks trained with global TWDRLS algorithm. This is due to the fact that when the value of α is too large, the weight decay eﬀect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Figure 4 the decision boundary obtained from the network trained with global TWDRLS algorithm for α = 0.0178. The ﬁgure shows that the network has already converged when the decision boundary is still quite far from the ideal one. This is because when the value of α is too large, the weight decay eﬀect is too strong. That means, the regularization parameter α cannot be too large otherwise the network cannot learn the target function. 4.2

Sunspot Data Prediction

The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700 − 1920) and two test sets, namely, Test-set 1 (1921 − 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather nonstationary and Test-set 2 is atypical for the series as a whole. In the simulation, we assume that the series is generated from the following auto-regressive model, given by d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t)

(20)

where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation

The Local True Weight Decay Recursive Least Square Algorithm

463

Table 2. Computational and space complexities of the global and local TWDRLS algorithms for solving the sunspot data prediction Algorithm Computational complexity Space complexitty Global O(1.44 × 106 ) O(1.28 × 104 ) 4 Local O(1.83 × 10 ) O(1.43 × 103 )

Fig. 4. Decision boundaries of a trained network with local TWDRLS where α = 0.0178. In this case, the value of the regularization parameter is too large. Hence, the network cannot form a good decision boundary.

(a) Test-set 1 average RMSE

(b) Test-set 2 average RMSE

Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS.

function), and one output neuron (with linear activation function) is used for approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200. As this is a time series problem, the training samples are feeded to the network sequentially in each iteration. The criterion to evaluate the model performance

464

C.S. Leung, K.-W. Wong, and Y. Xu

is the mean squared error (RMSE) of the test set. The experiment are repeated 10 times with diﬀerent initial weights. Figure 5 summarizes the average RMSE in 10 runs. The computational and space complexities for global and local algorithms are listed in Table 2. We observe from Figure 5 that over a wide range of the regularization parameter α, both global and local TWDRLS algorithms have greatly improved the generalization ability of the trained networks, especially for test-set 2 that is quite diﬀerent from the training set. However, the test RMSE becomes very large at large values of α. The reasons are similar to those stated in the last subsection. This is because at large value of α, the weight decay eﬀect is too strong and so the network cannot learn the target function. In most cases, the performance of the local training is found to be comparable to that of the global ones. Also, Table 2 shows that those complexities of the local training are much smaller than those of the global one.

5

Conclusion

We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the local TWDRLS algorithm. The computational complexity and the storage requirement are reduced considerably when using the local approach. Computer simulations indicate that both local and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the local TWDRLS algorithm is comparable to that of the global ones.

Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grant (Project No. CityU 115606).

References 1. Shah, S., Palmieri, F., Datum, M.: Optimal ﬁltering algorithm for fast learning in feedforward neural networks. Neural Networks 5, 779–787 (1992) 2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive least square algorithm. Neural Networks 14, 147–174 (2001) 3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992) 4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman ﬁltering method in neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165 (1999) 5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Networks 12, 1314–1332 (2001) 6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood Cliﬀs, NJ (1995)

The Local True Weight Decay Recursive Least Square Algorithm

465

7. Haykin, S.: Adaptive ﬁlter theory. Prentice-Hall, Englewood Cliﬀs, NJ (1991) 8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992) 9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) 10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliﬀs, NJ (1989)

Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift Keisuke Yamazaki and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute of Technology R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan {k-yam,swatanab}@pi.titech.ac.jp

Abstract. In the standard setting of statistical learning theory, we assume that the training and test data are generated from the same distribution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly under the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many eﬀective results are reported on statistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical ﬁndings. Moreover it is observed that the basis function in the model plays an important role in some cases.

1

Introduction

The task of regression problem is to estimate the input-output relation q(y|x) from sample data, where x, y are the input and output data, respectively. Then, we generally assume that the input distribution q(x) is generating both of the training and test data. However, this assumption cannot be satisﬁed in practical situations, e.g., brain computer interfacing [2], bioinformatics [3], etc. The change of the input distribution from the training q0 (x) into the test q1 (x) is referred to as the covariate shift [4]. It is known that, under the covariate shift, the standard techniques in machine learning cannot work properly, and many eﬃcient methods to tackle this issue are proposed [4,5,6,7]. In the Bayes estimation, Shimodaira [4] revealed the generalization error improved by the importance weight on regular cases. We formally clariﬁed the behavior of the error in non-regular cases [1]. The result shows the generalization error is determined by lower order terms, which are ignored at the situation without the covariate shift. At the same time, it appeared that the calculation of these terms is not straightforward even in a simple regular example. To cope M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 466–476, 2008. c Springer-Verlag Berlin Heidelberg 2008

Experimental Bayesian Generalization Error of Non-regular Models

467

with this problem, we also established the upper bound using the generalization error without the covariate shift. However, it is still open how to derive the theoretical generalization error in non-regular models. In this paper, we observe experimental results calculated with a Monte Carlo method and examine the theoretical upper bound on some non-regular models. Comparing the generalization error under the covariate shift to those without the shift, we investigate an eﬀect of a basis function in the learning model and consider the tightness of the bound. In the next section, we deﬁne non-regular models, and summarize the Bayesian analyses for the models without and with the covariate shift. We show the experimental results in Section 3 and give discussions at the last.

2

Bayesian Generalization Errors with and without Covariate Shift

In this section, we summarize the asymptotic theory on the non-regular Bayes inference. At ﬁrst, we deﬁne the non-regular case. Then, mathematical properties of non-regular models without the covariate shift are introduced [8]. Finally, we state the results of the former study [1], which clariﬁed the generalization error under the shift. 2.1

Non-regular Models

Let us deﬁne the parametric learning model by p(y|x, w), where x, y and w are the input, output and parameter, respectively. When the true distribution r(y|x) is realized by the learning model, the true parameter w∗ exists, i.e., p(y|x, w∗ ) = r(y|x). The model is regular if w∗ is one point in the parameter space. Otherwise, non-regular; the true parameter is not one point but a set of parameters such that Wt = {w∗ : p(y|x, w∗ ) = r(y|x)}. For example, three-layer perceptrons are non-regular. Let the true distribution be a zero function with Gaussian noise 1 y2 r(y|x) = √ exp − , 2 2π and the learning models be a simple three-layer perceptron 1 (y − a tanh(bx))2 p(y|x, w) = √ exp − , 2 2π where the parameter w = {a, b}. It is easy to ﬁnd that the true parameters are a set {a = 0}∪{b = 0} since 0×tanh(bx) = a×tanh 0 = 0. This non-regularity (also called non-identiﬁability) causes that the conventional statistical method cannot be applied to such models (We will mention the detail in Section 2.2). In spite of this diﬃculty for the analysis, the non-regular models, such as perceptrons, Gaussian mixtures, hidden Markov models, etc., are mainly employed in many information engineering ﬁelds.

468

2.2

K. Yamazaki and S. Watanabe

Properties of the Generalization Error without Covariate Shift

As we mentioned in the previous section, the conventional statistical manner does not work in the non-regular models. To cope with the issue, a method was developed based on algebraic geometry. Here, we introduce the summary. Hereafter, we denote the cases without and with the covariate shift subscript 0 and 1, respectively. Some of the functions with a suﬃx 0 will be replaced by those with 1 in the next section. Let {X n , Y n } = {X1 , Y1 , . . . , Xn , Yn } be a set of training samples that are independently and identically generated by the true distribution r(y|x)q0 (x). Let p(y|x, w) be a learning machine and ϕ(w) be an a priori distribution of parameter w. Then the a posteriori distribution/posterior is deﬁned by p(w|X n , Y n ) =

n 1 p(Yi |Xi , w)ϕ(w), Z(X n , Y n ) i=1

where Z(X n , Y n ) =

n

p(Yi |Xi , w)ϕ(w)dw.

(1)

i=1

The Bayesian predictive distribution is given by n n p(y|x, X , Y ) = p(y|x, w)p(w|X n , Y n )dw. When the number of sample data is suﬃciently large (n → ∞), the posterior has the peak at the true parameter(s). The posterior in a regular model is a Gaussian distribution, whose mean is asymptotically the parameter w∗ . On the other hand, the shape of the posterior in a non-regular model is not Gaussian because of Wt (cf. the right panel of Fig.10 in Section 3). We evaluate the generalization error by the average Kullback divergence from the true distribution to the predictive distribution: r(y|x) 0 G0 (n)=EX r(y|x)q0 (x) log dxdy . n ,Y n p(y|x, X n , Y n ) In the standard statistical manner, we can formally calculate the generalization error by integrating the predictive distribution. This integration is viable based on the Gaussian posterior. Therefore, this method is applicable only to regular models. The following is one of the solutions for non-regular cases. The stochastic complexity [9] is deﬁned by F (X n , Y n ) = − log Z(X n , Y n ), (2) which can be used for selecting an appropriate model or hyper-parameters. To analyze the behavior of the stochastic complexity, the following functions play important roles: 0 n n U 0 (n) = EX )] , n ,Y n [F (X , Y 0 where EX n ,Y n [·] stands for the expectation value over r(y|x)q0 (x) and

(3)

Experimental Bayesian Generalization Error of Non-regular Models

F (X n , Y n ) = F (X n , Y n ) +

n

469

log r(Yi |Xi ).

i=1

The generalization error and the stochastic complexity are linked by the following equation [10]: G0 (n) = U 0 (n + 1) − U 0 (n).

(4)

When the learning machine p(y|x, w) can attain the true distribution r(y|x), the asymptotic expansion of F (n) is given as follows [8]. U 0 (n) = α log n − (β − 1) log log n + O(1).

(5)

The coeﬃcients α and β are determined by the integral transforms of U 0 (n). More precisely, the rational number −α and natural number β are the largest pole and its order of J(z) = H0 (w)z ϕ(w)dw, r(y|x) H0 (w) = r(y|x)q0 (x) log dxdy. (6) p(y|x, w) J(z) is obtained by applying the inverse Laplace and Mellin transformations to exp[−U 0 (n)]. Combining Eqs.(5) and (4) immediately gives

α β−1 1 0 G (n) = − +o , n n log n n log n when G0 (n) has an asymptotic form. The coeﬃcients α and β indicate the speed of convergence of the generalization error when the number of training samples is suﬃciently large. When the learning machine cannot attain the true distribution (i.e., the model is misspeciﬁed), the stochastic complexity has an upper bound of the following asymptotic expression [11]. U 0 (n) ≤ nC + α log n − (β − 1) log log n + O(1),

(7)

where C is a non-negative constant. When the generalization error has an asymptotic form, combining Eqs.(7) and (4) gives

α β−1 1 G0 (n) ≤ C + − +o , (8) n n log n n log n where C is the bias. 2.3

Properties of the Generalization Error with Covariate Shift

Now, we introduce the results in [1]. Since the test data are distributed from r(y|x)q1 (x), the generalization error with the shift is deﬁned by

470

K. Yamazaki and S. Watanabe

1

G

0 (n)=EX n ,Y n

r(y|x) r(y|x)q1 (x) log dxdy . p(y|x, X n , Y n )

(9)

We need the function similar to (3) given by 1 U 1 (n) = EX EX n−1 ,Y n−1 [F (X n , Y n )]. n ,Yn

(10)

Then, the variant of (4) is obtained as G1 (n) = U 1 (n + 1) − U 0 (n).

(11)

When we assume that G1 (n) has an asymptotic expansion and converges to a constant and that U i (n) has the following asymptotic expansion 1 di U i (n) = ai n + bi log n + · · · +ci + +o , n n i (n) TH

0

i (n) TL

1

it holds that G (n) and G (n) are expressed by 1 b0 G0 (n) = a0 + +o , n n 1 b0 + (d1 − d0 ) G1 (n) = a0 + (c1 − c0 ) + +o , (12) n n 1 0 and that TH (n) = TH (n). Note that b0 = α. The factors c1 −c0 and d1 −d0 determine the diﬀerence of the errors. We have also obtained that the generalization error G1 (n) has an upper bound G1 (n) ≤ M G0 (n), if the following condition is satisﬁed M ≡ max

x∼q0 (x)

3

q1 (x) < ∞. q0 (x)

(13)

(14)

Experimental Generalization Errors in Some Toy Models

Even though we know the factors causing the diﬀerence between G1 (n) and G0 (n) according to Eq.(12), it is not straightforward to calculate the lower order terms in TLi (n). More precisely, the solvable model is restricted to ﬁnd the constant and decreasing factors ci , di in the increasing function U i (n). This implies to reveal the analytic expression of G1 (n) is still open issue in non-regular models. Here, we calculate G1 (n) with experiments and observe the behavior. A non-regular model requires the sampling from the non-Gaussian posterior in the Bayes inference. We use the Markov Chain Monte Carlo (MCMC) method to execute this task [12]. In the following examples, we use the common notations: the true distribution is deﬁned by 1 (y − g(x))2 r(y|x) = √ exp − , 2 2π

Experimental Bayesian Generalization Error of Non-regular Models

-10

-5

0

5

10

20

15

-5

0

5

10

20

15

-5

0

5

10

15

0

5

10

15

20

-10

-10

-5

0

5

10

15

20

20

Fig. 7. (μ1 , σ1 ) = (10, 1)

-10

-5

0

5

10

15

20

Fig. 8. (μ1 , σ1 )=(10, 0.5)

-5

0

5

10

15

20

Fig. 3. (μ1 , σ1 ) = (0, 2)

-10

-5

0

5

10

15

20

Fig. 6. (μ1 , σ1 ) = (2, 2)

Fig. 5. (μ1 , σ1 ) = (2, 0.5)

Fig. 4. (μ1 , σ1 ) = (2, 1)

-10

-5

Fig. 2. (μ1 , σ1 ) = (0, 0.5)

Fig. 1. (μ1 , σ1 ) = (0, 1)

-10

-10

471

-10

-5

0

5

10

15

20

Fig. 9. (μ1 , σ1 ) = (10, 2)

The training and test distributions

the learning model is given by

1 (y − f (x, w))2 p(y|x, w) √ exp − , 2 2π

the prior ϕ(w) is a standard normal distribution, and the input distribution is of the form 1 (x − μi )2 qi (x) = √ exp − (i = 0, 1). 2σi2 2πσi The training input distribution has (μ0 , σ0 ) = (0, 1), and there are nine test distributions, where each one has the mean and variance as the combination between μ1 = {0, 2, 10} and σ1 = {1, 0.5, 2} (cf. Fig.1-9). Note that the case in Fig.1 corresponds to q0 (x). As for the experimental setting, the number of traning samples is n, the number of test samples is ntest , the number of parameter samples distributed from the posterior with the MCMC method is np , and the number of samples to have the expectation EX n ,Y n [·] is nD . In the mathematical expressions, n np 1 p(y|x, wj ) i=1 p(Yi |Xi , wj )ϕ(wj ) n n p(y|x, X , Y ) , n p np j=1 n1p nk=1 i=1 p(Yi |Xi , wk )ϕ(wk ) nD ntest 1 1 r(yj |xj ) G1 (n) log , nD i=1 ntest j=1 p(yj |xj , Di )

0.4

0.49

0.31

0.22

0.13

0.04

-0.05

-0.14

-0.23

-0.32

-0.5

K. Yamazaki and S. Watanabe

-0.41

472

4 cy en qu fre

2500 2000 1500 1000 500 0

4

−4

0

b

3

2

2

−2

1

a

0

−2

0 2

-1

-2

4

-3 -3

-2

-1

0

1

2

3

−4

4

Fig. 10. The sampling from the posteriors. The left-upper panel shows the histogram of a for the ﬁrst model. The left-middle one is the histogram of a3 for the third model. The left-lower one is the point diagram of (a, b) for the second model, and the right one is its histogram.

where Di = {Xi1 , Yi1 , · · · , Xin , Yin } stands for the ith set of training data, and Di and (xj , yj ) in G1 (n) are taken from q0 (x)r(y|x) and q1 (x)r(y|x), respectively. The experimental parameters were as follows: n = 500, ntest = 1000, np = 10000, nD = 100. Example 1 (Lines with Various Parameterizations) g(x) = 0, f1 (x, a) = ax, f2 (x, a, b) = abx, f3 (x, a) = a3 x, where the true is the zero function, the learning functions are lines with the gradient a, ab and a3 . In this example, all learning functions belong to the same function class though the second model is non-regular (Wt = {a = 0} ∪ {b = 0}) and the third one has the non-Gaussian posterior. The gradient parameters are taken from the posterior depicted by Fig.10. Table 1 summarizes the results. The ﬁrst row indicates the pairs (μ1 , σ1 ), and the rest does the experimental average generalization errors. G1 [fi ] stands for the error of the model with fi . M G0 [fi ] is the upper bound in each case according to Eq.(13). Note that there are some blanks in the row because of the condition Eq.(14). To compare G1 [f3 ] with G1 [f1 ], the last row shows the values 3 × G1 [f3 ] of each change. Since it is regular, the ﬁrst model has theoretical results:

1 1 R 1 μ2 + σ12 G0 (n) = +o , G1 (n) = +o , R = 12 . 2n n log n 2n n log n μ0 + σ02 ‘th G1 [f1 ]’ in Table 1 is this theoretical result.

Experimental Bayesian Generalization Error of Non-regular Models

473

Table 1. Average generalization errors in Example 1 (μ1 , σ1 ) 1

th G [f1 ] G1 [f1 ] G1 [f2 ] G1 [f3 ] MG0 [f1 ] MG0 [f2 ] MG0 [f3 ]

(0,1)= G0 (0,0.5)

(0,2)

(2,1)

0.001 0.00025 0.004 0.005 0.001055 0.000239 0.004356 0.006162 0.000874 0.000170 0.003466 0.004523 0.000394 0.000059 0.001475 0.002374 — — —

0.002000 0.001356 0.000667

— — —

— — —

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

0.00425 0.008 0.101 0.10025 0.104 0.005107 0.009532 0.109341 0.106619 0.108042 0.003670 0.006669 0.079280 0.078802 0.080180 0.001912 0.003736 0.040287 0.038840 0.039682 0.028784 0.019521 0.009595

— — —

— — —

1.79×1026 1.22×1026 5.98×1025

— — —

3 × G1 [f3 ] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046

Table 2. Average generalization errors in Example 2 (μ1 , σ1 ) 1

G [f4 ] G1 [f5 ] G1 [f6 ] MG0 [f4 ] MG0 [f5 ] MG0 [f5 ]

(0,1)= G0 (0,0.5)

(0,2)

(2,1)

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204 0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743 0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489 — — —

0.001356 0.000667 0.000400

— — —

— — —

0.019521 0.009595 0.005757

— — —

— — —

1.22×1026 5.98×1025 3.59×1025

— — —

3 × G1 [f5 ] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229 5 × G1 [f6 ] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445

We can ﬁnd that ‘th G1 [f1 ]’ are very close to G1 [f1 ] in spite of the fact that the theoretical values are established in asymptotic cases. Based on this fact, the accuracy of experiments can be evaluated to compare them. As for f2 and f3 , they do not have any comparable theoretical value except for the upper bound. We can conﬁrm that every value in G1 [f2 , f3 ] is actually smaller than the bound. Example 2 (Simple Neural Networks). Let assume that the true is the zero function, and the learning models are three-layer perceptrons: g(x) = 0, f4 (x, a, b) = a tanh(bx), f5 (x, a, b) = a3 tanh(bx), f6 (x, a, b) = a5 tanh(bx). Table 2 shows the results. In this example, we can also conﬁrm that the bound works. Combining the results in the previous example, the bound tends to be tight when μ1 is small. As a matter of fact, the bound holds in small sample cases, i.e., the number of training data n does not have to be suﬃciently large. Though we omit it because of the lack of space, the bound is always larger than the experimental results in n = 100, 200, . . . , 400. The property of the bound will be discussed in the next section.

4

Discussions

First, let us conﬁrm if the sampling from the posterior was successfully done by the MCMC method. Based on the algebraic geometrical method, the coeﬃcients of G0 (n) are derived in the models (cf. Table 3). As we mentioned, f2 , f4 and

474

K. Yamazaki and S. Watanabe Table 3. The coeﬃcients of generalization error without the covariate shift f1 f2 , f4 f3 , f5 f6 α 1/2 1/2 1/6 1/10 β 1 2 1 1

f3 , f5 have the same theoretical error. According to the examples in the previous section, we can compare the theoretical value to the experimental one, G0 (n)[f1 ] = 0.001 0.001055, G0(n)[f2 , f4 ] = 0.000678 0.000874, 0.000688 G0 (n)[f3 , f5 ] = 0.000333 0.000394, 0.000251, G0(n)[f6 ] = 0.0002 0.000146. In the sense of the generalization error, the MCMC method worked well though there is some ﬂuctuation in the results. Note that it is still open how to evaluate the method. Here we measured by the generalization error since the theoretical value is known in G0 (n). However, this index is just a necessary condition. To develop an evaluation of the selected samples is our future study. Next, we consider the behavior of G1 (n). In the examples, the true function was commonly the zero function g(x) = 0. It is an important case to learn the zero function because we often prepare an enough rich Kmodel in practice. Then, the learning function will be set up as f (x, w) = i=k t(w1k )h(x, w2k ), where h is the basis function t is the parameterization for its weight, and w = {w11 , w21 , w12 , w22 , . . . , w1K , w2K }. Note that many practical models are included in this expression. According to the redundancy of the function, some of h(x, w2k ) learn the zero function. Our examples provided the simplest situations and highlighted the eﬀect of non-regularity in the learning models. The errors G0 (n) and G1 (n) are generally expressed as

α β−1 1 G0 (n) = − +o , n n log n n log n

α β−1 1 G1 (n) = R1 − R2 +o , n n log n n log n where R1 , R2 depend on f, g, q0 , and q1 . R1 and R2 cause the diﬀerence between G0 and G1 in this expression. In Eq. (12), the coeﬃcient of 1/n is given by b0 + (d1 − d0 ). So b0 + (d1 − d0 ) d1 − d0 R1 = =1+ . b0 α Let us denote A B as “A is the only factor to determine a value of B”. As mentioned above, f, g, q0 , q1 R1 , R2 . Though f, g α, β (cf. around Eqs.(5)-(6)), we should emphasize that α, β, q0 , q1 R1 , R2 .

Experimental Bayesian Generalization Error of Non-regular Models

475

This fact is easily conﬁrmed by comparing f2 to f4 (also f3 to f5 ). It holds that G1 (n)[f2 ] = G1 (n)[f4 ] for all q1 although they have the same α and β (G0 (n)[f2 ] = G0 (n)[f4 ]). Thus α and β are not enough informative to describe R1 and R2 . Comparing the values in G1 [f2 ] to the ones in G1 [f4 ], the basis function (x in f2 and tanh(bx) in f4 ) seems to play an important role. To clarify the eﬀect of basis functions, let us ﬁx the function class. Examples 1 and 2 correspond to h(x, w2 ) = x and h(x, w2 ) = tanh(bx), respectively. The values of G1 [f1 ] and 3 × G1 [f3 ] (also 3 × G1 [f5 ] and 5 × G1 [f6 ]) can be regarded as the same in any covariate shift. This implies h, g, q0 , q1 R1 , i.e. the parameterization t(w1 ) will not aﬀect R1 . Instead, it aﬀects the nonregularity or the multiplicity and decides α and β. Though it is an unclear factor, the inﬂuence of R2 does not seem as large as R1 . Last, let us analyze properties of the upper bound M G0 (n). According to the above discussion, it holds that R G1 /G0 ≤ M. The ratio G1 /G0 basically depends on g, h, q0 and q1 . However, M is determined by only the training and test input distributions, q0 , q1 M . Therefore this bound gives the worst case evaluation in any g and h. Considering the tightness of the bound, we can still improve it based on the relation between the true and learning functions.

5

Conclusions

In the former study, we have got the theoretical generalization error and its upper bound under the covariate shift. This paper showed that the theoretical value is supported by the experiments in spite of the fact that it is established under an asymptotic case. We observed the tightness of the bound and discussed an eﬀect of basis functions in the learning models. In this paper, the non-regular models are simple lines and neural networks. It is an interesting issue to investigate more general models. Though we mainly considered the amount of G1 (n), the computational cost for the MCMC method strongly connects to the form of the learning function f . It is our future study to take account of the cost in the evaluation.

Acknowledgements The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and Klaus-Robert M¨ uller for fruitful discussions. The software to calculate the MCMC method and technical comments were provided by Kenji Nagata. This research partly supported by the Alexander von Humboldt Foundation, and MEXT 18079007.

476

K. Yamazaki and S. Watanabe

References 1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨ uller, K.R.: Asymptotic bayesian generalization error when training and test distributions are diﬀerent. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1079–1086 (2007) 2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.: Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6), 767–791 (2002) 3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227– 244 (2000) 5. Sugiyama, M., M¨ uller, K.R.: Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions 23(4), 249–279 (2005) 6. Sugiyama, M., Krauledat, M., M¨ uller, K.R.: Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (2007) 7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨ olkopf, B.: Correcting sample selection bias by unlabeled data. In: Sch¨ olkopf, B., Platt, J., Hoﬀman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA (2007) 8. Watanabe, S.: Algebraic analysis for non-identiﬁable learning machines. Neural Computation 13(4), 899–933 (2001) 9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) 10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer, Heidelberg (1999) 11. Watanabe, S.: Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems 14, 329–336 (2001) 12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst. Statis. Math. 42(3), 403–433 (1990)

Using Image Stimuli to Drive fMRI Analysis David R. Hardoon1 , Janaina Mour˜ ao-Miranda2, Michael Brammer2 , and John Shawe-Taylor1 1

The Centre for Computational Statistics and Machine Learning Department of Computer Science University College London Gower St., London WC1E 6BT {D.Hardoon,jst}@cs.ucl.ac.uk 2 Brain Image Analysis Unit Centre for Neuroimaging Sciences (PO 89) Institute of Psychiatry, De Crespigny Park London SE5 8AF {Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk

Abstract. We introduce a new unsupervised fMRI analysis method based on Kernel Canonical Correlation Analysis which diﬀers from the class of supervised learning methods that are increasingly being employed in fMRI data analysis. Whereas SVM associates properties of the imaging data with simple speciﬁc categorical labels, KCCA replaces these simple labels with a label vector for each stimulus containing details of the features of that stimulus. We have compared KCCA and SVM analyses of an fMRI data set involving responses to emotionally salient stimuli. This involved ﬁrst training the algorithm ( SVM, KCCA) on a subset of fMRI data and the corresponding labels/label vectors, then testing the algorithms on data withheld from the original training phase. The classiﬁcation accuracies of SVM and KCCA proved to be very similar. However, the most important result arising from this study is that KCCA in able in part to extract many of the brain regions that SVM identiﬁes as the most important in task discrimination blind to the categorical task labels. Keywords: Machine learning methods, Kernel canonical correlation analysis, Support vector machines, Classiﬁers, Functional magnetic resonance imaging data analysis.

1

Introduction

Recently, machine learning methodologies have been increasingly used to analyse the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8, 9,10]. In this paper, we introduce a new unsupervised machine learning approach to fMRI analysis, in which the simple categorical description of stimulus type (e.g. type of task) is replaced by a more informative vector of stimulus features. We compare this new approach with a standard Support Vector Machine (SVM) analysis of fMRI data using a categorical description of stimulus type. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008. c Springer-Verlag Berlin Heidelberg 2008

478

D.R. Hardoon et al.

The technology of the present study originates from earlier research carried out in the domain of image annotation [11], where an image annotation methodology learns a direct mapping from image descriptors to keywords. Previous attempts at unsupervised fMRI analysis have been based on Kohonen selforganising maps, fuzzy clustering [12] and nonparametric estimation methods of the hemodynamic response function, such as the general method described in [13]. [14] have reported an interesting study which showed that the discriminability of PCA basis representations of images of multiple object categories is signiﬁcantly correlated with the discriminability of PCA basis representation of the fMRI volumes based on category labels. The current study diﬀers from conventional unsupervised approaches in that it makes use of the stimulus characteristics as an implicit representation of a complex state label. We use kernel Canonical Correlation Analysis (KCCA) to learn the correlation between an fMRI volume and its corresponding stimulus. Canonical correlation analysis can be seen as the problem of ﬁnding basis vectors for two sets of variables such that the correlations of the projections of the variables onto corresponding basis vectors are maximised. KCCA ﬁrst projects the data into a higher dimensional feature space before performing CCA in the new feature space. CCA [15, 16] and KCCA [17] have been used in previous fMRI analysis using only conventional categorical stimulus descriptions without exploring the possibility of using complex characteristics of the stimuli as the source for feature selection from the fMRI data. The fMRI data used in the following study originated from an experiment in which the responses to stimuli were designed to evoke diﬀerent types of emotional responses, pleasant or unpleasant. The pleasant images consisted of women in swimsuits while the unpleasant images were a collection of images of skin diseases. Each stimulus image was represented using Scale Invariant Feature Transformation (SIFT) [18] features. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not speciﬁcally exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. In the current study, we present a feasibility study of the possibility of generating new activity maps by using the actual stimuli that had generated the fMRI volume. We have shown that KCCA is able to extract brain regions identiﬁed by supervised methods such as SVM in task discrimination and to achieve similar levels of accuracy and discuss some of the challenges in interpreting the results given the complex input feature vectors used by KCCA in place of categorical labels. This work is an extension of the work presented in [19]. The paper is structured as follows. Section 2 gives a review of the fMRI data acquisition as well as the experimental design and the pre-processing. These are followed by a brief description of the scale invariant feature transformation in Section 2.1. The SVM is brieﬂy described in Section 2.2 while Section 2.2 elaborates on the KCCA methodology. Our results in Section 3. We conclude with a discussion in Section 4.

Using Image Stimuli to Drive fMRI Analysis

2

479

Materials and Methods

Due to the lack of space we refer the reader to [10] for a detailed account of the subject, data acquisition and pre-processing applied to the data as well as to the experimental design. 2.1

Scale Invariant Feature Transformation

Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown to be superior to other descriptors [20]. This is due to the SIFT descriptors being designed to be invariant to small shifts in position of salient (i.e. prominent) regions. Calculation of the SIFT vector begins with a scale space search in which local minima and maxima are identiﬁed in each image (so-called key locations). The properties of the image at each key location are then expressed in terms of gradient magnitude and orientation. A canonical orientation is then assigned to each key location to maximize rotation invariance. Robustness to reorientation is introduced by representing local image regions around key voxels in a number of orientations. A reference key vector is then computed over all images and the data for each image are represented in terms of distance from this reference. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not speciﬁcally exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. Image Processing. Let fil be the SIFT features vector for image i where l is the number of features. Each image i has a diﬀerent number of SIFT features l, making it diﬃcult to directly compare two images. To overcome this problem we apply K-means to cluster the SIFT features into a uniform frame. Using K-means clustering we ﬁnd K classes and their respective centers oj where j = 1, . . . , K. The feature vector xi of an image stimuli i is K dimensional with j’th component xi,j . The feature vectors is computed as the Gaussian measure of the minimal distance between the SIFT features fil to the centre oj . This can be represented as − minv∈f l d(v,oj )2

xi,j = exp

i

(1)

where d(., .) is the Euclidean distance. The number of centres is set to be the smallest number of SIFT features computed (found to be 300). Therefore after processing each image, we will have a 300 dimensional feature vector representing its relative distance from the cluster centres. 2.2

Methods

Support Vector Machines. Support vector machines [21] are kernel-based methods that ﬁnd functions of the data that facilitate classiﬁcation. They are derived from statistical learning theory [22] and have emerged as powerful tools for statistical pattern recognition [23]. In the linear formulation a SVM ﬁnds,

480

D.R. Hardoon et al.

during the training phase, the hyperplane that separates the examples in the input space according to their class labels. The SVM classiﬁer is trained by providing examples of the form (x, y) where x represents a input and y it’s class label. Once the decision function has been learned from the training data it can be used to predict the class of a new test example. We used a linear kernel SVM that allows direct extraction of the weight vector as an image. A parameter C, that controls the trade-oﬀ between training errors and smoothness was ﬁxed at C = 1 for all cases (default value).1 Kernel Canonical Correlation Analysis. Proposed by Hotelling in 1936, Canonical Correlation Analysis (CCA) is a technique for ﬁnding pairs of basis vectors that maximise the correlation between the projections of paired variables onto their corresponding basis vectors. Correlation is dependent on the chosen coordinate system, therefore even if there is a very strong linear relationship between two sets of multidimensional variables this relationship may not be visible as a correlation. CCA seeks a pair of linear transformations one for each of the paired variables such that when the variables are transformed the corresponding coordinates are maximally correlated. Consider the linear combination x = wa x and y = wb y. Let x and y be two random variables from a multi-dimensional distribution, with zero mean. The maximisation of the correlation between x and y corresponds to solving maxwa ,wb ρ = wa Cab wb subject to wa Caa wa = wb Cbb wb = 1. Caa and Cbb are the non-singular within-set covariance matrices and Cab is the between-sets covariance matrix. We suggest using the kernel variant of CCA [24] since due to the linearity of CCA useful descriptors may not be extracted from the data. This may occur as the correlation could exist in some non linear relationship. The kernelising of CCA oﬀers an alternate solution by ﬁrst projecting the data into a higher dimensional feature space φ : x = (x1 , . . . , xn ) → φ(x) = (φ1 (x), . . . , φN (x)) (N ≥ n) before performing CCA in the new feature space. Given the kernel functions κa and κb let Ka = Xa Xa and Kb = Xb Xb be the kernel matrices corresponding to the two representations of the data, where Xa is the matrix whose rows are the vectors φa (xi ), i = 1, . . . , from the ﬁrst representation while Xb is the matrix with rows φb (xi ) from the second representation. The weights wa and wb can be expressed as a linear combination of the training examples wa = Xa α and wb = Xb β. Substituting into the primal CCA equation gives the optimisation maxα,β ρ = α Ka Kb β subject to α K2a α = β K2b β = 1. This is the dual form of the primal CCA optimisation problem given above, which can be cast as a generalised eigenvalue problem and for which the ﬁrst k generalised eigenvectors can be found eﬃciently. Both CCA and KCCA can be formulated as an eigenproblem. The theoretical analysis shown in [25,26] suggests the need to regularise kernel CCA as it shows that the quality of the generalisation of the associated pattern function is controlled by the sum of the squares of the weight vector norms. We 1

The LibSVM toolbox for Matlab was used to perform the classiﬁcations http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

Using Image Stimuli to Drive fMRI Analysis

481

refer the reader to [25, 26] for a detailed analysis and the regularised form of KCCA. Although there are advantages in using kernel CCA, which have been demonstrated in various experiments across the literature. We must clarify that in this particular work, as we are using a linear kernel in both views, regularised CCA is the same as regularised linear KCCA (since the former and latter are linear). Although using KCCA with a linear kernel has advantages over CCA, the most important of which is in our case speed, together with the regularisation.2 Using linear kernels as to allow the direct extraction of the weights, KCCA performs the analysis by projecting the fMRI volumes into the found semantic space deﬁned by the eigenvector corresponding to the largest correlation value (these are outputted from the eigenproblem). We classify a new fMRI volume as follows; Let αi be the eigenvector corresponding to the largest eigenvalue, and let φ(ˆ x) be the new volume. We project the fMRI into the semantic space w = Xa αi (these are the training weights, similar to that of the SVM) and using the weights we are able to classify the new example as w ˆ = φ(ˆ x)w where w ˆ is a weighted value (score) for the new volume. The score can be thresholded to allocate a category to each test example. To avoid the complications of ﬁnding a threshold, we zeromean the outputs and threshold the scores at zero, where w ˆ < 0 will be associated with unpleasant (a label of −1) and w ˆ ≥ 0 will be associated with pleasant (a label of 1). We hypothesis that KCCA is able to derive additional activities that may exist a-priori, but possibly previously unknown, in the experiment. By projecting the fMRI volumes into the semantic space using the remaining eigenvectors corresponding to lower correlation values. We have attempted to corroborate this hypothesis on the existing data but found that the additional semantic features that cut across pleasant and unpleasant images did not share visible attributes. We have therefore conﬁned our discussion here to the ﬁrst eigenvector.

3

Results

Experiments were run on a leave-one-out basis where in each repeat a block of positive and negative fMRI volumes was withheld for testing. Data from the 16 subjects was combined. This amounted, per run, in 1330 training and 14 testing fMRI volumes, each set evenly split into positive and negative volumes (these pos/neg splits were not known to KCCA but simply ensured equal number of images with both types of emotional salience). The analyses were repeated 96 times. Similarly, we run a further experiment of leave-subject-out basis where 15 subjects were combined for training and one left for testing. This gave a sum total of 1260 training and 84 testing fMRI volumes. The analyses was repeated 16 times. The KCCA regularisation parameter was found using 2-fold cross validation on the training data. Initially we describe the fMRI activity analysis. After training the SVM we are able to extract and display the SVM weights as a representation of the brain 2

The KCCA toolbox used was from http://homepage.mac.com/davidrh/Code.html

482

D.R. Hardoon et al.

regions important in the pleasant/unpleasant discrimination. A thorough analysis is presented in [10]. We are able to view the results in Figures 1 and 2 where in both ﬁgures the weights are not thresholded and show the contrast between viewing Pleasant vs. Unpleasant. The weight value of each voxel indicates the importance of the voxel in diﬀerentiating between the two brain states. In Figure 1 the unthresholded SVM weight maps are given. Similarly with KCCA, once learning the semantic representation we are able to project the fMRI data into the learnt semantic feature space producing the primal weights. These weights, like those generated from the SVM approach, could be considered as a representation of the fMRI activity. Figure 2 displays the KCCA weights. In Figure 3 the unthresholded weights values for the KCCA approach with the hemodynamic function applied to the image stimuli (i.e. applied to the SIFT features prior to analysis) are displayed. The hemodynamic response function is the impulse response function which is used to model the delay and dispersion of hemodynamic responses to neuronal activation [27]. The application of the hemodynamic function to the images SIFT features allows for the reweighting of the image features according to the computed delay and dispersion model. We compute the hemodynamic function with the SPM2 toolbox with default parameter settings. As the KCCA weights are not driven by simple categorical image descriptors (pleasant/unpleasant) but by complex image feature vectors it is of great interest that many regions, especially in the visual cortex, found by SVM are also highlighted by the KCCA. We interpret this similarity as indicating that many important components of the SIFT feature vector are associated with pleasant/unpleasant discrimination. Other features in the frontal cortex are much less reproducible between SVM and KCCA indicting that many brain regions detect image diﬀerences not rooted in the major emotional salience of the images. In order to validate the activity patterns found in Figure 2 we show that the learnt semantic space can be used to correctly discriminate withheld (testing) fMRI volumes. We also give the 2−norm error to provide an indication as to

Fig. 1. The unthresholded weight values for the SVM approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed with labels (+1/ − 1).

Using Image Stimuli to Drive fMRI Analysis

483

Fig. 2. The unthresholded weight values for the KCCA approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed without labels. The class discrimination is automatically extracted from the analysis.

Fig. 3. The unthresholded weight values for the KCCA approach with the hemodynamic function applied to the image stimuli showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant).

the quality of the patterns found between the fMRI volumes and image stimuli from the testing set by Ka α − Kb β2 (normalised over the number of volumes and analyses repeats). The latter is especially important when the hemodynamic function has been applied to the image stimuli as straight forward discrimination is no longer possible to compare with. Table 1 shows the average and median performance of SVM and KCCA on the testing of pleasant and unpleasant fMRI blocks for the leave-two-block-out experiment. Our proposed unsupervised approach had achieved an average accuracy of 87.28%, slightly less than the 91.52% of the SVM. Although, both methods had the same median accuracy of 92.86%. The results of the leavesubject-out experiment are given in Table 2, where our KCCA has achieved an average accuracy of 79.24% roughly 5% less than the supervised SVM method. In both tables the Hemodynamic Function is abbreviated as HF. We are able to observe in both tables that the quality of the patterns are better than random. The results demonstrate that the activity analysis is meaningful. To further conﬁrm the validity of the methodology we repeat the experiments with the

484

D.R. Hardoon et al.

Table 1. KCCA & SVM results on the leave-two-block-out experiment. Average and median performance over 96 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 87.28 92.86 0.0048 0.0048 SVM 91.52 92.86 Random KCCA 49.78 50.00 0.0103 0.0093 Random SVM 52.68 50.00 KCCA with HF 0.0032 0.0031 Random KCCA with HF 1.1049 0.9492

Table 2. KCCA & SVM results on the leave-one-subject-out experiment. Average and median performance over 16 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 79.24 79.76 0.0025 0.0024 SVM 84.60 86.90 Random KCCA 48.51 47.62 0.0052 0.0044 Random SVM 48.88 48.21 KCCA with HF 0.0016 0.0015 Random KCCA with HF 0.5869 0.0210

image stimuli randomised, hence breaking the relationship between fMRI volume and stimuli. Table 1 and 2 KCCA and SVM both show performance equivalent to the performance of a random classiﬁer. It is also interesting to observe that when applying the hemodynamic function the random KCCA is substantially diﬀerent, and worse than, the non random KCCA. Implying that the spurious correlations are found.

4

Discussion

In this paper we present a novel unsupervised methodology for fMRI activity analysis in which a simple categorical description of a stimulus type is replaced by a more informative vector of stimulus (SIFT) features. We use kernel canonical correlation analysis using an implicit representation of a complex state label to make use of the stimulus characteristics. The most interesting aspect of KCCA is its ability to extract visual regions very similar to those found to be important in categorical image classiﬁcation using supervised SVM. KCCA “ﬁnds” areas in the brain that are correlated with the features in the SIFT vector regardless of the stimulus category. Because many features of the stimuli were associated with the pleasant/unpleasant categories we were able to use the KCCA results to classify the fMRI images between these categories. In the current study it is diﬃcult to address the issue of modular versus distributed neural coding as the complexity of the stimuli (and consequently of the SIFT vector) is very high.

Using Image Stimuli to Drive fMRI Analysis

485

A further interesting possible application of KCCA relates to the detection of “inhomogeneities” in stimuli of a particular type (e.g happy/sad/disgusting emotional stimuli). If KCCA analysis revealed brain regions strongly associated with substructure within a single stimulus category this could be valuable in testing whether a certain type of image was being consistently processed by the brain and designing stimuli for particular experiments. There are many openended questions that have not been explored in our current research, which has primarily been focused on fMRI analysis and discrimination capacity. KCCA is a bi-directional technique and therefore are also able to compute a weight map for the stimuli from the learned semantic space. This capacity has the potential of greatly improving our understanding as to the link between fMRI analysis and stimuli by potentially telling us which image features were important. Acknowledgments. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. David R. Hardoon is supported by the EPSRC project Le Strum, EP-D063612-1. This publication only reﬂects the authors views. We would like to thank Karl Friston for the constructive suggestions.

References 1. Cox, D.D., Savoy, R.L.: Functional magnetic resonance imaging (fmri) ‘brain reading’: detecting and classifying distributed patterns of fmri activity in human visual cortex. Neuroimage 19, 261–270 (2003) 2. Carlson, T.A., Schrater, P., He, S.: Patterns of activity in the categorical representations of objects. Journal of Cognitive Neuroscience 15, 704–717 (2003) 3. Wang, X., Hutchinson, R., Mitchell, T.M.: Training fmri classiﬁers to detect cognitive states across multiple human subjects. In: Proceedings of the 2003 Conference on Neural Information Processing Systems (2003) 4. Mitchell, T., Hutchinson, R., Niculescu, R., Pereira, F., Wang, X., Just, M., Newman, S.: Learning to decode cognitive states from brain images. Machine Learning 1-2, 145–175 (2004) 5. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X.: Support vector machines for temporal classiﬁcation of block design fmri data. NeuroImage 26, 317–329 (2005) 6. Mourao-Miranda, J., Bokde, A.L.W., Born, C., Hampel, H., Stetter, S.: Classifying brain states and determining the discriminating activation patterns: support vector machine on functional mri data. NeuroImage 28, 980–995 (2005) 7. Haynes, J.D., Rees, G.: Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience 8, 686–691 (2005) 8. Davatzikos, C., Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C., Langleben, D.D.: Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. NeuroImage 28, 663–668 (2005) 9. Kriegeskorte, N., Goebel, R., Bandettini, P.: Information-based functional brain mapping. PANAS 103, 3863–3868 (2006)

486

D.R. Hardoon et al.

10. Mourao-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M.: The impact of temporal compression and space selection on svm analysis of singlesubject and multi-subject fmri data. NeuroImage (accepted, 2006) 11. Hardoon, D.R., Saunders, C., Szedmak, S., Shawe-Taylor, J.: A correlation approach for automatic image annotation. In: Li, X., Za¨ıane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 681–692. Springer, Heidelberg (2006) 12. Wismuller, A., Meyer-Base, A., Lange, O., Auer, D., Reiser, M.F., Sumners, D.: Model-free functional mri analysis based on unsupervised clustering. Journal of Biomedical Informatics 37, 10–18 (2004) 13. Ciuciu, P., Poline, J., Marrelec, G., Idier, J., Pallier, C., Benali, H.: Unsupervised robust non-parametric estimation of the hemodynamic response function for any fmri experiment. IEEE TMI 22, 1235–1251 (2003) 14. O’Toole, A.J., Jiang, F., Abdi, H., Haxby, J.V.: Partially distributed representations of objects and faces in ventral temporal cortex. Journal of Cognitive Neuroscience 17(4), 580–590 (2005) 15. Friman, O., Borga, M., Lundberg, P., Knutsson, H.: Adaptive analysis of fMRI data. NeuroImage 19, 837–845 (2003) 16. Friman, O., Carlsson, J., Lundberg, P., Borga, M., Knutsson, H.: Detection of neural activity in functional MRI using canonical correlation analysis. Magnetic Resonance in Medicine 45(2), 323–330 (2001) 17. Hardoon, D.R., Shawe-Taylor, J., Friman, O.: KCCA for fMRI Analysis. In: Proceedings of Medical Image Understanding and Analysis, London, UK (2004) 18. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer vision, Kerkyra, Greece, pp. 1150–1157 (1999) 19. Hardoon, D.R., Mourao-Miranda, J., Brammer, M., Shawe-Taylor, J.: Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImag (in press, 2007) 20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: International Conference on Computer Vision and Pattern Recognition, pp. 257–263 (2003) 21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classiﬁers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp. 144–152. ACM, New York (1992) 24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 365–377 (2001) 25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16, 2639–2664 (2004) 26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)

Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan [email protected]

Abstract. Reinforcement learning (RL) for a linear family of tasks is studied in this paper. The key of our discussion is nonlinearity of the optimal solution even if the task family is linear; we cannot obtain the optimal policy by a naive approach. Though there exists an algorithm for calculating the equivalent result to Q-learning for each task all together, it has a problem with explosion of set sizes. We introduce adaptive margins to overcome this diﬃculty.

1

Introduction

Reinforcement learning (RL) for a linear family of tasks is studied in this paper. Such learning is useful for time-varying environments, multi-criteria problems, and inverse RL [5,6]. The family is deﬁned as a weighted sum of several criteria. This family is linear in the sense that reward is linear with respect to weight parameters. For instance, criteria of network routing include end-to-end delay, loss of packets, and power level associated with a node [5]. Selecting appropriate weights beforehand is diﬃcult in practice and we need try and errors. In addition, appropriate weights may change someday. Parallel RL for all possible weight values is desirable in such cases. The key of our discussion is nonlinearity of the optimal solution; it is not linear but piecewise-linear actually. This fact implies that we cannot obtain the best policy by the following naive approach: 1. Find the value function for each criterion. 2. Calculate weighted sum of them to obtain the total value function. 3. Construct a policy on the basis of the total value function. A typical example is presented in section 5. Piecewise-linearity of the optimal solution has been pointed out independently in [4] and [5]. The latter aims at fast adaptation under time-varying environments. The former is our previous report, and we have tried to obtain the optimal solutions for various weight values all together. Though we have developed an algorithm that gives exactly equivalent solution to Q-learning for each weight value, it has a diﬃculty with explosion of set size. This diﬃculty is not a problem of the algorithm but an intrinsic nature of Q-learning for the weighted criterion model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008. c Springer-Verlag Berlin Heidelberg 2008

488

K. Hiraoka, M. Yoshida, and T. Mishima

We have introduced a simple approximation with a ‘margin’ into decision of convexity ﬁrst [6]. Then we have improved it so that we obtain an interval estimation and we can monitor the eﬀect of the approximation [7]. In this paper, we propose adaptive adjustment of margins. In margin-based approach, we have to manage large sets of vectors in the ﬁrst stage of learning. The peak of the set size tends to be large if we set a small margin to obtain an accurate ﬁnal result. The proposed method reduces worry of this trade-oﬀ. By changing margins appropriately through learning steps, we can enjoy small set size in the ﬁrst stage with large margins, and an accurate result in the ﬁnal stage with small margins. The weighted criterion model is deﬁned in section 2, and parallel RL for it is described in section 3. Then the diﬃculty of set size is pointed out and margins are introduced in section 4. Adaptive adjustment of margins is also proposed there. Its behavior is veriﬁed with experiments in section 5. Finally, a conclusion is given in section 6.

2

Weighted Criterion Model

An “orthodox” RL setting is assumed for states and actions as follows. – – – – –

The The The The The

time step is discrete (t = 0, 1, 2, 3, . . .). state set S and the action set A are ﬁnite and known. state transition rule P is unknown. state st is observable. task is a Markov decision process (MDP).

1 M The reward rt+1 is given as a weighted sum of partial rewards rt+1 , . . . , rt+1 :

rt+1 (β) =

M

i βi rt+1 = β · r t+1 ,

(1)

i=1

weight vector β ≡ (β1 , . . . , βM ) ∈ RM , reward vector r t+1 ≡

1 M (rt+1 , . . . , rt+1 )

M

∈R .

(2) (3)

1 M We assume that the partial rewards rt+1 , . . . , rt+1 are also observable, whereas their reward rules R(1), . . . , R(M ) are unknown. Multi-criteria RL problems of this type have been introduced independently in [3] and [5]. We hope to ﬁnd the optimal policy πβ∗ for each weight β that maximizes the expected cumulative reward with a given discount factor 0 < γ < 1, ∞ πβ∗ = argmax E π γ τ rτ +1 (β) , (4) π

τ =0

π

where E [·] denotes the expectation under a policy π. To be exact, π ∗ is deﬁned π∗

as a policy that attains Qββ (s, a; γ) = Q∗β (s, a; γ) ≡ maxπ Qπβ (s, a; γ) for all state-action pairs (s, a), where the action-value function Qπβ is deﬁned as

Parallel Reinforcement Learning for Weighted Multi-criteria Model

Qπβ (s, a; γ)

≡E

π

∞

τ =0

γ rτ +1 (β) s0 = s, a0 = a . τ

489

(5)

It is well known that MDP has a deterministic policy πβ∗ that satisﬁes the above condition; such πβ∗ is obtained from the optimal value function [2], πβ∗ : S → A : s → argmax Q∗β (s, a; γ). a∈A

(6)

Thus we concentrate on estimation of Q∗β . Note that Q∗β is nonlinear with respect to β. A typical example is presented in section 5. Basic properties of the action-value function Q are described brieﬂy in the rest of this section [4,5,6]. The discount factor γ is ﬁxed through this paper, and it is omitted below. Proposition 1. Qπβ (s, a) is linear with respect to β for a fixed policy π. Proof. Neither P nor π depend on β from assumptions. Hence, joint distribution of (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), . . . is independent of β. It implies linearity. Definition 1. If f : RM → R can be written as f (β) = maxq∈Ω (q · β) with a nonempty finite set Ω ⊂ RM , we call f Finite-Max-Linear (FML) and write it as f = FMLΩ . It is trivial that f is convex and piecewise-linear if f is FML. Proposition 2. The optimal action-value function is FML as a function of the weight β. Namely, there exists a nonempty finite set Ω ∗ (s, a) ⊂ RM for each state-action pair (s, a), and Q∗β is written as Q∗β (s, a) =

max

q∈Ω ∗ (s,a)

q · β.

(7)

Proof. We have assumed MDP. It is well known that Q∗β can be written Q∗β (s, a) = maxπ∈Π Qπβ (s, a) for the set Π of all deterministic policies. Π ﬁnite, and Qπβ is linear with respect to β from proposition 1. Hence, Q∗β FML.

as is is

Proposition 3. Assume that an estimated action-value function Qβ is FML as a function of the weight β. If we apply Q-learning, the updated new Qβ (st , at ) = (1 − α)Qβ (st , at ) + α β · rt+1 + γ max Qβ (st+1 , a) (8) a∈A

is still FML as a function of β, where α > 0 is the learning rate. Proof. There exists a nonempty ﬁnite set Ω(s, a) ⊂ RM such that Qβ (s, a) = ˜ · β, maxq∈Ω(s,a) (q · β) for each (s, a). Then (8) implies Qnew ˜q β (st , at ) = maxq ˜ ∈Ω where ˜ ≡ (1 − α)q + α(r t+1 + γq ) a ∈ A, q ∈ Ω(st , at ), q ∈ Ω(st+1 , a) , (9) Ω because maxx f (x) + maxy g(y) = maxx,y (f (x) + g(y)) holds in general. The set ˜ is ﬁnite, and Qnew is FML. Ω β These propositions imply that (1) the true Q∗β is FML, and (2) its estimation Qβ is also FML as long as the initial estimation is FML.

490

3

K. Hiraoka, M. Yoshida, and T. Mishima

Parallel Q-Learning for All Weights

A parallel Q-learning method for the weighted criterion model has been proposed in [6]. The estimation Qβ for all β ∈ RM are updated all together in parallel Q-learning. In this method, Qβ (s, a) for each (s, a) is treated in an FML expression: Qβ (s, a) =

max q · β = FMLΩ(s,a) (β)

(10)

q∈Ω(s,a)

with a certain set Ω(s, a) ⊂ RM . We store and update Ω(s, a) instead of Qβ (s, a) on the basis of propositions 2 and 3. Though a naive updating rule has been suggested in the proof of proposition 3, it is extremely redundant and ineﬃcient. We need several deﬁnitions to describe a better algorithm. Definition 2. An element c ∈ Ω is redundant if FML(Ω−{c}) = FMLΩ . Definition 3. We use Ω † to represent non-redundant elements in Ω. Note that FMLΩ † = FMLΩ [5]. Definition 4. We define the following operations: cΩ ≡ {cq | q ∈ Ω},

c + Ω ≡ {c + q | q ∈ Ω},

K † K † Ω Ω ≡ (Ω ∪ Ω ) , Ωk ≡ Ωk , k=1

Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω },

(11) (12)

k=1 †

Ω Ω ≡ (Ω ⊕ Ω ) .

With these operations, the updating rule of Ω is described as follows [6]:

Ω new (st , at ) = (1 − α)Ω(st , at ) α r t+1 + γ Ω(st+1 , a) .

(13)

(14)

a∈A

The initial value of Ω at t = 0 is Ω(s, a) = {o} ⊂ RM for all (s, a) ∈ S × A. It corresponds to a constant initial function Qβ (s, a) = 0. Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A, Qnew β (st , at ) in (8) is equal to FMLΩ new (st ,at ) (β) for (14). Namely, parallel Qlearning is equivalent to Q-learning for each β: {Qβ (s, a)} FML expression

{Ω(s, a)}

update → Qnew β (st , at )

FML expression . → Ω new (st , at ) update

(15)

Parallel Reinforcement Learning for Weighted Multi-criteria Model

491

y

Ω

O

x

Ω[+]Ω’

(2) Merge and sort edges according to their arguments

Ω’

(3) Connect edges to generate a polygon

(4) Shift the origin (max x in Ω)+(max x in Ω’) 㤘(max x in Ω[+]Ω’) (max y in Ω)+(max y in Ω’) 㤘(max y in Ω[+]Ω’)

(1) Set directions of edges

Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of polygons correspond to Ω, Ω and Ω Ω .

˜ in (9) to prove proposition 3. With the above Proof. We have introduced a set Ω operations, (9) is written as

˜ Ω = (1 − α)Ω(st , at ) ⊕ α r t+1 + Ω(st+1 , a) . a∈A †

˜ = Ω new (st , at ) is obtained and FMLΩ new (s ,a ) (β) = FML ˜ Then (Ω) t t Ω(st ,at ) (β) = new Qβ (st , at ) is implied. It is well known that Ω † is equal to the vertices in the convex hull of Ω [6]. Eﬃcient algorithms of convex hull have been developed in computational geometry † [8]. Using them, we can calculate the merged set (Ω Ω ) = (Ω ∪ Ω ) . The sum set (Ω Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its calculation is particularly easy for two-dimensional convex polygons (Fig.1). Before closing the present section, we note an FML version of Bellman equation in our notation. Theoretically, we can use successive iteration of this equation to ﬁnd the optimal policy when we know P and R, though we must take care of numerical error in practice. Proposition 5. FML expression Q∗β = FMLΩ ∗ (β) satisfies †

Ω ∗ (s, a) = Ras + γ

a ∈A

a ∗ + Pss Ω (s , a ),

(16)

s ∈S

where i Ras = (Ras (1), . . . , Ras (M )), Ras (i) = E[rt+1 | st = s, at = a], a Pss = P (st+1 = s | st = s, at = a),

+ s ∈{s1 ,...,sk }

Xs = Xs1 Xs2 · · · Xsk .

(17) (18) (19)

492

K. Hiraoka, M. Yoshida, and T. Mishima

In particular, the next equation holds if state transition is deterministic: † Ω ∗ (s, a) = Ras + γ Ω ∗ (s , a ),

(20)

a ∈A

where s is the next state for the action a at the current state s. a Proof. Substituting (7) and Ras,β ≡ E[r t+1 (β) | st = s, at = a] = Rs · β into the ∗ a a ∗ Bellman equation Qβ (s, a) = Rs,β + γ s ∈S Pss maxa ∈A Qβ (s , a ), we obtain

max q · β, a ∗ Ω (s, a) = Ras + γ Pss q s q s ∈ Ω (s , a )

max

q∈Ω ∗ (s,a)

q·β =

q ∈Ω (s,a)

a ∈A

(22)

s ∈S

in the same way as (9). Hence, Ω ∗ is equal to Ω except for redundancy.

4

(21)

Interval Operations

Under regularity conditions, Q-learning has been proved to converge to Q∗ [1]. That result implies pointwise convergence of parallel Q-learning to Q∗β for each β because of proposition 3. From proposition 2, Q∗β (s, a) is expressed with a ﬁnite Ω ∗ (s, a). However, as we can see in Fig.1, the number of elements in the set Ω(s, a) increases monotonically and it never ‘converges’ to Ω ∗ (s, a). This is not a paradox; the following assertions can be true at the same time. 1. Vertices of polygons P1 , P2 , . . . monotonically increase. 2. Pt converges to a polygon P ∗ in the sense that the volume of the diﬀerence Pt P = (Pt ∪ P ∗ ) − (Pt ∩ P ∗ ) converges to 0. 2’. The function FMLPt (·) converges pointwise to FMLP ∗ (·). In short, pointwise convergence of a piecewise-linear function does not imply convergence of the number of pieces. Note that it is not a problem of the algorithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion model for each weight β. To overcome this diﬃculty, we tried a simple approximation with a small ‘margin’ at ﬁrst [6]. Then we have introduced interval operations to monitor approximation error [7]. A pair of sets Ω L (s, a) and Ω U (s, a) are updated instead of the original Ω(s, a) so that CH Ω L (s, a) ⊂ CH Ω(s, a) ⊂ CH Ω U (s, a) holds, where CH Z represents the convex hull of Z. This relation implies lower and upU X per bounds QL β (s, a) ≤ Qβ (s, a) ≤ Qβ (s, a), where Qβ (s, a) = FMLΩ X (s,a) (β) L U for X = L, U . When the diﬀerence between Q and Q is suﬃciently small, it is guaranteed that the eﬀect of the approximation can be ignored. Updating rules of Ω L and Ω U are same as those of Ω, except for the following approximations after every calculation of and . We assume M = 2 here. Lower approximation for Ω L : A vertex is removed if the change of the area of CH Ω L (s, a) is smaller than a threshold L /2 (Fig.2 left).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

b

c

b

d a

a

d

e

b a

c

if the area of triangle /// is small

if the area of triangle /// is small d - remove c e

493

z a

- remove b,c - add z d

Fig. 2. Lower approximation (left) and upper approximation (right)

Upper approximation for Ω U : An edge is removed if the change of the area of CH Ω U (s, a) is smaller than a threshold U /2 (Fig.2 right). In this paper, we propose an automatic adjustment of the margins L , U . The below procedures are performed at every step t after the updating of Ω L , Ω U . The symbol X represents L or U here. ξs , ξw ≥ 1 and θQ , θΩ ≥ 0 are constants. 1. Check the changes of set sizes and interval width compared with the previous ones. Namely, check these values: Xnew ∆X (st , at ) − Ω X (st , at ) , (23) Ω = Ω Unew U Lnew L ∆Q = Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) , (24) ¯ is selected beforehand. where |Z| is the number of elements in Z, and β 2. Increase of set size suggests a need of thinning, whereas increase of interval width suggests a need of more accurate calculation. Modify margins as ˜X (∆Q ≤ θQ ) X (∆X Xnew X Ω ≤ θΩ ) = X , where ˜ = . (25) X ˜ /ξw (∆Q > θQ ) ξs (∆X Ω > θΩ ) To avoid underﬂow, we set Xnew = min if Xnew is smaller than a constant min .

5

Experiments with a Basic Task of Weighted Criterion

We have veriﬁed behaviors of the proposed method. We set S = {S, G, A, B, X, Y}, A = {Up, Down, Left, Right}, s0 = S, and γ = 0.8 (Fig.3) [6]. Each action causes a deterministic state transition to the corresponding direction except at G, where the agent is moved to S regardless of its action. Rewards 1, 4b, b are oﬀered at st = G, X, Y, respectively. If at is an action to ‘outside wall’ at st = G, the state is unchanged and a negative reward (−1) is added further. It is a weighted criterion model of M = 2, because it can be written as 1 2 the form rt+1 = β · rt+1 for r t+1 = (rt+1 , rt+1 ) and β = (b, 1). The optimal policy changes depending on the weight b. Hence, the optimal value function is

494

K. Hiraoka, M. Yoshida, and T. Mishima

S

X (4b)

G (1)

A

Y (b)

B

outside = wall (-1)

Fig. 3. Task for experiments. Numbers in parentheses are reward values. Table 1. Optimal state-value functions and optimal policies Range of weight Optimal Vb∗ (S) Optimal state transition b < −16/25 0 S → A → S → ··· −16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · · −225/1796 ≤ b < 15/47 (400b + 80)/61 S → X → G → S → ··· 15/47 ≤ b < 3/4 32b/3 S → X → Y → X → ··· 3/4 ≤ b 16b − 4 S → X → X → ···

1

1e-04 1e-06

1e-13 1e-10 1e-7 1e-4 0.1

0.01 1e-04 Upper margin

Lower margin

1

1e-13 1e-10 1e-7 1e-4 0.1

0.01

1e-08 1e-10 1e-12 1e-14

1e-06 1e-08 1e-10 1e-12 1e-14

1e-16

1e-16

1e-18

1e-18 0

5000

10000

15000

20000

0

5000

10000

t

15000

20000

t

300

Total number of elements Σs,a|Ω (s,a)|

1e-13 1e-10 1e-7 1e-4 0.1

250 200

U

L

Total number of elements Σs,a|Ω (s,a)|

Fig. 4. Transition of margins L and U from various initial margins

150 100 50 0 0

5000

10000

15000

20000

t

Fig. 5. Total number of elements

300

1e-13 1e-10 1e-7 1e-4 0.1

250 200 150 100 50 0 0

5000

10000

15000

20000

t

s,a

|Ω X (s, a)|. (Left: X = L, Right: X = U ).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

495

1 0.01 1e-04 Interval width

1e-06 1e-08 1e-10 1e-12

1e-13 1e-10 1e-7 1e-4 0.1

1e-14 1e-16 1e-18 0

5000

10000

15000

20000

t

4500

1

1e-2(Lower) 1e-2(Upper) 1e-9(Lower) 1e-9(Upper)

4000 3500

0.01 1e-04

3000

Interval width

X

Total number of elements Σs,a|Ω (s,a)|

L Fig. 6. Interval width QU (0.2,1) (A, Up) − Q(0.2,1) (A, Up)

2500 2000 1500

1e-06 1e-08 1e-10 1e-12

1000

1e-14

500

1e-16

0

1e-18 0

5000

10000

15000

20000

1e-2 1e-9 0

5000

10000

t

15000

20000

t

300

1

1e-13 1e-10 1e-7 1e-4 0.1

250 200

0.01 1e-04 Interval width

U

Total number of elements Σs,a|Ω (s,a)|

Fig. 7. Fixed-margin (U = L = 10−2 and U = L = 10−9 ). Left: total

algorithm X number of elements s,a |Ω (s, a)| for X = U, L. Right: interval width.

150 100

1e-06 1e-08 1e-10 1e-12

1e-13 1e-10 1e-7 1e-4 0.1

1e-14

50

1e-16 0

1e-18 0

2000

4000

6000 t

8000

10000

0

2000

4000

6000

8000

10000

t

Fig. 8. Average of 100 trials with inappropriate factors ξs = 1.5, ξw = 1.015 for γ = 0.5 Left: total number of elements in upper approximation. Right: interval width.

nonlinear with respect to b (Table 1). Note that the second pattern (S→A→Y) in Table 1 cannot appear on the naive approach in section 1. The proposed algorithm is applied to this task with random actions at and ¯ = (0.2, 1), parameters α = 0.7, (ξs , ξw ) = (1.7, 1.015), (θQ , θΩ ) = (0, 2), β −14 L U −1 min = 10 . The initial margins = at t = 0 is one of 10 , 10−4 , 10−7 ,

496

K. Hiraoka, M. Yoshida, and T. Mishima

10−10 , 10−13 . On this task, we can replace convex hulls with upper convex hulls in our algorithm because β is restricted to the upper half plane [6]. We also assume |b| ≤ 10 ≡ bmax and we safely remove the edges on the both end in Fig.2 if the absolute value of their slope is greater than bmax for lower approximation. Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is robust to wide range of initial margins. It realizes reduced set sizes and small interval width at the same time; these requirements are trade-oﬀ in the conventional ﬁxed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms is sensitivity to the factors ξs , ξw . When they are inappropriate, instability is observed after a long run (Fig.8). Another problem is slow convergence of the interval width QU − QL compared with the ﬁxed-margin algorithm.

6

Conclusion

A parallel RL method with adaptive margins is proposed for the weighted criterion model, and its behaviors are veriﬁed experimentally with a basic task. Adaptive margins realize reduced set sizes and accurate results. A problem of the adaptive margins is instability for inappropriate parameters. Though it is robust for initial margins, it needs tuning of factor parameters. Another problem is slow convergence of the interval between upper and lower estimations. These points must be studied further.

References 1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge (1998) 3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167 (2004) 4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10 (2005) 5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005) 6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural Network Society 13, 137–145 (2006) 7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007) 8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985) 9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.): ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001) 10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004) 11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006)

Convergence Behavior of Competitive Repetition-Suppression Clustering Davide Bacciu1,2 and Antonina Starita2 1

2

IMT Lucca Institute for Advanced Studies, P.zza San Ponziano 6, 55100 Lucca, Italy [email protected] Dipartimento di Informatica, Universit` a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy [email protected]

Abstract. Competitive Repetition-suppression (CoRe) clustering is a bio-inspired learning algorithm that is capable of automatically determining the unknown cluster number from the data. In a previous work it has been shown how CoRe clustering represents a robust generalization of rival penalized competitive learning (RPCL) by means of M-estimators. This paper studies the convergence behavior of the CoRe model, based on the analysis proposed for the distance-sensitive RPCL (DSRPCL) algorithm. Furthermore, it is proposed a global minimum criterion for learning vector quantization in kernel space that is used to assess the correct location property for the CoRe algorithm.

1

Introduction

CoRe learning has been proposed as a biologically inspired learning model mimicking a memory mechanism of the visual cortex, i.e. repetition suppression [1]. CoRe is a soft-competitive model that allows only a subset of the most active units to learn in proportion to their activation strength, while it penalizes the least active units, driving them away from the patterns producing low ﬁring strengths. This feature has been exploited in [2] to derive a clustering algorithm that is capable of automatically determining the unknown cluster number from the data by means of a reward-punishment procedure that resembles the rival penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed a generalized loss function for the RPCL algorithm, named DSRPCL, that has been used for studying the convergence behavior of the rival penalization scheme. In this paper, we present a convergence analysis for CoRe clustering that founds on Ma and Wang’s approach, describing how CoRe satisﬁes the three properties of separation nature, correct division and correct location [4]. The intuitive analysis presented in [4] for DSRPCL is enforced with theoretical considerations showing that CoRe pursues a global optimality criterion for vector quantization algorithms. In order to do this, we introduce a kernel interpretation for the CoRe loss that is used to generalize the results given in [5] for hard vector quantization, to kernel-based algorithms. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008

498

2

D. Bacciu and A. Starita

A Kernel Based Loss Function for CoRe Clustering

A CoRe clustering network consists of cluster detector units that are characterized by a prototype ci , that identiﬁes the preferred stimulus for the unit ui and represents the learned cluster centroid. In addition, units are characterized by an activation function ϕi (xk , λi ), deﬁned in terms of a set of parameters λi , that determines the ﬁring strength of the unit in response to the presentation of an input pattern xk ∈ χ. Such an activation function measures the similarity between the prototype ci and the inputs, determining whether the pattern xk belongs to the i-th cluster. In the remainder of the paper we will use an activation function that is a gaussian centered in ci with spread σi , i.e. ϕi (xk |{ci , σi }) = exp −0.5xk − ci 2 /σi2 . CoRe clustering works essentially by evolving a small set of highly selective cluster detectors out of an initially larger population by means of a competitive reward-punishment procedure that resembles the rival penalization mechanism [3]. Such a competition is engaged between two sets of units: at each step the most active units are selected to form the winners pool, while the remainder is inserted into the losers pool. More formally, we deﬁne the winners pool for the input xk as the set of units ui that ﬁres more than θwin or the single unit that is maximally active for the pattern, that is wink = {i | ϕi (xk , {ci , σi }) ≥ θwin } ∪ {i | i = arg max ϕj (xk | {ci , σi })} j∈U

(1)

where the second term of the union ensures that wink is non-empty. Conversely, the losers pool for xk is losek = U \ wink , that is the complement of wink with respect to the neuron set U . The units belonging to the losers pool are penalized and their response is suppressed. The strength of the penalization for the pattern xk , at time t, is regulated by the repetition suppression RSkt ∈ [0, 1] and is proportional to the frequency of the pattern that has elicited the suppressive eﬀect (see [2,6] for details). The repetition suppression is used to deﬁne a pseudo-target activation for the units in the losers pool as ϕˆti (xk ) = ϕi (xk , {ci , σi })(1 − RSkt ). This reference signal forces the losers to reduce their activation proportionally to the amount of repetition suppression they receive. The error of the i-th loser unit can thus be written as E ti,k =

1 t 1 (ϕˆi (xk ) − ϕi (xk , {ci , σi }))2 = (−ϕi (xk , {ci , σi })RSkt )2 . 2 2

(2)

Conversely, in order to strengthen the activation of the winner units, we set the target activation for the neurons ui (i ∈ wink ) to M , that is the maximum of the activation function ϕi (·). The error, in this case, can be written as t

E i,k = (M − ϕi (xk , {ci , σi })).

(3)

To analyze the CoRe convergence, we give an error formulation that accumulates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe units in U and the dataset χ = (x1 , . . . , xk , . . . , xK ) yields

Convergence Behavior of Competitive Repetition-Suppression Clustering

Je (χ, U ) =

I K

δik (1 − ϕi (xk )) +

i=1 k=1

I K

499

2 (e|χ|+k) (1 − δik ) ϕi (xk )RSk (4)

i=1 k=1

where δik is the indicator function for the set wink and where {ci , σi } has been omitted from ϕi to ease the notation. Note that, in (4), we have implicitly used the fact that the units can be treated as independent. The CoRe learning equations can be derived using gradient descent to minimize Je (χ, U ) with respect to the parameters {ci , σi } [2]. Hence, the prototype increment for the e-th epoch can be calculated as follows ⎡ ⎤

K (e|χ|+k) 2 e ⎣δik ϕi (xk ) (xk − ci ) − (1 − δik ) ϕi (xk )RSk cei = αc (xk − cei )⎦ (σie )2 σie k=1

(5) where αc is a suitable learning rate ensuring that Je decreases with e. Similarly, the spread update can be calculated as σie = ασ

K k=1

δik ϕi (xk )

e 2 xk − cei 2 (e|χ|+k) 2 xk − ci − (1 − δ )(ϕ (x )RS ) . (6) i ik k k (σie )3 (σie )3

As one would expect, unit prototypes are attracted by similar patterns (ﬁrst term in (5)) and are repelled by the dissimilar inputs (second term in (5)). Moreover, the neural selectivity is enhanced by reducing the Gaussian spread each time the corresponding unit happens to be a winner. Conversely, the variance of loser neurons is enlarged, reducing the units’ selectivity and penalizing them for not having sharp responses. The error formulation introduced so far can be restated by exploiting the kernel trick [7] to express the CoRe loss in terms of diﬀerences in a given feature space F . Kernel methods are algorithms that exploit a nonlinear mapping Φ : χ → F to project the data from the input space χ onto a convenient, implicit feature space F . The kernel trick is used to express all operations on Φ(x1 ), Φ(x2 ) ∈ F in terms of the inner product Φ(x1 ), Φ(x2 ) . Such inner product can be calculated without explicitly using the mapping Φ, by means of the kernel κ(x1 , x2 ) = Φ(x1 ), Φ(x2 ) . To derive the kernel interpretation for the CoRe loss in (4), consider ﬁrst the formulation of the distance dFκ of two vectors x1 , x2 ∈ χ in the feature space Fκ , induced by the kernel κ, and described by the mapping Φ : χ → Fκ , that is dFκ (x1 , x2 ) = Φ(x1 ) − Φ(x2 )2Fκ = κ(x1 , x1 ) − 2κ(x1 , x2 ) + κ(x2 , x2 ). The kernel trick [7] have been used to substitute the inner products in feature space with a suitable kernel κ calculated in the data space. If κ is chosen to be a gaussian kernel, then we have that κ(x, x) = 1. Hence dFκ can be rewritten as dFκ = Φ(x1 ) − Φ(x2 )2Fκ = 2 − 2κ(x1 , x2 ). Now, if we take x1 to be an element of the input dataset, e.g. xk ∈ χ, and x2 to be the prototype ci of the i-th CoRe unit, we can rewrite dFκ in such a way to depend on the activation function ϕi . Therefore, applying the substitution κ(xk , ci ) = ϕi (xk , {ci , σi }) we obtain ϕi (xk , {ci , σi }) = 1 − 12 Φ(xk ) − Φ(ci )2Fκ . Now, if we substitute this result in the formulation of the CoRe loss in (4), we obtain

500

D. Bacciu and A. Starita

1 δik Φ(xk ) − Φ(ci )2Fκ + 2 i=1 k=1

2 I K 1 1 (e|χ|+k) 2 + (1 − δik ) RSk 1 − Φ(xk ) − Φ(ci )Fκ (7) 2 i=1 2 I

Je (χ, U ) =

K

k=1

Equation (7) states that CoRe minimizes the feature space distance between the prototype ci and those xk that are close in the kernel space induced by the activation functions ϕi , while it maximizes the feature space distance between the prototypes and those xk that are far from ci in the kernel space.

3

Separation Nature

To prove the separation nature of the CoRe process we need to demonstrate that, given a a bounded hypersphere G containing all the sample data, then after suﬃcient iterations of the algorithm the cluster prototypes will ﬁnally either fall into G or remain outside it and never get into G. In particular, those prototypes remaining outside the hypersphere will be driven far away from the samples by the RS repulsion. We consider a prototype ci to be far away from the data if, for a given epoch e, it is in the loser pool for every xk ∈ χ. To prove CoRe separation nature we ﬁrst demonstrate the following Lemma. Lemma 1. When a prototype ci is far away from the data at a given epoch e, then it will always be a loser for every xk ∈ χ and will be driven away from the data samples. Proof. The deﬁnition of far away implies that, given cei , ∀xk ∈ χ. i ∈ loseek , where the e in the superscript refers to the learning epoch. Given the prototype update in (5), we obtain the weight vector increment Δcei at epoch e as follows cei

= −ασ

K k=1

(e|χ|+k)

ϕi (xk )RSk σie

2 (xk − cei ).

(8)

As a result of (8), the prototype ce+1 is driven further from the data. On the i other hand, by deﬁnition (1), for each of the data samples there exists at least one winner unit for every epoch e, such that its prototype is moved towards the samples for which it has been a winner. Moreover, not every prototype can be deﬂected from the data, since this would make the ﬁrst term of Je (χ, U ) (see (4)) grow and, consequently, the whole Je (χ, U ) will diverge since the loser error term in (4) is lower bounded. However, this would contradict the fact that Je (χ, U ) decreases with e since CoRe applies gradient descent to the loss function. Therefore, there must exist at least one winning prototype cel that remains close to the samples at epoch e. On the other hand cei is already far away from the samples and, by (8), ce+1 will be further from the data and i won’t be a winner for any xk ∈ χ. To prove this, consider the deﬁnition of wink

Convergence Behavior of Competitive Repetition-Suppression Clustering

501

in (1): for ce+1 to be a winner, it must hold either (i) ϕi (xk ) ≥ θwin or (ii) i i = arg maxj∈U ϕj (xk , λj ). The former does not hold because the receptive ﬁeld area where the ﬁring strength of the i-th unit is above the threshold θwin does not contain any sample at epoch e. Consequently, it cannot contain any sample at epoch e + 1 since its center ce+1 has been deﬂected further from the data. The i latter does not hold since there exist at least one prototype, i.e. cl , that remains close to the data, generating higher activations than unit ui . As a consequence, a far away prototype ci will be deﬂected from the data until it reaches a stable point where the corresponding ﬁring strength ϕi is negligible. Now we can proceed to demonstrate the following Theorem 1. For a CoRe process there exist an hypersphere G surrounding the sample data χ such that after suﬃcient iterations each prototype ci will ﬁnally either (i) fall into G or (ii) keep outside G and reach a stable point. Proof. The CoRe process is a gradient descent (GD) algorithm on Je (χ, U ), hence, for a suﬃciently small learning step, the loss decreases with the number of epochs. Therefore, being Je (χ, U ) always positive the GD process will converge to a minimum J ∗ . The sequences of prototype vectors {cei } will converge either to a point close to the samples or to a point of negligible activation far away from the data. If a unit ui has a suﬃciently long subsequence of prototypes {cei } diverging from the dataset then, at a certain time, will no longer be a winner for any sample and, by Lemma 1, will converge at a point far away from the data. The attractors for the sequence {cei } of the diverging units lie at a certain distance r from the samples, that is determined by those points x where the gaussian unit centered in x produces a negligible activation in response to any pattern xk ∈ χ. Hence, G can be chosen as any hypersphere surrounding the samples with radius smaller than r. On the other hand, since Je (χ, U ) decreases to J ∗ , there must exist at least one prototype that is not far away from the data (otherwise the ﬁrst term of Je (χ, U ) in (4) will diverge). In this case, the sequences {cei } must have accumulation points close to the samples. Therefore any hypersphere G enclosing all the samples will also surround the accumulation points of {cei } and, after a certain epoch E, the sequence will be always within such hypersphere. In summary, Theorem 1 tells that the separation nature holds for a CoRe process: some prototypes are possibly pushed away from the data until their contribution to the error in (4) becomes negligible. Far away prototypes will always be losers and will never head back to the data. Conversely, some prototypes will converge to the samples, heading to a saddle point of the loss Je (χ, U ) by means of a gradient descent process.

4

Correct Division and Location

Following the convergence analysis in [4] we now turn our attention to the issues of correct division and location of the weight vectors. This means that the

502

D. Bacciu and A. Starita

number of prototypes falling into G will be nc , i.e. the number of the actual clusters in the sample data, and they will ﬁnally converge to the centers of the clusters. At this point, we leave the intuitive study presented for DSRPCL [4], introducing a sound analysis of the properties of the saddle points identiﬁed by CoRe, giving a suﬃcient and necessary condition for identifying the global minimum of a vector quantization loss in feature space. 4.1

A Global Minimum Condition for Vector Quantization in Kernel Space

The classical problem of hard vector quantization (VQ) in Euclidean space is to determine a codebook V = v1 , . . . , vN minimizing the total distortion, calculated by Euclidean norms, resulting from the approximation of the inputs xk ∈ χ by the code vectors vi . Here, we focus on a more general problem that is vector quantization in feature space. Given the nonlinear mapping Φ and the induced feature space norm · Fκ introduced in the previous sections, we aim at optimizing the distortion K N 1 1 min D(χ, ΦV ) = δik Φ(xk ) − Φvi 2Fκ K i=1

(9)

k=1

where ΦV = {Φv1 , . . . , ΦvN } represents the codebook in the kernel space and 1 δik is equal to 1 if the i-th cluster is the closest to the k-th pattern in the feature space Fκ , and is 0 otherwise. It is widely known that VQ generates a Voronoi tessellation of the quantized space and that a necessary condition for the minimization of the distortion requires the code-vectors to be selected as the centroids of the Voronoi regions [8]. In [5], it is given a necessary and suﬃcient condition for the global minimum of an Euclidean VQ distortion function. In the following, we generalize this result to vector quantization in feature space. To prove the global minimum condition in kernel space we need to extend the results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel induced distance metric. Therefore we introduce the following lemma. Lemma 2. Let κ be a kernel and Φ : χ → Fκ a map into the corresponding feature space Fκ . Given a dataset χ = x1 , . . . , xK partitioned into N subsets Ci , K 1 deﬁne the feature space mean Φ = χ k=1 Φ(xk ) and the i-th partition centroid K 1 Φvi = |Ci | k∈Ci Φ(xk ), then we have K k=1

Φ(xk ) − Φχ 2Fκ =

N i=1 k∈Ci

Φ(xk ) − Φvi 2Fκ +

N

|Ci |Φvi − Φχ 2Fκ . (10)

i=1

Proof. Given a generic feature vector Φ1 , consider the identity Φ(xk ) − Φ1 = (Φ(xk ) − Φvi ) + (Φvi − Φ1 ): its squared norm in feature space is Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ + 2(Φ(xk ) − Φvi )T (Φvi − Φ1 ).

Convergence Behavior of Competitive Repetition-Suppression Clustering

503

Summing over all the elements in the i-th partition we obtain Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ k∈Ci

k∈Ci

+2

k∈Ci

(Φ(xk ) − Φvi ) (Φvi − Φ1 ) T

k∈Ci

=

Φ(xk ) − Φvi 2Fκ + |Ci |Φvi − Φ1 2Fκ .

(11)

k∈Ci

The last term in (11) vanishes since k∈Ci (Φ(xk ) − Φvi ) = 0 by deﬁnition of Φvi . Now, applying the substitution Φ1 = Φχ and summing up for all the N partitions yields K

Φ(xk ) − Φχ 2Fκ =

N

Φ(xk ) − Φvi 2Fκ +

i=1 k∈Ci

k=1

N

|Ci |Φvi − Φχ 2Fκ (12)

i=1

N

where the left side of equality holds since

i=1

Ci = χ and

N i=1

Ci = ∅ .

Using the results from Lemma 2 we can proceed with the formulation of the global minimum criterion by generalizing the results of Proposition 1 in [5] to vector quantization in feature space. g } be a global minimum solution to the probProposition 1. Let {Φv1g , . . . , ΦvN lem in (9), then we have

N

|Cig |Φvig − Φχ 2Fκ ≥

N

i=1

|Ci |Φvi − Φχ 2Fκ

(13)

i=1

g for any local optimal solution {Φv1 , . . . , ΦvN } to (9), where {C1g , . . . , CN } and g g {C , . . . , C } are the χ partitions corresponding to the centroids Φ = 1/|C 1 N v i| i k∈Cig Φ(xk ) and Φvi = 1/|Ci | k∈Ci Φ(xk ) respectively, and where Φχ is the dataset mean (see deﬁnition in Lemma 2). g } is a global minimum for (9) we have Proof. Since {Φv1g , . . . , ΦvN

N

Φ(xk ) − Φvig 2Fκ ≤

i=1 k∈Cig

N

Φvi − Φχ 2Fκ

(14)

i=1 k∈Ci

for any local minimum {Φv1 , . . . , ΦvN }. From Lemma 2 we have that K

Φ(xk ) − Φχ 2Fκ =

k=1

Φ(xk ) − Φvig 2Fκ +

i=1 k∈Cig

k=1 K

N

Φ(xk ) − Φχ 2Fκ =

N

Since (14) holds, we obtain

|Cig |Φvig − Φχ 2Fκ (15)

i=1

Φ(xk ) − Φvi 2Fκ +

i=1 k∈Ci

N

N

g g i=1 |Ci |Φvi

N

|Ci |Φvi − Φχ 2Fκ . (16)

i=1

−

Φχ 2Fκ

≥

N i=1

|Ci |Φvi − Φχ 2Fκ

504

4.2

D. Bacciu and A. Starita

Correct Division and Location for CoRe Clustering

To evaluate the correct division and location properties we ﬁrst analyze the case when the number of units N is equal to the true cluster number nc . Consider the loss in (4) as being decomposed into a winner and a loser dependent term, i.e. Je (χ, U ) = Jewin (χ, U )+Jelose (χ, U ). By deﬁnition, Jewin (χ, U ) = nc K i=1 k=1 δik (1 − ϕi (xk )) must have at least one minimum point. Applying the necessary condition ∂Jewin (χ, U )/∂ci = 0 we obtain an estimate of the prototypes by means of ﬁxed point iteration, that is N e k=1 δik ϕi (xk )xk ci = . (17) N k=1 δik ϕi (xk ) When the number of prototypes equals the number of clusters, the ﬁxed point iteration in (17) converges by positioning each unit weight vector close to the true cluster centroids. In addition, it can be shown that (17) approximates a local minima of the kernel vector quantization loss in (9). To prove this, consider the CoRe loss formulation in kernel space (7): we have Jewin (χ, U ) = 1 nc K 2 i=1 k=1 δik Φ(xk ) − Φ(ci )Fκ , where ci is estimated by (17). 2 Now, consider the VQ loss in (9): a necessary condition for its minimization requires the computation of the cluster centroids as Φvi = |C1i | k∈Ci Φ(xk ). The exact calculation of Φvi requires to know the form of the implicit nonlinear mapping Φ to solve the so-called pre-image problem [10], that is determining z such that Φ(z) = Φvi . Unfortunately, such a problem is insolvable in the general case [10]. However, instead of calculating the exact pre-image we can search an approximation by seeking z minimizing ρ(z) = Φvi − Φ(z)2Fκ , that is the feature space distance between the centroid in kernel space and the mapping of its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize the distance between Φvi and its orthogonal projection onto the span Φ(z). Due to space limitations, we omit the technicalities of this calculation (see [10] for further details). It turns out that the minimization of ρ(z) reduces to the the evaluation of the gradient of ρ (z) = Φvi , Φ(z) . By substituting the deﬁnition of Φvi and applying the kernel trick we obtain ρ (z) = (1/|Ci |) κ(xk , xj ) + κ(z, z) + (1/|Ci |) κ(xk , z) k,j∈Ci

k∈Ci

where κ(z, z) = 1 since we are using a gaussian kernel. Diﬀerentiating ρ (z) with respect to z and solving by ﬁxed point iteration yields e−1 )xk k∈Ci κ(xk , z e z = (18) e−1 ) k∈Ci κ(xk , z that is the same as the prototype estimate obtained in (17) for gaussian kernels centered in z e . The indicator function δik in (17) is not null only for those points xk for which unit ui was in the winner set. This does not ensures the partition conditions over χ, since, by deﬁnition of wink , some points can be associated with

Convergence Behavior of Competitive Repetition-Suppression Clustering

505

two or more winners. However, by (6) we know that the variance of the winners tends to reduce as learning proceeds. Therefore, using the same arguments by Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe winners competition will become a WTA process where δik will be ensuring the partition conditions over χ. Summarizing, the minimization of the CoRe winners error Jewin (χ, U ) generates an approximate solution to the vector quantization problem in feature space in (9). As a consequence, the prototypes ci become a local solution satisfying the conditions of Proposition 1. Hence, substituting the deﬁnition of Φχ in the results of Proposition 1 we obtain that {c1 , . . . , cnc } is an approximated global minimum for (9) if and only if nc K |Ci | i=1 k=1

K

Φ(ci ) − Φ(xk )2Fκ ≥

nc K |C˜i | i=1 k=1

K

Φ(˜ ci ) − Φ(xk )2Fκ

(19)

holds for every {˜ c1 , . . . , c˜nc } that are approximated pre-images of a local minimum for (9). In summary, a global optimum to (9) should minimize the feature space distance between the prototypes and samples belonging to their cluster while maximizing the weight vector distance from the sample mean, or, equivalently, the distance from all the samples in the dataset χ. The loser component Jelose (χ, U ) in the kernel CoRe loss (7) depends on the term (1 − (1/2)Φ(ci ) − Φ(xk )2Fκ that maximizes the distance between the prototypes ci and those xk that do not fall in the respective Voronoi sets Ci . Hence, Jelose (χ, U ) produces a distortion in the estimate of ci that pursues the global optimality criterion except for the fact that it discounts the repulsive eﬀect of the xk ∈ Ci . In fact, (19) suggests that ci has to be repelled by all the xk ∈ χ. On the other hand, the estimate ci is a linear combination of the xk ∈ Ci : applying the repulsive eﬀect in (19) would subtract their contribution, either canceling the attractive eﬀect (which would be catastrophic) or simply scaling the magnitude of the learning step without changing the ﬁnal direction. Hence, the CoRe loss makes a reasonable assumption discarding the repulsive eﬀect of the xk ∈ Ci when calculating the estimate of ci . Summarizing, CoRe locates the prototypes close to the centroids of the nc clusters by means of (17), escaping from local minima of the loss function by approximating the global minimum condition of Proposition 1. Finally, we need to study the behavior of Je (χ, U ) as the number of units N varies with respect to the true cluster number nc . Using the same motivations in [4], we see that the winner-dependent loss Jewin tends to reduce as the the number of units increases. However, if the number of units falling into G is larger than nc there will be a number of clusters that are erroneously split. Therefore, the samples from these clusters will tend to produce an increased level of error in Jelose contrasting the reduction of Jewin . On the other hand, Jelose will tend to reduce when the number of units inside G is lower than nc . This however will produce increased levels of Jewin since the prototype allocation won’t match the underlying sample distribution. Hence, the CoRe error will have its minimum when the number of units inside G will approximate nc .

506

5

D. Bacciu and A. Starita

Conclusion

The paper presents a sound analysis of the convergence behavior of CoRe clustering, showing how the minimization of the CoRe cost function satisﬁes the properties of separation nature, correct division and location [4]. As the loss reduces to a minimum, the CoRe algorithm is shown to converge allocating the correct number of prototypes to the centers of the clusters. Moreover, it is given a sound optimality criterion that shows how CoRe gradient descent pursues a global minimum of the vector quantization problem in feature space. The results presented in the paper hold for a batch gradient descent process. However, it can be proved that, under Ljung’s conditions [11], they can be extended to stochastic (online) gradient descent. Moreover, we plan to investigate further the properties of the CoRe kernel formulation, extending the convergence analysis to a wider class of activation functions other than gaussians, i.e. normalized kernels.

References 1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models of stimulus-speciﬁc eﬀects. Trends in Cognitive Sciences 10(1), 14–23 (2006) 2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic determination of unknown cluster number. In: Proceedings of the 2007 International Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos (2007) 3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993) 4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning (rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006) 5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural Process. Lett. 15(3), 261–273 (2002) 6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 130–139. Springer, Heidelberg (2006) 7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992) 9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980) 10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. on Neur. Net. 10(5), 1000–1017 (1999) 11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals of Statistics 6(3), 680–696 (1978)

Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma Kyushu Institute of Technology, Faculty of Engineering, 1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan [email protected]

Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evolution of Self-Organizing Map, where each computational unit deﬁnes a linear subspace. Recently, its modiﬁed version, where each unit deﬁnes an linear manifold instead of the linear subspace, has been proposed. The linear manifold in a unit is represented by a mean vector and a set of basis vectors. After training, these units result in a set of linear variety detectors. In another point of view, we can consider the AMSOM represents the latent commonality of data as linear structures. In numerous cases, however, these are not enough to describe the latent commonality of data because of its linearity. In this paper, the nonlinear variety is considered in order to represent a diversity of data in a class. The eﬀectiveness of the proposed method is veriﬁed by applying it to some simple classiﬁcation problems.

1

Introduction

The subspace method is popular in pattern recognition, feature extraction, compression, classiﬁcation and signal processing.[1] Unlike other techniques where classes are primarily deﬁned as regions or zones in the feature space, the subspace method uses linear subspaces that are deﬁned by a set of normalized basis vectors. One linear subspace is usually associated with one class. An input vector is classiﬁed to a particular class if its projection error into the subspace associated with one class is the minimum. The subspace method, as compared to other pattern recognition techniques, has advantages in applications where the relative intensities or energies of the vector components are more important than the overall level of the signal. It also provides an economical representation for groups of vectors with high dimensionality, since one can often use a small set of basis vectors to approximate the subspace where the vectors reside. Another paradigm is to use is use a mixture of local subspace to collectively model the data space. Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of local subspace method for pattern recognition. ASSOM, which is an evolution M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008

508

H. Kawano, H. Maeda, and N. Ikoma

of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive layer arranging some computational units in a line or a lattice structure. Each computational units deﬁnes a subspace spanned by some basis vectors. ASSOM creates a set of subspaces representations by competitive selection and cooperative learning. In SOM, a set of reference vectors is spatially organized to partition the input space. In ASSOM, a set of reference sub-models is topologically ordered, with each sub-model responsible for describing a speciﬁc region of the input space by its local principal subspace. The ASSOM is attractive not only because it inherits the topographic representation property in the SOM, but also because the learning results of ASSOM can faithfully describe the the core features of various transformation groups. The simulation results in the reference [2] and the reference [3] have illustrated that diﬀerent feature ﬁlters can be self-organized to diﬀerent low-dimensional subspaces and a wavelet type representation does emerge in the learning. Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a modiﬁed version of ASSOM has been proposed.[5] AMSOM is the same structure as the ASSOM, except for the way to represent each computational unit. Each unit in AMSOM deﬁnes an aﬃne subspace which is composed of a mean vector and a set of basis vectors. By incorporating a mean vector into each unit, the recognition performance has been improved signiﬁcantly. The simulation results in the reference [5] have been shown that AMSOM outperforms linear PCA-based method and ASSOM in face recognition problem. In both ASSOM and AMSOM, a local subspace in each unit can be adapted by linear PCA learning algorithms. On the other hand, it is known that there are a number of advantages in introducing nonlinearities into a PCA type network with reproducing kernels.[6][13] For example, the performance of the subspace method is aﬀected by the dimensionality for the intersections of subspaces.[1] In other words, the dimensionality of subspace should be as possible as low in order to achieve successful performance. it is, however, not enough to describe variation in a class of patterns by low dimensional subspace because of its linearity. From this consideration, we propose a nonlinear extended version of the AMSOM with the reproducing kernels. The proposed method could be expected to construct nonlinear varieties so that eﬀective representation of data belonging to the same category is achieved with low dimensionality. The eﬀectiveness of the proposed method is veriﬁed by applying it to some simple pattern classiﬁcation problems.

2

Adaptive Manifold Self-Organizing Map (AMSOM)

In this section, we give a brief review of the original AMSOM. Fig.1 shows the structure of the AMSOM. It consists of an input layer and a competitive layer, in which n and M units are included respectively. Suppose i ∈ {1, · · · , M } is used to index computational units in the competitive layer, the dimensionality of the input vector is n. The i-th computational unit constructs an aﬃne subspace, which is composed of a mean vector μ(i) and a subspace spanned by H basis

Self-Organizing Clustering with Map

x1

xj

509

xn

Input Layer

1

M

i

Competitive Layer

Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM)

input vector :

relative input vector :

x ~ ~ x=

φ

φ

φ^ mean vector : μ

x^

Subspace

Origin

Fig. 2. Aﬃne Subspace in a computational unit (i)

vectors bh , h ∈ {1, · · · , H}. First of all, we deﬁne the orthogonal projection of a input vector x onto the aﬃne subspace of i-th unit as x ˆ(i) = μ(i) +

H

T (i)

(i)

(φ(i) bh )bh ,

(1)

h=1

where φ(i) = x − μ(i) . Therefore the projection error is represented as x ˜(i) = φ(i) −

H

T (i)

(i)

(φ(i) bh )bh .

(2)

h=1

Figure 2 shows a schematic interpretation for the orthogonal projection and the projection error of a input vector onto the aﬃne subspace deﬁned in a unit i. The AMSOM is more general strategy than the ASSOM, where each computational unit solely deﬁnes a subspace. To illustrate why this is so, let us consider a very simple case: Suppose we are given two clusters as shown in Fig.3(a). It

510

H. Kawano, H. Maeda, and N. Ikoma

x2

x2 class 1 class 2

manifold 1

b1

class 1 class 2

μ1

O

x1

O

μ2

x1 b2

manifold 2

(b)

(a)

Fig. 3. (a) Clusters in 2-dimensional space: An example of the case which can not be separated without a mean value. (b) Two 1-dimensional aﬃne subspaces to approximate and classify clusters.

is not possible to use one dimensional subspaces, that is lines intersecting the origin O, to approximate the clusters. This is true even if the global mean is removed, so that the origin O is translated to the centroid of the two clusters. However, two one-dimensional aﬃne subspaces can easily approximate the clusters as shown in Fig.3(b), since the basis vectors are aligned in the direction that minimizes the projection error. In the AMSOM, the input vectors are grouped into episodes in order to present them to the network as an input sets. For pattern classiﬁcation, an episode input is deﬁned as a subset of training data belonging to the same category. Assume that the number of input vectors in the subset is E, then an episode input ωq in the class q is denoted as ωq = {x1 , x2 , · · · , xE } , ωq ⊆ Ωq , where Ωq is a set of training patterns belonging to the class q. The set of input vectors of an episode has to be recognized as one class, such that any member of this set and even an arbitrary linear combination of them should have the same winning unit. The training process in AMSOM has the following steps: (a) Winner lookup. The unit that gives the minimum projection error for an episode is selected. The unit is denoted as the winner, whose index is c. This decision criterion for the winner c is represented as E (i) 2 c = arg min ||˜ xe || , (3) i

where i ∈ {1, · · · , M }.

e=1

Self-Organizing Clustering with Map

(b) Learning. For each unit i, and for each xe , update μ(i) μ(i) (t + 1) = μ(i) (t) + λm (t)hci (t) xe − μ(i) (t) ,

511

(4)

where λm (t) is the learning rate for μ(i) at learning epoch t, hci (t) is the neighborhood function at learning epoch t with respect to the winner c. Both λm (t) and hci (t) are monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λm (t) = λini and hci (t) = exp(−|c − i|/γ(t)), m (λm /λm ) ini f in f in t/tmax γ(t) = γ (γ /γ ) are used. Then the basis vectors are updated as (i)

(i)

(i)

bh (t + 1) = bh (t) + λb (t)hci (t)

(i)

φe (t)T bh (t) , φ(i) e (t), (i) ˆ(i) ||φ (t)||||φ (t)|| e e

(5)

(i)

where φe (t) is the relative input vector in the manifold i updated the mean (i) ˆ(i) vector, which is represented by φe (t) = xe − μ(i) (t + 1), φ e (t) is the orthogonal projection of the relative input vector, which is represented by H (i) (i) T (i) ˆ(i) φ e (t) = h=1 (φ (t) bh (t))bh (t) and λb (t) is the learning rate for the basis vectors, which is also monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λb (t) = λini is used. b (λb /λb ) After the learning phase, a categorization phase to determine the class association of each unit. Each unit is labeled by the class index for which is selected as the winner most frequently when the input data for learning are applied to the AMSOM again,

3 3.1

Self-Organizing Clustering with Nonlinear Varieties Reproducing Kernels

Reproducing kernels are functions k : X 2 → R which for all pattern sets {x1 , · · · , xl } ⊂ X

(6)

give rise to positive matrices Kij := k(xi , xj ). Here, X is some compact set in which the data resides, typically a subset of Rn . In the ﬁeld of Support Vector Machine (SVM), reproducing kernels are often referred to as Mercer kernels. They provide an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in some feature space F nonlinearly related to input space: Using k instead of a dot product in Rn corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map Φ : Rn → F , and taking the dot product there, i.e.[11] k(x, y) = (Φ(x), Φ(y)) .

(7)

By virtue of this property, we shall call Φ a feature map associated with k. Any linear algorithm which can be carried out in terms of dot products can be made

512

H. Kawano, H. Maeda, and N. Ikoma

nonlinear by substituting a priori chosen kernel. Examples of such algorithms include the potential function method[12], SVM [7][8] and kernel PCA.[9] The price that one has to pay for this elegance, however, is that the solutions are only obtained as expansions in terms of input patterns mapped into feature space. For instance, the normal vector of an SV hyperplane is expanded in terms of Support Vectors, just as the kernel PCA feature extractors are expressed in terms of training examples, Ψ=

l

αi Φ(xi ).

(8)

i=1

3.2

AMSOM in the Feature Space

The AMSOM in the high-dimensional feature space F is considered. In the proposed method, an varieties deﬁned by a computational unit i in the competitive layer take the nonlinear form Mi = {Φ(x)|Φ(x) = Φ(μ(i) ) +

H

(i)

ξΦ(bh )},

(9)

h=1

where ξ ∈ R. Given training data set {x1 , · · · , xN }, the mean vector and the basis vector in a unit i are represented by the following form Φ(μ(i) ) =

N

(i)

(10)

(i)

(11)

αl Φ(xl ),

l=1

(i)

Φ(bh ) =

N

βhl Φ(xl ),

l=1 (i)

(i)

respectively. αl in Eq.(10) and βhl in Eq.(11) are the parameters adjusted by learning. The derivation of training procedure in the proposed method is given as follows: (a) Winner lookup. The norm of the orthogonal projection error onto the i-th aﬃne subspace with respect to present input xp is calculated as follows: (i)

||Φ(˜ xp ) ||2 = k(xp , xp ) +

H

(i) 2

Ph

h=1

−2

N

αl k(x, xl ) + 2

l=1

−2

H N h=1 l=1

+

N N

(i) (i)

αl1 αl2 k(xl1 , xl2 )

l1 =1 l2 =1 H N

N

(i)

Ph αl1 βhl2 k(xl1 , xl2 )

h=1 l1 =1 l2 =1 (i)

Ph βhl k(x, xl ),

(12)

Self-Organizing Clustering with Map

513

(i)

where Ph means the orthogonal projection component of present input xp (i) into the basis Φ(bh ) and it is calculated by (i)

Ph =

N

N N

(i)

βhl k(xp , xl )−

l=1

(i) (i)

αl1 βhl2 k(xl1 , xl2 ).

(13)

l1 =1 l2 =1

The reproducing kernels used generally are as follows: = (xTs xt )d d ∈ N, T d = (xs xt + 1) d ∈ N, ||xs −xt ||2 k(xs , xt ) = exp − 2σ2 σ ∈ R, k(xs , xt ) k(xs , xt )

(14) (15) (16)

where N and R are the set of natural numbers and the set of reals, respectively. Eq.(14), Eq.(15) and Eq.(16) are referred as to homogeneous polynomial kernels, non-homogeneous polynomial kernels and gaussian kernels, respectively. The winner for an episode input ωq = {Φ(x1 ), · · · , Φ(xE )} is decided by the same manner as the AMSOM as follows: E (i) 2 c = arg min ||Φ(˜ xe ) || , i ∈ {1, · · · , M }. (17) i

e=1 (i)

(i)

(b) Learning. The learning rule for αl and βhl are as follows: ⎧ (i) ⎪ f or l = e ⎨−αl (t)λm (t)hci (t) (i) Δαl = −α(i) (t)λm (t)hci (t) + λm (t)hci (t) , ⎪ ⎩ l f or l = e ⎧ (i) (i) −α (t + 1)λb (t)hci (t)Th (t) ⎪ ⎪ ⎪ l ⎪ f or l = e ⎨ (i) Δβhl = −α(i) , (t + 1)λ (t)h (t)T (t) b ci l ⎪ ⎪ ⎪ +λ (t)h (t)T (t) b ci ⎪ ⎩ f or l = e where

(i)

T (t) = ˆ(i) Φ(φ e (t))) =

H N h=1

(i)

l=1

||Φ(φ(i) e )(t)|| = k(xe , xe ) − 2

N l=1

(19)

(i)

Φ(φe (t))T Φ(bh (t)) , (i) ˆ(i) ||Φ(φ e (t))||||Φ(φe (t))|| βhl k(xe , xl ) −

(18)

N N

(20)

2 12 αl1 βkl2 k(xl1 , xl2 ) ,

l1 =1 l2 =1

(i)

αl k(xe , xl ) +

N N

(21) 12 (i) (i) αl1 αl2 k(xl1 , xl2 ) ,

l1 =1 l2 =1

(22)

514

H. Kawano, H. Maeda, and N. Ikoma (i)

T Φ(φ(i) e (t)) Φ(bh (t)) =

N

βhl k(xe , xl ) −

l=1

N N

(i) (i)

αl1 βhl2 k(xl1 , xl2 ). (23)

l1 =1 l2 =1

In Eqs.(18) and (19), λm (t), λb (t) and hci (t) are the same parameters as mentioned in the AMSOM training process. After the learning phase, a categorization phase to determine the class association of each unit. The procedure of the categorization is the same manner as mentioned in previous section.

4

Experimental Results

Data Distribution 1. To analyze the eﬀect of reducing the dimensionality for the intersections of subspaces by the proposed method, the data as shown in Fig.4(a) is used. For this data, although a set of each class can be approximated by 1-dimensional linear manifold in the input space R2 , the intersections of subspace could be happend between class 1 and class 2, and between class 2 and class 3 , even if the optimal linear manifold for each class can be obtained. However, the linear manifold in the high-dimensional space, that is the nonlinear manifold in input space, can be expected to classify the given data by reduction eﬀect of the dimensionality for the intersections of subspaces. As the result of simulation, the given input data are classiﬁed correctly by the proposed method. Figure 4(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 1, class 2 to unit 2, class 3 to unit 3, respectively. In this simulation, the experimental parameters are assigned as follows: the f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in ini ini λb = 0.01, γ = 3, γ = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y + 1)3 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to reduce the dimensionality for intersections of subspaces. Data Distribution 2. To verify that the proposed method has the ability to describe the nonlinear manifolds, the data as shown in Fig.5(a) is used. This case is of impossible to describe each class by a linear manifold, that is 1-dimensional aﬃne subspace. As the result of simulation, the given input data are classiﬁed correctly by the proposed method. Figure 5(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 3, class 2 to unit 2, class 3 to unit 1, respectively. In this simulation, the experimental parameters are assigned as follows: the

Self-Organizing Clustering with Map 5

5

4

4

Unit 3

3

3

2

2

1

Unit 1

1

Unit 1

Unit 2

0

x2

x2

515

-1

-2

-2

-3

-3

-4

-4

-5 -5 -4 -3 -2 -1

0

x1

1

2

3

4

Unit 2

0

-1

Unit 3

-5 -5 -4 -3 -2 -1

5

(b)

(a)

0

x1

1

2

3

4

5

(c)

Fig. 4. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method.

4

4

4

class 1 class 2 class 3

3

2

2

2

1

1

1

0

x2

3

x2

x2

3

0

0

-1

-1

-1

-2

-2

-2

-3

-3

-4

-3

-4 -4

-3

-2

-1

0

1

x1

(a)

2

3

4

-4 -4

-3

-2

-1

0

x1

(b)

1

2

3

4

-4

-3

-2

-1

0

1

2

3

4

x1

(c)

Fig. 5. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method. f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in λb = 0.01, γ ini = 3, γ ini = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y)2 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to extract the suitable nonlinear manifolds eﬃciently.

5

Conclusions

A clustering method with map of nonlinear varieties was proposed as a new pattern classiﬁcation method. The proposed method has been extended to a nonlinear method easily from AMSOM by applying the kernel method. The eﬀectiveness of the proposed method were veriﬁed by the experiments. The proposed algorithm has highly promising applications of the ASSOM in a wide area of practical problems.

516

H. Kawano, H. Maeda, and N. Ikoma

References 1. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press (1983) 2. Kohonen, T.: Emergence of Invariant-Feature Detectors in the Adaptive-Subspace Self-Organizing Map. Biol.Cybern 75, 281–291 (1996) 3. Kohonen, T., Kaski, S., Lappalainen, H.: Self-Organizing Formation of Various invariant-feature ﬁlters in the Adaptive Subspace SOM. Neural Comput 9, 1321– 1344 (1997) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg, New York (1995) 5. Liu, Z.Q.: Adaptive Subspace Self-Organizing Map and Its Application in Face Recognition. International Journal of Image and Graphics 2(4), 519–540 (2002) 6. Saitoh, S.: Theory of Reproducing Kernels and its Applications. In: Longman Scientiﬁc & Technical, Harlow, England (1988) 7. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 8. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York, Berlin, Heidelberg (1995) 9. Sch¨ olkopf, B., Smola, A.J., M¨ uler, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Technical Report 44, Max-Planck-Institut fur biologische Kybernetik (1996) 10. Mika, S., R¨ atsch, G., Weston, J., Sch¨ olkopf, B., M¨ uler, K.R.: Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 41–48 (1999) 11. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Otimal Margin Classiﬁers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152 (1992) 12. Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Automation and Remote Control 25, 821–837 (1964) 13. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)

An Automatic Speaker Recognition System P. Chakraborty1, F. Ahmed1, Md. Monirul Kabir2, Md. Shahjahan1, and Kazuyuki Murase2,3 1

Department of Electrical & Electronic Engineering, Khulna University of Engineering and Technology, Khulna-920300, Bangladesh 2 Dept. of Human and Artificial Intelligence Systems, Graduate School of Engineering 3 Research and Education Program for Life Science, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan [email protected], [email protected]

Abstract. Speaker Recognition is the process of identifying a speaker by analyzing spectral shape of the voice signal. This is done by extracting & matching the feature of voice signal. Mel-frequency Cepstrum Co-efficient (MFCC) is the feature extraction technique in which we will get some coefficients named Mel-Frequency Cepstrum coefficient. This Cepstrum Coefficient is extracted feature. This extracted feature is taken as the input of Vector Quantization process. Vector Quantization (VQ) is the typical feature matching technique in which VQ codebook is generated by providing predefined spectral vectors for each speaker to cluster the training vectors in a training session. Finally test data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN). Keywords: MFCC- Mel-Frequency Cepstrum Co-efficient, DCT: Discrete cosine Transform, IIR: - Infinite impulse response, FIR: - Finite impulse response, FFT: - Fast Fourier Transform, VQ: - Vector Quantization.

1 Introduction Speaker Recognition is the process of automatic recognition of the person who is speaking on the basis of individual information included in speech waves. This paper deals with the automatic Speaker recognition system using Vector Quantization. There are another techniques for speaker recognition such as Hidden Markov model (HMM), Artificial Neural network (ANN) for speaker recognition. We have used VQ because of its less computational complexity [1]. There are two main modulesfeature extraction and feature matching in any speaker recognition system [1, 2]. The speaker specific features are extracted using Mel-Frequency Cepstrum Co-efficient (MFCC) processor. A set of Mel-frequency cepstrum coefficients was found, which are called acoustic vectors [3]. These are the extracted features of the speakers. These M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526, 2008. © Springer-Verlag Berlin Heidelberg 2008

518

P. Chakraborty et al.

acoustic vectors are used in feature matching using vector quantization technique. It is the typical feature matching technique in which VQ codebook is generated using trained data. Finally tested data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. This work is done with about 70 spectral data. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN) because the correct recognition for HMM & ANN is below ninety percent. The future work is to generate a VQ codebook with many pre-defined spectral vectors. Then it will be possible to add many trained data in that codebook in a training session, but the main problem is that the network size and training time become prohibitively large with increasing data size. To overcome these limitations, time alignment technique can be applied, so that continuous speaker recognition system becomes possible. There are several implementations for feature matching & identification. Lawrence Rabiner & B. H. Juang proposed Mel- frequency cepstrum co-efficient (MFCC) method to extract the feature & Vector Quantization as feature matching technique [1]. Lawrence Rabiner & R.W. Schafer discussed the performance of MFCC Processor by following several theoretical concepts [2]. S. B. Davis & P. Mammelstein described the characteristics of acoustic speech [3]. Linde A Buzo & R. Gray proposed the LBG Algorithm to generate a VQ codebook by splitting technique [4]. S. Furui describes the speaker independent word recognition using dynamic features of speech spectrum [5]. S. Furui also proposed the overall speaker recognition technology using MFCC & VQ method [6].

2 Methodology A general model for speaker recognition system for several people is shown in fig: 1. the model consists of four building blocks. The first is data extraction that converts a wave data stored in audio wave format into a form that is suitable for further computer processing and analysis. The second is pre-processing, which involves filtering,

Fig. 1. Block diagram of speaker recognition system

An Automatic Speaker Recognition System

519

removing pauses, silences and weak unvoiced sound signal and detect the valid speech signal. The third block is feature extraction, where speech features are extracted from the speech signal. The selected features have enough information to recognize a speaker. Here a class label is assigned to each word uttered by each speaker by examining the extracted features and comparing them with classes learnt during the training phase. Vector quantization is used as an identifier.

3 Pre-processing A digital filter is a mathematical algorithm implemented in hardware and/or software that operates on a digital input signal to produce a digital output signal for the purpose of achieving a filtering objective. Digital filters often operate on digitized analog signals or just numbers, representing some variable, stored in a computed memory. Digital filters are broadly divided into two classes, namely infinite impulse response (IIR) and finite impulse response (FIR) filters. We chose FIR filter because, FIR filters can have an exactly linear phase response. The implication of this is that no phase distortion is introduced into the signal by the filter. This is an important requirement in many applications, for example data transmission, biomedicine, digital audio and image processing. The phase responses of IIR filters are non-linear, especially at the band edges. When a machine is continuously listening to speech, a difficulty arises when it is trying to figure out to where a word starts and stops. We solved this problem by examining the magnitude of several consecutive samples of sound. If the magnitude of these samples is great enough, then keep those samples and examine them later. [1]

Fig. 2. Sample Speech Signal

Fig. 3. Example of speech signal after it has been cleaned

One thing that is clearly noticeable in the example speech signal is that there is lots of empty space where nothing is being said, so we simply remove it. An example speech signal is shown before cleaning in Figure 2, and after in Figure 3. After the empty space is removed from the speech signal, the signal is much shorter. In this case the signal had about 13,000 samples before it was cleaned. After it was run through the clean function, it only contained 2,600 samples. There are several advantages of this. The amount of time required to perform calculations on 13,000 samples is much larger than that required for 2,600 samples. The cleaned sample now contains all the important data that is required to perform the analysis of the speech. The sample produced from the cleaning process is then fed in to the other parts of the ASR system.

520

P. Chakraborty et al.

4 Feature Extraction The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, namely: • • • • •

Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC) Perceptual Linear Prediction (PLP) Neural Predictive Coding (NPC)

Among the above classes we used MFCC, because it is the best known and most popular. MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [1, 2]. 4.1 Mel-Frequency Cepstrum Processor A diagram of the structure of an MFCC processor is given in Figure 4. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselves, MFCC’s are shown to be less susceptible to mentioned variations [5, 6].

Fig. 4. Block diagram of the MFCC processor

4.1.1 Frame Blocking In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by

An Automatic Speaker Recognition System

521

N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 and M = 100. 4.1.2 Windowing Next processing step is windowing. By means of windowing the signal discontinuities, at the beginning and end of each frame, is minimized. The concept here is to minimize the spectrum distortion by using the window to taper the signal to zero at the beginning and end of each frame.If we define the window as w(n), 0 ≤ n≤ N - 1 , where N is the number of samples, then the result of windowing is the signal

yl (n) = xl (n) w(n), 0 ≤ n ≤ N − 1

(1)

The followings are the types of window method: ▪ Hamming ▪ Rectangular ▪ Barlett (Triangular) ▪ Hanning

▪ Kaiser ▪ LancZos ▪ Blackman-Harris

Among all the above, we used Hamming Window method most to serve our purpose for ease of mathematical computations, which is described as:

⎛ 2π n ⎞ w ( n ) = 0 . 54 − 0 .46 cos ⎜ (2) ⎟, 0 ≤ n ≤ N − 1 ⎝ N −1⎠ Besides this, we’ve also used Hanning window and Blackman-Harris window. As an example, a Hamming window with 256 samples is shown here.

Fig. 5. Hamming window with 256 speech sample

4.1.3 Fast Fourier Transform (FFT) Fast Fourier Transform, converts a signal from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow:

Xn =

N −1

∑x k =0

k

e − 2 π jkn / N ,

n = 0 ,1, 2 ,..., N − 1

(3)

522

P. Chakraborty et al.

We use j here to denote the imaginary unit, i.e. j = √ (-1). In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero frequency corresponds to n = 0, positive frequencies 0 Cn, = 0, for > 0 (17) n→∞ 1≤i≤n lim P max |Wi |2 ≤ Cn, = 0, for < 0. (18) n→∞

1≤i≤n

The strong convergence result with Cn,1 is known and Cn,1 is used as the universal threshold level for removing irrelevant wavelet coeﬃcients in wavelet denoising[5]. The universal threshold level is shown to be asymptotically equivalent to the minimax one[5]. (17) and (18) are the weaker results while they evaluate the O(log log n) term. We return to our problem and consider the case of v ∗λ = 0n ; i.e. all of vλ,i ’s are noise components. Here, 0n denotes the n-dimensional zero vector. We deﬁne = ( ∼ N (0n , σ 2 Γ λ ). We deﬁne σi2 = σ 2 γi /(γi + λ), u u1 , . . . , u n ) that satisﬁes u 2 i = 1, . . . , n, where σi = 0 for any i. We also deﬁne u = (u1 , . . . , un ), where ui = u i /σi . Then, u ∼ N (0n , I n ). By (17) and the deﬁnition of u, we have

n

2 2 P u i > σi Cn, = P max u2i > Cn, → 0 (n → ∞), (19) i=1

1≤i≤n

542

K. Hagiwara

if > 0. On the other hand, by using (18) and the deﬁnition of u, we have

n

2 2 2 P u i ≤ σi Cn, = P max ui ≤ Cn, → 0 (n → ∞), (20) 1≤i≤n

i=1

if < 0. (19) tells us that, for any i, u 2i cannot exceed σi2 Cn, with high probability when n is large and > 0. Therefore, if we employ σi Cn, with > 0 as component-wise threshold levels, (19) implies that they remove noise components if it is. On the other hand, (20) tells us that there are some noise components which satisfy u 2i > σi2 Cn, with high probability when n is large and < 0. Therefore, the ith component should be recognized as a noise component if u 2i ≤ σi2 Cn, even when it is not. In other words, we can not distinguish nonzero mean components from zero mean components in this case. Hence, σi2 Cn,0 , i = 1, . . . , n are critical levels for identifying noise components. We note that ∗ these results are still valid when vλ,i are not zero for some i but the number of such components is very small. This is the case of assuming the sparseness of the representation of h in the orthogonal domain. As a conclusion, we propose a hard thresholding method by putting θn,i = σi2 Cn,0 , i = 1, . . . , n in (16), where we set = 0. We refer to this method by component-wise hard thresholding(CHT). In practical applications of the method, we need an estimate of noise variance σ 2 . Fortunately, in nonparametric regression methods, [1] suggested to apply σ 2 =

y [I − Hλ ]2 y , trace[(I − Hλ )2 ]

(21)

where Hλ is deﬁned by Hλ = GFλ−1 G = GQΓλ Q G . Although the method includes a regularization parameter, it can be ﬁxed for a small value. This is because the thresholding method keeps a trained machine compact on the orthogonal domain, by which the contribution of the regularizer may not be signiﬁcant in improving the generalization performance. Therefore it is needed only for ensuring the numerical stability of the matrix calculations. Since the basis functions can be nearly linearly dependent in practical applications, small eigen values are less reliable. We therefore ignore the orthogonal components whose eigen values are less than a predetermined small value, say, 10−16 . Although the run time of eigendecomposition is O(n3 ), the subsequent procedures of CHT such as calculations of σ 2 and wλ are achieved with less computational costs by the eigendecomposition. Hard thresholding with the universal threshold level(HTU). Basically eigendecomposition of G G corresponds to the principal component analysis of g 1 , . . . , g n . Therefore, for nearly linearly dependent basis functions, only several eigen vectors are largely contributed. On the other hand, the components with small eigen values are largely aﬀected by numerical errors. Therefore, it is natural to take care of only the components with large eigen values. For a component

Orthogonal Shrinkage Methods for Nonparametric Regression

543

with a large eigen value, γi λ holds since we can choose a small value for λ. Thus, σ i2 σ 2 holds by the deﬁnition of σ i2 in CHT. We then consider to apply 2 a single threshold level σ Cn,0 instead of σ i2 Cn,0 , i = 1, . . . , n in CHT; i.e. we 2 set θn,i = σ Cn,0 in (16). This is a direct application of the universal threshold level in wavelet denoising[5]. This method is referred to by hard thresholding with the universal threshold level(HTU). Backward hard thresholding(BHT). On the other hand, since the threshold level derived here is the worst case evaluation for a noise level, CHT and HTU have a possibility of yielding a bias between fwλ and h by unexpected removes of contributed components. The component with a large eigen value is composed by a linear sum of many basis functions, it may be a smooth component. Therefore, removes of these components may yield a large bias. Actually, in wavelet denoising, fast/detail components are the target of a thresholding method and slow/approximation components are harmless by the thresholding method[5], which may be a device for reducing the bias. For our method, we also introduce this idea and consider the following procedure. We assume that γ1 ≤ γ2 ≤ · · · ≤ γn . The method is that, by increasing j from 1 to n, we ﬁnd the ﬁrst j = j for which vλ,j > σ 2 Cn,0 occurs. Then, thresholding is made by Tj ( vλ,j ) = vλ,j if j ≥ j and Tj ( vλ,j ) = 0 if j < j. This keeps components with large eigen values and possibly reduces the bias. We refer to this method by backward hard thresholding(BHT). BHT can be viewed as a stopping criterion for choosing contributed components in orthogonal components that are enumerated in order of the magnitudes of eigen values.

4 4.1

Numerical Experiments Choice of Regularization Parameter

CHT, HTU and BHT do not include parameters to be adjusted except the regularization parameter. Since thresholding of orthogonal components yields a certain simple representation of a machine, the regularization parameter may not be signiﬁcant in improving the generalization performance. To demonstrate this property of our methods, through a simple numerical experiment, we see the relationship between generalization performances of trained machines and regularization parameter values. The target function is h(x) = 5 sinc(8 x) for x ∈ R. x1 , . . . , xn are randomly drawn in the interval [−1, 1]. We assume i.i.d. Gaussian noise with mean zero and variance σ 2 = 1. The basis functions are Gaussian basis functions that are deﬁned by gj (x) = exp{−(x − xi )2 /(2τ 2 )}, j = 1, . . . , n, where we set τ 2 = 0.05. In this experiment, under a ﬁxed value of a regularization parameter, we trained machines for 1000 sets of training data of size n. At each trial, the test error is measured by the mean squared error between the target function and the trained machine, which is calculated on 1000 equally spaced input points in [−1, 1]. Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively, which show the relationship between the averaged test errors of trained machines

544

K. Hagiwara

0.1 averaged test error

averaged test error

0.1

0.05

0 10

−6

−4

−2

0

10 10 10 10 reguralization parameter

2

(a) n = 200

0.05

0 10

−6

−4

−2

0

10 10 10 10 reguralization parameter

2

(b) n = 400

Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and (b) n = 400. The ﬁlled circle, open circle and open square indicate the results for the raw estimate, CHT and BHT respectively.

and regularization parameter values. The ﬁlled circle, open circle and open square indicate the results for the raw estimator, CHT and BHT respectively, where the raw estimate is obtained by (6) at each ﬁxed value of a regularization parameter. We do not show the result for HTU since it is almost the same as the result for CHT. In these ﬁgures, we can see that the averaged test errors of our methods are almost unchanged for small values of a regularization parameter while those of the row estimates are sensitive to regularization parameter values. We can also see that BHT is entirely superior to the raw estimate and CHT while CHT is worse than the raw estimate around λ = 101 for both of n = 200 and 400. In practical applications, the regularization parameter of the raw estimate should be determined based on training data and the performance comparisons to the leave-one-out cross validation method are shown in below. 4.2

Comparison with LOOCV

We compare the performances of the proposed methods to the performance of the leave-one-out cross validation(LOOCV) choice of a regularization parameter value. We see not only generalization performances but also computational times of the methods. For the regularized estimator considered in this article, it is known that the LOOCV error is calculated without training on validation sets[3,6]. We assume the same conditions as the previous experiment. The CPU time is measured only for the estimation procedure. The experiments are conducted by using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory. Table 1 (a) and (b) show the averaged test errors and averaged CPU times of LOOCV and our methods respectively, in which the standard deviations(divided by 2) are also appended. The examined values for a regularization parameter in LOOCV is {m × 10−j : m = 1, 2, 5, j = −4, −3, . . . , 3}. In our methods, the

Orthogonal Shrinkage Methods for Nonparametric Regression

545

Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT n 100 200 400

LOOCV 0.079± 0.027 0.040± 0.013 0.021± 0.006

CHT 0.101± 0.034 0.046± 0.014 0.023± 0.007

HTU 0.100± 0.034 0.045± 0.014 0.023± 0.007

BHT 0.076± 0.027 0.035± 0.011 0.017± 0.005

(a) Test errors n 100 200 400

LOOCV 0.079± 0.002 0.533± 0.003 3.657± 0.003

CHT 0.013± 0.002 0.080± 0.002 0.523± 0.004

HTU 0.014± 0.002 0.080± 0.002 0.523± 0.004

BHT 0.014± 0.002 0.080± 0.002 0.523± 0.004

(b) CPU times

regularization parameter is ﬁxed at 1 × 10−4 which is the smallest value in candidate values for LOOCV. Based on Table 1 (a), we ﬁrst discuss the generalization performances of the methods. CHT and HTU are almost comparable. This implies that only the components corresponding to large eigen values are contributed. CHT and HTU are entirely worse than LOOCV in average while the diﬀerences are within the standard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV, CHT and HTU in average while the diﬀerence between the averaged test error of LOOCV and that of BHT is almost within the standard deviations. As pointed out previously, CHT and HTU have a possibility to remove smooth components accidentally since the threshold levels were determined based on the worst case evaluation of dispersion of noise. The better generalization performance of BHT compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other hand, as shown in Table 1 (b), our methods completely outperform LOOCV in terms of the CPU times.

5

Conclusions and Future Works

In this article, we proposed shrinkage methods in training a machine by using a regularization method. The machine is represented by a linear combination of ﬁxed basis functions, in which the number of basis functions, or equivalently, the number of weights is identical to that of training data. In the regularized cost function, the error function is deﬁned by the sum of squared errors and the regularization term is deﬁned by the quadratic form of the weight vector. In the proposed shrinkage procedures, basis functions are orthogonalized by eigendecomposition of the Gram matrix of the vectors of basis function outputs. Then, the orthogonal components are kept or removed according to the proposed thresholding methods. The proposed methods are based on the statistical properties of regularized estimators of weights, which are derived by assuming i.i.d. Gaussian noise. The ﬁnal weights are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators of weights. We

546

K. Hagiwara

proposed three versions of thresholding methods which are component-wise hard thresholding, hard thresholding with the universal threshold level and backward hard thresholding. Since the regularization parameter can be ﬁxed for a small value in our methods, our methods are automatic. Additionally, since eigendecomposition algorithms are included in many software packages and the thresholding methods are simple, the implementations of our methods are quite easy. The numerical experiments showed that our methods achieve relatively good generalization capabilities in strictly less computational time by comparing with the LOOCV method. Especially, the backward hard thresholding method outperformed the LOOCV method in average in terms of the generalization performance. As future works, we need to investigate the performance of our methods on real world problems. Furthermore, we need to evaluate the generalization error when applying the proposed shrinkage methods.

References 1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric regression. J. R. Statist. Soc. B 54, 773–780 (1992) 2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neurocomputing 69, 559–585 (2006) 3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403 (1979) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994) 6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D thesis, MIT (2002) 7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58, 267–288 (1996) 8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 85–105 (2002) 9. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel machines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 682–688 (2001)

A Subspace Method Based on Data Generation Model with Class Information Minkook Cho, Dongwoo Yoon, and Hyeyoung Park School of Electrical Engineering and Computer Science Kyungpook National University, Deagu, Korea [email protected], [email protected], [email protected]

Abstract. Subspace methods have been used widely for reduction capacity of memory or complexity of system and increasing classiﬁcation performances in pattern recognition and signal processing. We propose a new subspace method based on a data generation model with intra-class factor and extra-class factor. The extra-class factor is associated with the distribution of classes and is important for discriminating classes. The intra-class factor is associated with the distribution within a class, and is required to be diminished for obtaining high class-separability. In the proposed method, we ﬁrst estimate the intra-class factors and reduce them from the original data. We then extract the extra-class factors by PCA. For veriﬁcation of proposed method, we conducted computational experiments on real facial data, and show that it gives better performance than conventional methods.

1

Introduction

Subspace methods are for ﬁnding a low dimensional subspace which presents some meaningful information of input data. They are widely used for high dimensional pattern classiﬁcation such as image data owing to two main reasons. First, by applying a subspace method, we can reduce capacity of memory or complexity of system. Also, we can expect to increase classiﬁcation performances by eliminating useless information and by emphasizing essential information for classiﬁcation. The most popular subspace method is PCA(Principal Component Analysis) [10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation models. The PCA ﬁnds a subspace of independent linear combinations (principal components) that retains as much of the information in the original variables as possible. However, PCA method is an unsupervised method, which does not use class information. This may cause some loss of critical information for classiﬁcation. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a supervised learning method which uses information of the target label of data set. The LDA method attempts to ﬁnd basis vectors of subspace maximizing the linear class separability. It is generally known that LDA can give better classiﬁcation performance than PCA by using class information. However, LDA gives M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008. c Springer-Verlag Berlin Heidelberg 2008

548

M. Cho, D. Yoon, and H. Park

at most k-1 basis of subspace for k-class, and cannot extract features stably for the data set with limited number of data in each class. Another subspace method for classiﬁcation is the intra-person space method[9], which is developed for face recognition. The intra-person space is deﬁned as a diﬀerence between two facial data from same person. For dimension reduction, the low dimensional eigenspace is obtained by applying PCA to the intra-person space. In the classiﬁcation tasks, raw input data are projected to the intra-personal eigenspace to get the low dimensional features. The intra-person method showed better performance than PCA and LDA in the FERET Data[9]. However, it is not based on data generation model and it cannot give a sound theoretical reason why the intra-person space gives good information for classiﬁcation. On the other hand, a data generation model with class information has recently been developed[12]. It is a variant of the factor analysis model with two type of factors; class factor(what we call extra-class factor) and environment factor(what we call intra-class factor). Based on the data generation model, the intra-class factor is estimated by using diﬀerence vectors between two data in the same class. The estimated probability distribution of intra-class factor is applied to measuring the similarity of data for classiﬁcation. Through this method takes similar approaches to the intra-person method in the sense that it is using the diﬀerence vectors to get the intra-class information and similarity measure, it is based on the data generation model which can give an explanation on the developed similarity measure. Still, it does not include dimension reduction process and another subspace method is necessary for high-dimensional data. In this paper, we propose an appropriate subspace method for the data generation model developed in [12]. The proposed method ﬁnds subspace which undercuts the eﬀect of intra-class factors and enlarges the eﬀect of extra-class factors based on the data generation model. In Section 2, the model will be explained in detail.

2

Data Generation Model

In advance of deﬁning the data generation model, let us consider that we obtained several images(i.e. data) from diﬀerent persons(i.e. class). We know that pictures of diﬀerent persons are obviously diﬀerent. Also, we know that the pictures of a person is not exactly same due to some environmental condition such as illumination. Therefore, it is natural to assume that a data consists of the intra-class factor which represents within-class variations such as the variation of pictures of same person and the extra-class factor which represents betweenclass variations such as the diﬀerences between two persons. Under this condition, a random variable x for observed data can be written as a function of two distinct random variable ξ and η, which is the form, x = f (ξ, η),

(1)

where ξ represents an extra-class factor, which keeps some unique information in each class, and η represents an intra-class factor which represents environmental

A Subspace Method Based on Data Generation Model

549

variation in the same class. In [12], it was assumed that η keeps information of any variation within class and its distribution is common for all classes. In order to explicitly deﬁne the generation model, a linear additive factor model was applied such as xi = W ξ i + η.

(2)

This means that a random sample xi in class Ci is generated by the summation of a linearly transformed class prototype ξ i and a random class-independent variation η. In this paper, as an extension of the model, we assume that the low dimensional intra-class factor is linearly transformed to generate an observed data x. Therefore, function f is deﬁned as xi = W ξi + V η i .

(3)

This model implies that a data of a speciﬁc class is generated by the extra-class factor ξ which gives some discriminative information among class and the intraclass factor η which represents some variation in a class. In this equation, W and V are transformation matrices of corresponding factors. We call W the extra-class factor loading and call V the intra-class factor loading. Figure 1 represents this data generation model. To ﬁnd a good subspace for classiﬁcation, we try to ﬁnd V and W using class information which is given with the input data. In Section 3, we explain how to ﬁnd the subspace based on the data generation model.

ξi

ηi V

W Xi

Fig. 1. The proposed data generation model

3

Factor Analysis Based on Data Generation Model

For given data x, if we keep the extra-class information and reduce the intraclass information as much as possible, we can expect better classiﬁcation performances. In this aspect, the proposed method can be thought to be similar to the traditional LDA method. The LDA ﬁnds a projection matrix which simultaneously maximizes the between-scatter matrix and minimizes the within-scatter matrix of original data set. On the other hand, the proposed method ﬁrst estimates the intra-class information from the diﬀerence data set from same class, and excludes the intra-class information from original data to keep the extraclass information. Therefore, as compared to LDA, the proposed method does

550

M. Cho, D. Yoon, and H. Park

not need to compute a inverse matrix of the within-scatter matrix and the number of basis of subspace does not depend on a number of class. In addition, the proposed method is so simple to get subspaces and many variations of the proposed method can be developed by exploiting various data generation model. In this section, we will state how to obtain the subspace in detail based on simple linear generative model. 3.1

Intra-class Factor Loading

We ﬁrst ﬁnd the projection matrix Λ which represents the intra-class factor instead of intra-class factor loading matrix V for obtaining intra-class information. In given data set {x}, we calculate the diﬀerence vector δ between two data from same class, which can be written as δ kij = xki − xkj = (W ξ ki − W ξ kj ) + (V η ki − V η kj ),

(4)

where xki and xkj are data from a class Ck (k=1,...,K). Because the xki , xkj are came from the same class, we can assume that the extra-class factor does not make much diﬀerence and we ignore the ﬁrst term W ξki − W ξ kj . Then we can obtain a new approximate relationship δ kij ≈ V (η ki − η kj ).

(5)

Based on the relationship, we try to ﬁnd the factor loading matrix V . For the obtained the data set, Δ = {δ kij }k=1,...,K,i=1,...,N,j=1,...,N ,

(6)

where K is the number of classes and N is the number of data in each class, we apply PCA to obtain the principal component of Δ. The obtained matrix Λ of the principal component of the Δ gives the subspace which maximizes the variance of intra-class factor η. The original data set X is projected to the subspace for extraction of intraclass factors Y intra , such as Y intra = XΛ.

(7)

Note that Y intra is a low dimensional data set and includes the intra-class information of X. Using the Y intra , we reconstruct X intra in original dimension by applying the calculation, X intra = Y intra ΛT (ΛΛT )−1 .

(8)

Note that X intra keeps the intra-class information which is not desirable for classiﬁcation. To remove the undesirable information, we subtract X intra from ˜ such as the original data set X. As a result, we get a new data set X ˜ = X − X intra . X

(9)

A Subspace Method Based on Data Generation Model

3.2

551

Extra-Class Factor Loading

˜ we try to ﬁnd the projection matrix Using the newly obtained data set X, ˜ Λ which represents the extra-class factor instead of extra-class factor loading matrix W for preserving extra-class information as much as possible. To solve ˜ Noting that a data xintra in the this problem, let us consider the data set X. intra data set X is a reconstruction from the intra-class factor, we can write the approximate relationship, xintra ≈ V η.

(10)

˜ can be ˜ in X By combining it with equation (3), the newly obtained data x rewritten as ˜ ≈ x − V η = W ξ. x

(11)

˜ mainly has the extra-class From this, we can say that the new data set X information, and thus we need to preserve the information as much as possible. ˜ and obtain the From these consideration, we apply PCA to the new data set X, ˜ ˜ ˜ to the basis principal component matrix Λ of the data set X. By projecting X vectors such as ˜Λ ˜ = Z, ˜ X

(12)

˜ which has small intra-class variance and large we can obtain the data set Z extra-class variance. The obtained data set is used for classiﬁcation.

4

Experimental Results

For veriﬁcation of the proposed method, we did comparison experiments on facial data sets with conventional methods : PCA, LDA and intra-person. For classiﬁcation, we apply the minimum distance method[10] with Euclidean distance. When we ﬁnd the subspace for each method, we optimized the dimension of subspace with respect to the classiﬁcation rates for each data set. 4.1

Face Recognition

We ﬁrst conducted facial recognition task for face images with diﬀerent viewpoints which are obtained from FERET(Face Recognition Technology) database at the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows some samples of the data set. We used 450 images from 50 subjects and each subject consists of 9 images of diﬀerent poses corresponding to 15 degree from left and right. The left, right(±60 degree) and frontal images are used for training and the rest 300 images are used for testing. The size of image is 50 × 70, thus the dimension of the raw input is 3500. For LDA method, we ﬁrst applied PCA to solve small sample set problem and obtained 52 dimensional features. We then applied LDA and obtained 9 dimensional features. Similarly, in the proposed method, we ﬁrst

552

M. Cho, D. Yoon, and H. Park

Fig. 2. Examples of the human face images with diﬀerent viewpoints Table 1. Result on face image data Method PCA LDA Intra-Person Proposed

Dimension 117 9 92 8

Classiﬁcation Rate 97 99.66 92.33 100

ﬁnd 83 dimensional subspace for intra-class factor and 8 dimensional subspace for extra-class factor. In this case, there are the 50 number of classes and the number of data in each class is very limited; 3 for each class. The experimental results are shown in Table 1. The performance of the proposed method is perfect, and the other methods also give generally good results. In spite of 50 number of class and limited number of training data, the good results may due to that the variation of between-class is intrinsically high. 4.2

Pose Recognition

We conducted pose recognition task with the same data set used in Section 4.1. Therefore, we used 450 images from 9 diﬀerent viewpoints classes of −60o to 60o with 15o intervals, and each class consists of 50 images from diﬀerent persons. The 255 images which are composed of 25 images for each class are used for training and the rest 255 images are used for testing. For LDA method, we ﬁrst applied PCA and obtained 167 dimensional features. We then applied LDA

A Subspace Method Based on Data Generation Model

553

Table 2. Result on the human pose image data Method PCA LDA Intra-Person Proposed

Dimension 65 6 51 21

Classiﬁcation Rate 36.44 57.78 38.67 58.22

and obtained 6 dimensional features. For the proposed method, we ﬁrst ﬁnd 128 dimensional subspace for intra-class factor and 21 dimensional subspace for extra-class factor. In this case, there is the 9 number of classes and the 225 number of training data which is composed of 25 number of data from each class. The results are shown in Table 2. The performance is generally low, but the proposed method and LDA give much better performance than the PCA and intra-person method. From the low performance, we can conjecture that the variance of between-class is very small in contrast to the variance of within-class. However, the proposed method achieved the best performance. 4.3

Facial Expression Recognition

We also conducted facial expression recognition task with data set obtained from PICS(Psychological Image Collection at Stirling) at the homepage(http : //pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We obtained 276 images from 69 persons and each person has 4 images of diﬀerent expressions. The 80 images which is composed of 20 images from each expression are used for training and the rest 196 images are used for testing. The size of image is 80 × 90, thus the dimension of the raw input is 7200. For LDA method, we ﬁrst applied PCA and obtained 59 dimensional features. We then applied LDA and obtained 3 dimensional features. For the proposed method, we ﬁrst ﬁnd 48 dimensional subspace for intra-class factor and 14 dimensional subspace for extra-class factor. In this case, there is 4 number of classes and 20 number of training data in each class. Although the performances of all methods are generally low, the proposed method performed much better than PCA and intra-person method. Like the pose recognition, it is also seemed that the variance of between-class Table 3. Result on facial expression image data Method PCA LDA Intra-Person Proposed

Dimension 65 3 76 14

Classiﬁcation Rate 35.71 65.31 42.35 66.33

554

M. Cho, D. Yoon, and H. Park

Fig. 3. Examples of the human facial expression images

is very small in contrast to the variance of within-class. However, the proposed method is achieved the best performance.

5

Conclusions and Discussions

In this paper, we proposed a new subspace method based on a data generation model with class information which can be represented as intra-class factors and extra-class factors. By reducing the intra-class information from original data and by keeping extra-class information using PCA, we could get a low dimensional features which preserves some essential information for the given classiﬁcation problem. In the experiments on various type of facial classiﬁcation tasks, the proposed method showed better performance than conventional methods. As further study, it could be possible to ﬁnd more sophisticated dimension reduction than PCA which can enlarge extra-class information. Also, the kernel method could be applied to overcome the non-linearity problem.

Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD) (KRF-2006-311-D00807).

References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Speciﬁc Linear Projection. IEEE trans. on Pattern Recogntion and Machine Intelligence 19(7), 711–720 (1997) 2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004) 3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classiﬁers. Advances in Neural Information Processing System, 487–493 (1998)

A Subspace Method Based on Data Generation Model

555

4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8, 376–386 (1938) 5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and Helmholtz Free Energy. Advances In Neural Information Processing Systems 6, 3–10 (1994) 7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Distributed Representations. Philosophical Transactions Royal Society B 352, 1177– 1190 (1997) 8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000 International Technical Conference on Circuits/Systems, Computers, and Communications, pp. 873–876 (2000) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity. Advances in Neural Information Processing System, 910–916 (1998) 10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Inteligence 23(2), 228–233 (2001) 12. Park, H., Cho, M.: Classiﬁcation of Bio-data with Small Data set Using Additive Factor Model and SVM. In: Hoﬀmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg (2006) 13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Veriﬁcation. In: Proc. of International Conference on Computer Vision on Pattern Recognition, pp. 539–546 (2005) 14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neural Information Processing Systems, vol. 7, pp. 617–624 (1995)

Hierarchical Feature Extraction for Compact Representation and Classiﬁcation of Datasets Markus Schubert and Jens Kohlmorgen Fraunhofer FIRST.IDA Kekul´estr. 7, 12489 Berlin, Germany {markus,jek}@first.fraunhofer.de http://ida.first.fraunhofer.de

Abstract. Feature extraction methods do generally not account for hierarchical structure in the data. For example, PCA and ICA provide transformations that solely depend on global properties of the overall dataset. We here present a general approach for the extraction of feature hierarchies from datasets and their use for classiﬁcation or clustering. A hierarchy of features extracted from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording.

1

Introduction

The vast majority of feature extraction methods does not account for hierarchical structure in the data. For example, PCA [1] and ICA [2] provide transformations that solely depend on global properties of the overall data set. The ability to model the hierarchical structure of the data, however, might certainly help to characterize and understand the information contained in the data. For example, neural dynamics are often characterized by a hierarchical structure in space and time, where methods for hierarchical feature extraction might help to group and classify such data. A particular demand for these methods exists in EEG recordings, where slow dynamical components (sometimes interpreted as internal “state” changes) and the variability of features make data analysis diﬃcult. Hierarchical feature extraction is so far mainly related to 2-D pattern analysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the hierarchical structure is typically a priori hard-wired in the architecture and the methods primarily apply to a 2-D grid structure. There are, however, more recent approaches, like local PCA [4] or tree-dependent component analysis [5], that are promising steps towards structured feature extraction methods that derive also the structure from the data. While local PCA in [4] is not hierarchical and tree-dependent component analysis in [5] is restricted to the context of ICA, we here present a general approach for the extraction of feature hierarchies and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Feature Extraction

557

their use for classiﬁcation and clustering. We exemplify this by using PCA as the core feature extraction method. In [6] and [7], hierarchies of two-dimensional PCA projections (using probabilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional data. For obtaining the hierarchies, the selection of sub-clusters was performed either manually [6] or automatically by using a model selection criterion (AIC, MDL) [7], but in both cases based on two-dimensional projections. A 2-D projection of high-dimensional data, however, is often not suﬃcient to unravel the structure of the data, which thus might hamper both approaches, in particular, if the sub-clusters get superimposed in the projection. In contrast, our method is based on hierarchical clustering in the original data space, where the structural information is unchanged and therefore undiminished. Also, the focus of this paper is not on visualizing the data itself, which obviously is limited to 2-D or 3-D projections, but rather on the extraction of the hierarchical structure of the data (which can be visualized by plotting trees) and on replacing the data by a compact hierarchical representation in terms of a tree of extracted features, which can be used for classiﬁcation and clustering. The individual quantity to be classiﬁed or clustered in this context, is a tree of features representing a set of data points. Note that classifying sets of points is a more general problem than the well-known problem of classifying individual data points. Other approaches to classify sets of points can be found, e.g., in [9, 10], where the authors deﬁne a kernel on sets, which can then be used with standard kernel classiﬁers. The paper is organized as follows. In section 2, we describe the hierarchical feature extraction method. In section 3, we show how feature hierarchies can be used for classiﬁcation and clustering, and in section 4 we provide a proof of concept with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. Section 5 concludes with a discussion.

2

Hierarchical Feature Extraction

We pursue a straightforward approach to hierarchical feature extraction that allows us to make any standard feature extraction method hierarchical: we perform hierarchical clustering of the data prior to feature extraction. The feature extraction method is then applied locally to each signiﬁcant cluster in the hierarchy, resulting in a representation (or replacement) of the original dataset in terms of a tree of features. 2.1

Hierarchical Clustering

There are many known variants of hierarchical clustering algorithms (see, e.g., [11, 12]), which can be subdivided into divisive top-down procedures and agglomerative bottom-up procedures. More important than this procedural aspect, however, is the dissimilarity function that is used in most methods to quantify the dissimilarity between two clusters. This function is used as the criterion to

558

M. Schubert and J. Kohlmorgen

determine the clusters to be split (or merged) at each iteration of the top-down (or bottom-up) process. Thus, it is this function that determines the clustering result and it implicitly encodes what a “good” cluster is. Common agglomerative procedures are single-linkage, complete-linkage, and average-linkage. They diﬀer simply in that they use diﬀerent dissimilarity functions [12]. We here use Ward’s method [13], also called the minimum variance method, which is agglomerative and successively merges the pair of clusters that causes the smallest increase in terms of the total sum-of-squared-errors (SSE), where the error is deﬁned as the Euclidean distance of a data point to its cluster mean. The increase in square-error caused by merging two clusters, Di and Dj , is given by ni nj d (Di , Dj ) = mi − mj , (1) ni + nj where ni and nj are the number of points in each cluster, and mi and mj are the means of the points in each cluster [12]. Ward’s method can now simply be described as a standard agglomerative clustering procedure [11, 12] with the particular dissimilarity function d given in Eq. (1). We use Ward’s criterion, because it is based on a global ﬁtness criterion (SSE) and in [11] it is reported that the method outperformed other hierarchical clustering methods in several comparative studies. Nevertheless, depending on the particular application, other criteria might be useful as well. The result of a hierarchical clustering procedure that successively splits or merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it deﬁnes a partition of the given n samples into k clusters. The leaf node level consists of n nodes describing a partition into n clusters, where each cluster/node contains exactly one sample. Each hierarchy level further up contains one node with edges to the two child nodes that correspond to the clusters that have been merged. The tree can be depicted graphically as a dendrogram, which aligns the leaf nodes along the horizontal axis and connects them by lines to the higher level nodes along the vertical axis. The position of the nodes along the vertical axis could in principle correspond linearly to the hierarchy level k. This, however, would reveal almost nothing of the structure in the data. Most of the structural information is actually contained in the dissimilarity values. One therefore usually positions the node at level k vertically with respect to the dissimilarity value of its two corresponding child clusters, Di and Dj , δ(k) = d (Di , Dj ) .

(2)

For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function δ can be regarded as within-cluster dissimilarity. By using δ as the vertical scale in a dendrogram, a large gap between two levels, for example k and k + 1, means that two very dissimilar clusters have been merged at level k. 2.2

Extracting a Tree of Signiﬁcant Clusters

As we have seen in the previous subsection, a hierarchical clustering algorithm always generates a tree containing n − 1 non-singleton clusters. This does not

Hierarchical Feature Extraction

559

necessarily mean that any of these clusters is clearly separated from the rest of the data or that there is any structure in the data at all. The identiﬁcation of clearly separated clusters is usually done by visual inspection of the dendrogram, i.e. by identifying large gaps. For an automatic detection of signiﬁcant clusters, we use the following straightforward criterion δ(parent(k)) > α, δ(k)

for 1 < k < n,

(3)

where parent(k) is the parent cluster level of the cluster obtained at level k and α is a signiﬁcance threshold. If a cluster at level k is merged into a cluster that has a within-cluster dissimilarity which is more than α times higher than that of cluster k, we call cluster k a signiﬁcant cluster. That means that cluster k is signiﬁcantly more compact than its merger (in the sense of the dissimilarity function). Note that this does not necessarily mean that the sibling of cluster k is also a signiﬁcant cluster, as it might have a higher dissimilarity value than cluster k. The criterion directly corresponds to the relative increase of the dissimilarity value in a dendrogram from one merger level to the next. For small clusters that contain only a few points, the relative increase in dissimilarity can be large just because of the small sample size. To avoid that these clusters are detected as being signiﬁcant, we require a minimum cluster size M for signiﬁcant clusters. After having identiﬁed the signiﬁcant clusters in the binary cluster tree, we can extract the tree of signiﬁcant clusters simply by linking each signiﬁcant cluster node to the next highest signiﬁcant node in the tree, or, if there is none, to the root node (which is just for the convenience of getting a tree and not a forest). The tree of signiﬁcant clusters is generally much smaller than the original tree and it is not necessarily a binary tree anymore. Also note that there might be data points that are not in any signiﬁcant cluster, e.g., outliers. The criterion in (3) is somewhat related to the criterion in [14], which is used to take out clusters from the merging process in order to obtain a plain, nonhierarchical clustering. The criterion in [14] accounts for the relative change of the absolute dissimilarity increments, which seems to be somewhat less intuitive and unnecessarily complicated. This criterion might also be overly sensitive to small variations in the dissimilarities. 2.3

Obtaining a Tree of Features

To obtain a representation of the original dataset in terms of a tree of features, we can now apply any standard feature extraction method to the data points in each signiﬁcant cluster in the tree and then replace the data points in the cluster by their corresponding features. For PCA, for example, the data points in each signiﬁcant cluster are replaced by their mean vector and the desired number of principle components, i.e. the eigenvectors and eigenvalues of the covariance matrix of the data points. The obtained hierarchy of features thus constitutes

560

M. Schubert and J. Kohlmorgen

a compact representation of the dataset that does not contain the individual data points anymore, which can save a considerable amount of memory. This representation is also independent of the size of the dataset. The hierarchy can on the one hand be used to analyze and understand the structure of the data, on the other hand – as we will further explain in the next section – it can be used to perform classiﬁcation or clustering in cases where the individual input quantity to be classiﬁed (or clustered) is an entire dataset and not, as usual, a single data point.

3

Classiﬁcation of Feature Trees

The classiﬁcation problem that we address here is not the well-known problem of classifying individual data points or vectors. Instead, it relates to the classiﬁcation of objects that are sets of data points, for example, time series. Given a “training set” of such objects, i.e. a number of datasets, each one attached with a certain class label, the problem consists in assigning one class label to each new, unlabeled dataset. This can be accomplished by transforming each individual dataset into a tree of features and by deﬁning a suitable distance function to compare each pair of trees. For example, trees of principal components can be regarded as (hierarchical) mixtures of Gaussians, since the principal components of each node in the tree (the eigenvectors and eigenvalues) describe a normal distribution, which is an approximation to the true distribution of the underlying data points in the corresponding signiﬁcant cluster. Two mixtures (sums) of Gaussians, f and g, corresponding to two trees of principal components (of two datasets), can be compared, e.g., by using the the squared L2 -Norm as distance function, which is also called the integrated squared error (ISE), ISE(f, g) = (f − g)2 dx. (4) The ISE has the advantage that the integral is analytically tractable for mixtures of Gaussians. Note that the computation of a tree of principal components, as described in the previous section, is in itself an interesting way to obtain a mixture of Gaussians representation of a dataset: without the need to specify the number of components in advance and without the need to run a maximum likelihood (gradient ascent) algorithm like, for example, expectation–maximization [15], which is prone to get stuck in local optima. Having obtained a distance function on feature trees, the next step is to choose a classiﬁcation method that only requires pairwise distances to classify the trees (and their corresponding datasets). A particularly simple method is ﬁrst-nearest-neighbor (1-NN) classiﬁcation. For 1-NN classiﬁcation, the tree of a test dataset is assigned the label of the nearest tree of a collection of trees that were generated from a labeled “training set” of datasets. If the generated trees are suﬃciently diﬀerent among the classes, ﬁrst- (or k-) nearest-neighbor

Hierarchical Feature Extraction

561

classiﬁcation can already be suﬃcient to obtain a good classiﬁcation result, as we demonstrate in the next section. In addition to classiﬁcation, the distance function on feature trees can also be used to cluster a collection of datasets by clustering their corresponding trees. Any clustering algorithm that uses pairwise distances can be used for this purpose [11, 12]. In this way it is possible to identify homogeneous groups of datasets.

4 4.1

Applications Mixtures of Gaussians

As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness. From three classes of Gaussian mixture distributions, which are exemplarily shown in Fig. 1(a)-(c), we generated 10 training samples for each class, which constitute the training set, and a total of 100 test samples constituting the test set. Each sample contains 540 data points. The mixture distribution of each test sample was chosen with equal probability from one of the three classes. Next, we generated the binary cluster tree from each sample using Ward’s criterion. Examples of the corresponding dendrograms for each class are shown in Fig. 1(d)-(f) (in gray). We then determined the signiﬁcant clusters in each tree, using the signiﬁcance factor α = 3 and the minimum cluster size M = 40. In Fig. 1(d)-(f), the signiﬁcant clusters are depicted as black dots and the extracted trees of signiﬁcant clusters are shown by means of thick black lines. The cluster of each node in a tree of signiﬁcant clusters was then replaced by the principle components obtained from the data in the cluster, which turns the tree of clusters into a tree of features. In Fig. 1(g)-(i), the PCA components of all signiﬁcant clusters are shown for the three example datasets from Fig. 1(a)-(c). Finally, we classiﬁed the feature trees obtained from the test samples, using the integrated squared error (Eq. (4)) and ﬁrst-nearest-neighbor classiﬁcation. We obtained a nearly perfect accuracy of 98% correct classiﬁcations (i.e.: only two misclassiﬁcations), which can largely be attributed to the circumstance that the structural diﬀerences between the classes were correctly exposed in the tree structures. This result demonstrates that an appropriate representation of the data can make the classiﬁcation problem very simple. 4.2

Clinical EEG

To demonstrate the applicability of our approach to real-world data, we used a clinical recording of human EEG. The recording was carried out in order to screen for pathological features, in particular the disposedness to epilepsy. The subject went through a number of experimental conditions: eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and, ﬁnally, a stimulation with stroboscopic light of increasing frequency (PO: photic on).

562

M. Schubert and J. Kohlmorgen

50

50

50

40

40

40

30

30

30

20

20

20

10

10

10

0

0

0

−10

−10

−10

−20

−20

−20

−30

−30

−40 −30

−20

−10

0

10

20

30

40

50

60

−30

−40 −30

−20

−10

0

(a)

10

20

30

40

50

60

−40 −30

−20

−10

0

(b)

10

20

30

40

50

60

30

40

50

60

(c) 600

500 500 500

400 400

200

dissimilarity

dissimilarity

dissimilarity

400

300

300

200

100

200

100

0

100

0

data

0

data

(d)

(f)

50

50

40

40

40

30

30

30

20

20

20

10

10

10

0

0

0

−10

−10

−10

−20

−20

−20

−30

−30

−20

−10

0

10

20

(g)

data

(e)

50

−40 −30

300

30

40

50

60

−40 −30

−30

−20

−10

0

10

20

(h)

30

40

50

60

−40 −30

−20

−10

0

10

20

(i)

Fig. 1. (a)-(c) Example datasets for the three types of mixture distributions used in the application. (d)-(f) The corresponding dendrograms for each example dataset (gray) and the extracted trees of signiﬁcant clusters (black). Note that the extracted tree structure exactly corresponds to the structure in the data. (g)-(i) The PCA components of all signiﬁcant clusters. The components are contained in the tree of features.

During the photic phase, the subject kept the eyes closed, while the rate of light ﬂashes was increased every four seconds in steps of 1 Hz, from 5 Hz to 25 Hz. The obtained recording was subdivided into 507 epochs of ﬁxed length (1s). For each epoch, we extracted four features that correspond to the power in

Hierarchical Feature Extraction

563

25

dissimilarity

20

15

10

5

0

82% (EC)

69% (PHV)

92% (EO)

88% (PO)

76% (HV)

90% (HV)

Fig. 2. The tree of signiﬁcant clusters (black), obtained from the underlying dendrogram (gray) for the EEG data. The data in each signiﬁcant sub-cluster largely corresponds to one of the experimental conditions (indicated in %): eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and ‘photic on’ (PO).

speciﬁc frequency bands of particular EEG electrodes.1 The resulting set of four-dimensional feature vectors was then analyzed by our method. For the hierarchical clustering, we used Ward’s method and found the signiﬁcant clusters depicted in Fig. 2. The extracted tree of signiﬁcant clusters consists of a twolevel hierarchy. As expected, the majority of feature vectors in each sub-cluster corresponds to one of the experimental conditions. By applying PCA to each sub-cluster and replacing the data of each node with its principle components, we obtain a tree of features, which constitutes a compact representation of the original dataset. It can then be used for comparison with trees that arise from normal or various kinds of pathological EEG, as outlined in section 3.

5

Discussion

We proposed a general approach for the extraction of feature hierarchies from datasets and their use for classiﬁcation or clustering. The feasibility of this approach was demonstrated with an application to mixtures of Gaussians with 1

In detail: (I.) the power of the α-band (8–12 Hz) at the electrode positions O1 and O2 (according to the international 10–20 system), (II.) the power of 5 Hz and its harmonics (except 50 Hz) at electrode F4, (III.) the power of 6 Hz and its harmonics at electrode F8, and (IV.) the power of the 25–80 Hz band at F7.

564

M. Schubert and J. Kohlmorgen

varying degree of structuredness and to a clinical EEG recording. In this paper we focused on PCA as the core feature extraction method. Other types of feature extraction, like, e.g., ICA, are also conceivable, which then should be complemented with an appropriate distance function on the feature trees (if used for classiﬁcation or clustering). The basis of the proposed approach is hierarchical clustering. The quality of the resulting feature hierarchies thus depends on the quality of the clustering. Ward’s criterion tends to ﬁnd compact, hyperspherical clusters, which may not always be the optimal choice for a given problem. Therefore, one should consider to adjust the clustering criterion to the problem at hand. Our future work will focus on the application of this method to classify normal and pathological EEG. By comparing the diﬀerent tree structures, the hope is to gain a better understanding of the pathological cases. Acknowledgements. This work was funded by the German BMBF under grant 01GQ0415 and supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.

References [1] Jolliﬀe, I.: Principal Component Analysis. Springer, New York (1986) [2] Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, Chichester (2001) [3] Fukushima, K.: Neural network model for a mechanism of pattern recognition unaﬀected by shift in position — neocognitron. Transactions IECE 62-A(10), 658–665 (1979) [4] Bregler, C., Omohundro, S.: Surface learning with applications to lipreading. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Precessing Systems, vol. 6, pp. 43–50. Morgan Kaufmann Publishers, San Mateo (1994) [5] Bach, F., Jordan, M.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) [6] Bishop, C., Tipping, M.: A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281–293 (1998) [7] Wang, Y., Luo, L., Freedman, M., Kung, S.: Probabilistic principal component subspaces: A hierarchical ﬁnite mixture model for data visualization. IEEE Transactions on Neural Networks 11(3), 625–636 (2000) [8] Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B 61(3), 611–622 (1999) [9] Kondor, R., Jebara, T.: A kernel between sets of vectors. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the ICML, pp. 361–368. AAAI Press, Menlo Park (2003) [10] Desobry, F., Davy, M., Fitzgerald, W.: A class of kernels for sets of vectors. In: Proceedings of the ESANN, pp. 461–466 (2005) [11] Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Inc., Englewood Cliﬀs (1988) [12] Duda, R., Hart, P., Stork, D.: Pattern Classiﬁcation. Wiley–Interscience, Chichester (2000)

Hierarchical Feature Extraction

565

[13] Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 236–244 (1963) [14] Fred, A., Leitao, J.: Clustering under a hypothesis of smooth dissimilarity increments. In: Proceedings of the ICPR, vol. 2, pp. 190–194 (2000) [15] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, 1–38 (1977)

Principal Component Analysis for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Adaptive Informatics Research Center, Helsinki Univ. of Technology P.O. Box 5400, FI-02015 TKK, Finland {Tapani.Raiko,Alexander.Ilin,Juha.Karhunen}@tkk.fi http://www.cis.hut.fi/projects/bayes/

Abstract. Principal component analysis (PCA) is a widely used technique for data analysis and dimensionality reduction. Eigenvalue decomposition is the standard algorithm for solving PCA, but a number of other algorithms have been proposed. For instance, the EM algorithm is much more eﬃcient in case of high dimensionality and a small number of principal components. We study a case where the data are highdimensional and a majority of the values are missing. In this case, both of these algorithms turn out to be inadequate. We propose using a gradient descent algorithm inspired by Oja’s rule, and speeding it up by an approximate Newton’s method. The computational complexity of the proposed method is linear with respect to the number of observed values in the data and to the number of principal components. In the experiments with Netﬂix data, the proposed algorithm is about ten times faster than any of the four comparison methods.

1

Introduction

Principal component analysis (PCA) [1,2,3,4,5,6] is a classic technique in data analysis. It can be used for compressing higher dimensional data sets to lower dimensional ones for data analysis, visualization, feature extraction, or data compression. PCA can be derived from a number of starting points and optimization criteria [2,3,4]. The most important of these are minimization of the mean-square error in data compression, ﬁnding mutually orthogonal directions in the data having maximal variances, and decorrelation of the data using orthogonal transformations [5]. While standard PCA is a very well-established linear statistical technique based on second-order statistics (covariances), it has recently been extended into various directions and considered from novel viewpoints. For example, various adaptive algorithms for PCA have been considered and reviewed in [4,6]. Fairly recently, PCA was shown to emerge as a maximum likelihood solution from a probabilistic latent variable model independently by several authors; see [3] for a discussion and references. In this paper, we study PCA in the case where most of the data values are missing (or unknown). Common algorithms for solving PCA prove to be M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 566–575, 2008. © Springer-Verlag Berlin Heidelberg 2008

Principal Component Analysis for Sparse High-Dimensional Data

567

inadequate in this case, and we thus propose a new algorithm. The problem of overﬁtting and possible solutions are also outlined.

2

Algorithms for Principal Component Analysis

Principal subspace and components. Assume that we have n d-dimensional data vectors x1 , x2 , . . . , xn , which form the d×n data matrix X=[x1 , x2 , . . . , xn ]. The matrix X is decomposed into X ≈ AS,

(1)

where A is a d × c matrix, S is a c × n matrix and c ≤ d ≤ n. Principal subspace methods [6,4] ﬁnd A and S such that the reconstruction error C = X − AS2F =

d n

(xij −

i=1 j=1

c

aik skj )2 ,

(2)

k=1

is minimized. There F denotes the Frobenius norm, and xij , aik , and skj elements of the matrices X, A, and S, respectively. Typically the row-wise mean is removed from X as a preprocessing step. Without any further constraints, there exist inﬁnitely many ways to perform such a decomposition. However, the subspace spanned by the column vectors of the matrix A, called the principal subspace, is unique. In PCA, these vectors are mutually orthogonal and have unit length. Further, for each k = 1, . . . , c, the ﬁrst k vectors form the k-dimensional principal subspace. This makes the solution practically unique, see [4,2,5] for details. There are many ways to determine the principal subspace and components [6,4,2]. We will discuss three common methods that can be adapted for the case of missing values. Singular Value Decomposition. PCA can be determined by using the singular value decomposition (SVD) [5] X = UΣVT ,

(3)

where U is a d × d orthogonal matrix, V is an n × n orthogonal matrix and Σ is a d × n pseudodiagonal matrix (diagonal if d = n) with the singular values on the main diagonal [5]. The PCA solution is obtained by selecting the c largest singular values from Σ, by forming A from the corresponding c columns of U, and S from the corresponding c rows of ΣVT . Note that PCA can equivalently be deﬁned using the eigendecomposition of the d × d covariance matrix C of the column vectors of the data matrix X: C=

1 XXT = UDUT , n

(4)

Here, the diagonal matrix D contains the eigenvalues of C, and the columns of the matrix U contain the unit-length eigenvectors of C in the same order

568

T. Raiko, A. Ilin, and J. Karhunen

[6,4,2,5]. Again, the columns of U corresponding to the largest eigenvalues are taken as A, and S is computed as AT X. This approach can be more eﬃcient for cases where d n, since it avoids the n × n matrix. EM Algorithm. The EM algorithm for solving PCA [7] iterates updating A and S alternately.1 When either of these matrices is ﬁxed, the other one can be obtained from an ordinary least-squares problem. The algorithm alternates between the updates S ← (AT A)−1 AT X ,

A ← XST (SST )−1 .

(5)

This iteration is especially eﬃcient when only a few principal components are needed, that is c d [7]. Subspace Learning Algorithm. It is also possible to minimize the reconstruction error (2) by any optimization algorithm. Applying the gradient descent algorithm yields rules for simultaneous updates A ← A + γ(X − AS)ST ,

S ← S + γAT (X − AS) .

(6)

where γ > 0 is called the learning rate. Oja-Karhunen learning algorithm [8,9,6,4] is an online learning method that uses the EM formula for computing S and the gradient for updating A, a single data vector at a time. A possible speed-up to the subspace learning algorithm is to use the natural gradient [10] for the space of matrices. This yields the update rules A ← A + γ(X − AS)ST AT A ,

S ← S + γSST AT (X − AS) .

(7)

If needed, the end result of subspace analysis can be transformed into the PCA solution, for instance, by computing the eigenvalue decomposition SST = 1/2 T US DS UTS and the singular value decomposition AUS DS = UA ΣA VA . The transformed A is formed from the ﬁrst c columns of UA and the transformed S −1/2 T from the ﬁrst c rows of ΣA VA DS UTS S. Note that the required decompositions are computationally lighter than the ones done to the data matrix directly.

3

Principal Component Analysis with Missing Values

Let us consider the same problem when the data matrix has missing entries2 . In the following there are N = 9 observed values and 6 missing values marked with a question mark (?): ⎡ ⎤ −1 +1 0 0 ? X = ⎣−1 +1 ? ? 0⎦ . (8) ? ? −1 +1 ? 1

2

The procedure studied in [7] can be seen as the zero-noise limit of the EM algorithm for a probabilistic PCA model. We make the typical assumption that values are missing at random, that is, the missingness does not depend on the unobserved data. An example where the assumption does not hold is when out-of-scale measurements are marked missing.

Principal Component Analysis for Sparse High-Dimensional Data

569

We would like to ﬁnd A and S such that X ≈ AS for the observed data values. The rest of the product AS represents the reconstruction of missing values. Adapting SVD. One can use the SVD approach (4) in order to ﬁnd an approximate solution to the PCA problem. However, estimating the covariance matrix C becomes very diﬃcult when there are lots of missing values. If we estimate C leaving out terms with missing values from the average, we get for the estimate of the covariance matrix ⎡ ⎤ 0.5 1 0 1 C = XXT = ⎣ 1 0.667 ?⎦ . (9) n 0 ? 1 There are at least two problems. First, the estimated covariance 1 between the ﬁrst and second components is larger than their estimated variances 0.5 and 0.667. This is clearly wrong, and leads to the situation where the covariance matrix is not positive (semi)deﬁnite and some of its eigenvalues are negative. Secondly, the covariance between the second and the third component could not be estimated at all3 . Both problems appeared in practice with the data set considered in Section 5. Another option is to complete the data matrix by iteratively imputing the missing values (see, e.g., [2]). Initially, the missing values can be replaced by zeroes. The covariance matrix of the complete data can be estimated without the problems mentioned above. Now, the product AS can be used as a better estimate for the missing values, and this process can be iterated until convergence. This approach requires the use of the complete data matrix, and therefore it is computationally very expensive if a large part of the data matrix is missing. The time complexity of computing the sample covariance matrix explicitly is O(nd2 ). We will further refer to this approach as the imputation algorithm. Note that after convergence, the missing values do not contribute to the reconstruction error (2). This means that the imputation algorithm leads to the solution which minimizes the reconstruction error of observed values only. Adapting the EM Algorithm. Grung and Manne [11] studied the EM algorithm in the case of missing values. Experiments showed a faster convergence compared to the iterative imputation algorithm. The computational complexity is O(N c2 + nc3 ) per iteration, where N is the number of observed values, assuming na¨ıve matrix multiplications and inversions but exploiting sparsity. This is quite a bit heavier than EM with complete data, whose complexity is O(ndc) [7] per iteration. Adapting the Subspace Learning Algorithm. The subspace learning algorithm works in a straightforward manner also in the presence of missing values. 3

It could be ﬁlled by ﬁnding a value that maximizes the determinant of the covariance matrix (and thus the entropy of the underlying Gaussian distribution).

570

T. Raiko, A. Ilin, and J. Karhunen

We just take the sum over only those indices i and j for which the data entry xij (the ijth element of X) is observed, in short (i, j) ∈ O. The cost function is

C=

e2ij ,

with

c

eij = xij −

aik skj .

(10)

k=1

(i,j)∈O

and its partial derivatives are ∂C = −2 eij slj , ∂ail j|(i,j)∈O

∂C = −2 ∂slj

eij ail .

(11)

i|(i,j)∈O

The update rules for gradient descent are ∂C ∂C , S ←S+γ ∂A ∂S and the update rules for natural gradient descent are A←A+γ

(12)

∂C T ∂C A A, S ← S + γSST . (13) ∂A ∂S We propose a novel speed-up to the original simple gradient descent algorithm. In Newton’s method for optimization, the gradient is multiplied by the inverse of the Hessian matrix. Newton’s method is known to converge fast especially in the vicinity of the optimum, but using the full Hessian is computationally too demanding in truly high-dimensional problems. Here we use only the diagonal part of the Hessian matrix. We also include a control parameter α that allows the learning algorithm to interpolate between the standard gradient descent (α = 0) and the diagonal Newton’s method (α = 1), much like the Levenberg-Marquardt algorithm. The learning rules then take the form 2 −α ∂ C ∂C j|(i,j)∈O eij slj α , ail ← ail − γ = ail + γ (14) 2 ∂ail ∂ail 2 s j|(i,j)∈O lj

−α ∂2C ∂C i|(i,j)∈O eij ail α . slj ← slj − γ = slj + γ (15) 2 ∂slj ∂slj a2 A←A+γ

i|(i,j)∈O

il

The computational complexity is O(N c + nc) per iteration.

4

Overﬁtting

A trained PCA model can be used for reconstructing missing values: xˆij =

c

aik skj ,

(i, j) ∈ / O.

(16)

k=1

Although PCA performs a linear transformation of data, overﬁtting is a serious problem for large-scale problems with lots of missing values. This happens when the value of the cost function C in Eq. (10) is small for training data, but the quality of prediction (16) is poor for new data. For further details, see [12].

Principal Component Analysis for Sparse High-Dimensional Data

571

Regularization. A popular way to regularize ill-posed problems is penalizing the use of large parameter values by adding a proper penalty term into the cost function; see for example [3]. In our case, one can modify the cost function in Eq. (2) as follows: Cλ = e2ij + λ(A2F + S2F ) . (17) (i,j)∈O

This has the eﬀect that the parameters that do not have signiﬁcant evidence will decay towards zero. A more general penalization would use diﬀerent regularization parameters λ for diﬀerent parts of A and S. For example, one can use a λk parameter of its own for each of the column vectors ak of A and the row vectors sk of S. Note that since the columns of A can be scaled arbitrarily by rescaling the rows of S accordingly, one can ﬁx the regularization term for ak , for instance, to unity. An equivalent optimization problem can be obtained using a probabilistic formulation with (independent) Gaussian priors and a Gaussian noise model:

c p(xij | A, S) = N xij ; aik skj , vx , (18) k=1

p(aik ) = N (aik ; 0, 1) ,

p(skj ) = N (skj ; 0, vsk ) ,

(19)

where N (x; m, v) denotes the random variable x having a Gaussian distribution with the mean m and variance v. The regularization parameter λk = vsk /vx is the ratio of the prior variances vsk and vx . Then, the cost function (ignoring constants) is minus logarithm of the posterior for A and S: CBR =

d c c n 2 e2ij /vx + ln vx + a2ik + skj /vsk + ln vsk (i,j)∈O

i=1 k=1

(20)

k=1 j=1

An attractive property of the Bayesian formulation is that it provides a natural way to choose the regularization constants. This can be done using the evidence framework (see, e.g., [3]) or simply by minimizing CBR by setting vx , vsk to the means of e2ij and s2kj respectively. We will use the latter approach and refer to it as regularized PCA. Note that in case of joint optimization of CBR w.r.t. aik , skj , vsk , and vx , the cost function (20) has a trivial minimum with skj = 0, vsk → 0. We try to avoid this minimum by using an orthogonalized solution provided by unregularized PCA from the learning rules (14) and (15) for initialization. Note also that setting vsk to small values for some components k is equivalent to removal of irrelevant components from the model. This allows for automatic determination of the proper dimensionality c instead of discrete model comparison (see, e.g., [13]). This justiﬁes using separate vsk in the model in (19). Variational Bayesian Learning. Variational Bayesian (VB) learning provides even stronger tools against overﬁtting. VB version of PCA by [13] approximates

572

T. Raiko, A. Ilin, and J. Karhunen

the joint posterior of the unknown quantities using a simple multivariate distribution. Each model parameter is described a posteriori using independent Gaussian distributions. The means can then be used as point estimates of the parameters, while the variances give at least a crude estimate of the reliability of these point estimates. The method in [13] does not extend to missing values easily, but the subspace learning algorithm (Section 3) can be extended to VB. The derivation is somewhat lengthy, and it is omitted here together with the variational Bayesian learning rules because of space limitations; see [12] for details. The computational complexity of this method is still O(N c + nc) per iteration, but the VB version is in practice about 2–3 times slower than the original subspace learning algorithm.

5

Experiments

Collaborative ﬁltering is the task of predicting preferences (or producing personal recommendations) by using other people’s preferences. The Netﬂix problem [14] is such a task. It consists of movie ratings given by n = 480189 customers to d = 17770 movies. There are N = 100480507 ratings from 1 to 5 given, from which 1408395 ratings are reserved for validation (or probing). Note that 98.8% of the values are thus missing. We tried to ﬁnd c = 15 principal components from the data using a number of methods.4 We subtracted the mean rating for each movie, assuming 22 extra ratings of 3 for each movie as a Dirichlet prior. Computational Performance. In the ﬁrst set of experiments we compared the computational performance of diﬀerent algorithms on PCA with missing values.The root mean square (rms) error is measured on the training data, 1 2 EO = |O| (i,j)∈O eij . All experiments were run on a dual cpu AMD Opteron SE 2220 using Matlab. First, we tested the imputation algorithm. The ﬁrst iteration where the missing values are replaced with zeros, was completed in 17 minutes and led to EO = 0.8527. This iteration was still tolerably fast because the complete data matrix was sparse. After that, it takes about 30 hours per iteration. After three iterations, EO was still 0.8513. Using the EM algorithm by [11], the E-step (updating S) takes 7 hours and the M-step (updating A) takes 18 hours. (There is some room for optimization since we used a straightforward Matlab implementation.) Each iteration gives a much larger improvement compared to the imputation algorithm, but starting from a random initialization, EM could not reach a good solution in reasonable time. We also tested the subspace learning algorithm described in Section 3 with and without the proposed speed-up. Each run of the algorithm with diﬀerent values of the speed-up parameter α was initialized in the same starting point (generated randomly from a normal distribution). The learning rate γ was adapted such that 4

The PCA approach has been considered by other Netﬂix contestants as well (see, e.g., [15,16]).

Principal Component Analysis for Sparse High-Dimensional Data

573

1.1

1.04

Gradient Speed−up Natural Grad. Imputation EM

1 0.96

Gradient Speed−up Natural Grad. Regularized VB1 VB2

1.05

0.92 1

0.88 0.84 0.95

0.8 0.76

0

1

2

4

8

16

32

64

0

1

2

4

8

16

32

Fig. 1. Left: Learning curves for unregularized PCA (Section 3) applied to the Netﬂix data: Root mean-square error on the training data EO is plotted against computation time in hours. Right: The root mean square error on the validation data EV from the Netﬂix problem during runs of several algorithms: basic PCA (Section 3), regularized PCA (Section 4) and VB (Section 4). VB1 has some parameters ﬁxed (see [12]) while VB2 updates all the parameters. The time scales are linear below 1 and logarithmic above 1.

if an update decreased the cost function, γ was multiplied by 1.1. Each time an update would increase the cost, the update was canceled and γ was divided by 2. Figure 1 (left) shows the learning curves for basic gradient descent, natural gradient descent, and the proposed speed-up with the best found parameter value α = 0.625. The proposed speed-up gave about a tenfold speed-up compared to the gradient descent algorithm even if each iteration took longer. Natural gradient was slower than the basic gradient. Table 1 gives a summary of the computational complexities. Overﬁtting. We compared PCA (Section 3), regularized PCA (Section 4) and VB-PCA (Section 4) by computing the rms reconstruction error for the validation set V , that is, testing how the models generalize to new data: EV = 1 2 (i,j)∈V eij . We tested VB-PCA by ﬁrstly ﬁxing some of the parameter |V | values (this run is marked as VB1 in Fig. 1, see [12] for details) and secondly by Table 1. Summary of the computational performance of diﬀerent methods on the Netﬂix problem. Computational complexities (per iteration) assume na¨ıve computation of products and inverses of matrices and ignores the computation of SVD in the imputation algorithm. While the proposed speed-up makes each iteration slower than the basic gradient update, the time to reach the error level 0.85 is greatly diminished. Method Gradient Speed-up Natural Grad. Imputation EM

Complexity Seconds/Iter Hours to EO = 0.85 O(N c + nc) 58 1.9 O(N c + nc) 110 0.22 O(N c + nc2 ) 75 3.5 O(nd2 ) 110000 64 O(N c2 + nc3 ) 45000 58

574

T. Raiko, A. Ilin, and J. Karhunen

adapting them (marked as VB2). We initialized regularized PCA and VB1 using normal PCA learned with α = 0.625 and orthogonalized A, and VB2 using VB1. The parameter α was set to 2/3. Fig. 1 (right) shows the results. The performance of basic PCA starts to degrade during learning, especially using the proposed speed-up. Natural gradient diminishes this phenomenon known as overlearning, but it is even more eﬀective to use regularization. The best results were obtained using VB2: The ﬁnal validation error EV was 0.9180 and the training rms error EO was 0.7826 which is naturally larger than the unregularized EO = 0.7657.

6

Discussion

We studied a number of diﬀerent methods for PCA with sparse data and it turned out that a simple gradient descent approach worked best due to its minimal computational complexity per iteration. We could also speed it up more than ten times by using an approximated Newton’s method. We found out empirically that setting the parameter α = 2/3 seems to work well for our problem. It is left for future work to ﬁnd out whether this generalizes to other problem settings. There are also many other ways to speed-up the gradient descent algorithm. The natural gradient did not help here, but we expect that the conjugate gradient method would. The modiﬁcation to the gradient proposed in this paper, could be used together with the conjugate gradient speed-up. This will be another future research topic. There are also other beneﬁts in solving the PCA problem by gradient descent. Algorithms that minimize an explicit cost function are rather easy to extend. The case of variational Bayesian learning applied to PCA was considered in Section 4, but there are many other extensions of PCA, such as using non-Gaussianity, nonlinearity, mixture models, and dynamics. The developed algorithms can prove useful in many applications such as bioinformatics, speech processing, and meteorology, in which large-scale datasets with missing values are very common. The required computational burden is linearly proportional to the number of measured values. Note also that the proposed techniques provide an analogue of conﬁdence regions showing the reliability of estimated quantities. Acknowledgments. This work was supported in part by the Academy of Finland under its Centers for Excellence in Research Program, and the IST Program of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reﬂects the authors’ views. We would like to thank Antti Honkela for useful comments.

References 1. Pearson, K.: On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine 2(6), 559–572 (1901) 2. Jolliﬀe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 3. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)

Principal Component Analysis for Sparse High-Dimensional Data

575

4. Diamantaras, K., Kung, S.: Principal Component Neural Networks - Theory and Application. Wiley, Chichester (1996) 5. Haykin, S.: Modern Filters. Macmillan, Basingstoke (1989) 6. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing - Learning Algorithms and Applications. Wiley, Chichester (2002) 7. Roweis, S.: EM algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998) 8. Karhunen, J., Oja, E.: New methods for stochastic approximation of truncated Karhunen-Loeve expansions. In: Proceedings of the 6th International Conference on Pattern Recognition, pp. 550–553. Springer, Heidelberg (1982) 9. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press and J. Wiley (1983) 10. Amari, S.: Natural gradient works eﬃciently in learning. Neural Computation 10(2), 251–276 (1998) 11. Grung, B., Manne, R.: Missing values in principal components analysis. Chemometrics and Intelligent Laboratory Systems 42(1), 125–139 (1998) 12. Raiko, T., Ilin, A., Karhunen, J.: Principal component analysis for large scale problems with lots of missing values. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 691–698. Springer, Heidelberg (2007) 13. Bishop, C.: Variational principal components. In: Proceedings of the 9th International Conference on Artiﬁcial Neural Networks (ICANN 1999), pp. 509–514 (1999) 14. Netﬂix: Netﬂix prize webpage (2007), http://www.netflixprize.com/ 15. Funk, S.: Netﬂix update: Try this at home (December 2006), http://sifter.org/∼ simon/journal/20061211.html 16. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative ﬁltering. In: Proceedings of the International Conference on Machine Learning (2007)

Hierarchical Bayesian Inference of Brain Activity Masa-aki Sato1 and Taku Yoshioka1,2 1

2

ATR Computational Neuroscience Laboratories [email protected] National Institute of Information and Communication Technology

Abstract. Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement. Therefore, prior information on the source currents is essential to solve the inverse problem. We have proposed a new hierarchical Bayesian method to combine several sources of information. In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information by using variational Bayes method. The fMRI information can be imposed as prior distribution rather than the variance itself so that it gives a soft constraint on the variance. It is shown that the hierarchical Bayesian method has better accuracy and spatial resolution than conventional linear inverse methods by evaluating the resolution curve. The proposed method also demonstrated good spatial and temporal resolution for estimating current activity in early visual area evoked by a stimulus in a quadrant of the visual ﬁeld.

1

Introduction

In recent years, there has been rapid progress in noninvasive neuroimaging measurement for human brain. Functional organization of the human brain has been revealed by PET and functional magnetic resonance imaging (fMRI). However, these methods can not reveal the detailed dynamics of information processing in the human brain, since they have poor temporal resolution due to slow hemodynamic responses to neural activity (Bandettini, 2000;Ogawa et al., 1990). On the other hand, Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement (Hamalainen et al., 1993)). Therefore, prior information on the source currents is essential to solve the inverse problem. One of the standard methods for the inverse problem is a dipole method (Hari, 1991; Mosher et al., 1992). It assumes that brain activity can be approximated by a small number of current dipoles. Although this method gives good estimates when the number of active areas is small, it can not give distributed brain activity for higher function. On the other hand, a number of distributed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 576–585, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Bayesian Inference of Brain Activity

577

source methods have been proposed to estimate distributed activity in the brain such as the minimum norm method, the minimum L1-norm method, and others (Hamalainen et al., 1993)). It has been also proposed to combine fMRI information with MEG data (Dale and Sereno, 1993;Ahlfors et al., 1999; Dale et al., 2000;Phillips et al., 2002). However, there are essential diﬀerences between fMRI and MEG due to their temporal resolution. The fMRI activity corresponds to an average of several thousands of MEG time series data and may not correspond MEG activity at some time points. We have proposed a new hierarchical Bayesian method to combine several sources of information (Sato et al. 2004). In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. Therefore, our method is capable of appropriately estimating the source current variance from the MEG data supplemented with the fMRI data, even if fMRI data convey inaccurate information. Accordingly, our method is robust against inaccurate fMRI information. Because of the hierarchical prior, the estimation problem becomes nonlinear and cannot be solved analytically. Therefore, the approximate posterior distribution is calculated by using the Variational Bayesian (VB) method (Attias, 1999; Sato, 2001). The resulting algorithm is an iterative procedure that converges quickly because the VB algorithm is a type of natural gradient method (Amari, 1998) that has an optimal local convergence property. The position and orientation of the cortical surface obtained from structural MRI can be also introduced as hard constraint. In this article, we explain our hierarchical Bayesian method. To evaluate the performance of the hierarchical Bayesian method, the resolution curves were calculated by varying the numbers of model dipoles, simultaneously active dipoles and MEG sensors. The results show the superiority of the hierarchical Bayesian method over conventional linear inverse methods. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. The estimation results are consistent with known physiological ﬁndings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.

2

MEG Inverse Problem

When neural current activity occurs in the brain, it produces a magnetic ﬁeld observed by MEG. The relationship between the magnetic ﬁeld B = {Bm |m = 1 : M } measured by M sensors and the primary source current J = {Jn |n = 1 : N } in the brain is given by B = G · J,

(1)

where G= {Gm,n |m = 1 : M, n = 1 : N } is the lead ﬁeld matrix. The lead ﬁeld Gm,n represents the magnetic ﬁeld Bm produced by the n-th unit dipole current. The above equations give the forward model and the inverse problem

578

M. Sato and T. Yoshioka

is to estimate the source current J from the observed magnetic ﬁeld data B. The probabilistic model for the source currents can be constructed assuming Gaussian noise for the MEG sensors. Then, the probability distribution, that the magnetic ﬁeld B is observed for a given current J , is given by 1 P (B|J ) ∝ exp − β (B − G · J) · Σ G · (B − G · J ) , (2) 2 where (βΣG )−1 denotes the covariance matrix of the sensor noise. Σ−1 G is the −1 normalized covariance matrix satisfying T r(Σ −1 is the average G ) = M , and β noise variance.

3

Hierarchical Bayesian Method

In the hierarchical Bayesian method, the variances of the currents are considered unknown parameters and estimated from the observed MEG data by introducing a hierarchical prior on the current variance. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. The spatial smoothness constraint, that neurons within a few millimeter radius tends to ﬁre simultaneously due to the neural interactions, can also be implemented as a hierarchical prior (Sato et al. 2004). Hierarchical Prior. Let us suppose a time sequence of MEG data B 1:T ≡ {B(t)|t = 1 : T } is observed. The MEG inverse problem in this case is to estimate the primary source current J 1:T ≡ {J (t)|t = 1 : T } from the observed MEG data B 1:T . We assume a Normal prior for the current: T 1 P0 (J 1:T |α) ∝ exp − β J (t) · Σ α · J (t) , (3) 2 t=1 where Σ α is the diagonal matrix with diagonal elements α = {αn |n = 1 : N }. We also assume that the current variance α−1 does not change over period T . The current inverse variance parameter α is estimated by introducing an ARDiAutomatic Relevance Determinationj hierarchical prior (Neal, 1996): P0 (α) =

N n=1 −1

Γ (α|¯ α, γ) ≡ α

Γ (αn |¯ α0n , γ0nα ),

(4)

(αγ/α ¯ )γ Γ (γ)−1 e−αγ/α¯ ,

where Γ (α|¯ α, γ) represents the Gamma distribution with mean α ¯ and degree of ∞ freedom γ. Γ (γ) ≡ 0 dttγ−1 e−t is the Gamma function. When the fMRI data is not available, we use a non-informative prior for the current inverse variance parameter αn , i.e., γ0nα = 0 and P0 (αn ) = α−1 n . When the fMRI data is available, fMRI information is imposed as the prior for the inverse variance parameter αn . The mean of the prior, α ¯0n , is assumed to be inversely proportional to the fMRI activity. Conﬁdence parameter γ0nα controls a reliability of the fMRI information.

Hierarchical Bayesian Inference of Brain Activity

579

Variational Bayesian Method. The objective of the Bayesian estimation is to calculate the posterior probability distribution of J for the observed data B (in the following, B 1:T and J 1:T are abbreviated as B and J, respectively, for notational simplicity): P (J |B) = dα P (J , α|B), P (J, α, B) , P (B) P (J , α, B) = P (B|J ) P0 (J |α) P0 (α) , P (B) = dJ dαP (J , α, B) . P (J , α|B) =

The calculation of the marginal likelihood P (B) cannot be done analytically. In the VB method, the calculation of the joint posterior P (J , α|B) is reformulated as the maximization problem of the free energy. The free energy for a trial distribution Q(J, α) is deﬁned by P (J , α, B) F (Q) = dJ dαQ (J, α) log Q (J , α) = log (P (B)) − KL [Q (J , α) || P (J , α|B)].

(5)

Equation (5) implies that the maximization of the free energy F (Q) is equivalent to the minimization of the Kullback-Leibler distance (KL-distance) deﬁned by KL [Q(J , α) || P (J , α|B)] ≡ dJ dαQ(J, α) log (Q(J , α)/P (J, α|B)) . This measures the diﬀerence between the true joint posterior P (J , α|B) and the trial distribution Q(J , α). Since the KL-distance reaches its minimum at zero when the two distributions coincide, the joint posterior can be obtained by maximizing the free energy F (Q) with respect to the trial distribution Q. In addition, the maximum free energy gives the log-marginal likelihood log(P (B)). The optimization problem can be solved using a factorization approximation restricting the solution space (Attias, 1999; Sato, 2001): Q (J , α) = QJ (J ) Qα (α) . Under the factorization assumption (6), the free energy can be written as

F (Q) = log P (J , α, B)J α − log QJ (J ) J − log Qα (α)α

= log P (B|J )J − KL QJ (J )Qα (α) || P0 (J |α)P0 (α) ,

(6)

(7)

where ·J and ·α represent the expectation values with respect to QJ (J ) and Qα (α), respectively. The ﬁrst term in the second equation of (7) corresponds to the negative sign of the expected reconstruction error. The second term (KL-distance) measures the diﬀerence between the prior and the posterior

580

M. Sato and T. Yoshioka

and corresponds to the eﬀective degree of freedom that can be well speciﬁed from the observed data. Therefore, (negative sign of) the free energy can be considered a regularized error function with a model complexity penalty term. The maximum free energy is obtained by alternately maximizing the free energy with respect to QJ and Qα . In the ﬁrst step (J-step), the free energy F (Q) is maximized with respect to QJ while Qα is ﬁxed. The solution is given by QJ (J ) ∝ exp [log P (J , α, B)α ] .

(8)

In the second step (α-step), the free energy F (Q) is maximized with respect to Qα while QJ is ﬁxed. The solution is given by

Qα (α) ∝ exp log P (J , α, B)J . (9) The above J- and α-steps are repeated until the free energy converges. VB algorithm. The VB algorithm is summarized here. In the J-step, the in−1 verse ﬁlter L(Σ−1 α ) is calculated using the estimated covariance matrix Σ α in the previous iteration: −1 −1 −1 −1 L(Σ−1 α ) = Σα · G · G · Σα · G + ΣG

(10)

The expectation values of the current J and the noise variance β −1 with respect to the posterior distribution are estimated using the inverse ﬁlter (10). J = L(Σ−1 α ) · B, γβ β −1 =

γβ =

1 N T, 2

1 (B − G · J ) · ΣG · (B − G · J) + J · Σα · J . 2

(11)

In the α-step, the expectation values of the variance parameters α−1 n with respect to the posterior distribution are estimated as −1 γnα α−1 n = γ0nα α0n +

T −1 αn 1 − Σ−1 · G · Σ−1 α B · G n,n , 2

where γnα is given by γnα = γ0nα +

4

T 2

(12)

.

Resolution Curve

We evaluated the performance of the hierarchical Bayesian method by calculating the resolution curve and compared with the minimum norm (MN) method. The inverse ﬁlter of the MN method L can be obtained from Eq.(10), if the inverse variance parameters αn is set to a given constant which is independent of position n. Let deﬁne the resolution matrix R by R = L · G.

(13)

Hierarchical Bayesian Inference of Brain Activity

581

Fig. 1. Resolution curves of minimum norm method for diﬀerent number of model dipoles (170, 262, 502, and 1242). The number of sensors is 251. Horizontal axis denotes the radius from the source position in m.

The (n,k) component of the resolution matrix Rn,k represents the n-th estimated current when the unit current dipole is applied for the k-th position without noise, i.e., Jk = 1 and Jl = 0(l = k). The resolution curve is deﬁned by the averaged estimated currents as a function of distance r from the source position. It can be obtained by summing the estimated currents Rn,k at the n-th position whose distance from the k-th position is in the range from r to r + dr, when the unit current dipole is applied for the k-th position. The averaged resolution curve is obtained by averaging the resolution curve over the k-th positions. If the estimation is perfect, the resolution curve at the origin, which is the estimation gain, should be one. In addition, the resolution curve should be zero elsewhere. However, the estimationgain of the linear inverse method such as MN method satisﬁes the constraint, N n=1 Gn ≤ M , where Gn denotes the estimation gain at n-th position (Sato et al. 2004). This constraint implies that the linear inverse method cannot perfectly retrieve more current dipoles than the number of sensors M . To see the eﬀect of this constraint, we calculated the resolution curve for the MN method by varying the number of model dipoles while the number of sensors M is ﬁxed at 251 (Fig. 1). We assumed model dipoles are placed evenly on a hemisphere. Fig. 1 shows that MN method gives perfect estimation if the number of dipoles are less than M . On the other hand, the performance degraded as the number of dipoles increases over M . Although the above results are obtained by using MN method, similar results can be obtained for a class of linear inverse methods. This limitation is the main cause of poor spatial

582

M. Sato and T. Yoshioka

Fig. 2. Resolution curves of hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 sensors. The number of active dipoles are 240 or 400. Horizontal axis denotes the radius from the source position in m.

resolution of the linear inverse methods. When several dipoles are simultaneously active, estimated currents in the linear inverse methods can be obtained by the summation of the estimated currents for each dipole. Therefore, the resolution curve gives complete descriptions on the spatial resolution of the linear inverse methods. From the theoretical analysis (in preparation), the hierarchical Bayesian method can estimate dipole currents perfectly even when the number of model dipoles are larger than the number of sensors M . This is because the hierarchical Bayesian method eﬀectively eliminates inactive dipoles from the estimation model by adjusting the estimation gain of these dipoles to zero. Nevertheless, the number of active dipoles gives the constraint on the performance of the hierarchical Bayesian method. The calculation of the resolution curves for the hierarchical Bayesian method are somewhat complicated, because Bayesian inverse ﬁlters are dependent on the MEG data. To evaluate the performance for the situations where multiple dipoles are active, we generated 240 or 400 active dipoles randomly on the hemisphere, and calculated the corresponding MEG data where 240 or 400 dipoles were simultaneously active. The Bayesian inverse ﬁlters were estimated using these simulated MEG data. Then, the resolution curves were calculated using the estimated Bayesian inverse ﬁlters for each active dipole and they were averaged over all active dipoles. Fig. 2 shows the resolution curves for the hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 MEG sensors. When the number of simultaneously active dipoles are less than those of MEG sensors, almost perfect estimation is obtained

Hierarchical Bayesian Inference of Brain Activity

583

regardless of the number of model dipoles. Therefore, the hierarchical Bayesian method can achieve much better spatial resolution than the conventional linear inverse method. On the other hand, the performance is degraded if the numbers of simultaneously active dipoles are larger than the number of MEG sensors. The above results demonstrate the superiority of the hierarchical Bayesian method over MN method.

5

Visual Experiments

We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. Red and green checkerboards of a pseudo-randomly selected quadrant are presented for 700 ms in one trial. FMRI experiments with the same quadrant stimuli were also done by adopting conventional block design where the stimuli were presented for 15 seconds in a block. The global ﬁeld power (sum of MEG signals of all sensors) recorded from subject RH induced by the upper right stimulus is shown in Fig. 3a. The strong peak was observed after 93 ms of the stimulus onset. Cortical currents were estimated by applying the hierarchical Bayesian method to the averaged MEG data between 100 ms before and 400 ms after the stimulus onset. The fMRI activity t-values were used as a prior for the inverse

Fig. 3. Estimated current for quadrant visual stimulus. (a) shows the global ﬁeld power of MEG signal. (b) shows the temporal pattens of averaged currents in V1, V2/3, and V4. (c-e) shows spatial pattens of the current strength averaged over 20ms time windows centered at 93ms, 98ms, and 134ms.

584

M. Sato and T. Yoshioka

variance parameters. As explained in ’Hierarchical Prior’ subsection, the mean of the prior was assumed to be α ¯ −1 0n = a0 · tf (n), where tf (n) was the t-value at the n-th position and a0 was a hyper parameter and set to 500 in this analysis. Estimated spatiotemporal brain activities are illustrated in Fig. 3. We identiﬁed 3 ROIs (Region Of Interest) in V1, V2/3, and V4 and temporal patterns of the estimated currents are obtained by averaging the current within these ROIs. Fig. 3b shows that V1, V2/3, and V4 are successively activated and attained their peak around 93ms, 98ms, and 134ms, respectively. Fig. 3c-3e illustrates the spatial pattern of the current strength averaged over 20ms time windows (centered at 93ms, 98ms, and 134ms), in a ﬂattened map format. The ﬂattened map was made by cutting along the bottom of calcarine sulcus. We can see strongly active regions in V1, V2/3, and V4 corresponding to their peak activities. The above results are consistent with known physiological ﬁndings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.

6

Conclusion

In this article, we have explained the hierarchical Bayesian method which combines MEG and fMRI by using the hierarchical prior. We have shown the superiority of the hierarchical Bayesian method over conventional linear inverse methods by evaluating the resolution curve. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. The estimation results are consistent with known physiological ﬁndings and shows the good spatial and temporal resolutions of the hierarchical Bayesian method. Currently, we are applying the hierarchical Bayesian method for brain machine interface using noninvasive neuroimaging. In our approach, we ﬁrst estimate current activity in the brain. Then, the intention or the motion of the subject is estimated by using the current activity. This approach enables us to use physiological knowledge and gives us more insight on the mechanism of human information proceeding. Acknowledgement. This research was supported in part by NICT-KARC.

References Ahlfors, S.P., Simpson, G.V., Dale, A.M., Belliveau, J.W., Liu, A.K., Korvenoja, A., Virtanen, J., Huotilainen, M., Tootell, R.B.H., Aronen, H.J., Ilmoniemi, R.J.: Spatiotemporal activity of a cortical network for processing visual motion revealed by MEG and fMRI. J. Neurophysiol. 82, 2545–2555 (1999) Amari, S.: Natural Gradient Works Eﬃciently in Learning. Neural Computation 10, 251–276 (1998) Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Proc. 15th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 21–30 (1999) Bandettini, P.A.: The temporal resolution of functional MRI. In: Moonen, C.T.W., Bandettini, P.A. (eds.) Functional MRI, pp. 205–220. Springer, Heidelberg (2000)

Hierarchical Bayesian Inference of Brain Activity

585

Dale, A.M., Liu, A.K., Fischl, B.R., Buchner, R.L., Belliveau, J.W., Lewine, J.D., Halgren, E.: Dynamic statistical parametric mapping: Combining fMRI and MEG for high-resolution imaging of cortical activity. Neuron 26, 55–67 (2000) Dale, A.M., Sereno, M.I.: Improved localization of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: A Linear approach. J. Cognit. Neurosci. 5, 162–176 (1993) Hamalainen, M.S., Hari, R., Ilmoniemi, R.J., Knuutila, J., Lounasmaa, O.V.: Magentoencephalography– Theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Modern Phys. 65, 413–497 (1993) Hari, R.: On brain’s magnetic responses to sensory stimuli. J. Clinic. Neurophysiol. 8, 157–169 (1991) Mosher, J.C., Lewis, P.S., Leahy, R.M.: Multiple dipole modelling and localization from spatio-temporal MEG data. IEEE Trans. Biomed. Eng. 39, 541–557 (1992) Neal, R.M.: Bayesian learning for neural networks. Springer, Heidelberg (1996) Ogawa, S., Lee, T.-M., Kay, A.R., Tank, D.W.: Brain magnetic resonance imaging with contrast-dependent oxygenation. In: Proc. Natl. Acad. Sci. USA, vol. 87, pp. 9868–9872 (1990) Phillips, C., Rugg, M.D., Friston, K.J.: Anatomically Informed Basis Functions for EEG Source Localization: Combining Functional and Anatomical Constraints. NeuroImage 16, 678–695 (2002) Sato, M.: On-line Model Selection Based on the Variational Bayes. Neural Computation 13, 1649–1681 (2001) Sato, M., Yoshioka, T., Kajihara, S., Toyama, K., Goda, N., Doya, K., Kawato, M.: Hierarchical Bayesian estimation for MEG inverse problem. NeuroImage 23, 806–826 (2004)

Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models Byron M. Yu1,2 , John P. Cunningham1 , Krishna V. Shenoy1 , and Maneesh Sahani2 1

2

Dept. of Electrical Engineering and Neurosciences Program, Stanford University, Stanford, CA, USA Gatsby Computational Neuroscience Unit, UCL, London, UK {byronyu,jcunnin,shenoy}@stanford.edu, [email protected]

Abstract. To date, the neural decoding of time-evolving physical state – for example, the path of a foraging rat or arm movements – has been largely carried out using linear trajectory models, primarily due to their computational eﬃciency. The possibility of better capturing the statistics of the movements using nonlinear trajectory models, thereby yielding more accurate decoded trajectories, is enticing. However, nonlinear decoding usually carries a higher computational cost, which is an important consideration in real-time settings. In this paper, we present techniques for nonlinear decoding employing modal Gaussian approximations, expectatation propagation, and Gaussian quadrature. We compare their decoding accuracy versus computation time tradeoﬀs based on high-dimensional simulated neural spike counts. Keywords: Nonlinear dynamical models, nonlinear state estimation, neural decoding, neural prosthetics, expectation-propagation, Gaussian quadrature.

1

Introduction

We consider the problem of decoding time-evolving physical state from neural spike trains. Examples include decoding the path of a foraging rat from hippocampal neurons [1,2] and decoding the arm trajectory from motor cortical neurons [3,4,5,6,7,8]. Advances in this area have enabled the development of neural prosthetic devices, which seek to allow disabled patients to regain motor function through the use of prosthetic limbs, or computer cursors, that are controlled by neural activity [9,10,11,12,13,14,15]. Several of these prosthetic decoders, including population vectors [11] and linear ﬁlters [10,12,15], linearly map the observed neural activity to the estimate of physical state. Although these direct linear mappings are eﬀective, recursive Bayesian decoders have been shown to provide more accurate trajectory estimates [1,6,7,16]. In addition, recursive Bayesian decoders provide conﬁdence regions on the trajectory estimates and allow for nonlinear relationships between M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 586–595, 2008. c Springer-Verlag Berlin Heidelberg 2008

Neural Decoding Using Nonlinear Trajectory Models

587

the neural activity and the physical state variables. Recursive Bayesian decoders are based on the speciﬁcation of a probabilistic model comprising 1) a trajectory model, which describes how the physical state variables change from one time step to the next, and 2) an observation model, which describes how the observed neural activity relates to the time-evolving physical state. The function of the trajectory model is to build into the decoder prior knowledge about the form of the trajectories. In the case of decoding arm movements, the trajectory model may reﬂect 1) the hard, physical constraints of the limb (for example, the elbow cannot bend backward), 2) the soft, control constraints imposed by neural mechanisms (for example, the arm is more likely to move smoothly than in a jerky motion), and 3) the physical surroundings of the person and his/her objectives in that environment. The degree to which the trajectory model captures the statistics of the actual movements directly aﬀects the accuracy with which trajectories can be decoded from neural data [8]. The most commonly-used trajectory models assume linear dynamics perturbed by Gaussian noise, which we refer to collectively as linear-Gaussian models. The family of linear-Gaussian models includes the random walk model [1,2,6], those with a constant [8] or time-varying [17,18] forcing term, those without a forcing term [7,16], those with a time-varying state transition matrix [19], and those with higher-order Markov dependencies [20]. Linear-Gaussian models have been successfully applied to decoding the path of a foraging rat [1,2], as well as arm trajectories in ellipse-tracing [6], pursuit-tracking [7,20,16], “pinball” [7,16], and center-out reach [8] tasks. Linear-Gaussian models are widely used primarily due to their computational eﬃciency, which is an important consideration for real-time decoding applications. However, for particular types of movements, the family of linear-Gaussian models may be too restrictive and unable to capture salient properties of the observed movements [8]. We recently proposed a general approach to constructing trajectory models that can exhibit rather complex dynamical behaviors and whose decoder can be implemented to have the same running time (using a parallel implementation) as simpler trajectory models [8]. In particular, we demonstrated that a probabilistic mixture of linear-Gaussian trajectory models, each accurate within a limited regime of movement, can capture the salient properties of goal-directed reaches to multiple targets. This mixture model, which yielded more accurate decoded trajectories than a single linear-Gaussian model, can be viewed as a discrete approximation to a single, uniﬁed trajectory model with nonlinear dynamics. An alternate approach is to decode using this single, uniﬁed nonlinear trajectory model without discretization. This makes the decoding problem more diﬃcult since nonlinear transformations of parametric distributions are typically no longer easily parametrized. State estimation in nonlinear dynamical systems is a ﬁeld of active research that has made substantial progress in recent years, including the application of numerical quadrature techniques to dynamical systems [21,22,23], the development of expectation-propagation (EP) [24] and its application to dynamical systems [25,26,27,28], and the improvement in the

588

B.M. Yu et al.

computational eﬃciency of Monte Carlo techniques (e.g., [29,30,31]). However, these techniques have not been rigorously tested and compared in the context of neural decoding, which typically involves observations that are high-dimensional vectors of non-negative integers. In particular, the tradeoﬀ between decoding accuracy and computational cost among diﬀerent neural decoding algorithms has not been studied in detail. Knowing the accuracy-computational cost tradeoﬀ is important for real-time applications, where one may need to select the most accurate algorithm given a computational budget or the least computationally intensive algorithm given a minimal acceptable decoding accuracy. This paper takes a step in this direction by comparing three particular deterministic Gaussian approximations. In Section 2, we ﬁrst introduce the nonlinear dynamical model for neural spike counts and the decoding problem. Sections 3 and 4 detail the three deterministic Gaussian approximations that we focus on in this report: global Laplace, Gaussian quadrature-EP (GQ-EP), and Laplace propagation (LP). Finally, in Section 5, we compare the decoding accuracy versus computational cost of these three techniques.

2

Nonlinear Dynamical Model and Neural Decoding

In this report, we consider nonlinear dynamical models for neural spike counts of the following form: xt | xt−1 ∼ N (f (xt−1 ) , Q) yti

| xt ∼ Poisson (λi (xt ) · Δ) ,

(1a) (1b)

where xt ∈ Rp×1 is a vector containing the physical state variables at time t = 1, . . . , T , yti ∈ {0, 1, 2, . . .} is the corresponding observed spike count for neuron i = 1, . . . , q taken in a time bin of width Δ, and Q ∈ Rp×p is a covariance matrix. The functions f : Rp×1 → Rp×1 and λi : Rp×1 → R+ are, in general, nonlinear. The initial state x1 is Gaussian-distributed. For notational compactness, the spike counts for all q simultaneously-recorded neurons are assembled into a q × 1 vector yt , whose ith element is yti . Note that the observations are discretevalued and that, typically, q p. Equations (1a) and (1b) are referred to as the trajectory and observation models, respectively. The task of neural decoding involves ﬁnding, at each timepoint t, the likely physical states xt given the neural activity observed up to that time {y}t1 . In other words, we seek to compute the ﬁltered state posterior P (xt | {y}t1 ) at each t. We previously showed how to estimate the ﬁltered state posterior when f is a linear function [8]. Here, we consider how to compute P (xt | {y}t1 ) when f is nonlinear. The extended Kalman ﬁlter (EKF) is a commonly-used technique for nonlinear state estimation. Unfortunately, it cannot be directly applied to the current problem because the observation noise in (1b) is not additive Gaussian. Possible alternatives are the unscented Kalman ﬁlter (UKF) [21,22] and the closelyrelated quadrature Kalman ﬁlter (QKF) [23], both of which employ quadrature

Neural Decoding Using Nonlinear Trajectory Models

589

techniques to approximate Gaussian integrals that are analytically intractable. While the UKF has been shown to outperform the EKF [21,22], the UKF requires making Gaussian approximations in the observation space. This property of the UKF is undesirable from the standpoint of the current problem because the observed spike counts are typically 0 or 1 (due to the use of relatively short binwidths Δ) and, therefore, distinctly non-Gaussian. As a result, the UKF yielded substantially lower decoding accuracy than the techniques presented in Sections 3 and 4 [28], which make Gaussian approximations only in the state space. While we have not yet tested the QKF, the number of quadrature points required grows geometrically with p + q, which quickly becomes impractical even for moderate values of p and q. Thus, we will no longer consider the UKF and QKF in the remainder of this paper. The decoding techniques described in Sections 3 and 4 naturally yield the smoothed state posterior P xt | {y}T1 , rather than the ﬁltered state posterior P (xt | {y}t1 ). Thus, we will focus on the smoothed state posterior in this work. However, the ﬁltered state posterior at time t can be easily obtained by smoothing using only observations from timepoints 1, . . . , t.

3

Global Laplace

The idea is to estimate the joint state posterior across the entire sequence (i.e., the global state posterior) as a Gaussian matched to the location and curvature of a mode of P {x}T1 | {y}T1 , as in Laplace’s method [32]. The mode is deﬁned as {x }T1 = argmax P {x}T1 | {y}T1 = argmax L {x}T1 , (2) {x}T 1

{x}T 1

where L {x}T1 = log P {x}T1 , {y}T1 = log P (x1 ) +

T

log P (xt | xt−1 ) +

t=2

q T

log P yti | xt .

(3)

t=1 i=1

Using the known distributions (1), the gradients of L {x}T1 can be computed exactly and a local mode {x }T1 can be found by applying a gradient optimization technique. The global state posterior is then approximated as: −1 P {x}T1 | {y}T1 ≈ N {x }T1 , −∇2 L {x }T1 . (4)

4

Expectation Propagation

We brieﬂy summarize here the application of EP [24] to dynamical models [25,26,27,28]. More details can be found in the cited references. The two primary distributions of interest here are the marginal P xt | {y}T1 and pairwise

590

B.M. Yu et al.

joint P xt−1 , xt | {y}T1 state posteriors. These distributions can be expressed in terms of forward αt and backward βt messages as follows P xt | {y}T1 =

1 αt (xt ) βt (xt ) P {y}T1 αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) P xt−1 , xt | {y}T1 = , P {y}T1

(5) (6)

where αt (xt ) = P (xt , {y}t1 ) and βt (xt ) = P {y}Tt+1 | xt . The messages αt and βt are typically approximated by an exponential family density; in this paper, we use an unnormalized Gaussian. These approximate messages are iteratively updated by matching the expected suﬃcient statistics1 of the marginal posterior (5) with those of the pairwise joint posterior (6). The updates are usually performed sequentially via multiple forward-backward passes. During the forward pass, the αt are updated while the βt remain ﬁxed: αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) T P xt | {y}1 = dxt−1 (7) P {y}T1 ≈ Pˆ (xt−1 , xt ) dxt−1 (8) αt (xt ) ∝ Pˆ (xt , xt−1 ) dxt−1 βt (xt ) , (9) where Pˆ (xt−1 , xt ) is an exponential family distribution whose expected suﬃcient statistics are matched to those of P xt−1 , xt | {y}T1 . In this paper, Pˆ (xt−1 , xt ) is assumed to be Gaussian. The backward pass proceeds similarly, where the βt are updated while the αt remain ﬁxed. The decoded trajectory is obtained by combining the messages αt and βt , as shown in (5), after completing the forwardbackward passes. In Section 5, we investigate the accuracy-computational cost tradeoﬀ of using diﬀerent numbers of forward-backward iterations. Although the expected suﬃcient statistics (or moments) of P xt−1 , xt | {y}T1 cannot typically be computed analytically for the nonlinear dynamical model (1), they can be approximated using Gaussian quadrature [26,28]. This EP-based decoder is referred to as GQ-EP. By applying the ideas of Laplace propagation (LP) [33], a closely-related developed that uses a modal decoder has been Gaussian approximation of P xt−1 , xt | {y}T1 rather than matching moments [27,28]. This technique, which uses the same message-passing scheme as GQ-EP, is referred to here as LP. In practice, it is possible to encounter invalid message updates. For example, if the variance of xt in the numerator is larger than that in the denominator in (9) due to approximation error in the choice of Pˆ , the update rule would assign αt (xt ) a negative variance. A way around this problem is to simply skip that message update and hope that the update is no longer invalid during the next 1

If the approximating distributions are assumed to be Gaussian, this is equivalent to matching the ﬁrst two moments.

Neural Decoding Using Nonlinear Trajectory Models

591

forward-backward iteration [34]. An alternative is to set βt (xt ) = 1 in (7) and (9), which guarantees a valid update for αt (xt ). This is referred to as the onesided update and its implications for decoding accuracy and computation time are considered in Section 5.

5

Results

We evaluated decoding accuracy versus computational cost of the techniques described in Sections 3 and 4. These performance comparisons were based on the model (1), where f (x) = (1 − k) x + k · W · erf(x) λi (x) = log 1 + eci x+di

(10) (11)

with parameters W ∈ Rp×p , ci ∈ Rp×1 , and di ∈ R. The error function (erf) in (10) acts element-by-element on its argument. We have chosen the dynamics (10) of a fully-connected recurrent network due to its nonlinear nature; we make no claims in this work about its suitability for particular decoding applications, such as for rat paths or arm trajectories. Because recurrent networks are often used to directly model neural activity, it is important to emphasize that x is a vector of physical state variables to be decoded, not a vector of neural activity. We generated 50 state trajectories, each with 50 time points, and corresponding spike counts from the model (1), where the model parameters were randomly chosen within a range that provided biologically realistic spike counts (typically, 0 or 1 spike in each bin). The time constant k ∈ R was set to 0.1. To understand how these algorithms scale with diﬀerent numbers of physical state variables and observed neurons, we considered all pairings (p, q), where p ∈ {3, 10} and q ∈ {20, 100, 500}. For each pairing, we repeated the above procedure three times. For the global Laplace decoder, the modal trajectory was found using PolackRibi`ere conjugate gradients with quadratic/cubic line searches and Wolfe-Powell stopping criteria (minimize.m by Carl Rasmussen, available at http://www.kyb. tuebingen.mpg.de/bs/people/carl/code/minimize/). To stabilize GQ-EP, we used a modal Gaussian proposal distribution and the custom precision 3 quadrature rule with non-negative quadrature weights, as described in [28]. Forboth GQ EP and LP, minimize.m was used to ﬁnd a mode of P xt−1 , xt | {y}T1 . Fig. 1 illustrates the decoding accuracy versus computation time of the presented techniques. Decoding accuracy was measured by evaluating the marginal state posteriors P xt | {y}T1 at the actual trajectory. The higher the log probability, the more accurate the decoder. Each panel corresponds to a diﬀerent number of state variables and observed neurons. For GQ-EP (dotted line) and LP (solid line), we varied the number of forward-backward iterations between one and three; thus, there are three circles for each of these decoders. Across all panels, global Laplace required the least computation time and yielded state

592

Log probability

-191

B.M. Yu et al. -110

(a)

-199

-114

-24

-207

-118

-28

-600

-250

(d) Log probability

-20

(b)

-2600 -1 10

50

(e)

-1600

(f)

-600

0

10

1

10

Computation time (sec)

(c)

-950 -1 10

-250

0

10

1

10

Computation time (sec)

-550 -1 10

0

10

1

10

Computation time (sec)

Fig. 1. Decoding accuracy versus computation time of global Laplace (no line), GQEP (dotted line), and LP (solid line). (a) p = 3, q = 20, (b) p = 3, q = 100, (c) p = 3, q = 500, (d) p = 10, q = 20, (e) p = 10, q = 100, (f) p = 10, q = 500. The circles and bars represent mean±SEM. Variability in computation time is not represented on the plots because they were negligible. The computation times were obtained using a 2.2-GHz AMD Athlon 64 processor with 2 GB RAM running MATLAB R14. Note that the scale of the vertical axes is not the same in each panel and that some error bars are so small that they can’t be seen.

estimates as accurate as, or more accurate than, the other techniques. This is the key result of this report. We also implemented a basic particle smoother [35], where the number of particles (500 to 1500) was chosen such that its computation time was on the same order as those shown in Fig. 1 (results not shown). Although this particle smoother yielded substantially lower decoding accuracy than global Laplace, GQ-EP, and LP, the three deterministic techniques should be compared to more recently-developed Monte Carlo techniques, as described in Section 6. Fig. 1 shows that all three techniques have computation times that scale well with the number of state variables p and neurons q. In particular, the required computational time typically scales sub-linearly with increases in p and far sublinearly with increases in q. As the q increases, the accuracies of the techniques become more similar (note that diﬀerent panels have diﬀerent vertical scales), and there is less advantage to performing multiple forward-backward iterations for GQ-EP and LP. The decoding accuracy and required computation time both typically increase with the number of iterations. In a few cases (e.g., GQ-EP in Fig. 1(b)), it is possible for the accuracy to decrease slightly when going from two to three iterations, presumably due to one-sided updates. In theory, GQ-EP should require greater computation time than LP because it needs to perform the same modal Gaussian approximation, then use it as a proposal distribution for Gaussian quadrature. In practice, it is possible for LP

Neural Decoding Using Nonlinear Trajectory Models

593

to be slower if it needs many one-sided updates (cf. Fig. 1(d)), since one-sided updates are used only when the usual update (9) fails. Furthermore, LP required greater computation time in Fig. 1(d) than in Fig. 1(e) due to the need for many more one-sided updates, despite having ﬁve times fewer neurons. It was previously shown that {x }T1 is a local optimum of P {x}T1 | {y}T1 (i.e., a solution of global Laplace) if and only if it is a ﬁxed-point of LP [33]. Because the modal Gaussian approximation matches local curvature up to second order, it can also be shown that the estimated covariances using global Laplace and LP are equal at {x }T1 [33]. Empirically, we found both statements to be true if few one-sided updates were required for LP. Due to these connections between global Laplace and LP, the accuracy of LP after three forward-backward iterations was similar to that of global Laplace in all panels in Fig. 1. Although LP may have computational savings compared to global Laplace in certain applications [33], we found that global Laplace was substantially faster for the particular graph structure described by (1).

6

Conclusion

We have presented three deterministic techniques for nonlinear state estimation (global Laplace, GQ-EP, LP) and compared their decoding accuracy versus computation cost in the context of neural decoding, involving high-dimensional observations of non-negative integers. This work can be extended in the following directions. First, the deterministic techniques presented here should be compared to recently-developed Monte Carlo techniques that have yielded increased accuracy and/or reduced computational cost compared to the basic particle ﬁlter/smoother in applications other than neural decoding [29]. Examples include the Gaussian particle ﬁlter [31], sigma-point particle ﬁlter [30], and embedded hidden Markov model [36]. Second, we have compared these decoders based on one particular non-linear trajectory model (10). Other non-linear trajectory models (e.g., a model describing primate arm movements [37]) should be tested to see if the decoders have similar accuracy-computational cost tradeoﬀs as shown here. Acknowledgments. This work was supported by NIH-NINDS-CRCNS-R01, NDSEG Fellowship, NSF Graduate Research Fellowship, Gatsby Charitable Foundation, Michael Flynn Stanford Graduate Fellowship, Christopher Reeve Paralysis Foundation, Burroughs Wellcome Fund Career Award in the Biomedical Sciences, Stanford Center for Integrated Systems, NSF Center for Neuromorphic Systems Engineering at Caltech, Oﬃce of Naval Research, Sloan Foundation and Whitaker Foundation.

References 1. Brown, E.N., Frank, L.M., Tang, D., Quirk, M.C., Wilson, M.A.: A statistical paradigm for neural spike train decoding applied to position prediction from the ensemble ﬁring patterns of rat hippocampal place cells. J. Neurosci 18(18), 7411– 7425 (1998)

594

B.M. Yu et al.

2. Zhang, K., Ginzburg, I., McNaughton, B.L., Sejnowski, T.J.: Interpreting neuronal population activity by reconstruction: Uniﬁed framework with application to hippocampal place cells. J. Neurophysiol 79, 1017–1044 (1998) 3. Wessberg, J., Stambaugh, C.R., Kralik, J.D., Beck, P.D., Laubach, M., Chapin, J.K., Kim, J., Biggs, J., Srinivasan, M.A., Nicolelis, M.A.L.: Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408(6810), 361–365 (2000) 4. Schwartz, A.B., Taylor, D.M., Tillery, S.I.H.: Extraction algorithms for cortical control of arm prosthetics. Curr. Opin. Neurobiol. 11, 701–707 (2001) 5. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., Donoghue, J.: Robustness of neuroprosthetic decoding algorithms. Biol. Cybern. 88(3), 219–228 (2003) 6. Brockwell, A.E., Rojas, A.L., Kass, R.E.: Recursive Bayesian decoding of motor cortical signals by particle ﬁltering. J. Neurophysiol 91(4), 1899–1907 (2004) 7. Wu, W., Black, M.J., Mumford, D., Gao, Y., Bienenstock, E., Donoghue, J.P.: Modeling and decoding motor cortical activity using a switching Kalman ﬁlter. IEEE Trans Biomed Eng 51(6), 933–942 (2004) 8. Yu, B.M., Kemere, C., Santhanam, G., Afshar, A., Ryu, S.I., Meng, T.H., Sahani, M., Shenoy, K.V.: Mixture of trajectory models for neural decoding of goal-directed movements. J. Neurophysiol. 97, 3763–3780 (2007) 9. Chapin, J.K., Moxon, K.A., Markowitz, R.S., Nicolelis, M.A.L.: Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci. 2, 664–670 (1999) 10. Serruya, M.D., Hatsopoulos, N.G., Paninski, L., Fellows, M.R., Donoghue, J.P.: Instant neural control of a movement signal 416, 141–142 (2002) 11. Taylor, D.M., Tillery, S.I.H., Schwartz, A.B.: Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002) 12. Carmena, J.M., Lebedev, M.A., Crist, R.E., O’Doherty, J.E., Santucci, D.M., Dimitrov, D.F., Patil, P.G., Henriquez, C.S., Nicolelis, M.A.L.: Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biology 1(2), 193–208 (2003) 13. Musallam, S., Corneil, B.D., Greger, B., Scherberger, H., Andersen, R.A.: Cognitive control signals for neural prosthetics. Science 305, 258–262 (2004) 14. Santhanam, G., Ryu, S.I., Yu, B.M., Afshar, A., Shenoy, K.V.: A high-performance brain-computer interface. Nature 442, 195–198 (2006) 15. Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006) 16. Wu, W., Gao, Y., Bienenstock, E., Donoghue, J.P., Black, M.J.: Bayesian population decoding of motor cortical activity using a Kalman ﬁlter. Neural Comput 18(1), 80–118 (2006) 17. Kemere, C., Meng, T.: Optimal estimation of feed-forward-controlled linear systems. In: Proc IEEE ICASSP, pp. 353–356 (2005) 18. Srinivasan, L., Eden, U.T., Willsky, A.S., Brown, E.N.: A state-space analysis for reconstruction of goal-directed movements using neural signals. Neural Comput 18(10), 2465–2494 (2006) 19. Srinivasan, L., Brown, E.N.: A state-space framework for movement control to dynamic goals through brain-driven interfaces. IEEE Trans. Biomed. Eng. 54(3), 526–535 (2007) 20. Shoham, S., Paninski, L.M., Fellows, M.R., Hatsopoulos, N.G., Donoghue, J.P., Normann, R.A.: Statistical encoding model for a primary motor cortical brainmachine interface. IEEE Trans. Biomed. Eng. 52(7), 1313–1322 (2005)

Neural Decoding Using Nonlinear Trajectory Models

595

21. Wan, E., van der Merwe, R.: The unscented Kalman ﬁlter. In: Haykin, S. (ed.) Kalman Filtering and Neural Networks, Wiley Publishing, Chichester (2001) 22. Julier, S., Uhlmann, J.: Unscented ﬁltering and nonlinear estimation. Proceedings of the IEEE 92(3), 401–422 (2004) 23. Arasaratnam, I., Haykin, S., Elliott, R.: Discrete-time nonlinear ﬁltering algorithms using Gauss-Hermite quadrature. Proceedings of the IEEE 95(5), 953–977 (2007) 24. Minka, T.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 362–369 (2001) 25. Heskes, T., Zoeter, O.: Expectation propagation for approximate inference in dynamic Bayesian networks. In: Darwiche, A., Friedman, N. (eds.) Proceedings UAI2002, pp. 216–223 (2002) 26. Zoeter, O., Ypma, A., Heskes, T.: Improved unscented Kalman smoothing for stock volatility estimation. In: Barros, A., Principe, J., Larsen, J., Adali, T., Douglas, S. (eds.) Proceedings of the IEEE Workshop on Machine Learning for Signal Processing (2004) 27. Ypma, A., Heskes, T.: Novel approximations for inference in nonlinear dynamical systems using expectation propagation. Neurocomputing 69, 85–99 (2005) 28. Yu, B.M., Shenoy, K.V., Sahani, M.: Expectation propagation for inference in nonlinear dynamical models with Poisson observations. In: Proc. IEEE Nonlinear Statistical Signal Processing Workshop (2006) 29. Doucet, A., de Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 30. van der Merwe, R., Wan, E.: Sigma-point Kalman ﬁlters for probabilistic inference in dynamic state-space models. In: Proceedings of the Workshop on Advances in Machine Learning (2003) 31. Kotecha, J.H., Djuric, P.M.: Gaussian particle ﬁltering. IEEE Transactions on Signal Processing 51(10), 2592–2601 (2003) 32. MacKay, D.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 33. Smola, A., Vishwanathan, V., Eskin, E.: Laplace propagation. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 34. Minka, T., Laﬀerty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 352–359 (2002) 35. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian ﬁltering. Statistics and Computing 10(3), 197–208 (2000) 36. Neal, R.M., Beal, M.J., Roweis, S.T.: Inferring state sequences for non-linear systems with embedded hidden Markov models. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 37. Chan, S.S., Moran, D.W.: Computational model of a primate arm: from hand position to joint angles, joint torques and muscle forces. J. Neural. Eng. 3, 327–337 (2006)

Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience Kazuyuki Samejima1 and Kenji Doya2 1 Brain Science Institute, Tamagawa University, 6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan [email protected] 2 Initial Research Project, Okinawa Institute of Science and Technology 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan [email protected]

Abstract. A major problem in search of neural substrates of learning and decision making is that the process is highly stochastic and subject dependent, making simple stimulus- or output-triggered averaging inadequate. This paper presents a novel approach of characterizing neural recording or brain imaging data in reference to the internal variables of learning models (such as connection weights and parameters of learning) estimated from the history of external variables by Bayesian inference framework. We specifically focus on reinforcement leaning (RL) models of decision making and derive an estimation method for the variables by particle filtering, a recent method of dynamic Bayesian inference. We present the results of its application to decision making experiment in monkeys and humans. The framework is applicable to wide range of behavioral data analysis and diagnosis.

1 Introduction The traditional approach in neuroscience to discover information processing mechanisms is to correlate neuronal activities with external physical variables, such as sensory stimuli or motor outputs. However, when we search for neural correlates of higher-order brain functions, such as attention, memory and learning, a problem has been that there are no external physical variables to correlate with. Recently, the advances in computational neuroscience, there are a number of computational models of such cognitive or learning processes and make quantitative prediction of the according subject’s behavioral responses. Thus a possible new approach is to try to find neural activities that correlate with the internal variables of such computational models(Corrado and Doya, 2007). A major issue in such model-based analysis of neural data is how to estimate the hidden variables of the model. For example, in learning agents, hidden variables such as connection weights change in time. In addition, the course of learning is regulated by hidden meta-parameters such as learning rates. Another important issue is how to judge the validity of a model or to select the best model among a number of candidates. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 596–603, 2008. © Springer-Verlag Berlin Heidelberg 2008

Estimating Internal Variables of a Decision Maker’s Brain

597

The framework of Bayesian inference can provide coherent solutions to the issues of estimating hidden variables, including meta-parameter from observable experimental data and selecting the most plausible computational model out of multiple candidates. In this paper, we first review the reinforcement learning model of reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian estimation method for the hidden variables of a reinforcement learning model by particle filtering (Samejima et al., 2004). We then review examples of application of the method to monkey neural recording (Samejima et al., 2005) and human imaging studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007).

2 Reinforcement Leaning Model as an Animal or Human Decision Maker Reinforcement learning can be a model of animal or human decision based on reward delivery. Notably, the response of monkey midbrain dopamine neurons are successfully explained by the temporal difference (TD) error of reinforcement learning models (Schultz et al., 1997). The goal of reinforcement learning is to improve the policy, the rule of taking an action at at state st , so that the resulting rewards rt is maximized in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the value function for each state and then it improves the policy based on the value function. In a standard reinforcement learning algorithm called “Q-learning,” an agent learns the action-value function

[

]

(1)

which estimates the cumulative future reward when action

at is taken at a

Q ( st , at ) = E rt + γrt +1 + γ 2 rt + 2 + ... | s, a state st .The discount factor

0 < γ < 1 is a meta-parameter that controls the time

scale of prediction. The policy of the learner is then given by comparing actionvalues, e.g. according to Boltzman distribution

π (a | s ) =

exp(βQ(a, st )) ∑ exp(βQ(a' , st ))

(2)

a '∈A

where the inverse temperature β > 0 is another meta-parameter that controls randomness of action selection. From an experience of state st , action at , reward rt , and next state st +1 , the action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as

δ t = rt + γ max Q( st +1 , a) − Q( st , at ) a∈A

Q( st , at ) ⇐ Q( st , at ) + αδ t

,

(3)

598

K. Samejima and K. Doya

where α > 0 is the meta-parameter for learning rate. In the case of a reinforcement learning agent, we have three meta-parameters. Such a reinforcement learning model of behavior learning does not only predict subject’s actions, but can also provide candidates of brain’s internal processes for decision making, which may be captured in neural recording or brain imaging data. However, a big problem is that the predictions are depended on the setting of metaparameters, such as learning rate α , action randomness β and discount factor γ .

3 Probabilistic Dynamic Evolution of Internal Variable for Q-Learning Agent Let

us

consider

a

problem

of

estimating

the

course

of

action-values

{Qt ( s, a); s ∈ S , a ∈ A,0 < t < T } , and meta-parameters α, β, and γ of reinforcement

st , actions at and rewards rt . We use a Bayesian method of estimating a dynamical hidden variable {x t ; t ∈ N } from sequence of observable variable {y t ; t ∈ N } to solve this problem. We assume learner by only observing the sequence of states

that the unobservable signal (hidden variable) is modeled as a Markov process of initial distribution p (x 0 ) and the transition probability p (x t +1 | x t ) . The observations {y t ; t ∈ N } are assumed to be conditionally independent given the process

{x t ; t ∈ N } and of marginal distribution p(y t | xt ) . The problem to solve

in this setting is to estimate recursively in time the posterior distribution of hidden variable p (x1:t | y1:t ) , where x 0:T = {x 0 , " , xT } and y 1:T = {y 1 , " , y T } . The marginal distribution is given by recursive procedure of the following prediction and updating, Predicting:

p(x t | y1:t −1 ) = ∫ p(x t | x t −1 ) p(x t −1 | y

Updating:

p( xt | y1:t ) =

1:t −1

)dx t −1

p (y t | xt ) p (x t | y1:t −1 ) ∫ p(y t | xt ) p(xt | y )dxt −1 1:t −1

We use a numerical method to solve the Bayesian recursion procedure was proposed, called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of sequence of hidden variables are represented by a set of random samples, also named ``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and the update the distribution of particles (Doucet et al. 2001). Figure 1 shows the dynamical Bayesian network representation of a evolution of internal variables in Q-learning agent. The hidden variable x t consists of actionvalues

Q( s, a ) for each state-action pair, learning rate α, inverse temperature β, and

Estimating Internal Variables of a Decision Maker’s Brain

discount factor γ. The observable variable and rewards

599

y t consists of the states st , actions at ,

rt .

p(y t | xt ) is given by the softmax action selection (2). The transition probability p (x t +1 | x t ) of the hidden variable is given by the The observation probability

Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we assume that meta-parameters (α,β, and γ) are constant with small drifts. Because α, β and γ should all be positive, we assume random-walk dynamics in logarithmic space.

log( xt +1 ) = log( xt ) + ε x where

σx

ε x ~ N (0,σ x )

(4)

is a meta-meta-parameter that defines random-walk variability of meta-

parameters.

Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and unobservable variable is depended on decision, reward probability, state transition, and update rule for value function. Circles: hidden variable. Double box: observable variable. Arrow: probabilistic dependency.

4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity Samejima et al (Samejima et al., 2005) used the internal parameter approach with Q-leaning model for monkey’s free choice task of a two-armed-bandit problem

600

K. Samejima and K. Doya

(Figure 2). The task has only one state, two actions, and stochastic binary reward. The reward probability for each action is fixed in a 30-150 trials of block, but randomly chosen from five kinds of probability combination, block-by-block. The reward probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9, 0.5]}, at the beginning of each block.

Fig. 2. Two-armed bandit task for monkey’s behavioral choice. Upper: Time course of the task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle with their right hand and adjusted at the center position. If monkeys held the handle at the center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned the handle to either right or left side, which was associated with a shift of yellow LED illumination from up to the turned direction. After 0.5 sec., color of the LED changed from yellow to either green or red. Green LED was followed by a large amount of reward water, while red LED was followed by a small amount of water. Lower panel: state diagram of the task. The circle indicates state. Arrow indicate possible action and state transition.

The Q-learning model of monkey behavior tries to learn reward expectation of each action, action value, and maximize reward acquired in each block. Because the task has only one state, the agent does not need to take into account next state’s value, and thus, we set the discount factor as γ = 0 . (Samejima et al., 2005) showed that the computed internal variable, the action value for a particular movement direction (left/right), that is estimated by past history of choice and outcome(reward), could predicts monkey’s future choice probability (Figure 3). Action value is an example of a variable that could not be immediately

Estimating Internal Variables of a Decision Maker’s Brain

601

Fig. 3. Time course of predicted choice probability and estimated action values. Upper panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large), choice ratio (cyan line, Gaussian smoothed σ=2.5) and predicted choice probability (black line). Color of upper bar indicate reward probability combination. Lower panel: estimated action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).

Fig. 4. An example of the activity of a caudate neuron plotted on the space of estimated action values QL(t) and QR(t).Left panel: 3-dimentional plot of neural activity on estimated QL(t) and QR(t). Right panel: 2-d projected plot for the discharge rates of the neuron on QL axes(Left side) and on QR (right side). Grey lines derived from regression model. Circles and error bars indicate average and standard deviation of neural discharge rates for each of 10 equally populated action value bins. (from Samejima et al. 2005).

obvious from observable experimental parameters but can be inferred using an actionpredictable computational model. Further more, the activity of most dorsal striatum projection neurons correlate to the estimated action value for particular action (figure 4). 4.2 Application to Human Imaging Data Not only the internal variable estimation but also the meta-parameters (e.g. learning rate, action stochasticity, and discount rate for future reward) are also estimated by this methodology. Although the subjective value of learning meta-parameters might be different for individual subject, the model-based approach could track subjective internal value for different meta-parameters. Especially, in human imaging study, this

602

K. Samejima and K. Doya

admissibility is effective to extract common neuronal circuit activation in multiple subject experiment. One problem in the cognitive neuroscience by decision making task is lack of controllability of internal variables. In conventional analysis of neuroscience and brain-imaging study, experimenter tries to control a cognitive state or an assumed internal parameter by a task demand or an experimental setting. Observed brain activities are compared to the assumed parameter. However, the subjective internal variables may depended personal behavioral tendency and may be different from the parameter assumed by experimenter. The Baysian estimation method for internal parameters including meta-parameter could reduce such a noise term of personal difference by fitting the meta-parameters. (Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple human subjects could be featured by the estimated meta-parameter of Q-learning agent. Figure 5 shows distribution of three meta-parameters, learning rate α, action stochasticity β and discount rate γ. The subjects whose estimated γ are lower tend to be trapped on a local optimal polity and could not reach optimal choice sequence (figure 5 left panel) . On the other hand, the subjects, whose learing rate α and inverse temperature β are estimated lower than others, reported in post-experimental questionnaire that they could not find any confident action selection in each state even in later experimental session of the task (figure 5 right panel). Regardless of the variety of subject’s behavioral tendency, the fMRI signal that correlated to estimated action value for the selected action is observed in ventral striatum in unpredictable condition, in which the state transitions are completely random, whereas dorsal striatum is correlated to action value in predictable environment, in which the state transitions are deterministic. This suggests that the different cortico-basal ganglia circuits might be involved in different predictability of the environment. (Tanaka et al., 2006)

Fig. 5. Subject distribution of estimated meta-parameters, larning rate, α, action stochasticity (inverse temperature), β, and discount rate, γ. Left panel: distribution on α−γ space. Subject LI, NN, and NT (Left panel, indicated by inside of ellipsoid), were trapped to local optimal action sequence. Right panel: distribution on α−β space. Subject BB and LI (right panel, indicated by inside of ellipsoid) were reported that they could not find any confident strategy.

Estimating Internal Variables of a Decision Maker’s Brain

603

5 Conclusion Theoretical framework of reinforcement learning to model behavioral decision making and the Bayesian estimating method for subjective internal variable can be powerful tools for analyzing both neural recording (Samejima et al., 2005) and human imaging data (Daw et al., 2006; Pessiglione et al., 2006; Tanaka et al., 2006). Especially, tracking meta-parameter of RL can capture behavioral tendency of animal or human decision making. Recently, correlation with anterior cingulated cortex activity and learning rate in uncertain environmental change are reported by using the approach with Bayesian decision model with temporal evolving the parameter of learning rate (Behrens et al., 2007). Although not detailed in this paper, Bayesian estimation framework also provides a way of objectively selecting the best model in reference to the given data. Combination of Bayesian model selection and hidden variable estimation methods would contribute to a new understanding of decision mechanism of our brain through falsifiable hypotheses and objective experimental tests.

References 1. Behrens, T.E., Woolrich, M.W., Walton, M.E., Rushworth, M.F.: Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007) 2. Corrado, G., Doya, K.: Understanding neural coding through the model-based analysis of decision making. J. Neurosci. 27, 8178–8180 (2007) 3. Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006) 4. Doucet, A., Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 5. Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., Imamizu, H., Kawato, M.: A neural correlate of reward-based behavioral learning in caudate nucleus: a functional magnetic resonance imaging study of a stochastic decision task. J. Neurosci. 24, 1660–1665 (2004) 6. Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., Frith, C.D.: Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045 (2006) 7. Samejima, K., Doya, K., Ueda, Y., Kimura, M.: Advances in neural processing systems, vol. 16. The MIT Press, Cambridge, Massachusetts, London, England (2004) 8. Samejima, K., Ueda, Y., Doya, K., Kimura, M.: Representation of action-specific reward values in the striatum. Science 310, 1337–1340 (2005) 9. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) 10. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT press, Cambridge (1998) 11. Tanaka, S.C., Samejima, K., Okada, G., Ueda, K., Okamoto, Y., Yamawaki, S., Doya, K.: Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics. Neural Netw. 19, 1233–1241 (2006)

Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions Tomohiro Shibata1 , Takashi Bando2 , and Shin Ishii1,3 1

Graduate School of Information Science, Nara Institute of Science and Technology [email protected] 2 DENSO Corporation 3 Graduate School of Informatics, Kyoto University

Abstract. Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. From computational point of view, visual tracking is the real-time process of statistical spatiotemporal ﬁltering of target states from an image stream, and incremental Bayesian computation is one of the most important devices. To make Bayesian computation of the posterior density of state variables tractable for any types of probability distribution, Particle Filters (PFs) have been often employed in the real-time vision area. In this paper, we brieﬂy review incremental Bayesian computation and PFs for visual tracking, indicate drawbacks of PFs, and then propose our framework, in which hierarchical and parallel predictions are integrated by adaptive sampling to achieve appropriate balancing of tracking accuracy and robustness. Finally, we discuss the proposed model from the viewpoint of neuroscience.

1

Introduction

Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. The prediction is also essential for real-time recognition or visual tracking. Due to ﬂood of visual data, examining the whole data is infeasible, and ignoring the irrelevant data is essentially requisite. Primate fovea and oculomotor control can be viewed from this point; high visual acuity is realized by the narrow foveal region on the retina, and the visual axis has to actively move by oculomotor control. Computer vision, in particular real-time vision faces the same computational problems discussed above, and attractive as well as feasible methods and applications have been developed in the light of particle ﬁlters (PFs) [4]. One of key ideas of PFs is importance sampling distribution or proposal distribution which can be viewed as prediction or attention in order to overcome the discussed computational problems. The aim of this paper is to propose a novel Bayesian visual tracking framework for hierarchically-modeled state variables for single object tracking, and to discuss the PFs and our framework from the viewpoint of neuroscience. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 604–613, 2008. c Springer-Verlag Berlin Heidelberg 2008

Visual Tracking Achieved by Adaptive Sampling

2 2.1

605

Adaptive Sampling from Hierarchical and Parallel Predictions Incremental Bayes and Particle Filtering

Particle ﬁltering [4] is an approach to performing Bayesian estimation of intractable posterior distributions from time series signals with non-Gaussian noise, such to generalize the traditional Kalman ﬁltering. This approach has been attracting attention in various research areas, including real-time visual processing (e.g., [5]). In clutter, there are usually several competing observations and these causes the posterior to be multi-modal and therefore non-Gaussian. In reality, using a large number of particles is not allowed especially for realtime processing, and thus there is strong demand to reduce number of particles. Reducing the number of particles, however, can lead to sacriﬁcing accuracy and robustness of ﬁltering, particularly in case that the dimension of state variables is high. How to cope with this trade-oﬀ has been one of the most important computational issues in PFs, but there have been few eﬀorts to reduce the dimension of the state variables in the context of PFs. Making use of hierarchy in state variables seems a natural solution to this problem. For example, in the case of pose estimation of a head from a videostream involving state variables of three-dimensional head pose and the twodimansional position of face features on the image, the high-dimensional state space can be divided into two groups by using the causality in their states, i.e., the head pose strongly aﬀects the position of feace features (cf. Fig. 1, except dotted arrows). There are, however, two big problems in this setting. First, the estimation of the lower state variables is strongly dependent on the estimation of the higher state variables. In real applications, it often happens that the assumption on generating from the higher state variable to the lower state variable is violated. Second, the assumption on generating from the lower state variable to the input, typically image frames, can also be violated. These two problems lead to failure in the estimation. 2.2

Estimation of Hierarchically-Modeled State Variables

Here we present a novel framework for hierarchically-modeled state variables for single object tracking. The intuition of our approach is that the higher and lower layer have their own dynamics respectively, and mixing their predictions over the proposal distribution based on their reliability adds both robustness and accuracy in tracking with the fewer number of particles. We assume there are two continuous state vectors at time step t, denoted by at ∈ RNa and xt ∈ RNx , and they are hierarchically modeled as in Fig. 1. Our goal is then estimating these unobservable states from an observation sequence z1:t . State Estimation. According to Bayes rule, the joint posterior density p(at , xt |z1:t ) is given by p(at , xt |z1:t ) ∝p(zt |at , xt )p(at , xt |z1:t−1 ),

606

T. Shibata, T. Bando, and S. Ishii

Fig. 1. A graphical model of a hierarchical and parallel dynamics. Right panels depict example physical representations processed in the layers.

where p(at , xt |z1:t−1 ) is the joint prior density, and p(zt |at , xt ) is the likelihood. The joint prior density p(at , xt |z1:t−1 ) is given by the previous joint posterior density p(at−1 , xt−1 |z1:t−1 ) and the state transition model p(at , xt |at−1 , xt−1 ) as follows: p(at , xt |z1:t−1 ) = p(at , xt |at−1 , xt−1 )p(at−1 , xt−1 |z1:t−1 )dat−1 dxt−1 . The state vectors at and xt are assumed to be genereted from the hierarchical model shown in Fig. 1. Furthermore, dynamics model of at is assumed to be represented as a temporal Markov chain, conditional independence between at−1 and xt given at . Under these assumptions, the state transition model p(at , xt |at−1 , xt−1 ) and the joint prior density p(at , xt |z1:t−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at )p(at |at−1 ) and

p(at , xt |z1:t−1 ) =p(xt |at )p(at |z1:t−1 ),

respectively. Then, we can carry out hierarchically computation of the joint posterior dis(i) tribution by PF as follows; ﬁrst, the higher samples {at |i = 1, ..., Na } are drawn (j) from the prior density p(at |z1:t−1 ). The lower samples {xt |j = 1, ..., Nx } are then drawn from the prior density p(xt |z1:t−1 ) described as the proposal dis(i) tribution p(xt |at ). Finally, the weights of samples are given by the likelihood p(zt |at , xt ). Adaptive Mixing of Proposal Distribution. For applying to the state estimation problem in the real world, the above proposal distribution can be contaminated by the cruel non-Gaussian noise and/or the declination of the assumption. Especially in the case of PF estimation with small number of particles, the contaminated proposal distribution can give fatal disturbance to the

Visual Tracking Achieved by Adaptive Sampling

607

estimation. In this study, we assumed the state transition, which are represented as dotted arrows in Fig.1, in lower layer independently of upper layer, and we suppose to enable robust and acculate estimation by adaptive determination of the contribution the from hypothetical state transition. The state transition p(at , xt |at−1 , xt−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at , xt−1 )p(at |at−1 ).

(1)

Here, the dynamics model of xt is assumed to be represented as p(xt |at , xt−1 ) =αa,t p(xt |at ) + αx,t p(xt |xt−1 ),

(2)

with αa,t + αx,t = 1. In our algorithm, p(xt |at , xt−1 ) is modelled as a mixture of approximated prediction densities computed in the lower and higher layers, p(xt |xt−1 ) and p(xt |at ). Then, its mixture ratio αt = {αa,t , αx,t } representing the contribution of each layer is determined by means of which enables the interaction between the layers. We describe the determination way of the mixture ratio αt in the following subsection. From Eqs. (1) and (2), the joint prior density p(at , xt |z1:t−1 ) is given as p(at , xt |z1:t−1 ) = p(xt |at , z1:t−1 )p(at |z1:t−1 ) = π(xt |αt )p(at |z1:t−1 ), where

π(xt |αt ) =αa,t p(xt |at ) + αx,t p(xt |z1:t−1 )

(3)

is the adaptive-mixed proposal distribution for xt based on the prediction densities p(xt |at ) and p(xt |z1:t−1 ). Determination of αt using On-line EM Algorithm. The mixture ratio αt is the parameter for determining the adaptive-mixed proposal distribution π(xt |αt ), and its determination by minimalizing the KL divergence between the posterior density in the lower layer p(xt |at , z1:t ) and π(xt |αt ) gives robust and accurate estimation. The determination of αt is equal to determination of mixture ratio in two components mixture model, and we employ a sequential Maximum-Likelihood (ML) estimation of the mixture ratio. In our method, the index variable for componet selection becomes the latent variable, and therefore the sequential ML estimation is implemented by means of an on-line EM algorithm [11]. Resampling from the posterior density p(xt |at , z1:t ), we obtain Nx (i) ˜ t . Using the latent variable m = {ma , mx } indicates which prediction samples x density, p(xt |at ) or p(xt |z1:t−1 ) is trusted, the on-line log likelihood can then be represented as t N t x L(αt ) = ηt λs log π(˜ x(i) τ |αt ) τ =1

= ηt

t τ =1

s=τ +1 t s=τ +1

λs

i=1 Nx i=1

log

m

p(˜ x(i) τ , m|ατ ) ,

608

T. Shibata, T. Bando, and S. Ishii

1. Estimation of the state variable xt in the lower layer. (i) – Obtain αa,t Nx samples xa,t from p(xt |at ). (i) – Obtain αx,t Nx samples xx,t from p(xt |z1:t−1 ). ˆ t and Std(xt ) of p(xn,t |at , z1:t ) using Nx mixture – Obtain expectation x (i) (i) (i) samples xt , constituted by xx,t and xa,t . Above procedure is applied to each feature, and obtain {ˆ xn,t , Std(xn,t )}. 2. Estimation of the state variable at in the higher layer. – Obtain Da,n (t) based on Std(xn,t ), and then estimate p(at |z1:t ). 3. Determination of the mixture ratio ¸t+1 . ˜ (i) – Obtain Nx samples x from p(xn,t |at , z1:t ). t – Calculate ¸t+1 such to maximize the on-line log likelihood. Above procedure is applied to each feature, and obtain {¸n,t+1 }.

Fig. 2. Hierarchical pose estimation algorithm

where λs is a decay constant to decline adverse inﬂuence from former inaccurate

−1 t t estimation, and ηt = is a normalization constant which τ =1 s=τ +1 λs works as a learning coeﬃcient. The optimal mixture ratio α∗t by means of the KL divergence, which gives the optimal proposal distribution π(xt |α∗t ), can be calculated by miximization of the on-line log likelihood as follows: mt , m ∈m m t

α∗m,t = where (i)

(i)

p(˜ xt , ma |αt ) =αa,t p(˜ xt |at ),

(i)

(i)

p(˜ xt , mx |αt ) = αx,t p(˜ xt |z1:t−1 ),

and mt = ηt

t τ =1

t s=τ +1

λs

Nx

(i)

p(m|˜ x(i) τ , α τ ),

(i)

p(m|˜ xt , α t ) =

i=1

p(˜ xt , m|αt ) m ∈m

(i)

p(˜ xt , m |αt )

.

Note that mt can be calculated incrementally. 2.3

Application to Pose Estimation of a Rigid Object

Here the proposed method is applied to a real problem, pose estimation of a rigid object (cf. Fig. 1). The algorithm is shown in Fig. 2. Nf features on the image plane at time step t are denoted by xn,t = (un,t , vn,t )T , (n = 1, ..., Nf ), an aﬃne camera matrix which projects a 3D model of the object onto the image plane at time step t is reshaped into a vector form at = (a1,t , ..., a8,t )T , for simplicity, and an observed image at time step t is denoted by zt . When applied to the pose estimation of a rigid object, Gaussian process is assumed in the higher layer and the object’s pose is estimated by Kalman Filter,

Visual Tracking Achieved by Adaptive Sampling

609

while tracking of the features is performed by PF because of the existence of cruel non-Gaussian noise, e.g. occlusion, in the lower layer. (i) To obtain the samples xn,t from the mixture proposal distribution we need two prediction densities, p(xn,t |at ) and p(xn,t |z1:t−1 ). The prediction density computed in the higher layer, p(xn,t |at ), which contains the physical relationship between the features, is given by the aﬃne projection process of the 3D model of the rigid object. 2.4

Experiments

Simulations. The goal of this simulation is to estimate the posture of a rigid object, as in Fig. 3A, from an observation sequence of eight features. The rigid object was a hexahedral piece, whose size was 20 (upper base) × 30 (lower base) × 20 (height) × 20 (depth) [mm], was rotated at 150mm apart from a pin-hole camera (focal length: 40mm) and was projected onto the image plane (pixel size: 0.1 × 0.1 [mm]) by a perspective projection process disturbed by a Gaussian noise (mean: 0, standard deviation: 1 pixel). The four features at back of the object were occluded by the object itself. An occluded feature was represented by assuming the standard deviation of measurement noise grows to 100 pixels from a non-occluded value of 1 pixel. The other four features at front of the object were not occluded. The length of the observation sequence of the features was 625 frames, i.e., about 21 sec. In this simulation, we compared the performance of our proposed method with (i) the method with a ﬁxed αt = {1, 0} which is equivalent to the simple hierarchical modeling in which the prediction density computed in the higher layer is trusted every time, (ii) the method with a ﬁxed αt = {0, 1} which does not implement the mutual interaction between the layers. The decay constant of the on-line EM algorithm was set at λt = 0.5. The estimated pose by the adaptive proposal distribution and αa,t at each time step are shown in Figs. 3B and 3C, respectively. Here, the object’s pose θt = {θX,t , θY,t , θZ,t } was calculated from the estimated at using Extended Kalman Filter (EKF). In our implementation, maximum and minimum values of αa,t were limited to 0.8 and 0.2, respectively, to prohibit the robustness from degenerating. As shown in Fig. 3B, our method achieved robust estimation against the occlusion. Concurrently with the robust estimation of the object’s pose, appropriate determination of the mixture ratio was exhibited. For example, as in the case of the feature x1 , the prediction density computed in the higher layer was emphasized and well predicted using the 3D object’s model during the period in which x1 was occluded, because the observed feature position contaminated by cruel noise depressed conﬁdence of the lower prediction density. Real Experiments. To investigate the performance against the cruel nonGaussian noise existing in real environments, the proposed method was applied to a head pose estimation problem of a driver from a real image sequence captured in a car. The face extraction/tracking from an image sequence is a wellstudied problem because of its applicability to various area, and then several

610

T. Shibata, T. Bando, and S. Ishii

Fig. 3. A: a sample simulation image. B: Time course of the estimated object’s pose. C: Time course of αa,t determined by the on-line EM algorithm. Gray background represents a frame in which the feature was occluded.

PF algorithms have been proposed. However, for accurate estimation of state variables, e.g. human face, lying in a high dimensional space especially in the case of real-time processing, some techniques for dimensional reduction are required. The proposed method is expected to enable more robust estimation in spite of limitation in computing resource by exploiting hierarchy of state variables. The real image sequence was captured by a near-infrared camera at the back of a handle, i.e., the captured images did not contain color information, and the image resolution was 640 × 480. In such a visual tracking task, the true observation process p(zt |xn,t ) is unknown because the true positions of face features are unobservable, hence, a (i) model of the observation process for calculating the particle’s weight wn,t is needed. In this study, we employed normalized correlation with a template as the model of the approximate observation process. Although this observation model seems too simple to apply to problems in real environments, it is sufﬁcient for examining the eﬃciency of the proposal distribution. We employed nose, eyes, canthi, eyebrows, corners of mouth as the face features. Using the 3D feature positions measured by a 3D distance measuring equipment, we constructed the 3D face model. The proposed method was applied by employing 50 particles for each face feature, as well as in the simulation, and processed by a Pentium 4 (2.8 GHz) Windows 2000 PC with 1048MB RAM. Our system processed one frame in 29.05 msec, and hence achieved real-time processing.

Visual Tracking Achieved by Adaptive Sampling

611

Fig. 4. Mean estimation error in the case when αt was estimated by EM, ﬁxed at αa,t = {1, 0}, and ﬁxed at αa,t = {0, 1} (100 trials). Since some bars protruded from the ﬁgure, they were shorten and the error amount is instead displayed on the top of them.

Fig. 5. Tracked face features

Fig. 4 shows the estimation error of the head pose, and the true head pose of the driver measured by a gyro sensor in the car. Our adaptive mixing proposal distribution achieved robust head pose estimation as well as in the simulation task. In Fig. 5, the estimated face features are depicted by “+” for various head pose of the driver; the variance of the estimated feature position is represented as the size of the “+” mark, i.e., the larger a “+” is, the higher estimation conﬁdence the estimator has.

3

Modeling Primate’s Visual Tracking by Particle Filters

Here we note that our computer vision study and primates’ vision share the same computational problems and similar constraints. Namely, they need to perform real-time spatiotemporal ﬁltering of the visual data robustly and accurately as

612

T. Shibata, T. Bando, and S. Ishii

much as possible with the limited computing resource. Although there are huge numbers of neurons in the brain, their ﬁring rate is very noisy and much slower than recent personal computers. We can visually track only around four to six objects simultaneously (e.g., [2]). These facts indicate the limited attention resource. As mentioned at the beginning this paper, it is widely known that only the foveal region on the retina can acquire high-resolution images in primates, and that humans usually make saccades mostly to eyes, nose, lip, and contours when we watch a human face. In other words, primates actively ignore irrelevant information against massive image inputs. Furthermore, there have been many behavioral and computational studies reporting that the brain would compute Bayesian statistics (e.g, [8][10]). As we discussed, however, Bayesian computation is intractable in general, and particle ﬁlters (PFs) is an attractive and feasible solution to the problem as it is very ﬂexible, easy to implement, parallelisable. Importance sampling is analogous to eﬃcient computing resource delivery. As a whole, we conjecture that the primate’s brain would employ PFs for visual tracking. Although one of the major drawbacks of PF is that a large number of particles, typically exponential to the dimension of state variables, are required for accurate estimation, our proposed framework in which adaptive sampling from hierarchical and parallel predictive distributions can be a solution. As demonstrated in section 2 and in other’s [1], adaptive importance sampling from multiple predictions can balance both accuracy and robustness of the estimation with a restricted numbers of particles. Along this line, overt/covert smooth pursuit in primates could be a research target to investigate our conjecture. Based on the model of Shibata, et al. [12], Kawawaki et al. investigated the human brain mechanism of overt/covert smooth pursuit by fMRI experiments and suggested that the activity of the anterior/superior lateral occipito-temporal cortex (a/sLOTC) was responsible for target motion prediction rather than motor commands for eye movements [7]. Note that LOTC involves the monkey medial superior temporal (MST) homologue responsible for visual motion processing (e.g., [9][6]) In their study, the mechanism for increasing the a/sLOTC activity remained unclear. The increase in the a/sLOTC activity was observed particularly when subjects pursued blinking target motion covertly. This blink condition might cause two predictions, e.g., one emphasizes observation and the other its belief as proposed in [1]), and require the computational resource for adaptive sampling. Multiple predictions might be performed in other brain regions such as frontal eye ﬁled (FEF), the inferior temporal (IT) area, fusiform face area (FFA). It is known that FEF involved in smooth pursuit (e.g., [3]), and it has reciprocal connection to the MST area (e.g., [13]), but how they work together is unclear. Visual tracking for more general object tracking rather than a small spot as a visual stimulus requires a speciﬁc target representation and distractors’ representation inculding a background. So the IT, FFA and other areas related to higher-order visual representation would be making parallel predictions to deal with the varying target appearance during tracking.

Visual Tracking Achieved by Adaptive Sampling

4

613

Conclusion

In this paper, ﬁrst we have introduced particle ﬁlters (PFs) as an approximated incremental Bayesian computation, and pointed out their drawbacks. Then, we have proposed a novel framework for visual tracking based on PFs as a solution to the drawback. The keys of the framework are: (1) high-dimensional state space is decomposed into hierarchical and parallel predictors which treat state variables in the lower dimension, and (2) their integration is achieved by adaptive sampling. The feasibility of our frame work has been demonstrated by real as well as simulation studies. Finally, we have pointed out the shared computational problems between PFs and human visual tracking, presented our conjecture that at least the primate’s brain employs PFs, and discussed its possibility and perspectives for future investigations.

References 1. Bando, T., Shibata, T., Doya, K., Ishii, S.: Switching particle ﬁlters for eﬃcient visual tracking. J. Robot Auton. Syst. 54(10), 873 (2006) 2. Cavanagh, P., Alvarez, G.A.: Tracking multiple targets with multifocal attention. Trends in Cogn. Sci. 9(7), 349–354 (2005) 3. Fukushima, K., Yamanobe, T., Shinmei, Y., Fukushima, J., Kurkin, S., Peterson, B.W.: Coding of smooth eye movements in three-dimensional space by frontal cortex. Nature 419, 157–162 (2002) 4. Gordon, N.J., Salmond, J.J., Smith, A.F.M.: Novel approach to nonlinear nonGaussian Bayesian state estimation. IEEE Proc. Radar Signal Processing 140, 107– 113 (1993) 5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998) 6. Kawano, M., Shidara, Y., Watanabe, Y., Yamane, S.: Neural activity in cortical area MST of alert monkey during ocular following responses. J. Neurophysiol 71(6), 2305–2324 (1994) 7. Kawawaki, D., Shibata, T., Goda, N., Doya, K., Kawato, M.: Anterior and superior lateral occipito-temporal cortex responsible for target motion prediction during overt and covert visual pursuit. Neurosci. Res. 54(2), 112 8. Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosci. 27(12) (2004) 9. Newsome, W.T., Wurtz, H., Komatsu, R.H.: Relation of cortical areas MT and MST to pursuit eye movements. II. Diﬀerentiation of retinal from extraretinal inputs. J. Neurophysiol 60(2), 604–620 (1988) 10. Rao, R.P.N.: The Bayesian Brain: Probabilistic Approaches to Neural Coding. In: Neural Models of Bayesian Belief Propagation, MIT Press, Cambridge (2006) 11. Sato, M., Ishii, S.: On-line EM algorithm for the normalized gaussian network. Neural Computation 12(2), 407–432 (2000) 12. Shibata, T., Tabata, H., Schaal, S., Kawato, M.: A model of smooth pursuit in primates based on learning the target dynamics. Neural Netw. 18(3), 213 13. Tian, J.-R., Lynch, J.C.: Corticocortical input to the smooth and saccadic eye movement subregions of the frontal eye ﬁeld in cebus monkeys. J. Neurophysiol 76(4), 2754–2771 (1996)

Bayesian System Identiﬁcation of Molecular Cascades Junichiro Yoshimoto1,2 and Kenji Doya1,2,3 1

Initial Research Project, Okinawa Institute of Science and Technology Corporation 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan {jun-y,doya}@oist.jp 2 Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan 3 ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai, “Keihanna Science City”, Kyoto 619-0288, Japan

Abstract. We present a Bayesian method for the system identiﬁcation of molecular cascades in biological systems. The contribution of this study is to provide a theoretical framework for unifying three issues: 1) estimating the most likely parameters; 2) evaluating and visualizing the conﬁdence of the estimated parameters; and 3) selecting the most likely structure of the molecular cascades from two or more alternatives. The usefulness of our method is demonstrated in several benchmark tests. Keywords: Systems biology, biochemical kinetics, system identiﬁcation, Bayesian inference, Markov chain Monte Carlo method.

1

Introduction

In recent years, the analysis of molecular cascades by mathematical models has contributed to the elucidation of intracellular mechanisms related to learning and memory [1,2]. In such modeling studies, the structure and parameters of the molecular cascades are selected based on the literature and databases1 . However, if reliable information about a target molecular cascade is not obtained from those repositories, we must tune its structure and parameters so as to ﬁt the model behaviors to the available experimental data. The development of a theoretically sound and eﬃcient system identiﬁcation framework is crucial for making such models useful. In this article, we propose a Bayesian system identiﬁcation framework for molecular cascades. For a given set of experimental data, the system identiﬁcation can be separated into two inverse problems: parameter estimation and model selection. The most popular strategy for parameter estimation is to ﬁnd a single set of parameters based on the least mean-square-error or maximum likelihood criterion [3]. However, we should be aware that the estimated parameters might 1

DOQCS ( http://doqcs.ncbs.res.in/) and BIOMODELS ( http://biomodels. net/) are available for example.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 614–624, 2008. c Springer-Verlag Berlin Heidelberg 2008

Bayesian System Identiﬁcation of Molecular Cascades

615

suﬀer from an “over-ﬁtting eﬀect” because the available set of experimental data is often small and noisy. For evaluating the accuracy of the estimators, statistical methods based on the asymptotic theory [4] and Fisher information [5] were independently proposed. Still, we must pay attention to practical limitations: a large number of data are required for the former method; and the mathematical model should be linear at least locally for the latter method. For model selection

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

4984

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa (Eds.)

Neural Information Processing 14th International Conference, ICONIP 2007 Kitakyushu, Japan, November 13-16, 2007 Revised Selected Papers, Part I

13

Volume Editors Masumi Ishikawa Hiroyuki Miyamoto Takeshi Yamakawa Kyushu Institute of Technology Department of Brain Science and Engineering 2-4 Hibikino, Wakamatsu, Kitakyushu 808-0196, Japan E-mail: {ishikawa, miyamo, yamakawa}@brain.kyutech.ac.jp Kenji Doya Okinawa Institute of Science and Technology Initial Research Project 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan E-mail: [email protected]

Library of Congress Control Number: Applied for CR Subject Classification (1998): F.1, I.2, I.5, I.4, G.3, J.3, C.2.1, C.1.3, C.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-540-69154-5 Springer Berlin Heidelberg New York 978-3-540-69154-9 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12282845 06/3180 543210

Preface

These two-volume books comprise the post-conference proceedings of the 14th International Conference on Neural Information Processing (ICONIP 2007) held in Kitakyushu, Japan, during November 13–16, 2007. The Asia Pacific Neural Network Assembly (APNNA) was founded in 1993. The first ICONIP was held in 1994 in Seoul, Korea, sponsored by APNNA in collaboration with regional organizations. Since then, ICONIP has consistently provided prestigious opportunities for presenting and exchanging ideas on neural networks and related fields. Research fields covered by ICONIP have now expanded to include such fields as bioinformatics, brain machine interfaces, robotics, and computational intelligence. We had 288 ordinary paper submissions and 3 special organized session proposals. Although the quality of submitted papers on the average was exceptionally high, only 60% of them were accepted after rigorous reviews, each paper being reviewed by three reviewers. Concerning special organized session proposals, two out of three were accepted. In addition to ordinary submitted papers, we invited 15 special organized sessions organized by leading researchers in emerging fields to promote future expansion of neural information processing. ICONIP 2007 was held at the newly established Kitakyushu Science and Research Park in Kitakyushu, Japan. Its theme was “Towards an Integrated Approach to the Brain—Brain-Inspired Engineering and Brain Science,” which emphasizes the need for cross-disciplinary approaches for understanding brain functions and utilizing the knowledge for contributions to the society. It was jointly sponsored by APNNA, Japanese Neural Network Society (JNNS), and the 21st century COE program at Kyushu Institute of Technology. ICONIP 2007 was composed of 1 keynote speech, 5 plenary talks, 4 tutorials, 41 oral sessions, 3 poster sessions, 4 demonstrations, and social events such as the Banquet and International Music Festival. In all, 382 researchers registered, and 355 participants joined the conference from 29 countries. In each tutorial, we had about 60 participants on the average. Five best paper awards and five student best paper awards were granted to encourage outstanding researchers. To minimize the number of researchers who cannot present their excellent work at the conference due to financial problems, we provided travel and accommodation support of up to JPY 150,000 to six researchers and of up to JPY to eight students 100,000. ICONIP 2007 was jointly held with the 4th BrainIT 2007 organized by the 21st century COE program, “World of Brain Computing Interwoven out of Animals and Robots,” with the support of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) and Japan Society for the Promotion of Science (JSPS).

VI

Preface

We would like to thank Mitsuo Kawato for his superb Keynote Speech, and Rajesh P.N. Rao, Fr´ed´eric Kaplan, Shin Ishii, Andrew Y. Ng, and Yoshiyuki Kabashima for their stimulating plenary talks. We would also like to thank Sven Buchholz, Eckhard Hitzer, Kanta Tachibana, Jung Wang, Nikhil R. Pal, and Tetsuo Furukawa for their enlightening tutorial lectures. We would like to express our deepest appreciation to all the participants for making the conference really attractive and fruitful through lively discussions, which we believe would tremendously contribute to the future development of neural information processing. We also wish to acknowledge the contributions by all the Committee members for their devoted work, especially Katsumi Tateno for his dedication as Secretary. Last but not least, we want to give special thanks to Irwin King and his students, Kam Tong Chan and Yi Ling Wong, for providing the submission and reviewing system, Etsuko Futagoishi for hard secretarial work, Satoshi Sonoh and Shunsuke Sakaguchi for maintaining our conference server, and many secretaries and graduate students at our department for their diligent work in running the conference.

January 2008

Masumi Ishikawa Kenji Doya Hiroyuki Miyamoto Takeshi Yamakawa

Organization

Conference Committee Chairs General Chair Organizing Committee Chair Steering Committee Chair Program Co-chairs

Tutorials Chair Exhibitions Chair Publications Chair Publicity Chair Local Arrangements Chair Web Master Secretary

Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Shiro Usui (RIKEN, Japan) Takeshi Yamakawa (Kyushu Institute of Technology, Japan) Masumi Ishikawa (Kyushu Institute of Technology, Japan), Kenji Doya (OIST, Japan) Hirokazu Yokoi (Kyushu Institute of Technology, Japan) Masahiro Nagamatsu (Kyushu Institute of Technology, Japan) Hiroyuki Miyamoto (Kyushu Institute of Technology, Japan) Hideki Nakagawa (Kyushu Institute of Technology, Japan) Satoru Ishizuka (Kyushu Institute of Technology, Japan) Tsutomu Miki (Kyushu Institute of Technology, Japan) Katsumi Tateno (Kyushu Institute of Technology, Japan)

Steering Committee Takeshi Yamakawa, Masumi Ishikawa, Hirokazu Yokoi, Masahiro Nagamatsu, Hiroyuki Miyamoto, Hideki Nakagawa, Satoru Ishizuka, Tsutomu Miki, Katsumi Tateno

Program Committee Masumi Ishikawa, Kenji Doya Track Co-chairs

Track 1: Masato Okada (Tokyo Univ.), Yoko Yamaguchi (RIKEN), Si Wu (Sussex Univ.) Track 2: Koji Kurata (Univ. of Ryukyus), Kazushi Ikeda (Kyoto Univ.), Liqing Zhang (Shanghai Jiaotong Univ.)

VIII

Organization

Track 3: Yuzo Hirai (Tsukuba Univ.), Yasuharu Koike (Tokyo Institute of Tech.), J.H. Kim (Handong Global Univ., Korea) Track 4: Akira Iwata (Nagoya Institute of Tech.), Noboru Ohnishi (Nagoya Univ.), SeYoung Oh (Postech, Korea) Track 5: Hideki Asoh (AIST), Shin Ishii (Kyoto Univ.), Sung-Bae Cho (Yonsei Univ., Korea)

Advisory Board Shun-ichi Amari (Japan), Sung-Yang Bang (Korea), You-Shou Wu (China), Lei Xu (Hong Kong), Nikola Kasabov (New Zealand), Kunihiko Fukushima (Japan), Tom D. Gedeon (Australia), Soo-Young Lee (Korea), Yixin Zhong (China), Lipo Wang (Singapore), Nikhil R. Pal (India), Chin-Teng Lin (Taiwan), Laiwan Chan (Hong Kong), Jun Wang (Hong Kong), Shuji Yoshizawa (Japan), Minoru Tsukada (Japan), Takashi Nagano (Japan), Shozo Yasui (Japan)

Referees S. Akaho P. Andras T. Aonishi T. Aoyagi T. Asai H. Asoh J. Babic R. Surampudi Bapi A. Kardec Barros J. Cao H. Cateau J-Y. Chang S-B. Cho S. Choi I.F. Chung A.S. Cichocki M. Diesmann K. Doya P. Erdi H. Fujii N. Fukumura W-k. Fung T. Furuhashi A. Garcez T.D. Gedeon

S. Gruen K. Hagiwara M. Hagiwara K. Hamaguchi R.P. Hasegawa H. Hikawa Y. Hirai K. Horio K. Ikeda F. Ishida S. Ishii M. Ishikawa A. Iwata K. Iwata H. Kadone Y. Kamitani N. Kasabov M. Kawamoto C. Kim E. Kim K-J. Kim S. Kimura A. Koenig Y. Koike T. Kondo

S. Koyama J.L. Krichmar H. Kudo T. Kurita S. Kurogi M. Lee J. Liu B-L. Lu N. Masuda N. Matsumoto B. McKay K. Meier H. Miyamoto Y. Miyawaki H. Mochiyama C. Molter T. Morie K. Morita M. Morita Y. Morita N. Murata H. Nakahara Y. Nakamura S. Nakauchi K. Nakayama

Organization

K. Niki J. Nishii I. Nishikawa S. Oba T. Ogata S-Y. Oh N. Ohnishi M. Okada H. Okamoto T. Omori T. Omori R. Osu N. R. Pal P. S. Pang G-T. Park J. Peters S. Phillips

Y. Sakaguchi K. Sakai Y. Sakai Y. Sakumura K. Samejima M. Sato N. Sato R. Setiono T. Shibata H. Shouno M. Small M. Sugiyama I. Hong Suh J. Suzuki T. Takenouchi Y. Tanaka I. Tetsunari

N. Ueda S. Usui Y. Wada H. Wagatsuma L. Wang K. Watanabe J. Wu Q. Xiao Y. Yamaguchi K. Yamauchi Z. Yi J. Yoshimoto B.M. Yu B-T. Zhang L. Zhang L. Zhang

Sponsoring Institutions Asia Pacific Neural Network Assembly (APNNA) Japanese Neural Network Society (JNNS) 21st Century COE Program, Kyushu Institute of Technology

Cosponsors RIKEN Brain Science Institute Advanced Telecommunications Research Institute International (ATR) Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT) IEEE CIS Japan Chapter Fuzzy Logic Systems Institute (FLSI)

IX

Table of Contents – Part I

Computational Neuroscience A Retinal Circuit Model Accounting for Functions of Amacrine Cells . . . Murat Saglam, Yuki Hayashida, and Nobuki Murayama

1

Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi

7

Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Hatori and Ko Sakai

18

Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuya Soga and Yoshiki Kashimori

27

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordan H. Boyle, John Bryden, and Netta Cohen

37

Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Kuang Chao, Hsiao-Lung Chan, Tony Wu, Ming-An Lin, and Shih-Tseng Lee

48

Population Coding of Song Element Sequence in the Songbird Brain Nucleus HVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Nishikawa, Masato Okada, and Kazuo Okanoya

54

Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamami Motomura, Yuki Hayashida, and Nobuki Murayama

64

Region-Based Encoding Method Using Multi-dimensional Gaussians for Networks of Spiking Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lakshmi Narayana Panuku and C. Chandra Sekhar

73

Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kouichi Mitsunaga, Yusuke Totoki, and Takami Matsuo

83

XII

Table of Contents – Part I

Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings . . . . . . . . . . . . . . . . . . . . . . . . Akihisa Ichiki and Masatoshi Shiino Spike-Timing Dependent Plasticity in Recurrently Connected Networks with Fixed External Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Gilson, David B. Grayden, J. Leo van Hemmen, Doreen A. Thomas, and Anthony N. Burkitt A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Ren Su, Michelle Liou, Philip E. Cheng, John A.D. Aston, and Shang-Hong Lai

93

102

112

126

The Eﬀects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murat Saglam, Kaoru Matsunaga, Yuki Hayashida, Nobuki Murayama, and Ryoji Nakanishi

135

Interactions between Spike-Timing-Dependent Plasticity and Phase Response Curve Lead to Wireless Clustering . . . . . . . . . . . . . . . . . . . . . . . . Hideyuki Cˆ ateau, Katsunori Kitano, and Tomoki Fukai

142

A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoko Yamaguchi, Colin Molter, Wu Zhihua, Harshavardhan A. Agashe, and Hiroaki Wagatsuma Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Colliaux, Yoko Yamaguchi, Colin Molter, and Hiroaki Wagatsuma

151

160

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroshi Fujii, Kazuyuki Aihara, and Ichiro Tsuda

170

Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongtao Li and Shigetoshi Nara

179

A Generalised Entropy Based Associative Model . . . . . . . . . . . . . . . . . . . . . Masahiro Nakagawa

189

Table of Contents – Part I

The Detection of an Approaching Sound Source Using Pulsed Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaname Iwasa, Takeshi Fujisumi, Mauricio Kugler, Susumu Kuroyanagi, Akira Iwata, Mikio Danno, and Masahiro Miyaji

XIII

199

Sensitivity and Uniformity in Detecting Motion Artifacts . . . . . . . . . . . . . Wen-Chuang Chou, Michelle Liou, and Hong-Ren Su

209

A Ring Model for the Development of Simple Cells in the Visual Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Hamada and Kazuhiro Okada

219

Learning and Memory Practical Recurrent Learning (PRL) in the Discrete Time Domain . . . . . Mohamad Faizal Bin Samsudin, Takeshi Hirose, and Katsunari Shibata

228

Learning of Bayesian Discriminant Functions by a Layered Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshifusa Ito, Cidambi Srinivasan, and Hiroyuki Izumi

238

RNN with a Recurrent Output Layer for Learning of Naturalness . . . . . . J´ an Dolinsk´y and Hideyuki Takagi

248

Using Generalization Error Bounds to Train the Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakria Hussain and John Shawe-Taylor

258

Model of Cue Extraction from Distractors by Active Recall . . . . . . . . . . . . Adam Ponzi

269

PLS Mixture Model for Online Dimension Reduction . . . . . . . . . . . . . . . . . Jiro Hayami and Koichiro Yamauchi

279

Analysis on Bidirectional Associative Memories with Multiplicative Weight Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi Sing Leung, Pui Fai Sum, and Tien-Tsin Wong

289

Fuzzy ARTMAP with Explicit and Implicit Weights . . . . . . . . . . . . . . . . . . Takeshi Kamio, Kenji Mori, Kunihiko Mitsubori, Chang-Jun Ahn, Hisato Fujisaka, and Kazuhisa Haeiwa Neural Network Model of Forward Shift of CA1 Place Fields Towards Reward Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Ponzi

299

309

XIV

Table of Contents – Part I

Neural Network Models A New Constructive Algorithm for Designing and Training Artiﬁcial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdus Sattar, Md. Monirul Islam, and Kazuyuki Murase

317

Eﬀective Learning with Heterogeneous Neural Networks . . . . . . . . . . . . . . Llu´ıs A. Belanche-Mu˜ noz

328

Pattern-Based Reasoning System Using Self-incremental Neural Network for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihito Sudo, Manabu Tsuboyama, Chenli Zhang, Akihiro Sato, and Osamu Hasegawa

338

Eﬀect of Spatial Attention in Early Vision for the Modulation of the Perception of Border-Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nobuhiko Wagatsuma, Ryohei Shimizu, and Ko Sakai

348

Eﬀectiveness of Scale Free Network to the Performance Improvement of a Morphological Associative Memory without a Kernel Image . . . . . . . Takashi Saeki and Tsutomu Miki

358

Intensity Gradient Self-organizing Map for Cerebral Cortex Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Hung Chuang, Jiun-Wei Liou, Philip E. Cheng, Michelle Liou, and Cheng-Yuan Liou Feature Subset Selection Using Constructive Neural Nets with Minimal Computation by Measuring Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Dynamic Link Matching between Feature Columns for Diﬀerent Scale and Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuomi D. Sato, Christian Wolﬀ, Philipp Wolfrum, and Christoph von der Malsburg

365

374

385

Perturbational Neural Networks for Incremental Learning in Virtual Learning System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eiichi Inohira, Hiromasa Oonishi, and Hirokazu Yokoi

395

Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Tiˇ no

405

Variable Selection for Multivariate Time Series Prediction with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Ru Wei

415

Table of Contents – Part I

XV

Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takaaki Aoki, Kaiichiro Ota, Koji Kurata, and Toshio Aoyagi

426

A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Azusa Iwata, Yoshihisa Shinozawa, and Akito Sakurai

436

Supervised/Unsupervised/Reinforcement Learning Unbiased Likelihood Backpropagation Learning . . . . . . . . . . . . . . . . . . . . . . Masashi Sekino and Katsumi Nitta

446

The Local True Weight Decay Recursive Least Square Algorithm . . . . . . Chi Sing Leung, Kwok-Wo Wong, and Yong Xu

456

Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Yamazaki and Sumio Watanabe

466

Using Image Stimuli to Drive fMRI Analysis . . . . . . . . . . . . . . . . . . . . . . . . David R. Hardoon, Janaina Mour˜ ao-Miranda, Michael Brammer, and John Shawe-Taylor

477

Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima

487

Convergence Behavior of Competitive Repetition-Suppression Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Bacciu and Antonina Starita

497

Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma

507

An Automatic Speaker Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . P. Chakraborty, F. Ahmed, Md. Monirul Kabir, Md. Shahjahan, and Kazuyuki Murase Modiﬁed Modulated Hebb-Oja Learning Rule: A Method for Biologically Plausible Principal Component Analysis . . . . . . . . . . . . . . . . . Marko Jankovic, Pablo Martinez, Zhe Chen, and Andrzej Cichocki

517

527

Statistical Learning Algorithms Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katsuyuki Hagiwara

537

XVI

Table of Contents – Part I

A Subspace Method Based on Data Generation Model with Class Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minkook Cho, Dongwoo Yoon, and Hyeyoung Park

547

Hierarchical Feature Extraction for Compact Representation and Classiﬁcation of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Markus Schubert and Jens Kohlmorgen

556

Principal Component Analysis for Sparse High-Dimensional Data . . . . . . Tapani Raiko, Alexander Ilin, and Juha Karhunen

566

Hierarchical Bayesian Inference of Brain Activity . . . . . . . . . . . . . . . . . . . . . Masa-aki Sato and Taku Yoshioka

576

Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byron M. Yu, John P. Cunningham, Krishna V. Shenoy, and Maneesh Sahani

586

Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuyuki Samejima and Kenji Doya

596

Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomohiro Shibata, Takashi Bando, and Shin Ishii

604

Bayesian System Identiﬁcation of Molecular Cascades . . . . . . . . . . . . . . . . Junichiro Yoshimoto and Kenji Doya

614

Use of Circle-Segments as a Data Visualization Technique for Feature Selection in Pattern Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shir Li Wang, Chen Change Loy, Chee Peng Lim, Weng Kin Lai, and Kay Sin Tan

625

Extraction of Approximate Independent Components from Large Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshitatsu Matsuda and Kazunori Yamaguchi

635

Local Coordinates Alignment and Its Linearization . . . . . . . . . . . . . . . . . . . Tianhao Zhang, Xuelong Li, Dacheng Tao, and Jie Yang

643

Walking Appearance Manifolds without Falling Oﬀ . . . . . . . . . . . . . . . . . . . Nils Einecke, Julian Eggert, Sven Hellbach, and Edgar K¨ orner

653

Inverse-Halftoning for Error Diﬀusion Based on Statistical Mechanics of the Spin System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yohei Saika

663

Table of Contents – Part I

XVII

Optimization Algorithms Chaotic Motif Sampler for Motif Discovery Using Statistical Values of Spike Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takafumi Matsuura and Tohru Ikeguchi

673

A Thermodynamical Search Algorithm for Feature Subset Selection . . . . F´elix F. Gonz´ alez and Llu´ıs A. Belanche

683

Solvable Performances of Optimization Neural Networks with Chaotic Noise and Stochastic Noise with Negative Autocorrelation . . . . . . . . . . . . . Mikio Hasegawa and Ken Umeno

693

Solving the k-Winners-Take-All Problem and the Oligopoly Cournot-Nash Equilibrium Problem Using the General Projection Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolin Hu and Jun Wang Optimization of Parametric Companding Function for an Eﬃcient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shin-ichi Maeda and Shin Ishii A Modiﬁed Soft-Shape-Context ICP Registration System of 3-D Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiann-Der Lee, Chung-Hsien Huang, Li-Chang Liu, Shih-Sen Hsieh, Shuen-Ping Wang, and Shin-Tseng Lee Solution Method Using Correlated Noise for TSP . . . . . . . . . . . . . . . . . . . . Atsuko Goto and Masaki Kawamura

703

713

723

733

Novel Algorithms Bayesian Collaborative Predictors for General User Modeling Tasks . . . . Jun-ichiro Hirayama, Masashi Nakatomi, Takashi Takenouchi, and Shin Ishii

742

Discovery of Linear Non-Gaussian Acyclic Models in the Presence of Latent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shohei Shimizu and Aapo Hyv¨ arinen

752

Eﬃcient Incremental Learning Using Self-Organizing Neural Grove . . . . . Hirotaka Inoue and Hiroyuki Narihisa

762

Design of an Unsupervised Weight Parameter Estimation Method in Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masato Uchida, Yousuke Maehara, and Hiroyuki Shioya

771

Sparse Super Symmetric Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Marko Jankovic, Rafal Zdunek, and Shun-ichi Amari

781

XVIII

Table of Contents – Part I

Probabilistic Tensor Analysis with Akaike and Bayesian Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dacheng Tao, Jimeng Sun, Xindong Wu, Xuelong Li, Jialie Shen, Stephen J. Maybank, and Christos Faloutsos Decomposing EEG Data into Space-Time-Frequency Components Using Parallel Factor Analysis and Its Relation with Cerebral Blood Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fumikazu Miwakeichi, Pedro A. Valdes-Sosa, Eduardo Aubert-Vazquez, Jorge Bosch Bayard, Jobu Watanabe, Hiroaki Mizuhara, and Yoko Yamaguchi

791

802

Flexible Component Analysis for Sparse, Smooth, Nonnegative Coding or Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Cichocki, Anh Huy Phan, Rafal Zdunek, and Li-Qing Zhang

811

Appearance Models for Medical Volumes with Few Samples by Generalized 3D-PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Xu and Yen-Wei Chen

821

Head Pose Estimation Based on Tensor Factorization . . . . . . . . . . . . . . . . . Wenlu Yang, Liqing Zhang, and Wenjun Zhu

831

Kernel Maximum a Posteriori Classiﬁcation with Error Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zenglin Xu, Kaizhu Huang, Jianke Zhu, Irwin King, and Michael R. Lyu Comparison of Local Higher-Order Moment Kernel and Conventional Kernels in SVM for Texture Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Kameyama

841

851

Pattern Discovery for High-Dimensional Binary Datasets . . . . . . . . . . . . . . V´ aclav Sn´ aˇsel, Pavel Moravec, Duˇsan H´ usek, Alexander Frolov, ˇ Hana Rezankov´ a, and Pavel Polyakov

861

Expand-and-Reduce Algorithm of Particle Swarm Optimization . . . . . . . . Eiji Miyagawa and Toshimichi Saito

873

Nonlinear Pattern Identiﬁcation by Multi-layered GMDH-Type Neural Network Self-selecting Optimum Neural Network Architecture . . . . . . . . . Tadashi Kondo

882

Motor Control and Vision Coordinated Control of Reaching and Grasping During Prehension Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masazumi Katayama and Hirokazu Katayama

892

Table of Contents – Part I

Computer Simulation of Vestibuloocular Reﬂex Motor Learning Using a Realistic Cerebellar Cortical Neuronal Network Model . . . . . . . . . . . . . . Kayichiro Inagaki, Yutaka Hirata, Pablo M. Blazquez, and Stephen M. Highstein Reﬂex Contributions to the Directional Tuning of Arm Stiﬀness . . . . . . . . Gary Liaw, David W. Franklin, Etienne Burdet, Abdelhamid Kadi-allah, and Mitsuo Kawato

XIX

902

913

Analysis of Variability of Human Reaching Movements Based on the Similarity Preservation of Arm Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Oyama, Yoji Uno, and Shigeyuki Hosoe

923

Directional Properties of Human Hand Force Perception in the Maintenance of Arm Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshiyuki Tanaka and Toshio Tsuji

933

Computational Understanding and Modeling of Filling-In Process at the Blind Spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shunji Satoh and Shiro Usui

943

Biologically Motivated Face Selective Attention Model . . . . . . . . . . . . . . . . Woong-Jae Won, Young-Min Jang, Sang-Woo Ban, and Minho Lee

953

Multi-dimensional Histogram-Based Image Segmentation . . . . . . . . . . . . . . Daniel Weiler and Julian Eggert

963

A Framework for Multi-view Gender Classiﬁcation . . . . . . . . . . . . . . . . . . . Jing Li and Bao-Liang Lu

973

Japanese Hand Sign Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hirotada Fujimura, Yuuichi Sakai, and Hiroomi Hikawa

983

An Image Warping Method for Temporal Subtraction Images Employing Smoothing of Shift Vectors on MDCT Images . . . . . . . . . . . . . Yoshinori Itai, Hyoungseop Kim, Seiji Ishikawa, Shigehiko Katsuragawa, Takayuki Ishida, Ikuo Kawashita, Kazuo Awai, and Kunio Doi

993

Conﬂicting Visual and Proprioceptive Reﬂex Responses During Reaching Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato An Involuntary Muscular Response Induced by Perceived Visual Errors in Hand Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 David W. Franklin, Udell So, Rieko Osu, and Mitsuo Kawato Independence of Perception and Action for Grasping Positions . . . . . . . . . 1021 Takahiro Fujita, Yoshinobu Maeda, and Masazumi Katayama

XX

Table of Contents – Part I

Handwritten Character Distinction Method Inspired by Human Vision Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Jumpei Koyama, Masahiro Kato, and Akira Hirose Recent Advances in the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Kunihiko Fukushima Engineering-Approach Accelerates Computational Understanding of V1–V2 Neural Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051 Shunji Satoh and Shiro Usui Recent Studies Around the Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . . 1061 Hayaru Shouno Toward Human Arm Attention and Recognition . . . . . . . . . . . . . . . . . . . . . 1071 Takeharu Yoshizuka, Masaki Shimizu, and Hiroyuki Miyamoto Projection-Field-Type VLSI Convolutional Neural Networks Using Merged/Mixed Analog-Digital Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Osamu Nomura and Takashi Morie Optimality of Reaching Movements Based on Energetic Cost under the Inﬂuence of Signal-Dependent Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1091 Yoshiaki Taniai and Jun Nishii Inﬂuence of Neural Delay in Sensorimotor Systems on the Control Performance and Mechanism in Bicycle Riding . . . . . . . . . . . . . . . . . . . . . . . 1100 Yusuke Azuma and Akira Hirose Global Localization for the Mobile Robot Based on Natural Number Recognition in Corridor Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110 Su-Yong An, Jeong-Gwan Kang, Se-Young Oh, and Doo San Baek A System Model for Real-Time Sensorimotor Processing in Brain . . . . . . 1120 Yutaka Sakaguchi Perception of Two-Stroke Apparent Motion and Real Motion . . . . . . . . . . 1130 Qi Zhang and Ken Mogi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141

A Retinal Circuit Model Accounting for Functions of Amacrine Cells Murat Saglam, Yuki Hayashida, and Nobuki Murayama Graduate School of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. In previous experimental studies on vertebrates, high level processes of vision such as object segregation and spatio-temporal pattern adaptation were found to begin at retinal stage. In those visual functions, diverse subtypes of amacrine cells are believed to play essential roles by processing the excitatory and inhibitory signals laterally over a wide region on the retina to shape the ganglion cell responses. Previously, a simple "linear-nonlinear" model was proposed to explain a specific function of the retina, and could capture the spiking behavior of retinal output, although each class of the retinal neurons were largely omitted in it. Here, we present a spatio-temporal computational model based on the response function for each class of the retinal neurons and the anatomical intercellular connections. This model is not only capable of reproducing filtering properties of outer retina but also realizes high-order inner retinal functions such as object segregation mechanism via wide-field amacrine cells. Keywords: Retina, Amacrine Cells, Model, Visual Function.

1 Introduction The vertebrate retina is far more than a passive visual receptor. It has been reported that many high-level vision tasks begin in the retinal circuits although they are believed to be performed in visual cortices of the brain [1]. One important task among those is the discrimination of the actual motion of an object from the global motion across the retina. Even in the case of perfect stationary scene, eye movements cause retinal image-drifts that hinder the retinal circuit to have a stationary global input at the background [2]. To handle this problem, retinal circuits are able to distinguish the object motions better when they have different patterns than background motion. The synaptic configuration of diverse types of retinal cells plays an essential role during this function. It was reported that wide-field polyaxonal amacrine cells can drive inhibitory process between the surround and the object regions (receptive field) on the retina [1, 2, 4, 5, 6]. Those wide-field amacrine cells are known to use inhibitory neurotransmitters like glycine or GABA [7, 8]. A previous study reported that glycine-mediated wide-field inhibition exists in the salamander retina and proposed a simple “linear-nonlinear” model, consisting of a temporal filter and a threshold function [2]. However that model does not include the details of any retinal neurons M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 1–6, 2008. © Springer-Verlag Berlin Heidelberg 2008

2

M. Saglam, Y. Hayashida, and N. Murayama

accounting for that inhibitory mechanism although it is capable of predicting the spiking behavior of the retinal output for certain input patterns. On the other hand, different temporal models that include the behavior of each class of retinal neurons exist in the literature [9, 10]. Even though those models provide high temporal resolution, they lack spatial information of the retinal processing. Here, we present a spatio-temporal computational model that realizes wide-field inhibition between object and surround region via wide-field transient on/off amacrine cells. The model considers the responses of all major retinal neurons in detail.

2 Retinal Model The retina model parcels the stimulating input into a line of spatio-temporal computational units. Each unit consists of main retinal elements that convey information forward (photoreceptors, bipolar and ganglion cells) and laterally (horizontal and amacrine cells). Figure 1 illustrates the organization and the synaptic connections of the retinal neurons.

Fig. 1. Three computational units of the model are depicted. Each unit includes: PR: Photoreceptor, HC: Horizontal Cell, onBC: On Bipolar Cell, offBC: Off Bipolar Cell, onAC: Sustained On Amacrine Cell, offAC: Sustained Off Amacrine Cell, on/offAC: Fast-transient Wide-field On/Off Amacrine Cell, GC: On/Off Ganglion Cell. Excitatory/Inhibitory synaptic connections are represented with black/white arrowheads, respectively. Gap junctions within neighboring HCs are indicated by dotted horizontal lines. Wide-field connections are realized between wide-field on/offAC and GCs, double-s-symbols point to distant connections within those neurons.

A Retinal Circuit Model Accounting for Functions of Amacrine Cells

3

Each neuron’s membrane dynamics is governed by a differential equation (eqn.1) which is adjusted from push-pull shunting models of retinal neurons [9]. n dvc (t ) = − Avc (t ) + [ B − vc (t )]e(t ) − [ D + vc (t )]i (t ) + ∑Wck vk (t ) dt k =1

(1)

Here υc(t) stands for the membrane potential of the neuron of interest. A represents the rate of passive membrane decay toward the resting potential in the dark. B and D are the saturation levels for the excitatory, e(t) and inhibitory i(t) inputs, respectively. Those excitatory/inhibitory inputs correspond to the synaptic connections (solid lines in fig.1) of different neurons in a computational unit. υk(t) is the membrane potential of a different neuron belonging to another unit making synapse or gap-junction to the neuron of interest, υc(t). The efficiency of that link is determined by a weight parameter, Wck. In the current model, spatial connectivity is present within horizontal cells as gap junctions (dashed lines in fig.1) and between on/off amacrine cells and ganglion cells as wide-field inhibitory process (thin solid lines in fig1). As for the other neurons Wck is fixed to zero since we ignore the lateral spatial connection within them. A compressive nonlinearity (eqn. 2) is cascaded prior to the photoreceptor input stage in order to account the limited dynamic range of the neural elements. Therefore the photoreceptor is fed by a hyperpolarizing input, r (t), representing the compressed form of light intensity, f (t).

⎛ f (t ) ⎞ ⎟⎟ r (t ) = G⎜⎜ ⎝ f (t ) + I ⎠

n

(2)

Here G denotes the saturation level of the hyperpolarizing input to the photoreceptor. I represents the light intensity yielding half-maximum response and n is a real constant. Although ganglion cell receptive field size is diverse among different animals, we defined that each unit corresponds to 500μm which is in well accordance with experiments on salamander [2]. 32 computational units are interconnected as a line. Wck values are determined as a function of distance between computational units. Parameter set given in [9] is calibrated to reproduce the temporal dynamics of all neuron classes. Spatial parameters of the model are selected to meet the spatial ganglion cell response profile given in [2]. All differential equations in the model are solved sequentially using fixed-step (1ms) Bogacki-Shampine solver of the MATLAB/SIMULINK software package (The Mathworks-Inc., Natick, MA).

3 Results First we confirmed that responses of each neuron agree with the physiological observations [9]. Figure 2 illustrates the response of all neurons to 150ms-long flash light stimulating the whole model retina. At the outer retina, photoreceptor responds with a transient hyperpolarization followed by a less steep level and reaches to the resting potential with a small overshoot. Essentially horizontal cell has the smoothened form of the photoreceptor response due to its low-pass filtering feature. On- and offbipolar cells are depolarized during on/off-set of the flash light, respectively. Since

4

M. Saglam, Y. Hayashida, and N. Murayama

Fig. 2. Responses of retinal neurons (labeled as fig.1) to 150ms-long full field light flash. Dashed horizontal lines indicate dark responses(resting potentials) of each neuron. Note that the timings of on/off responses of wide-field ACs and GC spike generating potentials (GC gen. pot.) match each other. This phenomenon drives the wide-field inhibition.

Fig. 3. GC spike generating potential responses (top row) of the center unit at incoherent (left column) and coherent (right column) stimulation case. Stimulation timing and position are depicted on the x and y axes, respectively (bottom row, white bars indicate the light onset). Under coherent stimulation condition, off responses are significantly inhibited and on responses all disappeared.

those cells build negative feedback loop with sustained on- and off-amacrine cells, their responses are more transient than photoreceptors and horizontal cells as expected. Eventually bipolar cells transmit excitatory, wide-field transient on/off amacrine cells

A Retinal Circuit Model Accounting for Functions of Amacrine Cells

5

convey inhibitory inputs to ganglion cells. Significant inhibition at ganglion cell level only happens when wide-field amacrine cell signal matches the excitatory input. Figure 3 demonstrates how inhibitory process differs when the peripheral (surround) and the object regions are stimulated coherently or incoherently. In both cases object region invades 3 units (750μm radius) and stimulated identically. When the surround is stimulated incoherently, depolarized peaks of wide-field amacrine cells do not coincide with the ganglion cell peaks so that spike generating potentials are evident. However when the surround region is simulated coherently, inhibition from amacrine cells cancels out the big portions of ganglion cell depolarizations. This leads to maximum inhibition of the spike generating potentials of ganglion cells (Fig.3, right column).

4 Discussion In the current study we realized the basic mechanism of an important retinal task which is discriminating a moving object from moving background image. The coherent stimulation (Fig.2) can be linked to the global motion of the retinal image that takes place when the eye moves. However when there is a moving object in the

Fig. 4. Relative generating potential response of GC as a function of object size. Dashed line represents the model response with the original parameter set. Triangle and square markers indicate data points for ‘wide-field AC blocked’ and ‘control’ cases, respectively. Maximum GC response is observed when the object radius is 250μm (1 unit stimulation, 2nd data point). For the sake of symmetry 3rd data point represents 3 unit stimulation (750μm radius as in Fig.3), similarly each interval after 2nd data point corresponds to 500μm increment in the radius of object. As the object starts to invade the surround region, GC response decreases. When the weight of the interconnections among wide-field on/off ACs are set to zero (STR application), inhibition process is partially disabled (solid line).

6

M. Saglam, Y. Hayashida, and N. Murayama

scene, its image would be reflected on the receptive field as a different stimulation pattern than the global pattern (Incoherent stimulation, Fig.3). Experimental results revealed that blocking glycine-mediated inhibition by strychnine (STR) disables the wide-field process [2]. Therefore this glycine-ergic mechanism could be accounted to wide-field amacrine cells [7]. In our model, STR application can be realized by turning off synaptic weight parameters between wide-field amacrine cells and ganglion cells. Figure 4 demonstrates how STR application can affect ganglion cell response. As the object invades the background ganglion cell response is expected to be inhibited as in the control case however STR prevents this phenomenon to occur. This behavior of the model is in very well accordance with the experimental results in [2]. Note that the model is flexible enough to fit to another wide-field inhibitory process such as GABA-ergic mechanism [8]. Spike generation of the ganglion cells is not implemented in the current model in order to highlight the role of wide-field amacrine cells only. A specific spike generator can be cascaded to the model to reproduce spike responses and highlight retinal features more. Since the model covers on/off pathways and all major retinal neurons, it can be flexibly adjusted to reproduce other functions. Although we deduced the retina into a line of spatio-temporal computational units, the model was able to reproduce a retinal mechanism. This deduction can be bypassed and more precise results can be achieved by creating a 2-D mesh of spatiotemporal computational units.

References 1. Masland, R.H.: Vision: The retina’s fancy tricks. Nature 423(6938), 387–388 (2003) 2. Olveczky, B.P., Baccus, S.A., Meister, M.: Segregation of object and background motion in the retina. Nature 423(6938), 401–408 (2003) 3. Volgyi, B., Xin, D., Amarillo, Y., Bloomfield, S.A.: Morphology and physiology of the polyaxonal amacrine cells in the rabbit retina. J. Comp. Neurol. 440(1), 109–125 (2001) 4. Lin, B., Masland, R.H.: Populations of wide-field amacrine cells in the mouse retina. J. Comp. Neurol. 499(5), 797–809 (2006) 5. Solomon, S.G., Lee, B.B., Sun, H.: Suppressive surrounds and contrast gain in magnocellular pathway retinal ganglion cells of macaque. J. Neurosci. 26(34), 8715–8726 (2006) 6. van Wyk, M., Taylor, W.R., Vaney, D.: Local edge detectors: a substrate for fine spatial vision at low temporal frequencies in rabbit retina. J. Neurosci. 26(51), 250–263 (2006) 7. Hennig, M.H., Funke, K., Worgotter, F.: The influence of different retinal subcircuits on the nonlinearity of ganglion cell behavior. J. Neurosci. 22(19), 8726–8738 (2002) 8. Lukasiewicz, P.D.: Synaptic mechanisms that shape visual signaling at the inner retina. Prog Brain Res. 147, 205–218 (2005) 9. Thiel, A., Greschner, M., Ammermuller, J.: The temporal structure of transient ON/OFF ganglion cell responses and its relation to intra-retinal processing. J. Comput. Neurosci. 21(2), 131–151 (2006) 10. Gaudiano, P.: Simulations of X and Y retinal ganglion cell behavior with a nonlinear pushpull model of spatiotemporal retinal processing. Vision Res. 34(13), 1767–1784 (1994)

Global Bifurcation Analysis of a Pyramidal Cell Model of the Primary Visual Cortex: Towards a Construction of Physiologically Plausible Model Tatsuya Ishiki, Satoshi Tanaka, Makoto Osanai, Shinji Doi, Sadatoshi Kumagai, and Tetsuya Yagi Division of Electrical, Electronic and Information Engineering, Graduate School of Engineering, Osaka University, Yamada-Oka 2-1, Suita, Osaka, Japan [email protected]

Abstract. Many mathematical models of diﬀerent neurons have been proposed so far, however, the way of modeling Ca2+ regulation mechanisms has not been established yet. Therefore, we try to construct a physiologically plausible model which contains many regulating systems of the intracellular Ca2+ , such as Ca2+ buﬀering, Na+ /Ca2+ exchanger and Ca2+ pump current. In this paper, we seek the plausible values of parameters by analyzing the global bifurcation structure of our temporary model.

1

Introduction

Complex information processing of brain is regulated by the electrical activity of neurons. Neurons transmit an electrical signal called action potential each other for the information processing. The action potential is a spiking or bursting and plays an important role in the information processing of the brain. In the visual system, visual signals from the retina are processed by neurons in the primary visual cortex. There are several types of neurons in the visual cortex, and pyramidal cells compose roughly 80% of the neurons of the cortex. Pyramidal cells are connected each other and form a complex neuronal circuit. Previous physiological and anatomical studies [1] revealed the fundamental structure of the circuit. However, it is not completely understood how visual signals propagate and function in the neuronal circuit of the visual cortex. In order to investigate the neuronal circuit, not only physiological experiments but also simulations by using a mathematical model of neuron are necessary. Many mathematical models of neurons have been proposed so far [2]. Though there are various models of neurons, the way of modeling the regulating system of the intracellular calcium ions (Ca2+ ) has not been established yet. The regulating system of the intracellular Ca2+ is a very important element because the intracellular Ca2+ plays crucial roles in cellular processes such as hormone and neurotransmitter release, gene transcription, and regulations of synaptic plasticity. Therefore, it is important to establish the way of modeling the regulating system of the intracellular Ca2+ . M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 7–17, 2008. c Springer-Verlag Berlin Heidelberg 2008

8

T. Ishiki et al.

In this paper, we try to construct a model of pyramidal cells by regarding previous physiological experimental data, especially focusing on the regulating systems of the intracellular Ca2+ , such as Ca2+ buﬀering, Na+ /Ca2+ exchanger and Ca2+ pump current. In order to estimate the value of the parameters which cannot be determined by physiological experiments solely, we analyze the global bifurcation structure based on the slow/fast decomposition of the model. Thus we demonstrate the usefulness of such nonlinear analyses not only for the analysis of an established model but also for the construction of a model.

2

Cell Model

The well-known Hodgkin-Huxley (HH) equations [3] describe the temporal variation of membrane potential of neuronal cells. Though there are many neuron models based on the HH equations, the way of modeling the regulating system of the intracellular Ca2+ has not been established yet. Thus, we construct a pyramidal cell model by using several physiological experiments data [4]-[11]. The model includes Ca2+ buﬀer, Ca2+ pump, Na+ /Ca2+ exchanger in order to describe the regulating system of the intracellular Ca2+ appropriately. The model also includes seven ionic currents through the ionic channels. The equations of the pyramidal cell model are as follows: −C

dV = Itotal − Iext , dt dy 1 = (y∞ − y) , (y = M 1, · · · , M 6, H1, · · · , H6), dt τy

d[Ca2+ ] −S · ICatotal = + k− [CaBuf] − k+ [Ca2+ ][Buf], dt 2·F d[Buf] = k− [CaBuf] − k+ [Ca2+ ][Buf], dt d[CaBuf] = −(k− [CaBuf] − k+ [Ca2+ ][Buf]), dt

(1a) (1b) (1c) (1d) (1e)

where V is the membrane potential, C is the membrane capacitance, Itotal is the sum of all currents through the ionic channels and Na+ /Ca2+ exchanger, and Iext is the current injected to the cell externally. The variable y denotes gating variables (M 1, · · · , M 6, H1, · · · , H6) of the ionic channels, y∞ is the steady state function of y, and τy is a time constant. [Ca2+ ] denotes the intracellular Ca2+ concentration, ICatotal is the sum of all Ca2+ ionic currents, S is the surface to volume ratio, and F is the Faraday constant. [Buf] and [CaBuf] are the concentrations of the unbound and the bound buﬀers, and k− and k+ are the reverse and forward rate constants of the binding reaction, respectively. Details of all equations and parameter values of this model can be found in Appendix. First we show the simulation results when a certain external stimulus current is injected to the cell model (1). Figure 1 shows an action potential waveform, and a change of [Ca2+ ] when a narrow external stimulus current (length 1ms,

Global Bifurcation Analysis of a Pyramidal Cell Model

20

9

[B]

[A]

0.0018

V (mV)

[Ca2+] (mM)

0 -20 -40 -60 -80 3500

4000

4500

5000

t (ms)

5500

6000

0.0014 0.001 0.0006 0.0002 3500 4000

4500

5000 5500 6000

t (ms)

Fig. 1. Waveforms of [A] the membrane potential and [B] [Ca2+ ] in the case of a 1ms narrow pulse injection [A]

[B]

V (mV)

20 0

-20 -40 -60 -80

4000

5000

t (ms)

6000

7000

Fig. 2. [A] A waveform of the membrane potential under a long pulse injection. [B] A typical waveform of the membrane potential of pyramidal cell in physiological experiments [12].

density 40μA/cm2 ) is injected at t = 4000ms. The waveforms of the membrane potential and [Ca2+ ] are not so diﬀerent from the physiological experimental data qualitatively [1]. In contrast, as shown in Fig. 2A, in the case that a long pulse (length 1000ms, density 40μA/cm2 ) is injected, the membrane potential keeps a resting state after one action potential is generated. Though the membrane potential is spiking continuously in the physiological experiment (Fig. 2B), the membrane potential of the model does not show such a behavior. In general, it is well known that the membrane potential of pyramidal cell is resting in the case of no external stimulus, and spiking or bursting when the external stimulus is added. The aim of this paper is a reconstruction or a parameter tuning of the model (1) which can reproduce such a behavior of the membrane potential.

3

Bifurcation Analysis

The characteristics of the membrane potential vary with the change of the value of some parameters, therefore we investigate the bifurcation structure of the model (1) to estimate the values of the parameters. For the bifurcation analysis in this paper, we used the bifurcation analysis software AUTO [13].

10

T. Ishiki et al.

V (mV)

-35 -40 -45 -50 -55 -60 -65 -70 0

20

40

60

80

100

Iext (µA/cm2)

Fig. 3. One-parameter bifurcation diagram on the parameter Iext . The solid curve denotes stable equilibria of eq. (1).

3.1

The External Stimulation Current I ext

We analyze the bifurcation structure of the model to understand why the continuous spiking of the membrane potential is not generated when a long pulse is injected. In order to investigate whether the spiking is generated or not when Iext is increased, we vary the external stimulation current Iext as a bifurcation parameter. We show the one-parameter bifurcation diagram on the parameter Iext (Fig. 3) in which the solid curve denotes the membrane potential at stable equilibria of eq. (1). The one-parameter bifurcation diagram shows the dependence of the membrane potential on the parameter Iext . There is no bifurcation point in Fig. 3. Therefore the stability of the equilibrium point does not change, and thus the membrane potential of the model keeps resting even if Iext is increased. This result means that we cannot reproduce the physiological experimental result in Fig. 2B by varying Iext , thus we have to reconsider the parameter values of the model which cannot be determined by physiological experiments only. 3.2

The Maximum Conductance of Ca2+ -Dependent Potassium Channel GKCa

The current through the Ca2+ -dependent potassium channel is involved in the generation of spiking or bursting of the membrane potential. Therefore, we select the maximum conductance of Ca2+ -dependent potassium channel GKCa as a bifurcation parameter, and show the one-parameter bifurcation diagram when Iext = 10 (Fig. 4A). There are two saddle-node bifurcation points (SN1, SN2), three Hopf bifurcation points (HB1-HB3) and two torus bifurcation points (TR1, TR2). An unstable periodic solution which is bifurcated from HB1 changes its stability at the two torus bifurcation points and merges into the equilibrium point at HB3. Only in the range between HB1 and HB3, the membrane potential can oscillate. In order to investigate the dependence of the oscillatory range on the Iext value, we show the two-parameter bifurcation diagram (Fig. 4B) in which the horizontal and vertical axes denote Iext and GKCa , respectively. The twoparameter bifurcation diagram shows the loci where a speciﬁc bifurcation occurs.

Global Bifurcation Analysis of a Pyramidal Cell Model [A]

14

V (mV)

-10

-40

HB3

-60

2.50

SN1 HB2 HB1

2.75

3.00

3.25

HB SN

10

SN2

-50

[B]

12

TR1

-20 -30

Iext =10

TR2

G K Ca (mS/cm 2 )

0

11

3.50

G K Ca (mS/cm 2 )

8 6 4 2

3.75

4.00

0 -15 -10 -5

0

5

10 15 20 25 30

Iext (µA/cm2)

Fig. 4. [A] One-parameter bifurcation diagram on the parameter GKCa . The solid and broken curves show stable and unstable equilibria, respectively. The symbols • and ◦ denote the maximum value of V of the stable and unstable periodic solutions, respectively. [B] Two-parameter bifurcation diagram in the (Iext , GKCa )-plane.

In the diagram, the gray colored area separated by the HB and SN bifurcation curves corresponds to the range between HB1 and HB3 in Fig. 4A where the periodic solutions appear. Increasing Iext , the gray colored area shrinks gradually and disappears near Iext = 25. This result means that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of the parameter GKCa . 3.3

The Maximum Pumping Rate of the Ca2+ Pump Apump

The Ca2+ pump plays an important role in the regulation of the intracellular Ca2+ . We also investigate the eﬀect of varying the Apump value, which is the maximum pumping rate of the Ca2+ pump, on the membrane potential. Figure 5A is the one-parameter bifurcation diagram when Iext = 10. There are two Hopf bifurcation points (HB1, HB2) and four double-cycle bifurcation points (DC1DC4). A stable periodic solution generated at HB1 changes its stability at the four double-cycle bifurcation points and merges into the equilibrium point at HB2. Similarly to the case of GKCa , we show the two-parameter bifurcation diagram in the plane of two parameters Iext and Apump (Fig. 5B) in order to examine the dependence of the oscillatory range between HB1 and HB2 on Iext . In Fig. 5B, the gray colored area, where the membrane potential oscillates, shrinks and disappears as Iext increases. The result shows that the membrane potential of the model cannot present any oscillations (spontaneous spiking) for large values of Iext even if we changed the value of Apump similarly to the case of GKCa . 3.4

Slow/Fast Decomposition Analysis

In this section, in order to investigate the dynamics of our pyramidal cell model in more detail, we use the slow/fast decomposition analysis [14].

T. Ishiki et al.

30 20 10 0 -10 -20 -30 -40 -50 -60 0

[A]

1000 DC3

Iext =10

DC4 DC2 HB2 DC1 HB1

300

500

800

1000

Apump (pmol/s/cm2)

1300

Apump (pmol/s/cm2)

V (mV)

12

[B] HB

800 600 400 200 0

0

20

40

60

80

100 120 140

Iext (µA/cm2)

Fig. 5. [A] One-parameter bifurcation diagram on the parameter Apump . [B] Twoparameter bifurcation diagram in the (Iext , Apump )-plane.

A system with multiple time scales can be denoted generally as follows: dx = f (x, y), x ∈ Rn , y ∈ Rm , dt dy = g(x, y), 1. dt

(2a) (2b)

Equation (2b) is called a slow subsystem since the value of y changes slowly while equation (2a) a fast subsystem. The whole eq. (2) is called a full system. So-called slow/fast analysis divides the full system into the slow and fast subsystems. In the fast subsystem (2a), the slow variable y is considered as a constant or a parameter. The variable x changes more quickly than y and thus x is considered to stay close to the attractor (stable equilibrium points, limit cycle, etc.) of the fast subsystem for a ﬁxed value of y. The variable y changes slowly with a velocity g(x, y) in which x is considered to be in the neighborhood of the attractor. The attractor of the fast subsystem may change if y is varied. The problem of analysis of the dependence of attractor on the parameter y is a bifurcation problem. Thus the slow/fast analysis reduces the analysis of full system to the bifurcation problem of the fast subsystem with a slowly-varying bifurcation parameter. In the case of the pyramidal cell model (1), under the assumption that the change of the intracellular Ca2+ concentration [Ca2+ ] is slower than the other variables, the slow/fast analysis can be made. Thus, we consider [Ca2+ ] as a bifurcation parameter and eq. (1c) as a slow subsystem, and all other equations of eqs. (1a,b,d,e) are considered as a fast subsystem. We show the bifurcation diagram of the fast subsystem by varying the value of [Ca2+ ] as a parameter (Fig. 6). The ﬁgure shows the stable and unstable equilibria of the fast subsystem with Iext = 0 (thick solid and broken curves, resp.), and the slow subsystem (thin curve). The point at the intersection of the equilibrium curve of the fast subsystem with the nullcline of the slow subsystem is the equilibrium point of the f ull system. The stability of the full system is determined whether the intersection point is on the stable or unstable branch of the equilibrium curve of the f ast subsystem. Therefore, the stability of the full system is stable in the case of

Global Bifurcation Analysis of a Pyramidal Cell Model

13

10

0

V (mV)

-10 -20 -30 -40 -50 -60 -70 -80

0

0.0001 0.0002 0.0003 0.0004 0.0005

[Ca2+] (mM)

Fig. 6. Bifurcation diagram of the fast subsystem (Iext = 0) with Ca2+ as a bifurcation parameter and the slow-nullcline of the slow subsystem

Fig. 6. In addition, when Iext is increased, the bifurcation diagram (equilibrium curve) of the fast subsystem shifts upward and the stability of the full system keeps stable and no oscillation appear, as will be shown in Fig. 7. By changing the parameter of the slow subsystem, the shape of the slownullcline changes and the intersection point is also shifted. First, we select some parameters of the slow subsystem. Because the Ca2+ pump is included only in the slow subsystem, we select the Apump and the dissociation constant Kpump which are both parameters contained in Ca2+ pump as the parameters of the slow subsystem. Second, we change the values of Apump and Kpump in order to change the shape of the nullcline. Figure 7A shows the slow-nullclines (thin solid or broken curves) with varying Apump and also the equilibria of the fast subsystem (thick solid and broken curves) with Iext = 0, 20 and 40. By the increase of Apump , the nullcline of the slow subsystem shifts upward, and the intersection point of the equilibrium curve of the fast subsystem (Iext = 0) with the slow-nullcline is then located at an unstable equilibrium. Therefore, the membrane potential of the full system is spiking when Iext = 0. Figure 7B is the similar diagram to Fig. 7A, where the value of Kpump is varied (Apump is varied in Fig. 7A). By the change of Kpump value, the shape of the slow-nullcline is not changed much, therefore the intersection point of the equilibrium curve of the fast subsystem with the nullcline keeps staying at stable equilibria and the full system remains stable at a resting state. Next, in Fig. 8, we show an example of spontaneous spiking induced by an increase of Apump (Apump = 20). The gray colored orbit in Fig. 8A is the projected trajectory of the oscillatory membrane potential and the waveform is shown in Fig. 8B. Because the equilibrium curve of the fast subsystem intersects with the slow-nullcline at the unstable equilibrium, the membrane potential oscillates even though Iext = 0. The projected trajectory of the full system follows the stable equilibrium of the fast subsystem (the lower branch of thick curve) for a long time, and this prolongs the inter-spike interval. After the trajectory passes through the intersection point, the membrane potential makes a spike. When the trajectory passes through the intersection point, the trajectory winds around the intersection point. This winding is possibly caused by a complicated nonlinear

14

T. Ishiki et al.

10

[A] Iext=0 Iext=20 Iext=40 Apump=5(default) Apump=50 Apump=100 Apump=150

0 -10 -30 -40 -50

-20 -30 -40 -60 -70

0.0002

-80

0.0004 0.0006 0.0008 0.001

or or or

-50

-70 0

Iext=0 Iext=20 Iext=40 Kpump=0.2 Kpump=0.32 Kpump=0.4(default) Kpump=1.2 Kpump=2.0 Kpump=3.2

-10

-60 -80

[B]

0

V (mV)

V (mV)

-20

10

or or or

[Ca2+] (mM)

0

0.0002

0.0004 0.0006 0.0008 0.001

[Ca2+] (mM)

Fig. 7. Variation of the equilibria of the fast subsystem and the nullcline of the slow subsystem, by the change of the parameters of slow subsystem: [A] Apump , [B] Kpump [A]

20

0

0

-20

-20

V (mV)

V (mV)

20

-40

-40 -60

-60 -80

[B]

0

0.0005

0.001

[Ca2+] (mM)

0.0015

-80 20000

20400

20800

21200 21600 22000

t (ms)

Fig. 8. [A] An oscillatory trajectory of the full system (gray curve) with bifurcation diagram of the fast subsystem (Iext = 0, thick solid and broken curve) and the nullcline of the slow subsystem (Apump = 20, thin curve), [B] The oscillatory waveform of the membrane potential

dynamics [14] and makes the subthreshold oscillation of the membrane potential just before the spike in Fig. 8B. However, this subthreshold oscillation cannot be observed in physiological experiments of pyramidal cell (Fig. 2B).

4

Conclusion

In this research, we tried to construct a model of pyramidal cells in the visual cortex focusing on Ca2+ regulation mechanisms, and analyzed the global bifurcation structure of the model in order to seek the physiologically plausible values of its parameter. We analyzed the global bifurcation structure of the model using the maximum conductance of Ca2+ -dependent potassium channel (GKCa ) and the maximum pumping rate of the Ca2+ pump (Apump) as bifurcation parameters. According to the two-parameter bifurcation diagrams we showed that the range where the spontaneous spiking occurs shrinks as the external stimulation current Iext

Global Bifurcation Analysis of a Pyramidal Cell Model

15

increases. Therefore, the membrane potential of the model cannot oscillate for large values of Iext even if both values of GKCa and Apump were changed. We also investigated the eﬀect of Apump and the dissociation constant Kpump on the nullcline of slow subsystem based on the slow/fast decomposition analysis. If Apump is increased, the membrane potential is spiking when Iext = 0 because the nullcline shifts upward and the stability of the full system becomes unstable. When Kpump is varied, the membrane potential keeps a resting state because the full system remains stable. Unfortunately, no expected behavior was obtained by the change of values of the parameters considered in this paper. We have, however, demonstrated the usefulness of such nonlinear analyses as the bifurcation and slow/fast analyses to examine parameter values and construct a physiological model. More detailed study using the other parameters is necessary for the construction of the appropriate model as a future subject.

References 1. Osanai, M., Takeno, Y., Hasui, R., Yagi, T.: Electrophysiological and optical studies on the signal propagation in visual cortex slices. In: Proc. of 2005 Annu. Conf. of Jpn. Neural Network Soc., pp. 89–90 (2005) 2. Herz, A.V.M., Gollisch, T., Machens, C.K., Jaeger, D.: Modeling single-neuron dynamics and computations: a balance of detail and abstraction. Science 314, 80– 85 (2006) 3. Hodgkin, A.L., CHuxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol(Lond) 177, 500–544 (1952) 4. Brown, A.M., Schwindt, P.C., Crill, W.E.: Voltage dependence and activation kinetics of pharmacologically defend components of the high-threshold calcium current in rat neocortical neurons. J. Neurophysiol. 70, 1516–1529 (1993) 5. Peterson, B.Z., Demaria, C.D., Yue, D.T.: Calmodulin is the Ca2+ sensor for Ca2+ dependent inactivation of L-type calcium channels. Neuron. 22, 549–558 (1999) 6. Cummins, T.R., Xia, Y., Haddad, G.G.: Functional properties of rat and human neocortical voltage-sensitive sodium currents. J. Neurophysiol. 71, 1052–1064 (1994) 7. Korngreen, A., Sakmann, B.: Voltage-gated K + channels in layer 5 neocortical pyramidal neurons from young rats: subtypes and gradients. J. Neurophy. 525, 621–639 (2000) 8. Kang, J., Huguenard, J.R., Prince, D.A.: Development of BK channels in neocortical pyramidal neurons. J. Neurophy. 76, 188–198 (1996) 9. Hayashida, Y., Yagi, T.: On the interaction between voltage-gated conductances and Ca+ regulation mechanisms in retinal horizontal cells. J. Neurophysiol. 87, 172–182 (2002) 10. Naraghi, M., Neher, E.: Linearized buﬀered Ca+ diﬀusion in microdomains and its implications for calculation of [Ca+ ] at the mouth of a calcium channel. J. Neurosci. 17, 6961–6973 (1997) 11. Noble, D.: Inﬂuence of Na/Ca exchanger stoichiometry on model cardiac action potentials. Ann. N.Y, Acad. Sci. 976, 133–136 (2002)

16

T. Ishiki et al.

12. Yuan, W., Burkhalter, A., Nerbonne, J.M.: Functional role of the fast transient outward K+ current IA in pyramidal neurons in (rat) primary visual cortex. J. Neurosci. 25, 9185–9194 (2005) 13. Doedel, E.J., Champeny, A.R., Fairgrieve, T.F., Kunznetsov, Y.A., Sandstede, B., Xang, X.: Continuation and bifurcation software for ordinary diﬀerential equations (with HomCont). Technical Report, Concordia University (1997) 14. Doi, S., Kumagai, S.: Generation of very slow neuronal rhythms and chaos near the Hopf bifurcation in single neuron models. J. Comp. Neurosci. 19, 325–356 (2005)

Appendix Itotal = INa + IKs + IKf + IK−Ca + ICaL + ICa + Ileak + Iex ICatotal = ICaL + ICa − 2Iex + Ipump 2 INa = GNa · M 1 · H1 · (V − ENa ), GNa = 13.0(mS/cm ), ENa = 35.0(mV) V +29.5 τM1 (V ) = 1/ 0.182 1−expV +29.5 − 0.124 1−exp [− V +29.5 ] [ V +29.5 ] 6.7 6.7 V +124.955511 V +10.07413 τH1 (V ) = 0.5 + 1/ exp − 19.76147 + exp − 20.03406 2 IKs = GKs · M 2 · H2 · (V − EK ), GKs = 0.66(mS/cm ), EKs = −103.0(mV) 1.25 + 115.0 exp[0.026V ], (V < −50mV) τM2 (V ) = 1.25 + 13.0 exp[−0.026V ], (V ≥ −50mV) τH2 (V ) = 360.0 + (1010.0 + 24.0(V + 55.0)) exp (−(V + 75.0)/48.0)2 IKf = GKf · M 3 · H3 · (V − EKf ), GKf = 0.27(mS/cm2 ), EKf = EKs τM3 (V ) = 0.34 + 0.92 exp (−(V + 71.0)/59.0)2 τH3 (V ) = 8.0 + 49.0 exp (−(V + 37.0)/23.0)2 IKCa = GKCa · M 4 · (V − EKCa ), GKCa = 12.5(mS/cm2 ), EKCa = EKs M 4∞ (V, [Ca2+ ]) = ([Ca2+ ]/([Ca2+ ] + Kh )) · (1/(1 + exp[−(V + 12.7)/26.2])) Kh = 0.15(μM) 1.25 + 1.12 exp[(V + 92.0)/41.9] (V < 40mV) τM4 (V ) = 27.0 (V ≥ 40mV)

2+ (2F )2 [Ca ] exp[2V F /RT ]−[Ca2+ ]o ICaL = PCaL · M 5 · H5 · RT · V · exp[2V F /RT ]−1 4

PCaL = 0.225(cm/ms), H5∞ = KCa 4 /(KCa 4 + [Ca2+ ] ), KCa = 4.0(μm) τM5 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) τH5 = 2000.0

2+ )2 exp[2V F /RT ]−[Ca2+ ]o ICa = PCa · M 6 · H6 · (2F · V · [Ca ] exp[2V R·T F /RT ]−1 PCa = 0.155(cm/ms), τH6 = 2000.0 τM6 (V ) = 2.5/(exp[−0.031(V + 37.1)] + exp[0.031(V + 37.1)]) Ileak = V /30.0 Iex = k [Na+ ]3i · [Ca2+ ]o · exp s · VRTF − [Na+ ]3o · [Ca2+ ] · exp −(1 − s) · VRTF −5 2 4 k = 6.0 × 10 (μA/cm /mM ), s = 0.5 Ipump = (2 · F · Apump · [Ca2+ ])/([Ca2+ ] + Kpump ) Apump = 5.0(pmol/s/cm2 ), Kpump = 0.4(μM) M i∞ = 1/(1 + exp (−(V − αM i )/βM i ), i = 1, 2, 3, 5, 6 Hj∞ = 1/(1 + exp ((V − αHj )/βHj ), j = 1, 2, 3, 6 i 1 2 3 5 6 j 1 2 3 6 αM i (mV ) −29.5 −3.0 −3.0 −18.75 18.75 βM i 6.7 10.0 10.0 7.0 7.0

αHj (mV ) −65.8 −51.0 −66.0 −12.6 βHi 7.1 12.0 10.0 18.9

Global Bifurcation Analysis of a Pyramidal Cell Model

17

C = 1.0(μF/cm2 ), S = 3.75(/cm), k− = 5.0(/ms), k+ = 500.0(/mM · ms) [Ca2+ ]o = 2.5(mM), [Na+ ]i = 7.0(mM), [Na+ ]o = 150.0(mM)

T and R denote the absolute temperature and gas constant, respectively. In all ionic currents of the model, the powers for gating variables (M 1, · · · , M 6, H1, · · · , H6) are approximately set at one in order to simplify the equations. Leak current is assumed to have no ion selectivity and follow Ohm’s law. Thus, the reversal potential of leak current is set at 0 (mV), though it might be unusual.

Representation of Medial Axis from Synchronous Firing of Border-Ownership Selective Cells Yasuhiro Hatori and Ko Sakai Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573 Japan [email protected] [email protected] http://www.cvs.cs.tsukuba.ac.jp/

Abstract. The representation of object shape in the visual system is one of the most crucial questions in brain science. Although we can perceive figure shape correctly and quickly, without any effort, the underlying cortical mechanism is largely unknown. Physiological experiment with macaque indicated the possibility that the brain represents a surface with Medial Axis (MA) representation. To examine whether early visual areas could provide basis for MA representation, we constructed the physiologically realistic, computational model of the early visual cortex, and examined what constraint is necessary for the representation of MA. Our simulation results showed that simultaneous firing of BorderOwnership (BO) selective cells at the stimulus onset is a crucial constraint for MA representation. Keywords: Shape, representation, perception, vision, neuroscience.

1 Introduction Segregation of figure from ground might be the first step in the cortex toward the recognition of shape and object. Recent physiological studies have shown that around 60% of neurons in cortical areas V2 and V4 are selective to Border Ownership (BO) that tells which side of a contour owns the border, or the direction of figure, even about 20% of V1 neurons also showed the BO selectivity [1]. These reports also give an insightful idea on coding of shape in early- to intermediate- level vision. The coding of shape is a major question in neuroscience as well as in robot vision. Specifically, it is of great interest that how the visual information in early visual areas is processed to form the representation of shape. Physiological studies in monkeys [2] suggest that shape is coded by medial axis (MA) representation in early visual areas. The MA representation is the method that codes a surface by a set of circles inscribed along the contour of the surface. An arbitrary shape would be reproduced from the centers of the circles and their diameters. We examined whether neural circuits in early- to intermediate-level visual areas could provide a basis for MA representation. We propose that the synchronized responses of BO-selective neurons could evoke the representation of MA. The physiological study [2] showed that V1 neurons responded to figure shape around 40 ms after the stimulus onset, while the latency of the cells responded to MA was about 200 ms after the onset. Physiological study on M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 18–26, 2008. © Springer-Verlag Berlin Heidelberg 2008

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

19

BO-selective neurons reported that their latency is around 70 ms. These results give rise to the proposal that the neurons detect contours first, then the BO is determined from the local contrast detected, and finally MA representation is constructed. It should be noted that BO could be determined from local contrast surrounding the classical receptive field (CRF), thus a single neuron with surrounding modulation would be sufficient to yield BO selectivity [3]. To examine the proposal, we constructed physiologically realistic, firing model of Border-ownership (BO) selective neuron. We assume that the onset of a stimulus with high contrast evokes simultaneous, strong responses of the neurons. These fast responses will propagate retinotopically, thus the neurons at MA (equidistance to the contour) will be activated. Our simulation results show that even relatively small facilitation from the propagated signals yield the firing of the model-cells at the MA because of the synchronization, indicating that simultaneous firing of BO-selective cells at the stimulus onset could enable the representation of MA.

2 The Proposed Model The model is comprised of three stages: (1) contrast detection stage, (2) BO detection stage, and (3) MA detection stage. The first stage extracts luminance contrast as similar to V1 simple cells. The model cells in the second stage mimic BO-selective neurons, which determine the direction of BO with respect to the border at the CRF based on the modulation from surrounding contrast up to 5 deg in visual angle from the CRF. The third stage spatially pools the responses from the second stage to test the MA representation from simultaneous firing of BO cells. A schematic diagram of the model is given in Figure 1. The following sections describe the functions of each stage. 2.1 Model Neurons To enable the simulations of precise spatiotemporal properties, we implemented single-component firing neurons and their connections through their synapses on NEURON simulator [4]. The cell body of the model cell is approximated by a sphere. We set the radius and the membrane resistance to 50μm and 34.5Ωcm, respectively. The model neurons calculate the membrane potential following the Hodgkin-Huxley equation [5] with constant parameter as shown in table 1. We used the biophysically realistic spiking neuron model because we need to examine exact timing of the firing of BO-selective cells as well as the propagtation of the signals, which cannot be realized by integrate-and-fire neuron model or any other abstract models. 2.2 Contrast Detection Stage The model cells in the first stage have response properties similar to those of V1 cells, including the contrast detection, dynamic contrast normalization and static compressive nonlinearity. The model cells in this stage detect the luminance contrast from oriented Gabor filters with distnct orientations. We limited ourselves to have four orientations for the sake of simplicity; vertical (0 and 180 deg) and horizontal (90 and 270 deg). Gabor filter is defined as follows:

Gθ (x , y ) = cos(2πω

(x sin θ + y cos θ )) × gaussian(x , y , μ x , μ y , σ x , σ y ),

(1)

20

Y. Hatori and K. Sakai

Fig. 1. A schematic illustration of the proposed model. Luminance contrast is detected in the first stage which is then processed to determine the direction of BO by surrounding modulation. F and S represent excitatory and inhibitory regions, respectively, for the surrounding modulation. The last stage detects MA based on the propagations from BO model cells.

where x and y represent spatial location, μx and μy represent central coordinate of Gauss function, σx and σy represent standard deviation of Gauss function, θ and ω represent orientation and spatial frequency, respectively. We take a convolution of an input image

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

21

with the Gabor filters, with dynamic contrast normalization [6] including a static, compressive nonlinear function. For the purpose of efficient computation, the responses of the vertical pathways (0 and 180 deg) are integrated to form vertical orientation, and so do the horizontal pathways (90 and 270 deg), which will be convenient for the computation of iso-orientation suppression and cross-orientation facilitation in the next stage: O θ (x , y ) = (I * G θ )(x , y ) ,

(2)

1 (x , y ) = O0 (x , y ) + O180 (x , y ) , Oiso

(3)

1 (x , y ) = O90 (x , y ) + O270 (x , y ) , Ocross

(4)

where I represents input image, Oθ(x,y) represents the output of convolution (*), O1iso (O1cross) represents integrated responses of vertical (horizontal) pathway. Table 1. The constant values for the model cells used in the simulations

Parameter

Value

Cm

1( F/cm2) 50(mv) -77(mv) -54.3(mv) 0.120(S/cm2) 0.036(S/cm2) 0.0003(S/cm2)

ENa EK El gNa gK gl

μ

2.3 BO Detection Stage The second stage models surrounding modulation reported in early- to intermediatelevel vision for the determination of BO [3]. The model cells integrate surrounding contrast information up to5 deg in visual angle from the CRF center. We modeled the surrounding region with two Gaussians, one for inhibition and the other for facilitation, that are located asymmetrically with respect to the CRF center. If a part of the contour of an object is projected onto the excitatory (or inhibitory) region, the contrast information of the contour that is detected by the first stage is transmitted via a pulse to generate EPSP (or IPSP) of the BO-selective model-cell. In other words, the projection of a figure within the excitatory region facilitates the response of the BO model cell. Conversely, if a figure is projected onto the inhibitory region, the response of the model cell is suppressed. Therefore, surrounding contrast signals from the excitatory and inhibitory regions modulate the activity of BO model cells depending on the direction of figure. In this way, we implemented BO model cells based on the surrounding modulation. Note that Jones and her colleagues reported the orientation dependency in the surrounding modulation in monkeys' V1 cells [7]. The suppression is limited to similar orientations to the preferred orientation of the CRF (iso-orientation suppression), and facilitation is dominant for other orientations (cross-orientation facilitation). We implemented this orientation dependency for surround modulation.

22

Y. Hatori and K. Sakai

Taking into account the EPSP and IPSP from the surrounds, we compute the membrane potential of a BO-selective model cell at time t as follows: O 2 ( x1 , y1 , t ) = input ( x1 , y1 ) + c ∑{E iso (x, y , t - d x1, y1 (x, y ) ) + E cross (x, y , t - d x1, y1 (x, y ))} (5) x, y

,where x1 and y1 represent the spatial position of the BO-selective cell, input(x1,y1) represents the output of the first stage that is O1iso or O1cross, c represents a weight of synaptic connection, Eiso(x,y,t-dx1,y1(x,y)) represents EPSP (or IPSP) that is triggered by the pulse which is generated at t-dx1,y1. Ecross(x,y,t-dx1,y1(x,y)) is the same except for input orientation. And, dx1,y1(x,y) shows time delay in proportion to the distance between the BO-selective cell whose coordinate is (x1,y1) and the connected cell whose coordinate is (x,y). We defined Eiso(x,y,t-dx1,y1(x,y)) and dx1,y1(x,y) as:

(

)

(

)

E iso x , y ,t - d x1 , y1 (x , y ) = gaussian x , y , μ x1 , μ y1 ,σ x1 ,σ y1 × exp d x1 , y1 (x , y ) = ctime

(x

1

((

)

- t - d x1 , y1 ( x , y )

τ

)× (v - e) ,

2 2 - x ) + (y1 - y ) ,

(6) (7)

where τ represents time constant, v represents membrane potential, e represents reversal potential, ctime is a constant that converts distance to time. We set c, τ, e and ctime to 0.6 (or 1.0), 10ms, 0mv and 0.2ms/μm, respectively. And Ecross(x,y,t - dx1,y1(x,y)) is also calculated similarly to eq.6. 2.4 MA Detection Stage The third stage integrates BO information from the second stage to detect the MA. A MA model cell has a single excitatory surrounding region that is represented by a Gaussian. The membrane potential of a MA model cell is given by: O 3 (x2 , y 2 ,t )= cmedial

∑gaussian(x , y , μ

(x ,y )∈(x1 ,y1 )

x2

)

(

)

, μ y2 ,σ x2 ,σ y2 × O 2 x , y ,t - d x2 , y2 (x , y ) ,

(8)

where x2 and y2 represent the spatial position of the MA model cell, μx2 and μy2 represent the center of gauss function, cmedial is a constant that we set to 1.8 (or 10.0), dx2,y2(x,y) is calculated similarly to eq.7. Note that MA model cells receive EPSP only from BO model cells. When a BO model cell is activated, the model cell transmits a pulse to MA model cells, with its magnitude and time delay depending on the distance between the two. If a MA model cell was located equidistance to some parts of the contours, the pulses from the BO model cells on the contours reach the MA cell at the same time to evoke strong EPST that will generate a spike. On the other hand, MA model cells that are located not equidistance to the contours will never evoke a pulse. Therefore, the model cells that are located equidistance to the contours are activated based on simultaneous activation from BO model cells, the neural population of which will represent the medial axis of the object.

3 Simulation Result We carried out the simulations of the model to test whether the model shows the representation of MA. As typical examples, the results for thee types of stimuli are

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

23

shown here, a square, a C-shaped figure, and a natural image of an eagle [8], as shown in Fig 1. Note that the model cells described in the previous sections are distributed retinotopically to form each layer with 138 × 138 cells.

(A)

(B)

(C)

Fig. 2. Three examples of stimuli used for the simulations. A square (A), a C-shaped figure (B), and a natural image of an eagle from Berkeley Segmentation Dataset [8] (C).

3.1 A Single Square First, we tested the model with a single square similar to that used in corresponding physiological experiment [2], as shown in Fig.1(A). Although we carried out the simulations retinotopically in 2D, for the purpose of graphical presentation, the responses of a horizontal cross section of the stimulus indicated by Fig. 3(A) are shown here in Fig. 3(B). The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines at the horizontal positions of -1 and 1) and MA model cells (dotted lines at the horizontal position of 0). We observe a clear peak corresponding to MA at the center, as similar to the results of physiological experiments by Lee, et al. [2]. Although we tested the model along the horizontal cross section, the response of a vertical is identical. This result suggests that simultaneous firing of BO cells is capable of generating MA without any other particular constraints. 14 12 c 10 e s m 0 8 0 4 / s 6 e ki p s

4 2 0

(A)

-1 0 Horizontal position

1

(B)

Fig. 3. The simulation results for a square. (A) We show the responses of the cells located along the dashed line (the number of neurons is 138). Horizontal positions -1 and 1 represent the places on the vertical edges of the square. Zero represents the center of the square. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. A clear peak at the center corresponding to MA is observed.

24

Y. Hatori and K. Sakai

3.2 C-Shape We tested the model with a C-shaped figure that has been suggested to be difficult shape for the determination of BO. Fig. 4 shows the simulation result for the C-shape. The responses of a horizontal and vertical cross sections, as indicated in Fig.3(A), are shown here in Fig. 4(B) and (C), respectively, with the conventions same as Fig.1. The figure exhibits the firing rates for two types of the model cells, BO model cells responding to contours (solid lines) and MA model cells (dotted lines). MA representation predicts a strong peak at 0 along the horizontal cross section, and distributed strong responses along the vertical cross section. Although we observe the maximum responses at the centers corresponding to MA, the peak along the horizontal was not significant, and the distribution along the vertical was peaky and uneven. It appears that MA cells cannot integrate properly the signals propagated from BO cells. This distributed MA response comes from complicated propagations of BO signals from the concaved shape. Furthermore, as has been suggested [3], the C-shaped figure is a challenging shape for the determination of BO. Therefore, the BO signals are not clear before the propagation begins. We would like to note that the model is still capable of providing a basis for MA representation for a complicated figure.

(A) 14

14

12

12

c10 e s m 0 8 0 4 / s 6 e ik p s 4

c10 e s m 0 8 0 4 / s 6 e ik p s

2

2

0

4

-1

0 1 Horizontal position

(B)

0

-1

0 Vertical position

1

(C)

Fig. 4. The simulation results for the C-shaped figure. (A) Positions of the analysis along horizontal and vertical cross-sections as indicated by dashed line. The responses of the model cells in firing rate along the horizontal cross section (B), and that along the vertical cross section (C). Solid and dotted lines indicate the responses of BO and MA model cells, respectively. Although the maximum responses are observed at the centers, the responses are distributed.

3.3 Natural Images The model has shown its ability to extract a basis for MA representation for not only simple but also difficult shapes. To further examine the model for arbitrary shapes, we

Representation of Medial Axis from Synchronous Firing of BO Selective Cells

25

tested the model with natural images taken from Berkeley Segmentation Dataset [8]. Fig.2(C) shows an example, an eagle stops on a tree branch. Because we are interested in the representation of shape, we extracted its shape by binarizing the gray scale, as shown in Fig. 5(A). The simulation results of BO and MA model cells are shown in Fig. 5(B). We plotted the responses along the horizontal cross-section indicated in Fig. 5(A). Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig.2(B). The BO model cells responded to the contours (horizontal positions at -1 and 1), and MA model cells exhibited a strong peak at the center (horizontal position at 0). This result indicates that the model detects MA for figures with arbitrary shape. The aim of the simulations with natural images is to test a variety of stimulus shape and configurations that are possible in natural scenes. Further simulations with a number of natural images is expected, specifically with images including occulusion, multiple objects and ambiguous figures. 12 10 c e s 8 m 0 0 4 6 / s e ik 4 p s

2 0

(A)

-1 0 1 Horizontal position

(B)

Fig. 5. An example of simulation results for natural images. (A) The binary image of an eagle together with the horizontal cross-section for the graphical presentation of the results. (B) The responses of the model cells in firing rate along the cross-section as a function of the horizontal location. Horizontal positions -1 and 1 represent the places on the vertical edges of the eagle. Zero represents the center of the bird. A clear peak at the center corresponding to MA is observed. Although the shape is much more complicated, detailed and asymmetric, the results are very similar to that for a square as shown in Fig. 2(B), indicating robustness of the model.

4 Conclusion We studied whether early visual areas could provide basis for MA representation, specifically what constraint is necessary for the representation of MA. Our results showed that simultaneous firing of BO-selective neurons is crucial for MA representation. We implemented the physiologically realistic firing model neurons that have connections from BO model cells to MA model cells. If a stimulus is presented at once so that BO cells along the contours fire simultaneously, and if a MA cell is located equidistant from some of contours of the stimulus, then the MA cell fires because of synchronous signals from the BO cells that give rise to strong EPSP. We showed three typical examples of the simulation results, a simple square, a difficult C-shaped figure, and a natural image of an eagle. The simulation results showed that the model provides a basis for MA representation for all three types of stimuli. These

26

Y. Hatori and K. Sakai

results suggest that the simultaneous firing of BO cells is an essence for the MA representation in early visual areas.

Acknowledgment We thank Dr. Haruka Nishimura for her insightful comments and Mr. Satoshi Watanabe for his help in simulations. This work was supported by Grant-in-aid for Scientific Research from the Brain Science Foundation, the Okawa Foundation, JSPS (19530648), and MEXT of Japan (19024011).

References 1. Zhou, H., Friedman, H.S., Heydt, R.: Coding of Border Ownership in Monkey Visual Cortex. The Journal of Neuroscience 86, 2796–2808 (2000) 2. Lee, T.S., Mumford, D., Romero, R., Lamme, V.A.F.: The role of the primary visual cortex in higher level vision. Vision Research 38, 2429–2454 (1998) 3. Sakai, K., Nishimura, H.: Surrounding Suppression and Facilitation in the determination of Border Ownership. The Journal of Cognitive Neuroscience 18, 562–579 (2006) 4. NEURON: http://www.neuron.yale.edu/neuron/ 5. Johnston, D., Wu., S. (eds.): Foundations of Cellular Neurophysiology. MIT Press, Cambridge (1999) 6. Carandini, M., Heeger, D.J., Movshon, J.A.: Linearity and Normalization in Simple Cells of the Macaque Primary Visual Cortex. The Journal of Neuroscience 21, 8621–8644 (1997) 7. Jones, H.E., Wang, W., Silito, A.M.: Spatial Organization and Magnitude of Orientation Contrast Interaction in Primate V1. Journal of Neurophysiology 88, 2796–2808 (2002) 8. The Berkeley Segmentation Dataset: http://www.eecs.berkeley.edu/Research/ Projects/CS/vision/grouping/segbench/ 9. Engel, A.K., Fries, P., Singer, W.: Dynamic predictions: oscillations and synchrony in topdown processing. Nature Reviews Neuroscience 2, 704–716 (2001)

Neural Mechanism for Extracting Object Features Critical for Visual Categorization Task Mitsuya Soga1 and Yoshiki Kashimori1,2 1

2

Dept. of Information Network Science, Graduate school of Information Systems, Univ. of Electro-Communications, Chofu, Tokyo 182-8585, Japan Dept. of Applied Physics and Chemistry, Univ. of Electro-communications, Chofu, Tokyo 182-8585, Japan

Abstract. The ability to group visual stimuli into meaningful categories is a fundamental cognitive process. Some experiments are made to investigate the neural mechanism of visual categorization. Although experimental evidence is known that prefrontal cortex (PFC) and inferiortemporal (IT) cortex neurons sensitively respond in categorization task, little is known about the functional role of interaction between PFC and IT in categorization task To address this issue, we propose a functional model of visual system, and investigate the neural mechanism for the categorization task of line drawings of faces. We show here that IT represents similarity of face images based on the information of the resolution maps of early visual stages. We show also that PFC neurons bind the information of part and location of the face image, and then PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.

1

Introduction

Visual categorization is fundamental to the behavior of higher primates. Our raw perceptions would be useless without our classiﬁcation of items such as animals and food. The visual system has the ability to categorize visual stimuli, which is the ability to react similarity to stimuli even when they are physically distinct, and to react diﬀerently to stimuli that may be similar. How does the brain group stimuli into meaningful categories? Some experiments have been made to investigate the neural mechanism of visual categorization. Freedman et al.[1] examined the responses of neurons in the prefrontal cortex(PFC) of monkey trained to categorize animal forms(generated by computer) as either “doglike” or “catlike”. They reported that many PFC neurons responded selectively to the diﬀerent types of visual stimuli belonging to either the cats or the dogs category. Sigala and Logothetis [2] recorded from inferior temporal (IT) cortex after monkey learned a categorization task, and found that selectivity of the IT neurons was signiﬁcantly increased to features critical for the task. The numerous reciprocal connections between PFC and IT could allow the necessary interactions to select the best diagnostic features of stimuli [3]. However, little is known about the role of interaction between IT M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 27–36, 2008. c Springer-Verlag Berlin Heidelberg 2008

28

M. Soga and Y. Kashimori

and PFC in categorization task that gives the category boundaries in relation to behavioral consequences. To address this issue, we propose a functional model of visual system in which categorization task is achieved based on functional roles of IT and PFC. The functional role of IT is to represent features of object parts, based on diﬀerent resolution maps in early visual system such as V1 and V4. In IT, visual stimuli are categorized by the similarity based on the features of object parts. Posterior parietal (PP) encodes the location of object part to which attention is paid. The PFC neurons combine the information about feature and location of object parts, and generate a working memory of the object information relevant to the categorization task. The synaptic connections between IT and PFC are learned so as to achieve the categorization task. The feedback signals from PFC to IT enhance the sensitivity of IT neurons that respond to the features of object parts critical for the categorization task, thereby enabling the visual system to perform quickly and reliably task-dependent categorization. In the present study, we present a neural network model which makes categories of visual objects depending on categorization task. We investigated the neural mechanism of the categorization task about line drawings of faces used by Sigala and Logothetis [2]. Using this model we show that IT represents similarity of face images based on the information of the resolution maps in V1 and V4. We also show that PFC generates a working memory state, in which only the information of face features relevant to the categorization task are sustained.

2

Model

To investigate the neural mechanism of visual categorization, we made a neural network model for a form perception pathway from retina to prefrontal cortex(PFC). The model consists of eight neural networks corresponding to the retina, lateral geniculate nucleus(LGN), V1, V4, inferior temporal cortex (IT), posterior parietal(PP), PFC, and premotor area, which are involved in ventral and dorsal pathway [4,5]. The network structure of our model is illustrated in Fig. 1. 2.1

Model of Retina and LGN

The retinal network is an input layer, on which the object image is projected. The retina has a two-dimensional lattice structure that contains NR x NR pixel scene. The LGN network consists of three diﬀerent types of neurons with respect to the spatial resolution of contrast detection, ﬁne-tuned neurons with high spatial frequency (LGNF), middle-tuned neurons with middle spatial frequency (LGNM), and broad-tuned neurons with low spatial frequency (LGNB). The output of LGNX neuron (X=B, M, F) of (i, j) site is given by ILGN (iR , jR ; x) =

i

j

IR (iR , jR )M (i, j; x),

(1)

Neural Mechanism for Extracting Object Features

29

Fig. 1. The structure of our model. The model is composed of eight modules structured such that they resemble ventral and dorsal pathway of the visual cortex, retina, lateral geniculate nucleus (LGN), primary visual cortex (V1), V4, inferior temporal cortex (IT), posterior parietal (PP), prefrontal cortex (PFC), and premotor area. a1 ∼ a3 mean dynamical attractors of visual working memory.

(i − iR )2 + (j − jR )2 (i − iR )2 + (j − jR )2 M (i, j; x) = A exp − −B exp − , 2 2 σ1x σ2x (2) where, IR (iR , jR ) is the gray-scale intensity of the pixel of retinal site (iR , jR ), and the function M (i, j; X) is Mexican hat-like function that represents the convergence of the retinal inputs with ON center-OFF surrounding connections between retina and LGN. The parameter values were set to be A = 1.0, B = 1.0, σ1x = 1.0, and σ2x = 2.0. 2.2

Model of V1 Network

The neurons of V1 network have the ability to elemental features of the object image, such as orientation and edge of a bar. The V1 network consists of three diﬀerent types of networks with high, middle, and broad spatial resolutions, V1B, V1M, and V1F, each of which receives the outputs of LGNB, LGNM, and LGNF, respectively. The V1X network (X=B, M, F) contains MX × MX hypercolumns, each of which contains LX orientation columns. The neurons in V1X (X=B, M, F) have receptive ﬁeld performing a Gabor transform. The output of V1X neuron of (i, j) site is given by IV 1 (i, j, θ; x) = ILGN (p, q; x)G(p, q, θ; x), (3) p

q

30

M. Soga and Y. Kashimori

1 (i − p)2 (j − q)2 G(p, q, θ; x) = exp − + 2 2 2πσxX σyX 2 σGx σGy π

× sin 2πfx p cos θ + 2πfy q sin θ + , 2 1

(4)

where fx , fy are spatial frequencies of x- and y- coordinate, respectively. The parameter values were σGx = 1, σGy = 1, fx = 0.1Hz, and fy = 0.1Hz. 2.3

Model of V4 Network

The V4 network consists of three diﬀerent networks with high, middle, and low spatial resolutions, which receive convergent inputs from the cell assemblies with the same tuning in V1F, V1M, and V1B. The convergence of outputs of V1 neurons enables V4 neurons to respond speciﬁcally to a combination of elemental features such as a cross and triangle represented on the V4 network. 2.4

Model of PP

The posterior parietal(PP) network consists of NP P x NP P neurons, each of which corresponds to a spatial position of each pixel of the retinal image. The functions of PP network are to represent the spatial position of a whole object and the spatial arrangement of its parts in the retinotopic coordinate system and to mediate the location of the object part to which attention is paid. 2.5

Model of IT

The network of IT consists of three subnetworks, each of which receives the outputs of V4F, V4M, and V4B maps, respectively. Each network has neurons tuned to various features of the object parts depending on the resolutions of V4 maps and the location of the object parts to which attention is directed. The ﬁrst sub-network detects a broad outline of a whole object, the second subnetwork detects elemental ﬁgures of the object that represent elemental outlines of the object, and the third subnetwork represents the information of the object parts based on the ﬁne resolution of V4F map. Each subnetwork was made based on Kohonen’s self-organized map model. The elemental ﬁgures in the second subnetwork may play an important role in extracting the similarity to the outlines of objects. 2.6

Model for Working Memory in PFC

The PFC memorizes the information of spatial positions of the object parts as dynamical attractors. The functional role of the PFC is to reproduce a complete form of the object by binding the information of the object parts memorized in the second and third subnetworks of ITC and the their spatial arrangements represented by PP network. The PFC network model was made based on the dynamical map model [6,7]. The network model consists of three types of neurons, M, R, and Q neurons. Q neuron is connected to M neuron with inhibitory

Neural Mechanism for Extracting Object Features

31

synapse. M neurons are interconnected to each other with excitatory and inhibitory synaptic connections. M neuron layer of the present model corresponds to an associative neural network. M neurons receive inputs from three IT subnetworks. Dynamical evolutions of membrane potentials of the neurons, M, R, and Q, are described by

τm

tm dumi = −umi + Wmm,ij (t, Tij )Vm (t − τij ) + Wmq U qi dt j τij =0 +Wmr Uri + WIT,ik XkIT + WP P,il XlP P , R

(5)

l

duqi = −uqi + Wqm Vmi , dt duir τri = −uri + Wrm Vmi , dt τqi

(6) (7)

where um,i , uq,i , and ur,i are the membrane potentials of ith M neuron, ith Q neuron, and ith R neuron, respectively. τmi , τqi , τri are the relaxation times of these membrane potentials. τij is the delay time of the signal propagation from jth M neuron to ith one, and τmax is the maximum delay time. The time delay plays an important role in the stabilization of temporal sequence of ﬁring pattern. Wmm,ij (t, Tij ) is the strength of the axo-dendric synaptic connection from jth M neuron to ith M neuron whose propagation delay time is τij . Wmq (t), Wmr (t), and Wqm (t), and Wrm (t) are the strength of the dendro-dendritic synaptic connection from Q neuron to M neuron, from R neuron to M neuron, from M neuron to Q neuron, and from M neuron to R neuron, respectively. Vm is the output of ith M neuron, Uqi and Uri are dendritic outputs of ith Q neuron and ith R neuron, respectively. Outputs of Q and R neurons are given by sigmoidal functions of uqi and uri , respectively. M neuron has a spike output, the ﬁring probability of which is determined by a sigmoidal function of umi . WIT,ik and WP P,il are the synaptic strength from kth IT neuron to ith M neuron and that from lth PP neuron to ith M neuron, respectively. XkIT and XlP P are the output of k th IT neuron and lth PP neuron, respectively. The parameters were set to be τm = 3ms, τqi = 1ms, τri = 1ms, Wmq = 1, Wmr = 1, Wqm = 1, Wqr = −10, and τm = 38ms. Dynamical evolution of the synaptic connections are described by τw

dWmm,ij (t, Tij ) = −Wmm,ij (t, Tij ) + λVmi (t)Vmj (t − Tij ), dt

(8)

where τw is a time constant, and λ is a learning rate.The parameter values were τw = 1200ms, and λ = 28. The PFC connects reciprocally with IT, PP, and the premotor networks. The synaptic connections between PFC and IT and those between PFC and PP are learned by Hebbian learning rule.

32

2.7

M. Soga and Y. Kashimori

Model of Premotor Cortex

The model of premotor area consists of neurons whose ﬁring correspond to action relevant to the categorization task, that is, pressing right or left lever.The details of mathematical description are described in Refs. [8-10].

3 3.1

Neural Mechanism for Extracting Diagnostic Features in Visual Categorization Task Categorization Task

We used the line drawings of faces used by Sigala and Logothetis [2] to investigate the neural mechanism of categorization task. The face images consist of four varying features, eye height, eye separation, nose length, and mouse height. The monkeys were trained to categorize the face stimuli depending on two diagnostic features, eye height and eye separation. The two diagnostic features allowed separation between classes along a linear category boundary as shown in Fig.2b. The face stimuli were not linearly separable by using the other two, non-diagnostic features, or nose length and mouth height. On each trial, the monkeys saw one face stimulus and then pressed one of two levers to indicate category. Thereafter, they received a reward only if they chose correct category. After the training, the monkeys were able to categorize various face stimuli based on the two diagnostic features. In our simulation, we used four training stimuli shown in Fig.2a and test stimuli with varying four features.

Fig. 2. a) The training stimulus set consisted of line drawing of faces with four varying features: eye separation, eye height, nose length and mouth height. b) In the categorization task, the monkyes were presented with one stimulus at a time. The two categories were linearly separable along the line. The test stimuli are illustrated by the marks, ‘x‘ and ‘o‘. See Ref.2 for details of the task.

Neural Mechanism for Extracting Object Features

3.2

33

Neural Mechanism for Accomplishing Visual Categorization Task

Neural mechanism for accomplishing visual categorization task is illustrated in Fig.3. Object features are encoded by hierarchical processing at each stage of ventral visual pathway from retina to V4. The IT neurons encode the information of object parts such as eyes, nose, and mouth. The PP neurons encode the location of object parts to which attention should be directed. The information of object part and its location are combined by a dynamical attractor in PFC. The output of PFC is sent to two types of premotor neurons, each ﬁring of which leads to pressing of the right or left lever. When monkeys exhibit the relevant behavior for the task and then receive a reward, the attractor in PFC, which represents the object information relevant to the task, is gradually stabilized by the facilitation of learning across the PFC neurons and that between PFC and premotor neurons. On the other hand, when monkeys exhibit the irrelevant behavior for the task and then receive no reward, the attractor associated with the irrelevant behavior is destabilized, and then eliminated in the PFC network. As a result, the PFC retains only the object information relevant to the categorization task, as working memory. The feedback from PFC to IT and PP makes the responses of IT and PP neurons strengthened, thereby enabling the visual system to rapidly and accurately discriminate between object features belonging to diﬀerent categories. When monkey pays attention to a local region of face stimulus, the attention signal from other brain area such as prefrontal eye ﬁeld increases the activity of PP neurons encoding the location of face part to which attention is directed. The PP neurons send their outputs back to V4, and thereby increasing the activity of V4 neurons encoding the feature of the face parts which the attention is paid, because V4 has the same retinotopic map as PP. Thus the attention to PP allows V4 to send IT only the information of face part to which attention is directed,

Fig. 3. Neural mechanism for accomplishing visual categorization task. The information about object part and it’s location are combined to generate working memory, indicated by α and β. The solid and dashed lines indicate the formation and elimination of synaptic connection, respectively.

34

M. Soga and Y. Kashimori

leading to generation of IT neurons encoding face parts. Furthermore, as the training proceeds, the attention relevant to the task is ﬁxed by the learning of synaptic connections between PFC and PP, allowing monkey to perform quickly the visual task.

4 4.1

Results Information Processing of Visual Images in Early Visual Areas

Figure 4 shows the responses of neurons in early visual areas, LGN, V1, and V4, to the training stimulus, face 1 shown in Fig. 2a. The visual information is processed in a hierarchical manner, because the neurons involved in the pathway from LGN to V4 have progressively larger receptive ﬁelds and prefer more complex stimuli. At the ﬁrst stage, the contrast of the stimulus is encoded by ON center-OFF surrounding receptive ﬁeld of LGN neurons, as shown in Fig. 4b. Then, the V1 neurons, receiving the outputs of LGN neurons, encode the information of directional features of short bars contained in the drawing face, as shown in Fig. 4c. The V4 network was made by Kohonen’s self-organized map so that the V4 neurons could respond to more complex features of the stimulus. Figure 4d shows that the V4 neurons respond to the more complex features such as eyes, nose, and mouth.

Fig. 4. Responses of neurons in LGN, V1, and V4. The magnitude of neuronal responses in these areas is illustrated with a gray scale, in which the response magnitude is increased with increase of gray color. a) Face stimulus. The image is 90 x 90 pixel scale. b) Response of LGN neurons. c) Responses of V1 neurons tuned to four kinds of directions. d) Responses of V4 neurons. The kth neurons (k=1-3) encode the stimulus features such as eyes, nose, and mouth, respectively.

4.2

Information Processing of Visual Images in IT Cortex

Figure 5a shows the ability of IT neurons encoding eye separation and eye height to categorize test stimuli of faces. The test stimuli with varying the two features were categorized by the four ITC neurons learned by the four training stimuli shown in Fig.2a, suggesting that the ITC neurons are capable for separating test stimuli into some categories, based on similarity to the features of face parts.

Neural Mechanism for Extracting Object Features

35

Fig. 5. a) Ability of IT neurons to categorize face stimuli for two diagnostic features. The four IT neurons were made by using the four kinds of training stimuli shown in Fig. 2a, whose features are represented by four kinds of symbols (circle, square, triangle, cross). The test stimuli, represented by small symbols, are categorized by the four IT neurons. The kind of small symbols means the symbol of IT neuron that categorizes the test stimulus. The solid lines mean the boundary lines of the categories. b) Temporal variation of dynamic state of the PFC network during the categorization task. The attractors representing the diagnostic features are denoted by α ∼ δ, and the attractors representing non-diagnostic feature is denoted by . A mark on the row corresponding to α ∼ indicates that the network activity stays in the attractor. The visual stimulus of face 1 was applied to the retina at 300m ∼ 500ms.

Similarly, the IT neurons encoding nose length and mouth height separated test stimuli into other categories on the basis of similarity of the two features. However, the classiﬁcation in the IT is not task-dependent, but is made based on the similarity of face features. 4.3

Mechanism for Generating Working Memory Attractor in PFC

The PFC combines the information of face features and that of location of face parts to which attention is directed, and then makes memory attractors about the information. Figure 5b shows temporal variation of the memory attractors in PFC. The information about face parts with the two diagnostic features is represented by attractors X (X= α, β, γ, δ), in which X represents the information about eye separation and eye height of the four training stimuli and the location around eyes. The attractors X are dynamically linked in the PFC. As shown in Fig. 5b, the information about face parts with the diagnostic features are memorized as working memory α ∼ δ, because the synaptic connections between PFC and premotor area are strengthened by a reward signal given by the choice of correct categorization. On the other hand, the information about face parts with non-diagnostic features are not memorized as a stable attractor, as shown by in Fig.5b, because the information of non-diagnostic features does

36

M. Soga and Y. Kashimori

not lead to correct categorization behavior. Thus, the PFC can retain only the information required for the categorization task, as working memory.

5

Concluding Remarks

In the present study, we have shown that IT represents similarity of face images based on the resolution maps of V1 and V4, and PFC generates a working memory state, in which the information of face features relevant to categorization task are sustained. The feedback from PFC to IT and PP may play an important role in extracting the diagnostic features critical for the categorization task. The feedback from PFC increases the sensitivity of IT and PP neurons which encode the relevant object feature and location to the task, respectively. This allows the visual system to rapidly and accurately perform the categorization task. It remains to see how the feedback from PFC to IT and PP makes the functional connections across the three visual areas.

References 1. Freedman, D.J., Riesenhube, M., Poggio, T., Miller, E.K.: Categorical representation of visual stimuli in the primate prefrontal cortex. Science 291, 312–316 (2001) 2. Sigala, N., Logothetis, N.K.: Visual categorization shapes feature selectivity in the primate temporal cortex. Nature 415, 318–320 (2002) 3. Hagiwara, I., Miyashita, Y.: Categorizing the world: expert neurons look into key features. Nature Neurosci. 5, 90–91 (2002) 4. Marcelija, S.: Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70, 1297–1300 (1980) 5. Rolls, E.T., Deco, G.: Computational Neuroscience of Vision. Oxford University Press, Oxford (2002) 6. Hoshino, O., Inoue, S., Kashimori, Y., kambara, T.: A hierachical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput. 13, 1781–1810 (2001) 7. Hoshino, O., Kashimori, Y., Kambara, T.: An olfactory recognition model of spatiotemporal coding of odor quality in olfactory bulb. Biol. Cybernet 79, 109–120 (1998) 8. Suzuki, N., Hashimoto, N., Kashimori, Y., Zheng, M., Kambara, T.: A neural model of predictive recognition in form pathway of visual cortex. Biosystems 79, 33–42 (2004) 9. Ichinose, Y., Kashimori, Y., Fujita, K., Kambara, T.: A neural model of visual system based on multiple resolution maps for categorizing visual stimuli. In: Proceedings of ICONIP 2005, pp. 515–520 (2005) 10. Kashimori, Y., Suzuki, N., Fujita, K., Zheng, M., Kambara, T.: A functional role of multiple spatial resolution maps in form perception along the ventral visual pathway. Neurocomputing 65-66, 219–228 (2005)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion Jordan H. Boyle, John Bryden, and Netta Cohen School of Computing, University of Leeds, Leeds LS2 9JT, United Kingdom

Abstract. One of the most tractable organisms for the study of nervous systems is the nematode Caenorhabditis elegans, whose locomotion in particular has been the subject of a number of models. In this paper we present a ﬁrst integrated neuro-mechanical model of forward locomotion. We ﬁnd that a previous neural model is robust to the addition of a body with mechanical properties, and that the integrated model produces oscillations with a more realistic frequency and waveform than the neural model alone. We conclude that the body and environment are likely to be important components of the worm’s locomotion subsystem.

1

Introduction

The ultimate aim of neuroscience is to unravel and completely understand the links between animal behaviour, its neural control and the underlying molecular and genetic computation at the cellular and sub-cellular levels. This daunting challenge sets a distant goal post in the study of the vast majority of animals, but work on one animal in particular, the nematode Caenorhabditis elegans, is leading the way. This tiny worm has only 302 neurons and yet is capable of generating an impressive wealth of sensory-motor behaviours. With the ﬁrst fully sequenced animal genome [1], a nearly complete wiring diagram of the nervous circuit [2], and hundreds of well characterised mutant strains, the link between genetics and behaviour never seemed more tractable. To date, a number of models have been constructed of subcircuits within the C. elegans nervous system, including sensory circuits for thermotaxis and chemotaxis [3,4], reﬂex control such as tap withdrawal [5], reversals (from forward to backward motion and vice versa) [6] and head swing motion [7]. Locomotion, like the overwhelming majority of known motor activity in animals, relies on the rhythmic contraction of muscles, which are controlled or regulated by neural networks. This system consists of a circuit in the head (generally postulated to initiate motion and determine direction) and an additional subcircuit along the ventral cord (responsible for propagating and sustaining undulations, and potentially generating them as well). Models of C. elegans locomotion have tended to focus on forward locomotion, and in particular, on the ability of the worm to generate and propagate undulations down its length [8,9,10,11,12]. These models have tended to study either the mechanics of locomotion [8] or the forward locomotion neural circuit [9,10,11,12]. In this paper we present simulations of an M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 37–47, 2008. c Springer-Verlag Berlin Heidelberg 2008

38

J.H. Boyle, J. Bryden, and N. Cohen

integrated model of the neural control of forward locomotion [12] with a minimal model of muscle actuation and a mechanical model of a body, embedded in a minimal environment. The main questions we address are (i) whether the disembodied neural model is robust to the addition of a body with mechanical properties; and (ii) how the addition of mechanical properties alters the output from the motor neurons. In particular, models of the isolated neural circuit for locomotion suﬀer from a common limitation: the inability to reproduce undulations with frequencies that match the observed behaviour of the crawling worm. To address this question, we have limited our integrated model to a short section of the worm, rather than modelling the entire body. We ﬁnd that the addition of a mechanical framework to the neural control model of Ref. [12] leads to robust oscillations, with signiﬁcantly smoother waveforms and reduced oscillation frequencies, matching observations of the worm.

2 2.1

Background C . elegans Locomotion

Forwards locomotion is achieved by propagating sinusoidal undulations along the body from head to tail. When moving on a ﬁrm substrate (e.g. agarose) the worm lies on its side, with the ventral and dorsal muscles at any longitudinal level contracting in anti-phase. With the exception of the head and neck, the worm is only capable of bending in the dorso-ventral plane. Like all nematode worms, C. elegans lacks any form of rigid skeleton. Its roughly cylindrical body has a diameter of ∼ 80 μm and a length of ∼ 1 mm. It has an elastic cuticle containing (along with its intestine and gonad) pressurised ﬂuid, which maintains the body shape while remaining ﬂexible. This structure is referred to as a hydrostatic skeleton. The body wall muscles responsible for locomotion are anchored to the inside of the cuticle. 2.2

The Neural Model

The neural model used here is based on the work of Bryden and Cohen [11,12,13]. Speciﬁcally, we use the model (equations and parameters) presented in [12] which is itself an extension of Refs. [11,13]. The model simpliﬁes the neuronal wiring diagram of the worm [2,14] into a minimal neural circuit for forward locomotion. This reduced model contains a set of repeating units, (one “tail” and ten “body” units) where each unit consists of one dorsal motor neuron (of class DB) and one ventral motor neuron (of class VB). A single command interneuron (representing a pair of interneurons of class AVB in the biological worm) provides the “on” signal to the forward locomotion circuit and is electrically coupled (via gap junctions) to all motor neurons of classes DB and VB. In the model, motor neurons also have sensory function, integrating inputs from stretch-receptors, or mechano-sensitive ion channels, that encode each unit’s bending angle. Motor neurons receive both local and – with the exception of the tail – proximate sensory input, with proximate input received from the adjacent posterior unit.

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

39

Fig. 1. A: Schematic diagram of the physical model illustrating nomenclature (see Appendix B for details). B: The neural model, with only two units (one body, one tail). AVB is electrically coupled to each of the motor neurons via gap junctions (resistor symbols).

The sensory-motor loop for each unit gives rise to local oscillations which phase lock with adjacent units. Equations and parameters for the neural model are set out in Appendix A. This neural-only model uses a minimal physical framework to translate neuronal output to bending. Fig. 1B shows the neural model with only two units (a tail and one body unit), as modelled in this paper. In the following section, we outline a more realistic physical model of the body of the worm.

3

Physical Model

Our physical model is an adaptation of Ref. [8], a 2-D model consisting of two rows of N points (representing the dorsal and ventral sides of the worm). Each point is acted on by the opposing forces of the elastic cuticle and pressure, as well as muscle force and drag (often loosely referred to as friction or surface friction [8]). We modify this model by introducing simpliﬁcations to reduce simulation time, in part by allowing us to use a longer time step. Fig. 1A illustrates the model’s structure. The worm is represented by a number rigid beams, connected to each of the adjacent beams by four springs. Two horizontal (h) springs connect points on the same side of adjacent beams and resist both elongation and compression. Two diagonal (d) springs connect the dorsal side of the ith beam to the ventral side of the i + 1st , and vice versa. These springs strongly resist compression and have an eﬀect analogous to that of pressure, in that they help to maintain reasonably constant area in each unit. The model was implemented in C++, using a 4th order Runge-Kutta method for numerical integration, with a time step of 0.1 ms.1 Equations and parameters 1

The original model [8] required a time step of 0.001 ms with the same integration method.

40

J.H. Boyle, J. Bryden, and N. Cohen

of the physical model are given in Appendix B. The steps taken to interface the physical and neuronal models are described in Appendix C.

4

Results

Using our integrated model we ﬁrst simulated a single unit (the tail), and then implemented two phase-lagged units (adding a body unit). In what follows, we present these results, as compared to those of the neural model alone. 4.1

Single Oscillating Segment

The neural model alone produces robust oscillations in unit bending angle (θi ) with a roughly square waveform, as shown in Fig. 2A. The model unit oscillates at about 3.5 Hz, as compared to frequencies of about 0.5 Hz observed for C. elegans forward locomotion on an agarose substrate. It has not been possible to ﬁnd parameters within reasonable electrophysiological bounds for the neural model that would slow the oscillations to the desired time scales [12]. Oscillations of the integrated neuro-mechanical model of a single unit are shown in Fig. 2B. All but four parameters of the neuronal model remain unchanged from Ref. [12]. However, parameters used for the actuation step caused a slight asymmetry in the oscillations when integrated with a physical model, and were therefore modiﬁed. As can be seen from the traces in the ﬁgure, the frequency of oscillation in the integrated model is about 0.5 Hz for typical agarose drag [8], and the waveform has a smooth, almost sinusoidal shape. Faster (and slower) oscillations are possible for lower (higher) values of drag. Fig. 2C shows a plot of oscillation frequencies as a function of drag for the integrated model.

Fig. 2. Oscillations of, A, the original neural model [12] and, B, the integrated model (with drag of 80 × 10−6 kg.s−1 ). Note the diﬀerent time scales. C: Oscillation frequency as a function of drag. The zero frequency point indicates that the unit can no longer oscillate.

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

4.2

41

Two Phase-Lagged Segments

Parameters of the neural model are given in Table A-1 for the tail unit and in Table A-2 for the body unit. Fig. 3 compares bending waveforms recorded from a living worm (Fig. 3A), simulated by the neural model (Fig. 3B) and simulated by the integrated model (Fig. 3C).

Fig. 3. Phase lagged oscillation of two units. A: Bending angles extracted from a recording of a forward locomoting worm on an agarose substrate. The traces are of two 1 points along the worm (near the middle and 12 of a body length apart). B: Simulation of two coupled units in the neural model. C: Simulation of the integrated model. Take note of the faster oscillations in subplot B.

5

Discussion

C. elegans is amenable to manipulations at the genetic, molecular and neuronal levels but with such rich behaviour being produced by a system with so few components, it can often be diﬃcult to determine the pathways of cause and eﬀect. Mathematical and simulation models of the locomotion therefore provide an essential contribution to the understanding of C. elegans neurobiology and motor control. The inclusion of a realistic embodiment is particularly relevant to a model of C. elegans locomotion. Sensory feedback is important to the locomotion of all animals. However, in C. elegans, the postulated existence of stretch receptor inputs along the body (unpublished communication, L. Eberly and R. Russel, reported in [2]) would provide direct information about body posture to the motor neurons themselves. Thus, the neural control is likely to be tightly coupled to the shape the worm takes as it locomotes. Modelling the body physics is therefore particularly important in this organism. Here we have presented the ﬁrst steps in the implementation of such an integrated model, using biologically plausible parameters for both the neural and mechanical components. One interesting eﬀect is the smoothing of the waveform from a square-like waveform in the isolated neural model to a nearly sinusoidal waveform in the integrated model. The smoothing can be attributed to the body’s resistance to bending (modelled as a set of springs), which increases with the bending angle.

42

J.H. Boyle, J. Bryden, and N. Cohen

By contrast, in the original neural model, the rate of bending depends only on the neural output. The work presented here would naturally lead to an integrated neuromechanical model of locomotion for an entire worm. The next step toward this goal, extending the neural circuit to the entire ventral cord (and the corresponding motor system) is currently underway. The physical model introduces long range interactions between units via the body and environment. In a real worm, as in the physical model, for bending to occur at some point along the worm, local muscles must contract. However, such contractions also apply physical forces to adjacent units, and so on up and down the worm, giving rise to a signiﬁcant persistence length. For this reason the extension of the neuro-mechanical model from two to three (or more) units will not be automatic and will require parameter changes to model an operable balance between the eﬀects of the muscle and body properties. In fact, the worm’s physical properties (and, in particular, the existence of long range physical interactions along it) could set new constraints on the neural model, or could even be exploited by the worm to achieve more eﬀective locomotion. Either way, the physics of the worm’s locomotion is likely to oﬀer important insights that could not be gleaned from a model of the isolated neural subcircuit. We have shown that a neural model developed with only the most rudimentary physical framework can continue to function with a more realistic embodiment. Indeed, both the waveform and frequency have been improved beyond what was possible for the isolated neural model. We conclude that the body and environment are likely to be important components of the subsystem that generates locomotion in the worm.

Acknowledgement This work was funded by the EPSRC, grant EP/C011961. NC was funded by the EPSRC, grant EP/C011953. Thanks to Stefano Berri for movies of worms and behavioural data.

References 1. C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282, 2012–2018 (1998) 2. White, J.G., Southgate, E., Thomson, J.N., Brenner, S.: The structure of the nervous system of the nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society of London, Series B 314, 1–340 (1986) 3. Ferr´ee, T.C., Marcotte, B.A., Lockery, S.R.: Neural network models of chemotaxis in the nematode Caenorhabditis elegans. Advances in Neural Information Processing Systems 9, 55–61 (1997) 4. Ferr´ee, T.C., Lockery, S.R.: Chemotaxis control by linear recurrent networks. Journal of Computational Neuroscience: Trends in Research, 373–377 (1998) 5. Wicks, S.R., Roehrig, C.J., Rankin, C.H.: A Dynamic Network Simulation of the Nematode Tap Withdrawal Circuit: Predictions Concerning Synaptic Function Using Behavioral Criteria. Journal of Neuroscience 16, 4017–4031 (1996)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

43

6. Tsalik, E.L., Hobert, O.: Functional mapping of neurons that control locomotory behavior in Caenorhabditis elegans. Journal of Neurobiology 56, 178–197 (2003) 7. Sakata, K., Shingai, R.: Neural network model to generate head swing in locomotion of Caenorhabditis elegans. Network: Computation in Neural Systems 15, 199–216 (2004) 8. Niebur, E., Erd¨ os, P.: Theory of the locomotion of nematodes. Biophysical Journal 60, 1132–1146 (1991) 9. Niebur, E., Erd¨ os, P.: Theory of the Locomotion of Nematodes: Control of the Somatic Motor Neurons by Interneurons. Mathematical Biosciences 118, 51–82 (1993) 10. Niebur, E., Erd¨ os, P.: Modeling Locomotion and Its Neural Control in Nematodes. Comments on Theoretical Biology 3(2), 109–139 (1993) 11. Bryden, J.A., Cohen, N.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. In: Schaal, S., Ijspeert, A.J., Billard, A., Vijayakumar, S., Hallam, J., Meyer, J.A. (eds.) Proceedings of the eighth international conference on the simulation of adaptive behavior, pp. 183–192. MIT Press / Bradford Books (2004) 12. Bryden, J.A., Cohen, N.: Neural control of C. elegans forward locomotion: The role of sensory feedback (Submitted 2007) 13. Bryden, J.A.: A simulation model of the locomotion controllers for the nematode Caenorhabditis elegans. Master’s thesis, University of Leeds (2003) 14. Chen, B.L., Hall, D.H., Chklovskii, D.B.: Wiring optimization can relate neuronal structure and function. Proceedings of the National Academy of Sciences USA 103, 4723–4728 (2006)

Appendix A: Neural Model Neurons are assumed to have graded potentials [11,12,13]. In particular, motor neurons (VB and DB) and are modelled by leaky integrators with a transmembrane potential V (t) following: C

dV = −G(V − Erev ) − I shape + I AVB , dt

(A-1)

where C is the cell’s membrane capacitance; Erev is the cell’s eﬀective reversal potential; and G is the total eﬀective membrane conductance. Sensory input n stretch I shape = )Gstretch σjstretch (θj ) is the stretch receptor input j j=1 (V − Ej from the shape of the body, where Ejstretch is the reversal potential of the ion channels, θj is the bending angle of unit j and σjstretch is a sigmoid response function of the stretch receptors to the local bending. The stretch receptor activation function is given by σ stretch (θ) = 1/ [1 + exp (−(θ − θ0 )/δθ)] where the steepness parameter δθ and the threshold θ0 are constants. The command input current I AVB = GAVB (VAVB − V ) models gap junctional coupling with AVB (with coupling strength GAVB and denoting AVB voltage by VAVB ). Note that in the model, AVB is assumed to have a suﬃciently high capacitance, so that the gap junctional currents have a negligible eﬀect on its membrane potential.

44

J.H. Boyle, J. Bryden, and N. Cohen

Segment bending in this model is given as a summation of an output function from each of the two neurons: dθ out = σVout B (V ) − σDB (V ) , dt

(A-2)

where σ out (V ) = ωmax /[1 + exp (−(V − V0 )/δV )] with constants ωmax , δV and V0 . Note that dorsal and ventral muscles contribute to bending in opposite directions (with θ and -θ denoting ventral and dorsal bending, respectively). Table A-1. Parameters for a self-oscillating tail unit (as in Ref. [12]) Parameter Value Parameter Value Parameter Value Erev −60mV VAVB −30.7mV C 5pF GVB 19.07pS GDB 17.58pS GAVB 35.37pS VB GAVB 13.78pS Gstretch 98.55pS Gstretch 67.55pS DB VB DB Estretch 60mV θ0,VB −18.68o θ0,DB −19.46o δθVB 0.1373o δθDB 0.4186o ωmax,VB 6987o /sec ωmax,DB 9951o /sec V0,VB 22.8mV V0,DB 25.0mV δVVB 0.2888mV/sec δVDB 0.0826mV/sec

Table A-2. Parameters for body units and tail-body interactions as in Ref. [12]. All body-unit parameters that are not included here are the same as for the tail unit. Parameter GVB Gstretch DB θ0,DB

Value Parameter Value Parameter Value 26.09pS GDB 25.76pS Gstretch 16.77pS VB 18.24pS E stretch 60mV θ0,VB −19.14o −13.26o δθVB 1.589o /sec δθDB 1.413o /sec

Appendix B: Physical Model The physical model consists of N rigid beams which form the boundaries between the N − 1 units. The ith beam can be described in one of two ways: either by the (x, y) coordinates of the centre of mass (CoMi in Fig. 1) and angle φi , or by the (x, y) coordinates of its two end points (PiD and PiV in Fig. 1). Each formulation has its own advantages and is used where appropriate. B.1 Spring Forces The rigid beams are connected to each of their neighbours by two horizontal (h) springs and two diagonal (d) springs, directed along the vectors k k Δh k,i = Pi+1 − Pi for k = D, V k Δdm,i = Pi+1 − Pil for k = D, V , l = V, D and m = 1, 2 ,

(B-1)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

45

for i = 1 : N − 1, where Pik = (xki , yik ) are the coordinates of the ends of the ith beam. The spring forces F(s) depend on the length of these vectors, Δjk,i = |Δjk,i | and are collinear to them. The magnitude of the horizontal and diagonal spring forces are piecewise linear functions ⎧ h κ (Δ − Lh2 ) + κhS1 (Lh2 − Lh0 ) : Δ > Lh2 ⎪ ⎪ ⎨ S2 κhS1 (Δ − Lh0 ) : Lh2 > Δ > Lh0 h F(s) (Δ) = , (B-2) h h κ (Δ − L1 ) + κhC1 (Lh1 − Lh0 ) : Δ < Lh1 ⎪ ⎪ ⎩ C2 h h κC1 (Δ − L0 ) : otherwise ⎧ d ⎨ κC2 (Δ − Ld1 ) + κdC1 (Ld1 − Ld0 ) : Δ < Ld1 d F(s) (Δ) = κdC1 (Δ − Ld0 ) : Ld1 < Δ < Ld0 , (B-3) ⎩ 0 : otherwise where spring (κ) and length (L) constants are given in Table B-1. Table B-1. Parameters of the physical model. Note that values for θ0 and θ0 diﬀer from Ref. [12] and Table A-1. Parameter Value Parameter Value Parameter Value D 80μm Lh0 50μm Lh1 0.5Lh0 h h d d h2 2 L2 1.5L1 L0 L0 + D L1 0.95Ld0 h −1 h h h κS1 20μN.m κS2 10κS1 κC1 0.5κhS1 h h d h d κC2 10κC1 κC1 50κS1 κC2 10κdC1 fmuscle 0.005Lh0 κhC1 c = c⊥ 80 × 10−6 kg.s−1 θ0,VB −29.68o θ0,DB −8.46o θ0,VB −22.14o θ0,DB −10.26o

B.2

Muscle Forces

Muscle forces F(m) are directed along the horizontal vectors Δh k,i with magnitude F(m)k,i = fmuscle Ak,i for k = D, V and i = 1 : N − 1 ,

(B-4)

where fmuscle is a constant (see Table B-1) and Ak,i are scalar activation functions for the dorsal and ventral muscles, determined by (θi (t), 0) if θi (t) ≥ 0 (AD,i , AV,i ) = (B-5) (0, −θi (t)) if θi (t) < 0 , where θi (t) = B.3

t

dθi 0 dt dt

is the integral over the output of the neural model.

Total Point Force

With the exception of points on the outer beams, each point i is subject to forces F D,i and F V,i , given by diﬀerences of the spring and muscle forces from the corresponding units (i and i − 1):

46

J.H. Boyle, J. Bryden, and N. Cohen

F D,i = (F h(s)D,i − F h(s)D,i−1 ) + (F d(s)1,i − F d(s)2,i−1 ) + (F (m)D,i − F (m)D,i−1 ) F V,i = (F h(s)V,i − F h(s)V,i−1 ) + (F d(s)2,i − F d(s)1,i−1 ) + (F (m)V,i − F (m)V,i−1 ) . (B-6) Since the ﬁrst beam has no anterior body parts, and the last beam has no posterior body parts, all terms with i = 0 or i = N are taken as zero. B.4

Equations of Motion

Motion of the beams is calculated from the total force acting on each of the 2N points. Since the points PiD and PiV are connected by a rigid beam, it is convenient to convert F(t)k,i to a force and a torque acting on the beam’s centre of mass. y x Rotation by φi converts the coordinate system of F(t)k,i = (F(t)k,i , F(t)k,i )

⊥ to a new system F(t)k,i = (F(t)k,i , F(t)k,i ) with axes perpendicular to (⊥) and parallel with () the beam: y ⊥ x F(t)k,i = F(t)k,i cos(φi ) + F(t)k,i sin(φi )

y x F(t)k,i = F(t)k,i cos(φi ) − F(t)k,i sin(φi ) .

(B-7)

The parallel components are summed and applied to CoMi , resulting in pure translation. The perpendicular components are separated into odd and even parts (giving rise to a torque and force respectively) by Fi⊥,even = Fi⊥,odd =

⊥ ⊥ (F(t)D,i + F(t)V,i )

2 ⊥ ⊥ (F(t)D,i − F(t)V,i ) 2

.

(B-8)

As in Ref. [8] we disregard inertia, but include Stokes’ drag. Also following Ref. [8], we allow for diﬀerent constants for drag in the parallel and perpendicular directions, given by c and c⊥ respectively. The motion of CoMi is therefore

1 (F + F(t)V,i ) c (t)D,i 1 = (2Fi⊥,even) c⊥ 1 = (2Fi⊥,odd ) , rc⊥

V(CoM),i = ⊥ V(CoM),i

ω(CoM),i

(B-9)

where r = 0.5D is the radius of the worm. Finally we convert V(CoM),i and ⊥ V(CoM),i back to (x, y) coordinates with

x ⊥ V(CoM),i = V(CoM),i cos(φi ) − V(CoM),i sin(φi )

y ⊥ V(CoM),i = V(CoM),i cos(φi ) + V(CoM),i sin(φi ) .

(B-10)

An Integrated Neuro-mechanical Model of C. elegans Forward Locomotion

47

Appendix C: Integrating the Neural and Physical Model In the neural model, the output dθi (t)/dt speciﬁes the bending angles θi (t) for each unit. In the integrated model, θi (t) are taken as the input to the muscles. Muscle ouputs (or contraction) are given by unit lengths. The bending angle αi is then estimated from the dorsal and ventral unit lengths by αi = 36.2

h |Δh D,i | − |ΔV,i |

Lh0

,

(C-1)

where Lh0 is the resting unit length. (For simplicity, we have denoted the bending angles of both the neural and integrated models by θ in the Figures).

Applying the String Method to Extract Bursting Information from Microelectrode Recordings in Subthalamic Nucleus and Substantia Nigra Pei-Kuang Chao1, Hsiao-Lung Chan1,4, Tony Wu2,4, Ming-An Lin1, and Shih-Tseng Lee3,4 1

Department of Electrical Engineering, Chang Gung University 2 Department of Neurology, Chang Gung Memorial Hospital 3 Department of Neurosurgery, Chang Gung Memorial Hospital 4 Center of Medical Augmented Virtual Reality, Chang Gung Memorial Hosipital 259 Wen-Hua First Road, Gui-Shan, 333, Taoyuan, Taiwan [email protected]

Abstract. This paper proposes that bursting characteristics can be effective parameters in classifying and identifying neural activities from subthalamic nucleus (STN) and substantia nigra (SNr). The string method was performed to quantify bursting patterns in microelectrode recordings into indexes. Interspike-interval (ISI) was used as one of the independent variables to examine effectiveness and consistency of the method. The results show consistent findings about bursting patterns in STN and SNr data across all ISI constraints. Neurons in STN tend to release a larger number of bursts with fewer spikes in the bursts. Neurons in SNr produce a smaller number of bursts with more spikes in the bursts. According to our statistical evaluation, 50 and 80 ms are suggested as the optimal ISI constraint to classify STN and SNr’s bursting patterns by the string method. Keywords: Subthalamic nucleus, substantia nigra, inter-spike-interval, burst, microelectrode.

1 Introduction Subthalamic nucleus (STN) is frequently the target to study and to treat Parkinson’s disease [1, 2]. Placing a microelectrode to record neural activities in deep brain nuclei provides useful information for localization during deep brain stimulation (DBS) neurosurgery. DBS has been approved by FDA since 1998[3]. The surgery implants a stimulator to deep brain nuclei, usually STN, to alleviate Parkinson’s symptoms, such as tremor and rigidity. To search for STN in operation, a microelectrode probe is often used to acquire neural signals from outer areas to the specific target. With assistance of imagery techniques, microelectrode signals from different depth are read and recorded. Then, an important step to determine STN location is to distinguish signals of STN from its nearby areas, e.g. subtantia nigra (SNr) (which is a little ventral and medial to STN). Therefore, characterizing and quantifying firing patterns of STN and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 48–53, 2008. © Springer-Verlag Berlin Heidelberg 2008

Applying the String Method to Extract Bursting Information

49

SNr are essential. Firing rate defined as the number of neural spikes within a period is the most common variable used for describing neural activities. However, STN and SNr have a broad range of firing rate and mostly overlapped [4] (although SNr has a slightly higher mean firing rate than STN). This makes it difficult to depend on firing rate to target STN. Bursting patterns may provide a better solution to separate signals from different nuclei. Bursting, defined as clusters of high-frequency spikes released by a neuron, is believed storing important neural information. To establish long-term responses, central synapses usually require groups of action potentials (bursts) [5,6]. Exploring bursting information in neural activities has recently become fundamental in Parkinson’s studies [1,2]. Also, that spike arrays are more regular in SNr than in STN signals is observed [7]. However, the regularity of firing or grouping of spikes in STN and SNr potentials has not been investigated thoroughly. This study aims to extract bursting information from STN and SNr. A quantifying method for bursting, the string method, will be applied. The string method quantifies bursting information relying on inter-spike-interval (ISI) and spike number [8]. Although some other methods for quantifying bursts exist [9,10], the string method is the one which can provide information about what spike members contributing to the detected bursts. In addition, because various ISIs have been used in research [8,9] to define bursts, this study will also evaluate the effect of ISI constraints on discriminating STN and SNr signals.

2 Method The neuronal data used in this study were acquired during DBS neurosurgery in Chang Gung Memorial Hospital. With assistance of imagery localization systems [11], trials (10s for each) of microelectrode recordings were collected at a sampling

a

b

Fig. 1. MRI images from one patient: a. In the sagittal plane – the elevation angle of the probe (yellow line) from the inter-commissural line (green line) was around 50 to 75°; b. In the frontal plane – the angle of the probe (yellow lines) from the midline (green line) was about 8 to 18° to right or left.

50

P.-K. Chao et al.

rate of 24,000 Hz. Based on several observations, e.g. magnetic resonance imaging (MRI), computed topography (CT), motion/perception-related responses and probe location according to a stereotactic system, experienced neurologists diagnosed 18 trials as neural signals from STN, and the other 23 trials as from SNr. The trials which were collected outside STN and SNr and/or confused between STN and SNr were excluded. In this paper, the data are from 3 Parkinson’s patients (2 females, 1 male, age=73.3±8.3 y/o) who received DBS treatment. Due to the patients’ individual difference, e.g. head size, the depth of STN from the scalp was found varied between 15 and 20 cm. During surgery, the elevation angle of the probe from the intercommissural line was around 50 to 75° (Fig 1a) and the angle between the probe and the midline was about 8 to 18° toward either right or left (Fig 1b). 2.1 Spike Detection Each trial of microelectrode recordings includes 2 types of signals, spikes and background signals. The background signals are interference from nearby neural areas or environment. Because background signals can be interpreted as a Gaussian distribution, signals which are 3 standard deviations (SD) above or below mean can be treated as non-background signals or spikes. Therefore, a threshold in the level of mean plus 3 SD is applied in this study to detect spikes (Fig 2).

100 90 80

amplitude

70 60 50 40 30 20 10 5.175

5.18

5.185

5.19 5.195 time (s)

5.2

5.205

5.21

Fig. 2. A segment of microelectrode recording – the red horizontal line is threshold; the green stars indicate the found spikes; most signals around the baseline are background signals

Applying the String Method to Extract Bursting Information

51

2.2 The String Method Every detected spike was plotted as a circle in a spike sequential number versus spike occurring time figure (Fig 3). The spike sequential number is starting at 1 for the first spike in a trial. The spikes which are closer to each other are labeled as strings [8] and defined as bursts. Two parameters were controlled and manipulated to determine bursts: (1) the minimum number of spikes to form a burst was 5; (2) the maximum ISI of the adjacent spikes in a burst was set as 20ms, 50ms, 80ms, and 110ms separately to find an optimal condition to distinguish STN and SNr bursting patterns.

spike sequential number

180

160

140

120

100

80 5

5.5

6

6.5

7

time (s)

Fig. 3. A segment of strings plot – each blue circle means a spike; the red triangles mean the starting spikes of a burst; the black triangles mean the ending spikes of a burst

2.3 Dependent Variables Three dependent variables were computed: (1) Firing rate (FR) was calculated as total spike number divided by trial duration (10 s). (2) Number of bursts (NB) was determined by the string method as the total burst number in a trial. (3) The average of spike number in each burst (SB) was also counted in every trial. 2.4 Statistical Analysis Independent sample’s t-test was applied to test firing rate difference between STN and SNr signals. MANOVA was performed to evaluate NB and SB separately among different ISI constraints (α=.05).

52

P.-K. Chao et al.

3 Results The signals from STN and SNr showed similar firing rate but different bursting patterns. There is no significant difference between STN and SNr in firing rate (STN: 57.0±22.1; SNr: 68.8±23.5) (p>.05). The results of NB and SB are listed in Table 1 and Table 2. In NB, SNr has significantly fewer bursts than STN while ISI setting is 50ms and 80 ms (p 1.0, 65/104 cells), and 28% of neurons were more responsive to BOS than to OREV (d (RS BOS /RS OREV ) > 1.0, 29/104 cells). These results are consistent with past studies [4]. The mean of d (RS BOS /RS REV ) was 1.25, and the mean of d (RS BOS /RS OREV ) was 0.63. These results indicate that BOS-selective neurons in HVC are largely variable, especially in terms of sequential response properties. A. Elements PSTH (Hz)

Raster

20 0 100 0 -0.2

0

0.5 (s)

A

B

C

A B

A C

D

PSTH (Hz)

Raster

B. Element pairs 20 0 100 0 -0.2

0 A A

B A

0.5 (s)

B

B

B

AD

C

B D

C A

C B

C C

CD

D A

D B

D C

D D

Fig. 1. An example of the auditory response to song elements (A) and element pairs (B) in a single unit from HVC

Population Coding of Song Element Sequence

59

Fig. 2. Song transition matrices of self-generated songs (ﬁrst row) and sequential response distribution matrices of each single unit (second to seventh rows)

3.2

Responses to Song Element Pair Stimuli

To investigate the neural selectivity to song element sequences, we recorded neural responses to all possible element pair stimuli. Because the playback of these stimuli is extremely time-consuming, we could only maintain 34% of the recorded single units stable throughout the entire presentation (35/104 cells, 12/23 birds). In total, 70% of the stable single-units were BOS-selective (d (RS BOS /RS REV ) > 1.0, 27/35 cells, 12/12 birds). Thereafter, we focused on these data. A typical example of neural responses to each song element is shown in Fig. 1 (A). The neuron responded to a single element A or C with single phasic activity, but it did not respond to element B. It responded to element D with double phasic activity. These results indicate that the neuron has various response properties even during single element presentation. In addition, the neuron exhibited more complex response properties during the presentation of element pairs (Fig. 1(B)). The neuron responded more strongly to most of the element pairs when the second element was A or C, compared to single presentation of each element. However, the response was weaker when the ﬁrst and second elements were the same. When the second element was B, no diﬀerences were observed between single and paired stimuli. When the second element was D, we measured single

60

J. Nishikawa, M. Okada, and K. Okanoya

phasic responses, and a strong response to BD. These response properties were not correlated with the element-to-element transition probabilities in the song structure. The dotted boxes indicate the sequences included in BOS. However, the neuron responded only weakly to some sequences that were included in BOS (brack arrows). In contrast, the neuron responded strongly to other sequences that were not included in BOS (white arrows). Thus, the neuron had broad response properties to song element pairs beyond the structure of self-generated song. To quantitatively evaluate sequential response properties, we calculated the response strength measure d (RS S /RS Baseline ) to the element pair stimuli S. The sequential response distributions were created for each neuron in two individuals with more than ﬁve well identiﬁed single units. Song transition matrices and sequential response distributions are shown in Fig. 2. The response distributions were not correlated with the associated song transition matrices. However, each HVC neuron in the same individual had broad but diﬀerent response distribution properties. This tendency was consistent among individuals. This result indicates that the song element sequence is encoded at the population level, within broadly but diﬀerentially selective HVC neurons. 3.3

Population Dynamics Analysis

To analyze the information coding of song element sequences at the population level, we calculated the time course of population activity vectors, which is the set of instantaneous mean ﬁring rates for each neuron in a 50 ms time window. Snapshots of population responses to stimuli are shown in eight panels of Fig. 3A (n = 6, bird 2 of Fig. 2). Each point in the panel represents the population vector toward each stimulus on the MDS space. The ellipses in the upper four panels indicate the group of vectors whose stimuli have the same ﬁrst element, while the ellipses in the lower four panels indicate the group of vectors whose stimuli have the same second element. Note that the population activity vectors in the upper four panels are identical to those in the bottom four panels, and only the ellipses diﬀer. Before the stimulus presentation ([-155 ms: -105 ms], upper and lower panels), only spontaneous activities were observed around the origin. After the ﬁrst element presentation ([50 ms: 100 ms], upper panel), groups with the same ﬁrst elements split. After the second element presentation ([131 ms: 181 ms] of the lower panel), groups with the same second elements were still largely overlapping. In the next section, we will show that confounded information, which represents the relation between ﬁrst and second elements, increased signiﬁcantly in this timing. After suﬃcient time ([480 ms: 530 ms], upper and lower panels), the neurons returned to spontaneous activity. The result indicates that the population response to the ﬁrst and second element is drastically diﬀerent. Subsequently, we will show that this overlap is derived from the information in the song element sequence. 3.4

Information-Theoretic Analysis

To determine the origin of the overlap in the population response, we calculated the time course of mutual information between the stimulus and neural activity.

Population Coding of Song Element Sequence

61

Fig. 3. Responses of HVC neurons at the population level (A) and encoded information (B)

The mutual information for ﬁrst elements I(S1 ; R), the second elements I(S2 ; R), and that of element pairs I(S1 , S2 ; R) was calculated within each time window; the window was shifted to analyze the temporal dynamics of information coding (left upper 3 graphs in Fig. 3B). Narrow lines in each graph indicate the cumulative trace of mutual information in each neuron. The thick line is the cumulative trace of all neurons in the individual. The bottom-left graph in Fig. 3B shows the probability of stimulus presentation. After the presentation of the ﬁrst elements, mutual information for the ﬁrst elements increased, showing a statistically signiﬁcant peak (P < 0.001). After the presentation of the second elements, mutual information for the second elements signiﬁcantly increased (P < 0.001). At the

62

J. Nishikawa, M. Okada, and K. Okanoya

same time, mutual information for element pairs also showed a signiﬁcant peak (P < 0.001). Intuitively, information for element pairs I(S1 , S2 ; R) would consist of information for the ﬁrst elements I(S1 ; R) and second elements I(S2 ; R). However, the consecutive calculation of I(S1 , S2 ; R) − I(S1 ; R) − I(S2 ; R) in each time window causes a statistical peak after the presentation of element pairs (P < 0.001; forth graph from left in Fig. 3B). The diﬀerence C represents the conditional mutual information between the ﬁrst and second elements for a given neural response, otherwise known as confounded information [13]. Therefore, confounded information represents the relationship between the ﬁrst and second elements encoded in the neural responses. The I(S1 ; R) peak occurred at the same time that groups of population vectors with the same ﬁrst elements were splitting ([50 ms: 100 ms]). The peaks for I(S2 ; R), I(S1 , S2 ; R), and C occurred during the same time that groups with the same second elements were still largely overlapping ([131 ms: 181 ms]). This indicates that the sequential information causes an overlap in the population response. In the population dynamics analysis, we cannot combine the data from diﬀerent birds because each bird has a diﬀerent number and types of song elements. However, in the mutual information analysis, we can combine and average the data from diﬀerent birds. The ﬁve graphs on the right in Fig. 3B show the time courses for I(S1 ; R), I(S2 ; R), I(S1 , S2 ; R), C, and the stimulus presentation probability, which were calculated from all stable single units with BOS selectivity (n = 27, 12 birds). The combined mutual information for ﬁrst elements was very similar to that from one bird, showing a signiﬁcant peak after the presentation of the ﬁrst elements (P < 0.001). Mutual information for second elements, element pairs, and confounded information also had signiﬁcant peaks after the presentation of the second elements (P < 0.001). These results show that the song element sequence is encoded into a neural ensemble in HVC by population coding.

4

Conclusion

In this study, we recorded auditory responses to all possible element pair stimuli from the Bengalese ﬁnch HVC. By determining the sequential response distributions for each neuron, we showed that each neuron in HVC has broad but diﬀerential response properties to song element sequences. The population dynamics analysis revealed that population activity vectors overlap after the presentation of element pairs. Using mutual information analysis, we demonstrated that this overlap in the population response is due to confounded information, namely, the sequential information of song elements. These results indicate that the song element sequence is encoded into the HVC microcircuit at the population level. Song element sequences are encoded in a neural ensemble with broad and differentially selective neuronal populations, rather than the chain-like model of diﬀerential TCS neurons.

Population Coding of Song Element Sequence

63

Acknowledgment This study was partially supported by the RIKEN Brain Science Institute, and by a Grant-in-Aid for young scientists (B) No. 18700303 from the Japanese Ministry of Education, Culture, Sports, Science, and Technology.

References 1. Okanoya, K.: The Bengalese ﬁnch: a window on the behavioral neurobiology of birdsong syntax. Ann. N.Y. Acad. Sci. 1016, 724–735 (2004) 2. Doupe, A.J., Kuhl, P.K.: The Bengalese ﬁnch: a window on the behavioral neurobiology of birdsong syntax. Birdsong and human speech: common themes and mechanisms. Annu. Rev. Neurosci. 22, 567–631 (1999) 3. Margoliash, D., Fortune, E.S.: Temporal and harmonic combination-selective neurons in the zebra ﬁnch’ s HVc. J. Neurosci. 12, 4309–4326 (1992) 4. Lewicki, M.S., Arthur, B.J.: Hierarchical organization of auditory temporal context sensitivity. J. Neurosci. 16, 6987–6998 (1996) 5. Drew, P.J., Abbott, L.F.: Model of song selectivity and sequence generation in area HVc of the songbird. J. Neurophysiol. 89, 2697–2706 (2003) 6. Deneve, S., Latham, P.E., Pouget, A.: Reading population codes: a neural implementation of ideal observers. Nat. Neurosci. 2, 740–745 (2001) 7. Pouget, A., Dayan, P., Zemel, R.: Information processing with population codes. Nat. Rev. Neurosci. 1, 125–132 (2000) 8. Green, D., Swets, J.: Signal Detection Theory and Psychophysics. Wiley, New York (1966) 9. Theunissen, F.E., Doupe, A.J.: Temporal and spectral sensitivity of complex auditory neurons in the nucleus HVc of male zebra ﬁnches. J. Neurosci. 18, 3786–3802 (1998) 10. Matsumoto, N., Okada, M., Sugase-Miyamoto, Y., Yamane, S., Kawano, K.: Population dynamics of face-responsive neurons in the inferior temporal cortex. Cerebr. Cort. 15, 1103–1112 (2005) 11. Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–328 (1966) 12. Sugase, Y., Yamane, S., Ueno, S., Kawano, K.: Global and ﬁne information coded by single neurons in the temporal visual cortex. Nature 400, 869–873 (1999) 13. Reich, D.S., Mechler, F., Victor, J.D.: Formal and attribute-speciﬁc information in primary visual cortex. J. Neurophysiol. 85, 305–318 (2001)

Spontaneous Voltage Transients in Mammalian Retinal Ganglion Cells Dissociated by Vibration Tamami Motomura, Yuki Hayashida, and Nobuki Murayama Graduate school of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. We recently developed a new method to dissociate neurons from mammalian retinae by utilizing low-Ca2+ tissue incubation and the vibrodissociation technique, but without use of enzyme. The retinal ganglion cell somata dissociated by this method showed spontaneous voltage transients (sVT) with the fast rise and slower decay. In this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The sVT varied in amplitude with quantal manner, and reversed in polarity around −80 mV in a normal physiological saline. The reversal potential of sVT shifted dependently on the K+ equilibrium potential, indicating the involvement of some K+ conductance. Based on the model, the conductance changes responsible for producing sVT were little dependent on the membrane potential below −50 mV. These results could suggest the presence of isolated, inhibitory presynaptic terminals attaching on the ganglion cell somata. Keywords: Neuronal computation, dissociated cells, retina, patch-clamp, neuron model.

1 Introduction Elucidating the functional role of single neurons in neural information processing is intricate because the neuronal computation itself is highly nonlinear and adaptive, and depends on combinations of many parameters, e.g. the ionic conductances, the intracellular signaling, their subcellular distributions, and the cell morphology. Furthermore, the interactions with surrounding neurons/glias can alter those factors, and thereby hinder us from examining some of those factors separately. This could be overcome by pharmacologically or physically isolating neurons from the circuits. One would use the pharmacological agents those can block the synaptic signal transmission in situ, although it is hard to know whether or not such agents show any unintended side-effects. Alternatively, one can dissociate neural tissue into single neurons by means of enzymatic digestion and mechanical trituration. The dissociated single neurons often lost their fine neurites and the synaptic contacts with other cells during the dissociation procedure, and thus, are useful for examining the properties of ionic conductances at known membrane potentials [3]. Unfortunately, however, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 64–72, 2008. © Springer-Verlag Berlin Heidelberg 2008

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

65

several studies have demonstrated that proteolytic enzymes employed for the cell dissociations can distort the amplitude, kinetics, localization, and pharmacological properties of ionic currents, e.g. [2]. These observations lead to attempts to isolate neurons by enzyme-free, mechanical means. Recently, we developed a new protocol for dissociating single neurons from specific layers of mammalian retinae without use of any proteolytic enzymes [12], but with a combination of the low-Ca2+ tissue incubation [9] and the vibrodissociation technique [15] which has been applied to the slices of brains and spinal cords [1]. The somata of ganglion cells dissociated by our method showed spontaneous voltage transients (sVT) with fast rise and slower decay in the time course [8]. To our knowledge, such sVT have never been reported in previous studies on the retinal ganglion cells dissociated with or without enzyme [9]. Therefore, in this study, we analyzed characteristics of these sVT in the cells under perforated-patch whole-cell configuration, as well as in a single compartment cell model. The present results could suggest the presence of inhibitory presynaptic terminals attaching to the ganglion cell somata we recorded from, as demonstrated in previous studies on the vibrodissociated neurons of brains and spinal cords [1]. If this is the case, the retinal neurons dissociated by our method would be advantageous to investigate the mechanisms of transmitter release in single tiny synaptic boutons even under the isolation from the axons and neurites.

2 Methods All animal care and experimental procedures in this study were approved by the committee for animal researches of Kumamoto University. 2.1 Cell Dissociation The neural retinas were isolated from two freshly enucleated eyes of Wistar rats (P7-P25), cut into 2-4 pieces each, and briefly kept in chilled extracellular “bath” solution. This solution contained (in mM): 140 NaCl, 3.5 KCl, 1 MgCl2, 2.5 CaCl2, 10 D-glucose, 5 HEPES. The pH was adjusted to 7.3 with NaOH. A retinal piece was then placed with photoreceptor-side down in a culture dish, covered with 0.4 ml of chilled, low-Ca2+ solution and incubated for 3-5 min. This low-Ca2+ solution contained (in mM): 140 sucrose, 2.5 KCl, 70 CsOH, 20 NaOH, 1 NaH2PO4, 15 CaCl2, 20 EDTA, 11 D-glucose, 15 HEPES. The estimated free Ca2+ concentration was 100–200 nM. The pH was adjusted to 7.2 with HCl. After the incubation, the fireblunted glass pipette horizontally vibrating in amplitude of 0.2-0.5 mm at 100 Hz was applied to the flattened surface of retina throughout under visual control with the microscope, so that the cells were dissociated from the ganglion cell layer, but least from the inner and outer nuclear layers. After removing the remaining retinal tissue, the culture dish was filled with the bath solution, and left on a vibration-isolation table for allowing cells to settle down for 15-40 min. The bath solution was replaced by a fresh aliquot supplemented with 1 mg/ml bovine serum albumin and the dissociated cells were maintained at room temperature (20-25 oC) for 2-18 hrs prior to the electrophysiological recordings described below. The ganglion cells were identified

66

T. Motomura, Y. Hayashida, and N. Murayama

based on the size criteria [6]. Nearly all those cells we made recordings from in voltage-/current-clamp showed the large amplitude of voltage-gated Na+ current and/or of action potentials (see Fig. 1A-B), verifying that they were ganglion cells [4]. 2.2 Electrophysiology Since the previous studies demonstrated that the membrane conductances of retinal ganglion cells can be modulated by the intracellular messengers, e.g. Zn2+ [13] and cAMP [7], all recordings presented here were performed in perforated-patch wholecell mode [9] to maintain cytoplasmic integrity. Patch electrodes were pulled from borosilicate glass capillaries to tip resistances of approximately 4-8 MΩ. The tip of the electrodes were filled with a recording “electrode” solution that contained (in mM): 110 K-D-gluconic acid, 15 KCl, 15 NaOH, 2.6 MgCl2, 0.34 CaCl2, 1 EGTA, 10 HEPES. The pH was adjusted to 7.2 with methanesulfonic acid. The shank of the electrodes were filled with this solution after the addition of amphotericin B as the perforating agent (260 μg/ml, with 400 μg/ml Pluronic F-127). The recordings were made after the series resistance in perforated-patch configuration reached a stable value (typically 20-40 MΩ, ranging 10-100 MΩ). In the fast current-clamp mode of the amplifier (EPC-10, Heka), the voltage monitor output was analog-filtered by the built-in Bessel filters (3-pole 10–30 kHz followed by 4-pole 2-5 kHz) and digitally sampled (5–20 kHz). The voltage drop across the series resistance was compensated by the built-in circuitry. The recording bath was grounded via an agar bridge, and the bath solution was continuously superfused over each cell recorded from, at a constant flow rate (0.4 ml/min). The volume of solution in the recording chamber was kept at about 2 ml. To apply a high-K+ solution (Fig. 2B), 8 mM NaCl in the bath solution was replaced by the equimolar KCl. An enzyme solution was made by supplementing 0.25 mg/ml papain and 2.5 mM L-cystein in the bath solution. All experiments were performed at room temperature.

3 Results Perforated-patch whole-cell recordings were made from the somata of ganglion cells dissociated by our recently developed protocol (see Methods), which offered us quantitative measurements of the intrinsic membrane properties with the least distortion due to the proteolysis by enzymes [2]. Conversely, since these cells were never exposed to any enzyme, they were useful in examining the effects of the enzymes utilized for the cell dissociation in previous studies. In fact, spike firing of the ganglion cells in response to constant current injection via the patch electrode (30-pA step in the positive direction) was irreversibly altered when the enzyme solution was superfused over those cells (n=3): 1) The resting potential depolarized by 5-20 mV and the spike firing diminished during the enzyme application; 2) When the enzyme was washed out from the recording chamber, the resting potential gradually hyperpolarized near the original level and the spike firing returned in some way;

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

A

67

B

C

Fig. 1. Spontaneous voltage transients (sVT) observed in the dissociated retinal ganglion cells. A: Microphotograph of the cell recorded from. Note the soma being lager than 15 μm in diameter. B: Membrane potential changes in response to step-wise constant current injections. Four traces are superimposed. The injected current was 10 pA in the negative direction and 10, 20, and 30 pA in the positive directions. C: Spontaneous hyperpolarizations under the currentclamp. The recordings were made for 50 sec in three different episodes with breaks for 12 sec between first and second, and for 6 sec between second and third. A constant current (2 pA in the positive direction) was injected to hold the membrane potential at around –70 mV (dashed gray line). Inset: Examples of sVT in an expanded time scale. Five events are recognized. Three of them are similar in their amplitude and time course, and the other two have roughly half and quarter of the largest amplitude.

3) After ~20 min of the washing out of enzyme, the spike firing reached a steady state at which the interval between the first and second spike firings in response to the current step was shorter than that before the enzyme application, by 40 11 % (mean S.E.) (not shown, [12]). These results suggest that, in previous studies on isolated retinal ganglion cells, some of the ionic channels could be significantly distorted during the dissociation procedure because of the use of proteolytic enzymes. Moreover, we found the spontaneous voltage transients (sVT) with the fast rise and slower decay in the retinal ganglion cell somata dissociated by our method [8]. Fig. 1C shows an example of sVT recorded from the cell shown in Fig. 1A. As shown in the Fig., transient hyperpolarizations spontaneously appeared under a constant current injection. Most of these hyperpolarizations are similar in amplitude and time course at a certain membrane potential (−70 mV here) and in some, the peak amplitude of hyperpolarizations was roughly the half or quarter (or one-eighth, in other cells) of the largest one (Inset). Such sVT appeared in 10-20 % of the cells we made recordings from, and could be observed in particular cells as long as we kept the recordings (0.5-2 hrs). When the enzyme solution was superfused over one of those cells, the sVT disappeared completely, and then were not seen again even after 20 min of the washing out of enzyme.

±

±

68 A

T. Motomura, Y. Hayashida, and N. Murayama B

C

Fig. 2. Reversal potential of sVT. A, B: The sVT recorded in the saline containing extracellular K+ of 3.5 mM (A) and 11.5 mM (B). The basal membrane potential (indicated by arrows) was varied by injecting holding currents ranging between –8 and +8 pA in A and between –8 and +12 pA in B. C: Plots of the peak amplitude versus basal potential. Only the events having the largest amplitude (see Fig. 1C) are taken into account. Note that the amplitude of depolarizations were plotted as negative values, and vice versa. The filled circles and open circles represent the data for 3.5-mM K+ and 11.5-mM K+, respectively.

As shown in Fig. 1C, the sVT were all recorded as hyperpolarizations when the cell was held at approximately −70 mV. Thus, the reversal potential of the ionic current producing sVT should be below this voltage. In Fig. 2, the reversal potential for sVT was measured by holding the basal membrane potential at different levels under the current-clamp, c.f. [5]. As expected, the polarity of sVT reversed around −80 mV when the basal membrane potential was varied from about −100 to −40 mV (Fig. 2A). Based on the ionic compositions in the bath and electrode solutions used in this recording, the equilibrium potential of K+ (EK) was estimated to be about −90 mV, and close to the reversal potential for sVT. When the EK was shifted by +30 mV, i.e. from about −90 to −60 mV by applying the high-K+ solution (see Method), the polarity of sVT reversed between −54 and −39 mV of the basal membrane potential (Fig. 2B). Fig. 2C plots the peak amplitude of sVT versus the basal membrane potential. The linear regressions on these plots (gray lines) crossed the abscissa (dashed line) at approximately –76 mV and –48 mV for 3.5-mM K+ and 11.5-mM K+, respectively, showing the shift of reversal potential parallel to the EK shift. Similar results were obtained in other two cells. These results indicate that the ionic conductance responsible for producing sVT is permeable to, at least, K+. In the present experiments, we made recordings from the cells without neurites or with neurites no longer than 10 μm or so. Therefore, those cells can be modeled as a single compartment shown in Fig. 3A. In this model, the unknown conductance responsible for producing sVT and the reversal potential are represented by “gx” and “Ex”, respectively. The membrane properties intrinsic to the cell are represented by membrane capacitance Cm, nonlinear conductance gm, and the apparent reversal potential Em. Here, Cm was measured with the capacitance compensation circuitry of

sVT in Mammalian Retinal Ganglion Cells Dissociated by Vibration

69

B

A

C

E

D

G

H F

Fig. 3. Conductance changes during sVT. A: Single compartment model of the isolated somata. ICm, the current through Cm. Igm, the current through gm. Igx, the current through gx. B: Voltage responses to the current steps injected. The amplitude of current steps (Iinj) were varied from –10 to +10 pA in 5-pA increment and the corresponding voltage changes (Vm), from the bottom (black) to the top (light gray), were recorded. C: Plots of the membrane potential versus the amplitude of current steps. The voltage was measured at the time points indicated by the marks in B (circle, square, triangle, rhombus, and hexagon). The solid line shows the best fit to the plots with a single exponential function. D: Voltage-dependency of gm calculated from the plots in C. The derivative of current with respect to the membrane potential gave the slope conductance gm, which could be approximated by a hyperbolic function, gm = Iα / (Vα –Vm), where Vm 2) GRFs are used. The center of the ith GRF is set to μi = Imin + ((2i − 3)/2)((Imax − Imin )/(n − 2)). One GRF is placed outside the range at each of the two ends. All the GRFs encoding an input variable will have the same width. The width of a GRF is set to σ = (1/γ)((Imax − Imin )/(n − 2)), where γ controls the extent of overlap between the GRFs. For an input variable with value a, the activation value of the ith GRF with center μi and width σ is given by fi (a) = exp −(a − μi )2 /(2 σ 2 ) (3) The ﬁring time of the neuron associated with this GRF is inversely proportional to fi (a). For a highly stimulated GRF, with value of fi (a) close to 1.0, the ﬁring time t = 0 milliseconds is assigned. When the activation value of the

Region-Based Encoding Method Using Multi-dimensional Gaussians

77

GRF is small, the ﬁring time t is high indicating that the neuron ﬁres later. In our experiments, the ﬁring time of an input neuron is chosen to be in the range 0 to 9 milliseconds. While converting the activation values of GRFs into ﬁring times, a coding threshold is imposed on the activation value. A GRF that gives an activation value less than the coding threshold will be marked as not-ﬁring (NF), and the corresponding input neuron will not contribute to the membrane potential of the post-synaptic neuron. The range-based encoding method is illustrated in Fig. 1(c). For multi-variate data, each variable is encoded separately, eﬀectively not using the correlation present among the variables. In this encoding method, 1-D GRFs are uniformly placed along an input dimension, without considering the distribution of the data. Hence, when the data is sparse some GRFs, placed in the regions where data is not present, are not eﬀectively used. This results in a high neuron count and computational cost. The widths of the GRFs are derived without using the knowledge of the data distribution, except for the range of values that the variables take. Taking one GRF along an input dimension and quantizing its activation value results in the formation of intervals within the range of values of that variable, such that one or more intervals are mapped onto a particular quantization level. When 2-D data is encoded by taking an array of 1-D GRFs along each input dimension, the input space is quantized into rectangular grids such that all the input patterns falling into a particular rectangular grid have the same vector of quantization levels, and hence the same encoded time vector. Additionally, one or more rectangular grids may have the same encoded time vector. For multivariate data, the input space is divided into hypercuboids. To demonstrate this, the single-ring data (shown in Fig. 2(a)) is encoded by placing 5 GRFs along each dimension, dividing the input space into grids as shown in Fig. 2(b). A 10-2 MDSNN, having 10 neurons in the input layer and 2 neurons in the output layer, is trained to cluster this data. The space of data points as represented by the output layer neurons is shown in Fig. 2(c). The cluster boundary is observed to be a combination of linear segments deﬁned by the rectangular grid boundaries formed by encoding. The shape of the boundary formed by the MDSNN is signiﬁcantly diﬀerent from the desired circle-shaped boundary between the two clusters in the single-ring data. Increasing the number of GRFs used for encoding each dimension may give a boundary that is a combination of smaller linear segments, at the expense of high neuron count. However, this may not result in proper clustering of the data, as the choice of number of GRFs is observed to be crucial in the range-based encoding method. When the range-based encoding method is used along with the varyingthreshold method and the multi-stage learning method [15] to cluster complex data sets such as the double-ring data and the spiral data, it is observed that proper subclusters are not formed by the neurons in the hidden layer. As each dimension is encoded separately, the spatially disjoint subsets of data points, as shown by the marked regions in Fig. 3, that have similar encoding along a particular dimension are found to be represented by a single neuron in the hidden

78

L.N. Panuku and C.C. Sekhar 2

1

1.5

0.8 0.6

1

0.4

0.5

0.2 0

0

−0.2

−0.5

−0.4

−1

−0.6

−1.5

−0.8

−2 −2 −1.5 −1 −0.5

0(a)0.5

1

1.5

−1 −1 −0.8−0.6−0.4−0.2

2

0

0.2 0.4 0.6 0.8

(b)

1

Fig. 2. Clustering the single-ring data encoded using the range-based encoding method: (a) The single-ring data, (b) data space quantization due to the range-based encoding, and (c) data space representation by the output neurons

15 3

10 5

y

y

1

0

−1 −5 −3

−5 −5

−10 −15 −3

x

−1

1 (a)

3

−15

−10

−5

0 x

(b)

5

10

15

Fig. 3. Improper subclusters formed when the data is encoded with the range-based encoding method for (a) the double-ring data and (b) the spiral data

layer. This binding is observed to form during the initial iterations of learning when the ﬁring thresholds of neurons are low. The established binding cannot be unlearnt in the subsequent iterations, leading to improper clustering at the output layer. An encoding method that overcomes the above discussed limitations is proposed in the next section.

4

Region-Based Encoding Using Multi-dimensional Gaussian Receptive Fields

Using multi-dimensional GRFs for encoding helps in capturing the correlation present in the data. One approach would be to uniformly place the multidimensional GRFs covering the whole range of the input data space. However, this results in an exponential increase in the number of neurons in the input layer with the dimensionality of the data. To circumvent this, we propose a region-based encoding method that places the multi-dimensional GRFs only in the data-inhabited regions, i.e., the regions where data is present. The mean vectors and the covariance matrices of these GRFs are computed from the data in the regions, thus capturing the correlation present in the data. To identify the data-inhabited regions in the input space, ﬁrst the k-means clustering is performed on the data to be clustered, with the value of k being larger than the number of actual clusters. On each of the regions, identiﬁed using the k-means clustering method, a multi-dimensional GRF is placed by computing the mean vector and the covariance matrix from the data in that region. Response of the ith GRF for a multi-variate input pattern a is computed as,

Region-Based Encoding Method Using Multi-dimensional Gaussians

1 t −1 fi (a) = exp − (a − μi ) Σi (a − μi ) , 2

79

(4)

where, μi and Σi are the mean vector and the covariance matrix of the ith GRF respectively, and fi (a) is the activation value of that GRF. As discussed in Section 3, these activation values are translated into ﬁring times in the range 0 to 9 milliseconds and the non-optimally stimulated input neurons are marked as NF. By deriving the covariance for a GRF from the data, the region-based encoding method captures the correlation present in the data. The regions identiﬁed by k-means clustering and the data space quantization resulting from this encoding method, for the single-ring data used in Section 3, are shown in Fig 4(a) and 4(b) respectively. The boundary given by the MDSNN with the region-based encoding method is shown in Fig. 4(c). This boundary is more like the desired circle-shaped boundary, as against the combination of linear segments observed with the range-based encoding method (see Fig. 2(c)). 2

2.5

1.5

2 1.5 1

1 0.5 0

0.5 0

−0.5

−0.5 −1 −1.5 −2

−1 −1.5 −2 −2 −1.5 −1 −0.5

0 (a)0.5

1

1.5

2

−2.5 −2.5 −2 −1.5 −1 −0.5

0

0.5

b)

1

1.5

2

2.5

Fig. 4. (a) Regions identiﬁed by k-means clustering with k = 8, (b) data space quantization due to the region-based encoding and (c) data space representation by the neurons in the output layer

Next, we study the performance of the proposed region-based encoding method in clustering complex 2-D and 3-D data sets. For the double-ring data, the k-means clustering is performed with k = 20 and the resulting regions are shown in Fig 5(a). Over each of the regions, a 2-D GRF is placed to encode the data. A 20-8-2 MDSNN is trained using the multi-stage learning method, discussed in Section 2. It is observed that out of 8 neurons in the hidden layer 3 neurons do not win for any of the training examples and the data is represented by the remaining 5 neurons as shown in Fig 5(b). These 5 neurons provide the input in the second stage of learning to form the ﬁnal clusters (Fig 5(c)). The resulting cluster boundaries are seen to follow the data distribution as shown in Fig 5(d). Similarly, the spiral data is encoded using 40, 2-D GRFs. The regions of data identiﬁed using the k-means clustering method are shown in in Fig 6(a). A 40-20-2 MDSNN is trained to cluster the spiral data. As shown in Fig 6(b), 14 subclusters are formed in the hidden layer that are combined in the next layer to form the ﬁnal clusters as shown in Fig 6(c). The region-based encoding method helps in proper subcluster formation at the hidden layer (Fig 5(b) and Fig 6(b)), against the range-based encoding method (Fig 3). The proposed method is also used to cluster 3-D data sets namely, the interlocking donuts data and the 3-D ring data. The interlocking donuts data is

80

L.N. Panuku and C.C. Sekhar 4

4

4

3

3

3

2

2

2

1

1

0

0

0

−1

−1

−1

−2

−2

−2

−3 −4 −4

1

−3

−3 −3

−2

−1

0

1

2

3

(a)

4

−4 −4

−3

−2

−1

0

(b) 1

2

3

4

−4 −4

−3

−2

−1

0

1

2

3

4

(c)

Fig. 5. Clustering the double-ring data: (a) Regions identiﬁed by k-means clustering with k = 20, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10 −15 −15

−10

−5

0

(a)

5

10

15

−15 −15

−10

−10

−5

0

5

10

15

−15 −15

−10

−5

(b)

0

(c)

5

10

15

Fig. 6. Clustering the spiral data: (a) Regions identiﬁed by k-means clustering with k = 40, (b) subclusters formed at the hidden layer, (c) clusters formed at the output layer and (d) data space representation by the neurons in the output layer 1

2 1 0 −1 −2 −2

−1

0

1

(a) 2

2 1.5 1 2 0.5 0 1 −0.5 0 −1 −1.5 −1 −2 3 3 −2

0

1 0

−1 2 −2 −1

−1 2 1.5

1 0

0 2

1

1 0

(b)−1

−2 2

−1 −2 −2

−1

0

(c)

1

2

1 0.5

2 0 −0.5

1 0 −1 −1.5

−1 −2 −2

(d)

Fig. 7. Clustering of the interlocking donuts data and the 3-D ring data: (a) Regions identiﬁed by k-means clustering on the interlocking donuts data with k = 10 and (b) clusters formed at the output layer. (c) Regions identiﬁed by k-means clustering on the 3-D ring data with k = 5 and (d) clusters formed at the output layer.

encoded with 10, 3-D GRFs and a 10-2 MDSNN is trained to cluster this data. The k-means clustering results and the ﬁnal clusters formed by the MDSNN are shown in Fig 7(a) and (b) respectively. The clustering results for the 3-D ring data with the proposed encoding method are shown in Fig 7(c) and 7(d). For comparison, the performance of the range-based encoding method and the region-based encoding method, for diﬀerent data sets, is presented in Table 1. It is observed that the region-based encoding method outperforms the range-based encoding method for clustering complex data sets like the double-ring data and the spiral data. For the cases, where both the methods give the same or almost the same performance, the number of neurons used in the input layer is given in the parentheses. It is observed that the region-based encoding method always maintains a low neuron count, there by reducing the computational cost. The difference between the neuron counts for the two methods may look small for these 2-D and 3-D data sets. However, as the dimensionality of the data increases, this

Region-Based Encoding Method Using Multi-dimensional Gaussians

81

Table 1. Comparison of the performance (in %) of MDSNNs using the range-based encoding method and the region-based encoding method for clustering. The numbers in parentheses give the number of neurons in the input layer. Data set

Encoding method Range-based Region-based encoding encoding

Double-ring data 74.82 Spiral data 66.18 Single-ring data 100.00 (10) Interlocking cluster data 99.30 (24) 3-D ring data 100.00 (15) Interlocking donuts data 97.13 (21)

100.00 100.00 100.00 100.00 100.00 100.00

(8) (6) (5) (10)

diﬀerence can be signiﬁcant. From these results, it is evident that the proposed encoding method scales well to higher dimensional data clustering problems, while keeping a low count of neurons. Additionally, and more importantly, the nonlinear cluster boundaries given by the region-based encoding method follow the distribution of the data or shapes of the clusters.

5

Conclusions

In this paper, we have proposed a new encoding method using multi-dimensional GRFs for MDSNNs. We have demonstrated that the proposed encoding method eﬀectively uses the correlation present in the data and positions the GRFs in the data-inhabited regions. We have also shown that the proposed method results in a low neuron count as opposed to the encoding method proposed in [14] and the simple approach of placing multi-dimensional GRFs covering the data space. This in turn results in low computational cost for clustering. With the encoding method proposed in [14], the cluster boundaries obtained for clustering nonlinearly separable data are observed to be combinations of linear segments and the MDSNN is failed to cluster the double-ring data and the spiral data. We have experimentally shown that with the proposed encoding method, the MDSNNs could cluster complex data like the double-ring data and the spiral data, while giving smooth nonlinear boundaries that follow the data distribution. In the existing range-based encoding method, when the data consists of clusters with diﬀerent scales, i.e., narrow and wider clusters, then the GRFs with diﬀerent widths are used. This technique is called multi-scale encoding. However, in the region-based encoding method the widths of the multi-dimensional GRFs are automatically computed from the data-inhabited regions. The widths of these GRFs can be diﬀerent. In the proposed method, for clustering the 2-D and 3-D data, the value of k is decided empirically and the formation of subclusters at the hidden layer is veriﬁed visually. However, for higher dimensional data, it is necessary to ensure the formation of subclusters automatically.

82

L.N. Panuku and C.C. Sekhar

References 1. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Englewood Cliﬀs (1998) 2. Kumar, S.: Neural Networks: A Classroom Approach. Tata McGraw-Hill, New Delhi (2004) 3. Maass, W.: Networks of Spiking Neurons: The Third Generation of Neural Network Models. Trans. Soc. Comput. Simul. Int. 14(4), 1659–1671 (1997) 4. Bi, Q., Poo, M.: Precise Spike Timing Determines the Direction and Extent of Synaptic Modiﬁcations in Cultured Hippocampal Neurons. Neuroscience 18, 10464–10472 (1998) 5. Maass, W., Bishop, C.M.: Pulsed Neural Networks. MIT-Press, London (1999) 6. Gerstner, W., Kistler, W.M.: Spiking Neuron Models. Cambridge University Press, Cambridge (2002) 7. Maass, W.: Fast Sigmoidal Networks via Spiking Neurons. Neural Computation 9, 279–304 (1997) 8. Verstraeten, D., Schrauwen, B., Stroobandt, D., Campenhout, J.V.: Isolated Word Recognition with the Liquid State Machine: A Case Study. Information Processing Letters 95(6), 521–528 (2005) 9. Bohte, S.M., Kok, J.N., Poutre, H.L.: Spike-Prop: Error-backpropagation in Temporally Encoded Networks of Spiking Neurons. Neural Computation 48, 17–37 (2002) 10. Natschlager, T., Ruf, B.: Spatial and Temporal Pattern Analysis via Spiking Neurons. Network: Comp. Neural Systems 9, 319–332 (1998) 11. Ruf, B., Schmitt, M.: Unsupervised Learning in Networks of Spiking Neurons using Temporal Coding. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 361–366. Springer, Heidelberg (1997) 12. Hopﬁeld, J.J.: Pattern Recognition Computation using Action Potential Timing for Stimulus Representations. Nature 376, 33–36 (1995) 13. Gerstner, W., Kempter, R., Van Hemmen, J.L., Wagner, H.: A Neuronal Learning Rule for Sub-millisecond Temporal Coding. Nature 383, 76–78 (1996) 14. Bohte, S.M., Poutre, H.L., Kok, J.N.: Unsupervised Clustering with Spiking Neurons by Sparse Temporal Coding and Multilayer RBF Networks. IEEE Transactions on Neural Networks 13, 426–435 (2002) 15. Panuku, L.N., Sekhar, C.C.: Clustering of Nonlinearly Separable Data using Spiking Neural Networks. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, Springer, Heidelberg (2007)

Firing Pattern Estimation of Biological Neuron Models by Adaptive Observer Kouichi Mitsunaga1 , Yusuke Totoki2 , and Takami Matsuo2 1

2

Control Engineering Department, Oita Institute of Technology, Oita, Japan Department of Architecture and Mechatronics, Oita University, 700 Dannoharu, Oita, 870-1192, Japan

Abstract. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Strictly Positive Realness and Yu’s stability criterion, we can show the asymptotic stability of the error systems. The estimators allow us to recover the internal states and to distinguish the ﬁring patterns with early-time dynamic behaviors.

1

Introduction

In traditional artiﬁcial neural networks, the neuron behavior is described only in terms of ﬁring rate, while most real neurons, commonly known as spiking neurons, transmit information by pulses, also called action potentials or spikes. Model studies of neuronal synchronization can be separated in those where models of the integrated-and-ﬁre type are used and those where conductance-based spiking and bursting models are employed[1]. Bursting occurs when neuron activity alternates, on slow time scale, between a quiescent state and fast repetitive spiking. In any study of neural network dynamics, there are two crucial issues that are: 1) what model describes spiking dynamics of each neuron and 2) how the neurons are connected[3]. Izhikevich considered the ﬁrst issue and compared various models of spiking neurons. He reviewed the 20 types of real (cortical) neurons response, considering the injection of simple dc pulses such as tonic spiking, phasic spiking, tonic bursting, phasic bursting. Through out his simulations, he suggested that if the goal is to study how the neuronal behavior depends on measurable physiological parameters, such as the maximal conductance, steady-state (in)activation functions and time constants, then the Hodgkin-Huxley type model is the best. However, its computational cost is the highest in all models. He also pointed out that the Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich ﬁring patterns exhibited by real biological neurons. Nevertheless the HR model is a computational one of the neuronal bursting using three coupled ﬁrst order diﬀerential equations[5,6], it can generate a tonic spiking, phasic spiking, and so on, for diﬀerent parameters in the model equations. Charroll simulated that the additive noise shifts the neuron model into two-frequency region (ı.e. bursting) and the slow part of the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 83–92, 2008. c Springer-Verlag Berlin Heidelberg 2008

84

K. Mitsunaga, Y. Totoki, and T. Matsuo

responses allows being robust to added noises using the HR model[7]. The parameters in the model equations are important to decide the dynamic behaviors in the neuron[12]. From the measurement theoretical point of view, it is important to estimate the states and parameters using measurement data, because extracellular recordings are a common practice in neuro-physiology and often represent the only way to measure the electrical activity of neurons[8]. Tokuda et al. applied an adaptive observer to estimate the parameters of HR neuron by using membrane potential data recorded from a single lateral pyloric neuron synaptically isolated from other neurons[13]. However, their observer cannot guarantee the asymptotic stability of the error system. Steur[14] pointed out that HR equations could not transformed into the adaptive observer canonical form and it is not possible to make use of the adaptive observer proposed by Marino[10]. He simpliﬁed the three dimensional HR equations and write as one-dimensional system with exogenous signal using contracting and the wandering dynamics technique. His adaptive observer with ﬁrst-order diﬀerential equation cannot estimate the internal states of HR neurons. We have recently presented adaptive observers with full states measurement and with the membrane potential measurement[15]. However, the estimates of the states by the observer with output measurement are not enough to recover the immeasurable internal states. In this paper, we present three adaptive observers with the membrane potential measurement under the assumption that some of parameters in HR neuron are known. Using the Kalman-Yakubovich lemma, we can show the asymptotic stability of the error systems based on the standard adaptive control theory[11]. The estimators allow us to recover the internal states and to distinguish the ﬁring patterns with early-time dynamic behaviors. The MATLAB simulations demonstrate the estimation performance of the proposed adaptive observers.

2

Review of Real (Cortical) Neuron Responses

There are many types of cortical neurons responses. Izhikevich reviewed 20 of the most prominent features of biological spiking neurons, considering the injection of simple dc pulses[3]. Typical responses are classiﬁed as follows[4]: – Tonic Spiking (TS): The neuron ﬁres a spike train as long as the input current is on. This kind of behavior can be observed in the three types of cortical neurons: regular spiking excitatory neurons (RS), low-threshold spiking neurons (LTS), and ﬁrst spiking inhibitory neurons (FS). – Phasic Spiking (PS): The neuron ﬁres only a single spike at the onset of the input. – Tonic Bursting: The neuron ﬁres periodic bursts of spikes when stimulated. This behavior may be found in chattering neurons in cat neocortex. – Phasic Bursting (PB): The neuron ﬁres only a single burst at the onset of the input.

Firing Pattern Estimation of Biological Neuron Models

85

– Mixed Mode (Bursting Then Spiking) (MM): The neuron ﬁres a phasic burst at the onset of stimulation and then switch to the tonic spiking mode. The intrinsically bursting excitatory neurons in mammalian neocortex may exhibit this behavior. – Spike Frequency Adaptation (SFA): The neuron ﬁres tonic spikes with decreasing frequency. RS neurons usually exhibit adaptation of the interspike intervals, when these intervals increase until a steady state of periodic ﬁring is reached, while FS neurons show no adaptation.

3

Single Model of HR Neuron

The Hindmarsh-Rose(HR) model is computationally simple and capable of producing rich ﬁring patterns exhibited by real biological neurons. 3.1

Dynamical Equations

The single model of the HR neuron[1,5,6] is given by x˙ = ax2 − x3 − y − z + I y˙ = (a + α)x2 − y z˙ = μ(bx + c − z) where x represents the membrane potential y and z are associated with fast and slow currents, respectively. I is an applied current, and a, α, μ, b and c are constant parameters. We rewrite the single HR neuron as a vectorized form: ˙ = h(w) + Ξ(x, z)θ (S0 ) : w where ⎡ ⎤T ⎡ ⎤ ⎡ 2 ⎤ x −(x3 + y + z) x 1 0 00 0 ⎦, Ξ(x, z) = ⎣ 0 0 x2 0 0 0 ⎦, −y w = ⎣ y ⎦ , h(w) = ⎣ z 0 0 0 0 x 1 −z T T θ = θ1 , θ2 , θ3 , θ4 , θ5 , θ6 = a, I, a + α, μb, μc, μ . 3.2

Numerical Examples

The HR model shows a large variety of behaviors with respect to the parameter values in the diﬀerential equations[12]. Thus, we can characterize the dynamic behaviors with respect to diﬀerent values of the parameters. We focus on the parameters a and I. The parameter a is an internal parameter in the single neuron and I is an external depolarizing current. For the ﬁxed I = 0.05, the HR model shows a tonic bursting with a ∈ [1.8, 2.85] and a tonic spiking with a ≥ 2.9. On the other hand, for the ﬁxed a = 2.8, the HR model shows a tonic bursting with I ∈ [0, 0.18] and a tonic spiking with a ∈ [0.2, 5].

86

K. Mitsunaga, Y. Totoki, and T. Matsuo

2 1.5 −0.3 −0.4

1

−0.5 −0.6

x

z

0.5 0

−0.7 −0.8 −0.9

−0.5

−1 8 6

−1

2 1

4 −1.5 0

0

2 200

400

600

800

1000

Fig. 1. The response of x in the tonic bursting

−1 0

y

time

−2

x

Fig. 2. 3-D surface of x, y, z in the tonic bursting

2 1.5 −0.5 1

−0.6 −0.7

x

z

0.5

−0.8 0 −0.9 −0.5

−1 8 6

−1

2 1

4 −1.5 0

0

2 200

400

600

800

1000

Fig. 3. The response of x in the tonic spiking

0

y

time

−1 −2

x

Fig. 4. 3-D surface of x, y, z in the tonic spiking

1.5

−0.2

1

−0.4

z

x

2

0.5

−0.6 −0.8

0 −1 8 −0.5

6

2 4

−1 0

1 2

200

400

600

800

1000

time

Fig. 5. The response of x1 in the intrinsic bursting neuron

y

0 0

−1

x

Fig. 6. 3-D surface of x, y, z in the intrinsic bursting neuron

The parameters of the HR model in the tonic bursting (TB) are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. Figure 1 shows the response of x. Figure 2 shows the 3 dimensional surface of x, y, z. We call this neuron the intrinsic bursting neuron(IBN).

Firing Pattern Estimation of Biological Neuron Models

87

The parameters of the HR model in the tonic spiking (TS) are given by a = 3.0, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The diﬀerence between the tonic bursting and the tonic spiking is only the value of the parameter a. Figure 3 shows the responses of x. Figure 4 shows the 3 dimensional surface of x, y, z. We also call this neuron the intrinsic spiking neuron(ISN). When the external current change from I = 0.05 to I = 0.2, the IBN shows a tonic spiking. Figure 5 shows the responses of x. Figure 6 shows the 3 dimensional surface of x, y, z.

4 4.1

Synaptically Coupled Model of HR Neuron Dynamical Equations

Consider the following synaptically coupled two HR neurons[1]: x˙ 1 = a1 x21 − x31 − y1 − z1 − gs (x1 − Vs1 )Γ (x2 ) y˙ 1 = (a1 + α1 )x21 − y1 , z˙1 = μ1 (b1 x1 + c1 − z1 ) x˙ 2 = a2 x22 − x32 − y2 − z2 − gs (x2 − Vs2 )Γ (x1 ) y˙ 2 = (a2 + α2 )x22 − y2 , z˙2 = μ2 (b2 x2 + c2 − z2 ) where Γ (x) is the sigmoid function given by Γ (x) =

4.2

1 . 1 + exp(−λ(x − θs ))

Numerical Examples

Consider the IBN neuron with a = 2.8 and the ISN neuron with a = 10.8 whose other parameters are as follows: αi = 1.6, ci = 5, bi = 9, μi = 0.001, Vsi = 2, θs = −0.25, λ = 10. Figures 7,8 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 0.05, respectively. Each neuron behaves as an intrinsic single neuron. As increasing the coupling strength, however, the IBN neuron shows a chaotic behavior. Figures 9,10 show the responses of the membrane potentials in the coupling of IBN neuron and ISN neuron with the coupling strength gs = 1, respectively. Figure 11 shows the response of the membrane potentials in the coupling of two same IBNs with

88

K. Mitsunaga, Y. Totoki, and T. Matsuo 2

12

1.5

10 8

1

6

x

x

0.5 4

0 2 −0.5

0

−1

−2

−1.5 0

200

400

600

800

−4 0

1000

200

400

time

600

800

1000

time

Fig. 7. The response of x1 of the IBN

Fig. 8. The response of x2 of the ISN

2

12

1.5

10 8

1

6

x

x

0.5 4

0 2 −0.5

0

−1 −1.5 0

−2

200

400

600

800

−4 0

1000

200

time

400

600

800

1000

time

Fig. 9. The response of x1 of the IBN

Fig. 10. The response of x2 of the ISN

2

2

1.5

1.5

1 1

x

x

0.5 0.5

0 0 −0.5 −0.5

−1 −1.5 0

200

400

600

time

800

1000

−1 0

200

400

600

800

1000

time

Fig. 11. The response of x1 of the IBN- Fig. 12. The response of x1 of the IBNIBN coupling with gs = 0.05 IBN coupling with gs = 1

the coupling strength gs = 0.05. In this case, two IBNs synchronize as bursting neurons. Figure 12 shows the response of the membrane potentials in the coupling of two same IBNs with the coupling strength gs = 1. Two IBNs synchronize as spiking neurons.

Firing Pattern Estimation of Biological Neuron Models

5

89

Adaptive Observer with Full States

We present the parameter estimation problem to distinguish the ﬁring patterns by using early-time dynamic behaviors. In this section, assuming that the full states are measurable, we present an adaptive observer to estimate all parameters in the single HR neuron. 5.1

Construction of Adaptive Observer

We present an adaptive observer as ˆ ˆ˙ = W (w ˆ − w) + h(w) + Ξ θ (O0 ) : w ˆ is an estimate of the ˆ = [ˆ where w x, yˆ, zˆ] is an estimate of the states, θ unknown parameters and W is selected as a stable matrix. Using the standard adaptive control theory[11], the parameter update law is given by ˆ˙ = Γ Ξ T P (w − w). ˆ θ where P is a positive deﬁnite solution of the following Lyapunov equation for a positive deﬁnite matrix Q: W T P + P W = −Q. 5.2

Numerical Examples

We will show the simulation results of single IBN case. The parameters in the tonic bursting are given by a = 2.8, α = 1.6, c = 5, b = 9, μ = 0.001, I = 0.05. The parameters of the adaptive observers are selected as W = −10I3 , Γ = diag{100, 50, 300}. Figure 13 shows the estimation behavior of a (solid line) and I (dotted line). The estimates a ˆ and Iˆ converge to the true values of a and I.

6

Adaptive Observer with a Partial State

We assume that the membrane potential x is available, but the others are immeasurable. In this case, we consider following problems: – Estimate y and z using the available signal x; – Estimate the parameter a or I to distinguish the ﬁring patterns by using early-time dynamic behaviors.

90

K. Mitsunaga, Y. Totoki, and T. Matsuo

6.1

Construction of Adaptive Observer

The parameters a and I are key parameters that determine the ﬁring pattern. The HR model can be rewritten by the following three forms[9]: ˙ = Aw + h1 (x) + b1 (x2 a) (S1 ) : w ˙ = Aw + h2 (x) + b2 (I) (S2 ) : w ˙ = Aw + h3 (x) + b2 (θ T ξ) (S3 ) : w where

(1) (2) (3)

⎡

⎤ ⎡ 3 ⎤ ⎡ ⎤ ⎡ ⎤ 0 −1 −1 −x + I 1 1 A = ⎣ 0 −1 0 ⎦ , h1 = ⎣ αx2 ⎦ , b1 = ⎣ 1 ⎦ , b2 = ⎣ 0 ⎦ , μb 0 −μ μc 0 0 ⎡ 3 ⎤ ⎡ 3⎤ 2 −x + ax −x a x2 2 ⎦ 2 ⎦ ⎣ ⎣ h2 = (a + α)x , h3 = δx ,θ = ,ξ = . I 1 μc μc

In (S1 ) and (S2 ), the unknown parameters are assumed to be a and I, respectively. In (S3 ), we assume that the parameter δ = a + α is known, and a and I are unknown. Since the measurable signal is x, the output equation is given by x = cw = 1 0 0 w We present adaptive observers that estimate the parameter for each system (Si ), i = 1, 2, 3, as follows: ˆ 1 + h1 (x) + b1 (x2 a (O1 ) : wˆ˙ 1 = Aw ˆ) + g(x − x ˆ) ˙ ˆ ˆ 2 + h2 (x) + b2 (I) + g(x − x (O2 ) : wˆ2 = Aw ˆ) T ˙ ˆ ˆ 2 + h3 (x) + b2 (θ ξ) + g(x − xˆ) (O3 ) : wˆ2 = Aw

(4) (5) (6)

where g is selected such that A − gc is a stable. Since (A, b1 , c) and (A, b2 , c) are strictly positive real, the parameter estimation laws are given as ˙ a ˆ˙ = γ1 x2 (x − x ˆ), Iˆ = γ2 (x − xˆ).

(7)

Using the Kalman-Yakubovich (KY) lemma, we can show the asymptotic stability of the error system based on the standard adaptive control theory[11]. 6.2

Numerical Examples

We will show the simulation results of single IBN case. The parameters in the tonic spiking are same as in the previous simulation. Figures 14 and 15 show the estimated parameters by the adaptive observers (O1 ) and (O2 ), respectively. Figures 16 and 17 show the responses of y (solid line) and its estimate yˆ (dotted line) the adaptive observers (O1 ) for t ≤ 500 and for t ≤ 20, respectively. Figure 18 shows the responses of z (solid line) and its estimate zˆ (dotted line) by the adaptive observers (O1 ). The simulation results of other cases are omitted. The states and parameters can be asymptotically estimated.

Firing Pattern Estimation of Biological Neuron Models

5

5

a ˆ Iˆ

4.5

91

4.5

4 4 3.5 3.5 3

a ˆ

a ˆ Iˆ

3 2.5 2

2.5

1.5 2 1 1.5

0.5 0 0

50

100

150

1 0

200

20

40

time

60

80

100

time

ˆ Fig. 13. a ˆ(solid line) and I(dotted line) in the adaptive observer (O0 ) with full states

Fig. 14. a ˆ(solid line) in the adaptive observer (O1 ) with x

0.1

7

y yˆ

6

0

5

−0.1

y yˆ

Iˆ

4

−0.2

3

−0.3 2

−0.4

−0.5 0

1

20

40

60

80

0 0

100

100

200

300

400

500

time

time

ˆ Fig. 15. I(solid line) in the adaptive observer (O2 ) with x

Fig. 16. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 500)

7

0

y yˆ

z zˆ

6 5

−0.5

y yˆ

z zˆ

4 3

−1 2 1 0 0

5

10

15

20

time

Fig. 17. y(solid line) and yˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)

−1.5 0

5

10

15

20

time

Fig. 18. z(solid line) and zˆ(dottedline) in the adaptive observer (O1 ) (t ≤ 20)

92

7

K. Mitsunaga, Y. Totoki, and T. Matsuo

Conclusion

We presented estimators of the parameters of the HR model using the adaptive observer technique with the output measurement data such as the membrane potential. The proposed observers allow us to distinguish the ﬁring pattern in early time and to recover the immeasurable internal states.

References 1. Belykh, I., de Lange, E., Hasler, M.: Synchronization of Bursting Neurons: What Matters in the Network Topology. Phys. Rev. Lett. 94, 101–188 (2005) 2. Izhikevich, E.M.: Simple Model of Spiking Neurons. IEEE Trans on Neural Networks 14(6), 1569–1572 (2003) 3. Izhikevich, E.M.: Which model to use for cortical spiking neurons? IEEE Trans on Neural Networks 15(5), 1063–1070 (2004) 4. Watts, L.: A Tour of NeuraLOG and Spike - Tools for Simulating Networks of Spiking Neurons (1993), http://www.lloydwatts.com/SpikeBrochure.pdf 5. Hindmarsh, J.L., Rose, R.M.: A model of the nerve impulse using two ﬁrst order diﬀerential equations. Nature 296, 162–164 (1982) 6. Hindmarsh, J.L., Rose, R.M.: A model of neuronal bursting using three coupled ﬁrst order diﬀerential equations. Proc. R. Soc. Lond. B. 221, 87–102 (1984) 7. Carroll, T.L.: Chaotic systems that are robust to added noise, CHAOS, 15, 013901 (2005) 8. Meunier, N., Narion-Poll, R., Lansky, P., Rospars, J.O.: Estimation of the Individual Firing Frequencies of Two Neurons Recorded with a Single Electrode. Chem. Senses 28, 671–679 (2003) 9. Yu, H., Liu, Y.: Chaotic synchronization based on stability criterion of linear systems. Physics Letters A 314, 292–298 (2003) 10. Marino, R.: Adaptive Observers for Single Output Nonlinear Systems. IEEE Trans. on Automatic Control 35(9), 1054–1058 (1990) 11. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall Inc., Englewood Cliﬀs (1989) 12. Arena, P., Fortuna, L., Frasca, M., Rosa, M.L.: Locally active Hindmarsh-Rose neurons. Chaos, Soliton and Fractals 27, 405–412 (2006) 13. Tokuda, I., Parlitz, U., Illing, L., Kennel, M., Abarbanel, H.: Parameter estimation for neuron models. In: Proc. of the 7th Experimental Chaos Conference (2002), http://www.physik3.gwdg.de/∼ ulli/pdf/TPIKA02 pre.pdf 14. Steur, E.: Parameter Estimation in Hindmarsh-Rose Neurons (2006), http://alexandria.tue.nl/repository/books/626834.pdf 15. Fujikawa, H., Mitsunaga, K., Suemitsu, H., Matsuo, T.: Parameter Estimation of Biological Neuron Models with Bursting and Spiking. In: Proc. of SICE-ICASE International Joint Conference 2006 CD-ROM, pp. 4487–4492 (2006)

Thouless-Anderson-Palmer Equation for Associative Memory Neural Network Models with Fluctuating Couplings Akihisa Ichiki and Masatoshi Shiino Department of Applied Physics, Faculty of Science, Tokyo Institute of Technology, 2-12-2 Ohokayama Meguro-ku Tokyo, Japan

Abstract. We derive Thouless-Anderson-Palmer (TAP) equations and order parameter equations for stochastic analog neural network models with ﬂuctuating synaptic couplings. Such systems with ﬁnite number of neurons originally have no energy concept. Thus they defy the use of the replica method or the cavity method, which require the energy concept. However for some realizations of synaptic noise, the systems have the eﬀective Hamiltonian and the cavity method becomes applicable to derive the TAP equations.

1

Introduction

The replica method [1] for random spin systems has been successfully employed in neural network models of associative memory to have the order parameters and the storage capacity [2] and the cavity method [3] has been employed to derive the Thouless-Anderson-Palmer (TAP) equations [4,5]. However these techniques require the energy concept. On the other hand, various types of neural network models which have no energy concept, such as networks with temporally ﬂuctuating synaptic couplings, may exist. The alternative approach to the replica method to derive the order parameter equations, called the self-consistent signal-to-noise analysis (SCSNA), is closely related to the cavity concept in the case where networks have free energy [6,7]. An advantage to apply the SCSNA to neural networks is that the energy concept is not required to derive the order parameter equations once the TAP equations are obtained. The SCSNA, which was originally proposed for deriving a set of order parameter equations for deterministic analog neural networks, becomes applicable to stochastic networks by noting that the TAP equations deﬁne the deterministic networks. Furthermore, the coeﬃcients of the Onsager reaction terms characteristic to the TAP equations which determine the form of the transfer functions in analog networks are selfconsistently obtained through the concept shared by the cavity method and the SCSNA. Thus the TAP equations as well as the order parameter equations are derived self-consistently by the hybrid use of the cavity method and the SCSNA in the case where the energy concept exists. However the networks with synaptic noise, which have no energy concept, defy the use of the cavity method to have the TAP equations. On the other hand, as in [8], the network with a speciﬁc M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 93–101, 2008. c Springer-Verlag Berlin Heidelberg 2008

94

A. Ichiki and M. Shiino

involvement of synaptic noise can be analyzed by the cavity method to derive the TAP equation in the thermodynamic limit, since the energy concept appears as an eﬀective Hamiltonian in this limit. It is natural to consider such neural network models with ﬂuctuating synaptic couplings, since the synaptic couplings in real biological systems are updated by learning rules and the time-sequence of the synaptic couplings may be stochastic under the inﬂuence of noisy external stimuli. Thus the study on such networks is required to understand the retrieval process of the realistic networks. The aim of this paper is two-fold: (i) we will investigate the networks with which realization of synaptic noise have the energy concept to apply the cavity method to derive the TAP equations, (ii) we will show the TAP equations for the networks when the concept of the eﬀective Hamiltonian appears. This paper is organized as the follows: in the next section, we will brieﬂy review how the energy concept appears in the network with synaptic noise and derive the TAP equations and the order parameter equations by using the cavity method and the SCSNA [8]. Once the eﬀective Hamiltonian is found, the replica method is also applicable to derive the order parameter equations. However in the present paper, to make clear the relationship between the TAP equations and the order parameter equations, we do not use the replica trick. In section 3, we will investigate the cases where the energy concept appears in the networks with synaptic noise. We will see that the TAP equations and the order parameter equations for some models can be derived in the framework mentioned in section 2. We will also mention that some diﬃculties to derive the TAP equations arise in some models with other involvements of synaptic noise. In the last section, we will make discussions on the structure of the TAP equations for the network with temporally ﬂuctuating synaptic noise and conclude this paper.

2

Brief Review on Eﬀective Hamiltonian, TAP Equations and Order Parameter Equations

In this section, we brieﬂy review the cavity method becomes applicable to the network with ﬂuctuating synaptic couplings [8]. Then we derive the TAP equations and the order parameter equations self-consistently in the framework of the SCSNA. In this section, we deal with the following stochastic analog neural network of N neurons with temporally ﬂuctuating synaptic noise (multiplicative noise): x˙ i = −φ (xi ) + Jij (t)xj + ηi (t), (1) j(=i)

ηi (t)ηj (t ) = 2Dδij δ(t − t ),

(2)

where xi (i = 1, · · · , N ) represents a state of the neuron at site i taking a continuous value, φ(xi ) is a potential of an arbitrary form which determines the probability distribution of xi in the case without the input j(=i) Jij xj , ηi the Langevin white noise with its noise intensity 2D and Jij (t) the synaptic coupling.

TAP Equation for Associative Memory Neural Network Models

95

We note here that, in the case of associative memory neural network, the synaptic coupling Jij is usually deﬁned by the well-known Hebb learning rule. However, in the present paper, we will deal with the coupling Jij ﬂuctuating around the Hebb rule with a white noise: Jij (t) = J¯ij + ij (t), ˜ 2D ij (t)kl (t ) = δik δjl δ(t − t ), N

(3) (4)

p where J¯ij is deﬁned by the usual Hebb learning rule J¯ij ≡ N1 μ=1 ξiμ ξjμ with p = αN the number of patterns embedded in the network, ξiμ = ±1 is the μth embedded pattern at neuron i, and ij (t) denotes the synaptic noise independent of ηi (t), which we assume in the present model as a white noise with its intensity ˜ . 2D/N Using Ito integral, we obtain the Fokker-Planck equation corresponding to the Langevin equation (1) as ⎧ ⎫ N ∂ ⎬ ∂P (t, x) ∂ ⎨ ˜q =− −φ (xi ) + J¯ij xj − D + Dˆ P (t, x), (5) ∂t ∂xi ⎩ ∂xi ⎭ i=1

j(=i)

where qˆ ≡ N1 j(=i) x2j . Since the self-averaging property holds in the thermodynamic limit N → ∞, qˆ is identiﬁed as qˆ =

N 1 2 x . N i=1 i

(6)

The order parameter qˆ is obtained self-consistently in our framework as seen below. Supposing qˆ is given, one can easily ﬁnd the equilibrium probability density for the Fokker-Planck equation (5) as ⎧ ⎛ ⎞⎫ N ⎨ ⎬ PN (x) = Z −1 exp −βeﬀ ⎝ φ(xi ) − J¯ij xi xj ⎠ , (7) ⎩ ⎭ i=1

i<j

where Z denotes the normalization constant and −1 ˜ qˆ βeﬀ ≡D+D

(8)

plays the role of the eﬀective temperature of the network. The temperature of the system is modiﬁed as a consequence of the multiplicative noise and it depends on the order parameter qˆ. Notice here that the equilibrium distribution of the system becomes Gibbs distribution in the thermodynamic limit N → ∞. The equilibrium solution for equation (5) in the ﬁnite N -body system diﬀers√from the indeed 1 2 2 probability density (7). However, since N1 x − x = O(1/ N ), the j(=i) j N j j diﬀerence between the probability densities in the ﬁnite N -body system PN√and in the system in the thermodynamic system PN →∞ is PN →∞ − PN = O(1/ N ).

96

A. Ichiki and M. Shiino

Thus one can conclude that the equilibrium density for equation (5) converges to the probability density (7) in the thermodynamic limit N → ∞. Since we have explicitly written down the equilibrium probability density (7) as a Gibbsian form, one can deﬁne the eﬀective Hamiltonian of (suﬃciently large) N -body system as N

HN ≡

i=1

φ(xi ) −

J¯ij xi xj .

(9)

i<j

Since we have found the eﬀective Hamiltonian and the eﬀective temperature, one can apply the usual cavity method [3] to this system and derive the TAP equation. According to the cavity method, we divide the Hamiltonian of N -body system (9) into that of (N −1)-body system and the part involving the state of ith neuron as HN = φ(xi ) − hi xi + HN −1 , where hi ≡ j(=i) J¯ij xj is the local ﬁeld at site i and the Hamiltonian of (N −1) body system HN −1 is given as HN −1 ≡ j(=i) φ(xj ) − j 0) and x = P for the potentiation part (u < 0), cf. [2, Sec. 5] for details and the parameter values.

structures) and it can be linked to the variability observed in physiological data; the other is due to the additional impact of the background activity upon the generation of the spike output of each neuron (intrinsic stochasticity modelled by the Poisson process).

3 3.1

Theoretical Analysis Characterisation of the Neural Activity

We consider a network of N Poisson neurons (referred to as internal neurons) stimulated with M Poisson pulse trains (or external inputs), as shown in Fig. 2. The activity of the neural network can be described using the ﬁring rates and the pairwise correlations, plus the weights. In the terminology of the theory of dynamical systems, these variables characterise the “state” of the network activity at each time t and their evolution is referred to as neural dynamics. Similar to [7,2], we consider the time-averaged ﬁring rates νi (t) (for the internal neuron indexed by i; T is a given time period) 1 t νi (t) Si (t ) dt (3) T t−T where Si (t) is the spike-time series of the ith neuron and the brackets . . . denotes the ensemble averaging. Likewise for the coeﬃcient correlations (timeaverage correlation function convoluted with the STDP window function W ): W Dik (t) between the ith internal neuron and the k th external input t +∞ 1 W Dik (t) W (u) Si (t )Sˆk (t + u) dt du (4) T t−T −∞ th and QW and j th internal neurons (with Si and Sj ) [2, Sec. 3]. ij (t) between the i

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

105

Fig. 2. Presentation of the network and the notation. The internal neurons (within the network) are indexed by i ∈ [1..N ], and their output pulse trains are denoted Si (t) (it can be understood as a sum of “Dirac functions” at each spiking time [2, Sec. 2]). Likewise for the external input pulse trains Sˆk (t) (k ∈ [1..M ]). The time-averaged ﬁring rates are denoted by νi (t), cf. Eq. 3; the correlation coeﬃcients within the network by Qij (t) (resp. Dik (t) between a neuron in the network and an external input; and ˆ kl (t) between two external inputs), cf. Eq. 4. The weight of the connection from the Q kth external input onto the ith internal neuron is denoted Kik (t) (resp. Jij (t) from the j th internal neuron onto the ith internal neuron).

The stimulation parameters (which represent the “information” carried in the external inputs) are determined by the time-averaged input spiking rates (ˆ νk (t), ˆ W (t), deﬁned simideﬁned similarly to νi (t)) and their correlation coeﬃcients (Q kl W larly to Dik (t) and QW ij (t)) [2, Sec. 3]. In this paper, we only consider stimulation parameters that are constant in time. 3.2

Learning Equations

Learning equations can be derived from Eq. 2 as in Kempter et al. [7]. This requires the assumption that the internal pulse trains are statistically independent (this can be considered valid when the number of recurrent connection is large enough) and a small learning rate η. This leads to the matrix equation Eq. 10 for the weights between internal neurons (resp. to Eq. 9 for the input weights K). 3.3

Activation Dynamics

In order to study the evolution of the weights described by Eqs. 9 and 10, we need to evaluate the neuron time-average ﬁring rates (the vector ν(t)) and their time-average correlation coeﬃcients (the matrices DW (t) and QW (t)). Similar to Kempter et al. [7] and Burkitt et al. [2], we approximate the instantaneous ﬁring rate Si (t) of the ith Poisson neuron by its expected inhomogeneous Poisson parameter ρ(t) (cf. Eq. 1) and we neglect the impact of the short-time dynamics (the synaptic response kernel and the synaptic delays dˆik and dij ) by using time averaged variables (over a “long” period T ). We require that the

106

M. Gilson et al.

learning occurs slowly compared to the activation mechanisms (cf. the neuron and synapse models) so that T is large compared to the time scale of these mechanisms but small compared to the inverse of the learning parameter η −1 . This leads to the consistency matrix equations of the ﬁring rates (Eq. 6) and of the correlation coeﬃcients (Eqs. 7 and 8). See Burkitt et al. [2, Sec. 3] for details of the derivation. Note that the consistency equations of the correlation coeﬃcients as deﬁned in [2, Sec. 3 and 4] have been reformulated to express the usual covariance using the assumption that the correlations are quasi-constant in time [5] (this implies ˆ V T [2, Sec. 3] is actually equal to Q ˆ W ). Eqs. 7 and 8 express the impact that Q of the connectivity (through the term [I − J]−1 K) on the internal ﬁring rates and the cross covariances in terms of the input covariance Cˆ W

ˆW − W νˆ νˆT , Cˆ W Q

(5)

W (u) du evaluates the balance between the potentiation and the where W depression of our STDP rule. These equations are obtained by combining the equations in [2] with the ﬁring-rate consistency equation Eq. 6. 3.4

Network Dynamical System

Putting everything together, the network dynamics is described by

ν = [I − J]−1 ν0 E + K νˆ (6)

ν νˆT = [I − J]−1 K Q ˆW − W νˆ νˆT DW − W (7)

ν ν T = [I − J]−1 K Q ˆW − W νˆ νˆT K T [I − J]−1T QW − W (8)

dK ˆ T + DW = ΦK win E νˆT + wout ν E (9) dt

dJ = ΦJ win E ν T + wout ν E T + QW , (10) dt where E is the unit vector of N elements (likewise for Eˆ with M elements); ΦJ is a projector on the space of N × N matrices to which J belongs [2, Sec. 3]. The eﬀect of ΦJ is to nullify the coeﬃcients corresponding to a missing connection in the network, viz. all the diagonal terms because the self-connection of a neuron onto itself is forbidden. More generally, such an projection operator can account for any network connectivity [2, Sec. 3]. Note that time has been rescaled, in order to remove η from these equations and for simplicity of notation the dependence on time t will be omitted in the rest of this paper. The matrix [I − J(t)] is assumed invertible at all time (the contrary would imply a diverging behaviour of the ﬁring rates [2, Sec. 4]).

4

Recurrent Network with Fixed Input Weights

We now examine the case of a recurrently connected network with ﬁxed input weights K and learning on the recurrent weights J. Only Eqs. 6, 8 and 10 remain,

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

107

which simpliﬁes the analysis. In the case of full recurrent connectivity, ΦJ in Eq. 10 only nulliﬁes the diagonal terms of the square matrix in its argument. 4.1

Analytical Predictions

Homeostatic equilibrium. Similarly to [7,2], we derive the scalar equations of the mean ﬁring rate νav N −1 i νi and weight Jav (N (N − 1))−1 i=j Jij . It consists of neglecting the inhomogeneities of the ﬁring rates and of the weights over the network, as well as of the connectivity, and we obtain ν0 + (K νˆ)av 1 − (N − 1)Jav

in +C ν2 , = w + wout νav + W av

νav = J˙av

(11)

is deﬁned as where (K νˆ)av denotes the mean of the matrix product K νˆ and C ˆW T (K C K )av . C 2 [ν0 + (K νˆ)av ]

(12)

Eqs. 11 are thus the same equations as for the case with no input [2, Sec. 5], with replaced by, resp., ν0 + (K νˆ)av and W + C. It qualitatively implies the ν0 and W < 0, the network exhibits same dynamical behaviour and, provided that W + C a homeostatic equilibrium (when it is realisable, it requires in particular μ > 0), and the means of the ﬁring rates and of the weights converge towards win + wout +C W μ − ν0 − (K νˆ)av = , (N − 1)μ

∗ νav =μ − ∗ Jav

(13)

If the correlation inputs are time-invariant functions and if the correlations are positive (i.e. more likely to ﬁre in a synchronous way) and homogeneous is of the same sign as W . among the correlated input pool, it follows that C < 0 as in [7,2]. Therefore, the condition for stability reverts to W Structural dynamics. For the particular case of uncorrelated inputs (or more generally when K Cˆ W K T = 0), the ﬁxed-point structure of the dynamical system is qualitatively the same as for the case of no external input [2, Sec. 5]: a homogeneous distribution for the ﬁring rates and a continuous manifold of ﬁxed points for the internal weights. In the case of “spatial” inhomogeneities over the input correlations, the network dynamics shows a diﬀerent evolution. To illustrate and to compare this with the case of feed-forward connections, we consider the network conﬁguration described in Fig. 3 and inspired by [7], where one input pool has correlation while the other pool has uncorrelated sources. In general, the equations that

108

M. Gilson et al.

Fig. 3. Architecture of the simulated network. The inputs are divided into two subpools, each feeding half of the internal network with means K1 and K2 for the input weights from each subpool. Similarly to the recurrent weights J and the ﬁring rates νˆ and ν, the inhomogeneities are neglected within each subpool and they are assumed to be all identical to their mean. The weights and delays are initially set with 10% random variation around a mean.

determine the ﬁxed points have no accurate solution. Yet, we can reduce the dimensionality by neglecting the variance within each subpool and make approximations to evaluate the asymptotic distribution of the ﬁring rates, which turns out to be bimodal. 4.2

Simulation Protocol and Results

We simulated a network of Poisson neurons as described in Fig. 3 with random initial recurrent weights (uniformly distributed in a given interval, as well as all the synaptic delays). An issue with such simulations is to maintain positive internal weights during their homeostatic convergence because they individually diverge. Thus, all equilibria are not realisable depending on the initial distributions of K and J and the weight bounds [7,2]. See [2, Sec. 5] for details about the simulation parameters. An interesting ﬁrst case to test the analytical predictions consists of two pools of uncorrelated inputs, each feeding half of the network with distinct weights (K1 νˆ1 = K2 νˆ2 and Cˆ W = 0). Each half of the internal network thus has distinct ﬁring rates initially and, as predicted, the outcome is a convergence of these ﬁring-rates towards a uniform value (similar to the case of no external input [2, Sec. 5]). In the case of a full connected (for both the K and the J according to Fig. 3) network stimulated by one correlated input pool (short-time correlation inspired by [4] so that Cˆ W = 0) and one uncorrelated one, both the internal ﬁring rates and the internal weights also exhibit a homeostatic equilibrium. As shown in Fig. 4, the means over the network (thick solid lines) converge towards the predicted equilibrium values (dashed lines). Furthermore, the individual ﬁring rates tend to stabilise and their distribution remains bimodal (the subpool #1 excited by < 0). The recurrent correlated inputs ﬁres at a lower rate eventually when W weights individually diverge similarly to [2, Sec. 5] and reorganise so that the outgoing weights from the subpool #1 (see the means over each weight subpool

Spike-Timing Dependent Plasticity in Recurrently Connected Networks 25

0.03

15

0.02 0.015 J11

0.01

#1

#2

J21

0.025

#2

weights

firing rates (Hz)

#1 20

10 0

109

0.005 10000

0 0

20000

10000

J22 J12 20000

time (sec)

time (sec)

Fig. 4. Evolution of the ﬁring rates (left) and of the recurrent weights (right) for N = 30 fully-connected Poisson neurons (cf. Fig.3) with short-time correlated inputs. The outcome is a quasi bimodal distribution of the ﬁring rates (the grey bundle, the mean is the thick solid line) around the predicted homeostatic equilibrium (dashed line). The subgroup #1 that receives correlated inputs is more excited initially but ﬁres at a lower rate at the end of the simulation (cf. the two thin black solid lines which represent the mean over each subpool). The internal weights individually diverge, while their mean (thick line) converges towards the predicted equilibrium value (dashed line). They reorganise themselves so that the weights outgoing from the subpool #2 (that receives uncorrelated inputs) become silent, while the ones from #1 are strengthened. Note that the homeostatic equilibrium is preserved even when some weights saturate.

firing rates (Hz)

25 #1

#2

20

15 #2 0

#1 10000

20000

time (sec) Fig. 5. Evolution of the ﬁring rates for a partially connected network of N = 75 neurons. Both the K and the J have 40% probability of connection with the same setup as the network in Fig. 4 (to preserve the total input synaptic strength). The mean ﬁring rate (very thick line) still converges towards the predicted equilibrium value (dashed line) and the two subgroups (grey bundles, each mean represented by a thin black solid line) get separated similarly to the case of full connectivity. The internal weights (not shown) exhibit similar dynamics as for the case of full connectivity.

J11 and J21 in Fig. 4) are strengthened while the other ones almost become silent (see J12 and J22 ). In other words, the subpool that receives correlated inputs takes the upper hand in the recurrent architecture.

110

M. Gilson et al.

30

firing rates (Hz)

firing rates (Hz)

30

25

20

15 0

10000

time (sec)

25

20

15 0

10000

time (sec)

Fig. 6. Simulation of networks of IF neurons with partial random connectivity of 50%. The network qualitatively exhibits the expected behaviour in the case of uncorrelated inputs (left) and inputs with one correlated pool and one uncorrelated pool (right).

In the case of partial connectivity for both the K and the J, the behaviour of the individual ﬁring rates (cf. Fig. 5) still follows the predictions but they are more dispersed, their convergence is slower and the bimodal distribution is not always observed as clearly as in the case of full connectivity (in Fig. 5 the means of the two internal neuron subpools clearly remain separated though). The homeostatic equilibrium of the internal weights also holds and they individually diverge. The partial connectivity needs to be rich enough for the predictions of the mean variable to be accurate enough. First results with IF neurons comply with the analytical predictions remain valid even if the activation mechanisms are more complex (here with a connectivity of 50%, cf. Fig. 6).

5

Discussion and Future Work

The analytical results presented here are preliminary, and further investigation is needed to gain better understanding of the interplay between the input correlation structure and STDP. Nevertheless, our results illustrate two points: STDP induces a stable activity in recurrent architectures similar to that for feed-forward ones (homeostatic regulation of the network activity under the con < 0); and the qualitative structure of the internal ﬁring rates is mainly dition W determined by the input correlation structure. Namely, a “poor” correlation structure (uncorrelated or delta-correlated inputs, so that Cˆ W = 0) induces a homogenisation of the ﬁring activity. Finally, partial connectivity impacts upon the structure of the internal ﬁring rates, but such networks still exhibit similar behaviour to fully connected ones. ˆ W K T suggest Preliminary results involving more complex patterns of K Q more complex interplay between the input correlation structure and the equilibrium distribution of the network ﬁring rates and weights. Such cases are under investigation and may constitute more “interesting” dynamic behaviour of the network from a cognitive modelling point of view, namely through the

Spike-Timing Dependent Plasticity in Recurrently Connected Networks

111

relationship between the attractors of the network activity and the input structure. The case of learning for both the input connections and the recurrent ones will also form part of a future study. Comparison with IF neurons suggests that the impact of the neuron activation mechanisms on the weight dynamics may not be signiﬁcant. It can be linked to the separation of the time scales between them in the case of slow learning. Note that with our approximations the IF neurons are assumed to be in “linear input-output regime” (no bursting for instance).

Acknowledgments The authors thank Iven Mareels, Chris Trengove, Sean Byrnes and Hamish Mefﬁn for useful discussions that introduced signiﬁcant improvements. MG is funded by two scholarships from The University of Melbourne and NICTA. ANB and DBG acknowledge funding from the Australian Research Council (ARC Discovery Projects #DP0453205 and #DP0664271) and The Bionic Ear Institute.

References 1. Bi, G.Q., Poo, M.M.: Synaptic modiﬁcation by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience 24, 139–166 (2001) 2. Burkitt, A.N., Gilson, M., van Hemmen, J.L.: Spike-timing-dependent plasticity for neurons with recurrent connections. Biological Cybernetics 96, 533–546 (2007) 3. Gerstner, W., Kempter, R., van Hemmen, J.L., Wagner, H.: A neuronal learning rule for sub-millisecond temporal coding. Nature 383, 76–78 (1996) 4. Gutig, R., Aharonov, R., Rotter, S., Sompolinsky, H.: Learning input correlations through nonlinear temporally asymmetric hebbian plasticity. Journal of Neuroscience 23, 3697–3714 (2003) 5. Hawkes, A.G.: Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society Series B-Statistical Methodology 33, 438–443 (1971) 6. Hebb, D.O.: The organization of behavior: a neuropsychological theory. Wiley, Chichester (1949) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Hebbian learning and spiking neurons. Physical Review E 59, 4498–4514 (1999) 8. van Rossum, M.C.W., Bi, G.Q., Turrigiano, G.G.: Stable hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience 20, 8812–8821 (2000)

A Comparative Study of Synchrony Measures for the Early Detection of Alzheimer’s Disease Based on EEG Justin Dauwels, Fran¸cois Vialatte, and Andrzej Cichocki RIKEN Brain Science Institute, Saitama, Japan [email protected], {fvialatte,cia}@brain.riken.jp

Abstract. It has repeatedly been reported in the medical literature that the EEG signals of Alzheimer’s disease (AD) patients are less synchronous than in age-matched control patients. This phenomenon, however, does at present not allow to reliably predict AD at an early stage, so-called mild cognitive impairment (MCI), due to the large variability among patients. In recent years, many novel techniques to quantify EEG synchrony have been developed; some of them are believed to be more sensitive to abnormalities in EEG synchrony than traditional measures such as the cross-correlation coeﬃcient. In this paper, a wide variety of synchrony measures is investigated in the context of AD detection, including the cross-correlation coeﬃcient, the mean-square and phase coherence function, Granger causality, the recently proposed corr-entropy coeﬃcient and two novel extensions, phase synchrony indices derived from the Hilbert transform and time-frequency maps, information-theoretic divergence measures in time domain and timefrequency domain, state space based measures (in particular, non-linear interdependence measures and the S-estimator), and at last, the recently proposed stochastic-event synchrony measures. For the data set at hand, only two synchrony measures are able to convincingly distinguish MCI patients from age-matched control patients (p < 0.005), i.e., Granger causality (in particular, full-frequency directed transfer function) and stochastic event synchrony (in particular, the fraction of non-coincident activity). Combining those two measures with additional features may eventually yield a reliable diagnostic tool for MCI and AD.

1

Introduction

Many studies have shown that the EEG signals of AD patients are generally less coherent than in age-matched control patients (see [1] for an in-depth review). It is noteworthy, however, that this eﬀect is not always easily detectable: there tends to be a large variability among AD patients. This is especially the case for patients in the pre-symptomatic phase, commonly referred to as Mild Cognitive Impairment (MCI), during which neuronal degeneration is occurring prior to the clinical symptoms appearance. On the other hand, it is crucial to predict AD at M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 112–125, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Comparative Study of Synchrony Measures for the Early Detection of AD

113

an early stage: medication that aims at delaying the eﬀects of AD (and hence intend to improve the quality of life of AD patients) are the most eﬀective if applied in the pre-symptomatic phase. In recent years, a large variety of measures has been proposed to quantify EEG synchrony (we refer to [2]–[5] for recent reviews on EEG synchrony measures); some of those measures are believed to be more sensitive to perturbations in EEG synchrony than classical indices as for example the cross-correlation coeﬃcient or the coherence function. In this paper, we systematically investigate the state-ofthe-art of measuring EEG synchrony with special focus on the detection of AD in its early stages. (A related study has been presented in [6,7] in the context of epilepsy.) We consider various synchrony measures, stemming from a wide spectrum of disciplines, such as physics, information theory, statistics, and signal processing. Our aim is to investigate which measures are the most suitable for detecting the eﬀect of synchrony perturbations in MCI and AD patients; we also wish to better understand which aspects of synchrony are captured by the diﬀerent measures, and how the measures are related to each other. This paper is structured as follows. In Section 2 we review the synchrony measures considered in this paper. In Section 3 those measures are applied to EEG data, in particular, for the purpose of detecting MCI; we describe the EEG data set, elaborate on various implementation issues, and present our results. At the end of the paper, we brieﬂy relate our results to earlier work, and speculate about the neurophysiological interpretation of our results.

2

Synchrony Measures

We brieﬂy review the various families of synchrony measures investigated in this paper: cross-correlation coeﬃcient and analogues in frequency and time-frequency domain, Granger causality, phase synchrony, state space based synchrony, information theoretic interdependence measures, and at last, stochastic-event synchrony measures, which we developed in recent work. 2.1

Cross-Correlation Coeﬃcient

The cross-correlation coeﬃcient r is perhaps one of the most well-known measures for (linear) interdependence between two signals x and y. If x and y are not linearly correlated, r is close to zero; on the other hand, if both signals are identical, then r = 1 [8]. 2.2

Coherence

The coherence function quantiﬁes linear correlations in frequency domain. One distinguishes the magnitude square coherence function c(f ) and the phase coherence function φ(f ) [8].

114

2.3

J. Dauwels, F. Vialatte, and A. Cichocki

Corr-Entropy Coeﬃcient

The corr-entropy coeﬃcient rE is a recently proposed [9] non-linear extension of the correlation coeﬃcient r; it is close to zero if x and y are independent (which is stronger than being uncorrelated). 2.4

Coh-Entropy and Wav-Entropy Coeﬃcient

One can deﬁne a non-linear magnitude square coherence function, which we will refer to as “coh-entropy” coeﬃcient cE (f ); it is an extension of the corr-entropy coeﬃcient to the frequency domain. The corr-entropy coeﬃcient rE can also be extended to the time-frequency domain, by replacing the signals x and y in the deﬁnition of rE by their time-frequency (“wavelet”) transforms. In this paper, we use the complex Morlet wavelet, which is known to be well-suited for EEG signals [10]. The resulting measure is called “wav-entropy” coeﬃcient wE (f ). (To our knowledge, both cE (f ) and wE (f ) are novel). 2.5

Granger Causality

Granger causality1 refers to a family of synchrony measures that are derived from linear stochastic models of time series; as the above linear interdependence measures, they quantify to which extent diﬀerent signals are linearly interdependent. Whereas the above linear interdependence measures are bivariate, i.e., they can only be applied to pairs of signals, Granger causality measures are multivariate, they can be applied to multiple signals simultaneously. Suppose that we are given n signals X1 (k), X2 (k), . . . , Xn (k), each stemming from a diﬀerent channel. We consider the multivariate autoregressive (MVAR) model: X(k) =

p

A(j)X(k − ) + E(k),

(1)

=1

where X(k) = (X1 (k), X2 (k), . . . , Xn (k))T , p is the model order, the model coeﬃcients A(j) are n × n matrices, and E(k) is a zero-mean Gaussian random vector of size n. In words: Each signal Xi (k) is assumed to linearly depend on its own p past values and the p past values of the other signals Xj (k). The deviation between X(k) and this linear dependence is modeled by the noise component E(k). Model (1) can also be cast in the form: E(k) =

p

˜ A(j)X(k − ),

(2)

=0 1

The Granger causality measures we consider here are implemented in the BioSig library, available from http://biosig.sourceforge.net/

A Comparative Study of Synchrony Measures for the Early Detection of AD

115

˜ ˜ where A(0) = I (identity matrix) and A(j) = −A(j) for j > 0. One can transform (2) into the frequency domain (by applying the z-transform and by substituting z = e−2πiΔt , where 1/Δt is the sampling rate):

˜ −1 (f )E(f ) = H(f )E(f ). X(f ) = A

(3)

The power spectrum matrix of the signal X(k) is determined as S(f ) = X(f )X(f )∗ = H(f )VH∗ (f ),

(4)

where V stands for the covariance matrix of E(k). The Granger causality measures are deﬁned in terms of coeﬃcients of the matrices A, H, and S. Due to space limitations, only a short description of these methods is provided here, additional information can be found in existing literature (e.g., [4]). From these coeﬃcients, two symmetric measures can be deﬁned: – Granger coherence |Kij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f . – Partial coherence (PC) |Cij (f )| ∈ [0, 1] describes the amount of in-phase components in signals i and j at the frequency f when the inﬂuence (i.e., linear dependence) of the other signals is statistically removed. The following asymmetric (“directed”) Granger causality measures capture causal relations: 2 – Directed transfer function (DTF) γij (f ) quantiﬁes the fraction of inﬂow to channel i stemming from channel j. – Full frequency directed transfer function (ﬀDTF) |Hij (f )|2 Fij2 (f ) = m ∈ [0, 1], 2 f j=1 |Hij (f )|

(5)

2 is a variation of γij (f ) with a global normalization in frequency. – Partial directed coherence (PDC) |Pij (f )| ∈ [0, 1] represents the fraction of outﬂow from channel j to channel i 2 – Direct directed transfer function (dDTF) χ2ij (f ) = Fij2 (f )Cij (f ) is non-zero if the connection between channel i and j is causal (non-zero Fij2 (f )) and 2 direct (non-zero Cij (f )).

2.6

Phase Synchrony

Phase synchrony refers to the interdependence between the instantaneous phases φx and φy of two signals x and y; the instantaneous phases may be strongly synchronized even when the amplitudes of x and y are statistically independent. The instantaneous phase φx of a signal x may be extracted as [11]:

φH x(k)] , x (k) = arg [x(k) + i˜

(6)

116

J. Dauwels, F. Vialatte, and A. Cichocki

where x ˜ is the Hilbert transform of x. Alternatively, one can derive the instantaneous phase from the time-frequency transform X(k, f ) of x:

φW x (k, f ) = arg[X(k, f )].

(7)

The phase φW x (k, f ) depends on the center frequency f of the applied wavelet. By appropriately scaling the wavelet, the instantaneous phase may be computed in the frequency range of interest. The phase synchrony index γ for two instantaneous phases φx and φy is deﬁned as [11]: γ = ei(nφx −mφy ) ∈ [0, 1], (8) where n and m are integers (usually n = 1 = m). We will use the notation γH and γW to indicate whether the instantaneous phases are computed by the Hilbert transform or time-frequency transform respectively. In this paper, we will consider two additional phase synchrony indices, i.e., the evolution map approach (EMA) and the instantaneous period approach (IPA) [12]. Due to space constraints, we will not describe those measures here, instead we refer the reader to [12]2 ; additional information about phase synchrony can be found in [6]. 2.7

State Space Based Synchrony

State space based synchrony (or “generalized synchronization”) evaluates synchrony by analyzing the interdependence between the signals in a state space reconstructed domain (see e.g., [7]). The central hypothesis behind this approach is that the signals at hand are generated by some (unknown) deterministic, potentially high-dimensional, non-linear dynamical system. In order to reconstruct such system from a signal x, one considers delay vectors X(k) = (x(k), x(k − τ ), . . . , x(k − (m − 1) τ ))T , where m is the embedding dimension and τ denotes the time lag. If τ and m are appropriately chosen, and the signals are indeed generated by a deterministic dynamical system (to a good approximation), the delay vectors lie on a smooth manifold (“mapping”) in Rm , apart from small stochastic ﬂuctuations. The S-estimator [13], here denoted by Sest , is a state space based measure obtained by applying principal component analysis (PCA) to delay vectors3 . We also considered three measures of nonlinear interdependence, S k , H k , and N k (see [6] for details4 ). 2.8

Information-Theoretic Measures

Several interdependence measures have been proposed that have their roots in information theory. Mutual Information I is perhaps the most well-known 2 3

4

Program code is available at www.agnld.uni-potsdam.de/%7Emros/dircnew.m We used the S Toolbox downloadable from http://aperest.epfl.ch/docs/ software.htm Software is available from http://www.vis.caltech.edu/~rodri/software.htm

A Comparative Study of Synchrony Measures for the Early Detection of AD

117

information-theoretic interdependence measure; it quantiﬁes the amount of information the random variable Y contains about random variable X (and vice versa); it is always positive, and it vanishes when X and Y are statistically independent. Recently, a sophisticated and eﬀective technique to compute mutual information between time series was proposed [14]; we will use that method in this paper5 . The method of [14] computes mutual information in time-domain; alternatively, this quantity may also be determined in time-frequency domain (denoted by IW ), more speciﬁcally, from normalized spectrograms [15,16] (see also [17,18]). We will also consider several information-theoretic measures that quantify the dissimilarity (or “distance”) between two random variables (or signals). In contrast to the previously mentioned measures, those divergence measures vanish if the random variables (or signals) are identical ; moreover, they are not necessarily symmetric, and therefore, they can not be considered as distance measures in the strict sense. Divergences may be computed in time domain and timefrequency domain; in this paper, we will only compute the divergence measures in time-frequency domain, since the computation in time domain is far more involved. We consider the Kullback-Leibler divergence K, the R´enyi divergence Dα , the Jensen-Shannon divergence J, and the Jensen-R´enyi divergence Jα . Due to space constraints, we will not review those divergence measures here; we refer the interested reader to [15,16]. 2.9

Stochastic Event Synchrony (SES)

Stochastic event synchrony, an interdependence measure we developed in earlier work [19], describes the similarity between the time-frequency transforms of two signals x and y. As a ﬁrst step, the time-frequency transform of each signal is approximated as a sum of (half-ellipsoid) basis functions, referred to as “bumps” (see Fig. 1 and [20]). The resulting bump models, representing the most prominent oscillatory activity, are then aligned (see Fig. 2): bumps in one timefrequency map may not be present in the other map (“non-coincident bumps”); other bumps are present in both maps (“coincident bumps”), but appear at slightly diﬀerent positions on the maps. The black lines in Fig. 2 connect the centers of coincident bumps, and hence, visualize the oﬀset in position between pairs of coincident bumps. Stochastic event synchrony consists of ﬁve parameters that quantify the alignment of two bump models: – ρ: fraction of non-coincident bumps, – δt and δf : average time and frequency oﬀset respectively between coincident bumps, – st and sf : variance of the time and frequency oﬀset respectively between coincident bumps. The alignment of the two bump models (cf. Fig. 2 (right)) is obtained by iterative max-product message passing on a graphical model; the ﬁve SES parameters are determined from the resulting alignment by maximum a posteriori (MAP) 5

The program code (in C) is available at www.klab.caltech.edu/~kraskov/MILCA/

118

J. Dauwels, F. Vialatte, and A. Cichocki

Fig. 1. Bump modeling Coincident bumps (ρ = 27%) 30

25

25

20

20

15

15

f

f

Bump models of two EEG signals 30

10

10

5

5

00

200

400

t

600

800

00

200

400

t

600

800

Fig. 2. Coincident and non-coincident activity (“bumps”); (left) bump models of two signals; (right) coincident bumps; the black lines connect the centers of coincident bumps

estimation [19]. The parameters ρ and st are the most relevant for the present study, since they quantify the synchrony between bump models (and hence, the original time-frequency maps); low ρ and st implies that the two time-frequency maps at hand are well synchronized.

3

Detection of EEG Synchrony Abnormalities in MCI Patients

In the following section, we describe the EEG data we analyzed. In Section 3.2 we address certain technical issues related to the synchrony measures, and in Section 3.3, we present and discuss our results. 3.1

EEG Data

The EEG data6 analyzed here have been analyzed in previous studies concerning early detection of AD [21]–[25]. They consist of rest eyes-closed EEG data 6

We are grateful to Prof. T. Musha for providing us the EEG data.

A Comparative Study of Synchrony Measures for the Early Detection of AD

119

recorded from 21 sites on the scalp based on the 10–20 system. The sampling frequency was 200 Hz, and the signals were band pass ﬁltered between 4 and 30Hz. The subjects comprised two study groups. The ﬁrst consisted of a group of 25 patients who had complained of memory problems. These subjects were then diagnosed as suﬀering from MCI and subsequently developed mild AD. The criteria for inclusion into the MCI group were a mini mental state exam (MMSE) score above 24, the average score in the MCI group was 26 (SD of 1.8). The other group was a control set consisting of 56 age-matched, healthy subjects who had no memory or other cognitive impairments. The average MMSE of this control group was 28.5 (SD of 1.6). The ages of the two groups were 71.9 ± 10.2 and 71.7 ± 8.3, respectively. Pre-selection was conducted to ensure that the data were of a high quality, as determined by the presence of at least 20 sec. of artifact free data. Based on this requirement, the number of subjects in the two groups described above was reduced to 22 MCI patients and 38 control subjects. 3.2

Methods

In order to reduce the computational complexity, we aggregated the EEG signals into 5 zones (see Fig. 3); we computed the synchrony measures (except the S-estimator) from the averages of each zone. For all those measures except SES, we used the arithmetic average; in the case of SES, the bump models obtained from the 21 electrodes were clustered into 5 zones by means of the aggregation algorithm described in [20]. We evaluated the S-estimator between each pair of zones by applying PCA to the state space embedded EEG signals of both zones. We divided the EEG signals in segments of equal length L, and computed the synchrony measures by averaging over those segments. Since spontaneous EEG is usually highly non-stationary, and most synchrony measures are strictly speaking only applicable to stationary signals, the length L should be suﬃciently small; on the other hand, in order to obtain reliable measures for synchrony, the length should be chosen suﬃciently large. Consequently, it is not a priori clear how to choose the length L, and therefore, we decided to test several values, i.e., L = 1s, 5s, and 20s. In the case of Granger causality measures, one needs to specify the model order p. Similarly, for mutual information (in time domain) and the state space based measures, the embedding dimension m and the time lag τ needs to be chosen; the phase synchrony indices IPA and EMA involve a time delay τ . Since it is not obvious which parameter values amount to the best performance for detecting AD, we have tried a range of parameter settings, i.e., p = 1, 2,. . . , 10, and m = 1, 2,. . . , 10; the time delay was in each case set to τ = 1/30s, which is the period of the fastest oscillations in the EEG signals at hand. 3.3

Results and Discussion

Our main results are summarized in Table 1, which shows the sensitivity of the synchrony measures for detecting MCI. Due to space constraints, the table only shows results for global synchrony, i.e., the synchrony measures were averaged

120

J. Dauwels, F. Vialatte, and A. Cichocki

Fp1

Fpz

Fp2

1 F7

F3

T3

3 C3

T5

F4

Fz

2

T4

C4

Cz

Pz

P3

F8

4

P4

T6

5 O1

Oz

O2

Fig. 3. The 21 electrodes used for EEG recording, distributed according to the 10– 20 international placement system [8]. The clustering into 5 zones is indicated by the colors and dashed lines (1 = frontal, 2 = left temporal, 3 = central, 4 = right temporal and 5 = occipital).

over all pairs of zones. (Results for local synchrony and individual frequency bands will be presented in a longer report, including a detailed description of the inﬂuence of various parameters such as model order and embedding dimension on the sensitivity.) The p-values, obtained by the Mann-Whitney test, need strictly speaking to be Bonferroni corrected; since we consider many diﬀerent measures simultaneously, it is likely that a few of those measures have small p-values merely due to stochastic ﬂuctuations (and not due to systematic difference between MCI and control patients). In the most conservative Bonferroni post-correction, the p-values need to be divided by the number of synchrony measures. From the table, it can be seen that only a few measures evince signiﬁcant diﬀerences in EEG synchrony between MCI and control patients: full-frequency DTF and ρ are the most sensitive (for the data set at hand), their p-values remain signiﬁcant (pcorr < 0.05) after Bonferroni correction. In other words, the eﬀect of MCI and AD on EEG synchrony can be detected, as was reported earlier in the literature; we will expand on this issue in the following section. In other to gain more insight in the relation between the diﬀerent measures, we calculated the correlation between them (see Fig. 5; red and blue indicate strong correlation and anti-correlation respectively). From this ﬁgure, it becomes strikingly clear that the majority of measures are strongly correlated (or anticorrelated) with each other; in other words, the measures can easily be classiﬁed in diﬀerent families. In addition, many measures are strongly (anti-)correlated with the classical cross-correlation coeﬃcient r, the most basic measure; as a result, they do not provide much additional information regarding EEG synchrony. Measures that are only weakly correlated with the cross-correlation coeﬃcient include the phase synchrony indices, Granger causality measures, and stochastic-event synchrony measures; interestingly, those three families of synchrony measures are mutually uncorrelated, and as a consequence, they each seem to capture a speciﬁc kind of interdependence.

A Comparative Study of Synchrony Measures for the Early Detection of AD

121

Table 1. Sensitivity of synchrony measures for early prediction of AD (p-values for Mann-Whitney test; * and ** indicate p < 0.05 and p < 0.005 respectively) Measure

Cross-correlation

Coherence

Phase Coherence

Corr-entropy

Wave-entropy

p-value

0.028∗

0.060

0.72

0.27

0.012∗

References Measure

[8]

[9] PDC

DTF

ﬀDTF

dDTF

0.15

0.16

0.60

0.34

0.0012∗∗

0.030∗

Measure

Kullback-Leibler

R´enyi

Jensen-Shannon

Jensen-R´enyi

IW

I

p-value

0.072

0.076

0.084

0.12

0.060

0.080

Measure

Nk

Sk

Hk

S-estimator

p-value

0.032∗

0.29

0.090

0.33

p-value

Granger coherence Partial Coherence

References

[4]

References

[15]

References

[14]

[6]

Measure

Hilbert Phase

p-value

0.15

[13]

Wavelet Phase

Evolution Map Instantaneous Period

0.082

References

0.020∗

0.072

[6]

[12]

Measure

st

ρ

p-value

0.92

0.00029∗∗

In Fig. 4, we combine the two most sensitive synchrony measures (for the data set at hand), i.e., full-frequency DTF and ρ. In this ﬁgure, the MCI patients are fairly well distinguishable from the control patients. As such, the separation is not suﬃciently strong to yield reliable early prediction of AD. For this purpose, the two features need to be combined with complementary features, for example, derived from the slowing eﬀect of AD on EEG, or perhaps from diﬀerent modalities such as PET, MRI, DTI, or biochemical indicators. On the other hand, we remind the reader of the fact that in the data set at hand, patients did not carry out any speciﬁc task; moreover, the recordings were short (only 20s). It is plausible that the sensitivity of EEG synchrony could be further improved by increasing the length of the recordings and by recording the EEG before, while, and after patients carry out speciﬁc tasks, e.g., working memory tasks.

0.5 MCI CTR

0.45 0.4

ρ

0.35 0.3 0.25 0.2 0.15 0.045

0.05

Fij2

0.055

Fig. 4. ρ vs. ﬀDTF

0.06

122

J. Dauwels, F. Vialatte, and A. Cichocki

state space

corr/coh mut inf

phase

divergence

Granger

SES

N k (X|Y ) N k (Y |X) S k (X|Y ) S k (Y |X) H k (X|Y ) 5 H k (Y |X) Sest r c rE 10 wE IW I γH γW 15 φ EMA IPA K(Y |X) K(X|Y )20 K Dα J Jα Kij 25 Cij Pij γij Fij χij st ρ

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

30

−0.8 5

10

15

20

25

30

Fig. 5. Correlation between the synchrony measures

4

Conclusions

In previous studies, brain dynamics in AD and MCI patients were mainly investigated using coherence (cf. Section 2.2) or state space based measures of synchrony (cf. Section 2.7). During working memory tasks, coherence shows signiﬁcant eﬀects in AD and MCI groups [26] [27]; in resting condition, however, coherence does not show such diﬀerences in low frequencies (below 30Hz), neither between AD and controls [28] nor between MCI and controls [27]. These results are consistent with our observations. In the gamma range, coherence seems to decrease with AD [29]; we did not investigate this frequency range, however, since the EEG signals analyzed here were band pass ﬁltered between 4 and 30Hz. Synchronization likelihood, a state space based synchronization measure similar to the non-linear interdependence measures S k , H k , and N k (cf. Section 2.7), is believed to be more sensitive than coherence to detect changes in AD patients [28]. Using state space based synchrony methods, signiﬁcant diﬀerences were found between AD and control in rest conditions [28] [30] [32] [33]. State space based synchrony failed to retrieve signiﬁcant diﬀerences between MCI patient and control subjects on a global level [32] [33], but signiﬁcant eﬀects were observed locally: fronto-parietal electrode synchronization likelihood progressively decreased through MCI and mild AD groups [30]. We report here a lower p-value for the state space based synchrony measure N k (p = 0.032) than for coherence (p = 0.06); those low p-values, however, would not be statistically signiﬁcant after Bonferroni correction.

A Comparative Study of Synchrony Measures for the Early Detection of AD

123

By means of Global Field Synchronization, a phase synchrony measure similar to the ones we considered in this paper, Koenig et al. [31] observed a general decrease of synchronization in correlation with cognitive decline and AD. In our study, we analyzed ﬁve diﬀerent phase synchrony measures: Hilbert and wavelet based phase synchrony, phase coherence, evolution map approach (EMA), and instantaneous period approach (IPA). The p-value of the latter is low (p=0.020), in agreement with the results of [31], but it would be non-signiﬁcant after Bonferroni correction. The strongest observed eﬀect is a signiﬁcantly higher degree of local asynchronous activity (ρ) in MCI patients, more speciﬁcally, a high number of noncoincident, asynchronous oscillatory events (p = 0.00029). Interestingly, we did not observe a signiﬁcant eﬀect on the timing jitter st of the coincident events (p = 0.92). In other words, our results seem to indicate that there is signiﬁcantly more non-coincident background activity, while the coincident activity remains well synchronized. On the one hand, this observation is in agreement with previous studies that report a general decrease of neural synchrony in MCI and AD patients; on the other hand, it goes beyond previous results, since it yields a more subtle description of EEG synchrony in MCI and AD patients: it suggests that the loss of coherence is mostly due to an increase of (local) non-coincident background activity, whereas the locked (coincident) activity remains equally well synchronized. In future work, we will verify this conjecture by means of other data sets.

References 1. Jong, J.: EEG Dynamics in Patients with Alzheimer’s Disease. Clinical Neurophysiology 115, 1490–1505 (2004) 2. Pereda, E., Quiroga, R.Q., Bhattacharya, J.: Nonlinear Multivariate Analysis of Neurophsyiological Signals. Progress in Neurobiology 77, 1–37 (2005) 3. Breakspear, M.: Dynamic Connectivity in Neural Systems: Theoretical and Empirical Considerations. Neuroinformatics 2(2) (2004) 4. Kami´ nski, M., Liang, H.: Causal Inﬂuence: Advances in Neurosignal Analysis. Critical Review in Biomedical Engineering 33(4), 347–430 (2005) 5. Stam, C.J.: Nonlinear Dynamical Analysis of EEG and MEG: Review of an Emerging Field. Clinical Neurophysiology 116, 2266–2301 (2005) 6. Quiroga, R.Q., Kraskov, A., Kreuz, T., Grassberger, P.: Performance of Diﬀerent Synchronization Measures in Real Data: A Case Study on EEG Signals. Physical Review E 65 (2002) 7. Sakkalis, V., Giurc˘ aneacu, C.D., Xanthopoulos, P., Zervakis, M., Tsiaras, V.: Assessment of Linear and Non-Linear EEG Synchronization Measures for Evaluating Mild Epileptic Signal Patterns. In: Proc. of ITAB 2006, Ioannina-Epirus, Greece, October 26–28 (2006) 8. Nunez, P., Srinivasan, R.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press, Oxford (2006) 9. Xu, J.-W., Bakardjian, H., Cichocki, A., Principe, J.C.: EEG Synchronization Measure: a Reproducing Kernel Hilbert Space Approach. IEEE Transactions on Biomedical Engineering Letters (submitted to, September 2006)

124

J. Dauwels, F. Vialatte, and A. Cichocki

10. Herrmann, C.S., Grigutsch, M., Busch, N.A.: EEG Oscillations and Wavelet Analysis. In: Handy, T. (ed.) Event-Related Potentials: a Methods Handbook, pp. 229– 259. MIT Press, Cambridge (2005) 11. Lachaux, J.-P., Rodriguez, E., Martinerie, J., Varela, F.J.: Measuring Phase Synchrony in Brain Signals. Human Brain Mapping 8, 194–208 (1999) 12. Rosenblum, M.G., Cimponeriu, L., Bezerianos, A., Patzak, A., Mrowka, R.: Identiﬁcation of Coupling Direction: Application to Cardiorespiratory Interaction. Physical Review E, 65 041909 (2002) 13. Carmeli, C., Knyazeva, M.G., Innocenti, G.M., De Feo, O.: Assessment of EEG Synchronization Based on State-Space Analysis. Neuroimage 25, 339–354 (2005) 14. Kraskov, A., St¨ ogbauer, H., Grassberger, P.: Estimating Mutual Information. Phys. Rev. E 69(6), 66138 (2004) 15. Aviyente, S.: A Measure of Mutual Information on the Time-Frequency Plane. In: Proc. of ICASSP 2005, Philadelphia, PA, USA, March 18–23, vol. 4, pp. 481–484 (2005) 16. Aviyente, S.: Information-Theoretic Signal Processing on the Time-Frequency Plane and Applications. In: Proc. of EUSIPCO 2005, Antalya, Turkey, September 4–8 (2005) 17. Quiroga, Q.R., Rosso, O., Basar, E.: Wavelet-Entropy: A Measure of Order in Evoked Potentials. Electr. Clin. Neurophysiol (Suppl.) 49, 298–302 (1999) 18. Blanco, S., Quiroga, R.Q., Rosso, O., Kochen, S.: Time-Frequency Analysis of EEG Series. Physical Review E 51, 2624 (1995) 19. Dauwels, J., Vialatte, F., Cichocki, A.: A Novel Measure for Synchrony and Its Application to Neural Signals. In: Honolulu, H.U. (ed.) Proc. IEEE Int. Conf. on Acoustics and Signal Processing (ICASSP), Honolulu, Hawai’i, April 15–20 (2007) 20. Vialatte, F., Martin, C., Dubois, R., Haddad, J., Quenet, B., Gervais, R., Dreyfus, G.: A Machine Learning Approach to the Analysis of Time-Frequency Maps, and Its Application to Neural Dynamics. Neural Networks 20, 194–209 (2007) 21. Chapman, R., et al.: Brain Event-Related Potentials: Diagnosing Early-Stage Alzheimer’s Disease. Neurobiol. Aging 28, 194–201 (2007) 22. Cichocki, A., et al.: EEG Filtering Based on Blind Source Separation (BSS) for Early Detection of Alzheimer’s Disease. Clin. Neurophys. 116, 729–737 (2005) 23. Hogan, M., et al.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49 (2003) 24. Musha, T., et al.: A New EEG Method for Estimating Cortical Neuronal Impairment that is Sensitive to Early Stage Alzheimer’s Disease. Clin. Neurophys. 113, 1052–1058 (2002) 25. Vialatte, F., et al.: Blind Source Separation and Sparse Bump Modelling of TimeFrequency Representation of EEG Signals: New Tools for Early Detection of Alzheimer’s Disease. In: IEEE Workshop on Machine Learning for Signal Processing, pp. 27–32 (2005) 26. Hogan, M.J., Swanwick, G.R., Kaiser, J., Rowan, M., Lawlor, B.: Memory-Related EEG Power and Coherence Reductions in Mild Alzheimer’s Disease. Int. J. Psychophysiol. 49(2), 147–163 (2003) 27. Jiang, Z.Y.: Study on EEG Power and Coherence in Patients with Mild Cognitive Impairment During Working Memory Task. J. Zhejiang Univ. Sci. B 6(12), 1213– 1219 (2005) 28. Stam, C.J., van Cappellen van Walsum, A.M., Pijnenburg, Y.A., Berendse, H.W., de Munck, J.C., Scheltens, P., van Dijk, B.W.: Generalized Synchronization of MEG Recordings in Alzheimer’s Disease: Evidence for Involvement of the Gamma Band. J. Clin. Neurophysiol. 19(6), 562–574 (2002)

A Comparative Study of Synchrony Measures for the Early Detection of AD

125

29. Herrmann, C.S., Demiralp, T.: Human EEG Gamma Oscillations in Neuropsychiatric Disorders. Clinical Neurophysiology 116, 2719–2733 (2005) 30. Babiloni, C., Ferri, R., Binetti, G., Cassarino, A., Forno, G.D., Ercolani, M., Ferreri, F., Frisoni, G.B., Lanuzza, B., Miniussi, C., Nobili, F., Rodriguez, G., Rundo, F., Stam, C.J., Musha, T., Vecchio, F., Rossini, P.M.: Fronto-Parietal Coupling of Brain Rhythms in Mild Cognitive Impairment: A Multicentric EEG Study. Brain Res. Bull. 69(1), 63–73 (2006) 31. Koenig, T., Prichep, L., Dierks, T., Hubl, D., Wahlund, L.O., John, E.R., Jelic, V.: Decreased EEG Synchronization in Alzheimer’s Disease and Mild Cognitive Impairment. Neurobiol. Aging 26(2), 165–171 (2005) 32. Pijnenburg, Y.A., Made, Y.v., van Cappellen, A.M., van Walsum, Knol, D.L., Scheltens, P., Stam, C.J.: EEG Synchronization Likelihood in Mild Cognitive Impairment and Alzheimer’s Disease During a Working Memory Task. Clin. Neurophysiol. 115(6), 1332–1339 (2004) 33. Yagyu, T., Wackermann, J., Shigeta, M., Jelic, V., Kinoshita, T., Kochi, K., Julin, P., Almkvist, O., Wahlund, L.O., Kondakor, I., Lehmann, D.: Global dimensional complexity of multichannel EEG in mild Alzheimer’s disease and age-matched cohorts. Dement Geriatr Cogn Disord 8(6), 343–347 (1997)

Reproducibility Analysis of Event-Related fMRI Experiments Using Laguerre Polynomials Hong-Ren Su1,2, Michelle Liou2,*, Philip E. Cheng2, John A.D. Aston2, and Shang-Hong Lai1 1

Dept. of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2 Institute of Statistical Science, Academia Sinica, Taipei, Taiwan [email protected]

Abstract. In this study, we introduce the use of orthogonal causal Laguerre polynomials for analyzing data collected in event-related functional magnetic resonance imaging (fMRI) experiments. This particular family of polynomials has been widely used in the system identification literature and recommended for modeling impulse functions in BOLD-based fMRI experiments. In empirical studies, we applied Laguerre polynomials to analyze data collected in an eventrelated fMRI study conducted by Scott et al. (2001). The experimental study investigated neural mechanisms of visual attention in a change-detection task. By specifying a few meaningful Laguerre polynomials in the design matrix of a random effect model, we clearly found brain regions associated with trial onset and visual search. The results are consistent with the original findings in Scott et al. (2001). In addition, we found the brain regions related to the mask presence in the parahippocampal, superior frontal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus. Keywords: Reproducibility analysis, Event-related fMRI.

1 Introduction We previously proposed a methodology for assessing reproducibility evidence in fMRI studies using an on-and-off paradigm without necessarily conducting replicated experiments, and suggested interpreting SPMs in conjunction with reproducibility evidence (Liou et al., 2003; 2006). Empirical studies have shown that the method is robust to the specification of hemodynamic response functions (HRFs). Recently, BOLD-based event-related fMRI experiments have been widely used as an advanced alternative to the on-and-off design for studies on human brain functions. In eventrelated fMRI experiments, the duration of stimulus presentation is generally longer and there are no obvious contrasts between the experimental and control conditions to be used in data analyses. In order to detect possible brain activations during stimulus presentation and task performance, there have been a variety of event-related HRFs proposed in the literature. In this study, we introduce the use of orthogonal causal * Corresponding author. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 126–134, 2008. © Springer-Verlag Berlin Heidelberg 2008

Reproducibility Analysis of Event-Related fMRI Experiments

127

Laguerre polynomials for modeling response functions. This particular family of polynomials has been widely used in the system identification literature and was recommended for modeling impulse functions in fMRI experiments (Saha et al., 2004). In the empirical study, we applied Laguerre polynomials to analyze data in the study by Scott et al. (2001). The dataset was published by the US fMRI Data Center and is available for public access. The original experiment involved 10 human subjects and investigated brain functions associated with a change-detection task. In the experimental task, subjects look attentively at two versions of the same picture in alternation, separated by a brief mask interval. The experiment additionally analyzed behavioral responses that subjects detected something changing between pictures and pressed a button with hands. In our reproducibility analysis, a few meaningful Laguerre polynomials matching the experimental design were inserted into a random effect model and reproducibility analyses were conducted based on the selected polynomials. In the analyses, we successfully located brain regions associated with the visual change-detection task similar to those found in Scott et al.. Additionally, we found other interesting brain regions that were not included in the previous study.

2 Method In this section, we will briefly describe the method for investigating the reproducibility evidence in fMRI experiments, and outline the family of Laguerre polynomials including those used in our empirical study. 2.1 Reproducibility Analysis In the SPM generalized linear model, the fMRI responses in the ith run can be expressed as

yi = X i β i + ei ,

(1)

where yi is the vector of image intensity after pre-whitening, Xi is the design matrix, and β i is the vector containing the unknown regression parameters. In the random effect model, the regression parameters

βi

are additionally assumed to be random

from a multivariate Gaussian distribution with common mean μ and variance Ω . The empirical Bayes estimate of

βi

in the random effect model would shrink all

estimates toward the mean μ , with greater shrinkage at noisy runs. In fMRI studies, the true status of each voxel is unknown, but can be estimated using the t-values (i.e., standardized β estimates) within individual runs derived from the random effect model along with the maximum likehood estimation method. By specifying a mixed multinomial model, the receiver-operation characteristic (ROC) curve can be estimated using the maximum likelihood estimation method and t-values of all image voxels. The curve is simply a bivariate plot of sensitivity versus the false alarm rate. The threshold (or the operational point) on the ROC curve for classifying voxels into the active/inactive status was found by maximizing the kappa value. We follow the

128

H.-R. Su et al.

same definition in Liou et al. (2006) to categorize voxels according to reproducibility, that is, a voxel is strongly reproducible if its active status remains the same in at least 90% of the runs, moderately reproducible in 70-90% of the runs, weakly reproducible in 50-70% of the runs, and otherwise not reproducible. The brain activation maps are constructed on the basis of strongly reproducible voxels, but include voxels that are moderately reproducible and spatially proximal to those strongly reproducible voxels. 2.2 Laguerre Polynomials The Laguerre polynomials can be used for detecting experimental responses. This family of polynomials can be specified as follows: L

h (t ) = ∑ f i g ia (t ) ,

(2)

i =1

where h(t) is the design coefficients to be input into Xi in (1); L is the order of Laguerre polynomial; ƒi is the coefficient of the basis function, and gia(t) is the inverse Z transform of the i-th Laguerre polynomial given by ~ a ⎡ z −1 z −1 − a i −1 ⎤ −1 g ia (t ) = Z −1 ⎢ ( ) = Z [ g i ( z )] ⎥ −1 −1 ⎣1 − az 1 − az ⎦

(3)

where a is a time constant. As an illustration, Fig. 1 gives the response coefficients corresponding to L=2 and L =3.

h(t), L=2

h(t), L=3

(a)

(b)

Fig. 1. The boxcar functions of experimental conditions in the Scott et al. study are depicted in (a), and the Laguerre polynomials h(t) with L=2 and L =3 are depicted in (b)

3 Event-Related fMRI Experiments We here introduce the experimental design behind the fMRI dataset used in the empirical study, and select the design matrix suitable for the study.

Reproducibility Analysis of Event-Related fMRI Experiments

129

3.1 Experimental Design In our empirical study, the dataset contains functional MR images of 10 subjects who went through 10-12 experimental runs, with 10 stimulus trials in each run. Experimental runs involved the change-detection task in which two images within a pair differed in either the presence/absence of the position of a single object or the color of the object. The two images were presented alternatively for 40 sec. In the first 30 sec, each image was presented for 300 msec followed by a 100 msec mask. However, the mask was removed in the last 10 sec. Subjects pressed a button when detecting something changing between the pair of images. The experimental images and stimulus duration are shown in Fig. 2.

Fig. 2. The experimental images and stimulus duration in the Scott et al. study

3.2 Experimental Design Matrix We used the Laguerre polynomials in Fig. 1 for specifying the design matrix in (1) instead of the theoretical HRFs. According to the original experimental design, there are two contrasts of interest in the Scott et al. study. The first is the response after the task onset within the 40 secs trial, and the second is the difference between stimulus presentations with and without the mask, that is, responses during the image presentation with the mask (0 ~ 30 sec.) and without the mask (30 ~ 40 sec.) in Fig. 2. The boxcar functions in Fig. 1 can also be specified in the design matrix as was suggested in the study by Liou et al. (2006) for the on-and-off design. However, the two boxcar functions are not orthogonal to each other and carry redundant information on experimental effects. In the event-related fMRI experiment, the duration of stimulus presentation is always longer than that in the on-and-off design. The theoretical HRFs vanish during the stimulus presentation. There might be brain regions continuously responding to the stimulus. The Laguerre polynomials are orthogonal and offer possibilities for examining all kinds of experimental effects. We

130

H.-R. Su et al.

might also consider Laguerre polynomials in Fig. 1 as a smoothed version of the boxcar functions.

4 Results In the Scott et al. study, a response-contingent event-related analysis technique was used in the data analyses, and the original results showed brain regions associated with different processing components in the visual change-detection task. For instance, the lingual gyrus, cuneus, precentral gyrus, and medial frontal gyrus showed activations associated with the task onset. And the pattern of activation in dorsal and ventral visual pathways was temporally associated with the duration of visual search. Finally parietal and frontal regions showed systematic deactivations during task performance. In the reproducibility analysis with Laguerre polynomials, we found the similar activation regions associated with the task onset, visual search and deactivations. In addition, we found activation regions in the parahippocampal, superior frontal gyrus, supramarginal gyrus and inferior parietal lobule. Both positive and negative responses were also found in the lingual gyrus, cuneus and precuneus which are also reproducible across all subjects; this finding is consistent with our previous data analyses of fMRI studies involving object recognition and word/pseudoword reading (Liou et al., 2006). Table 1 lists a few activation regions in the change-detection task for the 10 subjects. Table 1. The activation regions in the change-detection task. The plus sign indicates the positive response and minus sign indicates the negative response.

Subjects Lingual gyrus Precuneus Cuneus Posterior cingulate Medial frontal gyrus Parahippocampal gyrus Superior frontal gyrus Supramarginal gyrus

1 +/+/+/+/+/-

2 +/+/+/-

+ + +

3 +/+/+/+/+/+/+ +

4 +/+/+/+/-

5 +/+/+/+/+/+ +

6 +/+/+/+ +/-

7 +/+/+/+/+/+/-

8 +/+/+/+/+/+/-

9 +/+/+/-

10 +/+/+/+/+/-

+ +

In the table, there are 4 subjects showing activations in the superior frontal gyrus and supramarginal gyrus in the change-detection task. The two regions have been referred to in fMRI studies on language process (e.g., the study on word and pseudoword reading). The 4 subjects, on average, had longer reaction time in the change-detection task, that is, a delay of pressing the button until the image presentation without the mask (30-40 sec.). Fig. 3 shows the brain activation regions for Subjects 5 and 7 in the Scott et al. study. Subject 5 involved the superior frontal gyrus and supramarginal gyrus and had the longest reaction time compared with other subjects in the experiment. On the other hand, Subject 7 had relatively shorter reaction time and showed no activations in the two regions.

Reproducibility Analysis of Event-Related fMRI Experiments

Fig. 3. Brain activation regions for Subjects 5 and 7 in the Scott et al. study

131

132

H.-R. Su et al.

Subject 5

Fig. 3. (continued)

Reproducibility Analysis of Event-Related fMRI Experiments

Subject 7

Fig. 3. (continued)

133

134

H.-R. Su et al.

5 Discussion The reproducibility evidence suggests that the 10 subjects consistently show a pattern of increased/decreased responses in the lingual gyrus, cuneus, and precuneus. Similar observations were also found in our empirical studies on other datasets published by the fMRIDC. In the fMRI literature, the precuneus, posterior cingulate and medial prefrontal cortex are known to be the default network in a resting state and show decreased activities in a variety of cognitive tasks. The physiological mechanisms behind the decreased responses are still under investigation. However, discussions on the network have given a focus on the decreased activities. We would suggest to consider both positive and negative responses when interpreting the default network. By the method of reproducibility analyses, we can clearly classify brain regions that show consistent responses across subjects and those that show patterns and inconsistencies across subjects (see results in Table 1). Higher mental functions are individual and their localization in specific brain regions can be made only with some probabilities. Accordingly, the higher mental functions are connected with speech, that is, external or internal speech organizing personal behavior. Subjects differ from each other as a result of using different speech designs when making decisions in performing experimental tasks. Change of functional localization is an additional characteristic of a subject’s psychological traits. The proposed methodology would assist researchers in identifying those brain regions that are specific to individual speech designs and those that are consistent across subjects. Acknowledgments. The authors are indebted to the fMRIDC at Dartmouth College for supporting the datasets analyzed in this study. This research was supported by the grant NSC 94-2413-H-001-001 from the National Science Council (Taiwan).

References 1. Liou, M., Su, H.-R., Lee, J.-D., Aston, J.A.D., Tsai, A.C., Cheng, P.E.: A method for generating reproducible evidence in fMRI studies. NeuroImage 29, 383–395 (2006) 2. Huettel, S.A., Guzeldere, G., McCarthy, G.: Dissociating neural mechanisms of visual attention in change detection using functional MRI. Journal of Cognitive Neuroscience 13(7), 1006–1018 (2001) 3. Liou, M., Su, H.-R., Lee, J.-D., Cheng, P.E., Huang, C.-C., Tsai, C.-H.: Bridging Functional MR Images and Scientific Inference: Reproducibility Maps. Journal of cognitive Neuroscience 15(7), 935–945 (2003) 4. Saha, S., Long, C.J., Brown, E., Aminoff, E., Bar, M., Solo, V.: Hemodynamic transfer function estimation with Laguerre polynomials and confidence intervals construction from functional magnetic resonance imaging (FMRI) data. IEEE ICASSP 3, 109–112 (2004) 5. Andrews, G.E., Askey, R., Roy, R.: Laguerre Polynomials. In: §6.2 in Special Functions, pp. 282–293. Cambridge University Press, Cambridge (1999) 6. Arfken, G.: Laguerre Functions. In: §13.2 in Mathematical Methods for Physicists, 3rd ed., Orlando, FL, pp. 721–731. Academic Press, London (1985)

The Effects of Theta Burst Transcranial Magnetic Stimulation over the Human Primary Motor and Sensory Cortices on Cortico-Muscular Coherence Murat Saglam1, Kaoru Matsunaga2, Yuki Hayashida1, Nobuki Murayama1, and Ryoji Nakanishi2 1

Graduate School of Science and Technology, Kumamoto University, Japan 2 Department of Neurology, Kumamoto Kinoh Hospital, Japan [email protected], {yukih,murayama}@cs.kumamoto-u.ac.jp

Abstract. Recent studies proposed a new paradigm of repetitive transcranial magnetic stimulation (rTMS), “theta burst stimulation” (TBS); to primary motor cortex (M1) or sensory cortex (S1) can influence cortical excitability in humans. Particularly it has been shown that TBS can induce the long-lasting effects with the stimulation duration shorter than those of conventional rTMSs. However, in those studies, effects of TBS over M1 or S1 were assessed only by means of motor- and/or somatosensory-evoked-potentials. Here we asked how the coherence between electromyographic (EMG) and electroencephalographic (EEG) signals during isometric contraction of the first dorsal interosseous muscle is modified by TBS. The coherence magnitude localizing for the C3 scalp site, and at 13-30Hz band, significantly decreased 30-60 minutes after the TBS on M1, but not that on S1, and recovered to the original level in 90-120 minutes. These findings indicate that TBS over M1 can suppress the corticomuscular synchronization. Keywords: Theta Burst Transcranial Magnetic Stimulation, Coherence Electroencephalogram, Electromyogram, Motor Cortex.

1 Introduction Previous studies have demonstrated dense functional and anatomical projections among motor cortex building a global network which realizes the communication between the brain and peripheral muscles via the motor pathway [1, 2]. The quality of the communication is thought to highly depend on the efficacy of the synaptic transmission between cortical units. In the past few decades, repetitive transcranial magnetic stimulation (rTMS) was considered to be a promising method to modify cortical circuitry by leading the phenomena of long-term potentiation (LTP) and depression (LTD) of synaptic connections in human subjects [3]. Furthermore, a recently developed rTMS paradigm, called “theta burst stimulation” (TBS) requires less number of the stimulation pulses and even offers the longer aftereffects than conventional rTMS protocols do [4]. Previously, efficiency of TBS has M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 135–141, 2008. © Springer-Verlag Berlin Heidelberg 2008

136

M. Saglam et al.

been assessed by means of signal transmission from cortex to muscle or from muscle to cortex, by measuring motor-evoked-potential (MEP) or somatosensory-evoked-potential (SEP), respectively. It was shown that TBS applied over the surface of sensory cortex (S1) as well as primary motor cortex (M1) could modify the amplitude of SEP (recorded from the S1 scalp site) lasting for tens of minutes after the TBS [5]. On the other hand, the amplitude of MEP was not modified by the TBS applied over S1, while the MEP amplitude was significantly decreased by the TBS applied over M1 [4, 5]. In the present study, we examined the effects of TBS applied over either M1 or S1 on the functional coupling between cortex and muscle by measuring the coherence between electroencephalographic (EEG) and electromyographic (EMG) signals during voluntary isometric contraction of the first dorsal interosseous (FDI) muscle.

2 Methods 2.1 Subjects Seven subjects among whole set of recruited participants (approximately twenty) showed significant coherence and only those subjects participated to TBS experiments. Experiments on M1 and S1 performed on different days and subjects did not report any side effects during or after the experiments. 2.2 Determination of M1 and S1 Location The optimal location of the stimulating coil was determined by searching the largest MEP response(from the contralateral FDI-muscle) elicited by single pulse TMS while

Fig. 1. EEG-EMG signals recorded before and after the application of TBS as depicted in the experiment time line. Subjects were asked to contract four times at each recording set. Location and intensity for actual location were determined after pre30 session. Pre0 recording was done to confirm searching does not make any conditioning. TBS paradigm is illustrated in the rightabove inset.

The Effects of Theta Burst Transcranial Magnetic Stimulation

137

moving the TMS coil in 1cm steps on the presumed position of M1. Stimulation was applied by a with a High Power Magstim 200 machine and a figure-of-8 coil with mean loop diameter of 70 mm (Magstim Co., Whitland, Dyfed, UK). The coil was placed tangentially to the scalp with the handle pointing backwards and laterally at a 45˚ angle away from the midline. Based on previous reports S1 is assumed to be 2cm posterior from M1 site [5]. 2.3 Theta Burst Stimulation Continuous TBS (cTBS) paradigm of 600 pulses was applied to the M1 and S1 location. cTBS consists of 50Hz triplets of pulses that are repeating themselves at every 0.2s (5Hz) for 40s[4]. Intensity of each pulse was set to 80% active motor threshold (AMT) which is defined as the minimum stimulation intensity that could evoke an MEP of no less than 200μV during slight tonic contraction. 2.4 EEG and EMG Recording EEG signals were recorded based on the international 10-20 scalp electrode placement method (19 electrodes) with earlobe reference. EMG signal, during isometric hand contraction at 15% level from the maximum, was recorded from the FDI muscle of the right hand with reference to the metacarpal bone of the index finger. EEG and EMG signals were recorded with 1000 Hz sampling frequency and passbands of 0.5-200Hz and 5-300 Hz, respectively. Each recording set consists of 4 one-minute-long recordings with 30s-rest time intervals. To assess TBS effect with respect to time, each set was performed 30 minutes before (pre30), just before (pre0), 0,30,60,90 and 120 minutes after the delivery of TBS. Stimulation location and intensity were determined between pre30 and pre0 recordings (Fig. 1). 2.5 Data Analysis Coherence function is the squared magnitude of the cross-spectra of the signal pair divided by the product their power spectra. Therefore cross- and power-spectra between EMG and 19 EEG channels were calculated. Fast Fourier transform, with an epoch size of 1024, resulting in frequency resolution of 0.98 Hz was used to convert the signals into frequency domain. The current source density (CSD) reference method was utilized in order to achieve spatially sharpened EEG signals [6]. Coherency between EEG and EMG signals was obtained using the expression below:

0≤κ (f) = 2 xy

S xy ( f )

2

S xx ( f ) S yy ( f )

≤1

(1)

where Sxy(f) represents the cross-spectral density function. Sxx(f) and Syy(f) stand for the auto-spectral density of the signals x and y, respectively. Since coherence is a normalized measure of the correlation between signal pairs, κ2xy (f) =1 represents a perfect linear dependence and κ2xy (f) =0 indicates a lack of linear dependence within those signal pairs. Coherence values for κ2xy (f) > 0 are assumed to be statistically significant only if they are above 99% confidence limit that is estimated by:

138

M. Saglam et al. 1

α ⎞ ( n−1) ⎛ CL(α %) := 1 − ⎜1 − ⎟ ⎝ 100 ⎠ where n is the number of epochs used for cross- and power- spectra calculations.

(2)

3 Results First we have confirmed that EEG-EMG coherence values for all (n=7) subjects lie above 99% significance level (coh ~= 0.02) and within beta (13-30 Hz) frequency Table 1. Mean ± standard error of mean (SEM) of beta band EEG (C3)-EMG coherence values and peak frequencies for TBS-over-M1 and S1 experiments (n=7)

Coherence Magnitude

M1 S1

Peak M1 Frequency(Hz) S1

Pre30 (min)

Pre0 (min)

0 (min)

30 (min)

60 (min)

90 (min)

120 (min)

0.061±0.008

0.051±0.014

0.053±0.014

0.03±0.007

0.031±0.009

0.057±0.015

0.067±0.016

0.059±0.021

0.068±0.026

0.058±0.027 0.061±0.023

0.052±0.014

0.034±0.006

0.040±0.016

20.50±1.92

21.80±1.98

22.48±2.04

23.73±1.84

18.57±1.2

20.75±1.62

18.23±1.82

21.17±1.79

21.80±1.97

23.08±1.82

23.10±2.15

18.22±2.13

21.17±1.08

19.52±.3.21

Fig. 2. Coherence spectra between EEG and EMG signals at pre30 session. A, Coherence spectra between 19 EEG channels and EMG (FDI) for all subjects (n=7) are superimposed and topographically according to the approximate locations of the electrodes on the scalp. Each electrode labeled with respect to its location (Fp: Frontal pole F: Frontal T: Temporal C: Central P: Parietal O: Occipital). B, Expanded view of the coherence spectra between EEG (C3 scalp site) and EMG (FDI) for all subjects (n=7). Each line style specifies different subject’s coherence spectra. Coherence values only above 99% significance level (indicated by the solid horizontal line) are highlighted.

The Effects of Theta Burst Transcranial Magnetic Stimulation

139

band at C3 scalp site. Maximum coherence levels were observed at C3 (n=6) and at F3 (n=1) scalp sites whereas no significant coherence was observed at other locations (Figure 2). These results are in well agreement with the previous studies on coherence between EEG-EMG during isometric contraction [7, 8]. Table 1 shows the average absolute coherence values and peak frequencies at all trials before and after the application of TBS. Figure 3 demonstrates the normalized EEG (C3)-EMG coherence values as average of all subjects. Coherence values obtained before the TBS were taken as control and set to 100%. Average beta band coherence suppressed to 56.2% after 30 minutes and 54.5% after 60 minutes with statistical significance(p 0, and depression for a negative timing diﬀerence, Δt < 0. Many have argued that the asymmetry of this rule produces a one-way coupling (see e.g. [4] ). Such arguments would be valid if Δt represented the time diﬀerence between post- and presynaptic spikes. However, actually most experimental literatures [3] deﬁne Δt to be the time diﬀerence between a postsynaptic spike and the onset or peak of the somatic excitatory postsynaptic potential (EPSP) induced by a presynaptic spike: Δt = tpost spike − tEPSP by pre . Hence the above argument does not apply. A somatic EPSP should lag behind a presynaptic spike for a few msec. Therefore, if two neurons ﬁre in exact synchrony (Fig.1g), Δt < 0 negative [12] for both directions, thereby weakens connections bidirectionally. Now, how does this mechanism convert initial asynchronous ﬁring to clustered synchronous ﬁring (Fig.1b)? Initial asynchronous ﬁring (Fig.1a) is represented as ﬁring phases evenly spread around the circle (Fig.1h, left). The ﬁring remains asynchronous without STDP . However, with the phases of many neurons squeezed into the circle, any single neuron must have neighboring neurons that unwillingly ﬁre synchronously with it(Fig.1h). Among these neurons, the abovementioned mechanism weakens the connections bidirectionally. As their synaptic connections weaken, mutual repulsion is also weakened. This then further synchronizes their ﬁring. This positive feedback mechanism develops wireless clusters (Figs.1h). Although this mechanism qualitatively explains how the clustering happens, a quantitative question how many clusters are formed requires further consideration. We will later see that a stability analysis tells the possible number of clusters. In contrast to the vanishing intra-cluster connections, the inter-cluster connections survive and can be unidirectional (Fig.1d), which deﬁnes the cyclic network topology such as shown in Fig.1f, upper. Let us ask how we can change this 3-cycle topology. We ﬁnd that one of the recently observed higher-order rules of STDP [15,16] increases the number of clusters (Fig.1e). The higher-order rule shown in [16] implies the gross reduction in the LTD eﬀect because LTP override the immediately preceding LTD, while LTP simply cancels partly the immediately preceding LTD. The weakened LTD eﬀect is likely to increase the total number of potentiated synapses, which is in consistent with the increased ratio of black areas in Fig.1e compared to Fig.1d. In contrast to such cluster-wise synchrony observed with Model A neurons, Model B neurons that favor synchrony ( a = 0.02, b = 0.2, c = −50, d = 40) selforganize into the globally synchronous state with or without synchrony (Fig.2). Due to the global synchrony, mutual synaptic connections are largely lost, and each neuron ends up being driven by the external input individually, having little sense of being present as a population. The global synchrony gives too strong

146

H. Cˆ ateau, K. Kitano, and T. Fukai

D

50 45 40 35 30 25 20 15 10 5 0

9750

postsynaptic neurons

neuron index

C

9800

9850 9900 time [ms]

9950

5

presynaptic neurons 10 15 20 25 30 35 40 45 50

5 10 15 20 25 30 35 40 45 50

10000

Fig. 2. Global synchrony observed with Model B neurons that favor synchrony. A raster plot (a) and connection matrix (b) of ﬁfty Model B neurons. The neurons were aligned with the connection-based method: neurons are deﬁned to belong to the same cluster whenever there mutual connections are small enough.

an impact and also has minimal coding capacity because all the neurons behave identically, and it appears to bear more similarity to the pathological activity such as seizure in the brain than to meaningful information processing. By contrast, the clustered synchrony arising in the network of Model A neurons appears functionally useful. Generally in the brain, the unitary EPSP amplitude (∼ 0.5mV ) is designed to be much smaller than the voltage rise needed to elicit ﬁring (∼ 15mV ). Therefore, single-neuron activity alone cannot cause other neurons to respond. Hence, it is diﬃcult to regard the single-neuron activity as a carrier of information transferred back and forth in the brain. In contrast, the self-organized assembly of tens of Model A neurons (Figs. 1d) looks an ideal candidate for a carrier of information in the brain because their impact on other neurons are strong enough to elicit responses. Additionally, a cluster can reliably code the timing information. The PRC, Z(2πt/T ), representing the amount of advance/delay of the next ﬁring time in response to the input at t in the ﬁring interval [0, T ] has been mostly used to decide whether a coupled pair of neurons or oscillators tend to synchronize or desynchronize under the assumption that the connection strengths between the neurons are equal and unchanged. Speciﬁcally, suppose that a pair of neurons are mutually connected and a spike of one neuron introduces a current with the waveform of EP SC(t) in an innervated neuron after a transmission T delay of τd . The eﬀective PRC deﬁned as Γ− (θ) = T1 0 Z(2πt /T )EP SC(t − θ τd − T 2π )dt is known to decide their synchrony tendency. If Γ− (θ) < 0 at θ = 0 is positive (negative), the two neurons are desynchronized (synchronized). This synchrony condition is inherited to a population of neurons coupled in an allto-all or random manner as far as the connection strengths remain unchanged. Theoretically calculated Γ− (θ)s for Model A and B (Fig.3a,b) explain that the all-to-all netwok of Model A (B) neurons exhibit global asynchrony (synchrony). Note that both Model A and B neurons belong to type II[19] so that both model neurons favor synchrony if they are delta-coupled with no synaptic delay. After STDP is switched on, the network consisting of Model A neurons, is selforganized into the 3-cycle circuit (Fig.1d) with a successive phase diﬀerence of the clusterd activity being Δsuc θ = 2π/3. Stability analysis shows that the slope

Interactions between STDP and PRC Lead to Wireless Clustering

147

(b)

(a) 0.035

0.25

2pi/3

0.03

0.2

2pi/4 0.15

0.025 0.02 0.015

0.1

0.01

0.05

0.005 0 -0.05

0

0

pi/2

pi

3pi/2

2pi

-0.005

0

pi/2

pi

3pi/2

2pi

Fig. 3. Eﬀective PRCs and schema of triad mechanism. The eﬀective PRC of Model A and B were calculated with the adjoint method [19] and shown shown in (a) and (b). The slope at θ = 0 is positive for (a) but negative for (b) although it hardly recognizable with this resolution. The slope at θ = 2π/3 is negative for (a) but positive for (b). The dashed lines represent θ = 2π/3 and θ = 2π/4.

of Γ− (θ) not at the origin but at θ = 2π − Δsuc θ now determines the stability of the 3-cycle activity: the 3-cycle activity is stable if Γ− (2π − Δsuc θ) < 0. Fig.3a tells that the 3-cycle activity shown in Fig.1c is stable. The stable cyclic activity is achieved through the following synergetic process: (1) PRC determines the preferred network activity (e.g. asynchronous or synchronous), (2) the network activity determines how STDP works, STDP modiﬁes the network structure (e.g. from all-to-all to cyclic ), and (3) the network structure determines how the PRC is readout (e.g. θ = 0 or θ = 2π −Δsuc θ) ), closing the loop. Generally, we can show that the n-cylce activity whose successive phase diﬀerence equals Δsuc θ = 2π/n is stable if Γ− (2π − Δsuc θ) < 0. PRCs of biologically plausible neuron models or real neurons [20] tend to have a negative slope in a later phase of the ﬁring interval and converge to zero at θ = 2π because the membrane potential starts the regenerative depolarization and becomes insensitive to any synaptic input. The corresponding eﬀective PRCs inherit this negative slope in the later phase and tends to stabilize the n-cycle activity for some n.

3

Self-organization of Hodgkin-Huxley Type Neurons

Next we see that the self-organized cyclic activity with the wireless clustering is also observed in biologically realistic setting. Our simulations as described in [18] with 200 excitatory and 50 inhibitory neurons modeled with the HodgkinHuxley (HH) formalism exhibits the 3-cyclic activity with the wireless clustering (Fig.4a,b). The setup here is biologically realistic in that (1) HH type neurons are used, (2) physiologically known percentage of inhibitory neurons with nonplastic synapses are included, (3) neurons ﬁre with high irregularity due to large noise in the background input unlike the well-regulated ﬁring as shown in Fig.1c. Interestingly, the eﬀective PRC (Fig.4c) of the HH type neuron shares important features with that of Model A: the positive initial slope implying the preference to asynchrony and a negative later slope stabilizing the 3-cycle activity.

148

H. Cˆ ateau, K. Kitano, and T. Fukai (a)

presynaptic neurons

(b)

20

200

postsynaptic neurons

160 140

neuron index

60

80

100 120 140 160 180 200

20

180

120 100 80 60 40 20 0 5000

40

40 60 80 100 120 140 160 180

5040

5080

5120

5160

200

5200

time(msec)

(c) 3.5

2pi/3 2pi/4

3 2.5 2 1.5 1 0.5 0

0

pi/2

pi

3pi/2

2pi

Fig. 4. Conductance-based model also develops the wireless clustering. (a) A raster plot of 200 HH-type excitatory neurons showing 3-cycle activity. (b) The corresponding connection matrix showing the wirelessness. (c) Eﬀective PRC or Γ− (θ) of the conductance-based model.

Generally, technical diﬃculty in the HH simulations is their massive computational demands due to the complexity of the system. That diﬃculty has hidered theoretical analysis, and has left the studies largely experimental. In particular, previously we tried hard to understand why we never observed 4-cycle or longer in vain. However, the analytic argument we developed here with the simpliﬁed model gives a clear insight into the biologically plausible but complex system. Comparison of Fig.3a and Fig.4c reveals that the negative slope of Γ− (θ) of the HH model is located at more left than that of Model A, indicating less stability of long cycles in the HH simulations. With the larger amount of noise in the HH simulations in mind, it is now understood that 4-cycle and longer can be easily destabilized in the HH simulations. Thus, our analysis developed with the simpleﬁed system serves as a useful tool to understand a biologically realistic but complex systems. There is, howevr, an interesting diﬀerence between the Model A and HH simulations. Although the intra-cluster wirelessness is a fairly good ﬁrst approximation in the HH model simulations (Fig.4b), it is not as exact as in the Model A simulations (Fig.1d,e). Interestingly, an elimination of the residual intracluster connections destroys the cyclic activity, suggesting the supportive role of the tiny residual intra-cluster connections.

4

Discussion

In the previous simulation study[17] using the LIF model, cyclic activity was observed to propagate only at the theoretical speed limit: it takes only τd from one cluster to the next, requiring the zero membrane integration time. To understand

Interactions between STDP and PRC Lead to Wireless Clustering

149

why it was the case, we ﬁrst remind that the eﬀective PRC needs a negative slope at 2π − Δsuc θ to stabilize the cylic activity. However, the slope of the PRC of an θ LIF model, Z(θ) = c exp( τTm 2π ), is always positive except at the the end point, T where Z(2π − 0) = c exp( τm ) and Z(2π + 0) = c, implying Z (2π) = −∞. This inﬁnitely sharp negative slope of the PRC at θ = 2π is rounded and displaced to 2π − 2πτd /T in Γ− (θ) (see its deﬁnition). Since this is the only place where Γ− (θ) has a negative slope, The cyclic activity is stable only if Δsuc θ = 2πτd /T , implying the propagation at the theoretical speed limit. We demonstrated an intimate interplay between PRC and STDP using the Izhikevich neuron model as well as the HH type model. The present study complements previous studies using the phase oscillator [11,14], where its mathematical tractability was exploited to analytically investigate the stability of the global phase/frequency synchrony. The self-organization or unsupervised learning by STDP studied here complements the supervised learning studied in [22]. The propagation of synchronous ﬁring and temporal evolution of synaptic strength under STDP is know to be analyzed semi-analytically with the Fokker-Planck equation STDP [5,6,8,9,21]. It is interesting future direction to see how the Fokker-Planck equation can be used to understand the interplay between PRC and STDP.

Acknowledgement The present authors thank Dr. T. Takewaka at RIKEN BSI for oﬀering the code to calculate the PRC.

References 1. Kuramoto, Y.: Chemical oscillations,waves,and turbulence. Springer, Berlin (1984) 2. Ermentrout, G., Kopell, N.: SIAM J. Math.Anal. 15, 215 (1984) 3. Markram, H., et al.: Science 275, 213 (1997); Bell, C.C., et al.: Nature 387, 278 (1997); Magee, J.C., Johnston, D.: Science 275, 209 (1997); Bi, G.-Q., Poo, M.-M.: J. Neurosci. 18, 10464 (1998); Feldman, D. E., Neuron 27, 45 (2000); Nishiyama, M., et al.: Nature 408, 584 (2000) 4. Song, S., et al.: Nat. Neurosci. 3, 919 (2000) 5. van Rossum, M.C., Turrigiano, G.G., Nelson, S.B.: J. Neurosci. 22,1956 (2000) 6. Rubin, J., et al.: Phys. Rev. Lett. 86, 364 (2001) 7. Abbott, L.F., Nelson, S.B.: Nat. Neurosci. 3, 1178 (2000) 8. Gerstner, W., Kistler, W.M.: Spiking neuron model. Cambridge University Press, Cambridge (2002) 9. Cˆ ateau, H., Fukai, T.: Neural Comput., 15, 597 (2003) 10. Izhikevich, E.M.: IEEE Trans. Neural Netw. 15, 1063 (2004) 11. Karbowski, J.J., Ermentrout, G.B.: Phys. Rev. E. 65, 031902 (2002) 12. Nowotny, T., et al.: J. Neurosci. 23, 9776 (2003) 13. Zhigulin, V.P., et al.: Phys. Rev. E, 67, 021901 (2003) 14. Masuda, N., Kori, H.: J. Comp. Neurosci, 22, 327 (2007) 15. Froemke, R.C., Dan, Y.: Nature, 416, 433 (2002)

150 16. 17. 18. 19. 20.

H. Cˆ ateau, K. Kitano, and T. Fukai

Wang, H.-X., et al.: Nat. Neurosci, 8, 187 (2005) Levy, N., et al.: Neural Netw. 14, 815 (2001) Kitano, K., Cˆ ateau, H., Fukai, T.: Neuroreport, 13, 795 (2002) Ermentrout, G.B.: Neural Comput. 8, 979 (1996) Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1673 (1993); Reyes, A.D., Fetz, E.E.: J. Neurophysiol. 69, 1661 (1993); Oprisan, S.A., Prinz, A.A., Canavier, C.C.: Biophys. J., 87, 2283 (2004); Netoﬀ, T.I., et al.: J Neurophysiol. 93, 1197 (2005); Lengyel, M., et al.: Nat. Neurosci. 8, 1667 (2005); Galan, R.F., Ermentrout, G.B., Urban, N.N.: Phys. Rev. Lett. 94, 158101 (2005); Preyer, A.J., Butera, R.J.: Phys. Rev. Lett. 95, 13810 (2005); Goldberg, J.A., Deister, C. A., Wilson, C.J.: J. Neurophysiol., 97, 208 (2007); Tateno, T., Robinson, H.P.: Biophys. J., 92, 683 (2007); Mancilla, J.G., et al.: J. Neurosci. 27, 2058 (2007); Tsubo, Y., et al.: Eur J. Neurosci, 25, 3429 (2007) 21. Cˆ ateau, H., Reyes, A.D.: Phys. Rev. Lett. 96, 058101, and references therein (2006) 22. Lengyel, M., et al.: Nat. Neurosci. 8, 1677 (2005)

A Computational Model of Formation of Grid Field and Theta Phase Precession in the Entorhinal Cells Yoko Yamaguchi1, Colin Molter1, Wu Zhihua1,2, Harshavardhan A. Agashe1, and Hiroaki Wagatsuma1 1

Lab For Dynamic of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, Japan 2 Institute of Biophysics, Chinese Academy of Sciences, Beijing, China [email protected]

Abstract. This paper proposes a computational model of spatio-temporal property formation in the entorhinal neurons recently known as “grid cells”. The model consists of module structures for local path integration, multiple sensory integration and for theta phase coding of grid fields. Theta phase precession naturally encodes the spatial information in theta phase. The proposed module structures have good agreement with head direction cells and grid cells in the entorhinal cortex. The functional role of theta phase coding in the entorhinal cortex for cognitive map formation in the hippocampus is discussed. Keywords: Cognitive map, hippocampus, temporal coding, theta rhythm, grid cell.

1 Introduction In rodents, it is well known that a hippocampal neuron increases its firing rate in some specific position in an environment [1]. These neurons are called place cells and considered to provide neural representation of a cognitive map. Recently it was found that the entorhinal neurons, giving major inputs to the hippocampus fire at positions distributing in a form of a triangular-grid-like patterns in the environment [2]. They are called “grid cells” and their spatial firing preference is termed “grid fields”. Interestingly, temporal coding of space information, “theta phase precession” initially found in hippocampal place cells were also observed in grid cells in the superficial layer of the entorhinal cortex [3], as shown in Figs.1 – 3. A sequence of neural firing is locked to theta rhythm (4~12 HZ) of local field potential (LFP) during spatial exploration. As a step to understand cognitive map formation in the rat hippocampus, the mechanism to form the grid field and also the mechanism of phase precession formation in grid cells must be clarified. Here we propose a model of neural computation to create grid cells based on known property of entorhinal neurons including “head direction cells” which fires M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 151–159, 2008. © Springer-Verlag Berlin Heidelberg 2008

152

Y. Yamaguchi et al.

when the animal’s head has some specific direction in the environment. We demonstrate that theta phase precession in the entorhinal cortex naturally emerge as a consequence of grid cell formation mechanism.

Fig. 1. Theta phase precession observed in rat hippocampal place cells. When the rat traverses in a place field, spike timing of the place cell gradually advances relative to local field potential (LFP) theta rhythm. In a running sequence through place filed A-B-C, the spike sequence in order of A-B-C emerge in each theta cycle. The spike sequence repeatedly encoded in theta phase is considered to lead robust on-line memory formation of the running experience through asymmetric synaptic plasticity in the hippocampus.

Fig. 2. Network structure of the hippocampal formation (DG, CA3, CA1, the entorhinal cortex (EC deeper layer and EC superficial payer) and cortical areas giving multimodal input. Theta phase precession was initially found in the hippocampus, and also found in EC superficial layer. EC superficial layer can be considered as an origin of theta phase precession.

A Computational Model of Formation of Grid Field and Theta Phase Precession

153

HC place cell EC grid cell EC LFP theta

Time Fig. 3. Top) A grid field in an entorhinal grid cell (left) and a place field (right) in a hippocampal place cell. Bottom) Theta phase precession observed in the EC grid cell and in the hippocampus place cell.

2 Model Firing rate of the ith grid cell at a location (x, y) in a given environment increases in the condition given by the relation:

x = α i + nAi cos φ i + mAi cos(φ i + π / 3), y = β i + nAi sinφ i + mAi sin(φ i + π / 3), with

n, m = integer + r ,

(1)

where φ i , Ai and ( α i , β i ) denote one of angles characterizing the grid orientation, a distance of nearby vertices, and a spatial phase of the grid field in an environment. The parameter r is less than 1.0 giving the relative size of a field with high firing rate.

154

Y. Yamaguchi et al.

Fig. 4. Illustration of a hypercolumn structure for grid field computation in the hypothesized entorhinal cortex. The bottom layer consists of local path integration module with a hexagonal direction system. The middle layer associates output of local path integration and visual cue in a given environment. The top layer consists of triplet of grid cells whose grid fields have a common orientation, a common spatial scale and complementary spatial phases. Phase precession is generated at the grid cell at each grid field.

The computational goal to create a grid field is to find the region with n, m = integer + r . We hypothesize that the deeper layer of the entorhinal cortex works as local path integration systems by using head direction and running velocity. The local path integration results in a variable with slow gradual change forming a grid field. This change can cause the gradual phase shift of theta phase precession in accordance with the phenomenological model of theta phase precession by Yamaguchi et al. [4]. The schematic structure of the hypothesized entorhinal cortex and multimodal sensory system is illustrated in Fig. 4. The entorhinal layer includes head direction cells in the deeper layer and grid cells in the superficial layer. Cells with theta phase precession can be considered as stellate cells. The set of modules along vertical direction form a kind of functional column with a direction preference. These columns form a hypercolumnar structure with a set of directions. Mechanisms in individual modules are explained below. 2.1 A Module of Local Path Integrator

The local path integration module consists of six units. During animal’s locomotion with a given head direction and velocity, each unit integrates running distance in each direction with an angle dependent coefficient. Units have preferred vector directions distributing with π/3 intervals as shown in Fig. 5. Computation of animal displacement in given directions in this module is illustrated in Fig, 6. The maximum integration length of the distance in each direction is assumed to be common in a module, corresponding to the distance between nearby vertices of the subsequently formed grid field. This computation gives (n.m) in eq. (1). These systems distribute in the deeper layer of the entorhinal cortex in agreement with observation of head direction cells. Different modules have different vector

A Computational Model of Formation of Grid Field and Theta Phase Precession

155

directions and form a hypercolumn set covering the entire running directions or entire orientation of resultant grid field. The entorhinal cortex is considered to include multiple hypercolumns with different spatial scales. They are considered to work in parallel possibly to give stability in a global space by compensating accumulation of local errors.

Fig. 5. Left) A module of local path integrator with hexagonal direction vectors and a common vector size. Right) Activity coefficient of each vector unit. A set of these vector units to give a displacement distance measure computes an animal motion in a give head direction.

Fig. 6. Illustration of computation of local path integration in a module. Animal locomotion in a give head direction is computed by a pair of vector units among six vectors to give a position measure.

2.2 Grid Field Formation with Visual Cues

Computational results of local path integration are projected to next module in the superficial layer of the entorhinal cortex, which has multiple sensory inputs in a given environment. The association of path integration and visual cues results in the relative location of path integration measure ( α i , β i ) in eq. (1) in the module. Further interaction in a set of three cells as shown in Fig. 7 can give robustness of the parameter ( α i , β i ). Possible interaction among these cells is mutual inhibition to give supplementary distribution of three grid fields.

156

Y. Yamaguchi et al.

2.3 Theta Phase Precession in the Grid Field

The input of the parameter (n, m) and ( α i , β i ), to a cell at next module, at the top part of the module, can cause theta phase precession. It is obtained by the fundamental mechanism of theta phase generation proposed by Yamaguchi et al. [4] [5]. The mechanism needs the presence of a gradual increase of natural frequency in a cell with oscillation activity. Here we find that the top module consists of stellate cells with intrinsic theta oscillation. The natural increase in frequency is expected to emerge by the input of path integration at each vertex of a grid field.

Fig. 7. A triplet of grid fields with the same local path integration system and different spatial phases can generate mostly uniform spatial representation where a single grid cell fires at every location. The uniformity can help robust assignment of environmental spatial phases under the help of environmental sensory cues. The association is processed in the middle part of each column in the entorinal cortex. Projection of each cell output to the module at the top generates a grid field wit theta phase precession as explained in text.

3 Mathematical Formulation Simple mathematical formulation of the above model is phenomenologically given below. The locomotion of animal is represented by current displacement (R, φc ) computed with head direction φ , and running velocity. An elementary vector at a H column of in local path integration system has a vector angle φ and its length A. The i output of the ith vector system I is given by

⎧⎪1 if I (φi ) = ⎨ ⎪⎩ 0

φi − φ H < π / 2 and − r < S (φi ) < r , otherwise, (2)

with S (φi ) = R cos(φi − φ c ) ( S mod A ).

A Computational Model of Formation of Grid Field and Theta Phase Precession

157

where r and A respectively represent the field radius and the distance between neighboring grid vertices. The output of path integration module Di to the middle layer is given by

Di = ∏ I (φi + kπ / 3). k

(3)

Through association with visual cues, spatial phase of the grid is determined. (Details are not shown here.) The term Eqs. (2-3) from the middle layer to the top layer gives on-off regulation and also a parameter with gradual increase in a grid field. Dynamics of the membrane potential G i of the cell at the top layer is given by d dt

G i = f (G i ,t) + aDi S(φi ) + I theta ,

(4)

where f is a function of time-dependent ionic currents and a is constant. The last term I theta denotes a sinusoidal current representing theta oscillation of inhibitory neurons. In a proper dynamics of f, the second term in the right had side gives activation of the grid cell oscillation and gradual increase in its natural frequency. According to our former results by using a phenomenological model [5], the last term of theta currents leads phase locking of grid cells with gradual phase shift. This realizes a cell with grid field and theta phase precession. One can test Eq. (4) by applying several types of equations including a simple reduced model and biophysical model of the hippocampus or entorhinal cells. An example of computer experiments is given in the following section.

4 Computer Simulation of Theta Phase Precession The mechanism of theta phase precession was phenomenologically proposed by Yamaguchi et al. [4][5] as coupling of two oscillations. One is LFP theta oscillation with a constant frequency of theta rhythm. The other is a sustained oscillation with gradual increase in natural frequency. The sustained oscillation in the presence of LFP theta exhibits gradual phase shift as quasi steady states of phase locking. The simulation by using a hippocamal pyramidal cell [6] is shown in Fig. 8. It is obviously seen that LFP theta instantaneously captures the oscillation with gradual increase in natural frequency into a quasi-stable phase at each theta cycle to give gradual phase shift. This phase shift is robust against any perturbation as a consequence of phase locking in nonlinear oscillations. The simulation with a model of an entorhinal stellate cell [7] was also elucidated. We obtained similar phase precession with stellate cell model. One important property of stellate cell is the presence of sub threshold oscillations, while synchronization of this oscillation can be reduced to a simple behavior of the phase model. Thus, the mechanism of phenomenological model [5] is found to endow comprehensive description of phase locking of complex biophysical neuron models.

158

Y. Yamaguchi et al.

(a)

(b) Fig. 8. Computer experiment of theta phase precession by using a hippocampal pyramidal neuron model [6]. (a)Bottom: Input current with gradual increase. Top: Resultant sustained oscillation with gradual increase in natural frequency of the membrane potential. (b) In the presence of LFP theta (middle), the neuronal activity exhibit theta phase precession.

5 Discussions and Conclusion We elucidated a computational model of grid cells in the entorhnal cortex to investigate how temporal coding works for spatial representation in the brain, A computational model of formation of grid field was proposed based on local path integration. This assumption was found to give theta phase precession within the grid field. This computational mechanism does not need an assumption of learning in repeated trials in a novel environment but enables instantaneous spatial representation. Furthermore, this model has good agreements with experimental observations of head direction cells and grid cells. The networks proposed in the model predict local interaction networks in the entorhinal cortex and also head direction systems distributed in many areas. Although computation of place cells based on grid cells is beyond this paper, emergence of theta phase precession in the entorhinal cortex can be used for place cell formation and also instantaneous memory formation in the hippocampus [8]. These computational model studies with space-time structure for environmental space representation enlightens the temporal coding over distributed areas used in real-time operation of spatial information in ever changing environment.

A Computational Model of Formation of Grid Field and Theta Phase Precession

159

References 1. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Clarendon Press, Oxford (1978) 2. Fyhn, M., Molden, S., Witter, M., Moser, E.I., Moser, M.B.: Spatial representation in the entorhinal cortex. Sience 305, 1258–1264 (2004) 3. Hafting, T., Fyhn, M., Moser, M.B., Moser, E.I.: Phase precession and phase locking in entorhinal grid cells. Program No. 68.8, Neuroscience Meeting Planner. Atlanta, GA: Society for Neuroscience (2006.) Online (2006) 4. Yamaguchi, Y., Sato, N., Wagatsuma, H., Wu, Z., Molter, C., Aota, Y.: A unified view of theta-phase coding in the entorhinal-hippocampal system. Current Opinion in Neurobiology 17, 197–204 (2007) 5. Yamaguchi, Y., McNaughton, B.L.: Nonlinear dynamics generating theta phase precession in hippocampal closed circuit and generation of episodic memory. In: Usui, S., Omori, T. (eds.) The Fifth International Conference on Neural Information Processing (ICONIP 1998) and The 1998 Annual Conference of the Japanese Neural Network Society (JNNS 1998), Kitakyushu, Japan. Burke, VA, vol. 2, pp. 781–784. IOS Press, Amsterdam (1998) 6. Pinsky, P.F., Rinzel, J.: Intrinsic and network rhythmogenesis in a reduced traub model for CA3 neurons. Journal of Computational Neuroscience 1, 39–60 (1994) 7. Fransén, E., Alonso, A.A., Dickson, C.T., Magistretti, J., Hasselmo, M.E.: Ionic mechanisms in the generation of subthreshold oscillations and action potential clustering in entorhinal layer II stellate neurons 14(3), 368–384 (2004) 8. Molter, C., Yamaguchi, Y.: Theta phase precession for spatial representation and memory formation. In: The 1st International Conference on Cognitive Neurodynamics (ICCN 2007), Shanghai, 2-09-0002 (2007)

Working Memory Dynamics in a Flip-Flop Oscillations Network Model with Milnor Attractor David Colliaux1,2 , Yoko Yamaguchi1 , Colin Molter1 , and Hiroaki Wagatsuma1 1

Lab for Dynamics of Emergent Intelligence, RIKEN BSI, Wako, Saitama, Japan 2 Ecole Polytechnique (CREA), 75005 Paris, France [email protected]

Abstract. A phenomenological model is developed where complex dynamics are the correlate of spatio-temporal memories. If resting is not a classical ﬁxed point attractor but a Milnor attractor, multiple oscillations appear in the dynamics of a coupled system. This model can be helpful for describing brain activity in terms of well classiﬁed dynamics and for implementing human-like real-time computation.

1

Introduction

Neuronal collective activities of the brain are widely characterized by oscillations in human and animals [1][2]. Among various frequency bands, distant synchronization in theta rhythms (4-8 Hz oscillation deﬁned in human EEG) is recently known to relate with working memory, a short-term memory for central execution in human scalp EEG [3][4] and in neural ﬁring in monkeys [5][6]. For long-term memory, information coding is mediated by synaptic plasticity whereas short-term memory is stored in neural activities [7]. Recent neuroscience reported various types of persistent activities of a single neuron and a population of neurons as possible mechanisms of working memory. Among those, bistable states, up- and down-states, of the membrane potential and its ﬂip-ﬂop transitions were measured in a number of cortical and subcortical neurons. The up-state, characterized by frequent ﬁring, shows stability for seconds or more due to network interactions [8]. However it is little known whether ﬂip-ﬂop transition and distant synchronization work together or what kind of processings are enabled by the ﬂip-ﬂop oscillation network. Associative memory network with ﬂip-ﬂop change was proposed for working memory with classical rate coding view [9], while further consideration on dynamical linking property based on ﬁring oscillation, such as synchronization of theta rhythms referred above, is likely essential for elucidation of multiple attractor systems. Besides, Milnor extended the concept of attractors to invariant sets with Lyapunov unstability, which has been of interest in physical, chemical and biological systems. It might allow high freedom in spontaneous switching among semi-stable states [12]. In this paper, we propose a model of oscillation associative memory with ﬂip-ﬂop change for working memory. We found that M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 160–169, 2008. c Springer-Verlag Berlin Heidelberg 2008

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

161

the Milnor attractor condition is satisﬁed in the resting state of the model. We will ﬁrst study how the Milnor attractor appears and will then show possible behaviors of coupled units in the Milnor attractor condition.

2 2.1

A Network Model Structure

In order to realize up- and down-states where up-state is associated with oscillation, phenomenological models are joined. Traditionally, associative memory networks are described by state variables representing the membrane potential {Si } [9]. Oscillation is assumed to appear in the up-state as an internal process within each variable φi for the ith unit. Oscillation dynamics is simply given by a phase model with a resting state and periodic motion [10,11]. cos(φi ) stands for an oscillation current in the dynamics of the membrane potential. 2.2

Mathematical Formulation of the Model

The ﬂip-ﬂop oscillations network of N units is described by the set of state variables {Si , φi } ∈ N × [0, 2π[N (i ∈ [1, N ]). Dynamic of Si and φi is given by the following equations: dSi Wij R(Sj ) + σ(cos(φi ) − cos(φ0 )) + I± dt = −Si + dφi (1) = ω + (β − ρS i )sin(φi ) dt with R(x) = 12 (tanh(10(x − 0.5)) + 1), φ0 = arcsin( −ω β ) and cos(φ0 ) < 0. R is the spike density of units and input I± will be taken as positive (I+ ) or negative (I− ) pulses (50 time steps), so that we can focus on the persistent activity of units after a phasic input. ω and β are respectively the frequency and the stabilization coeﬃcient of the internal oscillation. ρ and σ represent mutual feedback between internal oscillation and membrane potential. Wij are the connection weights describing the strength of coupling between units i and j. φ0 is known to be a stable ﬁxed point of the equation for φ, and 0 to be a ﬁxed point for the S equation.

3 3.1

An Isolated Unit Resting State

The resting state is the stable equilibrium when I = 0 for a single unit. We assume ω < β so that M0 = (0, φ0 ) is the ﬁxed point of the system. To study the linear stability of this ﬁxed point, we write the stability matrix around M0 : −1 −σsin(φ0 ) DF |M0 = (2) −ρsin(φ0 ) βcos(φ0 )

162

D. Colliaux et al.

The sign of the eigenvalues of DF |M0 and thus the stability of M0 depends only on μ = ρσ. With our choice of ω = 1 and β = 1.2, μc ≈ 0.96. If μ < μc , M0 is a stable ﬁxed point and there is another ﬁxed point M1 = (S1 , φ1 ) with φ1 < φ0 which is unstable. If μ > μc , M0 is unstable and M1 is stable with φ1 > φ0 . Fixed points exchange stability as the bifurcation parameter μ increases (transcritical bifurcation). The simpliﬁed system according to eigenvectors (X1 , X2 ) of the matrix DF |M0 gives a clear illustration of the bifurcation as dx1 dt dx2 dt

= ax21 + λ1 x1 = λ2 x2

(3)

Here a = 0 is equivalent to μ = μc and in this condition there is a positive measure basin of attraction but some directions are unstable. The resting state M0 is not a classical ﬁxed point attractor because it does not attract all trajectories from an open neighborhood, but it is still an attractor if we consider Milnor’s extended deﬁnition of attractors. Phase plane (S, φ) Fig. 1 shows that for μ close to the critical value, nullclines cross twice staying close to each other in between. That narrow channel makes the conﬁguration indistinguishable from a Milnor attractor in computer experiments.

Fig. 1. Top: Phase space (S, φ) with vector ﬁeld and nullclines of the system. The dashed domain in B shows that M0 have positive measure basin of attraction when μ = μc . Bottom: Fixed points with their stable and unstable directions for the equivalent simpliﬁed system. A: μ < μc . B: μ = μc . C:μ > μc .

Since we showed μ is the crucial parameter for the stability of the resting state, we can now consider ρ = 1 and study the dynamics according to σ with a close look near the critical regime (σ = μc ).

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

3.2

163

Constant Input Can Give Oscillations

Under constant input there are two possible dynamics: ﬁxed point and limit cycle. If ω (4) β − S < 1 there is a stable ﬁxed point (S1 , φ1 ) with φ1 solution of ω + (β − σ(cos(φ1 ) − cos(φ0 )) − I)sin(φ1 ) = 0 S1 = σ(cosφ1 − cosφ0 ) + I

(5)

If condition 4 is not satisﬁed, the φ equation in 1 will give rise to oscillatory dynamics. Identifying S with its temporal average, dφ dt = ω + Γ sin(φ) with 2π dφ Γ = β − S will be periodic with period 0 ω+(β−S)sin(φ) . This approximation gives an oscillation at frequency ω = ω 2 − (β − S)2 , which is qualitatively in good agreement with computer experiments Fig. 2. σ=μc 6

1.2 S minimum S maximum Frequency (theoretical) Frequency

5

1

4

0.8

3 f

S

0.6 2 0.4 1 0.2 0 0 -1 -0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

I

Fig. 2. For each value of constant current I, maximum and minimum values of S1 are plotted. Dominant frequency of S1 obtained by FFT is compared to the theoretical value when S is identiﬁed with its temporal average: Frequency VS Frequency (theoretical).

If we inject an oscillatory input into the system, S oscillates at the same frequency provided the input frequency is low. For higher frequencies, S cannot follow the input and shows complex oscillatory dynamics with multiple frequencies.

164

4

D. Colliaux et al.

Two Coupled Units

For two coupled units, ﬂip-ﬂop of oscillations is observed under various conditions. We will analyze the case μ = 0 and ﬂip-ﬂop properties under various strengths of connection weights, assuming symmetrical connections (W12 = W2,1 = W ). 4.1

Influence of the Feedback Loop

In equation 1, ρ and σ implement a feedback loop representing mutual inﬂuence of φ and S for each unit. The Case µ = 0. In the case σ = 0 or ρ = 0, φ remains constant φ = φ0 : the system is then a classical recurrent network. This model was used to provide associative memory network storing patterns in ﬁxed point attractors [9]. For small coupling strength, the resting state is a ﬁxed point. For strong coupling strength, two more ﬁxed points appear, one unstable, corresponding to threshold, and one stable, providing memory storage. After a transient positive input I+ above threshold, the coupled system will be in up-state. A transient negative input I− can bring it back to resting state. For a small perturbation (σ 1 and ρ = 1), the active state is a small up-state oscillation but associative memory properties (storage, completion) are preserved. Growing Oscillations. The up-state oscillation in the membrane potential dynamics triggered by giving an I+ pulse to unit 1 grows when σ increases and saturates to an up-state ﬁxed point for strong feedback. Interestingly, for a range of feedback strength values near μc , S returns transiently near the Milnor attractor resting state. Projection of the trajectories of the 4-dimensional system on a 2-dimensional plane section P illustrates these complex dynamics Fig. 3. A cycle would intersect this plane in two points. For each σ value, we consider S1 for these intersection points. For a range between 0.91 and 1.05 with our choice of parameters, there are much more than two intersection points M*, suggesting chaotic dynamics. 4.2

Influence of the Coupling Strength

The dynamics of two coupled units can be a ﬁxed point attractor, as in the resting state (I = 0), or down-state or up-state oscillation (depending on the coupling strength), after a transient input. Near critical value of the feedback loop, in addition to these, more complex dynamics occur for intermediate coupling strength.

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

165

Fig. 3. A: Inﬂuence of the feedback loop- Bifurcation diagram according to σ (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to σ(Bottom). B: Inﬂuence of the coupling strengh - S1 maximum and minimum values and average phase diﬀerence (φ1 − φ2 ) according to W (Top). S1 coordinates of the intersecting points of the trajectory with a plane section P according to W (Bottom).

Down-state Oscillation. For small coupling strength, the system periodically visits the resting state for a long time and goes brieﬂy to up-state. The frequency of this oscillation increases with coupling strength. The two units are anti-phase (when Si takes maximum value, Sj takes minimum value) Fig. 4 (Bottom). Up-state Oscillation. For strong coupling strength, a transient input to unit 1 leads to an up-state oscillation Fig. 4 (Top). The two units are perfectly in-phase at W = 0.75 and phase diﬀerence stays small for stronger coupling strength. Chaotic Dynamics. For intermediate coupling strength, an intermediate cycle is observed and more complex dynamics occur for a small range (0.58 < W < 0.78

166

D. Colliaux et al.

Fig. 4. Si temporal evolution, (S1 , S2 ) phase plane and (Si , φi ) cylinder space. Top: Up-state oscillation for strong coupling. Middle: Multiple frequency oscillation for intermediate coupling. Bottom: Down-state oscillation for weak coupling.

with our parameters) before full synchronization characterized by φ1 − φ2 = 0. The trajectory can have many intersection points with P and S ∗ in Fig. 3 shows multiple roads to chaos through period doubling.

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

5

167

Application to Slow Selection of a Memorized Pattern

5.1

A Small Network

The network is a set N of ﬁve units consisting in a subset N1 of three units A,B and C and another N2 of two units D and E. In the set N, units have symmetrical all-to-all weak connections (WN = 0.01) and in each subset units have symmetrical all-to-all strong connections (WNi = 0.1 ∗ M ) with M a global parameter slowly varying in time between 1 and 10. These subsets could represent two objects stored in the weight matrix. 5.2

Memory Retrieval and Response Selection

S

We consider a transient structured input into the network. For constant M, a partial or complete stimulation of a subset Ni can elicit retrieval and completion of the subset in an up-state as would do a classical auto-associative memory network.

2 1.5 1 0.5 0

A 0

20000

40000

60000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

80000

100000

120000

140000

S

t 2 1.5 1 0.5 0

B 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

C 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

D 0

20000

40000

60000

S

t 2 1.5 1 0.5 0

E 0

20000

40000

60000 t

Fig. 5. Slow activation of a robust synchronous up-state in N1 during slow increase of M

In the Milnor attractor condition more complex retrieval can be achieved when M is slowly increased. As an illustration, we consider transient stimulation of units A and B from N1 and unit E from N2 Fig. 5. N2 units show anti-phase

168

D. Colliaux et al.

oscillations with increasing frequency. N1 units ﬁrst show synchronous downstate oscillations with long stays near the Milnor attractor and gradually go toward sustained up-state oscillations. In this example, the selection of N1 in up-state is very slow and synchrony between units plays an important role.

6

Conclusion

We demonstrated that, in cylinder space, a Milnor attractor appears at a critical condition through forward and reverse saddle-node bifurcations. Near the critical condition, the pair of saddle and node constructs a pseudo-attractor, which can serves for observation of Milnor attractor-like properties in computer experiments. Semi-stability of the Milnor attractor in this model seems to be associated with the variety of oscillations and chaotic dynamics through period doubling roads. We demonstrated that an oscillations network provides a variety of working memory encoding in dynamical states under the presence of a Milnor attractor. Applications of oscillatory dynamics have been compared to classical autoassociative memory models. The importance of Milnor attractors was proposed in the analysis of coupled map lattices in high dimension [11] and for chaotic itinerancy in the brain [13]. The functional signiﬁcance of ﬂip-ﬂop oscillations networks with the above dynamical complexity is of interest for further analysis of integrative brain dynamics.

References 1. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchronization and large-scale integration. Nature Reviews Neuroscience (2001) 2. Buzsaki, G., Draguhn, A.: Neuronal oscillations in cortical networks. Science (2004) 3. Onton, J., Delorme, A., Makeig, S.: Frontal midline EEG dynamics during working memory. NeuroImage (2005) 4. Mizuhara, H., Yamaguchi, Y.: Human cortical circuits for central executive function emerge by theta phase synchronization. NeuroImage (2004) 5. Rainer, G., Lee, H., Simpson, G.V., Logothetis, N.K.: Working-memory related theta (4-7Hz) frequency oscillations observed in monkey extrastriate visual cortex. Neurocomputing (2004) 6. Tsujimoto, T., Shimazu, H., Isomura, Y., Sasaki, K.: Prefrontal theta oscillations associated with hand movements triggered by warning and imperative stimuli in the monkey. Neuroscience Letters (2003) 7. Goldman-Rakic, P.S.: Cellular basis of working memory. Neuron (1995) 8. McCormick, D.A.: Neuronal Networks: Flip-Flops in the Brain. Current Biology (2005) 9. Durstewitz, D., Seamans, J.K., Sejnowski, T.J.: Neurocomputational models of working memory. Nature Neuroscience (2000) 10. Yamaguchi, Y.: A Theory of hippocampal memory based on theta phase precession. Biological Cybernetics (2003)

Working Memory Dynamics in a Flip-Flop Oscillations Network Model

169

11. Kaneko, K.: Dominance of Minlnor attractors in Globally Coupled Dynamical Systems with more than 7 +- 2 degrees of freedom (retitled from ‘Magic Number 7 +- 2 in Globally Coupled Dynamical Systems’) Physical Review Letters (2002) 12. Fujii, H., Tsuda, I.: Interneurons: their cognitive roles - A perspective from dynamical systems view. Development and Learning (2005) 13. Tsuda, I.: Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioural and Brain Sciences (2001)

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization of Quasi-Attractors Hiroshi Fujii1,2, Kazuyuki Aihara2,3, and Ichiro Tsuda4,5 1

Department of Information and Communication Sciences, Kyoto Sangyo University, Kyoto 603-8555, Japan [email protected] 2 Institute of Industrial Science, the University of Tokyo, Tokyo 153-8505 [email protected] 3 ERATO, Japan Science and Technology Agency, Tokyo 151-0065, Japan 4 Research Institute for Electronic Science, Hokkaido University, Sapporo 060-0812, Japan [email protected] 5 Center of Excellence COE in Mathematics, Department of Mathematics, Hokkaido University, Sapporo 060-0810, Japan

（ ）

Abstract. A new hypothesis on a possible role for the corticopetal acetylcholine (ACh) is provided from a dynamical systems standpoint. The corticopetal ACh helps to transiently organize a global (inter- and intra-cortical) quasi-attractors via gamma range synchrony when it is behaviorally needed as top-down attentions and expectation.

1 Introduction 1.1 Corticopetal Acetylcholine Achetylcholine (ACh) is the first substance identified as a neurotransmitter by Otto Loewi [19]. Although it is increasingly recognized that ACh plays a critical role, not only in arousal and sleep, but in higher cognitive functions as attention, conscious flow, and so on, the question on the way in which ACh works in those cognitive processes remains a mystery [11]. The corticopetal ACh, originated in the nucleus basalis of Meinert (NBM), a part of the basal forebrain (BF), is the primary source of cortical ACh, and the major target of BF projections is the cortex [21]. Behavioral studies and those using immunotoxin as well provide consistent evidence of the role of ACh in top-down attentions. A blockage of NBM ACh, either by deasease-related or drug-induced, causes a severe loss of attentions: selective attention, sustained attention, and divided attention together with a shift of attention. ACh concerns conscious flow (Perry & Perry [24]). Continual death of cholinergic neurons in NBM causes Lewy Body Dementia (LBD), one of the most salient symptoms of which is the complex visual hallucination (CVH) [1].1 1

Perry and Perry [24] noted those hallucinatory LBD patients who see: “integrated images of people or animals which appear real at the time”, “insects on walls”, or “bicycles outside the fourth storey window”. Images are generally vivid and colored, continue for a few minutes (neither seconds nor hours). It is to be noted that “many of those experiences are enhanced during eye closed and relieved by visual input”, and “nicotinic anatagonists, such as mecamylamine, are not reported to induce hallucinations”.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 170–178, 2008. © Springer-Verlag Berlin Heidelberg 2008

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

171

1.2 Attentions, Cortical State Transitions and Cholinergic Control System from NBM Top-down flow of signals which accompanies attentions, expectation and so on may cause state transitions in the “down stream” cortices. Fries et al. [7] reported an increase of synchrony in high-gamma range in accordance of selective attention. (See, also Jones [15], Buschman et al. [2].) Metherate et al. [22] stimulated NBM in in vivo preparations of auditory cortex. Of particular interest in their observations is that NBM ACh produced a change in subthreshold membrane potential fluctuations from large-amplitude, slow (1-5 Hz) oscillations to low-amplitude, fast (20-40 Hz) (i.e., gamma) oscillations. A shift of spike discharge pattern from phasic to tonic was also observed.2 They pointed out that in view of the wide spread projections of NBM neurons, larger intercortical networks could also be modified. Together with Fries et al. data, it is suggested that NBM cholinergic projections may induce a state transition as a shift of frequency, and change of discharge pattern in neocortical neurons. This may be consistent with the observation made by Kay et al. [16]. During perceptual processing in the olfactory-limbic axis, a cascade of brain events at the successive stages of the task as “expectation” and/or “attention” was observed. ‘Local’ transitions of the olfactory structures indicated by modulations of EEG signals as gamma amplitude, periodicity, and coherence were reported to exist. Kay et al. also observed that the local dynamics transiently falls into attractor-like states. Such ‘local’ transitions of states are generally postulated to be triggered by ‘topdown’ glutamatergic spike volleys from “upper stream” organizations. However, such brain events of state transitions with a change in synchrony could be a result of collaboration of descending glutamatergic spike volleys and ascending ACh afferents from NBM. (See, also [26]).3

2 Neural Correlate of Conscious Percepts and the Role of the Corticopetal ACh 2.1 Neural Correlates of Conscious Percepts and Transient Synchrony The corticopetal ACh pathway might be the critical control system which may trigger various kinds of attentions receiving convergent inputs from other sensory and association areas as Sarter et al. [25], [26] argued. In order to discuss the role of the 2

NBM contains cholinergic neurons and non-cholinergic neurons. GABAergic neurons are at least twice more numerous than cholinergic neurons [15] The Metherate et al. observations (above) may be the result of collective functioning of both the cholinergic and GABAergic projections. Wenk [31] argued another possibility that the NBM ACh projections on the reticular thalamic nucleus might cause the cortical state change. 3 Triangular Attentional Pathway: The above arguments may better be complemented by the triangular amplification circuitry, a pathway consisting of parietal cortex → prefrontal cortex → NBM → sensory cortex [10]. This may constitute the cholinergic control system from NBM, i.e., the top-down attentional pathway for cholinergic modulations of responses in cortical sensory areas.

172

H. Fujii, K. Aihara, and I. Tsuda

corticopetal ACh related to attentions, we first begin with the question: What is the neural correlate of conscious percepts? The recent experiments by Kenet et al. [17] show the possibility that in a background (or spontaneous) state where no external stimuli exist, the visual cortex fluctuates between specific internal states. This might mean that cortical circuitry has a number of pre-existing and intrinsic internal states which represent features, and the cortex fluctuates between such multiple intrinsic states even when no external inputs exist. (See, also Treisman et al. [27].) Attention and Dynamical Binding Through Synchrony In order that perception of an object makes sense, its features, or fragmentary subassemblies, must be bound as a unity. How does the “binding” of those local fragmentary dynamics into a global dynamics is done? A widely accepted view is that top-down signals as expectation, attentions, and so on may play the role of such an integration, which is mediated by (possibly gamma) synchronization among the concerned assemblies representing stimuli or events. We postulate that such a process is a basis of global intra- and inter-cortical conjunctions of brain activities. See, Varela et al. [30],Womelsdorf et al. [32]. See, also Dehaene and Changeux [5].) The neural correlate of conscious percepts is a globally integrated state of related brain networks mediated by synchrony over gamma and other frequency bands. In mathematical terms, such transient processes of synchrony of global, but between selected groups of neurons may be described as a transitory state of approaching a global attractor. We note that such a “transitory” state may be conceptualized as an attractor ruin (Tsuda [28], Fujii et al. [8]), in which there are orbits approaching and stay there for a while, but may simultaneously possess repelling orbits from itself. The proper Milnor attractor and its perturbed structures can be a specific representation of attractor ruins [8]. However, since the concept of attractor ruins may include a wider class of non-classical attractors than the Milnor attractor, we may use the term “attractor ruins” in this paper to include possible but unknown classes of ruins. 2.2 Role of the Corticopetal ACh: A Working Hypothesis How can top-down attentions, expectation, etc. contribute to conscious perception with the aid of ACh? Assuming the arguments in the preceding section, this question could be translated into: “How do the corticopetal ACh projections work for the emergence of globally organized attractor ruins via transient synchrony?” We summarize our tentative proposition as a Working Hypothesis in the following. Working Hypothesis: The role of the corticopetal ACh accompanied with top-down contextual signals as attentions and so on is the mediator for dynamically organizing quasi-attractors, which are required in conscious perception, or in executing actions. ACh “integrates” multiple “floating subnetworks” into “a transient synchrony group” in the gamma frequency range. Such a transiently emerging synchrony group can be regarded as an attractor ruin in the dynamical systems-theoretic sense.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

173

3 Do the Existing Experimental Data Support the Working Hypothesis? 3.1 Introductory Remarks: Transient Synchronization by Virtue of Pre- and Post-synaptic ACh Modulations ACh may have both pre-synaptic and post-synaptic effects on individual neurons in the cortex. First, top-down glutamatergic spike volleys flow into cortical layers, which might convey contextual information on stimuli. The corticopetal ACh arrives concomitantly with the glutamatergic volleys. If ACh release modulates synaptic connectivity between cortical neurons even in an effective sense by virtue of “pre-synaptic modulations”, metamorphosis of the attractor landscape4 should be inevitable. Post-synaptic influences of the corticopetal ACh on individual neurons – either inhibitory or excitatory, might cause deep effects on their firing behavior, and might induce a state transition with a collective gamma oscillation. A consequence of the three effects together might trigger a specific group of networks to oscillate in gamma frequency with phase synchrony. These are all speculative stories based on experimental evidence. We need at least to examine the present status on the experimental data concerning the cortical ACh influence on individual neurons. This may be the place to add a comment on the specificity of the corticopetal ACh projections on the cortex, which may be a point of arguments. It is reported that the cholinergic afferents specialized synaptic connections with post synaptic targets, rather than releasing ACh non-specifically (Turrini et al..[29].) 3.2 Controversy on Experimental Data Let us review quickly the existing experimental data. As noted before, “there exist little consensus among researchers for more than a half century” [11]. The following is not intended to give a complete review, but to give a preliminary knowledge which may be of help to understand the succeeding discussions. ACh have two groups of receptors, one is the muscarinic receptors, mAChRs with 5 subtypes, and the other is nicotinic receptors, nAChRs with 17 subtypes. The nAChR is a relatively simple cationic (Na+ and Ca2+) channel, the opening of which leads to a rapid depolarization followed by desensitization. Most of mAChRs activation exhibits the slower onset and longer lasting G-protein coupled second messenger generation. Here the primary interest is in mAChRs. 5 The functions of mAChRs are reported to be two-fold: one is pre-synaptic, and the other is postsynaptic modulations. 4

“Attractor landscape” is usually used for potential systems. Here, we use it to mean the landscape of “basins” (absorbing regions), of classical and non-classical attractors as attractor ruins. 5 The nicotinic receptors, nAChRs may work as a disinhibition system to layer 2/3 pyramidal neurons [4]), the exact function of which is not known yet.

174

H. Fujii, K. Aihara, and I. Tsuda

Post-synapticModulations The results of traditional studies may be divided into two opposing data. The majority view is that mAChRs function as excitatory transmitter for post-synaptic neurons (see, e.g., McCormick [20]), while there are minority data that claim inhibitory functioning. The latter, however, has been considered to be a consequence of ACh excitation of interneurons, which in turn may inhibit post-synaptic pyramidal neurons (PYR). Recently, Gulledge et al. [11], [12] stated that transient mAChR activation generates strong and direct transient inhibition of neocortical PYR. The underlying ionic process is the induction of calcium release from internal stores, and subsequent activation of small-conductance (SK–type) calcium-activated potassium channels. The authors claim that the traditional data do not describe the actions of transient mAChR activation, as is likely to happen during synaptic release of ACh by the following reasons. 1.

2.

In vivo ACh concentration: Previous studies typically used high concentrations of muscarinic agonists (1–100 mM). Extracellular concentrations of ACh in the cortex are at least one order of magnitude lower than those required to depolarize PYR in vitro. Phasic (transient) application vs. bath application: Most data depended on experiments with bath applications, which may correspond to prolonged, tonic mAChR stimulation. The ACh release accompanied with attentions, etc., would better correspond to a transient puff application as the authors’ experiment.

The specificity of ACh afferents on postsynaptic targets was already noted. [29]. Pre-synaptic Modulations Experimental works on pre-synaptic modulations are mostly based on ACh bath applications, and the modulation data were measured in terms of local field potentials (LFP). Most results claimed the pathway specificity of modulations. Typically, it was concluded that muscarinic modulation can strongly suppress intracortical(IC) synaptic activity while exerting less suppression, or actually enhancing, thalamocortical (TC) inputs. [14]. Gil et al. [9] reported that 5 μM muscarine decreases both IC and TC pathway transmission, and that those data were presynaptic effects, since membrane potential and input resistance were unchanged. Recently, Kuczewski et al. [18] studied the same problem, and obtained different results from the previous ones. Low ACh (less than 100 μM) shows facilitation, and as ACh concentration goes high, the result is depression. This is true for the both IC and TC pathways, i.e., for the layer 2/3 and layer 4. 3.3 Possible Scenarios The lack of consistent experimental data makes our job complicated. The situation might be compared to playing a jigsaw puzzle with, many pieces missing and some pieces mingled from other jigsaw pictures. What we can do at this moment may be to propose possible alternatives of scenarios for the role of the corticopetal ACh.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

175

The following are the list of prerequisites and evidences on which our arguments should be based. 1.

2.

3.

Two modulations may occur simultaneously inside the 6 layers of the cortex. The firing characteristics of individual neurons, and the strength of synaptic connections may change dynamically either as post-synaptic or pre-synaptic modulations. Virtually no models, to our knowledge, have been proposed, which took the net effects of the two modulations into account. The interaction of ACh with top-down glutamatergic spike volleys should be considered. The majority of neurons alter in response to combined exposure to both acetylcholine and glutamate concomitantly. (Perry & Perry [24].) As a post-synaptic influence, ACh release may change the firing regime of neurons, and induce gamma oscillation [6], [23], [31].

As to pre-synaptic modulations, the details of synaptic processes appear to be largely unknown. The significance of experimental studies can not be overemphasized. Now let us try to draw a dessin for possible scenarios on the role of corticopetal ACh. Here we may put three corner stones for the models: 1. 2. 3.

Who triggers the gamma oscillation? Who (and how to) modulates the effective connectivity? What is the mechanism of phase synchrony and what is the role of it?

Scenario I The basic idea is that in the default, low level state of ACh, globally organized attractors do not virtually exist, and may take the form of floating fragmentary dynamics. Then, ACh release may help to strengthen the synaptic connections presynaptically. One of the roles of post-synaptic modulation is to start up the gamma oscillation. (Here the influence of GABAergic projections from NBM might play a role.) Another, but important role will be stated later. Scenario II The effective modulation of synaptic connectivity might be carried by, rather than the pre-synaptic modulation, the phase synchrony of the gamma oscillation itself which is triggered by the post-synaptic modulation. Such a mechanism for the change of synaptic connectivity, and resulting binding of fragmentary groups of neurons was proposed by Womelsdorf et al. [32]. They claimed that the mutual influence among neuronal groups depends on the phase relation between rhythmic activities within the groups. Phase relations supporting interactions among the groups preceded those interactions by a few milliseconds, consistent with a mechanistic role. See, also Buzsaki [3]. For the case of Scenario II, the role of the post-synaptic modulation is to start up the gamma oscillation, and the reset of its phase, as Gulledge and Stuart [11] suggested. The transient hyper-polarization may play the role of referee to start up the oscillation among the related groups in unison. The Scenario I makes the postsynaptic modulation carry the two roles of starting up the gamma oscillation, and

176

H. Fujii, K. Aihara, and I. Tsuda

resetting its phase. For the pre-synaptic modulation a bigger role of realization of attractors by virtue of synaptic strength modulation is assigned.

4 Concluding Discussions The critical role of the corticopetal ACh in cognitive functions, together with its relation to some disease-related symptoms as complex visual hallucinations in DLB, and its apparent involvements in the neocortical state change have motivated us to the study of the functional role(s) of the corticopetal ACh from dynamical systems standpoints. Cognitive functions are phenomena carried by the brain dynamics. We hope that understanding the cognitive dynamics with the dynamical systems language would open new theoretical horizons. It is of some help to consider the conceptual difference of the two “forces” which flow into the 6 layers of the neocortex. Glutamate spike volleys could be, if viewed as an event in a dynamical system, an external force, which may kick the orbit to another orbit, and may sometimes to out of the “basin” of the present attractor beyond the border – the separatrix. In contrary to this situation, ACh projections – though transient, could be regarded as a slow parameter working as a bifurcation parameter that modifies the landscape itself. What we are looking at in the preceding arguments is that the two phenomena happen concomitantly at the 6 layers of the cortex. Hasselmo and McGaughy [13] emphasized the ACh role in memory as: “high acetylcholine sets circuit dynamics for attention and encoding; low acetylcholine sets dynamics for consolidation”, which is based on some experimental data on selective pre-synaptic depression and facilitation. However, in view of the potential role of attentions in local bindings or global integrations, we may pose alternative (but not necessarily exclusive,) scenarios on the ACh function as temporarily modifying the quasi-attractor landscape, in collaboration with glutamatergic spike volleys. Rather, we speculate that the process of memorization itself would realized through such a dynamic formation of attractor ruins, for which mAChR may play a role.

Acknowledgements The first author (HF) was supported by a Grant-in-Aid for Scientific Research (C), No. 19500259, from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. The second author (KA) was partially supported by Grant-in-Aid for Scientific Research on Priority Areas 17022012 from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. The third author (IT) was partially supported by a Grant-in-Aid for Scientific Research on Priority Areas, No. 18019002 and No. 18047001, a Grant-inAid for Scientific Research (B), No. 18340021, Grant-in-Aid for Exploratory Research, No. 17650056, a Grant-in-Aid for Scientific Research (C), No. 16500188, and the 21st Century COE Program, Mathematics of Nonlinear Structures via Singularities.

Corticopetal Acetylcholine: Possible Scenarios on the Role for Dynamic Organization

177

References 1. Behrendt, R.-P., Young, C.: Hallucinations in schizophrenia, sensory impairment, and brain disease: A unifying model. Behav. Brain Sci. 27, 771–787 (2004) 2. Buschman, T.J., Miller, E.K.: Top-down Versus Bottom-Up Control of Attention in the Prefrontal and Posterior Parietal Cortices. Science 315, 1860–1862 (2007) 3. Buzsaki, G.: Rhythms of the Brain. Oxford University Press, Oxford (2006) 4. Christophe, E., Roebuck, A., Staiger, J.F., Lavery, D.J., Charpak, S., Audinat, E.: Two Types of Nicotinic Receptors Mediate an Excitation of Neocortical Layer I Interneurons. J. Neurophysiol. 88, 1318–1327 (2002) 5. Dehaene, S., Changeux, J.-P.: Ongoing Spontaneous Activity Controls Access to Consciousness: A Neuronal Model for Inattentional Blindness. PLoS Biology 3, 910–927 (2005) 6. Detari, L.: Tonic and phasic influence of basal forebrain unit activity on the cortical EEG. Behav. Brain. Res. 115, 159–170 (2000) 7. Fries, P., Reynolds, J.H., Rorie, A.E., Desimone, R.: Modulation of Oscillatory Neuronal Synchronization by Selective Visual Attention. Science 291, 1560–1563 (2001) 8. Fujii, H., Aihara, K., Tsuda, I.: Functional Relevance of ‘Excitatory’ GABA Actionsin Cortical Interneurons: A Dynamical Systems Approach. J. Integrative Neurosci. 3, 183– 205 (2004) 9. Gil, Z., Connors, B.W., Yael Amitai, Y.: Differential Regulation of Neocortical Synapses by Neuromodulators and Activity. Neuron 19, 679–686 (1997) 10. Golmayo, L., Nunez, A., Zaborsky, L.: Electrophysiological Evidence for the Existence of a Posterior Cortical-Prefrontal-Basal Forebrain Circuitry in Modulating Sensory Responses in Visual and Somatyosensory Rat Cortical Areas. Neuroscience 119, 597–609 (2003) 11. Gulledge, A.T., Stuart, G.J.: Cholinergic Inhibition of Neocortical Pyramidal Neurons. J. Neurosci 25, 10308–10320 (2005) 12. Gulledge, A.T., Susanna, S.B., Kawaguchi, Y., Stuart, G.J.: Heterogeneity of phasic signaling in neocortical neurons. J. Neurophysiol. 97, 2215–2229 (2007) 13. Hasselmo, M.E., McGaughy, J.: High acetylcholine sets circuit dynamics for attention and encoding; Low acetylcholine sets dynamics for consolidation. Prog. Brain Res. 145, 207– 231 (2004) 14. Hsieh, C.Y., Cruikshank, S.J., Metherate, R.: Differential modulation of auditory thalamocortical and intracortical synaptic transmission by cholinergic agonist. Brain Res 880, 51–64 (2000) 15. Jones, B.E., Muhlethaler, M.: Cholinergic and GABAergic neurons of the basal forebrain: role in cortical activation. In: Lydic, R., Baghdoyan, H.A. (eds.) Handbook of Behavioral State Control, pp. 213–233. CRC Press, London (1999) 16. Kay, L.M., Lancaster, L.R., Freeman, W.J.: Reafference and attractors in the olfactory system during odor recognition. Int. J. Neural Systems 4, 489–495 (1996) 17. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., Arieli, A.: Nerve cell activity when eyes are shut reveals internal views of the world. Nature 425, 954–956 (2003) 18. Kuczewski, N., Aztiria, E., Gautam, D., Wess, J., Domenici, L.: Acetylcholine modulates cortical synaptic transmission via different muscarinic receptors, as studied with receptor knockout mice. J. Physiol. 566.3, 907–919 (2005) 19. Loewi, O.: Ueber humorale Uebertragbarkeit der Herznervenwirkung. Pflueger’s Archiven Gesamte Physiologie 189, 239–242 (1921) 20. McCormick, D.A., Prince, D.A.: Mechanisms of action of acetylcholine in the guinea-pig cerebral cortex in vitro. J. Physiol. 375, 169–194 (1986)

178

H. Fujii, K. Aihara, and I. Tsuda

21. Mesulam, M.M., Mufson, E.J., Levey, A.I., Wainer, B.H.: Cholinergic innervation of cortex by the basal forebrain: cytochemistry and cortical connections of the septal area, diagonal band nuclei, nucleus basalis (substantia innominata), and hypothalamus in the rhesus monkey. J. Comp. Neurol. 214, 170–197 (1983) 22. Metherate, R., Charles, L., Cox, C.L., Ashe, J.H.: Cellular Bases of Neocortical Activation: Modulation of Neural Oscillations by the Nucleus Basalis and Endogenous Acetylcholine. J. Neurosci. 72, 4701–4711 (1992) 23. Niebur, E., Hsiao, S.S., Johnson, K.O.: Synchrony: a neuronal mechanism for attentional selection? Curr. Opin. Neurobiol. 12, 190–194 (2002) 24. Perry, E.K., Perry, R.H.: Acetylcholine and Hallucinations: Disease-Related Compared to Drug-Induced Alterations in Human Consciousness. Brain Cognit. 28, 240–258 (1995) 25. Sarter, M., Gehring, W.J., Kozak, R.: More attention should be paid: The neurobiology of attentional effort. Brain Res. Rev. 51, 145–160 (2006) 26. Sarter, M., Parikh, V.: Choline Transporters, Cholinergic Transmission and Cognition. Nature Reviews Neurosci. 6, 48–56 (2005) 27. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognit. Psychol. 12, 97–136 (1980) 28. Tsuda, I.: Chaotic Itinerancy as a Dynamical Basis of Hermeneutics of Brain and Mind. World Future 32, 167–185 (1991) 29. Turrini, P., Casu, M.A., Wong, T.P., De Koninck, Y., Ribeiro-da-Silva, A., And Cuello, A.C.: Cholinergic nerve terminals establish classical synapses in the rat cerebral cortex: synaptic pattern and age—related atrophy. Neiroscience 105, 277–285 (2001) 30. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase synchronization and large-scale integration. Nature Rev. Neurosci. 2, 229–239 (2001) 31. Wenk, G.L.: The Nucleus Basalis Magnocellularis Cholinergic System: One Hundred Years of Progress. Neurobiol. Learn. Mem. 67, 85–95 (1997) 32. Womelsdorf, T., Schoffelen, J.M., Oostenveld, R., Singer, W., Desimone, R., Engel, A.K., Fries, P.: Modulation of neuronal interactions through neuronal synchronization. Science 316, 1578–1579 (2007)

Tracking a Moving Target Using Chaotic Dynamics in a Recurrent Neural Network Model Yongtao Li and Shigetoshi Nara Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-naka, Okayama 700-8530, Japan [email protected]

Abstract. Chaotic dynamics introduced in a recurrent neural network model is applied to controlling an tracker to track a moving target in two-dimensional space, which is set as an ill-posed problem. The motion increments of the tracker are determined by a group of motion functions calculated in real time with firing states of the neurons in the network. Several groups of cyclic memory attractors that correspond to several simple motions of the tracker in two-dimensional space are embedded. Chaotic dynamics enables the tracker to perform various motions. Adaptively real-time switching of control parameter causes chaotic itinerancy and enables the tracker to track a moving target successfully. The performance of tracking is evaluated by calculating the success rate over 100 trials. Simulation results show that chaotic dynamics is useful to track a moving target. To understand them further, dynamical structure of chaotic dynamics is investigated from dynamical viewpoint. Keywords: Chaotic dynamics, tracking, moving target, neural network.

1 Introduction Biological systems have became a hot research around the world because of their excellent functions not only in information processing, but also in well-regulated functioning and controlling, which work quite adaptively in various environments. However, we are yet poor of understanding the mechanisms of biological systems including brains despite many eﬀorts of researchers because enormous complexity originating from dynamics in systems is very diﬃcult to be understood and described using the conventional methodologies based on reductionism, that is, decomposing a system into parts or elements. The conventional reductionism more or less falls into two diﬃculties: one is “combinatorial explosion” and the other is “divergence of algorithmic complexity”. These diﬃculties are not yet solved. On the other hand, dynamical viewpoint to understand the mechanism seems to be a plausible method. In particular, chaotic dynamics experimentally observed in biological systems including brains[1,2] has suggested a viewpoint that chaotic dynamics would play important roles in complex functioning and controlling of biological systems including brains. From this viewpoint, many dynamical models have been constructed for approaching the mechanisms by means of large-scale simulation or heuristic methods. Artificial neural networks in which chaotic dynamics can be introduced has been attracting great interests, and the relation between chaos and functions has been discussed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 179–188, 2008. c Springer-Verlag Berlin Heidelberg 2008

180

Y. Li and S. Nara

[9,10,11,12]. As one of those works, Nara and Davis found that chaotic dynamics can occur in a recurrent neural network model(RNNM) consisting of binary neurons [3], and they investigated the functional aspects of chaos by applying it to solving a memory search task with an ill-posed context[7]. To show the potential of chaos in controlling, chaotic dynamics was applied to solving two-dimensional mazes, which are set as ill-posed problems[8]. Two important points were proposed. One is a simple coding method translating the neural states into motion increments , the other is a simple control algorithm, switching a system parameter adaptively to produce constrained chaos. The conclusions show that constrained chaos behaviours can give better performance to solving a two-dimensional maze than that of random walk. In this paper, we develop the idea and apply chaotic dynamics to tracking a moving target, which is set as another ill-posed problem. Let us state about a model of tracking a moving target. An tracker is assumed to move in two-dimensional space and track a moving target along a certain trajectory by employing chaotic dynamics. the tracker is assumed to move with discrete time steps. The state pattern is transform into the tracker’s motion by the coding of motion functions, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as the prototypical simple motions, are embedded in the network. By the coding of motion function, each cycle corresponds to a monotonic motion in twodimensional space. If the state pattern converges into a prototypical attractor, the tracker moves in a monotonic direction. Introducing chaotic dynamics into the network generated non-period state pattern, which is transformed into chaotic motion of the tracker by motion functions. Adaptive switching of a system parameter by a simple evaluation between chaotic dynamics and attractor’s dynamics in the network results in complex motions of the tracker in various environments. Considering this point, a simple control algorithm is proposed for tracking a moving target. In actual simulation, the present method using chaotic dynamics gives novel performance. To understand the mechanism of better performance, dynamical structure of chaotic dynamics is investigated from statistical data.

2 Memory Attractors and Motion Functions Our study works with a fully interconnected recurrent neural network consisting of N binary neurons. Its updating rule is defined by Si (t + 1) = sgn Wi j S j (t) (1) j∈Gi (r)

sgn(u) = • • • •

+1 u ≥ 0; −1 u < 0.

S i (t) = ±1(i = 1 ∼ N): the firing state of a neuron specified by index i at time t. Wi j : connection weight from the neuron S j to the neuron S i (Wii is taken to be 0) r: fan-in number for the neuron S i , named as connectivity,(0 < r < N). Gi (r): a spatial configuration set of connectivity r.

Tracking a Moving Target Using Chaotic Dynamics

181

At a certain time t, the state of neurons in the network can be represented as a N-dimensional state vector S(t), called as state pattern. Time development of state pattern S(t) depends on the connection weight matrix {Wi j } and connectivity r, therefore, in our study, Wi j are determined in the case of full connectivity r = N − 1, by a kind of orthogonalized learning method[7]and taken as follows. Wi j =

L K μ=1 λ=1

λ † (ξλ+1 μ )i · (ξ μ ) j

(2)

where {ξ λμ |λ = 1 . . . K, μ = 1 . . . L} is an attractor pattern set, K is the number of memory patterns included in a cycle and L is the number of memory cycles. ξλ† μ is the conjugate λ λ† λ vector of ξμ which satisfies ξμ · ξμ = δμμ · δλλ ,where δ is Kronecker’s delta. This method was confirmed to be eﬀective to avoid spurious attractors[3,4,5,6,7,8]. Biological data show that neurons in brain causes various motions of muscles in body with a quite large redundancy. Therefore, the network consisting of N neurons is used to realize two-dimensional motion control of an tracker. We confirmed that chaotic dynamics introduced in the network does not so sensitively depend on the size of the neuron number[7].In our actual computer simulation,N = 400. Suppose that an tracker moves from the position (p x (t), py (t)) to (p x (t + 1), py (t + 1)) with a set of motion increments ( f x (t), fy (t)). The state pattern S(t) at time t is a 400-dimensional vector, and we transform it to two-dimensional motion increments by the coding of motion functions ( f x (S(t)), fy (S(t))). In 2-dimensional space, the actual motion of the tracker is given by 4 A·C p x (t + 1) p (t) f (S(t)) p (t) = x + x = x + (3) py (t + 1) py (t) fy (S(t)) py (t) N B·D where A, B, C, D are four independent N/4 dimensional sub-space vectors of state pattern S(t). Therefore, after the inner product between two independent sub-space vectors is normalized by 4/N, motion functions range from -1 to +1. In our actual simulations, two-dimensional space is digitized with a resolution 0.02 due to the binary neuron state ±1 and N = 400. Now, let us consider the construction of memory attractors corresponding to prototypical simple motions. We take 24 attractor patterns consisting of (L=4 cycles) × (K=6 patterns per cycle). Each cycle corresponds to one prototypical simple motion. We take four types of motion that one tracker moves toward (+1, +1), (-1, +1), (-1, -1), (+1, -1) in two-dimensional space. Each attractor pattern consists of four random subspace vectors A, B, C and D, where C = A or −A, and D = B or −B. So only A and B are independent random patterns. From the law of large number, memory patterns are almost orthogonal each other. Furthermore, in determining {Wi j }, the orthogonalized learning method was employed. Therefore, memory patterns are orthogonalized each other. The corresponding relations between memory attractors and prototypical simple motions are shown as follows. ( f x (ξλ1 ), fy (ξλ1 )) = (+1, +1) ( f x (ξλ3 ), fy (ξλ3 ))

= (−1, −1)

( f x (ξλ2 ), fy (ξλ2 )) = (−1, +1) ( f x (ξλ4 ), fy (ξλ4 )) = (+1, −1)

182

Y. Li and S. Nara

3 Introducing Chaotic Dynamics in RNNM Now let us state the eﬀects of connectivity r. In the case of full connectivity r = N − 1, the network can function as a conventional associative memory. If the state pattern S(t) is one or near one of the memory patterns ξλμ , finally the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . In other words, for each memory pattern, there is a set of the state patterns, called as memory basin Bλμ . If S(t) is in the memory basin Bλμ , then the output sequence S(t + kK)(k = 1, 2, 3 . . .) will converge to the memory pattern ξλμ . It is quite diﬃcult to estimate basin volume accurately because of enormous amounts of calculation for the whole state patterns in N-dimensional state space. Therefore, a statistical method is applied to estimating the approximate basin volume. First, a sufficiently large amount of state patterns are sampled in the state space. Second, each sample is taken as initial pattern and updated with full connectivity. Third, it is taken statistic which memory attractor limk→∞ S(kK) of each sample would converge into. The distribution of statistic data over the whole samples is regarded as the approximate basin volume for each memory attractor(see Fig.1). The basin volume shows that almost all initial state patterns converge into one of the memory attractors averagely and there are seldom spurious attractors. 0.06

Basin volume

0.05 0.04 0.03 0.02 0.01 0 0

5

10 15 20 25 Memory pattern number

30

Fig. 1. Basin volume: The horizontal axis represents memory pattern number(1-24). Basin 25 corresponds to samples that converged into cyclic outputs with a period of six steps but not any one memory attractor. Basin 26 corresponds to samples excluded from any other case(1-25). The vertical axis represents the ratios between the corresponding samples and the whole samples.

Next, we continue to decrease connectivity r. When r is large enough, r N, memory attractors are stable. When r becomes smaller and smaller, more and more state patterns gradually do not converge into a certain memory pattern despite the network is updated for a long time, that is, attractors become unstable. Finally, when r becomes quite small, state pattern becomes non-period output, that is, non-period dynamics occurs in the state space. In our previous papers, we confirmed that the non-period dynamics in the network is chaotic wandering. In order to investigate the dynamical structure, we calculated basin visiting measures and it suggests that the trajectory can pass the

Tracking a Moving Target Using Chaotic Dynamics

183

whole N-dimensional state space, that is, cyclic memory attractors ruin due to a quite small connectivity [3,4,5,6,7].

4 Motion Control and Tracking Algorithm When connectivity r is suﬃciently large, one random initial pattern converges into one of four limit cycle attractors as time evolves. By the coding transformation of motion functions, the corresponding motion of the tracker in 2-dimensional space becomes monotonic(see Fig.2). On the other hand, when connectivity r is quite small, chaotic dynamics occurs in the state space, correspondingly, the tracker moves chaotically (see Fig.3). If the updating of state pattern in chaotic regime is replaced by random 400-bitpattern generator, the tracker shows random walk(see Fig.4). Obviously, chaotic motion is diﬀerent from random walk, and has a certain dynamical structure.

6 4 5

4 r=30

r=399

Random walk

2

4

2

0

3 2 1

START

START

-2

-2

-4

-4

-6

0 -1 0

1

2

3

4

5

6

Fig. 2. Monotonic motion

START

-6

-8 -1

0

(500 steps)

-10 -10

-8

-6

-4

-2

-8 0

2

Fig. 3. Chaotic walk

4

(500 steps)

-10 -10

-8

-6

-4

-2

0

2

4

Fig. 4. Random walk

Therefore, when the network evolves, monotonic motion and chaotic motion can be switched by switching the connectivity r. Based on this idea, we proposed a simple algorithm to track a moving target, shown in Fig.5. First, an tracker is assumed to be tracking a target that is moving along a certain trajectory in two-dimensional space, and the tracker can obtain the rough directional information D1 (t) of the moving target, which is called as global target direction. At a certain time t, the present position of the tracker is assumed at the point (p x (t), py (t)). This point is taken as the origin point and two-dimensional space can be divided into four quadrants. If the target is moving in the nth quadrant, D1 (t) = n (n = 1, 2, 3, 4). Next, we also suppose that the tracker can know another directional information D2 (t) = m (m = 1, 2, 3, 4), which is called global motion direction. It means that the tracker has moved toward the mth quadrant from time t − 1 to t, that is, in the previous step. Global target direction D1 (t) and global motion direction D2 (t) are time-dependent variables. If these information are taken as feedback to the network in real time, the connectivity r also becomes a time-dependent variable r(t) and is determined by D1 (t) and D2 (t). In Fig.5, RL is a suﬃciently large connectivity and RS is a quite small connectivity that can lead to chaotic dynamics in the neural network. Adaptive switching of connectivity is the core idea of the algorithm. When the synaptic connectivity r(t) is determined by comparing two directions,D1(t−1) and D2 (t−1), the motion increments of the tracker are calculated from the state pattern of the network updated with r(t). The new motion

184

Y. Li and S. Nara

Fig. 5. Control algorithm of tracking a moving target: By judging whether global target direction D1 (t) coincides with global motion direction D2 (t) or not, adaptive switching of connectivity r between RS and RL results in chaotic dynamics or attractor’s dynamics in state space. Correspondingly, the tracker is adaptively tracking a moving target in two-dimensional space.

causes the next D1 (t) and D2 (t), and produces the next connectivity r(t + 1). By repeating this process, the synaptic connectivity r(t) is adaptively switching between RL and RS , the tracker is alternatively implementing monotonic motion and chaotic motion in two-dimensional space.

5 Simulation Results In order to confirm that this control algorithm is useful to tracking a moving target, the moving target should be set. Firstly, we have taken nine kinds of trajectories which the target moves along, which are shown in Fig.6 and include one circular trajectory and eight linear trajectories. Suppose that the initial position of the tracker is the origin(0,0) of two-dimensional space. The distance L between initial position of the tracker and that of the target is a constant value. Therefore, at the beginning of tracking, the tracker is at the circular center of the circular trajectory and the other eight linear trajectories are tangential to the circular trajectory along a certain angle α, where the angle is defined by the x axis. The tangential angle α = nπ/4 (n = 1, 2, . . . , 8), so we number the eight linear trajectories as LTn , and the circular trajectory as LT0 . 20 15 10 5

Object

Target

0 -5 -10

Capture

-15 -20 -20

Fig. 6. Trajectories of moving target: Arrow represents the moving direction of the target. Solid point means the position at time t=0.

-15

-10

-5

0

5

10

15

20

Fig. 7. An example of tracking a target that is moving along a circular trajectory with the simple algorithm. the tracker captured the moving target at the intersection point.

Tracking a Moving Target Using Chaotic Dynamics

185

Next, let us consider the velocity of the target. In computer simulation, the tracker moves one step per discrete time step, at the same time, the target also moves one step with a certain step length S L that represents the velocity of the target. The motion increments of the tracker ranges from -1 to 1, so the step length S L is taken with an interval 0.01 from 0.01 to 1 up to 100 diﬀerent velocities. Because velocity is a relative quantity, so S L = 0.01 is a slower target velocity and S L = 1 is a faster target velocity relative to the tracker. Now, let us look at a simulation of tracking a moving target using the algorithm proposed above, shown in Fig.7. When an target is moving along a circular trajectory at a certain velocity, the tracker captured the target at a certain point of the circular trajectory, which is a successful capture to a circular trajectory.

6 Performance Evaluation

(a) circular target trajectory

-2

et

10 20 30 40 50 60 0 Connectivity

Ve l

oc

ity

x10 100 80 60 40 20 rg

Success Rate

1 0.8 0.6 0.4 0.2 0 0

Ta

Ve

et

10 20 30 40 50 60 0 Connectivity

lo

ci

ty

x10-2 100 80 60 40 20 rg

1 0.8 0.6 0.4 0.2 0 0

Ta

Success Rate

To show the performance of tracking a moving target, we have evaluated the success rate of tracking a moving target that moves along one of nine trajectories over 100 initial state patterns. In tracking process, the tracker suﬃciently approaching the target within a certain tolerance during 600 steps is regarded as a successful trial. The rate of successful trials is called as the success rate. However, even though tracking a same target trajectory, the performance of tracking depends not only on synaptic connectivity r, but also on target velocity or target step length S L. Therefore, when we evaluate the success rate of tracking, a pair of parameters, that is, one of connectivity r(1 ≤ r ≤ 60) and one of target velocity S L(0.01 ≤ T ≤ 1.0), is taken. Because we take 100 diﬀerent 100 target velocity with a same interval 0.01, we have C60 pairs of parameters. We have evaluated the success rate of tracking a moving target along diﬀerent trajectories. Two examples are shown as Fig.8(a) and (b). By comparing Fig.8(a) and (b), we are sure that tracking a moving target of circular trajectory has better performance than that of linear trajectory. However, to some linear trajectories, quite excellent performance was observed. On the other hand, the success rate highly depends on connectivity r and the target velocity S L even if the same target trajectory is set. In order to observe the performance clearly, we have taken the data

(b) linear target trajectory

Fig. 8. Success rate of tracking a moving target along (a)a circle trajectory;(b)a linear trajectory: The positive orientation obeys the right-hand rule. The vertical axis represents success rate, and two axes in the horizontal plane represents connectivity r and target velocity S L, respectively.

Y. Li and S. Nara

1

1

0.8

0.8 Success rate

Success rate

186

0.6 0.4 0.2

0.6 0.4 0.2

0

0 0

20

40

60

80

100

0

-2

Target Velocity ( x10 )

(a) r = 16: downward tendency

20

40

60

80

100

Target Velocity ( x10-2)

(b) r = 51: upward tendency

Fig. 9. Success rates drawn from Fig.8(a): We take the data of a certain connectivity and show them in two dimension diagram. The horizontal axis represents target velocity from 0.01 to 1.0, and the vertical axis represents success rate.

of certain connectivities from Fig.8(a), and plot them in two-dimensional coordinates, shown as Fig.9. Comparing these figures, we can see a novel performance, when the target velocity becomes faster, the success rate has a upward tendency, such as r = 51. In other words, when the chaotic dynamics is not too strong, it seems useful to tracking a faster target.

7 Discussion In order to show the relations between the above cases and chaotic dynamics, from dynamical viewpoint, we have investigated dynamical structure of chaotic dynamics. To a quite small connectivity from 1 to 60, the network performs chaotic wandering for long time from a random initial state pattern. During this hysteresis, we have taken a statistics of continuously staying time in a certain basin [8] and evaluated the distribution p(l, μ) which is defined by p(l, μ) = {the number of l | S(t) ∈ βμ in τ ≤ t ≤ τ + l and S(τ − 1) βμ and S(τ + l + 1) βμ , μ| μ ∈ [1, L]} K βμ = Bλμ

(4)

(5)

λ=1

T =

lp(l, μ)

(6)

l

where l is the length of continuously staying time steps in each attractor basin, and p(l, μ) represents a distribution of continuously staying l steps in attractor basin L = μ within T steps. In our actual simulation, T = 105 . To diﬀerent connectivity r=15 and r=50, the distribution p(l, μ) are shown in Fig.10(a) and Fig.10(b). In these figures, diﬀerent basins are marked with diﬀerent symbols. From the results, we can know that continuously staying time l becomes longer and longer with increase of the connectivity r. Referring to those novel performances talked in previous section, let us try to consider the reason.

100000

Frequency distribution of staying

Frequency distribution of staying

Tracking a Moving Target Using Chaotic Dynamics

Basin 1 Basin 2 Basin 3 Basin 4

10000 1000 100 10 1 2

4

6

8

10

12

14

Continuously staying time steps

(a) r = 15: shorter

16

187

100000 Basin 1 Basin 2 Basin 3 Basin 4

10000 1000 100 10 1 10

20

30

40

50

60

Continuously staying time steps

(b) r = 50: longer

Fig. 10. The log plot of the frequency distribution of continuously staying time l: The horizontal axis represents continuously staying time steps l in a certain basin μ during long time chaotic wandering, and the vertical axis represents the accumulative number p(l, μ) of the same staying time steps l in a certain basin μ. continuously staying time steps l becomes long with the increase of connectivity r.

First, in the case of slower target velocity, a decreasing success rate with the increase of connectivity r is observed from both circular target trajectory and linear ones. This point shows that chaotic dynamics localized in a certain basin for too much time is not good to track a slower target. Second, in the case of faster target velocity, it seems useful to track a faster target when chaotic dynamics is not too strong. Computer simulations shows that, when the target moves quickly, the action of the tracker is always chaotic so as to track the target. In past experiments, we know that motion increments of chaotic motion is very short. Therefore, shorter motion increments and faster target velocity result in bad tracking performance. However, when continuously staying time l in a certain basin becomes longer, the tracker can move toward a certain direction for l steps. This would be useful for the tracker to track the faster target. Therefore, when connectivity becomes a little large (r=50 or so), success rate arises following the increase of target velocity, such as the case shown in Fig.9. As an issue for future study, a functional aspect of chaotic dynamics still has context dependence.

8 Summary We proposed a simple method to tracking a moving target using chaotic dynamics in a recurrent neural network model. Although chaotic dynamics could not always solve all complex problems with better performance, better results often were often observed on using chaotic dynamics to solve certain ill-posed problems, such as tracking a moving target and solving mazes [8]. From results of the computer simulation, we can state the following several points. • A simple method to tracking a moving target was proposed • Chaotic dynamics is quite eﬃcient to track a target that is moving along a circular trajectory.

188

Y. Li and S. Nara

• Performance of tracking a moving target of a linear trajectory is not better than that of a circular trajectory, however, to some linear trajectories, excellent performance was observed. • The length of continuously staying time steps becomes long with the increase of synaptic connectivity r that can lead chaotic dynamics in the network. • Continuously longer staying time in a certain basin seems useful to track a faster target.

References 1. Babloyantz, A., Destexhe, A.: Low-dimensional chaos in an instance of epilepsy. Proc. Natl. Acad. Sci. USA. 83, 3513–3517 (1986) 2. Skarda, C.A., Freeman, W.J.: How brains make chaos in order to make sense of the world. Behav. Brain. Sci. 10, 161–195 (1987) 3. Nara, S., Davis, P.: Chaotic wandering and search in a cycle memory neural network. Prog. Theor. Phys. 88, 845–855 (1992) 4. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Memory search using complex dynamics in a recurrent neural network model. Neural Networks 6, 963–973 (1993) 5. Nara, S., Davis, P., Kawachi, M., Totuji, H.: Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudo-inverse method. Int. J. Bifurcation and Chaos Appl. Sci. Eng. 5, 1205–1212 (1995) 6. Nara, S., Davis, P.: Learning feature constraints in a chaotic neural memory. Phys. Rev. E 55, 826–830 (1997) 7. Nara, S.: Can potentially useful dynamics to solve complex problems emerge from constrained chaos and/or chaotic itinerancy? Chaos. 13(3), 1110–1121 (2003) 8. Suemitsu, Y., Nara, S.: A solution for two-dimensional mazes with use of chaotic dynamics in a recurrent neural network model. Neural Comput. 16(9), 1943–1957 (2004) 9. Tsuda, I.: Chaotic itinerancy as a dynamical basis of Hermeneutics in brain and mind. World Futures 32, 167–184 (1991) 10. Tsuda, I.: Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behav Brain Sci. 24(5), 793–847 (2001) 11. Kaneko, K., Tsuda, I.: Chaotic Itinerancy. Chaos 13(3), 926–936 (2003) 12. Aihara, K., Takabe, T., Toyoda, M.: Chaotic Neural Networks. Phys. Lett. A 114, 333–340 (1990)

A Generalised Entropy Based Associative Model Masahiro Nakagawa Nagaoka University of Technology, Kamitomioka 1603-1, Nagaoka, Niigata 940-2188, Japan [email protected]

Abstract. In this paper, a generalised entropy based associative memory model will be proposed and applied to memory retrievals with analogue embedded vectors instead of the binary ones in order to compare with the conventional autoassociative model with a quadratic Lyapunov functionals. In the present approach, the updating dynamics will be constructed on the basis of the entropy minimization strategy which may be reduced asymptotically to the autocorrelation dynamics as a special case. From numerical results, it will be found that the presently proposed novel approach realizes the larger memory capacity even for the analogue memory retrievals in comparison with the autocorrelation model based on dynamics such as associatron according to the higher-order correlation involved in the proposed dynamics. Keywords: Entropy, Associative Memory, Analogue Memory Retrieval.

1 Introduction During the past quarter century, the numerous autoassociative models have been extensively investigated on the basis of the autocorrelation dynamics. Since the proposals of the retrieval models by Anderson, [1] Kohonen, [2] and Nakano, [3] some works related to such an autoassociation model of the inter-connected neurons through an autocorrelation matrix were theoretically analyzed by Amari, [4] Amit et al . [5] and Gardner [6] . So far it has been well appreciated that the storage capacity of the autocorrelation model , or the number of stored pattern vectors, L , to be completely associated vs the number of neurons N, which is called the relative storage capacity or loading rate and denoted as c = L / N , is estimated as c ~0.14 at most for the autocorrelation learning model with the activation function as the signum one ( sgn (x) for the abbreviation) [7,8] . In contrast to the abovementioned models with monotonous activation functions, the neuro-dynamics with a nonmonotonous mapping was recently proposed by Morita, [9] Yanai and Amari, [10] Shiino and Fukai [11]. They reported that the nonmonotonous mapping in a neurodynamics possesses a remarkable advantage in the storage capacity, c ~0.27, superior than the conventional association models with monotonous mappings, e.g. the signum or sigmoidal function. In the present paper, we shall propose a novel approach based on the entropy defined in terms of the overlaps, which are defined by the innerproducts between the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 189–198, 2008. © Springer-Verlag Berlin Heidelberg 2008

190

M. Nakagawa

state vector and the analogue embedded vectors instead of the previously investigated binary ones [1-16,25].

2 Theory Let us consider an associative model with the embedded analogue vector (r ) ei (1 i N,1 r L) , where N and L are the number of neurons and the number of embedded vectors. The states of the neural network are characterized in terms of the output vector si (1 i N ) and the internal states i (1 i N ) which are related each other in terms of

s i =f

(1iN),

i

(1)

where f ( • ) is the activation function of the neuron. Then we introduce the following entropy I which is to be related to the overlaps as L

I=- 1 2 (r)

where the overlaps m

(r) 2

m

log m

(r) 2

,

(2)

r=1

(r=1,2,...,L) are defined by N (r)

m

†(r)

= e

(3)

i s i;

i=1

here the covariant vector e relation,

†(r )

is defined in terms of

i

the following orthogonal

N

e

†(r)

ie

(s)

a

(r') rr' e i ,

(4a)

,

(4b)

i = rs

(1r,sL) ,

(4)

i=1

e

L †(r) i = r'=1

a

rr' =(

-1

)

rr'

e

(r) (r') i e i

N

rr' =

.

(4c)

i=1

The entropy defined by eq.(2) can be minimized by the following condition

m and

(r)

=

rs

(1r,sL),

(5)

A Generalised Entropy Based Associative Model L

m

(r) 2

191

(6)

=1

r=1

That is, regarding m (1 r L) as the probability distribution in eq.(2), a target (r) pattern may be retrieved by minimizing the entropy I with respect to m or the state vector si to achieve the retrieval of a target pattern in which the eqs.(5a) and (5b) are to be satisfied. Therefore the entropy function may be considered to be a functional to be minimized during the retrieval process of the auto-association model instead of the conventional quadratic energy functional, E, i.e. (r) 2

E=- 1 2

N N

w ij s †

is j

,

(6a)

js j ,

(6b)

i=1 j=1

†

where s i is the covariant vector defined by L N

s

†

i=

e

†(r)

ie

†(r)

r=1 j=1

and the connection matrix w ij is defined in terms of L (r)

w ij = e

ie

†(r)

j.

(6c)

r=1

According to the steepest descent approach in the discrete time model, the updating rule of the internal states i (1 i N ) may be defined by

i (t+1) =-

I s

(1iN) ,

†

(7)

i

where (> 0 ) is a coefficient. Substituting eqs.(2) and (3) into eq.(7) and noting the following relation with aid of eq.(6b), N

m

(r)

N

= e

†(r)

(r)

i s i = e

i=1

is

† i

,

(8)

i=1

one may readily derive the following relation.

i (t+1)=- I† =+ 1 † 2 s i s i = 1 † 2 s i L

m

(r) 2

2

log

= e = e

(r)

i

r=1 L

N

e

(r)

†

js

j

t

r=1 j=1

(r)

(r) 2

r=1

e

(r)

js

js

†

†

j

2

t

j=1 N

e

(r)

js

†

j

t

1+log

j=1 im

log m

N

N (r)

r=1

L

L

e

(r)

j

t

2

j=1

1+log m

(r) 2

.

(9)

192

M. Nakagawa

Generalizing somewhat the above dynamics in order to combine the quadratic approach ( 0) and the present entropy one ( 1) , we propose the following dynamic rule, in a somewhat ad-hoc manner, for the internal states L

N

i (t+1)= e

(r)

i

r=1

N

†(r)

e

js j

t

1+log 1- +

j=1

e

†(r)

js j

t

2

j=1

L

= e

(r)

i

m

(r)

t

1+log 1- + m

(r)

2

t

. (10)

r=1

In practice, in the limit of 0 , the above dynamics will be reduced to the autocorrelation dynamics. L

e (r) i

i (t+1)=lim L

= e

m

0 r=1

(r)

(r)

im

r=1

(r)

t

L

1+log 1- + m N

t =- e

(r)

r=1

i

(r)

2

t

N

e

† (r)

j=1

j s j (t) = w ij s j (t) . j=1

(11)

On the other hand, eq.(10) results in eq.(9) in the case of 1 . Therefore one may control the dynamics between the autocorrelation ( 0) and the entropy based approach ( 1 ).

3 Numerical Results The embedded vectors are set to the binary random vectors as follows.

e (r) i =z

(r) i

(1iN,1rL)

(12)

where z i (1 i N ,1 r L) are the zero-mean pseudo-random numbers between -1 and +1. For simplicity, the activation function , eq.(1), is assumed to be a piecewise linear function instead of the previous signum form for the binary embedded vectors[25] and set to (r )

s i =f ( i )=

1+sgn 1- i

2

i

+sgn

1-sgn 1- i

where denotes the signum function sgn ( •) defined by

-1 (x0)

2

i

,

(13)

(14)

A Generalised Entropy Based Associative Model

193

The initial vector si (0) (1 i N ) is set to

s i (0)=

where e

(r ) i

-e (s) i (1iH d ) , +e (s) i (H d +1iN)

(15)

is a target pattern to be retrieved and H d is the Hamming distance

between the initial vector si (0) and a target vector e

(s) i

. The retrieval is succeeded if

N

m

(s)

(t ) = e

†(s) i s

i (t

)

(16)

i=1

results in ±1 for t 1, in which the system may be in a steady state such that

s i (t+1)=s i (t) ,

(17a)

i (t+1)= i (t) .

(17b)

To see the retrieval ability of the present model, the success rate Sr is defined as the rate of the success for 1000 trials with the different embedded vector sets e (r )i (1 i N ,1 r L) . To control from the autocorrelation dynamics after the initial state (t~1) to the entropy based dynamics (t~ Tmax ) , the parameter in eq.(10) was simply controlled by

= t T max

max

(0tT

max ) ,

(18)

where Tmax and max are the maximum values of the iterations of the updating according to eq.(10) and , respectively. Choosing N =200, η = 1, Tmax = 25, L/N=0.5 and α max = 1, we first present an example of the dynamics of the overlaps in Figs.1(a) and (b) (Entropy based approach). Therein the cross symbols( × ) and the open circles(o) represent the success of retrievals, in which eqs.(5a) and (5b) are satisfied, and the entropy defined by eq.(2), respectively, for a retrieval process. In addition the time dependence of the parameter α / α max defined by eq.(18) are depicted as dots ( i ). In Fig. 1 after a transient state, it is confirmed that the complete association corresponding to eqs.(5a) and (5b) can be achieved. Then we shall present the dependence of the success rate Sr on the loading rate = L / N are depicted in Figs.2 (a) and (b) for H d / N = 0.3 , N =100 for the entropy approach and the associatron, respectively. From these results, one may confirm the larger memory capacity of the presently proposed model defined by eq.(10) in

194

M. Nakagawa <EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=10 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1 1

0.8

0.6

Overlaps < o(n) >

0.4

0.2 0

-0.2 -0.4 -0.6 -0.8

-1 0

5

10

15

(a)

25 n

20

30

35

40

45

50

H d / N = 0.1

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1

1

0.8 0.6

Overlaps < o(n) >

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0

5

10

15

(b)

20

25 n

30

35

40

45

50

H d / N = 0.3 (r)

Fig. 1. The time dependence of overlaps m eq.(10)

of the present entropy based model defined by

A Generalised Entropy Based Associative Model

195

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=1 izero=0 alpmax=1

1 0.9

Success Rate Sr(L/N)

0.8 0.7

Success Rate MemCap= 0.9999 Hd/N= 0.3

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5 L/N

0.6

0.7

0.8

0.9

1

(a) Entropy based Model defined by eq.(10)

<EAM2A> N=100 Ns=100 Tmax=50 k=0 Hd=30 idia=1 iana=1 ictl=1 iotg=1 ient=0 izero=0 alpmax=1 1

Success Rate MemCap= 0.0134 Hd/N= 0.3

0.9

Success Rate Sr(L/N)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5 L/N

0.6

0.7

0.8

0.9

1

(b) Conventional Associatron Model defined by eq.(11) Fig. 2. The dependence of the success rate on the loading rate α = L / N of the present entropy based model defined by eqs.(10) and (11). Here the Hamming distance is set to H d / N = 0.3.

196

M. Nakagawa

comparison with the conventional autoassociation model defined by eq.(11). In practice, it is found that the present approach may achieve the high memory capacity beyond the conventional autocorrelation strategy even for the analogue embedded vectors as well as the previously concerned binary case [15,16,25].

4 Concluding Remarks In the present paper, we have proposed an entropy based association model instead of the conventional autocorrelation dynamics. From numerical results, it was found that the large memory capacity may be achieved on the basis of the entropy approach. This advantage of the association property of the present model is considered to result from the fact such that the present dynamics to update the internal state eq.(10) assures that the entropy, eq.(2) is minimized under the conditions, eqs.(5a) and (5b), which corresponding to the succeeded retrieval of a target pattern. In other words, the higher-order correlations in the presently proposed dynamics, eq.(10), which was ignored in the conventional approaches, [1-11] was found to play an important role to improve memory capacity, or the retrieval ability. To conclude this work, we shall show the dependence of the storage capacity, which is defined as the area covered in terms of the success rate curves as shown in Fig.3 , on the Hamming distance in Fig.3 for the analogue embedded vectors (Ana) as well as the previous binary ones (Bin). In addition OL and CL imply the orthogonal learning model and the autocorrelation learning model, respectively. Therein one may see again the great advantage of the present model based on the entropy functional to be minimized beyond the conventional quadratic form [12,13] even for the analogue embedded vectors. In fact one may realize the considerably larger storage capacity in the present model in comparison with the associatron over H d / N 0.5 . The memory retrievals for the associatron based on the quadratic 1

n a m

n a m

n a m

a n m

a n m

0.9 Memory Capacity

0.8

a m n

a m n

t

a m n

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

t s s

t s

t s

t s

t s

t s

a m

t s

n

a m n

a

Entropy based Model (OL:Ana)

m

Entropy based Model (OL:Bin)

n

Entropy based Model (CL:Bin)

s

Associatron(OL:Bin)

t

Associatron(OL:Wii=0:Bin) t s

t s

0.010.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Hd/N

Fig. 3. The dependence of the storage capacity on the Hamming distance. Here symbols a, m and n are for the entropy based approach with eq. (10) as well as the orthogonal learning (OL) and the autocorrelation learning (CL) [16,17], in which Ana and Bin imply the analogue embedded vectors and the binary ones, respectively. In addition we presented the associatron in symbols s with the orthogonal learning [13], and the associatron in symbols t with orthogonal learning under the condition wii = 0 [12], respectively.

A Generalised Entropy Based Associative Model

197

Lyapunov functionals to be minim ized become troublesome near H d / N = 0 .5 as seen in Fig.3 since the directional cosine between the initial vector and a target pattern eventually vanishes therein. Remarkably, even in such a case, the present model attains a remarkably large memory capacity because of the higher-order correlations involved in eq.(10) as expected from Figs. 1 and 2 for the analogue vectors as well as the binary ones previously investigated [15,16,25]. As a future problem, it seems to be worthwhile to involve a chaotic dynamics in the present model introducing a periodic activation function such as sinusoidal one as a nonmonotonic activation function [14]. The entropy based approach [15] with chaos dynamics [14] is now in progress and will be reported elsewhere together with the synergetic models [17-24] in the near future.

References 1. Anderson, J.A.: A Simple Neural Network Generating Interactive Memory. Mathematical Biosciences 14, 197–220 (1972) 2. Kohonen, T.: Correlation Matrix Memories. IEEE Transaction on Computers C-21, 353– 359 (1972) 3. Nakano, K.: Associatron-a Model of Associative Memory. IEEE Trans. SMC-2, 381–388 (1972) 4. Amari, S.: Neural Theory of Association and Concept Formation. Biological Cybernetics 26, 175–185 (1977) 5. Amit, D.J., Gutfreund, H., Sompolinsky, H.: Storing Infinite Numbers of Patternsin a Spinglass Model of Neural Networks. Physical Review Letters 55, 1530–1533 (1985) 6. Gardner, E.: Structure of Metastable States in the Hopfield Model. Journal of Physics A19, L1047–L1052 (1986) 7. Kohonen, T., Ruohonen, M.M.: Representation of Associated Pairs by Matrix Operators. IEEE Transaction C-22, 701–702 (1973) 8. Amari, S., Maginu, K.: Statistical Neurodynamics of Associative Memory. Neural Networks 1, 63–73 (1988) 9. Morita, M.: Neural Networks. Associative Memory with Nonmonotone Dynamics 6, 115– 126 (1993) 10. Yanai, H.-F., Amari, S.: Auto-associative Memory with Two-stage Dynamics of nonmonotonic neurons. IEEE Transactions on Neural Networks 7, 803–815 (1996) 11. Shiino, M., Fukai, T.: Self-consistent Signal-to-noise Analysis of the Statistical Behaviour of Analogu Neural Networks and Enhancement of the Storage Capacity. Phys. Rev. E48, 867 (1993) 12. Kanter, I., Sompolinski, H.: Associative Recall of Memory without Errors. Phys. Rev. A 35, 380–392 (1987) 13. Personnaz, L., Guyon, I., Dreyfus, D.: Information Storage and Retrieval in Spin-Glass like Neural Networks. J. Phys(Paris) Lett. 46, L-359 (1985) 14. Nakagawa, M.: Chaos and Fractals in Engineering, p. 944. World Scientific Inc., Singapore (1999) 15. Nakagawa, M.: Autoassociation Model based on Entropy Functionals. In: Proc. of NOLTA 2006, pp. 627–630 (2006) 16. Nakagawa, M.: Entropy based Associative Model. IEICE Trans. Fundamentals EA-89(4), 895–901 (2006)

198

M. Nakagawa

17. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System I. Biological Cybernetics 60, 17–22 (1988) 18. Fuchs, A., Haken, H.: Pattern Recognition and Associative Memory as Dynamical Processes in a Synergetic System II. Biological Cybernetics 60, 107–109 (1988) 19. Fuchs, A., Haken, H.: Dynamic Patterns in Complex Systems. In: Kelso, J.A.S., Mandell, A.J., Shlesinger, M.F. (eds.), World Scientific, Singapore (1988) 20. Haken, H.: Synergetic Computers and Cognition. Springer, Heidelberg (1991) 21. Nakagawa, M.: A study of Association Model based on Synergetics. In: Proceedings of International Joint Conference on Neural Networks 1993 NAGOYA, JAPAN, pp. 2367– 2370 (1993) 22. Nakagawa, M.: A Synergetic Neural Network. IEICE Fundamentals E78-A, 412–423 (1995) 23. Nakagawa, M.: A Synergetic Neural Network with Crosscorrelation Dynamics. IEICE Fundamentals E80-A, 881–893 (1997) 24. Nakagawa, M.: A Circularly Connected Synergetic Neural Networks. IEICE Fundamentals E83-A, 881–893 (2000) 25. Nakagawa, M.: Entropy based Associative Model. In: Proceedings of ICONIP 2006, pp. 397–406. Springer, Heidelberg (2006)

The Detection of an Approaching Sound Source Using Pulsed Neural Network Kaname Iwasa1, Takeshi Fujisumi1 , Mauricio Kugler1 , Susumu Kuroyanagi1, Akira Iwata1 , Mikio Danno2 , and Masahiro Miyaji3 1

Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, 466-8555, Japan [email protected] 2 Toyota InfoTechnology Center, Co., Ltd, 6-6-20 Akasaka, Minato-ku, Tokyo, 107-0052, Japan 3 Toyota Motor Corporation, 1 Toyota-cho, Toyota, Aichi, 471-8572, Japan

Abstract. Current automobiles’ safety systems based on video cameras and movement sensors fail when objects are out of the line of sight. This paper proposes a system based on pulsed neural networks able to detect if a sound source is approaching a microphone or moving away from it. The system, based on PN models, compares the sound level diﬀerence between consecutive instants of time in order to determine its relative movement. Moreover, the combined level diﬀerence information of all frequency channels permits to identify the type of the sound source. Experimental results show that, for three diﬀerent vehicles sounds, the relative movement and the sound source type could be successfully identiﬁed.

1

Introduction

Driving safety is one of the major concerns of the automotive industry nowadays. Video cameras and movement sensors are used in order to improve the driver’s perception of the environment surrounding the automobile [1][2]. These methods present good performance when detecting objects (e.g., cars, bicycles, and people) which are in line of sight of the sensor, but fail in case of obstruction or dead angles. Moreover, the use of multiple cameras or sensors for handling dead angles increases the size and cost of the safety system. The human being, in contrast, is able to perceive people and vehicles around itself by the information provided by the auditory system [3]. If this ability could be reproduced by artiﬁcial devices, complementary safety systems for automobiles would emerge. Cause of diﬀraction, sound waves can contour objects and be detected even when the source is not in direct line of sight. A possible approach for processing temporal data is the use of Pulsed Neuron (PN) models [4]. This type of neuron deals with input signals on the form of pulse trains, using an internal membrane potential as a reference for generating pulses on its output. PN models can directly deal with temporal data and can be eﬃciently implemented in hardware, due to its simple structure. Furthermore, M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 199–208, 2008. c Springer-Verlag Berlin Heidelberg 2008

200

K. Iwasa et al.

high processing speeds can be achieved, as PN model based methods are usually highly parallelizable. A sound localization system based on pulsed neural networks has already being proposed in [5] and a sound source identiﬁcation system, with a corresponding implementation on FPGA, was introduced in [6]. This paper focuses speciﬁcally on the relative moving direction of a sound emitting object, and proposes a method to detect if a sound source is approaching or moving away from it using a microphone. The system, based on PN models, compares the sound level diﬀerence between consecutive instants of time in order to determine its relative movement. Moreover, the proposed method also identiﬁes the type of the sound source by the use of PN model based competitive learning pulsed neural network for processing the spectral information.

2

Pulsed Neuron Model

When processing time series data (e.g., sound), it is important to consider the time relation and to have computationally inexpensive calculation procedures to enable real-time processing. For these reasons, a PN model is used in this research.

Input Pulses

A Local Membrane Potential p1(t)

IN 1(t) The Inner Potential of the Neuron IN 2 (t)

w1 w2 p2(t) wk pk(t)

IN k (t)

I(t)

θ

Output Pulses o(t)

w n pn(t)

IN n (t)

Fig. 1. Pulsed neuron model

Figure 1 shows the structure of the PN model. When an input pulse INk (t) reaches the k th synapse, the local membrane potential pk (t) is increased by the value of the weight wk . The local membrane potentials decay exponentially with a time constant τk across time. The neuron’s output o(t) is given by o(t) = H(I(t) − θ) n I(t) = pk (t)

(1) (2)

k=1 t

pk (t) = wk INk (t) + pk (t − 1)e− τ

(3)

The Detection of an Approaching Sound Source

201

where n is the total number of inputs, I(t) is the inner potential, θ is the threshold and H(·) is the unit step function. The PN model also has a refractory period tndti , during which the neuron is unable to ﬁre, independently of the membrane potential.

3

The Proposed System

The basic structure of the proposed system is shown in Fig.2. This system consists of three main blocks, the frequency-pulse converter, the level diﬀerence extractor and the sound source classiﬁer, from which the last two are based on PN models. The relative movement (approaching or moving away) of the sound source is determined by the sound level variation. The system compares a signal level x(t) from a microphone with the level in a previous time x(t−Δt). If x(t) > x(t−Δt), the sound source is getting closer to a microphone, if x(t) < x(t−Δt), it is moving away. After the level diﬀerence having been extracted, the outputs of the level diﬀerence extractors contain the spectral pattern of the input sound, which is then used for recognizing the type of the source. 3.1

Filtering and Frequency-Pulse Converter

Initially, the input signal must be pre-processed and converted to a train of pulses. A bank of 4th order band-pass ﬁlters decomposes the signal in 13 frequency channels equally spaced in a logarithm scale from 500 Hz to 2 kHz. Each frequency channel is modiﬁed by the non-linear function shown in Eq.(4), and the resulting signal’s envelope is extracted by a 400 Hz low-pass ﬁlter. Finally, Input Signal Filter Bank & Frequency - Pulse Converter f1

f2

fN

Time Delay

x(t)

Time Delay

x(t- D t)

Level Difference Extractor

x(t)

x(t- D t)

Level Difference Extractor

Time Delay

x(t)

x(t- D t)

Level Difference Extractor

Sound Source Classifier Approaching Detection & Sound Classification

Fig. 2. The structure of the recognition system

202

K. Iwasa et al.

each output signal is independently converted to a pulse train, whose rate is proportional to the amplitude of the signal. 1 x(t) 3 x(t) ≥ 0 F (t) = 1 (4) 1 3 x(t) x(t) < 0 4 3.2

Level Diﬀerence Extractor

Each pulse trains generated by the Frequency-Pulse converter is inputted in a Level Diﬀerence Extractor (LDE) independently. The LDE, shown in Fig. 3, is composed by two parts, the Lateral Superior Olive (LSO) model and the Level Mapping Two (LM2) model [7]. In LSO model and LM2 model, each neurons work as Eq.(3). The LSO is responsible for the time diﬀerence extraction itself, while the LM2 extracts the envelope of the complex ﬁring pattern. Each pulse train correspondent to each frequency channel is inputted in a LSO LSO model. The PN potential of f th channel, ith LSO neuron Ii,f (t) is calculated as follows: LSO B Ii,f (t) = pN i,f (t) + pi,f (t)

pN i,f (t)

=

N wi,f xf (t)

+

pN i,f (t

(5) − 1)e

t LSO

−τ

B B pB i,f (t) = wi,f xf (t − Δt) + pi,f (t − 1)e

−τ t LSO

(6) (7)

N where τLSO is the time constant of the LSO neuron and the weights wi,f and B wi,f are deﬁned as: ⎧ ⎧ 0.0 i=0 0.0 i=0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨1.0 ⎨ i>0 1.0 i E p −min . That is, when the last HN is added, the minimum training error Emin should be larger than the previously achieved minimum training error Ep-min. 2.1.2 Stopping Criterion for Backward Elimination (BE) 2.1.2.1 Calculation of Error Reduction Rate (ERR) During BE process, we need to calculate the ERR as follows,

ERR = − Here,

′ − Emin Emin Emin

′ is the minimum training error during backward elimination. Emin

2.1.2.2 Stopping Criterion in double feature deletion During the course of training, the irrelevant features are sequentially deleted double at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.05, CA′ and CA refer to the classification accuracies during BE and before BE, respectively.

2.1.2.3 Stopping Criterion in single feature deletion During the course of training, the irrelevant features are sequentially deleted single at a time in BE: the SC is adopted as ERR CA . Here, th refers to threshold value equals to 0.08. 2.2 Measurement of Contribution Finding the contribution of input attributes accurately ensures the effectiveness of CFSS. Therefore, during the one-by-one removal training, we measure the minimum i

training error, Emin for removing each ith feature. Then, calculate the contribution of each feature as follows. i i Training error difference for removing the ith features is, Ediff = Emin − Emin where Emin is the minimum training error using all features. Now, we need to calculate the average training error difference for removing ith feature Eavg = 1

N

∑E N i =1

i diff

where

N is the number of features. Now, the percentage contribution of ith feature is,

Coni = 100.

i Eavg − Ediff

Eavg

Now, we can decide the rank of features according to Conmax = max(Coni ) . The worst ranking features are treated as the irrelevant features for the network as these are providing comparatively less contribution for the network.

Feature Subset Selection Using Constructive Neural Nets

377

2.3 Calculation of Network Connection The number of connections (C) of the final architecture of the network can be calculated in terms of the number of existing input attributes (x), number of existing hidden units (h), and number of outputs (o) by C = ( x × h) + ( h × o) + h + o . 2.3.1 Reduction of Network Connection (RNC) In this context we here estimate how much connection has been reduced due to feature selection. In this regard, we initially calculate the total number of connections of the achieved network before and after BE, Cbefore and Cafter , respectively. After that, we estimate RNC as, RNC = 100.

Cbefore − C after

.

Cbefore

2.3.2 Increment of Network Accuracy (INA) Due to the reduction of network connection, we estimate INA that shows how much network accuracy is improved. We measure the classification accuracy before and after BE, CAbefore and CAafter , respectively. After that, we estimate INA as, INA = 100.

CAafter − CAbefore Cbefore

.

2.4 The Algorithm

In this framework, CFSS comprises into three main steps that are summarized in Fig. 1: (a) architecture determination of NN in constructive fashion (step 1-2), (b) measurement of contribution of each feature (step 3), and (c) subset generation (step 4-8). These steps are dependent each other and the entire process has been accomplished one after another depending on particular criteria. The detail of each step is as follows. Step 1) Create a minimal NN architecture. Initially it has three layers, i.e. an input layer, an output layer, and a hidden layer with only one neuron. Number of input and output neurons is equal to the number of inputs and outputs of the problem. Randomly initialize connection weights between input layer to hidden layer and hidden layer to output layer within a range between [+1.0, -1.0]. Step 2) Train the network by using BPL algorithm and try to achieve a minimum training error, Emin. Then, add a HN, retrain the network from the beginning, and check the SC according to subsection 2.1.1. When it is satisfied, then validate the NN to test set to calculate the classification accuracy, CA and go to step 3. Otherwise, continue the HN selection process. Step 3) Now to find out the contribution of features, we perform one-by-one removal training of the network. Delete successively the ith feature and save the i individual minimum training error, E min . Then, the rank of all features can be reflected in accordance with the subsection 2.2. Then go to Step 4.

378

M.M. Kabir, M. Shahjahan, and K. Murase

1

Create an NN with minimal architecture Training

2

SC satisfied ?

yes

Validate NN, calculate accuracy

no Add HN

One-by-one training, compute contribution

3

Delete double/single feature Training 4

Validate subset, calculate accuracy

SC satisfied ?

no

yes 5

6

Backtracking no

Single deletion end ? yes

7

Further check

8

Final subset

Fig. 1. Overall Flowchart of CFSS. Here, NN, HN and SC refer to neural network, hidden neuron and stopping criterion respectively.

Step 4) This stage is the first step for BE to generate the feature subset. Initially, we attempt to delete double features of worst rank at a time to accelerate the process. Calculate the minimum training error, E′min during training. Validate the existing features using test set and calculate classification accuracy CA′ . After that, if SC mentioned in subsection 2.1.2.2 is satisfied, then continue. Otherwise, go to step 5. Step 5) We perform backtracking. That is, the lastly deleted double/single of features is restored here with associated all components. Step 6) If single deletion process has not been finished then attempt to delete feature single at a time using step 4 to filter the unnecessary ones, otherwise go to step 7. If SC mentioned in subsection 2.1.2.3 is satisfied, then continue. Otherwise, go to step 5 and then step 7.

Feature Subset Selection Using Constructive Neural Nets

379

Step 7) Before going to select the final subset, we again check through the existing features whether any irrelevant feature presents or not. The following steps are taken to accomplish this task. i) Delete the existing features from the network single in each step in accordance with worst rank, and then again retrain. ii) Validate the existing features by using test set. For the next stage, save the classification accuracy CA′′ , the responsible deleted feature, and all its components. iii) If DF CA . v) If better CA′′ are available, then identify the higher rank of feature among them. Otherwise, stop. Delete the higher ranked feature with corresponding worst ranked ones from the subset that obtained at step 7-i. After that, recall the components to the current network that was obtained at step 7-ii and stop. Step 8) Finally, we achieve the relevant feature subset with compact network.

3 Experimental Analysis This section evaluates CFSS’s performance on four well-known benchmark problems such as Breast cancer (BCR), Diabetes (DBT), Glass (GLS), and Iris (IRS) problems which are obtained from [13]. Input attributes and output classes of BCR are 9 and 2, 8 and 2 of DBT, 9 and 6 of GLS, 4 and 3 of IRS, respectively. All datasets are partitioned into three sets: a training set, a validation set, and a test set. The first 50% is used as training set for training the network, and the second 25% as validation set to check the condition of training and stop it when the overall training error goes high. The last 25% is used as test set to evaluate the generalization performance of the network and is not seen by the network during the whole training process. In all experiments, two bias nodes with a fixed input +1 are connected to the hidden layer and output layer. The learning rate η is set between [0.1,0.3] and the weights are initialized to random values between [+1.0, -1.0]. Each experiment was carried out 10 times and the presented results were the average of these 10 runs. The all experiments were done by Pentium-IV, 3 GHz desktop personal computer. 3.1 Experimental Results

Table 1 shows the average results of CFSS where the number of selected features, classification accuracies and so on are incorporated for BCR, DBT, GLS, and IRS problems in before and after feature selection process. For clarification, in this table, we measured the training error by the section 2.5 of [12], and classification accuracy CA is the ratio of classified examples to the total examples of the particular data set.

40 30 20 10 0 -10 -20 0

1

2

3

4

5

6

7

8

9

10

Classification Accuracy (%)

M.M. Kabir, M. Shahjahan, and K. Murase Attribute Contribution (%)

380

80 78 76 74 72 70 0

Attribute

2

4

6

8

10

Subset Size

Fig. 2. Contribution of attributes in breast Fig. 3. Performance of the network according to cancer problem for single run the size of subset in diabetes problem for single run Table 1. Results of four problems such as BCR, DBT, GLS, and IRS. Numbers in () are the standard deviations. Here, TERR, CA, FS, HN, ITN, and CTN refer to training error, classification accuracy, feature selection, hidden neuron, iteration, and connection respectively. Name

BCR DBT GLS IRS

Feature # 9 (0.00) 8 (0.00) 9 (0.00) 4 (0.00)

Before FS TERR (%) 3.71 (0.002) 16.69 (0.006) 26.44 (0.026) 4.69 (0.013)

CA (%) 98.34 (0.31) 75.04 (2.06) 64.71 (1.70) 97.29 (0.00)

Feature # 3.2 (0.56) 2.5 (0.52) 4.1 (1.19) 1.3 (0.48)

Table 2. Results of computational time for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS

Computational Time (Second) 22.10 30.07 22.30 7.90

After FS TERR (%) 3.76 (0.003) 14.37 (0.024) 26.56 (0.028) 8.97 (0.022)

CA (%) 98.57 (0.52) 76.13 (1.74) 66.62 (2.18) 97.29 (0.00)

HN#

ITN#

CTN#

2.60

249.4

18.12

2.66

271.3

16.63

3.3

1480.9

42.63

3.2

2317.1

19.96

Table 3. Results of connection reduction and accuracy increase for BCR, DBT, GLS, and IRS problems Name BCR DBT GLS IRS

Connection Decrease (%) 45.42 46.80 27.5 30.2

Accuracy Increase (%) 0.23 1.45 2.95 0.00

In each experiment, CFSS generates a subset having the minimum number of relevant features is available. Thereupon, CFSS produces better CA with a smaller number of hidden neurons HN. It is shown in Table 1 that, for example, in BCR, the network easily selects a minimum number of relevant features subset i.e. 3.2 which occurs 98.57% CA while the network used only 2.6 of hidden neurons to build the minimal network structure. In contrast, in case of remaining problems like DBT, GLS, and IRS, the results of those attributes are nearly similar as BCR except IRS. This is because, for IRS problem, the network cannot produce better accuracy any

Feature Subset Selection Using Constructive Neural Nets

381

more after feature selection rather it able to generate 1.3 relevant features subset among 4 attributes which is sufficient to exhibit the same network performance. In addition, Fig. 2 exhibits the arrangement of attributes during training according to their contribution in BCR. CFSS can easily delete the unnecessary ones and generate the subset {6,7,1}. In contrast, Fig. 3 shows the relationship between classification accuracy and subset size for DBT. We calculated the network connection of the pruned network at the final stage according to section 2.3 and the results are shown in Table 1. The computational time for completing the entire FSS process is exhibited in Table 2. After that, we estimated the connection decrement of the network corresponding to accuracy increment due to FSS as shown in Table 3 according to 2.3.1 and 2.3.2. We can see here the relation between the reduction of the network connection, and the increment of accuracy. 3.2 Comparison with other Methods

In this section, we compare the results of CFSS to those obtained by other methods NNFS and ICFS reported in [4] and [6], respectively. The results are summarized in Tables 4-7. Prudence should be considered because different technique has been involved in their methods for feature selection. Table 4. Comparison on the number of relevant features for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

3.2 2.5 4.1 1.3

2.70 2.03 -

ICFS (M 1) 5 2 5 -

ICFS (M 2) 5 3 4 -

Table 6. Comparison on the average number of hidden neurons for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

2.60 2.66 3.3 3.2

12 12 -

ICFS (M 1) 33.55 8.15 62.5 -

ICFS (M 2) 42.05 21.45 53.95 -

Table 5. Comparison on the average testing CA (%) for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

98.57 76.13 66.62 97.29

94.10 74.30 -

ICFS (M 1) 98.25 78.88 63.77 -

ICFS (M 2) 98.25 78.70 66.61 -

Table 7. Comparison on the average number of connections for BCR, DBT, GLS, and IRS data sets Name

CFSS

NNFS

BCR DBT GLS IRS

18.12 16.63 42.63 19.96

70.4 62.36 -

ICFS (M 1) 270.4 42.75 756 -

ICFS (M 2) 338.5 130.7 599.4 -

Table 4 shows the discrimination capability of CFSS in which the dimensionality of input layer is reduced for above-mentioned four problems. In case of BCR, the result of CFSS is quite better in terms of ICFS’s two methods while it is not so with NNFS. The result of DBT is average with comparing others. But, for the case of GLS, the result is comparable or better with two methods of ICFS.

382

M.M. Kabir, M. Shahjahan, and K. Murase

The comparison on the average testing CA for all problems is shown in Table 5. It is seen that, the results of BCR and GLS are better with NNFS and ICFS while DBT with NNFS. The most important aspect however in our proposed method is, less number of connections of final network due to less number of HN in NN architecture. The results are shown in Table 6 and 7 where the number of HNs and connections of CFSS are much less in comparison to other three methods. Note that there are some missing data in Table 4-7, since IRS problem was not tested by NNFS and ICFS, and GLS problem by NNFS.

4 Discussion This paper presents a new combinatorial method for feature selection that generates subset in minimal computation due to minimal size of hidden layer as well as input layer. Constructive technique is used to achieve the compact size of hidden layer, and a straightforward contribution measurement leads to achieve reduced input layer showing better performance. Moreover, a composite combination of BE by means of double and single elimination, backtracking, and validation helps CFSS to generate subset proficiently. The results shown in Table 1 exhibit that CFSS generates subset with a small number of relevant features with producing better performance in four-benchmark problems. The results of relevant subset generation and generalization performance are better or comparable to other three methods as shown in Tables 4 and 5. From the long period, the wrapper approaches for FSS are overlooked because of the huge computation in processing. In CFSS, computational cost is much less. As seen in Table 3, due to FSS, 37.48% of computational cost is reduced in the advantage of 1.18% network accuracy for four problems in average. The computational time for different problems to complete the entire process in CFSS is shown in Table 2. We believe that these values are sufficiently low especially for clinical field. The system can give the diagnostic result of the patient to the doctor within a minute. Though the exact comparison is difficult, other methods such as NNFS and ICFS may take 4-10 times more since the numbers of hidden neurons and connections are much more as seen in Tables 6 and 7 respectively. CFSS thus provides minimal computation in feature subset selection. In addition, during subset generation, we used to meet up the generated subset for validation in each step. The reason is that, during BE we build a composite SC, which eventually find out the local minima where the network training should be stopped. Due to implementing such criterion, network produces significant performance and thus no need to validate the generated subset finally. Furthermore, further checking for irrelevant features in BE brings the completeness of CFSS. In this study we applied CFSS to the datasets with smaller number of features up to 9. To get more relevant tests for real tasks, we intend to use CFSS with other datasets having a larger number of features in future. The issue of extracting rules from NN is always demandable to interpret the knowledge how it works. For this, a NN with compactness is desirable. As CFSS can give support to fulfill the requirements, rule extraction from NN is the further task in future efficiently.

Feature Subset Selection Using Constructive Neural Nets

383

5 Conclusion This paper presents a new approach for feature subset selection based on contribution of input attributes in NN. The combination of constructive, contribution, and backward elimination carries the success of CFSS. Initially a basic constructive algorithm is used to determine a minimal and optimal structure of NN. In the latter part, one-by-one removal of input attributes is adopted that does not computationally expensive. Finally, a backward elimination with new stopping criteria are used to generate relevant feature subset efficiently. Moreover, to evaluate CFSS, we tested it on four real-world problems such as breast cancer, diabetes, glass and iris problems. Experimental results confirmed that, CFSS has a strong capability of feature selection, and it can remove the irrelevant features from the network and generates feature subset by producing compact network with minimal computational cost.

Acknowledgements Supported by grants to KM from the Japanese Society for Promotion of Sciences, the Yazaki Memorial Foundation for Science and Technology, and the University of Fukui.

References 1. Liu, H., Tu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering 17(4), 491–502 (2005) 2. Dash, M., Liu, H.: Feature Selection for Classification. Intelligent Data Analysis - An International Journal 1(3), 131–156 (1997) 3. Kohavi, R., John, G.H.: Wrapper for feature subset selection. Artificial Intelligence 97, 273–324 (1997) 4. Sateino, R., Liu, H.: Neural Network Feature Selector. IEEE Transactions on Neural Networks 8 (1997) 5. Milna, L.: Feature Selection using Neural Networks with Contribution Measures. In: 8th Australian Joint Conference on Artificial Intelligence, Canberra, November 27 (1995) 6. Guan, S., Liu, J., Qi, Y.: An incremental approach to Contribution-based Feature Selection. Journal of Intelligence Systems 13(1) (2004) 7. Schuschel, D., Hsu, C.: A weight analysis-based wrapper approach to neural nets feature subset selection. In: Tools with Artificial Intelligence: Proceedings of 10th IEEE International Conference (1998) 8. Hsu, C., Huang, H., Schuschel, D.: The ANNIGMA-Wrapper Approach to Fast Feature Selection for Neural Nets. IEEE Trans. on Systems, Man, and Cybernetics-Part B: Cybernetics 32(2), 207–212 (2002) 9. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection. Journal of Machine Learning Research (2002)

384

M.M. Kabir, M. Shahjahan, and K. Murase

10. Phatak, D.S., Koren, I.: Connectivity and performance tradeoffs in the cascade correlation learning architecture, Technical Report TR-92-CSE-27, ECE Department, UMASS, Amherst (1994) 11. Rumelhart, D.E., McClelland, J.: Parallel Distributed Processing. MIT Press, Cambridge (1986) 12. Prechelt, L.: PROBEN1-A set of neural network benchmark problems and benchmarking rules, Technical Report 21/94, Faculty of Informatics, University of Karlsruhe, Germany (1994) 13. newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases, Dept. of Information and Computer Sciences, University of California, Irvine (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

Dynamic Link Matching between Feature Columns for Diﬀerent Scale and Orientation Yasuomi D. Sato1 , Christian Wolﬀ1 , Philipp Wolfrum1 , and Christoph von der Malsburg1,2 1

2

Frankfurt Institute for Advanced Studies (FIAS), Johann Wolfgang Goethe University, Max-von-Laue-Str. 1, 60438, Frankfurt am Main, Germany Computer Science Department, University of Southern California, LA, 90089-2520, USA

Abstract. Object recognition in the presence of changing scale and orientation requires mechanisms to deal with the corresponding feature transformations. Using Gabor wavelets as example, we approach this problem in a correspondence-based setting. We present a mechanism for ﬁnding feature-to-feature matches between corresponding points in pairs of images taken at diﬀerent scale and/or orientation (leaving out for the moment the problem of simultaneously ﬁnding point correspondences). The mechanism is based on a macro-columnar cortical model and dynamic links. We present tests of the ability of ﬁnding the correct feature transformation in spite of added noise.

1

Introduction

When trying to set two images of the same object or scene into correspondence with each other, so that they can be compared in terms of similarity, it is necessary to ﬁnd point-to-point correspondences in the presence of changes in scale or orientation (see Fig. 1). It is also necessary to transform local features (unless one chooses to work with features that are invariant to scale and orientation, accepting the reduced information content of such features). Correspondencebased object recognition systems [1,2,3,4] have so far mainly addressed the issue of ﬁnding point-to-point correspondences, leaving local features unchanged in the process [5,6]. In this paper, we propose a system that can not only transform features for comparison purposes, but also recognize the transformation parameters that best match two sets of local features, each one taken from one point in an image. Our eventual aim will be to ﬁnd point correspondences and feature correspondences simultaneously in one homogeneous dynamic link matching system, but we here take point correspondences as given for the time being. Both theoretical [7] and experimental [8] investigations are suggesting 2D-Gabor-based wavelets as features to be used in visual cortex. These are best sampled in a log-polar manner [11]. This representation has been shown to be particularly useful for face recognition [12,13], and due to its inherent symmetry it is highly appropriate for implementing a transformation system for scale and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 385–394, 2008. c Springer-Verlag Berlin Heidelberg 2008

386

Y.D. Sato et al.

Fig. 1. When images of an object diﬀer in (a) scale or (b) orientation, correspondence between them can only established if also the features extracted at a given point (e.g., at the dot in the middle of the images) are transformed during comparison

orientation. The work described here is based entirely on the macro-columnar model of [14]. In computer simulations we demonstrate feature transformation and transformation parameter recognition.

2

Concept of Scale and Rotation Invariance

Let there be two images, called I and M (for image and model). The GaborL based wavelet transform J¯ (L = I or M), has the form of a vector, called a jet, whose components are deﬁned as convolutions of the image with a family of Gabor functions: 1 1 L ¯ Jk,l (x0 , a, θ) = I(x) 2 ψ Q(θ)(x0 − x) d2 x, (1) a a R2 1 1 D2 2 ψ(x) = 2 exp(− 2 |x| ) exp(ixT · e1 ) − exp(− ) , (2) D 2D 2 cos θ sin θ Q(θ) = . (3) −sin θ cos θ Here a simultaneously represents the spatial frequency and controls the width of a Gaussian envelope window. D represents the standard deviation of the Gaussian. The individual components of the jet are indexed by n possible orientations and (m + m1 + m2 ) possible spatial frequencies, which are θ = πk/n (k ∈ {0, 1, . . . , (n−1)}) and a = al0 (0 < a0 < 1, l ∈ {0, 1, . . . , (m+m1 +m2 −1)}). Here parameters m1 and m2 ﬁx the number of scale steps by which the image can be scaled up or down, respectively. m (> 1) is the number of scales in the jets of the model domain. L L The Gabor jet J¯ is deﬁned as the set {J¯k,l } of NL Gabor wavelet components extracted from the position x0 of the image. In order to test the ability to ﬁnd the correct feature transformation described below, we add noise of strength σ1 to L L the Gabor wavelet components, J˜k,l = J¯k,l (1 + σ1 Sran ) where Sran are random L numbers between 0 and 1. Instead of the Gabor jet J¯ itself, we employ the L sum-normalized J˜k,l . Jets can be visualized in angular-eccentricity coordinates,

Dynamic Link Matching between Feature Columns (a)

387

(b)

Fig. 2. Comparison of Gabor jets in the model domain M and the image domain I (left and right, resp., in the two part-ﬁgures). Jets are visualized in log-polar coordinates: The orientation θ of jets is arranged circularly while the spatial frequency l is set radially. The jet J M of the model domain M is shown right while the jet J I from the transformed image in the image domain I is shown left. Arrows indicate which components of J I and J M are to be compared. (a) comparison of scaled jets and (b) comparison of rotated jets.

in which orientation and spatial frequency of a jet are arranged circularly and radially (as shown in Fig.2). The components of jets that are taken from corresponding points in images I and M may be arranged according to orientation (rows) and scale (columns): ⎛ ⎞ I I 0 · · · 0 J0,m · · · J0,m 0 ··· 0 1 −1 1 −1+m ⎜ .. .. .. .. .. ⎟ , J I = ⎝ ... (4) . . . . .⎠ I I 0 · · · 0 Jn−1,m · · · Jn−1,m 0 ··· 0 1 −1 1 −1+m

⎛

JM

⎞ M M J0,m · · · J0,m 1 −1 1 −1+m ⎜ ⎟ .. .. =⎝ ⎠. . . M M Jn−1,m1 −1 · · · Jn−1,m1 −1+m

(5)

Let us assume the two images M and I to be compared have the exact same structure, apart from being transformed relative to each other. Let us assume the transformation conforms to the sample grid of jet components. Then there will be pair-wise identities between components in Eqs.(4) and (5). If the jet in I is scaled relative to the jet in M, the non-zero components of J I are shifted along the horizontal (or, in Fig.2(a), radial) coordinate. If the image I, and correspondingly the jet J I , is rotated, then jet components are circularly permuted along the vertical axis in Eq.(4) (see Fig.2(b)). When comparing scaled and rotated jets, the non-zero components of J I are shifted along both axes simultaneously. There are model jet components of m diﬀerent scales, and to allow for m1 steps of scaling down the image I and m2 steps of scaling it up. The jet in Eq.(4) is padded on the left and right with m1 and m2 columns of zeros, respectively.

388

Y.D. Sato et al.

Fig. 3. A dynamic link matching network model of a pair of single macro-columns. In the I and M domains, the network consists of, respectively, NI and NM feature units (mini-columns). The units in the I and M domains represent components of the jets J I and J M . The Control domain C controls interconnections between the I and M macrocolumns, which are all-to-all at the beginning of a ν cycle. By comparing the two jets and through competition of control units in C, the initial all-to-all interconnections are dynamically reduced to a one-to-one interconnection.

3

Modelling a Columnar Network

In analogy to [14], we set up a dynamical system of variables, both for jet components and for all possible correspondences between jet components. These variables are interpreted as the activity of cortical mini-columns (here called “units”, “feature units” in this case). The set of units corresponding to a jet, either in I or in M, form a cortical macro-column (or short “column”) and inhibit each other, the strength of inhibition being controlled by a parameter ν, which cyclically ramps up and falls down. In addition, there is a column of control units, forming the control domain C. Each control unit stands for one pair of possible values of relative scale and orientation between I and M and can evaluate the similarity in the activity of feature units under the corresponding transformation. The whole network is schematically shown in Fig.3. Each domain, or column in our simple case, consists of a set of unit activities {P1L , . . . , PNLL } in the respective domain (L = I or M). NI = (m + m1 + m2 ) × n and NM = m × n represent respectively the total number of units in the I and M domains. The control unit’s activations are PC = (P1C , . . . , PNCC )T , where NC = (m1 + m2 + 1) × n. The equation system for the activations in the domains is given by dPαL = fα (PL ) + κL CE JαL + κL (1 − CE )EαLL + ξα (t), dt (α = 1, . . . , NL and L = {M, I}\L), dPγC = fγ (PC ) + κC Fγ + ξγ (t), dt (γ = 1, . . . , NC ),

(6)

(7)

Dynamic Link Matching between Feature Columns

389

Fig. 4. Time course of all activities in the domains I, M and C over a ν cycle for two jets without relative rotation and scale, for a system with size m = 5, n = 8, m1 = 0 and m2 = 4. The activity of units in the M and I columns is shown within the bars on the right and the bottom of the panels, respectively, each of the eight subblocks corresponding to one orientation, with individual components corresponding to diﬀerent scales. The indices k and l at the top label the orientation and scale of the I jet, respectively. The large matrix C contains, in the top version, as non-zero (white) entries all allowed component-to-component correspondences between the two jets (the empty, black, entries would correspond to forbidden relative scales outside the range set by m1 and m2 ). The matrix consists of 8 × 8 blocks, corresponding to the 8 × 8 possible correspondences between jet orientations, and each of these sub-matrices contains m×(m+m1 +m2 ) entries to make room for all possible scale correspondences. Each control unit controls (and evaluates) a whole jet to jet correspondence with the help of its entries in this matrix. The matrix in the lowest panel has active entries corresponding to just one control unit (for identical scale and orientation). Active units are white, silent units black. Unit activity is shown for three moments within one ν cycle. At t = 3.4 ms, the system has already singled out the correct scale, but is still totally undecided as to relative orientation. System parameters are κI = κM = 2, κC = 5, CE = 0.99, σ = 0.015 and σ1 = 0.1.

390

Y.D. Sato et al.

where ξα (t) and ξγ (t) are Gaussian noise of strength σ 2 . CE ∈ [0, 1] scales the ratio between the Gabor jet JαL and the total input EαLL from units in the other domain, L , and κL controls the strength of the projection L to L. κC controls the strength of the inﬂuence of the similarity Fγ on the control units. Here we explain the other terms in the above equations. The function fα (·) is: L L L L L 2 fα (P ) = a1 Pα Pα − ν(t) max {Pβ } − (Pα ) , (8) β=1,...,NL

where a1 = 25. The function ν(t) describes a cycle in time during which the feedback is periodically modulated:

k (νmax − νmin ) t−T (Tk t < T1 + Tk ) T1 + νmin , ν(t) = . (9) νmax , (T1 + Tk t < T1 + Tk + Trelax ) Here T1 [ms] is the duration during which ν increases with t while Tk (k = 1, 2, 3, . . .) are the periodic times at which the inhibition ν starts to increase. Trelax is a relaxation time after ν has been increased. νmin = 0.4, νmax = 1, T1 = 36.0 ms and Trelax = T1 /6 are set up in our study. Because of the increasing ν(t) and the noise in Eqs.(6) and (7), only one minicolumn in each domain remains active at the end of the cycle, while the others are deactivated, as shown in Fig.4. If we deﬁne the mean-free column activities as P˜αL := PαL − (1/NL ) α PαL , the interaction terms in Eqs. (6) and (7) are EαLL

=

NL NC

β ˜ L PβC Cαα Pα ,

(10)

β ˜I P˜γM Cγγ Pγ .

(11)

β=1 α =1

Fβ =

NM NI γ=1

γ =1

Here NC = n×(m1 +m2 +1) is the number of permitted transformations between the two domains (n diﬀerent orientations and m1 + m2 + 1 possible scales). The β expression Cαα designates a family of matrices, rotating and scaling one jet into another. If we write β = (o, s) so that o identiﬁes the rotation and s the scaling, then we can write the C matrices as tensor product: C β = Ao ⊗ B s , where Ao is an n × n matrix with a wrap-around shifted diagonal of entries 1 and zeros otherwise, implementing rotation o of jet components at any orientation, and B s is an m × (m + m1 + m2 ) matrix with a shifted diagonal of entries 1, and zeros otherwise, implementing an up or down shift of jet components of any scale. The C matrix implementing identity looks like the lowest panel in Fig. 4. The function E of Eq. (10), transfers feature unit activity from one domain to the other, and is a weighted superposition of all possible transformations, weighted with the activity of the control units. The function of F in Eq. (11) evaluates the similarity of the feature unit activity in one domain with that in the other under mutual transformation.

Dynamic Link Matching between Feature Columns

391

Through the columnar dynamics of Eq. (6), the relative strengths of jet components are expressed in the history of feature unit activities (the strongest jet component, for instance, letting its unit stay on longest) so that the sum of products in Eq. (11), integrated over time in Eq. (7), indeed expresses jet similarities. The time course of all unit activities in our network model during a ν cycle is demonstrated in Fig.4. In this ﬁgure, the non-zero components of the jets used in the I and M domains are identical without any rotation and scale. At the beginning of the ν cycle, all units in the I, M and C domains are active, as indicated in Fig.4 by the white jet-bars on the right and the bottom, and indicated by all admissible correspondences in the matrix being on. The activity state of the units in each domain is gradually reduced following Eqs.(6) – (11). In the intermediate state at t = 3.4 ms, all control units for the wrong scale have switched oﬀ, but the control units for all orientations are still on. At t = 12.0 ms, ﬁnally, only one control unit has survived, the one that correctly identiﬁes the identity map. This state remains stable during the rest of the ν cycle.

4

Results

In this Section, we describe tests performed on the functionality of our macrocolumnar model by studying the robustness of matching to noise introduced into the image jets (parameter σ1 ) and the dynamics of the units, Eqs. (6) and (7) (parameter σ) and robustness to the admixture of input from the other domain, Eq. (6), due to non-vanishing (1 − CE ). A correct match is one in which the surviving control unit corresponds to the correct transformation pair which the I jet diﬀers from the M jet. Jet noise is to model diﬀerences between jets actually extracted from natural images. Dynamic noise is to reﬂect signal ﬂuctuations to be expected in units (minicolumns) composed of spiking neurons, and a nonzero admixture parameter may be relevant in a system in which transfer of jet information between the domains is desired. For both of the experiments described below, we used Gabor jets extracted from the center of real facial images. The transformed jets used for matching were extracted from images that had been scaled and rotated relative to the center position by exactly the same factors and angles that are also used in the Gabor transform. This ensures that there exists a perfect match between any two jets produced from two diﬀerent rotated and scaled versions of the same face. The ﬁrst experiment uses a single facial image. This experiment investigates the inﬂuence of noise in the units (σ) and of the parameter CE . Using a pair of the same jets, we obtain temporal averages of the correct matching over 10 ν cycles for each size (n = 4, 5, . . . , 9, 10, m = 4, m1 = 0 and m2 = m − 1) of the macro-column. From these temporal averages, we calculate √ the sampling average over all n. The standard error is estimated using [σd / N1 ] where σd is standard deviation of the sampling average. N1 is the sample size. Fig.5(a) shows results for κI = κM = 1 and κC = 5. For strong noise (σ = 0.015 to 0.02), the correct matching probabilities for CE = 0.98 to CE = 0.96

392

Y.D. Sato et al.

(a)

(b) CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94

1

Probability Correct

Probability Correct

0.8

0.6

0.4 CE=1.0 CE=0.98 CE=0.96 CE=0.95 CE=0.94

0.2

1

0.75

0 0.5 0

0.005

0.01

0.015

0.02

0

0.005

σ

0.01

0.015

0.02

σ

Fig. 5. Probability of correct match between units in the I and M domains. (a) κI = κM = 1 and κC = 5. (b) κI = 2.3, κM = 5 and κC = 5. σ=0.000

Probability Correct

1 0.8 0.6 0.4 0.2 0 0

0.05

0.1 σ1

0.15

0.2

Fig. 6. Probability of the correct matching for comparisons of two identical faces. 6 facial images are used. Our system is set with m = 4, m1 = m2 = m − 1, n = 8, κI = κM = 6.5, κC = 5.0, CE = 0.98 and σ = 0.0.

take better values than the ones for CE = 1. Interestingly, for these low CE values, matching gets worse for weaker noise, collapsing in the case of σ = 0. This eﬀect requires further investigation. However, for κI = 2.3 and κM = 5, we have found that the correct matching probability takes higher values than in case of Fig.5(a). In particular, our system model with higher noise σ = 0.015 even demonstrates perfect correct matching for CE = 0.98, independent of n. Next, we have investigated the robustness of the matching against the second type of noise (σ1 ), using 6 diﬀerent face types. Since the original image is resized and/or rotated with m1 + m2 + 1 = 7 scales and n = 8 orientations, a total of 56 images could be employed in the I domain. Here our system is set with κI = κM = 6.5, κC = 5.0 and CE = 0.98. For each face image, we take temporal averages of the correct matching probability for each size-orientation of the image, in a similar manner as described above. Averages and standard errors of the temporal averages on 6 diﬀerent facial images can be plotted, in terms of σ1 [Fig. 6]. The result of this experiment is independent of σ as long as σ 0.02. As a result of Fig. 6, as random noise in the jets is increased, the correct matching probability smoothly decreases for one image, or it abruptly shifts down to around 0 at a certain σ1 for the other image. We have also obtained

Dynamic Link Matching between Feature Columns

393

perfect matching, independent of σ1 when using the same but rotated or resized face. The most interesting point that we would like to insist is that the probability takes higher values than 87.74% as σ < 0.02 and σ1 < 0.08 (see Fig.6). Therefore, we can say that network model has a high recognition ability for scale and orientation invariance, which is not dependent of diﬀerent facial types.

5

Discussion and Conclusion

The main purpose of this communication is to convey a simple concept. Much more work needs to be done on the way to practical applications, for instance, more experiments with features extracted from independently taken images of diﬀerent scale and orientation to better bring out the strengths and weaknesses of the approach. In addition to comparing and transforming local packages of feature values (our jets), it will be necessary to also handle spatial maps, that is, sets of point-to-point correspondences, a task we will approach next. Real applications involve, of course, a continuous range of transformation parameters, whereas we here had admitted only transformations from the same sample grid used for deﬁning the family of wavelets in a jet. We hope to address this problem by working with continuous superpositions of transformation matrices for the neighboring transformation parameter values that straddle the correct value. Our approach may be seen as using brute force, as it requires many separate control units to sample the space of transformation parameters with enough density. However, the same set of control units can be used in a whole region of visual space, and in addition to that, for controlling point-to-point maps between regions, as spatial maps and feature maps stand in one-to-one correspondence to each other. The number of control units can be reduced even further if transformations are performed in sequence, by consecutive, separate layers of dynamic links. Thus, the transformation from an arbitrary segment of primary visual cortex to an invariant window could be done in the sequence translation – scaling – rotation. In this case, the number of required control units would be the sum and not the product of the number of samples for each individual transformation. A disadvantage of that approach might be added diﬃculties in ﬁnding correspondences between the domains. Further work is required to elucidate these issues.

Acknowledgements This work was supported by the EU project Daisy, FP6-2005-015803 and by the Hertie Foundation. We would like to thank C. Weber for help with preparing the manuscript.

References 1. Anderson, C.H., Van Essen, D.C., Olshausen, B.A.: Directed visual attention and the dynamic control of information ﬂow. In: Itti, L., Rees, G., Tsotsos, J.K. (eds.) Neurobiology of attention, pp. 11–17. Academic Press/Elservier (2005) 2. Weber, C., Wermter, S.: A self-organizing map of sigma-pi units. Neurocomputing 70, 2552–2560 (2007)

394

Y.D. Sato et al.

3. Wiskott, L., v.d. Malsburg, C.: Face Recognition by Dynamic Link Matching, ch. 4. In: Sirosh, J., Miikkulainen, R., Choe, Y. (eds.) Lateral Interactions in the Cortex: Structure and Function, vol. 4, Electronic book (1996) 4. Wolfrum, P., von der Malsburg, C.: What is the optimal architecture for visual information routing? Neural Comput. 19 (2007) 5. Lades, M.: Invariant Object Recognition with Dynamical Links, Robust to Variations in Illumination. Ph.D. Thesis, Ruhr-Univ. Bochum (1995) 6. Maurer, T., von der Malsburg, C.: Learning Feature Transformations to Recognize Faces Rotated in Depth. In: Fogelman-Souli´e, F., Rault, J.C., Gallinari, P., Dreyfus, G. (eds.) Proc. of the International Conference on Artiﬁcial Neural Networks ICANN 1995, EC2 & Cie, p. 353 (1995) 7. Daugmann, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters. J. Opt. Soc. Am. A 2, 1160–1169 (1985) 8. Jones, J., Palmer, L.: An evaluation of the two-dimensional Gabor ﬁlter model of simple receptive ﬁelds in cat striate cortex. J. Neurophysiol. 58, 1233–1258 (1987) 9. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face recognition. In: Proc. of the third IEEE Conference on Computer Vision and Pattern Recognition, pp. 84–91 (1994) 10. Wiskott, L., Fellous, J.-M., Kr¨ uger, von der Malsburg, C.: Face Recognition by Elastic Bunch Graph Matching. IEEE Trans. Pattern Anal. & Machine Intelligence 19, 775–779 (1997) 11. Marcelja, S.: Mathematical description of the responses of simple cortical cells. J. Optical Soc. Am. 70, 1297–1300 (1980) 12. Yue, X., Tjan, B.C., Biederman, I.: What makes faces special? Vision Res. 46, 3802–3811 (2006) 13. Okada, K., Steﬀens, J., Maurer, T., Hong, H., Elagin, E., Neven, H., von der Malsburg, C.: The Bochum/USC Face Recognition System and How it Fared in the FERET Phase III Test. In: Wechsler, H., Phillips, P.J., Bruce, V., Fogelman Souli´e, F., Huang, T.S. (eds.) Face Recognition: FromTheory to Applications, pp. 186–205. Springer, Heidelberg (1998) 14. L¨ ucke, J., von der Malsburg, C.: Rapid Correspondence Finding in Networks of Cortical Columns. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 668–677. Springer, Heidelberg (2006)

Perturbational Neural Networks for Incremental Learning in Virtual Learning System Eiichi Inohira1 , Hiromasa Oonishi2 , and Hirokazu Yokoi1 1

Kyushu Institute of Technology, Hibikino 2-4, 808-0196 Kitakyushu, Japan {inohira,yokoi}@life.kyutech.ac.jp 2 Mitshbishi Heavy Industries, Ltd., Japan

Abstract. This paper presents a new type of neural networks, a perturbational neural network to realize incremental learning in autonomous humanoid robots. In our previous work, a virtual learning system has been provided to realize exploring plausible behavior in a robot’s brain. Neural networks can generate plausible behavior in unknown environment without time-consuming exploring. Although an autonomous robot should grow step by step, conventional neural networks forget prior learning by training with new dataset. Proposed neural networks features adding output in sub neural network to weights and thresholds in main neural network. Incremental learning and high generalization capability are realized by slightly changing a mapping of the main neural network. We showed that the proposed neural networks realize incremental learning without forgetting through numerical experiments with a twodimensional stair-climbing bipedal robot.

1

Introduction

Recently, humanoid robots such as ASIMO [1] dramatically develop in view of hardware and are prospective for working just like a human. Although many researchers have studied artiﬁcial intelligence for a long time, humanoid robots have not so high autonomy as experts are not wanted. Humanoid robots should accomplish missions by itself rather than experts give them solutions such as models, algorithms, and programs. Researchers have studied to realize a robot with learning ability through trial and error. Such studies uses so-called soft computing techniques such as reinforcement learning [2] and central pattern generator (CPG) [3]. Learning of a robot saves expert’s work to some degree but takes much time. These techniques are less eﬃcient than humans. Humans instantly act depending on a situation by using imagination and experience. Even if a human fails, a failure will serve himself as experience. In particular, humans know characteristics of environment and their behavior and simulate trial and error to explore plausible behavior in their brains. In our previous work, Yamashita and et al. [4] have proposed a bipedal walking control system with virtual environment based on motion control mechanism of primates. In this study, exploring plausible behavior by trial and error is carried M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 395–404, 2008. c Springer-Verlag Berlin Heidelberg 2008

396

E. Inohira, H. Oonishi, and H. Yokoi

out in a robot’s brain. However, the problem is that exploring takes much time. Kawano and et al. [5] have introduced learning of a relation of environmental data and an optimized behavior into the previous work to save time. Generating behavior becomes fast because a trained neural network (NN) immediately outputs plausible behavior due to its generalization capability, which is an ability to obtain a desired output at an unknown input. However, a problem in this previous work is that conventional neural networks cannot realize incremental learning. It means that a trained NN forgets prior learning by training with new dataset. Realizing incremental learning is indispensable to learn various behaviors step by step. We presents a new type of NNs, perturbational neural networks (PNN) to achieve incremental learning. PNN consists of main NN and sub NN. Main NN learns with representative dataset and never changes after that. Sub NN learns with new dataset as output error of main NN for the dataset is canceled. Sub NN is connected with main NN as additional terms of weights and thresholds in main NN. We guess that connections through weights and thresholds slightly changes characteristics of main NN, which means that perturbation is given to mapping of main NN, and that incremental learning does not sacriﬁce its generalization capability.

2

A Virtual Learning System for a Biped Robot

We assume that a robot tries to achieve a task in an unknown environment. Virtual learning is deﬁned as learning of virtual experience in virtual environment in a robot’s brain. Virtual environment is generated from sensory information on real environment around a robot and used for exploring behavior ﬁt for a task without real action. Exploring such behavior in virtual environment is regarded as virtual experience by trial and error or ingenuity. Virtual learning is to memorize a relation between environment and behavior under a particular task. Beneﬁts of virtual learning are (1) to reduce risk of robot’s failure in the real world such as falling and colliding and (2) to enable a robot to immediately act and achieve learned tasks in similar environments. The former is involved in a robot’s safety and physical wastage. The latter leads to saving work and time to achieve a task. We presented virtual learning system for a biped robot [4,5]. This virtual learning system has virtual environment and NNs for learning as shown in Fig.1. We assume that central pattern generator (CPG) generates motion of a biped robot. CPG of a biped robot consists of neural oscillators corresponding to its joints. Then behavior is described as CPG parameters rather than joint angles. CPG parameters for controlling a robot are called CPG input. Optimized CPG input for a plausible behavior is given by exploring in virtual environment. Virtual learning is realized by NNs. A NN learned a mapping from environmental data to optimized CPG input, i.e, a mapping from real environment to robot’s behavior. NNs have generalization capability, which means that desired output can be generated at an unknown input. When a NN learned a mapping

PNNs for Incremental Learning in Virtual Learning System

397

Virtual Environment

Exploring optimized motion Environmental data

Does it serve a purpose ?

CPG

CPG input Robot

NN

Fig. 1. Virtual learning system for a biped robot

with representative data, it can generate desired CPG input at unknown environment and then a biped robot achieves a task. It takes much time to train NN, but NN is fast to generate output. CPG input is ﬁrst generated by NN and then checked in virtual environment. When the CPG input generated by NN achieve a task, it is sent to a CPG. Otherwise, CPG input obtained by exploring, which takes time, is used for controlling a robot. And NN learns with CPG input obtained by exploring to correct its error. Combination of NN and exploring realizes autonomous learning and covers their shortcomings each other.

3

Neural Networks for Incremental Learning

A key of our virtual learning system is NN. A virtual learning system should develop with experience in various environments and tasks. However, multilayer perceptron (MLP) with back-propagation (BP) learning algorithm, which is widely used in many applications, has a problem that it forgets prior learning through training with a new dataset because its connection weights are overwritten. If all training datasets are stored in a memory, prior learning can be reﬂected in a MLP by training with all the datasets, but training with a large volume of dataset takes huge amount of time and memory. Of course humans can learn something new with keeping old experiences. A virtual learning system needs to accumulate virtual experiences step by step. 3.1

Perturbational Neural Networks

The basic idea of PNN is that a NN learns a new dataset by adding new parameters to constant weights and thresholds after training without overwriting them. PNN consists of main NN and sub NN as shown in Fig.2. Main NN learns with representative datasets and is constant after this learning. Sub NN learns with new datasets and generates new terms in weights and thresholds of main

398

E. Inohira, H. Oonishi, and H. Yokoi Input

Main NN

Output

Δw, Δh Sub NN

Fig. 2. A perturbational neural network

Neuron w1 Δ w1

Input

x1 −

f

z1

Output

Δh

h From Sub NN

Fig. 3. A neuron in main NN

NN, i.e., Δw and Δh. A neuron of PNN is shown in Fig.3. Input-output relation of a conventional neuron is given by the following equation. z=f wi xi − h (1) i

where z denotes output, wi weight, xi input, and f (·) activation function. A neuron of PNN is given as follows. z=f (wi + Δwi )xi − (h + Δh)

(2)

i

where Δwi and Δh are outputs of sub NN. Training of PNN is divided into two phases. First, main NN learns a main NN with representative dataset. For instance, it is assumed that representative datasets consists of environmental data (EA , EB , EC ) and CPG parameters

PNNs for Incremental Learning in Virtual Learning System

399

PA , PB , P C EA , EB , E C

Main NN Δw, Δh Sub NN 0

Fig. 4. Training of NN with representative dataset

PD Stop learning

ED

Main NN Δw, Δh Sub NN

Fig. 5. Training of NN with new dataset

(PA , PB , PC ) as shown in Fig.4. Training of main NN is the same as NN with BP. At this time, sub NN learn as Δw and Δh equal zero. Then, for representative dataset, sub NN has no eﬀect on main NN. Next, PNN learns with new dataset. For instance, it is assumed that new dataset consists of environmental data ED and CPG parameters PD as shown in Fig.5. Main NN does not learn with the new dataset, but reinforcement signals are passed to sub NN through main NN. Output error of main NN for ED exists because it is unknown for main NN. Sub NN learns with new dataset as such error is canceled. 3.2

Related Work

Some authors [6,7,8] have proposed other methods in which Mixture of Experts (MoE) architecture is applied to incremental learning of neural networks. The MoE architecture is based on divide-and-conquer approach. It means that a complex mapping which a signal neural network should learn is divided into simple mappings which a neural network can learn easily in such architecture. On the other hand, a PNN learns a complex mapping by connecting sub NNs with a main NN. A PNN is expected to have global generalization capability because it does not divided a mapping to local mappings. In the MoE architecture, a expert neural network is expected to have local generalization capability. However, a system would not have global generalization capability because it is not concerned. Therefore, when generalization capability is focused on, a PNN should be used.

400

E. Inohira, H. Oonishi, and H. Yokoi

A PNN have a problem that it needs large number of connections from sub NNs to a main NN. It means that a PNN is very ineﬃcient in resources. Although we have not yet study eﬃciency in a PNN, we guess that a PNN has room to reduce the connections, and it would be our future work.

4 4.1

Numerical Experiments Setup

We evaluate generalization capability of proposed NN for incremental learning through numerical experiments of a robot climbing stairs. Simpliﬁed experiments are performed because we focus on a NN for virtual learning.

θ3 θ4

θ2 θ1

θ5

Fig. 6. A two-dimensional ﬁve-link biped robot model

θ3

Neural oscillator

θ2

θ4

θ1

θ5

Fig. 7. A CPG for a biped robot

A two-dimensional ﬁve-link model of a bipedal robot is used as shown in Fig. 6. Five joint angles deﬁned in Fig. 6 are controlled by ﬁve neural oscillators corresponding to each joint angle. Length of all link is 100 cm. A task of a bipedal robot is to climb ﬁve stairs. Height and depth of stairs are used for

PNNs for Incremental Learning in Virtual Learning System

401

Table 1. Dimension of representative stairs Stairs Height [cm] Depth [cm] A 30 60 B 10 70 C 30 100 D 10 90

Output

Input

Main NN Sub NN Sub NN for hidden layer

Δwh , Δhh

Sub NN for output layer

Δwo , Δho

Fig. 8. A PNN used for experiments

environmental data. A robot concerns only kinematics in virtual environment and ignores dynamics. A CPG shown in Fig. 7 is used for controlling of a bipedal robot. We used Matsuoka’s neural oscillator model [9] in a CPG. In this study, 16 connection weights w and ﬁve constant external inputs u0 are deﬁned as CPG input for controlling a bipedal robot. CPG input for climbing stairs are given through exploring their parameter space by GA and are optimized for each environment. Internal parameters of CPG are also given by GA and are constant in all environments because it takes much time to explore all CPG parameters including internal parameters. Internal parameters of CPG are optimized for walking on a horizontal plane. The four pairs shown in Table 1 are deﬁned as representative data for NN training. Then inputs and outputs of the NNs numbers 2 and 21 respectively. The following targets are compared to evaluating their generalization capability. – CPG whose parameters are optimized from each of the three representative environmental data, i.e., stairs A, B, and C – MLP trained for stairs A, B, and C (MLP-I) – MLP trained for stairs D after trained for stairs A, B, and C (MLP-II) – PNN trained in the same way as the above MLP These targets are optimized and trained for one to four kinds of stairs. Generalization capability of each targets is calculated by the number of conditions which is diﬀerent from the representative stairs and where a biped robot can climb ﬁve

402

E. Inohira, H. Oonishi, and H. Yokoi

stairs successfully. Stairs’ height ranges from 4 cm to 46 cm and width from 40 cm to 110 cm. MLPs has 30 neurons in a hidden layer. Initial weights of MLPs are given by uniform random numbers ranging between ±0.3. PNN used in this paper has two sub NNs as shown in Fig.8. Main NN in PNN is the same as the above MLP. Sub NN for hidden layer has 100 neurons and 90 outputs. Sub NN for output layer has 600 neurons and 561 outputs. All initial weights in PNN are given in the same way as the MLPs. One learning cycle is deﬁned as stairs A, B, and C are given to MLP or PNN sequentially. MLP-I is trained for the three kinds of stairs until 10000 learning cycles where sum of squared error is much less than 10−7 . MLP-II is trained for stairs D until 1000 learning cycles after trained for the three kinds of stairs. As mentioned below, although the number of learning cycles for stairs D is small, MLP-II forgets stair A, B, and C. Training condition for Main NN in PNN is the same as MLP-I. Incremental learning of PNN for stairs D means that the two sub NNs are trained for stairs D while the main NN is constant. These sub NNs are trained until 700 learning cycles. Learning rate of NNs is optimized through preliminary experiments because their performance heavily depends on it. We used learning rate minimizing sum of squared error under a certain number of learning cycles for each of NNs. Learning rate used in MLP-I and sub NN of PNN is 0.30 and 0.12 respectively. 4.2

Experimental Results

In Fig. 9, mark x denotes successful condition in climbing ﬁve stairs and a circle a given condition as representative dataset. Fig. 9 (a) to (c) show that successful conditions spreads around representative stairs to some degree. It means that CPG has a certain generalization capability by itself as already known. However, generalization capability of CPG is not as much as stairs A, B, and C are covered simultaneously. On the other hand, Fig. 9 (d) shows that MLP-I covers intermediate conditions in stairs A, B, and C. Eﬀect of virtual learning with NNs is clear. Fig. 9 (e) and (f) are involved in incremental learning. Fig. 9 (e) shows that PNN is successful in incremental learning. Moreover, when conditions near stairs C are focused on, generalization capability of PNN is larger than MLP-I. Fig. 9 (f) shows that MLP-II forgot the precedent learning on stairs A, B, and C and fails in incremental learning. It is known that incremental learning of MLP with BP fails because connection weights are overwritten by training with new dataset. In PNN, main NN is constant after training with initial dataset. Then main NN does not forget initial dataset. Incremental learning is realized by sub NNs in PNN. The problems of sub NNs are eﬀects on adjusting of connection weights, i.e., whether it is successful in incremental learning or not, and whether performance for initial dataset decreases or not. From the experimental results, we showed that PNN realized incremental learning and increased generalization capability by incremental learning.

PNNs for Incremental Learning in Virtual Learning System

45

45

40

40

25 20

10

80 100 Depth [cm] (d) Trained NN with data A, B, and C

45

45

40

40

35

35

25 20

A

30

C

25 20

B

10 5 40

60

80 100 Depth [cm] (b) CPG without NN for data B

D

80 100 Depth [cm] (e) Trained proposed NN with data D after the three data

45

45

40

40

35

60

25 20

Height [cm]

35 C

30

30

15 10 80 100 Depth [cm] (c) CPG without NN for data C

C

20

10 60

A

25

15

5 40

60

15 B

10 5 40

B

5 40

80 100 Depth [cm] (a) CPG without NN for data A

15

Height [cm]

20

10 60

C

25

15

30

A

30

15

5 40

Height [cm]

Height [cm]

30

35 A

Height [cm]

Height [cm]

35

403

5 40

B

D

60

80 100 Depth [cm] (f) Trained conventional NN with data D after the three data

Fig. 9. A comparison of generalization capability of CPG and MLP and PNN in stair climbing

404

5

E. Inohira, H. Oonishi, and H. Yokoi

Conclusions

We proposed a new type of NNs for incremental learning in a virtual learning system. Our proposed NNs features adjusting weights and thresholds externally to slightly change a mapping of a trained NN. This paper demonstrated numerical experiments with a two-dimensional ﬁve-link biped robot and climbing-stairs task. We showed that PNN is successful in incremental learning and has generalization capability to some degree. This study is very limited to focus on only verifying our approach. In future work, we will study PNN with as much data as an actual robot needs and compare with related work quantitatively.

References 1. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of honda humanoid robot. In: Proc. IEEE ICRA, vol. 2, pp. 1321–1326 (1998) 2. Mahadevan, S., Connell, J.: Automatic programming of behavior-based robots using reinforcement learning. Artiﬁcial Intelligence 55, 311–365 (1992) 3. Kuniyoshi, Y., Sangawa, S.: Early motor development from partially ordered neuralbody dynamics: experiments with a cortico-spinal-musculo-skeletal model. Biological Cybernetics 95, 589–605 (2006) 4. Yamashita, I., Yokoi, H.: Control of a biped robot by using several virtual environments. In: Proceedings of the 22nd Annual Conference of the Robotics Society of Japan 1K25 (in Japanese) (2004) 5. Kawano, T., Yamashita, I., Yokoi, H.: Control of the bipedal robot generating the target by the simulation in virtual space (in Japanese). IEICE Technical Report 104, 65–69 (2004) 6. Haruno, M., Wolpert, D.M., Kawato, M.: MOSAIC model for sensorimotor learning and control. Neural Computation 13, 2201–2220 (2001) 7. Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Computation 10, 2047–2084 (1998) 8. Yamauchi, K., Hayami, J.: Incremental learning and model selection for radial basis function network through sleep. IEICE TRANSACTIONS on Information and Systems E90-D, 722–735 (2007) 9. Matsuoka, K.: Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biological Cybernetics 52, 367–376 (1985)

Bifurcations of Renormalization Dynamics in Self-organizing Neural Networks Peter Tiˇ no University of Birmingham, Birmingham, UK [email protected]

Abstract. Self-organizing neural networks (SONN) driven by softmax weight renormalization are capable of finding high quality solutions of difficult assignment optimization problems. The renormalization is shaped by a temperature parameter - as the system cools down the assignment weights become increasingly crisp. It has been reported that SONN search process can exhibit complex adaptation patterns as the system cools down. Moreover, there exists a critical temperature setting at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states. To shed light on such observed phenomena, we present a detailed bifurcation study of the renormalization process. As SONN cools down, new renormalization equilibria emerge in a strong structure leading to a complex skeleton of saddle type equilibria surrounding an unstable maximum entropy point, with decision enforcing “one-hot” stable equilibria. This, in synergy with the SONN input driving process, can lead to sensitivity to annealing schedules and adaptation dynamics exhibiting signatures of complex dynamical behavior. We also show that (as hypothesized in earlier studies) the intermittent search by SONN can occur only at temperatures close to the first (symmetry breaking) bifurcation temperature.

1

Introduction

For almost three decades there has been an energetic research activity on application of neural computation techniques in solving diﬃcult combinatorial optimization problems. Self-organizing neural network (SONN) [1] constitutes an example of a successful neural-based methodology for solving 0-1 assignment problems. SONN has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in mobile communications. As in most self-organizing systems, dynamics of SONN adaptation is driven by a synergy of cooperation and competition. In the competition phase, for each item to be assigned, the best candidate for the assignment is selected and the corresponding assignment weight is increased. In the cooperation phase, the assignment weights of other candidates that were likely to be selected, but were not quite as strong as the selected one, get increased as well, albeit to a lesser M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 405–414, 2008. c Springer-Verlag Berlin Heidelberg 2008

406

P. Tiˇ no

degree. The assignment weights need to be positive and sum to 1. Therefore, after each SONN adaptation phase, the assignment weights need to be renormalized back onto the standard simplex e.g via the softmax function [2]. When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a temperature parameter T > 0. As the system cools down, the assignments become increasingly crisp. In the original setting SONN is annealed so that a single high quality solution to an assignment problem is found. Yet, renormalization onto the standard simplex is a double edged sword. On one hand, SONN with assignment weight renormalization have empirically shown sensitivity to annealing schedules, on the other hand, the quality of solutions could be greatly improved [3]. Interestingly enough, it has been reported recently [4] that there exists a critical temperature T∗ at which SONN is capable of powerful intermittent search through a multitude of high quality solutions represented as meta-stable states of the adaptation SONN dynamics. It is hypothesised that the critical temperature may be closely related to the symmetry breaking bifurcation of equilibria in the autonomous softmax dynamics. At present there is still no theory regarding the dynamics of SONN adaptation driven by the softmax renormalization. Consequently, the processes of crystallising a solution in an annealed version of SONN, or of sampling the solution space in the intermittent search regime are far from being understood. The ﬁrst steps towards theoretical underpinning of SONN adaptation driven by softmax renormalization were taken in [5,4,6]. For example, in [5] SONN is treated as a dynamical system with a bifurcation parameter T . The cooperation phase was not included in the model, The renormalization process was empirically shown to result in complicated bifurcation patterns revealing a complex nature of the search process inside SONN as the systems gets annealed. More recently, Kwok and Smith [4] suggested to study SONN adaptation dynamics by concentrating on the autonomous renormalization process, since it is this process that underpins the search dynamics in the SONN. In [6] we initiated a rigorous study of equilibria of the autonomous renormalization process. Based on dynamical properties of the autonomous renormalization, we found analytical approximations to the critical temperature T∗ as a function of SONN size. In this paper we complement [6] by reporting a detailed bifurcation study of the renormalization process and give precise characterization and stability types of equilibria, as they emerge during the annealing process. Interesting and intricate equilibria structure emerges as the system cools down, explaining empirically observed complexity of SONN adaptation during intermediate stages of the annealing process. The analysis also clariﬁes why the intermittent search by SONN occurs near the ﬁrst (symmetry breaking) bifurcation temperature of the renormalization step, as was experimentally veriﬁed in [4,6]. Due to space limitations, we cannot fully prove statements presented in this study. Detailed proofs can be found in [7] and will be published elsewhere.

Bifurcations of Renormalization Dynamics in SONNs

2

407

Self-organizing Neural Network and Iterative Softmax

First, we brieﬂy introduce Self-Organizing Neural Network (SONN) endowed with weight renormalization for solving assignment optimization problems (see e.g. [4]). Consider a ﬁnite set of input elements (neurons) i ∈ I = {1, 2, ..., M } that need to be assigned to outputs (output neurons) j ∈ J = {1, 2, ..., N }, so that some global cost of an assignment A : I → J is minimized. Partial cost of assigning i ∈ I to j ∈ J is denoted by V (i, j). The “strength” of assigning i to j is represented by the “assignment weight” wi,j ∈ (0, 1). The SONN algorithm can be summarized as follows: The connection weights wi,j , i ∈ I, j ∈ J , are ﬁrst initialized to small random values. Then, repeatedly, an output item jc ∈ J is chosen and the partial costs V (i, jc ) incurred by assigning all possible input elements i ∈ I to jc are calculated in order to select the “winner” input element (neuron) i(jc ) ∈ I that minimizes V (i, jc ). The “neighborhood” BL (i(jc )) of size L of the winner node i(jc ) consists of L nodes i = i(jc ) that yield the smallest partial costs V (i, jc ). Weights from nodes i ∈ BL (i(jc )) to jc get strengthened: wi,jc ← wi,jc + η(i)(1 − wi,jc ),

i ∈ BL (i(jc )),

(1)

where η(i) is proportional to the quality of assignment i → jc , as measured by V (i, jc ). Weights1 wi = (wi,1 , wi,2 , ..., wi,N ) for each input node i ∈ I are then renormalized using softmax wi,j ←

wi,j T ) N wi,k . k=1 exp( T )

exp(

(2)

We will refer to SONN for solving an (M, N )-assignment problem as (M, N )SONN. As mentioned earlier, following [4,6] we strive to understand the search dynamics inside SONN by analyzing the autonomous dynamics of the renormalization update step (2) of the SONN algorithm. The weight vector wi of each of M neurons in an (M, N )-SONN lives in the standard (N − 1)-simplex, SN −1 = {w = (w1 , w2 , ..., wN ) ∈ RN | wi ≥ 0, i = 1, 2, ..., N, and

N

wi = 1}.

i=1

Given a value of the temperature parameter T > 0, the softmax renormalization step in SONN adaptation transforms the weight vector of each unit as follows: w → F(w; T ) = (F1 (w; T ), F2 (w; T ), ..., FN (w; T )) , where Fi (w; T ) = 1

exp( wTi ) , i = 1, 2, ..., N, Z(w; T )

Here denotes the transpose operator.

(3)

(4)

408

P. Tiˇ no

wk N and Z(w; T ) = N k=1 exp( T ) is the normalization factor. Formally, F maps R 0 to SN −1 , the interior of SN −1 . 0 Linearization of F around w ∈ SN −1 is given by the Jacobian J(w; T ): J(w; T )i,j =

1 [δi,j Fi (w; T ) − Fi (w; T )Fj (w; T )], T

i, j = 1, 2, ..., N,

(5)

where δi,j = 1 iﬀ i = j and δi,j = 0 otherwise. 0 The softmax map F induces on SN −1 a discrete time dynamics known as Iterative Softmax (ISM): w(t + 1) = F(w(t); T ).

(6)

The renormalization step in an (M, N )-SONN adaptation involves M separate renormalizations of weight vectors of all of the M SONN units. For each temperature setting T , the structure of equilibria in the i-th system, wi (t + 1) = F(wi (t); T ), gets copied in all the other M − 1 systems. Using this symmetry, it is suﬃcient to concentrate on a single ISM (6). Note that the weights of diﬀerent units are coupled by the SONN adaptation step (1). We will study systems for N ≥ 2.

3

Equilibria of SONN Renormalization Step

We ﬁrst introduce basic concepts and notation that will be used throughout the paper. An (r − 1)-simplex is the convex hull of a set of r aﬃnely independent points in Rm , m ≥ r−1. A special case is the standard (N −1)-simplex SN −1 . The convex hull of any nonempty subset of n vertices of an (r − 1)-simplex Δ, n ≤ r, is called an (n − 1)-face of Δ. There are nr distinct (n − 1)-faces of Δ and each (n − 1)-face is an (n − 1)-simplex. Given a set of n vertices w1 , w2 , ..., wn ∈ Rm deﬁning an (n − 1)-simplex Δ in Rm , the central point, n

w(Δ) =

1 wi , n i=1

(7)

is called the maximum entropy point of Δ. We will denote the set of all (n − 1)-faces of the standard (N − 1)-simplex SN −1 by PN,n . The set of their maximum entropy points is denoted by QN,n , i.e. QN,n = {w(Δ)| Δ ∈ PN,n }.

(8)

The n-dimensional column vectors of 1’s and 0’s are denoted by 1n and 0n , respectively. Note that wN,n = n1 (1n , 0N −n ) ∈ QN,n . In addition, all the other elements of QN,n can be obtained by simply permuting coordinates of wN,n . Due to this symmetry, we will be able to develop most of the material using wN,n only and then transfer the results to permutations of wN,n . The maximum entropy point wN,N = N −1 1N of the standard (N − 1)-simplex SN −1 will be denoted

Bifurcations of Renormalization Dynamics in SONNs

409

simply by w. To simplify the notation we will use w to denote both the maximum entropy point of SN −1 and the vector w − 0N . We showed in [6] that w is a ﬁxed point of ISM (6) for any temperature setting T and all the other ﬁxed points w = (w1 , w2 , ..., wN ) of ISM have exactly two diﬀerent coordinate values: wi ∈ {γ1 , γ2 }, such that N −1 < γ1 < N1−1 and 0 < γ2 < N −1 , where N1 is the number of coordinates γ1 larger than N −1 . Since 0 w ∈ SN −1 , we have 1 − N1 γ1 γ2 = . (9) N − N1 The number of coordinates γ2 smaller than N −1 is denoted by N2 . Obviously, N2 = N − N1 . If w = (γ1 1N1 , γ2 1N2 ) is a ﬁxed point of ISM (6), so are all NN1 distinct permutations of it. We collect w and its permutations in a set 1 − N γ 1 1 EN,N1 (γ1 ) = v| v is a permutation of γ1 1N1 , 1 . (10) N − N1 N −N1 The ﬁxed points in EN,N1 (γ1 ) exist if and only if the temperature parameter T is set to [6]

TN,N1 (γ1 ) = (N γ1 − 1) −(N − N1 ) · ln 1 −

N γ1 − 1 (N − N1 )γ1

−1 .

(11)

We will show that as the system cools down, increasing number of equilibria emerge in a strong structure. Let w, v ∈ SN −1 be two points on the standard simplex. The line from w to v is parametrized as (τ ; w, v) = w + τ · (v − w),

τ ∈ [0, 1].

(12)

Theorem 1. All equilibria of ISM (6) lie on lines connecting the maximum entropy point w of SN −1 with the maximum entropy points of its faces. In particular, for 0 < N1 < N and γ1 ∈ (N −1 , N1−1 ), all fixed points from EN,N1 (γ1 ) lie on lines (τ ; w, w), where w ∈ QN,N1 . Sketch of the Proof: Consider the maximum entropy point wN,N1 = N11 (1N1 , 0N2 ) of an (N1 −1)-face of SN −1 . Then w(γ1 ) = (γ1 1N1 , γ2 1N2 ) lies on the line (τ ; w, wN,N1 ) for the parameter setting τ = 1 − N γ2 . Q.E.D. The result is illustrated in ﬁgure 1. As the (M, N )-SONN cools down, the ISM equilibria emerge on lines connecting w with the maximum entropy points of faces of SN −1 of increasing dimensionality. Moreover, on each such line there can be at most two ISM equilibria. Theorem 2. For N1 < N/2, there exists a temperature TE (N, N1 ) > N −1 such that for T ∈ (0, TE (N, N1 )], ISM fixed points in EN,N1 (γ1 ) exist for some γ1 ∈

410

P. Tiˇ no

Fig. 1. Positions of equilibria of SONN renormalization illustrated for the case of 4-dimensional weight vectors w. ISM is operating on the standard 3-simplex S3 and its equilibria can only be found on the lines connecting the maximum entropy point w (filled circle) with maximum entropy points of its faces. Triangles, squares and diamonds represent maximum entropy points of 0-faces (vertices), 1-faces (edges) and 2-faces (facets), respectively.

(N −1 , N1−1 ), and no ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ), can exist at temperatures T > TE (N, N1 ). For each temperature T ∈ (N −1 , TE (N, N1 )), there are two coordinate values γ1− (T ) and γ1+ (T ), N −1 < γ1− (T ) < γ1+ (T ) < N1−1 , such that ISM fixed points in both EN,N1 (γ1− (T )) and EN,N1 (γ1+ (T )) exist at temperature T . Furthermore, as the temperature decreases, γ1− (T ) decreases towards N −1 , while γ1+ (T ) increases towards N1−1 . For temperatures T ∈ (0, N −1 ], there is exactly one γ1 (T ) ∈ (N −1 , N1−1 ) such that ISM equilibria in EN,N1 (γ1 (T )) exist at temperature T . Sketch of the Proof: The temperature function TN,N1 (γ1 ) (11) is concave and can be continuously extended to [N −1 , N1−1 ) with TN,N1 (N −1 ) = N −1 and limγ1 →N −1 TN,N1 (γ1 ) = 1 0 < N −1 . The slope of TN,N1 (γ1 ) at N −1 is positive for N1 < N/2. Q.E.D. Theorem 3. The bifurcation temperature TE (N, N1 ) is decreasing with increasing number N1 of equilibrium coordinates larger than N −1 . Sketch of the Proof: It can be shown that for any feasible value of γ1 > N −1 , if there are two ﬁxed points w ∈ EN,N1 (γ1 ) and w ∈ EN,N1 (γ1 ) of ISM, such that N1 < N1 , then w exists at a higher temperature than w does. For a given N1 < N/2, the bifurcation temperature TE (N, N1 ) corresponds to the maximum of TN,N1 (γ1 ) on γ1 ∈ (N −1 , N1−1 ). It follows that N1 < N1 implies TE (N, N1 ) > TE (N, N1 ). Q.E.D.

Bifurcations of Renormalization Dynamics in SONNs

411

Theorem 4. If N/2 ≤ N1 < N , for each temperature T ∈ (0, N −1 ), there is exactly one coordinate value γ1 (T ) ∈ (N −1 , N1−1 ), such that ISM fixed points in EN,N1 (γ1 (T )) exist at temperature T . No ISM fixed points in EN,N1 (γ1 ), for any γ1 ∈ (N −1 , N1−1 ) can exist for temperatures T > N −1 . As the temperature decreases, γ1 (T ) increases towards N1−1 . Sketch of the Proof: Similar to the proof of theorem 2, but this time the slope of TN,N1 (γ1 ) at N −1 is not positive. Q.E.D. Let us now summarize the process of creation of new ISM equilibria, as the (M, N )-SONN cools down. For temperatures T > 1/2, the ISM has exactly one equilibrium - the maximum entropy point w of SN −1 [6]. As the temperature is lowered and hits the ﬁrst bifurcation point, TE (N, 1), new ﬁxed points of ISM w ∈ QN,1 , one on each line. The lines connect w emerge on the lines (τ ; w, w), of SN −1 . As the temperature decreases further, on each line, with the vertices w the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other in QN,1 (vertex of SN −1 ). moves towards the corresponding high entropy point w When the temperature reaches the second bifurcation point, TE (N, 2), new ﬁxed w ∈ QN,2 , one on each line. This points of ISM emerge on the lines (τ ; w, w), of the time the lines connect w with the maximum entropy points (midpoints) w edges of SN −1 . Again, as the temperature continues decreasing, on each line, the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other in QN,2 (midpoint moves towards the corresponding maximum entropy point w of an edge of SN −1 ). The process continues until the last bifurcation temperature TE (N, N1 ) is reached, where N1 is the largest natural number smaller than N/2. At TE (N, N1 ), new ﬁxed points of ISM emerge on the lines (τ ; w, w), ∈ QN,N1 , connecting w with maximum entropy points w of (N1 − 1)-faces w of SN −1 . As the temperature continues decreasing, on each line, the single ﬁxed point splits into two ﬁxed points, one moves towards w, the other moves towards in QN,N1 . At temperatures below the corresponding maximum entropy point w N −1 , only the ﬁxed points moving towards the maximum entropy points of faces of SN −1 exist. In the low temperature regime, 0 < T < N −1 , a ﬁxed point occurs on every w ∈ QN,N1 , N1 = N/2 , N/2 + 1, ..., N − 1. Here, x denotes line (τ ; w, w), the smallest integer y, such that y ≥ x. As the temperature decreases, the ﬁxed ∈ QN,N1 points w move towards the corresponding maximum entropy points w of (N1 − 1)-faces of SN −1 . The process of creation of new ﬁxed points and their ﬂow as the temperature cools down is demonstrated in ﬁgure 2 for an ISM operating on 9-simplex S9 (N = 10). We plot against each temperature setting T the values of the larger coordinate γ1 > N −1 = 0.1 of the ﬁxed points existing at T . The behavior of ISM in the neighborhood of its equilibrium w is given by the structure of stable and unstable manifolds of the linearized system at w outlined in the next section.

412

P. Tiˇ no Bifurcation structure of ISM equilibria (N=10) T (10,4)

T (10,2)

E

TE(10,1)

E

1 T (10,3) E

N =1 1

0.8

gamma

1

N =2 1

0.6

N =3 1

0.4

N =4 1

0.2

0

0

0.05

0.1

0.15

0.2

0.25

temperature T

Fig. 2. Demonstration of the process of creation of new ISM fixed points and their flow as the system temperature cools down. Here N = 10, i.e. the ISM operates on the standard 9-simplex S9 . Against each temperature setting T , the values of the larger coordinate γ1 > N −1 of the fixed points existing at T are plotted. The horizontal bold line corresponds to the maximum entropy point w = 10−1 110 .

4

Stability Analysis of Renormalization Equilibria

The maximum entropy point w is not only a ﬁxed point of ISM (6), but also, regarded as a vector w − 0N , it is an eigenvector of the Jacobian J(w; T ) 0 at any w ∈ SN −1 , with eigenvalue λ = 0. This simply reﬂects the fact that ISM renormalization acts on the standard simplex SN −1 , which is a subset of a (N − 1)-dimensional hyperplane with normal vector 1N . We have already seen that w plays a special role in the ISM equilibria structure: all equilibria lie on lines going from w towards maximum entropy points of faces of SN −1 . The lines themselves are of special interest, since we will show that these lines are invariant manifolds of the ISM renormalization and their directional vectors are eigenvectors of ISM Jacobians at the ﬁxed points located on them. Theorem 5. Consider ISM (6) and 1 ≤ N1 < N . Then for each maximum ∈ QN,N1 of an (N1 − 1)-face of SN −1 , the line (τ ; w, w), entropy point w is an invariant set τ ∈ [0, 1) connecting the maximum entropy point w with w under the ISM dynamics. Sketch of the Proof: into (3) and realizing The result follows from plugging parametrization (τ ; w, w) (after some manipulation) that for each τ ∈ [0, 1), there exists a parameter = (τ1 ; w, w). setting τ1 ∈ [0, 1) such that F((τ ; w, w)) Q.E.D.

Bifurcations of Renormalization Dynamics in SONNs

413

The proofs of the next two theorems are rather involved and we refer the interested reader to [7]. Theorem 6. Let w ∈ EN,N1 (γ1 ) be a fixed point of ISM (6). Then, w∗ = w − w is an eigenvector of the Jacobian J(w; TN,N1 (γ1 )) with the corresponding eigenvalue λ∗ , where 1. if N/2 ≤ N1 ≤ N − 1, then 0 < λ∗ < 1. 2. if 1 ≤ N1 < N/2 and N −1 < γ1 < (2N1 )−1 , then λ∗ > 1. 3. if 1 ≤ N1 < N/2 , then there exists γ¯1 ∈ ((2N1 )−1 , N1−1 ), such that for all ISM fixed points w ∈ EN,N1 (γ1 ) with γ1 ∈ (¯ γ1 , N1−1 ), 0 < λ∗ < 1. We have established that for an ISM equilibrium w, both w and w∗ = w− w are eigenvectors of the ISM Jacobian at w. Stability types of the remaining N − 2 eigendirections are characterized in the next theorem. Theorem 7. Consider an ISM fixed point w ∈ EN,N1 (γ1 ) for some 1 ≤ N1 < N and N −1 < γ1 < N1−1 . Then, there are N − N1 − 1 and N1 − 1 eigenvectors of Jacobian J(w; TN,N1 (γ1 )) of ISM at w with the same associated eigenvalue 0 < λ− < 1 and λ+ > 1, respectively.

5

Discussion – SONN Adaptation Dynamics

In the intermittent search regime by SONN [4], the search is driven by pulling promising solutions temporarily to the vicinity of the 0-1 “one-hot” assignment values - vertices of SN −1 (0-dimensional faces of the standard simplex SN −1 ). The critical temperature for intermittent search should correspond to the case where the attractive forces already exist in the form of attractive equilibria near the “one-hot” assignment suggestions (vertices of SN −1 ), but the convergence rates towards such equilibria should be suﬃciently weak so that the intermittent character of the search is not destroyed. This occurs at temperatures lower than, but close to the ﬁrst bifurcation temperature TE (N, 1) (for more details, see [7]). In [4] it is hypothesised that there is a strong link between the critical temperature for intermittent search by SONN and bifurcation temperatures of the autonomous ISM. In [6] we hypothesised (in accordance with [4]) that even though there are many potential ISM equilibria, the critical bifurcation points are related only to equilibria near the vertices of SN −1 , as only those could be guaranteed by the theory of [6] (stability bounds) to be stable, even though the theory did not prevent the other equilibria from being stable. In this study, we have rigorously shown that the stable equilibria can in fact exist only near the vertices of SN −1 , on the lines connecting w with the vertices. Only when N1 = 1, there are no expansive eigendirections of the local Jacobian with λ+ > 1. As the SONN system cools down, more and more ISM equilibria emerge on the lines connecting the maximum entropy point w of the standard simplex SN −1 with the maximum entropy points of its faces of increasing dimensionality. With decreasing temperature, the dimensionality of stable and unstable manifolds

414

P. Tiˇ no

of linearized ISM at emerging equilibria decreases and increases, respectively. At lower temperatures, this creates a peculiar pattern of saddle type equilibria surrounding the unstable maximum entropy point w, with decision enforcing “one-hot” stable equilibria located near vertices of SN −1 . Trajectory towards the solution as the SONN system anneals is shaped by the complex skeleton of saddle type equilibria with stable/unstable manifolds of varying dimensionalities and can therefore, in synergy with the input driving process, exhibit signatures of a very complex dynamical behavior, as reported e.g. in [5]. Once the temperature is suﬃciently low, the attraction rates of stable equilibria near the vertices of SN −1 are so high that the found solution is virtually pinned down by the system. Even though the present study clariﬁes the prominent role of the ﬁrst (symmetry breaking) bifurcation temperature TE (N, 1) in obtaining the SONN intermittent search regime and helps to understand the origin of complex SONN adaptation patterns in the annealing regime, many interesting open questions remain. For example, no theory as yet exists of the role of abstract neighborhood BL (i(jc )) of the winner node i(jc ) in the cooperative phase of SONN adaptation. We conclude by noting that it may be possible to apply the theory of ISM in other assignment optimization systems that incorporate the softmax assignment weight renormalization e.g. [8,9].

References 1. Smith, K., Palaniswami, M., Krishnamoorthy, M.: Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks 9, 1301–1318 (1998) 2. Guerrero, F., Lozano, S., Smith, K., Canca, D., Kwok, T.: Manufacturing cell formation using a new self-organizing neural network. Computers & Industrial Engineering 42, 377–382 (2002) 3. Kwok, T., Smith, K.: Improving the optimisation properties of a self-organising neural network with weight normalisation. In: Proceedings of the ICSC Symposia on Intelligent Systems and Applications (ISA 2000), Paper No.1513-285 (2000) 4. Kwok, T., Smith, K.: Optimization via intermittency with a self-organizing neural network. Neural Computation 17, 2454–2481 (2005) 5. Kwok, T., Smith, K.: A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks 15, 84–88 (2004) 6. Tiˇ no, P.: Equilibria of iterative softmax and critical temperatures for intermittent search in self-organizing neural networks. Neural Computation 19, 1056–1081 (2007) 7. Tiˇ no, P.: Bifurcation structure of equilibria of adaptation dynamics in selforganizing neural networks. Technical Report CSRP-07-12, University of Birmingham, School of Computer Science (2007), http://www.cs.bham.ac.uk/∼ pxt/PAPERS/ism.bifurc.tr.pdf 8. Gold, S., Rangarajan, A.: Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks 2, 381–399 (1996) 9. Rangarajan, A.: Self-annealing and self-annihilation: unifying deterministic annealing and relaxation labeling. Pattern Recognition 33, 635–649 (2000)

Variable Selection for Multivariate Time Series Prediction with Neural Networks Min Han and Ru Wei School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China [email protected]

Abstract. This paper proposes a variable selection algorithm based on neural networks for multivariate time series prediction. Sensitivity analysis of the neural network error function with respect to the input is developed to quantify the saliency of each input variables. Then the input nodes with low sensitivity are pruned along with their connections, which represents to delete the corresponding redundant variables. The proposed algorithm is tested on both computergenerated time series and practical observations. Experiment results show that the algorithm proposed outperformed other variable selection method by achieving a more significant reduction in the training data size and higher prediction accuracy. Keywords: Variable selection, neural network pruning, sensitivity, multivariate prediction.

1 Introduction Nonlinear and chaotic time series prediction is a practical technique which can be used for studying the characteristics of complicated dynamics from measurements. Usually, multivariate variables are required since the output may depend not only on its own previous values but also on the past values of other variables. However, we can’t make sure that all of the variables are equally important. Some of them may be redundant or even irrelevant. If these unnecessary input variables are included into the prediction model, the parameter estimation process will be more difficult, and the overall results may be poorer than if only the required inputs are used [1]. Variable selection is such a problem to discard the redundant variables, which will reduce the number of input variables and the complexity of the prediction model. A number of variable selection methods based on statistical or heuristics tools have been proposed, such as Principal Component Analysis (PCA) and Discriminant Analysis. These techniques attempt to reduce the dimensionality of the data by creating new variables that are linear combinations of the original ones. The major difficulty comes from the separation of variable selection process and prediction process. Therefore, variable selection using neural network is attractive since one can globally adapt the variable selector together with the predictor. Variable selection with neural network can be seen as a special case of architecture pruning [2], where the pruning of input nodes is equivalent to removing the corresponding M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 415–425, 2008. © Springer-Verlag Berlin Heidelberg 2008

416

M. Han and R. Wei

variables from the original data set. One approach to pruning is to estimate the sensitivity of the output to the exclusion of each unit. There are several ways to perform sensitivity analysis with neural network. Most of them are weight-based [3], which is based on the idea that weights connected to important variables attain large absolute values while weights connected to unimportant variables would probably attain values somewhere near zero. However, smaller weights usually result in smaller inputs to neurons and larger sigmoid derivatives in general, which will increase the output sensitivity to the input. Mozer and Smolensky [4] have introduced a method which estimates which units are least important and can be deleted over training. Gevrey et al. [5] compute the partial derivatives of the neural network output with respect to the input neurons and compare performances of several different methods to evaluate the relative contribution of the input variables. This paper concentrates on a neural-network-based variable selection algorithm as the tool to determine which variables are to be discarded. A simple sensitivity criterion of the neural network error function with respect to each input is developed to quantify the saliency of each input variables. Then the input nodes are arrayed by a decreasing sensitivity order so that the neural network can be pruned efficiently by discarding the last items with low sensitivity. The variable selection algorithm is then applied to both computer-generated data and practical observations and is compared with the PCA variable reduction method. The rest of this paper is organized as follows. Section 2 reviews the basic concept of multivariate time series prediction and a statistical variable selection method. Section 3 explains the sensitivity analysis with neural networks in detail. Section 4 presents two simulation results. The work is finally concluded in section 5.

2 Modeling Multivariate Chaotic Time Series The basic idea of chaotic time series analysis is that, a complex system can be described by a strange attractor in its phase space. Therefore, the reconstruction of the equivalent state space is usually the first step in chaotic time series prediction. 2.1 Multivariate Phase Space Reconstruction Phase space reconstruction from observations can be accomplished by choosing a suitable embedding dimension and time delay. Given an M-dimensional time series{Xi, i=1, 2,…, M}, where Xi=[xi(1), xi(2), …, xi(N)]T, N is the length of each scalar time series. As in the case of univariate time series (where M=1), the reconstructed phase-space can be made as [6]:

X (t ) = [ x1 (t ), x1 (t − τ 1 ), " , x1 (t − (d1 − 1)τ 1 ), ," ,

(1)

xM (t ), xM (t − τ M )," , xM (t − (d M − 1)τ M ] where t = L, L + 1," , N , L = max(di − 1) ⋅τ i + 1 , τi and di ( i = 1, 2," , M ) are the time 1≤ i ≤ M

delays and embedding dimensions of each time series, respectively. The delay time τi can be calculated using mutual information method and the embedding dimension is computed with the false nearest neighbor method.

Variable Selection for Multivariate Time Series Prediction with Neural Networks

417

M

According to Takens’ embedding theorem, if D = ∑ di is large enough there exist i =1

an mapping F: X(t+1)=F{X(t)}. Then the evolvement of X(t)→X(t+1) reflects the evolvement of the original dynamics system. The problem is then to find an appropriate expression for the nonlinear mapping F. Up to the present, many chaotic time series prediction models have been developed. Neural network has been widely used because of its universal approximation capabilities. 2.2 Neural Network Model

A multilayer perceptron (MLP) with a back propagation (BP) algorithm is used as a nonlinear predictor for multivariate chaotic time series prediction. MLP is a supervised learning algorithm designed to minimize the mean square error between the computed output of the neural network and the desired output. The network usually consists of three layers: an input layer, one or more hidden layers and an output layer. Consider a three layer MLP that contains one hidden layer. The D dimensional delayed time series X(t) are used as the input of the network to generate the network output X(t+1). Then the neural network can be expressed as follows: NI

o j = f (∑ xi wij(I) )

(2)

i =0

NH

yk = ∑ w(O) jk o j

(3)

j =1

where [ x1 , x2 ," , xNI ] = X (t ) denotes the input signal, N I is number of input signal to the neural network, wij(I) is the weight connected from the ith input neuron to the jth hidden neuron, oj are the output of the jth hidden neuron, N H is the number of neurons in the hidden layer, [ y1 , y2 ," , y NO ] = X (t + 1) is the output, N O is the number of output neurons and w(O) is the weight connected from the jth hidden neuron jk and the kth output neuron. The activation function f(·) is the sigmoid function given by

f ( x) =

1 1 + exp(− x)

(4)

The error function of the net is usually defined as the sum square of the error N

No

E = ∑∑ [ yk (t ) − pk (t )]2 , t=1,2,…N t =1 k =1

where pk(t) is the desired output for unit k, N is the length of the training sample.

(5)

418

M. Han and R. Wei

2.3 Statistical Variable Selection Method

For the multivariate time series, the dimension of the reconstructed phase space is usually very high. Moreover, the increase of the input variable numbers will lead to the high complexity of the prediction model. Therefore, in many practical applications, variable selection is needed to reduce the dimensionality of the input data. The aim of variable selection in this paper is to select a subset of R inputs that retains most of the important features of the original input sets. Thus, D-R irrelevant inputs are discarded. The Principle Component Analysis (PCA) is a traditional technique for variable selection [7]. PCA attempts to reduce the dimensionality by first decomposing the normalize input vector X(t) d with singular value decomposition (SVD) method X = U ∑V T

(6)

where ∑ = diag[ s1 s2 ... s p 0 ... 0] , s1 ≥ s2 " ≥ s p are the first p eigenvalues of X ar-

rayed by a decreasing order, U and V are both orthogonal matrixes. Then the first k singular values are preserved as the principle components. The final input can be obtained as

Z = U T X

(7)

where U is the first k rows of U. PCA is an efficient method to reduce the input dimension. However, we can’t make sure that the factors we discard have no influence to the prediction output because the variable selection and prediction process are separated individually. Neural network selector is a good choice to combine the selection process and prediction process.

3 Sensitivity Analysis with Neural Networks Variable selection with neural networks can be achieved by pruning the input nodes of a neural network model based on some saliency measure aiming to remove less relevant variables. The significance of a variable can be defined as the error when the unit is removed minus the error when it is left in place:

Si = EWithoutUnit _ i − EWithUnit _ i = E ( xi = 0) − E ( xi = xi )

(8)

where E is the error function defined in Eq.(5). After the neural network has been trained, a brute-force pruning method for ever input is setting the input to zero and evaluate the change in the error. If it increases too much, the input is restored, otherwise it is removed. Theoretically, this can be done by training the network under all possible subsets of the input set. However, this exhaustive search is computational infeasible and can be very slow for large network. This paper uses the same idea with Mozer and Smolensky [4] to approximate the sensitivity by introducing a gating term α i for each unit such that

o j = f (∑ wijα i oi ) i

where o j is the activity of unit j, wij is the weight from unit i to unit j.

(9)

Variable Selection for Multivariate Time Series Prediction with Neural Networks

419

The gating term α is shown in Fig.1, where α iI , i = 1, 2," , N I is the gating term of the ith input neuron and α Hj , j = 1, 2," , N H is the gating term of the jth output neuron.

α1I

x1 xi

xN I

α1H

#

#

#

α NI

α

yk H NH

I

Fig. 1. The gating term for each unit

The gating term α is merely a notational convenience rather than a parameter that must be implied in the net. If α = 0 , the unit has no influence on the network; If α = 1 , the unit behaves normally. The importance of a unit is then approximated by the derivative Si = −

∂E ∂α i

(10) α i =1

By using a standard error back-propagation algorithm, the derivative of Eq.(9) can be expressed in term of network weights as follows NH N NO ⎡ ⎤ ∂E ∂E ∂yk (O ) = − ⋅ = ∑∑ ⎢( pk (t ) − yk (t ) ) ∑ w jk o j ⎥ H H ∂α j ∂yk ∂α j t =1 k =1 ⎣ j =1 ⎦

(11)

NH N NO ⎡ ⎤ ∂E ∂E ∂yk = − ⋅ = p ( t ) − y ( t ) w(jkO ) o j (1 − o j ) wij( I ) xi (t ) ⎥ ( ) ∑∑ ⎢ k ∑ k I I ∂α i ∂yk ∂α i t =1 k =1 ⎣ j =1 ⎦

(12)

S Hj = −

SiI = −

where SiI is the sensitivity of the ith input neuron, S Hj is the sensitivity of the jth output neuron. Thus the algorithm can prune the input nodes as well as the hidden nodes according to the sensitivity over training. However, the undulation is high when the sensitivity is calculated directly using Eq.(11) and Eq.(12) because of the engineering approximation in Eq.(10). Sometimes, it may delete the input incorrectly. In order to possibly reduce the dimensionality of input vectors, the sensitivity matrix needs to be evaluated over the entire training set. This paper develops several ways to define the overall sensitivity such as: (1) The mean square average sensitivity:

Si , avg =

1 N ∑ Si (t )2 T t =1

where T is the number of data in the training set.

(13)

420

M. Han and R. Wei

(2) The absolute value average sensitivity:

Si , abs =

1 N ∑ Si (t ) T t =1

(14)

(3) The maximum absolute sensitivity:

Si ,max = max Si (t ) 1≤ t ≤ N

(15)

Any of the sensitivity measure in Eqs.(13)~(15) can provide a useful criterion to determine which input is to be deleted. For succinctness, this paper uses the mean square average sensitivity as an example. An input with a low sensitivity has little or no influence on the prediction accuracy and can therefore be removed. In order to get a more efficient criterion for pruning inputs, the sensitivity is normalized. Define the absolute sum of the sensitivity for all the input nodes NI

S = ∑ Si

(16)

i =1

Then the normalized sensitivity of each unit can be defined as S Sˆi = i S

(17)

where the normalized value Sˆi is between [0 1]. The input variables is then arrayed by a decreasing sensitivity order: Sˆ1 ≥ Sˆ2 ≥ " ≥ SˆN I

(18)

The larger values of Sˆi (t ), i = 1, 2," , N I present the important variables. Define the sum the first k terms of the sensitivity ηk as k

η k = ∑ Sˆ j

(19)

j =1

where k=1,2,…,NI . Choosing a threshold value 0 < η 0 < 1 , if ηk > η0 , the first k values are preserved as the principal components and the last term of the inputs with low sensitivity are removed. The number of variable remained is increasing as the threshold η0 increase.

4 Simulations In this section, two simulations are carried out on both computer-generated data and practical observed data to demonstrate the performance of the variable selection method proposed in this paper. Then the simulation results are compared with the

Variable Selection for Multivariate Time Series Prediction with Neural Networks

421

PCA method. The prediction performance can be evaluated by two error evaluation criteria [8]: the Root Mean Square Error ERMSE and Prediction Accuracy EPA: 1

ERMSE

⎛ 1 N 2 ⎞2 =⎜ [ P(t ) − O(t )] ⎟ ∑ ⎝ N − 1 t =1 ⎠

(20)

N

EPA =

∑ [( P(t ) − P )(O(t ) − O )] t =1

m

m

(21)

( N − 1)σ Pσ O

where O(t) is the target value, P(t) is the predicted value, Om is the mean value of O(t), σO is the standard deviation of y(t), Pm and σP are the mean value and standard deviation of P(t), respectively. ERMSE reflects the absolute deviation between the predicted value and the observed value while EPA denotes the correlation coefficient between the observed and predicted value. In ideal situation, if there are no errors in prediction, these parameters will be ERMSE =0 and EPA =1. 4.1 Prediction of Lorenz Time Series

The first data is derived from the Lorenz system, given by three differential equations: ⎧ dx(t ) ⎪ dt = a ( − x(t ) + y (t ) ) ⎪ ⎪ d y (t ) = bx(t ) − y (t ) − x(t ) z (t ) ⎨ ⎪ dt ⎪ d z (t ) ⎪ dt = x(t ) y (t ) − c(t ) z (t ) ⎩

(22)

where the typical values for the coefficients are a=10, b=8/3, c=28 and the initial values are x(0)=12, y(0)=2, z(0)=9. 1500 points of x(t), y(t) and z(t) obtained by fourorder Runge-Kutta method are used as the training sample and 500 points as the testing sample. In order to extract the dynamics of this system to predict x(t+1), the parameters for phase-space reconstruction are chosen as τx=τy=τz=3, mx=my=mz=9. Thus a MLP neural network with 27 input nodes, one hidden layer of 20 neurons and one output node are considered and a back propagation training algorithm is used. After the training process of the MLP neural network is topped, sensitivity analysis is carried out to evaluate the contribution of each input variable to the error function of the neural network. The trajectories of the sensitivity through training for each input are shown in Fig.2. It can be seen that the sensitivity undulates through training and finally converges when the weights and error are steady. The normalized sensitivity measures in Eq.(17) are calculated. A threshold η0 = 0.98 is chosen to determine which inputs are discarded. Thus the input dimension of neural network is reduced to 11. The original input matrix is replaced by the reduced input matrix and the structure of the neural networks is simplified. The prediction performance over the testing samples with the reduced inputs is shown in Fig.4.

422

M. Han and R. Wei

8

0.7

7

0.6

6

0.5

5

0.4

4

0.3

3 0.2

2

0.1

1 0

0 0

2000

4000

6000 Epoch

8000

10000

0

5

10

15 20 Input Nodes

25

30

Fig. 2. The trajectories of the input sensitivityFig. 3. The normalized sensitivity for each input through training node

x(t)

20 10

Ovserved

Predicted

0

-10

Error

-20 1 0.5 0

-0.5 -1

0

100

200

Time

300

400

500

Fig. 4. The observed and predicted values of Lorenz x(t) time series

The solid line in Fig. 4 represents the observed values while the dashed line represents the predicted values. It can be seen from Fig.4 that the chaotic behaviors of x(t) time series are well predicted and the errors between the observed values and the predicted values are small. The prediction performance are calculated in Table 1 and compared with the PCA variable reduction method. Table 1. Prediction performance of the x(t) time series

Input Nodes ERMSE EPA

With All Variables 27 0.1278 0.9998

PCA Selection 11 0.1979 0.9997

NN Selection 11 0.0630 1.0000

The prediction performance in Table 1 are comparable for the variable selection method with neural networks and the PCA method while the algorithm proposed in this paper obtains the best prediction accuracy.

Variable Selection for Multivariate Time Series Prediction with Neural Networks

423

4.2 Prediction of the Rainfall Time Series

Rainfall is an important variable in hydrological systems. The chaotic characteristic of the rainfall time series has been proven in many papers [9]. In this section, the simulation is taken on the monthly rainfall time series in the city of Dalian, China over a period of 660 months (from 1951 to 2005). The performance of the rainfall may be influenced by many factors, so in this paper five other time series such as the temperature time series, air-pressure time series, humidity time series, wind-speed time series and sunlight time series are also considered. This method also follows the Taken’s theorem to reconstruct the embedding phase space first with the dimension and delay-time as m1=m2=m3=m4=m5=m6=9, τ1=τ2=τ3 =τ4=τ5=τ6=3. Then the input of the neural network contains L=660-(9-1)×3=636 data points. In the experiments, this data set is divided into a training set composed of the first 436 points and a testing set containing the remaining 200 points. The neural network used in this paper then contains 54 input nodes, 20 hidden notes and 1 output. The threshold is also chosen as η0 = 0.98 . The trajectory of the input sensitivity and the normalized sensitivity for ever inputs are shown in Fig.5 and Fig.6, respectively. Then 34 input nodes are remained according to the sensitivity value. 6

0.2

Normalized Sensitivity

5 4 3 2

0.12 0.08 0.04

1 0

0.16

0

2000

4000 6000 Epoch

8000

10000

Fig. 5. The trajectories of the input sensitivity through training

0

0

10

20

30 40 Input Nodes

50

60

Fig. 6. The normalized sensitivity for each input node

The observed and predicted values of rainfall time series are shown in Fig.7, which gives high prediction accuracy. It can be seen from the figures that the chaotic behaviors of the rainfall time series are well predicted and the errors between the observed values and the predicted values are small. Corresponding values of ERMSE and EPA are shown in Table 2. Both of the figures and the error evaluation criteria indicate that the result for multivariate chaotic time series using the neural network based variable selection is much better than the results with all variables and PCA method. It can be concluded from the two simulations that the variable selection algorithm using neural networks is able to capture the dynamics of both computer-generated and practical time series accurately and gives high prediction accuracy.

424

M. Han and R. Wei

rainfall(mm)

400

Observed Predicted

300 200 100

error(mm)

0 200 100 0

-100 -200

0

40

80 120 t (month)

160

200

Fig. 7. The observed and predicted values of rainfall time series Table 2. Prediction performance of the rainfall time series

Input Nodes ERMSE EPA

With All Variables 54 22.2189 0.9217

PCA Selection 43 21.0756 0.9286

NN Selection 31 18.1435 0.9529

5 Conclusions This paper studies the variable selection algorithm using the sensitivity for pruning input nodes in a neural network model. A simple and effective criterion for identifying input nodes to be removed is also derived which does not require high computational cost and proves to work well in practice. The validity of the method was examined through a multivariate prediction problem and a comparison study was made with other variable selection methods. Experimental results encourage the application of the proposed method to complex tasks that need to identify significant input variables. Acknowledgements. This research is supported by the project (60674073) of the National Nature Science Foundation of China, the project (2006CB403405) of the National Basic Research Program of China (973 Program) and the project (2006BAB14B05) of the National Key Technology R&D Program of China. All of these supports are appreciated.

References [1] Verikas, B.M.: Feature selection with neural networks. Pattern Recognition Letters 23, 1323–1335 (2002) [2] Castellano, G., Fanelli, A.M.: Variable selection using neural network models. Neuralcomputing 31, 1–13 (2000)

Variable Selection for Multivariate Time Series Prediction with Neural Networks

425

[3] Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative method for pruning feed-forward neural networks. IEEE Trans. Neural Networks 8(3), 519–531 (1997) [4] Mozer, M.C., Smolensky, P.: Skeletonization: a technique for trimming the fat from a network via a relevance assessment. NIPS 1, 107–115 (1989) [5] Gevrey, M., Dimopoulos, I., Lek, S.: Review and comparison of methods to study the contribution of variables in artificial neural network models. Ecol. Model. 160, 249–264 (2003) [6] Cao, L.Y., Mees, A., Judd, K.: Dynamics from multivariate time series. Physica D 121, 75–88 (1998) [7] Han, M., Fan, M., Xi, J.: Study of Nonlinear Multivariate Time Series Prediction Based on Neural Networks. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 618–623. Springer, Heidelberg (2005) [8] Chen, J.L., Islam, S., Biswas, P.: Nonlinear dynamics of hourly ozone concentrations: nonparametric short term prediction. Atmospheric environment 32(11), 1839–1848 (1998) [9] Liu, D.L., Scott, B.J.: Estimation of solar radiation in Australia from rainfall and temperature observations. Agricultural and Forest Meteorology 106(1), 41–59 (2001)

Ordering Process of Self-Organizing Maps Improved by Asymmetric Neighborhood Function Takaaki Aoki1 , Kaiichiro Ota2 , Koji Kurata3 , and Toshio Aoyagi1,2 1 CREST, JST, Kyoto 606-8501, Japan Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Faculty of Engineering, University of the Ryukyus, Okinawa 903-0213, Japan [email protected]

2 3

Abstract. The Self-Organizing Map (SOM) is an unsupervised learning method based on the neural computation, which has recently found wide applications. However, the learning process sometime takes multi-stable states, within which the map is trapped to a undesirable disordered state including topological defects on the map. These topological defects critically aggravate the performance of the SOM. In order to overcome this problem, we propose to introduce an asymmetric neighborhood function for the SOM algorithm. Compared with the conventional symmetric one, the asymmetric neighborhood function accelerates the ordering process even in the presence of the defect. However, this asymmetry tends to generate a distorted map. This can be suppressed by an improved method of the asymmetric neighborhood function. In the case of one-dimensional SOM, it found that the required steps for perfect ordering is numerically shown to be reduced from O(N 3 ) to O(N 2 ). Keywords: Self-Organizing Map, Asymmetric Neighborhood Function, Fast ordering.

1

Introduction

The Self-Organizing Map (SOM) is an unsupervised learning method of a type of nonlinear principal component analysis [1]. Historically, it was proposed as a simpliﬁed neural network model having some essential properties to reproduce topographic representations observed in the brain [2,3,4,5]. The SOM algorithm can be used to construct an ordered mapping from input stimulus data onto two-dimensional array of neurons according to the topological relationships between various characters of the stimulus. This implies that the SOM algorithm is capable of extracting the essential information from complicated data. From the viewpoint of applied information processing, the SOM algorithm can be regarded as a generalized, nonlinear type of principal component analysis and has proven valuable in the ﬁelds of visualization, compression and data mining. With based on the biological simple learning rule, this algorithm behaves as an unsupervised M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 426–435, 2008. c Springer-Verlag Berlin Heidelberg 2008

A

B

1

0.5

0 0

0.5

1

Reference Vector mi

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

427

1 0.5 0 0

N Unit Number i

Fig. 1. A: An example of a topological defect in a two-dimensional array of SOM with a uniform rectangular input space. The triangle point indicates the conflicting point in the feature map. B: Another example of topological defect in a one-dimensional array with scalar input data. The triangle points also indicate the conflicting points.

learning method and provides a robust performance without a delicate tuning of learning conditions. However, there is a serious problem of multi-stability or meta-stability in the learning process [6,7,8]. When the learning process is trapped to these states, the map seems to be converged to the ﬁnal state practically. However, some of theses states are undesirable for the solution of the learning procedure, in which typically the map has topological defects as shown in Fig. 1A. The map in Fig. 1A, is twisted with a topological defect at the center. In this situation, two-dimensional array of SOM should be arranged in the square space, for the input data taken uniformly from square space. But, this topological defect is a global conﬂicting point which is diﬃcult to remove by local modulations of the reference vectors of units. Therefore, it will require a sheer number of learning steps to rectify the topological defect. Thus, the existence of the topological defect critically aggravates the performance of the SOM algorithm. To avoid the emergence of the topological defect, several conventional and empirical methods have been used. However, it is more favorable that the SOM algorithm works well without tuning any model parameters, even when the topological defect emerged. Thus, let us consider a simple method which enables the eﬀective ordering procedure of SOM in the presence of the topological defect. Therefore we propose an asymmetric neighborhood function which eﬀectively removes the topological defect [9]. In the process of removing the topological defect, the conﬂicting point must be moved out toward the boundary of the arrays and vanished. Therefore, the motive process of the defect is essential for the efﬁciency of the ordering process. With the original symmetric neighborhood, the movement of the defect is similar to a random walk stochastic process, whose eﬃciency is worse. By introducing the asymmetry of the neighborhood function, the movement behaves like a drift, which enables the faster ordering. For this reason, in this paper we investigate the eﬀect of an asymmetric neighborhood function on the performance of the SOM algorithm for the case of one-dimensional and two-dimensional SOMs.

428

2 2.1

T. Aoki et al.

Methods SOM

The SOM constructs a mapping from the input data space to the array of nodes, we call the ‘feature map’. To each node i, a parametric ‘reference vector’ mi is assigned. Through SOM learning, these reference vectors are rearranged according to the following iterative procedure. An input vector x(t) is presented at each time step t, and the best matching unit whose reference vector is closest to the given input vector x(t) is chosen. The best matching unit c, called the ‘winner’ is given by c = arg mini x(t) − mi . In other words, the data x(t) in the input data space is mapped on to the node c associated with the reference vector mi closest to x(t). In SOM learning, the update rule for reference vectors is given by mi (t + 1) = mi (t) + α · h(ric )[x(t) − mi (t)],

ric ≡ ric ≡ ri − rc

(1)

where α, the learning rate, is some small constant. The function h(r) is called the ‘neighborhood function’, in which ric is the distance from the position rc of the winner node c to the position ri of a node i on the array of units. A widely used r2 neighborhood function is the Gaussian function deﬁned by, h(ric ) = exp − 2σic2 . We expect an ordered mapping after iterating the above procedure a suﬃcient number of times. 2.2

Asymmetric Neighborhood Function

We now introduce a method to transform any given symmetric neighborhood function to an asymmetric one (Fig. 2A). Let us deﬁne an asymmetry parameter β (β ≥ 1), representing the degree of asymmetry and the unit vector k indicating the direction of asymmetry. If a unit i is located on the positive direction with k, then the component parallel to k of the distance from the winner to the unit is scaled by 1/β. If a unit i is located on the negative direction with k, the parallel component of the distance is scaled by β. Hence, the asymmetric function hβ (r), transformed from its symmetric counterpart h(r), is described by ⎧ ⎪ −1 r 2 ⎨ + r⊥ 2 (ric · k ≥ 0) 1 hβ (ric ) = 2 +β · h(˜ ric ), r˜ic = β , ⎪ β ⎩ βr 2 + r 2 (r · k < 0) ⊥ ic (2) where r˜ic is the scaled distance from the winner. r is the projected component of ric , and r⊥ are the remaining components perpendicular to k, respectively. In addition, in order to single ∞out the eﬀect of asymmetry, the overall area of the neighborhood function, −∞ h(r)dr, is preserved under this transformation. In the special case of the asymmetry parameter β = 1, hβ (r) is equal to the original symmetric function h(r). Figure 2B displays an example of asymmetric Gaussian neighborhood functions in the two-dimensional array of SOM.

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

A

Oriented in the positive direction c

Oriented in the negative direction β

1/β

c

k

k

B

C

Inverse the direction 1.0

hβ(r)

k

1.0

Interval: T 0.5

1 0.5

0.5

distance

0 -6 -3

429

distance

Reduce the degree of asymmetric

0

3

6-6

-3

0

3

6

1.0

1.0

1.0

0.5

0.5

0.5

b=3

b=2

b=1

Fig. 2. A: Method of generating an asymmetric neighborhood function by scaling the distance ric asymmetrically.The degree of asymmetry is parameterized by β. The distance of the node on the positive direction with asymmetric unit vector k, is scaled by 1/β. The distance on the negative direction is scaled by β. Therefore, the asymmetric 2 function is described by hβ (ric ) = β+1/β h(˜ ric ) where r˜ic is the scaled distance of node i from the winner c. B: An example of an asymmetric Gaussian function. C: An illustration of the improved algorithm for asymmetric neighborhood function.

Next, we introduce an improved algorithm for the asymmetric neighborhood function. The asymmetry of the neighborhood function causes to distort the feature map, in which the density of units does not represent the probability of the input data. Therefore, two novel procedures will be introduced. The ﬁrst procedure is an inversion operation on the direction of the asymmetric neighborhood function. As illustrated in Fig. 2C, the direction of the asymmetry is turned in the opposite direction after every time interval T , which is expected to average out the distortion in the feature map. It is noted that the interval T should be set to a larger value than the typical ordering time for the asymmetric neighborhood function. The second procedure is an operation that decreases the degree of asymmetry of the neighborhood function. When β = 1, the neighborhood function equals the original symmetric function. With this operation, β is decreased to 1 with each time step, as illustrated in Fig. 2C. In our numerical simulations, we adopt a linear decreasing function. 2.3

Numerical Simulations

In the following sections, we tested learning procedures of SOM with sample data to examine the performance of the ordering process. The sample data is

430

T. Aoki et al.

generated from a random variable of a uniform distribution. In the case of onedimensional SOM, the distribution is uniform in the ranges of [0, 1]. Here we use Gaussian function for the original symmetric neighborhood function. The model parameters in SOM learning was used as follows: The total number of units N = 1000, the learning rate α = 0.05 (constant), and the neighborhood radius σ = 50. The asymmetry parameter β = 1.5 and asymmetric direction k is set to the positive direction in the array. The interval period T of ﬂipping the asymmetric direction is 3000. In the case of two-dimensional SOM ( 2D → 2D map), the input data taken uniformly from square space, [0, 1] × [0, 1]. The model parameters are same as in one-dimensional SOM, excepted that The total number of units N = 900 (30 × 30) and σ = 5. The asymmetric direction k is taken in the direction (1, 0), which can be determined arbitrary. In the following numerical simulations, we also conﬁrmed the result holds with other model parameters and other form of neighborhood functions. 2.4

Topological Order and Distortion of the Feature Map

For the aim to examine the ordering process of SOM, lets us consider two measures which characterize the property of the future map. One is the ‘topological order’ η for quantifying the order of reference vectors in the SOM array. The units of the SOM array should be arranged according to its reference vector mi . In the presence of the topological defect, most of the units satisfy the local ordering. However, the topological defect violates the global ordering and the feature map is divided into fragments of ordering domains within which the units satisfy the order-condition. Therefore, the topological order η can be deﬁned as the ratio of the maximum domain size to the total number of units N , given by maxl Nl , (3) N where Nl is the size of domain l. In the case of one-dimensional SOM, the order-condition for the reference vector of units is deﬁned as, mi−1 ≤ mi ≤ mi+1 , or mi−1 ≥ mi ≥ mi+1 . In the case of two-dimensional SOM as referred in the previous section, the order-condition is also deﬁned explicitly with the vector product a(i,i) ≡ (m(i+1,i) −m(i,i) )×(m(i,i+1) −m(i,i) ). Within the ordering domain, the vector products a(i,i) of units have same sign, because the reference vectors are arranged in the sample space ordered by the position of the unit. The other is the ‘distortion’ χ, which measures the distortion of the feature map. The asymmetry of the neighborhood function tends to distort the distribution of reference vectors, which is quite diﬀerent from the correct probability density of input vectors. For example, when the probability density of input vectors is uniform, a non-uniform distribution of reference vectors is formed with an asymmetric neighborhood function. Hence, for measuring the non-uniformity in the distribution of reference vectors, let us deﬁne the distortion χ. χ is a coeﬃcient of variation of the size-distribution of unit Voronoi tessellation cells, and is given by Var(Δi ) χ= , (4) E(Δi ) η≡

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

431

where Δi is the size of Voronoi cell of unit i. To eliminate the boundary eﬀect of the SOM algorithm, the Voronoi cells on the edges of the array are excluded. When the reference vectors are distributed uniformly, the distortion χ converges to 0. It should be noted that the evaluation of Voronoi cell in two-dimensional SOM is time-consuming, and we approximate the size of Voronoi cell by the size of the vector product a(i,i) . If the feature map is uniformly formed, the approximate value also converges to 0.

3 3.1

Results One-Dimensional Case

In this section, we investigate the ordering process of SOM learning in the presence of a topological defect in symmetric, asymmetric, and improved asymmetric cases of the neighborhood function. For this purpose, we use the initial condition that a single topological defect appears at the center of the array. Because the density of input vectors is uniform, the desirable feature map is a linear arrangement of SOM nodes. Figure 3A shows a typical time development of the reference vectors mi . In the case of the symmetric neighborhood function, a single defect remains around the center of the array even after 10000 steps. In contrast, in the case of the asymmetric one, this defect moves out to the right so that the reference vectors are ordered within 3000 steps. This phenomenon can also be conﬁrmed in Fig. 3B, which shows the time dependence of the topological order η. In the case of the asymmetric neighborhood function, η rapidly converges to 1 (completely ordered state) within 3000 steps, whereas the process of eliminating the last defect takes a large amount of time (∼18000 step) for the symmetric one. On the other hand, one problem arises in the feature map obtained with the asymmetric neighborhood function. After 10000 steps, the distribution of the reference vectors in the feature map develops an unusual bias (Fig. 3A). Figure 3C shows the time dependence of the distortion χ during learning. In the case of the symmetric neighborhood function, χ eventually converges to almost 0. This result indicates that the feature map obtained with the symmetric one has an almost uniform size distribution of Voronoi cells. In contrast, in the case of the asymmetric one, χ converges to a ﬁnite value (= 0). Although the asymmetric neighborhood function accelerates the ordering process of SOM learning, the resultant map becomes distorted which is unusable for the applications. Therefore, the improved asymmetric method will be introduced, as mentioned in Method. Using this improved algorithm, χ converges to almost 0 as same as the symmetric one (Fig. 3C). Furthermore, as shown in Fig. 3B, the improved algorithm preserves the faster order learning. Therefore, by utilizing the improved algorithm of asymmetric neighborhood function, we confer the full beneﬁt of both the fast order learning and the undistorted feature map. To quantify the performance of the ordering process, let us deﬁne the ‘ordering time’ as the time at which η reaches to 1. Figure 4A shows the ordering

432

T. Aoki et al.

A

Symmetric 1

1

0

1

0 0

1000

1

0 0

1000

0 0

1000

0

1000

0

1000

Asymmetric 1

1

0

1

0 0

1000

1

0 0

1000

1

0 0

1000

0 0

1000

Improved asymmetric 1

1

0

0 0 1000 t=1000

t=0

1 0 0 1000 t=5000

B

Distortion χ

1

0.5

0

0 0 1000 t=10000

0 1000 t=20000

Symmetric Asymmetric Improved asym.

C

Topological order η

1

1.5 1 0.5 0

0

10000 Time

20000

0

10000 Time

20000

Fig. 3. The asymmetric neighborhood function enhances the ordering process of SOM. A: A typical time development of the reference vectors mi in cases of symmetric, asymmetric, and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. The standard deviations are denoted by the error bars, which cannot be seen because they are smaller than the size of the graphed symbol. C: The time dependence of the distortion χ.

time as a function of the total number of units N for both improved asymmetric and symmetric cases of the neighborhood function. It is found that the ordering time scales roughly as N 3 and N 2 for symmetric and improved asymmetric neighborhood functions, respectively. For detailed discussion about the reduction of the ordering time, refer to the Aoki & Aoyagi (2007). Figure 4B shows the dependency of the ordering time on the width of neighborhood function, which indicates that ordering time is proportional to (N/σ)k with k = 2/3 for asymmetric/symmetric neighborhood function. This result implies that combined usage of the asymmetric method and annealing method for the width of neighborhood function is more eﬀective. 3.2

Two-Dimensional Case

In this section, we investigate the eﬀect of asymmetric neighborhood function for two-dimensional SOM (2D → 2D map). Figure 5 shows that a similar fast

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

Ordering time

108 10

7

10

6

10

5

N2.989±0.002

10

7

10

6

Sym. σ = 8 Sym. σ = 16 Sym. σ = 32 Sym. σ = 64 Asym.σ = 8 Asym.σ = 16 Asym.σ = 32 Asym.σ = 64

N3

105

104 10

108

Symmetric Improved asym. Fitting

104

N1.917±0.005

3

103

10

104

The number of units N

433

N2

3

102

Scaled number of units N/σ

Fig. 4. Ordering time as a function of the total number of units N . The fitting function is described by Const. · N γ .

A

Symmetric 1

1

0

1

0 0

0

1

0

1

0

1

0

1

Asymmetric 1

1

0

1

0 0

1

1

0 0

1

0 0

1

Improved asymmetric 1

1

0

0 0

t=0

1

0 0

t=1000

B

1 t=5000

1

0.5

0

0 1 t=20000

Symmetric Asymmetric Improved asym.

C Distortion χ

Topological order η

1

1

0.5

0 0

10000 Time

20000

0

10000 Time

20000

Fig. 5. A: A typical time development of reference vectors in two-dimensional array of SOM for the cases of symmetric, asymmetric and improved asymmetric neighborhood functions. B: The time dependence of the topological order η. C: The time dependence of the distortion χ.

434

T. Aoki et al.

Population

Symmetric

Improved asymmetric

0.5

0.5

0

0 0

25000

50000

Ordering time

0

25000

50000

Ordering time

Fig. 6. Distribution of ordering times when the initial reference vectors are generated randomly. The white bin at the right in the graph indicates a population of failed trails which could not converged to the perfect ordering state within 50000 steps.

ordering process can be realized with an asymmetric neighborhood function in two-dimensional SOM. The initial state has a global topological defect, in which the map is twisted at the center. In this situation, the conventional symmetric neighborhood function has trouble in correcting the twisted map. Because of the local stability, this topological defect is never corrected even with a huge learning iteration. The asymmetric neighborhood function also is eﬀective to overcome such a topological defect, like the case of one-dimensional SOM. However, the same problem of ’distortion map’ occurs. Therefore, by using the improved asymmetric neighborhood function, the feature map converges to the completely ordered map in much less time without any distortion. In the previous simulations, we have considered a simple situation that a single defect exists around the center of the feature map as an initial condition in order to investigate the ordering process with the topological defect. However, when the initial reference vectors are set randomly, the total number of topological defects appearing in the map is not generally equal to one. Therefore, it is necessary to consider the statistical distribution of the ordering time, because the total number of the topological defects and the convergence process depend generally on the initial conditions. Figure 6 shows the distribution of the ordering time, when the initial reference vectors are randomly selected from the uniform distribution [0, 1] × [0, 1]. In the case of symmetric neighborhood function, a part of trial could not converged to the ordered state with trapped in the undesirable meta-stable states, in which the topological defects are never rectiﬁed. Therefore, although the fast ordering process is observed in some successful cases (lucky initial conditions), the formed map with symmetric one is highly depends on the initial conditions. In contrast, for the improved asymmetric neighborhood function, the distribution of the ordering time has a single sharp peak and the successive future map is constructed stably without any tuning of the initial condition.

4

Conclusion

In this paper, we discussed the learning process of the self-organized map, especially in the presence of a topological defect. Interestingly, even in the presence

Ordering Process of SOMs Improved by Asymmetric Neighborhood Function

435

of the defect, we found that the asymmetry of the neighborhood function enables the system to accelerate the learning process. Compared with the conventional symmetric one, the convergence time of the learning process can be roughly reduced from O(N 3 ) to O(N 2 ) in one-dimensional SOM(N is the total number of units). Furthermore, this acceleration with the asymmetric neighborhood function is also eﬀective in the case of two-dimensional SOM (2D → 2D map). In contrast, the conventional symmetric one can not rectify the twisted feature map even with a sheer of iteration steps due to its local stability. These results suggest that the proposed method can be eﬀective for more general case of SOM, which is the subject of future study.

Acknowledgments This work was supported by Grant-in-Aid for Scientiﬁc Research from the Ministry of Education, Science, Sports, and Culture of Japan: Grant number 18047014, 18019019 and 18300079.

References 1. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 2. Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in cats visual cortex. J. Physiol.-London 160(1), 106–154 (1962) 3. Hubel, D.H., Wiesel, T.N.: Sequence regularity and geometry of orientation columns in monkey striate cortex. J. Comp. Neurol. 158(3), 267–294 (1974) 4. von der Malsburg, C.: Self-organization of orientation sensitive cells in striate cortex. Kybernetik 14(2), 85–100 (1973) 5. Takeuchi, A., Amari, S.: Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern. 35(2), 63–72 (1979) 6. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps - stationary states, metastability and convergence rate. Biol. Cybern. 67(1), 35–45 (1992) 7. Geszti, T., Csabai, I., Cazok´ o, F., Szak´ acs, T., Serneels, R., Vattay, G.: Dynamics of the kohonen map. In: Statistical mechanics of neural networks: proceedings of the XIth Sitges Conference, pp. 341–349. Springer, New York (1990) 8. Der, R., Herrmann, M., Villmann, T.: Time behavior of topological ordering in self-organizing feature mapping. Biol. Cybern. 77(6), 419–427 (1997) 9. Aoki, T., Aoyagi, T.: Self-organizing maps with asymmetric neighborhood function. Neural Comput. 19(9), 2515–2535 (2007)

A Characterization of Simple Recurrent Neural Networks with Two Hidden Units as a Language Recognizer Azusa Iwata1 , Yoshihisa Shinozawa1 , and Akito Sakurai1,2 1

Keio University, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan 2 CREST, Japan Science and Technology Agency

Abstract. We give a necessary condition that a simple recurrent neural network with two sigmoidal hidden units to implement a recognizer of the formal language {an bn |n > 0} which is generated by a set of generating rules {S → aSb, S → ab} and show that by setting parameters so as to conform to the condition we get a recognizer of the language. The condition implies instability of learning process reported in previous studies. The condition also implies, contrary to its success in implementing the recognizer, diﬃculty of getting a recognizer of more complicated languages.

1

Introduction

Pioneered by Elman [6], many researches have been conducted on grammar learning by recurrent neural networks. Grammar is deﬁned by generating or rewriting rules such as S → Sa and S → b. S → Sa means that “the letter S should be rewritten to Sa,” and S → b means that “the letter S should be rewritten to b.” If we have more than one applicable rules, we have to try all the possibility. The string generated this way but with no further applicable rewriting rule is called a sentence. The set of all possible sentences is called the language generated by the grammar. Although the word “language” would be better termed formal language in contrast to natural language, we follow custom in formal language theory ﬁeld(e.g. [10]). In everyday expression a sentence is a sequence of words whereas in the above example it is a string of characters. Nevertheless the essence is common. The concept that a language is deﬁned based on a grammar this way has been a major paradigm in a wide range of formal language study and related ﬁeld. A study of grammar learning focuses on restoring a grammar from ﬁnite samples of sentences of a language associated with the grammar. In contrast to ease of generation of sample sentences, grammar learning from sample sentences is a hard problem and in fact it is impossible except for some very restrictive cases e.g. the language is ﬁnite. As is well-known, humans do learn language whose grammar is very complicated. The diﬀerence between the two situations might be attributed to possible existence of some unknown restrictions on types of natural language grammars in our brain. Since neural network is a model of M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Characterization of Simple RNNs with Two Hidden Units

437

brain, some researchers think that neural network could learn a grammar from exemplar sentences of a language. The researches on grammar learning by neural networks are characterized by 1. focus on learning of self-embedding rules, since simpler rules that can be expressed by ﬁnite state automata are understood to be learnable, 2. simple recurrent neural networks (SRN) are used as basic mechanism, and 3. languages such as {an bn |n > 0}, {an bn cn |n > 0} which are clearly the results of, though not a representative of, self-embedding rules, were adopted for target languages. Here an is a string of n-times repetition of a character a, the language {an bn |n > 0} is generated by a grammar {S → ab, S → aSb}, and {an bn cn |n > 0} is generated by a context-sensitive grammar. The adoption of simple languages such as {an bn |n > 0} and {an bn cn |n > 0} as target languages are inevitable, since it is not easy to see whether enough generalization is achieved by the learning if realistic grammars are used. Although these languages are simple, many times when they learned, what they learned was just what they were taught, not a grammar, i.e. they did almost rote-learning. In other words, their generalization capability in learning was limited. Also their learned-results were unstable in a sense that when they were given new training sentences which were longer than the ones they learned but still in the same language, learned network changes unexpectedly. The change was more than just reﬁnement of the learned network. Considering these situations, we may doubt that there really exists a solution or correct learned network. Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), and others tried to clarify the reasons and explore the possibility of network learning of the languages. But their results were not conclusive and did not give clear conditions that the learned networks satisfy in common. We will in this paper describe a condition that SRNs with two hidden units learned to recognize language {an bn |n > 0} have in common, that is, a condition that has to be met for SRN to be qualiﬁed to be successfully learned the language. The condition implies, by the way, instability of learning results. Moreover by utilizing the condition, we realize a language recognizer. By doing this we found that learning of recognizers of languages more complicated than {an bn |n > 0} are hard. RNN (recurrent neural network) is a type of networks that has recurrent connections and a feedforward network connections. The calculation is done at once for the feedforward network part and after one time-unit delay the recurrent connection part gives rises to additional inputs (i.e., additional to the external inputs) to the feedforward part. RNN is a kind of discrete time system. Starting with an initial state (initial outputs, i.e. outputs without external inputs, of feedforward part) the network proceeds to accept the next character in a string given to external inputs, reaches the ﬁnal state, and obtains the ﬁnal external output from the ﬁnal state. SRN (simple recurrent network) is a simple type of RNN and has only one layer of hidden units in its feedforward part.

438

A. Iwata, Y. Shinozawa, and A. Sakurai

Rodriguez et al. [13] showed that SRN learns languages {an bn |n > 0} (or, more correctly, its subset {ai bi |n ≥ i > 0}) and {an bn cn |n > 0} (or, more correctly, {ai bi |n ≥ i > 0}). For {an bn |n > 0}, they found an SRN that generalized to n ≤ 16 after learned for n ≤ 11. They analyzed how the languages are processed by the SRN but the results were not conclusive so that it is still an open problem if it is really possible to realize recognizers of the languages such as {an bn |n > 0} or {an bn cn |n > 0} and if SRN could learn or implement more complicated languages. Siegelmann [16] showed that an RNN has computational ability superior to a Turing machine. But the diﬃculty of learning {an bn |n > 0} by SRN suggests that at least SRN might not be able to realize Turing machine. The diﬀerence might be attributed to the diﬀerence of output functions of neurons: piecewise linear function in Siegelmann’s case and sigmoidal function in standard SRN cases, since we cannot ﬁnd other substantial diﬀerences. On the other hand Casey [4] and Maass [12] showed that in noisy environment RNN is equivalent to ﬁnite automaton or less powerful one. The results suggest that correct (i.e., inﬁnite precision) computation is to be considered when we had to make research on possibility of computations by RNN or speciﬁcally SRN. Therefore in the research we conducted and will report in this paper we have adopted RNN models with inﬁnite precision calculation and with sigmoidal function (tanh(x)) as output function of neurons. From the viewpoints above, we discuss two things in the paper: a necessary condition for an SRN with two hidden units to be a recognizer of the language {an bn |n > 0}, the condition is suﬃcient enough to guide us to build an SRN recognizing the language.

2

Preliminaries

SRN (simple recurrent network) is a simple type of recurrent neural networks and its function is expressed by sn+1 = σ(w s · sn + wx · xn ), Nn (sn ) = wos · sn + woc where σ is a standard sigmoid function (tanh(x) = (1 − exp(−x))/(1 + exp(−x))) and is applied component-wise. A counter is a device which keeps an integer and allows +1 or −1 operation and answers yes or no to an inquiry if the content is 0 (0-test). A stack is a device which allows operations push i (store i) and pop up (recall the laststored content, discard it and restore the device as if it were just before the corresponding push operation). Clearly a stack is more general than a counter, so that if a counter is not implementable, a stack is not, too. Elman used SRN as a predictor but Rodriguez et al. [13] and others consider it as a counter. We in this paper mainly stand on the latter viewpoint. We will ﬁrst explain that a predictor is used as a recognizer or a counter. When we try to train SRN to recognize a language, we train it to predict correctly the next character to come in an input string (see e.g. [6]). As usual we adopted “one hot vector” representation for the network output. One hot vector is a vector representation in which a single element is one and the others

A Characterization of Simple RNNs with Two Hidden Units

439

are zero. The network is trained so that the sum of the squared error (diﬀerence between actual output and desired output) is minimized. By this method, if two possible outputs with the same occurrence frequency for the same input exist in a training data, the network would learn to output 0.5 for two elements in the output with others being 0, since the output vector should give the minimum of the sum of the squared error. It is easily seen that if a network correctly predicts the next character to come for a string in the language {an bn |n > 0}, we could see it behaves as a counter with limited function. Let us add a new network output whose value is positive if the original network predicts only a to come (which happens only when the input string up to the time was an bn for some n which is common practice since Elman’s work) and is negative otherwise. The modiﬁed network seems to count up for a character a and count down for b since it outputs positive value when the number of as and bs coincide and negative value otherwise. Its counting capability, though, is limited since it could output any value when a is fed before the due number of bs are fed, that is, when counting up action is required before the counter returns back to 0-state. A (discrete-time) dynamical system is represented as the iteration of a function application: si+1 = f (si ), i ∈ N , si ∈ Rn . A point s is called a ﬁxed point of f if f (s) = s. A point s is an attracting ﬁxed point of f if s is a ﬁxed point and there exists a neighborhood Us around s such that limi→∞ f i (x) = s for all x ∈ Us . A point s is a repelling ﬁxed point of f if s is an attracting ﬁxed point of f −1 . A point s is called a periodic point of f if f n (s) = s for some n. A point s is a ω-limit point of x for f if limi→∞ f ni (x) = s for limi→∞ ni = ∞. A ﬁxed point x of f is hyperbolic if all of the eigenvalues of Df at x have absolute values diﬀerent from one, where Df = [∂fi /∂xj ] is the Jacobian matrix of ﬁrst partial derivatives of the function f . A set D is invariant under f if for any s ∈ D, f (s) ∈ D. The following theorem plays an important role in the current paper. Theorem 1 (Stable Manifold Theorem for a Fixed Point [9]). Let f : Rn → Rn be a C r (r ≥ 1) diﬀeomorphism with a hyperbolic ﬁxed point x. Then s,f u,f there are local stable and unstable manifolds Wloc (x), Wloc (x), tangent to the s,f s,f u,f eigenspaces Ex , Ex of Df at x and of corresponding dimension. Wloc (x) and u,f r Wloc (x) are as smooth as the map f , i.e. of class C . Local stable and unstable manifold for f are deﬁned as follows: s,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f m (y), q) = 0} u,f Wloc (q) = {y ∈ Uq | limm→∞ dist(f −m (y), q) = 0}

where Uq is a neighborhood of q and dist is a distance function. Then global s,f stable and unstable manifold for f are deﬁned as: W s,f (q) = i≥0 f −i (Wloc (q)) u,f u,f i and W (q) = i≥0 f (Wloc (q)). As deﬁned, SRN is a pair of a discrete-time dynamical system sn+1 = σ(w s · sn + wx · xn ) and an external output part Nn = wos · sn + woc . We simply

440

A. Iwata, Y. Shinozawa, and A. Sakurai

write the former (dynamical system part) as sn+1 = f (sn , xn ) and the external output part as h(sn ). When RNN (or SRN) is used as a recognizer of the language {an bn |n > 0}, as described in Introduction, it is seen as a counter where the input character a is for a count-up operation (i.e., +1) and b is for a count-down operation (i.e., −1). In the following we may use x+ for a and x− for b. For abbreviation, in the following, we use f+ = f ( · , x+ ) and f− = f ( · , x− ) Please note that f− is undeﬁned for the point outside and on the border of the square (I[−1, 1])2 , where I[−1, 1] is the closed interval [−1, 1]. In the following, though, we do not mention it for simplicity. D0 is the set {s|h(s) ≥ 0}, that is, a region where the counter value is 0. Let −i D i = f− (D0 ), that is, a region where the counter value is i. We postulate that f+ (Di ) ⊆ Di+1 . This means that any point in Di is eligible for a state where the counter content is i. This may seem to be rather demanding. An alternative would be that the point p is for counter content c if and only if p1 pi m1 mi p = f− ◦ f+ ◦ . . . ◦ f− ◦ f+ (s0 ) for a predeﬁned s0 , some mj ≥ 0 and pj ≥ 0 i for 1 ≤ j ≤ i, and i ≥ 0 such that j=1 (pj − mj ) = c. This, unfortunately, has not resulted in a fruitful result. We also postulate that Di ’s are disjoint. Since we deﬁned Di as a closed set, the postulate is natural. The point is therefore we have chosen Di to be closed. The postulate requires that we should keep a margin between D0 and D1 and others.

3

A Necessary Condition

We consider only SRN with two hidden units, i.e., all the vectors concerning s such as ws , sn , wos are two-dimensional. Definition 2. Dω is the set of the accumulation points of i≥0 Di , i.e. s ∈ Dω iﬀ s = limi→∞ ski for some ski ∈ Dki . Definition 3. Pω is the set of ω-limit points of points in D0 for f+ , i.e. s ∈ Pω ki iﬀ s = limi→∞ f+ (s0 ) for some ki and s0 ∈ D0 . Qω is the set of ω-limit points −ki −1 of points in D0 for f− , i.e. s ∈ Qω iﬀ s = limi→∞ f− (s0 ) for some ki and s0 ∈ D0 . Considering the results obtained in Bod´en et al. ([1,2,3]), Rodriguez et al. ([13,14]), Chalup et al. ([5]), it is natural, at least for a ﬁrst consideration, to i i postulate that f+ (x) and f− (x) are not wondering so that they will converge to periodic points. Therefore Pω and Qω are postulated as ﬁnite set of hyperbolic periodic points for f+ and f− , respectively. In the following, though, for simpliﬁcation of presentations, we postulate that Pω and Qω are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively. Moreover, according to the same literature, the points in Qω are saddle points u,f s,f for f− , so that we further postulate that Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, where their existence is guaranteed by Theorem 1.

A Characterization of Simple RNNs with Two Hidden Units

441

Postulate 4. We postulate that f+ (Di ) ⊆ Di+1 , Di ’s are disjoint, Pω and Qω u,f are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively, and Wloc − (q) s,f− for q ∈ Qω and Wloc (q) for q ∈ Qω are one-dimensional. −1 −1 Lemma 5. f− ◦ f+ (Dω ) = Dω , f− (Dω (I(−1, 1))2 ) = Dω and Dω (I(−1, 1))2 = f− (Dω ), and f+ (Dω ) ⊆ Dω . Pω ⊆ Dω and Qω ⊆ Dω . −1 Definition 6. W u,−1 (q) is the global unstable manifold at q ∈ Qω for f− , i.e., −1 u,f− u,−1 s,f− W (q) = W (q) = W (q). i Lemma 7. For any p ∈ Dω , any accumulation point of {f− (p)|i > 0} is in Qω .

Proof. Since p is in Dω , there exist pki ∈ Dki such that p = limi→∞ pki . Suppose q in Dω is the accumulation point stated in the theorem statement, i.e., q = h limj→∞ f−j (p). We take ki large enough for any hj so that in any neighborhood of h h h −k q where f−j (p) is in, pki exists. Then q = limj→∞ f−j (pki ) = limj→∞ f−j i (ski ) ki where ki is a function of hj with ki > hj . Let ski = f− (pki ) ∈ D0 and s0 ∈ D0 −1 be an accumulation point of {ski }. Then since f− is continuous, letting nj = −n −hj + ki > 0, q = limj→∞ f− j (s0 ), i.e., q ∈ Qω .

Lemma 8. Dω = q∈Qω W u,−1 (q). Proof. Let p be any point in Dω . Since f− (Dω ) ⊆ (I[−1, 1])2 where I[−1, 1] n is the interval [−1, 1], i.e., f− (Dω ) is bounded, and f− (Dω ) ⊆ Dω , {f− (p)} has an accumulation point q in Dω , which is, by Lemma 7, in Qω . Then q is n expressed as q = limj→∞ f−j (p). Since Qω is a ﬁnite set of hyperbolic ﬁxed −1 n points, q = limn→∞ f− (p), i.e., p ∈ W s,f (q) = W u,f (q) = W u,−1 (q).

Since Pω ⊆ Dω , the next theorem holds. Theorem 9. A point in Pω is either a point in Qω or in W u,−1 (q) for some q ∈ Qω . Please note that since q ∈ W u,−1 (q), the theorem statement is just “If p ∈ Pω then p ∈ W u,−1 (q) for some q ∈ Qω .”

4

An Example of a Recognizer

To construct an SRN recognizer for {an bn |n > 0}, the SRN should satisfy the conditions stated in Theorem 9 and Postulate 4, which are summarized as: 1. f+ (Di ) ⊆ Di+1 , 2. Di ’s are disjoint, 3. Pω and Qω are ﬁnites set of hyperbolic ﬁxed points for f+ and f− , respectively, u,f s,f 4. Wloc − (q) for q ∈ Qω and Wloc − (q) for q ∈ Qω are one-dimensional, and u,−1 5. If p ∈ Pω then p ∈ W (q) for some q ∈ Qω .

442

A. Iwata, Y. Shinozawa, and A. Sakurai

Let us consider as simple as possible, so that the ﬁrst choice is to think about a point p ∈ Pω and q ∈ Qω , that is f+ (p) = p and f− (q) = q. Since p cannot be −1 the same as q (because f− ◦ f+ (p) = p + w−1 s · w x · (x+ − x− ) = p ), we have u,−1 to ﬁnd a way to let p ∈ W (q). Since it is very hard in general to calculate stable or unstable manifolds from a function and its ﬁxed point, we had better try to let W u,−1 (q) be a “simple” manifold. There is one more reason to do so: we have to deﬁne D0 = {x|h(x) ≥ 0} but if W u,−1 (q) is not simple, suitable h may not exist. We have decided that W u,−1 (q) be a line (if possible). Considering the function form f− (s) = σ(w s · s + wx · x− ), it is not diﬃcult to see that the line could be one of the axes or one of the bisectors of the right angles at the origin (i.e., one of the lines y = x and y = −x). We have chosen the bisector in the ﬁrst (and the third) quadrant (i.e., the line y = x). By the way q was chosen to be the origin and p was chosen arbitrarily to be (0.8, 0.8). The item 4 is satisﬁed by setting one of the two eigenvalues of Df− at the origin to be greater than one, and the other smaller then one. We have chosen 1/0.6 for one and 1/μ for the other which is to be set so that Item 1 and 2 are satisﬁed by considering eigenvalues of Df+ at p for f+ . The design consideration that we have skipped is how to design D0 = {x|h(x) ≥ 0}. A simple way is to make the boundary h(x) = 0 parallel to W u,−1 (q) for our intended q ∈ Qω . Because, if we do so, by setting the largest eigenvalue of Df− at q to be equal to the inverse of the eigenvalue of Df+ at p along the 2 2 i normal to W u,−1 , we can get the points s ∈ D0 , f− ◦ f+ (s), f− ◦ f+ (s), . . . , f− ◦ i f+ (s), . . . that belong to {an bn |n > 0}, reside at approximately equal distance from W u,−1 . Needless to say that the points belonging to, say, {an+1 bn |n > 0} have approximately equal distance from W u,−1 among them and this distance is diﬀerent from that for {an bn |n > 0}. Let f− (x) = σ(Ax + B0 ), f+ (x) = σ(Ax + B1 ). We plan to put Qω = {(0, 0)}, Pω = {(0.8, 0.8)}, W u,−1 = {(x, y)|y = x}, the eigenvalues of the tangent space −1 of f− at (0, 0) are 1/λ = 1/0.6 and 1/μ (where the eigenvector on y = x is expanding), and the eigenvalues of the tangent space of f+ at (0.8, 0.8) are 1/μ and any value. Then, considering derivatives at (0, 0) and (0.8, 0.8), it is easy to see π 1 π 1 λ 0 A=ρ ρ − , = (1 − 0.82 )μ 0 μ 2 4 4 μ where ρ(θ) is a rotation by θ. Then λ+μ λ−μ A= λ−μ λ+μ Next from σ(B0 ) = (0, 0)T and σ((0.8λ, 0.8λ)T + B1 ) = (0.8, 0.8)T , −1 0 σ (0.8) − 0.8λ B0 = , B1 = . 0 σ −1 (0.8) − 0.8λ These give us μ = 5/3, λ = 0.6, B1 ≈ (1.23722, 1.23722)T .

A Characterization of Simple RNNs with Two Hidden Units

443

Fig. 1. The vector ﬁeld representation of f+ (left) and f− (right) 1

0.75

0.5

0.25

-1

-0.75

-0.5

-0.25

0.25

0.5

0.75

1

-0.25

-0.5

-0.75

-1

-1

-0.75

-0.5

1

1

0.75

0.75

0.5

0.5

0.25

0.25

-0.25

0.25

0.

0.75

1

-1

-0.75

-0.5

-0.25

0.25

-0.25

-0.25

-0.5

-0.5

-0.75

-0.75

-1

-1

0.5

0.75

1

n+1 n+1 n n n Fig. 2. {f− ◦ f+ (p)| n ≥ 1} (upper), {f− ◦ f+ (p)| n ≥ 1} (lower left), and {f− ◦ n f+ (p)| n ≥ 1} (lower right) where p = (0.5, 0.95)

444

A. Iwata, Y. Shinozawa, and A. Sakurai

In Fig. 1, the left image shows the vector ﬁeld of f+ where the arrows starting at x end at f+ (x) and the right image shows the vector ﬁeld of f− . In Fig. 2, the upper plot shows points corresponding to strings in {an bn |n > 0}, the lower-left plot {an+1 bn |n > 0}, and the lower-right plot {an bn+1 |n > 0}. The initial point was set to p = (0.5, 0.95) in Fig. 2. All of them are for n = 1 to n = 40 and when n grows the points gather so we could say that they stay in narrow stripes, i.e. Dn , for any n.

5

Discussion

We obtained a necessary condition that SRN implements a recognizer for the language {an bn |n > 0} by analyzing its behavior from the viewpoint of discrete dynamical systems. The condition supposes that Di ’s are disjoint, f+ (Di ) ⊆ Di+1 , and Qω is ﬁnite. It suggests a possibility of the implementation and in fact we have successfully built a recognizer for the language, thereby we showed that the learning problem of the language has at least a solution. Unstableness of any solutions for learning is suggested to be (but not derived to be) due to the necessity of Pω being in an unstable manifold W u,−1 (q) for n q ∈ Qω . Since Pω is attractive in the above example, f+ (s0 ) for s0 ∈ D0 comes n exponentially close to Pω for n. By even a small ﬂuctuation of Pω , since f+ (s0 ), u,−1 n n too, is close to W (q), f− (f+ (s0 )), which should be in D0 , is disturbed much. This means that even if we are close to a solution, by just a small ﬂuctuation of n n Pω caused by a new training data, f− (f+ (s0 )) may easily be pushed out of D0 . Since Rodriguez et al. [14] showed that the languages that do not belong to the context-free class could be learned to some degree, we have to further study the discrepancies. Instability of grammar learning by SRN shown above might not be seen in our natural language learning, which suggests that SRN might not be appropriate for a model of language learning.

References 1. Bod´en, M., Wiles, J., Tonkes, B., Blair, A.: Learning to predict a context-free language: analysis of dynamics in recurrent hidden units. Artiﬁcial Neural Networks (1999); Proc. ICANN 1999, vol. 1, pp. 359–364 (1999) 2. Bod´en, M., Wiles, J.: Context-free and context-sensitive dynamics in recurrent neural networks. Connection Science 12(3/4), 197–210 (2000) 3. Bod´en, M., Blair, A.: Learning the dynamics of embedded clauses. Applied Intelligence: Special issue on natural language and machine learning 19(1/2), 51–63 (2003) 4. Casey, M.: Correction to proof that recurrent neural networks can robustly recognize only regular languages. Neural Computation 10, 1067–1069 (1998) 5. Chalup, S.K., Blair, A.D.: Incremental Training Of First Order Recurrent Neural Networks To Predict A Context-Sensitive Language. Neural Networks 16(7), 955– 972 (2003)

A Characterization of Simple RNNs with Two Hidden Units

445

6. Elman, J.L.: Distributed representations, simple recurrent networks and grammatical structure. Machine Learning 7, 195–225 (1991) 7. Elman, J.L.: Language as a dynamical system. In: Mind as Motion: Explorations in the Dynamics of Cognition, pp. 195–225. MIT Press, Cambridge 8. Gers, F.A., Schmidhuber, J.: LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks 12(6), 1333–1340 (2001) 9. Guckenheimer, J., Holmes, P.: Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, Heidelberg (Corr. 5th print, 1997) 10. Hopcroft, J.E., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley, Reading (1979) 11. Katok, A., Hasselblatt, B.: Introduction to the Modern Theory of Dynamical Systems. Cambridge University Press, Cambridge (1996) 12. Maass, W., Orponen, P.: On the eﬀect of analog noise in discrete-time analog computations. Neural Computation 10, 1071–1095 (1998) 13. Rodriguez, P., Wiles, J., Elman, J.L.: A recurrent neural network that learns to count. Connection Science 11, 5–40 (1999) 14. Rodriguez, P.: Simple recurrent networks learn context-free and context-sensitive languages by counting. Neural Computation 13(9), 2093–2118 (2001) 15. Schmidhuber, J., Gers, F., Eck, D.: Learning Nonregular Languages: A Comparison of Simple Recurrent Networks and LSTM. Neural Computation 14(9), 2039–2041 (2002) 16. Siegelmann, H.T.: Neural Networks and Analog Computation: beyond the Turing Limit, Birkh¨ auser (1999) 17. Wiles, J., Blair, A.D., Bod´en, M.: Representation Beyond Finite States: Alternatives to Push-Down Automata. In: A Field Guide to Dynamical Recurrent Networks

Unbiased Likelihood Backpropagation Learning Masashi Sekino and Katsumi Nitta Tokyo Institute of Technology, Japan

Abstract. The error backpropagation is one of the popular methods for training an artificial neural network. When the error backpropagation is used for training an artificial neural network, overfitting occurs in the latter half of the training. This paper provides an explanation about why overfitting occurs with the model selection framework. The explanation leads to a new method for training an aritificial neural network, Unibiased Likelihood Backpropagation Learning. Several results are shown.

1

Introduction

An artiﬁcial neural network is one of the model for function approximation. It is possible to approximate arbitrary function when the number of basis functions is large. The error backpropagation learning [1], which is a famous method for training an artiﬁcial neural network, is the gradient discent method with the squared error to learning data as a target function. Therefore, the error backpropagation learning can obtain local optimum while monotonously decreasing the error. Here, although the error to learning data is monotonously decreasing, the error to test data increases in the latter half of training. This phenomenon is called overﬁtting. Early stopping is one of the method for preventing the overﬁtting. This method stop the training when an estimator of the generalization error does not decrease any longer. For example, the technique which stop the training when the error to hold-out data does not decrease any longer is often applied. However, the early stopping basically minimize the error to learning data, therefore there is no guarantee for obtaining the optimum parameter which minimize the estimator of the generalization error. When the parameters of the basis functions (model parameter) are ﬁxed, an artiﬁcial neural network becomes a linear regression model. If a regularization parameter is introduced to assure the regularity of this linear regression model, the artiﬁcial neural network becomes a set of regular linear regression models. The cause of why an artiﬁcial neural network tends to overﬁt is that the maximum likelihood estimation with respect to the model parameter is the model selection about regular linear regression models based on the empirical likelihood. In this paper, we propose the unbiased likelihood backpropagation learning which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. It is expected M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008

Unbiased Likelihood Backpropagation Learning

447

that the proposed method has better approximation performance because the method explicitly minimize an estimator of the generalization error. Following section, section 2 explains about statistical learning and maximum likelihood estimation, and section 3 explains about information criterion shortly. Next, in section 4, we give an explanation about an artiﬁcial neural network, regularized maximum likelihood estimation and why the error backpropagation learning cause overﬁtting. Then, the proposed method is explained in section 5. We show the eﬀectiveness of the proposed method by applying the method to DELVE data set [4] in section 6. Finally we conclude this paper in section 7.

2

Statistical Learning

Statistical learning aims to construct an optimal approximation pˆ(x) of a true distribution q(x) from a set of hypotheses M ≡ {p(x|θ) | θ ∈ Θ} using learning data D ≡ {xn | n = 1, · · · , N } obtained from q(x). M is called model and approximation pˆ(x) is called estimation. When we want to clearly denote that the estimation pˆ(x) is constructed using learning data D, we use a notation pˆ(x|D). Kullback-Leibler divergence: q(x) D(q||p) ≡ q(x) log dx (1) p(x) is used for distance from q(x) to p(x). Probability density p(x) is called likelihood, and especially the likelihood of the estimation pˆ(x) to the learning data D: N pˆ(D|D) ≡ pˆ(xn |D) (2) n=1

is called empirical likelihood. Sample mean of the log-likelihood: N 1 1 log p(D) = log p(xn ) N N n=1

asymptotically converges in probability to the mean log-likelihood: Eq(x) log p(x) ≡ q(x) log p(x)dx

(3)

(4)

according to the law of large numbers, where Eq(x) denotes an expectation under q(x). Because Kullback-Leibler divergence can be decomposed as D(q||p) = Eq(x) log q(x) − Eq(x) log p(x) , (5) the maximization of the mean log-likelihood is equal to the minimization of Kullback-Leibler divergence. Therefore, statistical learning methods such as maximum likelihood estimation, maximum a posteriori estimation and Bayesian estimation are based on the likelihood.

448

M. Sekino and K. Nitta

Maximum Likelihood Estimation Maximum likelihood estimation pˆML (x) is the hypothesis p(x|θˆML ) by the maximizer θˆML of likelihood p(D|θ): pˆML (x) ≡ p(x|θˆML ), θˆML ≡ argmax p(D|θ). θ

3 3.1

(6) (7)

Information Criterion and Model Selection Information Criterion

Because sample mean of the log-likelihood asymptotically converges in probability to the mean log-likelihood, statistical learning methods are based on likelihood. However, because learning data D is a ﬁnite set in practice, the empirical likelihood pˆ(D|D) contains bias. This bias b(ˆ p) is deﬁned as b(ˆ p) ≡ Eq(D) log pˆ(D) −N · Eq(x) log pˆ(x|D) , (8) N where q(D) ≡ n=1 q(xn ). Because of the bias, it is known that the most overﬁtted model to learning data is selected when select a regular model from a candidate set of regular models based on the empirical likelihood. Addressing this problem, there have been proposed many information criteria which evaluates learning results by correcting the bias. Generally the form of the information criterion is IC(ˆ p, D) ≡ −2 log pˆ(D) + 2ˆb(ˆ p)

(9)

where ˆb(ˆ p) is the estimator of the bias b(ˆ p). Corrected AIC (cAIC) [3] estimates and corrects accurate bias of the empirical log-likelihood as N (M + 1) ˆbcAIC (ˆ pML ) = , (10) N −M −2 under the assumption that the learning model is a normal linear regression model, the true model is included in the learning model and the estimation is constructed by maximum likelihood estimation. Here, M is the number of explanatory variables and dim θ = M +1 (1 is the number of the estimators of variance.) Therefore, cAIC asymptotically equals AIC [2]: ˆbcAIC (ˆ pML ) → ˆbAIC (ˆ pML ) (N → ∞).

4

Artificial Neural Network

In a regression problem, we want to estimate the true function using learning data D = {(xn , yn ) | n = 1, · · · , N }, where xn ∈ Rd is an input and yn ∈ R is the corresponding output. An artiﬁcial neural network is deﬁned as

Unbiased Likelihood Backpropagation Learning

f (x; θM ) =

M

ai φ(x; ϕi )

449

(11)

i=1

where ai (i = 1, · · · , N ) are regression coeﬃcients and φ(x; ϕi ) are basis functions which are parametalized by ϕi . Model parameter of this neural network is θM ≡ (ϕT1 , · · · , ϕTM )T (T denotes transpose.) Design matrix X, coeﬃcient vector a and output vector y are deﬁned as Xij ≡ φ(xi ; ϕj )

(12)

a ≡ (a1 , · · · , aM )

(13)

y ≡ (y1 , · · · , yN ) .

(14)

T

T

When the model parameter θM are ﬁxed, the artiﬁcial neural network (11) becomes a linear regression model parametalized by a. A normal linear regression model:

1 (y −f (x; θM))2 p(y|x; θM ) ≡ √ exp − (15) 2σ 2 2πσ 2 is usually used when the noise included in the output y is assumed to follow a normal distribution. In this paper, we call the parameter θR ≡ (aT , σ 2 )T regular parameter. 4.1

Regularized Maximum Likelihood Estimation

To assure the regularity of the normal linear regression model (15), regularized maximum likelihood estimation is usually used for estimating θR = (aT , σ 2 )T . Regularized maximum likelihood estimation maximize regularized log-likelihood: log p(D) − exp(λ) a 2 .

(16)

λ ∈ R is called a regularization parameter. Regularized maximum likelihood estimators of the coeﬃcient vector a and the valiance σ 2 are ˆ = Ly a (17) and σ ˆ2 = where

1 ˆ 2 , y − Xa N

L ≡ (X T X + exp(λ)I)−1 X T

and I is the identity matrix. Here, the eﬀective number of the regression coeﬃcients is

Mef f = tr XL .

(18) (19)

(20)

Therefore, Mef f is used for the number of explanatory variables M in (10)1 . 1

When the denominator of (10) is not positive value, we use ˆbcAIC (ˆ p) = ∞.

450

4.2

M. Sekino and K. Nitta

Overfitting of the Error Backpropagation Learning

The error backpropagation learning is usually used for training an artiﬁcal neural network. This method is equal to the gradient discent method based on the likelihood of the linear regression model (15), because the target function is the squared error to learning data. In what follows, we assume the noise follow a normal distribution and the normal linear regression model (15) is regular for all θM . Then, an artiﬁcial neural network becomes a set of regular linear regression models. For simplicity, let’s think about a set of regular models H ≡ {M(θM ) | θM ∈ ΘM }, where M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } is a regular model. We can deﬁne a new model MC ≡ {ˆ p(x; θM ) | θM ∈ ΘM }, where pˆ(x; θM ) is the estimation of M(θM ). Concerning model parameter θM , statistical learning methods construct the estimation of the model MC based on the empirical likelihood pˆ(D; θM ). For example, maximum likelihood estimation selects θˆMML which is the maximizer of pˆ(D; θM ): pˆML (x) = pˆ(x; θˆMML ), θˆMML ≡ argmax pˆ(D; θM ). θM

(21) (22)

Thus, the maximum likelihood estimation with respect to model parameter θM is the model selection from H based on the empirical likelihood. Because the error backpropagation is the method which realizes maximum likelihood estimation by the gradient discent method, the model M(θM ) becomes the one gradually overﬁtted in the latter half of the training. Therefore, we propose a learning method for model parameter θM based on unbiased likelihood which is the empirical likelihood corrected by an appropriate information criterion.

5 5.1

Unbiased Likelihood Backpropagation Learning Unbiased Likelihood

Using an information criterion IC(ˆ p, D), we deﬁne unbiased likelihood as: 1 pˆub (D) = exp − IC(ˆ p, D) . (23) 2 This unbiased likelihood satisﬁes 1 Eq(D) log pˆub (D) = Eq(x) log pˆ(x) N when the assumptions of the information criterion are satisﬁed.

(24)

Unbiased Likelihood Backpropagation Learning

5.2

451

Regular Hierarchical Model

In this paper, we consider about a certain type of hierarchical model, which we call a regular hierarchical model, deﬁned as a set of regular models. A concise deﬁnitions of a regular hierarchical model are follows. Regular Hierarchical Model – H ≡ {M(θM ) | θM ∈ ΘM } – M(θM ) ≡ {p(x|θR ; θM ) | θR ∈ ΘR } – M(θM ) is a regular model with respect to θR . An artiﬁcial neural network is one of the regular hierarchical models. And also we deﬁne unbiased maximum likelihood estimation as follows. Unbiased Maximum Likelihood Estimation Unbiased maximum likelihood estimation pˆubML (x) is the estimation pˆ(x; θˆMubML) by the maximizer θˆMubML of the unbiased likelihood pˆub (D|θM ): pˆubML (x) ≡ pˆ(x; θˆMubML ) θˆMubML ≡ argmax pˆub (D; θM ). θM

5.3

(25) (26)

Unbiased Likelihood Backpropagation Learning

The partial diﬀerential of the unbiased likelihood is ∂ ∂ ∂ ˆ log pˆub (D; θM ) = log pˆ(D; θM ) − b(ˆ p; θM ). ∂θM ∂θM ∂θM

(27)

We deﬁne the unbiased likelihood estimation based on the gradient method with this partial diﬀerential as unbiased likelihood backpropagation learning. 5.4

Unbiased Likelihood Backpropagation Learning for an Artificial Neural Network

In this paper, we derive the unbiased likelihood backpropagation learning for an artiﬁcial neural network when the bias of the empirical likelihood is estimated by cAIC (10). The partial diﬀerential of the empirical likelihood with respect to θM which is the ﬁrst term of (27), is ∂ 1 log pˆ(D; θM ) = 2 ∂θM σ ˆ ˆ ∂X a = ∂θM

ˆ ∂X a ∂θM

T

ˆ) (y − X a

T ∂X ∂X T ∂X L+L y − 2LT X T Ly. ∂θM ∂θM ∂θM

(28)

(29)

452

M. Sekino and K. Nitta

The partial diﬀerential of cAIC (10) with respect to θM , which is the second term of (27), is ∂ ˆ N (N − 1) ∂Mef f bcAIC (ˆ p; θM ) = ∂θM (N − M − 2)2 ∂θM ∂Mef f ∂X ∂X = 2 tr L − LT X T L . ∂θM ∂θM ∂θM

(30)

(31)

We can also obtain the partial diﬀerential of the unbiased likelihood with respect to λ as ∂ 1 ˆ ) exp(λ), log pˆ(D; θM ) = − 2 y T LT L(y − X a (32) ∂λ σ ˆ and ∂Mef f = −tr LT L exp(λ). (33) ∂λ Now, we have already obtain the partial diﬀerential of the unbiased likelihood with respect to θM and λ, therefore it is possible to apply the unbiased likelihood backpropagation learning to an artiﬁcial neural network.

6

Application to Kernel Regression Model

Kernel regression model is one of the artiﬁcial neural networks. The kernel regression model using gaussian kernels has the model parameter of degree one, which is the size of gaussian kernels. This model is comprehensible about the behavior of learning methods. Therefore, the kernel regression model using gaussian kernels is used in the following simulations. In the implementation of the gradient discent method, we adopted quasi-Newton method with BFGS method for estimating the Hesse matrix and golden section search for determining the modiﬁcation length. 6.1

Kernel Regression Model

Kernel regression model is f (x; θM ) =

N

an K(x, xn ; θM ).

(34)

n=1

K(x, xn ; θM ) are kernel functions parametalized by model parameter θM . Gaussian kernel: x − xn 2 K(x, xn ; c) = exp − (35) 2c2 is used in the following simulations, where c is a parameter which decides the size of a gaussian kernel. Model parameter is θM = c.

Unbiased Likelihood Backpropagation Learning

453

1 emp true cAIC

log-likelihood

0.5

0

-0.5

-1 0

2

4

6 iterations

8

10

12

(a) Empirical Likelihood Backpropagation

1 emp true cAIC

log-likelihood

0.5

0

-0.5

-1 0

2

4

6 iterations

8

10

12

(b) Unbiased Likelihood Backpropagation Fig. 1. An example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC)

454

6.2

M. Sekino and K. Nitta

Simulations

For the purpose of evaluation, the empirical likelihood backpropagation learning and the unbiased likelihood backpropagation learning are applied to the 8 dimensional input to 1 dimensional output regression problems of “kin-family” and “pumadyn-family” in the DELVE data set [4]. Each data has 4 combinations of fairly linear (f) or non linear (n), and medium noise (m) or high noise (h). We use 128 samples for learning data. 50 templates are chosen randomly and kernel functions are put on the templates. Fig.1 shows an example of the transition of mean log empirical likelihood, mean log test likelihood and mean log unbiased likelihood (cAIC). It shows the test likelihood of the empirical likelihood backpropagation learning (a) decrease in the latter half of the training and overﬁtting occurs. On the contrary, it shows the unbiased likelihood backpropagation learning (b) keeps the test likelihood close to the unbiased likelihood (cAIC) and overﬁtting does not occur. Table 1 shows mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is signiﬁcantly better result by the t-test at the signiﬁcance level 1%. Table 1. Mean and standard deviations of the mean log test likelihood for 100 experiments. The number in bold face shows it is significantly better result by the t-test at the significance level 1%. Data Empirical BP kin-8fm 2.323 ± 0.346 kin-8fh 1.108 ± 0.240 kin-8nm −0.394 ± 0.415 kin-8nh −0.435 ± 0.236 pumadyn-8fm −2.235 ± 0.249 pumadyn-8fh −3.168 ± 0.187 pumadyn-8nm −3.012 ± 0.238 pumadyn-8nh −3.287 ± 0.255

6.3

Unbiased BP 2.531 ± 0.140 1.599 ± 0.100 0.078 ± 0.287 0.064 ± 0.048 −1.708 ± 0.056 −2.626 ± 0.021 −2.762 ± 0.116 −2.934 ± 0.055

Discussion

The reason why the results of the unbiased likelihood backpropagation learning in Table 1 shows better is attributed to the fact that the method maximizes the true likelihood averagely, because the mean of the log unbiased likelihood is equal to the log true likelihood (see (24)). The reason why the standard deviations of the test likelihood of the unbiased likelihood backpropagation learning is smaller than that of the empirical likelihood backpropagation learning is assumed to be due to the fact that the empirical likelihood prefer the model which has bigger degree of freedom. On the contrary, the unbiased likelihood prefer the model which has appropriate degree of freedom. Therefore, the variance of the estimation of the unbiased likelihood backpropagation learning becomes smaller than that of the empirical likelihood backpropagation learning.

Unbiased Likelihood Backpropagation Learning

7

455

Conclusion

In this paper, we provide an explanation about why overﬁtting occurs with the model selection framework. We propose the unibiased likelihood backpropagation learning, which is the gradient discent method for modifying the model parameter with unbiased likelihood (information criterion) as a target function. And we conﬁrm the eﬀectiveness of the proposed method by applying the method to DELVE data set.

References 1. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1987) 2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974) 3. Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics, vol. A78, pp. 13–26 (1978) 4. Rasmussen, C.E., Neal, R.M., Hinton, G.E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., Tibshirani, R.: The DELVE manual (1996), http://www.cs.toronto.edu/∼ delve/

The Local True Weight Decay Recursive Least Square Algorithm Chi Sing Leung, Kwok-Wo Wong, and Yong Xu Department of Electronic Engineering, City University of Hong Kong, Hong Kong [email protected]

Abstract. The true weight decay recursive least square (TWDRLS) algorithm is an eﬃcient fast online training algorithm for feedforward neural networks. However, its computational and space complexities are very large. This paper ﬁrst presents a set of more compact TWDRLS equations. Afterwards, we propose a local version of TWDRLS to reduce the computational and space complexities. The eﬀectiveness of this local version is demonstrated by simulations. Our analysis shows that the computational and space complexities of the local TWDRLS are much smaller than those of the global TWDRLS.

1

Introduction

Training multilayered feedforward neural networks (MFNNs) using recursive least square (RLS) algorithms has aroused much attention in many literatures [1, 2, 3, 4]. This is because those RLS algorithms are eﬃcient second-order gradient descent training methods. They lead to a faster convergence when compared with ﬁrst-order methods, such as the backpropagation (BP) algorithm. Moreover, fewer parameters are required to be tuned during training. Recently, Leung et. al. found that the standard RLS algorithm has an implicit weight decay eﬀect [2]. However, its decay eﬀect is not substantial and so its generalization ability is not very good. A true weight decay RLS (TWDRLS) algorithm is then proposed [5]. However, the computational complexity of TWDRLS is equal to O(M 3 ) at each iteration, where M is the number of weights. Therefore, it is necessary to reduce the complexity of TWDRLS so that the TWDRLS can be used for large scale practical problems. The main goal of this paper is to reduce both the computational complexity and storage requirement. In Section 2, we derive a set of concise equations for TWDRLS and give some discussions on it. We then describe a local TWDRLS algorithm in Section 3. Simulation results are presented in Section 4. We then summarize our ﬁndings in Section 5.

2

TWDRLS Algorithm

A general MFNN is composed of L layers, indexed by 1, · · · , L from input to output. There are nl neurons in layer l. The output of the i-th neuron in the M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008

The Local True Weight Decay Recursive Least Square Algorithm

457

l-th layer is denoted by yi,l . That means, the i-th neuron of the output layer is represented by yi,L while the i-th input of the network is represented by yi,1 . The connection weight from the j-th neuron of layer l − 1 to the i-th neuron of layer l is denoted by wi,j,l . Biases are implemented as weights and are speciﬁed by wi,(nl−1 +1),l , where l = 2, · · · , L. Hence, the number of weights in a MFNN is given by M = L l=2 (nl−1 + 1)nl . In the standard RLS algorithm, we arrange all weights as a M -dimensional vector, given by T w = w1,1,2 , · · · w1,(n1 +1),2 , · · · , wnL ,1,L , · · · , wnL ,(nL−1 +1),L . (1) The energy function up to the t-th training sample is given by E(w) =

t

T

ˆ ˆ d(τ ) − h(w, x(τ )) + [w − w(0)] P −1 (0) [w − w(0)] 2

(2)

τ =1

where x(τ ) is a n1 -dimensional input vector, d(τ ) is the desired nL -dimensional output, and h(w0 , x(τ )) is a nonlinear function that describes the function of the network. The matrix P (0) is the error covariance matrix and is usually set to δ −1 IM×M , where IM×M is a M × M identity matrix. The minimization of (2) leads to the standard RLS equations [1, 3, 6, 7], given by −1 K(t) = P (t − 1) H T (t) InL ×nL + H(t) P (t − 1) H T (t) (3) P (t) = P (t − 1) − K(t) H(t) P (t − 1) (4) ˆ ˆ − 1) + K(t) [d(t) − h(w(t ˆ − 1), x(t)) ] , w(t) = w(t (5) T ∂h(w,(t)) where H(t) = is the gradient matrix (nL × M ) of ∂w ˆ w=w(t−1)

h(w, x(t)); and K(t) is the so-called Kalman gain matrix (M × nL ) in the classical control theory. The matrix P (t) is the so-called error covariance matrix. It is symmetric positive deﬁnite. As mentioned in [5], the standard RLS algorithm only has the limited weight decay eﬀect, being tδo per training iteration, where to is the number of training iterations. The decay eﬀect decreases linearly as the number of training iterations increases. Hence, the more training presentations take place, the less smoothing eﬀect would have in the data ﬁtting process. A true weight decay RLS algorithm, namely TDRLS, was then proposed [5], where a decay term is added to the original energy function. The new energy function is given by t

E(w) =

T ˆ ˆ d(τ ) − h(w, x(τ ))2 + αw T w + [w − w(0)] P −1 (0) [w − w(0)] (6)

τ =1

where α is a regularization parameter. The gradient of E(w) is given by t ∂E(w) ˆ ≈ P −1 (0) [w − w(0)] + αw − H T (τ ) [d(τ ) − H(τ )w − ξ(τ )] . ∂w τ =1 (7)

458

C.S. Leung, K.-W. Wong, and Y. Xu

ˆ − 1). That means, In the above, we linearize h(w, x(τ )) around the estimate w(τ ˆ − 1), x(τ )) − H(τ )w(τ ˆ ) + ρ(τ ). h(w, x(τ )) = H(τ )w + ξ(τ ), where ξ(τ ) = h(w(τ To minimize the energy function, we set the gradient to zero. Hence, we have ˆ w(t) = P (t)r(t)

(8)

P −1 (t) = P −1 (t − 1) + H T (t) H(t) + αIM×M r(t) = r(t − 1) + H T (t) [d(t) − ξ(t)] .

(9) (10)

where

Δ

Furthermore, we deﬁne P ∗ (t) = [IM×M + αP (t − 1)]−1 P (t − 1). Hence, we have P ∗ −1 (t) = P −1 (t − 1) + αIM×M . With the matrix inversion lemma [7] in the recursive calculation of P (t), (8) becomes −1

P ∗ (t − 1) = [IM×M + αP (t − 1)] P (t − 1) (11) −1 ∗ T ∗ T K(t) = P (t − 1) H (t) InL ×nL + H(t) P (t − 1)H (t) (12) ∗ ∗ P (t) = P (t − 1) − K(t) H(t) P (t − 1) (13) ˆ ˆ − 1)−αP (t)w(t ˆ − 1) + K(t)[d(t) − h(w(t ˆ − 1), x(t))]. (14) w(t) = w(t Equations (11)-(14) are the general global TWDRLS equations. Those are more compact than the equations presented in [5]. Also, the weight updating equation in [5] ( i.e., (14) in this paper) is more complicated. When the regularization parameter α is set to zero, the decay term αw T w vanishes and (11)-(14) reduce to the standard RLS equations. The decay eﬀect can be easily understood by the decay term αw T w in the energy function given by (6). As mentioned in [5], the decay eﬀect per training iterations is equal to αw T w which does not decrease with the number of training iterations. The energy function of TWDRLS is the same as that of batch model weight decay methods. Hence, existing heuristic methods [8,9] for choosing the value of α can be used for the TWDRLS’s case. We can also explain the weight decay eﬀect based on the recursive equations (11)-(14). The main diﬀerence between the standard RLS equations and the ˆ − 1) in TWDRLS equations is the introduction of a decay term −αP (t) w(t (14). This term guarantees that the magnitude of the updating weight vector decays an amount proportional to αP (t). Since P (t) is positive deﬁnite, the magnitude of the weight vector would not be too large. So the generalization ability of the trained networks would be better [8, 9]. A drawback of TWDRLS is the requirement in computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity is equal to O(M 3 ) which is much larger than that of the standard RLS, O(M 2 ). Hence, the TWDRLS algorithm is computationally prohibitive even for a network with moderate size. In the next Section, a local version of the TWDRLS algorithm will be proposed to solve this large complexity problem.

The Local True Weight Decay Recursive Least Square Algorithm

3

459

Localization of the TWDRLS Algorithm

To localize the TWDRLS algorithm, we ﬁrst divide the weight vector into several T small vectors, where wi,l = wi,1,l , · · · , wi,(nl−1 +1),l is denoted as the weights connecting all the neurons of layer l − 1 to the i − th neuron of layer l. We consider the estimation of each weight vector separately. When we consider the i−th neuron in layer l, we assume that other weight vectors are constant vectors. Such a technique is usually used in many numerical methods [10]. At each training iteration, we update each weight vector separately. Each neuron has its energy function. The energy function of the i − th neuron in layer l is given by E(wi,l ) =

t

2 d(τ ) − h(wi,l , x(τ )) + αw Ti,l wi,l τ =1 T

−1 ˆ i,l (0)] Pi,l ˆ i,l (0)] . + [wi,l − w (0) [w i,l − w

(15)

Utilizing a derivation process similar to the previous analysis, we obtain the following recursive equations for the local TWDRLS algorithm. Each neuron (excepting input neurons) has its set of TWDRLS equations. For the i − th neuron in layer l, the TWDRLS equations are given by

∗ Pi,l (t − 1) = I(nl−1 +1)×(nl−1 +1) + αPi,l (t − 1)

−1

Pi,l (t − 1)

∗ T ∗ T Ki,l (t) = Pi,l (t − 1) Hi,l (t) InL ×nL + Hi,l (t) Pi,l (t − 1)Hi,l (t)

Pi,l (t) =

∗ Pi,l (t

− 1) −

∗ Ki,l (t) Hi,l (t) Pi,l (t

− 1)

−1

(16) (17) (18)

ˆ i,l (t) = w ˆ i,l (t−1)−αPi,l (t)w ˆ i,l (t−1) + Ki,l (t)[d(t)−h(w ˆ i,l (t−1), x(t))], (19) w

where Hi,l is the nL × n(nl−1 +1) local gradient matrix. In this matrix, only one row associated with the considered neuron is nonzero for output layer L. Ki,l (t) is the (nl−1 + 1) × nL local Kalman gain. Pi,l (t) is the (nl−1 + 1) × (nl−1 + 1) local error covariance matrix. The follows. There training process of the local TWDRLS algorithm is as L are L n neurons (excepting input neurons). Hence, there are l l=2 l=2 nl sets of TWDRLS equations. We update the local weight vectors in accordance with a descending order of l and then an ascending order of i using (16)-(19). At each training stage, only the concerned local weight vector is updated and all other local weight vectors remain unchanged. In the global TWDRLS, the complexity mainly comes from the computing the inverse of the M -dimensional matrix (IM×M + αP (t − 1)). This complexity 3 is equal to O(M complexity is equal to T CCglobal = ). So, the computational

3 L 3 O(M ) = O . Since the size of the matrix is M × M , l=2 nl (nl−1 + 1) the space complexity (storage requirement) is equal to T CSglobal = O(M 2 ) =

2 L O . l=2 nl (nl−1 + 1) From (16), the computational cost of local TWDRLS algorithm mainly comes from the inversion of an (nl−1 + 1) × (nl−1 + 1) matrix. In this way, the

460

C.S. Leung, K.-W. Wong, and Y. Xu

computational complexity of each set of local TWDRLS equations is equal to O((nl−1 +1)3 ) and the corresponding space complexity is equal to O((nl−1 +1)2 ). Hence, the total of the local TWDRLS is given by computational complexity

L 3 T CClocal = O n (n + 1) and the space complexity (storage requirel−1 l=2 l

L 2 ment) is equal to T CSlocal = O n (n + 1) . They are much smaller l l−1 l=2 than the computational and space complexities of the global case.

4

Simulations

Two problems, the generalized XOR and the sunspot data prediction, are considered. We use three-layers networks. The initial weights are small zero-mean independent identically distributed Gaussian random variables. The transfer function of hidden neurons is a hyperbolic tangent. Since the generalized XOR is a classiﬁcation problem, output neurons are with the hyperbolic tangent function. For the sunspot data prediction problem, output neurons are with the linear activation function. The training for each problem is performed 10 times with diﬀerent random initial weights. 4.1

Generalized XOR Problem

The generalized XOR problem is formulated as d = sign(x1 x2 ) with inputs in the range [−1, 1]. The network has 2 input neurons, 10 hidden neurons, and 1 output neuron. As a result, there are 41 weights. The training set and test set, shown in Figure 1, consists of 50 and 2,000 samples, respectively. The total number of training cycles is set to 200. In each cycle, training samples from the training set are feeded to the network one by one. The decision boundaries obtained from typical networks trained with both global and local TWDRLS and standard RSL algorithms are plotted in Figure 2.

(a) Training samples

(b) Test samples

Fig. 1. Training and test samples for the generalized XOR problems

The Local True Weight Decay Recursive Least Square Algorithm

461

Table 1. Computational and space complexities of the global and local TWDRLS algorithms for solving the generalized XOR problem Algorithm Computational complexity Space complexitty Global O(6.89 × 104 ) O(1.68 × 103 ) 3 Local O(1.60 × 10 ) O(2.21 × 102 )

(a) Global TWDRLS, α = 0

(b) Local TWDRLS, α = 0

(c) Global TWDRLS, α = 0.00178

(d) Local TWDRLS, α = 0.00178

Fig. 2. Decision boundaries of various trained networks for the generalized XOR problem. Note that when α = 0, the TWDRLS is identical to RLS.

From Figures 1 and 2, the decision boundaries obtained from the trained networks with TWDRLS algorithm are closer to the ideal ones than those with the standard RLS algorithm. Also, both local and global TWDRLS algorithms produce a similar shape of decision boundaries. Figure 3 summarizes the average test set false rates in the 10 runs. The average test set false rates obtained by global and local TWDRLS algorithms are usually lower than those obtained by the standard RLS algorithm over a wide range of regularization parameters. That means, both global and local TWDRLS algorithms can improve the generalization ability. In terms of average false rate, the performance of the local TWDRLS algorithm is quite similar to that of the global ones. The computational and space complexities for global and local algorithms are listed in Table 1. From Figure 3 and Table 1, we can conclude that

462

C.S. Leung, K.-W. Wong, and Y. Xu

Fig. 3. Average test set false rate of 10 runs for the generalized XOR problem

the performance of local TWDRLS is comparable to that of the global ones, and that its complexities are much smaller. Figure 3 indicates that the average test set false rate ﬁrst decreases with the regularization parameter α and then increases with it. This shows that a proper selection of α will indeed improve the generalization ability of the network. On the other hand, we observe that the test set false rate becomes very high at large values of α, especially for the networks trained with global TWDRLS algorithm. This is due to the fact that when the value of α is too large, the weight decay eﬀect is very substantial and the trained network cannot learn the target function. In order to further illustrate this, we plot in Figure 4 the decision boundary obtained from the network trained with global TWDRLS algorithm for α = 0.0178. The ﬁgure shows that the network has already converged when the decision boundary is still quite far from the ideal one. This is because when the value of α is too large, the weight decay eﬀect is too strong. That means, the regularization parameter α cannot be too large otherwise the network cannot learn the target function. 4.2

Sunspot Data Prediction

The sunspot data from 1700 to 1979 are normalized to the range [0,1] and taken as the training and the test sets. Following the common practice, we divide the data into a training set (1700 − 1920) and two test sets, namely, Test-set 1 (1921 − 1955) and Test-set 2 (1956 − 1979). The sunspot series is rather nonstationary and Test-set 2 is atypical for the series as a whole. In the simulation, we assume that the series is generated from the following auto-regressive model, given by d(t) = ϕ(d(t − 1), · · · , d(t − 12)) + (t)

(20)

where (t) is noise and ϕ(·, · · · , ·) is an unknown nonlinear function. A network with 12 input neurons, 8 hidden neurons (with hyperbolic tangent activation

The Local True Weight Decay Recursive Least Square Algorithm

463

Table 2. Computational and space complexities of the global and local TWDRLS algorithms for solving the sunspot data prediction Algorithm Computational complexity Space complexitty Global O(1.44 × 106 ) O(1.28 × 104 ) 4 Local O(1.83 × 10 ) O(1.43 × 103 )

Fig. 4. Decision boundaries of a trained network with local TWDRLS where α = 0.0178. In this case, the value of the regularization parameter is too large. Hence, the network cannot form a good decision boundary.

(a) Test-set 1 average RMSE

(b) Test-set 2 average RMSE

Fig. 5. RMSE of networks trained by global and local TWDRLS algorithms. Note that when α = 0, the TWDRLS is identical to RLS.

function), and one output neuron (with linear activation function) is used for approximating ϕ(·, · · · , ·). The total number of training cycles is equal to 200. As this is a time series problem, the training samples are feeded to the network sequentially in each iteration. The criterion to evaluate the model performance

464

C.S. Leung, K.-W. Wong, and Y. Xu

is the mean squared error (RMSE) of the test set. The experiment are repeated 10 times with diﬀerent initial weights. Figure 5 summarizes the average RMSE in 10 runs. The computational and space complexities for global and local algorithms are listed in Table 2. We observe from Figure 5 that over a wide range of the regularization parameter α, both global and local TWDRLS algorithms have greatly improved the generalization ability of the trained networks, especially for test-set 2 that is quite diﬀerent from the training set. However, the test RMSE becomes very large at large values of α. The reasons are similar to those stated in the last subsection. This is because at large value of α, the weight decay eﬀect is too strong and so the network cannot learn the target function. In most cases, the performance of the local training is found to be comparable to that of the global ones. Also, Table 2 shows that those complexities of the local training are much smaller than those of the global one.

5

Conclusion

We have investigated the problem of training the MFNN model using the TWDRLS algorithms. We derive a set of concise equations for the local TWDRLS algorithm. The computational complexity and the storage requirement are reduced considerably when using the local approach. Computer simulations indicate that both local and global TWDRLS algorithms can improve the generation ability of MFNNs. The performance of the local TWDRLS algorithm is comparable to that of the global ones.

Acknowledgement The work is supported by the Hong Kong Special Administrative Region RGC Earmarked Grant (Project No. CityU 115606).

References 1. Shah, S., Palmieri, F., Datum, M.: Optimal ﬁltering algorithm for fast learning in feedforward neural networks. Neural Networks 5, 779–787 (1992) 2. Leung, C.S., Wong, K.W., Sum, J., Chan, L.W.: A pruning method for recursive least square algorithm. Neural Networks 14, 147–174 (2001) 3. Scalero, R., Tepedelelenlioglu, N.: Fast new algorithm for training feedforward neural networks. IEEE Trans. Signal Processing 40, 202–210 (1992) 4. Leung, C.S., Sum, J., Young, G., Kan, W.K.: On the kalman ﬁltering method in neural networks training and pruning. IEEE Trans. Neural Networks 10, 161–165 (1999) 5. Leung, C.S., Tsoi, A.H., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Networks 12, 1314–1332 (2001) 6. Mosca, E.: Optimal Predictive and adaptive control. Prentice-Hall, Englewood Cliﬀs, NJ (1995)

The Local True Weight Decay Recursive Least Square Algorithm

465

7. Haykin, S.: Adaptive ﬁlter theory. Prentice-Hall, Englewood Cliﬀs, NJ (1991) 8. Mackay, D.: Bayesian interpolation. Neural Computation 4, 415–447 (1992) 9. Mackay, D.: A practical bayesian framework for backpropagation networks. Neural Computation 4, 448–472 (1992) 10. William H, H.: Applied numerical linear algebra. Prentice-Hall, Englewood Cliﬀs, NJ (1989)

Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift Keisuke Yamazaki and Sumio Watanabe Precision and Intelligence Laboratory, Tokyo Institute of Technology R2-5, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503 Japan {k-yam,swatanab}@pi.titech.ac.jp

Abstract. In the standard setting of statistical learning theory, we assume that the training and test data are generated from the same distribution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly under the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many eﬀective results are reported on statistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical ﬁndings. Moreover it is observed that the basis function in the model plays an important role in some cases.

1

Introduction

The task of regression problem is to estimate the input-output relation q(y|x) from sample data, where x, y are the input and output data, respectively. Then, we generally assume that the input distribution q(x) is generating both of the training and test data. However, this assumption cannot be satisﬁed in practical situations, e.g., brain computer interfacing [2], bioinformatics [3], etc. The change of the input distribution from the training q0 (x) into the test q1 (x) is referred to as the covariate shift [4]. It is known that, under the covariate shift, the standard techniques in machine learning cannot work properly, and many eﬃcient methods to tackle this issue are proposed [4,5,6,7]. In the Bayes estimation, Shimodaira [4] revealed the generalization error improved by the importance weight on regular cases. We formally clariﬁed the behavior of the error in non-regular cases [1]. The result shows the generalization error is determined by lower order terms, which are ignored at the situation without the covariate shift. At the same time, it appeared that the calculation of these terms is not straightforward even in a simple regular example. To cope M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 466–476, 2008. c Springer-Verlag Berlin Heidelberg 2008

Experimental Bayesian Generalization Error of Non-regular Models

467

with this problem, we also established the upper bound using the generalization error without the covariate shift. However, it is still open how to derive the theoretical generalization error in non-regular models. In this paper, we observe experimental results calculated with a Monte Carlo method and examine the theoretical upper bound on some non-regular models. Comparing the generalization error under the covariate shift to those without the shift, we investigate an eﬀect of a basis function in the learning model and consider the tightness of the bound. In the next section, we deﬁne non-regular models, and summarize the Bayesian analyses for the models without and with the covariate shift. We show the experimental results in Section 3 and give discussions at the last.

2

Bayesian Generalization Errors with and without Covariate Shift

In this section, we summarize the asymptotic theory on the non-regular Bayes inference. At ﬁrst, we deﬁne the non-regular case. Then, mathematical properties of non-regular models without the covariate shift are introduced [8]. Finally, we state the results of the former study [1], which clariﬁed the generalization error under the shift. 2.1

Non-regular Models

Let us deﬁne the parametric learning model by p(y|x, w), where x, y and w are the input, output and parameter, respectively. When the true distribution r(y|x) is realized by the learning model, the true parameter w∗ exists, i.e., p(y|x, w∗ ) = r(y|x). The model is regular if w∗ is one point in the parameter space. Otherwise, non-regular; the true parameter is not one point but a set of parameters such that Wt = {w∗ : p(y|x, w∗ ) = r(y|x)}. For example, three-layer perceptrons are non-regular. Let the true distribution be a zero function with Gaussian noise 1 y2 r(y|x) = √ exp − , 2 2π and the learning models be a simple three-layer perceptron 1 (y − a tanh(bx))2 p(y|x, w) = √ exp − , 2 2π where the parameter w = {a, b}. It is easy to ﬁnd that the true parameters are a set {a = 0}∪{b = 0} since 0×tanh(bx) = a×tanh 0 = 0. This non-regularity (also called non-identiﬁability) causes that the conventional statistical method cannot be applied to such models (We will mention the detail in Section 2.2). In spite of this diﬃculty for the analysis, the non-regular models, such as perceptrons, Gaussian mixtures, hidden Markov models, etc., are mainly employed in many information engineering ﬁelds.

468

2.2

K. Yamazaki and S. Watanabe

Properties of the Generalization Error without Covariate Shift

As we mentioned in the previous section, the conventional statistical manner does not work in the non-regular models. To cope with the issue, a method was developed based on algebraic geometry. Here, we introduce the summary. Hereafter, we denote the cases without and with the covariate shift subscript 0 and 1, respectively. Some of the functions with a suﬃx 0 will be replaced by those with 1 in the next section. Let {X n , Y n } = {X1 , Y1 , . . . , Xn , Yn } be a set of training samples that are independently and identically generated by the true distribution r(y|x)q0 (x). Let p(y|x, w) be a learning machine and ϕ(w) be an a priori distribution of parameter w. Then the a posteriori distribution/posterior is deﬁned by p(w|X n , Y n ) =

n 1 p(Yi |Xi , w)ϕ(w), Z(X n , Y n ) i=1

where Z(X n , Y n ) =

n

p(Yi |Xi , w)ϕ(w)dw.

(1)

i=1

The Bayesian predictive distribution is given by n n p(y|x, X , Y ) = p(y|x, w)p(w|X n , Y n )dw. When the number of sample data is suﬃciently large (n → ∞), the posterior has the peak at the true parameter(s). The posterior in a regular model is a Gaussian distribution, whose mean is asymptotically the parameter w∗ . On the other hand, the shape of the posterior in a non-regular model is not Gaussian because of Wt (cf. the right panel of Fig.10 in Section 3). We evaluate the generalization error by the average Kullback divergence from the true distribution to the predictive distribution: r(y|x) 0 G0 (n)=EX r(y|x)q0 (x) log dxdy . n ,Y n p(y|x, X n , Y n ) In the standard statistical manner, we can formally calculate the generalization error by integrating the predictive distribution. This integration is viable based on the Gaussian posterior. Therefore, this method is applicable only to regular models. The following is one of the solutions for non-regular cases. The stochastic complexity [9] is deﬁned by F (X n , Y n ) = − log Z(X n , Y n ), (2) which can be used for selecting an appropriate model or hyper-parameters. To analyze the behavior of the stochastic complexity, the following functions play important roles: 0 n n U 0 (n) = EX )] , n ,Y n [F (X , Y 0 where EX n ,Y n [·] stands for the expectation value over r(y|x)q0 (x) and

(3)

Experimental Bayesian Generalization Error of Non-regular Models

F (X n , Y n ) = F (X n , Y n ) +

n

469

log r(Yi |Xi ).

i=1

The generalization error and the stochastic complexity are linked by the following equation [10]: G0 (n) = U 0 (n + 1) − U 0 (n).

(4)

When the learning machine p(y|x, w) can attain the true distribution r(y|x), the asymptotic expansion of F (n) is given as follows [8]. U 0 (n) = α log n − (β − 1) log log n + O(1).

(5)

The coeﬃcients α and β are determined by the integral transforms of U 0 (n). More precisely, the rational number −α and natural number β are the largest pole and its order of J(z) = H0 (w)z ϕ(w)dw, r(y|x) H0 (w) = r(y|x)q0 (x) log dxdy. (6) p(y|x, w) J(z) is obtained by applying the inverse Laplace and Mellin transformations to exp[−U 0 (n)]. Combining Eqs.(5) and (4) immediately gives

α β−1 1 0 G (n) = − +o , n n log n n log n when G0 (n) has an asymptotic form. The coeﬃcients α and β indicate the speed of convergence of the generalization error when the number of training samples is suﬃciently large. When the learning machine cannot attain the true distribution (i.e., the model is misspeciﬁed), the stochastic complexity has an upper bound of the following asymptotic expression [11]. U 0 (n) ≤ nC + α log n − (β − 1) log log n + O(1),

(7)

where C is a non-negative constant. When the generalization error has an asymptotic form, combining Eqs.(7) and (4) gives

α β−1 1 G0 (n) ≤ C + − +o , (8) n n log n n log n where C is the bias. 2.3

Properties of the Generalization Error with Covariate Shift

Now, we introduce the results in [1]. Since the test data are distributed from r(y|x)q1 (x), the generalization error with the shift is deﬁned by

470

K. Yamazaki and S. Watanabe

1

G

0 (n)=EX n ,Y n

r(y|x) r(y|x)q1 (x) log dxdy . p(y|x, X n , Y n )

(9)

We need the function similar to (3) given by 1 U 1 (n) = EX EX n−1 ,Y n−1 [F (X n , Y n )]. n ,Yn

(10)

Then, the variant of (4) is obtained as G1 (n) = U 1 (n + 1) − U 0 (n).

(11)

When we assume that G1 (n) has an asymptotic expansion and converges to a constant and that U i (n) has the following asymptotic expansion 1 di U i (n) = ai n + bi log n + · · · +ci + +o , n n i (n) TH

0

i (n) TL

1

it holds that G (n) and G (n) are expressed by 1 b0 G0 (n) = a0 + +o , n n 1 b0 + (d1 − d0 ) G1 (n) = a0 + (c1 − c0 ) + +o , (12) n n 1 0 and that TH (n) = TH (n). Note that b0 = α. The factors c1 −c0 and d1 −d0 determine the diﬀerence of the errors. We have also obtained that the generalization error G1 (n) has an upper bound G1 (n) ≤ M G0 (n), if the following condition is satisﬁed M ≡ max

x∼q0 (x)

3

q1 (x) < ∞. q0 (x)

(13)

(14)

Experimental Generalization Errors in Some Toy Models

Even though we know the factors causing the diﬀerence between G1 (n) and G0 (n) according to Eq.(12), it is not straightforward to calculate the lower order terms in TLi (n). More precisely, the solvable model is restricted to ﬁnd the constant and decreasing factors ci , di in the increasing function U i (n). This implies to reveal the analytic expression of G1 (n) is still open issue in non-regular models. Here, we calculate G1 (n) with experiments and observe the behavior. A non-regular model requires the sampling from the non-Gaussian posterior in the Bayes inference. We use the Markov Chain Monte Carlo (MCMC) method to execute this task [12]. In the following examples, we use the common notations: the true distribution is deﬁned by 1 (y − g(x))2 r(y|x) = √ exp − , 2 2π

Experimental Bayesian Generalization Error of Non-regular Models

-10

-5

0

5

10

20

15

-5

0

5

10

20

15

-5

0

5

10

15

0

5

10

15

20

-10

-10

-5

0

5

10

15

20

20

Fig. 7. (μ1 , σ1 ) = (10, 1)

-10

-5

0

5

10

15

20

Fig. 8. (μ1 , σ1 )=(10, 0.5)

-5

0

5

10

15

20

Fig. 3. (μ1 , σ1 ) = (0, 2)

-10

-5

0

5

10

15

20

Fig. 6. (μ1 , σ1 ) = (2, 2)

Fig. 5. (μ1 , σ1 ) = (2, 0.5)

Fig. 4. (μ1 , σ1 ) = (2, 1)

-10

-5

Fig. 2. (μ1 , σ1 ) = (0, 0.5)

Fig. 1. (μ1 , σ1 ) = (0, 1)

-10

-10

471

-10

-5

0

5

10

15

20

Fig. 9. (μ1 , σ1 ) = (10, 2)

The training and test distributions

the learning model is given by

1 (y − f (x, w))2 p(y|x, w) √ exp − , 2 2π

the prior ϕ(w) is a standard normal distribution, and the input distribution is of the form 1 (x − μi )2 qi (x) = √ exp − (i = 0, 1). 2σi2 2πσi The training input distribution has (μ0 , σ0 ) = (0, 1), and there are nine test distributions, where each one has the mean and variance as the combination between μ1 = {0, 2, 10} and σ1 = {1, 0.5, 2} (cf. Fig.1-9). Note that the case in Fig.1 corresponds to q0 (x). As for the experimental setting, the number of traning samples is n, the number of test samples is ntest , the number of parameter samples distributed from the posterior with the MCMC method is np , and the number of samples to have the expectation EX n ,Y n [·] is nD . In the mathematical expressions, n np 1 p(y|x, wj ) i=1 p(Yi |Xi , wj )ϕ(wj ) n n p(y|x, X , Y ) , n p np j=1 n1p nk=1 i=1 p(Yi |Xi , wk )ϕ(wk ) nD ntest 1 1 r(yj |xj ) G1 (n) log , nD i=1 ntest j=1 p(yj |xj , Di )

0.4

0.49

0.31

0.22

0.13

0.04

-0.05

-0.14

-0.23

-0.32

-0.5

K. Yamazaki and S. Watanabe

-0.41

472

4 cy en qu fre

2500 2000 1500 1000 500 0

4

−4

0

b

3

2

2

−2

1

a

0

−2

0 2

-1

-2

4

-3 -3

-2

-1

0

1

2

3

−4

4

Fig. 10. The sampling from the posteriors. The left-upper panel shows the histogram of a for the ﬁrst model. The left-middle one is the histogram of a3 for the third model. The left-lower one is the point diagram of (a, b) for the second model, and the right one is its histogram.

where Di = {Xi1 , Yi1 , · · · , Xin , Yin } stands for the ith set of training data, and Di and (xj , yj ) in G1 (n) are taken from q0 (x)r(y|x) and q1 (x)r(y|x), respectively. The experimental parameters were as follows: n = 500, ntest = 1000, np = 10000, nD = 100. Example 1 (Lines with Various Parameterizations) g(x) = 0, f1 (x, a) = ax, f2 (x, a, b) = abx, f3 (x, a) = a3 x, where the true is the zero function, the learning functions are lines with the gradient a, ab and a3 . In this example, all learning functions belong to the same function class though the second model is non-regular (Wt = {a = 0} ∪ {b = 0}) and the third one has the non-Gaussian posterior. The gradient parameters are taken from the posterior depicted by Fig.10. Table 1 summarizes the results. The ﬁrst row indicates the pairs (μ1 , σ1 ), and the rest does the experimental average generalization errors. G1 [fi ] stands for the error of the model with fi . M G0 [fi ] is the upper bound in each case according to Eq.(13). Note that there are some blanks in the row because of the condition Eq.(14). To compare G1 [f3 ] with G1 [f1 ], the last row shows the values 3 × G1 [f3 ] of each change. Since it is regular, the ﬁrst model has theoretical results:

1 1 R 1 μ2 + σ12 G0 (n) = +o , G1 (n) = +o , R = 12 . 2n n log n 2n n log n μ0 + σ02 ‘th G1 [f1 ]’ in Table 1 is this theoretical result.

Experimental Bayesian Generalization Error of Non-regular Models

473

Table 1. Average generalization errors in Example 1 (μ1 , σ1 ) 1

th G [f1 ] G1 [f1 ] G1 [f2 ] G1 [f3 ] MG0 [f1 ] MG0 [f2 ] MG0 [f3 ]

(0,1)= G0 (0,0.5)

(0,2)

(2,1)

0.001 0.00025 0.004 0.005 0.001055 0.000239 0.004356 0.006162 0.000874 0.000170 0.003466 0.004523 0.000394 0.000059 0.001475 0.002374 — — —

0.002000 0.001356 0.000667

— — —

— — —

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

0.00425 0.008 0.101 0.10025 0.104 0.005107 0.009532 0.109341 0.106619 0.108042 0.003670 0.006669 0.079280 0.078802 0.080180 0.001912 0.003736 0.040287 0.038840 0.039682 0.028784 0.019521 0.009595

— — —

— — —

1.79×1026 1.22×1026 5.98×1025

— — —

3 × G1 [f3 ] 0.001182 0.000177 0.004425 0.007122 0.005736 0.011208 0.120861 0.116520 0.119046

Table 2. Average generalization errors in Example 2 (μ1 , σ1 ) 1

G [f4 ] G1 [f5 ] G1 [f6 ] MG0 [f4 ] MG0 [f5 ] MG0 [f5 ]

(0,1)= G0 (0,0.5)

(0,2)

(2,1)

(2,0.5)

(2,2)

(10,1)

(10,0.5)

(10,2)

0.000688 0.000260 0.002312 0.002977 0.003094 0.003423 0.013640 0.012769 0.012204 0.000251 0.000103 0.000934 0.001116 0.001291 0.001318 0.004350 0.003729 0.003743 0.000146 0.000062 0.000626 0.000705 0.000875 0.000918 0.002896 0.002357 0.002489 — — —

0.001356 0.000667 0.000400

— — —

— — —

0.019521 0.009595 0.005757

— — —

— — —

1.22×1026 5.98×1025 3.59×1025

— — —

3 × G1 [f5 ] 0.000753 0.000309 0.002802 0.003348 0.003873 0.003954 0.013050 0.011187 0.011229 5 × G1 [f6 ] 0.000730 0.000310 0.003130 0.003525 0.004375 0.004590 0.014480 0.011785 0.012445

We can ﬁnd that ‘th G1 [f1 ]’ are very close to G1 [f1 ] in spite of the fact that the theoretical values are established in asymptotic cases. Based on this fact, the accuracy of experiments can be evaluated to compare them. As for f2 and f3 , they do not have any comparable theoretical value except for the upper bound. We can conﬁrm that every value in G1 [f2 , f3 ] is actually smaller than the bound. Example 2 (Simple Neural Networks). Let assume that the true is the zero function, and the learning models are three-layer perceptrons: g(x) = 0, f4 (x, a, b) = a tanh(bx), f5 (x, a, b) = a3 tanh(bx), f6 (x, a, b) = a5 tanh(bx). Table 2 shows the results. In this example, we can also conﬁrm that the bound works. Combining the results in the previous example, the bound tends to be tight when μ1 is small. As a matter of fact, the bound holds in small sample cases, i.e., the number of training data n does not have to be suﬃciently large. Though we omit it because of the lack of space, the bound is always larger than the experimental results in n = 100, 200, . . . , 400. The property of the bound will be discussed in the next section.

4

Discussions

First, let us conﬁrm if the sampling from the posterior was successfully done by the MCMC method. Based on the algebraic geometrical method, the coeﬃcients of G0 (n) are derived in the models (cf. Table 3). As we mentioned, f2 , f4 and

474

K. Yamazaki and S. Watanabe Table 3. The coeﬃcients of generalization error without the covariate shift f1 f2 , f4 f3 , f5 f6 α 1/2 1/2 1/6 1/10 β 1 2 1 1

f3 , f5 have the same theoretical error. According to the examples in the previous section, we can compare the theoretical value to the experimental one, G0 (n)[f1 ] = 0.001 0.001055, G0(n)[f2 , f4 ] = 0.000678 0.000874, 0.000688 G0 (n)[f3 , f5 ] = 0.000333 0.000394, 0.000251, G0(n)[f6 ] = 0.0002 0.000146. In the sense of the generalization error, the MCMC method worked well though there is some ﬂuctuation in the results. Note that it is still open how to evaluate the method. Here we measured by the generalization error since the theoretical value is known in G0 (n). However, this index is just a necessary condition. To develop an evaluation of the selected samples is our future study. Next, we consider the behavior of G1 (n). In the examples, the true function was commonly the zero function g(x) = 0. It is an important case to learn the zero function because we often prepare an enough rich Kmodel in practice. Then, the learning function will be set up as f (x, w) = i=k t(w1k )h(x, w2k ), where h is the basis function t is the parameterization for its weight, and w = {w11 , w21 , w12 , w22 , . . . , w1K , w2K }. Note that many practical models are included in this expression. According to the redundancy of the function, some of h(x, w2k ) learn the zero function. Our examples provided the simplest situations and highlighted the eﬀect of non-regularity in the learning models. The errors G0 (n) and G1 (n) are generally expressed as

α β−1 1 G0 (n) = − +o , n n log n n log n

α β−1 1 G1 (n) = R1 − R2 +o , n n log n n log n where R1 , R2 depend on f, g, q0 , and q1 . R1 and R2 cause the diﬀerence between G0 and G1 in this expression. In Eq. (12), the coeﬃcient of 1/n is given by b0 + (d1 − d0 ). So b0 + (d1 − d0 ) d1 − d0 R1 = =1+ . b0 α Let us denote A B as “A is the only factor to determine a value of B”. As mentioned above, f, g, q0 , q1 R1 , R2 . Though f, g α, β (cf. around Eqs.(5)-(6)), we should emphasize that α, β, q0 , q1 R1 , R2 .

Experimental Bayesian Generalization Error of Non-regular Models

475

This fact is easily conﬁrmed by comparing f2 to f4 (also f3 to f5 ). It holds that G1 (n)[f2 ] = G1 (n)[f4 ] for all q1 although they have the same α and β (G0 (n)[f2 ] = G0 (n)[f4 ]). Thus α and β are not enough informative to describe R1 and R2 . Comparing the values in G1 [f2 ] to the ones in G1 [f4 ], the basis function (x in f2 and tanh(bx) in f4 ) seems to play an important role. To clarify the eﬀect of basis functions, let us ﬁx the function class. Examples 1 and 2 correspond to h(x, w2 ) = x and h(x, w2 ) = tanh(bx), respectively. The values of G1 [f1 ] and 3 × G1 [f3 ] (also 3 × G1 [f5 ] and 5 × G1 [f6 ]) can be regarded as the same in any covariate shift. This implies h, g, q0 , q1 R1 , i.e. the parameterization t(w1 ) will not aﬀect R1 . Instead, it aﬀects the nonregularity or the multiplicity and decides α and β. Though it is an unclear factor, the inﬂuence of R2 does not seem as large as R1 . Last, let us analyze properties of the upper bound M G0 (n). According to the above discussion, it holds that R G1 /G0 ≤ M. The ratio G1 /G0 basically depends on g, h, q0 and q1 . However, M is determined by only the training and test input distributions, q0 , q1 M . Therefore this bound gives the worst case evaluation in any g and h. Considering the tightness of the bound, we can still improve it based on the relation between the true and learning functions.

5

Conclusions

In the former study, we have got the theoretical generalization error and its upper bound under the covariate shift. This paper showed that the theoretical value is supported by the experiments in spite of the fact that it is established under an asymptotic case. We observed the tightness of the bound and discussed an eﬀect of basis functions in the learning models. In this paper, the non-regular models are simple lines and neural networks. It is an interesting issue to investigate more general models. Though we mainly considered the amount of G1 (n), the computational cost for the MCMC method strongly connects to the form of the learning function f . It is our future study to take account of the cost in the evaluation.

Acknowledgements The authors would like to thank Masashi Sugiyama, Motoaki Kawanabe, and Klaus-Robert M¨ uller for fruitful discussions. The software to calculate the MCMC method and technical comments were provided by Kenji Nagata. This research partly supported by the Alexander von Humboldt Foundation, and MEXT 18079007.

476

K. Yamazaki and S. Watanabe

References 1. Yamazaki, K., Kawanabe, M., Wanatabe, S., Sugiyama, M., M¨ uller, K.R.: Asymptotic bayesian generalization error when training and test distributions are diﬀerent. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1079–1086 (2007) 2. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M.: Brain-computer interfaces for communication and control. Clinical Neurophysiology 113(6), 767–791 (2002) 3. Baldi, P., Brunak, S., Stolovitzky, G.A.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (1998) 4. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227– 244 (2000) 5. Sugiyama, M., M¨ uller, K.R.: Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions 23(4), 249–279 (2005) 6. Sugiyama, M., Krauledat, M., M¨ uller, K.R.: Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8 (2007) 7. Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., Sch¨ olkopf, B.: Correcting sample selection bias by unlabeled data. In: Sch¨ olkopf, B., Platt, J., Hoﬀman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, MIT Press, Cambridge, MA (2007) 8. Watanabe, S.: Algebraic analysis for non-identiﬁable learning machines. Neural Computation 13(4), 899–933 (2001) 9. Rissanen, J.: Stochastic complexity and modeling. Annals of Statistics 14, 1080– 1100 (1986) 10. Watanabe, S.: Algebraic analysis for singular statistical estimation. In: Watanabe, O., Yokomori, T. (eds.) ALT 1999. LNCS (LNAI), vol. 1720, pp. 39–50. Springer, Heidelberg (1999) 11. Watanabe, S.: Algebraic information geometry for learning machines with singularities. Advances in Neural Information Processing Systems 14, 329–336 (2001) 12. Ogata, Y.: A monte carlo method for an objective bayesian procedure. Ann. Inst. Statis. Math. 42(3), 403–433 (1990)

Using Image Stimuli to Drive fMRI Analysis David R. Hardoon1 , Janaina Mour˜ ao-Miranda2, Michael Brammer2 , and John Shawe-Taylor1 1

The Centre for Computational Statistics and Machine Learning Department of Computer Science University College London Gower St., London WC1E 6BT {D.Hardoon,jst}@cs.ucl.ac.uk 2 Brain Image Analysis Unit Centre for Neuroimaging Sciences (PO 89) Institute of Psychiatry, De Crespigny Park London SE5 8AF {Janaina.Mourao-Miranda,Michael.Brammer}@iop.kcl.ac.uk

Abstract. We introduce a new unsupervised fMRI analysis method based on Kernel Canonical Correlation Analysis which diﬀers from the class of supervised learning methods that are increasingly being employed in fMRI data analysis. Whereas SVM associates properties of the imaging data with simple speciﬁc categorical labels, KCCA replaces these simple labels with a label vector for each stimulus containing details of the features of that stimulus. We have compared KCCA and SVM analyses of an fMRI data set involving responses to emotionally salient stimuli. This involved ﬁrst training the algorithm ( SVM, KCCA) on a subset of fMRI data and the corresponding labels/label vectors, then testing the algorithms on data withheld from the original training phase. The classiﬁcation accuracies of SVM and KCCA proved to be very similar. However, the most important result arising from this study is that KCCA in able in part to extract many of the brain regions that SVM identiﬁes as the most important in task discrimination blind to the categorical task labels. Keywords: Machine learning methods, Kernel canonical correlation analysis, Support vector machines, Classiﬁers, Functional magnetic resonance imaging data analysis.

1

Introduction

Recently, machine learning methodologies have been increasingly used to analyse the relationship between stimulus categories and fMRI responses [1,2,3,4,5,6,7,8, 9,10]. In this paper, we introduce a new unsupervised machine learning approach to fMRI analysis, in which the simple categorical description of stimulus type (e.g. type of task) is replaced by a more informative vector of stimulus features. We compare this new approach with a standard Support Vector Machine (SVM) analysis of fMRI data using a categorical description of stimulus type. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 477–486, 2008. c Springer-Verlag Berlin Heidelberg 2008

478

D.R. Hardoon et al.

The technology of the present study originates from earlier research carried out in the domain of image annotation [11], where an image annotation methodology learns a direct mapping from image descriptors to keywords. Previous attempts at unsupervised fMRI analysis have been based on Kohonen selforganising maps, fuzzy clustering [12] and nonparametric estimation methods of the hemodynamic response function, such as the general method described in [13]. [14] have reported an interesting study which showed that the discriminability of PCA basis representations of images of multiple object categories is signiﬁcantly correlated with the discriminability of PCA basis representation of the fMRI volumes based on category labels. The current study diﬀers from conventional unsupervised approaches in that it makes use of the stimulus characteristics as an implicit representation of a complex state label. We use kernel Canonical Correlation Analysis (KCCA) to learn the correlation between an fMRI volume and its corresponding stimulus. Canonical correlation analysis can be seen as the problem of ﬁnding basis vectors for two sets of variables such that the correlations of the projections of the variables onto corresponding basis vectors are maximised. KCCA ﬁrst projects the data into a higher dimensional feature space before performing CCA in the new feature space. CCA [15, 16] and KCCA [17] have been used in previous fMRI analysis using only conventional categorical stimulus descriptions without exploring the possibility of using complex characteristics of the stimuli as the source for feature selection from the fMRI data. The fMRI data used in the following study originated from an experiment in which the responses to stimuli were designed to evoke diﬀerent types of emotional responses, pleasant or unpleasant. The pleasant images consisted of women in swimsuits while the unpleasant images were a collection of images of skin diseases. Each stimulus image was represented using Scale Invariant Feature Transformation (SIFT) [18] features. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not speciﬁcally exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. In the current study, we present a feasibility study of the possibility of generating new activity maps by using the actual stimuli that had generated the fMRI volume. We have shown that KCCA is able to extract brain regions identiﬁed by supervised methods such as SVM in task discrimination and to achieve similar levels of accuracy and discuss some of the challenges in interpreting the results given the complex input feature vectors used by KCCA in place of categorical labels. This work is an extension of the work presented in [19]. The paper is structured as follows. Section 2 gives a review of the fMRI data acquisition as well as the experimental design and the pre-processing. These are followed by a brief description of the scale invariant feature transformation in Section 2.1. The SVM is brieﬂy described in Section 2.2 while Section 2.2 elaborates on the KCCA methodology. Our results in Section 3. We conclude with a discussion in Section 4.

Using Image Stimuli to Drive fMRI Analysis

2

479

Materials and Methods

Due to the lack of space we refer the reader to [10] for a detailed account of the subject, data acquisition and pre-processing applied to the data as well as to the experimental design. 2.1

Scale Invariant Feature Transformation

Scale Invariant Feature Transformation (SIFT) was introduced by [18] and shown to be superior to other descriptors [20]. This is due to the SIFT descriptors being designed to be invariant to small shifts in position of salient (i.e. prominent) regions. Calculation of the SIFT vector begins with a scale space search in which local minima and maxima are identiﬁed in each image (so-called key locations). The properties of the image at each key location are then expressed in terms of gradient magnitude and orientation. A canonical orientation is then assigned to each key location to maximize rotation invariance. Robustness to reorientation is introduced by representing local image regions around key voxels in a number of orientations. A reference key vector is then computed over all images and the data for each image are represented in terms of distance from this reference. Interestingly, some of the properties of the SIFT representation have been modeled on the properties of complex neurons in the visual cortex. Although not speciﬁcally exploited in the current paper, future studies may be able to utilize this property to probe aspects of brain function such as modularity. Image Processing. Let fil be the SIFT features vector for image i where l is the number of features. Each image i has a diﬀerent number of SIFT features l, making it diﬃcult to directly compare two images. To overcome this problem we apply K-means to cluster the SIFT features into a uniform frame. Using K-means clustering we ﬁnd K classes and their respective centers oj where j = 1, . . . , K. The feature vector xi of an image stimuli i is K dimensional with j’th component xi,j . The feature vectors is computed as the Gaussian measure of the minimal distance between the SIFT features fil to the centre oj . This can be represented as − minv∈f l d(v,oj )2

xi,j = exp

i

(1)

where d(., .) is the Euclidean distance. The number of centres is set to be the smallest number of SIFT features computed (found to be 300). Therefore after processing each image, we will have a 300 dimensional feature vector representing its relative distance from the cluster centres. 2.2

Methods

Support Vector Machines. Support vector machines [21] are kernel-based methods that ﬁnd functions of the data that facilitate classiﬁcation. They are derived from statistical learning theory [22] and have emerged as powerful tools for statistical pattern recognition [23]. In the linear formulation a SVM ﬁnds,

480

D.R. Hardoon et al.

during the training phase, the hyperplane that separates the examples in the input space according to their class labels. The SVM classiﬁer is trained by providing examples of the form (x, y) where x represents a input and y it’s class label. Once the decision function has been learned from the training data it can be used to predict the class of a new test example. We used a linear kernel SVM that allows direct extraction of the weight vector as an image. A parameter C, that controls the trade-oﬀ between training errors and smoothness was ﬁxed at C = 1 for all cases (default value).1 Kernel Canonical Correlation Analysis. Proposed by Hotelling in 1936, Canonical Correlation Analysis (CCA) is a technique for ﬁnding pairs of basis vectors that maximise the correlation between the projections of paired variables onto their corresponding basis vectors. Correlation is dependent on the chosen coordinate system, therefore even if there is a very strong linear relationship between two sets of multidimensional variables this relationship may not be visible as a correlation. CCA seeks a pair of linear transformations one for each of the paired variables such that when the variables are transformed the corresponding coordinates are maximally correlated. Consider the linear combination x = wa x and y = wb y. Let x and y be two random variables from a multi-dimensional distribution, with zero mean. The maximisation of the correlation between x and y corresponds to solving maxwa ,wb ρ = wa Cab wb subject to wa Caa wa = wb Cbb wb = 1. Caa and Cbb are the non-singular within-set covariance matrices and Cab is the between-sets covariance matrix. We suggest using the kernel variant of CCA [24] since due to the linearity of CCA useful descriptors may not be extracted from the data. This may occur as the correlation could exist in some non linear relationship. The kernelising of CCA oﬀers an alternate solution by ﬁrst projecting the data into a higher dimensional feature space φ : x = (x1 , . . . , xn ) → φ(x) = (φ1 (x), . . . , φN (x)) (N ≥ n) before performing CCA in the new feature space. Given the kernel functions κa and κb let Ka = Xa Xa and Kb = Xb Xb be the kernel matrices corresponding to the two representations of the data, where Xa is the matrix whose rows are the vectors φa (xi ), i = 1, . . . , from the ﬁrst representation while Xb is the matrix with rows φb (xi ) from the second representation. The weights wa and wb can be expressed as a linear combination of the training examples wa = Xa α and wb = Xb β. Substituting into the primal CCA equation gives the optimisation maxα,β ρ = α Ka Kb β subject to α K2a α = β K2b β = 1. This is the dual form of the primal CCA optimisation problem given above, which can be cast as a generalised eigenvalue problem and for which the ﬁrst k generalised eigenvectors can be found eﬃciently. Both CCA and KCCA can be formulated as an eigenproblem. The theoretical analysis shown in [25,26] suggests the need to regularise kernel CCA as it shows that the quality of the generalisation of the associated pattern function is controlled by the sum of the squares of the weight vector norms. We 1

The LibSVM toolbox for Matlab was used to perform the classiﬁcations http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

Using Image Stimuli to Drive fMRI Analysis

481

refer the reader to [25, 26] for a detailed analysis and the regularised form of KCCA. Although there are advantages in using kernel CCA, which have been demonstrated in various experiments across the literature. We must clarify that in this particular work, as we are using a linear kernel in both views, regularised CCA is the same as regularised linear KCCA (since the former and latter are linear). Although using KCCA with a linear kernel has advantages over CCA, the most important of which is in our case speed, together with the regularisation.2 Using linear kernels as to allow the direct extraction of the weights, KCCA performs the analysis by projecting the fMRI volumes into the found semantic space deﬁned by the eigenvector corresponding to the largest correlation value (these are outputted from the eigenproblem). We classify a new fMRI volume as follows; Let αi be the eigenvector corresponding to the largest eigenvalue, and let φ(ˆ x) be the new volume. We project the fMRI into the semantic space w = Xa αi (these are the training weights, similar to that of the SVM) and using the weights we are able to classify the new example as w ˆ = φ(ˆ x)w where w ˆ is a weighted value (score) for the new volume. The score can be thresholded to allocate a category to each test example. To avoid the complications of ﬁnding a threshold, we zeromean the outputs and threshold the scores at zero, where w ˆ < 0 will be associated with unpleasant (a label of −1) and w ˆ ≥ 0 will be associated with pleasant (a label of 1). We hypothesis that KCCA is able to derive additional activities that may exist a-priori, but possibly previously unknown, in the experiment. By projecting the fMRI volumes into the semantic space using the remaining eigenvectors corresponding to lower correlation values. We have attempted to corroborate this hypothesis on the existing data but found that the additional semantic features that cut across pleasant and unpleasant images did not share visible attributes. We have therefore conﬁned our discussion here to the ﬁrst eigenvector.

3

Results

Experiments were run on a leave-one-out basis where in each repeat a block of positive and negative fMRI volumes was withheld for testing. Data from the 16 subjects was combined. This amounted, per run, in 1330 training and 14 testing fMRI volumes, each set evenly split into positive and negative volumes (these pos/neg splits were not known to KCCA but simply ensured equal number of images with both types of emotional salience). The analyses were repeated 96 times. Similarly, we run a further experiment of leave-subject-out basis where 15 subjects were combined for training and one left for testing. This gave a sum total of 1260 training and 84 testing fMRI volumes. The analyses was repeated 16 times. The KCCA regularisation parameter was found using 2-fold cross validation on the training data. Initially we describe the fMRI activity analysis. After training the SVM we are able to extract and display the SVM weights as a representation of the brain 2

The KCCA toolbox used was from http://homepage.mac.com/davidrh/Code.html

482

D.R. Hardoon et al.

regions important in the pleasant/unpleasant discrimination. A thorough analysis is presented in [10]. We are able to view the results in Figures 1 and 2 where in both ﬁgures the weights are not thresholded and show the contrast between viewing Pleasant vs. Unpleasant. The weight value of each voxel indicates the importance of the voxel in diﬀerentiating between the two brain states. In Figure 1 the unthresholded SVM weight maps are given. Similarly with KCCA, once learning the semantic representation we are able to project the fMRI data into the learnt semantic feature space producing the primal weights. These weights, like those generated from the SVM approach, could be considered as a representation of the fMRI activity. Figure 2 displays the KCCA weights. In Figure 3 the unthresholded weights values for the KCCA approach with the hemodynamic function applied to the image stimuli (i.e. applied to the SIFT features prior to analysis) are displayed. The hemodynamic response function is the impulse response function which is used to model the delay and dispersion of hemodynamic responses to neuronal activation [27]. The application of the hemodynamic function to the images SIFT features allows for the reweighting of the image features according to the computed delay and dispersion model. We compute the hemodynamic function with the SPM2 toolbox with default parameter settings. As the KCCA weights are not driven by simple categorical image descriptors (pleasant/unpleasant) but by complex image feature vectors it is of great interest that many regions, especially in the visual cortex, found by SVM are also highlighted by the KCCA. We interpret this similarity as indicating that many important components of the SIFT feature vector are associated with pleasant/unpleasant discrimination. Other features in the frontal cortex are much less reproducible between SVM and KCCA indicting that many brain regions detect image diﬀerences not rooted in the major emotional salience of the images. In order to validate the activity patterns found in Figure 2 we show that the learnt semantic space can be used to correctly discriminate withheld (testing) fMRI volumes. We also give the 2−norm error to provide an indication as to

Fig. 1. The unthresholded weight values for the SVM approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed with labels (+1/ − 1).

Using Image Stimuli to Drive fMRI Analysis

483

Fig. 2. The unthresholded weight values for the KCCA approach showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant). The discrimination analysis on the training data was performed without labels. The class discrimination is automatically extracted from the analysis.

Fig. 3. The unthresholded weight values for the KCCA approach with the hemodynamic function applied to the image stimuli showing the contrast between viewing Pleasant vs. Unpleasant. We use the blue scale for negative (Unpleasant) values and the red scale for the positive values (Pleasant).

the quality of the patterns found between the fMRI volumes and image stimuli from the testing set by Ka α − Kb β2 (normalised over the number of volumes and analyses repeats). The latter is especially important when the hemodynamic function has been applied to the image stimuli as straight forward discrimination is no longer possible to compare with. Table 1 shows the average and median performance of SVM and KCCA on the testing of pleasant and unpleasant fMRI blocks for the leave-two-block-out experiment. Our proposed unsupervised approach had achieved an average accuracy of 87.28%, slightly less than the 91.52% of the SVM. Although, both methods had the same median accuracy of 92.86%. The results of the leavesubject-out experiment are given in Table 2, where our KCCA has achieved an average accuracy of 79.24% roughly 5% less than the supervised SVM method. In both tables the Hemodynamic Function is abbreviated as HF. We are able to observe in both tables that the quality of the patterns are better than random. The results demonstrate that the activity analysis is meaningful. To further conﬁrm the validity of the methodology we repeat the experiments with the

484

D.R. Hardoon et al.

Table 1. KCCA & SVM results on the leave-two-block-out experiment. Average and median performance over 96 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 87.28 92.86 0.0048 0.0048 SVM 91.52 92.86 Random KCCA 49.78 50.00 0.0103 0.0093 Random SVM 52.68 50.00 KCCA with HF 0.0032 0.0031 Random KCCA with HF 1.1049 0.9492

Table 2. KCCA & SVM results on the leave-one-subject-out experiment. Average and median performance over 16 repeats. The value represents accuracy, hence higher is better. For norm−2 error lower is better. Method Average Median Average · 2 error Median · 2 error KCCA 79.24 79.76 0.0025 0.0024 SVM 84.60 86.90 Random KCCA 48.51 47.62 0.0052 0.0044 Random SVM 48.88 48.21 KCCA with HF 0.0016 0.0015 Random KCCA with HF 0.5869 0.0210

image stimuli randomised, hence breaking the relationship between fMRI volume and stimuli. Table 1 and 2 KCCA and SVM both show performance equivalent to the performance of a random classiﬁer. It is also interesting to observe that when applying the hemodynamic function the random KCCA is substantially diﬀerent, and worse than, the non random KCCA. Implying that the spurious correlations are found.

4

Discussion

In this paper we present a novel unsupervised methodology for fMRI activity analysis in which a simple categorical description of a stimulus type is replaced by a more informative vector of stimulus (SIFT) features. We use kernel canonical correlation analysis using an implicit representation of a complex state label to make use of the stimulus characteristics. The most interesting aspect of KCCA is its ability to extract visual regions very similar to those found to be important in categorical image classiﬁcation using supervised SVM. KCCA “ﬁnds” areas in the brain that are correlated with the features in the SIFT vector regardless of the stimulus category. Because many features of the stimuli were associated with the pleasant/unpleasant categories we were able to use the KCCA results to classify the fMRI images between these categories. In the current study it is diﬃcult to address the issue of modular versus distributed neural coding as the complexity of the stimuli (and consequently of the SIFT vector) is very high.

Using Image Stimuli to Drive fMRI Analysis

485

A further interesting possible application of KCCA relates to the detection of “inhomogeneities” in stimuli of a particular type (e.g happy/sad/disgusting emotional stimuli). If KCCA analysis revealed brain regions strongly associated with substructure within a single stimulus category this could be valuable in testing whether a certain type of image was being consistently processed by the brain and designing stimuli for particular experiments. There are many openended questions that have not been explored in our current research, which has primarily been focused on fMRI analysis and discrimination capacity. KCCA is a bi-directional technique and therefore are also able to compute a weight map for the stimuli from the learned semantic space. This capacity has the potential of greatly improving our understanding as to the link between fMRI analysis and stimuli by potentially telling us which image features were important. Acknowledgments. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. David R. Hardoon is supported by the EPSRC project Le Strum, EP-D063612-1. This publication only reﬂects the authors views. We would like to thank Karl Friston for the constructive suggestions.

References 1. Cox, D.D., Savoy, R.L.: Functional magnetic resonance imaging (fmri) ‘brain reading’: detecting and classifying distributed patterns of fmri activity in human visual cortex. Neuroimage 19, 261–270 (2003) 2. Carlson, T.A., Schrater, P., He, S.: Patterns of activity in the categorical representations of objects. Journal of Cognitive Neuroscience 15, 704–717 (2003) 3. Wang, X., Hutchinson, R., Mitchell, T.M.: Training fmri classiﬁers to detect cognitive states across multiple human subjects. In: Proceedings of the 2003 Conference on Neural Information Processing Systems (2003) 4. Mitchell, T., Hutchinson, R., Niculescu, R., Pereira, F., Wang, X., Just, M., Newman, S.: Learning to decode cognitive states from brain images. Machine Learning 1-2, 145–175 (2004) 5. LaConte, S., Strother, S., Cherkassky, V., Anderson, J., Hu, X.: Support vector machines for temporal classiﬁcation of block design fmri data. NeuroImage 26, 317–329 (2005) 6. Mourao-Miranda, J., Bokde, A.L.W., Born, C., Hampel, H., Stetter, S.: Classifying brain states and determining the discriminating activation patterns: support vector machine on functional mri data. NeuroImage 28, 980–995 (2005) 7. Haynes, J.D., Rees, G.: Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nature Neuroscience 8, 686–691 (2005) 8. Davatzikos, C., Ruparel, K., Fan, Y., Shen, D.G., Acharyya, M., Loughead, J.W., Gur, R.C., Langleben, D.D.: Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. NeuroImage 28, 663–668 (2005) 9. Kriegeskorte, N., Goebel, R., Bandettini, P.: Information-based functional brain mapping. PANAS 103, 3863–3868 (2006)

486

D.R. Hardoon et al.

10. Mourao-Miranda, J., Reynaud, E., McGlone, F., Calvert, G., Brammer, M.: The impact of temporal compression and space selection on svm analysis of singlesubject and multi-subject fmri data. NeuroImage (accepted, 2006) 11. Hardoon, D.R., Saunders, C., Szedmak, S., Shawe-Taylor, J.: A correlation approach for automatic image annotation. In: Li, X., Za¨ıane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 681–692. Springer, Heidelberg (2006) 12. Wismuller, A., Meyer-Base, A., Lange, O., Auer, D., Reiser, M.F., Sumners, D.: Model-free functional mri analysis based on unsupervised clustering. Journal of Biomedical Informatics 37, 10–18 (2004) 13. Ciuciu, P., Poline, J., Marrelec, G., Idier, J., Pallier, C., Benali, H.: Unsupervised robust non-parametric estimation of the hemodynamic response function for any fmri experiment. IEEE TMI 22, 1235–1251 (2003) 14. O’Toole, A.J., Jiang, F., Abdi, H., Haxby, J.V.: Partially distributed representations of objects and faces in ventral temporal cortex. Journal of Cognitive Neuroscience 17(4), 580–590 (2005) 15. Friman, O., Borga, M., Lundberg, P., Knutsson, H.: Adaptive analysis of fMRI data. NeuroImage 19, 837–845 (2003) 16. Friman, O., Carlsson, J., Lundberg, P., Borga, M., Knutsson, H.: Detection of neural activity in functional MRI using canonical correlation analysis. Magnetic Resonance in Medicine 45(2), 323–330 (2001) 17. Hardoon, D.R., Shawe-Taylor, J., Friman, O.: KCCA for fMRI Analysis. In: Proceedings of Medical Image Understanding and Analysis, London, UK (2004) 18. Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE International Conference on Computer vision, Kerkyra, Greece, pp. 1150–1157 (1999) 19. Hardoon, D.R., Mourao-Miranda, J., Brammer, M., Shawe-Taylor, J.: Unsupervised analysis of fmri data using kernel canonical correlation. NeuroImag (in press, 2007) 20. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: International Conference on Computer Vision and Pattern Recognition, pp. 257–263 (2003) 21. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 23. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classiﬁers. In: D. Proc. Fifth Ann. Workshop on Computational Learning Theory, pp. 144–152. ACM, New York (1992) 24. Fyfe, C., Lai, P.L.: Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 365–377 (2001) 25. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Computation 16, 2639–2664 (2004) 26. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 27. Stephan, K.E., Harrison, L.M., Penny, W.D., Friston, K.J.: Biophysical models of fmri responses. Current Opinion in Neurobiology 14, 629–635 (2004)

Parallel Reinforcement Learning for Weighted Multi-criteria Model with Adaptive Margin Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Japan [email protected]

Abstract. Reinforcement learning (RL) for a linear family of tasks is studied in this paper. The key of our discussion is nonlinearity of the optimal solution even if the task family is linear; we cannot obtain the optimal policy by a naive approach. Though there exists an algorithm for calculating the equivalent result to Q-learning for each task all together, it has a problem with explosion of set sizes. We introduce adaptive margins to overcome this diﬃculty.

1

Introduction

Reinforcement learning (RL) for a linear family of tasks is studied in this paper. Such learning is useful for time-varying environments, multi-criteria problems, and inverse RL [5,6]. The family is deﬁned as a weighted sum of several criteria. This family is linear in the sense that reward is linear with respect to weight parameters. For instance, criteria of network routing include end-to-end delay, loss of packets, and power level associated with a node [5]. Selecting appropriate weights beforehand is diﬃcult in practice and we need try and errors. In addition, appropriate weights may change someday. Parallel RL for all possible weight values is desirable in such cases. The key of our discussion is nonlinearity of the optimal solution; it is not linear but piecewise-linear actually. This fact implies that we cannot obtain the best policy by the following naive approach: 1. Find the value function for each criterion. 2. Calculate weighted sum of them to obtain the total value function. 3. Construct a policy on the basis of the total value function. A typical example is presented in section 5. Piecewise-linearity of the optimal solution has been pointed out independently in [4] and [5]. The latter aims at fast adaptation under time-varying environments. The former is our previous report, and we have tried to obtain the optimal solutions for various weight values all together. Though we have developed an algorithm that gives exactly equivalent solution to Q-learning for each weight value, it has a diﬃculty with explosion of set size. This diﬃculty is not a problem of the algorithm but an intrinsic nature of Q-learning for the weighted criterion model. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 487–496, 2008. c Springer-Verlag Berlin Heidelberg 2008

488

K. Hiraoka, M. Yoshida, and T. Mishima

We have introduced a simple approximation with a ‘margin’ into decision of convexity ﬁrst [6]. Then we have improved it so that we obtain an interval estimation and we can monitor the eﬀect of the approximation [7]. In this paper, we propose adaptive adjustment of margins. In margin-based approach, we have to manage large sets of vectors in the ﬁrst stage of learning. The peak of the set size tends to be large if we set a small margin to obtain an accurate ﬁnal result. The proposed method reduces worry of this trade-oﬀ. By changing margins appropriately through learning steps, we can enjoy small set size in the ﬁrst stage with large margins, and an accurate result in the ﬁnal stage with small margins. The weighted criterion model is deﬁned in section 2, and parallel RL for it is described in section 3. Then the diﬃculty of set size is pointed out and margins are introduced in section 4. Adaptive adjustment of margins is also proposed there. Its behavior is veriﬁed with experiments in section 5. Finally, a conclusion is given in section 6.

2

Weighted Criterion Model

An “orthodox” RL setting is assumed for states and actions as follows. – – – – –

The The The The The

time step is discrete (t = 0, 1, 2, 3, . . .). state set S and the action set A are ﬁnite and known. state transition rule P is unknown. state st is observable. task is a Markov decision process (MDP).

1 M The reward rt+1 is given as a weighted sum of partial rewards rt+1 , . . . , rt+1 :

rt+1 (β) =

M

i βi rt+1 = β · r t+1 ,

(1)

i=1

weight vector β ≡ (β1 , . . . , βM ) ∈ RM , reward vector r t+1 ≡

1 M (rt+1 , . . . , rt+1 )

M

∈R .

(2) (3)

1 M We assume that the partial rewards rt+1 , . . . , rt+1 are also observable, whereas their reward rules R(1), . . . , R(M ) are unknown. Multi-criteria RL problems of this type have been introduced independently in [3] and [5]. We hope to ﬁnd the optimal policy πβ∗ for each weight β that maximizes the expected cumulative reward with a given discount factor 0 < γ < 1, ∞ πβ∗ = argmax E π γ τ rτ +1 (β) , (4) π

τ =0

π

where E [·] denotes the expectation under a policy π. To be exact, π ∗ is deﬁned π∗

as a policy that attains Qββ (s, a; γ) = Q∗β (s, a; γ) ≡ maxπ Qπβ (s, a; γ) for all state-action pairs (s, a), where the action-value function Qπβ is deﬁned as

Parallel Reinforcement Learning for Weighted Multi-criteria Model

Qπβ (s, a; γ)

≡E

π

∞

τ =0

γ rτ +1 (β) s0 = s, a0 = a . τ

489

(5)

It is well known that MDP has a deterministic policy πβ∗ that satisﬁes the above condition; such πβ∗ is obtained from the optimal value function [2], πβ∗ : S → A : s → argmax Q∗β (s, a; γ). a∈A

(6)

Thus we concentrate on estimation of Q∗β . Note that Q∗β is nonlinear with respect to β. A typical example is presented in section 5. Basic properties of the action-value function Q are described brieﬂy in the rest of this section [4,5,6]. The discount factor γ is ﬁxed through this paper, and it is omitted below. Proposition 1. Qπβ (s, a) is linear with respect to β for a fixed policy π. Proof. Neither P nor π depend on β from assumptions. Hence, joint distribution of (s0 , a0 ), (s1 , a1 ), (s2 , a2 ), . . . is independent of β. It implies linearity. Definition 1. If f : RM → R can be written as f (β) = maxq∈Ω (q · β) with a nonempty finite set Ω ⊂ RM , we call f Finite-Max-Linear (FML) and write it as f = FMLΩ . It is trivial that f is convex and piecewise-linear if f is FML. Proposition 2. The optimal action-value function is FML as a function of the weight β. Namely, there exists a nonempty finite set Ω ∗ (s, a) ⊂ RM for each state-action pair (s, a), and Q∗β is written as Q∗β (s, a) =

max

q∈Ω ∗ (s,a)

q · β.

(7)

Proof. We have assumed MDP. It is well known that Q∗β can be written Q∗β (s, a) = maxπ∈Π Qπβ (s, a) for the set Π of all deterministic policies. Π ﬁnite, and Qπβ is linear with respect to β from proposition 1. Hence, Q∗β FML.

as is is

Proposition 3. Assume that an estimated action-value function Qβ is FML as a function of the weight β. If we apply Q-learning, the updated new Qβ (st , at ) = (1 − α)Qβ (st , at ) + α β · rt+1 + γ max Qβ (st+1 , a) (8) a∈A

is still FML as a function of β, where α > 0 is the learning rate. Proof. There exists a nonempty ﬁnite set Ω(s, a) ⊂ RM such that Qβ (s, a) = ˜ · β, maxq∈Ω(s,a) (q · β) for each (s, a). Then (8) implies Qnew ˜q β (st , at ) = maxq ˜ ∈Ω where ˜ ≡ (1 − α)q + α(r t+1 + γq ) a ∈ A, q ∈ Ω(st , at ), q ∈ Ω(st+1 , a) , (9) Ω because maxx f (x) + maxy g(y) = maxx,y (f (x) + g(y)) holds in general. The set ˜ is ﬁnite, and Qnew is FML. Ω β These propositions imply that (1) the true Q∗β is FML, and (2) its estimation Qβ is also FML as long as the initial estimation is FML.

490

3

K. Hiraoka, M. Yoshida, and T. Mishima

Parallel Q-Learning for All Weights

A parallel Q-learning method for the weighted criterion model has been proposed in [6]. The estimation Qβ for all β ∈ RM are updated all together in parallel Q-learning. In this method, Qβ (s, a) for each (s, a) is treated in an FML expression: Qβ (s, a) =

max q · β = FMLΩ(s,a) (β)

(10)

q∈Ω(s,a)

with a certain set Ω(s, a) ⊂ RM . We store and update Ω(s, a) instead of Qβ (s, a) on the basis of propositions 2 and 3. Though a naive updating rule has been suggested in the proof of proposition 3, it is extremely redundant and ineﬃcient. We need several deﬁnitions to describe a better algorithm. Definition 2. An element c ∈ Ω is redundant if FML(Ω−{c}) = FMLΩ . Definition 3. We use Ω † to represent non-redundant elements in Ω. Note that FMLΩ † = FMLΩ [5]. Definition 4. We define the following operations: cΩ ≡ {cq | q ∈ Ω},

c + Ω ≡ {c + q | q ∈ Ω},

K † K † Ω Ω ≡ (Ω ∪ Ω ) , Ωk ≡ Ωk , k=1

Ω ⊕ Ω ≡ {q + q | q ∈ Ω, q ∈ Ω },

(11) (12)

k=1 †

Ω Ω ≡ (Ω ⊕ Ω ) .

With these operations, the updating rule of Ω is described as follows [6]:

Ω new (st , at ) = (1 − α)Ω(st , at ) α r t+1 + γ Ω(st+1 , a) .

(13)

(14)

a∈A

The initial value of Ω at t = 0 is Ω(s, a) = {o} ⊂ RM for all (s, a) ∈ S × A. It corresponds to a constant initial function Qβ (s, a) = 0. Proposition 4. When (10) holds for all states s ∈ S and actions a ∈ A, Qnew β (st , at ) in (8) is equal to FMLΩ new (st ,at ) (β) for (14). Namely, parallel Qlearning is equivalent to Q-learning for each β: {Qβ (s, a)} FML expression

{Ω(s, a)}

update → Qnew β (st , at )

FML expression . → Ω new (st , at ) update

(15)

Parallel Reinforcement Learning for Weighted Multi-criteria Model

491

y

Ω

O

x

Ω[+]Ω’

(2) Merge and sort edges according to their arguments

Ω’

(3) Connect edges to generate a polygon

(4) Shift the origin (max x in Ω)+(max x in Ω’) 㤘(max x in Ω[+]Ω’) (max y in Ω)+(max y in Ω’) 㤘(max y in Ω[+]Ω’)

(1) Set directions of edges

Fig. 1. Calculation of Ω Ω in (14) for two-dimensional convex polygons. Vertices of polygons correspond to Ω, Ω and Ω Ω .

˜ in (9) to prove proposition 3. With the above Proof. We have introduced a set Ω operations, (9) is written as

˜ Ω = (1 − α)Ω(st , at ) ⊕ α r t+1 + Ω(st+1 , a) . a∈A †

˜ = Ω new (st , at ) is obtained and FMLΩ new (s ,a ) (β) = FML ˜ Then (Ω) t t Ω(st ,at ) (β) = new Qβ (st , at ) is implied. It is well known that Ω † is equal to the vertices in the convex hull of Ω [6]. Eﬃcient algorithms of convex hull have been developed in computational geometry † [8]. Using them, we can calculate the merged set (Ω Ω ) = (Ω ∪ Ω ) . The sum set (Ω Ω ) have been also studied as Minkowski sum algorithms [9,10,11]. Its calculation is particularly easy for two-dimensional convex polygons (Fig.1). Before closing the present section, we note an FML version of Bellman equation in our notation. Theoretically, we can use successive iteration of this equation to ﬁnd the optimal policy when we know P and R, though we must take care of numerical error in practice. Proposition 5. FML expression Q∗β = FMLΩ ∗ (β) satisfies †

Ω ∗ (s, a) = Ras + γ

a ∈A

a ∗ + Pss Ω (s , a ),

(16)

s ∈S

where i Ras = (Ras (1), . . . , Ras (M )), Ras (i) = E[rt+1 | st = s, at = a], a Pss = P (st+1 = s | st = s, at = a),

+ s ∈{s1 ,...,sk }

Xs = Xs1 Xs2 · · · Xsk .

(17) (18) (19)

492

K. Hiraoka, M. Yoshida, and T. Mishima

In particular, the next equation holds if state transition is deterministic: † Ω ∗ (s, a) = Ras + γ Ω ∗ (s , a ),

(20)

a ∈A

where s is the next state for the action a at the current state s. a Proof. Substituting (7) and Ras,β ≡ E[r t+1 (β) | st = s, at = a] = Rs · β into the ∗ a a ∗ Bellman equation Qβ (s, a) = Rs,β + γ s ∈S Pss maxa ∈A Qβ (s , a ), we obtain

max q · β, a ∗ Ω (s, a) = Ras + γ Pss q s q s ∈ Ω (s , a )

max

q∈Ω ∗ (s,a)

q·β =

q ∈Ω (s,a)

a ∈A

(22)

s ∈S

in the same way as (9). Hence, Ω ∗ is equal to Ω except for redundancy.

4

(21)

Interval Operations

Under regularity conditions, Q-learning has been proved to converge to Q∗ [1]. That result implies pointwise convergence of parallel Q-learning to Q∗β for each β because of proposition 3. From proposition 2, Q∗β (s, a) is expressed with a ﬁnite Ω ∗ (s, a). However, as we can see in Fig.1, the number of elements in the set Ω(s, a) increases monotonically and it never ‘converges’ to Ω ∗ (s, a). This is not a paradox; the following assertions can be true at the same time. 1. Vertices of polygons P1 , P2 , . . . monotonically increase. 2. Pt converges to a polygon P ∗ in the sense that the volume of the diﬀerence Pt P = (Pt ∪ P ∗ ) − (Pt ∩ P ∗ ) converges to 0. 2’. The function FMLPt (·) converges pointwise to FMLP ∗ (·). In short, pointwise convergence of a piecewise-linear function does not imply convergence of the number of pieces. Note that it is not a problem of the algorithm. It is an intrinsic nature of pointwise Q-learning of the weighted criterion model for each weight β. To overcome this diﬃculty, we tried a simple approximation with a small ‘margin’ at ﬁrst [6]. Then we have introduced interval operations to monitor approximation error [7]. A pair of sets Ω L (s, a) and Ω U (s, a) are updated instead of the original Ω(s, a) so that CH Ω L (s, a) ⊂ CH Ω(s, a) ⊂ CH Ω U (s, a) holds, where CH Z represents the convex hull of Z. This relation implies lower and upU X per bounds QL β (s, a) ≤ Qβ (s, a) ≤ Qβ (s, a), where Qβ (s, a) = FMLΩ X (s,a) (β) L U for X = L, U . When the diﬀerence between Q and Q is suﬃciently small, it is guaranteed that the eﬀect of the approximation can be ignored. Updating rules of Ω L and Ω U are same as those of Ω, except for the following approximations after every calculation of and . We assume M = 2 here. Lower approximation for Ω L : A vertex is removed if the change of the area of CH Ω L (s, a) is smaller than a threshold L /2 (Fig.2 left).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

b

c

b

d a

a

d

e

b a

c

if the area of triangle /// is small

if the area of triangle /// is small d - remove c e

493

z a

- remove b,c - add z d

Fig. 2. Lower approximation (left) and upper approximation (right)

Upper approximation for Ω U : An edge is removed if the change of the area of CH Ω U (s, a) is smaller than a threshold U /2 (Fig.2 right). In this paper, we propose an automatic adjustment of the margins L , U . The below procedures are performed at every step t after the updating of Ω L , Ω U . The symbol X represents L or U here. ξs , ξw ≥ 1 and θQ , θΩ ≥ 0 are constants. 1. Check the changes of set sizes and interval width compared with the previous ones. Namely, check these values: Xnew ∆X (st , at ) − Ω X (st , at ) , (23) Ω = Ω Unew U Lnew L ∆Q = Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) − Qβ¯ (st , at ) , (24) ¯ is selected beforehand. where |Z| is the number of elements in Z, and β 2. Increase of set size suggests a need of thinning, whereas increase of interval width suggests a need of more accurate calculation. Modify margins as ˜X (∆Q ≤ θQ ) X (∆X Xnew X Ω ≤ θΩ ) = X , where ˜ = . (25) X ˜ /ξw (∆Q > θQ ) ξs (∆X Ω > θΩ ) To avoid underﬂow, we set Xnew = min if Xnew is smaller than a constant min .

5

Experiments with a Basic Task of Weighted Criterion

We have veriﬁed behaviors of the proposed method. We set S = {S, G, A, B, X, Y}, A = {Up, Down, Left, Right}, s0 = S, and γ = 0.8 (Fig.3) [6]. Each action causes a deterministic state transition to the corresponding direction except at G, where the agent is moved to S regardless of its action. Rewards 1, 4b, b are oﬀered at st = G, X, Y, respectively. If at is an action to ‘outside wall’ at st = G, the state is unchanged and a negative reward (−1) is added further. It is a weighted criterion model of M = 2, because it can be written as 1 2 the form rt+1 = β · rt+1 for r t+1 = (rt+1 , rt+1 ) and β = (b, 1). The optimal policy changes depending on the weight b. Hence, the optimal value function is

494

K. Hiraoka, M. Yoshida, and T. Mishima

S

X (4b)

G (1)

A

Y (b)

B

outside = wall (-1)

Fig. 3. Task for experiments. Numbers in parentheses are reward values. Table 1. Optimal state-value functions and optimal policies Range of weight Optimal Vb∗ (S) Optimal state transition b < −16/25 0 S → A → S → ··· −16/25 ≤ b < −225/1796 (2000b + 1280)/2101 S → A → Y → B → G → S → · · · −225/1796 ≤ b < 15/47 (400b + 80)/61 S → X → G → S → ··· 15/47 ≤ b < 3/4 32b/3 S → X → Y → X → ··· 3/4 ≤ b 16b − 4 S → X → X → ···

1

1e-04 1e-06

1e-13 1e-10 1e-7 1e-4 0.1

0.01 1e-04 Upper margin

Lower margin

1

1e-13 1e-10 1e-7 1e-4 0.1

0.01

1e-08 1e-10 1e-12 1e-14

1e-06 1e-08 1e-10 1e-12 1e-14

1e-16

1e-16

1e-18

1e-18 0

5000

10000

15000

20000

0

5000

10000

t

15000

20000

t

300

Total number of elements Σs,a|Ω (s,a)|

1e-13 1e-10 1e-7 1e-4 0.1

250 200

U

L

Total number of elements Σs,a|Ω (s,a)|

Fig. 4. Transition of margins L and U from various initial margins

150 100 50 0 0

5000

10000

15000

20000

t

Fig. 5. Total number of elements

300

1e-13 1e-10 1e-7 1e-4 0.1

250 200 150 100 50 0 0

5000

10000

15000

20000

t

s,a

|Ω X (s, a)|. (Left: X = L, Right: X = U ).

Parallel Reinforcement Learning for Weighted Multi-criteria Model

495

1 0.01 1e-04 Interval width

1e-06 1e-08 1e-10 1e-12

1e-13 1e-10 1e-7 1e-4 0.1

1e-14 1e-16 1e-18 0

5000

10000

15000

20000

t

4500

1

1e-2(Lower) 1e-2(Upper) 1e-9(Lower) 1e-9(Upper)

4000 3500

0.01 1e-04

3000

Interval width

X

Total number of elements Σs,a|Ω (s,a)|

L Fig. 6. Interval width QU (0.2,1) (A, Up) − Q(0.2,1) (A, Up)

2500 2000 1500

1e-06 1e-08 1e-10 1e-12

1000

1e-14

500

1e-16

0

1e-18 0

5000

10000

15000

20000

1e-2 1e-9 0

5000

10000

t

15000

20000

t

300

1

1e-13 1e-10 1e-7 1e-4 0.1

250 200

0.01 1e-04 Interval width

U

Total number of elements Σs,a|Ω (s,a)|

Fig. 7. Fixed-margin (U = L = 10−2 and U = L = 10−9 ). Left: total

algorithm X number of elements s,a |Ω (s, a)| for X = U, L. Right: interval width.

150 100

1e-06 1e-08 1e-10 1e-12

1e-13 1e-10 1e-7 1e-4 0.1

1e-14

50

1e-16 0

1e-18 0

2000

4000

6000 t

8000

10000

0

2000

4000

6000

8000

10000

t

Fig. 8. Average of 100 trials with inappropriate factors ξs = 1.5, ξw = 1.015 for γ = 0.5 Left: total number of elements in upper approximation. Right: interval width.

nonlinear with respect to b (Table 1). Note that the second pattern (S→A→Y) in Table 1 cannot appear on the naive approach in section 1. The proposed algorithm is applied to this task with random actions at and ¯ = (0.2, 1), parameters α = 0.7, (ξs , ξw ) = (1.7, 1.015), (θQ , θΩ ) = (0, 2), β −14 L U −1 min = 10 . The initial margins = at t = 0 is one of 10 , 10−4 , 10−7 ,

496

K. Hiraoka, M. Yoshida, and T. Mishima

10−10 , 10−13 . On this task, we can replace convex hulls with upper convex hulls in our algorithm because β is restricted to the upper half plane [6]. We also assume |b| ≤ 10 ≡ bmax and we safely remove the edges on the both end in Fig.2 if the absolute value of their slope is greater than bmax for lower approximation. Averages of 100 trials are shown in Fig.4,5,6. The proposed algorithm is robust to wide range of initial margins. It realizes reduced set sizes and small interval width at the same time; these requirements are trade-oﬀ in the conventional ﬁxed-margin algorithm [7] (Fig.7). A problem of the proposed algorithms is sensitivity to the factors ξs , ξw . When they are inappropriate, instability is observed after a long run (Fig.8). Another problem is slow convergence of the interval width QU − QL compared with the ﬁxed-margin algorithm.

6

Conclusion

A parallel RL method with adaptive margins is proposed for the weighted criterion model, and its behaviors are veriﬁed experimentally with a basic task. Adaptive margins realize reduced set sizes and accurate results. A problem of the adaptive margins is instability for inappropriate parameters. Though it is robust for initial margins, it needs tuning of factor parameters. Another problem is slow convergence of the interval between upper and lower estimations. These points must be studied further.

References 1. Jaakkola, T., et al.: Neural Computation 6, 1185–1201 (1994) 2. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT Press, Cambridge (1998) 3. Kaneko, Y., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. 167 (2004) 4. Kaneko, N., et al.: In: Proc. IEICE Society Conference (in Japanese), vol. A-2-10 (2005) 5. Natarajan, S., et al.: In: Proc. Intl. Conf. on Machine Learning, pp. 601–608 (2005) 6. Hiraoka, K., et al.: The Brain & Neural Networks (in Japanese). Japanese Neural Network Society 13, 137–145 (2006) 7. Yoshida, M., et al.: Proc. FIT (in Japanese) (to appear, 2007) 8. Preparata, F.P., et al.: Computational Geometry. Springer, Heidelberg (1985) 9. Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.): ICCS 2001. LNCS, vol. 2073. Springer, Heidelberg (2001) 10. Fukuda, K.: J. Symbolic Computation 38, 1261–1272 (2004) 11. Fogel, E., et al.: In: Proc. ALENEX, pp. 3–15 (2006)

Convergence Behavior of Competitive Repetition-Suppression Clustering Davide Bacciu1,2 and Antonina Starita2 1

2

IMT Lucca Institute for Advanced Studies, P.zza San Ponziano 6, 55100 Lucca, Italy [email protected] Dipartimento di Informatica, Universit` a di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy [email protected]

Abstract. Competitive Repetition-suppression (CoRe) clustering is a bio-inspired learning algorithm that is capable of automatically determining the unknown cluster number from the data. In a previous work it has been shown how CoRe clustering represents a robust generalization of rival penalized competitive learning (RPCL) by means of M-estimators. This paper studies the convergence behavior of the CoRe model, based on the analysis proposed for the distance-sensitive RPCL (DSRPCL) algorithm. Furthermore, it is proposed a global minimum criterion for learning vector quantization in kernel space that is used to assess the correct location property for the CoRe algorithm.

1

Introduction

CoRe learning has been proposed as a biologically inspired learning model mimicking a memory mechanism of the visual cortex, i.e. repetition suppression [1]. CoRe is a soft-competitive model that allows only a subset of the most active units to learn in proportion to their activation strength, while it penalizes the least active units, driving them away from the patterns producing low ﬁring strengths. This feature has been exploited in [2] to derive a clustering algorithm that is capable of automatically determining the unknown cluster number from the data by means of a reward-punishment procedure that resembles the rival penalization mechanism of RPCL [3]. Recently, Ma and Wang [4] have proposed a generalized loss function for the RPCL algorithm, named DSRPCL, that has been used for studying the convergence behavior of the rival penalization scheme. In this paper, we present a convergence analysis for CoRe clustering that founds on Ma and Wang’s approach, describing how CoRe satisﬁes the three properties of separation nature, correct division and correct location [4]. The intuitive analysis presented in [4] for DSRPCL is enforced with theoretical considerations showing that CoRe pursues a global optimality criterion for vector quantization algorithms. In order to do this, we introduce a kernel interpretation for the CoRe loss that is used to generalize the results given in [5] for hard vector quantization, to kernel-based algorithms. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008

498

2

D. Bacciu and A. Starita

A Kernel Based Loss Function for CoRe Clustering

A CoRe clustering network consists of cluster detector units that are characterized by a prototype ci , that identiﬁes the preferred stimulus for the unit ui and represents the learned cluster centroid. In addition, units are characterized by an activation function ϕi (xk , λi ), deﬁned in terms of a set of parameters λi , that determines the ﬁring strength of the unit in response to the presentation of an input pattern xk ∈ χ. Such an activation function measures the similarity between the prototype ci and the inputs, determining whether the pattern xk belongs to the i-th cluster. In the remainder of the paper we will use an activation function that is a gaussian centered in ci with spread σi , i.e. ϕi (xk |{ci , σi }) = exp −0.5xk − ci 2 /σi2 . CoRe clustering works essentially by evolving a small set of highly selective cluster detectors out of an initially larger population by means of a competitive reward-punishment procedure that resembles the rival penalization mechanism [3]. Such a competition is engaged between two sets of units: at each step the most active units are selected to form the winners pool, while the remainder is inserted into the losers pool. More formally, we deﬁne the winners pool for the input xk as the set of units ui that ﬁres more than θwin or the single unit that is maximally active for the pattern, that is wink = {i | ϕi (xk , {ci , σi }) ≥ θwin } ∪ {i | i = arg max ϕj (xk | {ci , σi })} j∈U

(1)

where the second term of the union ensures that wink is non-empty. Conversely, the losers pool for xk is losek = U \ wink , that is the complement of wink with respect to the neuron set U . The units belonging to the losers pool are penalized and their response is suppressed. The strength of the penalization for the pattern xk , at time t, is regulated by the repetition suppression RSkt ∈ [0, 1] and is proportional to the frequency of the pattern that has elicited the suppressive eﬀect (see [2,6] for details). The repetition suppression is used to deﬁne a pseudo-target activation for the units in the losers pool as ϕˆti (xk ) = ϕi (xk , {ci , σi })(1 − RSkt ). This reference signal forces the losers to reduce their activation proportionally to the amount of repetition suppression they receive. The error of the i-th loser unit can thus be written as E ti,k =

1 t 1 (ϕˆi (xk ) − ϕi (xk , {ci , σi }))2 = (−ϕi (xk , {ci , σi })RSkt )2 . 2 2

(2)

Conversely, in order to strengthen the activation of the winner units, we set the target activation for the neurons ui (i ∈ wink ) to M , that is the maximum of the activation function ϕi (·). The error, in this case, can be written as t

E i,k = (M − ϕi (xk , {ci , σi })).

(3)

To analyze the CoRe convergence, we give an error formulation that accumulates the residuals in (2) and (3) for a given epoch e: summing up over all CoRe units in U and the dataset χ = (x1 , . . . , xk , . . . , xK ) yields

Convergence Behavior of Competitive Repetition-Suppression Clustering

Je (χ, U ) =

I K

δik (1 − ϕi (xk )) +

i=1 k=1

I K

499

2 (e|χ|+k) (1 − δik ) ϕi (xk )RSk (4)

i=1 k=1

where δik is the indicator function for the set wink and where {ci , σi } has been omitted from ϕi to ease the notation. Note that, in (4), we have implicitly used the fact that the units can be treated as independent. The CoRe learning equations can be derived using gradient descent to minimize Je (χ, U ) with respect to the parameters {ci , σi } [2]. Hence, the prototype increment for the e-th epoch can be calculated as follows ⎡ ⎤

K (e|χ|+k) 2 e ⎣δik ϕi (xk ) (xk − ci ) − (1 − δik ) ϕi (xk )RSk cei = αc (xk − cei )⎦ (σie )2 σie k=1

(5) where αc is a suitable learning rate ensuring that Je decreases with e. Similarly, the spread update can be calculated as σie = ασ

K k=1

δik ϕi (xk )

e 2 xk − cei 2 (e|χ|+k) 2 xk − ci − (1 − δ )(ϕ (x )RS ) . (6) i ik k k (σie )3 (σie )3

As one would expect, unit prototypes are attracted by similar patterns (ﬁrst term in (5)) and are repelled by the dissimilar inputs (second term in (5)). Moreover, the neural selectivity is enhanced by reducing the Gaussian spread each time the corresponding unit happens to be a winner. Conversely, the variance of loser neurons is enlarged, reducing the units’ selectivity and penalizing them for not having sharp responses. The error formulation introduced so far can be restated by exploiting the kernel trick [7] to express the CoRe loss in terms of diﬀerences in a given feature space F . Kernel methods are algorithms that exploit a nonlinear mapping Φ : χ → F to project the data from the input space χ onto a convenient, implicit feature space F . The kernel trick is used to express all operations on Φ(x1 ), Φ(x2 ) ∈ F in terms of the inner product Φ(x1 ), Φ(x2 ) . Such inner product can be calculated without explicitly using the mapping Φ, by means of the kernel κ(x1 , x2 ) = Φ(x1 ), Φ(x2 ) . To derive the kernel interpretation for the CoRe loss in (4), consider ﬁrst the formulation of the distance dFκ of two vectors x1 , x2 ∈ χ in the feature space Fκ , induced by the kernel κ, and described by the mapping Φ : χ → Fκ , that is dFκ (x1 , x2 ) = Φ(x1 ) − Φ(x2 )2Fκ = κ(x1 , x1 ) − 2κ(x1 , x2 ) + κ(x2 , x2 ). The kernel trick [7] have been used to substitute the inner products in feature space with a suitable kernel κ calculated in the data space. If κ is chosen to be a gaussian kernel, then we have that κ(x, x) = 1. Hence dFκ can be rewritten as dFκ = Φ(x1 ) − Φ(x2 )2Fκ = 2 − 2κ(x1 , x2 ). Now, if we take x1 to be an element of the input dataset, e.g. xk ∈ χ, and x2 to be the prototype ci of the i-th CoRe unit, we can rewrite dFκ in such a way to depend on the activation function ϕi . Therefore, applying the substitution κ(xk , ci ) = ϕi (xk , {ci , σi }) we obtain ϕi (xk , {ci , σi }) = 1 − 12 Φ(xk ) − Φ(ci )2Fκ . Now, if we substitute this result in the formulation of the CoRe loss in (4), we obtain

500

D. Bacciu and A. Starita

1 δik Φ(xk ) − Φ(ci )2Fκ + 2 i=1 k=1

2 I K 1 1 (e|χ|+k) 2 + (1 − δik ) RSk 1 − Φ(xk ) − Φ(ci )Fκ (7) 2 i=1 2 I

Je (χ, U ) =

K

k=1

Equation (7) states that CoRe minimizes the feature space distance between the prototype ci and those xk that are close in the kernel space induced by the activation functions ϕi , while it maximizes the feature space distance between the prototypes and those xk that are far from ci in the kernel space.

3

Separation Nature

To prove the separation nature of the CoRe process we need to demonstrate that, given a a bounded hypersphere G containing all the sample data, then after suﬃcient iterations of the algorithm the cluster prototypes will ﬁnally either fall into G or remain outside it and never get into G. In particular, those prototypes remaining outside the hypersphere will be driven far away from the samples by the RS repulsion. We consider a prototype ci to be far away from the data if, for a given epoch e, it is in the loser pool for every xk ∈ χ. To prove CoRe separation nature we ﬁrst demonstrate the following Lemma. Lemma 1. When a prototype ci is far away from the data at a given epoch e, then it will always be a loser for every xk ∈ χ and will be driven away from the data samples. Proof. The deﬁnition of far away implies that, given cei , ∀xk ∈ χ. i ∈ loseek , where the e in the superscript refers to the learning epoch. Given the prototype update in (5), we obtain the weight vector increment Δcei at epoch e as follows cei

= −ασ

K k=1

(e|χ|+k)

ϕi (xk )RSk σie

2 (xk − cei ).

(8)

As a result of (8), the prototype ce+1 is driven further from the data. On the i other hand, by deﬁnition (1), for each of the data samples there exists at least one winner unit for every epoch e, such that its prototype is moved towards the samples for which it has been a winner. Moreover, not every prototype can be deﬂected from the data, since this would make the ﬁrst term of Je (χ, U ) (see (4)) grow and, consequently, the whole Je (χ, U ) will diverge since the loser error term in (4) is lower bounded. However, this would contradict the fact that Je (χ, U ) decreases with e since CoRe applies gradient descent to the loss function. Therefore, there must exist at least one winning prototype cel that remains close to the samples at epoch e. On the other hand cei is already far away from the samples and, by (8), ce+1 will be further from the data and i won’t be a winner for any xk ∈ χ. To prove this, consider the deﬁnition of wink

Convergence Behavior of Competitive Repetition-Suppression Clustering

501

in (1): for ce+1 to be a winner, it must hold either (i) ϕi (xk ) ≥ θwin or (ii) i i = arg maxj∈U ϕj (xk , λj ). The former does not hold because the receptive ﬁeld area where the ﬁring strength of the i-th unit is above the threshold θwin does not contain any sample at epoch e. Consequently, it cannot contain any sample at epoch e + 1 since its center ce+1 has been deﬂected further from the data. The i latter does not hold since there exist at least one prototype, i.e. cl , that remains close to the data, generating higher activations than unit ui . As a consequence, a far away prototype ci will be deﬂected from the data until it reaches a stable point where the corresponding ﬁring strength ϕi is negligible. Now we can proceed to demonstrate the following Theorem 1. For a CoRe process there exist an hypersphere G surrounding the sample data χ such that after suﬃcient iterations each prototype ci will ﬁnally either (i) fall into G or (ii) keep outside G and reach a stable point. Proof. The CoRe process is a gradient descent (GD) algorithm on Je (χ, U ), hence, for a suﬃciently small learning step, the loss decreases with the number of epochs. Therefore, being Je (χ, U ) always positive the GD process will converge to a minimum J ∗ . The sequences of prototype vectors {cei } will converge either to a point close to the samples or to a point of negligible activation far away from the data. If a unit ui has a suﬃciently long subsequence of prototypes {cei } diverging from the dataset then, at a certain time, will no longer be a winner for any sample and, by Lemma 1, will converge at a point far away from the data. The attractors for the sequence {cei } of the diverging units lie at a certain distance r from the samples, that is determined by those points x where the gaussian unit centered in x produces a negligible activation in response to any pattern xk ∈ χ. Hence, G can be chosen as any hypersphere surrounding the samples with radius smaller than r. On the other hand, since Je (χ, U ) decreases to J ∗ , there must exist at least one prototype that is not far away from the data (otherwise the ﬁrst term of Je (χ, U ) in (4) will diverge). In this case, the sequences {cei } must have accumulation points close to the samples. Therefore any hypersphere G enclosing all the samples will also surround the accumulation points of {cei } and, after a certain epoch E, the sequence will be always within such hypersphere. In summary, Theorem 1 tells that the separation nature holds for a CoRe process: some prototypes are possibly pushed away from the data until their contribution to the error in (4) becomes negligible. Far away prototypes will always be losers and will never head back to the data. Conversely, some prototypes will converge to the samples, heading to a saddle point of the loss Je (χ, U ) by means of a gradient descent process.

4

Correct Division and Location

Following the convergence analysis in [4] we now turn our attention to the issues of correct division and location of the weight vectors. This means that the

502

D. Bacciu and A. Starita

number of prototypes falling into G will be nc , i.e. the number of the actual clusters in the sample data, and they will ﬁnally converge to the centers of the clusters. At this point, we leave the intuitive study presented for DSRPCL [4], introducing a sound analysis of the properties of the saddle points identiﬁed by CoRe, giving a suﬃcient and necessary condition for identifying the global minimum of a vector quantization loss in feature space. 4.1

A Global Minimum Condition for Vector Quantization in Kernel Space

The classical problem of hard vector quantization (VQ) in Euclidean space is to determine a codebook V = v1 , . . . , vN minimizing the total distortion, calculated by Euclidean norms, resulting from the approximation of the inputs xk ∈ χ by the code vectors vi . Here, we focus on a more general problem that is vector quantization in feature space. Given the nonlinear mapping Φ and the induced feature space norm · Fκ introduced in the previous sections, we aim at optimizing the distortion K N 1 1 min D(χ, ΦV ) = δik Φ(xk ) − Φvi 2Fκ K i=1

(9)

k=1

where ΦV = {Φv1 , . . . , ΦvN } represents the codebook in the kernel space and 1 δik is equal to 1 if the i-th cluster is the closest to the k-th pattern in the feature space Fκ , and is 0 otherwise. It is widely known that VQ generates a Voronoi tessellation of the quantized space and that a necessary condition for the minimization of the distortion requires the code-vectors to be selected as the centroids of the Voronoi regions [8]. In [5], it is given a necessary and suﬃcient condition for the global minimum of an Euclidean VQ distortion function. In the following, we generalize this result to vector quantization in feature space. To prove the global minimum condition in kernel space we need to extend the results in [9] (Proposition 3.1.7 and 3.2.4) to the most general case of a kernel induced distance metric. Therefore we introduce the following lemma. Lemma 2. Let κ be a kernel and Φ : χ → Fκ a map into the corresponding feature space Fκ . Given a dataset χ = x1 , . . . , xK partitioned into N subsets Ci , K 1 deﬁne the feature space mean Φ = χ k=1 Φ(xk ) and the i-th partition centroid K 1 Φvi = |Ci | k∈Ci Φ(xk ), then we have K k=1

Φ(xk ) − Φχ 2Fκ =

N i=1 k∈Ci

Φ(xk ) − Φvi 2Fκ +

N

|Ci |Φvi − Φχ 2Fκ . (10)

i=1

Proof. Given a generic feature vector Φ1 , consider the identity Φ(xk ) − Φ1 = (Φ(xk ) − Φvi ) + (Φvi − Φ1 ): its squared norm in feature space is Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ + 2(Φ(xk ) − Φvi )T (Φvi − Φ1 ).

Convergence Behavior of Competitive Repetition-Suppression Clustering

503

Summing over all the elements in the i-th partition we obtain Φ(xk ) − Φ1 2Fκ = Φ(xk ) − Φvi 2Fκ + Φvi − Φ1 2Fκ k∈Ci

k∈Ci

+2

k∈Ci

(Φ(xk ) − Φvi ) (Φvi − Φ1 ) T

k∈Ci

=

Φ(xk ) − Φvi 2Fκ + |Ci |Φvi − Φ1 2Fκ .

(11)

k∈Ci

The last term in (11) vanishes since k∈Ci (Φ(xk ) − Φvi ) = 0 by deﬁnition of Φvi . Now, applying the substitution Φ1 = Φχ and summing up for all the N partitions yields K

Φ(xk ) − Φχ 2Fκ =

N

Φ(xk ) − Φvi 2Fκ +

i=1 k∈Ci

k=1

N

|Ci |Φvi − Φχ 2Fκ (12)

i=1

N

where the left side of equality holds since

i=1

Ci = χ and

N i=1

Ci = ∅ .

Using the results from Lemma 2 we can proceed with the formulation of the global minimum criterion by generalizing the results of Proposition 1 in [5] to vector quantization in feature space. g } be a global minimum solution to the probProposition 1. Let {Φv1g , . . . , ΦvN lem in (9), then we have

N

|Cig |Φvig − Φχ 2Fκ ≥

N

i=1

|Ci |Φvi − Φχ 2Fκ

(13)

i=1

g for any local optimal solution {Φv1 , . . . , ΦvN } to (9), where {C1g , . . . , CN } and g g {C , . . . , C } are the χ partitions corresponding to the centroids Φ = 1/|C 1 N v i| i k∈Cig Φ(xk ) and Φvi = 1/|Ci | k∈Ci Φ(xk ) respectively, and where Φχ is the dataset mean (see deﬁnition in Lemma 2). g } is a global minimum for (9) we have Proof. Since {Φv1g , . . . , ΦvN

N

Φ(xk ) − Φvig 2Fκ ≤

i=1 k∈Cig

N

Φvi − Φχ 2Fκ

(14)

i=1 k∈Ci

for any local minimum {Φv1 , . . . , ΦvN }. From Lemma 2 we have that K

Φ(xk ) − Φχ 2Fκ =

k=1

Φ(xk ) − Φvig 2Fκ +

i=1 k∈Cig

k=1 K

N

Φ(xk ) − Φχ 2Fκ =

N

Since (14) holds, we obtain

|Cig |Φvig − Φχ 2Fκ (15)

i=1

Φ(xk ) − Φvi 2Fκ +

i=1 k∈Ci

N

N

g g i=1 |Ci |Φvi

N

|Ci |Φvi − Φχ 2Fκ . (16)

i=1

−

Φχ 2Fκ

≥

N i=1

|Ci |Φvi − Φχ 2Fκ

504

4.2

D. Bacciu and A. Starita

Correct Division and Location for CoRe Clustering

To evaluate the correct division and location properties we ﬁrst analyze the case when the number of units N is equal to the true cluster number nc . Consider the loss in (4) as being decomposed into a winner and a loser dependent term, i.e. Je (χ, U ) = Jewin (χ, U )+Jelose (χ, U ). By deﬁnition, Jewin (χ, U ) = nc K i=1 k=1 δik (1 − ϕi (xk )) must have at least one minimum point. Applying the necessary condition ∂Jewin (χ, U )/∂ci = 0 we obtain an estimate of the prototypes by means of ﬁxed point iteration, that is N e k=1 δik ϕi (xk )xk ci = . (17) N k=1 δik ϕi (xk ) When the number of prototypes equals the number of clusters, the ﬁxed point iteration in (17) converges by positioning each unit weight vector close to the true cluster centroids. In addition, it can be shown that (17) approximates a local minima of the kernel vector quantization loss in (9). To prove this, consider the CoRe loss formulation in kernel space (7): we have Jewin (χ, U ) = 1 nc K 2 i=1 k=1 δik Φ(xk ) − Φ(ci )Fκ , where ci is estimated by (17). 2 Now, consider the VQ loss in (9): a necessary condition for its minimization requires the computation of the cluster centroids as Φvi = |C1i | k∈Ci Φ(xk ). The exact calculation of Φvi requires to know the form of the implicit nonlinear mapping Φ to solve the so-called pre-image problem [10], that is determining z such that Φ(z) = Φvi . Unfortunately, such a problem is insolvable in the general case [10]. However, instead of calculating the exact pre-image we can search an approximation by seeking z minimizing ρ(z) = Φvi − Φ(z)2Fκ , that is the feature space distance between the centroid in kernel space and the mapping of its approximated pre-image. Rather than optimizing ρ(z), it is easier to minimize the distance between Φvi and its orthogonal projection onto the span Φ(z). Due to space limitations, we omit the technicalities of this calculation (see [10] for further details). It turns out that the minimization of ρ(z) reduces to the the evaluation of the gradient of ρ (z) = Φvi , Φ(z) . By substituting the deﬁnition of Φvi and applying the kernel trick we obtain ρ (z) = (1/|Ci |) κ(xk , xj ) + κ(z, z) + (1/|Ci |) κ(xk , z) k,j∈Ci

k∈Ci

where κ(z, z) = 1 since we are using a gaussian kernel. Diﬀerentiating ρ (z) with respect to z and solving by ﬁxed point iteration yields e−1 )xk k∈Ci κ(xk , z e z = (18) e−1 ) k∈Ci κ(xk , z that is the same as the prototype estimate obtained in (17) for gaussian kernels centered in z e . The indicator function δik in (17) is not null only for those points xk for which unit ui was in the winner set. This does not ensures the partition conditions over χ, since, by deﬁnition of wink , some points can be associated with

Convergence Behavior of Competitive Repetition-Suppression Clustering

505

two or more winners. However, by (6) we know that the variance of the winners tends to reduce as learning proceeds. Therefore, using the same arguments by Gersho [8] it can be demonstrated that, after a certain epoch E, the CoRe winners competition will become a WTA process where δik will be ensuring the partition conditions over χ. Summarizing, the minimization of the CoRe winners error Jewin (χ, U ) generates an approximate solution to the vector quantization problem in feature space in (9). As a consequence, the prototypes ci become a local solution satisfying the conditions of Proposition 1. Hence, substituting the deﬁnition of Φχ in the results of Proposition 1 we obtain that {c1 , . . . , cnc } is an approximated global minimum for (9) if and only if nc K |Ci | i=1 k=1

K

Φ(ci ) − Φ(xk )2Fκ ≥

nc K |C˜i | i=1 k=1

K

Φ(˜ ci ) − Φ(xk )2Fκ

(19)

holds for every {˜ c1 , . . . , c˜nc } that are approximated pre-images of a local minimum for (9). In summary, a global optimum to (9) should minimize the feature space distance between the prototypes and samples belonging to their cluster while maximizing the weight vector distance from the sample mean, or, equivalently, the distance from all the samples in the dataset χ. The loser component Jelose (χ, U ) in the kernel CoRe loss (7) depends on the term (1 − (1/2)Φ(ci ) − Φ(xk )2Fκ that maximizes the distance between the prototypes ci and those xk that do not fall in the respective Voronoi sets Ci . Hence, Jelose (χ, U ) produces a distortion in the estimate of ci that pursues the global optimality criterion except for the fact that it discounts the repulsive eﬀect of the xk ∈ Ci . In fact, (19) suggests that ci has to be repelled by all the xk ∈ χ. On the other hand, the estimate ci is a linear combination of the xk ∈ Ci : applying the repulsive eﬀect in (19) would subtract their contribution, either canceling the attractive eﬀect (which would be catastrophic) or simply scaling the magnitude of the learning step without changing the ﬁnal direction. Hence, the CoRe loss makes a reasonable assumption discarding the repulsive eﬀect of the xk ∈ Ci when calculating the estimate of ci . Summarizing, CoRe locates the prototypes close to the centroids of the nc clusters by means of (17), escaping from local minima of the loss function by approximating the global minimum condition of Proposition 1. Finally, we need to study the behavior of Je (χ, U ) as the number of units N varies with respect to the true cluster number nc . Using the same motivations in [4], we see that the winner-dependent loss Jewin tends to reduce as the the number of units increases. However, if the number of units falling into G is larger than nc there will be a number of clusters that are erroneously split. Therefore, the samples from these clusters will tend to produce an increased level of error in Jelose contrasting the reduction of Jewin . On the other hand, Jelose will tend to reduce when the number of units inside G is lower than nc . This however will produce increased levels of Jewin since the prototype allocation won’t match the underlying sample distribution. Hence, the CoRe error will have its minimum when the number of units inside G will approximate nc .

506

5

D. Bacciu and A. Starita

Conclusion

The paper presents a sound analysis of the convergence behavior of CoRe clustering, showing how the minimization of the CoRe cost function satisﬁes the properties of separation nature, correct division and location [4]. As the loss reduces to a minimum, the CoRe algorithm is shown to converge allocating the correct number of prototypes to the centers of the clusters. Moreover, it is given a sound optimality criterion that shows how CoRe gradient descent pursues a global minimum of the vector quantization problem in feature space. The results presented in the paper hold for a batch gradient descent process. However, it can be proved that, under Ljung’s conditions [11], they can be extended to stochastic (online) gradient descent. Moreover, we plan to investigate further the properties of the CoRe kernel formulation, extending the convergence analysis to a wider class of activation functions other than gaussians, i.e. normalized kernels.

References 1. Grill-Spector, K., Henson, R., Martin, A.: Repetition and the brain: neural models of stimulus-speciﬁc eﬀects. Trends in Cognitive Sciences 10(1), 14–23 (2006) 2. Bacciu, D., Starita, A.: A robust bio-inspired clustering algorithm for the automatic determination of unknown cluster number. In: Proceedings of the 2007 International Joint Conference on Neural Networks, pp. 1314–1319. IEEE, Los Alamitos (2007) 3. Xu, L., Krzyzak, A., Oja, E.: Rival penalized competitive learning for clustering analysis, rbf net, and curve detection. IEEE Trans. on Neur. Net. 4(4) (1993) 4. Ma, J., Wang, T.: A cost-function approach to rival penalized competitive learning (rpcl). IEEE Trans. on Sys., Man, and Cyber 36(4), 722–737 (2006) 5. Munoz-Perez, J., Gomez-Ruiz, J.A., Lopez-Rubio, E., Garcia-Bernal, M.A.: Expansive and competitive learning for vector quantization. Neural Process. Lett. 15(3), 261–273 (2002) 6. Bacciu, D., Starita, A.: Competitive repetition suppression learning. In: Kollias, S., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 130–139. Springer, Heidelberg (2006) 7. Scholkopf, B., Smola, A., Muller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp. 10(5), 1299–1319 (1998) 8. Yair, E., Zeger, K., Gersho, A.: Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Sign. Proc. 40(2), 294–309 (1992) 9. Spath, H.: Cluster analysis algorithms. Ellis Horwood (1980) 10. Scholkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. on Neur. Net. 10(5), 1000–1017 (1999) 11. Ljung, L.: Strong convergence of a stochastic approximation algorithm. The Annals of Statistics 6(3), 680–696 (1978)

Self-Organizing Clustering with Map of Nonlinear Varieties Representing Variation in One Class Hideaki Kawano, Hiroshi Maeda, and Norikazu Ikoma Kyushu Institute of Technology, Faculty of Engineering, 1-1 Sensui-cho Tobata-ku Kitakyushu, 804-8550, Japan [email protected]

Abstract. Adaptive Subspace Self-Organizing Map (ASSOM) is an evolution of Self-Organizing Map, where each computational unit deﬁnes a linear subspace. Recently, its modiﬁed version, where each unit deﬁnes an linear manifold instead of the linear subspace, has been proposed. The linear manifold in a unit is represented by a mean vector and a set of basis vectors. After training, these units result in a set of linear variety detectors. In another point of view, we can consider the AMSOM represents the latent commonality of data as linear structures. In numerous cases, however, these are not enough to describe the latent commonality of data because of its linearity. In this paper, the nonlinear variety is considered in order to represent a diversity of data in a class. The eﬀectiveness of the proposed method is veriﬁed by applying it to some simple classiﬁcation problems.

1

Introduction

The subspace method is popular in pattern recognition, feature extraction, compression, classiﬁcation and signal processing.[1] Unlike other techniques where classes are primarily deﬁned as regions or zones in the feature space, the subspace method uses linear subspaces that are deﬁned by a set of normalized basis vectors. One linear subspace is usually associated with one class. An input vector is classiﬁed to a particular class if its projection error into the subspace associated with one class is the minimum. The subspace method, as compared to other pattern recognition techniques, has advantages in applications where the relative intensities or energies of the vector components are more important than the overall level of the signal. It also provides an economical representation for groups of vectors with high dimensionality, since one can often use a small set of basis vectors to approximate the subspace where the vectors reside. Another paradigm is to use is use a mixture of local subspace to collectively model the data space. Adaptive-Subspace Self-Organizing Map (ASSOM)[2][3] is a mixture of local subspace method for pattern recognition. ASSOM, which is an evolution M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008

508

H. Kawano, H. Maeda, and N. Ikoma

of Self-Organizing Map (SOM),[4] consists of an input layer and a competitive layer arranging some computational units in a line or a lattice structure. Each computational units deﬁnes a subspace spanned by some basis vectors. ASSOM creates a set of subspaces representations by competitive selection and cooperative learning. In SOM, a set of reference vectors is spatially organized to partition the input space. In ASSOM, a set of reference sub-models is topologically ordered, with each sub-model responsible for describing a speciﬁc region of the input space by its local principal subspace. The ASSOM is attractive not only because it inherits the topographic representation property in the SOM, but also because the learning results of ASSOM can faithfully describe the the core features of various transformation groups. The simulation results in the reference [2] and the reference [3] have illustrated that diﬀerent feature ﬁlters can be self-organized to diﬀerent low-dimensional subspaces and a wavelet type representation does emerge in the learning. Recently, Adaptive Manifold Self-Organizing Map (AMSOM) which is a modiﬁed version of ASSOM has been proposed.[5] AMSOM is the same structure as the ASSOM, except for the way to represent each computational unit. Each unit in AMSOM deﬁnes an aﬃne subspace which is composed of a mean vector and a set of basis vectors. By incorporating a mean vector into each unit, the recognition performance has been improved signiﬁcantly. The simulation results in the reference [5] have been shown that AMSOM outperforms linear PCA-based method and ASSOM in face recognition problem. In both ASSOM and AMSOM, a local subspace in each unit can be adapted by linear PCA learning algorithms. On the other hand, it is known that there are a number of advantages in introducing nonlinearities into a PCA type network with reproducing kernels.[6][13] For example, the performance of the subspace method is aﬀected by the dimensionality for the intersections of subspaces.[1] In other words, the dimensionality of subspace should be as possible as low in order to achieve successful performance. it is, however, not enough to describe variation in a class of patterns by low dimensional subspace because of its linearity. From this consideration, we propose a nonlinear extended version of the AMSOM with the reproducing kernels. The proposed method could be expected to construct nonlinear varieties so that eﬀective representation of data belonging to the same category is achieved with low dimensionality. The eﬀectiveness of the proposed method is veriﬁed by applying it to some simple pattern classiﬁcation problems.

2

Adaptive Manifold Self-Organizing Map (AMSOM)

In this section, we give a brief review of the original AMSOM. Fig.1 shows the structure of the AMSOM. It consists of an input layer and a competitive layer, in which n and M units are included respectively. Suppose i ∈ {1, · · · , M } is used to index computational units in the competitive layer, the dimensionality of the input vector is n. The i-th computational unit constructs an aﬃne subspace, which is composed of a mean vector μ(i) and a subspace spanned by H basis

Self-Organizing Clustering with Map

x1

xj

509

xn

Input Layer

1

M

i

Competitive Layer

Fig. 1. A structure of the Adaptive Manifold Self-Organizing Map (AMSOM)

input vector :

relative input vector :

x ~ ~ x=

φ

φ

φ^ mean vector : μ

x^

Subspace

Origin

Fig. 2. Aﬃne Subspace in a computational unit (i)

vectors bh , h ∈ {1, · · · , H}. First of all, we deﬁne the orthogonal projection of a input vector x onto the aﬃne subspace of i-th unit as x ˆ(i) = μ(i) +

H

T (i)

(i)

(φ(i) bh )bh ,

(1)

h=1

where φ(i) = x − μ(i) . Therefore the projection error is represented as x ˜(i) = φ(i) −

H

T (i)

(i)

(φ(i) bh )bh .

(2)

h=1

Figure 2 shows a schematic interpretation for the orthogonal projection and the projection error of a input vector onto the aﬃne subspace deﬁned in a unit i. The AMSOM is more general strategy than the ASSOM, where each computational unit solely deﬁnes a subspace. To illustrate why this is so, let us consider a very simple case: Suppose we are given two clusters as shown in Fig.3(a). It

510

H. Kawano, H. Maeda, and N. Ikoma

x2

x2 class 1 class 2

manifold 1

b1

class 1 class 2

μ1

O

x1

O

μ2

x1 b2

manifold 2

(b)

(a)

Fig. 3. (a) Clusters in 2-dimensional space: An example of the case which can not be separated without a mean value. (b) Two 1-dimensional aﬃne subspaces to approximate and classify clusters.

is not possible to use one dimensional subspaces, that is lines intersecting the origin O, to approximate the clusters. This is true even if the global mean is removed, so that the origin O is translated to the centroid of the two clusters. However, two one-dimensional aﬃne subspaces can easily approximate the clusters as shown in Fig.3(b), since the basis vectors are aligned in the direction that minimizes the projection error. In the AMSOM, the input vectors are grouped into episodes in order to present them to the network as an input sets. For pattern classiﬁcation, an episode input is deﬁned as a subset of training data belonging to the same category. Assume that the number of input vectors in the subset is E, then an episode input ωq in the class q is denoted as ωq = {x1 , x2 , · · · , xE } , ωq ⊆ Ωq , where Ωq is a set of training patterns belonging to the class q. The set of input vectors of an episode has to be recognized as one class, such that any member of this set and even an arbitrary linear combination of them should have the same winning unit. The training process in AMSOM has the following steps: (a) Winner lookup. The unit that gives the minimum projection error for an episode is selected. The unit is denoted as the winner, whose index is c. This decision criterion for the winner c is represented as E (i) 2 c = arg min ||˜ xe || , (3) i

where i ∈ {1, · · · , M }.

e=1

Self-Organizing Clustering with Map

(b) Learning. For each unit i, and for each xe , update μ(i) μ(i) (t + 1) = μ(i) (t) + λm (t)hci (t) xe − μ(i) (t) ,

511

(4)

where λm (t) is the learning rate for μ(i) at learning epoch t, hci (t) is the neighborhood function at learning epoch t with respect to the winner c. Both λm (t) and hci (t) are monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λm (t) = λini and hci (t) = exp(−|c − i|/γ(t)), m (λm /λm ) ini f in f in t/tmax γ(t) = γ (γ /γ ) are used. Then the basis vectors are updated as (i)

(i)

(i)

bh (t + 1) = bh (t) + λb (t)hci (t)

(i)

φe (t)T bh (t) , φ(i) e (t), (i) ˆ(i) ||φ (t)||||φ (t)|| e e

(5)

(i)

where φe (t) is the relative input vector in the manifold i updated the mean (i) ˆ(i) vector, which is represented by φe (t) = xe − μ(i) (t + 1), φ e (t) is the orthogonal projection of the relative input vector, which is represented by H (i) (i) T (i) ˆ(i) φ e (t) = h=1 (φ (t) bh (t))bh (t) and λb (t) is the learning rate for the basis vectors, which is also monotonic decreasing function with respect to t. f in f in t/tmax In this paper, λb (t) = λini is used. b (λb /λb ) After the learning phase, a categorization phase to determine the class association of each unit. Each unit is labeled by the class index for which is selected as the winner most frequently when the input data for learning are applied to the AMSOM again,

3 3.1

Self-Organizing Clustering with Nonlinear Varieties Reproducing Kernels

Reproducing kernels are functions k : X 2 → R which for all pattern sets {x1 , · · · , xl } ⊂ X

(6)

give rise to positive matrices Kij := k(xi , xj ). Here, X is some compact set in which the data resides, typically a subset of Rn . In the ﬁeld of Support Vector Machine (SVM), reproducing kernels are often referred to as Mercer kernels. They provide an elegant way of dealing with nonlinear algorithms by reducing them to linear ones in some feature space F nonlinearly related to input space: Using k instead of a dot product in Rn corresponds to mapping the data into a possibly high-dimensional dot product space F by a (usually nonlinear) map Φ : Rn → F , and taking the dot product there, i.e.[11] k(x, y) = (Φ(x), Φ(y)) .

(7)

By virtue of this property, we shall call Φ a feature map associated with k. Any linear algorithm which can be carried out in terms of dot products can be made

512

H. Kawano, H. Maeda, and N. Ikoma

nonlinear by substituting a priori chosen kernel. Examples of such algorithms include the potential function method[12], SVM [7][8] and kernel PCA.[9] The price that one has to pay for this elegance, however, is that the solutions are only obtained as expansions in terms of input patterns mapped into feature space. For instance, the normal vector of an SV hyperplane is expanded in terms of Support Vectors, just as the kernel PCA feature extractors are expressed in terms of training examples, Ψ=

l

αi Φ(xi ).

(8)

i=1

3.2

AMSOM in the Feature Space

The AMSOM in the high-dimensional feature space F is considered. In the proposed method, an varieties deﬁned by a computational unit i in the competitive layer take the nonlinear form Mi = {Φ(x)|Φ(x) = Φ(μ(i) ) +

H

(i)

ξΦ(bh )},

(9)

h=1

where ξ ∈ R. Given training data set {x1 , · · · , xN }, the mean vector and the basis vector in a unit i are represented by the following form Φ(μ(i) ) =

N

(i)

(10)

(i)

(11)

αl Φ(xl ),

l=1

(i)

Φ(bh ) =

N

βhl Φ(xl ),

l=1 (i)

(i)

respectively. αl in Eq.(10) and βhl in Eq.(11) are the parameters adjusted by learning. The derivation of training procedure in the proposed method is given as follows: (a) Winner lookup. The norm of the orthogonal projection error onto the i-th aﬃne subspace with respect to present input xp is calculated as follows: (i)

||Φ(˜ xp ) ||2 = k(xp , xp ) +

H

(i) 2

Ph

h=1

−2

N

αl k(x, xl ) + 2

l=1

−2

H N h=1 l=1

+

N N

(i) (i)

αl1 αl2 k(xl1 , xl2 )

l1 =1 l2 =1 H N

N

(i)

Ph αl1 βhl2 k(xl1 , xl2 )

h=1 l1 =1 l2 =1 (i)

Ph βhl k(x, xl ),

(12)

Self-Organizing Clustering with Map

513

(i)

where Ph means the orthogonal projection component of present input xp (i) into the basis Φ(bh ) and it is calculated by (i)

Ph =

N

N N

(i)

βhl k(xp , xl )−

l=1

(i) (i)

αl1 βhl2 k(xl1 , xl2 ).

(13)

l1 =1 l2 =1

The reproducing kernels used generally are as follows: = (xTs xt )d d ∈ N, T d = (xs xt + 1) d ∈ N, ||xs −xt ||2 k(xs , xt ) = exp − 2σ2 σ ∈ R, k(xs , xt ) k(xs , xt )

(14) (15) (16)

where N and R are the set of natural numbers and the set of reals, respectively. Eq.(14), Eq.(15) and Eq.(16) are referred as to homogeneous polynomial kernels, non-homogeneous polynomial kernels and gaussian kernels, respectively. The winner for an episode input ωq = {Φ(x1 ), · · · , Φ(xE )} is decided by the same manner as the AMSOM as follows: E (i) 2 c = arg min ||Φ(˜ xe ) || , i ∈ {1, · · · , M }. (17) i

e=1 (i)

(i)

(b) Learning. The learning rule for αl and βhl are as follows: ⎧ (i) ⎪ f or l = e ⎨−αl (t)λm (t)hci (t) (i) Δαl = −α(i) (t)λm (t)hci (t) + λm (t)hci (t) , ⎪ ⎩ l f or l = e ⎧ (i) (i) −α (t + 1)λb (t)hci (t)Th (t) ⎪ ⎪ ⎪ l ⎪ f or l = e ⎨ (i) Δβhl = −α(i) , (t + 1)λ (t)h (t)T (t) b ci l ⎪ ⎪ ⎪ +λ (t)h (t)T (t) b ci ⎪ ⎩ f or l = e where

(i)

T (t) = ˆ(i) Φ(φ e (t))) =

H N h=1

(i)

l=1

||Φ(φ(i) e )(t)|| = k(xe , xe ) − 2

N l=1

(19)

(i)

Φ(φe (t))T Φ(bh (t)) , (i) ˆ(i) ||Φ(φ e (t))||||Φ(φe (t))|| βhl k(xe , xl ) −

(18)

N N

(20)

2 12 αl1 βkl2 k(xl1 , xl2 ) ,

l1 =1 l2 =1

(i)

αl k(xe , xl ) +

N N

(21) 12 (i) (i) αl1 αl2 k(xl1 , xl2 ) ,

l1 =1 l2 =1

(22)

514

H. Kawano, H. Maeda, and N. Ikoma (i)

T Φ(φ(i) e (t)) Φ(bh (t)) =

N

βhl k(xe , xl ) −

l=1

N N

(i) (i)

αl1 βhl2 k(xl1 , xl2 ). (23)

l1 =1 l2 =1

In Eqs.(18) and (19), λm (t), λb (t) and hci (t) are the same parameters as mentioned in the AMSOM training process. After the learning phase, a categorization phase to determine the class association of each unit. The procedure of the categorization is the same manner as mentioned in previous section.

4

Experimental Results

Data Distribution 1. To analyze the eﬀect of reducing the dimensionality for the intersections of subspaces by the proposed method, the data as shown in Fig.4(a) is used. For this data, although a set of each class can be approximated by 1-dimensional linear manifold in the input space R2 , the intersections of subspace could be happend between class 1 and class 2, and between class 2 and class 3 , even if the optimal linear manifold for each class can be obtained. However, the linear manifold in the high-dimensional space, that is the nonlinear manifold in input space, can be expected to classify the given data by reduction eﬀect of the dimensionality for the intersections of subspaces. As the result of simulation, the given input data are classiﬁed correctly by the proposed method. Figure 4(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 1, class 2 to unit 2, class 3 to unit 3, respectively. In this simulation, the experimental parameters are assigned as follows: the f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in ini ini λb = 0.01, γ = 3, γ = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y + 1)3 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to reduce the dimensionality for intersections of subspaces. Data Distribution 2. To verify that the proposed method has the ability to describe the nonlinear manifolds, the data as shown in Fig.5(a) is used. This case is of impossible to describe each class by a linear manifold, that is 1-dimensional aﬃne subspace. As the result of simulation, the given input data are classiﬁed correctly by the proposed method. Figure 5(b) and (c) are the decision regions constructed by AMSOM and the proposed method, respectively. In this experiment, 3 units were used in the competitive layer. The class associations to each unit are class 1 to unit 3, class 2 to unit 2, class 3 to unit 1, respectively. In this simulation, the experimental parameters are assigned as follows: the

Self-Organizing Clustering with Map 5

5

4

4

Unit 3

3

3

2

2

1

Unit 1

1

Unit 1

Unit 2

0

x2

x2

515

-1

-2

-2

-3

-3

-4

-4

-5 -5 -4 -3 -2 -1

0

x1

1

2

3

4

Unit 2

0

-1

Unit 3

-5 -5 -4 -3 -2 -1

5

(b)

(a)

0

x1

1

2

3

4

5

(c)

Fig. 4. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method.

4

4

4

class 1 class 2 class 3

3

2

2

2

1

1

1

0

x2

3

x2

x2

3

0

0

-1

-1

-1

-2

-2

-2

-3

-3

-4

-3

-4 -4

-3

-2

-1

0

1

x1

(a)

2

3

4

-4 -4

-3

-2

-1

0

x1

(b)

1

2

3

4

-4

-3

-2

-1

0

1

2

3

4

x1

(c)

Fig. 5. (a) Training data used in the second experiment. (b) Decision regions learned by AMSOM and (c) the proposed method. f in ini totoal number of epochs tmax = 100, λini = 0.1, m = 0.1, λm = 0.01, λb f in λb = 0.01, γ ini = 3, γ ini = 0.03, both in AMSOM and in the proposed method in common, H = 1 in AMSOM, and H = 2, k(x, y) = (xT y)2 in the proposed method. From the experiment, it was shown that the the proposed method has the ability to extract the suitable nonlinear manifolds eﬃciently.

5

Conclusions

A clustering method with map of nonlinear varieties was proposed as a new pattern classiﬁcation method. The proposed method has been extended to a nonlinear method easily from AMSOM by applying the kernel method. The eﬀectiveness of the proposed method were veriﬁed by the experiments. The proposed algorithm has highly promising applications of the ASSOM in a wide area of practical problems.

516

H. Kawano, H. Maeda, and N. Ikoma

References 1. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press (1983) 2. Kohonen, T.: Emergence of Invariant-Feature Detectors in the Adaptive-Subspace Self-Organizing Map. Biol.Cybern 75, 281–291 (1996) 3. Kohonen, T., Kaski, S., Lappalainen, H.: Self-Organizing Formation of Various invariant-feature ﬁlters in the Adaptive Subspace SOM. Neural Comput 9, 1321– 1344 (1997) 4. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg, New York (1995) 5. Liu, Z.Q.: Adaptive Subspace Self-Organizing Map and Its Application in Face Recognition. International Journal of Image and Graphics 2(4), 519–540 (2002) 6. Saitoh, S.: Theory of Reproducing Kernels and its Applications. In: Longman Scientiﬁc & Technical, Harlow, England (1988) 7. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995) 8. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York, Berlin, Heidelberg (1995) 9. Sch¨ olkopf, B., Smola, A.J., M¨ uler, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Technical Report 44, Max-Planck-Institut fur biologische Kybernetik (1996) 10. Mika, S., R¨ atsch, G., Weston, J., Sch¨ olkopf, B., M¨ uler, K.R.: Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 41–48 (1999) 11. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Otimal Margin Classiﬁers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152 (1992) 12. Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning. Automation and Remote Control 25, 821–837 (1964) 13. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)

An Automatic Speaker Recognition System P. Chakraborty1, F. Ahmed1, Md. Monirul Kabir2, Md. Shahjahan1, and Kazuyuki Murase2,3 1

Department of Electrical & Electronic Engineering, Khulna University of Engineering and Technology, Khulna-920300, Bangladesh 2 Dept. of Human and Artificial Intelligence Systems, Graduate School of Engineering 3 Research and Education Program for Life Science, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan [email protected], [email protected]

Abstract. Speaker Recognition is the process of identifying a speaker by analyzing spectral shape of the voice signal. This is done by extracting & matching the feature of voice signal. Mel-frequency Cepstrum Co-efficient (MFCC) is the feature extraction technique in which we will get some coefficients named Mel-Frequency Cepstrum coefficient. This Cepstrum Coefficient is extracted feature. This extracted feature is taken as the input of Vector Quantization process. Vector Quantization (VQ) is the typical feature matching technique in which VQ codebook is generated by providing predefined spectral vectors for each speaker to cluster the training vectors in a training session. Finally test data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN). Keywords: MFCC- Mel-Frequency Cepstrum Co-efficient, DCT: Discrete cosine Transform, IIR: - Infinite impulse response, FIR: - Finite impulse response, FFT: - Fast Fourier Transform, VQ: - Vector Quantization.

1 Introduction Speaker Recognition is the process of automatic recognition of the person who is speaking on the basis of individual information included in speech waves. This paper deals with the automatic Speaker recognition system using Vector Quantization. There are another techniques for speaker recognition such as Hidden Markov model (HMM), Artificial Neural network (ANN) for speaker recognition. We have used VQ because of its less computational complexity [1]. There are two main modulesfeature extraction and feature matching in any speaker recognition system [1, 2]. The speaker specific features are extracted using Mel-Frequency Cepstrum Co-efficient (MFCC) processor. A set of Mel-frequency cepstrum coefficients was found, which are called acoustic vectors [3]. These are the extracted features of the speakers. These M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526, 2008. © Springer-Verlag Berlin Heidelberg 2008

518

P. Chakraborty et al.

acoustic vectors are used in feature matching using vector quantization technique. It is the typical feature matching technique in which VQ codebook is generated using trained data. Finally tested data are provided for searching the nearest neighbor to match that data with the trained data. The result is to recognize correctly the speakers where music & speech data (Both in English & Bengali format) are taken for the recognition process. This work is done with about 70 spectral data. The correct recognition is almost ninety percent. It is comparatively better than Hidden Markov model (HMM) & Artificial Neural network (ANN) because the correct recognition for HMM & ANN is below ninety percent. The future work is to generate a VQ codebook with many pre-defined spectral vectors. Then it will be possible to add many trained data in that codebook in a training session, but the main problem is that the network size and training time become prohibitively large with increasing data size. To overcome these limitations, time alignment technique can be applied, so that continuous speaker recognition system becomes possible. There are several implementations for feature matching & identification. Lawrence Rabiner & B. H. Juang proposed Mel- frequency cepstrum co-efficient (MFCC) method to extract the feature & Vector Quantization as feature matching technique [1]. Lawrence Rabiner & R.W. Schafer discussed the performance of MFCC Processor by following several theoretical concepts [2]. S. B. Davis & P. Mammelstein described the characteristics of acoustic speech [3]. Linde A Buzo & R. Gray proposed the LBG Algorithm to generate a VQ codebook by splitting technique [4]. S. Furui describes the speaker independent word recognition using dynamic features of speech spectrum [5]. S. Furui also proposed the overall speaker recognition technology using MFCC & VQ method [6].

2 Methodology A general model for speaker recognition system for several people is shown in fig: 1. the model consists of four building blocks. The first is data extraction that converts a wave data stored in audio wave format into a form that is suitable for further computer processing and analysis. The second is pre-processing, which involves filtering,

Fig. 1. Block diagram of speaker recognition system

An Automatic Speaker Recognition System

519

removing pauses, silences and weak unvoiced sound signal and detect the valid speech signal. The third block is feature extraction, where speech features are extracted from the speech signal. The selected features have enough information to recognize a speaker. Here a class label is assigned to each word uttered by each speaker by examining the extracted features and comparing them with classes learnt during the training phase. Vector quantization is used as an identifier.

3 Pre-processing A digital filter is a mathematical algorithm implemented in hardware and/or software that operates on a digital input signal to produce a digital output signal for the purpose of achieving a filtering objective. Digital filters often operate on digitized analog signals or just numbers, representing some variable, stored in a computed memory. Digital filters are broadly divided into two classes, namely infinite impulse response (IIR) and finite impulse response (FIR) filters. We chose FIR filter because, FIR filters can have an exactly linear phase response. The implication of this is that no phase distortion is introduced into the signal by the filter. This is an important requirement in many applications, for example data transmission, biomedicine, digital audio and image processing. The phase responses of IIR filters are non-linear, especially at the band edges. When a machine is continuously listening to speech, a difficulty arises when it is trying to figure out to where a word starts and stops. We solved this problem by examining the magnitude of several consecutive samples of sound. If the magnitude of these samples is great enough, then keep those samples and examine them later. [1]

Fig. 2. Sample Speech Signal

Fig. 3. Example of speech signal after it has been cleaned

One thing that is clearly noticeable in the example speech signal is that there is lots of empty space where nothing is being said, so we simply remove it. An example speech signal is shown before cleaning in Figure 2, and after in Figure 3. After the empty space is removed from the speech signal, the signal is much shorter. In this case the signal had about 13,000 samples before it was cleaned. After it was run through the clean function, it only contained 2,600 samples. There are several advantages of this. The amount of time required to perform calculations on 13,000 samples is much larger than that required for 2,600 samples. The cleaned sample now contains all the important data that is required to perform the analysis of the speech. The sample produced from the cleaning process is then fed in to the other parts of the ASR system.

520

P. Chakraborty et al.

4 Feature Extraction The purpose of this module is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task, namely: • • • • •

Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC) Perceptual Linear Prediction (PLP) Neural Predictive Coding (NPC)

Among the above classes we used MFCC, because it is the best known and most popular. MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency; filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. This is expressed in the mel-frequency scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz [1, 2]. 4.1 Mel-Frequency Cepstrum Processor A diagram of the structure of an MFCC processor is given in Figure 4. The speech input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears. In addition, rather than the speech waveforms themselves, MFCC’s are shown to be less susceptible to mentioned variations [5, 6].

Fig. 4. Block diagram of the MFCC processor

4.1.1 Frame Blocking In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by

An Automatic Speaker Recognition System

521

N - M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 and M = 100. 4.1.2 Windowing Next processing step is windowing. By means of windowing the signal discontinuities, at the beginning and end of each frame, is minimized. The concept here is to minimize the spectrum distortion by using the window to taper the signal to zero at the beginning and end of each frame.If we define the window as w(n), 0 ≤ n≤ N - 1 , where N is the number of samples, then the result of windowing is the signal

yl (n) = xl (n) w(n), 0 ≤ n ≤ N − 1

(1)

The followings are the types of window method: ▪ Hamming ▪ Rectangular ▪ Barlett (Triangular) ▪ Hanning

▪ Kaiser ▪ LancZos ▪ Blackman-Harris

Among all the above, we used Hamming Window method most to serve our purpose for ease of mathematical computations, which is described as:

⎛ 2π n ⎞ w ( n ) = 0 . 54 − 0 .46 cos ⎜ (2) ⎟, 0 ≤ n ≤ N − 1 ⎝ N −1⎠ Besides this, we’ve also used Hanning window and Blackman-Harris window. As an example, a Hamming window with 256 samples is shown here.

Fig. 5. Hamming window with 256 speech sample

4.1.3 Fast Fourier Transform (FFT) Fast Fourier Transform, converts a signal from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT) which is defined on the set of N samples {xn}, as follow:

Xn =

N −1

∑x k =0

k

e − 2 π jkn / N ,

n = 0 ,1, 2 ,..., N − 1

(3)

522

P. Chakraborty et al.

We use j here to denote the imaginary unit, i.e. j = √ (-1). In general Xn’s are complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero frequency corresponds to n = 0, positive frequencies 0 Cn, = 0, for > 0 (17) n→∞ 1≤i≤n lim P max |Wi |2 ≤ Cn, = 0, for < 0. (18) n→∞

1≤i≤n

The strong convergence result with Cn,1 is known and Cn,1 is used as the universal threshold level for removing irrelevant wavelet coeﬃcients in wavelet denoising[5]. The universal threshold level is shown to be asymptotically equivalent to the minimax one[5]. (17) and (18) are the weaker results while they evaluate the O(log log n) term. We return to our problem and consider the case of v ∗λ = 0n ; i.e. all of vλ,i ’s are noise components. Here, 0n denotes the n-dimensional zero vector. We deﬁne = ( ∼ N (0n , σ 2 Γ λ ). We deﬁne σi2 = σ 2 γi /(γi + λ), u u1 , . . . , u n ) that satisﬁes u 2 i = 1, . . . , n, where σi = 0 for any i. We also deﬁne u = (u1 , . . . , un ), where ui = u i /σi . Then, u ∼ N (0n , I n ). By (17) and the deﬁnition of u, we have

n

2 2 P u i > σi Cn, = P max u2i > Cn, → 0 (n → ∞), (19) i=1

1≤i≤n

542

K. Hagiwara

if > 0. On the other hand, by using (18) and the deﬁnition of u, we have

n

2 2 2 P u i ≤ σi Cn, = P max ui ≤ Cn, → 0 (n → ∞), (20) 1≤i≤n

i=1

if < 0. (19) tells us that, for any i, u 2i cannot exceed σi2 Cn, with high probability when n is large and > 0. Therefore, if we employ σi Cn, with > 0 as component-wise threshold levels, (19) implies that they remove noise components if it is. On the other hand, (20) tells us that there are some noise components which satisfy u 2i > σi2 Cn, with high probability when n is large and < 0. Therefore, the ith component should be recognized as a noise component if u 2i ≤ σi2 Cn, even when it is not. In other words, we can not distinguish nonzero mean components from zero mean components in this case. Hence, σi2 Cn,0 , i = 1, . . . , n are critical levels for identifying noise components. We note that ∗ these results are still valid when vλ,i are not zero for some i but the number of such components is very small. This is the case of assuming the sparseness of the representation of h in the orthogonal domain. As a conclusion, we propose a hard thresholding method by putting θn,i = σi2 Cn,0 , i = 1, . . . , n in (16), where we set = 0. We refer to this method by component-wise hard thresholding(CHT). In practical applications of the method, we need an estimate of noise variance σ 2 . Fortunately, in nonparametric regression methods, [1] suggested to apply σ 2 =

y [I − Hλ ]2 y , trace[(I − Hλ )2 ]

(21)

where Hλ is deﬁned by Hλ = GFλ−1 G = GQΓλ Q G . Although the method includes a regularization parameter, it can be ﬁxed for a small value. This is because the thresholding method keeps a trained machine compact on the orthogonal domain, by which the contribution of the regularizer may not be signiﬁcant in improving the generalization performance. Therefore it is needed only for ensuring the numerical stability of the matrix calculations. Since the basis functions can be nearly linearly dependent in practical applications, small eigen values are less reliable. We therefore ignore the orthogonal components whose eigen values are less than a predetermined small value, say, 10−16 . Although the run time of eigendecomposition is O(n3 ), the subsequent procedures of CHT such as calculations of σ 2 and wλ are achieved with less computational costs by the eigendecomposition. Hard thresholding with the universal threshold level(HTU). Basically eigendecomposition of G G corresponds to the principal component analysis of g 1 , . . . , g n . Therefore, for nearly linearly dependent basis functions, only several eigen vectors are largely contributed. On the other hand, the components with small eigen values are largely aﬀected by numerical errors. Therefore, it is natural to take care of only the components with large eigen values. For a component

Orthogonal Shrinkage Methods for Nonparametric Regression

543

with a large eigen value, γi λ holds since we can choose a small value for λ. Thus, σ i2 σ 2 holds by the deﬁnition of σ i2 in CHT. We then consider to apply 2 a single threshold level σ Cn,0 instead of σ i2 Cn,0 , i = 1, . . . , n in CHT; i.e. we 2 set θn,i = σ Cn,0 in (16). This is a direct application of the universal threshold level in wavelet denoising[5]. This method is referred to by hard thresholding with the universal threshold level(HTU). Backward hard thresholding(BHT). On the other hand, since the threshold level derived here is the worst case evaluation for a noise level, CHT and HTU have a possibility of yielding a bias between fwλ and h by unexpected removes of contributed components. The component with a large eigen value is composed by a linear sum of many basis functions, it may be a smooth component. Therefore, removes of these components may yield a large bias. Actually, in wavelet denoising, fast/detail components are the target of a thresholding method and slow/approximation components are harmless by the thresholding method[5], which may be a device for reducing the bias. For our method, we also introduce this idea and consider the following procedure. We assume that γ1 ≤ γ2 ≤ · · · ≤ γn . The method is that, by increasing j from 1 to n, we ﬁnd the ﬁrst j = j for which vλ,j > σ 2 Cn,0 occurs. Then, thresholding is made by Tj ( vλ,j ) = vλ,j if j ≥ j and Tj ( vλ,j ) = 0 if j < j. This keeps components with large eigen values and possibly reduces the bias. We refer to this method by backward hard thresholding(BHT). BHT can be viewed as a stopping criterion for choosing contributed components in orthogonal components that are enumerated in order of the magnitudes of eigen values.

4 4.1

Numerical Experiments Choice of Regularization Parameter

CHT, HTU and BHT do not include parameters to be adjusted except the regularization parameter. Since thresholding of orthogonal components yields a certain simple representation of a machine, the regularization parameter may not be signiﬁcant in improving the generalization performance. To demonstrate this property of our methods, through a simple numerical experiment, we see the relationship between generalization performances of trained machines and regularization parameter values. The target function is h(x) = 5 sinc(8 x) for x ∈ R. x1 , . . . , xn are randomly drawn in the interval [−1, 1]. We assume i.i.d. Gaussian noise with mean zero and variance σ 2 = 1. The basis functions are Gaussian basis functions that are deﬁned by gj (x) = exp{−(x − xi )2 /(2τ 2 )}, j = 1, . . . , n, where we set τ 2 = 0.05. In this experiment, under a ﬁxed value of a regularization parameter, we trained machines for 1000 sets of training data of size n. At each trial, the test error is measured by the mean squared error between the target function and the trained machine, which is calculated on 1000 equally spaced input points in [−1, 1]. Figure 1 (a) and (b) depict the results for n = 200 and n = 400 respectively, which show the relationship between the averaged test errors of trained machines

544

K. Hagiwara

0.1 averaged test error

averaged test error

0.1

0.05

0 10

−6

−4

−2

0

10 10 10 10 reguralization parameter

2

(a) n = 200

0.05

0 10

−6

−4

−2

0

10 10 10 10 reguralization parameter

2

(b) n = 400

Fig. 1. The dependence of test errors on regularization parameters. (a) n = 200 and (b) n = 400. The ﬁlled circle, open circle and open square indicate the results for the raw estimate, CHT and BHT respectively.

and regularization parameter values. The ﬁlled circle, open circle and open square indicate the results for the raw estimator, CHT and BHT respectively, where the raw estimate is obtained by (6) at each ﬁxed value of a regularization parameter. We do not show the result for HTU since it is almost the same as the result for CHT. In these ﬁgures, we can see that the averaged test errors of our methods are almost unchanged for small values of a regularization parameter while those of the row estimates are sensitive to regularization parameter values. We can also see that BHT is entirely superior to the raw estimate and CHT while CHT is worse than the raw estimate around λ = 101 for both of n = 200 and 400. In practical applications, the regularization parameter of the raw estimate should be determined based on training data and the performance comparisons to the leave-one-out cross validation method are shown in below. 4.2

Comparison with LOOCV

We compare the performances of the proposed methods to the performance of the leave-one-out cross validation(LOOCV) choice of a regularization parameter value. We see not only generalization performances but also computational times of the methods. For the regularized estimator considered in this article, it is known that the LOOCV error is calculated without training on validation sets[3,6]. We assume the same conditions as the previous experiment. The CPU time is measured only for the estimation procedure. The experiments are conducted by using Matlab on the computer that has a 2.13 GHz Core2 CPU, 1 GByte memory. Table 1 (a) and (b) show the averaged test errors and averaged CPU times of LOOCV and our methods respectively, in which the standard deviations(divided by 2) are also appended. The examined values for a regularization parameter in LOOCV is {m × 10−j : m = 1, 2, 5, j = −4, −3, . . . , 3}. In our methods, the

Orthogonal Shrinkage Methods for Nonparametric Regression

545

Table 1. Test errors and CPU times of LOOCV, CHT, HTU and BHT n 100 200 400

LOOCV 0.079± 0.027 0.040± 0.013 0.021± 0.006

CHT 0.101± 0.034 0.046± 0.014 0.023± 0.007

HTU 0.100± 0.034 0.045± 0.014 0.023± 0.007

BHT 0.076± 0.027 0.035± 0.011 0.017± 0.005

(a) Test errors n 100 200 400

LOOCV 0.079± 0.002 0.533± 0.003 3.657± 0.003

CHT 0.013± 0.002 0.080± 0.002 0.523± 0.004

HTU 0.014± 0.002 0.080± 0.002 0.523± 0.004

BHT 0.014± 0.002 0.080± 0.002 0.523± 0.004

(b) CPU times

regularization parameter is ﬁxed at 1 × 10−4 which is the smallest value in candidate values for LOOCV. Based on Table 1 (a), we ﬁrst discuss the generalization performances of the methods. CHT and HTU are almost comparable. This implies that only the components corresponding to large eigen values are contributed. CHT and HTU are entirely worse than LOOCV in average while the diﬀerences are within the standard deviations for n = 200 and n = 400. BHT entirely outperforms LOOCV, CHT and HTU in average while the diﬀerence between the averaged test error of LOOCV and that of BHT is almost within the standard deviations. As pointed out previously, CHT and HTU have a possibility to remove smooth components accidentally since the threshold levels were determined based on the worst case evaluation of dispersion of noise. The better generalization performance of BHT compared with CHT and HTU in Table 1 (a) is caused by this fact. On the other hand, as shown in Table 1 (b), our methods completely outperform LOOCV in terms of the CPU times.

5

Conclusions and Future Works

In this article, we proposed shrinkage methods in training a machine by using a regularization method. The machine is represented by a linear combination of ﬁxed basis functions, in which the number of basis functions, or equivalently, the number of weights is identical to that of training data. In the regularized cost function, the error function is deﬁned by the sum of squared errors and the regularization term is deﬁned by the quadratic form of the weight vector. In the proposed shrinkage procedures, basis functions are orthogonalized by eigendecomposition of the Gram matrix of the vectors of basis function outputs. Then, the orthogonal components are kept or removed according to the proposed thresholding methods. The proposed methods are based on the statistical properties of regularized estimators of weights, which are derived by assuming i.i.d. Gaussian noise. The ﬁnal weights are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators of weights. We

546

K. Hagiwara

proposed three versions of thresholding methods which are component-wise hard thresholding, hard thresholding with the universal threshold level and backward hard thresholding. Since the regularization parameter can be ﬁxed for a small value in our methods, our methods are automatic. Additionally, since eigendecomposition algorithms are included in many software packages and the thresholding methods are simple, the implementations of our methods are quite easy. The numerical experiments showed that our methods achieve relatively good generalization capabilities in strictly less computational time by comparing with the LOOCV method. Especially, the backward hard thresholding method outperformed the LOOCV method in average in terms of the generalization performance. As future works, we need to investigate the performance of our methods on real world problems. Furthermore, we need to evaluate the generalization error when applying the proposed shrinkage methods.

References 1. Carter, C.K., Eagleson, G.K.: A comparison of variance estimators in nonparametric regression. J. R. Statist. Soc. B 54, 773–780 (1992) 2. Chen, S.: ‘Local regularization assisted orthogonal least squares regression. Neurocomputing 69, 559–585 (2006) 3. Craven, P., Wahba, G.: Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403 (1979) 4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 5. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455 (1994) 6. Rifkin, R.: Everything old is new again: a fresh look at historical approaches in machine learning. Ph.D thesis, MIT (2002) 7. Tibshirani, R.: Regression shrinkage and selection via lasso. J.R. Statist. Soc. B 58, 267–288 (1996) 8. Suykens, J.A.K., Brabanter, J.D., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48, 85–105 (2002) 9. Williams, C.K.I., Seeger, M.: Using the Nystr¨ om method to speed up kernel machines. In: Leen, T.K., Diettrich, T.G., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 682–688 (2001)

A Subspace Method Based on Data Generation Model with Class Information Minkook Cho, Dongwoo Yoon, and Hyeyoung Park School of Electrical Engineering and Computer Science Kyungpook National University, Deagu, Korea [email protected], [email protected], [email protected]

Abstract. Subspace methods have been used widely for reduction capacity of memory or complexity of system and increasing classiﬁcation performances in pattern recognition and signal processing. We propose a new subspace method based on a data generation model with intra-class factor and extra-class factor. The extra-class factor is associated with the distribution of classes and is important for discriminating classes. The intra-class factor is associated with the distribution within a class, and is required to be diminished for obtaining high class-separability. In the proposed method, we ﬁrst estimate the intra-class factors and reduce them from the original data. We then extract the extra-class factors by PCA. For veriﬁcation of proposed method, we conducted computational experiments on real facial data, and show that it gives better performance than conventional methods.

1

Introduction

Subspace methods are for ﬁnding a low dimensional subspace which presents some meaningful information of input data. They are widely used for high dimensional pattern classiﬁcation such as image data owing to two main reasons. First, by applying a subspace method, we can reduce capacity of memory or complexity of system. Also, we can expect to increase classiﬁcation performances by eliminating useless information and by emphasizing essential information for classiﬁcation. The most popular subspace method is PCA(Principal Component Analysis) [10,11,8] and FA(Factor Analysis)[6,14] which are based on data generation models. The PCA ﬁnds a subspace of independent linear combinations (principal components) that retains as much of the information in the original variables as possible. However, PCA method is an unsupervised method, which does not use class information. This may cause some loss of critical information for classiﬁcation. Contrastively, LDA(Linear discriminant analysis)[1,4,5] method is a supervised learning method which uses information of the target label of data set. The LDA method attempts to ﬁnd basis vectors of subspace maximizing the linear class separability. It is generally known that LDA can give better classiﬁcation performance than PCA by using class information. However, LDA gives M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 547–555, 2008. c Springer-Verlag Berlin Heidelberg 2008

548

M. Cho, D. Yoon, and H. Park

at most k-1 basis of subspace for k-class, and cannot extract features stably for the data set with limited number of data in each class. Another subspace method for classiﬁcation is the intra-person space method[9], which is developed for face recognition. The intra-person space is deﬁned as a diﬀerence between two facial data from same person. For dimension reduction, the low dimensional eigenspace is obtained by applying PCA to the intra-person space. In the classiﬁcation tasks, raw input data are projected to the intra-personal eigenspace to get the low dimensional features. The intra-person method showed better performance than PCA and LDA in the FERET Data[9]. However, it is not based on data generation model and it cannot give a sound theoretical reason why the intra-person space gives good information for classiﬁcation. On the other hand, a data generation model with class information has recently been developed[12]. It is a variant of the factor analysis model with two type of factors; class factor(what we call extra-class factor) and environment factor(what we call intra-class factor). Based on the data generation model, the intra-class factor is estimated by using diﬀerence vectors between two data in the same class. The estimated probability distribution of intra-class factor is applied to measuring the similarity of data for classiﬁcation. Through this method takes similar approaches to the intra-person method in the sense that it is using the diﬀerence vectors to get the intra-class information and similarity measure, it is based on the data generation model which can give an explanation on the developed similarity measure. Still, it does not include dimension reduction process and another subspace method is necessary for high-dimensional data. In this paper, we propose an appropriate subspace method for the data generation model developed in [12]. The proposed method ﬁnds subspace which undercuts the eﬀect of intra-class factors and enlarges the eﬀect of extra-class factors based on the data generation model. In Section 2, the model will be explained in detail.

2

Data Generation Model

In advance of deﬁning the data generation model, let us consider that we obtained several images(i.e. data) from diﬀerent persons(i.e. class). We know that pictures of diﬀerent persons are obviously diﬀerent. Also, we know that the pictures of a person is not exactly same due to some environmental condition such as illumination. Therefore, it is natural to assume that a data consists of the intra-class factor which represents within-class variations such as the variation of pictures of same person and the extra-class factor which represents betweenclass variations such as the diﬀerences between two persons. Under this condition, a random variable x for observed data can be written as a function of two distinct random variable ξ and η, which is the form, x = f (ξ, η),

(1)

where ξ represents an extra-class factor, which keeps some unique information in each class, and η represents an intra-class factor which represents environmental

A Subspace Method Based on Data Generation Model

549

variation in the same class. In [12], it was assumed that η keeps information of any variation within class and its distribution is common for all classes. In order to explicitly deﬁne the generation model, a linear additive factor model was applied such as xi = W ξ i + η.

(2)

This means that a random sample xi in class Ci is generated by the summation of a linearly transformed class prototype ξ i and a random class-independent variation η. In this paper, as an extension of the model, we assume that the low dimensional intra-class factor is linearly transformed to generate an observed data x. Therefore, function f is deﬁned as xi = W ξi + V η i .

(3)

This model implies that a data of a speciﬁc class is generated by the extra-class factor ξ which gives some discriminative information among class and the intraclass factor η which represents some variation in a class. In this equation, W and V are transformation matrices of corresponding factors. We call W the extra-class factor loading and call V the intra-class factor loading. Figure 1 represents this data generation model. To ﬁnd a good subspace for classiﬁcation, we try to ﬁnd V and W using class information which is given with the input data. In Section 3, we explain how to ﬁnd the subspace based on the data generation model.

ξi

ηi V

W Xi

Fig. 1. The proposed data generation model

3

Factor Analysis Based on Data Generation Model

For given data x, if we keep the extra-class information and reduce the intraclass information as much as possible, we can expect better classiﬁcation performances. In this aspect, the proposed method can be thought to be similar to the traditional LDA method. The LDA ﬁnds a projection matrix which simultaneously maximizes the between-scatter matrix and minimizes the within-scatter matrix of original data set. On the other hand, the proposed method ﬁrst estimates the intra-class information from the diﬀerence data set from same class, and excludes the intra-class information from original data to keep the extraclass information. Therefore, as compared to LDA, the proposed method does

550

M. Cho, D. Yoon, and H. Park

not need to compute a inverse matrix of the within-scatter matrix and the number of basis of subspace does not depend on a number of class. In addition, the proposed method is so simple to get subspaces and many variations of the proposed method can be developed by exploiting various data generation model. In this section, we will state how to obtain the subspace in detail based on simple linear generative model. 3.1

Intra-class Factor Loading

We ﬁrst ﬁnd the projection matrix Λ which represents the intra-class factor instead of intra-class factor loading matrix V for obtaining intra-class information. In given data set {x}, we calculate the diﬀerence vector δ between two data from same class, which can be written as δ kij = xki − xkj = (W ξ ki − W ξ kj ) + (V η ki − V η kj ),

(4)

where xki and xkj are data from a class Ck (k=1,...,K). Because the xki , xkj are came from the same class, we can assume that the extra-class factor does not make much diﬀerence and we ignore the ﬁrst term W ξki − W ξ kj . Then we can obtain a new approximate relationship δ kij ≈ V (η ki − η kj ).

(5)

Based on the relationship, we try to ﬁnd the factor loading matrix V . For the obtained the data set, Δ = {δ kij }k=1,...,K,i=1,...,N,j=1,...,N ,

(6)

where K is the number of classes and N is the number of data in each class, we apply PCA to obtain the principal component of Δ. The obtained matrix Λ of the principal component of the Δ gives the subspace which maximizes the variance of intra-class factor η. The original data set X is projected to the subspace for extraction of intraclass factors Y intra , such as Y intra = XΛ.

(7)

Note that Y intra is a low dimensional data set and includes the intra-class information of X. Using the Y intra , we reconstruct X intra in original dimension by applying the calculation, X intra = Y intra ΛT (ΛΛT )−1 .

(8)

Note that X intra keeps the intra-class information which is not desirable for classiﬁcation. To remove the undesirable information, we subtract X intra from ˜ such as the original data set X. As a result, we get a new data set X ˜ = X − X intra . X

(9)

A Subspace Method Based on Data Generation Model

3.2

551

Extra-Class Factor Loading

˜ we try to ﬁnd the projection matrix Using the newly obtained data set X, ˜ Λ which represents the extra-class factor instead of extra-class factor loading matrix W for preserving extra-class information as much as possible. To solve ˜ Noting that a data xintra in the this problem, let us consider the data set X. intra data set X is a reconstruction from the intra-class factor, we can write the approximate relationship, xintra ≈ V η.

(10)

˜ can be ˜ in X By combining it with equation (3), the newly obtained data x rewritten as ˜ ≈ x − V η = W ξ. x

(11)

˜ mainly has the extra-class From this, we can say that the new data set X information, and thus we need to preserve the information as much as possible. ˜ and obtain the From these consideration, we apply PCA to the new data set X, ˜ ˜ ˜ to the basis principal component matrix Λ of the data set X. By projecting X vectors such as ˜Λ ˜ = Z, ˜ X

(12)

˜ which has small intra-class variance and large we can obtain the data set Z extra-class variance. The obtained data set is used for classiﬁcation.

4

Experimental Results

For veriﬁcation of the proposed method, we did comparison experiments on facial data sets with conventional methods : PCA, LDA and intra-person. For classiﬁcation, we apply the minimum distance method[10] with Euclidean distance. When we ﬁnd the subspace for each method, we optimized the dimension of subspace with respect to the classiﬁcation rates for each data set. 4.1

Face Recognition

We ﬁrst conducted facial recognition task for face images with diﬀerent viewpoints which are obtained from FERET(Face Recognition Technology) database at the homepage(http : //www.itl.nist.gov/iad/humanid/f eret/). Figure 2 shows some samples of the data set. We used 450 images from 50 subjects and each subject consists of 9 images of diﬀerent poses corresponding to 15 degree from left and right. The left, right(±60 degree) and frontal images are used for training and the rest 300 images are used for testing. The size of image is 50 × 70, thus the dimension of the raw input is 3500. For LDA method, we ﬁrst applied PCA to solve small sample set problem and obtained 52 dimensional features. We then applied LDA and obtained 9 dimensional features. Similarly, in the proposed method, we ﬁrst

552

M. Cho, D. Yoon, and H. Park

Fig. 2. Examples of the human face images with diﬀerent viewpoints Table 1. Result on face image data Method PCA LDA Intra-Person Proposed

Dimension 117 9 92 8

Classiﬁcation Rate 97 99.66 92.33 100

ﬁnd 83 dimensional subspace for intra-class factor and 8 dimensional subspace for extra-class factor. In this case, there are the 50 number of classes and the number of data in each class is very limited; 3 for each class. The experimental results are shown in Table 1. The performance of the proposed method is perfect, and the other methods also give generally good results. In spite of 50 number of class and limited number of training data, the good results may due to that the variation of between-class is intrinsically high. 4.2

Pose Recognition

We conducted pose recognition task with the same data set used in Section 4.1. Therefore, we used 450 images from 9 diﬀerent viewpoints classes of −60o to 60o with 15o intervals, and each class consists of 50 images from diﬀerent persons. The 255 images which are composed of 25 images for each class are used for training and the rest 255 images are used for testing. For LDA method, we ﬁrst applied PCA and obtained 167 dimensional features. We then applied LDA

A Subspace Method Based on Data Generation Model

553

Table 2. Result on the human pose image data Method PCA LDA Intra-Person Proposed

Dimension 65 6 51 21

Classiﬁcation Rate 36.44 57.78 38.67 58.22

and obtained 6 dimensional features. For the proposed method, we ﬁrst ﬁnd 128 dimensional subspace for intra-class factor and 21 dimensional subspace for extra-class factor. In this case, there is the 9 number of classes and the 225 number of training data which is composed of 25 number of data from each class. The results are shown in Table 2. The performance is generally low, but the proposed method and LDA give much better performance than the PCA and intra-person method. From the low performance, we can conjecture that the variance of between-class is very small in contrast to the variance of within-class. However, the proposed method achieved the best performance. 4.3

Facial Expression Recognition

We also conducted facial expression recognition task with data set obtained from PICS(Psychological Image Collection at Stirling) at the homepage(http : //pics.psych.stir.ac.uk/). Figure 3 shows the facial expression image sample. We obtained 276 images from 69 persons and each person has 4 images of diﬀerent expressions. The 80 images which is composed of 20 images from each expression are used for training and the rest 196 images are used for testing. The size of image is 80 × 90, thus the dimension of the raw input is 7200. For LDA method, we ﬁrst applied PCA and obtained 59 dimensional features. We then applied LDA and obtained 3 dimensional features. For the proposed method, we ﬁrst ﬁnd 48 dimensional subspace for intra-class factor and 14 dimensional subspace for extra-class factor. In this case, there is 4 number of classes and 20 number of training data in each class. Although the performances of all methods are generally low, the proposed method performed much better than PCA and intra-person method. Like the pose recognition, it is also seemed that the variance of between-class Table 3. Result on facial expression image data Method PCA LDA Intra-Person Proposed

Dimension 65 3 76 14

Classiﬁcation Rate 35.71 65.31 42.35 66.33

554

M. Cho, D. Yoon, and H. Park

Fig. 3. Examples of the human facial expression images

is very small in contrast to the variance of within-class. However, the proposed method is achieved the best performance.

5

Conclusions and Discussions

In this paper, we proposed a new subspace method based on a data generation model with class information which can be represented as intra-class factors and extra-class factors. By reducing the intra-class information from original data and by keeping extra-class information using PCA, we could get a low dimensional features which preserves some essential information for the given classiﬁcation problem. In the experiments on various type of facial classiﬁcation tasks, the proposed method showed better performance than conventional methods. As further study, it could be possible to ﬁnd more sophisticated dimension reduction than PCA which can enlarge extra-class information. Also, the kernel method could be applied to overcome the non-linearity problem.

Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD) (KRF-2006-311-D00807).

References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Speciﬁc Linear Projection. IEEE trans. on Pattern Recogntion and Machine Intelligence 19(7), 711–720 (1997) 2. Alpaydin, E.: Machine Learning. MIT Press, Cambridge (2004) 3. Jaakkola, T., Haussler, D.: Exploiting Generative Models in Discriminative Classiﬁers. Advances in Neural Information Processing System, 487–493 (1998)

A Subspace Method Based on Data Generation Model

555

4. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8, 376–386 (1938) 5. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, London (1990) 6. Hinton, G.E., Zemel, R.S.: Autoencoders, Minimum Description Length and Helmholtz Free Energy. Advances In Neural Information Processing Systems 6, 3–10 (1994) 7. Hinton, G.E., Ghahramani, Z.: Generative Models for Discovering Sparse Distributed Representations. Philosophical Transactions Royal Society B 352, 1177– 1190 (1997) 8. Lee, O., Park, H., Choi, S.: PCA vs. ICA for Face Recognition. In: The 2000 International Technical Conference on Circuits/Systems, Computers, and Communications, pp. 873–876 (2000) 9. Moghaddam, B., Jebara, T., Pentland, A.: Bayesian Modeling of Facial Similarity. Advances in Neural Information Processing System, 910–916 (1998) 10. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press, London (1979) 11. Martinez, A., Kak, A.: PCA versus LDA. IEEE Trans. on Pattern Analysis and Machine Inteligence 23(2), 228–233 (2001) 12. Park, H., Cho, M.: Classiﬁcation of Bio-data with Small Data set Using Additive Factor Model and SVM. In: Hoﬀmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 770–779. Springer, Heidelberg (2006) 13. Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively, with Application to Face Veriﬁcation. In: Proc. of International Conference on Computer Vision on Pattern Recognition, pp. 539–546 (2005) 14. Ghahramani, Z.: Factorial Learning and The EM Algorithm. In: Advances In Neural Information Processing Systems, vol. 7, pp. 617–624 (1995)

Hierarchical Feature Extraction for Compact Representation and Classiﬁcation of Datasets Markus Schubert and Jens Kohlmorgen Fraunhofer FIRST.IDA Kekul´estr. 7, 12489 Berlin, Germany {markus,jek}@first.fraunhofer.de http://ida.first.fraunhofer.de

Abstract. Feature extraction methods do generally not account for hierarchical structure in the data. For example, PCA and ICA provide transformations that solely depend on global properties of the overall dataset. We here present a general approach for the extraction of feature hierarchies from datasets and their use for classiﬁcation or clustering. A hierarchy of features extracted from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording.

1

Introduction

The vast majority of feature extraction methods does not account for hierarchical structure in the data. For example, PCA [1] and ICA [2] provide transformations that solely depend on global properties of the overall data set. The ability to model the hierarchical structure of the data, however, might certainly help to characterize and understand the information contained in the data. For example, neural dynamics are often characterized by a hierarchical structure in space and time, where methods for hierarchical feature extraction might help to group and classify such data. A particular demand for these methods exists in EEG recordings, where slow dynamical components (sometimes interpreted as internal “state” changes) and the variability of features make data analysis diﬃcult. Hierarchical feature extraction is so far mainly related to 2-D pattern analysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron [3], the hierarchical structure is typically a priori hard-wired in the architecture and the methods primarily apply to a 2-D grid structure. There are, however, more recent approaches, like local PCA [4] or tree-dependent component analysis [5], that are promising steps towards structured feature extraction methods that derive also the structure from the data. While local PCA in [4] is not hierarchical and tree-dependent component analysis in [5] is restricted to the context of ICA, we here present a general approach for the extraction of feature hierarchies and M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 556–565, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Feature Extraction

557

their use for classiﬁcation and clustering. We exemplify this by using PCA as the core feature extraction method. In [6] and [7], hierarchies of two-dimensional PCA projections (using probabilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional data. For obtaining the hierarchies, the selection of sub-clusters was performed either manually [6] or automatically by using a model selection criterion (AIC, MDL) [7], but in both cases based on two-dimensional projections. A 2-D projection of high-dimensional data, however, is often not suﬃcient to unravel the structure of the data, which thus might hamper both approaches, in particular, if the sub-clusters get superimposed in the projection. In contrast, our method is based on hierarchical clustering in the original data space, where the structural information is unchanged and therefore undiminished. Also, the focus of this paper is not on visualizing the data itself, which obviously is limited to 2-D or 3-D projections, but rather on the extraction of the hierarchical structure of the data (which can be visualized by plotting trees) and on replacing the data by a compact hierarchical representation in terms of a tree of extracted features, which can be used for classiﬁcation and clustering. The individual quantity to be classiﬁed or clustered in this context, is a tree of features representing a set of data points. Note that classifying sets of points is a more general problem than the well-known problem of classifying individual data points. Other approaches to classify sets of points can be found, e.g., in [9, 10], where the authors deﬁne a kernel on sets, which can then be used with standard kernel classiﬁers. The paper is organized as follows. In section 2, we describe the hierarchical feature extraction method. In section 3, we show how feature hierarchies can be used for classiﬁcation and clustering, and in section 4 we provide a proof of concept with an application to mixtures of Gaussians with varying degree of structuredness and to a clinical EEG recording. Section 5 concludes with a discussion.

2

Hierarchical Feature Extraction

We pursue a straightforward approach to hierarchical feature extraction that allows us to make any standard feature extraction method hierarchical: we perform hierarchical clustering of the data prior to feature extraction. The feature extraction method is then applied locally to each signiﬁcant cluster in the hierarchy, resulting in a representation (or replacement) of the original dataset in terms of a tree of features. 2.1

Hierarchical Clustering

There are many known variants of hierarchical clustering algorithms (see, e.g., [11, 12]), which can be subdivided into divisive top-down procedures and agglomerative bottom-up procedures. More important than this procedural aspect, however, is the dissimilarity function that is used in most methods to quantify the dissimilarity between two clusters. This function is used as the criterion to

558

M. Schubert and J. Kohlmorgen

determine the clusters to be split (or merged) at each iteration of the top-down (or bottom-up) process. Thus, it is this function that determines the clustering result and it implicitly encodes what a “good” cluster is. Common agglomerative procedures are single-linkage, complete-linkage, and average-linkage. They diﬀer simply in that they use diﬀerent dissimilarity functions [12]. We here use Ward’s method [13], also called the minimum variance method, which is agglomerative and successively merges the pair of clusters that causes the smallest increase in terms of the total sum-of-squared-errors (SSE), where the error is deﬁned as the Euclidean distance of a data point to its cluster mean. The increase in square-error caused by merging two clusters, Di and Dj , is given by ni nj d (Di , Dj ) = mi − mj , (1) ni + nj where ni and nj are the number of points in each cluster, and mi and mj are the means of the points in each cluster [12]. Ward’s method can now simply be described as a standard agglomerative clustering procedure [11, 12] with the particular dissimilarity function d given in Eq. (1). We use Ward’s criterion, because it is based on a global ﬁtness criterion (SSE) and in [11] it is reported that the method outperformed other hierarchical clustering methods in several comparative studies. Nevertheless, depending on the particular application, other criteria might be useful as well. The result of a hierarchical clustering procedure that successively splits or merges two clusters is a binary tree. At each hierarchy level, k = 1, ..., n, it deﬁnes a partition of the given n samples into k clusters. The leaf node level consists of n nodes describing a partition into n clusters, where each cluster/node contains exactly one sample. Each hierarchy level further up contains one node with edges to the two child nodes that correspond to the clusters that have been merged. The tree can be depicted graphically as a dendrogram, which aligns the leaf nodes along the horizontal axis and connects them by lines to the higher level nodes along the vertical axis. The position of the nodes along the vertical axis could in principle correspond linearly to the hierarchy level k. This, however, would reveal almost nothing of the structure in the data. Most of the structural information is actually contained in the dissimilarity values. One therefore usually positions the node at level k vertically with respect to the dissimilarity value of its two corresponding child clusters, Di and Dj , δ(k) = d (Di , Dj ) .

(2)

For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function δ can be regarded as within-cluster dissimilarity. By using δ as the vertical scale in a dendrogram, a large gap between two levels, for example k and k + 1, means that two very dissimilar clusters have been merged at level k. 2.2

Extracting a Tree of Signiﬁcant Clusters

As we have seen in the previous subsection, a hierarchical clustering algorithm always generates a tree containing n − 1 non-singleton clusters. This does not

Hierarchical Feature Extraction

559

necessarily mean that any of these clusters is clearly separated from the rest of the data or that there is any structure in the data at all. The identiﬁcation of clearly separated clusters is usually done by visual inspection of the dendrogram, i.e. by identifying large gaps. For an automatic detection of signiﬁcant clusters, we use the following straightforward criterion δ(parent(k)) > α, δ(k)

for 1 < k < n,

(3)

where parent(k) is the parent cluster level of the cluster obtained at level k and α is a signiﬁcance threshold. If a cluster at level k is merged into a cluster that has a within-cluster dissimilarity which is more than α times higher than that of cluster k, we call cluster k a signiﬁcant cluster. That means that cluster k is signiﬁcantly more compact than its merger (in the sense of the dissimilarity function). Note that this does not necessarily mean that the sibling of cluster k is also a signiﬁcant cluster, as it might have a higher dissimilarity value than cluster k. The criterion directly corresponds to the relative increase of the dissimilarity value in a dendrogram from one merger level to the next. For small clusters that contain only a few points, the relative increase in dissimilarity can be large just because of the small sample size. To avoid that these clusters are detected as being signiﬁcant, we require a minimum cluster size M for signiﬁcant clusters. After having identiﬁed the signiﬁcant clusters in the binary cluster tree, we can extract the tree of signiﬁcant clusters simply by linking each signiﬁcant cluster node to the next highest signiﬁcant node in the tree, or, if there is none, to the root node (which is just for the convenience of getting a tree and not a forest). The tree of signiﬁcant clusters is generally much smaller than the original tree and it is not necessarily a binary tree anymore. Also note that there might be data points that are not in any signiﬁcant cluster, e.g., outliers. The criterion in (3) is somewhat related to the criterion in [14], which is used to take out clusters from the merging process in order to obtain a plain, nonhierarchical clustering. The criterion in [14] accounts for the relative change of the absolute dissimilarity increments, which seems to be somewhat less intuitive and unnecessarily complicated. This criterion might also be overly sensitive to small variations in the dissimilarities. 2.3

Obtaining a Tree of Features

To obtain a representation of the original dataset in terms of a tree of features, we can now apply any standard feature extraction method to the data points in each signiﬁcant cluster in the tree and then replace the data points in the cluster by their corresponding features. For PCA, for example, the data points in each signiﬁcant cluster are replaced by their mean vector and the desired number of principle components, i.e. the eigenvectors and eigenvalues of the covariance matrix of the data points. The obtained hierarchy of features thus constitutes

560

M. Schubert and J. Kohlmorgen

a compact representation of the dataset that does not contain the individual data points anymore, which can save a considerable amount of memory. This representation is also independent of the size of the dataset. The hierarchy can on the one hand be used to analyze and understand the structure of the data, on the other hand – as we will further explain in the next section – it can be used to perform classiﬁcation or clustering in cases where the individual input quantity to be classiﬁed (or clustered) is an entire dataset and not, as usual, a single data point.

3

Classiﬁcation of Feature Trees

The classiﬁcation problem that we address here is not the well-known problem of classifying individual data points or vectors. Instead, it relates to the classiﬁcation of objects that are sets of data points, for example, time series. Given a “training set” of such objects, i.e. a number of datasets, each one attached with a certain class label, the problem consists in assigning one class label to each new, unlabeled dataset. This can be accomplished by transforming each individual dataset into a tree of features and by deﬁning a suitable distance function to compare each pair of trees. For example, trees of principal components can be regarded as (hierarchical) mixtures of Gaussians, since the principal components of each node in the tree (the eigenvectors and eigenvalues) describe a normal distribution, which is an approximation to the true distribution of the underlying data points in the corresponding signiﬁcant cluster. Two mixtures (sums) of Gaussians, f and g, corresponding to two trees of principal components (of two datasets), can be compared, e.g., by using the the squared L2 -Norm as distance function, which is also called the integrated squared error (ISE), ISE(f, g) = (f − g)2 dx. (4) The ISE has the advantage that the integral is analytically tractable for mixtures of Gaussians. Note that the computation of a tree of principal components, as described in the previous section, is in itself an interesting way to obtain a mixture of Gaussians representation of a dataset: without the need to specify the number of components in advance and without the need to run a maximum likelihood (gradient ascent) algorithm like, for example, expectation–maximization [15], which is prone to get stuck in local optima. Having obtained a distance function on feature trees, the next step is to choose a classiﬁcation method that only requires pairwise distances to classify the trees (and their corresponding datasets). A particularly simple method is ﬁrst-nearest-neighbor (1-NN) classiﬁcation. For 1-NN classiﬁcation, the tree of a test dataset is assigned the label of the nearest tree of a collection of trees that were generated from a labeled “training set” of datasets. If the generated trees are suﬃciently diﬀerent among the classes, ﬁrst- (or k-) nearest-neighbor

Hierarchical Feature Extraction

561

classiﬁcation can already be suﬃcient to obtain a good classiﬁcation result, as we demonstrate in the next section. In addition to classiﬁcation, the distance function on feature trees can also be used to cluster a collection of datasets by clustering their corresponding trees. Any clustering algorithm that uses pairwise distances can be used for this purpose [11, 12]. In this way it is possible to identify homogeneous groups of datasets.

4 4.1

Applications Mixtures of Gaussians

As a proof of concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree of structuredness. From three classes of Gaussian mixture distributions, which are exemplarily shown in Fig. 1(a)-(c), we generated 10 training samples for each class, which constitute the training set, and a total of 100 test samples constituting the test set. Each sample contains 540 data points. The mixture distribution of each test sample was chosen with equal probability from one of the three classes. Next, we generated the binary cluster tree from each sample using Ward’s criterion. Examples of the corresponding dendrograms for each class are shown in Fig. 1(d)-(f) (in gray). We then determined the signiﬁcant clusters in each tree, using the signiﬁcance factor α = 3 and the minimum cluster size M = 40. In Fig. 1(d)-(f), the signiﬁcant clusters are depicted as black dots and the extracted trees of signiﬁcant clusters are shown by means of thick black lines. The cluster of each node in a tree of signiﬁcant clusters was then replaced by the principle components obtained from the data in the cluster, which turns the tree of clusters into a tree of features. In Fig. 1(g)-(i), the PCA components of all signiﬁcant clusters are shown for the three example datasets from Fig. 1(a)-(c). Finally, we classiﬁed the feature trees obtained from the test samples, using the integrated squared error (Eq. (4)) and ﬁrst-nearest-neighbor classiﬁcation. We obtained a nearly perfect accuracy of 98% correct classiﬁcations (i.e.: only two misclassiﬁcations), which can largely be attributed to the circumstance that the structural diﬀerences between the classes were correctly exposed in the tree structures. This result demonstrates that an appropriate representation of the data can make the classiﬁcation problem very simple. 4.2

Clinical EEG

To demonstrate the applicability of our approach to real-world data, we used a clinical recording of human EEG. The recording was carried out in order to screen for pathological features, in particular the disposedness to epilepsy. The subject went through a number of experimental conditions: eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and, ﬁnally, a stimulation with stroboscopic light of increasing frequency (PO: photic on).

562

M. Schubert and J. Kohlmorgen

50

50

50

40

40

40

30

30

30

20

20

20

10

10

10

0

0

0

−10

−10

−10

−20

−20

−20

−30

−30

−40 −30

−20

−10

0

10

20

30

40

50

60

−30

−40 −30

−20

−10

0

(a)

10

20

30

40

50

60

−40 −30

−20

−10

0

(b)

10

20

30

40

50

60

30

40

50

60

(c) 600

500 500 500

400 400

200

dissimilarity

dissimilarity

dissimilarity

400

300

300

200

100

200

100

0

100

0

data

0

data

(d)

(f)

50

50

40

40

40

30

30

30

20

20

20

10

10

10

0

0

0

−10

−10

−10

−20

−20

−20

−30

−30

−20

−10

0

10

20

(g)

data

(e)

50

−40 −30

300

30

40

50

60

−40 −30

−30

−20

−10

0

10

20

(h)

30

40

50

60

−40 −30

−20

−10

0

10

20

(i)

Fig. 1. (a)-(c) Example datasets for the three types of mixture distributions used in the application. (d)-(f) The corresponding dendrograms for each example dataset (gray) and the extracted trees of signiﬁcant clusters (black). Note that the extracted tree structure exactly corresponds to the structure in the data. (g)-(i) The PCA components of all signiﬁcant clusters. The components are contained in the tree of features.

During the photic phase, the subject kept the eyes closed, while the rate of light ﬂashes was increased every four seconds in steps of 1 Hz, from 5 Hz to 25 Hz. The obtained recording was subdivided into 507 epochs of ﬁxed length (1s). For each epoch, we extracted four features that correspond to the power in

Hierarchical Feature Extraction

563

25

dissimilarity

20

15

10

5

0

82% (EC)

69% (PHV)

92% (EO)

88% (PO)

76% (HV)

90% (HV)

Fig. 2. The tree of signiﬁcant clusters (black), obtained from the underlying dendrogram (gray) for the EEG data. The data in each signiﬁcant sub-cluster largely corresponds to one of the experimental conditions (indicated in %): eyes open (EO), eyes closed (EC), hyperventilation (HV), post-hyperventilation (PHV), and ‘photic on’ (PO).

speciﬁc frequency bands of particular EEG electrodes.1 The resulting set of four-dimensional feature vectors was then analyzed by our method. For the hierarchical clustering, we used Ward’s method and found the signiﬁcant clusters depicted in Fig. 2. The extracted tree of signiﬁcant clusters consists of a twolevel hierarchy. As expected, the majority of feature vectors in each sub-cluster corresponds to one of the experimental conditions. By applying PCA to each sub-cluster and replacing the data of each node with its principle components, we obtain a tree of features, which constitutes a compact representation of the original dataset. It can then be used for comparison with trees that arise from normal or various kinds of pathological EEG, as outlined in section 3.

5

Discussion

We proposed a general approach for the extraction of feature hierarchies from datasets and their use for classiﬁcation or clustering. The feasibility of this approach was demonstrated with an application to mixtures of Gaussians with 1

In detail: (I.) the power of the α-band (8–12 Hz) at the electrode positions O1 and O2 (according to the international 10–20 system), (II.) the power of 5 Hz and its harmonics (except 50 Hz) at electrode F4, (III.) the power of 6 Hz and its harmonics at electrode F8, and (IV.) the power of the 25–80 Hz band at F7.

564

M. Schubert and J. Kohlmorgen

varying degree of structuredness and to a clinical EEG recording. In this paper we focused on PCA as the core feature extraction method. Other types of feature extraction, like, e.g., ICA, are also conceivable, which then should be complemented with an appropriate distance function on the feature trees (if used for classiﬁcation or clustering). The basis of the proposed approach is hierarchical clustering. The quality of the resulting feature hierarchies thus depends on the quality of the clustering. Ward’s criterion tends to ﬁnd compact, hyperspherical clusters, which may not always be the optimal choice for a given problem. Therefore, one should consider to adjust the clustering criterion to the problem at hand. Our future work will focus on the application of this method to classify normal and pathological EEG. By comparing the diﬀerent tree structures, the hope is to gain a better understanding of the pathological cases. Acknowledgements. This work was funded by the German BMBF under grant 01GQ0415 and supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778.

References [1] Jolliﬀe, I.: Principal Component Analysis. Springer, New York (1986) [2] Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, Chichester (2001) [3] Fukushima, K.: Neural network model for a mechanism of pattern recognition unaﬀected by shift in position — neocognitron. Transactions IECE 62-A(10), 658–665 (1979) [4] Bregler, C., Omohundro, S.: Surface learning with applications to lipreading. In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Precessing Systems, vol. 6, pp. 43–50. Morgan Kaufmann Publishers, San Mateo (1994) [5] Bach, F., Jordan, M.: Beyond independent components: Trees and clusters. Journal of Machine Learning Research 4, 1205–1233 (2003) [6] Bishop, C., Tipping, M.: A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 281–293 (1998) [7] Wang, Y., Luo, L., Freedman, M., Kung, S.: Probabilistic principal component subspaces: A hierarchical ﬁnite mixture model for data visualization. IEEE Transactions on Neural Networks 11(3), 625–636 (2000) [8] Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B 61(3), 611–622 (1999) [9] Kondor, R., Jebara, T.: A kernel between sets of vectors. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the ICML, pp. 361–368. AAAI Press, Menlo Park (2003) [10] Desobry, F., Davy, M., Fitzgerald, W.: A class of kernels for sets of vectors. In: Proceedings of the ESANN, pp. 461–466 (2005) [11] Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Inc., Englewood Cliﬀs (1988) [12] Duda, R., Hart, P., Stork, D.: Pattern Classiﬁcation. Wiley–Interscience, Chichester (2000)

Hierarchical Feature Extraction

565

[13] Ward, J.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58, 236–244 (1963) [14] Fred, A., Leitao, J.: Clustering under a hypothesis of smooth dissimilarity increments. In: Proceedings of the ICPR, vol. 2, pp. 190–194 (2000) [15] Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, 1–38 (1977)

Principal Component Analysis for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Adaptive Informatics Research Center, Helsinki Univ. of Technology P.O. Box 5400, FI-02015 TKK, Finland {Tapani.Raiko,Alexander.Ilin,Juha.Karhunen}@tkk.fi http://www.cis.hut.fi/projects/bayes/

Abstract. Principal component analysis (PCA) is a widely used technique for data analysis and dimensionality reduction. Eigenvalue decomposition is the standard algorithm for solving PCA, but a number of other algorithms have been proposed. For instance, the EM algorithm is much more eﬃcient in case of high dimensionality and a small number of principal components. We study a case where the data are highdimensional and a majority of the values are missing. In this case, both of these algorithms turn out to be inadequate. We propose using a gradient descent algorithm inspired by Oja’s rule, and speeding it up by an approximate Newton’s method. The computational complexity of the proposed method is linear with respect to the number of observed values in the data and to the number of principal components. In the experiments with Netﬂix data, the proposed algorithm is about ten times faster than any of the four comparison methods.

1

Introduction

Principal component analysis (PCA) [1,2,3,4,5,6] is a classic technique in data analysis. It can be used for compressing higher dimensional data sets to lower dimensional ones for data analysis, visualization, feature extraction, or data compression. PCA can be derived from a number of starting points and optimization criteria [2,3,4]. The most important of these are minimization of the mean-square error in data compression, ﬁnding mutually orthogonal directions in the data having maximal variances, and decorrelation of the data using orthogonal transformations [5]. While standard PCA is a very well-established linear statistical technique based on second-order statistics (covariances), it has recently been extended into various directions and considered from novel viewpoints. For example, various adaptive algorithms for PCA have been considered and reviewed in [4,6]. Fairly recently, PCA was shown to emerge as a maximum likelihood solution from a probabilistic latent variable model independently by several authors; see [3] for a discussion and references. In this paper, we study PCA in the case where most of the data values are missing (or unknown). Common algorithms for solving PCA prove to be M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 566–575, 2008. © Springer-Verlag Berlin Heidelberg 2008

Principal Component Analysis for Sparse High-Dimensional Data

567

inadequate in this case, and we thus propose a new algorithm. The problem of overﬁtting and possible solutions are also outlined.

2

Algorithms for Principal Component Analysis

Principal subspace and components. Assume that we have n d-dimensional data vectors x1 , x2 , . . . , xn , which form the d×n data matrix X=[x1 , x2 , . . . , xn ]. The matrix X is decomposed into X ≈ AS,

(1)

where A is a d × c matrix, S is a c × n matrix and c ≤ d ≤ n. Principal subspace methods [6,4] ﬁnd A and S such that the reconstruction error C = X − AS2F =

d n

(xij −

i=1 j=1

c

aik skj )2 ,

(2)

k=1

is minimized. There F denotes the Frobenius norm, and xij , aik , and skj elements of the matrices X, A, and S, respectively. Typically the row-wise mean is removed from X as a preprocessing step. Without any further constraints, there exist inﬁnitely many ways to perform such a decomposition. However, the subspace spanned by the column vectors of the matrix A, called the principal subspace, is unique. In PCA, these vectors are mutually orthogonal and have unit length. Further, for each k = 1, . . . , c, the ﬁrst k vectors form the k-dimensional principal subspace. This makes the solution practically unique, see [4,2,5] for details. There are many ways to determine the principal subspace and components [6,4,2]. We will discuss three common methods that can be adapted for the case of missing values. Singular Value Decomposition. PCA can be determined by using the singular value decomposition (SVD) [5] X = UΣVT ,

(3)

where U is a d × d orthogonal matrix, V is an n × n orthogonal matrix and Σ is a d × n pseudodiagonal matrix (diagonal if d = n) with the singular values on the main diagonal [5]. The PCA solution is obtained by selecting the c largest singular values from Σ, by forming A from the corresponding c columns of U, and S from the corresponding c rows of ΣVT . Note that PCA can equivalently be deﬁned using the eigendecomposition of the d × d covariance matrix C of the column vectors of the data matrix X: C=

1 XXT = UDUT , n

(4)

Here, the diagonal matrix D contains the eigenvalues of C, and the columns of the matrix U contain the unit-length eigenvectors of C in the same order

568

T. Raiko, A. Ilin, and J. Karhunen

[6,4,2,5]. Again, the columns of U corresponding to the largest eigenvalues are taken as A, and S is computed as AT X. This approach can be more eﬃcient for cases where d n, since it avoids the n × n matrix. EM Algorithm. The EM algorithm for solving PCA [7] iterates updating A and S alternately.1 When either of these matrices is ﬁxed, the other one can be obtained from an ordinary least-squares problem. The algorithm alternates between the updates S ← (AT A)−1 AT X ,

A ← XST (SST )−1 .

(5)

This iteration is especially eﬃcient when only a few principal components are needed, that is c d [7]. Subspace Learning Algorithm. It is also possible to minimize the reconstruction error (2) by any optimization algorithm. Applying the gradient descent algorithm yields rules for simultaneous updates A ← A + γ(X − AS)ST ,

S ← S + γAT (X − AS) .

(6)

where γ > 0 is called the learning rate. Oja-Karhunen learning algorithm [8,9,6,4] is an online learning method that uses the EM formula for computing S and the gradient for updating A, a single data vector at a time. A possible speed-up to the subspace learning algorithm is to use the natural gradient [10] for the space of matrices. This yields the update rules A ← A + γ(X − AS)ST AT A ,

S ← S + γSST AT (X − AS) .

(7)

If needed, the end result of subspace analysis can be transformed into the PCA solution, for instance, by computing the eigenvalue decomposition SST = 1/2 T US DS UTS and the singular value decomposition AUS DS = UA ΣA VA . The transformed A is formed from the ﬁrst c columns of UA and the transformed S −1/2 T from the ﬁrst c rows of ΣA VA DS UTS S. Note that the required decompositions are computationally lighter than the ones done to the data matrix directly.

3

Principal Component Analysis with Missing Values

Let us consider the same problem when the data matrix has missing entries2 . In the following there are N = 9 observed values and 6 missing values marked with a question mark (?): ⎡ ⎤ −1 +1 0 0 ? X = ⎣−1 +1 ? ? 0⎦ . (8) ? ? −1 +1 ? 1

2

The procedure studied in [7] can be seen as the zero-noise limit of the EM algorithm for a probabilistic PCA model. We make the typical assumption that values are missing at random, that is, the missingness does not depend on the unobserved data. An example where the assumption does not hold is when out-of-scale measurements are marked missing.

Principal Component Analysis for Sparse High-Dimensional Data

569

We would like to ﬁnd A and S such that X ≈ AS for the observed data values. The rest of the product AS represents the reconstruction of missing values. Adapting SVD. One can use the SVD approach (4) in order to ﬁnd an approximate solution to the PCA problem. However, estimating the covariance matrix C becomes very diﬃcult when there are lots of missing values. If we estimate C leaving out terms with missing values from the average, we get for the estimate of the covariance matrix ⎡ ⎤ 0.5 1 0 1 C = XXT = ⎣ 1 0.667 ?⎦ . (9) n 0 ? 1 There are at least two problems. First, the estimated covariance 1 between the ﬁrst and second components is larger than their estimated variances 0.5 and 0.667. This is clearly wrong, and leads to the situation where the covariance matrix is not positive (semi)deﬁnite and some of its eigenvalues are negative. Secondly, the covariance between the second and the third component could not be estimated at all3 . Both problems appeared in practice with the data set considered in Section 5. Another option is to complete the data matrix by iteratively imputing the missing values (see, e.g., [2]). Initially, the missing values can be replaced by zeroes. The covariance matrix of the complete data can be estimated without the problems mentioned above. Now, the product AS can be used as a better estimate for the missing values, and this process can be iterated until convergence. This approach requires the use of the complete data matrix, and therefore it is computationally very expensive if a large part of the data matrix is missing. The time complexity of computing the sample covariance matrix explicitly is O(nd2 ). We will further refer to this approach as the imputation algorithm. Note that after convergence, the missing values do not contribute to the reconstruction error (2). This means that the imputation algorithm leads to the solution which minimizes the reconstruction error of observed values only. Adapting the EM Algorithm. Grung and Manne [11] studied the EM algorithm in the case of missing values. Experiments showed a faster convergence compared to the iterative imputation algorithm. The computational complexity is O(N c2 + nc3 ) per iteration, where N is the number of observed values, assuming na¨ıve matrix multiplications and inversions but exploiting sparsity. This is quite a bit heavier than EM with complete data, whose complexity is O(ndc) [7] per iteration. Adapting the Subspace Learning Algorithm. The subspace learning algorithm works in a straightforward manner also in the presence of missing values. 3

It could be ﬁlled by ﬁnding a value that maximizes the determinant of the covariance matrix (and thus the entropy of the underlying Gaussian distribution).

570

T. Raiko, A. Ilin, and J. Karhunen

We just take the sum over only those indices i and j for which the data entry xij (the ijth element of X) is observed, in short (i, j) ∈ O. The cost function is

C=

e2ij ,

with

c

eij = xij −

aik skj .

(10)

k=1

(i,j)∈O

and its partial derivatives are ∂C = −2 eij slj , ∂ail j|(i,j)∈O

∂C = −2 ∂slj

eij ail .

(11)

i|(i,j)∈O

The update rules for gradient descent are ∂C ∂C , S ←S+γ ∂A ∂S and the update rules for natural gradient descent are A←A+γ

(12)

∂C T ∂C A A, S ← S + γSST . (13) ∂A ∂S We propose a novel speed-up to the original simple gradient descent algorithm. In Newton’s method for optimization, the gradient is multiplied by the inverse of the Hessian matrix. Newton’s method is known to converge fast especially in the vicinity of the optimum, but using the full Hessian is computationally too demanding in truly high-dimensional problems. Here we use only the diagonal part of the Hessian matrix. We also include a control parameter α that allows the learning algorithm to interpolate between the standard gradient descent (α = 0) and the diagonal Newton’s method (α = 1), much like the Levenberg-Marquardt algorithm. The learning rules then take the form 2 −α ∂ C ∂C j|(i,j)∈O eij slj α , ail ← ail − γ = ail + γ (14) 2 ∂ail ∂ail 2 s j|(i,j)∈O lj

−α ∂2C ∂C i|(i,j)∈O eij ail α . slj ← slj − γ = slj + γ (15) 2 ∂slj ∂slj a2 A←A+γ

i|(i,j)∈O

il

The computational complexity is O(N c + nc) per iteration.

4

Overﬁtting

A trained PCA model can be used for reconstructing missing values: xˆij =

c

aik skj ,

(i, j) ∈ / O.

(16)

k=1

Although PCA performs a linear transformation of data, overﬁtting is a serious problem for large-scale problems with lots of missing values. This happens when the value of the cost function C in Eq. (10) is small for training data, but the quality of prediction (16) is poor for new data. For further details, see [12].

Principal Component Analysis for Sparse High-Dimensional Data

571

Regularization. A popular way to regularize ill-posed problems is penalizing the use of large parameter values by adding a proper penalty term into the cost function; see for example [3]. In our case, one can modify the cost function in Eq. (2) as follows: Cλ = e2ij + λ(A2F + S2F ) . (17) (i,j)∈O

This has the eﬀect that the parameters that do not have signiﬁcant evidence will decay towards zero. A more general penalization would use diﬀerent regularization parameters λ for diﬀerent parts of A and S. For example, one can use a λk parameter of its own for each of the column vectors ak of A and the row vectors sk of S. Note that since the columns of A can be scaled arbitrarily by rescaling the rows of S accordingly, one can ﬁx the regularization term for ak , for instance, to unity. An equivalent optimization problem can be obtained using a probabilistic formulation with (independent) Gaussian priors and a Gaussian noise model:

c p(xij | A, S) = N xij ; aik skj , vx , (18) k=1

p(aik ) = N (aik ; 0, 1) ,

p(skj ) = N (skj ; 0, vsk ) ,

(19)

where N (x; m, v) denotes the random variable x having a Gaussian distribution with the mean m and variance v. The regularization parameter λk = vsk /vx is the ratio of the prior variances vsk and vx . Then, the cost function (ignoring constants) is minus logarithm of the posterior for A and S: CBR =

d c c n 2 e2ij /vx + ln vx + a2ik + skj /vsk + ln vsk (i,j)∈O

i=1 k=1

(20)

k=1 j=1

An attractive property of the Bayesian formulation is that it provides a natural way to choose the regularization constants. This can be done using the evidence framework (see, e.g., [3]) or simply by minimizing CBR by setting vx , vsk to the means of e2ij and s2kj respectively. We will use the latter approach and refer to it as regularized PCA. Note that in case of joint optimization of CBR w.r.t. aik , skj , vsk , and vx , the cost function (20) has a trivial minimum with skj = 0, vsk → 0. We try to avoid this minimum by using an orthogonalized solution provided by unregularized PCA from the learning rules (14) and (15) for initialization. Note also that setting vsk to small values for some components k is equivalent to removal of irrelevant components from the model. This allows for automatic determination of the proper dimensionality c instead of discrete model comparison (see, e.g., [13]). This justiﬁes using separate vsk in the model in (19). Variational Bayesian Learning. Variational Bayesian (VB) learning provides even stronger tools against overﬁtting. VB version of PCA by [13] approximates

572

T. Raiko, A. Ilin, and J. Karhunen

the joint posterior of the unknown quantities using a simple multivariate distribution. Each model parameter is described a posteriori using independent Gaussian distributions. The means can then be used as point estimates of the parameters, while the variances give at least a crude estimate of the reliability of these point estimates. The method in [13] does not extend to missing values easily, but the subspace learning algorithm (Section 3) can be extended to VB. The derivation is somewhat lengthy, and it is omitted here together with the variational Bayesian learning rules because of space limitations; see [12] for details. The computational complexity of this method is still O(N c + nc) per iteration, but the VB version is in practice about 2–3 times slower than the original subspace learning algorithm.

5

Experiments

Collaborative ﬁltering is the task of predicting preferences (or producing personal recommendations) by using other people’s preferences. The Netﬂix problem [14] is such a task. It consists of movie ratings given by n = 480189 customers to d = 17770 movies. There are N = 100480507 ratings from 1 to 5 given, from which 1408395 ratings are reserved for validation (or probing). Note that 98.8% of the values are thus missing. We tried to ﬁnd c = 15 principal components from the data using a number of methods.4 We subtracted the mean rating for each movie, assuming 22 extra ratings of 3 for each movie as a Dirichlet prior. Computational Performance. In the ﬁrst set of experiments we compared the computational performance of diﬀerent algorithms on PCA with missing values.The root mean square (rms) error is measured on the training data, 1 2 EO = |O| (i,j)∈O eij . All experiments were run on a dual cpu AMD Opteron SE 2220 using Matlab. First, we tested the imputation algorithm. The ﬁrst iteration where the missing values are replaced with zeros, was completed in 17 minutes and led to EO = 0.8527. This iteration was still tolerably fast because the complete data matrix was sparse. After that, it takes about 30 hours per iteration. After three iterations, EO was still 0.8513. Using the EM algorithm by [11], the E-step (updating S) takes 7 hours and the M-step (updating A) takes 18 hours. (There is some room for optimization since we used a straightforward Matlab implementation.) Each iteration gives a much larger improvement compared to the imputation algorithm, but starting from a random initialization, EM could not reach a good solution in reasonable time. We also tested the subspace learning algorithm described in Section 3 with and without the proposed speed-up. Each run of the algorithm with diﬀerent values of the speed-up parameter α was initialized in the same starting point (generated randomly from a normal distribution). The learning rate γ was adapted such that 4

The PCA approach has been considered by other Netﬂix contestants as well (see, e.g., [15,16]).

Principal Component Analysis for Sparse High-Dimensional Data

573

1.1

1.04

Gradient Speed−up Natural Grad. Imputation EM

1 0.96

Gradient Speed−up Natural Grad. Regularized VB1 VB2

1.05

0.92 1

0.88 0.84 0.95

0.8 0.76

0

1

2

4

8

16

32

64

0

1

2

4

8

16

32

Fig. 1. Left: Learning curves for unregularized PCA (Section 3) applied to the Netﬂix data: Root mean-square error on the training data EO is plotted against computation time in hours. Right: The root mean square error on the validation data EV from the Netﬂix problem during runs of several algorithms: basic PCA (Section 3), regularized PCA (Section 4) and VB (Section 4). VB1 has some parameters ﬁxed (see [12]) while VB2 updates all the parameters. The time scales are linear below 1 and logarithmic above 1.

if an update decreased the cost function, γ was multiplied by 1.1. Each time an update would increase the cost, the update was canceled and γ was divided by 2. Figure 1 (left) shows the learning curves for basic gradient descent, natural gradient descent, and the proposed speed-up with the best found parameter value α = 0.625. The proposed speed-up gave about a tenfold speed-up compared to the gradient descent algorithm even if each iteration took longer. Natural gradient was slower than the basic gradient. Table 1 gives a summary of the computational complexities. Overﬁtting. We compared PCA (Section 3), regularized PCA (Section 4) and VB-PCA (Section 4) by computing the rms reconstruction error for the validation set V , that is, testing how the models generalize to new data: EV = 1 2 (i,j)∈V eij . We tested VB-PCA by ﬁrstly ﬁxing some of the parameter |V | values (this run is marked as VB1 in Fig. 1, see [12] for details) and secondly by Table 1. Summary of the computational performance of diﬀerent methods on the Netﬂix problem. Computational complexities (per iteration) assume na¨ıve computation of products and inverses of matrices and ignores the computation of SVD in the imputation algorithm. While the proposed speed-up makes each iteration slower than the basic gradient update, the time to reach the error level 0.85 is greatly diminished. Method Gradient Speed-up Natural Grad. Imputation EM

Complexity Seconds/Iter Hours to EO = 0.85 O(N c + nc) 58 1.9 O(N c + nc) 110 0.22 O(N c + nc2 ) 75 3.5 O(nd2 ) 110000 64 O(N c2 + nc3 ) 45000 58

574

T. Raiko, A. Ilin, and J. Karhunen

adapting them (marked as VB2). We initialized regularized PCA and VB1 using normal PCA learned with α = 0.625 and orthogonalized A, and VB2 using VB1. The parameter α was set to 2/3. Fig. 1 (right) shows the results. The performance of basic PCA starts to degrade during learning, especially using the proposed speed-up. Natural gradient diminishes this phenomenon known as overlearning, but it is even more eﬀective to use regularization. The best results were obtained using VB2: The ﬁnal validation error EV was 0.9180 and the training rms error EO was 0.7826 which is naturally larger than the unregularized EO = 0.7657.

6

Discussion

We studied a number of diﬀerent methods for PCA with sparse data and it turned out that a simple gradient descent approach worked best due to its minimal computational complexity per iteration. We could also speed it up more than ten times by using an approximated Newton’s method. We found out empirically that setting the parameter α = 2/3 seems to work well for our problem. It is left for future work to ﬁnd out whether this generalizes to other problem settings. There are also many other ways to speed-up the gradient descent algorithm. The natural gradient did not help here, but we expect that the conjugate gradient method would. The modiﬁcation to the gradient proposed in this paper, could be used together with the conjugate gradient speed-up. This will be another future research topic. There are also other beneﬁts in solving the PCA problem by gradient descent. Algorithms that minimize an explicit cost function are rather easy to extend. The case of variational Bayesian learning applied to PCA was considered in Section 4, but there are many other extensions of PCA, such as using non-Gaussianity, nonlinearity, mixture models, and dynamics. The developed algorithms can prove useful in many applications such as bioinformatics, speech processing, and meteorology, in which large-scale datasets with missing values are very common. The required computational burden is linearly proportional to the number of measured values. Note also that the proposed techniques provide an analogue of conﬁdence regions showing the reliability of estimated quantities. Acknowledgments. This work was supported in part by the Academy of Finland under its Centers for Excellence in Research Program, and the IST Program of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reﬂects the authors’ views. We would like to thank Antti Honkela for useful comments.

References 1. Pearson, K.: On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine 2(6), 559–572 (1901) 2. Jolliﬀe, I.: Principal Component Analysis. Springer, Heidelberg (1986) 3. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)

Principal Component Analysis for Sparse High-Dimensional Data

575

4. Diamantaras, K., Kung, S.: Principal Component Neural Networks - Theory and Application. Wiley, Chichester (1996) 5. Haykin, S.: Modern Filters. Macmillan, Basingstoke (1989) 6. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing - Learning Algorithms and Applications. Wiley, Chichester (2002) 7. Roweis, S.: EM algorithms for PCA and SPCA. In: Advances in Neural Information Processing Systems, vol. 10, pp. 626–632. MIT Press, Cambridge (1998) 8. Karhunen, J., Oja, E.: New methods for stochastic approximation of truncated Karhunen-Loeve expansions. In: Proceedings of the 6th International Conference on Pattern Recognition, pp. 550–553. Springer, Heidelberg (1982) 9. Oja, E.: Subspace Methods of Pattern Recognition. Research Studies Press and J. Wiley (1983) 10. Amari, S.: Natural gradient works eﬃciently in learning. Neural Computation 10(2), 251–276 (1998) 11. Grung, B., Manne, R.: Missing values in principal components analysis. Chemometrics and Intelligent Laboratory Systems 42(1), 125–139 (1998) 12. Raiko, T., Ilin, A., Karhunen, J.: Principal component analysis for large scale problems with lots of missing values. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 691–698. Springer, Heidelberg (2007) 13. Bishop, C.: Variational principal components. In: Proceedings of the 9th International Conference on Artiﬁcial Neural Networks (ICANN 1999), pp. 509–514 (1999) 14. Netﬂix: Netﬂix prize webpage (2007), http://www.netflixprize.com/ 15. Funk, S.: Netﬂix update: Try this at home (December 2006), http://sifter.org/∼ simon/journal/20061211.html 16. Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative ﬁltering. In: Proceedings of the International Conference on Machine Learning (2007)

Hierarchical Bayesian Inference of Brain Activity Masa-aki Sato1 and Taku Yoshioka1,2 1

2

ATR Computational Neuroscience Laboratories [email protected] National Institute of Information and Communication Technology

Abstract. Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement. Therefore, prior information on the source currents is essential to solve the inverse problem. We have proposed a new hierarchical Bayesian method to combine several sources of information. In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information by using variational Bayes method. The fMRI information can be imposed as prior distribution rather than the variance itself so that it gives a soft constraint on the variance. It is shown that the hierarchical Bayesian method has better accuracy and spatial resolution than conventional linear inverse methods by evaluating the resolution curve. The proposed method also demonstrated good spatial and temporal resolution for estimating current activity in early visual area evoked by a stimulus in a quadrant of the visual ﬁeld.

1

Introduction

In recent years, there has been rapid progress in noninvasive neuroimaging measurement for human brain. Functional organization of the human brain has been revealed by PET and functional magnetic resonance imaging (fMRI). However, these methods can not reveal the detailed dynamics of information processing in the human brain, since they have poor temporal resolution due to slow hemodynamic responses to neural activity (Bandettini, 2000;Ogawa et al., 1990). On the other hand, Magnetoencephalography (MEG) can measure brain activity with millisecond-order temporal resolution, but its spatial resolution is poor, due to the ill-posed nature of the inverse problem, for estimating source currents from the electromagnetic measurement (Hamalainen et al., 1993)). Therefore, prior information on the source currents is essential to solve the inverse problem. One of the standard methods for the inverse problem is a dipole method (Hari, 1991; Mosher et al., 1992). It assumes that brain activity can be approximated by a small number of current dipoles. Although this method gives good estimates when the number of active areas is small, it can not give distributed brain activity for higher function. On the other hand, a number of distributed M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 576–585, 2008. c Springer-Verlag Berlin Heidelberg 2008

Hierarchical Bayesian Inference of Brain Activity

577

source methods have been proposed to estimate distributed activity in the brain such as the minimum norm method, the minimum L1-norm method, and others (Hamalainen et al., 1993)). It has been also proposed to combine fMRI information with MEG data (Dale and Sereno, 1993;Ahlfors et al., 1999; Dale et al., 2000;Phillips et al., 2002). However, there are essential diﬀerences between fMRI and MEG due to their temporal resolution. The fMRI activity corresponds to an average of several thousands of MEG time series data and may not correspond MEG activity at some time points. We have proposed a new hierarchical Bayesian method to combine several sources of information (Sato et al. 2004). In our method, the variance of the source current at each source location is considered an unknown parameter and estimated from the observed MEG data and prior information. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. Therefore, our method is capable of appropriately estimating the source current variance from the MEG data supplemented with the fMRI data, even if fMRI data convey inaccurate information. Accordingly, our method is robust against inaccurate fMRI information. Because of the hierarchical prior, the estimation problem becomes nonlinear and cannot be solved analytically. Therefore, the approximate posterior distribution is calculated by using the Variational Bayesian (VB) method (Attias, 1999; Sato, 2001). The resulting algorithm is an iterative procedure that converges quickly because the VB algorithm is a type of natural gradient method (Amari, 1998) that has an optimal local convergence property. The position and orientation of the cortical surface obtained from structural MRI can be also introduced as hard constraint. In this article, we explain our hierarchical Bayesian method. To evaluate the performance of the hierarchical Bayesian method, the resolution curves were calculated by varying the numbers of model dipoles, simultaneously active dipoles and MEG sensors. The results show the superiority of the hierarchical Bayesian method over conventional linear inverse methods. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. The estimation results are consistent with known physiological ﬁndings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.

2

MEG Inverse Problem

When neural current activity occurs in the brain, it produces a magnetic ﬁeld observed by MEG. The relationship between the magnetic ﬁeld B = {Bm |m = 1 : M } measured by M sensors and the primary source current J = {Jn |n = 1 : N } in the brain is given by B = G · J,

(1)

where G= {Gm,n |m = 1 : M, n = 1 : N } is the lead ﬁeld matrix. The lead ﬁeld Gm,n represents the magnetic ﬁeld Bm produced by the n-th unit dipole current. The above equations give the forward model and the inverse problem

578

M. Sato and T. Yoshioka

is to estimate the source current J from the observed magnetic ﬁeld data B. The probabilistic model for the source currents can be constructed assuming Gaussian noise for the MEG sensors. Then, the probability distribution, that the magnetic ﬁeld B is observed for a given current J , is given by 1 P (B|J ) ∝ exp − β (B − G · J) · Σ G · (B − G · J ) , (2) 2 where (βΣG )−1 denotes the covariance matrix of the sensor noise. Σ−1 G is the −1 normalized covariance matrix satisfying T r(Σ −1 is the average G ) = M , and β noise variance.

3

Hierarchical Bayesian Method

In the hierarchical Bayesian method, the variances of the currents are considered unknown parameters and estimated from the observed MEG data by introducing a hierarchical prior on the current variance. The fMRI information can be imposed as prior information on the variance distribution rather than the variance itself so that it gives a soft constraint on the variance. The spatial smoothness constraint, that neurons within a few millimeter radius tends to ﬁre simultaneously due to the neural interactions, can also be implemented as a hierarchical prior (Sato et al. 2004). Hierarchical Prior. Let us suppose a time sequence of MEG data B 1:T ≡ {B(t)|t = 1 : T } is observed. The MEG inverse problem in this case is to estimate the primary source current J 1:T ≡ {J (t)|t = 1 : T } from the observed MEG data B 1:T . We assume a Normal prior for the current: T 1 P0 (J 1:T |α) ∝ exp − β J (t) · Σ α · J (t) , (3) 2 t=1 where Σ α is the diagonal matrix with diagonal elements α = {αn |n = 1 : N }. We also assume that the current variance α−1 does not change over period T . The current inverse variance parameter α is estimated by introducing an ARDiAutomatic Relevance Determinationj hierarchical prior (Neal, 1996): P0 (α) =

N n=1 −1

Γ (α|¯ α, γ) ≡ α

Γ (αn |¯ α0n , γ0nα ),

(4)

(αγ/α ¯ )γ Γ (γ)−1 e−αγ/α¯ ,

where Γ (α|¯ α, γ) represents the Gamma distribution with mean α ¯ and degree of ∞ freedom γ. Γ (γ) ≡ 0 dttγ−1 e−t is the Gamma function. When the fMRI data is not available, we use a non-informative prior for the current inverse variance parameter αn , i.e., γ0nα = 0 and P0 (αn ) = α−1 n . When the fMRI data is available, fMRI information is imposed as the prior for the inverse variance parameter αn . The mean of the prior, α ¯0n , is assumed to be inversely proportional to the fMRI activity. Conﬁdence parameter γ0nα controls a reliability of the fMRI information.

Hierarchical Bayesian Inference of Brain Activity

579

Variational Bayesian Method. The objective of the Bayesian estimation is to calculate the posterior probability distribution of J for the observed data B (in the following, B 1:T and J 1:T are abbreviated as B and J, respectively, for notational simplicity): P (J |B) = dα P (J , α|B), P (J, α, B) , P (B) P (J , α, B) = P (B|J ) P0 (J |α) P0 (α) , P (B) = dJ dαP (J , α, B) . P (J , α|B) =

The calculation of the marginal likelihood P (B) cannot be done analytically. In the VB method, the calculation of the joint posterior P (J , α|B) is reformulated as the maximization problem of the free energy. The free energy for a trial distribution Q(J, α) is deﬁned by P (J , α, B) F (Q) = dJ dαQ (J, α) log Q (J , α) = log (P (B)) − KL [Q (J , α) || P (J , α|B)].

(5)

Equation (5) implies that the maximization of the free energy F (Q) is equivalent to the minimization of the Kullback-Leibler distance (KL-distance) deﬁned by KL [Q(J , α) || P (J , α|B)] ≡ dJ dαQ(J, α) log (Q(J , α)/P (J, α|B)) . This measures the diﬀerence between the true joint posterior P (J , α|B) and the trial distribution Q(J , α). Since the KL-distance reaches its minimum at zero when the two distributions coincide, the joint posterior can be obtained by maximizing the free energy F (Q) with respect to the trial distribution Q. In addition, the maximum free energy gives the log-marginal likelihood log(P (B)). The optimization problem can be solved using a factorization approximation restricting the solution space (Attias, 1999; Sato, 2001): Q (J , α) = QJ (J ) Qα (α) . Under the factorization assumption (6), the free energy can be written as

F (Q) = log P (J , α, B)J α − log QJ (J ) J − log Qα (α)α

= log P (B|J )J − KL QJ (J )Qα (α) || P0 (J |α)P0 (α) ,

(6)

(7)

where ·J and ·α represent the expectation values with respect to QJ (J ) and Qα (α), respectively. The ﬁrst term in the second equation of (7) corresponds to the negative sign of the expected reconstruction error. The second term (KL-distance) measures the diﬀerence between the prior and the posterior

580

M. Sato and T. Yoshioka

and corresponds to the eﬀective degree of freedom that can be well speciﬁed from the observed data. Therefore, (negative sign of) the free energy can be considered a regularized error function with a model complexity penalty term. The maximum free energy is obtained by alternately maximizing the free energy with respect to QJ and Qα . In the ﬁrst step (J-step), the free energy F (Q) is maximized with respect to QJ while Qα is ﬁxed. The solution is given by QJ (J ) ∝ exp [log P (J , α, B)α ] .

(8)

In the second step (α-step), the free energy F (Q) is maximized with respect to Qα while QJ is ﬁxed. The solution is given by

Qα (α) ∝ exp log P (J , α, B)J . (9) The above J- and α-steps are repeated until the free energy converges. VB algorithm. The VB algorithm is summarized here. In the J-step, the in−1 verse ﬁlter L(Σ−1 α ) is calculated using the estimated covariance matrix Σ α in the previous iteration: −1 −1 −1 −1 L(Σ−1 α ) = Σα · G · G · Σα · G + ΣG

(10)

The expectation values of the current J and the noise variance β −1 with respect to the posterior distribution are estimated using the inverse ﬁlter (10). J = L(Σ−1 α ) · B, γβ β −1 =

γβ =

1 N T, 2

1 (B − G · J ) · ΣG · (B − G · J) + J · Σα · J . 2

(11)

In the α-step, the expectation values of the variance parameters α−1 n with respect to the posterior distribution are estimated as −1 γnα α−1 n = γ0nα α0n +

T −1 αn 1 − Σ−1 · G · Σ−1 α B · G n,n , 2

where γnα is given by γnα = γ0nα +

4

T 2

(12)

.

Resolution Curve

We evaluated the performance of the hierarchical Bayesian method by calculating the resolution curve and compared with the minimum norm (MN) method. The inverse ﬁlter of the MN method L can be obtained from Eq.(10), if the inverse variance parameters αn is set to a given constant which is independent of position n. Let deﬁne the resolution matrix R by R = L · G.

(13)

Hierarchical Bayesian Inference of Brain Activity

581

Fig. 1. Resolution curves of minimum norm method for diﬀerent number of model dipoles (170, 262, 502, and 1242). The number of sensors is 251. Horizontal axis denotes the radius from the source position in m.

The (n,k) component of the resolution matrix Rn,k represents the n-th estimated current when the unit current dipole is applied for the k-th position without noise, i.e., Jk = 1 and Jl = 0(l = k). The resolution curve is deﬁned by the averaged estimated currents as a function of distance r from the source position. It can be obtained by summing the estimated currents Rn,k at the n-th position whose distance from the k-th position is in the range from r to r + dr, when the unit current dipole is applied for the k-th position. The averaged resolution curve is obtained by averaging the resolution curve over the k-th positions. If the estimation is perfect, the resolution curve at the origin, which is the estimation gain, should be one. In addition, the resolution curve should be zero elsewhere. However, the estimationgain of the linear inverse method such as MN method satisﬁes the constraint, N n=1 Gn ≤ M , where Gn denotes the estimation gain at n-th position (Sato et al. 2004). This constraint implies that the linear inverse method cannot perfectly retrieve more current dipoles than the number of sensors M . To see the eﬀect of this constraint, we calculated the resolution curve for the MN method by varying the number of model dipoles while the number of sensors M is ﬁxed at 251 (Fig. 1). We assumed model dipoles are placed evenly on a hemisphere. Fig. 1 shows that MN method gives perfect estimation if the number of dipoles are less than M . On the other hand, the performance degraded as the number of dipoles increases over M . Although the above results are obtained by using MN method, similar results can be obtained for a class of linear inverse methods. This limitation is the main cause of poor spatial

582

M. Sato and T. Yoshioka

Fig. 2. Resolution curves of hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 sensors. The number of active dipoles are 240 or 400. Horizontal axis denotes the radius from the source position in m.

resolution of the linear inverse methods. When several dipoles are simultaneously active, estimated currents in the linear inverse methods can be obtained by the summation of the estimated currents for each dipole. Therefore, the resolution curve gives complete descriptions on the spatial resolution of the linear inverse methods. From the theoretical analysis (in preparation), the hierarchical Bayesian method can estimate dipole currents perfectly even when the number of model dipoles are larger than the number of sensors M . This is because the hierarchical Bayesian method eﬀectively eliminates inactive dipoles from the estimation model by adjusting the estimation gain of these dipoles to zero. Nevertheless, the number of active dipoles gives the constraint on the performance of the hierarchical Bayesian method. The calculation of the resolution curves for the hierarchical Bayesian method are somewhat complicated, because Bayesian inverse ﬁlters are dependent on the MEG data. To evaluate the performance for the situations where multiple dipoles are active, we generated 240 or 400 active dipoles randomly on the hemisphere, and calculated the corresponding MEG data where 240 or 400 dipoles were simultaneously active. The Bayesian inverse ﬁlters were estimated using these simulated MEG data. Then, the resolution curves were calculated using the estimated Bayesian inverse ﬁlters for each active dipole and they were averaged over all active dipoles. Fig. 2 shows the resolution curves for the hierarchical Bayesian method with 4078/10442 model dipoles and 251/515 MEG sensors. When the number of simultaneously active dipoles are less than those of MEG sensors, almost perfect estimation is obtained

Hierarchical Bayesian Inference of Brain Activity

583

regardless of the number of model dipoles. Therefore, the hierarchical Bayesian method can achieve much better spatial resolution than the conventional linear inverse method. On the other hand, the performance is degraded if the numbers of simultaneously active dipoles are larger than the number of MEG sensors. The above results demonstrate the superiority of the hierarchical Bayesian method over MN method.

5

Visual Experiments

We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. Red and green checkerboards of a pseudo-randomly selected quadrant are presented for 700 ms in one trial. FMRI experiments with the same quadrant stimuli were also done by adopting conventional block design where the stimuli were presented for 15 seconds in a block. The global ﬁeld power (sum of MEG signals of all sensors) recorded from subject RH induced by the upper right stimulus is shown in Fig. 3a. The strong peak was observed after 93 ms of the stimulus onset. Cortical currents were estimated by applying the hierarchical Bayesian method to the averaged MEG data between 100 ms before and 400 ms after the stimulus onset. The fMRI activity t-values were used as a prior for the inverse

Fig. 3. Estimated current for quadrant visual stimulus. (a) shows the global ﬁeld power of MEG signal. (b) shows the temporal pattens of averaged currents in V1, V2/3, and V4. (c-e) shows spatial pattens of the current strength averaged over 20ms time windows centered at 93ms, 98ms, and 134ms.

584

M. Sato and T. Yoshioka

variance parameters. As explained in ’Hierarchical Prior’ subsection, the mean of the prior was assumed to be α ¯ −1 0n = a0 · tf (n), where tf (n) was the t-value at the n-th position and a0 was a hyper parameter and set to 500 in this analysis. Estimated spatiotemporal brain activities are illustrated in Fig. 3. We identiﬁed 3 ROIs (Region Of Interest) in V1, V2/3, and V4 and temporal patterns of the estimated currents are obtained by averaging the current within these ROIs. Fig. 3b shows that V1, V2/3, and V4 are successively activated and attained their peak around 93ms, 98ms, and 134ms, respectively. Fig. 3c-3e illustrates the spatial pattern of the current strength averaged over 20ms time windows (centered at 93ms, 98ms, and 134ms), in a ﬂattened map format. The ﬂattened map was made by cutting along the bottom of calcarine sulcus. We can see strongly active regions in V1, V2/3, and V4 corresponding to their peak activities. The above results are consistent with known physiological ﬁndings and show the good spatial and temporal resolutions of the hierarchical Bayesian method.

6

Conclusion

In this article, we have explained the hierarchical Bayesian method which combines MEG and fMRI by using the hierarchical prior. We have shown the superiority of the hierarchical Bayesian method over conventional linear inverse methods by evaluating the resolution curve. We also applied the hierarchical Bayesian method to visual experiments, in which subjects viewed a ﬂickering stimulus in one of four quadrants of the visual ﬁeld. The estimation results are consistent with known physiological ﬁndings and shows the good spatial and temporal resolutions of the hierarchical Bayesian method. Currently, we are applying the hierarchical Bayesian method for brain machine interface using noninvasive neuroimaging. In our approach, we ﬁrst estimate current activity in the brain. Then, the intention or the motion of the subject is estimated by using the current activity. This approach enables us to use physiological knowledge and gives us more insight on the mechanism of human information proceeding. Acknowledgement. This research was supported in part by NICT-KARC.

References Ahlfors, S.P., Simpson, G.V., Dale, A.M., Belliveau, J.W., Liu, A.K., Korvenoja, A., Virtanen, J., Huotilainen, M., Tootell, R.B.H., Aronen, H.J., Ilmoniemi, R.J.: Spatiotemporal activity of a cortical network for processing visual motion revealed by MEG and fMRI. J. Neurophysiol. 82, 2545–2555 (1999) Amari, S.: Natural Gradient Works Eﬃciently in Learning. Neural Computation 10, 251–276 (1998) Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Proc. 15th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 21–30 (1999) Bandettini, P.A.: The temporal resolution of functional MRI. In: Moonen, C.T.W., Bandettini, P.A. (eds.) Functional MRI, pp. 205–220. Springer, Heidelberg (2000)

Hierarchical Bayesian Inference of Brain Activity

585

Dale, A.M., Liu, A.K., Fischl, B.R., Buchner, R.L., Belliveau, J.W., Lewine, J.D., Halgren, E.: Dynamic statistical parametric mapping: Combining fMRI and MEG for high-resolution imaging of cortical activity. Neuron 26, 55–67 (2000) Dale, A.M., Sereno, M.I.: Improved localization of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: A Linear approach. J. Cognit. Neurosci. 5, 162–176 (1993) Hamalainen, M.S., Hari, R., Ilmoniemi, R.J., Knuutila, J., Lounasmaa, O.V.: Magentoencephalography– Theory, instrumentation, and applications to noninvasive studies of the working human brain. Rev. Modern Phys. 65, 413–497 (1993) Hari, R.: On brain’s magnetic responses to sensory stimuli. J. Clinic. Neurophysiol. 8, 157–169 (1991) Mosher, J.C., Lewis, P.S., Leahy, R.M.: Multiple dipole modelling and localization from spatio-temporal MEG data. IEEE Trans. Biomed. Eng. 39, 541–557 (1992) Neal, R.M.: Bayesian learning for neural networks. Springer, Heidelberg (1996) Ogawa, S., Lee, T.-M., Kay, A.R., Tank, D.W.: Brain magnetic resonance imaging with contrast-dependent oxygenation. In: Proc. Natl. Acad. Sci. USA, vol. 87, pp. 9868–9872 (1990) Phillips, C., Rugg, M.D., Friston, K.J.: Anatomically Informed Basis Functions for EEG Source Localization: Combining Functional and Anatomical Constraints. NeuroImage 16, 678–695 (2002) Sato, M.: On-line Model Selection Based on the Variational Bayes. Neural Computation 13, 1649–1681 (2001) Sato, M., Yoshioka, T., Kajihara, S., Toyama, K., Goda, N., Doya, K., Kawato, M.: Hierarchical Bayesian estimation for MEG inverse problem. NeuroImage 23, 806–826 (2004)

Neural Decoding of Movements: From Linear to Nonlinear Trajectory Models Byron M. Yu1,2 , John P. Cunningham1 , Krishna V. Shenoy1 , and Maneesh Sahani2 1

2

Dept. of Electrical Engineering and Neurosciences Program, Stanford University, Stanford, CA, USA Gatsby Computational Neuroscience Unit, UCL, London, UK {byronyu,jcunnin,shenoy}@stanford.edu, [email protected]

Abstract. To date, the neural decoding of time-evolving physical state – for example, the path of a foraging rat or arm movements – has been largely carried out using linear trajectory models, primarily due to their computational eﬃciency. The possibility of better capturing the statistics of the movements using nonlinear trajectory models, thereby yielding more accurate decoded trajectories, is enticing. However, nonlinear decoding usually carries a higher computational cost, which is an important consideration in real-time settings. In this paper, we present techniques for nonlinear decoding employing modal Gaussian approximations, expectatation propagation, and Gaussian quadrature. We compare their decoding accuracy versus computation time tradeoﬀs based on high-dimensional simulated neural spike counts. Keywords: Nonlinear dynamical models, nonlinear state estimation, neural decoding, neural prosthetics, expectation-propagation, Gaussian quadrature.

1

Introduction

We consider the problem of decoding time-evolving physical state from neural spike trains. Examples include decoding the path of a foraging rat from hippocampal neurons [1,2] and decoding the arm trajectory from motor cortical neurons [3,4,5,6,7,8]. Advances in this area have enabled the development of neural prosthetic devices, which seek to allow disabled patients to regain motor function through the use of prosthetic limbs, or computer cursors, that are controlled by neural activity [9,10,11,12,13,14,15]. Several of these prosthetic decoders, including population vectors [11] and linear ﬁlters [10,12,15], linearly map the observed neural activity to the estimate of physical state. Although these direct linear mappings are eﬀective, recursive Bayesian decoders have been shown to provide more accurate trajectory estimates [1,6,7,16]. In addition, recursive Bayesian decoders provide conﬁdence regions on the trajectory estimates and allow for nonlinear relationships between M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 586–595, 2008. c Springer-Verlag Berlin Heidelberg 2008

Neural Decoding Using Nonlinear Trajectory Models

587

the neural activity and the physical state variables. Recursive Bayesian decoders are based on the speciﬁcation of a probabilistic model comprising 1) a trajectory model, which describes how the physical state variables change from one time step to the next, and 2) an observation model, which describes how the observed neural activity relates to the time-evolving physical state. The function of the trajectory model is to build into the decoder prior knowledge about the form of the trajectories. In the case of decoding arm movements, the trajectory model may reﬂect 1) the hard, physical constraints of the limb (for example, the elbow cannot bend backward), 2) the soft, control constraints imposed by neural mechanisms (for example, the arm is more likely to move smoothly than in a jerky motion), and 3) the physical surroundings of the person and his/her objectives in that environment. The degree to which the trajectory model captures the statistics of the actual movements directly aﬀects the accuracy with which trajectories can be decoded from neural data [8]. The most commonly-used trajectory models assume linear dynamics perturbed by Gaussian noise, which we refer to collectively as linear-Gaussian models. The family of linear-Gaussian models includes the random walk model [1,2,6], those with a constant [8] or time-varying [17,18] forcing term, those without a forcing term [7,16], those with a time-varying state transition matrix [19], and those with higher-order Markov dependencies [20]. Linear-Gaussian models have been successfully applied to decoding the path of a foraging rat [1,2], as well as arm trajectories in ellipse-tracing [6], pursuit-tracking [7,20,16], “pinball” [7,16], and center-out reach [8] tasks. Linear-Gaussian models are widely used primarily due to their computational eﬃciency, which is an important consideration for real-time decoding applications. However, for particular types of movements, the family of linear-Gaussian models may be too restrictive and unable to capture salient properties of the observed movements [8]. We recently proposed a general approach to constructing trajectory models that can exhibit rather complex dynamical behaviors and whose decoder can be implemented to have the same running time (using a parallel implementation) as simpler trajectory models [8]. In particular, we demonstrated that a probabilistic mixture of linear-Gaussian trajectory models, each accurate within a limited regime of movement, can capture the salient properties of goal-directed reaches to multiple targets. This mixture model, which yielded more accurate decoded trajectories than a single linear-Gaussian model, can be viewed as a discrete approximation to a single, uniﬁed trajectory model with nonlinear dynamics. An alternate approach is to decode using this single, uniﬁed nonlinear trajectory model without discretization. This makes the decoding problem more diﬃcult since nonlinear transformations of parametric distributions are typically no longer easily parametrized. State estimation in nonlinear dynamical systems is a ﬁeld of active research that has made substantial progress in recent years, including the application of numerical quadrature techniques to dynamical systems [21,22,23], the development of expectation-propagation (EP) [24] and its application to dynamical systems [25,26,27,28], and the improvement in the

588

B.M. Yu et al.

computational eﬃciency of Monte Carlo techniques (e.g., [29,30,31]). However, these techniques have not been rigorously tested and compared in the context of neural decoding, which typically involves observations that are high-dimensional vectors of non-negative integers. In particular, the tradeoﬀ between decoding accuracy and computational cost among diﬀerent neural decoding algorithms has not been studied in detail. Knowing the accuracy-computational cost tradeoﬀ is important for real-time applications, where one may need to select the most accurate algorithm given a computational budget or the least computationally intensive algorithm given a minimal acceptable decoding accuracy. This paper takes a step in this direction by comparing three particular deterministic Gaussian approximations. In Section 2, we ﬁrst introduce the nonlinear dynamical model for neural spike counts and the decoding problem. Sections 3 and 4 detail the three deterministic Gaussian approximations that we focus on in this report: global Laplace, Gaussian quadrature-EP (GQ-EP), and Laplace propagation (LP). Finally, in Section 5, we compare the decoding accuracy versus computational cost of these three techniques.

2

Nonlinear Dynamical Model and Neural Decoding

In this report, we consider nonlinear dynamical models for neural spike counts of the following form: xt | xt−1 ∼ N (f (xt−1 ) , Q) yti

| xt ∼ Poisson (λi (xt ) · Δ) ,

(1a) (1b)

where xt ∈ Rp×1 is a vector containing the physical state variables at time t = 1, . . . , T , yti ∈ {0, 1, 2, . . .} is the corresponding observed spike count for neuron i = 1, . . . , q taken in a time bin of width Δ, and Q ∈ Rp×p is a covariance matrix. The functions f : Rp×1 → Rp×1 and λi : Rp×1 → R+ are, in general, nonlinear. The initial state x1 is Gaussian-distributed. For notational compactness, the spike counts for all q simultaneously-recorded neurons are assembled into a q × 1 vector yt , whose ith element is yti . Note that the observations are discretevalued and that, typically, q p. Equations (1a) and (1b) are referred to as the trajectory and observation models, respectively. The task of neural decoding involves ﬁnding, at each timepoint t, the likely physical states xt given the neural activity observed up to that time {y}t1 . In other words, we seek to compute the ﬁltered state posterior P (xt | {y}t1 ) at each t. We previously showed how to estimate the ﬁltered state posterior when f is a linear function [8]. Here, we consider how to compute P (xt | {y}t1 ) when f is nonlinear. The extended Kalman ﬁlter (EKF) is a commonly-used technique for nonlinear state estimation. Unfortunately, it cannot be directly applied to the current problem because the observation noise in (1b) is not additive Gaussian. Possible alternatives are the unscented Kalman ﬁlter (UKF) [21,22] and the closelyrelated quadrature Kalman ﬁlter (QKF) [23], both of which employ quadrature

Neural Decoding Using Nonlinear Trajectory Models

589

techniques to approximate Gaussian integrals that are analytically intractable. While the UKF has been shown to outperform the EKF [21,22], the UKF requires making Gaussian approximations in the observation space. This property of the UKF is undesirable from the standpoint of the current problem because the observed spike counts are typically 0 or 1 (due to the use of relatively short binwidths Δ) and, therefore, distinctly non-Gaussian. As a result, the UKF yielded substantially lower decoding accuracy than the techniques presented in Sections 3 and 4 [28], which make Gaussian approximations only in the state space. While we have not yet tested the QKF, the number of quadrature points required grows geometrically with p + q, which quickly becomes impractical even for moderate values of p and q. Thus, we will no longer consider the UKF and QKF in the remainder of this paper. The decoding techniques described in Sections 3 and 4 naturally yield the smoothed state posterior P xt | {y}T1 , rather than the ﬁltered state posterior P (xt | {y}t1 ). Thus, we will focus on the smoothed state posterior in this work. However, the ﬁltered state posterior at time t can be easily obtained by smoothing using only observations from timepoints 1, . . . , t.

3

Global Laplace

The idea is to estimate the joint state posterior across the entire sequence (i.e., the global state posterior) as a Gaussian matched to the location and curvature of a mode of P {x}T1 | {y}T1 , as in Laplace’s method [32]. The mode is deﬁned as {x }T1 = argmax P {x}T1 | {y}T1 = argmax L {x}T1 , (2) {x}T 1

{x}T 1

where L {x}T1 = log P {x}T1 , {y}T1 = log P (x1 ) +

T

log P (xt | xt−1 ) +

t=2

q T

log P yti | xt .

(3)

t=1 i=1

Using the known distributions (1), the gradients of L {x}T1 can be computed exactly and a local mode {x }T1 can be found by applying a gradient optimization technique. The global state posterior is then approximated as: −1 P {x}T1 | {y}T1 ≈ N {x }T1 , −∇2 L {x }T1 . (4)

4

Expectation Propagation

We brieﬂy summarize here the application of EP [24] to dynamical models [25,26,27,28]. More details can be found in the cited references. The two primary distributions of interest here are the marginal P xt | {y}T1 and pairwise

590

B.M. Yu et al.

joint P xt−1 , xt | {y}T1 state posteriors. These distributions can be expressed in terms of forward αt and backward βt messages as follows P xt | {y}T1 =

1 αt (xt ) βt (xt ) P {y}T1 αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) P xt−1 , xt | {y}T1 = , P {y}T1

(5) (6)

where αt (xt ) = P (xt , {y}t1 ) and βt (xt ) = P {y}Tt+1 | xt . The messages αt and βt are typically approximated by an exponential family density; in this paper, we use an unnormalized Gaussian. These approximate messages are iteratively updated by matching the expected suﬃcient statistics1 of the marginal posterior (5) with those of the pairwise joint posterior (6). The updates are usually performed sequentially via multiple forward-backward passes. During the forward pass, the αt are updated while the βt remain ﬁxed: αt−1 (xt−1 ) P (xt | xt−1 ) P (yt | xt ) βt (xt ) T P xt | {y}1 = dxt−1 (7) P {y}T1 ≈ Pˆ (xt−1 , xt ) dxt−1 (8) αt (xt ) ∝ Pˆ (xt , xt−1 ) dxt−1 βt (xt ) , (9) where Pˆ (xt−1 , xt ) is an exponential family distribution whose expected suﬃcient statistics are matched to those of P xt−1 , xt | {y}T1 . In this paper, Pˆ (xt−1 , xt ) is assumed to be Gaussian. The backward pass proceeds similarly, where the βt are updated while the αt remain ﬁxed. The decoded trajectory is obtained by combining the messages αt and βt , as shown in (5), after completing the forwardbackward passes. In Section 5, we investigate the accuracy-computational cost tradeoﬀ of using diﬀerent numbers of forward-backward iterations. Although the expected suﬃcient statistics (or moments) of P xt−1 , xt | {y}T1 cannot typically be computed analytically for the nonlinear dynamical model (1), they can be approximated using Gaussian quadrature [26,28]. This EP-based decoder is referred to as GQ-EP. By applying the ideas of Laplace propagation (LP) [33], a closely-related developed that uses a modal decoder has been Gaussian approximation of P xt−1 , xt | {y}T1 rather than matching moments [27,28]. This technique, which uses the same message-passing scheme as GQ-EP, is referred to here as LP. In practice, it is possible to encounter invalid message updates. For example, if the variance of xt in the numerator is larger than that in the denominator in (9) due to approximation error in the choice of Pˆ , the update rule would assign αt (xt ) a negative variance. A way around this problem is to simply skip that message update and hope that the update is no longer invalid during the next 1

If the approximating distributions are assumed to be Gaussian, this is equivalent to matching the ﬁrst two moments.

Neural Decoding Using Nonlinear Trajectory Models

591

forward-backward iteration [34]. An alternative is to set βt (xt ) = 1 in (7) and (9), which guarantees a valid update for αt (xt ). This is referred to as the onesided update and its implications for decoding accuracy and computation time are considered in Section 5.

5

Results

We evaluated decoding accuracy versus computational cost of the techniques described in Sections 3 and 4. These performance comparisons were based on the model (1), where f (x) = (1 − k) x + k · W · erf(x) λi (x) = log 1 + eci x+di

(10) (11)

with parameters W ∈ Rp×p , ci ∈ Rp×1 , and di ∈ R. The error function (erf) in (10) acts element-by-element on its argument. We have chosen the dynamics (10) of a fully-connected recurrent network due to its nonlinear nature; we make no claims in this work about its suitability for particular decoding applications, such as for rat paths or arm trajectories. Because recurrent networks are often used to directly model neural activity, it is important to emphasize that x is a vector of physical state variables to be decoded, not a vector of neural activity. We generated 50 state trajectories, each with 50 time points, and corresponding spike counts from the model (1), where the model parameters were randomly chosen within a range that provided biologically realistic spike counts (typically, 0 or 1 spike in each bin). The time constant k ∈ R was set to 0.1. To understand how these algorithms scale with diﬀerent numbers of physical state variables and observed neurons, we considered all pairings (p, q), where p ∈ {3, 10} and q ∈ {20, 100, 500}. For each pairing, we repeated the above procedure three times. For the global Laplace decoder, the modal trajectory was found using PolackRibi`ere conjugate gradients with quadratic/cubic line searches and Wolfe-Powell stopping criteria (minimize.m by Carl Rasmussen, available at http://www.kyb. tuebingen.mpg.de/bs/people/carl/code/minimize/). To stabilize GQ-EP, we used a modal Gaussian proposal distribution and the custom precision 3 quadrature rule with non-negative quadrature weights, as described in [28]. Forboth GQ EP and LP, minimize.m was used to ﬁnd a mode of P xt−1 , xt | {y}T1 . Fig. 1 illustrates the decoding accuracy versus computation time of the presented techniques. Decoding accuracy was measured by evaluating the marginal state posteriors P xt | {y}T1 at the actual trajectory. The higher the log probability, the more accurate the decoder. Each panel corresponds to a diﬀerent number of state variables and observed neurons. For GQ-EP (dotted line) and LP (solid line), we varied the number of forward-backward iterations between one and three; thus, there are three circles for each of these decoders. Across all panels, global Laplace required the least computation time and yielded state

592

Log probability

-191

B.M. Yu et al. -110

(a)

-199

-114

-24

-207

-118

-28

-600

-250

(d) Log probability

-20

(b)

-2600 -1 10

50

(e)

-1600

(f)

-600

0

10

1

10

Computation time (sec)

(c)

-950 -1 10

-250

0

10

1

10

Computation time (sec)

-550 -1 10

0

10

1

10

Computation time (sec)

Fig. 1. Decoding accuracy versus computation time of global Laplace (no line), GQEP (dotted line), and LP (solid line). (a) p = 3, q = 20, (b) p = 3, q = 100, (c) p = 3, q = 500, (d) p = 10, q = 20, (e) p = 10, q = 100, (f) p = 10, q = 500. The circles and bars represent mean±SEM. Variability in computation time is not represented on the plots because they were negligible. The computation times were obtained using a 2.2-GHz AMD Athlon 64 processor with 2 GB RAM running MATLAB R14. Note that the scale of the vertical axes is not the same in each panel and that some error bars are so small that they can’t be seen.

estimates as accurate as, or more accurate than, the other techniques. This is the key result of this report. We also implemented a basic particle smoother [35], where the number of particles (500 to 1500) was chosen such that its computation time was on the same order as those shown in Fig. 1 (results not shown). Although this particle smoother yielded substantially lower decoding accuracy than global Laplace, GQ-EP, and LP, the three deterministic techniques should be compared to more recently-developed Monte Carlo techniques, as described in Section 6. Fig. 1 shows that all three techniques have computation times that scale well with the number of state variables p and neurons q. In particular, the required computational time typically scales sub-linearly with increases in p and far sublinearly with increases in q. As the q increases, the accuracies of the techniques become more similar (note that diﬀerent panels have diﬀerent vertical scales), and there is less advantage to performing multiple forward-backward iterations for GQ-EP and LP. The decoding accuracy and required computation time both typically increase with the number of iterations. In a few cases (e.g., GQ-EP in Fig. 1(b)), it is possible for the accuracy to decrease slightly when going from two to three iterations, presumably due to one-sided updates. In theory, GQ-EP should require greater computation time than LP because it needs to perform the same modal Gaussian approximation, then use it as a proposal distribution for Gaussian quadrature. In practice, it is possible for LP

Neural Decoding Using Nonlinear Trajectory Models

593

to be slower if it needs many one-sided updates (cf. Fig. 1(d)), since one-sided updates are used only when the usual update (9) fails. Furthermore, LP required greater computation time in Fig. 1(d) than in Fig. 1(e) due to the need for many more one-sided updates, despite having ﬁve times fewer neurons. It was previously shown that {x }T1 is a local optimum of P {x}T1 | {y}T1 (i.e., a solution of global Laplace) if and only if it is a ﬁxed-point of LP [33]. Because the modal Gaussian approximation matches local curvature up to second order, it can also be shown that the estimated covariances using global Laplace and LP are equal at {x }T1 [33]. Empirically, we found both statements to be true if few one-sided updates were required for LP. Due to these connections between global Laplace and LP, the accuracy of LP after three forward-backward iterations was similar to that of global Laplace in all panels in Fig. 1. Although LP may have computational savings compared to global Laplace in certain applications [33], we found that global Laplace was substantially faster for the particular graph structure described by (1).

6

Conclusion

We have presented three deterministic techniques for nonlinear state estimation (global Laplace, GQ-EP, LP) and compared their decoding accuracy versus computation cost in the context of neural decoding, involving high-dimensional observations of non-negative integers. This work can be extended in the following directions. First, the deterministic techniques presented here should be compared to recently-developed Monte Carlo techniques that have yielded increased accuracy and/or reduced computational cost compared to the basic particle ﬁlter/smoother in applications other than neural decoding [29]. Examples include the Gaussian particle ﬁlter [31], sigma-point particle ﬁlter [30], and embedded hidden Markov model [36]. Second, we have compared these decoders based on one particular non-linear trajectory model (10). Other non-linear trajectory models (e.g., a model describing primate arm movements [37]) should be tested to see if the decoders have similar accuracy-computational cost tradeoﬀs as shown here. Acknowledgments. This work was supported by NIH-NINDS-CRCNS-R01, NDSEG Fellowship, NSF Graduate Research Fellowship, Gatsby Charitable Foundation, Michael Flynn Stanford Graduate Fellowship, Christopher Reeve Paralysis Foundation, Burroughs Wellcome Fund Career Award in the Biomedical Sciences, Stanford Center for Integrated Systems, NSF Center for Neuromorphic Systems Engineering at Caltech, Oﬃce of Naval Research, Sloan Foundation and Whitaker Foundation.

References 1. Brown, E.N., Frank, L.M., Tang, D., Quirk, M.C., Wilson, M.A.: A statistical paradigm for neural spike train decoding applied to position prediction from the ensemble ﬁring patterns of rat hippocampal place cells. J. Neurosci 18(18), 7411– 7425 (1998)

594

B.M. Yu et al.

2. Zhang, K., Ginzburg, I., McNaughton, B.L., Sejnowski, T.J.: Interpreting neuronal population activity by reconstruction: Uniﬁed framework with application to hippocampal place cells. J. Neurophysiol 79, 1017–1044 (1998) 3. Wessberg, J., Stambaugh, C.R., Kralik, J.D., Beck, P.D., Laubach, M., Chapin, J.K., Kim, J., Biggs, J., Srinivasan, M.A., Nicolelis, M.A.L.: Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408(6810), 361–365 (2000) 4. Schwartz, A.B., Taylor, D.M., Tillery, S.I.H.: Extraction algorithms for cortical control of arm prosthetics. Curr. Opin. Neurobiol. 11, 701–707 (2001) 5. Serruya, M., Hatsopoulos, N., Fellows, M., Paninski, L., Donoghue, J.: Robustness of neuroprosthetic decoding algorithms. Biol. Cybern. 88(3), 219–228 (2003) 6. Brockwell, A.E., Rojas, A.L., Kass, R.E.: Recursive Bayesian decoding of motor cortical signals by particle ﬁltering. J. Neurophysiol 91(4), 1899–1907 (2004) 7. Wu, W., Black, M.J., Mumford, D., Gao, Y., Bienenstock, E., Donoghue, J.P.: Modeling and decoding motor cortical activity using a switching Kalman ﬁlter. IEEE Trans Biomed Eng 51(6), 933–942 (2004) 8. Yu, B.M., Kemere, C., Santhanam, G., Afshar, A., Ryu, S.I., Meng, T.H., Sahani, M., Shenoy, K.V.: Mixture of trajectory models for neural decoding of goal-directed movements. J. Neurophysiol. 97, 3763–3780 (2007) 9. Chapin, J.K., Moxon, K.A., Markowitz, R.S., Nicolelis, M.A.L.: Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nat. Neurosci. 2, 664–670 (1999) 10. Serruya, M.D., Hatsopoulos, N.G., Paninski, L., Fellows, M.R., Donoghue, J.P.: Instant neural control of a movement signal 416, 141–142 (2002) 11. Taylor, D.M., Tillery, S.I.H., Schwartz, A.B.: Direct cortical control of 3D neuroprosthetic devices. Science 296, 1829–1832 (2002) 12. Carmena, J.M., Lebedev, M.A., Crist, R.E., O’Doherty, J.E., Santucci, D.M., Dimitrov, D.F., Patil, P.G., Henriquez, C.S., Nicolelis, M.A.L.: Learning to control a brain-machine interface for reaching and grasping by primates. PLoS Biology 1(2), 193–208 (2003) 13. Musallam, S., Corneil, B.D., Greger, B., Scherberger, H., Andersen, R.A.: Cognitive control signals for neural prosthetics. Science 305, 258–262 (2004) 14. Santhanam, G., Ryu, S.I., Yu, B.M., Afshar, A., Shenoy, K.V.: A high-performance brain-computer interface. Nature 442, 195–198 (2006) 15. Hochberg, L.R., Serruya, M.D., Friehs, G.M., Mukand, J.A., Saleh, M., Caplan, A.H., Branner, A., Chen, D., Penn, R.D., Donoghue, J.P.: Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442, 164–171 (2006) 16. Wu, W., Gao, Y., Bienenstock, E., Donoghue, J.P., Black, M.J.: Bayesian population decoding of motor cortical activity using a Kalman ﬁlter. Neural Comput 18(1), 80–118 (2006) 17. Kemere, C., Meng, T.: Optimal estimation of feed-forward-controlled linear systems. In: Proc IEEE ICASSP, pp. 353–356 (2005) 18. Srinivasan, L., Eden, U.T., Willsky, A.S., Brown, E.N.: A state-space analysis for reconstruction of goal-directed movements using neural signals. Neural Comput 18(10), 2465–2494 (2006) 19. Srinivasan, L., Brown, E.N.: A state-space framework for movement control to dynamic goals through brain-driven interfaces. IEEE Trans. Biomed. Eng. 54(3), 526–535 (2007) 20. Shoham, S., Paninski, L.M., Fellows, M.R., Hatsopoulos, N.G., Donoghue, J.P., Normann, R.A.: Statistical encoding model for a primary motor cortical brainmachine interface. IEEE Trans. Biomed. Eng. 52(7), 1313–1322 (2005)

Neural Decoding Using Nonlinear Trajectory Models

595

21. Wan, E., van der Merwe, R.: The unscented Kalman ﬁlter. In: Haykin, S. (ed.) Kalman Filtering and Neural Networks, Wiley Publishing, Chichester (2001) 22. Julier, S., Uhlmann, J.: Unscented ﬁltering and nonlinear estimation. Proceedings of the IEEE 92(3), 401–422 (2004) 23. Arasaratnam, I., Haykin, S., Elliott, R.: Discrete-time nonlinear ﬁltering algorithms using Gauss-Hermite quadrature. Proceedings of the IEEE 95(5), 953–977 (2007) 24. Minka, T.: Expectation propagation for approximate Bayesian inference. In: Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 362–369 (2001) 25. Heskes, T., Zoeter, O.: Expectation propagation for approximate inference in dynamic Bayesian networks. In: Darwiche, A., Friedman, N. (eds.) Proceedings UAI2002, pp. 216–223 (2002) 26. Zoeter, O., Ypma, A., Heskes, T.: Improved unscented Kalman smoothing for stock volatility estimation. In: Barros, A., Principe, J., Larsen, J., Adali, T., Douglas, S. (eds.) Proceedings of the IEEE Workshop on Machine Learning for Signal Processing (2004) 27. Ypma, A., Heskes, T.: Novel approximations for inference in nonlinear dynamical systems using expectation propagation. Neurocomputing 69, 85–99 (2005) 28. Yu, B.M., Shenoy, K.V., Sahani, M.: Expectation propagation for inference in nonlinear dynamical models with Poisson observations. In: Proc. IEEE Nonlinear Statistical Signal Processing Workshop (2006) 29. Doucet, A., de Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 30. van der Merwe, R., Wan, E.: Sigma-point Kalman ﬁlters for probabilistic inference in dynamic state-space models. In: Proceedings of the Workshop on Advances in Machine Learning (2003) 31. Kotecha, J.H., Djuric, P.M.: Gaussian particle ﬁltering. IEEE Transactions on Signal Processing 51(10), 2592–2601 (2003) 32. MacKay, D.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 33. Smola, A., Vishwanathan, V., Eskin, E.: Laplace propagation. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 34. Minka, T., Laﬀerty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp. 352–359 (2002) 35. Doucet, A., Godsill, S., Andrieu, C.: On sequential Monte Carlo sampling methods for Bayesian ﬁltering. Statistics and Computing 10(3), 197–208 (2000) 36. Neal, R.M., Beal, M.J., Roweis, S.T.: Inferring state sequences for non-linear systems with embedded hidden Markov models. In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, MIT Press, Cambridge (2004) 37. Chan, S.S., Moran, D.W.: Computational model of a primate arm: from hand position to joint angles, joint torques and muscle forces. J. Neural. Eng. 3, 327–337 (2006)

Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience Kazuyuki Samejima1 and Kenji Doya2 1 Brain Science Institute, Tamagawa University, 6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan [email protected] 2 Initial Research Project, Okinawa Institute of Science and Technology 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan [email protected]

Abstract. A major problem in search of neural substrates of learning and decision making is that the process is highly stochastic and subject dependent, making simple stimulus- or output-triggered averaging inadequate. This paper presents a novel approach of characterizing neural recording or brain imaging data in reference to the internal variables of learning models (such as connection weights and parameters of learning) estimated from the history of external variables by Bayesian inference framework. We specifically focus on reinforcement leaning (RL) models of decision making and derive an estimation method for the variables by particle filtering, a recent method of dynamic Bayesian inference. We present the results of its application to decision making experiment in monkeys and humans. The framework is applicable to wide range of behavioral data analysis and diagnosis.

1 Introduction The traditional approach in neuroscience to discover information processing mechanisms is to correlate neuronal activities with external physical variables, such as sensory stimuli or motor outputs. However, when we search for neural correlates of higher-order brain functions, such as attention, memory and learning, a problem has been that there are no external physical variables to correlate with. Recently, the advances in computational neuroscience, there are a number of computational models of such cognitive or learning processes and make quantitative prediction of the according subject’s behavioral responses. Thus a possible new approach is to try to find neural activities that correlate with the internal variables of such computational models(Corrado and Doya, 2007). A major issue in such model-based analysis of neural data is how to estimate the hidden variables of the model. For example, in learning agents, hidden variables such as connection weights change in time. In addition, the course of learning is regulated by hidden meta-parameters such as learning rates. Another important issue is how to judge the validity of a model or to select the best model among a number of candidates. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 596–603, 2008. © Springer-Verlag Berlin Heidelberg 2008

Estimating Internal Variables of a Decision Maker’s Brain

597

The framework of Bayesian inference can provide coherent solutions to the issues of estimating hidden variables, including meta-parameter from observable experimental data and selecting the most plausible computational model out of multiple candidates. In this paper, we first review the reinforcement learning model of reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian estimation method for the hidden variables of a reinforcement learning model by particle filtering (Samejima et al., 2004). We then review examples of application of the method to monkey neural recording (Samejima et al., 2005) and human imaging studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007).

2 Reinforcement Leaning Model as an Animal or Human Decision Maker Reinforcement learning can be a model of animal or human decision based on reward delivery. Notably, the response of monkey midbrain dopamine neurons are successfully explained by the temporal difference (TD) error of reinforcement learning models (Schultz et al., 1997). The goal of reinforcement learning is to improve the policy, the rule of taking an action at at state st , so that the resulting rewards rt is maximized in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the value function for each state and then it improves the policy based on the value function. In a standard reinforcement learning algorithm called “Q-learning,” an agent learns the action-value function

[

]

(1)

which estimates the cumulative future reward when action

at is taken at a

Q ( st , at ) = E rt + γrt +1 + γ 2 rt + 2 + ... | s, a state st .The discount factor

0 < γ < 1 is a meta-parameter that controls the time

scale of prediction. The policy of the learner is then given by comparing actionvalues, e.g. according to Boltzman distribution

π (a | s ) =

exp(βQ(a, st )) ∑ exp(βQ(a' , st ))

(2)

a '∈A

where the inverse temperature β > 0 is another meta-parameter that controls randomness of action selection. From an experience of state st , action at , reward rt , and next state st +1 , the action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as

δ t = rt + γ max Q( st +1 , a) − Q( st , at ) a∈A

Q( st , at ) ⇐ Q( st , at ) + αδ t

,

(3)

598

K. Samejima and K. Doya

where α > 0 is the meta-parameter for learning rate. In the case of a reinforcement learning agent, we have three meta-parameters. Such a reinforcement learning model of behavior learning does not only predict subject’s actions, but can also provide candidates of brain’s internal processes for decision making, which may be captured in neural recording or brain imaging data. However, a big problem is that the predictions are depended on the setting of metaparameters, such as learning rate α , action randomness β and discount factor γ .

3 Probabilistic Dynamic Evolution of Internal Variable for Q-Learning Agent Let

us

consider

a

problem

of

estimating

the

course

of

action-values

{Qt ( s, a); s ∈ S , a ∈ A,0 < t < T } , and meta-parameters α, β, and γ of reinforcement

st , actions at and rewards rt . We use a Bayesian method of estimating a dynamical hidden variable {x t ; t ∈ N } from sequence of observable variable {y t ; t ∈ N } to solve this problem. We assume learner by only observing the sequence of states

that the unobservable signal (hidden variable) is modeled as a Markov process of initial distribution p (x 0 ) and the transition probability p (x t +1 | x t ) . The observations {y t ; t ∈ N } are assumed to be conditionally independent given the process

{x t ; t ∈ N } and of marginal distribution p(y t | xt ) . The problem to solve

in this setting is to estimate recursively in time the posterior distribution of hidden variable p (x1:t | y1:t ) , where x 0:T = {x 0 , " , xT } and y 1:T = {y 1 , " , y T } . The marginal distribution is given by recursive procedure of the following prediction and updating, Predicting:

p(x t | y1:t −1 ) = ∫ p(x t | x t −1 ) p(x t −1 | y

Updating:

p( xt | y1:t ) =

1:t −1

)dx t −1

p (y t | xt ) p (x t | y1:t −1 ) ∫ p(y t | xt ) p(xt | y )dxt −1 1:t −1

We use a numerical method to solve the Bayesian recursion procedure was proposed, called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of sequence of hidden variables are represented by a set of random samples, also named ``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and the update the distribution of particles (Doucet et al. 2001). Figure 1 shows the dynamical Bayesian network representation of a evolution of internal variables in Q-learning agent. The hidden variable x t consists of actionvalues

Q( s, a ) for each state-action pair, learning rate α, inverse temperature β, and

Estimating Internal Variables of a Decision Maker’s Brain

discount factor γ. The observable variable and rewards

599

y t consists of the states st , actions at ,

rt .

p(y t | xt ) is given by the softmax action selection (2). The transition probability p (x t +1 | x t ) of the hidden variable is given by the The observation probability

Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we assume that meta-parameters (α,β, and γ) are constant with small drifts. Because α, β and γ should all be positive, we assume random-walk dynamics in logarithmic space.

log( xt +1 ) = log( xt ) + ε x where

σx

ε x ~ N (0,σ x )

(4)

is a meta-meta-parameter that defines random-walk variability of meta-

parameters.

Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and unobservable variable is depended on decision, reward probability, state transition, and update rule for value function. Circles: hidden variable. Double box: observable variable. Arrow: probabilistic dependency.

4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity Samejima et al (Samejima et al., 2005) used the internal parameter approach with Q-leaning model for monkey’s free choice task of a two-armed-bandit problem

600

K. Samejima and K. Doya

(Figure 2). The task has only one state, two actions, and stochastic binary reward. The reward probability for each action is fixed in a 30-150 trials of block, but randomly chosen from five kinds of probability combination, block-by-block. The reward probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9, 0.5]}, at the beginning of each block.

Fig. 2. Two-armed bandit task for monkey’s behavioral choice. Upper: Time course of the task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle with their right hand and adjusted at the center position. If monkeys held the handle at the center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned the handle to either right or left side, which was associated with a shift of yellow LED illumination from up to the turned direction. After 0.5 sec., color of the LED changed from yellow to either green or red. Green LED was followed by a large amount of reward water, while red LED was followed by a small amount of water. Lower panel: state diagram of the task. The circle indicates state. Arrow indicate possible action and state transition.

The Q-learning model of monkey behavior tries to learn reward expectation of each action, action value, and maximize reward acquired in each block. Because the task has only one state, the agent does not need to take into account next state’s value, and thus, we set the discount factor as γ = 0 . (Samejima et al., 2005) showed that the computed internal variable, the action value for a particular movement direction (left/right), that is estimated by past history of choice and outcome(reward), could predicts monkey’s future choice probability (Figure 3). Action value is an example of a variable that could not be immediately

Estimating Internal Variables of a Decision Maker’s Brain

601

Fig. 3. Time course of predicted choice probability and estimated action values. Upper panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large), choice ratio (cyan line, Gaussian smoothed σ=2.5) and predicted choice probability (black line). Color of upper bar indicate reward probability combination. Lower panel: estimated action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).

Fig. 4. An example of the activity of a caudate neuron plotted on the space of estimated action values QL(t) and QR(t).Left panel: 3-dimentional plot of neural activity on estimated QL(t) and QR(t). Right panel: 2-d projected plot for the discharge rates of the neuron on QL axes(Left side) and on QR (right side). Grey lines derived from regression model. Circles and error bars indicate average and standard deviation of neural discharge rates for each of 10 equally populated action value bins. (from Samejima et al. 2005).

obvious from observable experimental parameters but can be inferred using an actionpredictable computational model. Further more, the activity of most dorsal striatum projection neurons correlate to the estimated action value for particular action (figure 4). 4.2 Application to Human Imaging Data Not only the internal variable estimation but also the meta-parameters (e.g. learning rate, action stochasticity, and discount rate for future reward) are also estimated by this methodology. Although the subjective value of learning meta-parameters might be different for individual subject, the model-based approach could track subjective internal value for different meta-parameters. Especially, in human imaging study, this

602

K. Samejima and K. Doya

admissibility is effective to extract common neuronal circuit activation in multiple subject experiment. One problem in the cognitive neuroscience by decision making task is lack of controllability of internal variables. In conventional analysis of neuroscience and brain-imaging study, experimenter tries to control a cognitive state or an assumed internal parameter by a task demand or an experimental setting. Observed brain activities are compared to the assumed parameter. However, the subjective internal variables may depended personal behavioral tendency and may be different from the parameter assumed by experimenter. The Baysian estimation method for internal parameters including meta-parameter could reduce such a noise term of personal difference by fitting the meta-parameters. (Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple human subjects could be featured by the estimated meta-parameter of Q-learning agent. Figure 5 shows distribution of three meta-parameters, learning rate α, action stochasticity β and discount rate γ. The subjects whose estimated γ are lower tend to be trapped on a local optimal polity and could not reach optimal choice sequence (figure 5 left panel) . On the other hand, the subjects, whose learing rate α and inverse temperature β are estimated lower than others, reported in post-experimental questionnaire that they could not find any confident action selection in each state even in later experimental session of the task (figure 5 right panel). Regardless of the variety of subject’s behavioral tendency, the fMRI signal that correlated to estimated action value for the selected action is observed in ventral striatum in unpredictable condition, in which the state transitions are completely random, whereas dorsal striatum is correlated to action value in predictable environment, in which the state transitions are deterministic. This suggests that the different cortico-basal ganglia circuits might be involved in different predictability of the environment. (Tanaka et al., 2006)

Fig. 5. Subject distribution of estimated meta-parameters, larning rate, α, action stochasticity (inverse temperature), β, and discount rate, γ. Left panel: distribution on α−γ space. Subject LI, NN, and NT (Left panel, indicated by inside of ellipsoid), were trapped to local optimal action sequence. Right panel: distribution on α−β space. Subject BB and LI (right panel, indicated by inside of ellipsoid) were reported that they could not find any confident strategy.

Estimating Internal Variables of a Decision Maker’s Brain

603

5 Conclusion Theoretical framework of reinforcement learning to model behavioral decision making and the Bayesian estimating method for subjective internal variable can be powerful tools for analyzing both neural recording (Samejima et al., 2005) and human imaging data (Daw et al., 2006; Pessiglione et al., 2006; Tanaka et al., 2006). Especially, tracking meta-parameter of RL can capture behavioral tendency of animal or human decision making. Recently, correlation with anterior cingulated cortex activity and learning rate in uncertain environmental change are reported by using the approach with Bayesian decision model with temporal evolving the parameter of learning rate (Behrens et al., 2007). Although not detailed in this paper, Bayesian estimation framework also provides a way of objectively selecting the best model in reference to the given data. Combination of Bayesian model selection and hidden variable estimation methods would contribute to a new understanding of decision mechanism of our brain through falsifiable hypotheses and objective experimental tests.

References 1. Behrens, T.E., Woolrich, M.W., Walton, M.E., Rushworth, M.F.: Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007) 2. Corrado, G., Doya, K.: Understanding neural coding through the model-based analysis of decision making. J. Neurosci. 27, 8178–8180 (2007) 3. Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006) 4. Doucet, A., Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2001) 5. Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., Imamizu, H., Kawato, M.: A neural correlate of reward-based behavioral learning in caudate nucleus: a functional magnetic resonance imaging study of a stochastic decision task. J. Neurosci. 24, 1660–1665 (2004) 6. Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., Frith, C.D.: Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045 (2006) 7. Samejima, K., Doya, K., Ueda, Y., Kimura, M.: Advances in neural processing systems, vol. 16. The MIT Press, Cambridge, Massachusetts, London, England (2004) 8. Samejima, K., Ueda, Y., Doya, K., Kimura, M.: Representation of action-specific reward values in the striatum. Science 310, 1337–1340 (2005) 9. Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275, 1593–1599 (1997) 10. Sutton, R.S., Barto, A.G.: Reinforcement Learning. The MIT press, Cambridge (1998) 11. Tanaka, S.C., Samejima, K., Okada, G., Ueda, K., Okamoto, Y., Yamawaki, S., Doya, K.: Brain mechanism of reward prediction under predictable and unpredictable environmental dynamics. Neural Netw. 19, 1233–1241 (2006)

Visual Tracking Achieved by Adaptive Sampling from Hierarchical and Parallel Predictions Tomohiro Shibata1 , Takashi Bando2 , and Shin Ishii1,3 1

Graduate School of Information Science, Nara Institute of Science and Technology [email protected] 2 DENSO Corporation 3 Graduate School of Informatics, Kyoto University

Abstract. Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. From computational point of view, visual tracking is the real-time process of statistical spatiotemporal ﬁltering of target states from an image stream, and incremental Bayesian computation is one of the most important devices. To make Bayesian computation of the posterior density of state variables tractable for any types of probability distribution, Particle Filters (PFs) have been often employed in the real-time vision area. In this paper, we brieﬂy review incremental Bayesian computation and PFs for visual tracking, indicate drawbacks of PFs, and then propose our framework, in which hierarchical and parallel predictions are integrated by adaptive sampling to achieve appropriate balancing of tracking accuracy and robustness. Finally, we discuss the proposed model from the viewpoint of neuroscience.

1

Introduction

Because the inevitable ill-posedness exists in the visual information, the brain essentially needs some prior knowledge, prediction, or hypothesis to acquire a meaningful solution. The prediction is also essential for real-time recognition or visual tracking. Due to ﬂood of visual data, examining the whole data is infeasible, and ignoring the irrelevant data is essentially requisite. Primate fovea and oculomotor control can be viewed from this point; high visual acuity is realized by the narrow foveal region on the retina, and the visual axis has to actively move by oculomotor control. Computer vision, in particular real-time vision faces the same computational problems discussed above, and attractive as well as feasible methods and applications have been developed in the light of particle ﬁlters (PFs) [4]. One of key ideas of PFs is importance sampling distribution or proposal distribution which can be viewed as prediction or attention in order to overcome the discussed computational problems. The aim of this paper is to propose a novel Bayesian visual tracking framework for hierarchically-modeled state variables for single object tracking, and to discuss the PFs and our framework from the viewpoint of neuroscience. M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 604–613, 2008. c Springer-Verlag Berlin Heidelberg 2008

Visual Tracking Achieved by Adaptive Sampling

2 2.1

605

Adaptive Sampling from Hierarchical and Parallel Predictions Incremental Bayes and Particle Filtering

Particle ﬁltering [4] is an approach to performing Bayesian estimation of intractable posterior distributions from time series signals with non-Gaussian noise, such to generalize the traditional Kalman ﬁltering. This approach has been attracting attention in various research areas, including real-time visual processing (e.g., [5]). In clutter, there are usually several competing observations and these causes the posterior to be multi-modal and therefore non-Gaussian. In reality, using a large number of particles is not allowed especially for realtime processing, and thus there is strong demand to reduce number of particles. Reducing the number of particles, however, can lead to sacriﬁcing accuracy and robustness of ﬁltering, particularly in case that the dimension of state variables is high. How to cope with this trade-oﬀ has been one of the most important computational issues in PFs, but there have been few eﬀorts to reduce the dimension of the state variables in the context of PFs. Making use of hierarchy in state variables seems a natural solution to this problem. For example, in the case of pose estimation of a head from a videostream involving state variables of three-dimensional head pose and the twodimansional position of face features on the image, the high-dimensional state space can be divided into two groups by using the causality in their states, i.e., the head pose strongly aﬀects the position of feace features (cf. Fig. 1, except dotted arrows). There are, however, two big problems in this setting. First, the estimation of the lower state variables is strongly dependent on the estimation of the higher state variables. In real applications, it often happens that the assumption on generating from the higher state variable to the lower state variable is violated. Second, the assumption on generating from the lower state variable to the input, typically image frames, can also be violated. These two problems lead to failure in the estimation. 2.2

Estimation of Hierarchically-Modeled State Variables

Here we present a novel framework for hierarchically-modeled state variables for single object tracking. The intuition of our approach is that the higher and lower layer have their own dynamics respectively, and mixing their predictions over the proposal distribution based on their reliability adds both robustness and accuracy in tracking with the fewer number of particles. We assume there are two continuous state vectors at time step t, denoted by at ∈ RNa and xt ∈ RNx , and they are hierarchically modeled as in Fig. 1. Our goal is then estimating these unobservable states from an observation sequence z1:t . State Estimation. According to Bayes rule, the joint posterior density p(at , xt |z1:t ) is given by p(at , xt |z1:t ) ∝p(zt |at , xt )p(at , xt |z1:t−1 ),

606

T. Shibata, T. Bando, and S. Ishii

Fig. 1. A graphical model of a hierarchical and parallel dynamics. Right panels depict example physical representations processed in the layers.

where p(at , xt |z1:t−1 ) is the joint prior density, and p(zt |at , xt ) is the likelihood. The joint prior density p(at , xt |z1:t−1 ) is given by the previous joint posterior density p(at−1 , xt−1 |z1:t−1 ) and the state transition model p(at , xt |at−1 , xt−1 ) as follows: p(at , xt |z1:t−1 ) = p(at , xt |at−1 , xt−1 )p(at−1 , xt−1 |z1:t−1 )dat−1 dxt−1 . The state vectors at and xt are assumed to be genereted from the hierarchical model shown in Fig. 1. Furthermore, dynamics model of at is assumed to be represented as a temporal Markov chain, conditional independence between at−1 and xt given at . Under these assumptions, the state transition model p(at , xt |at−1 , xt−1 ) and the joint prior density p(at , xt |z1:t−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at )p(at |at−1 ) and

p(at , xt |z1:t−1 ) =p(xt |at )p(at |z1:t−1 ),

respectively. Then, we can carry out hierarchically computation of the joint posterior dis(i) tribution by PF as follows; ﬁrst, the higher samples {at |i = 1, ..., Na } are drawn (j) from the prior density p(at |z1:t−1 ). The lower samples {xt |j = 1, ..., Nx } are then drawn from the prior density p(xt |z1:t−1 ) described as the proposal dis(i) tribution p(xt |at ). Finally, the weights of samples are given by the likelihood p(zt |at , xt ). Adaptive Mixing of Proposal Distribution. For applying to the state estimation problem in the real world, the above proposal distribution can be contaminated by the cruel non-Gaussian noise and/or the declination of the assumption. Especially in the case of PF estimation with small number of particles, the contaminated proposal distribution can give fatal disturbance to the

Visual Tracking Achieved by Adaptive Sampling

607

estimation. In this study, we assumed the state transition, which are represented as dotted arrows in Fig.1, in lower layer independently of upper layer, and we suppose to enable robust and acculate estimation by adaptive determination of the contribution the from hypothetical state transition. The state transition p(at , xt |at−1 , xt−1 ) can be represented as p(at , xt |at−1 , xt−1 ) =p(xt |at , xt−1 )p(at |at−1 ).

(1)

Here, the dynamics model of xt is assumed to be represented as p(xt |at , xt−1 ) =αa,t p(xt |at ) + αx,t p(xt |xt−1 ),

(2)

with αa,t + αx,t = 1. In our algorithm, p(xt |at , xt−1 ) is modelled as a mixture of approximated prediction densities computed in the lower and higher layers, p(xt |xt−1 ) and p(xt |at ). Then, its mixture ratio αt = {αa,t , αx,t } representing the contribution of each layer is determined by means of which enables the interaction between the layers. We describe the determination way of the mixture ratio αt in the following subsection. From Eqs. (1) and (2), the joint prior density p(at , xt |z1:t−1 ) is given as p(at , xt |z1:t−1 ) = p(xt |at , z1:t−1 )p(at |z1:t−1 ) = π(xt |αt )p(at |z1:t−1 ), where

π(xt |αt ) =αa,t p(xt |at ) + αx,t p(xt |z1:t−1 )

(3)

is the adaptive-mixed proposal distribution for xt based on the prediction densities p(xt |at ) and p(xt |z1:t−1 ). Determination of αt using On-line EM Algorithm. The mixture ratio αt is the parameter for determining the adaptive-mixed proposal distribution π(xt |αt ), and its determination by minimalizing the KL divergence between the posterior density in the lower layer p(xt |at , z1:t ) and π(xt |αt ) gives robust and accurate estimation. The determination of αt is equal to determination of mixture ratio in two components mixture model, and we employ a sequential Maximum-Likelihood (ML) estimation of the mixture ratio. In our method, the index variable for componet selection becomes the latent variable, and therefore the sequential ML estimation is implemented by means of an on-line EM algorithm [11]. Resampling from the posterior density p(xt |at , z1:t ), we obtain Nx (i) ˜ t . Using the latent variable m = {ma , mx } indicates which prediction samples x density, p(xt |at ) or p(xt |z1:t−1 ) is trusted, the on-line log likelihood can then be represented as t N t x L(αt ) = ηt λs log π(˜ x(i) τ |αt ) τ =1

= ηt

t τ =1

s=τ +1 t s=τ +1

λs

i=1 Nx i=1

log

m

p(˜ x(i) τ , m|ατ ) ,

608

T. Shibata, T. Bando, and S. Ishii

1. Estimation of the state variable xt in the lower layer. (i) – Obtain αa,t Nx samples xa,t from p(xt |at ). (i) – Obtain αx,t Nx samples xx,t from p(xt |z1:t−1 ). ˆ t and Std(xt ) of p(xn,t |at , z1:t ) using Nx mixture – Obtain expectation x (i) (i) (i) samples xt , constituted by xx,t and xa,t . Above procedure is applied to each feature, and obtain {ˆ xn,t , Std(xn,t )}. 2. Estimation of the state variable at in the higher layer. – Obtain Da,n (t) based on Std(xn,t ), and then estimate p(at |z1:t ). 3. Determination of the mixture ratio ¸t+1 . ˜ (i) – Obtain Nx samples x from p(xn,t |at , z1:t ). t – Calculate ¸t+1 such to maximize the on-line log likelihood. Above procedure is applied to each feature, and obtain {¸n,t+1 }.

Fig. 2. Hierarchical pose estimation algorithm

where λs is a decay constant to decline adverse inﬂuence from former inaccurate

−1 t t estimation, and ηt = is a normalization constant which τ =1 s=τ +1 λs works as a learning coeﬃcient. The optimal mixture ratio α∗t by means of the KL divergence, which gives the optimal proposal distribution π(xt |α∗t ), can be calculated by miximization of the on-line log likelihood as follows: mt , m ∈m m t

α∗m,t = where (i)

(i)

p(˜ xt , ma |αt ) =αa,t p(˜ xt |at ),

(i)

(i)

p(˜ xt , mx |αt ) = αx,t p(˜ xt |z1:t−1 ),

and mt = ηt

t τ =1

t s=τ +1

λs

Nx

(i)

p(m|˜ x(i) τ , α τ ),

(i)

p(m|˜ xt , α t ) =

i=1

p(˜ xt , m|αt ) m ∈m

(i)

p(˜ xt , m |αt )

.

Note that mt can be calculated incrementally. 2.3

Application to Pose Estimation of a Rigid Object

Here the proposed method is applied to a real problem, pose estimation of a rigid object (cf. Fig. 1). The algorithm is shown in Fig. 2. Nf features on the image plane at time step t are denoted by xn,t = (un,t , vn,t )T , (n = 1, ..., Nf ), an aﬃne camera matrix which projects a 3D model of the object onto the image plane at time step t is reshaped into a vector form at = (a1,t , ..., a8,t )T , for simplicity, and an observed image at time step t is denoted by zt . When applied to the pose estimation of a rigid object, Gaussian process is assumed in the higher layer and the object’s pose is estimated by Kalman Filter,

Visual Tracking Achieved by Adaptive Sampling

609

while tracking of the features is performed by PF because of the existence of cruel non-Gaussian noise, e.g. occlusion, in the lower layer. (i) To obtain the samples xn,t from the mixture proposal distribution we need two prediction densities, p(xn,t |at ) and p(xn,t |z1:t−1 ). The prediction density computed in the higher layer, p(xn,t |at ), which contains the physical relationship between the features, is given by the aﬃne projection process of the 3D model of the rigid object. 2.4

Experiments

Simulations. The goal of this simulation is to estimate the posture of a rigid object, as in Fig. 3A, from an observation sequence of eight features. The rigid object was a hexahedral piece, whose size was 20 (upper base) × 30 (lower base) × 20 (height) × 20 (depth) [mm], was rotated at 150mm apart from a pin-hole camera (focal length: 40mm) and was projected onto the image plane (pixel size: 0.1 × 0.1 [mm]) by a perspective projection process disturbed by a Gaussian noise (mean: 0, standard deviation: 1 pixel). The four features at back of the object were occluded by the object itself. An occluded feature was represented by assuming the standard deviation of measurement noise grows to 100 pixels from a non-occluded value of 1 pixel. The other four features at front of the object were not occluded. The length of the observation sequence of the features was 625 frames, i.e., about 21 sec. In this simulation, we compared the performance of our proposed method with (i) the method with a ﬁxed αt = {1, 0} which is equivalent to the simple hierarchical modeling in which the prediction density computed in the higher layer is trusted every time, (ii) the method with a ﬁxed αt = {0, 1} which does not implement the mutual interaction between the layers. The decay constant of the on-line EM algorithm was set at λt = 0.5. The estimated pose by the adaptive proposal distribution and αa,t at each time step are shown in Figs. 3B and 3C, respectively. Here, the object’s pose θt = {θX,t , θY,t , θZ,t } was calculated from the estimated at using Extended Kalman Filter (EKF). In our implementation, maximum and minimum values of αa,t were limited to 0.8 and 0.2, respectively, to prohibit the robustness from degenerating. As shown in Fig. 3B, our method achieved robust estimation against the occlusion. Concurrently with the robust estimation of the object’s pose, appropriate determination of the mixture ratio was exhibited. For example, as in the case of the feature x1 , the prediction density computed in the higher layer was emphasized and well predicted using the 3D object’s model during the period in which x1 was occluded, because the observed feature position contaminated by cruel noise depressed conﬁdence of the lower prediction density. Real Experiments. To investigate the performance against the cruel nonGaussian noise existing in real environments, the proposed method was applied to a head pose estimation problem of a driver from a real image sequence captured in a car. The face extraction/tracking from an image sequence is a wellstudied problem because of its applicability to various area, and then several

610

T. Shibata, T. Bando, and S. Ishii

Fig. 3. A: a sample simulation image. B: Time course of the estimated object’s pose. C: Time course of αa,t determined by the on-line EM algorithm. Gray background represents a frame in which the feature was occluded.

PF algorithms have been proposed. However, for accurate estimation of state variables, e.g. human face, lying in a high dimensional space especially in the case of real-time processing, some techniques for dimensional reduction are required. The proposed method is expected to enable more robust estimation in spite of limitation in computing resource by exploiting hierarchy of state variables. The real image sequence was captured by a near-infrared camera at the back of a handle, i.e., the captured images did not contain color information, and the image resolution was 640 × 480. In such a visual tracking task, the true observation process p(zt |xn,t ) is unknown because the true positions of face features are unobservable, hence, a (i) model of the observation process for calculating the particle’s weight wn,t is needed. In this study, we employed normalized correlation with a template as the model of the approximate observation process. Although this observation model seems too simple to apply to problems in real environments, it is sufﬁcient for examining the eﬃciency of the proposal distribution. We employed nose, eyes, canthi, eyebrows, corners of mouth as the face features. Using the 3D feature positions measured by a 3D distance measuring equipment, we constructed the 3D face model. The proposed method was applied by employing 50 particles for each face feature, as well as in the simulation, and processed by a Pentium 4 (2.8 GHz) Windows 2000 PC with 1048MB RAM. Our system processed one frame in 29.05 msec, and hence achieved real-time processing.

Visual Tracking Achieved by Adaptive Sampling

611

Fig. 4. Mean estimation error in the case when αt was estimated by EM, ﬁxed at αa,t = {1, 0}, and ﬁxed at αa,t = {0, 1} (100 trials). Since some bars protruded from the ﬁgure, they were shorten and the error amount is instead displayed on the top of them.

Fig. 5. Tracked face features

Fig. 4 shows the estimation error of the head pose, and the true head pose of the driver measured by a gyro sensor in the car. Our adaptive mixing proposal distribution achieved robust head pose estimation as well as in the simulation task. In Fig. 5, the estimated face features are depicted by “+” for various head pose of the driver; the variance of the estimated feature position is represented as the size of the “+” mark, i.e., the larger a “+” is, the higher estimation conﬁdence the estimator has.

3

Modeling Primate’s Visual Tracking by Particle Filters

Here we note that our computer vision study and primates’ vision share the same computational problems and similar constraints. Namely, they need to perform real-time spatiotemporal ﬁltering of the visual data robustly and accurately as

612

T. Shibata, T. Bando, and S. Ishii

much as possible with the limited computing resource. Although there are huge numbers of neurons in the brain, their ﬁring rate is very noisy and much slower than recent personal computers. We can visually track only around four to six objects simultaneously (e.g., [2]). These facts indicate the limited attention resource. As mentioned at the beginning this paper, it is widely known that only the foveal region on the retina can acquire high-resolution images in primates, and that humans usually make saccades mostly to eyes, nose, lip, and contours when we watch a human face. In other words, primates actively ignore irrelevant information against massive image inputs. Furthermore, there have been many behavioral and computational studies reporting that the brain would compute Bayesian statistics (e.g, [8][10]). As we discussed, however, Bayesian computation is intractable in general, and particle ﬁlters (PFs) is an attractive and feasible solution to the problem as it is very ﬂexible, easy to implement, parallelisable. Importance sampling is analogous to eﬃcient computing resource delivery. As a whole, we conjecture that the primate’s brain would employ PFs for visual tracking. Although one of the major drawbacks of PF is that a large number of particles, typically exponential to the dimension of state variables, are required for accurate estimation, our proposed framework in which adaptive sampling from hierarchical and parallel predictive distributions can be a solution. As demonstrated in section 2 and in other’s [1], adaptive importance sampling from multiple predictions can balance both accuracy and robustness of the estimation with a restricted numbers of particles. Along this line, overt/covert smooth pursuit in primates could be a research target to investigate our conjecture. Based on the model of Shibata, et al. [12], Kawawaki et al. investigated the human brain mechanism of overt/covert smooth pursuit by fMRI experiments and suggested that the activity of the anterior/superior lateral occipito-temporal cortex (a/sLOTC) was responsible for target motion prediction rather than motor commands for eye movements [7]. Note that LOTC involves the monkey medial superior temporal (MST) homologue responsible for visual motion processing (e.g., [9][6]) In their study, the mechanism for increasing the a/sLOTC activity remained unclear. The increase in the a/sLOTC activity was observed particularly when subjects pursued blinking target motion covertly. This blink condition might cause two predictions, e.g., one emphasizes observation and the other its belief as proposed in [1]), and require the computational resource for adaptive sampling. Multiple predictions might be performed in other brain regions such as frontal eye ﬁled (FEF), the inferior temporal (IT) area, fusiform face area (FFA). It is known that FEF involved in smooth pursuit (e.g., [3]), and it has reciprocal connection to the MST area (e.g., [13]), but how they work together is unclear. Visual tracking for more general object tracking rather than a small spot as a visual stimulus requires a speciﬁc target representation and distractors’ representation inculding a background. So the IT, FFA and other areas related to higher-order visual representation would be making parallel predictions to deal with the varying target appearance during tracking.

Visual Tracking Achieved by Adaptive Sampling

4

613

Conclusion

In this paper, ﬁrst we have introduced particle ﬁlters (PFs) as an approximated incremental Bayesian computation, and pointed out their drawbacks. Then, we have proposed a novel framework for visual tracking based on PFs as a solution to the drawback. The keys of the framework are: (1) high-dimensional state space is decomposed into hierarchical and parallel predictors which treat state variables in the lower dimension, and (2) their integration is achieved by adaptive sampling. The feasibility of our frame work has been demonstrated by real as well as simulation studies. Finally, we have pointed out the shared computational problems between PFs and human visual tracking, presented our conjecture that at least the primate’s brain employs PFs, and discussed its possibility and perspectives for future investigations.

References 1. Bando, T., Shibata, T., Doya, K., Ishii, S.: Switching particle ﬁlters for eﬃcient visual tracking. J. Robot Auton. Syst. 54(10), 873 (2006) 2. Cavanagh, P., Alvarez, G.A.: Tracking multiple targets with multifocal attention. Trends in Cogn. Sci. 9(7), 349–354 (2005) 3. Fukushima, K., Yamanobe, T., Shinmei, Y., Fukushima, J., Kurkin, S., Peterson, B.W.: Coding of smooth eye movements in three-dimensional space by frontal cortex. Nature 419, 157–162 (2002) 4. Gordon, N.J., Salmond, J.J., Smith, A.F.M.: Novel approach to nonlinear nonGaussian Bayesian state estimation. IEEE Proc. Radar Signal Processing 140, 107– 113 (1993) 5. Isard, M., Blake, A.: Condensation - conditional density propagation for visual tracking. Int. J. Comput. Vis. 29(1), 5–28 (1998) 6. Kawano, M., Shidara, Y., Watanabe, Y., Yamane, S.: Neural activity in cortical area MST of alert monkey during ocular following responses. J. Neurophysiol 71(6), 2305–2324 (1994) 7. Kawawaki, D., Shibata, T., Goda, N., Doya, K., Kawato, M.: Anterior and superior lateral occipito-temporal cortex responsible for target motion prediction during overt and covert visual pursuit. Neurosci. Res. 54(2), 112 8. Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosci. 27(12) (2004) 9. Newsome, W.T., Wurtz, H., Komatsu, R.H.: Relation of cortical areas MT and MST to pursuit eye movements. II. Diﬀerentiation of retinal from extraretinal inputs. J. Neurophysiol 60(2), 604–620 (1988) 10. Rao, R.P.N.: The Bayesian Brain: Probabilistic Approaches to Neural Coding. In: Neural Models of Bayesian Belief Propagation, MIT Press, Cambridge (2006) 11. Sato, M., Ishii, S.: On-line EM algorithm for the normalized gaussian network. Neural Computation 12(2), 407–432 (2000) 12. Shibata, T., Tabata, H., Schaal, S., Kawato, M.: A model of smooth pursuit in primates based on learning the target dynamics. Neural Netw. 18(3), 213 13. Tian, J.-R., Lynch, J.C.: Corticocortical input to the smooth and saccadic eye movement subregions of the frontal eye ﬁeld in cebus monkeys. J. Neurophysiol 76(4), 2754–2771 (1996)

Bayesian System Identiﬁcation of Molecular Cascades Junichiro Yoshimoto1,2 and Kenji Doya1,2,3 1

Initial Research Project, Okinawa Institute of Science and Technology Corporation 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan {jun-y,doya}@oist.jp 2 Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-0192, Japan 3 ATR Computational Neuroscience Laboratories 2-2-2 Hikaridai, “Keihanna Science City”, Kyoto 619-0288, Japan

Abstract. We present a Bayesian method for the system identiﬁcation of molecular cascades in biological systems. The contribution of this study is to provide a theoretical framework for unifying three issues: 1) estimating the most likely parameters; 2) evaluating and visualizing the conﬁdence of the estimated parameters; and 3) selecting the most likely structure of the molecular cascades from two or more alternatives. The usefulness of our method is demonstrated in several benchmark tests. Keywords: Systems biology, biochemical kinetics, system identiﬁcation, Bayesian inference, Markov chain Monte Carlo method.

1

Introduction

In recent years, the analysis of molecular cascades by mathematical models has contributed to the elucidation of intracellular mechanisms related to learning and memory [1,2]. In such modeling studies, the structure and parameters of the molecular cascades are selected based on the literature and databases1 . However, if reliable information about a target molecular cascade is not obtained from those repositories, we must tune its structure and parameters so as to ﬁt the model behaviors to the available experimental data. The development of a theoretically sound and eﬃcient system identiﬁcation framework is crucial for making such models useful. In this article, we propose a Bayesian system identiﬁcation framework for molecular cascades. For a given set of experimental data, the system identiﬁcation can be separated into two inverse problems: parameter estimation and model selection. The most popular strategy for parameter estimation is to ﬁnd a single set of parameters based on the least mean-square-error or maximum likelihood criterion [3]. However, we should be aware that the estimated parameters might 1

DOQCS ( http://doqcs.ncbs.res.in/) and BIOMODELS ( http://biomodels. net/) are available for example.

M. Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 614–624, 2008. c Springer-Verlag Berlin Heidelberg 2008

Bayesian System Identiﬁcation of Molecular Cascades

615

suﬀer from an “over-ﬁtting eﬀect” because the available set of experimental data is often small and noisy. For evaluating the accuracy of the estimators, statistical methods based on the asymptotic theory [4] and Fisher information [5] were independently proposed. Still, we must pay attention to practical limitations: a large number of data are required for the former method; and the mathematical model should be linear at least locally for the latter method. For model selection