JANUARY 2011
VOLUME 22
NUMBER 1
ITNNEP
(ISSN 1045-9227)
EDITORIAL
One Year as EiC, and Editorial-Board Changes at TNN ........................................................................
1
REGULAR PAPERS
Signature Neural Networks: Definition and Application to Multidimensional Sorting Problems .......................... ............................................................................ R. Latorre, F. de Borja Rodríguez, and P. Varona Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Discrete-Time Nonlinear Systems with ε-Error Bound ................................................................................ F.-Y. Wang, N. Jin, D. Liu, and Q. Wei Solving Nonstationary Classification Problems with Coupled Support Vector Machines ................................... ............................................................ G. L. Grinblat, L. C. Uzal, H. A. Ceccatto, and P. M. Granitto Optimum Spatio-Spectral Filtering Network for Brain–Computer Interface .................................................. ................................................................... H. Zhang, Z. Y. Chin, K. K. Ang, C. Guan, and C. Wang 24-GOPS 4.5-mm2 Digital Cellular Neural Network for Rapid Visual Attention in an Object-Recognition SoC ...... ............................................................................ S. Lee, M. Kim, K. Kim, J.-Y. Kim, and H.-J. Yoo An Augmented Echo State Network for Nonlinear Adaptive Filtering of Complex Noncircular Signals ................ ...................................................... Y. Xia, B. Jelfs, M. M. Van Hulle, J. C. Príncipe, and D. P. Mandic Learning Pattern Recognition Through Quasi-Synchronization of Phase Oscillators ........................................ ...................................................................... E. Vassilieva, G. Pinto, J. A. de Barros, and P. Suppes ELITE: Ensemble of Optimal Input-Pruned Neural Networks Using TRUST-TECH ..... B. Wang and H.-D. Chiang Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression ......................... .................................................... K. De Brabanter, J. De Brabanter, J. A. K. Suykens, and B. De Moor Super-Resolution Method for Face Recognition Using Nonlinear Mappings on Coherent Features ...................... .............................................................................................................. H. Huang and H. He Minimum Complexity Echo State Network ............................................................ A. Rodan and P. Tiˇno Bounded H∞ Synchronization and State Estimation for Discrete Time-Varying Stochastic Complex Networks Over a Finite Horizon ............................................................................... B. Shen, Z. Wang, and X. Liu
8 24 37 52 64 74 84 96 110 121 131 145
BRIEF PAPERS
Extended Input Space Support Vector Machine ........... R. Santiago-Mozos, F. Pérez-Cruz, and A. Artés-Rodríguez Robust Stability Criterion for Discrete-Time Uncertain Markovian Jumping Neural Networks with Defective Statistics of Modes Transitions ............................................................. Y. Zhao, L. Zhang, S. Shen, and H. Gao
158 164
ANNOUNCEMENTS
Call for Papers—The IEEE T RANSACTIONS ON N EURAL N ETWORKS Special Issue: Online Learning in Kernel Methods .............................................................................................................................
171
Call for Participation—The 2011 International Joint Conference on Neural Networks .....................................
172
IEEE TRANSACTIONS ON NEURAL NETWORKS IEEE TRANSACTIONS ON NEURAL NETWORKS is published by the IEEE Computational Intelligence Society. Members may subscribe to this TRANSACTIONS for $22.00 per year. IEEE student members may subscribe for $11.00 per year. Nonmembers may subscribe for $1,750.00. For additional subscription information visit http://www.ieee.org/nns/pubs. For information on receiving this TRANSACTIONS, write to the IEEE Service Center at the address below. Member copies of Transactions/Journals are for personal use only. For more information about this TRANSACTIONS see http://www.ieee-cis/org/pubs/tnn.
Editor-in-Chief DERONG LIU Institute of Automation Chinese Academy of Sciences Beijing 100190, China Dept. of Electrical and Computer Engineering University of Illinois Chicago, IL 60607 USA Email:
[email protected] Associate Editors HOJJAT ADELI Ohio State Univ., USA
PABLO A. ESTEVEZ Univ. of Chile, Chile
CESARE ALIPPI Politecnico di Milano, Italy
HAIBO HE Univ. of Rhode Island, USA
MARCO BAGLIETTO DIST-Univ. of Genova, Italy
TOM HESKES Radboud Univ. Nijmegen, The Netherlands
LUBICA BENUSKOVA Univ. of Otago, New Zealand AMIT BHAYA Federal Univ. of Rio de Janeiro, Brazil IVO BUKOVSKY Czech Technical Univ. in Prague, Czech Republic SHENG CHEN Univ. of Southampton, U.K. TIANPING CHEN Fudan Univ., China PAU-CHOO (JULIA) CHUNG National Cheng Kung Univ., Taiwan MING DONG Wayne State Univ., USA EL-SAYED EL-ALFY King Fahd Univ. of Petroleum & Minerals, Saudi Arabia
RHEE MAN KIL Korea Advanced Inst. of Science and Technology, Korea IRWIN KING Chinese Univ. of Hong Kong LI-WEI (LEO) KO National Chiao-Tung Univ., Taiwan
AKIRA HIROSE Univ. of Tokyo, Japan ZENG-GUANG HOU The Chinese Acad. Sci., China SANQING HU Hangzhou Dianzi Univ. , China AMIR HUSSAIN Univ. of Stirling, U.K. KAZUSHI IKEDA Nara Inst. of Sci. & Technol., Japan
JINHU LU The Chinese Acad. Sci., China
YAOCHU JIN Honda Research Inst., Germany FAKHRI KARRAY University of Waterloo, Canada
DRAGUNA VRABIE Univ. of Texas at Arlington, USA
SEIICHI OZAWA Kobe Univ., Japan
ZIDONG WANG Brunel Univ., U.K.
MIKE PAULIN Univ. of Otago, New Zealand
MARCO WIERING Univ. of Groningen, The Netherlands
ROBI POLIKAR Rowan Univ., USA
JAMES KWOK Hong Kong Univ. of Sci. & Technol. DANIL PROKHOROV Toyota Research Institute NA, FRANK L. LEWIS USA Univ. of Texas at Arlington, USA MARCELLO SANGUINETI ROBERT LEGENSTEIN Univ. of Genoa, Italy Graz Univ. of Technology, Austria ALESSANDRO SPERDUTI ARISTIDIS LIKAS Univ. of Padova, Italy Univ. of Ioannina, Greece GUO-PING LIU Univ. of Glamorgan, U.K.
HOSSEIN JAVAHERIAN General Motors R&D Center, USA
DANILO P. MANDIC Imperial College London, U.K.
STEFANO SQUARTINI Univ. Politecnica delle Marche, Italy DIPTI SRINIVASAN National Univ. of Singapore
YUNQIAN MA Honeywell International Inc., USA
SERGIOS THEODORIDIS Univ. of Athens, Greece
MALIK MAGDON-ISMAIL Rensselaer Polytechnic Institute, USA
MARC M. VAN HULLE Katholieke Univ. Leuven, Belgium
ZHANG YI Sichuan Univ., China VICENTE ZARZOSO University of Nice-Sophia Antipolis, France ZHIGANG ZENG Huazhong Univ. of Sci. & Technol., China G. PETER ZHANG Georgia State Univ., USA HUAGUANG ZHANG Northeastern Univ., China NIAN ZHANG Univ. of District of Columbia, USA LIANG ZHAO Univ. of Sao Paulo, Brazil NANNING ZHENG Xi’an Jiaotong Univ., China
IEEE Officers DAVID A. HODGES, Vice President, Publication Services and Products HOWARD E. MICHEL, Vice President, Member and Geographic Activities STEVE M. MILLS, President, Standards Association DONNA L. HUDSON, Vice President, Technical Activities RONALD G. JENSEN, President, IEEE-USA
MOSHE KAM, President GORDON W. DAY, President-Elect ROGER D. POLLARD, Secretary HAROLD FLESCHER, Treasurer PEDRO A. RAY, Past President TARIQ S. DURRANI, Vice President, Educational Activities
VINCENZO PIURI, Director, Division X
IEEE Executive Staff DR. E. JAMES PRENDERGAST, THOMAS SIEGERT, Business Administration MATTHEW LOEB, Corporate Activities DOUGLAS GORHAM, Educational Activities BETSY DAVIS, SPHR, Human Resources CHRIS BRANTLEY, IEEE-USA ALEXANDER PASIK, Information Technology
Executive Director & Chief Operating Officer PATRICK MAHONEY, Marketing CECELIA JANKOWSKI, Member and Geographic Activities ANTHONY DURNIAK, Publications Activities JUDITH GORMAN, Standards Activities MARY WARD-CALLAN, Technical Activities
IEEE Periodicals Transactions/Journals Department Staff Director: FRAN ZAPPULLA Editorial Director: DAWN MELLEY Production Director: PETER M. TUOHY Managing Editor: JEFFREY E. CICHOCKI Journal Coordinator: MICHAEL J. HELLRIGEL IEEE TRANSACTIONS ON NEURAL NETWORKS (ISSN 1045-9227) is published monthly by The Institute of Electrical and Electronics Engineers, Inc. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society/Council, or its members. IEEE Corporate Office: 3 Park Avenue, 17th Floor, New York, NY 10016-5997. IEEE Operations Center: 445 Hoes Lane, Piscataway, NJ 08854-4141. NJ Telephone: +1 732 981 0060. Price/Publication Information: Individual copies: IEEE Members $20.00 (first copy only), nonmembers $146.00 per copy. (Note: Postage and handling charge not included.) Member and nonmember subscription prices available upon request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For all other copying, reprint, or republication permission, write to Copyrights and Permissions Department, IEEE Publications Administration, 445 Hoes Lane, Piscataway, NJ 08854-4141. Copyright © 2011 by The IEEE, Inc. All rights reserved. Periodicals Postage Paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE TRANSACTIONS ON NEURAL NETWORKS, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-4141. GST Registration No. 125634188. CPC Sales Agreement #40013087. Return undeliverable Canada addresses to: Pitney Bowes IMEX, P.O. Box 4332, Stanton Rd., Toronto, ON M5W 3J4, Canada. IEEE prohibits discrimination, harassment and bullying. For more information visit http://www.ieee. org/nondiscrimination Printed in U.S.A.
Digital Object Identifier 10.1109/TNN.2010.2102710
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1
Editorial: One Year as EiC, and Editorial-Board Changes at TNN
AM ABOUT to start my second year of service as the Editor-in-Chief (EiC) of the IEEE T RANSACTIONS ON N EURAL N ETWORKS (TNN). Needless to say, my first year as the EiC has been full of excitement and challenges. Transitioning this position from my predecessor to me went very smoothly during the months of September 2009 to January 2010. During the past year, we have accumulated 50+ Associate Editors (AEs) handling roughly 600 new submissions (not counting resubmissions and revised submissions). With the help of these AEs and my predecessor, I was quickly able to learn to do my job, and as such, the transition had very few glitches. The easy part of my job is checking whether a submission is in compliance with our guidelines and where it is within the scope of the T RANSACTIONS, before it is assigned to an AE for handling. The difficult part of my job has been dealing with some papers with three or more reviewers, all of whom agreed to review them but for some reason failed to respond to repeated automatic-review reminders. AEs handling these papers have to take several extra steps to remind reviewers through phone calls or e-mails, look for replacement reviewers, or review the papers themselves. Most authors have been appreciative of the work of the AEs and reviewers, and they accept our decisions without a problem. The backlog of papers has been kept short over the last year. We have maintained an organized printing and paperacceptance schedule, with papers typically printed in the journal within 2–3 months of acceptance. Our page budget has been kept constant in the past few years (roughly 2060 pages per year), and we expect to hold the same page count for next year. There are three special issues that are being organized this year. These include: 1) White-box nonlinear prediction models (organized by Bart Baesens, David Martens, Rudy Setiono, and Jacek Zurada); 2) Data-based optimization, control, and modeling (organized by Tianyou Chai, Zhongsheng Hou, Frank L. Lewis, and Amir Hussain); and 3) Online learning in kernel methods (organized by Jose C. Principe, Seiichi Ozawa, Sergios Theodoridis, Tulay Adali, Danilo P. Mandic, and Weifeng Liu). Interested authors should refer to the individual solicitations or contact the special-issue organizers for more details. I would like to take this opportunity to thank the hardworking AEs whose terms have ended this year. They are
I
Angelo Alessandri, Fahmida Chowdhury, Bhaskar DasGupta, Rene Doursat, Deniz Erdogmus, Mark Girolami, Barbara Hammer, Giacomo Indiveri, Stefanos Kollias, Chih-Jen Lin, Mark Plumbley, Jagath Rajapakse, George A. Rovithakis, Kate Smith-Miles, Changyin Sun, and Simon X. Yang. Thank you for your excellent service to TNN. I wish you much success in your future endeavors. I would also like to welcome the following new AEs whose terms officially start on January 1, 2011 (K. Ikeda and J. Lu started on June 1, 2010):
• • •
• •
•
• •
•
• • •
• • • •
•
Marco Baglietto, DIST-University of Genova, Italy Lubica Benuskova, University of Otago, New Zealand Ivo Bukovsky, Czech Technical University in Prague, Czech Republic Tianping Chen, Fudan University, China Tom Heskes, Radboud University Nijmegen, The Netherlands Kazushi Ikeda, Nara Institute of Science and Technology, Japan Fakhri Karray, University of Waterloo, Canada Rhee Man Kil, Korea Advanced Institute of Science and Technology, Korea Robert Legenstein, Graz University of Technology, Austria Jinhu Lu, Chinese Academy of Sciences, China Yunqian Ma, Honeywell International Inc., USA Malik Magdon-Ismail, Rensselaer Polytechnic Institute, USA Mike Paulin, University of Otago, New Zealand Robi Polikar, Rowan University, USA Danil Prokhorov, Toyota Research Institute NA, USA Marco Wiering, University of Groningen, The Netherlands Vicente Zarzoso, University of Nice Sophia Antipolis, France
All the above AEs are established authorities in their respective fields and have been carefully selected on the basis of their achievements, their geographical diversity, and our needs for expertise on various subject areas of TNN. I look forward to working with them to make TNN an even better journal.
Date of current version January 4, 2011. Digital Object Identifier 10.1109/TNN.2010.2099171
D ERONG L IU, Editor-in-Chief 1045–9227/$26.00 © 2011 IEEE
2
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Marco Baglietto (M’04) was born in Savona, Italy, in 1970. He received the Laurea degree in electronic engineering in 1995, and the Ph.D. degree in electronic engineering and computer science in 1999, both from the University of Genoa, Genoa, Italy. He has been an Assistant Professor of Automatic Control in the Department of Communications, Computer and Systems Science, University of Genoa, since 1999. His current research interests include neural approximations, linear and nonlinear estimation, distributed-information control systems, and control of communication networks. Dr. Baglietto is currently an Associate Editor for the IEEE Control Systems Society Conference Editorial Board. He has been a member of the guest editorial team of the Special Issue of the IEEE T RANSACTIONS ON N EURAL N ETWORKS on “Adaptive Learning Systems in Communication Networks.” He was a co-recipient of the 2004 Outstanding Paper Award of the IEEE T RANSACTIONS ON N EURAL N ETWORKS.
Lubica “Luba” Benuskova received the Ph.D. degree in biophysics from Comenius University, Bratislava, Slovakia, in 1994. She became an Associate Professor in the Department of Applied Informatics of the Faculty of Mathematics, Physics, and Informatics at Comenius University, in 2002. After, she served as a Director of the Center for Neurocomputation and Neuroinformatics in the Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand, in 2007. Currently, she is a Senior Lecturer in the Department of Computer Science, University of Otago, Dunedin, New Zealand. She co-authored the book Computational Neurogenetic Modelling (New York, NY: Springer, 2007). Her current research interests include computational neuroscience, spiking neural networks, neural dynamics, neuroinformatics, bioinformatics, and consciousness/emotions. Dr. Benuskova is currently a member of the Editorial Board of the peer-reviewed journal Neural Network World. She is a member of the IEEE Computational Intelligence Society and Otago Chapter of the Society for Neuroscience. Recently she has become a Professional member of the Royal Society of New Zealand.
Ivo Bukovsky received the Ph.D. degree in field of control and system engineering from Czech Technical University, Prague, Czech Republic, in 2007. He is currently the Head of Division of Automatic Control and Engineering Informatics in the Department of Instrumentation and Control Engineering within the Faculty of Mechanical Engineering, Czech Technical University. He was a Visiting Researcher at the University of Saskatchewan, Saskatoon, SK, Canada, in 2003. His thesis on nonconventional neural units and adaptive approach to evaluation of complicated dynamical systems was recognized by the Verner von Siemens Excellence Award 2007. For six months in 2009, he worked on neural networks and biomedical applications at the Cyberscience Center, Tohoku University, Miyagi, Japan. He held a short assignment at the University of Manitoba, Winnipeg, MB, Canada, in 2010. His current research interests include multiscale analyzes for adaptive evaluation of complicated dynamical systems and neural networks. Dr. Bukovsky has been a member of IEEE Computational Intelligence Society (CIS) Neural Network Technical Committee since 2007, and the Chair of CIS Neural Networks Technical Committee Task Force on Education since 2009. He became involved in IEEE CIS Student Activity Subcommittee in 2010.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
3
Tianping Chen received the Postgraduate degree from the Mathematics Department, Fudan University, Shanghai, China, in 1965. He is currently a Professor in the School of Mathematical Sciences, Fudan University. His current research interests include complex networks, neural networks, principal component analysis, independent component analysis, dynamical system harmonic analysis, and approximation theory. Prof. Chen was a recipient of several awards, including the second prize of National Natural Science Award of China in 2002, the Outstanding Paper award of the IEEE T RANSACTIONS ON N EURAL N ETWORKS in 1997, and the Best Paper Award of the Japanese Neural Network Society in 1997.
Tom Heskes received the Ph.D. degree in physics from Radboud University, Nijmegen, Netherlands, in 1993. He was a Post-Doctoral Fellow at the Beckman Institute, University of Illinois at UrbanaChampaign, Urbana. He is currently a Professor of artificial intelligence and computer science in Radboud University. He leads the Machine Learning Group and is Principal Investigator and Director of the Institute for Computing and Information Sciences. He is also the Principal Investigator at the Donders Center for Neuroscience, Radboud University. He has published over 100 research papers and books in the following areas. His current research interests include (Bayesian) machine learning and probabilistic graphical models, with applications to cognitive neuroimaging and bioinformatics. Prof. Heskes received the prestigious National Grant (Vici) for the research on probabilistic artificial intelligence, in 2006. He is the Editor-in-Chief of Neurocomputing and Associate Editor of several other journals. He has served on program committees of dozens of international conferences.
Kazushi Ikeda (M’94–SM’07) received the B.E., M.E., and Ph.D. degrees in mathematical engineering and information physics from the University of Tokyo, Tokyo, Japan, in 1989, 1991, and 1994, respectively. He joined the Department of Electrical and Computer Engineering, Kanazawa University, Kanazawa, Japan, and moved to the Department of Systems Science, Kyoto University, Kyoto, Japan, as an Associate Professor, in 1998. Since 2008, he has been a Professor in the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Japan. His current research interests include machine learning theory such as support vector machines and information geometry, applications to adaptive systems, and brain informatics. Dr. Ikeda is currently the Editor-in-Chief of Journal of Japanese Neural Network Society, an Action Editor of Neural Networks, and an Associate Editor of Institute of Electronics, Information and Communication Engineers Transactions on Information and Systems. He has served as a member of the Board of Governors of Japanese Neural Network Society and Institute of Systems, Control and Information Engineers.
4
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fakhri Karray (SM’89–M’90–SM’99) received the Ph.D. degree from the University of Illinois at Urbana-Champaign, Urbana, in 1989. He is a Professor in the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada, and the Associate Director of the Pattern Analysis and Machine Intelligence Laboratory, University of Waterloo. He holds 13 U.S. patents in various areas of intelligent systems design using tools of computational intelligence. He is the coauthor of the textbook Tools of Soft Computing and Intelligent Systems Design (New York, NY: Addison-Wesley, 2004). He has extensively published in the following areas. His current research interests include soft computing and tools of computational intelligence with applications to autonomous systems and intelligent man-machine interaction. Dr. Karray has served over the years as Associate Editor for the IEEE T RANSACTIONS ON M ECHATRONICS , the IEEE T RANSACTIONS ON S YSTEMS M AN AND C YBERNETICS PART B, the IEEE C OMPUTATIONAL I NTELLIGENCE M AGAZINE, the International Journal of Robotics and Automation, the International Journal of Control and Intelligent Systems, and the International Journal of Image Processing. He has been a Guest Editor for the IEEE T RANSACTIONS ON M ECHATRONICS and the J OURNAL OF C ONTROL AND I NTELLIGENT S YSTEMS. He was the recipient of a number of professional and scholarly awards and has served as Chair/Co-Chair for more than 12 international conferences and technical programs. He is the founding General Co-Chair of the International Conference on Autonomous and Intelligent Systems and the founding Co-Chair of the IEEE Computational Intelligence Society, Kitchener-Waterloo Chapter, and Chair of the IEEE Control Systems Society of the same chapter.
Rhee Man Kil (M’94–SM’09) received the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, in 1991. He joined the Basic Research Department of Electronics and Telecommunications Research Institute, Daejeon, Korea. Since 1994, he has been with the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, where he is currently an Associate Professor in the Department of Mathematical Sciences. In the KAIST, he has been working as an Operating Committee member of the Brain Science Research Center funded by the Korean Ministry of Science and Technology. His current research interests include theories and applications of machine learning, pattern classification, model selection in regression problems, active learning, text mining, financial data mining, noise-robust speech feature extraction, and binaural information processing. He served as a Guest Editor for neural information processing journals and also served as a program committee member for several international conferences related to neural networks.
Robert Legenstein received the Ph.D. degree in telematics from Graz University of Technology (TUG), Graz, Austria, in 2002. He is currently an Assistant Professor in the Department of Computer Science, TUG. He is also the Deputy Head of the Institute for Theoretical Computer Science, TUG. He is especially interested in biologically inspired neural computation. Currently, he is coordinating the international research project “Novel Brain-Inspired Learning Paradigms for Large-Scale Neuronal Networks” of the European Commission. His current research interests include neural networks, learning in neural systems, reward-based learning, spiking neural networks, information processing in biological neural systems, and dynamics in neural networks. Dr. Legenstein has been honored as an outstanding reviewer at the 2008 conference on Advances in Neural Information Processing Systems.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
5
Jinhu Lu (M’03–SM’06) received the Ph.D. degree in applied mathematics from the Academy of Mathematics and Systems Science (AMSS), Chinese Academy of Sciences (CAS), Beijing, China, in 2002. He is an Associate Professor of AMSS, CAS, and also a Professor and Australian Research Council (ARC) Future Fellow with the School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology University, Melbourne, Australia. He has held several visiting positions in Australia, Canada, France, Germany, and Hong Kong, and was a Visiting Fellow in Princeton University, Princeton, NJ, from 2005 to 2006. His current research interests include nonlinear circuits and systems, neural networks, complex systems, and networks. Dr. Lu is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I AND II. He is the Secretary of the Technical Committee of Neural Systems and Applications of the IEEE Circuits and Systems Society. He has received several prestigious awards, including National Science Fund for Distinguished Young Scholars in China, the Hundred Talents Program of CAS, the National Natural Science Award from the Chinese Government, the Natural Science Award of the Ministry of Education of China, and the ARC Future Fellowships Award in Australia.
Yunqian Ma (SM’07) received the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, in 2003. He joined Honeywell International Inc., Morristown, NJ, where he is currently a Senior Principal Research Scientist, Advanced Technology Laboratory, Honeywell Aerospace. He holds 10 U.S. patents and 35 patent applications. He has authored 50 publications, including two books. His research has been supported by internal funds and external contracts, such as the Defense Advanced Research Projects Agency, Homeland Security Advanced Research Projects Agency, and Federal Aviation Administration. His current research interests include inertial navigation, integrated navigation, surveillance, signal and image processing, pattern recognition, computer vision, machine learning, and neural networks. Dr. Ma received the International Neural Network Society Young Investigator Award for outstanding contributions in the application of neural networks in 2006. He is currently on the Editorial Board of Pattern Recognition Letters, and has served on program committees of several international conferences. He also served on the panel of National Science Foundation in the Division of Information and Intelligent Systems. He is included in the Marquis Who is Who in Engineering and Science.
Malik Magdon-Ismail received the B.S. degree in physics from Yale University in 1993, the Masters degree in physics in 1995, and the Ph.D. degree in electrical engineering with a minor in physics from the California Institute of Technology, Pasadena, in 1998. He is currently an Associate Professor of Computer Science at Rensselaer Polytechnic Institute (RPI), Tory, NY, where he is a member of the Theory Group. His current research interests include the theory and applications of machine learning, social network algorithms, communication networks, computational finance, and theoretical and algorithmic aspects of learning from data. Dr. Ismail has served on the program committees of several conferences, and was an Associate Editor for Neurocomputing. He has several publications in peer-reviewed journals and conferences, has been a Financial Consultant, has collaborated with a number of companies, and has several active grants from National Science Foundation and other government funding agencies. He has been awarded the RPI Early Career Award in recognition of research.
6
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Mike Paulin received the B.Sc. (hons.) degree in mathematics from the University of Otago, Dunedin, New Zealand, in 1979, and the Ph.D. degree from the University of Auckland, Auckland, New Zealand, in 1985. He carried out post-doctoral research in experimental and computational neuroscience at the University of Southern California, Los Angeles, and at the California Institute of Technology, Pasadena. He has been a Scientific Programmer and a Lecturer in mathematics at the University of Auckland. For a number of years, he has been a Technical Consultant and Distinguished Visiting Scientist developing biologically inspired algorithms for robotics at NASA-Jet Propulsion Laboratory, Pasadena. He is currently an Associate Professor at the University of Otago. He teaches zoology, neuroscience, mathematics, and computational modeling. His current research interests include principles of neural computation and mechanical design for agility in animals and robots. Prof. Paulin is a member of the NZ Mathematical Society, the NZ Institute of Mathematics and its Applications, and the IEEE Computational Intelligence Society.
Robi Polikar (M’93–SM’09) received the co-major Ph.D. degree in electrical engineering and biomedical engineering from Iowa State University, Ames, in 2000. He is currently an Associate Professor with the Department of Electrical and Computer Engineering at Rowan University, Glassboro, NJ, where he directs the Signal Processing and Pattern Recognition Laboratory. He is also a long-term Visiting Scholar at the School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA. His work in following areas has been supported primarily by National Science Foundation’s CAREER, Power, Control, and Adaptive Networks and Collaborative Research in Computational Neuroscience programs, and various industrial partners. He is the author of over 120 publications. His current research interests include machine learning, pattern recognition, neural networks, with specific emphasis on incremental learning, nonstationary learning, concept drift, data fusion, and applications of computational intelligence in neuroscience. Dr. Polikar is a member of the IEEE Computational Intelligence Society, and its Technical Committee on neural networks. He was the recipient of Rowan University’s Research Excellence and Achievement Award.
Danil Prokhorov (SM’02) began his technical career in St. Petersburg, Russia, in 1992. He was a Research Engineer in the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Moscow, Russia. He became involved in automotive research in 1995, when he was a summer intern at Ford Scientific Research Laboratory, Dearborn, MI. In 1997, he became a Ford Research Staff Member involved in application-driven research on neural networks and other machine learning methods. While at Ford, he took active part in several production-bound projects including neural-network-based engine misfire detection. Since 2005, he has been with Toyota Technical Center, Ann Arbor, MI, overseeing important mid- and long-term research projects in computational intelligence. He has published more than 100 papers in various journals and conference proceedings, and has several inventions to his credit. Dr. Prokhorov is a frequent member of Program Committees of various international conferences including the International Joint Conference on Neural Networks and the World Congress on Computational Intelligence, a member of several IEEE technical committees and journal editorial boards.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
7
Marco Wiering received the Ph.D. degree from the University of Amsterdam, Amsterdam, Netherlands, in 1999, after completing the Ph.D. degree program at the Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Lugano, Switzerland. He worked as the Assistant Professor in the Intelligent Systems Group, Utrecht University, Utrecht, Netherlands, from 2000 to 2007. He is currently pursuing a tenure track for Full Professor in the Department of Artificial Intelligence, University of Groningen, Groningen, Netherlands. He is also the Director of the Robotlab in the Department of Artificial intelligence. He has published more than 70 peer-reviewed conference and journal papers and has supervised or supervises seven Ph.D. students. Furthermore, he has supervised more than 70 master graduation projects on many different topics. Together with Dr. Martijn van Otterlo, he is editing the book Reinforcement learning: State-of-the-art, which will be published in 2011. His current research interests include reinforcement learning, neural networks, robotics, computer games, computer vision, and signal processing. Dr. Wiering was the Chair of the IEEE Computational Intelligence Society Technical Committee on Adaptive Dynamic Programming and Reinforcement Learning in 2010.
Vicente Zarzoso (S’94–M’03–SM’10) received the Graduate degree with highest distinction in telecommunications engineering from the Polytechnic University of Valencia, Valencia, Spain, in 1996. After starting the Ph.D. degree program at the University of Strathclyde, Glasgow, U.K., he received the Ph.D. degree from the University of Liverpool, Liverpool, U.K, in 1999. He obtained the Habilitation to Lead Researches from the University of Nice Sophia Antipolis, Nice, France, in 2009. From 2000 to 2005, he held a Research Fellowship from the Royal Academy of Engineering of the U.K. Since 2005, he has been with the Computer Science, Signals and Systems Laboratory of Sophia Antipolis, University of Nice Sophia Antipolis, where he was appointed as a Professor in 2010. He has authored nearly 100 publications on these topics. His current research interests include statistical signal and array processing, with emphasis on independent component analysis, signal separation and their application to biomedical problems and communications. Dr. Zarzoso has served as Program Committee Member for several international conferences and was a Program Committee Chair of the 9th International Conference on Latent Variable Analysis and Signal Separation in 2010.
8
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Signature Neural Networks: Definition and Application to Multidimensional Sorting Problems Roberto Latorre, Francisco de Borja Rodr´ıguez, and Pablo Varona
Abstract—In this paper we present a self-organizing neural network paradigm that is able to discriminate information locally using a strategy for information coding and processing inspired in recent findings in living neural systems. The proposed neural network uses: 1) neural signatures to identify each unit in the network; 2) local discrimination of input information during the processing; and 3) a multicoding mechanism for information propagation regarding the who and the what of the information. The local discrimination implies a distinct processing as a function of the neural signature recognition and a local transient memory. In the context of artificial neural networks none of these mechanisms has been analyzed in detail, and our goal is to demonstrate that they can be used to efficiently solve some specific problems. To illustrate the proposed paradigm, we apply it to the problem of multidimensional sorting, which can take advantage of the local information discrimination. In particular, we compare the results of this new approach with traditional methods to solve jigsaw puzzles and we analyze the situations where the new paradigm improves the performance. Index Terms—Jigsaw puzzles, local contextualization, local discrimination, multicoding, neural signatures, self-organization.
I. Introduction
R
ECENT experiments in living neural circuits known as central pattern generators (CPG) show that some individual cells have neural signatures that consist of neuron specific spike timings in their bursting activity [33], [34]. Model simulations indicate that neural signatures that identify each cell can play a functional role in the activity of CPG circuits [22]–[24]. Neural signatures coexist with the information encoded in the slow wave rhythm of the CPG. Readers of the signal emitted by the CPG can take advantage of these multiple simultaneous codes and process them one by one, or simultaneously in order to perform different tasks [23]. The who and the what of the signals can be used to discriminate the information received by a neuron by distinctly processing the input as a function of these multiple codes. These results emphasize the importance of cell diversity for some living neural networks and suggest that local discrimination is important
Manuscript received December 31, 2009; revised May 5, 2010 and July 14, 2010; accepted July 14, 2010. Date of publication November 18, 2010; date of current version January 4, 2011. This work was supported in part by Ministry of Science and Innovation (MICINN) under Grant BFU2009-08473 and Grant ´ TIN2007-65989, and in part by the Comunidad Autónoma de Madrid (CAM) under Grant S-SEM-0255-2006. The authors are with the Grupo de Neurocomputaci´on Biol´ogica, Dpto. de Ingenier´ıa Inform´atica, Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid, Madrid 28049, Spain (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2060495
in systems where neural signatures are present. This kind of information processing can be a powerful strategy for neural systems to enhance their capacity and performance. Artificial neural networks (ANNs) are inspired to some extent from their biological counterparts. However, in the context of artificial neural computation, phenomena such as local recognition, discrimination of input signals and multicoding strategies have not been analyzed in detail. Most traditional ANN paradigms consider network elements as indistinguishable units, with the same transfer functions, and without mechanisms of transient memory in each cell. None of the existing ANN paradigms discriminates information as a function of the recognition of the emitter unit. While neuron uniformity facilitates the mathematical formalism of classical paradigms [1], [3], [13], [17], [36] (which has largely contributed to their success [39]), some specific problems could benefit from other approaches. Here, we propose a neural network paradigm that makes use of neural signatures to identify each unit of the network, and multiple simultaneous codes to discriminate the information received by a cell. The network self-organization is based on the signature recognition and on a distinct processing of input information as a function of a local transient memory in each cell that we have called the local informational context of the unit. The efficiency of the network depends on a tradeoff between the advantages provided by the local information discrimination and its computational cost. In this paper we discuss the application of signature neural networks (SNNs) to solve multidimensional sorting problems. In particular, to fully illustrate the use of this neural network and to evaluate its performance, we apply this formalism to the task of solving canonical jigsaw puzzles. The paper is organized as follows. In Section II we present the general formalization of the proposed paradigm. In Section III, we: 1) discuss its application to generic multidimensional sorting, and 2) provide an implementation for this kind of problems. To test the performance, in Section IV we: 1) review the jigsaw puzzle problem and the traditional algorithms to solve it; 2) provide a specific solution using a SNN; 3) describe the methods to evaluate the performance; and 4) we present our quantitative results on the comparison of this new approach with traditional methods to solve jigsaw puzzles and we analyze the situations where the new paradigm improves the performance (Section IV-H). Finally, in the Appendix, we illustrate in detail the evolution of the network with another example of multidimensional sorting.
c 2010 IEEE 1045-9227/$26.00
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
II. SNN Formalization In this section, we present the SNN paradigm. Behind this new paradigm, there are four main ideas. 1) Each neuron of the network has a signature that allows its unequivocal identification by the rest of the cells. 2) The neuron outputs are signed with the neural signature. Therefore, there are multiple codes in a message (multicoding) regarding the who and the what of the information. 3) The single neuron discriminates the input signals as a function of the following: a) the recognition of the emitter signature; b) a transient memory that keeps track of the information and its sources. This memory provides a contextualization mechanism to the single neuron processing. 4) The network self-organization relies to a large extent on the local discrimination by each unit. A. SNN Definitions The formalism requires the definition of several terms that will be used in the following sections. Some of the SNN definitions are open and depend on the specific problem to be solved. This provides a framework that can be applied to different problems by only customizing the open definitions. To illustrate the use of the SNN, we will fix these open definitions for the general multidimensional sorting problem and the particular case of the jigsaw puzzle solver in Sections III-A and IV-D, respectively. 1) Neuron or cell: the processing unit of the network. 2) Neuron signature: the neuron ID in the network. This ID is used for the local information discrimination. 3) Neuron data: information stored in each neuron about the problem. 4) Neuron information: the joint information about the who (neuron signature) and the what (neuron data) of the cell. 5) Synapse: connection between two neurons. 6) Neuron neighborhood: cells directly connected to the neuron. This concept is used to define the output channels of each neuron. The neuron neighborhood can change during the evolution of the SNN. 7) Local informational context: transient memory of each neuron to keep track of the information and its sources. This memory consists of a subset of neuron informations from other cells received in previous iterations. The maximum size of the context (Ncontext ) is the maximum number of elements in this set, and it is an important parameter of the algorithm. The neuron signature and the local informational context are the key concepts of the SNN. 8) Local discrimination: the distinct processing of a unit as a function of the recognition of the emitter and the local informational context. 9) Message: the output or total information transmitted through a synapse between two neurons in a single iteration. The message consists of the neuron information of a subset of cells that are part of the context of the
9
emitter plus its own neuron information (see below). The maximum message size is equal to Ncontext . The input to a neuron consists of all messages received at a given iteration. 10) A receptor starts recognizing the signature of an emitter cell during the message processing when it detects that the neuron data of the emitter is relevant to solve the problem (emitter and receptor data are compatible). The network self-organization is based on this recognition. The meaning of “relevant” depends on the specific problem. 11) Information propagation mode: depending on the problem, the information propagation can be monosynaptic or multisynaptic. Monosynaptic means that each neuron can receive only one input message per iteration. The information propagation is bidirectional between cells. 12) A neuron belongs to a cluster if it recognizes the signature of all the neurons in its neighborhood. The clusters allow to simplify the processing rules of the SNN. B. Algorithm The connectivity, the neuron data and the local informational contexts of all the network units are previously initialized. Depending on the problem, connectivity and neuron data initialization can be random or heuristic. Three different context initializations can be considered. 1) A free context initialization where initially the context of every neuron is empty. In this way, the cells have no information about the rest of the network. 2) A random context initialization where the context of the neurons is chosen randomly. 3) A neighborhood context initialization where all the contexts are coherent with the neighborhood of each neuron. After the initialization, the algorithm consists in the iteration of three different steps for each neuron in the network until the stop condition is fulfilled. Note that the network selforganization takes place both in steps 1 and 3 by modifying the network connections. 1) Process synaptic inputs: in this phase of the algorithm, each neuron applies the local information discrimination. a) First, the cell discriminates the input messages as a function of the emitter signature to determine which of them will pass to a second discrimination stage. If no signatures are recognized (a situation likely in the first iterations), all messages pass to the second stage. b) Second, the neuron uses the memory of the processing in previous iterations stored in its local informational context to select the set of neuron informations from the messages that will be finally processed. c) Third, the cell processes these set of neuron informations by applying its corresponding transfer functions or processing rules (which are specific of the problem to solve). If the neuron data processed
10
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
is relevant to solve the problem, the cell starts recognizing the corresponding signature and establishes a new connection with the cell identified by this signature. d) Finally, as the last step of this phase, the local informational context of the receptor is updated using the neuron information set analyzed during the processing of the input messages. Local discrimination can lead to changes in the network connectivity. Network reconfiguration as a function of the local discrimination implies a nonsupervised synaptic learning. Clusters represent partial solutions to the problem. Neurons belonging to a cluster have the same processing rules. 2) Propagate information: during this phase, neurons build and send the output messages. For this task each neuron adds its own information to the local informational context and signs the message. If the message size reaches the Ncontext value, the neuron information from the oldest cell of the context is deleted from the message (this will be illustrated in Fig. 4 for the puzzle solver case). The output message of a neuron is the same for all output channels. 3) Restore neighborhood: if a neuron has not reached its maximum number of neighbors, it randomly tries to connect to another neuron in the same situation (only one connection per neuron and iteration). First, it tries to connect to neurons from its local informational context and, if this is not possible, to other cells. This allows to maximize the information propagation in the network. To establish synapses with cells not belonging to the local context allows propagating information to other regions of the network. III. SNNs for Multidimensional Sorting An ANN paradigm based on local discrimination relies on the criteria used to perform discrimination, which necessarily depends on the problem to be solved. This implies that a SNN must be designed thinking on the specific problem at hand. To illustrate the concept and the applicability of the SNN paradigm, we will apply it to the problem of multidimensional sorting. The ideas relating neural signatures with local information discrimination have a direct application in the wide scope of multidimensional sorting problems. This is an example in which a solver can take advantage of local information discrimination, specifically when the global solution depends on local sorting criteria. For example, if we consider a scheduling problem (i.e., a specific case of multidimensional sorting where different tasks must be ordered to optimize the time or cost spent in a global problem), the SNN will find the global solution by defining a local discrimination task for the neurons that will use the informational context as a transient memory to achieve the final sorting goal. A general multidimensional sorting problem [16] consists in finding the correct order of a set of elements in several dimensions simultaneously. Different criteria must be met in each of the dimensions to reach the solution. In many
cases, these criteria are not global but local, which makes the problem much harder. Here we will emphasize how the local information discrimination of SNNs can lead to an efficient solution of multidimensional sorting problems with local order criteria. A. Customization of the SNN In this section, we describe how to use a SNN to solve a general multidimensional sorting problem. For this task, we have to fix some of the open definitions, conditions and constraints for the algorithm. All definitions of Section II-A apply to build the SNN network. However, the parameters regarding the dimension of the problem, the number of neighbors, the final structure of the network, and most of all, the recognition for the local discrimination task, will depend on the specific problem at hand. Common grounds for all multidimensional sorting problems to be solved with a SNN are as follows. 1) The number of neurons of the network is equal to the number of elements to sort. There is a one to one relationship between the neurons and the elements to sort. 2) The neuron signature can be the neuron number or some other value that allows to identify unequivocally each neuron. 3) The neuron data of each cell is a structure with information about the element to sort in each dimension (e.g., in a scheduling problem neuron data will be a task with a cost, effort, and priority). 4) The network is d-dimensional, where d is the number of dimensions of the problem (each dimension can use a specific sorting algorithm, global or local). 5) The information propagation mode is multisynaptic. 6) The compatibility for an element is given by the sorting criteria. If the sorting criteria is local, elements can only be compatible or not compatible. If the sorting criteria is global, a best compatibility measure can be assigned among different elements. During the algorithm evolution, neurons can change dynamically their compatibilities and, thus, the set of signatures recognized in a given iteration to adjust the discrimination rules. 7) The sorting criteria defines two possible neighbors in each dimension, the previous and the next element to be sorted. 8) If none of the neurons has learned a new signature for a given amount of iterations, the network reaches the stop condition. B. Implementation of the SNN Here we present a brief pseudocode for the SNN paradigm to solve a general multidimensional sorting problem. The following notation is used: expression → variable means that variable takes the value of the evaluation of the expression; variable[] means that variable is a vector; and Signature(ni ), Compatibility(ni , nj ) and Context(ni ) denote the signature, the compatibility with nj , and the local informational context
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
of neuron ni , respectively. P is the probability to establish a new connection with a cell of its local informational context during the processing phase. T is the threshold for the maximum number of iterations in which a neuron is allowed to have an incomplete neighborhood. The initialization on the network consists of the following steps: 1) assign each element to sort to a neuron of the network; 2) establish random connections between cells to build the initial network architecture; 3) for each neuron of the network → ni : a) initialize Context(ni ) using one of the initializing algorithms (see Section II-B). The neural network main function is repeated until the end condition is fulfilled. This function consists of the following steps in each neuron: 1) Process synaptic input: a) synaptic input messages → inputs[]. Regarding the who of the information (steps b) and c) constitute the first discrimination stage, while step e) corresponds to the second discrimination stage): b) select from inputs[] those messages sent by an emitter with a recognized signature → recognized[]; c) if recognized[] is empty, select all messages from inputs[] → recognized[]; d) for each emitter in recognized[] → emitter: i) if receptor recognizes Signature(emitter), reconfigure the network to place elements in their correct position; e) select randomly Ncontext neurons not including in Context(receptor) from messages in recognized[] → in. Regarding the what of the information: f) for each dimension, process information of in: i) sort neurons of in with the corresponding sorting algorithm → sorted[]; ii) choose from sorted[] those neurons that are compatible with receptor → ni ; iii) if ni exists: - choose the corresponding neighbor of receptor for the corresponding dimension → neighbor; - if Compatibility(receptor, ni ) is better than Compatibility(receptor, neighbor): (i) Break connection between receptor and neighbor and connect receptor and ni ; (ii) receptor starts recognizing Signature(ni ); (iii) receptor stops recognizing Signature(neighbor). iv) Else search in Context(receptor) for a cell with incomplete neighborhood → nj . - If nj exists, connect receptor and nj with probability P. g) Update Context(receptor) with in.
11
2) Propagate information: a) for each neuron of the network → ni : i) for each neighbor of ni → nj , send messages between ni and nj . 3) Restore neighborhood: a) for each neuron with incomplete neighborhood → ni : i) search for a side of ni without a neighbor → empty; ii) search for a neuron in Context(ni ) without a neighbor in the opposite side to empty → nj ; iii) if nj exists, connect ni and nj through empty; iv) else, if ni has had incomplete neighborhood for a number of iterations larger than T : - choose randomly among the non-neighbors of ni a neuron different than ni → nk ; - break the connection of nk in the opposite side to empty and connect ni and nk through empty. This general implementation can be used in a wide variety of problems by customizing the discrimination rules. In the Appendix, we describe in detail a multidimensional sorting example that helps the reader to further understand the use of the local informational context and the local information discrimination. To test the performance of the SNN framework, we will first consider another example in which the discrimination rules are well known.
IV. SNN for the Jigsaw Puzzle Problem A. Problem Definition Jigsaw puzzles are a specific case of multidimensional sorting problems in which the order criteria is local and it is given by the fitting among pieces. A typical jigsaw puzzle is a 2-D picture that has to be rebuilt from different fragments or pieces. Once the pieces are mixed, the solution to the problem consists in reassembling them into the original picture. The difficulty of solving the puzzle depends mainly on the number of pieces, on their assembly complexity and on the graphical representation of the picture. To rebuild a jigsaw puzzle without the original image is a NP-complete problem [11]. To efficiently solve jigsaw puzzles is considered a classical fitting or pattern recognition problem and the algorithms to solve it may have potential applications in many different fields of knowledge, such as archeology, art restoration, failure analysis, steganography, and others. For example, such algorithms have been used to reassembly manuscripts from separate pieces [26], to rebuild broken objects [20], [25], [37], [40], to send secure messages in a non secure channel [45], to hide secret messages in seemingly innocuous carriers [12] or even to design evolutionary algorithms to solve complex problems [44]. Although some of these problems can be considered as 3D jigsaw puzzle assembling, we focus our work on solving 2-D puzzles as the one shown in Fig. 1. Jigsaw puzzle pieces
12
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
B. General Solver Schema
Fig. 1. Example of a canonical jigsaw puzzle with a picture of Mount Kilimanjaro. Pieces are rectangular and the number of neighbors is four except for the border pieces. Corners have two neighbors and the rest of border pieces have three. The solution to the puzzle consists in reassembling pieces into the original picture once they are mixed.
are typically rectangular and they fit with their contiguous neighbors, which are usually four except for the border pieces. The full picture is usually square-shaped. Puzzles with these constraints are called canonical jigsaw puzzles [43]. Our method and results can be easily extended to rebuild 3-D objects from fragments with a different number of neighbors. Although the solution to the problem involves several different tasks, the research literature about jigsaw puzzles and reconstruction of broken objects is mainly focused on algorithms to test the matching of pieces according to their shape [4], [6], [14], [15], [27], [30], [38], [42], and, more recently, also on image (texture and color) matching [8], [21], [31], [43]. Different techniques have been used for this goal: shape matching [43], image merging [43], neural networks [32], genetic algorithms [35], best first search [5], and so on. To solve the jigsaw puzzle, pairs of pieces are chosen (randomly or with a heuristic method) to test their fitting. Several tasks related to solving the puzzle can also be considered to be part of a sorting or classification problem, in the sense that pieces must be sorted and clustered in different groups to reduce the search space for a correct fitting. However, the performance of the associated sorting algorithm is usually disregarded. Typically it is thought that the efficiency to solve the problem is mainly related to the way the solver determines if two pieces can fit, rather than to the way the pieces are sorted and classified to test this fitting. Here we focus on the sorting and classification tasks. The jigsaw puzzle problem is interesting in the context of our study because the sorting and classification algorithms are multidimensional sorting problems that can take advantage of local information discrimination to reduce the space search for correct fittings [28], [40]. If similar pieces are grouped into sets, each piece only needs to be compared with those in the same set. We have used the proposed paradigm to build a neural network that is able to efficiently implement the fitting algorithm. With this paradigm, we improve the performance to solve jigsaw puzzles by optimizing the strategy to choose pairs of pieces to test their fitting.
Traditional puzzle solver algorithms follow a common general schema to find the correct solution. The reconstruction of the puzzle (or the object in a general case) is usually an exhaustive search over all pieces or fragments trying to find the best fittings. Therefore, we can consider that the general algorithm is as follows: 1) choose a piece (P1 ) from the set of available pieces; 2) search for one piece (P2 ) that fits with P1 through one of its borders; 3) assemble both pieces in a new single piece; 4) add this new piece to the set of available pieces, deleting P1 and P2 ; 5) back to the first step until only one piece is left. Differences between existing approaches arise both from the algorithm used to test the matching of pieces P1 and P2 , and from the one used to select which pieces are to be tested. Therefore, the performance of a jigsaw puzzle solver depends mainly on these two algorithms. C. Traditional Algorithms to Choose Pieces to Compare In classical approaches to solve jigsaw puzzles, the algorithms used to select a pair of pieces to test their fitting are based on the way humans solve jigsaw puzzles. Firstly they search for border pieces. Later they can choose to group the rest of pieces into different sets (e.g., according to the number of straight edges of each piece, according to their colors or any other similitude metrics) to make the search easier by focusing only on pieces with a greater probability of fitting. Finally, they try to find the correct fitting for each piece of the puzzle. Traditional algorithms for piece selection are stochastic in different levels. In most cases they are brute force techniques that sort pieces until the correct solution is found [6]. Pieces are placed randomly and if the solution is not reached there is a new random search iteration. Alternatively, each piece is compared with all the rest until the correct fittings are found [14]. In other cases, for each piece of the puzzle the fitting is tested only for a subset of the available pieces. For example, in many approaches key pieces are identified first and then assembled independently using different heuristics. This set of approaches sort all possible matchings according to specific measures to find the best candidates to fit as a function of the shape and/or graphical content of the piece. Here, pieces are chosen following an order (best first, highest confidence first, and so on) and not randomly [5], [10], [15], [20], [28], [42], [43]. Thus, the space search is reduced. These algorithms require the calculation of complex similitude measures to be effective. The measures that are easy to calculate do not always give a good performance. Experiments reported in the literature using this kind of algorithms use puzzles with less than 300 pieces. A detailed comparison between different image feature solving methods can be found in [28]. D. SNN to Solve Jigsaw Puzzles In this section, we describe in detail how to use the SNN to solve the jigsaw puzzle problem, and in the next section we will provide a pseudocode for the implementation of
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
13
Fig. 2. Example of iteration status in a fragment of the proposed neural network. In the jigsaw puzzle case, signatures are the neuron numbers (10, 11, 12, . . . ) and the data are the specific pieces of the puzzle.
this algorithm. The SNN paradigm defines a different search than the general puzzle-solver search schema described in Section IV-B. Here, the processing units are neurons that try to find the best fitting locally. The SNN described in Section III-B for the general multidimemsional sorting case can be easily adapted to the jigsaw puzzle problem. However, to allow a fair comparison between the performance of the SNN and a traditional stochastic algorithm (SA), we need to impose some restrictions to the general framework. Note that the SNN can also be applied without these restrictions, as we will discuss later. 1) The number of neurons of the network is equal to the number of pieces of the puzzle. 2) The neuron signature is the neuron number. There exist different matching algorithms that use different metrics to represent the characteristics of a piece or fragment [19], [18], [41]. For example, objects can be represented by “shape signatures” which are strings that are obtained by an approximation of the boundary curve. The signature of a neuron could be the shape signature of the piece that it contains. However, as we are not interested in evaluating the fitting algorithm, to simplify our implementation we use the neuron number as the neural signature. 3) Now, the neuron data of each cell is one piece of the puzzle (this is illustrated in Fig. 2). 4) As we solve canonical jigsaw puzzles, the maximum number of neighbors is four, one for each side of the piece that the neuron represents (up, down, left, and right). The neighbor order is important, up-down and left-right are opposite sides. In a more general case, e.g., to rebuild broken objects, there could exist more than four neighbors. The SNN has periodic boundary conditions. 5) The initial structure of the network is 2-D with each cell connected to its four nearest neighbors.
Fig. 3. Reconfiguration of the SNN when two neurons recognize their signatures. For example, neurons 18 and 25 of Fig. 2 recognize their signatures; however, their corresponding pieces are not well located. The piece corresponding to neuron 25 has to be located to the left of the piece corresponding to neuron 18 and not down (compare Fig. 1). When the network is reconfigured, 1) the connections between 17–18 and 25–26 are broken; 2) 17 and 25 are interchanged; and 3) 18–25 are connected in their correct position. In this example, as a consequence of this network reconfiguration, neurons 18 and 25 temporary have only three neighbors. Note that neurons 17 and 26 are now connected as a consequence of the SNN reconfiguration.
6) In our example, the information propagation mode is monosynaptic, i.e. only one input message is processed per iteration. Fig. 4 shows the way messages are built and propagated with this choice of parameters. 7) When two neurons contain pieces with a complementary border (borders that match correctly), they are compatible. For example, in Fig. 2, since neurons 18 and 25 contain complementary pieces, they are compatible and, recognize their signatures. In the puzzle solution (see Fig. 1), the piece that corresponds to neuron 25 is located to the left of the piece that corresponds to neuron 18. When a neuron recognizes the signature of another cell, the network is reconfigured to move pieces to their correct positions (Fig. 3). Neural signatures allow to identify the source of the information to achieve the correct fitting by reconfiguring the network from the starting 2-D structure to a multidimensional one. At the end of the self-organization the network recovers a 2-D structure. 8) If a neuron belongs to a cluster: a) it does not process the part of its informational context related to its neighbors; b) it does not add its own neuron data to its output. Neurons in a cluster are only relayers of their input information.
14
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fig. 4. Synaptic transmission example for the SNN shown in Fig. 2. In this example, we consider Ncontext = 3. If a message follows the path 10−11−12−19, in iteration 1, the message only consists of information about neuron 10. In iterations 2 and 3, information about neurons 11 and 12 is added to the head of the message. Finally, in iteration 4, neuron 19 deletes the tail information of its input message (information about neuron 10) and adds its own information to the head of its output message.
E. Puzzle Solver Implementation Here we present a brief pseudocode for the SNN paradigm to solve jigsaw puzzles with the same notation as in Section III-B. In the simulations discussed in this paper, values of P and T are 0.1 and 10, respectively: 1) Process synaptic input. a) Synaptic input message → inputs[] (note that when information propagation is monosynaptic, inputs only contains one input message). Regarding the who of the information: b) select from inputs[] those messages sent by an emitter with a recognized signature and not in its correct position → recognized[]; c) if recognized[] is empty, select randomly a message of inputs[] → recognized[]; d) for each emitter in recognized[] → emitter: i) if receptor has recognized Signature(emitter), reconfigure the network to move pieces to their correct position (see Fig. 3); e) select randomly Ncontext neurons not including in Context(receptor) from messages in recognized[] → in. Regarding the what of the information: f) for each dimension, process information of in as follows: i) search in in for a neuron whose signature has not been recognized by receptor but has a complementary piece to Piece(receptor) in the corresponding dimension → ni . Note that the part of the incoming messages about neurons whose signature is recognized is not processed. ii) If ni exists: - connect receptor and ni ; - receptor starts recognizing Signature(ni ). This means that the emitter will recognize the signature of the emitter in the next iteration. iii) Else, search in Context(receptor) for a cell with incomplete neighborhood → nj . - If nj exists, connect receptor and nj with probability P. g) Set Context(receptor) equal to the set of neuron informations of in. If receptor belongs to a cluster do not include the neuron information of any neuron in its neighborhood to build Context(receptor).
2) Propagate piece information. a) For each neuron of the network whose corresponding piece is not in its correct position → ni : i) search for a neighbor of ni with a signature not recognized by ni which contains a complementary piece of Piece(ni ) → nj ; ii) if nj exists, send messages between ni and nj ; iii) else, choose randomly a neighbor of ni (note that information propagation is monosynaptic) → nk . - If nk exists, send messages between ni and nk . 3) Restore neighborhood. a) For each neuron whose corresponding piece is not in its correct position and with incomplete neighborhood → ni : i) search for a side of ni without a neighbor → empty; ii) search for a neuron in Context(ni ) without a neighbor in the opposite side to empty → nj ; iii) if nj exists, connect ni and nj through empty; iv) else, if ni has had incomplete neighborhood for a number of iterations larger than T : - choose randomly among the non-neighbors of ni a neuron different than ni → nk ; - break the connection of nk in the opposite side to empty and connect ni and nk through empty. Note that the only significant change to the pseudocode described in Section III-B is related to the what of the information F. Methodology and Validation 1) How to Evaluate the Performance of the SNN: To evaluate the SNN to solve jigsaw puzzles, we have compared its performance with the performance of a traditional SA based on the general solver schema described in Section IV-B. The SA consists in the following steps. a) For each piece of the puzzle (Pi ) repeat N times: i) choose randomly a piece of the puzzle (Pj ); ii) if Pi and Pj have a complementary side: - assemble both pieces in a new single piece; - add the new piece to the set of available pieces, deleting Pi and Pj . b) Back to first step until only one piece is left. The number of attempts to find a complementary piece for each piece per iteration (N) is the main parameter of the SA.
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
We consider this value equivalent to the context size in the SNN (Ncontext ). In Section IV-C we reviewed the traditional algorithms to choose pieces to compare in order to solve the puzzles. All of them are stochastic in different levels and can be described under this general scheme, some of them only by setting the number of attempts to find a complementary piece for each piece per iteration (N). For the rest, specific rules for clustering pieces need to be added to execute the SA in different sets of pieces (e.g., border pieces), or to use a priori knowledge about the matching of two pieces (e.g., their concavity and convexity). These set of rules can also be easily added to the SNN. When they are applied, the improvement is equivalent for both methods under the same conditions. Here we do not use them to compare both approaches in the simplest case. 2) What Puzzles to Solve: To probe the viability of the proposed algorithm and to compare it with traditional approaches we have solved several puzzles of different sizes with the SNN and with the SA described in the previous section. In all our tests we have used computer generated canonical squared jigsaw puzzles of size n × n. To generate these puzzles, we divided pictures in n × n squared fragments and mixed them randomly. 3) How to Test the Piece Matching: To test the piece matching we use information about the overall picture. Before mixing pieces, we save the neighborhood of all the fragments. In this way, during the algorithm evolution, we can evaluate if two pieces fit or not. 4) How to Quantify the Performance: To assess the algorithm performance we use three measurements that will allow us to compare the different methods in terms of time requirements and effectiveness: the average number of iterations to solve the puzzles, the average total number of fitting tests needed, and the effective number of fitting tests (see below). These three measurements will allow to quantitatively analyze our results. A method’s performance is often evaluated using the average time needed to solve the jigsaw puzzle. Let us define an iteration as a cycle of the algorithm in which all the processing units (pieces in the SA and neurons in the SNN) are updated. Therefore, this measure is equivalent for both methods in the sense that in each iteration they try to find the best fittings for all the pieces of the puzzle. To have a measure of the performance independent of the computer power and on the quality of the implementation, here we will quantify the performance of the algorithms in terms of the average number of iterations needed to solve puzzles of different sizes. Fitting algorithms can be complex and computationally expensive. Therefore, performance is improved as the total number of fitting tests is reduced independently of the number of iterations. We consider that a fitting test takes place during the algorithm evolution every time the borders of two pieces are compared. For example, when reassembling two pieces of four borders a maximum of 16 fitting tests are performed. Finally, the effective number of fitting tests per iteration is defined as the percentage of correct matchings between pieces in each iteration of the algorithm. This quantity is
15
used to assess the relationship between the two previous measurements. To illustrate the results of the performance comparison between the proposed SNN and the SA, we calculate the difference between the value of the above defined measures for each algorithm. Thus, let us define the following “distances”: dit = IterationsSA − IterationsSNN dtests = TestsSA − TestsSNN deff = EffectiveTestsSNN − EffectiveTestsSA . Negative values of these distances mean a poor performance by the SNN as compared with the performance of the SA. Note that the larger the number of iterations and the larger the number of fitting tests, the worst is the performance. However, the larger the number of effective tests, the better the performance. 5) Simulation Parameter: The main parameters in our simulations are the local informational context size (for the SNN) and the number of attempts to find complementary pieces in each iteration (for the SA). For each piece, these values indicate the maximum number of pieces for the fitting test per iteration. In this sense, we consider both parameters equivalent in order to compare the performance of the two algorithms. Here on these quantities are called simulation parameters. In all our tests we set the simulation parameter to a percentage of the puzzle border length. For example, when we deal with puzzles of size 50 × 50 and we say that the simulation parameter is 10%, it means that the size of the local informational context of the SNN and the number of fitting attempts in the SA is equal to 5 (10% of 50). Note that the storage requirement of the SNN is O(N ∗ Ncontext ), where N is the number of neurons in the network. G. Context Initialization In order to test the dependency of the SNN on the initial conditions we have used the three different context initializations proposed in Section II-B: a free context initialization, a random context initialization, and a neighborhood context initialization. H. Results To assess the viability of the SNN paradigm we have solved several canonical jigsaw puzzles of different sizes: from puzzles of 5×5 pieces to puzzles of 100×100 pieces increasing the border size in steps of five pieces. In all cases we compare the results obtained with the SNN with those of solving the puzzles using the SA described in Section IV-F1. As mentioned before, we evaluate the performance as a function of the simulation parameter: the size of the local informational context for the SNN and the number of attempts to find complementary pieces per iteration for the SA. For each size, the simulation parameter goes from 10% to 100% in steps of 5%. The first result observed in our tests is that the SNN performance does not depend on the context initialization. There are only very small differences resulting from the three
16
Fig. 5. Comparison between the mean number of iterations needed to solve 100 puzzles of 25 × 25 (top) and 100 × 100 pieces (bottom) with the SA and the SNN. The x-axis is the simulation parameter (see Section IV-F5). The y-axis is the mean number of iterations needed to solve 100 different puzzles. For small puzzles, the performance of the neural network improves as the value of the simulation parameter increases, but it is never better than the performance of the SA. For large puzzles, the performance of the SNN is better than the SA for small values of the simulation parameter.
methods proposed to initialize the network. These differences are significant only when the size of the local informational context and the puzzle size are large (greater than 80% and 75 × 75, respectively). Taking into account this result, we have decided to use the free context initialization in all the simulations discussed here, since this is the method that uses no a priori information to solve the problem. We start the comparison between the performance of the SNN and the SA by analyzing the results of solving jigsaw puzzles in terms of the number of iterations (Fig. 5) and the number of fitting tests (Fig. 6) required to solve 100 puzzles of small size (25 × 25) and 100 puzzles of large size (100 × 100). Figs. 5 and 6 show that there is not a clear relationship between both measures. While for the number of iterations the performance of the SA is better in general, for the number of fitting tests the situation is the opposite. This is an interesting result in the context of the jigsaw puzzle problem. It means that the way pieces are chosen to test their fitting is important for improving the performance of the solver. Fig. 5 shows that the performance of the SNN in terms of the mean number of iterations is only better for large puzzles when the simulation parameter is small (smaller than 30%). For example, with a simulation parameter equal to 10% our algorithm requires a mean of 894 iterations, while the SA needs a mean of 1036, i.e., a performance improvement of 14%. As one can expect, the efficiency of the puzzle solver (for both methods) in terms of the number of iterations improves
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Fig. 6. Mean number of fitting tests needed to solve 100 puzzles of 25 × 25 (top) and 100×100 pieces (bottom) as a function of the simulation parameter. This measure is smaller for the SA only for small puzzles with a small simulation parameter. In the rest of cases, the number of fitting tests needed to solve the puzzles is always less for the SNN.
with larger simulation parameters. With the SA, the number of iterations tends to 0 as the simulation parameter tends to 100%. The larger the number of attempts, the larger is also the probability to find the right piece in each iteration. The extreme case is when the number of attempts to find a complementary piece is equal to the total number of pieces. In this case, puzzles can be solved in only one iteration. The SNN can never achieve this performance level because the SNN has an adaptation period to fill the local informational contexts of all the neurons with relevant information for each unit at the initial iterations. For example, the bottom panel of Fig. 5 shows that with a simulation parameter equal to 100% (x-axis), the SA needs an average of 106 iterations, while the SNN needs an average of 219 (52% less performance). Thus, for large values of the simulation parameter, the SA requires always fewer iterations to solve the puzzle. However, for large puzzles, the computational cost of the local information discrimination is less significant compared with the total number of iterations needed to solve the puzzle. Therefore, the performance of the SNN becomes better as the context size is smaller. Fig. 6 shows the results for the evaluation of the mean number of fitting tests. In general the performance of the SNN is better in this case. The only exception is for small puzzles with a small value for the simulation parameter. Again, this is due to the initial adaptation period of the SNN. For example, with a simulation parameter equal to 10%, the performance of the SNN is approximately 75% worse (around 6 · 106 more fitting tests). Out of this region, the mean number of fitting tests depends on the puzzle size but not on the simulation parameter. For small puzzles the number of fitting tests is
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
17
Fig. 7. Left panels: Comparison between the number of iterations needed to solve puzzles of a given size with the SA and the SNN with different values of the simulation parameter. Right panels: Comparison between the mean number of fitting tests between pieces needed to solve puzzles with both approaches. The x-axis is the simulation parameter (from 10% to 100% of the puzzle border size). The y-axis is the puzzle border size (the total size of the puzzle goes from 10 × 10 to 100 × 100 pieces). In the top panels, the z-axis is the difference between the corresponding average distances (dit -left- and dtests -right-) for solving 100 different puzzles with both algorithms. The dark plane shows the zero value. Above this plane the number of iterations or fitting tests needed with the SA is greater than those needed with the SNN. Bottom panels show these distances as a contour map. Lighter colors denote regions where the performance of the SNN is better. Taking into account the number of iterations, the larger the puzzle and the smaller the simulation parameter, the better is the performance of the SNN. The worst performance of the SNN appears in regions where the puzzle size and the simulation parameter are small. In the rest of the regions the performance is slightly worst than the SA (compare Fig. 5). For the number of fitting tests, the performance of the SNN is always better than the performance of the SA except for small puzzles with a small value of the simulation parameter.
very similar for both methods. However, for large puzzles (bottom panel) the difference between both algorithms is approximately 15 · 107 fitting tests, meaning an improvement of the performance by the SNN of 24%. These results suggest that the larger the puzzle, the better is the performance in terms of the number of fitting tests of the SNN, independently of the simulation parameter. To extend the analysis, we calculated the distances dit and dtests to solve 100 different puzzles with the SA and the SNN for a wide range of puzzle sizes and simulation parameters (Fig. 7). The results are in agreement with those shown in Figs. 5 and 6. For the mean number of iterations, the performance space can be divided in three different regions (Fig. 7, left panels). 1) For puzzles of moderate size (up to size 50 × 50) and a small simulation parameter (smaller than 20%), the performance of the SNN is not good compared with that of the SA. 2) For puzzles with more than 50 × 50 pieces with a simulation parameter between 10% and 30%, the performance of the SNN is better, i.e., the value of dit is greater than 0. 3) In the rest of cases, the performance of the SA is better, but very similar to that provided by the SNN.
Regarding the number of fitting tests (right panels of Fig. 7), the performance space can also be divided in three different regions. a) For small puzzles and small simulation parameters, the performance of the SA is better. b) The second region corresponds also to small puzzles, but now with the largest values for the simulation parameter. Here, the mean number of fitting tests needed to solve the puzzles is very similar for both methods. c) For puzzles larger than 45 × 45, the SNN has the best performance independently of the value of the simulation parameter. The improvement clearly increases with the puzzle size. For example, for puzzles of size 75 × 75 the performance of the SNN improves approximately 10% with respect to the SA. For puzzles of 100 × 100 pieces this improvement is 24%. The results shown in Fig. 7 suggest that there is not a clear relationship between the number of iterations and the number of fitting tests needed to solve a puzzle. However, both values decrease when the puzzle size increases and when the local informational context decreases. To address this point we have used a small context (10% of the border size) to solve
18
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
sizes (darker regions in the contour plots), while the SNN needs some iterations to achieve a minimum level of selforganization. Then the SNN improves and the effectiveness is similar for both approaches (white regions where deff is close to zero). For small puzzles the SA solves the puzzle before the SNN reaches a minimum effectiveness level. The effectiveness improvement is not translated into a better performance in terms of the number of iterations. However, for large puzzles (and specially for small values of the context size -panels a and b-), the opposite situation occurs. The SNN starts with a poor performance, but then it reaches a high effectiveness level and solves the puzzle before the SA. Taking into account the results shown in Fig. 8, the better performance of the SNN will further increase, as compared to the SAs, for larger puzzles. A large local context does not imply an optimal performance. When the context size is increased, the total number of fitting tests needed to solve the puzzle is very similar to the one needed with a small context (right panels of Fig. 7). In the limit, for a context size close to the total number of neurons in the network, the local processing becomes equivalent to the global processing of the SA. In this case, the problem can be solved in only a few iterations, but this does not mean that the number of fitting tests decreases. Based on all our measurements, we can conclude that the best performance of the SNN (in terms of both the number of fitting tests and iterations) is achieved for large puzzles using a relatively small context. This combination provides the optimal balance between the number of iterations and the number of comparisons to efficiently solve the problem. Fig. 8. Top panel: Comparison between the mean number of iterations needed to solve puzzles of different sizes (100 × 100, 200 × 200, 300 × 300, 400 × 400, and 500 × 500) with a simulation parameter equal to 10%. The y-axis is the mean number of iterations needed to solve the puzzles. This is a linear function of the border size for both algorithms. Note that the slope for the SA is greater. The larger the puzzle, the better is the performance of the SNN. Bottom panel: Comparison between the mean number of fitting tests needed to solve the puzzles with a simulation parameter equal to 10%. In this case, the performance is also better for the SNN, but the number of fitting tests increases nonlinearly with the puzzle size.
100 different puzzles with 100 × 100, 200 × 200, 300 × 300, 400 × 400 and 500 × 500 pieces. Fig. 8 shows the results of these tests. Both the number of iterations and the number of fitting tests increase with the puzzle size. However, the increment rate is larger for the SA. Therefore, the corresponding values for the distances dit and dtests also increase with the puzzle size. For example, for puzzles of 500 × 500 pieces dit ≈ 2, 000 iterations and dtests ≈ 175 · 109 fitting tests. These values represent a 40% performance improvement by the SNN in both cases. The performance comparison between the SNN and the SA using the measure deff is shown in Fig. 9. Effectiveness is measured as the percentage of correct matchings between pieces in relation to the total number of fitting tests in one iteration. Each panel in Fig. 9 corresponds to the contour plot of deff for a specific context size as function of the iteration number and the puzzle size. Note that the SA has a better performance in the initial iterations, specially for large context
V. Discussion In this paper we have introduced a self-organizing neural network paradigm that is able to discriminate information locally using a strategy for information processing inspired in recent findings in living neural systems. The network uses neural signatures to identify each unit, a transient memory in each cell to keep track of the information and its sources, and a multicoding mechanism for information propagation. This provides the ability to discriminate inputs in each neuron during the processing. To illustrate that the proposed paradigm can use these strategies to efficiently solve a problem, we have defined a general framework for multidimensional sorting problems and we have applied it to a classical task: the assembly of jigsaw puzzles. We have compared the results of our new approach with a classical stochastic method to solve the problem, and we have pointed out the situations where the new paradigm improves the performance. We have analyzed the performance of the proposed algorithm in terms of the effort needed to solve the problem according to two different measurements (number of iterations and number of fitting tests). In both cases we have found a similar result. Local information discrimination has a computational cost that is evident in small puzzles. For large puzzles this computational cost is justified as our results show that local discrimination provides a better performance. Due to the nature of the jigsaw puzzle problem, we have limited our analysis to the case in which each neuron receives
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
19
Fig. 9. Evolution of deff as a function of the puzzle size for three different contexts. Dark colors indicate the cases where the SA has better effectiveness (deff < 0). On the contrary, light colors indicate better effectiveness for the SNN (deff > 0). Solid/dashed lines denote the average number of iterations needed to solve a puzzle of a specific size with the SA/SNN paradigm (calculated over 100 puzzles). In the initial iterations the effectiveness of the SA is always better. Then the SNN improves and the effectiveness is similar for both approaches (white regions where deff is near 0). For small puzzles the SA solves the problem before the SNN reaches a minimum effectiveness level. However, for large puzzles (and specially for small values of the context size panels a and b), the opposite situation occurs. (a) Context size = 15, (b) Context size = 25, and (c) Context size = 50.
and processes one input message per iteration. This restriction allowed us to compare the SNN performance with a classical approach in equivalent conditions. If we consider multiple messages per iteration in the SNN, neurons can process a larger amount of information in parallel. In this case, the local informational context can be built in different ways. In a multiple message scenario, we have randomly chosen different fragments of the inputs of each cell to build the context. This strategy leads to solve the problem in fewer iterations (see top panel of Fig. 10). However, the number of fitting tests required is larger, as expected (bottom panel of Fig. 10). On the other hand, we have only used the local informational context to store neuron data from cells whose piece matching is to be tested. Alternatively, a “negative context” can be used to save temporary information about cells whose matching has already been tested with a negative result and thus this information is considered not useful for the neuron. This negative context not only reduces the number of fitting tests, but also the number of iterations. In the context of the jigsaw puzzle problem, the SNN defines a different search than the general puzzle-solver search schema. Local discrimination allows to group pieces in clusters dynamically with no a priory information. Each cluster contains pieces that have a high probability to match. Regarding the problem of reassembling real 3-D broken objects, this is an desirable property, because the fitting among fragments usually is more difficult than among pieces of a commercially produced jigsaw puzzle [40]. Some of the traditional algorithms try to group similar pieces to reduce the number of fitting tests [5], [20], [15], [43]. However these approaches require a significant preprocessing. On the other hand, the processing rules of the SNN for the jigsaw puzzle could include the use of classical similitude metrics such as the concavity and convexity of border pieces. Using them together with the local information discrimination, they can significantly reduce the number of fitting tests needed for the SNN to find the puzzle solution. We would like to emphasize that the proposed paradigm has a wider use beyond the context of jigsaw puzzles. There is a large flexibility to implement the core concepts of the SNN, thus these networks can be adapted to solve different problems
Fig. 10. Comparison between the performance of the SNN in monosynaptic information propagation mode (only one input channel per iteration and neuron) and in multisynaptic propagation mode (each neuron receives four input messages in parallel). Top panel: Performance in terms of the mean number of iterations needed to solve the puzzle. Bottom panel: Performance in terms of the mean number of fitting tests. In all cases the size of the local informational context is equal to 10%. All measures are calculated by solving 100 different puzzles for each border size. The large number of iterations for puzzles of size 30 × 30 pieces is due to the low informational context for this size. Note that this effect is reduced in the multisynaptic mode.
that can benefit from the local information discrimination. Depending on the specific problem, one might need to provide a good performance in terms of number of iterations and cost measurements. The SNN can provide in many cases a good balance between performance and computational cost. A straightforward application of SNN is multidimensional sorting when the order in a particular dimension can be independent of the order in other dimensions, or when there is no global sorting criteria in any dimension. The local discrimination of the SNN can contribute to provide an
20
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
efficient solution to these problems once the right balance between its cost and the performance is found (through the specification of the size of the local informational context). Areas of application for this kind of sorting that are likely to benefit from the SNN approach are scheduling, planning and optimization [2], [7]. Note that the SNN uses a self-organizing strategy that includes a nonsupervised learning as a function of the local discrimination. In addition, SNNs allow for a new set of learning rules that can include not only the modification of the connections, but also the parameters that affect the local discrimination. Subcellular plasticity is also a characteristic that has been recently studied in the nervous system [9]. In the introduction we have mentioned that uniformity of neurons has facilitated the mathematical formulation of many ANN paradigms. Local discrimination is somehow a problem to achieve a compact formalization for the SNN paradigm since this formalization depends on the specific problem that the network is trying to solve. This does not mean that some concepts that underlie the strategy of SNN paradigm cannot be used to extend classical ANN. For example, in particular applications, we can consider having different sets of transfer functions for each unit, and make the selection of the specific function depend on the state of a local informational context. This strategy can combine synaptic and intra-unit learning paradigms and lead to achieve multifunctionality in the network. There is an increasing amount of new results on the strategies of information processing in living neural systems [29]. Beyond the specific results reported in this paper, the use of novel bio-inspired information processing strategies can contribute to a new generation of neural networks with enhanced capacity to perform a given task. Appendix Multidimensional Sorting Example To illustrate the strategy of the SNNs to solve multidimensional sorting problems and explain in detail the evolution of the SNN implementation described in Section III-B, let us consider the multidimensional sorting presented in Fig. 11. For simplicity, we will describe the bidimensional case first. In this case, the final goal is to sort horizontally and vertically with a SNN 12 elements in the order shown in panel a. The order criteria is given by the compatibility between colors displayed in this panel. In this SNN example [Fig. 11(b)], the neuron signature is the number of each neuron in the network. The neuron data are the elements to sort illustrated by the different colors. For simplicity, we choose Ncontext = 3, P = 100% and T = 10. There is no global order criteria, so two neurons are compatible if their corresponding blocks are adjacent in Fig. 11(a). We will consider a multisynaptic information propagation mode with a maximum of two active output channels per iteration. Taking this into account, we describe below the evolution of the SNN presented in Fig. 11(b) during the first two iterations. To follow this example, we recommend the reader to have in mind the definitions of Section II-A and Fig. 12.
Fig. 11. Basic example of multidimensional sorting. (a) The correct order of the elements to sort in a 2-D problem is represented by the colors assigned to the blocks. (b) Initialization of the SNN used to sort the elements of panel a (elements are randomly assigned). Each neuron of the network has a signature that identifies it and some information needed to solve the problem. In the example, the signature of each neuron is its order number (N01, N02, N03, and so on) and the data are the blocks to sort. (c) Architecture of the SNN when the solution is reached. (d) 3-D generalization of the problem, in this case the compatibility is represented by adjacent sides with the same color.
A. Initialization of the SNN The initialization of the network consists of the following. 1) Neuron data are initialized by randomly assigning an element to each neuron. 2) The initial structure of the network is 2-D with each cell connected to its four nearest neighbors. The SNN has periodic boundary conditions, each neuron of a border is connected to the neuron of the opposite side. 3) For each neuron of the network → ni : a) initialize Context(ni ) with the top, right and bottom neighbors of ni (not shown). B. Discrimination and Processing Rules After the initialization, neurons build and send initial messages and the SNN starts to search for the solution (see Fig. 12). For this task, neurons follow the algorithm described in Section III-B. This example uses the following local discrimination rules: 1) During the process synaptic input phase, in the second discrimination stage (step e) of the algorithm), neurons to be processed are selected from messages of in. To simplify our description, here we will assume that active channels are sampled one neuron information at a time in a clockwise order (top channel first, right second, bottom third, and left fourth). For example, if we consider neuron N01 in the left panel of Fig. 12, the neuron receives message N10 − N05 − N09 from N09 (top channel) and message N08 − N12 − N04 (left channel) from N04. Right and bottom channels are not active in this case. As the signatures of N09 and N04 are not recognized, both messages pass to the second discrimination stage. To select the neuron
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
21
Fig. 12. Evolution of the SNN for the sorting problem illustrated in Fig. 11. Left panel: neighborhood, local informational context and output messages of each neuron of the SNN shown in the right panel of Fig. 11 at the end of iteration 1. The details of the processing and the discrimination rules are described in the text. In this example, Ncontext = 3, the context initialization is a neighborhood context initialization (initially all the local informational contexts contain the corresponding top, right and bottom neuron informations. For example, for N01 the initial context corresponds to neurons N09, N02, and N05. After the initialization, the first messages are built and sent (not shown here). In our case, N01 sends messages to N09 and N04; N02 to N10 and N03; N03 to N11 and N02; N04 to N01 and N08; N05 to N06 and N08; N06 to N10 and N05; N07 to N11; N08 to N04 and N05; N09 to N01 and N12; N10 to N06 and N02; N11 to N07 and N03; and N12 to N09. During the process synaptic input phase of iteration 1, all incoming messages will pass to the second discrimination stage. Now, each neuron selects the set of neuron informations to process following the rules described in the text. The selected set is shown below each neuron, and this will be the local informational context for the next iteration. When a neuron starts recognizing a signature, the corresponding cell is shown green and filled. If the cells are also in their correct position, the connection between them are green and solid instead of red and dotted. Neurons filled and grey have a new connection established during the restore neighborhood phase. Arrows denote the output channels activated during the propagate information phase. Right panel: Evolution of the SNN in iteration 2. The differences with respect to the network in the left panel consist in the yellow filled neurons. These are cells moved to their correct position because a receptor receives a message from an emitter whose signature is recognized (and the emitter is not in its correct position).
informations to process from the selected messages, N01 initially chooses N09 (first neuron information from the top channel), N04 (first neuron information from the left channel) and N05 (second neuron information from the top channel). N09 and N05 are discarded because they belong to the local informational context of N01 [remember that at the beginning of this iteration the context of N01 is the one built during the context initialization: N09, N02 and N05, see Fig. 11(a)]. Then, as Ncontext = 3 in this example, N01 chooses N12 (second neuron information from the left channel) and N10 (third neuron information from the top channel), to finally process N04, N12 and N10. 2) If a receptor starts recognizing a signature in a given iteration, it sends a message to this emitter. As we have a maximum number of active channels, these messages have a higher priority during the information propagation phase (step a) of the algorithm). In the example, N01 starts recognizing the signature of N04 in iteration 1 (left panel of Fig. 12). Then, it sends a message to N04. The rest of output channels are activated randomly taking into account that the communication is bidirectional and the maximum value of active channels. 3) During the information propagation phase, the emitter does not include information about the receptor in the output message. For example, let us consider neuron N01 whose context is N04, N12 and N10 at the end of
iteration 1 (shown below the neuron in Fig. 12). N01 sends to neuron N04 the message N10 − N12 − N01 instead of N12 − N04 − N01. C. Evolution of the SNN 1) Iteration 1: The left panel of Fig. 12 illustrates the evolution of the SNN presented in Fig. 11(b) during the first iteration of the algorithm. The first step during the process synaptic input phase consists in the selection of the messages to process. In iteration 1 there are no messages received from an emitter with a recognized signature, because no neuron recognizes any signature yet. Therefore, all messages pass to the second stage of the discrimination. In this stage each neuron selects the set of neurons informations to be processed. In the figure, this set is shown below each neuron as it becomes the local informational context for the next iteration. The neurons use this set to find their adjacent blocks (a compatible neuron in a given dimension). In this way, neuron N01 starts recognizing signature N04, or neuron N08 signature N01. In the first case, N04 belongs to the neighborhood of N01 and a reconfiguration of the network is not needed. On the other hand, N01 does not belong to the neighborhood of N08. This implies a network reconfiguration (self-organization of the SNN) before starting to recognize the signature. As a consequence of this reconfiguration, N08 and N01 are now neighbors, connections between N08 and N07, and N01 and N02 are broken, and N07 stops having a right neighbor
22
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and N02 a left neighbor. This reconfiguration mechanism is repeated during the algorithm evolution for the rest of neurons. Finally, during the restore neighborhood phase, neurons with incomplete neighborhood try to connect with neurons of their local informational context in this same status to maximize the information flow in the network. For example, at the end of iteration 1, as N04 is included in the local informational context of N02, N02 has not a bottom neighbor and N04 has not a top neighbor; a new connection is established between these cells. Note that although N02 and N09 could be connected to complete their neighborhood (N02 is included in the local context of N09) they are not connected because they are already neighbors. Note that the neighborhood restoration takes place after the local informational context is updated. 2) Iteration 2: The second iteration starts from the situation shown in the left panel of Fig. 12. The mechanism used to process the incoming messages in the receptor is analogous to the mechanism described for iteration 1 with only one difference. Now, N01 receives a message from N04, an emitter with a recognized signature and not its correct position. Then, N01 only processes this message. Before processing the message, as N04 is not in its correct position, a reconfiguration of the network takes place to move N04 on top N01 (right panel of Fig. 12). At this point N04 does not recognize signature S01 yet. Later, when N04 processes its input message it will start recognizing S01. During the process synaptic input phase, N04 processes, in this order, the neuron information of cells N01, N03 and N12. The processing of neuron information of N01 implies that N04 starts recognizing signature N01. As N01 belongs to the neighborhood of N04 a reconfiguration is not needed. The same occurs when the neuron information of N03 is processed. Finally, the processing of neuron information of N12 also implies that N04 starts recognizing signature N12. But now, N12 does not belong to the neighborhood of N04. When a new synapse is established to set N12 in its correct position, the connection between N03 and N04 must be broken. As a consequence, N04 stops recognizing signature N03 and, N03 stops recognizing N04. In this simple example, in just two iterations most of the SNN neurons have reached the local solution [compare to Fig. 11(c)], and only two more iterations are needed to reach the global solution to the problem (not shown here). Note that the corresponding problem in three dimensions [illustrated in Fig. 11(d)] or more only requires to repeat step f) of the algorithm described in section III-B.
References [1] M. Anthony, “On the generalization error of fixed combinations of classifiers,” J. Comput. Syst. Sci., vol. 73, no. 5, pp. 725–734, 2007. [2] W. G. Aref and I. Kamel, “On multi-dimensional sorting orders,” in Proc. 11th Int. Conf. Database Expert Syst. Applicat., vol. 1873. 2000, pp. 774–783. [3] P. Auer, H. Burgsteiner, and W. Maass, “A learning rule for very simple universal approximators consisting of a single layer of perceptrons,” Neural Netw., vol. 21, no. 5, pp. 786–795, 2008. [4] N. Ayache and O. D. Faugeras, “Hyper: A new approach for the recognition and positioning of two dimensional objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 1, pp. 44–54, Jan. 1986.
[5] H. Bunke and G. Kaufmann, “Jigsaw puzzle solving using approximate string matching and best-first search,” in Proc. 5th Int. Conf. CAIP, 1993, pp. 299–308. [6] B. Burdea and H. Wolfson, “Solving jigsaw puzzles by a robot,” IEEE Trans. Robot. Autom., vol. 5, no. 6, pp. 752–764, Dec. 1989. [7] O. Catoni, “Solving scheduling problems by simulated annealing,” Siam J. Contr. Optim., vol. 36, no. 5, pp. 1539–1575, 1998. [8] M. Chung, M. Fleck, and D. Forsyth, “Jigsaw puzzle solver using shape and color,” in Proc. 4th ICSP, 1998, pp. 877–880. [9] G. W. Davis, “Homeostatic control of neural activity: From phenomenology to molecular design,” Annu. Rev. Neurosci., vol. 29, no. 1, pp. 307– 323, 2006. [10] J. De Bock, R. De Smet, W. Philips, and J. D’Haeyer, “Constructing the topological solution of jigsaw puzzles,” in Proc. ICSP, vol. 3. 2004, pp. 2127–2130. [11] E. D. Demaine and M. L. Demaine, “Jigsaw puzzles, edge matching, and polyomino packing: Connections and complexity,” Graph. Comb., vol. 23, no. 1, pp. 195–208, 2007. [12] E.-J. Farn and C.-C. Chen, “Novel steganographic method based on jig swap puzzle images,” J. Electron. Imag., vol. 18, no. 1, p. 013003, 2009. [13] J. Fort, “Som’s mathematics,” Neural Netw., vol. 19, nos. 6–7, pp. 812– 816, 2006. [14] H. Freeman and L. Gardner, “Apictorial jigsaw puzzles: A computer solution to a problem in pattern recognition,” IEEE Trans. Electron. Comput., vol. EC-13, no. 2, pp. 118–127, 1964. [15] D. Goldberg, C. Malon, and M. Bern, “A global approach to automatic solution of jigsaw puzzles,” Computat. Geometry, vol. 28, nos. 2–3, pp. 165–174, 2004. [16] J. E. Goodman and R. Pollack, “Multidimensional sorting,” Siam J. Comput., vol. 12, no. 3, pp. 484–507, 1983. [17] R. Ilin, R. Kozma, and P. Werbos, “Beyond feedforward models trained by backpropagation: A practical training tool for a more efficient universal approximator,” IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 929–937, Jun. 2008. [18] E. Kishon, T. Hastie, and H. Wolfson, “3-D curve matching using splines,” in Proc. 1st Eur. Conf. Comput. Vision, 1990, pp. 589–591. [19] E. Kishon and H. Wolfson, “3-D curve matching,” in Proc. AAAI Workshop Spatial Reasoning Multi-Sensor Fusion, 1987, pp. 250–261. [20] W. Kong and B. Kimia, “On solving 2-D and 3-D puzzles using curve matching,” in Proc. IEEE Comput. Vision Patt. Recog., vol. 2. 2001, pp. II-583–II-590. [21] D. A. Kosiba, P. M. Devaux, S. Balasubramanian, T. L. Gandhi, and R. Kasturi, “An automatic jigsaw puzzle solver,” in Proc. 12th IAPR Int. Conf. Patt. Recog., 1994, pp. 616–618. [22] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Effect of individual spiking activity on rhythm generation of central pattern generators,” Neurocomputing, vols. 58–60, pp. 535–540, Jun. 2004. [23] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Neural signatures: Multiple coding in spiking-bursting cells,” Biologic. Cybern., vol. 95, no. 2, pp. 169–183, 2006. [24] R. Latorre, F. B. Rodr´ıguez, and P. Varona, “Reaction to neural signatures through excitatory synapses in central pattern generator models,” Neurocomputing, vol. 70, pp. 1797–1801, Jun. 2007. [25] H. C. G. Leitao and J. Stolfi, “A multiscale method for the reassembly of two-dimensional fragmented objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1239–1251, Sep. 2002. [26] M. Levison, “The siting of fragments,” Comput. J., vol. 7, no. 4, pp. 275–277, 1965. [27] K. Nagura, K. Sato, H. Maekawa, T. Morita, and K. Fujii, “Partial contour processing using curvature function-assembly of jigsaw puzzles and recognition of moving figures,” Syst. Comput., vol. 2, pp. 30–39, 1986. [28] T. R. Nielsen, P. Drewsen, and K. Hansen, “Solving jigsaw puzzles using image features,” Patt. Recogn. Lett., vol. 29, no. 14, pp. 1924– 1933, 2008. [29] M. I. Rabinovich, P. Varona, A. I. Selverston, and H. D. I. Abarbanel, “Dynamical principles in neuroscience,” Rev. Modern Phys., vol. 78, no. 44, pp. 1213–1265, 2006. [30] G. Radack and N. Badler, “Jigsaw puzzle matching using a boundarycentered polar encoding,” Comput. Graphics Image Process., vol. 19, no. 1, pp. 1–17, May 1982. [31] J. T. Schwartz and M. Sharir, “Identification of partially obscured objects in two and three dimension by matching noisy characteristic curves,” Int. J. Robotics Res., vol. 6, no. 2, pp. 29–44, Jun. 1987. [32] P. N. Suganthan, “Solving jigsaw puzzles using Hopfield network,” in Proc. Int. Conf. Neural Netw., Jul. 1999, pp. 10–16.
LATORRE et al.: SIGNATURE NEURAL NETWORKS: DEFINITION AND APPLICATION TO MULTIDIMENSIONAL SORTING PROBLEMS
[33] A. Sz¨ucs, H. D. I. Abarbanel, M. I. Rabinovich, and A. I. Selverston, “Dopamine modulation of spike dynamics in bursting neurons,” Eur. J. Neurosci., vol. 21, no. 3, pp. 763–772, Feb. 2005. [34] A. Sz¨ucs, R. D. Pinto, M. I. Rabinovich, H. D. I. Abarbanel, and A. I. Selverston, “Synaptic modulation of the interspike interval signatures of bursting pyloric neurons,” J. Neurophysiol., vol. 89, no. 3, pp. 1363– 1377, Mar. 2003. [35] F. Toyama, Y. Fujiki, K. Shoji, and J. Miyamichi, “Assembly of puzzles using a genetic algorithm,” in Proc. 16th Int. Conf. Patt. Recog., vol. 4. 2002, pp. 389–392. [36] S. Trenn, “Multilayer perceptrons: Approximation order and necessary number of hidden units,” IEEE Trans. Neural Netw., vol. 19, no. 5, pp. 836–844, May 2008. ¨ [37] G. Ucoluk and I. Toroslu, “Automatic reconstruction of broken 3D surface objects,” Comput. Graphics, vol. 23, no. 4, pp. 573–582, 1999. [38] R. W. Webster, P. S. LaFollette, and R. L. Stafford, “Isthmus critical points for solving jigsaw puzzles in computer vision,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 5, pp. 1271–1278, 1991. [39] D. A. White and A. Sofge, Eds., Handbook of Intelligent Control Neural, Fuzzy, and Adaptive Approaches. New York: Reinhold, 1992. [40] A. Willis and D. Cooper, “Computational reconstruction of ancient artifacts,” IEEE Signal Process. Mag., vol. 25, no. 4, pp. 65–83, Jul. 2008. [41] H. Wolfson, “On curve matching,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 12, no. 5, pp. 483–489, May 1990. [42] H. Wolfson, E. Schonberg, A. Kalvin, and Y. Landam, “Solving jigsaw puzzles by computer,” Ann. Oper. Res., vol. 12, nos. 1–4, pp. 51–64, Feb. 1988. [43] F.-H. Yao and G.-F. Shao, “A shape and image merging technique to solve jigsaw puzzles,” Patt. Recogn. Lett., vol. 24, no. 12, pp. 1819– 1835, 2003. [44] A. Zaritsky and M. Sipper, “The preservation of favored building blocks in the struggle for fitness: The puzzle algorithm,” IEEE Trans. Evol. Comput., vol. 8, no. 5, pp. 443–455, Oct. 2004. [45] Y.-X. Zhao, M.-C. Su, Z.-L. Chou, and J. Lee, “A puzzle solver and its application in speech descrambling,” in Proc. Annu. Conf. Int. Conf. Comput. Eng. Applicat., 2007, pp. 171–176.
23
Roberto Latorre received the B.S. degree in computer engineering and the Ph.D. in computer science and telecommunications from Universidad Aut´onoma de Madrid, Madrid, Spain, in 2000 and 2008, respectively. Since 2002, he has been a Profesor Asociado with the Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid. He has been a member of the Grupo de Neurocomputaci´on Biol´ogica, Escuela Polit´ecnica Superior, since 2001. His research interests include different topics in neuroscience and neurocomputing, from the generation of motor patterns and information coding to pattern recognition and ANNs.
Francisco de Borja Rodr´ıguez received the B.S. degree in applied physics and the Ph.D. in computer science from Universidad Aut´onoma de Madrid, Madrid, Spain, in 1992 and 1999, respectively. He then was with Nijmegen University, Nijmegen, Holland, and the Institute for Nonlinear Science, University of California, San Diego. Since 2002, he has been a Profesor Titular with the Escuela Polit´ecnica Superior, Universidad Auton´oma de Madrid.
Pablo Varona received the B.S. degree in theoretical physics and the Ph.D. in computer science from Universidad Aut´onoma de Madrid, Madrid, Spain, in 1992 and 1997, respectively. He was a Post-Doctoral Fellow and later an Assistant Research Scientist with the Institute for Nonlinear Science, University of California, San Diego. Since 2002, he has been a Profesor Titular with the Escuela Polit´ecnica Superior, Universidad Aut´onoma de Madrid.
24
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Adaptive Dynamic Programming for Finite-Horizon Optimal Control of Discrete-Time Nonlinear Systems with ε-Error Bound Fei-Yue Wang, Fellow, IEEE, Ning Jin, Student Member, IEEE, Derong Liu, Fellow, IEEE, and Qinglai Wei
Abstract— In this paper, we study the finite-horizon optimal control problem for discrete-time nonlinear systems using the adaptive dynamic programming (ADP) approach. The idea is to use an iterative ADP algorithm to obtain the optimal control law which makes the performance index function close to the greatest lower bound of all performance indices within an ε-error bound. The optimal number of control steps can also be obtained by the proposed ADP algorithms. A convergence analysis of the proposed ADP algorithms in terms of performance index function and control policy is made. In order to facilitate the implementation of the iterative ADP algorithms, neural networks are used for approximating the performance index function, computing the optimal control policy, and modeling the nonlinear system. Finally, two simulation examples are employed to illustrate the applicability of the proposed method. Index Terms— Adaptive critic designs, adaptive dynamic programming, approximate dynamic programming, learning control, neural control, neural dynamic programming, optimal control, reinforcement learning.
I. I NTRODUCTION HE optimal control problem of nonlinear systems has always been the key focus of control fields in the past several decades [1]–[15]. Traditional optimal control approaches are mostly implemented in infinite time horizon [2], [5], [9], [11], [13], [16], [17]. However, most real-world systems need to be effectively controlled within finite time horizon (finite-horizon for brief), such as stabilized or tracked to a desired trajectory in a finite duration of time. The design of finite-horizon optimal controllers faces a major obstacle in
T
Manuscript received April 16, 2010; revised August 20, 2010; accepted August 24, 2010. Date of publication September 27, 2010; date of current version January 4, 2011. This work was supported in part by the Natural Science Foundation (NSF) China under Grant 60573078, Grant 60621001, Grant 60728307, Grant 60904037, Grant 60921061, and Grant 70890084, by the MOST 973 Project 2006CB705500 and Project 2006CB705506, by the Beijing Natural Science Foundation under Grant 4102061, and by the NSF under Grant ECS-0529292 and Grant ECCS-0621694. The acting Editor-inChief who handled the review of this paper was Frank L. Lewis. F. Y. Wang and Q. Wei are with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected];
[email protected]). N. Jin is with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA (e-mail:
[email protected]). D. Liu is with the Department of Electrical and Computer Engineering, University of Illinois, Chicago, IL 60607 USA. He is also with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2076370
comparison to the infinite horizon one. An infinite horizon optimal controller generally obtains an asymptotic result for the controlled systems [9], [11]. That is, the system will not be stabilized or tracked until the time reaches infinity, while for finite-horizon optimal control problems the system must be stabilized or tracked to a desired trajectory in a finite duration of time [1], [8], [12], [14], [15]. Furthermore, in the case of discrete-time systems, the determination of the number of optimal control steps is necessary for finite-horizon optimal control problems, while for the infinite horizon optimal control problems the number of optimal control steps is infinity in general. The finite-horizon control problem has been addressed by many researchers [18]–[23]. But most of the existing methods consider only the stability problems of systems under finite-horizon controllers [18], [20], [22], [23]. Due to the lack of methodology and the fact that the number of control steps is difficult to determine, the optimal controller design of finitehorizon problems still presents a major challenge to control engineers. This motivates our present research. As is known, dynamic programming is very useful in solving optimal control problems. However, due to the “curse of dimensionality” [24], it is often computationally untenable to run dynamic programming to obtain the optimal solution. The adaptive/approximate dynamic programming (ADP) algorithms were proposed in [25] and [26] as a way to solve optimal control problems forward in time. There are several synonyms used for ADP including “adaptive critic designs” [27]–[29], “adaptive dynamic programming” [30]– [32], “approximate dynamic programming” [26], [33]–[35], “neural dynamic programming” [36], “neuro-dynamic programming” [37], and “reinforcement learning” [38]. In recent years, ADP and related research have gained much attention from researchers [27], [28], [31], [33]–[36], [39]–[57]. In [29] and [26], ADP approaches were classified into several main schemes: heuristic dynamic programming (HDP), actiondependent HDP, also known as Q-learning [58], dual heuristic dynamic programming (DHP), ADDHP, globalized DHP (GDHP), and ADGDHP. Saridis and Wang [10], [52], [59] studied the optimal control problem for a class of nonlinear stochastic systems and presented the corresponding Hamilton-Jacobi-Bellman (HJB) equation for stochastic control problems. Al-Tamimi et al. [27] proposed a greedy HDP iteration algorithm to solve the discrete-time HJB (DTHJB) equation of the optimal control problem for discrete-time nonlinear systems. Though great progress has been made for ADP in the optimal control
1045–9227/$26.00 © 2010 IEEE
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
field, most ADP methods are based on infinite horizon, such as [16], [27], [33], [36], [37], [43]–[45], [53], [56] and [57]. Only [60] and [61] discussed how to solve the finite-horizon optimal control problems based on ADP and backpropagation-through-time algorithms. In this paper, we will develop a new ADP scheme for finitehorizon optimal control problems. We will study the optimal control problems with an ε-error bound using ADP algorithms. First, the HJB equation for finite-horizon optimal control of discrete-time systems is derived. In order to solve this HJB equation, a new iterative ADP algorithm is developed with convergence and optimality proofs. Second, the difficulties of obtaining the optimal solution using the iterative ADP algorithm is presented and then the ε-optimal control algorithm is derived based on the iterative ADP algorithms. Next, it will be shown that the ε-optimal control algorithm can obtain suboptimal control solutions within a fixed finite number of control steps that make the performance index function converge to its optimal value with an ε-error. Furthermore, in order to facilitate the implementation of the iterative ADP algorithms, we use neural networks to obtain the iterative performance index function and the optimal control policy. Finally, an ε-optimal state feedback controller is obtained for finite-horizon optimal control problems. This paper is organized as follows. In Section II, the problem statement is presented. In Section III, the iterative ADP algorithm for finite-horizon optimal control problem is derived. The convergence property and optimality property are also proved in this section. In Section IV, the ε-optimal control algorithm is developed, the properties of the algorithm are also proved in this section. In Section V, two examples are given to demonstrate the effectiveness of the proposed control scheme. Finally, in Section VI, the conclusion is drawn. II. P ROBLEM S TATEMENT In this paper, we will study deterministic discrete-time systems x k+1 = F(x k , u k ), k = 0, 1, 2, . . . (1) where x k ∈ Rn is the state and u k ∈ Rm is the control vector. Let x 0 be the initial state. The system function F(x k , u k ) is continuous for ∀ x k , u k and F(0, 0) = 0. Hence, x = 0 is an equilibrium state of system (1) under the control u = 0. The performance index function for state x 0 under the control sequence u 0N−1 = (u 0 , u 1 , . . . , u N−1 ) is defined as N−1 U (x i , u i ) J x 0 , u 0N−1 =
(2)
i=0
where U is the utility function, U (0, 0) = 0, and U (x i , u i ) ≥ 0 for ∀ x i , u i . The sequence u 0N−1 defined above is a finite sequence of controls. Using this sequence of controls, system (1) gives a trajectory starting from x 0 : x 1 = F(x 0 , u 0 ), x 2 = F(x 1 , u 1 ), . . . , x N = F(x N−1 , u N−1 ). We call the number of elements in the control sequence u0N−1 the length of N−1 N−1 and denote it as u 0 . Then, u 0N−1 = N. The u0 length of the associated trajectory x 0N = (x 0 , x 1 , . . . , x N )
25
is N + 1. We denote the final state of the trajectory as x ( f ) x 0 , u 0N−1 , i.e., x ( f ) x 0 , u 0N−1 = x N . Then, for ∀ k ≥ 0, the finite control sequence starting at k can be written as u k+i−1 = (u k , u k+1 , . . . , u k+i−1 ), where i ≥ 1 is the length k of the control sequence. The final state can be written as = x k+i . x ( f ) x k , u k+i−1 k We note that the performance index function defined in (2) does not have the term associated with the final state since in this paper we specify the final state x N = F(x N−1 , u N−1 ) to be at the origin, i.e., x N = x ( f ) = 0. For the present finitehorizon optimal control problems, the feedback controller u k = u(x k ) must not only drive the system state to zero within finite number of time steps but also guarantee the performance index function (2) to be finite, i.e., u kN−1 = (u(x k ), u(x k+1 ), . . . , u(x N−1 )) must be a finite-horizon admissible control sequence, where N > k is a finite integer. Definition 2.1: A control sequence u kN−1 is said to be finite n , if x ( f ) x , u N−1 = 0 ∈ R horizon admissible for a state x k k k and J x k , u kN−1 is finite, where N > k is a finite integer. A state x k is said to be finite-horizon controllable (controllable for brief) if there is a finite-horizon admissible control sequence associated with this state. Let u k be an arbitrary finite-horizon admissible control sequence starting at k and let Ax k = u k : x ( f ) x k , u k = 0 be the set of all finite-horizon admissible control sequences of x k . Let k+i−1 k+i−1 k+i−1 (f) u =i = 0, x , u : x A(i) = u k xk k k k
be the set of all finite-horizon admissible control sequences of (i) x k with length i . Then, Axk = ∪1≤i 1. From (11), we have Vq (x k ) =
V∞ (x k ) = min{U (x k , u k ) + V∞ (x k+1 )}.
Let i → ∞, we have
u k+1 k
x k , uˆ k+1 k
Then, we have
V∞ (x k ) ≤ Vi+1 (x k ) ≤ U (x k , ηk ) + Vi (x k+1 ).
k+1 (2) V2 (x k ) = min J x k , u k+1 ∈ A : u xk k k
(13)
which is true for ∀ ηk . Therefore V∞ (x k ) ≤ min{U (x k , u k ) + V∞ (x k+1 )}. uk
(14)
Let ε > 0 be an arbitrary positive number. Since Vi (x k ) is nonincreasing for i ≥ 1 and limi→∞ Vi (x k ) = V∞ (x k ), there exists a positive integer p such that V p (x k ) − ε ≤ V∞ (x k ) ≤ V p (x k ). From (8), we have V p (x k ) = min{U (x k , u k ) + V p−1 (F(x k , u k ))} uk
= U (x k , v p (x k )) + V p−1 (F(x k , v p (x k ))). Hence, V∞ (x k ) ≥ U (x k , v p (x k )) + V p−1 (F(x k , v p (x k ))) − ε ≥ U (x k , v p (x k )) + V∞ (F(x k , v p (x k ))) − ε ≥ min{U (x k , u k ) + V∞ (x k+1 )} − ε. uk
Since ε is arbitrary, we have V∞ (x k ) ≥ min{U (x k , u k ) + V∞ (x k+1 )}. uk
(15)
Combining (14) and (15), we prove the theorem. Next, we will prove that the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) as i → ∞.
28
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Theorem 3.3: Let V∞ (x k ) be defined in (13). If the system state x k is controllable, then we have the performance index function V∞ (x k ) equal to the optimal performance index function J ∗ (x k ) lim Vi (x k ) = J ∗ (x k ) i→∞
where Vi (x k ) is defined in (8). Proof: According to (3) and (10), we have (i) k+i−1 ) : u ∈ A J ∗ (x k ) ≤ min J (x k , u k+i−1 x k = Vi (x k ). k k u k+i−1 k
Then, let i → ∞, we obtain J ∗ (x k ) ≤ V∞ (x k ).
(16)
Next, we show that V∞ (x k ) ≤ J ∗ (x k ).
(17)
For any ω > 0, by the definition of J ∗ (x k ) in (3), there exists ηk ∈ Axk such that J (x k , ηk ) ≤ J ∗ (x k ) + ω.
(18)
( p)
Suppose that |ηk | = p. Then ηk ∈ Axk . So, by Theorem 3.1 and (10), we have V∞ (x k ) ≤ V p (x k ) ( p) k+ p−1 k+ p−1 = min J (x k , u k ) : uk ∈ Ax k k+ p−1
uk
Next, we prove the following theorem. Theorem 3.4: Let T 0 = {0} and T i be defined in (19). Then, for i = 0, 1, . . ., we have T i ⊆ T i+1 . Proof: We prove the theorem by mathematical induction. First, let i = 0. Since T 0 = {0} and F(0, 0) = 0, we know that 0 ∈ T 1 . Hence, T 0 ⊆ T 1 . Next, assume that T i−1 ⊆ T i holds. Now, if x k ∈ T i , we have F(x k , ηi−1 (x k )) ∈ T i−1 for some ηi−1 (x k ). Hence, F(x k , ηi−1 (x k )) ∈ T i by the assumption of T i−1 ⊆ T i . So, x k ∈ T i+1 by (19). Thus, T i ⊆ T i+1 , which proves the theorem. According to Theorem 3.4, we have {0} = T 0 ⊆ T 1 ⊆ · · · ⊆ T i−1 ⊆ T i ⊆ · · · . We can see that by introducing the sets T i , i = 0, 1, . . ., the state x k can be classified correspondingly. According to Theorem 3.4, the properties of the ADP algorithm can be derived in the following theorem. Theorem 3.5: (i) 1) For any i , x k ∈ T i ⇔ Axk = ∅ ⇔ Vi (x k ) is defined at xk . 2) Let T ∞ = ∪∞ i=1 T i . Then, x k ∈ T ∞ ⇔ Ax k = ∅ ⇔ J ∗ (x k ) is defined at x k ⇔ x k is controllable. 3) If Vi (x k ) is defined at x k , then V j (x k ) is defined at x k for every j ≥ i . 4) J ∗ (x k ) is defined at x k if and only if there exists an i such that Vi (x k ) is defined at x k .
≤ J (x k , ηk ) ≤ J ∗ (x k ) + ω.
IV. ε-O PTIMAL C ONTROL A LGORITHM
Since ω is chosen arbitrarily, we know that (17) is true. Therefore, from (16) and (17), we prove the theorem. We can now present the following corollary. Corollary 3.1: Let the performance index function Vi (x k ) be defined by (8). If the system state x k is controllable, then the iterative control law v i (x k ) converges to the optimal control law u ∗ (x k ), i.e., limi→∞ v i (x k ) = u ∗ (x k ). Remark 3.3: Generally speaking, for the finite-horizon optimal control problems, the optimal performance index function depends not only on state x k but also on the time left (see [60], [61]). For the finite-horizon optimal control problems with unspecified terminal time, we have proved that the iterative performance index functions converge to the optimal as the iterative index i reaches infinity. Then, the time left is negligible and we say that the optimal performance index function V (x k ) is only a function of the state x k , which is like the case of infinite-horizon optimal control problems. By Theorem 3.3 and Corollary 3.1, we know that if x k is controllable, then, as i → ∞, the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) and the iterative control law v i (x k ) also converges to the optimal control law u ∗ (x k ). So, it is important to note that for controllable state x k , the iterative performance index functions Vi (x k ) are well defined for all i under the iterative control law v i (x k ). Let T 0 = {0}. For i = 1, 2, . . . , define
T i = {x k ∈ Rn | ∃ u k ∈ Rm s.t. F(x k , u k ) ∈ T i−1 }.
(19)
In the previous section, we proved that the iterative performance index function Vi (x k ) converges to the optimal performance index function J ∗ (x k ) and J ∗ (x k ) = minu k {J (x k , u k ), u ∈ Axk } satisfies the Bellman’s equation (4) for any controllable state x k ∈ T ∞ . To obtain the optimal performance index function J ∗ (x k ), a natural strategy is to run the iterative ADP algorithm (6)–(9) until i → ∞. But unfortunately, it is not practical to do so. In many cases, we cannot find the equality J ∗ (x k ) = Vi (x k ) for any finite i . That is, for any admissible control sequence u k with finite length, the performance index starting from x k under the control of u k will be larger than, not equal to, J ∗ (x k ). On the other hand, by running the iterative ADP algorithm (6)–(9), we can obtain a control vector v ∞ (x k ) and then construct a control sequence u ∞ (x k ) = (v ∞ (x k ), v ∞ (x k+1 ), . . . , v ∞ (x k+i ), . . . ), where x k+1 = F(x k , v ∞ (x k )), . . . , x k+i = F(x k+i−1 , v ∞ (x k+i−1 )), . . . . In general, u ∞ (x k ) has infinite length. That is, the controller v ∞ (x k ) cannot control the state to reach the target in finite number of steps. To overcome this difficulty, a new ε-optimal control method using iterative ADP algorithm will be developed in this section. A. ε-Optimal Control Method In this section, we will introduce our method of iterative ADP with the consideration of the length of control sequences. For different x k , we will consider different length i for the
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
optimal control sequence. For a given error bound ε > 0, the number i will be chosen so that the error between J ∗ (x k ) and Vi (x k ) is within the bound. Let ε > 0 be any small number and x k ∈ T ∞ be any controllable state. Let the performance index function Vi (x k ) be defined by (8) and J ∗ (x k ) be the optimal performance index function. According to Theorem 3.3, given ε > 0, there exists a finite i such that |Vi (x k ) − J ∗ (x k )| ≤ ε.
(20)
We can now give the following definition. Definition 4.1: Let x k ∈ T ∞ be a controllable state vector. Let ε > 0 be a small positive number. The approximate length of optimal control sequence with respect to ε is defined as K ε (x k ) = min{i : |Vi (x k ) − J ∗ (x k )| ≤ ε}.
(21)
Given a small positive number ε, for any state vector x k , the number K ε (x k ) gives a suitable length of control sequence for optimal control starting from x k . For x k ∈ T ∞ , since lim Vi (x k ) = J ∗ (x k ), we can always find i such that (20) i→∞
is satisfied. Therefore, {i : |Vi (x k ) − J ∗ (x k )| ≤ ε} = ∅ and K ε (x k ) is well defined. We can see that an error ε between Vi (x k ) and J ∗ (x k ) is introduced into the iterative ADP algorithm, which makes the performance index function Vi (x k ) converge within finite number of iteration steps. In this part, we will show that the corresponding control is also an effective control that drives the performance index function to within error bound ε from its optimal. From Definition 4.1, we can see that all the states x k that satisfy (21) can be classified into one set. Motivated by the definition in (19), we can further classify this set using the following definition. (ε) Definition 4.2: Let ε be a positive number. Define T 0 = {0} and for i = 1, 2, . . . , define
T (ε) i = {x k ∈ T ∞ : K ε (x k ) ≤ i }. T (ε) i ,
Accordingly, when x k ∈ to find the optimal control sequence which has performance index less than or equal to J ∗ (x k ) + ε, one only needs to consider the control sequences (ε) u k with length |u k | ≤ i . The sets T i have the following properties. Theorem 4.1: Let ε > 0 and i = 0, 1, . . . . Then: (ε) 1) x k ∈ T i if and only if Vi (x k ) ≤ J ∗ (x k ) + ε; (ε) 2) T i ⊆ T i ; (ε) (ε) 3) T i ⊆ T i+1 ; (ε) 4) ∪i T i = T ∞ ; (ε) (δ) 5) If ε > δ > 0, then T i ⊇ T i . Proof: (ε) 1) Let x k ∈ T i . By Definition 4.2, K ε (x k ) ≤ i . Let j = K ε (x k ). Then, j ≤ i and by Definition 4.1, |V j (x k ) − J ∗ (x k )| ≤ ε. So, V j (x k ) ≤ J ∗ (x k ) + ε. By Theorem 3.1, Vi (x k ) ≤ V j (x k ) ≤ J ∗ (x k ) + ε. On the other hand, if Vi (x k ) ≤ J ∗ (x k ) + ε, then |Vi (x k ) − J ∗ (x k )| ≤ ε. So, K ε (x k ) = min{ j : |V j (x k ) − J ∗ (x k )| ≤ ε} ≤ i , which (ε) implies that x k ∈ T i .
29
(ε)
2) If x k ∈ T i , K ε (x k ) ≤ i and |Vi (x k ) − J ∗ (x k )| ≤ ε. So, Vi (x k ) is defined at x k . According to Theorem 3.5 1), (ε) we have x k ∈ T i . Hence, T i ⊆ T i . (ε) (ε) 3) If x k ∈ T i , K ε (x k ) ≤ i < i + 1. So, x k ∈ T i+1 . Thus, (ε) T (ε) i ⊆ T i+1 . (ε) (ε) 4) Obviously, ∪i T i ⊆ T ∞ since T i are subsets of T ∞ . (ε) For any x k ∈ T ∞ , let p = K ε (x k ). Then, x k ∈ T p . (ε) (ε) So, x k ∈ ∪i T i . Hence, T ∞ ⊆ ∪i T i ⊆ T ∞ , and we (ε) obtain, ∪i T i = T ∞ . (δ) 5) If x k ∈ T i , Vi (x k ) ≤ J ∗ (x k ) + δ by part 1) of this theorem. Clearly, Vi (x k ) ≤ J ∗ (x k ) + ε since δ < ε. This (ε) (ε) (δ) implies that x k ∈ T i . Therefore, T i ⊇ T i . (ε)
According to Theorem 4.1 1), T i is just the region where Vi (x k ) is close to J ∗ (x k ) with error less than ε. This region is a subset of T i according to Theorem 4.1 2). As stated in (ε) Theorem 4.1 3), when i is large, the set T i is also large. That means, when i is large, we have a large region where we can use Vi (x k ) as the approximation of J ∗ (x k ) under certain error. On the other hand, we claim that if x k is far away from the origin, we have to choose a long control sequence to approximate the optimal control sequence. Theorem 4.1 4) means that for every controllable state x k ∈ T ∞ , we can always find a suitable control sequence with length i to (ε) approximate the optimal control. The size of the set T i depends on the value of ε. A smaller value of ε gives a smaller (ε) set T i , which is indicated by Theorem 4.1 5). (ε) Let x k ∈ T ∞ be an arbitrary controllable state. If x k ∈ T i , the iterative performance index function satisfies (20) under the control v i (x k ), we call this control the ε-optimal control and denote it as µ∗ε (x k ) µ∗ε (x k ) = v i (x k ) = arg min {U (x k , u k ) + Vi−1 (F(x k , u k ))} . uk
(22)
We have the following corollary. Corollary 4.1: Let µ∗ε (x k ) be expressed in (22), which makes the performance index function satisfy (20) for x k ∈ (ε) ′ ∗ ′ T (ε) i . Then, for any x k ∈ T i , µε (x k ) guarantees |Vi (x k′ ) − J ∗ (x k′ )| ≤ ε. (23) Proof: The corollary can be proved by contradiction. Assume that the conclusion is not true. Then, the inequality (ε) (23) is false under the control µ∗ε (·) for some x k′′ ∈ T i . ∗ As µε (x k ) makes the performance index function satisfy (ε) (20) for x k ∈ T i , we have K ε (x k ) ≤ i. Using the ε-optimal ∗ control law µε (·) at the state x k′′ , according to the assumption, we have |Vi (x k′′ ) − J ∗ (x k′′ )| > ε. Then, K ε (x k′′ ) > i and (ε) / T i . It is in contradiction with the assumption x k′′ ∈ (ε) x k′′ ∈ T i . Therefore, the assumption is false and (23) holds (ε) for any x k′ ∈ T i . Remark 4.1: Corollary 4.1 is very important for neural network implementation of the iterative ADP algorithm. It shows that we do not need to obtain the optimal control law by (ε) searching the entire subset T i . Instead, we can just find one (ε) (ε) point of T i , i.e., x k ∈ T i , to obtain the ε-optimal control
30
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
(ε)
µ∗ε (x k ) which will be effective for any other state x k′ ∈ T i . This property not only makes the computational complexity much reduced but also makes the optimal control law easily obtained using neural networks. (ε) Theorem 4.2: Let x k ∈ T i and let µ∗ε (x k ) be expressed (ε) ∗ in (22). Then, F(x k , µε (x k )) ∈ T i−1 . In other words, if ∗ K ε (x k ) = i , then K ε (F(x k , µε (x k ))) ≤ i − 1. (ε) Proof: Since x k ∈ T i , by Theorem 4.1 1), we know that Vi (x k ) ≤ J ∗ (x k ) + ε.
x
xk ∈Ti (ε)
(24)
i
T0
According to (8) and (22), we have Vi (x k ) = U (x k , µ∗ε (x k )) + Vi−1 (F(x k , µ∗ε (x k ))).
(25)
T1(ε)
T2(ε)
Combining (24) and (25), we have
Ti−1(ε)
Vi−1 (F(x k , µ∗ε (x k ))) = Vi (x k ) − U (x k , µ∗ε (x k )) ≤ J ∗ (x k ) + ε − U (x k , µ∗ε (x k )).
Ti (ε)
(26)
On the other hand, we have J ∗ (x k ) ≤ U (x k , µ∗ε (x k )) + J ∗ (F(x, µ∗ε (x k ))).
(27)
Putting (27) into (26), we obtain
Fig. 1. Control process of the controllable sate xk ∈ ADP algorithm.
T (ε) i
using iterative
Vi−1 (F(x k , µ∗ε (x k ))) ≤ J ∗ (F(x k , µ∗ε (x k ))) + ε. B. ε-Optimal Control Algorithm
By Theorem 4.1 1), we have (ε)
F(x k , µ∗ε (x k )) ∈ T i−1 .
(28) (ε)
So, if K ε (x k ) = i , we know that x k ∈ T i and (ε) F(x, µ∗ε (x k )) ∈ T i−1 according to (28). Therefore, we have K ε (F(x k , µ∗ε (x k ))) ≤ i − 1 which proves the theorem. Remark 4.2: From Theorem 4.2, we can see that the parameter K ε (x k ) gives an important property of the finitehorizon ADP algorithm. It not only gives an optimal condition of the iteration process, but also gives an optimal number of control steps for the finite-horizon ADP algorithm. For example, if |Vi (x k ) − J ∗ (x k )| ≤ ε for small ε, then we have Vi (x k ) ≈ J ∗ (x k ). According to Theorem 4.2, we can get N = k + i , where N is the number of control steps to drive the system to zero. The whole control sequence u 0N−1 may not be ε-optimal, but the control sequence u kN−1 is ε-optimal control sequence. If k = 0, we have N = K ε (x 0 ) = i . Under this condition, we say that the iteration index K ε (x 0 ) denotes the number of ε-optimal control steps. Corollary 4.2: Let µ∗ε (x k ) be expressed in (22), which makes the performance index function satisfy (20) for x k ∈ (ε) ′ ∗ ′ T (ε) i . Then, for any x k ∈ T j , where 0 ≤ j ≤ i , µε (x k ) guarantees |Vi (x k′ ) − J ∗ (x k′ )| ≤ ε.
(29)
Proof: The proof is similar to Corollary 4.1 and is omitted here. Remark 4.3: Corollary 4.2 shows that the ε-optimal control (ε) µ∗ε (x k ) obtained for ∀ x k ∈ T i is effective for any state x k′ ∈ (ε) ′ T (ǫ) j , where 0 ≤ j ≤ i . This means that for ∀ x k ∈ T j , ∗ ′ 0 ≤ j ≤ i , we can use a same ε-optimal control µε (x k ) to control the system.
According to Theorem 4.1 3) and Corollary 4.1, the (ε) ε-optimal control µ∗ε (x k ) obtained for an x k ∈ T i is effective (ε) for any state x k′ ∈ T i−1 (which is also stated in Corollary 4.2). That is to say, in order to obtain effective ε-optimal control, the iterative ADP algorithm only needs to run at some state x k ∈ T ∞ . In order to obtain an effective ε-optimal control (ε) (ε) law µ∗ε (x k ), we should choose the state x k ∈ T i \T i−1 for each i to run the iterative ADP algorithm. The control process using iterative ADP algorithm is illustrated in Fig. 1. From the iterative ADP algorithm (6)–(9), we can see that for any state x k ∈ Rn , there exits a control u k ∈ Rm that drives the system to zero in one step. In other words, for ∀ x k ∈ Rn , there exists a control u k ∈ Rm such that x k+1 = F(x k , u k ) = 0 holds. A large class of systems possesses this property, for example, all linear systems of the type x k+1 = Ax k + Bu k when B is invertible and the affine nonlinear systems with the type x k+1 = f (x k ) + g(x k )u k when the inverse of g(x k ) exists. But there are also other classes of systems for which there does not exist any control u k ∈ Rm that drives the state to zero in one step for some x k ∈ Rn , i.e., ∃x k ∈ Rn such that F(x k , u k ) = 0 is not possible for ∀ u k ∈ Rm . In the following part, we will discuss the situation where F(x k , u k ) = 0 for some x k ∈ Rm . Since x k is controllable, there exists a finite-horizon ad= {u k , u k+1 , . . . , u k+i−1 } ∈ missible control sequence u k+i−1 k (i) k+i−1 ( f ) = x k+i = 0. Let N = k + i Axk that makes x x k , u k be the terminal time. Assume that for k + 1, k + = 2, . . . , N − 1, the optimal control sequence u (N−1)∗ k+1 (N−k−1) ∗ ∗ ∗ has been deter{u k+1 , u k+2 , . . . , u N−1 } ∈ Axk+1 mined. Denote the performance index function for x k+1 as (N−1)∗ J x k+1 , u k+1 = V0 (x k+1 ). Now, we use the iterative ADP algorithm to determine the optimal control sequence for the state x k .
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
The performance index function for i = 1 is computed as V1 (x k ) = U (x k , v 1 (x k )) + V0 (F(x k , v 1 (x k )))
(30)
where v 1 (x k ) = arg min{U (x k , u k ) + V0 (F(x k , u k ))}. uk
(31)
Note that the initial condition used in the above expression is the performance index function V0 , which is obtained previously for x k+1 and now applied at F(x k , u k ). For i = 2, 3, 4, . . ., the iterative ADP algorithm will be implemented as follows: Vi (x k ) = U (x k , v i (x k )) + Vi−1 (F(x k , v i (x k )))
(32)
where v i (x k ) = arg min {U (x k , u k ) + Vi−1 (F(x k , u k ))} . uk
(33)
Theorem 4.3: Let x k be an arbitrary controllable state vector. Then, the performance index function Vi (x k ) obtained by (30)–(33) is a monotonically nonincreasing sequence for ∀ i ≥ 0, i.e., Vi+1 (x k ) ≤ Vi (x k ) for ∀ i ≥ 0. Proof: It can easily be proved following the proof of Theorem 3.1, and the proof is omitted here. Theorem 4.4: Let the performance index function Vi (x k ) be defined by (32). If the system state x k is controllable, then the performance index function Vi (x k ) obtained by (30)–(33) converges to the optimal performance index function J ∗ (x k ) as i → ∞ lim Vi (x k ) = J ∗ (x k ).
If |Vi (x k )− J ∗ (x k )| ≤ ε holds, we have Vi (x k ) ≤ J ∗ (x k )+ε and J ∗ (x k ) ≤ Vi+1 (x k ) ≤ Vi (x k ). These imply that 0 ≤ Vi (x k ) − Vi+1 (x k ) ≤ ε
(34)
or |Vi (x k ) − Vi+1 (x k )| ≤ ε. On the other hand, according to Theorem 4.4, |Vi (x k ) − Vi+1 (x k )| → 0 implies that Vi (x k ) → J ∗ (x k ). Therefore, for any given small ε, if |Vi (x k ) − Vi+1 (x k )| ≤ ε holds, we have |Vi (x k ) − J ∗ (x k )| ≤ ε holds if i is sufficiently large. We will use (34) as the optimal criterion instead of the optimal criterion (20). Let u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) be an arbitrary finitehorizon admissible control sequence and the corresponding state sequence be x 0K = (x 0 , x 1 , . . . , x K ) where x K = 0. We can see that the initial control sequence u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) may not be optimal, which means that the initial number of control steps K may not be optimal. So, the iterative ADP algorithm must complete two kinds of optimization: one is to optimize the number of control steps; and the other is to optimize the control law. In the following, we will show how the number of control steps and the control law are both optimized in the iterative ADP algorithm simultaneously. For the state x K −1 , we have F(x K −1 , u K −1 ) = 0. Then, we run the iterative ADP algorithm (6)–(9) at x K −1 as follows. The performance index function for i = 1 is computed as V11 (x K −1 ) = min {U (x K −1 , u K −1 ) + V0 (F(x K −1 , u K −1 ))} u K −1
i→∞
Proof: This theorem can be proved following similar steps to the proof of Theorem 3.3 and the proof is omitted here. Remark 4.4: We can see that the iterative ADP algorithm (30)–(33) is an expansion from of the previous one (6)–(9). So, the properties of the iterative ADP algorithm (6)–(9) is also effective for the current one (30)–(33). But there also exist differences. From Theorem 3.1, we can see that Vi+1 (x k ) ≤ Vi (x k ) for all i ≥ 1, which means that V1 (x k ) = max{Vi (x k ) : i = 0, 1, . . .}, while Theorem 4.3 shows that Vi+1 (x k ) ≤ Vi (x k ) for all i ≥ 0, which means that V0 (x k ) = max{Vi (x k ) : i = 0, 1, . . .}. This difference is caused by the difference of the initial conditions of the two iterative ADP algorithms. In the previous iterative ADP algorithm (6)–(9), it begins with the initial performance index function V0 (x k ) = 0 since F(x k , u k ) = 0 can be solved, while in the current iterative ADP algorithm (30)–(33), it begins with the performance index function V0 for the state x k+1 which is determined previously. This also causes the difference between the proof of Theorems 3.1 and 3.3 and the corresponding results in Theorems 4.3 and 4.4. But the difference of the initial conditions of the iterative performance index function does not affect the convergence property of the two iterative ADP algorithms. For the iterative ADP algorithm, the optimal criterion (20) is very difficult to verify because the optimal performance index function J ∗ (x k ) is unknown in general. So, an equivalent criterion is established to replace (20).
31
s.t. F(x K −1 , u K −1 ) = 0 = U (x K −1 , v 11 (x K −1 ))
(35)
where v 11 (x K −1 ) = arg min U (x K −1 , u K −1 )
(36)
u K −1
s.t. F(x K −1 , u K −1 ) = 0 and V0 (F(x K −1 , u K −1 )) = 0. The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
Vi1 (x K −1 ) = U x K −1 , v i1 (x K −1 )
1 F x K −1 , v i1 (x K −1 ) (37) + Vi−1 where
v i1 (x K −1 ) = arg min U (x K −1 , u K −1 ) u K −1
1 + Vi−1 (F(x K −1 , u K −1 ))
until the inequality 1 Vl1 (x K −1 ) − Vl11 +1 (x K −1 ) ≤ ε
(38)
(39) (ε)
is satisfied for l1 > 0. This means that x K −1 ∈ T l1 and the optimal number of control steps is K ε (x K −1 ) = l1 . Considering x K −2 ) = x K −1 . Put , we have F(x K −2 , u K −2 1 1 x K −2 into (39). If Vl1 (x K −2 ) − Vl1 +1 (x K −2 ) ≤ ε holds, then
32
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
(ε)
according to Theorem 4.1 1), we know that x K −2 ∈ T l1 . (ε) Otherwise, if x K −2 ∈ / T l1 , we will run the iterative ADP algorithm as follows. Using the performance index function Vl11 as the initial condition, we compute for i = 1
V12 (x K −2 ) = U x K −2 , v 12 (x K −2 )
(40) + Vl11 F x K −2 , v 12 (x K −2 ) where
v 12 (x K −2 ) = arg min U x K −2 , u K −2 ) u K −2 + Vl11 (F(x K −2 , u K −2 )) .
(41)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
Vi2 (x K −2 ) = U x K −2 , v i2 (x K −2 )
2 (42) F x K −2 , v i2 (x K −2 ) + Vi−1 where
v i2 (x K −2 ) = arg min U (x K −2 , u K −2 ) u K −2
2 + Vi−1 (F(x K −2 , u K −2 ))
until the inequality 2 Vl2 (x K −2 ) − Vl22 +1 (x K −2 ) ≤ ε
(43)
(44)
T l(ε) 2 l2 .
is satisfied for l2 > 0. We can then obtain that x K −2 ∈ and the optimal number of control steps is K ε (x K −2 ) = (ε) Next, assume that j ≥ 2 and x K − j +1 ∈ T l j −1 j −1 j −1 (45) Vl j −1 (x K − j +1) − Vl j −1 +1 (x K − j +1) ≤ ε holds. Considering x K − j , we have F(x K − j , u K − j ) = x K − j +1. Putting x K − j into (45) and, if j −1 j −1 (46) Vl j −1 (x K − j ) − Vl j −1 +1 (x K − j ) ≤ ε (ε)
holds, then we know that x K − j ∈ T l j −1 . Otherwise, if x K − j ∈ /
where j v i (x K − j ) = arg min U (x K − j , u K − j ) uK−j
j
+ Vi−1 (F(x K − j , u K − j )) until the inequality j j Vl j (x K − j ) − Vl j +1 (x K − j ) ≤ ε
(50)
(51) (ε)
is satisfied for l j > 0. We can then obtain that x K − j ∈ T l j and the optimal number of control steps is K ε (x K − j ) = l j . Finally, considering x 0 , we have F(x 0 , u 0 ) = x 1 . If K −1 −1 Vl K −1 (x 0 ) − VlKK −1 +1 (x 0 ) ≤ ε
holds, then we know that x 0 ∈
T l(ε) . Otherwise, if x 0 ∈ / K −1
T l(ε) , then we run the iterative ADP algorithm as follows, K −1 −1 as the initial Using the performance index function VlKK −1 condition, we compute for i = 1
−1 V1K (x 0 ) = U x 0 , v 1K (x 0 ) + VlKK −1 (52) F x 0 , v 1K (x 0 )
where −1 (F(x 0 , u 0 )) . v 1K (x 0 ) = arg min U (x 0 , u 0 ) + VlKK −1 u0
(53)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
K (54) F x 0 , v iK (x 0 ) ViK (x 0 ) = U x 0 , v iK (x 0 ) + Vi−1 where K (F(x 0 , u 0 )) v iK (x 0 ) = arg min U (x 0 , u 0 ) + Vi−1 u0
until the inequality K Vl K (x 0 ) − VlKK +1 (x 0 ) ≤ ε
(55)
(56)
T l(ε) , then we run the iterative ADP algorithm as follows. is satisfied for l K > 0. Therefore, we can obtain that x 0 ∈ T l(ε) j −1 K j −1
Using the performance index function Vl j −1 as the initial condition, we compute for i = 1
j j V1 (x K − j ) = U x K − j , v 1 (x K − j )
j −1 j (47) + Vl j −1 F x K − j , v 1 (x K − j ) where
j v 1 (x K − j ) = arg min U (x K − j , u K − j ) uK−j j −1 + Vl j −1 (F(x K−j , u K−j )) .
(48)
The iterative ADP algorithm will be implemented as follows for i = 2, 3, 4, . . .
j j Vi (x K − j ) = U x K − j , v i (x K − j )
j j (49) + Vi−1 F x K − j , v i (x K − j )
and the optimal number of control steps is K ε (x 0 ) = l K . Starting from the initial state x 0 , the optimal number of control steps is l K according to our ADP algorithm. Remark 4.5: For the case where there are some x k ∈ Rn , there does not exit a control u k ∈ Rm that drives the system to zero in one step, and the computational complexity of the iterative ADP algorithm is much related to the original finite-horizon admissible control sequence u 0K −1 . First, we repeat the iterative ADP algorithm at x K −1 , x K −2 , . . ., x 1 , x 0 , respectively. It is related to the control steps K of u 0K −1 . K −1 If K is large, it means that u0 takes a large number of control steps to drive the initial state x 0 to zero and then the number of times needed to repeat the iterative ADP algorithm will be large. Second, the computational complexity is also u 0K −1 is related to the quality of control results of u 0K −1 . If (N−1)∗ close to the optimal control sequence u 0 , then it will take less computation to make (51) hold for each j .
V. S IMULATION S TUDY To evaluate the performance of our iterative ADP algorithm, we choose two examples with quadratic utility functions for numerical experiments. Example 5.1: Our first example is chosen from [57]. We consider the following nonlinear system: x k+1 = f (x k ) + g(x k )u k where x k = [x 1k x 2k ]T and u k = [u 1k u 2k ]T are the state and control variables, respectively. The system functions are given as 2 ) 0.2x 1k exp(x 2k −0.2 0 f (x k ) = ) = . , g(x k 3 0 −0.2 0.3x 2k The initial state is x 0 = [1 −1]T . The performance index function is in quadratic form with finite-time horizon expressed as N−1 x kT Qx k + u kT Ru k J x 0 , u 0N−1 = k=0
where the matrix Q = R = I and I denotes the identity matrix with suitable dimensions. The error bound of the iterative ADP is chosen as ε = 10−5 . Neural networks are used to implement the iterative ADP algorithm and the neural network structure can be seen in [32] and [57]. The critic network and the action network are chosen as three-layer backpropagation (BP) neural networks with the structures of 2–8–1 and 2–8–2, respectively. The model network is also chosen as a three-layer BP neural network with the structure of 4–8–2. The critic network is used to approximate the iterative performance index functions, which are expressed by (35), (37), (40), (42), (47), (49), (52), and (54). The action network is used to approximate
33
Control trajectories
4.5 4 3.5 3 2.5 2
1
3
0.2
u1 u2
0.1 0 −0.1 −0.2
5 7 9 11 13 15 Iteration steps
0
2
(a) 0.8 0.6 0.4 0.2 0
0
2
4 6 Time steps
(c)
4 6 Time steps
8
10
8
10
(b)
1
State trajectory x2
Now, we summarize the iterative ADP algorithm as follows. Step 1: Choose an error bound ε and choose randomly an array of initial states x 0 . Step 2: Obtain an initial finite-horizon admissible control sequence u 0K −1 = (u 0 , u 1 , . . . , u K −1 ) and obtain the corresponding state sequence x 0K = (x 0 , x 1 , . . . , x K ), where x K = 0. Step 3: For the state x K −1 with F(x K −1 , u K −1 ) = 0, run the iterative ADP algorithm (35)–(38) at x K −1 until (39) holds. Step 4: Record Vl11 (x K −1 ), vl11 (x K −1 ) and K ε (x K −1 ) = l1 . Step 5: For j = 2, 3, . . . , K , if for x K − j , the inequality (46) holds, go to Step 7; otherwise, go to Step 6. j −1 Step 6: Using the performance index function Vl j −1 as the initial condition, run the iterative ADP algorithm (47)–(50) until (51) is satisfied. Step 7: If j = K , then we have obtained the optimal performance index function V ∗ (x 0 ) = VlKK (x 0 ), the law of the optimal control sequence u ∗ (x 0 ) = vlKK (x 0 ) and the number of optimal control steps K ε (x 0 ) = l K ; otherwise, set j = j + 1, and go to Step 5. Step 8: Stop.
State trajectory x1
C. Summary of the ε-Optimal Control Algorithm
Performance index function
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
8
10
0.5 0 −0.5 −1
0
2
4 6 Time steps
(d)
Fig. 2. Simulation results for Example 1. (a) Convergence of performance index function. (b) ε-optimal control vectors. (c) and (d) Corresponding state trajectories.
the optimal control laws, which are expressed by (36), (38), (41), (43), (48), (50), (53), and (55). The training rules of the neural networks can be seen in [50]. For each iterative step, the critic network and the action network are trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training error becomes less than 10−8 . Enough iteration steps should be implemented to guarantee the iterative performance index functions and the control law to converge sufficiently. We let the algorithm run for 15 iterative steps to obtain the optimal performance index function and optimal control law. The convergence curve of the performance index function is shown in Fig. 2(a). Then, we apply the optimal control law to the system for T f = 10 time steps and obtain the following results. The ε-optimal control trajectories are shown in Fig. 2(b) and the corresponding state curves are shown in Fig. 2(c) and (d). After seven steps of iteration, we have |V6 (x 0 ) − V7 (x 0 )| ≤ 10−5 = ε. Then, we obtain the optimal number of control steps K ε (x 0 ) = 6. We can see that after six time steps, the state variable becomes x 6 = [0.912 × 10−6 , 0.903 × 10−7 ]T . The entire computation process takes about 10 s before satisfactory results are obtained. Example 5.2: The second example is chosen from [62] with some modifications. We consider the following system:
(57) x k+1 = F(x k , u k ) = x k + sin 0.1x k2 + u k where x k , u k ∈ R, and k = 0, 1, 2, . . . . The performance index function is defined as in Example 5.1 with Q = R = 1. The initial state is x 0 = 1.5. Since F(0, 0) = 0, x k = 0 is an equilibrium state of system (57). But since (∂ F(x k , u k )/∂ x k )(0, 0) = 1, system (57) is marginally stable at x k = 0 and the equilibrium x k = 0 is not attractive. We can see that for the fixed initial state x 0 , there does not exist a control u 0 ∈ R that makes x 1 = F(x 0 , u 0 ) = 0. The error bound of the iterative ADP algorithm is chosen as ε = 10−4 . The critic network, the action network, and the model network are chosen as three-layer BP neural networks with the structures of 1–3–1, 1–3–1, and 2–4–1, respectively.
1.6 1.58 1.56 1
3
5 7 9 11 13 15 Iteration steps
0
2
−1
4 6 Time steps
(c)
5 7 9 11 13 15 Iteration steps
8
10
1 0.5 0
0
2
4 6 Time steps
8
10
Fig. 3. Simulation results for Case 1 of Example 2. (a) Convergence of performance index function at xk = 0.8. (b) Convergence of performance index function at xk = 1.5. (c) ε-optimal control trajectory. (d) Corresponding state trajectory.
According to (57), the control can be expressed by u k = −0.1x k2 + sin−1 (x k+1 − x k ) + 2λπ
(58)
where λ = 0, ±1, ±2, . . . . To show the effectiveness of our algorithm, we choose two initial finite-horizon admissible control sequences. Case 1: The control sequence is uˆ 10 = (−0.225 − sin−1 (0.7), −0.064 − sin −1 (0.8)) and the corresponding state sequence is xˆ 20 = (1.5, 0.8, 0). For the initial finite-horizon admissible control sequences in this case, run the iterative ADP algorithm at the states 0.8 and 1.5, respectively. For each iterative step, the critic network and the action network are trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training accuracy of 10−8 is reached. After the algorithm runs for 15 iterative steps, we obtain the performance index function trajectories shown in Fig. 3(a) and (b), respectively. The ε-optimal control and state trajectories are shown in Fig. 3(c) and (d), respectively, for 10 time steps. We obtain K ε (0.8) = 5 and K ε (1.5) = 8. Case 2: The control sequence is uˆ 30 = (−0.225 − sin−1 (0.01), 2π −2.2201 −sin−1 (0.29), −0.144 −sin−1 (0.5), −sin−1 (0.7)) and the corresponding state sequence is xˆ 40 = (1.5, 1.49, 1.2, 0.7, 0). For the initial finite-horizon admissible control sequence in this case, run the iterative ADP algorithm at the states 0.7, 1.2, and 1.49, respectively. For each iterative step, the critic network and the action network are also trained for 1000 iteration steps using the learning rate of α = 0.05 so that the neural network training accuracy of 10−8 is reached. Then, we obtain the performance index function trajectories shown in Fig. 4(a)–(c), respectively. We have K ε (0.7) = 4, K ε (1.2) = 6, and K ε (1.49) = 8. After 25 steps of iteration, the performance index function Vi (x k ) is convergent sufficiently at x k = 1.49, with V83 (1.49) as the performance index function. For the state x k = 1.5, we have |V83 (1.5) − V93 (1.5)| = 0.52424 × 10−7 < ε. Therefore,
0
5 10 Iteration steps
13.5 13 12.5 12 11.5
15
(b)
14
0
5
10 15 20 Iteration steps
(c)
(d)
10 9 8 7 6 5 4
(a) Performance index function
State trajectory
Control trajectory
−0.5
2
3
(b) 1.5
0
1
4 6 8 10 12 14 15 Iteration steps
(a) 0
−1.5
1.17 1.165 1.16 1.155 1.15 1.145
1.5 1 0.5 0 −0.5 −1 −1.5
x u
State and control trajectories
1.54
10 9 8 7 6 5 4
Performance index function
1.62
Performance index function
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Performance index function
Performance index function
34
25
0
2
4 6 Time steps
8
10
(d)
Fig. 4. Simulation results for Case 2 of Example 2. (a) Convergence of performance index function at xk = 0.7. (b) Convergence of performance index function at xk = 1.2. (c) Convergence of performance index function at xk = 1.49. (d) ε-optimal control trajectory and the corresponding state trajectory.
the optimal performance index function at x k = 1.5 is V83 (1.5), (ε) and thus we have x k = 1.5 ∈ T 8 and K ε (1.5) = 8. The whole computation process takes about 20 s and then satisfactory results are obtained. Then we apply the optimal control law to the system for T f = 10 time steps. The ε-optimal control and state trajectories are shown in Fig. 4(d). We can see that the ε-optimal control trajectory in Fig. 4(d) is the same as the one in Fig. 3(c). The corresponding state trajectory in Fig. 4(d) is the same as the one in Fig. 3(d). Therefore, the optimal control law is not dependent on the initial control law. The initial control sequence uˆ 0K −1 can arbitrarily be chosen as long as it is finite-horizon admissible. Remark 5.1: If the number of control steps of the initial admissible control sequence is larger than the number of control steps of the optimal control sequence, then we will have some of the states in the initial sequence to possess the same number of optimal control steps. For example, in Case 2 of Example 2, we see that the two states x = 1.49 and x = 1.5 possess the same number of optimal control steps, i.e., K ε (1.49) = K ε (1.5) = 8. Thus, we say that the control u = −0.225 − sin−1 (0.01) that makes x = 1.5 run to x = 1.49 is an unnecessary control step. After the unnecessary control steps are identified and removed, the number of control steps will reduce to the optimal number of control steps, and thus the initial admissible control sequence does not affect the final optimal control results. VI. C ONCLUSION In this paper, we developed an effective iterative ADP algorithm for finite-horizon ε-optimal control of discrete-time nonlinear systems. Convergence of the performance index function for the iterative ADP algorithm was proved, and the ε-optimal number of control steps could also be obtained. Neural networks were used to implement the iterative ADP algorithm. Finally, two simulation examples were given to illustrate the performance of the proposed algorithm.
WANG et al.: ADAPTIVE DYNAMIC PROGRAMMING FOR DISCRETE-TIME NONLINEAR SYSTEMS
R EFERENCES [1] A. E. Bryson and Y.-C. Ho, Applied Optimal Control: Optimization, Estimation, and Control. New York: Wiley, 1975. [2] T. Cimen and S. P. Banks, “Nonlinear optimal tracking control with application to super-tankers for autopilot design,” Automatica, vol. 40, no. 11, pp. 1845–1863, Nov. 2004. [3] N. Fukushima, M. S. Arslan, and I. Hagiwara, “An optimal control method based on the energy flow equation,” IEEE Trans. Control Syst. Technol., vol. 17, no. 4, pp. 866–875, Jul. 2009. [4] H. Ichihara, “Optimal control for polynomial systems using matrix sum of squares relaxations,” IEEE Trans. Autom. Control, vol. 54, no. 5, pp. 1048–1053, May 2009. [5] S. Keerthi and E. Gilbert, “Optimal infinite-horizon control and the stabilization of linear discrete-time systems: State-control constraints and nonquadratic cost functions,” IEEE Trans. Autom. Control, vol. 31, no. 3, pp. 264–266, Mar. 1986. [6] I. Kioskeridis and C. Mademlis, “A unified approach for four-quadrant optimal controlled switched reluctance machine drives with smooth transition between control operations,” IEEE Trans. Autom. Control, vol. 24, no. 1, pp. 301–306, Jan. 2009. [7] J. Mao and C. G. Cassandras, “Optimal control of multi-stage discrete event systems with real-time constraints,” IEEE Trans. Autom. Control, vol. 54, no. 1, pp. 108–123, Jan. 2009. [8] I. Necoara, E. C. Kerrigan, B. D. Schutter, and T. Boom, “Finitehorizon min-max control of max-plus-linear systems,” IEEE Trans. Autom. Control, vol. 52, no. 6, pp. 1088–1093, Jun. 2007. [9] T. Parisini and R. Zoppoli, “Neural approximations for infinite-horizon optimal control of nonlinear stochastic systems,” IEEE Trans. Neural Netw., vol. 9, no. 6, pp. 1388–1408, Nov. 1998. [10] G. N. Saridis and F. Y. Wang, “Suboptimal control of nonlinear stochastic systems,” Control-Theory Adv. Technol., vol. 10, no. 4, pp. 847–871, Dec. 1994. [11] C. Seatzu, D. Corona, A. Giua, and A. Bemporad, “Optimal control of continuous-time switched affine systems,” IEEE Trans. Autom. Control, vol. 51, no. 5, pp. 726–741, May 2006. [12] K. Uchida and M. Fujita, “Finite horizon H∞ control problems with terminal penalties,” IEEE Trans. Autom. Control, vol. 37, no. 11, pp. 1762–1767, Nov. 1992. [13] E. Yaz, “Infinite horizon quadratic optimal control of a class of nonlinear stochastic systems,” IEEE Trans. Autom. Control, vol. 34, no. 11, pp. 1176–1180, Nov. 1989. [14] F. Yang, Z. Wang, G. Feng, and X. Liu, “Robust filtering with randomly varying sensor delay: The finite-horizon case,” IEEE Trans. Circuits Syst. I, vol. 56, no. 3, pp. 664–672, Mar. 2009. [15] E. Zattoni, “Structural invariant subspaces of singular Hamiltonian systems and nonrecursive solutions of finite-horizon optimal control problems,” IEEE Trans. Autom. Control, vol. 53, no. 5, pp. 1279–1284, Jun. 2008. [16] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar, Convex Analysis and Optimization. Boston, MA: Athena Scientific, 2003. [17] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer, “Mixed H2 and H∞ performance objectives II: Optimal control,” IEEE Trans. Autom. Control, vol. 39, no. 8, pp. 1575–1587, Aug. 1994. [18] L. Blackmore, S. Rajamanoharan, and B. C. Williams, “Active estimation for jump Markov linear systems,” IEEE Trans. Autom. Control, vol. 53, no. 10, pp. 2223–2236, Nov. 2008. [19] O. L. V. Costa and E. F. Tuesta, “Finite horizon quadratic optimal control and a separation principle for Markovian jump linear systems,” IEEE Trans. Autom. Control, vol. 48, no. 10, pp. 1836–1842, Oct. 2003. [20] P. J. Goulart, E. C. Kerrigan, and T. Alamo, “Control of constrained discrete-time systems with bounded ℓ2 gain,” IEEE Trans. Autom. Control, vol. 54, no. 5, pp. 1105–1111, May 2009. [21] J. H. Park, H. W. Yoo, S. Han, and W. H. Kwon, “Receding horizon controls for input-delayed systems,” IEEE Trans. Autom. Control, vol. 53, no. 7, pp. 1746–1752, Aug. 2008. [22] A. Zadorojniy and A. Shwartz, “Robustness of policies in constrained Markov decision processes,” IEEE Trans. Autom. Control, vol. 51, no. 4, pp. 635–638, Apr. 2006. [23] H. Zhang, L. Xie, and G. Duan, “H∞ control of discrete-time systems with multiple input delays,” IEEE Trans. Autom. Control, vol. 52, no. 2, pp. 271–283, Feb. 2007. [24] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press, 1957. [25] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA: MIT Press, 1991, pp. 67–95.
35
[26] P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York: Reinhold, 1992, ch. 13. [27] A. Al-Tamimi, M. Abu-Khalaf, and F. L. Lewis, “Adaptive critic designs for discrete-time zero-sum games with application to H∞ control,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 37, no. 1, pp. 240–247, Feb. 2007. [28] S. N. Balakrishnan and V. Biega, “Adaptive-critic-based neural networks for aircraft optimal control,” J. Guidance, Control, Dynamics, vol. 19, no. 4, pp. 893–898, Jul.-Aug. 1996. [29] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sep. 1997. [30] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Jun. 2009. [31] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern., Part C: Appl. Rev., vol. 32, no. 2, pp. 140–153, May 2002. [32] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009. [33] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 943–949, Aug. 2008. [34] S. Ferrari, J. E. Steck, and R. Chandramohan, “Adaptive feedback control by constrained approximate dynamic programming,” , IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 982–987, Aug. 2008. [35] J. Seiffertt, S. Sanyal, and D. C. Wunsch, “Hamilton-Jacobi-Bellman equations and approximate dynamic programming on time scales,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 918–923, Aug. 2008. [36] R. Enns and J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp. 929–939, Jul. 2003. [37] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. [38] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [39] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779–791, May 2005. [40] Z. Chen and S. Jagannathan, “Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discretetime systems,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 90–106, Jan. 2008. [41] T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-time adaptive critics,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 631–647, May 2007. [42] G. G. Lendaris, “A retrospective on adaptive dynamic programming for control,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, Jun. 2009, pp. 14–19. [43] B. Li and J. Si, “Robust dynamic programming for discounted infinitehorizon Markov decision processes with uncertain stationary transition matrices,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 96–102. [44] D. Liu, X. Xiong, and Y. Zhang, “Action-dependent adaptive critic designs,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 2. Washington D.C., Jul. 2001, pp. 990–995. [45] D. Liu and H. Zhang, “A neural dynamic programming approach for learning control of failure avoidance problems,” Int. J. Intell. Control Syst., vol. 10, no. 1, pp. 21–32, Mar. 2005. [46] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, Sep. 2005. [47] C. Lu, J. Si, and X. Xie, “Direct heuristic dynamic programming for damping oscillations in a large power system,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 1008–1013, Aug. 2008. [48] S. Shervais, T. T. Shannon, and G. G. Lendaris, “Intelligent supply chain management using adaptive critic learning,” IEEE Trans. Syst., Man Cybern., Part A: Syst. Humans, vol. 33, no. 2, pp. 235–244, Mar. 2003. [49] P. Shih, B. C. Kaul, S. Jagannathan, and J. A. Drallmeier, “Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1369–1388, Aug. 2008.
36
[50] J. Si and Y.-T. Wang, “On-line learning control by association and reinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276, Mar. 2001. [51] A. H. Tan, N. Lu, and D. Xiao, “Integrating temporal difference methods and self-organizing neural networks for reinforcement learning with delayed evaluative feedback,” IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 230–244, Feb. 2008. [52] F. Y. Wang and G. N. Saridis, “Suboptimal control for nonlinear stochastic systems,” in Proc. 31st IEEE Conf. Decis. Control, Tucson, AZ, Dec. 1992, pp. 1856–1861. [53] Q. L. Wei, H. G. Zhang, J. Dai, “Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions,” Neurocomputing, vol. 72, nos. 7–9, pp. 1839–1848, Mar. 2009. [54] P. J. Werbos, “Using ADP to understand and replicate brain intelligence: The next level design,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 209–216. [55] P. J. Werbos, “Intelligence in the brain: A theory of how it works and how to build it,” Neural Netw., vol. 22, no. 3, pp. 200–212, Apr. 2009. [56] H. G. Zhang, Y. H. Luo, and D. Liu, “Neural network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraint,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [57] H. G. Zhang, Q. L. Wei, and Y. H. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., Part B: Cybern., vol. 38, no. 4, pp. 937–942, Aug. 2008. [58] C. Watkins, “Learning from delayed rewards,” Ph.D. thesis, Dept. Comput. Sci., Cambridge Univ., Cambridge, U.K., 1989. [59] F. Y. Wang and G. N. Saridis, “On successive approximation of optimal control of stochastic dynamic systems,” in Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications, M. Dror, P. Lécuyer, and F. Szidarovszky, Eds. Boston, MA: Kluwer, 2002, pp. 333–386. [60] D. Han and S. N. Balakrishnan, “State-constrained agile missile control with adaptive-critic-based neural networks,” IEEE Trans. Control Syst. Technol., vol. 10, no. 4, pp. 481–489, Jul. 2002. [61] E. S. Plumer, “Optimal control of terminal processes using neural networks,” IEEE Trans. Neural Netw., vol. 7, no. 2, pp. 408–418, Mar. 1996. [62] N. Jin, D. Liu, T. Huang, and Z. Pang, “Discrete-time adaptive dynamic programming using wavelet basis function neural networks,” in Proc. IEEE Symp. Approx. Dyn. Program. Reinforcement Learn., Honolulu, HI, Apr. 2007, pp. 135–142.
Fei-Yue Wang (S’87–M’89–SM’94–F’03) received the Ph.D. degree in computer and systems engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1990. He joined the University of Arizona, Tuscon, in 1990, and became a Professor and Director of the Robotics and Automation Laboratory and the Program for Advanced Research in Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center at the Chinese Academy of Sciences (CAS), Beijing, China, with the support of the Outstanding Overseas Chinese Talents Program. Since 2002, he has been the Director of the Key Laboratory of Complex Systems and Intelligence Science at CAS. Currently, he is a Vice-President of the Institute of Automation, CAS. His current research interests include social computing, web science, complex systems, and intelligent control. Dr. Wang is a member of Sigma Xi and an elected Fellow of the International Council on Systems Engineering, the International Federation of Automatic Control, the American Society of Mechanical Engineers (ASME), and the American Association for the Advancement of Science. He was the Editorin-Chief of the International Journal of Intelligent Control and Systems and the World Scientific Series in Intelligent Control and Intelligent Automation
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
from 1995 to 2000. Currently, he is the Editor-in-Chief of IEEE I NTELLIGENT S YSTEMS and IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTATION S YSTEMS . He has served as Chair of more than 20 IEEE, the Association for Computing Machinery, the Institute for Operations Research and the Management Sciences, and ASME conferences. He was the President of the IEEE Intelligent Transportation Systems Society from 2005 to 2007, the Chinese Association for Science and Technology, Pittsburg, PA, in 2005, and the American Zhu Kezhen Education Foundation from 2007 to 2008. Currently, he is the Vice-President of the ACM China Council and Vice-President/SecretaryGeneral of the Chinese Association of Automation. In 2007, he received the National Prize in Natural Sciences of China and was awarded the Outstanding Scientist Award by ACM for his work in intelligent control and social computing.
Ning Jin (S’06) received the Ph.D. degree in electrical and computer engineering from the University of Illinois, Chicago, in 2005. He was an Associate Professor in the Department of Mathematics at Nanjing Normal University, Nanjing, China. From 2002 to 2005, he was a Visiting Scholar in the Department of Mathematics, Statistics, and Computer Science, University of Illinois. His current research interests include optimal control and dynamic programming, artificial intelligence, pattern recognition, neural networks, and wavelet analysis.
Derong Liu (S’91–M’94–SM’96–F’05) received the Ph.D. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, in 1994. He was a Staff Fellow with the General Motors Research and Development Center, Warren, MI, from 1993 to 1995. He was an Assistant Professor in the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, from 1995 to 1999. He joined the University of Illinois, Chicago, in 1999, and became a Full Professor of electrical and computer engineering and of computer science in 2006. He was selected for the “100 Talents Program” by the Chinese Academy of Sciences in 2008. Dr. Liu has been an Associate Editor of several IEEE publications. Currently, he is the Editor-in-Chief of the IEEE T RANSACTIONS ON N EURAL N ETWORKS and an Associate Editor of the IEEE T RANSACTIONS ON C ONTROL S YSTEMS T ECHNOLOGY. He received the Michael J. Birck Fellowship from the University of Notre Dame in 1990, the Harvey N. Davis Distinguished Teaching Award from the Stevens Institute of Technology in 1997, the Faculty Early Career Development Award from the National Science Foundation in 1999, the University Scholar Award from the University of Illinois in 2006, and the Overseas Outstanding Young Scholar Award from the National Natural Science Foundation of China in 2008.
Qinglai Wei received the B.S. degree in automation, the M.S. degree in control theory and control engineering, and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2002, 2005, and 2008, respectively. He is currently a Post-Doctoral Fellow with the Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include neural-networks-based control, nonlinear control, adaptive dynamic programming, and their industrial applications.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
37
Solving Nonstationary Classification Problems with Coupled Support Vector Machines Guillermo L. Grinblat, Lucas C. Uzal, H. Alejandro Ceccatto, and Pablo M. Granitto
Abstract— Many learning problems may vary slowly over time: in particular, some critical real-world applications. When facing this problem, it is desirable that the learning method could find the correct input–output function and also detect the change in the concept and adapt to it. We introduce the time-adaptive support vector machine (TA-SVM), which is a new method for generating adaptive classifiers, capable of learning concepts that change with time. The basic idea of TA-SVM is to use a sequence of classifiers, each one appropriate for a small time window but, in contrast to other proposals, learning all the hyperplanes in a global way. We show that the addition of a new term in the cost function of the set of SVMs (that penalizes the diversity between consecutive classifiers) produces a coupling of the sequence that allows TA-SVM to learn as a single adaptive classifier. We evaluate different aspects of the method using appropriate drifting problems. In particular, we analyze the regularizing effect of changing the number of classifiers in the sequence or adapting the strength of the coupling. A comparison with other methods in several problems, including the well-known STAGGER dataset and the real-world electricity pricing domain, shows the good performance of TA-SVM in all tested situations. Index Terms— Adaptive methods, drifting concepts, support vector machine.
I. I NTRODUCTION N MANY real-world applications, pattern recognition problems may vary slowly over time. For example, weather conditions under which meteorological alerts should be raised are seasonal, or the state of a critical mechanical system that should trigger an alarm could change with the wear of the machine. In most cases, the underlying causes and characteristics of these slow changes are not evident from the data under analysis. Under such circumstances, it is desirable for the pattern recognition method to be able to learn related but distinct input–output functions at different epochs and, in particular, to have the flexibility to do it in a continuous way, profiting from the slow-drift property and thereby harnessing information from the entire historical database. In the next section, we review some previous works on this topic, which is sometimes called “drifting concepts”
I
Manuscript received February 16, 2010; revised July 18, 2010; accepted September 22, 2010. Date of publication November 9, 2010; date of current version January 4, 2011. This work was supported in part by the ANPCyT under Grant PICT-2006 643 and Grant 2226. The authors are with the CIFASIS - French Argentine International Center for Information and Systems Sciences, UPCAM (France) / UNR-CONICET (Argentina), Rosario S2000EZP, Argentina (e-mail:
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2083684
[1]–[4]. In this context, some authors distinguish sudden or instantaneous drift from gradual change [5], [6]. As Stanley [5] points out, the two problems present very different challenges. Algorithms appropriate for sudden concept changes [7]–[13] should be fast in detecting the change and react to it in an appropriate way. In gradual drift [4], [14]–[18], on the other hand, there is no need for a rapid reaction and the interesting problem is how to use efficiently the information from the full dataset. Our method focus on this latter kind of problems, and in particular on situations with scarce data, but also works efficiently for problems with a sudden change, as we will show later. Most previous approaches to handle concept drift rely on the use of “local” classifiers, each one fitted or adapted to a particular temporal window of a given length [2], [7]–[9], [19], [20]. As we discuss in Section II, the methods differ in how they select the length of the window, or in how they weigh the selected samples, or even in how they use the set of classifiers (some methods keep several classifiers in an ensemble, others use only the classifier corresponding to the current window). Here we present a new approach to this problem, the use of a sequence of classifiers that vary following the concept change, but which are all fitted in a global way. To build the sequence of classifiers, we selected one of the most powerful methods nowadays, the support vector machine (SVM) [21], [22], which we adapted accordingly. As in most previous methods, each SVM in the sequence is trained using data points from only one of a set of consecutive nonoverlapping time windows. The novelty of our method is that the sequence is not independent. We solve all the SVMs at the same time, using a coupling term that force time neighbors to be similar to each other. In our method, the interval of validity of each classifier can be as small as needed to follow the change in the concept but with reduced overfitting because the classifiers are trained to minimize a global measure of the error instead of adjusting them locally. In a previous work [23], we introduced a limited version of this method and showed its potential using an artificial drifting problem. In this paper, we describe an extended version of our algorithm that can use fewer classifiers than points in the dataset,1 producing more robust and efficient solutions. Based on the ideas of [4], we evaluate the new method in three different settings for drifting concepts: estimation, prediction, and extrapolation. 1 The previous version was limited to use one SVM for each point in the training sequence.
1045–9227/$26.00 © 2010 IEEE
38
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
In the estimation task, we train a sequence of classifiers using a given dataset, and then we test the sequence of classifiers in a new dataset equivalent to the training one (involving the same time span of the training set). This estimation task is appropriate, for example, to the analysis of a slow drifting problem with a cyclic behavior [24]. In this case, it is important to model accurately all the time span of the dataset, not only the last section. Our method is particularly aimed at this task, in which one can use not only information from past records but also from records corresponding to a future time. For the prediction and extrapolation tasks, we train a sequence of classifiers on a section of a dataset and then test it on the following section. In the prediction task, we evaluate each sequence using only the next point in the dataset. In the more difficult extrapolation task, we use a subset of data points including several steps ahead in the future. In this case, we need to extrapolate the position of the decision boundary, for which we use a simple linear technique. The objective in these two tasks is to make short-term predictions on a system that is evolving in a completely unknown way. In both cases, the evaluation puts emphasis in the performance of the last classifier(s) in the sequence. We discuss the three settings again in Section IV. This paper is organized as follows. In Section II, we discuss previous works on concept drift. In Section III, we introduce our solution to the problem and show illustrative examples, leaving mathematical details to Appendix A. We also discuss the relation with similar solution in other areas. Then, in Section IV, we present empirical results and comparisons with similar methods using artificial and real-world datasets and, finally, Section V closes the work with some remarks and conclusions. II. P REVIOUS W ORK Drifting concepts are specific classification problems in which the labels of examples or the shape of the decision boundaries change over time. In particular, we are interested in problems that change slowly over time. In a recent work, Kolter et al. [25] include a lengthy review of the state of the art in the field, starting from the early work of Schlimmer and Granger Jr. [26]. According to this, in this section we limit ourselves to briefly describe most of the previous methods and refer the interested reader to [25] for details and further references. There are three main approaches to concept drift in the literature: sample selection, sample weighting, and ensemble methods [5], [6]. As we stated before, the most common solution to the drifting concepts problem is to use a temporal window of a given length, also called a sliding window (SW) and to build a different classifier (or adapt a previous one) for each window [2], [7]–[9], [19], [20]. Some authors prefer to use the equivalent idea of uniform or stationary “batches” [20], [27], [28]. If the window is too big, the response time needed by the algorithm to follow the changes is excessive. On the contrary, when the window is too small, the algorithm adapts
quickly to any drift in the data, but it is also more sensitive to noise and loses accuracy because it must learn the input–output relationship from only a few examples. As a potential solution, many algorithms include an adaptive window size. One of the first to do that was FLORA2 [8]. Klinkenberg and Renz [19] presented an algorithm that modifies the number of stationary batches in the dataset by monitoring the accuracy, recall, and precision of the method. They applied it to the problem of detecting relevant documents in a series. Klinkenberg and Joachims [2] used SVMs to find the optimal time interval. The method adjusts SVMs with various window sizes, calculates the corresponding ξ α − esti mator [29] using the last batch, and keeps the window size that minimizes that quantity. Castillo et al. [27] and also Lanquillon [20] use statistical quality control to determine whether there is a concept change in a given batch. When this happens, a new classifier is constructed from scratch using only the data points considered to belong to the new context. Koychev et al. [30] also used a (different) statistical test to determine whether there was a change in concept in the last batch. In an interesting series of papers, Alippi and Roveri [12], [13], [31] developed another test for concept change and an adaptive classifier based on nearest neighbors. In a work focused on recurrent systems, Koychev [32] proposed to use a relatively small time window to learn the current context, then to select the past episodes that show a high predictive accuracy, and finally to train again the classifier using the original and the newly selected data points. In a similar way, Maloof et al. [33] introduced a method for the selection of examples in a partial memory learning system. They select some extreme examples and add them to the current ones to model the actual concept description. FLORA3 [8] also takes recurrence into account. In general, all these methods select an appropriate subset of the original dataset to train independent classifiers that are, each one, accurate at the corresponding time. In most cases, the selected subset is taken from the most recent examples. As a representative of sample selection methods in the comparisons we present in Section IV, we selected to use a simple SW, but with the length of the window optimized using independent validation sets. In an early work, Koychev [34] proposed to decrease the importance of old examples in the classifier simply by giving each data point a relative weight that decreases with time. The method, called gradual forgetting (GF), can be viewed as a softening of the sliding windows (SW) strategy, which gives “hard” (0/1) weights to (older/newer) examples. The author suggests using a simple linearly decreasing function for the relative weight. Klinkenberg [28] used an exponentially decaying function to weight older samples. The GF method is simple and easy to implement, and usually gives better results than SW. We also included (linear) GF in the comparisons in Section IV as a representative of sample weighting methods. Several authors have discussed the use of ensemble methods for drifting concepts. The streaming ensemble algorithm [10] fits an independent classifier to each batch of data, which are combined into a fixed-size ensemble using a heuristic replacement strategy. Wang et al. [35] used a similar strategy,
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
weighting each member of the ensemble according to its accuracy on the current batch. Gao et al. [36] applied ensembles to the task of classifying data streams and Hashemi et al. [37] used evolving one-versus-all multiclass classifiers for the same problem. Polikar and co-workers [24], [38], [39] developed an ensemble-based framework for different nonstationary problems. Recently, Kolter and Maloof [25], [40] introduced an improved ensemble method based on a previous work of Littlestone et al. [41]. Their dynamic weighted majority algorithm (DWM) dynamically creates and removes weighted experts in response to changes in performance. DWM uses four mechanisms to cope with concept drift: it trains, weights, or removes learners based on their individual performance, and also adds new experts based on the global performance of the ensemble. The authors produced an extensive evaluation and concluded that DWM outperformed most of the other learners considered in their work. DWM is also included in the evaluation in Section IV as the representative of ensemble methods. Finally, out of the scope of our work, we mention that the drifting problem has also been addressed from a computational learning theory point of view [4], [14], [16], [17], where some guarantees and theoretical bounds regarding the learning of sequences of functions were established. III. TA-SVM Let us assume that we have a dataset [(x1 , y1 ), . . . , (xn , yn )], where each pair (xi , yi ) was obtained at time i (that is, they are time ordered), xi is a vector in a given vector space, yi = ±1, and that the relation between x and y has a slow change in time. Our strategy to cope with this problem is to divide the dataset into m consecutive nonoverlapping time windows twν (with ν = 1, . . . , m and m ≤ n), and to create a coupled sequence of m (static) classifiers, each one being optimal in the corresponding time window. As we are assuming that the concept has a slow evolution, we expect that the classifiers will have the same property. According to this, we seek for a sequence of good classifiers in which time neighbors are similar to each other. The best solution to our problem should be a compromise between (individual) optimality and (neighbor) similarity. If we can define a simple distance measure d(cν , cµ ) to quantify the diversity between two neighbor classifiers cν and cµ , the base idea of our method is to minimize a two-term cost function min
m m−1 1 Errµ2 + γ d(cµ , cµ+1 ) m µ=1
(1)
µ=1
where the first term is the average of the usual cost function for each of the m classifiers and the second evaluates the total difference among the sequence of discriminant functions. The free parameter γ regulates the compromise between both terms, as in any regularized fitting. In principle, this method can be used with any classifier, if an appropriate distance measure can be defined. In this formulation, we use linear SVMs as classifiers (as, usually, we can use Kernels to produce nonlinear predictors if needed). Therefore, we look for a sequence of m pairs (w, b), each one defining a high-margin
39
hyperplane h ν given by wν x + bν = 0, where x belongs to the dataset’s vector space. We use a simple quadratic distance measure to quantify the diversity between hyperplanes d(h ν , h µ ) = ||wν − wµ ||2 + (bν − bµ )2 . Applying this measure to (1), we can introduce a new cost function for the full sequence of SVMs min
m m−1 n γ 1 ξi + ||wµ ||2 + C d(h µ , h µ+1 ) (2) m m −1 µ=1
µ=1
i=1
subject to
ξi ≥ 0 yi (wµ(i) xi + bµ(i) ) − 1 + ξi ≥ 0 where i = 1, . . . , n and µ(i ) indicates the time window including point xi . The first two terms in (2) correspond to the usual margin and error penalization terms in SVM [42], but for a complete set of classifiers, each one trained on a different time window. It is easy to see that the solution of this two-term problem gives the same sequence of SVMs that can be obtained by solving each SVM individually (if we use the same C for all SVMs). The last term in (2) corresponds to the new diversity penalization. The inclusion of this term couples the sequence, making each SVM dependent of all the others. The free parameter γ regulates the relative cost of the new term. Low γ values will almost decouple the sequence of classifiers, allowing for increased flexibility. High γ values, on the other side, will produce a sequence of almost similar SVMs. In this formulation, we have only considered the case in which data points arrive at regular time intervals. The more general case of nonconstant intervals (including missing data or data coming in bursts) can be addressed with simple extensions, for example, by giving different relative weights to the distances considered in the second term of (1), or by assigning different amounts of points to each hyperplane (see Appendix A). It is interesting to see that this formulation is valid even for time windows including only one point (m = n), because the coupling introduced by the new penalization term breaks with the indetermination of having only one point to define a hyperplane. As we show in detail in Appendix A, by deriving the corresponding dual (as usual in SVM methods) we can rephrase the problem in (2) as 1 αi max − α T Rα + α 2 i
subject to
0 ≤ αi ≤ C αi yi = 0
where αi are the Lagrange multipliers and R is a matrix with Kernel properties. The solution to this maximization problem is a coupled set of SVMs that evolve in time, which we call time-adaptive SVMs (TA-SVMs).
40
The time complexity of a TA-SVM is similar to that of the plain SVM. It can be analyzed in two stages, the kernel computation and the solution of the optimization problem. The new kernel can be computed in O(n 2 ), as we show in Appendix B. What is left after this is solving a conventional SVM optimization problem, which is also O(n 2 ) using for example sequential minimal optimization (SMO) [43]. Overall, TA-SVM, in its basic form, has the same scaling problems as plain SVM. For the estimation task this is not critical, as one usually needs to solve the problem only once. But basic TA-SVM does not work by making updates of previous solutions, it always looks for a new global solution. Then, for the prediction and extrapolation tasks our method requires to solve an O(n t 2 ) problem each time a new batch arrives, where n t is the total number of instances at time t. A. Connections with Other Areas There are connections between TA-SVM and methods from other areas, so it is interesting to discuss them at this point. In online learning [41], [44], [45], the objective is to learn the correct input–output relationship as fast as possible from data that arrive sequentially, one instance at a time. Algorithms for online learning mostly work by making updates from the solution at the previous step. Many concept drift methods, including for example ensemble methods or the FLORA series described before, also update their classifiers as a function of the last batch. Some of these methods can use very short batches or even learn from one point at a time, making them closer to online learning strategies. The main difference between the two settings is that online learning does not necessarily track concept changes, and therefore no efforts are made to forget out-of-date information in most cases. In particular, algorithms for sudden drift are more related to online learning, given their common objective of fast convergence to the optimal solution. In a recent work, [46] introduced an online learning algorithm for kernel methods, the passive-aggressive (PA) algorithm. At each step, PA modifies its solution trying to give the right label to the newly arrived instance while keeping the solution similar to the previous step by using a coupling term, similar to our formulation (2). The general idea of keeping a tradeoff between local accuracy and smoothness of the time evolution of the solution is similar to ours, but with two main differences. One is determined by the different objectives of the methods. TA-SVM is able to use information from both past and future times (when available), because it fits (offline) the full sequence of SVMs at the same time, while PA looks for the best current solution using only past information. This becomes particularly relevant for the estimation task, as defined before. The second difference is that TA-SVM looks actively for a maximum margin solution, while PA does not update its solution if the margin of new points is greater than a given value, because of its passive nature. Other authors have also used the idea of constraining the possible updates of the current solution [47], [48]. For example, [49] projects the update to a convex set, which allows establishing shifting bounds on the total loss for some general additive regression algorithms.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TA-SVM is in fact more related to the area of multiple task learning (MTL) [50], [51]. MTL algorithms are designed to learn several related tasks at the same time, profiting from the relevant information available in the different tasks. Our method makes use of the same idea. In TA-SVM, each of the related tasks is the classification problem corresponding to a given time window. As we are assuming a gradual drift, time-neighbor tasks should be similar to each other. Recently, [52] introduced a framework for MTL with kernel methods. In their formulation, the relation between the different tasks can be regulated using an appropriate homogeneous quadratic function. TA-SVM can be viewed as a particular case of this formulation, using the first and third terms of (2) as the regularizer. As we mentioned in the introduction, we use three different tasks to evaluate TA-SVM, namely estimation, prediction, and extrapolation. The estimation task is more related to the MTL problem, giving the same importance to all problems, while the prediction and extrapolation tasks are more tied to the online learning problem and its emphasis on the current hypothesis. B. Illustrative Example As a first example of the potential of TA-SVM, we apply it to the artificial sliding Gaussians dataset. This is a twoclass problem, in which each class is sampled from a Gaussian distribution. Both classes drift together, following a sinusoidal trajectory on a 2-D input space. We generated n = 500 points according to 2i π 2i π − π + 0.2yi + ε1 , sin − π + 0.2yi + ε2 xi = 500 500
where i = 1, . . . , 500, ε1,2 are sampled from a normal distribution with zero mean and σ = 0.1, and yi is a balanced random sequence of ±1. Fig. 1 shows a realization of the dataset at three different times. We used the first 450 points as training set, and generated in each case a second realization of 450 points to use as validation set, in order to select the optimal values of γ , C, and the length l of the window used by SW. Fig. 2 shows the sequence of hyperplanes obtained with TA-SVM (m = n) and SW-SVM. In this latter case, for each point xi we trained an SVM using a specific time window of length 2l + 1 centered on that point (when this is not possible, here and in all other experiments we used l points from one side and all the available points from the other). To improve the readability of the figure, we show only one in every ten consecutive SVMs. It is evident that the coupled solution of TA-SVM produces a more regular less noisy sequence of classifiers than the use of independent optimal SWs. C. Dependence on γ As a second demonstrative example, we evaluated the dependence on γ of TA-SVM solutions. In this case, we used the rotating hyperplane dataset, which is a set of 500 points sampled from a uniform distribution in a d-dimensional hypercube [−1, 1]d [35], [53], [54]. The decision boundary for the two classes is a slowly rotating hyperplane (passing through
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
1
1
1
0
0
0
−1
−1 −1
Fig. 1.
41
0 (a)
1
−1 0 (b)
−1
1
−1
0 (c)
1
Sliding Gaussians dataset at (a) t = 25, (b) t = 175, and (c) t = 475 time units. In each figure, the last 25 generated points are filled.
2π
1
3π/2 Angle
0
π
Real Low γ Mid γ High γ
π/2 −1 0 (a)
−1
1
0
0
100
200 300 Time (i)
400
500
1 Fig. 3. TA-SVM solutions for the rotating hyperplane problem using different γ values.
0
−1 0 (b)
−1
1
Fig. 2. Sequence of hyperplanes obtained as a solution of the sliding Gaussians dataset with (a) TA-SVM and (b) SW-SVM.
the origin). The direction of the hyperplane is defined by its normal vector v, which in this experiment follows the law: cos(2πi /500) 10 v 2 (i ) = sin(2πi /500)
v 1 (i ) =
v 3,...,d (i ) = 0. Point xi has class yi = sign(xi v(i )). In this first experiment with this dataset, we used d = 2, m = n, and three different γ values that show the typical responses of our method. The C parameter was set to 1 in all cases, because we checked that the solutions are almost independent of C in this problem. In Fig. 3, we plot the real angle between v and the first axis, as a function of time. We also show the solutions found by TASVM for three values of γ . Using a low γ (∼102 , dashed line), the TA-SVM solution is too flexible, following particularities
of the training set. For an adequate mid-γ value (∼105, dashdotted line), there is an optimal solution, with a balance between local accuracy and global flexibility. Last, for a high γ (∼108, dotted line), the change over time of the hyperplane is highly penalized, therefore it remains almost constant and similar to the solution that can be found by a classical SVM. For this particular dataset, as v does a complete turn, both classes are almost uniformly distributed and the optimal static solution is a null vector. The soft and erratic trajectory of TA-SVM corresponds to the angle of a nearly null vector in this case. D. Dependence on m In the previous example, we used the maximum flexibility of TA-SVM, m = n, and regularized it by optimizing the value of γ . TA-SVM has other simple way to control its complexity, i.e., to use a shorter sequence of SVMs (m < n), with the added advantage of a reduced computational burden. In Fig. 4, we show the evaluation of this possibility. Again, we used 100 realizations of the rotating hyperplane dataset, with d = 2. We considered three settings: one classifier for each training data (m = n, full line), one classifier for every two data points (m = n/2, dotted line) and one for every eight points (m = n/8, dashed line). In (a), we show the corresponding results as a function of γ . The first observable result is that the optimal values of γ decrease when using fewer classifiers. This is easy to explain considering that shorter sequences are naturally less flexible (simply because there are fewer hyperplanes and hence fewer
42
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.1
m=n m = n/2 m = n/8
Test Error
0.09 0.08
l: 8
0.07 l: 64
0.06
l: 16 l: 32
0.05 0.04 1 10
102
103
104
105
106
γ (a) 0.2
m=n m = n/2 m = n/8 l: 8
Test Error
0.19 0.18 0.17
l: 64 0.16
l: 16 l: 32
0.15 101
102
103
104
105
106
γ (b) Fig. 4. Test errors for the rotating hyperplane problem as a function of γ . (a) Results using d = 2 and different values of m. (b) Same as before, but for a noisy dataset with 10% flipped labels.
adjustable parameters) and thus require a lower diversity penalization. A second observation is that the best result is obtained with m = n, and that in this case the use of fewer SVMs produces a small decrease in performance. This result is a consequence of using a noiseless dataset, as will become clear when analyzing (b) of the same figure. Horizontal lines represent the error rates produced by SW-SVM with different lengths (l). It is interesting to note that in all cases there is a wide range of γ values for which TA-SVM outperforms the results of the optimal l. We repeated the full experiment using a noisy dataset, in which 10% of the labels were randomly switched. We show the corresponding results in Fig. 4(b). Qualitatively, the results are similar to the noiseless dataset. TA-SVM results are better than SW-SVM over a wide range of γ values. The only difference is that in this noisy case the best performance is obtained using eight points per hyperplane (m = n/8, dashed line). In this case, the higher flexibility of the m = n models allow them to learn some noisy characteristics of the datasets, and the problem cannot be avoided using a stronger coupling. Using more points per hyperplane allows TA-SVM to filter some noise locally, at each SVM, thereby improving the performance of the sequence. E. STAGGER As a last example, we applied TA-SVM to the most used benchmark in drifting concepts methods [8], [30], [32]–[34],
i.e., the STAGGER dataset [26]. The dataset has three categorical inputs, each one taking three possible values. The dataset has 120 training instances, and the concept changes abruptly two times, one every 40 instances. This is a particularly challenging problem for TA-SVM, because there are only sudden drifts of the concept in this case. With this dataset, we are demonstrating the capabilities of our method in the most unfavorable situation. We first generated the training sequence of 120 data points, each time sampling with repetition from the set of 27 possible instances, and labeling each point with the right concept for each time step. As in [8], we generated a similar test sequence but with 100 points at each one of the 120 time steps. We also generated a third sequence with 100 points at each time step to use as a validation sequence.2 At each time step i , we trained a sequence of classifiers using x1,...,i , and used the last classifier in the sequence to predict point xi in each one of the 100 test points (i.e., the prediction setting). For both methods, we used one SVM per training instance (n = m) and the independent validation set to select the optimal γ and l values for the full sequence, as in previous datasets. Again, we used a fixed C value in this noiseless dataset, because we verified that both methods are nearly independent of C also in this case. The full experiment was repeated 100 times. In Fig. 5, we show the average accuracy of both methods as a function of time. In both (a) and (b), the two vertical dashed lines correspond to the concept changes. In (a), we show the performance of independent SWs for three different lengths. For a short window, there is a quick response to changes, but there is also a lack of information about the concept. On the contrary, for the biggest window the adapting times are bigger, but the final performance is better. The optimal length compromises between both situations. In (b), we compare the optimal settings for both methods. TA-SVM is equivalent to or better than independent SW-SVM even in this (most unfavorable) dataset. IV. E MPIRICAL E VALUATION In this section, we compare the new method with other state-of-the-art strategies for drifting problems under the three settings discussed in the introduction: prediction, estimation, and extrapolation. We use the same artificial datasets and the STAGGER concepts introduced in the previous section, and the real-world electricity pricing dataset [55]. Unless stated otherwise, we report the mean classification error, with its standard deviation, over 100 independent realization of the training sets. In all cases, we use independent validation sets to optimize the internal parameters of the methods under evaluation. We always use the same realizations of the training, validation, and test sets for all the methods. 2 The use of a validation sequence is not typical in the previous literature on the STAGGER dataset. The setting we use in this example is therefore not comparable with previous works. We use it as a demonstration of the capabilities of TA-SVM in its optimal setting. In Section IV, we use this dataset to evaluate TA-SVM in the standard setting.
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
1
0.22
0.18
0.6 Test Errors
Accuracy
0.8
0.4 Optimal Window Small Window Big Window
0.2 0
20
40
60 80 Time Steps (a)
0.14
120 0.06 5
0.8 Accuracy
TA-SVM GF-SVM SW-SVM DWM-SVM
0.10 100
1
10 Dimensions
15
20
Fig. 6. Prediction test errors as a function of the dataset dimension for the rotating hyperplane problem.
0.6 0.4 0.2 0
43
TA-SVM Sliding Windows
20
40
60 80 Time Steps (b)
100
120
Fig. 5. Results for the STAGGER dataset. (a) Behavior of independent SVMs using overlapping sliding time windows. (b) Direct comparison of optimal settings of both methods.
A. Prediction In the prediction task, we are given a subset of data points spanning some period of time and our goal is the prediction of the next arriving data point (which should have no drift from the last one). To evaluate the different methods in this case, we use the same settings as for the STAGGER dataset in the previous section. We compare TA-SVM with other three methods that we described in Section II: SW, GF, and DWM, in all cases using linear SVMs as classifiers. In a first evaluation we use the rotating hyperplane dataset, but with a uniform rotation in this case: vi = (cos(2πi /500), sin(2πi /500)). In all cases, we generated 100 training sets with 500 points, and for each time i and each training set we generated an independent validation and test sequence with 100 points. We use different values of d in order to evaluate the performance of the methods in highdimensional spaces. For each training set (and all the corresponding validation and test sets), we applied a random rotation of the original space, to produce a dataset in which all the variables are relevant to the concept. We use one classifier per data point (which is the most flexible setup) for the four methods. For TA-SVM, this means that we set m = n (which could be not optimal, as we showed in Fig. 4). In the case of SW-SVM and GF-SVM, for each point xi we fit an SVM using a specific time window of length 2l + 1 centered on that point. It is worth mentioning that in this case consecutive time windows are almost completely overlapped. For DWM-SVM,
this is the default setting where the ensemble is updated at each time step. In Fig. 6, we compare the performance of all methods as a function of the number of dimensions in the dataset. SW and GF are the best methods for d = 2, probably because of some small overfitting due to the nonoptimal setting of TA-SVM. On the other hand, they are clearly more affected by the increase in the number of dimensions. This can be explained considering that, even if SVMs are known to work well in high-dimensional spaces, there are always more chances of producing solutions with bad generalization in this case. SW and GF can use bigger time windows (that is more training points) to increase their performance, but this has the added cost of allowing a bigger concept change in the considered window. TA-SVM can deal more efficiently with this problem, because it searches for a global solution of the problem, sharing information among all classifiers as a consequence of the coupling. DWM starts with a relatively low performance, but it is also less affected by the “curse of dimensionality” than SW and GF, probably because it also shares information among the sequence of classifiers, as we discussed before. We repeated the evaluation using the rotating Gaussians dataset. This dataset is very similar to the previous one. The main difference is that the classes are sampled not from uniform distributions at each side of the hyperplane but from normal distributions centered at +vi /2 and −vi /2, both with σ = 0.3. The optimal solution of this problem is more difficult for SVMs because it has some class overlapping and fewer points on the decision boundary. We consider two scenarios here. In the first one, all other settings of the problem are equal to the previous case. The corresponding results are shown in Fig. 7(a). Qualitatively, the behavior of all four methods is the same as in the previous dataset. There is a bigger gap for low-dimensional problems and also a bigger decay of the performance of SW and GF for high-dimensional datasets. In the second scenario, we considered a faster drift of the classes, including two full turns of the Gaussians around the origin at the same time. In this case, we generated sequences of 500 points using vi = (cos(2πi /250), sin(2πi /250)).
44
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
1.0 0.8
0.11
Accuracy
Test Errors
0.13
TA-SVM GF-SVM SW-SVM DWM-SVM
0.09
0.6
TA-SVM GF-SVM SW-SVM DWM-SVM
0.4 0.07 5
10 15 Dimensions (a)
20
0.2
0
20
40
60 Time Steps
80
100
120
Fig. 8. Average prediction accuracy of the methods tested in this paper on the STAGGER dataset, as a function on time. Error bars show the 95% confidence interval.
0.17
Test Errors
0.15 TABLE I P REDICTION A CCURACY ON THE STAGGER D ATASET, AVERAGED OVER
0.13
R EALIZATIONS AND T IME , FOR THE M ETHODS T ESTED IN T HIS PAPER .
0.11
I N PARENTHESIS W E S HOW THE S TANDARD D EVIATION OF THE M EAN A CCURACIES
0.09
Method
0.07 5
10 15 Dimensions (b)
SW-SVM GF-SVM TA-SVM DWM-SVM
20
Fig. 7. Prediction test errors as a function of the dataset dimension for the rotating Gaussians problem (a) for a dataset including a full turn of the Gaussians and (b) for the same problem, but including two full turns of the classes, i.e., a faster drift.
In Fig. 7(b), we show the corresponding results. It is easy to see that there is a general increase in error levels, associated with the faster drifting of the classes. Again, SW and GF are better than the other methods for low-dimensional datasets. TA-SVM still outperforms the other methods for high-dimensional datasets, but DWM-SVM does not work well in any situation in this case. We also evaluated the STAGGER dataset in the prediction task. In this case, we followed as much as possible the settings used in previous works with this dataset. According to this, we do not use an external validation sequence, we only use the training and test sequences described in the previous section. We replace the external validation with an internal fourfold cross validation using the training sequence available at each time step in order to set the optimal values of the free parameters of all the methods. In Fig. 8, we show the average prediction accuracy as a function of time. TA-SVM shows the fastest response to concept drift, but after that DWMSVM shows a better convergence to the optimal decision. The bias to a continuous drift in TA-SVM reflects in a slower convergence rate in this case. In Table I, we show the same results averaged over time. Overall, TA-SVM shows a very good performance in this problem involving sudden concept drift. DWM-SVM slightly outperformed the new method, but most of the difference is in the first concept, before any concept drift.
Accuracy (%) 87.14 87.67 90.00 90.83
(0.12) (0.12) (0.10) (0.12)
B. Estimation As we stated in the introduction, in the estimation task we train a sequence of classifiers using a given dataset and then we test the complete sequence of classifiers in an independent test set involving the same time span of the training set. The objective is to evaluate all the classifiers at the same time, not only the last one as in the previous task. In this case, for each training set of 500 points we generated 100 equivalent validation and test sets. We optimized all internal parameters using the validation sets and then we evaluated the resulting classifiers on the test sets. We assume that we know the time step i at which each point in the test set was measured, so we can use the corresponding classifier from the trained sequence. For the SW and GF methods we use symmetric time windows of length 2l + 1, centered at time i . We do not use DWM in this case because it was not designed for the estimation task. In DWM, the classifier corresponding to time i can only use information from the points in its past. The other three methods can see the full training set, which would make the comparison unfair. In this evaluation, we fixed the number of dimensions at d = 2 (for all the datasets) and varied the number m of classifiers in the sequence, in order to evaluate the dependence of the methods with the m/n ratio. In the first place, we used the same rotating hyperplane dataset as in the prediction task. In Fig. 9(a), we show the results for the three methods included this time. We also evaluated a noisy version of this dataset, (b), in which a random 10% of the labels were switched. In (c) of the same figure we show the results corresponding to
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
TA-SVM GF-SVM SW-SVM
0.038
0.148 Test Errors
Test Errors
0.040
45
0.036
0.146
0.144 0.034 0.032
0.142 0
5
10 n/m (a)
15
20
5
10 n/m (b)
15
20
0
5
10 n/m (d)
15
20
0.244
Test Errors
0.05 Test Errors
0
0.04
0.03
0.240 0.236 0.232
0.02 0.228 0
5
10 n/m (c)
15
20
Fig. 9. Estimation test errors as a function of the number m of classifiers in the sequence (a) for the rotating hyperplane problem, (b) for the same problem with 10% noise, (c) for the sliding Gaussians problem, and (d) for the same problem using normal distributions with bigger σ .
the sliding Gaussians dataset, using the same settings as in Section III-B. Finally, in (d) we use a second version of the sliding Gaussian generated with σ = 0.3, i.e., with more overlap of the classes, and the other setting equal to (c). In the four situations (two datasets times two noise levels), the qualitative results are similar. The overall performance of SW and GF deteriorates when fewer classifiers are used in the sequence (higher n/m values). Clearly, when using fewer SVMs the concept drift becomes more relevant to the problem. The effect is clearer for the low-noise situations, (a) and (c). For noisy situations, as we discussed before, there is always certain trade-off between noise and drift. TA-SVM clearly outperforms SW and GF in the estimation task. This was expected, as TA-SVM was designed to learn accurately all the sequence at once. We already showed in Section III-D that TA-SVM usually works better when using n/m > 1, in particular in noisy situations. This result is evident here in all cases except for (c), which is a problem with low-noise and a relatively fast drift of the decision boundary. C. Extrapolation The extrapolation task is an extension of the prediction task, in which we are interested in the prediction of several steps ahead into the future. In this case, we need to extrapolate
the position of the decision boundary some steps into the future, starting from the last classifier in the sequence.3 Our method does not assume any functional form for the time evolution of the sequence of classifiers, the only constraint is that neighboring hyperplanes should be close to each other, so we do not have a principled way to determine the position of each future classifier. In consequence, we must choose an appropriate external method to make an extrapolation based on the position of each classifier in the sequence. In this paper, we use a simple linear extrapolation, but a more complex model could be applied if required (and if there is enough data available). For the experiments in this task, we generated training sequences with 450 points, and for each one of these training sets we generated 100 test sequences with 500 points each. At each run, we optimized all internal parameters using the first 450 points of the test sequences as validation sets and then evaluated the resulting classifiers on the last 50 points of 3 A simple way to do it would be to use an extended sequence of classifiers with hyperplanes located in the future period that we want to predict, and let TA-SVM choose the position of each one. Unfortunately, this procedure will not have the effect we are looking for. As we do not have training points for the future period, the solution will be a compromise between only two penalties: the last term in (2), which will make the solution to stay at the position of the last hyperplane before the extrapolation period, and the first term in (2), which will move the solution to the null vector in the extrapolation period.
46
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
0.4
0.26
TA-SVM GF-SVM SW-SVM
0.3
0.22 Test Errors
Test Errors
0.24
0.20 0.18 0.16
TA-SVM GF-SVM SW-SVM
0.2
0.1
0.14 0
5
10 n/m (a)
15
0.0
20
0.42
0.34
Test Errors
Test Errors
0.36
0.32 0.30
10
20
30 Steps (a)
40
50
30 Steps (b)
40
50
TA-SVM GF-SVM SW-SVM
0.36
0.30
0.24 0
5
10 n/m (b)
15
20
10
20
Fig. 10. Extrapolation test errors as a function of the number m of classifiers in the sequence (a) for the sliding Gaussians problem and (b) for the same problem using distributions with bigger σ .
Fig. 11. Extrapolation test errors as a function of the number of predicted steps into the future (a) for the sliding Gaussians problem and (b) for the same problem using distributions with bigger σ .
these test sets. As in the previous case, we fixed the number of dimensions at d = 2 and varied the number m of classifiers in the sequence, from n/m = 1 to n/m = 20. To extrapolate the position of the decision boundary, we used a simple linear extrapolation of the values of each component of w (wi (t) = αi t + βi ) and of b (b(t) = αb t + βb ), fitting the coefficients of the linear models (αi,b and βi,b ) using all the SVMs in the last 50 training points. The number of SVMs included in the extrapolation goes from 50 for n/m = 1 to only 2 for n/m = 20. We do not use DWM in this case because it produces an ensemble of SVMs and there is no simple way to extrapolate the position of the decision boundary for this method. For this evaluation, we used the sliding Gaussians dataset in the same two settings that we explained in the estimation task. In Fig. 10, we show the corresponding results. All methods become more unstable in this task, as indicated by the bigger error bars, because we are superposing two error sources, the fitting of the classifiers, and the extrapolation of their position. In Fig. 10(a), we can see again a typical behavior of TASVM, with the best performance at n/m > 1. For n/m = 20, all methods show similar poor results, mainly associated with bad extrapolations of the decision boundaries. For the noisy situation in Fig. 10(b), the difference between TA-SVM and the other methods is clearer. In Fig. 11, we show, for the same two datasets, the evolution of the classification error as a function of the number of time steps into the future predicted
by all methods. The results correspond to n/m = 10. In the low-noise situation, (a), TA-SVM shows the best performance for all but the maximum number of time steps, when all methods are equivalent. With more noise present, (b), TASVM is clearly superior in all situations. D. Real-World Case: Electricity Pricing As a last evaluation of TA-SVM we considered a realworld problem, the electricity pricing dataset [55]. The dataset contains 45 312 instances collected at regular 30-min intervals during 30 months, from May 1996 to December 1998. The data was first obtained directly from the electricity supplier in New South Wales, Australia. There are five attributes in total. The first two date the record in day of week (1 to 7) and half-hour period (1 to 48). The last three attributes measure the current demand, the demand in New South Wales, the demand in Victoria, and the amount of electricity scheduled for transfer between the two states. The target is a binary value indicating whether the price of electricity will go up or down. Following Harries [55], we considered batches of 1-week length. At each week, we train the classifiers with all previous batches and predict the next batch (i.e., the 336 instances in the current week). We considered this setting as a prediction task, and correspondingly for our method we use the last SVM in the sequence to make predictions. In order to select the free parameter of other methods, we used a simple validation
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
TABLE II P REDICTION A CCURACY ON THE E LECTRICITY P RICING D ATASET, AVERAGED OVER T IME , FOR A LL M ETHODS T ESTED IN T HIS PAPER . T HE ROWS L ABELED SVM S HOW THE R ESULTS O BTAINED WITH S TATIONARY SVM Method SVM GF-SVM SW-SVM DWM-SVM TA-SVM SVM GF-SVM SW-SVM DWM-SVM TA-SVM
Kernel Linear Linear
Gaussian Gaussian
Accuracy (%) 63.3 65.1 65.3 63.3 65.6 66.1 67.2 67.8 66.9 68.9
scheme. At each step, we set aside the last available week in the training set as a validation set (not the current week, which we want to predict, but the previous one). Once we selected the parameters that are optimal for this validation set, we train again all methods using the complete training set. As this is a real-world dataset, we do not know in advance what is the real amount of drift in the data. In order to have an estimation of the benefits of using concept drift methods in this case, we also applied a standard SVM using the same procedure as described before (i.e., for each week we determined optimal parameters, trained the SVM, and predicted the current week). In Table II, we show the corresponding results. In the first rows we show the result obtained using a linear kernel, as in all previous datasets. All adaptive methods outperformed the standard SVM in this case, suggesting the actual presence of some concept drift in the dataset. TA-SVM shows the best performance in this case. For reference, Harries used a decision tree with a sliding window in the same problem, reporting 1-week prediction accuracies for various window sizes between 66% and 67.7%. Looking for a better solution, we repeated the experiment using a Gaussian kernel in this case. All methods improved with the use of nonlinear classifiers. Again, TA-SVM shows the best performance in this dataset.4 V. C ONCLUSION In this paper, we presented the TA-SVM, which is a new method for generating adaptive classifiers and capable of learning concepts that change with time. The basic idea of TA-SVM is to use a sequence of classifiers, each one appropriate for a small time window but, in contrast to other proposals, learning all the hyperplanes in a global way. Starting from the solution of independent SVMs, we showed that the addition of a new term in the cost function (which penalizes the diversity between consecutive classifiers) produces in fact a coupling of the sequence. Once coupled, the set of SVMs acts as a single adaptive classifier. 4 DWM shows a low performance on this setup. Kolter and Maloof [25] applied DWM to this dataset with better results, but using an online learning setting, learning, and predicting one instance at a time (an easier task for this problem), which differs from Harries’s methodology.
47
We evaluated different aspects of the TA-SVM using artificial drifting problems. In particular, we showed that changing the number of classifiers (the n/m ratio) and the coupling constant γ , we can effectively regularize the sequence of classifiers. We compared TA-SVM with other state-of-the-art methods in three different settings: estimation, prediction, and extrapolation, including problems with small datasets, highdimensional input spaces and noise. TA-SVM showed in all cases to be equivalent to or better than the other methods. Even for the most unfavorable situation for TA-SVM, i.e., the sudden changes of the STAGGER dataset, our new method showed a very good performance. We also applied TA-SVM to a real-world dataset, i.e., Harries’s electricity pricing, with very good results. TA-SVM has two free parameters, m and γ . In our experience, the more efficient way to use them is to fix the n/m ratio in a range of 5 to 10, and then tune γ using an internal cross validation. If the dataset is small or there are indications of high drift levels, one can use n = m to increase the flexibility of the model. The C parameter follows the same rules as in standard SVMs. If there is previous knowledge about noise levels, the C value can be set accordingly. If not, we recommend to begin with a low C value and leave the regularization to the coupling term. There is nothing in our formulation or the derivation of the dual problem that prevents the use of arbitrary kernel functions to evaluate distances and create nonlinear adaptive classifiers. We already used this possibility in the modeling of the electricity pricing domain. The only potential difficulty can arise in the extrapolation setting. For kernel functions corresponding to finite-dimensional feature spaces, it is always possible, in principle, to use our simple extrapolation. However, this cannot be done if the kernel is associated to an infinite-dimensional feature space. If needed, there are some simple ways to make TA-SVM scale efficiently to larger problems. For the prediction or extrapolation tasks, the focus is on the performance of the last classifiers in the sequence. In the prediction case, we only use the last classifier to predict the labels of the test samples. For the extrapolation task, we use only a few SVMs from the end of the sequence in order to extrapolate the solution. As the coupling term in (2) involves only interactions to first time neighbors, the influence of old examples to the actual TA-SVM prediction decays exponentially with time. According to this analysis, very old examples quickly become useless to TA-SVM and can be eliminated from the dataset. In practice, for prediction and extrapolation we will pay a reduced cost by limiting TA-SVM to use a fixed number of the more recent examples. In addition, we can easily force the first hyperplane in this reduced sequence to keep the optimal position found in a previous step, which will reduce even more the loss in performance. Going further in the same direction, we can even arrive at a quasi online version of TA-SVM, where only a few hyperplanes are adjusted at each step. We are currently studying the application of TA-SVM to real problems in slowly drifting systems, in particular to fault prediction in critical mechanical systems. Also, we are
48
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
evaluating the extension of TA-SVM to one-class classification and regression problems. Finally, we are considering the use of partially overlapping windows for TA-SVM. ACKNOWLEDGMENT The authors would like to thank P. F. Verdes and three anonymous reviewers for useful suggestions that considerably improved this manuscript. A PPENDIX A D ERIVING THE D UAL P ROBLEM First, we introduce the notation we use in this section. We consider the case where we want to adjust a sequence of m hyperplanes to a dataset with n points. We use Greek letters for the indices that vary in the number of hyperplanes and Latin letters for the indices that vary in the number of points. As we explained in the main text, the hyperplanes are defined by a vector wµ and a scalar bµ , for µ ∈ {1, . . . , m}. The hyperplane corresponding to the i -th point is defined by (wµi , bµi ). pµ is the set of points {i : µi = µ}. P is an m × n matrix defined as Pµj = 1, if j ∈ pµ , 0 otherwise. Also, we use the kernel matrix K, defined by K i j = yi y j xi x j . As in conventional SVMs, we can always replace this definition with any other including a useful inner product. Other required matrix is Q given by 1, if hyperplanes ν and µ are neighbors, Q µν = 0, otherwise which we symmetrize by Q S = (Q + QT )/2. We also use the notation P ⊙ Q for the entrywise (or Hadamard) matrix product of P and Q: (P ⊙ Q)i j = Pi j Q i j . We start from the problem m m γ 1 Q µν ||wµ − wν ||2 ||wµ ||2 + min wµ ,bµ 2m 2 µ=1 ν=1 n
ξi + (bµ − bν )2 + C i=1
subject to
ξi
≥0
yi (wµi xi + bµi ) − 1 + ξi
≥0
where ||w||2 = w · w. This is the same problem we introduced in the main text, with small differences that help in the search of the solution. Given the symmetry of the term including Q, it is easy to rewrite the problem using Q S m m γ S 1 Q µν ||wµ − wν ||2 ||wµ ||2 + 2m 2 µ=1 ν=1 n
2 +C ξi . + (bµ − bν ) i=1
Then, the corresponding Lagrangian is m m γ S 1 ||wµ ||2 + Q µν ||wµ − wν ||2 L= 2m 2 µ=1 ν=1 n
+ (bµ − bν )2 + C ξi i=1
−
−
n
i=1 n
αi yi (wµi xi + bµi ) − 1 + ξi
(3)
βi ξi
i=1
where αi ≥ 0 and βi ≥ 0. We have to maximize L with respect to αi and βi and minimize it with respect to wi , bi and ξi . At this point, the derivatives with respect to the primal variables should be zero ∂L = 0, ∂ξi
∂L = 0, ∂wµ
∂L = 0. ∂bµ
From these equations, we can eliminate the variables ξi , wµ , and bµ from L and obtain the dual problem. We start with the derivative with respect to ξi ∂L = 0 = C − αi − βi ∂ξi which implies that 0 ≤ αi ≤ C. On the other hand, taking into account that each ξi is multiplied by (C − αi − βi ), (3) becomes m m γ S 1 ||wµ ||2 + Q µν ||wµ − wν ||2 L= 2m 2 µ=1 ν=1 (4) n
2 + (bµ − bν ) αi yi (wµi xi + bµi ) − 1 . − i=1
In the case of wµ we have
∂L =0 ∂wµ 1 S wµ + γ (wµ − wν ) − αj yjxj Q µν = m ν j ∈ pµ
which results in 1 S Q µν (wµ − wν ) = αj yjxj. wµ + γ m ν j ∈ pµ
Defining the matrix M as
S /m, 1 + γ κ Q µκ Mµν = S /m, −γ Q µν we can write wµ as wµ =
j
−1 Mµµ α y x . j j j j
if µ = ν, otherwise
(5)
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
Using this, we can rewrite the term ||wµ ||2 in (4) −1 −1 ||wµ ||2 = Mµµ Mµµ αα yy xx . i j i j i j i j µ
µ
With this equality, we can simplify the obtained L 1 L = − α T P T M−1 P ⊙ K α 2
2 γ S bµ − bν + αi − αi yi bµi . (8) Q µν + 4m µ,ν
ij
On the other hand −1 −1 P T M−2 P = Mµµ Mµµ . i j ij
i
(6)
γ m
ij
−1 −1 × Mµµ − Mνν j j −1 −1 2 = Mµµ Mµµ Dµµ i j −
−1 Mµµ i
−1 Mνµ j
µν
S Q µν
×αi α j yi y j xi x j as Q S is symmetric and using the m × m diagonal matrix defined by S if µ = ν κ Q µκ , Dµν = 0, otherwise. Given that T −1
P M (D − Q S )M−1 P i j −1 S −1 −1 −1 = Dµµ − Mµµ Q µν Mµµ Mνµ Mµµ j i j i µ
µν
S ||w − w ||2 as we can write µν Q µν µ ν S Q µν ||wµ − wν ||2 µν
P T M−1 (D − Q S )M−1 P ⊙ K α.
(7)
Using (6) and (7) we can write (4) as 1 T T −2 α P M P⊙K α L= 2m γ T T −1 α P M (D − Q S M−1 P ⊙ K)α + 2m
2 γ S Q µν bµ − bν − α T P T M−1 P ⊙ K α + 4m µ,ν +
i
αi −
αi yi bµi .
i
The matrices M, D, and Q S are related by the equation
I + γ D − QS . M= m
bµ
m ν=1
S Q µν
−
m ν=1
S bν Q µν
=
αi yi
(9)
i∈ pµ
(10)
Since (D − Q S ) is singular, given that (D − Q S )1 = 0, we can write n γ γ αi yi . 0 = 0b = 1 D − Q S b = 1Ph = n n i=1
In this case, the solution to the system (10) is + m b= D − Q S Ph γ
(11)
[where (D − Q S )+ is the pseudoinverse of (D − Q S )]. We still need to eliminate the bµ from L. The part that depends on b is
2 γ S αi yi bµi Q µν bµ − bν − 4m µ,ν i
2 γ S Q µν bµ − bν − bT Ph. (12) = 4m µ,ν
= 2α T
which, defining h i = αi yi , we can write as γ D − Q S b = Ph. m
µ
ij
i∈ pµ
which gives
S ||w − w ||2 term can be rewritten as The µν Q µν µ ν
−1 −1 S S Mµµ − Mνν Q µν Q µν ||wµ − wν ||2 = i i
αi α j yi y j xi x j
m γ S ∂L =0= Q µν (bµ − bν ) − αi yi ∂bµ m ν=1
µ
µν
i
Now we should use the derivatives with respect to bµ . In the case
µ
With this and the definition of K, we have ||wµ ||2 = α T P T M−2 P ⊙ K α.
µν
49
We can write this as γ S Q (bµ − bν )2 − bT Ph 4m µ,ν µν γ S 2 Q µν bµ − 2bµ bν + bν2 − bT Ph = 4m µ,ν
γ 2 S T S = b Q − b Q b − bT Ph 2m µ,ν µ µν γ T b D − Q S b − bT Ph = 2m bT Ph = − bT Ph 2 bT Ph =− 2 + m T T = − h P D − Q S Ph 2γ + m T S T =− α P D−Q P ⊙Y α 2γ
50
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
where Y = yyT . Taking into account the last equality, L becomes
1 L = − α T P T M−1 P ⊙ K + P T (M − I)+ P ⊙ Y α 2 αi + i
which has the form
1 αi L = − α T Rα + 2 i
with the matrix R defined accordingly. Finally, the dual problem is 1 αi (13) max − α T Rα + α 2 i
subject to
0 ≤ αi ≤ C αi yi = 0
which is the same quadratic minimization problem with restrictions solved in SVM (with a different matrix R). In consequence, any technique employed to solve the conventional SVM problem can be used here, as, for example, SMO [43]. A PPENDIX B C OMPLEXITY E VALUATION As follows from (13), the complexity of the whole problem is given by the computation of the matrix R and the solution of the optimization problem. As we mentioned before, this last step is equivalent to a conventional SVM optimization problem, which is O(n 2 ). The computation of R involves the inversion of M and the computation of the pseudoinverse of (M − I). The general solutions of these problems are costly but in our case, given that we consider only interactions to first time neighbors, both problems can be solved analytically. After this, the computation of P T M−1 P and P T (M − I)+ P is trivial, given that (P T M−1 P)i j = Mµ−1 iµj (P T (M − I)+ P)i j = (M − I)+ µi µ j . Hence, the computation of each element of the Hadamard product is O(1), which means that the computation of R is O(n 2 ), i.e., no greater than the optimization step. R EFERENCES [1] J. C. Schlimmer and R. H. Granger, “Beyond incremental processing: Tracking concept drift,” in Proc. 5th Nat. Conf. Artif. Intell., Irvine, CA, 1986, pp. 502–507. [2] R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector machines,” in Proc. 17th Int. Conf. Mach. Learn., San Mateo, CA, 2000, pp. 487–494. [3] R. Vicente, O. Kinouchi, and N. Caticha, “Statistical mechanics of online learning of drifting concepts: A variational approach,” Mach. Learn., vol. 32, no. 2, pp. 179–201, Aug. 1998.
[4] P. L. Bartlett, S. Ben-Dabid, and S. R. Kulkarni, “Learning changing concepts by exploiting the structure of change,” Mach. Learn., vol. 41, no. 2, pp. 153–174, Nov. 2000. [5] K. Stanley, “Learning concept drift with a committee of decision trees,” Dept. Comput. Sci., Univ. Texas, Austin, Tech. Rep. UTAI-TR-03-302, 2003. [6] A. Tsymbal, “The problem of concept drift: Definitions and related work,” Dept. Comput. Sci., Trinity College Dublin, Dublin, Ireland, Tech. Rep. TCD-CS-2004-15, Apr. 2004. [7] T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, “Experience with a learning personal assistant,” Commun. ACM, vol. 37, no. 7, pp. 81–91, 1994. [8] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Mach. Learn., vol. 23, no. 1, pp. 69–101, Apr. 1996. [9] M. Salganicoff, “Tolerating concept and sampling shift in lazy learning using prediction error context switching,” Artif. Intell. Rev., vol. 11, nos. 1–5, pp. 133–155, Feb. 1997. [10] W. N. Street and Y. Kim, “A streaming ensemble algorithm (SEA) for large-scale classification,” in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining. San Francisco, CA, 2001, pp. 377–382. [11] S. H. Bach and M. A. Maloof, “Paired learners for concept drift,” in Proc. IEEE Int. Conf. Data Mining, Los Alamitos, CA, 2008, pp. 23–32. [12] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part I: Detecting nonstationary changes,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1145–1153, Jul. 2008. [13] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part II: Designing the classifier,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2053–2064, Dec. 2008. [14] D. P. Helmbold and P. M. Long, “Tracking drifting concepts by minimizing disagreements,” Mach. Learn., vol. 14, no. 1, pp. 27–45, Jan. 1994. [15] P. L. Bartlett, “Learning with a slowly changing distribution,” in Proc. 5th Annu. Workshop Comput. Learn. Theory, Pittsburgh, PA, 1992, pp. 243–252. [16] R. D. Barve and P. M. Long, “On the complexity of learning from drifting distributions,” in Proc. 9th Annu. Workshop Comput. Learn. Theory, San Mateo, CA, 1996, pp. 170–193. [17] Y. Freund and Y. Mansour, “Learning under persistent drift,” in Proc. 3rd Eur. Conf. Comput. Learn. Theory, London, U.K., 1997, pp. 109–118. [18] C. Alippi, G. Boracchi, and M. Roveri, “Just in time classifiers: Managing the slow drift case,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, 2009, pp. 114–120. [19] R. Klinkenberg and I. Renz, “Adaptive information filtering: Learning in the presence of concept drifts,” in Workshop Notes of the ICML/AAAI Workshop Learning for Text Categorization. Menlo Park, CA: AAAI Press, 1998, pp. 33–40. [20] C. Lanquillon, “Enhancing test classification to improve information filtering,” Ph.D. thesis, Faculty Comp. Sci., Univ. Magdeburg, Magdeburg, Germany, 2001. [21] K. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201, Mar. 2001. [22] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [23] G. L. Grinblat, P. M. Granitto, and H. A. Ceccatto, “Time-adaptive support vector machines,” Inteligencia Artif., vol. 12, no. 40, pp. 39–50, 2008. [24] M. Karnick, M. Ahiskali, M. D. Muhlbaier, and R. Polikar, “Learning concept drift in nonstationary environments using an ensemble of classifiers based approach,” in Proc. IEEE Int. Joint Conf. Neural Netw., Hong Kong, China, Jun. 2008, pp. 3455–3462. [25] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” J. Mach. Learn. Res., vol. 8, pp. 2755–2790, Dec. 2007. [26] J. C. Schlimmer and R. H. Granger, Jr., “Incremental learning from noisy data,” Mach. Learn., vol. 1, no. 3, pp. 317–354, 1986. [27] G. Castillo, J. Gama, and P. Medas, “Adaptation to drifting concepts,” in Proc. Progress Artif. Intell., 11th Portuguese Conf. Artif1. Intell. (EPIA), LNCS 2902. Beja, Portugal, 2003, pp. 279–293. [28] R. Klinkenberg, “Learning drifting concepts: Example selection versus example weighting,” Intell. Data Anal., vol. 8, no. 3, pp. 281–300, Aug. 2004. [29] T. Joachims, “Estimating the generalization performance of a SVM efficiently,” in Proc. 17th Int. Conf. Mach. Learn., San Francisco, CA, 2000, pp. 431–438.
GRINBLAT et al.: SOLVING NONSTATIONARY CLASSIFICATION PROBLEMS WITH COUPLED SVMs
[30] I. Koychev and R. Lothian, “Tracking drifting concepts by time window optimization,” in Proc. 25th SGAI Int. Conf. Innov. Tech. Appl. Artif. Intell., New York, 2005, pp. 46–59. [31] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers in nonstationary conditions,” in Proc. Int. Joint Conf. Neural Netw., Orlando, FL, 2007, pp. 1014–1019. [32] I. Koychev, “Tracking changing user interests through prior-learning of context,” in Adaptive Hypermedia, LNCS 2347. New York: SpringerVerlag, 2002, pp. 223–232. [33] M. Maloof and R. Michalski, “Selecting examples for partial memory learning,” Mach. Learn., vol. 41, no. 1, pp. 27–52, Oct. 2000. [34] I. Koychev, “Gradual forgetting for adaptation to concept drift,” in Proc. ECAI Workshop Current Issues Spatio-Temporal Reason., Berlin, Germany, 2000, pp. 101–106. [35] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Washington D.C., 2003, pp. 226– 235. [36] J. Gao, B. Ding, J. Han, W. Fan, and P. Yu, “Classifying data streams with skewed class distributions and concept drifts,” IEEE Internet Comput., vol. 12, no. 6, pp. 37–49, Nov.–Dec. 2008. [37] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. Kangavari, “Adapted one-versus-all decision trees for data stream classification,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 5, pp. 624–637, May 2009. [38] R. Elwell and R. Polikar, “Incremental learning in nonstationary environments with controlled forgetting,” in Proc. Int. Joint Conf. Neural Netw., Atlanta, GA, 2009, pp. 771–778. [39] M. D. Muhlbaier, A. Topalis, and R. Polikar, “Learn ++.NC: Combining ensemble of classifiers with dynamically weighted consult-and-vote for efficient incremental learning of new classes,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 152–168, Jan. 2009. [40] Z. Kolter and M. Maloof, “Dynamic weighted majority: A new ensemble method for tracking concept drift,” in Proc. 3rd IEEE Int. Conf. Data Mining, Nov. 2003, pp. 123–130. [41] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,” Inform. Comput., vol. 108, no. 2, pp. 212–261, Feb. 1994. [42] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge Univ. Press, 2000. [43] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods-Support Vector Learning. Cambridge, MA: MIT Press, 2000, pp. 185–208. [44] N. Littlestone, “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Mach. Learn., vol. 2, no. 4, pp. 285– 318, Apr. 1987. [45] S. Ferrari and M. Jensenius, “A constrained optimization approach to preserving prior knowledge during incremental training,” IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 996–1009, Jun. 2008. [46] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-agressive algorithms,” J. Mach. Learn. Res., vol. 7, pp. 551–585, Dec. 2006. [47] J. Kivinen and M. Warmuth, “Exponentiated gradient versus gradient descent for linear predictors,” Inform. Comput., vol. 132, no. 1, pp. 1– 63, Jan. 1997. [48] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [49] M. Herbster and M. K. Warmuth, “Tracking the best linear predictor,” J. Mach. Learn. Res., vol. 1, pp. 281–309, Sep. 2001. [50] R. Caruana, “Multi-task learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, Jul. 1997. [51] S. Thrun and L. Pratt, Learning to Learn. Norwell, MA: Kluwer, 1997. [52] T. Evgeniou, C. M. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, Dec. 2005. [53] W. Fan, “Systematic data selection to mine concept-drifting data streams,” in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Seattle, WA, 2004, pp. 128–137. [54] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen, “Dynamic integration of classifiers for handling concept drift,” Inform. Fusion, vol. 9, no. 1, pp. 56–68, Jan. 2008. [55] M. Harries, “Splice-2 comparative evaluation: Electricity pricing,” School Comput. Sci. & Eng., Univ. New South Wales, Sydney, Australia, Tech. Rep. NSW-CSE-TR-9905, 1999.
51
Guillermo L. Grinblat was born in Miramar, Buenos Aires, Argentina, in 1976. He received the Licenciate degree in computer sciences from the National University of Rosario, Rosario, Argentina, in 2006. He currently holds a Fellowship at the French Argentine International Center for Information and Systems Sciences, Rosario. He is also a Teaching Assistant at the National University of Rosario. His current research interests include drifting problems, kernel methods, and deep architectures.
Lucas C. Uzal was born in Pergamino, Argentina, in 1982. He received the Licentiate and M.Sc. degrees in physics from Balseiro Institute, San Carlos de Bariloche, Argentina, in 2005 and 2006, respectively. He has been with the French Argentine International Center for Information and Systems Sciences, Rosario, Argentina, since 2007, on a research grant from Consejo Nacional de Investigaciones Cientficas y Tecnolgicas, Rosario. His current research interests include complex systems and time series analysis.
H. Alejandro Ceccatto was born in Argentina in 1953. He received the M.Sc. degree in physics from the Universidad Nacional de Rosario, Rosario, Argentina, in 1979, and the Ph.D. degree in physics from the Universidad Nacional de La Plata, La Plata, Argentina, in 1985. He was a Post-Doctoral Fellow at the Department of Applied Physics, Stanford University, Stanford, CA, in 1988, and at the Institut fuer Theoretische Physik, Universitaet zu Koeln, Cologne, Germany, in 1989. Since 1995, he has been the Director of the Intelligent Systems Group, Instituto de Fisica Rosario, Rosario. He is currently a full Professor at the Universidad Nacional de Rosario, and Director of the French Argentine International Center for Information and Systems Sciences, Rosario. He has supervised 20 M.Sc. and 11 Ph.D. theses.
Pablo M. Granitto was born in Rosario, Argentina, in 1970. He received the Degree in physics, and the Ph.D. degree, also in physics, in 1997 and 2003, respectively, both from Universidad Nacional de Rosario (UNR), Rosario, Argentina. He was a Post-Doctoral Researcher at Istituto Agrario San Michele all’Addige, Trento, Italy. Since 2006, he has been a full-time Researcher at Consejo Nacional de Investigaciones Cientficas y Tecnolgicas, Rosario, and UNR. He leads the Machine Learning Group at the French Argentine International Center for Information and Systems Sciences, Rosario. His current research interests include application of modern machine learning techniques to agroindustrial and biological problems, involving feature selection, clustering, and ensemble methods.
52
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Optimum Spatio-Spectral Filtering Network for Brain–Computer Interface Haihong Zhang, Member, IEEE, Zheng Yang Chin, Member, IEEE, Kai Keng Ang, Member, IEEE, Cuntai Guan, Senior Member, IEEE, and Chuanchu Wang, Member, IEEE
Abstract— This paper proposes a feature extraction method for motor imagery brain–computer interface (BCI) using electroencephalogram. We consider the primary neurophysiologic phenomenon of motor imagery, termed event-related desynchronization, and formulate the learning task for feature extraction as maximizing the mutual information between the spatio-spectral filtering parameters and the class labels. After introducing a nonparametric estimate of mutual information, a gradient-based learning algorithm is devised to efficiently optimize the spatial filters in conjunction with a band-pass filter. The proposed method is compared with two existing methods on real data: a BCI Competition IV dataset as well as our data collected from seven human subjects. The results indicate the superior performance of the method for motor imagery classification, as it produced higher classification accuracy with statistical significance (≥95% confidence level) in most cases. Index Terms— Brain–computer interface, motor imagery electroencephalography, spatio-spectral filtering.
I. I NTRODUCTION HE necessity of developing high-performance brain– computer interface (BCI) is rapidly growing alongside advances in neural devices and demands from rehabilitation, assistive technology, and beyond [1], [2]. Among the various useful signals for electroencephalogram (EEG) based BCI [3], motor imagery [4] is probably the most common one. It refers to the imagination or mental rehearsal of a motor action without any real motor output. The primary phenomenon of motor imagery electroencephalography (EEG) is event-related desynchronization (ERD) [4], [5], which is the attenuation of the rhythmic activity over the sensorimotor cortex in the µ (8–14 Hz) and β (14–30 Hz) rhythms. ERD can be induced by both imagined movements in healthy people or intended movements in paralyzed patients [6]. Previous studies have demonstrated that, based on ERD analysis, it is feasible to classify imagined movements of left hand, right hand, feet, and tongue [4], [7], [8]. A complementary phenomenon called Bereitschafts
T
Manuscript received April 3, 2010; revised August 1, 2010 and September 2, 2010; accepted September 28, 2010. Date of publication November 9, 2010; date of current version January 4, 2011. This work was supported by the Science and Engineering Research Council of the Agency for Science, Technology and Research, Singapore. The authors are with the Institute for Infocomm Research, Agency for Science, Technology and Research, 138632, Singapore (e-mail:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2010.2084099
potential is a nonoscillatory characteristic of motor imagery EEG, and can be also used for BCI [9]. This paper will focus on the ERD. For decoding different motor imaginations from EEG, the essential task is to distinguish the respective ERD signals. Neurologically, the spatial pattern of the ERD provides a clue. For instance, movements of left hand/right hand are associated with activities in the contralateral (right/left) motor cortex areas [4]. However, localization of the ERD sources is impeded by the EEG’s poor spatial specificity caused by volume conduction and coherency [10], [11]. Furthermore, the ERD is sensitive to artifacts cased by muscle activities or by visual cortex activities, since their frequency ranges highly overlap while the ERD signal is rather weak [12]. Besides, both the spatial pattern and the particular rhythm vary among people, requiring subject-specific learning [5]. Therefore, from a signal processing point of view, it is important to design a feature extraction mechanism that can learn to capture effective spatial and spectral features associated with the ERD, for each particular person. As a recent survey [13] indicates, considerable efforts have been devoted to this topic by the signal processing, machine learning, and artificial neural networks communities. Particularly, spatial filtering techniques are widely used to extract discriminative spatial features of ERD in multichannel EEG. Techniques such as independent component analysis [14] and beam-forming [15] were introduced, while the most commonly used technique thus far is the common spatial pattern (CSP) [4], [16], [17]. As [18] shows, CSP can yield significantly higher accuracy in motor imagery classification than various independent component analysis methods. CSP consists of a linear projection of time samples of multichannel EEG onto a few vectors that correspond to individual spatial filters. Mathematically, the projection matrix is constructed by maximizing the separability, in terms of the Rayleigh coefficient [17], between motor imagery EEG classes. The coefficient is determined by the intraclass covariance matrices of EEG time samples, while its maximization can be readily solved by generalized eigenvalue decomposition. Usually, CSP works together with a subject-specific bandpass filter to select the particular rhythm of the ERD. To learn the band-pass filter and the spatial filters in a unified framework, several extensions of CSP have been devised. In [19], the authors embedded a first-order finite impulse
1045–9227/$26.00 © 2010 IEEE
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
response filter into CSP. In view of the limited capability of first-order filters to choose frequency bands, a higher order finite impulse response (FIR) filter was proposed in [20], while a sophisticated regularization method was necessary to make the solution robust. More recently, Wu et al. [21] proposed an iterative learning method, in which an FIR filter and a classifier were simultaneously parameterized and optimized in the spectral domain, alternately with optimization of spatial filters using CSP in the spectral domain. More recently, another method called filter bank common spatial pattern (FBCSP) [22] introduced a feature selection algorithm to combine a filter bank framework with CSP. It decomposed EEG data into an array of passbands, performed CSP in each band, and selected a reduced set of features from all the bands. An offline study [23] suggested its higher performance over the above-mentioned iterative learning method. Furthermore, its efficacy was demonstrated in the latest BCI Competition [24], where it served as the basis of all the winning algorithms in the EEG categories. FBCSP was further improved in [25] by employing a robust maximum mutual information criterion for feature selection. (Another method [8] used the maximum mutual information principle but in a different formulation to select spatial components from independent component analysis). However, learning optimum spatio-spectral filters is still an open issue. Extensions of CSP often inherit its limitation in exploring spatial patterns. Specifically, as shown in the Appendix and in [26, Sec. 10.2], CSP is equivalent to minimizing a classification error bound for two unimodal multivariate Gaussian distributions only. As [13, p. R43] puts it, it can also be sensitive to artifacts in the training data, as a single trial contaminated with artifacts can unfortunately cause extreme changes to the filters. In this paper, we present an information-theoretic approach to learning the spatio-spectral filters. Particularly, the approach constructs an optimum spatio-spectral filtering network (OSSFN) that optimizes the filters by maximizing the mutual information between the feature vectors and the corresponding class labels. As mentioned earlier, the maximum mutual information criterion was employed in [25] for feature selection, where numerical optimization of spatial filters was not considered. By contrast, this paper addresses the more challenging and interesting issue of feature extraction, which involves numerical optimization of spatial filters together with selection of a band-pass filter. Therefore, one of the major contributions of this paper is the introduction of a nonparametric mutual information estimate to formulate the objective for spatio-spectral feature extraction. Importantly, based on this new formulation, we devise a gradient-based method for optimization of spatial filters jointly with a band-pass filter. We conduct an experimental study to assess the proposed method while comparing with existing methods including CSP and FBCSP. The study collects motor imagery data from seven human subjects in our lab. The publicly available BCI Competition IV Dataset I is also used. The study performs randomized cross-validation to assess the classification accuracy with a linear support vector machine, and runs t-test
53
TABLE I L IST OF S YMBOLS Symbol z(t) x(t) y(t) W wl a; A ω; p; P H I (A, ) na nω
Description A block of raw n c -channel EEG signal; t ∈ [0 L] Signal after spectral filtering using a bandpass filter h Signal after spatial filtering ∈ R(n c ×nl ) , spatial filtering matrix with spatial filter vectors in columns ∈ R(n c ×1) , the lth spatial filter vector in W A particular feature vector ∈ R(nl ×1) for z(t); feature vector variable symbol A particular class label; class label variable symbol Probability density function and probability function of a random variable Entropy of a random variable Mutual information between A and Number of samples (z(t)) in training data Number of class-ω samples in training data
to verify the statistical significance of the results between different methods. The rest of this paper is organized as follows. Section II describes the proposed method, and formulates the maximum mutual information based learning problem. Section III derives a numerical solution. Section IV describes the experimental study and the results, followed by discussions in Section V. Section VI finally concludes this paper. II. OSSFN For the convenience of readers, Table I describes a list of essential mathematical symbols. The architecture of the proposed filtering network OSSFN is illustrated in Fig. 1. It learns and performs consecutive bandpass filtering, spatial filtering, and log power integral to extract discriminative features for motor imagery classification. The input of the network is a time window of n c -channel EEG waveforms z(t) (without loss of generality, we assume t ∈ [0 L] in the time window), and the output is a feature vector a that represents the mean power of spatio-spectral components of z(t). The procedure of transforming the EEG block of z(t) into the feature vector a comprises the following steps. 1) Spectral filtering: A band-pass filter that extracts a specific rhythmic activity of the ERD, it produces the band-pass-filtered signal x. 2) Spatial filtering: A linear projection of z that transforms x into a lower dimensional signal y y(t) = WT x(t).
(1)
Here, the superscript T denotes the transpose operator. Each column in the transformation matrix W ∈ R(nc ×nl ) determines one of the nl spatial filters. Therefore, each element in y describes the activity of a particular spatial component. 3) Log power integral: A process that computes ERD features as the mean power of y in the time window L 2 1 y(t) dt . (2) a = log L 0 Each of the element in a represents the mean band power of a particular spatial component in W.
54
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Training Program
Optimization Alogrithm
Mutual Information Estimate
Log Power Integration
Spatial Filtering
Unseen (test) Data
Class labels
Evaluation
(.)2
∫
ln
(.)2
∫
ln
(.)2
∫
ln
Features
Output Feature : a(t)
Spectral Filtering
Class labels
Input EEG: x(t)
EEG Trial
Training Data
Classifier
Test Program Results Fig. 1. Diagram of the proposed network for extracting motor imagery EEG features. A motor imagery EEG block, in the form of time-windowed multivariate waveforms x(t), is processed firstly by a spectral (band-pass) filter to pick up subject-specific responsive rhythm activity, and subsequently by a linear transformation (in the form of spatial filters) and log power integration. The output feature vector describes the mean power of particular spatio-spectral components associated with motor imagery. The network takes a maximum mutual information approach to optimizing the spectral filter and the spatial filters.
The logarithm operation has been widely used since the introduction of CSP in [16], which describes its purpose as “to approximate normal distribution of the data.” We would like to note that another positive effect of the logarithm operation is the reduced dynamic range, which facilitates the subsequent processing, e.g., by a classifier. In addition, extreme feature values (suspects of artifacts) in some EEG blocks can be largely reduced before the corrupted information (such as intraclass variance) is fed into the learning machine. Our BCI experience suggests that the logarithm operation can improve classification accuracy.
where H () (or H (A)) is the entropy of the class label (or the feature vector). ω is a particular class label (e.g., ω = 1 or ω = 2 represents left- or right-hand motor imagination). H (A|) is the conditional entropy of the obtained feature vector for a particular class. H (|A) is then the conditional entropy of the class label given the obtained feature vector. Now we define the objective function for learning. Since the feature vector a is determined by the band-pass filter h and the spatial filters W, the objective is to maximize I (A, ) with respect to h and W
This paper introduces mutual information [27] to formulate the objective function for the learning machine. Consider the mutual information between the feature vector variable A and the class label variable
Let us discuss the relevance of mutual information to objective function for discriminative learning. The mutual information I (A, ) is the reduction of uncertainty by the feature vector [27], the entropy H () is the uncertainty about class label, while after observing the feature vector, the uncertainty reduces to the conditional entropy H (|A). An earlier paper [28] has connected the maximum mutual information criterion to minimum Bayes error via lower and upper bounds. A recent paper [29] further studied the
I (A, )
=
H () − H (|A)
=
H (A) − H (A|) H (A|ω)P(ω) H (A) −
=
ω∈
(3)
{h opt , Wopt } = max I (A, ). {h,W}
(4)
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
relationship between maximum mutual information and other criteria for feature extraction, though in the context of linear feature extraction rather than in the present nonlinear context (see the processing steps above). Importantly, that paper concludes that maximum mutual information is Bayesian optimum under more general conditions than others. Coincidently, recent years have seen attempts [30], [31] to address linear feature extraction problems through using the maximum mutual information principle. III. L EARNING A LGORITHM The technical challenge to achieve the objective in (4) primarily lies in the fact that the objective function (mutual information) is a function of probability density functionals and cannot be expressed in explicit form generally. To address this problem, we propose a learning method below that first introduces a mutual information estimation method and then derives a gradient-based optimization algorithm. A. Mutual Information Estimate Since the mutual information in (3) is dependent on the entropies, we approximate it by first estimating the entropies. The entropy of the feature vector variable and the conditional entropy are, respectively, given by (5) H (A) = − p(a)log ( p(a)) da a
and
H (A|ω) = −
p(a|ω)log [ p(a|ω)] da.
(6)
a
The entropy A can be viewed as an expectation of the function log( p(a)) [32, Sec. 5]. Suppose a set of n a empirical samples of feature vector a is available: ai , i = 1, . . . , n a . The entropy can be estimated by H (A) = ∼ =
−E[log( p(a))] na 1 log( p(ai )). − na
(7)
where r denotes the term a − ai , ψ usually takes a diagonal matrix form called the bandwidth matrix. The diagonal elements in the bandwidth matrix determine the smoothness of the kernel. We choose the following bandwidth for the kernel: n
ψk,k
a ζ = (aik − a¯k )2 na − 1
(11)
i=1
where a¯k is the empirical mean of {aik }, i.e., the kth element in the feature vector samples. We use the normal optimal smoothing strategy [34] to set the coefficient, i.e., ζ = (4/3n a )0.1 . By introducing (9) into (7), the entropy H (A) is approximated by ⎧ ⎫ na na ⎨1 ⎬ 1 H (A) ∼ log ϕ(ai − a j ) . (12) = Hˆ (A) = − ⎩ na ⎭ na j =1
i=1
The conditional intraclass entropy Hˆ (A|ω) is estimated similarly. We replace the entropies in (3) by the estimates Hˆ (A) and Hˆ (A|ω). This results in a sample-based estimate of mutual information. The full expression of the estimate is omitted since it is straightforward from the above. B. Subspace Gradient Descent Learning
In this section, we derive a numerical solution to maximizing the mutual information estimate with respect to spatial filters in W in conjunction with a band-pass filter. For simultaneous optimization of all the spatial filter vectors in W, we consider a joint vector by concatenating all the spatial filters T w ˆ = w1T . . . wlT . . . wnTl . (13) As described earlier, the mutual information I (A, ) is estimated from all the feature vector samples {ai }. Since each of the samples in turn is a function of w, ˆ we have n
a ∂ I (A, ) ∂ai ∂ I (A, ) = . ∂w ˆ ∂ai ∂w ˆ
i=1
Similarly
55
(14)
i=1
H (A|ω)
= ∼ =
−E[log( p(a|ω))] 1 − log( p(ai )). n ω a ∈ω
(8)
The partial derivative ∂ I (A, )/∂ai can be computed by differentiating (3) to give ∂ H (A) ∂ H (A|ω) ∂ I (A, ) = − P(ω) ∂ai ∂ai ∂ai
i
The underlying probability density function can also be estimated from the samples using kernel density estimation [33] na 1 ϕ(a − ai ). (9) p(a) ˆ = na
where ω is the class label of the sample ai . To compute ∂ I /∂ai , the partial derivatives ∂ H (A)/∂ai and ∂ H (A|ω)/∂ai are required. To compute ∂ H (A)/∂ai , differentiate (12) with respect to ai , which gives
i=1
Using Gaussian for the kernel function ϕ, the Gaussian kernel density estimation is well known for its capability for general data analysis [33], [34]. A multivariate Gaussian function is given by nl
1
ϕ(r) = (2π)− 2 |ψ|− 2 e
− 21 r T ψ −1 r
(10)
(15)
na na ∂ϕ[a j − ak ] 1 ∂ H (A) 1 βj =− ∂ai na na ∂ai j =1
k=1
where βj =
na 1 ϕ[a j − ak ] na k=1
(16)
−1
(17)
56
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
and ∂ϕ(a j − ak ) = ∂ai ⎧ −1 ⎪ ⎨−ϕ(ai − ak )ψ (ai − ak ), if i = j −ϕ(ai − a j )ψ −1 (ai − a j ), if i = k ⎪ ⎩ 0, otherwise
where bl is a coefficient vector that determines wl bl = b1 , b2 , . . . , bnu .
(18)
the computation of the partial derivative ∂ H (A|ω)/∂ai is performed similarly. ˆ we first consider To compute the partial derivative ∂ai /∂ w, a particular element, say, the lth element ail in ai . From (13), the partial derivative of this element with respect to wl is 2 L ∂log L1 0 wlT xi (t) dt ∂ail = ∂wl ∂wl 1 L T T (t)dt w ∂ w x (t)x i l l i L 0 1 = · a e il ∂wl 2 = wT Rxi (19) Leail l where xi (t) denotes the EEG sequence in the i th trial, and L (20) xi (t)xiT (t)dt. Rxi = 0
Since ai j is dependent on wl only
∂ai j = 0 if j = l (21) ∂wl the partial derivative of a with respect to w ˆ is thus ∂ai = ∂w ˆ ⎡ ⎤ 2 T 0 ··· 0 Leai1 w1 R xi ⎢ ⎥ 2 T 0 0 ⎢ ⎥ Leai2 w2 R xi ⎢ ⎥. .. .. .. .. ⎢ ⎥ . . . . ⎣ ⎦ 2 T 0 0 ··· ain wn R xi l l Le
(22)
Now we can compute the gradient by introducing the above equation to (14). However, a practical issue arises for multichannel EEG and multiple spatial filters. Consider an example in which EEG has n c = 59 channels and W contains nl = 2 filters. The number of free parameters would be 2 × 59 = 118. Gradientbased optimization in this high-dimensional space would be difficult. To address this issue, we propose a subspace optimization approach in below. Consider a n u -dimensional (n u ≪ n c ) subspace U , linearly spanned by the n c -dimensional column vectors as bases in a matrix U ∈ R(nc ×nu ) (23) U = u1 , u2 , . . . , uk , . . . , unu
where uk denotes the kth basis vector. A spatial filter vector wl in the subspace can be expressed by nu blk uk = Ubl (24) wl = k=1
(25)
Hence, bl is the low-dimensional representation of the spatial filter wl . In the subspace U, simultaneous optimization of the spatial filters is equivalent to simultaneous optimization of the concatenated coefficient vectors T (26) bˆ = b1T b2T . . . bnTl .
Now consider the partial derivatives of I (A, ) with respect to bˆ na ∂ I ∂a ∂ I (A, ) = . (27) ˆ ˆ ∂a i ∂b ∂b i=1
Substitution of (24) into (2) gives L 2 1 ail = log (Ubl )T xi (t) dt . L 0
(28)
Similar to (19), differentiating (28) gives 2 L ∂log L1 0 (Ubl )T xi (t) dt ∂ail = ∂bl ∂bl T 1 L T dt (Ub ) ∂ x (t)x (t) (Ub ) l i i l L 0 1 · = eakl ∂bkl 2 = (29) (Ubl )T Rxi U. Leakl Therefore ⎤ ⎡ 2(Ub1 )T R U · · · 0 a xi i1 ⎥ ⎢ Le 2(Ub2 )T ⎥ ⎢ 0 0 ∂ai Leai2 R xi U ⎥ ⎢ =⎢ ⎥. (30) .. . .. .. ⎥ ⎢ . ∂ bˆ . ⎦ ⎣ T 2 Ubnl Rxi U 0 ··· Leain u
Now, introducing the above equation to ∂ I (A, )/∂ bˆ (expressed in a similar form as (14) by substituting bˆ for w), ˆ we can compute the gradient of the mutual information estimate with respect to bˆ of low dimensionality. This effectively reduces the number of free parameters for learning. In the earlier example, the number of free parameters will reduce from 118 to 8 in a n u = 4-D subspace. How to optimally construct the subspace U is, however, beyond the scope of this paper. Tentatively, we simply use spatial filters by CSP (band-pass filter selected by FBCSP) as the subspace bases. We would like to stress that the proposed optimization procedure, as a general approach, is neither tailored nor dedicated to the CSP or FBCSP subspace. We expect that more effective subspace construction methods will be devised. As mentioned earlier, the subject-specific sensorimotor rhythm of the ERD must be selected for effective extraction of spatial patterns associated with the ERD. To this end, we need to maximize the mutual information estimate with respect to spatial filters in conjunction with a band-pass filter. Inspired by previous works [22], [35] that choose the optimum band-pass
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
filter from an array of filter banks, we propose a joint spatiospectral filter learning algorithm below (Fig. 2) in a filter bank framework. Briefly, the algorithm first decomposes the EEG data into an array of frequency bands that cover the range of possible ERD rhythm, performs spatial filter optimization in each band, and then selects the band with maximum mutual information estimate. IV. E XPERIMENTS AND R ESULTS This section reports an offline analysis of the proposed method for extracting the ERD features. A. Materials: Motor Imagery EEG Datasets 1) BCI Competition IV Dataset I: The dataset [24] consists of both human and artificially generated motor imagery data. We consider human EEG data only, which were collected from four healthy subjects using the EEG amplifier of BrainAmp MR plus with 59 channels sampled at 1000 Hz. Each subject participated in two data collection sessions with different protocols as described below. In the calibration session, a visual cue was displayed on a computer screen to the subjects who then started to perform motor imagery tasks according to the cue. The cue represented specific motor imagery tasks: each subject chose two classes of motor imagery tasks from left hand, right hand, or foot. Specifically, subject “a” chose {left, foot}, “b” chose {left, right}, “f” chose {left, foot}, “g” chose {left, right}. Each subject performed a total of 200 motor imagery tasks (balanced between the two tasks) each in the [0 4]-s window after the cue. Consecutive motor imagery tasks were interleaved with a 4-s break. In the evaluation session, the subjects followed the soft voice commands from an instructor to perform motor imagery tasks of varying length between 1.5 and 8 s. Consecutive tasks were also interleaved with a varying length interval from 1.5 to 8 s. This session was meant for offline validation of motor imagery classification algorithms for self-paced BCI (see [36]). Our study uses the down-sampled data (provided by the organizer) at 10-Hz sampling rate, with all the 59 channels employed for spatio-spectral feature extraction. The 59 channels are AF3, AF4, F5, F3, F1, Fz, F2, F4, F6, FC5, FC3, FC1, FCz, FC2, FC4, FC6, CFC7, CFC5, CFC3, CFC1, CFC2, CFC4, CFC6, CFC8, T7, C5, C3, C1, Cz, C2, C4, C6, T8, CCP7, CCP5, CCP3, CCP1, CCP2, CCP4, CCP6, CCP8, CP5, CP3, CP1, CPz, CP2, CP4, CP6, P5, P3, P1, Pz, P2, P4, P6, PO1, PO2, O1, and O2. 2) Our Motor Imagery Data Set: The data were recorded in our laboratory from seven healthy male subjects. Each subject performed 160 tasks of motor imagery (including 80 left-hand and 80 right-hand tasks). Similar to the calibration session of the BCI Competition dataset, the data collection procedure used visual cues to prompt the subjects to perform motor imagery tasks for 4 s each.
57
Consecutive motor imagery tasks were interleaved with a 6-s break. The EEG data were recorded using a NuAmps amplifier with 25 channels sampled at 250 Hz. The 25 channels, including F7, F3, Fz, F4, F8, FT7, FC3, FCz, FC4, FT8, T7, C3, Cz, C4, T8, TP7, CP3, CPz, CP4, TP8, P7, P3, Pz, P4, and P8, cover the full scalp. The data collection and study was approved by the National University of Singapore Institutional Review Board with reference code 08-036. Given the considerable difference in data collection setup in terms of EEG amplifiers and motor imagery task protocols, effective unification of the two datasets is difficult. Instead, this paper validates the proposed method on the two datasets separately. Furthermore, this allows validation of the proposed method in two different conditions (or effectively three conditions, since the calibration session and the evaluation session in the BCI Competition data were of different protocols), which is an important consideration for studying generalization performance. B. Selection of Hyperparameters The following describes how we set the hyperparameters for feature extraction and classification. First, selection of a time interval in the motor imagery tasks is almost a common practice in learning motor imagery EEG. This paper selects the time interval [1 4] s after the cue. The first 1-s period after the cue is excluded since it contains the spontaneous responses (evoked potentials) to the cue stimulus [37, Sec. V]. As the BCI Competition evaluation set has varying duration of motor imagery tasks, we consider the same time interval and remove those motor imagery tasks of less than 4 s long. Consequently, the number of remaining motor imagery tasks in the evaluation data ranges from 111 to 126 in the four subjects. Second, the filter banks (an array of band-pass filters) are constructed to continuously cover a wide frequency range. Specifically, a total of eight Chebyshev Type II filters (though other type of filters can also be used instead) are built with center frequencies spanning from 8 to 32 Hz at a constant interval in the logarithm domain. Consequently, the center frequencies are respectively 8, 9.75, 11.89, 14.49, 17.67, 21.53, 26.25, and 32 Hz. All the filters all have a uniform Q-factor (bandwidth-to-center frequency) of 0.33 as well as an order of 4. The filter banks process each of the EEG blocks separately after they are extracted from the selected time interval mentioned above. The number of spatial filters to be constructed is also an important hyperparameter. This paper considers the learning of two spatial filters only, corresponding to a transformation matrix W (1) of two column vectors. Consequently, the feature vector is a bivariate. C. Mutual Information Surface and Selected Spatial Patterns Here we use the calibration data from the BCI Competition dataset to investigate the surface of the mutual information versus spatial filters. We visualize the mutual information estimate in a low-dimensional space U (see Section III-B), in which each point defines a particular spatial filter. As more
58
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
Input:Training EEG data that comprises N sample blocks of {z(t)}, each block has a specific class label; Output:The filtering network as depicted in Fig.1, with optimum parameters for spatial filters and the selection of the optimum band-pass filter; Step 1: Construct an array of ns band-pass filters that covers the EEG rhythms of motor imagery, then filter {z(t)} to yield {x m (t)} for m = 1, . . . , ns; Step 2: For each band-pass filter’s output {xm(t)}: 1) Construct a discriminative spatial filter subspace: a) Compute the empirical covariance matrices of the two classes: x0 and x1; b) Compute the eigenvectors and eigenvalues of x0−1x0 (refer to equation(37)); c) Select nu eigenvectors that correspond to the largest and smallest eigenvalues λ, sort the eigenvectors from large to small eigenvalues, use these eigenvectors as the bases U for the low dimensional subspace for parameterization of spatial filters; 2) Set the initial parameters of the spatial filters b01 = [1, 0, . . . , 0], b02 = [0,. . . , 0, 1], b03 = [0, 1, 0, . . . , 0], and so on. This setting effectively chooses the top and bottom spatial filters generated by CSP or FBCSP. 3) Set iteration count k = 0, repeat the following steps until convergence whereby the criterion is defined as the change of the mutual information estimate being smaller than a small threshold ζ: a) Compute the spatial filters W using bIter inequation(24): W = Ubˆ k where bˆ k = [bk1, bk2, . . . , bkn ], l b) Use W to update the feature vector according to equations (1) and (2); ∂I ˆ c) Compute the gradient b = ∂ bˆ using equation (27); d) Perform a linear search with a stepfactor s, alternately selected from the range [−1 1] with an interval of 0.01: i) Set bˆ bˆ (s) = bˆ k + s (31) 1 ˆ 2 b 2
where .2 represents the l−2 norm of the gradient vector. ii) Compute the mutual information estimate I (s) with the spatial filters defined by bˆ (s); e) Update the parameter vectors for spatial filters using the optimum update step sopt = argmaxs I(s) ˆ f) Update the mutual information Ik = I(sopt) and bk = ˆb(s ); opt g) Compute the change in mutual information by δ = Ik − Ik−1 (Ik−1 = I( s = 0)ifunassigned); if δ < ζ or the iteration count k is larger than a preset number, continue to next step; otherwise go back to step a); h) Set the optimum spatial filters for the frequency bandas Wm = U bˆ k , and set the corresponding mutual information I m = I k; Step 3:Select the optimum frequency band mopt by mopt = argminm Im, and finally set the spectral filter to be the mopt band-pass filter, and the spatial filter to be Wmopt. Fig. 2.
Learning algorithm for the spatio-spectral filtering network.
than 2-D surfaces would be difficult to visualize, we consider a 2-D subspace spanned by the first pair (i.e., the top and the bottom) of CSP filters from the frequency band selected by FBCSP [22]. Fig. 3 uses color image presentations to illustrate the result for each of the four subjects in the BCI Competition data. Each pixel represents a spatial filter, while the value of the corresponding mutual information estimate is denoted by the color. In three of the subjects, including “a,” “b,” and “g,” a peak mutual information estimate appears near the point [0 1], which represents the bottom spatial filter from FBCSP. However, in “f,” there is no such peak found near [0 1]. Instead, a peak is prominent near the point [1 0], which corresponds to the top filter from FBCSP. Hence, we use the top filter for the FBCSP mark in “f,” while using the bottom one in the others. The result suggests favorable conditions for the proposed method. First, the surface is smooth that facilitates gradientbased optimization. Second, target peaks on the mutual information surface often have an FBCSP filter in the vicinity, which validates the use of FBCSP spatial filters for optimization initialization.
Furthermore, Fig. 4 shows the top three spatial patterns (each with a particular frequency band) which together maximize the mutual information measure for each of the subjects in the competition dataset. The patterns are consistent with neurophysiological principles on motor imagery except for subject “b.” For example, the spatial patterns for subject “g” show that the two most discriminative patterns correspond to EEG sources that originate from the motor cortex of the right and left hemisphere. Furthermore, the frequency bands of the selected spatial patterns are mostly from the Beta rhythm except for the second spatial pattern of subject “a.” D. Classification Results This paper compares the proposed method in comparison with CSP and FBCSP, using five rounds of fivefold crossvalidation study. FBCSP shares the same band-pass filter array as described in Section IV-B. Literally, it is the proposed method before optimization. Thus, it selects only one frequency band and two spatial filters. CSP is implemented by following [16] and using a [8 30]-Hz Chebyshev Type II band-pass filter. The top and the bottom filters by CSP are selected.
ZHANG et al.: OPTIMUM SPATIO-SPECTRAL FILTERING NETWORK FOR BRAIN–COMPUTER INTERFACE
0.1
Mutual Information 0.12
0.2
0.5
Mutual Information 0.24 0.28
0.7
0.6 0.7
0.8 Local Maximum
0.8
0.9
b2
0.9 1
1 1.1
FBCSP Filter
1.1
1.2 1.3
1.2
1.4 1.5 −0.5
0 b1
0.5
Subject ‘a’
1.3 −0.2
0 b1
0.2
Subject ‘b’
0.44
59
Mutual Information 0.46
Mutual Information 0.2 0.3
−0.5
0.3
−0.4
0.4
−0.3
0.5
−0.2
0.6
−0.1
0.7
0
0.8
0.1
0.9
0.2
1
0.3
1.1
0.4
1.2
0.5 0.5
1 b1 Subject ‘f’
1.5
1.3
0
0.5 b1
1
Subject ‘g’
Fig. 3. Surface of mutual information estimate over a bivariate b (see 24), i.e., the coefficient vector that defines a spatial filter. Notes: see Section IV-C for details; the four graphs are for each of the four subjects in the BCI Competition IV Dataset I. The axes b1 and b2 denote the first and the second elements of the coefficient vector b in (24) that defines a spatial filter. The value of mutual information estimate is indicated by the color according to the overhead color bar. See Section. III-B for the description of parameterization of spatial filters in the subspace of spatial filters. Here the subspace is spanned by two spatial filters by FBCSP. Therefore, e.g., the point [1 0] represents to the first spatial filter by FBCSP. The previous FBCSP filter and the local optimum filter are annotated, respectively, by a square and a circle in each graph.
foot [27.7 36.3]Hz
left [6.7 9.3]Hz
foot [6.7 9.3]Hz
left [14.8 20.6]Hz
left [6.7 9.3]Hz
right [27.7 36.3]Hz
left [18.0 25.1]Hz
right [18.0 25.1]Hz
Subject ‘a’
Subject ‘b’
Subject ‘f’
Subject ‘g’
Fig. 4. Spatial patterns of motor imagery EEG. In each column, the two spatial features selected according to maximal mutual information between classes, are plotted in the form of spatial patterns (see [16]). The positions of the electrodes are superimposed as black dots. The view of the spatial pattern is from the top whereby the nose is facing upward. The motor imagery class and the frequency band of the feature are also given above each plot.
60
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 22, NO. 1, JANUARY 2011
TABLE II C LASSIFICATION A CCURACY (M EAN AND SD) IN BCI C OMPETITION IV D ATASET I Feature extraction method CSP FBCSP OSSFN
p-value OSSFN=CSP OSSFN=FBCSP
Sub
Train–Test
a
Calib. Calib.–Eval.
67.1(11.4) 65.9(4.7)
66.5(13.3) 79.2(7.2)
89.8(9.0) 93.6(3.4)