Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6594
Andrej Dobnikar Uroš Lotriˇc Branko Šter (Eds.)
Adaptive and Natural Computing Algorithms 10th International Conference, ICANNGA 2011 Ljubljana, Slovenia, April 14-16, 2011 Proceedings, Part II
13
Volume Editors Andrej Dobnikar Uroš Lotriˇc Branko Šter University of Ljubljana Faculty of Computer and Information Science Tržaška 25, 1000 Ljubljana, Slovenia E-mail: {andrej.dobnikar, uros.lotric, branko.ster}@fri.uni-lj.si
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-20266-7 e-ISBN 978-3-642-20267-4 DOI 10.1007/978-3-642-20267-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011923992 CR Subject Classification (1998): F.1-2, I.2.3, I.2, I.5, D.2.2, D.4.7, D.1 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 2011 edition of ICANNGA marked the 10th anniversary of the conference series, started in 1993 in Innsbruck, Austria, where it was decided to have a similar scientific meeting organized biennially. Since then, and with considerable success, the conference has taken place in Ales in France (1995), Norwich in the UK (1997), Portoroˇz in Slovenia (1999), Prague in the Czech Republic (2001), Roanne in France (2003), Coimbra in Portugal (2005), Warsaw in Poland (2007), and Kuopio in Finland (2009), while this year, for the second time, in Slovenia, in its capital Ljubljana (2011). The Faculty of Computer and Information Science of the University of Ljubljana was pleased and honored to host this conference. We chose the old university palace as the conference site in order to keep the traditionally good academic atmosphere of the meeting. It is located in the very centre of the capital and is surrounded by many cultural and touristic sights. The ICANNGA conference was originally limited to neural networks and genetic algorithms, and was named after this primary orientation: International Conference on Artificial Neural Networks and Genetic Algorithms. Very soon the conference broadened its outlook and in Coimbra (2005) the same abbreviation got a new meaning: International Conference on Adaptive and Natural computiNG Algorithms. Thereby the popular short name remained and yet the conference is widely open to many new disciplines related to adaptive and natural algorithms. This year we received 144 papers from 33 countries. After a peer-review process by at least two reviewers per paper, 83 papers were accepted and included in the proceedings. The papers were divided into seven groups: neural networks, evolutionary computation, pattern recognition, soft computing, system theory, support vector machines, and bio-informatics. The submissions were recommended for oral and for poster presentation. The ICANNGA 2011 plenary lectures were planned to combine several compatible disciplines like adaptive computation (Rudolf Albrecht), artificial intelligence (Ivan Bratko), synthetic biology and biomolecular modelling of new biological systems (Roman Jerala), computational neurogenetic modelling (Nikola Kasabov), and robots with biological brains (Kevin Warwick). We believe these discussions served as an inspiration for future contributions. One of the traditions of all ICANNGA conferences so far has been to combine pleasantness and usefulness. The cultural and culinary traditions of the organizing country helped to create an atmosphere for a successful and friendly meeting. We would like to thank the Advisory Committee for their guidance, advice and discussions. Furthermore, we wish to express our gratitude to the Program Committee, the reviewers and sub-reviewers for their substantial work in revising
VI
Preface
the papers. Our recognition also goes to Springer, our publisher, and especially to Alfred Hofmann, Editor-in-Chief of LNCS, for his support and collaboration. ˇ Many thanks go to the agency Go-mice and its representative Natalija Bah Cad for her help and effort. And last but not least, on behalf of the Organizing Committee of ICANNGA 2011, we want to express our special recognition to all the participants, who contributed enormously to the success of the conference. We hope that you will enjoy reading this volume and that you will find it inspiring and stimulating for your future work and research. April 2011
Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster
Organization
ICANNGA 2011 was organized by the Faculty of Computer and Information Science, University of Ljubljana, Slovenia.
Advisory Committee Rudolf Albrecht Bartlomiej Beliczynski Andrej Dobnikar Mikko Kolehmainen Vera Kurkova David Pearson Bernardete Ribeiro Nigel Steele
University of Innsbruck, Austria Warsaw University of Technology, Poland University of Ljubljana, Slovenia University of Eastern Finland, Finland Academy of Sciences of the Czech Republic, Czech Republic University Jean Monnet of Saint-Etienne, France University of Coimbra, Portugal Coventry University, UK
Program Committee Andrej Dobnikar, Slovenia (Chair) Jarmo Alander, Finland Rudolf Albrecht, Austria Rub´en Arma˜ nanzas, Spain Bartlomiej Beliczynski, Poland Ernesto Costa, Portugal Janez Demˇsar, Slovenia Antonio Dourado, Portugal Stefan Figedy, Slovakia Alexandru Floares, Romania Juan A. Gomez-Pulido, Spain Barbara Hammer, Germany Honggui Han, China Osamu Hoshino, Japan Marcin Iwanowski, Poland Martti Juhola, Finland Paul C. Kainen, USA Helen Karatza, Greece Kostas D. Karatzas, Greece Nikola Kasabov, New Zealand Mikko Kolehmainen, Finland Igor Kononenko, Slovenia Jozef Korbicz, Poland
Vera Kurkova, Czech Republic Kauko Leiviska, Finland Aleˇs Leonardis, Slovenia Uroˇs Lotriˇc, Slovenia Danilo P. Mandic, UK Francesco Masulli, Italy Roman Neruda, Czech Republic Stanislaw Osowski, Poland David Pearson, France Jan Peters, Germany Bernardete B. Ribeiro, Portugal Juan M. Sanchez-Perez, Spain Catarina Silva, Portugal Nigel Steele, UK ˇ Branko Ster, Slovenia Miroslaw Swiercz, Poland Ryszard Tadeusiewicz, Poland Tatiana Tambouratzis, Greece Miguel A. Vega-Rodriguez, Spain Kevin Warwick, UK Blaˇz Zupan, Slovenia
VIII
Organization
Organizing Committee Andrej Dobnikar Uroˇs Lotriˇc ˇ Branko Ster Nejc Ilc
Davor Sluga Jernej Zupanc ˇ Natalija Bah Cad
Reviewers Jarmo Alander Rudolf Albrecht Ana de Almeida M´ ario Joao Antunes Rub´en Arma˜ nanzas Iztok Lebar Bajec Bartlomiej Beliczynski Zoran Bosni´c Ernesto Costa Janez Demˇsar Andrej Dobnikar Antonio Dourado Stefan Figedy Alexandru Floares Juan A. Gomez-Pulido ˇ Crtomir Gorup Barbara Hammer Honggui Han Jorge Henriques Osamu Hoshino Marcin Iwanowski Martti Juhola Paul C. Kainen Helen Karatza Kostas D. Karatzas Nikola Kasabov Mikko Kolehmainen Igor Kononenko Jozef Korbicz Vera Kurkova Kauko Leiviska Aleˇs Leonardis Pedro Luis L´ opez-Cruz Uroˇs Lotriˇc
Danilo P. Mandic Francesco Masulli Neˇza Mramor Kosta Miha Mraz Roman Neruda Dominik Olszewski Stanislaw Osowski David Pearson Jan Peters Matija Polajnar Mengyu Qiao Bernardete B. Ribeiro ˇ Marko Robnik Sikonja Mauno R¨ onkk¨ o Gregor Rot Aleksander Sadikov Juan M. Sanchez-Perez Catarina Silva Danijel Skoˇcaj Nigel Steele Miroslaw Swiercz ˇ Miha Stajdohar ˇ Branko Ster Ryszard Tadeusiewicz Tatiana Tambouratzis Marko Toplak Miguel A. Vega-Rodriguez Alen Vreˇcko Kevin Warwick Blaˇz Zupan ˇ Jure Zabkar ˇ Lan Zagar ˇ Jure Zbontar
Table of Contents – Part II
Pattern Recognition and Learning Asymmetric k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski
1
Gravitational Clustering of the Self-Organizing Map . . . . . . . . . . . . . . . . . . Nejc Ilc and Andrej Dobnikar
11
A General Method for Visualizing and Explaining Black-Box Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Erik Strumbelj and Igor Kononenko
21
An Experimental Study on Electrical Signature Identification of Non-Intrusive Load Monitoring (NILM) Systems . . . . . . . . . . . . . . . . . . . . . Marisa B. Figueiredo, Ana de Almeida, and Bernardete Ribeiro
31
Evaluation of a Resource Allocating Network with Long Term Memory Using GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernardete Ribeiro, Ricardo Quintas, and Noel Lopes
41
Gabor Descriptors for Aerial Image Classification . . . . . . . . . . . . . . . . . . . . Vladimir Risojevi´c, Snjeˇzana Momi´c, and Zdenka Babi´c
51
Text Representation in Multi-label Classification: Two New Input Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo Alfaro and H´ector Allende
61
Fraud Detection in Telecommunications Using Kullback-Leibler Divergence and Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . Dominik Olszewski
71
Classification of EEG in a Steady State Visual Evoked Potential Based Brain Computer Interface Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˙scan, Ozen ¨ ¨ Zafer I¸ Ozkaya, and Z¨ umray Dokur
81
Fast Projection Pursuit Based on Quality of Projected Clusters . . . . . . . . Marek Grochowski and Wlodzislaw Duch A New N-gram Feature Extraction-Selection Method for Malicious Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Hossein Karshenas, and Akram Beigi
89
98
X
Table of Contents – Part II
A Robust Learning Model for Dealing with Missing Values in Many-Core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noel Lopes and Bernardete Ribeiro A Model of Saliency-Based Selective Attention for Machine Vision Inspection Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-Feng Ding, Li-Zhong Xu, Xue-Wu Zhang, Fang Gong, Ai-Ye Shi, and Hui-Bin Wang Grapheme-Phoneme Translator for Brazilian Portuguese . . . . . . . . . . . . . . Danilo Picagli Shibata and Ricardo Luis de Azevedo da Rocha
108
118
127
Soft Computing Improvement of Inventory Control under Parametric Uncertainty and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Nechval, Konstantin Nechval, Maris Purgailis, and Uldis Rozevskis Modified Jakubowski Shape Transducer for Detecting Osteophytes and Erosions in Finger Joints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzena Bielecka, Andrzej Bielecki, Mariusz Korkosz, Marek Skomorowski, Wadim Wojciechowski, and Bartosz Zieli´ nski Using CMAC for Mobile Robot Motion Control . . . . . . . . . . . . . . . . . . . . . . Krist´ of G´ ati and G´ abor Horv´ ath
136
147
156
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Buesser, Fabio Daolio, and Marco Tomassini
167
Numerically Efficient Analytical MPC Algorithm Based on Fuzzy Hammerstein Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak
177
Online Adaptation of Path Formation in UAV Search-and-Identify Missions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Willem H. van Willigen, Martijn C. Schut, A.E. Eiben, and Leon J.H.M. Kester Reconstruction of Causal Networks by Set Covering . . . . . . . . . . . . . . . . . . Nick Fyson, Tijl De Bie, and Nello Cristianini The Noise Identification Method Based on Divergence Analysis in Ensemble Methods Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryszard Szupiluk, Piotr Wojewnik, and Tomasz Zabkowski
186
196
206
Table of Contents – Part II
Efficient Predictive Control and Set–Point Optimization Based on a Single Fuzzy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr M. Marusak Wind Turbines States Classification by a Fuzzy-ART Neural Network with a Stereographic Projection as a Signal Normalization . . . . . . . . . . . . Tomasz Barszcz, Marzena Bielecka, Andrzej Bielecki, and Mateusz W´ ojcik Binding and Cross-Modal Learning in Markov Logic Networks . . . . . . . . . Alen Vreˇcko, Danijel Skoˇcaj, and Aleˇs Leonardis
XI
215
225
235
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents in Nondeterministic Environments . . . . . . . . . . . . . . . . . . Akram Beigi, Nasser Mozayani, and Hamid Parvin
245
Parallel Graph Transformations Supported by Replicated Complementary Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leszek Kotulski and Adam S¸edziwy
254
Diagnosis of Cardiac Arrhythmia Using Fuzzy Immune Approach . . . . . . Olgierd Unold
265
Systems Theory Adaptive Finite Automaton: A New Algebraic Approach . . . . . . . . . . . . . . Reginaldo Inojosa Silva Filho and Ricardo Luis de Azevedo da Rocha
275
Cryptanalytic Attack on the Self-Shrinking Sequence Generator . . . . . . . . Maria Eugenia Pazo-Robles and Amparo F´ uster-Sabater
285
About Nonnegative Matrix Factorization: On the posrank Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana de Almeida
295
Stability of Positive Fractional Continuous-Time Linear Systems with Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tadeusz Kaczorek
305
Output-Error Model Training for Gaussian Process Models . . . . . . . . . . . . Juˇs Kocijan and Dejan Petelin
312
Support Vector Machines Learning Readers’ News Preferences with Support Vector Machines . . . . Elena Hensinger, Ilias Flaounas, and Nello Cristianini
322
XII
Table of Contents – Part II
Incorporating a Priori Knowledge from Detractor Points into Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Orchel
332
A Hybrid AIS-SVM Ensemble Approach for Text Classification . . . . . . . . M´ ario Antunes, Catarina Silva, Bernardete Ribeiro, and Manuel Correia
342
Regression Based on Support Vector Classification . . . . . . . . . . . . . . . . . . . Marcin Orchel
353
Two One-Pass Algorithms for Data Stream Classification Using Approximate MEBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˜ Ricardo Nanculef, H´ector Allende, Stefano Lodi, and Claudio Sartori
363
Bioinformatics X-ORCA - A Biologically Inspired Low-Cost Localization System . . . . . . Enrico Heinrich, Marian L¨ uder, Ralf Joost, and Ralf Salomon On the Origin and Features of an Evolved Boolean Model for Subcellular Signal Transduction Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Branko Ster, Monika Avbelj, Roman Jerala, and Andrej Dobnikar
373
383
Similarity of Transcription Profiles for Genes in Gene Sets . . . . . . . . . . . . Marko Toplak, Tomaˇz Curk, and Blaˇz Zupan
393
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
401
Table of Contents – Part I
Plenary Session Autonomous Discovery of Abstract Concepts by a Robot . . . . . . . . . . . . . . Ivan Bratko
1
Neural Networks Kernel Networks with Fixed and Variable Widths . . . . . . . . . . . . . . . . . . . . Vˇera K˚ urkov´ a and Paul C. Kainen
12
Evaluating Reliability of Single Classifications of Neural Networks . . . . . . ˇ Darko Pevec, Erik Strumbelj, and Igor Kononenko
22
Nonlinear Predictive Control Based on Multivariable Neural Wiener Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk
31
Methods of Integration of Ensemble of Neural Predictors of Time Series - Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stanislaw Osowski and Krzysztof Siwek
41
A Rejection Option for the Multilayer Perceptron Using Hyperplanes . . . Eduardo Gasca A., Sergio Salda˜ na T., Jos´e S. S´ anchez G., Valent´ın Vel´ asquez G., Er´endira Rend´ on L., Itzel M. Abundez B., Rosa M. Valdovinos R., and Rafael Cruz R.
51
Parallelization of Algorithms with Recurrent Neural Networks . . . . . . . . . Jo˜ ao Pedro Neto and Fernando Silva
61
Parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olena Schuessler and Diego Loyola
70
Supporting Diagnostics of Coronary Artery Disease with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matjaˇz Kukar and Ciril Groˇselj
80
The Right Delay: Detecting Specific Spike Patterns with STDP and Axonal Conduction Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arvind Datadien, Pim Haselager, and Ida Sprinkhuizen-Kuyper
90
XIV
Table of Contents – Part I
New Measure of Boolean Factor Analysis Quality . . . . . . . . . . . . . . . . . . . . Alexander A. Frolov, Dusan Husek, and Pavel Yu. Polyakov
100
Mechanisms of Adaptive Spatial Integration in a Neural Model of Cortical Motion Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Ringbauer, Stephan Tschechne, and Heiko Neumann
110
Self-organized Short-Term Memory Mechanism in Spiking Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mikhail Kiselev
120
Approximation of Functions by Multivariable Hermite Basis: A Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bartlomiej Beliczynski
130
Using Pattern Recognition to Predict Driver Intent . . . . . . . . . . . . . . . . . . . Firas Lethaus, Martin R.K. Baumann, Frank K¨ oster, and Karsten Lemmer
140
Neural Networks Committee for Improvement of Metal’s Mechanical Properties Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga A. Mishulina, Igor A. Kruglov, and Murat B. Bakirov
150
Logarithmic Multiplier in Hardware Implementation of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uroˇs Lotriˇc and Patricio Buli´c
158
Efficiently Explaining Decisions of Probabilistic RBF Classification Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Marko Robnik-Sikonja, Aristidis Likas, Constantinos Constantinopoulos, Igor Kononenko, and Erik Strumbelj
169
Evolving Sum and Composite Kernel Functions for Regularization Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petra Vidnerov´ a and Roman Neruda
180
Optimisation of Concentrating Solar Thermal Power Plants with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Pascal Richter, Erika Abrah´ am, and Gabriel Morin
190
Emergence of Attention Focus in a Biologically-Based Bidirectionally-Connected Hierarchical Network . . . . . . . . . . . . . . . . . . . . . . Mohammad Saifullah and Rita Kovord´ anyi
200
Visualizing Multidimensional Data through Multilayer Perceptron Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Neme and Antonio Nido
210
Table of Contents – Part I
Input Separability in Living Liquid State Machines . . . . . . . . . . . . . . . . . . . Robert L. Ortman, Kumar Venayagamoorthy, and Steve M. Potter Predictive Control of a Distillation Column Using a Control-Oriented Neural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej L awry´ nczuk Neural Prediction of Product Quality Based on Pilot Paper Machine Process Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paavo Nieminen, Tommi K¨ arkk¨ ainen, Kari Luostarinen, and Jukka Muhonen A Robotic Scenario for Programmable Fixed-Weight Neural Networks Exhibiting Multiple Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guglielmo Montone, Francesco Donnarumma, and Roberto Prevete Self-Organising Maps in Document Classification: A Comparison with Six Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jyri Saarikoski, Jorma Laurikkala, Kalervo J¨ arvelin, and Martti Juhola Analysis and Short-Term Forecasting of Highway Traffic Flow in Slovenia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Primoˇz Potoˇcnik and Edvard Govekar
XV
220
230
240
250
260
270
Evolutionary Computation A New Method of EEG Classification for BCI with Feature Extraction Based on Higher Order Statistics of Wavelet Components and Selection with Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Kolodziej, Andrzej Majkowski, and Remigiusz J. Rak Regressor Survival Rate Estimation for Enhanced Crossover Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alina Patelli and Lavinia Ferariu A Study on Population’s Diversity for Dynamic Environments . . . . . . . . . Anabela Sim˜ oes, Rui Carvalho, Jo˜ ao Campos, and Ernesto Costa Effect of the Block Occupancy in GPGPU over the Performance of Particle Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, Juan Jos´e Rodr´ıguez-V´ azquez, and Antonio G´ omez-Iglesias Two Improvement Strategies for Logistic Dynamic Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qingjian Ni and Jianming Deng
280
290
300
310
320
XVI
Table of Contents – Part I
Digital Watermarking Enhancement Using Wavelet Filter Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Lipi´ nski and Jan Stolarek
330
CellularDE: A Cellular Based Differential Evolution for Dynamic Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vahid Noroozi, Ali B. Hashemi, and Mohammad Reza Meybodi
340
Optimization of Topological Active Nets with Differential Evolution . . . . Jorge Novo, Jos´e Santos, and Manuel G. Penedo Study on the Effects of Pseudorandom Generation Quality on the Performance of Differential Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ am¨ Ville Tirronen, Sami Ayr¨ o, and Matthieu Weber Sensitiveness of Evolutionary Algorithms to the Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel C´ ardenas-Montes, Miguel A. Vega-Rodr´ıguez, and Antonio G´ omez-Iglesias
350
361
371
New Efficient Techniques for Dynamic Detection of Likely Invariants . . . Saeed Parsa, Behrouz Minaei, Mojtaba Daryabari, and Hamid Parvin
381
Classification Ensemble by Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Behrouz Minaei, Akram Beigi, and Hoda Helmi
391
Simulated Evolution (SimE) Based Embedded System Synthesis Algorithm for Electric Circuit Units (ECUs) . . . . . . . . . . . . . . . . . . . . . . . . . Umair F. Siddiqi, Yoichi Shiraishi, Mona A. El-Dahb, and Sadiq M. Sait Taxi Pick-Ups Route Optimization Using Genetic Algorithms . . . . . . . . . . Jorge Nunes, Lu´ıs Matos, and Ant´ onio Trigo
400
410
Optimization of Gaussian Process Models with Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dejan Petelin, Bogdan Filipiˇc, and Juˇs Kocijan
420
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
431
Asymmetric k-Means Algorithm Dominik Olszewski Faculty of Electrical Engineering, Warsaw University of Technology, Poland
[email protected] Abstract. In this paper, an asymmetric version of the k-means clustering algorithm is proposed. The asymmetry arises caused by the use of asymmetric dissimilarities in the k-means algorithm. Application of asymmetric measures of dissimilarity is motivated with a basic nature of the k-means algorithm, which uses dissimilarities in an asymmetric manner. Clusters centroids are treated as the dominance points governing the asymmetric relationships in the entire cluster analysis. The results of experimental study on the real data have shown the superiority of asymmetric dissimilarities employed for the k-means method over their symmetric counterparts. Keywords: k-means recognition.
1
clustering,
Asymmetric
dissimilarity,
Signal
Introduction
The k-means clustering algorithm [1,2,3,4,5] is a well-known statistical data analysis tool used in order to form arbitrary settled number of clusters in the analyzed data set. The algorithm aims to separate clusters of possibly most similar objects. Object represented as a vector of d features can be interpreted as a point in d-dimensional space. Hence, the k-means algorithm can be formulated as follows: given n points in d-dimensional space, and the number k of desired clusters, the algorithm seeks a set of k clusters so as to minimize the sum of squared dissimilarities between each point and its cluster centroid. The name “k-means” was introduced in [2], however, the algorithm, itself, was formulated by H.Steinhaus in [1]. The k-means algorithm forms clusters on the basis of multiple allocations of objects to the nearest clusters. The nearest cluster is the one with a minimal dissimilarity between its centroid and an object being allocated. Hence, the principal behavior of the discussed algorithm is based on evaluating a dissimilarity between two distinct entities (object vs. cluster centroid). The Euclidean distance, most frequently used in k-means, like any other symmetric measure, does not apply properly to evaluating a dissimilarity between a single object and a cluster centroid. We propose employing of the asymmetric dissimilarities in the k-means algorithm, since we claim that it is more consistent with the fundamental nature of this algorithm, i.e., properly reflects the asymmetric relationship between a single object and a cluster centroid. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 1–10, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
D. Olszewski
The application of asymmetric dissimilarities in data analysis has been extensively studied by A.Okada and T.Imaizumi [6,7,8]. They have concentrated in their work on the multidimensional scaling for analyzing one-mode two-way (object × object) and two-mode three-way (object × object × source) asymmetric proximities. They have introduced the dominance point governing asymmetry in the proximity relationships among objects, represented as points in the multidimensional Euclidean space. They claim that ignoring or neglecting the asymmetry in proximity analysis discards potentially valuable information. Our method can be regarded as the extension of this solution for the k-means clustering algorithm, where the centroids of the clusters are treated as the dominance points governing the multiple allocations of objects, and, consequently, governing the whole clustering process. Therefore, the distinction between a centroid and a single object is that the centroid is a privileged entity acting as an “attractor” of objects in the analyzed data set. Our solution can be also interpreted as the generalization of Okada’s and Imaizumi’s idea for the multidimensional non-Euclidean spaces associated with the non-standard asymmetric dissimilarity measures, like, the Kullback-Leibler divergence, for example. Finally, we wanted to confirm and continue their assertion that the property of asymmetry does not have to be considered as the inhibiting shortcoming, but, quite the contrary, in certain areas of research, it can be even significantly beneficial.
2
Dissimilarities
In this section, we briefly present six dissimilarity measures. Three of them are symmetric (Hellinger distance, total variation distance, and Euclidean distance), one is asymmetric (Kullback-Leibler divergence), and two are either symmetric or asymmetric, depending on the value of their parameters (Chernoff disance, and Lissack-Fu distance). Some of these measures are metrics (satisfy all metric conditions), and some are not, but they still present interesting properties. We wanted to compare the usefulness of symmetric and asymmetric dissimilarities employed for the k-means algorithm, in order to verify our assertion that asymmetric measures are more suitable for this algorithm. Throughout this section, we will use the following notation. Let P and Q denote two probability measures on a measurable space Ω with σ-algebra F . Let λ be a measure on (Ω, F) such that P and Q are absolutely continuous with respect to λ, with corresponding probability density functions p and q. All definitions presented in this section are independent of the choice of measure λ. 2.1
Symmetric Dissimilarities
Hellinger Distance Definition 1. The Hellinger distance between P and Q on a continuous measurable space (Ω, F ) is defined as 1/2 √ √ 2 def 1 dH (P, Q) = ( p − q) dλ . (1) 2 Ω
Asymmetric k-Means Algorithm
3
In some papers, the factor of 12 in Definition 1 is omitted. We consider definition containing this factor, as it normalizes the range of values taken by this dissimilarity. Some sources define the Hellinger distance as the square of dH . Defined by formula (1) the Hellinger distance is a metric, while d2H is not a metric, since it does not satisfy the triangle inequality. Total Variation Distance Definition 2. The total variation distance between P and Q on a continuous measurable space (Ω, F) is defined as √ √ def dTV (P, Q) = max h dP − h dQ = | p − q| dλ , (2) |h|≤1
Ω
Ω
Ω
where h: Ω → R satisfies |h(x)| ≤ 1. Total variation distance is a metric, which assumes values in interval [0, 2]. This dissimilarity is often called the L1 -norm of P − Q, and is denoted by P − Q1 . Euclidean Distance. This measure is used to determine the distance between two points in the Euclidean space. Definition 3. The Euclidean distance between points p = (p1 , p2 , . . . , pN ) and q = (q1 , q2 , . . . , qN ) in the N -dimensional Euclidean space is defined as N def 2 dE (p, q) = (pi − qi ) . (3) i=1
The Euclidean distance is a metric, which takes values from interval [0, ∞]. It can be interpreted as a generalization of the distance between two points in the plane, i.e., in the 2-dimensional Euclidean space, which can be derived from the Pythagorean theorem. 2.2
Asymmetric Dissimilarity
Kullback-Leibler Divergence (Relative Entropy) Definition 4. The Kullback-Leibler divergence between P and Q on a continuous measurable space (Ω, F ) is defined as
p def dKL (P, Q) = p log2 dλ . (4) q Ω According to the convention, the value of 0 log 0q is assumed to be 0 for all real q, and the value of p log p0 is assumed to be ∞ for all real non-zero p. Therefore, relative entropy takes values from interval [0, ∞]. Kullback-Leibler divergence is not a metric, since it is not symmetric and it does not satisfy the triangle inequality. However, it has many useful properties, including additivity over marginals of product measures.
4
2.3
D. Olszewski
Parametrized Dissimilarities
In this subsection, we present two dissimilarities which definitions involve parameters. Depending on the parameters values, these dissimilarities can be either symmetric or asymmetric. This property is very convenient for the purpose of this paper, since it allows for investigating the influence of symmetrizing and asymmetrizing of the same dissimilarity on the final results of clustering. Chernoff Distance Definition 5. The Chernoff distance between P and Q on a continuous measurable space (Ω, F ) is defined as
def α 1−α dCh (P, Q) = − log2 p q dλ , (5) Ω
where 0 < α < 1. Depending on the choice of the parameter α, the Chernoff distance can be either symmetric or asymmetric measure. For α = 0.5 it is symmetric and for all other values of this parameter it does not satisfy the symmetry condition. We have chosen α = 0.1 and α = 0.9 in order to obtain the asymmetric dissimilarity measure, while α = 0.5 resulted in a symmetric dissimilarity. Lissack-Fu Distance Definition 6. The Lissack-Fu distance between P and Q on a continuous measurable space (Ω, F ) is defined as |p Pa − q Pb |α def dLF (P, Q) = (6) α−1 dλ , Ω |p Pa + q Pb | where 0 ≤ α ≤ ∞. Changing values of the parameters Pa and Pb enables to obtain either symmetric or asymmetric dissimilarity. For Pa = Pb one has a symmetric measure and for Pa = Pb the measure becomes asymmetric. The value of the parameter α does not affect the symmetry property of the dissimilarity. Therefore, in our experiments, we have fixed α = 0.5.
3
Asymmetric k-Means Clustering
The asymmetric k-means algorithm starts from random choice of k objects from the entire data space. These objects are used to form initial clusters – each containing one object. Then, the algorithm consists of two alternating steps: Step 1. Forming of the clusters: The algorithm iterates over the entire data set, and allocates each object to the cluster represented by the centroid – nearest to this object. The nearest centroid is determined with use of a chosen asymmetric dissimilarity measure. Therefore, for each object in
Asymmetric k-Means Algorithm
5
the analyzed data set, the following minimal asymmetric dissimilarity has to be found: (7)
min dASYM (FEnew , FEci ) , i
where dASYM is the chosen asymmetric dissimilarity measure, FEnew is the vector of features of a given object in the analyzed data set, and FEci is the vector of features of the i-th cluster centroid, i = 1, . . . , k. This process can be presented with the following pseudocode: for x ∈ X do min ← M AX_V ALU E for c ∈ centroids do if min > dASYM (x, c) then min ← dASYM (x, c) x temporarily belongs to cluster cluster(c) end if end for end for After the execution of this pseudocode, each object x from the entire data set X is allocated to the cluster represented by the centroid nearest to this object. The centroids variable stores the set of all current centroids, cluster(c) denotes the cluster with centroid c, min is an auxiliary variable, while the M AX_V ALU E is the maximal value of the min variable. Step 2. Finding centroids for the clusters: For each cluster, a centroid is determined on the basis of objects belonging to this cluster. The algorithm calculates centroids of the clusters so as to minimize a formal objective function, the mean-squared-error (MSE) distortion: MSE(Xj ) =
nj
2
dASYM (xi , cj ) ,
(8)
i=1
where Xj , j = 1, . . . , k, is the j-th cluster; xi , i = 1, . . . , nj , are the objects in the j-th cluster; nj , j = 1, . . . , k, is the number of objects in the j-th cluster; cj , j = 1, . . . , k, is the centroid of the j-th cluster; k is the number of clusters; dASYM (a, b) is a chosen asymmetric dissimilarity measure. Both these steps must be carried out with the same dissimilarity measure in order to guarantee the monotone property of the k-means algorithm. Steps 1 and 2 have to be repeated until the termination condition is met. The termination condition might be either reaching convergence of the iterative application of the objective function (9), or reaching the pre-defined number of cycles. After each cycle (Step 1 and 2), a value of the following mean-squared-error objective function needs to be computed in order to track the convergence of the whole clustering process:
6
D. Olszewski
MSE(X) =
nj k
2
dASYM (xi , cj ) ,
(9)
j=1 i=1
where X is the analyzed set of objects, and the rest of notation is described in (8). A serious problem concerning the traditional k-means algorithm (i.e., using the symmetric dissimilarities), and the asymmetric k-means version, proposed in this paper, is that clustering process may not converge to an optimal, or nearoptimal configuration. The algorithm can assure only local optimality, which depends on the initial locations of the objects. An exhaustive study of asymptotic behavior of the k-means algorithm is conducted in [2]. 3.1
Minimization Technique Employed for Finding Centroids of the Clusters
The minimization technique, we have employed is the traditional complete search method. The variables space was, in our study, the feature space, i.e., the search was conducted in the feature space. For the numerical simplicity and speed, we have limited the variables space to the points corresponding to the current members of the specific cluster. This means that the search process was carried out in the set of current members of the considered cluster. This kind of approach is sometimes referred to as the k-medoids algorithm. We justify this simplification with the fact of irrelevant performance decrease in a case of clusters with a large number of objects. The objective function (8) was the criterion of minimization process. Therefore, the minimization technique, we have used, can be presented with the following pseudocode: min ← M AX_V ALU E sum ← 0 for i ∈ cluster do for j ∈ cluster do sum ← sum + dASYM (i, j) end for if min > sum then min ← sum centroid ← i end if sum ← 0 end for After the execution of this pseudocode, the centroid variable stores the coordinates of the centroid of the given cluster. The function dASYM (i, j) is a chosen asymmetric dissimilarity measure, while the min and sum are the auxiliary variables. The cluster variable represents the specific cluster for which a centroid is being computed, and M AX_V ALU E is the maximal value of the min variable.
Asymmetric k-Means Algorithm
4
7
Experiments
We have tested performance of the discussed improved k-means clustering algorithm by carrying out experiments on the real data: in the field of signal recognition, i.e., piano music composer clustering and human heart rhythm clustering. Human heart rhythms are represented with ECG recordings derived from the MIT-BIH ECG Databases [9]. We have employed different symmetric, asymmetric, and parametrized dissimilarities presented in Section 2, in order to evaluate their effectiveness in cooperating with the discussed k-means algorithm. Consequently, we verify the main assertion of this paper, which is the proposition of applying asymmetric dissimilarities as more recommended for the k-means algorithm. 4.1
Piano Music Composer Clustering
In this part of our experiments, we tested our enhancement to the k-means algorithm forming three clusters representing three piano music composers: Johann Sebastian Bach, Ludwig van Beethoven, and Fryderyk Chopin. Numbers of music pieces belonging to each of these composers are given in Table 1. Each music piece was represented with a 20-seconds sound signal sampled with the 44100 Hz frequency. The entire data set was composed of 32 sound signals. The feature extraction process was carried out according to the traditional Discrete-FourierTransform-based (DFT-based) method. The DFT was implemented with the fast Fourier transform (FFT) algorithm. Sampling signals with the 44100 Hz frequency resulted in the 44100/2 Hz value of the upper boundary of the FFT result range. The results of this part of our experiments are gathered in Table 1, which presents the accuracy degree of clustering with k-means cooperating with asymmetric dissimilarities, and with their symmetric counterparts. The numbers 1 and 2 given with each asymmetric dissimilarity denote this dissimilarity computed in two different directions, i.e., dASYM 1 = dASYM (p, q) and dASYM 2 = dASYM (q, p). The asymmetric Chernoff distance was obtained by applying its parameter α = 0.9, while the symmetric Chernoff distance was obtained with the α = 0.5. The asymmetric Lissack-Fu distance, in turn, was obtained by applying its parameters Pa = 0.5 and Pb = 1.0, while the symmetric form of this quantity was obtained with the Pa = 1.0 and Pb = 1.0. The accuracies were calculated on the basis of the following accuracy degree: ai =
ximax , Ni
(10)
where ai , i = 1, 2, 3, is the accuracy degree for the i-th composer; ximax , i = 1, 2, 3, is the maximal number of music pieces of i-th composer in any of the clusters; Ni , i = 1, 2, 3, is the total number of music pieces of i-th composer. Once the accuracy degree for the i-th composer is calculated, the corresponding cluster is not considered in calculations of accuracy degrees for remaining composers.
8
D. Olszewski Table 1. Accuracies of piano music composer clustering Bach Beethoven Chopin Number of Signals Kullback-Leibler Divergence 1 Kullback-Leibler Divergence 2 Asymmetric Chernoff Distance 1 Symmetric Chernoff Distance Asymmetric Chernoff Distance 2 Asymmetric Lissack-Fu Distance 1 Symmetric Lissack-Fu Distance Asymmetric Lissack-Fu Distance 2 Hellinger Distance Total Variation Distance Euclidean Distance
Average Accuracy
11
12
9
0.818
0.750
0.778
0.781
0.727
0.667
0.778
0.719
0.818
0.750
0.778
0.781
0.727
0.750
0.778
0.750
0.727
0.667
0.778
0.719
0.818
0.750
0.778
0.781
0.727
0.750
0.778
0.750
0.727
0.667
0.778
0.719
0.727
0.750
0.778
0.750
0.636
0.750
0.778
0.719
0.818
0.583
0.556
0.656
Each row with the accuracy entries is ended with the average accuracy degree, estimating the quality of each clustering approach. It is the arithmetic average of all three accuracy degrees associated with all three composers: aaverage =
a1 + a2 + a3 . 3
(11)
This average accuracy degree we used as the basis of the comparison between investigated approaches. Table 1 shows that clustering with k-means algorithm and asymmetric dissimilarities allowed for obtaining better results than with the symmetric dissimilarities. What is worth noting, is the fact that clustering performance strongly depends on the direction of asymmetry in the case of asymmetric measures, i.e., whether we consider dASYM (p, q) or dASYM (q, p). Therefore, asymmetric dissimilarities outperform their symmetric competitors, if the right direction of asymmetry is chosen. In other case, they produce worse results. However, this kind of observation is not surprising, since, if the k-means algorithm operates in an asymmetric manner, then, the asymmetric dissimilarities should be applied in the direction of asymmetry, consistent with the direction of asymmetry of the algorithm, itself. How to determine this direction prior to the clustering, remains an open question, since we do not provide any procedure to find it this way in this paper. This may depend on the asymmetry in the data that is analyzed. The obvious and simplest way to determine this direction is on the basis of the final results of clustering, i.e., which direction corresponds to higher clustering performance. However, leaving these considerations aside, and assuming the proper direction of asymmetry is chosen, the experimental results confirmed,
Asymmetric k-Means Algorithm
9
that asymmetric dissimilarities employed for k-means algorithm are superior in comparison with the symmetric measures cooperating with this algorithm, what is the main assertion of this paper. 4.2
Human Heart Rhythm Clustering
In this part of our experiments, we investigated our algorithm forming three clusters representing three types of human heart rhythms: normal sinus rhythm, atrial arrhythmia, and ventricular arrhythmia. This kind of clustering can be viewed as the cardiac arrhythmia detection and recognition based on the ECG recordings. In general, the cardiac arrhythmia disease may be classified either by rate (tachycardias – the heart beat is too fast, and bradycardias – the heart beat is too slow) or by site of origin (atrial arrhythmias – they begin in the atria, and ventricular arrhythmias – they begin in the ventricles). Our clustering recognizes the normal rhythm, and, also, recognizes arrhythmias originating in the atria, and in the ventricles. We analyzed 20-minutes ECG holter recordings sampled with the 250 Hz frequency. The entire data set was composed of 63 ECG signals. Numbers of recordings belonging to each rhythm type are given in Table 2. The feature extraction was carried out in the same way, like it was done with the piano music composer clustering. The results of this part of our experiments are gathered in Table 2, which is constructed in the same way as Table 1. The accuracy degrees and average accuracy degrees are also calculated in the similar way as in the previous subsection (formulae (10) and (11), respectively) with the only difference that instead of composers we regard three types of human heart rhythms. Table 2. Accuracies of human heart rhythm clustering
Number of Signals Kullback-Leibler Divergence 1 Kullback-Leibler Divergence 2 Asymmetric Chernoff Distance 1 Symmetric Chernoff Distance Asymmetric Chernoff Distance 2 Asymmetric Lissack-Fu Distance 1 Symmetric Lissack-Fu Distance Asymmetric Lissack-Fu Distance 2 Hellinger Distance Total Variation Distance Euclidean Distance
Normal Atrial Ventricular Average Rhythm Arrhythmia Arrhythmia Accuracy 18 23 22 0.944
0.783
0.773
0.825
0.944
0.826
0.636
0.794
0.944
0.826
0.773
0.841
0.944
0.826
0.727
0.825
0.944
0.826
0.636
0.794
0.944
0.826
0.773
0.841
0.944
0.826
0.727
0.825
1.000
0.826
0.636
0.810
0.944
0.783
0.727
0.810
0.944
0.826
0.682
0.810
0.833
0.739
0.636
0.730
10
D. Olszewski
Table 2 shows results very similar to the results of the previous part of our experiments. And, what is most interesting, the same effect can be observed due to the direction of asymmetry, in the case of asymmetric dissimilarities. In one of the directions of asymmetry (we call it as the “correct” direction), the asymmetric dissimilarities outperform symmetric ones, while in the other direction (“incorrect” direction), they provide lower clustering performance.
5
Summary
This paper presented an improvement to the k-means clustering algorithm. We proposed application of the asymmetric dissimilarities in this algorithm, as more consistent with the behavior of the algorithm, than most commonly employed symmetric dissimilarities, e.g., the Euclidean distance. We claim that asymmetric measures are more suitable for k-means technique, because it evaluates dissimilarity between two distinct entities (object vs. cluster centroid). Consequently, we wanted to assert that asymmetric dissimilarities, in certain areas of research, can be regarded as superior over their symmetric counterparts, on the contrary to the frequent opinion regarding them as the mathematically inconvenient quantities.
References 1. Steinhaus, H.: Sur la Division des Corp Matériels en Parties. Bulletin de l’Académie Polonaise des Sciences, C1. III 4(12), 801–804 (1956) 2. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967) 3. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An Efficient k-Means Clustering Algorithm: Analysis and Implemetation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002) 4. Biau, G., Devroye, L., Lugosi, G.: On the Performance of Clustering in Hilbert Spaces. IEEE Transactions on Information Theory 54(2), 781–790 (2008) 5. Olszewski, D., Kolodziej, M., Twardy, M.: A Probabilistic Component for KMeans Algorithm and its Application to Sound Recognition. Przeglad Elektrotechniczny 86(6), 185–190 (2010) 6. Okada, A., Imaizumi, T.: Asymmetric Multidimensional Scaling of Two-Mode Three-Way Proximities. Journal of Classification 14(2), 195–224 (1997) 7. Okada, A.: An Asymmetric Cluster Analysis Study of Car Switching Data. In: Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Heidelberg (2000) 8. Okada, A., Imaizumi, T.: Multidimensional Scaling of Asymmetric Proximities with a Dominance Point. In: Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 307–318. Springer, Heidelberg (2007) 9. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000), circulation Electronic Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
Gravitational Clustering of the Self-Organizing Map Nejc Ilc and Andrej Dobnikar Faculty of Computer and Information Science, University of Ljubljana Tržaška cesta 25, 1000 Ljubljana, Slovenia {nejc.ilc,andrej.dobnikar}@fri.uni-lj.si
Abstract. Data clustering is the fundamental data analysis method, widely used for solving problems in the field of machine learning. Numerous clustering algorithms exist, based on various theories and approaches, one of them being the well-known Kohonen’s self-organizing map (SOM). Unfortunately, after training the SOM there is no explicitly obtained information about clusters in the underlying data, so another technique for grouping SOM units has to be applied afterwards. In this paper, a contribution towards clustering of the SOM is presented, employing principles of Gravitational Law. On the first level of the proposed algorithm, SOM is trained on the input data and prototypes are extracted. On the second level, each prototype acts as a unit-mass point in a feature space, in which presence of gravitational force is simulated, exploiting information about connectivity gained on the first level. The proposed approach is capable of discovering complex cluster shapes, not only limited to the spherical ones, and is able to automatically determine the number of clusters. Experiments with synthetic and real data are conducted to show performance of the presented method in comparison with other clustering techniques. Keywords: clustering, self-organizing map, gravitational clustering, data analysis, two-level approach.
1
Introduction
Clustering is an unsupervised process of organizing data into natural groups or clusters, such that objects or data points, assigned to the same cluster, have high similarity, whereas the similarity between objects assigned to different clusters is low [1]. Clustering techniques have been widely used in the fields of data mining, feature extraction, function approximation, image segmentation, and others [2]. Kohonen’s self-organizing map (SOM) is one of more successful neural network approaches for clustering, which has been applied to a broad range of applications in the previously mentioned fields [3]. Actually, the SOM is not only a clustering method – it is also a popular data exploratory and visualization tool since it is capable of mapping d-dimensional input space to m-dimensional output space, where m d and usually m=2 or m=3. The SOM consists of a set of A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 11–20, 2011. c Springer-Verlag Berlin Heidelberg 2011
12
N. Ilc and A. Dobnikar
neurons arranged in a 2 or 3-dimensional structure, usually in a rectangular or hexagonal grid with defined neighborhood. Through unsupervised training, the SOM folds and fits onto input data points preserving their density and topology. Such trained map of neurons can be used as powerful visualization surface, convenient for detection of interesting regions or clusters. The number of neurons in the SOM is usually much greater than the number of clusters in the underlying data. Hence, the main problem is to find a meaningful grouping of neurons and to obtain a good insight into the structure of the data as a consequence. In the past, different attempts were made towards the clustering of the SOM. In [4], the SOM is clustered using two methods: k-means and hierarchical agglomerative clustering algorithm. In both cases, significant running time reduction is shown compared to direct clustering of data. However, clustering quality is not improved as it has not been the purpose of research. The opposite is the case in [5], where superior clustering accuracy is achieved using maps with a huge number of neurons. Consequently, increased time complexity of algorithm has to be taken into account. Another interesting approach, which is able to automatically determine the number of clusters, is proposed in [6]. It employs recursive flooding algorithm for a detection of the clusters in the SOM. However, the results of the experiments are not convincing – in a comparison, simple k-means clustering of the SOM performs better on average. The paper presents a new method for clustering the SOM using gravitational approach, which assumes that every point in the data set can be viewed as a mass particle. If gravitational force between points exists, they begin to move towards each other with respect to mass and distance, thus producing clusters. This natural inspired notion was firstly used in the algorithm proposed by Wright [7] and recently extended in [8]. In our proposed algorithm, which is called GSOM, the basic idea from the latter is used and integrated with the SOM, considering the connections between neurons. The goal of our research is to develop an efficient clustering method capable of dealing with arbitrary shaped clusters, where the number of clusters has to be determined automatically, without user interaction. At the moment, GSOM can handle only numeric data due to the limitations of the implemented SOM algorithm. The rest of the paper is organized as follows. In Section 2, the proposed algorithm GSOM is described. Section 3 presents performance of GSOM in comparison with three other clustering algorithms over nine synthetic and real data sets. Results are presented along with discussion. Finally, the conclusion is drawn in Section 4.
2
Proposed Algorithm GSOM
Clustering algorithm GSOM is based on a two-level approach depicted in Fig. 1. First, a set of prototypes is created using the SOM as a vector quantization method. Each data point belongs to its closest prototype called best matching unit (BMU). Data points with common BMU, acting as their representative,
Gravitational Clustering of the SOM
13
Fig. 1. Two-level scheme of GSOM. a) Input data set. b) SOM is trained and BMUs are identified (black circles). Interpolating units (empty diamonds) are eliminated together with connections. c) BMUs are interpreted as mass points and moved around under influence of gravitational force. Merging occur when two points are close enough. d) As a result, final clustering is obtained - different markers are used for different clusters.
form a first-level cluster. There are several times more prototypes than the expected number of clusters. On the next level, prototypes are observed as movable objects in a feature space, where a force of gravity moves them towards each other. When two prototypes are close enough, they merge into a single prototype with mass unchanged, due to the reason explained later in this section. The main benefit using the SOM on the first level of the proposed algorithm is to obtain topological ordered representatives of the data. Prototypes are connected with each other in a grid and neighbors for each of them are known. We use this valuable and often omitted information to bound influence of gravitational field to prototype’s close neighbors and therefore stabilize and enhance gravitational clustering process on the second level. Another advantage of the SOM is a reduction of noise. The prototypes are local averages of the data and therefore less sensitive to random variations. Finally, SOM with properly chosen number of neurons reduces computational complexity of clustering data, especially when a huge number of input points is a case as shown in [4]. 2.1
SOM Algorithm
The SOM is a regular two-dimensional grid a×b of M = a · b neurons. Each neuron is represented by a prototype vector mi = [mi1 , . . . , mid ], where d is the dimension of input space. The neurons are connected to the adjacent ones with a neighborhood relation. Each neuron, except the ones on the border of the map, has four or six direct neighbors, depending on choosing rectangular or hexagonal grid structure, respectively. Before the training, linear initialization of the SOM is made in the subspace spanned by the two eigenvectors with the greatest eigenvalues computed from the original data. For initialization and training of the SOM, the SOM Toolbox
14
N. Ilc and A. Dobnikar
for MATLAB1 was used. In our case the SOM is trained in a batch mode, where the whole data set is presented to the SOM before the adjustments are made to prototype vectors. In each epoch, the data set is partitioned according to the Voronoi regions of the neurons. Each data point xj belongs to the neuron to which it is the closest. After this, the prototype vector of neuron i is updated as N j=1
mi (t + 1) = N
hic(j) (t)xj
j=1
hic(j) (t)
,
(1)
where c(j) is the BMU of data point xj and N is the number of points in the data set. The new value of the i-th prototype vector mi is computed as weighted average of all data points, where each data point’s weight is a value of neighborhood kernel function hic(j) (t) centered on its BMU c(j). We used Gaussian neighborhood kernel function with a width defined by parameter σ that decreases monotonically in time. Initial value of σ is σ0 = max{1, max8 a,b }. The a and b are chosen, such that the ratio a/b is approximately equal to the square root of the ratio between the two largest eigenvalues of data in the input space. The SOM is trained in two phases: a rough phase with number of epochs lr = max{1, 10 · M/N } and a fine-tuning phase, √ where number of epochs is lf = max{1, 40 · M/N }. Above, M = S · 5 · N , where S is a scale factor set to 1 by default. Values of σ0 , lr , lf and M are heuristically determined as proposed by the authors of the SOM Toolbox. As a result on the first level of algorithm we obtain prototypes which represent the original data. Interpolating prototypes which are not BMU for any data point are eliminated together with the connections to their neighbors. This proves to be very beneficial in a sense of widening the gap between distant regions of map units. Therefore, only BMUs are taken onto the second level of GSOM. 2.2
Gravitational Clustering
Identified BMUs from the first level of algorithm are now being interpreted as ddimensional particles in the d-dimensional space with mass equal to unity. During iterations, each particle is being moved around according to a simplified version of the Gravitation Law using the Second Newton’s Motion Law as proposed in [8]. The new position of point x influenced by gravity of point y is x(t + 1) = x(t) +
G d · , ||d||2 ||d||
(2)
where d = x(t) − y(t) is the Euclidean distance between points x and y, and G is the parameter of gravity, which is decreased by factor ΔG at each iteration, following the rule: G = (1 − ΔG) · G. When two points are moved close enough, i.e. ||d|| is lower than parameter α, they are merged into a single point with mass equal to unity. This principle ensures that clusters with greater density do not 1
SOM Toolbox for MATLAB is available under the GNU General Public License at: http://www.cis.hut.fi/somtoolbox/.
Gravitational Clustering of the SOM
15
affect smaller or less dense ones. The experiments presented in Section 3 prove that such approach is beneficial. It is obvious that the number of points decreases during iterations, when the appropriate G is chosen. At each iteration of an algorithm, every point x in the remaining set of points denoted with P is considered once. Then we have to choose another point y and move both of them according to Eq. 2. As both points are actually BMUs, taken from the SOM, point y can be selected in two ways: either from the neighbors of x, if any of them exist, or as a random point from the set P , not equal to the point x. With probability 1 − p, one of the existing neighbors of the x is randomly chosen and with probability p a random point from P is selected, where p is a parameter of an algorithm. When p is small, the point’s movement is more influenced by its closest neighbors and when p is large, the information of locality is less important. The algorithm stops when G is reduced to a value, where movements of all remaining points are under particular threshold value. Alternative stopping criterion is a case when a predefined maximum number of iterations is reached or when only two points remain in set P . The last criterion implicitly means that we want to split the data in at least two groups, which is a reasonable assumption. Points, remaining in the set P are final clusters representatives. Each representative may contain one or more BMUs and therefore all data points they cover. Obviously, the number of discovered clusters depends on a data set’s features and input parameters’ values. Therefore, the GSOM determines the number of clusters automatically, without predefining it. The essential step is the selection of the parameters, which will be considered in the next section.
3
Experiments and Results
Experiments were conducted over synthetic data sets Giant, Hepta, Ring, Wave, Moon, and Flag and real data sets Iris, Wine, and LetterABC. The performance of the proposed clustering algorithm in assessed in comparison with the three selected algorithms: the Expectation Maximization algorithm using mixture of Gaussians (EM GMM) [9], the Cauchy-Schwarz divergence clustering algorithm (CS) [10], and the clustering of the SOM with k-means algorithm (SOMkM) [4]. EM GMM was chosen as a baseline method because of its popularity and efficiency, although it assumes a hyper-spherical shape of clusters. CS algorithm is more advanced in the sense of discovering complex cluster shapes. In addition, the SOMkM method has been included as the representative algorithm of those which perform clustering of the SOM. Table 1 summarizes properties of the data sets and their plots are presented in Fig. 2, showing also the best clustering results obtained by the GSOM algorithm. Note that data sets Hepta, Iris, Wine and LetterABC are plotted using the PCA (principal component analysis) projection due to high-dimensionality of data. Each data set is linearly scaled to fit in range [0, 1] before clustering is carried out.
16
N. Ilc and A. Dobnikar
Fig. 2. Data sets used in experiments. The best clustering results of GSOM are displayed using different shapes or colors of markers. PCA projection is used to visualize Hepta, Iris, Wine, and LetterABC data.
3.1
Data Sets
A short description of the used data sets is given as follows: a) Data set Giant consists of 862 2-D points and has two clusters: one small spherical cluster on the right side with 10 points and one huge spherical cluster with 852 points on the left side. A much greater density of the leftmost cluster, compared to the other one, is a difficulty here, leading algorithms to split the giant instead of finding the dwarf. b) Hepta is a data set with 212 points, which form seven clusters of spherical shape. Each cluster contains 10 data points, except for the middle one, which contains two more points. Hepta is a part of the Fundamental Clustering Problem Suite, available at http://www.uni-marburg.de/fb12/datenbionik/. c) Data set Ring consists of 800 2-D points forming two concentric rings, each containing 400 points. Non-linear separability and sophisticated connectivity are presented here to challenge the methods.
Gravitational Clustering of the SOM
17
d) Data set Wave is generated to measure algorithms’ performance on highly irregular, longitudinal and linearly non-separable clusters. 2-D data consists of 148 points in the upper and 145 points in the lower wavy curve. e) Data set Moon is another problem domain with linearly non-separable clusters. Here, four clusters are defined, containing 104, 150, 150 and 110 2-D points, from the topmost to the lowermost cluster respectively. f) Data set Flag consists of 640 points that form three clusters. The spherical cluster in the middle contains 100 2-D points; the cluster above and the cluster beneath contain 270 points each. g) Iris data set [11] has been widely used in classification tasks. It has 150 points of four dimensions, divided into three classes of an iris plant with 50 points each. The first class is linearly separable from the other two. The second and the third class are overlapped and linearly non-separable. h) Wine data set [11] has 178 13-D points with three known classes of wines derived from three different cultivars. The numbers of data points in the classes are 59, 71 and 48 respectively. i) LetterABC data set is based on a Letter data set from [11], containing only data for identification of letters A, B and C. There are 1719 data points in total with 16 numerical attributes. 3.2
Parameters Setting
As it can be seen from Section 2, six parameters need to be set for the GSOM algorithm to work. Fortunately, it turns out that default values or values, selected with heuristics, can be used for the majority of the cases. Extensive experiments on influence of parameters were conducted, including SOM sizes with four scale factor values S = {0.5, 0.75, 1, 2}, two shapes of SOM grid {rectangular, hexagonal}, five values of G = {4 · 10−4 , 6 · 10−4 , 8 · 10−4 , 9 · 10−4 , 1 · 10−3 }, five values of ΔG = {0.03, 0.04, 0.045, 0.05, 0.06}, five values of p = {0, 0.1, 0.5, 0.9, 1}, and five values of α = {0.001, 0.005, 0.01, 0.05, 0.1}. For each data set a total of 5000 configurations of GSOM parameters are considered. Every clustering result is then evaluated with external measure of quality, called clustering error (CE) [12], defined as percentage of wrongly clustered data points. In order to calculate CE, the optimal covering, relating maximization of intersection between the result of clustering method and desired clustering is considered. The analysis of the results, summarized only briefly here, shows that the following parameters’ values should be taken as default: SOM size with S=1, rectangular SOM grid, G = 0.0008, ΔG = 0.045, α = 0.01, and p = 0.1. In addition, parameter α proves the lowest impact on the quality of clustering; it is followed in increasing order of impact by SOM grid, p, ΔG, S, and G. Table 2 displays the best parameters’ values, which give the minimal CE, for each data set. The parameters of other methods were set as follows. The maximum number of iterations for EM GMM was set to 500 in order to assure convergence. Parameters of CS algorithm were set in accordance with the authors’ suggestions in [10] and [13]. Concerning the parameters of the SOMkM algorithm, a benchmark test of the parameters SOM size and SOM grid, similar to those described for
18
N. Ilc and A. Dobnikar
Table 1. Data sets used for performance measurements. The number of clusters is a man-given ground truth. data points dim. clusters
Giant 862 2 2
Hepta 212 3 7
Ring 800 2 2
Wave 293 2 2
Moon 514 2 4
Flag 640 2 3
Iris 150 4 3
Wine 178 13 3
LetterABC 1719 16 3
Table 2. The best values of the parameters for the GSOM algorithm. SOM size is the size of 2-D grid with the scale factor S given in brackets, SOM grid can be rectangular (rect) or hexagonal (hexa), G is the initial gravitational constant, ΔG the reduction factor of G, α the merging distance and p the probability of choosing a random point instead of a neighbor. data set Giant Hepta Ring Wave Moon Flag Iris Wine LetterABC
SOM size SOM grid 13×11 (S = 1) rect 9×8 (S = 1) rect 11×10 (S = 0.75) rect 14×12 (S = 2) rect 20×10 (S = 2) rect 14×9 (S = 1) rect 12×5 (S = 2) rect 7×5 (S = 0.5) rect 12×9 (S = 0.5) rect
G 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0010
ΔG 0.045 0.060 0.045 0.045 0.045 0.045 0.045 0.030 0.030
α 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
p 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.1 0.1
GSOM, was performed and the values, which give minimal CE, were chosen. All three algorithms require the number of clusters as the input parameter. We set it on the values shown in Table 1. 3.3
Evaluation of Results
Clustering of nine data sets is performed by the proposed GSOM and three other algorithms: EM GMM, CS, and SOMkM. The results are evaluated with respect to the desired clustering, i.e. ground truth, and a clustering error was computed - it’s minimal, maximum and mean value for 100 runs of each algorithm. The average running time is measured on Intel Core2 Duo 2.1 GHz processor with 3 GB of memory and MATLAB version R2007b. The results of the experiments are collected in Table 3 and the best ones of GSOM are visualized in Fig. 2. Considering the minimal and mean clustering error of clustering data sets Giant, Hepta, Wave, Moon, and Flag, GSOM outperforms other methods. Method CS is the only one that achieves perfect result on Ring data set. It is followed by GSOM, which is able to discover the inner circle, while the outer one is partitioned in three clusters. When clustering the Iris data set, the best results are obtained with EM GMM and GSOM, though CS achieves the lowest mean error. EM GMM method is also the most successful in clustering the Wine and the LetterABC data. The latter is obviously the hardest problem for the proposed algorithm GSOM, due to the highest error rate among all compared methods.
Gravitational Clustering of the SOM
19
Table 3. Performance of GSOM algorithm compared to EM GMM, CS and SOMkM. Clustering Error (min/max, mean ± standard deviation) and the average running time (s) are measured for every data set. Data set Giant
Hepta
Ring
Wave
Moon
Flag
Iris
Wine
LetterABC
EM GMM 0.000 / 0.017 0.007 ± 0.002 0.054 0.000 / 0.557 0.254 ± 0.121 0.042 0.418 / 0.500 0.491 ± 0.022 0.397 0.280 / 0.491 0.448 ± 0.069 0.031 0.307 / 0.541 0.421 ± 0.058 0.206 0.000 / 0.641 0.114 ± 0.192 0.060 0.033 / 0.613 0.169 ± 0.165 0.019 0.011 / 0.494 0.268 ± 0.130 0.032 0.068 / 0.601 0.294± 0.114 0.216
CS 0.219 / 0.497 0.404 ± 0.062 78.694 0.000 / 0.269 0.057 ± 0.062 0.900 0.000 / 0.000 0.000 ± 0.000 48.109 0.130 / 0.403 0.237 ± 0.093 3.180 0.000 / 0.465 0.284 ± 0.165 11.011 0.000 / 0.252 0.003 ± 0.025 23.636 0.040 / 0.173 0.072 ± 0.029 0.447 0.056 / 0.427 0.139 ± 0.055 1.071 0.180 / 0.453 0.301 ± 0.096 918.997
SOMkM 0.352 / 0.458 0.457 ± 0.011 0.092 0.000 / 0.542 0.227 ± 0.124 0.032 0.466 / 0.500 0.493 ± 0.010 0.032 0.126 / 0.495 0.393 ± 0.107 0.045 0.288 / 0.521 0.451 ± 0.062 0.030 0.000 / 0.361 0.118 ± 0.163 0.033 0.047 / 0.333 0.087 ± 0.055 0.024 0.051 / 0.056 0.053 ± 0.003 0.130 0.180 / 0.472 0.318 ± 0.049 0.067
GSOM 0.000 / 0.000 0.000 ± 0.000 0.315 0.000 / 0.142 0.003 ± 0.02 0.116 0.288 / 0.395 0.349 ± 0.025 0.204 0.000 / 0.495 0.173 ± 0.131 0.318 0.000 / 0.292 0.048 ± 0.103 0.386 0.000 / 0.000 0.000 ± 0.000 0.215 0.033 / 0.333 0.260 ± 0.096 0.125 0.034 / 0.601 0.253 ± 0.113 0.091 0.361 / 0.571 0.498 ± 0.074 0.346
It is important to stress that the execution times of the GSOM algorithm are shorter than those of the CS for factor of 100 or even 1000 in the case of the LetterABC data set, while error rates are quite comparable in general. The EM GMM and SOMkM are faster than GSOM for approximately 10 times. Except for the data sets Ring and LetterABC, GSOM correctly finds the expected number of clusters.
4
Conclusion
A novel approach of clustering Kohonen’s SOM is presented in the paper, utilizing gravitational clustering in a two-level scheme. According to the results of the experiments, the advantages of the presented method GSOM are as follows. First, GSOM is able to detect and to successfully cluster data of complex shapes with linearly non-separable regions. Second, the proposed algorithm
20
N. Ilc and A. Dobnikar
determines the number of clusters automatically. Finally, the employing SOM on the first level of the algorithm greatly decreases the overall execution time and thus enables the processing of large data sets, which will also be the subject of our further research. Furthermore, the data preprocessing methods have to be studied in order to set the values of GSOM input parameters according to the features of a certain data set instead of using heuristics.
References 1. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall, Englewood Cliffs (2003) 2. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Elsevier, Amsterdam (2005) 3. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (2001) 4. Vesanto, J., Alhoniemi, E.: Clustering of the Self-Organizing Map. IEEE Trans. on Neural Networks 11(3), 586–600 (2000) 5. Ultsch, A.: Emergence in Self Organizing Feature Maps. In: 6th International Workshop on Self-Organizing Maps (2007) 6. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic Cluster Detection in Kohonen’s SOM. IEEE Trans. on Neural Networks 19(3), 442–459 (2008) 7. Wright, W.E.: Gravitational clustering. Pattern Recognition 9, 151–166 (1977) 8. Gomez, J., Dasgupta, D., Nasraoui, O.: A new gravitational clustering algorithm. In: 3rd SIAM International Conference on Data Mining (2003) 9. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 10. Jenssen, R., Principe, J.C., Eltoft, T.: Cauchy-Schwarz pdf Divergence Measure for non-Parametric Clustering. In: IEEE Norway Section International Symposium on Signal Processing (2003) 11. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html 12. Meila, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42, 9–29 (2001) 13. Jenssen, R., Principe, J.C., Erdogmus, D., Eltoft, T.: The Cauchy-Schwarz Divergence and Parzen Windowing: Connections to Graph Theory and Mercer Kernels. Journal of the Franklin Institute 343(6), 614–629 (2006)
A General Method for Visualizing and Explaining Black-Box Regression Models ˇ Erik Strumbelj and Igor Kononenko Faculty of Computer and Information Science, University of Ljubljana Trˇzaˇska 25, 1000 Ljubljana, Slovenia {erik.strumbelj,igor.kononenko}@fri.uni-lj.si
Abstract. We propose a method for explaining regression models and their predictions for individual instances. The method successfully reveals how individual features influence the model and can be used with any type of regression model in a uniform way. We used different types of models and data sets to demonstrate that the method is a useful tool for explaining, comparing, and identifying errors in regression models. Keywords: Neural networks, SVM, prediction, transparency.
1
Introduction
Explaining prediction models and their predictions is an integral part of machine learning. The purpose of such methods is making models more informative, easier to understand, and easier to use. These benefits are especially welcome when using non-transparent prediction models, such as artificial neural networks and SVM. Some of the most popular learning algorithms (naive Bayes, decision trees) owe a part of their popularity to their ability to produce models which are inherently easy to interpret. For others, model-specific explanation and visualization methods have been developed [3,5,6]. There also exist general methods that can be applied to any model. The latter are the focus of this paper. Before discussing general explanation methods, we start with a simple example. Figure 1 is an explanation for an instance from the artificial data set testA. Instances from this data set describe the situation involving a student in consultation with a professor about his final mark. The teacher can immediately pass the student or may opt to test the student with additional questions in which case it comes down to the student’s knowledge to determine whether the student will pass. The model’s task is to predict the student’s chances of success. The binary feature teacher describes the teachers action. The feature student describes the student’s knowledge and has 6 possible equally spread levels, where 0 means certain failure, 1 means 20% chance,..., and 5 means certain success. In testA all combinations of pairs of values the two features are equally probable. The explanation in Figure 1 is consistent with our intuition and helps us understand the model’s prediction. Observe how the explanation is given in the form of magnitudes and directions of features’ contributions. Assigning a contribution ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 21–30, 2011. c Springer-Verlag Berlin Heidelberg 2011
22
ˇ E. Strumbelj and I. Kononenko
Fig. 1. The decision tree makes a dire prediction (0.12) for this poorly prepared student (student = 0) who will be tested (teacher = 1). The explanation suggests that both features have an approximately equal contribution. Both are negative, speaking against the student’s chances.
Fig. 2. A general explanation reveals that both features are approximately equally important (grey dots). Studying increases the student’s chances. Not being tested is beneficial while being tested has an opposite effect.
(score, rank, etc...) is a common approach and is used in most of the previously mentioned model-specific methods and in all of the general methods that follow. By using a general method machine learning and data mining practitioners can avoid using a different model-specific explanation method for each different model, which also simplifies comparison. Furthermore, in a practical setting it is very desirable, especially from the end-user’s perspective, that the explanation method need not be replaced if the underlying prediction models change. To achieve such generality methods must avoid anything model-specific, essentially treating every model as a black-box, limiting all interaction to changing the inputs (feature values) and observing the outputs. Clearly, going through all possible combinations of input values is infeasible, so each method is forced in some sort of a tradeoff between its time complexity and the complexity of what it can extract from a model. Some existing methods, such as [7] and [4] use the “one feature at a time” approach. A feature’s contribution for a particular instance is defined as the average change in prediction when the feature’s value is permuted. While this reduces the time complexity, it, in some cases, does not result in a change that reveals the true importance of a feature. Observe how the value of the expression 1 ∨1 does not change if we change either of the 1’s to 0. Both must be changed at the same time to achieve a change. A recently published paper introduces FIRM, a method for computing the importance of features for a given model [9]. For each feature the method observes the variance of the conditional expected output of the model, across all possible values of that feature (conditional to the given value of the feature). However, observe how for two uniformly distributed binary variables E[b1 XOR b2 |b1 = 1] = E[b1 XOR b2 |b1 = 0] = 0.5. The conditional expected outputs will be the same and variance will be 0. A clearly important variable will be assigned 0 importance.
Visualizing and Explaining Black-Box Regression Models
23
A method that solves the problems mentioned in the previous paragraph was recently developed for classification models [8]. The authors’ basic idea is to observe changes across all subsets of features (for example, also observing how the value of 1 ∨ 1 changes if we change both values at the same time). The exponential time complexity is resolved by an approximation algorithm. However, unresolved issues remain. First, it is limited to classification models and can not be used to explain a regression model. Second, it can only be used to explain a particular instance (see Figure 1) - users would benefit from a global overview of how features contribute (see Figure 2). And third, the proposed approximation algorithm is based on a very strict assumption that all combinations of feature values are equiprobable. Successfully dealing with the first two issues and loosening the assumption in the third are the main contributions of this paper. The remainder of the paper is divided into 3 sections. In Section 2 we adapt the explanation method for use with regression models and introduce improvements. Section 3 describes a series of experiments on artificial data sets, followed by an experiment on a real-world data set. With Section 4 we conclude the paper and give some ideas for further work.
2
Explaining Regression Models’ Predictions
Let A ∈ A1 × A2 × ... × An be our feature space, where each feature Ai is a set of values. Let p be the probability mass function defined on the sample space A. Let f : A → be our regression model. No other assumptions are made about f . Let S = {A1 , ..., An }. The influence of a certain subset of features Q ⊆ S in a given instance x ∈ A is defined as Δ(Q)(x) = E[f |values of features in Q for x] − E[f ].
(1)
In other words, the contribution of a subset of feature values in a particular instance is the change in expectation caused by observing those feature values. Suppose we have Δ(Q)(x) for every Q ⊆ S. How do we combine these values to form contributions of individual feature values? In [8] they propose using the well known game-theoretic solution - the Shapley value - to define ϕi (x), the contribution of the i−th feature for instance x: ϕi (x) =
Q⊆S\{i}
|Q|!(|S| − |Q| − 1)! (Δ(Q ∪ {i})(x) − Δ(Q)(x)). |S|!
(2)
Eq. 2 has desirable properties. The feature contributions are implicitly normalized (they sum up to the initial difference ΔS ), which makes them easier to interpret. If a feature does not have any impact on the prediction, will be assigned a 0 contribution. And, features with a symmetrical impact will be assigned equal contributions. The work described so far in this section is credited to [8] and only minor modifications were necessary to apply the method to a regression setting (in our case f is a regression model’s output, instead of a classification model’s probabilistic prediction for a given class value).
24
2.1
ˇ E. Strumbelj and I. Kononenko
Approximation Algorithm
Eq. 2 reflects any influence the feature might have on the prediction. However, in practice it is often impossible to calculate the Δ-terms due to the time complexity. Even if we could, we still face the exponential time complexity of computing 1 ϕi (x). In [8] this is resolved by assuming that p(x) = |A| , for all x ∈ A. For any given feature space this assumption limits the choice of p to a single possibility. The distribution of values plays an important part in how people intuitively explain events. Recall the teacher/student scenario. The concept that students are more likely to pass if they study or are not tested is universal (that is, such a model would perform well on any university with a similar concept, regardless of the distribution of feature values). Our intuitive explanation depends heavily on the distribution of feature values. For example, a student who does not study and is tested will fail. If this teacher tests students most of the time, we would say that it is mostly the student’s own fault for not studying. On the other hand, if the teacher almost never tests a student, most would say it was ”bad luck” (that is, being tested is a much more important contributor than the amount of study). This example emphasizes the importance of providing more flexibility wrt the choice of p, while still retaining an efficient explanation algorithm. To loosen the restriction, we assume that p is such that individual features are mutually independent. Then transform Eq. 1 into Δ(Q)(x) =
p(y) (f (τ (x, y, Q)) − f (y))
(3)
y∈A
Note that τ (x, y, W ) = (z1 , z2 , ..., zn ), where zi = xi iff i ∈ W and zi = yi , otherwise. We use the alternative formulation of the Shapley value (equivalent to Eq. 2) ϕi (x) =
1 Δ(P r i (O) ∪ {i})(x) − Δ(P ri (O))(x) , n!
(4)
O∈π(n)
where π(n) is the set of all permutations of n elements and P ri (O) is the set of all features which precede the i-th feature in permutation O ∈ π(n). By combining Eq. 3 and Eq. 4, we get ϕi (x) =
1 n!
p(y) · (f (τ (x, y, P r i (O) ∪ {i})) − f (τ (x, y, P ri (O)))),
O∈π(N ) y∈A
(5)
which facilitates the use of random sampling and an efficient approximation algorithm (see Algorithm 1). Note that at random refers to drawing each feature’s value at random, according to the distribution of that feature’s values (usually, by sampling from a data set). Note that due to with replacement features with finite and infinite domains are treated identically. Therefore, it can be applied to both nominal and numeric features. Observe the same model’s prediction for the same instance, but from data set testB where the teacher tests the students a vast majority of time
Visualizing and Explaining Black-Box Regression Models
25
Algorithm 1. Approximating ϕi (x), the importance of the i-th feature’s value for instance x and model f . Take m samples. ϕi (x) ← 0 for k = 1 to m do select (at random) permutation O ∈ π(n) and instance y ∈ A
x1 ←
take their values from x
feat. preceding i in O take their values from x
x2 ←
feat. preceding i in O
i
take their values from y
feat. succeeding i in O take their values from y
i
feat. succeeding i in O
ϕi (x) ← ϕi (x) + f (x1 ) − f (x2 ) end for ϕi (x) ←
ϕi (x) m
Algorithm 2. Approximating ψi,j , the global importance of the i-th feature’s value j for model f . Take m samples. ψi,j ← 0 for k = 1 to m do select (at random) instance y ∈ A x1 ← set i-th feature to j, take other values from y ψi,j ← ψi,j + f (x1 ) − f (y) end for ψi,j ψi,j ← m
(Figure 3) and compare to Figure 1. The explanation now depends on the context and the proposed explanation method provides us with explanations which are in accordance with our own intuitive explanation. Figures 1 and 3 show how individual feature values influence the model’s prediction for a given instance. For a global overview of how a feature contributes, we could observe the contributions across several instances. Instead, we provide the same information within a single visualization. We define the global contribution of the i-th feature’s j-th value as the expected value of that feature’s contribution (see Eq. 5) for an instance where its value is j:
ψi,j =
x∈A,x[i]=j
=
1 n!
p(x)ϕi (x) =
O∈π(N ) x∈A
p(x)ϕi (x) =
x∈A
p(x) f (x ) − f (x) = p(x) f (x ) − f (x) ,
(6)
x∈A
where x is x with i-th feature’s value set to j). Eq. 6 can be approximated using Algorithm 2.
26
ˇ E. Strumbelj and I. Kononenko
Fig. 3. If it is likely that the teacher will test the student then studying hard (or lack of) becomes much more important
Fig. 4. KNN1 does not perform well, but the features have a strong influence on its predictions. We can conclude it overfits.
Fig. 5. M5P successfully models dDisj and correctly predicts R = 1. The visualization shows that a single feature is responsible for the prediction, while the other two have the opposite effect.
Fig. 6. The neural network successfully models dXorBin and correctly predicts this instance. The explanation reveals that the first three features are important and all three contribute towards 1.
Figure 2 is a visualization of the global importance of features for our illustrative data set testA. Each grey/black point pair is obtained by running Algorithm 2. The mean of ψi,j samples (black points) reveals the magnitude and direction of the value’s average influence. Standard deviation of ψi,j samples (grey points) is also included for each value to reveal its global importance. For an instance explanation, we repeat Algorithm 1 for each feature. To ensure with a certain probability that the approximated contribution will be within a certain distance from the actual contribution we require a constant number of samples. Therefore, for a given error the number of samples m needed to generate the explanation for a single feature does not increase with the number of features. The same applies to global visualizations, although the constant is larger because we repeat the process for each feature value we plot. The total running time for one explanation is: a constant × the number of features n × the model’s prediction time complexity T (f (x)). The methods time complexity is O(n · T (f (x)). For most regression algorithms T (f (x)) is O(n), which implies quadratic time complexity. Our purpose is to show that the method is a wellfounded and useful tool, which can be used to generate explanations in real-time
Visualizing and Explaining Black-Box Regression Models
Fig. 7. SVM provides the best fit for dPoly. Subsequently, the contributions closely match the actual concepts.
27
Fig. 8. The visualization shows us that MP learned some but not all of the concepts behind the dPoly data set.
(order of seconds) for data sets with up to a few dozen features (already shown in [8]). A more rigorous analysis of the limits of the method wrt the number of features it can handle for a given type of model is delegated to further work.
3
Experimental Verification
We have shown that the method is theoretically well founded and has several desirable properties. But how well does it translate into practice. The time complexity was discussed at the end of Section 2.1. Due length limits, we omit an in-depth analysis of running times in favor of showing more examples. We tested the method using six different regression algorithms: linear regression (LR), a Support Vector Machine for regression (SVM), a multi-layer perceptron with a single hidden layer (MP), k-nearest neighbors (k = 1 and k = 11), a regression tree (M5P), and pace regression (PR). The method was implemented in Java using Weka’s learning algorithm classes [1]. Default parameters were used, with the exception of SVM, where a 2nd degree polynomial kernel was used. A variety of models (in terms of performance and type) is desirable as we can verify if the explanations reveal why they performed well or poorly. Artificial data allow us to test if explanations generated for good models are close to those generated for the optimal model (and vice versa). All feature values lie between 0.00 and 100.00, R is the target variable, each data set has 5 features and those that are not explicitly mentioned have no influence on R. Note that 1000 training and 1000 test samples were generated for each data set. Data sets: dLinear (R = A1 + 2A2 + 3A3 ), dRedund (R = 2A1 − 2A2 ; A3 always has the same value as A2 to create a redundant feature), dLocLin (features A3 and A4 are binary and divide the problem space into 4 locally linear subproblems: R = 5A1 + A2 , ifA3 = 0 ∧ A4 = 0; R = A1 − 4A2 , ifA3 = 0 ∧ A4 = 1; R =
28
ˇ E. Strumbelj and I. Kononenko
Table 1. RRMSE and distances from the explanation for an optimal model (in parentheses). The correlation coefficients between the two are included for each data set. dLinear 0.00 (3.09) MP 0.01 (3.10) SVM 0.01 (3.08) M5P 0.24 (24.12) KNN1 0.34 (19.80) KNN10 0.24 (22.12) PR 0.00 (2.97) coeff 0.942 LR
dLocLin dRedund dTrig dPoly dDisj dXor dXorBin dRand 0.49 0.00 0.83 0.98 0.85 1.00 1.00 1.00 (112.72) (2.33) (0.78) (4.06) (0.17) (0.35) (0.29) (1.82) 0.05 0.02 0.33 0.88 0.72 0.57 0.00 0.99 (13.75) (3.01) (0.20) (3.25) (0.13) (0.17) (0.05) (1.72) 0.13 0.01 0.50 0.13 1.05 0.81 1.60 1.00 (24.24) (2.78) (0.33) (0.67) (0.33) (0.26) (0.81) (3.19) 0.08 0.12 0.18 0.30 0.34 0.30 0.00 1.00 (20.38) (7.20) (0.13) (1.03) (0.03) (0.06) (0.04) (3.40) 0.11 0.16 0.59 0.66 0.75 0.75 0.00 1.00 (25.53) (17.21) (0.35) (2.52) (0.10) (0.23) (0.14) (14.87) 0.11 0.13 0.52 0.60 0.61 0.60 0.26 1.01 (28.73) (11.23) (0.43) (2.24) (0.12) (0.21) (0.16) (5.79) 0.50 0.00 0.79 0.97 0.74 1.00 1.00 1.00 (114.52) (3.25) (0.73) (4.05) (0.16) (0.35) (0.29) (1.89) 0.998 0.927 0.958 0.991 0.911 0.992 0.913 NA
2A1 + 8A2 , ifA3 = 1 ∧ A4 = 0; R = −2A1 − 3A2 , ifA3 = 1 ∧ A4 = 1), dTrig (R = 2πA2 A1 −50 2 A2 −50 2 A3 −50 1 sin( 2πA 100 ) + cos( 100 )), dPoly (R = 2( 25 ) − 3( 25 ) − 25 ), dDisj (R = 1 if (A1 > 50) ∨ (A2 > 40) ∨ (A3 > 60); otherwise R = 0), dXor (an XOR problem, R = (A1 > 50) XOR (A2 > 50) XOR (A2 > 50))), dXorBin (similar to dXor, all five features are binary.,R = A1 XOR A2 XOR A3 ), dRand (R is chosen at random). First, we investigated if the generated contributions reflect what the model learns. We evaluated the models with the relative root mean squared error (RRMSE). For a distance measure1 we used the Euclidean distance between the vector (ϕ1 , ..., ϕn ) and the vector generated when using optimal predictions instead of f . Table 1 shows the results for the described experiment. Some models perform better and some data sets are more difficult. Regardless, explanation quality and model performance are highly correlated. Correlation is not applicable to dRand. All models should have a RRMSE of 1 (any deviations are due to noise). However, some models overfit, which results in explanations away from optimal. For example, KNN1 is likely to overfit. Figure 4 reveals that feature A1 has a substantial influence on the KNN1 model, despite being useless for predicting R. Results confirm that the explanations reflect, at least in an abstract sense, what the models have learnt. We continue by observing some examples and verifying whether the explanations are useful from a user’s perspective. We start with instances from dDisj and dXorBin. Figures 5 and 6 are explanations for M5P on dDisj and MP on dXorBin, respectively. In the introduction we pointed out that these two concepts are representative of what existing general methods are unable to handle correctly. Visualizations show that the proposed method reveals the important features and their contributions. Now we proceed to global visualizations2 . The best model for dPoly was SVM. The explanation (Figure 7) confirms that it fits the data well. The worst were the 1 2
That is, to describe how much the explanations generated for a given model differ from those generated for an optimal model. We left some irrelevant features out of the visualizations, to conserve space.
Visualizing and Explaining Black-Box Regression Models
Fig. 9. LR is most influenced by cement, water, and age. Concrete strength increases with age and amount of cement and decreases with the amount of water.
29
Fig. 10. Similar to LR, cement, water, and age are the most important for the neural network model. However, MP fits the non-linear relationships better.
Fig. 11. For this particular prediction age contributes positively. The amount of water and cement have a negative contribution. Construction experts agree with the explanation and elaborate that the mixture suffers from a high water-to-cement ratio. Least important features were removed.
linear models, which can not fit the polynomial. The MP model is somewhere in between and Figure 8 reveals why. The model learned only a part of the concept, missing the relevance of feature A2 . We conclude the section with a more realistic example of what data mining practitioners encounter in practice. The concrete data set has 9 numeric features - concrete mixture components (in kg/m3 ) and age (in days), and one target feature - compressive strength of the mixture (in MPa). The data were obtained from the UCI repository, where it was made available by prof. I-Cheng Yeh [2]. The compression strength is a highly non-linear problem [2]. Using LR and MP we achieved mean squared errors of 109 and 55, respectively, while predicting with the mean value results in a mean squared error of 279 (we used 10-fold cross-validation). The minimum, maximum, mean, and standard deviation of the compressive strength class variable are 2.3, 82.6, 35.8, and 16.706, respectively.
30
ˇ E. Strumbelj and I. Kononenko
Figures 9 and 10 are visualizations for LR and MP. These are used to reveal the overall importance of individual features and their contribution to the model’s predictions. When interested in a specific prediction, we observe the corresponding instance explanation. For example, Figure 11 is an instance explanation for MP’s prediction for a particular concrete mixture. MP’s prediction for is close to the actual concrete compressive strength, while LR overestimates the compressive strength for this instance and predicts 60 MPa. The explanation reveals which features contribute towards/against compressive strength.
4
Conclusion
The proposed explanation method is simple to implement and can be applied to any regression model. It can explain both the model and its predictions. Results across different regression models and data sets confirmed that the method’s explanations reflect what the models learn, even in cases where existing general explanation methods would fail. The examples presented throughout the paper illustrate that the method is a useful tool for visualizing models, comparing them, and identifying potential errors. With emphasis on the theoretical properties and the method’s usefulness, less attention was given to measuring and optimizing running times. We delegate this to further work, together with an in-depth analysis of running times across different types of models.
References 1. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: An update. SIGKDD Explorations 11(1), 1–3 (2009) 2. I-Cheng, Y.: Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Research 28(12), 1797–1808 (1998) 3. Jakulin, A., Moˇzina, M., Demˇsar, J., Bratko, I., Zupan, B.: Nomograms for visualizing support vector machines. In: KDD 2005: ACM SIGKDD, pp. 108–117 (2005) 4. Lemaire, V., Fraud, R., Voisine, N.: Contact personalization using a score understanding method. In: IJCNN 2008 (2008) 5. Moˇzina, M., Demˇsar, J., Kattan, M., Zupan, B.: Nomograms for visualization of naive bayesian classifier. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 337–348. Springer, Heidelberg (2004) 6. Poulet, F.: Svm and graphical algorithms: A cooperative approach. In: 4th IEEE ICDM, pp. 499–502 (2004) ˇ 7. Robnik-Sikonja, M., Kononenko, I.: Explaining classifications for individual instances. IEEE TKDE 20, 589–600 (2008) ˇ 8. Strumbelj, E., Kononenko, I.: An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research 11, 1–18 (2010) 9. Zien, A., Kr¨ amer, N., Sonnenburg, S., R¨ atsch, G.: The feature importance ranking measure. In: Buntine, W., Grobelnik, M., Mladeni´c, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 694–709. Springer, Heidelberg (2009)
An Experimental Study on Electrical Signature Identification of Non-Intrusive Load Monitoring (NILM) Systems Marisa B. Figueiredo, Ana de Almeida, and Bernardete Ribeiro CISUC, Department of Informatics Engineering, University of Coimbra, Polo II, P-3030-290 Coimbra, Portugal {mbfig,amaria,bribeiro}@dei.uc.pt
Abstract. Electrical load disambiguation for end-use recognition in the residential sector has become an area of study of its own right. Several works have shown that individual loads can be detected (and separated) from sampling of the power at a single point (e.g. the electrical service entrance for the house) using a non-intrusive load monitoring (NILM) approach. This work presents the development of an algorithm for electrical feature extraction and pattern recognition, capable of determining the individual consumption of each device from the aggregate electric signal of the home. Namely, the idea consists of analyzing the electrical signal and identifying the unique patterns that occur whenever a device is turned on or off by applying signal processing techniques. We further describe our technique for distinguishing loads by matching different signal parameters (step-changes in active and reactive powers and power factor) to known patterns. Computational experiments show the effectiveness of the proposed approach. Keywords: feature extraction and classification, k-nearest neighbors, non-intrusive load monitoring, steady-state signatures, support vector machines.
1
Introduction
“Your TV set has just been switched on.” This may very well be a sms or email message received on your mobile phone in the near future. For energy monitoring, health care or home automation, concepts like Smart Grids or in-Home Activity Tracking are a recent and important trend. In that context, an accurate and inconspicuous identification and monitoring of electrical appliances consumptions are required. Moreover, such monitoring system should be inconspicuous. Currently, the available solutions for load consumption monitoring are smart meters and individual meters. The first ones supply aggregated consumption information without identifying which devices are on. To overcome this limitation, to use an individual meter for each appliance in the house would be sufficient. However, this would turn out to be an expensive solution for a household. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 31–40, 2011. c Springer-Verlag Berlin Heidelberg 2011
32
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
A non-intrusive load monitoring system (NILM) fulfills all the requirements imposed by the Smart Grids and in-Home Activity Tracking challenges at virtually no cost. NILM is a viable solution for monitoring individual electrical loads: a single device is used to monitor the electrical system and to identify the electric load related to each appliance, without increasing the marginal cost of electricity or needing extra sub-measurements. Nevertheless, only with the present low-cost sensing devices, its full potential could be achieved. The central dominant goal of a NILM system is to identify which are the appliances switched on at a certain moment in time. The signals from the aggregate consumption of an electrical network are acquired and electrical features are extracted, in order to identify which devices are switched on. Each appliance has a particular electrical signature which must be recognized in order to perform an accurate identification. This paper presents a study for its electrical distinctive characterization. The proposed signature is based on the analysis and recognition of steady-states occurring in the active and reactive power signals and the power factor measurements. To evaluate this approach, data from a set of appliances were collected and classified using a Support Vector Machine (SVM) method and the K-Nearest Neighbors (K-NN) technique. The results of the computational experiments indicate that an accurate identification of the devices can be, in fact, accomplished. This paper is organized as follows: the next section presents a brief overview of the related literature. Section 3 describes the concepts behind the NILM system and the electrical signature problem. It proceeds describing the developments associated to the step-changes analysis in an electrical signal and the features that can be used as distinctive marks, introducing a result that enables an algorithm for steady-state recognition. Finally, Section 4 describes the experimental setup, where the new algorithm for feature extraction was used followed by SVM and K-NN classification algorithms and the classification results. Conclusions and future work are addressed in Section 5.
2
Related Literature
To identify the devices switched on at a certain moment in time, a non-intrusive load monitoring system uses only the voltage and current signals of the aggregated electrical consumption using sampling of power at a single point. The concept was independently introduced by Hart [1] (then working at the Electric Power Research Institute) and by Sultamen (Electricité de France) [2]. Over the last decades, due to the pressing environmental and economic issues, the interest in this area has increased, being the focus of PhD theses as [3]. In 1996, the first NILM system was commercialized by the company Enetics, Inc.. The main steps in a non-intrusive load monitoring system are: a) the acquisition of electrical signals, b) extraction of the important events and/or characteristics and c) production of a classifier of electrical events (see Figure 1). To perform the identification, the definition of an electrical device ID is needed. Therefore, the electrical signatures are the basis of any NILM system [1]. These are defined as a set of parameters that can be measured from the total load. For
An Experimental Study on Electrical Signature Identification
33
NILM Signals Electrical
Data Acquisition
Signal Analysis and Feature Extraction
Sampled Electrical Signals
Load Classification
Feature Vectors
Energy Estimation and Activity Inference
Appliance Identification
Appliance Signature Sensing Meter
Load parameters
SVM and 5-NN
Fig. 1. A NILM high-level system and approach for the device signature study
a NILM system, usually these parameters result, either from signal steady-states or are obtained by sensing transients. A steady-state signature is deduced from the difference between two steadystates in a signal consisting of a stable set of consecutive samples, whose values are within a given threshold. The basic steady-state identifies the turning on and off of an electrical device connected to the network. To achieve this steady-state detection is much less demanding that what is required for capture and analysis of the transients. Other advantages are the fact that we can recognize turning off states and when two appliances are on at the same time, it is possible to analyze its signature sum. Hence, they were used by Hart for the prototype presented in [1]. Since then, steady-state signatures were used by several authors, mainly for residential load monitoring systems. In [4,5,6] discrete changes on the active and reactive power are analyzed while [7] only uses the active power. Nevertheless, some limitations can be pointed out, as the impossibility to distinguish two different appliances with the same steady-state signature. The small sampling rate can be also considered a disadvantage: sequences of turning on loads during a period smaller than the sampling rate are not possible to identify. To overcome these limitations, the transient signatures, which result from the noise in the electrical signal caused by the switching on/off of an appliance, can be used. Yet, for transient identification a high sampling rate is needed. Since both steady-state and transient signatures have their own limitations, considering both for a study of a joint ID is interesting. Such was considered by Chang et al. in [8], very recently. The following section describes our approach.
3
Steady-States (StS) Recognition: Proposed Approach
Electrical signatures are the main component of a NILM system. Usually individual load identification uses transients and steady-states (StS) signatures. Due to the high sampling frequency needed for the first, residential NILM systems typically use the latter. However, one of the drawbacks of StS is the fact that distinct appliances can present very similar signatures. In fact, using only the step-changes in the active power, little information is provided which may lead to
34
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
an incorrect identification. This paper studies the incorporation of further signal information in order to enrich the electric profile of each appliance, namely from the reactive power signal and power factor measurements. The first step in the definition of a StS is the recognition of a stable value sequence in the sampled signal. In [9] the authors presented a method for the identification of a steady-state signature based in ratios between rectangular areas defined by the successive states values. The method allows for the identification of a complete steady-state, i.e., when does the StS begin and when does it end. The approach is based on the difference between the rectangular area produced by aggregating a new sample and the one already defined by the previous values in the stable state. However, keeping only the extreme values already in the stable state and testing the new sample value against the previous ones can simplify this approach. This improvement is described in the following. The new result was implemented in order to extract features from power signals: the active, reactive power and power factor signals. 3.1
A Rule for Steady-States Recognition
A sequence of consecutive samples is regarded as a stable-state if the difference between any two samples of the sequence does not exceed a given tolerance value. The minimum number of consecutive samples needed to identify a stable state depends on the sampling frequency: when this is low, a small number of samples is enough, otherwise a bigger number is needed. For instance, with a sampling frequency of 1Hz, the minimum number of samples can be defined as three, which is the one used in [1], where other methods for steady-states recognition are proposed (namely, filtering, differentiating and peak detection). Consider a sequence of n consecutive sampling values, Y = {yi , i = 1, . . . , n} already identified as a steady-state. By definition, |yi − yj | ≤ ∀i, j = 1, . . . , n and i = j, where > 0 is the defined tolerance. Let yM = max {yi } and ym = min {yi } , ∀i = l, . . . , n be the maximum and minimum values, respectively, for Y, and that yr (r = n + 1) is the next sample value. Next we prove that yr maintains the stable behavior of Y only for a limited range of values. Theorem 1. In the conditions above, the n+1 consecutive values form a steadystate iff yM − ≤ yr ≤ ym + , i.e., |yi − yj | ≤ , ∀i, j = 1, . . . , n + 1 . Proof. In fact, if ym ≤ yr ≤ yM , then |yi − yr| ≤ |ym − yM | ≤ , for all i = 1, . . . , n. Consider now that, yM < yr ≤ ym + . For any yi ∈ [ym , yM ], i = 1, . . . , n, we have, |yi − yr | ≤ |ym − yr | ≤ |ym − ym + | = . Thus, the sequence of the n + 1 values, yi , i = 1, . . . , n + 1, forms a steady-state with a new maximum value: yM = yn+1 = yr .
An Experimental Study on Electrical Signature Identification yr ∈ Y
yr ∈ /Y min - max - min
35
yr ∈ /Y max
max + min +
Fig. 2. Range of acceptable values for inserting yr in a previous identified StS
If we assume that yM − < yr ≤ ym , then, using a similar reasoning, we prove that yr = yn + 1 maintains the value stability of the state, and the steady sequence yi , i = 1, . . . , n + 1 as a new minimum value: ym = yn+1 = yr . In all other cases, that is, yr < yM − or yr > ym + , yr does not belong to the steady-state Y since it goes above the maximum tolerance value. Let us consider yr < yM − . Therefore, (Figure 2), yr < yM − ≤ ym ≤ yi ≤ yM . Hence, |yr − yM | > |yM − − yM | = . The remaining case can be proved similarly. In conclusion, a consecutive sample point yr belongs to the steady-state immediately before if yM − ≤ yr ≤ ym + such that ym and yM are the minimum and the maximum values in the state. Otherwise, the previous sample is considered as the end of the steady-state. When all the samples of the signal have been tested for steady-states identification, the method ends by computing the differences between consecutive states and a feature vector is built. 3.2
Defining a Signature
As it was mentioned, the signature composed only by the changes in the active power provides little information for an accurate appliance recognition. The active power (also known as real power) represents the power that is being consumed by the appliances. However, two other electrical parameters can also be used: the apparent power and the reactive power. In a simple alternating current circuit, current and voltage are sinusoidal waves that, according to the load in the circuit, can be in phase or not. For a resistive load the two curves will be in phase and multiplying their values, at each instant, produces a positive result. This indicates that only real power is transferred in the circuit. In case of a reactive load, current and voltage will be ninety degrees out of phase, which suggests that only reactive energy exists. In practice, resistance, inductance and capacitance loads will occur, so both real and reactive power will exist. At last, the product of the root-mean-square voltage values and current values represents the apparent power. The real, the reactive and the apparent powers are measured in Watts, volt-amperes reactive (VAR) and volt-ampere (VA), respectively. See an example for a LCD 20” in Figure 3. The relation between the three parameters is given by S = P 2 +Q2 where S, P and Q represent the apparent power, the active and the reactive powers, severally. The apparent and the real powers are also connected by the power factor. The latter constitutes an efficiency measure of a power distribution system and is computed as the ratio between real power and apparent power in a circuit. It varies between 0, for the purely reactive load, and 1, in case of resistive load.
36
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
Fig. 3. Active, reactive power and power factor for a LCD of 20”
4
Computational Experiments
4.1
Data Collection and Feature Extraction
The data, namely, the active power, voltage, current and power factor signals were acquired using a sensing meter prototype provided by ISA-Intelligent Sensing Anywhere [10]. However, for monitoring several parameters this prototype has a severe limitation for monitoring several parameters: only one parameter value can be supplied at each point in time. This implies the existence of a delay between the values of different parameter types. Another shortcoming is related with the fact that errors in the measurements can eventually occur, resulting in the failure of deliverance of the expected value. To evaluate the effectiveness of the composed signatures, data from several electrical appliances were acquired, presenting 100 milliseconds delay between the several parameters data samples. The parameter data types are: active power, current, voltage and power factor. Therefore, the frequency between each sample data type was of 400 milliseconds. The data for each appliance is collected in four steps: a) during 10 to 15 seconds, signal samples are acquired without the appliance being plugged to the socket; b) the device is plugged in and samples are collected for 15 seconds; c) the apparatus is switched on and it runs for a period of 1 minute1 and d) the appliance is switched off after, a 15 seconds sampling period occurs. For each one of the appliances the process was repeated fifty times. The devices chosen for the experiments were: a microwave, a coffee machine, a toaster, an incandescent lamp and two LCD’s (from the same manufacturer but different models). In order to proceed with StS identification, Theorem 1 from Section 3 was implemented obtaining a recognition algorithm for processing the collected signals. For each one of the different appliances is possible to identify three steady 1
For the coffee machine, the running time is less than a minute corresponding to the time needed for an expresso.
An Experimental Study on Electrical Signature Identification
37
-states: a stable signal before the switch on of any of the devices; one other StS corresponding to the appliances’ operation phase and a last one occurring after switching off. In fact, one LCD in particular presented four different states: it was possible to identify the steady-state related to the standby mode. For any of the four measured parameters, the difference between the identified steadystates was calculated such that a positive/negative value was associated to the switch on/off, respectively. 4.2
Feature Classification Methods and Multi-evaluation Metrics
To assess the performance of the composed signature, the features for the six class problem associated to the switch on were normalized. Classification was performed using Support Vector Machines (SVM) and 5-Nearest Neighbors (5NN) methods. The SVM is developed for solving binary classification problems, nevertheless, in the related literature two main approaches to solve a multiclassification problem can be found: one-against-all and one-against-one [11]. In the first technique, a binary problem is defined by using each class against the remain ones. This implies that m binary classifiers are applied (m > 2, represents the number of different classes). In the other one, m(m−1) binary classifiers are 2 employed , comparing each pair of classes. For a given sample, a voting is carried out among the classifiers and the class obtaining the maximum number of votes is assigned to it. This last approach is supplied in LIBSVM [12], a package available to implement SVM classification. A similar package is SVMLight [13] whereas this uses the multi-class formulation described in [14] and the algorithm based on Structural SVMs [15] to perform multi-class classification. To perform the classification of the composed electrical signatures the oneagainst-all tactic was implemented using SVM and 5-NN methods. For the SVM, the linear kernel and the radial basis function (RBF) with scaling factor σ = 1 were used. For the multi-class classification, the SVMLight available implementation was chosen. A 3-fold cross validation was employed to the data set in order to evaluate the tests performance. The results are reported in Table 1. To assess the tests performance accuracies, macro-average and micro-average were used. On the latter, the F-measure performance is calculated in two different ways: a) the mean value of every F-scores computed for each binary problem (macro-average); b) the global F-measure, calculated from a global confusion matrix computed from the sum of all the confusion matrices related to the binary problems (micro-average). The F-score is an evaluation of a test’s accuracy which combines the and the precision (P ) values of a test. The general recall (R)P ·R formula is Fβ = 1 + β 2 β 2 ·(P +R) . In this paper β = 1 is used, i.e., Fβ is the harmonic mean of the precision and recall. In order to evaluate a binary decision task, we first define a contingency matrix representing the possible outcomes of the classification, namely the true positives (TP - positive examples classified as positive), the True Negatives (TN - negative examples classified as negative), the False Positives (FP - negative examples classified as positive) and the False Negative (FN - positive samples classified as negative). The recall is defined as TP TP T P +F N and the precision is T P +F P .
38
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
Table 1. The mean accuracies and F-scores values for the tests performed using oneagainst-all SVM (linear and RBF kernels) and one-against-all 5-NN SVM one-against-all Linear RBF F1 (%) Acc. (%) F1 (%) Acc. Incandescent bulb 95.2 ± 1.7 98.1 ± 0.0 97.1 ± 2.8 98.5 ± 0.1 Lcd 22 n.d. 83.2 ± 0.1 49.1 ± 21.2 89.6 ± 1.2 Lcd 32 96.0 ± 0.0 99.3 ± 0.5 95.9 ± 1.9 98.4 ± 0.4 Microwave 96.7 ± 1.6 98.7 ± 0.3 97.9 ± 3.6 99.4 ± 0.5 Toaster 97.2 ± 2.8 99.2 ± 0.3 98.2 ± 3.2 99.8 ± 0.4 Coffee Machine n.d. 99.2 ± 0.3 86.9 ± 4.9 99.8 ± 0.9 Average n.d. 96.3 ± 6.4 87.5 ± 19.3 97.5 ± 4.1 Micro-average 76.8 90.0
4.3
5-NN one-against-all F1(%) 99.4 ± 1.1 99.4 ± 0.0 96.0 ± 4.6 96.8 ± 5.6 100.0 ± 0.0 98.0 ± 1.2 98.3 ± 1.6 98.9
Acc. (%) 99.7 ± 0.6 99.0 ± 0.0 97.9 ± 0.8 98.1 ± 1.0 100.0 ± 0.0 100.0 ± 0.0 99.1 ± 0.9
Evaluation Results
For each one of the tests one-against-all (SVM and 5-NN) the F-scores and mean accuracies are illustrated in Table 1. Towards a global evaluation, the macroaverages (mean values of the F-scores), micro-averages and mean accuracies are also presented. As it can be observed, the one-against-all approaches performance is quite effective. In average, we have an accuracy around 96% for the Linear SVM, 97% for the RBF SVM and 99% for the 5-NN. Rather, the multi-class method presents very low accuracy values: around 40%. Micro and macro averages are measures only applied to binary problems. Therefore, in order to compare the SVMLight multi-class test results, micro-averages were computed for the obtained classifications. For that, the results of the six binary problems one-against-all were determined based on the multi-class classification results as well as the respective confusion matrices and F-measures. With respect to the accuracy, the results for the linear and RBF kernels were of 40.57±8.55% and 40.57±8.55%, respectively. The micro average values were computed as previously described resulting in a value of 40.67% for both kernels. Notice that accuracy values provide only global information: high accuracy is not necessarily related to a precise identification of true positives. In fact, micro-averages supply particular information related to the samples classified as true where the multi-class classification scored badly. This may result from the fact that, in the test data sets associated to binary problems used, the number of samples that belong to the class in test is smaller than the remaining ones. Therefore, the number for TN probably will be greater than the number of samples labeled as TP. Actually, cases were no TP are labelled may occur and then the F-score cannot be defined. In our case, for the multi-class SVM the accuracy is low following the respective micro-average while for the remaining tests, both performance metrics are high.
An Experimental Study on Electrical Signature Identification
39
For both one-against-all methods, the good performance indicates that the composed signature can be an accurate description for each one of the appliances in the database. Nevertheless, these findings may be related to the fact that the number of electrical appliances used is still very small. Moreover, all of these appliances have distinct loads with the exception of the LCD devices. Taking a closer look to the multi-class classification tests, the incandescent bulb was misclassified as the toaster and the LCD’s and the coffee machine as the microwave. To overcome this limitation, other methods for multi-class classification can be studied, like the neural networks, a hybrid approach or even more features can be added to signature, for instance, information related to the transient signals.
5
Conclusions and Future Work
The project for the deployment of Smart Energy Grids requires automatic solutions for the identification of electrical appliances. The issue of implementing in-home activity modeling and recognition relies in cheap and inconspicuous recognition systems. The most suitable solution for both problems is a system NILM. Moreover, such a scheme can also be used as a household electrical management system. For implementing a NILM system, based in the sampling of power at a single point, feature extraction techniques and classification methods are needed to detect and to separate individual loads. This can only be accomplished through the definition of an effective electrical signature. This work begins by presenting an approach to perform the identification of the step-changes of the electrical signal. This strategy was applied to the analysis of the signals acquired for a given data set of appliances, in order to extract features for the definition of an electrical ID for each device. The features use the step-changes in active and reactive powers and power factor. In order to evaluate the proposed approach, we used SVM and 5-NN classification oneagainst-all tests as well as SVM multi-class classification tests. The results show that the simplest methods are able to accurately tackle the recognition issue. This work constitutes an experimental test case study for a composed steadystate signature. Future work will acquire more steady-states IDs in order to increase the data set and perform more ambitious tests. The incorporation of the transient pattern associated with each appliance in the signature is under study. The first problem to overcome is the very high sampling frequency required to obtain transients, not easy to achieve unless a specific sensing device is developed. Another research question that needs to be answered addresses the consumption variations of a device operating in its intermediate state. The proper identification of this variation with the respective device can bring an added-valued for the analysis of the information provided by a NILM system.
Acknowledgments The authors would like to thank ISA for the collaboration and to iTeam project for the support grant given.
40
M.B. Figueiredo, A. de Almeida, and B. Ribeiro
References 1. Hart, G.W.: Nonintrusive appliance load monitoring. Proc. of the IEEE 80, 1870– 1891 (1992) 2. Sultanem, F.: Using appliance signatures for monitoring residential loads at meter panel level. IEEE Transactions on Power Delivery 6, 1380–1385 (1991) 3. Leeb, S.B.: A conjoint pattern recognition approach to nonintrusive load monitoring. PhD thesis, Massachusetts Institute of Technology (1993) 4. Cole, A., Albicki, A.: Data extraction for effective non-intrusive identification of residential power loads. In: Instrumentation and Measurement Technology Conf., IMTC 1998. Conf. Proc. IEEE, vol. 2, pp. 812–815 (1998) 5. Cole, A., Albicki, A.: Algorithm for non intrusive identification of residential appliances. In: Proc. of the 1998 IEEE Intl. Symposium on Circuits and Systems, ISCAS 1998, vol. 3, pp. 338–341 (1998) 6. Berges, M., Goldman, E., Matthews, H.S., Soibelman, L.: Learning systems for electric consumption of buildings. In: ASCE Intl. Workshop on Computing in Civil Engineering, Austin, Texas (2009) 7. Bijker, A., Xia, X., Zhang, J.: Active power residential non-intrusive appliance load monitoring system. In: AFRICON 2009, pp. 1–6 (2009) 8. Chang, H.H., Lin, C.L., Lee, J.K.: Load identification in nonintrusive load monitoring using steady-state and turn-on transient energy algorithms. In: 2010 14th Intl. Conf. on Computer Supported Cooperative Work in Design, pp. 27–32 (2010) 9. Figueiredo, M., de Almeida, A., Ribeiro, B., Martins, A.: Extracting features from an electrical signal of a non-intrusive load monitoring system. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 210–217. Springer, Heidelberg (2010) 10. ISA Intelligent Sensing Anywhere, S.: Isa intelligent sensing anywhere (2009), http://www.isasensing.com/ [Online; accessed 18-October-2010]. 11. Fauvel, M., Chanussot, J., Benediktsson, J.: Evaluation of kernels for multiclass classification of hyperspectral remote sensing data. In: Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, pp. 813–816. IEEE, Los Alamitos (2006) 12. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 13. Joachims, T.: Making large-scale svm learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MITPress, Cambridge (1999) 14. Crammer, K., Singer, Y., Cristianini, N., Shawe-taylor, J., Williamson, B.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001) 15. Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML 2004: Proc. of the twenty-first Intl. Conf. on Machine learning, p. 104. ACM, New York (2004)
Evaluation of a Resource Allocating Network with Long Term Memory Using GPU Bernardete Ribeiro1,2 , Ricardo Quintas2 , and Noel Lopes2 1
2
Department of Informatics Engineering, University of Coimbra, Portugal CISUC - Center for Informatics and Systems of University of Coimbra, Portugal
Abstract. Incremental learning has recently received broad attention in many applications of pattern recognition and data mining. With many typical incremental learning situations in the real world where a fast response to changing data is necessary, developing a parallel implementation (in fast processing units) will give great impact to many applications. Current research on incremental learning methods employs a modified version of a resource allocating network (RAN) which is one variation of a radial basis function network (RBFN). This paper evaluates the impact of a Graphics Processing Units (GPU) based implementation of a RAN network incorporating Long Term Memory (LTM) [4]. The incremental learning algorithm is compared with the batch RBF approach in terms of accuracy and computational cost, both in sequential and GPU implementations. The UCI machine learning benchmark datasets and a real world problem of multimedia forgery detection were considered in the experiments. The preliminary evaluation shows that although the creation of the model is faster with the RBF algorithm, the RAN-LTM can be useful in environments with the need of fast changing models and high-dimensional data. Keywords: Incremental Learning, GPU Computing.
1
Introduction
The amount of data available on Internet appears to be growing exponentially with time. In addition the complexity of data created by non-stationary underlying processes poses many challenges in the machine learning area. To extract relevant information humans need help from methods based on incremental learning where neural networks can be optimal in many application domains. The most promising strategy for incremental learning is the memory-based learning approach where almost all training samples are stored in memory and then are used in each learning step [5]. In many incremental learning problems this strategy is useless because the number of training samples is not known in advance. To overcome this limitation, a Resource Allocating Network with Long-Term Memory (RAN-LTM) has been proposed in [4]. In RAN-LTM, not only training data but also memory items stored in the long-term memory are trained. In some of the tasks involved computation can be rather intensive and ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 41–50, 2011. c Springer-Verlag Berlin Heidelberg 2011
42
B. Ribeiro, R. Quintas, and N. Lopes
time consuming. With the release of friendly frameworks to program Graphics Processing Units (GPUs) many applications which needed high-processing power found new ways to speed-up its execution. One field that has greatly benefited from this technical progress is machine learning. CUDA (Compute Unified Device Architecture) and its C-like language interface thus enabled parallel implementations of neural network algorithms easing the computation that is heavily data-dependent. In this study, we compare two different learning strategies, the batch and the incremental learning, exploiting the high-performance SIMD architecture of GPU computing. For testing we run the experiments using the UCI machine learning repository and high-dimensional data from a real world problem of audio steganalysis. In this problem the aim is to detect hidden messages which are embedded in audio WAV files. While the traditional methods commonly build a static steganalysis model unable to adapt to new behavior patterns, adaptive detection models with self-learning ability, dynamically update to new changing data. The results have shown that the GPU based RAN-LTM has lessened the computational costs for audio forgery detection. The paper is organized as follows. Section 2 describes the incremental learning with long term memory algorithm (RAN-LTM), and presents shortly the tailored kernels needed for GPU computing. In Section 3 we introduce the experimental setup. The results will be discussed and analyzed in this section, taking into account each algorithm in both platforms, CPU and GPU. Finally Section 4 summarizes the conclusions and points out lines for future work.
2
Incremental Learning
Incremental learning is an important technique, specially in todays environments where a fast response to changing data is necessary. One algorithm that follows the incremental learning model, and uses RBF units in its hidden layer is the Resource Allocating Network(RAN) [5]. The network learns by allocating new units and adjusting the parameters of existing units. If the network performs poorly on a presented pattern, then a new unit is allocated that corrects its response. Thus, the units in this network respond to only a local region of the space of input values. One variation of this algorithm has been investigated [6,4] to prevent the catastrophic interference [2], which occurs when new training disrupts existing memory. A Long-Term Memory (LTM) is then added which has proven to perform well in incremental learning environments. The RAN-LTM algorithm automatically allocates RBF units in the hidden layer on an online basis. The long-term memory also stores samples from the training data, to perform the update of the weights without losing generalization capabilities. The samples stored in LTM are called memory items, and correspond to an input-output pair of the training data [6].
Evaluation of a Resource Allocating Network with Long Term Memory Input layer
Hidden layer
x1
43
Output layer
z1
x2 x3
z2
x4
Long-Term Memory Retrieve & Learn
Generate & Store
Fig. 1. RAN-LTM network architecture with (I = 4, J = 5, K = 2)
2.1
Resource Allocating Network with Long Term Memory
We follow the notation given in [6]. Let us denote the input vector x = {x1 , x2 , . . . , xI }T , the vector of RBF outputs y = {y1 , y2 , . . . , yJ }T and the network output vector z = {z1 , z2 , . . . , zK }T , respectively, for I number of inputs, J number of RBF outputs, and K network outputs. The RAN-LTM proceeds as follows: ||x − cj ||2 yj = exp − (j = 1, · · · , J) (1) σj2 zk =
J
wkj yj + bk
(k = 1, · · · , K)
(2)
j=1
where cj = {cj1 , . . . cjI }T and σj are, respectively, the center and the width of the jth RBF, wkj is the connection weight from the jth unit to the kth output and bk is the bias of output k. The items in the LTM (see Figure 1) correspond to input-output representative pairs selected from training data. The procedure retrieves these pairs when learning a new training data in order to suppress catastrophic interference. The training of the RAN-LTM network is divided into two phases, (i) the allocation of RBFs and (ii) the calculation of the weights. The weight calculation W = {wjk } is similar to the standard RBFN except that instead of the complete target training vector t, only the targets from the training samples stored in the LTM T and the target d of the sample being trained are used. Therefore, to
44
B. Ribeiro, R. Quintas, and N. Lopes
minimize the errors one needs to solve ΦW = Z where Z is the matrix whose column vectors correspond to the target of the sample being trained and the stored M memory item targets. In order to solve W, Single Value Decomposition (SVD) is used. To calculate the width we use the same heuristic rule as in [6]. Initially a maximum value is set σmax using the training data x and targets d: σmax = mediani {minj (||xi − xj ||)} for di = dj
(3)
Subsequently, the width updates are performed whenever, a new RBF unit J is added, and adjusted as follows: σJ = min {minj (||cJ − cj ||), σmax }
(4)
σj = min(||cJ − cj ||, σj ) (j = 1, ..., J − 1)
(5)
Algorithm 1 describes the method used to implement a Resource Allocating Network with Long Term Memory (RAN-LTM) [5]. Algorithm 1. RAN-LTM for all xi ∈ X do z ← j wj ϕj (||xi − cj ||) + b k ← argminj D(cj , xi ) E ←d−z if E > and ||xi − ck || > γ then Allocate new unit: cnew ← xi , wnew ← E Update widths Store memory item: Inew ← xi , Tnew ← d Increment number of memory items:M ← M + 1 else Update weights using memory items z ← j wj ϕj (||xi − cj ||) + b k ← argminj D(cj , xi ) E ←d−z if E > then Allocate new unit: cnew ← xi , wnew ← E Update widths Store memory item: Inew ← xi , Tnew ← d Increment number of memory items: M ← M + 1 end if end if end for
2.2
Parallel Implementation
One of the most important (and basic) units in a CUDA program are the threads which are executed in kernels. To parallelize the algorithms RBFN and RANLTM the main task was to define and implement the kernels for each algorithm.
Evaluation of a Resource Allocating Network with Long Term Memory
45
RBFN Network Kernels 1. KernelActivationMatrix. Calculates the activation between the samples and the hidden units. The threads are organized as in the standard matrix multiplication, one thread for each element of the matrix. 2. Weights calculation. Creates the pseudoinverse of the activation matrix using CULATools SVD, and performs multiplications to obtain the final matrix with CUBLAS routines. 3. Adjust Widths. Calculates distances between all centers with one thread for each element of the matrix, then applies the RNeighbours algorithm with one thread for each row and stores the result in an array as new width values. KMeans. The implementation of the KMeans on CUDA is depicted in figure 2. The following kernels perform the necessary computations for finding the centers. 1. KernelEuclidianDistance. Calculates the Euclidian distance between two matrices. The threads are organized as in the standard matrix multiplication, one thread for each element of the matrix. 2. KernelCenterAttribution. Creates N threads, where N is the number of samples, one thread for each row. The index of the minimum value in the row matrix corresponds to the nearest center. 3. KernelPrepareCenterCopy. Finds the assigned points for each center, with one thread for each center. 4. KernelCopyCenters. Averages all points attributed to a center and replaces old centers. 5. KernelReduce. Compares the assignment array with the previous iteration. Uses a reduction pattern. 2.3
RAN-LTM Kernels
The kernels for the RAN-LTM algorithm have been carefully customized for supporting GPU implementation. 1. KernelCalculateNetworkActivation. Computes the activation of one center to a given sample, one thread per center. 2. KernelSumActivations. Sums up the result of all center activations. 3. KernelFindNearestCenter. Finds the nearest center to a given sample. 4. KernelCalculateError. Calculates the error between target and result of the sum of the network activation. 5. KernelUpdateWidths. Update widths, one thread for each unit. The kernel for calculating the weights is similar to the one implemented for the RBFN algorithm. The RAN-LTM algorithm is able to use several parallel constructs, however there is a huge amount of data transfer from the host to the GPU card per each sample presented. Another issue is that both the error and the minimum distance to the centers must be passed to the host, in order to decide how the algorithm proceeds.
46
B. Ribeiro, R. Quintas, and N. Lopes Number of Centers Average correspo ndent sample for assigned points
Centers
Distance Matrix
Old assignments
Training Data
Assign each sample to the min center distance
Number of samples
Features
Compare arrays via reduction. Stop algorithm if equal.
Fig. 2. KMeans on GPU
3
Experimental Results and Discussion
The datasets, the hardware platforms, and the performance metrics are described followed by the discussion of the results. 3.1
Experimental Setup
We run the algorithms with the UCI [1] machine learning benchmarks downloaded from http://archive.ics.uci.edu/ml/. The datasets in Table 1 were chosen for comparison with the results in [6] with respect to accuracy, meanwhile ensuring the correctness and performance of our algorithm’s implementations. Table 1. Configuration parameters for RBFN and RAN-LTM models in UCI data UCI data RBFN RAN-LTM Dataset Samples Features Class Network Size RNeighbors Accuracy Distance Satellite 6435 36 6 150 4 0.35 5 BreastCancer 569 30 2 25 2 0.5 5 Vehicle 846 18 4 30 2 0.4 5 Vowel-context 990 13 11 35 3 0.3 3 CMC 1473 9 3 10 2 0.5 10 Iris 150 4 3 6 1 0.4 5
Regarding the case study of audio steganalysis, aiming to detect and recover hidden messages from tampered media, the datasets have been arranged as follows [3]. The original medium (cover) has been imperceptibly modified to embed
Evaluation of a Resource Allocating Network with Long Term Memory
47
Table 2. WAV audio signal datasets: (cover, class2) and (stego, class1) Filename cover6000mono hide4pgp25mono invislbe50stero lsbmatching50 steghide1005 steghide993
ID Samples 4390 1 6000 2 4886 3 6000 4 1003 5 993
Hiding Algorithm Class 2 Hide4PGP V4.0 1 Invisible Secrets 1 LSB matching 1 Steghide 1 Steghide 1
encrypted messages by using a shared key, and the receiver can extract and decrypt messages from the modified carriers (steganogram). In [3] feature extraction is performed and audio steganograms created by several signal processing techniques. A total number of 58 features were extracted and stored in 5 files, one for the cover class and the remaining 4 for the stego class. The data set contains 6000 mono 44.1-kHz 16-bit quantization in uncompressed, PCM coded WAV audio signal files, covering different types such as digital speech, on-line broadcast in different languages, for instance, English, Chinese, Japanese, Korean, and Spanish, and music (jazz, rock, blues). Each audio has the duration of 19 s. The stego-audio signals datasets have been built by hiding different messages in the audio signals . For hiding data several algorithms (tools Hide4PGP V4.0, Invisible Secrets, LSB matching and Steghide) were used. These datasets are summarized in Table 2. For each algorithm two platforms, CPU and GPU, were used. The testing setup consisted of two GPUs (each with 14 streaming multiprocessors (SM)) the NVIDIA GeForce GTX470 (448 cores, processor clock 1215MHz) and the NVIDIA GeForce 9800GT (112 cores, processor clock 1500MHz), and with an Intel Core 2 Duo E8400 processor running at 3.0GHz. The tests were done using Ubuntu 9.04 operating system with CUDA Toolkit 3.1 and CULATools 2.0 libraries. The performance metrics were calculated in terms of (i) classification performance (accuracy and F-measure) and (ii) processing times given by the speedups attained. 3.2
Results and Discussion
For statistical significance we run the algorithms 30 times and averaged the results with the mean and the standard deviation. All datasets were scaled with z-score normalization. Normalization is an important data transformation, since it prevents the attributes with initial larger ranges from outweighing the attributes with initial lower ranges. Table 3 illustrates the final classification accuracies of RBFN, RAN-LTM and RAN-LTM Tabuchi [6] in the benchmarks tested. In Table 4 the processing times for both the batch learning with RBFN and the incremental learning RAN-LTM are presented for the UCI benchmarks. We observe that for both tasks in RBFN, namely, finding the centers and adjusting the network weights, the GPU takes
48
B. Ribeiro, R. Quintas, and N. Lopes
Table 3. Final classification accuracy and F-measure for RBFN, RAN-LTM and RANLTM Tabuchi’s models. The best accuracy is written in bold. RBFN RAN-LTM RAN-LTM Tabuchi [6] Dataset Accuracy F-measure Accuracy F-measure Accuracy Satellite 97 90 92 76 89.5 BreastCancer 96 96 94 94 96.2 Vehicle 86 71 82 65 76.3 Vowel-context 95 73 90 44 92 CMC 65 51 59 40 48.1 Iris 89 84 94 91 na
Table 4. Performance for both learning models (batch and incremental) on UCI data Processing Time (s) RBFN Dataset CPU 17.35 Satellite (2.60) 0.37 Breastcancer (0.01) 0.39 Vehicle (0.03) 0.39 Vowel-context (0.02) 0.34 CMC (0.02) 0.27 Iris (0.02)
Centers 9800GT GTX470 4.68 0.67 (0.53) (0.09) 0.07 0.02 (0.01) (0.00) 0.10 0.03 (0.01) (0.00) 0.11 0.03 (0.01) (0.00) 0.15 0.05 (0.03) (0.01) 0.02 0.01 (0.00) (0.00)
CPU 19.45 (2.90) 0.05 (0.00) 0.11 (0.00) 0.17 (0.01) 0.03 (0.00) 0.00 (0.02)
RAN-LTM Weights 9800GT GTX470 9.57 8.37 (0.04) (0.05) 0.05 0.04 (0.00) (0.00) 0.09 0.09 (0.00) (0.00) 0.16 0.14 (0.01) (0.01) 0.13 0.13 (0.01) (0.01) 0.01 0.01 (0.00) (0.00)
CPU 656.98 (117.12) 2.19 (0.26) 59.10 (7.22) 13.25 (1.23) 167.99 (31.58) 0.81 (0.03)
9800GT 884.05 (101.34) 6.93 (0.58) 95.39 (7.68) 57.75 (3.98) 205.07 (16.85) 1.10 (0.08)
GTX470 803.95 (81.99) 5.36 (0.34) 88.73 (6.43) 52.63 (3.54) 178.84 (23.51) 0.90 (0.07)
advantage over the CPU by 44% in the 9800 GT device and by around 56% in the GTX470. Notice that these improvements in processing time are averaged over those two tasks. Meanwhile in the RAN-LTM the times are slightly worst since these data sets are too small. In Figure 3 we can see that in both algorithms, for smaller network sizes, the CPU presents better results. However, with the increase in the network size, the GPU starts to get an edge over the CPU, until finally surpassing performance wise. Likewise comparing the RBFN and RAN-LTM we can observe that in case of an update to the model, the RBFN algorithm would have to rebuild all the network taking much more time than the RAN-LTM. Using the GPU for a network size of 100, the RBFN would take approximately 4 seconds, while the RAN-LTM would take a fraction of this time about 0.045 seconds. Moreover the classifier performance is competitive for the cases tested. We present the performance of the RAN-LTM in a real-world application of audio steganalysis. Steganography is the art of concealed writing, where information can be hidden in unsuspected sources, like images, video and audio. We applied our algorithm to the detection of hidden messages in audio files.
Evaluation of a Resource Allocating Network with Long Term Memory 60
50
RBFN Network size increase
RANLTM Overtime
0.09
RBFN KMs CPU RBFN KMimp CPU RBFN KMs GTX470 RBFN KMNimp GTX470 RBFN KMs 9800GT RBFN KMimp 9800GT
49
RAN-LTM CPU RAN-LTM GTX470 RAN-LTM 9800GT
0.08 0.07
40
0.06
Time(s)
Time(s)
0.05
30
0.04
20
0.03 0.02
10 0.01
00
50
100
150
Network Size
0.00 0
200
20
40
(a) RBFN
60
80
Network Size
100
120
140
(b) RAN-LTM
Fig. 3. Processing times for different network sizes Times
12000
CPU GTX470 9800GT
10000
Time(s)
8000
6000
4000
steghide993 pca 10
steghide993
steghide993 pca 20
steghide1005 pca 10
steghide1005
steghide1005 pca 20
sbmatching50 pca 10
lsbmatching50
sbmatching50 pca 20
vislbe50stero pca 10
invislbe50stero
vislbe50stero pca 20
e4pgp25mono pca 10
hide4pgp25mono
0
e4pgp25mono pca 20
2000
Fig. 4. Processing times for RAN-LTM in WAV data files
The results showed competitive accuracies as compared to other algorithms [3] meanwhile attaining speedups of up to 15× with the CUDA implementation as illustrated in Figure 4. The advantage is that it may be useful in rapid changing environments.
4
Conclusions and Future Work
We have implemented both the batch (RBFN) and the incremental learning with long term memory (RAN-LTM) algorithms using the GPU graphic card. By exploiting the multi-thread capability of multi-core processors our approach
50
B. Ribeiro, R. Quintas, and N. Lopes
has been tested with data sets from the UCI benchmarks and with a real-world data set for multimedia forgery detection. The GPU-based RBFN batch algorithm performed better for the smaller benchmark datasets, while for the larger (and more difficult) audio steganalysis data, the RAN-LTM parallel version yields higher speedups than the sequential counterparts. The performances (for all cases tested) were statistically competitive as compared to the results in literature. Although the creation of the model is faster with the RBFN algorithm, the RAN-LTM can be useful in environments with the need of fast changing models and high dimensional data. Future work will operationally optimize towards better GPU support.
References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~ mlearn/{MLR}epository.html 2. Carpenter, G.A., Grossberg, S.: The ART of Adaptive pattern recognition by a self-organizing neural newtoek. IEEE Computer 21, 77–88 (1988) 3. Liu, Q., Sung, A.H., Qiao, M.: Temporal derivative-based spectrum and melcepstrum audio steganalysis. IEEE Transactions on Information Security 4(3), 359– 368 (2009) 4. Okamoto, K., Ozawa, S., Abe, S.: A fast incremental learning algorithm of RBF networks with long-term memory. In: IJCNN 2003: Proc. of the International Joint Conference on Neural Networks, vol. 1, pp. 102–107. IEEE Computer Society, Los Alamitos (2003) 5. Platt, J.: A resource-allocating network for function interpolation. Neural Computation 3(2), 213–225 (1991) 6. Tabuchi, T., Ozawa, S., Roy, A.: An autonomous learning algorithm of resource allocating network. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp. 134–141. Springer, Heidelberg (2009)
Gabor Descriptors for Aerial Image Classification Vladimir Risojevi´c, Snjeˇzana Momi´c, and Zdenka Babi´c Faculty of Electrical Engineering, University of Banja Luka, Bosnia and Herzegovina s
[email protected], {vlado,zdenka}@etfbl.net
Abstract. The amount of remote sensed imagery that has become available by far surpasses the possibility of manual analysis. One of the most important tasks in the analysis of remote sensed images is land use classification. This task can be recast as semantic classification of remote sensed images. In this paper we evaluate classifiers for semantic classification of aerial images. The evaluated classifiers are based on Gabor and Gist descriptors which have been long established in image classification tasks. We use support vector machines and propose a kernel well suited for using with Gabor descriptors. These simple classifiers achieve correct classification rate of about 90% on two datasets. From these results follows that, in aerial image classification, simple classifiers give results comparable to more complex approaches, and the pursuit for more advanced solutions should continue having this in mind. Keywords: Aerial image classification, Gabor filters, Gist descriptor.
1
Introduction
There is a constantly increasing number of instruments for remote sensing of the Earth. Consequently, many databases of remotely sensed data are being flooded with data. At the moment, images dominate these databases, both in variety and quantity. Remote sensing imaging of the Earth is done by a variety of airborne and space-borne imagers in various spectral bands, ranging from visible spectrum to microwave [8]. There are many applications of remote sensing imaging, both military and civilian. Civilian applications include land use planning, weather forecasting, studying long-term climate changes, crops monitoring, studying deforestation, city planning, and many others. These applications require development of effective means for acquisition, processing, transmission, storage, retrieval, and analysis of images. One of the key problems in aerial image analysis is the problem of semantic classification. This problem is closely related to the task of land use monitoring which is necessary for control of environmental quality as well as maintaining and improving living conditions and standards. The holy grail of automatic land use classification is pixel-level semantic segmentation of remotely sensed images. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 51–60, 2011. c Springer-Verlag Berlin Heidelberg 2011
52
V. Risojevi´c, S. Momi´c, and Z. Babi´c
The result of a pixel-level segmentation is a thematic map in which each pixel is assigned a predefined label from a finite set. However, remote sensing images are often multispectral and of high resolution which makes its detailed semantic segmentation excessively computationally demanding task. This is the reason why some researchers decided to classify image blocks instead of individual pixels. We also adopt this approach and evaluate classifiers based on the state of the art image descriptors and support vector machines, which have shown good results in image classification tasks, at the task of aerial image classification. The contribution of this paper is in the evaluation of Gabor and Gist descriptors for the task of aerial image classification. For the classifier based on Gabor descriptors we propose a kernel based on the distance function proposed for Gabor descriptors. In the experiments we show that the classifier based on Gabor descriptors yields similar or better performance compared to the Gist descriptor based classifier, despite lower dimensionality of the former. We also show that these simple classifiers yield classification performance which is better or comparable with some more complicated classifiers using more features. The paper is organized as follows. In Section 2 we briefly review previous related work. Image representation and classifier are described in Section 3, and experimental results are given in Section 4. In Section 5 we conclude and give ideas for future research.
2
Related Work
There has been a long history of using computer vision techniques for classification of aerial and satellite images. We briefly review here some of the methods that are relevant to our work. Ma and Manjunath [3] use Gabor descriptors for representing aerial images. Their work is centered around efficient content-based retrieval from the database of aerial images and they did not try to automatically classify images to semantic categories. Parulekar et al. [7] classify satellite images into four semantic categories in order to enable fast and accurate browsing of the image database. Fauquer et al. [2] classify aerial images based on color, texture and structure features. The authors tested their algorithm on a dataset of 1040 aerial images from 8 categories. In a more recent work [6], Ozdemir and Aksoy use bag-ofwords model and frequent subgraph mining to construct higher level features for satellite image classification. The algorithm is tested on a dataset of 585 images classified into 8 semantic categories. Our work is in a similar vein, but rather than trying to construct semantic features for image classification we focus on low level features and aerial images. Despite wide use of Gist descriptor [5] in general-purpose image classification, to the best of our knowledge there are not many examples of aerial image classification using Gist descriptor. Exception is work on tree detection by Yang et al. [10], where Gist is used for clustering of images prior to detection phase.
Gabor Descriptors for Aerial Image Classification
3
53
Image Representation and Classifier
In this paper we evaluate two image descriptors, both based on Gabor filters. There is a long tradition of using Gabor descriptors in computer vision and image processing, dating back to Daugman [1] who noted similarity between low level processing in biological vision and Gabor filter banks. Subsequently, Gabor descriptors have been used for various tasks including texture segmentation, image recognition, iris recognition, registration, and motion tracking. In the context of image classification the most notable are its uses for texture classification and retrieval, pioneered by Manjunath and Ma [4], and, more recently, for scene classification using Gist descriptor, as proposed by Oliva and Torralba [5]. 3.1
Gabor Descriptor
Gabor descriptor for an image is computed by passing the image through a filter bank of Gabor filters. Gabor filter is a linear band-pass filter whose impulse response is defined as a Gaussian function modulated with a complex sinusoid, 1 1 x2 y2 g (x, y) = exp − + 2 + 2πjΩx , (1) 2πσx σy 2 σx2 σy where Ω is the frequency of the Gabor function, and σx and σy determine its bandwidth. Gabor showed that these functions are optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency [1]. Impulse responses of the filters in a Gabor filter bank are dilated (scaled) and rotated versions of the function (1). Filters in a Gabor filter bank can be considered as edge detectors with tunable orientation and scale so that information on texture can be derived from statistics of the outputs of those filters [4]. We can consider (1) as a mother Gabor wavelet, and the functions obtained by its dilations and rotations are Gabor wavelets. For a given image, I (x, y) , (x, y) ∈ Ψ (Ψ is the set of image points), the output of a Gabor filter bank is actually Gabor wavelet transformation of that image, which can be written as ∗ Wmn (x, y) = I (x1 , y1 ) gmn (x − x1 ) (y − y1 ) dx1 dy1 , (2) Ψ
where gmn (x, y) are Gabor wavelets at scale m and orientation n, obtained from (1), and asterisk denotes complex conjugation. Assuming that image regions have homogeneous texture, means μmn and standard deviations σmn of the transform coefficients are used to represent the texture of the region: μmn = |Wmn (x, y)| dxdy , (3) Ψ
σmn
2 = (|Wmn (x, y)| − μmn ) dxdy . Ψ
(4)
54
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Gabor descriptor is now formed as a vector of means and standard deviations of filter responses
x = μ00 σ00 μ01 σ01 · · · μ(S−1)(K−1) σ(S−1)(K−1) , (5) where S is the total number of scales, and K is the total number of orientations. These values are typically set heuristically, through cross-validation. In [4] a distance metric based on the weighted L1 -norm is proposed for computing the dis-similarity between textures: d (xi , xj ) = dmn (xi , xj ) , (6) m
n
where
μ(i) − μ(j) σ (i) − σ (j) mn mn
mn
mn dmn (xi , xj ) =
+
,
α (μmn ) α (σmn )
(7)
and α (μmn ) and α (σmn ) are the standard deviations of the respective features over the entire database. 3.2
Gist Descriptor
Oliva and Torralba proposed Gist descriptor [5] to represent the spatial envelope of the scene. The spatial envelope is a set of holistic scene properties which can be used for inferring the semantic category of the scene, without the need for recognition of the objects in the scene. The Gist descriptor of an image is computed by first filtering the image by a filter bank of Gabor filters, and then averaging the responses of filters in each block on a 4 × 4 nonoverlaping grid. Comparing this descriptor to the Gabor descriptor, we see that Gist descriptor is essentially a spatial layout of textures. Note that here standard deviations of the distribution of filter responses are not used. Despite its simplicity this descriptor shows very good results in natural scene classification tasks. 3.3
Classifier
As a classifier we use support vector machine (SVM). Since distances of Gabor descriptors are computed using (6) we construct a kernel function starting from this metric as K (xi , xj ) = exp [−d (xi , xj )] ,
(8)
where d (xi , xj ), is given by (6). This kernel function is essentially based on weighted L1 -norm, and it satisfies Mercer condition [9]. For Gist descriptor we follow the approach in [5] and use SVM with radial basis function kernel. We construct a multi-class classifier using N (corresponding to the number of categories) one-vs-all SVMs and selecting the class with maximal SVM output.
Gabor Descriptors for Aerial Image Classification
4
55
Datasets and Experimental Results
We tested the described image representations and classifier on two datasets. Both datasets consist of aerial images. The first dataset is our in-house dataset and contains images of the part of Banja Luka, Bosnia and Herzegovina. The second dataset contains images used previously for aerial image classification [2], and we include it here for comparison purposes. 4.1
In-House Dataset
For evaluation of the classifiers we used an 4500×6000 pixel multispectral (RGB) aerial image of the part of Banja Luka, Bosnia and Herzegovina. In this image there is a variety of structures, both man-made, such as buildings, factories, and warehouses, as well as natural, such as fields, trees and rivers. We partitioned this image into 128×128 pixel tiles, and used a total of 606 images in our experiments. We manually classified all images into 6 categories, namely: houses, cemetery, industry, field, river, and trees. Examples of images from each class are shown in Fig. 1. It should be noted that the distribution of images in these categories is highly uneven, which can be observed from the bar graph in Fig. 2. In our experiments we used half of the images for training and the other half for testing. We compute Gabor descriptors at 8 scales and 8 orientations for all images from the dataset. We also tried other combinations of numbers of scales and orientations and chose the one with the best performance. Gabor descriptors, as proposed in [4] are computed for grayscale images. Since images are multispectral we compute Gabor descriptor for all 3 spectral bands in an image, and concatenate the obtained vectors, which yields 3 × 8 × 8 × 2 = 384-dimensional descriptors. For comparison purposes we also compute Gabor descriptors for grayscale (panchromatic) versions of images, which are 8 × 8 × 2 = 128-dimensional. As for Gist descriptors, we obtained the best results with the default setup, ie. a filter bank at 4 scales and 8 orientations. For this descriptor we also compute grayscale variant, which is 4×8×16 = 512-dimensional, and color variant, which results in a 3 × 4 × 8 × 3 = 1536-dimensional descriptor. For testing our classifiers we used 10-fold cross validation, each time with different random partition of the dataset, and averaged the results. Average classification accuracies on all categories are given in Table 1. In the table, Gabor (full) denotes Gabor descriptor as given in (5), while Gabor (mean) denotes descriptor obtained using only means of filter-bank responses. Table 1. Comparison of the classification accuracies for the in-house dataset Descriptor Panchromatic (grayscale) (%) Multispectral (RGB) (%) Gabor (full) 84.5 88.0 Gabor (mean) 80.7 84.5 Gist 79.5 89.3
56
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Fig. 1. Samples of images from all classes. From left to right, column-wise: houses, cemetery, industry, field, river, trees. (Best viewed in color.)
Fig. 2. Per category distribution of images in the in-house dataset
We see that Gist descriptor computed for all spectral bands of an RGB image has the best performance, at cost of high-dimensionality of the descriptor. It is worth noting that much simpler Gabor descriptor, with 4 times lower dimensionality, yields similar performance. Even more interesting is the fact that for grayscale (panchromatic) images Gabor descriptor outperforms Gist. From these results, it is obvious that classifiers benefit from information from various spectral bands. When grayscale images are considered, standard deviations of Gabor filter bank responses provide richer information about the texture of the image, hence its better performance. The importance of this information can be observed from the drop of performance when only means of Gabor filter bank responses are used. Another conclusion is that spatial layout of filter bank
Gabor Descriptors for Aerial Image Classification
57
Fig. 3. Confusion matrix for the in-house dataset using Gabor (RGB) descriptor
Fig. 4. Confusion matrix for the in-house dataset using Gist (RGB) descriptor
responses does not have beneficial influence on the performance of aerial image classifier, as is the case with general scenes [5]. The confusion matrix for Gabor descriptor is given in figure 3. We note that confusions mainly arise between categories which can be difficult even for humans. The most notable examples are houses versus cemetery, because of rectangular structures with strong oriented edges, and river versus field, because both have homogeneous, smooth texture without pronounced edges. It is also important to note that there are not many confusions between natural (river, trees, field) and man-made categories (houses, cemetery, industry). The confusion matrix for Gist descriptor is given in Fig. 4. The same observations we made for the confusion matrix for Gabor descriptor are also valid here.
58
V. Risojevi´c, S. Momi´c, and Z. Babi´c
Table 2. Comparison of the classification accuracies for Window on the UK dataset Method SVM with SVM with Algorithm SVM with
Accuracy (%) Gabor descriptor (RGB) 90.8 Gist descriptor (RGB) 87.1 from [2] 89.4 features from [2] 92.3
Fig. 5. Confusion matrix for Window on the UK dataset using Gabor descriptor
4.2
Window on the UK Dataset
For our second experiment we chose Window on the UK dataset which was also used in [2]. This dataset consists of 1040 64 × 64 pixels aerial images, which are manually classified into the following 8 categories: building, road, river, field, grass, tree, boat, vehicle. There are 130 images per category so the distribution of images into categories in this dataset is uniform, in contrast to our in-house dataset. The authors of [2] also proposed a split into training and test sets of 520 images each. For images from this dataset we computed Gabor descriptor at 8 scales and 8 orientations, as well as Gist descriptor, and then trained a multi-class classifier as described previously. In Table 2 we give the comparison of classification accuracies for this dataset. Again, Gabor and Gist descriptor result in comparable performances, this time with some advantage on the side of Gabor descriptors. This supports our previous findings about descriptive power of these two descriptors. Moreover, we can see that the performance of our classifier with Gabor descriptors is better than the performance of the algorithm proposed in [2], and only slightly worse than the performance of the SVM classifier trained with features from [2].
Gabor Descriptors for Aerial Image Classification
59
The confusion matrix for Gabor descriptor is shown in Fig. 5. We can see that common misclassifications again occur in cases that can also potentially confuse human subjects, such as building versus vehicle and field versus grass. It is important to note that, in this case too, misclassifications rarely occur between natural and man-made categories.
5
Conclusion
In this paper we evaluate two image descriptors, namely Gabor and Gist descriptors, and show that classifiers based on these descriptors show results comparable or better than more complex approaches. Both descriptors have previously shown good results in texture and image classification tasks. As a classifier we use SVM with standard radial basis function kernel, as well as a kernel constructed using a metric function proposed for comparing Gabor descriptors. We show that, for multispectral images, lower dimensional Gabor descriptors show similar or better performance performance than Gist, while, for panchromatic images, Gabor descriptors outperform Gist. This is mainly due to the fact that spatial layout is not such a strong cue for semantic classification of aerial images, but their texture regions are rather spatially homogeneous. Also, Gabor descriptors use standard deviations of filter bank responses, and this richer representation that they provide is another reason for their better performance. Despite its simplicity, classifier based on Gabor descriptors and SVMs with weighted L1 -norm kernel achieves better performance than more complex classifiers trained with color, texture and structural descriptors. This finding calls for a more thorough investigation of descriptors used for aerial image classification since it is possible that state of the art descriptors in other application areas do not show better performance than simpler descriptors on the task at hand. Comparing results of this paper with the literature, we also note that using multiple features does not guarantee better results. Therefore, another important research area, stemming from these results, is feature combination. Obviously, this question needs more elaborate studies that will show what features are needed to adequately represent aerial images, and how they should be combined. Also, the whole community would benefit from more manually annotated ground truth datasets which are publicly available so that the algorithms from various groups can be compared.
References 1. Daugman, J.G.: Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustics, Speech and Signal Processing 36(7), 1169–1179 (1988) 2. Fauqueur, J., Kingsbury, N.G., Anderson, R.: Semantic discriminant mapping for classification and browsing of remote sensing textures and objects. In: Proceedings of IEEE International Conference on Image Processing (ICIP 2005), pp. 846–849 (2005)
60
V. Risojevi´c, S. Momi´c, and Z. Babi´c
3. Ma, W.Y., Manjunath, B.S.: A texture thesaurus for browsing large aerial photographs. Journal of the American Society for Information Science 49(7), 633–648 (1998) 4. Manjunath, B.S., Ma, W.Y.: Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern analysis and Machine Intelligence 18(8), 837– 842 (1996) 5. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001) 6. Ozdemir, B., Aksoy, S.: Image classification using subgraph histogram representation. In: Proceedings of 20th IAPR International Conference on Pattern Recognition, Istanbul, Turkey (2010) 7. Parulekar, A., Datta, R., Li, J., Wang, J.Z.: Large-scale satellite image browsing using automatic semantic categorization and content-based retrieval. In: IEEE International Workshop on Semantic Knowledge in Computer Vision, in Conjunction with IEEE International Conference on Computer Vision, Beijing, China, pp. 1873–1880 (2005) 8. Ramapriyan, H.K.: Satellite imagery in earth science applications. In: Castelli, V., Bergman, L.D. (eds.) Image Databases, pp. 35–82. John Wiley & Sons, Inc., Chichester (2002) 9. Vapnik, V.: Statistical Learning Theory. John Wiley, Chichester (1998) 10. Yang, L., Wu, X., Praun, E., Ma, X.: Tree detection from aerial imagery. In: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2009, New York, NY, USA, pp. 131–137 (2009)
Text Representation in Multi-label Classification: Two New Input Representations Rodrigo Alfaro1,2 and H´ector Allende1,3 1 2
Universidad T´ecnica Federico Santa Mar´ıa, Chile Pontificia Universidad Cat´ olica de Valpara´ıso, Chile 3 Universidad Adolfo Ib´ an ˜ez, Chile
[email protected],
[email protected] Abstract. Automatic text classification is the task of assigning unseen documents to a predefined set of classes. Text representation for classification purposes has been traditionally approached using a vector space model due to its simplicity and good performance. On the other hand, multi-label automatic text classification has been typically addressed either by transforming the problem under study to apply binary techniques or by adapting binary algorithms to work with multiple labels. In this paper we present two new representations for text documents based on label-dependent term-weighting for multi-label classification. We focus on modifying the input. Performance was tested with a wellknown dataset and compared to alternative techniques. Experimental results based on Hamming loss analysis show an improvement against alternative approaches. Keywords: Multi-label text classification, text modelling, problem transformation.
1
Introduction
Large amounts of text document available on digital format on the web contain useful information for a wide variety of purposes. The amount of digital text is expected to increase significantly in the near future; thus, the need for the development of data analysis solutions becomes urgent. Text classification (or categorisation) is defined as the assignment of a Boolean value to each pair dj , ci ∈ D × C, where D is the domain of documents and C = {c1 , ..., c|C| } is the set of predefined labels [12]. Binary classification (BC) is the simplest and most widely studied case. In BC, a document is classified into one of two mutually exclusive classes. BC can be extended to solve multi-class problems. Moreover, if a document is categorised with either one label or multiple labels at once, it is called a single-label or multi-label problem, respectively [12]. Tsoumakas and Katakis [14] presents a formal description of multi-label methods. In [14], L = {λj : j = 1 . . . l}, where λj corresponds to the j-th label, is the finite set of labels in a multi-label learning task, and D = {f (xi ; Yi ); i = 1...m} ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 61–70, 2011. c Springer-Verlag Berlin Heidelberg 2011
62
R. Alfaro and H. Allende
denotes a set of multi-label training data, where xi is the feature vector and Yi ⊆ L is the set of labels of the i−th example. Methods for solving this problem are grouped into two types, namely, problem transformation and algorithm adaptation. The first type of methods is algorithm-independent; it transforms the multi-label learning task into one or more single-label classification tasks. Thus, this type of method can be implemented using efficient binary algorithms. The most common problem transformation method (PT4) learns |L| binary classifiers Hl : X → {l, ¬l}, one for each different label l in L. PT4 transforms the original data set into |L| data sets Dl:l=1...|L| . Each Dl labels every example in D with l if l is contained in the example or ¬l, otherwise. PT4 yields the same solution for both the single-label and multi-class problems using a binary classifier. For the classification of a new instance x, this method generates a set oflabels as the union of the labels generated by the |L| classifiers HP T 4 (x) = l∈L {l} : Hl (x) = l. The second type of method extends specific learning algorithms for handling multi-label data directly. This extensions are achieved by adjustments such as modifications to classical formulations from statistics or information theory. The pre-processing of documents for better representation can also be grouped in this type. Multi-label classification is an important problem for real applications, as can be observed in many domains, such as functional genomics, text categorisation, music mining and image classification. The purpose of this paper is to present a new representation for documents based on label-dependent term-weighting. Lan et al. [6] propose tf −rf representation for two classes of single-label classification problems. Our representation is a generalisation of the tf −rf applied to multilabel classification problems. This paper is organised as follows. In section 2, we briefly introduce multi-label text classification. In section 3, we analyse text representation. Our proposal for two new methods of representation is illustrated in section 4. In section 5, we compare the performance of our proposal with other algorithms. The last section is devoted to concluding remarks.
2
Multi-label Text Classification
The automatic classification of multi-label text has not been thoroughly addressed in the existing literature. Although many multi-label datasets are available, most of the techniques for automatic text classification consider them only as single-label dataset. One of the first approaches developed was Boostexter, an algorithm based on Boosting for the multi-label case [11]. This algorithm adjust the weights of training examples and their labels in the training phase; labels that are hard (easy) to predict correctly get incrementally higher (lower) weights. Among the proposal presented in [14], problem transformation is the most widely used. However, the automatic classification of multi-label text has been addressed by algorithms that directly capture the characteristics of the multi-label problem. Zhang and Zhou, for example, solved the multi-label problem using Backpropagation for Multilabel Learning (Bp-MLL), using artificial
Text Representation in Multi-label Classification
63
neural networks with multiple outputs. Bp-MLL is derived from Backpropagation by employing a novel error function capturing the characteristics of multilabel learning [16]. Regardless of the solution approaches to the Multi-label problem and the algorithms that solve it, according to Joachims [4], any text classification task has complexities due to the high-dimensional feature space, a heterogeneous use of terms, and a high level of redundancy. Multi-label problems have additional complexities, including a large number of tags per document. These characteristics of a multi-label problem require different methods of evaluation than those used in traditional single-label problems.
3
Problem Representation
The performance of a reasoning system depends heavily on problem representation. The same task may be easy or difficult, depending on the way it is described [3]. The explicit representation of relevant information enhances machine performance. Also, a more complex representation may work better with simpler algorithms. Document representation has high impact on the task of classification [5]. Some elements used for representing documents include N-grams, single-word, phrases, or logical terms and statements. The vector space model is one of the most widely used models for ad-hoc information retrieval, mainly because of its conceptual simplicity and the appeal of its underlying metaphor of using spatial proximity for semantic proximity [9]. Space representation can be conceived has a kernel representation. Kernel methods are an approach for solving machine learning problems. Joachims was among the first author to use kernel-based methods to categorise text [4]. Cristianini et al. utilised the kernel-based approach for representing the vector space model and latent semantic indexing [2]. Similarly, Tsivtsivadze et al. established a mapping of input data into a feature space by means of a kernel function and then used learning algorithms to discover relationships in that space [13]. In the vector space model (VSM), the contents of a document are represented by a vector in the term space d = {w1 ; . . . ; wk }, where k is the size of the term (or feature) set. Terms may be measured at several levels, such as syllables, words, phrases, or any other semantic and/or syntactic unit used to identify the content of a text. Different terms have different importance within a text, and thus, the relevance indicator wi (usually between 0 and 1) represents how much the term ti contributes to the semantics of the document d. For weight terms in the vector space model, word frequency of occurrence in the document can be used as term weight for term-weighting. However, there are more effective methods for term-weighting. The basic information used to derive term-weighting is term frequency, document frequency, or sometimes collection frequency. There are different mappings of text to input space across different text classifications. Leopold and Kindermann, for example, combines mappings with different kernel functions in support vector machines [8]. According to Lan et al.
64
R. Alfaro and H. Allende
Table 1. Variables utilized in a term-weighting in multi-label problem for a term t with |L| labels t t label1 at,λ1 dt,λ1 labelλj at,λj dt,λj label|L| at,|L| dt,|L|
[7], two important decisions for choosing a representation based on VSM are the following. First, what should constitute a term? For example, should it be a subword, word, multi-word or meaning? Second, how should a term be weighted? Term-weighting can be a binary function or term frequency-inverse document frequency (tf −idf ) developed by Salton and Buckley [10], using feature selection metrics such as χ2 , information gain (IG), or gain ratio (GR). Term-weighting methods improve the effectiveness of text classification by assigning appropriate weights to terms. Although text classification has been studied for several decades, term-weighting methods for text classification are usually borrowed from the traditional information retrieval (IR) field, including, for example, the Boolean model, tf −idf , and its variants. Table 1 shows the variables that we will consider in a term-weighting method for multi-label problems. where at,λj is the number of documents in the class λj containing the term t and dt,λj is the number of documents in the class λj that do not contain the term t. 3.1
Bag-of-Words Representation (tf −idf )
The most widely used document representation for text classification is tf −idf [12], where for a two classes problem (where, label1 is class+ and label2 is class− ) each component of the vector is computed as: tf −idftd = ft,d log10
N , Nt
(1)
where ft,d is the frequency of term t in the document d, N = (at,λ1 + dt,λ1 + at,λ2 + dt,λ2 ) is the number of documents, and Nt = (at,λ1 + at,λ2 ) is the number of documents containing the term t. 3.2
Relevance Frequency Representation (tf −rf )
Lan et al. [7] proposed recently tf −rf as an improved VSM representation based on two classes and single-label problems (where, label1 is class+ and label2 is class− ): at,λ1 , tf −rftd = ft,d log2 2 + (2) max 1, at,λ2
Text Representation in Multi-label Classification
65
where ft,d is the frequency of term t in the document d, at,λ1 is the number of documents in the positive class containing the term t, and at,λ2 is the number of in the negative class containing the term t. The function documents max 1, at,λ2 in the denominator allows that the term tf −rftd be not indefinite even if at,λ2 is zero. According to [7], using this representation in different single-label data sets improves the performance of two-class based classifiers. For multi-class problems, [7] used a one-versus-all method. Note that tf −rf representation is for single-label problems and does not consider the frequency information of the term evaluated in other classes. That is, it only considers the relationship of the appearance of the term in the class under evaluation (that is, positive) versus all the other classes (that is, negative).
4
Our Proposal for a New Representation of Multi-label Datasets
On the one hand, tf −idf as a representation of documents considers only the frequency of terms in the document (tf ) and the frequency of terms in all documents (idf ), disregarding the class or label to which the documents belong. On the other hand, tf −rf also considers the frequency of terms in the document (tf ) and the frequency of terms in all documents of the class under evaluation (rf ). That is, in tf −rf , each document is represented by a different vector when assessing if it belongs to a particular class. From a theoretical point of view, this extension of the tf −rf representation of text changes the representation of a document according to the label under evaluation, thereby achieving larger differences between documents belonging to different labels and thus harnessing the performance of binary classifiers. Thus, important information about the frequency in other classes is used, specially when frequency of the term shoes sharp variations as example in Table 2 shows. Table 2. Example of frequency of a term for each label
Frequency
Label 1 Label 2 Label 3 Label 4 Label 5 Label 6 Label 7 Label 8 Label 9 53 76 87 66 62 27 25 28 26
We propose the use of a centrality function μ−Relevance Frequency of a Label, tf −μrf l, over the frequency of a term for each label, is derived from the term frequency and relevance frequency of a given label; as such, it constitutes a new representation based on tf −rf for a multi-label problem. at,l , tf −μrf ltdl = ft,d log2 2 + (3) μ(at,λj/l where μ(at,λj/l is a function over the set at,λj/l = {at,λ1 , ..., at,λl−1 , at,λl+1 , ..., at,|L| }.
66
R. Alfaro and H. Allende
We will consider μ(at,λj/l = max 1, mean(at,λj/l ) for tf −rf l representa tion and μ(at,λj/l = max 1, median(at,λj/l ) for tf −rrf l representation. Such functions give centrality measures, the mean is a classical metric and the median is a robust metric. 4.1
Relevance Frequency of a Label
Relevance frequency of a label, tf −rf l, is derived from the μ−Relevance Frequency of a Label, tf −μrf l; as such, it constitutes a new representation for a multi-label problem. at,l tf −rf ltdl = ft,d log2 2 + (4) max 1, mean(at,λj/l ) In equation 5, the term mean(at,λj/l ) is the average number of documents containing the term t for each document labelled other than l. 4.2
Robust Relevance Frequency of a Label
Robust relevance frequency of a label, tf −rrf l, also is derived from the μ− Relevance Frequency of a Label, tf −μrf l; as such, this is the second new representation for a multi-label problem. at,l tf −rrf ltdl = ft,d log2 2 + (5) max 1, median(at,λj/l ) The use of the median should yield more robust results in datasets containing large differences between the frequency of the occurrence of a term in a given set of labels versus other labels sets under evaluation. 4.3
Classification Method
The proposed term-weighting methods includes information on the frequency of the occurrence of a term t in each set of documents labelled other than the label under evaluation. It is expected that mean(at,λj/l ) and median(at,λj/l ) will be higher if the term t appears more frequently in documents with label λj = l than in documents with other labels λj/l , and they will be lower, in contrast, if the term t is more frequent in documents with labels other than l. Our proposal is based on the tf −rf l and tf −rrf l representations and the SVM binary ensemble. It transforms the multi-label problem into a PT4 form [14], and then for each document d, the tf −rf l and tf −rrf l representations are derived for each label λj and classified using |L| binary classifiers.
5
Experiments
The evaluation of the proposed tf −rf l and tf −rrf l representations was carried out using the Reuters-21578 Distribution 1.09. The Reuters-21578 data set consists of 21,578 Reuters newswire documents that appeared in 1987, less than
Text Representation in Multi-label Classification
67
Table 3. Characteristics of the pre-processed data set. Note that PMC denotes the percentage of documents belonging to more than one class and ANL denotes the average number of labels for each document. Data Number of Number of Vocabulary PMC ANL Set Classes Documents Size First3 3 7,258 529 0.74% 1.0074 First4 4 8,078 598 1.39% 1.0140 First5 5 8,655 651 1.98% 1.0207 First6 6 8,817 663 3.43% 1.0352 First7 7 9,021 677 3.62% 1.0375 First8 8 9,158 683 3.81% 1.0396 First9 9 9,190 686 4.49% 1.0480
half of which have human-assigned topic labels. The data set and the validation mechanism used are the same as in [16], that is, the subsets of the k classes with the largest number of articles are selected for k = 3, . . . , 9, resulting in seven different data sets denoted as First3, First4, . . . , First9. Also, in this test 3-fold cross-validation is run ten times on each data set. Our classification method reports the average values among ten runs. Table 3 shows the data set characteristics. First, we must transform the problem into a PT4 form, dividing the data into k input data sets for k = 3, . . . , 9 binary classifiers, whereby each machine classifies one-against-others labels. Four representations were constructed from the data set, namely, the classical tf −idf and tf −rf representations and our proposed tf −rf l and tf −rrf l representations. An ensemble of binary SVM classifiers was used. Each machine employed a linear kernel; the parameters were optimised by maximising the classification margin between each pair of classes. The ensemble was implemented with LibSVM [1], where each machine worked with random sampling. Two-thirds of the examples were used for training, and one-third was used for testing. Note that all tf −if d representations are the same, regardless the label under evaluation, while the tf −rf , tf −rf l and tf −rrf l representations are different for each label. Multi-label classification methods require different performance metrics than those used in traditional single-label classification methods. These measures can be grouped into bipartitions and rankings [15]. Since our method is not based on ranking, as in [11] and [16], the evaluation of the results in this research was performed using Hamming loss by considering bipartitions to evaluate how many times an instance-label pair was misclassified. This measure of error is defined as: d
hloss(h) =
1 1 |h(xi )ΔYi |, d i=1 |L|
(6)
where h(xi ) is the set of labels defined by the classifier for the documents, Yi is the original labels of the documents and Δ is the difference between both. Performance is better when hloss(h) is near 0.
68
R. Alfaro and H. Allende
Table 4. Experimental results of SVM Ensembles with tf −idf , tf −rf, tf−rf l and tf −rrf l compared with others learning algorithms in terms of Hamming loss. Bp-MLL* and BoosTexter* as reported by [16] Data set SVM tf-idf SVM tf-rf SVM tf-rfl SVM tf.rrfl Bp-MLL* BoosTexter*
First3 0.02797 0.02814 0.02716 0.02578 0.0368 0.0236
First4 0.02641 0.02687 0.02590 0.02478 0.0256 0.0250
First5 0.02590 0.02611 0.02526 0.02427 0.0257 0.0260
First6 0.02477 0.02522 0.02412 0.02321 0.0271 0.0262
First7 0.02246 0.02287 0.02186 0.02110 0.0252 0.0249
First8 0.02083 0.02118 0.02026 0.01958 0.0230 0.0229
First9 0.01981 0.02012 0.01930 0.01870 0.0231 0.0226
Average 0.02402 0.02436 0.02341 0.02249 0.02664 0.02446
Table 4 shows the different representations and their performance in term of Hamming loss. In this metric, for data set with fewer classes, Boostexter is better than tf −rf l and tf −rrf l for 0.00356 and 0.00218 respectively. For data set with more classes (namely, First5, First6, First7, First8 and First9), tf −rf l is better than the other algorithms. Table 4 also shows that tf −rrf l is better than the other algorithms for data sets with more classes (namely, the First4, First5, First6, First7, First8 and First9). To evaluate the results, as in [16] a test based on the two-tailed paired t-test at the 5 percent significance level was implemented. According to these results, SVM Ens tf −rf l performs better than SVM Ens tf −idf (4.2595 × 10−6 ), SVM Ens tf −rf (2.0376 × 10−7 ) and Bp-MLL (3.74 × 10−2 ). In addition, SVM Ens tf −rrf l performs better than SVM Ens tf −idf (2.5368×10−5), SVM Ens tf −rf (4.2013 × 10−6 ) and Bp-MLL (1.63 × 10−2 ). The p-value shown in parentheses provides a further quantification of the significance level. The results shown in Table 5 show the level of statistic significance as compared to alternative approaches with respect to Hamming loss. We can see that diferences between Boostexter have not statistical significance for data sets with fewer labels (First3, First4, First5), but for data sets with more labels (First6, First7, First8 and First9), Boostexter has the worst performance among all algorithms. Finally, in Figure 1, we show how the different weighting methods discriminate when a term is important for a classifier or not. In this case, using rrf l and rf l the term is weighted to high for labels 1, 2, 3, 4 and 5, and lower for labels 6, 7, 8 and 9. Note that idf does not discriminate when evaluating each label and rf slightly discriminates. Table 5. Statistical analysis of results in terms of p-value on t-student test. NSS mean ”Is Not Statistically Significant”. SVM tf-rfl SVM tf-rf SVM tf-idf Bp-MLL BoosT. SVM tf.rrfl 1.0754 × 10−4 4.2013 × 10−6 2.5368 × 10−5 1.63 × 10−2 NSS SVM tf-rfl 2.0376 × 10−7 4.2013 × 10−6 3.74 × 10−2 NSS SVM tf-rf 4.2595 × 10−6 NSS NSS SVM tf-idf NSS NSS Bp-MLL NSS
Text Representation in Multi-label Classification
69
Fig. 1. Term-weights assigned by different representations for each label
6
Remarks and Conclusions
Multi-label classification is an important topic in information retrieval and machine learning. Text representation and classification have been traditionally addressed using tf −idf due to its simplicity and good performance. Changes in input representation can employ knowledge about the problem, a particular label, or the class to which the document belongs. Other representations can be developed for overcoming a particular problem directly, without transformation. New benchmarks should be used to validate the results; however, the preprocessing of multi-labelled texts must be standardised. In this paper, we have presented the tf −μrf l as a novel text representations for the multi-label classification approach. This proposal was assessed with two new input representation tf −rf l and tf −rrf l. This representation considers the label to which the document belongs. Combining, this problem transformation with algorithm adaptation. The performance of this representation was tested in combination with an SVM ensemble using a known dataset. The results show statistically significant improvement as compared to alternative approaches with respect to Hamming loss. We believe that the contribution of the proposed multi-label representation is due to a better understanding of the problem under consideration. In future studies, we plan to compare our method to other tf −idf representations and to investigate other label-dependent representations and procedures in order to reduce the dimension of the feature space depending on the relevance of each label. Acknowledgement This work has been partially funded by the Research Grants: Fondecyt 1110854 and Research Grant Basal FB0821 ”Centro Cient´ıfico Tecn´ ologico de Valpara´ıso”.
70
R. Alfaro and H. Allende
References [1] Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm [2] Cristianini, N., Shawe-Taylor, J., Lodhi, H.: Latent semantic kernels. Journal of Intelligent Information Systems 18(2-3), 127–152 (2002) [3] Fink, E.: Automatic evaluation and selection of problem-solving methods: Theory and experiments. Journal of Experimental and Theoretical Artificial Intelligence 16(2), 73–105 (2004) [4] Joachims, T.: Learning to classify text using support vector machines – methods, theory, and algorithms. Kluwer-Springer (2002) [5] Keikha, M., Razavian, N.S., Oroumchian, F., Razi, H.S.: Document representation and quality of text: An analysis. In: Survey of Text Mining II: Clustering, Classifcation, and Retrieval, pp. 135–168. Springer, London (2008) [6] Lan, M., Tan, C.-L., Low, H.-B.: Proposing a new term weighting scheme for text categorization. In: AAAI 2006: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 763–768. AAAI Press, Menlo Park (2006) [7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 721–735 (2009) [8] Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Machine Learning 46(1-3), 423–444 (2002) [9] Manning, C., Schutze, H.: Foundations of statistical natural language processing. The MIT Press, Cambridge (1999) [10] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988) [11] Schapire, R.E., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning, 135–168 (2000) [12] Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002) [13] Tsivtsivadze, E., Pahikkala, T., Boberg, J., Salakoski, T.: Kernels for text analysis. Advances of Computational Intelligence in Industrial Systems 116, 81–97 (2008) [14] Tsoumakas, G., Katakis, I.: Multi label classification: An overview. International Journal of Data Warehouse and Mining 3(3), 1–13 (2007) [15] Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, 2nd edn. Springer, Heidelberg (2010) [16] Zhang, M.-L., Zhou, Z.-H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge Data Engineering 18(10), 1338–1351 (2006)
Fraud Detection in Telecommunications Using Kullback-Leibler Divergence and Latent Dirichlet Allocation Dominik Olszewski Faculty of Electrical Engineering, Warsaw University of Technology, Poland
[email protected] Abstract. In this paper, a method for telecommunications fraud detection is proposed. The method is based on the user profiling by employing the Latent Dirichlet Allocation (LDA). The detection of fraudulent behavior is achieved with a threshold-type classification algorithm, allocating the telecommunication accounts into one of two classes: fraudulent account and non-fraudulent account. The accounts are classified with use of the Kullback-Leibler divergence (KL-divergence). Therefore, we also introduce four methods for approximating the KL-divergence between two LDAs. Finally, the results of experimental study on KL-divergence approximation and fraud detection in telecommunications are reported. Keywords: Fraud detection, User profiling, Kullback-Leibler divergence, Mixture models, Latent Dirichlet Allocation.
1
Introduction
There is a number of fraud detection problems, including credit card frauds, money laundering, computer intrusion, and telecommunications frauds, to name but a few. Among all of them, the fraud detection in telecommunications appears to be one of the most difficult, since there is a large amount of data that needs to be analyzed, and, simultaneously, there is only a small number of fraudulent calls samples, which could be used as the learning data for the learning-based methods. Consequently, this problem essentially inhibits and limits an application of the learning-based techniques, like the neural-networks-based classifiers. The problem of fraud detection in telecommunications has been studied in [1,2,3,4,5]. In paper [1], the Gaussian Mixture Model (GMM) is applied for user profiling, and a high fraud recognition rate is reported. The paper [2] employs Latent Dirichlet Allocation (LDA) to build user profile signatures. The authors assume that any significant unexplainable deviations from the normal activity of an individual user is strongly correlated with fraudulent activity. The authors of [3] investigate the usefulness of applying different learning approaches to a problem of telecommunications fraud detection, while in work [4] an expert system is constructed, which incorporates both the network administrator’s expert knowledge, and knowledge derived from the application of data mining A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 71–80, 2011. c Springer-Verlag Berlin Heidelberg 2011
72
D. Olszewski
techniques on real-world data. Finally, the recent study [5] aimed at identifying customers’ subscription fraud by employing data mining techniques and adopting knowledge discovery process, and to this end, a hybrid approach consisting of pre-processing, clustering, and classification phases was applied. The Kullback-Leibler divergence (KL-divergence) between two probability measures P and Q on a continuous measurable space Ω is defined as [6,7]: p def d(P, Q) = p log2 dλ , (1) q Ω where p and q are the density functions of measures P and Q, respectively, while measures P and Q are absolutely continuous with respect to measure λ. Our approach is based on the user profiling technique utilizing LDA, and detecting fraudulent behavior on the basis of binary classification, i.e., classification to one of two classes: fraudulent account and non-fraudulent account. We apply a threshold-type classification algorithm using the KL-divergence. Consequently, our method requires the computation of KL-divergence between two LDAs, which is an unsolved problem. Therefore, this paper focuses also on the issue of approximation of the KL-divergence between two LDAs, introduces four approximation methods, and chooses the most effective one. The fraudulent activity is indicated by crossing the pre-defined threshold. Our technique strongly relies on the user profiling with LDA probabilistic model. Employing LDA for fraud detection in telecommunications was first proposed in [2], however, the difference between [2] and our paper is that we detect whole fraudulent accounts, in contrast to [2], where single fraudulent calls are detected. Consequently, we apply a different classification algorithm. This kind of approach is also useful in real-world fraud detection problems. Recapitulating, this paper proposes: – four methods for approximating the KL-divergence between two LDAs, – a threshold-type classification algorithm for fraud detection in telecommunications. An advantage of our probabilistic approach is that it does not involve the learning process, this way, overcoming associated with it difficulties (insufficient learning data).
2
Using LDA for User Profiling
The choice of this specific probabilistic model of a telecommunication user was motivated with its properties, which provide an accurate description of a user profile. The model is dynamically developed for individuals within a group, and it explicitly captures the assumption of the existence of a common set of behavioral patterns, which can be estimated on the basis of all observed users, along with their user-specific proportion of participation [2]. The model, itself, was introduced in [8].
Fraud Detection in Telecommunications Using KL-Divergence and LDA
73
LDA is a generative probabilistic model for collections of discrete data. It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of latent classes. The basic idea, derived from [2], is that the accounts are represented as finite mixtures over latent classes, where a class is characterized by a distribution over features of calls, made from the account. As the features, we use destination, start-time, and duration of a call. The accounts are coded as bag of feature-vectors. Procedure 1. An account can be generated from the LDA model using the following procedure: Step 1. Draw the number of iterations N ∼ Poisson(ξ). Step 2. Draw the parameter for account class distribution θ ∼ Dir(α), where α is the parameter of prior Dirichlet distribution over the latent classes. Step 3. For i = 1 : 1 : N : Step a. Draw the class zi , z ∼ Mul(θ). Step b. Draw the feature ai from p(a | zi , β) – a multinomial probability distribution of the vector of features a, conditioned on the class zi , which points the row of the matrix-parameter β, i.e., β zi . The LDA model has two parameters: a vector α = [α1 , . . . , αK ] (the parameter of the Dirichlet distribution) and a K×V -matrix β, which rows are the parameters of the multinomial distributions. K is the number of the latent classes, and V is the number of features in vector a. The variable N is independent of other data generating variables of the model (θ, z), and, therefore, its randomness may be ignored. For the convenience of further considerations, we will assume N ≡ K. The posterior distributions of the hidden variables θ and z are estimated using variational approximation. The model parameters α and β are estimated using variational EM algorithm (α and β maximize the (marginal) log likelihood of the data). Given the parameters α and β, the joint distribution of a latent class mixture θ, a vector of K latent classes z, and a vector of V features a is given by [8]: p(θ, z, a | α, β) = p(θ | α)
K
p(z | θ)p(a | zi , β) ,
(2)
i=1
where p(θ | α) is the Dirichlet probability distribution of the variable θ = [θ1 , . . . , θK ], p(z | θ) is the multinomial probability distribution of the vector z of K latent classes, with vector-parameter θ, and p(a|zi , β) is the multinomial probability distribution of the vector a of V features, with the vector-parameter β zi (zi -th row of the matrix β). At this point, we note that the symbols p and q will be abused throughout this paper, i.e., they will refer to different types of probability distributions, however, always to probability distributions. This kind of notation abuse is common in probability and statistics. It is used, e.g., in [2,8].
74
3
D. Olszewski
KL-Divergence between Multinomial Mixture Models
Since the LDA model incorporates the multinomial mixtures, it will be necessary to evaluate the KL-divergence between them, in order to approximate the KLdivergence between LDAs. We introduce the notion of Multinomial Mixture Model (MMM) referring to the product of multinomial probability distributions. Consequently, a pair of MMMs can be described with the following formulae: p(x) =
Mul(x; Na , Pa ) =
a
q(x) =
pa (x) ,
(3)
qb (x) ,
(4)
a
Mul(x; Nb , Qb ) =
b
b
where Na , Pa and Nb , Qb are the parameters of the distributions p(x) and q(x), respectively. The parameters Na and Nb are the numbers of trials, while the parameters Pa and Pb are the event probabilities. The problem of determining the KL-divergence between two MMMs is analytically intractable. This happens due to the strong statistical dependence between the random variables of each MMM’s component, i.e., each component has the same variable (explained in Section 4). Therefore, an approximation needs to be employed. We propose three methods for approximating the KL-divergence between two MMMs: 1. The nearest pair method. This approach is inspired with the nearest pair method for approximating the KL-divergence between two GMMs discussed in [9]. Hence, we have: dmin MMM (p, q) = min d (pa , qb ) . a,b
(5)
2. The furthest pair method. This is an analogous method as the previous one, with the difference that in this case, the furthest pair in considered. dmax MMM (p, q) = max d (pa , qb ) . a,b
(6)
3. The mixed sum method. In this case, the KL-divergence is computed as the sum of divergences for each of the mixtures’ components. Hence, for the k-component MMMs, we get: dsum MMM (p, q) =
k
d (pj , qj ) .
(7)
j=1
The drawback associated with this method is that both of MMMs must have the same number of components.
Fraud Detection in Telecommunications Using KL-Divergence and LDA
4
75
KL-Divergence between LDAs
We propose three methods for approximating the KL-divergence between two LDAs. Our methods are based on the KL-divergence computation between the components of LDAs, i.e., between the Dirichlet distributions and between the MMMs. The difference between each proposed method consists in the use of different methods for approximation of the KL-divergence between MMMs. We discuss also the Monte-Carlo simulation method, which was used in our experiments as the reference method. We consider a pair of LDAs of the following form: p(θ, z, a | α1 , β1 ) = p(θ | α1 )
K
p(z | θ)p(a | zi , β1 ) ,
(8)
q(z | θ)q(a | zi , β2 ) .
(9)
i=1
q(θ, z, a | α2 , β2 ) = q(θ | α2 )
K i=1
Each LDA can be presented as the three-variable function, which, in turn, can be written as the product of three one-variable functions (Dirichlet distribution and two MMMs): p(θ, z, a) = p1 (θ) p2 (z) p3 (a) ,
(10)
where p1 (θ) = p(θ | α), θ ∼ Dir(α); p2 (z) = p(z | θ), z ∼ MMM(θ); p3 (a) = p(a | z, β), a ∼ MMM(z, β). We will use this form of LDA for approximation of the KL-divergence. According to (1), the functions of product form are mathematically convenient for computation of KL-divergence (logarithm of product, integral over density function). However, the convenience, essentially simplifying the computations, is achieved only if the product components refer to the independent variables. Hence, in the case of the random variables, a statistical independence is expected. In the case of LDA model, the joint distribution (2) implies the statistical dependence between the random variables θ, z, and a. Therefore, the KL-divergence between two LDA models is not analytically tractable, and its determination is possible only on the basis of the approximation. Consequently, our methods can be regarded as the example approaches to such approximation, which assume the statistical independence of the random variables θ, z, and a. Assuming the random variables θ, z, and a are statistically independent, the KL-divergence between two LDAs can be written as follows:
p(θ, z, a) dadzdθ q(θ, z, a) θ z a p1 (θ)p2 (z)p3 (a) = p1 (θ)p2 (z)p3 (a) log2 dadzdθ q1 (θ)q2 (z)q3 (a) θ z a p1 (θ) = p1 (θ)p2 (z)p3 (a) log2 dadzdθ q1 (θ) θ z a
d(p(θ, z, a), q(θ, z, a)) =
p(θ, z, a) log2
76
D. Olszewski
+ + ≈ + =
p2 (z) dadzdθ q2 (z) θ z a p3 (a) p1 (θ)p2 (z)p3 (a) log2 dadzdθ q3 (a) θ z a p1 (θ) p2 (z) p1 (θ) log2 dθ + p2 (z) log2 dz q (θ) q2 (z) 1 z θ p3 (a) p3 (a) log2 da q3 (a) a d(p1 (θ), q1 (θ)) + d(p2 (z), q2 (z)) + d(p3 (a), q3 (a)) p1 (θ)p2 (z)p3 (a) log2
= dDir + dMMM1 + dMMM2 ,
(11)
where dDir = d(p1 (θ), q1 (θ)), dMMM1 = d(p2 (z), q2 (z)), dMMM2 = d(p3 (a), q3 (a)). On the basis of this transformation, three approximation methods are proposed. The difference between them derives from the different methods for approximating the KL-divergence between MMMs, applied in these three methods. 1. The nearest pair method. In this method, the KL-divergence between MMMs is approximated according to the nearest pair method: min min dmin LDA = dDir + dMMM1 + dMMM2 ,
(12)
where dDir can be calculated analytically, according to the formula, given, e.g., in [10]. 2. The furthest pair method. In this case, the KL-divergence between MMMs is approximated according to the furthest pair method: max max dmax LDA = dDir + dMMM1 + dMMM2 .
(13)
3. The mixed sum method. In this case, the KL-divergence between MMMs is approximated according to the mixed sum method: sum sum dsum LDA = dDir + dMMM1 + dMMM2 .
(14)
4. The Monte-Carlo simulation method. In this case, the KL-divergence between two LDAs is approximated in the following way: n
dMC LDA (p, q) =
1 p(xi ) n→∞ log2 −→ d(p, q) . n q(xi )
(15)
i=1
We use n i.i.d. samples xk , k = 1, . . . , n, coming from the LDA model. Each sample xk is a vector xk = [θ, z, a]. Consequently, in each of n iterations, three random variables need to be drawn. For a large number of samples (100K or 1M) this method yields a very accurate approximation. Of course, using this number of samples is associated with a huge computational burden. However, the Monte-Carlo method can be used successfully as a reference method, allowing for evaluation of other methods, discussed in this paper.
Fraud Detection in Telecommunications Using KL-Divergence and LDA
77
In the LDA model, the hidden random variable θ, drawn from the Dirichlet distribution with the vector-parameter α (the first parameter of LDA model), is used as the vector-parameter of the first multinomial distribution. Then, in each of N iterations, the hidden random variable z is being drawn from the first multinomial distribution, and is used to select the row of the matrix β (the second parameter of LDA model), which, in turn, will be used as the vector-parameter of the second multinomial distribution (Procedure 1). Therefore, in order to obtain the parameters of MMM1 and MMM2 , we have computed the expected values of the hidden random variables θ and z, i.e., θ = E [θ], θ ∼ Dir(α); z = E [z], z ∼ Mul(θ).
5
Fraud Detection in Telecommunications
Fraud detection is performed on the basis of classification of accounts into one of two accounts classes: fraudulent account and non-fraudulent account. We propose a threshold-type classification algorithm for detecting fraudulent activity in telecommunications. Each account is profiled with the LDA probabilistic model, described in Section 2. The detection is achieved by evaluating of the KLdivergence between the reference account’s model, and a model of an account, being currently classified. A fraud is alarmed, when the pre-defined threshold is crossed. The reference account should represent the possibly most typical telecommunication user’s behavior. The threshold value is set arbitrary. Our classification algorithm can be illustrated in 2-dimensional space with Fig. 1. Figure 1 presents ten LDA models of telecommunication accounts, among which, two are detected as fraudulent, i.e., points representing these accounts lay outside of the circle determined by the reference model (center) and the threshold (radius). 5
1
10 2
3 Reference Model
4 6 9
7
8
Threshold (radius)
Fig. 1. Graphical illustration of the proposed classification algorithm
78
D. Olszewski
6
Experiments
In the first part of our experiments, we have investigated the accuracy of the proposed methods for approximating the KL-divergence between two LDAs. In the second part, we have conducted a telecommunication fraud detection experiment. 6.1
KL-Divergence Approximation between Two LDAs
We have evaluated the accuracy of three methods, proposed in this paper, by comparing them with the Monte-Carlo method run for 1K samples. The parameters α and β of the simulated LDA models were generated randomly, i.e., the entries of the vector α were drawn from the uniform distribution, from the interval [0, 5], while the entries of the matrix β were drawn from the uniform distribution, from the interval [0, 1]. The rows of β were normalized (they are the parameters of the multinomial distributions). The experiments have been conducted for three and five latent classes. For each of these cases, five LDA models were investigated (Fig. 2). 20
30
Monte−Carlo Nearest Pair Furthest Pair Mixed Sum
18
25
KL−Divergence
KL−Divergence
16
Monte−Carlo Nearest Pair Furthest Pair Mixed Sum
14
12
10
8
6
20
15
10
4 5 2
0
1
2
3
4
Test Number
(a) Three latent classes
5
0
1
2
3
4
5
Test Number
(b) Five latent classes
Fig. 2. Results of KL-divergence approximation
The highest approximation accuracy was reported for the mixed sum method. The experiments have shown that the mixed sum method, for three and five latent classes, provides the similar accuracy to the Monte-Carlo simulation method, hence, regarding the obvious fact of a much lower computational complexity, we can assert that this method outperforms the Monte-Carlo method, and provides an efficient and effective way for approximating the KL-divergence between two LDAs. 6.2
Fraud Detection in Telecommunications Results
In the experiments, the performance of the proposed telecommunications fraud detection method was assessed by a comparison with the GMM-based
1
1
0.9
0.9
Fraud Detection Rate
Fraud Detection Rate
Fraud Detection in Telecommunications Using KL-Divergence and LDA
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
79
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
False Alarm Rate
(a) Our method (AUROC=0.9833)
1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Alarm Rate
(b) GMM-based method (AUROC=0.9111)
Fig. 3. ROC curves for our method and for GMM-based method
fraud detection method. The GMM-based method is discussed in [1], and employs the GMM probabilistic models for user profiling. As the basis of the comparison, the Receiver Operating Characteristics (ROC) curves were employed. The ROC curves show the fraud detection probability (true positive rate) as the function of false alarm probability (false positive rate). In order to evaluate a specific curve, the Area Under ROC (AUROC) metric was introduced. It simply measures the area under the curve, and a perfect value of AUROC is 1. Also an important value in assessment of the ROC curve is the highest fraud detection rate corresponding to zero false alarm rate (HDZF), and, the lowest false alarm rate corresponding to maximal (i.e., 1) fraud detection rate (LFMD). The experiments were carried out on data set consisting of a hundred telecommunication accounts, among which twenty were fraudulent. Each account was represented with one hundred call data records (CDRs). Each CDR contains an information about a specific call, made by a specific user. Hence, CDR is a vector of features of a specific call, such as: destination, start-time, or duration. For each account the LDA and GMM models, profiling the telecommunication user, were built on the basis of a hundred CDRs. The building of the LDA model is discussed in Section 2. Each GMM model consisted of three Gaussians, each corresponding to different feature in CDRs. Hence, the building of the GMM model consisted in estimation of the parameters μ (mean) and σ 2 (variance) of each Gaussian component. Each CDR consists of three features. First feature is the destination of a call (local, trunk, international, premium, toll-free, mobile), second is the start-time (8-17, 17-22, 22-8), and third is the duration ( 0 and if C(xi ) = C(xj ) then αij < 0. Function G(x) should be localized with maximum for x = 0, for example it is a Gaussian function. Then for a given direction w ∈ Rd vectors xi and xj will increase QPC value if after projection on w they fall close to each other and are from the same class, but if they are from different classes QPC index is decreased by a value dependent on distance between these vectors after projection on w. Thus maximization of Eq. (1) leads to linear transformation that create compact and pure clusters of vectors from the same class, well separated from other clusters, provides a leave-one-out estimator measuring quality of this projection. Proper choice of constants αij and width of function G(x) might force QPC optimization to prefer solutions with higher between-cluster separation over solutions characterized with better within-class purity and compactness. In all experiments presented in this paper Gaussian functions were used for localization. To normalize the QPC index value αij = 1/nnj is used for all i, j = 1, . . . , n satisfying condition C(xi ) = C(xj ) and αij = 1/n(n − nj ) if C(xi ) = C(xj ), where nj denote number of instances that belong to class associated with xj , and n is the number of all instances. Optimization of the QPC index provides solutions that might be useful in many machine learning supervised learning applications for data visualization and dimensionality reduction. Recently [5,6] this index was successfully applied to train and construct several neural networks architectures for classifications of multi-class problems. Major disadvantage of QPC (like most of the projection pursuit indexes) is high computational cost. Each evaluation of Eq. (1) has computational complexity O(dn2 ), where d is the number of dimensions and n is the number of instances in training dataset, which may make this approach useless for datasets with large number of instances, especially when many iterations is needed for convergence of the optimization process. This drawback can be overcome by using a set of prototypes T = {t1 , . . . , tk } as a reference points providing estimation for dataset class distribution. For given set of
Fast Projection Pursuit Based on QPC
91
prototypes T , where each prototype ti is associated with class C(ti ), the approximation of the QPC index might be expressed as follows: QP C(w) =
k n
αij G w T (xi − tj )
(2)
j=1 i=1
where constants αij > 0 if C(xi ) = C(tj ) and αij < 0 if C(xi ) = C(tj ), accordingly. If positions of prototypes are not fixed then Eq. (2) has (k+1)×d parameters to optimize (where k is the number of prototypes) while optimization of Eq. (1) must adjust only d weight components. However, if k n then computational cost becomes linear in the number of instances and in the number of features O(kdn). Solutions generated by maximization of Eq. (2) strongly depend on the number of prototypes and their initialization (position and label association). The algorithm described below allows for computing an approximation to the QPC index value for a given direction without the need of finding reference points, and might also be used for estimation of initial positions of prototypes. Consider the set of vectors xi ∈ Rd (i = 1, . . . , n) projected on the w direction, with the whole span of projected points divided into k equal intervals of width h: ymin = min w T xi , i
ymax = max w T xi ,
h=
i
1 (ymax − ymin ) . k
(3)
Let βi be the center of the i-th interval: βi = ymin + h (i − 1/2) ,
i = 1, . . . , k.
(4)
For each class Ci and j-th interval the partial QPC index is defined by: ˜ Ci ,j = Q
n
αij G wT xi − βj
(5)
i=1
where αij > 0 if C(xi ) = Cj and αij < 0 if C(xi ) = Cj . Let associate interval j with class Ci that gives maximum: ˜ Ci ,j C(βj ) = arg max Q Ci
(6)
The approximate value of QPC index for direction w and k intervals is computed from: QP C(w) ≈
k n
αij G w T xi − βj
(7)
j=1 i=1
where αij > 0 if C(xi ) = C(βj ) and αij < 0 if C(xi ) = C(βj ). The computational cost of evaluation of Eq. (7) is O(kndc) where c denotes the number of classes. Eq. (7) might be directly used for searching for optimal w, however this approximation is used here only for setting initial positions of the prototypes and their labels. Direction w define line in d dimensional space y = γw + µ, where γ ∈ R and µ ∈ Rd is an arbitrary point along this line that may be taken as the center position of all data vectors
92
M. Grochowski and W. Duch
X . Then for a given direction w and k intervals with centers in βi , initial positions of prototypes ti ∈ Rd placed on this line are given by: (8) ti = βi w + µ − (w T µ)w . These prototypes are used here to initialize optimization procedure of the QPC index given by Eq. (2). Maximum number of prototypes do not exceed the number of intervals k, but might be reduced if prototypes for the same class become neighbors after projection. Additionally, the width of these intervals give a direct estimation of the spread of G(x) function. For Gaussian functions setting the standard deviation to σ = h guarantees that the par˜ Ci ,j given by Eq. (5), will depend mostly on data projected inside tial QPC function Q the i-th interval, and to a lesser extent on vectors that belong to the adjacent intervals.
3 Results 3.1 Learning Speed Comparison Tab. 1 presents comparison of time needed for training of the standard QPC index defined by Eq. 1 (denoted here as QPC1) and the approximated QPC index (denoted here as QPC2) defined by Eq. (2) for several classification problems with various size and complexity of inherent relations. Most of these datasets come from the UCI repository [7] (Abalone, Appendicitis, Australian Credit Rating, Breast Cancer Wisconsin, Glass, Heart, Ionosphere, Iris, Ljubljana Breast Cancer, Monk’s 1 training part, Congressional Voting Records, Spam and Wine). In addition two artificial dataset were used: 10-dimensional parity problem and Concentric Rings dataset containing 2 important features defining points inside 4 rings (one per class) and 2 noise variables drawn from uniform distribution. Both QPC1 and QPC2 use Gaussian function for G(x) and a gradient descent procedure with the same learning rate (0.1) and the same stop condition. Initial positions of the prototypes for QPC2 have been set according to Eq. (8) with number of intervals k = 20. To avoid occurrence of local minima each optimization process was initialized 10 times with different weight values w between [−0.5, 0.5] and after short optimization the most promising solution has been converged to the final value. Each learning procedure was repeated 10 times and the average time required for convergence, the number of iterations and the final index value are reported in Tab. 1. Value of projection index referred in Tab. 1, both for QPC1 and QPC2, have been computed according to Eq. (1). Results presented in Tab. 1 show great improvement of QPC2 performance compared to the QPC1. The Wilcoxon’s signed-rank test [8] indicates significant difference of the average time used for computation at a confidence level of 99% (p-value of 0.0061) in favor of QPC2. Reduction of computation time occurs especially for the datasets with large number of instances like Abalone and Spam. Results for those data were excluded from statistical analysis to avoid dominance of these large values. Projections obtained from QPC2 provide good approximation of solutions that might be found by the full QPC1 index. In most cases improvement of performance involves only slight loss of quality of obtained solutions. Fig. 1 presents scatter plots generated
Fast Projection Pursuit Based on QPC
93
Table 1. Comparison of performance of the full (QPC1) and approximate optimization (QPC2) of the QPC index Data Set
Vec. Feat. Class
Appendicitis 106 Monk’s 1 124 Iris 150 Wine 178 Ionosphere 200 Sonar 208 Glass 214 Heart Statlog 270 L.Breast 277 Heart Cleveland 297 Voting 435 Breast Cancer W. 683 Australian Credit 690 P.I.Diabetes 768 Concentric Rings 800 Parity 10-bits 1024 Average Wilcoxon p-value Large data Abalone Spam
4177 4601
7 6 4 13 34 60 9 13 9 13 16 9 14 8 4 10
7 57
2 2 3 3 2 2 6 2 2 2 2 2 2 2 4 2
QPC1 Index Time −2 ×10 [s] 35.5 ± 0.2 3.6 ± 0.9 15.2 ± 0.9 3.7 ± 1.6 76.5 ± 0.1 2.0 ± 0.3 64.9 ± 0.0 3.7 ± 0.4 47.1 ± 0.2 16.6 ± 11.7 37.3 ± 0.4 27.4 ± 19.7 31.2 ± 0.0 5.0 ± 0.6 29.9 ± 0.3 20.3 ± 1.9 13.5 ± 0.1 14.5 ± 3.9 29.4 ± 0.2 28.7 ± 7.1 70.5 ± 5.1 136.2 ± 9.0 66.0 ± 0.0 65.8 ± 11.1 51.2 ± 0.1 54.3 ± 7.2 17.8 ± 0.0 68.9 ± 13.3 15.7 ± 0.2 49.2 ± 11.7 26.6 ± 0.0 32.1 ± 5.6 39.3 33.3
Iterations 163.0 ± 95.5 148.0 ± 71.7 46.5 ± 12.3 77.0 ± 4.2 213.0 ± 77.3 178.0 ± 20.3 84.5 ± 19.9 238.0 ± 44.3 217.5 ± 111.5 307.5 ± 156.6 855.0 ± 322.1 119.5 ± 26.5 138.5 ± 28.9 120.0 ± 21.9 101.0 ± 62.8 22.5 ± 6.8 189.3
QPC2 Index Time Iterations −2 ×10 [s] 32.3 ± 0.5 4.3 ± 0.6 111.0 ± 47.0 12.2 ± 1.4 3.9 ± 0.4 101.0 ± 34.5 75.6 ± 0.5 2.4 ± 0.1 58.0 ± 13.8 61.8 ± 0.6 4.0 ± 0.1 109.5 ± 15.2 41.6 ± 0.9 5.0 ± 0.2 110.0 ± 24.7 32.0 ± 0.5 7.8 ± 0.1 144.0 ± 10.7 28.3 ± 1.5 5.0 ± 0.5 117.0 ± 30.0 28.3 ± 0.5 6.8 ± 0.7 170.5 ± 47.4 10.6 ± 1.3 6.7 ± 0.8 107.5 ± 54.8 27.9 ± 0.5 7.8 ± 0.9 246.0 ± 42.0 81.4 ± 0.3 10.8 ± 0.6 214.0 ± 14.5 59.9 ± 1.4 8.8 ± 0.8 172.0 ± 60.9 49.9 ± 0.4 6.6 ± 0.3 89.5 ± 21.3 17.6 ± 0.1 6.9 ± 0.3 100.5 ± 19.1 15.2 ± 0.5 5.4 ± 1.1 75.0 ± 52.2 26.6 ± 0.0 17.7 ± 4.8 209.0 ± 243.7 37.6 6.9 133.4 0.0106 0.0061 0.0879
28 18.9 ± 0.1 3148.4 ± 609.8 184.0 ± 59.6 15.2 ± 0.2 29.8 ± 1.3 73.0 ± 13.4 2 26.2 ± 0.0 5260.5 ± 105.6 105.5 ± 2.8 25.3 ± 0.2 184.7 ± 4.1 102.0 ± 3.5
by projection of data vectors on the first two directions wT1 x and wT2 x found by optimization of QPC1 and QPC2. The second direction w2 have been found in the direction orthogonal to the first one. For the Australian dataset distinct separation between two groups of vectors is obtained. First projection on w 1 is sufficient to distinguish this two clusters. The Monk’s 1 problem projected on the two dimensional space generated by QPC2 revealed inherent relations for this artificial dataset with symbolic features, leading to almost complete separation of instances with opposite labels. For the 10-bit parity problem both approaches found correct projections on diagonals of the hypercube representing Boolean function. In case of Concentric Rings noise has been suppressed and the two-dimensional ring structure hidden in this data was recovered. 3.2 Comparison of Generalization The QPC projection index may be used for generation of new features that should reveal interesting aspects of analyzed data. Such features may be beneficial for training of almost any learning machines. Tab. 2 presents results obtained by training the Naive Bayes (NB) classifier with kernel density estimation on problems used for performance testing. First column contains results of NB trained on the original data. Each successive column represent results for NB trained on data projected on 1, 2 and 3 directions generated by the full (QPC1) index maximization and by its fast approximation (QPC2). Classification accuracy has been estimated using 10 fold stratified cross-validation repeated 10 times for each dataset and each method. To compare generalization of NB classifier trained with and without initial QPC transformation for each dataset corrected resampled t-test was used [9] and significant differences (at significance level 0.05) are marked with dots (see Tab. 2).
94
M. Grochowski and W. Duch Australian Credit Rating
Australian Credit Rating
1
1.5
1
0.5
0.5 0
w x
w x
0 2
2
−0.5
−0.5 −1 −1
−1.5
−1.5
−2 −1
−0.5
0
0.5
1
−2 −1
1.5
−0.5
0
0.5 w x
w x 1
1
1.5
2
1
Monks 1
Monks 1
1.5
2
1.5 1 1 0.5
w x
w x
0.5
0
2
2
0
−0.5 −0.5 −1 −1 −1.5
−1.5 −1.5
−1
−0.5
0 w x
0.5
1
−2 −2
1.5
−1.5
−1
−0.5
3
3
2
2
1
1
w x
0
2
2
w x
4
−1
−2
−2
−3
−3
−2
−1
0 w1 x
1
1.5
0
−1
−3
0.5
1
Parity 10
4
−4 −4
0 w x
1
Parity 10
1
2
3
−4 −4
4
−3
−2
−1
Concentric Rings
0 w1 x
1
2
3
4
Concentric Rings
1
0.5
0.5
0
2
2
w x
1
w x
1.5
0
−0.5
−0.5
−1
−1 −1.5
−1
−0.5
0 w x 1
0.5
1
1.5
−1.5 −1.5
−1
−0.5
0 w x
0.5
1
1.5
1
Fig. 1. Examples of the first two projections found by maximization of the full QPC1 index (left) and the approximated QPC2 index (right) for the Australian credit, the Monk’s 1 problem, the 10-bit Parity and the Concentric Rings
Fast Projection Pursuit Based on QPC
95
Table 2. Average accuracy of the Naive Bayes with kernel density estimation in the 10x10 stratified CV test for the whole dataset and after training on dataset reduced to 1, 2 and 3 dimensions using two QPC versions Data set
Naive Bayes
1 Appendicitis 84.4 ± 10.2 87.4 ± 8.2 Monk’s 1 71.5 ± 11.3 71.3 ± 11.0 Iris 95.7 ± 4.9 98.0 ± 4.0 Wine 97.7 ± 3.5 92.5 ± 5.8 ◦ Ionosphere 84.4 ± 7.9 79.9 ± 9.1 Sonar 75.8 ± 10.1 74.1 ± 10.4 Glass 60.3 ± 9.9 55.3 ± 8.3 Heart Statlog 79.8 ± 7.3 80.2 ± 7.2 L.Breast 72.7 ± 6.1 72.3 ± 5.3 Heart Cleveland 79.3 ± 7.3 80.7 ± 7.7 Voting 89.8 ± 4.7 95.4 ± 2.9 • Breast Cancer W. 96.7 ± 2.0 96.1 ± 2.1 Australian Credit 68.4 ± 6.0 85.3 ± 4.7 • P.I.Diabetes 73.6 ± 5.1 76.4 ± 4.4 • Concentric Rings 85.9 ± 3.6 64.0 ± 4.3 ◦ Parity 10 bits 44.4 ± 6.9 85.5 ± 10.3 • Average 78.8 80.9 Win/Tie/Lose 4/10/2 Wilcoxon NB vs. QPC+NB p-value 0.756 Wilcoxon QPC1+NB vs. QPC2+NB p-value
QPC1+NB 2 86.1 ± 8.8 82.7 ± 13.9 • 95.9 ± 5.2 96.2 ± 5.2 84.0 ± 7.8 75.4 ± 10.1 56.0 ± 8.7 82.8 ± 6.8 72.6 ± 6.6 82.8 ± 6.9 95.1 ± 3.1 • 97.0 ± 1.9 85.5 ± 4.4 • 74.9 ± 4.5 86.4 ± 3.8 90.2 ± 8.9 • 84.0 4/12/0 0.049
3 84.9 ± 9.6 89.2 ± 9.4 • 95.8 ± 5.2 97.7 ± 3.7 85.4 ± 7.3 75.8 ± 9.3 59.9 ± 8.9 82.6 ± 7.2 73.7 ± 6.4 82.7 ± 7.4 94.7 ± 3.4 • 97.0 ± 1.9 86.2 ± 4.7 • 73.9 ± 5.2 86.7 ± 3.6 90.9 ± 7.7 • 84.8 4/12/0 0.002
1 87.1 ± 8.9 67.2 ± 12.7 96.9 ± 4.6 91.6 ± 6.1 ◦ 81.7 ± 9.1 73.3 ± 10.5 54.8 ± 9.8 80.5 ± 7.5 70.6 ± 6.3 80.5 ± 7.1 95.3 ± 3.0 • 95.7 ± 2.3 85.4 ± 4.5 • 76.3 ± 4.5 63.3 ± 4.4 ◦ 89.3 ± 11.2 • 80.6 3/11/2 0.918 0.121
QPC2+NB 2 86.0 ± 9.2 82.9 ± 13.0 • 95.9 ± 5.2 97.4 ± 4.0 83.2 ± 8.0 75.9 ± 10.4 56.5 ± 9.7 82.7 ± 7.0 70.8 ± 7.2 83.1 ± 7.6 94.7 ± 3.1 • 96.9 ± 1.9 85.4 ± 4.4 • 73.9 ± 4.6 84.9 ± 4.9 93.3 ± 7.7 • 84.0 4/12/0 0.109 0.776
3 86.1 ± 9.0 87.9 ± 11.0 • 96.0 ± 5.1 97.6 ± 3.8 85.5 ± 7.5 76.5 ± 9.0 59.1 ± 9.8 83.0 ± 7.1 70.6 ± 8.0 83.5 ± 7.2 • 94.4 ± 3.2 • 97.2 ± 1.8 85.8 ± 4.4 • 72.7 ± 5.1 85.6 ± 4.0 94.9 ± 6.6 • 84.8 5/11/0 0.039 0.717
•- statistically significant improvement, ◦- statistically significant degradation
Features produced by QPC2 lead to similar accuracy to that of full QPC1. The Wilcoxon’s signed-rank test shows no significant difference in accuracy of NB trained on the first three directions obtained by both QPC optimizations, giving p-value greater than 0.1 in all three cases (Tab. 2 last row). For all datasets t-test also shows no significant differences in NB accuracy between QPC1 and QPC2 transformation. In most cases NB trained on data projected on the first QPC direction produce results that are not significantly different from NB trained on the original data (10 ties obtained by corrected resampled t-test with level of significance equal to 5%). For 2 datasets t-test shows difference in accuracy in favor of original NB, but for 4 datasets the QPC transformations have improved NB generalization. For NB trained on data projected to the first two directions no significant degradation of accuracy is noted with comparison to NB trained on the original dataset. The Wilcoxon’s signed-rank test confirms that there is no significant difference between accuracy of NB trained on first QPC projection and NB trained on original data, and there is significant difference in favor of NB trained on data projected to 2 or 3 dimensions obtained from QPC index both for QPC1 and QPC2. Thus a great reduction in dimensionality is obtained by using QPC features.
4 Discussion The approximate version of the Quality of Projected Clusters projection pursuit method introduced in this paper greatly improve performance without degradation of the quality of results. As has already been stressed [10] separability is not the best goal of learning when problems are difficult, some intermediate tasks should be defined to derive
96
M. Grochowski and W. Duch
information that may help in finding optimal solutions. Many methods fail on difficult problems, such as the parity problem or the noisy concentric rings problem, but searching for good linear projection direction followed by simple one-dimensional nonlinear functions to distinguish pure clusters after the projection handles such problems without much effort. Therefore we are confident that such methods provide important computational intelligence tools. Projections found by QPC may be used to enhance data representation expanding feature spaces (this was done in [11], where remarks on relations with kernel methods may be found). Each projection may also be implemented as a node in a hidden layer of feedforward network. This may be either followed by a simple linear layer (as in the multilayer perceptrons), or used only for initialization of weights. The prototypes obtained from QPC2 training may be directly used for classification as the nearest prototype vectors, or used for initialization in any radial-basis function method. The full QPC index has already been successfully applied to several constructive neural network architectures including QPC-NN [6] and QPC-LVQ [5]. The QPC-NN method build neural network optimizing QPC index within general sequential constructive method scheme proposed by [12]. The QPC-LVQ combines learning vector quantization [13] to map local relations with linear projections given by QPC to handle non local relations. Modification introduced in previous section should considerably increase performance of the QPC-based networks without loss of their generalization powers. Results of all these procedures will be presented in a longer paper in the near future. Acknowledgment. This work was supported by the Polish Ministry of Higher Education under research grant no. N N516 500539.
References 1. Friedman, J.H., Tukey, J.W.: A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. 23(9), 881–890 (1974) 2. Friedman, J.: Exploratory projection pursuit. Journal of the American Statistical Association 82, 249–266 (1987) 3. Duch, W.: K-separability. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 188–197. Springer, Heidelberg (2006) 4. Grochowski, M., Duch, W.: Projection Pursuit Constructive Neural Networks Based on Quality of Projected Clusters. In: K˚urková, V., Neruda, R., Koutník, J. (eds.) ICANN 2008„ Part II. LNCS, vol. 5164, pp. 754–762. Springer, Heidelberg (2008) 5. Grochowski, M., Duch, W.: Constrained learning vector quantization or relaxed kseparability. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G. (eds.) ICANN 2009. LNCS, vol. 5768, pp. 151–160. Springer, Heidelberg (2009) 6. Grochowski, M., Duch, W.: Constructive Neural Network Algorithms that Solve Highly Non-Separable Problems. Studies in Computational Intelligence, vol. 258, pp. 49–70. Springer, Heidelberg (2010) 7. Asuncion, A., Newman, D.: UCI machine learning repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html 8. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945) 9. Nadeau, C., Bengio, Y.: Inference for the generalization error. Machine Learning 52(3), 239– 281 (2003)
Fast Projection Pursuit Based on QPC
97
10. Duch, W.: Towards comprehensive foundations of computational intelligence. In: Duch, W., Mandziuk, J. (eds.) Challenges for Computational Intelligence, vol. 63, pp. 261–316. Springer, Heidelberg (2007) 11. Maszczyk, T., Duch, W.: Support feature machines: Support vectors are not enough. In: World Congress on Computational Intelligence, pp. 3852–3859. IEEE Press, Los Alamitos (2010) 12. Muselli, M.: Sequential constructive techniques. In: Leondes, C. (ed.) Optimization Techniques. Neural Network Systems, Techniques and Applications, vol. 2, pp. 81–144. Academic Press, San Diego (1998) 13. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (1995)
A New N-gram Feature Extraction-Selection Method for Malicious Code Hamid Parvin, Behrouz Minaei, Hossein Karshenas, and Akram Beigi School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran {Parvin,B_Minaei,karshenas,Beigi}@iust.ac.ir
Abstract. N-grams are the basic features commonly used in sequence-based malicious code detection methods in computer virology research. The empirical results from previous works suggest that, while short length n-grams are easier to extract, the characteristics of the underlying executables are better represented in lengthier n-grams. However, by increasing the length of an n-gram, the feature space grows in an exponential manner and much space and computational resources are demanded. And therefore, feature selection has turned to be the most challenging step in establishing an accurate detection system based on byte n-grams. In this paper we propose an efficient feature extraction method where in order to gain more information; both adjacent and non-adjacent bigrams are used. Additionally, we present a novel boosting feature selection method based on genetic algorithm. Our experimental results indicate that the proposed detection system detects virus programs far more accurately than the best earlier known methods. Keywords: Malicious Code, N-gram Analysis, Feature Selection.
1 Introduction The machine learning’s main aim is to enhance the efficiency of favorite task(s), and so it tries to find and exploit regularities in training data. Machine learning has the general goal of constructing computer programs that can automatically be improved with experience. Detecting fraudulent credit card transactions is one of the successful applications of machine learning. There are many others which machine learning can successfully be applied to them [1]. The promising results obtained from applying machine learning techniques in the many fields, especially in intrusion detection, has encouraged researchers to utilize them in virus detection problem as well [2, 3]. The obvious advantage is that, there is no need to go through the laborious process of building a database of virus signatures. Instead, a sample set of malicious and benign codes are used to train a classifier system, and then the trained classifier is used to evaluate new executables and detect malicious ones. Although, it is not a long time since the researchers have begun applying machine learning and data mining techniques to this field, quite interesting results have been obtained which opens the hope for further success in the near future. Prior to the classifier’s training phase, the most appropriate features of the data that best discriminate various target classes of the problem should be extracted form the A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 98–107, 2011. © Springer-Verlag Berlin Heidelberg 2011
A New N-gram Feature Extraction-Selection Method for Malicious Code
99
set of all available features. The principal aim of this task is the reduction of dimensionality of feature space as much as possible while holding the semantics of data fixed. In the context, the features that best discriminate malicious codes from benign ones should be selected and be used in training and classification process. Researchers have proposed using a variety of different features like, binary profiling of files, string sequences, hex dumps [4] or a table representation of the file [5] for malicious code detection in the literature. N-gram analysis initially used in the field of natural language processing and document search is one of the most important techniques for feature extraction. Byte n-grams are overlapping substrings, collected in a sliding-window fashion where a window of fixed size slides one byte at a time. The huge number of n-grams which are often resulted from the feature extraction process makes them ineffective to be directly used in classification techniques. Therefore, a feature selection mechanism is inevitable. Several feature selection techniques applicable to n-gram features are proposed in [6], information gain [7] and class-wise document frequency [8] are among the most important proposed techniques in related works. In this paper we propose an efficient bi-gram extraction technique where in contrast to previous works that only use adjacent bytes, in order to catch more byte dependency information, non-adjacent bytes are considered as well. Once the feature extraction phase is over, a novel boosting feature selection technique based on genetic algorithm is used to gradually select the most discriminating bi-grams as final features to be used in classifier training and classification process. The paper proceeds by giving a brief theoretical background of the issues discussed here. Our proposed method is explained in section 3. Section 4 presents the results obtained through the experiments. Section 5 concludes the paper.
2 Background As a pioneer, Cohen [6] has done the first major theoretical study on viruses. He has defined virus as a program that can infect other programs by modifying them to include a possibly evolved copy of itself [7, 8]. He used Turing machine to introduce the notion of viral sets and to formalize a virus as a word on a Turing machine tape with the ability to duplicate or mutate when it is activated in a suitable environment. Using this theoretical basis he showed that virus detection is an undecidable problem in general [9, 10]. Adleman [10] proposed the definition of a wider class of attacks, namely computer infections or malwares. He defined several properties for a program and using these properties, defined different kinds of programs and viruses. Filiol [9] defined malwares as a simple or self-reproducing offensive program that is installed in an information system without users’ knowledge in order to violate confidentiality, integrity or availability of that system, or susceptible of falsely incriminating the owner or user of the program in a computer offense. Kolter and Maloof [11] referred to malicious code as any code added, changed or removed from a software system to intentionally cause harm or subvert the system’s intended function. Reddy and Pujari [7] defined a computer virus to be a code that recursively replicates a possibly evolved copy of itself.
100
H. Parvin et al.
As it can be observed, all of the proposed definitions are common in a key feature: the ability of virus to self replicate (implicating its name in correspondence to biological viruses). In order to enable a classifier to classify new samples, it should pass a learning phase in which it is trained with a set of training samples. Each instance in the sample set is represented by the values of a number of features. Indeed, these feature values describe different instances of the domain we are dealing with. For example features like skin color, size, and gender with values yellow, 160 centimeters and female respectively, possibly describe a woman living in Far East. After the training phase, to approximate the classification accuracy of a classifier, it is evaluated using an unseen test data set. Usually a single data set is partitioned into two parts to feed data for both training and testing. K-fold cross-validation is also a method that divides the data set to k separate parts, and uses k-1 of these as the training set and the remaining partition for test. This process is repeated k times, each time with a different partition as the test set. At the end, classification accuracy is averaged over all runs. In the context of malicious code detection, n-grams are byte sequences extracted from binary executables which represent certain characteristics of the codes implemented. The length of an n-gram is an important parameter that affects all phases of a classifier based detection system including feature extraction, feature selection and classification. For example, when n-grams of two bytes length are extracted, irrespective of what executable is under operation, a total number of 216 different features are possible. Obviously, as the length of n-gram increases, the number of different possible features grows exponentially. To cope with this difficulty a second parameter is introduced to the n-gram extraction process which controls the upper bound of the number of different n-grams that can be extracted from an input file. This upper bound is often referred to as profile size [5, 7, and 11]. The n-gram extraction process starts by sliding a window of size n over the input file, and taking the appropriate action with each of the values spotted in the window according to one of the following strategies: • If the spotted value is previously seen, its count is increased. • Otherwise, the spotted value is treated as a new n-gram and its count is set to The sliding window starts from the beginning of the input file and each time moves a single byte toward the end of the file. The same process is repeated for all of the executables given in the sample set. In this way a vast number of different n-grams with different frequencies are recorded and thus a selection mechanism should be employed, otherwise the practicality of the final detection system will be questioned. But to allow the selection strategy to select the best discriminating n-grams among all extracted n-grams, a “group of files” statistics is also needed in addition to simple file statistics of n-grams. The selected n-grams will be used as sample features that feed the classifier. It is important to note that this feature selection is performed only once during the training process of the classifier and according to the statistical information gained from training samples. After that and while deploying the detection system based on the classifier, the n-gram extraction mechanism only searches for the selected features in the file under test.
A New N-gram Feature Extraction-Selection Method for Malicious Code
101
Many feature selection methods have been proposed to be used in problems where the feature space has many dimensions. Some of these methods are particularly recognized to be applicable in virus detection field. Document frequency-based and information gain-based feature selection methods are among the most important ones being proposed in recent works [7, 8]. Virus detection is often considered as a pairwise classification problem, i.e. virus and benign classes. So, in both previous methods a discrimination of favorite class from the other class has been the goal. Document frequency-based feature selection method tries to select features that are more frequently present in favorite class and afterwards it applies the same approach on the other class to select further features. The final set of features is resulted from the union of two previously selected feature sets. In Information gain-based feature selection method, features are sorted in descending order and those with highest information gain measures are selected. In fact, the information gain metric shows the correlation degree between a feature and the class labels. The more the value of the information gain metric for a feature, the more capability it would have to discriminate classes. Our proposed method is experimentally compared with these methods and it is shown that it detects viruses far more accurately.
3 Background It was explained earlier that to have a good classification, the representative features that can properly discriminate data samples should be selected. When we use n-gram analysis to extract such features, the first critical decision is the proper choice of n-gram length. Due to the lack of the information implicitly included in different combinations of bytes, single byte n-grams, prove to be ineffective in revealing necessary information. Two byte n-grams are able to catch adjacent bytes dependencies to some extent. By increasing the n-gram length, the number of different combinations of byte values that can be found will increase excessively in an exponential manner that allows catching even more intrinsic characteristics of the executables. But this exponential growth will affect the computational and memory requirements of feature extraction process. It is believed that n-grams of smaller lengths contain the statistics of larger n-grams by a large extent and thus a length of two bytes should do the work [7]. However in practice using larger n-gram lengths, e.g. a length of four [5, 7] has shown better results. In reality, it turns out that we are confronted with a case where we should make a tradeoff between having better features and reducing the computational cost of the algorithm in order to make it feasible. As a way to deal with this difficulty, we propose a new method for n-gram extraction. It allows to considerably decrease the number of potential combinations of bytes while also enabling to analyze the combination of non-adjacent bytes. In this method only certain bytes of the sliding window, used in n-gram extraction, are taken as ngrams and the rest are not considered. In our experiments we have only taken the first and the last bytes of the sliding window to constitute the n-grams. This allows catching non-adjacent dependencies between byte sequences while manipulating a constant number of maximum possible combinations. To enrich the set of possible features used in the classification, different window sizes from 2 to 6 bytes long are used. In this way the gap between
102
H. Parvin et al.
n-gram’s bytes is increased from zero to four bytes. Clearly, extracting these n-grams with the explained different gap sizes, require five consecutive passes over the input file. However, because of the low memory requirement, it is much faster than extracting n-grams of larger lengths, e.g. four, which is used in previous works. Having five n-gram streams of length two (bi-gram) but with different gap sizes extracted from each executable in the training set, a new selection mechanism is applied to select a subset of these n-grams as the final sample features. Because of its general purpose functionality, genetic algorithm [12] proved to be a good search and optimization mechanism for domains with unknown structure. The selection strategy used in this paper employs a genetic algorithm to search the space of all possible n-gram combinations that will result in the best discrimination among benign and malicious executables (two classes that are important for us). An important notion used in this selection mechanism is the reference vector. We define a reference vector to be the one consisting of m sub-vectors, where each one is a binary vector with a size equal to all possible combinations of n bytes values (i.e. a size of 2n*8 entries). This reference vector is later used as the base representation type for input samples as well as the reference “variable set” that chromosome’s genes in GA are taken from. In this work we use a reference vector with m=5 and n=2 resulting in five binary sub-vectors of 216 entries each. The input files (samples) are initially represented by an instance of the reference vector that corresponds to the n-gram streams extracted from them. A value of 1 in an entry of this vector means that the n-gram represented by its index is present in the corresponding input file. We use a genetic algorithm with binary representation for its variables that indicates whether a specific bi-gram is used in its chromosome or not. In order to simplify the chromosomes and to increase the efficiency of the genetic algorithm, the bi-grams represented in the reference vector are processed in groups of 1000 each time. So firstly, the fittest bi-grams of the first 1000 bi-grams of the first sub-vector are selected by GA, and after that the first 1000 bi-grams of the second sub-vector are considered, and then the first 1000 bi-grams of the third sub-vector and so on. When the last sub-vector is reached the algorithm returns to the second 1000 bi-grams of the first sub-vector and the whole process is repeated until a desired number of bi-grams are obtained as selected features. The fitness function used to evaluate the population of chromosomes in the GA takes the following general form:
f C (chromosome ) = AC − AC′
(1)
where AC is the average number of times the features included in the chromosome (those genes having value of 1) have appeared in the files of class C (i.e. malicious or benign), and AC′ is the average number of times that these features have appeared in the files of the complement class, C′. Class C can be either benign or malicious, and therefore C' will be the other one. Once the groups of 1000 bi-grams of each sub-vector are processed and the optimal set of features discriminating class C from class C′ are selected, the whole above process (genetic algorithm) is run once again for obtaining the optimal set of features discriminating class C′ from class C. Therefore the role of C and C′ changes in each
A New N-gram Feature Extraction-Selection Method for Malicious Code
Virus Files
Benign Files
Feature Extractor Dataset
Selected Dataset
Probability Vector
Save Selected Features
103
Project on next 1000 features
Feature Extractor Yes
Clustering Accuracy Improved
No Final Virus Detector
Use AdaBoosting to learn the data
Project Data into All Saved Selected Features
Fig. 1. The proposed Virus Detector Scheme
of these runs. [By introducing a weighting mechanism the role of well separated data points can be degraded in the scoring procedure. When two data points have the same weight, a positive score is added to the chromosome score if the data point is a member of benign files, and a negative score otherwise]. For the feature selection algorithm to be able to decide when it has selected enough features, a data clustering is performed. If the clustering accuracy has reached a predefined threshold which we call it Tc, the algorithm terminates. The best place to perform this evaluation is after each group of 1000 n-grams has been processed. For more detailed information about the proposed method, see Fig 1. And the pseudo code of the described algorithm in Fig 2. 1. 2. 3. 4.
All bi-gram features are extracted from both virus and benign files The selection probabilities of all initial extracted vectors are set to one The new-space (the set of selected features) is initialized to an empty set. The following steps are repeated until the clustering accuracy in the newspace is not decreased 5. According to the selection probabilities and the SUS algorithm a subset of data is selected as the input of genetic algorithm 6. The GA is run and some new features are selected 7. The new-space is updated according to the features selected by GA 8. The data is mapped to the new-space 9. The selection probabilities are updated according to the results of the clustering on the newly mapped data At the beginning, adjacent and various non-adjacent (ranged from one to four) bigrams are extracted according to the discussed method in section 3.2. As our proposed method tries to select the most discriminative features in an evolutionary manner, a boosting scheme has been employed. In general, boosting methods put more emphasis on data samples that have not been adequately learned in previous iterations. This method is another version of boosting method called arc-x4 explained in [13]. Therefore, in the proposed feature selection method, the input vectors of the GA algorithm
104
H. Parvin et al.
Feature Selection Algorithm: Malicious_Features = Extract_Feature(Malicious_Files) Benign_Features = Extract_Feature(Benign_Files) Data_Number = Malicious_Files_Number+Benign_Files_Number P_Select(1.. Data_Number) =1 Previous_Accuracy = 0 Current_Accuracy = 0 Feature_Selected_Sofar = {} While (Previous_Accuracy < Current_Accuracy) Previous_Accuracy=Current_Accuracy [Malicious_Subset,Benign_Subset]=SUS_Selection(Malicious_Features,Benign_Features,P_select) Feature_Selected_Now = Run_GA over the next 1000 features of the selected data Feature_Selected_Sofar =Feature_Selected_Sofar∪ Feature_Selected_Now Newdata=Project_to_New_Feature(Malicious_Features,Benign_Features, Feature_Selected_Sofar) [Result, Current_Accuracy] = K-means(Newdata) Update P_select according to Result of K-means End_While
Fig. 2. Feature Selection Algorithm
are selected according to a boosting probability vector. As it can be observed in the above pseudo code, all entries of this vector are initially set to one. The way that each entry is applied in the next iteration is according to the scale of an adaptive value. The adaptive value is computed according to the distance of the corresponding data from the cluster centroid dominated by the data samples having the same class labels as that data. Next a clustering is performed over the data samples selected according to the probability vector, mapped in the feature space selected so far. This clustering provides cluster centroids which are then used for final clustering of all data samples. Then according to the results of the later clustering, the clustering accuracy is approximated and also the probability vector is updated to be used in the next iteration. The whole algorithm continues as far as the clustering accuracy is not decreased in the following iteration. At the end of feature selection process, a set of promising features are obtained (Xs) that have resulted in a clustering accuracy higher than Tc. Using this set, each of the data samples are assigned with a binary vector of length | Xs | that specifies which of the selected features are present in the corresponding executable. So the inputted data set is simplified to a table of binary vectors. These vectors serve as the final inputs to the classifier system. Because of the inherent unstructured nature of virus detection problem using byte sequences, we preferred to use artificial neural networks (ANNs) as our classifier system. NNs are well recognized classifiers for their capability to cope with unclearly specified problem domains. Artificial neurons are presented as models of biological neurons and as conceptual components of circuits that can perform computational tasks. The training phase of a typical NN consists of adjusting the weights of the links between neurons of the network. An algorithm based on least mean square error is used for training which tries to minimize the classification error of the network based on the inputted training samples. As was mentioned earlier, classifier ensemble is an approach in classification which tries to obtain better classification results by combining several classifiers. One of these methods is boosting in which, several versions of the same classifier are
A New N-gram Feature Extraction-Selection Method for Malicious Code
105
trained on different areas of input domain. In this method a classifier is trained on the sample data, and if its test error is not satisfying a second classifier is trained on the erroneous data points. This process is repeated until the test error rate of the final classifier decreases to a satisfying level. The final classifier ensemble is the combination of all trained classifiers used together.
4 Experiment Results The data samples used in our work consists of 411 malicious and 416 benign executables. Malicious programs are win32 malwares taken from VX heavens site [14] and the benign executables are taken from the system32 folder of Microsoft windows operating system. To avoid any harmful effect, the n-gram extraction process is done on the Suse Linux operating system. The resulted n-gram streams are then used in the Matlab software from Mathworks, for selection and classification. The genetic algorithm uses binary tournament selection and uniform crossover with each parent having a chance of 50% to transmit its genes, and the crossover probability of 0.8. Mutation is performed with a probability of 0.01. Complete replacement strategy is used to incorporate new offspring into population. The termination criteria are set to be 200 generations of changing population or 20 generations of almost stable population. A population size of 5000 individuals is used for evolution. We used the simple k-means algorithm as the clustering technique in this work to evaluate the features selected for classification with an accuracy threshold (Tc) of 80%. After selecting the appropriate features the complete data set of benign and virus executables is divided to training and test parts using 4-fold cross validation. To examine the usefulness of the extracted n-grams with different gap sizes on the classification accuracy, they are compared with the case when only adjacent bytes are used. Fig 2 shows the result of this experiment in terms of ROC curves. As it can be seen, the classifier that is using this type of features completely dominates the one that is using only adjacent two byte n-grams. The results presented in this figure are generated when we use the proposed selection mechanism. In Fig 3 (Left), the green curve represents the ROC of classification based on adjacent bi-gram features selected by document frequency-based method which is known to have the best reported results. The blue curve represents the ROC of classification based on adjacent and non-adjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy. Fig 3 (Middle), depicts the ROC curves comparing the proposed method with document frequency-based feature selection method [7] where non-adjacent features are also considered by the method. The green curve represents the ROC of classification based on both adjacent and non-adjacent bi-gram features selected by document frequency-based method. The blue curve represents the ROC of classification based on adjacent and non-adjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy. As it is obvious from the above Figs, the inclusion of non-adjacent features significantly improves the classification accuracy of the document frequency-based method.
106
H. Parvin et al.
Fig. 3. ROCs of the proposed method versus the document frequency-based method. (Left). ROCs of the proposed method versus the document frequency-based method using the same set of extracted features (Middle). ROCs of the proposed method versus the information gainbased method using the same set of extracted features (Right).
In the Fig 3 (Right), the green curve represents the ROC of classification based on both adjacent and non-adjacent bi-gram features selected by information gain-based method which is known as one of the most accurate methods in the virology literature [8]. The blue curve represents the ROC of classification based on adjacent and nonadjacent bi-gram features selected by the proposed method. In both cases the same number of features is selected. As it can be observed, the proposed method outperforms this method in terms of classification accuracy.
5 Conclusion and Future Works A new feature extraction technique based on n-gram analysis was proposed in this paper that uses non-adjacent sequences of bytes with different gap sizes in addition to adjacent bytes to extract better features for virus code detection. The proposed technique can catch important dependencies between non-adjacent byte sequences, while it does not require the high space and computational costs of extracting n-grams of larger sizes. The presented experiment results have also confirmed the usefulness of this type of feature extraction. Accompanied with this feature extraction technique, a new feature selection method based on genetic algorithms was also proposed. Using a reference vector, extracted bi-grams are processed in a predefined order. After each couple runs of the genetic algorithm a data clustering is performed to decide whether enough features has been selected or not. The set of selected features are finally used to represent data samples as binary vectors. These vectors are fed to a classifier that performs the main classification of executables into benign or malicious. Some experiments were conducted to evaluate the efficiency of the proposed method. The results suggest that the proposed method significantly outperforms the best earlier known methods in the virology field. It demonstrates improvements in terms of both feature extraction and feature selection.
References 1. Mitchell, T.: Machine Learning. Prentice Hall, Englewood Cliffs (1997) 2. Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
A New N-gram Feature Extraction-Selection Method for Malicious Code
107
3. Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004) 4. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, pp. 412–420 (1997) 5. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004) 6. Cohen, F.: Computer Viruses - Theory and Experiments. IFIP-TC11 Computers and Security 6, 22–35 (1987) 7. Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. Journal in Computer Virology 2(3), 231–239 (2006) 8. Morin, B., Mé, L.: Intrusion detection and virology: an analysis of differences, similarities and complementariness. Journal of Computer Virology, vol 3, 39–49 (2007) 9. Filiol, E.: Computer viruses: from theory to applications. Springer, New York (2005) 10. Adleman, L.M.: An Abstract Theory of Computer Viruses. In: Goldwasser, S. (ed.) CRYPTO 1988. LNCS, vol. 403, pp. 354–374. Springer, Heidelberg (1990) 11. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. The Journal of Machine Learning Research 7, 2721–2744 (2006) 12. Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F.: Optimizing Classification Ensembles via a Genetic Algorithm for a Web-Based Educational System. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 397–406. Springer, Heidelberg (2004) 13. Breiman, L.: Arcing classifiers. The Annals of Statistics 26(3), 801–823 (1998) 14. http://vx.netlux.org/
A Robust Learning Model for Dealing with Missing Values in Many-Core Architectures Noel Lopes1,2 and Bernardete Ribeiro1,3 1
CISUC - Center for Informatics and Systems of University of Coimbra, Portugal 2 UDI/IPG - Research Unit, Polytechnic Institute of Guarda, Portugal 3 Department of Informatics Engineering, University of Coimbra, Portugal
[email protected],
[email protected] Abstract. Most of the classification algorithms (e.g. support vector machines, neural networks) cannot directly handle Missing Values (MV). A common practice is to rely on data pre-processing techniques by using imputation or simply by removing instances and/or features containing MV. This seems inadequate for various reasons: the resulting models do not preserve the uncertainty, these techniques might inject inaccurate values into the learning process, the resulting models are unable to deal with faulty sensors and data in real-world problems is often incomplete. In this paper we look at the Missing Values Problem (MVP) by extending our recently proposed Neural Selective Input Model (NSIM) first, to a novel multi-core architecture implementation and, second, by validating our method in a real-world financial application. The NSIM encompasses different transparent and bound (conceptual) models, according to the multiple combinations of missing attributes. The proposed NSIM is applied to bankruptcy prediction of (healthy and distressed) French companies, yielding much better performance than previous approaches using pre-processing techniques. Moreover, the Graphics Processing Unit (GPU) implementation reduces drastically the time spent in the learning phase, making the NSIM an excellent choice for dealing with the MVP. Keywords: Missing values, Neural Networks, GPU.
1
Introduction
Pattern recognition is an important area of research in the Machine Learning (ML) field, with a respectable and long history [11]. In particular, classification received a great deal of attention from researchers. As a result, a large number of algorithms and approaches have been developed [5], supporting for the emergence of successful real-world applications in a wide range of domains [4,3,12]. Classification algorithms attempt to discover the underlying relationship between a set of input variables and the desired (target) classes, based on a pool of instances (samples) that typically cover only a small portion of the input space. Thus, the quality of the resulting models (classifiers) depends not only on the algorithms being used, but also on the quality and quantity of the available data. Moreover, usually algorithms are designed based on the assumption that ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 108–117, 2011. c Springer-Verlag Berlin Heidelberg 2011
Robust Model for Dealing with Missing Values Using GPU
109
the data does not contain missing and/or invalid values. However, in practice, the data samples obtained for many real-world problems are often incomplete and may contain a large number of unknown (missing) values. This is backed up by the fact that almost a half (45%) of the datasets in the UCI Machine Learning Repository (widely used for benchmarking) contain missing values [4]. Thus, the ability to cope with missing data has become a fundamental requirement for pattern classification. Failure to handle missing data properly will most likely result in large errors and bad generalization issues [4]. Examples of situations where data may contain missing values include: survey questionnaires where people typically left unanswered questions; industrial experiments where mechanical/electronic failures may happen during the data acquisition process; medical diagnosis where diferent patients perform different tests [4]. The remainder of the paper is organized as follows. In the next section we describe several techniques that are being used to handle the missing values problem (MVP) and present the contributions of our approach. In Section 3 we describe the proposed method NSIM. Section 4 presents its GPU implementation and Section 5 the results obtained in a real-world problem of bankruptcy prediction of French companies using a dataset with thousands of samples with MV. Finally, in section 6 conclusions and future work are addressed.
2
Related Work with the Missing Values Problem (MVP)
Since many classification algorithms (e.g. Support Vector Machines (SVMs), Neural Networks (NNs)) cannot directly handle MV, a common practice is to rely on data pre-processing techniques to deal with them. Usually, this is accomplished by using imputation or simply by removing instances (samples) and/or features (attributes, variables) containing missing values [4,1,9,2,13,5]. A review of the methods and techniques to deal with this problem, including a comparison of some well-known methods, can be found in Laencina et al. [4]. Removing features and/or instances containing a large fraction of missing values is a common and appealing approach for dealing with MV, because it is a simple process and reduces the dimensionality of the data (therefore potentially reducing the complexity of the problem). However, for some problems the number of instances available is reduced and removing samples with missing values is simply not affordable. Furthermore, if the instances (with missing observations) eliminated are not similar to the remaining patterns (instances), the resulting models could present bad generalization performance [9]. Likewise, removing features assumes that their information is either irrelevant or it can be compensated by other variables. However, this is not always the case and features containing missing values may have vital (critical) information which cannot be compensated by the information contained in the other features. An alternative for deleting data containing missing values consists of estimating their values. Many algorithms have been developed for this purpose (e.g. weighted k-nearest neighbor approach, Bayesian principle component analysis, local least squares) [4,13,1]. However, wrong estimates of crucial variables can substantially weaken the capacity of generalization of the resulting model and
110
N. Lopes and B. Ribeiro
originate unpredicted and potentially dramatic results. Moreover, models created using imputed (estimated) data consider missing values as if they are the real ones (albeit their value is not known), therefore, the resulting conclusions do not show the uncertainty produced by the absence of such values. Furthermore, statistically, the variability or correlation estimations can be strongly biased [9]. Multiple imputation techniques (e.g. metric matching, bayesian bootstrap) take into account the variability produced by the absence of missing values, by replacing every missing value by two or more acceptable (plausible) values, representing a distribution of possibilities [9]. However, multiple imputation may increase drastically the size of the datasets and therefore the complexity of the problems, in particular when the number of missing values present is high. Moreover, although the variability is taken into account, missing values will still be treated as if they are real. Furthermore, imputation methods were conventionally developed and validated under the assumption that missing values occur in a random manner. However, this assumption does not always hold in practice. In particular in the microarray experiments the distribution of missing entries is highly non-random due to technical and/or experimental conditions [13]. Recently we presented a new method that empowers the well-known BackPropagation (BP) and the Multiple Back-Propagation (MBP) algorithms with the capacity of directly handling missing values [8]. Instead of relying on data pre-processing techniques the proposed method creates a Neural Selective Input Model (NSIM) that accommodates different transparent and bound NN models, providing the support for handling missing values efficiently. Unlike other methods, the models generated take into account and reflect the uncertainty caused by unknown values. In this paper we extend our work by providing a Graphics Processing Unit (GPU) implementation of the NSIM. Furthermore, we also show that applying the NSIM to a financial setting of the French companies enhances the bankruptcy prediction model by increasing its performance. The motivation is twofold: first, GPUs have proven to be able to decrease considerably the long training times associated to NNs [7]. Thus, extending the GPU implementation described in Lopes and Ribeiro [7] is fundamental to overcome one of the main limitations of NNs (their long training times), when applying the proposed method. Second, although the results obtained previously on several benchmarks yielded excellent results [8], applying the NSIM to a real-world problem is important not only to validate it but also to demonstrate its usefulness.
3
Neural Selective Input Model (NSIM)
The building blocks of the NSIM are the selective activation (actuation) neurons, whose importance (contribution) to the NN depends on the pattern (stimulus) being presented [6]. For each neuron k, an importance factor, mpk , is used to define its relevance and contribution, when pattern p is presented to the network. Its output, ykp , is given by (1): ykp = mpk Fk (apk ) = mpk Fk (
N j=1
wjk yjp + θk ) ,
(1)
Robust Model for Dealing with Missing Values Using GPU
111
where Fk the neuron activation function, apk its activation, θk the bias and wjk the weight of the connection between neuron j and neuron k. The farther from zero mpk is the more important the neuron contribution becomes. On the other hand, a value of zero means the neuron is completely irrelevant for the network output and one can interpret such a value as if the neuron is not present in the network. Considering the BP algorithm, the input weights associated to these neurons are updated using the same rule that is used for standard neurons, i.e. after presenting a given pattern p the network, the weights are adjusted by (2): Δp wjk = γδkp yjp + αΔq wjk ,
(2)
where γ is the learning rate, δkp the local gradient of neuron k, Δq wjk the weight change wjk for the last pattern q and α the momentum term. However, the equations of the local gradient for the output, o, and hidden, h, neurons, given respectively by (3) and (4), differ from the standard neuron equations (unless the importance factor is considered to be constant and equal to 1):
δop = (dpo − yop )mpo Fo (apo ) ,
δhp = mph Fh (aph )
No
δop who .
(3) (4)
o=1
Let Vi be a random variable with Bernoulli distribution representing the act of obtaining the value of xi (Vi ∼ Be(pi )). To deal with missing values we propose transforming the values of xi by taking into account Vi , as shown in (5): xi = f (xi , Vi ) .
(5)
This transformation can be carried out by a neuron, k, with selective activation (named selective input), containing a single input, xi , and an importance factor mk identical to Vi , in which case (5) can be rewritten as (6) using (1): xi = Vi Fk (wik xi + θk ) .
(6)
When a given value xi can not be obtained the selective input associated to it will behave as if it does not exist, since Vi will be zero. On the other hand, if the value of xi is available (Vi = 1), the selective input will actively participate on the determination of the network outputs. This can be viewed as if there are two different models, bound to each other, sharing information. One model for the case where the value of xi is known and another one for the case where it can not be obtained (is missing). Figure 1 shows the physical model (NSIM) of a network containing a selective input and the two conceptual models inherent to it. A network with N selective inputs will have 2N different models bonded to each other and constrained in order to share information (network weights). It is guarantee that all the models share at least S parameters, being S equal to the number of weights that the network would have if the inputs with missing values were not considered at all [8].
112
N. Lopes and B. Ribeiro Conceptual models
Physical model
Model 1 when x3 is missing: V3 = 0 x1
y1
x2
x1
y1
x2
y2
y2
x3 V3
Model 2 when the value of x3 is known: V3 = 1
θk multiplier xi
wik
×
xi
x1
y1
x2 Vi selective input neuron
x3
y2
Fig. 1. Physical and conceptual models of a network with a selective input (k = 3)
Although conceptually there are multiple models, from the point of view of the training procedure there is a single model (NSIM). When a pattern is presented to the network, only the parameters (weights) directly or indirectly related to the inputs with known values are adjusted (observe equations (3) and (4)). Thus, only the relevant (conceptual) models will be adjusted [8]. The NSIM presents a high degree of robustness, since it is prepared to deal with faulty sensors. If the system which integrates the NSIM realizes a given sensor has stopped working it can easily deactivate (discard) all the models inherent to that specific sensor, by setting Vi = 0. Thus, consequently the best model available for the remaining sensors working properly will be considered.
4
GPU Parallel Implementation
Our GPU implementation of the referred method extends the BP and MBP implementation presented in Lopes and Ribeiro [7]. A total of three new kernels (special C functions that are executed in parallel on the GPU) were added to the CUDA (Compute Unified Device Architecture) implementation. In order to calculate the outputs of the selective input neurons, xi , a kernel, named FireSelectiveInputs, was created. This kernel, whose code is shown in Figure 2, assumes that standard inputs may coexist with selective inputs. Thus, it should be launched with one thread per input and pattern (regardless of the type of inputs – selective or standard). Moreover, since our implementation considers the batch training mode, the xi variables will be calculated simultaneously for all the training patterns (samples) and the threads should be grouped in blocks containing all the inputs of a given pattern. Of course, for standard inputs the value of xi must match to the original input (xi = xi ). Therefore, to differentiate standard inputs from the selective ones, the value of the weights and bias of the
Robust Model for Dealing with Missing Values Using GPU
#define #define #define {
113
threadIdx.x blockDim.x PATTERN blockIdx.x NEURON
NUM NEURONS
global int idx =
void FireSelectiveInputs(float * inputs, float * weights, float * bias, float * outputs) PATTERN
*
NUM NEURONS
+
NEURON;
float o = inputs[idx]; if (isnan(o) || isinf(o)) { // missing value o = 0.0; } else { float w = weights[NEURON]; float b = bias[NEURON]; if (w != 0.0 || b != 0.0) o = tanh(o * w + b); } outputs[idx] = o; }
Fig. 2. FireSelectiveInputs kernel. CUDA specific keywords appear in bold.
standard inputs is set to zero – the kernel checks this condition to determine which type of input is being handle. Divergence is avoided when all the inputs are selective inputs. Thus, the maximum performance of this kernel is obtained when we treat all the inputs as selective inputs. For the back-propagation phase two more kernels were created: CalcLocalGradSelectiveInputs and CorrectWeightsSelectiveInputs. The first one calculates the local gradients of the selective input neurons for all patterns and the latter is responsible for adjusting the weights of the selective input neurons. As in the case of the FireSelectiveInputs kernel, maximum performance is achieved when all the inputs are considered to be selective inputs. A complete and functional implementation of this method was integrated in the Multiple Back-Propagation software. The latest version of this software as well as its source code can be freely obtained at http://dit.ipg.pt/MBP. Moreover, the NSIM will also be included in the GPUMLib – an open source GPU machine learning library (available at http://gpumlib.sourceforge.net/).
5
Financial Distress Prediction
In recent years, due to the global financial crisis (triggered by the sub-prime mortgage crisis), the rate of insolvency has been aggravated globally. As a result investors are now more careful entrusting their money. Moreover, determining whether or not firms are healthy is of major importance, not only to investors and stakeholders but also to everyone else that has a relationship with the analyzed firms (e.g. suppliers, workers, banks, insurance firms). Although this is a widely studied topic, estimating the real healthy conditions of firms is becoming a much harder task, as companies become more complex and develop sophisticated schemes to conceal their real situation. In this context, automated pattern recognition systems that can accurately predict the risk of insolvency and warn, in advance, all those who may be affected by a bankruptcy process are of major importance. Furthermore, it is common to have incomplete observations
114
N. Lopes and B. Ribeiro Table 1. Financial ratios used to create the bankruptcy model Financial ratios
Financial Debt / Capital Employed (%) Capital Employed / Fixed Assets Depreciation of Tangible Assets (%) Working Capital / Current Assets Current Ratio Liquidity Ratio Stock Turnover days Collection Period Credit Period Turnover per Employee Interest / Turnover Debt Period (days) Financial Debt / Equity (%) Financial Debt / Cashflow Cashflow / Turnover (%)
Working Capital / Turnover (days) Net Current Assets / Turnover (days) Working Capital Needs / Turnover (%) Export (%) Value Added per Employee Total Assets / Turnover Operating Profit Margin (%) Net Profit Margin (%) Added Value Margin (%) Part of Employees (%) Return on Capital Employed (%) Return on Total Assets (%) EBIT Margin (%) EBITDA Margin (%)
(missing data) in financial and business applications [4]. Thus, this in an interesting problem to test (and validate) the proposed method for handling MV. In this study, we used a large database of French companies, containing information of an ample set of financial ratios spawning over a period of several years. The database contains information about 107,932 companies, out of which 1,653 become insolvent in 2006. The objective consists of discriminating between healthy and distressed companies based on the record of the financial indicators from previous years. For this purpose, we considered 29 financial ratios over the immediate previous three years (see Table 1) as well as two more features: the number of employees and the turnover totalizing 89 features. On average each financial ratio, for a given year, contained over 4% of missing values. However, some had almost a third of the data missing. What is more interesting, is that if we consider only the data from distressed companies then the average of MV for the financial ratios raises up to 42.35%. In fact, it is observed that there are many features that contain less than a quarter of the data. We are unsure why this happens, but one possible explanation is that the affected firms could be trying to hide information from the markets. Nevertheless, this highlights the fact that knowing that some information is missing could be as important as knowing the information itself. Thus, in this respect our model is advantageous, since it preserves the missing information (unlike imputation methods). As expected, when looking at the data of each company (sample) we found similar results: overall, on average only 3 or 4 ratios are missing; however when considering only the distressed firms, rougly 37 ratios per sample are missing. Moreover, there are companies for which all the ratios are unknown. To create a workable and balanced dataset, we started by selecting all the instances of the database associated to the distressed companies, whose number of unknown ratios did not exceed 70 (we considered that at least about 20% of the
Robust Model for Dealing with Missing Values Using GPU
115
Table 2. Results of the NSIM for the bankruptcy problem Metric
Results (%)
Accuracy 95.70 ± 1.42 Sensitivity 95.60 ± 1.61 Specificity 95.80 ± 1.83
Metric
Results (%)
Precision 95.82 ± 1.77 Recall 95.60 ± 1.61 F1 measure 95.70 ± 1.35
180
Speedup (×)
160
140
120
100
80 20
30
40
50 60 70 Hidden Layer Neurons
80
90
100
Fig. 3. GPU speedups obtained for the bankruptcy problem
ratios should contain information). Thus, a total of 1524 samples associated to distressed companies were chosen. Then we selected the same number of samples associated to healthy companies, in order to obtain a balanced dataset so that the missing values were uniformly distributed by all the ratios. The resulting dataset contains 3048 instances – a number over 5× greater than the number of samples we were able to obtain in previous work [14,10], using imputation methods. The resulting dataset contains on average 27.66% of missing values per ratio. Moreover, on average 24 ratios per sample are missing. Table 2 presents the results of the NSIM, with the MBP algorithm, using a 10-fold cross-validation. These excel by far the results previously obtained [14,10] when imputation techniques were used and demonstrate the validity and usefulness of the NSIM in a real-world setting. One of the strengths of the NSIM relies on the possibility of using data with a large number of missing values. This is important, because better (and more accurate) models can be built by incorporating and taking advantage of extra information. Moreover, instead of injecting inaccurate values into the system, as imputation methods do, the NSIM preserves the uncertainty caused by unknown values increasing the model utility when relevant information is missing. Figure 3 shows the speedups obtained, for the bankruptcy problem, using an GTX 280 GPU and a Core 2 Quad CPU Q9300 (2.5 GHz). These are consistent with the results previously obtained in Lopes and Ribeiro [7] and demonstrate
116
N. Lopes and B. Ribeiro
the potential of the GPU to reduce significantly the long times and the fastidious (and expensive) task during training of NNs. Moreover, the GPU implementation scales better than the standalone counterpart, by taking advantage of additional parallel operations.
6
Conclusions and Future Work
The ability to deal properly with missing values has become a fundamental issue for pattern recognition, as data samples in many real-world problems are often incomplete. Failure to correctly handle missing data will most likely result in larger errors and inaccurate models with poor performance. Moreover, there are situations where sensors may fail, yet systems are expected to take decisions based on the available data. In this paper we addressed this problem by presenting a GPU implementation of an innovative method that integrates the capacity for handling directly MV in neural networks. The NSIM has several advantages as compared to other methods: (i) It presents a higher degree of robustness – the resulting models are able to deal with faulty sensors, by selecting the best model available for the sensors working properly; (ii) it preserves the uncertainty caused by unknown values, instead of injecting inaccurate values into the system; (iii) data containing valuable information, that could be discarded otherwise due to a large number of MV, can now be incorporated into the models; and (iv ) prevents undesirable bias. This is validated in a real-world problem of bankruptcy prediction that attests for the quality and usefulness of the proposed method. Moreover, its GPU implementation is crucial to reduce the long training times associated to NNs, thus making this method more attractive. Future work will exploit selective input neurons on other type of neural networks.
Acknowledgment FCT (Funda¸c˜ao para a Ciˆencia e Tecnologia) is gratefully acknowledged for funding the first author with the grant SFRH/BD/62479/2009.
References 1. Aikl, L., Zainuddin, Z.: A comparative study of missing value estimation methods: Which method performs better? In: Proc. International Conference on Electronic Design (ICED 2008), pp. 1–5 (2008) 2. Ayuyev, V.V., Jupin, J., Harris, P.W., Obradovic, Z.: Dynamic clustering-based estimation of missing values in mixed type data. In: DaWaK 2009: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, pp. 366–377. Springer, Heidelberg (2009) 3. Friedman, M., Kandel, A.: Introduction to Pattern Recognition: Statistical, Structural, Neural, and Fuzzy Logic Approaches. World Scientific, Singapore (1999) 4. Garc´ıa-Laencina, P., Sancho-G´ omez, J.L., Figueiras-Vidal, A.: Pattern classification with missing data: a review. Neural Computing & Applications 19, 263–282 (2010)
Robust Model for Dealing with Missing Values Using GPU
117
5. Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26(3), 159–190 (2006) 6. Lopes, N., Ribeiro, B.: Hybrid learning in a multi-neural network architecture. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2001), vol. 4, pp. 2788–2793 (2001) 7. Lopes, N., Ribeiro, B.: GPU implementation of the multiple back-propagation algorithm. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp. 449–456. Springer, Heidelberg (2009) 8. Lopes, N., Ribeiro, B.: A strategy for dealing with missing values by using selective activation neurons in a multi-topology framework. In: IEEE World Congress on Computational Intelligence, WCCI (2010) 9. L´ opez-Molina, T., P´erez-M´endez, A., Rivas-Echeverr´ıa, F.: Missing values imputation techniques for neural networks patterns. In: ICS 2008: Proceedings of the 12th WSEAS International Conference on Systems, pp. 290–295. World Scientific and Engineering Academy and Society, WSEAS (2008) 10. Ribeiro, B., Lopes, N., Silva, C.: High-performance bankruptcy prediction model using graphics processing units. In: IEEE World Congress on Computational Intelligence, WCCI (2010) 11. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, New York (2008) 12. Tang, H., Tan, K.C., Yi, Z.: Neural Networks: Computational Models and Applications (Studies in Computational Intelligence). Springer-Verlag New York, Inc., Secaucus (2007) 13. Tuikkala, J., Elo, L., Nevalainen, O., Aittokallio, T.: Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics 9(1), 202 (2008) 14. Vieira, A.S., Duarte, J., Ribeiro, B., Neves, J.C.: Accurate prediction of financial distress of companies with machine learning algorithms. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 569–576. Springer, Heidelberg (2009)
A Model of Saliency-Based Selective Attention for Machine Vision Inspection Application Xiao-Feng Ding1 , Li-Zhong Xu1,2, , Xue-Wu Zhang1 , Fang Gong1 , Ai-Ye Shi1,2 , and Hui-Bin Wang1,2 1
2
College of Computer and Information Engneering, Hohai University, Nanjing 210098, China Institute of Communication and Information System Engineering, Hohai University, Nanjing 210098, China
[email protected],
[email protected] Abstract. A machine vision inspection model of surface defects, inspired by the methodologies of neuroanatomy and psychology, is investigated. Firstly, the features extracted from defect images are combined into a saliency map. The bottom-up attention mechanism then obtains “what” and “where” information. Finally, the Markov model is used to classify the types of the defects. Experimental results demonstrate the feasibility and effectiveness of the proposed model with 94.40% probability of accurately detecting of the existence of cropper strips defects. Keywords: Vision inspection, Surface defect, Saliency map, Selective attention, Markov model.
1
Introduction
Since the 1990’s, with the rapid development of electronic technology and machine vision technology, machine vision inspection of surface defects has gradually become the most important non-destructive detection technology. The difficulties of machine vision inspection of surface defects are mainly about the defect feature extraction and defect classification. In traditional machine vision inspection, individual feature such as gray feature [1], geometry feature [2] and texture feature, and with their combinations are used to describe the defect images. Then, neural networks (NN) [4] or support vector machines (SVM) [5] classify the surface defects. These methods achieve surface defect detection and classification in a certain extent. However, in copper strip surface defect inspection, the copper strip surface is highly reflective and different production technologies of copper strips lead to different types of surface defects. Furthermore, some kinds of defects are small and their intensities are similar to non-defective copper strip surfaces. The performance of traditional methods are poor in vision inspection of copper strips and can not meet the high demands on quality control. Human visual inspection can identify what kinds
Corresponding author.
A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 118–126, 2011. c Springer-Verlag Berlin Heidelberg 2011
Model of Saliency-Based Selective Attention for Machine Vision Inspection
119
of the defects are and where the defects locate, quickly and effectively. Human visual inspection are robust when the reflective intensity and the shapes of defects change. The machine can not identify the differences of the same defects caused by different production technology and the slight defects, which are not too difficulty to the human visual inspection. This paper is motivated by the need for an automated inspection techniques by imitating human visual that detects and locates defects in copper strips. Human vision has a strong ability in pattern recognition and image understating. Human has a remarkable ability to interpret complex scenes in real time. Intermediate and higher visual processes appear to select a subset of the available sensory information before further processing, most likely to reduce the complexity of scene analysis [8]. This selective attention is a vital process in vision; it facilitates the identification of important areas in a visual scene. It has been described as a spotlight, illuminating a particular region while neglecting the rest. Corbetta [12] has characterized this selection of a region as necessary because of “computational limitations in the brain’s capacity to process information and to ensure that behavior is controlled by relevant information”. Moreover, research done on attention mechanisms in the brain has been useful in identifying areas of the visual system as well as their behavior and function. With the development of Neuroscience, computational neuroscience and anatomy, research on human vision perception system is increasing constantly. The computational models [6,7,8,9,11,13] for selective attention, in both biological and computer vision, are particularly useful in image understanding. In this paper, we presented a selective attention model combined the saliency map for machine vision inspection of surface defects. The features of the defect images are extracted to obtain a saliency map [9]. Then, the observable Markov model is used into the attention mechanism of task-driven. It combined topdown attention with bottom-up attention, and takes “what” information and “where” information into account and then completes surface defect inspection. The paper is structured as follows: In Section 2, the saliency-based selective attention model is given. Then, the experimental results on the copper strips are reported. In the last section, we conclude and discuss future work.
2
Saliency-Based Selective Attention Model
In this section, we start by obtaining saliency map of defect images, then the acquisition of “what” and “where” information, finally the integration of “what” and “where” information. 2.1
Saliency Map
Visual Feature Extraction. The choice of a specific set of features is not crucial. In this work we choose on a feature decomposition proposed by Itti and Koch [9]. The image to process is first subject to a feature decomposition into an intensity map (I) and four broadly-tuned color channels (R, G, B, and Y ) are established: I = (r + g + b)/3 for intensity, R = ˜ r − (˜ g + ˜b)/2+ for
120
X.-F. Ding et al.
red, G = ˜ g − (˜ r + ˜b)/2+ for green, B = ˜b − (˜ r + g˜)/2+ for blue, and R = (˜ r + g˜)/2 − |˜ r − g˜|/2 − ˜b+ for yellow, where r˜ = r/I, g˜ = g/I, ˜b = b/I and x+ = max(x, 0). I, R, G, B, and Y are used to create Gaussian pyramid I(σ), R(σ), B(σ) and Y (σ), where σ ∈ [0, ..., 8] is the scale factor. I is also used to create Gabor pyramids O(σ, θ), where σ ∈ [0, ..., 8] is the scales and θ = {00 , 450 , 900 , 1350 } is the preferred orientation. Feature Conspicuity Map. The feature maps are obtained by calculating the center-surround differences between a “center” fine scale c and a “surround” coarser scale s with the extracted intensity, color and orientation features. The calculation denotes as . Intensity feature map is relative to intensity contrast, which is detected by sensitive neurons sensitive either to dark centers on bright surrounds, or to bright centers on dark surrounds in mammals. Here, it calculates six features for intensity I(c, s), where c ∈ {2, 3, 4}, σ ∈ {3, 4}, s = c + σ. I(c, s) = |I(c) I(s)|
(1)
The color feature maps are calculated by Rg(c, s) = |((R(c) − G(c)) (G(s) − R(s)))|,
(2)
By(c, s) = |((B(c) − Y (c)) (Y (s) − B(s))|.
(3)
The orientation feature maps are calculated by O(c, s, θ) = |O(c, θ) O(s, θ)|.
(4)
When intensity, color and orientation feature maps are down, we combine the three kinds of features. Before this, the local iterative method is used to each feature map. The concrete realization process is first normalizing each feature map to the same range, then convolving them by a large two-dimensional difference of Gaussians filter. s ← |S + S ∗ DoG − Cinh | > 0, (5) where DoG(x, y) =
2 2 Cex −(x2 + y 2 ) Cinh −(x2 + y 2 ) exp( )− exp( ), 2 2 2 2 2πσex 2σex 2πσinh 2σinh
(6)
∗ stands for convolution operation, DoG is Gauss differential function, σex and 2 2 σinh are excitation and suppression bandwidth, Cex and Cinh are excitation and suppression constants. Let Cinh be an offset. So that the combination strategy can restrain the balanced region broadly, such as well-distributed texture images. Using Gauss differential function for local iteration, on the one hand, it can detect more significant targets, on the other hand, the Gauss differential function is similar to the center self-excitation of main visual cortex of human eye and organizational form of restrained long-range linked on surrounds. Therefore, it
Model of Saliency-Based Selective Attention for Machine Vision Inspection
121
has rationality of physiology, and can effectively restrained the noise using multiresolution. After obtaining the intensity, color, orientation feature maps I , C and O , it can get the final saliency map through weight average, that is S= 2.2
1 (I + C + O ). 3
(7)
Acquisition of “what” and “where” Information
When the saliency map is calculated, we can get the intensity, color and orientation features, which can be directly used as “what” information. However, in order to analyze the content of fovea centralis more effectively, the experts network composed by single-layer perceptron in every area are used to get “what” information. Inputs are extractive features vectors from captured information in fovea centralis. Outputs are the posteriori probability vectors of the information category, which are treated as “what” information required for this article. Single layer perceptron is training through supervised learning. Attention focus selection and diverting determines the location and importance of interest region. The competition of various targets in interest image is implemented by winnertake-all competition mechanism. Firstly, the winner-take-all Neural Network finds attention focus from saliency map, selects candidate regions to get saliency area, then returns to the effect of inhibition mechanism to look for the next saliency point to divert attention focus. According to scanning time sequence of the access points in simulation, these points form a scanning path, which are treated as “where” information flow. 2.3
Integration of “what” and “where” Information
Discrete observable Markov model is used to connect the saliency map and “what” and “where” information in combined layer module. Region visited by attention focus treated as "where" information is used as the state of Markov models, and output of expert network treated as "what" information is used as condition observations. The focus diverting sequences of each sample in training set form a scanning path in time order, which corresponds to a Markov chain of the training samples category. The model adjusts the probability of single Markov chain based on “what” and “where” information, thereby maximizing the specific scanning path likelihood values of certain training sample, and by selecting a class which has the largest posteriori probability that implements recognition. Observable Markov Model. In the training process, Markov model simulates a certain number of scanning path. So each state can be observed and the state diverting probability aij , the initial distribution probability πij can be obtained through count method. Similarly, the state observations bj (k) are obtained
122
X.-F. Ding et al.
by calculating an output of expert network under each state of each sample. The three parameters are calculated by: time si move to sj move times start from sj
(8)
times of state sequence si at t = 1 total observation sequence
(9)
time sj observing oi total time sj
(10)
aij =
πij =
b(k) =
The probability of observation series is P (O, S/λ) = πSi bSi (O1 )
n
aSi−1 Si bSi (Oi ),
(11)
i=2
where S is the state sequence, O is the observation sequence, λ = {πi , aij , bj (k)} is the parameter of Markov Chain, i, j = 1, ..., N are indices for states, k = 1, ..., M is the index for the observation samples. Let C be the category with the highest observation probability, then P (O, S/λc ) = max P (O, S/λj ). j
(12)
Dynamic Fovea. Advantages of using Markov Model is that the number of scanning is controllable. In the course of identification, just let the recognition images process through limited number of focus diverting, without the need of being noticed of all areas, they can be classified and judged correctly. Every round ended, from a posteriori probability of Markov Model in the class. After each scanning, posteriori probability of class can be obtained from Markov Model. At one point t, regions had been noticed, and the recognition probability of particular type of images is recorded as a(C). The probability of partial sequence of Markov Model is: at (c) = P (O1 , ..., Ot , S1 , ..., St /λc ),
(13)
where O1 , ..., Ot is the observation sequence up to time t, S1 , ..., St is the state sequence, and λc is the parameter of category C with Markov Model. When the probability has reached decisions confidence, the focus stop diverting. At time t, the posterior probability that the image belongs to category C can be defined as: at (c) a∗t (c) = P (C/O1 , ..., Ot , S1 , ..., St ) = k (14) j=1 at (j) Let the confidence be τ . Then, the standard for focus stop diverting is a∗t (c) ≥ τ , τ ∈ [0, 1].
Model of Saliency-Based Selective Attention for Machine Vision Inspection
3
123
Experimental Results and Analysis
To verify the feasible of the proposed approach, the experimental simulations are implemented in the image library from XINGRONG Copper Corporation in Changzhou, Jiangsu Province. This image library contains 1600 640 × 480 copper strip surface images. There are 6 types of defects such as cracks, burrs, scratch, holes, pits and buckles. There are 200 defect images, 200 non-defect images, 200 smearing “false defect” images. In the experiments, a narrow LED lighting device of LT-191 X 18 model from Dongguan technology co. and a CCD industrial camera of JAI CV-A1 model are used to collect the copper strip image. We start by obtaining the saliency maps of defect images. Firstly, use Gaussian pyramid and Gabor pyramid decomposing in different scales, 9 brightness features, 36 color features and 36 orientations features are obtained for each defect image. In all these 81 features, 42 features maps involving 6 brightness feature maps, 12 color feature maps and 24 orientation feature maps are obtained by calculating the central peripheral difference between Central fine scale c and neighboring rough scale s. Then, use local iteration strategy to get I , C and O features map, as shown in Fig. 1. As used in this paper they are static image, fibrillation feature maps do not have any saliency areas. In the experiments, 42 feature maps are taken as the input of local neural network (here using single-layer perceptron), and the output of perception is 10-D class posteriori probability treated as “what” information in this paper. The local neural network is used to reduce the complexity of the system and to improve the classification accuracy.
Fig. 1. Conspicuity maps of smearing. From left to right, up to bottom, they are original image, attention map, conspicuity maps for color contrasts, flicker contrasts, intensity contrasts and orientation contrasts.
124
X.-F. Ding et al.
Table 1. The performances of the observable Markov model and the dynamic central fovea Method
The average scanning number
Observable Markov model Dynamic central fovea
5 3.6
Accuracy Training 97.45 93.56
rate (%) Testing 94.40 89.52
Table 2. Classification accuracy of surface defects detection using observable Markov model Type of defects Smearing Cracks Burrs Scratch Holes Pits Buckles Total
Correct Number 192 186 193 185 195 192 179 1322
Error Number 8 14 7 15 5 8 21 78
Accuracy rate (%) 96.00 93.00 96.50 92.50 97.50 96.00 89.50 94.40
The scanning number of dynamic central fovea is uncertain, just relying on the observable Markov model structure. The performances of the observable Markov model and the dynamic central fovea are given in Table 1. Table 1 presents that the classification accuracy rate of dynamic central fovea is lower than that of Markov model. However, the average scanning number of observable Markov model is 5 which is larger than the dynamic fovea with 3.6 scanning number to complete the classification. So, using the dynamic central fovea can greatly improve the real-time performance. Table 2 presents the classification accuracy of surface defects detection using observable Markov model. This method has a higher recognition rate for all copper strip typical surface defects detection, ranging from 89 to 97. Furthermore, when the defect features have small difference with non-defect image features such as scratches and buckles, the accuracy may achieve 92.5% and 89.5%.
4
Conclusions
This paper investigates a model of saliency-based selective attention for machine vision inspection of copper strip surface defects. The proposed method is capable of detecting the copper strip surface defects although the copper strip surface are highly reflective and different production technologies of copper strips lead to different types of surface defects. The experimental results show that the proposed method improves the classification ability for surface defect inspection system and achieve the requirement of the accuracy. In this paper, we only
Model of Saliency-Based Selective Attention for Machine Vision Inspection
125
consider the static stimuli for saliency map detector. It is an interesting future direction for consider the dynamic scene for copper strip surface inspection. We are also considering to apply the proposed method in other surface defects quality inspection.
Acknowledgments This work is supported partly by the National Natural Science Foundation of China (No. 60872096), the National Natural Science Foundation of Jiangsu Province of China (No. BK2009352) and the Fundamental Research Funds for the Central Universities (No. 2009B20614).
References 1. Zheng, H., Kong, L., Nahavandi, S.: Automatic Inspection of Metallic Surface Defects using Genetic Algorithms. Journal of Materials Processing Tech. 125, 427–433 (2002) 2. Liang, R., Ding, Y., Zhang, X., Chen, J.: Copper Strip Surface Defects Inspection Based on SVM-RBF. In: 4th International Conference on Natural Computation, pp. 41–45. IEEE Press, New York (2008) 3. Zhong, K.-H., Ding, M.-Y., Zhou, C.-P.: Texture Defect Inspection Method using Difference Statistics Feature in Wavelet Domain. Systems Engineering and Electronics 26, 660–665 (2004) 4. Zhang, X., Liang, R., Ding, Y., Chen, J., Duan, D., Zong, G.: The System of Copper Strips Surface Defects Inspection Based on Intelligent Fusion. In: 2008 IEEE International Conference on Automation and Logistics, pp. 476–480. IEEE Press, New York (2008) 5. Li, T.-S.: Applying Wavelets Transform, Rough Set Theory and Support Vector Machine for Copper Clad Laminate Defects Classification. Expert Systems with Applications 36, 5822–5829 (2009) 6. Luo, S.-W.: Information Processing Theory of Visual Perception. publishing house of electronics industry, Beijing (2006) 7. Noton, D., Stark, L.: Eye Movements and Visual Perception. Scientific American 224, 35–43 (1971) 8. Didday, R., Arbib, M.: Eye Movements and Visual Perception: A Two Visual System Model. International Journal of Man-Machine Studies 7, 547–570 (1975) 9. Itti, L., Koch, C., Niebur, E.: A Model of Saliency-based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1254–1259 (1998) 10. Rimey, R., Brown, C.: Selective Attention as Sequential Behavior: Modeling Eye Movements with An Augmented Hidden Markov Model. Department of Computer Science, University of Rochester (1990) 11. Salah, A., Alpaydin, E., Akarun, L.: A Selective Attention-based Method for Visual Pattern Recognition with Application to Handwritten Digit Recognition and Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 420–425 (2002)
126
X.-F. Ding et al.
12. Corbetta, M.: Frontoparietal Cortical Networks for Directing Attention and The Eye to Visual locations: Identical, independent, or overlapping neural systems? Proc. Natl. Acad. Sci. USA 95, 831–838 (1998) 13. Vazquez, E., Gevers, T., Lucassen, M., Weijer, J., Baldrich, R.: Saliency of Color Image Derivatives: A Comparison between Computational Models and Human Perception. J. Opt. Soc. Am. A 27, 613–621 (2010)
Grapheme-Phoneme Translator for Brazilian Portuguese Danilo Picagli Shibata and Ricardo Luis de Azevedo da Rocha Escola Polit´ecnica, Universidade de S˜ ao Paulo, Brazil
[email protected],
[email protected] Abstract. This work presents an application for grapheme-phoneme translation for Portuguese language texts based on adaptive automata. The application has a module for grapheme-phoneme translation of words as its core, and input texts are transformed into sequences of words (including numbers, acronyms, etc) that are used as input for the word translation module. The word translation module separates words into sequences of tokens and analyzes their behavior considering stress and influences from adjacent tokens. The paper begins with an overview of the word translation method based on adaptive automata, presents the application for text translation and ends with results of translation tests using texts from Brazilian newspapers. Keywords: Adaptive Automata, Brazilian Portuguese, GraphemePhoneme Translation, Natural Language Processing.
1
Introduction
Text-to-Speech translation (TTS) has been an important topic of studies among those of Natural Language Processing. TTS is often divided in two parts: text-tophoneme translation (TTP), where input text is translated to a phonetic representation, and speech synthesis, where the phonetic representation is transformed into speech. Multiple approaches to TTP can be found, including methods based on grapheme-phoneme translation or letter-to-phoneme (L2P) translation in which phonetic representation are discovered from words and letters respectively. This paper presents an application for grapheme-phoneme translation for Portuguese language based on adaptive automata [1]. The application translates texts written in Portuguese to phonetic sequences similar to that spoken in S˜ ao Paulo, Brazil, but may be changed to adhere to different variations of Portuguese and furthermore different languages, such as Spanish, that hold major similarities with Portuguese. Rule based methods may not be the best fit for processing some natural languages, especially the ones that have highly irregular rules for letter-to-phoneme translation, but that is not the case for Portuguese whose language are quite regular. Rule based methods yield good results for this language as stated in [2] and shown by the results of [3], [4] for European Portuguese and [5] for Brazilian Portuguese. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 127–135, 2011. c Springer-Verlag Berlin Heidelberg 2011
128
1.1
D.P. Shibata and R.L. de Azevedo da Rocha
Brazilian Portuguese of S˜ ao Paulo (SPP)
Portuguese is a widespread language, being spoken in different countries and continents with different accents. The Portuguese of S˜ao Paulo has been chosen as the target of this study for familiarity with that language and also because it is spoken in the most populous dialectal region of Brazil [6]. S˜ ao Paulo is still quite large and its population is largely composed of immigrants, therefore it was necessary to standardize the expected output. The standard output for the presented method is based in an illustration of the urban variety of S˜ ao Paulo State dialectal region [6]. The sound rules presented in this paper may be a clue for how SPP sounds, but [6] should be addressed for the complete set of rules.
2
Word Translation
The core of the application is a word translator based on adaptive automata. This word translator module is an implementation of the translation method presented in [5] and [7]. This section presents an overview of the translation method. Translations begin with lexical analysis where words are divided into tokens similar to the syllables of Portuguese language [8]. The tokens are then passed to the adaptive automaton that handles the tokens and treats three issues concerning context sensitivity: stress of the token which define whether the sound of a token should be emphasized over other sounds in the same word, and the influences a token receives from its previous and next tokens that may change the sound of the token. 2.1
Lexical Analysis
Lexical analysis is the first part of the translation process. It separates input words into sequences of tokens that are handled by the adaptive automaton and translated into phonetic sequences considering the appropriate context sensitivity issues. The lexical analyzer rules are based on syllabic separation rules for Portuguese language defined in [7]. The full set of rules can be found in [6]. Table 1 presents examples of words that are separated differently by the mentioned rules. The main difference from the lexical analyzer and the syllabic separation for Portuguese is on the separation of adjacent vowels if the vowels are different from each other. While the separation rules state that separation is conditional to Table 1. Word separation example Word
Separation
Lexical
Sabia Piano Aerado
Sa-bi-a Pi-a-no A-e-ra-do
Sa-bia Pia-no Ae-ra-do
Grapheme-Phoneme Translator for BP
129
context and vowels are not separated if they form diphthongs or triphthongs, the lexical analyzer never separates adjacent vowels that can become diphthongs or triphthongs and separation is made by the automaton when context is analyzed. 2.2
Adaptive Automaton
After the completion of lexical analysis, the sequence of tokens generated is used as the input for an automaton which translates them to sequences of phonetic symbols reckoning the context sensitivity issues mentioned before. As a result of the execution process there may be one or more acceptable phonetic representations for the input sequence. Sometimes only one of these representations is actually used by Portuguese speaking people, but there are cases in which more than one representation is correct and disambiguation must be done through context. Symbols. Symbols used by the automaton are divided in three sets. Tokens are the input symbols generated by the Lexical Analyzer and represent parts of the analyzed word. Context symbols are internal symbols written and read by the automaton to treat context sensitivity issues in a word. Markup symbols are used by adaptive actions to search transitions that indicate places where other translations should be inserted. Context symbols are divided in three subgroups: forward influence symbols that define the influence a token exerts on the following token, backward influence symbols that define the influence a token exerts on the preceding token and stress symbols that define stress for a token. Forward influence symbols. define influences a token exerts on its following token. These symbols are represented by the Greek letter α and are also referred to as α-symbols. Forward influence symbols indicate whether the last character of the influencing token is a vowel or a consonant. Forward influence is not frequent in Portuguese, only tokens that begin with ’r’ or ’s’ followed by vowels suffer this type of influence. Backward influence symbols. define influences a token exerts on its preceding token. These symbols are represented by the Greek letter π and are also referred to as π-symbols. Backward influence symbols indicate the characteristic of the first sound in the influencing token as fricative, nasal, voiced or unvoiced consonants among others. Contrary to forward influence, backward influences are common and a significant number of tokens suffer it. Stress symbols. define stress for a token. These symbols are represented by the Greek letter τ and are also referred to as τ -symbols. Stress symbols indicate whether tokens are stressed or unstressed, and a special symbol indicates if the token is the last of a word and triggers the process of defining stress for all tokens in a word.
130
D.P. Shibata and R.L. de Azevedo da Rocha
Sub Machines. The adaptive automaton is divided in two sub-machines: Recognizer and Translator. The Recognizer reads input symbols and executes adaptive actions that change the structure of the Translator in order to comply with the rules used by Portuguese speakers to read the analyzed word. When the input sequence is over, there is a sub-machine call to Translator that defines the valid phonetic representations for the input word. Adaptive Functions. Changes to the Translator are executed by adaptive actions that are executed when a token is read by the Recognizer. These adaptive actions are composed by sequences of adaptive function calls. The adaptive functions were designed with the purpose of executing small changes to the Translator such as adding or removing a particular transition, and their calls are arranged in blocks that change the Translator in a structured manner. The following adaptive functions were designed: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
dm: indicates where to create new transitions for the automaton. rt : creates transitions to analyze stress. ina: creates transitions to generate backward influence. inp: creates transitions to generate forward influence. ida: creates transitions to read forward influence. idp: creates transitions to read backward influence. som: creates transitions that define the phonetic representation of a token. am: erases a markup transition. af : prepares the Translator to be executed. ra: recognizes existence of acute or circumflex accents in the token.
Order. Sequences of adaptive function calls are divided in two blocks. The first block creates transitions that define influences on adjacent tokens and stress, while the second block creates transitions that read influences from adjacent tokens and transitions that define sounds that represent the token. The second block is composed by multiple forward influence blocks which in turn are composed by multiple backward influence blocks. Forward and backward influence blocks refer to sequences of functions calls that create transitions that handle one specific influence value. Figure 1 presents as an example the sequence of adaptive function calls called when the token sa is read by the Recognizer. The leftmost column contains calls that compose the first block where stress and generated influence rules are defined. Calls from other columns compose the second block which contains two α-blocks and each α-block is divided into three π-blocks. Parameters. Values passed on the parameters of adaptive functions calls are in its majority context symbols that will be read or written by the created transitions. During Translator execution they define how context sensitivity issues are handled by analyzed tokens. Markup symbols and output symbols are passed to search transitions and define the sounds of tokens respectively. There are three combinations of function calls that define the stress rules for a token. The parameters in these three combinations separate tokens in three
Grapheme-Phoneme Translator for BP
131
Fig. 1. Adaptive action for token sa
sets concerning stress rules: tokens that are unstressed when they are the last of a word, tokens that are stressed when they are the last of a word without acute or circumflex accents, and tokens with acute or circumflex accents. Parameters for calls that define influences a token exerts on adjacent tokens are defined based on characteristics of the influencing token. Forward influence is defined by whether the last character in the influencing token is a vowel or a consonant, backward influence is defined by the characteristics of the first sound of the influencing token (nasal, fricative, voiced, unvoiced, etc). Parameters for calls that define influences a token receives from adjacent tokens are based on characteristics of the influenced token. For each relevant forward influence there should be an α-block that handles that influence, and for each relevant backward influence there should be a π-block that handles that influence nested inside each α-block. For the example presented in Figure 1, parameter values for calls that compose the first block define that sa is unstressed final, starts with a fricative consonant and ends with a vowel. In the second block, forward influence blocks defines that ’s’ sounds like [z] when it follows a vowel and like [s] otherwise, while backward influence blocks define that ’a’ is nasalized when token is stressed and followed by nasal consonants, it is voiceless when it is a final token and sounds like [a] otherwise. 2.3
Example
Figure 2 presents the structure of the Translator submachine during the translation of the word casa. States and transitions are represented with usual automata notation. Tags for transitions mean they consume the symbol before comma, write the symbol after comma (omitted if nothing is written) into the input and write the sequence in between brackets in the output. The structure consists of two cyclic blocks of transitions that represent the tokens ca and sa that compose the word. The execution consists of two passes in
132
D.P. Shibata and R.L. de Azevedo da Rocha
Fig. 2. Translation of word casa Table 2. Different contexts for token sa
each block, the first pass (below from right to left) defines stress and backward influences, the second pass (above from left to right) defines forward influence and resolves stress and influences to translate the tokens. Transitions used during execution are highlighted in red. Table 2 presents the use of token sa in different contexts. Blocks are structured in rows (α) and columns (π), defining different sounds for the prefix and the nucleus of the token. Stress (τ ) is also considered in the columns. The original work [5] can be referred for a step-by-step explanation of the adaptive process that changes Translator from its initial configuration to the configurations that represent words, and for other translation examples presenting the behavior of different types of tokens on different contexts. 2.4
Disambiguation
The rules for reading some graphemes of Portuguese may not define clearly the correct form of reading or may sometimes allow more than one correct form of reading. Words with ’x’ starting a token are not clearly defined and the correct sound depends on the origin of the word. Words with ’e’ and ’o’ in the stressed syllable may have two different readings depending on the context.
Grapheme-Phoneme Translator for BP
133
In these cases the automaton generates a set of phonetic representations that may be used for the given word and an auxiliary method is used to define which of the representations in the set will be used as the output. Two sets of disambiguation rules were used, choosing the most probable phonetic representation considering morphological characteristics of the word with and without its partof-speech.
3
Text Translator
The Portuguese Grapheme-Phoneme Translator (PGPT) is an application that translates texts written in Portuguese to phoneme sequences that represent the speech of a native of S˜ ao Paulo reading the input text. The output from text translations are generated according to IPA standards using Unicode symbols based on the appropriate Unicode chart. The translator is based on a word translation module. The implementation of the word translation module is based on the word translation method presented in the previous section, but changes were made to increase execution performance, avoid excessive memory usage and decrease loading time. The word translation module is surrounded by other modules that treat input texts replacing acronyms, numbers and other complex structures into sequences of words that are translated one by one into phonetic sequences. If multiple translations are generated by the automaton, a disambiguation module chooses one of these translations and sends it to the output stream. The application was implemented on Java 5 platform with graphical and command line interfaces for translation of user input and files respectively. The lexical analyzer was implemented with Java’s regular expression package (java.util.regex). It receives words as input and splits them into substrings that represent the tokens that will be used as input for the automaton. An API for adaptive automata execution was implemented and the adaptive actions from the Recognizer and the Transducer structure were built over this API. While the model supposes the preexistence of a Recognizer submachine that handles all possible tokens, in the application this structure is built during translations. Whenever a token is recognized the adaptive action is built and stored in a hash map under that given token for reuse. The Transducer is the exact reproduction of the one presented in the translation methodology over the automata execution API.
4
Results
This section presents a compilation of the results obtained in [5]. The classification was slightly changed, with a reclassification of words that could be fit in two categories as incorrect results. Tests were run using texts published by Folha de So Paulo, obtained from CHAVEFolha collection [10] and result spreadsheets and the software used for tests can be found in [11]. The test phase was divided in two parts. In the first part the words were translated using the automata based method and in the second part one of
134
D.P. Shibata and R.L. de Azevedo da Rocha Table 3. Word Translation
the results of the set generated by the automaton was selected based on the choosing rules. The automaton was tested with a set of 7797 words. These words were taken from journalistic articles on the themes of sports, culture, politics, technology and economy. Acronyms, names, typos and foreign words were removed from the main set since they need not follow the rules of Portuguese. Table 4 presents the results of the automaton execution which were classified as: 1. Correct: yielded and expected translations are equal. 2. Incorrect: yielded and expected translations are different but that does not affect understandability. 3. Doubt: yielded translation set include expected translation. 4. Failure: yielded and expected translations are different and that affects understandability. The same texts were tagged using the VLMC Tagger [12] and the generated 9100 pairs of words and tags were used as the input to test the translation method composed by the automaton and the choosing rules. Table 4 presents the results of grapheme-phoneme translation. Table 4. Text Translation Classification
(1) Choosing
(2) Most Probable
Correct Incorrect Total
8331 (91,55%) 769 (8,45%) 9100 (100%)
8103 (89,04%) 997 (10,96%) 9100 (100%)
The results were classified as correct if translation result equals the expected representation or incorrect if any kind of utterance was found. The test was repeated choosing the most probable sound (2) to verify how much the accuracy improved the output. The 2,5% accuracy increase over the whole set turns out to be very good since 1600 pair were classified as doubt (about 15% accuracy increase inside this group).
5
Conclusion
This paper presented an application for grapheme-phoneme translation for Portuguese based on adaptive automata, an implementation of the method described
Grapheme-Phoneme Translator for BP
135
in [5]. First tests have shown that the application is quite successful, translating words into their expected phonetic representations in 91.5% of the words tested, and getting results that were not expected but still acceptable in a large amount of the other 8.5%. The method may be adapted for other variations of Portuguese by changing the rules that define the sounds of a token. With changes on characteristics of tokens such as stress rules, generated influences and received influences the method might even be used for different languages. The accuracy rate found is quite good and it indicates the solution can be used as part of the core of a text-to-speech translator or at least as a method to guess the correct phonetic representation of words that are not previously known. There is still room to increase the accuracy by fine tuning the rules and studying characteristics that are not checked in the model such as second stress. The research should follow on with the improvement of rules used for Portuguese, the study of phonetic rules for variations of Portuguese language and the study of rules for Spanish language.
References 1. Neto, J.J.: Adaptive Automata for Context-Sensitive Languages. SIGPLAN NOTICES 29(9), 115–124 (1994) 2. Beck, J., Braga, D., Nogueira, J., Coelho, L., Dias, M.: Automatic Syllabification for Danish Text-to-Speech Systems. In: Proceedings of Interspeech 2009, Brighton, United Kingdom, September 6-10 (2009) 3. Braga, D.: Natural Language Processing Algorithms for TTS systems in Portuguese. PhD Thesis. La Coru˜ na University, Spain (2008) (in Portuguese) 4. Oliveira, C., Moutinho, L., Teixeira, A.: On European Portuguese Automatic Syllabification. In: Gonz´ alez, G., et al. (eds.) (coords), III Congreso Internacional de Fon´etica Experimental, Santiago de Compostela: Xunta de Galicia, pp. 461–473 (2007) 5. Shibata, D.P.: Tradu¸ca ˜o Grafema-Fonema para a L´ıngua Portuguesa baseada em Autˆ omatos Adaptativos, p. 91. Disserta¸ca ˜o de Mestrado - Escola Polit´ıcnica, Univesidade de S˜ ao Paulo, S˜ ao Paulo (2008) 6. Barbosa, P.A., Albano, E.C.: Brazilian Portuguese. Illustrations of the IPA. Journal of the International Phonetic Association 34(2), 227–232 (2004) 7. Shibata, D.P., Rocha, R.L.A.: An Adaptive Automata based method to improve the output of text-to-speech translators. In: Congress of Logic Applied to Technology, Santos, vol. 6 (2007) 8. Neto, P.C., Infante, U.: Gram´ atica da L´ıngua Portuguesa. 1a Edi¸ca ˜o, p. 583. Editora Scipione, S˜ ao Paulo (1997) 9. International Phonetics Alphabet, http://www.langsci.ucl.ac.uk/ipa/index.html 10. Linguateca, http://www.linguateca.pt 11. Shibata, D.P.: http://sites.google.com/site/daniloshibata/ 12. Kepler, F.N.: Um etiquetador morfo-sint´ atico baseado em Cadeias de Markov de tamanho vari´ avel, p. 58, Disserta¸ca ˜o de Mestrado Instituto de Matem´ atica e Estat´ıstica, Univesidade de S˜ ao Paulo, S˜ ao Paulo (2005)
Improvement of Inventory Control under Parametric Uncertainty and Constraints Nicholas Nechval1, Konstantin Nechval2, Maris Purgailis1, and Uldis Rozevskis1 1
University of Latvia, EVF Research Institute, Statistics Department, Raina Blvd 19, LV-1050 Riga, Latvia Nicholas Nechval, Maris Purgailis, Uldis Rozevskis
[email protected] 2 Transport and Telecommunication Institute, Applied Mathematics Department, Lomonosov Street 1, LV-1019 Riga, Latvia
[email protected] Abstract. The aim of the present paper is to show how the statistical inference equivalence principle (SIEP), the idea of which belongs to the authors, may be employed in the particular case of finding the effective statistical decisions for the multi-product inventory problems with constraints. To our knowledge, no analytical or efficient numerical method for finding the optimal policies under parametric uncertainty for the multi-product inventory problems with constraints has been reported in the literature. Using the (equivalent) predictive distributions, this paper represents an extension of analytical results obtained for unconstrained optimization under parametric uncertainty to the case of constrained optimization. A numerical example is given. Keywords: Inventory problem, parametric uncertainty, constraints, pivotal quantity, equivalent predictive inferences.
1 Introduction The last decade has seen a substantial research focus on the modeling, analysis and optimization of complex stochastic service systems, motivated in large measure by applications in areas such as transport, computer and telecommunication networks. Optimization issues, which broadly focus on making the best use of limited resources, are recognized as of increasing importance. However, stochastic optimization in the context of systems and processes of any complexity is technically very difficult. Most stochastic models to solve the problems of control and optimization of system and processes are developed in the extensive literature under the assumptions that the parameter values of the underlying distributions are known with certainty. In actual practice, such is simply not the case. When these models are applied to solve realworld problems, the parameters are estimated and then treated as if they were the true values. The risk associated with using estimates rather than the true parameters is called estimation risk and is often ignored. When data are limited and (or) unreliable, estimation risk may be significant, and failure to incorporate it into the model design may lead to serious errors. Its explicit consideration is important since decision rules A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 136–146, 2011. © Springer-Verlag Berlin Heidelberg 2011
Improvement of Inventory Control under Parametric Uncertainty and Constraints 137
that are optimal in the absence of uncertainty need not even be approximately optimal in the presence of such uncertainty. In this paper, we propose a new approach to solve constrained optimization problems under parametric uncertainty. This approach is based on the statistical inference equivalence principle, the idea of which belongs to the authors. It allows one to yield an operational, optimal information-processing rule and may be employed for finding the effective statistical decisions for problems such as multi-product newsboy problem with constraints, allocation of aircraft to routes under uncertainty, airline set inventory control for multi-leg flights, etc. For instance, one of the above problems can be formulated as follows. An airline company operates more than one route. It has available more than one type of airplanes. Each type has its relevant capacity and costs of operation. The demand on each route is known only in the form of the sample data, and the question asked is: which aircraft should be allocated to which route in order to minimize the total cost (performance index) of operation? This latter involves two kinds of costs: the costs connected with running and servicing an airplane, and the costs incurred whenever a passenger is denied transportation because of lack of seating capacity. (This latter cost is “opportunity” cost.) We define and illustrate the use of the loss function, the cost structure of which is piecewise linear. Within the context of this performance index, we assume that a distribution function of the passenger demand on each route is known as certain component of a given set of predictive models. Thus, we develop our discussion of the allocation problem in the presence of completely specified set of predictive demand models. We formulate this problem in a probabilistic setting. Let A1, ..., Ag be the set of airplanes which company utilize to satisfy the passenger demand for transportation en routes 1, ..., h. It is assumed that the company operates h routes which are of different lengths, and consequently, different profitabilities. Let f ij( k ) ( y ) represent the predictive probability density function of the passenger demand
Y for transportation en route j, j∈{1, ..., h}, at the ith stage (i∈{1, …, n}) for the kth predictive model (k∈{1, …, m}). It is required to minimize the expected total cost of operation (the performance index) ⎤ ⎡g ∞ (k ) ⎢ − J i (U i ) = ∑ ∑ wrij u rij + c j ∫ ( y Qij ) f ij ( y )dy ⎥ ⎥ ⎢ j =1 r =1 Qij ⎦ ⎣ h
(1)
subject to h
g
j =1
r =1
∑ urij ≤ ari , r = 1, ... , g , where Qij = ∑ urij qrj , j = 1, ... , h,
(2)
Ui={urij} is the g × h matrix, urij is the number of units of airplane Ar allocated to the jth route at the ith stage, wrij is the operation costs of airplane Ar for the jth route at the ith stage, cj is the price of a one-way ticket for air travel en jth route, qrj is the limited seating capacity of airplane Ar for the jth route, ari is available the number of units of airplane Ar at the ith stage. To use the data of observations of the real airline system more effectively, the technique proposed in this paper might be employed to optimize the statistical decisions under parametric uncertainty and constraints derived from the analytical model (1-2).
138
N. Nechval et al.
2 Inference Equivalence Principle In the general formulation of decision theory, we observe a random variable X (which may be multivariate) with distribution function F(x|θ) where a parameter θ (in general, vector) is unknown, θ∈Θ, and if we choose decision d from the set of all possible decisions D, then we suffer a loss l(d,θ). A “decision rule” is a method of choosing d from D after observing x∈X, that is, a function u(x)=d. Our average loss (called risk) Ex{l(u(X),θ)} is a function of both θ and the decision rule u(⋅), called the risk function r(u,θ), and is the criterion by which rules are compared. Thus, the expected loss (gains are negative losses) is a primary consideration in evaluating decisions. We will now define the major quantities just introduced. A general statistical decision problem is a triplet (Θ,D,l) and a random variable X. The random variable X (called the data) has a distribution function F(x|θ) where θ is unknown but it is known that θ∈Θ. X will denote the set of possible values of the random variable X. θ is called the state of nature, while the nonempty set Θ is called the parameter space. The nonempty set D is called the decision space or action space. Finally, l is called the loss function and to each θ∈Θ and d∈D it assigns a real number l(d,θ). For a statistical decision problem (Θ,D,l), X, a (nonrandomized) decision rule is a function u(⋅) which to each x∈X assigns a member d of D: u(X)=d. The risk function r(u,θ) of a decision rule u(X) for a statistical decision problem (Θ,D,l), X (the expected loss or average loss when θ is the state of nature and a decision is chosen by rule u(⋅)) is r(u,θ)=Ex{l(u(X),θ)}. This paper is concerned with the implications of group theoretic structure for invariant loss functions. Our underlying structure consists of a class of probability models (X, A, P), a one-one mapping ψ taking P onto an index set Θ, a measurable space of actions (D, B), and a real-valued loss function
{
}
l (d , θ) = E x l D (d , X )
(3)
defined on Θ × D, where l D (d , X ) is a random loss function with a random variable X∈(0,∞) (or (−∞,∞)). We assume that a group G of one-one A - measurable transformations acts on X and that it leaves the class of models (X, A, P ) invariant. We ~ further assume that homomorphic images G and G of G act on Θ and D, respec~ tively. ( G may be induced on Θ through ψ; G may be induced on D through l). We shall say that l is invariant if for every (θ, d) ∈ Θ × D l ( g~d , gθ) = l (d , θ), g∈G.
(4)
A loss function, l (d , θ) , can be transformed as follows: l (d , θ) = l ( g~θˆ−1d , g θˆ−1θ) = l # (η , V ),
(5)
Improvement of Inventory Control under Parametric Uncertainty and Constraints 139
where V=V(θ, θ ) is a pivotal quantity whose distribution does not depend on un known parameter θ; η=η(d, θ ) is an ancillary factor; θ is a maximum likelihood estimator of θ (or a sufficient statistic for θ). Then the best invariant decision rule (BIDR) is given by u BIDR ≡ d ∗ = η −1 (η ∗ , θ), where η ∗ = arg inf E l # (η , V ) (6) η
and a risk function
{
}
{
{
}
}
r (u BIDR , θ) = Eθ l (u BIDR , θ) = Ev l # (η ∗ , V )
(7)
does not depend on θ. Consider now a situation described by one of a family of density functions f(x|μ,σ) indexed by the vector parameter θ=(μ,σ), where μ and σ (>0) are respectively parameters of location and scale. For this family, invariant under the group of positive linear transformations: x → ax+b with a > 0, we shall assume that there is obtainable from some informative experiment (a random sample of observations X=(X1, …, Xn)) a sufficient statistic (M,S) for (μ,σ) with density function h(m,s|μ,σ) of the form h(m, s | μ , σ ) = σ −2 h• [(m − μ ) / σ , s / σ ]
(8)
h(m, s | μ , σ )dmds = h• (v1 , v2 )dv1dv2 ,
(9)
such that
where V1=(M−μ)/σ, V2=S/σ. We are thus assuming that for the family of density functions an induced invariance holds under the group G of transformations: m→am+b, s→as (a>0). The family of density functions f(x|μ,σ) satisfying the above conditions is, of course, the limited one of normal, negative exponential, Weibull and gamma, with known index, density functions. The structure of the problem is, however, more clearly seen within the general framework. Suppose that we deal with a loss function l+(d,θ) =Ex {l D (d , X )} = ω (σ ) l (d , θ), where ω(σ) is some function of σ and ω(σ)=ω•(V2,S). In order to obtain an equivalent prediction loss function l • (d , m, s ) , which is independent on θ and has the same optimal invariant statistical decision rule given by (6), i.e., arg min l • ( d , M , S ) = d ∗ ≡ u BIDR ,
(10)
{
(11)
d
with a risk given by
}
Em, s l • (u BIDR , Μ , S ) = ω (σ )r (u BIDR , θ),
we define an equivalent predictive probability density function of a random variable X (with a probability density function f(x|μ,σ)) as
140
N. Nechval et al.
f • ( x | m, s ) =
∫∫ f ( x, v1 , v2 | m, s)h•• (v1 , v2 )dv1dv2 ,
(12)
v1 ,v2
where f ( x, v1 , v2 | m, s) = f ( x | μ ,σ ),
h•• (v1 , v 2 ) =
⎛
ω•−1 (v2 , s)h• (v1 , v2 )⎜⎜
∫∫
(13)
⎞
ω•−1 (v2 , s) h• (v1 , v2 )dv1dv2 ⎟⎟ ⎟ ⎠
⎜ v ,v ⎝1 2
−1
.
(14)
Then l • (d , m, s ) is given by
{
} ∫
l • (d , m, s ) = E x l D (d , X ) | m, s = l D (d , X ) f • ( x | m, s )dx.
(15)
x
Now the predictive loss function l • (d , m, s ) can be used to obtain efficient frequentist statistical decisions under parameter uncertainty for constrained optimization problems, where the known approaches are unable to do it.
3 Newsboy Problem with No Constraints The classical newsboy problem is reflective of many real life situations and is often used to aid decision-making in the fashion and sporting industries, both at the manufacturing and retail levels (Gallego and Moon [1]). The newsboy problem can also be used in managing capacity and evaluating advanced booking of orders in service industries such as airlines and hotels (Weatherford and Pfeifer [2]). A partial review of the newsboy problem literature has been recently conducted in a textbook by Silver et al. [3]. Researchers have followed two approaches to solving the newsboy problems. In the first approach, the expected costs of overestimating and underestimating demand are minimized. In the second approach, the expected profit is maximized. Both approaches yield the same results. We use the first approach in stating the newsboy problem. For product j, define: quantity demanded during the period, a random variable, Xj fj(xj|μj,σj) the probability density function of Xj, θj=(μj,σj) the parameter of fj(xj|μj,σj), Fj(xj|μj,σj) the cumulative distribution function of Xj, c (j1) overage (excess) cost per unit, c (j2) dj
underage (shortage) cost per unit,
inventory/order quantity, a decision variable. The cost per period is l Dj (d j , X j ) = c (j1) (d j − X j ), if X j < d j , or c (j2) ( X j − d j ), if X j ≥ d j .
(16)
Improvement of Inventory Control under Parametric Uncertainty and Constraints 141
Complete information. A standard newsboy formulation (see, e.g., Nahmias [4]) is to consider each product j’s cost function: l +j (d j , θ j )
=
c (j1)
dj
∞
−∞
dj
( 2) ∫ (d j − x j ) f j ( x j | μ j ,σ j )dx j + c j ∫ ( x j − d j ) f j ( x j | μ j ,σ j )dx j . (17)
Expanding (17) gives dj
∞
−∞
dj
l +j (d j , θ j ) = −c (j1) ∫ x j f j ( x j | μ j , σ j )dx j + c (j2) ∫ x j f j ( x j | μ j ,σ j )dx j + (c (j1) + c (j2) )d j [ F j (d j | μ j , σ j ) − c (j2) (c (j1) + c (j2) )].
(18)
Let the superscript * denote optimality. Using Leibniz's rule to obtain the first and second derivatives shows that l +j ( d j | θ j ) is concave. The sufficient optimality condition is the well-known fractile formula: F j (d ∗j | μ j , σ j ) = c (j2) (c (j1) + c (j2) ) .
(19)
d ∗j = F j−1[c (j2) (c (j1) + c (j2) ) | μ j , σ j ] .
(20)
It follows from (19) that
At optimality, substituting (19) into the last (bracketed) term in Eq. (18) gives
(
)
(c (j1) + c (j2) )d ∗j F j (d ∗j | μ j , σ j ) − c (j2) (c (j1) + c (j2) ) = 0.
(21)
Hence (18) reduces to l +j (d ∗j , θ j )
d ∗j
=
c (j2) Ex j {X j } − (c (j1)
+ c (j2) )
∫ x j f j ( x j | μ j ,σ j )dx j .
(22)
−∞
Parametric Uncertainty. Let us assume that the functional form of the probability density function fj(xj|μj,σj) is specified but its parameter θ=(μj,σj) is not specified. Let Xj=(Xj1, …, Xjn) be a random sample of observations on a continuous random variable Xj. We shall assume that there is obtainable from a random sample of observations Xj=(Xj1, …, Xjn) a sufficient statistic (Mj,Sj) for θ=(μj,σj) with density function of the form (8), h j (m j , s j | μ j ,σ j ) = σ −j 2 h• j [(m j − μ j ) / σ j , s j / σ j ],
(23)
h j (m j , s j | μ j , σ j )dm j ds j = h• j (v1 j , v2 j )dv1 j dv2 j ,
(24)
and with
where V1j=(Mj−μj)/σj, V2j=Sj/σj.
142
N. Nechval et al.
Using an invariant embedding technique (Nechval et al. [5-8]), we transform (17) as follows: l +j (d j , θ j ) = ω j (σ j )l #j (η j , V j ),
(25)
where ωj(σj)=σj, l #j (η j , V j ) =
c (j1)
η jV 2 j +V1 j
∞
−∞
η jV 2 j +V1 j
( 2) ∫ (η jV2 j + V1 j − z j ) f j ( z j )dz j + c j
∫ (z j − η jV2 j − V1 j ) f j ( z j )dz j ,
(26)
Zj=(Xj-μj)/σj is a pivotal quantity, fj(zj) is defined by fj(xj|μj,σj), i.e., fj(zj)dzj = fj(xj|μj,σj)dxj,
(27)
Vj=(V1j,V2j) is a pivotal quantity, ηj=(dj-Mj)/Sj is an ancillary factor. It follows from (25) that the risk associated with u BIDR (or η ∗j ) can be expressed as j
{
}
{
}
r j+ (u BIDR , θ j ) = Em j ,s j l +j (u BIDR , θ j ) = ω j (σ j ) E v j l #j (η ∗j , V j ) , j j where
{
}
u BIDR ≡ d ∗j = M j + η ∗j S j , η ∗j = arg min E v j l #j (η j , V j ) j ηj
{
} ∫∫ l
E v j l #j (η j , V j ) =
# j (η j ; v1 j , v2 j ) h• j (v1 j , v2 j ) dv1 j dv2 j .
(28)
(29)
(30)
v1 j ,v2 j
The fact that (30) is independent of θj means that an ancillary factor η ∗j , which minimizes (30), is uniformly best invariant. Thus, d ∗j given by (29) is the best invariant decision rule.
4 Numerical Example Complete Information. Assuming that the demand for product j, Xj, is exponentially distributed with the probability density function, fj(xj|σj)=(1/σj)exp(−xj/σj) (xj>0),
(31)
it follows from (17), (20) and (22) that l +j (d j , σ j ) = c (j1) (d j − σ j ) + (c (j1) + c (j2) )σ j exp(−d j / σ j ),
(32)
d ∗j = σ j ln(1 + c (j2) / c (j1) ) , and l +j (d ∗j , σ j ) = c (j1)σ j ln(1 + c (j2) / c (j1) ),
(33)
respectively.
Improvement of Inventory Control under Parametric Uncertainty and Constraints 143
Parametric Uncertainty. Consider the case when the parameter σj is unknown. Let Xj=(Xj1, …, Xjn) be a random sample of observations (each with density function (31)) on a continuous random variable Xj. Then Sj =
n
∑ X ji ,
(34)
i =1
is a sufficient statistic for σj; Sj is distributed with h j ( s j | σ j ) = [Γ( n)σ nj ]−1 s nj −1 exp(− s j / σ j ) ( s j > 0),
(35)
so that h• j (v 2 j ) = [Γ(n)]−1 v 2n−j 1e
− v2 j
(v2j>0).
(36)
It follows from (28) and (32) that
{
}
∞
∫
r j+ (u BIDR , σ j ) = E s j l +j (u BIDR , σ j ) = σ j l #j (η ∗j , v2 j )h• j (v2 j )dv2 j j j 0
= σ j [c (j1) (nη ∗j − 1) + (c (j1) + c (j2) )(1 + η ∗j ) − n ] ,
(37)
u BIDR = η ∗j S j , j
(38)
where
1 /( n +1)
η ∗j
⎡ (1) c (j1) + c (j2) ⎤ ⎡ c (j2) ⎤ = arg min σ j ⎢c j ( nη j − 1) + ⎥ = ⎢1 + ⎥ ηj (1 + η j ) n ⎥⎦ ⎢⎣ c (j1) ⎥⎦ ⎢⎣
− 1.
(39)
Comparison of Decision Rules. For comparison, consider the maximum likelihood decision rule (MLDR) that may be obtained from (33), u MLDR = σ j ln(1 + c (j2) / c (j1) ) = η MLDR Sj , j j
(40)
where σ j =Sj/n is the maximum likelihood estimator of σj,
η MLDR = ln(1 + c (j2) / c (j1) )1 / n . j
(41)
and u MLDR belong to the same class Since u BIDR j j & = {u j : u j = η j S j },
(42)
it follows from the above that u MLDR is inadmissible in relation to u BIDR . If, say, j j n=1 and c (j2) / c (j1) =100, we have that
144
N. Nechval et al.
rel.eff .r + {u MLDR , u BIDR ,σ j } = r j+ (u BIDR ,σ j ) r j+ (u MLDR ,σ j ) j j j j j
−1
⎛ 1 + c (j2) / c (j1) ⎞⎟ 1 + c (j2) / c (j1) ⎞⎟⎛⎜ MLDR η = ⎜ nη ∗j − 1 + − 1 + n = 0.838. j ⎜ (1 + η MLDR ) n ⎟⎠ (1 + η ∗j ) n ⎟⎠⎜⎝ j ⎝
(43)
Thus, in this case, the use of u BIDR leads to a reduction in the risk of about 16.2 % as j compared with u MLDR . The absolute risk will be proportional to σj and may be conj siderable. Equivalent Predictive Loss Function. In order to obtain an equivalent predictive loss function l •j ( d j , S j ) , which is independent on σj and has the same optimal invari-
ant statistical solution given by (29), i.e., arg min l •j (d j , S j ) = d ∗j ≡ u BIDR , j
(44)
dj
with a risk given by
{
}
E s j l •j (u BIDR , S j ) = r j+ (u BIDR ,σ j ), j j
(45)
we define (on the basis of (12)) an equivalent predictive distribution of a random variable Xj as f j• ( x j | s j ) =
xj ⎞ n + 1 ⎛⎜ ⎟ 1+ s j ⎜⎝ s j ⎟⎠
−( n+ 2)
⎛ xj ⎞ ( x j > 0) or F j• ( x j | s j ) = 1 − ⎜1 + ⎟ ⎜ s j ⎟⎠ ⎝
− ( n +1)
. (46)
Then l •j (d j , s j ) is given by
{
∞
} ∫
l •j (d j , s j ) = E x j l Dj (d j , X j ) | s j = l Dj (d j , x j ) f j• ( x j | s j )dx j 0
= (s j / n)[c (j1) (nd j s −j 1 − 1) + (c (j1) + c (j2) )(1 + d j s −j 1 ) −n ] .
(47)
Now the equivalent predictive loss function l •j (d j , s j ) can be used to obtain efficient frequentist statistical solutions under parameter uncertainty for constrained optimization problems, where the known approaches are unable to do it.
5 Newsboy Problem with Constraints Complete Information. Define wj (>0) as product j's per-unit requirement of a constrained resource, and wΣ as the maximum availability of the resource. The formulation for minimizing the total expected cost of N products subject to one capacity constraint is as follows:
Improvement of Inventory Control under Parametric Uncertainty and Constraints 145
Minimize N
∑
N
j =1
∫ 0
dj ∞
dj
N
∑
∫
[c (j1) (d j − x j ) f j ( x j | μ j ,σ j )dx j + c (j2) ( x j − d j ) f j ( x j | μ j ,σ j )dx j ]
j =1
=
∞
dj
∑
l +j (d j , θ j ) =
[c (j1)
j =1
∫
∫
F j ( x j | μ j ,σ j )dx j + c (j2) [1 − F j ( x j | μ j ,σ j )]dx j ].
−∞
(48)
dj
subject to N
∑wjd j j =1
≤ wΣ .
(49)
The above problem can be solved as follows. Compute d ∗j for each product j with Eq. (20) and check whether ∑ j w j d ∗j exceeds wΣ. If it does not, the capacity constraint is non-operative, and the optimal order quantity is d ∗j , ∀j=1(1)N. Otherwise, the constraint is set to equality and the Lagrange function is introduced. Parametric Uncertainty. In this case, the problem is as follows: Minimize the total equivalent predictive loss function N
∑
l •j ( d j , m j , s j ) =
j =1
dj
N
∑
j =1
[c (j1)
∫ (d j − x j ) f j ( x j | m j , s j )dx j •
0
∞
∫
+ c (j2) ( x j − d j ) f j• ( x j | m j , s j )dx j ] dj
N
subject to
∑ w j d j ≤ wΣ .
(50)
j =1
Now we can obtain the effective statistical solutions under capacity constraint and parametric uncertainty from solving this problem in the same manner as in the case of complete information, namely: d ∗j = F j• −1 ([c (j2) − λw j ][c (j1) + c (j2) ]−1 | m j , s j ), ∀j = 1(1) N , (51) where the value of the Lagrange multiplier λ can be determined by solving the singlevariable (λ) non-linear equation N
∑ w j F j• −1 ([c (j2) − λw j ][c (j1) + c (j2) ]−1 | m j , s j ) − wΣ = 0.
(52)
j =1
Consider, for instance, the case of the numerical example of Section 4, with N = 2, sj = s, c (j1) = c1 , c (j2) = c2 , c (j2) / c (j1) = 100, wj=1 for j∈{1, 2}. We find (with n1=n2=1 and wΣ=14s) that in this case the use of u BIDR (j = 1, 2) leads to a reduction in the risk of j (j = 1, 2). about 14 % as compared with u MLDR j
146
N. Nechval et al.
6 Conclusion In this paper, we propose a new approach to solve constrained optimization problems under parametric uncertainty. It is especially efficient when we deal with asymmetric loss functions and small data samples. The results obtained in the paper agree with the computer simulation results, which confirm the validity of the theoretical predictions of performance of the suggested approach.
References 1. Gallego, G., Moon, I.: The Distribution Free Newsboy Problem: Review and Extensions. The Journal of the Operational Research Society 44, 825–834 (1993) 2. Weatherford, L.R., Pfeifer, P.E.: The Economic Value of Using Advance Booking of Orders. Omega 22, 105–111 (1994) 3. Silver, E.A., Pyke, D.F., Peterson, R.P.: Inventory Management and Production Planning and Scheduling. John Wiley, New York (1998) 4. Nahmias, S.: Production and Operations Management. Irwin, Boston (1996) 5. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Optimization of Interval Estimators via Invariant Embedding Technique. IJCAS (The International Journal of Computing Anticipatory Systems) 9, 241–255 (2001) 6. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Effective State Estimation of Stochastic Systems. Kybernetes (The International Journal of Systems & Cybernetics) 32, 666–678 (2003) 7. Nechval, N.A., Nechval, K.N., Vasermanis, E.K.: Prediction Intervals for Future Outcomes with a Minimum Length Property. Computer Modelling and New Technologies 8, 48–61 (2004) 8. Nechval, N.A., Berzins, G., Purgailis, M., Nechval, K.N.: Improved Estimation of State of Stochastic Systems via Invariant Embedding Technique. WSEAS Transactions on Mathematics 7, 141–159 (2008)
Modified Jakubowski Shape Transducer for Detecting Osteophytes and Erosions in Finger Joints Marzena Bielecka1 , Andrzej Bielecki2 , Mariusz Korkosz3, Marek Skomorowski2, Wadim Wojciechowski4 , and Bartosz Zieliński2 1
3
Department of Geoinformatics and Applied Computer Science, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, Mickiewicza 30, 30-059 Cracow, Poland
[email protected] 2 Institute of Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348 Cracow, Poland {bielecki,skomorowski}@ii.uj.edu.pl,
[email protected] Division of Rheumatology, Departement of Internal Medicine and Gerontology, Jagiellonian University Hospital, Śniadeckich 10, 31-531 Cracow, Poland
[email protected] 4 Department of Radiology, Jagiellonian University Hospital, Kopernika 19, 31-531 Cracow, Poland
[email protected] Abstract. In this paper, a syntactic method of pattern recognition is applied to hand radiographs interpretation, in order to recognize erosions and osteophytes in the finger joints. It is shown that, the classical Jakubowski transducer does not distinguish contours of healthy bones from contours of affected bones. Therefore, the modifications of the transducer are introduced. It is demonstrated, that the modified transducer correctly recognizes the classes of bone shapes obtained based on the medical classification: healthy bone class, erosion bone class and osteophyte bone class. Keywords: Syntactic method of pattern recognition, Medical imaging, Computer assisted rheumatic diagnosis.
1
Introduction
Arthritis and musculoskeletal disorders are more prevalent and frequent causes of disability than heart disease or cancer [11]. There are a number of inflammatory as well as non-inflammatory diseases within the scope of rheumatology and diagnostic radiology. It is essential to distinguish between inflammatory disorders, which can be fatal, and non-inflammatory disorders, which are relatively harmless and can occur in the majority of people aged around 65. To give a diagnosis, A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 147–155, 2011. c Springer-Verlag Berlin Heidelberg 2011
148
M. Bielecka et al.
an X-ray is taken of the patients hand and symmetric metacarpophalangeal joint spaces and interphalangeal joint spaces are analyzed [14]. Thus, the changes in border of finger joints surfaces observed on hand radiographs are a crucial point in medical diagnosis and support important information for estimation of therapy efficiency. However, they are difficult to detect in an X-ray picture when examined by a human expert, due to the quantity of joints. On the other hand, it is extremely important to diagnose pathological changes in the early stages of a disease, which means that differences in the order of 0.5mm between the contours of pathologically changed bones and unaffected ones need to be identified. The possibility of performing such analysis by a computer system is a key point for diagnosis support. Therefore, studies concerning possibilities of implementation such systems are topic of numerous publications [12,13,16] (see other references in [6]). These researches are a part of the extensive stream of studies concerning artificial intelligence methods application in medical image understanding [15].
a
b
c
d
e
Fig. 1. Healthy joint (a), bones with osteophytes (b, c) and joints with erosions (d, e) radiograph
This paper is a continuation of studies described in [2,3,4,5,6,17,18], concerning automatic hand radiographs analysis. In the previous papers the preprocessing and joint location algorithms were presented. At the beginning, the applied approach turned to be effective in about 90% of cases [6], the algorithm was then improved in [18] and efficiency at 97% was achieved. Based on those locations, the algorithm identifying the borders of the upper and lower joint surfaces was proposed [5]. The preliminary analysis of such borders due to erosions detection is studied in [2,4]. In this paper, a syntactic method of pattern recognition is applied to hand radiographs interpretation, in order to recognize erosions and osteophytes in the finger joints. Example of the healthy joint radiograph and joints with osteophytes and erosions are shown in Fig.1(a), Fig.1(b,c) and Fig.1(d,e), respectively. Possible location of the osteophytes and erosions are shown as bold line in Fig.2(a) and Fig.2(b), respectively. It is shown that, the classical Jakubowski transducer [8] does not distinguish contours of healthy bones from contours of
Application of Shape Description
a
149
b
Fig. 2. Contours with possible locations where osteophytes (a) and erosions (b) may occur marked by bold line
affected bones. Therefore, the modifications of the transducer are introduced. It is demonstrated, that the modified transducer correctly recognizes the classes of bone shapes obtained based on the medical classification: healthy bone class, osteophyte bone class and erosion bone class. The paper is organized in the following way. The shape description methodology is recalled in section 2. In section 3, Jakubowski transducer is used for bone contours analysis and the necessary modifications are introduced.
2
Shape Description Methodology
Let us recall a formalism presented in [7,9,8,10], where basic unit of the analysed pattern is one of the sixteen primitives from set PRIM, being line segments or quarters of a circle (see Fig.3a). It should be mentioned that bi-indexation enumerating primitives plays a crucial role in the contour analysis. Let us also recall definition of a contour k = p1 p2 ... pm , where p1 , p2 , ..., pm are successive primitives of the contour k. Symbols pi pi+1 denotes that pi is connected to pi+1 , such that hd(pi ) = tl(pi+1 ), where hd(pk ) and tl(pk ) corresponds to head and tail of the primitive pk (see Fig.3b). Characterological description of contour k is chain of successive primitive types defined as char(k) = si1 j1 si2 j2 ...sim jm . Moreover, Qo is defined as set of primitives from the o-th quarter, for o = 1, 2, 3, 4, therefore: Qo = {sij : (j = o) ∨ (i = 1 ∧ j = o ⊕ 1)}, where o ⊕ 1 = 1 if o = 4 and o ⊕ 1 = o + 1 otherwise.
150
M. Bielecka et al.
a
b
Fig. 3. Set PRIM (a) and construction of primitive (b)
A contour k with char(k) = v such that v ∈ Q+ i ∧(length(v) > 1∨(length(v) = 1 ∧ v ∈ Qi \ (Qi⊕3 ∪ Qi⊕1 ))) is said to be the contour from the singular quadrant ((i)-singuad for short). In other words, the (i)-singuad is a contour composed of primitives from the ith quadrant. + Given contours k , k such that char(k ) ∈ Q+ i , char(k ) ∈ Qj , and char(f irst(k )) = b ∈ / Qi , and if j = i ⊕ 2 then b ∈ Qj \ (Qj⊕3 ∪ Qj⊕1 ). If k = k k , we say that k crates the so called (i,j)-biquad with char(k) = char(k )char(k ˙ ). The first primitive of k i.e. f irst(k ) is called a switch encoded by the string ij named the basic mark. Furthermore, according to definition 10, paper [8], transducer is a 5-tuple: T = (G, Σ, Δ, δ, G0 ), where G is a finite nonempty set of states, Σ is a finite nonempty input alphabet, Δ is a finite nonempty output alphabet, G0 is a finite nonempty set of start states, G0 ⊂ G and δ is a finite subset of G × Σ ∗ × Δ∗ × G. Intuitively, if (q, u, v, q ) ∈ δ, it means that if the machine is in the state q and the string u ∈ Σ ∗ is given as an input, then the state of the machine is changed into the state q and v ∈ Δ∗ becomes the machine output.
3
Bone Contour Analysis
The transducer Tm = ({q1 , q2 , q3 , q4 }, S, {1, 2, 3, 4}, δ, {q1, q2 , q3 , q4 }), where δ is given by the graph depicted in Fig.4 was proposed by Jakubowski in [8]. If u causes the transition from the state qi to qj , i = j, then u designates the switch of an (i, j)-biquad, what simply means, that there is a switch between ith and
Application of Shape Description
151
Fig. 4. δ function of the original transducer from paper [8], Fig.14b
j th quarter. Therefore, for each analysed contour, chain of biquads is taken as the result of transition. If transducer with δ function is used in case of the bone, it usually can not distinguish the healthy bone contours from contours of the bone with osteophyte or erosion. As an example, let us consider the simplified contours presented in Fig.5. Contour presented in Fig.5(a) presents no pathological changes. However contour in Fig.5(b) is convex, what means that it contains osteophyte. On the other hand, contour in Fig.5(c) is concave, that is why it contains erosion. However, it can be easily verified, that all three contours are represented by the same biquad description 32.21, despite the fact that they represents healthy bone, bone with osteophyte and bone with erosion, respectively. Wherefore, authors had to modified the transducer to differentiate those three classes of bones. For this purpose, δ function was created as modification of the original δ function. Thus, new function behaves differently in case of primitives placed at the border of two quarters (s11 , s12 , s13 and s14 ). To better understand the changes, let assume that k is fragment of the contour which characterological description char(k) = sj s1o , where the first primitive was already classified by transducer to j th quarter and the second primitive is placed at the border of two quarters. Then, in case of function δ the biquad value is described by function:
c
b
a
d
M. Bielecka et al.
152
e
Fig. 5. Example of the healthy contour (a), contours with osteophytes (b and d) and contours with erosions (c and e). Number near primitive represent the quarter to which this primitives belong to. If the primitive first index equals 1, there are two numbers, as such primitive is placed between two quarters.
⎧ ,o = j ∨ o = j ⊕ 1 ⎨ none biquads(δ) = j(j ⊕ 1) , o = j ⊕ 2 ⎩ j(j ⊕ 3) , o = j ⊕ 3 On the other hand, modified function δ works differently for two last cases: ⎧ ,o = j ∨o = j ⊕ 1 ⎨ none biquads(δ ) = j(j ⊕ 2) , o = j ⊕ 2 ⎩ j(j ⊕ 2) , o = j ⊕ 3 It can be easily verified, that all three contours represented by the same chain of biquad 32.21 in case of δ function are represented by three different chains of biquads in case of δ function - see Tab.1. The changes in transducer were introduced due to the fact that in healthy bone, the angles between successive primitives are bigger than 90◦ , what can be observed in Fig.1a. If angles are equal or smaller than 90◦ , it means that bone contour contains pathological changes - osteophyte if an acute or right angle is inside of the bone and erosion if an acute or right angle is outside of the bone - see 5b and 5c, respectively. Original δ function does not take such regularity into account and in many cases does not differentiate contours from different bone classes.
Application of Shape Description
153
Fig. 6. δ function of the transducer, created based on the original δ function from Fig.4 Table 1. δ biquad description, δ biquad description and medical assignment of contours from Fig.5 Figure δ biquad description δ biquad description osteophyte or erosion Fig.5a 32.21 32.21 none Fig.5b 32.21 31 osteophyte Fig.5c 32.21 31.13.31 erosion Fig.5d 31 31 osteophyte Fig.5e 32.23.31 31.13.31 erosion
Moreover, it has to be stressed that introduced δ function not only differentiates two contours with the same δ biquad description, but also integrates some contours with different δ biquad description. The integration can be observed in case of Fig.5b and Fig.5d, as well as in case of Fig.5c and Fig.5e. In both pairs, biquad description generated by δ function is different for both contours, but description generated by δ is identical (see Tab.1). However, it turns out that this is an advantage, because δ function generates the same biquad description for contours with the same pathological change - either both have osteophyte, or both have erosion. Naturally, the examples in Fig.5 are quite simple, due to the fact, that they contain 45◦ , 90◦ and 135◦ angles only. However in reality, the set of angles
154
M. Bielecka et al.
between parts of contours will be much bigger. Therefore, some kind of fuzzy representation of the angles might help improve robustness and portability of the proposed methodology.
4
Concluding Remarks
As, has been presented, transducer introduced by Jakubowski and modified in this paper can be used to distinguish contour of the healthy bone from contour of the bones with erosion and osteophyte. That kind of diversification is required to build an intelligent system for joint diseases diagnosis. In such system, the most important will be analysis of the highest level features such as: the presence and location of osteophyte, the presence and location of erosion and joint space narrowing. The first two features can be described using a special algebraic approach described in [1], what will be the topic of the next publication. To recapitulate, the final system will be hierarchical one, with the following levels (starting from the lowest to highest level): preprocessing [6,17,18], contour shape description and joint space width analysis [2,4], algebraic language for coding highest level features in syntactic way and expert system to diagnose joint diseases. It has to be noted, that the system will be used as an aid in radiological diagnosis of the hand radiographs.
References 1. Bielecka, M.: Syntactic segmentation of graph function type curves. Machine Graphics and Vision 16, 39–55 (2007) 2. Bielecka, M., Bielecki, A., Korkosz, M., Skomorowski, M., Wojciechowski, W., Zieliński, B.: Application of shape description methodology to hand radiographs interpretation. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2010. LNCS, vol. 6374, pp. 11–18. Springer, Heidelberg (2010) 3. Bielecka, M., Skomorowski, M., Bielecki, A.: Fuzzy syntactic approach to pattern recognition and scene analysis. Intelligent Control Systems and Optimization, Robotics and Automation 1, 29–35 (2007) 4. Bielecka, M., Skomorowski, M., Zieliński, B.: A fuzzy shape descriptor and inference by fuzzy relaxation with application to description of bones contours at hand radiographs. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 469–478. Springer, Heidelberg (2009) 5. Bielecki, A., Korkosz, M., Wojciechowski, W., Zieliński, B.: Identifying the borders of the upper and lower metacarpophalangeal joint surfaces on hand radiographs. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS, vol. 6113, pp. 589–596. Springer, Heidelberg (2010) 6. Bielecki, A., Korkosz, M., Zieliński, B.: Hand radiographs preprocessing, image representation in the finger regions and joint space width measurements for image interpretation. Pattern Recognition 41(12), 3786–3798 (2008) 7. Jakubowski, R.: Syntactic characterization of machine parts shapes. Cybernetics and Systems 13, 1–24 (1982) 8. Jakubowski, R.: Extraction of shape features for syntactic recognition of mechanical parts. IEEE Transactions on Systems, Man and Cybernetics 15(5), 642–651 (1985)
Application of Shape Description
155
9. Jakubowski, R.: A structural representation of shape and its features. Information Sciences 39, 129–151 (1986) 10. Jakubowski, R., Bielecki, A., Chmielnicki, W.: Data structure for storing drawing being then analysed for purposes of CAD. Archiwa Informatyki Teoretycznej i Stosowanej 1, 51–70 (1993) 11. Liang, M., Esdaile, J., Klippel, J., Dieppe, P.: Impact and Cost Effectiveness of Rheumatologic Care in Rheumatology. Mosby International, London (1998) 12. Ogiela, M.R., Tadeusiewicz, R., Ogiela, L.: Image languages in intelligent radiological palm diagnostics. Pattern Recognition 39, 2157–2165 (2006) 13. Sharp, J., Gardner, J., Bennett, E.: Computer-based methods for measuring joint space and estimating erosion volume in the finger and wrist joints of patients with rheumatoid arthritis. Arthritis & Rheumatism 43(6), 1378–1386 (2000) 14. Szczeklik, A., Zimmermann-Górska, I.: Injury Disease (in Polish). Medycyna Praktyczna, Warszawa (2006) 15. Tadeusiewicz, R., Ogiela, M.R.: Medical image understanding technology. Studies in fuzziness and soft computing. Springer, Heidelberg (2004) 16. Tadeusiewicz, R., Ogiela, M.R.: Picture languages in automatic radiological palm interpretation. International Journal of Applied Mathematics and Computer Science 15(2), 305–312 (2005) 17. Zieliński, B.: A fully-automated algorithm dedicated to computing metacarpophalangeal and interphalangeal joint cavity widths. Schedae Informaticae 16, 47–67 (2007) 18. Zieliński, B.: Hand radiograph analysis and joint space location improvement for image interpretation. Schedae Informaticae 17/18, 45–61 (2009)
Using CMAC for Mobile Robot Motion Control Kristóf Gáti and Gábor Horváth Budapest University of Technology and Economics, Department of Measurement and Information Systems Magyar tudósok krt. 2. Budapest, Hungary H-1117 {gatikr,horvath}@mit.bme.hu http://www.mit.bme.hu
Abstract. Cerebellar Model Articulation Controller (CMAC) has some attractive features: fast learning capability and the possibility of efficient digital hardware implementation. These features makes it a good choice for different control applications, like the one presented in this paper. The problem is to navigate a mobile robot (e.g a car) from an initial state to a fixed goal state. The approach applied is backpropagation through time (BPTT). Besides the attractive features of CMAC it has a serious drawback: its memory complexity may be very large. To reduce memory requirement different variants of CMACs were developed. In this paper several variants are used for solving the navigation problem to see if using a network with reduced memory size can solve the problem efficiently. Only those solutions are described in detail that solve the problem in an acceptable level. All of these variants of the CMAC require higher-order basis functions, as for BPTT continuous input-output mapping of the applied neural network is required. Keywords: CMAC, recurrent neural network, control, BPTT.
1
Introduction
The Cerebellar Model Articulation Controller is a special neural network architecture originally proposed by James S. Albus [1]. The network has some attractive features like fast convergence, local approximation capability and the possibility of efficient digital hardware implementation. Because of these features the CMAC is often used in control applications [2] among other areas like image and signal processing, pattern recognition and modeling. This paper deals with a navigation problem. It presents a solution for mobil robot motion control, implemented with a CMAC network. This is a highly nonlinear problem, which is hard to solve with classical control methods. There are many articles about this problem e. g. [11]. The question is if the advantageous properties of CMAC can be utilized in this complex navigation problem. To answer this question it should also be noted that despite the attractive features CMAC has some drawbacks. The most serious one is that its memory complexity may be huge, and that concerning its function approximation capability it may be inferior to an MLP. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 156–166, 2011. c Springer-Verlag Berlin Heidelberg 2011
Using CMAC for Mobile Robot Motion Control
157
Many solutions were suggested on both problems. Hash-coding [1],[3],[5] kernel CMAC [4],[5], fuzzy CMAC [6] and SOP-CMAC [8] are some ways for reducing memory complexity. Weight-smoothing [4] and higher-order CMACs [9],[10] are proposed for improving function approximation capability. The paper is organized as follows. In Section 2 the basic principle of BPTT is summarized, in Section 3 the classical CMAC is presented, in Section 4 the extensions and variants of the CMAC are presented, while in Section 5 the partial derivatives are determined. Section 6 describes the system and the training in details. The results may be found in Section 7, and conclusions are drawn in Section 8.
2
Backpropagation Through Time (BPTT)
BPTT is an approach proposed for training recurrent neural networks [12]. As a recurrent net is a dynamic network, it needs special training algorithms where the temporal behaviour of the network must be taken into consideration. The basic idea of BPTT is that the recurrent network is unfolded in time - as it can be seen in Fig. 1.[12] - resulting in a many-stage static one, where this static network basically can be trained using classical backpropagation algorithm. As the operation of the recurrent network is considered in discrete time steps, the number of stages equals to the time steps required for the operation of the network i.e. for determining the number of stages of the static network the time window of the operation of the recurrent network must be fixed. One constraint must be noticed. As the number of weights in the unfolded static network is increased, where in the static network more weights are used instead of a single weight of the original network, these weights must be modified simultaneously and with the same amount, as they represent the same physical weight in different time steps.
Fig. 1. Simple network with feedback a.) and its unfolded equivalent b.)
3
Classical CMAC
CMAC is a basis function network where finite-support basis functions are used. The basis functions are applied in the input space in predefined positions and the
158
K. Gáti and G. Horváth
supports of the basis functions are fixed-size closed intervals - or in multidimensional cases - fixed-size hypercubes.The classical CMAC applies rectangular basis functions that take constant value over the hypercube and zero elsewhere.The hypercube is often called the receptive field of the basis function. The network has two layers. The first layer performs a fixed nonlinear mapping, which implements the basis functions. The network output is calculated in the second layer as a weighted sum of the basis function outputs. Only the weights are trainable in the network. The fixed first layer creates a binary vector called association vector, which consists of the outputs of the basis functions. If the input point, x ∈ N , is in the receptive field of a basis function then the corresponding element in the association vector will be 1, otherwise it will be 0. The width of the receptive field is finite, controlled by the generalization parameter of the CMAC. This is denoted by C. The basis functions are arranged in overlays. An overlay is a set of basis functions, which covers the full input space, without any gap and overlap. Hence the number of overlays equals to the number of the activated basis functions. In case of an N dimensional problem this is C N . N The number of required basis function is (R + C − 1) , here R means the size of the input space. This number can be enormous in a real world application, for example if R = 1024 and N = 10, then the required number of basis functions is ∼ 2100 which cannot be implemented. To reduce the number of basis functions Albus proposed a way of using only C overlays, however even with this reduction the network could need extremely large weight-memory that is rather hard or even impossible to implement [1]. The second layer of the CMAC calculates the output y ∈ R of the network as a scalar product of the association vector a and the weight vector w: y(x) = a(x)T w = wi (1) i:ai =1
Because of the binary basis functions the product can be replaced by the sum of weights corresponding to the activated basis functions. The weights can be trained using the LMS rule in Eq. 2. Δwi = μ(yd − y), i : ai = 1
(2)
where yd is the desired output of a training data point, and μ is the learning rate.
4 4.1
Variants of the CMAC Higher-Order CMAC
For BPTT training the binary (rectangular) basis functions are not adequate, because BPTT training needs the derivative of the basis functions. Lane et al.
Using CMAC for Mobile Robot Motion Control
159
proposed the CMAC with B-Spline basis functions in [9]. The B-Splines are especially well suited for the CMAC with finite-support basis functions as the B-Splines are non-zero only in a finite and closed interval. Further advantages are the improved performance and the possibilty of training continuous functions. The main disadvantage is the loss of multiplication-free structure, as the association vector is not binary anymore. There are other type of basis functions for example Gaussian, see [10]. 4.2
Kernel CMAC
CMAC can be interpreted as a kernel machine [4], where instead of using directly the basis functions, we use the so called kernel functions that are constructed easily from the basis functions. In a Kernel CMAC(KCMAC) the memory complexity is upper bounded by the number of training points independently of the dimension of the input space and the number of basis functions [4],[5]. If M is the number of basis functions, and P is the number of training samples, then the input-output mapping of a basis function network based on the basisfunction representation is: y(x) =
M
wj ϕj (x) = wT ϕ(x)
(3)
j=1 T
where ϕ(x) = [ϕ1 (x), ϕ2 (x), ..., ϕM (x)] is a vector formed from the outputs of the basis functions for the input x and w is the weight vector. The same mapping can be described using the kernel function representation as: y (x) =
P
αk K (x, x (k))
(4)
k=1
It is also a weighted sum of nonlinear functions where the K(x, x(k)) = ϕT (x)ϕ(x(k)), k = 1, ..., P functions are called kernel functions defined by scalar products, and the αk coefficients serve as the weight values. In kernel CMAC the kernel functions are defined as K(x, x(k)) = aT (x) · a(x(k))
(5)
where a(x) is the association vector for the discrete input x and the response of a CMAC for a given x can be written as:
where
y(x) = aT (x)AT (AAT )−1 yd = aT (x)AT α = k T (x)α
(6)
α = (AAT )−1 yd
(7)
Here A is a P ×M matrix constructed from the association vectors of the training T points, A = [a1 (x), ..., aP (x)] . In kernel representation the components of α are considered as the weight values, so instead of using M weights, here only P weights will be used. As in multidimensional cases P R(g ∗ ) then g∗ ← g end if until g ∗ is a local optimum
should be a highly multimodal one and thus being trapped in inferior local minima is a common occurrence. In these conditions, a better heuristic is provided by simulated annealing. Simulated annealing works in the same manner as the hillclimbing procedure above, except that inferior configurations are not discarded with probability one; rather, they are accepted with some positive probability that depends on the cost function difference Δ(R) = |R(g) − R(g∗)| between the old and the new configuration, and on a parameter T called “temperature” by analogy with classical physical systems obeying the Maxwell-Boltzmann statistics [8]. The higher the difference Δ(R) and the lower the temperature T , the more unlikely the transition g ← g∗ is. The probabilistic acceptance criterion together with a temperature lowering schedule are the devices that allow the search to escape from local optima. At the beginning, when T is high, this capability is maximal, whereas towards the end, when T → 0, it becomes more difficult to jump out of a local optimum and the system tends to reach equilibrium. The process may be described by the pseudo-code 2.
172
P. Buesser, F. Daolio, and M. Tomassini
Algorithm 2. Simulated Annealing Build graph g ∗ ∈ S Choose an initial temperature T repeat Nsteps = 0 repeat Nsteps = Nsteps + 1 Compute R(g ∗ ) Choose edges eij , ekl uniformly at random and swap them: g ← g∗ Compute R(g) if R(g) > R(g ∗ ) then g∗ ← g else g ∗ ← g, with probability exp(−Δ(R)/T ) end if until Nsteps > Nmax or g ∗ is a local optimum Lower T until T < Tmin
4
Computational Results
We used the simulated annealing heuristic described in the previous section to optimize scale-free networks of increasing sizes N from N = 100 up to N = 300. We used two kinds of swap operators, one that is identical to that used in [13] and a second one where the two edges to be swapped are not chosen anywhere in the graph, but rather locally. This works as follows: first one chooses an edge eij uniformly at random among all the edges of graph g. Then a second edge ekl is selected among the edges belonging to a neighbor of i or j checking that no inconsistencies arise, i.e. ekl must not be adjacent to eij and there has not to be an already existing edge between the vertices that are candidate to be connected by the swap. Finally, eij and ekl are swapped as in the global move. In simulated annealing it is necessary to establish an initial temperature and a temperature schedule such that the temperature is progressively decreased during the run in order for the system to reach an equilibrium at each step of decreasing temperature. A suitable initial value for T is found by performing a random walk in the search space: the largest R difference in absolute value found is saved and the initial value of T is chosen such that the acceptance rate of all moves is at least 90% when the search starts. In the present study, we used the following rule: T0 = −|ΔR|max /ln(0.8) 4.5 × |ΔR|max , which is roughly half than recommended in [14] but should permit a faster convergence. As for the temperature lowering schedule, there are several possibilities but usually a linear or geometric scheduling is used. In order to save computational time, the update rule we employed here for T at the i−th constant-temperature search cycle (see algorithm 2), is given by T (i) = (0.8)i × T0 . For these choices we followed the rules of thumb in Chap. 15 of [14].
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
173
Numerical results for network robustness R in the optimized networks, are reported in Fig. 2. The best results were obtained with simulated annealing and global edge swaps. The advantage with respect to the simple hill-climbing search is particularly clear for relatively small network sizes. As N increases, the results become more similar, although simulated annealing maintains the advantage. It is also clear from the figure that local edge swaps yield somewhat inferior results. The critical parameters in our simulated annealing are the initial temperature T0 , the number of steps performed at a given temperature (Nsteps in algorithm 2), and the geometric cooling factor α such that T (i) = αi × T0 . For larger network sizes these parameters should be suitably tuned for best results. However, because of time and resources limitations, we initially kept the parameters that worked well for smaller sizes in order to limit the computational expense; recent further tests with a slower cooling procedure, (α = 0.9 instead of α = 0.8), indeed, already improved on the global edges swaps (see Fig. 2). Hill Climbing Initial graph S.A. Global S.A. Local S.A Global α=0.9
0.3 0.28 0.26 0.24
R
0.22 0.2 0.18 0.16 0.14 0.12 0.1
100
120
140
160
180
200
220
240
260
280
300
n Fig. 2. Robustness R as a function of the network size. Simulated annealing results are averaged over 100 independent and randomly generated networks for each size; Hill-climbing results are redrawn from [13] and C. Schneider, personal communication.
Figures 3 and 4 show the results of the optimization process on two particular instances of size 300. The figures have been produced with the igraph [6] package in the R statistical environment [11] and depict nodes according to their coreness. The k-core of a network is the connected subset of vertices that have degree at least k. The left images of Figs. 3 and 4 are the original Barab´ asi-Albert networks, while the right images are the result of the optimization process with global and local edge swaps respectively. We observe that, whilst the original networks have a single core, the optimized ones, in spite of maintaining an identical degree distribution by construction, are more hierarchical. For the global
174
P. Buesser, F. Daolio, and M. Tomassini
Fig. 3. Robustness optimization with global rewiring. Vertex color is related to vertex coreness; vertex size is proportional to the logarithm of vertex degree; edges connecting vertices having the same degree are highlighted. The final network has R = 0.227.
Fig. 4. Robustness optimization with local rewiring. Vertex color is related to vertex coreness; vertex size is proportional to the logarithm of vertex degree; edges connecting vertices having the same degree are highlighted. The final network has R = 0.215.
rewiring, the resulting graph has three cores and shows the typical “onion-like” structure first found by Schneider et al. [13]. It thus appears that this topology is highly conducive to good robustness properties as it has been found by two rather different optimization techniques. In the local case, Fig. 4, the trend is similar but restricting swaps to the locale of a link produces less radical changes in the
Optimizing the Robustness of Scale-Free Networks with Simulated Annealing
175
final topology, which could be advantageous in some real-life situations, although the robustness is slightly lower.
5
Summary and Conclusions
In this work we have performed an investigation of the robustness of Barab´ asiAlbert scale-free networks under attacks targeted to highly connected nodes. Following previous work by Schneider et al. [13], we have optimized the networks against this kind of perturbation without changing the degree sequence. To that effect, the move operator is a 2-swap of non-adjacent edges. Two versions of the swap were used, one in which the swap is global as in [13], and a second one in which the swapped edges belong to the neighborhood of the concerned nodes. Since the problem is a computationally hard one, we have used the simulated annealing heuristic to perform the optimization. The results are promising. Although we could only study networks up to size N = 300 because of time limitations, simulated annealing gave better results than the straightforward hill-climbing optimization used in [13]. The gain is larger when global edge swap is allowed but, even with local swaps the results are encouraging. In the last case the resulting networks require less actual rewiring to be produced from the original ones. Work is ongoing on larger networks, up to N = 500 at least, with a more adapted choice of the simulated annealing parameters. As future work, we believe that it would be interesting to study other types of attacks, as well as other network robustness measures such as network efficiency.
References 1. Albert, R., Barabasi, A.L.: Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47–97 (2002) 2. Albert, R., Jeong, H., Barabasi, A.L.: Error and attack tolerance of complex networks. Nature 406, 378–382 (2000) 3. Amaral, L.A.N., Scala, A., Barth´elemy, M., Stanley, H.E.: Classes of small-world networks. Proc. Natl. Acad. Sci. USA 97, 11149–11152 (2000) 4. Bollob´ as, B.: Modern Graph Theory. Springer, Heidelberg (1998) 5. Cohen, R., Erez, K., Avraham, D.B., Havlin, S.: Breakdown of the Internet under intentional attack. Phys. Rev. Lett. 86, 3682–3685 (2001) 6. Csardi, G., Nepusz, T.: The igraph software package for complex network research. Inter Journal Complex Systems, 1695 (2006) 7. Holme, P., Kin, B.J., Yoon, C.N., Han, S.K.: Attack vulnerability of complex networks. Phys. Rev. E 65, 056109 (2002) 8. Kirkpatrick, S., Gelatt, C.D., Vecchi, P.: Optimization by simulated annealing. Science 220, 671–680 (1983) 9. Latora, V., Marchiori, M.: Efficient behavior of small-world networks. Phys. Rev. Lett. 87, 198701 (2001) 10. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford (2010) 11. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010)
176
P. Buesser, F. Daolio, and M. Tomassini
12. Schneider, C.M., Andrade, J.S., Shinbrot, T., Herrmann, H.J.: Protein interaction networks are fragile against random attacks and robust against malicious attacks. Tech. rep. (2010) 13. Schneider, C.M., Moreira, A., Andrade, J.S., Havlin, S., Herrmann, H.J.: Onionlike network topology enhances robustness against malicious attacks. J. Stat. Mech. (2010) (to appear) 14. Schneider, J.J., Kirkpatrck, S.: Stochastic Optimization. Springer, Berlin (2006) 15. Valente, A.X.C.N., Sarkar, A., Stone, H.: 2-peak and 3-peak optimal complex networks. Phys. Rev. Lett. 92, 118702 (2004)
Numerically Efficient Analytical MPC Algorithm Based on Fuzzy Hammerstein Models Piotr M. Marusak Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00–665 Warszawa, Poland
[email protected] Abstract. Numerically efficient analytical MPC (Model Predictive Control) algorithm based on fuzzy Hammerstein models is proposed in the paper. Thanks to the form of the model the prediction can be described by analytical formulas and the proposed algorithm is numerically efficient. It is shown that thanks to a clever tuning of the controller most of calculations needed to derive the control value can be performed off– line. Thus, the proposed algorithm has the advantage reserved so far for analytical MPC algorithms based on linear models. At the same time, the algorithm offers practically the same performance as the MPC algorithm in which a nonlinear optimization problem must be solved at each iteration. The efficiency of the algorithm is demonstrated in the control system of a nonlinear control plant with delay. Keywords: fuzzy control, fuzzy systems, predictive control, nonlinear control, constrained control.
1
Introduction
The MPC algorithms use a model of the control plant to predict behavior of the control system during generation of the control signals. Therefore, the MPC algorithms can be successfully used in control systems of processes with difficult dynamics (e.g. with large delays) and constraints [2,6,12,14]. In the standard MPC algorithms linear control plant models are used. However, such an approach applied in the case of a nonlinear control plant may give unsatisfactory results especially if the control system should be able to work in different operating points. In such a case operation of the control system may be improved using the MPC algorithm in which prediction is based on a nonlinear model. Straightforward utilization of a nonlinear process model causes, however, necessity of solving a nonlinear (often non–convex) optimization problem at each iteration of the algorithm; see e.g. [1,3]. Such an optimization problem is usually hard to solve (numerical problems often occur). Moreover, time needed to find the solution is difficult to predict. These are the reasons why MPC algorithms with linear approximations of the control plant models, obtained at each iteration, are often used [5,7,8,9,10,14]. Among many types of the nonlinear models Hammerstein models have interesting properties. They are composed of a linear dynamic block which follows ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 177–185, 2011. c Springer-Verlag Berlin Heidelberg 2011
178
P.M. Marusak
Fig. 1. Structure of the Hammerstein model; u – inputs, y – outputs, z – outputs of the nonlinear static block
a static nonlinearity (Fig. 1). Such models can be used for many processes, like for example distillation columns or chemical reactors [4]. The static nonlinearity can be modeled in different ways. However, as the fuzzy models offer many advantages [11,13], like e.g. relative easiness of model identification and simple obtaining of linear approximation, the Hammerstein models with fuzzy static part are considered in the paper. The efficient method of prediction generation using a fuzzy Hammerstein model and its linear approximation was proposed in [9]. Efficient numerical fuzzy MPC algorithm, formulated as the standard quadratic programming problem, was also proposed there. In this paper it is shown that the discussed prediction can be used to formulate a numerically efficient analytical fuzzy MPC algorithm. This algorithm is formulated in such a way that the main part of calculations needed to derive the control value is performed off–line. Therefore, even solving of the quadratic programming problem is avoided and the algorithm can be applied to fast control plants. In the next section the standard analytical MPC algorithm based on linear models is described. In Sect. 3 the proposed analytical fuzzy MPC algorithm is detailed. Example results illustrating excellent performance offered by the proposed algorithm are presented in Sect. 4. The paper is summarized in Sect. 5.
2
Analytical MPC Algorithm Based on Linear Models (LMPC)
Control signals are generated in the Model Predictive Control (MPC) algorithms using prediction of future behavior of the control plant many sampling instants ahead. The prediction is obtained using a process model. The values of control variables are calculated in such a way that the prediction fulfills assumed criteria. Usually, minimization of the following performance function is demanded [2,6,12,14]: p s−1 2 2 min JMPC = y k − yk+i|k + λ · Δuk+i|k , (1) Δu
i=1
i=0
where y k is a set–point value, yk+i|k is a value of the output for the (k +i)th sampling instant, predicted at the k th sampling instant, Δuk+i|k are future changes
Numerically Efficient Analytical Fuzzy MPC Algorithm
179
in manipulated variable, Δu = Δuk|k , . . . , Δuk+s−1|k , λ ≥ 0 is a weighting coefficient; p and s denote prediction and control horizons, respectively. The predicted values of the output variable yk+i|k are derived using a dynamic control plant model. If this model is linear then the superposition principle can be applied and the vector of predicted output values y is described by the following formula: y = y + A · Δu , (2) where y = yk+1|k , . . . , yk+p|k ; y = y k+1|k , . . . , y k+p|k is a free response of the plant which contains future values of the output variable calculated assuming that the control signal does not change in the prediction horizon; A · Δu is the forced response which depends only on future changes of the control signal; ⎡ ⎤ a1 0 . . . 0 0 ⎢ a2 a1 . . . 0 0 ⎥ ⎢ ⎥ A=⎢ . . . (3) ⎥ . .. . . . . ⎣ . . ⎦ . . . ap ap−1 . . . ap−s+2 ap−s+1 is a matrix composed of coefficients of the control plant step response ai . It is called the dynamic matrix. Introduce the vector y = [y k , . . . , y k ] of length p. The performance function from (1) rewritten in the matrix–vector form is as follows: JMPC = (y − y)T · (y − y) + ΔuT · Λ · Δu ,
(4)
where Λ = λ · I is the s × s matrix. After application of the prediction (2) to the performance function (4) one obtains: JLMPC = (y − y − A · Δu)T · (y − y − A · Δu) + ΔuT · Λ · Δu ,
(5)
which depends quadratically on decision variables Δu. Thus, if the problem without constraints is considered, the vector minimizing the performance function (5) is described by the following formula: −1 . Δu = AT · A + λ · I · AT · (y − y)
(6)
−1 The matrix K = AT · A + λ · I · AT depends on the matrix A which is constant. Thus the most complex part of calculations can be performed off–line. Remark. In the analytical MPC algorithms the control constraints are taken into consideration by using a mechanism of control projection on constraint set; see e.g. [14]. The mechanism is simple as it consists in application of the following rules of modification of increments of the manipulated variable: • for changes of the manipulated variable: — if Δuk|k < Δumin , then Δuk|k = Δumin , — if Δuk|k > Δumax , then Δuk|k = Δumax ;
180
P.M. Marusak
• for values of the manipulated variable: — if uk−1 + Δuk|k < umin , then Δuk|k = umin − uk−1 , — if uk−1 + Δuk|k > umax , then Δuk|k = umax − uk−1 .
3
Analytical MPC Algorithm Based on Fuzzy Hammerstein Models (FMPC)
It is assumed that the process model is of the Hammerstein structure (i.e. the nonlinear static block is followed by the linear dynamic block) with fuzzy Takagi– Sugeno static part: zk = f (uk ) =
l j=1
wj (uk ) · zkj =
l
wj (uk ) · (bj · uk + cj ) ,
(7)
j=1
where zk is the output of the static block, wj (uk ) are weights obtained using fuzzy reasoning, zkj are outputs of local models in the fuzzy static model, l is the number of fuzzy rules in the model, bj and cj are parameters of the local models in the fuzzy static part of the model. It is also assumed that the dynamic part of the model has the form of the step response: yk =
p d −1
an · Δzk−n + apd · zk−pd ,
(8)
n=1
where yk is the output of the fuzzy Hammerstein model, ai are coefficients of the step response, pd is the horizon of the process dynamics (equal to the number of sampling instants after which the step response can be considered as settled). The proposed algorithm is based on prediction obtained in a way described in [9]. The idea of this prediction is to use the Hammerstein model (8) to obtain the free response. It is expressed by the following analytical formula: y k+i|k =
p d −1
an · Δzk−n+i + apd · zk−pd +i + dk ,
(9)
n=i+1
where dk = yk − yk is the DMC–type disturbance model, i.e. it is assumed the same for all instants in the prediction horizon. Next, the dynamic matrix, needed to predict the influence of the future control changes is derived using at each algorithm iteration a linear approximation of the fuzzy Hammerstein model (8): ykL
= dzk ·
p −1 d
an · Δuk−n + apd · uk−pd
,
(10)
n=1
where dzk is a slope of the static characteristic near the zk . It can be calculated numerically using the formula
Numerically Efficient Analytical Fuzzy MPC Algorithm
dzk =
l
181
(wj (uk + du) · (bj · (uk + du) + cj ) − wj (uk ) · (bj · uk + cj )) /du ,
j=1
where du is a small number. Thus obtained: ⎡ a1 0 ⎢ a2 a1 ⎢ Ak = dzk · ⎢ . .. ⎣ .. .
(11) finally the following dynamic matrix is ⎤ ... 0 0 ... 0 0 ⎥ ⎥ (12) . .. ⎥ . .. .. . . ⎦
ap ap−1 . . . ap−s+2 ap−s+1 The prediction is therefore described by: y = y + Ak · Δu .
(13)
After application of prediction (13) to the performance function (4) one obtains: JFMPC = (y − y − Ak · Δu)T · (y − y − Ak · Δu) + ΔuT · Λ · Δu .
(14)
The performance function (14) depends quadratically on decision variables Δu. Thus, if constraints are not taken into consideration, the vector minimizing the performance function (14) at each iteration is described by the following formula: −1 . Δu = ATk · Ak + λ · I · ATk · (y − y) (15) This time, however, on the contrary to the MPC based on a linear model, the dynamic matrix is changing at each iteration. Fortunately, thanks to the form of the dynamic matrix obtained from the Hammerstein model and clever tuning of the controller, number of on–line calculations can be reduced largely, as in the case of the LMPC algorithm. Assume that the matrix which contains weighting coefficient can be changed at each iteration, then −1 . Δu = ATk · Ak + λk · I · ATk · (y − y) (16) Assume also that λk = dzk2 · λ, in such a case, after using (12) one obtains: −1 , Δu = dzk2 · AT · A + dzk2 · λ · I · dzk · AT · (y − y) (17) and finally:
−1 1 T . · A ·A+λ·I · AT · (y − y) (18) dzk What can be written as: 1 . Δu = · K · (y − y) (19) dzk Thus, as in the case of the LMPC algorithm the main part of calculations can be −1 performed off–line as the matrix K = AT · A + λ · I · AT does not change. Therefore, despite the nonlinear fuzzy Hammerstein model was used the main advantage of the analytical LMPC algorithm is retained. Δu =
182
4 4.1
P.M. Marusak
Simulation Experiments Control Plant
The experiments were made in the control system of the ethylene distillation column used for tests in [9]. The control plant is highly nonlinear and has large time delay. The output y is the impurity of the product. The manipulated variable u is the reflux. During experiments it was assumed that the reflux is constrained 4.05 ≤ u ≤ 4.4. The Hammerstein model of the plant is depicted in Fig. 2a and the steady–state characteristic – in Fig. 2b.
Fig. 2. a) Hammerstein model of the distillation column; b) Steady–state characteristic of the plant
Fig. 3. Membership functions of the static part of the Hammerstein model
Numerically Efficient Analytical Fuzzy MPC Algorithm
183
The static part of the fuzzy Hammerstein model has the form of the Takagi– Sugeno model with three local models of the form: zkj = bj · uk + cj , where b1 = −2222.4, b2 = −1083.2, b3 = −534.4, c1 = 9486, c2 = 4709.3, c3 = 2408.7. The assumed membership functions are shown in Fig. 3. 4.2
Results
In order to test properties of the proposed approach, three MPC algorithms were designed: • an NMPC one (with nonlinear optimization), • an LMPC one (analytical based on a linear model) and • an FMPC one (the proposed analytical algorithm based on the fuzzy Hammerstein model). The sampling period was assumed equal to Ts = 20 min; tuning parameters of all three algorithms were as follows: prediction horizon p = 44, control horizon s = 20. During the experiments performance of control systems with all three algorithms was compared. The example responses of control systems to changes of the set–point value are shown in Fig. 4. It was assumed that the weighting coefficient λ = 2 × 106 . It was done so because for λ = 106 in the NMPC algorithm numerical problems occurred. On the contrary to the NMPC, the proposed FMPC algorithm did not have any problems with control calculation for λ = 106 (solid lines in Fig. 4) and generated the fastest responses. Slightly slower responses were obtained with the FMPC algorithm for λ = 2 × 106 (dash–dotted lines in Fig. 4). It is also good to notice that in both cases, the control signal in response
Fig. 4. Responses of the control systems to the change of the set–point values to y1 = 200 ppm, y2 = 300 ppm and y3 = 400 ppm; NMPC – dotted lines, LMPC – dashed lines, FMPC with λ = 2 × 106 – dash–dotted lines, FMPC with λ = 106 – solid lines
184
P.M. Marusak
to the set–point change to y 3 = 400 ppm hits the constraint. Despite that the responses are practically the same as in the case on the NMPC algorithm (dotted lines in Fig. 4) in which a constrained optimization problem is solved at each iteration. Both FMPC and NMPC algorithms give satisfactory responses. The overshoot is very small and the character of these responses is the same for different set– points. Unfortunately, the LMPC algorithm (dashed lines in Fig. 4) operates almost as good as other algorithms only for the set–point change to y 1 = 200 ppm. When the set–point changes to y 2 = 300 ppm or to y 3 = 400 ppm, LMPC algorithm works unacceptably bad what is caused by significant nonlinearity of the control plant.
5
Summary
The efficient analytical MPC algorithm based on the fuzzy Hammerstein model was proposed in the paper. The nonlinear model is used to derive the free response of the control plant and its linear approximation to calculate the influence of future control action. Thanks to such an approach and clever tuning the most computationally demanding part of calculations needed to derive the control value is in the proposed algorithm performed off–line. Despite significant simplicity the algorithm outperforms its counterpart based on linear process models and offers great control performance comparable with the one offered by the algorithms with nonlinear optimization. Moreover, the proposed algorithm is more computationally robust than the algorithm with nonlinear optimization. Acknowledgment. This work was supported by the Polish national budget funds for science 2009–2011 as a research project.
References 1. Babuska, R., te Braake, H.A.B., van Can, H.J.L., Krijgsman, A.J., Verbruggen, H.B.: Comparison of intelligent control schemes for real–time pressure control. Control Engineering Practice 4, 1585–1592 (1996) 2. Camacho, E.F., Bordons, C.: Model Predictive Control. Springer, Heidelberg (1999) 3. Fink, A., Fischer, M., Nelles, O., Isermann, R.: Supervision of nonlinear adaptive controllers based on fuzzy models. Control Engineering Practice 8, 1093–1105 (2000) 4. Janczak, A.: Identification of nonlinear systems using neural networks and polynomial models: a block–oriented approach. Springer, Heidelberg (2005) 5. Lawrynczuk, M.: A family of model predictive control algorithms with artificial neural networks. International Journal of Applied Mathematics and Computer Science 17, 217–232 (2007) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Marusak, P.: Advantages of an easy to design fuzzy predictive algorithm in control systems of nonlinear chemical reactors. Applied Soft Computing 9, 1111–1125 (2009)
Numerically Efficient Analytical Fuzzy MPC Algorithm
185
8. Marusak, P.: Efficient model predictive control algorithm with fuzzy approximations of nonlinear models. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 448–457. Springer, Heidelberg (2009) 9. Marusak, P.: On prediction generation in efficient MPC algorithms based on fuzzy Hammerstein models. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS (LNAI), vol. 6113, pp. 136–143. Springer, Heidelberg (2010) 10. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 11. Piegat, A.: Fuzzy Modeling and Control. Physica-Verlag, Berlin (2001) 12. Rossiter, J.A.: Model-Based Predictive Control. CRC Press, Boca Raton (2003) 13. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Systems, Man and Cybernetics 15, 116–132 (1985) 14. Tatjewski, P.: Advanced Control of Industrial Processes; Structures and Algorithms. Springer, London (2007)
Online Adaptation of Path Formation in UAV Search-and-Identify Missions Willem H. van Willigen1,2 , Martijn C. Schut1 , A.E. Eiben1 , and Leon J.H.M. Kester2 1
VU University Amsterdam (NL) TNO Defence, Security and Safety, The Hague (NL) {willem,mc.schut,ae.eiben}@few.vu.nl,
[email protected] 2
Abstract. In this paper, we propose a technique for optimisation and online adaptation of search paths of unmanned aerial vehicles (UAVs) in search-and-identify missions. In these missions, a UAV has the objective to search for targets and to identify those. We extend earlier work that was restricted to offline generation of search paths by enabling the UAVs to adapt the search path online (i.e., at runtime). We let the UAV start with a pre-planned search path, generated by a Particle Swarm Optimiser, and adapt it at runtime based on expected value of information that can be acquired in the remainder of the mission. We show experimental results from 3 different types of UAV agents: two benchmark agents (one without any online adaptation that we call ‘naive’ and one with predefined online behaviour that we call ‘exhaustive’) and one with adaptive online behaviour, that we call ‘adaptive’. Our results show that the adaptive UAV agent outperforms both the benchmarks, in terms of jointly optimising the search and identify objectives. Keywords: adaptive algorithm, design and engineering for self-adaptive systems, unmanned aerial vehicles, search and identify.
1
Introduction
One of the most prevalent and important issues in reconnaissance, surveillance, and target acquisition (RSTA) flight missions is the ability to adapt one’s flight path based on acquired information. In such (often military) missions, planes acquire information about a specific territory by first exploring it, followed by surveilling and finally obtaining information about possible targets in the area. While some information about the territory may be available beforehand (making a priori planning possible), it is increasingly important to do the planning during the mission itself because of the very dynamic nature of RSTA missions at present day (e.g., unknown territory, rapidly moving targets). The possibility of such automated adaptability during the mission becomes very important when we take the human out of the loop, as we employ unmanned aerial vehicles (UAVs) in RSTA missions. The problem that we address in this paper concerns the programming of such UAVs in situations where some information is available beforehand (for example, some knowledge about possible target ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 186–195, 2011. c Springer-Verlag Berlin Heidelberg 2011
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
187
locations throughout the territory), but where substantial performance may be gained by equipping the UAVs with online (in-flight) adaptation of the flight path based on collected real-time information. We employ a machine-learning approach to accomplish this. Machine learning has been used to deal with different issues in UAV research and development. For example, Berger et al. [2] use a co-evolutionary algorithm for information gathering in UAV teams; Allaire et al. [1] have used genetic algorithms for UAV real-time path planning; and Sauter et al. [7,5] have used a swarming approach (for which a ground sensor network for coordination purposes is needed). Recently, Pitre et al. [6] introduced a new measurement for (UAV) search and track missions. The introduced metric jointly optimises the objectives to 1) detect new targets, and 2) track previously detected targets. This particular metric has some desirable properties with respect to search-and-tracking: jointly optimises detection and tracking; easily compares different solutions; promotes early detection; encourages repeated observations of the same targets; and it is useful for resource management. However, this approach does not yet allow for online adaptation of the search path during the flight. In this paper, we provide a method for doing this. We build further on the work of Pitre et al. with two important differences: 1) we use the metric and calculations also for in-flight coordination and adaptation (whereas the original metric has reportedly only been used for off-line generation of paths), and 2) in our case study, the second objective (besides search) is to identify targets rather than tracking these. This paper is structured as follows. In Section 2, we present the details of our adaptive algorithm. We report on the conducted simulation study in Section 3. Finally, Section 4 concludes and provides some pointers for future work.
2
Model
In this section, we describe the model that we used in terms of (1) the problem setting (i.e., search-and-identification of targets in some terrain with UAVs), and (2) our solution approach (i.e., objective function and adaptive behaviour of the UAV). We describe both these aspects in detail below. Our solution approach enables a UAV to jointly optimise the objectives of searching and identification by a UAV in a given terrain. Although we have no exact knowledge on where targets are in the terrain (because that would render the search-aspect of the mission pointless), we have some a priori knowledge in terms of probability distributions over the terrain cells on whether a target could be there. Before the mission, we compute an optimal flight path for the UAV. When the UAV is in-flight, it is possible to adapt this path. The beforemission calculation of the optimal search path as well as the in-flight decision to-adapt-or-not is based on a number of value functions that are described in detail below.
188
2.1
W.H. van Willigen et al.
Problem Setting
Terrain. The terrain to-be-searched is 60 by 60 nautical miles (nmi). This consists of a mountainous area, a desert, a small forest and some roads. In Figures 1a and 1b, two maps of the terrain show the different types of terrain, and the different altitudes (that ranges from 856m to 2833m), respectively1 . In both figures, the straight lines depict roads in the terrain.
(a) Terrain types
(b) Altitude map
Fig. 1. Scenario Maps (Taken from [6])
A UAV that flies over the terrain cannot detect targets equally well in all types of terrain. We represent the ability-to-detect by means of a detection probability, denoted by pdot , where dot means detection-on-terrain. In Table 1a, the detection probabilities for the different types of terrain are shown. The right column of this table shows that the detection probabilities increase when targets are on a road. Table 1. Scenario Assumptions (a) Detection probabilities for different types of terrain
(b) Percentage of targets per terrain type
pdot pdot on road Desert 0.90 0.95 Mountain 0.5 0.75 Forest 0.10 0.50
Terrain type % Targets Mountain 90 Desert 7 Road 2 Forest 1
Targets. In this scenario, targets are stationary (i.e., non-mobile) objects located throughout the searched terrain. We consider all targets to be equally important (i.e., not prioritising with respect to a specific aim of a mission)2 . Targets can be identified better when they are observed longer. We represent this gradually improving identification by means of a single scalar value, which increases as a UAV observes the object longer. 1 2
These maps are the same that were used in [6]. In [6], extensions are introduced that allow for varying the target importance.
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
189
UAVs. The UAVs in our model are planes that fly with a constant speed of 100 knots (kt) at a constant altitude of 3,000 meters above sea level. As previously mentioned, the UAV flies a particular search path that was determined beforehand. The adaptability of the UAV is that upon observation of a target, it may decide to fly a circle over the target enabling better identification. This decision depends on the objective function presented later in this section. After finishing the circle, it continues its original search path. A UAV has only limited resources (e.g., fuel), thus when it decides to fly a circle, this means that the path shortens in the tail (details follow below). How much a UAV can see on the ground, depends on the altitude of the terrain. The detection range is defined as range(alt) = −6.5 · 10−4 · alt + 1.96, where alt is the altitude of the terrain. We assume a viewing angle of about 51 deg in every direction. In the lowest regions of the terrain, the detection range is 1.4 nmi, while in the higher regions, this number drops to just 0.1 nmi. The probability that a UAV detects a target on the ground, denoted by pdet (), is determined by the detection range: pdet (cell) =
pdot 0
if within range(alt) otherwise
(1)
where cell is a single location in the terrain. The UAV sensor automatically takes a picture every 30 seconds. In our scenario, a mission takes 2 hours, thus resulting in a total of 240 pictures taken and analysed. Finally, the maximum turning rate of the UAV is 2 degrees per second, which means that if the UAV wants to fly a circle above a certain object, this takes 3 minutes, or 6 pictures. Flying a circle above a target also means that the end of the search path is shortened by 3 minutes, or 6 pictures. 2.2
Solution Approach
We evaluate search paths by means of an objective function, based on (expected) value functions. This evaluation is needed for 1) the a priori calculations for determining optimal search paths, as well as for 2) in-flight adaptation of a search path. For the former (a priori search process), we provide more details in the following section. For the latter (in-flight adaptation), we provide details in this section after explaining the used value functions. We employ two different functions for evaluation: first, the value function, that computes the total value of a path after flying; and second, the expected value, that estimates the value of a (partial) path before flying and, in case of the adaptive agent, during the flight. T N The value function is defined as: V = t=1 n=1 utilityGain(n, t), where T is the number of discrete time intervals during the mission, N is the number of detected targets at time t, and utilityGain(n, t) is the gain in utility of information for target n at time t. The utility gain function utilityGain(n, t) can be interpreted as the number of points scored for observing a target. Upon first observation of a target, the
190
W.H. van Willigen et al.
utility gain is 1. This increases linearly with time for the duration of observation of this target with a maximum utility gain of 6 per target. The reason for this maximum is that identification cannot improve after 6 detections. However, after 6 consecutive non-detections (when a target seen before is now undetected), known information about that target is reset which means that when the UAV encounters that target after that time, new information can be gained yet again for that target. We define the expected value function of a UAV search path as: E(V ) = T C t=1 c=1 pdet (c)ptarget (c), where T is the number of discrete time intervals during the mission, C is the number of cells within the detection range of the UAV at time t, pdet (c) is the probability of detection. This number depends on the type of terrain at cell c, and ptarget (c) is the probability of a target being present at cell c. We assume this information to be available and, because of the high resolution of the terrain, we also assume that no more than one target can be present at each cell. This formula thus estimates the number of targets that will be detected during the length of the mission based on the probabilities of 1) the presence of a target and 2) detection by the UAV. 2.3
UAV Adaptive Agent
The UAV agent determines the behaviour of the UAV in terms of adapting the flight path or not. The online adaptive agent will decide on flying a circle above a detected target based on the expected value of the remaining search path. Pseudocode for this agent is depicted in Algorithm 1, that runs each timestep of the flight, when a picture has been taken. /* The UAV starts flying the predetermined search path. At each timestep t when a picture is taken and analysed, the following code is executed: */ if the UAV detects a target that has not been seen before then /* Determine the expected value of the rest of the search path (from current timestep t until the final timestep T ) expValueWithout = E(V )t,T ;
*/
/* Determine the expected value of the search path when a circle is made. To do this, the expected value gets 6 points for the circle (unless the expected value of the circle is greater than 6), and the rest of the path has been made 6 steps shorter. */ expValueWith = E(V )t+6,T + max(6, E(V )t,t+5 ); if expValueWith > expValueWithout then flyCircle(); else keepFollowingOriginalPath(); end end
Algorithm 1. The algorithm for the online adaptive UAV agent
When the UAV is currently not flying a circle (because otherwise the UAV could start flying circles within circles and this would increase the complexity of getting back on the original path significantly), and a new target has been observed, two values are computed: the expected value of the rest of the search path without flying a circle (expValueWithout), and the expected value of the rest of the search path with the certainty of observing a certain target during the circle (expValueWith).
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
3
191
Experiments
In this section, we describe the experimental design and setup, the results we obtained and an analysis of these results. 3.1
Design and Setup
The main objective of this research is to investigate if our online adaptive UAV agent improves the value of a predefined search path. To this end, we compare our agent, as described in section 2, to two benchmark agents: The Naive Agent, in which the UAV has a predefined search path and the UAV will just follow this path without doing anything differently. The Exhaustive Agent is the other benchmark and has predefined online behaviour: the UAV starts flying the predefined search path and each time the UAV detects a target, it always decides to fly a circle around that target before continuing its path. This agent is necessary in our experiments, because if we want to show that it is beneficial for the value to sometimes fly a circle, we also need to show that it is not a good idea to always fly a circle. Our experimental design has 3 independent variables that we systematically vary to investigate the effects: 1) target distribution, 2) search path, and 3) agent type. – Target distributions: We have generated 10 different target distributions, each consisting of 1,000 targets, placed in the terrain using the distribution as shown in table 1b. For each type of terrain, the targets are normally distributed. – Search paths: We run the experiments on 10 different search paths. We generated search paths by hand and we ran a simple Particle Swarm Optimisation (PSO) technique [3] to optimise these search paths based on their expected value value. This work closely resembles the work described in [6]. After we ran the PSO algorithm for a fixed amount of time, we picked the 10 best paths for use in our experiments. – Type of agent: As explained above, there are three types of agents: the naive agent (without any online adaptation), the exhaustive agent (that will always fly a circle upon detection of a new target) and the adaptive online agent (that will base its decision of flying a circle on expected value calculations). The main measurable is the obtained value of a search path given a type of agent. The higher the value of a search path, the better. For each combination of a search path and a target distribution, we measure the value of the paths that are generated by the three different agents. We hypothesise that the utilities of the paths generated by the adaptive agents are better than the utilities of the paths generated by the naive and the exhaustive agents. We also measure the number of detected targets and the total number of detections. Using these two metrics, we can see to what extent the different agents are better in searching, identification, or both.
192
W.H. van Willigen et al.
The different types of terrain and the detection probabilities of the different types of terrain were explained above in Section 2. The UAV starts flying in the bottom right corner of the world. 3.2
Results
Before we present the results of our simulations, we give some illustrative screenshots of the simulation, showing different kinds of search paths (albeit somewhat simplified for reasons of clarity). Here, the UAV starts in the bottom right corner of the terrain, and each green dot is a location at which the UAV takes a picture which is then analysed using one the three agents. An example flight is shown in Figure 2. In Figures 3a and 3b, the results for every run are shown in terms of value differences between the adaptive and the naive/exhaustive agents, respectively. On the x-axis of these charts are the 10 different target distributions. For all these 10 target distributes, the results for the 10 different search
(a)
(b)
(c)
Fig. 2. (a) shows an example naive path, without online adaption; (b) shows an example exhaustive path, with many circles during the flight; and (c) shows an example adaptive path, with some circles here and there
(a) V(adaptive) - V(naive). Positive values mean that the adaptive agent has outperformed the naive agent.
(b) V(adaptive) - V(exhaustive). Positive values mean that the adaptive agent has outperformed the exhaustive agent.
Fig. 3. Differences between the adaptive agent and the benchmark agents
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
193
paths that we used are shown. On the y-axis, the difference in value is shown. Figures 4a and 4b are two histograms of the data from Figures 3a and 3b. From these histograms, it becomes clear that the data is not normally distributed, but slightly positively skewed. In the next section, we analyse this skewness. We also have included an example graph of this in Figure 5. The figure shows for each timestep that the value value of the search path up until that point. All lines are non-descending, since value will only increase over time. In table 2, the mean values for the total number of detections per run of the different agents is shown, as well as the mean number of uniquely detected targets per run. The ratio between these two values, which gives an indication on how well the identification objective is executed, is also included in this table. 3.3
Analysis
From Figures 3a and 3b, we can see that the adaptive agent generally performs better than the naive method, and much better than the exhaustive method. Some exceptions occur, for instance distribution 7. We analysed these exceptions and these UAV paths do not encounter as many targets as expected. The difference between the exhaustive and the adaptive agent are much larger. When many circles are flown in a short period of time, many targets will be detected for many more than 6 times, which yields no further utility gain. The histograms in Figure 4 are positively skewed. Using the Wilcoxon Signed-Rank test, we found that the adaptive agent is significantly better than the naive and exhaustive agents using a significance level of p = 0.05, which validates our hypothesis. Figure 5 depicts an example run. In this Figure, we observe that the naive agent does not significantly differ from the expected value. The exhaustive agent starts out well, but is outperformed by the other agents after some time. Note that Figure 5 is an example of one single run. Plots of other runs look differently. This can also be derived from the other plots; sometimes the naive or exhaustive agents are better. But generally, the plots follow this pattern.
(a) Histogram of Value(adaptive) Value(naive)
(b) Histogram of Value(adaptive) Value(exhaustive)
Fig. 4. Histograms of the differences between the adaptive agent and the benchmark agents
194
W.H. van Willigen et al. 300
Expected Utility Naive Controller Exhaustive Controller Adaptive Controller
250
(Expected) Utility
200 150 100 50 0 0
50
100
150
200
Time
Fig. 5. Value increase over time (example) Table 2. The mean values for the number of uniquely detected targets, the total number of detections and the ratio between these values Naive Exhaustive Adaptive # targets 171.04 68.44 148.99 # detections 315.04 362.22 347.09 detections / targets 1.84 5.29 2.33
Our second metric, i.e., the number of detections versus the number of uniquely detected targets, is depicted in Table 2. Using the numbers from this table, we can say something about strengths and weaknesses of each agent. We expected the naive agent to be the best in searching, the exhaustive agent to be the best in identification and the adaptive agent to be the best in jointly optimising these objectives. The naive agent has the highest mean number of uniquely detected targets, while the exhaustive agent has the highest ratio between the number of detections and the number of targets. The adaptive agent is best in jointly optimising these objectives.
4
Conclusions
In this paper, we propose a UAV agent that online adapts its predefined search path according to actual observations during the mission. The adaptive agent flies a circle above a detected target when it expects that this will improve the total value of the search path. Our results show that our agent significantly outperforms both a naive and an exhaustive agent. However, not in every instance the adaptive agent outperforms the other two; in some cases one of the benchmarks is better. This result can be attributed to unexpected situations during the flight. We also conclude that each agent has its own strength. It depends on the user’s goal which agent is best. In our scenario, we want to jointly optimise
Online Adaptation of Path Formation in UAV Search-and-Identify Missions
195
search and identification objectives. Using these objectives jointly, our adaptive agent outperforms the benchmarks. But if searching was the only objective, the naive agent would be better; likewise, when identification was the only objective, the exhaustive agent would be the better one. As a future research path, we will generalise the model further by introducing different kinds of vehicles with different kinds of capabilities (e.g., helicopters, ground vehicles, underwater vehicles). We will investigate how to model different capabilities and how the different vehicles in the field can make use of other vehicle’s capabilities. Related work in this direction has been done by Kester et al. [4] to find a unifying way of designing Networked Adaptive Interactive Hybrid Systems.
References 1. Allaire, F.C.J., Tarbouchi, M., Labonte, G., Fusina, G.: Fpga implementation of genetic algorithm for uav real-time path planning. Intelligent Robot Systems 54, 495–510 (2009) 2. Berger, J., Happe, J., Gagne, C., Lau, M.: Co-evolutionary information gathering for a cooperative unmanned aerial vehicle team. In: 12th International Conference on Information Fusion, FUSION 2009, pp. 347–354 (2009) 3. Eberhart, R., Kennedy, J.: Particle swarm optimization. In: Proceeding of IEEE International Conference on Neural Networks, vol. 4 (1995) 4. Kester, L.J.M.H.: Designing networked adaptive interactive hybrid systems. In: Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, August 20-22, pp. 516–521 (2008) 5. Legras, F., Glad, A., Simonin, O., Charpillet, F.: Authority Sharing in a Swarm of UAVs: Simulation and Experiments with Operators. In: Carpin, S., Noda, I., Pagello, E., Reggiani, M., von Stryk, O. (eds.) SIMPAR 2008. LNCS (LNAI), vol. 5325, pp. 293–304. Springer, Heidelberg (2008) 6. Pitre, R.R., Li, X.R., DelBalzo, D.: A new performance metric for search and track missions 2: Design and application to UAV search. In: Proceedings of the 12th International Conference on Information Fusion, pp. 1108–1114 (2009) 7. Sauter, J.A., Matthews, R., Van Dyke Parunak, H., Brueckner, S.A.: Performance of digital pheromones for swarming vehicle control. In: AAMAS 2005: Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 903–910. ACM, New York (2005)
Reconstruction of Causal Networks by Set Covering Nick Fyson1,2 , Tijl De Bie1 , and Nello Cristianini1 1
Intelligent Systems Laboratory, Bristol University, Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK 2 Bristol Centre for Complexity Sciences, Bristol University, Queen’s Building, University Walk, Bristol, BS8 1TR, UK http://patterns.enm.bris.ac.uk
Abstract. We present a method for the reconstruction of networks, based on the order of nodes visited by a stochastic branching process. Our algorithm reconstructs a network of minimal size that ensures consistency with the data. Crucially, we show that global consistency with the data can be achieved through purely local considerations, inferring the neighbourhood of each node in turn. The optimisation problem solved for each individual node can be reduced to a set covering problem, which is known to be NP-hard but can be approximated well in practice. We then extend our approach to account for noisy data, based on the Minimum Description Length principle. We demonstrate our algorithms on synthetic data, generated by an SIR-like epidemiological model. Keywords: machine learning, network inference, data mining, complex systems, minimum description length.
1
Introduction
There has been increasing interest over recent years in the problem of reconstructing complex networks from the streams of dynamic data they produce. Such problems can be found in a highly diverse range of fields, for example in determining Gene Regulatory Networks (GRNs) from expression measurements [1], or the connectivity of neuronal systems from spike train data [2]. All share the similar challenge of extracting the causal structure of a complex dynamical system from streams of temporal data. We here address the distinct challenge of reconstructing networks from data corresponding to stochastic branching processes, occurring on directed networks and where a discrete ‘infection’ is propagated from node to node. The clearest analogy lies in the field of epidemiology, where instances of infection begin at particular nodes, before propagating stochastically along edges until the infection dies out. Recent work has seen a new approach to this problem based on a Maximum Likelihood framework, in which the approach was applied to meme data extracted from the international media and blogs [3,4]. We here outline our own approach to this problem, developed in parallel to that of Rodriguez et al., but a direct comparison is beyond the scope of this paper. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 196–205, 2011. c Springer-Verlag Berlin Heidelberg 2011
Reconstruction of Causal Networks by Set Covering
2
197
Network Reconstruction
In this paper we consider two networks over the same set of nodes, the true underlying network GT = (V, ET ) and the reconstructed network GR = (V, ER ). We assume a dynamic branching process occurs on the network GT , in which the transfer of ‘markers’ occurs. Markers originate at a particular node in the network, and then propagate stochastically from node to adjacent node, ‘traversing’ along only those edges that exist in the set ET . Due to the range of potential applications our description of the method remains intentionally abstract, but with analogy to epidemiology we use terms such as infection, infectious and carrier. Each marker that is propagated through the network generates a ‘marker trace’ M i , and the set of all marker traces is denoted by M = {M i}. The marker trace is represented by an ordered set of the nodes that carried that marker, in the order in which they became infected. We will use subscripts to refer to individual nodes in a marker trace. We formally define the notion of a marker trace as follows. Definition 1 (Marker Trace, M i ). A Marker Trace M i is an ordered set of ni distinct nodes wji ∈ V , and we denote it as M i = (w1i , w2i , . . . , wni i ) . Each marker trace defines a total order over the reporting nodes, and we use the notation vi <M i vj to state that the node vi appears before node vj in the marker trace M i . For clarity in future definitions we also formally define a path from one node to another within a network. Definition 2 (Path in a network G = (V, E)). A sequence U = u1 , . . . , uk of nodes ui ∈ V is a path in G = (V, E) if ∀ 1 ≤ i < k , (ui , ui+1 ) ∈ E. 2.1
Problem Formulation and Global Consistency
Problem 1 (Informal Description). Given a set M of Marker Traces construct a network GR , approximating the true network GT that generated M. Intuitively, it makes sense to choose GR such that it is capable of generating M. Given our assumptions on the mechanism of data generation, this requires that for each marker a path exists from the originator to all other carrier nodes, passing only through nodes that have been previously infected. We will refer to this as ‘global consistency’ and formalise the intuitive notion as follows. Definition 3 (Globally Consistent, GC) GR is GC with M i ⇐⇒ ∀ wji with j > 1 ∃ a path w1i , . . . , wji in GR Trivially, it is clear that a completely connected network is consistent with all possible data, and hence we aim to reconstruct a consistent set ER of minimal size. Combining the above allows us to formalise our goal in terms of an optimisation problem. Problem 2. subject to
argminER |ER | GR = (V, ER ) is GC with M i ∀ Mi ∈ M
198
2.2
N. Fyson, T. De Bie, and N. Cristianini
Local Consistency
For a reconstruction to make intuitive sense we require global consistency between network and data, but this is computationally impractical. Below, we demonstrate the equivalence of global consistency with ‘local consistency’, an alternative that allows us to consider the immediate neighbourhood of each node in turn. Local consistency requires that for each node reporting a particular marker, the node must have at least one incoming edge from a node that has reported the marker at an earlier time. This concept is formalised as follows. Definition 4 (Locally Consistent, LC) GR is LC with M i ⇐⇒ ∀wji with j > 1 ∃ wki with k < j
: (wki , wji ) ∈ ER
Theorem 1 (LC ⇐⇒ GC). Demonstrating local consistency between GR and M i is necessary and sufficient to ensure global consistency. Proof. This equivalence may be quite intuitively demonstrated by induction. Where k is the number of nodes in the reporting sub-network, for the case k = 1, we have only the originator node, hence trivially there is a path from originator to all other nodes. For the case k = 2, we add a node with an incoming edge from the only other node. Again, trivially there is a path from the originator to every other node. For the case k = n + 1 we take the network for k = n, and add a node with an incoming edge from one of the existing nodes. If there is a path from originator to all nodes in the k = n network, there will be a path from originator to the new node in the case k = n + 1. Hence if the claim is true for k = n it is also true for k = n + 1. Therefore, by induction, LC ⇐⇒ GC. This allows us to formulate an alternative but equivalent optimisation problem, using the concept of local consistency. Problem 3. subject to
argminER |ER | GR = (V, ER ) is LC with M i ∀ Mi ∈ M
Crucially, to establish local consistency we need only consider the immediate neighbourhood of each node in turn. This explanation of reports is necessarily performed for each node individually, and in each of these subproblems we establish the minimal set of incoming edges required to explain all the reported markers. From now on, unless otherwise specified, we describe approaches as applied to discovering the parents of a particular node, which can then be applied to each node in turn. 2.3
Formulation in Terms of Set Covering
We treat the reconstruction on a node-by-node basis and denote the node under consideration as v. As specified by local consistency, in considering the incoming edges for a particular node we must include at least one edge from a node that has reported each marker at an earlier time. Each edge therefore ‘explains’ the
Reconstruction of Causal Networks by Set Covering
199
presence of a subset of the reported markers, and if the set of all incoming edges together explains all the reported markers, we ensure local consistency. This problem of ‘explaining’ marker reports may be neatly expressed as a set covering problem. We first formally state the set covering optimisation problem: Given a universe A and a family B of subsets of A, the task is to find the smallest subfamily C ⊆ B such that C = A. This subfamily C is then the ‘minimal cover’ of A. Given this formal framework, we now define how these sets relate to our reconstruction problem. Definition 5 (Universe, Av ). The set of all markers that have been reported by the node v, given by Av = {i : v ∈ M i }. The node v can have an incoming edge from any other node, and hence the space of potential incoming edges is F v = (V /v) × v. As stated above, each potential incoming edge will ‘explain’ a subset of the markers reported by v, and therefore every edge fjv ∈ F v corresponds to one element Bjv in the family of subsets B v . Definition 6 (Family of subsets, B v = {Bjv }). Each subset Bjv is defined by a potential incoming edge (vj , v) = fjv ∈ F v , where i is in Bjv if and only if vj appears earlier than v in the marker trace M i , given by Bjv = {i : vj <M i v}. The set covering problem then requires us to find a subfamily C v ⊆ B v such that C v = Av , and this subfamily C v directly corresponds to a set of incoming edges for the node v. v ). The set of Definition 7 (Reconstructed Incoming Edges, ER v reconstructed edges, ER , consists of the set of all elements in F that correspond v to elements of C, given by ER = {fjv ∈ F v : Bjv ∈ C v }.
This then allows us to make a final definition of our optimisation problem, this time in terms of the set covering problem. The following problem is defined for each node v ∈ V . Problem 4. subject to
Av =
v | argminERv |ER
v v C v where C v = {Bjv : fjv ∈ ER } and ER ⊆ Fv
Finally, v repeating this optimisation for all nodes in the network, we get ER = v ER , allowing us to reconstruct the entire network through only local considerations. 2.4
Greedy Approximation to Set Covering
The set covering problem is known to be NP-hard, but in practice may be well approximated using a greedy approach. This algorithm is well documented for set covering, including bounds on worst-case performance [5,6], but here we briefly outline the approach. We wish to cover the set A by selecting from the family of subsets B. We first select the subset Bj ∈ B that covers the greatest number of
200
N. Fyson, T. De Bie, and N. Cristianini
elements in A, ie. such as to maximise |Bj |. The corresponding edge fj is then v added to the set of reconstructed edges ER . A subset of A has now been covered, and hence these elements are removed both from A and all subsets in the family B. This process is repeated until A = ∅.
3
Reconstruction from Noisy Data
The assumption thus far of noise-free data is unrealistic, and hence we define an extension of our approach. The basic approach assumes that every report of a marker is due to direct infection from a previously infected node, and hence requires that every report be explained. When noise is present these assumptions do not hold, and each missing marker report may imply false-positive edges in the reconstruction, a problem that will only be exacerbated by using more data. In executing the greedy approximation to set covering, we first select those subsets that cover the greatest number of remaining elements. While the noise level remains low, therefore, we will first select true edges, since they will tend to explain relatively more reports. We can therefore expect that the noise-induced false positives will be added toward the end. This is demonstrated empirically in Fig. 2a, and motivates our defining a criterion to halt the covering early. 3.1
Minimum Description Length
In selecting the optimal point to halt the set covering we appeal to the Minimum Description Length (MDL) principle [7]. This states that in model selection one should prefer models that are able to communicate the data in the lowest number of bits. This is in principle equivalent to considering Maximum Likelihood Estimation [8], but our case lends itself particularly well to the use of MDL. For the network, all edges are explicitly assigned 0 or 1 meaning the description is of fixed length, and hence does not in itself favour sparsity in the reconstructed network. The Description Length (DL) is dependent only on how efficiently the set of markers can be expressed in light of the network structure. Coding Scheme. To describe a marker trace we must specify in order all those nodes that are members of the set M i . A simple ordered list requires log N bits per node, where N is the number of nodes in the network. This is straightforward, but using the framework of the underlying network may allow us to describe this same information in a compressed form. Instead of simply listing the reporting nodes, we describe the progression of the marker through the network. To allow coding of markers that are not consistent with the network, we also define a ‘supernode’ in addition to the standard network, which is the originator of all markers, and by definition a parent of every other node. Each report in a trace requires us to first identify the infecting parent, and then which of its children is the new carrier. All markers start at the ‘supernode’, so the cost of describing the first report is log 1 = 0 to identify the parent and log dp1 = log N to specify the child (since the ‘supernode’ has an out-degree of N ). For the second report there are now two potential parents, so the cost is
Reconstruction of Causal Networks by Set Covering
201
(log 2 + log dp2 ), where dp2 is the out-degree of the cheapest possible infecting parent (which where possible will be a standard node, otherwise the more costly ‘supernode’). Similarly, the cost for the third report is (log 3 + log dp3 ) bits, the fourth (log 4 + log dp4 ) and so on. 3.2
MDL as Stopping Criterion
While there is no explicit cost to defining edges, nodes of higher degree are more expensive to use as the parent of a report. Adding a new outgoing edge to a node may make the description of one report cheaper, but will increase the cost for all other reports using that node as a parent. In general, therefore, the network that allows the shortest description of all marker traces will lie at some point between completely disconnected and completely connected. To use MDL as a stopping criterion requires the final addition of edges to be considered globally. We still perform greedy set covering for each node in turn, but make a note of each edge and the number of additional elements covered when it is selected. After doing this for all nodes we have a list of edges across the whole network, along with their explanatory power within the greedy set covering framework. We then rank them all by the elements covered, and follow this order in adding edges to the network. We can therefore calculate the change in DL as edges are added, and subsequently select the network ER that gives the lowest total description length.
4
Empirical Evaluation
To assess the quality of our reconstruction we use both precision-recall curves and the Jaccard Distance (JD) between the set of true and reconstructed edges. For identical sets this has a value zero, and a value of one if the two sets have no elements in common at all. The JD is given by JD = ( |ET ∪ ER | − |ET ∩ ER | ) / |ET ∪ ER | . 4.1
(1)
Generation of Synthetic Data
We use two models for directed networks of N nodes. The first is an Erd˝osR´enyi model in which each edge exists with probability p = 2/N , resulting in a sparse network that is likely to be dominated by a single weakly-connected giant component. The second is generated through a preferential attachment model, in which each new node forms incoming edges from two existing nodes [9]. This results in a network of the same average degree as the ER model, but with a scale-free topology. We generate markers with a process based on the SIR epidemiological model. Each node can be in one of three states; Susceptible (S), Infected (I) or Recovered (R), and all nodes begin in state S. One randomly selected node is set to state ‘I’, and the state of each node in all subsequent time steps is determined
202
N. Fyson, T. De Bie, and N. Cristianini
stochastically from its current state and that of all its parents. In each time step there is a probability pI = 0.1 that the infection will propagate along any edge leading from an infected to susceptible node, and a probability pR = 0.1 that any given infected node will recover. Parameters were selected to produce marker traces of reasonable length given network size. Additionally, we include a noise parameter, where ploss = 0.05 means 5% of all reports are lost. 4.2
Naive Approaches
We define two naive algorithms for network reconstruction, to which it will be instructive to compare our set covering approach. The most immediately obvious explanation for the creation of a marker trace is that each node became infected by the node immediately preceding it in time. Simply taking the union of all these edges results in a network that is consistent with the data, and is given by i EN1 = (wni , wn+1 ) ∀n . (2) M i ∈M
In the second naive approach we consider only the first two reports from each trace. This does not use all the available information, but in the noise-free case guarantees no false positives, and is given by (w1i , w2i ) . (3) EN 2 = M i ∈M
1
1
0.8
0.8
0.6 0.4 0.2 0 0 (a)
False +ve rate
True +ve rate
Jaccard Distance
1
0.6 0.4 0.2
200 400 600 800 # markers naive 1
0 (b) 0 naive 2
200
400 600 # markers
800
set covering
0.1 0.01 0.001 0.0001 0 (c)
200
400 600 # markers
800
set covering bound
Fig. 1. Performance of set covering reconstruction, relative to naive approaches and theoretical bounds. For TPR, the data for naive 2 and set covering bound coincide. FPR for naive 2 is always zero, and hence not shown. Results are shown for Erd˝ os-R´enyi networks of 100 nodes.
4.3
Reconstruction from Noise-Free Data
Figure 1 shows the results of network reconstruction using our set covering algorithm. Our algorithm clearly exceeds the worst-case bounds and the naive approaches, indicating that the greedy heuristic performs well, and that we make
Reconstruction of Causal Networks by Set Covering
203
efficient use of the information. Figures 1b and 1c clearly show that false positives are the cause of the poor performance of the first naive approach. This is entirely expected, but illustrates that it is not sufficient to simply find any network that is consistent with the data. In contrast, given infinite data the set covering approach tends toward perfect performance, and at a faster rate than the second naive approach. 4.4
Reconstruction from Noisy Data
Figure 2a gives the empirical motivation for using a halting criterion, showing that for noisy data the closest match to the true network is obtained before the set covering is complete. The circles plotted in Fig. 2 indicate the point at which the minimum description length was obtained. Figures 2c and 2d demonstrate that halting using MDL includes the majority of true positives, but limits the inclusion of false edges. This is demonstrated even more clearly in Fig. 3, where results are shown both with and without the use of MDL stopping. Figure 3c shows that MDL halting bounds the rate of false positives as the amount of data increases, in stark contrast to results for the basic set covering approach.
0 (a) 0
500 # edges added
x 10
1
5
0 (b)0
500 # edges added ploss = 0.00
0.03 False +ve rate
0.5
10
True +ve rate
Description Length
Jaccard Distance
4
1
0.5
0 (c) 0
ploss = 0.05
500 # edges added
0.02 0.01 0 (d) 0
500 # edges added
ploss = 0.10
Fig. 2. Plots showing variation of JD, DL, TPR and FPR with the progress of set covering. Circles indicate the point at which the MDL criterion would have halted the covering. Each line shows reconstruction of a 100 node Erd˝ os-R´enyi network, using 1000 markers.
Finally, in Figure 4 we demonstrate the performance of our approach on much larger networks, using 3000 markers to reconstruct ER and scale-free networks of size 1000 nodes. As one would expect, the performance becomes progressively worse with higher noise levels, but as indicated by the symbols in Fig. 4a, our MDL stopping allows us to maintain good precision even with relatively high levels of noise. In Fig. 4b we see that the set covering reconstruction can achieve high performance for scale-free networks, and is in fact more robust to noise, but it is important to note that the MDL stopping does not work for this network topology. This is likely due to the presence of hubs in scale-free networks, and will require an alternative coding scheme to account for this.
204
N. Fyson, T. De Bie, and N. Cristianini
1
1
0.8
0.8
0.6 0.4 0.2
False +ve rate
True +ve rate
Jaccard Distance
0.1
0.6 0.4 0.2
0 0
500
(a)
1000 1500 # markers
2000
0 0 (b)
500
1000 1500 # markers
0.01
0.001
0.0001 0 (c)
2000
500
1000 1500 # markers
2000
1
1
0.8
0.8 precision
precision
Fig. 3. Plots showing clear benefit of MDL stopping for noisy data. Solid line is with MDL stopping, dotted line without. Network was Erd˝ os-R´enyi model of 100 nodes and ploss = 0.05.
0.6 0.4 0.2 0 0
p p
=0.00
loss
=0.01
0.6
0.4 0.6 recall
= 0.00
p
= 0.10
loss
loss
ploss=0.05 0.2
p
loss
0.4
ploss = 0.20
0.2
0.8
1
0 0
0.2
0.4 0.6 recall
0.8
1
Fig. 4. Precision-recall curves for reconstruction under various noise conditions, with symbols indicating the MDL halting point. Subfig (a) shows results for an Erd˝ os-R´enyi network, while (b) shows results for a scale-free network. Networks are of size 1000 nodes, reconstructed from 3000 markers.
5
Conclusions
Our work demonstrates a novel approach to the reconstruction of causal networks underlying stochastic branching processes, such as from data representing information flow or the spread of an epidemic on a network. Using the intuitive notion of consistency between a network and such data, we demonstrated that the entire network can be reconstructed node by node, using only local considerations. In this way, we were able to reformulate the problem in terms of the set covering problem, which is NP-hard but can be approximated well using an efficient greedy algorithm. The extension of our approach using Minimum Description Length as a criterion for halting the covering allows reconstruction to be performed on noisy data, and we demonstrated this for large networks of different topologies and with various noise levels. In further work we plan to investigate direct optimisation of the MDL cost function, as well as alternative coding schemes to account for different network topologies. Another avenue for extending the approach is the use of exact time information, rather than considering only the order of reports. Finally, we intend
Reconstruction of Causal Networks by Set Covering
205
to apply our methods to various real-life data sets, such as the propagation of memes on the media network [10,11], and fault propagation data [12]. Acknowledgments. Nick Fyson is supported by the Bristol Centre for Complexity Sciences (EPSRC grant EP/5011214) and Nello Cristianini is supported by a Royal Society Wolfson Merit Award. This work is partially supported by EPSRC grant EP/G056447/1 and by the European Commission through the PASCAL2 Network of Excellence (FP7-216866).
References 1. Sprinzak, D., Elowitz, M.B.: Reconstruction of genetic circuits. Nature 438(7067), 443–448 (2005) 2. Brown, E.N., Kass, R.E., Mitra, P.P.: Multiple neural spike train data analysis: state-of-the-art and future challenges. Nature Neuroscience 7(5), 456–461 (2004) 3. Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N., Hurst, M.: Cascading behavior in large blog graphs. In: SDM 2007 (2007) 4. Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influence. In: KDD 2010 (2010) 5. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 6. Slav´ık, P.: Improved performance of the greedy algorithm for partial cover. Information Processing Letters 64(5), 251–254 (1997) 7. Wallace, C.S., Boulton, D.M.: An information measure for classification. Computer Journal 11(2), 185–194 (1968) 8. MacKay, D.J.C.: Information Theory, Inference & Learning Algorithms, 1st edn. Cambridge University Press, Cambridge (2002) 9. Barab´ asi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 10. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009) 11. Snowsill, T., Nicart, F., Stefani, M., De Bie, T., Cristianini, N.: Finding surprising patterns in textual data streams. In: Proceedings of Cognitive Information Processing 2010 (April 2010) 12. Nageswara Rao, S., Viswanadham, N.: Fault diagnosis in dynamical systems: a graph theoretic approach. International Journal of Systems Science 18(4), 687–695 (1987)
The Noise Identification Method Based on Divergence Analysis in Ensemble Methods Context Ryszard Szupiluk1,2 , Piotr Wojewnik1,2 , and Tomasz Zabkowski1,3 1
Polska Telefonia Cyfrowa Ltd., Al. Jerozolimskie 181, 02-222 Warsaw, Poland Warsaw School of Economics, Al. Niepodleglosci 162, 02-554 Warsaw, Poland Warsaw University of Life Sciences, ul. Nowoursynowska 159/34, Warsaw, Poland {rszupiluk,pwojewnik,tzabkowski}@era.pl
2 3
Abstract. In this paper we propose a divergence based method for noise detection in ensemble method context where the prediction results from different models are treated as a multidimensional variable that contains constructive and destructive latent components. The crucial stage is the proper destructive and constructive components classification. We propose to calculate the noisiness of the particular latent component as the divergence from chosen reference noise. It allows us to identify the wide range of noises besides the typical signals with close analytical form such as Gaussian or uniform. The real data experiment with load energy prediction confirms presented methodology. Keywords: Ensemble methods, Noise detection, Statistical divergences.
1
Introduction
The random noise detection is one of the fundamental tasks in signal and data processing [19]. It can be found in the context of ensemble methods based on signals separation. In this approach, the prediction results from different models are treated as a multidimensional variable containing hidden constructive and destructive components [17]. The destructive components result from errors in data (measurement, input, edition etc.), model misspecification or suboptimal model training. On the other hand, the constructive components are the elements of the predicted value reproduced by particular models. These latent components are identified using the methods of blind signal separation [4,5,9,15]. In this case, the method of model building is not crucial, because we do not formulate any assumptions in this regard. From this point of view, the presented approach differs from the other ensemble methods, where the model results are averaged and the ensemble methods might be successful only if the assumptions to the aggregated models structure are met [3,18]. One of the key issues in this method is the correct classification and distinction between destructive and constructive components. The latent components, both destructive and constructive, may have many different forms and characteristics. They can be described in terms ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 206–214, 2011. c Springer-Verlag Berlin Heidelberg 2011
The Noise Identification Method Based on Divergence Analysis
207
of regularity, variability, sparseness or randomness and some of these properties appear jointly. We will focus here only on the destructive components which can be treated as random noises, taking into account some of the above mentioned properties.
2
Data Decomposition for Prediction Ensemble
The ensemble method via blind signal separation is stated as follow. After the learning process we collect particular primary model results xi together in one T multivariate variable X = x1 , x2 , . . . , xn , X ∈ Rn×N , where N means number of observations. We assume that those prediction results contain certain latent T components S = s1 , s2 , . . . , sn . The latent component can be constructive or destructive for the prediction results. The constructive components sj = ˆsj are associated with true predicted value whereas the destructive components sl = ˜sl are responsible for errors. Next we assume that each prediction result is a linear mixture of the latent components where the relation can be described in a matrix form as X = AS, (1) T T = ˆs1 , ˆs2 , . . . , ˆsk , ˜sk+1 , . . . , ˜sn , S ∈ Rn×N , the where S = s1 , s2 , . . . , sn matrix , A ∈ Rn×n represents the mixing system. The relation (1) means matrix X factorization by latent components matrix S and mixing matrix A. Our aim is to find the latent components and reject the destructive ones (replace them with zero) and next mix the constructive components back to obtain improved prediction results as ˆ = AS ˆ = A ˆs1 , ˆs2 , . . . , ˆsk , 0k+1 , . . . , 0n T . X (2) The above ensemble methodology faces us with two fundamental problems: how to estimate the components S (or equivalently matrix n A), and how to choose the destructive ones. Some solutions to the first problem are proposed by blind signal separation (BSS) methods aiming at identification of unknown signals mixed in an unknown system [4,5,9]. The process of matrix A estimation, under several conditions, with BSS methods can explore different properties of the data like: independence, decorrelation, sparsity, smoothness, non-negativity. Many of those approaches are tested and successfully applied in practice for the model aggregation problem via BSS [17]. For the present consideration we can assume that matrix A and a set of latent components S are obtained from ICA or PCA algorithm [4,9]. But still it is an open question how to indicate the destructive components properly. It can be difficult task in general, because obtained components might be not pure constructive or destructive due to many reasons like improper linear transformation assumption or other statistic characteristics than explored by chosen BSS method. Consequently, it is possible that some component has constructive impact on one model and destructive on the other. There may also exist components destructive as a single but constructive in a group. As the simple solution we might eliminate each signal subset of S and check
208
R. Szupiluk, P. Wojewnik, and T. Zabkowski
the impact on the final results. This procedure is quite simple and often works well but for the higher number of components the method is computationally consuming. Therefore, we try to find the other method that can be used to classify the basis latent components. The one way is to associate the destructive components with random noises. It leads us to noise detection problem.
3
Random Noise Detection Problem
The term random has wide-ranging physical, epistemological, philosophical and mathematical meaning, and typically deeper insight in the randomness leads us to such phrases like uncertainty, complexity, predictability and even logic [14]. Consequently, random noise can be defined, described and analyzed in many ways, but probabilistic approach seems to be the most natural [18,19]. In such approach the typical noise model is described by Gaussian distribution what is strongly supported from theoretical point of view by Central Limit Theorem [10]. Of course there are many other noise models based on uniform distribution, mixtures of Gaussians, alpha-stable distributions, 1/f noises etc. [11,12,16]. The one of their common features is their close analytical form of the noise distributions expressed by probability distribution function, cumulative distribution function or characteristics function. Starting with such above theoretical assumptions we can perform the real empirical research like correlation analysis or R/S analysis [10,13,15]. The problem with noise is the fact that the mathematical definition and interpretation can be far from physical reality, especially when ”physical” means economic, social or medical phenomena [13,16]. The well-known example is the mathematical white noise defined as a sequence of random variables with independent identical distribution (IID) [6,18,19]. In the technological problems like signal filtration or system identification it is very convenient to model the random distortions as the white IID noises, especially when useful signals are regular and easily recognizable [7]. In other words, we have some a priori information how the noise and the desired signal look like. Thus, the white noise description might be a fine approximation of the signal disruptions, even if they are colored noises. The situation is more complicated if the mathematical description does not reflect the usefulness of the signal. We can observe the problem on the financial markets, where the logarithmic returns are described by stationary stochastic processes [16]. In practice, it means assumption that the rate of return is generated from the series of identically distributed random variables. Moreover, the variables are uncorrelated or transformed with Box-Jenkins methodology to obtain this property [1,2,16]. As a consequence the signal fulfilling the mathematical definition of the white noise might be both disruption (in technological applications) and e useful information (in financial time series). We can conclude that the n mathematical definition of random noise does not necessarily implicate its positive or negative role in the context of the analyzed problem. The situation complicates, if both
The Noise Identification Method Based on Divergence Analysis
209
the useful and disruption signals reveal similarity. We can observe the situation in case of model aggregation via BSS methods. The constructive and destructive components are mixture of random and deterministic signals with unknown distributions. We can distinguish the noise only if we define its properties a priori. But how can we formulate such a priori information, especially, if the data distributions are far from closed analytical distributions like Gaussian or uniform?
4
Reference Noise
To solve above problem we can find some reference noise or set of reference noises and compare them with the analyzed signal. The reference noise can be easy obtained from random number generators if we have a priori information. It seem especially adequate for non-Gaussian or multimodal w distributions, for which it can be more effective to obtain some reference noise signal, than to assume and check its statistical properties theoretically. The most interesting case is the reference noise identified from the data on the analyzed problem. In the prediction task the signal might be found from decomposition of the target value or input data. There are many ways to acquire a reference noise from the target, eg. signal decomposition, filtration or differentiation. In our approach, to find a reference noise z from target d we propose a simple formula: z(t) = d(t) − M Ad (n),
(3)
where M A(n) means n-point moving average. In this way we formulate the problem of random noise detection as the question whether a certain signal is similar to the reference random noise. The main question now is to measure the similarity between particular signal and the reference noise. For this task we apply a divergence approach.
5
Similarity Measure and Divergences
The similarity measure between two nonnegative sequences or patterns can be taken as divergence function [5]. A divergence function D(yz) between two variables z and y fulfills the following conditions: D(yz) ≥ 0 and D(yz) = 0 only if y = z. But the triangular inequality D(yz) ≤ D(yx) + D(xz) is not a necessary condition for the divergence. Such divergence functions are usually used in a manifold of probability distributions but can be also used for general data analysis. The most popular types of divergences are the following [5]: 1. Kullback-Leibler divergence DKL (zy) =
N zi zi ln , yi
(4)
i=1
2. Csiszar divergence defined as
y(t) DC (zy) = ϕ , z(t) i=1 N
(5)
210
R. Szupiluk, P. Wojewnik, and T. Zabkowski
where z(t) ≥ 0, y(t) ≥ 0 and the function ϕ : [0, ∞) → (−∞, ∞) is convex on (0, ∞) and continuous at zero. The divergence function (5) under some additional restriction that ϕ(1) = 0 and strict convexity at 1 can be used as a distance measure with several special cases like: (a) if ϕ(u) = (u − 1)2 then we obtain Pearson’s distance, δ−1
(b) if ϕ(u) = u(uδ2 −δ−1) + 1−u δ , where δ = Amari’s alpha divergences.
1+α 2 ,
then we obtain a family of
In general case, before application of divergence measure we should perform the data preprocessing to nonnegative vectors summing up to one. Due to the fact that we look for similarity in the signal shapes the preprocessing is not a significant limitation. Now, we assume, that primary prediction results X are decomposed into basis components S by one of BSS methods. The basis components are constructive or destructive. To find the destructive ones we make comparison with the reference noise z from section (4). Resulting disruptions are compared by chosen divergence to all the signals si . If the divergence is symmetric or close to symmetric, D(zsi ) = D(si z), then the signal si is similar to the noise z extracted from target variable and si should be removed. As the practical similarity measure we can take the symmetry factor q: D(zsi ) q = abs log . (6) D(si z) The rest of the signals are mixed back with system inverse to BSS decomposition. Resulting values will be the improved predictions, see (Fig.1). The full ensemble method with noise extraction might be presented with the following algorithm: 1. Collect the predictive models results into multivariate variable X, 2. Decompose the matrix X with BSS methods into signals S = [si ], i = 1, . . . , n, 3. Extract the random noise characterization z from target value e.g. with filtration methods or derive it from random generator , 4. Measure the divergence values D(zsi ), D(si z), i = 1, . . . , n , 5. Classify the components as destructive, if their divergence to noise si is ˜ m ∈ si D(zsi ) = D(si z), i = 1, . . . , n , symmetric or close to symmetric S T T and obtain S = s1 , s2 , . . . , sn = ˆs1 , ˆs2 , . . . , ˆsk , ˜sk+1 , . . . , ˜sn , S ∈ Rn×N , 6. Eliminate the destructive signals and mix them with system inverse to deˆ = AS, ˆ composition X ˆ 7. Resulting values X ∈ Rn×N are improved version of the predictions.
The Noise Identification Method Based on Divergence Analysis
Models
M1
Results
Basic Components
X2
Input data
Improved results ^ X1
S1
X1
M2
S2
BSS stage
211
Remixing
^ X2
(BSS -1 )
Reference noise extracted from input data
...
Xn
...
... Mn
Sn
^ Xn
Decision system
Fig. 1. Ensemble method with random noise identification and removal
6
Practical Experiment
The validation test of the proposed concept with noise detection is performed on the real problem of load prediction in the Polish power system. Our task is to forecast the hourly energy consumption in 24 hours based on the energy demand from the last 24 hours and calendar variables: month, day of the month, day of the week, and holiday indicator. We train six MLP neural networks with one hidden layer (with 12, 18, 24, 27, 30, 33 neurons respectively). The quality of the results is measured with MAPE criterion for following neural networks M1:MLP12, M2:MLP18, M3:MLP24, M4:MLP27, M5:MLP30, M6:MLP33. For such primary models we perform their ensemble with BSS methods. Table 1 presents the final results. The best prediction improvement is obtained after elimination the components (s4, s5, s6) for PCA and the components (s4, s6) for ICA. The Fig.2a) shows basis component after PCA decomposition. Histograms and autocorrelation functions for PCA components are presented in Fig.2b). In the same way Fig.3a) and Fig.3b) present components, its histograms and correlation functions after ICA Table 1. Prediction results for primary models and after BSS Methods
M1 Primary Results 2.392 PCA 2.304 ICA 2.410
M2 2.365 2.256 2.248
Models M3 M4 2.374 2.402 2.283 2.274 2.395 2.401
M5 2.409 2.255 2.423
M6 2.361 2.234 2.384
212
R. Szupiluk, P. Wojewnik, and T. Zabkowski
a)
b)
Fig. 2. Basis latent component after PCAa) and histograms and autocorrelation functions Basis latent component after PCA b)
separation. As we can see, the visual recognition of the noises seem to be a difficult task. In this experiment we want to examine if it is possible to identify the destructive components with reference noise and divergence symmetry method. The reference noise were the residuals obtained after decomposition (smoothing) of the target with moving average (k = 168 = 7 days ×24 hours). In Fig.4 we can observe the histogram and the autocorrelation function of target and residuals from the smoothing. a)
b)
Fig. 3. Basis latent component after ICA a) and histograms and autocorrelation functions Basis latent component after ICA b)
The Noise Identification Method Based on Divergence Analysis
213
Fig. 4. Histograms and autocorrelation functions of the target and reference noise Table 2. Similarity factor q for PCA components and reference noises by KL and Pearson divergence PCA component S1 S2 S3 S4 S5 S6 KL 0.261 0.149 0.175 0.411 0.212 0.547 Pearson 1.889 0.579 0.430 2.008 1.178 2.667
Noise model
Table 3. Similarity factor q for ICA components and reference noises by KL and Pearson divergence ICA component S1 S2 S3 S4 S5 S6 KL 0.431 0.308 0.489 0.178 0.182 0.116 Pearson 4.827 3.111 1.165 4.591 3.387 4.420
Noise model
We applied Kullback-Leibler and Pearson’s divergence to measure the similarity between the signals. The symmetry is measured with the factor q. The final results are presented in Tables 2-3. The low value of q factor refers to high similarity of signals. The Pearson measure is ineffective both for PCA and ICA what can be explained by strongly non-Gaussian signals. However, we can observe relatively good quality results both for PCA and ICA aggregations with Kulback-Leibler divergence. The result seems to be reasonable, because BSS methods and especially ICA are addressed mainly for the non-Gaussian signals.
7
Conclusions
The destructive components detection using divergence between latent component and the reference noise can be applied in the context of predictive models
214
R. Szupiluk, P. Wojewnik, and T. Zabkowski
aggregation. The approach is based on assumption that it is difficult to find the close analytical form for the destructive components distributions. Especially, Gaussian white noise characteristic might not be reasonable for the ICA decomposition addressed for non Gaussian (except one) signals in general. The experiments confirmed the validity of the proposed solutions. However, a number of research and methodological issues is still open. The most important include the way of proper identification and estimation of the reference noise. The other question is the divergence choice, where for this moment the most popular in statistics analysis Kullback-Leibler divergence seems adequate.
References 1. Bollerslev, T., Chou, R.Y., Kroner, K.: ARCH Modeling in Finance. Journal of Econometrics 52, 5–59 (1992) 2. Box, G.E.P., Muller, M.E., Jenkins, G.M.: Time Series Analysis Forecasting and Control, 2nd edn. Holden Day, San Francisco (1976) 3. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. John Wiley, Chichester (2002) 5. Cichocki, A., Zdunek, R., Phan, A.-H., Amari, S.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley, Chichester (2009) 6. Hamilton, J.D.: Time series analysis. Princeton University Press, Princeton (1994) 7. Haykin, S.: Adaptive filter theory, 3rd edn. Prentice-Hall, Upper Saddle River (1996) 8. Hoeting, J., Mdigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: a tutorial. Statistical Science 14, 382–417 (1999) 9. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley, Chichester (2001) 10. MacDonough, R.N., Whalen, A.D.: Detection of signals in noise, 2nd edn. Academic Press, San Diego (1995) 11. Mandelbrot, B.: Multifractals and 1/f noise. Springer, Heidelberg (1997) 12. Nikias, C.L., Shao, M.: Signal Processing with Alpha-Stable Distributions and Applications. John Wiley and Son, Chichester (1995) 13. Peters, E.: Fractal market analysis. John Wiley and Son, Chichester (1996) 14. Popper, K.R.: The Logic of Scientific Discovery. Hutchinson, London (1959) 15. Samorodnitskij, G., Taqqu, M.S.: Stable non-Gaussian random processes: stochastic models with infinitive variance. Chapman and Hall, N.York (1994) 16. Shiryaev, A.N.: Essentials of stochastic finance: facts, models, theory. World Scientific, Singapore (1999) 17. Szupiluk, R., Wojewnik, P., Zabkowski, T.: Prediction improvement via smooth component analysis and neural network mixing. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4132, pp. 133–140. Springer, Heidelberg (2006) 18. Therrien, C.W.: Discrete Random Signals and Statistical Signal Processing. Prentice Hall, New Jersey (1992) 19. Vaseghi, S.V.: Advanced signal processing and digital noise reduction. John Wiley and Sons, Stuttgart, Chichester (1997)
Efficient Predictive Control and Set–Point Optimization Based on a Single Fuzzy Model Piotr M. Marusak Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00–665 Warszawa, Poland
[email protected] Abstract. The idea proposed in the paper consists in significant simplification of the control structure with a predictive control algorithm and a steady–state target optimization. It is done by application of only one fuzzy (nonlinear) dynamic control plant model for both: predictive control and set–point calculation. The approach exploits possibilities offered by a fuzzy model used by the predictive control algorithm. The fuzzy model is of Takagi–Sugeno type with step responses used as the local models. Such a model can be obtained relatively easy and well tuned using neural networks. The proposed approach, despite simplification of the control system, offers very good control performance. It is demonstrated using an example of a control system of a nonlinear chemical reactor with inverse response. Keywords: fuzzy control, fuzzy models, predictive control, nonlinear control, constraints.
1
Introduction
The classical, multilayer control structure in which set–point values for predictive control algorithms are generated by the optimization layer solving a nonlinear optimization problem may be inefficient when variability of disturbances is comparable with the dynamics of the control plant; see e.g. [1,4,5,12,17,18]. It is caused by relatively low frequency of intervention of the optimization layer being a result of computational burden of the nonlinear optimization problem. This drawback of the classical control structure can be eliminated by its appropriate modification. There are, in general, three approaches to do it. The first approach consists in supplementing the classical, multilayer control structure with the steady–state target optimization (SSTO) which is a linear programming problem based on linear approximations of a nonlinear steady– state control plant model, executed as often as the predictive control algorithm, recalculating the set–point values; see e.g. [1,4,5,12,17,18]. In the second approach, the optimization problem solved by the predictive control algorithm is integrated with the set–point optimization; see e.g. [5,17,18,19,20]. The third approach consists in using the predictive set–point optimizer; see e.g. [7,16,18]. This paper is focused on the SSTO approach. ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 215–224, 2011. c Springer-Verlag Berlin Heidelberg 2011
216
P.M. Marusak
The basic version of the SSTO is based on the steady–state linear model derived from the dynamic linear model used in the predictive controller [1,4,12]. In the second version, the nonlinear steady–state process model is linearized at each iteration [5,12,17,18]. The second version of the SSTO gives usually better results than the first one. In the paper, it is shown that it can be simplified without harm to the performance of the control system. The idea of the proposed approach consists in clever usage of a dynamic fuzzy (Takagi–Sugeno) control plant model. This model can be tuned, using e.g. neural networks, to contain information about steady–state properties of the control plant. Then, using the tuned fuzzy model, linear approximations of both: dynamic and steady–state process models are obtained and used for both: control generation and set– point optimization. Thanks to such an approach the control system is simplified without loss of performance, and may work faster comparing to the case when a nonlinear steady–state model is linearized at each iteration. In the next section the idea of predictive control algorithms is shortly reminded. The Takagi–Sugeno (TS) fuzzy models used in the proposed approach are shortly described in Sect. 3. The set–point optimization problem solved in the optimization layer of the classical multilayer control structure is reminded in Sect. 4. Section 5 is dedicated to formulation of the SSTO problem. In Sect. 6 example results of the experiments performed in a control system of a nonlinear control plant with inverse response (a CSTR with van de Vusse reaction) are described, illustrating efficiency of the proposed approach. The paper is summarized in the last section.
2
Predictive Control Algorithms
The Model Predictive Control (MPC) algorithms during control generation use information about predicted behavior of the control system many sampling instants ahead. Moreover, constraints existing in the control system can be also taken into consideration. Thanks to such an approach it is possible to use all available knowledge about process and conditions of its operation during control action calculation. Predictive control algorithms are usually formulated as the following optimization problem [2,6,14,17]: ⎧ ⎫ ny p nu s−1 ⎨ 2 2 ⎬ j j min κj y k − yk+i|k + λm Δum (1) k+i|k Δu ⎩ ⎭ m=1 i=0
j=1 i=1
subject to: Δumin ≤ Δu ≤ Δumax ,
(2)
umin ≤ u ≤ umax ,
(3)
y min ≤ y ≤ y max , y jk
th
(4) j yk+i|k
where is a set–point value for the j output, is a value of the j th output for the (k + i)th sampling instant predicted at the k th sampling instant
Efficient Predictive Control and Set–Point Optimization
217
using a control plant model, Δum k+i|k are future changes in the manipulated variables, κj ≥ 0 and λm ≥ 0 are weighting coefficients for the predicted control errors of the j th output and for the changes of the mth manipulated variable, respectively, p and s denote prediction and control horizons, respectively, ny , nu denote number of output and manipulated variables, respectively; y is a vector j of (p · ny ) elements, composed of the predicted output values yk+i|k , Δu is a vector of (s · nu ) elements, composed of the future increments of manipulated variables Δum k+i|k , u is a vector of (s · nu ) elements, composed of future values of manipulated variables um k+i|k , Δumin , Δumax , umin , umax , y min , y max are vectors of lower and upper bounds of changes and values of the control signals and of the values of output variables, respectively. As a solution to the optimization problem (1–4) the optimal vector of changes in the manipulated variables is obtained. From this vector, the Δum k|k elements are applied in the control system. Then, the optimization problem is solved again in the next sampling instant. The optimization problem (1–4), in the case when future output values y are predicted using a linear model, is a well known, easy to solve quadratic optimization problem. Application of the algorithm based on a linear model to a nonlinear control plant may, however, give unsatisfactory results or operation of the control system may be improved using a nonlinear model to predict behavior of the control plant. Unfortunately, after direct usage of a nonlinear control plant model the problem (1–4) becomes a nonlinear, usually non–convex, optimization problem solving of which is difficult and time consuming, without guarantee of global optimum finding. Taking the facts given above into consideration, one may apply approaches that consist in obtaining, at each algorithm iteration, a linear approximation of a nonlinear control plant model. Then, it is used to formulate a quadratic programming problem; see e.g. [8,9,11,17].
3
Takagi–Sugeno Models for Efficient Predictive Control and Set–Point Optimization
Obtaining a linear approximation of a dynamic control plant model is especially easy in the case when fuzzy TS models are used; see e.g. [8,10]. The approach proposed in the paper exploits a TS fuzzy model with step responses used as the local models: Rule f : j f,j jy is Bnf,jy and if yky is B1 y and . . . and yk−n+1 f,ju u ujku is C1f,ju and . . . and ujk−m+1 is Cm nu p d −1 j,f j,m,f then yk = aj,m,f · Δum · um n k−n + apd k−pd m=1 n=1 j
(5)
,
where yky is the jy th output variable value at the k th sampling instant, ujku is f,j f,j the ju th manipulated variable value at the kth sampling instant, B1 y , . . . , Bn y ,
218
P.M. Marusak
f,ju C1f,ju , . . . , Cm are fuzzy sets, aj,m,f are the coefficients of step responses in the n th f local model, jy = 1, . . . , ny , ju = 1, . . . , nu , f = 1, . . . , l, l is number of rules. Output of the fuzzy model is calculated using the following formula:
ykj =
nu p d −1 m=1 n=1
m aj,m · Δum aj,m n k−n + pd · uk−pd ,
(6)
where aj,m = lf =1 w f · aj,m,f ,w f are the normalized weights calculated using n n fuzzy reasoning for current values of inputs and outputs of the control plant, see e.g. [15]. The model (5) can be obtained in a relatively simple way. It is sufficient to collect a few step responses of the control plant for a few operating points. The premise part of the model can be designed using expert knowledge, simulation experiments, fuzzy neural networks or all the mentioned techniques combined. It is good to notice that the model (6) may be interpreted as the step response of the control plant describing behavior of the control plant near the current operating point. This model may be used to predict behavior of the control plant in the same way as in the DMC algorithm [2,6,14,17]. Such an approach leads to formulation of the optimization problem (1–4), solved at each iteration by the algorithm, as the quadratic optimization problem [8].
4
Optimization Layer
In the classical multilayer control system the desired set–points are calculated by the optimization layer. The optimization problem solved in this layer has usually the following form (having economic meaning) [17]: T min JE (y, u) = cT u u − cy y
(7)
umin ≤ u ≤ umax ,
(8)
y min ≤ y ≤ y max ,
(9)
y = F u, d ,
(10)
y,u
subject to:
where F : IRnu × IRnd → IRny , F ∈ C 1 is a nonlinear, steady–state control plant model, cu ∈ IRnu and cy ∈ IRny are prices, nd is the number of disturbances affecting the control plant, y is a vector of length ny of the set–point values, u is a vector of length nu of control values corresponding to the set–point values y, calculated using the steady–state plant model, d is an estimate of disturbances, umin , umax are vectors of lower and upper bounds of manipulated variables, y min , y max are vectors of lower and upper bounds of output values, JE (y, u) is a performance function. The optimal solution of the optimization problem (7–10), is passed to control algorithms as the desired set–point values.
Efficient Predictive Control and Set–Point Optimization
219
The problem presented above is usually nonlinear thus its solution is usually time consuming. Therefore, it is repeated less often than the action of the controllers.
5
Steady–State Target Optimization
In the case when variability of disturbances is comparable with the dynamics of the control plant, application of the classical multilayer control structure, with low frequency of intervention of the optimization layer may bring results far from optimal. One of the solutions to this problem is to supplement the control structure with a steady–state target optimization (SSTO) [1,4,5,12,17]. The first version of the SSTO was based on a linear approximation of the steady–state process model obtained from the linear dynamic control plant model used by the predictive control algorithm [1,4,12]: T min JE (y, u) = cT u u − cy y
(11)
umin ≤ u ≤ umax ,
(12)
y min ≤ y ≤ y max ,
(13)
y = Hu + C(k) .
(14)
y,u
subject to:
where C(k) = y 0 (k +N |k)−Hu(k −1) and y 0 (k +N |k) is the value of predicted free output trajectory at the end of the prediction horizon. It should be noticed that the matrix H, in the linear approximation of the steady–state process model (14), is constant in time. It is thus intuitive that for highly nonlinear processes performance of the control system may be improved using a linear approximation of the steady–state process model obtained by linearization of the original, nonlinear steady–state process model [5,17]: y = H(k)u + C(k) . (15) ∂F (u, d) ∂F (u, d) where the matrix H(k) = ... is updated at each SSTO it∂u1 ∂unu eration, C(k) = F u(k − 1), d −H(k)u(k −1). The derivatives being elements of the matrix H(k) are usually computed numerically. The other version of SSTO is proposed in the paper. It is based on a linear approximation of the steady–state process model obtained at each iteration from a dynamic fuzzy TS model of the control plant. Using this approach the numerical linearization (calculation of the derivatives) of the nonlinear steady–state process model is avoided. Thanks to the usage of the fuzzy model with local models in the form of the step responses, derivation of the H(k) matrix is especially easy. It is because
220
P.M. Marusak
elements of the H(k) matrix are last elements of the step responses aj,m pd that are in fact gain coefficients of the control plant. The matrix of gains is described by the following formula: ⎡ 1,1 ⎤ u +nd apd . . . a1,n pd ⎢ . . ⎥ .. ⎥ . .. H(k) = ⎢ (16) . ⎣ .. ⎦ ny ,1 ny ,nu +nd apd . . . apd The matrix H(k) is obtained as a result of calculation of a linear approximation of the dynamic process model for the predictive control algorithm. In order to obtain proper values of the gains (good approximation of the steady–state properties of the process), the dynamic fuzzy model should be tuned as well as it is possible, using e.g. fuzzy neural networks, like it was done in the current research and discussed in the next section.
6
Simulation Experiments
The control plant under consideration is an isothermal continuous stirred tank reactor (CSTR) in which a van de Vusse reaction carries out (Fig. 1a) [3]. Steady– state characteristics of the control plant are shown in Fig. 1b. The process model of the reactor contains two composition balance equations dCA dt
2 = −k1 · CA − k3 · CA + VF (CAf − CA ) , dCB = k1 · CA − k2 · CB − VF CB , dt
(17)
where CA , CB are the concentrations of components A and B, respectively, F is the inlet flow rate (equal to the outlet flow rate), V is the volume in which the reaction takes place (it is assumed constant and V = 1 l), CAf is the concentration
Fig. 1. a) Diagram of an isothermal CSTR with van de Vusse reaction; b) Steady–state characteristics of the control plant
Efficient Predictive Control and Set–Point Optimization
221
of component A in the inlet flow stream (it is assumed that CAf = 10 mol/l). The values of parameters are: k1 = 50 1/h, k2 = 100 1/h, k3 = 10 l/(h · mol). The output variable is the concentration CB of substance B, the manipulated variable is the inlet flow rate F of the raw substance, CAf concentration is the disturbance variable and it is assumed that it is changing according to the formula: 2π CAf = CAf0 − sin t . (18) 0.4 where CAf0 = 10 mol/l. The following performance index of the set–point optimization problem was assumed: JE = −F .
(19)
The manipulated variable is constrained: 0 l/h ≤ F ≤ 60 l/h .
(20)
It was also assumed that the product should fulfill the following purity criteria: 1.1 mol/l ≤ CB ≤ 1.2 mol/l .
(21)
For the presented control plant a fuzzy predictive algorithm was designed. The sampling period was assumed equal to Ts = 3.6 s; tuning parameters were as follows: p = 70, s = 35, λ = 0.001. The fuzzy TS process model is based on step responses obtained near the following operating points: P1) CB0 = 0.91 mol/l, CA0 = 2.18 mol/l, F = 20 l/h; P2) CB0 = 1.12 mol/l, CA0 = 3 mol/l, F = 34.3 l/h; P3) CB0 = 1.22 mol/l, CA0 = 3.66 mol/l, F = 50 l/h. The fuzzy TS model was tuned using a fuzzy neural network (FNN). It was done in such a way that the obtained dynamic fuzzy model can be used also to obtain a good linear approximation of the nonlinear steady–state process model. The premises were modeled in a standard way; see e.g. [13]. The consequents were simply the gain coefficients of the obtained step responses. The structure of the applied FNN is thus very simple. The idea is to tune the membership functions taking into consideration the steady–state characteristic of the control plant. The value of the performance index J = (y F N N − y SSM )T · (y F N N − y SSM ) (where y F N N and y SSM are the vectors of 400 elements containing the output values obtained from the FNN and from the steady–state process model, respectively), calculated after the tuning, was equal to J = 0.1456. The obtained, normalized membership functions are shown in Fig. 2. The experiments were performed in three control systems: 1) with the MPC algorithm and SSTO based on a linear model (LSSTO), 2) with fuzzy model used in the MPC algorithm and with successive nonlinear steady–state model linearization (NSSTO),
222
P.M. Marusak
3) with the fuzzy TS model used in both the MPC algorithm and SSTO, exploiting the approach proposed in the paper (FSSTO). The obtained results are shown in Figs. 3 and 4. Usage of the steady–state model approximation obtained from fuzzy dynamic control plant model brought results very close to those obtained with linearization of the nonlinear steady–state process model. The difference between responses generated by the control systems with NSSTO and FSSTO is small. Values of the performance index (calculated as the sum of temporary values of the assumed economic performance index) obtained in these two control systems are equal to −9582.71 and −9589.99, respectively (the smaller the value – the better). Moreover, calculation of set–points in the structure based only on the fuzzy process model is significantly simplified. The control system with LSSTO (the standard SSTO) and the MPC algorithm based on a linear model offers
Fig. 2. Normalized membership functions of the fuzzy model
Fig. 3. Results of an example experiment obtained in the control system with: LSSTO – dotted line, NSSTO – dashed line, FSSTO – solid line; left – output variable CB , right – manipulated variable F
Efficient Predictive Control and Set–Point Optimization
223
Fig. 4. Results of an example experiment obtained in the control system with: LSSTO – dotted line, NSSTO – dashed line, FSSTO – solid line; left – output variable reference signals C B , right – disturbance variable CAf
the worst control performance (the value of the performance index is equal to −9404.17 in this case). Thus, usage of the well tuned fuzzy model is very important for the performance of the control system.
7
Summary
The problem of simplification of control systems with SSTO was addressed in the paper. The proposed solution consists in using the dynamic, fuzzy control plant model for both: control generation and set–point optimization. The linear approximation of the steady–state process model is obtained at each iteration using the dynamic fuzzy TS control plant model. Using the example control system of the control plant with difficult dynamics it is shown that the proposed approach may generate solutions very close to those obtained in control systems in which, at each SSTO iteration, linearization of the original nonlinear steady– state process model is performed. In the proposed approach the fuzzy Takagi–Sugeno models with step responses used as the local models are exploited. Such models can be obtained relatively easy and tuned using fuzzy neural networks, offering possibility of obtaining a linear approximation of the steady–state process model in a very easy way. Thus, the proposed approach may be relatively easy applied in the existing control systems to improve their performance. Acknowledgment. This work was supported by the Polish national budget funds for science 2009–2011 as a research project.
References 1. Blevins, T.L., McMillan, G.K., Wojsznis, W.K., Brown, M.W.: Advanced Control Unleashed. ISA (2003) 2. Camacho, E.F., Bordons, C.: Model Predictive Control. Springer, Heidelberg (1999)
224
P.M. Marusak
3. Doyle, F., Ogunnaike, B.A., Pearson, R.K.: Nonlinear model–based control using second–order Volterra models. Automatica 31, 697–714 (1995) 4. Kassmann, D.E., Badgwell, T.A., Hawkins, R.B.: Robust Steady-State Target Calculation for Model Predictive Control. AIChE Journal 46, 1007–1024 (2000) 5. Lawrynczuk, M., Marusak, P., Tatjewski, P.: Cooperation of model predictive control with steady–state economic optimisation. Control and Cybernetics 37, 133–158 (2008) 6. Maciejowski, J.M.: Predictive control with constraints. Prentice Hall, Harlow (2002) 7. Marusak, P., Tatjewski, P.: Actuator fault tolerance in control systems with predictive constrained set-point optimizers. International Journal of Applied Mathematics and Computer Science 18, 539–551 (2008) 8. Marusak, P.: Advantages of an easy to design fuzzy predictive algorithm in control systems of nonlinear chemical reactors. Applied Soft Computing 9, 1111–1125 (2009) 9. Marusak, P., Tatjewski, P.: Effective dual–mode fuzzy DMC algorithms with online quadratic optimization and guaranteed stability. International Journal of Applied Mathematics and Computer Science 19, 127–141 (2009) 10. Marusak, P.: Efficient model predictive control algorithm with fuzzy approximations of nonlinear models. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 448–457. Springer, Heidelberg (2009) 11. Morari, M., Lee, J.H.: Model predictive control: past, present and future. Computers and Chemical Engineering 23, 667–682 (1999) 12. Qin, S.J., Badgwell, T.: A survey of industrial model predictive control technology. Control Engineering Practice 11, 733–764 (2003) 13. Piegat, A.: Fuzzy modelling and control. Physica-Verlag, Heidelberg (2001) 14. Rossiter, J.A.: Model–Based Predictive Control. CRC Press, Boca Raton (2003) 15. Takagi, T., Sugeno, M.: Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Systems, Man and Cybernetics 15, 116–132 (1985) 16. Saez, D., Cipriano, A., Ordys, A.W.: Optimisation of Industrial Processes at Supervisory Level: Application to Control of Thermal Power Plants. Springer, London (2002) 17. Tatjewski, P.: Advanced Control of Industrial Processes; Structures and Algorithms. Springer, London (2007) 18. Tatjewski, P.: Supervisory predictive control and on–line set–point optimization. International Journal of Applied Mathematics and Computer Science 20, 483–495 (2010) 19. Zanin, A., Tvrzska de Gouvea, M., Odloak, D.: Industrial implementation of a real–time optimization strategy for maximizing production of LPG in a FCC unit. Computers and Chemical Engineering 24, 525–531 (2000) 20. Zanin, A., Tvrzska de Gouvea, M., Odloak, D.: Integrating real–time optimization into model predictive controller of the FCC system. Computers and Chemical Engineering 10, 819–831 (2002)
Wind Turbines States Classification by a Fuzzy-ART Neural Network with a Stereographic Projection as a Signal Normalization Tomasz Barszcz1, Marzena Bielecka2 , Andrzej Bielecki3 , and Mateusz W´ ojcik4 1
Chair of Robotics and Mechatronics, Faculty of Mechanical Engineering and Robotics, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krak´ ow, Poland 2 Chair of Geoinformatics and Applied Computer Science, Faculty of Gelogy, Geophysics and Environmental Protection, AGH University of Science and Technology, Al. Mickiewicza 30, 30-059, Krak´ ow, Poland 3 Institute of Computer Science, Faculty of Mathematics and Computer Science, Jagiellonian University, Nawojki 11, 30-072 Krak´ ow, Poland 4 Department of Computer Design and Graphics, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Reymonta 4, 30-059 Krak´ ow, Poland
[email protected],
[email protected],
[email protected],
[email protected] Abstract. In this paper wind turbines operational states classification is considered. The fuzzy-ART neural network is proposed as a classifying system. Applying of stereographic projection as an input signals normalization procedure is introduced. Both theoretical justification is discussed and results of experiments are presented. It turns out that the introduced normalization procedure improves classification results. Keywords: wind turbines operational states classification, fuzzy-ART neural network, input signals normalization, stereographic projection.
1
Introduction
In recent years wind energy is the fastest growing branch of the power generation industry not only in the world [7] but also in the European Union [1], including Poland [13]. The largest cost for the wind turbine is its maintenance.
The paper was supported by the Polish Ministry of Science and Higher Education under Grant No. N504 147838.
ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 225–234, 2011. c Springer-Verlag Berlin Heidelberg 2011
226
T. Barszcz et al.
A common technique to decrease this cost is a remote monitoring [11]. Growing number of monitored turbines requires an automated way of support for diagnostic experts. Wind turbines operational states classification is the basic stage of analyzing data obtained from monitoring. In classification algorithms very often data describing features which are taken into consideration by the classifying algorithm must be preprocessed. Input signals normalization is an example of such a preprocessing. Classification algorithms are nontrivial in the case when the number of classes are unknown a priori. In such a way, certain types of artificial neural networks (ANNs) can be applied for solving the classification problem. In ART-type neural networks, introduced by Carpenter and Grossberg [4,5], the learning process is not separated from its operation. Furthermore, this sort of ANNs are capable to add new states when necessary. Therefore, they were tested as a tool for classification of operational states in continuous monitoring of wind turbines [3]. This paper is a continuation of studies presented there, where the obtained results were promising but far from satisfactory ones. It turns out that input signal normalization, based on the stereographic projection proposed in [2], improves the results obtained in [3]. In this paper we consider possibilities of application of fuzzy-ART artificial neural network to find clusters in data describing wind turbines operational states. This type on ANN was proposed by Carpenter, Grossberg and Rosen [6], and then was intensively studied [10,12], including its clustering capabilities [8]. The stereographic projection was used by us as the input signals preprocessing. The paper is organized in the following way. In Section 2 the classification problem of wind turbine states is discussed. The architecture of fuzzy ART neural network is reminded in Section 3, whereas in Section 4 a stereographic projection is presented. Results of experiments are presented in Section 5.
2
Classification Problem of Wind Turbine States
In recent years large development of monitoring and diagnostic technologies for wind turbines has taken place [11]. The growing number of installed systems created the need for analysis of gigabytes of data created every day by these systems. Apart from the development of several advanced diagnostic methods for this type of machinery there is a need for a group of methods, which will act as an ”early warning”. The idea of this approach could be based on a data driven algorithm, which would decide on a similarity of current data to the data, which are already known. In other words, the data from the turbine should be accounted for one of known states. If this is a state describing a failure, the human expert should be alarmed. If this is an unknown state, the expert should be informed about the situation and asked for a definition of such a new state. It seems that there are no works, apart from our one [3], which would consider application of ANNs networks for the classification of wind turbines states. As the ART neural networks are capable to perform efficient classification and to recognize new states when necessary [4,5], we performed research of initial
Wind Turbines States Classification by a Fuzzy-ART Neural Network
227
classification task. The goal of the experiment was verification of ART classification capabilities with comparison to the human expert. This type of data is acquired in the majority of cases and the successful classifier should create a reasonable number of classes, similar to these by a human expert. According to the results presented here - see Section 5 - the input signal transformation using stereographic projection improves results of classification. Having an accurate classification, it is thus possible to filter out states, which are known to be correct. The expert can then focus only on ”suspicious” states returned by the algorithm.
3
Fuzzy ART Network
The organization of a fuzzy ART network, introduced by Carpenter et al. [6], is presented in Fig.1. In comparison with a classical ART-2 network, the fuzzy ART network has an additional layer F0 which transforms input vectors using so called complement coding. The fuzzy ART network has a single weights matrix Z, which processes signals being sent both from F1 to F2 layer and vice versa. Operations done by the network are based on the fuzzy logic. The operator fuzzy AND is used for two vectors comparison. Signal processing in fuzzy ART network is similar to processing in ART-2 network. The input signal vector, say X, is transformed by F0 and F1 layers producing signals Tj which are put to inputs of the F2 layer. For the neuron which is excited most strongly a vigilance parameter ρ is used to check the similarity between the output pattern from the F2 layer and the input pattern. The vigilance parameter has considerable influence on the system: higher vigilance produces highly detailed memories - many, fine-grained categories, while lower vigilance results in more general memories - fewer, more-general categories. If the degree of similarity p between current input pattern and best fitting prototype is at least as high as vigilance, this prototype is chosen to represent the cluster containing the input. This means that if p < ρ then the winner neuron is inhibited and other neuron in F2 layer is searched. Otherwise, the weights matrix Z is modified in order to store the recognized vector features. The learning process is continued until the values of the matrix Z are stabilized.
4
Input Signals Normalization
A normalization procedure corresponds to founding a mapping F : Rn ⊃ A x → x ˆ ∈ Rk , where ˆ x = 1. The most commonly used normalization is done according to the formula x ˆ= x . This formula defines projection x Π : Rn \ {0} → S n−1 ⊂ Rn ,
228
T. Barszcz et al.
Fig. 1. Fuzzy ART architecture
Fig. 2. Simple projection of R2
call it a simple projection, of Rn \ {0} onto (n − 1)-dimensional sphere S n−1 see Fig.2 for n = 2. A simple projection has crucial drawbacks. First of all dimension is reduced. Secondly, the projection is not defined on the whole space - the mapping is undefined for 0. Furthermore, the space Rn having an infinite measure is projected onto a sphere having a finite measure. Additionally, the projection is not a injective mapping - if two points, say u and w, lie on the same radial line then Π(u) = Π(w) - see Fig.2. Referring to the considered problem this means that if two data clusters are situated along the same redial direction then, after normalization, they can not be separated even they are well separated before normalization. Therefore, this method should be used only in such cases if it is a priori known that clusters in input signal space are situated in various radial directions.
Wind Turbines States Classification by a Fuzzy-ART Neural Network
229
Fig. 3. Stereographic projection of R2
Therefore sometimes normalization which do not reduce the input signals space dimension is applied. The stereographic projection S : Rn → S n ⊂ Rn+1 is an example of such a mapping. Geometric interpretation of the stereographic projection is visualized in Fig.3 for a two-dimensional case. Stereographic projection is given explicitly by algebraic formulae for each natural n - see [9], page 73. Let P = (x1 , ..., xn ). Then S(P ) = P˜ = (˜ x1 , ..., x ˜n+1 ) is given as 4xi x ˜i = 4+s for i = 1, ..., n; x ˜n+1 = s−4 , 4+s n where s := i=1 x2i . As it has been already mentioned, stereographic projection preserves the transformed space dimension and is defined on the whole Rn . Furthermore it is an injective mapping i.e. if u = v, u, v ∈ Rn then S(u) = S(v). However, it transforms a space having infinite measure into space having finite measure. This implied, among others, that points being far from each other in Rn can be closed each to other on S n . Therefore, two clusters which are well separated in Rn can be hardly separated after normalization. However such case can only take place if the clusters are far from the coordinate system origin - then they are transformed near to the north pole of the sphere. Since, in practice, norms of transformed vectors are limited, the minimal distance between clusters after signal normalization can be estimated.
5
Results
The practical case study was performed on data from one of wind turbines. The data covering the period from 11.09.2009 till 30.09.2009 were recorded every 10 minutes by the online monitoring system. The recorded data were current values and were not averaged. The data set included 2869 measurements. As the main goal of the work was to test applicability of ART-type ANNs for classification, in the first step we tried to use the network to achieve results similar to a human
230
T. Barszcz et al.
expert. Other training was not possible, as this type of networks performs only unsupervised learning. The data set contained the most fundamental values, deciding about the operational state of the machine. These were: wind speed - variable x1 , rotational speed of the generator - variable x2 and power generated by the turbine - variable x3 , vertical axis in Figures 4-7. They are related, but only to some extent and in fact they are all independent variables. The selection of variables is the same as the human expert would use. Typically the operation of a wind turbine can be divided in a few distinct states: stopped, transient between 0 and 1000 rotations per minute (rpm), idle load (about 1000 rpm, no load), low power, high power. Sometimes it is not necessary to distinguish all of them and the first two pairs are sometimes regarded as only two states (i.e. ”stopped or transient” and ”low power” including also idle mode). Very important advantage of the chosen set is that it has only 3 variables and can be presented in a graphical way. Thus, it can be easily understood and compared with a human expert. Results create the basis and give some intuition for more advanced research. The main idea of the research was to apply recorded data to the fuzzy-ART network and to investigate what is its behavior i.e. how many states will be created and how does this depends on the network parameters. In all experiment the vigilance parameter ρ belongs to the interval [0, 1]. As it has been already mentioned, this paper is a continuation of studies described in [3]. Neural networks ART-2 and fuzzy-ART were used for classification of a wind turbine operational states. Examples of clustering performed by fuzzy-ART network for vigilance parameter ρ = 0.55 and ρ = 0.7 without transforming input signals using stereographic projection are shown in Fig.4 and Fig.5 respectively. The clasterization is roughly correct but essential drawbacks can be observed. For low value of the vigilance parameter (ρ = 0.55) three classes were created (the fourth one, marked by diamonds, consists of low number of points so it can be neglected). The low power and high power states are classified as a single class (marked by ”x” in Fig.4. The class of idle load - marked by squares in Fig.4 - contains also transient state and, partially, stopped state. Thus it seems that the vigilance parameter should be increased. However, increasing of the vigilance causes creation of a fake class - marked by black ”snowflakes” in Fig.5 Furthermore, the classes representing higher and higher powers - squares, reverse triangles and grey stars in Fig.5 - are not separated clearly. Moreover, the bigger vigilance, the greater number of fake classes. The result of classification by a fuzzy-ART network with stereographic projection used as a preprocessing procedure but without the signal scaling is shown in Fig.6. Though the vigilance parameter was equal to 0.99 which means that it was extremely high, only two classes has been found. It is obvious that the classification is not correct because only small part of data representing stopped cluster has been separated as a distinct class whereas all other data has been rated to the common class. The fact that the data points from R3 has been projected near to the north pole of the sphere S 3 is the reason that, after projection, the clusters has not been able to separate although the vigilance parameter was extremely
Wind Turbines States Classification by a Fuzzy-ART Neural Network
231
1500 1000 500 0 −500 2000
1500
1000
12 10 8
500
6 4 0
2 0
Fig. 4. Input data clasterization by fuzzy-ART neural network without stereographic projection, ρ = 0.55
1500
1000
500
0
−500 2000
1500
1000 12 10
500
8 6 4 0
2 0
Fig. 5. Input data clasterization by fuzzy-ART neural network without stereographic projection, ρ = 0.7
high - see discussion in the end of Section 4. This means that input signals should be scaled before stereographic transformation in order to reduce their distance from the origin of the coordinate system. Thus, the input vector components was x1 x2 x3 scaled using transformation [x1 , x2 , x3 ] → [x˜1 , x˜2 , x˜3 ] = x1M , x2M , x3M . AX AX AX Then x˜i ∈ [0, 1], because xi ≥ 0 and xiMAX > 0, where i ∈ {1, 2, 3}. The result of classification by a fuzzy-ART with input signals transformed using stereographic projection and with the signal scaling is shown in Fig.7. The
232
T. Barszcz et al.
2000
0
−2000 2000
1500
12
1000 10 8 500
6 4 2 0
0
Fig. 6. Input data clasterization by fuzzy-ART neural network with stereographic projection without input signal scaling 1500
1000
500
0
−500 2000 1500 1000 500 0
0
2
4
6
8
10
12
Fig. 7. Input data clasterization by fuzzy-ART neural network with stereographic projection and input signal scaling, ρ = 0.85
vigilance parameter ρ is equal to 0.85 which means that it is high. It can be observed that if scaling and stereographic projection of input signals are done, the vigilance parameter can be increased without appearance of fake classes. The stopped class (grey points), idle load class (circles), low power class (black ”snowflakes”), middle power class (squares) and high power class (stars) are specified correctly and clearly. Interesting case is another class, marked by triangles, describing strong blows of wind when the turbine is stopped. It is very important, because (apart for a small number of cases, when a turbine controller
Wind Turbines States Classification by a Fuzzy-ART Neural Network
233
was not yet able to respond to the wind gust) it signals lost opportunity of energy production. This information is very important to the turbine operator. The only drawback of the proposed method of classification is that the transition between the stopped state and the idle load state has not been detected as a separate cluster. However, it should be stressed that it is extremely difficult task, because the density of data points in this class is far lesser than in other ones. Moreover, these states are very short, do not contribute to the diagnostics and are typically discarded by monitoring systems. It is worth to mention that this behavior is very much different from monitoring of large conventional power generators, where this operational state is the source of very important diagnostic information.
6
Concluding Remarks
Presented results belong to a broader research activity, aimed at automatic monitoring of rotating machinery. We are interested in investigation of several approaches, which can be applied in the engineering practice. Thus, one has to assume that learning sets are not available or cover only a part of machine states. The problem becomes much more the classification of the current state to one of previously known states or detection of a new state. Ideally, such a new state should be included for further classifications. It was shown that ART-2 networks, both classical one and fuzzy, were capable to classify typical states of a wind turbine roughly correctly [3]. Results presented in this paper shows that scaling of the input signals and transforming them using stereographic projection causes significant improvement of data clustering. The fuzzy-ART neural network with the mentioned input signals preprocessing properly allocated classes corresponding to stopped state, idle load state, low, middle and high generated power states. Moreover, the state corresponding to strong wind blows when the turbine is stopped was distinguished as well. Such classification is practically identical to the one done by a human expert.
References 1. Banakar, H., Ooi, B.T.: Clustering of wind farms and its sizing impact. IEEE Transactions on Energy Conversion 24, 935–942 (2009) 2. Bielecki, A., Bielecka, M., Chmielowiec, A.: Input signals normalization in Kohonen neural networks. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 3–10. Springer, Heidelberg (2008) 3. Barszcz, T., Bielecki, A., W´ ojcik, M.: ART-type artificial neural networks applications for classification of operational states in wind turbines. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010. LNCS (LNAI), vol. 6114, pp. 11–18. Springer, Heidelberg (2010) 4. Carpenter, G.A., Grossberg, S.: A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing 37, 54–115 (1987)
234
T. Barszcz et al.
5. Carpenter, G.A., Grossberg, S.: ART2: self-organization of stable category recognition codes for analog input pattern. Applied Optics 26, 4919–4930 (1987) 6. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks 4, 759–771 (1991) 7. Ezio, S., Claudio, C.: Exploitation of wind as an energy source to meet the worlds electricity demand. Wind Engineering 74-76, 375–387 (1998) 8. Frank, T., Kraiss, K.F., Kuhlen, T.: Comparative analysis of fuzzy ART and ART2A network clustering performance. IEEE Transactions on Neural Networks 9, 544–559 (1998) 9. Gancarzewicz, J.: Differential Geometry, PWN, Warszawa (1987) (in Polish) 10. Georgiopoulos, M., Fernlund, H., Bebis, G., Heileman, G.L.: Order of search in fuzzy ART and fuzzy ARTMAP: effect of the choice parameter. Neural Networks 9, 1541–1559 (1996) 11. Hameeda, Z., Honga, Y.S., Choa, T.M., Ahnb, S.H., Son, C.K.: Condition monitoring and fault detection of wind turbines and related algorithms: A review. Renewable and Sustainable Energy Reviews 13, 1–39 (2009) 12. Huang, J., Georgiopoulos, M., Heileman, G.L.: Fuzzy ART properties. Neural Networks 8, 203–213 (1995) 13. Paska, J., Salek, M., Surma, T.: Current status and perspectives of renewable energy sources in Poland. Renewable and Sustainable Energy Reviews 13, 142–154 (2009)
Binding and Cross-Modal Learning in Markov Logic Networks Alen Vrečko, Danijel Skočaj, and Aleš Leonardis Faculty of Computer and Information Science, University of Ljubljana, Slovenia {alen.vrecko,danijel.skocaj,ales.leonardis}@fri.uni-lj.si
Abstract. Binding — the ability to combine two or more modal representations of the same entity into a single shared representation is vital for every cognitive system operating in a complex environment. In order to successfully adapt to changes in an dynamic environment the binding mechanism has to be supplemented with cross-modal learning. In this paper we define the problems of high-level binding and crossmodal learning. By these definitions we model a binding mechanism and a cross-modal learner in a Markov logic network and test the system on a synthetic object database. keywords:Binding, Cross-modal learning, Graphical models, Markov logic networks, Cognitive systems.
1
Introduction
One of the most important abilities of any cognitive system operating in a real world environment is to be able to relate and merge information from different modalities. For example, when hearing a sudden, unexpected sound, humans automatically try to visually locate its source in order to relate the audio perception of the sound to the visual perception of the source. The process of combining two or more modal representations (grounded in different sensorial inputs) of the same entity into a single multimodal representation is called binding. While the term binding has many different meanings across various scientific fields, a very similar definition comes from neuroscience, where it denotes the ability of the brain to converge perceptual data processed in different brain parts and segregate it into distinct elements [2] [14]. The binding process can operate on different types and levels of cues. In the above example the direction that the human perceives the sound from is an important cue, but sometimes this is not enough. If there are several potential sound sources in the direction of the percept, the human may have to relate higher level audio and visual properties. A knowledge base that associates the higher level perceptual features across different modalities is therefore critical for a successful binding process in any cognitive system. In order to function properly in a dynamic environment, a cognitive system should also be able to learn and adapt in a continuous, open-ended manner. A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 235–244, 2011. c Springer-Verlag Berlin Heidelberg 2011
236
A. Vrečko, D. Skočaj, and A. Leonardis
The ability to update the cross-modal knowledge base online, i. e. cross-modal learning, is therefore vital for any kind of binding process in such an environment. Many of the past attempts at binding information within cognitive systems were restricted to associating linguistic information to lower level perceptual information. Roy et al. tried to ground the linguistic descriptions of objects and actions in visual and sound perceptions and to generate descriptions of previously unseen scenes based on the accumulated knowledge [12] [13]. This is essentially a symbol grounding problem first defined by Harnad [6]. Chella et al. proposed a three-layered cognitive architecture around the visual system with the middle, conceptual layer bridging the gap between linguistic and sub-symbolic (visual) layers [4]. Related problems were also often addressed by Steels [15]. Jacobsson et al. approached the binding problem in a more general way [8] [7] developing a cross-modal binding system that could form associations between multiple modalities and could be part of a wider cognitive architecture. The cross-modal knowledge was represented as a set of binary functions comparing binding attributes in pair-wise fashion. A cognitive architecture using this system for linguistic reference resolution was presented in [16]. This system was capable of learning visual concepts in interaction with a human tutor. Recently, a probabilistic binding system was developed within the same group that encodes cross-modal knowledge into a Bayesian graphical model [17]. The need for a more flexible, but still probabilistic representation of cross-modal knowledge led our reasearch efforts in the direction of Markov graphical models, as presented in this paper. In the next section we define the problems of cross-modal learning and binding. In section 3 we first briefly describe the basics of Markov logic networks (MLNs). Then we desribe our binding and cross-modal learning model that is based on MLNs. Section 4 describes the experiments performed on the prototype system. We end the paper with the conclusion and future work.
2
The Problem Definition
The main idea of cross-modal learning is to use successful bindings of modal percepts as learning samples for the cross-modal learner. The improved crossmodal knowledge thus enhances the power of the binding process, which is then able to bind together new combinations of percepts, i. e. new learning samples for the learner. For example, if a cognitive system is currently capable of binding an utterance describing something blue and round to a perceived blue ball only by color association, this particular instance of binding could teach the system to also associate the visual shape of the ball to the linguistic concept of roundness. We see that at least on this level the binding process depends on the ability to associate between modal features (in this example the visual concepts of blue and shape are features of the visual modality, while the linguistic concepts of blue and ball belong to the linguistic modality). We assume an open world in terms of modal features (new features can be added, old retracted). The cross-modal learner starts with just a some basic
Binding and Cross-Modal Learning in Markov Logic Networks
237
prior knowledge of how to associate between a few basic features, which is then gradually expanded to other features and the new ones that are created. The high-level cross-modal learning problem is closely related to the association rule learning problem in data mining, which was first defined by Agrawal et al. [1]. Therefore, we will base our learning problem definition on Agrawal’s definition and expand it with the notion of modality. We have a set of n binary attributes called features F = {f1 , f2 , ..., fn } and a set of rules called a knowledge database K = {t1 , t2 , ..., tm }. A rule is defined as an implication over two subset of features: ti : X ⇒ Y
(1)
where X, Y ⊆ F and X∩Y = ∅. The features represent various higher level modal properties based on the sensorial input. We introduce the notion of modality to our definition — each feature is restricted to one modality only: M1 = {f11 , f12 , ..., f1n1 } M2 = {f21 , f22 , ..., f2n2 } .. .. .. .. .. ..... Mk = {fk1 , fk2 , ..., fknk }
(2)
F = M1 ∪ M2 ∪ ... ∪ Mk . We modify the rule-making restrictions of (1) accordingly: 1. N = Mm1 ∪ Mm2 ∪ ... ∪ Mmr , m1 , ..., mr ∈ {1, 2, ..., k} , r < k 2. Y ⊆ N 3. X ⊆ F \ N
(3)
Next, we define the binding problem. Percepts are collections of features from a single modality. A percept acts as modal representation of a percieved entity. Let P be the set of current percepts, i. e. the percept configuration: P = {P1 , P2 , ..., Pn } , Pi ⊆ Mj .
(4)
Percept unions are collections of percepts from different modalities. A percept union acts as shared representation of a percieved entity, grounded through its percepts to different modalities. Given a percept configuration P, U(P) is the set of current percept unions, i. e. the union configuration: U(P) = {U1 , U2 , ..., Um } , Ui ⊆ P.
(5)
The binding process is then defined as a mapping between a percept configuration and one of the possible union configurations: β : P → U(P),
(6)
238
A. Vrečko, D. Skočaj, and A. Leonardis
where the following restrictions apply: 1. U1 ∪ U2 ∪ ... ∪ Um = P 2. ∀Ui , Uj ∈ U(P), i = j : Ui ∩ Uj = ∅ 3. ∀Pi , Pj ∈ Uk , i = j : Pi ⊆ Ml ∧ Pj ⊆ Mm ⇒ l = m.
(7)
The first two restrictions assign each percept in the configuration to exactly one union, while the third restricts the number of percepts per modality in an union to one. To make the binding process plausible, we also introduce a measure of confidence in a union configuration based on the knowledge K – the binding confidence bconfK (U). Given the percept configuration P and the current knowledge base K the task of the binding process is to find the optimal union configuration: Uopt (P) = argmax(bconfK (U(P))).
(8)
U(P)
In this sense — i. e. considering bconfK (U) as a predictor based on K — we can consider high-level cross-modal learning as a regression problem. Therefore, the aim of the cross-modal learner is to maintain and improve the cross-modal knowledge base, thus providing an increasingly more reliable measure of binding confidence.
3
Implementation in MLN
Markov logic networks1 [10] combine first-order logic and probabilistic graphical models in a single representation. An MLN knowledge base consists of a set of first-order logic formulae (rules) with a weight attached: weight first-order logic formula The weight is a real number, which determines how strong a constraint each rule is: the higher the weight — the less likely the worldis to violate that rule. Together with a finite set of constants the MLN defines a Markov network (MN) (or Markov random field). An MN is an undirected graph where each possible grounding of a predicate (all predicate variables replaced with constants) represents a node, while the rules define the edges between the nodes. Each rule grounding defines a clique in the graph. An MLN can thus be viewed as a template for constructing the MN. The probability distribution over possible worlds x defined by an MN is given by P (X = x) = 1
1 exp wi ni (x) , Z i
(9)
We used Alchemy [9] to implement the prototype of our crossmodal binding and cross-modal learning mechanisms. Alchemy is a software package providing various inference and learning algorithms based on MLN.
Binding and Cross-Modal Learning in Markov Logic Networks
239
where ni (x) is the number of true groundings of the rule i, wi is the weight of the rule i, while Z is the partition function defined as Z = x exp i wi ni (x) . The inference in MN is a P#-complete problem [11]. Methods for approximating the inference include various Markov Chain Monte Carlo sampling methods [5] and belief propagation [18]. 3.1
Cross-Modal Knowledge Base
We have two types of templates for the binding rules in our cross-modal knowledge base. The template for the aggregative rule is defined as perPart(p1 , f1 ) ∧ uniPart(u, p1 ) ∧ perPart(p2 , f2 ) ⇒ uniPart(u, p2 ),
(10)
where the predicate perPart(percept, f eature) denotes that the feature f eature is part of the percept percept, while uniPart(union, percept) denotes that union includes percept. In a very similar manner the template for the segregative rule is defined: perPart(p1 , f1 ) ∧ uniPart(u, p1 ) ∧ perPart(p2 , f2 ) ⇒ ¬uniPart(u, p2 ).
(11)
The aggregative rules are used to merge the percepts into common percept unions, while the segregative rules separate them in distinct unions. The template rules are equivalent to a subset of association rules (1), where each side is limited to one feature. We also define the binding domain that we will use to ground the network. An example binding domain with two modalities is: modality = {Language, V ision} f eature = {Red, Green, Blue, Compact, F lat, Elongated, Color1, Color2, Color3, Shape1, Shape2, Shape3}.
(12)
Based on this example domain a small set of grounded and weighted binding rules could look like this: 2.5
perPart(p1 , Red) ∧ uniPart(u, p1 ) ∧ perPart(p2 , Color1) ⇒ uniPart(u, p2 )
1.9 perPart(p1 , Red) ∧ uniPart(u, p1 ) ∧ perPart(p2 , Color2) ⇒ ¬uniPart(u, p2 ), (13)
The predicates forming the binding rules are not fully grounded yet. They are grounded on the conceptual level only, with known features like Red, Color1, etc., while the instances (objects) are still represented with variables. The predicates get fully grounded each time an inference is performed, when based on the current situation (e. g. perceived objects that form the scene) an MN is constructed. This principle could be very beneficial for a cognitive system, since while decoupling the general from the specific, it allows for the application and adaptation of general concepts learned over longer periods of time to the current situation in a very flexible fashion.
240
A. Vrečko, D. Skočaj, and A. Leonardis
Using the example domain in (12) an example percept configuration could look like perPart(1, Color1) ∧ perPart(1, Shape2) ∧ perPart(2, Color2) ∧ perPart(2, Shape3) ∧ perPart(3, Red).
(14)
From (13) and (14) we could infer the following union configuration: uniPart(1, 1) ∧ uniPart(1, 3) ∧ uniPart(2, 3). Besides the binding rules, our database also contains feature priors in the following form: weight perPart(p, ColorGrounding) A feature’s prior denotes the default probability of a feature belonging to a percept (if there is no positive or negative evidence about it). In addition, we use a special predicate to determine the partition of features between modalities in the sense of (2) (e. g. modP art(Language, Red), modP art(V ision, Color1). 3.2
Learning
After the rules and priors are grounded within the binding domain, we need to learn their weights. We use the generative learning method described in [10]. The learner computes a gradient from the weights based on the number of true groundings (ni (x)) in the learning database and the expected true groundings according to the MLN (Ew [ni (x)]): δ log Pw (x) = ni (x) − Ew [ni (x)], δwi
(15)
and optimizes the weights accordingly. Since the expectations Ew [ni (x)] are very hard to compute, the method uses the pseudo-likelihood to approximate it [3]. Continuous learning is performed by feeding the learning samples to the system in small batches (3-6 percept unions). Each mini batch represents a scene the system has resolved, described with perPart and uniPart predicates. In each learning step the learner accepts the rule’s old weight in the knowledge database as the mean for the Gaussian prior, which it tries to adjust based on the new training mini batch. By setting the dispersion of the weight’s Gaussian prior to an adequate value, we ensure the learning rate of each mini batch is proportional to the batch size. 3.3
The Binding Process
The binding process translates to inferring over the knowledge base based on some evidence. In order for the binding inference to function properly we have
Binding and Cross-Modal Learning in Markov Logic Networks
241
to define some hard rules (formulae with infinite weight) that apply the binding restrictions in (7): 1. ∀p∃u : uniPart(u, p) 2. uniPart(u1, p) ∧ uniPart(u2, p) ⇒ u1 = u2 3. perPart(p1, f 1) ∧ perPart(p2, f 2) ∧ (p1 = p2) ∧ modP art(m, f 1) ∧ modP art(m, f 2) ∧ uniPart(u, p1) ⇒ ¬uniPart(u, p2). Usually the inference consists of querying for the predicate uniPart, where the evidence typically includes the description of the current percept configuration (using the predicate perPart), a list of known and potential unions and the description of the current partial union configuration (some percepts are already assigned to known unions). The binding result is expressed as a probability distribution for each unassigned percept over the known and potential unions.
4
Experimental Results
We experimented with our system on a database of 42 synthetic objects. Objects had percepts from three modalities: vision, language and affordance. The visual modality had 13 features in total: 6 for object color and 7 for the shape. Language had 13 features matching the visual features and 8 features for object type (e. g. book, box, apple). The affordance modality had 3 features describing the possible outcomes of pushing an object. Mini batches were designed to mimic robot interaction with a human tutor, where the tutor showed objects to the robot, describing their properties. Typically a mini batch contained 5-6 objects. The learning sequence consisted of 80 mini batches. We designed 30 test-cases for evaluating the binding process. In each testcase we had three visual percepts and one non-visual percept. The binder had to determine which visual percept, if any at all, the non-visual percept belonged to (i. e. four possible choices: one for each visual percept and one for no corresponding percept). Of the four possible choices there was always one that was union = {1, 2, 3, 4} perPart(1, V isRed), perPart(1, V isF lat), perPart(1, V isCylindrical) perPart(2, V isBlue), perPart(2, V isCompact), perPart(2, V isSpherical) perPart(3, V isGreen), perPart(3, V isElongated), perPart(3, V isConical) uniPart(1, 1), uniPart(2, 2), uniPart(2, 2) perPart(4, LinRed), perPart(4, LinF lat), perPart(4, LinCylindrical) uniPart(u, 4)? Fig. 1. An example of an easy test-case. We can see that objects represented with visual percepts (1,2 and 3) differ in all types of visual features. The system needs to determine which union the fourth, linguistic percept belongs to.
242
A. Vrečko, D. Skočaj, and A. Leonardis
union = {1, 2, 3, 4} perPart(1, V isRed), perPart(1, V isCompact), perPart(3, V isConical) perPart(2, V isGreen), perPart(2, V isCompact), perPart(2, V isSpherical) perPart(3, V isGreen), perPart(3, V isF lat), uniPart(1, 1), uniPart(2, 2), uniPart(2, 2) perPart(4, LinApple) uniPart(u, 4)? Fig. 2. An example of a difficult test-case. We can see that the objects represented with visual percepts (1,2 and 3) are less distinct than in the easier test-case (fig. 1) and with some incomplete information. The system has to find out which visual percept could be an apple. The visual training samples for apples consisted of compact and spherical percepts of red or green color.
more obvious than the others and deemed correct. The possibility that the system inferred as the most probable, was considered to be its binding choice. The test-cases varied in their level of difficulty: the easiest featured distinct features for visual percepts and complete information for all percepts (all percepts had a value for each feature type belonging to its modality, see figure 1), while more difficult cases could have features shared by more percepts and incomplete percept information (see figure 2). The tests were performed several times during the learning process in intervals of four batches.
Fig. 3. Experimental results: the average rate of correct binding choices relative to the number of training batches (10 randomly generated learning sequences were used). The green, yellow and red lines denote the easy, medium and hard test samples respectively, while the blue line denotes the overall success percentage.
Binding and Cross-Modal Learning in Markov Logic Networks
243
Figure 3 shows the average success rate over 10 randomly generated learning sequences. We see that with the growing number of samples the binding rate tends to grow and converge, though with some oscillations. The oscillations tend to be more pronounced for the difficult samples. Analyzing the results example by example we saw that the test-cases with the most oscillations were the ones that depended on many-to-one feature associations (e. g. red, compact, cylindrical ⇒ cola can). This can be explained with the current structure of our binding rules that directly support one-to-one feature associations only.
5
Conclusion
In this paper we defined the problems of high-level binding and cross-modal learning. By these definitions we modeled a binding mechanism and a crossmodal learner in MLNs. We tested the system on a synthetic object database and showed how the binding power of the system increases with the number of learned samples. In the future we will apply our binding and cross-modal learning models to a real cognitive architecture that includes visual and communication subsystems, thus gaining a platform for experiments on real-world data. We will also extend the structure of our database to more complex rules (or perhaps include a structure learning mechanism to our system) and improve and extend our experiments to better simulate the robot-tutor interaction.
Acknowledgment This work was supported by the EC FP7 IST project CogX-215181.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proc. of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., pp. 207–216 (May 1993) 2. Bartels, A., Zeki, S.: The temporal order of binding visual attributes. Vision Research 46(14), 2280–2286 (2006) 3. Besag, J.: Statistical analysis of non-lattice data. Journal of the Royal Statistical Society. Series D (The Statistician) 24(3), 179–195 (1975) 4. Chella, A., Frixione, M., Gaglio, S.: A cognitive architecture for artificial vision. Artif. Intell. 89(1-2), 73–111 (1997) 5. Gilks, W.R., Spiegelhalter, D.J.: Markov chain Monte Carlo in practice. Chapman & Hall/CRC (1996) 6. Harnad, S.: The symbol grounding problem. Physica D: Nonlinear Phenomena 42, 335–346 (1990) 7. Jacobsson, H., Hawes, N., Kruijff, G.-J., Wyatt, J.: Crossmodal content binding in information-processing architectures. In: Proc. of the 3rd ACM/IEEE International Conference on Human-Robot Interaction, Amsterdam (March 2008)
244
A. Vrečko, D. Skočaj, and A. Leonardis
8. Jacobsson, H., Hawes, N., Skočaj, D., Kruijff, G.-J.: Interactive learning and crossmodal binding - a combined approach. In: Symposium on Language and Robots, Aveiro, Portugal (2007) 9. Kok, S., Marc Sumner, M., Richardson, M., Singla, P., Poon, H., Lowd, D., Wang, J., Domingos, P.: The alchemy system for statistical relational ai. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA (2009) 10. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1-2), 107– 136 (2006) 11. Roth, D.: On the hardness of approximate reasoning. Artif. Intell. 82(1-2), 273–302 (1996) 12. Roy, D.: Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language 16(3-4), 353–385 (2002) 13. Roy, D.: Grounding words in perception and action: computational insights. TRENDS in Cognitive Sciences 9(8), 389–396 (2005) 14. Singer, W.: Consciousness and the binding problem. Annals of the New York Academy of Sciences 929, 123–146 (2001) 15. Steels, L.: The Talking Heads Experiment. Words and Meanings, vol. 1. Laboratorium, Antwerpen (1999) 16. Vrečko, A., Skočaj, D., Hawes, N., Leonardis, A.: A computer vision integration model for a multi-modal cognitive system. In: Proc. of the 2009 IEEE/RSJ Int. Conf. on Intelligent RObots and Systems, St. Louis, pp. 3140–3147 (October 2009) 17. Wyatt, J., Aydemir, A., Brenner, M., Hanheide, M., Hawes, N., Jensfelt, P., Kristan, M., Kruijff, G.-J., Lison, P., Pronobis, A., Sjöö, K., Skočaj, D., Vrečko, A., Zender, H., Zillich, M.: Self-understanding & self-extension: A systems and representational approach (2010) (accepted for publication) 18. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations. Morgan Kaufmann Publishers Inc., San Francisco (2003)
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents in Nondeterministic Environments Akram Beigi, Nasser Mozayani, and Hamid Parvin School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran {Beigi,Mozayani,Parvin}@iust.ac.ir
Abstract. In reinforcement learning exploration phase, it is necessary to introduce a process of trial and error to discover better rewards obtained from environment. To this end, one usually uses the uniform pseudorandom number generator in exploration phase. However, it is known that chaotic source also provides a random-like sequence similar to stochastic source. In this paper we have employed the chaotic generator in the exploration phase of reinforcement learning in a nondeterministic maze problem. We obtained promising results in the so called maze problem. Keywords: Reinforcement Learning, Evolutionary Q-Learning, Chaotic Exploration.
1 Introduction In reinforcement learning, agents learn their behaviors by interacting with an environment [1]. An agent senses and acts in its environment in order to learn to choose optimal actions for achieving its goal. It has to discover by trial and error search how to act in a given environment. For each action the agent receives feedback (also referred to as a reward or reinforcement) to distinguish what is good and what is bad. The agent’s task is to learn a policy or control strategy for choosing the best set of actions in such a long run that achieves its goal. For this purpose the agent stores a cumulative reward for each state or state-action pair. The ultimate objective of a learning agent is to maximize the cumulative reward it receives in the long run, from the current state and all subsequent next states along with goal state. Reinforcement learning systems have four main elements [2]: policy, reward function, value function and model of the environment. A policy defines the behavior of learning agent. It consists of a mapping from states to actions. A reward function specifies how good the chosen actions are. It maps each perceived state-action pair to a single numerical reward. In value function, the value of a given state is the total reward accumulated in the future, starting from that state. The model of the environment simulates the environment’s behavior and may predict the next environment state from the current state-action pair and it is usually represented as a Markov Decision Process (MDP) [1, 3, and 4]. In MDP A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 245–253, 2011. © Springer-Verlag Berlin Heidelberg 2011
246
A. Beigi, N. Mozayani, and H. Parvin
Model, The agent senses the state of the world then takes an action which leads it to a new state. The choice of the new state depends on the agent’s current state and its action. An MDP is defined as a 4-tuple <S,A,T,R> characterized as follows: S is a set of states in environment, A is the set of actions available in environment, T is a state transition function in state s and action a, R is the reward function. The optimal solution for an MDP is that of taking the best action available in a state, i.e. the action that collected as much reward as possible over time. In reinforcement learning, it is necessary to introduce a process of trial and error to maximize rewards obtained from environment. This trial and error process is called an environment exploration. Because there is a trade-off between exploration and exploitation, balancing of them is very important. This is known as the explorationexploitation dilemma. The schema of the exploration is called a policy. There are many kinds of policies such as ε-greedy, softmax, weighted roulette and so on. In these existing policies, exploring is decided by using stochastic numbers as its random generator. It is ordinary to use the uniform pseudorandom number generator as the generator employed in exploration phase. However, it is known that chaotic source also provides a random-like sequence similar to stochastic source. Employing the chaotic generator based on the logistic map in the exploration phase gives better performances than employing the stochastic random generator in a nondeterministic maze problem. Morihiro et al. [5] proposed usage of chaotic pseudorandom generator instead of stochastic random generator in an environment with changing goals or solution paths along with exploration. That algorithm is severely sensitive to ε in ε-greedy. It is important to note that they don’t use chaotic random generator in nondeterministic environments. In that work, it can be inferred that stochastic random generator has better performance in the case of using random action selection instead of ε-greedy one. On the other hand, because of slowness in learning by reinforcement learning, evolutionary computation techniques are applied to improve learning in nondeterministic environments. In this work we propose a modified reinforcement learning algorithm by applying population-based evolutionary computation technique and an application of the random-like feature of deterministic chaos as the random generator employed in its exploration phase, to improve learning in multi task agents. To sum up, our contributions are: •
Employing evolutionary strategies to reinforcement learning algorithm in support of increasing performance both in speed and accuracy of learning phase,
•
Usage of chaotic generator instead of uniform pseudorandom number generator in the exploration phase of evolutionary reinforcement learning,
•
Multi task learning in nondeterministic environments.
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
247
2 Chaotic Exploration Chaos theory studies the behavior of certain dynamical systems that are highly sensitive to initial conditions. Small differences in initial conditions (such as those due to rounding errors in numerical computation) result in widely diverging outcomes for chaotic systems, and consequently obtaining long-term predictions impossible to take in general. This happens even though these systems are deterministic, meaning that their future dynamics are fully determined by their initial conditions, with no random elements involved. In other words, the deterministic nature of these systems does not make them predictable if the initial condition is unknown [6, 7]. As it is mentioned, there are many kinds of exploration policies in the reinforcement learning, such as ε-greedy, softmax, weighted roulette. It is common to use the uniform pseudorandom number as the stochastic exploration generator in each of the mentioned policies. There is another way to deal with the problem of exploration generators which is to utilize chaotic deterministic generator as their stochastic exploration generators [5]. As the chaotic deterministic generator, a logistic map which generates a value in the closed interval [0 1] according to equation 1, is used as stochastic exploration generators in this paper. xt+1 = alpha xt(1 − xt).
(1)
In equation 1, x0 is a uniform pseudorandom generated number in the [0 1] interval and alpha is a constant in the interval [0 4]. It can be showed that sequence xi will be converged to a number in the [0 1] interval provided that the coefficient alpha be a number near to and below 4 [8, 9]. It is important to note that the sequence may be divergent for the alpha greater than 4. The closer the alpha to 4, the more different convergence points of the sequence. If alpha is selected 4, the vastest convergence points (maybe all points in the [0 1] interval) will be covered per different initializations of the sequence. So here alpha is chosen 4 to making the output of the sequence as similar as to uniform pseudorandom number.
3 Population Based Evolutionary Computation One of research has done in Evolutionary Computation introduced by Handa [10]. It has used a kind of memory in Evolutionary Computation for storing past optimal solutions. In that work, each individual in population denotes a policy for a routine task. The best individual in current population is selected to insert in archive as environmental is changed. After that individuals in the archive are randomly selected to be moved into the population. The algorithm is called Memory-based Evolutionary Programming which is depicted in Fig 1. A large number of studies concerning dynamic or uncertain environments have been made; have used Evolutionary Computation algorithms [11]. These problems try to reach their goal as soon as possible. The significant issue is that the robots could get assistance from their previous experiences. In this paper a population based chaotic evolutionary computation for multitask reinforcement learning problems is examined.
248
A. Beigi, N. Mozayani, and H. Parvin
Fig. 1. Handa algorithm’s Diagram for evolutionary computation
4 Q-Learning Among reinforcement learning algorithms, Q-learning method is considered as one of the most important algorithms [1]. It consists of a Q-mapping from state-action pairs by rewards obtained from the interaction with the environment. In this case, the learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of the policy being followed. This simplifies the analysis of the algorithm and enabled early convergence proofs. The pseoduecode of Q-learning algorithm is shown in Fig 2. Q-Learning Algorithm: Initialize Q(s,a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., ε-greedy) Take action a, observe r, s’
Q ( s, a ) ← Q ( s, a ) + α [r + γ max a ' Q ( s ' , a ' ) − Q ( s, a )] s ← s' Until s is terminal Fig. 2. Q- Learning Algorithm
5 Evolutionary Reinforcement Learning Evolutionary Reinforcement Learning (ERL) is a method of probing the best policy in RL problem by applying GA. In this case, the potential solutions are the policies and are represented as chromosomes, which can be modified by genetic operators such as crossover and mutation [12].
Chaotic Exploration Generator for Evolutionary Reinforcement Learning Agents
249
GA can directly learn decision policies without studying the model and state space of the environment in advance. The fitness values of different potential policies are used by GA. In many cases, fitness function can be computed as the sum of rewards, which are used to update the Q-values. We use a modified Q-learning algorithm with applying memory-based Evolutionary Computation technique for improving learning in multi task agents [13].
6 Chaotic Based Evolutionary Q-Learning With applying Genetic Algorithms for reinforcement learning in nondeterministic environments, we propose a Q-learning method called Evolutionary Q-learning. The algorithm is presented in Fig 3. Chaotic Based Evolutionary Q Learning (CEQL): Initialize Q(s,a) by zero Repeat (for each generation): Repeat (for each episode): Initialize s Repeat (for each step of episode): Initiate(Xcurrent) by Rnd[0,1] Repeat Xnext=4 * Xcurrent * (1- Xcurrent) Until (Xnext - Xcurrent 0 for an original training sample, and translated by −ϕ for a duplicated training sample. 2. Every training sample ai is converted to a classification sample by incorporating the output as an additional feature and setting class 1 for original training samples, and class −1 for duplicated training samples. 3. SVC is run with the classification mappings. 4. The solution of SVC is converted to a regression form. The above procedure is repeated for different values of ϕ, for particular ϕ it is depicted in Fig. 1. The best solution among various ϕ is found based on the mean squared error (MSE) measure. The result of the first step is a set of training mappings for i ∈ {1, . . . , 2n} bi = (ai,1 , . . . , ai,m ) → yi + ϕ for i ∈ {1, . . . , n} bi = (ai−n,1 , . . . , ai−n,m ) → yi−n − ϕ for i ∈ {n + 1, . . . , 2n}
356
M. Orchel 1 1
0.8
0.8
0.6
0.6
0.4
0.4 0.2
0.2
0 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Fig. 1. In the left figure, there is an example of regression data for a function y = 0.5 sin 10x + 0.5 with Gaussian noise, in the right figure, the regression data are translated to classification data. With the ’+’ translated original samples are depicted, with the ’x’ translated duplicated samples are depicted.
for ϕ > 0. The parameter ϕ we call the translation parameter. The result of the second step is a set of training mappings for i ∈ {1, . . . , 2n} ci = (bi,1 , . . . , bi,m , yi + ϕ) → 1 for i ∈ {1, . . . , n} ci = (bi,1 , . . . , bi,m , yi−n − ϕ) → −1 for i ∈ {n + 1, . . . , 2n} for ϕ > 0. The dimension of the ci samples is equals to m + 1. The set of ai mappings is called a regression data setting, the set of ci ones is called a classification data setting. In the third step, we solve OP 2 with ci samples. Note that h∗ (x) is in the implicit form of the last coordinate of x. In the fourth step, we have to find an explicit form of the last coordinate. The explicit form is needed for example for testing new samples. The wc variable of the primal problem for a simple linear kernel case is found based on the solution of the dual problem in the following way: wc =
2n
yci αi ci .
i=1
For a simple linear kernel the explicit form of (2) is m − j=1 wcj xj − bc m+1 x = . wcm+1 The regression solution is g ∗ (x) = wr · x + br , where wri = −wci /wcm+1 , br = −bc /wcm+1 for i = 1..m. For nonlinear kernels, a conversion to the explicit form has some limitations. First, a decision curve could have more than one value
Regression Based on Support Vector Classification
357
of the last coordinate for specific values of remaining coordinates of x, thus it cannot be converted unambiguously to the function (e.g. a polynomial kernel with a dimension equals to 2). Second, even when the conversion to the function is possible, there is no explicit analytical formula (e.g. a polynomial kernel with a dimension greater than 4), or it is not easy to find it. Therefore, a special method for finding an explicit formula of the coordinate should be used, e.g. a bisection method. The disadvantage of this solution is a longer time of testing new samples. To overcome these problems, we propose a new kernel type in which the last coordinate is placed only inside a linear term. A new kernel is constructed from an original kernel by removing the last coordinate, and adding the linear term with the last coordinate. For the most popular kernels polynomial, radial basis function (RBF) and sigmoid, the conversions are respectively d
(x · y)
→
m
d xi yi
+ xm+1 ym+1 ,
(3)
i=1
x − y exp − 2σ 2
2
→ exp −
tanh xy → tanh
m i=1 m
2
(xi − yi ) + xm+1 ym+1 , 2σ 2
xi yi + xm+1 y m+1 .
(4) (5)
i=1
The proposed method of constructing new kernels always generates a function fulfilling Mercer’s condition, because it generates a function which is a sum of two kernels. For the new kernel type, the explicit form of (2) is x
m+1
=
−
2n
i i i i=1 yc αi Kr cr , xr 2n i m+1 i=1 yc αi ci
,
i 1 m where cir = c1i ..cm i , xr = xi ..xi . 2.1
Support Vectors
The SVCR runs the SVC method on duplicated number of samples. Thus, a maximal number of support vectors of SVC is 2n. The SVCR algorithm is constructed in the way, that while searching for the best value of ϕ, the cases for which a number of SVC support vectors is bigger than n are omitted. We prove the fact, that in this case the set of SVC support vectors does not contain any two training samples, where one of them is a duplicate of the another. Therefore, a set of SVC support vectors is a subset of the ai set of training samples. Let’s call a margin boundaries vector or an outside margin boundaries vector as an essential margin vector and a set of such vectors EM V . Theorem 1. The ai samples are not collinear and |EM V | ≤ n, implicates EM V does not contain duplicates.
358
M. Orchel
Proof (Proof sketch). Let’s assume, that the EM V contains a duplicate at of a sample at . Let p (·) = 0 be a hyperplane parallel to margin boundaries and containing the at . Therefore the set of EM V samples for which p (·) ≥ 0 has r elements, where r >= 1. Let p (·) = 0 be a hyperplane parallel to margin boundaries and containing the at . Therefore the set of EM V samples for which p (·) ≤ 0 has equals or greater than n − r + 1 elements. So |EM V | ≥ n + 1, which contradicts the assumption. For nonlinear case the same theorem is applied in induced feature kernel space. It can be proved that a set of support vectors is a subset of the EM V . Therefore the same theorem is applied for a set of support vectors. Experiments show that it is a rare situation when for any value of ϕ checked by SVCR, a set of support vectors has more than n elements. In such situation the best solution among violating the constraint is chosen. Here we consider how changes of a value of ϕ influence on a number of support vectors. First, we can see that for ϕ = 0, n ≤ |EM V | ≤ 2n. When for a particular value of ϕ both classes are separable then 0 ≤ |EM V | ≤ 2n. By a configuration of essential margin vectors we call a list of essential margin vectors, each with a distance to a one of the margin boundaries. Theorem 2. For two values of ϕ, ϕ1 > 0 and ϕ2 > 0, where ϕ1 > ϕ2 , for every margin boundaries for ϕ2 , there exist margin boundaries for ϕ1 with the same configuration of essential margin vectors. Proof (Proof sketch). Let’s consider the EM V for ϕ2 with particular margin boundaries. When increasing a value of ϕ by ϕ1 − ϕ2 in order to preserve the same configuration of essential margin vectors we extend margin bounded region by ϕ1 − ϕ2 on both sides. When increasing a value of ϕ, new sets of essential margin vectors arise, and all sets presented for the lower values of ϕ remains. When both classes become separable by a hyperplane, further increasing the value of ϕ does not change a collection of sets of essential margin vectors. The above suggests that increasing a value of ϕ would lead to solutions with lesser number of support vectors. 2.2
Comparison with -SVR
Both methods have the same number of free parameters. For -SVR: C, kernel parameters, and . For SVCR: C, kernel parameters and ϕ. When using a particular kernel function for -SVR and a related kernel function for SVCR, both methods have the same hypothesis space. Both parameters and ϕ control a number of support vectors. There is a slightly difference between these two methods when we compare configurations of essential margin vectors. For the case of -SVR, we define margin boundaries as a lower and upper tube boundaries. Among various values of the , every configuration of essential margin vectors is unique. In the SVCR, based on Thm. 2, configurations of essential margin vectors are repeated while a value of ϕ increases. This suggest that for particular values of ϕ and a set of configurations of essential margin vectors is richer for SVCR than for -SVR.
Regression Based on Support Vector Classification
3
359
Experiments
First, we compare performance of SVCR and -SVR on synthetic data and on publicly available regression data sets. Second, we show that by using SVCR, a priori knowledge in the form of detractors introduced in [5] for classification problems could be applied for regression problems. For the first part, we use a LibSVM [1] implementation of -SVR and we use LibSVM in solving SVC problems in SVCR. We use a version ported to Java. For the second part, we use the author implementation of SVC with detractors. For all data sets, every feature is scaled linearly to [0, 1] including an output. For variable parameters like the C, σ for the RBF kernel, ϕ for SVCR, and for -SVR, we use a grid search method for finding best values. The number of values searched by the grid method is a trade-off between an accuracy and a speed of simulations. Note that for particular data sets, it is possible to use more accurate grid searches than for massive tests with multiple number of simulations. The preliminary tests confirm that while ϕ is increased, a number of support vectors is decreased. 3.1
Synthetic Data Tests
We compare the SVCR and -SVR methods on data generated from particular functions with added Gaussian noise for output values. We perform tests with a linear kernel on linear functions, with a polynomial kernel on the polynomial function, with the RBF kernel on the sine function. The tests with results are presented in Tab. 1. We can notice generally slightly worse training performance for the SVCR. The reason is that -SVR directly minimizes the MSE. We can notice fairly good generalization performance for the SVCR, which is slightly better than for the -SVR. We can notice lesser number of support vectors for the SVCR method for linear kernels. For the RBF kernel the SVCR is slightly worse. 3.2
Real World Data Sets
The real world data sets were taken from the LibSVM site [1] [4] except stock price data. The stock price data consist of monthly prices of index DJIA from 1898 up to 2010. We generated the sample data set as follows: for every month the output value is a growth/fall comparing to the next month. Every feature i is a percentage price change between the month and the i-th previous month. In every simulation, training data are randomly chosen and the remaining samples become test data. The tests with results are presented in Tab. 2. For linear kernels, the tests show better generalization performance of the SVCR method. The performance gain on testing data is ranged from 0–2%. For the polynomial kernel, we can notice better generalization performance of the SVCR (performance gain from 68–80%). A number of support vectors is comparable for both
360
M. Orchel
Table 1. Description of test cases with results for synthetic data. Column descriptions:
kerP dim a function – a function used for generating data y1 = dim x , y = x , i 4 i i=1 i=1 dim y5 = 0.5 i=1 sin 10xi + 0.5, simC – a number of simulations, results are averaged, σ – a standard deviation used for generating noise in output, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, tes – a testing set size, dim – a dimension of the problem, tr12M – a percentage average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for -SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE.
function
simC σ
y1 100 y2 = 3y1 100 y3 = 1/3y1 100 y4 100 y5 20
0.04 0.04 0.04 0.04 0.04
ker kerP trs tes dim tr12M te12M tr12MC te12MC s1 s2 lin lin lin pol rbf
– – – 3 var
90 90 90 90 90
300 300 300 300 300
4 4 4 4 4
0% −0.4% 0% −2% −500%
0.5% −0.4% 1% 10% −10%
20% 10% 50% 2% 30%
58% 40% 80% 80% 20%
50 74 50 61 90
46 49 40 61 90
methods. For the RBF kernel, results strongly depends on data: for two test cases the SVCR has better generalization performance (10%). Generally the tests show that the new method SVCR has good generalization performance on synthetic and real world data sets used in experiments and often it is better than for the -SVR. 3.3
Incorporating a Priori Knowledge in the Form of Detractors to SVCR
In the article [5], a concept of detractors was proposed for a classification case. Detractors were used for incorporating a priori knowledge in the form of a lower bound (a detractor parameter b) on a distance from a particular point (called a detractor point) to a decision surface. We show that we can use a concept of detractors directly in a regression case by using the SVCR method. We define a detractor for the SVCR method as a point with the parameter d, and a side (1 or −1). We modify the SVCR method in the following way: the detractor is added to a training data set, and transformed to the classification data setting in a way that when a side is 1: d = b + ϕ, for a duplicate d = 0; when a side is −1: d = 0, for a duplicate d = b − ϕ. The primal application of detractors was to model a decision function (i.e. moving far away from a detractor). A synthetic test shows that indeed we can use detractors for modeling a regression function. In Fig. 2, we can see that adding a detractor causes moving a regression function far away from the detractor.
Regression Based on Support Vector Classification
361
Table 2. Description of test cases with results for real world data. Column descriptions: a name – a name of the test, simT – a number of random simulations, where training data are randomly selected, results are averaged, ker – a kernel (pol – a polynomial kernel), kerP – a kernel parameter (for a polynomial kernel it is a dimension, for the RBF kernel it is σ), trs – a training set size, all – a number of all data, it is a sum of training and testing data, dim – a dimension of the problem, tr12M – a percentage average difference in MSE for training data, if greater than 0 than SVCR is better, te12M – the same as tr12M, but for testing data, tr12MC – a percentage average difference in number of tests for training data in which SVCR is better (SVCR is better when a value is greater than 50%), te12MC – the same as tr12MC, but for testing data, s1 – an average number of support vectors for -SVR, s2 – an average number of support vectors for SVCR. The value ’var’ means that we search for the best value comparing the training data MSE. name
simT ker kerP trs all
dim tr12M
abalone1 abalone2 abalone3 caData1 caData2 caData3 stock1 stock2 stock3
100 100 20 100 100 20 100 100 20
8 8 8 4 2 2 4 2 2
lin pol rbf lin pol rbf lin pol rbf
– 5 var – 5 var – 5 var
90 90 90 90 90 90 90 90 90
4177 4177 4177 4424 4424 4424 1351 1351 1351
−0.2% −90% 70% −1.5% −105% −25% 0% −4500% 76%
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
te12M tr12MC te12MC s1 s2 2% 80% 10% 2% 68% 10% 0% 78% −6%
20% 0% 90% 1% 0% 50% 40% 0% 100%
70% 100% 65% 55% 100% 50% 55% 100% 25%
0.4
0.6
35 78 90 41 79 90 35 90 90
38 73 90 44 75 90 32 87 90
0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.8
1
Fig. 2. In the left figure, the best SVCR translation for particular regression data is depicted, in the right figure, the best SVCR translation for the same data, but with a detractor in a point (0.2, 0.1) and d = 10.0 is depicted. We can see that the detractor causes moving the regression function far away from it. Note that the best translation parameter ϕ is different for both cases.
362
4
M. Orchel
Conclusions
The SVCR method is an alternative for the -SVR. We focus on two advantages of the new method, first, a generalization performance of the SVCR is comparable or better than for the -SVR based on conducted experiments. Second, we show on the example of a priori knowledge in the form of detractors, that a priori knowledge already incorporated to SVC can be used for a regression problem solved by the SVCR. In such case, we do not have to analyze and implement the incorporation of a priori knowledge to the other regression methods (e.g. to the -SVR). Further analysis of the SVCR will concentrate on analysing and comparing the generalization performance of the proposed method in the framework of statistical learning theory. Just before submitting this paper, we have found in [2] very similar idea. However, the Authors solve an additional optimization problem in the testing phase to find a root of the nonlinear equation. Therefore two problems arise, multiple solutions and lack of solution. Instead, we propose a special type of kernels (3)(4)(5), which overcome these difficulties. In [2], the Authors claim that by modifying ϕ parameter for every sample in a way that the samples with lower and upper values of yi have lesser values of ϕ than the middle ones, a solution with lesser number of support vectors can be obtained. However, this modification leads to a necessity of tuning a value of an additional parameter during the training phase. Acknowledgments. The research is financed by the Polish Ministry of Science and Higher Education project No NN519579338. I would like to express my sincere gratitude to Professor Witold Dzwinel (AGH University of Science and Technology, Department of Computer Science) for contributing ideas, discussion and useful suggestions.
References 1. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), software available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm 2. Fuming Lin, J.G.: A novel support vector machine algorithm for solving nonlinear regression problems based on symmetrical points. In: Proceedings of the 2010 2nd International Conference on Computer Engineering and Technology, ICCET (2010) 3. Lauer, F., Bloch, G.: Incorporating prior knowledge in support vector machines for classification: A review. Neurocomput. 71(7-9), 1578–1594 (2008) 4. Libsvm data sets, http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/ 5. Orchel, M.: Incorporating detractors into svm classification. In: Kacprzyk, P.J. (ed.) Man-Machine Interactions; Advances in Intelligent and Soft Computing, pp. 361– 369. Springer, Heidelberg (2009) 6. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998) 7. Wu, C.A., Liu, H.B.: An improved support vector regression based on classification. In: Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering, MUE 2007, pp. 999–1003. IEEE Computer Society, Washington, DC (2007)
Two One-Pass Algorithms for Data Stream Classification Using Approximate MEBs Ricardo Ñanculef1 , Héctor Allende1 , Stefano Lodi2 , and Claudio Sartori2 1
2
Dept. of Informatics, Federico Santa María University, Chile {hallende,jnancu}@inf.utfsm.cl Dept. of Electronics, Computer Science and Systems, University of Bologna, Italy {claudio.sartori,stefano.lodi}@unibo.it
Abstract. It has been recently shown that the quadratic programming formulation underlying a number of kernel methods can be treated as a minimal enclosing ball (MEB) problem in a feature space where data has been previously embedded. Core Vector Machines (CVMs) in particular, make use of this equivalence in order to compute Support Vector Machines (SVMs) from very large datasets in the batch scenario. In this paper we study two algorithms for online classification which extend this family of algorithms to deal with large data streams. Both algorithms use analytical rules to adjust the model extracted from the stream instead of recomputing the entire solution on the augmented dataset. We show that these algorithms are more accurate than the current extension of CVMs to handle data streams using an analytical rule instead of solving large quadratic programs. Experiments also show that the online approaches are considerably more efficient than periodic computation of CVMs even though warm start is being used. Keywords: Data stream mining, Online learning, Kernel methods, Minimal enclosing balls.
1
Introduction
Datasets which continuously grow over time are referred to as data streams. Data mining operations such as classification, clustering and frequent pattern mining are considerably more challenging in data streams applications because frequently the volume of data is too large to be stored on disk or to be analyzed using multiple scans [1]. Approximate solutions to standard data mining problems can thus be reasonable alternatives if they provide a near-optimal answer in a timely and computationally efficient manner. In this paper we focus on the problem of online approximation of SVM classifiers from data streams using a single pass over the data. In contrast to batch algorithms where data is supposed to be available all in advance and allowed to be used as many times as desired along the model computation process, online learning takes place in a sequence of consecutive rounds
This work was supported by Research Grants 1110854 Fondecyt and Basal FB0821, “Centro Científico-Tecnológico de Valparaíso”, UTFSM.
A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 363–372, 2011. c Springer-Verlag Berlin Heidelberg 2011
364
R. Ñanculef et al.
in which the learner observes a new example, provides a prediction, receives feedback about the correct outcome and finally has the chance to update its prediction mechanism in order to make better predictions on subsequent rounds [14]. One-pass methods in addition process new data items at most once [1]. Online learners avoiding multiple thorough passes are highly desired in real-time applications in which the model extracted from data needs to be frequently adjusted to achieve more accurate results. These algorithms are expected to exhibit restricted memory requirements as well as fast prediction and model-computation times and thus can also be used to deal with very large datasets effectively. Our method is based on the equivalence between a class of SVM classifiers (L2SVMs) and the problem of computing the minimal enclosing ball (MEB) of a set of points in a dot-product space. This equivalence, originally presented in [17] for the construction of the so called Core Vector Machines (CVMs), has motivated several approaches to speed up kernel methods on large datasets [18] [15] [3] [10]. Up to our knowledge however only [12] has previously examined the use of this equivalence for the design of single-pass online classifiers. This method in turn is based on a method to estimate the MEB of a streaming sequence proposed in [21]. Although recently [19] has also addressed the computation of CVMs from data streams, this method is based on the periodic resolution of large quadratic programs which mostly require several passes through the dataset (see [11] for a survey on methods for SVM computation). As the method presented in [12], our method keeps track of a ball which reasonably approximates the MEB corresponding to the sequence of examples coming up to a given round. We study two novel analytical rules to adjust such a ball from new coming observations. Our simulations on two medium scale and three large scale classification datasets show that the obtained classifiers are more accurate than the ones proposed in [12] to handle data streams. Experiments also show that all the single-pass approaches studied in this paper are considerably more efficient than periodic computation of CVMs even though warm start is being used.
2
Pattern Classification Using MEBs
Given a set of items Sx = {xk : k ∈ I := {0, 1, . . . , T }} associated with an outcome sequence {yk : k ∈ I}, a typical machine learning task consists in designing a prediction mechanism h(x) termed hypothesis capable of mapping an input to a given outcome. In pattern classification this outcome represents a category or class that needs to be associated to a given item. In binary classification yk ∈ {+1, −1}, xk ∈ X ⊂ RN and h : X → {+1, −1}. 2.1
Kernel Methods
Kernel methods model the prediction mechanism h using functions from the space of linear classifiers, that is the predictions are computed using only dotproducts in a feature space. Since in realistic problems the configuration of the
Two One-Pass Algorithms for Data Stream Classification
365
data can be highly non-linear, kernel methods build the linear model not in the original space X of data but in a high-dimensional dot-product space Z = Lin(φ(X )) named the feature space, where the decision function can be linearly represented [8]. The feature space is related with the data space X by means of a function k : X × X → R called the kernel which computes the dot products zTi zj in Z directly from the points in the input space k(xi , xj ), avoiding the explicit computation of the mapping φ [13]. In this paper we will use z to refer a generic element of the feature space Z obtained as the image of an observation x under the mapping φ. In the feature space Z, the classification hypothesis takes the form h(z) = sgn (f (z))) where the discriminant function f (z) is represented as a separating hyperplane f (z) = wT φ(x)+b defined by means of a normal vector w ∈ Z and a position parameter b ∈ R. The vector w results by construction [13,8] equivalent to a superposition of featured items w= λi φ(xi ) , (1) i
such that the prediction mechanism can be implemented using only the kernel and the original data items: h(x) = sgn wT φ(x) + b = sgn yi λi k(xi , x) + b . (2) i
The weights λi which determine the prediction mechanism (2) are defined as the solution to an optimization problem which incorporates in the objective a measure of error or loss l(ˆ yk , yk ) on the dataset and other theoretically-founded criteria such as the sparseness of the solution which determines the memory required to store the model. Note that for a hypothesis of the form of (2), yf (x) > 0 if and only if the decision of the classifier and the true outcome coincides. The margin yf (x) provides thus a measure of confidence of the prediction. Kernel methods are usually built by using the soft-margin loss [9]: lρ (h(x), y) = max (0, ρ − yf (x)) . 2.2
(3)
L2-SVM Classification
The model of classification we will consider in the rest of this paper is that of L2-SVM classification [17]. In L2-SVM classification, the optimal separating hyperplane (w, b) for the dataset S is obtained as the solution to the problem min(w, b, ρ, ξ): 12 w2 + b2 + C · ξi2 − ρ (4) i
st: yi f (zi ) ≥ ρ − ξi ∀i ∈ I . which aims to simultaneously maximize the largest margin lρ (h(x), y) attained on the dataset and the margin parameter ρ of the soft-margin loss function (3).
366
R. Ñanculef et al.
The parameter C is a regularization parameter used by the model to handle noisy data [9]. It can be shown that the Lagrange-dual of this problem is max(α) : − αi αj (yi yj (k(xi , xj ) + 1) + δ(i, j)/C) (5) i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I . i∈I
From strong duality it can be shown that the parameters (w, b) for the L2-SVM model are w= yi αi φ(xi ) b = yi α i , (6) i∈I
i∈I
and the parameters λi of expansion (1) are given by λi = yi αi . The margin parameter is additionally given by ρ = i,j∈I αi αj (yi yj k(xi , xj ) + yi yj + δ(i, j)/C). 2.3
Minimal Enclosing Balls (MEBs)
As shown first by [17] and then generalized in [18], several kernel methods can be formulated as the problem of computing the minimal enclosing ball (MEB) of a set of feature points D = {zi : i ∈ I} in a dot-product space Z. The MEB of D, denoted by BS (c∗ , r∗ ) is defined as the smallest ball in Z containing D. As shown in [20], the Lagrange-dual of the quadratic programming formulation of this problem is max(α) : Φ(α) := zTi zi − αi αj zTi zj (7) i∈I i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I , i∈I
Suppose now that the set D corresponds to the image of a set S = {˜ xk = (xk , yk ) : k ∈ I := {1, 2, . . . , T }} under a mapping ϕ : X × Y → Z, such that zTi zj = ϕ(˜ xi )T ϕ(˜ xj ) = kϕ (˜ xi , x ˜j ) ∀i, j ∈ I for a given kernel function kϕ , defined on the pairs x ˜ := (x, y). Problem (7) looks now as max(α) : Φ(α) := αi kϕ (˜ xi , x ˜i ) − αi αj kϕ (˜ xi , x ˜j ) (8) i∈I i,j∈I st: αi = 1, αi ≥ 0 ∀i ∈ I , i∈I
which only differs from the L2-SVM problem by the linear term i∈I αi kϕ (˜ xi , x ˜i ). If the kernel k used in L2-SVM classification satisfies the normalization condition1 k(xi , xi ) = Δ2 = xi , x ˜i ) = Δ2 + 1 + C1 := Δϕ 2 is also constant, it follows that kϕ (˜ constant. Since i∈I αi = 1, the first term of the objective function (7) becomes a constant and the problem of finding the optimal classifier becomes equivalent to the problem of computing a MEB by setting the kernel kϕ in (7) to kϕ (˜ xi , x ˜j ) = yi yj (k(xi , xj ) + 1) + δ(i, j)/C . 1
(9)
Note that this condition is straightforward for kernels of the form k(xi , xj ) = g(xi − xj )) such as a RBF kernel which is commonly used in practice. See [18] for constructions which do not require the normalization condition.
Two One-Pass Algorithms for Data Stream Classification
3
367
Classification of Data Streams Using MEBs
Online learners are mechanisms designed to learn continuously from a stream of data which can neither be predicted in advance nor completely stored before the learning process starts [4,5,9]. This data stream can be modeled as a sequence of input observations {xk : k ∈ I} indexed on I = {0, 1, . . . , T } and associated with an outcome sequence {yk : k ∈ I} which is aimed to be predicted by the learner. In contrast to the batch model presented previously, online learning takes place in a sequence of rounds. On each round, the learner observes an example xk and makes a prediction yˆk = hk−1 (xk ) using the current hypothesis hk−1 . The learner has then the chance of updating the current hypothesis by using information about the correct outcome yk presented usually in the form of a loss l(ˆ yk , yk ). An online kernel classifier generates hence a sequence of decision functions {hk } of parameters {wk , bk } which are updated according to the loss lk suffered by the algorithm at each round. Since the goal of an online learner is to make accurate predictions of the new coming inputs, online learners are typically designed to minimize the cumulated hinge-loss Lc ({hk }, S) = k lρk−1 (hk−1 (xk ), yk ) [9,4] along the sequence of observations, where {hk } denotes the sequence of hypothesis generated by the algorithm and ρk−1 is the margin parameter used by the algorithm before observing xk and lρ is defined in equation (3). Note that in this framework the loss of the algorithm is computed before the information about the correct outcome is revealed to the learner. 3.1
General Structure of the Method
Let Ik = {0, 1, . . . , k} and Sk = {˜ xi = (xi , yi ) : i ∈ Ik } be the subset of items revealed to the learner up to a round k and ϕ(Sk ) = {zi : i ∈ Ik } the corresponding image of Sk under the mapping induced by the L2-SVM kernel defined at equation (9). A naive approach to keep a classifier from the data stream may be to periodically compute the L2-SVM on Sk or (equivalently) the MEB of ϕ(Sk ) by solving αi kϕ (˜ xi , x ˜i ) − αi αj kϕ (˜ xi , x ˜j ) (10) max(α) : Φ(α) := i∈Ik i,j∈Ik αi = 1, αi ≥ 0 ∀i ∈ Ik . st: i∈Ik
This approach requires however the full storage of the stream and several passes through the data stream on the augmented dataset when new observations become available. The basic idea is hence to provide an efficient mechanism to approximate the MEB {B(c∗k , rk∗ )} of ϕ(Sk ) and recover a classifier from the sequence of approximating balls {Bk }. Denote by α∗k ∈ Rk+1 the solution of (10) and by α∗k,i one of its coordinates. As shown in [20], the primal variables c∗k and rk∗2 are hence given by c∗k = i∈Ik α∗k,i zi and rk∗2 = Φ(α∗k ). Given an approximation αk to α∗k ∈ Rk+1 we can thus define the approximating ball Bk = B(ck , rk ) at round k by setting
368
R. Ñanculef et al.
ck =
i∈Ik
αk,i zi =
i∈Ik
αk,i ϕ(˜ xi ) , rk2 = Φ(αk ) ,
(11)
and the corresponding SVM classifier using equation (6). It should be noted that if at a given round k, zk is already contained in the ∗ MEB ϕ(Sk−1 ), the current MEB is optimal, that is c∗k = c∗k−1 and rk∗ = rk−1 . We could hence implement the following test in order to decide when the current approximation Bk−1 needs to be updated: if zk ∈ Bk−1 we set Bk = Bk−1 , otherwise we take a step to improve the current approximation. However, following recent advances in algorithms to compute MEBs we build our approximations under the concept of (1 + )-MEB [20,3] and initiate an update if and only if zk ∈ / B(ck−1 , (1 + )rk−1 ) for some predefined > 0. Algorithm (1) summarizes the procedure. Note that the approximating ball is initialized as the true MEB of a small subset of s+1 observations. If s = 1, this MEB can be easily computed by setting c1 = 12 z0 + 12 z1 and r12 = 14 z0 − z1 2 .
1 2 3 4 5 6 7 8 9
Data: A stream {z0 , z1 , . . .} of featured observations zk = φ(˜ xk ) = φ(xk , yk ); an approximation tolerance > 0. Result: A sequence of approximating balls B1 , B2 , . . .. Δ2ϕ ←− z0 2 ; Set Bs = B(cs , rs ) to the MEB of the first s + 1 observations; for k = s + 1, s + 2, . . . do 2 if zk − ck−1 2 ≥ (1 + )2 rk−1 then Call an updating rule to compute ck and rk ; else Set ck = ck−1 and rk = rk−1 . end end Algorithm 1. Online Approximating Balls
3.2
Derivation of the First Rule (OFW)
Our first approximating rule is an online adaptation of the Frank-Wolfe optimization method, a very general procedure to find the optimum of concave function by using a constrained form of gradient ascent. This method has been studied in [20] and [6] for the fast computation of MEBs and SVMs respectively. At the beginning we initialize αk to α0k = (αTk−1 , 0)T which is equivalent to preserve the current approximating ball Bk−1 . Then the rule looks for the best improvement of the quadratic objective function Φ(α0k ) in the new direction k + 1 given by the last featured observation, that is ηk = arg max Φ (1 − η)α0k + ηek+1 , (12) η∈[0,1]
where ej denotes the j-th unit vector, that is the vector with all the components equal to zero, except the j-th component. Vector αk is then updated as αk = (1 − η)α0k + ηek+1 . Note that αk is always on the feasible space of (10), that is
Two One-Pass Algorithms for Data Stream Classification
369
xi , x ˜i ) = Δϕ for any i. Thus, i∈Ik αk,i = 1 for any k. On the other hand kϕ (˜ the objective function Φ(αk−1 ) can be written as Φ(αk−1 ) = Δϕ 2 − αk−1,i αk−1,j kϕ (˜ xi , x ˜j ) (13) i,j∈Ik−1
2
2
= Δϕ − ck−1 . and similarly, Φ((1 − η)α0k + ηek+1 ) = Δϕ 2 − ck 2 , by setting ck = (1 − ηk )ck−1 + ηk zk ,
(14)
η)α0k
(15)
αk = (1 −
k+1
+ ηe
.
Note that equation (14) gives an explicit rule to update the center of the current 2 ball. On the other hand, we have by construction rk−1 = Φ(αk−1 ) and thus the 0 k+1 first derivative of Φ (1 − η)αk + ηe equals zero by setting η to ηkofw :=
2 ck−1 − zk 2 − rk−1 ck−1 2 − zTk ck−1 = . 2ck−1 − zk 2 ck−1 − zk 2
(16)
Since the second derivative equals 2ck−1 − zk 2 > 0 and αk ∈ [0, 1] we have that the value of ηk given above is the solution of (12). Plugging in this value of ηk in equation (14) defines hence our first method to update the current approximating ball: ck = (1 − ηkofw )ck−1 + ηkofw zk , rk2 3.3
2
(17)
2
= Φ(αk ) = Δϕ − ck .
Derivation of the Second Rule (CNP)
Our second rule corresponds to a relaxation of the quadratic program which represents the optimal classifier at round k. We aim to determine Bk = B(ck , rk ) by first computing the minimal change in position of Bk−1 = B(ck−1 , rk−1 ) that puts the coming observation zk inside B(ck , rk−1 ) and then updating the radius to keep the primal-dual equation rk2 = Φ(αk ) = Δϕ 2 − ck 2 . This formulation replaces thus the quadratic program (10) by the simpler problem 2 min(ck ) : ck − ck−1 2 st: zk − ck 2 ≤ rk−1 .
(18)
The Lagrangian ofthis problem is given by L(ck , γk ) = ck − ck−1 2 + γk 2 zk − ck 2 ≤ rk−1 with multiplier γk ≥ 0. From the Karush-Kuhn-Tucker conditions [13] for optimality (dual-feasibility δL/δct+1 = 0) we have that ck = (1 − ηk )ck−1 + ηk zk ,
(19)
370
R. Ñanculef et al.
with ηk = γk /(1 + γk ), that is, the new rule has the same form of the rule previously introduced. Note now that γk = 0 because the point zk is not included in the current approximating ball. Thus the from the Karush-Kuhn-Tucker conditions (vanishing KKT-gap: γ · δL/δγ = 0) implies now that the solution of (18) is obtained by setting ηk to ck−1 − zk − rk−1 cnp ηk := . 2ck−1 − zk
(20)
Plugging in this value of ηk in equation (19) defines hence our second method to update the current approximating ball: cnp cnp ck = (1 − ηk )ck−1 + ηk zk ,
(21)
2
rk2 = Φ(αk ) = Δϕ − ck 2 .
4
Simulation Results
We simulate the task of data stream classification by sequentially presenting unseen data to the algorithm. This data was obtained from the following datasets: pendigits (7.4e+03 items, 10 classes), usps (7.2e+03 examples, 10 classes), Kddfull (4.9e+06 items, 2 classes), Ijcnn (4.9e+04 items, 2 classes) and extended Usps (2.6e+05 items, 10 classes). Datasets Kdd-full, Ijcnn and extended Usps (abbreviated as Usps-ext) were used as in previous research to test the largescale capabilities of CVMs [17] and are available at [16]. The other problems are available at [7] or [2]. SVMs were trained using a gaussian kernel k(x1 , x2 ) = exp(−x1 − x2 2 /σ2 ). Multicategory problems are addressed using a OVO scheme [13]. For datasets Kdd-full, Usps-ext and Ijcnn we used the hyper-parameter values reported in [17,15]. For the smaller datasets (≤ 104 examples) hyper-parameters were used according to the values reported in [6]. Algorithm (1) was initialized by randomly extracting a subset of s items corresponding to the 1 percent of the stream. The same criterion was used to simulate alternative algorithms. The method proposed in [12] was implemented and abbreviated here as CPB. The method based on the periodical computation of a new L2-SVM from the union old and coming observations is denoted as PB. Since this approach needs to solve large quadratic programs on the augmented datasets we only include this algorithm in the results for medium scale problems. Following this approach, the model is computed again after s new items have arrived to the system, corresponding to the 1 percent of the stream size. Note that a finer period should considerably increase time complexity since each time the model needs to be recomputed considering the complete sequence of previous observations. Additionally, we allow it to use warm-start: each time that the model needs to be recomputed the starting approximating ball is set to the previous approximating ball available in the system. Naturally, this should improve time complexity.
Two One-Pass Algorithms for Data Stream Classification
371
Tables (1) and (2) show the results obtained with the different problems and algorithms. The third column corresponds to the number of classification mistakes cumulated by the algorithms along the sequence of prediction/adjustment rounds. In order to assess computational complexity we report the total number of kernel evaluations carried out by the algorithm, that is, the number of times that the kernel function kϕ is evaluated on a pair of examples in order to make predictions and compute adjustments. Since this variable is platform independent, it is frequently employed to assess algorithmic complexity of kernel methods. The last column shows finally the total running times obtained on a 2.40GHz Intel Core 2 Duo with 2GB RAM running openSUSE 11.1. Table 1. Results obtained on medium-scale datasets Dataset
Rule
pendigits pendigits pendigits pendigits usps usps usps usps
OFW CNP CPB PB OFW CNP CPB PB
Cumulated Errors 242 599 754 59 437 790 917 185
Stream size (T) 7415 7415 7415 7415 7219 7219 7219 7219
Kernel Evals 1.56e+07 1.83e+07 1.72e+06 9.59e+09 1.55e+07 2.04e+07 1.72e+06 2.23e+10
Time (secs) 2.57 3 0.24 2541.56 15.96 22.65 1.22 22120.3
Table 2. Results obtained on large-scale datasets
5
Dataset
Rule
kdd-full kdd-full kdd-full usps-ext usps-ext usps-ext ijcnn ijcnn ijcnn
OFW CNP CPB OFW CNP CPB OFW CNP CPB
Cumulated Errors 2 2 7037 1 1 340 2133 2399 3808
Stream size (T) 4.89e+06 4.89e+06 4.89e+06 264748 264748 264748 49490 49490 49490
Kernel Evals 1.49e+08 1.49e+08 1.30e+08 7.82e+07 8.04e+07 7.26e+07 3.83e+08 3.55e+08 1.98e+07
Time (secs) 28.64 27.50 24.94 61.79 65.29 59.41 69.21 63.04 2.28
Conclusions
We have introduced two algorithms based on minimal enclosing balls to approximate SVM classifiers from streaming data using a single pass over each incoming item. According to the results of tables (1) and (2) the proposed methods are considerably more accurate than the single-pass method presented in [12] in all cases, at the price of a slightly greater computational complexity. Table (1) shows that the accuracy of the first method proposed in this paper (OFW) becomes
372
R. Ñanculef et al.
particularly closer to the accuracy obtained from the periodic recomputation of the model. This method is however based on multiple passes through the data items and its computational complexity is, as expected, several orders of magnitude worse than the complexity of single-pass methods.
References 1. Aggarwal, C. (ed.): Data Streams, Models and Algorithms. Springer, Heidelberg (2007) 2. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2010) 3. Clarkson, K.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. In: Proceedings of SODA 2008, pp. 922–931. SIAM, Philadelphia (2008) 4. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passiveaggressive algorithms. J. of Machine Learning Research 7, 551–585 (2006) 5. Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: A kernel-based perceptron on a budget. SIAM Journal of Computing 37(5), 1342–1372 (2008) 6. Frandi, E., Gasparo, M.-G., Lodi, S., Ñanculef, R., Sartori, C.: A new algorithm for training sVMs using approximate minimal enclosing balls. In: Bloch, I., Cesar Jr., R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 87–95. Springer, Heidelberg (2010) 7. Hettich, S., Bay, S.: The UCI KDD Archive (2010), http://kdd.ics.uci.edu 8. Kivinen, J.: Online learning of linear classifiers, pp. 235–257 (2003) 9. Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Transactions on Signal Processing 52(8), 2165–2176 (2004) 10. Lodi, S., Ñanculef, R., Sartori, C.: Single-pass distributed learning of multi-class svms using core-sets. In: Proceedings of the SDM 2010, pp. 257–268. SIAM, Philadelphia (2010) 11. Léon Bottou, D.D., Chapelle, O., Weston, J. (eds.): Large Scale Kernel Machines. MIT Press, Cambridge (2007) 12. Rai, P., Daumé, H., Venkatasubramanian, S.: Streamed learning: one-pass svms. In: IJCAI 2009: Proceedings of the 21st International Jont Conference on Artifical Intelligence, pp. 1211–1216. Morgan Kaufmann Publishers, San Francisco (2009) 13. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001) 14. Shalev-Shwartz, S., Singer, Y.: A primal-dual perspective of online learning algorithms. Machine Learning 69(2-3), 115–142 (2007) 15. Tsang, I., Kocsor, A., Kwok, J.: Simpler core vector machines with enclosing balls. In: ICML 2007, pp. 911–918. ACM, New York (2007) 16. Tsang, I., Kocsor, A., Kwok, J.: LibCVM Toolkit (2009) 17. Tsang, I., Kwok, J., Cheung, P.-M.: Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6, 363–392 (2005) 18. Tsang, I., Kwok, J., Zurada, J.: Generalized core vector machines. IEEE Transactions on Neural Networks 17(5), 1126–1140 (2006) 19. Wang, D., Zhang, B., Zhang, P., Qiao, H.: An online core vector machine with adaptive meb adjustment. Pattern Recognition 43(10), 3468–3482 (2010) 20. Yildirim, E.A.: Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization 19(3), 1368–1391 (2008) 21. Zarrabi-Zadeh, H., Chan, T.M.: A simple streaming algorithm for minimum enclosing balls. In: Proceedings of the CCCG 2006 (2006)
X-ORCA - A Biologically Inspired Low-Cost Localization System Enrico Heinrich, Marian L¨ uder, Ralf Joost, and Ralf Salomon University of Rostock, 18051 Rostock, Germany {enrico.heinrich,marian.lueder,ralf.joost,ralf.salomon}@uni-rostock.de
Abstract. In nature, localization is a very fundamental task for which natural evolution has come up with many powerful solutions. In technical applications, however, localization is still quite a challenge, since most ready-to-use systems are not satisfactory in terms of costs, resolution, and effective range. This paper proposes a new localization system that is largely inspired by auditory system of the barn owl. A first prototype has been implemented on a low-cost field-programmable gate array and is able to determine the time difference of two 300 MHz signals with a resolution of about 0.02 ns, even though the device is clocked as slow as 85 MHz. X-ORCA is able to achieve this performance by adopting some of the core properties of the biological role model. Keywords: hardware implementation, robotics, architecture.
1
Introduction
Localization is a process in which some reference points, angles, and distances are used in order to determine the coordinates of new, so-far unknown points. For this task, nature provide several quite powerful solutions. One particularly interesting solution is provided by the auditory system of the barn owl [7]. This solution propagates the sensory information along some neural pathways across the owl’s brain. Since the two “wires” are anti-parallel, the attached phase (or correlation) detectors all observe different time delays between the two acoustic signals that originate from the owl’s ears. Section 2 proposes a technical model, called X-ORCA, that mainly adopts some of the main properties of the biological role model. Conceptually, the correlation neurons are modeled by phase detectors. Each phase detector consists of a simple XOR gate and a counter. The counter value represents the average fireing rate of the modeled neuron, and is displayed as a simple number. Internally, the system employs these phase detectors are placed along two anti-parallel “delay wired”. Since these wires go along opposite directions, all the phase detectors observed different signal phases as the barn owl’s auditory system does as well. In the domain of electrical engineering, electromagnetic signals are often prefered over acoustic ones, since they travel very large distances with high reliability and low energy consumption. However, electromagnetic signals travel with the speed of light c ≈ 3·108 m/s, which makes them quite challenging for every digital ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 373–382, 2011. c Springer-Verlag Berlin Heidelberg 2011
374
E. Heinrich et al.
system, if it comes to high resolutions: a difference in length of Δx = 1 cm, for example, corresponds to a time difference of Δt ≈ 33 ps. Because the X-ORCA system is intended to detect signal delays in the range of a few pico seconds, the aforementioned delay wires are made of regular passive wires, as can be found inside any digital circuit. A first prototype has been implemented on an Altera Cyclone II field programmable gate array (FPGA) [2]. Such an FPGA is a digital device, which consists of a very large number of simple logical gates. These gates can be propperly interconnected by using a hardware description language. Because of this hardware-oriented realization approach, such a system can be operated in situ. Section 3 provides all the technical implementation details as well as the experimental setup. The practical experiments are summarized in Section 4, and show that already this first X-ORCA prototype yields a resolution of about 0.02 ns. Finally, Section 5 concludes this paper with a brief discussion.
2
The X-ORCA Localization System
This section presents the X-ORCA architecture in three parts. The first parts starts off by clarifying the physical setup and all the assumptions made in this paper. Then, the second part explains X-ORCA’s core principles. In so doing, it makes a few assumptions that might seem practically implausible for some readers. However, the third part elaborates on how the X-ORCA architecture and the assumptions made in the second part can be fully realized on standard circuits. 2.1
Physical Setup and Preliminaries
Since the aim of a single X-ORCA instance is to determine the phase shift Δϕ between two incoming signals, it can be used as the core of a one-dimensional localization system. It thus adopts a standard setup (see, also, Fig. 1) in which a transmitter T emits a signal s(t) = A sin(2πf (t − t0 )) with frequency f , amplitude A, and time offset t0 . Since this signal travels with the speed of light c ≈ 3 · 108 m/s, it arrives at the receivers R1 and R2 after some delays Δt1 = (L + Δx)/c and Δt2 = (L − Δx)/c. Both receivers employ an amplifier and a Schmitt trigger, and thus feed the X-ORCA system with the two rectangular signals r1 (t − t0 ) and r2 (t − t0 ) that both heave frequency f . By estimating the phase shift Δϕ between these two signals r1 (t − t0 ) and r2 (t − t0 ), X-ORCA then determines the time difference Δt = t1 − t2 = Δϕ/(2πf ), in order to arrive at the transmitter’s off-center position Δx = Δtc/2. It might be, though, that both the physical setup and the X-ORCA system have further internal delays, such as switches, cables of different lengths, repeaters, and further logical gates. However, these internal delays are all omitted, since they can be easily eliminated in a proper calibration process. Furthermore, for a real-world three-dimensional scenario, the X-ORCA system has to be simply duplicated twice.
X-ORCA - A Biologically Inspired Low-Cost Localization System
-L
0
L
x
x
Tx
t2= L-c x
t1= L+c x R1 r1(t)
375
R2
t1
X-ORCA Q
r2(t)
t2
Q(t)
T
Fig. 1. X-ORCA assumes a standard, one-dimensional setup in which the time difference Δt = t1 − t2 = 2Δx/c is a result of the transmitter’s off-center position Δx. It indirectly determines Δt = Δϕ/(2πf ) by estimating the phase shift Δϕ between the two incoming signals r1 (t) and r2 (t).
2.2
The System Core
Essentially, the X-ORCA core consists of a large number of independently operating phase detectors. One of these phase detectors is illustrated in Fig. 2. It consists of a logical XOR and a counter. The XOR “mixes” the two input signals s1 and s2 , and yields a logical 1 or a logical 0 on whether the two signals differ or not. In other words, the degree of how both signals differ from each other corresponds to the phase shift Δϕ, and is represented as the proportion
signal from R1
XOR
reset system clock
Counter
X-ORCA signal from R2
phase/corellation value
Fig. 2. An X-ORCA phase detector consists of a logical XOR (or any other suitable binary logic function), which “mixes” the two input signals s1 and s2 , and an additional counter to actually determine the phase shift Δϕ
376
E. Heinrich et al.
r1(t)
r2(t) XORCA
XORCA
j
k
phase indicator
i
XORCA
i
j
k
counter index
Fig. 3. X-ORCA places all phase detectors along two reciprocal (anti-parallel) “delay” wires w1 and w2 on which the two signals r1 (t) and r2 (t) travel with approximately two third of the speed of light cw ≈ 2/3c. Because the two wires w1 and w2 are reciprocal, all phase detectors have different internal delays τi .
of logical 1’s per time unit. This proportion is evaluated by the counter that is attached to the XOR gate. For example, let us assume an input signal with a frequency of f = 100 MHz and a phase shift of Δϕ = π/4 = 45◦ . Then, if the counter is clocked at a rate of 10 GHz over a signal’s period T = 1/(100 MHz) = 10 ns, the counter will assume a value of v = 25. At this point, three practical remarks should be made: (1) The XOR gate has been chosen for pure educational purposes; any other suitable binary logic function, such as AND, NAND, OR, and NOR, could have been chosen as well. (2) A counter clock rate of 10 GHz is quite unrealistic for technical reasons, but Subsection 2.3 shows how such clock rates can be virtually achieved. (3) A result of a phase shift Δϕ = π/4 = 45◦ , for example, is intrinsically ambiguous, since the system cannot differ between p = π/4 = 45◦ and p = −π/4 = −45◦ . In order to solve the ambiguity of a single phase detector, X-ORCA simply employs more than just one. Figure 3 shows that X-ORCA places all phase detectors along two reciprocal (anti-parallel) “delay” wires w1 and w2 on which the two signals r1 (t) and r2 (t) travel with approximately two third of the speed of light cw ≈ 2/3c. Because the two wires w1 and w2 are reciprocal, all phase detectors have different internal delays τi which always add to the external delay Δt = 2Δx/c that is due to the transmitter’s off-center position Δx. As a consequence, each phase detector i observes an effective time delay Δt + τi and thus a phase shift Δϕi = 2πf (Δt + τi ). Further post-processing stages become particularly easy, if the internal delays τimax − τimin = T = 1/f span the entire range of a period T of the localization signal s(t). For a first estimate of the transmitter’s off-center position Δx it would suffice to determine the phase detector i that has the smallest counter value vimin = min{vi }; only those phase detectors i have a counter value close to zero for which the condition τi ≈ −ΔT holds.
X-ORCA - A Biologically Inspired Low-Cost Localization System
R1
R1
R1
R2
R2
R2
Q
Q
Q
377
Fig. 4. Due to the inherent rise and fall times, a change in a gate’s output requires some time. Therefore, if the input frequency increases too much or if the input edges come too close together, the gate cannot properly change its output (right-hand-side).
Furthermore, in case all phase detectors are sorted in an ascending order, i.e., τi ≤ τi+1 , the counter values vi assume a V-shaped curve. Thus, X-ORCA might also be utilizing all phase detectors for reconstructing Δx by, for example, calculating the best-fitting-curve. 2.3
Real-World Implementation Details
The description presented in Subsection 2.2 has made a few, practically unrealistic assumptions, which are more or less concerned with the maximal frequency f that can be processed by the phase detectors. First of all, the X-ORCA concept has assumed that the clock frequency clk ≥ 100 × f is at least 100 times higher than the frequency of the localization signal s(t) in order to achieve a practically relevant resolution. A signal frequency of f = 100 MHz, for example, would require a clock frequency of at least clk = 10 GHz. Such a clock frequency, however, would be way too unrealistic for low-cost devices, such as FPGAs. In case of periodic localization signals, however, a virtually very high frequency can be achieved by a technique, known as unfolding-in-time [6]. Let us assume, for example, a signal with frequency f and thus a period of T = 1/f . Then, the samples could be taken at 0, t, 2t, . . . , (n − 1)t, with t = T /n denoting the interval between two consecutive samples, and n denoting the number of samples per signal period T . Then, unfolding-in-time means that the samples are taken at 0, (t+T ), 2(t+T ), . . . , (n−1)(t+T ). That is, the sampling process is expanded over an extended interval with duration nT . Moreover, unfolding-in-time does not necessarily stick to an increment of “t + T ”. For example, the samples can also be taken at 0, (kt + T ), 2(kt + T ), . . . , (n − 1)(kt + T ), with k denoting a constant that is prime to n. The second assumption concerns the electrical transition behavior of the XOR gates as well as the counters. The conceptual description of Subsection 2.2 implicitly assumes that gates and counters are fast enough to properly process signals that travel along the internal wires with about two third of the speed of light. The technical suitability of this approach might be surprising to some readers but has already been shown by previous research [8]. That research has also shown that due to technical reasons, such as thermal noise, the logic gates do not yield exact results but that they exhibit a rather random behavior if, for
378
E. Heinrich et al.
example, set and hold time requirements are not met. This random effect can be statistically compensated, for example, by a large number of processing elements, which is another reason for employing a large number of phase detectors in the X-ORCA architecture. The third implementation remark concerns the processing speed of the gates and the input parts of the counters. Figure 4 shows that if the phase shift gets too small (or too close to 180◦), the rise and fall times prevent the gate from properly switching its output state. This effects lead to small errors of the counter values, if the phase shift Δϕ is close to zero or 180◦ ; as a result, the expected V-shaped curve of the counter values (subsection 2.2) might change to a U-shape.
3
Methods
The first X-ORCA prototype was implemented on an Altera Cyclone II FPGA [2]. This device offers 33,216 logic elements and can only be clocked at about 85 MHz. The chosen FPGA development board is a low-cost device that charges about 500 USD. On the top-level view, the X-ORCA prototype consists of 140 phase detectors, a common data bus, a Nios II soft core processor [3], and a system PLL that runs at 85 MHz. The Nios II processor manages all the counters of the phase detectors, and reports the results via an interface to a PC. Due to the limited laboratory equipment, the transmitter, its localization signal s(t), the two receivers R1 and R2 , and their distances to the transmitter are all emulated on the very same development board. The transmitter and its localization signal s(t) is realized by means of a second PLL, which runs at 300 MHz, whereas the receivers and physical distances are realized by means of some active delay lines. It should be noted, though, that X-ORCA’s internal “delay wires” w1 and w2 are realized as pure passive internal wires, connecting the device’s logic elements, as previously announced in Subsection 2.2. In a second experiment, the prototype utilized an external 19 MHz signal and emulated the transmitter-to-receiver distances by external line stretchers [1].
4
Results
Figures 5-8 summarize the experimental results that the first X-ORCA prototype has achieved under different configurations. Unless otherwise stated, the figures present the counter values vi of n = 140 different phase detectors, which were clocked at a rate of 85 MHz. In Fig. 5, the prototype was exposed to two 300 MHz (localization) signals that have a zero phase shift Δϕ = 0. The input signals were sampled 1,000,000 times, which corresponds to an averaging over 196 periods, with virtually 5100 samples per period of the localization signal (please, see also the discussion presented in Subsection 2.3). It can be clearly seen that the minimum is at counter #31 and that the counters to the left and right have larger values as can be expected from
X-ORCA - A Biologically Inspired Low-Cost Localization System
379
800000 700000 600000
CountterValue
500000 400000 300000 200000 100000 0 1
11
21
31
41
51
61 71 81 CounterIndex
91
101 111 121 131
Fig. 5. The figure shows the counter values vi of n = 140 phase detectors when fed with two 300 MHz signals with zero phase shift Δϕ = 0
800000 700000 'M q
600000
'M
CountterValue
500000
q
400000 'M q 300000 200000 100000 0 1
11
21
31
41
51
61 71 81 CounterIndex
91
101 111 121 131
Fig. 6. The figure shows the counter values vi of n = 140 phase detectors when fed with two 300 MHz signals with zero phase shift Δϕ = 0 (solid line), with −43◦ phase shift Δϕ = −43◦ (dotted line), and with +43◦ phase shift Δϕ = +43◦ (dashed line)
X-ORCA’s internal architecture. In addition, Fig. 5 reveals some technological FPGA internals that might be already known to the expert readers: neighboring logic elements do not necessarily have equivalent technical characteristics and are not interconnected by a regular wire grid. As a consequence, the counter values vi and vi+1 of two neighboring phase detectors do not steadily increase or decrease, which makes the curve look a bit rough. Figure 6 shows the results of the prototype when the two input signals have one of the following three time delays Δt = t1 − t2 ∈ {−0.4 ns, 0 ns, +0.4 ns}. It can be clearly seen that a time delay of 0.4 ns shifts the “counter curve” by about 20 counters. This observation suggests that the prototype would be able to detect a time delay as small as Δt = 0.02 ns.
380
E. Heinrich et al.
3500000 3300000
CounterValue
3100000
'M q
2900000 'M q 2700000
'M q
2500000 2300000 1
11
21
31
41
51
61
71
81
91
101 111 121 131
CounterIndex
Fig. 7. The figure shows the counter values vi of n = 140 phase detectors when fed with two 19 MHz signals with zero phase shift Δϕ = 0 (dashed line), with about −0.3◦ phase shift Δϕ ≈ −0.3◦ (solid line), and with about +0.3◦ phase shift Δϕ ≈ +0.3◦ (dashed line) 480 470
De elayIndicatorValue
460 450 440 430 420 410 400 390 0
5
10
15
20
25
AdjustableDelayincm
Fig. 8. The figure shows the delay value indicator resulting from adjustable delay line lengths when fed with two 19 MHz signals
A closer look at Figs. 5 and 6 reveals that the graphs are not exactly Vshaped but rather U-shaped at the very bottom. This is because the effects already discussed in Fig. 4 come into effect. Figure 7 shows the behavior of the X-ORCA architecture when using the external 19 MHz localization signal. In this experiment, one of the connections from the function generator to the input pad of the development board was established by a line stretcher [1], whereas the other one was made of a regular copper wire. Figure 7 shows the values vi of the n = 140 counters, which were still clocked at 85 MHz over a measurement period of 10,000,000 ticks. The three
X-ORCA - A Biologically Inspired Low-Cost Localization System
381
graphs refer to a phase shift of Δϕ ∈ {−0.3◦, 0◦ , +0.3◦ }, which corresponds to time delays Δt ∈ −0.15 ns, 0 ns, +0.15 ns. It should be noted that the graph of this figure appears as a straight line, since the internal time delays τi span much less than an entire period of the 19 MHz signal, which is significantly lower than the previously used 300 MHz signal (both experiment have used exactly the same X-ORCA system). Figure 8 presents a different of Figure 7: In the graph, every dot represents the sum vtot = i vi of all n = 140 counter values vi ; that is, an entire graph of Fig. 7 is collapsed into one single dot. The graph shows 29 measurements in which the line stretcher was extended by 1 cm step by step. It can be seen, that a length difference of Δx = 1 cm decreases vtot by about 20. This result suggests that with a localization of 19 MHz, X-ORCA is able to detect a length difference of about Δx = 1 mm, which equals a time resolution of about 0.015 ns.
5
Discussion
This paper has presented a new localization architecture, called X-ORCA. Its main purpose is the localization of transmitters, such as WLAN network cards or Bluetooth dongles, that emit electromagnetic signals. In its core, X-ORCA consists of a large number of very simple phase detectors, which are mounted along two passive wires with very small but finite internal time delays. This large number of rather unreliable phase detectors allows X-ORCA to perform a rather reliable statistical evaluation. The X-ORCA architecture has been havily inspired by the biological role model, i.e., the auditory system of the barn owl. In this adaptation process, XORCA relys on a large number of rather unreliable simple phase detectors, which exhibit rather unreliable results. However, by averaging over a large number of entities, as the role model suggests, X-ORCA arrives at a quite reliable and accurate result. Since the role model’s neurons were emulated in re-configurable, physical hardware, the system is able to process electromagnetic signals, rather than acoustic signals. The switch in the utilized media is of practical importance for many real-world applications, such as the localization of persons and/or objects in laboratory environments. Unfortunately, the available laboratory equipment did not allow to test the true limits of the first prototype. This particularly applies to the maximal frequency f of the localization signal and to the achievable resolution with respect to Δx. These tests will be certainly subject of future research. Future research will also be devoted to the integration of wireless communication modules. The best option seems to be the utilization of a software-defined radio module, such as the Universal Software Radio Peripheral 2 (USRP2) [5]. Finally, future research will port the first prototype onto more state-of-the-art development boards, such as an Altera Stratix V FPGA [4].
382
E. Heinrich et al.
Acknowledgements The authors gratefully thank Volker K¨ uhn and Sebastian Vork¨ oper for their helpful discussions. This work was supported in part by the DFG graduate school 1424. Special thanks are due to Matthias Hinkfoth for valuable comments on draft versions of the paper.
References 1. Microlab: Line Stretchers, SR series. Datasheed, Microlab Company (2008) 2. Altera Corp., San Jose, CA. Nios Development Board Cyclone II Edition Reference Manual. Altera Document MNLN051805-1.3 (2007) 3. Altera Corp., San Jose, CA. Nios II Processor Reference Handbook. Altera Document NII5V1-7.2 (2007) 4. Altera Corp., San Jose, CA. Stratix V Device Handbook. Altera Document SV5V11.0 (2010) 5. Ettus Research LLC, http://www.ettus.com 6. Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the Theory of Neural Computation. Addison-Wesley Pub. Co., Redwood City (1991) 7. Kempter, R., Gerstner, W., van Hemmen, J.L.: Temporal coding in the submillisecond range: Model of barn owl auditory pathway. Advances in Neural Information Processing Systems 8, 124–130 (1996) 8. Salomon, R., Joost, R.: Bounce: A new high-resolution time-interval measurement architecture. IEEE Embedded Systems Letters (ESL) 1(2), 56–59 (2009)
On the Origin and Features of an Evolved Boolean Model for Subcellular Signal Transduction Systems ˇ 1 , Monika Avbelj2 , Roman Jerala2, and Andrej Dobnikar1 Branko Ster 1
Faculty of Computer and Information Science, University of Ljubljana, Trˇzaˇska 25, 1000 Ljubljana, Slovenia
[email protected] 2 National Institute of Chemistry, Hajdrihova 19, 1001 Ljubljana, Slovenia
Abstract. In this paper we deal with the evolved Boolean model of the subcellular network for a hypothetical subcellular task that performs some of the basic cellular functions. The Boolean network is trained with a genetic algorithm and the obtained results are analyzed. We show that the size of the evolved Boolean network relates strongly to the task, that the number of output combinations is decreased, which is in concordance with the biological (measured) networks, and that the number of noncanalyzing inputs is increased, which indicates its specialization to the task. We conclude that the structure of the evolved network is biologically relevant, since it incorporates properties of evolved biological systems. Keywords: Subcellular networks, Simulation, Genetic algorithms, Regression.
1
Introduction
Recent studies in biochemistry, molecular biology and information processing networks have opened an important area of research: analysis and modeling of intracellular signal-transduction networks [1,2,3,4]. The main goal is to understand the origin, the features and the information processing of subcellular networks of genes or proteins. It has already been shown with the help of simulations that a discrete Boolean model of the signal-transduction network is able to simulate intracellular mappings from surface receptors to an output set of genes or proteins [5]. In that case, however, the logic tables of the nodes within the Boolean model and the interactions between the nodes and/or input receptors were taken from an extensive experimental work and a huge set of network elements performing a simple classification task [6]. In this paper we show some preliminary results of the evolved Boolean model of the subcellular network for a hypothetical subcellular task. We found that a) the number of nodes of evolved Boolean network and its number of inputs per node k are considerably related to the size of the task b) the number of ˇ A. Dobnikar, U. Lotriˇ c, and B. Ster (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 383–392, 2011. c Springer-Verlag Berlin Heidelberg 2011
384
ˇ B. Ster et al.
output attractors decreases significantly with the evolution of the model, c) the number of non-canalyzing combinations is greatly increased in the evolved model, d) the structure of the evolved model is biologically reasonable. Our results are in concordance with those experimentally obtained [5]. The origin of the Boolean network for subcellular tasks via evolution and its informationprocessing function of nontrivial clustering are the main contributions of the paper. The resultant structure for the pre-defined task also gives some insight into the natural features of the network. The paper is organized as follows. In Chapter 2 we give some background of the Boolean model of the subcellular biological signal-transduction network and describe an example subcellular task. Chapter 3 details the evolution of the model and gives the main results of the procedure related to the case-study task, together with the features of the evolved structure. In the conclusion, we comment on the results and open some new ideas for future work.
2
Boolean Model of Subcellular Signal-Transduction Network
The main objective of the Boolean network modeling is to study generic coarsegrained properties of large subcellular signal-transduction networks. In particular, the logical functions of nodes (genes or proteins) and their interactions are investigated via ’goal-oriented evolution’, where the ’goal’ is a subcellular task to be performed by the network and the evolution describes some natural search procedure for the proper structure of the model. The functions of the nodes and their connections are unknown (random) at the beginning of the evolution. The result of the procedure gives the proper functions of the nodes and their interconnections such that the task is performed properly. Searching for the right structure of the Boolean model is a huge combinatorial problem, even for a rather small task or a correspondingly small network. Considering that only input receptors and output nodes are known for some realistic subcellular task, the obvious unknowns are: number of hidden nodes, set of possible node functions, number of inputs to the nodes, topology of the network or connectivity plan, etc. Fortunately, some simplifications that do not significantly change the nature of the problem are possible. Instead of the complete set of possible functions of the nodes, only the set of biologically relevant [7] canalyzing functions is considered, which results in a substantial reduction n of the set, from 22 to 2 · 2n (Table 1a), where n is the number of inputs to the nodes. It is well known that NAND and NOR are universal logical and also canalyzing functions. A function is canalyzing if in all but one input combinations, only one input variable defines the output value. Table 1b illustrates all possible canalyzing functions with two input variables, n = 2, where active inputs and outputs take all possible combinations. Obviously c1 = NOR and c8 = NAND. For example, in c1 = i1 ↓ i2 = (i1 ∨ i2 ) = i1 · i2 , the active value for any input is 1, which activates the majority output value 0, in c8 = (i1 · i2 ) = i1 ∨ i2 , the active value for both inputs is 0, but the activated output is 1. In c3 = i1 ·i2 , active
On the Origin and Features of an Evolved Boolean Model
385
Table 1. Number of possible canalyzing functions (a) and the set for n = 2 (b) n 1 2 3 4
n
22 2 · 2n 4 4 16 8 256 16 64k 32 a.
i1 0 0 1 1
i2 0 1 0 1
c1 1 0 0 0
c2 0 1 0 0
c3 0 0 1 0
c4 0 0 0 1
c5 0 1 1 1
c6 1 0 1 1
c7 1 1 0 1
c8 1 1 1 0
b.
i1 is 0 and active i2 is 1, and the active output is 0. Canalyzing functions can also be described as functions that are closest to the constant functions, as only a single input combination (also called the non-canalyzing input combination) of all input variables, leads to the other (non-active) function value. Another simplification of the large combinatorial problem follows from the reduced set of possible functions of the nodes in the Boolean network. As only one input variable to the node (with a canalyzing function) defines the output value in most cases, we can limit our search to the networks with a constant number of input variables for all nodes, denoted with k, which is clearly the important parameter of the evolving procedure. A Boolean network can be described as a directed graph G = (V, E), where V is a set of vertices or nodes and E a set of oriented edges, where each edge is an ordered pair of nodes. It is convenient to label the set of nodes with integers, V = (1, 2, .., v) for a graph of v nodes, and link (j, i) represents an directed link from node j to node i. A graph with v nodes is completely specified by a v × v matrix, C = (cij ), which is called the adjacency matrix of the graph. cij is the i-th row and j-th column element of C and is equal to unity if E contains a directed link (j, i), and zero otherwise. The adjacency matrix C is non-negative because it has no negative entries, which implies the existence of a real eigenvalue λ (root of the characteristic equation of adjacency matrix: |C − λI| = 0) of an eigenvector x = (x1 , .., xv ) of C, provided Cx = λx. It is possible to study the presence or absence of closed paths in a graph from the largest real eigenvalue λ1 (Perron-Frobenius theorem) [8] in the following way: 1. no closed path if λ1 (C) = 0, 2. closed path if λ1 (C) ≥ 1. Node i in the graph performs a particular Boolean function fi from the list of all possible logical (canalyzing) functions with only two possible values (states), True(1) or False(0). The global state of the network in discrete time t is presented with the set of all current function values of the nodes, F (t) = (f1 (t), ..., fv (t)). The dynamics of the network are given by the sequence F (t), F (t+1), F (t+2), ..., which is the consequence of the current state F (t) and the current value of the inputs (receptor) vector. Because of the general topology of the network, with possible closed loops (cycles) between nodes, it is possible that the network responds to the different inputs with different sequences of different lengths, where the length is the number of responding global states from the starting state to the attractor state or cycle. The attractor state is the global state that no longer
386
ˇ B. Ster et al.
changes providing the input is not changed. The attractor cycle is a sequence of several global states that continues to change periodically. By considering responses to different global states at different inputs, one can observe some interesting information processing (nontrivial clustering), which differs significantly between the starting (random) and the evolved Boolean network. For the purpose of illustrating the subcellular modeling, we use a hypothetical system mimicking important properties of an organism, that should be executed by the unknown network, and can be described as follows. The network has nine receptor inputs, five of them representing different danger signals: D1 and D2 - bacterial infection, D3 and D4 - viral infection, D5 - cellular injury, and four representing different sources of energy (food): F1 - proteins, F2 - carbohydrates, F3 - lipids, F4 - sugar. In this way the system is equipped with the possibility of either increasing or decreasing the energy according to fitness. The system can respond to the input signals through seven outputs for activating different metabolic and defence genes that can help the organism to respond to the danger and utilize the available food sources: MG - general metabolic gene, PG - protein metabolism, CG - metabolism of carbohydrates, TG - lipid metabolism, DG generalized defense against danger, BG - defense against bacteria, VG - defense against viruses. The logical mapping of the task is shown in Table 2, together with the probabilities, which are based on the energy consumption/acquisition of the network. There are 29 = 512 possible inputs in the table. Only nine of them are basic (first group) and have biologically established outputs. For all other input combinations, the reasonable outputs are superimposed relative to the basic entries (second group), with an exception in the case of simultaneously active Fs and Ds. In that case, only the influence of Ds is considered. For example, if D1 and D3 are active (1), then DG, DB and VG are set, while if D1 , D2 and F1 are active, then only DG and BG are taken into account, while MG and PG are ignored. Table 2. Input output table of the network for the task under discussion; p is probability of the entry in the table, based on energy consumption/acquisition D1 1 0 0 0 0 0 0 0 0
D2 D3 D4 D5 F1 F2 F3 F4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 more than 1 danger more than 1 food combinations of Ds & Fs
DG BG VG MG PG CG TG p 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0.6 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 superposition of outputs superposition of outputs 0.4 Fs are ignored
On the Origin and Features of an Evolved Boolean Model
387
For sustainable performance (in the sense of retained energy) of the network, the probabilities for the two groups of entries in the truth table are derived. They are used for proper selection of the entries from the table during the evolution and operation. The probabilities are related to the energy consumptions/acquisitions, which are based on the biological reasoning and therefore have a direct influence on the resulted (evolved) networks. In the case of food (without danger), any active output from the inputs Fi , i = 1, ..., 4, increases the energy of network by 10 if the appropriate gene to utilize this resource is activated, while any output activated due to some danger decreases energy by 5 unless the appropriate danger response is activated, in which case the energy does not decrease. Using the input-output table for the given task (Table 2), a training set was constructed as follows. With probability of p1 , only a single input was active at a time, and with probability of 1−p1 random inputs were activated, each appearing with a probability of 0.25. In the latter case, due to the binomial distribution, two or three inputs were mostly activated. When two or more inputs were activated at the same time, the target outputs were obtained by superposition of individual outputs (OR function). Besides, when a danger was present, the food inputs were ignored (Table 2). We must find the value of p1 . The conservation of energy can be written as Pf o ΔEf o + Pd ΔEd = 0 ,
(1)
where Pf o is the probability of food only (without danger), and Pd = 1 − Pf o is the probability of danger (food may also be present, but is ignored). ΔEf o = 10 and ΔEd = −5. Pf o may be written as 4 p1 + (1 − p1 ) P9 (k) 9 4
Pf o =
k=1
k 4 , 9
(2)
where p1 is the unknown probability of any single input and 1 − p1 is the probability of more inputs, each with probability of 0.25. Since the number of combinations is generally n k n−k Pn (k) = p q , (3) k we have
9 P9 (k) = 0.25k 0.759−k . k
(4)
From Eq. 2 we find p1 = 0.6 and 1 − p1 = 0.4 for the two groups in Table 2, respectively. For different energy values these probabilities would be different.
3
Evolution and Experimental Results
A genetic algorithm was applied to search for a Boolean network that responds with correct outputs, given the inputs from the training set. It was assumed that each processing element (node) has a unit delay. Due to possible delays in
388
ˇ B. Ster et al.
the network, it is normal for the output to stabilize after some time. For this reason, the output was considered after a delay corresponding to the number of nodes on the path from input to output (five in case of network with N = 15, six if N = 20 and N = 25; see Figures 2, 3 and 4, respectively). Besides, due to possible attractor loops of length more than one, the output is checked as many times as there are global states in the maximal attractor cycle. The evaluation function of the genetic algorithm was simply the number of errors at the (binary) outputs, that is, the number of incorrect classifications. Each node had k inputs from global inputs or other nodes and a logic value of a canalyzing function. If k = 3, this means that in the part of the genotype that relates to the node, there are 3 input values of non-canalyzing combination and corresponding output function value. For example, when k = 3, combination 0001 means that only for inputs 000 is the output 1 (for every other combination the output is 0). An individual chromosome consisted of this information for all the nodes. In the genetic algorithm we applied roulette wheel parent selection. The crossover type was uniform with the probability of 0.2, while the mutation inverted individual bits with the probability of 0.01. Two input parameters to the network were applied: the number of nodes (N ) and the number of inputs to a node (k). For each combination of selected N and k, 10 repetitions gave 10 Boolean networks during 20,000 generations of the genetic algorithm. Table 3 shows output errors. It is clear from this that Table 3. Average absolute error (standard deviation) over 10 separately evolved Boolean networks after 20,000 generations of the genetic algorithm. For N = 15 the output is considered after a delay of 5. N is the number of nodes and k is the number of inputs to a node. k/N 2 3 4 5
10 226 (111) 96.0 (42.3) 89.0 (45.0) 109 (49.0)
15 15.0 (0.0) 13.5 (4.7) 13.5 (4.7) 32.5 (14.0)
20 15.0 (0.0) 7.2 (7.6) 9.0 (7.7) 25.0 (7.1)
25 12.0 (6.3) 10.5 (7.2) 12.0 (6.3) 30.0 (0.0)
30 13.1 (4.8) 7.5 (7.9) 16.5 (4.7) 34.5 (12.6)
35 13.5 (4.7) 13.5 (4.7) 28.5 (4.7) 48.0 (11.1)
the lowest error was obtained with the combination of k = 3 and N = 20. However, since we are interested in solving the task completely, it is interesting to know how many of these networks have zero error (Table 4). This can also be represented graphically (Fig. 1). The more successful networks have greater probability of being ’implemented’ in the cells, than others. The organisms with the evolved feature will be more frequent and will therefore ’survive’. Biologically, networks with higher number of nodes (proteins or genes) and high interconnectivity (a) use a lot of time to solve a simple task and (b) involve synthesis of unnecessary inner node (proteins), which represents an unnecessary energetic burden to the cell. Therefore, networks with high N and k are eliminated. On the other hand, networks with smaller number of nodes are unable to fulfill the task at all and are also eliminated. For an organism to evolve, energy conservation and survival are important, yet it must still retain the ability to
On the Origin and Features of an Evolved Boolean Model
389
Table 4. Number of fully successful Boolean networks (i.e. with zero error) out of 10
number of successful networks
k/N 2 3 4 5
10 0 0 0 0
15 0 1 1 0
20 0 5 4 0
25 2 3 2 0
30 1 5 0 0
35 1 1 0 0
5 4 3 2 1 0 −1 5 35
4
30 25
3 k
20 2
15 10
N
Fig. 1. Number of fully successful Boolean networks (spline interpolation)
adapt to environmental changes. Therefore, the best networks solve the task with a minimum number of nodes, which are redundant so as to retain the ability to overcome errors. The smallest Boolean network in our simulations had N =15 nodes, 7 outputs and 8 internal units (=15-7), from C0 to C7 (Fig. 2). The logical equations that show non-canalyzing input combinations of all 15 nodes are in Table 5. Table 5. Logical equations showing non-canalyzing input combinations of all 15 nodes DG = C4 C1 C6 BG = C4 C2 C2 V G = C6 C1 C6 M G = C4 C0 C1 P G = C4 C0 C5 CG = C0 C4 C3 T G = C1 C2 C0
C0 C1 C2 C3 C4 C5 C6 C7
= F3 F4 C7 = D3 D4 D5 = F3 C1 D1 = C6 C6 F2 = D1 D2 D5 = C1 F3 F1 = F2 F3 F2 = F2 C5 C4
The Boolean network with N = 15, k = 3 is shown in Fig. 2. Internal nodes are structured into layers, in accordance with the cumulative delay from the input nodes. λ1 of the adjacency matrix C for the network in Fig. 2 is 0, which
390
ˇ B. Ster et al.
Fig. 2. Boolean network with N = 15 nodes (8 internal nodes)
Fig. 3. Boolean network with N = 20 nodes (13 internal nodes)
On the Origin and Features of an Evolved Boolean Model
391
Fig. 4. Boolean network with N = 25 nodes (18 internal nodes). Nodes C0 and C17 have no outputs.
means according to the Perron-Frobenius theorem that there are no closed paths in the network. We were also interested in the proportion of non-canalyzing input combinations during the processing of the network. The greater the number, the more restricted or specialized is the network. For this network, it was found to be 0.228 (standard deviation 0.033), i.e. on average 22.8% of all the combinations in the network were non-canalyzing. For comparison, initial random networks had 0.111 (standard deviation 0.054), i.e. 11.1%, non-canalyzing combinations. It is obvious that successfully trained networks have a much larger proportion of noncanalyzing combinations than networks with randomly connected canalyzingfunction nodes. We also compared the ratio between the number of different inputs and the number of different outputs (regression ratio) for trained and for random networks. This ratio for the trained networks was 8.1 and for random networks 0.93, and hence evolution of our networks increased regression. Higher regression means that the network was performing the classification task by mapping different input patterns into the same output pattern (label). This feature was comparable to the measurements of the real subcellular structures [5], which means that it is biologically relevant. In summary, evolved networks show specialization and the ability to filter a larger number of stimuli (inputs) into one response (output), characteristics significant for biological systems. Fig. 3 shows a larger network with N = 20, k = 3 and 13 internal units, from C0 to C12 . It still contains no loops (λ1 = 0). The regression ratio is the same as before. Fig. 4 shows a network with N = 25 nodes, k = 3. This network contains many loops and λ1 = 1.84. The regression ratio is again the same.
392
4
ˇ B. Ster et al.
Conclusion
In the paper a Boolean model of a subcellular signal-transduction system has been presented. The network was evolved using a genetic algorithm. The example task was a hypothetical subcellular task involving response to food and danger with energy increasing and energy decreasing inputs, respectively. We have shown that the number of non-canalyzing combinations in the evolved models is greatly increased, meaning therefore its specialization, and that the structures exhibit the classification feature, typical for the real subcellular networks. The evolved models therefore have a biological grounding. In future work, we would like to investigate the structures of the evolved models and compare them with the experimentally determined subcellular networks, which, however, often cannot be completely isolated from the rest of the system. Our goal is to investigate the evolution of biological systems networks and find conditions within the evolving procedure and an evaluation (fitting) function that would assure a one-to-one mapping between the two structures.
References 1. Kauffman, S.A.: The Origins of Order. Oxford Univ. Press, Oxford (1993) 2. Aldana, M., Cluzel, P.: A natural class of robust networks. Proceedings of the National Academy of Sciences USA 100(15), 8710–8714 (2003) 3. Shmulevich, I., Dougherty, E.R., Zhang, W.: From Boolean to Probabilistic Boolean Networks as Models of Genetic Regulatory Networks. Proc. of IEEE 90(11), 1778– 1792 (2002) 4. Shmulevich, I., Dougherty, E.R., Kim, S., Zhang, W.: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274 (2002) 5. Helikar, T., Konvalina, J., Heidel, J., Rogers, J.A.: Emergent decision-making in biological signal transduction networks. Proceedings of the National Academy of Sciences USA 105(6), 1913–1918 (2008) 6. SI.txt: http://mathbio.unomaha.edu/Database 7. Kauffman, S., Petersen, C., Samuelsson, B., Troein, C.: Random Boolean network models and the yeast transcriptional network. Proceedings of the National Academy of Sciences USA 100(14), 14796–14799 (2003) 8. Jain, S., Krishna, S.: Graph theory and the evolution of autocatalytic networks. In: Bornholdt, S., Schuster, H.G. (eds.) Handbook of Graphs and Networks. Wiley, Chichester (2002)
Similarity of Transcription Profiles for Genes in Gene Sets Marko Toplak1 , Tomaž Curk1 , and Blaž Zupan1,2 1 2
Faculty of Computer and Information Sciences, University of Ljubljana, Slovenia Dept. of Human and Mol. Genetics, Baylor College of Medicine, Houston, USA
Abstract. In gene set focused knowledge-based analysis we assume that genes from the same functional gene set have similar transcription profiles. We compared the distributions of similarity scores of gene transcription profiles between genes from the same gene sets and genes chosen at random. In line with previous research, our results show that transcription profiles of genes from the same gene sets are on average indeed more similar than random transcription profiles, although the differences are slight. We performed the experiments on 35 human cancer data sets, with KEGG pathways and BioGRID interactions as gene set sources. Pearson correlation coefficient and interaction gain were used as association measures. Keywords: gene transcription profile, association, interaction gain, gene sets, KEGG, BioGRID.
1
Introduction
Much of the current data analysis in bioinformatics relies on existing knowledge on groupings of objects of interests. For instance, Gene Ontology [2] annotates genes with terms from the ontology and a group of interest may simply be a set of genes tagged with the same term. Among others, Kyoto Encyclopedia of Genes and Genomes (KEGG) [11] lists metabolic pathways and identifies genes that belong to the same pathway. BioGRID [17], on the other hand, provides information on protein-protein and genetic interactions. Genes encoding the proteins may be grouped together if their proteins interact. Such groups of objects, which are most commonly genes, proteins, chemicals, and metabolic products, enable various knowledge-based data analysis techniques [4]. Typical analyses of this kind are gene set enrichment [15] and classification based on gene set signatures [12,14]. Both are useful for gene transcription profile analysis, where the task is either to find if a chosen gene group has a specific transcription response, or prediction of responses for uncharacterized samples with transformation of the data set to gene set space beforehand. The backing for such knowledge-based data analysis approaches is an assumption that genes belonging to the same group have similar transcription profiles. Genes encoding interacting proteins are more similar that random genes if Pearson correlation coefficient is used to measure association [5,7,9]. It was shown A. Dobnikar, U. Lotrič, and B. Šter (Eds.): ICANNGA 2011, Part II, LNCS 6594, pp. 393–399, 2011. c Springer-Verlag Berlin Heidelberg 2011
394
M. Toplak, T. Curk, and B. Zupan
on data of baker’s yeast that transcription profiles of genes encoding interacting proteins behave similarly and that genes encoding proteins for permanent complexes, such as ribosome or proteasome, have a particularly similar transcription profiles [9]. Other studies on a small number of data sets confirmed these findings while focusing on coevolution gene expressions [7] or comparison between multiple species [5]. Another study reports no difference of similarities between genes in KEGG pathways and random genes [10]. A study on 60 data sets looked at patterns of correlating genes across data sets and compared aggregated results with background knowledge from Gene Ontology, but did not evaluate individual data sets [13]. In the paper we present a computational analysis of association between gene transcription profiles for genes in gene sets on a wide array of data sets. To measure gene profile association, we used the Pearson correlation coefficient and interaction gain [8], which is an information theory based supervised measure of association. Compared to related work, we performed the same test over a wide array of data sets and, additionally, used interaction gain to measure association.
2
Data
Gene expression data. Gene expression microarray data consists of mRNA levels for thousands of genes for each biological sample. We used 35 human cancer gene expression data sets from the Gene Expression Omnibus (GEO) [3] and the Broad Institute. All data sets have two diagnostic classes and include at least 20 instances, where each class was represented by at least 8 data instances. On average, the data sets include 44 instances (s.d.= 29.6). GDS data sets with the following ID numbers were used: 806, 971, 1059, 1062, 1209, 1210, 1220, 1221, 1282, 1329, 1375, 1390, 1562, 1618, 1650, 1667, 1714, 1887, 2113, 2201, 2250, 232, 2415, 2489, 2520, 2609, 2735, 2771, 2785 and 2842. The Broad Institute data sets are described on the supplemental page of our previous paper (http://www.ailab.si/supp/bi-cancer/projections/index.htm); we used leukemia, DLBCL, prostate, GSE412, and GSE3726 data sets. Where the array contained multiple probes for the same gene, they were averaged. Gene sets. BioGRID [17] version 2.0.51 was used as a source of gene sets for protein-protein interactions. Pathways from KEGG [11] were obtained on 16 August 2010.
3
Methods
In this section we describe measures used to evaluate transcription profile associations and the experimental methodology. 3.1
Transcription Profile Association Measures
Pearson correlation. The Pearson product-moment correlation coefficient [16] was used as a gene transcription profile association measure in many related
Similarity of Transcription Profiles for Genes in Gene Sets
395
studies [5,7,9,13]. It determines the degree of linear relationship between two transcription profiles. Interaction gain. The interaction gain, also known as bivariate synergy, estimates information about the class that is gained by considering two transcription profiles together as compared to when they are considered separately [1,8]. Two similar gene transcription profiles will have a negative interaction gain as both carry approximately the same class information. Interaction gain of two transcriptional profiles X and Y with respect to class C is defined as IntGainC (X, Y ) = GainC (X × Y ) − GainC (X) − GainC (Y ), where GainC (X) denotes information gain of profile X with respect to class C and X × Y is a cartesian product of transcription profiles. Information gain is defined as GainC (X) = − p(c)log2 p(c) + p(v) p(c|v)log2 p(c|v), c∈DC
v∈DX
c∈DC
where DC and DX denote sets of class and attribute values. Gene expressions were discretized into three intervals with equal frequencies prior to computation of interaction gain. 3.2
Experimental Methodology
For each data set we measured the degree of association between pairs of gene transcription profiles, where both genes were in the same gene set - either in protein-protein interaction (BioGRID) or a biological pathway (KEGG). The scores obtained were compared to scores between random gene pairs (in the same data set) with a two-sample Kolmogorov-Smirnov test as in [5,7]. A two-sample Kolmogorov-Smirnov test is a nonparametric test, which quantifies whether two samples of values come from the same underlying distribution. It measures the maximum distance between cumulative distributions of the samples’ values and takes sample sizes into account for p-value computation [16]. The Orange data mining environment [6] was used to perform the analysis.
4
Results and Discussion
Table 1 presents two-sample Kolmogorov-Smirnov p-values for all data sets. For Pearson correlation, 32 data sets have p-values lower than 0.001 for KEGG pathways and 31 for BioGRID interactions, while for interaction gain the numbers are 20 and 14, respectively. Association score distributions for three data sets are shown in Figure 1. Gene transcription profiles of genes in gene sets are more correlated than random genes, which augments previous protein-protein interaction focused studies [5,7,9]. The differences in distributions of correlation coefficients are slight,
396
M. Toplak, T. Curk, and B. Zupan
Fig. 1. Histograms showing degree of association between genes in KEGG pathways (yellow) and random genes (blue). Pearson correlations are shown in left column while interaction gains are shown in the right column.
Similarity of Transcription Profiles for Genes in Gene Sets
Table 1. Two-sample Kolmogorov-Smirnov p-values for all data sets
DLBCL GDS1059 GDS1062 GDS1209 GDS1210 GDS1220 GDS1221 GDS1282 GDS1329 GDS1375 GDS1390 GDS1562 GDS1618 GDS1650 GDS1667 GDS1714 GDS1887 GDS2113 GDS2201 GDS2250 GDS232 GDS2415 GDS2489 GDS2520 GDS2609 GDS2735 GDS2771 GDS2785 GDS2842 GDS806 GDS971 GSE3726 GSE412 leukemia prostata
Pearson correlation BioGRID KEGG 6.7·10−87 3.5·10−183 2.0·10−5 1.1·10−4 1.5·10−14 6.2·10−11 −66 1.5·10 4.6·10−86 −4 7.4·10 7.8·10−32 2.7·10−1 2.2·10−2 −16 5.5·10 8.0·10−36 −28 4.4·10 5.3·10−19 −15 3.7·10 1.6·10−7 2.9·10−30 1.0·10−46 −3 8.6·10 2.3·10−6 −2 9.8·10 4.1·10−3 −111 1.9·10 1.2·10−277 4.9·10−37 3.3·10−138 −37 1.9·10 1.8·10−66 −47 5.9·10 1.9·10−119 −3 1.0·10 3.6·10−7 1.6·10−4 1.1·10−7 −6 5.6·10 3.6·10−4 −64 8.3·10 5.6·10−90 −2 6.0·10 5.1·10−5 6.0·10−41 < 1.0·10−318 1.2·10−2 7.7·10−3 −6 7.2·10 1.1·10−37 2.1·10−72 < 1.0·10−318 7.9·10−40 4.8·10−96 −9 9.8·10 1.7·10−32 −233 3.4·10 1.5·10−48 7.2·10−6 1.1·10−4 7.8·10−50 6.0·10−299 −6 1.9·10 2.4·10−18 −238 2.6·10 1.2·10−318 4.9·10−163 2.7·10−28 9.3·10−57 6.0·10−72 −5 8.9·10 5.5·10−6
Interaction gain BioGRID KEGG 8.4·10−5 2.4·10−86 1.3·10−1 2.4·10−2 4.7·10−1 1.1·10−15 −15 2.6·10 2.0·10−1 −1 4.5·10 4.5·10−2 1.1·10−5 3.9·10−16 −1 6.8·10 2.3·10−2 −57 6.4·10 1.6·10−5 −5 9.2·10 8.0·10−27 2.1·10−39 6.8·10−113 −1 2.9·10 1.2·10−4 −1 7.3·10 1.6·10−1 −114 4.8·10 < 1.0·10−318 2.2·10−18 3.2·10−49 −21 3.6·10 5.9·10−3 −1 4.4·10 4.7·10−1 −1 4.2·10 2.6·10−6 3.3·10−1 3.1·10−6 −2 1.2·10 8.3·10−2 −2 6.3·10 6.9·10−3 −2 8.2·10 7.6·10−1 6.6·10−1 3.3·10−2 −1 1.4·10 6.5·10−3 −2 4.7·10 3.9·10−5 3.9·10−87 < 1.0·10−318 1.0·100 6.1·10−2 −1 2.5·10 2.7·10−13 −7 6.3·10 1.4·10−7 7.8·10−2 1.5·10−1 9.2·10−2 5.3·10−1 −10 6.3·10 2.6·10−30 −8 5.1·10 1.3·10−32 3.5·10−3 1.5·10−11 −2 1.7·10 5.8·10−9 −4 4.0·10 8.1·10−9
397
398
M. Toplak, T. Curk, and B. Zupan
as written in [9], albeit the p-values are very small due to a large number of scores in distribution samples. The absolute values of pairwise correlations between genes from KEGG were slightly higher than those from BioGRID, which is in contrast with [10], who did not find genes from KEGG pathways noticeably more correlated than genes chosen at random. Positive correlation between genes from evaluated gene set sources is more common than negative, which could be due to biological reasons [13]. The distribution of interaction gain scores for gene pairs from evaluated gene sets was shifted slightly towards negative scores, which means that such pairs of gene transcription profiles provide overlapping information about the class. On average, the p-values were higher than with Pearson correlation. This might be due to the small number of biological samples in data sets, because we need more samples to measure interaction gain accurately. While negative Pearson correlation is more common in tested gene groups than between random gene pairs, positive interaction gain is not. This was expected, because if it was more common in general, this would imply that we need completely different knowledge-based analysis techniques. We hypothesize that positive interaction gain is more common between different gene groups.
5
Conclusion
Our analysis confirms that gene transcription profiles of genes from gene sets from KEGG or BioGRID are more related than those defined arbitrarily, which is in line with previous research [5,7,9]. Our contributions to the topic are the high number of data sets evaluated and the use of another association metric. While we were able to consistently detect the differences between distributions of association scores between genes from the same gene sets and genes chosen at random, the differences were only slight. This may be one of the reasons for relatively disappointing results of classification methods based on gene set signatures, where higher prediction accuracies were expected [14].
References 1. Anastassiou, D.: Computational analysis of the synergy among multiple interacting genes. Mol. Syst. Biol. 3(83) (February 2007) 2. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al.: Gene ontology: tool for the unification of biology. Nature genetics 25(1), 25–29 (2000) 3. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Edgar, R.: NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucl. Acids Res. 35, D760–765 (2007) 4. Bellazzi, R., Zupan, B.: Towards knowledge-based gene expression data mining. Journal of Biomedical Informatics 40(6), 787–802 (2007) 5. Bhardwaj, N., Lu, H.: Correlation between gene expression profiles and protein– protein interactions within and across genomes. Bioinformatics 21(11), 2730 (2005)
Similarity of Transcription Profiles for Genes in Gene Sets
399
6. Demšar, J., Zupan, B., Leban, G.: Orange: From experimental machine learning to interactive data mining, white paper (2004) 7. Fraser, H., Hirsh, A., Wall, D., Eisen, M.: Coevolution of gene expression among interacting proteins. Proceedings of the National Academy of Sciences of the United States of America 101(24), 9033 (2004) 8. Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 229–240. Springer, Heidelberg (2003) 9. Jansen, R., Greenbaum, D., Gerstein, M.: Relating whole-genome expression data with protein-protein interactions. Genome Research 12(1), 37 (2002) 10. Jelizarow, M., Guillemot, V., Tenenhaus, A., Strimmer, K., Boulesteix, A.: Overoptimism in bioinformatics: an illustration. Bioinformatics 26(16), 1990 (2010) 11. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M.: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research 38(Database issue), D355 (2010) 12. Lee, E., Chuang, H.Y., Kim, J.W., et al.: Inferring pathway activity toward precise disease classification. PLoS Comput. Biol. 4(11), e1000217 (2008) 13. Lee, H., Hsu, A., Sajdak, J., Qin, J., Pavlidis, P.: Coexpression analysis of human genes across many microarray data sets. Genome Research 14(6), 1085 (2004) 14. Mramor, M., Toplak, M., Leban, G., Curk, T., Demšar, J., Zupan, B.: On utility of gene set signatures in gene expression-based cancer class prediction. In: Machine Learning in Systems Biology, p. 65 (2009) 15. Nam, D., Kim, S.Y.: Gene-set approach for expression pattern analysis. Brief Bioinform 9(3), 189–197 (2008) 16. Sheskin, D.: Handbook of parametric and nonparametric statistical procedures. CRC Pr I Llc (2004) 17. Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34(suppl. 1), 535 (2006)
Author Index
´ Abrah´ am, Erika I-190 Abundez B., Itzel M. I-51 Alfaro, Rodrigo II-61 Allende, H´ector II-61, II-363 Antunes, M´ ario II-342 Avbelj, Monika II-383 ¨ am¨ Ayr¨ o, Sami I-361 Babi´c, Zdenka II-51 Bakirov, Murat B. I-150 Barszcz, Tomasz II-225 Baumann, Martin R.K. I-140 Beigi, Akram I-391, II-98, II-245 Beliczynski, Bartlomiej I-130 Bielecka, Marzena II-147, II-225 Bielecki, Andrzej II-147, II-225 Bratko, Ivan I-1 Buesser, Pierre II-167 Buli´c, Patricio I-158 Campos, Jo˜ ao I-300 C´ ardenas-Montes, Miguel I-310, I-371 Carvalho, Rui I-300 Constantinopoulos, Constantinos I-169 Correia, Manuel II-342 Costa, Ernesto I-300 Cristianini, Nello II-196, II-322 Cruz R., Rafael I-51 Curk, Tomaˇz II-393 Daolio, Fabio II-167 Daryabari, Mojtaba I-381 Datadien, Arvind I-90 de Almeida, Ana II-31, II-295 de Azevedo da Rocha, Ricardo Luis II-127, II-275 De Bie, Tijl II-196 Deng, Jianming I-320 Ding, Xiao-Feng II-118 Dobnikar, Andrej II-11, II-383 Dokur, Z¨ umray II-81 Donnarumma, Francesco I-250 Duch, Wlodzislaw II-89
Eiben, A.E. II-186 El-Dahb, Mona A. I-400 Ferariu, Lavinia I-290 Figueiredo, Marisa B. II-31 Filipiˇc, Bogdan I-420 Flaounas, Ilias II-322 Frolov, Alexander A. I-100 F´ uster-Sabater, Amparo II-285 Fyson, Nick II-196 Gasca A., Eduardo I-51 G´ ati, Krist´ of II-156 G´ omez-Iglesias, Antonio I-310, I-371 Gong, Fang II-118 Govekar, Edvard I-270 Grochowski, Marek II-89 Groˇselj, Ciril I-80 Haselager, Pim I-90 Hashemi, Ali B. I-340 Heinrich, Enrico II-373 Helmi, Hoda I-391 Hensinger, Elena II-322 Horv´ ath, G´ abor II-156 Husek, Dusan I-100 Ilc, Nejc II-11 ˙ scan, Zafer II-81 I¸ J¨ arvelin, Kalervo I-260 Jerala, Roman II-383 Joost, Ralf II-373 Juhola, Martti I-260 Kaczorek, Tadeusz II-305 Kainen, Paul C. I-12 K¨ arkk¨ ainen, Tommi I-240 Karshenas, Hossein II-98 Kester, Leon J.H.M. II-186 Kiselev, Mikhail I-120 Kocijan, Juˇs I-420, II-312 Kolodziej, Marcin I-280 Kononenko, Igor I-22, I-169, II-21 Korkosz, Mariusz II-147
402
Author Index
K¨ oster, Frank I-140 Kotulski, Leszek II-254 Kovord´ anyi, Rita I-200 Kruglov, Igor A. I-150 Kukar, Matjaˇz I-80 K˚ urkov´ a, Vˇera I-12 Laurikkala, Jorma I-260 L awry´ nczuk, Maciej I-31, I-230 Lemmer, Karsten I-140 Leonardis, Aleˇs II-235 Lethaus, Firas I-140 Likas, Aristidis I-169 Lipi´ nski, Piotr I-330 Lodi, Stefano II-363 Lopes, Noel II-41, II-108 Lotriˇc, Uroˇs I-158 Loyola, Diego I-70 L¨ uder, Marian II-373 Luostarinen, Kari I-240 Majkowski, Andrzej I-280 Marusak, Piotr M. II-177, II-215 Matos, Lu´ıs I-410 Meybodi, Mohammad Reza I-340 Minaei, Behrouz I-381, I-391, II-98 Mishulina, Olga A. I-150 Momi´c, Snjeˇzana II-51 Montone, Guglielmo I-250 Morin, Gabriel I-190 Mozayani, Nasser II-245 Muhonen, Jukka I-240 ˜ Nanculef, Ricardo II-363 Nechval, Konstantin II-136 Nechval, Nicholas II-136 Neme, Antonio I-210 Neruda, Roman I-180 Neto, Jo˜ ao Pedro I-61 Neumann, Heiko I-110 Ni, Qingjian I-320 Nido, Antonio I-210 Nieminen, Paavo I-240 Noroozi, Vahid I-340 Novo, Jorge I-350 Nunes, Jorge I-410 Olszewski, Dominik II-1, II-71 Orchel, Marcin II-332, II-353 Ortman, Robert L. I-220
Osowski, Stanislaw I-41 ¨ ¨ Ozkaya, Ozen II-81 Parsa, Saeed I-381 Parvin, Hamid I-381, I-391, II-98, II-245 Patelli, Alina I-290 Pazo-Robles, Maria Eugenia II-285 Penedo, Manuel G. I-350 Petelin, Dejan I-420, II-312 Pevec, Darko I-22 Polyakov, Pavel Yu. I-100 Potoˇcnik, Primoˇz I-270 Potter, Steve M. I-220 Prevete, Roberto I-250 Purgailis, Maris II-136 Quintas, Ricardo
II-41
Rak, Remigiusz J. I-280 Rend´ on L., Er´endira I-51 Ribeiro, Bernardete II-31, II-41, II-108, II-342 Richter, Pascal I-190 Ringbauer, Stefan I-110 Risojevi´c, Vladimir II-51 ˇ Robnik-Sikonja, Marko I-169 Rodr´ıguez-V´ azquez, Juan Jos´e I-310 Rozevskis, Uldis II-136 Saarikoski, Jyri I-260 Saifullah, Mohammad I-200 Sait, Sadiq M. I-400 Salda˜ na T., Sergio I-51 Salomon, Ralf II-373 S´ anchez G., Jos´e S. I-51 Santos, Jos´e I-350 Sartori, Claudio II-363 Schuessler, Olena I-70 Schut, Martijn C. II-186 S¸edziwy, Adam II-254 Shi, Ai-Ye II-118 Shibata, Danilo Picagli II-127 Shiraishi, Yoichi I-400 Siddiqi, Umair F. I-400 Silva, Catarina II-342 Silva, Fernando I-61 Silva Filho, Reginaldo Inojosa II-275 Sim˜ oes, Anabela I-300 Siwek, Krzysztof I-41 Skoˇcaj, Danijel II-235 Skomorowski, Marek II-147
Author Index Sprinkhuizen-Kuyper, Ida I-90 Stolarek, Jan I-330 Szupiluk, Ryszard II-206 ˇ Ster, Branko II-383 ˇ Strumbelj, Erik I-22, I-169, II-21 Tirronen, Ville I-361 Tomassini, Marco II-167 Toplak, Marko II-393 Trigo, Ant´ onio I-410 Tschechne, Stephan I-110
Vel´ asquez G., Valent´ın I-51 Venayagamoorthy, Kumar I-220 Vidnerov´ a, Petra I-180 Vreˇcko, Alen II-235 Wang, Hui-Bin II-118 Weber, Matthieu I-361 Wojciechowski, Wadim II-147 W´ ojcik, Mateusz II-225 Wojewnik, Piotr II-206 Xu, Li-Zhong
Unold, Olgierd
II-118
II-265
Valdovinos R., Rosa M. I-51 van Willigen, Willem H. II-186 Vega-Rodr´ıguez, Miguel A. I-310, I-371
Zabkowski, Tomasz II-206 Zhang, Xue-Wu II-118 Zieli´ nski, Bartosz II-147 Zupan, Blaˇz II-393
403