Asli Celikyilmaz and I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic
Studies in Fuzziness and Soft Computing, Volume 240 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 224. Rafael Bello, Rafael Falcón, Witold Pedrycz, Janusz Kacprzyk (Eds.) Contributions to Fuzzy and Rough Sets Theories and Their Applications, 2008 ISBN 978-3-540-76972-9 Vol. 225. Terry D. Clark, Jennifer M. Larson, John N. Mordeson, Joshua D. Potter, Mark J. Wierman Applying Fuzzy Mathematics to Formal Models in Comparative Politics, 2008 ISBN 978-3-540-77460-0 Vol. 226. Bhanu Prasad (Ed.) Soft Computing Applications in Industry, 2008 ISBN 978-3-540-77464-8 Vol. 227. Eugene Roventa, Tiberiu Spircu Management of Knowledge Imperfection in Building Intelligent Systems, 2008 ISBN 978-3-540-77462-4 Vol. 228. Adam Kasperski Discrete Optimization with Interval Data, 2008 ISBN 978-3-540-78483-8 Vol. 229. Sadaaki Miyamoto, Hidetomo Ichihashi, Katsuhiro Honda Algorithms for Fuzzy Clustering, 2008 ISBN 978-3-540-78736-5 Vol. 230. Bhanu Prasad (Ed.) Soft Computing Applications in Business, 2008 ISBN 978-3-540-79004-4 Vol. 231. Michal Baczynski, Balasubramaniam Jayaram Soft Fuzzy Implications, 2008 ISBN 978-3-540-69080-1
Vol. 232. Eduardo Massad, Neli Regina Siqueira Ortega, Laécio Carvalho de Barros, Claudio José Struchiner Fuzzy Logic in Action: Applications in Epidemiology and Beyond, 2008 ISBN 978-3-540-69092-4 Vol. 233. Cengiz Kahraman (Ed.) Fuzzy Engineering Economics with Applications, 2008 ISBN 978-3-540-70809-4 Vol. 234. Eyal Kolman, Michael Margaliot Knowledge-Based Neurocomputing: A Fuzzy Logic Approach, 2009 ISBN 978-3-540-88076-9 Vol. 235. Kofi Kissi Dompere Fuzzy Rationality, 2009 ISBN 978-3-540-88082-0 Vol. 236. Kofi Kissi Dompere Epistemic Foundations of Fuzziness, 2009 ISBN 978-3-540-88084-4 Vol. 237. Kofi Kissi Dompere Fuzziness and Approximate Reasoning, 2009 ISBN 978-3-540-88086-8 Vol. 238. Atanu Sengupta, Tapan Kumar Pal Fuzzy Preference Ordering of Interval Numbers in Decision Problems, 2009 ISBN 978-3-540-89914-3 Vol. 239. Baoding Liu Theory and Practice of Uncertain Programming, 2009 ISBN 978-3-540-89483-4 Vol. 240. Asli Celikyilmaz, I. Burhan Türksen Modeling Uncertainty with Fuzzy Logic, 2009 ISBN 978-3-540-89923-5
Asli Celikyilmaz and I. Burhan Türksen
Modeling Uncertainty with Fuzzy Logic With Recent Theory and Applications
ABC
Authors Prof. I. Burhan Türksen University of Toronto Mechanical & Industrial Engineering 170 College St., Haultain Building Toronto, Ontario, M5S 3G8 Canada E-mail:
[email protected] and TOBB Economy and Technology University Mühendislik Fakültesi Endüstri Mühendisligi Bölümü Sögütözü Caddesi No.43 06560 Ankara Turkey E-mail:
[email protected] Dr. Asli Celikyilmaz University of California, Berkeley BISC - The Berkeley Initiative in Soft Computing Electrical Eng. and Computer Sciences Department 415 Soda Hall Berkeley, CA, 94709-7886 USA E-mail:
[email protected],
[email protected],
[email protected] ISBN 978-3-540-89923-5
e-ISBN 978-3-540-89924-2
DOI 10.1007/978-3-540-89924-2 Studies in Fuzziness and Soft Computing
ISSN 1434-9922
Library of Congress Control Number: 2008941674 c 2009 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
To: Fuzzy Logic Research Fuzzy Data Mining Society and Prof. L.A. Zadeh
Foreword
Preface
The world we live in is pervaded with uncertainty and imprecision. Is it likely to rain this afternoon? Should I take an umbrella with me? Will I be able to find parking near the campus? Should I go by bus? Such simple questions are a common occurrence in our daily lives. Less simple examples: What is the probability that the price of oil will rise sharply in the near future? Should I buy Chevron stock? What are the chances that a bailout of GM, Ford and Chrysler will not succeed? What will be the consequences? Note that the examples in question involve both uncertainty and imprecision. In the real world, this is the norm rather than exception. There is a deep-seated tradition in science of employing probability theory, and only probability theory, to deal with uncertainty and imprecision. The monopoly of probability theory came to an end when fuzzy logic made its debut. However, this is by no means a widely accepted view. The belief persists, especially within the probability community, that probability theory is all that is needed to deal with uncertainty. To quote a prominent Bayesian, Professor Dennis Lindley, “The only satisfactory description of uncertainty is probability. By this I mean that every uncertainty statement must be in the form of a probability; that several uncertainties must be combined using the rules of probability; and that the calculus of probabilities is adequate to handle all situations involving uncertainty…probability is the only sensible description of uncertainty and is adequate for all problems involving uncertainty. All other methods are inadequate…anything that can be done with fuzzy logic, belief functions, upper and lower probabilities, or any other alternative to probability can better be done with probability.” What can be said about such views is that they reflect unfamiliarity with fuzzy logic. The book “Modeling Uncertainty with Fuzzy Logic,” co-authored by Dr. A. Celikyilmaz and Professor I.B. Turksen, may be viewed as a convincing argument to the contrary. In effect, what this book documents is that in the realm of uncertainty and imprecision fuzzy logic has much to offer. There are many misconceptions about fuzzy logic. Fuzzy logic is not fuzzy. Like traditional logical systems and probability theory, fuzzy logic is precise. However, there is an important difference. In fuzzy logic, the objects of discourse are allowed to be much more general and much more complex than the objects of
VIII
Foreword
discourse in traditional logical systems and probability theory. In particular, fuzzy logic provides many more tools for dealing with second-order uncertainty, that is, uncertainty about uncertainty, than those provided by probability theory. Imprecise probabilities, fuzzy sets of Type 2 and vagueness are instances of secondorder uncertainty. In many real-world settings, and especially in the context of decision analysis, the complex issue of second-order uncertainty has to be addressed. At this juncture, decision-making under second-order uncertainty is far from being well understood. “Modeling Uncertainty with Fuzzy Logic” begins with an exposition of the basics of fuzzy set theory and fuzzy logic. In this part of the book as well as in other parts, there is much that is new and unconventional. Particularly worthy of note is the authors' extensive use of the formalism of so-called fuzzy functions as an alternative to the familiar formalism of fuzzy if-then rules. The formalism of fuzzy functions was introduced by Professor M. Demirci about a decade ago, and in recent years was substantially extended by the authors. The authors employ their version of the formalism to deal with fuzzy sets of Type 2, that is, fuzzy sets with fuzzy grades of membership. To understand the authors' approach, it is helpful to introduce what I call the concept of cointension. Informally, cointension is a measure of the closeness of fit of a model to the object of modeling. A model is cointensive if its degree of cointension is high. In large measure, scientific progress is driven by a quest for cointensive models of reality. In the context of modeling with fuzzy logic, the use of fuzzy sets of Type 2 makes it possible to achieve higher levels of cointension. The price is higher computational complexity. On balance, the advantages of using fuzzy sets of Type 2 outweigh the disadvantages. For this reason, modeling with fuzzy sets of Type 2 is growing in visibility and importance. A key problem in applications of fuzzy logic is that of construction of the membership function of a fuzzy set. There are three principal approaches. In the declarative approach, membership functions are specified by the designer of a system. This is the approach that is employed in most of the applications of fuzzy logic in the realms of industrial control and consumer products. In the computational approach, the membership function of a fuzzy set is expressed as a function of the membership functions of one or more fuzzy sets with specified membership functions. In the modelization/elicitation approach, membership functions are computed through the use of cointension-enhancement techniques. In using such techniques, successive outputs of a model are compared with desired outputs, and parameters in membership functions are adjusted to maximize cointension. For this purpose, the authors make skillful use of a wide variety of techniques centering on cluster analysis, pattern classification and evolutionary algorithms. They employ simulation to validate their results. In sum, the authors develop an effective approach to modeling of uncertainty using fuzzy sets of Type 2 employing various parameter-identification formalisms. “Modeling Uncertainty with Fuzzy Logic” is an important contribution to the development of a better understanding of how to deal with second-order uncertainty.
Foreword
IX
The issue of second-order uncertainty has received relatively little attention so far, but its intrinsic importance is certain to make it an object of growing attention in coming years. Through their work, the authors have opened a door to wide-ranging applications. They deserve our compliments and congratulations.
Berkeley, CA November 30, 2008
Lotfi A. Zadeh
Preface
Preface
A representation of a system with a model and an implementation of that model to reason with and provide solutions are central to many disciplines of science. An essential concern of system models is to establish the relationships between system input variables and output variables. In cases of complex systems, conventional modeling approaches usually do not perform well because it is difficult to find a global function or analytic structure for such systems. In this regard, fuzzy system modeling (FSM) – meaning the construction of representations of fuzzy systems models – with fuzzy techniques and theories provide an efficient alternative which has proven to be quite successful. In one of his latest works1, Professor Lotfi A. Zadeh – describes remarkable capabilities of fuzzy logic as follows: “..Fuzzy logic may be viewed as an attempt at formalization/mechanization of two remarkable human capabilities. First, the capability to converse, reason and make rational decisions in an environment of imprecision, uncertainty, incompleteness of information, conflicting information, partiality of truth and partiality of possibility – in short, in an environment of imperfect information. And second, the capability to perform a wide variety of physical and mental tasks without any measurements and any computations.”. The capabilities of the fuzzy logic open possibilities for a wide range of theoretical and application problem domains. In spite of its success, implementation of fuzzy techniques and theories has been a difficult task in representing complex systems and building fuzzy system models. It requires identification of many parameters. For instance, an important problem in development of fuzzy system models is to generate fuzzy if-then rules. These rules may be constructed by an extraction of knowledge from human experts and to construct suitable membership functions. However, information supplied by humans suffers from certain serious problems. Firstly, human knowledge is usually incomplete or not organized since different experts usually make different decisions. Even the same expert may have 1
L.A. Zadeh, “Is there a need for fuzzy logic”, Information Sciences, vol. 178, issue 13, July 2008.
XII
Preface
different interpretations of the same observation on different times. Furthermore, knowledge acquisition from experts is not systematic or efficient. These problems have let researchers to build automated algorithms for modeling systems using fuzzy theories via machine learning and data mining techniques. With the above problems in hand, in this book, we propose novel and unique fuzzy-modeling approaches such deficiencies. We mainly focus on algorithms based on the novel fuzzy functions method. The new fuzzy functions approach employs membership values differently than any other fuzzy system models are implemented to date. The membership values can be thought of as ‘atoms’ that hold potential information about a system behaviour to be activated and release its power. Such potential information obtained from membership values are captured in local fuzzy functions as predictors of the system behaviour. Instead and in place of fuzzy if-then rule base structures, Fuzzy functions are implemented to build models of a given system. This book presents essential alternatives of the fuzzy functions approaches and their advancements. The aim is to build more powerful fuzzy models via autonomous approaches that can identify hidden structures via the optimization of their system parameters. The term “fuzzy functions” has been used by researchers to define different things. The building blocks of the fuzzy set theory is proposed by Professor L.A. Zadeh in 1965 especially the fuzzy extensions of classical basic notations such as logical connectives, quantifiers, inference rules, relations, arithmetic operations, etc. Hence, these constitute the initial definitions of fuzzy functions. Later different forms of fuzzy functions have been presented. In 1969, Marinos introduced the concept of well-known conventional switching theory techniques into the design of fuzzy logic systems, based on fuzzy set theory and fuzzy operations of Professor L.A. Zadeh. In their 1972 paper, Siy and Chen explored simplification of fuzzy functions and they defined fuzzy functions as the polynomials, which are formed by possible combinations of operations on fuzzy sets. Hence, fuzzy functions are defined as relations between fuzzy variables. Other researchers have also defined fuzzy functions as being a special case of fuzzy relations. Sasaki in 1993 and later in 1999 Demirci defined fuzzy functions as a special case of fuzzy relations and explored their mathematical representations. Thus, the fuzzy functions we introduce in this paper are different than the latter fuzzy functions. The idea of the fuzzy functions of this study has emerged from the idea of representing each unique rule of Fuzzy Rule Bases in terms of functions. The main structure identification tools of this book capture the hidden structures via pattern recognition methods. As well we present a new improved fuzzy clustering algorithm that would help identify the relationships between the input variables and the output variable in local models mainly by regression type system development techniques. We also focused on classification problem domains by building multiple fuzzy classifiers for each hidden pattern identified with the improved fuzzy clustering method. We present a new fuzzy cluster validity method to demonstrate how the methodologies presented fit to the estimation approaches.
Preface
XIII
Later in the book, we incorporate one of the soft computing tools to optimize the parameters of the fuzzy function models. We focus on a novel evolutionary fuzzy functions approach–the design of “improved fuzzy functions” system models with the use of evolutionary algorithms. Evolutionary algorithms are a broad class of stochastic optimization tools, inspired by biology. An evolutionary algorithm maintains a population of candidate solutions for the problem at hand, and makes it evolve by iteratively applying a set of stochastic operators; know as mutation, recombination, and selection. The resulting process tends to find, given enough time, globally optimal solutions to the problem much in the same way as in nature populations of organisms tend to adapt to their surrounding environment. Hence, using the evolutionary algorithms as the optimization tool, the local structures of the given database are identified with a new improved fuzzy clustering method and represented with novel “fuzzy functions”. Although it has been investigated for many years now, the problem of uncertainty modeling is yet to be satisfactorily resolved in system modeling communities. In engineering problems, building reliable models depends on the identification of important values of variables of model equations. However, in real life cases, these important values may not be obtained due to imprecise, noisy, vague, or incomplete nature of current information. The goal of this book is to build an uncertainty modeling architecture to represent and handle the uncertainty in parameters and structure of the fuzzy functions to capture the most available information. The uncertainty in systems can be captured with higher order fuzzy sets, viz. interval valued type-2 fuzzy sets, which was first introduced by Professor Lotfi A. Zadeh. Type-2 fuzzy systems implement type-2 fuzzy sets to capture the higher order imprecision inherited in systems. In particular, this book introduces the formation of type-2 fuzzy functions to capture uncertainties associated with system behaviour. The central contributions of this work is to expose the fuzzy functions approach and enhance it to capture imprecision in setting system model parameters by constructing a new uncertainty modeling tool. To replace the standard fuzzy rule bases with the new improved fuzzy functions succeeds in capturing essential relationships in structure identification processes and overcomes limitations exhibited by earlier fuzzy inference systems based on if-then rule base methods because there is an abundance of fuzzy operations and hence the difficulty of the choice amongst the t-norms and co-norms, methods of fuzzification and defuzzification. Designing an autonomous and robust fuzzy system model and reasoning with it is the prime goal of this approach. This new fuzzy system modeling (FSM) approach implements higher-level fuzzy sets to identify the uncertainties in: (1) the system parameters, and (2) the structure of fuzzy functions. With the identification of these parameters, interval valued fuzzy sets and fuzzy functions are identified. Finally, an evolutionary computing approach with the proposed uncertainty identification strategy is combined to build fuzzy system models that can automatically identify these uncertainty intervals.
XIV
Preface
After testing the new fuzzy functions tools on various benchmark problems, the algorithms are successfully applied to model decision processes in two real problem domains: the desulphurization process in steel making and stock price prediction activities. For both of these problems, the proposed methods produce robust and high performance results, which are comparable (if not better) than the best system modeling approaches known in the current literature. Several aspects of the proposed methodologies are thoroughly analyzed to provide a deeper understanding. These analyses show consistency of the results. As a final note, the fuzzy modeling approaches demonstrated in this book may supply suitable framework for engineers, and practitioners for a design of complex systems with fuzzy functions instead of crisp functions, that is to say, for a design of artificial systems based on well known functions, viz. regression, classification, etc. Although in this book we only show examples from economy and production industry, we believe that fuzzy modeling can and should be utilized in many fields of science, including biology, economics, psychology, sociology, history and more. In such fields, many models exist in current research literature and they can be directly transformed into mathematical models using the methodologies presented herein.
September 2008
Asli Celikyilmaz University of California, Berkeley, USA I. Burhan Turksen University of Toronto, Canada TOBB Economy and Technology University, Turkey
Contents Contents
1
Introduction………………………………………………… 1.1 Motivation………………………………………………………... 1.2 Contents of the Book……………………………………………... 1.3 Outline of the Book……..................................................................
1 1 3 9
2
Fuzzy Sets and Systems………………………………………………. 2.1 Introduction ……...……...……...……............................................ 2.2 Type-1 Fuzzy Sets and Fuzzy Logic……………………………... 2.2.1 Characteristics of Fuzzy Sets……………………………… 2.2.2 Operations on Fuzzy Sets………………………………….. 2.3 Fuzzy Logic……………………………………………………….. 2.3.1 Structure of Classical Logic Theory.………………………. 2.3.2 Relation of Set and Logic Theory………………………….. 2.3.3 Structure of Fuzzy Logic………………………………….. 2.3.4 Approximate Reasoning……………………………………. 2.4 Fuzzy Relations…………………………………………………… 2.4.1 Operations on Fuzzy Relations…………………………….. 2.4.2 Extension Principle………………………………………… 2.5 Type-2 Fuzzy Sets…………………………………………………. 2.5.1 Type-2 Fuzzy Sets………………………………………….. 2.5.2 Interval Valued Type-2 Fuzzy Sets………………………… 2.5.3 Type-2 Fuzzy Set Operations………………………………. 2.6 Fuzzy Functions…………………………………………………… 2.7 Fuzzy Systems………………………………………………….… 2.8 Extensions of Takagi-Sugeno Fuzzy Inference Systems…………. 2.8.1 Adaptive-Network-Based Fuzzy Inference System (ANFIS)…………………………………………………... 2.8.2 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS)………………… ……………………………... 2.8.3 Genetic Fuzzy Systems (GFS)…………………………….. 2.9 Summary………………………………………………………..…
11 11 12 13 14 18 18 19 19 21 22 25 25 28 29 31 32 33 36 40
Improved Fuzzy Clustering………………………………………….. 3.1 Introduction………………………………………………………. 3.2 Fuzzy Clustering Algorithms……………………………………...
51 51 52
3
41 44 46 50
XVI
Contents
3.2.1 Fuzzy C-Means Clustering Algorithm ……………………. 53 3.2.2 Classification of Objective Based Fuzzy Clustering Algorithms………………………………………………… 58 3.2.3 Fuzzy C-Regression Model (FCRM) Clustering Algorithm………………………………………………….. 58 3.2.4 Variations of Combined Fuzzy Clustering Algorithms…….. 61 3.3 Improved Fuzzy Clustering Algorithm (IFC)……………………… 64 3.3.1 Motivation …………………………………………………. 64 3.3.2 Improved Fuzzy Clustering Algorithm for Regression Models ( IFC )……………………………………………… 69 3.3.3 Improved Fuzzy Clustering Algorithm for Classification Models (IFC-C) ……………………………………..……. 73 3.3.4 Justification of Membership Values of the IFC Algorithm….. 77 3.4 Two New Cluster Validity Indices for IFC and IFC-C…………….. 85 3.4.1 Overview of Well-Known Cluster Validity Indices ………... 86 3.4.2 The New Cluster Validity Indices…………………………… 90 3.4.3 Simulation Experiments [Celikyilmaz and Turksen, 2007i;2008c]……………………………………………..…. 94 3.4.4 Discussions on Performances of New Cluster Validity Indices Using Simulation Experiments……………………... 100 3.5 Summary…………………………….…………………………….. 103 4
Fuzzy Functions Approach……………………….………….……….. 4.1 Introduction ……………………………………………………….. 4.2 Motivation…………………………………………………………. 4.3 Proposed Type-1 Fuzzy Functions Approach Using FCM – T1FF…..…………………………………………………… 4.3.1 Structure Identification of FF for Regression Models (T1FF )……………………………………………………… 4.3.2 Structure Identification of the Fuzzy Functions for Classification Models (T1FF-C )……………………......... 4.3.3 Inference Mechanism of T1FF for Regression Models……………………………...……………….…….. 4.3.4 Inference Mechanism of T1FF for Classification Models.... 4.4 Proposed Type-1 Improved Fuzzy Functions with IFC – T1IFF....……………………………………………..……… 4.4.1 Structure Identification of T1IFF for Regression Models.... 4.4.2 Structure Identification of T1IFF-C for Classification Models……..………………………………………………. 4.4.3 Inference Mechanism of T1IFF for Regression Problems...……………………………………………….... 4.4.4 Inference with T1IFF-C for Classification Problems........... 4.5 Proposed Evolutionary Type-1 Improved Fuzzy Function Systems…………………………………………………………….. 4.5.1 Genetic Learning Process: Genetic Tuning of Improved Membership Functions and Improved Fuzzy Functions…...
105 105 107 112 112 119 121 122 125 125 131 132 135 136 139
Contents
5
6
XVII
4.5.2 Inference Method for ET1IFF and ET1IFF-C……….......... 4.5.3 Reduction of Structure Identification Steps of T1IFF Using the Proposed ET1IFF Method…………………........ 4.6 Summary………………………………………………………….
146 147
Modeling Uncertainty with Improved Fuzzy Functions…………..
149
5.1 Motivation ……...……...……...……........................................... 5.2 Uncertainty ………………………………………………………. 5.3 Conventional Type-2 Fuzzy Systems………….…………………. 5.3.1 Generalized Type-2 Fuzzy Rule Bases Systems (GT2FRB)…........................................................................ 5.3.2 Interval Valued Type-2 Fuzzy Rule Bases Systems (IT2FRB)…………………………………………….….. 5.3.3 Most Common Type-Reduction Methods……………….. 5.3.4 Discrete Interval Type-2 Fuzzy Rule Bases (DIT2FRB)..... 5.4 Discrete Interval Type-2 Improved Fuzzy Functions…………….. 5.4.1 Background of Type-2 Improved Fuzzy Functions Approaches………………………………………………… 5.4.2 Discrete Interval Type-2 Improved Fuzzy Functions System (DIT2IFF)…………………………………..…..... 5.5 The Advantages of Uncertainty Modeling…………………..…... 5.6 Discrete Interval Type-2 Improved Fuzzy Functions with Evolutionary Algrithms…………..………………………………. 5.6.1 Motivation………………………...……………………… 5.6.2 Architecture of the Evolutionary Type-2 Improved Fuzzy Functions……..…..……………………………….. 5.6.3 Reduction of Structure Identification Steps of DIT2IFF Using New EDIT2IFF Method……...………………...…. 5.7 Summary…………………………………………………………
149 154 157
Experiments…….…….…….…….…….…….…….…….…….……. 6.1 Experimental Setup ……...……...……...……............................... 6.1.1 Overview of Experiments…………………………………. 6.1.2 Three-Way Sub-sampling Cross Validation Method……… 6.1.3 Measuring Models’ Prediction Performance……………… 6.1.3.1 Performance Evaluations of Regression Experiments…………………………..…………. 6.1.3.2 Performance Evaluations of Classification Experiments……………………..…………….... 6.2 Parameters of Benchmark Algorithms…………………………… 6.2.1 Support Vector Machines (SVM)…………………………. 6.2.2 Artificial Neural Networks (NN)………………………….. 6.2.3 Adaptive-Network-Based Fuzzy Inference System (ANFIS)……………………………………………………
145
157 160 162 164 167 168 179 193 196 196 197 213 215 217 217 217 219 221 221 223 227 228 229 229
XVIII
6.2.4 Dynamically Evolving Neuro-Fuzzy Inference Method (DENFIS)…………………………………………………. 6.2.5 Discrete Interval Valued Type-2 Fuzzy Rule Base (DIT2FRB)……………………………………………..... 6.2.6 Genetic Fuzzy System (GFS)…………………………..... 6.2.7 Logistic Regression, LR, Fuzzy K-Nearest Neighbor, FKNN…………………………………………................. Parameters of Proposed Fuzzy Functions Algorithms………….. 6.3.1 Fuzzy Functions Methods……………………………….. 6.3.2 Imporoved Fuzzy Functions Methods……………………. Analysis of Experiments – Regression Domain………………… 6.4.1 Friedman’s Artificial Domain…………………………… 6.4.2 Auto-mileage Dataset……………………………..……... 6.4.3 Desulphurization Process Dataset……………………….. 6.4.4 Stock Price Analysis……………………………………... 6.4.5 Proposed Fuzzy Cluster Validity Index Analysis for Regression……………………………………………….. Analysis of Experiments - Classification (Pattern Recognition) Domains…………………………………………………………. 6.5.1 Classification Datasets from UCI Repository…………… 6.5.2 Classification Dataset from StatLib……………………… 6.5.3 Results from Classification Datasets…………………….. 6.5.4 Proposed Fuzzy Cluster Validity Index Analysis for Classification………………………………………..... 6.5.5 Performance Comparison Based on Elapsed Times……... Overall Discussions on Experiments……………………………. 6.6.1 Overall Comparison of System Modeling Methods on Regression Datasets...……….…………………………… 6.6.2 Overall Comparison of System Modeling Methods on Classification Datasets…………………...………………. Summary of Results and Discussions……………...…………….
297 300
Conclusions and Future Work……………………………………… 7.1 General Conclusions...……...……...……..................................... 7.2 Future Work……………………………………………………… References……………………………………………………………..
305 305 310 313
Appendix……………………………………………………………… A.1 Set and Logic Theory – Additional Information………………… A.2 Fuzzy Relations (Composition) – An Example………………….. B.1 Proof of Fuzzy c-Means Clustering Algorithm………………….. B.2 Proof of Improved Fuzzy Clustering Algorithm………………… C.1 Artificial Neural Networks ANNs)……………………………… C.2 Support Vector Machines………………………………………... C.3 Genetic Algorithms………………………………………………
321 321 322 323 326 327 329 338
6.3
6.4
6.5
6.6
6.7 7
Contents
231 231 232 234 234 234 236 238 238 245 251 262 276 278 279 281 281 283 284 289 290
Contents
C.4 Multiple Linear Regression Algorithms with Least Squares Estimation……………………………………….…….….…….. C.5 Logistic Regression…………………………………………….. C.6 Fuzzy K-Nearest Neighbor Approach……………………….…. D.1 T-Test Formula……………………………………………….… D.2 Friedman’s Artificial Dataset: Summary of Results…………… D.3 Auto-mileage Dataset: Summary of Results…………………… D.4 Desulphurization Dataset: Summary of Results………………... D.5 Stock Price Datasets: Summary of Results…………………….. D.6 Classification Datasets: Summary of Results…………………... D.7 Cluster Validity Index Graphs………………………………...... D.8 Classification Datasets – ROC Graphs………………………….
XIX
340 341 343 344 345 354 363 367 388 397 398
List of Tables Contents
Table 2.1 Table 2.2
Some well known t-norms and t-conorms...……………….. The AGE and SALARY attributes of Employees...………..
18 26
Table 3.1 Table 3.2
Distance Measures...………………...................................... Membership values as input variables in Fuzzy Function parameter estimations…………………………………......... Correlation Analysis of FCM clustering and IFC membership values with the output variable……………..… Significance Test Results of fuzzy functions using membership values obtained from FCM clustering and IFC fuzzy functions………………………………………… Functions used to generate Artificial Datasets……………... Optimum number of clusters of artificial datasets for m {1.3,2.0}……………………………………………….. Optimum Number of Clusters of IFC models of stock price dataset identified by different validity indices……….. Optimum Number of Clusters of IFC models of Ionosphere dataset indicated by different validity indices……................
55 79
Table 3.3 Table 3.4
Table 3.5 Table 3.6 Table 3.7 Table 3.8
Table 4.1 Table 4.2
Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5
∈
Number of parameters of a Type-1 Improved Fuzzy Functions – T1IFF experiment Parameter………….…….... The number of parameters of an Evolutionary Type-1 Improved Fuzzy Function experiments……………………. Improved Fuzzy Functions Parameter Representation……………………………………………… Differences between the and Earlier Type-2 Fuzzy System Modeling Approach………………………………...……... The steps of the Genetic Learning Process of EDIT2IFF… Number of parameters of Discrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF) …………...………. The number of parameters of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions. (EDIT2IFF)...
81
85 95 101 102 103
146 147 177
180 204 214 215
XXII
Table 6.1 Table 6.2
Table 6.3 Table 6.4 Table 6.5 Table 6.6
Table 6.7
Table 6.8 Table 6.9 Table 6.10 Table 6.11 Table 6.12 Table 6.13 Table 6.14
Table 6.15 Table 6.16
Table 6.17
Table 6.18 Table 6. 19
List of Tables
Overview of Datasets used in the experiments…………… Calculation of overall performance of a method based on three way cross validation results. The overall performance is represented with tuple of 〈 PM , stPM 〉…… Contingency Table to calculate accuracy…………………. Learning parameters of Support Vector Machines for classification and regression methods……………………... Learning parameters of 1-Layer Neural Networks………... Learning parameters of Adaptive Network Fuzzy Inference Systems – ANFIS (Takagi-Sugeno) Subtractive Clusteing Method…………………………...... Learning parameters of Dynamically Evolving Neuro-Fuzzy Inference System - DENFIS Online Learning with Higher order Takagi-Sugeno (TS) inference……….………………………………………….. Learning parameters of Type-2 Fuzzy Rule Base Approach - DIT2FRB…..………………………………..... Initial Parameters of Genetic Fuzzy System…………......... The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Regression Problems………..……………….. The Parameters of Type-1 and Type-2 Fuzzy Functions Methods for Classification Problems………..………..…… The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods for Regression………..……………..... The Parameters of Type-1 and Type-2 Improved Fuzzy Functions Methods forClassification Problems………..….. R2 values obtained from the application Type-1 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset.…………..…………………………….... Optimum Parameters of Variations of Type-1 Fuzzy Functions Approach variations………..……….…..…........ R2 values obtained from the application of Benchmark Approaches on Training-Validation-Testing Datasets of Freidman’s Artificial Dataset and their optimum Mode parameters…………………………………………... R2 values obtained from the application Type-2 Fuzzy Functions Approaches and its variations on Training-Validation-Testing Datasets of Freidman’s Artificial Data set…………..…………………………….... Optimum Parameters of Variations of Type-2 Fuzzy Functions Approach……..…...……….………………….... R2 values obtained from the application Earlier Type 2 Fuzzy Rule Base –DIT2FRB Approach on TrainingValidation-Testing Datasets of Freidman’s Artificial Dataset…………………………………………………….
218
221 223 228 229
230
231 232 233 235 235 236 237
239 239
240
240 240
241
List of Tables
Table 6.20 Two sample left-tailed t-test results (p0. The objective function will be 0 when all data objects are clusters at the centers, c=n. On the other hand, when data objects are farther away from cluster centers, υi, objective function will get larger. The location and the number of cluster centers affect the value of the objective function. Criterion of the objective function is minimum at the optimum solution, and one should search for the global minimum. In order to avoid trivial solutions, two constrains are imposed on the partition matrix U, as follows:
∑ i =1 μik = 1, c
∀k >0
0 < ∑ k =1 μik 0
(3.4)
n
(3.5)
The constraint in (3.4) implies that each row of partition matrix in (3.2) adds up to 14. The constraint in (3.5) implies that the column total of membership values cannot exceed the number of data vectors, n, nor it can be zero. This indicates that there is at least one member assigned to each cluster. However, none of these constraints force membership values of each cluster to have a certain distribution. General formula of the distance measure is given by the following formula:
d 2 ( xk ,υi ) = ( xk − υi ) Ai ( xk − υi ) ≥ 0 T
3
4
(3.6)
Fuzziness is a type of uncertainty (of imprecision) accepted in uncertainty theory [Zadeh, 1965, 1975a]. Various functions have been proposed to measure the degree of fuzziness measure. In fuzzy clustering algorithms, the overlapping constant, m, is used as the degree of fuzziness. In later chapters m will be used as a parameter to define uncertainty in proposed fuzzy functions approach, along with other measures. In some research such as Krishnapuram and Keller (1993), (3.4) is indexed in possibilistic approach to clustering.
3.2 Fuzzy Clustering Algorithms
55
In (3.6) the norm matrix Ai, i=1…c, is positive definite symmetric matrix. Other distance measures can also be used in fuzzy clustering algorithms. A short list of different distance measures are given in Table 3.1. FCM clustering algorithm uses Euclidean distance, therefore norm matrix, Ai, is equal to the identity matrix since input matrix is scaled to standard deviation 1 and mean equals 0, e.g., A=I. On the other hand, Gustafson and Kessel [1979] use Mahalanobis distance, in which case norm matrix of each cluster is equal to inverse of the covariance matrix of each cluster, e.g., Ai=Ci-1. Table 3.1 Distance Measures
Distance Measure Euclidean Distance
Function
Minkowski Distance
⎡ s ⎤ d p (a, b) = ⎢ ∑ | ai − bi | p ⎥ ⎣ i=1 ⎦
Maximum Distance
d∞ (a, b) = max | ai − bi |
1/ 2
⎡ nv ⎤ d2 ( a, b) = ⎢ ∑ ( ai − bi ) 2 ⎥ ⎣ i =1 ⎦
1/ p
,p>0
nv
i =1
Mahalanobis Distance
d A (a, b) = (a − b)T A (a − b)
From (3.3)-(3.6), one can imply that FCM clustering algorithm is a constraint optimization problem, which should be minimized in order to obtain optimum results. Therefore, FCM clustering algorithm can be written as a single optimization structure as follows:
min J ( X ;U ,V ) = ∑ i =1 ∑ k =1 ( μik ) d 2 ( xk ,υi ) A c
s.t.
n
m
0 ≤ μik ≤ 1, ∀i,k
∑ μ 0< ∑ c
i =1
ik
n
k =1
= 1, ∀k > 0
(3.7)
μik < n, ∀i > 0
Constraint optimization model in (3.7) [Bezdek, 1981a] can be solved using a well-known method in mathematics, namely Lagrange Multiplier method [Khuri, 2003], and the model is converted into an unconstraint optimization problem with one objective function. In order to get an equality constraint problem, primal constraint optimization problem is first converted into an equivalent unconstraint problem with the help of unspecified parameters known as Lagrange Multipliers, λ;
56
3 Improved Fuzzy Clustering
∑ i =1 ∑ k =1 ( μik ) c
max W(U,V) =
n
m
d 2 ( xk ,υi ) -λ A
(∑
c i =1
μik − 1
)
(3.8)
According to Lagrangian Method, the Lagrangian function must be minimized with respect to primal parameters and maximized with respect to dual parameters. According to the derivative of Lagrangian function in (3.8) with respect to the original model parameters, U and V should vanish. Hence, by taking the derivative of objective function in (3.8) with respect to cluster centers, V and membership values, U, optimum membership value calculation equation and clusters centers are formulized by:
μik(t )
⎛
n
2 ⎡ ⎤ c ⎛ d x ,υ ( t −1) ⎞ m−1 ( ) ⎢ ⎥ k i ⎟ ⎥ = ⎢∑ ⎜ ( − 1) t ⎜ ⎟ ⎢ j =1 ⎝ d ( xk ,υ j ) ⎠ ⎥ ⎣ ⎦
( )
υi(t ) = ⎜ ∑ μik(t ) ⎝ k =1
m
⎞ xk ⎟ ⎠
∑ ( μik(t ) ) n
m
−1
, ∀i = 1,..., c
(3.9)
(3.10)
k =1
In (3.9) υi(t-1) represent cluster center vector for cluster i obtained in (t-1)th iteration. Similarly, in (3.9) and (3.10), μik(t) denotes optimum membership values calculated at tth iteration. The proof of extracting the membership value calculation formula in (3.9) and the cluster center function in (3.10) can be found in Appendix B.1. The result from this operation shows that membership values and cluster centers are dependent on each other, so Bezdek [1981a] proposed an iterative algorithm to calculate membership values and cluster centers. Objective function Jt at each iteration, t, is measured by J(t) =
∑ ∑ (μ ) c
n
i =1
k =1
(t ) m ik
d 2 ( xk ,υi( t ) ) >0
(3.11)
FCM algorithm stops according to a termination criterion, e.g., either after certain number of iterations, or magnitude of separation of two nearest clusters is less than a pre-determined value (ε), etc. Iterative algorithm of FCM clustering algorithm is shown in ALGORITHM 3.1. The effect of the fuzziness value, m, can be analyzed by taking the limit of the membership value calculation equation in (3.9) at the boundaries as follows: −1
1 ⎡ ⎤ m−1 c ⎛ d 2 x ,υ ⎞ ( ⎢ ⎥ 1 k i ) ⎟ ⎥ = , ∀i, j = 1,..., c. lim μik ( x ) = lim ⎢∑ ⎜ 2 m→∞ m→∞ ⎜ ⎟ c ⎢ j =1 ⎝ d ( xk ,υ j ) ⎠ ⎥ ⎣ ⎦
(3.12)
3.2 Fuzzy Clustering Algorithms
57
ALGORITHM 3.1 Fuzzy C-Means Clustering Algorithm (FCM) Given data vectors, X={x1,..,xn}, number of clusters, c, degree of fuzziness, m, and termination constant, ε (maximum iteration number in this case). Initialize the partition matrix, U, randomly. Step 1: Find initial cluster centers using (3.10) using membership values of initial partition matrix as inputs. Step 2: Start iteration t=1…max-iteration value; Step 2.1. Calculate membership values of each input data object k in cluster i, μik(t ) , using the membership value calculation equation in (3.9), where xk are input data objects as vectors and υi( t −1) are cluster centers from (t-1)th iteration, Step 2.2. Calculate cluster center of each cluster i at iteration t, υi( t ) using the cluster center function in (3.10), where the inputs are the input data matrix, xk, and the membership values of iteration t, μik(t ) . Step 2.3. Stop if termination condition satisfied, e.g., | υi( t ) - υi( t −1) |≤ε. Otherwise go to step 1.
and under the assumption that no cluster centers are alike, we get;
⎧⎪1, if d 2 ( xk ,υi ) 1.1, c>1 and a termination constant, ε>0, maximum number of iterations (max-iter), specify the structure of regression models such as in (3.27) for each cluster i, i=1,…,c, k=1,…,n, to create the interim input matrix, τi. Using FCM clustering algorithm, initialize the partition matrix, U0. Then, for each iteration, t=1,…max-iter; (1) Populate c number of input matrices, τi (t-1), one for each cluster, using membership values (U(t-1)) from the (t-1)th iteration, and their selected user defined transformations. (2) Approximate c number of interim fuzzy functions such as in (3.27), h(τ ik
( t -1)
).
(3) Update membership values for iteration t using (3.30). (4) Calculate cluster centers for iteration t using (3.31). (5) If (obj(t) - obj(t-1))0. We tried different sets of models for linear LSE and non-linear SVM to measure their performance differences based on error reduction. The value 0 in a control gene token represents that the corresponding membership value transformation will not be used in the model. The length of fuzzy function structures, nm, viz. the collection of membership value forms to shape fuzzy functions, is determined prior to chromosome formation. As it was mentioned earlier, based on the fuzzy function approximators type, the length of the genes may differ therefore the genetic structure has a dynamic length. Hence, Figure 5.17(bottom) chromosome is an example extracted from any tth iteration of the GLP when the support vector regression is used to approximate fuzzy functions. The chromosome structure indicates that the m interval is [1.45, 1.75], the number of fuzzy functions is 3, the alpha-cut value is 0.1 and type=2, which means the SVM will be used to approximate fuzzy functions. In turn, Creg=54.4, ε=0.115, the kernel function indicated by the first control gene implies that a non-linear rbf kernel will be implemented (=1) and only exponential transformation of the membership values will be used to shape the system fuzzy function parameters. Alpha-cut indicates that, in each cluster, the interim and local input matrixes will be determined based on the following constraint, μi(x)>α-cut, e.g., equation (4.1), which indicates that the data vectors with improved membership values less than 0.1 will be discarded when approximating the interim and local fuzzy functions.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
203
The purpose of the new genetic learning process of the EDIT2IFF is to find the optimum m interval, the optimum list of membership value transformation to structure the fuzzy functions, i.e., {τs,Φψ}, such as in equations (5.41) and (5.43), where s and ψ represent different fuzzy function structures, parameters and type of fuzzy functions, e.g., Creg, ε, K(⋅), so that the optimum model can be captured. The strength of this approach is that each individual in the population can construct different structures, e.g., linear or non-linear, based on fuzzy function approximation type. The algorithm determines the optimum structure through the probabilistic search method. The algorithm decides which type of function should be better to use for a particular model. Initial population is randomly generated. Fitness function is defined based on combinatorial performances of two type-1 improved fuzzy function (T1IFF) models for each m-bound on validation dataset (as shown in Figure 5.16). It is calculated by the defined performance indicator (PI), considering the global minimum value for the PI is searched for selection purposes. Then, the surviving individual will be selected using the fitness function by: m _ lower PI pop = PI mpop_ upper + PI pop
(5.55)
pop=1…population-size. The algorithm searches for the optimum model parameters and the m-bound so that the two T1IFF models constructed with the upper and lower boundaries of the degree of fuzziness variable would have the minimum error. The algorithm starts with a larger m-bound and gradually shifts to where the PIpop is minimized. To ensure that the fitness function decreases monotonically, the best candidate solution in each generation enters the next generation directly. Different genetic operators, e.g., arithmetic, simple crossover, non-uniform or uniform mutation operators, etc., are utilized for parameter and control genes since they are real and binary numbers respectively. For parameter genes, we used arithmetic and simple crossover and non-uniform and uniform mutation operators. For control genes, simple crossover and shift mutation operators are utilized. Tournament selection is used for the population selection. Elitist strategy is employed to ensure that the fitness function decreases monotonically, hence the best candidate solution in each generation enters the next generation directly. The definitions of genetic operators are described in Appendix C.3. The genetic learning process of new EDIT2IFF, as shown in Figure 5.16 Phase 1, is displayed in Table 5.3. In Table 5.3, each chromosome in the gene pool, i.e., each individual model, is denoted with chrpop, pop=1…total number of chromosomes. Hence the parameters of the DIT2IFF models, which are being optimized, take on the subscript to identify a chromosome. In sequence, for each chromosomes pop, m-lowerpop and m-upperpop represent the lower and upper values of the fuzziness parameter, e.g., m-lowerpop =1.1 and m-upperpop=3.5, cpop represent the number of clusters, α-cutpop represents the alpha-cut ∈[0,1], to eliminate the anomalities of the membership values, typepop represent the type of the function approximation method, e.g., LSE, SVM, etc., some function parameters specific to the type of the function approximation method used
204
5 Modeling Uncertainty with Improved Fuzzy Functions
Table 5.3 The steps of the Genetic Learning Process of EDIT2IFF GA initializes chromosomes to form the initial populationg=0. For each g=1… max-number-iterations, { chrpop : popth chromosome in the population with parameters typepop, mpop-lower, mpopupper, cpop, α-cutpop, typepop, Cregpop, εpop, Kernel-typepop {K(⋅)}, and list of membership values transformations that would be optimum to construct the interim matrix, e.g., {τspop} to identify the interim fuzzy functions and system input matrix, e.g.,{ Φppop ={Φ1ψ,…,Φc*ψ}} to identify the system fuzzy functions of each cluster. if chrpop has not been used in the past iterations { - Compute Improved Fuzzy Clustering with parameters from the chrpop using training data. - Approximate fuzzy functions, fi pop(x, Φiψ) for each cluster i=1…cpop using chrpop parameters. - Find improved membership values of validation data and infer their output values using each fuzzy function, fi pop(x, Φiψ), - Measure fitness value based on PIpop of validation data. } GA generates next populationg+1 by means of crossover and mutation operations Next generation (g=g+1) } End
to identify the fuzzy function parameters, e.g., Cregpop, being the regularization parameter, εpop, being the error margin, Kernel-typepop {K(⋅)} representing the nonlinear transformation of the dataset. The rest of the parameters of the chromosomes are control genes which only take on 0 or 1 to identify the type of membership value transformation to be used to identify the fuzzy functions, i.e., interim or system. Each chromosome includes the same list of the membership value transformations, which usually includes a long list of different transformations so that the optimum can be identified from within. Thus, the optimum interim matrix, {τ*pop}, to identify the optimum interim fuzzy function parameters, ŵi*, i=1,…,c, which composes of the membership values and their transformations, as well as the system input matrix of each cluster i,i=1,…c*, Φ*pop={Φ1ψ,…,Φc*ψ} to identify the local fuzzy function parameters is identified from this list of possible transformations of membership values. For example at the bottom of Figure 5.17 a sample chromosome is shown. If this were one of the optimum chromosomes with the best fitness functions, then the optimum fuzzy functions were to be identified from an optimized pool of membership value transformations that are defined with only an exponential transformation formula, i.e., it is the only token that takes on the values of 1, which indicates that the particular membership value transformations should be used as additional input to identify the interim fuzzy functions and system fuzzy functions. If there were additional transformations identified as ‘1’, then in the analysis, one would have used any combination of the memberships values transformation to identify as many different fuzzy functions as possible to identify an uncertainty interval of embedded type-1 improved fuzzy functions (T1IFF) models.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
205
(a) Fuzzy Function Surface- f(x,eμ), SVM-Linear, Kernel token=0.
(b) Fuzzy Function Surface f(x,eμ), SVM-Gaussian Kernel token=1 Fig. 5.19 Decision surfaces obtained from GLP for changing fuzzy function structures of chromosomes. K(.)={Linear or Non-linear} and (mlow,mup,Creg,ε)= {1.75,2.00,54.5,0.115}, c*=3. uclusi represents improved membership values of corresponding cluster i.
206
5 Modeling Uncertainty with Improved Fuzzy Functions
The purpose of genetic learning process (GLP) is to identify the uncertainty interval of the type-2 fuzzy membership values and list of possible structures of fuzzy functions (as shown in Figure 5.18). The algorithm tries to find the optimum forms of memberships values, to construct the input matrixes, {τ*,Φ*}, that would identify the interim and system fuzzy functions e.g., such as equations (5.41) and (5.43) , parameters and structure of fuzzy functions (Creg, ε, K(⋅)) such that the estimated output of the optimum model is as close to the identified system. In Figure 5.19, the three different decision surfaces of the single input, x, single output, y, Z={x,y}, 100 data point, 3–clustered non-linear artificial dataset are shown. The upper and lower graphs are two models identified by two different chromosomes. In this structure, the only difference between the chromosomes of the two modes as shown in the upper and lower models is the kernel type token (token #7 in Figure 5.20), which determines the non-linearity of the fuzzy functions. In this sample, the interim matrix, τ, constructed to identify the interim fuzzy functions in Improved Fuzzy Clustering (IFC) algorithm is identified by exponential transformation of membership values, eμ, as shown in Figure 5.20, as follows: μ imp hi (τ i , wˆ i )=wˆ 0,i +wˆ 1,i ⎛⎜ e i ⎞⎟ ;τ i ∈ R n× 2 ⎝ ⎠
μiimp ⎡ ,1 ⎤ ⎢1 e ⎥ = ⎢M M ⎥ , wˆ i ∈ R 2×1 = [ wˆ 0,i ⎢ ⎥ imp ⎢1 e μi , n ⎥ ⎣ ⎦ n× 2
wˆ 1,i ]T
(5.56)
The ‘imp’ indicates that the membership values are calculated from IFC method. In addition, the list of input matrices, Φ={Φ11,Φ21,Φ31} for each cluster to formulate system fuzzy functions of the corresponding cluster are identified using only exponential transformation of their corresponding improved membership values, eμ, as additional parameters to the original input variables, same for each cluster as follows:
(
)
imp μ yˆir , s ,ψ = f i Φψi ,Wˆ i =Wˆ0,i +Wˆ1,i ⎛⎜ e i ⎞⎟ + Wˆ2, i x ⎝ ⎠ imp μi ,1 ⎡ ⎤ x1 ⎥ ⎢1 e Φψi ∈ R n×3 = ⎢M M M ⎥ , Wˆi ∈ R 3×1 = [Wˆ0,i Wˆ1,i Wˆ2,i ]T ⎢ ⎥ imp ⎢1 e μi , n x ⎥ n⎦ ⎣
(5.57)
The upper and lower graphs of Figure 5.19 demonstrates two embedded T1IFF model decision surfaces obtained from the two different chromosomes of Figure 5.20, which are obtained from the first step of re-shaping process of membership values and fuzzy
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
207
functions of the proposed 3-phase EDIT2IFF. Each embedded T1IFF model obtained using the parameters denoted with two different chromosomes have the same parameters, but the Kernel Type, which identifies the linear or non-linear property of the SVM if SVM is the model approximation function chosen by the genetic algorithm.
Fig. 5.20 Two different chromosomes from the GLP algorithm of EDIT2IFF modeling approach applied on the Artificial Dataset. The dark colored token is the only difference between the two chromosomes. ‘1’:Linear Kernel Model, ‘2’:Non-Linear Kernel Model.
The top embedded model in Figure 5.19 uses the linear model, e.g., the ‘Kernel Type’ token of its chromosome is equal to 1, and the bottom figure is formed with a nonlinear model, i.e., K(.)=‘2’. Furthermore, each embedded model has three cluster structures, which identify three different fuzzy functions for each of the three clusters. It should be noted that each of the three graphs in the top figure (as well as the bottom figure) contains two decision surfaces, which are formed using two different m values, i.e., m-lower and m-upper, identified by its corresponding chromosomes (the first two tokens). If any other m′ values, i.e., m-lower<m′<m-upper, is to be used from within this m-bound, i.e., [m-lower, m-upper], then the decision surface to be obtained using this m′ parameter, would be in between the two decision surfaces obtained using m-lower and m-upper as shown Figure 5.19, considering the rest of the parameters indicated by the corresponding chromosome is kept intact. In phase 1, the GLP is employed based on T1IFF modeling to optimize the parameters. For each cluster, a different fuzzy decision surface is approximated based on parameter and control genes of the chromosome structure using the corresponding cluster’s membership values as inputs. The GLP searches for the best fuzzy decision surfaces based on parameters that are represented with each chromosome. The gap between surfaces represent uncertainty interval that the GLP tries to minimize based on chromosome structures. Phase 2: Structure Identification with Discrete Interval Type-2 IFF (DIT2IFF)
In Phase 1 of EDIT2IFF, the GLP captures the uncertainty interval by identifying: ¾ ¾
An optimum m interval, [m-low*,m-up*] A list of optimum membership value
transformations,
e.g.,
{( μ ) , ( μ ) , ( e ) , (ln (1-μ ) / μ )L} , which are to be used to form difp >0
μ p >0
ferent combinations of fuzzy function structures.
208
5 Modeling Uncertainty with Improved Fuzzy Functions
¾
The optimum list of function parameters, viz, any other fuzzy regression function parameters, e.g., Creg*,ε*,K(⋅), necessary for the model execution.
These parameters are represented by the resulting chromosomes with the best fitting function that has the minimum error. Using the identified uncertainty interval of the membership functions and optimum values of the particular parameters, the new evolutionary method implements the discrete type-2 fuzzy functions method to do reasoning. Therefore, in this step, identified uncertainty interval from the previous step, induced by the change in the degree of fuzziness and fuzzy function structures identified with a list of optimum membership value transformations, is discretisized to find as many embedded T1IFF models as feasible. Here we apply the DIT2IFF, however, this time we have shifted the uncertainty interval towards where the optimum models parameters would reside. The DIT2IFF models in the previous section apply an exhaustive search method to identify the uncertainty interval and optimum parameters by starting with a large uncertainty interval of parameter values and long list of fuzzy function structures, which takes a longer time to converge. With this new approach, after the Phase 1, we would have a preconception about the boundaries of the parameters in terms of the whereabouts of their optimum values should be searched. The new approach has a unique property that we should once more emphasize in this step. Previous Type-2 Fuzzy Logic Systems [Mendel et al., 2006; Uncu et al, 2004a, Uncu and Turksen, 2007] construct a general fuzzy function structure for a system model and use this structure to construct each rule. A model that is represented with different fuzzy function structures (characteristics) for each cluster (rule) have a better chance of identifying the optimum model than a method which builds a model with single function structure. In order to capture uncertainty, in the new model we utilize the best fuzzy function structures and fuzziness values in a cluster level, considering that the cluster center representatives are kept the same. Hence, the algorithm captures the best local function structures based on the training and validation datasets and preserves them in a matrix (collection table) to be used by inference, e.g., equation (5.49) and (5.50). This way the system model enables to have different local fuzzy models. Identification of a list of best models, in other words an interval value of possible solutions, instead of one particular optimum model, may increase the ability of the models to capture structure uncertainties in the T1IFF system. As a result of the genetic learning process in Phase 1, we obtain the optimum learning parameters, PL*={[m-lower*, m-upper*], c*,Creg*,ε*,K(⋅), the list of optimum membership value transformations}, from the surviving chromosome with the best PI value. It should be noted from PL* that, the optimum fuzziness parameter, m, is defined as an interval, [m-lower*- m-upper*], which is to be used as input variable of DIT2IFF method in Phase 2 of EDIT2IFF strategy. In addition, the optimum list of membership value transformations are identified by the GLP later to be used to identify the optimum interim fuzzy function parameters by constructing as many different
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
209
interim matrices, (τs), s=1…,nif, to build nif different IFC models, and as many local input matrixes for each cluster, Φp={Φ1ψ,…,Φc*ψ}, ψ=1,…,nf, p=1…(nf)c*. Thus, the list of optimum membership value transformations is identified by the GLP to identify the optimum uncertainty interval of membership values and fuzzy functions. The rest of the parameters of PL* are crisp optimum values. In Phase 2, the optimum parameters denoted with PL*, will be used to build a DIT2IFF system model. One of the parameters, which is used to define the uncertainty intervals, the interval identified by the optimum upper and lower values of the fuzziness variable, [m-lower*, m-upper*] is converted into a list of embedded membership values using {mr}, r=1,…,nr. In addition, all possible combinations of the list of optimum membership value transformations identified in the first phase of the algorithm are used to form the list of optimum fuzzy function structures to build embedded models. In place, for these discrete parameter set, we construct (1) an interim matrix to identify the interim fuzzy functions, τs, for the IFC clustering which comprise of the membership values and their transformations only, and (2) the local input matrix to identify the (system) local fuzzy function structures, Φp, which comprise of the original input variables, membership values and their transformations. One can define different matrix structures to define the fuzzy functions for each cluster, Φiψ, using the list of possible fuzzy function structures. Then, the optimum local fuzzy function structures of the optimum DIT2IFF models is represented with Φp={Φ1ψ,…,Φc*ψ}, ψ=1,…,nf, p=1…(nf)c*, one for each cluster, i=1…c*, where Φiψ represents one form of fuzzy function structure used to identify the local fuzzy function of the ith cluster. Examples of different types of fuzzy functions are given in equations (5.49), (5.50). For each discrete values of these parameters, 〈mr,τs,Φp〉, one embedded T1IFF model is constructed. Since, some of the parameters are already optimized and values of the uncertain interval of some parameters are reduced, there will be a fewer number of discrete embedded T1IFF models for EDIT2IFF structure identification compared to previous DIT2IFF models based on exhaustive search method, in which one has to search for all the combinations to find the optimum ones. In EDIT2IFF, the initial GLP step helps to reduce unimportant values of some of the parameters. It should be pointed out that, the second phase of EDIT2IFF is the same as the DIT2IFF strategy, except the number of parameters are pre-determined in the first phase of the EDIT2IFF strategy. Figure 5.21 summarizes the identification of the optimum uncertainty interval of membership values based on the three phrases of the EDIT2IFF model. The interval identified at the beginning of the algorithm are reduced in the first phase of the EDIT2IFF and then in the second phase of the algorithm, this interval is discretisized, i.e., converted into a list of discrete membership values, to find the embedded type-1 improved membership values, i.e., the scatter clouds on the upper right graph. The gray area in the upper left graph indicates as many membership values that could be defined using the parameters, i.e., the discretisized degree of fuzziness, m-interval, determined with the upper and lower fuzziness degree and the list of different membership value transformations to identify interim fuzzy functions. The algorithm narrows down this uncertainty interval in the initial step and shifts it to where the optimum interval should have been. In the second phase, the identified optimum values of parameters are discretisized to form an optimum interval for membership values as shown in the top right graph in Figure 5.21. In the
210
5 Modeling Uncertainty with Improved Fuzzy Functions
magnified view, on the bottom graph, the interval valued membership values for a given data point, x′, are shown, which are identified with t=1,…,(nr×nif) discrete membership values in each cluster, each of which are denoted with μi(x',c*,mr,τs), i=1,…c*, r=1,…nr, s=1,…nif.
μ( x)
μ( x)
μ i ( x ', c*, m r ,τ s )
Fig. 5.21 Interval valued membership values of evolutionary type-2 fuzzy functions strategy. Uncertainty interval represented with membership values dispersion induced by each tuple 〈mr,τs〉 (a) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (b) Discrete Improved Membership Values- Phase2 EDIT2IFF (c) magnified view of different discrete membership values for any x′.
The identification of the optimum uncertainty interval of the fuzzy functions as shown in Figure 5.21 is processed in an analogical manner to the uncertainty identification of membership values explained in the latter paragraph. At the start of the GLP process, a wider list of possible fuzzy function structures is introduced to the system. The genetic algorithm identifies the optimum fuzzy function structures by identifying the optimum forms of membership value transformations. Hence, this corresponds to identifying the optimum list of the fuzzy function structures, viz., reducing the uncertainty interval of fuzzy functions down to where the optimum values could be found as shown in Figure 5.22.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
211
This could mean identifying the optimum list of transformations of membership values anywhere between identifying only 1 possible membership value transformation to all the transformations. Any possible combination of membership values would identify a different fuzzy function structure from which a different estimated output value can be extracted. Thus, this forms the uncertainty interval of the fuzzy functions which include embedded fuzzy functions. During structure identification of the DIT2IFF system, one embedded fuzzy function, f(Φir,s,ψ), such as in (5.57), is approximated for each cluster i using each set of 〈c*,mr,τs〉 to approximate as many output values as possible for each data point, as shown in Figure 5.22.
yk yˆ ir,,ks ,ψ = f i r , s ,ψ
Fig. 5.22 Uncertainty interval of Fuzzy Function Structures. (a) Uncertainty interval represented with different output values obtained from list of fuzzy function structures induced by each tuple 〈mr,τs,Φp〉 (b) Optimized uncertainty interval from GLP-Phase 1 of EDIT2IFF, (c) magnified view of different output values, yikr,s,ψ, for a specific x′ vector obtained form optimized list of fuzzy functions.
The top-left graph in Figure 5.22 represents the initial fuzzy functions at the start of the algorithm and they are the output values of the fuzzy functions that are obtained from each embedded model at the start of the GLP algorithm. The GLP identifies the optimum fuzzy function structures by identifying the optimum forms of membership value transformations to be used to approximate local fuzzy
212
5 Modeling Uncertainty with Improved Fuzzy Functions
function structures. Hence, the algorithm narrows down this uncertainty interval in the initial step and shift it to where the optimum interval should have been by identifying selected fuzzy function structures which may be optimum ones, as shown in the top-right graph in Figure 5.22. In the second stage, the DIT2IFF method uses only these selected membership value transformations to identify fuzzy functions in order to obtain different outputs for a given input data point. This way, the number of embedded fuzzy function models is reduced, and the system only deals with candidate embedded models that are optimized in the first phrase. The interval valued estimated output values of a given data point, x′, is displayed in the magnified view, on the bottom graph of Figure 5.22. They are obtained from each local fuzzy function, which are identified with nf discrete fuzzy functions in a particular cluster, denoted with yˆikr , s ,ψ = f i r , s ,ψ ( x ', c*, m r ,τ s , Φψ ) i=1,…c*, r=1,…nr, s=1,…nif, ψ=1,…,nf, e.g., such as in (5.49), (5.50) and is calculated using equations like (5.57). The fuzzy output value from each function is weighted with their corresponding membership values to calculate a single crisp output value using
μ r ,s ˆy ikr ,s ,ψ ∑ i = 1 ik = c* ∑ i =1 μ ikr ,s c*
ˆy kq
(5.58)
It should be noted from Figure 5.21 and Figure 5.22 that we deal with the membership value scatter matrixes and obtain fuzzy functions outputs in a scatter diagram to define the interval valued membership values and fuzzy functions for identified optimum parameter lists. Based on the optimum performance measure, the optimum model parameters for each training data point is captured and retained in collection tables such as in equations (5.49), (5.50), to be used to infer output values of testing and validation data samples as follows:
(
arg min yk − yˆ kq q
)
2
⎪⎧ ∈ ⎨q ∃q′ , yk − yˆ kq ⎩⎪
(
)
2
(
< yk − yˆ kq′
) ⎪⎬⎭⎪ 2⎫
(5.59)
The rest of the structure identification is as the same as the DIT2IFF structure identification method described in section 5.3.
Phase 3: Inference Method for Evolutionary Discrete Interval Type-2 IFF (EDIT2IFF) The inference methodology of the new strategy is similar to the DIT2IFF approach. For each testing dataset, one crisp output value is obtained by applying the inference mechanism of the DIT2IFF method. The collection tables constructed in Phase 2, e.g., equations (5.49), (5.50), of the algorithm is used to infer crisp output values for testing cases.
5.6 Discrete Interval Type-2 Improved Fuzzy Functions
213
One final note for the new type-2 models is that, the DIT2IFF, and its extension using astochastic search technique, EDIT2IFF, are presented for regression type problem domains. We also adopted the two algorithms for classification problems: ¾
Discrete Interval Type-2 Improved Fuzzy Functions for Classification (DIT2IFF-C),
¾
Evolutionary Design of Discretized Interval Type-2 Improved Fuzzy Functions for Classification (EDIT2IFF-C).
The difference between the new type-2 strategies of classification domain and the regression domains which is presented in this chapter is that the classification extensions of the new approaches are designed by changing the fuzzy function approximators into classification fuzzy functions, and fitness evaluations into performance evaluation criteria for classification problems such as Area Under the ROC(receiver operating curve)– AUC or classification recognition percentage, to be discussed in the experiments section. In these classification extensions, we implemented the IFC-C classification method to find the improved membership values. Since the structure of the system modeling and inference modules are not affected by this change, we will not display these methodologies in detail.
5.6.3
Reduction of Structure Identification Steps of DIT2IFF Using New EDIT2IFF Method
In the new discrete interval valued type-2 improved fuzzy functions – DIT2IFF systems, initial parameters are optimized with an exhaustive search based on the supervised learning method by iterating the list of parameters; degree of fuzziness (m), number of clusters (c), types of membership value transformations to construct the matrices (τ, Φ) to identify interim and local fuzzy functions, alpha-cut constant and fuzzy function approximator parameters such as when SVM is used then regularization constant (C-reg), kernel type and error margin (ε). For each set of parameters, one T1IFF model is build and the optimum set is determined based on the cross validation analysis. At the beginning of the DIT2IFF algorithm, we implemented an exhaustive search method using T1IFF to optimize some of the parameters, i.e., fuzzy function approximators’ parameters and number of clusters. Assuming each parameter has N different values (except the degree of fuzziness, which is set to m=2 and one set of fuzzy functions is used for the initial T1IFF models) and two different kernel types are iterated, the number of iterations of the initial exhaustive search would be 2N5. Then, the boundaries of degree of fuzziness, m, is identified by the user [m-lower ,m-upper] to be discretisized to search for the optimum values for each input data point. In addition, a list of possible structures of fuzzy functions is identified. Assuming, the m interval is discretisized into N values and there are N different fuzzy functions structures, and the optimum values of the rest of the parameters from the initial T1IFF exhaustive search are used, the total number of times T1IFF will be executed during structure identification would be 2N5+N2+c.
214
5 Modeling Uncertainty with Improved Fuzzy Functions
Table 5.4 Number of parameters of Discrete Interval Type-2 Improved Fuzzy Functions (DIT2IFF)
Parameter
Initial parameter optimization based on exhaustive search using T1IFF method.
DIT2IFF optimization
c: number of clusters m: degree of fuzziness α-cut {τ, Φ}: structure of fuzzy functions Creg: regularization constant of SVR epsilon constant for SVR Kernel Type Subtotal Discrete values of m {τ}: interim fuzzy function types {Φ}: system fuzzy function types Total
Number of discrete values N 1 N N N N 2 2N5 N N Nc 2N5+N2+c
On the other hand, for the proposed EDIT2IFF, let the total number of iterations of genetic algorithms be N2. In each iteration, one T1IFF will be executed for two child chromosomes for every crossover operation, and one child for mutation from two selected parents. This will be repeated for each m-value in the chromosome, i.e., m-lower and m-upper. Roughly speaking, we set the number of iterations to 100, so the correspondence of N different values of a parameter of T1IFF listed in Table 5.4 will be ~N2. The population size will be set to N2, since we use 50-100 different populations in our experiments. Therefore, the initial number of iterations of the EDIT2IFF would be 2N2+2*2N2=6N2. Then using the optimum parameters, DIT2IFF will be executed for the optimum m interval, which will be discretisized into, N values. In addition, the combination of the list of possible fuzzy function structures will be used as separate N structures to execute the DIT2IFF method, even though there should have been less than N values for these parameters since in the genetic learning process of EDIT2IFF, the uncertainty interval is optimized. Nonetheless, we converted the reduced uncertainty interval into the same number of discrete values. The total number of iterations of the EDIT2IFF will be as shown in Table 5.5. It is evident from the Table 5.4 and Table 5.5 that the number of iterations are reduced when the EDIT2IFF are used instead of DIT2IFF approaches, i.e., (6N2+N2+c < 2N5+N2+c) Æ (3N21)
5.7 Summary
215
Table 5.5 The number of parameters of Evolutionary Discrete Interval Type-2 Improved Fuzzy Functions. (EDIT2IFF)
Parameter
Genetic Learning Process – Phrase 1 of EDIT2IFF
DIT2IFF optimization
Initial Run: population size = N2 # of T1IFF models for each chromosome (one for each m value) Secondary run: evaluation of 1 crossover and 1 mutation operations Sub-Total Discrete values of m {τ}: interim fuzzy function types {Φ}: system fuzzy function types Total
Number of discrete values 2N2
2N2 * 2 6N2 N N Nc 6N2+N2+c
In the chapter of experiments real datasets are used to measure the elapsed time of these two methods under the same number of initial parameter sets.
5.7 Summary This chapter presented two novel Discrete Interval Type-2 Improved Fuzzy Functions to identify uncertainties in system models. Structurally, the novel Discrete Interval Type-2 Improved Fuzzy Functions strategies are different from traditional type-2 fuzzy rule base approaches. They employ a new method to identify uncertainty in learning parameters and function structures. Two different types of uncertainties, namely the uncertainty in selection of improved fuzzy clustering parameters, and uncertainty in determining the mathematical model structure of each local fuzzy function are taken into consideration. In the second novel strategy, the optimum parameters are captured and the uncertainty interval of fuzziness is identified with genetic learning algorithm. This reduces the number of steps to identify the optimum parameters compared to the Discrete Interval Type-2 Improved Fuzzy Functions method based on exhaustive search method. Application of heterogeneous dynamic length chromosome structure enables to optimize parameters with different domains in the same model utilizing their cross combination effects. Additionally, the new type-2 inference schema enables the employment of different membership functions and fuzzy function structures in different local structures, which helps to identify the uncertainty of the system model.
Chapter 6
Experiments 6 Experiments
This chapter presents the results of experiments applied to benchmark and real life datasets to evaluate the performance of proposed algorithms. The results are compared to other soft computing methods of system modeling. In this chapter, performance of the proposed Fuzzy Functions approaches is analyzed against performances of other well-known soft computing methods using several datasets. Our goal is to assess prediction performance and robustness of the proposed methodologies on real life datasets under a variety of scenarios by altering system parameters using cross validation analysis.
6.1 Experimental Setup 6.1.1 Overview of Experiments Information about the datasets that are used to test the performance of the proposed approaches against other well-known approaches is listed in Table 6.1. The datasets are classified into two parts based on their structures. Datasets 1 through 4 are regression type datasets, where the output variable has a continuous domain, y∈ℜ, and datasets 5 through 10 are classification datasets, where the output variable has a discrete domain, y∈]. In these experiments only binary classification datasets with dichotomous output variable are used, e.g., y∈ {0,1}. Dataset 3 of the regression type includes five different stock price datasets and they are analyzed differently from the rest of the regression datasets, Dataset 1, 2, and 4. In the next section, a sub-sampling cross validation method that is applied to each experiment will be presented. Later, the performance measures listed in Table 6.1 will be explained in more detail. The parameters are optimized based on exhaustive search method or genetic algorithms based on the methodology used. Exhaustive Search verifies all the possible combinations of the optimized parameters, thus ensuring that the best possible solution will be found. Genetic Algorithms are search algorithms based on the mechanics of natural selection and natural genetics. They combine the survival of the fittest rule with structured yet randomized information exchange. Genetic algorithms possess the best characteristics of the other optimization methods, such as robustness and fast convergence, which does not depend on any of the optimization criteria (for instance, on smoothness). A. Celikyilmaz and I.B. Türksen: Model. Uncertain. Fuzzy Logic, STUDFUZZ 240, pp. 217–304. springerlink.com © Springer-Verlag Berlin Heidelberg 2009
Friedman Artificial Auto-Mileage -UCI Stock Price Predict. TD BMO Enbridge Loblaws Sun Life Desulphurization Process Reagent1 Reagent2 Liver Disorder–UCI Ionosphere -UCI Breast Cancer - UCI Diabetes – UCI Credit Scoring UCI California Housing
1 2
10,000 10,000 345 349 277 768 690 20,640
389 445 445 445 445
R R R R R
R R C C C C C C
9,791 398
OBS**
R R
Type*
11 11 6 34 9 8 15 9
16 16 16 16 16
5 8
#Var§
250 250 175 150 130 125 150 500
120 200 200 200 200
Performance measures used to evaluate each model performance in comparative analysis. UCI: University of California, Irvine, Real Dataset Repository
ð
750 750 75 120 70 75 75 500
90 144 144 144 144
8000 8000 50 80 50 50 50 12,600
100 100 100 100 100
OBS used in three-way cross validation Testing Training Validation 500 250 9,000 125 45 100
* R: Regression, C: Classification Type Datasets. ** OBS: Total number of cases (i.e., instances, objects, data points, observations) § Var: Total Number of attributes/features/variables exists in the dataset.
5 6 7 8 9 10
4
3
Dataset
No
Table 6.1 Overview of Datasets used in the experiments
Accuracy ROC Curve/ AUC Several Ranking Methods
Ranking
R2
RMSE MAPE Robust Simulated Trading Benchmark (RSTB)
Perform. Measure Usedð
218 6 Experiments
6.1 Experimental Setup
219
6.1.2 Three-Way Sub-sampling Cross Validation Method In this section, numerical examples are used to illustrate how system-modeling methods are applied using the three-way cross validation method [Rowland, 2003]. In each experiment, an entire dataset is randomly separated into three parts: training, validation and testing. The training dataset is used to find the parameters of a model, the validation dataset tunes these parameters and finds the optimum model, and the testing dataset is used to asses the performance of an optimum model (no tuning is done on testing dataset). In Figure 6.1 general framework of the three-way cross validation method is displayed which is used in building every algorithm presented in this work, i.e., proposed and benchmark algorithms.
Fig. 6.1 General Framework of the Three-Way Cross Validation Method
In Figure 6.2 the same three-way cross validation method is displayed specific to fuzzy function methodologies of this work. The three-way cross validation method applied to stock price estimation datasets is slightly different from the rest of the datasets, especially in the construction of testing dataset (Dataset #3 in Table 6.1). Initially, stock prices within the studied period, are divided into two parts. Specific to stock price estimation models, the data is not randomly divided since these are time series data and the analysis requires continuity from one to the next data vector. The first period is used for constructing five different training and validation datasets. The last period, 100 trading days of each stock price, is used for testing purposes. An example of the sampling method is illustrated in Figure 6.3. using an artificial stock price dataset. Stock prices from the first part of the selected period are used for learning and optimizing the model parameters. Random samples are selected to construct training and validation datasets at each repetition. The performance of the optimum model is evaluated on the last part of the time series, which we call the testing dataset.
220
6 Experiments
Stock Price (Closing Price)
Fig. 6.2 Three-way cross validation used in Fuzzy Functions approaches 75 70 65 60 55
0
50
100
150 200 250 Input versus Time
300
350
400
Fig. 6.3 Schematic view of the Three-Way Cross Validation Process that is used for stock price estimation models
Each experiment is repeated k times e.g., k={5,10,..} by selecting different samples of different sizes from the pool of vectors to create training and validation datasets. (The process in Figure 6.1 is repeated k times). In Table 6.1, we display a number of instances of training, validation and testing datasets, which are used when building models of each experiment. It should be stressed that, in order to do a fair comparison between the proposed and benchmark methods, the very same training, validation and testing datasets are used to build models for each algorithms. In particular, the training, validation, and testing data samples that are used to evaluate the proposed fuzzy functions approaches are also used in benchmark methods, e.g., SVM regression, ANFIS, DENFIS, etc., to learn, validate, and test their performance. Similarly, each algorithm evaluates their optimum parameters
6.1 Experimental Setup
221
using the same testing datasets. Let the performance of each method be represented with a tuple as { PM , stPM}, where PM represents the average of the performances obtained from testing datasets over k repetitions, and stPM is their standard deviation. The extraction of the values of tuples from each methodology is shown in Table 6.2. Table 6.2 Calculation of overall performance of a method based on three way cross validation results. The overall performance is represented with tuple of 〈 PM , stPM 〉Z.
Cross Validation Repetition 1 2 . . k
Performance Measure obtained from Testing Dataset PM1 PM2 . . PMk
PM = 1k ∑ k PM k , sdPM =
1 k
∑ k ( PM k − PM )2
6.1.3 Measuring Models’ Prediction Performance In the experiments, different types of evaluation criteria are used to analyze the prediction of the performances of the proposed fuzzy system modeling strategies in comparison to benchmark methods, to be listed next. The type of the evaluation criteria (performance measure) is mostly dependent on the structure of the system domain. Therefore, we separated the performance measures of the regression and classification problem domains. Additionally, a new performance measure is introduced specifically for stock-price estimation models. 6.1.3.1 Performance Evaluations of Regression Experiments Let yk and ŷk∈ℜ represent the actual and model output values of a datum k, respectively. In this work, to evaluate the performances of each methodology, four different functions are used for regression type datasets where the observed output variable has a continuous domain, viz. y∈ℜ:
RMSE =
1.
Root mean square error (RMSE) ,
2.
Mean absolute percentage error, MAPE =
3.
Coefficient of determination, R2 = 1 −
1 n
1 n
∑ k =1 ( yk − yˆ k ) 2
∑ k =1
SS E , SST
n
n
yk − yˆ k .100 yk
222
6 Experiments
SST = ∑ k ( yk − y ) , 2
4.
SS E = ∑ k ( yk − yˆ k ) , 1 2
Robust Simulated Trading Benchmark (RSTB) – to be explained later. yk: actual output, yk: predicted output, ŷk: predicted output.
RMSE ⎯: one of the most commonly used performance measures. It is useful to understand the deviation between the predicted output and the actual observed output. It is still widely accepted and used for performance evaluation methods. Recent publications e.g., [Salakhutdiniv et al. 2007], have shown the usage of RMSE as one of the acceptable performance measures. Thus, in the experiments, we used this measure where applicable. RMSE=0 means that the model output exactly matches the observed output. MAPE ⎯: Mean Absolute Percentage Error is a commonly used statistical measure of Goodness of Fit in quantitative forecasting methods. It produces a measure of relative overall fit. MAPE is a normalized value between 0 and 1. MAPE=0 means that model output exactly matches with the observed output. R2 ⎯ : defined as the coefficient of determination. Regardless of the structure of a model, one can always compute the total variance of the dependent variable (total sum of squares, SST), the proportion of variance due to the residuals (error sum of squares, SSE), and the proportion of variance due to the regression model (regression sum of squares, SSR=SST - SSE). The ratio of the regression sum of squares to the total sum of squares (SSR/SST = 1-(SSE/SST)) explains the proportion of variance accounted for in the dependent variable (y) by a model; thus, this ratio is equivalent to the R-square (0 R-square 1, the coefficient of determination). This measure helps to evaluate how well the model fits the data. R2 =1 indicates that the model can explain all the variability of the output variable, while R2 =0 means otherwise. RSTB ⎯: A new performance measure, Robust Simulated Trading Benchmark, is introduced specifically for stock price prediction problems. In any model of a trading system, the main goal is to improve its profitability. A profitable prediction is a better prediction even it has less accuracy based on different criteria, e.g., accuracy in predicting the next day directions of a stock. In [Deboeck, 1992] it was shown that a neural network that correctly predicted the next-day direction 85% of the time, consistently lost money. Although the system correctly predicted market direction, the prediction accuracy was low. Hence, evaluation of trading models should not just be based on predicted directions of stock price movements. In addition, as will be shown in the results of stock price predictions in the next section, the accuracies of benchmarking methods are not always significantly different from one another. This makes it difficult to identify the sole model for estimation of stock prices. Since the aim of stock trading models is to return profit, the profitability should be the performance measure. For these reasons, on top of the well-known performance 1
SST: Total Sum of Squares, SSE: Error Sum of Squares.
6.1 Experimental Setup
223
measures for regression models, here we introduce a new criterion – Robust Simulated Trading Benchmark (RSTB), based on profitability of models that are used to predict the stock prices. The RSTB combines three different properties to form one performance measure; namely the market directions, prediction accuracy and robustness of models. RSTB is driven by a conservative trading approach. The higher the RSTB, the better the profitability of the model would be. The details of the new RSTB are presented in the analysis of stock prices in section 6.4.4. 6.1.3.2 Performance Evaluations of Classification Experiments Classification type datasets used in this work have output variables of dichotomous structure, e.g., y∈{0, 1}. To evaluate the performance of classification datasets, three different criteria are used as follows; ¾ ¾ ¾
Accuracy, Area Under the ROC Curve (AUC), Ranking Methods; Average Rank (AR), Success Rate Ratio (SRR), Significance Win Ratio (SWR), Percent Improvement Ratio (PIR).
Accuracy ⎯: Classification accuracies are measured based on the contingency table as follows; Table 6.3 Contingency Table to calculate accuracy
Predicted Positive
Predicted Negative
Actual Positive
True Positives
False Negatives
Actual Negative
False Positives
True Negatives
accuracy (%) =
(True Positives ) + (True Negatives ) number of data (nd )
(6.1)
The maximum accuracy that a test can have is 1, the minimum is 0. Ideally, we want a test to have higher accuracy closer to 1. Since the classification model outputs are probabilities, different threshold values (to discern between two classes) values are varied to obtain the optimum True Positives (TPs) and True Negatives (TNs) during learning stage of each modeling approach. The threshold values that are identified by the structure identification are used during inference to estimate class labels of testing datasets. ROC ⎯: Receiver Operating Characteristics uses prediction probabilities directly in model performance evaluations. With most algorithms such as logistic regression, support vector machines, we obtain prediction probabilities instead of prediction labels. Accuracy measures do not directly consider these probabilities. In addition accuracy is not a good measure when there is a large difference between the number
224
6 Experiments
of positive and negative instances. However, in many data mining applications such as ranking customers, we need more than crisp predictions such as predicted probabilities. The probabilities show the true ranking and this way possible information loss is prevented due to discretization of the predicted output based on an unknown threshold. Thus, it is more appropriate to use a very common validation technique, receiver operating characteristics (ROC) curve [Swets, 1995; Bradley, 1997] that uses probabilities as inputs to evaluate models instead of the accuracy measure. In [Huang and Ling, 2005], the performance of ROC analysis is discussed in comparison to the simple accuracy measure. It was mathematically proven that area under the ROC curve (AUC), to be discussed next, should replace accuracy in measuring and comparing classification methods. Their argument was originated from the fact that accuracy and AUC measures obtained from the same methods were not always correlated. In the classification experiments of this work, we have come across the same situation where the accuracy measures are not correlated with the AUC values. Even though we listed both performance measures in the Appendix, the analysis for classification datasets is based on the AUC performance measure. Fig. 6.4 A sample Receiver Operating Characteristic ROC curve
A sample ROC Curve is shown in Figure 6.4. The idea behind ROC curve is that, one defines various possible cut-off points Cj and classifies each data vector with a probability higher than Cj as a potential success (positives) and lower than Cj as a potential failure (negatives). Thus, at each cut-off point a hit rate (True Positive Rate) :TPR(Cj )= TP (C j ) N P , is identified where TP(Cj ) is the number of correctly predicted positives at given cut-off (Cj ) and NP is the total number of actual positive outputs. Also, a false alarm rate (False Positive Rate): FPR(Cj ) = FP (C j ) N N , is defined where FP(Cj ) is the number of negative output instances which are incorrectly predicted as positive instances at a given cutoff (Cj ) and NN is the total number of negative instances. Thus, ROC curve is the plot of the list of FPRs as a function of TPRs and the performance measure to evaluate is the area under the curve above the diagonal, as depicted in Figure 6.4. As ROC curve area increases (towards perfect classifier), the higher the AUC and prediction power would be.
6.1 Experimental Setup
225
The TPR and FPR values to obtain ROC curve are calculated as follows. The scores produced by a classifier represent probabilities of each datum belonging to one class (for binary classifiers, usually one class becomes the base class, e.g., y=1) which are then sorted in descending order. A threshold is to be determined in order to predict the class labels of each datum. By varying the threshold, different values for TPR and FPR are obtained. ROC curve can be used to show the tradeoff of errors at different thresholds. Figure 6.5 shows an example of a ROC curve on a dataset of twenty instances. The instances, ten positive and ten negative, are also shown. In the table on the right-hand side of Figure 6.5, the instances are sorted by their scores (class probabilities), and each point in the ROC graph is labeled by the threshold that produces it. A threshold of +∞ produces the point (0,0). As we lower the threshold to 0.9, the first instance is classified positive and the rest are classified negative. At this point the TPR=1/10=0.1 and FPR=0/10=0, yielding (0,0.1) point on the ROC Curve. As the threshold is lowered, the TPR and FPR values are re-calculated to obtain a point for each possible threshold value. The ideal point on the ROC curve would be (0,1), that is all positive examples are classified correctly and no negative examples are misclassified as positive.
Fig. 6.5 The ROC “curve” created by varying threshold values.The table at right shows 20 data points and the score (probabilities) assigned to each of them. The graph on the left shows the corresponding ROC curve with each point labeled by the threshold that produces it.
AUC ⎯: By comparing ROC curves, one can analyze the classification performance differences of two or more classifiers. The higher the curve, that is; the nearer to the perfect classifier, the higher the accuracy would be. Sometimes the curve for one classifier is superior to that of another, that is; one curve is higher than the other throughout the diagram; therefore, a measure is given by the Area Under the ROC curve, (denoted as AUC ∈ [0,1]) [Breiman et al, 1984]. The curve that has a higher AUC is better than the one that has a smaller AUC. If any two ROC curves intersect, which makes it hard to differentiate them, AUC would be the comparison between the models. The simplest way to calculate the AUC is the trapezoid integration of ROC curve such as in Figure 6.5 as follows:
226
6 Experiments
AUC
=
(1 − β k ) .Δα +
1 Δ ( 1 − β ) .Δα 2
k =n
(6.2) k =2
where, Δ ( 1 − β ) = ( 1 − β k ) − ( 1 − β k −1 ) and Δα = α k − α k −1 . Here β =1-TPR and α=FPR and nd denotes number of data points. Ranking – Identification of the most adequate classification algorithm for a problem is usually a very difficult task since many different classification algorithms are available and they originate from different areas such as statistics, machine learning and soft computing methods. In this respect, additional ranking methods, to be discussed next, are used as performance measures for benchmark analysis on classification problem domains. Among many ranking methods, average ranks (AR), success rate ratios ranking (SRR), significance win ranking (SWR) [Brazdil and Soares, 2000] are used to generate an ordering of different algorithms based on experimental results obtained from different datasets. A new ranking method, namely percent improvement ratio (PIR) is presented to rank the performance improvements of each methodology based on AUC values. A best algorithm is determined based on the average results obtained from these ranking algorithms. AR –: Average Rank uses individual rankings to derive an overall ranking. This is a simple ranking method that orders the values of performance measures, which is referred to the average of the measures on all the folds of the cross-validation procedure. Then, the best algorithm is assigned rank 1, the runner-up 2, and so on. Let
rji be the rank of the algorithm where i refers to a dataset and j refers to each
different methodology. The average of the measures is calculated for each algorithm by
rj =
( ∑ r ) /nd, where nd is the total number of datasets. The final i i j
ranking is obtained by ordering the average ranks and assigning ranks to the algorithms accordingly. SRR –: Success Rate Ratios Ranking measures the ratios of success rates between pairs of datasets. The performances of each methodology are compared based on the ratio of a success rate. Let PMji be the measured performance measure of method j on dataset i. The SRR between two different methodologies on one dataset is represented with SRRmethod 1,method 2 = PM method 1 / PM method 2 . Thus, the dataset1
higher the
dataset1
dataset1
SRRij ,k the better the methodology j compared to methodology k. We
then calculate the pairwise mean success rate ratio,
SRR j ,k =
( ∑ SRR ) /nd i
i j ,k
for each pair of methodology j and k among datasets where nd is the number of datasets. This measure is an estimation of the general advantage/disadvantage of methodology j over methodology k. Finally, the overall mean success rate ratio for
6.2 Parameters of Benchmark Algorithms
227
each methodology is measured by SRR j =
(∑
k
)
SRR j ,k /(m − 1) , where m is the
number of methods. Then, the ranking is derived from this measure. SWR –: Significance Win Ratio Ranking measures the significance of differences in performance between each algorithm. In this work, paired student’s t-test is used because the number of datasets and the algorithms are small. Firstly, the significance of differences in performance between each pair of algorithms is measured individually for all datasets. We denote that an algorithm j is significant over algorithm k on dataset i when the probability of t-test is less than p