Ildar Batyrshin, Janusz Kacprzyk, Leonid Sheremetov, Lotfi A. Zadeh (Eds.) Perception-based Data Mining and Decision Making in Economics and Finance
Studies in Computational Intelligence, Volume 36 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 19. Ajita Ichalkaranje, Nikhil Ichalkaranje, Lakhmi C. Jain (Eds.) Intelligent Paradigms for Assistive and Preventive Healthcare, 2006 ISBN 978-3-540-31762-3 Vol. 20. Wojciech Penczek, Agata P´olrola Advances in Verification of Time Petri Nets and Timed Automata, 2006 ISBN 978-3-540-32869-8 Vol. 21. Cˆandida Ferreira Gene Expression on Programming: Mathematical Modeling by an Artificial Intelligence, 2006 ISBN 978-3-540-32796-7 Vol. 22. N. Nedjah, E. Alba, L. de Macedo Mourelle (Eds.) Parallel Evolutionary Computations, 2006 ISBN 978-3-540-32837-7 Vol. 23. M. Last, Z. Volkovich, A. Kandel (Eds.) Algorithmic Techniques for Data Mining, 2006 ISBN 978-3-540-33879-6 Vol. 24. Alakananda Bhattacharya, Amit Konar, Ajit K. Mandal Parallel and Distributed Logic Programming, 2006 ISBN 978-3-540-33458-3 ´ Vol. 25. Zolt´an Esik, Carlos Mart´ın-Vide, Victor Mitrana (Eds.) Recent Advances in Formal Languages and Applications, 2006 ISBN 978-3-540-33460-6 Vol. 26. Nadia Nedjah, Luiza de Macedo Mourelle (Eds.) Swarm Intelligent Systems, 2006 ISBN 978-3-540-33868-0
Vol. 27. Vassilis G. Kaburlasos Towards a Unified Modeling and KnowledgeRepresentation based on Lattice Theory, 2006 ISBN 978-3-540-34169-7 Vol. 28. Brahim Chaib-draa, J¨org P. M¨uller (Eds.) Multiagent based Supply Chain Management, 2006 ISBN 978-3-540-33875-8 Vol. 29. Sai Sumathi, S.N. Sivanandam Introduction to Data Mining and its Applications, 2006 ISBN 978-3-540-34350-9 Vol. 30. Yukio Ohsawa, Shusaku Tsumoto (Eds.) Chance Discoveries in Real World Decision Making, 2006 ISBN 978-3-540-34352-3 Vol. 31. Ajith Abraham, Crina Grosan, Vitorino Ramos (Eds.) Stigmergic Optimization, 2006 ISBN 978-3-540-34689-0 Vol. 32. Akira Hirose Complex-Valued Neural Networks, 2006 ISBN 978-3-540-33456-9 Vol. 33. Martin Pelikan, Kumara Sastry, Erick Cant´u-Paz (Eds.) Scalable Optimization via Probabilistic Modeling, 2006 ISBN 978-3-540-34953-2 Vol. 34. Ajith Abraham, Crina Grosan, Vitorino Ramos (Eds.) Swarm Intelligence in Data Mining, 2006 ISBN 978-3-540-34955-6 Vol. 35. Ke Chen, Lipo Wang (Eds.) Trends in Neural Computation, 2007 ISBN 978-3-540-36121-3 Vol. 36. Ildar Batyrshin, Janusz Kacprzyk, Leonid Sheremetov, Lotfi A. Zadeh (Eds.) Perception-based Data Mining and Decision Making in Economics and Finance, 2007 ISBN 978-3-540-36244-9
Ildar Batyrshin Janusz Kacprzyk Leonid Sheremetov Lotfi A. Zadeh (Eds.)
Perception-based Data Mining and Decision Making in Economics and Finance With 95 Figures and 37 Tables
Ildar Batyrshin
Leonid Sheremetov
Mexican Petroleum Institute Eje Central Lazaro Cardenas 152 Col. San Bartolo Atepehuacan 07730 Mexico Mexico
Mexican Petroleum Institute Eje Central Lazaro Cardenas 152 Col. San Bartolo Atepehuacan 07730 Mexico Mexico
Institute of Problems of Informatics Academy of Sciences of Tatarstan Mushtari st., 20, Kazan, 420012 Russia E-mail:
[email protected] St. Petersburg Institute for Informatics and Automation Russian Academy of Sciences 39, 14th Line, St. Petersburg, 199178 Russia E-mail:
[email protected] Janusz Kacprzyk
Lotfi A. Zadeh
Systems Research Institute Polish Academy of Sciences Newelska 6 01-447 Warszawa Poland E-mail:
[email protected] University of California Computer Science Division 387 Soda Hall 94720-1776 Berkeley, CA USA E-mail:
[email protected] Library of Congress Control Number: 2006939141 ISSN print edition: 1860-949X ISSN electronic edition: 1860-9503 ISBN-10 3-540-36244-4 Springer Berlin Heidelberg New York ISBN-13 978-3-540-36244-9 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com c Springer-Verlag Berlin Heidelberg 2007 ° The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Typesetting by the editors and SPi using a Springer LATEX macro package Printed on acid-free paper SPIN: 11793922 89/SPi 543210
Preface
The primary goal of this book is to present to the scientific and management communities a selection of applications using more recent Soft Computing (SC) and Computing with Words and Perceptions (CWP) models and techniques meant to solve the economics and financial problems. The selected examples could also serve as a starting point or as an opening out, in the SC and CWP techniques application to a wider range of problems in economics and finance. Decision making in the present world is becoming more and more sophisticated, time consuming and difficult for human beings who require more and more computational support. This book addresses the significant increase on research and applications of Soft Computing and Computing with Words and Perceptions for decision making in Economics and Finance in recent years. Decision making is heavily based on information and knowledge usually extracted from the analysis of large amounts of data. Data mining techniques enabled with the capability to integrate human experience could be used for a more realistic business decision support. Computing with Words and Perceptions introduced by Lotfi Zadeh, can serve as a basis for such extension of traditional data mining and decision making systems. Fuzzy logic as a main constituent of CWP gives powerful tools for modeling and processing linguistic information defined on numerical domain. Decision making techniques based on fuzzy logics in many cases have demonstrated better performance than competing approaches. The reason is that traditional, bivalent-logic-based approaches, are not a good fit to reality — the reality of pervasive imprecision, uncertainty and partiality of truth. On the other hand, traditional probabilistic interpretation of uncertainties in practice does not always correspond to the nature of uncertainties that often appear as the effects of subjective estimations. The list of practical situations, when it seems better to avoid the traditional probabilistic interpretation of uncertainty is very long. The centrepiece of fuzzy logic that everything is, or is allowed to be, a matter of degree, makes it possible to better deal with perception-based information. Such information plays an essential role in economics, finance and, more generally in all domains in which human perceptions and emotions are in evidence. For instance, it is the case for the studies of the capital markets/financial engineering including financial time series modeling; price projections for stocks, volatility analysis and the pricing of options and derivatives; and risk management to mention few. The book consists of two parts: Data Mining and Decision Making. An introductory chapter by Lotfi A. Zadeh called “Precisiated Natural Language” describes the conceptual structure of Precisiated Natural Language (PNL), which
VI
Preface
can be employed as a basis for computation with perceptions. PNL attempts to make possible treating propositions drawn from a natural language as objects of computation, capturing two fundamental facets of human cognition (a) partiality (of understanding, truth, possibility, etc.) and (b) granularity (clumping of values of attributes and forming granules with words as labels). The chapter shows that PNL has much higher expressive power than existing approaches to natural language processing based on bivalent logic. Its high expressiveness is based on the concept of a generalized constraint, which represents the meaning of propositions drawn from a natural language capturing their partiality and granularity. This chapter settles the conceptual basis for the rest of the book. The first part of the book presents novel techniques of Data Mining. Researchers in the data mining field have traditionally focused their efforts on obtaining algorithms in order to deal with huge amounts of data. It is nevertheless true that the results we obtain using these algorithms are of limited use in practice hence limiting data mining spread use and acceptance in many real-world situations. One of the reasons for this is that purely statistical approaches, which do not consider the experience of the experts do not hit in the actual problem. Perception-based data mining should be able to manipulate with linguistic information, fuzzy concepts and perception based patterns of time series. That is why, additionally to classic techniques of data mining and classification algorithms such as decision trees or Bayesian classifiers, the subsequent chapters study other data mining operations such as clustering, moving approximations, fuzzy association rule generation more suitable to work with perceptual patterns. The subsequent chapters describe novel techniques of perception-based data mining with their applications to typical economics and finance problems. The chapter titled “Towards Human-Consistent Data-Driven Decision Support Systems via Fuzzy Linguistic Data Summaries” by Janusz Kacprzyk and Sławomir Zadro ny focuses on construction of linguistic summaries of data. Summarization as one of the typical tasks of data mining provides efficient and human consistent means for the analysis of large amounts of data to be used for a more realistic business decision support. The chapter shows how to embed data summarization within the fuzzy querying environment, for an effective and efficient computer implementation. Realization of Zadeh’s computing with words and perception paradigm through fuzzy linguistic database summaries, and indirectly fuzzy querying, can open new vistas in data driven and also, to some extent, in knowledge driven and Web-based Decision Support System. The next chapter titled “Moving Approximation Transform and Local Trend Associations in Time Series Data Bases” by Batyrshin I., Herrera-Avelar R., Sheremetov L., and Panova A. describes a new technique of time series analysis based on a replacement of time series by the sequences of slopes of linear functions approximating time series in sliding windows. Based on Moving Approximation (MAP) Transform several measures of local trend associations can be introduced which are invariant under linear transformation of time series. Due to this very important property the local trend association measures can serve as basic measures of similarity of time series and time series patterns in most of problems of time series data mining. The chapter considers several examples of
Preface
VII
application of local trend association measures to construction of association networks for systems described by time series data bases. MAP can be used also as a basis for definition of perception based trend patterns like “quickly increasing”, “very slowly decreasing” etc in intelligent decision making systems including as a part expert knowledge and time series data bases. Nowadays Discrete Fourier Transform is a main technique for analysis of time series which describe signal propagation and oscillating processes, when the concept of frequency plays a key role. MAP transform can serve as a main instrument for analysis of local trends and tendencies of non-oscillating processes which is important for economic and financial applications. Most of information used in economics and finance is stored in time series; that is why, the book pays special attention to time series data mining (TSDM). Development of intelligent question answering systems supporting decision making procedures related with time series data bases needs to formalize human perceptions about time, time series values, patterns and shapes, about associations between patterns and time series, etc. The chapter called “Perception Based Patterns in Time Series Data Mining” by Batyrshin I., Sheremetov L., and Herrera-Avelar R. presents an overview of the current techniques and analyse them from the point of view of their contribution to perception based TSDM. The survey considers different approaches to description of perception based patterns which use sign of derivatives, scaling of trends and shapes, linguistic interpretation of patterns obtained as result of clustering, a grammar for generation of complex patterns from shape primitives, temporal relations between patterns. These descriptions can be extended by using fuzzy granulation of time series patterns to make them more adequate to perceptions used in human reasoning. Several approaches to relate linguistic descriptions of experts with automatically generated texts of summaries and linguistic forecasts are considered. Finally, it is discussed the role of perception based time series data mining and computing with words and perceptions in construction of intelligent systems that use expert knowledge and decision making procedures in time series data base domains. The next chapter titled “Perception Based Functions in Qualitative Forecasting” by Batyrshin I. and Sheremetov L. discusses application of fuzzy perception based functions (PBF) to qualitative forecasting of a new product life cycle. PBF are given by a sequence of rules Rk: If T is Tk then S is Sk, where Tk are perception based intervals defined on the domain of independent variable T, and Sk are perception based shape patterns of variable S on interval Tk. Intervals Tk can be expressed by words like Between N and M, Approximately M, Middle of the Day, End of the Week etc. Shape patterns Sk can be expressed linguistically, e.g. as follows: Very Large, Increasing, Quickly Decreasing and Slightly Concave etc. The authors consider new parametric patterns used for modeling convex-concave shapes of PBF and propose a method of reconstruction of PBF with these shape patterns. These patterns can be used also for time series segmentation in perception based time series data mining. The chapter titled “Towards Automated Share Investment System” by Dymitr Ruta describes a classification model that learns the transaction patterns from optimally labelled historical data presented as time series and accordingly gives
VIII
Preface
the profit-driven decision for the current-day transaction. Contrasting to the traditional regression-based approaches, a proposed model can facilitate the job of a busy investor prefering a simple decision on the current day transaction: buy, wait, sell that would maximise his return from the investment. The model is embedded into an automated client-server platform which automatically handles data collection and maintains client models on the database. The prototype of the system was tested over 20 years of NYSE:CSC share price history showing substantial improvement of the long-term profit compared to a passive long-term investment. The Decision Tree is one of the most popular classification algorithms in current use in Data Mining. The chapter called “Estimating Classification Uncertainty of Bayesian Decision Tree Technique on Financial Data” by Shetinin et al. studies interpretability issues of classification models, which is crucial for experts responsible for making reliable classifications. Decision Trees classification model is combined with the Bayesian model averaging all possible DTs achieving thus the required diversity of the DT ensemble. The authors explore the classification uncertainty of the Bayesian Markov Chain Monte Carlo techniques on some datasets from the StatLog Repository and real financial data. The classification uncertainty is compared within an Uncertainty Envelope technique dealing with the class posterior distribution and a given confidence probability. This technique provides realistic estimates of the classification uncertainty, which can be easily interpreted in statistical terms with the aim of risk evaluation. Another important method frequently used in data mining is Cluster Analysis. In the chapter “Invariant Hierarchical Clustering Schemes” by Batyrshin I. and Rudas T., the properties of general scheme of parametric invariant clustering procedures based on transformation of proximity function into fuzzy equivalence relation are studied. The scheme gives possibility to build clustering procedures invariant to numeration of objects and to monotone transformation of proximity values between objects. Several examples are used to illustrate the application of proposed clustering procedures to analysis of similarity structures of data. The second part of the book focuses on the problems of perceptual Decision Making in Economics and Finance. As shown in the previous chapters, the time series analysis has three goals: forecasting (also called predicting), modeling, and characterization. Almost all managerial decisions are based on forecasting and modelling. The ability to model and perform decision modeling and analysis is an essential feature of many real-world applications. The second part opens with the chapter called “Fuzzy Components of Cooperative Markets” by Milan Mareš deals with the Walras equilibrium model and its cooperative modification also analyzing some possibilities of its fuzzification. The main attention is focused on the vagueness of utility functions of prices, which can be considered for most subjective (utilities) or most unpredictable (prices) components of the model. The elementary properties of the fuzzified model are presented and the adequacy of the suggested fuzzy set theoretical methods to the specific properties of real market models is briefly discussed.
Preface
IX
The chapter titled “Possibilistic-Probabilistic Models and Methods of Portfolio Optimization” by Alexander Yazenin, considers a generalization of Markowitz models with fuzzy random variables characterized by corresponding possibility distributions. Such situation is incident for financial assets, particularly, of Russian market. In this case the financial asset profitability can be represented by fuzzy random variable. Approaches to definition of numerical characteristics of fuzzy random variables are analyzed and proposed. Appropriate methods of calculation, in particular within the framework of shift-scaled representation are obtained. Principles of decision making in fuzzy random environment using these possibilistic-probabilistic models are formulated. In the chapter “Towards Graded and Nongraded Variants of Stochastic Dominance” by Bernard De Baets and Hans De Meyer, a pairwise comparison method for random variables is established. This comparison results in a probabilistic relation on a given set of random variables. The transitivity of this probabilistic relation is investigated, which allows identifying appropriate strict or weak thresholds, depending upon the copula involved, turning the probabilistic relation into a strict order relation. The proposed method can also be seen as a way of generating graded as well as non-graded variants of the concept of stochastic dominance. The chapter titled “Option Pricing in the Presence of Uncertainty” by Silvia Muzzioli and Huguette Reynaerts studies the derivation of the European option price in the Cox Ross Rubinstein (1979) binomial model in the presence of uncertainty in the volatility of the underlying asset. Two different approaches to the issue that concentrate on the fuzzification of one or both the two jump factors are proposed. The first approach is derived by assuming that both the jump factors are represented by triangular fuzzy numbers. The second approach to the option pricing problem is derived under the assumption that only the up jump factor is uncertain. In the chapter titled “Non-Stochastic-Model Based Finance Engineering” by Toshihiro Kaino and Kaoru Hirota, a new corporate evaluation model and option pricing model based on fuzzy measures are proposed, which deal with the ambiguous subjectivity evaluation of human in the real world. The chapter called “Collective Intelligence in Multiagent Systems: Interbank Payment Systems Application” by Luis Rocha-Mier, Leonid Sheremetov and Francisco Villarreal describes a new approach to the interbank net settlement payment systems (NSPS) modeling in order to analyze the actions of individual depositors. The model is developed within the framework of a COllective INtelligence (COIN). This framework is focused on the interactions at the local and the global levels among the consumer-agents (without a global knowledge of the environment model) in order to assure the optimization of the global utility function (GUF). A COIN is defined as a large Multi-Agent System (MAS) with no centralized control and communication, but where there is a global task to complete. Reinforcement learning algorithms are used at the local level, while techniques based on the COIN theory are used to optimize the global behavior. The proposed framework was implemented using Netlogo (agent-based parallel modeling and simulation environment). The results demonstrate that the inter-
X
Preface
bank NSPS is a good experimental field for the application of the COIN theory and shows how the consumer-agents behavior converges to the Nash equilibrium adapting their actions to optimize the GUF. Finally, the chapter titled “Fuzzy Models in Credit Risk Analysis” by Antonio Carlos Pinto Dias Alves, presents some concepts guiding credit risk analysis using fuzzy logic systems. Fuzzy quantification theory is used to make a kind of multivariate analysis giving more usable answers than traditional Logit or Probit analysis. To make the analysis some interesting accounting indicators are used that can efficiently point the financial health of a company. At the moment, most of the models in the field of finance engineering are based on the stochastic theory. As shown in this book, the prediction based on these models sometime does not hit in the actual problem. One of the reasons is that they assume a known probability distribution and try to describe a system with the precision obtained by means of the exact probability densities. Nevertheless, as shown in chapter 4, for example, classification model that learns the transaction patterns applying a regression model to support or even make investment decisions is inappropriate as on top of being uncertain and unnecessarily complex requires lots of investor attention and further analysis to make an investment decision. In many cases it is impossible or unnecessary to describe the behavior of modeled system with the precision that may be obtained by means of the exact probability densities. SC models and decision making procedures described in this book, overcome these drawbacks which makes them more suitable for real-world situations. Perception based models can be a powerful tool helping to develop a new generation of human consistent, natural language based and easy to use decision support systems. Finally, we would like to thank all the contributing authors for excellent research papers, their institutions for the support provided for the research work in the field of soft computing, Studies in Fuzziness and Soft Computing Series Editorial Board and Springer Verlag for their interest in the topic and giving the possibility to publish this book. March 2006
I. Batyrshin J. Kacprzyk L. Sheremetov L.A. Zadeh
Contents
Precisiated Natural Language (PNL) ....................................................................... 1 Lotfi A. Zadeh
1. Data Mining Towards Human-Consistent Data-Driven Decision Support Systems via Fuzzy Linguistic Data Summaries ......................................................................... 37 Janusz Kacprzyk and Sławomir Zadro ny Moving Approximation Transform and Local Trend Associations in Time Series Data Bases.......................................................................................... 55 Ildar Batyrshin, Raul Herrera-Avelar, Leonid Sheremetov, and Aleksandra Panova Perception Based Patterns in Time Series Data Mining ........................................ 85 Ildar Batyrshin, Leonid Sheremetov and Raul Herrera-Avelar Perception-Based Functions in Qualitative Forecasting ...................................... 119 Ildar Batyrshin and Leonid Sheremetov Towards Automated Share Investment System ................................................... 135 Dymitr Ruta Estimating Classification Uncertainty of Bayesian Decision Tree Technique on Financial Data................................................................................ 155 Vitaly Schetinin, Jonathan E. Fieldsend, Derek Partridge, Wojtek J. Krzanowski, Richard M. Everson, Trevor C. Bailey, and Adolfo Hernandez Invariant Hierarchical Clustering Schemes.......................................................... 181 Ildar Batyrshin and Tamas Rudas
2. Decision Making Fuzzy Components of Cooperative Markets........................................................ 209 Milan Mareš
XII
Contents
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization ......................................................................................................... 241 Alexander V. Yazenin Toward Graded and Nongraded Variants of Stochastic Dominance ................... 261 Bernard De Baets and Hans De Meyer Option Pricing in the Presence of Uncertainty..................................................... 275 Silvia Muzzioli and Huguette Reynaerts Nonstochastic Model-Based Finance Engineering .............................................. 303 Toshihiro Kaino and Kaoru Hirota Collective Intelligence in Multiagent Systems: Interbank Payment Systems Application............................................................................................. 331 Luis Rocha-Mier, Leonid Sheremetov and Francisco Villarreal Fuzzy Models in Credit Risk Analysis................................................................. 353 Antonio Carlos Pinto Dias Alves
List of Contributors
T.C. Bailey School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] I. Batyrshin Mexican Petroleum Institute, Mexico e-mail:
[email protected] A. Carlos Pinto Dias Alves Unidade Gestão de Riscos - Banco do Brasil S.A. Brasil e-mail:
[email protected] B. De Baets Department of Applied Mathematics, Biometrics and Process Control, Ghent University Belgium H. De Meyer Department of Applied Mathematics and Computer Science, Ghent University Belgium R.M. Everson School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] XIV
List of Contributors
J.E. Fieldsend School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] A. Hernandez School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] R. Herrera-Avelar Mexican Petroleum Institute, Mexico K. Hirota Department of Computational Intelligence and Systems Science, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology Japan e-mail:
[email protected] J. Kacprzyk Systems Research Institute, Polish Academy of Sciences Poland e-mail:
[email protected] T. Kaino School of Business Administration, Aoyama Gakuin University Japan e-mail:
[email protected] W.J. Krzanowski School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] List of Contributors
M. Mares Institute of Information Theory and Automation (UTIA) Czech Republic e-mail:
[email protected] S. Muzzioli Department of Economics, University of Modena and Reggio Emilia Italy A. Panova Kazan Power Engineering Institute Russia D. Partridge School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] H. Reynaerts Department of Applied Mathematics and Computer Science Belgium L. Rocha-Mier Mexican Petroleum Institute Mexico T. Rudas Eötvös Loránd University Hungary D. Ruta British Telecom Group, Research & Venturing UK e-mail:
[email protected] V. Schetinin School of Engineering, Computer Science and Mathematics, University of Exeter UK e-mail:
[email protected] XV
XVI
List of Contributors
L. Sheremetov Mexican Petroleum Institute Mexico e-mail:
[email protected] F. Villarreal Mexican Petroleum Institute Mexico A.V. Yazenin Computer Science Department, Tver State University Russia L.A. Zadeh University of California USA S. Zadro ny Systems Research Institute, Polish Academy of Sciences Poland e-mail:
[email protected] Precisiated Natural Language1 (PNL) Lotfi A. Zadeh
1
Reprinted with permission from AI Magazine, 25(3) Fall 2004, 74–91.
L.A. Zadeh: Precisiated Natural Language, Studies in Computational Intelligence (SCI) 36, 1–33 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
2
L.A. Zadeh
Abstract This article is a sequel to an article titled “A New Direction in AI – Toward a Computational Theory of Perceptions,” which appeared in the Spring 2001 issue of AI Magazine (volume 22, No. 1, 73–84) [47]. The concept of precisiated natural language (PNL) was briefly introduced in that article, and PNL was employed as a basis for computation with perceptions. In what follows, the conceptual structure of PNL is described in greater detail, and PNL’s role in knowledge representation, deduction, and concept definition is outlined and illustrated by examples. What should be understood is that PNL is in its initial stages of development and that the exposition that follows is an outline of the basic ideas that underlie PNL rather than a definitive theory. A natural language is basically a system for describing perceptions. Perceptions, such as perceptions of distance, height, weight, color, temperature, similarity, likelihood, relevance, and most other attributes of physical and mental objects are intrinsically imprecise, reflecting the bounded ability of sensory organs, and ultimately the brain, to resolve detail and store information. In this perspective, the imprecision of natural languages is a direct consequence of the imprecision of perceptions [1, 2]. How can a natural language be precisiated – precisiated in the sense of making it possible to treat propositions drawn from a natural language as objects of computation? This is what PNL attempts to do. In PNL, precisiation is accomplished through translation into what is termed a precisiation language. In the case of PNL, the precisiation language is the generalized-constraint language (GCL), a language whose elements are so-called generalized constraints and their combinations. What distinguishes GCL from languages such as Prolog, LISP, SQL, and, more generally, languages associated with various logical systems, for example, predicate logic, modal logic, and so on, is its much higher expressive power. The conceptual structure of PNL mirrors two fundamental facets of human cognition (a) partiality and (b) granularity [3]. Partiality relates to the fact that most human concepts are not bivalent, that is, are a matter of degree. Thus, we have partial understanding, partial truth, partial possibility, partial certainty, partial similarity, and partial relevance, to cite a few examples. Similarly, granularity and granulation relate to clumping of values of attributes, forming granules with words as labels, for example, young, middle-aged, and old as labels of granules of age. Existing approaches to natural language processing are based on bivalent logic – a logic in which shading of truth is not allowed. PNL abandons bivalence. By so doing, PNL frees itself from limitations
Precisiated Natural Language
3
imposed by bivalence and categoricity, and opens the door to new approaches for dealing with long-standing problems in AI and related fields [4, 50, 52]. At this juncture, PNL is in its initial stages of development. As it matures, PNL is likely to find a variety of applications, especially in the realms of world knowledge representation, concept definition, deduction, decision, search, and question answering.
1 Introduction Natural languages (NLs) have occupied, and continue to occupy, a position of centrality in AI. Over the years, impressive advances have been made in our understanding of how natural languages can be dealt with on processing, logical, and computational levels. A huge literature is in existence. Among the important contributions that relate to the ideas described in this article are those of Biermann and Ballard [5], Klein [6], Barwise and Cooper [7], Sowa [8, 9], McAllester and Givan [10], Macias and Pulman [11], Mani and Maybury [12], Allan [13], Fuchs and Schwertelm [14], and Sukkarieh [15]. When a language such as preciasiated natural language (PNL) is introduced, a question that arises at the outset is: What can PNL do that cannot be done through the use of existing approaches? A simple and yet important example relates to the basic role of quantifiers such as all, some, most, many, and few in human cognition and natural languages. In classical, bivalent logic the principal quantifiers are all and some. However, there is a literature on so-called generalized quantifiers exemplified by most, many, and few [7, 16]. In this literature, such quantifiers are treated axiomatically, and logical rules are employed for deduction. By contrast, in PNL quantifiers such as many, most, few, about 5, close to 7, much larger than 10, and so on are treated as fuzzy numbers and are manipulated through the use of fuzzy arithmetic [17–19]. For the most part, inference is computational rather than logical. Following are a few simple examples. First, let us consider the Brian example [17]: Brian is much taller than most of his close friends. How tall is Brian?
At first glance it may appear that such questions are unreasonable. How can one say something about Brian’s height if all that is known is that he is much taller than most of his close friends? Basically, what PNL provides is a system for precisiation of propositions expressed in a
4
L.A. Zadeh
natural language through translation into the generalized-constraint language (GCL). Upon translation, the generalized constraints (GCs) are propagated through the use of rules governing generalized-constraint propagation, inducing a generalized constraint on the answer to the question. More specifically, in the Brian example, the answer is a generalized constraint on the height of Brian. Now let us look at the balls-in-box problem: A box contains balls of various sizes and weights. The premises are: Most are large. Many large balls are heavy. . What fraction of balls are large and heavy?
The PNL answer is: most × many, where most and many are fuzzy numbers defined through their membership functions, and most × many is their product in fuzzy arithmetic [18]. This answer is a consequence of the general rule Q1 As are Bs . Q 2 (A and B)s are Cs (Q1 × Q2) As are (B and C)s
Another simple example is the tall Swedes problem (version 1): Swedes who are more than twenty years old range in height from 140 centimeters to 220 centimeters. Most are tall. What is the average height of Swedes over twenty?
A less simple version of the problem (version 2) is the following (a* denotes “approximately a”). Swedes over twenty range in height from 140 centimeters to 220 centimeters. Over 70* percent are taller than 170* centimeters; less than 10* percent are shorter than 150* centimeters, and less than 15 percent are taller than 200* centimeters. What is the average height of Swedes over twenty?
A PNL-based answer is given in Appendix. There is a basic reason that generalized quantifiers do not have an ability to deal with problems of this kind. The reason is that in the theory of generalized quantifiers there is no concept of the count of elements in a fuzzy set. How do you count the number of tall Swedes if tallness is a matter of degree? More generally, how do you define the probability measure of a fuzzy event [20]?
Precisiated Natural Language
5
What should be stressed is that the existing approaches and PNL are complementary rather than competitive. Thus, PNL is not intended to be used in applications such as text processing, summarization, syntactic analysis, discourse analysis, and related fields. The primary function of PNL is to provide a computational framework for precisiation of meaning rather than to serve as a means of meaning understanding and meaning representation. By its nature, PNL is maximally effective when the number of precisiated propositions is small rather than large and when the chains of reasoning are short rather than long. The following is intended to serve as a backdrop. It is a deep-seated tradition in science to view the use of natural languages in scientific theories as a manifestation of mathematical immaturity. The rationale for this tradition is that natural languages are lacking in precision. However, what is not recognized to the extent that it should be is that adherence to this tradition carries a steep price. In particular, a direct consequence is that existing scientific theories do not have the capability to operate on perception-based information – information exemplified by “Most Swedes are tall,” “Usually Robert returns from work at about 6 PM,” “There is a strong correlation between diet and longevity,” and “It is very unlikely that there will be a significant increase in the price of oil in the near future” (Fig. 1).
Fig. 1. Modalities of measurement-based and perception-based information
6
L.A. Zadeh
Such information is usually described in a natural language and is intrinsically imprecise, reflecting a fundamental limitation on the cognitive ability of humans to resolve detail and store information. Due to their imprecision, perceptions do not lend themselves to meaning representation and inference through the use of methods based on bivalent logic. To illustrate the point, consider the following simple examples. The balls-in-box example: A box contains balls of various sizes. My perceptions of the contents of the box are: – There are about twenty balls. – Most are large. – There are several times as many large balls as small balls. The question is: What is the number of small balls?
The Robert example (a): My perception is: – Usually Robert returns from work at about 6 PM. The question is: What is the probability that Robert is home at about 6:15 PM?
The Robert example (b): – Most tall men wear large-sized shoes. – Robert is tall. – What is the probability that Robert wears large-sized shoes?
An immediate problem that arises is that of meaning precisiation. How can the meaning of the perception “There are several times as many large balls as small balls” or “Usually Robert returns from work at about 6 PM” be defined in a way that lends itself to computation and deduction? Furthermore, it is plausible, on intuitive grounds, that “Most Swedes are tall” conveys some information about the average height of Swedes. But what is the nature of this information, and what is its measure? Existing bivalent-logic-based methods of natural language processing provide no answers to such questions.
Precisiated Natural Language
7
The incapability of existing methods to deal with perceptions is a direct consequence of the fact that the methods are based on bivalent logic – a logic that is intolerant of imprecision and partial truth. The existing methods are categorical in the sense that a proposition, p, in a natural language, NL, is either true or not true, with no shades of truth allowed. Similarly, p is either grammatical or ungrammatical, either ambiguous or unambiguous, either meaningful or not meaningful, either relevant or not relevant, and so on. Clearly, categoricity is in fundamental conflict with reality – a reality in which partiality is the norm rather than an exception. But, what is much more important is that bivalence is a major obstacle to the solution of such basic AI problems as commonsense reasoning and knowledge representation [8, 9, 21--25], nonstereotypical summarization [12], unrestricted question answering, [26], and natural language computation [5]. PNL abandons bivalence. Thus, in PNL everything is, or is allowed to be, a matter of degree. It is somewhat paradoxical, and yet is true, that precisiation of a natural language cannot be achieved within the conceptual structure of bivalent logic. By abandoning bivalence, PNL opens the door to a major revision of concepts and techniques for dealing with knowledge representation, concept definition, deduction, and question answering. A concept that plays a key role in this revision is that of a generalized constraint [27]. The basic ideas underlying this concept are discussed in the following section. It should be stressed that what follows is an outline rather than a detailed exposition.
2 The Concepts of Generalized Constraint and Generalized-Constraint Language A conventional, hard constraint on a variable, X, is basically an inelastic restriction on the values that X can take. The problem is that in most realistic settings – and especially in the case of natural languages – constraints have some degree of elasticity or softness. For example, in the case of a sign in a hotel saying “Checkout time is 1 PM,” it is understood that 1 PM is not a hard constraint on checkout time. The same applies to “Speed limit is 65 miles per hour” and “Monika is young.” Furthermore, there are many different ways, call them modalities, in which a soft constraint restricts the values that a variable can take. These considerations suggest the following expression as the definition of generalized constraint (Fig. 2):
8
L.A. Zadeh
Fig. 2. Generalized constraint
X isr R, where X is the constrained variable; R is the constraining relation; and r is a discrete-valued modal variable whose values identify the modality of the constraint [1]. The constrained variable may be an n-ary variable, X = (X1,…,Xn); a conditional variable, X|Y; a structured variable, as in Location(Residence(X)); or a function of another variable, as in f(X ). The principal modalities are possibilistic (r = blank), probabilistic (r = p), veristic (r = v), usuality (r = u), random set (r = rs), fuzzy graph (r = fg), bimodal (r = bm), and Pawlak set (r = ps). More specifically, in a possibilistic constraint, X is R, R is a fuzzy set that plays the role of the possibility distribution of X. Thus, if U = {u} is the universe of discourse in which X takes its values, then R is a
Precisiated Natural Language
9
Fig. 3. Trapezoidal membership function of “small number” (“small number” is context dependent)
fuzzy subset of U and the grade of membership of u in R, µR (u), is the possibility that X = u: µR(u) = Poss{X = u}. For example, the proposition p: X is a small number is a possibilistic constraint in which “small number” may be represented as, say, a trapezoidal fuzzy number (Fig. 3), that represents the possibility distribution of X. In general, the meaning of “small number” is context dependent. In a probabilistic constraint: X isp R, X is a random variable and R is its probability distribution. For example, X isp N(m, σ 2) means that X is a normally distributed random variable with mean m and variance σ 2. In a veristic constraint, R is a fuzzy set that plays the role of the verity (truth) distribution of X. For example, the proposition “Alan is half German, a quarter French, and a quarter Italian,” would be represented as the fuzzy set
10
L.A. Zadeh Ethnicity (Alan) isv (0.5 | German + 0.25 | French + 0.25 | Italian),
in which Ethnicity (Alan) plays the role of the constrained variable, 0.5 | German means that the verity (truth) value of “Alan is German” is 0.5, and + plays the role of a separator. In a usuality constraint, X is a random variable, and R plays the role of the usual value of X. For example, X isu small means that usually X is small. Usuality constraints play a particularly important role in commonsense knowledge representation and perception-based reasoning. In a random set constraint, X is a fuzzy-set valued random variable and R is its probability distribution. For example, X isrs (0.3\small + 0.5\medium + 0.2\large), means that X is a random variable that takes the fuzzy sets small, medium, and large as its values with respective probabilities 0.3, 0.5, and 0.2. Random set constraints play a central role in the Dempster–Shafer theory of evidence and belief [28]. In a fuzzy graph constraint, the constrained variable is a function, f, and R is its fuzzy graph (Fig. 4). A fuzzy graph constraint is represented as F isfg (Σi Ai × Bj(i)),
Fig. 4. Fuzzy graph of a function
Precisiated Natural Language
11
in which the fuzzy sets Ai and Bj(i), with j dependent on i, are the granules of X and Y, respectively, and Ai × Bj(i) is the Cartesian product of Ai and Bj(i). Equivalently, a fuzzy graph may be expressed as a collection of fuzzy if then rules of the form if X is Ai then Y is Bj(i), i = 1, …; m; j = 1, …, n For example: F isfg (small × small + medium × large + large × small) may be expressed as the rule set: if X is small then Y is small if X is medium then Y is large if X is large then Y is small
Such a rule set may be interpreted as a description of a perception of f. A bimodal constraint involves a combination of two modalities: probabilistic and possibilistic. More specifically, in the generalized constraint X isbm R, X is a random variable, and R is what is referred to as a bimodal distribution, P, of X, with P expressed as P: ΣiPj(i) \ Ai, in which the Ai are granules of X, and the Pj(i), with j dependent on i, are the granules of probability (Fig. 5). For example, if X is a real-valued random variable with granules labeled small, medium, and large and probability granules labeled low, medium, and high, then X isbm (low\small\+high\medium+low\large) which means that Prob {X is small} is low Prob {X is medium} is high Prob {X is large} is low
12
L.A. Zadeh
Fig. 5. Bimodal distribution: perception-based probability distribution
In effect, the bimodal distribution of X may be viewed as a description of a perception of the probability distribution of X. As a perception of likelihood, the concept of a bimodal distribution plays a key role in perception-based calculus of probabilistic reasoning [29]. The concept of a bimodal distribution is an instance of combination of different modalities. More generally, generalized constraints may be combined and propagated, generating generalized constraints that are composites of other generalized constraints. The set of all such constraints together with deduction rules – rules that are based on the rules governing generalized-constraint propagation – constitutes the generalized-constraint language (GCL). An example of a generalized constraint in GCL is (X isp A) and ((X, Y ) is B), where A is the probability distribution of X and B is the possibility distribution of the binary variable (X,Y). Constraints of this form play an important role in the Dempster-Shafer theory of evidence [28].
Precisiated Natural Language
13
3 The Concepts of Precisiability and Precisiation Language Informally, a proposition, p, in a natural language, NL, is precisiable if its meaning can be represented in a form that lends itself to computation and deduction. More specifically, p is precisiable if it can be translated into what may be called a precisiation language, PL, with the understanding that the elements of PL can serve as objects of computation and deduction. In this sense, mathematical languages and the languages associated with propositional logic, first-order and higher-order predicate logics, modal logic, LISP, Prolog, SQL, and related languages may be viewed as precisiation languages. The existing PL languages are based on bivalent logic. As a direct consequence, the languages in question do not have sufficient expressive power to represent the meaning of propositions that are descriptors of perceptions. For example, the proposition “All men are mortal” can be precisiated by translation into the language associated with first-order logic, but “Most Swedes are tall” cannot. The principal distinguishing feature of PNL is that the precisiation language with which it is associated is GCL. It is this feature of PNL that makes it possible to employ PNL as a meaning-precisiation language for perceptions. What should be understood, however, is that not all perceptions or, more precisely, propositions that describe perceptions, are precisiable through translation into GCL. Natural languages are basically systems for describing and reasoning with perceptions, and many perceptions are much too complex to lend themselves to precisiation. The key idea in PNL is that the meaning of a precisiable proposition, p, in a natural language is a generalized constraint X isr R. In general, X, R, and r are implicit, rather than explicit, in p. Thus, translation of p into GCL may be viewed as explicitation of X, R, and r. The expression X isr R will be referred to as the GC form of p, written as GC(p). In PNL, a proposition, p, is viewed as an answer to a question, q. To illustrate, the proposition p: Monika is young may be viewed as the answer to the question q: How old is Monika? More concretely: p: Monika is young → p*: Age (Monika) is young q: How old is Monika? → q*: Age (Monika) is ?R where p* and q* are abbreviations for GC(p) and GC(q), respectively. In general, the question to which p is an answer is not unique. For example, p: Monika is young could be viewed as an answer to the question
14
L.A. Zadeh
q: Who is young? In most cases, however, among the possible questions there is one that is most likely. Such a question plays the role of a default question. The GC form of q is, in effect, the translation of the question to which p is an answer. The following simple examples are intended to clarify the process of translation from NL to GCL. p: Tandy is much older than Dana → (Age(Tandy), Age(Dana)) is much.older,
where much.older is a binary fuzzy relation that has to be calibrated as a whole rather through composition of much and older. p: Most Swedes are tall
To deal with the example, it is necessary to have a means of counting the number of elements in a fuzzy set. There are several ways in which this can be done, with the simplest way relating to the concept of ΣCount (sigma count). More specifically, if A and B are fuzzy sets in a space U = {u1, …, un}, with respective membership functions µA and µB, respectively, then ΣCount(A) = Σi µA( µ i), and the relative ΣCount, that is, the relative count of elements of A that are in B, is defined as ΣCount(A/B) = ΣCount(A B)/ΣCount(B) in which the membership function of the intersection A B is defined as µA B(u) = µA(u) ∧ µB(u), where ∧ is min or, more generally, a t-norm [30, 31]. Using the concept of sigma count, the translation in question may be expressed as Most Swedes are tall → ΣCount(tall.Swedes/Swedes) is most
where most is a fuzzy number that defines most as a fuzzy quantifier [32, 33] (Fig. 6).
Precisiated Natural Language
15
Fig. 6. Calibration of most and usually represented as trapezoidal fuzzy numbers p: q: X: R: r: p*:
Usually Robert returns from work at about 6 PM When does Robert return from work? Time of return of Robert from work, Time(Return) about 6 PM (6* PM) u (usuality) Prob {Time(Return) is 6* PM} is usually.
A less simple example is: p: It is very unlikely that there will be a significant increase in the price of oil in the near future.
In this example, it is expedient to start with the semantic network representation [8] of p that is shown in Fig. 7. In this representation, E is the main event and E* is a subevent of E: E: significant increase in the price of oil in the near future E*: significant increase in the price of oil Thus, near future is the epoch of E*.
The GC form of p may be expressed as Prob(E) is R, where R is the fuzzy probability, very unlikely, whose membership function is related to that of likely by Fig. 8. µvery.unlikely(u) = (1– µlikely)2,
16
L.A. Zadeh
Fig. 7. Semantic network of p. (It is very unlikely that there will be a significant increase in the price of oil in the near future)
Fig. 8. Precisiation of very unlikely
where it is assumed for simplicity that very acts as an intensifier that squares the membership function of its operand, and that the membership function of unlikely is the mirror image of that of likely. Given the membership functions of significant increase and near future (Fig. 9), we can compute the degree to which a specified time function that represents a variation in the price of oil satisfies the conjunction of the constraints significant increase and near future. This degree may be employed to compute the truth value of p as a function of the probability
Precisiated Natural Language
17
Fig. 9. Computation of degree of compatibility
distribution of the variation in the price of oil. In this instance, the use of PNL may be viewed as an extension of truth-conditional semantics [34, 13]. What should be noted is that precisiation and meaning representation are not coextensive. More specifically, precisiation of a proposition, p, assumes that the meaning of p is understood and that what is involved is a precisiation of the meaning of p.
4 The Concept of a Protoform and the Structure of PNL A concept that plays a key role in PNL is that of a protoform – an abbreviation of prototypical form. Informally, a protoform is an abstracted summary of an object that may be a proposition, command, question, scenario, concept, decision problem, or, more generally, a system of such objects. The importance of the concept of a protoform derives from the fact that it places in evidence the deep semantic structure of the object to which it applies. For example, the protoform of the proposition p: Monika is young is PF( p): A(B) is C, where A is abstraction of the attribute Age, B is abstraction of Monika, and C is abstraction of young. Conversely, Age is instantiation of A, Monika is instantiation of B, and young is instantiation of C. Abstraction may be annotated, for example, A/Attribute, B/Name, and C/Attribute.value. A few examples are shown in Fig. 10. Basically, abstraction is a means of generalization. Abstraction has levels, just as summarization does. For example, successive abstractions of p: Monika is young are
18
L.A. Zadeh
A(Monika) is young, A(B) is young, and A(B) is C, with the last abstraction resulting in the terminal protoform, or simply the protoform. With this understanding, the protoform of p: Most Swedes are tall is QAs are Bs, or
Fig. 10. Examples of translation from NL to PFL
equivalently, Count(B/A) is Q, and the protoform of p: Usually Robert returns from work at about 6 PM, is Prob(X is A) is B, where X, A, and B are abstractions of “Time (Robert.returns.from work),” “About 6 PM,” and “Usually.” For simplicity, the protoform of p may be written as p**. Abstraction is a familiar concept in programming languages and programming systems. As will be seen in the following, the role of abstraction in PNL is significantly different and more essential because PNL abandons bivalence. The concept of a protoform has some links to other basic concepts such as ontology [9, 35–37] conceptual graph [38] and Montague grammar [39]. However, what should be stressed is that the concept of a protoform is not limited – as it is in the case of related concepts – to propositions whose meaning can be represented within the conceptual structure of bivalent logic. As an illustration, consider a proposition, p, which was dealt with earlier:
Precisiated Natural Language
19
p: It is very unlikely that there will be a significant increase in the price of oil in the near future.
With reference to the semantic network of p (Fig. 9), the protoform of p may be expressed as: Prob(E) is A E: B(E*) is C E*: F(D) D: G(H)
(A: very unlikely) (B: epoch; C: near.future) (F: significant increase; D: price of oil) (G: price; H: oil)
Using the protoform of p and calibrations of significant increase, nearfuture, and likely, (Fig. 9), we can compute, in principle, the degree to which any given probability distribution of time functions representing the price of oil satisfies the generalized constraint, Prob(E ) is A. As was pointed out earlier, if the degree of compatibility is interpreted as the truth value of p, computation of the truth value of p may be viewed as a PNLbased extension of truth-conditional semantics. By serving as a means of defining the deep semantic structure of an object, the concept of a protoform provides a platform for a fundamental mode of classification of knowledge based on protoform equivalence, or PF equivalence for short. More specifically, two objects are protoform equivalent at a specified level of summarization and abstraction if at that level they have identical protoforms. For example, the propositions p: Most Swedes are tall, and q: Few professors are rich, are PF equivalent since their common protoform is QAs are Bs or, equivalently, Count (B/A) is Q. The same applies to propositions p: Oakland is near San Francisco, and q: Rome is much older than Boston. A simple example of PF equivalent concepts is: cluster and mountain. A less simple example involving PF equivalence of scenarios of decision problems is the following. Consider the scenarios of two decision problems, A and B: Scenario A: Alan has severe back pain. He goes to see a doctor. The doctor tells him that there are two options (1) do nothing and (2) do surgery. In the case of surgery, there are two possibilities (a) surgery is successful, in which case Alan will be pain-free and (b) surgery is not successful, in which case Alan will be paralyzed from the neck down. Question: Should Alan elect surgery?
20
L.A. Zadeh
Scenario B: Alan needs to fly from San Francisco to St. Louis and has to get there as soon as possible. One option is to fly to St. Louis via Chicago, and the other is to go through Denver. The flight via Denver is scheduled to arrive in St. Louis at time a. The flight via Chicago is scheduled to arrive in St. Louis at time b, with a < b. However, the connection time in Denver is short. If the connection flight is missed, then the time of arrival in St. Louis will be c, with c > b. Question: Which option is best? The common protoform of A and B is shown in Fig. 11. What this protoform means is that there are two options, one that is associated with a certain gain or loss and another that has two possible outcomes whose probabilities may not be known precisely. The protoform language, PFL, is the set of protoforms of the elements of the generalized-constraint language, GCL. A consequence of the concept of PF equivalence is that cardinality of PFL is orders of magnitude lower than that of GCL or, equivalently, the set of precisiable propositions in NL. As will be seen in the sequel, the low cardinality of PFL plays an essential role in deduction.
Fig. 11. Protoform equivalence of scenarios A and B
Precisiated Natural Language
21
The principal components of the structure of PNL (Fig. 12) are (1) a dictionary from NL to GCL; (2) a dictionary from GCL to PFL (Fig. 13); (3) a multiagent, modular deduction database, DDB; and (4) a world knowledge database, WKDB. The constituents of DDB are modules, with a module consisting of a group of protoformal rules of deduction, expressed in
Fig. 12. Basic structure of PNL
Fig. 13. Structure of PNL dictionaries
22
L.A. Zadeh
PFL (Fig. 14), that are drawn from a particular domain, for example, probability, possibility, usuality, fuzzy arithmetic [18], fuzzy logic, search, and so on. For example, a rule drawn from fuzzy logic is the compositional rule of inference, expressed in Fig. 14 where A°B is the composition of A and B, defined in the computational part, in which µA, µB, and µA°B are the membership functions of A, B, and A°B, respectively. Similarly, a rule drawn from probability is shown in Fig. 15, where D is defined in the computational part. The rules of deduction in DDB are, basically, the rules that govern propagation of generalized constraints. Each module is associated with an agent whose function is that of controlling execution of rules and performing embedded computations. The top-level agent controls the passing of results of computation from a module to other modules. The structure of protoformal, that is, protoform based, deduction is shown in Fig. 16. A simple example of protoformal deduction is shown in Fig. 17.
Fig. 14. Compositional rule of inference
Fig. 15. Rule drawn from probability
Precisiated Natural Language
23
Fig. 16. Structure of protoform-based deduction
Fig. 17. Example of protoformal reasoning
The world knowledge database (WKDB) consists of propositions that describe world knowledge, for example, Parking near the campus is hard to find on weekdays between 9 and 4; Big cars are safer than small cars; If
24
L.A. Zadeh
A/person works in B/city then it is likely that A lives in or near B; If A/person is at home at time t then A has returned from work at t or earlier, on the understanding that A stayed home after returning from work. Much, perhaps most, of the information in WKDB is perception based. World knowledge – and especially world knowledge about probabilities – plays an essential role in almost all search processes, including searching the Web. Semantic Web and related approaches have contributed to a significant improvement in performance of search engines. However, for further progress it may be necessary to add to existing search engines the capability to operate on perception-based information. It will be a real challenge to employ PNL to add this capability to sophisticated knowledge-management systems such as the Web Ontology Language (OWL) [36], Cyc [40], WordNet [41], and ConceptNet [42]. An example of PFL-based deduction in which world knowledge is used is the so-called Robert example. A simplified version of the example is the following. The initial data set is the proposition (perception) p: Usually Robert returns from work at about 6 PM. The question is q: What is the probability that Robert is home at 6:15 PM? The first step in the deduction process is to use the NL to GCL dictionary for deriving the generalized-constraint forms, GC(p) and GC(q), of p and q, respecttively. The second step is to use the GCL to PFL dictionary to derive the protoforms of p and q. The forms are: p*: q*:
Prob(Time(Robert.returns.from.work) is about 6 PM) is usually Prob(Time(Robert is home) is 6:15 PM) is ?E
and p**: Prob(X is A) is B q**: Prob(Y is C) is ?D
The third step is to refer the problem to the top-level agent with the query, Is there a rule or a chain of rules in DDB that leads from p** to q**? The top-level agent reports a failure to find such a chain but success in finding a proximate rule of the form Prob(X is A) is B Prob(X is C) is D.
The fourth step is to search the WKDB for a proposition or a chain of propositions that allow Y to be replaced by X. A proposition that makes
Precisiated Natural Language
25
this possible is (A/person is in B/location) at T/time if A arrives at B before T, with the understanding that A stays at B after arrival. The last step involves the use of the modified form of q**: Prob(X is E) is ?D, in which E is “before 6:15 PM.” The answer to the initial query is given by the solution of the variational problem associated with the rule that was described earlier (Fig. 15): Prob(X is A) is B Prob(X is C) is D
The value of D is the desired probability. What is important to observe is that there is a tacit assumption that underlies the deduction process, namely, that the chains of deduction are short. This assumption is a consequence of the intrinsic imprecision of perception-based information. Its further implication is that PNL is likely to be effective, in the main, in the realm of domain-restricted systems associated with small universes of discourse.
5 PNL as a Definition Language As we move further into the age of machine intelligence and automated reasoning, a problem that is certain to grow in visibility and importance is that of definability – that is, the problem of defining the meaning of a concept or a proposition in a way that can be understood by a machine. It is a deeply entrenched tradition in science to define a concept in a language that is based on bivalent logic [43–45]. Thus defined, a concept, C, is bivalent in the sense that every object, X, is either an instance of C or it is not, with no degrees of truth allowed. For example, a system is either stable or unstable, a time series is either stationary or nonstationary, a sentence is either grammatical or ungrammatical, and events A and B are either independent or not independent. The problem is that bivalence of concepts is in conflict with reality. In most settings, stability, stationarity, grammaticality, independence, relevance, causality, and most other concepts are not bivalent. When a concept that is not bivalent is defined as if it were bivalent, the ancient Greek sorites (heap) paradox comes into play. As an illustration, consider the standard bivalent definition of independence of events, say A and B. Let P(A), P(B), and PA(B) be the probabilities of A, B, and B given A, respectively. Then A and B are independent if and only if PA(B) = P(B).
26
L.A. Zadeh
Now assume that the equality is not satisfied exactly, with the difference between the two sides being ε. As ε increases, at which point will A and B cease to be independent? Clearly, independence is a matter of degree, and furthermore the degree is context dependent. For this reason, we do not have a universally accepted definition of degree of independence [46]. One of the important functions of PNL is that of serving as a definition language. More specifically, PNL may be employed as a definition language for two different purposes: first, to define concepts for which no general definitions exist, for example, causality, summary, relevance, and smoothness; and second, to redefine concepts for which universally accepted definitions exist, for example, linearity, stability, independence, and so on. In what follows, the concept of independence of random variables will be used as an illustration. For simplicity, assume that X and Y are random variables that take values in the interval [a, b]. The interval is granulated as shown in Fig. 18, with S, M, and L denoting the fuzzy intervals small, medium, and large. Using the definition of relative ΣCount, we construct a contingency table, C, of the form show in Fig. 18, in which an entry such as ΣCount (S/L) is a granulated fuzzy number that represents the relative ΣCount of occurrences of Y, which are small, relative to occurrences of X, which are large.
Fig. 18. PNL-based definition of statistical independence
Precisiated Natural Language
27
Based on the contingency table, the degree of independence of Y from X may be equated to the degree to which the columns of the contingency table are identical. One way of computing this degree is, first, to compute the distance between two columns and then aggregate the distances between all pairs of columns. PNL would be used for this purpose. An important point that this example illustrates is that, typically, a PNL-based definition involves a general framework with a flexible choice of details governed by the context or a particular application. In this sense, the use of PNL implies an abandonment of the quest for universality, or, to put it more graphically, of the one-size-fits-all modes of definition that are associated with the use of bivalent logic. Another important point is that PNL suggests an unconventional approach to the definition of complex concepts. The basic idea is to define a complex concept in a natural language and then employ PNL to precisiate the definition. More specifically, let U be a universe of discourse and let C be a concept that I wish to define, with C relating to elements of U. For example, U is a set of buildings, and C is the concept of tall building. Let p(C) and d(C) be, respectively, my perception and my definition of C. Let I(p(C)) and I(d(C)) be the intensions of p(C) and d(C), respectively, with intension used in its logical sense [34, 43], that is, as a criterion or procedure that identifies those elements of U that fit p(C) or d(C). For example, in the case of tall buildings, the criterion may involve the height of a building. Informally, a definition, d(C), is a good fit or, more precisely, is cointensive, if its intension coincides with the intension of p(C). A measure of goodness of fit is the degree to which the intension of d(C) coincides with that of p(C). In this sense, co-intension is a fuzzy concept. As a highlevel definition language, PNL makes it possible to formulate definitions whose degree of co-intensiveness is higher than that of definitions for mulated through the use of languages based on bivalent logic.
6 Concluding Remarks Existing theories of natural languages are based, anachronistically, on Aristotelian logic – a logical system whose centerpiece is the principle of the excluded middle: Truth is bivalent, meaning that every proposition is either true or not true, with no shades of truth allowed. The problem is that bivalence is in conflict with reality – the reality of pervasive imprecision of natural languages. The underlying facts are
28
L.A. Zadeh
(a) a natural language, NL, is, in essence, a system for describing perceptions and (b) perceptions are intrinsically imprecise, reflecting the bounded ability of sensory organs, and ultimately the brain, to resolve detail and store information. PNL abandons bivalence. What this means is that PNL is based on fuzzy logic – a logical system in which everything is, or is allowed to be, a matter of degree. Abandonment of bivalence opens the door to exploration of new directions in theories of natural languages. One such direction is that of precisiation. A key concept underlying precisiation is the concept of a generalized constraint. It is this concept that makes it possible to represent the meaning of a proposition drawn from a natural language as a generalized constraint. Conventional, bivalent constraints cannot be used for this purpose. The concept of a generalized constraint provides a basis for construction of GCL – a language whose elements are generalized constraints and their combinations. Within the structure of PNL, GCL serves as a precisiation language for NL. Thus, a proposition in NL is precisiated through translation into GCL. Not every proposition in NL is precisiable. In effect, the elements of PNL are precisiable propositions in NL. What should be underscored is that in its role as a high-level definition language, PNL provides a basis for a significant enlargement of the role of natural languages in scientific theories.
Dedication This article is dedicated to Noam Chomsky.
References 1. Zadeh, L. A. 1999. From computing with numbers to computing with words – From manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems 45(1): 105–119 2. Zadeh, L. A. 2000. Toward a logic of perceptions based on fuzzy logic. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 4–25. Heidelberg: Physica-Verlag
Precisiated Natural Language
29
3. Zadeh, L. A. 1997. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90(2): 111–127 4. Novak, V. 1991. Fuzzy logic, fuzzy sets, and natural languages. International Journal of General Systems 20(1): 83–97 5. Biermann, A. W. and Ballard, B. W. 1980. Toward natural language computation. American Journal of Computational Linguistics (6)2: 71–86 6. Klein, E. 1980. A semantics for positive and comparative adjectives. Linguistics and Philosophy 4(1): 1–45 7. Barwise, J. and Cooper, R. 1981. Generalized quantifiers and natural language. Linguistics and Philosophy 4(1): 159–209 8. Sowa, J. F. 1991. Principles of Semantic Networks: Explorations in the Representation of Knowledge. San Francisco: Morgan Kaufmann 9. Sowa, J. F. 1999. Ontological categories. In Shapes of Forms: From Gestalt Psychology and Phenomenology to Ontology and Mathematics, ed. L. Albertazzi, pp. 307–340. Dordrecht, The Netherlands: Kluwer. 10. McAllester, D. A. and Givan, R. 1992. Natural language syntax and first-order inference. Artificial Intelligence 56(1): 1–20 11. Macias, B. and Stephen G. Pulman. 1995. A method for controlling the production of specifications in natural language. The Computing Journal 38(4): 310–318 12. Mani, I. and Maybury M. T., eds. 1999. Advances in Automatic Text Summarization. Cambridge, MA: The MIT Press 13. Allan, K. 2001. Natural Language Semantics. Oxford: Blackwell 14. Fuchs, N. E. and Schwertel, U. 2003. Reasoning in Attempto Controlled English. In Proceedings of the Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR 2003), pp. 174–188. Lecture Notes in Computer Science. Berlin: Springer 15. Sukkarieh, J. 2003. Mind Your Language! Controlled Language for Inference Purposes. Paper presented at the Joint Conference of the Eighth International Workshop of the European Association for Machine Translation and the Fourth Controlled Language Applications Workshop, Dublin, Ireland, 15–17 May 16. Peterson, P. 1979. On the Logic of Few, Many and Most. Journal of Formal Logic 20(1–2): 155–179 17. Zadeh, L. A. 1983. A computational approach to fuzzy quantifiers in natural languages. Computers and Mathematics 9: 149–184 18. Kaufmann, A. and Gupta, M. M. 1985. Introduction to Fuzzy Arithmetic: Theory and Applications. New York: Van Nostrand 19. Hajek, P. 1998. Metamathematics of Fuzzy Logic: Trends in Logic (4). Dordrecht, The Netherlands: Kluwer 20. Zadeh, L. A. 1968. Probability measures of fuzzy events. Journal of Mathematical Analysis and Applications 23: 421–427
30
L.A. Zadeh
21. McCarthy, J. 1990. Formalizing Common Sense, eds. V. Lifschitz and J. McCarthy. Norwood, New Jersey: Ablex 22. Davis, E. 1990. Representations of Common-sense Knowledge. San Francisco: Morgan Kaufmann 23. Yager, R. R. 1991. Deductive approximate reasoning systems. IEEE Transactions on Knowledge and Data Engineering 3(4): 399–414 24. Sun, R. 1994. Integrating Rules and Connectionism for Robust Commonsense Reasoning. New York: Wiley 25. Dubois, D. and Prade, H. 1996. Approximate and commonsense reasoning: From theory to practice. In Proceedings of the Foundations of Intelligent Systems. Ninth International Symposium, pp. 19–33. Berlin: Springer 26. Lehnert, W. G. 1978. The Process of Question Answering – A Computer Simulation of Cognition. Hillsdale, New Jersey: Lawrence Erlbaum 27. Zadeh, L. A. 1986. Outline of a computational approach to meaning and knowledge representation based on the concept of a generalized assignment statement. In Proceedings of the International Seminar on Artificial Intelligence and Man-Machine Systems, eds. M. Thoma and A. Wyner, pp. 198–211. Heidelberg: Springer 28. Shafer, G. 1976. A Mathematical Theory of Evidence. Princeton, New Jersey: Princeton University Press 29. Zadeh, L. A. 2002. Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. Journal of Statistical Planning and Inference 105(1): 233–264 30. Pedrycz, W. and F. Gomide. 1998. Introduction to Fuzzy Sets. Cambridge, MA: MIT. 31. Klement, P., Mesiar, R., and Pap, E. 2000. Triangular norms – Basic properties and representation theorems. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 63–80. Heidelberg: Physica-Verlag 32. Zadeh, L. A. 1984. Syllogistic reasoning in fuzzy logic and its application to reasoning with dispositions. In Proceedings International Symposium on Multiple-Valued Logic, pp. 148–153. Los Alamitos, CA: IEEE Computer Society 33. Mesiar, R. and H. Thiele. 2000. On T-quantifiers and S-quantifiers. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 310–318. Heidelberg: Physica-Verlag 34. Cresswell, M. J. 1973. Logic and Languages. London: Methuen 35. Smith, B. and C. Welty. 2002. What is ontology? Ontology: Towards a new synthesis. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. New York: Association for Computing Machinery 36. Smith, M. K., C. Welty, and D. McGuinness, eds. 2003. OWL Web Ontology Language Guide. W3C Working Draft 31. Cambridge, MA: World Wide Web Consortium (W3C)
Precisiated Natural Language
31
37. Corcho, O., Fernandez-Lopez, M., and Gomez-Perez, A. 2003. Methodologies, tools and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering 46(1): 41–64 38. Sowa, J. F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley 39. Partee, B. 1976. Montague Grammar. New York: Academic 40. Lenat, D. B. 1995. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM 38(11): 32–38 41. Fellbaum, C., ed. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT 42. Liu, H. and Singh, P. 2004. Commonsense reasoning in and over natural language. In Proceedings of the Eighth International Conference on Knowledge-Based Intelligent Information and Engineering Systems Brighton, U.K.: KES Secretariat, Knowledge Transfer Partnership Centre. 43. Gamat, T. F. 1996. Language, Logic and Linguistics. Chicago: University of Chicago Press 44. Gerla, G. 2000. Fuzzy metalogic for crisp logics. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 175–187. Heidelberg: Physica-Verlag 45. Hajek, P. 2000. Many. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 302–304. Heidelberg: Physica-Verlag 46. Klir, G. J. 2000. Uncertainty-based information: A critical review. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 29–50. Heidelberg: Physica-Verlag 47. Zadeh, L. A. 2001. A new direction in AI – Toward a computational theory of perceptions. AI Magazine 22(1): 73–84 48. Lehmke, S. 2000. Degrees of truth and degrees of validity. In Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing (57), eds. V. Novak and I. Perfilieva, pp. 192–232. Heidelberg: Physica-Verlag 49. Novak, V., and I. Perfilieva, eds. 2000. Discovering the World with Fuzzy Logic: Studies in Fuzziness and Soft Computing. Heidelberg: Physica-Verlag
Appendix The Tall Swedes Problem (Version 2) In the following, a* denotes “approximately a.” Swedes more than 20 years of age range in height from 140 to 220 cm. Over 70*% are taller than 170* cm; less than 10*% are shorter than 150* cm; and less than 15% are taller than 200* cm. What is the average height of Swedes over 20?
32
L.A. Zadeh
Fuzzy Logic Solution Consider a population of Swedes over 20, S = {Swede1, Swede2, …, SwedeN}, with hi, i = 1, …, N, being the height of Si. The datum “Over 70*% of S are taller than 170* cm,” constrains the hi in h = (hi, …, hN). The constraint is precisiated through translation into GCL. More specifically, let X denote a variable taking values in S, and let X|(h(X) is ≥ 170*) denote a fuzzy subset of S induced by the constraint h(X) is ≥ 170*. Then Over 70*% of S are taller than 170* →
(GCL ) : 1 ∑ Count ( X | h( X ) is ≥ 170 *) is ≥ 0.7 * N where ΣCount is the sigma count of Xs that satisfy the fuzzy constraint h(X) is ≥ 170*. Similarly, Less than 10*% of S are shorter than 150*→
(GCL ) : 1 ∑ Count ( X | h( X ) is ≤ 150 *) is ≤ 0.1 N and Less than 15*% of S are taller than 200*→
(GCL ) : 1 ∑ Count ( X | h( X ) is ≥ 200 *) is ≤ 0.15 N A general deduction rule in fuzzy logic is the following. In this rule, X is a variable that takes values in a finite set U = {u1, u2, …, uN}, and a(X) is a real-valued attribute of X, with ai = a (ui) and a = (ai, …, aN) 1 ∑ Count ( X | a( X ) is C ) is B N Av( X ) is ? D
where Av(X ) is the average value of X over U. Thus, computation of the average value, D, reduces to the solution of the nonlinear programming problem
Precisiated Natural Language
33
⎛1 ⎞ µ D (v ) = max a µ B ⎜ ∑ i µ c (ai )⎟ i N ⎝ ⎠
subject to v=
1 N
∑ i ai (average value)
where µD, µB, and µC are the membership functions of D, B, and C, respectively. To apply this rule to the constraints in question, it is necessary to form their conjunction. Then, the fuzzy logic solution of the problem may be reduced to the solution of the nonlinear programming problem ⎛
⎛1 ⎞⎞ ⎛1 ⎞ ∑i µ ≥170* (hi )⎟ ⎟⎟ ∧ µ ≤ 0.1* ⎜ ∑i µ ≤150* (hi )⎟ ∧ ⎝N ⎠⎠ ⎝N ⎠ ⎛1 ⎞ ∧ µ ≤0.15* ⎜ µ ≥200* (hi )⎟ ⎝N ⎠
µ D (v ) = max h ⎜⎜ µ ≥ 0.7* ⎜ ⎝
subject to v=
1 ∑ hi N i
Note that computation of D requires calibration of the membership functions of ≤ 170*, ≤ 0.7*, ≤ 150*, ≤ 0.1*, ≥ 200*, and ≤ 0.15*.
Towards Human-Consistent Data-Driven Decision Support Systems via Fuzzy Linguistic Data Summaries Janusz Kacprzyk and Sławomir Zadro ny
Summary. We present the use of fuzzy logic for the derivation of linguistic summaries of data (databases) for providing efficient and human consistent means for the analysis of large amounts of data to be used for a more realistic business decision support. We concentrate on the issue of how to measure the goodness of a linguistic summary, and on how to embed data summarization within the fuzzy querying environment, for an effective and efficient computer implementation. Finally, we present an implementation for deriving linguistic summaries of a sales database at a small-to-medium size computer retailer. By analyzing the linguistic summaries obtained we indicate how they can help make decisions by the business owner.
1 Introduction Decision making in the present world is becoming more and more sophisticated, time consuming, and difficult for human beings who require some “scientific” support. There is a long tradition of scientific attempts to formalize, solve, and implement decision making. For a long time those attempts have concentrated on the development of mathematical models that would try to describe the situation under consideration (preferences, mathematical models, performance functions, solution concepts, etc.). This development has resulted in a huge number of different models, both descriptive and prescriptive, involving single and multiple criteria and decision makers (actors), dynamics, etc. Modern approaches to real world decision making go further. Basically, they speak about good decisions (not optimal as in most traditional J. Kacprzyk and S. Zadro ny: Towards Human-Consistent Data-Driven Decision Support Systems via Fuzzy Linguistic Data Summaries, Studies in Computational Intelligence (SCI) 36, 37–54 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
38
J. Kacprzyk and S. Zadro ny
approaches), but above all about a decision-making process. A decisionmaking process involves more factors and aspects that traditional decisionmaking models, notably, – – – – –
Use of own and external knowledge Involvement of various “actors,” aspects, etc. Individual habitual domains Non-trivial rationality Different paradigms, when appropriate
A good example of such a decision-making process is Peter Checkland’s (1975–1999) so-called deliberative decision making (which is an important elements of his soft approach to systems analysis). The essence of deliberative (soft) decision making may be subsumed as that while trying to solve a (real world) decision-making problem we should – Perceive the whole picture – Observe it from all angles (actors, criteria, etc.) – Find a good decision using knowledge and intuition Further, it is emphasized in modern approaches that the decision making process involves: – – – – –
Recognition Deliberation and analysis Gestation and enlightenment (the so-called “eureka!,” “aha” effects) Rationalization Implementation
and is always – Heavily based on data, information and knowledge, and human specific characteristics (intuition, attitude, natural language for communication, and articulation, etc.) – Need number crunching, but also more “delicate” and sophisticated “intelligent” analyses – Heavily relying on computer systems, and capable of a synergistic human–computer interaction, notably using (quasi)natural language It is easy to see that modern real world decision-making processes should be supported by some computerized systems, called decision support systems (DSSs). It is obvious that in the development of such systems emphasis should be on
Towards Human-Consistent Data-Driven Decision Support Systems
– – – – – – – –
39
Ill/semi/un-structured questions and problems Non-routine, one of a kind answers A flexible combination of analytical models and data Various kinds of data, e.g., numeric, textual, verbal, etc. Interactive interface (e.g., GUI, LUI) Iterative operation (“what if ”) Support of various decision-making styles Support of alternate decision-making passes, etc.
All the above-mentioned phases are based on data, information, and knowledge, meant here as: – Data – raw facts – Information – data in a context relevant to an individual, team, or organization – Knowledge – an individual’s utilization of information and data complemented by an unarticulated expertise, skills, competencies, intuitions, experience, and motivations It is clear that knowledge is most relevant, and it can be: – Explicit, expressed in words or numbers, and shared as data, equations, specifications, documents, and reports; can be transmitted individuals and formally recorded – Tacit, highly personal, hard to formalize, and difficult to communicate or share with others; technical (skills or crafts), and cognitive (perceptions, values, beliefs, and mental models) Both types are relevant for decision-making processes and hence for the DSSs. DSSs practically appeared in the mid-1960s with the development of the development of IBM 360 and a wider use of distributed, time-sharing computing, and have been since that time a topic of intensive research and development. Basically, one can distinguish the following basic types of DSSs: – – – – – –
Data driven Communication driven and group DSSs Document driven Model driven Knowledge driven Web based and interorganizational
40
J. Kacprzyk and S. Zadro ny
Roughly speaking: – Data Driven DSSs emphasize access to and manipulation of internal company data and sometimes external data, and may be based – from the low to high level – first on simple file systems with query and retrieval tools, then data warehouses, and finally with On-line Analytical Processing (OLAP) or data mining tools. – Communications Driven DSSs use network and communications technologies to facilitate collaboration and communication. – Group GDSSs are interactive, computer-based systems that facilitate solution of unstructured problems by a set of decision-makers working together as a group. – Document Driven DSSs integrate a variety of storage and processing technologies for a complete document retrieval and analysis; documents may contain numbers, text, and multimedia. – Model Driven DSSs emphasize access to and manipulation of a model, e.g., statistical, financial, optimization, and/or simulation; use data and parameters, but are not usually data intensive. – Knowledge Driven DSSs are interactive systems with specialized problem-solving expertise consisting of knowledge about a particular domain, understanding of problems within that domain, and “skill” at solving some of these problems. – Web-based DSSs are computerized system that deliver decision support related information and/or tools to a manager/analyst using a “thin-client” Web browser (Explorer); TCP/IP protocol, etc. One should bear in mind that this classification should not be considered as a chronology of development of DSSs but as a classification with respect to what a particular system is meant for. In this paper we concentrate on the data driven DSSs, and in particular show how the use of Zadeh’s computing with words and perception paradigm (cf. [1, 2]) through fuzzy linguistic database summaries, and indirectly fuzzy querying, can open new vistas in data driven DSSs (and also, to some extent, in knowledge driven and Web-based DSSs). Basically, the role of a data driven DSS is to help decision makers make rational use of (vast) amounts of data that exist in their company or institution as well as in the environment within which they operate. Clearly, from those data relevant, nontrivial dependencies should be found. Unfortunately, they are usually hidden, and their discovery is not a trivial act that requires some intelligence.
Towards Human-Consistent Data-Driven Decision Support Systems
41
One of interesting and promising approaches to discover such dependencies in an effective, efficient, and human consistent way is to derive linguistic summaries of a set of data (database). Here we discuss linguistic summarization of data sets in the sense of Yager [3–9] (for some extensions and other related issues, see, e.g., [10–15, 42]). In this approach linguistic summaries are derived as linguistically quantified propositions, exemplified – when the data in question concern employees – by “most of the employees are young and well paid,” with which a degree of validity is associated. Basically, in the source Yager’s [3–9] works that degree of validity was meant to be the degree of truth of a linguistically quantified proposition that constitutes a summary. This was shown to be not enough, and other validity (quality) indicators were proposed, also in the above Yager’s works. As a relevant further attempt, we can mention George and Srikanth’s [16] solution in which a compromise between the specificity and generality of a summary is sought, and then some extension in which a weighted sum of five quality indicators is employed as given in [10, 11]. In this paper we also follow Kacprzyk and Zadro ny’s [17–21], Kacprzyk’s [22], and Zadro ny and Kacprzyk’s [23] idea of an interactive approach to linguistic summaries. Basically, since a fully automatic generation of linguistic summaries is not feasible at present, an interaction with the user is assumed for the determination of a class of summaries of interest. This is done via Kacprzyk and Zadro ny’s [24–27] fuzzy querying add-on to Microsoft Access. We show that the approach proposed is implementable, and we present an implementation for a sales database of a computer retailer. We show that the linguistic summaries obtained may be very useful for supporting decision making by the management.
2 Idea of a Linguistic Summary Using Fuzzy Logic with Linguistic Quantifiers First, we will briefly present the basic Yager’s [3] approach to the linguistic summarization of sets of data. First, we have: – V is a quality (attribute) of interest, e.g., salary in a database of workers – Y = {y1 ,…, y n } is a set of objects (records) that manifest quality V, e.g., the set of workers; hence V ( yi ) are values of quality V for object y i – D = {V ( y1 ),…,V ( yn )} is a set of data (“database”)
42
J. Kacprzyk and S. Zadro ny
A summary of a data set consists of – A summarizer S (e.g., young) – A quantity in agreement Q (e.g., most) – Truth (validity) T – e.g., 0.7 as, e.g., T(most of employees are young) = 0.7. Given a set of data D, we can hypothetize any appropriate summarizer S and any quantity in agreement Q, and the assumed measure of truth (validity) will indicate the truth (validity) of the statement that Q data items satisfy the statement (summarizer) S. Since the only fully natural and human consistent means of communication for the humans is natural language, then we assume that the summarizer S is a linguistic expression semantically represented by a fuzzy set. For instance, in our example a summarizer like “young” would be represented as a fuzzy set in the universe of discourse, say, {1, 2, ..., 90}. Such a simple one-attribute-related summarizer exemplified by “young” does well serve the purpose of introducing the concept of a linguistic summary, hence it was assumed by Yager [3]. However, it is of a lesser practical relevance. It can be extended, for some confluence of attribute values as, e.g, “young and well paid,” and then to more complicated combinations. Clearly, when we try to linguistically summarize data, the most interesting are non-trivial, human-consistent summarizers (concepts) as, e.g., – Productive workers – Stimulating work environment – Difficult orders, etc. involving complicated combinations of attributes like, e.g., a hierarchy (not all attributes are of the same importance), the attribute values are ANDed and/or ORed, k out of n, most, etc. of them should be accounted for, etc. The generation and processing of such non-trivial summarizers needs some specific tools and techniques that will be discussed to some extent later. The quantity in agreement, Q, is a proposed indication of the extent to which the data satisfy the summary. Once again, a precise indication is not human consistent, and a linguistic term represented by a fuzzy set is employed. Basically, two types of such a linguistic quantity in agreement can be used: – Absolute as, e.g., “about 5,” “more or less 100,” “several,” and – Relative as, e.g., “a few,” “more or less a half,” “most,” “almost all,” etc.
Towards Human-Consistent Data-Driven Decision Support Systems
43
Notice that the above linguistic expressions are the so-called fuzzy linguistic quantifiers (cf. [28, 29]) that can be handled by fuzzy logic. As for the fuzzy summarizer, also in case of a fuzzy quantity in agreement, its form is subjective, and can be either predefined or elicited from the user when needed. The calculation of the truth (validity) of the basic type of a linguistic summary considered in this section is equivalent to the calculation of the truth value (from the unit interval) of a linguistically quantified statement (e.g., “most of the employees are young”). This may be done by two most relevant techniques using either Zadeh’s [28] calculus of linguistically quantified statements (cf. [30]) or Yager’s [31] OWA operators (cf. [32]); for a survey, see [33]. A linguistically quantified proposition, exemplified by “most experts are convinced,” is written as “Q y’s are F” where Q is a linguistic quantifier (e.g., most), Y = {y} is a set of objects (e.g., experts), and F is a property (e.g., convinced). Importance B may be added yielding “Q B y’s are F,” e.g., “most (Q) of the important (B) experts (y’s) are convinced (F).” The problem is to find truth (Qy’s are F) or truth (QBy’s are F), respectively, provided we know truth (y is F ), ∀ y ∈ Y, which is done here using Zadeh’s fuzzy-logic-based calculus of linguistically quantified propositions [28]. First, property F and importance B are fuzzy sets in Y, and a (proportional, nondecreasing) linguistic quantifier Q is assumed to be a fuzzy set in [0,1] as, e.g., for Q = “most” for x ≥ 0.8 ⎧1 ⎪ µQ ( x ) = ⎨ 2 x − 0.6 for 0.3 < x < 0.8 ⎪0 for x ≤ 0.3 ⎩
(1)
Then, due to [28], we have: truth(Qy ' s are F ) = µQ [ 1n ∑ i =1 µ F ( yi )] n
(2)
truth(QBy ' s are F ) = µ Q [ ∑ i =1 ( µ B ( yi ) ∧ µ F ( yi )) / ∑ i =1 µ B ( yi )] n
n
(3)
An OWA operator [31, 32] of dimension p is a mapping F : [0,1] p → [0,1] if associated with F is a weighting vector W = [ w1 ,…, wp ]T , wi ∈ [0,1], w1 + + w p = 1, and F ( x1,…, x p ) = w1b1 +
w pbp = W T B
(4)
44
J. Kacprzyk and S. Zadro ny
where bi is the ith largest element among x1 ,…, x p , B = [b1 ,…, b p ] . The OWA weights may be found from the membership function of Q due to (cf. [31]): ⎧ µQ (i ) − µQ (i − 1) for i = 1,…, p wi = ⎨ for i = 0 ⎩ µ Q ( 0)
(5)
The OWA operators can model a wide array of aggregation operators (including linguistic quantifiers), from w1 = = w p −1 = 0 and w p = 1 which corresponds to “all,” to w1 = 1 and w2 = = w p = 0 which corresponds to “at least one,” through all intermediate situations. An important issue is related with the OWA operators for importance qualified data. Suppose that we have A = [a1 ,…, a p ] , and a vector of importances V = [v1,…, v p ] such that vi ∈ [0,1] is the importance of ai , i = 1,…, p and v1 + v p = 1 . The case of an OWA operator with importance qualification denoted OWAQ is not trivial. In Yager’s [31] approach to be used here, which seems to be highly plausible (though is sometimes criticized), some redefinition of the OWA’s weights wi’s into wi ' s is performed, and (4) becomes FI ( x1,…, x p ) = w1b1 +
w pbp = W T B
(6)
where ⎛ j ⎜ ∑ uk ⎜ w j = µQ ⎜ k =1 p ⎜ ∑ uk ⎜ ⎝ k =1
⎛ j −1 ⎞ ⎜ ∑ uk ⎟ ⎜ k =1 ⎟ ⎟ − µQ ⎜ p ⎜ ∑ uk ⎟ ⎜ ⎟ ⎝ k =1 ⎠
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(7)
and uk is the importance of bk, that is the k-largest element of A (i.e., the corresponding vk ). The basic validity criterion introduced in the source Yager’s [3] work and employed by many authors later on, i.e., the truth of a linguistically quantified statement given by (2) and (3), is certainly the most important in the general framework. However, it does not grasp all aspects of a linguistic summary. Some attempts at devising other quality (validity) criteria have
Towards Human-Consistent Data-Driven Decision Support Systems
45
been proposed by Kacprzyk and Yager [10] and Kacprzyk, Yager and Zadro ny (2001). Basically, they proposed the following five criteria: – Truth value [which basically corresponds to the degree of truth of a linguistically quantified proposition representing the summary given by, say, (2) or (3)] – Degree of imprecision (fuzziness) – Degree of covering – Degree of appropriateness, and – Length of a summary and the (total) degree of validity, T, of a particulate linguistic summary is defined as the weighted average of the above five degrees of validity, i.e., T= T(T1, T2, T3, T4, T5; w1, w2, w3, w4, w5) =∑i=1,2, ..., 5 wiTi
(8)
And the problem is to find an optimal summary, S* ∈ {S}, such that S* = arg maxS ∑i=1,2,...,5 wiTi
(9)
where w1,...,w5 are weights assigned to the particular degrees of validity, with values from the unit interval, the higher the more important, such that ∑i=1,2,...,5 wi =1. The definition of weights, w1,...,w5, is a problem in itself, and will not be dealt with here in more detail. The weights can be predefined or elicited from the user. In the case study presented later the weights are determined by using the well-known Saaty’s [34] AHP (analytical hierarchy process) approach that works well in the problem considered.
3 Derivation of Linguistic Summaries via a Fuzzy Logic-Based Database Querying Interface The roots of the approach are our previous papers on the use of fuzzy logic in database querying (cf. [24–27, 35, 40]) in which we argued that the formulation of a precise query is often difficult for the end user [see also 41]. For example, a customer of a real-estate agency looking for a house would rather express his or her criteria using imprecise descriptions as cheap, large garden, etc. Also, to specify which combination of the criteria fulfillment would besatisfactory, he or she would often use, say, most of them or almost all. All such vague terms may be relatively easily interpreted using fuzzy
46
J. Kacprzyk and S. Zadro ny
logic. This has motivated the development of the whole family of fuzzy querying interfaces, notably our FQUERY for Access package (cf. [24–27]). The same arguments apply, to an even higher degree, when one tries to summarize the content of a database in a short (linguistic) statement. For example, a summary like “most our customers are reliable” may very often be more useful than, say “65% of our customers have paid at least 70% of their duties in less than 10 days.” In Sect. 2 we studied the summarization independently, and here we will restate it in the fuzzy querying context. We start with the reinterpretation of (2) and (3). Thus, (2) formally expresses a statement: “Most records match query S”
(10)
where S replaces F in (2) since we refer here directly to the concept of a summarizer. We assume a standard meaning of the query as a set of conditions on the values of fields from the database tables, connected with AND and OR. We allow for fuzzy terms in a query that implies a degree of matching from [0,1] rather than a yes/no matching. So, a query S defines a fuzzy subset on the set of the records, and the membership of them is determined by their matching degree with the query. Similarly, (3) may be interpreted as expressing a statement of the following type: “Most records meeting conditions F match query S ”
(11)
Thus, (11) says something only about a subset of records taken into account by (10). That is, in database terminology, F corresponds to a filter and (11) claims that most records passing through F match query S. Moreover, since the filter may be fuzzy, a record may pass through it to a degree from [0,1]. We seek, for a given database, propositions of the type (3), interpreted as (11) that are highly true, and they contain three elements: a fuzzy filter F (optional), a query S, and a linguistic quantifier Q. There are two limit cases where we: – Do not assume anything about the form of any of these elements – Assume fixed forms of a fuzzy filter and query, and look only for a linguistic quantifier Q In the first case data summarization will be extremely time-consuming but may produce interesting results. In the second case the user has to
Towards Human-Consistent Data-Driven Decision Support Systems
47
guess a good candidate formula for summarization but the evaluation is fairly simple being equivalent to the answering of a (fuzzy) query. Thus, the second case refers to the summarization known as ad hoc queries, extended with an automatic determination of a linguistic quantifier. In between these two extreme cases there are different types of summaries with various assumptions on what is given and what is sought. We use the following notation to describe what is given or what is sought with respect to the fuzzy filter F and query S (A will stand below for either F or S): – – – –
A, all is given (or sought), i.e., attributes, values, and the structure A f c , attributes and structure are given but values are left out A v , denotes sought left out values referred to above, and A f , only a set of attributes is given and the other elements are sought
Using such a notation we may propose a classification of linguistic summaries as shown in Table 1. The summaries of Types 1 and 3 have been implemented as an extension to our FQUERY for Access. FQUERY for Access is an add-in that makes it possible to use fuzzy terms in queries (cf. [24–27], Zadro ny and Kacprzyk (1995)). Briefly speaking, the following types of fuzzy terms are available: – Fuzzy values, exemplified by low in “profitability is low,” – Fuzzy relations, exemplified by much greater than in “income is much greater than spending,” and – Linguistic quantifiers, exemplified by most in “most conditions have to be met.” The elements of the first two types are elementary building blocks of fuzzy queries in FQUERY for Access. They are meaningful in the context of numerical fields only. There are also other fuzzy constructs allowed which may be used with scalar fields. Table 1. Classification of linguistic summaries type 1 2 3 4 5
given S SB Q Sstructure Q Sstructure B nothing
sought Q Q Svalue Svalue SBQ
remarks simple summaries through ad hoc queries conditional summaries through ad hoc queries simple value oriented summaries conditional value oriented summaries general fuzzy rules
where S structure denotes that attributes and their connection in a summary are known, while S value denotes a summarizer sought.
48
J. Kacprzyk and S. Zadro ny
If a field is to be used in a query in connection with a fuzzy value, it has to be defined as an attribute. The definition of an attribute consists of two numbers: the attribute’s values lower (LL) and upper (UL) limit. They set the interval that the field’s values are assumed to belong to. This interval depends on the meaning of the given field. For example, for the age (of a person), a reasonable interval would be, e.g., [18, 65], in a particular context, i.e., for a specific group. Such a concept of an attribute makes it possible to universally define fuzzy values. Fuzzy values are defined, for technical reasons, as fuzzy sets on [-10, +10]. Then, the matching degree md (⋅,⋅) of a simple condition referring to attribute AT and fuzzy value FV against a record R is calculated by: md ( AT = FV, R) = µ FV (τ ( R(AT))
(12)
where R(AT) is the value of attribute AT in record R, µ FV is the membership function of fuzzy value FV, τ: [LLAT,ULAT]→[-10,10] is the mapping from the interval defining AT onto [-10,10] so that we may use the same fuzzy values for different fields. A meaningful interpretation is secured by τ which makes it possible to treat all fields domains as ranging over the unified interval [-10,10]. For simplicity, it is assumed that the membership functions of fuzzy values are trapezoidal. Linguistic quantifiers provide for a flexible aggregation of simple conditions. In FQUERY for Access the fuzzy linguistic quantifiers are defined in Zadeh’s [28] sense (see Sect. 1) as fuzzy set on the [0, 10] interval instead of the original [0, 1]. They may be interpreted either using original Zadeh’s approach or via the OWA operators (cf. [31, 32]); Zadeh’s interpretation will be used here. The membership functions of fuzzy linguistic quantifiers are assumed piece-wise linear, hence two numbers from [0,10] are needed. Again, a mapping from [0,N], where N is the number of conditions aggregated, to [0,10] is employed to calculate the matching degree of a query. More precisely, the matching degree md (⋅,⋅) , for the query “Q of N conditions are satisfied” for record R is equal to md (Q, conditioni , R) = µQ [τ ( ∑ md (conditioni , R ))]
(13)
i
We can also assign different importance degrees for particular conditions. Then, the aggregation formula is equivalent to (3). The importance is identified with a fuzzy set on [0,1], and then treated as property B in (3). To be able to use a fuzzy term in a query, it has to be defined using the toolbar provided by FQUERY for Access and stored internally. This
Towards Human-Consistent Data-Driven Decision Support Systems
49
feature, i.e., the maintenance of dictionaries of fuzzy terms defined by users, strongly supports our approach to data summarization to be discussed next. In fact, the package comes with a set of predefined fuzzy terms but the user may enrich the dictionary too. When the user initiates the execution of a query it is automatically transformed by appropriate FQUERY for Access’s routines and then run as a native query of Access. The transformation consists primarily in the replacement of parameters referring to fuzzy terms by calls to functions implemented in the package that secure a proper interpretation of these fuzzy terms. Then, the query is run by Access as usually. Details can be found in [24–27] and Zadro ny and Kacprzyk (1995).
5 Implementation for a Sales Database at a Computer Retailer The proposed data summarization procedure was implemented on a sales database of a computer retailer in Southern Poland (cf. [22, 36–38]). The basic structure of the database is as shown in Table 2. In the beginning, after some initialization, we provide some parameters concerning mainly: definition of attributes and the subject, definition of how the results should be presented, definition of parameters of the method (i.e., a genetic algorithm or, seldom, full search). Then, we initialize the search and obtain results shown in Tables 3–5. Their consecutive columns contain: a linguistic summary, values of the four indicators, i.e., the degrees of appropriateness, covering, truth, and fuzziness (the length is not Table 2. Structure of the database attribute name date time name amount (number) price commission value discount group transaction value total sale to customer purchasing frequency town
attribute type date time text numeric numeric numeric numeric numeric text numeric numeric
description date of sale time of sale transaction name of the product number of products sold in the transaction unit price commission (in %) on sale value = amount (number) × price; of the product discount (in %) for transaction product group to which the product belongs value of the whole transaction total value of sales to the customer in fiscal year
numeric
number of purchases by customer in fiscal year
text
town where the customer lives
50
J. Kacprzyk and S. Zadro ny
Table 3. Linguistic summaries expressing relations between the group of products and commission summary about 1/2 of sales of network elements is with a high commission about 1/2 of sales of computers is with a medium commission much sales of accessories is with a high commission much sales of components is with a low commission about 1/2 of sales of software is with a low commission about 1/2 of sales of computers is with a low commission a few sales of components is without commission a few sales of computers is with a high commission very few sales of printers is with a high commission
Table 4. Linguistic summaries expressing relations between the groups of products and times of sale summary about 1/3 of sales of computers is by the end of year about 1/2 of sales in autumn is of accessories about 1/3 of sales of network elements is in the beginning of year very few sales of network elements is by the end of year very few sales of software is in the beginning of year about 1/2 of sales in the beginning of year is of accessories about 1/3 of sales in the summer is of accessories about 1/3 of sales of peripherals is in the spring period about 1/3 of sales of software is by the end of year about 1/3 of sales of network elements is in the spring period about 1/3 of sales in the summer period is of components very few sales of network elements is in the autumn period a few sales of software is in the summer period
Table 5. Linguistic summaries expressing relations between the attributes: size of customer, regularity of customer (purchasing frequency), date of sale, time of sale, commission, group of product, and day of sale summary much sales on saturday is about noon with a low commission much sales on saturday is about noon for bigger customers much sales on saturday is about noon much sales on saturday is about noon for regular customers a few sales for regular customers is with a low commission a few sales for small customers is with a low commission a few sales for one-time customers is with a low commission much sales for small customers is for nonregular customers
Towards Human-Consistent Data-Driven Decision Support Systems
51
Table 6. Linguistic summaries expressing relations between group of products, time of sale, temperature, precipitation, and type of customers summary very few sales of software is in hot days to individual customers about 1/2 of sales of accessories is in rainy days on weekends by the end of the year about 1/3 of sales of computers is in rainy days to individual customers
accounted for in our simple case), and finally the weighted average. The weights have been determined by employing Saaty’s AHP procedure. Some simple learning and finetuning has also been employed taking into account experience gained in previous sessions with the users. These are the most valid summaries, and they give the user much inside into relations between the attributes chosen, moreover they are simple and human consistent. Notice that these summaries concern data from the company’s known database. However, companies operate in an environment (economic, climatic, social, etc.), and aspects of this environment may be relevant because they may greatly influence the operation, economic results, etc. of a particular company. A notable example may here be the case of climatic data that can be fetched from some sources, for instance from paid or free climatic data services. The inclusion of such data may be implemented but its description is beyond the scope of this paper. We can just mention that one can obtain for instance the linguistic summaries as in Table 6 in the case when we are interested in relations between group of products, time of sale, temperature, precipitation, and type of customers. It is easy to see that the contents of all the linguistic summaries obtained does give much insight to the user (analyst) in what is happening in the company and its operation, and can be very useful.
6 Concluding Remarks In this paper we presented how the concept of a fuzzy linguistic database summary (in the sense of Yager) can be a very powerful tool for gaining insight into what relations exist within data in a particular company. We have indicated that such relations that are derived just from data can be valuable clues for the decision maker to make decisions pertaining to the operation of the company. Clearly, the philosophy and paradigm presented follow those of a data driven DSS, and one can see that fuzzy linguistic database summaries,
52
J. Kacprzyk and S. Zadro ny
or – more generally the computing with words and perception paradigm – can be a powerful tool that can help develop a new generation of human consistent, natural language based and easy to use data driven DSSs.
References 1. Zadeh L. and Kacprzyk J. (Eds.) (1999) Computing with Words in Information/Intelligent Systems 1. Foundations. Physica-Verlag, Heidelberg and New York 2. Zadeh L. and Kacprzyk J. (Eds.) (1999) Computing with Words in Information/Intelligent Systems 2. Applications. Physica-Verlag, Heidelberg and New York 3. Yager R.R. (1982) A new approach to the summarization of data. Information Sciences 28, 69–86 4. Yager R.R. (1989) On linguistic summaries of data, Proceedings of IJCAI Workshop on Knowledge Discovery in Databases, Detroit, pp. 378–389 5. Yager R.R. (1991) On linguistic summaries of data, in: G. Piatetsky-Shapiro and B. Frawley (Eds.), Knowledge Discovery in Databases. MIT, Cambridge, MA, pp. 347–363 6. Yager R.R. (1994) Linguistic summaries as a tool for database discovery, Proceedings of Workshop on Flexible Query-Answering Systems, Roskilde University, Denmark, pp. 17–22 7. Yager R.R. (1995) Linguistic summaries as a tool for database discovery, Proceedings of Workshop on Fuzzy Database Systems and Information Retrieval at FUZZ-IEEE/IFES, Yokohama, pp. 79–82 8. Yager R.R. (1995) Fuzzy summaries in database mining, Proceedings of the 11th Conference on Artificial Intelligence for Applications (Los Angeles, USA), pp. 265–269 9. Yager R.R. (1996) Database discovery using fuzzy sets. International Journal of Intelligent Systems 11, 691–712 10. Kacprzyk J. and Yager R.R. (2001) Linguistic summaries of data using fuzzy logic. International Journal of General Systems 30, 33–154 11. Kacprzyk J., Yager R.R. and Zadro ny S. (2001) Fuzzy linguistic summaries of databases for an efficient business data analysis and decision support, in: W. Abramowicz and J. Zurada J. (Eds.), Knowledge Discovery for Business Information Systems, Kluwer, Boston, pp. 129–152 12. Rasmussen D. and Yager R.R. (1996) Using summarySQL as a tool for finding fuzzy and gradual functional dependencies, Proceedings of IPMU’96 (Granada, Spain), pp. 275–280 13. Rasmussen D. and Yager R.R. (1997) A fuzzy SQL summary language for data discovery, in: D. Dubois, H. Prade and R.R. Yager (Eds.), Fuzzy Information Engineering: A Guided Tour of Applications, Wiley, New York, pp. 253–264 14. Rasmussen D. and Yager R.R. (1997) SummarySQL – A fuzzy tool for data mining. Intelligent Data Analysis – An International Journal 1 (Electronic Publication), URL-http//:www-east.elsevier.com/ida/browse/96-6/ida96-6.htm
Towards Human-Consistent Data-Driven Decision Support Systems
53
15. Rasmussen D. and Yager R.R. (1999) Finding fuzzy and gradual functional dependencies with summarySQL. Fuzzy Sets and Systems 106, 131–142 16. Yager R.R. and Rubinson T.C. (1981) Linguistic summaries of data bases, Proceedings of IEEE Conference on Decision and Control (San Diego, USA), pp. 1094–1097 17. George R. and Srikanth R. (1996) Data summarization using genetic algorithms and fuzzy logic, in: F. Herrera and J.L. Verdegay (Eds.), Genetic Algorithms and Soft Computing. Physica-Verlag, Heidelberg, pp. 599–611 18. Kacprzyk J. and Zadro ny S. (1998) Data mining via linguistic summaries of data: An interactive approach, in: T. Yamakawa and G. Matsumoto (Eds.), Methodologies for the Conception, Design and Application of Soft Computing (Proceedings of IIZUKA’98, Iizuka, Japan), pp. 668–671 19. Kacprzyk J. and Zadro ny S. (1999) On interactive linguistic summarization of databases via a fuzzy-logic-based querying add-on to Microsoft Access, in: W. Bernd Reusch (Ed.), Computational Intelligence: Theory and Applications. Springer-Verlag, Heidelberg, pp. 462–472 20. Kacprzyk J. and Zadro ny S. (2000) On combining intelligent querying and data mining using fuzzy logic concepts, in: G. Bordogna and G. Pasi (Eds.), Recent Research Issues on the Management of Fuzziness in Databases. Physica-Verlag, Heidelberg and New York 21. Kacprzyk J. and Zadro ny S. (2000) Data mining via fuzzy querying over the Internet, in: O. Pons, M.A. Vila and J. Kacprzyk (Eds.): Knowledge Management in Fuzzy Databases, Physica-Verlag, Heidelberg and New York, pp. 211–233 22. Kacprzyk J. and Zadro ny S. (2000) Using fuzzy querying over the Internet to browse through information resources, in: P.P. Wang (Ed.), Computing with Words, Wiley, New York 23. Kacprzyk J. (1999) An interactive fuzzy logic approach to linguistic data summaries, Proceedings of NAFIPS’99 – 18th International Conference of the North American Fuzzy Information Processing Society – NAFIPS, IEEE, pp. 595–599 24. Zadro ny S. and Kacprzyk J. (1999) On database summarization using a fuzzy querying interface, Proceedings of IFSA’99 World Congress (Taipei, Taiwan, R.O.C.), Vol. 1, pp. 39–43 25. Kacprzyk J. and Zadro ny S. (1994) Fuzzy querying for Microsoft Access, Proceedings of FUZZ-IEEE’94 (Orlando, USA), Vol. 1, pp. 167–171 26. Kacprzyk J. and Zadro ny S. (1995) Fuzzy queries in Microsoft Access v. 2, Proceedings of FUZZ-IEEE/IFES’95 (Yokohama, Japan), Workshop on Fuzzy Database Systems and Information Retrieval, pp. 61–66 27. Kacprzyk J. and Zadro ny S. (1995) FQUERY for Access: fuzzy querying for a Windows-based DBMS, in: P. Bosc and J. Kacprzyk (Eds.), Fuzziness in Database Management Systems. Physica-Verlag: Heidelberg, pp. 415–433 28. Kacprzyk J. and Zadro ny S. (1996) A fuzzy querying interface for a WWWserver-based relational DBMS. Proceedings of IPMU’96 (Granada, Spain), Vol. 1, pp. 19–24 29. Zadeh L.A. (1983) A computational approach to fuzzy quantifiers in natural languages. Computers and Mathematics 9, 149–184
54
J. Kacprzyk and S. Zadro ny
30. Zadeh L.A. (1985) Syllogistic reasoning in fuzzy logic and its application to usuality and reasoning with dispositions. IEEE Transaction on Systems, Man and Cybernetics, SMC-15, 754–763 31. Zadeh L.A. and Kacprzyk J. (Eds.) (1992) Fuzzy Logic for the Management of Uncertainty. Wiley, New York 32. Yager R.R. (1988) On ordered weighted averaging operators in multicriteria decision making. IEEE Transactions on Systems, Man and Cybernetics, SMC18, 183–190 33. Yager R.R. and Kacprzyk J. (1997) The Ordered Weighted Averaging Operators: Theory and Applications. Kluwer, Boston 34. Liu Y. and Kerre E.E. (1998) An overview of fuzzy quantifiers (I). Interpretations, Fuzzy Sets and Systems 95, 1–21 35. Saaty T.L. (1980) The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. McGraw-Hill, New York 36. Kacprzyk J., Zadro ny S. and Ziółkowski A. (1989) FQUERY III+: a ‘human consistent’ database querying system based on fuzzy logic with linguistic quantifiers. Information Systems 6, 443–453 37. Kacprzyk J. (1999) A new paradigm shift from computation on numbers to computation on words on an example of linguistic database summarization, in: N. Kasabov (Ed.), Emerging Knowledge Engineering and ConnectionistBased Information Systems – Proceedings of ICONIP/ANZIIS/ANNES’99, University of Otago, pp. 179–179 38. Kacprzyk J. and Strykowski P. (1999) Linguistic data summaries for intelligent decision support, in: R. Felix (Ed.), Fuzzy Decision Analysis and Recognition Technology for Management, Planning and Optimization – Proceedings of EFDAN’99, pp. 3–12 39. Kacprzyk J. and Strykowski P. (1999) Linguistic summaries of sales data at a computer retailer: A case study. Proceedings of IFSA’99 (Taipei, Taiwan R.O.C), Vol. 1, pp. 29–33 40. Kacprzyk J. and Ziółkowski A. (1986) Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, SMC-16, 474–479 41. Zemankova M. and Kacprzyk J. (1993) The roles of fuzzy logic and management of uncertainty in building intelligent information Systems. Journal of Intelligent Informations Systems 2, 311–317 42. Yager R.R. and Kacprzyk J. (1999) Linguistic Data Summaries: A Perspective. Proceedings of IFSA’99 Congress (Taipei), Vol. 1, 44–48
Moving Approximation Transform and Local Trend Associations in Time Series Data Bases Ildar Batyrshin, Raul Herrera-Avelar, Leonid Sheremetov, and Aleksandra Panova
Summary. The properties of moving approximation (MAP) transform and its application to time series data mining are discussed. MAP transform replaces time series values by slope values of lines approximating time series data in sliding window. A simple method of MAP transform calculation for time series with fixed time step is proposed. Based on MAP the measures of local trend associations and local trend distances are introduced. These measures are invariant under independent linear transformations and normalizations of time series values. Measure of local trend associations defines association function and measure of association between time series. The methods of application of association measure to construction of association network of time series and clustering are proposed and illustrated by examples of economic, financial, and synthetic time series.
1 Introduction Contemporary economic and financial systems are characterized by quick dynamics of system elements. The understanding of relationships existing between the system elements is very important for adequate decision making in rapidly changing markets. In many cases, the relationships between dynamic of system elements cannot be formulated directly and may be evaluated only marginally. Human expertise plays determinative role in ascertainment of relationships existing in financial and economic systems but development of formal methods to discover and evaluate such relationships would be very helpful for supporting human decisions. Time series data bases (TSDB) in economics and finance usually contain information about the dynamics of the system of financial or economic indicators. I. Batyrshin et al.: Moving Approximation Transform and Local Trend Associations in Time Series Data Bases, Studies in Computational Intelligence (SCI) 36, 55–83 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
56
I. Batyrshin et al.
Analysis of relationships between dynamics of time series can give useful information about relationships existing between elements of these systems and can be used for modelling and macro-analysis of them and, finally, for well-grounded decisions. In this chapter we propose a new technique of time series data mining based on moving approximation (MAP) transform introduced and studied in [3, 5]. MAP transform replaces time series (TS) values by slope values of lines approximating time series data in sliding window. These slope values, called local trends, describe local dynamics of time series. The end points of moving linear regression are used in time series analysis for forecasting [20] but slope values obtained by MAP can be used for calculation of associations between time series. Similarity in local trends of two time series can give important information about relationships existing between dynamics of elements of economic or financial systems. We discuss here the properties of MAP transform with respect to linear transformations of TS. A simple method of MAP transform calculation for time series with fixed time step is proposed. Based on MAP the measures of local trend associations and local trend distances are introduced. These measures are invariant under independent linear transformations and normalizations of time series values which give them advantage in comparison with distance and similarity measures used in time series data mining which are usually dependent on such transformations of time series. The measure of local trend associations defines for each pair of time series the association function as a function of sliding window size. Based on the values of this function the measures of short-term, medium-term or long-term associations between time series can be introduced. The methods of application of association measures to construction of association networks of time series and clustering are proposed and illustrated by examples of economic, financial, and synthetic time series. In more traditional approach to time series data mining similarity measures between time series are used for search of time series similar to given TS, for clustering of time series or TS patterns, finding typical shapes of TS, separation of TSDB on TS with different shapes, etc. [7, 13–15, 18, 22]. Often, the time series in TSDB are considered as results of measurements of some time-varying parameters for different objects which are not mutually related, e.g. when a set of ECG is considered, or for the same object in different time periods. In this chapter we consider multivariate time series which describe dynamics of elements of some system and the methods of analysis of associations between dynamics of system elements are studied by means of association measures defined by MAP. Examples of such systems include the sets of countries, companies,
Moving Approximation Transform and Local Trend Associations
57
securities, currencies described by the sets of economic or financial indicators, prices, rates, etc. measured during the same time period. The chapter is organized as follows. In Sect. 2 the properties of MAP transform are studied and a simple method of MAP transform calculation is considered. Measures of local trend associations and distances between time series based on MAP transform are discussed in Sect. 3. Here the properties of invariance of association and distance measures under linear transformations and normalizations of time series are considered. In Sect. 4 we consider association function and demonstrate its property on synthetic example of time series obtained as aggregation of three components with different short-term, medium-term and long-term movements. In Sects. 5–7 we propose the methods of construction of association networks of time series based on association function and association measures. These methods are illustrated by examples of economic and financial time series. In Sect. 8 we discuss the main results and possible applications of the proposed methods. Appendix contains examples of construction of association networks and clustering of time series used for testing time series data mining approaches.
2 Moving Approximation Transform A time series (y,t) is a sequence (yi,ti), i∈I = (1,…,n), such that ti < ti +1 for all i = 1,.., n-1, where y i and t i are real-valued time series values and time points, correspondingly. A time series (y,t) will be denoted also as y. A window Wi of a length k >1 is a sequence of indexes W = (i, i+1,…, i+k1), i∈{1,…,n-k+1}. Denote yWi = ( yi , yi +1,..., yi + k −1) the sequence of corresponding values of time series y. A sequence J = (W1, W2,…,Wn-k+1) of all windows of the length k, (1 < k ≤ n), is called a sliding (moving) window. A function fi = ait+bi with parameters {ai,bi} minimizing the criterion Q( f i , yWi ) =
i + k −1
i + k −1
j =i
j =i
∑ ( f i (t j ) − y j ) 2 = ∑ (ai t j + bi − y j ) 2 ,
is called a moving (least squares) approximation of yWi or a linear regression and the optimal values of ai, bi are calculated as follows [16, 17]:
58
I. Batyrshin et al. i + k −1
∑ (t j − ti )( y j − yi )
ai =
j =i i + k −1
∑
j =i
where t i =
1 i + k −1 ∑t j k j =i
, yi =
, (t j − ti ) 2
bi = yi − ai ti ,
1 i + k −1 ∑ y j . The slope values of moving approximak j =i
tions may be calculated also as follows: k ai =
⎛ i + k −1 ⎞i + k −1 − ⎜ ∑t j ⎟ ∑ y j ⎜ j =i ⎟ j =i ⎠ ⎝ . 2 i + k −1 i + k −1 ⎞ ⎛ k ∑ t 2j − ⎜ ∑ t j ⎟ ⎜ j =i ⎟ j =i ⎝ ⎠
i + k −1 ∑t j y j j =i
Definition 1. Suppose a = (a1,…,an-k+1) is a sequence of slope values of moving approximations of time series (y,t) in sliding window of size k. A transformation MAPk(y,t)= a is called a moving approximation (MAP) transform of time series y. The slope values a = (a1,…,an-k+1) are called local trends. Suppose p,q,r,s are real values, r ≠ 0 and y, z are time series given in the same time points t = (t1,…,tn). Denote py+q = (py1+q,…,pyn+q) and y+z = (y1+z1,…,yn+zn). Theorem 2. MAP transform satisfies for all real values p,q,r,s, (r ≠ 0) the following properties: 1) MAPk(py+q,t)=pMAPk(y,t); 2) MAPk(y,rt+s)=(1/r)MAPk(y,t); 3) MAPk(y+z,t)=MAPk(y,t) + MAPk(z,t). Note that from 1) it follows: MAP k (-y,t) = - MAP k (y,t). Corollary 3. MAP transform is invariant under equal simultaneous linear transformations of time and time series values, i.e. MAPk(ry+s,rt+s) = MAPk(y,t). Corollary 4. If time points t = (t1,…,tn) are increasing with a constant step h such that ti+1-ti = h for all i = 1,…,n-1, then in MAP transform t can be replaced by indexes I = (1,…,n) as follows:
Moving Approximation Transform and Local Trend Associations
59
MAPk(y,t) = (1/h)MAPk(y,I). Considered above properties show how the operations on time series can be replaced by corresponding operations on their MAP transform and vice versa. Theorem 5. Suppose time points t = (t1,…,tn) are increasing with a constant step h such that ti+1 -ti = h for all i = 1,…,n-1. Then the values of MAP transform MAPk(y,t) can be calculated as follows: k −1
6 ∑ (2 j − k + 1) yi + j
ai =
j =0
hk (k 2 − 1)
, i∈(1,2,…, n-k+1).
For the most common case of time series with fixed step between time points Theorem 5 gives simple method to calculate the slope values of moving approximation for any size k of sliding window. Since the conditions of the Corollary 4 and Theorem 5 usually are fulfilled for time series considered in applications, the time series values may be replaced by indexes t = I = (1,…,n) and the step value h = 1 may be used. Further we denote time series also as y = (y1,…,yn) and use notation MAPk(y) for k∈{2,…,n-1}.
3 Measures of Local Trend Associations and Distances Definition 6. Suppose y = (y1,…,yn) and x = (x1,…,xn) are two time series, and MAPk(y) = (ay1,…,aym), MAPk(x) = (ax1,…,axm), k∈{2,…,n-1}, m = n- k+1. A function m
∑ a yi ⋅ a xi
cossk(y,x) =
i =1 m
m ∑ a 2yi ⋅ ∑ a xj2 j =1 i =1
,
is called a measure of local trend associations. The measure of local trend associations was introduced in [3] and denoted as mlta but further we will denote it as coss because it equals to cosine of angle between two vectors of slope values MAPk(y) and MAPk(x):
I. Batyrshin et al.
60
cossk(y,x)= cos(∠(MAPk(y), MAPk(x))). It can be introduced also correlation of slope values: corsk(y,x)= corr(MAPk(y), MAPk (x)), and distance dissk(y,x) between slope values which will be considered below. Denote
(
a~yi =
a yi
,
m
∑ a 2yi i =1
)
(
)
a~y = a~y1 ,..., a~ym ,
m
m
i =1
i =1
a~xi =
a xi m
and
∑ ax2i i =1
a~x = a~x1 ,..., a~xm . From ∑ a~y2i = 1 and ∑ a~x2i = 1 it follows:
cossk(y,x)= cos(∠ (MAPk(y), MAPk (x))) = cos(∠(a~y , a~x ) ) , i.e. the measure of local trend associations may be considered as cosine of angle between unit vector representations of time series in multidimensional unit spheres. Definition 7. Suppose y = (y1,…,yn) and x = (x1,…,xn) are two time series, and MAPk(y) = (ay1,…,aym), MAPk (x)= (ax1,…,axm), (k∈{2,…,n-1}, m= n- k+1). A function
dissk(y,x) =
m 2 ∑ (a~yi − a~xi ) =
i =1
⎛ ⎜ m ⎜ ay ax ∑⎜ m i − m i i =1⎜ 2 ∑ ax2i ⎜ ∑ a yi i =1 ⎝ i =1
2
⎞ ⎟ ⎟ ⎟ , ⎟ ⎟ ⎠
is called a measure of local trend distances. From the definition of introduced measures it follows: cossk(y, x)= cossk (x, y); cossk (y, y) = 1; cossk (y, -x) = -cossk (y, x); dissk(y, x)= dissk (x, y); dissk (y, y) = 0;
-1 ≤ cossk (y, x) ≤ 1; cossk (y, -y) = -1; cossk (-y, -x) = cossk (y, x), 0 ≤ dissk (y, x) ≤ 2; dissk (y, -y) = 2;
dissk(y, -x)= 4 − dissk2 ( y, x) ;
dissk(-y, -x)= dissk(y, x).
dissk(y, x) ≤ dissk(y, z)+ dissk(z, x).
Moving Approximation Transform and Local Trend Associations
61
The measure dissk(y,x) is a quasi-metric on the set of time series with n elements because it satisfies the properties: 0 ≤ dissk (y, x), dissk(y, y)= 0 and triangle inequality. Generally the axiom of metric: dissk(y, x)= 0 if and only if y = x, is not fulfilled on the set of all possible time series of length n. Based on the distance measure diss we can introduce similarity measure between vectors of slope values of time series y and x: simsk(y,x) = 1- dissk(y,x). Like coss this measure satisfies the following properties: simsk(y, x)= simsk (x, y); simsk (y, y) = 1; simsk (-y, -x) = simsk (y, x).
-1 ≤ simsk (y, x) ≤ 1; simsk (y, -y) = -1;
Generally, similarity and distance measures also will be referred to as measures of local trend associations. Theorem 8. Suppose (y,t) and (x,t) are two time series and L1(y,t) = (p1y+q1, r1t+s1) and L2(y,t) = (p2y+q2, r2t+s2), where p1, p2, r1, r2 ≠ 0, are two linear transformations of time series then cossk(L1(y,t),L2(x,t)) = sign(p1)⋅sign(r1)⋅sign(p2)⋅sign(r2)⋅cossk((y,t), (x,t)). If sign(p1)⋅sign(r1) = sign(p2)⋅sign(r2) then dissk(L1(y,t), L2(x,t)) = dissk((y,t), (x,t)), simsk(L1(y,t), L2(x,t)) = simsk((y,t), (x,t)). Corollary 9. Measure |coss| is invariant under independent linear transformations of time series. Theorem 8 and Corollary 9 show very nice invariance properties of introduced association measures. Corollary 9 means, particularly, that time series may be normalized independently and the absolute measure of local trend associations will not be changed. Additionally, the following time transformations may be done: a change of the units of time, e.g. minutes to
62
I. Batyrshin et al.
seconds; a movement of zero time point; a replacement of time points by indexes if the step of time points is fixed, etc. Measures of local trend distances and similarities are invariant under most types of normalizations of time series. For example, if time series values are non-negative then these measures are invariant under the following most commonly used normalizations: yi =
yi − ymin , yi = yi − y , yi = K ⋅ ymax − ymin
yi − y n
,
∑ ( yi − y ) 2
i =1
1 n ∑ yi and K is a positive coefficient. In this case the following n i =1 equalities are fulfilled: dissk (y,x)= dissk ( y, x ) and simsk(y,x)= simsk ( y, x).
where y =
These invariance properties of local trend distance and similarity measures ensure that independently of the normalization of time series is applied or not, the values of distance or similarity measure will not be changed. If these measures are used for clustering of time series then the results of clustering also will be invariant under normalization of TS. Such invariance property is not fulfilled for Euclidean metric applied directly to time series values. In this case, the normalization of TS will change the values of distance and the result of clustering. We can illustrate this by the following simple synthetic example of three time series: y1 = cos(t); y2 = y1 + e, where e = 5 if t < - 4 and e = 0 otherwise; y3 = cos(t+ π). where time points are changing from -6 till 6 with the step 0.2. The shapes of these time series are shown in Fig. 1a. Figure 1b shows these time series after normalization yi - mean(y). Figure 1c shows these TS after normalization (yi - mean(y))/s, where s is the standard deviation of y. Table 1 shows the values of Euclidean distances between time series before and after normalization and the values of local trend distances dissk(yi,yj) and associations cossk (yi,yj) for window sizes 3 and 4. In this simple example, we see that the pairs of time series with minimum and maximum Euclidean distances between them are different for different types of normalization. The distance between time series y1 and y3 will have minimal value if TS are not normalized or the first type of
Moving Approximation Transform and Local Trend Associations
63
non-normalized:y1(.),y2(o),y3(--)
6 4 2
a)
0 −2 −6
−4
6
−2 0 2 normalized:y-mean(y)
4
6
4 b)
2 0 −2 −6
−4
4
−2 0 2 normalized:(y-mean(y))/std(y)
4
6
2 c)
0 −2 −6
−4
−2
0
2
4
6
Fig. 1. Example of three synthetic time series Table 1. Distances between time series before and after normalization non-normalized TS first type normalization second type normalization diss, nw = 3 diss, nw = 4 coss, nw = 3 coss, nw = 4
d(y1,y2) 15.81 14.46 7.57 0.96 0.90 0.54 0.59
d(y1,y3) 10.88 10.87 15.49 2 2 -1 -1
d(y2,y3) 20.79 19.92 13.52 1.76 1.79 -0.54 -0.59
normalization is used, but this distance will have maximal value for the second type of normalization. As result, the clustering of these time series normalized by different methods will be different. For this reason, the use of time series normalization together with Euclidean metric applied to time series values should be wary. The values of local trend distances and associations between time series do not change after all types of normalization. They depend only on the size of the sliding window. High, greater that 0.5, positive association
64
I. Batyrshin et al.
value between time series y1 and y2 has good explanation because these time series differ only on small time interval. When window size nw increases the smoothing of deviation e in time series y2 increases and, as result, the association value coss between time series y1 and y2 became greater. Also the distance value diss decreases. Local trend association measure detects also highest negative association value between y1 and y3 because the values of one of time series are inverted with respect to the values of another time series. If such time series are considered as similar, for example when such time series correspond to opposite indicators like “Employment” and “Unemployment”, the coss may be transformed into similarity S = |coss| or dissimilarity D = 1-|coss| measures which may be used in clustering of time series. If time series with high negative association are considered as highly dissimilar, e.g. when they describe “increasing” and “decreasing” patterns, then similarity and dissimilarity measures may be defined as follows: S = M(1+coss), D = M(1-coss), where M equals 0.5 or 1. Below we consider more detailed examples of the use of local trend association measure.
4 Association Function and Association Measure The high value of local trend association between two time series may be caused by different reasons: by dependence of one time series on another time series, by dependence upon the same hidden variable, by the same errors or small movements superposed on time series, by synchronous cyclic or seasonal changes, etc. In the presence of random fluctuations in time series values the size of windows k should not be very small to smooth these fluctuations. However, often the real financial time series are obtained as result of averaging of data during some time period; in this case it may be reasonable to consider small windows for analysis of local trend associations. The more complete information about associations between time series y, x is given by a sequence of association values, which will be called an association function: AF(y,x)= (coss2(y,x),…, cossn(y,x)). The values of association function depend on the size of the windows k∈(2,…,n). The graphics of this function can give useful information about associations between time series y and x. Generally some subset K⊆ {2,…,n} of all possible windows may be considered, in this case we will talk about function AFK(y,x) defined on the set of windows W. The average
Moving Approximation Transform and Local Trend Associations
65
or maximal value of this function may be used as a measure of association between time series: AM(y,x) = mean(AFK(y,x))= AFK ( y, x) =
1 ∑ cos sk ( y, x) . K k∈K
Let us illustrate the properties of association function by the synthetic example of two time series: y = 2t + 20 + 7sin(t) + 2sin(8t), x = 3t + 15 -10sin(t) + 4sin(8t), obtained as a summation of three functions y = y1 + y2 + y3, where y1 = 2t + 20; y2 = 7sin(t);
y3 = 2sin(8t);
and x = x1+ x2+ x3, where x1 = 3t + 15; x2 = -10sin(t); x3 = 4sin(8t). Figure 2 shows the shapes of functions y and x. Figure 3 shows the values of local trend association cossk(y,x) and local trend similarity simsk(y,x) measures for all possible values of window size k. The sequence of values of the first measure defines the association function AF. The sequence of values of similarity measure simsk(y,x) will be called also a similarity function SF. Association function shows high associations between all three components of time series: positive local trend associations on small windows, negative associations on medium-size windows and positive global trend associations on large windows. We see that for the considered example, the functions cossk(y,x) and simsk(y,x) have the similar shape but for association analysis we will use cossk(y,x) because it has better interpretation. For windows with small size k = 2,…,5 this measure shows high, greater than 0.5, positive associations between time series x and y. This is because for small windows the approximating lines follow the movements of functions components y3 and x3, which have the same phase. For larger sizes of window k the approximating lines smooth these movements and follow the movements of functions components y2 and x2, which are in opposite phase, and the slope values of lines approximating the functions y and x have opposite values.
66
I. Batyrshin et al.
Fig. 2. Example of time series composed of three components with different movements
For this reason, cossk(y,x) have large negative values, less than -0.5, for windows sizes k =10,…, 34. The calculations of function values were done with time step 0.1 and these window sizes correspond to time intervals 1,…,3.4. These time intervals are greater than period of functions y3 and x3 and for this reason the small oscillations determined by these functions are smoothed. On the other hand, the slope values on these intervals follow the fluctuations of functions components y2 and x2. For window sizes essentially greater than π the moving approximations begin to smooth the oscillations of functions y and x influenced by y2 and x2, and cossk(y,x) show small association values, between -0.5 and 0.5. Association values became highly positive (greater than 0.5) for window sizes greater than 68, corresponding to time interval 6.8 which greater than period 2π of functions y2 and x2. Such window sizes provide full smoothing of oscillations of functions y and x influenced by y2 and x2. The slope values of large windows follow the general movement of functions y and x influenced by linear functions y1 and x1, which has positive slope values. Empirically, based on this and other examples, we will evaluate association values greater than 0.5 or smaller than -0.5 as considerable.
Moving Approximation Transform and Local Trend Associations
67
coss and sims values 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 20
40
60 window size
80
100
120
Fig. 3. The values of association (•) and similarity (+) functions calculated for all possible window sizes k for time series from Fig. 2
This synthetic example of two time series composed from three component functions with different movements gives clear explanation of influence of the size of window on the values of association function. For real-time series such dependences are not clear due to random influences on time series values, the absence of good associations between movements of time series, etc. In this case, the analysis of associations may be based on a subset W of windows. The high associations between financial time series on small windows can show similar high sensitivity of financial indicators on the short-term market changes. But the high associations on medium-size or large-size windows can show mutual dependence of financial indicators or their dependence on common financial or economic factors. In the following sections such association analysis will be illustrated by real economic and financial time series.
68
I. Batyrshin et al.
5 Example 1: Level of Unemployment In this section we apply association analysis to the set of time series of unemployment level in five countries 1. Canada, 2. Italy, 3. Sweden, 4. UK, 5. US downloaded from [8] (thousands; not seasonally adjusted, 1959–2003 years). Normalized time series are shown in Fig. 4. Two association measures AM(y,x) = max(AFK(y,x)), K = {2,3}, and AM(y,x) = mean (AFK(y,x)), K = {5,…,41}, evaluating local trend associations and global trend associations between time series were calculated. Based on these measures the corresponding association networks were constructed which are presented in Fig. 5. In these networks only association values between time series greater than α = 0.5 and α = 0.75, respectively, are shown. Both association networks define clear separation on 2 classes: {1. Canada, 4. UK, 5. US} and {2. Italy, 3. Sweden}. This clustering has good accordance with intuitive evaluation of similarity between time series based on visual analysis of their shapes (Fig. 4). Figure 5a shows that association between unemployment levels of Canada and US has maximum value that might be
Fig. 4. Level of unemployment in 1. Canada, 2. Italy, 3. Sweden, 4. UK, 5. US
Moving Approximation Transform and Local Trend Associations CANADA
0.73
0.55
0.64
US
0.58
CANADA
ITALY
SWEEDEN UK
a)
ITALY
0.91
0.89
0.93
US
0.91
69
SWEEDEN UK
b)
Fig. 5. Associations between unemployment level in five countries. Only the values of association measure AM greater that α are shown: (a) AM = max(AFK), K = {2,3}, α = 0.5; (b) AM = mean(AFK), K = {5,…,41}, α = 0.75. Appendix contains examples of association networks constructed for time series from UCR time series data mining archive [12]. For these examples the association networks define clear clustering of time series
expected before association analysis, but the result obtained by data mining procedure that unemployment level of the UK has higher associations with Canada and US than with European countries seems unexpected. In this simple example, association networks were separated on two clusters as result of suitable selection of levels of association values presented in networks. Generally, such procedure cannot give some non-trivial clustering of time series. If the goal of association analysis is clustering of time series then, e.g. connected sub-graphs obtained on some level of association values may be considered as clusters. Such clustering procedure is equivalent to single linkage clustering [10] of time series and further selection of suitable partition on one of the levels of resulting dendrogram. Somebody can apply other clustering procedures if the absolute value of association measure is used as a similarity function. But we think that simply clustering of the set of time series describing dynamics of some economic or financial system, if it does not naturally exist in data set, can give nonadequate interpretation of mutual relationships existing in the system. For example, clustering can delete important associations existing between some time series as result of separating them on clusters or can aggregate in one cluster some time series with low association value. For this reason, the visualization of association network preserving high association values will give more useful information about the structure of analyzed system. In such visualization of association networks we will select further the value of association level as result of trade-off between the attempts to obtain connected sub-graphs or non-trivial partition on connected components and the attempts to show only the high-level association values. In any case, the analysis of full information about association values existing between time series may be additionally used.
70
I. Batyrshin et al.
6 Example 2: Gross Internal Product of Mexico In this section we do association analysis of 10 time series of gross internal product of Mexico checked quarterly over the period 1980–2003 [1]. Each time series contains 96 data. Figure 6 shows the shapes of time series: 1. Farming, Forestry and Fishes 2. Mining 3. Manufacturing Industry 4. Construction 5. Electricity, Gas and Water 6. Commerce, Restaurants and Hotels 7. Transport, Storage and Communications 8. Financial Services, Insurance, Real Estate Activities and Rent 9. Social and Personal Communal Services 10. Liability to Banking Services Allocate
Fig. 6. Time series of gross internal product of Mexico. The values of time series 10 are multiplied by -1 and normalized to demonstrate similarity of time series 8 and 10, which have high negative mutual association value
Moving Approximation Transform and Local Trend Associations
71
It is clear from Fig. 6 that the shapes of time series values have different small fluctuations but the same general tendency. Also, the graphics of association functions AF(x,y) for all pairs x,y of time series have more or less the same shapes and differ mainly for small windows. For this reasons only local trend associations AFK for small window sizes K = {2,3} were calculated. The resulting association value between time series was determined as AM = max(AFK), K = {2,3}. Figure 7 shows association network of considered economic indicators where only association values greater or equal than 0.5 are shown. Really, the association values 0.5 shown on Fig. 7 were obtained as result of rounding of association values, which are less than 0.5.
8: Financial Services, Insurance, Real Estate Activities and Rent
9: Social and Personal Communal Services
-0.99
-0.81
10: Liability to Banking Services
0.72
1: Farming, Forestry and Fishes
0.61
0.66
5: Electricity, Gas and Water
7: Transport, Storage and Communications 0.50
0.52
6: Commerce, Restaurants and Hotels
0.50
2: Mining
0.73
0.57 0.50
3: Manufacturing Industry
0.60
4: Construction
Fig. 7. Association network of indicators of Mexican economics
72
I. Batyrshin et al.
If we delete in Fig. 7 the associations with absolute values less than 0.55 then the association network will be covered by classes: {8,10}, {1,7,9}, {5,9}, {3,4,6}, {2}. The time series belonging to the same class have high absolute value of associations. Time series “8: Financial Services, Insurance, Real Estate Activities and Rent” and “10: Liability to Banking Services Allocate” have high negative local and global associations (for small and large sizes of window). For this reason in Fig. 6 the values of time series 10 are multiplied by -1 and normalized to demonstrate similarity of time series 8 and 10. Unlike the time series 8 and 10 the time series “5: Electricity, Gas and Water” and “9: Social and Personal Communal Services” have high negative local associations but high positive global associations. This is an interesting example of time series that have highly associated movements on local and global levels but these associations have different signs. From Fig. 6 and Fig. 7 it can be seen that the time series with high local associations have the similar shapes. The obtained association network also has natural explanation. The comparison of such association networks for different countries can give useful information about specific features of dynamics of economics of these countries.
7 Example 3: Foreign Exchange Rates In this example association analysis is applied to time series of Foreign Exchange Rates (money of different countries to US $1) measured daily since 2004-09-02 to 2004-10-15 [9]. Normalized time series are shown in Fig. 8. In contrast to example of Sect. 6, the association function AF(y,x) has sufficiently different shapes on different pairs of time series. For this reason two association measures were considered: AM1 = max(AFK), K = {2,3}, and AM2 = mean(AFK), K = {5,…,26}. The first measure evaluates local trend associations between local movements and can be considered as a measure of associations between sensitivity of currencies to the fluctuations of the market. The second measure evaluates long-term associations between currencies, which gives associations between global trends. The association networks for these two measures are shown in Figs. 9 and 10 where only associations with the value greater than 0.6 and greater than 0.75, respectively, are shown. For the measure AM1 the associations greater than 0.75 exist only within classes: DSSN = {1. Denmark, 13. Switzerland, 16. Sweeden, 17. Norway}, JS = {8. Japan, 15. Singapore} and IT = {7. India, 12. Thailand}. In Fig. 10 the dashed lines show high negative associations. The comparison of these association networks shows that two classes of currencies DSSN and IT have both local and global high associations within classes, i.e. the same sensitivity and the same general trends.
Moving Approximation Transform and Local Trend Associations
73
Denmark China Canada Brazil Hong Kong Mexico India Japan South Korea Taiwan South Africa Thailand Switzerland Sri Lanka Singapore Sweden Norway 5
10
15
20
25
30
Fig. 8. Time series of foreign exchange rates
From Fig. 8 it easy to see that corresponding time series has very similar shapes. It is interesting to note that the 8. Japan currency has highly positive local but highly negative global associations with {1.Denmark, 16.Sweeden} currencies. Such opposite associations of local and global trends show that in spite of similar sensitivity of corresponding currencies they have different global trends. This example shows usefulness of both local and global trend analysis. If local trend analysis gives similar sensitivity of currencies, the global trend analysis can find more deep associations between currencies. For example, global trend association network shows such regional currency clusters like {Canada, Brazil, Mexico} and {Japan, South Korea, Taiwan} in spite of absence of high local trend associations between these currencies. Association analysis of currency exchange rates may be used as supporting information in solution of portfolio optimization problems. For example, it seems clear that diversified portfolio of currencies can contain only one of currencies of Denmark, Switzerland, Sweden and Norway due to high local and global associations existing between these currencies. Also, the short-term or long-term forecasting of possible movements of some currency can be extended on other currencies that have high localtrend or global-trend associations with the first one. Generally, association
74
I. Batyrshin et al.
Fig. 9. Local trend association network of currency exchange rates
measures may be used for estimating functional relationship between dependent and independent variables in causal forecasting models [6] when predicted future values of independent variable is used for predicting the values of dependent variable.
8 Conclusions In this chapter, we considered new methods of time series data mining that give possibility to analyze relationships between dynamics of elements of economic or financial systems described by the sets of time series. These methods calculate associations between elements of considered systems as associations between corresponding time series. These associations may be visualized as association network between system elements. As such systems,
Moving Approximation Transform and Local Trend Associations
75
Fig. 10. Global trend association network of currency exchange rates
one may consider countries, companies, stock markets described by time series of economic or financial indicators, securities prices, exchange rates, etc. This association network may be compared with spatial, causal, etc. relations existing between system elements and association network can give useful additional information about relationships between elements of system. The new measures of associations between time series based on moving approximation transform are considered. These measures have several good properties. First, they are invariant under independent linear transformations and normalizations of time series. For example, Euclidean metric, applied directly to time series values and often used for evaluating similarity between time series, is not invariant under normalization of time series. As result, the presence of local large deviations in time series values superimposed by external influences can cause non-proportional transformations in distance values between different time series obtained after
76
I. Batyrshin et al.
normalization. New measures are free from such problems. Second, the measures of local trend associations considered here depend parametrically on the size of the sliding window. It gives a possibility to change the point of view on the concept of association between time series and to measure these associations on different levels of refinement, for example, to analyse short-term and long-term associations. Such flexibility of new measures gives them advantage with respect to non-parametric similarity measures often used in time series data mining when the concept of similarity does not clearly defined. Third, the measure of local trend associations can evaluate both positive and negative associations between time series that give possibility to discriminate the negative associations from the absence of any associations. For example, high negative associations between time series of employment and unemployment levels measured during some time interval will be easily detected by measure of local trend associations without any transformations of time series like inversion, etc. Euclidean distance between time series values usually used in TSDM cannot detect negative associations. Fourth, new association measures based on MAP transform by definition include smoothing of time series as result of linear approximation of time series values in sliding windows. This implies insensitivity of new measures to fluctuations in data superimposed by errors or external influences. The change of the size of sliding window can change the level of this insensitivity. Association networks can give important information for understanding mutual relationships existing between dynamics of elements of financial or economic systems. For example, information about local and global trend associations between currencies exchange rates or securities prices may be used as supporting information for decision making in portfolio optimization problems. Comparison of association networks of economic indicators constructed for different countries can give important information in comparison of dynamics of economical development of these countries. The absence of expected associations or the presence of unexpected associations between the elements of dynamic systems discovered after association analysis could change expert knowledge about relationships existing between elements of economic or financial systems. The method of MAP transform calculation presented in Theorem 5 gives a possibility to realize effective procedures of time series data mining based on MAP transform. Invariance of introduced association measures under independent linear transformations of time series gives possibility to consider numerical dependencies of wide nature as time series and to apply developed technique to analysis of associations between them. Proposed association measures can be considered as similarity and dissimilarity measures and used for clustering time series or time series patterns. Several
Moving Approximation Transform and Local Trend Associations
77
examples of such application were considered in the text of the chapter and in Appendix. The proposed methods of local trend analysis can be integrated with perception based fuzzy granulation [4, 21] of the set of slope values and used for generating linguistic descriptions of time series with perceptual patterns like “slowly increasing”, “very quickly decreasing”, etc. The rules and descriptions of such type can be used further for perception based reasoning about systems described by the set of time series. The method of forecasting based on MAP is described in [19]. Moving approximations and related association measures are studied in this chapter from the data mining point of view but it would be interesting also to use statistical approach for studying them [11].
Acknowledgements This research work was supported by the IMP within the projects D.00006 and D.00322. We thank Dr. E. Keogh for providing us with some examples of time series data base.
References 1. Banco de Información Económica. URL: http://dgcnesyp.inegi.gob.mx/bdine/ bancos.htm 2. Bastogne T., Noura H., Richard A., Hittinger J.M. (1997). Application of subspace methods to the identification of a winding process. In: Proceedings of the Fourth European Control Conference, Vol. 5, Brussels 3. Batyrshin I., Herrera-Avelar R., Sheremetov L., Suarez R. (2004). Moving approximations in time series data mining. In: Proceedings of International Conference on Fuzzy Sets and Soft Computing in Economics and Finance, FSSCEF 2004, St. Petersburg, Russia, Vol. I, pp. 62–72 4. Batyrshin I., Herrera-Avelar R., Sheremetov L., Suarez R. (2004). On qualitative description of time series based on moving approximations. In: Proceedings of International Conference on Fuzzy Sets and Soft Computing in Economics and Finance, FSSCEF 2004, St. Petersburg, Russia, Vol. I, 73–80 5. Batyrshin I., Herrera-Avelar R., Sheremetov L., Panova A. (2005). Association networks in time series data mining. – NAFIPS 2005. Soft Computing for Real World Applications, Ann Arbor, Michigan, USA, 754–759 6. Boverman B.L., O’Connell R.T. (1979). Time Series and Forecasting. Duxbury, Massachusetts 7. Das G., Lin K.I., Mannila H., Renganathan G., Smyth P. (1998). Rule Discovery from Time Series. Knowledge Discovery and Data Mining, 16–22 8. Economagic.com: Economic Time Series Page, URL: http://economiccharts.com/blsint.htm
78
9. 10. 11. 12.
13. 14.
15. 16. 17. 18.
19.
20. 21.
22.
I. Batyrshin et al. Economic Research. Federal Reserve Bank of St. Luis. URL: http:// research.stlouisfed.org/fred2/categories/15/downloaddata Everitt B.S., Landau S., Leese M. (2001). Cluster Analysis. Fourth Edition. Arnold, London Friedman, J.H. (1997). Data Mining and Statistics: What’s the Connection? URL: http://www-stat.stanford.edu/~jhf/ftp/dm-stat.ps Keogh, E., Folias, T. (2002). The UCR Time Series Data Mining Archive. [http://www.cs.ucr.edu/~eamonn/TSDMA/index.html], University of California – Computer Science & Engineering Department, Riverside, CA Keogh, E., Kasetty, S. (2002). On the need for time series data mining benchmarks: A survey and empirical demonstration. In: SIGKDD’02 Keogh, E., Lonardi, S., Ratanamahatana, C.A. (2004). Towards parameterfree data mining. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA Last, M., Klein, Y., Kandel, A. (2001). Knowledge discovery in time series databases. IEEE Transactions on Systems, Man, and Cybernetics, 31B Least Squares Fitting. Wolfram Research, Mathworld, URL: http://mathworld. wolfram.com/LeastSquaresFitting.html Linear regression lines. MarketScreen. URL: http://www.marketscreen.com/ help/AtoZ/default.asp?hideHF=&Num=58 Möller-Levet, C.S., Klawonn, F., Cho, K.H., & Wolkenhauer, O. (2003) Fuzzy clustering of short time-series and unevenly distributed sampling points. IDA 2003, 330–340 Sheremetov L., Rocha L., Batyrshin I. (2005) Towards a Multi-agent Dynamic Supply Chain Simulator for Analysis and Decision Support. In NAFIPS 2005. Soft Computing for Real World Applications, Ann Arbor, Michigan, USA, June 22–25, 263–286 Time Series Forecast. MarketScreen. URL: http:// www.marketscreen.com/ help/AtoZ/default.asp?hideHF=&Num=102 Zadeh, L.A. (1997). Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90, 111–127 Zhang, P., Huang, Y., Shekhar, S., & Kumar, V. (2003). Correlation analysis of spatial time series datasets: a filter-and-refine approach. Proc. Seventh Pacific-Asia Conf. Knowledge Discovery Data Mining, (PAKDD ’03), Lecture Notes in Artificial Intelligence Vol. 2637, Springer-Verlag, Seoul, Korea, 532–544
Appendix Here we demonstrate the results of association analysis of time series from UCR time series data mining archive [12]. Figure A1 shows time series from the file RealityCheck.dat.
Moving Approximation Transform and Local Trend Associations
79
1 2 3 4 5 6 7 8 9 10 11 12 13 14 100
200
300
400
500
600
700
800
900
1000
Fig. A1. Time series from file RealityCheck.dat
Figure A2 shows the association network of these time series corresponding to association measure AM = max(AFK), K={2,…,200} and containing association values greater than 0.70. We selected such association measure to find high local trend associations between time series. The maximal period of oscillations of time series approximately equals to 300 (for time series 11 and 12). For this reason it is not reasonable to consider window size much greater than a half of this period because these oscillations will be totally smoothed on window sizes near or greater than value of this period. We did not evaluate global trend associations because it is clear from Fig. A1 that they are more or less similar. It is easy to see from Figs. A1 and A2 that time series with similar shapes obtained high positive associations. The time series {1,4} have high negative associations with time series {9,10} and would have the similar shapes with
80
I. Batyrshin et al.
Fig. A2. Association network for RealityCheck.dat time series with association measure AM=max(AFK), K={2,…,200}
them if they would be inverted. Association network in Fig. A2 shows clear clustering of time series, which gives a possibility to join together time series with high negative associations. Figure A3 shows the clustering of these time series presented on data mining WEB page [12]. We see here that in spite of high similarity between time series {9,10} and inverted time series {1,4} they are separated in different classes. Measure of local trend associations can detect high negative associations without transformation of time series like inversion of values, etc. If the absolute value of our association measure is used then {1,4} will have high positive association values with {9,10} and the cluster {1,4,9,10} will be constructed by any reasonable clustering algorithm. Figure A4 shows hierarchical clustering of these time series by single linkage algorithm, which use as similarity values between time series the absolute values of association measure AM(x,y) = max(AFK(x,y)), K = {2,…,200}. This measure may be transformed into a dissimilarity measure d(y,x) = 1-|AM(x,y))| to give possibility to apply clustering algorithm. The absolute value of association measure may be used as a similarity measure between time series when high negative association is considered as a presence of real associations between time series.
Moving Approximation Transform and Local Trend Associations
81
Fig. A3. Hierarchical clustering of RealityCheck data shown in [12]
Fig. A4. Hierarchical clustering of RealityCheck data by single linkage clustering algorithm when absolute values of association measure are used as values of similarity between time series
82
I. Batyrshin et al.
Figure A5 shows time series from file Winding.dat from [12]. Figure A6 shows the association network of these time series corresponding to association measure AM = max(AFK), K = {2,…,100} and containing association values greater than 0.5. The values of time series are highly oscillating, for this reason we do not consider windows with size greater than 100 because oscillations will be totally smoothed. The visual explanation of obtained high association values between time series is very difficult due to high oscillations of data; only for time series 1 and 3 the similarity of shapes may be visually seen. This example is interesting for analysis because time series are highly oscillating and have many time points but nevertheless the applied method could find high positive and negative associations between time series which seems surprising. The description of time series can be found in [2, 12].
Fig. A5. Time series from Winding.dat file [12]
Moving Approximation Transform and Local Trend Associations
83
Fig. A6. Association network for Winding.dat time series with association measure AM = max(AFK), K = {2,…,100}. Only associations with absolute values greater than 0.50 are shown
Perception Based Patterns in Time Series Data Mining I. Batyrshin, L. Sheremetov, and R. Herrera-Avelar
Summary. Import of intelligent features to systems supporting human decisions in problems related with analysis of time series data bases is a promising research field. Such systems should be able to operate with fuzzy perception-based information about time moments and time intervals; about time series values, trends and shapes; about associations between time series and time series patterns, etc., to formalize human knowledge, to simulate human reasoning and to reply on human questions. The chapter discusses methods developed in TSDM to describe linguistic perception-based patterns in time series databases. The survey considers different approaches to description of such patterns which use sign of derivatives, scaling of trends and shapes, linguistic interpretation of patterns obtained as result of clustering, a grammar for generation of complex patterns from shape primitives, and temporal relations between patterns. These descriptions can be extended by using fuzzy granulation of time series patterns to make them more adequate to perceptions used in human reasoning. Several approaches to relate linguistic descriptions of experts with automatically generated texts of summaries and linguistic forecasts are considered. Finally, we discuss the role of perception-based time series data mining and computing with words and perceptions in construction of intelligent systems that use expert knowledge and decision making procedures in time series data base domains.
1 Introduction Till now most of the decision making procedures in problems related with time series (TS) analysis in economics and finance are based on human decisions I. Batyrshin et al.: Perception Based Patterns in Time Series Data Mining, Studies in Computational Intelligence (SCI) 36, 85–118 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
86
I. Batyrshin et al.
supported by statistical, data mining, or data processing software. Import of intelligent features to these systems including the possibility of operating with linguistic information, reasoning, and replying on questions is a promising research field. Computational theory of perceptions (CTP) [1–3] can serve as a basis for such extension of these systems. Fuzzy logic as a main constituent of CTP gives powerful tools for modeling and processing linguistic information defined on numerical domain. Methodology of computing with words and perceptions proposes methods of reasoning with linguistic information based on fuzzy models. The success of fuzzy logic applications in control, technical systems modeling, and pattern recognition is based on a synergy of linguistic descriptions and numerical data available for these application areas. Fuzzy logic serves here as a bridge between linguistic and numerical information. One of the prerequisites of fuzzy logic applications in these areas is the existence of regular resources of numerical data obtained from traditional mathematical models, experiments or measurements which can be used as a basis for construction, examination, and tuning of fuzzy models. In his recent works and lectures, Lotfi Zadeh called attention to decision making applications of fuzzy logic in economics, finance, Earth sciences, etc. with the central role of human perceptions. Perception-based propositions like “The price of gas is low and declining,” “It is very unlikely that there will be a significant increase in the price of oil in the near future,” etc. are usually used by people in decision making procedures. Perceptions like low, declining, very unlikely, significant increase, near future, etc. usually use fuzzy granulation of information [4] obtained from observations, measurements, life experience, mathematical analysis, visual perceptions about curves, etc. The formation of perceptions is a process of knowledge extraction from different sources. Development of intelligent question answering systems [5] supporting decision making procedures related with time series data bases (TSDB) needs to formalize human perceptions about time, time series values, patterns and shapes, about associations between patterns and time series, etc. These perceptions can be represented by words whose meaning is defined on the domains of time series data bases:
Perception Based Patterns in Time Series Data Mining
87
1. On time domain: – – – –
Time intervals: one–two weeks, several days, end of the day Absolute position on time scale: approximately on September 20 Respective position: after one month, near future Periodic or seasonal time intervals: end of the day, several weeks before Christmas, in summer
2. On the domain of TS values: large price; very low level of production 3. As perception-based function or pattern of TS shape: slowly decreasing, quickly increasing, and slightly concave 4. On the set of time series, attributes, or system elements: stocks of new companies 5. On the set of relations between TS, attributes, or elements: highly associated 6. On the set of possibility or probability values: unlikely, very probable Most of such perceptions can be represented as fuzzy sets defined on corresponding domain. Figure 1 depicts an example of fuzzy set SEVERAL DAYS defined on time domain. This fuzzy set reflects the perception that 3, 4, and 5 days are definitely correspond to the term SEVERAL DAYS but 2 and 6 days correspond to this term only partially. m
SEVERAL DAYS 1
0.8
0.6
0.4
0.2
0 1
2
3
4
5
6
7
8
9
10
t
Fig. 1. Fuzzy set SEVERAL DAYS
Figure 2a depicts an example of time series of PRICE values; Fig. 2b depicts a fuzzy set LARGE PRICE defined on a domain of price values y. This fuzzy set together with the time series of price values define by the extension principle of Zadeh a fuzzy set DAYS WITH LARGE PRICE on time domain (Fig. 2c).
88
I. Batyrshin et al. y
a)
b)
y
12 10
10 8 6
5 4 2 0
5
10
m
15
20
15
20
m
t
1
0.8
0.6
0.4
0.2
0
c)
1 0.8 0.6 0.4 0.2 0
5
10
t
Fig. 2. (a) Time series of PRICE values; (b) fuzzy set LARGE PRICE defined on domain of price values y; (c) fuzzy set DAYS WITH LARGE PRICE induced on time domain
Figure 3 depicts an example of fuzzy sets defined on a domain of slope values and corresponding to perception-based trend patterns: Quickly Decreasing, Slowly Decreasing, Constant, Slowly Increasing, and Quickly Increasing. An exact definition of membership values of fuzzy sets used in models of computing with words often is not extremely important because input and output of model are words [1] which are insensible to some change in membership values of fuzzy sets obtained as result of translation of words. 1
Q-DEC
S-DEC
Q-INC
CONST S-INC
0.8
0.6
0.4
0.2
0 -10
-5
0
5
10
Fig. 3. Example of fuzzy perception-based slopes defined on a domain of slope values
Perception Based Patterns in Time Series Data Mining
89
This situation differs from fuzzy modeling based on Mamdani or Sugeno models where initial definition of fuzzy sets usually does not play a large role in construction of final fuzzy model when a tuning of membership functions is used in the presence of training input–output data [6, 7]. Human perceptions are intrinsically imprecise and granular, such that the boundaries of perceived classes are unsharp and the values of attributes are granulated [4]. Such a granulation may be crisp or fuzzy. The CTP gives a conceptual framework and a methodology for computing and reasoning with perceptions. The base for CTP is the methodology of computing with words (CW) [1]. In this case, computing and reasoning with perceptions is reduced to computing and reasoning with words. Granular perceptions can be represented by fuzzy sets and computing with perceptions can be realized by methods of fuzzy sets theory [8]. Computing with words and perceptions can serve as a basis for insertion of deduction capability [5] into decision support systems related with economical and financial time series data bases. Intelligent question answering system based on financial or economic time series data bases should give replies on fuzzy questions, realize perceptionbased inference and do a perception-based forecasting. Below are examples of questions for such systems: 1. Find: – Wells with high level of water production – Securities quickly increasing in price in the end of the day – Highly mutually associated currencies – Most perspective securities 2. Forecast: – The price on sugar in the middle of July if we know this price in the beginning of May. Additional information: The price on sugar is slowly increasing in Spring and more quickly increasing in Summer. – The prices on cosmetics if the oil price will be greater than 75 dollars per barrel. – The sales of the new product after six months. 3. Optimize: – What commodities, when and what amount to buy to obtain maximal profit during the next year? – What production capacity is needed to produce a new product?
90
I. Batyrshin et al.
It is very difficult to give an exact reply to most of these questions without additional information about environment conditions and requirements; and only qualitative, perception-based evaluations may be given. These replies will depend on evaluation of a current situation, tendencies, existing associations between system elements, expert knowledge, etc. The currently developed methods of time series analysis, forecasting, and mathematical programming may be useful for replying to some of these questions if exact information for applying these methods is known. But very often such information is fuzzy, absent, or scarce. In contrast to time series of physical variables, e.g., electromagnetic waves, which find applications in technical systems as result of application of methods of mathematical physics, the economical and financial time series are often evaluated and used in economical and financial decisions based on human perceptions, expertise, and knowledge. Linguistic rules and perception-based descriptions are intrinsic part of such human solutions. Moreover, economic and financial systems are very complex and it is impossible to take into account all information influencing on the solution of the decision making problems. To these systems the Principle of Incompatibility of Zadeh can be applied [9]: “As the complexity of a system increases, our ability to make precise and yet significant statements about its behavior diminishes until a threshold is reached beyond which precision and significance (or relevance) become almost mutually exclusive characteristics.” For this reason, often only qualitative, perception-based solutions have sense in decision making in complex systems. Realization of the system supporting perception-based decision making procedures in the problems related with the analysis of time series data bases needs to extend the methods of time series data mining (TSDM) to give them possibility to operate with perceptions. The goal of data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [10]. The following list contains the main time series data mining tasks [11–17]: Segmentation. Split a time series into a number of “meaningful” segments. Possible representations of segment include approximating line, perceptual pattern, word, etc. Clustering. Find natural groupings of time series or time series patterns.
Perception Based Patterns in Time Series Data Mining
91
Classification. Assign given time series or time series patterns to one of several predefined classes. Indexing. Realize efficient execution of queries. Summarization. Give short description of a time series (or multivariate time series) which retains its essential features in considered problem. Anomaly Detection. Find surprising and unexpected patterns. Motif Discovery. Find frequently occurring patterns. Forecasting. Forecast time series values based on time series history or human expertise. Discovery of association rules. Find rules relating patterns in time series (e.g., patterns that occur frequently in the same or in neighboring time segments). These tasks are mutually related, for example, segmentation can be used for indexing, clustering, summarization, etc. Perception-based time series data mining systems should be able to manipulate with linguistic information, fuzzy concepts, and perception-based patterns of time series to support human decision making in problems related with time series data bases. Development of such systems needs to extend the methods of time series data mining (TSDM) [12–20] to give them possibility to operate with perceptions. Fortunately, a number of methods in time series data mining were recently developed for manipulating such information. A survey of perception-based patterns used in TSDM is given in the following sections. The goal of this survey is not to observe all papers in TSDM and time series analysis which uses perception-based patterns but to consider main types of such patterns that can be useful for perception-based time series data mining (PTSDM). Usually the patterns used in TSDM are crisp but they may be easily generalized to represent fuzzy patterns. The possible applications of fuzzy models to signal processing, data mining, and knowledge management 0in data bases were discusses, e.g., in [35, 57– 61]. The linguistic description of time series and solutions of time series data mining tasks can have different form depending on the goal of linguistic description. Such description can be given as a sequence of perception-based patterns (A1, A2,…, An), as a sequence of rules “If T is Tk then A is Ak,” k = 1,…,n, where Tk are crisp or fuzzy intervals and Ai are linguistic shape descriptors, or as a less formal text generated as a result of summarization of multivariate time series. Due to space limits we are not going to discuss the
92
I. Batyrshin et al.
methods and algorithms of extraction of these patterns. The necessary details can be found in the cited papers. Most of discussed approaches can be used in different time series data mining tasks. In Sect. 2 we start with the perception-based patterns considered mainly in qualitative process analysis and used for process monitoring, fault detection, and qualitative reasoning about processes. Time series are divided in episodes described by temporal patterns or primitives defined by the signs of the first and the second derivatives. In Sect. 3 we consider patterns based on scaling of trends and shapes. Elementary patterns or primitives can be used for generation of more complicated patterns based on suitable grammar. In Sect. 4 we consider shape definition language [14] which can be used for generation of composed shape patterns. The language facilitates generation (and execution) of queries to discover important information in time series. In Sect. 5 we consider approach to definition of patterns and rules which is based on clustering of shapes and linguistic interpretation of clusters. Once the patterns are identified, the next step forward is finding rules, relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. One specific type of this relationship considered in Sect. 6 is a temporal one. Integration of TSDM system with the experience of human experts and generation of summaries and textual descriptions of data sets need to define patterns in expert knowledge and relate them to patterns in time series. These approaches are discussed in Sect. 7. Conclusion contains discussion of possible directions of research in perception-based time series data mining.
2 Patterns Based on Signs of Derivatives Triangular episodes representation language was formulated in [21, 22] for representation and extraction of temporal patterns. Figure 4 shows seven temporal episodes used for description of temporal patterns. These episodes can be linguistically described as A: Increasing and Concave; B: Decreasing and Concave; C: Decreasing and Convex; D: Increasing and Convex; E: Linearly Increasing; F: Linearly Decreasing; G: Constant.
Perception Based Patterns in Time Series Data Mining
93
Fig. 4. Temporal episodes
Fig. 5. Representation of time series by temporal episodes. Adopted from [23]
Figure 5 depicts an example of time series representation by temporal episodes. Episodes are separated by vertical dashed lines This representation generates segmentation of time series on temporal episodes and codes it by the sequence of temporal patterns: ABCDABCDAB. Such coded representation of time series can be used for dimensionality reduction, indexing, clustering of time series, etc. The possible applications of such representation are process monitoring, diagnosis, and control [21–25]. An extension of considered approach to description of temporal episodes is considered in [24, 25]. Connected temporal episodes A, B and C, D with the same sigh of second derivative are joined together in composite episodes AB and CD. Such episodes are classified by means of the sign of slope of the line joining the boundary points of the episode. In the proposed approach the new episodes AB↓, AB↑, AB=, CD↓, CD↑, CD= are added. Figure 6 depicts proposed classification of episodes.
94
I. Batyrshin et al.
Fig. 6. Extended set of episodes. Adopted from [24]
The more extended dictionary of temporal patterns defined by the signs of the first (sd1) and the second (sd2) derivatives in the pattern is considered in [26]. This dictionary includes the perceptual patterns: PassedOverMaximum; IncreasingConcavely; StartedToIncrease; ConvexMaximum; etc. (see Figure 7). Figure 8 depicts an example of transformation of the profile of measured variable xj(t) into the qualitative form as a result of approximation of xj(t) by a proper analytical function from which the signs of derivatives are extracted [26]. The methods of application of temporal episodes to description of noisy time series as a result of approximation and smoothing are discussed also in [27]. The paper [26] introduces a method for reasoning about the form of the recent temporal profiles of process variables which carry important information about the process state and underlying process phenomena. The method is illustrated by the example of control of fermentation processes. Later are the examples of the shape analyzing rules used by decision making system [26]:
Perception Based Patterns in Time Series Data Mining
Fig. 7. Elements of shape library. Adopted from [26]
95
96
I. Batyrshin et al.
Fig. 8. Example of transformation of the temporal profile into the qualitative form qshape = {sd1; sd2}. Adopted from [26]
IF (DOincrement > 5%) and (DuringTheLast30sec DO has been Increasing) THEN (Report: Glucose depletion) and (Activate glucose feeding); IF (DuringTheLast1hr DO has been DecreasingConcavelyConvexly) THEN (Report: Foaming) and (Feed antifoam agent), where DO denotes Dissolved Oxygen. The qualitative description of time series and processes can be used in qualitative reasoning about processes [28, 29], which takes into account the change of signs of derivatives of considered processes. The temporal patterns considered in this section take into account only the signs of first and second derivatives of a function representing a process. This property may be considered as a positive feature of temporal pattern representtation language because (1) time series representation is invariant to transformations of time and time series values domains, (2) these patterns give qualitative description of processes or time series. But these representations do not use the scaling of patterns typical for people perceptions. Such perceptionbased patterns are considered in the following section.
Perception Based Patterns in Time Series Data Mining
97
3 Scaling of Perception-Based Patterns A scaling of perception-based patterns is used in many papers. This scaling can be applied to time series values, to slope values, to convex-concave shapes, etc. The method of symbolic representation of time series called SAX was proposed in [12]. This method divides the domain of time series values in considered window on intervals and time series values are replaced by the codes of respective intervals containing these values. An example of time series representation by SAX is shown in Fig. 9. The authors use scale with grades a,b,c,d,e,f for coding time series values. Generalization of this method can be based on suitable replacement of these symbols by linguistic labels like very small, medium, large, etc. and on fuzzy granulation of linguistic grades, e.g., as it was shown in Fig. 2b. The paper [12] gives also the classification of various time series representations based on PLR, wavelets, Discrete Fourier Transform etc. Granulation of slope values of functional dependencies and time series is used in [30], where it is described a system that generates linguistic descriptions of time series in the form of rules Rk: If T is Tk then Y is Ak, where Tk are fuzzy intervals, like Between A and B, Small, and Ak are linguistic descriptions of trends, like Very Quickly Increasing, Slowly Decreasing, etc. An evolutionary procedure is used to find optimal partition of time domain on fuzzy intervals where time series values are approximated by linear functions. The paper discusses the methods of retranslation of obtained piece-wise linear approximation of time series into linguistic form.
Fig. 9. Example of time series representation by SAX as a string ffffffeeeddcbaabceedcbaaaaacddee. Adopted from [12]
98
I. Batyrshin et al.
Another approach to linguistic description of time series was proposed in [31]. The method is based on Moving Approximation Transform [32], which replaces time series by a sequence of slope values of linear functions approximating time series values in sliding window. Figure 10 shows an example of piece-wise linear representation of time series obtained by this method. Time series in this example contains information about Industrial Production Index published by Board of Governors of the Federal Reserve System [33]. It includes monthly data of time period from 1940-01-01 to 2003-07-01. The part of linguistic description of time series, corresponding to the last two segments can be presented, e.g., in the following form: “Last 2.5 years the Index Slowly Decreased, whereas during previous 8 years, it is Quickly Increased.” Similar linguistic descriptions of time series are reported in [34] where it has used a scaling of trends in the system that detects and describes linguistically significant trends in time-series data, applying wavelets, and scale space theory. In this work some experimental results of application of this system to summarization of weather data are presented. Later is an example of description generated automatically by the system, which used perception-based patterns like dropped, dropped sharply, etc:
Fig. 10. Piece-wise linear representation of time series. Adopted from [31]
Perception Based Patterns in Time Series Data Mining
99
“The temperature dropped sharply between the 3rd and the 6th. Then it rose between the 6th and the 11th, dropped sharply between the 14th and 16th and dropped between the 23rd and 28th.” The linguistic description of time series based on piece-wise linear representation (PLR) of time series can be based also on optimal PLR algorithms developed in time series data mining [17]. In this case, PLR algorithms should be modified to take into account a linguistic scaling of the set of possible slope values. The resulting algorithm should avoid also the situations when neighboring intervals have the same shape descriptors. Possible application of fuzzy models to signal processing were discussed in [35]. Table 1 gives a list of primitives used for description of Carotid waveforms [35, 36]. Figure 11 shows an example of segmentation of part of Carotid pulse wave by means of these primitives. The sequence of primitives L = (Up, Cap, Cup, Cup, LN, Cup, Cap, TE, Up, Cap, Cup, Cap, LN) can be used in syntactic pattern recognition of systolic and diastolic epochs. In [35] it is discussed the possibility to introduce fuzziness into syntactic descriptions of digital signals in some very natural ways. Two regions in Fig. 11 marked F denote fuzzy boundaries between the systolic and diastolic regimes and between primitives Up and Cap of this signal [35]. The functions mT(t) and mu(t) denote respective transition membership functions. The paper [37] presents an approach to modeling time series datasets using linguistic shape descriptors. A simple linguistic term such as “rising” indicates that the series at this point is changing, i.e., yk+1 > yk. These terms are a measure of the first derivative trend of the series. A more complex term such as “rising more steeply” indicates that the trend of the series is changing, i.e., yk+2 - yk+1 > yk+1 - yk. These terms are a measure of the second derivative trend of the series. Parametric prototype trend shapes are given by the following functions: Table 1. List of primitives used for description of Carotid waveforms [35, 36] Up LP LN MP MN TE Cup Cap
upslope large-pos large-neg med-pos med-neg trailing edge parabola parabola
long line, large positive slope medium line, large positive slope medium line, large negative slope medium line, positive slope medium line, negative slope long line, medium negative slope opening up opening down
100
I. Batyrshin et al.
Fig. 11. Carotid waveform representation as a sequence of primitives. Adopted from [35]
falling less steeply:
f (t ) = 1 − 1 − (1 − t )α ,
falling more steeply:
f (t ) = 1 − t α
rising more steeply:
f (t ) = 1 − 1 − t α
rising less steeply:
f (t ) = 1 − (1 − t )α
crest:
f (t ) = 1 − α 2 (t − 0.5)α
trough:
f (t ) = α 2 (t − 0.5)α .
Figure 12 depicts these prototype shapes for different values of parameter α. Figure 13 depicts how trend shapes can be matched with a series. Each region of interest can have membership in several trend shapes. The trend concepts are used to build fuzzy rules in the form [37]: If trend is F then next point is Y, If trend is F then next point is current point + dY,
Perception Based Patterns in Time Series Data Mining
101
Fig. 12. Examples of prototype shapes (a) “falling less steeply,” α = 6; 3; 1.5; (b) “falling more steeply,” α = 1.5; 3; 6; (c) “rising more steeply,” α = 6; 3; 1.5; (d) “rising less steeply,” α = 1.5; 3; 6; (e) “crest,” α = 2; (d) “trough,” α = 2
Fig. 13. Shape matching. Adopted from [37]
where F is a trend fuzzy set, such as “rising more steeply” and Y, dY are fuzzy time series values. Prediction using these trend fuzzy sets is performed using the Fril evidential logic rule [38]. Approach uses also fuzzy scaling of trends similar to method depicted in Fig. 3 with linguistic patterns: falling fast, falling slowly, constant, rising slowly, and rising fast.
102
I. Batyrshin et al.
More extended methods of scaling and fuzzy granulation of time series shape patterns are considered in [63].
4 Shape Definition Language A shape definition language (SDL) was developed in [14] for retrieving objects based on shapes contained in the histories associated with these objects. For example, in a stock database, the histories of opening price, closing price, the high for the day, the low for the day, and the trading volume may be associated with each stock. SDL allows a variety of queries about the shapes of histories. It performs “blurry” matching [14] where the user cares about the overall shape but does not care about specific details. SDL has efficient implementation based on index structure for speeding up the execution of SDL queries. Table 2 gives an illustrative alphabet of SDL, where lb and ub are the lower and upper bounds, respectively, of the allowed variation from the initial value to the final value of the transition. Table 2. An illustrative alphabet A symbol up Up down Down appears disappears stable zero
description slightly increasing transition highly increasing transition slightly decreasing transition highly decreasing transition transition from a zero value to a nonzero value transition from a nonzero value to a zero value the final value nearly equal to the initial value both the initial and final values are zero
lb .05 .20 –.19 –1.0 0 –1.0 –.04 0
ub .19 1.0 –.05 –.19 1.0 0 .04 0
Figure 14 shows an example of the time sequence. Given alphabet A, this time sequence may be described, e.g., by two different transition sequences: (zero appears up up up down stable Down down disappears) (zero stable up up up down stable Down down stable) This alphabet can be used for definition of shapes as follows: (shape name(parameters) descriptor)
Perception Based Patterns in Time Series Data Mining
103
Fig. 14. Time sequence H=(0,0,.02,.17,.35,.50,.45,.43,.15,.03,0). Adopted from [14]
For example, “a spike” can be defined as (shape spike ( ) (concat Up up down Down)), where the list of parameters is empty and concat denotes concatenation. All the symbols of the alphabet correspond to elementary shapes. Complex shapes can be derived by recursive combination of elementary and previously defined shapes. A set of available operators provide multiple choice, concatenation, multiple and bounded occurrences of shapes in complex shape descriptions. SDL is a natural and powerful language for expressing shape queries with the following syntax [19]: (query (shape history-spec)) Here, shape is the descriptor for the shape to be matched. The history-spec is of the form: history-name, start-time, and end-time. Here history-name specifies the name of the history, in which the shape should be matched and start-time and end-time define the interval, on which matching occurs. The
104
I. Batyrshin et al.
result of the execution of a query is the set of all rules that contain the desired shape in the specified history. In addition, the result also contains the list of subsequences of the history that matched the shape. The approach gives the possibility to retrieve combinations of several shapes in different histories by using the logical operators and and or. The query language provides the capability to discover important information in time series data bases.
5 Patterns with Human Interpretation The paper [13] studies the problem of finding rules, relating patterns in a time series to other patterns in that series, or patterns in one series to patterns in another series. The patterns are formed from data. The method first forms subsequences by sliding a window through the time series, and then clusters these subsequences by using a suitable measure of time series similarity. The discretized version of the time series is obtained by taking the cluster identifiers corresponding to the subsequence. Further the rule finding methods are used to obtain the rules from the sequences. Figure 15 depicts the simplified example of time series s = (1, 2, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4). The set of subsequences formed by sliding window with width = 3 is clustered by some clustering method based on a given distance measure between subsequences. The right side of Fig. 15 shows three primitive shapes a1, a2, and a3 obtained after clustering. Replacement of subsequences by corresponding names of shapes gives the discretized series D(s) = (a1,a2,a1,a2,a3,a1,a2,a3,a1,a2). The discretization process depends on the choice of window size, on the choice of distance measure, and on the type of clustering algorithm used. The simplest rules discovered from a set of discretized sequences have the format: if A occurs, then B occurs within time T,
Perception Based Patterns in Time Series Data Mining
105
Fig. 15. Example of time series discretization. Adopted from [13]
where A and B are identifiers of patterns (clusters of patterns) discovered in T
time series. In a short form, this rule may be written as A ⇒ B . Frequency, confidence, and J-measure were used for selecting interesting rules [13, 39]. The approach was applied to several data sets. As an example, from the daily closing share prices of ten database companies traded on the NASDAG, 20
for example, the following significant rule was found: s18 ⇒ s 4 . The patterns s18 and s4 are shown in Fig. 16. These patterns were obtained for window size w=13. An interpretation of the rule is that a stock which follows a 2.5-week declining pattern of s18 “sharper decrease and then leveling out,” will likely incur a “short sharp fall” within 4 weeks (20 days) before leveling out again (the shape of s4).
106
I. Batyrshin et al.
Fig. 16. Example of patterns of significant rule. Adopted from [13]
As stressed by the authors, the proposed technique is essentially intended as an exploratory method and thus, iterative and interactive application of the method coupled with human interpretation of the rules is likely to lead to the most useful results rather than any fully automated approach [13].
6 Temporal Relationships Between Patterns The approach to knowledge discovery from multivariate time series developed in [40] consists of several stages. Initially, the time series are segmented and transformed into sequences of state intervals (bi,si,fi), i= 1,…n. Here, si are time series states like increasing, decreasing, constant, highly increasing, and convex holding during time periods (bi,fi), where bi ≤ bi+1 and bi < fi. It is required that every state is maximal in the sense, that there are no state intervals in the series with the same state, which overlap or meet each other. The temporal relationships between state intervals are described by 13 temporal relationships of Allen’s interval logic [41] shown in Fig. 17.
Perception Based Patterns in Time Series Data Mining
107
Fig. 17. Allen’s interval relationships. Adopted from [27]
Finally, the rules with frequent temporal patterns in the premise and conclusion are derived. The method was applied to time series of air-pressure and wind strength/wind direction data [40]. The smoothed time series have been partitioned into segments with primitive patterns like very highly increasing, constant, and decreasing. Later is an example of association rule generated by the proposed approach: convex air pressure, highly decreasing air pressure, decreasing air pressure→ highly increasing wind strength where the following temporal relationships take place: 1. Convex air pressure “overlaps” highly decreasing air pressure 2. Highly decreasing air pressure “equals” decreasing air pressure 3. Decreasing air pressure “meets” highly increasing wind strength
The proposed methodology may support a human in learning from temporal data. The meaningful rules that obtained by described technique can be used together with expert knowledge in construction of expert system [40]. Generalization of proposed approach may be based on a fuzzy extension of Allen’s interval algebra and on formalization of relations between fuzzy time intervals [42–50].
108
I. Batyrshin et al.
7 Perception-Based Patterns in Expert Knowledge, Summaries, and Forecasting Texts In this section we consider systems which use preliminary analysis of human expert rules, forecasting, and summaries texts to relate human perceptions with time series patterns and finally to generate texts in the form similar to human descriptions in considered problem area. General questions of generation of fuzzy linguistic summaries in data bases are discussed in [59–61]. The problem of analysis of intentions in utterances is discussed in [62]. A rule-based fuzzy expert system WXSYS attempts to predict local weather based on conventional wisdom [51]. Later are examples of expert rules used in the system: Weather will be generally clear when the wind shifts to a westerly direction. The greatest change occurs when the wind shifts from east through south to west. Generally, if the barometer falls steadily and the wind comes from an easterly quarter, expect foul weather. If the current wind is blowing from S to SW and the current barometric pressure is rising from 30.00 or below, then the weather will be clearing within a few hours, then fair for several days. This system is realized in FuzzyClips and contains formalized descriptions of fuzzy expert rules. For pattern formalization, it is necessary to relate them to fuzzy concepts and to patterns from time series of weather parameters like pressure, wind, etc. Another approach uses special grammar for generation of summaries based on patterns retrieved from the summaries generated by human experts. The first system that generated descriptions of stock market behavior, called Ana, was described in [52]. Data from a Dow Jones stock quotes database serves as input to the system, and the opening paragraphs of a stock market summary are produced as output. As more semantic and linguistic knowledge about the stock market is added to the system, it is able to generate longer, more informative reports. Figure 18 depicts a portion of the real data submitted to Ana for January 12, 1983. The following text sample is one of possible interpretations of the data generated by Ana:
Perception Based Patterns in Time Series Data Mining
109
Fig. 18. Example of stock data used by Ana
Wall Street’s securities markets rose steadily through most of the morning, before sliding downhill late in the day. The stock market posted a small loss yesterday, with the indexes finishing with mixed results in active trading. The Dow Jones average of 30 industrials surrendered a 16.28 gain at 4pm and declined slightly, to finish at 1083.61, off 0.18 points. The more extended system called StockReporter is discusses in [53, 54]. This system is one of a number of developed online text generation systems which generate textual descriptions of numeric data sets. The StockReporter project is heavily influenced by Karen Kukich’s work [52]. In contrast to Ana, StockReporter produces reports that incorporate both text and graphics. It reports on the behavior of any one of 100 US stocks and how that stock’s behavior compares with the overall behavior of the Dow Jones Index or the NASDAQ. StockReporter takes numeric data that describes the performance of a particular stock and produces from this data a textual summary that describes how the stock performed over a user-specified reporting period. It can generate a text like the following: Microsoft avoided the downwards trend of the Dow Jones average today. Confined trading by all investors occurred today. After shooting to a high of $104.87, its highest price so far for the month of April, Microsoft stock eased to finish at an enormous $104.37. The Dow closed after trading at a weak 5682, down 6 points. Another system generating short (a few sentences) summaries of large (100KB or more) time-series data sets is described in [55]. The architecture integrates pattern recognition, pattern abstraction, selection of the most
110
I. Batyrshin et al.
significant patterns, microplanning (especially aggregation), and realization. SumTime-Turbine is a prototype system which uses this architecture to generate textual summaries of sensor data from gas turbines. Figure 19 shows ontology for various patterns used by this system. The goal is to classify patterns into ontology, not to identify specific types of patterns. SumTime-Turbine’s pattern analysis components can operate with different temporal granularity, e.g., 1, 5, and 10 s. For example, when using a temporal granularity of 1 s, the pattern concepts (such as “dip with oscillatory recovery”) used by experts while examining data visualized at a 1-s time scale, are applied. When using a temporal granularity of 5 s, the concepts used by experts while examining data visualized at a 5-s time scale, are applied [55]. A pattern recognition algorithm is composed of a pattern locator and a pattern classifier. The algorithm uses shape description language described in [14] (See Sect. 4 of this chapter). Later is an example of sentences generated to describe patterns: There were large erratic oscillations with short period in all channels at 18:17, large spikes in all channels at 18:40, 18:48 and 20:21. There were variant patterns in all channels at 18:03, 19:30 and 20:44. Finally we cite some perception-based weather forecast generated by weather.com [56]: “Scattered thunderstorms ending during the evening,” “Skies will become partly cloudy after midnight,” “Occasional showers possible.” Such perception-based forecast supports main idea of Computing with Words and Perceptions (CWP) where inputs and/or outputs of decision making system are words [1]. It would be interesting to find application area in economics and finance where such perception-based forecasting plays important role.
Fig. 19. Patterns ontology in the gas turbine domain. Adopted from [55]
Perception Based Patterns in Time Series Data Mining 111
112
I. Batyrshin et al.
8 Toward Perception-Based Time Series Data Mining Many decision making procedures in economics and finance use expert knowledge defined on time series data base domain. This knowledge can serve as a basis for development of intelligent decision making systems integrating expert knowledge with computing with words and perceptions [1] and perceptionbased time series data mining (see Fig. 20). The advantage of such system over a human expert will consist in capability of real time processing of gigabytes of information in permanently changing situations typical for economics and financial markets. The role of CWP in such systems is to realize human decision making procedures and reasoning mechanisms given in expert knowledge. Expert knowledge usually uses fuzzy perceptions and the role of perception-based time series data mining (PTSDM) is to support CWP by extracting from TSDB perception-based patterns and associations relevant to decision making
Intelligent Decision Making System Expert Knowledge
Computing with Words and Perceptions
Perception Based Time Series Data Mining
Time Series Data Base Fig. 20. Architecture of intelligent decision making system based on expert knowledge in time series data base domains
Perception Based Patterns in Time Series Data Mining
113
procedures. As it was shown in this chapter perception-based patterns are considered in many papers and the developed approaches for manipulation by such patterns can be used as a basis of PTSDM. Some of these approaches and effective algorithms of TSDM should be adopted for extracting fuzzy perception-based information useful in decision making models of computing with words and perceptions.
9 Conclusions In spite of the growing number of applications of TSDM, we have only begun to scratch the surface in realizing full benefits of these technologies using perception-based information. We showed the role of linguistic perceptionbased patterns defined on time series domain in representation of expert knowledge in wide range of application areas. Different approaches to description of such patterns use sign of derivatives, scaling of trends and shapes, linguistic interpretation of patterns obtained as result of clustering, a grammar for generation of complex patterns from shape primitives, temporal relations between patterns. Several approaches to relate linguistic descriptions of experts with automatically generated texts of summaries and linguistic forecasts are considered. Semantic imprecision of natural languages is a concomitant of imprecision of perceptions [64]. For this reason considered in this chapter approaches may be extended by using fuzzy granulation of time series patterns to make them more adequate to perceptions used in human reasoning. Perception-based time series data mining together with CWP and natural language computation [64] can serve as a basis for construction of intelligent decision making systems that use expert knowledge in time series data base domains.
10 Acknowledgment The support for this research work has been provided by the IMP, projects D.00006 and D.00322.
114
I. Batyrshin et al.
References 1. Zadeh L.A. (1999) From computing with numbers to computing with words – from manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems 1: Fundamental Theory and Applications, vol. 45, 105–119 2. Zadeh L.A. (2001) A new direction in AI: Toward a computational theory of perceptions. AI Magazine, Spring 2001, 73–84 3. Zadeh L.A. (2002) Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. Journal of Statistical Planning and Inference vol. 105, 233–264 4. Zadeh L.A. (1997) Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, vol. 90, 111–127 5. Zadeh L.A. (2003) Web intelligence and fuzzy logic – the concept of Web IQ (WIQ). WI’03 and IAT’03 Keynote Talk, Halifax, Canada, October 2003 6. Jang J.-S.R., Sun C.T., Mizutani E. (1997) Neuro-Fuzzy and Soft Computing. A Computational Approach to Learning and Machine Intelligence. Prentice-Hall, NJ, USA 7. Kosko B. (1997) Fuzzy Engineering. Prentice-Hall, NJ, USA 8. Klir G.J., Clair U.S., Yuan B. (1997) Fuzzy Set Theory: Foundations and Applications, Prentice Hall, NJ, USA 9. Zadeh L.A. (1973) Outline of a new approach to the analysis of complex systems and decision processes. IEEE Transactions on Systems, Man and Cybernetics SMC-3, 28–44 10. Hand D., Manilla H., Smyth P. (2001) Principles of Data Mining. MIT, Cambridge 11. KDnuggets: Polls: Time-Series Data Mining (Nov 2004) What Types of TimeSeries Data Mining You’ve Done? http://www.kdnuggets.com/polls/2004/ time_series_data_mining.htm 12. Lin J., Keogh E., Lonardi S., Chiu B. (2003) A symbolic representation of time series, with implications for streaming algorithms. Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA 13. Das G., Lin K.I., Mannila H., Renganathan G., Smyth P. (1998) Rule discovery from time series. Proceedings of KDD98, 16–22 14. Agrawal R., Psaila G., Wimmers E.L., Zait M. (1995) Querying shapes of histories. Proceedings of the 21st International Conference on Very Large Databases, VLDB ’95, Zurich, Switzerland, 502–514
Perception Based Patterns in Time Series Data Mining
115
15. Sripada S.G., Reiter E., Hunter J., Yu J. (2002) Segmenting time series for weather forecasting. Proceedings of ES2002, 193–206 16. Cohen P., Adams N. (2001) An algorithm for segmenting categorical time series into meaningful episodes. Proceedings of the Fourth International Symposium on Intelligent Data Analysis, Lisbon Portugal 17. Keogh E.J., Chu S., Hart D., Pazzani M. (2001) An online algorithm for segmenting time series. Proceedings of IEEE International Conference on Data Mining, 289–296 18. Agrawal R., Faloutsos C., Swami A. (1993) Efficient similarity search in sequence databases. Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, Chicago, 69–84 19. Agrawal R., Psaila G. (1995) Active data mining. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montreal 20. Last M., Klein Y., Kandel A. (2001) Knowledge discovery in time series databases. IEEE Transactions on Systems, Man, and Cybernetics, vol. 31B, 160– 169 21. Cheung J.T., Stephanopoulos G. (1990) Representation of process trends. Part I. A formal representation framework. Computers and Chemical Engineering, vol. 14, 495–510 22. Cheung J.T. (1992) Representation and extraction of trends from process data. D.Sci.Th., Massachusetts Institute of Technology, Cambridge/MA, USA 23. Kivikunnas S. (1999) Overview of process trend analysis methods and applications. Proceedings of Workshop on Applications in Chemical and Biochemical Industry. Aachen, Germany 24. Colomer J., Melendez J., De la Rosa J.L., Aguilar J. (1997) A qualitative/ quantitative representation of signals for supervision of continuous systems. Proceedings of European Control Conference-ECC97, Brussels 25. Colomer J. (1998) Representacio Qualitativa Asincrona de Senyals Per a la Supervisio Experta de Processos, Ph.D. dissertation, University of Girona (UdG), Catalonia, Spain 26. Konstantinov K.B., Yoshida T. (1992) Real-time qualitative analysis of the temporal shapes of (bio) process variables. American Institute of Chemical Engineers Journal vol. 38, no. 11, 1703–1715 27. Höppner F. (2003) Knowledge Discovery from Sequential Data. Dissertation. Braunschweig University 28. Forbus K.D. (1984) Qualitative process theory. Artificial Intelligence, vol. 24, 85–168 29. Kuipers B. (1984) Commonsense reasoning about causality: deriving behavior from structure. Artificial Intelligence, vol. 24, 169–203
116
I. Batyrshin et al.
30. Batyrshin I., Wagenknecht M. (2002) Towards a linguistic description of dependencies in data. International Journal of Applied Mathematics and Computer Science. Special Issue on Computing with Words and Perceptions (ed. by D. Rutkowska, J. Kacprzyk, L.A. Zadeh), vol. 12, no. 3, 391–401 31. Batyrshin I., Herrera-Avelar R., Sheremetov L., Suarez R. (2004) On qualitative description of time series based on moving approximations. Proceedings of the International Conference on Fuzzy Sets and Soft Computing in Economics and Finance, FSSCEF 2004, St. Petersburg, Russia, vol. I, 73–80 32. Batyrshin I., Herrera-Avelar R., Sheremetov L., Panova A. Moving approximation transform and local trend associations in time series data bases. In this book. 33. Federal Reserve Board, http://www.federalreserve.gov/rnd.htm 34. Boyd S. (1998) TREND: A system for generating intelligent descriptions of timeseries data. In Proceedings of the IEEE International Conference on Intelligent Processing Systems (ICIPS1998) 35. Bezdek J.C. (1993) Fuzzy models and digital signal processing (for pattern recognition): Is this a good marriage?. Digital Signal Processing, vol. 3, no. 4, 253–270 36. Stockman G., Kanal L., Kyle M.C. (1976) Structural pattern recognition of carotid pulse waves using a general waveform parsing system. CACM 19, 2, 688–695 37. Baldwin J.F., Martin T.P., Rossiter J.M. (1998) Time series modelling and prediction using fuzzy trend information. Proceedings of the Fifth International Conference on Soft Computing and Information/Intelligent Systems, 499–502 38. Baldwin J.F., Martin T.P., Pilsworth B.W. (1995) Fril – Fuzzy and Evidential Reasoning in Artificial Intelligence. Research Studies Press Ltd 39. Smyth P., Goodman R. M. (1991) Rule induction using information theory. In: Knowledge Discovery in Databases, MIT, Cambridge, MA, Chapter 9, 159–176 40. Höppner F. (2001) Learning temporal rules from state sequences. IJCAI Workshop on Learning from Temporal and Spatial Data, Seattle, USA, 25–31 41. Allen J.F. (1983) Maintaining knowledge about temporal intervals. Communications of the ACM, vol. 26, no. 11, 832–843 42. Ohlbach H.J. (2004) Relations between fuzzy time intervals. Proceedings of 11th International Symposium on Temporal Representation and Reasoning, Tatihoui, Normandie, France 43. Nagypál G., Motik B. (2003) A fuzzy model for representing uncertain, subjective, and vague temporal knowledge in ontologies. Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics, (ODBASE), volume 2888 of LNCS. Springer, Berlin Heidelberg New York, 906–923
Perception Based Patterns in Time Series Data Mining
117
44. Dubois D., Prade H. (1989) Processing fuzzy temporal knowledge. IEEE Transactions on Systems, Man and Cybernetics, vol. 19, 729–744 45. Dubois D., Prade H. (1986) Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum, New York 46. Kurutach W. (1995) Modelling fuzzy interval-based temporal information: a temporal database perspective. Proceedings of 1995 IEEE International Conference on Fuzzy Systems, Yokohama, Japan, 741–748 47. Godo L., Vila L. (1995) Possibilistic temporal reasoning based on fuzzy temporal constraints. IJCAI’95: Proceedings International Joint Conference on Artificial Intelligence, Montreal 48. Dutta S. (1988) An event-based fuzzy temporal logic. Proceedings of the 18th IEEE Intnational Symposium on Multiple-Valued Logic, Palma de Mallorca, Spain, 64–71 49. Badaloni S., Giacomin M. (2000) A fuzzy extension of Allen’s interval algebra. In E. Lamma, P. Mello (Eds.), AI*IA99: Advances in Artificial Intelligence, Selected Papers – Lecture Notes in Artificial Intelligence, 1792, 155–165, Springer, Berlin Heidelberg New York fuz 50. Badaloni S., Giacomin M. (2006) The algebra IA : a framework for qualitative fuzzy temporal reasoning. Artificial Intelligence, vol. 170, 872–908, Elsevier 51. Maner W., Joyce S. (1997) WXSYS: Weather Lore + Fuzzy Logic = Weather Forecasts. Presented at the 1997 CLIPS Virtual Conference (http://web.cs.bgsu.edu/maner/wxsys/wxsys.htm) 52. Kukich K. (1983) Design of a knowledge-based report generator. Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics (ACL1983), 145–150 53. StockReporter. http://www.ics.mq.edu.au/~ltgdemo/StockReporter/about.html 54. Reiter E., Dale R. (2000) Building Natural Language Generation Systems, (Studies in Natural Language Processing). Cambridge University Press, Cambridge 55. Yu J., Reiter E., Hunter J., Mellish C. (2007) Choosing the content of textual summaries of large time-series data sets. Natural Language Engineering. (To appear) 56. www.weather.com 57. Pons O., Vila M. A., Kacprzyk J. (Eds.). (2000) Knowledge Management in Fuzzy Databases, Physica, Wurzburg 58. Kandel A., Last M., Bunke H. (Eds). (2001) Data Mining and Computational Intelligence, Studies in Fuzziness and Soft Computing, vol. 68, Physica, Wurzburg 59. Kacprzyk J., Zadro ny S. (1998) Data Mining via Linguistic Summaries of Data: An Interactive Approach, In T. Yamakawa and G. Matsumoto (Eds.):
118
60.
61.
62. 63. 64.
I. Batyrshin et al. Methodologies for the Conception, Design and Application of Soft Computing (Proceedings of IIZUKA’98, Iizuka, Japan), 668–671 Yager R.R. (1991) On linguistic summaries of data, In: Piatetsky-Shapiro G. and Frawley B. (Eds.): Knowledge Discovery in Databases, MIT, Cambridge, MA, 347–363 Yager R.R. (1995) Fuzzy summaries in database mining. Proceedings of the 11th Conference on Artificial Intelligence for Applications, Los Angeles, USA, 265– 269 Allen J.F., Perrault C.R. (1980) Analyzing intention in utterances. Artificial Intelligence, vol. 15, 143–178 Batyrshin I., Sheremetov L. Perception-based functions in qualitative forecasting. In this book Zadeh L.A. Computation with information described in natural language – the concept of generalized-constraint-based computation. International Conference on Computational Intelligence for Modelling Control and Automation – CIMCA’2005, Vienna, Austria, http://csdl2.computer.org/comp/proceedings/cimca/ 2005/2504/01/25041xxx.pdf
Perception-Based Functions in Qualitative Forecasting I. Batyrshin and L. Sheremetov
Summary. Perception-based function (PBF) is a fuzzy function obtained as a result of reconstruction of human judgments given by a sequence of rules Rk: If T is Tk then S is Sk, where Tk are perception-based intervals defined on the domain of independent variable T, and Sk are perception-based shape patterns of variable S on interval Tk. Intervals Tk can be expressed by words like Between N and M, Approximately M, Middle of the Day, End of the Week, etc. Shape patterns Sk can be expressed linguistically, e.g., as follows: Very Large, Increasing, Quickly Decreasing and Slightly Concave, etc. PBF differs from the Mamdani fuzzy model which defines a crisp function usually obtained as a result of tuning of function parameters in the presence of training crisp data. PBF is used for reconstruction of human judgments when testing data are absent or scarce. Such reconstruction is based mainly on scaling and granulation of human knowledge. PBF can be used in Computing with Words and Perceptions for qualitative evaluation of relations between variables. In this chapter we discuss application of PBF to qualitative forecasting of a new product life cycle. We consider new parametric patterns used for modeling convex–concave shapes of PBF and propose a method of reconstruction of PBF with these shape patterns. These patterns can be used also for time series segmentation in perception-based time series data mining.
1 Introduction Expert evaluations of tendencies, shapes of curves, etc. are widely used in different domains when it is necessary to predict the change in time of parameters of dynamic processes, such as technological, financial, economical, meteorological, etc. Several qualitative forecasting methods were developed to use the opinions of experts to subjectively predict future I. Batyrshin and L. Sheremetov: Perception-Based Functions in Qualitative Forecasting, Studies in Computational Intelligence (SCI) 36, 119–134 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
120
I. Batyrshin and L. Sheremetov
events [7]: subjective curve fitting [7], Delphi method [8], time independent technological comparisons [9], cross-impact method [10], the relevance tree method [13], morphological research method [18], etc. These methods are usually used when historical data either are not available or are scarce, for example to forecast sales for the new product or to predict if and when new technologies will be discovered and adopted [7]. In this chapter we propose the methods of modeling of expert knowledge and judgments about shapes of functions and time series by fuzzy perception-based function (PBF) [2, 3, 15, 16]. PBF is a fuzzy function obtained as a result of reconstruction of human judgments given by a sequence of rules Rk: If T is Tk then S is Sk, where Tk are perceptionbased intervals defined on the domain of independent variable T, and Sk are perception-based shape patterns of variable S on interval Tk. Intervals Tk can be expressed, for example, by words like Between N and M, Approximately M, Middle of the Day, End of the Week, etc. Shape patterns Sk can be expressed linguistically, e.g., as follows: Very Large, Increasing, Quickly Decreasing and Slightly Concave, etc. Human decision making is often based on perception-based evaluations of tendencies of prices, sale volumes, oil or gas production of reservoir wells, weather forecasting, etc. Some of such evaluations are based on statistical data and can be modeled by statistical methods, but some of them use expert opinions, commonsense knowledge, perceptions based on analogy, etc. As a simple example we can consider a forecasting problem of retail dealer which uses the following perception-based information: In spring the sugar price increases slowly. In summer the sugar price increases rapidly. What will be the sugar price at the end of July if on the 5-th of April it is equal to X? The methods of modeling of perception-based information by PBF and solution of perception-based initial value problem were considered in [3]. PBF differs from the Mamdani fuzzy model [11] which defines a crisp function and is usually obtained as a result of tuning of function parameters in the presence of training crisp data. PBF is used for reconstruction of human judgments when testing data are absent or scarce. Such reconstruction is based mainly on scaling and granulation of human knowledge. The methods of reconstruction of PBF were developed in the framework of Computational Theory of Perceptions (CTP) and Computing with Words and Perceptions (CWP) [2–5, 15–17] for description of qualitative relations between variables. The set of rules with perceptionbased trends like Increasing, Slowly Decreasing, etc. is considered as a linguistically given derivative. A reconstruction of fuzzy functional
Perception-Based Functions in Qualitative Forecasting
121
dependence based on such rules is considered as a solution of granular initial value problem [3]. Generally boundary conditions or additional information can be defined [4]. In this chapter we discuss application of PBF to modeling qualitative forecast of new product life cycle when experimental data are absent and only expert opinion is used. In Sect. 2 we consider the methods of scaling and fuzzy granulation of trends and directions. In Sect. 3 we discuss new methods of parametric granulation of convex–concave patterns. In Sect. 4 we propose a new method of reconstruction of PBF. Application of PBF to qualitative forecasting of new product sales is considered in Sect. 5. In Conclusions we discuss the advantage of application of PBF to qualitative forecasting.
2 Granulation of Trends and Directions in Perception-Based Functions A perception-based function (PBF) is given by a set of rules [4]: Rk: If T is Tk then Y is Sk, (k=1,…,n) where Tk are fuzzy intervals defined on the domain of independent variable T, and Sk are perception-based shape patterns of variable Y on intervals Tk. As basic shape patterns we will consider perception-based trends like QUICKLY DECREASING, SLOWLY INCREASING, etc. which will define directions of function change on given time intervals. Based on such trend patterns we will introduce convex–concave shape patterns like QUICKLY DECREASING AND SLOWLY CONCAVE. Reconstruction of human evaluations of trends needs a scaling and fuzzy granulation of them. As an example it may be considered the following scale of trends and directions: LD = <EXTREMELY QUICKLY DECREASING, QUICKLY DECREASING, DECREASING, SLOWLY DECREASING, CONSTANT, SLOWLY INCREASING, INCREASING, QUICKLY INCREASING, EXTREMELY QUICKLY INCREASING>. In an abbreviated form we will write this scale as following: LD = . Figure 1 shows the possible axes of directions corresponding to this scale.
122
I. Batyrshin and L. Sheremetov
Fig. 1. Axis of directions
A fuzzy granulation of directions can be done by several methods [2, 3]. Denote the angle of the axis of the direction di as ϕi. For each value of increment ∆t > 0 the corresponding value of ∆y located on the axis of the direction di is equal to ∆yi = ∆t⋅ tg(ϕi). Then the fuzzy set of directions in the point ∆t associated with di may be calculated, for example, as generalized bell membership function [11]:
µ d (∆y ) = i
1 ∆y − ∆yi 1+ a
2b
,
where a and b are parameters of fuzzy set. Such method of construction of fuzzy granular direction is called a cylindrical extension [3, 15]. Another method of fuzzy granulation of directions is called a proportional extension and based on extension principle of Zadeh [3]:
Perception-Based Functions in Qualitative Forecasting
a)
123
b)
Fig. 2. (a) Proportional and (b) cylindrical extensions of directions Constant and Decreasing
∆y ⎞ ⎟, ⎝ ∆t ⎠ where Pi is a fuzzy set of slope values corresponding to fuzzy granulation of direction di. Examples of fuzzy proportional and cylindrical granulation of directions Constant and Decreasing based on generalized bell membership functions are shown in Fig. 2a, b, respectively.
µ dprop ( ∆ t , ∆ y ) = µ Pi ⎛⎜ i
3 Granulation of Convex–Concave Patterns Perceptual patterns like QUICKLY INCREASING AND SLOWLY CONCAVE, SLOWLY DECREASING AND STRONGLY CONVEX, etc. require scaling and granulation of convex–concave (CC-) patterns. Later is an example of a linguistic scale of CC-patterns: LCC = <STRONGLY CONCAVE, CONCAVE, SLIGHTLY CONCAVE, LINEAR, SLIGHTLY CONVEX, CONVEX, STRONGLY CONVEX>. In [5] such grades were correspondingly presented also by the codes of CC-degree: CCD = < -3, -2, -1, 0, +1, +2, +3>. The sign of CCD code denotes (1) the sign of the second derivative for the corresponding pattern of function and (2) the type of concave (–) or convex (+) modification applied to the direction of function change. An
124
I. Batyrshin and L. Sheremetov
absolute value of CCD code denotes the intensity of CC-modification. The following CC modifications are based on Zadeh operation of contrast intensification: Conc( y ) = y 2 −
( y 2 − y )2 y 2 − y1
, Conv( y ) = y1 +
( y − y1 )2 , y 2 − y1
where Conc(y) is a concave modification and Conv(y) is a convex modification of function y, and y1, y2 are the minimal and the maximal values of y(t) on considered interval of input variable t. These two functions will be called further BZ-modifications. For example, the pattern f(t) corresponding to the term STRONGLY CONCAVE have the code -3 and will be obtained from a linear function y(t) corresponding to the direction of function change as result of triple application of concave modification: f = Conc(Conc(Conc(y))). Figure 3a shows CC-patterns obtained by considered method from linear functions corresponding to directions 7:INCREASING and 4:SLOWLY DECREASING. For example the perception-based pattern INCREASING AND STRONGLY CONCAVE is represented by the uppermost curve in Fig. 3a and is calculated by f = Conc(Conc(Conc(y))) where y is the linear function corresponding to the direction 7:INCREASING. The perception-based pattern SLOWLY DECREASING AND STRONGLY CONVEX is represented by the undermost curve in Fig. 3a and is calculated by f = Conv(Conv(Conv(y))) where y is the linear function corresponding to the direction 4:SLOWLY DECREASING. Here we consider two new more flexible parametric methods of convex–concave granulation [5]. The first method uses the following concave Conc and convex Conv modifications of linear function y: Conc p ( y ) = y 2 −
p
( y 2 − y1 ) p − ( y − y1 ) p
Conv p ( y ) = y1 +
p
( y 2 − y1 ) p − ( y 2 − y ) p ,
,
where p is a parameter, p∈(0,1]. It is clear that Conc1 = Conv1= I, where I is the identity function: I(y)=y. This method is called BY-modification. Figure 3b shows CC-patterns in directions 7:INCREASING and 4:SLOWLY DECREASING corresponding to all grades of the linguistic scale LCC and obtained by functions , respectively.
Perception-Based Functions in Qualitative Forecasting CCM = 1, dir = 7:INC
CCM = 2, dir = 7:INC
CCM = 3, dir = 7:INC
30
30
30
28
28
28
26
26
26
24
24
24
22
22
22
20
20
20
18
18
18
16 -5
0 dir = 4:SDE
5
16 -5
a)
0 dir = 4:SDE
b)
125
5
16 -5
0 dir = 4:SDE
5
c)
Fig. 3. Granulation of convex–concave patterns in directions 7:INCREASING (dashed line) and 4:SLOWLY DECREASING (dotted line) obtained by (a) BZmodification, (b) BY-modification, and (c) BS-modification
The second method is called BS-modification and consists in the following concave Conc and convex Conv modifications of linear function y: Conc s ( y ) = y1 +
( y 2 − y1 )( y − y1 ) , ( y 2 − y1 ) + s ( y 2 − y )
Conv s ( y ) = y 2 −
( y 2 − y1 )( y 2 − y ) , ( y 2 − y1 ) + s ( y − y1 )
where s∈(-1,0]. We have Conc0 = Conv0 = I. Figure 3c shows CCpatterns of BS-modifications in directions 7:INCREASING and 4:SLOWLY DECREASING corresponding to all grades of the linguistic scale LCC and obtained by functions , respectively. As we can see, BY-modifications and BS-modification for chosen values of parameters p and s give more or less similar CC-curves. The tuning of these parameters may be used in fuzzy modeling by PBF. The linear and convex–concave patterns may be used for crisp and fuzzy modeling of time series and perception-based shape patterns. In comparison with methods of granulation of time series shape patterns falling less steeply, rising more steeply, etc., considered in [1, 6], the
126
I. Batyrshin and L. Sheremetov
methods of granulation of convex–concave patterns considered in this section give possibility to modify any linear function defined on any interval. Such flexibility of these methods gives them advantage not only in modeling perception-based functions but also in modeling and segmentation of time series. Fuzzy granulation of CC-patterns can be based on the extension principle of Zadeh or on cylindrical extension of shape patterns. This granulation can be defined parametrically depending on the type and parameters of fuzzy set used for fuzzification. The methods of reconstruction of PBF given by the set of rules with perception-based shape patterns are considered in the following section.
4 Reconstruction of PBF The methods of reconstruction of perception-based functions were proposed in [2, 3]. In these methods it was supposed that additionally to the rules Rk: If T is Tk then Y is Sk, (k=1,…,n) there is given an initial value If T is T0 then Y is Y0, where T0 and Y0 are fuzzy sets defined on the domains of argument and function values. Also, a fuzzy set defining a fuzzy granulation of this pattern was associated with each shape pattern. The initial value is sequentially extended from rule to rule along shape patterns given in consequents of the rules. These fuzzy shape patterns are finally aggregated in fuzzy function describing all set of rules. Figure 4 depicts an example of reconstruction of perception-based function given by the set of following rules: R0: If X is Approx. 0 then Y is Approx. 10 R1: If X is Small then Y is Increasing and Convex R2: If X is Medium then Y is Slowly Increasing and Concave R3: If X is Large then Y is Quickly Decreasing and Slightly Concave. This reconstruction uses cylindrical extension of fuzzy initial value given in the rule R0.
Perception-Based Functions in Qualitative Forecasting
127
Fig. 4. Steps of reconstruction of perception-based function given by three rules and fuzzy initial value (a) fuzzy time intervals; (b–d) perception-based patterns defined on these intervals, (e) aggregated PBF
Here we propose a new method of reconstruction of PBF when the initial value is given by a crisp number (T0, Y0) and fuzzification of PBF is produced after concatenation of all crisp shape patterns given in consequents of the rules. Without the loss of generality we suppose that T0 coincides with the left border of time domain. We suppose that fuzzy intervals Tk in premises of fuzzy rules define a fuzzy partition of time domain. It means that they are represented by normalized convex fuzzy sets with nonintersected cores [3, 12]. The membership functions of fuzzy intervals are monotone increasing and monotone decreasing on the left and on the right sides of cores, respectively, and equal to 1 in core points (see as example Fig. 4a). Such fuzzy intervals have natural ordering corresponding to the ordering of cores and we will suppose that this ordering corresponds also to the ordering of indexes of fuzzy intervals Tk. The fuzzy partition defines the set of knots, i.e., the points where the intersection of
128
I. Batyrshin and L. Sheremetov
membership functions of neighboring intervals has a maximal value. The borders of time domain are also considered as nodes. The method starts with the rule R1 and constructs corresponding CCpattern on interval defined by two first nodes starting from initial value (T0, Y0). The second perception-based pattern given in the consequent of the second rule is constructed on interval given by the next pair of nodes. A final value of previous pattern is used as an initial value for this pattern. The process is repeated till all patterns given in the consequents of the rules will be constructed. For construction of CC-patterns one of the methods considered in Sect. 4 can be used. Aggregation of all constructed patterns will give a crisp perception-based function. As an aggregation in the simplest case we can use a concatenation of shape patterns. Fuzzy granulation of resulting function can be obtained by cylindrical extension of some fuzzy set along this function. This fuzzy set can evaluate uncertainty in reconstruction of PBF. The example of such reconstruction of PBF is considered in Sect. 5.
5 Representation of Curves in Subjective Qualitative Forecasting by PBF Qualitative forecasting methods use the opinions of experts to subjectively predict future events [7]. These methods are usually used when historical data either are not available or are scarce, for example to forecast sales for the new product or to predict if and when new technologies will be discovered and adopted. These methods are also used to predict changes in historical data patterns. The method of technological comparisons involves predicting changes in one area by monitoring changes that take place in another area [7, 9]. The forecaster tries to determine a pattern of change in one area, called primary trend, which will result in new developments being made in the area of interest. The forecaster must determine the relationship between the primary trend and the events to be forecast. After this, forecasts in the area of interest can be made by monitoring the primary trend [7]. In subjective curve fitting applied to predicting sales of a new product, the product life cycle is usually thought of as consisting of several stages: “growth”, “maturity” and “decline” [7]. Each stage is represented by qualitative patterns of sales: – “Growth” stage. Start slowly, then Increase rapidly, and then Continue to increase at a slower rate
Perception-Based Functions in Qualitative Forecasting
129
– “Maturity” stage, sales of the product stabilize. Increasing slowly, Reaching a plateau, and then Decreasing slowly – “Decline” stage. Decline at an increasing rate Figure 5 depicts a typical curve of product life cycle based on subjective curve fitting. The “growth” stage is subjectively represented as S-curve, as shown in Fig. 6, which could then be used to forecast sales during this stage.
Fig. 5. Product life cycle. Adopted from [7]
Fig. 6. S-Curve. Adopted from [7]
130
I. Batyrshin and L. Sheremetov
To predict time intervals for each step of “growth” stage the company uses the expert knowledge and its experience with other products. If for prediction of product life cycle the use of S-curve may be appropriate, for other situations exponential or logarithmic curves may be used. The subjective curve fitting is very difficult and requires a great deal of expertise and judgment [7]. The methods of reconstruction of perception-based functions can be used to support the process of qualitative forecasting. The qualitative patterns of sales can be represented by perceptual patterns of trends and each stage
Start
1
Middle
End a)
0
0
Time
b) 0 0
c) 0 0
d) 0 0 Sales e) 0 0
Fig. 7. Steps of reconstruction of crisp perception-based function
Time
Perception-Based Functions in Qualitative Forecasting
131
can be represented by a sequence of fuzzy time intervals with different trend patterns. For example, S-curve can be modeled by three rules using convex–concave perceptual patterns considered earlier: R1: If T is Start of Growth Stage then S is Slowly Increasing and Convex. R2: If T is Middle of Growth Stage then S is Quickly Increasing. R3: If T is End of Growth Stage then S is Slowly Increasing and Concave. In these rules, T denotes time intervals and S denotes the sale volumes. The time intervals “Start of Growth Stage” etc. define fuzzy intervals and the corresponding subshapes of S-curve are defined by perception-based patterns. Reconstruction of the perception-based function given by these three rules is shown in Fig. 7. Figure 8 depicts fuzzy PBF obtained as a result of cylindrical extension of crisp PBF from Fig. 7 by means of generalized bell membership function. This fuzzification can be done by different types of parameterized membership functions such as trapezoidal, Gaussian, etc. [11]. Fuzzy perception-based function gives possibility to model uncertainty in forecasting. Generally, a proportional extension of fuzzy sets in direction given by crisp PBF can be applied
Fig. 8. Fuzzy granulation of PBF shown in Fig. 7
132
I. Batyrshin and L. Sheremetov
instead of cylindrical extension. Such proportional extension can model an increase of forecasting uncertainty with the increase of forecasting time. A reconstruction of PBF corresponding to all product life cycle shown in Fig. 5 can be done similarly to the reconstruction of “Growth” stage. A resulting perception-based function can be used in forecasting of sales.
6 Conclusions Perception-based functions give natural and flexible tools for modeling human expert knowledge about function shapes. They have the following advantage over logarithmic, exponential and other mathematical functions usually used in such modeling: – PBF can be easily composed from different segments – Each segment has natural linguistic description corresponding to human perceptions – Each perception-based pattern can be easily tuned due to it parametric definition to be better matched with expert opinion and with training data if they are available – Fuzzy function obtained as a result of reconstruction of PBF gives possibility to model different types of uncertainty presented in forecasting In comparison with methods of granulation of time series shape patterns falling less steeply, rising more steeply, etc., considered in [1, 6], the methods of granulation of convex–concave patterns considered in this chapter give the possibility to modify any linear function defined on any interval. Such flexibility of these methods gives them an advantage not only in modeling perception-based functions but also in modeling and segmentation of time series.
7 Acknowledgment The support for this research work has been provided by the IMP, projects D.00006 and D.00322.
Perception-Based Functions in Qualitative Forecasting
133
References 1. Baldwin J.F., Martin T.P., Rossiter J.M. (1998) Time series modelling and prediction using fuzzy trend information. Proc. Fifth Intern. Conf. Soft Comput. Inf./Intell. Syst., 499–502 2. Batyrshin I., Panova A. (2001) On granular description of dependencies. In: Proc. 9th Zittau Fuzzy Colloquium 2001, Zittau, Germany, 1–8 3. Batyrshin I. (2002) On granular derivatives and the solution of a granular initial value problem. Intern. J. Appl. Math. Comput. Sci. Special Issue on Computing with Words and Perceptions (ed. by D. Rutkowska, J. Kacprzyk, L.A. Zadeh), 12(3), 403–410 4. Batyrshin I. (2003) Perception based functions with boundary conditions. In: Proc. Third Conf. Eur. Soc. Fuzzy Logic Technol., EUSFLAT 2003, Zittau, Germany, 491–496 5. Batyrshin I.Z. (2004) On reconstruction of perception based functions with convex-concave patterns. Proc. Int. Conf. Comput. Intell. ICCI 2004, Nicosia, North Cyprus, Near East University Press, 30–34 6. Batyrshin I., Sheremetov L., Herrera-Avelar R. Perception based patterns in time series data mining. In this book. 7. Bowerman B.L., O’Connell R.T. (1979) Time series and forecasting. An Applied Approach. Duxbury Press. Massachusetts 8. Brown B.B. (1968) Delphi Process: A Methodology Used for the Elicitation of Opinion of Experts, P-3925, RAND Corp., Santa Monica, California 9. Gerstenfeld A. (1971) Technological forecasting. J. Business, 44(1), 10–18 10. Gordon T.J., Hayward H. (1968) Initial experiments with the cross-impact method of forecasting. Futures, 1(2), 100–116 11. Jang J.-S.R., Sun C.T., Mizutani E. (1997) Neuro-fuzzy and soft computing. A Computational Approach to Learning and Machine Intelligence. PrenticeHall, NJ, USA 12. Klir G.J., Clair U.S., Yuan B. (1997) Fuzzy Set Theory: Foundations and Applications, Prentice-Hall, NJ, USA 13. Sigford, J.V., Parvin R.H. (1965) Project PATTERN: A methodology for determining relevance in complex decision making. – IEEE Transactions in Engineering Management, 12(1) 14. Zadeh L.A. (1997) Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, 90, 111–127 15. Zadeh L.A. (1999) From computing with numbers to computing with words – from manipulation of measurements to manipulation of perceptions. IEEE Trans. Circuits Syst. 1: Fundamental Theory and Applications, 45, 105–119 16. Zadeh L.A. (2001) A new direction in AI: Toward a computational theory of perceptions. AI Mag., 73–84
134
I. Batyrshin and L. Sheremetov
17. Zadeh L.A. (2002) Toward a perception-based theory of probabilistic reasoning with imprecise probabilities. J. Stat. Plan. Inf. 105, 233–264 18. Zwicky F. (1962) Morphology of Propulsive Power. Monographs on Morphological Research, no 1. Society of Morphological Research, Pasadena, California
Towards Automated Share Investment System Dymitr Ruta
Summary. Predictability of financial time series (FTS) is a well-known dilemma. A typical approach to this problem is to apply a regression model built on the historical data and then further extend it into the future. If however the goal for FTS prediction would be to support or even make investment decisions, predictions generated by regression-based models are inappropriate as on top of being uncertain and excessively complex, they require a lot of investor attention and further analysis to make an investment decision. Rather than precise time series prediction, a busy investor might prefer a simple decision on the current day transaction: buy, wait, sell, that would maximise his return on investment. Based on such assumptions a classification model is proposed that learns the transaction patterns from optimally labelled historical data and accordingly gives the profit-driven decision for the current-day transaction. The model is embedded into an automated client–server platform which automatically handles data collection and maintains client models on the database. The prototype of the system was tested over 20 years of NYSE:CSC share price historical data showing substantial improvement of the long-term profit compared to a passive long-term investment strategy.
1 Introduction Prediction of the financial time series (FTS) represents a very challenging signal processing problem. Many scientists consider FTS as very noisy, nonstationary and non-linear signal but believe that it is at least to a certain degree predictable [3,6]. Other analysis suggest that a financial market is selfguarded against predictability as whenever it shows some signs of apparent predictability, investors immediately attempt to exploit the trading opportunities, thereby affecting the series and turning it unpredictable [2]. Stable forecasting of FTS seems therefore unlikely to persist for longer periods of time and will self-destruct when discovered by a large number of investors. The only prediction model that could be successful and sustainable seems to be the one that exploits the supportive evidence either hidden to other investors or the D. Ruta: Towards Automated Share Investment System, Studies in Computational Intelligence (SCI) 36, 135–153 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
136
D. Ruta
evidence that is available but is highly dispersed among many sources and therefore considered irrelevant, too difficult or too costly to be incorporated in the prediction model. Despite this seemingly obvious rationale a number of techniques is being developed in an attempt to predict what seems unpredictable: Tomorrow’s share price based on historical data. Starting from simple linear Autoregressive Moving Average models (ARMA) [3] through conditional heteroscedastic models like ARCH or GARCH [3] up to the complex non-linear models [3, 4], the idea is similar: Establish the regression-based description of the future samples based on the historical data series. More recently a number of machine learning techniques started to be applied to a financial forecasting and on a number of occasions showed considerable improvement compared to traditional regression models [5–7]. Neural networks are shown to be particularly good at capturing complex non-linear characteristics of FTS [5, 6]. Support vector machine represents another powerful regression technique that immediately found applications in financial forecasting [7, 8]. While there is already extensive knowledge available in pattern recognition domain, it has been rarely used for FTS prediction. The major problem lies in the fact that classification model learns to categorise patterns into crisp classes, rather than numerical values of the series. A temporal classification models would have to provide a specific definition of classes or obtain it from the series by discretisation. Although some work has already been done in this field [9–11, 14] there is still a lack of pattern recognition based models that would offer immediate investment applications surpassing in functionality and performance of the traditional regression models. While the reason for using prediction models for share market investment is obvious, given the future price prediction, an investor still needs time to analyze this information before making an investment decision. Moreover, it is well known in the share market technical analysis that humans tend to be inconsistent in following fixed rules due to emotional factors that affect their investment decisions [13]. Addressing these weaknesses, rather than predicting share price prInvestor uses a classification model that learns from expandable historical evidence how to categorise the future series into investment actions: buy, wait or sell. The proposed prInvestor prototype is a step forward towards a fully automated pattern recognition based investment system. It is designed as a fully automated client–server application where server handles all analytical requests and maintains multiple client profiles while the client supported with relevant presentation layer only sends the requests and collects responses. prInvestor is tested on the 20-years of daily share price series and the results are analyzed to give recommendations towards prospective fully automated platform development. The remainder of the paper is organised as follows. Section 2 provides a detailed analysis of the proposed temporal classification with prInvestor, specifying the investment cycle, an algorithm for optimal labelling of training data,
Towards Automated Share Investment System
137
feature extraction process and the classification model used. Section 3 presents the results of experiments evaluating the performance of the prInvestor system. The concluding remarks and some suggestions for model refinement are shown in the closing Section 4.
2 Temporal Classification with prInvestor Classification represents a supervised learning technique that tries to correctly label the patterns based on multi-dimensional set of features [1]. The model is fully built in the training process carried out on the labelled dataset with a set of discriminative features. Based on the knowledge gained from the training process, a classifier assigns the label to a new previously unseen pattern. Adapting the pattern recognition methodology, prInvestor has to generate the action label: buy, sell or wait as a response to the current day feature values based on the knowledge learnt from historical data. To achieve this goal, the training data has to be optimally labelled such that the investments corresponding to the sequence of labels generate the maximum return possible. Furthermore, to maximise the discrimination among classes, prInvestor should exploit the scalability of pattern recognition models and use as many relevant features as possible, far beyond just the historical share price series. All the properties mentioned above along with some mechanisms controlling model flexibility and adaptability are addressed in the presented prInvestor system. 2.1 Investment Cycle In the simplified investment cycle, considered in this paper, the investor is using all his assets during each transaction which means he buys shares using all the available cash and sells always all shares. Assuming this simplification, there are four different states an investor can fall into during the investment cycle. He enters the cycle at the state “wait with money” (WM), where the investor is assumed to possess financial assets and is holding on in the preparation for a “good buy”. From there he can either wait at the same state WM or purchase the shares at the actual price of the day thereby entering the “buy” state (B). From state B investor can progress to the two different states. Either he enters “wait with shares” state (WS) preparing for the good selling moment, or he sells immediately (the day after the purchase) all the shares transferring them back to money at the “Sell” (S) state. If the investor chooses to wait with shares (WS), he can stay at this state or may progress only to the state S. The sell state has to be followed by either the starting WM state or directly buy state (B) that launch the new investment cycle. The complete investment cycle is summarised by the conceptual diagram and accompanied directed cyclic graph shown in Fig. 1.
138
D. Ruta
Fig. 1. Visualisation of the investment cycle applied in prInvestor. The wait state has been separated into two states as they have different preceding and following states
Immediate consequence of the investment cycle is that the labelling sequence is highly constrained by the allowed transaction paths, i.e. obeying the following sequentiality rules: – – – –
WM may follow only with WM or B. B may follow only with WS or S. WS may follow only with WS or S. S may follow only with WM or B.
k The above rules imply that for the current day sample the system has to pic only one out of two labels depending on the label from the previous step. That way the 4-class problem is locally simplified to the 2-class classification problem. In a real-time scenario it means that to classify a sample, the model o has to be trained on the dynamically filtered training data from only tw valid classes at each data point. If the computational complexity is of concern, retraining at each new data vector can be replaced by four fixed models that can be built on the training data subsets corresponding to four combinations of valid pairs of classes as stated in the above sequentiality rules. 2.2 Training Set Labelling Classification, representing supervised learning model, requires labelled training data for model building and classifies incoming data using available class labels. In our share investment model, the training data initially represents unlabelled daily time series that has to be labelled using four available states WM, B, S and WS, subject to sequentiality rules. A sequence of labels generated that way determines the transaction history and allows for a calculation of the key performance measure – the profit.
Towards Automated Share Investment System
139
In a realistic scenario each transaction is subjected to a commission charge, typically set as a fixed fraction of the transaction value. Let xt for t ∈ 0, 1, . . . , N be the original share price series and c stand for the transaction commission rate. Assuming that b and s = b + p, where p ∈ 1, . . . , N − b, denote the buying and selling indices, the relative capital after a single buy– sell investment cycle is defined by: Cbs =
xs 1 − c xb 1 + c
(1)
Note that the same equation would hold if the relative capital was calculated in numbers of shares resulting from a sell–buy transaction. Assuming that there are T cycles in the series let b(j) and s(j) for j = 1, . . . , T denote indices of buying and selling at jth cycle such that xb(j) and xs(j) stand for buy and sell prices in jth cycle. Then the relative capital after k cycles (0 < k ≤ T ) can be easily calculated by: k s(j) s(k) Cb(j) (2) Cb(1) = j=1
The overall performance measure related to the whole series would then be the closing relative capital, which means the relative capital after T transactions: s(T )
(3)
CT = Cb(1)
Given CT the absolute value of the closing profit can be calculated by: P = C0 (CT − 1)
(4)
where C0 represents the absolute value of the starting capital. Finally, to be consistent with the investment terminology one can devise the return on investment performance measure, which is an annual average profit related to initial investment capital: AN N P AN N s(T ) s(T )−b(1) = Cb(1) −1 C0 t
R=
(5)
where tAN N stands for the average number of samples in 12 months. The objective of the model is to deliver an optimal sequence of labels, which means the labels that through the corresponding investment cycles generate the highest possible profit. Such optimal labelling is only possible if at the actual sample to be labelled, the knowledge of the future samples (prices) is available. Although reasoning from the future events is forbidden in the realistic scenario, there is no harm of applying it to the training data series. The classification model would then have to try to learn the optimal labelling from the historical data and use this knowledge for classification of new previously unseen samples.
140
D. Ruta
An original optimal labelling algorithm is proposed here. The algorithm is scanning the sequence of prices and subsequently finds the best buy and sell indices labelling the corresponding samples with B and S labels, respectively. All the samples in between B and S labels are labelled with WS label and all the samples in between S and B labels are labelled with WM label as required by the sequentiality rules. The following rules are used to determine whether the scanned sample is identified as optimal buy or sell label: – Data sample xb is classified with the label B (optimal buy) if for all data points xt (b < t < s) between xb and the nearest future data point xs b, s ∈ 1, . . . , N ∩ b < s, at which the shares would be sold with a profit (Cbs > 1), the capital Cts < Cbs . – Data sample xs is classified with the label S (optimal sell) if for all data points xt (s < t < b) between sample xs and the nearest future sample xb b, s ∈ 1, . . . , N ∩ s < b, at which the shares would be bought increasing the original number of shares (Cbs > 1), the capital Cbt < Cbs . The Matlab implementation of the above rules together with complete code for optimal labelling algorithm can be found in Appendix 7. It has to be made clear at this point that the optimal labelling algorithm is used for only one but key purpose: To label a training dataset. Labelled training dataset defines the whole learning process in which classifier tries to establish the relationship between the action labels and the historical data.
3 Feature Extraction The data in its original form represents only a share price time series. Optimal labelling process adds labels on top of prices. Extensive research dedicated to time series prediction [3] proves that building the model solely on the basis of historical data of the FTS is very uncertain as it exhibits considerable proportion of a random crawl. At the same time an attractive property of classification systems is that in most of the cases they are scalable, which means they can process large number of features in a non-conflicting complementary learning process [1]. Making use of these attributes, prInvestor takes the original share price series, the average transaction volume series as well as the label series as the basis for the feature generation process. Details of the family of features used in prInvestor model are listed in Table 1. Apart from various mixtures of typical moving average and various orders differencing features, there are new features (plf , ppf ) that exploit the labels of past samples in their definition. The past label series is used as feature in a raw form (plf ) as well as in generation of the prospective transaction profit feature (ppf ). ppf determines the share price difference between current day and the latest buy or sell transaction made by investor. The use of labels as features might draw some controversies as it imposes that current model outputs depend on its previous outputs. This is however truly a
Towards Automated Share Investment System
141
Table 1. A list of features used in the prInvestor model name
description
prc vol mvai (x) atdi (x) difi (x) plfi ppf
average daily share price daily transactions volume moving average - mean from i last samples of the series x average difference between the current value of x and mvai (x) series x differenced at ith order the past label of the sample taken i steps before the current sample. difference between the current price and the price at B or S labels
reflection of the fact that investment actions strongly depend on the previous actions as for example if a good buy moment was missed the following good sell point could no longer be good or even could fall into WM or B class. Incorporation of the dependency on previous system outputs (labels) injects also a needed element of the flexibility to the model such that after a wrong decision the system could quickly recover rather than make further losses. Another consequence of using past labels as features is the high non-linearity and indeterminism of the model and hence its limited predictability. Various configurations of features from the families of features listed in Fig. 1 are evaluated in Sect. 6 and the optimal subset filtered out as a result of a feature selection procedure details of which are presented in Sect. 6. It is important to note that the features proposed in the prototype of prInvestor model are just a proposition of simple, cheap and available set of features which by no means form the optimal set of features. In fact as the series is time related, countless number of features starting from the company’s P/E ratio or economy strength indicators up to type of weather outside or the investment mood could be incorporated. The problem of generation and selection of the most efficient features related to the prInvestor model is by far open and will be considered in more detail in the later versions of the prInvestor. 3.1 Feature Selection Even in its initial prototype stage, preliminary experiments indicated very high sensitivity of performance to the selection of feature subsets. Moreover future additions of different classification models to the system typically work well with different subsets of features. Hence, the system requires incorporation of a simple evaluative feature selection method that optimises the selection to the classifier model it selects for the features. One of the powerful algorithms fulfilling these requirements is a probability based incremental learning (PBIL) method [12, 15]. It shares some similarities with evolutionary algorithms, by operating on a population of enumerated solutions – chromosomes, which are vectors of 0–1 incidences, where 1 on the kth position means
142
D. Ruta
that kth feature is selected while 0, is not. The chromosomes are sampled from a special probability vector, which is updated at each step according to the fittest chromosomes (best performing set of features). The update process of the probability vector is performed according to a standard supervised learning method. Given the probability vector p = [p1 , . . . , pM ]T , and population of chromosomes P = [v1 , . . . vC ], where vj = [ωj1 , . . . , ωjM ]T , where ωji ∈ {0, 1}, each probability bit is updated as in the following expression: pnew i
=
pold i
+ ∆pi
∆pi = η
C
j=1
C
ωji
− pi
(6)
where j = 1, . . . , C, i = 1, . . . , M refers to the C fittest chromosomes found and η controls the magnitude of the update. A number of best chromosomes taken to update the probability vector together with the magnitude factor η control a balance between the speed of reaching convergence and the ability to explore the whole search space. According to the standard algorithm, the only information that remains after each step is the probability vector, from which the chromosomes are generated. Convergence, achieved when: ∆pi → 0, implies that for all the chromosomes in the population pi → ωji , which means that probability vector becomes the final solution of the search process. The complete PBIL algorithm can be described in the following steps: 1. Create a probability vector of the same length as the required chromosome and initialise it with values of 0.5 at each bit. 2. Create a population of chromosomes according to the probability vector. 3. Evaluate fitness of sample by calculating classification error for the combinations of features defined by the chromosomes. 4. Update the probability vector using (6). 5. If all elements in probability vector are 0 or 1 then finish, else go to 2.
4 Classification Model Given the set of features, labelled training set and the performance measure, the model needs only a relevant classifier that could learn to label the samples based on optimal labels and the corresponding historical features available in the training set. Before the decision on the classifier is made, it is reasonable to consider the complexity and adaptability issues related to prInvestor working in a real-time mode. Depending on the choice of training data there are three different modes prInvestor can operate in. In the most complex complete mode the model is always trained on all available data to date. At each new day the previous day would have to be added to the training set and the model retrained on typically immense dataset covering all available historical evidence. In the fixed mode the model is trained only once and then used in such fixed form day by day without retraining that could incorporate the
Towards Automated Share Investment System
143
new data. Finally in the window mode the model is retrained on the same number of past samples shifted each day as the new data comes in. Undoubtedly the model is fastest in fixed mode, which could be a good short-term solution particularly if complexity is of concern. Complete mode offers the most comprehensive learning, however at huge computational costs and poor adaptability capabilities. The most suitable seems to be the window mode in which the model is fully adaptable and its complexity can be controlled by the window width. Given relatively large datasets and the necessity of retraining, it seems reasonable to select a rather simple easily scalable classifier that would not be severely affected by the increase in sizes of both feature and data sets. Initially prInvestor prototype is featured with the simple quadratic discriminant analysis (QDA), decision tree (DT) and the k-nearest neighbours classifiers details of which can be found in [1]. These classifiers seem to accommodate the simplicity properties mentioned above while being still capable of capturing some complex data structures. As classification is a relatively consistent process, this set of classifier models can be appended by more complex models at latter stage. It is important to note that given a day lag of the series, there is plenty of time for retraining even using large datasets. The model is therefore open for more complex classifiers or even the mixture of classifiers that could potentially improve its performance. The three simple classifiers used in this work, due to their simplicity are particularly useful in this early prototyping stage where the experimentation comprehensiveness takes the priority over the maximum possible performance. Moreover, simple classifiers are also preferred for shorter-lag series where the retraining might be necessary every hour or minute.
5 Client–Server Platform Implementation From the technical point of view prInvestor’s capability can be efficiently exploited in a relevant client–server architecture. The server traditionally handles all analytical tasks and the communication with a database where the data is read from and processed results written to. Multiple remote clients can be served simultaneously by means of widespread HTTP protocol inline with the internet communication channel. The schematic architecture of the prInvestor system is shown in Fig. 2. The Data Mining Server (DMS) performs all computational and database interaction realised by the prInvestor system. It maintains client accounts which can have various privilege levels with respect to the access to shares data and data processing requests. A user logged on to administrator account can manage all the client accounts which includes, adding and removing accounts, resetting their usernames and passwords and setting various privilege levels. DMS has a direct, secure and fast connection with the relational database
144
D. Ruta
Fig. 2. Client–server architecture of the prInvestor system
containing daily shares data updated overnight. If necessary on client request the user can perform various pre-processing procedures like data cleaning and aggregation supported by the relevant GUI. Given complete source data the user has to decide upon the set of features he wants or let prInvestor to select the best set of features found using the algorithm described in Sect. 3.1 from the list of available features offered by the system. The list of selected features is then fixed to the user until he decides to change it. The user also has the facility to redesign or filter the data to his individual needs by means of relevant SQL queries which are sent to and run by the server and then saved either locally or on the database depending on privilege profile of the user. The clients can request analytical task at any time. The server manages them via priority queuing system taking into account estimated task duration, resource usage, user priority level etc. For each user the server creates a table in a database where the history of his transactions is kept. If no transactions is made by the user in particular day, the system automatically updates wait action states in the user tables. During the periods of less active server usage, the server automatically retrains the models for the users according to their priority and stores the predicted action for the next day to be proposed to users if they log in on that day. Before the model training the server pulls the shares data from the database and joins them with optimal labels returned as a result of the optimal labelling algorithm. To minimise analytical effort, optimal label is appended each day after the database is updated with new data. After retraining the models are stored within particular user schemas and are ready to handle client requests. In case of extremely high server usage during the week, retraining of models can be postponed to the weekend, which would mean that prInvestor model would operate in the fixed mode on the short-term and moving window mode on the long-term.
Towards Automated Share Investment System
145
In a response to client action of prediction request the server performs the following actions: – Retrieves the shares data from the database or receives them in the request from the user. – Appends the data by the features using user action labels. – Retrieves the user classification model. – Applies the model to the joined data point(s). – Sends the predicted action label(s) to the client.
6 Experiments Extensive experimentation work has been carried out to evaluate prInvestor. Specifically prInvestor was assessed in terms of the relative closing capital compared to the relative closing capital of the passive investment strategy of buying the shares at the beginning and selling at the end of the experimental series. Rather than a large number of various datasets only one dataset has been used but covering almost 20 years of daily average price and volume information. The dataset represents Computer Sciences Corporation average daily share price and volume series from years 1964–1984, available at the corporation website (www.csc.com). Initially the dataset has been optimally labelled for many different commission rates, just to investigate what is the level of maximum possible oracletype return from the share market investment. The experimental results shown in Fig. 3b reveal surprisingly large double-figure return for small commission charges (10%). Relating this information to the plot of the original series shown in Fig. 3a it is clear that to generate such a huge return the algorithm has to exploit all possible price rises that exceed the profitability threshold determined by the commission rate. It also indicates that the most of the profit is generated on small but frequent price variations (±2%), which in real-life could be considered as noise and may not be possible to predict from the historical data. The maximum possible annual return or the profile presented in Fig. 3b can also be considered as a measure describing potential speculative investment attractiveness of the corresponding company. High values of such measure would give a founded hope that even a small fraction of the optimal transactions if learned from the historical evidence would still generate a considerable profit. Shape analysis of the return profile from Fig. 3b could provide even more detailed information on this issue. Due to many varieties of prInvestor setup and a lack of presentation space in this paper the experiments have been carried out in the moving-window mode as a balanced option featuring reasonable flexibility and adaptability mechanisms. The choice of optimal width of the moving window was aided by a specifically designed experiment. All the features were tested individually
146
D. Ruta 6
CSC share price [$]
5
4
3
2
1
0 0
500
1000
1500
2000 2500 3000 3500 working days since 1964
4000
4500
5000
16
18
20
(a) 22 20 18
annual return
16 14 12 10 8 6 4 2 0
0
2
4
6 8 10 12 14 transaction commission rate [%]
(b) Fig. 3. Visualisation of the optimal labelling algorithm capability. (a) CSC daily share price since 1964. (b) Annual return as a function of transaction commission rate
Towards Automated Share Investment System
147
by a simulation of the real time investment over 20 years of data for training windows fixed at 6 months, 1, 2 and 4 years. The resulting profit performances are compared with passive investment strategy in Table 2. The expectation of poor performance of the classification system with just a single feature has been confirmed for most of the cases. Nevertheless for a number of features the closing capital was comparable to the passive investor closing capital and for differenced moving average applied to share price dif (mva(p20)) prInvestor doubled the closing capital of the passive investor. Then, for each of the window width fixed at: 6 months, 1, 2 and 3 years, prInvestor has been run 100 times using QDC classifier and a random subset of features. The discriminant function of the QDC classifier is visualised for two sample features in Fig. 4. The plot shows data points projected on the superposition of the class discriminant functions and the resulting class boundaries on the bottom plane. Details of the visualisation method used in Fig. 4 can be found in [16]. The performances obtained in the form of returns related to passive investment indicated that the width of 2 years (around 500 business days) is optimal for this particular pair of dataset-classifier. Having decided on the 2 years moving window mode prInvestor was further tuned by selection of the most relevant features. The PBIL algorithm described in Sect. 3.1 was applied to search for the optimal subset of features listed in Table 2. PBIL has been run with 50-element population of randomly initialised chromosomes and after around 40 generations the algorithm converged resulting in a optimal subset of 21 features which are indicated by the + mark preceding feature names in Table 2. The model was then trained on the first 500 days (2 years window) that have been optimally labelled. In the next step the investment simulation was launched, in which at each next day the model generated the transaction label based on the training on the preceding 500 optimally labelled samples. The resulted sequence of transaction labels represents complete output of the model. Figure 5a illustrates the transactions generated by the prInvestor model while Table 3 shows the performance results. Important point is that most of investment cycles generated by prInvestor were profitable. Moreover the occasional loss cycles occur mostly during bessa and are relatively short in duration. The model is quite eager to invest during hossa, which seemed to be at least partially picked from the historical data. Numerical evaluation of the prInvestor depicted in Fig. 5b, shows more than five times higher closing capital than for the case of passive investor who buys at the beginning and sells at the end of 20 years period. Such remarkable results correspond on average to almost 20% annual return from investment and give a good prospect for the development of the complete investment platform with a number of carefully developed features and the data incoming automatically to the system for daily processing resulting in a final investment decision.
148
D. Ruta
Table 2. Average annual returns (ret), closing capitals (rc) and closing capitals related to the passive investment closing capitals obtained for the 20-year investment simulation with individual features only Feat. +prc +vol −mva(p1) −mva(v1) +mva(p5) −mva(v5) −mva(p20) +mva(v20) +mva(p50) +mva(v50) +atd(p1) +atd(v1) +atd(p5) +atd(v5) +atd(p20) +atd(v20) −atd(p50) −atd(v50) +dif(p1) −dif(v1) +dif[mva(p20)] +dif[mva(v20)] +dif(p2) +dif(v2) +dif[mva2(p20)] +dif[mva2(v20)] +plf(1) −plf(2) −plf(3) +ppf
ret.5 rcp.5 rc.5 0.09 0.00 0.07 0.07 0.00 −0.01 0.06 0.00 0.02 0.08 −0.14 0.12 −0.07 −0.05 −0.01 −0.12 0.07 −0.02 −0.14 0.12 0.04 −0.01 0.10 0.07 −0.02 0.04 −0.09 0.02 0.03 −0.15
0.36 0.07 0.24 0.25 0.07 0.06 0.21 0.07 0.11 0.30 0.00 0.70 0.02 0.03 0.06 0.01 0.27 0.05 0.00 0.70 0.14 0.06 0.45 0.29 0.05 0.14 0.01 0.11 0.12 0.00
5.16 0.95 3.43 3.48 1.01 0.82 2.94 1.01 1.55 4.28 0.06 9.89 0.26 0.39 0.79 0.09 3.83 0.74 0.06 9.89 2.04 0.85 6.40 4.05 0.72 2.03 0.16 1.60 1.76 0.04
ret1 rcp1 rc1 0.07 −0.01 0.07 0.05 0.05 0.03 −0.01 0.07 0.04 0.01 −0.05 −0.11 −0.11 −0.01 0.07 −0.02 0.02 0.00 −0.05 −0.11 0.08 −0.07 0.00 0.05 0.02 0.06 −0.05 0.00 −0.05 −0.10
0.37 0.08 0.34 0.25 0.25 0.17 0.09 0.32 0.19 0.11 0.03 0.01 0.01 0.08 0.33 0.06 0.15 0.10 0.03 0.01 0.39 0.02 0.09 0.25 0.13 0.29 0.04 0.09 0.04 0.01
3.91 0.81 3.54 2.65 2.68 1.82 0.90 3.38 2.02 1.21 0.35 0.12 0.11 0.89 3.51 0.63 1.55 1.05 0.35 0.12 4.07 0.24 0.99 2.63 1.33 3.08 0.39 1.00 0.37 0.13
ret2 rcp2 rc2 0.07 −0.02 0.06 −0.02 0.06 −0.03 0.05 0.01 −0.02 0.00 −0.06 0.04 −0.05 −0.07 −0.01 −0.01 −0.13 0.01 −0.06 0.04 0.08 −0.06 −0.12 −0.06 −0.02 −0.01 0.00 0.00 0.00 −0.13
0.77 0.16 0.68 0.15 0.59 0.12 0.57 0.26 0.15 0.23 0.08 0.43 0.10 0.06 0.20 0.20 0.02 0.26 0.08 0.43 0.85 0.07 0.02 0.07 0.16 0.17 0.22 0.22 0.22 0.02
3.49 0.71 3.10 0.67 2.67 0.56 2.59 1.17 0.69 1.05 0.36 1.97 0.44 0.27 0.90 0.90 0.08 1.19 0.36 1.97 3.86 0.33 0.10 0.31 0.75 0.79 1.00 1.00 1.00 0.08
ret4 rcp4 rc4 0.00 −0.13 0.00 −0.15 0.00 −0.12 0.00 −0.08 0.00 0.00 −0.14 −0.13 −0.11 −0.13 −0.13 −0.05 −0.04 −0.02 −0.14 −0.13 0.04 −0.12 −0.15 −0.13 −0.04 −0.10 0.00 0.00 0.00 −0.15
1.11 0.11 1.11 0.08 1.11 0.14 1.11 0.31 1.11 1.05 0.10 0.13 0.16 0.12 0.13 0.50 0.59 0.75 0.10 0.13 2.05 0.13 0.08 0.12 0.54 0.22 1.11 1.11 1.11 0.08
1.00 0.10 1.00 0.08 1.00 0.13 1.00 0.28 1.00 0.94 0.09 0.12 0.15 0.11 0.11 0.45 0.54 0.68 0.09 0.12 1.84 0.12 0.07 0.10 0.49 0.19 1.00 1.00 1.00 0.07
The experiments show the results for many fixed training window widths: 125, 250, 500 and 1,000 days, corresponding roughly to 0.5, 1, 2 and 4 years. The + or − at the front of feature name indicates whether the feature is included or not in the optimal subset of features selected by PBIL algorithm.
Towards Automated Share Investment System
149
Fig. 4. Visualisation of the quadratic discriminant classifier generating the boundaries among classes. (a) Plot of the Gaussian-based discriminative function for the pair of features atd(p50) and ppf. (b) Discriminative function for features dif[mva(p20)] and plf(1). The resulting class boundaries are visualised by the thick lines at the bottom of the plots
150
D. Ruta 6
CSC share price [$]
5
sell buy
4
3
2 loss transaction 1
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
(a) prInvestor transactions 30
Relative capital [C/C0]
25
20 prInvestor 15
10
5 passive investor 0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
(b) Relative capital - comparison Fig. 5. prInvestor in action: the transactions returned as a result of learning to invest with profit from historical data. (a) Visualisation of the transactions generated by prInvestor. (b) Comparison of the relative capital evolutions of the prInvestor and passive investor always keeping shares
Towards Automated Share Investment System
151
Table 3. Performance comparison between passive investment strategy and prInvestor obtained over investment simulation on 20 years of CSC share price history investor passive investor prInvestor
annual return (%) relative closing capital (rcc) (%) 8.9 19.1
463.8 2310.7
7 Conclusions prInvestor is a proposition of the intelligent share investment system. Based on extensive historical evidence it uses pattern recognition mechanisms to generate a transaction decision for the current day: buy, sell or wait. The advantage of this model over the existing techniques is that it is capable of incorporating in a complementary, non-conflicting manner various types of evidence beyond just the historical share price data. The proposed system benefits further from optimal labelling algorithm developed to assign the transaction labels to the training series such that the maximum possible return on investment is achieved. The model features basic flexibility and adaptability mechanisms such that it can quickly recover from bad decisions and adapt to the novel trend behaviour that may suddenly start to appear. The robustness of prInvestor is demonstrated on just a few simple feature types generated upon the share price and volume series. With such a simple setup the model hugely outperformed the passive investment strategy of buying at the beginning and selling at the end of the series, and brings on average 20% annual return on investment. Despite tremendous average results the model is not always consistent and occasionally generates losses. Full understanding of this phenomenon requires deeper analysis of the role of each individual feature in the decision process. In addition there are plenty of unknowns related to the choice of features, classifiers and the real-time classification mode. Moreover, although a prototype of the client–server implementation of prInvestor has already been proposed in this work, the system is still far from commercial use which would require portfolio management of multiple investments at the same time. The system also needs flexibility to accommodate more action types beyond just simple buy, sell and wait. Prospective real-time application of prInvestor may require drastic decrease of time granularity which in turn could impose rethinking and possibly redesign of the retraining process. All these problems and doubts will be the subject of further investigation towards fully automated, robust and commercially feasible investment platform, which should be flexible enough to work with multiple classification models. For that to happen the system would have to meet very restrictive reliability requirements confirmed by the extensive testing across different company shares, markets, and time.
152
D. Ruta
Appendix A: Implementation of Optimal Labelling Algorithm The following set of Matlab functions realise the complete algorithm for optimal labelling of the training set. Given the price time series x and broker handling fees charge (e.g. charge=0.01 would mean 1%) the function Optimal Labelling finds the optimal sequence of labels: 0 – wait with money (WM), 1 – buy (B), 2 – wait with shares (WS), 3 – sell (S), that generates the highest possible profit as a result of corresponding buy–sell transactions. The function returns the label vector labels. function [labels]=OptimalLabelling(x,charge) new_buy_id=1; n=length(x); labels=zeros(n,1); charge=(1+charge)/(1-charge); while new_buy_id Sk, we can see that the value of R → c, where c is a constant lying between 0 and 1. Alternatively, for the death moves the proposal ratio is written as R=
bk DQ S k , d k −1 (k − 1) S k −1
(12)
and we can see that under the assumptions considered for the birth moves, R ≥ 1. 2.3 The Difficulties of Sampling Decision Trees The RJ MCMC technique starts drawing samples from a DT consisting of one splitting node whose parameters were randomly assigned within the predefined priors. So we need to run the Markov Chain while it grows and its likelihood is unstable. This phase is said burn-in and it should be preset enough long in order to stabilize the Markov Chain. When the Markov Chain will be enough stable, we can start sampling. This phase is said post burn-in. It is important to note that the DTs grow very quickly during the first burn-in samples. This happens because an increase in log likelihood value for the birth moves is much larger than that for the others. For this reason almost every new partition of data is accepted. Once a DT has grown the change moves are accepted with a very small probability and, as a result, the MCMC algorithm tends to get stuck at a particular DT structure instead of exploring all possible structures. The size of DTs can rationally decrease by defining a minimal number of data points, pmin, allowed to be in the splitting nodes [3–5]. If the number of data points in new partitions made after the birth or change moves becomes less than a given number pmin, such moves are assigned unavailable, and the RJ MCMC algorithm resamples such moves. However, when the moves are assigned unavailable, this distorts the proposal probabilities pb, pd, and pc given for the birth, death, and change moves, respectively. The larger the DT, the smaller the number of data
Estimating Classification Uncertainty of Bayesian Decision
163
points falling in the splitting nodes, and correspondingly the larger is the probability with which moves become unavailable. Resampling the unavailable moves makes the balance between the proposal probabilities biased. To show that the balance of proposal probabilities can be biased, let us assume an example with probabilities pb, pd, and pc set equal to 0.2, 0.2, and 0.6, respectively, note that pb + pd + pc = 1. Let the DTs be large so that the birth and change moves are assigned unavailable with probabilities pbu and pcu equal to 0.1 and 0.3, respectively. As a result, the birth and change moves are made with probabilities equal to (pb – pbu ) and (pc – pcu ), respectively. Let us now emulate 10,000 moves with the given proposal probabilities. The resultant probabilities are shown in Fig. 2. From Fig. 2 we can see that after resampling the unavailable proposals the probabilities of the birth and death moves become equal approximately 0.17 and 0.32, i.e. the death moves are made with a probability which is significantly larger than a probability originally set equal 0.2.
Fig. 2. The standard strategy: The proposal probabilities for the birth, death, and change moves presented by the three groups. The left-hand bars in each group denote the proposal probabilities. The right-hand bars denote the resultant probabilities with which the birth, death, and change moves are made in reality if the birth and change moves were assigned unavailable with probabilities 0.1 and 0.3, respectively
164
V. Schetinin et al.
The disproportion in the balance between the probabilities of birth and death moves is dependent on the size of DTs averaged over samples. Clearly, at the beginning of burn-in phase the disproportion is close to zero, and to the end of the burn-in phase, when the size and form of DTs are stabilized, its value becomes maximal. Because DTs are hierarchical structures, the changes at the nodes located at the upper levels can significantly change the location of data points at the lower levels. For this reason there is a very small probability of changing and then accepting a DT split located near a root node. Therefore the RJ MCMC algorithm collects the DTs in which the splitting nodes located far from a root node were changed. These nodes typically contain small numbers of data points. Subsequently, the value of log likelihood is not changed much, and such moves are frequently accepted. As a result, the RJ MCMC algorithm cannot explore a full posterior distribution properly. One way to extend the search space is to restrict DT sizes during a given number of the first burn-in samples as described in [7]. Indeed, under such a restriction, this strategy gives more chances of finding DTs of a smaller size which could be competitive in terms of the log likelihood values with the larger DTs. The restricting strategy, however, requires setting up in an ad hoc manner the additional parameters such as the size of DTs and the number of the first burn-in samples. Sadly, in practice, it often happens that after the limitation period the DTs grow quickly again and this strategy does not improve the performance. Alternatively to the above approach based on the explicit limitation of DT size, the search space can be extended by using a restarting strategy as Chipman et al. have suggested in [6]. Clearly, both these strategies cannot guarantee that most of DTs will be sampled from a model space region with a maximal posterior. In Sect. 3 we describe our approach based on sweeping the DTs.
3 The Bayesian Averaging with a Sweeping Strategy In this section we describe our approach to decreasing the uncertainty of classification outcomes within the Bayesian averaging over DT models. The main idea of this approach is to assign the prior probability of further splitting DT nodes to be dependent on the range of values within which the number of data points will be not less than a given number of points, pmin. Such a prior is explicit because at the current partition the range of such values is unknown.
Estimating Classification Uncertainty of Bayesian Decision
165
Formally, the probability Ps(i, j) of further splitting at the ith partition level and variable j can be written as Ps (i, j ) =
(i, j ) (i , j ) − x min x max , (1, j ) (1, j ) x max − x min
(13)
(i , j ) (i , j ) where xmin are the minimal and maximal values of variable j at and xmax the ith partition level. (i , j ) (1, j ) (i , j ) (1, j ) Observing (13), we can see that xmax and xmin for all the ≤ xmax ≥ xmax partition levels i > 1. On the other hand there is partition level k at which the number of data points becomes less than a given number pmin. Therefore, we can conclude that the prior probability of splitting P ranges s between 0 and 1 for any variable j and the partition levels i: 1 ≤ i < k. From (13) it follows that for the first level of partition, probability Ps is equal to 1.0 for any variable j. Let us now assume that the first partition split the original data set into two non-empty parts. Each of these parts contains less data points than the original data set, and consequently for the (i , j ) (1, j ) (i , j ) (1, j ) or x min for new splitting variable (i = 2)th partition either x max < x max > x max j. In any case, numerator in (13) decreases, and probability Ps becomes less than 1.0. We can see that each new partition makes values of numerator and consequently probability (13) smaller. So the probability of further splitting nodes is dependent on the level i of partitioning data set. The above prior favours splitting the terminal nodes which contain a large number of data points. This is clearly a desired property of the RJ MCMC technique because it allows accelerating the convergence of Markov chain. As a result of using prior (13), the RJ MCMC technique of sampling DTs can explore an area of a maximal posterior in more detail. However, prior (13) is dependent not only on the level of partition but also on the distribution of data points in the partitions. Analyzing the data set at the ith partition, we can see that value of probability Ps is dependent on the distribution of these data. For this reason the prior (13) cannot be implemented explicitly without the estimates of the distribution of data points in each partition. To make the birth and change moves within prior (13), the new splitting values sirule,new for the ith node and variable j are assigned as follows. For the birth and change-split moves the new value sirule,new is drawn from a uniform distribution:
,j ,j sirule , new ~ U ( x1min , x1max ).
(14)
166
V. Schetinin et al.
The above prior is “uninformative” and used when no information on preferable values of sirule is available. As we can see, the use of a uniform distribution for drawing new rule sirule,new, proposed at the level i > 1, can cause the partitions containing less the data points than pmin. However, within our technique such proposals can be avoided. For the change-split moves, drawing sirule,new follows after taking new variable sivar,new: (15) sivar,new ~ U {S k }, where Sk = {1, …, m}\sivar is the set of features excluding variable sivar currently used at the ith node. For the change-rule moves, the value sirule,new is drawn from a Gaussian with a given variance σj: sirule, new ~ N ( sirule ,σ j ) ,
(16)
where j = sivar is the variable used at the ith node. Because DTs have hierarchical structure, the change moves (especially change-split moves) applied to the first partition levels can heavily modify the shape of the DT, and as a result, its bottom partitions can contain less data points than pmin. As mentioned in Sect. 2, within the Bayesian DT techniques [6, 7] such moves are assigned unavailable. Within our approach after birth or change move there arise three possible cases. In the first case, the number of data points in each new partition is larger than pmin. The second case is where the number of data points in one new partition is larger than pmin. The third case is where the number of data points in two or more new partitions is larger than pmin. These three cases are processed as follows. For the first case, no further actions are made, and the RJ MCMC algorithm runs as usual. For the second case, the node containing unacceptable number of data points is removed from the resultant DT. If the move was of birth type, then the RJ MCMC resamples the DT. Otherwise, the algorithm performs the death move. For the last case, the RG MCMC algorithm resamples the DT.
Estimating Classification Uncertainty of Bayesian Decision
167
As we can see, within our approach the terminal node, which after making the birth or change moves contains less than pmin data points, is removed from the DT. Clearly, removing such unacceptable nodes turns the random search in a direction in which the RJ MCMC algorithm has more chances to find a maximum of the posterior amongst shorter DTs. As in this process the unacceptable nodes are removed, we named such a strategy sweeping. After change move the resultant DT can contain more than one nodes splitting less than pmin data points. However this can happen at the beginning of burn-in phase, when the DTs grow, and this unlikely happen, when the DTs have grown. As an example, Fig. 3 provides the resultant probabilities estimated on 10,000 moves for a case when the original probabilities of the birth, death, and change moves were set equal 0.2, 0.2, and 0.6, respectively, as assumed
Fig. 3. The shrinking strategy: The proposal probabilities for the birth, death, and change moves presented by the three groups. The left-hand bars in each group denote the proposal probabilities. The right-hand bars denote the resultant probabilities with which the birth, death, and change moves are made in reality if the birth and change moves were assigned unavailable with probabilities 0.07 and 0.2, respectively
168
V. Schetinin et al.
at the example given in Sect. 2. The probabilities of the unacceptable birth and change moves were set equal to 0.07 and 0.2. These values are less than those that were set in the previous example because the DTs induced with a sweeping strategy are shorter than those induced with the standard strategy. The shorter DTs, the more data points fall at their splitting nodes, and less the probabilities pbu and pcu are. In addition, 1/10th of the unacceptable change moves was set assigned to the third option, mentioned above, for which two or more new partitions contain less than pmin data points. From Fig. 3 we can see that after resampling the unacceptable birth moves and reassigning the unacceptable change moves, the resultant probabilities of the birth and death moves become equal approximately 0.17 and 0.3, i.e. the values of these probabilities are very similar to those that shown in Fig. 2. Next we describe the Uncertainty Envelope technique suggested to estimate the classification uncertainty of multiple classifier systems the details of which are described in [13]. This technique allows us to compare the performance of the Bayesian strategies of averaging over the DTs in terms of classification uncertainty.
4. The Uncertainty Envelope Technique In general, the Bayesian DT strategies described in Sects. 2 and 3 allow sampling the DTs induced from data independently. In such a case, we can naturally assume that the inconsistency of the classifiers on a given datum x is proportional to the uncertainty of the DT ensemble. Let the value of class posterior probability P(cj|x) calculated for class cj be an average over the class posterior probability P(cj|Ki, x) given on classifier Ki: P (c j | x) =
1 N
N
∑ P (c i =1
j
| K i , x),
(17)
where N is the number of classifiers in the ensemble. As classifiers K1, …, KN are independent each other and their values P(cj|Ki, x) range between 0 and 1, the probability P(cj|x) can be approximated as follows P(c j | x) ≈
1 N
N
∑ I(y ,t i =1
i
i
| x),
(18)
Estimating Classification Uncertainty of Bayesian Decision
169
where I(yi, ti) is the indicator function assigned to be 1 if the output yi of the ith classifier corresponds to target ti, and 0 if it does not. The larger number of classifiers, N, the smaller is error of the approximation (17). For example, when N = 500, the approximation error is equal to 1%, and when N = 5,000, it becomes equal to 0.4%. It is important to note that the right side of (18) can be considered as a consistency of the outcomes of DT ensemble. Clearly, values of the consisN tency, γ = 1 ∑ I ( yi , ti | x) , lie between 1/C and 1. N
i =1
Analyzing (18), we can see that if all the classifiers are degenerate, i.e., P(cj|Ki, x) ∈ {0, 1}, then the values of P(cj|x) and γ become equal. The outputs of classifiers can be equal to 0 or 1, for example, when the data points of two classes do not overlap. In other cases, the class posterior probabilities of classifiers range between 0 and 1, and the P(cj| x) ≈ γ. So we can conclude that the classification confidence of an outcome is characterized by the consistency of the DT ensemble calculated on a given datum. Clearly, the values of γ are dependent on how representative the training data are, what classification scheme is used, how well the classifiers were trained within a classification scheme, how close the datum x is to the class boundaries, how the data are corrupted by noise, and so on. Let us now consider a simple example of a DT ensemble consisting of N = 1,000 classifiers in which 2 classifiers give a conflicting classification on a given datum x to the other 998. Then consistency γ = 1 – 2/1,000 = 0.998, and we can conclude that the DT ensemble was trained well and/or the data point x lies far from the class boundaries. It is clear that for new datum appearing in some neighbourhood of the x, the classification uncertainty as the probability of misclassification is expected to be 1 – γ = 1 – 0.998 = 0.002. This inference is truthful for the neighbourhood within which the prior probabilities of classes remain the same. When the value of γ is close to γmin = 1/C, the classification uncertainty is highest and a datum x can be misclassified with a probability 1 – γ = 1 – 1/C. From the above consideration, we can assume that there is some value of consistency γ0 for which the classification outcome is confident, that is the probability with which a given datum x could be misclassified is small enough to be acceptable. Given such a value, we can now specify the uncertainty of classification outcomes in statistical terms. The classification outcome is said to be confident and correct, when the probability of misclassification is acceptably small and γ ≥ γ0.
170
V. Schetinin et al.
Additionally to the confident and correct output, we can specify a confident but incorrect output referring to a case when almost all the classifiers assign a datum x to a wrong class whilst γ ≥ γ0. Such outcomes tell us that the majority of the classifiers fail to classify a datum x correctly. The confident but incorrect outcomes can happen for different reasons, for example, the datum x could be mislabelled or corrupted, or the classifiers within a selected scheme cannot distinguish the data x properly. The remaining cases for which γ < γ0 are regarded as uncertain classifications. In such cases the classification outcomes cannot be accepted with a given confidence probability γ0 and the DT ensemble labels them as uncertain. Figure 4 gives a graphical illustration for a simple two-class problem formed by two Gaussian N(0, 1) and N(2, 0.75) for variable x. As the class probability distributions are given, an optimal decision boundary can be easily calculated in this case. For a given confident consistency γ0, the integration over the class posterior distribution gives boundaries B1 and B2 within which the outcomes of the DT ensemble are assigned within the Uncertainty Envelope technique to be confident and correct (CC), confident but incorrect (CI) or uncertain (U). If a decision boundary within a
Fig. 4. Uncertainty Envelope characteristics for an example of two-class problem
Estimating Classification Uncertainty of Bayesian Decision
171
selected classification scheme is not optimal, the classification error becomes higher than a minimal Bayes error. So, for the Bayesian classifier and a given consistency γ0, the probabilities of CI and U outcomes on the given data are minimal as depicted in Fig. 4. The above three characteristics, the confident and correct, confident but incorrect, and uncertain outcomes, seem to provide a practical way of evaluating different types of DT ensembles on the same data sets. Comparing the ratios of the data points assigned to be one of these three types of classification outcomes, we can quantitatively evaluate the classification uncertainty of the DT ensembles. Depending on the costs of types of misclassifications in real-world applications, the value of the confidence consistency γ0 should be given, say, equal to 0.99. Next we describe the experimental results obtained with the shrinking strategy of Bayesian averaging over DTs. These results are then compared with those that have been obtained with the standard Bayesian DT technique described in [7].
5 Experiments and Results This section describes the experimental results on the comparison of the Bayesian DT techniques with the standard and sweeping strategies described in the above sections. The experiments were conducted first on a synthetic dataset, and then on the real financial datasets, the Australian and German Credit Datasets available at the StatLog Repository [11] as well as the Company Liquidity Data recently presented by the German Classification Society at [12]. The performance of the Bayesian techniques is evaluated within the Uncertainty Envelope technique described in Sect. 4. 5.1 The Characteristics of Datasets and Parameters of MCMC Sampling The synthetic data are related to an exclusive OR problem (XOR3) with the output y = sign(x1x2) and three input variables x1, x2 ~ U(−0.5, 0.5) and x3 ~ N(0, 0.2) which is a Gaussian noise. Table 1 lists the total number of input variables, m, including the number of the nominal variables, m0, the number of examples, n, and the proportion of examples of class 1, r. All the four datasets present the two-class problems.
172
V. Schetinin et al. Table 1. The characteristics if the data sets #
data
m
m0
1 2 3 4
XOR3 Australian Credit German Credit Company Liquidity
3 14 20 26
0 13 20 15
n 1,000 690 1,000 20,000
r,% 50.0 55.5 70.0 88.8
Variables with the enumerated number of values were assigned nominal. All the above data do not contain missing values. However the Company Liquidity Data contain many values marked by 9999999 that we interpreted as unimportant under the certain circumstances. The fraction of such values is large and equal 24%. For all the above domain problems, no prior information on the preferable DT shape and size was available. The pruning factor, or the minimal number of data point allowed being in the splits, pmin was given equal between 3 and 50 in the dependence on the size of the data. The proposal probabilities for the death, birth, change-split and change-rules are set to be 0.1, 0.1, 0.2, and 0.6, respectively. The numbers of burn-in and post burnin samples were also dependent on the problems. Meanwhile, the sampling rate for all the domain problems was set equal to 7. Note all the parameters of MCMC sampling were set the same for both Bayesian techniques. The performance of the Bayesian MCMC techniques was evaluated within the Uncertainty Envelope techniques within fivefold cross-validation and 2σ intervals. The average size of the induced DTs is an important characteristic of the Bayesian techniques and it was also evaluated in our experiments. 5.2 Experimental Results 5.2.1 Performance on XOR3 Data Both Bayesian DT techniques with the standard (DBT1) and the sweeping (BDT2) strategies perform quite well on the XOR3 data, recognizing 99.7% and 100.0% of the test examples, respectively. The acceptance rate was 0.49 for the BDT1 and 0.12 for BDT2 strategies. The average number of DT nodes was 11.3 and 3.4 for these strategies, respectively, see Table 2. Both the BDT1 and the BDT2 strategies ran with the value pmin = 5. The
Estimating Classification Uncertainty of Bayesian Decision
173
numbers of burn-in and post burn-in samples were set equal to 50,000 and 10,000, respectively. The proposal variance was set equal 1.0. Figures 5 and 6 depict samples of log likelihood and numbers of DT nodes as well as the densities of DT nodes for burn-in and post burn-in phases for the BDT1 and BDT2 strategies. From the top left plot of these figures we can see that the Markov chain very quickly converges to the stationary value of log likelihood near to zero. During post burn-in the values of log likelihood slightly oscillate around zero. As we can see from Table 2, both the BDT1 and the BDT2 strategies reveal the same performance on the test data. However the number of DT nodes induced by the BDT2 strategy is much less than that induced by the Table 2. Comparison between BDT1 and BDT2 on the XOR3 Data strategy BDT1 BDT2
number of DT nodes 11.3±7.0 3.4±0.2
perform, %
sure correct, %
uncertain, %
99.7±0.9 100.0±0.0
96.0±7.4 99.5±1.2
4.0±7.4 0.5±1.2
sure incorrect, % 0.0±0.0 0.0±0.0
Fig. 5. The Bayesian DT technique with the standard strategy on the XOR3 data: Samples of log likelihood and DT size during burn-in and post burn-in. The bottom plots are the distributions of DT sizes
174
V. Schetinin et al.
Fig. 6. The Bayesian DT technique with the sweeping strategy on XOR3 problem: Samples of log likelihood and DT size during burn-in and post burn-in. The bottom plots are the distributions of DT sizes
BDT1 strategy. It is very important that on this test the BDT2 strategy has found a true classification model consisting of the two variables. Besides, the BDT2 strategy provides more sure and correct classifications than those provided by the BDT1 strategy. 5.2.2 Performance on Australian Credit Data On these data, both the BDT1 and the BDT2 strategies ran with value pmin = 3. The numbers of burn-in and post burn-in samples were set equal to 100,000 and 10,000, respectively. The proposal variance was set equal 1.0. Both the standard DBT1 and the sweeping BDT2 strategies correctly recognized 85.4% of the test examples. The acceptance rate was 0.5 for the BDT1 and 0.23 for BDT2 strategies. The average number of DT nodes was 25.8 and 8.3 for these strategies, respectively, see Table 3.
Estimating Classification Uncertainty of Bayesian Decision
175
Table 3. Comparison between BDT1 and BDT2 on the Australian Credit Data strategy BDT1 BDT2
number of DT nodes 25.8±2.3 8.3±0.9
perform, %
sure correct, %
85.4±4.0 85.4±4.2
55.1±9.5 65.4±9.7
uncertain, % 42.0±9.1 30.3±8.9
sure incorrect, % 2.9±2.9 4.3±2.3
Table 3 shows us that both the BDT1 and the BDT2 strategies reveal the same performance on the test data. However the number of DT nodes induced by the BDT2 strategy is much less than that induced by the BDT1 strategy. Additionally, the BDT2 strategy provides more sure and correct classifications than those provided by the BDT1 strategy. The rate of uncertain classification is also less than that provided by the BDT1 strategy. 5.2.3 Performance on German Credit Data Both Bayesian strategies ran with value pmin = 3. The numbers of burn-in and post burn-in samples were set equal to 100,000 and 10,000, respectively. The proposal variance was set equal 2.0 to achieve the better performance on these data. The standard DBT1 and the sweeping BDT2 strategies correctly recognized 72.5% and 74.3% of the test examples, respectively. The acceptance rate was 0.36 for the BDT1 and 0.3 for BDT2 strategies. The average number of DT nodes was 18.5 and 3.8 for these strategies, respectively, see Table 4. As we can see from Table 4, the BDT2 strategy slightly outperforms the BDT1 on the test data. In the same time the number of DT nodes induced by the BDT2 strategy is less than that induced by the BDT1 strategy. The BDT2 strategy provides more sure and correct classifications than those provided by the BDT1 strategy. Table 4. Comparison between BDT1 and BDT2 on the German Credit Data strategy BDT1 BDT2
number of DT nodes 27.3±2.8 20.7±1.1
perform, % 72.5±6.8 74.3±5.9
sure correct, % 32.8±7.2 39.4±9.2
uncertain, % 62.5±11.4 54.4±10.5
sure incorrect, % 4.7±4.4 6.2±3.6
176
V. Schetinin et al.
5.2.4 Performance on Company Liquidity Data Due to large amount of the training data the BDT1 and the BDT2 strategies ran with value pmin = 50. The numbers of burn-in and post burn-in samples were set equal to 50,000 and 5,000, respectively. The proposal variance was set equal 5.0 which as we found in our experiments provides he best performance. Both Bayesian DT techniques strategies perform quite well, recognizing 91.5% of the test examples. The acceptance rate was 0.36 for the BDT1 and 0.3 for BDT2 strategies. The average number of DT nodes was 68.5 and 34.2 for these strategies, respectively, see Table 5. Figure 7 and 8 depict samples of log likelihood and numbers of DT nodes as well as the densities of DT nodes for burn-in and post burn-in phases for the BDT1 and BDT2 strategies. Table 5. Comparison between BDT1 and BDT2 on the Company Liquidity Data strategy BDT1 BDT2
number of DT nodes 68.5±5.2 34.2±3.3
perform, %
sure correct, %
uncertain, %
91.5±0.3 91.5±0.5
89.8±1.4 90.2±1.1
2.9±2.1 2.5±1.7
sure incorrect, % 7.2±0.8 7.3±0.8
Fig. 7. The Bayesian DT technique with the standard strategy on the Company Liquidity data
Estimating Classification Uncertainty of Bayesian Decision
177
Fig. 8. The Bayesian DT technique with the sweeping strategy on the Company Liquidity
From Table 5 we can see that both the BDT1 and the BDT2 strategies reveal the same performance on the test data. However the number of DT nodes induced by the BDT2 strategy is much less than that induced by the BDT1 strategy.
6 Conclusion The use of the RJ MCMC methodology of stochastic sampling from the posterior distribution makes Bayesian DT techniques feasible. However, exploring the space of DTs parameters, existing techniques may prefer sampling DTs from the local maxima of the posterior instead of the properly representing the posterior. This affects the evaluation of the posterior distribution and, as a result, causes an increase in the classification uncertainty. This negative effect can be reduced by averaging the DTs obtained in different starts or by restricting the size of DTs during burn-in phase. As an alternative way of reducing the classification uncertainty, we have suggested the Bayesian DT technique using the sweeping strategy.
178
V. Schetinin et al.
Within this strategy, DTs are modified after birth or change moves by removing the splitting nodes containing fewer data points than acceptable. The performances of the Bayesian DT techniques with the standard and the sweeping strategies have been compared on a synthetic dataset as well as on some datasets from the StatLog Repository and real financial data. Quantitatively evaluating the uncertainty within the Uncertainty Envelope technique, we have found that our Bayesian DT technique using the sweeping strategy is superior to the standard Bayesian DT technique. Both Bayesian DT techniques reveal rather similar average classification accuracy on the test datasets. However, the Bayesian averaging technique with a sweeping strategy makes more sure and incorrect classifications. We also observe that the sweeping strategy provides much shorter DTs. Thus we conclude that our Bayesian strategy of averaging over DTs using a sweeping strategy is able decreasing the classification uncertainty without affecting classification accuracy on the problems examined. Clearly this is a very desirable property for classifiers used in critical systems in which classification uncertainty may be of crucial importance for risk evaluation.
Acknowledgements This research was supported by the EPSRC, grant GR/R24357/01.
References 1. Duda R, Hart P, Stork D (2000) Pattern Classification. New York: WileyInterscience 2. Kuncheva L (2004) Combining Pattern Classifiers: Methods and Algorithms. New York: Wiley-Interscience 3. Dietterich T (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple Classifier Systems. Lecture Notes in Computer Science, Berlin Heidelberg New York: Springer, pp 1–15 4. Breiman L, Friedman J, Olshen R., Stone C (1984) Classification and Regression trees. Belmont, CA: Wadsworth 5. Buntine W (1992) Learning classification trees. Statistics and Computing 2: 63–73 6. Chipman H, George E, McCullock R (1998) Bayesian CART model search. Journal of American Statistics 93: 935–960 7. Denison D, Holmes C, Malick B, Smith A (2002) Bayesian Methods for Nonlinear Classification and Regression. New York: Wiley-Interscience
Estimating Classification Uncertainty of Bayesian Decision
179
8. Green P (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732 9. Domingos P (2000) Bayesian averaging of classifiers and the overfitting problem. In: Langley P (ed) Proceedings of the 17 International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann, pp. 223–230 10. Schetinin V, Fieldsend JE, Partridge D, Krzanowski WJ, Everson RM, Bailey TC, Hernandez A (2004) The Bayesian decision tree technique with a Sweeping Strategy. In: Proceedings of International Conference on Advances in Intelligent Systems – Theory and Applications, (AISTA’2004) in Cooperation with IEEE Computer Society, Luxembourg 11. The StatLog Data (1994) Available at http://www.liacc.up.pt/ML/statlog 12. The 29th Annual Conference of the Geran Classification Society (2005) (Available at http://omen.cs.uni-magdeburg.de/itikmd/gfkl2005/ 13. Fieldsend JE, Bailey TC, Everson RM, Krzanowski WJ, Partridge D, Schetinin V (2003) Bayesian inductively learned modules for safety critical systems. In: Proceedings of the 35th Symposium on the Interface: Computing Science and Statistics, Salt Lake City 14. Schetinin V, Partridge D, Krzanowski WJ, Everson RM, Fieldsend JE, Bailey TC, Hernandez A (2004) Experimental comparison of classification uncertainty for randomized and Bayesian decision tree ensembles. In: Yang ZR, Yin H, Everson R (eds) Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’04) pp 726–732
Invariant Hierarchical Clustering Schemes Ildar Batyrshin and Tamas Rudas
Summary. A general parametric scheme of hierarchical clustering procedures with invariance under monotone transformations of similarity values and invariance under numeration of objects is described. This scheme consists of two steps: correction of given similarity values between objects and transitive closure of obtained valued relation. Some theoretical properties of considered scheme are studied. Different parametric classes of clustering procedures from this scheme based on perceptions like “keep similarity classes,” “break bridges between clusters,” etc. are considered. Several examples are used to illustrate the application of proposed clustering procedures to analysis of similarity structures of data.
1 Introduction At least two goals can be associated with cluster analysis of the set of objects based on information about similarity values between the objects: (1) decomposition of the set of objects into classes of similar objects and (2) analysis of similarity structure of this set. Unfortunately, many clustering algorithms seeking decomposition of given set of objects into given number of classes of similar objects do not bring out underlying structure but fit the data to some preconceived model [15, 26]. A user of cluster analysis packages can be very happy with good clusters obtained for his data by some standard clustering procedure but it is very possible that the obtained structure of clusters does not reflect intrinsic structure of data but imposed by specifications of clustering algorithm. One of the reasons of this disadvantage of many popular clustering procedures is their noninvariance under numeration of objects. The permutation of numeration of objects at the input of noninvariant clustering procedure often causes the change of results of clustering. I. Batyrshin and T. Rudas: Invariant Hierarchical Clustering Schemes, Studies in Computational Intelligence (SCI) 36, 181–206 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
182
I. Batyrshin and T. Rudas
It means that the clustering obtained for given numeration of objects does not reflect the structure of the set of objects. A simple example of such noninvariance of classical algorithms is considered in Sect. 2. The requirement of invariance of clustering algorithms under numeration (permutation, ordering) of objects is considered in cluster analysis as most important requirement [ 1, 6, 8, 21], but, unfortunately, an overwhelming majority of popular clustering algorithms do not satisfy this property. This property is fulfilled for single linkage (also called nearest neighbor) algorithm discussed in many papers [15, 21–23]. This algorithm builds chains of clusters and for this reason reflects only a specific point of view on “cluster” which does not always acceptable. In this paper we consider parametric scheme of invariant clustering procedures which can vary the point of view on “cluster” and includes single linkage algorithm as a partial case. Another important requirement on clustering algorithms is invariance under monotone transformation of similarity values between objects [18, 21, 22, 24]. This is a necessary requirement on clustering algorithm if similarity values are evaluated by experts in ordinal scale. This requirement is desirable also for insensitivity of the results of clustering to the choice of similarity or dissimilarity measure. In this chapter we study a general scheme of hierarchical clustering procedures satisfying both invariance requirements considered above. This scheme initially proposed by Batyrshin [2–4] is based on the concept of a fuzzy equivalence relation introduced and studied in [28, 30]. Clustering procedure in this scheme consists in two steps: correction of given similarity values between objects and max–min transitive closure of obtained valued (fuzzy) relation. When a correction of similarity values does not used and only transitive closure of given similarity relation is applied then the clustering scheme gives clustering procedure proposed in [28] which is similar to single linkage algorithm [16]. Since transitive closure is invariant under numeration of objects and under monotone transformations of similarity values, the clustering procedure will satisfy both types of invariance if correction procedure satisfies them. Several schemes of such invariant parametric correction procedures are considered in this chapter. To build a rational clustering procedure in considered scheme, it is necessary to propose suitable correction procedure. The chapter studies the properties of similarity relations and correction procedures related with the perceptions of “natural” cluster and “rational” clustering. Such relationships are formulated as propositions with main results given in Theorem 2 and Proposition 5. Theorem 2 gives reasons for construction of general
Invariant Hierarchical Clustering Schemes
183
class of correction procedures as transformations decreasing similarity values in initially given similarity relation. Proposition 5 says that for some class of such transformations the resulting clustering procedure will satisfy the property “keep similarity classes.” This result is used further for construction of clustering procedures “breaking” similarity classes considered as “bridges” between clusters. Some basic definitions and properties of valued relations are discussed in Sect. 3. In Sect. 4 we consider a general scheme of clustering procedures and its relation with a solution of a problem of approximation of valued similarity relations by valued equivalence relation. Section 5 discusses theoretical properties of the first version of this scheme [3] based on identity neighborhood functions and on perception “keep similarity classes.” Section 6 discusses methods of extraction of valuable clusters from parametric dendrograms constructed by clustering scheme. Application of clustering procedures with identity neighborhood functions to clustering of Windham’s data [13] is considered in Sect. 7. Section 8 considers clustering scheme with nonidentity neighborhood functions which is based on perception “break bridges between clusters.” This scheme is illustrated on “butterfly” data. Example of clustering of time series from economics is considered in Sect. 9. In Sect. 10 we summarize results of the chapter and discuss possible extensions of considered clustering schemes.
2 Invariance and Noninvariance of Clustering Procedures Let us consider a very simple example of a set of seven points symmetrically located on a circle (Fig. 1a). Initial information for clustering is given as a matrix of distances between the objects (see Appendix 1). For numeration of objects considered in Fig. 1b the average linkage clustering algorithm realized in Matlab 6.5 builds dendrogram shown in Fig. 1d. We can extract from this dendrogram a partition, for example, on two clusters. On the highest level of dendrogram we will obtain clusters {1,2,7} and {3,4,5,6} which corresponds to partition of the set of objects on subsets {a,b,g} and {c,d,e,f}. The explorer, if he does not know the geometrical structure of data, could be very happy to obtain such clear partition of objects on two clusters. But if we change numeration of objects as shown in Fig. 1c then due to a symmetry of data the input matrix will not be changed and average linkage will give the same dendrogram as in Fig. 1d but now partition {1,2,7} and {3,4,5,6} will correspond to partition of
184
I. Batyrshin and T. Rudas
1.5
1.5
1
1
c
0.5
0.5
d a
0
-1.5 -1.5
g
f -1
1 5
-0.5
-1 -0.5
0.5
1
1.5
-1.5 -1.5
7
6
-1 0
2
4
0
e
-0.5
3
b
-1
-0.5
a)
0
0.5
1
1.5
b)
1.5 1.6 1
2
0.5
1 1.4
3 7
0
6
5
-1 -1.5 -1.5
1.2
4
-0.5
-1
-0.5
0
c)
0.5
1
1
1.5
1
2
7
3
4
5
6
d)
Fig. 1. Noninvariance of average linkage algorithm under initial numeration of objects: (a) seven objects symmetrically located on a circle; (b–c) two different numerations of objects; (d) dendrogram obtained for these numerations and resulting clustering of objects: {{a,b},g},{{c,d},{e,f}} and {{b,c},a},{{d,e},{f,g}}
objects {b,c,a} and {d,e,f,g}. Sequentially rotating a numeration of objects we can obtain five new partitions of objects on two clusters. It is clear that none of these partitions separately reflects symmetric structure of data. Only two trivial partitions A={{a},{b},{c},{d},{e},{f},{g}} and B={{a,b,c,d,e,f,g}} correspond to symmetry in data. These partitions are constructed by single linkage clustering algorithm which is invariant under numeration of objects. Most of known clustering algorithms also can decompose this set of objects onto 2, 3, 4, 5 or 6 clusters, these decompositions can be optimal, if some optimality criteria is used in clustering algorithm, but none of these decompositions will reflect similarity structure of data. This simple example shows that explorer should be wary if he wants to use clustering algorithms for analysis of similarity structure of data. Most of the popular clustering procedures are noninvariant under numeration of objects. They can give “good” partition of data on clusters but this partition even if it optimizes some optimality criteria can be useless for analyzing a
Invariant Hierarchical Clustering Schemes
185
similarity structure of data. The following sections consider parametric scheme of clustering procedures which are invariant under numeration of objects and showed good results on testing and experimental data.
3 Basic Definitions Denote X a finite set of objects and R a set of non-negative real values. A function S: X × X→R satisfying symmetry condition S(x,y) = S(y,x) will be called a proximity relation. A proximity relation D is called a dissimilarity relation if D(x,x) = 0 for all x from X. In this case D(x,y) usually denotes dissimilarity or distance value between objects x and y. A proximity relation S is called a similarity relation if S satisfies reflexivity condition: S(x,x) = I, where I = maxy,z(S(y,z)) for all x,y,z from X. Similarity relation S and dissimilarity relation D can be obtained one from another, e.g., as follows: D(x,y) = I - S(x,y). Note that in [30] similarity relation denotes reflexive and symmetric valued relation satisfying on X (∨,∧) – transitivity condition: S(x,y) ≥ min{(S(x,z),S(z,y)}. Such relation will be called here a valued equivalence relation. The properties of valued equivalence relations were studied in [28, 30]. The property of (∨,∧) – transitivity is dual to the ultrametric inequality: D(x,y) ≤ max{(D(x,z),D(z,y)}. If S is a valued equivalence relation then D(x,y) = I - S(x,y) is an ultrametric and vice versa. The properties of ultrametrics were studied in many works [1, 19, 20, 22, 17, 29]. For any value (level) a from R a valued relation S defines an ordinary relation S[a] and valued relation Sa as follows:
186
I. Batyrshin and T. Rudas
S[a] = {(x,y)∈X⎪S(x,y) ≥ a}; Sa(x,y) = 1, if S(x,y) ≥ a and Sa(x,y) = 0, if S(x,y) < a. Valued relation Sa may be considered as a characteristic function of ordinary relation S[a]. From a < b it follows that S[b] ⊆ S[a] and Sb ⊆ Sa. From reflexivity and symmetry of S it follows that for all a∈R the ordinary relations S[a] also will be reflexive and symmetric. If S is (∨,∧) – transitive then all S[a] will be transitive relations. As result, valued equivalence relation defines a nested set of ordinary equivalence relations and hence a nested partition of X on equivalence classes. The subset A of X will be called a similarity class of similarity relation S on X if S(x,y) > S(x,z) for all x,y ∈A and all z∉A. A similarity class A may be considered as a natural cluster in the set X. A value s = minx,y∈A{S(x,y)} will be called a strength of similarity class A. Proposition 1. A set of similarity classes of a valued equivalence relation S coincides with a set of equivalence classes of relations S[a] , a∈R. The set S(X) of all similarity relations defined on X is a partially ordered set with ordering relation ⊆ given as follows: S ⊆ T iff S(x,y) ≤ T(x,y) for all x,y from X. We will write S ⊂ T if and only if S ⊆ T and S ≠ T. S(X) is a distributive lattice [14] with operations ∩ and ∪ defined on S(X) as follows: (S∩T)(x,y) = min(S(x,y), T(x,y)), (S∪T)(x,y) = max(S(x,y), T(x,y)). Note that the intersection of valued equivalence relations will give valued equivalence relation but for union operation the similar property generally does not hold. (∨,∧)-composition S° T of valued relations S and T on X is defined as follows:
Invariant Hierarchical Clustering Schemes
187
(S°T)(x,y) = ∨z∈X (S(x,z)∧ T(z,y)). (∨,∧) – transitivity of S can be written as S ⊇ S° S. (∨,∧)-transitive ∞ closure Sˆ of S is defined as follows: Sˆ = ∪ S k , where Sk = Sk-1° S, for all k =1
k > 1, and S = S. From reflexivity of S and from ⎪X⎪=n it follows S ⊆ S2⊆ S3⊆ …⊆ Sn-1 = Sn= … and hence Sˆ = Sn-1. A transitive closure Sˆ of S will be denoted also as TC(S). For transitive closure of similarity relation S the following properties are fulfilled: 1
(1) (2) (3) (4)
Sˆ is a valued equivalence relation, i.e., Sˆ is transitive; S is transitive if and only if S = Sˆ ; if S ⊆ T then Sˆ ⊆ Tˆ ; S ⊆ Sˆ and Sˆ is a least transitive-valued relation containing S, i.e., if S ⊆ T and T is transitive then Sˆ ⊆ T.
4 General Scheme of Hierarchical Clustering Procedures A hierarchical clustering procedure can be considered as a transformation of a given similarity relation S into a valued equivalence relation E which defines a nested partition of X on equivalence classes. In terms of ultrametric a clustering procedure can be considered as a transformation of a dissimilarity relation into an ultrametric [17, 20, 22 ]. In terms of valued equivalence relations, there exists a natural relationship between the concepts of equivalence relation, partition and clustering. This approach was used in [28], where the transitive closure of the given similarity relation was used as such a transformation. The method proposed in [28] is equivalent to the single linkage clustering algorithm [16]. We will use here a more general approach, where transitive closure is applied to a corrected similarity relation. We will consider the following general scheme of clustering procedures [3, 4]: ∧
E = Q(S) = TC(F(S))= F (S ) ,
(1)
where F is some “correction” of given similarity relation S and TC is a procedure of transitive closure of valued similarity relations. The
188
I. Batyrshin and T. Rudas
procedure of transitive closure is studied in the theory of fuzzy relations, in graph theory and in cluster analysis and may be realized by single linkage clustering method [16] or by special algorithms [25, 27]. This procedure possesses both types of invariance discussed above. When a correction procedure F also exhibits both types of invariance, then the clustering procedure Q will also satisfy both invariance properties. A clustering procedure consisting of these two procedures F and TC will be called a relational clustering procedure. In [3, 4] it was required also that a reasonable correction procedure F should satisfy the following constraint: F(S) ⊆ S,
(2)
where ⊆ is a partial ordering of valued relations. This constraint follows from the following formal considerations. It is desirable to use a correction procedure F such that the distance between the initial similarity relation S and a final equivalence relation E will be small. The small transformation of initial similarity relation produced by clustering algorithm gives reasons to suppose that clusters corresponding to final valued equivalence relation reflect intrinsic structure of data. Of course, for some reasonable clustering algorithm that extracts clusters with specific form this distance may be sufficiently large. Nevertheless, the small distance between initial and final valued relations may be considered as a desirable property for any clustering algorithm. Formally this requirement can be formulated as follows: Find E*∈ E(X): d(S,E*) = minE∈E(X) d(S,E),
(3)
where S is a given similarity relation on X, E(X) is a set of all possible valued equivalence relations defined on X and d is some distance measure defined on the set S(X) of all similarity relations on X. The problem (3) is studied in more general form in [3, 5] as a problem of approximation in a partially ordered set with closure operation. A function d: S(X) × S(X)→R is called a positive distance function on S(X) if it satisfies on S(X) the following properties: A1. d(S,S) = 0. A2. d(S,T) = d(S∩T,S∪T). A3a. If P ⊆ S ⊂ T then d(P,S) < d(P,T), A3b. If P ⊂ S ⊆ T then d(S,T) < d(P,T).
Invariant Hierarchical Clustering Schemes
189
It is easy to see that d satisfies also the properties: d(S,T) = d(T,S), and d(S,T) > 0 if and only if S ≠ T. A function d will be called an isotonic distance function on S(X) if it satisfies the properties A1, A2 and the property A3*. If P ⊆ S ⊆ T then max(d(P,S),d(S,T)) ≤ d(P,T). As an example of a positive distance function on S(X) we can use any metric d defined as: d(S,T) = v(S ∪ T) - v(S∩T), where v is a positive valuation on S(X) [14], i.e., v is a real-valued function v:S(X)→R satisfying the properties: v(S∪T)+ v(S∩T) = v(S) + v(T), if S ⊂ T then v(S) < v(T). For example, the function v(S) = ∑ x ∑ y S(x,y) will be a positive valuation on the set of all similarity relations and hence the metric d(S,T)= ∑x ∑y |S(x,y) - T(x,y)| defined by this valuation will be a positive distance function on S(X). Most of the known metrics are positive distance functions but the metric d(S,T)=maxx,y⎪S(x,y)-T(x,y)⎪ will be only isotonic one. Theorem 2. If d is a positive distance function on S(X) then the solution of (3) has representation E* =TC(Sc),
(4)
where Sc is some element of S(X) such that Sc ⊆ S.
(5)
Theorem 2 gives reasons for the constraint (2) on the correction procedure F in general scheme of clustering procedures (1). Several parametric correction procedures F satisfying condition (2) were proposed in [3, 4, 7] such that the resulting clustering procedures showed good results on many real and testing data [4, 7, 9]. From F(S) ⊆ S it follows that correction procedure should decrease some similarity values S(x,y). To be invariant under numeration of objects a correction procedure should be applied to all pairs of objects (x,y) simultaneously and independently on their numeration. To be invariant
190
I. Batyrshin and T. Rudas
under monotone transformations of similarity values a correction procedure should take into account only mutual linear ordering of similarity values S(x,y). Of course, when the last condition of invariance does not required from clustering algorithm then correction procedure can use some quantitative measures depending on similarity values S(x,y). Below is a description of parameterized correction procedure given in more general form in [11] than it was proposed initially in [3]. Suppose f1, f2, f3: R→R are monotone functions. A correction procedure depends on the following sets and functions: Vy(x) = {z∈X \{x,y}⎪S(x,z) ≥ f1(S(x,y))}, Vx(y) = {z∈X \{x,y}⎪S(y,z) ≥ f1(S(x,y))}. The sets Vy(x) and Vx(y) denote the sets of objects “similar” to x and to y, respectively, when the value f1(S(x,y)) serves as a criterion of this similarity. The set V(x,y) = {z∈X \{x,y}⎪ max{S(x,z),S(y,z)} ≥ f2(S(x,y))}, contains the objects from X which are “similar” at least to one of the objects x and y. When f1 ≡ f2 we have V(x,y)=Vy(x)∪Vx(y). This set will be considered as the set of “neighbors” of x and y. The objects in V(x,y) will be taken into account when decision about correction of the value S(x,y) will be made. The set W(x,y) = {z∈V(x,y)⎪ min{S(x,z),S(y,z)} ≥ f3(S(x,y))}, denotes the set of “strong” or “common” neighbors, i.e., objects which are “similar” to both objects x and y. The objects from W(x,y) will “support” the value S(x,y). When f1 ≡ f3 we have W(x,y)= Vy(x)∩Vx(y). The functions f1, f2, f3 used in clustering procedure will be called neighborhood functions. The decision about correction of the value S(x,y) will depend on the relative part of objects “supporting” the similarity value S(x,y). One can consider the following methods to calculate for each pair of objects x and y this relative part denoted as hi: h1 = h4 =
W ( x, y ) min( V y ( x) , V x ( y ) ) X \V + W − 2 X −2
, h2 =
, etc.
W ( x, y ) max( V y ( x) , V x ( y ) )
, h3 =
W ( x, y ) V ( x, y )
,
Invariant Hierarchical Clustering Schemes
191
where, by definition, hi = 1 if denominator of hi is equal to 0. The correction procedure F(S) in the clustering procedure Q may be defined as follows: ⎧⎪S ( x, y ) if hi ≥ p , F ( S ( x, y )) = ⎨ ⎪⎩ F j ( x, y ) otherwise
where p∈[0,1], j are parameters and Fj(x,y) is a corrected value such that Fj(x,y) ≤ S(x,y). We will suppose that Fj(x,y) depends on the values S(x,z), S(y,z) for all objects z belonging to the sets of neighbors Vy(x), Vx(y) and V(x,y). We require also that Fj(x,y) ≥ minz∈V {S(x,z),S(y,z)}, where V = Vy(x)∪Vx(y)∪V(x,y). The specific definition of Fj(x,y) will be discussed later. When p = 0, from hj ≥ 0 it follows that F(S(x,y)) = S(x,y), i.e., for all x,y from X the values S(x,y) will be uncorrected, and Q(S) = TC(F(S))= TC(S), i.e., clustering procedure will coincide with single linkage method and method considered in [16, 28]. Instead of relative part of supporting neighbors hi it is possible to consider the number of supporting neighbors which can be calculated as follows: g1= |W(x,y)| or g2= |W(x,y)|+|X \V | -2. In this case the correction procedure can be defined as follows: ⎧⎪S ( x, y ) if g i ≥ t , F ( S ( x, y )) = ⎨ ⎪⎩ F j ( x, y ) otherwise
with parameter t∈{0,1,…, n-2}, n = |X |. To be invariant under numeration of objects a correction procedure should contain the same parameters for all pairs of objects or these parameters should be independent from the numeration of the objects.
5 Clustering Procedures with Identity Neighborhood Functions Clustering procedures with identity functions f1–f3 were considered in the first clustering scheme [3, 4], where the relative part h3 and correction procedure F1(x,y) = minz∈V min{S(x,z),S(y,z)} were considered. Generally it can be considered some aggregation of values S(x,z), S(y,z), (z∈V), that less than S(x,y). As such aggregation function it can be used correction
192
I. Batyrshin and T. Rudas
procedure Fj(x,y) equal to the mean or max [7]. These procedures were introduced heuristically and yielded good results on different experimental and testing data [4, 7, 9]. The ordinal versions of correction procedures Fj were considered in [11]. The correction procedure formalizes the following idea. We can say that two objects x and y will be considered as identical in S if S(x,y) = I and S(x,z) = S(y,z) for all objects z from X \{x,y}. More generally, two objects x and y will be called indistinguishable on the level a∈R if S(x,y )≥ a and for any z∈X it is fulfilled S(x,z) ≥ a if S(y,z) ≥ a. It is clear that two objects indistinguishable on some level a will be identical in similarity relation Sa. It is clear also that all objects are indistinguishable on the minimal possible level 0. Two objects x and y will be called indistinguishable in S if they are indistinguishable on the level a = S(x,y). Proposition 3. A similarity relation S defined on X is a valued equivalence relation if and only if all objects of X are indistinguishable in S. From the properties of transitive closure procedure it follows that TC transforms any similarity relation S into valued equivalence relation E such that S ⊆ E and E is the minimal valued equivalence relation including S. Hence transitive closure procedure produces minimal increase of values S(x,y) when transforms S into valued equivalence relation E. From Proposition 3 we can conclude that this procedure transforms nonindistinguishable pairs of objects into indistinguishable. Hence we can suppose that the total value of transformation of S into E produced by TC depends on the number of nonindistinguishable pairs of elements in S and on the “degree of indistinguishability” of these elements, if we can measure it. Hence, the correction procedure F decreasing similarity values S(x,y) should produce such minimal corrections of these values which will increase the number of indistinguishable pairs of objects or increase the “degree of indistinguishability” of pairs of objects. In this case the transformation TC(F(S)) produced by transitive closure will be small. For construction of suitable correction procedure it is desirable to decide: for what pairs of objects (x,y) the similarity values S(x,y) should be
Invariant Hierarchical Clustering Schemes
193
corrected and how these values should be decreased. For these purposes the following evaluation of indistinguishability may be used. We will say that two objects x and y are indistinguishable with respect to object z if from S(x,z) ≥ S(x,y) it follows S(y,z) ≥ S(x,y). In this case we will say that object z “supports” similarity value S(x,y). The more the objects in X support similarity value S(x,y), the more the degree of indistinguishability of x and y will be. Our goal is to change the value of S(x,y) such that the number of objects supporting similarity between x and y and hence the degree of indistinguishability of these objects will increase. We can say, that if the objects x and y are indistinguishable only with respect to small part of objects and hence they show different behavior on large part of objects then the similarity value S(x,y) does not confirmed or supported by objects of the set X and, as result, the similarity value S(x,y) can be corrected (decreased). This idea of correcting procedure illustrated by Fig. 2 where the nodes of graph denote the objects of a set X and the presence of an edge between two nodes denotes that these objects are “similar” with degree I. Reflexive edges in nodes are omitted. For simplicity we consider ordinary relation when all weights of edges equal to 0 or I. Objects x, z and v, w are identical in relation presented in Fig. 2 by graph. We have Vu(y) = {x,z}, Vy(u) = {v,w}, W(y,u)= ∅, i.e., the similarity between objects y and u does not supported by neighboring objects hence the respective edge can be deleted. The graph of valued equivalence relation nearest to S can be obtained by deleting the edge (y,u) and then by transitive closure of resulting graph, i.e., by adding the edge (u,t). The resulting equivalence relation presented by graph in Fig. 2b will contain two equivalence classes {x,y,z} and {u,v,w,t}. Since the correction procedure considered in Sect. 4 depends on parameter p (or t), for some values of this parameter the correction procedure can delete in initial similarity relation also edges (u,v) and (u,w) (Fig. 2c) or even delete all edges except the edges (x,z) and (y,w) between identical objects (Fig. 2d). If correction will be not applied then all objects will be joined in one cluster. Analysis of all possible similarity structures generated by clustering procedure for different parameter values will give the following nontrivial clusters: {x,y,z}, {u,v,w,t}, {v,w,t}, {v,w}, {x,z}. All of these clusters describe similarity structures existing in data but different arguments in favor of these clusters can be used.
194
I. Batyrshin and T. Rudas
x
v y
u
z
x y
t
w
v
z b)
v y
u
z
x
v y
t
w c)
t
w
a) x
u
u
z
t
w d)
Fig. 2. (a) Graph of initial similarity relation S; (b–d) Graphs of possible equivalence relations E obtained by clustering procedure from S. A selection of edges for correction depends on the value of parameter p and hence on requirement on “similarity” of objects joined in one cluster
For valued relations and graphs the situation is more complicated because the correction procedure instead of deleting edges will decrease the weights of edges. Some methods of analysis of similarity structures generated by parametric scheme are discussed in Sect. 6. One of the desirable properties of a clustering procedure is “to keep equivalence relations” and “to keep similarity classes.” If such clusters exist in the initially given similarity relation S then these clusters should also exist in the clustering obtained by the clustering procedure. It can be proved that a clustering procedure Q with identity neighborhood functions f1–f3 “keeps equivalence relations” and “keeps similarity classes.” Proposition 4. For clustering procedures Q with identity functions f1–f3 it is fulfilled Q(S) = S if and only if S is a valued equivalence function. Proposition 5. Clustering procedures from proposed scheme “keeps similarity classes” if the neighborhood functions f1 and f2 used in this procedures are identity functions. Let LV(x,y) denote the list of all values S(x,z), S(y,z), (z∈V), which are less than S(x,y), ordered in descending order. Denote the number of elements in LV(x,y) as m = |LV(x,y)| and the elements of LV(x,y) as lk (k=1, m). If m > 1
Invariant Hierarchical Clustering Schemes
195
then lk ≥ lk+1 for all k=1, m-1. Consider the ordinal generalizations of correction procedures Fj(x,y) proposed in [11]. When m > 1, possible corrections will be defined by parameter j: j=1: Fj(x,y) = lm , i.e., lm is the minimal value of LV (x,y); j=2: Fj(x,y) = l1, i.e., l1 is the maximal value of LV (x,y); j=3: Fj(x,y) = (∑ lk)/m, i.e., Fj(x,y) is the mean of all values from LV (x,y); j=4: Fj(x,y) = lk, where k∈{1, …, m} – parameter, F2 is a special case of F4; j=5: Fj(x,y) = median(LV (x,y)). All correction procedures Fj(x,y) for j=1,…,5 are invariant under numeration of objects and correction procedures Fj(x,y) for j=1,2,4,5 are invariant under monotone transformations of similarity values.
6 Selection of Valuable Clusters Considered clustering scheme for given values of parameters defines some hierarchical clustering procedure. Generally a hierarchy constructed by clustering procedure is considered as a sought similarity structure of data. If the goal of analysis is a search of a partition of data then in more traditional approach it is selected some level of hierarchy and clusters on this level define a partition of data. In naive approaches the number of clusters in partition a priori is fixed and the level of dendrogram is selected so that the corresponding partition contains desired number of clusters. The level-based approach to selection of clusters has a following weakness. Frequently, natural clusters existing in data are generated on different levels of hierarchy. For this reason on high levels of hierarchy the small natural clusters can disappear as result of union in large conglomerates. Correspondingly on small levels the large natural clusters can be separated on small non-natural fragments. Another approaches extract from hierarchy “valuable” clusters, e.g., clusters existing on large number of levels or clusters constructed on high level of similarity (on small level of dissimilarity). We use “structural” approach to selection of valuable clusters from hierarchy proposed in [4]. Suppose on some level of hierarchy two clusters A and B are joined together in cluster C = A∪B. Then the importance m of these clusters is calculated as follows:
196
I. Batyrshin and T. Rudas
m(A) = m(B) = min(NA,NB), where NA and NB are the numbers of objects in clusters A and B, respectively. We say that a cluster A is a “valuable” cluster if m(A) ≥ M, where M is a given number greater than 1. The level M can be selected adaptively depending on the number of valuable clusters extracted for different values M. The reasons to consider such measure of importance of clusters are the following. Suppose NB < M. It means that the set A is joined with “nonvaluable” amount of objects and hence A is still “in the process of formation of cluster.” For this reason the cluster A, even if he has a large amount of objects, receives a small importance value. But if NA , NB ≥ M, then we say that both A and B are “valuable” clusters. If more than two clusters are joined on some level then the importance value of all such clusters is determined by two of them which have maximal number of objects. As usual, real data does not contain clear clusters. As in a desert, where sand-dunes have different forms and mutual locations and some small dunes can be considered as parts of large dunes or as separate dunes depending on the “definition” of the concept “dune,” in considered parametric clustering scheme a change of parameters of clustering procedure will change a concept of “similarity” or “indistinguishability” and, hence, will cause the construction of slightly different hierarchical similarity structures. Analysis of hierarchical structures obtained by clustering procedure for different values of parameters can be used for extraction of all possible valuable clusters in data or for selection “stable” clusters presented in most of hierarchies. For example presented in Fig. 2 we can say that similarity structure of data contains clusters {x,y,z}, {u,v,w,t}, {v,w,t}, {x,z} and {v,w}. Another approach to selection of “best clustering” of data uses distance measure between similarity relations. Change of parameters in clustering procedure can be used for selection of hierarchical structure corresponding to valued equivalence relation E on the output of clustering procedure with minimal distance from initial similarity relation S. For example presented in Fig. 2 such minimal distance has partition in Fig. 2b. Note that for crisp equivalence relations hierarchy of partitions is reduced to one partition.
Invariant Hierarchical Clustering Schemes
197
7 Example of Windham’s Data Clustering procedures with identity neighborhood functions f1–f3 will be illustrated here on Windham’s data [13] shown in Fig. 3. Dissimilarity values (squares of distances) for these data are given in Appendix 2. A single linkage method will give a trivial clustering of these objects, i.e., all objects will be joined in one cluster or there will be 11 clusters containing one object. It is clear that similarity classes are absent in the given data but two clusters {1,2,3,4,5} and {7,8,9,10,11} can be considered as “natural.” These clusters are constructed by relational clustering procedures based on procedures Fj with parameters h 3 and j = 2, j = 4 (k = 2), j = 5 for all values of parameter p∈{0.1,0.2,…,1}, and for parameter j =3 for almost all values of parameter p on higher levels of dendrogram. In addition to “natural clusters,” relational clustering procedures with parameter values j = 1, 3, 4 (k > 2) construct dendrograms with the following nontrivial clusters and partitions: {1,2,4}, {8,10,11}, {5,6,7}; {{{{1,2,4}, 3},5}, {{{8,10,11},9},7}}; {{1,2,3,4},{5,6,7}, {8,9,10,11}}. Invariance under numeration of objects of the constructed clusters can be easily seen from symmetry of data. These clusters describe the symmetric structure of the considered set of objects. Note that most of known clustering algorithms cannot extract such symmetric structure of data. 8 6 4 2
2
0
1
8
3
5
6
7
4
-2
9
11
10
-4 -6 -8 -8
-6
-4
-2
0
2
4
Fig. 3. Windham’s data [13]
6
8
198
I. Batyrshin and T. Rudas
The distance d(D,U) between the initial dissimilarity function D corresponding to Windham’s data and the ultrametrics U corresponding to dendrograms constructed by clustering algorithms was minimal for the ultrametric U corresponding to clustering into “natural clusters.” As distances d we have used distances ds(S,T) = (∑x∑y|S(x,y)-T(x,y)|s)1/s, with s = 1 and s = 2.
8 Clustering Procedures with Nonidentity Neighborhood Functions Nonidentity neighborhood functions f1–f3 in the considered scheme of clustering procedures may be used for the construction of clusters based on a perception “break bridges between clusters.” This approach, from a certain point of view, is opposite to the approach “keep similarity classes” because some similarity clusters considered as “bridges” between natural clusters can be break down. Figure 4 shows an example of “butterfly” data, which may be considered as junction of two clusters forming the “wings” of a butterfly. Two central points, forming “similarity class,” should belong to different clusters corresponding to the “wings.” Clustering procedure
10
5
0
-5
-10 -10
-5
0
5
10
Fig. 4. “Butterfly” data with 18 points
Invariant Hierarchical Clustering Schemes
199
should be able “to break” this “bridge” between clusters. Note that the distances between points increase when moving off origin of coordinates along x-axis. Coordinates of points are given in Appendix 3. Relational clustering procedures with nonidentity neighborhood functions give possibility to classifying the “butterfly” data into two clusters. Euclidean distances between objects were considered as ordinal dissimilarity values and a clustering procedure with ordinal nonidentity functions f2 and f3 were used. These functions were defined by f2(D(x,y))=lr, f3(D(x,y))=lq, where lr and lq are the “rth” and “qth” dissimilarity values chosen from the ordered list of dissimilarity values D(y,z), D(x,z) that are greater than D(x,y,) where z∈X\{x,y} for f2 and z∈V(x,y) for f3. Similarity values can be obtained from dissimilarity values by S(x,y) = maxu,vD(u,v)D(x,y). The clustering into two wings on the higher levels of the dendrograms was obtained, for example, for the following values of parameters ( j = 1 and h3 were used): (1) (r = 1, q = 1,2), (r = 2, q = 1,2), (r = 3, q = 1,...,4), p = 0.1,...,0.5 or p = 0.1,...,0.7; (2) (r = 3, q = 5,6), p = 0.1,…,1. As one can see for case (2) two wings were constructed for all values of parameter p (chosen with step 0.1), greater than 0.
9 Example of Clustering of Time Series Consider application of relational clustering procedures to clustering time series of economic data. We use data from [31] which contain time series of World Per Capita Gross Domestic Product using Market Exchange Rates, 1980-2002. Time series used in example are presented in Fig. 5. As dissimilarity measure between time series we used measure of local trend associations [12] calculated as d(y,x) = 0.5(1-AM(y,x)), where AM(y,x) = max(AFK(y,x)), K = {2,3}. The values of this dissimilarity measuree used in clustering procedure are given in Appendix 4. We scanned similarity structures of these data by parametric clustering procedures with values of parameters: f1, f2, f3 = 1, 1.5; j = 1,2,3; p = 0, 0.1,…,1. The minimal distance between initial similarity relation and final valued equivalence relation was obtained for the following values of parameters: f1, f2, f3 = 1.5; j = 1; p = 1. The respective dendrogram is shown in Fig. 6.
200
I. Batyrshin and T. Rudas 1.Mexico 2.USA 3.Cuba 4.Argentina 5.Brazil 6.Venezuela 7.Bulgaria 8.Poland 9.Iraq 10.UAE 11.Madagascar 12.SA 13.Australia 14.China 15.India 16.Japan
Fig. 5. Time series of world per capita gross domestic product using market exchange rates, 1980–2002
6.Venezuela 4.Argentina 10.UAE 9.Iraq 12.South Africa 8.Poland 7.Bulgaria 11.Madagascar 3.Cuba 1.Mexico 5.Brazil 15.India 16.Japan 2.USA 14.China 13.Australia
6 4 10 9 12 8 7 11 3 1 5 15 16 2 14 13 2
4
6
8
10
12
14
Fig. 6. Clustering of time series for parameters: f1, f2, f3 = 1.5; j = 1; p = 1
Invariant Hierarchical Clustering Schemes 9.Iraq 6.Venezuela 4.Argentina 12.South Africa 10.UAE 7.Bulgaria 11.Madagascar 3.Cuba 1.Mexico 15.India 8.Poland 5.Brazil 16.Japan 2.USA 14.China 13.Australia
201
9 6 4 12 10 7 11 3 1 15 8 5 16 2 14 13
Fig. 7. Clustering of time series for parameters: f1, f2, f3 = 1; j = 2; p = 0.2,…,1
If we consider clusters containing at least two objects (see Sect. 6) as “valuable” then we will obtain the following valuable clusters: C1= {13.Australia, 14.China, 2.USA, 16.Japan, 15.India, 5.Brazil, 1.Mexico}, C2= {3.Cuba, 11.Madagascar}, C3= {7.Bulgaria, 8.Poland}, C4= {6.Venezuela, 4.Argentina}, C5= C1∪C2, C6= X\C4. Analysis of similarity structures generated for other values of parameters gives also, for example, the following valuable clusters: {13.Australia, 14.China, 2.USA, 16.Japan}, {3.Cuba, 11.Madagascar, 7.Bulgaria}, {12.South Africa, 10.United Arab Emirates}, {5.Brazil, 8.Poland}. Figure 7 gives another example of hierarchical clustering of data obtained by clustering procedures for parameter values f1, f2, f3 = 1; j = 2; p = 0.2,…,1.
10 Conclusion In this chapter we studied a general scheme of clustering procedures based on correction and transitive closure of the given similarity relation. The main properties of these procedures are invariance under numeration of objects and invariance under monotone transformations of similarity
202
I. Batyrshin and T. Rudas
values. Another important property of general clustering scheme is its parametric definition. Different choice of these parameters gives possibility to change a meaning of “indistinguishability” of objects and to analyze possible similarity structures of data, in an exploratory manner. Such analysis gives possibility to analyze similarity structures of data from different points of view. Several of the considered parametric clustering procedures are based on perceptions “keep similarity classes” and “break bridges between clusters.” Several propositions and testing examples illustrate the properties of the clustering scheme. The proposed scheme can be extended in several directions, for example a new type of correction procedure can be proposed. In [10] it was considered such extension of clustering scheme based on perception “break bridges between clusters.”
Acknowledgments This work was supported in part by the Research Fellowship Program of the Open Society Institute and by the IMP projects D.00006 and D.00322.
References 1. Barthelemy J.P. & Guenoche A. (1991). Trees and Similarity Representations. Chichester: Wiley 2. Batyrshin I.Z. (1980). Clustering based on fuzzy similarity relations. In: Third Workshop “Control with Presence of Fuzzy Categories”, Perm, Russia, pp. 25–27 (in Russian) 3. Batyrshin I.Z. (1982). Methods of Systems Analysis Based on Valued Relations. PhD Thesis. Moscow Power Engineering Institute (in Russian) 4. Batyrshin I.Z. & Shuster V.A. (1984). The structure of semantic spaces of verbal estimates of actions. Acta et Commentationes Universitas Tartuensis, Transactions on Artificial Intelligence, Principle Questions of Knowledge Theory, Tartu, 688, 20–38 (in Russian) 5. Batyrshin I.Z. (1985). About approximation task in a partially ordered set. In: Mathematical Methods of Optimization and Control in Systems. Kalinin: Kalinin State University, pp. 50–56 6. Batyrshin I. (1994). Errors of type 2 in cluster analysis and invariant cluster procedures based on similarity relations. In: Application of Fuzzy Systems, ICAFS-94 (Ed. by R. Aliev and R. Kenarangui). Iran: University Press of Tabriz, pp. 374–378
Invariant Hierarchical Clustering Schemes
203
7. Batyrshin I.Z. & Khabibulin R.Ph. (1995). Attribution of pseudonymous works of literature based on invariant relational clustering algorithms. In: Computational Linguistics and its Applications, Proceedings of the International Workshop, Kazan, pp. 43–55 (in Russian) 8. Batyrshin I. & Khabibulin R. (1998). On invariance of clustering procedures. Journal of Fuzzy Mathematics, 6(3), 721–733 9. Batyrshin I., Khabibulin R., Fatkullina R. (1996). Application of fuzzy relational clustering algorithms to ecological data. In: ICAFS-96, Second International Conference on Application of Fuzzy Systems and Soft Computing (Ed. by R.A. Aliev et al.). Siegen, Germany, pp. 115–117 10. Batyrshin I. & Klimova A. (2002). New invariant relational clustering procedures. In: Proceedings of East West Fuzzy Colloquium 2002, 10th Zittau Fuzzy Colloquium, Zittau, Germany, pp. 264–269 11. Batyrshin I. & Rudas T. (2000). Invariant clustering procedures based on corrections of similarities. In: Proceedings of East West Fuzzy Colloquium, Zittau, Germany, pp. 302–309 12. Batyrshin I., Herrera-Avelar R., Sheremetov L., Panova A. Moving approximation transform and local trend associations in time series data bases. In this book 13. Bezdek J.C. (1990). A note on two clustering algorithms for relational network data. SPIE, Vol. 1293, Applications of Artificial Intelligence, VIII, 268–277 14. Birkhoff G. (1967). Lattice theory. Providence, RI: American Mathematical Society 15. Duda R.O. & Hart P.E. (1973). Pattern Classification and Scene Analysis. New York: Wiley 16. Dunn J.C. (1974). A graph-theoretic analysis of pattern classification via Tamura’s fuzzy relation. IEEE Transaction on Systems, Man and Cybernetics, SMC-4, 310–313 17. Hartigan J.A. (1967). Representation of similarity matrices by trees. Journal of the American Statististical Association, 62, 1140–1158 18. Hubert L.J. (1973). Monotone invariant clustering procedures. Psychometrica, 38(1), 47–62 19. Jambu M. (1978). Classification automatique pour l’analyse des donnees. Paris, France: Dunod 20. Jardine C.J., Jardine N., Sibson R. (1967). The structure and construction of taxonomic hierarchies. Mathematical Biosciences, 1, 173–179 21. Jardine N. & Sibson R. (1971). Mathematical taxonomy. London: Wiley 22. Johnson S.C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241–254 23. Lance G.N. & Williams W.T. (1969). A general theory of classificatory sorting strategies. I. Hierarchical systems. The Computer Journal, 9(4), 373–380
204
I. Batyrshin and T. Rudas
24. Matula D.W. (1977). Graph theoretic techniques for cluster analysis algorithms. In: Classification and Clustering (Ed. by J. Van Ryzin). New York: Academic, pp. 95–129 25. Naessens H., De Meyer H., De Baets B. (1999). Novel algorithms for the computation of transitive closures and openings of proximity relations. In: Proceedings of EUROFUSE-SIC’99, pp. 200–203 26. Sokal R.R. (1977). Clustering and classification: background and current directions. In: Classification and Clustering (Ed. by J. Van Ryzin). New York: Academic, pp. 1–15 27. Swamy M.N.S. & Thulasiraman K. (1981). Graphs, Networks, and Algorithms. New York: Wiley 28. Tamura S., Higuchi S., Tanaka K. (1971). Pattern classification based on fuzzy relations. IEEE Transaction on Systems, Man and Cybernetics, SMC-1, 61–66 29. Young M.R. & DeSarbo W.S. (1995). A parametric procedure for ultrametric tree estimation from conditional rank order proximity data. Psychometrica, 60(1), 47–75 30. Zadeh L.A. (1973). Similarity relations and fuzzy orderings. Information Sciences, 3, 177–200 31. International Gross Domestic Product, Population, and General Conversion Factors Information. http://www.eia.doe.gov/emeu/international/other.html. Energy Information Administration. Official Energy Statistics from the U.S. Government
Invariant Hierarchical Clustering Schemes
205
Appendix 1 Table 1. Distances between seven points symmetrically located on a circle (Fig. 1) a 0 0.8678 1.5637 1.9499 1.9499 1.5637 0.8678
a b c d e f g
b 0.8678 0 0.8678 1.5637 1.9499 1.9499 1.5637
c 1.5637 0.8678 0 0.8678 1.5637 1.9499 1.9499
d 1.9499 1.5637 0.8678 0 0.8678 1.5637 1.9499
e 1.9499 1.9499 1.5637 0.8678 0 0.8678 1.5637
f 1.5637 1.9499 1.9499 1.5637 0.8678 0 0.8678
g 0.8678 1.5637 1.9499 1.9499 1.5637 0.8678 0
Appendix 2 Table 2. Dissimilarity values for Windham’s data (Fig. 3)
n/n 1 2 3 4 5 6 7 8 9 10 11
1 0 6 3 6 11 25 44 72 69 72 100
2 6 0 3 11 6 14 28 56 47 44 72
3 3 3 0 3 3 11 25 47 44 47 69
4 6 11 3 0 6 14 28 44 47 56 72
5 11 6 3 6 0 3 11 28 25 28 44
6 25 14 11 14 3 0 3 14 11 14 25
7 44 28 25 28 11 3 0 6 3 6 11
8 72 56 47 44 28 14 6 0 3 11 6
Appendix 3 Table 3. Coordinates of “Butterfly” data (Fig. 4) x y x y
–10 –10 –10 4 2 0 1 5 5 0 2 0
– 10 –2 5 –2
–10 –4 10 4
–5 2 10 2
–5 0 10 0
–5 –2 10 –2
–1 0 10 –4
9 69 47 44 47 25 11 3 3 0 3 3
10 72 44 47 56 28 14 6 11 3 0 6
11 100 72 69 72 44 25 11 6 3 6 0
206
Table 4. Dissimilarity values between time series (Fig. 5) 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1
0,000
0,387
0,553
0,437
0,537
0,532
0,731
0,601
0,677
0,345
0,549
0,424
0,310
0,315
0,283
0,404
2
0,387
0,000
0,450
0,457
0,225
0,681
0,383
0,206
0,507
0,667
0,577
0,531
0,066
0,168
0,320
0,184
3
0,553
0,450
0,000
0,770
0,431
0,706
0,282
0,301
0,565
0,547
0,155
0,317
0,484
0,573
0,573
0,475
4
0,437
0,457
0,770
0,000
0,479
0,238
0,762
0,468
0,372
0,540
0,724
0,684
0,554
0,451
0,602
0,584
5
0,537
0,225
0,431
0,479
0,000
0,547
0,378
0,174
0,458
0,545
0,462
0,413
0,270
0,272
0,596
0,376
6
0,532
0,681
0,706
0,238
0,547
0,000
0,723
0,677
0,612
0,429
0,594
0,468
0,630
0,547
0,698
0,530
7
0,731
0,383
0,282
0,762
0,378
0,723
0,000
0,239
0,405
0,708
0,338
0,385
0,433
0,546
0,445
0,465
8
0,601
0,206
0,301
0,468
0,174
0,677
0,239
0,000
0,223
0,525
0,402
0,313
0,301
0,359
0,438
0,546
9
0,677
0,507
0,565
0,372
0,458
0,612
0,405
0,223
0,000
0,298
0,585
0,588
0,630
0,602
0,695
0,714
10
0,345
0,667
0,547
0,540
0,545
0,429
0,708
0,525
0,298
0,000
0,433
0,232
0,618
0,598
0,614
0,695
11
0,549
0,577
0,155
0,724
0,462
0,594
0,338
0,402
0,585
0,433
0,000
0,295
0,607
0,689
0,658
0,565
12
0,424
0,531
0,317
0,684
0,413
0,468
0,385
0,313
0,588
0,232
0,295
0,000
0,369
0,500
0,450
0,597
13
0,310
0,066
0,484
0,554
0,270
0,630
0,433
0,301
0,630
0,618
0,607
0,369
0,000
0,056
0,266
0,180
14
0,315
0,168
0,573
0,451
0,272
0,547
0,546
0,359
0,602
0,598
0,689
0,500
0,056
0,000
0,261
0,262
15
0,283
0,320
0,573
0,602
0,596
0,698
0,445
0,438
0,695
0,614
0,658
0,450
0,266
0,261
0,000
0,433
16
0,404
0,184
0,475
0,584
0,376
0,530
0,465
0,546
0,714
0,695
0,565
0,597
0,180
0,262
0,433
0,000
I. Batyrshin and T. Rudas
Appendix 4
Fuzzy Components of Cooperative Markets Milan Mareš
Summary. The models of the free exchange market belong to the basic contributions of mathematics and, especially, operations research to the theoretical investigation of economic phenomena. In this chapter we deal with the Walras equilibrium model and its more cooperative modification and analyze some possibilities of its fuzzification. The main attention is focused on the vagueness of utility functions and of prices, which can be considered for most subjective (utilities) or most unpredictable (prices) components of the model. Some marginal comments deal with the sense and possibility of fuzzification of the cooperating coalitions. The elementary properties of the fuzzified model are presented and the adequacy of the suggested fuzzy set theoretical methods to the specific properties of real market models is briefly discussed.
1 Introduction The models of market equilibrium, here we deal especially with Walras equilibrium, are deeply investigated in the literature on operations research. Their close relation to the cooperative game theory belongs to most significant results of that investigation, and its different modifications are still thoroughly analyzed. The classical market models are deterministic, respecting the paradigm that all input parameters are exactly know. It is evident that this presumption is not correct in many real situations in which the exchange of goods (in a very wide sense) is realized. Namely a market as an environment in which subjective preferences, intuitive expectations, and rather chaotic behavior of individual agents are typical phenomena, is an object of investigation in which some vagueness is to be expected and included in the theoretical models. It means that the substitution of some of its components by their fuzzy counterparts is quite desirable. Most of this chapter is devoted to the fuzzification of two quantitative data – namely, the individual utilities and the prices – and to their representation by fuzzy quantities (see, e.g., [10, 11]). In this respect, M. Mareš: Fuzzy Components of Cooperative Markets, Studies in Computational Intelligence (SCI) 36, 209–239 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
210
M. Mareš
the fuzzified cooperative game model presented in [12] and developed in some other papers, appears to be a useful analogy. Moreover, some sections of this work freely use some ideas which were briefly developed in [7, 8], and which regard the more intensive cooperation among participants of the market, where there exist some groups of agents behaving like homogeneous blocks respecting the standard market behavior if we consider their market activities toward other partners, but whose members among themselves use much more liberal limitations of their exchange (e.g., they do not respect prices) in order to maximize the total profit of the block. The cooperative behavior is, on a general level, modeled by so-called cooperative games. For our purposes, we focus our attention on the games with transferable utility, briefly TU-games, (see, e.g., [6, 16, 17]). Their close relation to the market models is well known and it is formulated, e.g., in [5]. It means that even the fuzzification of the market and its equilibrium can be inspired by well-elaborated approaches to the fuzzification of TU-games. One of them consisting in the fuzzification of some quantitative components of the model, was already mentioned in one of the above paragraphs, and it is investigated in Sect. 3 of this chapter. Its main concepts are analogous to fuzzy cooperative games model [12], use the methodology summarized in [10] and [11], and further develop the model briefly suggested in [9]. There exists also another approach to the fuzzification of TU-games, consisting in the fuzzification of the structure of the cooperation. For the games, it was formulated in [1,2], and it is further developed (cf. [3,4,13,14]) till now. Its transformation to the market model is not simple. It demands deep analysis of the particular forms of participation of agents (players) in coalitions, especially in the cases where their market activities are to be distributed among several coalitions. In Sect. 4 of this paper, we briefly discuss this topic from the point of view of its eventual development in other papers. The fuzzification of the market equilibrium suggested here opens new field of investigation which appears to be perspective and which can lead to some inspirative results on the behavior of agents in free exchange markets being connected with uncertainty and vagueness typical for realistic market situations.
2 Preliminaries Before the presentation of the analyzed model, it is useful to recollect some concepts which are used in the following sections. They regard, especially, the theory of fuzzy quantities, deterministic cooperative games with transferable utility and deterministic market equilibria. In this and in all following sections we denote by R the set of all real numbers, by Rm we denote the set of m-dimensional real vectors and by m their subsets of non-negative components. Moreover, if M is a set R + , R+ then we denote by F(M ) the set of all fuzzy subsets of M .
Fuzzy Components of Cooperative Markets
211
2.1 Fuzzy Quantities The vague quantitative components of the models presented below are characterized by fuzzy quantities. Due to [10, 11] and other works, fuzzy quantity is any fuzzy subset a of R, i.e., a ∈ F(R), with membership function µa (R) → [0, 1] such that: – There exists r0 ∈ R such that µa (r0 ) = 1 (r0 is called modal value of a) – The support set of µa is limited Fuzzy quantities represent vague numerical values and they can be processed analogously to their deterministic counterparts. For our purposes, we need the algebraical properties of summation and multiplication by crisp real number and the relation of ordering over the set of fuzzy quantities. If a, b ∈ F(R) are fuzzy quantities with membership functions µa , µb , respectively, then fuzzy quantity a ⊕ b with µa⊕b : R → [0, 1], where for r ∈ R µa⊕b (r) = sup [min(µa (s), µb (r − s))] ,
(1)
s∈R
is called sum of a and b. Moreover, if r ∈ R, then the fuzzy quantity r · a with µr·a : R → [0, 1] such that for s ∈ R if r = 0,
µr·a (s) = µa (s/r) µ0·a (0) = 1,
µ0·a (s) = 0 if s = 0,
(2)
is called the product of r and a. The properties of sum (1) and product (2) are summarized in [10, 11]. Effective economic or optimization models, including the market equilibrium, demand the good handling of ordering relation on the set of numerically represented outputs. In the fuzzified models, it means to define an ordering relation over the set of fuzzy quantities. There exist numerous definitions of such relation (some of them are recollected in [11]) in the literature. Here, we use the one which is based on the paradigm that relation between vague (i.e., fuzzy) quantities is to be fuzzy, as well. The fuzzy ordering relation used in the following sections is represented by a fuzzy subset of F(R) × F(R) with membership function ν (·, ·). For every pair of fuzzy quantities a, b with µa , µb , the value ν (a, b) represents the possibility that a b, and ν (a, b) = sup [min(µa (r), µb (s)) : r, s ∈ R, r ≥ s] .
(3)
The above concepts of the theory of fuzzy quantities are sufficient for the presentation of the fuzzy market model suggested below. 2.2 Deterministic Cooperative Game The cooperative game itself is not adequate for the representative description of the structure of market activities but it is closely related to the market
212
M. Mareš
model and the game theoretical concepts represent a pattern for some analogous components of the market (cf., [5, 6, 16, 17] and also [7, 8]). Let us denote by I the (nonempty and finite) set of players. To simplify some notations, we “name” the players by natural numbers, hence I = {1, 2, . . . , n}. Every subset of I is called a coalition. By K we denote the set of all coalitions. A mapping v : K → R such that v(∅) = 0 for the empty coalition, is called characteristic function of the considered game. For every coalition K ∈ K the value v(K) represents its expected total output. The cooperative game with transferable utility (briefly TU-game) is represented by the pair (I, v). The TU-game (I, v) is said to be superadditive if for every pair of disjoint coalitions K, L ⊂ I v(K ∪ L) ≥ v(K) + v(L). (4) The concept of TU-game is based on the idea that every realized coalition K expects to win a pay-off v(K) which is distributed among its members. Such distribution is described by a real-valued vector rK = (ri )i∈K . For the coalition of all players I, every vector r = (ri )i∈I is called an imputation. We say that an imputation r is accessible for I if ri ≤ v(I), i∈I
and we say that it is blocked by a coalition K ⊂ I iff ri < v(K). i∈K
The set C of all imputations which are accessible for I and are not blocked by any coalition K, i.e., , ri ≥ v(K) , (5) ri ≤ v(I), ∀ K ∈ K, C = r ∈ Rn : i∈I
i∈K
is called a core of the game (I, v).
2.3 Competitive Deterministic Market The basic market model, the fuzzification of which will be investigated in further sections, is defined as follows. Even in this case we denote by I the finite and nonempty set of players; in the market model, they are usually called agents. We suppose that there exist m sorts of goods which are somehow distributed among agents. By the symbol xij we denote the amount of the goods j ∈ {1, . . . , m} owned by agent i ∈ I. We suppose that xij ≥ 0. Values xij form a distribution matrix x with columns xi = (xij )i=1,...,m , where column xi characterizes the structure of property owned by agent i ∈ I. There exists a special distribution matrix, let us denote it a with elements aij and columns
Fuzzy Components of Cooperative Markets
213
ai , i ∈ I, j = 1, . . . , m, which is called initial distribution matrix, and which represents the distribution of goods at the very beginning of the bargaining and exchange process. It is useful to denote the set of all distribution matrices achievable in the considered market by means of redistribution of a by X, i.e., m aij . xij ≤ , ∀ j = 1, . . . , m, X = x = (xi )i∈I : ∀ i ∈ I, xi ∈ R+ i∈I
To simplify some notations, we denote for every coalition K ⊂ I X K = x ∈ X : ∀ 1, . . . , m, xij ≤ aij i∈K
i∈K
i∈I
(6)
(7)
as the set of all distribution matrices which are accessible by redistribution of goods inside the coalition K. Remark 1. It is evident that we may put without loss of validity of the definitoric (7) X ∅ = X for empty coalition ∅.
Remark 2. It is always evident that X I = X and for one-agent coalition {i}, i ∈ I,
X {i} = x ∈ X : ∀ j = 1, . . . , m, xij ≤ aij . Finally, we admit that every agent i ∈ I evaluates the achieved distribution matrix x by a utility function ui : X → R which depends exclusively on the vector of goods xi , i.e., for any x, y ∈ X ui (x) = ui (y) if xi = y i
(8)
and is nondecreasing and concave. It is natural to suppose that ui (x) = 0 if xij = 0 for all j = 1, . . . , m. Then we call the ordered quadruple M = (I, m, a, (ui )i∈I )
(9)
a free exchange market. The exchange of goods in a market respects the prices and, vice versa, relation between demand and supply influences their structure. The prices of goods form a real-valued vector p = (p1 , p2 , . . . , pm ) where pj > 0 for any j = 1, . . . , m, and, of course, pj is the price of good j. The set of admissible prices will be denoted by P (let us note that the notion of “admissibility” is sometimes quite significant, some price regulations do exist in numerous real economies). The prices p ∈ P are supposed to be row vectors so that the product p · xi has sense and its result is a scalar. For every agent i ∈ I and price vector p ∈ P we denote by B i (p) the set of distribution matrices
(10) Bi (p) = x ∈ X : p · xi ≤ p · ai , which is called the budget set of agent i.
214
M. Mareš
In the next sections, we call a pair (x, p), where x ∈ X and p ∈ P the state of the market M . Some states of market, respecting the balance between demand and supply, deserve our special attention. A state of market (x, p) ∈ X × P is called a competitive equilibrium iff for every agent i ∈ I x ∈ Bi (p), ui (x) ≥ ui (y)
(11) for every y ∈ Bi (p).
(12)
Each market M is connected with a cooperative TU-game (I, v), where the characteristic function v is defined by (13) ui (x) : x ∈ X K . v(K) = max i∈K
The pair (I, v) defined by (13) is called market game of the market M . The most important results of the market equilibrium theory specify the relation between the core of the market game and the vector of utilities (ui (x))i∈I where (x, p) is the competitive equilibrium of M . Namely the fact that (ui (x))i∈I ∈ C under easily fulfilled assumptions. 2.4 Coalitional Competitive Market The classical model of market and its equilibrium briefly recollected in Sect. 2.3, was rather extended in [7, 8] and several related papers. The cooperative extension is based on the idea that there exist two qualitatively different levels of cooperation. One of them represents the strictly market relations – the agents exchange the goods in order to maximize their subjective utility, respecting the existing prices. But some groups of agents reflect closed relations among them. They act as one compact participant of the market aiming to maximize the total group utility (as the sum of individual utilities of its members). That utility, connected with some total coalitional ownership of goods, is distributed among the agents forming the group in order to achieve the maximal sum of individual utilities. Even if the “external” exchange of common goods of the group respects the prices, its “internal” redistribution does not. Nevertheless, in spite of this “collectivism” the motivation of each agent is individual – he aims to maximize his own profit under the limitations given by the compromise group agreement maximizing the sum of profits. This structure of the market demands the existence of a very specific good, called “money” and intermediating the redistribution of utility among agents. In this simplified model, we do not use separate denotation for this specific good but it is useful to register its
Fuzzy Components of Cooperative Markets
215
hidden existence (more attention is paid to money, e.g., in [7, 8, 16, 17]). Such groups of close and in same sense “altruistic” cooperation do exist. They are formed, e.g., by families, economic concerns, or cooperatives. The framework idea of the “close groups” can be used also for modeling the market behavior of an agent (economic subject) which diversificates his property in several separated blocks treated in the market as independent distributions of goods (such approach remembers of the parallel participation of a player in different coalitions dealt, e.g., in [1, 2, 4, 13, 14]). For every coalition of agents K ⊂ I we denote the coalitional utility function uK : X → R by uK (x) = ui (x), x ∈ X (14) i∈K
and for the vector of prices p ∈ P also the coalitional budget set BK (p) by (15) p · ai . p · xi ≤ BpK = x ∈ X : i∈K
i∈K
I
Let us consider a class of coalitions M ⊂ 2 which covers the set I; in formulas K = I. K∈M
If (x, p) ∈ X × P is a state of market M then we say that it is a cooperative M-equilibrium iff for all K ∈ M x ∈ B K (p) uK (x) ≥ uK (y) for all y ∈ BK (p).
(16) (17)
The relations between competitive equilibrium (11), (12) and cooperative M-equilibrium (16), (17) are dealt in the referred works [7, 8] and several other papers. Of course, it is possible to analyze the relation between such cooperative market and corresponding market game. The market game defined by (13), i.e., v(K) = max uK (x) : x ∈ X K , K ⊂ I,
includes the cooperative behavior of agents and it can be easily used even in this case. The specific structure of cooperative M-equilibrium can be reflected by an analogous modification of core. Namely, the set of real-valued vectors CM , defined by C M = r = (ri )i∈I :
i∈I
≤ v(I), for all K ∈ M
i∈K
ri ≥ v(K)
(18) is called the M-core of the game (I, v). Relation between cooperative M-equilibrium of a market M and the M-core of corresponding market game is dealt in [7, 8], and it is consistent with the relations valid for the not cooperative case.
216
M. Mareš
3 Fuzzy Utilities and Prices The market model (9), both modifications of its equilibria, (11), (12) and (16), (17), as well as the corresponding market game (13) can be fuzzified. This fuzzification is natural – the reality of the market is closely connected with subjectivity of stand-points, vagueness of information, and approximity (eventually other deformation) of data, which are typical for real economic situations. In this chapter, we are interested in the fuzzy modifications of two components of the market which have quantitative character, namely of the utilities and prices. Let us note that utilities represent the preferences, which means essentially qualitative element of the model, but they are described by a quantitative scale of utilities. Both of the fuzzified components, the utilities, and the prices, are modeled by fuzzy quantities. It is also rational to admit that other components of the model – the set of agents, the number of goods, and their initial distribution – are usually well known and their fuzzification would not be adequate to the analyzed situation. With regard to the analogy with coalitional games, it is interesting to mention, at least briefly, another possibility of fuzzification, namely the existence of fuzzy coalitions as fuzzy subsets of I. In the reality of market (or cooperative game) they reflect the possibility that the individual agents may distribute their participation among several coalitions. This may be quite possible but it is not considered in this section. The fuzzification dealt here aims to extend the above mentioned components of the market (9) M = (I, m, a, (ui )i∈I ) , with the set P described in Sect. 2.3. 3.1 Fuzzy Competitive Market Let us consider fuzzy functions uF i : X → F(R), i ∈ I, such that for every (x) is a fuzzy quantity with membership function µi,x : x ∈ X the value uF i R → [0, 1]. Let, µi,x (ui (x)) = 1 for any i ∈ I, x ∈ X, Analogously to (8) we suppose that µu,x (r) = µi,y (r)
for all r ∈ R if xi = y i .
(19)
Moreover, if xij = 0 for all j = 1, 2, . . . , m then µi,x (0) = 1,
µu,x (r) = 0 r ∈ R, r = 0.
Then we call the mappings uF i , i ∈ I, fuzzy utility functions and their values (x) fuzzy utilities. uF i
Fuzzy Components of Cooperative Markets
217
Remark 3. Fuzzy utilities uF i are fuzzy extensions of the utilities ui , i ∈ I, in (9) in the sense that if µi,x (ui (x)) = 1 and µi,x (r) = 0 for r = ui (x) then uF i fulfil the properties of the crisp utility functions formulated in Sect. 2.3. Let us consider for every j = 1, . . . , m a fuzzy quantity qj with membership function πj : R → [0, 1] such that πj (pj ) = 1 for some (pj )j=1,...,m ∈ P , and πj (r) = 0 for r ≤ 0. Then fuzzy quantities qj are called fuzzy prices and the vector (qj )j=1,...,m = q is called fuzzy price vector. Remark 4. The fuzzy prices qj are fuzzy extensions of the crisp prices pj in the sense that if the fuzzy quantities are reduced into single possible value, i.e., πj (r) = 0 for r = pj , then the vector q has the properties of the deterministic price vector p ∈ P dealt in Sect. 2.3. Let us denote the set of all fuzzy price vectors by P F . The quadruple
and the set P F M F = I, m, a, (uF i )i∈I
(20)
is called fuzzy competitive market extending the deterministic market M . The fuzziness of some components of the model means that some of the concepts derived from it are fuzzy, as well. This consequent fuzziness will be the main topic of our analysis in the remaining part of this section. For every agent i ∈ I, every structure of his property xi (which is the relevant column of the distribution matrix x ∈ X), and for any vector of fuzzy prices q ∈ P F we may easily operate with the scalar product q · xi = q1 · xi1 ⊕ q2 · xi2 ⊕ · · · ⊕ qm · xim ,
(21)
where each product on the right-hand side of (21) is defined by (2) and the sums of fuzzy quantities in (21) are defined by (1). Hence, (21) defines a fuzzy quantity. The same is correct for the scalar product q · ai where ai is the relevant column of the initial distribution matrix. Let us note that, using (3), we are able to compare these two fuzzy quantities and that the value ν (q ·ai , q · xi ) of the membership function ν specifies the possibility that the fuzzy ordering relation q · ai q · xi is valid. The above operations justify the definition of the fuzzy subset FB i (q) of X with membership function βi,q : X → [0, 1] defined by βi,q (x) = ν (q · ai , q · xi ),
(22)
called fuzzy budget set of agent i and fuzzy prices q. It is easy to verify that the concept of fuzzy budget set is a fuzzy extension of the deterministic budget set Bi (p) defined by (10), as follows from the next statement.
218
M. Mareš
Remark 5. If q, q ′ ∈ P F , and πj , πj′ , j = 1, . . . , m, are corresponding membership functions such that for any r ∈ R, πj (r) ≥ πj′ (r) for all j = 1, . . . , m, then FB i (q) ⊃ FB i (q ′ ) in the fuzzy set theoretical sense, i.e., βi,q (x) ≥ βi,q′ (x) for all x ∈ X. Analogously to the deterministic market model we call the pair (x, q) ∈ X × P F a state of the fuzzy market M F . The construction of the above concepts is motivated by the endeavor to introduce the concept of market equilibrium adequate to the considered type of market. It is intuitively evident that the equilibrium of fuzzy market is to be a vague, i.e., fuzzy, concept. It means that such equilibria form a fuzzy subset of the Cartesian product X × P F . We denote its membership function by ρ : X × P F → [0, 1] and its value ρ(x, q), denoting the possibility that the state of fuzzy market (x, q) is an equilibrium, is defined by ρ(x, q) = min [β(x, q), δ(x, q)] ,
(23)
β(x, q) = min (βi,q (x) : i ∈ I)
(24)
where denotes the possibility that x belongs to all fuzzy budget sets FB i (q), see (22), and
F (25) δ(x, q) = min β(y, q), min ν (uF i (x), ui (y)) : i ∈ I : y ∈ X ,
denotes the possibility that the fuzzy utility of x is greater than the fuzzy utility of y for any y which may belong to the fuzzy budget sets FB i (q) for all agents i ∈ I. The fuzzy subset of X × P F with membership function ρ is called fuzzy competitive equilibrium of M F . It is not difficult to conclude from the above definitions that the fuzzy equilibrium extends the deterministic concept of equilibrium (described in Sect. 2.3) and that the increasing fuzziness of the input components increases the fuzziness of equilibria. This heuristic conclusion can be formulated in the following statements.
Lemma 1. If q, q ′ ∈ P F are two fuzzy prices vectors with πj , πj′ , j = 1, . . . , m, respectively, if πj (r) ≥ πj′ (r) for all r ∈ R then ρ(x, q) ≥ ρ(x, q ′ ) for all x ∈ X. Proof. The statement follows from (24) and (25). If πj (r) ≥ πj′ (r) for all r ∈ R ′ the membership functions and j = 1, . . . , m and if we denote by πj,x and πj,x i ′ i of fuzzy quantities qj · xj and qj · xj for arbitrary i ∈ I. Then (2) implies that ′ (r) for any r ∈ R. Hence, due to (3) and (22), βi,q (x) ≥ βj,q′ (x) πj,x (r) ≥ πj,x and (24), together with (23), implies the statement. ⊓ ⊔ F Lemma 2. Let us consider uF i , ui for some i ∈ I, with membership functions µi,x , µi,x , x ∈ X, such that for all r ∈ R
Fuzzy Components of Cooperative Markets
219
µ1,x (r) ≥ µi,x (r). Let us denote
M F = I, m, a, uF i i∈I ,
M
F
= I, m, a, uF i i∈I
and by ρ, ρ the membership functions of fuzzy competitive equilibria of the F fuzzy markets M F and M , respectively for some fuzzy prices q ∈ P F . Then ρ(x, q) ≥ ρ(x, q). Proof. The statement follows from (23) and related definitions, especially from (25). Under the assumptions of this lemma, mappings δ(x, q) and δ(x, q) fulfil the inequality δ(x, q) ≥ δ(x, q), which, together with (23) implies the statement.
⊓ ⊔
Theorem 1. The fuzzy market M F = (I, m, a, (uF i )i∈I ) with set of fuzzy prices P F is an extension of the deterministic market M = (I, m, a, (ui )i∈I ) with set of prices P , and fuzzy equilibria (x, q) of M F are fuzzy extensions of deterministic equilibria (x, q) of M if q is fuzzy extension of p (i.e., πj (r) = 1 iff r = pj for all j = 1, . . . , m and µi,x (r) = 1 iff r = ui (x)). Proof. The theorem follows from the previous statements, namely Lemma 1, 2, and Remarks 3, 4, 6, immediately. ⊓ ⊔ Corollary 1. The previous theorem together with Remark 5 implies that if fuzzy quantities uF i (x) and qj for i ∈ I, x ∈ X, and j = 1, . . . , m condensate into single possible values ui (x), pj , i.e., µi,x (r) = 0 for r = ui (x) and πj (r) = 0 for r = pj , i ∈ I, x ∈ X and j = 1, . . . , m, then the market M F is identical with M and fuzzy equilibrium (x, q) condensates into crisp equilibrium (x, p) in the sense that ρ(y, q ′ ) = 0 for y = x or q ′ = q. Theorem 2. The fuzzy competitive equilibrium (x, q) ∈ X ×P F can be transformed into a fuzzy subset of X × P , i.e., to a fuzzy set of deterministic equilibria. Proof. Due to the above definitions, every fuzzy equilibrium is a fuzzy subset of X × P F with membership function ρ having values ρ(x, q). Distribution matrices are crisp objects but each q ∈ P F is a vector of fuzzy subsets of R+ with memberships πj . Let us define a membership function π : P → [0, 1] as π(p) = min (πj (pj ) : j = 1, 2, . . . , m) . Then it is possible to define a fuzzy subset of X ×P with membership function ρ∗ : X × P → [0, 1], where for x ∈ X and p ∈ P ρ∗ (x, p) = min (ρ(x, q), π(p) : p ∈ P ) .
⊓ ⊔
220
M. Mareš
3.2 Fuzzy Market Game It is possible to proceed analogously to the deterministic model and to derive a coalitional TU-game in some sense connected with the fuzzy market M F . It means that it is necessary to define its characteristic function, by means of a procedure modifying (13) for the environment of fuzzy utilities. First of all, we define for any coalition K ⊂ I, K = ∅, and any x ∈ X the fuzzy quantity uF K (x) by ⊕ uF (26) uF K (x) = i (x), i∈K
⊕
means the fuzzy sum using the operation ⊕ (see (1)). More where precisely, if K = {i1 , i2 , . . . , ik } then F F F uF K (x) = ui1 (x) ⊕ ui2 (x) ⊕ · · · ⊕ uik (x).
By µK,x : R → [0, 1] we denote the membership function of uF K (x). Due to [10], uK (x) is a fuzzy quantity, as well. Then we may, using (3), construct for every K ⊂ I, K = ∅, the maximum of fuzzy quantities uF K (x) for all x ∈ X K (see (7)) as a fuzzy quantity w(K) by the following procedure. For every x ∈ X K we find K F (27) min ν uF K (x), uK (y) : y ∈ X K F as the possibility that uF K (x) uK (y) for all y ∈ X , and then put
w(K) = uF K (x)
(28)
for the x ∈ X K for which the value (27) is maximal. Let us note that the closedness of X K following from (7) implies the correctness of the above maxima and minima. In the following text, we denote the membership function of w(K) by χK . For empty coalition ∅ we put χ∅ (0) = 1, χ∅ (r) = 0 for r ∈ R, r = 0. In the terms of [12], it is easy and natural to interpret the pair (I, w) as a cooperative game with transferable utility and with fuzzy pay-offs. This game will be called fuzzy market game. The theory of TU-games with fuzzy pay-offs is relatively new. It is developed since the 1990s and the elementary or basic concepts and results are summarized in [12]. The model is further investigated and some of its modifications are suggested (cf. [15]). Theorem 3. Let M be a competitive market and M F be its fuzzy extension in the above sense. If (I, v) is the market game of M and (I, w) is the fuzzy market game of M F then w is a fuzzy extension of v, i.e., χK (v(K)) = 1 for any coalition K ⊂ I.
Fuzzy Components of Cooperative Markets
221
Proof. The assumption that M F is fuzzy extension of M means that µi,x (ui (x)) = 1
(29)
for all i ∈ I, x ∈ X K . It means that if x ∈ X K is the distribution matrix for which uK (x) ≥ uk (x) for all x ∈ X K . Then (29) means that K F ν (uF K (x), uk (y)) = 1 for all y ∈ X , and, consequently, v(K) = uK (x),
χK (v(K)) = 1.
⊓ ⊔
Corollary 2. If M F is a fuzzy market the fuzziness of which is condensed into a competitive market M , i.e., µi,x (ui (x)) = 1, µi,x (r) = 0 if r = ui (x), for all i ∈ I, x ∈ X then the fuzziness of the fuzzy market game (I, w) of M F is condensed into the market χK (v(K)) = 1, χK (r) = 0 if r = v(K), for all K ⊂ I, as follows from Theorem 2 and previous definitions. The relation between fuzzy competitive market and its fuzzy market game represents an inspirative topic for detailed research, most of which is to be done, yet. It is unavoidable to respect the fact that TU-games with fuzzy characteristic function (I, w) do not fully copy some of the useful properties of the deterministic TU-games (I, v), however desirable it could be. This discrepancy follows from the algebraic properties of fuzzy quantities which are not identical with the properties of crisp real or integer numbers. Significant differences are connected with the notions of fuzzy zero and opposite element which are self-evident for groups of deterministic numbers. These differences cause some essential complications, especially, regarding the relations between convexity of characteristic function and existence of core, as well as relations between superadditivity and convexity. Deeper analysis of these consequences of all these specific features of fuzziness in TU-games can be found in [12]. We may note that the relation between fuzzy competitive equilibrium and core (it means fuzzy core) of the fuzzy market game is not simple and that it cannot be a simple analogy of its deterministic counterpart. On the other hand, the situation in a fuzzy market M F with fuzzy prices F P and its fuzzy market game (I, w) becomes more lucid if we accept the fact that many concepts which are in the deterministic case strictly limited are in the fuzzy case related (in various degree of possibility) to all relevant objects. It regards, e.g., the prices – a fuzzy price q ∈ P F may represent, with possibility π(p) any crisp price p ∈ P . Similarly, fuzzy equilibrium (x, q) represents a fuzzy subset of X × P F , it means any state of fuzzy market (x, q) with possibility ρ(x, q). Together with the fuzziness of q it means that fuzzy equilibrium of M F can be interpreted as fuzzy set of crisp equilibria in M (see Theorem 2). Using the concepts and results summarized in [12], we analyze at least the basic properties of the fuzzy market game (I, w) where for every K ⊂ I, w(K) is a fuzzy quantity with modal value v(K). In the first step we verify the superadditivity. In TU-games with fuzzy pay-offs of coalitions, the
222
M. Mareš
superadditivity is a fuzzy property, as well. It means that if we denote by ΓI the set of all coalitional games with fuzzy characteristic functions, then the fuzzy superadditive games form a fuzzy subset of ΓI with membership function σ : ΓI → [0, 1]. For every game (I, w) the value σ(w) is defined by σ(w) = min [ν (w(K ∪ L), w(K) ⊕ w(L)) : K, L ⊂ I, K ∩ L = ∅]
(30)
where definition (3) was used, and σ(w) determines the possibility that the game (I, w) is superadditive. Lemma 3. If M is a deterministic competitive market and (I, v) its market game then (I, v) is superadditive. Proof. Let us consider a market (9) and its game (13). Let us consider disjoint K, L ⊂ I and sets X K , X L , X K∪L , defined by (7). Then it is easy to see i that X K∪L ⊃ X K ∩ X L . Moreover, as for all i ∈ I ui (x) = ui (y) if xi=y (see Sect. 2.3) then also uK (x) = uK (y) if xi = y i for all i ∈ K and the same is valid for L. It means that ui (x) : x ∈ X K∪L ≥ v(K ∪ L) = max i∈K∪L ≥ max ui (x) : x ∈ X K ∩ X L = ui (x) + i∈L i∈K = max ui (x) : x ∈ X K + max ui (x) : x ∈ X L = i∈K
i∈L
= v(K) + v(L).
F
⊓ ⊔
Theorem 4. Let fuzzy competitive market M be a fuzzy extension of deterministic market M , and let (I, w) be its fuzzy market game. Then (I, w) is certainly fuzzy superadditive, i.e., σ(w) = 1. Proof. If (I, v) is the market game of M then, due to Theorem 3, (I, w) is fuzzy extension of (I, v). It is easy to verify (the formal statement is presented, e.g., in [12]) that the superadditivity of (I, v) implies the fuzzy superadditivity of (I, w) with maximal possibility. It means that σ(w) = 1. ⊓ ⊔
Corollary 3. Fuzzy superadditivity of (I, w) derived from M F is a fuzzy extension of the deterministic superadditivity of (I, v) derived from M , if M F is fuzzy competitive market extending M . Let us consider a set of coalitions K = {K1 , K2 , . . . , Km }. If Kj ∩ Kℓ = ∅ for j = ℓ, j, ℓ = 1, 2, . . . , m, and if K ∪ K2 ∪ · · · ∪ Km = I then we say that K is a coalitional structure. Remark 6. If (I, w) is a fuzzy market game of a fuzzy market and if K = {K1 , . . . , Km } is a coalitional structure then certainly w(I) w(K1 ) ⊕ · · · ⊕ w(Km ), i.e., ν (w(I), w(K1 ) ⊕ · · · ⊕ w(Km )) = 1, as follows from Theorem 4, immediately.
Fuzzy Components of Cooperative Markets
223
A very important result, probably the crucial one, of the theory of deterministic exchange market equilibria, regards their relation to the core of respective market game (see, e.g., [5, 16, 17], and also [6] or [7, 8]). Analogous relation for fuzzy markets is not obvious and it deserves deeper investigation in the future. Nevertheless, here it is desirable to formulate at least the concept of core of fuzzy market game and its basic properties. The results presented below are based on the general properties of cooperative TU-games with fuzzy pay-offs summarized in [12]. Let us start with the concept of fuzzy core. In this work, we respect the methodological paradigm due to which the properties and elements derived from the fuzzified TU-game model are to be fuzzy. Related to the concept of core, this principle means that the core is to be defined as fuzzy subset of imputations. Analogously to Sect. 2.2, we suppose that I = {1, 2, . . . , n} and by imputation we call any real-valued vector r = (r1 , r2 , . . . , rn ) ∈ Rn . Analogously to (5), we characterize a core as a set of imputations which are accessible for the maximal coalition I, and which cannot be blocked by any coalition K ⊂ I (including K = I). The accessibility and blocking are defined by means of relations between the components of the considered imputations, and the values of the characteristic function for the relevant coalitions. If the values of characteristic functions are fuzzy quantities, and it is our case in this subsection, then both concepts – accessibility and blocking – are fuzzy and the inequalities in question are with membership function ν : F(R) × F(R) → [0, 1] (cf. (3)). First, we simplify the notation. Every real number r ∈ R may be considered for fuzzy quantity with the possibility concentrated in a single value. To distinguish between these two interpretations of a real number, we denote the “concentrated” fuzzy quantity by r, with membership function µ r (r) = 1,
µ r (r′ ) = 0 for r = r′ .
(31)
Let us note that, e.g., second part of (2) can be reformulated as 0 · a = 0 for any a ∈ F(R). Remark 7. It is easy to see that for r ∈ R, r, a ∈ F(R), (3) can be simplified as ν (r, a) = max (µa (r′ ) : r′ ≥ r) , ν (a, r) = max (µa (r′ ) : r ≥ r′ ) . The fuzzy core will be defined as a fuzzy subset of Rn denoted Cw , with membership function γ : Rn → [0, 1] defined for every r ∈ Rn by (r) , (32) γ(r) = min ν w(I), ri , γ i∈I
where
γ (r) = min ν
i∈K
ri , v(K) : K ⊂ I .
224
M. Mareš
denotes the possibility that r is accessible for the In (32), ν w(I) i∈I ri coalition I, i.e., w(I) ri , i∈I
and γ (r) denotes the possibility that r cannot be blocked by any coalition K ⊂ I, i.e., ri w(K) for all K ⊂ I. i∈K
Theorem 5. Let M be a deterministic market with market game (I, v) and let M F be its fuzzy extension with fuzzy market game (I, w), where χK , K ⊂ I, are membership functions of w(K). Let us denote by Cv the core of (I, v) and by Cw the fuzzy core of (I, w) with membership function γ : Rn → [0, 1]. Then, r ∈ Cv implies γ(r) = 1 for r ∈ Rn . Proof. Let r ∈ Cv . Due to (5), ri ≤ v(I) and i∈I
i∈K
ri ≥ v(K)
for all K ⊂ I. Using (28), (26) and the assumption that µK,x (v(K)) = 1
(as follows from the definition of uF i , i ∈ I) it is easy to see that =1 ri ν w(I), i∈I
and
ν
i∈K
ri , w(K) = 1,
K ⊂ I,
which means that γ (r) = 1. Hence, γ(r) = 1.
⊓ ⊔
Theorem 6. Let us preserve the notation of Theorem 5 and suppose that for n all i ∈ I, uF i (x) = ui (x) for each x ∈ X. Then for any r ∈ R , γ(r) = 0 if r∈ / Cv .
Proof. The assumption, together with (26) immediately mean that uF K (x) = uK (x)
for all x ∈ X, K ⊂ I.
It means, due to (28), that w(K) = v(K)
for all K ⊂ I
and, consequently, γ(r) = 0 iff r ∈ Cv , as follows from the definition of fuzzy core. ⊓ ⊔
Fuzzy Components of Cooperative Markets
225
Corollary 4. Theorems 5 and 6 immediately imply that if the fuzzy quantities uF i (x) are concentrated in a single possible value (i.e., under the assumption of Theorem 6) then γ(r) ∈ {0, 1} for any r, more exactly, γ(r) = 1 if r ∈ Cv and γ(r) = 0 if r ∈ / Cv . Corollary 5. The previous Theorems 5 and 6 imply that the fuzzy core Cw of fuzzy market game (I, w) is a fuzzy extension of the deterministic core Cv of market game (I, v) if the fuzzy market M F is a fuzzy extension of the market M . The relation between fuzzy core of fuzzy market game and the fuzzy equilibrium of the fuzzy market offers an inspirative topic for more detailed investigation. As the formal structure of fuzzy concepts is more complex than the structure of their deterministic counterparts, it is realistic to expect that even the relation between equilibria and core is in the fuzzified model more complicated and more varied than those derived for the classical deterministic market. It is also necessary to respect certain disproportion between the mathematical structures describing the utilities, budget sets, core, characteristic function, and equilibrium (which are fuzzy) and the individual imputations forming the core (which are crisp vectors even in the fuzzified model). The type of open problems which could be solved regards the relation between fuzziness of equilibria and core. Let us consider a competitive market M = (I, m, a, (ui )i∈I ) with a space of prices P , and its fuzzy extension
M F = I, m, a, (uF i )i∈I
with space of fuzzy prices P F . Preserving the notations used in this section, the relation between the membership functions ρ(x, q)
and
γ(r),
x ∈ X, q ∈ P F , r ∈ Rn
is to be in the focus of the eventual future investigation. The principally different structure of Rn on one side and X × P F on the other shows that this relation will not be direct or immediate. It is rational to expect rather the results in which this relation is intermediated by some other components of the market model. Theorem 7. Let (x, p) be a competitive equilibrium of market M . Let us denote by u = (ui (x))i∈I . Then γ(u) = 1 in the fuzzy market M F extending M and its fuzzy market game (I, w). Proof. If (x, p) is an equilibrium then, due to [5, 17], u ∈ Cv , where Cv is the ⊓ ⊔ core of (I, v). By Theorem 5, γ(u) = 1 for the fuzzy core Cw of (I, w).
226
M. Mareš
3.3 Fuzzy Cooperative Market The competitive fuzzy market concept can be generalized to its cooperative version, analogously to the procedure used in Sect. 2.4. Even its interpretation is analogous to the one given in Sect. 2. Let us recollect (26) by which the fuzzy quantities uF K (x) are defined for any x ∈ X and K ⊂ I, K ∈ / ∅, as sums of the fuzzy quantities uF i (x) for i ∈ K. (x) will be denoted by µ : R → [0, 1] and The membership function of uF K,x K constructed by repetitive application of (1), which is correct as the operation ⊕ is associative on F(R) (see [10]). Then it is possible to extend (15) for the environment of fuzzy market by means of the following procedure. For every nonempty coalition K ⊂ I, every player i ∈ K and every fuzzy price vector q = (qj )i=1,...,m we use (21) to define fuzzy quantity q · xi for each x = (xi )i∈I ∈ X. If K = {i1 , i2 , . . . , iK } then we denote ⊕
i∈K
q xi = q · xi1 ⊕ q · xi2 ⊕ · · · ⊕ q · xiK ,
(33)
⊕ and analogously for i∈K q · ai (cf. (26)). Having introduced the above symbols, we may define the coalitional fuzzy budget set FB K (q) for q ∈ P F as a fuzzy subset of X with membership function βK,q : X → [0, 1] where ⊕ ⊕ (34) q · ai , q · xi , x ∈ X. βK,q (x) = ν i∈K
i∈K
It means that βK,q (x) denotes the possibility that ⊕
i∈K
(cf. (3), (22) and (15)).
q · ai
⊕
i∈K
q · xi ,
Remark 8. If q ∈ P F is fuzzy extension of p ∈ P (i.e., πj (p) = 1) for all j = 1, . . . , m then p · ai p · xi and i∈K
i∈K
are the modal values of fuzzy quantities ⊕
i∈K
q · xi
and
⊕
i∈K
q · ai ,
respectively. It means that their values are achieved with possibility 1 (see Sect. 2.1). Lemma 4. If M F is fuzzy extension of M and q ∈ P F fuzzy extension of p ∈ P then for any K ⊂ I, K = ∅, FB K (q) is fuzzy extension of B K (p), i.e., βK,q (x) = 1 for any x ∈ BK (p).
Fuzzy Components of Cooperative Markets
Proof. Due to (33) and Remark 8, ⊕ ⊕ q · ai , ν i∈K
if
i∈K
i∈K
p · ai ≥
227
q · xi = 1
i∈K
p · xi .
The validity of the statement follows from this implication. K
⊓ ⊔
Lemma 5. Under the assumptions of the previous Lemma 4, if x ∈ / B (p) then βK,x (q) < 1. Proof. If x ∈ / BK (p) then
i∈K
It means, due to (3), that ⊕ ν
p · ai
0 . Definition 4. r-level set of a fuzzy variable Z is given by
Zr = z ∈ E 1 |µZ (z) ≥ r , r ∈ (0, 1] .
A necessity measure ν is a dual concept notion to a possibility measure and defined as ν (A) = 1 − π (Ac ) ,
where “c” means the complement of a set A ∈ P (Γ ). Taking into consideration results [7, 10], give the definition of a fuzzy random variable and its interpretation. Let (Ω, B, P) be a probability space. Definition 5. Fuzzy random variable X is a real function X (·, ·) : Ω × Γ → E 1 , such that for any fixed γ ∈ Γ Xγ = X (ω, γ) is a random variable on (Ω, B, P). From the forecited definition follow two interpretations. For a fixed ω ∈ Ω we get a fuzzy variable Xω = X (ω, γ). The values of a random variable are fuzzy variables with probability distributions µX (x, ω).
244
A.V. Yazenin
For a fixed γ Xγ can be considered as a random variable with possibility defined by a possibility measure. Everything becomes clear when distribution µX (x, ω) is defined as in case of fuzzy variable: µX (x, ω) = π {γ ∈ Γ : X (ω, γ) = x} ∀x ∈ E 1 . For each ω corresponds possibilistic distribution that is a random choice of an expert who gives an indefinite subjective estimation with certain amount. Xγ is a random variable for a fixed γ, but we are not sure in its distribution value. In the context of decision making expected value plays crucial role in explanation of random information. The expected value E {X (ω, γ)} of a random variable X (ω, γ) can be defined in different ways. We define distribution of a random variable expected value according to [7] through averaged random variable: µEX (x) = π {γ ∈ Γ : E {X (ω, γ)} = x} ∀x ∈ E 1 . It is easy to show an expected value of a fuzzy random variable defined this way has the basic properties of random variable expected value.
3 Fuzzy Random Variables Presentation and Calculation of their Characteristics Let us consider fuzzy random variable X (ω, γ). Presentation [11] is interesting for applications: (1) X (ω, γ) = a (ω) + σ (ω) X0 (γ) , where a (ω) , σ (ω) are random variables defined on probability space (Ω, B, P), have finite moments of the second order, and X0 (γ) is a fuzzy (possibilistic) variable defined on possibilistic space (Γ, P (Γ ) , π). To simplify demonstration of basic ideas suppose X0 ∈ T r (0, 1), that is X0 has triangular distribution function ⎧ ⎨ 1 − |t| , |t| ≤ 1, (2) µX0 (t) = ⎩ 0, |t| > 1.
From (2) follows a fuzzy random variable X0 has modal value and fuzziness coefficient equal, respectively, to 0 and 1. Presentation (1) is shift-scale presentation of a fuzzy random variable. a(ω), σ (ω) are shift and scale parameters which are modal value and fuzziness coefficient of a fuzzy random variable Xω = X (ω, γ) ∈ T r (a (ω) , σ (ω)) .
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
245
Let E (a) = a0 , E (σ) = σ0 . Then according to [6, 7] E (X) = a0 + σ0 X0 (γ) and µEX (t) = µX0 ((t − a0 ) /σ0 ) ∀ t ∈ E 1 . Solving applied problems we are interested not only in expected value but also in variance and covariance of fuzzy random variables. There exist at least two approaches to their definition. The characteristics are fuzzy within the first approach [11] and nonfuzzy within the second one [12]. Consider the first approach. Variance and covariance are defined by probability theory formula. Using presentation (1) we obtain formula for variance D(X) of a fuzzy random variable X (ω, γ) as function of fuzzy variable X0 (γ): 2
D (X) = D (a + σX0 ) = E (a + σX0 − a0 − σ0 X0 ) = 2
= E (a − a0 + (σ − σ0 ) X0 ) =
= D (a) + 2cov (a, σ) X0 + D (σ) X02 = 2 2 (a,σ) = D (σ) X0 + cov(a,σ) + D(a)D(σ)−cov . D(σ) D(σ) Let C12 = D (σ), C2 =
cov(a,σ) D(σ) ,
(3)
D(a)D(σ)−cov 2 (a,σ) . D(σ) 2 2 D (X) = C1 [X0 + C2 ]
C3 =
+ C3 . In force Then formula (3) is the following of Cauchy–Bunyakovski inequality C3 ≥ 0. Complexity of variance definition depends on fuzzy variable distribution. The case when C2 = C3 = 0 will be considered for illustration. Then D (X) = C12 X02 . The following result is valid for triangular symmetric possibilistic distribution. Theorem 1. [11] Let X0 ∈ T r (0, 1). Then ⎧ ⎨ 1 − √t/C , if 0 < t < C 2 , 1 1 µD(X) (t) =
2 ⎩ 0, if t ∈ / 0, C . 1
Theorem proof is based on fuzzy variable transformation formula. To describe collective behavior of E (X) and D (X) it is convenient to introduce parametric description. Let parameter t ∈ supp (X0 ). Then pair 2 (E (X) , D (X)) accepts value (a0 + σ0 t, C12 (t + C2 ) +C3 ) with possibility µX0 (t). Based on the results obtained describe the collective behavior of fuzzy random variables X1 , X2 , ... , Xn . We come to the following model: Xk (ω, γ) = ak (ω) + σk (ω) Xk0 (γ) ,
where X10 , ..., Xn0 is a fuzzy vector. Introduce the following notations: a0k = E (ak ) , σk0 = E (σk ) ,
246
A.V. Yazenin
Ck2 = D (σk ) , Cij = cov (σi , σj ) , fij = − dij = cov (ai , aj ) −
cov (σi , aj ) , cov (σi , σj )
cov (σi , aj ) · cov (σj , ai ) . cov (σi , σj )
Applying ordinary random variables numeric characteristics calculation rules we obtain formula that represent characteristics of fuzzy random variables X1 , ... , Xn : (4) mk = E (Xk ) = a0k + σk0 · Xk0 ; 2 Dk2 = D (Xk ) = Ck2 Xk0 + fkk + dkk ; (5)
Σij = cov (Xi , Xj ) = Cij (Xio − fij ) Xjo − fji + dij . (6)
Let m = (m1 , ... , mn ) be a mean vector, Σ = (Σij ) be a covariance matrix of fuzzy vector (X1 (ω, γ) , ... , Xn (ω, γ)). Calculate collective possibilistic distribution of m and Σ. Let t =
(t1 , ... , tn ) be a point from a set of possible values of a fuzzy vector X 0 = X10 , ... , Xn0 . According to forecited results pair (m, Σ) accepts value (m (t) , Σ (t)) with possibility µX 0 (t). Elements mk (t) and Σij (t) can be calculated by formulas (4)–(6). Actually, as Xi0 = ti , Xj0 = tj then mk (t) = a0k + σk0 · tk ;
ij
(t) = Cij (ti − fij ) (tj − fji ) .
If elements of the vector X 0 are min-related [10] then
µX 0 (t) = min µXio (ti ) . 1≤i≤n
The second approach. Omitting all the technical details connected with fuzzy random variable values definition in space L2 [12] in accepted notation, the corresponding formulas are:
1 1 cov Xω− (r) , Yω− (r) + cov(X, Y ) = 2 0
+ cov Xω+ (r) , Yω+ (r) dr, where Xω− (r) , Yω− (r) , Xω+ (r) , Yω+ (r) are r-level set endpoints of fuzzy variables Xω , Yω respectively. It is obvious variance D (X) = cov (X, X) and moments of the second order are without fuzziness. It is important that the definition methods of the second-order moments in the first and second approaches are different in principle. In the first approach we identify possibility distribution and in the second one we make numeric calculations.
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
247
4 Possibility–Probability Optimization Models and Decision Making Within fuzzy random data functions that form goals and restrictions of decision-making problem make sense of mapping Ri (·, ·, ·) : W × Ω × Γ → E 1 , i = 0, m, where W is a set of acceptable solutions, W ⊂ E n . Thus, a set of acceptable outcomes can be obtained by combination of solution set elements with elements of random and fuzzy parameter sets. That is why any concrete solution cannot be directly connected either with goal achievement degree no with restriction system execution degree. Existence of two different types of uncertainty in efficiency function complexifies reasoning and formalization of solution selection optimality principles. However decision making procedure based on expected possibility principle is quite natural. Its content is elimination of two types of uncertainty that is a realization of two types of decision-making principles [10, 13, 14]: – Averaging of fuzzy random data that allows to get to decision-making problem with fuzzy data. – Choice of optimal solution with more possible values of fuzzy parameters or with possibility not lower than preset level. Adequate mean of suggested optimality principle formalization is a mathematical apparatus of fuzzy random variables. Let τ be a possibility or necessity measure that is τ ∈ {π, ν}. Taking into consideration stated decision-making principles we came to the following optimization problem settings within fuzzy random factors. Problem of maximizing goal achievement measure with liner possibility (necessity) restrictions τ {ER0 (w, ω, γ) ℜ0 0} → max, ⎧ ⎨ τ {ER (w, ω, γ) ℜ 0} ≥ a , i = 1, m, i i i ⎩ w ∈ W.
Problem of level optimization with liner possibility (necessity) restrictions k → max,
⎧ ⎪ ⎪ τ {ER0 (w, ω, γ) ℜ0 k} ≥ a0 , ⎨ ⎪ ⎪ ⎩
τ {ERi (w, ω, γ) ℜi 0} ≥ ai , i = 1, m, w ∈ W.
248
A.V. Yazenin
In the stated problems ℜ0 , ℜi are binary relations, ℜ0 , ℜi ∈ {≤, ≥, =} , αi ∈ (0, 1], k is an additional (level) variable. Possibility–probability optimization models introduced define an approach to portfolio analysis model construction with combination of fuzzy and random uncertainties.
5 Models and Methods of Portfolio Analysis in Fuzzy Random Environment Portfolio selection problem [1] is a central problem of financial and investment analysis. It is still interesting for researchers. As some researchers equitably denote the main drawback of Markowitz approach to portfolio selection problem is an absence of statistic data which are used for model parameters estimation. Expert estimations are used in such situations. Possibility and fuzzy sets theory gave further impetus to problem developing [2, 3]. In [4, 5] analysis of portfolio analysis problems is conducted when appropriate probability characteristics of financial market model according to Markowitz [1] are replaced by fuzzy expert estimations. However financial market is instable and changeable so investment decision making leans on both expert estimations which are tolerant and fuzzy and statistic information. In some instances profitabilities and prices of separate financial assets are characterized by tolerant time series. In this case a fuzzy random variable is an adequate model of profitability. 5.1 Expected Value and Risk of Portfolio with Fuzzy Random Data Let Ri (· , ·) : Ω × Γ → E 1 be a fuzzy random variable that represents profitability of an i-asset and Γ are elements of probability space (Ω, B, P) and possibilistic space (Γ, P (Γ ), π), respectively. Then expected value of portfolio is a fuzzy random variable Rp (w, ω, γ) =
n
wi Ri (ω, γ)
i=1
Here w = (w1 , ..., wn ) is the vector representing the portfolio: w ≥ 0, n wi = 1. i=1
Expected profit and risk of the portfolio under fixed w are presented by following fuzzy variables: ⌢
Rp (w, γ) = ERp (w, ω, γ) ,
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
249
2 ⌢ ⌢ V p (w, γ) = E Rp (w, ω, γ) − Rp (w, γ) .
Hereinafter we assume that considering fuzzy random variables can be presented in following form: Ri (ω, γ) = ai (ω) + σi (ω) Xi (γ) , where ai (ω), σi (ω) are random variables, defined on probabilistic space (Ω, B, P) with mathematical expectations a0i , σi0 ; Xi (γ) is a fuzzy variable, defined on possibilistic space (Γ , P (Γ ), π). After calculation of the mathematical expectation characteristics of portfolio profit (expected profit and risk) takes form ⌢
Rp (w, γ) =
n
i=1
⌢
V p (w, γ) =
n
a0i + σi0 Xi (γ) wi ,
D (Ri ) wi2 +
i=1
n n
k=1
wk wj cov (Rk , Rj ),
j=1 k =j
where D(Ri ) is the dispersion of fuzzy random variable Ri (ω, γ), cov (Rk , Rj ) is a covariance coefficient of fuzzy random variables Rk (ω, γ), Rj (ω, γ). In accordance with results obtained earlier this characteristics are functions of fuzzy variables therefore are fuzzy variables. Let below ⌢
Ri (γ) = a0i + σi0 Xi (γ) ; 2
di (γ) = D (Ri ) = Ci2 [Xi − fii ] + dii ;
Θkj (γ) = cov (Rk , Rj ) = Ckj (Xk − fkj ) (Xj − fjk ) + dkj . 5.2 Basic Models of Portfolio Analysis in Probabilistic–Possibilistic Context Based on results presented in Sect. 4 and classical results [1] we propose the generalized models of portfolio analysis oriented on processing of fuzzy random data. Consider the following as basic ones. Maximum effectiveness portfolio: k → max, ˜ p (w, γ) ℜ0 k ≥ π0 , τ R
⎧ ⎪ ⎨ τ V˜p (w, γ) ℜ1 rp (γ) ≥ π1 , n ⎪ wj = 1, w1 , ..., wn ≥ 0. ⎩ j=1
250
A.V. Yazenin
Minimal risk portfolio: k → min, τ V˜p (w, γ) ℜ0 k ≥ π0 , ⎧ ˜ p (w, γ) ℜ1 mp (γ) ≥ π1 , ⎪ ⎨τ R n ⎪ wj = 1, w1 , ..., wn ≥ 0. ⎩ j=1
Maximization of possibility (necessary) achievement of portfolio acceptable profitability level: ⌢ τ Rp (w, γ) ℜ0 mp (γ) → max ⎧ ⌢ ⎪ ⎨ τ V p (w, γ) ℜ1 rp (γ) ≥ π0 , n ⎪ wi = 1, w1 , ..., wn ≥ 0. ⎩ i=1
In presented models τ ∈ {π, ν}, mp is a fuzzy profitability level acceptable for investor, rp (γ) is a level of possible risk, π0 , π1 ∈ (0, 1] are given levels of possibility (necessity). 5.3 Solving Methods In papers [15–17] solving methods for portfolio analysis problems are developed in correspondence with models presented in Sect. 5.2 with possibility measure. These models are the following: Maximum effectiveness portfolio: k → max, ˜ p (w, γ) = k ≥ π0 , π R
⎧ ⎪ ⎨ π V˜p (w, γ) ≤ rp (γ) ≥ π1 , n ⎪ wi = 1, w1 , ..., wn ≥ 0 . ⎩
(7) (8)
(9)
i=1
Minimal risk portfolio:
k → min π V˜p (w, γ) = k ≥ π0 ⎧ ˜ p (w, γ) ≥ mp (γ) ≥ π1 ⎪ ⎨π R n ⎪ wj = 1, w1 , ..., wn ≥ 0. ⎩ j=1
(10) (11)
(12)
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
251
Maximization of possibility achievement of portfolio acceptable profitability level: ⌢ (13) π Rp (w, γ) = mp (γ) → max, ⎧ ⌢ ⎪ ⎨ π V p (w, γ) ≥ rp (γ) ≥ π0 , n (14) ⎪ wi = 1, w1 , ..., wn ≥ 0. ⎩ i=1
The essence of solution methods developed in [15–17] for problems of portfolio analysis consists in construction of its equivalent deterministic analogues. Such methods can be classified as indirect ones. Further we formulate the most important results of these works.
Theorem 2. Let in problem (7)–(9) random variable are unrelated and chan racterized by covariance matrix {Θkl (γ)} k,l=1 , fuzzy variables Xi (γ) ∈ T r
¯ d¯ are min-related. Then problem (7)–(9) is (0, 1), i = 1, ..., n; rp (γ) ∈ T r m, equivalent to n X+ i (π0 ) wi → max, i=1
⎧ n n − − 2 ⎪ (π1 ) wk wl ≤ rp+ (π1 ) , Θkl d (π ) w + ⎪ 1 i i ⎪ ⎪ i=1 ⎪ ⎨ k,l=1 k =l
⎪ ⎪ ⎪ n ⎪ ⎪ ⎩ wi = 1, w1 , ..., wn ≥ 0, i=1
− (π1 ) are the left endpoints of level sets of appropriate fuzzy (π1 ), Θkl where variables,
d− i
0 0 ¯ + d¯(1 − π1 ) , X+ rp+ (π1 ) = m i (π0 ) = ai + σi (1 − π0 ) .
Theorem 3. Let in problem (10)–(12) probabilistic variables are unren lated and characterized by covariance matrix {Θkl (γ)} k,l=1 , fuzzy variables
¯ d¯ are min-related. Then Xi (γ) ∈ T r (0, 1) , i = 1, ..., n; mp (γ) ∈ T r m, problem (10)–(12) is equivalent to n
2 d− i (π0 ) wi +
i=1
n
− (π0 ) wk wj → min, Θkl
k,j=1 k =j
⎧ n + ⎪ ⎪ Xi (π1 ) wi ≥ m− ⎨ p (π1 ) , i=1
n ⎪ ⎪ wi = 1, w1 , ..., wn ≥ 0, ⎩ i=1
252
A.V. Yazenin
− where d− i (π0 ), Θkl (π0 ) are the left endpoints of level sets of appropriate fuzzy variables, 0 0 ¯ + d¯(π1 − 1) , X+ m− p (π1 ) = m i (π1 ) = ai + σi (1 − π1 ) .
Theorem 4. Let in problem (13)–(14) fuzzy variables mp (γ), rp (γ), Xi (γ) are convex, min-related and characterized by upper semicontinuous distributions with finite supports. Then problem (13)–(14) is equivalent to x0 → max,
⎧ ⎪ x0 ≤ µ ⌢ (νi ) , i = 1, ..., n; ⎪ ⎪ Ri ⎪ ⎪ ⎪ ⎪ ⎪ x ≤ µ 0 mp (t) , ⎪ ⎪ ⎪ n ⎪ ⎪ ⎪ wi νi = t, ⎪ ⎪ ⎪ i=1 ⎪ ⎪ n n − ⎪ − ⎪ ⎪ di (π0 ) wi2 + Θkl (π0 ) wk wl ≤ rp+ (π0 ) , ⎪ ⎨ i=1 k,l=1
k =l ⎪ ⎪ ⎪ ⎪ ⎪ n n ⎪ ⎪ + 2 ⎪ d+ (π0 ) wk wl ≥ rp− (π 0 ) , Θkl ⎪ i (π0 ) wi + ⎪ ⎪ i=1 ⎪ k,l=1 ⎪ ⎪ ⎪ ⎪ k =l ⎪ ⎪ ⎪ ⎪ n ⎪ ⎪ ⎪ ⎩ wi = 1, w1 , ..., wn ≥ 0, i=1
where µ⌢ , µmp are distribution functions of appropriate fuzzy variables; Ri
+ − + − + d− i (π0 ), di (π0 ), Θkl (π0 ), Θkl (π0 ), rp (π0 ), rp (π0 ) are endpoints of π0 -level sets of fuzzy variables di (γ) , Θij (γ) and rp .
To prove the formulated above theorems mathematical apparatus developed in [18, 19] can be used. Results, represented in the paper, are obtained in case τ = π. They admit interpretation within optimistic decision making model [19]. Similarly corresponding theorems can be proved for τ = ν. Interpretation of the results can be made for pessimistic decision-making model for π0 , π1 = 0, 5. Let us consider some portfolio models and methods represented in [20, 21] under the second approach to the second-order moments definition. 5.4 Nonfuzzy Moments of the Second Order Introduced portfolio optimization models are generalization of classical Markowitz models on case of fuzzy random data. Further Blekh constraints system be considered. Turn to its formal description.
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
Let ̟i be a capital part spent on buying an i-type security, 1
n
253
̟i = 1,
i=1
n
X , . . . , X are fuzzy random variables that are the expected values of securities. Portfolio expected value is a fuzzy random variable X(̟, ω, γ) = ̟1 · X 1 (ω, γ) + . . . + ̟n · X n (ω, γ).
Expected value E(X(̟, ω, γ)) is a portfolio expected value and variance D(X(̟, ω, γ)) is a portfolio risk. According to classical approach its natural to consider the following models: Model of maximizing the expected income under the preset level of possible risk E(X(̟, ω, γ)) → max, ⎧ ⎪ ⎨ D(X(̟, ω, γ)) = r, N ⎪ ̟i = 1, ⎩ i=1
where r is acceptable risk level.
The presented model is not absolutely correct as the expected value is a fuzzy variable. Further redetermination implies usage of proper decision making principles [10]. If fuzzy variables Xωi (γ) = E{X i (ω, γ)} are unimodal fuzzy variables then according to [6] n mi ̟ i , M {E(X(̟, ω, γ))} = i=1
where mi is modal value of fuzzy variable Xωi (γ). Let Xωi (γ) be triangle fuzzy variables with parameters (ξi +ηi , ηi , ζi ), where (ξi + ηi ) is a modal value, ηi , ζi are left and right fuzziness coefficients. According to [18] the model of maximizing the expected income is following: N i=1
mi ̟i → max,
⎧ N 2 ⎪ ⎪ ̟i (Dξi + 32 Dηi + 16 Dζi + 32 Cov(ξi , ηi ) + 21 Cov(ξi , ζi ) + ⎪ ⎪ ⎪ i=1 ⎪ ⎪ N N −1 ⎪ ⎪ 1 ⎪ ̟i ̟j [Cov(ξi , ξj ) + ⎪ ⎪ + 2 Cov(ηi , ζi )) + 2 ⎪ i=1 j=i+1 ⎪ ⎪ ⎨ 3 + 4 Cov(ξi , ηj ) + + 34 Cov(ηi , ξj ) + 32 Cov(ηi , ηj ) + ⎪ ⎪ ⎪ + 1 Cov(ξi , ζj ) + + 1 Cov(ηi , ζj ) + 1 Cov(ξj , ζi ) + ⎪ 4 4 4 ⎪ ⎪ ⎪ ⎪ 1 1 ⎪ Cov(η , ζ ) + Cov(ζ , ζ )] = r, + j i i j ⎪ 6 ⎪ 4 ⎪ ⎪ N ⎪ ⎪ ⎩ ̟ = 1. i
i=1
254
A.V. Yazenin
To model fuzzy parameters (L, R)–type distributions were used. Here L(t), R(t) = max{0, 1 − t}, t ≥ 0. Model of minimizing risk under the preset income level D(X(̟, ω, γ)) → min,
⎧ ⎪ ⎨ M {E(X(̟, ω, γ))} = E∗ , N ⎪ ̟i = 1, ⎩ i=1
where E∗ is preset profitability level.
This problem can be solved using Lagrangian multiplier method. The method allows to express the solution in analytic form: ̟∗ =
[(mT C −1 m) − E∗ (eT C −1 m)]C −1 e + [E∗ (eT C −1 e) − (mT C −1 e)]C −1 m , (eT C −1 e)(mT C −1 m) − (mT C −1 e)2
implying the existence of matrix C −1 . Elements of C are covariance coefficients of fuzzy random variables, Cip = Cpi = Cov(Xi , Xp ) = Cov(Xp , Xi ), Cii = Cov(Xi , Xi ) = D(Xi ). Under the made presumptions according to [12]: Cip = Cpi = Cov(ξi , ξp ) + 43 Cov(ξi , ηp ) + 43 Cov(ηi , ξp ) + 23 Cov(ηi , ηp ) + + 14 Cov(ξi , ζp ) + 41 Cov(ηi , ζp ) + 14 Cov(ξp , ζi ) + 41 Cov(ηp , ζi ) + + 16 Cov(ζi , ζp ), 2 1 3 1 1 Cii = Dξi + Dηi + Dζi + Cov(ξi , ηi ) + Cov(ξi , ζi ) + Cov(ηi , ζi ) 3 6 2 2 2
6 Model Example Let demonstrate the possibilities of the approach to portfolio optimization on the real data [22]. We form portfolio using securities of three companies. The companies are one of the largest in Russia. In model example information about securities of the companies is used, time period 25 October 2000–25 October 2001. In Tables 1–3 there is information about auction: auction date, average-weighted price, minimum deal price, maximum deal price, opening purchase price, opening sale price, closing purchase price, closing sale price. All the prices are in dollars. The fragments of the tables are given below. On existing information basis we get maximum, minimum asset prices, and average-weighted (most possible) price of an asset. So the fuzzy variable
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
255
Table 1. Auction archive of the First Company Security, 25 October 2000–25 October 2001
auction date
average- purchase sale purchase sale weighted opening opening closing closing price
min. deal price
max. deal price
25.10.00
0.1258
0.1275
0.1285
0.1251
0.1255
0.1230
0.1285
26.10.00
0.1246
0.1225
0.1245
0.1247
0.1250
0.1225
0.1259
27.10.00
0.1262
0.1250
0.1270
0.1262
0.1267
0.1250
0.1273
...
...
...
...
...
...
...
...
23.10.01
0.0982
0.0965
0.0971
0.0988
0.0991
0.0963
0.0995
24.10.01
0.0989
0.0982
0.0990
0.0980
0.0984
0.0977
0.0998
25.10.01
0.0970
0.0980
0.0995
0.0950
0.0955
0.0951
0.0994
Table 2. Auction archive of the Second Company Security, 25 October 2000–25 October 2001
auction date
average- purchase sale weighted opening opening price
purchase sale closing closing
min. deal price
max. deal price
25.10.00
14.07
14.25
14.45
14.00
14.05
13.85
14.30
26.10.00
13.69
13.80
14.05
13.62
13.70
13.60
13.80
27.10.00
13.52
13.65
13.85
13.64
13.75
13.45
13.75
...
...
...
...
...
...
...
...
23.10.01
10.59
10.60
10.65
10.61
10.66
10.52
10.71
24.10.01
10.64
10.60
10.68
10.57
10.63
10.55
10.73
25.10.01
10.54
10.56
10.72
10.42
10.46
10.42
10.74
256
A.V. Yazenin
Table 3. Auction archive of the third company security, 25 October 2000–25 October 2001
auction date
average- purchase sale purchase sale weighted opening opening closing closing price
min. deal price
max. deal price
25.10.00
0.2756
0.2790
0.2820
0.2755
0.2770
0.2725
0.2815
26.10.00
0.2677
0.2710
0.2750
0.2675
0.2690
0.2660
0.2730
27.10.00
0.2672
0.2695
0.2725
0.2692
0.2715
0.2650
0.2710
...
...
...
...
...
...
...
...
23.10.01
0.2560
0.2525
0.2550
0.2590
0.2599
0.2530
0.2600
24.10.01
0.2578
0.2580
0.2595
0.2550
0.2557
0.2550
0.2615
25.10.01
0.2508
0.2550
0.2597
0.2460
0.2475
0.2460
0.2580
Table 4. Modal values of expected values
m1 0.102954365
m2
m3
10.69961905
0.240925
with triangle distribution function can act as price value. As a result, we can forecast the values of possibilistic distribution parameters and thereby prices of financial asset using time series. Using known formula from mathematical statistics modal values of possibilistic distributions, covariance coefficients and variance were estimated. The results are represented in Tables 4, 5. Using Table 5 we get covariance matrix of fuzzy random variables, Table 6. Optimal portfolio under the preset level of expected value E∗ = 0, 3: ̟∗ = (̟1 , ̟2 , ̟3 ) = (0.05; 0.01; 0.94). Thus, it is better to invest the main part of the capital into the securities of the third company. Risk of the formed portfolio is R = 0,000853628.
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
257
Table 5. Covariance matrix of random parameters of possibilistic distributions
Cov
ξ1
η1
ζ1
ξ2
η2
ζ2
ξ3
η3
ξ1
0.0001 −0.000 −0.000 0.011 4831 00316 00316 36370
η1
−0.000 0.000 00316 00174
0.000 00096
−0.000 0.000 32322 06207
0.000 05233
−0.000 0.000 00560 00160
0.000 00143
ζ1
−0.000 0.000 00316 00096
0.000 00147
−0.000 0.000 38400 05898
0.000 07158
−0.000 0.000 00833 00147
0.000 00195
ξ2
0.011 36370
η2
−0.000 0.000 24537 06207
0.000 05898
−0.008 0.011 51037 49926
0.007 82606
−0.000 0.000 30062 16854
0.000 16061
ζ2
−0.000 0.000 13500 05233
0.000 07158
−0.005 0.007 52300 82606
0.012 93850
−0.000 0.000 23300 14283
0.000 18176
ξ3
0.000 18232
η3
−0.000 0.000 00419 00160
0.000 00147
−0.000 0.000 30990 16854
0.000 14283
−0.000 0.000 00751 00683
0.000 00490
ζ3
−0.000 0.000 00630 00143
0.000 00195
−0.000 0.000 60289 16061
0.000 18176
−0.000 0.000 01660 00490
0.000 00762
−0.000 −0.000 1.568 32322 38400 24690
−0.000 −0.000 0.024 00560 00833 61910
−0.000 −0.000 0.000 24537 13500 18232
ζ3
−0.008 −0.005 0.024 51037 52300 61910
−0.000 −0.000 0.000 30062 23300 55000
−0.000 −0.000 00419 00630
−0.000 −0.000 30990 60289
−0.000 −0.000 00751 01660
Table 6. Covariance matrix of fuzzy random variables
Cov(Xi , Xj )
X1
X2
X3
X1
0,000143870
0,010888659
0,000173476
X2
0,010888659
1,566455369
0,024170827
X3
0,000173476
0,024170827
0,000538912
258
A.V. Yazenin
7 Conclusion In the present paper the approach to analysis of portfolio selection problems based on possibilistic–probabilistic optimization is described. Principles of decision making in fuzzy random environment are formulated. It gives ground for developed generalized models of portfolio analysis. Indirect methods of portfolio optimization with that models in fuzzy random environment are represented. It is based on construction of the equivalent determined analogues. Its realization can be carried out in frames of quadratic and, in some cases, of separable programming. Representation of fuzzy random data is implemented on the basis of shift-scaled family of possibilistic variables distributions. It allows the explication of the probability on the level of shift and scale parameters. This model of fuzzy random variable is convenient for applications. The calculus of fuzzy random variables is presented with definition of its moments of the second order in fuzzy form. However, in the frames of proposed schema of possibilistic–probabilistic optimization the models of portfolio analysis and optimization methods can be developed in case the second-order moments of fuzzy random variables are defined in certain form. As a direction of a further research a comparative investigation of these two designated approaches and determination of the bounds of its adequate application can be considered.
Ackowledgments This work was carried out with financial support of RFBR (Project No. 02-01-011137 and partially Project No. 04-01-96720).
References 1. H. Markowitz. Portfolio selection: efficient diversification of investments. Wiley, New York 1959 2. L. A. Zadeh. Fuzzy sets as a basis for a theory of possibility, Fuzzy sets and systems 1(1978) 3–28 3. D. Dubois, H. Prade. Possibility theory: an approach to computerized processing of uncertainty. Plenum, New York 1988 4. M. Inuiguchi, J. Ramik. Possibilistic linear programming: a brief review of fuzzy mathematical programming and a comparison with stochastic programming in portfolio selection problem, Fuzzy sets and systems 111(2000) 3–28 5. M. Inuiguchi, T. Tanino. Portfolio selection under independent possibilistic information, Fuzzy sets and systems 115(2001) 83–92 6. S. Nahmias. Fuzzy variables, Fuzzy sets and systems 1(1978) 97–110 7. S. Nahmias. Fuzzy variables in random environment, In: M. M. Gupta et al. (eds.). Advances in fuzzy sets Theory. NHCP. 1979 8. H. Kwakernaak. Fuzzy random variables – 1.Definitions and theorems, Information Science 15 (1978) 1–29
Possibilistic–Probabilistic Models and Methods of Portfolio Optimization
259
9. M. D. Puri, D. Ralesky. Fuzzy random variables, Journal of Mathematical Analysis and Applications 114 (1986) 409–422 10. A. V. Yazenin, M. Wagenknecht, Possibilistic optimization. Brandenburgische Technische Universitat. Cottbus, Germany 1996 11. M. Yu. Khokhlov, A. V. Yazenin. The calculation of numerical characteristics of fuzzy random data, Vestnik TvGU, No.2. Series “Applied mathematics,” 2003, 39–43 12. Y. Feng, L. Hu, H. Shu. The variance and covariance of fuzzy random variables, Fuzzy sets and systems 120 (2001) 487–497 13. A. V. Yazenin. Linear programming with fuzzy random data, Izv. AN SSSR. Tekhn. kibernetika 3(1991) 52–58 14. A. V. Yazenin. On the method of solving the linear programming problem with fuzzy random data, Izv. RAN, Teoriya i sistemy upravleniya 5 (1997) 91–95 15. I. A. Yazenin, Minimal risk portfolio and maximum effectiveness portfolio in fuzzy random environment, Slozhnye sistemy: modelirovanie i optimizatsiya, Tver, TvGU, 2001, 59–63 16. I. A. Yazenin, On the methods of investment portfolio optimization in fuzzy random environment, Slozhnye sistemy: modelirovanie i optimizatsiya, Tver, TvGU, 2002, 130–135 17. I. A. Yazenin, On the model of investment portfolio optimization, Vestnik TvGU, No. 2. Series “Applied mathematics,” 2003, 102–105 18. A. V. Yazenin, On the problem of maximization of attainment of fuzzy goal possibility, Izv. RAN, Teoriya i sistemy upravleniya 4 (1999) 120–123 19. A. V. Yazenin, On the problem of possibilistic optimization, Fuzzy sets and systems 81 (1996) 133–140 20. E. N. Grishina, On one method of portfolio optimization with fuzzy random data, Proceedings of International Conference on Fuzzy Sets and Soft Computing in Economics and Finance (June 17–20, Saint-Petersburg, Russia), 2004, v. 2, 493–498 21. E. N. Grishina, A. V. Yazenin, About one approach to portfolio optimization, Proceedings of 11th Zittau Colloquium (September 8–10, Zittau, Germany), 2004, 219–226 22. http://www.quote.ru
Toward Graded and Nongraded Variants of Stochastic Dominance Bernard De Baets and Hans De Meyer
Summary. We establish a pairwise comparison method for random variables. This comparison results in a probabilistic relation on a given set of random variables. The transitivity of this probabilistic relation is investigated in the case of independent random variables, as well as when these random variables are pairwisely coupled by means of a copula, more in particular the minimum operator or the Łukasiewicz t-norm. A deeper understanding of this transitivity, which can be captured only in the framework of cycle-transitivity, allows to identify appropriate strict or weak thresholds, depending upon the copula involved, turning the probabilistic relation into a strict order relation. Using 1/2 as a fixed weak threshold does not guarantee an acyclic relation, but is always one-way compatible with the classical concept of stochastic dominance. The proposed method can therefore also be seen as a way of generating graded as well as nongraded variants of that popular concept.
1 Introduction We denote the joint cumulative distribution function (c.d.f.) of a random vector (X1 , X2 , . . . , Xm ) as FX1 ,X2 ,...,Xm . This joint c.d.f. characterizes the random vector almost completely. Nevertheless, it is known from probability theory and statistics that practical considerations often lead one to capture the properties of the random vector and its joint c.d.f. as much as possible by means of a restricted number of (numerical) characteristics. The expected value, variance and other (central) moments of the components Xi belong to the family of characteristics that can be computed from the marginal c.d.f. FXi solely. A second family consists of characteristics that measure dependence or association between the components of the random vector. Well-known members of this family are the correlation coefficient, also known as Pearson’s product–moment correlation coefficient, Kendall’s τ and Spearman’s ρ. In general, their computation only requires the knowledge of the bivariate c.d.f. FXi ,Xj . The function C that joins the one-dimensional marginal c.d.f. FXi B. De Baets and H. De Meyer: Toward Graded and Nongraded Variants of Stochastic Dominance, Studies in Computational Intelligence (SCI) 36, 261–274 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
262
B. De Baets and H. De Meyer
and FXj into the bivariate marginal c.d.f. FXi ,Xj is known as a copula [10]: FXi ,Xj = C(FXi , FXj ) . The relation of stochastic dominance is one of the fundamental concepts of decision theory, and has been widely used in economics and financial mathematics [9]. It introduces a partial order on a given set of random variables. The random variables are compared two by two by pointwise comparison of some performance functions constructed from their (marginal) distribution functions. Our goal in this contribution is to establish a new method for comparing the components of a random vector in a pairwise manner based on the bivariate marginal distributions, rather than the univariate ones. More in particular, with any given random vector we will associate a so-called probabilistic relation. Our main concern is to study the type of transitivity exhibited by this probabilistic relation and to analyze to what extent it depends upon the copula that pairwisely couples the components of the random vector. To that end, we need a framework that allows to describe a sufficiently broad range of types of transitivity. The one that will prove to be the best suited is the framework of cycle-transitivity, which has been laid bare by the present authors [3]. This chapter is organized as follows. In Sect. 2, we propose a new method for generating a probabilistic relation from a given random vector and indicate in what sense this relation generalizes the concept of stochastic dominance [9]. One of our aims is to characterize the type of transitivity exhibited by this relation. To that end, we give a brief introduction to the framework of cycletransitivity in Sect. 3. In Sect. 4, we consider a random vector with pairwise independent components and analyze the transitivity of the generated probabilistic relation, while in Sect. 5 we are concerned with random vectors that have dependent components. In the latter section, we first briefly review the important concept of a copula. Then we study two extreme types of couplings between the components of a random vector, namely by means of one of the copulas in between which all other copulas are situated, i.e., the minimum operator and the Łukasiewicz t-norm [7, 10]. Finally, in Sects. 6 and 7, we explain how the results presented lead to a whole range of methods for comparing probability distributions and identify proper ways of defining a strict order on them, thus offering valuable alternatives to the usual notion of stochastic dominance.
2 A Method for Comparing Random Variables An immediate way of comparing two real random variables X and Y is to consider the probability that the first one takes a value greater than the second one. Proceeding in this way, a random vector (X1 , X2 , . . . , Xm ) generates a probabilistic relation (also called reciprocal relation or ipsodual relation), as follows.
Toward Graded and Nongraded Variants of Stochastic Dominance
263
Definition 1. Given a random vector (X1 , X2 , . . . , Xm ), the binary relation Q defined by: Q(Xi , Xj ) = Prob{Xi > Xj } +
1 Prob{Xi = Xj } 2
(1)
is a probabilistic relation, i.e., for all (i, j) it holds that: Q(Xi , Xj ) + Q(Xj , Xi ) = 1 . Note that Q(X, Y ) is not the probability that X takes a greater value than Y , since in order to make Q a probabilistic relation, we also take half of the probability of a tie into account. It is clear from this definition that the relation Q can be computed immediately from the bivariate joint cumulative distributions FXi ,Xj as: 1 dFXi ,Xj (x, y) . (2) dFXi ,Xj (x, y) + Q(Xi , Xj ) = 2 x=y x>y If we want to further simplify (2), it is appropriate to distinguish between the following two cases. If the random vector is a discrete random vector, then 1 Q(Xi , Xj ) = pXi ,Xj (k, l) + pXi ,Xj (k, k) , (3) 2 k>l
k
with pXi ,Xj the joint probability mass function of (Xi , Xj ), and if it is a continuous random vector, then +∞ x dx fXi ,Xj (x, y) dy , (4) Q(Xi , Xj ) = −∞
−∞
with fXi ,Xj the joint probability density function of (Xi , Xj ). Note that in the transition from the discrete to the continuous case, the second contribution to Q(Xi , Xj ) in (2) has disappeared in (4), since in the latter case it holds that Prob{Xi = Xj } = 0. The probabilistic relation Q generated by a random vector yields a recipe for comparison that takes into account the bivariate joint probability distribution function, hence to some extent the pairwise dependence of the components. The information contained in the probabilistic relation is therefore much richer than if for the pairwise comparison of Xi and Xj we would have used, for instance, only their expected values E[Xi ] and E[Xj ]. For two random variables X and Y , one says that X is weakly statistically preferred to Y , denoted as X Y , if Q(X, Y ) ≥ 1/2; if Q(X, Y ) > 1/2, then one says that X is statistically preferred to Y , denoted X ⊲ Y . Of course, we would like to know whether the relations or ⊲ are transitive. To that aim, let us consider the following example of a discrete random vector (X, Y, Z) with three pairwise independent components, uniformly distributed over DX = {1, 3, 4, 15, 16, 17}, DY = {2, 10, 11, 12, 13, 14}, DZ = {5, 6, 7, 8, 9, 18} .
264
B. De Baets and H. De Meyer
We can apply (3) with all joint probability masses equal to 1/36. More precisely, we obtain Q(X, Y ) = 20/36, Q(Y, Z) = 25/36, and Q(X, Z) = 15/36, from which it follows that X ⊲ Y , Y ⊲ Z, and Z ⊲ X, and it turns out that in this case the relation ⊲ (and hence also ) forms a cycle, and hence is not transitive. An alternative concept for comparing two random variables, or equivalently, two probability distributions, is that of stochastic dominance [9]. Definition 2. A random variable X with c.d.f. FX weakly stochastically dominates in first degree a random variable Y with c.d.f. FY , denoted as X 1 Y , if it holds that FX ≤ FY . If, moreover, it holds that FX (t) < FY (t), for some t, then it is said that X stochastically dominates in first degree Y , denoted as X ≻1 Y . Note that, as for any comparison method that relies only upon characteristics of the marginal distributions, the stochastic dominance relation 1 does not take into account any effects of the possible pairwise dependence of the random variables. Moreover, the condition for first-degree stochastic dominance is rather severe, as it requires that the graph of the c.d.f. FX lies beneath the graph of the c.d.f. FY . The need to relax this condition has led to other types of stochastic dominance, such as second-degree and third-degree stochastic dominance. We will not go into more details here, since we just want to emphasize the following relationship between weak first-degree stochastic dominance and the relation .
Proposition 1. For any two random variables X and Y it holds that weak stochastic dominance implies weak statistical preference, i.e., X 1 Y implies X Y. The relation therefore generalizes weak first-degree stochastic dominance 1 . Note that the same implication is not true in general for the strict versions ⊲ and ≻1 . Since the probabilistic relation Q is a graded alternative to the crisp relation , we can interpret it as a graded generalization of weak first-degree stochastic dominance. Unfortunately, as shown in the example above, the relation is not necessarily transitive, while this is obviously the case for the relation 1 . Further on in this chapter, we will show how this shortcoming can be resolved.
3 Cycle-Transitivity Cycle-transitivity was proposed recently by the present authors as a general framework for studying the transitivity of probabilistic relations [1,3]. The key feature is the cyclic evaluation of transitivity: triangles (i.e., any three points) are visited in a cyclic manner. An upper bound function acting upon the ordered weights encountered provides an upper bound for the “sum minus 1” of these weights. Cycle-transitivity incorporates various types of fuzzy transitivity and stochastic transitivity.
Toward Graded and Nongraded Variants of Stochastic Dominance
265
For a probabilistic relation Q on A, we define for all a, b, c the following quantities: αabc = min(Q(a, b), Q(b, c), Q(c, a)) , βabc = median(Q(a, b), Q(b, c), Q(c, a)) , γabc = max(Q(a, b), Q(b, c), Q(c, a)) . Let us also denote ∆ = {(x, y, z) ∈ [0, 1]3 | x ≤ y ≤ z}. Definition 3. A function U : ∆ → R is called an upper bound function if it satisfies: (i) U (0, 0, 1) ≥ 0 and U (0, 1, 1) ≥ 1; (ii)for any (α, β, γ) ∈ ∆: U (α, β, γ) + U (1 − γ, 1 − β, 1 − α) ≥ 1 .
(5)
The function L : ∆ → R defined by L(α, β, γ) = 1 − U (1 − γ, 1 − β, 1 − α)
(6)
is called the dual lower bound function of a given upper bound function U . Inequality (5) then simply expresses that L ≤ U . Definition 4. A probabilistic relation Q on A is called cycle-transitive w.r.t. an upper bound function U if for any (a, b, c) ∈ A3 it holds that L(αabc , βabc , γabc ) ≤ αabc + βabc + γabc − 1 ≤ U (αabc , βabc , γabc ) ,
(7)
where L is the dual lower bound function of U . Due to the built-in duality, it holds that if (7) is true for some (a, b, c), then this is also the case for any permutation of (a, b, c). In practice, it is therefore sufficient to check (7) for a single permutation of any (a, b, c) ∈ A3 . Alternatively, due to the same duality, it is also sufficient to verify the right-hand inequality (or equivalently, the left-hand inequality) for two permutations of any (a, b, c) ∈ A3 (not being cyclic permutations of one another), e.g., (a, b, c) and (c, b, a). Proposition 2. A probabilistic relation Q on A is cycle-transitive w.r.t. an upper bound function U if for any (a, b, c) ∈ A3 it holds that αabc + βabc + γabc − 1 ≤ U (αabc , βabc , γabc ) .
(8)
4 The Case of Independent Random Variables In this section, we consider the case of a random vector with pairwise independent components Xi . In fact, as is well-known inprobability theory, this
266
B. De Baets and H. De Meyer
does not mean that the random variables Xi are necessarily mutually independent. However, since our comparison method only involves the bivariate c.d.f., the distinction between pairwisely and mutually independent random variables is superfluous in the present discussion. An important consequence of the assumed pairwise independence is that the bivariate distribution functions become factorizable into the univariate marginal distributions, in particular FXi ,Xj = FXi FXj , for a discrete random vector pXi ,Xj = pXi pXj , and for a continuous random vector fXi ,Xj = fXi fXj . In this case, (3) and (4) can be simplified into Q(Xi , Xj ) =
pXi (k)pXj (l) +
k>l
and Q(Xi , Xj ) =
1 pXi (k)pXj (k) 2
(9)
k
+∞ −∞
fXi (xi ) dxi
xi
−∞
fXj (xj ) dxj .
(10)
The first case in which we have been able to determine the type of transitivity of the probabilistic relation is that of a discrete random vector with pairwise independent components that are uniformly distributed on arbitrary integer (multi)sets. In this case the components Xi of the random vector can be regarded as hypothetical dice (with as many faces as elements in the corresponding multiset), whereas Q(Xi , Xj ) can then be seen as the probability that dice Xi wins from dice Xj . Such a discrete uniformly distributed random vector, together with the generated probabilistic relation, has therefore been called a standard discrete dice model [5]. The type of transitivity of the probabilistic relation Q can be very neatly described in the framework of cycle-transitivity [3]. Proposition 3. The probabilistic relation generated by a random vector with pairwise independent components that are uniformly distributed on finite integer multisets is cycle-transitive w.r.t. the upper bound function UD defined by: UD (α, β, γ) = β + γ − βγ . This type of transitivity is called dice-transitivity. In [6], the transitivity was investigated in the general case of discrete or continuous random vectors with arbitrary independent components. Somewhat surprisingly, the above case turned out, as far as transitivity of the probabilistic relation is concerned, to be generic for the most general situation. Proposition 4. A random vector with arbitrary pairwise independent components generates a probabilistic relation that is dice-transitive.
Toward Graded and Nongraded Variants of Stochastic Dominance
267
5 The Case of Dependent Random Variables 5.1 Joint Distribution Functions and Copulas In this section, we focus on dependent random variables. We consider the general case of a random vector (X1 , X2 , . . . , Xm ) with joint c.d.f. FX1 ,X2 ,...,Xm , to which we associate a probabilistic relation Q, as defined in (1) or, equivalently, in (2). Sklar’s theorem [10, 11] tells us that if a joint c.d.f. FXi ,Xj has marginal c.d.f. FXi and FXj , then there exists a copula C such that for all x, y: FXi ,Xj (x, y) = C(FXi (x), FXj (y)) .
(11)
Let us recall that a copula is a binary operation C : [0, 1]2 → [0, 1], that has neutral element 1 and absorbing element 0 and that satisfies the property of moderate growth [10]: for any (x1 , x2 , y1 , y2 ) ∈ [0, 1]4 it holds that (x1 ≤ x2 ∧ y1 ≤ y2 ) ⇒ C(x1 , y1 ) + C(x2 , y2 ) ≥ C(x1 , y2 ) + C(x2 , y1 ) . A copula C is called stable if for all (x, y) ∈ [0, 1]2 it holds that [8]: C(x, y) + 1 − C(1 − x, 1 − y) = x + y . If the random variables Xi and Xj are continuous, then the copula C in (11) is unique; otherwise, C is uniquely determined on Ran(FXi ) × Ran(FXj ). Conversely, if C is a copula and FXi and FXj are c.d.f., then the function defined by (11) is a joint c.d.f. with marginal c.d.f. FXi and FXj . For independent random variables, the copula C is the product copula TP (TP (x, y) = xy). In this section we will consider the two extreme copulas in between which all other copulas are situated, i.e., the Łukasiewicz copula TL (TL (x, y) = max(x+y −1, 0), also called the Fréchet–Hoeffding lower bound) and the minimum operator TM (TM (x, y) = min(x, y), also called the Fréchet–Hoeffding upper bound). It should be remarked that in the literature on copulas, usually these copulas are denoted W and M instead of their t-norm equivalents TL and TM [7]. 5.2 The Compatibility Problem for Copulas and Artificial Coupling The fact that for a given random vector not all copulas underlying the bivariate c.d.f. are necessarily the same, turns the characterization of the transitivity of the probabilistic relation Q into an unfeasible problem. It is therefore natural to assume in first instance that all these copulas coincide. However, we should be cautious here, as it is not guaranteed that there exists an FX1 ,X2 ,...,Xm compatible with that assumption. This is closely related to a famous open problem in the theory of copulas, also known as the compatibility problem [10].
268
B. De Baets and H. De Meyer
In order to circumvent this problem, we will lower our ambitions and reconsider the marginal distributions FXi only. However, for the computation of Q, we will artificially couple them into a bivariate distribution by means of a fixed copula C and denote the corresponding Q as QC . In Sect. 4, we have studied the case of independent random variables, which, in the spirit of this section, means that the marginal distributions are linked through C = TP . These results clearly stay valid when coupling arbitrary random variables (i.e., not necessarily pairwisely independent) by means of C = TP . In that case, the computation of QP can be done by means of (9) (discrete case) or (10) (continuous case). Obviously, one can pose the question whether for the two presently considered extreme cases of C, again simple formulas can be derived that facilitate the computation of QC . 5.3 The Copula TM We first consider the pairwise coupling of components by means of the copula TM . We have demonstrated that [4]: Proposition 5. Let (X1 , X2 , . . . , Xm ) be a continuous random vector. Then the probabilistic relation QM can be computed as: 1 QM (Xi , Xj ) = fXi (x) dx + fXi (x) dx . 2 x:FXi (x)=FXj (x) x:FXi (x) xj , ⎪ ⎪ ⎪ ⎨ (k) (k) = 1/2 , if xi = xj , ⎪ ⎪ ⎪ ⎪ ⎩ (k) (k) 0 , if xi < xj .
QM (Xi , Xj ) =
δkM
(13)
(14)
From (14) it is obvious that the computation of QM (Xi , Xj ) requires to (k) compare element xi at the kth position in the ordered list of elements of Ai (k) to element xj at the kth position in the ordered list of the elements of Aj . Let us illustrate this procedure on the following example. Consider X and Y uniformly distributed on the integer sets AX and AY : AX = {1, 2, 5, 8}
AY = {2, 3, 5, 7} .
(15)
The way the ordered lists should be compared is depicted in Fig. 2, where the solid lines indicate which couples of elements are compared. Clearly, QM (X, Y ) = 0 + 0 + 1/8 + 1/4 = 3/8. 5.4 The Copula TL Next, we consider a continuous random vector (X1 , X2 , . . . , Xm ) with arbitrary marginal distributions pairwisely coupled by TL . In [4], we have also shown that: Proposition 7. Let (X1 , X2 , . . . ,m ) be a continuous random vector. Then the probabilistic relation QL can be computed as: L Q (Xi , Xj ) = fXi (x) dx , (16) x:FXi (x)+FXj (x)≥1
270
B. De Baets and H. De Meyer
X
Y
1
0
2
2
0
3
5
1/2
5
8
1
7
Fig. 2. Comparison of the r.v. X and Y coupled by TM
or, equivalently: QL (Xi , Xj ) = FXj (u) with u such that FXi (u) + FXj (u) = 1 .
(17)
One can again easily verify that QL (X, Y ) = 1/2 when FX = FY . Note that u in (17) may not be unique, in which case any u fulfilling the right equality may be considered. Then QL (X, Y ) is simply the height of FXj in u. This is illustrated in Fig. 3, where QL (X, Y ) = FY (u) = t1 , since t1 + t2 = 1. In case of samples of the same size, the following proposition can be invoked for computing QL on the empirical distributions. Proposition 8. Let Xi , i = 1, . . . , m, be discrete r.v. uniformly distributed on (non-necessarily disjoint) finite multisets Ai with same cardinality n. Let (1) (2) (n) xi ≤ xi ≤ · · · ≤ xi denote the elements of Ai in increasing order. If the r.v. are coupled by TL , then the probabilistic relation QL is given by: n
1 L δk , n k=1 ⎧ (k) (n−k+1) ⎪ , 1 , if xi > xj ⎪ ⎪ ⎪ ⎨ (k) (n−k+1) δkL = 1/2 , if xi = xj , ⎪ ⎪ ⎪ ⎪ ⎩ (k) (n−k+1) 0 , if xi < xj .
QL (Xi , Xj ) =
(18)
(19)
(k)
The computation of QL (Xi , Xj ) requires to compare element xi at the (n−k+1) at the kth position in the ordered list of elements of Ai to element xj (n − k + 1)-th position in the ordered list of the elements of Aj . Consider again the example above concerning AX and AY . The way the ordered lists should be compared is now depicted in Fig. 4. Clearly, QL (X, Y ) = 1/2.
Toward Graded and Nongraded Variants of Stochastic Dominance F(x) t1
{
1
}
271
FY
t2 FX
FX
u
0
FY
x
Fig. 3. Comparison of two continuous random variables coupled by TL
X
Y
1
2 @0 2 P@ 3 0 @ P P P 1 P @ P P 5 5 1 @ @ @ 7 8 Fig. 4. Comparison of the r.v. X and Y coupled by TL
6 Graded Variants of Stochastic Dominance In Sect.4, we have indicated that the probabilistic relation of a discrete or continuous dice model is dice-transitive and we have demonstrated that the framework of cycle-transitivity is very well suited to express this type of transitivity in a concise manner. Having in mind that the purpose of stochastic dominance was to define an order relation (in essence determined by its transitivity) on a set of random variables, the construction of a probabilistic relation exhibiting dice-transitivity on this set of random variables could be seen as a graded variant of stochastic dominance. The question naturally arises whether such an interpretation is also available for the extreme copulas. The answer is positive, as we have shown in [2]: Theorem 1. Let C be a commutative copula such that for any n > 1 and for all 0 ≤ x1 ≤ x2 ≤ · · · ≤ xn ≤ 1 and 0 ≤ y1 ≤ y2 ≤ · · · ≤ yn ≤ 1, it holds that C(xi , yi ) − C(xn−2j , yn−2j−1 ) − C(xn−2j−1 , yn−2j ) i
⎛
j
≤ C ⎝xn +
yn +
j
C(xn−2j−2 , yn−2j−1 ) −
j
j
C(xn−2j−1 , yn−2j−2 ) −
C(xn−2j , yn−2j−1 ),
j
j
⎞
C(xn−2j−1 , yn−2j )⎠ , (20)
272
B. De Baets and H. De Meyer
where the sums extend over all integer values that lead to meaningful indices of x and y. Then the probabilistic relation Q generated by a collection of random variables pairwisely coupled by C is cycle-transitive w.r.t. to the upper bound function U C defined by: U C (α, β, γ) = max(β + C(1 − β, γ), γ + C(β, 1 − γ)) .
(21)
U C (α, β, γ) = β + C(1 − β, γ) = γ + C(β, 1 − γ)) .
(22)
If C is stable, then
Note that without the framework of cycle-transitivity, it would be extremely difficult to describe this type of transitivity in a compact manner. The three main copulas considered in this chapter, TL , TP , and TM fulfill the conditions are all stable. Let us discuss them in detail: (i) For C = TL , we obtain from (22) that Q is cycle-transitive w.r.t. the upper bound function U L given by: U L (α, β, γ) = max(β, γ) = γ . Since γ ≤ β + γ − βγ, this type of transitivity is obviously stronger than dice-transitivity. (ii) For C = TP , we retrieve the well-known case of independent variables, with U P (α, β, γ) = β + γ − βγ = UD (α, β, γ) .
(iii) For C = TM , we obtain from (22) that Q is cycle-transitive w.r.t. the upper bound function U M given by: U M (α, β, γ) = min(β + γ, 1) , a type of transitivity that is obviously weaker than dice-transitivity.
7 Nongraded Variants of Stochastic Dominance The results from the foregoing sections can also be exploited to come up with nongraded alternatives to the concept of stochastic dominance. The idea here is to transform the probabilistic relation QC into a crisp relation by setting an appropriate (strict) threshold (greater than 1/2). The question that arises is whether the knowledge of the type of cycle-transitivity can help in determining appropriate thresholds resulting in a strict order relation. Theorem 2. Let X1 , X2 , . . ., Xm be m random variables. For the copula C = TL , it holds that the binary relation >L defined by Xi >L Xj ⇔ QL (Xi , Xj ) >
1 2
is a strict order relation. For the probabilistic relations QP and QM things are more complicated.
Toward Graded and Nongraded Variants of Stochastic Dominance
273
Theorem 3. Let X1 , X2 , . . ., Xm be m random variables and consider the copula TP . Let k ∈ N, k ≥ 2. (i) The binary relation >kP defined by
Xi >kP Xj ⇔ QP (Xi , Xj ) > 1 −
1 4 cos2 (π/(k
+ 2))
is an asymmetric relation without cycles of length k. (ii) The binary relation >∞ P defined by P Xi >∞ P Xj ⇔ Q (Xi , Xj ) ≥
3 4
is an asymmetric acyclic relation. (iii) The transitive closure >P of >∞ P is a strict order relation. Note that Theorem 3 resolves the dice problem in√Sect. 2. Indeed, for the 22 given example it only holds that Y >3P Z, since 36 < 5−1 < 23 2 36 , and there is no longer a cycle. The appropriate threshold in this case is nothing else but √ the golden section ( 5 − 1)/2. As can be expected from the above results, it is not easy to identify the appropriate threshold for a given copula C leading to an acyclic relation. Theorem 3 expresses that for the product copula TP there exists a sequence of thresholds converging to 3/4 and guaranteeing that the corresponding relation >kP contains no cycles of length k. Although >∞ P is not transitive in general, its transitive closure yields a strict order relation. The same can be done for the copula TM , but the results are less exciting. Theorem 4. Let X1 , X2 , . . ., Xm be m random variables and consider the copula TM . Let k ∈ N, k ≥ 2.
(i) The binary relation >kM defined by
Xi >kM Xj ⇔ QM (Xi , Xj ) >
k−1 k
is an asymmetric relation without cycles of length k. (ii)The binary relation >M defined by Xi >M Xj ⇔ QM (Xi , Xj ) = 1 is a strict order relation. Theorem 4 shows that also for TM there exists a sequence of thresholds. Unfortunately, here it converges to 1. It is easily seen that >M is even more demanding than ≻1 . Finally, note that none of the relations >L , >P and >M generalizes the relation ≻1 .
274
B. De Baets and H. De Meyer
8 Conclusion We have developed a general framework for the pairwise comparison of the components of a random vector, expressed in terms of a probabilistic relation. The framework of cycle-transitivity has proven extremely suitable for characterizing the transitivity of this probabilistic relation. This transitivity has been studied for probabilistic relations generated by pairwise independent random variables as well as in the case of dependent random variables, although most of the discussion was focused on coupling by TL or TM . This study has led to graded as well as nongraded alternatives to the classical concept of stochastic dominance.
Acknowledgments H. De Meyer is a Research Director of the Fund for Scientific Research – Flanders. This work is supported in part by the Bilateral Scientific and Technological Cooperation Flanders–Hungary BIL00/51 (B-08/2000). Special thanks also goes to EU COST Action 274 named TARSKI: “Theory and Applications of Relational Structures as Knowledge Instruments”.
References 1. B. De Baets and H. De Meyer, Transitivity frameworks for reciprocal relations: cycle-transitivity versus F G-transitivity, Fuzzy Sets and Systems 152 (2005), 249–270 2. B. De Baets and H. De Meyer, On the cycle-transitive comparison of artificially coupled random variables, Journal of Approx. Reas. (2007), in press. 3. B. De Baets, H. De Meyer, B. De Schuymer, and S. Jenei, Cyclic evaluation of transitivity of reciprocal relations, Social Choice and Welfare, 26 (2006), 217–238 4. H. De Meyer, B. De Baets, and B. De Schuymer, Extreme copulas and the comparison of ordered lists, Theory and Decision (2006), online DOI 10.1007/511238-006-9018-y. 5. B. De Schuymer, H. De Meyer, B. De Baets, and S. Jenei, On the cycletransitivity of the dice model, Theory and Decision 54 (2003), 264–285 6. B. De Schuymer, H. De Meyer, and B. De Baets, Cycle-transitive comparison of independent random variables, Journal of Multivariate Analysis 96 (2005), 352–373 7. E. Klement, R. Mesiar, and E. Pap, Triangular Norms, Trends in Logic, Studia Logica Library, Vol. 8, Kluwer, Dordrecht, 2000 8. E. Klement, R. Mesiar, and E. Pap, Invariant copulas, Kybernetika 38 (2002), 275–285 9. H. Levy, Stochastic Dominance, Kluwer, MA, 1998 10. R. Nelsen, An Introduction to Copulas, Lecture Notes in Statistics, Vol. 139, Springer, Berlin Heidelberg New York, 1998 11. A. Sklar, Fonctions de répartition à n dimensions et leurs marges, Publ. Inst. Statist. Univ. Paris 8 (1959), 229–231
Option Pricing in the Presence of Uncertainty S. Muzzioli and H. Reynaerts
Summary. In this chapter we investigate the derivation of the European option price in the Cox–Ross–Rubinstein [Cox, J., Ross, S., Rubinstein, S., J. Finan. Econ. 7 (1979) 229] binomial model in the presence of uncertainty in the volatility of the underlying asset. We propose two different approaches to the issue that concentrate on the fuzzification of one or both the two jump factors. The first approach is derived by assuming that both the jump factors are represented by triangular fuzzy numbers. We start from the derivation of the risk neutral probabilities, a problem that boils down to the solution of a linear system of equations with fuzzy coefficients that has no solution using standard fuzzy arithmetics. We recall the vector solution proposed by Buckley et al. [Buckley, J.J., Eslami, E., Feuring, T., Studies in Fuzziness and Soft Computing, Physica Verlag (2002); Buckley, J.J., Qu, Y., Fuzzy Sets and Systems 43 (1991) 33] that applies to that kind of systems and we give the conditions under which the solution exists and is unique for a broader class of equivalent fuzzy linear systems. As a last step, we apply the risk neutral probabilities to the valuation of the option. In the second part of the chapter we present a second approach to the option pricing problem, that is derived under the assumption that only the up jump factor is uncertain. We analyse the derivative of the option price in the Cox–Ross–Rubinstein binomial model with respect to the up jump factor. This method of investigation is very general since it is consistent both with the assumption of an interval value for the up factor, as well as with the assumption of a fuzzy value for that factor. Differently from the continuous time model in which the option price is an increasing function of the volatility, in this discrete time binomial model the call option is a non-decreasing function of the up jump factor and in turn of the volatility.
1 Introduction The aim of this chapter is to derive the price of a European option when there is uncertainty in the volatility of the underlying asset. Starting from the Cox–Ross–Rubinstein [5] binomial model in which the option has a well-known valuation formula, we investigate which is the effect S. Muzzioli and H. Reynaerts: Option Pricing in the Presence of Uncertainty, Studies in Computational Intelligence (SCI) 36, 275–301 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
276
S. Muzzioli and H. Reynaerts
on the option price of assuming the volatility as an uncertain parameter. In fact, in real markets it is usually hard to precisely estimate the volatility of the underlying asset, therefore it is convenient to let it take interval values, whereby central members of the interval have a higher possibility than members near the boarders. In the binomial model of Cox–Ross–Rubinstein [5] the volatility is provided by two jump factors, up and down, that describe the possible moves of the underlying asset in the next time period. Therefore, the two different approaches to the derivation of the option price that we analyse in the following concentrate on the fuzzification of one or both the two jump factors. In the first part of the chapter we present a first approach to the problem of option pricing, that is based on Muzzioli and Torricelli [8], that is derived by assuming that both the jump factors are represented by triangular fuzzy numbers. In order to compute the option price we first show how to derive the risk neutral probabilities, i.e. the probabilities of an up and a down move of the underlying asset in the next time period in a risk neutral world. The risk neutral probabilities derivation is a fundamental problem in finance since they are necessary for the pricing of any derivative security. The problem boils down to the solution of a linear system of equations with fuzzy coefficients. In particular we show that the resulting system has no solution using standard fuzzy arithmetics. Therefore, we recall the vector solution proposed by Buckley et al. [2, 3] that applies to that kind of systems and, following Muzzioli and Reynaerts [10,11], we give the conditions under which the solution exists and is unique for a broader class of equivalent fuzzy linear systems. Once we derived the risk neutral probabilities we apply them to the valuation of the option. In the second part of the chapter we present a second approach to the option pricing problem, that is based on Reynaerts and Vanmaele [12], that is derived under the assumption that only the up jump factor is uncertain. In this approach we analyse the derivative of the option price in the standard Cox–Ross–Rubinstein [5] binomial model with respect to the up jump factor. This method of investigation is very general since it is consistent both with the assumption of an interval value for the up factor, as well as with the assumption of a fuzzy value for the up factor. The plan of the chapter is the following. The first part of the chapter covers Sects. 2–5, the second part Sect. 6. In Sect. 2 we recall the financial problem and we explain how we include uncertainty on the underlying asset moves. In Sect. 3 we make a digression on the fuzzy linear systems solution problem, we recall the Buckley et al. [2, 3] solution and we give the conditions under which the solution exists and is unique for a broader class of equivalent fuzzy linear systems. In Sect. 4 we apply the methodology explained in Sect. 3 to find the solution to the fuzzy linear system that provides us with the risk neutral probabilities and in Sect. 5 we derive the option price. In Sect. 6 we illustrate the second approach to the pricing problem. The last section concludes.
Option Pricing in the Presence of Uncertainty
277
2 European-Style Plain Vanilla Options and the Uncertainty in the Volatility 2.1 European-Style Plain Vanilla Options An European-Style Plain Vanilla option is a financial security that provides its holder, in exchange for the payment of a premium (the price of the option), the right but not the obligation to either buy (call option) or sell (put option) a certain underlying asset by a certain date in the future (maturity date) for a pre-specified price K (strike price). The binary tree model of Cox–Ross–Rubinstein [5] can be considered as a discrete-time version of the Black and Scholes [1] model. The following assumptions are made: – (A1) The markets have no transaction costs, no taxes, no restrictions on short sales, and assets are infinitely divisible. – (A2) The lifetime T of the option is divided into N time steps of length T /N . – (A3) The market is complete. – (A4) No arbitrage opportunities are allowed, which implies for the riskfree interest factor, 1 + r, over one step of length T /N , that d < 1 + r < u, where u is the up factor and d is the down factor. The European call option price at time zero, has a well-known formula in this model: N N j N −j 1 p u pd S(0)uj dN −j − K + , EC(K, T ) = (1 + r)N j=0 j
where K is the exercise (or strike) price, S(0) is the price of the underlying asset at time the contract begins, pu (respectively, pd ) is the risk-neutral probability that the price goes up (respectively down). It is known that pu and pd are solutions to the system: $ pu + pd = 1 (1) upu + dpd = 1 + r. The solution is given by: pu =
(1 + r) − d u−d
pd =
u − (1 + r) . u−d
(2)
The standard methodology (see Cox et al. [4]) leads to set: √ √ u = eσ T /N , d = e−σ T /N , where σ is the volatility of the underlying asset. If there is some uncertainty about the value of the volatility, then it is also impossible to precisely estimate the up and down factors. Muzzioli and Torricelli [7, 8] suggest to model the up and down jump factors by triangular fuzzy numbers.
278
S. Muzzioli and H. Reynaerts
2.2 Triangular Fuzzy Numbers A triangular real fuzzy number (or quantity) is uniquely defined by three numbers (f1 , f2 , f3 ). Alternatively, one can write a triangular fuzzy number in terms of its α-cuts, f (α), α in [0, 1]: f (α) = [f (α), f (α)] = [f1 + α(f2 − f1 ), f3 − α(f3 − f2 )]. For simplicity of the notations the α-cuts will also be noted by [f , f ]. Since the α-cuts of a triangular fuzzy number are compact intervals of the set of real numbers, the interval calculus of Moore [7] can be applied on them. 2.3 Fuzzy Binary Tree Model Fuzzy versions of the linear systems should now be introduced. In this setting the up and down factors are represented by the triangular fuzzy numbers: u = (u1 , u2 , u3 ) and d = (d1 , d2 , d3 ). Assumptions (A1), (A2) and (A3) are still valid, while assumption (A4) changes as follows: d1 ≤ d2 ≤ d3 < 1 + r < u 1 ≤ u2 ≤ u3 . Note that this condition guarantees that the resulting fuzzy matrix: ⎡ ⎤ 11 ⎣ ⎦ du
has always full rank for all d ∈ (d1 , d2 , d3 ) and for all u ∈ (u1 , u2 , u3 ). There is no fuzziness in the risk-free rate of interest, since it is given at time zero. A fuzzy version of the two equations of system (1) can be given (for each equation) in two different ways, since for an arbitrary fuzzy number f there exists no fuzzy number g such that f + g = 0 and for all non-crisp fuzzy numbers f + (−f ) = 0: pu + pd = (1, 1, 1)
pu = (1, 1, 1) − pd
respectively, upu + dpd = (1 + r, 1 + r, 1 + r)
upu = (1 + r, 1 + r, 1 + r) − dpd
where pu and pd are the fuzzy up and down probabilities. Therefore, the linear system (1) can be rewritten in four different ways: $ p u + pd = 1 (3) upu + dpd = 1 + r, $
pu = 1 − pd upu + dpd = 1 + r,
(4)
Option Pricing in the Presence of Uncertainty
$ pu = 1 − p d dpd = (1 + r) − upu , $ p u + pd = 1 dpd = (1 + r) − upu .
279
(5)
(6)
3 The Set of Equivalent Fuzzy Linear Systems A1 x + b1 = A2 x + b2 3.1 Definition From the example above a number of questions arise: – First of all, what is the appropriate methodology to solve the four systems? One could suggest to use fuzzy number arithmetic. As we shall see in the examples this method often does not provide a solution in the set of fuzzy numbers; – Even if one can find a solution using fuzzy number arithmetic, the solution may be different for each of the four systems; – Can we find a methodology which overcomes all those problems? Inspired by the work of Buckley et al. [2–4] we introduce a general solution to the problem. Consider the fuzzy linear system Ax = b, where the elements, aij , of the n × n-matrix A and the elements, bi , of the n × 1-vector b are triangular fuzzy numbers. All fuzzy linear systems A1 x + b1 = A2 x + b2 , where A = A1 − A2 and b = b2 − b1 form the set of equivalent linear systems of the system Ax = b. Buckley et al. [2, 3] handle the general problem of constructing a solution for the fuzzy matrix equation Ax = b and Ming Ma et al. (2000) [4] solve the fuzzy linear system A1 x = A2 x + b, where A1 and A2 are crisp coefficient matrices and b is a fuzzy number vector. They remark that this system has not the same solution of the system (A1 − A2 )x = b if one uses fuzzy arithmetic. In this chapter equivalent fuzzy linear systems of the most general form, A1 x+b1 = A2 x+b2 , with A1 and A2 squared matrices of fuzzy coefficients and b1 and b2 fuzzy number vectors, are studied. Indeed, various economic problems boil down to the solution of linear systems of equations, which contain parameters which are not exactly known. Example 1. The break-even point (BEP) is defined as the point where revenues equal total costs: there is no profit made or loss incurred at the BEP. In order to find the BEP one has to solve the following system of equations: $ Revenues = price ∗ x T otalCosts = f ixedcosts + variablecost ∗ x
280
S. Muzzioli and H. Reynaerts
where Revenues are imposed equal to Total Costs and x is the quantity produced. This figure is important for anyone that manages a business since the BEP is the lower limit of profit. Because of the averaging of many different products into a single estimate of per-unit variable cost and per-unit revenue, or because of sales discounts and special offers, sometimes it is difficult to exactly compute the coefficients of the system. Example 2. The market price of a good and the quantity produced are determined by the equality between supply and demand. Demand is the amount of a good that consumers are willing and able to buy at a given price. Supply is the amount of a good producers that are willing and able to sell at a given price. For example: $ qs = a ∗ p + b qd = c ∗ p + d where qs is the quantity supplied, required to be equal to qd , the quantity requested, p is the price and a, b, c, and d are coefficients to be estimated, on which we may have some expert judgement. When the estimation of the system parameters is difficult, it is convenient to represent some of the system parameters by fuzzy numbers rather than by crisp numbers. 3.2 The Vector Solution of Buckley et al. We )n solution of Buckley et al.2 [2, 3]. Define aα = )n first present the vector a (α) and b = ij α i,j=1 i=1 bi (α), α ∈ [0, 1]. Each n × 1-vector v in a0 determines a crisp n × n-matrix Ac . The first row of Ac consists of the first n elements of v, the second row contains the following n elements of v, and so on. Assume that Ac is nonsingular for all vectors v ∈ a0 . Buckley et al. [2] proposed the following procedure to solve the equation: – Solve the linear system by using fuzzy number arithmetic (solution Xc ), – If no such solution exists use the vector solution XJ , with XJ (α) = {x | Aα x = bα , (Aα )ij ∈ aij (α), (bα )i ∈ bi (α)}, – If the vector solution is too difficult to investigate use XE or XI , which are both found by using Cramer’s rule to solve for each unknown. XE is investigated by using the extension principle, XI by fuzzifying ex post the crisp solution. The main drawback of such a choice is that the solution bounds do not have any crisp system that supports them. Buckley et al. show that Xc ≤ XJ ≤ XE ≤ XI . Example 3. Buckley and Qu [3] prove that the system $ (−4, −2, 0)x1 + (1, 2, 3)x2 = (−1, 0, 1) (−3, −2, −1)x1 + 0x2 = (−1, 0, 1)
(7)
has no solution in the set of fuzzy numbers if one solves it using fuzzy arithmetic.
Option Pricing in the Presence of Uncertainty
281
For this system one has: (Aα )1,1 = (−4 + 2α) − 4(α − 1)λ1 , λ1 ∈ [0, 1] (Aα )1,2 = (1 + α) − 2(α − 1)λ2 , λ2 ∈ [0, 1] (Aα )2,1 = (−3 + α) − 2(α − 1)λ3 , λ3 ∈ [0, 1] (Aα )2,2 = 0 (bα )1 = (−1 + α) − 2(α − 1)λ4 , λ4 ∈ [0, 1] (bα )2 = (−1 + α) − 2(α − 1)λ5 ,
λ5 ∈ [0, 1]
All matrices Ac , v ∈ a0 , are nonsingular: * * * * *−4 + 4λ1 1 + 2λ2 * * * det(Ac ) = * * *−3 + 2λ3 0 *
= −(−3 + 2λ3 )(1 + 2λ2 ) = 0,
∀λ1 , λ2 , λ3 ∈ [0, 1] ⎛ ⎞ x1 Thus the system (6) has a vector solution, with α-cuts XJ (α) = ⎝ ⎠, x2 (−1 + α) − 2(α − 1)λ5 (−3 + α) − 2(α − 1)λ3 x2 = t/n, t = ((−4 + 2α) − 4(α − 1)λ1 )((−1 + α) − 2(α − 1)λ5 ) −((−3 + α) − 2(α − 1)λ3 )((−1 + α) − 2(α − 1)λ4 ) x1 =
n = −((−3 + α) − 2(α − 1)λ3 )((1 + α) − 2(α − 1)λ2 ).
For α = 1 this set becomes a singleton, since then x1 = x2 = 0. For α = 0 one obtains: −1 + 2λ5 −3 + 2λ3 (−3 + 2λ3 )(−1 + 2λ4 ) − (−4 + 4λ1 )((−1 + 2λ5 ) . x2 = (−3 + 2λ3 )(1 + 2λ2 )
x1 =
By minimizing and maximizing those functions over λ1 , . . . , λ5 one obtains the marginals: [−1, 1] and [−5, 5]. 3.3 The Vector Solution to the Fuzzy Linear System A 1 x + b 1 = A2 x + b 2 Suppose that the equivalent fuzzy system A1 x + b1 = A2 x + b2 has no solutions in the set of fuzzy numbers when using fuzzy number arithmetic. In the following we give the conditions under which the system has a vector solution and we show that the linear systems Ax = b and A1 x+b1 = A2 x+b2 , with A = A1 − A2 and b = b2 − b1 , have the same vector solutions.
282
S. Muzzioli and H. Reynaerts
Theorem 1. The equivalent fuzzy system A1 x + b1 = A2 x + b2 has a vector solution XJ∗ , with α-cuts XJ∗ (α) = {x | A1α x + b1α = A2α x + b2α , (A1α )ij ∈ a1,ij (α), (A2α )ij ∈ a2,ij (α), (b1α )i ∈ b1,i (α), (b2α )i ∈ b2,i (α)}, if all matrices A1,0 − A2,0 = [a01,ij − a02,ij ], with a1,ij ∈ a1,ij (0) and a2,ij ∈ a2,ij (0), are non-singular. Proof. The statement follows by an argument analogous to the one of the proof given in Buckley and Qu [3] when defining their notations a(α) and b(α) (notations used in the proof of Buckley and Qu), respectively, as a∗α = b∗α =
n
(a1,ij (α) − a2,ij (α)),
i,j=1 n
i=1
(b2,i (α) − b1,i (α)),
α ∈ [0, 1].
⊓ ⊔
Theorem 2. The linear systems Ax = b and A1 x + b1 = A2 x + b2 , with A, A1 , A2 , b, b1 , b2 matrices with elements which are fuzzy numbers and A = A1 − A2 and b = b2 − b1 , have the same vector solutions XJ . Proof. It is evident that XJ (α) ⊆ XJ∗ (α), since clearly aα ⊆ a∗α and bα ⊆ b∗α . Consider fuzzy numbers a, b, c, such that a = b−c, then x ∈ b(α) and y ∈ c(α) ⊔ implies that x − y ∈ a(α). Therefore XJ (α) ⊇ XJ∗ (α). ⊓ Example 4. Consider the following fuzzy linear system: $ (−2, −1, 0)x1 + (0, 1, 2)x2 = (0, 1, 2)x1 − x2 + (−1, 0, 1) (0, 1, 1)x1 + 2x2 = (2, 3, 3)x1 + 2x2 + (−1, 0, 1)
(8)
which has no solution in the set of fuzzy numbers if one solves it using fuzzy arithmetic. Remark that it is equivalent to system (6) since ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ (−4, −2, 0) (1, 2, 3) (−2, −1, 0) (0, 1, 2) (0, 1, 2) −1 ⎣ ⎦=⎣ ⎦−⎣ ⎦. (−3, −2, −1) 0 (0, 1, 1) 2 (2, 3, 3) 2 For this system one has:
(A1α )1,1 = (−2 + α) − 2(α − 1)λ1 , λ1 ∈ [0, 1] (A1α )1,2 = α − 2(α − 1)λ2 , λ2 ∈ [0, 1] (A1α )2,1 = α − (α − 1)λ3 , λ3 ∈ [0, 1] (A1α )2,2 = 2
Option Pricing in the Presence of Uncertainty
283
(A2α )1,1 = α − 2(α − 1)λ4 , λ4 ∈ [0, 1] (A2α )1,2 = −1 (A2α )2,1 = (2 + α) − (α − 1)λ5 , λ5 ∈ [0, 1] (A2α )2,2 = 2 (bα )1 = (−1 + α) − 2(α − 1)λ6 , (bα )2 = (−1 + α) − 2(α − 1)λ7 ,
λ6 ∈ [0, 1] λ7 ∈ [0, 1]
All matrices A1,0 − A2,0 , are non-singular since: * * * * *−2 + 2λ1 − 2λ4 1 + 2λ2 * * * det(A1,0 − A2,0 ) = * * * −2 + λ3 − λ5 0 *
= −(−2 + λ3 − λ5 )(1 + 2λ2 ) = 0,
∀λ1 , λ2 , λ3 , λ4 , λ5 ∈ [0, 1] ⎛ ⎞ x1 Thus the system (7) has a vector solution with α-cuts XJ (α) = ⎝ ⎠, x2 x1 =
−(α + 1 − 2(α − 1)λ2 )(α − 1 − 2(α − 1)λ7 ) (2 + (α − 1)(λ3 − λ5 ))(α + 1 − 2(α − 1)λ2 )
x2 = (α−1)((−2 − 2(α−1)(λ1 − λ4 ))(1 − 2λ7 )+(2 + (α − 1)(λ3 − λ5 )(1 − 2λ6 )) . (2 + (α − 1)(λ3 − λ5 ))(α + 1 − 2(α − 1)λ2 ) For α = 1 this set becomes a singleton, since then x1 = x2 = 0. For α = 0 one obtains: 1 − 2λ7 2 − (λ3 − λ5 ) −((−2 + 2(λ1 − λ4 ))(1 − 2λ7 ) + (2 − (λ3 − λ5 ))(1 − 2λ6 )) . x2 = (2 − (λ3 − λ5 ))(1 + 2λ2 )
x1 =
By minimizing and maximizing those functions for λ1 , . . . , λ7 one obtains the marginals: [−1, 1] and [−5, 5]. The vector solution is the same as the vector solution in Sect. 3.2.
3.4 An Algorithm to Find the Solution to the Fuzzy System Ax = b The following simple algorithm finds directly the marginals for each unknown. First of all, the remarks that for any crisp system Ac x = b, with A a nonsingular n × n-matrix, each unknown xj , j = 1 . . . n is either an increasing or a decreasing function of each aij and of each bi . In order to find the bounds of the solution interval for each α-cut of the unknown xj each bound of the
284
S. Muzzioli and H. Reynaerts
α-cuts aij (α) and bi (α) should be used. Therefore one has to solve 2n(n+1) systems. Each element of the extended coefficient matrix of those systems is either the lower or the upper bound of the α-cut of the corresponding element of the original fuzzy extended coefficient matrix. Consider for example the following linear fuzzy system with two unknowns x1 and x2 : $ a1,1 x1 + a1,2 x2 = b1 a2,1 x1 + a2,2 x2 = b2 . In this case one should solve 26 linear systems, e.g. one of the 26 systems is: $ a1,1 x1 + a1,2 x2 = b1 a2,1 x1 + a2,2 x2 = b2 . The final solution is investigated by taking the minimum and the maximum of the solutions found in each system for each unknown. This procedure ensures that all possible solutions, consistent with the parameters of the system, are taken. A simplification of the previous method is to find the solutions for α = 1 and α = 0 and impose ex post a triangular form on the solution. In order to find xj (1), for all j, one just solves the crisp system, substituting α = 1 in the fuzzy system. In order to find [xj (0), xj (0)], for all j, one applies the algorithm for α = 0. Then one takes as solution the triangular fuzzy numbers (xj (0), xj (1), xj (0)). Example 5. If one applies the simplified version of the algorithm proposed in this section to the system (1), one should solve, in order to find [xj (0), xj (0)], for j = 1, 2, the linear systems $ a1,1 x1 + a1,2 x2 = b1 (9) a2,1 x1 + a2,2 x2 = b2 with coefficients and solutions in Table 1. The marginals of the final solution are obtained by taking the minimum and maximum for each unknown: $ x1 (0) = 1 x1 (0) = −1 x2 (0) = 5 x2 (0) = −5 Remark that, as one could expect, the same marginals are found as in Sect. 2.1. If one puts α = 1 the following system should be solved: $ −2x1 (1) + 2x2 (1) = 0 −3x1 (1) + 0x2 (1) = 0. The solution of this system is: x1 (1) = 0 and x2 (1) = 0. Clearly those values belong to the respective intervals [−1, 1] and [−5, 5] and thus the triangular fuzzy numbers (−1, 0, 1) and (−5, 0, 5) are the solutions to the system. For more details we refer the reader to Muzzioli and Reynaerts [10].
Option Pricing in the Presence of Uncertainty Table 1. Coefficients and solutions for Systems (2)
a1,1
a1,2
a2,1
a2,2
b1
b2
x1
x2
−4
1
−3
0
−1
−1
1/3
1/3
1
−1
−1/3
−7/3
−1/3
−1/3
−1
−5
−1
−3
−1 1
−4
1
−1
0
3
−3
−1
1
−1
−1
0
3
−1
1
−1
1
−3
1
−1
1
−1
1
−1
3
−3
1
−1
3
−1
1
−1
1
−1
−1
0
1
−1
1 0
1
−1
−1
0
1
−1
1 0
1
−1
−1
0
1
−1
1 0
1
−1
−1
0
1
−1
1 0
1
−1
−1
0
1
−1
1 −4
1
−1
1 −4
1
1 1
−1
−1
1
−1
−1 1
1
1
1/3 1
1
7/3
3
5
1/3
1/9
−1/3
−7/9
−1/3
−1/9
−1
−5/3
−1
1
1/3 1
1
1/3 −1/3
1/3
−1/3 1
−1 1
−1
1/3
−1/3
1/3
−1/3 1
−1 1
−1
7/9
1
5/3
−1
−1
1 1
−1
−1
1
1 −1/3
−1/3 1/3
1/3 −1/3
−1/3 1/3
1/3
285
286
S. Muzzioli and H. Reynaerts
4 Fuzzy Up and Down Probabilities 4.1 Solution Using Fuzzy Arithmetic In this section we apply the methodology introduced in Sect. 3 to the solution of the fuzzy linear system that involves the risk neutral probabilities derivation. The first step when solving fuzzy linear systems, is to use fuzzy arithmetic. Therefore the systems are expressed in α-cuts. This method is applied for each system: – System (2) [pu , pu ] + [pd , pd ] = [1, 1] [u, u][pu , pu ] + [d, d][pd , pd ] = [1 + r, 1 + r]. By applying the arithmetic of fuzzy numbers and keeping in mind that the coefficients as well as the unknowns are positive, one gets: ⎧ pu + pd = 1 ⎪ ⎪ ⎪ ⎨ pu + pd = 1 ⎪ upu + dpd = 1 + r ⎪ ⎪ ⎩ upu + dpd = 1 + r
which leads to the following solution: (1 + r) − d u−d u − (1 + r) pd = u−d
pu =
(1 + r) − d u−d u − (1 + r) pd = . u−d
pu =
Clearly, pu < pu . The conclusion is that the system has no solution in the set of the fuzzy numbers if one applies fuzzy arithmetic. Note that if one poses pu 1 = pu and pu 1 = pu , one obtains the solution given in Muzzioli and Torricelli [7, 8], i.e. (1 + r) − d (1 + r) − d ] , u−d u−d u − (1 + r) u − (1 + r) , ]. p1d = [pd 1 , pd 1 ] = [ u−d u−d
p1u = [pu 1 , pu 1 ] = [
– System (3) [pd , pd ] = [1, 1] − [pu , pu ]
[u, u][pu , pu ] + [d, d][pd , pd ] = [1 + r, 1 + r].
Option Pricing in the Presence of Uncertainty
287
By applying the arithmetic of fuzzy numbers and keeping in mind that the coefficients as well as the unknowns are positive, one gets: ⎧ pd = 1 − pu ⎪ ⎪ ⎪ ⎨p = 1 − p d u ⎪ upu + dpd = 1 + r ⎪ ⎪ ⎩ upu + dpd = 1 + r which leads to the solution proposed by Reynaerts and Vanmaele [12]: (1 + r)(u + d) − d(d + u) uu − dd (1 + r)(u + d) − d(d + u) pu = uu − dd pu =
p d = 1 − pu p d = 1 − pu ,
which is different from what was found for System (3). One can show that pu < pu (and thus pd < pd ). The conclusion is that the system has no solution in the set of the fuzzy numbers if one applies fuzzy arithmetic. – System (4) [pd , pd ] = [1, 1] − [pu , pu ]
[d, d][pd , pd ] = [1 + r, 1 + r] − [u, u][pu , pu ]. By applying the arithmetic of fuzzy numbers and keeping in mind that the coefficients as well as the unknowns are positive, one gets: ⎧ p d = 1 − pu ⎪ ⎪ ⎪ ⎨p = 1 − p d u ⎪ dpd = (1 + r) − upu ⎪ ⎪ ⎩ dpd = (1 + r) − upu which leads to the solution:
(1 + r) − d u−d (1 + r) − d pu = u−d pu =
p d = 1 − pu pd = 1 − pu .
Remark that, if one considers the solutions for α = 0 and α = 1, one can conclude that pu (1) = pu (1) < pu (0). Thus the system has no solution in the set of the fuzzy numbers if one applies fuzzy arithmetic.
288
S. Muzzioli and H. Reynaerts
– System (5) [pd , pd ] + [pu , pu ] = [1, 1] [d, d][pd , pd ] = [1 + r, 1 + r] − [u, u][pu , pu ]. By applying the arithmetic of fuzzy numbers and keeping in mind that the coefficients as well as the unknowns are positive, one gets: ⎧ p d = 1 − pu ⎪ ⎪ ⎪ ⎨p = 1 − p d u ⎪ = (1 + r) − upu dp d ⎪ ⎪ ⎩ dpd = (1 + r) − upu which leads to the solution:
(1 + r)(u + d) − d(d + u) uu − dd (1 + r)(u + d) − d(u + d) pu = uu − dd pu =
p d = 1 − pu pd = 1 − pu .
Depending on the values of u and d, pu < pu or pd < pd . Thus the system has no solution in the set of the fuzzy numbers if one applies fuzzy arithmetic. More details on the properties of the solutions to the Systems (2–5) can be found in Muzzioli [9]. 4.2 Vector Solution Since none of the four equivalent fuzzy linear systems has a solution if one applies fuzzy arithmetic one should investigate the vector solution XJ . This solution is obtained as follows: (Aα )1,1 = (Aα )1,2 = (bα )1 = 1 (bα )2 = 1 + r (Aα )2,1 = u1 + α(u2 − u1 ) + λ1 (u3 − u1 − α(u3 − u1 )), λ1 ∈ [0, 1] (Aα )2,2 = d1 + α(d2 − d1 ) + λ2 (d3 − d1 − α(d3 − d1 )), λ2 ∈ [0, 1]. The vector solution is: (1 + r) − (d1 + α(d2 − d1 ) + λ2 (d3 − d1 )(1 − α)) f u1 + α(u2 − u1 ) + λ1 (u3 − u1 )(1 − α) − (1 + r)) , pd (α, λ1 , λ2 ) = f λ1 , λ2 ∈ [0, 1],
pu (α, λ1 , λ2 ) =
f = u1 − d1 + α((u2 − u1 ) − (d2 − d1 ))+(1 − α)(λ1 (u3 − u1 )−λ2 (d3 − d1 )).
Option Pricing in the Presence of Uncertainty
289
The marginals are obtained by solving the non-linear optimisation problem: pu (α) = minλ1 ,λ2 (pu (α, λ1 , λ2 )) pd (α) = minλ1 ,λ2 (pd (α, λ1 , λ2 ))
pu (α) = maxλ1 ,λ2 (pu (α, λ1 , λ2 )) pd (α) = maxλ1 ,λ2 (pd (α, λ1 , λ2 ))
namely: pu (α) = 1 + r − d1 − α(d2 − d1 ) 1 + r − d3 + α(d3 − d2 ) , ] u3 − α(u3 − u2 ) − d3 + α(d3 − d2 ) u1 + α(u2 − u1 ) − d1 − α(d2 − d1 ) pd (α) = u3 − α(u3 − u2 ) − (1 + r) u1 + α(u2 − u1 ) − (1 + r) , ]. [ u1 + α(u2 − u1 ) − d1 − α(d2 − d1 ) u3 − α(u3 − u2 ) − d3 + α(d3 − d2 ) [
Thus for α = 1 one gets: pu (1) =
(1 + r) − d2 u 2 − d2
pd (1) =
u2 − (1 + r) u 2 − d2
and for α = 0: (1 + r) − d1 − λ2 (d3 − d1 ) , u1 − d1 + λ1 (u3 − u1 ) − λ2 (d3 − d1 ) (1 + r) − d3 (1 + r) − d1 , ] =[ u 3 − d3 u 1 − d1 u1 + λ1 (u3 − u1 ) − (1 + r) , pd (0) = u1 − d1 + λ1 (u3 − u1 ) − λ2 (d3 − d1 ) u1 − (1 + r) u3 − (1 + r) , ] =[ u 1 − d1 u 3 − d3
pu (0) =
λ1 , λ2 ∈ [0, 1]
λ1 , λ2 ∈ [0, 1]
Note that this solution is the same solution found in Muzzioli and Torricelli [7, 8]. 4.3 Solution Using the Algorithm If one applies this algorithm to the financial example, one should solve the following systems: $ $ p u + pd = 1 pu + pd = 1 upu + dpd = 1 + r. upu + dpd = 1 + r. $
pu + pd = 1 upu + dpd = 1 + r.
$ p u + pd = 1 upu + dpd = 1 + r.
290
S. Muzzioli and H. Reynaerts
The solutions to those systems are, respectively, $ $ (1+r)−d (1+r)−d pu = u−d pu = u−d pd =
u−(1+r) u−d .
⎧ ⎨pu =
(1+r)−d u−d u−(1+r) . u−d
⎩pd =
pd =
⎧ ⎨p = u ⎩pd =
u−(1+r) u−d . (1+r)−d u−d u−(1+r) . u−d
The final solution is obtained by taking the minimum and maximum for each unknown: ⎧ ⎪ pu = (1+r)−d ⎪ u−d ⎪ ⎪ ⎨p = (1+r)−d u
⎪ pd = ⎪ ⎪ ⎪ ⎩ pd =
u−d u−((1+r)) u−d u−(1+r) . u−d
Since pu (1) and pd (1) belong to the respective solution intervals, the vector of fuzzy numbers, ⎞ ⎛ (1+r)−d , ] [ (1+r)−d u−d u−d ⎠, ⎝ u−(1+r) ] [ u−d , u−(1+r) u−d
is a solution to the system. Remark that one obtains the same solution as the vector solution in Sect. 4.2. More details can be found in Muzzioli and Reynaerts [11].
5 The Pricing of an Option in an Uncertain World In this section the risk neutral probabilities obtained in the previous section are used in order to obtain the price of a European option. We introduce the issue by a simple example in a one period model, then we present the more general case of a multiperiod one. 5.1 The Pricing of an Option in a One Period Model The price of the underlying asset at the end of one period is given either by S(0)u or S(0)d. Since u and d are triangular fuzzy numbers, it follows that the price of the underlying asset is represented by a triangular fuzzy number in each state: S(0)u = (S(0)u1 , S(0)u2 , S(0)u3 ), respectively, S(0)d = (S(0)d1 , S(0)d2 , S(0)d3 ). We assume that the strike price is between the highest value of the price of the underlying asset in state down and the lowest value of that price in state up: S(0)d3 ≤ K ≤ S(0)u1 .
(10)
Option Pricing in the Presence of Uncertainty
291
The case in which the assumption is not verified is discussed in the following section. We denote the call payoff in state “up” with C(u) and in state “down” with C(d). In the case we consider, it follows that C(d) = 0 and C(u) = S(0) − K = (S(0)u1 − K, S(0)u2 − K, S(0)u3 − K). We now determine the price EC(K, T ) by means of the risk neutral valuation approach, i.e. by computing the expected value of the option payoff under the risk neutral probabilities and discounting it at the risk-free rate: 1 (pd C(d) + pu C(u)) 1+r 1 pu C(u) = 1+r
EC(K, T ) =
Substituting the risk neutral probabilities in this expression we obtain the fuzzy option price: EC(K, T )(α) = (1 + r) − d3 + α(d3 − d2 ) 1 (S(0)u1 − K + αS(0)(u2 − u1 )) , [ 1+r u3 − d3 − α(u3 − u2 − d3 + d2 ) (1 + r) − d1 + α(d2 − d1 ) 1 (S(0)u3 − K − αS(0)(u3 − u2 )) ] 1+r u1 − d1 − α(u2 − u1 − d2 + d1 ) It is easy to prove that as α increases the above interval shrinks. It is the largest for α = 0: EC(K, T )(0) =
1 (1 + r) − d1 (1 + r) − d3 [(S(0)u1 − K) , (S(0u3 ) − K) ]. 1+r u 3 − d3 u 1 − d1
If α = 1, the interval collapses into one single value, namely: EC(K, T )(1) =
S(0)u2 − K (1 + r) − d2 1+r u 2 − d2
(11)
More details can be found in Muzzioli and Torricelli [8]. 5.2 The Multiperiod Case The fuzzy version of the standard Cox–Ross–Rubinstein binomial model is obtained by using the standard rules of addition and multiplication between fuzzy numbers, i.e. EC(K, T )(α) = N N j N −j pu pd max(S(0)uj dN −j − K, 0), (1 + r)−N [ j j=0
N N j=0
j
pju pdN −j max(S(0)uj d
N −j
− K, 0)]
292
S. Muzzioli and H. Reynaerts
In order to work with fuzzy numbers and speed up the computation, we use the following simplifications. First, we approximately compute the underlying asset price at each final node. Second, we calculate the call payoff, that, depending on the strike price value, may need to be approximated by a triangular fuzzy number. Finally we multiply the call payoff by the risk neutral probabilities and discount at the risk-free rate. The price of the underlying asset at each node j = 0, . . . , N of time T is equal to: S(0)uj dN −j . Due to the rules of multiplication among triangular fuzzy numbers, the resulting numbers are not anymore triangular. To simplify the algebra, we can approximate them, by imposing a triangular form as follows: S(0)uj dN −j = (S(0)uj1 d1N −j , S(0)uj2 d2N −j , S(0)uj3 d3N −j ) with j = 0 . . . N . The call option payoff C(j) at node j of the maturity date T is given by: C(j) = max((S(0)uj1 d1N −j − K, S(0)uj2 d2N −j − K, S(0)uj3 d3N −j − K), 0). We proceed as follows for each node j, j = 0 . . . N . If the following condition is satisfied, N −j−1 ≤ K ≤ S(0)dj1 u1N −j S(0)dj+1 3 u3
then C(j) is either equal to zero, or equal to the triangular fuzzy number: −j − K) C(j) = (S(0)uj1 d1N −j − K, S(0)uj2 d2N −j − K, S(0)uj3 dN 3
If the condition is not satisfied, then we can approximate C(j) with a triangular fuzzy number as follows: – if S(0)dj1 u1N −j ≤ K ≤ S(0)dj2 u2N −j , then −j C(j) = (0, S(0)uj2 dN − K, S(0)uj3 d3N −j − K) 2
– if S(0)dj2 u2N −j ≤ K ≤ S(0)dj3 u3N −j , then −j C(j) = (0, 0, S(0)uj3 dN − K). 3
Finally, in order to get the call price, we use the standard rules of addition and multiplication between fuzzy numbers to get: ⎡ ⎤ N N N N C(j)pju pdN −j , C(j)pju pdN −j ⎦ EC(K, T )(α) = (1 + r)−N ⎣ j j j=0 j=0
6 Sensitivity Analysis of the Option Price As we already mentioned, uncertainty in the volatility implies uncertainty in the up (and down) factor. This feature can be taken into account either by
Option Pricing in the Presence of Uncertainty
293
considering a confidence interval for u or by describing u by a fuzzy number. Under the assumptions of Sect. 2.3, the α-cuts of the fuzzy up factor or the confidence interval are subsets of ](1+r)T /N , +∞[. We now consider the special case where d = 1/u. If we invoke (2), the risk-neutral probability, pu , that the price goes up, is (1 + r)T /N u − 1 . pu = u2 − 1 We study the behaviour of the price of a European call option for all possible values of the up factor. In Sects. 6.1–6.3, we also need to include the border case where the up factor equals (1 + r)T /N . Therefore we define the up factor as uλ : λ ∈ R+ . uλ = (1 + r)T /N + λ, Again by invoking (2), the risk-neutral probability, pλ , that the price goes up, is (1 + r)T /N uλ − 1 . pλ = u2λ − 1 The price Cλ (K) of the option is: Cλ (K) = =
1 E[(SλT − K)+ ] (1 + r)T N 1 (S0 uλ2j−N − K)+ · P [XλN = j] (1 + r)T j=0
N 1 N j 2j−N pλ (1 − pλ )N −j = (S0 uλ − K) j (1 + r)T ∗
(12)
j=jλ
where XλN is the number of up jumps in the lifetime T and S0 uλ2j−N − K is positive for j ≥ jλ∗ . Consider a confidence interval, [u0 , u1 ] ⊂](1 + r)T /N , +∞[, of the up factor with u0 = (1 + r)T /N + λ0 u1 = (1 + r)T /N + λ1 . If uµ ∈ [u0 , u1 ], µ ∈ [0, 1] then uµ = µu0 + (1 − µ)u1
= µ((1 + r)T /N + λ0 ) + (1 − µ)((1 + r)T /N + λ1 ) = (1 + r)T /N + [µλ0 + (1 − µ)λ1 ] = (1 + r)T /N + λ∗ (µ).
294
S. Muzzioli and H. Reynaerts
The price of the option belongs to the interval [ min Cλ∗ (µ) (K), max Cλ∗ (µ) (K)]. µ∈[0,1]
µ∈[0,1]
Suppose the imprecise volatility is described by using a fuzzy quantity, (u1 , u2 , u3 ), u1 , u2 , u3 ∈](1 + r)T /N , +∞[, with u1 = (1 + r)T /N + λ1 u2 = (1 + r)T /N + λ2 u3 = (1 + r)T /N + λ3 , for the up factor. An α-cut, α ∈ [0, 1], is the interval: [u1 + (u2 − u1 )α, u3 + (u2 − u3 )α] = [(1 + r)T /N + λ1 + α(λ2 − λ1 ), (1 + r)T /N + λ3 + α(λ2 − λ3 )]. An element of this interval can be described by µ[(1 + r)T /N + (λ1 + α(λ2 − λ1 ))] + (1 − µ)[(1 + r)T /N + λ3 + α(λ2 − λ3 )] = (1 + r)T /N + µ(λ1 + α(λ2 − λ1 )) + (1 − µ)(λ3 + α(λ2 − λ3 )) = (1 + r)T /N + λ∗α (µ), µ ∈ [0, 1]. The α-cut, α ∈ [0, 1], of the option price is: [ min Cλ∗α (µ) (K), max Cλ∗α (µ) (K)]. µ∈[0,1]
µ∈[0,1]
(13)
It is clear that, for the method with confidence intervals as well as for the method using fuzzy quantities, the behaviour of Cλ (K) as function of uλ should be studied. This is the subject of the following sections. 6.1 Definitions, Notations and Lemmas The function is broken up in its basic elements: first the (up and down) probabilities are considered, then their products and finally their products with the up and down factors. The risk-neutral probability, pλ , is a decreasing function of uλ . For uλ = (1 + r)T /N this probability is one and lim
uλ →+∞
pλ = 0.
And one obtains p∗ = 0.5 for uλ = u∗ = (1 + r)T /N +
+
(1 + r)2T /N − 1.
The probability 1 − pλ is an increasing function of uλ . The function pλ (1 − pλ ) has a maximum for uλ = u∗ . It is zero for uλ = (1 + r)T /N and in the limit for uλ → +∞.
Option Pricing in the Presence of Uncertainty
295
The function uλ pλ attains a minimum for uλ = u∗ . It is equal to (1+r)T /N for uλ = (1 + r)T /N and in the limit for uλ → +∞. ∗ The function u−1 λ (1 − pλ ) attains a maximum for uλ = u . It is zero for T /N and in the limit for uλ → +∞. uλ = (1 + r) One can prove that, (uλ pλ )′ =
1 − 2pλ ′ = −(u−1 λ (1 − pλ )) . u2λ − 1
6.2 Functional Behaviour of the Functions C1 (λ, j) and C2 (λ, j, K) In the next section we will examine the functional behaviour of each term in the sum (6). Those terms consist of two parts, C1 (λ, j) =
N j namely N 2j−N N j N −j S0 uλ and C2 (λ, j, K) = −K j pλ (1 − pλ ) −j . Those j pλ (1 − pλ ) functions are first examined separately, regardless the sign of their sum. The derivative of the function C1 (λ, j) with respect to uλ , uλ ∈ [(1 + r)T /N , +∞], is: (14) (C1 (λ, j))′
S0 Nj 1 − 2pλ N −j−1 (uλ pλ)j−1 (u−1 (j(1 − pλ + u2λ pλ ) − N u2λ pλ ) 2 = λ (1 − pλ )) uλ uλ − 1 which implies that: – If j ≤ N/2 then j(1 − pλ + u2λ pλ ) − N u2λ pλ < 0 and the function C1 (λ, j) attains a maximum for uλ = u∗ . It is zero for uλ = (1 + r)T /N and in the limit for uλ → +∞. – If N/2 < j < N then the expression j(1 − pλ + u2λ pλ ) − N u2λ pλ and the – is negative for all uλ if moreover (1 + r)T /N ≤ √ N 2
function C1 (λ, j) attains a maximum for uλ = u∗ . , j = floor(N/2 + 1) – if (1 + r)T /N > √ N 2
(N −j)j
(N −j)j
∗
Nu (a) the expression is negative for j > 2(1+r) T /N and the function ∗ C1 (λ, j) attains a maximum for uλ = u N u∗ (b) the expression has two roots for j ≤ 2(1+r) T /N Those roots are: , N + N 2 − 4(N − j)j(1 + r)2T /N u1 (j) = 2(N − j)(1 + r)T /N , N − N 2 − 4(N − j)j(1 + r)2T /N u2 (j) = 2(N − j)(1 + r)T /N
with u1 (j) ≥ u∗ ≥ u2 (j) ≥ (1 + r)T /N and if u1 (j) = u2 (j) then u1 (j) = u2 (j) = u∗ . The function C1 (λ, j) attains a maximum for uλ = u2 (j) and for uλ = u1 (j). It attains a minimum for uλ = u∗ .
296
S. Muzzioli and H. Reynaerts
– The function C1 (λ, j) is zero for uλ = (1 + r)T /N and in the limit for uλ → +∞. – if j = N then the function equals S0 (uλ pλ )N and it attains a minimum for uλ = u∗ . The function C1 (λ, j) is equal to S0 (1 + r)T for uλ = (1 + r)T /N and in the limit for uλ → +∞.
The derivative of the function C2 (λ, j, K), 0 < j < N , with respect to uλ is: N (j − N pλ )pλj−1 (1 − pλ ))N −j−1 (pλ )′ (15) (C2 (λ, j, K))′ = −K j The factor (j − N pλ ) has two roots for all j: u∗1 (j) u∗2 (j)
= =
N (1 + r)T /N + N (1 + r)T /N −
,
,
N 2 (1 + r)2T /N − 4j(N − j) 2j N 2 (1 + r)2T /N − 4j(N − j) 2j
but u∗2 (j) < (1 + r)T /N . The function attains a minimum for uλ = u∗1 (j). If j ≤ N/2 then u∗1 (j) > ∗ u and if j ≥ N/2 then u∗1 (j) < u∗ . The function is zero for uλ = (1 + r)T /N and in the limit for uλ → +∞. The function is decreasing for j = 0. It is zero for uλ = (1 + r)T /N and equal to −K in the limit for uλ → +∞. The function C2 (λ, j, K) increases for j = N . It is equal to −K for uλ = (1 + r)T /N and zero in the limit for uλ → +∞. 6.3 The Branches of the Binary Tree Considered Separately Each term in the sum corresponds to a branch in the binary tree. Such a term depends on j, j = 0, . . . , N , and K, and is function of λ: N j 2j−N pλ (1 − pλ )N −j . − K) Cλ (j, K) = (S0 u j The functional behaviour of Cλ (j, K) is examined regardless of its sign. Noting that Cλ (j, K) = C1 (λ, j) + C2 (λ, j, K) the derivative of Cλ (j, K) with respect to uλ can be calculated by invoking (14) and (15): S0
N 1 − 2pλ N −j−1 −1 (uλ pλ)j−1 (u−1 uλ (j(1 − pλ + u2λ pλ ) − N u2λ pλ ) 2 λ (1 − pλ )) j uλ − 1 −K
N j−1 (j − N pλ )pλ (1 − pλ ))N −j−1 (pλ )′ . j
Option Pricing in the Presence of Uncertainty
297
In those intervals where both derivatives (14) and (15) have the same sign or for those values of uλ where one of the derivatives is zero, one can immediately conclude from Sect. 6.2 if the term is decreasing or increasing. On the other hand we can draw conclusions about the functional behaviour of the term by remarking that we studied the functional behaviour of
S0 uλ2j−N multiplied by Nj pjλ (1 − pλj−N ) and that the term can be calculated in two steps: first subtract K from S0 uλ2j−N and then multiply the result by
N j j−N ). j pλ (1 − pλ This leads to the following conclusions: j=0 – C0 (0, K) = 0 and Cλ (0, 0) > 0, lim Cλ (0, K) = −K. uλ →+∞
– If K ≥ S0 (1 + r)−T then Cλ (0, K) is negative for all uλ . 1 – If 0 < K < S0 (1 + r)−T then Cλ (0, K) has a root, u∗ (0) = ( Sk0 ) N . The ∗ function is negative for all uλ > u (0), it attains a maximum in the interval ](1+r)T /N , u∗ [. The root and the maximum decrease as K increases.
0<j
0, lim Cλ (j, K) = 0 uλ →+∞
(2j−N )T
– If K ≥ S0 (1 + r) N then Cλ (j, K) is negative for all uλ . (2j−N )T 1 – If K < S0 (1 + r) N then Cλ (j, K) has a root, u∗ (j) = ( SK0 ) 2j−N . The ∗ function is negative for all uλ > u (j), it attains a maximum in the interval ](1 + r)T /N , u∗ [. The root and the maximum decrease as K increases. – Since the function converges to zero it attains a minimum in the interval ]u∗ (j), +∞[. j = N/2 , j is odd
Since Cλ (j, K) = (S0 − K) Nj (pλ (1 − pλ ))N/2 , – C0 (j, K) = 0 and
lim
uλ →+∞
Cλ (j, K) = 0.
– The function is positive for all uλ if S0 > K and negative for all uλ if S0 < K. – The function attains a maximum for uλ = u∗ if S0 > K and a minimum for uλ = u∗ if S0 < K.
298
S. Muzzioli and H. Reynaerts
N <j √ 2
N (N −j)j
Cλ (j, K) = 0
N 2
and
<j≤
N u∗ 2(1+r)T /N
then (2j−N )T the function is positive for all uλ and attains – If K ≤ S0 (1 + r) N a maximum, larger then u∗ . The maximum increases as K increases. (2j−N )T 1 the function has a root, u∗ (j) = ( SK0 ) 2j−N . The – If K > S0 (1+r) N function is positive for all uλ > u∗ (j). It attains a maximum, larger then u∗ . The maximum and the root increase as K increases. It attains a minimum, smaller then u∗1 (j), between (1+r)T /N and the root. N (N −j)j
– (1 + r)T /N > √ 2
and
N u∗ 2(1+r)T /N
<j u∗ (j). The root decreases when K decreases. The function increases. 2T – If S0 (1 + r)T − N ≤ K < S0 (1 + r)T then the function is positive and increasing for all uλ . 2T – If 0 ≤ K < S0 (1 + r)T − N then the function is positive for all uλ and attains a minimum in ](1 + r)T /N , u∗ [. The minimum decreases as K increases. 6.4 Procedure for the Pricing of the European Call Option – An Example Suppose that K is such that Cλ (j, K) is positive for all j, then the price EC(K, T ) reads
Option Pricing in the Presence of Uncertainty
299
N 1 N j 2j−N pλ (1 − pλ )N −j Cλ (K) = (S0 uλ − K) j (1 + r)T j=0 =
N S0 N −j (uλ pλ )j (u−1 λ (1 − pλ )) (1 + r)T j=0
−
N K pj (1 − pλ )N −j (1 + r)T j=0 λ
S0 K N (uλ pλ + u−1 λ (1 − pλ )) − T (1 + r) (1 + r)T K . = S0 − (1 + r)T
=
This case is only possible if K < S0 , since otherwise the terms for j < N/2 are not in the sum. If this condition is fulfilled for K, then all terms for j ≥ N/2 are in the sum. Therefore we concentrate on the terms with j < N/2. The 1 expression Cλ (j, K) is positive for all uλ < ( SK0 ) N −2j . 1 The smallest root is ( SK0 ) N . This root is larger then (1 + r)T /N if 0 < K ≤ S0 (1 + r)T /N . If, in this case, S0 1 )N K then all terms are in the sum and Cλ (K) is constant for those values of uλ , namely Cλ (K) = S0 − K(1 + r)−T . If uλ increases: S0 1 S0 1 ( ) N ≤ uλ < ( ) N −2 K K then Cλ (0, K) < 0 and the corresponding term is not in the sum. Thus Cλ (K) = S0 − K(1 + r)−T − Cλ (0, K)(1 + r)−T . Since Cλ (0, K) is nega1 1 tive and decreasing for ( SK0 ) N ≤ uλ < ( SK0 ) N −2 , Cλ (K) increases for those values of uλ . This procedure can be extended for all values of uλ and K. Finally, we illustrate the procedure by an example in the case the imprecise volatility is described by a fuzzy quantity. Let 0 < K ≤ S0 (1 + r)T /N and (1 + r)T /N < uλ < (
u1 ∈ ](1 + r)T /N , (
S0 1 )N [ K
S0 1 )N K S0 1 S0 1 u3 ∈ ]( ) N , ( ) N −2 [. K K then, by applying (13), the α-cuts of the option price are u2 = (
[S0 −
Cλ∗α (1) (0, K) K K , S0 − − ]. T T (1 + r) (1 + r) (1 + r)T
300
S. Muzzioli and H. Reynaerts
7 Conclusions In this chapter we investigated the derivation of the European option price in the Cox–Ross–Rubinstein [5] binomial model in the presence of uncertainty on the volatility of the underlying asset. We proposed two different approaches to the issue that concentrate on the fuzzification of one or both the two jump factors. The first approach is derived by assuming that both the jump factors are represented by triangular fuzzy numbers. We started from the derivation of the risk neutral probabilities, a problem that boils down to the solution of a linear system of equations with fuzzy coefficients that has no solution using standard fuzzy arithmetics. We recalled the vector solution proposed by Buckley et al. [2, 3] that applies to that kind of systems and we gave the conditions under which the solution exists and is unique for a broader class of equivalent fuzzy linear systems. As a last step, we applied the risk neutral probabilities to the valuation of the option. In the second part of the chapter we presented a second approach to the option pricing problem, that is derived under the assumption that only the up jump factor is uncertain. We analysed the derivative of the option price in the Cox–Ross–Rubinstein [5] binomial model with respect to the up jump factor. This method of investigation is very general since it is consistent both with the assumption of an interval value for the up factor, as well as with the assumption of a fuzzy value for the up factor. Differently from the continuous time model in which the option price is an increasing function of the volatility, in this discrete time binomial model the call option is a non-decreasing function of the up jump factor and in turn of the volatility.
References 1. Black, F., Scholes, M., 1973. The Pricing of Options and Corporate Liabilities, Journal of Political Economy 2. Buckley, J.J., Eslami, E., Feuring, T., 2002. Fuzzy mathematics in economics and engineering, Studies in Fuzziness and Soft Computing, Physica Verlag 3. Buckley, J.J., Qu, Y., 1991. Solving systems of linear fuzzy equations, Fuzzy Sets and Systems 43, 33–43 4. Ming Ma, Friedman, M., Kandel, A., 2000. Duality in Fuzzy linear systems. Fuzzy sets and systems, 109, 55–58 5. Cox, J., Ross, S., Rubinstein, S., 1979. Option pricing, a simplified approach, Journal of Financial Economics 7, 229–263 6. Moore, R.E., 1966. Methods and applications of interval analysis, SIAM Studies in Applied Mathematics 7. Muzzioli, S., Torricelli, C., 2001. A multiperiod binomial model for pricing options in a vague world, Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications, 255–264 8. Muzzioli, S., Torricelli, C., 2004. A Multiperiod Binomial Model for Pricing Options in a Vague World, Journal of Economic Dynamics and Control, 28, 861–887
Option Pricing in the Presence of Uncertainty
301
9. Muzzioli, S., 2003. A note on fuzzy linear systems, Discussion Paper no. 447, Economics Department, University of Modena and Reggio Emilia 10. Muzzioli, S., Reynaerts, H., 2006a. Fuzzy linear systems of the form A1 x + b1 = A2 x + b2 , Fuzzy Sets and Systems, 157(7), 939–951 11. Muzzioli, S., Reynaerts, H., 2006b. The solution of fuzzy linear systems by nonlinear programming: a financial application, European Journal of Operational Research, 177(2), 1218–1231 12. Reynaerts, H., Vanmaele, M., 2003. A sensitivity analysis for the pricing of European call options in a binary tree model, Proceedings of the third International Symposium on Imprecise Probabilities and Their Applications, 467–481
Nonstochastic Model-Based Finance Engineering Toshihiro Kaino and Kaoru Hirota
Summary. Most of the models in the field of finance engineering are proposed based on the stochastic theory, e.g., the well-known option pricing model proposed by Black and Scholes is premised on following log-normal distribution by the underlying price. In the former stochastic theory, it is also a fact that the prediction sometime does not hit in the actual problem because it assumes a known probability distribution. Then, we propose research and development of the new corporate evaluation model and option pricing model based on fuzzy measures, which deal with the ambiguous subjectivity evaluation of man in the real world well. Especially, this system will support venture, small and medium companies.
1 Introduction Most of the models in the field of finance engineering are proposed based on the stochastic theory, e.g., the well-known option pricing model proposed by Black and Scholes [1] in 1973 is premised on following lognormal distribution by the underlying price. Many researchers have also pointed out that this assumption is not always valid for real world financial problems. Although various kinds of improvements have been done, there still exists an application limit with respect to the statistical distribution and the additivity of probability measure, e.g., in evaluation of venture, small and medium companies, underlying assets are a company and an enterprise, the distribution of the value of underlying assets is not a probability distribution. A new corporate evaluation model and option pricing model that is able to deal with ambiguous and discrete data better is proposed based on Choquet [2] Integral to overcome the gap mentioned earlier. First, the differentiation of the Choquet integral of a nonnegative measurable function with respect to a fuzzy measure on a fuzzy measure space is proposed and it is applied to the capital investment problem [5]. Then the differentiation of the Choquet integral of a nonnegative measurable function is extended to differentiation of the Choquet integral of a T. Kaino and K. Hirota: Nonstochastic Model-Based Finance Engineering, Studies in Computational Intelligence (SCI) 36, 303–330 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
304
T. Kaino and K. Hirota
measurable function. The Choquet integral is applied to the long-term debt ratings model, where the input is qualitative and quantitative data of the corporations, and the output is the Moody’s long-term debt ratings. The fuzzy measure that is given as the importance of an each qualitative and quantitative data is derived from a neural net method. Moreover, differentiation of the Choquet integral is applied to the long-term debt ratings [7], where this differentiation indicates how much evaluation of each specification influences to the rating of the corporation. Secondly, in the application using the Choquet integral limited by domain, it is a problem how to evaluate the in-between intervals each of which is characterized by a fuzzy measure. So, a composite fuzzy measure built up from fuzzy measures defined on fuzzy measurable spaces is proposed using composite fuzzy weights, where the measurable space of this composite fuzzy measure is the direct sum of measurable spaces [9]. Thirdly, two differentiations of a real interval domain limited Choquet integral for a nonnegative measurable function are proposed, where the domain is a subset of real number. One is the fuzzy measure shift differentiation of the Choquet integral and it is applied to the financial option trading [6]. Another is the differentiation of the Choquet integral limited by domain and it is applied to the automobile factory capital investment decision making problem [8].
2 Differentiation of the Choquet Integral of a Nonnegative Measurable Function 2.1 Interval Limited Choquet Integral Definition 1. Let (X, , µ ) be a fuzzy measure space. Let f be a nonnegative measurable function. The Choquet integral of the function f with respect to the fuzzy measure µ is expressed as:
(c) ∫ f dµ X
∫
+∞
0
µ ({x | f ( x) ≥ r})dr.
(1)
Accordingly, the Choquet integral of a nonnegative simple function is expressed as the following proposition: Proposition 1. Let (X, F, µ ) be a fuzzy measure space. Let f be a nonnegative simple function
Nonstochastic Model-Based Finance Engineering
305
n
f ( x) = ∑ ri χ Di ( x).
(2)
i =1
(0 < r1 < r2
0 , then f ∧ r* is also measurable. The [0, r*] limited Choquet integral of a nonnegative measurable function f with respect to a fuzzy measure µ is defined as: +∞
F (r* ) (c) ∫ f ∧ r* dµ = ∫ µ ({x | f ( x) ∧ r* X
=
∫
r* 0
0
r}) dr
µ ({x | f ( x) r})dr.
(4)
Let 0< ∀ a ≤ ∀ c≤ ∀ b a,b, c ∈ R . T hen the [a, b] limited Choquet integral of a nonnegative measurable function f with respect to the fuzzy measure µ is defined as: b
∫ µ ({x | f ( x) a
F (b) − F (a ).
r}) dr
(5)
Now, the following property holds true, b
∫ µ ({x | f ( x) r})dr = ∫ µ ({x | f ( x) a
c
a
b
r}) dr + ∫ µ ({x | f ( x) c
r}) dr.
(6)
306
T. Kaino and K. Hirota
2.2 Differentiation of the Choquet Integral of a Nonnegative Measurable Function Differentiation of the Choquet integral based on a nonnegative measurable function is defined as follows: Definition 3. Let (X, F, µ ) be a fuzzy measure space. For any nonnegative measurable function f , if
F (r + ∆r ) − F (r ) ∆r → +0 ∆r
D + F (r ) = F+' (r )
lim
(7)
exists, this limit is called an upper differential coefficient of the [0, r] limited Choquet integral F of f with respect to µ at r. Similarly, if
F (r + ∆r ) − F (r ) ∆r → −0 ∆r
D − F (r ) = F−' (r )
lim
(8)
exists, this limit is called a lower differential coefficient of the [0, r] limited Choquet integral F of f with respect to µ at r. If and only if both the upper differential coefficient and the lower differential coefficient of F at r exist and are equal, they are denoted by:
DF (r ) = F ' (r ) =
dF (r ) , dr
(9)
and is called a differential coefficient of the [0, r] limited Choquet integral F of f taken with respect to µ at r. If
dF (r ) exists on a c ertain interval dr
of r, then F(r) is called to be differentiable on this interval and the derived function
dF (r ) dF (r ) is defined. Moreover, if exists, then dr dr
dF (r ) = µ ({x | f ( x) ≥ r}). dr
Nonstochastic Model-Based Finance Engineering
307
2.3 Application to the Capital Investment Problem The differentiation of the [0, r] limited Choquet integral F of f with respect to µ at r is applied to the computer center investment decision making. Evaluated specifications are the following items (1) Aseismatic structure, (2) Power distribution and air conditioning equipment, (3) Internal and external appearance, and (4) Available space. Assume that a designer explains the following specifications: 1. Aseismatic structure. This building is 1.2 times stronger than the ordinary building. 2. Power distribution and air conditioning equipment. Complete double bus bar system, two diesel generation systems, and UPS are equipped, but air-handling units with one cooling coil are equipped. 3. Internal and external appearances. Internal appearance is pleasant. External appearance is a brave show. 4. Available space. The 65% of total floor area is available as computer room and office area. The other area is used as lobby, stairs, rest room, etc. The available area is not so functional. Now, let the sets of each specification be X {Aseismatic structure, Power distribution and air conditioning equipment, Internal and external appearance, Available space}. When decision makers evaluate each specification at 100 points, mean of each specification’s evaluation are given as f(Aseismatic structure) 60 points, f(Power distribution and air conditioning equipment) 20 points, f(Internal and external appearance) 65 points, f(Available space) 40 points. Hence, suppose that x1 “Power distribution and air conditioning,” x2 “Available space,” x3 “Aseismatic Structure,” x4 “Internal and external appearance,” where f(x1) f(x2) f(x3) f(x4). And, suppose that some specialists determine the importance of each specification against the total evaluation of the computer center by using λ -fuzzy measure ( λ = −0.65 ) as follows: where the results are rounded to three decimal places
308
T. Kaino and K. Hirota
µ ({x1 }) = 0.3, µ ({x 2 }) = 0.5, µ ({x3 }) = 0.4, µ ({x 4 }) = 0.2, µ ({x1 , x 2 }) = 0.7, µ ({x1 , x3 }) = 0.62, µ ({x1 , x 4 }) = 0.46, µ ({x 2 , x3 }) = 0.77, µ ({x 2 , x 4 }) = 0.64, µ ({x3 , x 4 }) = 0.55, µ ({x1 , x 2 , x3 }) = 0.92, µ ({x 2 , x3 , x 4 }) = 0.87, µ ({x1 , x 2 , x 4 }) = 0.81, µ ({x1 , x3 , x 4 }) = 0.74, µ ({ X }) = 1
(10)
Now, the total evaluation F (∞) of this design is derived from the definition 3 as follows: n
F (∞) = (C ) ∫ fdµ ∆ ∑ (ri − ri −1 ) µ ( Ai ) X
i =1
= r1 µ ( A1 ) + (r2 − r1 ) µ ({ A2 }) + (r3 − r2 ) µ ({ A3 }) + (r4 − r3 ) µ ({ A4 }) (11) =20 0.87(40 20) 0.55(60 40) 0.2(65 60) 49.4. Thus, the upper differential coefficient and the lower differential coefficient at each ri are calculated as follows: D D D D D D D D
F(r1)= µ ({A2})= µ ({x2,x3,x4})=0.87 F(r1)= µ ({A1})= µ (X)=1 F(r2)= µ ({A3})= µ ({x3,x4})=0.55 F(r2)= µ ({A2})= µ ({ x2,x3,x4})=0.87 F(r3)= µ ({A4})= µ ({x4})=0.2 F(r3)= µ ({A3})= µ ({x3,x4})=0.55 F(r4)= µ ({ })=0 F(r4)= µ ({A4})= µ ({x4})=0.2
(12)
From (11), the following formula can be derived: F( )={D F(r1) D F(r1)}r1+{D F(r2) D F(r2)}r2 +{D F(r3) D F(r3)}r3 {D F(r4) D F(r4)}r4
(13)
Suppose that a decision maker determines which specification should be improved so that the total evaluation of the computer center is maximized. If the decision maker decides based on the importance of each specifications µ as µ (available space) µ (Aseismatic structure) µ (Power distribution and air conditioning equipment) µ (Internal and external appearance), then it is the best solution to improve the available space. Hence, consider how much a tiny change of a specification will influence
309
Nonstochastic Model-Based Finance Engineering
the total evaluation of the computer center. On formula 13), it is noticed that the coefficient of each ri is “the lower differential coefficient (D F(ri)) the upper differential coefficient (D F(ri)).” So, this coefficient is considered to be the change of the total evaluation of the computer center as the specification change slightly. Hence, let V(xi) be the change of the total evaluation of the computer center as the evaluation of a specification change slightly as: V(xi)
D F(ri) D F(ri).
(14)
Now, each V(xi) is calculated as follows:
V ( x1 ) = D − F (r1 ) − D + F (r1 ) = 1 − 0.87 = 0.13 V ( x 2 ) = D − F (r2 ) − D + F (r2 ) = 0.87 − 0.55 = 0.32 V ( x3 ) = D − F (r3 ) − D + F (r3 ) = 0.55 − 0.2 = 0.35
(15)
V ( x 4 ) = D − F (r4 ) − D + F (r4 ) = 0.2 − 0 = 0.2 For instance, V(x1) 0.13 means that the total evaluation of the computer center will increase 0.13 points as the evaluation of Power distribution and air conditioning equipment increase from 20 points to 21 points, where the evaluations of the other specifications are fixed. As the result, V(Aseismatic structure) V(Available space) V(Internal and external appearance) V(Power distribution and air conditioning equipment) is derived. There is always a limit to the budget and schedule of capital investments. It will be the best way to improve “Aseismatic structure” of all specifications so that the total evaluation of the computer center is maximized.
3 Differentiation of the Choquet Integral of a Measurable Function 3.1 Interval Limited Schmeidler and Šipoš Choquet Integral The Schmeidler [3] Choquet integral limited by an interval is defined as: Definition 4. Let (X, F, µ ) be a fuzzy measure space. Let f be a nonnegative measurable function. If ∀ r* > 0 , then f ∧ r* is also measurable. The
310
T. Kaino and K. Hirota
[0, r*] limited Schmeidler Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as:
F + (r* )
+∞
(C ) ∫ f ∧ r* dµ = ∫ µ ({x | f ( x) ∧ r* ≥ r}) dr 0
X
r*
= ∫ µ ({x | f ( x) ≥ r})dr. 0
(16)
And, if ∀ r* < 0 , then f ∨ r* is also measurable. The [r* ,0] limited Schmeidler Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as ( µ ( X ) < ∞ )
F − (r* )
0
(C ) ∫ f ∨ r* dµ = ∫ [ µ ({x | f ( x) ∨ r* ≥ r}) − µ ( X )]dr −∞
X
0
= ∫ [ µ ({x | f ( x) ≥ r}) − µ ( X )]dr. r*
(17)
The Schmeidler Choquet integral is expressed by (16) and (17) as:
(C ) ∫ fdµ X
F + (+∞) + F − (−∞).
(18)
Let 0< ∀ a ≤ ∀ c≤ ∀ b ( a, b, c ∈ R). Then the [a, b] limited Schmeidler Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as:
F ([a, b])
⎧ F − (a) − F − (b) ( ∀ a ≤ ∀ b ≤ 0) ⎪ − + ∀ ∀ ⎨ F (a ) + F (b) ( a ≤ 0 ≤ b) . ⎪ F + (b) − F + (a ) (0 ≤ ∀ a ≤ ∀ b) ⎩
(19)
Now, the following property holds true,
F ([a, b]) = F ([a, c]) + F ([c, b]) . F(r) is expressed by F (r) and F+(r) as
(20)
Nonstochastic Model-Based Finance Engineering
⎧ F − (r ) (r < 0) ⎪ (r = 0) . F (r ) = ⎨ 0 ⎪ F + (r ) (r > 0) ⎩
311
(21)
Definition 5. Let (X, F, µ ) be a fuzzy measure space. Let f be a nonnegative measurable function. If ∀ r* > 0 , then f ∧ r* is also measurable. The ∨ ∨ [0, r*] limited Sipos [4] Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as: +
FS (r* )
+∞
( S ) ∫ f ∧ r* dµ = ∫ µ ({x | f ( x) ∧ r* ≥ r}) dr 0
X
r*
= ∫ µ ({x | f ( x) ≥ r})dr.
(22)
0
Similarly, if ∀ r* < 0 , then f ∨ r* is also measurable. The [r* ,0] limited ∨
∨
Sipos Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as ( µ ( X ) < ∞ ) −
FS (r* )
0
( S ) ∫ f ∨ r* dµ = − ∫ µ ({x | f ( x) ∨ r* ≥ r})dr −∞
X
0
= − ∫ µ ({x | f ( x) ≤ r}) dr.
(23)
r*
∨
∨
The Sipo s Choquet integral is expressed by (22) and (23) as:
( S ) ∫ fdµ X
+
−
FS (+∞) + FS (−∞).
(24) ∨
∨
Let 0< ∀ a ≤ ∀ c≤ ∀ b ( a, b, c ∈ R). Then the [a, b] limited Sipo s Choquet integral of a measurable function f with respect to a fuzzy measure µ is defined as:
312
T. Kaino and K. Hirota
FS ([a, b])
⎧ FS− (a) − FS− (b) ( ∀ a ≤ ∀ b ≤ 0) ⎪ − + ∀ ∀ ⎨ FS (a) + FS (b) ( a ≤ 0 ≤ b) . ⎪ F + (b) − F + (a ) (0 ≤ ∀ a ≤ ∀ b) S ⎩ S
(25)
Now, the following property holds true,
FS ([a, b]) = FS ([a, c]) + FS ([c, b]) . −
(26)
+
FS (r ) is expressed by FS (r ) and FS (r ) as ⎧ FS− (r ) (r < 0) ⎪ (r = 0) . FS (r ) = ⎨ 0 + ⎪ F (r ) (r > 0) ⎩ S
(27)
3.2 Differentiation of the Schmeidler and Šipoš Choquet Integral ∨
∨
Differentiation of the Schmeidler and Sipos Choquet integral based on a measurable function is defined as: Definition 6. Let (X, , µ ) be a fuzzy measure space. For any measurable ∞ and function f , if µ X
F (r + ∆r ) − F (r ) ∆r → +0 ∆r lim
D F(r) F+
(r)
(28)
exists, this limit is called an upper differential coefficient of the [0, r] limited Schmeidler Choquet integral F of f with respect to µ at r. Similarly, if
F (r + ∆r ) F (r ) −0 ∆r
lim
∆r
D F(r) F
(r)
(29)
exists, this limit is called a lower differential coefficient of the [0, r] limited Schmeidler Choquet integral F of f with respect to µ at r.
Nonstochastic Model-Based Finance Engineering
313
Similarly, if
FS (r + ∆r ) FS (r ) ∆r → +0 ∆r lim
D + FS (r ) = FS + (r ) '
(30)
exists, this limit is called an upper differential coefficient of the [0, r] lim∨ ∨ ited Sipos Choquet integral F of f with respect to µ at r. Similarly, if
FS (r + ∆r ) − FS (r ) −0 ∆r
lim
∆r
D − FS (r ) = FS − (r ) '
(31)
exists, this limit is called a lower differential coefficient of the [0, r] lim∨ ∨ ited S ipo s Choquet integral F of f with respect to µ at r. If and only if both the upper differential coefficient and the lower differential coefficient of F and FS at r exit and are equal, they are denoted by:
DF (r ) = F ' (r ) =
dF (r ) dF (r ) , DFS (r ) = FS' (r ) = S , dr dr
(32)
and is called a differential coefficient of the [0, r] limited Schmeidler Cho∨ ∨ quet integral F and Sipos Choquet integral FS of f with respect to µ at r. 3.3 Application to the Long-Term Debt Ratings Firstly, the long-term debt ratings model is identified by the real interval limited Choquet integral. Here, the input of this model is the quantitative and qualitative index of each corporations and the output of this model is the Moody’s long-term debt ratings of each corporations. The importance of each index, fuzzy measures µ , is determined by the neural network method, where the open financial statements [13] and the analysis data [14] are used for input data, and the Moody’s debt ratings [10] are used for output data. Generally, the long-term debt ratings of each rating institutions are determined by the analyst’s experience and know-how. Then, it is very difficult for each rated corporations to find how to raise their rating results, clearly. So, after the identification of the long-term debt ratings model using the real interval limited Choquet integral, the advisory system
314
T. Kaino and K. Hirota
Long-Term Debt Ratings Model determined by Neural Net InitialValue of µ Qualitative and Quatitative Index
Long-Term Debt Ratings Model (C)∫ xf(x) dm
Rating Order By Choquet Integral
Rating Order by Moody’s
µ :Minimize error sum of squares of rating order Local solution Importance:µ
Qualitative Index Quantitative Index
Differentiation of Choquet Integral
Influence to the ratings Advice the influential index for the ratings
Ratings Improvement Advise System
Fig. 1. Debt ratings analysis model to raise the rating using differentiation of the Choquet integral is proposed. Now, the following ten quantitative indices and four qualitative indices, which are often used actually, are selected as the input data [11, 12]. The total evaluation of an each corporation is given by the importance of each index µ which is determined by the neural network method using qualitative and quantitative index as in (33) (Fig. 1). 14
F (∞) = (C ) ∫ fdµ = ∑ {D − F (ri ) D + F (ri )}ri X
(33)
i =1
Now, on (33), it is noticed that the coefficient of each ri is “the lower differential coefficient (D F(ri)) the upper differential coefficient (D F(ri)).” So, this coefficient is considered to be the change of the total evaluation of each corporation as the evaluation of each index change slightly. Hence, let V(xi) be the change of the total evaluation of the corporations as the evaluation of index change slightly as:
V ( xi )
D − F (ri ) D + F (ri ) .
(34)
In the case of SONY, if the importance of each index is applied to decide which indices value should be improved in order to raise the total evaluation (rating), then the following results will be given.
315
Nonstochastic Model-Based Finance Engineering
µ (share)
µ (turnover of receivables) µ
interest coverage
µ profit
µ
leverage
µ
µ (cash flow) µ
current ratio
µ
ROA
management
µ
µ
inventory turnover
regulation
quick ratio
µ (ROE)
µ
µ
business
organization
If the V function is applied to decide it, then the following results will be given: V organization
V management
inventory turnover cash flow V current ratio
V ROE
V business profit V ROA
V turnover of receivables V regulation
V share
V V
V interest coverage
V quick ratio
V leverage
It will be the best way to improve the evaluation of “organization” and “management” so that the total evaluation of the corporation is maximized. 4. Composite Fuzzy Measure Built up from Fuzzy
Measures In the application using fuzzy measure on a real number, it is a problem how to evaluate the inside and in-between intervals which were characterized by fuzzy measures. Then, a composite fuzzy measure built up from two fuzzy measures using composite fuzzy weights is proposed. Firstly, a composite measure is shown, where each interval is evaluated by a fuzzy measure. Before formally proposing a composite fuzzy measure, it is shown that a composite measurable space using direct sum of two measurable spaces is a measurable space in Proposition 1. Proposition 1. ( X 1 ⊕ X 2 , F1 ⊕ F2) is a measurable space, where F1 ⊕ F2={ E1 ⊕ E 2 | E i ∈ Fi(i=1,2)}, and ⊕ indicates a direct sum.
(35)
316
T. Kaino and K. Hirota
Proof. It will be enough to show that F1⊕F F2 is a σ -field on X 1 ⊕ X 2 . (a) φ
φ ⊕ φ ∈ F1 ⊕ F2
(36)
∀
(b) E ∈ F1 ⊕ F2 ∃ Ei ∈ Fi (i=1,2) s.t. E=E1 ⊕ E2
E c = E1c ⊕ E 2c ∈ F1 ⊕ F2 (c) ∀ n ∈ N , ∀ E n ∈ F1 ⊕ F2
(37)
∃
E ni ∈ Fi (i=1,2) s.t. E n = E n1 ⊕ E n 2 ∪ En = ∪ ( En1 ⊕ En 2) = ∪ En1 ⊕ ∪ En 2 ∈ F1 ⊕ F2
n∈N
n∈ N
n∈ N
n∈ N
(38)
In the same way, it is proved that a composite measurable space using direct sum of some measurable spaces is a measurable space. n
n
Proposition 2. ( ⊕ Xi , ⊕ Fi) is a measurable space, where i =1
i =1
n
n
i =1
i =1
⊕ Fi={ ⊕ Ei | Ei ∈ Fi(i=1,2,
,n)}.
(39)
(Let ⊕ be a direct sum) Proof. It will be proved in the same way as Proposition 1. Now, a composite fuzzy measure built up from two fuzzy measures on the direct sum of two fuzzy measure spaces is given in Theorem 1. Theorem 1. Let (Xi, Fi, µ i ) (i =1,2) be fuzzy measure spaces. Then ( X 1 ⊕ X 2 , F1 ⊕ F2, µ 1⊕ 2 ) is a fuzzy measure space, where
E = E1 ⊕ E 2 ∈ F1 ⊕ F2, ∀ λ1 , λ 2 ∈ [0,1], µ1⊕ 2 ( E1 ⊕ E 2 ) λ1 µ1 ( E1 ) + λ 2 µ 2 ( E 2 ) + (1 − λ1 − λ 2 ) µ1 ( E1 ) µ 2 ( E 2 ).
∀
(40)
Proof (a) µ1⊕ 2 ( X 1 ⊕ X 2 ) = λ1 µ1 ( X 1 ) + λ 2 µ 2 ( X 2 ) + (1 − λ1 − λ 2 ) µ1 ( X 1 ) µ 2 ( X 2 )
= λ1 + λ 2 + (1 − λ1 − λ 2 ) = 1
(41)
317
Nonstochastic Model-Based Finance Engineering
(b) Monotonicity ∀
E , ∀ F ∈ F1 ⊕ F2 , E ⊂ F ∃ Ei , Fi ∈ Fi (i =1,2 ) s.t. E = E1 ⊕ E 2 , F = F1 ⊕ F2 , E1 ⊂ F1, E 2 ⊂ F2 . µ1⊕ 2 ( F ) − µ1⊕ 2 ( E ) = µ1⊕ 2 ( F1 ⊕ F2 ) − µ1⊕ 2 ( E1 ⊕ E 2 ) = λ1 µ1 ( F1 ) + λ 2 µ 2 ( F2 ) + (1 − λ1 − λ 2 ) µ1 ( F1 ) µ 2 ( F2 ) − λ1 µ1 ( E1 ) + λ 2 µ 2 ( E 2 ) + (1 − λ1 − λ 2 ) µ1 ( E1 ) µ 2 ( E 2 ) = (λ1 + (1 − λ1 − λ 2 ) µ 2 ( E 2 ))( µ1 ( F1 ) − µ1 ( E1 )) + (λ 2 + (1 − λ1 − λ 2 ) µ1 ( F1 ))( µ 2 ( F2 ) − µ 2 ( E 2 )) = ((1 − µ 2 ( E 2 ))λ1 + (1 − λ 2 ) µ 2 ( E 2 ))( µ1 ( F1 ) − µ1 ( E1 )) + ((1 − µ1 ( E1 ))λ 2 + (1 − λ1 ) µ1 ( F1 ))( µ 2 ( F2 ) − µ 2 ( E 2 )) 0 (∵ 0 < λi ≤ 1 (i = 1,2))
(42)
Moreover, a composite measure built up from two fuzzy measures is recursively extended to a composite measure built up from a finite number of fuzzy measures. Theorem 2. Let (Xi, Fi, µ i ) (i = 1,2,
, n) be fuzzy measure spaces. Then (( (( X 1 ⊕ X 2 ) ⊕ X 3 ) ) ⊕ X n , (( (( F1 ⊕ F2) ⊕ F3) ) ⊕ Fn,
µ(
((1⊕ 2 ) ⊕3) ) ⊕ n
) is a fuzzy measure space, where
∀
(( (( E1 ⊕ E 2 ) ⊕ E3 ) ) ⊕ E n −1 ) ⊕ E n ∈ (( (( F1 ⊕ F2) ⊕ F3 ) ) ⊕ Fn-1) ⊕ Fn, ∀ λ1⊕ 2 , λ (1⊕ 2 )⊕3 , λ( ((1⊕ 2)⊕3) )⊕i (i = 4,5, , n) ∈ (0,1),
∀
λ1 , λ 2 ,
µ(
, λ n ∈ [0,1],
((1⊕ 2 ) ⊕ 3) ) ⊕ n
λ(
((( (( E1 ⊕ E 2 ) ⊕ E3 ) ) ⊕ E n −1 ) ⊕ E n )
((1⊕ 2 ) ⊕ 3) ) ⊕ n −1
µ(
((1⊕ 2 ) ⊕ 3) ) ⊕ n −1
+ λ n µ n ( E n ) + (1 − λ ( × µ(
((1⊕ 2 ) ⊕3) ) ⊕ n −1
(( (( E1 ⊕ E 2 ) ⊕ E3 ) ) ⊕ E n− 1 )
((1⊕ 2 ) ⊕3) ) ⊕ n −1
− λn )
(( (( E1 ⊕ E 2 ) ⊕ E3 ) ) ⊕ E n −1 ) µ n ( E n ).
(43)
(Q.E.D.) And, the associative, composite fuzzy measure built up from three fuzzy measures is introduced.
318
T. Kaino and K. Hirota
Theorem 3. Let ( X i , Fi, µ i ) (i = 1,2,3) be fuzzy measure spaces. Let ( X 1 ⊕ X 2 , F1 ⊕ F2, µ1⊕ 2 ) be a fuzzy measure space as shown in Theorem 1 and, let ( ( X 1 ⊕ X 2 ) ⊕ X 3 ,(F F1 ⊕ F2) ⊕ F3, µ (1⊕ 2 )⊕3 ) be a fuzzy measure space, where ∀ E = ( E1 ⊕ E 2 ) ⊕ E 3 ∈ (F F1 ⊕ F2 ) ⊕ F3 , ∀
λ1⊕ 2 , λ1 , λ 2 , λ3 ∈ (0,1), λ1 =
1− λ1 − λ 2 λ1' λ' , λ 2 = 2 , λ3 = 3' , A = λ λ λ , 1 2 1⊕ 2 λ1⊕ 2 λ1⊕ 2
µ (1⊕ 2)⊕3 (( E1 ⊕ E 2 ) ⊕ E3 ) = λ1' µ1 ( E1 ) + λ'2 µ 2 ( E 2 ) + λ'3 µ 3 ( E3 ) + A(λ1' λ'2 µ1 ( E1 ) µ 2 ( E 2 ) + λ'2 λ'3 µ 2 ( E 2 ) µ 3 ( E3 )
(44)
+ λ1' λ'3 µ1 ( E1 ) µ 3 ( E3 ) + A 2 λ1' λ'2 λ'3 µ1 ( E1 ) µ 2 ( E 2 ) µ 3 ( E3 ). Then this construction process does not depend on the order, i.e.,
µ (1⊕ 2)⊕3 (( E1 ⊕ E 2 ) ⊕ E3 ) = µ1⊕ 2⊕3 ( E1 ⊕ E 2 ⊕ E3 ) Here, put x λ 1⊕ 2 , then λ 1 =
λ1' x
, λ2 =
λ 2' x
, λ3 = λ'3 .
(45) (46)
Since A=B,
x 2 λ'3 − xλ1' λ'3 − xλ'2 λ'3 = λ1' λ'2 − xλ1' λ'2 − λ1' λ'2 λ'3 .
λ'3 x 2 − (λ1' λ'3 + λ'2 λ'3 − λ1' λ'2 ) x + λ1' λ'2 λ'3 − λ1' λ'2 = 0.
(47)
λ1' λ3' + λ 2' λ3' − λ1' λ 2' ± D , ' 2λ3 where D = (λ1' λ'3 + λ'2 λ'3 − λ1' λ'2 ) 2 − 4λ'3 (λ1' λ'2 λ'3 − λ1' λ'2 ).
(48)
x=
So, λ 1⊕ 2 , λ1 and λ 2 are easily calculated by the (48). Here, Theorem 3 will be arranged to Theorem 4 so that it will be easy to use. Theorem 4. Let (Xi, Fi, µ i ) (i=1,2,3) be fuzzy measure spaces. Then ( ( X 1 ⊕ X 2 ⊕ X 3 , F1 ⊕ F2 ⊕ F3, µ1⊕ 2⊕3 ) is a fuzzy measure space, where
µ1⊕ 2⊕3 ( E1 ⊕ E 2 ⊕ E3 )
Nonstochastic Model-Based Finance Engineering
319
= λ1 µ1 ( E1 ) + λ1 µ 2 ( E 2 ) + λ3 µ 3 ( E3 ) + A(λ1λ 2 µ1 ( E1 ) µ 2 ( E 2 ) + λ 2 λ3 µ 2 ( E 2 ) µ 3 ( E3 ) + λ3 λ1 µ 3 ( E3 ) µ1 ( E1 ))
(49)
+ A 2 λ1λ 2 λ3 µ1 ( E1 ) µ 2 ( E 2 ) µ 3 ( E3 ), ∀
for
∀
E1 ⊕ E 2 ⊕ E3 ∈ F1 ⊕ F2 ⊕ F3, and
λ1 , λ 2 , λ3 ∈ [0,1] , such that
λ1λ 2 λ3 A 2 + (λ1λ 2 + λ 2 λ3 + λ3 λ1 ) A + λ1λ 2 λ3 = 1,
(50)
holds true and A is the greater solution of (50). Theorem 5. Let (Xi, Fi, µ i ) (i = 1,2,
(X1 ⊕ X 2 ⊕ X 3 ⊕
, n) be fuzzy measure spaces. Then ⊕ Fn, µ1⊕ 2⊕3 n ) is a fuzzy
⊕ X n , F1 ⊕ F2 ⊕
measure space, where
µ1⊕ 2⊕3 n ( E1 ⊕ E 2 ⊕ E3 ⊕
⊕ En )
n
= ∑ λi µ i ( Ei ) + A∑ λi λ j µ i ( Ei ) µ j ( E j ) i =1
i≠ j
∑ λi λ j λ k µ i ( Ei ) µ j ( E j ) µ k ( E k ) +
+ A2
i≠ j≠k
for ∀ E1 ⊕ E 2 ⊕ E3 ⊕
⊕ E n ∈ F1 ⊕ F2 ⊕
n
+ A n −1 ∏ λi µ i ( Ei ),
(51)
i =1
∀ ⊕ Fn , and λ1 , λ 2 ,
, λ n ∈ [0,1] ,
such that n
∑λ i =1
i
+ A∑ λ i λ j + A 2 i≠ j
∑λ λ λ
i≠ j≠k
i
j
k
+
n
+ A n −1 ∏ λi = 1,
(52)
i =1
holds true and A+ (≥ −1) exists for the solution of (52).
4 Fuzzy Measure Shift Differentiation of Choquet Integral Hereafter we consider the case X ⊆ [0, ∞]
R , and F BX
Borel sets limited to X).
(53)
320
T. Kaino and K. Hirota
4.1 Choquet Integral Limited by Domain The Choquet integral limited by a real interval of X is defined as: Definition 7. Let (X, F, µ ), ( X ⊆ R ) be a fuzzy measure space. Let f be a nonnegative measurable function. The X axis [0, x*] limited Choquet integral of a nonnegative measurable function f with respect to a fuzzy measure µ is defined as:
F X ( x* , µ )
∫
+∞ 0
µ ({x | f ( x) ≥ r} ∩ [0, x* ] ∩ X )dr.
(54)
Let 0≤ ∀ a ≤ ∀ b≤ ∀ c ( a, b, c ∈ R ). Then the X axis [a, b] limited Choquet integral of a nonnegative measurable function f with respect to the fuzzy measure µ is defined as:
FX ([a, b], µ )
∫
+∞
0
µ ({x | f ( x) ≥ r} ∩ [a, b] ∩ X )dr.
(55)
Now, it should be noted that for any 0 ≤ a ≤ c ≤ b
FX ([a, b], µ ) ≠ FX ([a, c], µ ) + FX ([c, a ], µ ).
(56)
4.2 Fuzzy Measure Shift Differentiation of the Choquet Integral Limited by Domain Definition 8. Let µ * (⋅, ∆x) be the ∆x shifted fuzzy measure of µ . ⎛ ∆x > 0, ∆x right shift ⎜ ⎝ ∆x < 0, |∆x | left shift
(57)
∀
F ∈F µ * ( F , ∆x) µ ( F − ∆x), where F − ∆x { f − ∆x | f ∈ F , f − ∆x ∈ X } .
(58)
A Fuzzy measure shift differentiation of the X axis real interval limited Choquet integral for a nonnegative measurable function is defined as:
Nonstochastic Model-Based Finance Engineering
321
Definition 9. For any nonnegative measurable function f , if
lim
∆x → +0
= lim
FX ([a, b], µ * ( , ∆x)) − FX ([a, b], µ ) ∆x
∫
+∞
0
µ * ({x | f ( x) ≥ r} ∩ [a, b] ∩ X , ∆x)dr
∫
+∞
0
µ ({x | f ( x) ≥ r} ∩ [a, b] ∩ X )dr
∆x
∆x → +0
= D FX ([a, b], µ ) +
(59)
exists, this limit is called a fuzzy measure right-shift differential coefficient of the X axis [a, b] limited Choquet integral F for a nonnegative measurable function f with respect to µ . Similarly, if
FX ([a, b], µ * ( , ∆x)) − FX ([a, b], µ ) 0 ∆x ∆x − = D FX ([a, b], µ ) lim
(60)
exists, this limit is called a fuzzy measure left-shift differential coefficient of the X axis [a, b] limited Choquet integral F for a nonnegative measurable function f with respect to µ . If and only if both a fuzzy measure right-shift differential coefficient and a fuzzy measure left-shift differential coefficient of FX ([a, b], µ ) exist and are equal, then they are denoted by:
DFX ([a, b], µ ) = FX ' ([a, b], µ ) =
dFX ([a, b], µ ) . dµ
(61)
and is called a fuzzy measure shift differential coefficient of the X axis [a, b] limited Choquet integral F of f with respect toµ (Fig. 2).
Fig. 2. Fuzzy measure right-shift differential coefficient of the X axis [a, b] limited
322
T. Kaino and K. Hirota
4.3 Application to Financial Option Trading An option is a security giving the right to buy or sell an asset, subject to certain conditions, within a specified period of time. The Black–Scholes [1] pricing model, gives us the premium on European call and put options that has been generally selected on real financial option markets. The basic assumption employed by Black and Scholes was that the underlying securities value followed by a log-normal diffusion process. But, it is indicated that the real distribution of the underlying securities value at the maturity date is not always followed by a log-normal diffusion process by the trader’s bias [15–17]. Therefore, the financial option pricing model based on the subjective fuzzy distribution is proposed by using the X axis real interval Choquet integral in this paper. Now, a European call option premium (a right to buy) is presented as:
⎛ ⎡ K S MAX ⎤ ⎞ C ( K ) = FX ⎜⎜ ⎢ , ⎥, µ ⎟⎟ ⎝ ⎣1 + r 1 + r ⎦ ⎠ +∞ ⎛ ⎧ S ⎞ ⎫ ⎡S K ⎤ ⎛ S ⎞ ∩ X ⎟⎟dq. | f⎜ = ∫ µ ⎜⎜ ⎨ ≥ q ⎬ ∩ ⎢ MIN , ⎟ ⎥ 0 ⎭ ⎣1 + r 1 + r ⎦ ⎝ ⎩1 + r ⎝ 1 + r ⎠ ⎠ s K ⎛ s ⎞ . f⎜ − ⎟= ⎝1+ r ⎠ 1+ r 1+ r
(62)
(63)
Likewise, a European put option premium (a right to sell) is calculated as:
⎛⎡S K ⎤ ⎞ P( K ) = FX ⎜⎜ ⎢ MIN , ⎥, µ ⎟⎟ 1 1 r r + + ⎣ ⎦ ⎠ ⎝ +∞ ⎛ ⎧ S ⎞ ⎫ ⎡S K ⎤ ⎛ S ⎞ ∩ X ⎟⎟dq. | f⎜ = ∫ µ ⎜⎜ ⎨ ≥ q ⎬ ∩ ⎢ MIN , ⎟ ⎥ 0 ⎭ ⎣1 + r 1 + r ⎦ ⎝ ⎩1 + r ⎝ 1 + r ⎠ ⎠ s K ⎛ s ⎞ − . f⎜ ⎟= ⎝1+ r ⎠ 1+ r 1+ r
(64)
(65)
Nonstochastic Model-Based Finance Engineering
323
where C is call option premium, P the put option premium, K the striking price, s the price of the underlying securities at the maturity, r the risk-less interest rate, µ : subjective fuzzy distribution, and s* is the market price of the underlying security at the trade date
⎛⎧ S S MAX ⎫⎞ ⎛ S ⎞ : max⎜⎜ ⎨ | µ⎜ ⎟ = 0⎬ ⎟⎟ 1+ r ⎭⎠ ⎝ ⎩1 + r ⎝ 1 + r ⎠
(66)
⎛⎧ S S MIN ⎫⎞ ⎛ S ⎞ : min⎜⎜ ⎨ | µ⎜ ⎟ = 0⎬ ⎟⎟ 1+ r ⎭⎠ ⎝ ⎩1 + r ⎝ 1 + r ⎠
(67)
Secondly, the delta is the ratio of the change in the price of an option to the change in the price of the underlying security. The value of delta of a call option can be calculated by the fuzzy measure shift differentiation of the X axis [a, b] limited Choquet integral as:
⎛ ⎡ K S MAX , ⎝ ⎣1 + r 1 + r
⎤ ⎞ ⎥, µ ⎟⎟. ⎦ ⎠ ⎛ ⎡ K S MAX ⎤ ⎞ Delta of call= D − FX ⎜⎜ ⎢ , ⎥, µ ⎟⎟. ⎝ ⎣1 + r 1 + r ⎦ ⎠ Delta of call= D + FX ⎜⎜ ⎢
(68)
(69)
And the value of the delta of a put option can be calculated by the fuzzy measure shift differentiation of the X axis [a, b] limited Choquet Integral as:
⎛ ⎡ S MIN K ⎤ ⎞ , ⎥, µ ⎟⎟. ⎝ ⎣1 + r 1 + r ⎦ ⎠
(70)
⎛ ⎡ S MIN K ⎤ ⎞ , ⎥, µ ⎟⎟. ⎝ ⎣1 + r 1 + r ⎦ ⎠
(71)
Delta of put= D + FX ⎜⎜ ⎢ Delta of put= D − FX ⎜⎜ ⎢
Now, our sample consists of daily closing prices of all call options and put options traded on the Osaka Stock Exchange (OSE) for the NIKKEI 225 stock index from April 1, 1998 to May 21, 1999. Option prices, prices of the NIKKEI 225 index, and the risk-less interest rates are taken from the Nippon Keizai Shimbun.
324
T. Kaino and K. Hirota Table 1. Call prices and put prices
strike prices 14,500 15,000 15,500 16,000 16,500 17,000 17,500 18,000
Choquet – – – 1,020 603 301 131 33
call option market prices – – – 455 240 100 40 15
B&S
Choquet
– – – 434 200 75 23 6
0 0 0 16 105 – – –
put option market prices 20 45 110 230 480 – – –
B&S 4 21 82 230 496 – – –
Premium is given in (62)–(65) as 301. One contract would cost 301,000 yen. The theoretical call price given by the Black–Scholes model with historical volatility is given as 75 and its real market price was 100 (Tables 1 and 2). The call premium given in (62)–(65) is larger than the theoretical price given by the Black–Scholes model. The reason for this distinction is that the fuzzy distribution using (62)–(65) is partial to the right side of the log-normal distribution (where the mean is 16,199.99) based on the Black– Scholes model. The theoretical prices given by the Black–Scholes model are very close to the market prices, because the most of traders use the Black–Sholes model for verifying the market prices. But, the basic assumption employed by Black and Scholes is that the underlying securities value is followed by a log-normal diffusion process. But, it is indicated that the real distribution of the underlying securities value at the maturity date is not always followed by a log-normal diffusion process by the trader’s bias. It is possible that the financial option pricing model based on the X axis real interval Choquet integral can give the traders to verify flexibly their original scenarios about the futures distribution of the underlying securities at maturity. The financial option pricing model is the fundamental instrument in the field of modern finance. The financial option pricing model based on the X axis real interval Choquet integral provides a key for applying the fuzzy sets theory to the financial engineering which has a variety of ambiguity depending on human factors and a social complexity.
325
Nonstochastic Model-Based Finance Engineering Table 2. Deltas of Call and Put
Strike Prices 14500 15000 15500 16000 16500 17000 17500 18000
Choquet
Call Delta B& S
0.62 0.56 0.37 0.33 0.14
0.61 0.37 0.18 0.07 0.02
Put Delta Choquet
B&S
0.00 0.00 0.00 0.06 0.22 -
0.01 0.06 0.18 0.39 0.63 -
5 Differentiation of the Choquet Integral Limited by Domain Hereafter we consider the case
X ⊆ [0, ∞]
R
and
F = BX Borel sets limited to X).
(72)
5.1 Differentiation of the Choquet Integral Limited by Domain Differentiation of the Choquet integral for a nonnegative measurable function is defined as: Definition 10. For any nonnegative measurable function f , if
FX ([ x, x + ∆x], µ ) +0 ∆x
lim
∆x
D FX(x)
(73)
exists, this limit is called a right differential coefficient of the X axis [0, x] limited Choquet integral F of f with respect to µ at x. Similarly, if
FX ([ x, x + ∆x], µ ) −0 − ∆x
lim
∆x
D FX(x)
(74)
exists, this limit is called a left differential coefficient of the X axis [0, x] limited Choquet integral F of f with respect to µ at x.
326
T. Kaino and K. Hirota
5.2 Application to Capital Investment Decision Making Problem The differentiation of the X axis 0, x limited Choquet integral is applied to the automobile factory capital investment decision making problem. An automobile company has a sales plan of new cars. The current factory line has a capacity to manufacture 3,200 new cars, additional to the current car lines. So, if that company has a sales plan to sell over 3,200 new cars annually, they must invest the new factory. Some marketing stuffs estimate the possibility of the annual sales amount of the new cars shown as the following Table 3. Table 3. Annual sales estimate of a new car amount evaluation
[0,1000) 0.0
[1000,2000) 1/3
[2000,3000) 4/9
[3000,4000) 4/9
[4000,5000] 1/3
Now, the evaluation of the inside of each interval is given by an additive measure, and the evaluation of the in-between intervals are given by a
3 4
fuzzy measure µ λ ( where µ λ ( X ) = 1 and λ = − ). Then, assume that the sales amount (million yen) of a new car will be given as:
x ⎧ ⎪ f ( x) = ⎨0.9( x − 2000) + 2000 ⎪0.8( x − 4000) + 3800 ⎩
(0 ≤ x < 2000) (2000 ≤ x ≤ 4000) , (4000 ≤ x ≤ 5000)
(75)
after consideration of discount. The expected sales revenue of this new cars will be derived from the [0, x] limited Choquet integral as:
0 ⎧ ⎪ x2 500 − ⎪ 6000 3 ⎪ ⎪⎪ 0.0002 x 2 − 7 x − 1300 FX ( x ) = ⎨ 90 9 ⎪ 52 7000 2 x+ ⎪ 0.0002 x − 135 9 ⎪ 1 221 52400 2 ⎪ x − x+ ⎪⎩ 7500 540 27
(0 ≤ x < 1000) (1000 ≤ x < 2000) (2000 ≤ x < 3000 ) . (3000 ≤ x < 4000) (4000 ≤ x < 5000)
(76)
Nonstochastic Model-Based Finance Engineering
327
Fig. 3. Relation between number of car sales and expected sales amount
The differentiation of the [0, x] limited Choquet integral will be given as:
0 (0 ≤ x < 1000) ⎧ ⎪ x (1000 ≤ x < 2000) ⎪ 3000 ⎪ ⎪ 0.0004 x − 7 (2000 ≤ x < 3000) ' . FX ( x ) = ⎨ 90 ⎪ 52 ⎪0.0004 x − 135 (3000 ≤ x < 4000) ⎪ 2 221 ⎪ x− (4000 ≤ x < 5000) 540 ⎩ 7500
(77)
The differentiation of the [0, x] limited Choquet integral can be considered to the increase or decrease rate of the expected sales amount per car. Now, assume that the cost of each new car is given as Fig. 3 because of the additional automobile factory capital investment for increasing the products excess of 3,200 cars annually. Then, the differentiation of the X axis real interval limited Choquet integral over X gives us the increase or decrease of the expected sales revenue (also profit) per new car. The expected profit per new car is given by the difference of the expected revenue per car and the cost per car. In the products excess of 3,200 cars annually, the profit per car slightly increases, but soon decreases. In this case, it is difficult to invest a new factory considering the ratio of profit per car against the risk (Fig. 4).
328
T. Kaino and K. Hirota
F'(x) 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Increase and decrease of the expected sales amount per a car (million of yen)
Cost per a car (million of \)
1000
Necessary to invest a new factory 2000
3000
5000 4000 number of cars
Fig. 4. Relation between the number of car sales, the increase and decrease of the expected sales amount per car
Recently, the return on investment (ROI) management is popularized by globalization and capitalization in Japan. Capitalists attach importance to the return on investment more than sales amount. So, managers require more detailed ratio of profit at the investment. Decision making is used as an important index for a capital investment decision making. This application is formulized by the idea of actual management stuffs. It is possible to decide the investment theoretically from the ambiguous evaluation against the real interval by this formulization.
6 Summary A new corporate evaluation model and option pricing model has been proposed; these are able to deal with ambiguous and discrete data better is proposed based on fuzzy measure to overcome the gap mentioned earlier. Firstly, the differentiation of the Choquet integral of a nonnegative measurable function with respect to a fuzzy measure on a fuzzy measure space has been proposed and it has been applied to the capital investment problem. Then the differentiation of the Choquet integral of a nonnegative measurable function was extended to differentiation of the Choquet integral of a measurable function. The Choquet integral was applied to the long-term debt ratings model, where the input was qualitative and quantitative data of the corporations, and the output was the Moody’s long-term debt ratings. Moreover, differentiation of the Choquet integral has been applied to the long-term debt ratings. Secondly, in the application using the
Nonstochastic Model-Based Finance Engineering
329
Choquet integral limited by domain, it was a problem how to evaluate the in-between intervals each of which is characterized by a fuzzy measure. So, a composite fuzzy measure built up from fuzzy measures defined on fuzzy measurable spaces has been proposed. Thirdly, two differentiations of a real interval domain limited Choquet integral for a nonnegative measurable function have been proposed, where the domain is a subset of real number. One is the fuzzy measure shift differentiation of the Choquet integral and it was applied to the option pricing. Another was the differentiation of the Choquet integral limited by domain, and it was applied to the automobile factory capital investment decision making problem. Recently, the real option theory into evaluation of nonfinancial assets (venture business, real estate, and the weather, etc.) based on the stochastic model is beginning to be applied in all industrial fields. But, it is difficult to premise the log-normal distribution of the underlying prices, because the real asset markets are not rational markets. Our models will evaluate not only financial assts but also nonfinancial assets.
References 1. F. Black and M. Scholes: “The pricing of options and corporate liabilities”, “J. Polit. Econ.”, Vol. 81, 1973, pp. 637–654 2. G. Choquet: “Theory of capacities”, “Ann. Inst. Fourier 5”, 1953, pp. 131–295 3. Schmeidler: “Subjective probability and expected utility without additivity”, “Econometrica”, Vol. 57, 1989, pp. 571–587 4. Šipoš: “Integral with respect to a pre-measure”, “Math. Slovaca”, Vol. 29, 1979, pp. 141–155 5. T. Kaino and K. Hirota: “Differentiation of the Choquet integral of a nonnegative measurable function”, “Proc. FUZZ-IEEE’99, Seoul”, 1999, Vol. III, pp. 1322–1327 6. T. Kaino and K. Hirota: “Differentiation of nonnegative measurable function Choquet integral over real fuzzy measure space and its application to financial option trading model”, “Proc. IEEE SMC’99, Tokyo”, 1999, Vol. III, pp. 73–78 7. T. Kaino and K. Hirota: “Differentiation of the Choquet integral and its application to long-term debt ratings”, “J. Adv. Comput. Intel.”, Vol. 4, No. 1, 2000, pp. 66–75 8. T. Kaino and K. Hirota: “Differentiation of Choquet integral for nonnegative measurable function and its application to capital investment decision making problem”, “Proc. FUZZ-IEEE2000, Texas”, 2000, pp. 89–93 9. T. Kaino and K. Hirota: “Composite fuzzy measure and its application to investment decision making problem”, “J. Adv. Comput. Intelligence”, Vol. 3, No. 1, 2003
330
T. Kaino and K. Hirota
10. Moody’s Japan KK: The Moody’s ratings list [Japanese corporations/sovereign], 1999 11. Moody’s investors service: Global Credit Analysis Institute for Financial Affairs Inc., 1995. (in Japanese) 12. T. Okahigashi: The analysis of Bond rating, Chuokeizai-sha, 1999. (in Japanese) 13. Toyo Keizai Inc.: “Toyokeizai ’99 Data Bank”, Toyo Keizai Inc, 1998 (in Japanese) 14. Daiamond Inc.: “Ranking of Japanese Corporations”, DAIAMOND Inc., Vol. 18, 1997, pp. 26–74 (in Japanese) 15. J. Hull, Mitsubishi Bank (Translation): Introduction to Futures and Options Markets, Institute for Financial Affairs Inc., 1994, pp. 487–502 (in Japanese) 16. H. Usami: World of Futures and Options, Jiji, 1989, pp. 207–212, 239–240 (in Japanese) 17. Bank of Japan: Entirely Option Trading, Institute for Financial Affairs Inc., 1995, pp. 265–266 (in Japanese)
Collective Intelligence in Multiagent Systems: Interbank Payment Systems Application Luis Rocha-Mier, Leonid Sheremetov, and Francisco Villarreal
Summary. An interbank payment system (IPS) is defined as the set of rules, institutions and technical mechanisms by which the transfer of funds between banks is carried out. Traditional models of IPS assume complete awareness of the economic agents of their environment, which may not be the case in real life scenarios with different sources of uncertainty. To overcome this drawback, in this chapter a novel framework to modeling an interbank net settlement payment systems (NSPS) and analyze the effect of the individual actions of the economic-agents in the context of the Collective Intelligence (COIN) theory is proposed. Based on Soft-computing techniques, this framework is focused on the interactions at the local and the global levels among the consumer-agents (without global knowledge of the environment model) in order to optimize a global utility function (GUF). A COIN is defined as a large Multiagent System (MAS) with no centralized control and communication, but where there is a global task to complete. Reinforcement learning (RL) algorithms are used at the local level, while mechanisms inspired from the COIN theory are used to optimize global behavior. The proposed framework was implemented using Netlogo (agent-based parallel modeling and simulation environment). The results demonstrated that the interbank NSPS is a good experimental field for the application of the COIN theory and showed how the consumer-agents adapt their behavior converging to the efficient Nash equilibrium, thereby optimizing the GUF.
1 Introduction The liberalization of financial markets experienced around the world in the last three decades coupled with the advances in computing and telecommunication technology, have been the driving forces behind fundamental changes experienced by the world financial system. Among these changes perhaps the most important are the growing integration of international markets, the innovation in financial instruments, and the resulting increase in the volume of funds transferred between agents on a daily basis. L. Rocha-Mier et al.: Collective Intelligence in Multiagent Systems: Interbank Payment Systems Application, Studies in Computational Intelligence (SCI) 36, 331–351 (2007) c Springer-Verlag Berlin Heidelberg 2007 www.springerlink.com
332
L. Rocha-Mier et al.
This last change has raised concern among central banks around the world regarding the risks associated with the systems used to handle these transfers. Issues regarding payment and settlement are of particular importance to central banks for three reasons. First, the main policy of the central bank’s objective is to promote macroeconomic stability through the instruments of monetary policy, which is effectively carried out through the payment system. Second, aside from monetary policy considerations, payment systems are the means by which transactions in the real economy take place; thus the smooth functioning of payment and settlement systems is fundamental to economic stability. Finally the liquidity transformation activity of banks makes them vulnerable to run which, due to the interdependence created by a payment system, gives rise to the possibility of systemic risk. The last two reasons provide the rationale for the objective of the central bank’s policy in their role of lender of last resort, which is to ensure financial stability. Increasingly central banks are being made explicitly responsible for the supervision of the financial system and the oversight of the payment system. However until relatively recently the issues regarding payment system design were regarded as technical in nature. Recent contributions to the study of various issues regarding are reported in [1, 3, 7, 9,11, 17, 18]. These research works model both gross settlement (operated by the central bank with or without explicit intraday credit) and deferred net multilateral settlement systems, as well as their integration. One of the main assumptions of these models is a complete awareness of the environment, which may not be the case in many real scenarios. Therefore, in this work the net settlement payment system problem (NSPSP) is addressed within the context of the Collective Intelligence (COIN) theory [8, 15, 23] where the complete awareness of the environment is not required. Being part of the Soft-computing technologies, the COIN framework permits to capture the main properties of distributed problem domains in general and the NSPSP in particular, like the following: – Distributed: The different entities in the NSPS that might be distributed across different geographical locations must be considered as a whole by using a distributed knowledge-based system. For example, the economic agents might be located in different cities. – Dynamic: The NSPS environment is changing, there are no obligations for the agents to be part of a bank for a certain period of time and the agents may join or leave the bank based on their own interest. Moreover, the variables of the environment can change continuously (e.g., returns, risk, etc.). – On-line adaptation: The agents must adapt their behavior to be able to respond to the dynamics of the environment in real time. – Cooperation: Greedy behavior of the consumer-agents conduces to the degradation of all the system, therefore the agents in the NSPS must cooperate in order to achieve the global objective (e.g., global utility function (GUF) optimization).
Collective Intelligence in Multiagent Systems
333
According to our conceptualization of the NSPSP within the framework of the COIN theory, a payment system is represented as a Multiagent System (MAS) where: – One of the main objectives is the decentralization of the control and communication. – Each agent of the PS (depositor, consumer, or bank) has an autonomous behavior and a local utility function (LUF). – The agents execute Reinforcement Learning (RL) algorithms at the local level. – The learning process consists of adapting the “local behavior” of each agent with the aim of optimizing a given “global behavior.” In this work, we developed an agent-based COIN framework to analyze the consumer-agents’ behavior in the NSPSP, studying different strategies to define the time consumer-agents must wait before consuming their goods. Mechanisms inspired from the COIN were developed and studied to determine how agents’ local decisions are contributing to optimize the global utility function. The chapter is organized as follows: Section 2 discusses the institutional aspects of the payment systems. Section 3 describes the model proposed in this work within the framework of the NSPSP. Section 4 describes the RL methods used at the local level of every agent. Section 5 describes the limitations of the Soft-computing methods for the optimization of distributed systems. Section 6 explains the COIN theory as a framework to address the NSPSP. The algorithm “Q-net” developed in this work based on the RL methods and COIN mechanisms is adapted to the scenario of the net settlement payment system problem in Section 7. Section 8 presents a case study description and the Netlogo simulation results. Section 9 concludes discussing some of the ways in which our model can be improved upon, and some directions in which it could be extended.
2 Aspects of the Interbank Payment Systems (IPS) An IPS is defined as the set of rules, institutions and technical mechanisms by which the transfer of funds between banks is carried out for the purpose of settling obligations assumed by economic agents1 , which may be the banks themselves or their clients. In general, we can divide the participants of the payment system into four categories: the depositors, the consumers, the banks, and the regulators.2 1
2
In this work, the terms “agent, economic agent, bank, depositor, consumer” are used interchangeably. For a detailed definition of payment systems see [5].
334
L. Rocha-Mier et al.
According to [11], the large-value interbank payment systems can be grouped into three general models: (i) gross settlement operated by central bank with explicit intraday credit, (ii) gross settlement operated by central bank without intraday credit, and (iii) deferred net multilateral settlement systems. As an example of the third type system, the Clearing House Interbank Payment System (CHIPS) could be mentioned operating in the USA that is the largest private net settlement payment system in the world. In Mexico, the Interbank Payment System known as Sistema de Pagos Electrónico de Uso Ampliado (SPEUA) and operated by Centro de Compensación Bancaria (CECOBAN) was based originally on the CHIPS system, designed to be independent from the credit of the Federal Reserve Bank. Though, as the banks were accustomed to have a capacity of limitless overdraft in the previous system, several changes to the SPEUA were made to make it more attractive for the banks. As a result, these changes gave a different system where the Bank of Mexico committed to give a great quantity of credit and to guarantee the liquidation assuming an important part of the credit risk. Payment systems are important given the value of transfers that are made through them, the speed with which they happen, and because they are a major channel through which endogenous and exogenous shocks are transmitted across financial systems. For example, according to [1], CHIPS is processing among 77 participants approximately 225,000 payment messages worth an average of $1.2 trillion everyday. On peak days, this can exceed 400,000 payments worth $2 trillion. In general a potentially large amount of resources must be dedicated to reducing the effect of the shocks being transmitted to the system. Next we discuss the most important risks and costs in a payment system, as well as the time dimension feature. 2.1 Risks in the Payment System Although the risks involved arise primarily from the explicit or implicit extension of credit made among agents, for completeness the risks include:3 (i) Credit Risk, which refers to the possibility that a transaction will not be realized at full value due to the failure of an agent to meet its financial obligations. (ii) Liquidity Risk, which refers to the possibility of a transaction not taking place at the desired time, due to a party having insufficient funds to meet its financial obligations at a given time. (iii) Operational Risk, which refers to the possibility of delays or failures to settle obligations due to breakdowns in computer or telecommunications systems. 3
See [16,17] for a detailed discussion of the risks associated with payment systems.
Collective Intelligence in Multiagent Systems
335
(iv) Legal Risk, which refers to the possibility that an inadequate legal framework will exacerbate the effects of the other risks. (v) Systemic Risk, which in its broadest sense refers to the possibility that the failure of a single agent in the system to meet its obligations, or that a shock to the system itself, could result in the inability of other agents in the system to meet their obligations. Systemic risk is explicitly considered in the model defined below because of its importance. Given the volumes operated by CHIPS, systemic risk from a settlement failure has been an ongoing concern for both the participants and their central bank regulators. In response to this concern, the “Clearing House” has made several major changes over the last 20 years to protect the system and its participants from settlement risk. During this period, CHIPS moved from next-day to same-day settlement in 1981, and, finally from an end-of day multilateral net payment system to one that provide intraday finality in 2001, achieving a number of important risk-reducing objectives. The definition of the systemic risk given above is very broad, thus it is necessary to discuss the different aspects that the notion of systemic risk encompasses. The most widely used aspect of systemic risk regarding payment systems refers to the possibility of a propagation of liquidity problems due to coordination problems regarding how to deal with failures or liquidity shortages in the system.4 The second aspect of systemic risk refers to the way risk is shared among agents when the system itself is subject to exogenous shocks, such as a generalized fall in asset prices. A third aspect to which systemic risk refers, is what Rochet and Tirole [16] call “learning related contagion”. The basic idea is that agents can observe signals about the solvency of banks, which may induce depositors to withdraw their deposits early, causing a run. These aspects of the systemic risk are studied in our model in the following sections and are taken into account by the coefficient of risk aversion of the agents (γ in our model). 2.2 Costs in the Payment System Regarding the costs associated with payment systems these include:5 – The costs associated with the resources used in processing payments; such as the costs of setting up and maintaining communication and safeguard systems, as well as the explicit cost of processing and settling payment orders. – The financial costs associated with maintaining an investment portfolio different to the optimal portfolio that would be kept under the classical assumption that there exists a perfect payment system that implies negligible transaction costs. In particular these are the costs of holding more 4 5
See [18] for a discussion of the transmission of shocks through the payment system. See [3] for a discussion of the costs in payment systems.
336
L. Rocha-Mier et al.
liquid portfolios than optimal, in order to be able to face liquidity needs as they arise, and not having to liquidate assets at disadvantageous prices to generate liquidity. – The costs associated with identifying and reducing risks in payment systems. These include, for example, the cost of assessing credit and liquidity risks, and the costs of holding collateral. – The explicit costs associated with delays or failures in the settlement procedure, such as the opportunity cost implied by any time lags between the time a transaction is agreed and the time when it is finally settled. These costs are not necessarily independent of each other, as some of them are consequences of the others. However listing in this manner is useful for exposition purposes. In our model, we will mainly be concerned with the financial costs (denoted by L) associated with maintaining an investment portfolio different from the optimal portfolio, given excessive early withdrawals. 2.3 The Time Dimension There are a number of ways in which payment system designs can be categorized, for example it can be done depending (i) on whether they handle large value or low value payments, (ii) on whether they are electronic or paper based, or (iii) on the way settlement is carried out [7]. For the purposes of this work we will classify them on the basis of the way settlement takes place. In reality, there is a continuum of ways in which the settlement of obligations can take place in a payment system, with a pure Real Time Gross Settlement (RTGS) System at one end of this continuum, and a pure Deferred Net Settlement (DNS) system at the other end. The discussion that follows will focus on the polar cases, although in our model we will examine the interaction of agents under a pure net settlement payment system. With respect to the settlement of obligations, when they are settled on a gross basis, it means that each transaction is settled individually, under an RTGS system, transmission and settlement of obligations occurs simultaneously as soon as orders are accepted by the system. Under a system without intraday credit, the payments of a bank with insufficient funds to cover their obligations are rejected by the system or queued until funds become available. Without intraday credit, funds can either come from incoming payments or by the liquidation of illiquid assets. Under a DNS system, banks send their payment orders to a central location, usually the central bank in the case of large value payment systems. However these orders are not settled continuously but at specific points in time, when the net position of each bank with respect to the central location is calculated. Banks only transfer their net obligation to the central location at the end of the netting process, which usually occurs at the end of the day.
Collective Intelligence in Multiagent Systems
337
3 Model of the Net Settlement Payment System Problem (NSPSP) In this section, the model proposed within the framework of the net settlement payment system problem is described. As shown in Fig. 1, we have n island economies, and there is a single consumption good. In each island there is a large number m of identical risk averse agents with a coefficient of risk aversion γ affecting the decision-making of every agent (for simplicity we will normalize one) and a single mutual bank. Agents live for many episodes composed of three periods t = 0, 1, 2. At t = 0, the planning period, each agent is endowed with one unit of the good. The agents face uncertainty about the time when they must consume. With probability α they must consume at t = 1, we will call this type of agents the “impatient consumers,” while with probability (1 − α) they must consume at t = 2, we will call this type of agents the “patient consumers.” Agents learn privately their types at t = 1. In the first episode, this type is learned arbitrarily, however, in the next episodes the type is learned from the knowledge acquired from the environment as showed in the next sections. In order to move the good across the time, agents and their banks can store it from one period to the next at no cost, the storage technology yields the riskless rate, which we will assume is equal to zero. In the context of IPS, we can think of the storage technology as being analogous to maintaining liquid funds in the form of reserves. Alternatively the good can be invested, however agents cannot invest by themselves but must do so by depositing their endowment at time zero at their bank, which has exclusive access to the investment technology. The investment technology is illiquid in the following sense, the technology yields a random return
0 < Re < 2
0 < L 2
L
Punishment
0.8
γ
Coefficient of relative risk aversion
0.1; 0.5; 0.9
leave their deposit in the bank (Wait action) and have the bank transfer their claims to the other island where their consumption must take place at t = 2. Although, with a high (γ = 0.9) coefficient of relative risk aversion, the patient consumer-agents prefer to withdraw their deposit at t = 1 (Running action), and carry the good themselves to the other island and store it there until it’s consumption at t = 2. Therefore, the patient consumer-agents learned to deal with the coefficient of relative risk aversion. When the risk is high, they prefer to pay the financial cost L due to the early withdraw instead of paying more costs because of the insolvency of the bank (systemic risk).
348
L. Rocha-Mier et al. 1
0.8
(1-α)*(1-µ)
γ = 0.1 0.6
0.4 γ = 0.5 γ = 0.9 0.2
0 0
100
200
300
400
500
600
700
800
900
episode
Fig. 3. Learning of the action Wait (1-µ) by the patient consumer-agents for γ = 0.1; 0.5; 0.9 1.6 1.5
γ = 0.1
1.4
GUF
1.3
γ = 0.5
1.2 1.1 1 γ = 0.9
0.9 0.8 0
100
200
300
400
500
600
700
800
900
episode
Fig. 4. Global utility function optimization
In Fig. 4 we show the global utility function optimization. As γ increases, the relative risk aversion of agents increases making them more averse to proportional losses as their wealth increases. In other words, the agents would
Collective Intelligence in Multiagent Systems
349
be willing to pay a higher premium to insure themselves against losses, in this γ case due to bank runs. This is why the global utility function increases as decreases. Obtained results are consistent with those reported in [6], wherein a similar t environment, two Nash equilibria can be supported: one where all the patien consumer-agents decide to wait and another one where all patient consumersagents decide to run. It is obvious that the optimal solution is the one that maximizes the global utility function: the state where all patient consumeragents decide to wait with a low or medium coefficient of relative risk aversion, or the state where all patient consumer-agents decide to run with a high coefficient of relative risk aversion.
9 Conclusions and Future Work In this chapter, the NSPSP was addressed within the framework of the COIN theory. This proposed conceptualization permits to handle the NSPSP within the following context (usually absent in the classic approaches): – The model of the behavior of the environment at the beginning is unknown. is – Global behavior of the system is composed of individual behaviors and modeled as interaction between them (bottom-up approach). – This way, the individual behavior of each entity affects the total behavior y of the system to some degree depending upon the degree of responsibilit for each entity also known as “Credit Assignment Problem.” alIn the presented NSPSP model, random returns across the islands to ev ed uate the effect of individual behaviors of the agents by using the develop learning mechanisms considering this source of uncertainty were introduced. In order to optimize the global utility function, an algorithm (Q-net) based on learning processes using RL algorithms and mechanisms inspired by the COIN theory was developed. A model of the net settlement payment system problem was proposed and implemented in the agent-based parallel modeling and simulation environment over the Netlogo platform. The algorithm developed can easily be implemented for simulation of the real-world scenarios of thanks to the learning abilities of the agents without needing knowledge the environment model. It is a common knowledge that in Soft Computing there is a lack of learning h mechanisms for systems comprising a large number of agents. In our researc work, this problem was addressed by using the theory of the Collective Intelligence. We think that the issue of collective learning will become one of the principal questions to solve in the years to come. We conclude that the net settlement payment system problem is well suited for the application of the COIN theory. In addition, the adaptive algorithm
350
L. Rocha-Mier et al.
presented, Q-net, provides a GUF optimization avoiding problems like the “Tragedy of the Commons.” Possible extensions of this work include the use of the COIN framework to an economy where net and gross settlement payment systems coexist, a common feature of modern economies. Within this framework the issue of the tradeoff between costs and systemic risk could be addressed. In addition, more complicated punishment algorithms will be developed to adjust the local utility functions. Finally, we pretend to compare our algorithms with other classic optimization methods of the game theory.
Acknowledgments Partial support for this research work has been provided by the IMP, within the project D.00006. The authors would like to thank Ana Luisa Hernandez for her contribution to the implementation of the developed algorithms.
References 1. Intraday liquidity management in the evolving payment system: A study of the impact of the euro, cls bank, and chips finality. Payments Risk Committee, Intraday Liquidity Management Task Force, NY, 2000 http://www.newyorkfed.org/prc/intraday.htm 2. R. Bellman. Dynamic Programming. Princeton University Press, NJ, 1957 3. A. Berger, D. Hancock, and J. Marquardt. Framework for Analyzing Efficiency, Risks, Costs, and Innovations in the Payments System. Journal of Money, Credit and Banking, 28(4):696–732, 1996 4. D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996 5. BIS. Bank of International Settlements, Core Principles for Systematically Important Payment Systems. Basel, 2001 6. D. Diamond and P. Dybvig. Bank Runs, Deposit Insurance and Liquidity. Journal of Political Economy, 91(31):401–419, 1983 7. W. Emmons.Recent Developments in Wholesale Payment Systems. Review of Federal Reserve Bank of St. Louis, pp. 23–43, November/December 1997 8. J. Ferber. Les Systèmes Multi-Agents : Vers Une Intelligence Collective. InterEditions, Paris, 1997 9. X. Freixas and B. Parigi. Contagion and Efficiency in Gross and Net Interbank Payment Systems. Journal of Financial Intermediation, 7:3–31, 1998 10. G. Hardin. The Tragedy of the Commons. Science, 1968 11. A. Horii and B.J. Summers.Large-value transfer systems. In: International Monetary Found, editor, The Payment System Design, Management and Supervision, pp. 73–78, Washington, DC, 1994 12. J.-S.R. Jang, C.-T. Sun, and E. Mizutani. Neuro-Fuzzy and Soft Computing: A computational approach to learning and machine intelligence. Prentice-Hall, Englewood Cliffs. NJ, 1997 13. N.R. Jennings, K. Sycara, and M. Wooldridge. A roadmap of agent research and development. Autonomous Agents and Multi-Agent Systems, 1(1):7–38, 1998
Collective Intelligence in Multiagent Systems
351
14. T.M. Mitchell. Machine Learning. McGraw-Hill, NY, 1997 15. L.E. Rocha-Mier. Apprentissage dans une Intelligence Collective Neuronale : Application au routage de paquets sur Internet. Ph.D. thesis, Institut National Polytechnique de Grenoble, 2002 16. J-C. Rochet and J. Tirole. Controlling Risk in Payment System. Journal of Money, Credit and Banking, 28(4):832–860, November 1996 17. J.C. Rochet and J. Tirole.Interbank Lending and Systemic Risk. Journal of Money, Credit and Banking, 28(4):733–762, November 1996 18. M. Sbracia and A. Zaghini. Crises and Contagion: The Role of the Banking System. Marrying the Macro and Micro Prudential Dimensions of Financial Stability, pp. 241–260, 2001 19. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT, Cambridge, MA, 1998 20. R.M. Turner. The Tragedy of the Commons and Distributed AI Systems. 12th International Workshop on Distributed Artificial Intelligence, 1993 21. C. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989 22. G. Weiss. Multiagents Systems: A Modern Approach to Distributed Artificial Intelligence. MIT, Cambridge, MA 1999 23. D. Wolpert and T. Kagan. An Introduction to Collective Intelligence. Technical Report NASA-ARC-IC-99-63, NASA Ames Research Center, 1999
Fuzzy Models in Credit Risk Analysis Antonio Carlos Pinto Dias Alves
Summary. The goal of this chapter is to present some concepts that can guide fuzzy logic systems to be used in credit risk analysis. Here we used fuzzy quantification theory to make a kind of multivariate analysis. We will see how it can be used to give more usable answers than traditional Logit or Probit analysis. To make the analysis we will use some interesting accounting indicators that can efficiently point the financial health of a company.
1 Introduction Credit risk analysis has been growing in financial literature. In general, models begin with subjective data and try to give a scientific shape to the classifications they pose. Those classifications are criticized either for being tendentious or by not taking into account important, although subjective, aspects of people, enterprises, and government’s realities. It would be interesting that we had systems that would be completely independent from psychological aspects and that at the same time would take into account peculiarities of those who are looking for credit. In Brazil, credit risk has become the great concern for not only the risk managers in financial institutions but also the regulators. There are some reasons for that. The first one we can point is the Real Plan in 1994 that brought inflation rates to tolerable levels since then. Nowadays Brazilian market risk is much better understood and tracked. Another reason is that the larger part of banks’ economic capital is generally used for credit. The sophistication of traditional standard methods of measurement, analysis, and management of credit risk might, therefore, not be in line with such significance. A third reason points to the fact that a great number of insolvencies and restructuring of banks all over the world were influenced by prior bankruptcies of creditors. A. Carlos Pinto Dias Alves: Fuzzy Models in Credit Risk Analysis, Studies in Computational Intelligence (SCI) 36, 353–367 (2007) www.springerlink.com © Springer-Verlag Berlin Heidelberg 2007
354
A. Carlos Pinto Dias Alves
The credit risk of a single client is the basis of all subsequent risk analysis of portfolio credit risk modeling. A concept of default can be client oriented, i.e., the status of default is a state of a counterpart such as insolvency or bankruptcy. This work analyses the short to mid-term probability of insolvency for a Brazilian company. It has evolved from a previous work [1] that used the traditional Z-score technique. To determine if a company has the ability to generate sufficient cash flow to service its debt obligations the focus of traditional credit analysis has been on company’s fundamental accounting information. Evaluations of the Industrial environment, investment plans, balance sheet data, and management skills serve as primary input for the assessment of the companies’ likelihood of survival over a certain time horizon. In this work we suppose a multivariate normality to warranty our results are adequate. Concerning traditional techniques, Logit and Probit models [10, 11] are widely used in multivariate data analysis. In credit risk analysis theses models have been used in most financial institutions to verify if a company has conditions to accomplish its payment obligations. In these models one selects some accounting variables and applies it to a function. These models have a great shortcoming since the dependent variable is binary, having only two possible states. In discriminant analysis, the nonmetric character of a dependent variable is accommodated by making predictions of group membership based on discriminant Z scores [10]. This requires the calculation of cutting scores and the assignment of observations to groups. Logit (Logistic Regression) models approaches this task in a manner more similar to that found in multiple regressions. It differs from multiple regression, however, as it directly predicts the probability of an event to happen. Although the probability value is a metric measure, there are fundamental differences between multiple and logistic regression. Probability values can be any value between zero and one, but the predicted value must be bounded to fall within the range zero and one [10]. To define a relationship bounded by zero and one, logistic regression uses an assumed relationship between the independent and dependent variables that resembles an S-shaped curve. At very low levels of the independent variable, the probability approaches zero. As the independent variable increases, the probability increases up the curve, but then the slope starts decreasing so that at any level of the independent variable, the probability will approach one but never exceed it. Linear models of regression cannot accommodate such a relationship, as it is inherently nonlinear [10]. Moreover, such situations cannot be studied with ordinary regression, because doing so would violate several assumptions. First, the error term of a discrete variable follows the binomial distribution instead of the
Fuzzy Models in Credit Risk Analysis 355
normal distribution, thus invalidating all statistical testing based on the assumptions of normality. Second, the variance of a dichotomous variable is not constant, creating instances of heteroscedasticity as well. Logistic regression was developed to deal with these issues. Its relationship between dependent and independent variables requires a somewhat different approach in estimating and interpreting the coefficients [10]. Logit models consider the following representation of a probability.
Pi = E (Y = 1| Xi ) =
1 1+ e
− ( β1+ β 2 Xi )
(1)
For ease of exposition we write (1) as
Pi = where
1 1 + e − Zi
(2)
Zi = β1 + β 2 Xi
(3)
Equation (2) is what is known as the cumulative logistic distribution function. It is easy to verify that as Zi ranges from – ∞ to + ∞ , Pi ranges between 0 and 1 and that Pi is nonlinear related to Zi thus satisfying the requirements considered earlier. But it seems that in satisfying these requirements we have created an estimation problem because Pi is nonlinear not only in Z (and X) but also in the β s as can be seen clearly from (2). This means that we cannot use ordinary least squares procedure to estimate the β s. Nevertheless (2) is intrinsically linear because if we consider the probability of success Pi be given by (3) then (1-Pi) is [11].
1 − Pi =
1 1 + e Zi
(4)
Therefore we can write
1 + e Zi Pi = e Zi = Zi − 1 − Pi 1 + e
(5)
356
A. Carlos Pinto Dias Alves
So Pi/(1-Pi) is simply the odds ratio in favor of probability [11]. Now if we take the natural log of (5) we obtain
⎛ Pi ⎞ Li = ln⎜ ⎟ = Zi = β1 + β 2 Xi ⎝ 1 − Pi ⎠
(6)
One can see that L, the log of the odds ratio, is not only linear in the Xs but also linear in the parameters. L is called the Logit function. For estimation purposes we can write (6) as
⎛ Pi ⎞ Li = ln⎜ ⎟ = β 1 + β 2 + ui ⎝ 1 − Pi ⎠
(7)
where ui is the disturbance term. As we have noted, to explain the behavior of a dichotomous dependent variable we will have to use a suitable chosen cumulative distribution function (CDF). The logit model uses the cumulative logistic function as shown in (2) but in some applications the normal CDF has been found useful. The estimating model that emerges from the normal CDF is known as the Probit model (also known as Normit model) [11]. Briefly, if a variable Z follows the normal distribution with mean µz and variance σ 2 , its PDF is
1
f (Z ) =
2 πσ
e
−
( Z −µz ) 2 / 2σ 2
(7)
and its CDF is
F (Z ) = ∫
Z0
−∞
1 2 πσ
e
−
( Z −µz ) 2 / 2σ 2
(8)
where Z0 is some specified value of Z [11]. In principle one could substitute the normal CDF in place of the logistic CDF in (12) and proceed as before. Probit and specially Logit models are nowadays widely used in credit risk analysis. Nevertheless they have several shortcomings. Some of those are:
Fuzzy Models in Credit Risk Analysis
357
(a) As P goes from 0 to 1 the Logit L goes from – ∞ to + ∞ . That is, although the probabilities lie between 0 and 1, the logits are not so bounded. (b) A logit or probit model presumes binary-dependent variables. Success can only occur or not. (c) Although L is linear in X, the probabilities themselves are not. (d) The disturbance term in the logit (or probit) model is heteroscedastic. Thus, instead on using ordinary least squares one will have to use weighted least squares. (e) The intercept term is almost ever meaningless.
2 Fuzzy Quantification Theory Since sample sets are commonly called groups in multivariate analysis, we call the fuzzy sets that form the samples here simply fuzzy groups. Now we will write the probability P(A) of a fuzzy event determined by a fuzzy set A over a nth dimensional interval Rn, which is defined by degree of probability P as [2].
P ( A) =
∫
Rn
µ A ( x )dP = E (µ A )
(9)
where E (µ A ) is the expected value of membership function µ A . From this, we can calculate the fuzzy mean and fuzzy variance for variable x as [2].
1 xµ ( x )dP P( A) ∫R n A 1 σ 2A = ( x − m A ) 2 µ A ( x )dP P( A) ∫R n
mA =
(10) (11)
When we consider fuzzy event A is occurring, the probability of fuzzy event B to occur, PA(B) is
PA ( B ) =
1 µ ( x )µ A ( x )dP = E A (µ B ) P( A) ∫R n B
(12)
and so the following relationship concerning the fuzzy event arises:
[
σ 2A = E A ( x − m A )
2
] = E ( x ) − E ( x) 2
A
2 A
(13)
358
A. Carlos Pinto Dias Alves
For a given sample (x1,…,xn), when we are concerned with the fuzzy event A, can we now define the sample mean and variance. The size of the fuzzy set can be expressed, using the elements of the set, as n
N ( A) = ∑ µ A ( x ω )
(14)
ω =1
Applying this idea of the size of a fuzzy set N(A) to the sample, one can define the sample mean mA and variance σ 2A as [2] n 1 ∑ x µ (x ) N ( A) ω =1 ω A ω n 1 ( x − m A ) 2 µ A ( xω ) σ 2A = ∑ N ( A) ω =1 ω
mA =
(15) (16)
Using the definitions above, we can now explain variation between groups, variation within groups, and total variation for fuzzy groups. Let sample x ω (ω = 1,..., n) be given and the membership function of fuzzy group Ai (i = 1, …, K) be defined by µ A ( xω ) . In this instance, the total mean m and mean m A using fuzzy group Ai are expressed by i
i
m=
1 N
m Ai =
K
n
∑∑ x i −1 ω =1
ω
µ Ai ( xω )
n 1 x µ (x ) ∑ N ( Ai ) ω =1 ω Ai ω
(17) (18)
where we have K
N = ∑ N ( Ai )
(19)
i =1
The total variation T, variation between fuzzy groups B and variation within a fuzzy group E are then, respectively, defined as n
K
T = ∑ ∑ ( xω − m) 2 µ Ai ( xω ) ω =1 i −1
(20)
Fuzzy Models in Credit Risk Analysis n
359
K
B = ∑ ∑ (m Ai − m) 2 µ Ai ( xω )
(21)
E = ∑ ∑ ( xω − m Ai ) 2 µ Ai ( xω )
(22)
ω =1 i =1 n K
ω =1 i =1
and (23)
T=B+E
The relationship above shows that ideas like multivariate analysis derived from relationships such as maximum variance ratio can easily be extended to fuzzy events [2]. The object of fuzzy quantification theory as it is used in this work is to express several fuzzy groups in terms of qualitative descriptive variables that take the form of values on [0,1][2][3]. Table 1 shows the data handled. We try to express, as well as possible using a linear equation of category weight ai of category A1, the structure of the external standard fuzzy groups on the real number axis: K
y (ω ) = ∑ a i µ i (ω ) ;
ω = 1, …, n
(24)
i =1
So it is determining the ai that gives the best separation of the external standard fuzzy groups on the real number axis. The degree of separation of the fuzzy groups is defined as the fuzzy variance ratio, η2 , which is the ratio of the total variation T and variation between fuzzy groups B from (23) and (24): (25) η2 = B T Table 1. Data used in fuzzy quantification theory
ω
fuzzy external standard
2
B1 µ B1 (1) µ B1 (2)
ω
µ B1 (ω )
µ BM (ω )
µ 1 (ω )
µ i (ω )
n
µ B1 (n)
µ BM (n)
µ 1 ( n)
µ i ( n)
1
BM
category
µ BM (1) µ BM (2)
A1 µ 1 (1) µ 1 (2)
Ai AK µ i (1) µ K (1) µ i (2) µ K (2)
µ K (ω ) µ K ( n)
360
A. Carlos Pinto Dias Alves
Now we determine ai for linear equation (16) which maximizes fuzzy variance ratio η 2 . The fuzzy mean y Br within fuzzy group Br for value y( ω ) for the linear equation and total fuzzy mean y come out as
y Br
n 1 = ∑ y(ω)µ Br (ω) ; N ( Br ) ω =1
1 N
y=
r = 1, ..., M
(26)
M
∑y r =1
Br
N ( Br )
(27)
Fuzzy mean µ ir within each fuzzy group Br for the membership value of category Ai and total fuzzy mean µ i are expressed as
µ ir =
n 1 ∑ µ i (ω )µ Br (ω ) ; N ( Br ) ω =1
µi =
1 N
i = 1, ..., K ; r = 1, ..., M (28)
M
∑µ r =1
r i
N ( Br ) ;
i = 1, ..., K
(29)
Now, to simplify the notation, we define the (Mn, K) matrices A, AG , and
A for µ i (ω ) , µ ir , and µ i as ⎡ µ 1 (1) ⎢ ⎢ ⎢ µ 1 (ω ) ⎢ A = ⎢ µ ( n) ⎢ 1 ⎢ µ 1 (1) ⎢ ⎢ ⎢⎣ µ 1 ( n)
µ i (1) µ i (ω ) µ i ( n) µ i (1) µ i ( n)
µ K (1) ⎤ ⎥ ⎥ µ K (ω )⎥ ⎥ µ K ( n) ⎥ ⎥ µ K (1) ⎥ ⎥ ⎥ µ K (n) ⎥⎦
⎡ µ 11 ⎢ ⎢ 1 ⎢ µ1 ⎢ AG = ⎢ µ 1 ⎢ 1 ⎢ µ 12 ⎢ ⎢ ⎢µ M ⎣ 1
µ i1 µ
1 i
µ i1 µ i2 µ iM
µ K1 ⎤ ⎥ ⎥ µ ⎥ ⎥ ⎥ µ K1 ⎥ 2 µK ⎥ ⎥ ⎥ M ⎥ µK ⎦ 1 K
Fuzzy Models in Credit Risk Analysis
⎡µ1 ⎢ ⎢ ⎢µ1 ⎢ A = ⎢µ ⎢ 1 ⎢µ1 ⎢ ⎢ ⎢⎣ µ 1
µi µi µi µi µi
361
µK ⎤ ⎥ ⎥ µK ⎥ ⎥ µK ⎥ ⎥ µK ⎥ ⎥ ⎥ µ K ⎥⎦
(30)
In addition, the K-dimensional row vector a for category weight ai and the (Mn, Mn) diagonal matrix G formed from membership value µ Br are defined as
a´= [a 1 ⎡ µ B1 (1) ⎢ ⎢ ⎢ G=⎢ ⎢ ⎢ ⎢ ⎣
ai
aK ]
(31a) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ µ BM (n)⎥⎦
0
µ B1 (n)
µ B 2 (1)
0
(31b)
This way, the total variation T and variation between fuzzy groups B from (24) and (25) can be written in a matrix form as [2, 3]
T = a ′ ( A − A )′ G ( A − A )a
(32)
B = a ′ ( AG − A )′ G ( AG − A )a
(33)
If we now substitute (32) and (33) in (29) and partially differentiate by a we obtain the relationship
[G
12
][
]
( AG − A ) ′ G 1 2 ( AG − A ) a =
η 2 [G 1
2
(A − A )]´ [G (A − A )] a 12
Now, defining SG and S by the (K, K) matrix as
(34)
362
A. Carlos Pinto Dias Alves
[
][
]
S G = G 1 2 (AG − A ) ´ G 1 2 (AG − A )
[
][
(35a)
]
S = G 1 2 (A − A ) ´ G 1 2 (A − A )
(35b)
it is possible to decompose S using triangular matrix ∆ as S = ∆ ′ ∆ getting
[(∆´)
−1
]
S G ∆−1 ∆a = η 2 ∆a
(36)
Because of that, category a for (16), which maximizes fuzzy variance ratio η 2 , can be obtained from eigenvector ∆a , which maximizes eigenvalue η 2 of the matrix (∆´)−1 S G ∆−1 .
[
]
3 A Practical Application The theory shown in Sect. 2 can be used to develop a fuzzy system for credit risk analysis. In fact a true prototype had been built [9] and was then applied to simulation data based on a work that analyzed the probability of insolvency for Brazilian enterprises using traditional multivariate analysis [1]. That work selected some key accounting variables that can efficiently pinpoint an enterprise financial health. These variables are here named as X1–X5 and their significance are as follows. Table 2 below explains the meaning of the abbreviators in the variables. Table 2. Abbreviations used for accounting data LS CP T LA TSI SGC OI MMS
liquid sales circulating passive treasury balance liquid asset tributary and social insurance obligations sold goods costs operational investment monthly medium sales
(CP/LA)/Sector Median – X1: This is an indicator of the financial structure that represents the short-term indebtedness. The interpretation of
Fuzzy Models in Credit Risk Analysis
363
the results reached in this index confirms that the great use of third-party money by an enterprise carries it toward insolvency. It is very common in Brazil that enterprises finance their projects with credit lines that have terms and characteristics incompatible with their cash flows, leading, this way, to the intensive use of short-term resources. So, a company can fall in major financial costs and, in consequence, not only its liquidity but also its return can be affected. One should note that this variable is not being used here in its absolute form. Instead, it is computed in relation to the index of the sector the enterprise is located in. OI/LS–X2: When positive, OI refers to the short-term liquid investment. In a static situation, it represents those resources necessary to keep the enterprise’s actual level of operational activity. Most of the time, those resources are obtained by means of financing from onerous sources with equivalent or even longer terms. One can consider that this element keeps proportionality to the financial cycle and to the volume of sales. The sample’s result can show great difficulties for insolvent enterprises to finance their operations by means of natural sources linked to their activities. T/LS – X3: This indicator is the essence of a Brazilian solvency model [4] and describes, from a more dynamical point of view than the traditional analysis of balance budget, the financial situation of an enterprise. The treasury balance signalizes the enterprise’s financial policy. If it is positive, this represents availability of resources that warrants the enterprise shortterm liquidity. However, when it is negative, it is important to set up a relationship with the sales level because the index expressivity can show imminent financial difficulties, especially when negative treasury balances are kept in successive periods of time and/or they are crescent. Sample data clearly reflect major difficulties for insolvent enterprises in getting operational financing, this way using erratic sources. Stores/SGC – X4: Industrial sector has been being object of major changes regarding the productive chain, gathering various logistic methods that invariably conduct to the reduction or elimination of stores. It should nowadays be seen as resources application. From high financial costs in Brazil, this rotating coefficient, when high, influences the rentability and manly the liquidity of an enterprise. A question that requires great attention of the analyst is when those enterprises start having very high stores. That can mean either a strategic policy guided by predictive premises regarding the market although with remarkable risk or formation of stores that can allow the enterprise to product for a given amount of time, in case it faces an imminent insolvency. When the market identifies such a situation, the regular credit offer is, normally, affected. From the reasons
364
A. Carlos Pinto Dias Alves
pointed, one can see the coherence of the selection of this index as a discriminator element. TSI/MMS – X5: This coefficient attests a liquidity measure. Brazilian confuse and onerous fiscal policy allied to conjectural economic problems is determining the growing of the tributes in relation to the enterprise’s invoices, more and more biding their operational margins. This situation has been being an element for a nonbook acknowledgment of these obligations by a very expressive number of enterprises. It is also true that the number of fiscal draw ups have been growing, thus resulting in a nonspontaneous acknowledgment of these engagements, driving enterprises to parcel these obligations up and, by consequence, making the accounted values in this title to grow even more. So, this indicator gained importance to differentiate healthy enterprises from those with social insurance and/or tributary problems in Brazil. In [9] each variable were assigned three membership functions namely, insolvent, undetermined, and solvent. Table 3 shows, for each variable, the means that have been chosen for centers of the membership functions (Fig. 1). From a previous analysis regarding the use of some curves as membership functions [8], triangles were chosen for the membership functions because of their facility of use. Also, the partition of unit between neighbor functions was followed. Those mean values came from simulation data of the status of enterprises and follows real data analyzed on a previous work [1]. The model was then tuned by finding the weights ai that gives the desired responses to training data according to (16) and (17). Table 3. Center of the membership functions used for the variables \ variable X1 X2 X3 X4 X5
insolvent 5.2 −3.5 3.2 −2.9 0.4
undetermined 3 −1.35 1.85 −1.45 0.3
solvent 0.8 0.8 0.5 0 0.2
As we have seen in Table 1, there are no theoretical limits regarding the fuzzy external standard B1,…,BM. However, in practical computations a large number of external groups can make a system not to run properly. It is up to the analyst the decision regarding the adequate number of fuzzy external groups.
Fuzzy Models in Credit Risk Analysis
365
Fig. 1. Partition of unit between membership functions
4 Results and Conclusion From a group of 1,500 companies, 500 were randomly chosen to tune the model and the remaining 1,000 were used to test it in a cross-validation style. From these, 150 were undetermined. Once the model had been tuned, it was applied, for each company of the validation group, the values for variables X1–X5 so that crisp classification of companies is found as being the result of (28). Tables 4 and 5 show the results reached. As we can see from Table 4 the results achieved show that model’s efficiency is very high. For comparison, the model that used a Z-Score [1] got a whole precision of 98.45%. Despite the results reached be excellent by themselves, the fuzzy model here developed is much more understandable since it shows how the state of solvency of an enterprise can be seen either from the individual accounting variables or from their combined result. Moreover, linguistic variables show exactly in which group a company belongs to whereas a Z-Score model attributes to variables weights that cannot be easily understood. Also one can see that the variables chosen for the model are very representative of the health of a Brazilian enterprise. None of them, if alone, can be said to be highly representative of the state of solvency of a company but, as we have seen, taken together they can be of great help in analyzing a company’s financial health. We then conclude that the model is very usable for everyone that wants to know if a Brazilian company is facing future difficulties or not.
366
A. Carlos Pinto Dias Alves Table 4. Individual results reached using fuzzy quantification theory
company solv. undet. insolv.
X1
X2
…
solv. undet. insolv. solv. undet. insolv.
1 2 . . . n
X5 solv. undet. insolv.
0,3 0,8
0,7 0,2
0,0 0,0
0,0 0,7
0,4 0,3
0,6 0,0
0,7 0,3 0,5 0,5
0,0 0,0
… …
0,0 0,5 0,9 0,1
0,5 0,0
0,0
0,4
0,6
0,0
0,2
0,8
0,0 0,3
0,7
…
0,6 0,4
0,0
The values found for the coefficients were a1 = 0,2242 ; a2 = 0,1336 ; a3 = 0,3105 ; a4 = 0,1763 ; a5 = 0,1554 Giving the function y(x) = 0,2242 X1 + 0,1336 X2 + 0,3105 X3 + 0,1763 X4 + 0,1554 X5 Just for comparison purposes, the model that we got in [1] brought the following logistic regression function: g(x) = 4,4728 – 1,659 X1 - 1,2182 X2 + 4,1434 X3 + 6,1519 X4 1,885 X5 Sure it is that this function is much less understandable than the one we get applying fuzzy quantification methods. One should have in mind that once g(x) is found with the Logit model it should be applied to equation
Pi =
1 1 + e − Zi
in order to obtain the solvency status of a company. Table 5. Results reached in [9] tuning solvent insolvent total
300 200 500
100% 100% 100%
validation 450 400 850
100% 100% 100%
Fuzzy Models in Credit Risk Analysis
367
References 1. Minussi, J. A., Damacena, C., Ness Jr., W. L.; Um modelo de Previsão de Solvência Utilizando Regressão Logística. Revista de Administração Contemporânea, V6, 3, pp. 109–128, Curitiba, Brasil, 2002 2. Terano, T., Asai, K., Sugeno, M., Aschmann, C.; Fuzzy Systems Theory and its Applications. San Diego, CA, Academic, 1992 3. Watada, J., Tanaka, E., Asai, K.; Analysis of purchasing factors using fuzzy quantification theory type II. Journal of the Japan Industrial Management Association, 32(5), 51–65, 1981 4. Fleuriet, M., Kehdy, R., Blanc, G.; A dinâmica financeira das empresas brasileiras. Belo Horizonte, Brasil, Fundação Dom Cabral, 1980 5. Caouette, J. B., Altman, E. I., Narayanan, P.; Managing Credit Risk. New York, Wiley, 1998 6. Saunders, Anthony. Medindo o Risco de Crédito. Rio de Janeiro, Qualitymark Editora, 2000 7. Zimmermann, H.J.; Fuzzy Sets. Theory and its Applications. Massachussets, Kluwer, 1994 8. Alves, A. C. P. D., Modelos Neuro-Nebulosos para Identificação de Sistemas Aplicados à Operação de Centrais Nucleares. PhD Thesis, COPPE-UFRJ, Rio de Janeiro, Brasil, 2000 9. Alves, A. C. P. D.; Analyzing the Solvency Status of Brazilian Enterprises with Fuzzy Models. Proceedings of International Conference on Fuzzy Sets and Soft Computing in Economics and Finance. V2, 446–456, St. Petersburg, Rússia, 2004 10. Hair, J. F., Anderson, R. E., Tatham, R. L., Black, W. C.; Multivariate Data Analysis. 5th ed., New Jersey, Prentice Hall, 1998 11. Gujarati, Damodar N.; Basic Econometrics. 3rd ed., Singapore, McGraw-Hill, 1995