Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5946
James F. Peters Andrzej Skowron (Eds.)
Transactions on Rough Sets XI
13
Editors-in-Chief James F. Peters University of Manitoba Department of Electrical and Computer Engineering Winnipeg, Manitoba, R3T 5V6, Canada E-mail:
[email protected] Andrzej Skowron Warsaw University Institute of Mathematics Banacha 2, 02-097, Warsaw, Poland E-mail:
[email protected] Library of Congress Control Number: 2009943065 CR Subject Classification (1998): F.4.1, F.1.1, H.2.8, I.5, I.4, I.2 ISSN ISSN ISBN-10 ISBN-13
0302-9743 (Lecture Notes in Computer Science) 1861-2059 (Transaction on Rough Sets) 3-642-11478-4 Springer Berlin Heidelberg New York 978-3-642-11478-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12837151 06/3180 543210
Preface
Volume XI of the Transactions on Rough Sets (TRS) provides evidence of further growth in the rough set landscape, both in terms of its foundations and applications. This volume provides further evidence of the number of research streams that were either directly or indirectly initiated by the seminal work on rough sets by Zdzislaw Pawlak (1926-2006)1. Evidence of the growth of various rough set-based research streams can be found in the rough set database2 . This volume contains articles introducing advances in the foundations and applications of rough sets. These advances include: calculus of attribute-value pairs useful in mining numerical data, definability and coalescence of approximations, variable consistency generalization approach to bagging controlled by measures of consistency, classical and dominance-based rough sets in the search for genes, judgement about satisfiability under incomplete information, irreducible descriptive sets of attributes for information systems useful in the design of concurrent data models, computational theory of perceptions (CTP) and its characteristics and the relation with fuzzy-granulation, methods and algorithms of the NetTRS system, a recursive version of the apriori algorithm designed for parallel processing, and decision table reduction method based on fuzzy rough sets. The editors and authors of this volume extend their gratitude to the reviewers of articles in this volume, Alfred Hofmann, Ursula Barth, Christine Reiss and the LNCS staff at Springer for their support in making this volume of the TRS possible. The editors of this volume were supported by the Mininstry of Science and Higher Education of the Republic of Poland, research grants N N516 368334, N N516 077837, and the Natural Sciences and Engineering Research Council of Canada (NSERC) research grant 185986, Canadian Network of Excellence (NCE), and a Canadian Arthritis Network (CAN) grant SRI-BIO-05. October 2009
1
2
James F. Peters Andrzej Skowron
See, e.g., Peters, J.F., Skowron, A.: Zdzislaw Pawlak: Life and Work, Transactions on Rough Sets V, (2006), 1-24; Pawlak, Z., A Treatise on Rough Sets, Transactions on Rough Sets IV, (2006), 1-17. See, also, Pawlak, Z., Skowron, A.: Rudiments of rough sets, Information Sciences 177 (2007) 3-27; Pawlak, Z., Skowron, A.: Rough sets: Some extensions, Information Sciences 177 (2007) 28-40; Pawlak, Z., Skowron, A.: Rough sets and Boolean reasoning, Information Sciences 177 (2007) 41-73. http://rsds.wsiz.rzeszow.pl/rsds.php
LNCS Transactions on Rough Sets
The Transactions on Rough Sets has as its principal aim the fostering of professional exchanges between scientists and practitioners who are interested in the foundations and applications of rough sets. Topics include foundations and applications of rough sets as well as foundations and applications of hybrid methods combining rough sets with other approaches important for the development of intelligent systems. The journal includes high-quality research articles accepted for publication on the basis of thorough peer reviews. Dissertations and monographs up to 250 pages that include new research results can also be considered as regular papers. Extended and revised versions of selected papers from conferences can also be included in regular or special issues of the journal. Editors-in-Chief: Managing Editor: Technical Editor:
James F. Peters, Andrzej Skowron Sheela Ramanna Marcin Szczuka
Editorial Board Mohua Banerjee Jan Bazan Gianpiero Cattaneo Mihir K. Chakraborty Davide Ciucci Chris Cornelis Ivo D¨ untsch Anna Gomoli´ nska Salvatore Greco Jerzy W. Grzymala-Busse Masahiro Inuiguchi Jouni J¨ arvinen Richard Jensen Bo˙zena Kostek Churn-Jung Liau Pawan Lingras Victor Marek Mikhail Moshkov Hung Son Nguyen
Ewa Orlowska Sankar K. Pal Lech Polkowski Henri Prade Sheela Ramanna Roman Slowi´ nski Jerzy Stefanowski Jaroslaw Stepaniuk Zbigniew Suraj Marcin Szczuka ´ ezak Dominik Sl¸ ´ Roman Swiniarski Shusaku Tsumoto Guoyin Wang Marcin Wolski Wei-Zhi Wu Yiyu Yao Ning Zhong Wojciech Ziarko
Table of Contents
Mining Numerical Data – A Rough Set Approach . . . . . . . . . . . . . . . . . . . . Jerzy W. Grzymala-Busse Definability and Other Properties of Approximations for Generalized Indiscernibility Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy W. Grzymala-Busse and Wojciech Rz¸asa Variable Consistency Bagging Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy Blaszczy´ nski, Roman Slowi´ nski, and Jerzy Stefanowski Classical and Dominance-Based Rough Sets in the Search for Genes under Balancing Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof A. Cyran
1
14 40
53
Satisfiability Judgement under Incomplete Information . . . . . . . . . . . . . . . Anna Gomoli´ nska
66
Irreducible Descriptive Sets of Attributes for Information Systems . . . . . . Mikhail Moshkov, Andrzej Skowron, and Zbigniew Suraj
92
Computational Theory Perception (CTP), Rough-Fuzzy Uncertainty Analysis and Mining in Bioinformatics and Web Intelligence: A Unified Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sankar K. Pal Decision Rule-Based Data Models Using TRS and NetTRS – Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marek Sikora
106
130
A Distributed Decision Rules Calculation Using Apriori Algorithm . . . . . Tomasz Str¸akowski and Henryk Rybi´ nski
161
Decision Table Reduction in KDD: Fuzzy Rough Based Approach . . . . . . Eric Tsang and Suyun Zhao
177
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
189
Mining Numerical Data – A Rough Set Approach Jerzy W. Grzymala-Busse Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA and Institute of Computer Science Polish Academy of Sciences, 01-237 Warsaw, Poland
[email protected], http://lightning.eecs.ku.edu/index.html Abstract. We present an approach to mining numerical data based on rough set theory using calculus of attribute-value blocks. An algorithm implementing these ideas, called MLEM2, induces high quality rules in terms of both simplicity (number of rules and total number of conditions) and accuracy. MLEM2 induces rules not only from complete data sets but also from data with missing attribute values, with or without numerical attributes. Additionally, we present experimental results on a comparison of three commonly used discretization techniques: equal interval width, equal interval frequency and minimal class entropy (all three methods were combined with the LEM2 rule induction algorithm) with MLEM2. Our conclusion is that even though MLEM2 was most frequently a winner, the differences between all four data mining methods are statistically insignificant.
1
Introduction
For knowledge acquisition (or data mining) from data with numerical attributes special techniques are applied [13]. Most frequently, an additional step, taken before the main step of rule induction or decision tree generation and called discretization is used. In this preliminary step numerical data are converted into symbolic or, more precisely, a domain of the numerical attribute is partitioned into intervals. Many discretization techniques, using principles such as equal interval width, equal interval frequency, minimal class entropy, minimum description length, clustering, etc., were explored, e.g., in [1,2,3,5,6,8,9,10,20,23,24,25,26], and [29]. Discretization algorithms which operate on the set of all attributes and which do not use information about decision (concept membership) are called unsupervised, as opposed to supervised, where the decision is taken into account [9]. Methods processing the entire attribute set are called global, while methods working on one attribute at a time are called local [8]. In all of these methods discretization is a preprocessing step and is undertaken before the main process of knowledge acquisition. Another possibility is to discretize numerical attributes during the process of knowledge acquisition. Examples of such methods are MLEM2 [14] and MODLEM [21,31,32] for rule induction and C4.5 [30] and CART [4] for decision tree J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 1–13, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
J.W. Grzymala-Busse
generation. These algorithms deal with original, numerical data and the process of knowledge acquisition and discretization are conducted at the same time. The MLEM2 algorithm produces better rule sets, in terms of both simplicity and accuracy, than clustering methods [15]. However, discretization is an art rather than a science, and for a specific data set it is advantageous to use as many discretization algorithms as possible and then select the best approach. In this paper we present the MLEM2 algorithm, one of the most successful approaches to mining numerical data. This algorithm uses rough set theory and calculus of attribute-value pair blocks. A similar approach is represented by MODLEM. Both MLEM2 and MODLEM algorithms are outgrowths of the LEM2 algorithm. However, in MODLEM the most essential part of selecting the best attribute-value pair is conducted using entropy or Laplacian conditions, while in MLEM2 this selection uses the most relevance condition, just like in the original LEM2. Additionally, we present experimental results on a comparison of three commonly used discretization techniques: equal interval width, equal interval frequency and minimal class entropy (all three methods combined with the LEM2 rule induction algorithm) with MLEM2. Our conclusion is that even though MLEM2 was most frequently a winner, the differences between all four data mining methods are statistically insignificant. A preliminary version of this paper was presented at the International Conference of Rough Sets and Emerging Intelligent Systems Paradigms, Warsaw, Poland, June 28–30, 2007 [19].
2
Discretization Methods
For a numerical attribute a with an interval [a, b] as a range, a partition of the range into n intervals {[a0 , a1 ), [a1 , a2 ), ..., [an−2 , an−1 ), [an−1 , an ]}, where a0 = a, an = b, and ai < ai+1 for i = 0, 1, ..., n − 1, defines discretization of a. The numbers a1 , a2 ,..., an−1 are called cut-points. The simplest and commonly used discretization methods are local methods called Equal Interval Width and Equal Frequency per Interval [8,13]. Another local discretization method [10] is called a Minimal Class Entropy. The conditional entropy, defined by a cut-point q that splits the set U of all cases into two sets, S1 and S2 is defined as follows E(q, U ) =
|S1 | |S2 | E(S1 ) + E(S2 ), |U | |U |
where E(S) is the entropy of S and |X| denotes the cardinality of the set X. The cut-point q for which the conditional entropy E(q, U ) has the smallest value is selected as the best cut-point. If k intervals are required, the procedure is applied recursively k − 1 times. Let q1 and q2 be the best cut-points for sets S1 and S2 ,
Mining Numerical Data – A Rough Set Approach
3
respectively. If E(q1 , S1 ) > E(q2 , S2 ) we select q1 as the next cut-point, if not, we select q2 . In our experiments three local methods of discretization: Equal Interval Width, Equal Interval Frequency and Minimal Class Entropy were converted to global methods using an approach of globalization presented in [8]. First, we discretize all attributes, one at a time, selecting the best cut-point for all attributes. If the level of consistency is sufficient, the process is completed. If not, we further discretize, selecting an attribute a for which the following expression has the largest value |B| B∈{a}∗ |U| E(B) Ma = . |{a}∗ | In all six discretization methods discussed in this paper, the stopping condition was the level of consistency [8], based on rough set theory introduced by Z. Pawlak in [27]. Let U denote the set of all cases of the data set. Let P denote a nonempty subset of the set of all variables, i.e., attributes and a decision. Obviously, set P defines an equivalence relation ℘ on U , where two cases x and y from U belong to the same equivalence class of ℘ if and only if both x and y are characterized by the same values of each variable from P . The set of all equivalence classes of ℘, i.e., a partition on U , will be denoted by P ∗ . Equivalence classes of ℘ are called elementary sets of P . Any finite union of elementary sets of P is called a definable set in P . Let X be any subset of U . In general, X is not a definable set in P . However, set X may be approximated by two definable sets in P , the first one is called a lower approximation of X in P , denoted by P X and defined as follows {Y ∈ P ∗ |Y ⊆ X}. The second set is called an upper approximation of X in P , denoted by P X and defined as follows {Y ∈ P ∗ |Y ∩ X = ∅}. The lower approximation of X in P is the greatest definable set in P , contained in X. The upper approximation of X in P is the least definable set in P containing X. A rough set of X is the family of all subsets of U having the same lower and the same upper approximations of X. A level of consistency [8], denoted Lc , is defined as follows X∈{d}∗ |AX| Lc = . |U | Practically, the requested level of consistency for discretization is 100%, i.e., we want the discretized data set to be consistent.
3
MLEM2
The MLEM2 algorithm is a part of the LERS (Learning from Examples based on Rough Sets) data mining system. Rough set theory was initiated by Z. Pawlak
4
J.W. Grzymala-Busse
[27,28]. LERS uses two different approaches to rule induction: one is used in machine learning, the other in knowledge acquisition. In machine learning, or more specifically, in learning from examples (cases), the usual task is to learn the smallest set of minimal rules, describing the concept. To accomplish this goal, LERS uses two algorithms: LEM1 and LEM2 (LEM1 and LEM2 stand for Learning from Examples Module, version 1 and 2, respectively) [7,11,12]. The LEM2 algorithm is based on an idea of an attribute-value pair block. For an attribute-value pair (a, v) = t, a block of t, denoted by [t], is a set of all cases from U such that for attribute a have value v. For a set T of attributevalue pairs, the intersection of blocks for all t from T will be denoted by [T ]. Let B be a nonempty lower or upper approximation of a concept represented by a decision-value pair (d, w). Set B depends on a set T of attribute-value pairs t = (a, v) if and only if ∅ = [T ] = [t] ⊆ B. t∈T
Set T is a minimal complex of B if and only if B depends on T and no proper subset T of T exists such that B depends on T . Let T be a nonempty collection of nonempty sets of attribute-value pairs. Then T is a local covering of B if and only if the following conditions are satisfied: – each member T of T is a minimal complex of B, – t∈T [T ] = B, and – T is minimal, i.e., T has the smallest possible number of members. The user may select an option of LEM2 with or without taking into account attribute priorities. The procedure LEM2 with attribute priorities is presented below. The option without taking into account priorities differs from the one presented below in the selection of a pair t ∈ T (G) in the inner loop WHILE. When LEM2 is not to take attribute priorities into account, the first criterion is ignored. In our experiments all attribute priorities were equal to each other. Procedure LEM2 (input: a set B, output: a single local covering T of set B); begin G := B; T := ∅; while G =∅ begin T := ∅; T (G) := {t|[t] ∩ G = ∅} ; while T = ∅ or [T ] ⊆B begin select a pair t ∈ T (G) with the highest attribute priority, if a tie occurs, select a pair t ∈ T (G) such that |[t] ∩ G| is maximum;
Mining Numerical Data – A Rough Set Approach
5
if another tie occurs, select a pair t ∈ T (G) with the smallest cardinality of [t]; if a further tie occurs, select first pair; T := T ∪ {t} ; G := [t] ∩ G ; T (G) := {t|[t] ∩ G = ∅}; T (G) := T (G) − T ; end {while} for each t ∈ T do if [T − {t}] ⊆ B then T := T − {t}; T := T ∪ {T }; G := B − T ∈T [T ];
end {while}; for each T ∈ T do if S∈T −{T } [S] = B then T := T − {T }; end {procedure}.
For a set X, |X| denotes the cardinality of X. Rules induced from raw, training data are used for classification of unseen, testing data. The classification system of LERS is a modification of the bucket brigade algorithm. The decision to which concept a case belongs is made on the basis of three factors: strength, specificity, and support. They are defined as follows: Strength is the total number of cases correctly classified by the rule during training. Specificity is the total number of attribute-value pairs on the left-hand side of the rule. The matching rules with a larger number of attributevalue pairs are considered more specific. The third factor, support, is defined as the sum of scores of all matching rules from the concept. The concept C for which the support (i.e., the sum of all products of strength and specificity, for all rules matching the case, is the largest is a winner and the case is classified as being a member of C). MLEM2, a modified version of LEM2, categorizes all attributes into two categories: numerical attributes and symbolic attributes. For numerical attributes MLEM2 computes blocks in a different way than for symbolic attributes. First, it sorts all values of a numerical attribute. Then it computes cutpoints as averages for any two consecutive values of the sorted list. For each cutpoint x MLEM2 creates two blocks, the first block contains all cases for which values of the numerical attribute are smaller than x, the second block contains remaining cases, i.e., all cases for which values of the numerical attribute are larger than x. The search space of MLEM2 is the set of all blocks computed this way, together with blocks defined by symbolic attributes. Starting from that point, rule induction in MLEM2 is conducted the same way as in LEM2. Let us illustrate the MLEM2 algorithm using the following example from Table 1. Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U . In Table 1, U = {1, 2, ..., 6}. Independent variables are called attributes and a dependent variable is called a
6
J.W. Grzymala-Busse Table 1. An example of the decision table Attributes
Decision
Case
Gender
Cholesterol
Stroke
1 2 3 4 5 6
man man man woman woman woman
180 240 280 240 280 320
no yes yes no no yes
decision and is denoted by d. The set of all attributes will be denoted by A. In Table 1, A = {Gender, Cholesterol }. Any decision table defines a function ρ that maps the direct product of U and A into the set of all values. For example, in Table 1, ρ(1, Gender ) = man. The decision table from Table 1 is consistent, i.e., there are no conflicting cases in which all attribute values are identical yet the decision values are different. Subsets of U with the same decision value are called concepts. In Table 1 there are two concepts: {1, 4, 5} and {2, 3, 6}. Table 1 contains one numerical attribute (Cholesterol). The sorted list of values of Cholesterol is 180, 240, 280, 320. The corresponding cutpoints are: 210, 260, 300. Since our decision table is consistent, input sets to be applied to MLEM2 are concepts. The search space for MLEM2 is the set of all blocks for all possible attribute-value pairs (a, v) = t. For Table 1, the set of all attribute-value pair blocks are [(Gender, man)] = {1, 2, 3}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180..210)] = {1}, [(Cholesterol, 210..320)] = {2, 3, [(Cholesterol, 180..260)] = {1, 2, [(Cholesterol, 260..320)] = {3, 5, [(Cholesterol, 180..300)] = {1, 2, [(Cholesterol, 300..320)] = {6}.
4, 5, 6}, 4}, 6}, 3, 4, 5},
Let us start running MLEM2 for the concept {1, 4, 5}. Thus, initially this concept is equal to B (and to G). The set T (G) is equal to {(Gender, man), (Gender, woman), (Cholesterol, 180..210), (Cholesterol, 210..320), (Cholesterol, 180..260), (Cholesterol, 260..320), (Cholesterol, 180..300)}. For the attribute-value pair (Cholesterol, 180..300) from T (G) the following value |[(attribute, value)] ∩ G| is maximum. Thus we select our first attributevalue pair t = (Cholesterol, 180..300). Since [(Cholesterol, 180..300)] ⊆ B, we have to perform the next iteration of the inner WHILE loop. This time T (G) =
Mining Numerical Data – A Rough Set Approach
7
{(Gender, man), (Gender, woman), (Cholesterol, 180..210), (Cholesterol, 210.. 320), (Cholesterol, 180..260), (Cholesterol, 260..320)}. For three attribute-value pairs from T (G): (Gender, woman), (Cholesterol, 210..320) and (Cholesterol, 180..260) the value of |[(attribute, value)] ∩ G| is maximum (and equal to two). The second criterion, the smallest cardinality of [(attribute, value)], indicates (Gender, woman) and (Cholesterol, 180..260) (in both cases that cardinality is equal to three). The last criterion, ”first pair”, selects (Gender, woman). Moreover, the new T = {(Cholesterol, 180..300), (Gender, woman)} and new G is equal to {4, 5}. Since [T ] = [(Cholesterol, 180..260] ∩ [(Gender, woman)] = {4, 5} ⊆ B, the first minimal complex is computed. Furthermore, we cannot drop any of these two attribute-value pairs, so T = {T }, and the new G is equal to B − {4, 5} = {1}. During the second iteration of the outer WHILE loop, the next minimal complex T is identified as {(Cholesterol, 180..210)}, so T = {{(Cholesterol, 180..300), (Gender, woman)}, {(Cholesterol, 180..210)}} and G = ∅. The remaining rule set, for the concept {2, 3, 6} is induced in a similar manner. Eventually, rules in the LERS format (every rule is equipped with three numbers, the total number of attribute-value pairs on the left-hand side of the rule, the total number of examples correctly classified by the rule during training, and the total number of training cases matching the left-hand side of the rule) are: 2, 2, 2 (Gender, woman) & (Cholesterol, 180..300) -> (Stroke, no) 1, 1, 1 (Cholesterol, 180..210) -> (Stroke, no) 2, 2, 2 (Gender, man) & (Cholesterol, 210..320) -> (Stroke, yes) 1, 1, 1 (Cholesterol, 300..320) -> (Stroke, yes)
4
Numerical and Incomplete Data
Input data for data mining are frequently affected by missing attribute values. In other words, the corresponding function ρ is incompletely specified (partial). A decision table with an incompletely specified function ρ will be called incompletely specified, or incomplete. Though four different interpretations of missing attribute values were studied [18]; in this paper, for simplicity, we will consider only two: lost values (the values that were recorded but currently are unavailable) and ”do not care” conditions (the original values were irrelevant). For the rest of the paper we will assume that all decision values are specified, i.e., they are not missing. Also, we will assume that all missing attribute values are denoted either by ”?” or by ”∗”, lost values will be denoted by ”?”, ”do not care” conditions will be denoted by ”∗”. Additionally, we will assume that for each case at least one attribute value is specified.
8
J.W. Grzymala-Busse
Incomplete decision tables are described by characteristic relations instead of indiscernibility relations. Also, elementary blocks are replaced by characteristic sets, see, e.g., [16,17,18]. An example of an incomplete table is presented in Table 2. Table 2. An example of the incomplete decision table Attributes
Decision
Case
Gender
Cholesterol
Stroke
1 2 3 4 5 6
? man man woman woman woman
180 * 280 240 ? 320
no yes yes no no yes
For incomplete decision tables the definition of a block of an attribute-value pair must be modified. If for an attribute a there exists a case x such that ρ(x, a) =?, i.e., the corresponding value is lost, then the case x is not included in the block [(a, v)] for any value v of attribute a. If for an attribute a there exists a case x such that the corresponding value is a ”do not care” condition, i.e., ρ(x, a) = ∗, then the corresponding case x should be included in blocks [(a, v)] for all values v of attribute a. This modification of the definition of the block of attribute-value pair is consistent with the interpretation of missing attribute values, lost and ”do not care” condition. Numerical attributes should be treated in a little bit different way as symbolic attributes. First, for computing characteristic sets, numerical attributes should be considered as symbolic. For example, for Table 2 the blocks of attribute-value pairs are: [(Gender, man)] = {2, 3}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180)] = {1, 2}, [(Cholesterol, 240)] = {2, 4}, [(Cholesterol, 280)] = {2, 3}, [(Cholesterol, 320)] = {2, 6}. The characteristic set KB (x) is the intersection of blocks of attribute-value pairs (a, v) for all attributes a from B for which ρ(x, a) is specified and ρ(x, a) = v. The characteristic sets KB (x) for Table 2 and B = A are: KA (1) = U ∩ {1, 2} = {1, 2}, KA (2) = {2, 3} ∩ U = {2, 3}, KA (3) = {2, 3} ∩ {2, 3} = {2, 3}, KA (4) = {4, 5, 6} ∩ {2, 4} = {4},
Mining Numerical Data – A Rough Set Approach
9
KA (5) = {4, 5, 6} ∩ U = {4, 5, 6}, KA (6) = {4, 5, 6} ∩ {2, 6} = {6}. For incompletely specified decision tables lower and upper approximations may be defined in a few different ways [16,17,18]. We will quote only one type of approximations for incomplete decision tables, called concept approximations. A concept B-lower approximation of the concept X is defined as follows: BX = ∪{KB (x)|x ∈ X, KB (x) ⊆ X}. A concept B-upper approximation of the concept X is defined as follows: BX = ∪{KB (x)|x ∈ X, KB (x) ∩ X = ∅} = ∪{KB (x)|x ∈ X}. For Table 2, concept lower and upper approximations are: A{1, 4, 5} = {4}, A{2, 3, 6} = {2, 3, 6}, A{1, 4, 5} = {1, 2, 4, 5, 6}, A{2, 3, 6} = {2, 3, 6}. For inducing rules from data with numerical attributes, blocks of attribute-value pairs are defined differently than in computing characteristic sets. Blocks of attribute-value pairs for numerical attributes are computed in a similar way as for complete data, but for every cutpoint the corresponding blocks are computed taking into account interpretation of missing attribute values. Thus, [(Gender, man)] = {1, 2}, [(Gender, woman)] = {4, 5, 6}, [(Cholesterol, 180..210)] = {1, 2}, [(Cholesterol, 210..320)] = {2, 3, 4, 6}, [(Cholesterol, 180..260)] = {1, 2, 4}, [(Cholesterol, 260..320)] = {2, 3, 6}, [(Cholesterol, 180..300)] = {1, 2, 3, 4}, [(Cholesterol, 300..320)] = {2, 6}. Using the MLEM2 algorithm, the following rules are induced: certain rule set (induced from the concept lower approximations): 2, 1, 1 (Gender, woman) & (Cholesterol, 180..260) -> (Stroke, no) 1, 3, 3 (Cholesterol, 260..320) -> (Stroke, yes)
10
J.W. Grzymala-Busse Table 3. Data sets Data set
Bank
Number of cases
attributes
concepts
66
5
2
Bupa
345
6
2
Glass
214
9
6
Globe
150
4
3
Image
210
19
7
Iris
150
4
3
Wine
178
13
3
possible rule set (induced from the concept upper approximations): 1, 2, 3 (Gender, woman) -> (Stroke, no) 1, 1, 3 (Cholesterol, 180..260) -> (Stroke, no) 1, 3, 3 (Cholesterol, 260..320) -> (Stroke, yes)
5
Experiments
Our experiments, aimed at a comparison of three commonly used discretization techniques with MLEM2, were conducted on seven data sets, summarized in Table 3. All of these data sets, with the exception of bank and globe, are available at the University of California at Irvine Machine Learning Repository. The bank data set is a well-known data set used by E. Altman to predict a bankruptcy of companies. The globe data set describes global warming and was presented in [22]. The following three discretization methods were used in our experiments: – Equal Interval Width method, combined with the LEM2 rule induction algorithm, coded as 11, – Equal Frequency per Interval method, combined with the LEM2 rule induction algorithm, coded as 12, – Minimal Class Entropy method, combined with the LEM2 rule induction algorithm, coded as 13. All discretization methods were applied with the level of consistency equal to 100%. For any discretized data set, except bank and globe, the ten-fold cross validation experiments for determining an error rate were used, where rule sets were induced using the LEM2 algorithm [7,12]. The remaining two data sets, bank and globe were subjected to leave-one-out validation method because of their small size. Results of experiments are presented in Table 4.
Mining Numerical Data – A Rough Set Approach
11
Table 4. Results of validation Data set
Bank
6
Error rate 11
12
13
MLEM2
9.09%
3.03%
4.55%
4.55%
Bupa
33.33%
39.71%
44.06%
34.49%
Glass
32.71%
35.05%
41.59%
29.44%
Globe
69.70%
54.05%
72.73%
63.64%
Image
20.48%
20.48%
52.86%
17.14%
Iris
5.33%
10.67%
9.33%
4.67%
Wine
11.24%
6.18%
2.81%
11.24%
Conclusions
We demonstrated that both rough set theory and calculus of attribute-value pair blocks are useful tools for data mining from numerical data. The same idea of an attribute-value pair block may be used in the process of data mining not only for computing elementary sets (for complete data sets) but also for rule induction. The MLEM2 algorithm induces rules from raw data with numerical attributes, without any prior discretization, and MLEM2 provides the same results as LEM2 for data with all symbolic attributes. As follows from Table 4, even though MLEM2 was most frequently a winner, using the Wilcoxon matched-pairs signed rank test we may conclude that the differences between all four data mining methods are statistically insignificant. Thus, for a specific data set with numerical attributes the best approach to discretization should be selected on a case by case basis.
References 1. Bajcar, S., Grzymala-Busse, J.W., Hippe, Z.S.: A comparison of six discretization algorithms used for prediction of melanoma. In: Proc. of the Eleventh International Symposium on Intelligent Information Systems, IIS 2002, Sopot, Poland, pp. 3–12. Physica-Verlag (2002) 2. Bay, S.D.: Multivariate discretization of continuous variables for set mining. In: Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, MA, pp. 315–319 (2000) 3. Biba, M., Esposito, F., Ferilli, S., Mauro, N.D., Basile, T.M.A.: Unsupervised discretization using kernel density estimation. In: Proc. of the 20th Int. Conf. on AI, Hyderabad, India, pp. 696–701 (2007) 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth & Brooks, Monterey (1984) 5. Catlett, J.: On changing continuous attributes into ordered discrete attributes. In: Kodratoff, Y. (ed.) EWSL 1991. LNCS (LNAI), vol. 482, pp. 164–178. Springer, Heidelberg (1991)
12
J.W. Grzymala-Busse
6. Chan, C.C., Batur, C., Srinivasan, A.: Determination of quantization intervals in rule based model for dynamic systems. In: Proc. of the IEEE Conference on Systems, Man, and Cybernetics, Charlottesville, VA, pp. 1719–1723 (1991) 7. Chan, C.C., Grzymala-Busse, J.W.: On the attribute redundancy and the learning programs ID3, PRISM, and LEM2. Department of Computer Science, University of Kansas, TR-91-14, December 1991, 20 p. (1991) 8. Chmielewski, M.R., Grzymala-Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. Int. Journal of Approximate Reasoning 15, 319–331 (1996) 9. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc of the 12th Int. Conf. on Machine Learning, Tahoe City, CA, July 9–12, pp. 194–202 (1995) 10. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. of the 13th Int. Joint Conference on AI, Chambery, France, pp. 1022–1027 (1993) 11. Grzymala-Busse, J.W.: LERS—A system for learning from examples based on rough sets. In: Slowinski, R. (ed.) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory, pp. 3–18. Kluwer Academic Publishers, Dordrecht (1992) 12. Grzymala-Busse, J.W.: A new version of the rule induction system LERS. Fundamenta Informaticae 31, 27–39 (1997) 13. Grzymala-Busse, J.W.: Discretization of numerical attributes. In: Kl¨ osgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 218– 225. Oxford University Press, New York (2002) 14. Grzymala-Busse, J.W.: MLEM2: A new algorithm for rule induction from imperfect data. In: Proc. of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2002, Annecy, France, pp. 243–250 (2002) 15. Grzymala-Busse, J.W.: A comparison of three strategies to rule induction from data with numerical attributes. In: Proc. of the Int. Workshop on Rough Sets in Knowledge Discovery (RSKD 2003), in conjunction with the European Joint Conferences on Theory and Practice of Software, Warsaw, pp. 132–140 (2003) 16. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Workshop Notes, Foundations and New Directions of Data Mining, in conjunction with the 3rd International Conference on Data Mining, Melbourne, FL, pp. 56–63 (2003) 17. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004) 18. Grzymala-Busse, J.W.: Incomplete data and generalization of indiscernibility re´ ezak, D., Wang, G., Szczuka, M.S., lation, definability, and approximations. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 19. Grzymala-Busse, J.W.: Mining numerical data—A rough set approach. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 12–21. Springer, Heidelberg (2007) 20. Grzymala-Busse, J.W., Stefanowski, J.: Discretization of numerical attributes by direct use of the rule induction algorithm LEM2 with interval extension. In: Proc. of the Sixth Symposium on Intelligent Information Systems (IIS 1997), Zakopane, Poland, pp. 149–158 (1997)
Mining Numerical Data – A Rough Set Approach
13
21. Grzymala-Busse, J.W., Stefanowski, J.: Three discretization methods for rule induction. Int. Journal of Intelligent Systems 16, 29–38 (2001) 22. Gunn, J.D., Grzymala-Busse, J.W.: Global temperature stability by rule induction: An interdisciplinary bridge. Human Ecology 22, 59–81 (1994) 23. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. of the 10th National Conf. on AI, San Jose, CA, pp. 123–128 (1992) 24. Kohavi, R., Sahami, M.: Error-based and entropy-based discretization of continuous features. In: Proc of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, pp. 114–119 (1996) 25. Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6, 393–423 (2002) 26. Nguyen, H.S., Nguyen, S.H.: Discretization methods for data mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery, pp. 451–482. Physica, Heidelberg (1998) 27. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 28. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 29. Pensa, R.G., Leschi, C., Besson, J., Boulicaut, J.F.: Assessment of discretization techniques for relevant pattern discovery from gene expression data. In: Proc. of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, pp. 24–30 (2004) 30. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) 31. Stefanowski, J.: Handling continuous attributes in discovery of strong decision rules. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 394–401. Springer, Heidelberg (1998) 32. Stefanowski, J.: Algorithms of Decision Rule Induction in Data Mining. Poznan University of Technology Press, Poznan (2001)
Definability and Other Properties of Approximations for Generalized Indiscernibility Relations Jerzy W. Grzymala-Busse1,2 and Wojciech Rz¸asa3 1
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA 2 Institute of Computer Science, Polish Academy of Sciences, 01–237 Warsaw, Poland
[email protected] 3 Department of Computer Science, University of Rzeszow, 35–310 Rzeszow, Poland
[email protected] Abstract. In this paper we consider a generalization of the indiscernibility relation, i.e., a relation R that is not necessarily reflexive, symmetric, or transitive. On the basis of granules, defined by R, we introduce the idea of definability. We study 28 basic definitions of approximations, two approximations are introduced for the first time. Furthermore, we introduce additional 8 new approximations. Our main objective is to study definability and coalescence of approximations. We study definability of all 28 basic approximations for reflexive, symmetric, and transitive relations. In particular, for reflexive relations, the set of 28 approximations is reduced, in general, to the set of 16 approximations.
1
Introduction
One of the basic ideas of rough set theory, introduced by Z. Pawlak in 1982, is the indiscernibility relation [17, 18] defined on the finite and nonempty set U called the universe. An ordered pair (U, R), where R denotes a binary relation is called an approximation space. This idea is essential for rough sets. The set U represents cases that are characterized by the relation R. If R is an equivalence relation then it is called an indiscrenibility relation. Two cases being in the relation R are indiscernible or indistinguishable. The original idea of the approximation space was soon generalized. In 1983 W. Zakowski redefined the approximation space as a pair (U, Π), where Π was a covering of the universe U [33]. Another example of the generalization of the original idea of approximation space was a tolerance approximation space, presented by A. Skowron and J. Stepaniuk in 1996 [21]. The tolerance approximation space was introduced as a four-tuple (U, I, ν, P ), where U denotes the universe, I : U → 2U represents uncertainty, ν : 2U × 2U → [0, 1] is a vague inclusion, and P : I(U ) → {0, 1} is yet another function. In 1989 T.Y. Lin introduced a J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 14–39, 2010. c Springer-Verlag Berlin Heidelberg 2010
Definability and Other Properties of Approximations
15
neighborhood system, yet another generalization of the approximation space [13]. A few other authors [19, 22, 27–29] considered some extensions of the approximation space. Some extensions of the approximation space were considered in papers dealing with data sets with missing attribute values [1–5, 10, 11, 24, 25]. In this paper we will discuss a generalization of the indiscernibility relation, an arbitrary binary relation R. Such relation R does not need to be reflexive, symmetric, or transitive. Our main objective was to study the definability and coalescence of approximations of any subset X of the universe U . A definable set is a union of granules, defined by R, that are also known as R-successor or R-predecessor sets or as neighborhoods. In this paper 28 definitions of approximations are discussed. These definitions were motivated by a variety of reasons. Some of the original approximations do not satisfy, in general, the inclusion property (Section 3, Properties 1a, 1b), hence modified approximations were introduced. In [5] two most accurate approximations were defined. In this paper, using duality (Section 3, Properties 8a, 8b), we define two extra approximations for the first time. Additionally, we show that it is possible to define additional 8 approximations, thus reaching 36 approximations altogether. Such generalizations of the indiscernibility relation have immediate application to data mining (machine learning) from incomplete data sets. In these applications the binary relation R, called the characteristic relation and describing such data, is reflexive. For reflexive relations the system of 28 approximations is reduced to 16 approximations. Note that some of these 16 approximations are not useful for data mining from incomplete data [2–5, 8, 9]). A preliminary version of this paper was presented at the IEEE Symposium on Foundations of Computational Intelligence (FOCI’2007), Honolulu, Hawaii, April 1–5, 2007, [6].
2
Basic Definitions
First we will introduce the basic granules (or neighborhoods), defined by a relation R. Such granules are called here R-successor and R-predecessor sets. In this paper R is a generalization of the indiscernibility relation. The relation R, in general, does not need to be reflexive, symmetric, or transitive, while the indiscernibility relation is an equivalence relation. Let U be a finite nonempty set, called the universe, let R be a binary relation on U , and let x be a member of U . The R-successor set of x, denoted by Rs (x), is defined as follows Rs (x) = {y | xRy}. The R-predecessor set of x, denoted by Rp (x), is defined as follows Rp (x) = {y | yRx}. R-successor and R-predecessor sets are used to form larger sets that are called R-successor and R-predecessor definable.
16
J.W. Grzymala-Busse and W. Rz¸asa
Let X be a subset of U . A set X is R-successor definable if and only if X = ∅ or there exists a subset Y of U such that X = ∪ {Rs (y) | y ∈ Y }. A set X is R-predecessor definable if and only if X = ∅ or there exists a subset Y of U such that X = ∪ {Rp (y) | y ∈ Y }. Note that definability is described differently in [19, 22]. It is not difficult to recognize that a set X is R-successor definable if and only if X = ∪{Rs (x) | Rs (x) ⊆ X} while a set X jest R-predecessor definable if and only if X = ∪{Rp (x) | Rp (x) ⊆ X}. It will be convenient to define a few useful maps with some interesting properties. Let U be a finite nonempty set and let f : 2U → 2U be a map. A map f is called an increasing if and only if for any subsets X and Y of U X ⊆ Y ⇒ f (X) ⊆ f (Y ). Theorem 1. Let X and Y be subsets of U and let f : 2U → 2U be increasing. Then f (X ∪ Y ) ⊇ f (X) ∪ f (Y ) and f (X ∩ Y ) ⊆ f (X) ∩ f (Y ). Proof. An easy proof is based on an observation that if both sets f (X) and f (Y ) are subsets of f (X ∪ Y ) (since the map f is increasing), then the union of f (X) and f (Y ) is also a subset of f (X ∪ Y ). By analogy, since f (X ∩ Y ) is a subset of both f (X) and f (Y ), then f (X ∩ Y ) is also a subset of f (X) ∩ f (Y ). Again, let U be a finite and nonempty set and let f : 2U → 2U be a map. A map f is non-decreasing if and only if there do not exist subsets X and Y of U such that X ⊂ Y and f (X) ⊃ f (Y ). Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U . Maps f and g will be called dual if for any subset X of U sets f (X) and g(¬X) are complementary. The symbol ¬X means the complement of the set X. Theorem 2. For a finite and nonempty set U and subsets X, Y and Z of U if sets X and Y are complementary then sets X ∪Z and Y − Z are complementary.
Definability and Other Properties of Approximations
17
Proof is based on de Morgan laws. Let U be a finite nonempty set and let f : 2U → 2U . The map f is called idempotent if and only if f (f (X)) = f (X) for any subset X of U . A mixed idempotency property is defined in the following way. Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U . A pair (f, g) has Mixed Idempotency Property if and only if f (g(X)) = g(X) for any subset X of U . Theorem 3. Let U be a finite nonempty set and let f : 2U → 2U and g : 2U → 2U be dual maps. (a) for any X ⊆ U
f (X) ⊆ X
if and only if for any X ⊆ U X ⊆ g(X), (b) for any X ⊆ U f (X) ⊆ X if and only if ¬X ⊆ g(¬X), (c) (d) (e) (f )
f (∅) = ∅ if and only if g(U ) = U , the map f is increasing if and only if the map g is increasing, the map f is non-decreasing if and only if the map g is non-decreasing. for any X, Y ⊆ U f (X ∪ Y ) = f (X) ∪ f (Y ) if and only if g(X ∩ Y ) = g(X) ∩ g(Y ),
(g) for any X, Y ⊆ U f (X ∪ Y ) ⊇ f (X) ∪ f (Y ) if and only if g(X ∩ Y ) ⊆ g(X) ∩ g(Y ), (h) f is idempotent if and only if g is idempotent, (i) the pair (f, g) has Mixed Idempotency Property if and only if the pair (g, f ) has the same property. Proof. For properties (a) – (i) only sufficient conditions will be proved. Proofs for the necessary conditions may be achieved by replacing all symbols f with g, ∪ with ∩, ⊆ with ⊇ and ⊂ with ⊃, respectively, and by replacing symbols ¬X with X in the proof of property (b) and symbols ∅ with U in the proof of property (c). For (a) let us observe that f (X) ⊆ X for any X subset of U if and only if ¬X ⊆ ¬f (X), so ¬X ⊆ g(¬X) for any ¬X subset of U . For the proof of (b), if f (X) ⊆ X then ¬X ⊆ g(¬X), see the proof for (a). A brief proof for (c) is based on de Morgan laws f (∅) = ¬g(¬∅) = ¬g(U ) = ∅. For (d) let us assume that X and Y are such subsets of U that X ⊆ Y . If map f is increasing then f (X) ⊆ f (Y ). Thus f (¬Y ) ⊆ f (¬X) or ¬f (¬Y ) ⊇ ¬f (¬X) or g(Y ) ⊇ g(X).
18
J.W. Grzymala-Busse and W. Rz¸asa
For (e) let us assume that X and Y are subsets of U that X ⊂ Y . If the map f is non-decreasing then ∼ (f (X) ⊃ f (Y )). Thus ∼ (f (¬X) ⊂ f (¬Y )) and ∼ (¬f (¬X) ⊃ ¬f (¬Y )), or ∼ (g(X) ⊃ g(Y )). For (f) and (g) proofs are very similar. First, g(X ∩ Y ) = ¬f (¬(X ∩ Y )) = ¬f ((¬X) ∪ (¬Y )). Then, by definition of f ¬f ((¬X) ∪ (¬Y )) = ¬(f (¬X) ∪ f (¬Y )) = g(X) ∩ g(Y ) for (f), or ¬f ((¬X) ∪ (¬Y )) ⊆ ¬(f (¬X) ∪ f (¬Y )) = g(X) ∩ g(Y ) for (g). For (h) first observe that g(g(X)) = g(¬f (¬X)) = ¬f (f (¬X)). The last set, by definition of f is equal to ¬f (¬X) = g(X). For (i) let us assume that the pair (f, g) has Mixed Idempotency Property. Then g(f (X)) = ¬f (¬f (X)) = ¬f (g(¬X)) since f and g are dual. The set ¬f (g(¬X)) is equal to ¬g(¬X) since the pair (f, g) has Mixed Idempotency Property. Theorem 4. Let U be a finite nonempty set, f : 2U → 2U and g : 2U → 2U be maps defined on the power set of U , and F, G be maps dual to f and g, respectively. If f (X) ⊆ g(X) for some subset X of U then F (X) ⊇ G(X). Proof. Let X be a subset of U . Then G(X) = U − g(¬X) by definition of dual maps. Let x be an element of U . Then x ∈ G(X) if and only if x ∈ / g(¬X). Additionally, f (¬X) ⊆ g(¬X) hence x ∈ / f (¬X), or x ∈ F (X).
3
Set Approximations in the Pawlak Space
Let (U, R) be an approximation space, where R is an equivalence relation. Let R be a family of R-definable sets (we may ignore adding successor or predecessor since R is symmetric). A pair (U, R) is a topological space, called the Pawlak space, where R is a family of all open and closed sets [17]. Let us recall that Z. Pawlak defined lower and upper approximations [17, 18], denoted by appr(X) and appr(X), in the following way appr(X) = ∪{[x]R | x ∈ U and [x]R ⊆ X}, appr(X) = ∪{[x]R | x ∈ U and [x]R ∩ X = ∅}, where [x]R denotes an equivalence class containing an element x of U . Maps appr and appr are operations of interior and closure in a topology defined by R-definable sets. As observed Z. Pawlak, [18], the same maps appr and appr may be defined using different formulas appr(X) = {x ∈ U | [x]R ⊆ X} and appr(X) = {x ∈ U | [x]R ∩ X = ∅}.
Definability and Other Properties of Approximations
19
These approximations have the following properties. For any X, Y ⊆ U 1. (a) (b) 2. (a) (b) 3. (a) (b) 4. (a) (b) 5. (a) (b) 6. (a) (b) 7. (a) (b)
appr(X) ⊆ X, inclusion property for lower approximation, X ⊆ appr(X), inclusion property for upper approximation, appr(∅) = ∅, appr(∅) = ∅, appr(U ) = U , appr(U ) = U , X ⊆ Y ⇒ appr(X) ⊆ appr(Y ), monotonicity of lower approximation, X ⊆ Y ⇒ appr(X) ⊆ appr(Y ), monotonicity of upper approximation, appr(X ∪ Y ) ⊇ appr(X) ∪ appr(Y ), appr(X ∪ Y ) = appr(X) ∪ appr(Y ), appr(X ∩ Y ) = appr(X) ∩ appr(Y ), appr(X ∩ Y ) ⊆ appr(X) ∩ appr(Y ), appr(appr(X)) = appr(X) = appr(appr(X)), appr(appr(X)) = appr(X) = appr(appr(X)),
8. (a) appr(X) = ¬appr(¬X), duality property, (b) appr(X) = ¬appr(¬X), duality property. Remark. Due to Theorem 3 we may observe that all Properties 1a–8b, for dual approximations, can be grouped into the following sets {1(a), 1(b)}, {2(a), 3(b)}, {2(b), 3(a)}, {4(a), 4(b)}, {5(a), 6(b)}, {5(b), 6(a)}, {7(a), 7(b)}, {8(a), 8(b)}. In all of these eight sets the first property holds if and only if the second property holds. A problem is whether conditions 1–8 uniquely define maps appr : 2U → 2U and appr : 2U → 2U in the approximation space (U, R). We assume that definitions of both appr and appr should be constructed only from using information about R, i.e., we can test, for any x ∈ U , whether it belongs to a lower and upper approximation only on the basis of its membership to some equivalence class of R. Obviously, for any relation R different from the set of all ordered pairs from U and from the set of all pairs (x, x), where x ∈ U the answer is negative, since we may define an equivalence relation S such that R is a proper subset of S and for any x ∈ U a membership of x in an equivalence class of S can be decided on the basis of a membership of x in an equivalence relation of R (any equivalence class of R is a subset of some equivalence class of S).
4
Subset, Singleton and Concept Approximations
In this paper we will discuss only nonparametric approximations. For approximations depending on same parameters see [21, 32]. In this and the following sections we will consider a variety of approximations for which we will test which of
20
J.W. Grzymala-Busse and W. Rz¸asa
Properties 1a–8b from the previous section are satisfied. Proofs will be restricted only for R-successor sets, since the corresponding proofs for R-predecessor sets can be obtained from the former proofs by replacing the relation R by the converse relation R−1 and using the following equality: Rs (x) = Rp−1 (x), for any relation R and element x of U . Unless it is openly stated, we will not assume special properties for the relation R. For an approximation space (U, R) and X ⊆ U , Xscov will be defined as follows ∪{Rs (x) | x ∈ U } ∩ X. By analogy, the set Xpcov will be defined as ∪{Rp (x) | x ∈ U } ∩ X. Let (U, R) be an approximation space. Let X be a subset of U . The R-subset successor lower approximation of X, denoted by apprsubset (X), is defined as s follows ∪ {Rs (x) | x ∈ U and Rs (x) ⊆ X}. The subset successor lower approximations were introduced in [1, 2]. The R-subset predecessor lower approximation of X, denoted by apprsubset (X), p is defined as follows ∪ {Rp (x) | x ∈ U and Rp (x) ⊆ X}. The subset predecessor lower approximations were studied in [22]. The R-subset successor upper approximation of X, denoted by apprsubset (X), s is defined as follows ∪ {Rs (x) | x ∈ U and Rs (x) ∩ X = ∅}. The subset successor upper approximations were introduced in [1, 2]. The R-subset predecessor upper approximation of X, denoted by apprsubset (X), p is defined as follows ∪ {Rp (x) | x ∈ U and Rp (x) ∩ X = ∅}. The subset predecessor upper approximations were studied in [22]. Sets apprsubset (X) and appr subset (X) are R-successor definable, while sets s s subset apprsubset (X) and appr (X) are R-predecessor definable for any approxip p mation space (U, R), see. e.g., [1, 3]. R-subset successor (predecessor) lower approximations of X have the following Properties: 1a, 2a, the generalized Property 3a, i.e., apprsubset (U ) = Uscov s (apprsubset (U ) = Upcov ), 4a, 5a and the first equality of 7a. p Proof. Proofs for Properties 1a, 2a, the generalized Property 3a and 4a will be skipped since they are elementary. Property 5a follows from Property 4a, as it was shown in Theorem 1, Section 2.
Definability and Other Properties of Approximations
21
For the proof of the first part of Property 7a, i.e., idempotency of the map apprsubset s apprsubset (appr subset (X)) = appr subset (X), s s s let us observe that from Property 1a and 4a we may conclude that apprsubset (appr subset (X)) ⊆ apprsubset (X). For the proof of the reverse inclus s s subset sion let x ∈ apprs (X). Thus, there exists some set Rs (y) ⊆ X such that x ∈ Rs (y). The set Rs (y) is a subset of apprsubset (X), from the definition of s subset subset apprsubset . Hence x ∈ appr (appr (X)). s s s R-subset successor upper approximations of X have the following Properties: a generalized Property 1b, i.e., apprsubset (X) ⊇ Xscov (appr subset (X) ⊇ Xpcov ), 2b, s p a generalized Property 3b, i.e., apprsubset (U ) = Uscov (apprsubset (U ) = Upcov ), s p 4b, 5b and 6b. Proof. Proofs for a generalized Property 1b, 2b, a generalized Property 3b and 4b will be skipped since they are elementary. The inclusion apprsubset (X ∪ Y ) ⊇ s apprsubset (X) ∪ apprsubset (Y ) in Property 5b follows from Property 4b, as it was s s explained in Theorem 1, Section 2. To show the reverse inclusion, i.e., appr(X ∪ Y ) ⊆ appr(X) ∪ appr(Y ) let us consider any element x ∈ appr(X ∪ Y ). There exists Rs (y), such that x ∈ Rs (y) and Rs (y)∩(X ∪Y ) = ∅. Hence x is a member of the set apprsubset (X) s subset ∪ apprs (Y ). To show Property 6b it is enough to apply Theorem 1 and Property 4b. The R-singleton successor lower approximation of X, denoted by appr singleton s (X), is defined as follows {x ∈ U | Rs (x) ⊆ X}. The singleton successor lower approximations were studied in many papers, see, e.g., [1, 2, 10, 11, 13–15, 20, 22–26, 28–31]. The R-singleton predecessor lower approximation of X, denoted by apprsingleton (X), is defined as follows p {x ∈ U | Rp (x) ⊆ X}. The singleton predecessor lower approximations were studied in [22]. The R-singleton successor upper approximation of X, denoted by appr singleton s (X), is defined as follows {x ∈ U | Rs (x) ∩ X = ∅}. The singleton successor upper approximations, like singleton successor lower approximations, were also studied in many papers, e.g., [1, 2, 10, 11, 22– 26, 28–31].
22
J.W. Grzymala-Busse and W. Rz¸asa
The R-singleton predecessor upper approximation of X, denoted by apprsingleton (X), is defined as follows p {x ∈ U | Rp (x) ∩ X = ∅}. The singleton predecessor upper approximations were introduced in [22]. In general, for any approximation space (U,R), sets apprsingleton (X) and s apprsingleton (X) are neither R-successor definable nor R-predecessor definable, p while set apprsingleton (X) is R-predecessor definable and apprsingleton (X) is s p R-successor definable, see, e.g. [1, 3, 16]. R-singleton successor (predecessor) lower approximations of X have the following Properties: 3a, 4a, 5a, 6a and 8a [29]. R-singleton successor (predecessor) upper approximations of X have the following Properties: 2b, 4b, 5b, 6b and 8b [29]. The R-concept successor lower approximation of X, denoted by apprconcept s (X), is defined as follows ∪ {Rs (x) | x ∈ X and Rs (x) ⊆ X}. The concept successor lower approximations were introduced in [1, 2]. The R-concept predecessor lower approximation of X, denoted by apprconcept p (X), is defined as follows ∪ {Rp (x) | x ∈ X and Rp (x) ⊆ X}. The concept predecessor lower approximations were introduced, for the first time, in [6]. The R-concept successor upper approximation of X, denoted by apprconcept s (X), is defined as follows ∪ {Rs (x) | x ∈ X and Rs (x) ∩ X = ∅} The concept successor upper approximations were studied in [1, 2, 15]. The R-concept predecessor upper approximation of X, denoted by apprconcept p (X), is defined as follows ∪ {Rp (x) | x ∈ X and Rp (x) ∩ X = ∅} The concept predecessor upper approximations were studied in [22]. Sets apprconcept (X) and appr concept (X) are R-successor definable, while sets s s concept apprconcept (X) and appr (X) are R-predecessor definable for any approxip p mation space (U, R), see, e.g., [1, 3]. R-concept successor (predecessor) lower approximations of X have the following Properties: 1a, 2a, generalized 3a, i.e., apprconcept (U ) = Uscov s concept cov (apprp (U ) = Up ), 4a and 5a. Proof. For a concept successor (predecessor) lower approximation proofs of Properties 1a, 2a, generalized Property 3a and 4a are elementary and will be ignored.
Definability and Other Properties of Approximations
23
Moreover, proofs for Properties, 5a and 7a are almost the same as for subset lower approximations. Note that in the proof of 7a besides Rs (y) ⊆ X and x ∈ Rs (y) we know additionally that y ∈ X, but it does not affect the proof. R-concept successor (predecessor) upper approximations of X have the following Properties: 2b, generalized Property 3b, i.e., apprconcept (U ) = Uscov s concept cov (apprp (U ) = Up ), 4b and 6b. Proof. Proofs of Property 2b, generalized Property 3b, and 4b are elementary and are omitted. Proof for Property 6b is a consequence of Theorem 1.
5
Modified Singleton Approximations
Definability and duality of lower and upper approximations of a subset X of the universe U are basic properties of rough approximations defined for the indiscernibility relation originally formulated by Z. Pawlak [17, 18]. Inclusion between the set and its approximations (Properties 1a and 1b) is worth some attention. For reflexive relation R any subset, singleton, and concept predecessor (successor) lower and upper approximations satisfy Properties 1a and 1b. However, for not reflexive relation R, in general, if the family of sets {Rs (x) | x ∈ U } ({Rp (x) | x ∈ U }) is not a covering of U , we have X ⊆ apprsubset (X) (and X ⊆ apprsubset (X). s p For R-subset successor (predecessor) upper approximations we have Xscov ⊆ apprsubset (X) (and Xpcov ⊆ apprsubset (X)). s p On the other hand, for R-concept successor (predecessor) upper approximations, in general, not only X ⊆ apprconcept (X) (and X ⊆ appr concept (X) s p but also Xscov ⊆ apprconcept (X) (and Xpcov ⊆ apprconcept (X)), s p as follows from the following example. Example. Let U = {1, 2}, X = {1}, R = {(1, 2), (2, 1)}. Then the family of two sets: Rs (1) and Rs (2) is a covering of U , Xscov = {1} and apprconcept (X) = ∅. s For R-singleton successor (predecessor) approximations the following situations may happen, does not matter if R is symmetric or transitive: appr singleton (X) ⊆ X (and apprsingleton (X) ⊆ X), s p X ⊆ appr singleton (X) (and X ⊆ apprsingleton (X). s p
24
J.W. Grzymala-Busse and W. Rz¸asa
even apprsingleton (X) ⊂ X ⊂ apprsingleton (X). s s The following example shows such three situations for a symmetric and transitive relation R. Example. Let U={1, 2, 3, 4, 5}, X = {1, 2},R={(1, 1), (3, 3), (3, 4), (4, 3), (4, 4)}. Then apprsingleton (X) = {1, 2, 5}, apprsingleton (X) = {1}, so that s s apprsingleton (X) ⊆ X and X ⊆ apprsingleton (X), and s s apprsingleton (X) ⊂ X ⊂ apprsingleton (X). s s To avoid the situation described by the last inclusion for singleton approximations, the following modification of the corresponding definitions were introduced: The R-modified singleton successor lower approximation of X, denoted by apprmodsingleton (X), is defined as follows s {x ∈ U | Rs (x) ⊆ X and Rs (x) = ∅}. The R-modified singleton predecessor lower approximation of X, denoted by apprmodsingleton (X), is defined as follows p {x ∈ U | Rp (x) ⊆ X and Rp (x) = ∅}. The R-modified singleton successor upper approximation of X, denoted by apprmodsingleton (X), is defined as follows s {x ∈ U | Rs (x) ∩ X = ∅ or Rs (x) = ∅}. The R-modified singleton predecessor upper approximation of X, denoted by apprmodsingleton (X), is defined as follows p {x ∈ U | Rp (x) ∩ X = ∅ or Rp (x) = ∅}. These four approximations were introduced, for the first time, in [6]. In an arbitrary approximation space, R-modified singleton successor (predecessor) lower and upper approximations have Properties 8a and 8b. Proof. Since apprmodsingleton (X) = apprsingleton (X) − {x ∈ U : Rs (x) = ∅} s s modsingleton singleton and apprs (X) = apprs (X) ∪ {x ∈ U : Rs (x) = ∅} the Properties 8a and 8b for R-modified singleton successor (predecessor) lower and upper approximations follow from Theorem 2 and duality of R-singleton successor (predecessor) lower and upper approximations [29]. R-modified singleton successor (predecessor) lower approximations have the following Properties: 2a, 4a, 5a and 6a.
Definability and Other Properties of Approximations
25
Proof. Proofs of Properties 2a and 4a are elementary and will be skipped. Properties 5a and a part of 6a, namely apprmodsingleton (X ∩Y ) ⊆ apprmodsingleton (X) s s modsingleton ∩ apprs (Y ) are simple consequences of Theorem 1. A second part of the proof for 6a, i.e., reverse inclusion, is almost identical with the proof for this property for a R-singleton successor lower approximation (compare with [29]). Here we assume that the set Rs (y) is nonempty, but it does not change the proof. R-modified singleton successor (predecessor) upper approximations have the following Properties: 3b, 4b, 5b and 6b. Proof. The proof follows immediately from Theorem 3 since we proved that maps apprmodsingleton and appr modsingleton are dual. s s In general, for any approximation space (U, R), sets apprmodsingleton (X), s modsingleton modsingleton apprmodsingleton (X), appr (X) and appr (X) are neither s p p R-successor definable nor R-predecessor definable.
6
Largest Lower and Smallest Upper Approximations
Properties 1a, 3a, 6a and the first equality of 7a indicate that a lower approximation, in the Pawlak space, is an operation of interior [12]. Properties 1b, 2b, 5b and the first equality of 7b indicate that an upper approximation, in the Pawlak space, is an operation of closure [12]. For any relation R, the R-subset successor (predecessor) lower approximation of X is the largest R-successor (predecessor) definable set contained in X. It follows directly from the definition. On the other hand, if R is not reflexive and at the same time transitive, then a family of R-successor (predecessor) definable sets is not necessarily a topology and then no upper approximation of X defined so far must be the smallest Rsuccessor (predecessor) definable set containing X. It was observed, for the first time, in [5]. In that paper it was also shown that a smallest R-successor definable set is not unique. Any R-smallest successor upper approximation, denoted by apprsmallest (X), s is defined as an R-successor definable set with the smallest cardinality containing X. An R-smallest successor upper approximation does not need to be unique. An R-smallest predecessor upper approximation, denoted by apprsmallest (X), p is defined as an R-predecessor definable set with the smallest cardinality containing X. Likewise, an R-smallest predecessor upper approximation does not need to be unique. Let apprsmallest (apprsmallest ) be a map that for any subset X of U des p termines one of the possible R-smallest successor (predecessor) upper approximation of X. Additionally, if Xscov = X (Xpcov = X) then we assume that smallest cov cov Xs ⊆ apprs (X), (Xp ⊆ apprsmallest (X)). Such a map apprsmallest p s (apprsmallest ) has the following properties [7]: p
26
J.W. Grzymala-Busse and W. Rz¸asa
1. apprsmallest (∅) = ∅, (apprsmallest (∅) = ∅), s p 2. apprsmallest (U ) = Uscov , (apprsmallest (U ) = Upcov ), s p 3. map appr smallest (appr smallest ) is non-decreasing, s p 4. if for any X ⊆ U there exists exactly one apprsmallest (X) (apprsmallest (X)) s p then for any subset Y of U apprsmallest (X ∪ Y ) ⊆ apprsmallest (X) ∪ apprsmallest (Y ), s s s (apprsmallest (X ∪ Y ) ⊆ apprsmallest (X) ∪ apprsmallest (Y )), p p p otherwise card apprsmallest (X ∪ Y ) ≤ card apprsmallest (X) + s s card apprsmallest (Y − X) , s (card appr smallest (X ∪ Y ) ≤ card apprsmallest (X) + p p card apprsmallest (Y − X) ), p 5. apprsmallest (apprsmallest (X)) = apprsmallest (X) = s s s apprsubset (apprsmallest (X)), s s (apprsmallest (appr smallest (X)) = apprsmallest (X) = p p p apprsubset (apprsmallest (X))), p p 6. X is R-successor (predecessor) definable if and only if X = apprsmallest (X) s (X = apprsmallest (X)). p Proof. Properties 1 and 2 follow directly from the definition of the smallest upper approximation. For the proof of Property 3 let us suppose that the map apprsmallest is not non-decreasing. Then there exist X, Y ⊆ U such that X ⊂ Y s and apprsmallest (X) ⊃ apprsmallest (Y ). For a map, defined for any subset Z of s s U in the following way apprsmallest (Z) if Z =X s appr (Z) = apprsmallest (Y ) if Z =X s apprsmallest (X) s contradiction.
⊃
appr (X) and appr (X) is R-successor definable, a
For the proof of Property 4 let us assume that apprsmallest (X) is uniquely des termined for any X ⊆ U . For all subsets Y and Z of U with X = Y ∪ Z, a set apprsmallest (Y ) ∪ apprsmallest (Z) is R-successor definable and Xscov ⊆ s s smallest smallest apprs (Y ) ∪ apprs (Z). Let us suppose that for subsets X, Y , and
Definability and Other Properties of Approximations
27
Z of U , with X = Y ∪ Z, the first inclusion of 4 does not hold. Due to the fact that there exists exactly one apprsmallest (X) we have s card(apprsmallest (Y ∪ Z)) > card(appr smallest (Y ) ∪ apprsmallest (Z)), s s s a contradiction with the assumption that appr smallest (X) is the R-smallest sucs cessor upper approximation of X. In particular, set X = Y ∪ Z may be presented as a union of two disjoint sets, e.g., X = Y ∪ (Z − Y ). Then the inequality of 4 for card(apprsmallest (Y ∪ Z)) s has the following form card(appr smallest (Y ∪ Z)) ≤ card(apprsmallest (Y ) ∪ appr smallest (Z − Y )) s s s ≤ card(apprsmallest (Y )) + card(apprsmallest (Z − Y )). s s If the set apprsmallest (X) is not unique for any X ⊆ U , then by a similar argus ment as in the first part of the proof for 4 card(apprsmallest (Y ∪ Z)) ≤ card(apprsmallest (Y ) ∪ apprsmallest (Z − Y )), s s s hence
card apprsmallest (Y ∪ Z) ≤ s card apprsmallest (Y ) + s card apprsmallest (Z − Y ) . s
Property 5 follows from the following facts: appr smallest (X) is R-successor defins able, apprsubset (X) is the R-successor largest lower approximation, and s apprsmallest (X) is the R-successor smallest upper approximation. s For the proof of Property 6 let us suppose that X is R-successor definable, i.e., X is equal to a union of R-successor sets. This union is the R-smallest successor upper approximation X. The converse is true since any R-smallest upper approximation of X is R-successor definable. Definitions of the smallest successor and predecessor approximations are not constructive in the sense that (except brute force) there is no indication how to determine these sets. Note that definitions of other approximations indicate how to compute the corresponding sets. Moreover, definitions of the smallest successor and predecessor approximations are the only definitions of approximations that refer to cardinalities.
7
Dual Approximations
As it was shown in [29], singleton approximations are dual for any relation R. In Section 5 we proved that modified singleton approximations are also dual. On the other hand it was shown in [29] that if R is not an equivalence relation then subset approximations are not dual. Moreover, concept approximations are not dual as well, unless R is reflexive and transitive [6].
28
J.W. Grzymala-Busse and W. Rz¸asa
All approximations discussed in this Section are dual to some approximations from Sections 4 and 6. By Theorem 3, properties of dual approximations follow from the corresponding properties of the original approximations. Two additional approximations were defined in [29]. The first approximation, denoted by apprdualsubset (X), was defined by s ¬(appr subset (¬X)) s while the second one, denoted by apprdualsubset (X) was defined by s ¬(appr subset (¬X)). s These approximations are called the R-dual subset successor lower and R-dual subset successor upper approximations, respectively. Obviously, we may define as well the R-dual subset predecessor lower approximation ¬(appr subset (¬X)) p and the R-dual subset predecessor upper approximation ¬(appr subset (¬X)). p R-dual subset successor (predecessor) lower approximations have the following Properties: 3a, 4a, 5a, 6a. R-dual subset successor (predecessor) upper approximations have the following Properties: 1b, 3b, 4b, 6b, 7b. By analogy we may define dual concept approximations. Namely, the R-dual concept successor lower approximation of X, denoted by apprdualconcept (X) is s defined by ¬(apprconcept (¬X)). s The R-dual concept successor upper approximation of X, denoted by apprdualconcept (X) is defined by s ¬(appr concept (¬X)). s Set denoted by apprdualconcept (X) and defined by the following formula p ¬(apprconcept (¬X)) p will be called an R-dual concept predecessor lower approximation, while set apprdualconcept (X) defined by the following formula p ¬(apprconcept (¬X)) p will be called an R-dual concept predecessor upper approximation. These four R-dual concept approximations were introduced in [6]. R-dual concept successor (predecessor) lower approximations have the following Properties: 3a, 4a, 5a.
Definability and Other Properties of Approximations
29
The R-dual concept successor (predecessor) upper approximations have the following Properties: 1b, 3b, 4b, 6b. Again, by analogy we may define dual approximations for the smallest upper approximations. The set, denoted by apprdualsmallest (X) and defined by s ¬(apprsmallest (¬X)), s will be called an R-dual smallest successor lower approximation of X while the set denoted by apprdualsmallest (X) and defined by p ¬(apprsmallest (¬X)). p will be called an R-dual smallest predecessor lower approximation of X. These two approximations are introduced in this work for the first time. The R-dual smallest successor (predecessor) lower approximations have the following Properties: 3a, the first equality of 7a, and the map apprdualsmallest is s non-decreasing.
8
Approximations with Mixed Idempotency
Smallest upper approximations, introduced in Section 6, and subset lower approximations are the only approximations discussed so far that satisfy the Mixed Idempotency Property, so apprs (X) = apprs (apprs (X))(apprp (X) = apprp (apprp (X))),
(1)
apprs (X) = apprs (apprs (X))(apprp (X) = apprp (apprp (X))).
(2)
and For lower and upper approximations that satisfy conditions 1 and 2, for any subset X of U , both approximations of X are fixed points. For apprsubset and apprsmallest , definable sets (successor or predecessor) are fixed points. For the following upper approximations definable sets are fixed points (and they are computable in a polynomial time): -concept (X) and defined as An upper approximation, denoted by apprsubset s follows apprsubset (X) ∪ {Rs (x) | x ∈ X − apprsubset (X) and Rs (x) ∩ X = ∅} s s will be called the R-subset-concept successor upper approximation of X. -concept (X) and defined as An upper approximation, denoted by apprsubset p follows appr subset (X) ∪ {Rp (x) | x ∈ X − apprsubset (X) and Rp (x) ∩ X = ∅} p p will be called the R-subset-concept predecessor upper approximation of X. An -subset (X) and defined as follows upper approximation, denoted by apprsubset s apprsubset (X) ∪ {Rs (x) | x ∈ U − apprsubset (X) and Rs (x) ∩ X = ∅} s s will be called the R-subset-subset successor upper approximation of X.
30
J.W. Grzymala-Busse and W. Rz¸asa
-subset (X) and defined as An upper approximation, denoted by apprsubset p follows apprsubset (X) ∪ {Rp (x) | x ∈ U − apprsubset (X) and Rp (x) ∩ X = ∅} p p will be called the R-subset-subset predecessor upper approximation of X. Any of these four upper approximations, together with appr subset (or s apprsubset , respectively), satisfies Mixed Idempotency Property (both second p equalities of 7a and 7b). Upper approximations, presented in this Section, do not preserve many Properties listed as 1–8 in Section 3. -concept (X) and apprsubset-concept (X) have only Approximations apprsubset s p Properties 2b and 7b. For these approximations Properties 4b, 5b, and 6b do not -concept hold even if the relation R is reflexive. For an approximation apprsubset s (X) it can be shown by the following example. Let U = {1, 2, 3, 4}, R = {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1), (3, 3), (4, 4). Then Rs (1) = U , Rs (2) = {1, 2}, Rs (3) = {1, 3}, Rs (4) = {4}. Property 4b is not satisfied for the sets X = {1} and Y = {1, 2}, Property 5b is not satisfied for the sets X = {1} and Y = {2}, Property 6b is not satisfied for the sets X = {1, 2} and Y = {1, 3}. -subset (X) and apprsubset-subset (X) generalized For approximations apprsubset s p Properties 1b and 3b, as well as Properties 2b and 7b hold. By analogy with Section 7, for these four upper approximations we may define corresponding dual lower approximations. Thus in this Section we introduced 8 new approximations.
9
Coalescence of Rough Approximations
Finding apprsmallest (X) and apprsmallest (X) approximations is NP-hard. In s p real-life application such approximations should be replaced by other approximations that are easier to compute. Therefore it is important to know under which conditions approximations are equal or comparable. Figures 1–8 show relations between lower and upper approximations. Sets in the same box are identical. Arrows indicate inclusions between corresponding sets. Information about successor (predecessor) definability is added as well. For a symmetric relation R, the corresponding sets Rs and Rp are identical, so the corresponding successor (predecessor) approximations are also identical. For the symmetric relation R the corresponding approximations have double subscripts, s and p. Some inclusions between subset, concept and singleton approximations, discussed in this Section, were previously presented in [1, 2, 22, 24, 29]. Remark For any approximation space, relations between successor approximations are the same as for predecessor approximations. Therefore, we will prove only properties for successor approximations. Due to Theorem 4, we restrict our attention to proof for lower approximations and skip proofs for dual upper approximations.
Definability and Other Properties of Approximations
31
apprdualsmallest (X) s
p-definable apprsubset (X) p
6 p-definable apprconcept (X) p
@ I @ @ Y H HH@ apprdualconcept (X) s H@ H 6 @ H apprsingleton (X) p
apprdualsubset (X) s
6 apprmodsingleton (X) p
s-definable apprconcept (X) s
?
s-definable apprsubset (X) s
apprmodsingleton (X) s
? apprsingleton (X) s
? apprdualconcept (X) p
apprdualsubset (X) p
apprdualsmallest (X) p Fig. 1. Inclusions for lower approximations and an approximation space with an arbitrary relation
Proofs for equalities and inclusions from Figures 1 and 2. The inclusion apprconcept (X) ⊆ apprsubset (X) follows from the definitions of both approximap p concept tions. Indeed, apprp (X) is the union of some subsets of U that are included in apprsubset (X). p
The inclusion apprmodsingleton (X) ⊆ appr singleton (X) follows directly from p p the definitions of both approximations. For the proof of the inclusion apprsingleton (X) ⊆ apprdualconcept (X) let x p s
be an element of U . If x ∈ apprsingleton (X) then Rp (x) ⊆ X. Thus for any p
y ∈ U such that x ∈ Rs (y) we have y ∈ / ¬X. Hence x ∈ / apprconcept (¬X) and s x ∈ apprdualconcept (X). s The inclusion apprdualsubset (X) ⊆ apprdualconcept (X) follows from apprconcept s s s (X) ⊆ apprsubset (X) and from duality of corresponding lower and upper approxs imations (Theorem 4). Similarly, appr dualsubset (X) ⊆ apprdualsmallest (X) follows from apprsmallest s s s subset (X) ⊆ apprs (X) and from duality of corresponding lower and upper approximations (Theorem 4).
32
J.W. Grzymala-Busse and W. Rz¸asa
s-definable
J
apprsmallest (X) J s
J J J H H apprconcept (X) s H J HH JJ ^ j ? s-definable
apprdualsubset (X) p
?
s-definable
s-definable
apprsingleton (X) p
apprsubset (X) s
?
apprdualconcept (X) p
apprmodsingleton (X) p
apprdualconcept (X) s
apprmodsingleton (X) s
6 apprdualsubset (X) s
6 p-definable
p-definable
apprsingleton (X) s
apprsubset (X) p
*
apprconcept (X) p
p-definable
6
p-definable
apprsmallest (X) p
Fig. 2. Inclusions for upper approximations and an approximation space with an arbitrary relation
Remaining equalities and inclusions from Figures 1 and 2 follow from the Remark from this Section and inclusion transitivity. Proofs for equalities and inclusions from Figures 3 and 4. The proof for apprsubset s (X) = apprconcept (X) follows from the fact that if R is reflexive, then for any s x∈ / X we have Rs (x) X. The equality of apprmodsingleton (X), apprsingleton (X) and appr dualconcept (X) s s p we will show by showing the equality respective upper approximations. The equality of apprmodsingleton (X) and apprsingleton (X) follows from their definis s tions since for a reflexive relation R for any x ∈ U there is Rs (x) = ∅. Proof for concept the equality of apprsingleton (X) and appr (X) is in [22] (formula 5). s p
Definability and Other Properties of Approximations
33
apprdualsmallest (X) s p-definable apprsubset (X) p
apprs
p
(X)
apprmodsingleton (X) @ I p
apprconcept (X) p s-definable apprconcept (X) s
@ I @ @ apprdualconcept (X) s @ singleton appr (X) dualsubset
apprsubset (X) s
@ @ apprmodsingleton (X) @ s dualsubset apprp
apprsingleton (X) s
(X)
apprdualconcept (X) p
apprdualsmallest (X) p Fig. 3. Inclusions for lower approximations and an approximation space with a reflexive relation
s-definable apprsmallest (X) @ s s-definable apprconcept (X) s apprdualsubset (X) p apprdualconcept (X) p
apprdualconcept (X) s apprdualsubset (X) s
-
@ @ @ @ R @
s-definable
subset (X) @ apprs modsingleton apprp (X) @ @ p-definable @ R @ p-definable modsingleton apprsingleton (X) p
apprs
(X)
apprsingleton (X) s
apprsubset (X) p
apprconcept (X) p p-definable apprsmallest (X) p Fig. 4. Inclusions for upper approximations and an approximation space with a reflexive relation
34
J.W. Grzymala-Busse and W. Rz¸asa
For the proof of apprsingleton (X) ⊆ appr subset (X) it is enough to observe that s s any element x is a member of Rs (x). Hence for any element x ∈ appr singleton (X) s subset we have Rs (x) ⊆ X, so x ∈ appr s (X). singleton The proof for apprdualsubset (X) ⊆ appr (X) follows from the fact that s s singleton the following inclusion apprs (X) ⊆ apprs apprsubset (X) is true (for the proof, see [22] (formula 15), [29] (Theorem 7). Remaining equalities and inclusions from Figures 3 and 4 follow from Figures 1 and 2 and Remark from this Section and inclusion transitivity. Proofs for equalities and inclusions from Figures 5 and 6. All equalities and inclusions from Figures 5 and 6 were proved for a reflexive R or follow from the fact that R is symmetric. Proofs for equalities and inclusions from Figures 7 and 8. Since R is reflexive (Figures 3 and 4) it is enough to show that appr concept (X) ⊆ appr singleton (X) s s dualsmallest dualconcept and apprs (X) = apprs (X).
apprdualsmallest (X) s,p
@ I @ @ @ apprdualconcept (X) s,p-definable s,p @ singleton subset dualsubset appr (X) apprs,p (X) apprs,p (X) s,p apprconcept (X) s,p
apprmodsingleton (X) s,p
Fig. 5. Inclusions for lower approximations and an approximation space with a reflexive and symmetric relation s,p-definable apprsmallest (X) @ s,p s,p-definable apprconcept (X) s,p apprdualsubset (X) - apprsingleton (X) s,p s,p apprdualconcept (X) s,p
apprmodsingleton (X) s,p
@ @ @ -
@ @ R @
s,p-definable apprsubset (X) s,p
Fig. 6. Inclusions for upper approximations and an approximation space with a reflexive and symmetric relation
Definability and Other Properties of Approximations
35
apprdualsmallest (X) s p-definable
apprdualconcept (X) s
apprsubset (X) p
apprsingleton (X) p
s-definable apprdualsubset (X) s
apprconcept (X) apprmodsingleton (X) @ I p p s-definable apprconcept (X) s apprsubset (X) s
apprmodsingleton (X) s
@ @ @
apprsingleton (X) s apprdualconcept (X) p
p-definable
apprdualsubset (X) p
apprdualsmallest (X) p Fig. 7. Inclusions for lower approximations and an approximation space with a reflexive and transitive relation
apprsmallest (X) s s-definable
apprconcept (X) s s-definable
@ apprsubset (X) s modsingleton @ apprdualconcept (X) appr (X) p p @ modsingleton @ R p-definable apprs (X) apprdualsubset (X) p
apprsingleton (X) p
apprdualconcept (X) s
apprsingleton (X) s
apprdualsubset (X) s
apprconcept (X) p
-
p-definable
apprsubset (X) p
apprsmallest (X) p Fig. 8. Inclusions for upper approximations and an approximation space with a reflexive and transitive relation
For the proof of apprconcept (X) ⊆ apprsingleton (X) let x ∈ apprconcept (X). s s s Hence there exists y ∈ X, such that Rs (y) ⊆ X and x ∈ Rs (y). Since R is transitive, for any z ∈ Rs (x) also z ∈ Rs (y). It is clear that Rs (y) ⊆ X. Hence x ∈ apprsingleton (X). s
36
J.W. Grzymala-Busse and W. Rz¸asa Table 1. Conditions for definability Approximation apprsingleton (X) s apprsingleton (X) p singleton apprs (X) singleton apprp (X) apprmodsingleton (X) s modsingleton apprp (X) apprmodsingleton (X) s modsingleton apprp (X) subset apprs (X) apprsubset (X) p subset apprs (X) subset apprp (X) apprdualsubset (X) s dualsubset apprp (X) apprdualsubset (X) s dualsubset apprp (X) concept apprs (X) apprconcept (X) p concept apprs (X) concept apprp (X) apprdualconcept (X) s dualconcept apprp (X) dualconcept apprs (X) apprdualconcept (X) p smallest apprs (X) apprsmallest (X) p dualsmallest apprs (X) apprdualsmallest (X) p
R-successor def.
R-predecessor def.
r∧t
r∧s∧t
r∧s∧t
r∧t
s
any
any
s
r∧t∨s∧t
s∧t
s∧t
r∧t∨s∧t
r∧s
r
r
r∧s
any
s
s
any
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
r∧s∧t
r∧t
r∧t
r∧s∧t
any
s
s
any
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
r∧s∧t
r∧t
r∧t
r∧s∧t
any
s
s
any
r∧s∧t
r∧t
r∧t
r∧s∧t
Instead of showing that apprdualsmallest (X) = apprdualconcept (X) we will show s s smallest concept firstly that apprs (X) = apprs (X). In our proof we will firstly show that if Rs (x) ⊆ apprsmallest (X) then Rs (x) ⊆ apprconcept (X) for any x ∈ U . s s To do so, we will show that if Rs (x) ⊆ apprsmallest (X) then x ∈ X or Rs (x) = s ∪{Rs (y)|y ∈ Rs (x) ∩ X}. For all these cases, taking into account that Rs (x) ⊆ apprsmallest (X) ⇒ Rs (x) ∩ X =∅ s
Definability and Other Properties of Approximations
37
our proof is completed. Is it possible that Rs (x) ⊆ apprsmallest (X) and x ∈ / X? s Since R is transitive, Rs (y) ⊆ Rs (x) for any y ∈ Rs (x). Additionally, due to the fact that R is reflexive ∪{Rs (y)|y ∈ Rs (x) ∩ X} ∩ X = Rs (x) ∩ X. On the other hand ∪{Rs (y)|y ∈ Rs (x) ∩ X} ∩ ¬X = Rs (x) ∩ ¬X, since Rs (x) ⊆ apprsmallest (X). Therefore no family of R-successor definable sets s covering at least these elements of X as Rs (x) cannot cover less elements of ¬X. Thus, our assumption that Rs (x) ⊆ apprsmallest (X) and x ∈ / X implies s that Rs (x) = ∪{Rs (y)|y ∈ Rs (x) ∩ X}, and this observation ends the proof for apprsmallest (X) ⊆ apprconcept (X). For the proof of the converse inclusion s s let us take an element x ∈ ¬X such that x ∈ apprconcept (X). We will show s that x ∈ apprsmallest (X). Let R = {R (y)|y ∈ X and x ∈ Rs (y)} and let s s s Y = {y|Rs (y) ∈ Rs }. R is reflexive and transitive. Thus, Rs (z) is a minimal R-successor definable set containing z for any z ∈ U and for any R-successor definable set V contained in X, V ∩ Y = ∅. Therefore there is no family of R-successor definable sets that can cover the set Y and does not contain x. Table 1 summarizes conditions for the R-successor and R-predecessor definability. The following notation is used: r denotes reflexivity, s denotes symmetry, t denotes transitivity and ”any” denotes lack of constrains on a relation R that are needed to guarantee given kind of definability of an arbitrary subset of the universe.
10
Conclusions
In this paper we studied 28 approximations defined for any binary relation R on universe U , where R is not necessarily reflexive, symmetric or transitive. Additionally, we showed that it is possible to define 8 additional approximations. Our main focus was on coalescence and definability of lower and upper approximations of a subset X of U . We checked which approximations of X are, in general, definable. We discussed special cases of R being reflexive, reflexive and symmetric, and reflexive and transitive. Note that these special cases have immediate applications to data mining (or machine learning). Indeed, if a data set consists of some missing attribute values in the form of both lost values (missing attribute values that were, e.g., erased; they were given in the past but currently these values are not available) and ”do not care” conditions (missing attribute values that can be replaced by any attribute value, e.g., the respondent refused to give an answer) then the corresponding characteristic relation [1–4, 7] is reflexive. If the data set contains some missing attribute values and all of them are ”do not care” conditions, the corresponding characteristic relation is reflexive and symmetric. Finally, if all missing attribute values of the data set contains are lost values then the corresponding characteristic relation is reflexive and transitive.
38
J.W. Grzymala-Busse and W. Rz¸asa
Acknowledgements This research has been partially supported by the Ministry of Science and Higher Education of the Republic of Poland, grant N N206 408834.
References 1. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Proc. Foundations and New Directions of Data Mining, the 3rd International Conference on Data Mining, pp. 56–63 (2003) 2. Grzymala-Busse, J.W.: Data with missing attribute values: Generalization of indiscernibility relation and rule induction. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 78–95. Springer, Heidelberg (2004) 3. Grzymala-Busse, J.W.: Three approaches to missing attribute values– A rough set perspective. In: Proc. Workshop on Foundation of Data Mining, within the Fourth IEEE International Conference on Data Mining, pp. 55–62 (2004) 4. Grzymala-Busse, J.W.: Incomplete data and generalization of indiscernibility re´ ezak, D., Wang, G., Szczuka, M.S., lation, definability, and approximations. In: Sl D¨ untsch, I., Yao, Y. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3641, pp. 244–253. Springer, Heidelberg (2005) 5. Grzymala-Busse, J.W., Rzasa, W.: Local and global approximations for incomplete data. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Slowi´ nski, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 244–253. Springer, Heidelberg (2006) 6. Grzymala-Busse, J.W., Rzasa, W.: Definability of approximations for a generalization of the indiscernibility relation. In: Proceedings of the IEEE Symposium on Foundations of Computational Intelligence (FOCI 2007), Honolulu, Hawaii, pp. 65–72 (2007) 7. Grzymala-Busse, J.W., Rzasa, W.: Approximation Space and LEM2-like Algorithms for Computing Local Coverings. Accepted to Fundamenta Informaticae (2008) 8. Grzymala-Busse, J.W., Santoso, S.: Experiments on data with three interpretations of missing attribute values—A rough set approach. In: Proc. IIS 2006 International Conference on Intelligent Information Systems, New Trends in Intelligent Information Processing and WEB Mining, pp. 143–152. Springer, Heidelberg (2006) 9. Grzymala-Busse, J.W., Siddhaye, S.: Rough set approaches to rule induction from incomplete data. In: Proc. IPMU 2004, the 10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. 2, pp. 923–930 (2004) 10. Kryszkiewicz, M.: Rough set approach to incomplete information systems. In: Proc. Second Annual Joint Conference on Information Sciences, pp. 194–197 (1995) 11. Kryszkiewicz, M.: Rules in incomplete information systems. Information Sciences 113, 271–292 (1999) 12. Kuratowski, K.: Introduction to Set Theory and Topology, PWN, Warszawa (in Polish) (1977) 13. Lin, T.T.: Neighborhood systems and approximation in database and knowledge base systems. In: Fourth International Symposium on Methodologies of Intelligent Systems, pp. 75–86 (1989)
Definability and Other Properties of Approximations
39
14. Lin, T.Y.: Chinese Wall security policy—An aggressive model. In: Proc. Fifth Aerospace Computer Security Application Conference, pp. 286–293 (1989) 15. Lin, T.Y.: Topological and fuzzy rough sets. In: Slowinski, R. (ed.) Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory, pp. 287–304. Kluwer Academic Publishers, Dordrecht (1992) 16. Liu, G., Zhu, W.: Approximations in rough sets versus granular computing for coverings. In: RSCTC 2008, the Sixth International Conference on Rough Sets and Current Trends in Computing, Akron, OH, October 23–25 (2008); A presentation at the Panel on Theories of Approximation, 18 p. 17. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982) 18. Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991) 19. Pomykala, J.A.: On definability in the nondeterministic information system. Bulletin of the Polish Academy of Science Mathematics 36, 193–210 (1988) 20. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: Slowinski, R. (ed.) Handbook of Applications and Advances of the Rough Sets Theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992) 21. Skowron, A., Stepaniuk, J.: Tolerance approximation space. Fundamenta Informaticae 27, 245–253 (1996) 22. Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Transactions on Knowledge and Data Engineering 12, 331–336 (2000) 23. Stefanowski, J.: Algorithms of Decision Rule Induction in Data Mining. Poznan University of Technology Press, Poznan (2001) 24. Stefanowski, J., Tsoukias, A.: On the extension of rough sets under incomplete information. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC 1999. LNCS (LNAI), vol. 1711, pp. 73–81. Springer, Heidelberg (1999) 25. Stefanowski, J., Tsoukias, A.: Incomplete information tables and rough classification. Computational Intelligence 17, 545–566 (2001) 26. Wang, G.: Extension of rough set under incomplete information systems. In: Proc. IEEE International Conference on Fuzzy Systems, vol. 2, pp. 1098–1103 (2002) 27. Wybraniec-Skardowska, U.: On a generalization of approximation space. Bulletin of the Polish Academy of Sciences. Mathematics 37, 51–62 (1989) 28. Yao, Y.Y.: Two views of the theory of rough sets in finite universes. International J. of Approximate Reasoning 15, 291–317 (1996) 29. Yao, Y.Y.: Relational interpretations of neighborhood operators and rough set appro ximation operators. Information Sciences 111, 239–259 (1998) 30. Yao, Y.Y.: On the generalizing rough set theory. In: Proc. 9th Int. Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, pp. 44–51 (2003) 31. Yao, Y.Y., Lin, T.Y.: Generalization of rough sets using modal logics. Intelligent Automation and Soft Computing 2, 103–119 (1996) 32. Ziarko, W.: Variable Precision Rough Set Model. Journal of Computer System Sciences 46, 39–59 (1993) 33. Zakowski, W.: Approximations in the space (U, Π). Demonstratio Mathematica 16, 761–769 (1983)
Variable Consistency Bagging Ensembles Jerzy Błaszczyński1, Roman Słowiński1,2 , and Jerzy Stefanowski1 1
Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland 2 Systems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland {jurek.blaszczynski,roman.slowinski,jerzy.stefanowski}@cs.put.poznan.pl
Abstract. In this paper we claim that the classification performance of bagging classifier can be improved by drawing to bootstrap samples objects being more consistent with their assignment to decision classes. We propose a variable consistency generalization of the bagging scheme where such sampling is controlled by two types of measures of consistency: rough membership and monotonic measure. The usefulness of this proposal is experimentally confirmed with various rule and tree base classifiers. The results of experiments show that variable consistency bagging improves classification accuracy on inconsistent data.
1
Introduction
In the last decade, a growing interest has been noticed in integrating several base classifiers into one classifier in order to increase classification accuracy. Such classifiers are known as multiple classifiers, ensembles of classifiers or committees [15,31]. Ensembles of classifiers perform usually better than their component classifiers used independently. Previous theoretical research (see, e.g., their summary in [10,15]) clearly indicated that combining several classifiers is effective only if there is a substantial level of disagreement among them, i.e., if they make errors independently with respect to one another. Thus, a necessary condition for the efficient integration is diversification of the component base classifiers. Several methods have been proposed to get diverse base classifiers inside an ensemble of classifiers, e.g., by changing the distributions of examples in the learning set, manipulating the input features, using different learning algorithms to the same data – for comprehensive reviews see again [15,31]. The best known methods are bagging and boosting which modify the set of objects by sampling or weighting particular objects and use the same learning algorithm to create base classifiers. Multiple classifiers have also attracted the interest of rough sets researchers, however, they were mainly focused on rather simple and partly evident solutions as applying various sets of attributes, e.g reducts, in the context of rule induction [29,22], using rough set based rule classifiers inside framework of some ensembles, see e.g. [25,24]. One can also notice some kind of this type of inspiration in constructing hierarchical classifiers using specific features [18]. In this study, we consider one of basic concept of rough sets, i.e., a measure of consistency of objects. The research question is whether it is possible, while J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 40–52, 2010. c Springer-Verlag Berlin Heidelberg 2010
Variable Consistency Bagging Ensembles
41
constructing multiple classifiers, to consider consistency of objects. Our research hypothesis is that not all objects in the training data may be equally useful to induce accurate classifiers. Usually noisy or inconsistent objects are source of difficulties that may lead to overfitting of the standard, single classifies and decrease their classification performance. Pruning techniques in machine learning and variable consistency [3,4] or variable precision [33] generalizations of rough sets are applied to reduce this effect. As a multiple classifier we choose bagging mainly because it is easier to be generalized according to our needs than boosting. Moreover it has been already successfully studied with rule induction algorithms [25]. This is the first motivation of our current research. The other results from analysing some related modifications of bagging, as Breiman’s proposals of Random Forests [7] or Pasting Small Votes [6]. Let us remark that the main idea of the standard version of the bagging method [5] is quite simple and appealing - the ensemble consists of several classifiers induced by the same learning algorithm over several different distributions of input objects and the outputs of base classifiers are aggregated by equal weight voting. The base classifiers used in bagging are expected to have sufficiently high predictive accuracy apart from being diversified [7]. The key issue in the standard bagging concerns bootstrap sampling – drawing many different bootstrap samples by uniformly sampling with replacement. So, each object is assigned the same probability of being sampled. In this context the research question is whether consistency of objects is worth taking into account while sampling training objects to the bootstrap samples used in bagging. Giving more chance to select consistent objects (i.e., objects which certainly belong to a class) and decreasing a chance for selecting border or noisy ones may lead to creating more accurate and diversified base classifiers in the bagging scheme. Of course, such modifications should be addressed to data being sufficiently inconsistent. Following these motivations, let us remark that rough set approaches can provide useful information about the consistency of an assignment of an object to a given class. This consistency results from granules (classes) defined by a given relation (e.g. indiscernibility, dominance) and from dependency between granules and decision categories. Objects that belong to the lower approximation of a class are consistent, otherwise they are inconsistent. The consistency of objects can be measured by consistency measures, such as rough membership function. Rough membership function introduced by Ziarko et al. is used in variable precision and variable consistency generalizations of rough sets [3,33]. Yet other monotonic measures of consistency were introduced by Błaszczyński et al. [4]. These measures allow to define variable consistency rough set approaches that preserve basic properties of rough sets and also simplify construction of classifiers. The main aim of our study is to propose a new generalization of the bagging scheme, called variable consistency bagging, where the sampling of objects is controlled by rough membership or monotonic consistency measure. That is why we call this approach variable consistency bagging (VC-bagging). We will
42
J. Błaszczyński, R. Słowiński, and J. Stefanowski
consider rule and tree base classifiers learned by different algorithms, also including those adapted to dominance-based rough set approach. Another aim is to evaluate experimentally the usefulness of variable consistency bagging on data sets characterized by a different level of consistency. The paper is organized as follows. In the next section, we remind bagging scheme and we present sampling algorithms based on consistency measures. In section 3, we describe variable consistency bagging and learning algorithms used to create base classifiers. In the following section 4, results of experiments are presented. We conclude by giving remarks and recommendations for applications of presented techniques.
2 2.1
Consistency Sampling Algorithms Bagging Scheme
The Bagging approach (an acronym from Bootstrap aggregating) was introduced by Breiman [5]. It aggregates by voting classifiers generated from different bootstrap samples. A bootstrap sample is obtained by uniformly sampling with replacement objects from the training set. Let the training set consist of m objects. Each sample contains n ≤ m objects (usually it has the same size as the original set), however, some objects do not appear in it, while others may appear more than once. The same probability 1/m of being sampled is assigned to each object. The probability of an object being selected at least once is 1 − (1 − 1/m)m . For a large m, this is about 1 - 1/e. Each bootstrap sample contains, on the average, 63.2% unique objects from the training set. Given the parameter T which is the number of repetitions, T bootstrap samples S1 , S2 , . . . , ST are generated. From each sample Si a classifier Ci is induced by the same learning algorithm and the final classifier C ∗ is formed by aggregating T classifiers. A final classification of object x is built by a uniform voting scheme on C1 , C2 , . . . , CT , i.e., it is assigned to the class predicted most often by these base classifiers, with ties broken arbitrarily. The approach is presented briefly below. For more details see [5]. (input LS learning set; T number of bootstrap samples; LA learning algorithm output C ∗ classifier) begin for i = 1 to T do begin Si := bootstrap sample from LS; {sample with replacement} Ci := LA(Si ); {generate a base classifier} end; {end for} C ∗ (x) = arg maxy∈Kj Ti=1 (Ci (x) = y) {the most often predicted class} end
Experimental results show a significant improvement of the classification accuracy, in particular, using decision tree classifiers. An improvement is also observed when using rule classifiers, as it was shown in [25]. However, the choice of a base classifier is not indifferent. According to Breiman [5] what makes a
Variable Consistency Bagging Ensembles
43
base classifier suitable is its unstability. A base classifier is unstable, when small changes in the learning set does cause major changes in the classifier. For instance, the decision tree and rule classifiers are unstable, while k-Nearest Neighbor classifiers are not. For more theoretical discussion on the justification of the problem "why bagging works" the reader is referred to [5,15]. Since bagging is an effective and open framework, several researchers have proposed its variants, some of which have turned out to have lower classification error than the original version proposed by Breiman. Some of them are summarized in [15]. Let us remind only Breiman’s proposals of Random Forests [7] and Pasting Small Votes [6] or integration of bootstrap sampling with feature selection [16,26]. 2.2
Variable Consistency Sampling
The goal of variable consistency sampling is to increase predictive accuracy of bagged classifiers by using additional information that reflects the treatment of consistency of objects, which could be easy applied to training data. The resulting bagged classifiers are trained on bootstrap samples slightly shifted towards more consistent objects. In general, the above idea is partly related to some earlier proposals of changing probability distributions while constructing bagging inspired ensembles. In particular Breiman proposed a variant called Pasting Small Votes [6]. Its main motivation was to handle massive data, which do not fit into the computer operating memory, by particular sampling of much smaller bootstraps. Drawing objects to these small samples is a key point of the modification. In Ivotes they are sequentially sampled, where at each step probability of selecting a given object is modified by its importance. This importance is estimated at each step by the accuracy of a new base classifier. Importance sampling is known to provide better results than simple bagging scheme. Variable consistency sampling could be seen as similar to the above importance sampling because it is also based on modification of probability distribution. However, difference between these two approaches lies in the fact that consistency is evaluated in the pre-processing before learning of the base classifiers is made. Our expectation is that drawing bootstrap samples from a distribution that reflects their consistency will not decrease the diversity of the samples. The reader familiar with ensembles classifiers can notice other solutions to taking into account misclassification of training objects by base classifiers. In boosting more focus is given on objects difficult to be classified by iteratively extended set of base classifiers. We argue that these ensembles are based on different principle of stepwise adding classifiers and using accuracy from previous step of learning while changing weights of objects. In typical bagging there is no such information and all bootstrap samples are independent. Our other observation is that estimating the role of objects in the preprocessing of training data is more similar to previous works on improving rough set approaches by variable precision models or some other works on edited knearest neighbor classifiers, see e.g. review in [30]. For instance in IBL family of
44
J. Błaszczyński, R. Słowiński, and J. Stefanowski
algorithms proposed by Aha et al. in [1], the IBL3 version, which kept the most useful objects for correct classification and removed noisy or borderline examples, was more accurate than IBL2 version, which focused on difficult examples from border between classes. Stefanowski et al. also observed in [28] similar performance of another variant of nearest neighbor cleaning rule in a specific approach to pre-process of imbalanced data. Let us comment more precisely the meaning of consistency. Object x is consistent if a base classifier trained on a sample that includes this object is able to re-classify this object correctly. Otherwise, object x is inconsistent. Remark that lower approximations of sets include consistent objects only. Calculating consistency measures of objects is sufficient to detect consistent objects in the pre-processing and thus learning is not required. In variable consistency sampling, the procedure of bootstrap sampling is different than in the bagging scheme described in section 2.1. The rest of the bagging scheme remains unchanged. When sampling with replacement from the training set is performed, a measure of consistency c(x) is calculated for each object x from the training set. A consistent object x has c(x) = 1, inconsistent objects have 0 ≤ c(x) < 1. The consistency measure is used to tune the probability of object x being sampled to a bootstrap sample, e.g. by calculating a product of c(x) and 1/m. Thus, objects that are inconsistent have decreased probability of being sampled. Objects that are more consistent (i.e., have higher value of a consistency measure) are more likely to appear in the bootstrap sample. Different measures of consistency may result in different probability of inconsistent object x being sampled. 2.3
Consistency Measures
To present a consistency measure, we first remind basic notions from rough set theory [19]. Namely, the relations used to compare objects and elementary classes (also called atoms or granules) defined by these relations. Consideration of the indiscernibility relation is meaningful when set of attributes A is composed of regular attributes only. Indiscernibility relation makes a partition of universe U into disjoint blocks of objects that have the same description and are considered indiscernible. Let Vai be the value set of attribute ai and f : U × A → Vai be a total function such that f (x, ai ) ∈ Vai . Indiscernibility relation IP is defined for a subset of attributes P ⊆ A as IP = {(x, y) ∈ U × U : f (x, ai ) = f (y, ai ) for all ai ∈ P }.
(1)
If an object x belongs to the class of relation IP , where all objects are assigned to the same decision class, it is consistent. The indiscernibility relation is not the only possible relation between objects. When attributes from A have preference-ordered value sets they are called criteria. In order to make meaningful classification decisions on criteria, one has to consider the dominance relation instead of the indiscernibility relation. The resulting approach called Dominance-based Rough Set Approach (DRSA), has been presented in [11,21]. Dominance relation makes a partition of universe U
Variable Consistency Bagging Ensembles
45
into granules being dominance cones. The dominance relation DP is defined for criteria P ⊆ A as DP = {(x, y) ∈ U × U : f (x, ai ) f (y, ai ) for all ai ∈ P },
(2)
where f (x, ai ) f (y, ai ) means “x is at least as good as y w.r.t. criterion ai ”. Dominance relation DP is a partial preorder (i.e. reflexive and transitive). For each object x two dominance cones are defined. Cone DP+ (x) composed of all objects that are dominating x and cone DP− (x) composed of all objects that are dominated by x. While in the indiscernibility-based rough set approach, rough sets decision classes Xi , i = 1, . . . , n, are not necessarily ordered, in DRSA they are ordered, such that if i < j, then class Xi is considered to be worse than Xj . In DRSA, unions of decision classes are approximated: upward unions Xi≥ = ≤ t≥i Xt , i = 2, . . . n, and downward unions Xi = t≤i Xt , i = 1, . . . , n − 1. Considering above concepts allow us to handle another kind of inconsistency of object description manifested by violation of the dominance principle (i.e. object x having not worse evaluations on criteria than y cannot be assigned to a worse class than x) while is not discovered by the standard rough sets with indiscernibility relation. To simplify notation, we define EP (x) as a granule defined for object x by indiscernibility or dominance relation. For the same reason, consistency measures are presented here in the context of sets X of objects, being either a given class Xi , or union Xi≥ or Xi≤ . In fact, this is not ambiguous if we specify that DP+ (x) is used when Xi≥ are considered and that DP− (x) is used when Xi≤ are considered. Rough membership consistency measure was introduced in [32]. It is used to control positive regions in Variable Precision Rough Set (VPRS) model [33]. Rough membership of x ∈ U to X ⊆ U is defined for P ⊆ A as μP X (x) =
|EP (x) ∩ X| , |EP (x)|
(3)
Rough membership captures a ratio of objects in granule EP (x) and in considered set X, to all objects belonging to granule EP (x). This measure is an estimate of conditional probability P r(x ∈ X|x ∈ EP (x)). Measure P Xi (y) was applied in monotonic variable consistency rough set approaches [4]. For P ⊆ A, X, ¬X = U − X, x ∈ U , it is defined as P X (x) =
|EP (x) ∩ ¬X| . |¬X|
(4)
In the numerator of (4) is the number of objects in U that do not belong to set X and belong to granule EP (x). In the denominator, the number of objects in U that do not belong to set X. The ratio P X (x) is an estimate of conditional probability P r(x ∈ EP (x)|x ∈ ¬X), called also a catch-all likelihood. P To use measures μP X (x) and X (x) in consistency sampling we need to transform them to measure c(x) defined for a given object x, a fixed set of attributes P and a fixed set of objects X as c(x) = μP X (x)
or
c(x) = 1 − P X (x),
(5)
46
J. Błaszczyński, R. Słowiński, and J. Stefanowski
For DRSA, the higher value of consistency c(x) calculated for union X ≥ and X ≤ is taken.
3
Experimental Setup for Variable Consistency Bagging with Rules and Trees Base Classifiers
The key issue in the variable consistency bagging is a modification of probability of sampling objects into bootstraps with regards to either rough membership μ or consistency measure . To evaluate whether this modification may improve the classification performance of bagging we planned an experiment where the standard bagging will be compared against two variants of sampling called μ bagging and bagging and single classifier. In all of compared bagging versions, base classifiers are learned using the same learning algorithms tuned with the same set of parameters. The same concerns learning single classifiers. First, we chose the ModLEM algorithm [23] as it could induce a minimal set of rules from rough approximations, it may lead to efficient classifiers and it has been already successfully applied inside few multiple classifiers [25]. Its implementation was run with the following parameters: evaluation measures based on entropy information gain, no pruning of rules, and application of classification strategy for solving ambiguous, multiple or partial matches as proposed by Grzymala-Busse et al. [13]. Ripper algorithm [8] was also considered to study performance of other popular rule induction algorithm based on different principles. We also used it with no prunning of rules. Additionally, for data where attributes have preference ordered scales we studied DomLEM algorithm specific for inducing rules in DRSA [12]. Also in this case no pruning of rules is made and classification strategy suitable for preference ordered classes is applied. Finally, the comparison was also extended to using Quinlan C4.5 algorithm to generate unprunned decision tree base classifier, as bagging is known to work efficiently for decision trees. Predictions of base classifiers were always aggregated into the final classification decision by majority voting. The classification accuracy was the main evaluation criterion and it was estimated by 10 fold stratified cross validation repeated 10 times. We evaluated performance for seven data sets listed in Table 1. They come mainly from the UCI repository [2] with one exception, acl from our own clinical applications. We were able to find orders of preference for attributes of three of these data sets. These were: car, denbosch and windsor. Four of the considered sets (car, bupa, ecoli, pima) were originally too consistent (i.e., they had too high quality of classification). So, we modified them to make them more inconsistent. Three data sets (bupa, ecoli, pima) included real-valued attributes, so they were discretized by a local approach based on minimizing entropy. Moreover, we used also a reduced set of attributes to decrease the level of consistency. The final characteristic of data is summarized in Table 1. The data sets that were modified are marked as1 . 1
Modified data set.
Variable Consistency Bagging Ensembles
47
Table 1. Characteristic of data sets data set # acl bupa1 car1 denbosch ecoli1 pima1 windsor
4
objects # attributes # classes quality of classification 140 6 2 0.88 345 6 2 0.72 1296 6 4 0.61 119 8 2 0.9 336 7 8 0.41 768 8 2 0.79 546 10 4 0.35
Results of Experiments
Table 2 summarizes the consistency of bootstrap samples obtained by bagging and VC-bagging with measures μ and . For each data set an average percentage of inconsistent objects and average consistency of sample (calculated over 1000 samples) are taken. The average consistency of sample is calculated as an average value of a given consistency measure. Table 2. Consistency of bootstrap samples resulting from bagging and VC-bagging data set type of sampling % inconsistent objects average consistency bagging 12.161 0.943 acl μ bagging 7.7 0.966 bagging 11.987 0.997 bagging 28.265 0.873 bupa μ bagging 20.244 0.916 bagging 28.1 0.996 bagging 38.96 0.95 car μ bagging 36.601 0.958 bagging 38.298 0.988 bagging 10.175 0.994 denbosch μ bagging 9.8 0.994 bagging 10.023 0.997 bagging 59.617 0.799 ecoli μ bagging 52.041 0.873 bagging 59.472 0.992 bagging 20.783 0.907 pima μ bagging 14.389 0.941 bagging 20.766 0.999 bagging 65.183 0.921 windsor μ bagging 63.143 0.929 bagging 64.637 0.979
The results show that VC-bagging results in more consistent samples (i.e., samples with less inconsistent objects and higher average value of consistency of object in the sample). However, the magnitude of this effect depends on the
48
J. Błaszczyński, R. Słowiński, and J. Stefanowski
Table 3. Classification accuracy resulting from 10 x 10-fold cross validation of single classifier and an ensemble of 50 classifers resulting from standard bagging and VCbagging. Rank of the result for the same data set and classifier is given in brackets. data set classifier C4.5 acl ModLEM Ripper C4.5 bupa ModLEM Ripper C4.5 ModLEM car Ripper DomLEM C4.5 denbosch ModLEM Ripper DomLEM C4.5 ecoli ModLEM Ripper C4.5 pima ModLEM Ripper C4.5 windsor ModLEM Ripper DomLEM average rank
single 84.64+ − 0.92 (4) 86.93+ − 0.96 (4) 85.79+ − 1.4 (2) 66.67+ − 0.94 (4) 68.93+ − 1.0 (4) 65.22+ − 1.4 (4) 78+ 0.42 (2.5) − 69.64+ − 0.33 (3) 69.96+ − 0.23 (4) 81.69+ − 0.17 (4) 80.92+ − 2.3 (4) 79.33+ − 2.3 (4) 81.93+ − 2.3 (4) 83.95+ − 1.27 (4) 81.4+ − 0.48 (4) 49.23+ − 0.58 (2) 77.59+ − 0.61 (4) 72.21+ − 1.1 (4) 72.98+ − 0.76 (4) 71.3+ − 0.84 (4) 45.53+ − 1.5 (4) 41.03+ − 1.6 (4) 40.05+ − 1.4 (4) 50.11+ − 0.75 (4) 3.729
bagging 85.21+ − 1.1 (1) 88.21+ − 0.86 (1) 85.64+ − 0.81 (3) 70.35+ − 0.93 (3) 69.28+ − 0.85 (2.5) 67.22+ − 1.1 (3) 77.97+ − 0.46 (4) 69.44+ − 0.2 (4) 70.86+ − 0.39 (3) 82.32+ − 0.17 (3) 85.46+ − 1.2 (3) 85.04+ − 1.6 (3) 86.8+ − 1.5 (2) 86.77+ − 1.6 (1) 81.37+ − 0.6 (3) 49.08+ − 0.64 (3) 78.72+ − 0.68 (3) 72.88+ − 1.2 (1) 73.95+ − 0.7 (3) 72.64+ − 0.71 (3) 48+ 1.2 (2) − 42.89+ − 1.4 (2) 49.16+ − 1.7 (1) 53.1+ − 0.68 (3) 2.521
μ bagging 84.71+ − 1.1 (3) 87.36+ − 1.6 (3) 86.07+ − 0.66 (1) 71.01+ − 0.53 (1) 71.07+ − 1 (1) 71.77+ − 1.0 (1) 79.76+ − 0.42 (1) 69.88+ − 0.31 (1) 72.56+ − 0.37 (1) 82.5+ − 0.18 (2) 85.63+ − 1.4 (1.5) 85.55+ − 1.0 (1) 86.72+ − 1.2 (3) 86.6+ − 1.2 (2) 81.52+ − 0.57 (1) 68.96+ − 0.87 (1) 81.31+ − 0.51 (1) 72.8+ − 1.1 (2) 75.01+ − 0.47 (1) 73.84+ − 0.6 (1) 48.15+ − 0.85 (1) 42.77+ − 1.3 (3) 48.24+ − 1.4 (2) 53.37+ − 1.0 (2) 1.562
bagging 84.79+ − 0.9 (2) 88.14+ − 0.73 (2) 85.57+ − 0.62 (4) 70.55+ − 0.96 (2) 69.28+ − 1.0 (2.5) 67.3+ − 1.0 (2) 78+ 0.46 (2.5) − 69.75+ − 0.21 (2) 71.94+ − 0.44 (2) 82.55+ − 0.29 (1) 85.63+ − 1.7 (1.5) 85.46+ − 1.1 (2) 87.23+ − 1.3 (1) 86.53+ − 1.3 (3) 81.46+ − 0.58 (2) 49.05+ − 0.76 (4) 78.75+ − 0.65 (2) 72.5+ − 1.2 (3) 74.02+ − 0.75 (2) 72.83+ − 0.61 (2) 47.91+ − 0.74 (3) 43.81+ − 1.1 (1) 46.78+ − 1.3 (3) 53.68+ − 0.92 (1) 2.188
measure used and the data set. It follows that μ bagging leads to more consistent samples while bagging is less restrictive to introducing inconsistent objects into a sample. The value of c(x) involving measure is usually higher than c(x) involving μ measure. As it comes from formula (4), relates the number of inconsistent objects to the whole number of objects in the data set that may cause inconsistencies. On the other hand, from (3), μ measures inconsistency more locally. It relates the number of consistent objects in the granule to the number of objects in the granule. A significant increase of consistency in variable consistency samples is visible for data sets like bupa, car, ecoli and pima. In case of other data sets (acl, denbosch, windsor) this effect is not visible. Moreover, the consistency in a sample is high for these data sets even when the standard bagging scheme is applied. To explore this point it is useful to see the value of quality of classification in the whole data set shown in Table 1. Less consistent data sets are those for which variable consistency sampling worked better. Naturally, this will also be reflected by results of VC-bagging. Another issue is
Variable Consistency Bagging Ensembles
49
the size of data sets used in experiment. Two data sets: acl and denbosch are considerably smaller than the others, which can affect the results. The experimental comparison of classification performance with standard bagging scheme against VC-bagging is summarized in Table 3. We measured average classification accuracy and its standard deviation. Each classifier used in experiment is an ensemble of 50 base classifiers. One can notice that results presented in the Table 3 show the two VCbagging variants do well against the standard bagging. VC-bagging with μ measure (μ bagging) almost always improves the results. Generally, application of VC-bagging never decreased the predictive accuracy of compared classifiers with one exception of ModLEM on data set ecoli. VC-bagging gives less visible improvements for data sets acl, denbosch and windsor. This can be explained by the smallest increase of consistency in samples observed for these data sets. One can notice that bagging seem to give better results than other bagging techniques when it is used with DomLEM. However, this point needs further research. Application of DomLEM increased significantly results in case of car and windsor data sets. To compare more formally performance of all classifiers on multiple data sets we follow a statistical approach proposed in [9]. More precisely, we apply Friedman test, which tests the null-hypothesis that all classifiers perform equally well. It uses ranks of each of classifiers on each of the data sets. In this test, we compare performance of single classifier, bagging, μ bagging and bagging with different base classifiers. Following principles of Friedman test, it is assumed that results of base classifiers on the same data set are independent. We are aware that this assumption is not completely fulfilled in our case. However, we claim that compared base classifiers are substantially different (and in this sense independent). In our experiment, we use one tree based classifier and three rule classifiers. Each of rule classifiers employ different strategy of rule induction. Friedman statistics, for data in Table 3, gives 35.82 which exceeds the critical value 7.82 (for confidence level 0.05). We can reject the null hypothesis. To compare performance of classifiers, we use a post-hoc analysis and we calculate critical difference (CD) according to Nemenyi statistics [17]. In our case, for confidence level 0.05, CD = 0.95. Classifiers that have difference in average ranks higher than CD are
CD = 0.957 single classifier bagging
ε bagging μ bagging
4
3
2
Fig. 1. Critical difference for data from Table 3
1
50
J. Błaszczyński, R. Słowiński, and J. Stefanowski
significantly different. We present results of the post-hoc analysis in Figure 1. Average ranks are marked on the bottom line. Groups of classifiers that are not significantly different are connected. Single classifiers are significantly worse than any bagging variant. Among bagging variants, μ bagging is significantly better than standard bagging. bagging leads to lower value of the average rank, however, this difference turns out to be not significant in the post-hoc analysis.
5
Conclusions
In this study, we have considered the question whether consistency of objects should be taken into account while drawing training objects to the bootstrap samples used in bagging. We claim that increasing a chance to select more consistent learning objects and reducing probability of selecting too inconsistent objects leads to creating more accurate and still enough diversified base classifiers. The main methodological contribution of this paper is proposing a variable consistency bagging, where such sampling is controlled by rough membership or monotonic consistency measures. The statistical analysis of experimental results, discussed in the previous section, has shown that the VC-bagging performed better than standard bagging. Although only μ bagging is significantly better. One may also notice that improvement of the classification performance depends on the quality of classification on the original data set. Thus, better performance of the VC-bagging could be observed for more inconsistent data sets. To sum up, the proposed approach does not decrease the accuracy of classification for consistent data sets while it improves the accuracy for inconsistent data sets. Moreover, its key concept is highly consistent with the principle of rough set theory and can be easily implemented without too many additional computational costs. It can be thus considered as an out-of-box method to improve the prediction of classifiers.
Acknowledgments The authors wish to acknowledge financial support from the Ministry of Science and Higher Education, grant N N519 314435.
References 1. Aha, D.W., Kibler, E., Albert, M.K.: Instance-based learning algorithms. Machine Learning Journal 6, 37–66 (1991) 2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~mlearn/MLRepositoru.html 3. Błaszczyński, J., Greco, S., Słowiński, R., Szeląg, M.: On Variable Consistency Dominance-based Rough Set Approaches. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Słowiński, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 191–202. Springer, Heidelberg (2006)
Variable Consistency Bagging Ensembles
51
4. Błaszczyński, J., Greco, S., Słowiński, R., Szeląg, M.: Monotonic Variable Consistency Rough Set Approaches. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Śl¸ezak, D. (eds.) RSKT 2007. LNCS (LNAI), vol. 4481, pp. 126–133. Springer, Heidelberg (2007) 5. Breiman, L.: Bagging predictors. Machine Learning Journal 24(2), 123–140 (1996) 6. Breiman, L.: Pasting small votes for classification in large databases and on-line. Machine Learning Journal 36, 85–103 (1999) 7. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001) 8. Cohen, W.W.: Fast effective rule induction. In: Proc. of the 12th Int. Conference on Machine Learning, pp. 115–123 (1995) 9. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 10. Dietrich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) 11. Greco, S., Matarazzo, B., Słowiński, R.: Rough sets theory for multicriteria decision analysis. European Journal of Operational Research 129, 1–47 (2001) 12. Greco, S., Matarazzo, B., Słowiński, R., Stefanowski, J.: An algorithm for induction of decision rules consistent with dominance principle. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 304–313. Springer, Heidelberg (2001) 13. Grzymala-Busse, J.W.: Managing uncertainty in machine learning from examples. In: Proc. 3rd Int. Symp. in Intelligent Systems, pp. 70–84 (1994) 14. Grzymala-Busse, J.W., Stefanowski, J.: Three approaches to numerical attribute discretization for rule induction. International Journal of Intelligent Systems 16(1), 29–38 (2001) 15. Kuncheva, L.: Combining Pattern Classifiers. In: Methods and Algorithms, Wiley, Chichester (2004) 16. Latinne, P., Debeir, O., Decaestecker, Ch.: Different ways of weakening decision trees and their impact on classification accuracy of decision tree combination. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, p. 200. Springer, Heidelberg (2000) 17. Nemenyi, P.B.: Distribution free multiple comparison. Ph.D. Thesis, Princenton Univeristy (1963) 18. Hoa, N.S., Nguyen, T.T., Son, N.H.: Rough sets approach to sunspot classification problem. In: Ślęzak, D., Yao, J., Peters, J.F., Ziarko, W.P., Hu, X. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3642, pp. 263–272. Springer, Heidelberg (2005) 19. Pawlak, Z.: Rough sets. International Journal of Information & Computer Sciences 11, 341–356 (1982) 20. Quinlan, J.R.: Bagging, boosting and C4.5. In: Proc. of the 13th National Conference on Artificial Intelligence, pp. 725–730 (1996) 21. Słowiński, R., Greco, S., Matarazzo, B.: Rough set based decision support. In: Burke, E.K., Kendall, G. (eds.) Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques, pp. 475–527. Springer, Heidelberg (2005) 22. Slezak, D.: Approximate entropy reducts. Fundamenta Informaticae 53(3/4), 365– 387 (2002) 23. Stefanowski, J.: The rough set based rule induction technique for classification problems. In: Proc. of 6th European Conference on Intelligent Techniques and Soft Computing. EUFIT 1998, pp. 109–113 (1998)
52
J. Błaszczyński, R. Słowiński, and J. Stefanowski
24. Stefanowski, J.: Multiple and hybrid classifiers. In: Polkowski, L. (ed.) Formal Methods and Intelligent Techniques in Control, Decision Making, Multimedia and Robotics, Post-Proceedings of 2nd Int. Conference, Warszawa, pp. 174–188 (2001) 25. Stefanowski, J.: The bagging and n2-classifiers based on rules induced by MODLEM. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 488–497. Springer, Heidelberg (2004) 26. Stefanowski, J., Kaczmarek, M.: Integrating attribute selection to improve accuracy of bagging classifiers. In: Proc. of the AI-METH 2004 Conference - Recent Developments in Artificial Intelligence Methods, Gliwice, pp. 263–268 (2004) 27. Stefanowski, J., Nowaczyk, S.: An experimental study of using rule induction in combiner multiple classifier. International Journal of Computational Intelligence Research 3(4), 335–342 (2007) 28. Stefanowski, J., Wilk, S.: Improving Rule Based Classifiers Induced by MODLEM by Selective Pre-processing of Imbalanced Data. In: Proc. of the RSKD Workshop at ECML/PKDD, Warsaw, pp. 54–65 (2007) 29. Suraj, Z., Gayar Neamat, E., Delimata, P.: A Rough Set Approach to Multiple Classifier Systems. Fundamenta Informaticae 72(1-3), 393–406 (2006) 30. Wilson, D.R., Martinez, T.: Reduction techniques for instance-based learning algorithms. Machine Learning Journal 38, 257–286 (2000) 31. Valentini, G., Masuli, F.: Ensembles of Learning Machines. In: Marinaro, M., Tagliaferri, R. (eds.) WIRN 2002. LNCS, vol. 2486, pp. 3–19. Springer, Heidelberg (2002) 32. Wong, S.K.M., Ziarko, W.: Comparison of the probabilistic approximate classification and the fuzzy set model. Fuzzy Sets and Systems 21, 357–362 (1987) 33. Ziarko, W.: Variable precision rough sets model. Journal of Computer and Systems Sciences 46(1), 39–59 (1993)
Classical and Dominance-Based Rough Sets in the Search for Genes under Balancing Selection Krzysztof A. Cyran Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
[email protected] Abstract. Since the time of Kimura’s theory of neutral evolution at molecular level the search for genes under natural selection is one of the crucial problems in population genetics. There exists quite a number of statistical tests designed for it, however, the interpretation of the results is often hard due to the existence of extra-selective factors, such as population growth, migration and recombination. The author, in his earlier work, has proposed the idea of multi-null hypotheses methodology applied for testing the selection in ATM, RECQL, WRN and BLM genes the foursome implicated in human familial cancer. However, because of high computational effort required for estimating the critical values under nonclassical null hypotheses, mentioned strategy is not an appropriate tool for selection screening. The current article presents novel, rough set based methodology, helpful in the interpretation of the tests outcomes applied only versus classical nulls. The author considers for this purpose both classical and dominance based rough set frameworks. None of rough set based methods requires long-lasting simulations and, as it is shown in a paper, both give reliable results. The advantage of dominance based approach over classical one is more natural treatment of statistical test outcomes, resulting in better generalization without necessity of manual incorporating the domain-dependent reasoning to the process of knowledge processing. However, in testing this gain in generalization proved to be at the price of a slight loss of accuracy. Keywords: classical rough sets approach, dominance-based rough sets approach, natural selection, balancing selection, ATM, BLM, RECQL, WRN, neutrality tests.
1
Introduction
According to Kimura’s neutral model of evolution [1] the majority of genetic variation at the molecular level is caused by the selectively neutral forces. These include genetic drift and silent mutations. Although the role of selection has been reduced as compared to selection driven evolution model, it is obvious that some mutations must be deleterious and some are selectively positive (the ASMP locus, which is a major contributor to the brain size in primates [2,3] J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 53–65, 2010. c Springer-Verlag Berlin Heidelberg 2010
54
K.A. Cyran
is the well known example of it). Another examples of research for detection of natural selection can be found in [4,5,6,7]. There exists one more kind of natural selection called balancing selection, detection of which with the use of rough set methods is described in the paper. The search for balancing selection is the crucial problem in contemporary population genetics due to the fact that such selection is associated with serious genetic disorders. Many statistical tests [8,9,10,11] were proposed for the detection of such a selection. Yet, the interpretation of the outcomes of tests is not a trivial task because such factors like population growth, migration and recombination can force similar results of the tests [12]. The author in his earlier work (published in part in [13] and in part unpublished), has proposed the ideas of multi-null hypotheses methodology used for the detection of a balancing selection in genes implicated in human familial cancer. One of the genes was the ATM (ataxiatelangiectasia mutated) and three others were DNA helicases involved in repair of it. The names of the helicases are RECQL, WRN (Werners syndrome, see [14]) and BLM (Blooms syndrome, see [15]). However, the methodology proposed earlier (even when get formalized) is not appropriate as a screening tool, because long lasting simulations are required for computing the critical values of the tests, assuming nonclassical null hypotheses. To avoid the need of extensive computer simulations the author proposed the artificial intelligence based methodology. In particular, rough set theory was studied in the context of its applicability in the aforementioned problem. As the result of this research the author presents in the current paper the application of two rough set approaches: classical [16,17] and dominance-based models [18,19,20]. Instead of trying to incorporate all feasible demography and recombination based parameters into null hypotheses, the newly proposed methodology relies on the assumption that extra-selective factors influence different neutrality tests in different way. Therefore it should be possible to detect the selection signal from the battery of neutrality tests even using classical null hypotheses. For this purpose the expert knowledge aquisition method should be used, where the expert knowledge can be obtained from multi null hypotheses technique applied for some small set of genes. The decision algorithm obtained in this way can be subsequently applied for the selection search in genes which were not subject for the study with the use of multi null hypotheses methodology. Since the critical values of neutrality tests for classical null hypotheses are known, aforementioned strategy does not require time-consuming computer simulations. The use of classical rough set theory was dictated by the fact that test outcomes are naturally discretized to a few values only - the approach was presented in [21]. The current paper, which is the extended and corrected version of that article, deals also with dominance-based rough sets. Dominance-based approach was not yet considered in this context neither in [21] nor elsewhere,
Classical and Dominance Rough Sets in Selection Search
55
and comparison of the applicability of the two approaches to this particular case study is done for the first time.
2
Genetic Data
The single nucleotide polymorphisms (SNP) data, taken from the intronic regions of target genes were used as genetic material for this study. The SNPs form so called haplotypes of the mentioned loci. A number of interesting problems about these genes were adressed by the author and his co-workers, including the question of selection signatures and other haplotypes related problems [13,21,22,23,24]. The first gene analyzed is ataxia-telangiectasia mutated (ATM) [25,26,27,28,29]. The product of this gene is a large protein implicated in the response to DNA damage and regulation of the cell cycle. The other three genes are human helicases [30] RECQL [24], Blooms syndrome (BLM) [31,32,33,34] and Werners syndrome (WRN) [35,36,37,38]. The products of these three genes are enzymes involved in various types of DNA repair, including direct repair, mismatch repair and nucleotide excision repair. The locations of the genes are as follows. The ATM gene is located in human chromosomal region 11q22-q23 and spans 184 kb of genomic DNA. The WRN locus spans 186 kb at 8p12-p11.2 and its intron-exon structure includes 35 exons, with the coding sequence beginning in the second exon. RECQL locus contains 15 exons, spans 180 kb and is located at 12p12-p11, whereas BLM is mapped to 15q26.1, spans 154 kb, and is composed of 22 exons. For this study blood samples were taken and genotyped from the residents of Houston, TX. The blood donors belonged to four major ethnic groups: Caucasians, Asians, Hispanics, and African-Americans, and therefore they are representative for human population.
3
Methodology
To detect departures from the neutral model, the author used the following tests: Tajima’s (1989) T (for uniformity, author follows here the nomenclature of Fu [9] and Wall [11]), Fu and Li’s (1993) F ∗ and D∗ , Wall’s (1999) Q and B, Kelly’s (1997) ZnS and Strobeck’s S test. The definitions of these tests can be found in original works of the inventors and, in a summarized form, in Cyran et. al. (2004) article [13]. Such tests like McDonald-Kreitman’s (1991) [39], Akashi’s (1995) [40], Nielsen and Weinreich’s (1999) [41], as well as Hudson, Kreitman and Aguade’s (1987) HKA test [42] were excluded because of the form of genetic data required for them. The haplotypes for particular loci were inferred and their frequencies were estimated by using the Expectation-Maximization algorithm [43,44,45]. The rough set methods are used to simplify the search for balancing selection. Such a selection (if present) should be reflected by statistically significant departures from the null of the Tajima’s and Fu’s tests towards positive values. Since not all such departures are indeed caused by a balancing selection
56
K.A. Cyran
[12], but can be the result of such factors like population change in time, migration between subpopulations and a recombination, therefore, a wide battery of tests was used. The problem with the interpretation of their combinations was solved with the use of rough set methods. Classical Rough Set Approach (CRSA) and Dominance-based Rough Set Approaches (DRSA) were applied for this purpose. The decision table contained tests results treated as conditional attributes and a decision about the balancing selection treated as the decision attribute. The author was able to determine the value of this decision attribute for given combination of conditional attributes, basing on previous studies and using heavy computer simulations. The goal of this work is to compare two methodologies used for automatic interpretation of the battery of tests. While using any of them the interpretation can be done without time consuming simulations. In order to find the required set of tests, which is informative about the problem, there was applied the notion of a relative reduct with respect to decision attribute. In case of classical rough sets, in order to obtain as simple decision rules as possible, the relative value reducts were also used for particular elements of the Universe. To study the generalization properties and to estimate the decision error, the jack-knife crossvalidation technique was used, known to generate an unbiased estimate of the decision error. The search for reducts, value reducts and rule generation was performed by the written by author software based on discernibility matrices and minimization of positively defined boolean functions. The search for reducts and rule generation process in DRSA was conducted with the use of 4eMka System - a rule system for multicriteria decision support integrating dominance relation with rough approximation [46,47]. This sofware is available at a web page of Laboratory of Intelligent Decision Support Systems, Institute of Computing Science, Poznan University of Technology (Poznan 2000, http://www-idss.cs.put.poznan.pl/).
4
Results and Discussion
The results of tests T , D∗ , F ∗ , S, Q, B and ZnS are given in Table 1 together with the decision attribute denoting the evidence of balancing selection based on computer simulations. The values of the test are: Non significant (NS) when p > 0.05, significant (S) if 0.01 < p < 0.05, and strongly significant (SS) when p < 0.01. The last column indicates the evidence or no evidence of balancing selection, based on the detailed analysis according to multi-null methodology. The CRSA-based analysis of the Decision Table 1 revealed that there existed two relative reducts: RED1 = {D∗ , T, ZnS } and RED2 = {D∗ , T, F ∗ }. It is clearly visible that the core set is composed of tests D∗ and T , whereas tests ZnS and F ∗ can be in general chosen arbitrarily. However, since both Fu’s tests F ∗ and D∗ are examples of tests belonging to the same family, and their outcomes can be strongly correlated, it seems to be advantageous to choose Kelly’s ZnS instead of F ∗ test.
Classical and Dominance Rough Sets in Selection Search
57
Table 1. Statistical tests results for the classical null hypothesis. Adopted from the article [21].
AfAm ATM Cauc Asian Hispanic AfAm RECQL Cauc Asian Hispanic AfAm WRN Cauc Asian Hispanic AfAm BLM Cauc Asian Hispanic
D∗ B S NS S NS NS NS S NS NS NS S NS NS S S NS NS NS S NS S NS NS NS NS NS NS NS NS NS NS NS
Q T NS S NS SS NS S NS SS NS SS NS SS S S NS SS NS NS NS NS NS NS NS NS NS NS NS S NS NS NS NS
S ZnS NS NS SS S NS S NS S NS NS NS NS NS S NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS NS
F ∗ Balancing selection S Yes SS Yes NS Yes S Yes NS Yes SS Yes NS Yes S Yes NS No NS No NS No NS No NS No S No NS No NS No
Probably ZnS outcomes are (at least theoretically) less correlated with outcomes of test D∗ , belonging to the core and therefore required in each reduct. In the DRSA only one reduct was found, and interestingly, it was the one preferred by geneticists, i.e. RED1 . The Decision Table 1 with set of conditional attributes reduced to the set RED1 is presented in Table 2. Table 2. The Decision Table, in which the set of tests is reduced to relative reduct RED1 . Adopted from [21].
ATM
RECQL
WRN
BLM
AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic AfAm Cauc Asian Hispanic
D∗ T S S S SS NS S S SS NS SS S SS NS S S SS NS NS S NS S NS NS NS NS NS NS S NS NS NS NS
ZnS Balancing selection NS Yes S Yes S Yes S Yes NS Yes NS Yes S Yes NS Yes NS No NS No NS No NS No NS No NS No NS No NS No
58
K.A. Cyran
After the reduction of informative tests to the set of tests RED1 = {D∗ , T , ZnS }, the problem of coverage of the (discrete) space generated by these statistics was considered. Since the reduct was the same for classical and dominance-based rough sets, the coverage of the space by the examples included in the training set was identical in both cases. The results are presented in Table 3. The domain of each test outcome (coordinate) is composed of three values: SS (strong statistical significance p < 0.01), S (statistical significance 0.01 < p < 0.05), and NS (no significance p > 0.05). The given point in a space is assigned to: Sel (the evidence of balancing selection), N Sel (no evidence of balancing selection) or empty cell (point not covered by the training data). The assignment is done basing on raw training data with conditional part reduced to the relative reduct RED1 . Note that the fraction of points covered by training examples is only 30%. The next step, available only in the case of classical rough sets, was the computation of the relative value reducts for particular decision rules in the Decision Table 2. The new Decision Table with relative value reducts used is presented in Table 4. This table is the basis for the Classical Rough Sets (CRS) Decision Algorithm 1. CRS Algorithm 1, adopted from [21] BALANCING_SELECTION If: T = SS or (T = S and D* = S) or ZnS = S NO_SELECTION If: T = NS or (T = S and D* = NS and ZnS = NS) Table 3. The discrete space of three tests: D∗ , T and ZnS . Adopted from [21].
SS D∗ S NS
T SS S NS ZnS ZnS ZnS SS S N S SS S N S SS S N S Sel Sel Sel
Sel Sel N Sel
N Sel N Sel
Certainly this algorithm is simplified and more general as compared to the algorithm that corresponds to the Decision Table 2. The increase of generality can be observed by comparizon of Table 5 with Table 3. In Table 5 the domain of each test outcome (coordinate) is also composed of three values: SS (strong statistical significance p < 0.01), S (statistical significance 0.01 < p < 0.05), and NS (no significance p > 0.05). The given point in a space is assigned to Sel and N Sel (with the meaning identical to that in Table 3), or ”-” having the meaning of contradiction between evidence and no evidence of the balancing selection. The coverage of points is based on the number of points which are classified with the use of the CRS Algorithm 1. Observe, that the fraction of points covered by this algorithm is 74%, however, since 11% is classified as both with and without selection, therefore only 63% of the points could be really treated as covered.
Classical and Dominance Rough Sets in Selection Search
59
Table 4. The set of tests with relative value reducts used in classical rough sets. Adopted from [21].
AfAm ATM Cauc Asian Hispanic AfAm RECQL Cauc Asian Hispanic AfAm WRN Cauc Asian Hispanic AfAm BLM Cauc Asian Hispanic
D∗ T ZnS Balancing selection S S Yes SS Yes S Yes SS Yes SS Yes SS Yes S Yes SS Yes NS No NS No NS No NS No NS No NS S NS No NS No NS No
In the case of classical apporach, the CRS Algorithm 1 is the final result of purely automatic knowledge processing technique. It can be further improved manually by supplying it with the additional information, concerning the domain under study. But such solution is not elegant. The elegant solution uses dominance-based rough sets capable of automaic knowledge processing for domains with ordered results (or preferences). It is clearly true that if a balancing selection is determined by the statistical significance of the given test, then such selection is even more probable when the outcome of this test is strongly statistically significant. Application of dominance-based rough sets results in the Dominance Rough Set (DRS) Algorithm, presented below. DRS Algorithm at least BALANCING_SELECTION If: T >= SS or (T >= S and D* >= S) or ZnS >= S at most NO_SELECTION If: T = S) or ZnS >= S then BALANCING_SELECTION := True; If T ∈ τB (εa1 , εa2 , . . . , εak )
(2)
In other words we can say that IB (x) set is a tolerance set of the element x. When tolerance thresholds are determined it is possible to determine a set of decision rules [54,31] that create descriptions of decision classes. As a decision class we mean the set Xv = {x ∈ U : d(x) = v}, where v ∈ Dd . Accuracy and generality (coverage) of obtained rules is dependent on the distance measure δa and tolerance thresholds values εa . One can choose the form of the δa function arbitrarily, however selection of the εa thresholds is not obvious. If there are decision table DT = (U, A ∪ {d})) and distance function δa for each a ∈ A defined, the new decision table DT’ = (U , A ∪ {d}) can be created, and: U = {< x, y >: (< xi , yj >∈ U × U ) ∧ (i ≤ j)}, A = {(a : U → R+ ): a (< x, y >) = δa (a(x), a(y))}, 0 for d(x) = d(y) d (< x, y >) = 1 otherwise.
Decision Rule-Based Data Models Using TRS and NetTRS
133
It can be easily noticed, that sorting objects from U ascending according to values of any a ∈ A , one receives every possible value for which the power of set Ia (x), x ∈ U varies. These values are all reasonable εa values that should be considered during tolerance thresholds searching. Choosing εa for each a ∈ A one receives a tolerance thresholds vector. Because the set of all possible vectors is very large 12 (|Da |2 − |Da |) + 1 [54], to estblish desired tolerance thresholds a∈A
values, heuristic [29,31] or evolutionary [54,55] strategies are used. In the TRS library a evolutionary strategy was applied. In the paper [55] Stepaniuk presented a tolerance threshold vector searching method. A genetic algorithm was chosen as a solution set searching strategy. A single specimen was represented as a tolerance thresholds vector. For every a ∈ A, reasonable εa values were numerated, obtaining a sequence of integers, out of which every single integer corresponded to one of the possible values of tolerance threshold. A single specimen was then a vector of integers. Such representation of the specimen led to a situation in which a given, limited to the population size and the mutation probability, set of all possible tolerance threshold vectors was searched through. In other words, mutation was the only factor capable of introducing new tolerance threshold values, not previously drawn to the starting population, into the population. In our library searching for good tolerance thresholds vectors, we assumed a binary representation of a single specimen. We used so-called block positional notation with standardization [16]. Every attribute was assigned a certain number of bits in a binary vector representing the specimen that was necessary to encode all numbers of tolerance thresholds acceptable for that attribute. Such representation of a specimen makes possible to consider every possible thresholds vector. In each case user can choose one from various criteria of thresholds optimality, applying standard criterion given by Stepaniuk [54] or criteria adapted from rules quality measures. For a given decision table DT , standard tolerance thresholds vector quality measure is defined in the following way: wγ(d) + (1 − w)vSRI (Rd , RIA ),
(3)
where γ(d) = |P OS(d)| , P OS(d) is is a positive region [33] of the decision table |U| DT , Rd = {< x, y >∈ U × U: d(x) = d(y)}, RIA = {< x, y >∈ U × U: y ∈ IA (x)}, vSRI (X, Y ) is a standard rough inclusion [48]. We expect from tolerance thresholds vector that most of all objects from the same decision class (Rd ) will be admitted as similar (RIA ). The above mentioned function makes possible to find such tolerance thresholds that as many as possible objects of the same decision stay in a mutual relation, concurrently limiting to minimum cases in which the relation concerns objects of different decisions.
134
M. Sikora
Since the (3) criterion not always allow to obtain high accuracy of classification and small decision rules set, other optimality criteria adapted from rules quality measures were also implemented in the TRS library. The rules quality measures allow us to evaluate a given rule, taking into account its accuracy and coverage [9,61]. The accuracy of a rule usually decreases when its coverage increases and vice-versa, the similar dependence appears also among components of the (3) measure. The properly adapted rules quality measures may be used to evaluate the tolerance thresholds vector [40]. It is possible to determine the following matrix for DT’ table and any tolerance thresholds vector ε = (εa1 , εa2 , . . . , εan ) nRd RIA n¬Rd RIA nRIA
nRd ¬RIA n¬Rd ¬RIA n¬RIA
nRd n¬Rd
where: nRd = nRd RIA + nRd ¬RIA is the number of object pairs with the same value of the decision attribute; n¬Rd = n¬Rd RIA + n¬Rd ¬RIA is the number of object pairs with different values of the decision attribute; nRIA = nRd RIA + n¬Rd RIA is the number of object pairs staying in the relation τA ; n¬RIA = nRd ¬RIA + n¬Rd ¬RIA is the number of object pairs not staying in the relation τA ; nRd RIA is the number of object pairs with the same value of the decision attribute, staying in the relation τA . The n¬Rd RIA , nRd ¬RIA , n¬Rd ¬RIA values we define analogously to nRd RIA . Two rules quality evaluation measures (WS, J-measure [53]) which were adapted to evaluation tolerance thresholds vector are presented below. nRd RIA nRd RIA ∗w+ ∗ (1 − w), w ∈ (0, 1; nRd nRIA 1 nR R |U | nR ¬R |U | q J−measure (ε) = nRd RIA ln d IA + nRd ¬RIA ln d IA . |U | nRd nRIA nRd n¬RIA q W S (ε) =
Modified versions of the following measures: Brazdil, Gain, Cohen, Coleman, IKIB, Chi-2 are also available in the TRS library. Analytic formulas of the mentioned measures can be found, among others, in [2,3,9,40,61]. Algorithm of searching the vector of identical tolerance thresholds value for each attribute is the other algorithm implemented in TRS library. The searched vector has the form (ε, . . . , ε). As the initial (minimal) value of ε we take 0 and the final (maximal) value is ε = 1. Additionally, the parameter k is defined to increase ε on each iteration (usually k = 0.1). For each vector, from (0, . . . , 0) to (1, . . . , 1), the tolerance thresholds vector quality measure is calculated (it
Decision Rule-Based Data Models Using TRS and NetTRS
135
is possible to use identical evaluation measures as for genetic algorithm). The vector with highest evaluation is admitted as the optimal one. 2.2
Decision Rules
TRS library generates decision rules in the following form (4): IF a1 ∈ Va1 and . . . and ak ∈ Vak THEN d = vd ,
(4)
where: {a1 , . . . , ak } ⊆ A, d is the decision attribute, Vai ⊆ Dai , vd ∈ Dd . Expression a ∈ V is called a conditional descriptor. For fixed distance functions and tolerance thresholds it is possible to make the set of all minimal decision rules [33,47]. The process of making minimal decision rules equals finding object-related relative reducts [47]. Finding object-related relative reducts is one of the methods of rules determining in the TRS library, in this case, decision rules are generated on the basis of discernibility matrix [47]. There are three different algorithm variants: all rules (from each objects all relative reducts are generated and from each reduct one minimal decision rule is obtained), one rule (from each object only one relative reduct is generated (the quasi-shortest one) that is used for the rule creation, the reduct is determined by the modified Johnson algorithm [27]); from each object given by the user rules number is generated, in this case the randomized algorithm is used [27]. Rules determined from object-related relative reducts have enough good classification abilities but they have (at least many of them) poor descriptive features. It happens because minimal decision rules obtained from object-related relative reducts express local regularities, only few of them reflect general (global) regularities in data. Because of this reason, among others, certain heuristic attitudes [17,18,19,28,30,57,58,60] or further generalizations of rough sets [50] are used in practice. The heuristic algorithm of rules induction RMatrix that exploits both information included in discernibility matrix [47] and rule quality measure values was implemented in the TRS library. Definition 1. Generalized discernibility matrix modulo d Let DT = (U, A ∪ {d}) be a decision table, where U = {x1 , x2 , . . . , xn }. The generalized discernibility matrix modulo d for the table DT we called the square matrix Md (DT) = {ci,j : 1 ≤ i, j ≤ n} which elements defined as follows: cij = {a ∈ A: (< xi , xj >∈ / τa (εa )) ∧ (d(xi ) = d(xj )}, cij = ∅
if
d(xi ) = d(xj ).
Each object xi ∈ U matches i-th row (i-th column) in discernibility matrix. The attribute most frequently appearing in the i-th row lets discern objects with decisions other than the xi object’s decision in the best way.
136
M. Sikora
RMatrix algorithm input: DT = (U, A ∪ {d}), the tolerance thresholds vector (εa1 , εa2 , ..., εam ), q – quality evaluation measure, x – object, rule generator, an order of conditional attributes (ai1 , ai2 , ..., aim ) so as the attribute the most frequently appearing in cx is the first (attribute appearing the most rarely is the last) begin create the rule r, which has the decision descriptor d = d(x) only; rbest := r; for every j := 1, ..., m add the descriptor aij ∈ Vaij to conditional part of the rule r (where Vaij = {aij (y) ∈ Daij : y ∈ Iaij (x)}) if q(r) > q(rbest ) then rbest := r return rbest RMatrix algorithm generates one rule from each object x ∈ U . Adding succeeding descriptor makes the rule more accurate and simultaneously limits rule coverage. Rule quality measure ensures the output rule not to be too fitting to training data. Using the algorithm it is possible to define one rule for each object or define only those rules which will be sufficient to cover the training set. Considering the number of the rules obtained it is good to get all the rules and then, starting with the best rules, construct a coverage of each decision class. Computational complexity of determining one rule by RMatrix algorithm is equal to O(m2 n), where m = |A|, n = |U |. The detailed description of this algorithm can be found in [40,42]. Both RMatrix algorithm and methods of determining rules from reducts require information about tolerance thresholds values. The MODLEM algorithm was also implemented in the TRS library. The MODLEM algorithm was proposed by Stefanowski [51,52] as the generalization of the LEM2 algorithm [18]. The author’s intention was to make strong decision rules from numeric data without their prior discretization. There is a short description of the MODLEM algorithm working below. For each conditional attribute and each value of the attribute appearing in the currently examined set of objects Ub (at first it is the whole training set U ) successive values of conditional attributes are tested previously sorted in not decreasing order) in search of the limit point ga . The limit point is in the middle, between two successive values of the attribute a (for example va < ga < wa ), and divides the current range of the attribute a values into two parts (values bigger than ga and smaller than ga ). Such a division also establishes the division of the current set of training objects Ub into two subsets U1 and U2 . Of course, only those values ga , which lie between two values of the attribute a, characterizing objects from different decision classes are taken into consideration as the limit points. The optimum is the limit point, which minimizes the value of the conditional |U2 | 1| entropy |U |Ub | Entr(U1 ) + |Ub | Entr(U2 ), where Entr(Ui ) means entropy of the set Ui .
Decision Rule-Based Data Models Using TRS and NetTRS
137
As the conditional descriptor we choose the interval (out of two) for which in adequate sets U1 , U2 there are more objects from the decision class pointed by the rule. The created descriptor limits the set of objects examined in the later steps of algorithm. The descriptor is added to those created earlier and together with them (in the form of the conjunction of conditions) makes the conditional part of the rule. As it is easily noticeable, if for an attribute a two limit points will be chosen in the successive steps of the algorithm, we can obtain the descriptor in the form [va1 , va2 ]. If there will be one such limit point, the descriptor will have the nature of inequality, and if a given attribute will not generate any limit point then the attribute will not appear in the conditional part of the rule. The process of creating the conditional part of the rule finishes when the set of objects Ub is included in the lower (or upper) approximation of the decision class being described. In other words, the process of creating the rule finishes when it is accurate or accurate as much as it is possible in the training set. The algorithm creates coverage of a given decision class and hence after generating the rule all objects supporting the created rule are deleted from the training set, and the algorithm is used for the remaining objects from the class being described. The stop criterion which demands from the objects recognized by the conditional part of the rule to be included in the proper approximation of a given decision class creates two unfavorable situations. Firstly, the process of making rules is longer than in the case of making a decision tree. Secondly, the number of the rules obtained, though smaller than in the case of induction methods based on rough sets methodology or the LEM2 algorithm, is still relatively big. Moreover, rules tend to match to data and hence we deal with the phenomenon of overlearning. The simplest way of dealing with the problem is arbitrary establishment of the minimal accuracy of the rule and after reaching this accuracy the algorithm finishes working. However, as numerous research studies show (among others [2,3,9,40,45,52]) it is better to use rules quality measures, which try to simultaneously estimate two most important rule features – accuracy and rule coverage. We apply rules evaluation measures in the implemented in the TRS version of the MODLEM algorithm. We do not interfere in the algorithm, only after adding (or modifying) successive conditional descriptor we evaluate the current form of the rule, remembering as the output rule the one with the best evaluation (not necessarily the one which fulfills the stop conditions). As it is shown in the research results [41,45], such attitude makes possible to significantly reduce the number of rules being generated. Rules generated in such a way are characterized by smaller accuracy (about 70-100%) but are definitely more general, such rules have also good classifying features. During experiments conducting we observed that the quality of the rule being made increases until it reaches a certain maximal value and then decreases (the quality function has one maximum). This observation allowed to modify stop criterion in the MODLEM algorithm so as to stop process of rule conditional
138
M. Sikora
part creation at the moment while a value of used in the algorithm quality measure becomes decrease. The presented modification allows to limit the number of the rules being generated and makes the algorithm work faster. 2.3
Rules Generalization and Filtration
Independently of the method of rules generating (either all minimal rules are obtained or the heuristic algorithms are used) the set of generated rules still can be large, which decreases its description abilities. The TRS library owns some algorithms implemented in, which are responsible for the generated rules set postprocessing. The main target of postprocessing is to limit the number of decision rules (in other words: to increase their description abilities) but with keeping their good classification quality simultaneously. The TRS library realizes postprocessing in two ways: rules generalization (rules shortening and rules joining) and rules filtration (rules that are not needed in view of certain criterion, are removed from the final rules set). Rules shortening Shortening a decision rule consists in removing some conditional descriptors from the conditional part of the rule [5]. Every unshortened decision rule has an assigned quality measure value. The shortening process longs until the quality of the shortened rule decreases below the defined threshold. The rules shortening algorithm was applied, among others, in the paper [4], where increasing classification accuracy of minimal decision rules induction algorithms was the main purpose of the shortening. Authors propose to create a classifiers hierarchy, in which a classifier composed of minimal rules is placed on the lowest level and classifiers composed of shortened rules (by 5%, 10% etc with relation to their original accuracy) are located on higher levels. The classification process runs from the lowest (the accurate rules) to the highest level. The presented proposition enables to increase classification accuracy, which was probably the main authors intention. Authors did not consider description power of obtained rules set, that undoubtedly significantly worsen in this case, for the sake of increase of rules number taking part in the classification (theoretically one can consider some method of filtration of determined rules). In the TRS library the standard not hierarchical shortening algorithm was implemented, quality threshold is defined for each decision class separately. The order of descriptors removing is set in accordance with a hill climbing strategy. Computational complexity of the algorithm that shortens a rule composed of l conditional descriptors is O(l2 mn, ), where m = |A|, n = |U |. Rules joining Rules joining consists in obtaining one more general rule from two (or more) less general ones [24,26,35,40,43]. The joining algorithm implemented in the TRS library bases on following assumptions: only rules from the same decision class can be joined, two rules can
Decision Rule-Based Data Models Using TRS and NetTRS
139
be joined if their conditional parts are built from the same conditional attributes or if the conditional attributes set of one rule is a subset of the conditional attributes set of the second one. Rules joining process consists in joining sets of values of corresponding conditional descriptors. If the conditional descriptor (a, Va1 ) occurs in the conditional part of the φ1 → ψ rule and the descriptor (a, Va2 ) occurs in the conditional part of the φ2 → ψ rule, then – as the result of joining process – the final conditional descriptor (a, Va ) has the following properties: Va1 ⊆ Va and Va2 ⊆ Va . After joining the descriptors the rules representation language does not change, that is depending on the type of the conditional attribute a the set Va of values of the descriptor created after joining is defined in the following way: – if the attribute is of the symbolic type and joining concerns descriptors (a, Va1 ) and (a, Va2 ), then Va = Va1 ∪ Va2 , – if is of the numeric type and joining concerns descriptors the1 attribute a, [va , va2 ] and (a, [va3 , va4 ]), then Va = [f vmin , vmax ], where vmax = max{vai : i = 2, 4}, vmin = min{vai : i = 1, 3}. Therefore, the situation in which after joining the conditional descriptor will be in the form a, [va1 , va2 ] ∨ [va3 , va4 ] for va2 < va3 is impossible. In other words, appearing of so-called internal alternative in descriptors is not possible. Other assumptions concern the control over the rules joining process. They can be presented as follows: – From two rules r1 and r2 with the properties qr1 > qr2 the rule r1 is chosen as the "base" on the basis of which a new joined rule will be created. The notation qr1 means the value of the rule r1 quality measure; – Conditional descriptors are joined sequentially, the order of descriptors joining depends on the value of the measure which evaluates a newly created rule r (qr ). In order to point the descriptor which is the best for joining the climbing strategy is used. – The process of joining is finished when the new rule r recognizes all positive training examples recognized by rules r1 and r2 ; – If r is the joined rule and qr ≥ λ, then in the place of rules r1 and r2 the rule r is inserted into the decision class description; otherwise rules r1 and r2 cannot be joined. The parameter λ defines the minimal rules quality value. In particular, before beginning the joining process one can create a table which contains "initial" values of the rules quality measure and afterwards use the table as the table of threshold values for rules joining; then for all joined rules r1 and r2 , λ = max{qr1 , qr2 }. Computational complexity of the algorithms that joins two rules r1 and r2 (with the property: qr1 > qr2 , number of descriptors in the rule r1 is equal to l) equals O(l2 mn), where m = |A|, n = |n|. The detailed description of this algorithm can be found in [43]. Another joining algorithm was proposed by Mikołajczyk [26]. In the algorithm rules from each decision class are grouped with respect to similarity of their
140
M. Sikora
structure. Due to computational complexity of the algorithm, iterative approach was proposed in [24] (rules are joined in pairs, beginning from the most similar rules). In contrast to the joining algorithm presented in the paper, rules built around a certain number of different attributes (i.e. occurring in one rule and not occurring in another one) can be joined. Ranges of descriptors values in a joined rule are the set sum of joined rules ranges. The algorithm does not verify quality of joined rules. Initially [26] the algorithm was described for rules with descriptors in the form attribute=value, then [24] occurrence of values set in a descriptor was admitted. In each approach value can be a symbolical value or a code of some interval of discretization (for numerical attributes). Thus, the algorithm admits introducing so-called internal alternatives into description of a joined rule in the case of numerical data, because in joined rule description, codes of two intervals that do not lie side by side can be found. The joining algorithm proposed by Pindur, Susmaga and Stefanowski [35] operates on rules determined by dominance based rough sets model. Rules that lie close to each other are joined if their joint cause no deterioration of accuracy of obtained rule. In resultant joined rule, new descriptor which is linear combination of already existing descriptors appears. In this way a rule can describe a solid different from hypercube in the features space. Therefore, it is possible to describe the space occupied by examples from a given decision class by means of less rules number. Rules filtration Let us assume that there is accessible a set of rules determined from a decision table DT , such a set we will be denote RU LDT. An object x ∈ U recognizes the rule of the form (4) if and only if ∀i ∈ {1, . . . , k} ai (x) ∈ Vai . If the rule is recognized by the object x and d(x) = vd , then the object x support the rule of the form (4). The set of objects from the table DT which recognize the rule r will be marked as matchDT (r). Definition 2. Description of a decision class. IF DT = (U, A ∪ {d}) is any decision table, v ∈ Dd and Xv = {x ∈ U : d(x) = v} is the decision class, then each set of rules RU LvDT ⊂ RU LDT satisfying the following conditions: v 1. If r ∈ RU φ → (d, v), LDT then r has the form 2. ∀x ∈ U d(x) = v ⇒ ∃ r ∈ RU LvDT (x ∈ matchDT (r)) is called the description of the decision class Xv .
Each of presented above rules induction algorithms create so-called descriptions of decision classes which are compatible with Definition 2. There are usually distinguished: minimal, satisfying, and complete descriptions. Rejection any rule from minimal description of the decision class Xv , causes the situation in which the second condition of Definition 2 is no more satisfied. This means that after removing any rule from the minimal description, there exists at least one object x ∈ Xv that is recognized by none of rules remaining in the description.
Decision Rule-Based Data Models Using TRS and NetTRS
141
The complete description is a rules set in which all decision rules that can be determined from a given decision table are included. Sometimes the definition of the complete description is constrained to a set of all minimal decision rules, so to rules determined from object-related relative reducts. So far, the satisfying description was defined as an intermediate description between minimal and complete description. The satisfying description can be obtained by determining some number of rules from the training table (for example, a subset of the minimal decision rules set) or by assumption that a subset of the rules set that create another description (for example, the complete description) is the description of the decision class. In particular, the RMatrix and MODLEM algorithms enable to define satisfying descriptions. Opinions about quality of individual types of descriptions differ. From objects classification point of view, the complete description can seem to be the best description of test objects. Big number of rules preserves great possibilities of recognizing test objects. However, for complete descriptions we often meet with overlearning effect which, in the case of uncertainty of data, leads to worse classification results of the description. From the knowledge discovery in databases point of view, the complete description is usually useless, since big number of rules is, in practice, impossible to interpretation. Besides, in the complete description there is certainly a big number of redundant rules and rules excessively matched to training data. Satisfying and minimal descriptions have the most wide application in data exploration tasks. But constructing a minimal description or quasi-minimal one we employ some heuristic solutions which can cause that a part of dependences in data interesting for a user will not be note. Filtration of the rules set can be a certain solution of the problem mentioned above. Having any rules set which is too big from the point of view of its interpretability, we can remove from the set these rules which are redundant for the sake of some criterion [2,39,40]. The criterion is usually a quality measure that make allowances for accuracy and coverage of a rule [3,9], and also its length, strength [60] and so on. There are two kinds of algorithms implemented in the TRS library: considering rules quality only and considering rules quality and classification accuracy of filtered rules set. First approach is represented by the Minimal quality and From coverage algorithms. The Minimal quality algorithm removes from decision classes descriptions these rules which quality is worst then a fixed for a given decision class rule acceptation threshold. The algorithm gives no guarantee that the result rules set will fulfill conditions of Definition 2. Determination a rules qualities ranking is the first step of the From coverage algorithm, then building of the training set coverage starts from the best rule. The following rules are added according to their position in the ranking. When the rules set covers all training examples, all remaining rules are rejected.
142
M. Sikora
The second approach is represented by two algorithms: Forward and Backwards. Both of them, beside the rules qualities ranking, take into consideration the result of all rules classification accuracy. To guarantee the independence of the filtration result the separate tuning set of examples is applied. In the case of the Forward algorithm, each decision class initial description contains only one decision rule – the best one. Then, to each decision class description single rules are added. If the decision class accuracy increases, the added rule remains in this description, otherwise next decision rule is considered. The process of adding rules to the decision class description stops when the obtained rules set has the same classification accuracy as the initial, or when all rules have been already considered. The Backwards algorithm is based on the opposite conception. The weakest rules form each decision class description are removed. Keeping the difference between the most accurate and the least accurate decision class guarantees that the filtered rules set keeps the same sensitivity as the initial. The Forward and Backwards algorithms like the Minimal quality algorithm give no guarantee that the filtered rules set will fulfill Definition 2. Both filtration algorithms mentioned above give to rules that are more specific (in the light of training set), but contribute to raise classification accuracy of tuning set a chance of finding in the filtered set. To estimate computational complexities of the presented algorithms we assume the following denotations: L is a number of rules that are subject to filtration, m = |A| is a number of conditional attributes, n = |U | is a number of objects in a training table. It is readily noticeable that computational complexity of Minimal quality algorithm is equal to O(Lmn). The algorithm From Coverage requires, like in the previous case, determination of quality of all rules (O(Lnm)), and then their sorting (O(L log L)) and checking, which examples from the training set support rules in succession considered. In the extreme case, such verification should be accomplished L times (O(Ln)). To recapitulate, computational complexity of the whole algorithm is of the O(Lnm) order. Complexity of Forward and Backwards algorithms depends on fixed classification method. If we assume that a whole training set is a tuning set and we use a classification algorithm that will be presented in the next section then the complexity analysis runs as follows. To prepare the classification, determining values of the rules quality measure for each rule (O(Lnm)) and their sorting (O(L log L)) is necessary. After adding (removing) a new rule to (from) the decision class description, the classification process is realized. It is connected with necessity of looking over which rules support each training examples (O(Lmn)). In the extreme case we add (remove) rules L − 1 times, just so many times classification should be also carry out. Hence, computational complexity of Forward and Backwards filtration algorithms is of O(L2 nm) order. All mentioned filtration algorithms are described in [40]. In literature, beside quality based filtering approach, genetic algorithm to limit rules set [1] methodology is met. A population consists of specimens that
Decision Rule-Based Data Models Using TRS and NetTRS
143
are rules classifiers. Each specimen consists of a rules set that is a subset of input rules set. A specimen fitness function is classification accuracy obtained on tuning set. It is also possible to apply a function which is weighting sum of classification accuracy and a number of rules that create the classifier. In [1,2] a quality based filtration algorithm is also described. The algorithm applies also rules ranking established by selected quality measure, but filtration do not take place for each decision class separately. 2.4
Classification
Classification is considered as a process of assigning objects to corresponding decision classes. The TRS library uses a "voting mechanism" to perform objects classification process. Each decision rule has an assigned kind of confidence grade (simply: this is a value of the rule quality measure). The TRS library classification process consists in summing up confidence grades of all rules from each decision class that recognize a test object (5). Test object is assigned to the decision class that has the highest value of mentioned sum. Sometimes it happens, that object is not recognized by any rule from given decision classes descriptions. In this case, it is possible to calculate a distance between the object and the rule, and admit that rules close enough to the object recognizes it.
conf (Xv , u) = (1 − dist(r, u))q(r). (5) r∈RULXv (DT ), dist(r,u)≤ ε
In the formula (3) dist(r, u) is a distance between the test object u and the rule r (Euclidean or Hamming), ε is a maximal acceptable distance between the object and the rule (especially, when ε = 0 classification takes place only by the rules that accurately recognizes the test object), q(r) is voice strength of the rule. 2.5
Rules Quality Measures
All of the algorithms mentioned above exploit measures that decided either about form of determined rule or about which of already determined rules may be removed or generalized. These measures are called the rules quality measures and their main goal is such steering of induction and/or reduction processes that in output rules set there are rules with the best quality. Values of most frequently applied rules quality measures [9,20,45,61] can be determined based on analysis of a contingency table that allows to describe rules behavior with relation to the training set. The contingency table for the rule r ≡ (ϕ → ψ) is defined in the following way: nϕψ n¬ϕψ nψ
nϕ¬ψ n¬ϕ¬ψ n¬ψ
nϕ n¬ϕ
144
M. Sikora
where: nϕ = nϕψ + nϕ¬ψ = |Uϕ | is the number of objects that recognize the rule ϕ → ψ; n¬ϕ = n¬ϕψ +n¬ϕ¬ψ = |U¬ϕ | is the number of objects that not recognize the rule ϕ → ψ; nψ = nϕψ + n¬ϕψ = |Uψ | is the number of objects that belong to the decision class described by the rule ϕ → ψ; n¬ψ = nϕ¬ψ + n¬ϕ¬ψ = |U¬ψ | is the number of objects that do not belong to the decision class described by the rule ϕ → ψ; nϕψ = |Uϕ ∩ Uψ | is the number of objects that support the rule ϕ → ψ; nϕ¬ψ = |Uϕ ∩ U¬ψ |; n¬ϕψ = |U¬ϕ ∩ Uψ |; n¬ϕ¬ψ = |U¬ϕ ∩ U¬ψ |. Using information included in the contingency table and the fact that for the known decision rule ϕ → ψ, there are known the values |Uψ | and |U¬ψ |, it is possible to determine values of each measure based on nϕψ and nϕ¬ψ . It can be also observed that for any rule ϕ → ψ , the inequalities 1 ≤ nϕψ ≤ |Uψ |, 0 ≤ nϕ¬ψ ≤ |U¬ψ | holds. Hence, the quality measure is the function of two variables q(ϕ → ψ): {1, . . . , |Uψ |} × {0, . . . , |U¬ψ |} → R. n Two basic quality measures are accuracy (designated by q acc (ϕ → ψ) = nϕψ ) ϕ n and coverage (designated by q cov (ϕ → ψ) = nϕψ ) of a rule. According to the ψ enumerational induction principle it is known that rules with big accuracy and coverage reflect real dependences. The dependences are true also for objects from outside of the analyzed dataset. It is easy to prove that along with accuracy increasing, rule coverage decreases. Therefore, attempts at defining quality measures that simultaneously respect accuracy and coverage of a rule are carried on. Empirical research on generalization abilities of obtained classifiers depending on rules quality measure used during rules induction were realized [3]. The influence of applied quality measure on the number of discovered rules was also considered [45,52]. It has special weight in context of knowledge discovery, since a user is usually interested in discovering such model that can be interpreted or is intent on finding several rules that describe the most important dependences. In quoted research some quality measures reached good results both in classification accuracy and size of classifiers (number of rules). These measures are: WS proposed by Michalski, C2 proposed by Bruha [9], the IREP measure [12] and the adequately adopted Gain measure used in decision trees induction [37]. The W S measure respects rule accuracy as well as rule coverage: q W S (ϕ → ψ) = q acc (ϕ → ψ)w1 + q cov (ϕ → ψ)w2 ,
w1 , w2 ∈ [0, 1].
(6)
In the rules induction system YAILS values of parameters w1 , w2 for the rule ϕ → ψ are calculated as follows w1,2 = 0.5 ± 0.25 ∗ q acc (ϕ → ψ). The measure is monotone with respect to each variable nϕψ and nϕ¬ψ , and takes values from the interval [0, 1]. The measure C2 is described by the formula: acc nq (ϕ → ψ) 1 + q cov (ϕ → ψ) C2 q (ϕ → ψ) = . (7) n − nψ 2 The first component of the product in the formula (7) is the separate measure known as the Coleman measure. This measure evaluates dependences between occurrences "the objects u recognizes the rule", and "the objects u belongs to
Decision Rule-Based Data Models Using TRS and NetTRS
145
the decision class described by the rule". The modification proposed by Bruha [9] (the second component of the formula (7)) respects the fact that the Coleman measure put too little emphasis on rule coverage. Therefore, application of the Coleman measure in the induction process leads to obtaining a big number of rules [40]. The measure C2 is monotone with respect to variables nϕψ and nϕ¬ψ , its range is the interval (−∞, 1], for a fixed rule the measure takes minimum if nϕψ = 1 and nϕ¬ψ = n¬ψ . The IREP program uses for rules evaluation a measure of the following form: q IREP (ϕ → ψ) =
nϕψ + n¬ψ − nϕ¬ψ . n
(8)
The measure value depends on both accuracy and coverage of evaluated rule. If the rule is accurate, then the rule coverage is evaluated. If the rule is approximate, the number nϕ¬ψ prevents from getting the same evaluation by two rules with the same coverage and different accuracy. The function is monotone with respect to variables nϕψ and nϕ¬ψ and takes values from the interval [0,1]. The Gain measure has origin in the information theory. The measure was adopted to rules evaluation from the decision trees methods (so-called LimitedGain criterion): q Gain (ϕ → ψ) = Inf o(U ) − Inf oϕ→ψ (U ).
(9)
In the formula (8) Inf o(U ) is the entropy of training examples and Inf oϕ→ψ (U ) nϕ |U|−n = |U| Inf o(ϕ → ψ)+ |U| ϕ Inf o(¬(ϕ → ψ)), where Inf o(ϕ → ψ) is the entropy of examples covered by the rule ϕ → ψ; Inf o(¬(ϕ → ψ)) is the entropy of examples not covered by the rule ϕ → ψ. The measure is not monotone with respect to variables nϕψ and nϕ¬ψ , and takes values from the interval [0, 1]. If the accuracy of a rule is less then the accuracy of the decision class (accuracy that results form examples distribution in the training set) described by the rule then the measure q Gain is the function decreasing with respect to both variables nϕψ and nϕ¬ψ , otherwise q Gain is the increasing function. Except the presented measures, in the TRS library there were implemented the following quality measures: Brazdil [8]; J-measure [53]; IKIB [21]; Cohen, Coleman and Chi-2 [9]. Recently, so-called the Bayesian confirmation measure (denoted by f ) was proposed as the alternative for rule accuracy evaluation. In papers [10,15] theoretical analysis of the measure f is presented, and it is shown, among others, that the measure is monotone with respect to rule accuracy (so, in terminology adopted in our paper, with respect to the variable nϕψ ). In standard notation the Bayesian confirmation measure is defined by the (ϕ|ψ)−P (ϕ|¬ψ) formula q f (ϕ → ψ) = P P (ϕ|ψ)+P (ϕ|¬ψ) , where P (ϕ|ψ) denotes the conditional probability of the fact that objects belonging to the set U and having the property ψ have also the property ϕ. It is easy to see that q f can be also write as follows: q f (ϕ → ψ) =
n¬ψ nϕψ − nψ nϕ¬ψ . n¬ψ nϕψ + nψ nϕ¬ψ
(10)
146
M. Sikora
It can be noticed that the measure q f does not take into consideration the coverage of the evaluated rule, the most clearly it can be observed for two rules with identical accuracy and different coverage. If rules r1 and r2 are accurate then for both of them the equality nϕ¬ψ = 0 holds. Hence, the formula that allow to calculate a value of the measure q f reduces to 1 (constant function). Then, independently on the number of objects that support rules r1 and r2 , the value of the measure q f will be equal to one for both the rules. Interest in the measure q f is justified by the fact that beside rule accuracy it takes into consideration probability distribution of examples in training set between decision classes. Because of the above argumentation, in the selected quality measures implemented in the TRS library replacing the accuracy with the measure q f can be proposed for rules quality measures that use the rule accuracy. In particular, such modifications were introduced in the WS and C2 measures [45]. A number of conditional descriptors occurring in the rule is the important rule property from the point of view of dependences, which the given rule represents, interpretation possibilities. Of course, the fewer descriptors the easier to understand the dependence which is presented by the given rule. Let us assume that for the decision rule r, the set descr(r) contains all attributes that create conditional descriptors in this rule. A formula evaluating the rule with respect to a number of conditional descriptors q length : RU L → [0, 1) was defined in the TRS library in the following way (10): q length (r) = 1 −
|descr(r)| . |A|
(11)
The bigger value of the q length the simples rule, that is it has fewer conditional descriptors. The rule evaluation is possible taking into consideration both its quality and length: q rule_quality_measure (r)q length (r). In particular, the formula q accuracy (r)q length (r) allows to evaluate accuracy and complexity of the rule ϕ → ψ, simultaneously.
3
Selected Results Obtained by Algorithms Included in the TRS Library
To present efficiency of methods included in the TRS library, selected results obtained on various benchmark data sets are presented below. The results were gained by use of the 10-fold cross validation testing methodology (except the Monks and Satimage data sets). The results were rounded off natural numbers. Since all applied data sets are generally known, we do not give their characteristic. 3.1
Searching Tolerance Thresholds
Results of genetic algorithm and various quality measures application for tolerance thresholds values determining are presented in the first table. For the
Decision Rule-Based Data Models Using TRS and NetTRS
147
standard measure two values of the weight w (w=0.5, w=0.75) are considered, the best of results is presented in the table. The Coleman, IKIB, J-measure, Cohen, Gain, Chi-2, IREP and WS measures were considered. Beside results for the standard measure, the best result for quote measures is also presented in the first table. After establishing tolerance thresholds values, rules were determined from object-related quasi-shortest relative reduct. Classification took place by exactly matching rules, hence the first table contains, except the classification accuracy (Accuracy) and the number of rules generated (Rules), also the degree of recognition test objects (Match). Unrecognized objects were recognized as wrongly classified. Table 1. Searching tolerance thresholds Dataset
Qality measure Std Australian Chi-2 Std Breast J-measure Std Iris J-measure Std Iris + discretization J-measure Std Glass J-measure Std Heart J-measure Std Lymhography Coleman Std Monk I Coleman Std Monk II Gain Std Monk III J-measure Std Pima J-measure Std Satimage Gain
Accuracy (%) 60 86 67 72 95 95 97 97 27 61 65 78 72 80 100 97 73 79 95 97 51 75 76 75
Rules 564 429 143 59 59 41 9 9 178 185 193 182 57 66 10 17 54 64 35 12 597 675 437 717
Match 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 0.43 0.90 0.92 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.73 1.00 0.95 1.00
Obtained results are consistent with the results presented in earlier papers [39,40]. The standard measure can make possible to get good classification accuracy, but proper establishing a value of weight occurring in the measure is necessary. With reference to classification quality, adapted rules quality measures, especially J-measure, manage better. The J-measure leads to determining tolerance thresholds with value higher than for the standard measure. For that reason, rules obtained are somewhat less accurate but more general. A similar
148
M. Sikora
situation was observed in the case of measures Gain, Chi-2 and Cohen. In some cases other measures turned out to be better. In particular, the Coleman measure which put greater emphasis on consistence in a decision table (like measures IKIB and Gain). It clearly shows on the example of the Monk I set, which the measure makes possible to get maximal classification accuracy. As it is known the set is easy classifiable and the classification can be done by accurate rules. Probably, searching more values of the weight w for the standard measure, the same result would be obtained. To recapitulate, adapted quality measures enable to evaluate tolerance thresholds vectors better than the standard one. Better accuracy and less rules number are obtained. Adapted quality measures application leads to approximate rules induction. However, in the case of noisy data approximate rules reflect dependencies in the set better than accurate ones and better classify unknown objects. An observation that for sets with bigger number of decision classes (Glass, Lymphography, Satimage) a method of searching global tolerance thresholds vector does not enable to achieve satisfactory results, is also important. Better results could be gained by looking for tolerance vector for each decision class separately (such approach was postulated i.a. in [31]). The TRS library does not have such functionality at present. 3.2
Rules Induction
Efficiency of the RMatrix algorithm can be compared with a method of generating all minimal decision rules and quasi-shortest minimal decision rules, since an idea of RMatrix joins the method of quasi-shortest object-related minimal rule generating with loosening conditions that concern accuracy of a determined rule. The comparison was done for the same tolerance thresholds values, determined by heuristic algorithm or by genetic algorithm (for smaller data sets). Results of the comparison are presented in the second table. During rules induction the following rules quality measures were used: Accuracy, IKIB, Coleman, Brazdil, Cohen, IREP, Chi-2, WSY (WS measure from YAILS system), C2, Gain, C2F (C2 measure with Bayesian confirmation measure F instead of Accuracy). In the case of the RMatrix algorithm a measure that enable to get the best results is presented. RMatrix makes possible to get similar (better in some cases) classification accuracies as the quasi-shortest minimal decision rules induction algorithm. Essential is the fact that a number of generated rules is smaller (in some cases even smaller a lot), the rules are more general and less accurate which is sometimes a desirable property. Information about classification accuracy and a number of rules get by the MODLEM algorithm and its version using quality measures in stop criterion is presented in the third table. Loosening of requirements concerning accuracy of a rule created by MODLEM and stopping the rule generation process while its quality begin to decrease made possible to get better results than in the standard version of MODLEM. Moreover, rules number is smaller. Depending on an analyzed data set the best results are obtained by applying various quality measures.
Decision Rule-Based Data Models Using TRS and NetTRS
149
Table 2. Results of the RMatrix algorithm Dataset
Algorithm RMatrix (Gain) Australian Shortest dec. rules All minimal rules RMatrix (Gain) Breast Shortest dec. rules All minimal rules RMatrix (Gain) Iris + discretization Shortest dec. rules All minimal rules RMatrix (Chi-2) Heart Shortest dec. rules All minimal rules RMatrix (Gain) Lymphography Shortest dec. rules All minimal rules RMatrix (C2) Monk I Shortest dec. rules All minimal rules RMatrix (IREP) Monk II Shortest dec. rules All minimal rules RMatrix (IREP) Monk III Shortest dec. rules All minimal rules
Accuracy (%) 86 86 86 73 70 71 97 97 97 80 80 80 82 81 84 100 100 93 75 79 74 97 97 97
Rules 32 111 144 31 50 57 7 9 17 24 111 111 46 62 2085 10 10 59 39 64 97 9 12 12
Table 3. Results of the MODLEM algorithm MODLEM Accuracy (%) Australian 77 Breast 68 Iris 92 Heart 65 Lymphography 74 Monk I 92 Monk II 65 Monk III 92
Dataset
3.3
Rules 274 101 18 90 39 29 71 29
MODLEM Modyf. Accuracy (%) 86 (Brazdil) 73 (Brazdil) 92 (C2F) 80 (Brazdil) 81 (IREP) 100 (IREP) 66 (IREP) 97 (Gain)
Rules 29 64 9 20 15 9 18 13
Rules Joining
The joining algorithm working was verified in two ways. In the first case rules were determined from the quasi-shortest object-related relative reduct, but values of tolerance thresholds had been found earlier by a genetic algorithm. Results are presented in the fourth table. In the second case (fifth table) rules obtained
150
M. Sikora Table 4. Results of using the decision rules joining algorithm
Dataset (algorithm parameters) Australian (IREP, 0%) Breast (IREP, 10%) Iris (IREP, 0%) Heart (IREP, 0%) Lymphography (IREP, 0%) Monk I (C2, 0%) Monk II (Acc., 0%) Monk III
Accuracy (%) Rules 86 73 95 79 82 100 80 97
130 25 19 56 43 8 49 9
Reduction rate (%) 62 25 53 69 9 27 22 0
Accuracy changes (%) 0 0 0 +1 0 0 +1 0
Table 5. Results of joining of decision rules obtained by the MODLEM algorithm Dataset (algorithm parameters) Australian (IREP, 10%) Breast (Brazdil, 30%) Iris (IREP, 0%) Heart Lymphography (IREP, 0%) Monk I Monk II (Brazdil, 10%) Monk III (Gain, 0%)
Accuracy (%) Rules 85 73 94 80 80 92 67 97
24 55 8 20 13 9 17 6
Reduction rate (%) 17 14 11 0 13 0 10 53
Accuracy changes (%) −1 0 +1 0 −1 0 +1 0
by MODLEM exploiting quality measures were put to joining. Various rules quality measures were applied and maximally thirty percent decrease of a rule quality was admitted during the joining. As the fourth table shows, the algorithm makes possible to limit a rules set by 38% on average (median 27%) without causing considerable increases or decreases of classification accuracy. The method of rules generation and data preparing (tolerance thresholds establishing) cause that approximate rules will be put to joining, therefore further decrease of their quality is pointless. In the paper [43] experiments with accurate rules were also carried out. It was proved there that in the case of accurate rules it is worth admitting decrease of rules quality is equal to at most 30. Measures IREP and C2 enable to get the best results in most cases. Results presented in fifth table suggest that in the case of the modified version of MODLEM, joining algorithm application leads to insignificant reduction of rules number. A general (and obvious) conclusion can be also drawn that compression degree is the bigger the more numerous is an input rules set.
Decision Rule-Based Data Models Using TRS and NetTRS
151
Table 6. Results of rules filtration algorithms Dataset Australian (R-Gain, F-Gain) Breast (R-Gain, F-Gain) Iris (R-Gain, F-Gain) Heart (R-Chi-2, F-Chi-2) Lymphography (R-Gain, F-IREP) Monk I (R-C2, F-C2) Monk II (Q-Accuracy, F-Accuracy) Monk III (R-IREP, F-IREP)
3.4
Algorithm
Accuracy (%) Rules
From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards
86 86 86 24 73 68 97 97 97 80 82 81 79 82 82 100 100 100 78 79 75 97 97 97
18 2 2 12 9 8 7 5 4 12 15 9 14 16 13 10 10 10 42 36 32 6 3 3
Reduction rate (%) 37 93 93 52 64 68 0 28 42 50 37 62 65 62 69 0 0 0 14 26 35 14 57 57
Accuracy changes (%) 0 0 0 −49 0 −5 0 0 0 0 +2 +1 −3 0 0 0 0 0 −2 −1 −5 0 0 0
Rules Filtration
In order to illustrate filtration algorithms working, rules which enable to get the highest classification accuracy in so far research were selected. There was verified whether filtration algorithms will make possible to limit rules number on. The method of rules induction and name of the rules quality measure used in the algorithm are presented near the name of data set in the sixth table. The following denotations were used: the RMAtrix algorithm - R, the MODLEM algorithm M, Quasi-shortest minimal decision rules algorithm - Q, rules filtration - F. The same measure as in the filtration algorithm or one of basic measures (accuracy, coverage) was applied during classification. The whole training set was used as the tuning set in the Forward and Backwards algorithms. Results obtained by filtration algorithms show that, like in the joining algorithms case, compression degree is the higher the bigger is an input rules set. In the case of filtration algorithms there ensues significant limitation of the rules set, and the algorithm From coverage often leads to classification accuracy decrease. Considering classification accuracy decrease, behavior of the Forward algorithm that enables to restrict meaningfully the rules set without classification abilities loosing, is the best.
152
3.5
M. Sikora
TRS Results – Comparison with Other Methods and a Real-Life Data Analysis Example
Results obtained by methods included in TRS were compared with several decision trees and decision rules induction algorithms. Adequate algorithms which, like TRS, use neither methods of constructive induction nor soft computing solutions, were selected. Mentioned algorithms were: CART [7], C5 [37], SSV [11], LEM2 [19] and the RIONA algorithm that joins rules induction with the k-NN method [17,58]. All of quoted algorithms (including algorithms contained in TRS) try to create as synthetic descriptions of data in a form of decision trees of rules as possible. RIONA which joins two approaches is the exception here. For some k value set by an adaptative method, k nearest neighbors of a fixed test object tst are found in the training set. Next, for each trn example included in the selected training examples set, so-called maximal local decision rule that covers trn and tst is determined. If the rule is consistent in the set of selected training examples, then the example trn is added to the support set of the appropriate decision. Finally, the RIONA algorithm selects the decision with the support set of the highest cardinality. All presented results are results of experiments that were carried out. In order to do the experiments the following software was applied: RSES (RIONA, LEM2), CART-Salfor-Systems (CART), See5 (C5) and Ghostminer (SSV). Optimal values of parameters were matched on the training set, after establishing the optimal parameters on the training set efficiency of algorithms was tested. In the case of presented algorithms the parameters were: – TRS - the algorithm and tolerance thresholds quality measure; rules quality measure in algorithms of: rules induction, rules joining, rules filtration, classification. If during rules induction an quality measure was selected, then the same measure or one of two basic measures (accuracy or coverage) was used during further postprocessing and classification stages. The order of postprocessing was always the same and consisted in applying joining algorithm and then filtration; – RIONA - the optimal number of neighbors; – CART - the criterion of partition optimality in a node (gini, entropy, class probability); – SSV - the searching strategy (best first or beam search - cross-validation on the whole training set); – LEM2 - rules shortening ratio; – See5 - the tree pruning degree. Apart from 10-fold cross-validation classification accuracy (in the case of TRS the standard deviation was also given), classification accuracy and rules number which were get on whole available data set (in the case of sets Monks and Satimage there were training sets) were also given. Analysis of obtained results enables to draw a conclusion that applying rules quality measures on each stage of induction makes possible to achieve good
Decision Rule-Based Data Models Using TRS and NetTRS
153
Table 7. Results of rules filtration algorithms Dataset Australian (R-Gain, F-Gain) Breast (R-Gain, F-Gain) Iris (R-Gain, F-Gain) Heart (R-Chi-2, F-Chi-2) Lymphography (R-Gain, F-IREP) Monk I (R-C2, F-C2) Monk II (Q-Accuracy, F-Accuracy) Monk III (R-IREP, F-IREP)
Algorithm
Accuracy (%) Rules
From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards From coverage Forward Backwards
86 86 86 24 73 68 97 97 97 80 82 81 79 82 82 100 100 100 78 79 75 97 97 97
18 2 2 12 9 8 7 5 4 12 15 9 14 16 13 10 10 10 42 36 32 6 3 3
Reduction rate (%) 37 93 93 52 64 68 0 28 42 50 37 62 65 62 69 0 0 0 14 26 35 14 57 57
Accuracy changes (%) 0 0 0 −49 0 −5 0 0 0 0 +2 +1 −3 0 0 0 0 0 −2 −1 −5 0 0 0
classification results by the induction algorithm. Applying postprocessing algorithms (especially filtration ones) makes possible to reduce a number of rules in the classifier which is important form the knowledge discovery point of view. Results obtained by TRS have the feature that differences between accuracy and number of rules determined on the whole data set and in the cross-validation mode are not big. Since majority of applied algorithms create approximate rules, overlearning manifests merely in generating too big rules set which can be effectively filtrated yet. Classification accuracy get by TRS is comparable with other presented algorithms. In some cases (Satimage, Monk II, Lymphography) the RIONA algorithm enables to achieve classification accuracy higher than the other systems that use rules approach only. Finally, a method of solving the real-life problem of monitoring seismic hazards in hard-coal mines by algorithms included in TRS is presented. Seismic hazard is monitored by seismic and seismoacustic measuring instruments placed in mine’s underground. Acquired data are then transmitted to the Hestia computer system [44], there are aggregated. Evaluation of bump risk in a given excavation is calculated based on aggregated data. Unfortunately, presently used methods of the risk evaluation are not too accurate, therefore new methods enabling to warn of hazard coming are sought. In research carried out there was created a
154
M. Sikora Table 8. TRS results – comparison with other methods Dataset Australian Breast Iris Glass Heart Lymphography Pima Monk I Monk II Monk III Satimage
TRS 86 –/+ (88 11) 73 –/+ (75 6) 97 –/+ (97 4) 66 –/+ (87 40) 82 –/+ (86 11) 82 –/+ (90 19) 75 –/+ (77 11) 100 (100 7) 80 (85 49) 97 (93 3) 83 (87 92)
0.2 0.3 0.2 0.4 0.3 0.3 0.2
RIONA
CART
86
86
73
67
95
94
71
70
83
79
85
82
75
75
96
87
83
79
94
97
91
84
SSV 86 (86 2) 76 (76 3) 95 (98 – 4 71 (89 21) 77 (86 7) 76 (91 16) 74 (76 5) 100 (100 11) 67 (85 20) 97 (95 6) 82 (83 9)
LEM2 87 (88 126) 74 (78 96) 94 (95 7) 67 (91 89) 81 (84 43) 82 (100 32) 73 (82 194) 88 (98 17) 73 (86 68) 96 (94 23) 82 (85 663)
C5 86 (92 73 (76 97 (97 70 (91 77 (90 80 (96 74 (81 84 (84 67 (66 97 (97 87 (95
11) 3) 4) 12) 11) 12) 11) 5) 2) 6) 96)
decision table in which information (registered during successive shifts) about aggregated values of: seismoacustic and seismic energy emitted in the excavation, a number of tremors in individual energy classes (from 1 ∗ 102 J to 9 ∗ 107 J), risk evaluation generated by classic evaluation methods and category of a shift (mining or maintenance) were included. Data from hazard longwall SC508 in the Polish coal-mine KWK Wesoła were collected. The data set numbered 864 objects, data were divided into two decision classes reflecting summary seismic energy which will be emitted in an excavation during the next shift. A limiting value for the decision classes was energy 1 ∗ 105 J. A number of objects belonging to the decision class Energy > 1 ∗ 105 J amounts to 97 objects. With respect to irregularity of a distribution of decision classes, global strategy of tolerance thresholds values determining was inadvisable, therefore modified version of the MODLEM algorithm was used in rules induction. The best results were obtained for the Cohen measure and Forward filtration algorithm, classification accuracy in cross-validation test was equal to 77%, and accuracy of individual class was equal to 76% and 77%. Applying another quality measures made possible to achieve better classification accuracy, running to 89% but at the cost of ability of recognizing seismic hazards. In each experiment 22 rules were obtained on average, the decision class describing potential hazard was described by two rules. The Forward filtration algorithm enabled to remove 18 rules and get decision classes descriptions composed of two rules. The 22 rules’ accuracy analysis made possible to establish an accuracy threshold for rules that describe the bigger decision class (non-hazardous states), the value 0.9 was the threshold.
Decision Rule-Based Data Models Using TRS and NetTRS
155
Applying an algorithm of arbitrary rules filtration included in TRS (Minimal Quality algorithm) five decision rules were obtained on average (3 describing the "non-hazardous state" and 2 describing the "hazardous state"). Premises of determined rules consist of three conditional attributes: maximal seismoacustic energy registered during a shift by any geophone, average seismoacustic energy registered by geophones, maximal number of impulses registered by any geophone, average number of impulses registered by geophones. Exemplary rules are presented below: IF IF IF IF IF
avg_impulses < 2786 THEN "non-hazardous state", max_energy < 161560 THEN "non-hazardous state", avg_energy < 48070 THEN "non-hazardous state", max_energy > 218000 THEN "hazardous state", max_impulses > 1444 THEN "hazardous state".
It is also important that determined rules are consistent with intuition of the geophysicist working in mine geophysical station. The geophysicist confirmed reasonableness of dependencies reflected by rules after their viewing. He cannot specify which attributes and their ranges have decisive influence on possibility of hazard occurrence earlier. To recapitulate, we have typical example of knowledge discovering in databases here. It is worth stressing that analysis of a whole rules set (without filtration) is not so simple. At present, research on verifying whether developing of classification system that enables hazard prediction in any mine excavation are carried out.
4
Conclusions
Several modifications of standard methods of: global establishing tolerance thresholds values, decision rules determining and their postprocessing have been presented in the paper. Methods included in the TRS library are available in Internet, in the NetTRS service. Reducts calculation and minimal decision rules induction algorithms and the RMatrix algorithm included in the TRS library use the discernibility matrix [47]. Therefore, their application for data sets composed of maximally several thousands objects is possible (in the paper, the biggest data set was the Staimage set for which searching a tolerance thresholds vector and rules by the RMatrix algorithms lasted a few minutes). Methods enabling to omit necessity of discernibility matrix usage are described in literature. In the case of standard rough set model such proposition are presented in [27], and in the case of tolerance model in [54]. The other algorithms of shortening, joining, filtration and the MODLEM algorithm can be applied for analysis of larger data sets composed of a few dozen thousands objects (the main operations made on these algorithms are rules quality calculation and classification). However, efficiency of the TRS library is less than commercial solutions (CART, GhostMiner) or developed from many years noncommercial program RSES [6]. Exemplary results of experimental research show that algorithms implemented in the library can be useful for obtaining small rules sets with good generalization
156
M. Sikora
abilities (especially in the case of applying the sequence: induction, generalization, filtration). Small differences between results obtained on training and test sets makes possible to match the most suitable quality measures with a given data set basis on an analysis of the training set. Results of tests done suggest also that good results can be obtained by using the same quality measure on both rules induction and postprocessing stages. This observation significantly restricts a space of possible solutions. A measure of confidence in a rule used during classification is an important parameter that is of great importance for classification accuracy. An adaptative method of the confidence measure selection limiting to one of two standard measures (accuracy, coverage) or a measure used during rules induction was applied in the paper. Well matched values of tolerance thresholds make possible to generate rules with better generalization and description abilities than the MODLEM algorithm offers. Applying the RMatrix or quasi-shortest minimal decision rules determining algorithms is a good solution. Postprocessing algorithms, especially the Forward filtration algorithm enable to limit meaningfully a set of rules used in the classification. That makes possible to improve description abilities of obtained rules set significantly. Detailed specification of experimental results that illustrates effectiveness of majority of presented in the paper algorithms can be found, among others, in [39,40,41,42,43,45,46]. As it was reported in many research [2,3,9,39,40] it is impossible to point at a measure that always gives the best results, but it is possible to show two groups of measures. One of them contains measures that put emphasis on rule coverage (these are, among others, the measures: IREP, Gain, Cohen, C2, WS, second group includes measures that put the greater emphasis on rule accuracy which leads to determining much more rules (these are, among others, the measures: Accuracy, Brazdil, Coleman, IKIB). Obviously, application of rules quality measures is sensible only if we admit approximate rules generating. If we determine accurate rules then the only rules quality measure worth to use will be rules coverage. (alternatively, rules strength [60], length or LEF criterion [22], if accuracy and coverage of compared rules will be identical). Further works will concentrate on adaptative (during rules induction) quality measures selection. In the paper [3] suggestion of such solution is presented. However, authors carried out no experimental research and propose no algorithm solving the problem. In our research we want to apply an approach in which simple characteristics of each decision class separately are calculated before main process of rules induction. Based on the characteristics an quality measure that will enable to describe a given decision class by a small number of rules with good generalization abilities is selected. A meta-classifier created based on analysis as big number of benchmark data as possible will point at potentially the best quality measure which should be applied on a given induction stage. Moreover, since during coverage algorithms application (e.g. MODLEM) a training set changes, we assume that during rules induction a quality measure used by the algorithm will also change.
Decision Rule-Based Data Models Using TRS and NetTRS
157
It would be also interesting to verify to which degree applying filtration to a set of rules obtained by the method described in the paper [4], in which a hierarchy of more and more general rules is built would reduce a number of rules used in classification.
References 1. Agotnes, T.: Filtering large propositional rule sets while retaining classifier performance. MSc Thesis. Norwegian University of Science and Technology, Trondheim, Norway (1999) 2. Agotnes, T., Komorowski, J., Loken, T.: Taming Large Rule Models in Rough Set Approaches. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 193–203. Springer, Heidelberg (1999) 3. An, A., Cercone, N.: Rule quality measures for rule induction systems – description and evaluation. Computational Intelligence 17, 409–424 (2001) 4. Bazan, J., Skowron, A., Wang, H., Wojna, A.: Multimodal classification: case studies. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 224–239. Springer, Heidelberg (2006) 5. Bazan, J.: A comprasion of dynamic and non-dynamic rough set methods for extracting laws from decision tables. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 1: Methododology and Applications, pp. 321–365. Physica, Heidelberg (1998) 6. Bazan, J., Szczuka, M., Wróblewski, J.: A new version of rough set exploration system. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 397–404. Springer, Heidelberg (2002) 7. Breiman, L., Friedman, J., Olshen, R., Stone, R.: Classificzation and Regression Trees. Wadsworth, Pacific Grove (1984) 8. Brazdil, P.B., Togo, L.: Knowledge acquisition via knowledge integration. Current Trends in Knowledge Acquisition. IOS Press, Amsterdam (1990) 9. Bruha, I.: Quality of Decision Rules: Definitions and Classification Schemes for Multiple Rules. In: Nakhaeizadeh, G., Taylor, C.C. (eds.) Machine Learning and Statistics, The Interface, pp. 107–131. Wiley, NY (1997) 10. Brzeziñska, I., Greco, S., Sowiñski, R.: Mining Pareto-optimal rules with respect to support and confirmation or support and anti-support. Engineering Applications of Artificial Intelligence 20, 587–600 (2007) 11. Duch, W., Adamczak, K., Grbczewski, K.: Methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transaction on Neural Networks 12, 277–306 (2001) 12. Furnkranz, J., Widmer, G.: Incremental Reduced Error Pruning. In: Proceedings of the Eleventh International Conference of Machine Learning, New Brunswick, NJ, USA, pp. 70–77 (1994) 13. Greco, S., Matarazzo, B., Sowiñski, R.: The use of rough sets and fuzzy sets in MCDM. In: Gal, T., Hanne, T., Stewart, T. (eds.) Advances in Multiple Criteria Decision Making, pp. 1–59. Kluwer Academic Publishers, Dordrecht (1999) 14. Greco, S., Materazzo, B., Sowiñski, R.: Rough sets theory for multicriteria decision analysis. European Journal of Operational Research 129, 1–47 (2001) 15. Greco, S., Pawlak, Z., Sowiñski, R.: Can Bayesian confirmation measures be use-ful for rough set decision rules? Engineering Applications of Artificial Intelligence 17, 345–361 (2004)
158
M. Sikora
16. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company Inc., Boston (1989) 17. Góra, G., Wojna, A.: RIONA: A new classification system combining rule induction and instance-based learning. Fundamenta Informaticae 51(4), 369–390 (2002) 18. Grzymaa-Busse, J.W.: LERS - a system for learning from examples based on rough sets. In: Sowiñski, R. (ed.) Intelligent Decision Support. Handbook of applications and advances of the rough set theory, pp. 3–18. Kluwer Academic Publishers, Dordrecht (1992) 19. Grzymaa-Busse, J.W., Ziarko, W.: Data mining based on rough sets. In: Wang, J. (ed.) Data Mining Opportunities and Challenges, pp. 142–173. IGI Publishing, Hershey (2003) 20. Guillet, F., Hamilton, H.J. (eds.): Quality Measures in Data Mining. Computational Intelligence Series. Springer, Heidelberg (2007) 21. Kanonenko, I., Bratko, I.: Information-based evaluation criterion for classifier‘s performance. Machine Learning 6, 67–80 (1991) 22. Kaufman, K.A., Michalski, R.S.: Learning in Inconsistent World, Rule Selection in STAR/AQ18. Machine Learning and Inference Laboratory Report P99-2 (February 1999) 23. Kubat, M., Bratko, I., Michalski, R.S.: Machine Learning and Data Mining: Methods and Applications. Wiley, NY (1998) 24. Latkowski, R., Mikoajczyk, M.: Data Decomposition and Decision Rule Joining for Classification of Data with Missing Values. In: Peters, J.F., Skowron, A., GrzymałaBusse, J.W., Kostek, B.z., Świniarski, R.W., Szczuka, M.S. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 299–320. Springer, Heidelberg (2004) 25. Michalski, R.S., Carbonell, J.G., Mitchel, T.M.: Machine Learning, vol. I. MorganKaufman, Los Altos (1983) 26. Mikoajczyk, M.: Reducing number of decision rules by joining. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 425–432. Springer, Heidelberg (2002) 27. Nguyen, H.S., Nguyen, S.H.: Some Efficient Algorithms for Rough Set Methods. In: Proceedings of the Sixth International Conference, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Granada, Spain, pp. 1451–1456 (1996) 28. Nguyen, H.S., Nguyen, T.T., Skowron, A., Synak, P.: Knowledge discovery by rough set methods. In: Callaos, N.C. (ed.) Proc. of the International Conference on Information Systems Analysis and Synthesis, ISAS 1996, Orlando, USA, July 22-26, pp. 26–33 (1996) 29. Nguyen, H.S., Skowron, A.: Searching for relational patterns in data. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 265–276. Springer, Heidelberg (1997) 30. Nguyen, H.S., Skowron, A., Synak, P.: Discovery of data patterns with applications to decomposition and classfification problems. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 2: Applications, Case Studies and Software Systems, pp. 55–97. Physica, Heidelberg (1998) 31. Nguyen, H.S.: Data regularity analysis and applications in data mining. Doctoral Thesis, Warsaw University. In: Polkowski, L., Tsumoto, S., Lin, T.Y. (eds.) Rough set methods and applications: New developments in knowledge discovery in information systems, pp. 289–378. Physica-Verlag/Springer, Heidelberg (2000), http://logic.mimuw.edu.pl/
Decision Rule-Based Data Models Using TRS and NetTRS
159
32. Ohrn, A., Komorowski, J., Skowron, A., Synak, P.: The design and implementation of a knowledge discovery toolkit based on rough sets: The ROSETTA system. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 1: Methodology and Applications, pp. 376–399. Physica, Heidelberg (1998) 33. Pawlak, Z.: Rough Sets. Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht (1991) 34. Pednault, E.: Minimal-Length Encoding and Inductive Inference. In: PiatetskyShapiro, G., Frawley, W.J. (eds.) Knowledge Discovery in Databases, pp. 71–92. MIT Press, Cambridge (1991) 35. Pindur, R., Susmaga, R., Stefanowski, J.: Hyperplane aggregation of dominance decision rules. Fundamenta Informaticae 61, 117–137 (2004) 36. Podraza, R., Walkiewicz, M., Dominik, A.: Credibility coefficients in ARES Rough Sets Exploration Systems. In: Ślęzak, D., Yao, J., Peters, J.F., Ziarko, W.P., Hu, X. (eds.) RSFDGrC 2005. LNCS (LNAI), vol. 3642, pp. 29–38. Springer, Heidelberg (2005) 37. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan-Kaufman, San Mateo (1993) 38. Prêdki, B., Sowiñski, R., Stefanowski, J., Susmaga, R.: ROSE – Software implementation of the rough set theory. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, p. 605. Springer, Heidelberg (1998) 39. Sikora, M., Proksa, P.: Algorithms for generation and filtration of approximate decision rules, using rule-related quality measures. In: Proceedings of International Workshop on Rough Set Theory and Granular Computing (RSTGC 2001), Matsue, Shimane, Japan, pp. 93–98 (2001) 40. Sikora, M.: Rules evaluation and generalization for decision classes descriptions improvement. Doctoral Thesis, Silesian University of Technology, Gliwice, Poland (2001) (in Polish) 41. Sikora, M., Proksa, P.: Induction of decision and association rules for knowledge discovery in industrial databases. In: International Conference on Data Mining, Alternative Techniques for Data Mining Workshop, Brighton, UK (2004) 42. Sikora, M.: Approximate decision rules induction algorithm using rough sets and rule-related quality measures. Theoretical and Applied Informatics 4, 3–16 (2004) 43. Sikora, M.: An algorithm for generalization of decision rules by joining. Foundation on Computing and Decision Sciences 30, 227–239 (2005) 44. Sikora, M.: System for geophysical station work supporting - exploitation and development. In: Proceedings of the 13th International Conference on Natural Hazards in Mining, Central Mining Institute, Katowice, Poland, pp. 311–319 (2006) (in Polish) 45. Sikora, M.: Rule quality measures in creation and reduction of data role models. In: Greco, S., Hata, Y., Hirano, S., Inuiguchi, M., Miyamoto, S., Nguyen, H.S., Słowiński, R. (eds.) RSCTC 2006. LNCS (LNAI), vol. 4259, pp. 716–725. Springer, Heidelberg (2006) 46. Sikora, M.: Adaptative application of quality measures in rules induction algorithms. In: Kozielski, S. (ed.) Databases, new technologies, vol. I. Transport and Communication Publishers (Wydawnictwa Komunikacji i Łączności), Warsaw (2007) (in Polish) 47. Skowron, A., Rauszer, C.: The Discernibility Matrices and Functions in Information systems. In: Sowiñski, R. (ed.) Intelligent Decision Support. Handbook of applications and advances of the rough set theory, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992)
160
M. Sikora
48. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets V. LNCS, vol. 4100, pp. 224–239. Springer, Heidelberg (2006) 49. Skowron, A., Wang, H., Wojna, A., Bazan, J.: Multimodal Classification: Case Studies. Fundamenta Informaticae 27, 245–253 (1996) 50. Sowiñski, R., Greco, S., Matarazzo, B.: Mining decision-rule preference model from rough approximation of preference relation. In: Proceedings of the 26th IEEE Annual Int. Conf. on Computer Software and Applications, Oxford, UK, pp. 1129– 1134 (2002) 51. Stefanowski, J.: Rough set based rule induction techniques for classification problems. In: Proceedings of the 6th European Congress of Intelligent Techniques and Soft Computing, Aachen, Germany, pp. 107–119 (1998) 52. Stefanowski, J.: Algorithms of rule induction for knowledge discovery. Poznañ University of Technology, Thesis series 361, Poznañ, Poland (2001) (in Polish) 53. Smyth, P., Gooodman, R.M.: Rule induction using information theory. In: Piatetsky-Shapiro, G., Frawley, W.J. (eds.) Knowledge Discovery in Databases, pp. 159–176. MIT Press, Cambridge (1991) 54. Stepaniuk, J.: Knowledge Discovery by Application of Rough Set Models. Institute of Computer Sciences Polish Academy of Sciences, Reports 887, Warsaw, Poland (1999) 55. Stepaniuk, J., Krêtowski, M.: Decision System Based on Tolerance Rough Sets. In: Proceedings of the 4th International Workshop on Intelligent Information Systems, Augustów, Poland, pp. 62–73 (1995) 56. Ślęzak, D., Wróblewski, J.: Classification Algorithms Based on Linear Combination of Features. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 548–553. Springer, Heidelberg (1999) 57. Wang, H., Duentsch, I., Gediga, G., Skowron, A.: Hyperrelations in version space. International Journal of Approximate Reasoning 36(3), 223–241 (2004) 58. Wojna, A.: Analogy based reasoning in classifier construction. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets IV. LNCS, vol. 3700, pp. 277–374. Springer, Heidelberg (2005) 59. Ziarko, W.: Variable precision rough sets model. Journal of Computer and System Sciences 46, 39–59 (1993) 60. Zhong, N., Skowron, A.: A rough set-based knowledge discovery process. International Journal of Applied Mathematics and Computer Sciences 11, 603–619 (2001) 61. Yao, Y.Y., Zhong, N.: An Analysis of Quantitative Measures Associated with Rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 479–488. Springer, Heidelberg (1999)
A Distributed Decision Rules Calculation Using Apriori Algorithm Tomasz Str¸akowski and Henryk Rybiński Warsaw Uniwersity of Technology, Poland
[email protected],
[email protected] Abstract. Calculating decision rules is a very important process. There are a lot of solutions for computing decision rules, but the algorithms of computing all set of rules are time consuming. We propose a recursive version of the well known apriori algorithm [1], designed for parallel processing. We present here, how to decompose the problem of calculating decision rules, so that the parallel calculations are efficient. Keywords: Rough set theory, decision rules, distributed computing.
1
Introduction
The number of applications of Rough Set Theory (RST) in the field of Data Mining is growing. There is a lot of research, where advanced tools for RST are developed. The most valuable feature of this theory is that it discovers knowledge from vague data. Decision rules are one of the fundamental concepts in RST , and their discovering is one of the main research areas. This paper refers to the problem of generating complete set of decision rules. One of the most popular methods of finding all sets of rules is the RS-apriori algorithm proposed by Kryszkiewicz [1], based on the idea of the apriori algorithm, proposed by [2]. Among other techniques solving this problem it is worth to note LEM [3], or incremental learning [4]. The LEM algorithm gives a restricted set of rules, which covers all objects from the Decision Table (DT ), but does not necessarily discovers all the rules. The incremental learning is used to efficiently update the set of rules, when new objects are added to DT . The main problem with discovering decision rules is the fact that the process is very time consuming. There are generally two ways of speeding it up. The first one is to use some heuristics, the second one consists in distributing computation between a number of processors. In the paper we focus on the second approach. We present here a modification of the apriori algorithm, designed for
The research has been partially supported by grant No 3 T11C 002 29 received from Polish Ministry of Education and Science, and partially by grant of Rector of Warsaw University of Technology No 503/G/1032/4200/000.
J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 161–176, 2010. c Springer-Verlag Berlin Heidelberg 2010
162
T. Str¸akowski and H. Rybiński
parallel computing. The problem of distributed generation of the rules was presented in [5], in the form of a multi agent system. The idea of the multi agent system consists in splitting Information System into many independent subsystems. Next, each subsystem computes rules for its part, and a specialized agent merges partial decision rules. Unfortunately, this method does not provide the same results, as the sequential version. Our aim is to provide a parallel algorithm which would give the same result as a sequential algorithm finding all rules, but in more efficient way. The paper is composed as follows. In Section 2 we recall basic notions related to the rough set theory, referring to the concept of decision rules. Section 3 presents the original version of the rough sets adaptation of the apriori algorithm for rules calculation [1]. Then we present our proposal for storing rules in a tree form, which is slightly different from the original hash tree proposed in [6,2], and makes possible defining recursive calculations of the rules. We present here such a recursive version. Section 4 is devoted to presenting two effective ways of parallel computing of the rules with a number of processors. Section 5 provides experimental results for the presented algorithms. We conclude the paper with a discussion about the effectiveness of the proposed approaches.
2
Computing Rules and Rough Sets Basis
Let us start with recalling basic notions of the rough set theory. Information system IS is a pair IS = (U, A), where U is finite set of elements, and A is a finite set of attributes which describe the elements. For every a ∈ A there is a function U → Va , assigning a value v ∈ Va of the attribute a to the objects u ∈ U , where Va is a domain of a. The indiscernibility relation is defined as follows: IN D(A) = {(u, v) : u, v ∈ U, ∀a ∈ A a(u) = a(v)} The IN D relation can be also defined for a particular attribute: IN D(a) = {(u, v) : u, v ∈ U, a(u) = a(v)} Informally speaking, two objects u and v are indiscernible for the attribute a if they have the same value of that attribute. One can show that the indiscernibility relation for a set B of attributes can be expressed as IN D(B) = IN D(a), B ⊆ A. In the sequel IB (x) denotes the set of objects indisa∈B cernible with x, wrt B. IN D(B) is the equivalence relation and it splits IS into abstraction classes. The set of abstraction classes wrt B will be further denoted by ACB (IS). So, ACB (IS) = {Yi |∀y ∈ Yi , Yi = IB (y)}. Given X ⊆ U and B ⊆ A, we say that B(X) is lower approximation of X if B(X) = {x ∈ U |IB (x) ⊆ X}. In addition, B(X) is upper approximation of X if B = {x ∈ U |IB (x) ∩ X = ∅}. B(X) is the set of objects that belong to X certainly, B(X) is the set of objects that possibly belong to X, hence B(X) ⊆ X ⊆ B(X). Decision table (DT ) is an information system where DT = {U, A ∪ D}, and D ∩ A = ∅. The attributes from D are called decision attributes, the attributes
A Distributed Decision Rules Calculation Using Apriori Algorithm
163
from A are called condition attributes. The set ACD (DT ) will be called the set of decision classes. Decision rule has the form t → s, where t = ∧(c, v), c ∈ A and v ∈ Va , s = ∧di ∈D (di , wi ), and wi ∈ Vdi , t is called antecedent, and s is called consequent. By ||t|| we denote the set of objects described by a conjunction of the pairs (attribute, value). The cardinality of ||t|| is called support of t, and will be denoted by sup(t). We say that the object u, u ∈ U , supports the rule t → s if u belongs to ||t||, and to ||s||, i.e. u ∈ ||t ∧ s||. Given a threshold, we say that the rule t → s is frequent, if its support is higher than the threshold. The rule is called certain, if ||t|| ⊆ ||s||, the rule is called possible, if ||t|| ⊆ A(||s||). The confidence of rule t → s is defined as sup(t ∧ s)/sup(t). For the certain rules confidence is 100%. Any rule with k pairs (attribute, value) in the antecedent is called k − ant rule. Any rule t ∧ p → s will be called extension of the rule t → s. Any rule t → s will be called seed of rule t ∧ p → s. The rule is called optimal certain/possible if there is no other rule certain/possible which has less conditions in antecedent.
3
Apriori Algorithm
RS − apriori [1] was invented as a way to compute rules from large databases. A drawback of the approach is that the resulting set of rules cannot be too large, otherwise the memory problems emerge with storing all the candidates. To this end we attempt to reduce the requirements for the memory, and achieve better time efficiency. First, let us recall the original version of RS − apriori [1] (Algorithm 3.1). The algorithm is run for a certain threshold to be defined by the user. As the first step, the set of rules with one element in antecendent (C1 ) is crated with INIT RULES, which is performed by single pass over decision table (Algorithm 3.2). Now, the loop starts. Every iteration pass consists of the following three phases: 1. counting support and confidence of all candidate rules; 2. selecting the rules with support higher than the threshold; 3. making k+1-ant rules based on k-ant rules. The third step in the iteration cycle performs pruning. The idea of pruning is based on the fact that every seed of the frequent rule has to be frequent. Using this property we can say that if some k-ant rule has a (k-1)-ant seed rule, which is not frequent, we can delete the k-ant rule without counting its support. So, the pruning reduces the number of candidate rules. The stop condition can be defined as follows: 1. after selecting rules candidates the number of rules is less than 2 (so, it is not possible to perform the next step), or 2. number k in k − ant rule level is grater than, or equal to, the number of conditional attributes.
164
T. Str¸akowski and H. Rybiński
Algorithm 3.1: RS-Apriori CERTAIN (DT, minSupport) procedure APRIORI RULE GEN(Ck ) insert into Ck+1 (ant, cons, antCount, ruleCount) select f.ant[1], f.ant[2], . . . , f.ant[k], c.ant[k], f.cons, 0, 0 f romCkf, Ckc where f.cons = c.cons and f.ant[k − 1] = c.ant[k − 1] and (f.ant[k]).attribute < (c.ant[k]).attribute for all c ∈ Ck+1 / ∗ rules pruning ∗ / if |f ∈ Ck |f.cons = c.cons and f.ant ⊂ c.ant| < k + 1 do then delete c f rom Ck+1 return (Ck+1 ) main C1 = INIT RULES(DT ) k=1 while⎧Ck = ∅ or k ≤ |A| for all candidatesc ∈ Ck ⎪ ⎪ ⎪ ⎪ do c.antCount = 0 c.ruleCount = 0 ⎪ ⎪ ⎪ ⎪ for all ⎪ ⎪ ⎧ objects u ∈ U ⎪ ⎪ for all⎧ candidates c ∈ Ck ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if c.ant ⊂ IA (u) ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ then c.antCount = c.antCount + 1 Rk = ∅ ⎨ do do ⎪ ⎪ if c.cons ∈ IA (u) do ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ ⎪ then c.ruleCount = c.ruleCount + 1 ⎪ ⎪ ⎪ ⎪ for all candidates c ∈ Ck ⎪ ⎪ ⎪ ⎪ do if c.ruleCount ≤ minSupport ⎪ ⎪ ⎪ ⎪ then delete c f rom Ck ⎪ ⎪ ⎪ ⎪ else move c f rom Ck to Rk ⎪ ⎪ ⎩ C k+1 = APRIORI RULE GEN(Ck ) return ( k Rk ) Algorithm 3.2: INIT RULES(DT ) main C1 = ∅ for all objectsx ∈ DT C1 = C1 ∪ {c ∈ / C1 |c.cons = (d, v) ∈ IA (x) and do c.ant = (a, v) ∈ IA (X) and a ∈ A} return (C1 ) Already for this algorithm there is a possibility to perform some parts in parallel. In particular, instead of building the set C1 based on the whole decision table, we could run it separately for each decision class. The process of generating rules for a decision class is independent of the other classes, and can be run in parallel with the other processes. If we look for certain rules we use lower approximations
A Distributed Decision Rules Calculation Using Apriori Algorithm
165
of the decision class to build C1 . If we look for possible rules we use the whole decision class. 3.1
Tree Structure for Keeping Candidate Rules
In the original algorithm [6] candidate rules are stored in a hash tree for the two following reasons: 1. it guarantees more efficient counting of support of the cadidates; 2. it is easier and more efficient to prune the candidate set. Also in our approach we build a tree for storing the partial results. It has the same properties, as the original Agraval tree [6], but additionally only the rules that are in one node may be parents of the rules in the node’s subtree. This property is very important, as it provides a possibility to perform the process of searching rules independently for particular subtrees, and in a recursive way. Let us call particular pair (attribute, value) as single condition. All the k-ant rules are kept on k-th level of the tree. Every rule that can be potentially joined to a new candidate rule is kept in the same node of the tree.
(c=1) -> s (b=1) -> s (a=1) -> s
(a=1 (a=1
(a=1
-> s ^^ c=1) b=1) -> s
^
(b=1 c=1) -> s
^ b=1 ^ c=1) -> s
Fig. 1. Tree for keeping candidate rules
Let us assume that the condition parts in all the rules stored in the tree are always in the same order (i.e. attribute A is always before B, and B is before C). At the first level the tree starts with a root, where all 1-ant rules are stored together. On the second level of the tree each node stores the rules with the same attribute for the first single condition. So the number of nodes on the second level is equal to the number of different attributes used in the first conditions of the rules. Each of the third level nodes of the tree contains the rules that start
166
T. Str¸akowski and H. Rybiński
with the same attributes in the first two conditions of the antecedents. Generally speaking, on the k th level of the tree each node stores the rules with the same attributes on k-1 conditions, counted from the beginning. Below we present the proposed algorithm in more detail. 3.2
Recursive Version of Apriori
For the tree structure as above we can define an algorithm which recursively performs the process of discovering rules from a decision table. The proposed algorithm is presented as Algorithm 3.3. It consists of the definitions of the following procedures: 1. AprioriRuleGen, which for a tree node t in T generates all possible k+1-ant candidate rules, based on k-ant rules stored in the node; 2. CheckRules, which checks the support of all the candidate rules in a given node t and removes the rules with the support less than a threshold; 3. CountSupportF orN ode, which counts support and confidence of all candidate rules in a node t; 4. ComputeRules, which for decision table DT , minSupport and a given node t computes the rules by calling the procedure CountSupport in the first step, then CheckRules in the second step, and then recoursively calling itself for generating the next level rules. Having defined the four procedures, we can now perform the calculation process as follows: for each decision class 1. with InitRules in the first phase of the cycle, the root of the tree T is initialized by calculating 1-ant rules, as in Algorithm 3.1; 2. having initialized the root for the i − th decision class, now we calculate the decision rules by calling ComputeRules, which recursively finds out the rules for the decision class, and cummulates them in results. The recursive approach gives us worst time of computations, but has less requirements for the memory. Another advantage of this version is that in every moment of its execution, we can send a branch t of the tree T and a copy of DT to another processor, and start processing this branch independently. Having completed the branch calculations we join particular results R into one set, and remove duplicates. This way we can decompose the problem even into a massive parallel computing (note that each processor receiving a branch of the tree and a copy of DT can compute by itself, or can share the work with other agents. In the next section we explain in more detail how the recursiveness can be used for splitting the calculations into separate tasks, so that the whole process can be performed in parallel.
A Distributed Decision Rules Calculation Using Apriori Algorithm
167
Algorithm 3.3: RS-APRIORI-RECURSIVE(DT, minSupport) procedure AprioriRuleGen(T, t) IN SERT IN T O T (ant, cons, antCount, ruleCount) SELECT f.ant[1], f.ant[2], . . . , f.ant[k], c.ant[k], f.cons, 0, 0 F ROM t, t//f ∈ tandc ∈ t W HEREf.cons = c.cons and f.ant[k − 1] = c.ant[k − 1] and (f.ant[k]).attribute < (c.ant[k]).attribute procedure CheckRules(t, R, minSupport) for all candidates c ∈ t do if c.ruleCount ≤ minSup then delete c f rom t else if c.rulecount = c.antCount then move c f rom t to R procedure CountSupportForNode(t) for all candidates c ∈ t do c.antCount = 0 c.ruleCount = 0 for all obiects u ∈ U do for all candidates c ∈ t do if c.ant ⎧ ⊂ IA (u) ⎨c.antCount = c.antCount + 1 then if c.cons ∈ IA (u) ⎩ then c.ruleCount = c.ruleCount + 1 procedure ComputeRules(t, R, DT, minSupport) if t.level = |A| then ⎧ stop algorithmreturn (R) CountSupportForNode(t) ⎪ ⎪ ⎨ CheckRules(t, R, minSupport) else for all ti is a child of t ⎪ ⎪ ⎩ do ComputeRules(t, R, DT, minSupport) main D is a set of decision classes result = ∅ for i ← ⎧ 1 to |D| ⎨T is tree do T.root = INIT RULES(A(Di )) ⎩ result = result ∪ ComputeRules(T, result, DT, minSupport) return (result)
4
Parallel Computations
The approach presented in this Section is an upgraded version of an earlier approach, presented in [7], where we have proven that by generating separately
168
T. Str¸akowski and H. Rybiński
sets of rules from disjoint parts of DT , and then joining them we obtain the same result, as computing the rules by sequential apriori. The rationalization of computations in the current solution consists in the fact that now every processor has its own copy of the whole decision table (even if the first k-ant rules are computed from the disjoint subset). The version described in [7] suffered for a large number of messages exchanged between the processors. It is therefore justified to provide the whole table to every processor, which makes the number of network messages drastically reduced. In addition, in the version described in [7] each processor used the sequential version of apriori, whereas in the presented approach we use the recursive algorithm, which gives rise to a better use of the processors. It will be described in more detail below in this Section. As already mentioned, even the original version of the apriori algorithm can be performed in parallel. The main constraint in this case is the limited number of decision classes of DT . Actually, in the majority of practical cases the number of decision classes in decision tables is rather limited, thus the level of distributing the calculations is not too high. The presented recursive version of Algorithm 3.3 gives us a possibility to split the process into many parts. However, the problem is that in the first phase of the algorithm (initialization) there is nothing to split. Therefore we propose to split DT into disjoint subsets of the objects, and start parallel recursive calculations for each subset. In the sequel the starting sets will be called initial sets. Based on an assigned initial set, first the processor generates the 1-ant rules, then the computations continue on this processor recursively, based on the k-ant rules, and the whole DT (which, as mentioned above, is also available for the processor). The important advantage of the algorithm is its scalability. The problem is though with the maximum number of the processors that can start computations in parallel. This is limited by the number of abstraction classes in DT . Actually, if we split one abstraction class and process it by few processors, we receive identical sets of rules already in the first phase of the algorithm. So it would give us only a redundancy of computing. Therefore for the first phase we do not split the abtraction classes. On the other hand, after completing the first phase of the algorithm it is possible to use much more processors. Actually, after computing the first level every processor can share its job with co-working processors, assigning them the branches of the tree computed locally. Below we consider two versions of the algorithm: 1. the one consisting in generating initial sets by choosing randomly abstraction classes belonging to an approximtion of a selected decision class; 2. the one, which additionally minimizes the redundancy of 1-ant rules among various processors. Let us present the first version of the algorithm (Algorithm 4.1). We have defined here two procedures: 1. DistributeXintoY , which randomly splits n objects into l groups, where X1 ... Xn are the objects and Y1 ... Yl are groups; 2. Associate, which splits X into l initial sets Y1 ... Yl , where X = Xi , and X is lower or upper approximation of a decision class, Xi is an abstraction class.
A Distributed Decision Rules Calculation Using Apriori Algorithm
169
Algorithm 4.1: SplitDT(DT, N ) procedure distributeXintoY(X, Y ) X set of sets where ∀i Xi =∅ Y set of sets where ∀i Yi = ∅ for i ← 1 to k do Y(imod(l) +1) = {Y(imod(l) +1) , Xi } procedure associate(X, Y, l) X is set of {Cg , ..., Ch } where Ci is anabstraction class, and C is the set of all abstraction classes, so X ⊂ C and i=h i=g Ci is lower or upper aproximation of a decision class, Y is set of {Sk , ..., Sl } where Si is particular initial set, and Y is set of all initial sets so Y ⊂ S distributeXintoY(X, Y, |Y |) main N − the number of processors S = {S1 , ..., SN } − the set of initial sets Split DT into decision classes DC(DT ) = {D1 , ..., Dn } if n ≥ N then ⎧ distributeXintoY({A(D1 ), ..., A(Dn )}, S, N ) Y = {Y1 , ..., Yn } ⎪ ⎪ ⎪ ⎪ for i ← 1 to n ⎪ ⎪ ⎨ do Y = ∅ else distributeXintoY(S, Y, n) ⎪ ⎪ ⎪ ⎪ for i ← 1 to n ⎪ ⎪ ⎩ do associate(A(Di ), Yi ) As we start in the procedure with lower approximations, it calculates certain rules. If we want to find possible rules, we should use here A(Di ) instead of A(Di ). With the two procedures Algorithm 4.1 prepares tasks for N processors in the following way: 1. first, the decision classes are calculated, and then 2. the number of decision classes n is compared with the number of processors N; 3. If n ≥ N , the algorithm splits n decision classes into N groups, so that some processors may have more than 1 decision classes to work with; 4. If n < N , first we split N processors into n grups, so that each group of processors is dedicated to a particular decision class. If a decision class is assigned to more than one procesor, this class should be split into k initial sets, where k is the number of the processors in the group. Then each initial set in the group will be computed by a separate procesor.
170
T. Str¸akowski and H. Rybiński
Having split the problem into N processors, Algorithm 3.3 can be started on each processor, so that the process of calculating the decision rules is performed parallely on N processors. Let us illustrate the algorithm for the decision table as presented in Table 1. Table 1. Decision Table u1 u2 u3 u4 u5 u6 u7 u8 u9
a 0 0 1 1 0 0 1 1 1
b 1 1 0 0 0 0 1 1 1
c 1 0 0 1 0 0 0 1 1
d 1 1 0 0 0 0 1 1 0
We consider 4 scenarios of splitting the table into initial sets. The table contains two decision classes: – D1 = {1, 2, 7, 8} – D2 = {3, 4, 5, 6, 9} and 8 abstraction classes: C1 = {1}, C2 = {2}, C3 = {3}, C4 = {4}, C5 = {5}, C6 = {6}, C7 = {7}, C8 = {8, 9}, hence the maximal number of the processors for the first phase is 8. Scenario 1. We plan computing certain rules with 2 processors. The first step is to find out decision classes. As we have two decision classes, we assign one class to each processor. We create two initial sets S1 and S2 , where S1 = A(D1 ) = {1, 2, 7}, and S2 = A(D2 ) = {3, 4, 5, 6}. Scenario 2. We plan computing certain rules with 6 processors. Having found 2 decision classes for 6 processors we have to perform Step 4 (N < n), and generate initial sets. Given A(D1 ) = {1, 2, 7}, and A(D2 ) = {3, 4, 5, 6} we can have the following initial sets: S1 = {1}, S2 = {2}, S3 = {7}, S4 = {3}, S5 = {4}, S6 = {5, 6}. The objects 5 and 6 belong to the same lower approximation of a decision class, thus can belong to the same initial set. Scenario 3. Again we plan using 6 processors, but this time for computing possible rules. With the decision classes D1 , and D2 we compute upper approximations A(D1 ) and A(D2 ). So, we have more processor than classes. Given calculated A(D1 ) = {1, 2, 7, 8, 9} and A(D2 ) = {3, 4, 5, 6, 8, 9}, we may have the following 6 initial sets: S1 = {1, 8, 9}, S2 = {2}, S3 = {7}, S4 = {3, 6}, S5 = {4, 8, 9}, S6 = {5} to be assigned to the 6 processors for further calculations. Scenario 4. Let us plan computing certain rules with 3 processors. Again, having 2 decision classes, we have to use A(D1 ) and A(D2 ) (as in Scenarios 1 and 2) to generate 3 initial sets: S1 = {1, 2, 7}, S2 = {3, 5}, S3 = {4, 6}.
A Distributed Decision Rules Calculation Using Apriori Algorithm
171
It may happen that for the generated initial sets we receive a redundancy in partial results. Clearly, if two initial sets X1 and X2 have objects belonging to the same decision class, and they generate the same sets of 1-ant rules, then two processors will give exactly the same partial results. The proof is obvious: if we have identical two 1-ant rules resulting from two various initial sets (and two processors have identical copies of DT ), the results will be the same. Scenario 4 is such an example. Actually, for the generated initial sets S2 and S3 assigned to two different processors we achieve identical sets of 1-ant rules from the two processors. It leads to a redunduncy (100%) in two partial results. For this reason, the process of choosing initial sets is of crucial importance. In the next subsection we will discuss in more detail how to overcome the problem. 4.1
The Problem of Redundancy of Partial Results
An optimal solution for spliting the calculation process into N processors would be to have the N sets of 1-ant rules disjoint, unfortunately this restriction is in practice impossible to achieve. Let us notice that if n is the number of the abstraction classes, X1 , . . . , XN are the initial sets and Y (Xi ) is the set of 1-ant rules generated from the i-th initial set Xi , then for a given N , and growing number of abstraction classes n, the probability that the sets of 1-ant rules overlap is growing, which results in growing the redundancy of the partial results. Let us refer back to the example of Scenario 4 above. The objects of the initial sets S2 and S3 belong to the same decision class. The 1-ant rules generated from S2 are {(a = 1 → d = 0), (a = 0 → d = 0), (b = 0 → d = 0), (c = 0 → d = 0), (c = 1 → d = 0)}. The same set of 1-ant rules will be generated from the initial set S3 . Concluding, the processors 2 and 3 will give us the same results, so we have 100% redundancy in the calculations, and no gain from using an extra processor. To avoid such situations, in the sequel we propose modifying the algorithm of generating initial sets. Actually, only a minor modification is needed in the function Associate. We show it in Algorithm 4.2. First, let us note that if we use a particular atribute to split DT into initial sets by means of discernibility, in the worst case we have a guarantee, that every two initial sets differ at least by one pair (attribute, value). Hence, the modification guarantees that every two initial sets Si , Sj , which contain objects from the same decision class, have such sets of 1-ant rules that differ with at least one rule, therefore for any two processors we already have redundancy less than 100% by means of the computation process (still it may happen that some rules appear in the partial results of the two processors).
172
T. Str¸akowski and H. Rybiński
Algorithm 4.2: associate(X, Y ) procedure associate(X, Y ) Y is {S1 , ..., Sk } where Si is particular initial set, Y ⊂ S a = findAttribute(X, k) if |ACa (X)| ≥ k do distributeXintoY(AC a (X), Y, |Y |) ⎧ l = k − |AC (X)| − 1 ⎪ a ⎪ ⎪ ⎪ split Y into two subsets Y1 and Y2 that : ⎪ ⎪ ⎪ ⎪ ⎨|Y1 | = |ACa (X)| − 1 and |Y 2| = l else split ACa (X) = {C1 , ..., Ch } into two subsets XX1 and XX2 ⎪ ⎪ XX1 = {C1 , ..., Ch−1 }, XX2 = Ch ⎪ ⎪ ⎪ ⎪ distributeXintoY(XX1 , Y1 , |Y1 |) ⎪ ⎪ ⎩ associate(XX2 , Y2 ) procedure findAttribute(X, k) Xi ⊂ U k is integer f ind f irst Attribute a ∈ A where |ACa (X)| ≥ k or |ACa (X)|is maximal Having defined the splitting process (Algorithms 4.1 and 4.2) we can summarize the whole process of the parallel computations of rules, as follows: 1. compute decision classes; 2. depending on the number of decision classes and the number of processors either we compute initial sets according to SplitDT (N > n), or we assign to the processors the whole classes as the initial sets (N ≤ n) ; 3. every processor is given an initial set and a copy of DT ; 4. each processor calculates the 1-ant rules, based on the assigned initial sets; 5. the n-ant rules are generated locally at each processor, according to the recursive apriori. Every processor has its own tree structure for keeping candidate rules; 6. after having completed the local calculations at the processors, the results go to the central processor, where we calculate the final result set by merging all the partial results, and removing duplicate rules, and then extensions of the rules. 4.2
Optimal Usage of Processors
For the approach presented above, the processors usage can be shown as in Figure 2. It is hard to predict how much time it would take to compute all the rules from the constructed initial sets. Usually it is different for the particular processors. Hence, we propose the following solution: having evaluated the initial sets, the central node initializes computations on all the processors.
A Distributed Decision Rules Calculation Using Apriori Algorithm
173
processor processor processor processor central node init sets
time summing rules
rules generation
inactive time
Fig. 2. Processor usage
At any time every node can ask the central node if there are any processors that have finished their tasks, and are free. If so, the processor refers to a free one, and sends there a request for remote calculations, starting with the current node of its tree. This action can be repeated until a free processor can be assigned (or all the local nodes have been assigned). Having initialized the remote computations the processor goes back to the next node to be computed. The processor should remember all its co-workers, as they should give back the results. Having completed its own computations, along with the results from co-workers, the processor summarizes the results, and sends them back to the requestor of the given task. At the end, all results will return to the starting node (which is the central one). This approach guarantees a massive use of all involved processors, and better utilization of the processing power. Additionally we can utilize more processors than the number of initial sets. An example diagram of cooperation of the processors is presented on Fig. 3. A problem is how to distribute information on free processors in an efficient way, so that the communication cost in the network does not influences the total cost of calculations. In our experiments we used the following strategy: 1. the central processor has a knowledge on all free processors (all the requests for a co-worker go through it); 2. every processor periodically receives a broadcast from the central node with information on how many processors are free (m), and how many are busy (k); 3. based on m and k the processor calculates the probability of success in receving a free processor, according to the formula p = min( m k , 1), and then with the probability p it issues the request to the central node for a free processor;
174
T. Str¸akowski and H. Rybiński
4. when a processor completes a task, it sends a message to the central processor, so that the next broadcast of the central processor will provide updated values of m and k.
processor processor processor processor central node init sets
time summing rules
rules generation
inactive time
rule generating as a co-worker Fig. 3. Processor usage
4.3
Results of Experiments
In the literature, there are some measures for the evaluation of distributed algorithms. In our experiments we use two indicators: – speedup; – efficiency. Following [8] we define speedup as Sp = TT p1 , and efficiency as Ep = pp , where T1 is the time of execution of the algorithm on one processor, Tp is the time needed by p processors. For our experiments we have used three base data sets: (a) 4000 records and 20 conditional attributes (generated data); (b) 5500 records and 22 conditional attributes (the set mushrooms without records having missing values on the stalk-root attribute); (c) 8000 records and 21 conditional attributes (the set mushrooms without the stalk-root attribute). For each of the databases we have prepared a number of data sets, namely 4 sets for the set (a), 7 sets for (b), 12 sets for the set (c). Every set of data was prepared by a random selection of objects from the base sets. For each series of the data sets we have performed one experiment for the sequential algorithm, one for two processors and one for three processors. Below we present the results of the experiment. Table 2 contains the execution times for the sequential version of the algorithm for each of the 3 testing series respectively. In these tables column 4 shows the total execution time, columns 3 shows the number of rules. S
A Distributed Decision Rules Calculation Using Apriori Algorithm Table 2. Sequential computing elements number condition attributes the first set of data 2500 20 3000 20 3500 20 4000 20 the second set of data 2500 22 3000 22 3500 22 4000 22 4500 22 5000 22 5500 22 the third set of data 2500 21 3000 21 3500 21 4000 21 4500 21 5000 21 5500 21 6000 21 6500 21 7000 21 7500 21 8000 21
rules time ms 40 23 17 14
1570 1832 2086 2349
26 18 23 44 82 91 116
1327 1375 1470 2112 2904 2509 3566
23 16 21 27 67 76 99 89 112 131 126 177
1188 1247 1355 1572 2210 1832 2307 2956 3001 2965 3010 3542
Table 3. Parallel computing elements number
2500 3000 3500 4000 2500 3000 3500 4000 4500 5000 5500 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000
two processors three processors time speedup efficiency time speedup efficiency the first set of data 805 1,95 0,97 778 2,02 0,67 929 1,97 0,98 908 2,02 0,767 1062 1,96 0,98 960 2,17 0,72 1198 1,96 0,98 1140 2,06 0,69 the second set of data 701 1,89 0,94 631 2,10 0,70 721 1,90 0,95 663 2,07 0,69 778 1,88 0,94 739 1,98 0,66 1184 1,78 0,89 1081 1,95 0,65 1765 1,66 0,83 1460 2,01 0,67 1526 1,64 0,82 1287 1,94 0,65 2139 1,66 0,83 1766 2,01 0,67 the third set of data 620 1,91 0,97 601 1,97 0,66 652 1,91 0,96 628 1,98 0,66 716 1,89 0,96 713 1,90 0,63 820 1,91 0,85 854 1,84 0,61 1296 1,70 0,85 1474 1,49 0,49 1090 1,68 0,84 1256 1,45 0,48 1377 1,67 0,84 1599 1,44 0,48 1795 1,64 0,82 2128 1,38 0,46 1816 1,65 0,83 1182 2,53 0,84 1799 1,64 0,82 1509 1,96 0,65 1816 1,65 0,82 1458 2,06 0,68 2000 1,77 0,88 1633 2,16 0,72
175
176
T. Str¸akowski and H. Rybiński
Table 3 contains the results for parallel computing of the algorithm. In the case of two processors the efficiency is near 100% because all the sets have 2 decision classes and the initial sets have completely different 1-ant rules (there was no redundancy). In the case of three processors the speedup coefficient is better than in the previous case, but worse than 3, hence there is a redundancy, though less than 100%.
5
Conclusions and Future Work
We have presented in the paper a recursive version of apriori and have shown its suitability for distributed computations of the decision rules. The performed experiments have shown that the algorithm reduces the processing time essentially, depending on the number of processors. The experiments have also shown that relative speedup decreases with the growing number of processors - obviously with the growing number of processors, there is a growing overhead for redundancy of partial rules, as well as communication between the processors. Nevertheless the proposed algorithm shows the property of scalability and can be used for spliting large tasks into a number of processors, giving rise to reasonable computation time for larger problems.
References 1. Kryszkiewicz, M.: Strong rules in large databases. In: Proceedings of 6th European Congress on Intelligent Techniques and Soft Computing EUFIT 1998, vol. 1, pp. 85–89 (1998) 2. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 26-28, pp. 207–216. ACM Press, New York (1993) 3. Dean, J., Grzymala-Busse, J.: An overview of the learning from examples module LEM1. Technical Report TR-88-2, Department of Computer Science, University of Kansas (1988) 4. Lingyun, T., Liping, A.: Incremental learning of decision rules based on rough set theory. In: Proceedings of the 4th World Congress on Intelligent Control and Automation, vol. 1, pp. 420–425 (2001) 5. Geng, Z., Zhu, Q.: A multi-agent method for parallel mining based on rough sets. In: Proceedings of The Sixth World Congress on Intelligent Control and Automation, vol. 2, pp. 826–850 (2006) 6. Agraval, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of association rules. In: Fayyad, U., Shapiro, G.P., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI, Menlo Park (1996) 7. Strkowski, T., Rybiński, H.: A distributed version of priori rule generation based on rough set theory. In: Lindemann, G., Burkhard, H.-D., Skowron, L.C., Schlingloff, A., Suraj, H.Z. (eds.) Proceedings of CS& P 2004 Workshop, vol. 2, pp. 390–397 (2004) 8. Karbowski, A. (ed.): E.N.S.: Obliczenia równoległe i rozproszone. Oficyna Wydawnicza Politechniki Warszawskiej (2001) (in Polish)
Decision Table Reduction in KDD: Fuzzy Rough Based Approach Eric Tsang and Zhao Suyun Department of Computing, The Hong Kong Polytechnic University Hong Kong
[email protected],
[email protected] Abstract. Decision table reduction in KDD refers to the problem of selecting those input feature values that are most predictive of a given outcome by reducing a decision table like database from both vertical and horizontal directions. Fuzzy rough sets has been proven to be a useful tool of attribute reduction (i.e. reduce decision table from vertical direction). However, relatively less researches on decision table reduction using fuzzy rough sets has been performed. In this paper we focus on decision table reduction with fuzzy rough sets. First, we propose attribute-value reduction with fuzzy rough sets. The structure of the proposed value-reduction is then investigated by the approach of discernibility vector. Second, a rule covering system is described to reduce the valued-reduced decision table from horizontal direction. Finally, numerical example illustrates the proposed method of decision table reduction. The main contribution of this paper is that decision table reduction method is well combined with knowledge representation of fuzzy rough sets by fuzzy rough approximation value. The strict mathematical reasoning shows that the fuzzy rough approximation value is the reasonable criterion to keep the information invariant in the process of decision table reduction. Keywords: Fuzzy Rough Sets, Decision Table Reduction, Discernibility Vector.
1
Introduction
Decision table like database is one important type of knowledge representation system in Knowledge discovery in databases (KDD) and it is represented by a two dimensional table with rows labeled by objects and columns labeled by attributes (composed by condition and decision attributes). Most decision problems can be formulated employing a decision table formulism; therefore, this tool is particularly useful in decision making. At present, large-scale problems are becoming more popular, which raise some efficient problems in many areas such as machine learning, data mining and pattern recognition. It is not surprising that to reduce the decision table has been paid attentions by many researchers. However, the existing work tends to reduce the decision table from vertical direction, i.e. dimensionality reduction (transformation-based approaches [7], fuzzy rough J.F. Peters and A. Skowron (Eds.): Transactions on Rough Sets XI, LNCS 5946, pp. 177–188, 2010. c Springer-Verlag Berlin Heidelberg 2010
178
E. Tsang and S. Zhao
approach). However, a technique that can reduce decision table from both horizontal and vertical direction using the information contained within the dataset and need no additional information (such as expert knowledge, data distribution) is clearly desirable. Rough set theory can be used as such a tool to reduce the decision table. However, we find that one obvious limitation of traditional rough sets is that it only works effectively on symbolic problems. As a result, one type of extensions of RS, Fuzzy Rough Sets has been proposed and studied which is used to handle problem with real numbers[10]. Owing to its wide applied background, fuzzy rough sets have attracted more and more attentions from both theoretical and practical fields [2,8-10,13-16,18,22,26,28-29,32]. Recently, some efforts have been put on attribute reduction with fuzzy rough sets [12,14,24]. Shen et al. first proposed the method of attribute reduction based on fuzzy rough sets [14]. Their key idea of attribute reduction is to keep the dependency degree invariant.Unlike [14], the authors in [12] proposed one attribute reduction method based on fuzzy rough sets by adopting the information entropy to measure the significance of the attributes. Their work perform well on some practical problems, but one obvious limitation of them is that their algorithm lacks mathematical foundation and theoretical analysis, many interesting topics related to attribute reduction, e.g. core and the structure of attribute reduction, are not discussed. In consideration of the above limitation, a unified framework of attribute reduction is then proposed in [24] which not only proposed a formal notion of attribute reduction based on fuzzy approximation operators, but also analyzed the mathematical structure of the attribute reduction by employing discernibility matrix. However, all these approaches do not mention another application of fuzzy rough sets, decision table reduction (which means to reduce the decision table from both vertical and horizontal dimensions). Till now, there exist one gap between decision table reduction and fuzzy rough sets. Decision table reduction, also called rule induction, from real valued datasets using rough set techniques has been less studied [25,27,30]. At most datasets containing real-valued features, some rule induction methods performed a discretization step beforehand and then designed the rule induction algorithm using rough set technique [25][30]. Unlike [25][30], by using rough set technique the reference [27] designed a method of learning fuzzy rules from database without discretization. This method performed well on some datasets, whereas the theoretical foundation of their method [27] is rather weak. For example, only the lower and upper approximation operators are proposed, whereas the theoretical structure of them, such as topologic and algebraic properties, is not studied [27]. Furthermore, the lower and upper approximations are not even used in the process of knowledge discovery, that is knowledge representation and knowledge discovery are unrelated. All these show that there exists one gap that fuzzy rough sets, as a well-defined generalization of rough sets has been less studied on decision table method designing. Now it is necessary to propose a method of decision table reduction (this method can induce a set of rules), in which the knowledge representation part and knowledge reduction part are well combined.
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
179
In this paper, we propose one method of decision table reduction by using fuzzy rough sets. First, we give some definitions of attribute-value reduction. The key idea of attribute-value reduction is to keep the critical value, i.e. fuzzy lower approximation value invariant before and after reduction. Second, the structure of attribute value reduction is completely studied by using the discernibility vector approach. With this vector approach all the attribute value reductions of each of initial objects can be computed. Thus a solid mathematical foundation is setup for decision table reduction with fuzzy rough sets. After the description of rule covering system, a reduced fuzzy decision table is obtained which keep the information contained in the original decision table invariant. This reduced decision table corresponds to a set of rules which covers all the objects in the original dataset. Finally, a numerical example is presented to illustrate the proposed method. The rest of this paper is structured as follows. In Section 2 we mainly review basic concept about fuzzy rough sets. In Section 3 we propose the concept of attribute value reduction with fuzzy rough sets and study its structure by using discernibility vector. Also, rule covering system is described which is helpful to induce a set of rules from fuzzy decision table. In Section 4 one illustrative example is given to show the feasibility of the proposed method. The last section concludes this paper.
2
Review of Fuzzy Rough Sets
In this section we only review some basic concepts of fuzzy rough sets found in [16], a detailed review of the existing fuzzy rough sets can be found in [32][6]. Given a triangular norm T , the binary operation on I, ϑT (α, γ) = sup{θ ∈ I : T (α, θ) ≤ γ}, α, γ ∈ I , is called a R− implicator based on T . If T is lower semi-continuous, then θT is called the residuation implication of T , or the T −residuated implication. The properties of T −residuated implication are listed in [16]. Suppose U is a nonempty universe. A T −fuzzy similarity relation R is a fuzzy relation on U which is reflexive (R(x, x) = 1 ), symmetric (R(x, y) = R(y, x)) and T −transitive ( R(x, y) ≥ T (R(x, z), R(y, z)), for every x, y, z ∈ U . If ϑ is the T −residuated implication of a lower semi-continuous T −norm T , then the lower and upper approximation operators were defined as for every A ∈ F (U ), Rϑ A(x) = infu∈Y ϑ(R(u, x), A(u)); RT A(x) = supu∈U T (R(u, x), A(u)). In[16] [32] these two operators were studied in detail from constructive and axiomatic approaches, we only list their properties as follows. Theorem 2.1 [4][16][32] Suppose R is a fuzzy T −similarity relation. The following statements hold: 1) Rϑ (Rϑ A) = Rϑ A, RT (RT A) = RT A; 2) Both of Rϑ and RT are monotone; 3) RT (Rϑ A) = Rϑ A, Rϑ (RT A) = RT A;
180
E. Tsang and S. Zhao
4) RT (A) = A, Rϑ (A) = A; 5) Rϑ (A) = ∪{RT xλ : RT xλ ⊆ A}, RT (A) = ∪{RT xλ : xλ ⊆ A}, , here xλ is a fuzzy set defined as λ, y = x xλ (y) = 0, y =x
3
Decision Table Reduction with Fuzzy Rough Sets
In this section we design one method of decision table reduction with fuzzy rough sets. We first formulate a fuzzy decision table. Then we propose the concept of attribute-value reduction (corresponding to one reduction rule) and design one method (i.e. discernibility vector) to compute the attribute value reduction. The strict mathematical reasoning results show that by using the discernibility vector approach we can find all the attribute value reductions for each of original object in fuzzy decision table. Finally, we propose a rule covering system, in which each decision rule can cover several objects. As a result, a reduced decision table is obtained without missing the information contained in original decision table and this reduced decision table is equivalent to a set of rules. A fuzzy decision table, denoted by F DT = (U, C, D), consists of three parts: a finite universe U , a family of conditional fuzzy attributes C and a family of symbolic decision attribute D. For every fuzzy attribute (i.e. the attributes with real values), a fuzzy T −similarity relation can be employed to measure the similar degree between every pair of objects [10]. In the following of this paper we use RC to represent the fuzzy similarity relation defined by the condition attribute set C. One symbolic decision attribute corresponds to one equivalence relation, which generates a partition on U , denoted by U/P = {[x]P |x ∈ U } , where [x]P is the equivalent class containing x ∈ U . Given a fuzzy decision table F DT , with every x ∈ U we associate f dx to represent the corresponding decision rule of x ∈ U . The restriction of f dx to C, denoted by f dx |C = { a(x) a |a ∈ C} and the restriction of f dx to D, denoted d(x) by f dx |D = { d |d ∈ D} will be called the condition and decision of f dx respectively. This decision rule can be denoted by f dx |C → f dx |D. 3.1
Attribute-Value Reduction
The concept of attribute-value reduction is the preliminary of rule induction of fuzzy rough sets. The key idea of attribute-value reduction in this paper is to keep the information invariant before and after attribute-value reduction. Now it is necessary to find the critical value to keep the information of each object invariant. Let us consider the following theorem. Theorem 3.1: Given one object x and its corresponding decision rule f dx |C → f dx |D in a fuzzy decision table F DT = (U, C, D), for one object y ∈ U , if ϑ(RC (x, y), 0) < infu∈U ϑ(RC (x, u), [x]D (u)), then [x]D (y) = 1 . Proof: We prove it by contradiction. Assume [x]D (y) = 0 , then
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
181
infu∈U ϑ(RC (x, u), [x]D (u)) ≤ ϑ(RC (x, y), [x]D (y)) = ϑ(RC (x, y), 0). This contradicts the given condition and we get [x]D (y) = 1. Theorem 3.1 show that if ϑ(RC (x, y), 0) > infu∈U ϑ(RC (x, u), [x]D (u)), then [x]D (y) = 0 may happen. That is to say infu∈U ϑ(RC (x, u), [x]D (u)) is the maximum value to guarantee two objects consistent (having the identical decision class). When ϑ(RC (x, y), 0) < infu∈U ϑ(RC (x, u), [x]D (u)), the objects x and y are always consistent, otherwise they may be inconsistent . By considering this theorem, we define the consistence degree of one object in F DT . Definition 3.1 (consistence degree): Given an arbitrary object x in F DT = (U, C, D), let ConC (D)(x) = infu∈U ϑ(RC (x, u), [x]D (u)), then ConC (D)(x) is called the consistence degree of x in F DT . Since f dx |C → f dx |D is the corresponding decision rule of x, ConC (D)(x) is also called the consistence degree of f dx |C → f dx |D. It is important to remove the superfluous attribute value in F DT . Similar with the key idea of attribute reduction, the key idea of reducing attribute value is to keep the information invariant, that is, to keep the consistence degree of each fuzzy decision rule invariant. Definition 3.2 (attribute-value reduction, i.e. reduction rule): Given an arbitrary object x in F DT = (U, C, D), if the subset B(x) ⊆ C(x) satisfies the following two formulae (D1) ConC (D)(x) = ConB (D)(x); (D2) ∀b ∈ B, ConC (D)(x) > Con{B−b} (D)(x). Then attribute-value subset B(x) is the attribute-value reduction of x. We also say that the fuzzy rule f dx |B → f dx |D is the reduction rule of f dx |C → f dx |D. The attribute value a(x) ∈ B(x) ⊆ C(x) is dispensable in attribute-value set B(x) if ConC (D)(x) = Con{B−a} (D)(x) ; otherwise it is indispensable. Definition 3.3 (attribute-value core): Given one object x in F DT = (U, C, D), the collection of the indispensable attribute values is the value core of x, denoted by Core(x). Theorem 3.2: For A ⊆ U in F DT = (U, C, D), Let λ = Rϑ A(x), then RT xλ ⊆ A for x ∈ U . Here R is the T −similarity relation corresponding to the attribute set C . Proof: By thegranular representation of Rυ A = {RT xλ : (RT xλ )α ⊆ A}, we get that β = ( {RT xλ : (RT xλ )α ⊆ A})(z). For any x ∈ U there exist t ∈ (0, 1] and y ∈ U satisfying β = RT yt (z) and (RT yt )α ⊆ A. Then T (R(y, z), t) = β and T (R(y, x), t) ≤ A(x) for any x ∈ U hold. Thus, the statement ∀x ∈ U , RT zβ (x) = T (R(z, x), β) = T (R(z, x), T (R(y, z), t)) ≤ T (R(y, x), t) holds. If T (R(x, z), t) > α, then T (R(y, z), t) > α. Thus we have T (R(x, z), β) ≤ A(x). Hence RT zβ ⊆ A. Theorem 3.3: Given one object x in F DT = (U, C, D) , the following two formulae are equivalent.
182
E. Tsang and S. Zhao
(T1): B(x) contains one attribute-value reduction of x. (T2): B(x) ⊆ C(x) satisfies T (RB (x, y), λ) = 0 for every y ∈ [x]D . Here λ = infu∈U ϑ(RC (x, u), [x]D (u)). Proof: (T1)⇒(T2): Assume that B(x) contains the attribute-value reduction of x. By Definition 3.2, we have ConC (D)(x) = ConB (D)(x). Let λ = infu∈U ϑ (RC (x, u), [x]D (u)) we have λ = infu∈U ϑ(RB (x, u), [x]D (u)) by the definition of the consistence degree. By Theorem 3.2, we have RB ⊆ [x]D . Thus we have T (RB (x, y), λ) = 0 for every y ∈ [x]D . (T2)⇒(T1): Clearly, λ = infu∈U ϑ(RC (x, u), [x]D (u)) ≥ infu∈U ϑ(RB (x, u), [x]D (u)). We have (RB )T xλ ⊆ [x]D by the condition T (RB (x, y), λ) = 0 for every y ∈ [x]D . By Theorem 2.1, we get that (RB )T xλ ⊆ (RB )ϑ [x]D . (RB )T xλ (x) ≤ (RB )ϑ [x]D (x) ⇒ λ ≤ (RB )ϑ [x]D (x) ⇒ λ = (RC )ϑ [x]D (x) ≤ (RB )ϑ [x]D (x). Thus λ = (RC )ϑ [x]D (x) = (RB )ϑ [x]D (x) holds. By the definition of consistence degree, we have ConC (D)(x) = ConB (D)(x) . By the definition of attributevalue reduction, we conclude that B(x) contains one condition attribute-value reduction of x. Using Theorem 3.3, we construct the discernibility vector as follows. Suppose U = {x1 , x2 , x3 , · · · xn }, by V ector(U, C, D, xi ) we denote an n × 1 vector (cj ), called the discernibility vector of xi ∈ U , such that (V 1) cj = {a : T (a(xi , xj ), λ) = 0}, here λ = infu∈U ϑ(RC (xi , u), [xi ]D (u)) for D(xi , xj ) = 0 ; (V 2) Cj = ∅, for D(xi , xj ) = 1. A discernibility function fx (F DT ) for x in F DT is a Boolean function of m Boolean variables a1 , · · · , am corresponding to the attributes a1 , · · · , am , respectively, and defined as follows: fx (F DT )(a1 , · · · , am ) = ∧{∨(cj ) : 1 ≤ j ≤ n}, where ∨(cj ) is the disjunction of all variables a such that a ∈ cj . Let gx (F DT ) be the reduced disjunctive form of fx (F DT ) obtained from fx (F DT ) by applying the multiplication and absorption laws as many times as possible. Then there exist l and Reductk (x) ⊆ C(x) for k = 1, · · · , l such that gx (F DT ) = (∧Reduct1 (x)) ∨ · · · ∨ (∧Reductl (x)) where every element in Reductk (x) only appears one time. Theorem 3.4: RedD (C)(x) = {Reduct1 (x), · · · , Reductl (x)}, here RedD (C)(x) is the collection of all attribute-value reductions of x. The proof is omitted since this theorem is similar to the one in [24]. 3.2
Rule Covering System
Since each attribute-value reduction corresponds to a reduction rule. It is necessary to discuss the relation between decision rule and objects. In the following we propose one concept named rule covering. Definition 3.4 (rule covering): Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , the fuzzy decision rule f dx |C → f dx |D is said to cover the object y if ϑ(RC (x, y), 0) < ConC (D)(x) and [x]D (y) = 1.
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
183
One may note that when the fuzzy decision rule f dx |C → f dx |D covers the object y, it may happen that the fuzzy decision rule f dy |C → f dy |D does not cover the object x since ϑ(RC (x, y), 0) < ConC (D)(x) can not infer that the formula ϑ(RC (x, y), 0) < ConC (D)(y) holds. Corollary 3.1: Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , if ϑ(RC (x, y), 0) < ConC (D)(x), then fuzzy decision rule f dx |C → f dx |D covers the object y. It is straightforward from Theorem 3.1 and Definition 3.4. We now describe several theorems about covering power of reduction rule as follows. Theorem 3.5: Given one fuzzy decision rule f dx |C → f dx |D and one object y in F DT , if the fuzzy decision rule f dx |C → f dx |D covers the object y , then the reduction rule of f dx |C → f dx |D also covers the object y. Proof: Assume f dx |B → f dx |D is the reduction rule of f dx |C → f dx |D. According to the definition of reduction rule , we get that ConB (D)(x) = ConC (D)(x). According to the definition of rule covering, we get that ϑ(RC (x, y), 0) < ConC (D)(x) and f dx |D = f dy |D. By the monotonicity of residual implicator, we get that ϑ(RB (x, y), 0) ≤ ϑ(RC (x, y), 0) < ConC (D)(x), i.e. ϑ(RB (x, y), 0) < ConB (D)(x) . By the definition of rule covering, we get that the reduction rule of f dx |C → f dx |D covers the object y. Theorem 3.5 shows that the covering power of decision rules in the fuzzy decision table does not change after value reduction. That is to say, the reduction rule keeps the information contained in the original decision rule invariant. Theorem 3.6: Suppose f dy |B → f dy |D is the reduction rule of f dx |C → f dx |D in F DT . For one object y ∈ U , if ϑ(RB (x, y), 0) < ConB (D)(x), then [x]D (y) = 1. Proof: it is straightforward to get this result from Theorem 3.1.
By using the method of attribute-value reduction and rule covering system, a set of rules can be found which covers all the objects in a fuzzy decision table. This result is seen as one decision table reduced from both vertical and horizontal directions.
4
An Illustrative Example
As pointed in [19], in crisp rough sets the computing of reductions by discernibility matrix is a NP-complete problem. Similarly, the computing of reduction rules by using the discernibility vector approach is also a NP-complete problem. In this paper we will not discuss the computing of reduction rules with fuzzy rough sets, which will be our future work. Following we employ an example to illustrate our idea in this paper. Our method is specified on triangular norm ’Lukasiewicz T −norm’ to design the example since many discussions on the selection of triangular norm T − emphasize the well known Lukasiewicz’s triangular norm as a suitable selection
184
E. Tsang and S. Zhao
[1][10][23]. After specification of the triangular T −norm, the discernibility vector based on ’Lukasiewicz T −norm’ is then specified as follows. Suppose U = {x1 , x2 , x3 , · · · xn }, by V ector(U, C, D, xi ) we denote an 1 × n vector (cj ), called the discernibility vector of the fuzzy decision rule xi , such that (V 3) cj = {a : (a(xi , xj ) + λ) ≤ 1 + α}, for D(xi , xj ) = 0 , where λ = infu∈U ϑ(RC (xi , u), [xi ]D (u)) ; (V 4) Cj = ∅, for D(xi , xj ) = 1. Example 4.1: Given a simple decision table with 10 objects as follows (see Table 1), there are 12 condition fuzzy attributes R={a,b,c,d,e,f,g,h,i.j,k,h} and one decision symbolic attribute {D}. There are two decision classes: 0 and 1. The objects {1 2 6 8} belong to class 0 and {3 4 5 7 9 10} belong to class 1. Table 1. One simple fuzzy decision table Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
a 0.9 0.9 0.1 0 0.1 0.1 0 0.9 0.8 0
b 0.1 0.1 0.9 0.1 0 0.1 1 0.1 0.2 0.1
c 0 0.1 0.2 0.9 0.9 0.9 0 0 0 0.9
d 0.9 0.8 0.9 0.1 0 0 0 0.3 0 0
e 0.1 0.2 0.1 0.9 0.1 0.2 0.1 0.9 0.4 1
f 0 0.1 0.1 0 0.9 0.9 0.9 0.1 0.6 0
g 0.8 0.9 0.9 0.6 0 0.1 0.1 0.9 0 0
h 0.2 0.2 0.1 0.5 1 0.9 0.9 0.1 1 1
i 0 0 0 0 0 0 0 0 0 0
j 0.7 0.1 0.9 0.8 0.8 0.1 0.2 1 1 0.9
k 0.4 0.8 0.1 0.3 0.2 0.9 0.9 0 0 0.1
l 0 0 0 0 0 0 0 0 0 0
D 0 0 1 1 1 0 1 0 1 1
In this example, the Lukasiewicz T −norm TL (x, y) = max(0, x + y − 1) is selected as the T −norm to construct the S−lower approximation operator. Since the dual conorm of Lukasiewicz T −norm is SL = min(1, x + y), the lower approximation is then specified as RS A(x) = infu∈U min(1 − R(x, u) + A(u), 0). The lower approximations of each decision class are listed in Table 2. The consistence degree of each object computed by ConR (D) = RS [x]D (x) is listed in Table 3. The consistence degree is the critical value to reduce the redundant attribute-values. By strict mathematical reasoning, the discernibility vector is designed to compute all the attribute-value reductions of each of the objects in fuzzy decision table.
Table 2. The lower approximation of each decision class Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Lower Approximation of class 0 0.8 0.8 0 0 0 0.7 0 0.8 0 0 Lower Approximation of class 1 0 0 0.8 0.9 0.7 0 0.9 0 0.9 0.9
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
185
Table 3. The consistence degree of each object Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Consistence degree 0 0.8 0.8 0.8 0.9 0.7 0.7 0.9 0.8 0.9 0.9
The specified formulae to compute the discernibility vector are: cj = {a : (a(xi , xj ) + λ) ≤ 1 + α}, for D(xi , xj ) = 0 , where λ = ConR (D)(xi )) ; cj = ∅, for D(xi , xj ) = 1. The discernibility vectors of each of the objects are listed in Table 4. Each column in Table 4 corresponds one discernibility vector. As a result, all discernibility vectors compose one matrix with 10 rows and 10 columns. Table 4. The discernibility vectors Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
X1 X2 ∅ ∅ ∅ ∅ {ab} {abj} {acde} {ac} {acdf gh} {acdf gh} ∅ ∅ {abdf } {abdf g} ∅ ∅ {dgh} {dghjk} {acdegh} {acdeghj}
X3 X4 {ab} {ac} {abj} {a} ∅ ∅ ∅ ∅ ∅ ∅ {bdf ghjk} {f } ∅ ∅ {abe} {ac} ∅ ∅ ∅ ∅
X5 X6 {acdf gh} ∅ {acdf ghj} ∅ ∅ {bcdf ghjk} ∅ {ef j} ∅ {jk} {jk} ∅ ∅ {bc} {acef gh} ∅ ∅ {acjk} ∅ {ef jk}
X7 X8 X9 {abdf } ∅ {d} {ab} ∅ {gj} ∅ {abe} ∅ ∅ {ac} ∅ ∅ {acef gh} ∅ {bc} ∅ {cjk} ∅ {abef ghjk} ∅ {abk} ∅ {gh} ∅ {gh} ∅ ∅ {acgh} ∅
X10 {acde} {ag} ∅ ∅ ∅ {f } ∅ {acgh} ∅ ∅
The attribute-value core is the collection of the most important attribute values of the corresponding object, which corresponds to the union of the entries with single element in the discernibility vector. For example, the attribute-value core of object X4 is the collection of the values of the attributes a and f on the object X4 , i.e. {(a, 0)(f, 0.1)}. The attribute-value core of each of the objects is listed in Table 5. Table 5. The attribute-value core of every object Objects X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Core 0 ∅ ∅ ∅ {af } ∅ ∅ ∅ ∅ {d} {f }
One attribute-value reduction of each object is calculated. All the value reductions are listed in Table 6. The notation ∗ represents attribute-value on the corresponding position has been reduced. Table 6 shows that the fuzzy decision table have been significantly reduced after attribute-value reduction. According to the analysis and discussion on Section 3, each attribute-value reduction corresponds one reduction rule. By using the rule covering system, we can find a set of rules which covers all the objects in the fuzzy decision table. One set of rules induced from Table is listed in Table 7. This set of rules is the result of decision table reduction. This rule set can be seen as a rule-based classifier learning from Table 1 by using fuzzy rough sets. Each row in Table 7 corresponds to one if-then production rule. For example, the first row corresponds the following rule:
186
E. Tsang and S. Zhao Table 6. Attribute-value reduction of every object Objects x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
a 0.9 0.9 0.1 0 0.1 ∗ 0 0.9 ∗ 0
b ∗ ∗ 0.9 ∗ ∗ ∗ 1 ∗ ∗ ∗
c ∗ ∗ ∗ 0.9 ∗ 0.9 ∗ ∗ 0 ∗
d 0.9 0.8 ∗ ∗ ∗ ∗ ∗ ∗ 0 ∗
e ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
f ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗ 0
g ∗ ∗ ∗ ∗ ∗ ∗ ∗ 0.9 0 ∗
h ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
i ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
j ∗ ∗ ∗ ∗ 0.8 0.1 ∗ ∗ ∗ ∗
k ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
l ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
D 0 0 1 1 1 0 1 0 1 1
Table 7. A reduced decision table (which can also be seen as a set of decision rules) Objects x5 x6 x1 x3
a 0.1 ∗ 0.9 0.1
b ∗ ∗ ∗ 0.9
c ∗ 0.9 ∗ ∗
d ∗ ∗ 0.9 ∗
e ∗ ∗ ∗ ∗
f ∗ ∗ ∗ ∗
g ∗ ∗ ∗ ∗
h ∗ ∗ ∗ ∗
i ∗ ∗ ∗ ∗
j 0.8 0.1 ∗ ∗
k ∗ ∗ ∗ ∗
l ∗ ∗ ∗ ∗
D Consistence degree 1 0.7 0 0.7 0 0.8 1 0.8
If N (P1 (objecty , objectx )) < 0.7 with x = (a, 0.1), (j, 0.8) and P1 = {a, j}, then the object y belongs to decision class 1. This example shows that the proposed method of rule induction in this paper is feasible. It is promised to handle the real classification problems by using the idea of building classifier in this paper.
5
Conclusion
In this paper a method of decision table reduction with fuzzy rough sets is proposed. The proposed method is twofold: attribute-value reduction and rule covering system. The key idea of the attribute-value reduction is to keep the information invariant before and after attribute-value reduction. Based on this idea, the approach of discernibility vector is proposed by using which all the attribute value reductions of each object can be found. After designing the rule covering system, a set of rules can be induced from one fuzzy decision table. This set of rules is equivalent to one decision table reduced from both horizontal and vertical directions. Finally, one illustrative example shows that the proposed method is feasible. For further real applications, our future work is to improve the proposed method to a robust framework.
Acknowledgements This research has been supported by the Hong Kong RGC CERG research grants: PolyU 5273/05E (B-Q943) and PolyU 5281/07E (B-Q06C).
Decision Table Reduction in KDD: Fuzzy Rough Based Approach
187
References 1. Bezdek, J.C., Harris, J.O.: Fuzzy partitions and relations: an axiomatic basis of clustering. Fuzzy Sets and Systems 84, 143–153 (1996) 2. Bhatt, R.B., Gopal, M.: On fuzzy rough sets approach to feature selection. Pattern recognition Letters 26, 1632–1640 (2005) 3. Cattaneo, G.: Fuzzy extension of rough sets theory. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 275–282. Springer, Heidelberg (1998) 4. Degang, C., Wenxiu, Z., Yeung, D.S., Tsang, E.C.C.: Rough approximations on a complete completely distributive lattice with applications to generalized rough sets. Information Sciences 176, 1829–1848 (2006) 5. Chen, D.G., Wang, X.Z., Zhao, S.Y.: Attribute Reduction Based on Fuzzy Rough Sets. In: Kryszkiewicz, M., Peters, J.F., Rybi´ nski, H., Skowron, A. (eds.) RSEISP 2007. LNCS (LNAI), vol. 4585, pp. 381–390. Springer, Heidelberg (2007) 6. Cornelis, C., De Cock, M., Radzikowska, A.M.: Fuzzy rough sets: from theory into practice. In: Pedrycz, W., Skowron, A., Kreinovich, V. (eds.) Handbook of Granular Computing. Springer, Heidelberg (in press) 7. Devijver, P., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs (1982) 8. Dubois, D., Prade, H.: Rough fuzzy sets and fuzzy rough sets. Internat. J. Genaral Systems 17(2-3), 191–209 (1990) 9. Dubois, D., Prade, H.: Putting rough sets and fuzzy sets together, Intelligent Decision support. In: Slowinski, R. (ed.) Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992) 10. Fernandez Salido, J.M., Murakami, S.: Rough set analysis of a general type of fuzzy data using transitive aggregations of fuzzy similarity relations. Fuzzy Sets and Systems 139, 635–660 (2003) 11. Greco, S., Inuiguchi, M., Slowinski, R.: A new proposal for fuzzy rough approximations and gradual decision rule representation. In: Peters, J.F., Skowron, A., Dubois, D., Grzymala-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds.) Transactions on Rough Sets II. LNCS, vol. 3135, pp. 319–342. Springer, Heidelberg (2004) 12. Hu, Q.H., Yu, D.R., Xie, Z.X.: Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognition Letters 27, 414–423 (2006) 13. Hong, T.P.: Learning approximate fuzzy rules from training examples. In: The proceeding of the Tenth IEEE International Conference on Fuzzy Systems, Melbourne, Australia, vol. 1, pp. 256–259 (2001) 14. Jensen, R., Shen, Q.: Fuzzy-rough attribute reduction with application to web categorization. Fuzzy Sets and Systems 141, 469–485 (2004) 15. Mi, J.S., Zhang, W.X.: An axiomatic characterization of a fuzzy generalization of rough sets. Information Sciences 160(1-4), 235–249 (2004) 16. Morsi, N.N., Yakout, M.M.: Axiomatics for fuzzy rough sets. Fuzzy Sets and Systems 100(1998), 327–342 (1998) 17. Pawlak, Z.: Rough Sets Internat. J. Comput. Inform. Sci. 11(5), 341–356 (1982) 18. Radzikowska, A.M., Kerre, E.E.: A comparative study of fuzzy rough sets. Fuzzy Sets and Systems 126, 137–155 (2002) 19. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems, Intelligent Decision support. In: Slowinski, R. (ed.) Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992)
188
E. Tsang and S. Zhao
20. Skowron, A., Polkowski, L.: Rough sets in knowledge discovery, vol. 1,2. Springer, Berlin (1998) 21. Slowinski, R. (ed.): Intelligent decision support: Handbook of applications and advances of the rough sets theory. Kluwer Academic Publishers, Dordrecht (1992) 22. Slowinski, R., Vanderpooten, D.: Similarity relation as a basis for rough approximations. In: Wang, P.P. (ed.) Advances in Machine Intelligence and Soft-Computing, Department of Electrical Engineering, Duke University, Durham, NC, USA, pp. 17–33 (1997) 23. Sudkamp, T.: Similarity, interpolation, and fuzzy rule construction. Fuzzy Sets and Systems 58, 73–86 (1993) 24. Tsang, E.C.C., Chen, D.G., Yeung, D.S., Wang, X.Z., Lee, J.W.T.: Attributes reduction using fuzzy rough sets. IEEE Transaction on Fuzzy System (in press) 25. Tsai, Y.-C., Cheng, C.-H., Chang, J.-R.: Entropy-based fuzzy rough classification approach for extracting classification rules. Expert Systems with Applications 31(2), 436–443 (2006) 26. Wang, X.Z., Hong, J.R.: Learning optimization in simplifying fuzzy rules. Fuzzy Sets and Systems 106, 349–356 (1999) 27. Wang, X.Z., Tsang, E.C.C., Zhao, S.Y., Chen, D.G., Yeung, D.S.: Learning Fuzzy Rules from Fuzzy Samples Based on Rough Set Technique. Information Sciences 177(20), 4493–4514 (2007) 28. Wu, W.Z., Mi, J.S., Zhang, W.X.: Generalized fuzzy rough sets. Information Sciences 151, 263–282 (2003) 29. Wu, W.Z., Zhang, W.X.: Constructive and axiomatic approaches of fuzzy approximation operators. Information Sciences 159(3-4), 233–254 (2004) 30. Yasdi, R.: Learning classification rules from database in the context of knowledge acquisition and representation. IEEE transaction on knowledge and data engineering 3(3), 293–306 (1991) 31. Yao, Y.Y.: Combination of rough and fuzzy sets based on level sets. In: Lin, T.Y., Cercone, N. (eds.) Rough Sets and Data mining: Analysis for Imprecise Data, pp. 301–321. Kluwer Academic Publishers, Boston (1997) 32. Yeung, D.S., Chen, D.G., Tsang, E.C.C., Lee, J.W.T.: On the Generalization of Fuzzy Rough Sets. IEEE Transactions on Fuzzy Systems 13, 343–361 (2005) 33. Zhao, S., Tsang, E.C.C.: The Analysis of Attribute Reduction on Fuzzy Rough Sets: the T-norm and Fuzzy Approximation Operator Perspective. Submitted to information science special issue 34. Ziarko, W.P. (ed.): Rough sets, fuzzy sets and knowledge discovery, Workshop in Computing. Springer, London (1994)
Author Index
Blaszczy´ nski, Jerzy
40
Cyran, Krzysztof A.
53
Gomoli´ nska, Anna 66 Grzymala-Busse, Jerzy W. Moshkov, Mikhail
Rybi´ nski, Henryk 161 Rz¸asa, Wojciech 14
1, 14
Sikora, Marek 130 Skowron, Andrzej 92 Slowi´ nski, Roman 40 Stefanowski, Jerzy 40 Str¸akowski, Tomasz 161 Suraj, Zbigniew 92
92 Tsang, Eric
Pal, Sankar K.
106
Zhao, Suyun
177 177