Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1704
¿ Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
. Jan M. Zytkow Jan Rauch (Eds.)
Principles of Data Mining and Knowledge Discovery Third European Conference, PKDD’99 Prague, Czech Republic, September 15-18, 1999 Proceedings
½¿
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors . Jan M. Zytkow University of North Carolina, Department of Computer Science Charlotte, NC 28223, USA E-mail:
[email protected] Jan Rauch University of Economics, Faculty of Informatics and Statistics Laboratory of Intelligent Systems W. Churchill Sq. 4, 13067 Prague, Czech Republic E-mail:
[email protected] Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Principles of data mining and knowledge discovery : third European conference ; proceedings / PKDD ’99, Prague, Czech Republic, September 15 - 18, 1999. Jan M. Zytkow ; Jan Rauch (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1999 (Lecture notes in computer science ; Vol. 1704 : Lecture notes in artificial intelligence) ISBN 3-540-66490-4
CR Subject Classification (1998): I.2, H.3, H.5, G.3, J.1 ISBN 3-540-66490-4 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1999 Printed in Germany Typesetting: Camera-ready by author SPIN 10704907 06/3142 – 5 4 3 2 1 0
Printed on acid-free paper
Preface
This volume contains papers selected for presentation at PKDD’99, the Third European Conference on Principles and Practice of Knowledge Discovery in Databases. The first meeting was held in Trondheim, Norway, in June 1997, the second in Nantes, France, in September 1998. PKDD’99 was organized in Prague, Czech Republic, on September 15-18, 1999. The conference was hosted by the Laboratory of Intelligent Systems at the University of Economics, Prague. We wish to express our thanks to the sponsors of the Conference, the Komerˇcn´ı banka, a.s. and the University of Economics, Prague, for their generous support. Knowledge discovery in databases (KDD), also known as data mining, provides tools for turning large databases into knowledge that can be used in practice. KDD has been able to grow very rapidly since its emergence a decade ago, by drawing its techniques and data mining experiences from a combination of many existing research areas: databases, statistics, mathematical logic, machine learning, automated scientific discovery, inductive logic programming, artificial intelligence, visualization, decision science, and high performance computing. The strength of KDD came initially from the value added to the creative combination of techniques from the contributing areas. In order to establish its identity, KDD has to create its own theoretical principles and to demonstrate how they stimulate KDD research, facilitate communication and guide practitioners towards successful applications. Seeking the principles that can guide and strengthen practical applications has been always a part of the European research tradition. Thus “Principles and Practice of KDD” (PKDD) make a suitable focus for annual meetings of the KDD community in Europe. The main long-term interest is in theoretical principles for the emerging discipline of KDD and in practical applications that demonstrate utility of those principles. Other goals of the PKDD series are to provide a European-based forum for interaction among all theoreticians and practitioners interested in data mining and knowledge discovery as well as to foster interdisciplinary collaboration. A Discovery Challenge hosted at PKDD’99 is a new initiative promoting cooperative research on new real-world databases, supporting a broad and unified view of knowledge and methods of discovery, and emphasizing business problems that require an open-minded search for knowledge in data. Two multi-relational databases, in banking and in medicine, were widely available. The Challenge was born out of the conviction that knowledge discovery in real-world databases requires an open-minded discovery process rather than application of one or another tool limited to one form of knowledge. A discoverer should consider a broad scope of techniques that can reach many forms of knowledge. The discovery process cannot be rigid and selection of techniques must be driven by knowledge hidden in the data, so that the most and the best of knowledge can be reached.
VI
Preface
The contributed papers were selected from 106 full papers (45% growth over PKDD’98) by the following program committee: Pieter Adriaans (Syllogic, Netherlands), Petr Berka (U. Economics, Czech Rep.), Pavel Brazdil (U. Porto, Portugal), Henri Briand (U. Nantes, France), Leo Carbonara (British Telecom, UK), David L. Dowe (Monash U., Australia), A. Fazel Famili (IIT-NRC, Canada), Ronen Feldman (Bar Ilan U., Israel), Alex Freitas (PUC-PR, Brazil), Patrick Gallinari (U. Paris 6, France), Jean Gabriel Ganascia (U. Paris 6, France), Attilio Giordana (U. Torino, Italy), Petr H´ ajek (Acad. Science, Czech Rep.), Howard Hamilton (U. Regina, Canada), David Hand (Open U., UK), Bob Henery (U. Strathclyde, UK), Mikhail Kiselev (Megaputer Intelligence, Russia), Willi Kloesgen (GMD, Germany), Yves Kodratoff (U. Paris 11, France), Jan Komorowski (Norwegian U. Sci. & Tech.), Jacek Koronacki (Acad. Science, Poland), Nada Lavrac (Josef Stefan Inst., Slovenia), Heikki Manilla (Microsoft Research, Finland), Gholamreza Nakhaeizadeh (DaimlerChrysler, Germany), Gregory Piatetsky-Shapiro (Knowledge Stream, Boston, USA), Jaroslav Pokorn´ y (Charles U., Czech Rep.), Lech Polkowski (U. Warsaw, Poland), Mohamed Quafafou (U. Nantes, France), Jan Rauch (U. Economics, Czech Rep.), Zbigniew Ras (UNC Charlotte, USA), Wei-Min Shen (USC, USA), Arno Siebes (CWI, Netherlands), Andrzej Skowron (U. Warsaw, Poland), Derek Sleeman (U. Aberdeen, UK), Nicolas Spyratos (U. ˇ ep´ Paris 11, France), Olga Stˇ ankov´ a (Czech Tech. U.), Shusaku Tsumoto (Tokyo U., Japan), Raul Valdes-Perez (CMU, USA), Rudiger Wirth (DaimlerChrysler, Germany), Stefan Wrobel (GMD, Germany), Ning Zhong (Yamaguchi U., Japan), Wojtek Ziarko (U. Regina, Canada), Djamel A. ˙ Zighed (U. Lyon 2, France), Jan Zytkow (UNC Charlotte, USA). The following colleagues also reviewed for the conference and are due our special thanks: Thomas ˚ Agotnes, Mirian Halfeld Ferrari Alves, Joao Gama, A. Giacometti, Claire Green, Alipio Jorge, P. Kuntz, Dominique Laurent, Terje Løken, Aleksander Øhrn, Tobias Scheffer, Luis Torgo, and Simon White. Classified according to the first author’s nationality, papers submitted to PKDD’99 came from 31 countries on 5 continents (Europe: 71 papers; Asia: 15; North America: 12; Australia: 5; and South America: 3), including Australia (5 papers), Austria (2), Belgium (3), Brazil (3), Bulgaria (1), Canada (2), Czech Republic (4), Finland (3), France (10), Germany (12), Greece (1), Israel (3), Italy (6), Japan (8), Korea (1), Lithuania (1), Mexico (1), Netherlands (3), Norway (2), Poland (4), Portugal (2), Russia (3), Slovak Republic (1), Slovenia (1), Spain (1), Switzerland (1), Taiwan (1), Thailand (2), Turkey (1), United Kingdom (9), and USA (9). Further authors represent: Australia (6 authors), Austria (1), Belgium (5), Brazil (5), Canada (4), Colombia (2), Czech Republic (3), Finland (2), France (12), Germany (9), Greece (1), Israel (13), Italy (9), Japan (15), Korea (1), Mexico (1), Netherlands (3), Norway (3), Poland (4), Portugal (4), Russia (1), Slovenia (3), Spain (4), Switzerland (1), Taiwan (1), Thailand (1), Turkey (1), Ukraine (1), United Kingdom(9), and USA (9).
Preface
VII
Many thanks to all who submitted papers for review and for publication in the proceedings. The accepted papers were divided into two categories: 28 oral presentations and 48 poster presentations. In addition to poster sessions each poster paper has been allocated 3-minute highlight presentation at a plenary session. Invited speakers included Rudiger Wirth (DaimlerChrysler, Germany) and Wolfgang Lehner (IBM Almaden Research Center, USA). Six tutorials were offered to all Conference participants on 15 September: (1) Data Mining for Robust Business Intelligence Solutions by Jan Mrazek; (2) Query Languages for Knowledge Discovery Processes by Jean-Fran¸cois Boulicaut; (3) The ESPRIT Project CreditMine and its Relevance for the Internet Market by Michael Krieger and Susanne K¨ ohler; (4) Logics and Statistics for Association Rules and Beyond by Petr H´ ajek and Jan Rauch; (5) Data Mining for the Web by Myra Spiliopolou; and (6) Relational Learning and Inductive Logic Programming Made Easy by Luc De Raedt and Hendrik Blockeel. Members of the PKDD’99 organizing committee have done an enormous amount of work and deserve the special gratitude of all participants: Petr Berka – Discovery Challenge Chair, Leonardo Carbonara – Industrial Program Chair, Jiˇ r´ı Iv´ anek – Local Arrangement Chair, Vojtˇ ech Sv´ atek – Pubˇ amek. Spelicity Chair, and Jiˇ r´ı Kosek, Marta Sochorov´ a, and Dalibor Sr´ cial gratitude is also due Milena Zeithamlov´ a and Lucie V´ achov´ a, and their organizing agency, Action M Agency. Special thanks go to Alfred Hofmann of Springer-Verlag for his continuous help and support.
July 1999
˙ Jan Rauch and Jan Zytkow PKDD’99 Program Co-Chairs
Table of Contents
Session 1A – Time Series Scaling up Dynamic Time Warping to Massive Dataset . . . . . . . . . . . . . . . . . E.J. Keogh, M.J. Pazzani
1
The Haar Wavelet Transform in the Time Series Similarity Paradigm . . . . . 12 Z.R. Struzik, A. Siebes Rule Discovery in Large Time-Series Medical Databases . . . . . . . . . . . . . . . . 23 S. Tsumoto
Session 1B – Applications Simultaneous Prediction of Multiple Chemical Parameters of River Water Quality with TILDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 H. Blockeel, S. Dˇzeroski, J. Grbovi´c Applying Data Mining Techniques to Wafer Manufacturing . . . . . . . . . . . . . . 41 E. Bertino, B. Catania, E. Caglio An Application of Data Mining to the Problem of the University Students’ Dropout Using Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 S. Massa, P.P. Puliafito
Session 2A – Taxonomies and Partitions Discovering and Visualizing Attribute Associations Using Bayesian Networks and Their Use in KDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 G. Masuda, R. Yano, N. Sakamoto, K. Ushijima Taxonomy Formation by Approximate Equivalence Relations, Revisited . . . 71 ˙ F.A. El-Mouadib, J. Koronacki, J.M. Zytkow On the Use of Self-Organizing Maps for Clustering and Visualization . . . . . 80 A. Flexer Speeding Up the Search for Optimal Partitions . . . . . . . . . . . . . . . . . . . . . . . . 89 T. Elomaa, J. Rousu
Session 2B – Logic Methods Experiments in Meta-level Learning with ILP . . . . . . . . . . . . . . . . . . . . . . . . . . 98 L. Todorovski, S. Dˇzeroski
X
Table of Contents
Boolean Reasoning Scheme with Some Applications in Data Mining . . . . . . 107 A. Skowron, H.S. Nguyen On the Correspondence between Classes of Implicational and Equivalence Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 J. Iv´ anek Querying Inductive Databases via Logic-Based User-Defined Aggregates . . 125 F. Giannotti, G. Manco
Session 3A – Distributed and Multirelational Databases Peculiarity Oriented Multi-database Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 N. Zhong, Y.Y. Yao, S. Ohsuga Knowledge Discovery in Medical Multi-databases: A Rough Set Approach . 147 S. Tsumoto Automated Discovery of Rules and Exceptions from Distributed Databases Using Aggregates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 R. P´ airc´eir, S. McClean, B. Scotney
Session 3B – Text Mining and Feature Selection Text Mining via Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 R. Feldman, Y. Aumann, M. Fresko, O. Liphstat, B. Rosenfeld, Y. Schler TopCat: Data Mining for Topic Identification in a Text Corpus . . . . . . . . . . 174 C. Clifton, R. Cooley Selection and Statistical Validation of Features and Prototypes . . . . . . . . . . 184 M. Sebban, D.A. Zighed, S. Di Palma
Session 4A – Rules and Induction Taming Large Rule Models in Rough Set Approaches . . . . . . . . . . . . . . . . . . . 193 T. ˚ Agotnes, J. Komorowski, T. Løken Optimizing Disjunctive Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 D. Zelenko Contribution of Boosting in Wrapper Models . . . . . . . . . . . . . . . . . . . . . . . . . . 214 M. Sebban, R. Nock Experiments on a Representation-Independent ”Top-Down and Prune” Induction Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 R. Nock, M. Sebban, P. Jappy
Table of Contents
XI
Session 5A – Interesting and Unusual Heuristic Measures of Interestingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 R.J. Hilderman, H.J. Hamilton Enhancing Rule Interestingness for Neuro-fuzzy Systems . . . . . . . . . . . . . . . . 242 T. Wittmann, J. Ruhland, M. Eichholz Unsupervised Profiling for Identifying Superimposed Fraud . . . . . . . . . . . . . . 251 U. Murad, G. Pinkas OPTICS-OF: Identifying Local Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 M.M. Breunig, H.-P. Kriegel, R.T. Ng, J. Sander
Posters Selective Propositionalization for Relational Learning . . . . . . . . . . . . . . . . . . . 271 ´ Alphonse, C. Rouveirol E. Circle Graphs: New Visualization Tools for Text-Mining . . . . . . . . . . . . . . . . 277 Y. Aumann, R. Feldman, Y.B. Yehuda, D. Landau, O. Liphstat, Y. Schler On the Consistency of Information Filters for Lazy Learning Algorithms . . 283 H. Brighton, C. Mellish Using Genetic Algorithms to Evolve a Rule Hierarchy . . . . . . . . . . . . . . . . . . 289 R. Cattral, F. Oppacher, D. Deugo Mining Temporal Features in Association Rules . . . . . . . . . . . . . . . . . . . . . . . . 295 X. Chen, I. Petrounias The Improvement of Response Modeling: Combining Rule-Induction and Case-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 F. Coenen, G. Swinnen, K. Vanhoof, G. Wets Analyzing an Email Collection Using Formal Concept Analysis . . . . . . . . . . 309 R. Cole, P. Eklund Business Focused Evaluation Methods: A Case Study . . . . . . . . . . . . . . . . . . . 316 P. Datta Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 W. Ertel, M. Schramm Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 A. Feelders
XII
Table of Contents
Rough Dependencies as a Particular Case of Correlation: Application to the Calculation of Approximative Reducts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 M.C. Fernandez-Baiz´ an, E. Menasalvas Ruiz, J.M. Pe˜ na S´ anchez, S. Mill´ an, E. Mesa A Fuzzy Beam-Search Rule Induction Algorithm . . . . . . . . . . . . . . . . . . . . . . . 341 C.S. Fertig, A.A. Freitas, L.V.R. Arruda, C. Kaestner An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Z. Fu Extension to C-means Algorithm for the Use of Similarity Functions . . . . . . 354 J.R. Garc´ıa-Serrano, J.F. Mart´ınez-Trinidad Predicting Chemical Carcinogenesis Using Structural Information Only . . . 360 C.J. Kennedy, C. Giraud-Carrier, D.W. Bristol LA - A Clustering Algorithm with an Automated Selection of Attributes, which Is Invariant to Functional Transformations of Coordinates . . . . . . . . . 366 M.V. Kiselev, S.M. Ananyan, S.B. Arseniev Association Rule Selection in a Data Mining Environment . . . . . . . . . . . . . . . 372 M. Klemettinen, H. Mannila, A.I. Verkamo Multi-relational Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 A.J. Knobbe, A. Siebes, D. van der Wallen Learning of Simple Conceptual Graphs from Positive and Negative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 S.O. Kuznetsov An Evolutionary Algorithm Using Multivariate Discretization for Decision Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 W. Kwedlo, M. Kr¸etowski ZigZag, a New Clustering Algorithm to Analyze Categorical Variable Cross-Classification Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 S. Lallich Efficient Mining of High Confidence Association Rules without Support Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 J. Li, X. Zhang, G. Dong, K. Ramamohanarao, Q. Sun A Logical Approach to Fuzzy Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 C.-J. Liau, D.-R. Liu AST: Support for Algorithm Selection with a CBR Approach . . . . . . . . . . . . 418 G. Lindner, R. Studer Efficient Shared Near Neighbours Clustering of Large Metric Data Sets . . . 424 S. Lodi, L. Reami, C. Sartori
Table of Contents
XIII
Discovery of ”Interesting” Data Dependencies from a Workload of SQL Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 S. Lopes, J.-M. Petit, F. Toumani Learning from Highly Structured Data by Decomposition . . . . . . . . . . . . . . . 436 R. Mac Kinney-Romero, C. Giraud-Carrier Combinatorial Approach for Data Binarization . . . . . . . . . . . . . . . . . . . . . . . . 442 E. Mayoraz, M. Moreira Extending Attribute-Oriented Induction as a Key-Preserving Data Mining Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 M.K. Muyeba, J.A. Keane Automated Discovery of Polynomials by Inductive Genetic Programming . 456 N. Nikolaev, H. Iba Diagnosing Acute Appendicitis with Very Simple Classification Rules . . . . . 462 A. Øhrn, J. Komorowski Rule Induction in Cascade Model Based on Sum of Squares Decomposition 468 T. Okada Maintenance of Discovered Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 ˇ ep´ M. Pˇechouˇcek, O. Stˇ ankov´ a, P. Mikˇsovsk´ y A Divisive Initialization Method for Clustering Algorithms . . . . . . . . . . . . . . 484 C. Pizzuti, D. Talia, G. Vonella A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 T. Poddig, K. Huber Mining Lemma Disambiguation Rules from Czech Corpora . . . . . . . . . . . . . . 498 L. Popel´ınsk´y, T. Pavelek Adding Temporal Semantics to Association Rules . . . . . . . . . . . . . . . . . . . . . . 504 C.P. Rainsford, J.F. Roddick Studying the Behavior of Generalized Entropy in Induction Trees Using a M-of-N Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 R. Rakotomalala, S. Lallich, S. Di Palma Discovering Rules in Information Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 Z.W. Ras Mining Text Archives: Creating Readable Maps to Structure and Describe Document Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 A. Rauber, D. Merkl
XIV
Table of Contents
Neuro-fuzzy Data Mining for Target Group Selection in Retail Banking . . . 530 J. Ruhland, T. Wittmann Mining Possibilistic Set-Valued Rules by Generating Prime Disjunctions . . 536 A.A. Savinov Towards Discovery of Information Granules . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 A. Skowron, J. Stepaniuk Classification Algorithms Based on Linear Combinations of Features . . . . . . 548 ´ ezak, J. Wr´ D. Sl¸ oblewski Managing Interesting Rules in Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . 554 M. Spiliopoulou Support Vector Machines for Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . 561 S. Sugaya, E. Suzuki, S. Tsumoto Regression by Feature Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 ˙ Uysal, H.A. G¨ I. uvenir Generating Linguistic Fuzzy Rules for Pattern Classification with Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 N. Xiong, L. Litz
Tutorials Data Mining for Robust Business Intelligence Solutions . . . . . . . . . . . . . . . . . 580 J. Mrazek Query Languages for Knowledge Discovery in Databases . . . . . . . . . . . . . . . . 582 J.-F. Boulicaut The ESPRIT Project CreditMine and Its Relevance for the Internet Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 S. K¨ ohler, M. Krieger Logics and Statistics for Association Rules and Beyond . . . . . . . . . . . . . . . . . 586 P. H´ ajek, J. Rauch Data Mining for the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 M. Spiliopoulou Relational Learning and Inductive Logic Programming Made Easy . . . . . . . 590 L. De Raedt, H. Blockeel
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Scaling up Dynamic Time Warping to Massive Datasets Eamonn J. Keogh and Michael J. Pazzani Department of Information and Computer Science University of California, Irvine, California 92697 USA {eamonn, pazzani}@ics.uci.edu Abstract. There has been much recent interest in adapting data mining algorithms to time series databases. Many of these algorithms need to compare time series. Typically some variation or extension of Euclidean distance is used. However, as we demonstrate in this paper, Euclidean distance can be an extremely brittle distance measure. Dynamic time warping (DTW) has been suggested as a technique to allow more robust distance calculations, however it is computationally expensive. In this paper we introduce a modification of DTW which operates on a higher level abstraction of the data, in particular, a piecewise linear representation. We demonstrate that our approach allows us to outperform DTW by one to three orders of magnitude. We experimentally evaluate our approach on medical, astronomical and sign language data.
1 Introduction Time series are a ubiquitous form of data occurring in virtually every scientific discipline and business application. There has been much recent work on adapting data mining algorithms to time series databases. For example, Das et al (1998) attempt to show how association rules can be learned from time series. Debregeas and Hebrail (1998) demonstrate a technique for scaling up time series clustering algorithms to massive datasets. Keogh and Pazzani (1998) introduced a new, scaleable time series classification algorithm. Almost all algorithms that operate on time series data need to compute the similarity between time series. Euclidean distance, or some extension or modification thereof, is typically used. However, Euclidean distance can be an extremely brittle distance measure. 3 Consider the clustering produced by Euclidean distance in Fig 1. 4 Sequence 3 is judged as most similar to the line in 2 sequence 4, yet it appears more similar to 1 or 2. The reason why Euclidean distance may fail to produce an intuitively correct measure of similarity between two sequences is because it is
1 Fig. 1. An unintuitive clustering produced by the Euclidean distance measure. Sequences 1, 2 and 3 are astronomical time series (Derriere 1998). Sequence 4 is simply a straight line with the same mean and variance as the other sequences
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 1-11, 1999. © Springer-Verlag Berlin Heidelberg 1999
2
E.J. Keogh and M.J. Pazzani
very sensitive to small distortions in the time axis. Consider Fig 2.A. The two sequences have approximately the same overall shape, but those shapes are not exactly aligned in the time axis. The nonlinear alignment shown in Fig 2.B would
A) 0
B) 10
20
30
40
50
60
0
10
20
30
40
50
60
Fig. 2. Two sequences that represent the Y-axis position of an individual’s hand while signing the word "pen" in Australian Sign Language. Note that while the sequences have an overall similar shape, they are not aligned in the time axis. Euclidean distance, which assumes the ith point on one sequence is aligned with ith point on the other (A), will produce a pessimistic dissimilarity measure. A nonlinear alignment (B) allows a more sophisticated distance measure to be calculated
allow a more sophisticated distance measure to be calculated. A method for achieving such alignments has long been known in the speech processing community (Sakoe and Chiba 1978). The technique, Dynamic Time Warping (DTW), was introduced to the data mining community by Berndt and Clifford (1994). Although they demonstrate the utility of the approach, they acknowledge that the algorithms time complexity is a problem and that "…performance on very large databases may be a limitation". As an example of the utility of DTW compare the clustering shown in Figure 1 with Figure 3. In this paper we introduce a technique which speeds up DTW by a large constant. The value of the constant is data dependent but is typically one to three orders of magnitude. The algorithm, Segmented Dynamic Time Warping (SDTW), takes advantage of the fact that we can efficiently approximate most time series by a set of piecewise linear segments.
4
3 2
1 Fig 3. When the dataset used in Fig. 1 is clustered using DTW the results are much more intuitive
The rest of this paper is organized as follows. Section 2 contains a review of the classic DTW algorithm. Section 3 introduces the piecewise linear representation and SDTW algorithm. In Section 4 we experimentally compare DTW, SDTW and Euclidean distance on several real world datasets. Section 5 contains a discussion of related work. Section 6 contains our conclusions and areas of future research.
2 The Dynamic Time Warping Algorithm Suppose we have two time series Q and C, of length n and m respectively, where:
Scaling up Dynamic Time Warping to Massive Dataset
3
Q = q1,q2,…,qi,…,qn
(1)
C = c1,c2,…,cj,…,cm
(2) th
To align two sequences using DTW we construct an n-by-m matrix where the (i , j ) element of the matrix contains the distance d(qi,cj) between the two points qi and cj 2 (With Euclidean distance, d(qi,cj) = (qi - cj) ). Each matrix element (i,j) corresponds to the alignment between the points qi and cj. This is illustrated in Figure 4. A warping path W, is a contiguous (in the sense stated below) set of matrix elements that defines th a mapping between Q and C. The k element of W is defined as wk = (i,j)k so we have: th
max(m,n) K < m+n-1
W = w1, w2, …,wk,…,wK
(3)
The warping path is typically subject to several constraints.
Boundary Conditions: w1 = (1,1) and wK = (m,n), simply stated, this requires the warping path to start and finish in diagonally opposite corner cells of the matrix.
Continuity: Given wk = (a,b) then wk-1 = (a’,b’) where a–a' 1 and b-b' 1. This restricts the allowable steps in the warping path to adjacent cells (including diagonally adjacent cells).
Monotonicity: Given wk = (a,b) then wk-1 = (a',b') where a–a' 0 and b-b' 0. This forces the points in W to be monotonically spaced in time.
There are exponentially many warping paths that satisfy the above conditions, however we are interested only in the path which minimizes the warping cost: K (4) DTW (Q, C ) = min w K k =1 k The K in the denominator is used to compensate for the fact that warping paths may have different lengths.
∑
Q 5
10
15
20
25
30
wK
15
20
25
30
0
m
10
j …
5
w
0
C
w
1
w
1
i
Fig. 4. An example warping path
n
4
E.J. Keogh and M.J. Pazzani
This path can be found very efficiently using dynamic programming to evaluate the following recurrence which defines the cumulative distance g(i,j) as the distance d(i,j) found in the current cell and the minimum of the cumulative distances of the adjacent elements: g(i,j) = d(qi,cj) + min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }
(5)
The Euclidean distance between two sequences can be seen as a special case of th DTW where the k element of W is constrained such that wk = (i,j)k , i = j = k. Note that it is only defined in the special case where the two sequences have the same length. The time complexity of DTW is O(nm). However this is just for comparing two sequences. In data mining applications we typically have one of the following two situations (Agrawal et. al. 1995). 1) Whole Matching: We have a query sequence Q, and X sequences of approximately the same length in our database. We want to find the sequence that is most similar to Q. 2) Subsequence Matching: We have a query sequence Q, and a much longer sequence R of length X in our database. We want to find the subsection of R that is most similar to Q. To find the best match we "slide" the query along R, testing every possible subsection of R. 2
In either case the time complexity is O(n X), which is intractable for many realworld problems. This review of DTW is necessarily brief; we refer the interested reader to Kruskall and Liberman (1983) for a more detailed treatment.
3 Exploiting a Higher Level Representation Because working with raw time series is computationally expensive, several researchers have proposed using higher level representations of the data. In previous work we have championed a piecewise linear representation, demonstrating that the linear segment representation can be used to allow relevance feedback in time series databases (Keogh and Pazzani 1998) and that it allows a user to define probabilistic queries (Keogh and Smyth 1997). 3.1 Piecewise Linear Representation We will use the following notation throughout this paper. A time series, sampled at n points, is represented as an italicized uppercase letter such as A. The segmented version of A, containing N linear segments, is denoted as a bold uppercase letter such as A, where A is a 4-tuple of vectors of length N. A {AXL, AXR, AYL, AYR} th
The i segment of sequence A is represented by the line between (AXLi ,AYLi) and (AXRi ,AYRi). Figure 5 illustrates this notation.
Scaling up Dynamic Time Warping to Massive Dataset
5
We will denote the ratio n/N as c, the compression ratio. We can choose to set this ratio to any value, adjusting the tradeoff between compactness and fidelity. For brevity we omit details of A how we choose the compression ratio and how the segmented representation (AXLi,AYLi) f(t) A (AXRi,AYRi) is obtained, referring the interested reader to Keogh t and Smyth (1997) instead. We do note however that the Fig. 5. We represent a time series by a sequence of segmentation can be obtained straight segments in linear time. 3.2 Warping with the Piecewise Linear Representation th
To align two sequences using SDTW we construct an N-by-M matrix where the (i , j ) element of the matrix contains the distance d(Qi,Cj) between the two segments Qi and Cj. The distance between two segments is defined as the square of the distance between their means: th
d(Qi,Cj) = [((QYLi + QYRi) /2 ) - ((CYLj + CYRj) /2 )]
2
(6)
Apart from this modification the matrix-searching algorithm is essentially unaltered. Equation 5 is modified to reflect the new distance measure: g(i,j) = d(Qi,Cj) + min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }
(7)
When reporting the DTW distance between two time series (Eq. 4) we compensated for different length paths by dividing by K, the length of the warping path. We need to do something similar for SDTW but we cannot use K directly, because different elements in the warping matrix correspond to segments of different lengths and therefore K only approximates the length of the warping path. Additionally we would like SDTW to be measured in the same units as DTW to facilitate comparison. We measure the length of SDTW’s warping path by extending the recurrence shown in Eq. 7 to return and recursively sum an additional variable, max([QXRi QXLi],[CXRj – CXLj]), with the corresponding element from min{ g(i-1,j-1) , g(i-1,j ) , g(i,j-1) }. Because the length of the warping path is measured in the same units as DTW we have: SDTW(Q,C) @ DTW(Q,C)
(8)
Figure 6 shows strong visual evidence that SDTW finds alignments that are very similar to those produced by DTW. The time complexity for a SDTW is O(MN), where M = m/c and N = n/c. This 2 means that the speedup obtained by using SDTW should be approximately c , minus some constant factors because of the overhead of obtaining the segmented representation.
6
E.J. Keogh and M.J. Pazzani A
B
0
10
20
30
40
50
60
70
A’
0
20
40
60
80
100
60
80
100
B’
0
10
20
30
40
50
60
70
0
20
40
Fig. 6. A and B both show two similar time series and the alignment between them, as discovered by DTW. A’ and B’ show the same time series in their segmented representation, and the alignment discovered by SDTW. This presents strong visual evidence that SDTW finds approximately the same warping as DTW
4 Experimental Results We are interested in two properties of the proposed approach. The speedup obtained over the classic DTW algorithm and the quality of the alignment. In general, the quality of the alignment is subjective, so we designed experiments that indirectly, but objectively measure it. 4.1 Clustering For our clustering experiment we utilized the Australian Sign Language Dataset from the UCI KDD archive (Bay 1999). The dataset consists of various sensors that measure the X-axis position of a subject’s right hand while signing one of 95 words in Australian Sign Language (There are other sensors in the dataset, which we ignored in this work). For each of the words, 5 recordings were made. We used a subset of the database which corresponds to the following 10 words, "spend", "lose", "forget", "innocent", "norway", "happy", "later", "eat", "cold" and "crazy". For every possible pairing of words, we clustered the 10 corresponding sequences, using group average hierarchical clustering. At the lowest level of the corresponding dendogram, the clustering is subjective. However, the highest level of the dendogram (i.e. the first bifurcation) should divide the data into the two classes. There are 34,459,425 possible ways to cluster 10 items, of which 11,025 of them correctly partition the two classes, so the default rate for an algorithm which guesses randomly is only 0.031%. We compared three distance measures: 1) DTW: The classic dynamic time warping algorithm as presented in Section 2. 2) SDTW: The segmented dynamic time warping algorithm proposed here. 3) Euclidean: We also tested Euclidean to facilitate comparison to the large body of literature that utilizes this distance measure. Because the Euclidean distance is only defined for sequences of the same length, and there is a small variance in the length of the sequences in this dataset, we did the following. When comparing sequences of different lengths, we "slid" the shorter of the two sequences across the longer and recorded the minimum distance.
Scaling up Dynamic Time Warping to Massive Dataset
7
Figure 7 shows an example of one experiment and Table 1 summarizes the results. 8
Euclidean
9
DTW
10
2
8
9
7
10
8
5
7
7
6
6
6
4
3
5
3
2
4
10
5
2
9
4
3
1
1
1
SDTW
Fig. 7. An example of a single clustering experiment. The time series 1 to 5 correspond to 5 different readings of the word "norway", the time series 6 to 10 correspond to 5 different readings of the word "later". Euclidean distance is unable to differentiate between the two words. Although DTW and SDTW differ at the lowest levels of the dendrogram, were the clustering is subjective, they both correctly divide the two classes at the highest level
Mean Time (Seconds) 3.23
Correct Clusterings (Out of 45) 2
DTW
87.06
22
SDTW
4.12
21
Distance measure Euclidean
Table 1: A comparison of three distance measures on a clustering task Although the Euclidean distance can be quickly calculated, it performance is only slightly better than random. DTW and SDTW have essentially the same accuracy but SDTW is more than 20 times faster. 4.2 Query by Example The clustering example in the previous section demonstrated the ability of SDTW to do whole matching. Another common task for time series applications is subsequence matching, which we consider here. Assume that we have a query Q of length n, and a much longer reference sequence R, of length X. The task is to find the subsequence of R, which best matches Q, and report it’s offset within R. If we use the Euclidean distance our distance measure, we can use an indexing technique to speedup the search (Faloutsos et. al. 1994, Keogh & Pazzani 1999). However, DTW does not obey the triangular inequality and this makes
8
E.J. Keogh and M.J. Pazzani
it impossible to utilize standard indexing schemes. Given this, we are resigned to using sequential search, "sliding" the query along the reference sequence repeatedly recalculating the distance at each offset. Figure 8 illustrates the idea. R f(t)
Q
t
Fig. 8. Subsequence matching involves sequential search, "sliding" the query Q against the reference sequence R, repeating recalculating the distance measure at each offset.
Brendt and Clifford (1994) suggested the simple optimization of skipping every second datapoint in R, noting that as Q is slid across R, the distance returned by DTW changes slowly and smoothly. We note that sometimes it would be possible to skip much more than 1 datapoint, because the distance will only change dramatically when a new feature (i.e. a plateau, one side of a peak or valley etc.) from R falls within the query window. The question then arises of how to tell where features begin and end in R. The answer to this problem is given automatically, because the process of finding obtaining the linear segmentation can be considered a form of feature extraction (Hagit & Zdonik 1996). We propose searching R by anchoring the leftmost segment in Q against the left edge of each segment in R. Each time we slid the query to measure the distance at the next offset, we effectively skip as many datapoints as are represented by the last anchor segment. As noted in section 3 the speedup for SDTW over DTW is 2 approximately c , however this is for whole matching, for subsequence matching the 3 speedup is approximately c . For this experiment we used the EEG dataset from the UCI KDD repository (Bay 1999). This dataset contains a 10,240 datapoints. In order to create queries with objectively correct answers. We extracted a 100-point subsection of data at random, then artificially warped it. To warp a sequence we begin by randomly choosing an anchor point somewhere on the 80 sequence. We randomly shifted 70 the anchor point W time-units 60 left or right (with W = 10, 20, 50 30). The other datapoints were 40 moved to compensate for this 30 shift by an amount that depended 20 on their inverse squared distance 10 to the anchor point, thus 0 0 10 20 30 40 50 60 70 80 90 100 localizing the effect. After this transformation we interpolated Fig. 9. An example of an artificially warped time series used in our experiments. An anchor point (black dot) is the data back onto the original, chosen in the original sequence (solid line). The anchor equi-spaced X-axis. The net point is moved W units (here W = 10) and the effect of this transformation is a neighboring points are also moved by an amount smooth local distortion of the related to the inverse square of their distance to the original sequence, as shown in anchor point. The net result is that the transformed sequence (dashed line) is a smoothly warped version of Figure 9. We repeated this ten the original sequence times for each for W.
Scaling up Dynamic Time Warping to Massive Dataset
9
As before, we compared three distance measures, measuring both accuracy and time. The results are presented in Table 2. Mean Accuracy (W = 10 )
Mean Accuracy (W = 20 )
Mean Accuracy (W = 30 )
Mean Time (Seconds)
Euclidean
20%
0%
0%
147.23
DTW
100%
90%
60%
15064.64
SDTW
100%
90%
50%
26.16
Distance measure
Table 2: A comparison of three distance measures on query by example Euclidean distance is fast to compute, but its performance degrades rapidly in the presence of time axis distortion. Both DTW and SDTW are able to detect matches in spite of warping, but SDTW is approximately 575 times faster.
5 Related Work Dynamic time warping has enjoyed success in many areas where it’s time complexity is not an issue. It has been used in gesture recognition (Gavrila & Davis 1995), robotics (Schmill et. al 1999), speech processing (Rabiner & Juang 1993), manufacturing (Gollmer & Posten 1995) and medicine (Caiani et. al 1998). Conventional DTW, however, is much too slow for searching large databases. For this problem, Euclidean distance, combined with an indexing scheme is typically used. Faloutsos et al, (1994) extract the first few Fourier coefficients from the time series and use these to project the data into multi-dimensional space. The data can then be indexed with a multi-dimensional indexing structure such as a R-tree. Keogh and Pazzani (1999) address the problem by de-clustering the data into bins, and optimizing the data within the bins to reduce search times. While both these approaches greatly speed up query times for Euclidean distance queries, many real world applications require non-Euclidean notions of similarity. The idea of using piecewise linear segments to approximate time series dates back to Pavlidis and Horowitz (1974). Later researchers, including Hagit and Zdonik (1996) and Keogh and Pazzani (1998) considered methods to exploit this representation to support various non-Euclidean distance measures, however this paper is the first to demonstrate the possibility of supporting time warped queries with linear segments.
6 Conclusions and Future Work We demonstrated a modification of DTW that exploits a higher level representation of time series data to produce one to three orders of magnitude speed-up with no appreciable decrease in accuracy. We experimentally demonstrated our approach on several real world datasets. Future work includes a detailed theoretical examination of SDTW, and extensions to multivariate time series.
10
E.J. Keogh and M.J. Pazzani
References Agrawal, R., Lin, K. I., Sawhney, H. S., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in times-series databases. In VLDB, September. Bay, S. (1999). UCI Repository of Kdd databases [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science. Berndt, D. & Clifford, J. (1994) Using dynamic time warping to find patterns in time series. AAAI-94 Workshop on Knowledge Discovery in Databases (KDD-94), Seattle, Washington. Caiani, E.G., Porta, A., Baselli, G., Turiel, M., Muzzupappa, S., Pieruzzi, F., Crema, C., Malliani, A. & Cerutti, S. (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology. Vol. 25 Cat. No.98CH36292, NY, USA. Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery form time series. Proceedings of the 4rd International Conference of Knowledge Discovery and Data Mining. pp 16-22, AAAI Press. Debregeas, A. & Hebrail, G. (1998). Interactive interpretation of Kohonen maps applied to curves. Proceedings of the 4rd International Conference of Knowledge Discovery and Data Mining. pp 179-183, AAAI Press. Derriere, S. (1998) D.E.N.I.S strasbg.fr/DENIS/qual_gif/cpl3792.dat]
strip
3792:
[http://cdsweb.u-
Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In Proc. ACM SIGMOD Conf., Minneapolis, May. Gavrila, D. M. & Davis,L. S.(1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International Workshop on Automatic Face- and Gesture-Recognition. IEEE Computer Society, Zurich. Gollmer, K., & Posten, C. (1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. On-Line Fault Detection and Supervision in the Chemical Process Industries (Edited by: Morris, A.J.; Martin, E.B.). Hagit, S., & Zdonik, S. (1996). Approximate queries and representations for large data sequences. Proc. 12th IEEE International Conference on Data Engineering. pp 546-553, New Orleans, Louisiana, February.
Scaling up Dynamic Time Warping to Massive Dataset
11
Keogh, E., & Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. rd Proceedings of the 4 International Conference of Knowledge Discovery and Data Mining. pp 239-241, AAAI Press. Keogh, E., & Pazzani, M. (1999). An indexing scheme for fast similarity search in th large time series databases. To appear in Proceedings of the 11 International Conference on Scientific and Statistical Database Management. Keogh, E., Smyth, P. (1997). A probabilistic approach to fast pattern matching in time rd series databases. Proceedings of the 3 International Conference of Knowledge Discovery and Data Mining. pp 24-20, AAAI Press. Kruskall, J. B. & Liberman, M. (1983). The symmetric time warping algorithm: From continuous to discrete. In Time Warps, String Edits and Macromolecules: The Theory and Practice of String Comparison. Addison-Wesley. Pavlidis, T., Horowitz, S. (1974). Segmentation of plane curves. IEEE Transactions on Computers, Vol. C-23, NO 8, August. Rabiner, L. & Juang, B. (1993). Fundamentals of speech recognition. Englewood Cliffs, N.J, Prentice Hall. Sakoe, H. & Chiba, S. (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech, and Signal Proc., Vol. ASSP-26, 43-49. Schmill, M., Oates, T. & Cohen, P. (1999). Learned models for continuous planning. In Seventh International Workshop on Artificial Intelligence and Statistics.
The Haar Wavelet Transform in the Time Series Similarity Paradigm Zbigniew R. Struzik, Arno Siebes Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam The Netherlands email:
[email protected] Abstract. Similarity measures play an important role in many data mining algorithms. To allow the use of such algorithms on non-standard databases, such as databases of financial time series, their similarity measure has to be defined. We present a simple and powerful technique which allows for the rapid evaluation of similarity between time series in large data bases. It is based on the orthonormal decomposition of the time series into the Haar basis. We demonstrate that this approach is capable of providing estimates of the local slope of the time series in the sequence of multi-resolution steps. The Haar representation and a number of related represenations derived from it are suitable for direct comparison, e.g. evaluation of the correlation product. We demonstrate that the distance between such representations closely corresponds to the subjective feeling of similarity between the time series. In order to test the validity of subjective criteria, we test the records of currency exchanges, finding convincing levels of correlation.
1
Introduction
Explicitly or implicitly, record similarity is a fundamental aspect of most data mining algorithms. For traditional, tabular data the similarity is often measured by attributevalue similarity or even attribute-value equality. For more complex data, e.g., financial time series, such simple similarity measures do not perform very well. For example, assume we have three time series A, B, and C, where B is constantly 5 points below A, whereas C is randomly 2 points below or above A. Such a simple similarity measure would rate C as far more similar to A than B, whereas a human expert would rate A and B as very similar because they have the same shape. This example illustrates that the similarity of time series data should be based on certain characteristics of the data rather than on the raw data itself. Ideally, these characteristics are such that the similarity of the time series is simply given by the (traditional) similarity of the characteristics. In that case, mining a database of time series is reduced to mining the database of characteristics using the traditional algorithms. This observation is not new, but can also (implicitly) be found in papers such as [1-7]. Which characteristics are computed depends very much on the application one has in mind. For example, many models and paradigms of similarity introduced to date are unnecesarily complex because they are designed to suit too large a spectrum of applications. The context of data mining applications in which matching time series are required often involves a smaller number of degrees of freedom than assumed. For •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 12−22, 1999. Springer−Verlag Berlin Heidelberg 1999
The Haar Wavelet Transform in the Time Series Similarity Paradigm
13
example, in comparing simultanous financial time series, the time variable is explicitly known and time and scale shift are not applicable. In addition, there are strong heuristics which can be applied to these time series. For example, the concern in trading is usually to reach a certain level of index or currency exchange within a certain time. This is nothing else than increase rate or simply slope of the time series in question. Consider a financial record over one year which we would like to compare with another such record from another source. The values of both are unrelated, the sampling density may be different or vary with time. Nevertheless, it ought to be possible to state how closely the two are related. If we were to do it in as few steps as possible, the first to ask would probably be about the increase/decrease in (log)value over the year. In fact, just a sign of a change over the year may be sufficient, showing whether there has been a decrease or an increase in the stock value. Given this information the next question might be what the increase/decrease was in the first half of the year and what it was in the second half. The reader will not be surprised if we suggest that perhaps the next question might be related to the increase/decrease in each quarter of the year. This is exactly the strategy we are going to follow. The wavelet transform using the Haar wavelet (the Haar WT for short) will provide exactly the kind of information we have used in the above example, through the decomposition of the time series in the Haar basis. In section 2, we will focus on the relevant aspects of the wavelet transformation with the Haar wavelet. From the hierarchical scale-wise decomposition provided by the wavelet transform, we will next select a number of interesting representations of the time series in section 3. In section 4, these time series’ representations will be subject to evaluation of their correlation products. Section 5 gives a few details on the computational efficiency of the convolution product. This is followed by several test cases of correlating examples of currency exchange rates in section 6. Section 7 closes the paper with conclusions and suggestions for future developments.
2 The Haar Wavelet Transform As already mentioned above, the recently introduced Wavelet Transform (WT), see e.g. Ref. [9, 10], provides a way of analysing local behaviour of functions. In this, it fundamentally differs from global transforms like the Fourier Transform. In addition to locality, it possesses the often very desirable ability of filtering the polynomial behaviour to some predefined degree. Therefore, correct characterisation of time series is possible, in particular in the presence of non-stationarities like global or local trends or biases. Conceptually, the wavelet transform is an inner product of the time series with the scaled and translated wavelet (x), usually a n-th derivative of a smoothing kernel (x). The scaling and translation actions are performed by two parameters; the scale parameter s ‘adapts’ the width of the wavelet to the microscopic resolution required, thus changing its frequency contents, and the location of the analysing wavelet is determined by the parameter b: W f (s; b) =< f;
> (s; b) =
1
s
Z
dx f (x)
(
x
, b) ; s
(1)
14
Z.R. Struzik and A. Siebes
R
where s; b 2 and s > 0 for the continuous version (CWT), or are taken on a discrete, usually hierarchical (e.g. dyadic) grid of values si ; bj for discrete version (DWT, or just WT). is the support of the f (x) or the length of the time series. The choice of the smoothing kernel (x) and the related wavelet (x) depends on the application and on the desired properties of the wavelet transform. In [6, 7, 11], we used the Gaussian as the smoothing kernel. The reason for this was the optimal localisation both in frequency and position of the related wavelets, and the existence of derivatives of any degree n. In this paper, for the reasons which will become apparent later, see section 3, we will use a different smoothing function, namely a simple block function:
(x) =
0
x, and each coefficient cm;l of the representation can be obtained as cm;l =< f; m;l >. In particular, the approximations f j of the time series f with the smoothing kernel j;k form a ‘ladder’ of multi-resolution approximations: 2j
X f j,1 = f j + < f; j;k > j;k ; k=0 j , j where f =< f; j;k > and j;k = 2 (2,j x , k ).
(6)
It is thus possible to ‘move’ from one approximation level j , 1 to another level j by simply adding (subtracting for j to j , 1 direction), the detail contained in the corresponding wavelet coefficients cj;k ; k = 0 : : : 2j . In figure 1, we show an example decomposition and reconstruction with the Haar wavelet. The time series analysed is f1::4 = f9; 7; 3; 5g.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
15
signal signal
9
9
7 7
5 5
3
3
=
average
6
6
+2 −2
difference
8
+
4
+2 −2
+1
9 difference
+1 −1
−1
+
7 5 3 +1
+1 −1
−1
Fig. 1. Decomposition of the example time series into Haar components. Right: reconstruction of the time series from the Haar components.
Note that the set of wavelet coefficients can be represented in a hierarchical (dyadic) tree structure, through which it is obtained. In particular, the reconstruction of each single point fi of the time series is possible (without reconstructing all the fj 6= fi ), by following a single path along the tree, converging to the point fi in question. This path determines a unique ‘binary address’ of the point fi .
3 Time Series Representations with Haar Family Note that the Haar wavelet implements the operation of derivation at the particular scale at which it operates. From the definition of the Haar wavelet , (eq. 3, see also figure 2) we have: (x) =
;
where D is the derivative operator
8 1 < D(x) = ,1 : 0
x=0 x=1 otherwise : for for
For the wavelet transform of f , we have the following:
(7)
16
Z.R. Struzik and A. Siebes
< f (x);
l;n (x) > = = = =
< f (x); < Dl;n (2 x); l;n (2 x) >>
< f (x); 2,1 < Dl,1;n (x); l,1;n (x) >> ; m;n (x) >
,1 < Dfm;n (x); m;n (x) >
=2
:
(8)
where Df is the derivative of the function f and is the smoothing kernel. The wavelet coefficients obtained with the Haar wavelet are, therefore, proportional to the local averages of the derivative of the time series f at a given resolution. This is a particularly interesting property of our representation, which makes us think that the representations derived from the Haar representation will be quite useful in time series mining. Indeed, in the analysis of patterns in time series, local slope is probably the most appealing feature for many applications.
*
=
x −> x/2
Fig. 2. Convolution of the block function with the derivative operator gives the Haar wavelet after rescaling the time axis x ! x=2. stands for the convolution product.
The most direct representation of the time series with the Haar decomposition scheme would be encoding a certain predefined, highest, i.e. most coarse, resolution level smax , say one year resolution, and the details at the lower scales: half (a year), quarter (of a year) etc., down to the minimal (finest) resolution of interest smin , which would often be defined by the lowest sampling rate of the signals. 1 The coefficients of the Haar decomposition between scales smax ::smin will be used for the representation: Haar(f ) = fci;j : i = smax ::smin ; j = 1::2i g :
The Haar representation is directly suitable to serve for comparison purposes when the absolute (i.e. not relative) values of the time series (and the local slope) are relevant. In many applications one would, however, rather work with value independent, scale invariant representations. For that purpose, we will use a number of different, special representations derived from the Haar decomposition WT. To begin with, we will use the sign based representation. It uses only the sign of the wavelet coefficient and it has been shown to work in the CWT based decomposition, see [6]. si;j = sgn(ci;j ) 1
In practice one may need to interpolate and re-sample signals in order to arrive at a certain common or uniform sampling rate. This is, however, a problem of the implementation and not of the representation and it is related to how the convolution operation is implemented.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
where sgn(x)
=
1 for for
,1
17
0
x x
0 :
This representation resembles the H¨older exponent approximation of time series local roughness at the particular scale of resolution i as introduced in [7].
4 Distance Evaluation with Haar Representations k;l The measure of the correlation between the components ci;j g and cg of two respective time series f and g can be put as: C (f; g )
=
m;n X
i;j
fi;j;k;lg=0
wi cf wk ck;l g i;j ;k;l
where i;j ;k;l
=1
i
i=k
&j=l
and the (optional) weights wi and wk depend on their respective scales i and k . In our experience the orthogonality of the coefficients is best employed without weighting. Normalisation is necessary in order to arrive at the correlation product between [0; 1] and will simply take the form of Cnormalised (f; g )
=
C (f; g )
p
C (f; f ) C (g; g )
:
The distance of two representations can be easily obtained as
Distance(f; g ) = , log(jCnormalised (f; g )j) :
18
Z.R. Struzik and A. Siebes
3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
Brownian walk sample
0
0.2
0.4
0.6
0.8
1
Fig. 3. Top plot contains the input signal. The top colour (gray-scale) panel contains the Haar decomposition with six scale levels from i = 1 to i = 6, the smoothed component is not shown. The colour (gray shade) encodes the value of the decomposition from dark blue (white) for 1 to dark red (black) for 1. The centre panel shows the sign of the decomposition coefficients, i.e. dark blue (white) for ci;j 0 and dark red (black) for ci;j < 0. The bottom colour (gray-scale) panel contains the H¨older decomposition with five scale levels i = 2 : : : 6.
,
The Haar Wavelet Transform in the Time Series Similarity Paradigm
19
5 Incremental Calculation of the Decomposition Coefficients and the Correlation Product One of the severe disadvantages of the Haar WT is the lack of translation invariance; when the input signal shifts by t (e.g. as the result of acquiring some additional input samples), the coefficients of the Haar wavelet transform need to be recalculated. This is rather impractical when one considers systematically updated inputs like financial records. When the representation is to be updated on each new sample, little can be done other than to recalculate the coefficients. The cost of this resides mainly in the cost of calculating the inner product. Direct calculation is of nm complexity, where n = 2N is the length of time series and m is the length of the wavelet. The cost of calculating the inner product therefore grows quickly with the length of the wavelet and for the largest scale it is n2 . The standard way to deal with this problem is to use the Fast Fourier Transform for calculating the inner product of two time series, which in case of equal length reduces the complexity to n log(n). Additional savings can be obtained if the update of the WT does not have to be performed on every new input sample, but it can be done periodically on each new n samples ( corresponding with some t time period). In this case, when the t coincides with the working scale of the wavelet at a given resolution, particular a situation arrises: – only the coefficients at scales larger than t scale have to be recalculated; – coefficients of f jxx00 +x must be calculated anew; – other coefficients have to be re-indexed or removed. This is also illustrated in figure 4. x0
x0 + delta t time
recalculate scale = delta t
remove
reindex
calculate anew
scale
Fig. 4. Representation update scheme in the case of the shift of the input time series by working scale of the wavelet.
t =
As expected, the larger the time shift t, the fewer the number of the coefficients which have to be recalculated and the larger the number of coefficients which have to be reindexed (plus, of course, the number of coefficients which have to be calculated from
20
Z.R. Struzik and A. Siebes
f jxx00 +t ). For the full details of incremental calculation of coefficients the reader may wish to consult [8].
6
Experimental Results
We took the records of the exchange rate with respect to USD over the period 01/06/73 - 21/05/87. It contains daily records of the exchange rates of five currencies with respect to USD: Pound Sterling, Canadian Dollar, German Mark, Japanese Yen and Swiss Franc. (Some records were missing - we used the last known value to interpolate missing values.) Below, in figure 5 we show the plots of the records. 3
2.6 1) Pound Sterling
1) Pound Sterling 2) Canadian Dollar 3) German Mark 4) Japanese Yen 5) Swiss Franc
2.5
2.4 2.2 2 1.8 1.6
2
1.4 1.2
1.5
1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.05 2) Canadian Dollar 1
1
0.95 0.9
0.5
0.85 0.8 0.75
0 0.7 0.65
0.1
0.2
0.3
0.4
0.5
0.7
0.6
0.7
0.8
0.1
0.9
0.0075
0.2
0.3
0.4
0.5
4) Japanese Yen
0.65
0.007
0.6
0.0065
0.6
0.7
0.8
0.9
0.6
5) Swiss Franc
3) German Mark 0.55 0.5
0.55
0.006
0.5
0.0055
0.45
0.005
0.4
0.0045
0.45 0.4 0.35
0.35
0.004
0.3
0.0035
0.25
0.003 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.3 0.25 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Fig. 5. Left above, all the records of the exchange rate used, with respect to USD over the period 01/06/73 - 21/05/87. In small inserts, single exchange rates renormalised, from top right to bottom left (clockwise), Pound Sterling, Canadian Dollar, German Mark, Japanese Yen and Swiss Franc, all with respect to USD.
All three representation types were made for each of the time series: the Haar, sign and H¨older representation. Only six scale levels (64 values) of the representation (five for H¨older, 63 points) were retained. These were next compared for each pair to give the correlation product.
The Haar Wavelet Transform in the Time Series Similarity Paradigm
21
In figure 6, we plot the values of the correlation for each of the pairs compared. The reader can visually compare the Haar representation results with his/her own ‘visual estimate’ of the degree of (anti-)correlation for pairs of plots in figure 5.
2
Haar repr. Hoelder repr sign repr. -1 1 0
1.5
1
0.5
0
-0.5
-1 c(1,2)
c(1,3)
c(1,4)
c(1,5)
c(2,3)
c(2,4)
c(2,5)
c(3,4)
c(3,5)
c(4,5)
Fig. 6. The values of the correlation products for each of the pairs compared, obtained with the Haar representation, the sign representation, and the H¨older representation.
One can verify that the results obtained with the sign representation follow those obtained with the Haar representation but are weaker in their discriminating power (more flat plot). Also, the H¨older representation is practically independent of the sign representation. In terms of correlation product, its distance to sign representation approximately equals the distance of Haar represenation to the sign representation but with the oposite sign. This confirms the fact that the correlation in the H¨older exponent captures the value oriented, sign independent features (roughness exponent) of the time series.
7 Conclusions We have demonstrated that the Haar representation and a number of related represenations derived from it are suitable for providing estimates of similarity between time series in a hierarchical fashion. In particular, the correlation obtained with the local slope of the time series (or its sign) in the sequence of multi-resolution steps closely corresponds to the subjective feeling of similarity between the example financial time series. Larger scale experiments with one of the major Dutch banks confirm these findings. The next step is the design and development of a module which will compute and update these representations for the 2.5 million time series which this bank maintains. Once this module is running, mining on the database of time series representations will be the next step.
References 1. R. Agrawal, C. Faloutsos, A. Swami. Efficient Similarity Search in Sequence Databases, In Proc. of the Fourth International Conference on Foundations of Data Organization and
22
Z.R. Struzik and A. Siebes
Algorithms, Chicago, (1993). 2. R. Agrawal, K-I. Lin, H.S. Sawhney, K, Shim, Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time Series Databases, in Proceedings of the 21 VLDB Conference, Z¨urich, (1995). 3. G. Das, D. Gunopulos, H. Mannila, Finding Similar Time Series, In Principles of Data Mining and Knowledge Discovery, Lecture Notes in Artificial intelligence 1263, Springer, (1997). 4. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Eds., Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, (1996). 5. C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast Subsequence Matching in Time-Series Databases”, in Proc. ACM SIGMOD Int. Conf. on Management of Data, (1994). 6. Z.R. Struzik, A. Siebes, Wavelet Transform in Similarity Paradigm I, CWI Report, INSR9802, (1998), also in Research and Development in Knowledge Discovery and Data Mining, Xindong Wu, Ramamohanarao Kotagiri, Kevin B. Korb, Eds, Lecture Notes in Artificial Intelligence 1394, 295-309, Springer (1998). 7. Z.R. Struzik, A. Siebes, Wavelet Transform in Similarity Paradigm II, CWI Report, INSR9815, CWI, Amsterdam (1998), also in Proc. 10th Int. Conf. on Database and Expert System Applications (DEXA’99), Florence, (1999). 8. Z.R. Struzik, A. Siebes, The Haar Wavelet Transform in Similarity Paradigm, CWI Report, INS-R99xx, CWI, Amsterdam (1999). http://www.cwi.nl/htbin/ins1/publications 9. I. Daubechies, Ten Lectures on Wavelets, S.I.A.M. (1992). 10. M. Holschneider, Wavelets - An Analysis Tool, Oxford Science Publications, (1995). 11. Z.R. Struzik, ‘Local Effective H¨older Exponent Estimation on the Wavelet Transform Maxima Tree’, Fractals: Theory and Applications in Engineering, Michel Dekking, Jacques L´evy V´ehel, Evelyne Lutton, Claude Tricot, Eds, Springer (1999).
Rule Discovery in Large Time-Series Medical Databases Shusaku Tsumoto Department of Medicine Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan E-mail:
[email protected] Abstract. Since hospital information systems have been introduced in large hospitals, a large amount of data, including laboratory examinations, have been stored as temporal databases. The characteristics of these temporal databases are: (1) Each record are inhomogeneous with respect to time-series, including short-term effects and long-term effects. (2) Each record has more than 1000 attributes when a patient is followed for more than one year. (3) When a patient is admitted for a long time, a large amount of data is stored in a very short term. Even medical experts cannot deal with these large databases, the interest in mining some useful information from the data are growing. In this paper, we introduce a combination of extended moving average method and rule induction method, called CEARI to discover new knowledge in temporal databases. This CEARI was applied to a medical dataset on Motor Neuron Diseases, the results of which show that interesting knowledge is discovered from each database.
1
Introduction
Since hospital information systems have been introduced in large hospitals, a large amount of data, including laboratory examinations, have been stored as temporal databases[11]. For example, in a university hospital, where more than 1000 patients visit from Monday to Friday, a database system stores more than 1 GB numerical data of laboratory examinations. Thus, it is highly expected that data mining methods will find interesting patterns from databases because medical experts cannot deal with those large amount of data. The characteristics of these temporal databases are: (1) Each record are inhomogeneous with respect to time-series, including short-term effects and long-term effects. (2) Each record has more than 1000 attributes when a patient is followed for more than one year. (3) When a patient is admitted for a long time, a large amount of data is stored in a very short term. Even medical experts cannot deal with these large temporal databases, the interest in mining some useful information from the data are growing. In this paper, we introduce a combination of extended moving average method and rule induction method, called CEARI to discover new knowledge in temporal databases. In the system, extended moving average method are used ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 23–31, 1999. c Springer-Verlag Berlin Heidelberg 1999
24
S. Tsumoto
for preprocessing, to deal with irregularity of each temporal data. Using several parameters for time-scaling, given by users, this moving average method generates a new database for each time scale with summarized attributes. Then, rule induction method is applied to each new database with summarized attributes. This CEARI was applied to two medical datasets, the results of which show that interesting knowledge is discovered from each database.
2
Temporal Databases in Hospital Information Systems
Since incorporating temporal aspects into databases is still an ongoing research issue in database area[1], temporal data are stored as a table in hospital information systems(H.I.S.). Table 1 shows a typical example of medical data, which is retrieved from H.I.S. The first column denotes the ID number of each patient, and the second one denotes the date when the datasets in this row is examined. Each row with the same ID number describes the results of laboratory examinations, which were taken on the date in the second column. For example, the second row shows the data of the patient ID 1 on 04/19/1986. This simple database show the following characteristics of medical temporal database: (1)The Number of Attributes are too many. Even though the dataset of a patient focuses on the transition of each examination (attribute), it would be difficult to see its trend when the patient is followed for a long time. If one wants to see the long-term interaction between attributes, it would be almost impossible. In order to solve this problems, most of H.I.S. systems provide several graphical interfaces to capture temporal trends[11]. However, the interactions among more than three attributes are difficult to be studied even if visualization interfaces are used. (2)Irregularity of Temporal Intervals. Temporal intervals are irregular. Although most of the patients will come to the hospital every two weeks or one month, physicians may not make laboratory tests at each time. When a patient has a acute fit or suffers from acute diseases, such as pneumonia, laboratory examinations will be made every one to three days. On the other hand, when his/her status is stable, these test may not be made for a long time. Patient ID 1 is a typical example. Between 04/30 and 05/08/1986, he suffered from a pneumonia and was admitted to a hospital. Then, during the therapeutic procedure, laboratory tests were made every a few days. On the other hand, when he was stable, such tests were ordered every one or two year. (3)Missing Values. In addition to irregularity of temporal intervals, datasets have many missing values. Even though medical experts will make laboratory examinations, they may not take the same tests in each instant. Patient ID 1 in Table 1 is a typical example. On 05/06/1986, medical physician selected a specific test to confirm his diagnosis. So, he will not choose other tests. On 01/09/1989, he focused only on GOT, not other tests. In this way, missing values will be observed very often in clinical situations. These characteristics have already been discussed in KDD area[5]. However, in real-world domains, especially domains in which follow-up studies are crucial, such as medical domains, these ill-posed situations will be distinguished. If one
Rule Discovery in Large Time-Series Medical Databases
25
Table 1. An Example of Temporal Database ID 1 1 1 1 1 1 1 1 2 2 2 2
Date 19860419 19860430 19860502 19860506 19860508 19880826 19890109 19910304 19810511 19810713 19880826 19890109 ···
GOT GPT LDH γ-GTP TP edema · · · 24 12 152 63 7.5 ··· 25 12 162 76 7.9 + · · · 22 8 144 68 7.0 + · · · ··· 22 13 156 66 7.6 ··· 23 17 142 89 7.7 ··· 32 ··· 20 15 369 139 6.9 + · · · 20 15 369 139 6.9 ··· 22 14 177 49 7.9 ··· 23 17 142 89 7.7 ··· 32 ···
wants to describe each patient (record) as one row, then each row have too many attributes, which depends on how many times laboratory examinations are made for each patient. It is notable that although the above discussions are made according to the medical situations, similar situations may occur in other domains with long-term follow-up studies.
3
Extended Moving Average Methods
3.1
Moving Average Methods
Averaging mean methods have been introduced in statistical analysis[6]. Temporal data often suffers from noise, which will be observed as a spike or sharp wave during a very short period, typically at one instant. Averaging mean methods remove such an incidental effect and make temporal sequences smoother. With one parameter w, called window, moving average yˆw is defined as follows: w X yˆw = yj . j=1
For example, in the case of GOT of patient ID 1, y5 is calculated as: yˆ5 = (24 + 25 + 22 + 22 + 22)/5 = 23.0. It is easy to see that yˆw will remove the noise effect which continue less than w points. The advantage of moving average method is that it enables to remove the noise effect when inputs are given periodically[6]. For example, when some tests are measured every several days1 , the moving average method is useful to remove the noise and to extract periodical domains. However, in real-world domains, 1
This condition guarantees that measurement is approximately continuous
26
S. Tsumoto
inputs are not always periodical, as shown in Table 1. Thus, when applied timeseries are irregular or discrete, ordinary moving average methods are powerless. Another disadvantage of this method is that it cannot be applicable to categorical attributes. In the case of numerical attributes, average can be used as a summarized statistic. On the other hand, such average cannot be defined for categorical attributes. Thus, we introduce the extended averaging method to solve these two problems in the subsequent subsections. 3.2
Extended Moving Average for Continuous Attributes
In this extension, we first focus on how moving average methods remove noise. The key idea is that a window parameter w is closely related with periodicity. If w is larger, then the periodical behavior whose time-constant is lower than w will be removed. Usually, a spike by noise is observed as a single event and this effect will be removed when w is taken as a large value. Thus, the choice of w separates different kinds of time-constant behavior in each attribute and in the extreme case when w is equal to total number of temporal events, all the temporal behavior will be removed. We refer to this extreme case as w = ∞. The extended moving average method is executed as follows: first calculates y∞ for an attribute y. Second, the method outputs its maximum and minimum values. Then, according to the selected values for w, a set of sequence {yw (i)} for each record is calculated. For example, if {w } is equal to {10 years, 5 years, 1 year, 3 months, 2 weeks}, then for each element in {w}, the method uses the time-stamp attribute for calculation of each {yw (i)} in order to deal with irregularities. In the case of Table 1, when w is taken as 1 year, all the rows are aggregated into several components as shown in Table 2. From this aggregation, a sequence yw for each attribute is calculated as in Table 3. Table 2. Aggregation for w= 1 (year) ID 1 1 1 1 1 1 1 1
Date 19860419 19860430 19860502 19860506 19860508 19880826 19890109 19910304 ···
GOT GPT LDH γ-GTP TP edema · · · 24 12 152 63 7.5 ··· 25 12 162 76 7.9 + · · · 22 8 144 68 7.0 + · · · ··· 22 13 156 66 7.6 ··· 23 17 142 89 7.7 ··· 32 ··· 20 15 369 139 6.9 + · · ·
Rule Discovery in Large Time-Series Medical Databases
27
Table 3. Moving Average for w= 1 (year) ID Period GOT GPT LDH γ-GTP TP edema · · · 1 1 23.25 11.25 153.5 68.25 7.5 ? ··· 1 2 23 17 142 89 7.7 ? ··· 1 3 32 ? ··· 1 4 ? ··· 1 5 20 15 369 139 6.9 ? ··· 1 ∞ 24 12.83 187.5 83.5 7.43 ? · · · ···
3.3
Categorical Attributes
One of the disadvantages of moving average method is that it cannot deal with categorical attributes. To solve this problem, we will classify categorical attributes into three types, whose information should be given by users. The first type is constant, which will not change during the follow-up period. The second type is ranking, which is used to rank the status of a patient. The third type is variable, which will change temporally, but ranking is not useful. For the first type, extended moving average method will not be applied. For the second one, integer will be assigned to each rank and extended moving average method for continuous attributes is applied. On the other hand, for the third one, the temporal behavior of attributes is transformed into statistics as follows. First, the occurence of each category (value) is counted for each window. For example, in Table 2, edema is a binary attribute and variable. In the first window, an attribute edema takes {-,+,+,-}.2 So, the occurence of − and + are 2 and 2, respectively. Then, each conditional probability will be calculated. In the above example, probabilities are equal to p(−|w1 ) = 2/4 and p(+|w1 ) = 2/3. Finally, for each probability, a new attribute is appended to the table (Table 4).
Table 4. Final Table with Moving Average for w= 1 (year) ID Period GOT GPT LDH γ-GTP TP edema(+) edema(-) · · · 1 1 23.25 11.25 153.5 68.25 7.5 0.5 0.5 ··· 1 2 23 17 142 89 7.7 0.0 1.0 ··· 1 3 32 0.0 1.0 ··· 1 4 0.0 1.0 ··· 1 5 20 15 369 139 6.9 1.0 0.0 ··· 1 ∞ 24 12.83 187.5 83.5 7.43 0.43 0.57 · · · ···
2
Missing values are ignored for counting.
28
S. Tsumoto
Summary of Extended Moving Average. All the process of extended moving average is used to construct a new table for each window parameter as the first preprocessing. Then, second preprocessing method will be applied to newly generated tables. The first preprocessing method is summarized as follows. 1. Repeat for each w in List Lw , a) Select an attribute in a List La ; i. If an attribute is numerical, then calculate moving average for w; ii. If an attribute is constant, then break; iii. If an attribute is rank, then assign integer to each ranking; calculate moving average for w; iv. If an attribute is variable, calculate frequency of each category; b) If La is not empty, goto (a). c) Construct a new table with each moving average. 2. Construct a table for w = ∞.
4 4.1
Second Preprocessing and Rule Discovery Summarizing Temporal Sequences
From the data table after processing extended moving average methods, several preprocessing methods may be applied in order for users to detect the temporal trends in each attribute. One way is discretization of time-series by clustering introduced by Das[4]. This method transforms time-series into symbols representing qualitative trends by using a similarity measure. Then, time-series data is represented as a symbolic sequence. After this preprocessing, rule discovery method is applied to this sequential data. Another way is to find auto-regression equations from the sequence of averaging means. Then, these quantitative equations can be directly used to extract knowledge or their qualitative interpretation may be used and rule discovery[3], other machine learning methods[7], or rough set method[9] can be applied to extract qualitative knowledge. In this research, we adopt two modes and transforms databases into two forms: one mode is applying temporal abstraction method[8] as second preprocessing and transforms all continuous attributes into temporal sequences. The other mode is applying rule discovery to the data after the first preprocessing without second one. The reason why we adopted these two mode is that we focus not only on temporal behavior of each attribute, but also on association among several attributes. Although Miksch’s method[8] and Das’s approach[4] are very efficient to extract knowledge about transition, they cannot focus on association between attributes in an efficient way. For the latter purpose, much simpler rule discovery algorithm are preferred. 4.2
Continuous Attributes and Qualitative Trend
To characterize the deviation and temporal change of continuous attributes, we introduce standardization of continuous attributes. For this, we only needs the
Rule Discovery in Large Time-Series Medical Databases
29
total average yˆ∞ and its standardization σ∞ . With these parameters, standardized value is obtained as: yw − yˆ∞ . zw = σ∞ The reason why standardization is introduced is that it makes comparison between continuous attributes much easier and clearer, especially, statistic theory guarantees that the coefficients of a auto-regression equation can be compared with those of another equation[6]. After the standardization, an extraction algorithm for qualitative trends is applied[8]. This method is processed as follows: First, this method uses data smoothing with window parameters. Secondly, smoothed values for each attributes are classified into seven categories given as domain knowledge about laboratory test values: extremely low, substantially low, slightly low, normal range, slightly high, substantially high, and extremely high. With these categories, qualitative trends are calculated and classified into the following ten categories by using guideline rules: decrease too fast(A1), normal decrease(A2), decrease too slow(A3), zero change(ZA), dangerous increase(C), increase too fast(B1), normal increase(B2), increase too slow(B3), dangerous decrease(D). For example, if the value of some laboratory tests change from substantially high to normal range within a very short time, the qualitative trend will be classified into A1(decrease too fast). For further information, please refer to [8]. 4.3
Rule Discovery Algorithm
For rule discovery, a simple rule induction algorithm discussed in [10] is applied, where continuous attributes are transformed into categorical attributes with a cut-off point. As discussed in Section 3, moving average method will remove the temporal effect shorter than a window parameter. Thus, w = ∞ will remove all the temporal effect, so this moving average can be viewed as data without any temporal characteristics. If rule discovery is applied to this data, it will generate rules which represents non-temporal association between attributes. In this way, data after processing w-moving average is used to discover association with w or longer time-effect. Ideally, from w = ∞ down to w = 1, we decompose all the independent time-effect associations between attributes. However, the timeconstant in which users are interested will be limited and the moving average method shown in Section 3 uses a set of w given by users. Thus, application of rule discovery to each table will generate a sequence of temporal associations between attributes. If some temporal associations will be different from associations with w = ∞, then these specific relations will be related with a new discovery. 4.4
Summary of Second Preprocessing and Rule Discovery
Second preprocessing method and rule discovery are summarized as follows.
30
S. Tsumoto
1. Calculate yˆ∞ and σ∞ from the table of w = ∞; 2. Repeat for each w in List Lw ; (w is sorted in a descending order.) a) Select a table of w: Tw ; i. Standardize continuous and ranking attributes; ii. Calculate qualitative trends for continuous and ranking attributes; iii. Construct a new table for qualitative trends; iv. Apply rule discovery method for temporal sequences; b) Apply rule induction methods to the original table Tw ;
5
Experimental Results
The above rule discovery system is implemented in CEARI(Combination of Extended Moving Average and RUle Induction). CEARI was applied to a clinical database on motor neuron diseases, which consists of 1477 samples, 3 classes. Each patient is followed during 15 years. A list of w, {w } was set to {10 years, 5 years, 1 year, 3 months, 2 weeks} and thresholds, δp(D|R) and δp(R|D) were set to 0.60 and 0.30,respectively. One of the most interesting problems of Motor neuron diseases (MND) is how long it takes each patient to suffer from respiratory failure, which is the main cause of death.3 It is empirically known that some types of MND is more progressive than other types and that their survival period is much shorter than others. The database for this analysis describes all the data of patients suffering from MND. Non-temporal Knowledge. The most interesting discovered rules are: [M ajor P ectolis < 3] → [P aCO2 > 50] (P (D|R) : 0.87, P (R|D) : 0.57), [M inor P ectolis < 3] → [P aO2 < 61] (P (D|R) : 0.877, P (R|D) : 0.65). Both rules mean that if some of the muscles of chest, called Major Pectolis and Minor Pectolis are weak, then respiratory function is low, which suggests that muscle power of chest is closely related with respiratory function, although these muscles are not directly used for respiration. Short-Term Effect. Several interesting rules are discovered: [M ajor P ectolis = 2] → [P aO2 : D] (P (D|R) : 0.72, P (R|D) : 0.53, w = 3(months)), [Biceps < 3] → [P aO2 : A2] (P (D|R) : 0.82, P (R|D) : 0.62, w = 3(months)). [Biceps > 4] → [P aO2 : ZA] (P (D|R) : 0.88, P (R|D) : 0.72, w = 3(months)). 3
The prognosis of MND is generally not good, and most of the patients will die within ten years because of respiratory failure. The only way for survival is to use automatic ventilator[2].
Rule Discovery in Large Time-Series Medical Databases
31
These rules suggest that if the power of muscles around chest is low, then respiratory function will decrease within one year and that if the power of muscles in arms is low, then respiratory function will decrease within a few years. Long-Term Effect. The following interesting rules are discovered: [M ajorP ectolis : A3] ∧ [Quadriceps : A3] → [P a02 : A3] (P (D|R) : 0.85, P (R|D) : 0.53, w = 1(year)), [Gastro : A3] → [P aO2 : A3] (P (D|R) : 0.87, P (R|D) : 0.52, w = 1(year)). These rules suggest that if the power of muscles of legs change very slowly, then respiratory function will decrease very slow. In summary, the system discovers that the power of muscles around chest and its chronological characteristics are very important to predict the respiratory function and how long it takes for a patient to reach respiratory failure.
References 1. Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases, Addison-Wesley, New York, 1995. 2. Adams, R.D. and Victor, M. Principles of Neurology, 5th edition, McGraw-Hill, NY, 1993. 3. Agrawal, R., Imielinski, T., and Swami, A., Mining association rules between sets of items in large databases, in Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), pp. 207-216, 1993. 4. Das, G., Lin, K.I., Mannila, H., Renganathan, G. and Smyth, P. Rule discovery from time series. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, pp.16-22, 1998. 5. Fayyad, U.M., et al.(eds.)., Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996. 6. Hamilton, J.D. Time Series Analysis, Princeton University Press, 1994. 7. Langley, P. Elements of Machine Learning, Morgan Kaufmann, CA, 1996. 8. Miksch, S., Horn, W., Popow, C., and Paky, F. Utilizing temporal data abstraction for data validation and therapy planning for artificially ventilated newborn infants. Artificial Intelligentce in Medicine, 8, 543-576, 1996. 9. Tsumoto, S. and Tanaka, H., PRIMEROSE: Probabilistic Rule Induction Method based on Rough Sets and Resampling Methods. Computational Intelligence, 11, 389-405, 1995. 10. Tsumoto, S. Knowledge Discovery in Medical MultiDatabases: A Rough Set Approach, Proceedings of PKDD99(in this issue), 1999. 11. Van Bemmel,J. and Musen, M. A. Handbook of Medical Informatics, SpringerVerlag, New York, 1997.
Simultaneous Prediction of Multiple Chemical Parameters of River Water Quality with TILDE Hendrik Blockeel1 , Saˇso Dˇzeroski2 , and Jasna Grbovi´c3 1
3
Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, B-3001 Heverlee, Belgium
[email protected] 2 Joˇzef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
[email protected] Hydrometeorological Institute, Vojkova 1b, SI-1000 Ljubljana, Slovenia
[email protected] Abstract. Environmental studies form an increasingly popular application domain for machine learning and data mining techniques. In this paper we consider two applications of decision tree learning in the domain of river water quality: a) the simultaneous prediction of multiple physico-chemical properties of the water from its biological properties using a single decision tree (as opposed to learning a different tree for each different property) and b) the prediction of past physico-chemical properties of the river water from its current biological properties. We discuss some experimental results that we believe are interesting both to the application domain experts and to the machine learning community.
1
Introduction
The quality of surface waters, including rivers, depends on their physical, chemical and biological properties. The latter are reflected by the types and densities of living organisms present in the water. Based on the above properties, surface waters are classified into several quality classes which indicate the suitability of the water for different kinds of use (drinking, swimming, . . . ). It is well known that the physico-chemical properties give a limited picture of water quality at a particular point in time, while living organisms act as continuous monitors of water quality over a period of time [6]. This has increased the relative importance of biological methods for monitoring water quality, and many different methods for mapping biological data to discrete quality classes or continuous scales have been developed [7]. Most of these approaches use indicator organisms (bioindicator taxa), which have well known ecological requirements and are selected for their sensitivity / tolerance to various kinds of pollution. Given a biological sample, information on the presence and density of all indicator organisms present in the sample is usually combined to derive a biological index that reflects the quality of the water at the site where the sample was taken. Examples are the Saprobic Index [14], used in many countries of Central Europe, and the Biological Monitoring Working Party Score (BMWP) [13] and its derivative Average Score Per Taxon (ASPT), used in the United Kingdom. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 32–40, 1999. c Springer-Verlag Berlin Heidelberg 1999
Simultaneous Prediction of River Water Quality with TILDE
33
The main problem with the biological indices described above is their subjectivity [18]. The computation of these indices makes use of weights and other numbers that were assigned to individual bioindicators by (committees of) expert biologists and ecologists and are based on the experts’ knowledge about the ecological requirements of the bioindicators, which is not always complete. The assigned bioindicator values are thus subjective and often inappropriate [19]. An additional layer of subjectivity is added by combining the scores of the individual bioindicators through ad-hoc procedures based on sums, averages, and weighted averages instead of using a sound method of combination. While a certain amount of subjectivity cannot be avoided (water quality itself is a subjective measure, tuned towards the interests humans have in river water), this subjectivity should only appear at the target level (classification) and not at the intermediate levels described above. This may be achieved by gaining insight into the relationships between biological, physical and chemical properties of the water and its overall quality, which is currently a largely open research topic. To this aim data mining techniques can be employed [18,11,9]. The importance of gaining such insight stretches beyond water quality prediction. E.g., the problem of inferring chemical parameters from biological ones is practically relevant, especially in countries where extensive biological monitoring is conducted. Regular monitoring for a very wide range of chemical pollutants would be very expensive, if not impossible. On the other hand, biological samples may, for example, reflect an increase in pollution and indicate likely causes or sources of (chemical) pollution. The work described in this paper is situated at this more general level. The remainder of the paper is organized as follows. Section 2 describes the goals of this study and the difference with earlier work. Section 3 describes the available data and the experimental setup. Section 4 describes the machine learning tool that was used in these experiments. Section 5 presents in detail the experiments and their results and in Sect. 6 we conclude.
2
Goals of This Study
In earlier work [10,11] machine learning techniques have been applied to the task of inferring biological parameters from physico-chemical ones by learning rules that predict the presence of individual bioindicator taxa from the values of physico-chemical measurements, and to the task of inferring physico-chemical parameters from biological ones [9]. Dˇzeroski et al. [9] discuss the construction of predictive models that allow prediction of a specific physico-chemical parameter from biological data. For each parameter a different regression tree is built using Quinlan’s M5 system [17]. A comparison with nearest neighbour and linear regression methods shows that the induction of regression trees is competitive with the other approaches as far as predictive accuracy is concerned, and moreover has the advantage of yielding interpretable theories. A comparison of the different trees shows that the trees for different target variables are often similar, and that some of the taxa occur in many trees (i.e.,
34
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
they are sensitive to many physico-chemical properties). This raises the question whether it would be possible to predict many or all of the properties with only one (relatively simple) tree, and without significant loss in predictive accuracy. As such, this application seems a good test case for recent research on simultaneous prediction of multiple variables [1]. A second extension with respect to the previous work is the prediction of past physico-chemical properties of the water; more specifically, the maximal, minimal and average values of these properties over a period of time. As mentioned before, physico-chemical properties of water give a very momentary view of the water quality; watching these properties over a longer period of time may alleviate this problem. This is the second scientific issue we investigate in this paper.
3
The Data
The data set we have used is the same one as used in [9]. The data come from the Hydrometeorological Institute of Slovenia (HMZ) that performs water quality monitoring for Slovenian rivers and maintains a database of water quality samples. The data cover a six year period (1990–1995). Biological samples are taken twice a year, once in summer and once in winter, while physical and chemical samples are taken more often (periods between measurements varying from one to several months) for each sampling site. The physical and chemical samples include the measured values of 16 different parameters: biological oxygen demand (BOD), electrical conductivity, chemical oxygen demand (K2 Cr2 O7 and KMnO4 ), concentrations of Cl, CO2 , NH4 , PO4 , SiO2 , NO2 , NO3 and dissolved oxygen (O2 ), alkalinity (pH), oxygen saturation, water temperature, and total hardness. The biological samples include a list of all taxa present at the sampling site and their density. The frequency of occurrence (density) of each present taxon is recorded by an expert biologist at three different qualitative levels: 1=incidentally, 3=frequently and 5=abundantly. Our data are stored in a relational database represented in Prolog; in Prolog terminology each relation is a predicate and each tuple is a fact. The following predicates are relevant for this text: – chem(Site, Year, Month, Day, ListOf16Values) : this predicate contains all physico-chemical measurements. It consists of 2580 facts. – bio(Site, Day, Month, Year, ListOfTaxa): this predicate lists the taxa that occur in a biological sample; ListOfTaxa is a list of couples (taxon, abundancelevel) where the abundance level is 1, 3 or 5 (taxa that do not occur are simply left out of the list). This predicate contains 1106 facts. Overall the data set is quite clean, but not perfectly so. 14 physico-chemical measurements have missing values; moreover, although biological measurements are usually taken on exactly the same day as some physico-chemical measurement, for 43 biological measurements no physico-chemical data for the same day are available. Since this data pollution is very limited, we have just disregarded the examples with missing values in our experiments. This leaves a total of 1060
Simultaneous Prediction of River Water Quality with TILDE
35
water samples for which complete biological and physico-chemical information is available; our experiments are conducted on this set.
4
Predictive Clustering and TILDE
Building a model for simultaneous prediction of many variables is strongly related to clustering. Indeed, clustering systems are often evaluated by measuring the average predictability of attributes, i.e., how well the attributes of an object can be predicted given that it belongs to a certain cluster (see, e.g., [12]). In our context, the predictive modelling can then be seen as clustering the training examples into clusters with small intra-cluster variance, where this variance is measured as the sum of the variances of the individual variables that are to be predicted, or equivalently: as the mean squared euclidean distance of the instances to their mean in the prediction space. More formally: given a cluster C consisting of n examples ei that are each labelled with a target vector xi ∈ IRD , the intra-cluster variance of C is defined as n X 2 σC = 1/n · (xi − x ¯)0 (xi − x ¯) (1) Pn
i=1
¯ = 1/n i=1 xi . (We assume the target vector to have only numerical where x components here, as is the case in our application; in general however predictive clustering can also be used for nominal targets (i.e., classification), see [1].) In our experiments we used the decision tree learner TILDE [2,3]. TILDE is an ILP system1 that induces first-order logical decision trees (FOLDT’s). Such trees are the first-order equivalent of classical decision trees [2]. TILDE can induce classification trees, regression trees and clustering trees and can handle both attribute-value data and structural data. It uses the basic TDIDT algorithm [16], in its clustering or regression mode employing as heuristic the variance as described above. The system is fit for our experiments for the following reasons: – Most machine learning and data mining systems that induce predictive models can handle only single target variables (e.g., C4.5 [15], CART [5], M5 [17], . . . ). Building a predictive model for a multi-dimensional prediction space can be done using clustering systems, but most clustering systems consider clustering as a descriptive technique, where evaluation criteria are still slightly different from the ones we have here. (Using terminology from [12], descriptive systems try to maximise both predictiveness and predictability of attributes, whereas predictive systems maximise predictability of the attributes belonging to the prediction space.) 1
Inductive logic programming (ILP) is a subfield of machine learning where first order logic is used to represent data and hypotheses. First order logic is more expressive than the attribute value representations that are classically used by machine learning and data mining systems. From a relational database point of view, ILP corresponds to learning patterns that extend over multiple relations, whereas classical (propositional) methods can find only patterns that link values within the same tuple of a single relation to one another. We refer to [8] for details.
36
H. Blockeel, S. Dˇzeroski, and J. Grbovi´c
– Although the problem at hand is not, strictly speaking, an ILP problem (i.e., it can be transformed into attribute-value format; the number of different attributes would become large but not unmanageable for an attribute-value learner), the use of an ILP learner has several advantages: – No data preprocessing is needed: the data can be kept in their original, multi-relational format. This was especially advantageous for us because the experiments described here are part of a broader range of experiments, many of which would demand different and extensive preprocessing steps. – Prolog offers the same querying capabilities as relational databases, which allows for non-trivial inspection of the data (e.g., counting the number of times a biological measurement is accompanied by at least 3 physicochemical measurements during the last 2 months, . . . ) The main disadvantage of ILP systems, compared to attribute-value learners, is their low efficiency. For our experiments however this inefficiency was not prohibitive and amply compensated by the additional flexibility ILP offers.
5
Experiments
TILDE was consistently run with default parameters, except one parameter controlling the minimal number of instances in each leaf which was 20. From preliminary experiments this value is known to combine high accuracy with reasonable tree size. All results are obtained using 10-fold cross-validations. 5.1
Multi-valued Predictions
For this experiment we have run TILDE with two settings: predicting a single variable at a time (the results of which serve as a reference for the other setting), and predicting all variables simultaneously. When predicting all variables at once, the variables were first standardised (zx = (x − µx )/σx with µx the mean and σx the standard devation); this ensures that all target variables will be considered equally important for the prediction.2 As a bonus the results are more interpretable for non-experts; e.g., “BOD=16.0” may not tell a non-expert much, but a standardised score of +1 always means “relatively high”. The predictive quality of the tree for each single variable is measured as the correlation of the predictions with the actual values. Table 1 shows these correlations; correlations previously obtained with M5.1 [9] are given as reference. It is clear from the table that overall, the multi-prediction tree performs approximately as well as the set of 16 single trees. For a few variables there is a clear decrease in predictive performance (T, NO2 , NO3 ), but surprisingly this effect is compensated for by a gain in accuracy for other variables (conductivity, CO2 , 2
Since the system minimises total variance, i.e. the sum of the variances of each single variable, the “weight” of a single variable is proportional to its variance; standardisation gives all variables an equal variance of 1.
Simultaneous Prediction of River Water Quality with TILDE
37
Table 1. Comparison of predictive quality of a single tree predicting all variables at once with that of a set of 16 different trees, each predicting one variable. variable T pH conduct. O2 O2 -sat. CO2 hardness NO2 NO3 NH4 PO4 Cl SiO2 KMnO4 K2 Cr2 O7 BOD avg
TILDE, all variables TILDE, single variable M5.1, single variable r r r 0.482 0.563 0.561 0.353 0.356 0.397 0.538 0.464 0.539 0.513 0.523 0.484 0.459 0.460 0.424 0.407 0.335 0.405 0.496 0.475 0.475 0.330 0.417 0.373 0.265 0.349 0.352 0.500 0.489 0.664 0.441 0.445 0.461 0.603 0.602 0.570 0.369 0.400 0.411 0.509 0.435 0.546 0.561 0.514 0.602 0.640 0.605 0.652 0.467 0.465 0.498 Chironomus thummi =3 T=0.0305434 pH=-0.868026 cond=1.88505 O2=-1.66761 O2sat=-1.77512 CO2=1.5091 hardness=1.27274 NO2=0.78751 NO3=0.309126 NH4=2.30423 PO4=1.38143 Cl=1.46933 SiO2=1.30734 KMnO4=1.09387 K2Cr2O7=1.40614 BOD=1.23197
Chlorella vulgaris >=3
T=0.637616 pH=-0.790306 cond=0.734063 O2=-1.17917 O2sat=-0.942371 CO2=0.603914 hardness=0.855631 NO2=1.57007 NO3=-0.250572 NH4=0.510661 PO4=0.247388 Cl=0.530256 SiO2=0.171444 KMnO4=0.526165 K2Cr2O7=0.561389 BOD=0.630086
0.991231.
appl c45 appl c45(A) :not(kurtosis(A, 1)), attr entropy(A,B,C), C13. attr kurtosis(A,B,C), C>10.1512. appl c45(A) :entropy(A,B), B>2.27248. appl cn2 appl cn2(A) :appl cn2(A) :class entropy(A,B), perc of attr na values(A,B,C), perc of na values(A,C), attr disc(A,B), C>2.30794. C>4.70738, B>0.276716. appl cn2(A) :mutual inf(A,B), B>4.32729. appl knn appl knn(A) :appl knn(A) :num of attrs(A,B), not(entropy(A, 1)). num of disc attrs(A,C), C 0.991231 ? +--yes: yes [9 / 9] +--no: num of bin attrs(A,D) , D > 13 ? +--yes: yes [2 / 2] +--no: no [9 / 9] appl cn2 attr kurtosis(A,C,D) , D > 22.7079 ? +--yes: no [5 / 5] +--no: attr class mutual inf(A,E,F) , F > 0.576883 ? +--yes: kurtosis(A,G) , G > 3.87752 ? | +--yes: yes [7 / 7] | +--no: num of examples(A,H) , H > 270 ? | +--yes: yes [3 / 3] | +--no: no [3 / 3] +--no: no [2 / 2] appl knn num of attrs(A,C) , C > 19 ? +--yes: no [4 / 4] +--no: num of examples(A,D) , D > 57 ? +--yes: num of bin attrs(A,E) , E > 15 ? | +--yes: no [1 / 1] | +--no: yes [12 / 13] +--no: no [2 / 2]
The concepts induced with the ILP system TILDE are presented in Table 5. The only concept based on the property of a single attribute (kurtosis of a single attribute and mutual information between the class and the attribute) is the one for the applicability of CN2. Table 6. Accuracy of the meta-level models measured using leave-one-out method. Dataset C4.5 CN2 k-NN FOIL FOIL-ND TILDE default appl c45 16/20 16/20 14/20 16/20 7/20 18/20 11/20 appl cn2 9/20 5/20 11/20 9/20 13/20 9/20 0/20 appl knn 9/20 11/20 9/20 10/20 11/20 14/20 12/20 Sum 34/60 32/60 34/60 35/60 31/60 41/60 23/60
Finally, the results of the leave-one-out experiments are summarized in Table 6. Please note here, that the model induced in each leave-one-out experiment can differ from the others (and the ones presented in Tables 4 and 5), but the accuracy of the classifiers was our primary interest in these experiments. It can
Experiments in Meta-level Learning with ILP
105
be seen from the table that FOIL has a slightly better and FOIL-ND a comparable accuracy with respect to the propositional machine learning systems. TILDE outperforms other machine learning systems on two out of three meta-learning tasks.
4
Discussion
The work presented in the paper extends the work already done in the area of meta-learning in several ways. First, an ILP framework for meta-level learning is introduced. It extends the methodology for dataset description used in [3] with non-propositional constructs which are not allowed when using propositional classification systems for meta-level learning. ILP framework incorporates measures for individual attributes in the dataset description. The ILP framework is also opened for incorporating prior expert knowledge about the applicability of classification algorithms. Also all the datasets used in the experiments are public domain and the experiments can be repeated. This was not the case with the StatLog dataset repository where more then half of the datasets used are not publicly available. Another improvement is the use of a unified methodology for measuring the error rate of different classification algorithms and the optimization of their parameters. The ILP framework used in this paper was build to include the measures used in the state-of-the-art meta-learning studies. It can be extended in several different ways. Beside including other more complex statistical and information theory based measures, it can be also extended with the properties measured for any subset of attributes or examples in the dataset. Individual or set of examples from the dataset can also be included in the description. From the preliminary results based on the experiments with only twenty datasets it is hard to make strong conclusions about the usability of the ILP framework for meta-level learning. The obtained models can capture some chance regularities beside the relevant ones. However, the results of the leave-one-out evaluation method show slight improvement of the classification accuracy when using an ILP description of the datasets. This improvement should be further investigated and tested for statistical significance performing experiments for other datasets from the UCI repository. To obtain a larger dataset for metalevel learning, experiments with artificial datasets should also be performed in the future.
Acknowledgments This work was supported in part by the Slovenian Ministry of Science and Technology and in part by the European Union through the ESPRIT IV Project 20237 Inductive Logic Programming 2. We greatly appreciate the comments of two anonymous reviewers of the proposed version of the paper.
106
L. Todorovski and S. Dˇzeroski
References 1. Aha, D. (1992) Generalising case studies: a case study. In Proceedings of the 9th International Conference on Machine Learning, pages 1–10. Morgan Kaufmann. 2. Blockeel, H. and De Raedt, L. (1998) Top-down induction of first order logical decision trees. Artificial Intelligence, 101(1–2): 285–297. 3. Brazdil, P. B. and Henery, R. J. (1994) Analysis of Results. In Michie, D., Spiegelhalter, D. J., and Taylor, C. C., editors: Machine learning, neural and statistical classification. Ellis Horwood. 4. Clark, P. and Boswell, R. (1991) Rule induction with CN2: Some recent improvements. In Proceedings of the Fifth European Working Session on Learning, pages 151–163. Springer. 5. Dˇzeroski, S., Cestnik, B. and Petrovski, I. (1993) Using the m-estimate in rule induction. Journal of Computing and Information Technology, 1:37–46. 6. Kalousis, A. and Theoharis, T. (1999) NEOMON: An intelligent assistant for classifier selection. In Proceedings of the ICML-99 Workshop on Recent Advances in Meta-Level Learning and Future Work, pages 28–37. 7. Murphy, P. M. and Aha, D. W. (1994) UCI repository of machine learning databases [http://www.ics.uci.edu/˜mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. 8. Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann. 9. Quinlan, J. R. and Cameron-Jones, R. M. (1993) FOIL: A midterm report. In Brazdil, P., editor: Proceedings of the 6th European Conference on Machine Learning, volume 667 of Lecture Notes in Artificial Intelligence, pages 3–20. Springer-Verlag. 10. Wettschereck, D. (1994) A study of distance-based machine learning algorithms. PhD Thesis, Department of Computer Science, Oregon State University, Corvallis, OR.
Boolean Reasoning Scheme with Some Applications in Data Mining Andrzej Skowron and Hung Son Nguyen Institute of Mathematics, Warsaw University, Banacha 2, 02-097 Warsaw, Poland Email: {skowron,son}@mimuw.edu.pl
Abstract. We present a general encoding scheme for a wide class of problems (including among others such problems like data reduction, feature selection, feature extraction, decision rules generation, pattern extraction from data or conflict resolution in multi-agent systems) and we show how to combine it with a propositional (Boolean) reasoning to develop efficient heuristics searching for (approximate) solutions of these problems. We illustrate our approach by examples, we show some experimental results and compare them with those reported in literature. We also show that association rule generation is strongly related with reduct approximation.
1
Introduction
We discuss a representation scheme for a wide class of problems including problems from such areas like decision support [14], [9], machine learning, data mining [4], or conflict resolution in multi-agent systems [10]. On the basis of the representation scheme we construct (monotone) Boolean functions with the following property: their prime implicants [3](minimal valuations satisfying propositional formulas) directly correspond to the problem solutions (compare the George Boole idea from 1848 discussed e.g. in [3]). In all these cases the implicants close to prime implicants define approximate solutions for considered problems (compare the discussion on Challenge 9 in [12]). The results are showing that the efficient heuristics for feature selection, feature extraction, pattern extraction from data can be developed using Boolean propositional reasoning. Moreover the experiments are showing that these heuristics can give better results concerning classification quality or/and time necessary for learning (discovery) than those derived using other methods. Our experience is showing that formulations of problems in the Boolean reasoning framework creates a promising methodology for developing very efficient heuristics for solving real-life problems in many areas. Let us also mention applications of Boolean reasoning in other areas like negotiations and conflict resolving in multi-agent systems [10]. Because of lack of space we illustrate the approach using two illustrative examples related to symbolic value grouping and association rule extraction in Data Mining (or Machine Learning) problems. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 107–115, 1999. c Springer-Verlag Berlin Heidelberg 1999
108
2
A. Skowron and H.S. Nguyen
Basic Notions
An information system is a pair S = (U, A), where U - is a non-empty, finite set called the universe, A - is a non-empty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the value set of a. Elements of U are called situations S and interpreted as e.g. cases, states, patients, observations. The set V = a∈A Va is said to be the domain of A. A decision table is any information system of the form S = (U, A ∪ {d}), where d ∈ / A is a distinguished attribute called decision. The elements of A are called conditional attributes (conditions). In a given information system, in general, we are not able to distinguish all pairs of situations objects (using attributes of the system). Namely, different situations can have the same values on considered attributes. Hence, any set of attributes divides the universe U into some classes which establish a partition [9] of the set of all objects U . With any subset of attributes B ⊆ A we associate a binary relation ind(B), called an indiscernibility relation, which is defined by ind(B) = {(u, u0 ) ∈ U × U for every a ∈ B, a(u) = a(u0 )}. The B-discernibility relation is defined to be the complement of ind(B) in U × U. Let S = (U, A) be an information system, where A = {a1 , ..., am }. Pairs (a, v) with a ∈ A, v ∈ V are called descriptors. By DESC(A, V ) we denote the set of all descriptors over A and V . Instead of (a, v) we also write a = v or av . One can assign Boolean variable to any descriptor. The set of terms over A and V is the least set containing descriptors (over A and V ) and closed with respect to the classical propositional connectives: ¬ (negation), ∨ (disjunction), and ∧ (conjunction), i.e., 1. Any descriptor (a, v) ∈ DESC(A, V ) is term over A and V . 2. If τ, τ 0 are terms then ¬τ, (τ ∨ τ 0 ), (τ ∧ τ 0 ) are terms over A and V too. The meaning kτ kS (or in short kτ k) of τ in S is defined inductively as follows: k(a, v)k = {u ∈ U : a(u) = v} for a ∈ A and v ∈ Va ; 0 0 0 k(τ ∨ τ )k = kτ k ∪ kτ k; k(τ ∧ τ )k = kτ k ∩ kτ 0 k; k¬τ k = U − kτ k. Two terms τ and τ 0 areWequivalent, τ ⇔ τ 0 , if and only if kτ k = kτ 0 k. In particular we have: ¬(a = v) ⇔ {a = v 0 : v 0 6= v and v 0 ∈ Va }. The information systems (desision tables) are representations of knowledge bases discussed in Introduction: rows corresponds to consistent sets of propositional variables defined by all descriptors a = v where v is the value of attribute a in a given situation and conflicting pairs, in case of information systems, are all pairs of situations which are discernible by some attributes. Let S = (U, A) be an information system, where U = {u1 , ..., un }, and A = {a1 , ..., am }. By M(S) we denote an n × n matrix (cij ), called the discernibility matrix of S, such that cij = {a ∈ A : a(ui ) 6= a(uj )} for i, j = 1, ..., n. With every discernibility matrix M(S) one can associate a discernibility function fM(S) , defined as follows. A discernibility function fM(S) for an information system S is a Boolean function of m propositional variables a∗1 , ...,Wa∗m (where ai ∈ A for i = 1, ..., m) defined as the conjunction of all expressions c∗ij , where
Boolean Reasoning Scheme with Some Applications in Data Mining
109
W
c∗ij is the disjunction of all elements of c∗ij = {a∗ : a ∈ cij }, where 1 ≤ j < i ≤ n and cij 6= ∅. In the sequel we write a instead of a∗ . One can show that every prime implicant of fM(S) (a∗1 , ..., a∗k ) corresponds exactly to one reduct in S. One can see that the set B ⊂ A is reduct if B has nonempty intersection with any nonempty set ci,j i.e. B is reduct in S
iff
∀i,j (ci,j = ∅) ∨ (B ∩ ci,j 6= ∅)
One can show that prime implicants of the discernibility function correspond exactly to reducts of information systems [9], [14]. Hence, Boolean reasoning can be used for information reduction. This can be extended to feature selection and decision rule synthesis (see e.g. [2], [9]). One can show that the problem of finding a minimal (with respect to cardinality) reduct is NP-hard [14]. In general the number of reducts of a given information system can be exponential with respect to the number of attributes (more exactly, any information system S has m reducts, where m=card(A)). Nevertheless, existing procedures at most bm/2c for reduct computation are efficient in many applications and for many cases one can apply some efficient heuristics (see e.g. [2]). Moreover, in some applications (see [13]), instead of reducts we prefer to use their approximations called α-reducts, where α ∈ [0, 1] is a real parameter. The set of attributes B ⊂ A is called α-reduct if |{ci,j : B ∩ ci,j 6= ∅}| ≥α B is α-reduct in S iff |{ci,j : ci,j 6= ∅}| One can show that for a given α, the problems of searching for shortest α-reducts and for all α-reducts are also NP-hard. Let us note that e.g. simple greedy Johnson strategy for computing implicants close to prime implicants of the discernibility function has time complexity of order O(k 2 n3 ) where n is the number of objects and k is the number of attributes. Hence, for large n this heuristic will be not feasible. We will show how to construct some more efficient heuristics in case when some additional knowledge is given about problem encoded by information system or decision table.
3
Feature Extraction by Grouping of Symbolic Values
In case of symbolic value attribute (i.e. without pre-assumed order on values of given attributes) the problem of searching for new features of the form a ∈ V is, in a sense, from practical point of view more complicated than the for real value attributes. However, it is possible to develop efficient heuristics for this case using Boolean reasoning. Let S = (U, A ∪ {d}) be a decision table. Any function Pa : Va → {1, . . . , ma } (where ma ≤ card(Va )) is called a partition of Vai . The rank of Pai is the value rank (Pi ) = card (Pai (Vai )). The family of partitions {Pa }a∈B is consistent with B (B − consistent) iff the condition [(u, u0 ) ∈ / ind(B/{d}) implies ∃a∈B [Pa (a(u)) 6= Pa (a(u0 ))] holds for any (u, u0 ) ∈ U. It means that if two objects u, u0 are discerned by B and d, then they must be discerned by partition attributes defined by {Pa }a∈B . We consider the following optimization problem
110
A. Skowron and H.S. Nguyen
PARTITION PROBLEM: symbolic value partition problem: Given a decision table S = (U, A ∪ {d}) and a set of attributes B ⊆ A, search for the minimal B − consistent family of partitions (i.e. such B − consistent P family {Pa }a∈B that a∈B rank (Pa ) is minimal). 0
To discern between pair of objects will use new binary features avv (for v 6= v 0 ) 0 defined by avv (x, y) = 1 iff a(x) = v 6= v 0 = a(y). One can apply the Johnson’s heuristic for the new decision table with these attributes to search for minimal set of new attributes that discerns all pairs of objects from different decision classes. After extracting of these sets, for each attribute ai we construct graph Γa = hVa , Ea i where Ea is defined as the set of all new attributes (propositional variables) found for the attribute a. Any vertex coloring of Γa defines a partition of Va . The colorability problem is solvable in polynomial time for k = 2, but remains NP-complete for all k ≥ 3. But, similarly to discretization[7], one can apply some efficient heuristic searching for optimal partition. Let us consider an example of decision table presented in Figure 1 and (a reduced form) of its discernibility matrix (Figure 1). From the Boolean function fA with Boolean variables of the form avv21 one can find the shortest prime implicant: aaa12 ∧ aaa23 ∧ aaa14 ∧ aaa34 ∧ bbb14 ∧ bbb24 ∧ bbb23 ∧ bbb13 ∧ bbb35 which can be treated as graphs presented in the Figure 2. We can color vertices of those graphs as it is shown in Figure 2. The colors are corresponding to the partitions: Pa (a1 ) = Pa (a3 ) = 1; Pa (a2 ) = Pa (a4 ) = 2; Pb (b1 ) = Pb (b2 ) = Pb (b5 ) = 1; Pb (b3 ) = Pb (b4 ) = 2. At the same time one can construct the new decision table (Figure 2). A u1 u2 u3 u4 u5 u6 u7 u8 u9 u10
a a1 a1 a2 a3 a1 a2 a2 a4 a3 a2
b b1 b2 b3 b1 b4 b2 b1 b2 b4 b5
d 0 0 0 0 1 1 1 1 1 1
M(S) u1 b u5 b b1 u6 u7
=⇒
u8 u9 u10
4 1 aa a2 , 1 aa a2 1 aa a4 , 1 aa a3 , 1 aa a2 ,
u2 b b b2
b
b b1 2
b
b b1 2 b b b1 4 b b b1 5
4 1 aa a2 1 aa a2 , 1 aa a4 1 aa a3 , 1 aa a2 ,
u3 u4 b3 b1 a1 1 aa a , bb a a , bb b
2
4
bb2
b
b b1 2
b b b2 4 b b b2 5
3 b bb1 3 2 aa a4 , 2 aa a3 , b bb3 5
b
b b2 3 b b b3 4
3
b
4
1 2 aa a , bb 3 2 aa a3 3 aa a4 , b b b1 4 2 aa a3 ,
2
b
b b1 2
b
b b1 5
Fig. 1. The decision table and the discernibility matrix
s ca2 @ @ a @ ca4 a3 s a1
s sb2 Q B Q B Q s B b QBB cb3 B B cb4 b1
b5
=⇒
a Pa 1 2 1 2
b Pb 1 2 2 1
d 0 0 1 1
Fig. 2. Coloring of attribute value graphs and the reduced table.
One can extend the presented approach (see e.g. [6]) to the case when in a given decision system nominal and numeric attribute appear. The received heuristics are of very good quality.
Boolean Reasoning Scheme with Some Applications in Data Mining
111
Experiments for classification methods (see [6]) have been carried over decision systems using two techniques called “train-and-test” and “n-fold-crossvalidation”. In Table 1 some results of experiments obtained by testing the proposed methods MD (using only discretization based on MD-heurisctic [7] using Johnson approximation strategy) and MD-G (using discretization and symbolic value grouping) for classification quality on well known data tables from the “UC Irvine repository” are shown. The results reported in [5] are summarized in columns labeled by S-ID3 and C4.5 in Table 1). It is interesting to compare those results with regard both to the classification quality. Let us note that the heuristics MD and MD-G are also very efficient with respect to the time complexity.
Names of Tables Australian Breast (L) Diabetes Glass Heart Iris Lympho Monk-1 Monk-2 Monk-3 Soybean TicTacToe Average
Classification accuracies S-ID3 C4.5 MD MD-G 78.26 85.36 83.69 84.49 62.07 71.00 69.95 69.95 66.23 70.84 71.09 76.17 62.79 65.89 66.41 69.79 77.78 77.04 77.04 81.11 96.67 94.67 95.33 96.67 73.33 77.01 71.93 82.02 81.25 75.70 100 93.05 69.91 65.00 99.07 99.07 90.28 97.20 93.51 94.00 100 95.56 100 100 84.38 84.02 97.7 97.70 78.58 79.94 85.48 87.00
Table 1. The quality comparison between decision tree methods. MD: MD-heuristics; MD-G: MD-heuristics with symbolic value partition
4
Association Rule Generation
Given an information table A = (U, A). By descriptors we mean terms of form (a = v), where a ∈ A is an attribute and v ∈ Va is a value in the domain of a (see [8]). The notion of descriptor can be generalized by using terms of form (a ∈ S), where S ⊆ Va is a set of values. By template we mean the conjunction of descriptors, i.e. T = D1 ∧ D2 ∧ ... ∧ Dm , where D1 , ...Dm are either simple or generalized descriptors. We denote by length(T) the number of descriptors being in T. An object u ∈ U is satisfying the template T = (ai1 = v1 ) ∧ ... ∧ (aim = vm ) if and only if ∀j aij (u) = vj . Hence the template T describes the set of objects having the common property: ”the values of attributes aj1 , ..., ajm on these objects are equal to v1 , ..., vm , respectively”. The support of T is defined by support(T) = |{u ∈ U : u satisfies T}|. The long templates with large support are preferred in many Data Mining tasks. Problems of finding optimal large templates (for many optimization functions) are known as being NP-hard with respect to the number of attributes
112
A. Skowron and H.S. Nguyen
involved into descriptors(see e.g. [8]). Nevertheless, the large templates can be found quite efficiently by Apriori and AprioriTid algorithms (see [1,15]). A number of other methods for large template generation has been proposed e.g. in [8]. Association rules and their generations can be defined in many ways (see [1]). Here, according to the presented notation, association rules can be defined as implications of the form (P ⇒ Q), where P and Q are different simple templates, i.e. formulas of the form (ai1 = vi1 ) ∧ . . . ∧ (aik = vik ) ⇒ (aj1 = vj1 ) ∧ . . . ∧ (ajl = vjl )
(1)
These implication can be called generalized association rules, because association rules are originally defined by formulas P ⇒ Q where P and Q are the sets of items (i.e. goods or articles in stock market) e.g. {A, B} ⇒ {C, D, E} (see [1]). One can see that this form can be obtained from 1 by replacing values on descriptors by 1 i.e.: (A = 1) ∧ (B = 1) ⇒ (C = 1) ∧ (D = 1) ∧ (E = 1). Usually, for a given information table A, the quality of the association rule R = P ⇒ Q can be evaluated by two coefficients called support and confidence with respect to A. The support of the rule R is defined by the number of objects from A satisfying the condition (P ∧ Q) i.e. support(R) = support(P ∧ Q). The second coefficient – confidence of R – is the ratio between the support of (P ∧ Q) and the support of P i.e. conf idence(R) = support(P∧Q) support(P) . The following problem has been investigated by many authors (see e.g. [1,15]): For a given information table A, an integer s, and a real number c ∈ [0, 1], find as many as possible association rules R = (P ⇒ Q) such that support(R) ≥ s and conf idence(R) ≥ c. All existing association rule generation methods consists of two main steps: 1. Generate as many as possible templates T = D1 ∧ D2 ... ∧ Dk such that support(T) ≥ s and support(T ∧ D) < s for any descriptor D (i.e. maximal templates among those which are supported by more than s objects). 2. For any template T, search for a partition T = P∧Q such that support(P) < support(T) and P is the smallest template satisfying this condition. c In this paper we show that the second steps can be solved using rough set methods and Boolean reasoning approach. 4.1
Boolean Reasoning Approach for Association Rule Generation
Let us assume that the template T = D1 ∧ D2 ∧ . . . ∧ Dm , which is supported by at least s objects, has been found. For a given confidence threshold c ∈ (0; 1) the decomposition T = P ∧ Q is called c-irreducible if conf idence(P ⇒ Q) ≥ c and for any decomposition T = P0 ∧ Q0 such that P0 is a sub-template of P, conf idence(P0 ⇒ Q0 ) < c. One can prove Theorem 1. Let c ∈ [0; 1]. The problem of searching for the shortest association rule from the template T for a given table S with confidence limited by c (Optimal c-Association Rules Problem) is NP-hard.
Boolean Reasoning Scheme with Some Applications in Data Mining
113
For solving the presented problem, we show that the problem of searching for optimal association rules from the given template is equivalent to the problem of searching for local α-reducts for a decision table, which is well known problem in rough set theory. We construct the new decision table S|T = (U, A|T ∪ d) from the original information table S and the template T as follows: 1. A|T = {aD1 , aD2 , ..., aDm } is a set of attributes corresponding to the de1 if the object u satisfies Di , scriptors of T such that aDi (u) = 0 otherwise. 2. the decision attribute d determines if the object satisfies template T i.e. 1 if the object u satisfies T, d(u) = 0 otherwise. The following theorems describe the relationship between association rules problem and reduct searching problem. Theorem 2. For a given information template T, the set V table S =V(U, A), the of descriptors P the implication Di ∈P Di ⇒ Dj ∈P / Dj is 1. 100%-irreducible association rule from T if and only if P is reduct in S|T . 2. c-irreducible association rule from T if and only if P is α-reduct in S|T , where α = 1 − ( 1c − 1)/( ns − 1), n is the total number of objects from U and s = support(T). Searching for minimal α-reducts is well known problem in Rough Sets theory. One can show, that the problem of searching for all α-reducts as well as the problem of searching for shortest α-reducts is NP-hard. Great effort has been done to solve those problems. In the next papers we present the rough set based algorithms for association rule generation for large data table using SQL queries. 4.2
The Example
The following example illustrates the main idea of our method. Let us consider the following information table A with 18 objects and 9 attributes. Assume that the template T = (a1 = 0) ∧ (a3 = 2) ∧ (a4 = 1) ∧ (a6 = 0) ∧ (a8 = 1) has been extracted from the information table A. One can see that support(T) = 10 and length(T) = 5. The new constructed decision table A|T is presented in Table 2. The discernibility function for A|T can be described as follows f (D1 , D2 , D3 , D4 , D5 ) = (D2 ∨ D4 ∨ D5 ) ∧ (D1 ∨ D3 ∨ D4 ) ∧ (D2 ∨ D3 ∨ D4 ) ∧(D1 ∨ D2 ∨ D3 ∨ D4 ) ∧ (D1 ∨ D3 ∨ D5 ) ∧(D2 ∨ D3 ∨ D5 ) ∧ (D3 ∨ D4 ∨ D5 ) ∧ (D1 ∨ D5 ) After its simplification we obtain six reducts: f (D1 , D2 , D3 , D4 , D5 ) = (D3 ∧ D5 )∨(D4 ∧D5 )∨(D1 ∧D2 ∧D3 )∨(D1 ∧D2 ∧D4 )∨(D1 ∧D2 ∧D5 )∨(D1 ∧D3 ∧D4 ) for the decision table A|T . Thus, we have found from T six association rules with (100%)-confidence. For c = 90%, we would like to find α-reducts for the decision
114
A. Skowron and H.S. Nguyen 1
−1
table A|T , where α = 1 − nc −1 = 0.86. Hence we would like to search for sets of s descriptors covering at least d(n−s)(α)e = d8·0.86e = 7 elements of discernibility matrix M(A|T ). One can see that the following sets of descriptors: {D1 , D2 }, {D1 , D3 }, {D1 , D4 }, {D1 , D5 }, {D2 , D3 }, {D2 , D5 }, {D3 , D4 } have nonempty intersection with exactly 7 members of the discernibility matrix M(A|T ). In Table 3 we present all association rules constructed from those sets. A u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15 u16 u17 u18
a1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1
a2 1 1 2 1 1 2 2 2 1 3 1 2 2 3 4 3 1 2
a3 1 2 2 2 2 1 1 2 2 2 3 2 2 2 2 2 2 2
a4 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 1 1 1
a5 80 81 82 80 81 81 83 81 82 84 80 82 81 81 82 83 84 82
a6 2 0 0 0 1 1 1 0 0 0 0 0 0 2 0 0 0 0
a7 2 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
a8 2 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 1 2
A|T
a9 3 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 u11 u12 u13 u14 u15 u16 u17 u18
D1 D2 D3 D4 D5 a1 = 0 a3 = 2 a4 = 1 a6 = 0 a8 = 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
d 1 1 1
1 1 1 1 1 1 1
Table 2. An example of information table A and template T supported by 10 objects and the new decision table A|T constructed from A and template T.
M(A|T ) u2 , u3 , u4 , u8 , u9 u10 , u13 , u15 , u16 , u17 u1 D 2 ∨ D4 ∨ D5 u5 D 1 ∨ D3 ∨ D4 u6 D 2 ∨ D3 ∨ D4 u7 D 1 ∨ D2 ∨ D3 ∨ D4 u11 D1 ∨ D3 ∨ D5 u12 D2 ∨ D3 ∨ D5 u14 D3 ∨ D4 ∨ D5 u18 D1 ∨ D5
=
100%
=⇒
=
90%
=⇒
D1 D1 D1 D1
D3 D4 ∧ D2 ∧ D2 ∧ D2 ∧ D3
∧ D5 ∧ D5 ∧ D3 ∧ D4 ∧ D5 ∧ D4
⇒ ⇒ ⇒ ⇒ ⇒ ⇒
D1 D1 D4 D3 D3 D2
∧ D2 ∧ D4 ∧ D2 ∧ D3 ∧ D5 ∧ D5 ∧ D4 ∧ D5
D1 D1 D1 D1 D2 D2 D3
∧ D2 ∧ D3 ∧ D4 ∧ D5 ∧ D3 ∧ D5 ∧ D4
⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒
D3 D3 D2 D2 D1 D1 D1
∧ D4 ∧ D4 ∧ D3 ∧ D3 ∧ D4 ∧ D3 ∧ D2
∧ D5 ∧ D5 ∧ D5 ∧ D4 ∧ D5 ∧ D4 ∧ D5
Table 3. The simplified version of discernibility matrix M(A|T ) and association rules.
5. Conclusions We have presented a general scheme for encoding a wide class of problems. This encoding scheme has been proven to be very useful for solving many problems using propositional reasoning e.g. information reduction, decision rule generation, feature extraction and feature selection, conflict resolving in multi–agent systems. Our approach can be used to consider only discernible pairs with the sufficiently large discernibility degree. Another possible extension is related to
Boolean Reasoning Scheme with Some Applications in Data Mining
115
extension of our knowledge bases by adding a new component corresponding to concordance (indiscernible) pairs of situations and to require to preserve some constrains described by this component. We also plan to extend approach using rough mereology [10]. Acknowledgement: This work was partially supported by the Research Program of the European Union - ESPRIT-CRIT2 No. 20288
References 1. Agrawal R., Mannila H., Srikant R., Toivonen H., Verkamo A.I., 1996. Fast discovery of assocation rules. In V.M. Fayad, G.Piatetsky Shapiro, P. Smyth, R. Uthurusamy (eds): Advanced in Knowledge Discovery and Data Mining, AAAI/MIT Press, pp. 307-328. 2. J. Bazan. A comparison of dynamic non-dynamic rough set methods for extracting laws from decision tables. In: L. Polkowski and A. Skowron (Eds.), Rough Sets in Knowledge Discovery 1: Methodology and Applications, Physica-Verlag, Heidelberg, 1998, 321–365. 3. E.M. Brown. Boolean Reasoning, Kluwer Academic Publishers, Dordrecht, 1990. 4. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.). Advances in Knowledge Discovery and Data Mining. MIT/AAAI Press, Menlo Park, 1996. 5. J. Friedman, R. Kohavi, Y. Yun. Lazy decision trees, Proc. AAAI-96, 717–724. 6. H.S. Nguyen and S.H. Nguyen. Pattern extraction from data, Fundamenta Informaticae 34, 1998, pp. 129–144. 7. H.S. Nguyen and A. Skowron. Boolean reasoning for feature extraction problems, Proc. ISMIS’97, LNAI 1325, Springer–verlag, Berlin, 117–126. 8. Nguyen S. Hoa, A. Skowron, P. Synak. Discovery of data pattern with applications to Decomposition and classification problems. In L. Polkowski, A. Skowron (eds.): Rough Sets in Knowledge Discovery 2. Physica-Verlag, Heidelberg 1998, pp. 55–97. 9. Z. Pawlak. Rough Sets – Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, 1991. 10. L. Polkowski and A. Skowron. Rough sets: A perspective. In: L. Polkowski and A. Skowron (Eds.). Rough Sets in Knowledge Discovery 1: Methodology and Applications. Physica-Verlag, Heidelberg, 1998, 31–56. 11. J.R. Quinlan. C4.5. Programs for machine learning, Morgan Kaufmann, San Mateo, CA, 1993. 12. B. Selman, H. Kautz and D. McAllester. Ten Challenges in Propositional Reasoning and Search, Proc. IJCAI’97, Japan. 13. Skowron A. Synthesis of adaptive decision systems from experimental data. In A. Aamodt, J. Komorowski (eds), Proc. of the 5th Scandinavian Conference on AI (SCAI’95), IOS Press, May 1995, Trondheim, Norway, 220–238. 14. A. Skowron and C. Rauszer. The discernibility matrices and functions in information systems, in: R. Slowi´ nski (Ed.), Intelligent decision support: Handbook of applications and advances of the rough sets theory, Kluwer Academic Publishers, Dordrecht, 1992, 331-362. 15. Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li. New Parallel Algorithms for Fast Discovery of Association Rules. In Data Mining and Knowledge Discovery : An International Journal, special issue on Scalable High-Performance Computing for KDD, Vol. 1, No. 4, Dec. 1997, pp 343-373.
On the Correspondence between Classes of Implicational and Equivalence Quantifiers Jiˇr´ı Iv´anek Laboratory of Intelligent Systems, Faculty of Informatics and Statistics, University of Economics, W. Churchill Sq. 4, 130 67 Prague, Czech Republic, e-mail:
[email protected] Abstract. Relations between two Boolean attributes derived from data can be quantified by truth functions defined on four-fold tables corresponding to pairs of the attributes. In the paper, several classes of such quantifiers (implicational, double implicational, equivalence ones) with truth values in the unit interval are investigated. The method of construction of the logically nearest double implicational and equivalence quantifiers to a given implicational quantifier (and vice versa) is described and approved.
1
Introduction
The theory of observational quantifiers was established in the frame of the GUHA method of mechanized hypothesis formation [4], [5]. It should be stressed that this method is one of the earliest methods of data mining [9]. The method was during years developed and various procedures were implemented e.g. in the systems PC-GUHA [6], Knowledge Explorer [3], and 4FT-Miner [12]. Further investigations of its mathematical and logical foundations are going on nowadays [7], [10], [11]. We concentrate to the most widely used observational quantifiers, called in [11] four-fold table quantifiers. So far this quantifiers were treated in classical logic as 0/1-truth functions. Some possibilities of fuzzy logic approach are now discussed [7]. In the paper, several classes of quantifiers (implicational, double implicational, equivalence ones) with truth values in the unit interval are investigated. Such type of quantifications of rules derived from databases is used in modern methods of knowledge discovery in databases (see e.g. [13]). On the other hand, there is a connection between four-fold table quantifiers and measures of resemblance or similarity applied on Boolean vectors [2]. In Section 2, basic notions and classes of quantifiers are defined, and some examples of quantifiers of different types are given. In Section 3, the method of construction of double implicational quantifiers from implicational ones (and vice versa) is described. This method provides a logically strong one-to-one correspondence between classes of implicational and so called Σ-double implicational quantifiers. An analogical construction is used in Section 4 to introduce similar correspondence between classes of Σ-double implicational and Σ-equivalence quantifiers. Several theorems on this constructions are proved. As a conclusion, triads of affiliated quantifiers are introduced, and their importance in data mining applications is discussed. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 116–124, 1999. c Springer-Verlag Berlin Heidelberg 1999
Classes of Implicational and Equivalence Quantifiers
2
117
Classes of Quantifiers
For two Boolean attributes ϕ and ψ (derived from given data), corresponding four-fold table < a, b, c, d > (Table 1) is composed from numbers of objects in data satisfying four different Boolean combinations of attributes: a is the number of objects satisfying both ϕ and ψ, b is the number of objects satisfying ϕ and not satisfying ψ, c is the number of objects not satisfying ϕ and satisfying ψ, d is the number of objects not satisfying ϕ and not satisfying ψ.
ϕ ¬ϕ
ψ a c
¬ψ b d
Table 1. Four-fold table of ϕ and ψ
To avoid degenerated situations, we shall assume, that all marginals of the four-fold table are non-zero: a + b > 0, c + d > 0, a + c > 0, b + d > 0. Definition 1. 4FT quantifier ∼ is a [0, 1]-valued function defined for all fourfold tables < a, b, c, d >. We shall write ∼ (a, b) if the value of the quantifier ∼ depends only on a, b; ∼ (a, b, c) if the value of the quantifier ∼ depends only on a, b, c; ∼ (a, b, c, d) if the value of the quantifier ∼ depends on all a, b, c, d. For simplicity, we shall omit in this paper specification 4FT. The most common examples of quantifiers are following ones: Example 1. Quantifier ⇒ of basic implication (corresponds to the notion of a confidence of an association rule, see [1],[4],[5]): a . ⇒ (a, b) = a+b Example 2. Quantifier ⇔ of basic double implication (Jaccard 1900, [2],[5]): a . ⇔ (a, b, c) = a+b+c Example 3. Quantifier ≡ of basic equivalence (Kendall, Sokal-Michener 1958, [2],[5]): a+d . ≡ (a, b, c, d) = a+b+c+d If the four-fold table < a, b, c, d > represents the behaviour of the derived attributes ϕ and ψ in given data, then we can interpret above quantifiers in the following way: The quantifier of basic implication calculates the relative frequency of objects satisfying ψ out from all objects satisfying ϕ, so it is measuring in a simple way
118
J. Iv´ anek
the validity of implication ϕ ⇒ ψ in data. The higher is a and the smaller is b, a . the better is validity ⇒ (a, b) = a+b The quantifier of basic double implication calculates the relative frequency of objects satisfying ϕ ∧ ψ out from all objects satisfying ϕ ∨ ψ, so it is measuring in a simple way the validity of bi-implication (ϕ ⇒ ψ) ∧ (ψ ⇒ ϕ) in data. The a . higher is a and the smaller are b, c, the better is validity ⇔ (a, b, c) = a+b+c The quantifier of basic equivalence calculates the relative frequency of objects supporting correlation of ϕ and ψ out from all objects, so it is measuring in a simple way the validity of equivalency ϕ ≡ ψ in data. The higher is a, d and the a+d . smaller are b, c, the better is validity ≡ (a, b, c, d) = a+b+c+d Properties of basic quantifiers are in the core of the general definition of several useful classes of quantifiers [4], [5],[11]: (1) I - class of implicational quantiffiers, (2) DI - class of double implicational quantiffiers, (3) ΣDI - class of Σ-double implicational quantiffiers, (4) E - class of equivalence quantiffiers, (5) ΣE - class of Σ-equivalence quantiffiers. Each class of quantifiers ∼ is characterized in the following definition by a special truth preservation condition of the form: fact that the four-fold table < a0 , b0 , c0 , d0 > is in some sense (implicational, ...) better than < a, b, c, d > implies that ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). Definition 2. Let a, b, c, d, a0 , b0 , c0 , d0 mean frequencies from arbitrary pairs of four-fold tables < a, b, c, d > and < a0 , b0 , c0 , d0 >. (1) A quantifier ∼ (a, b) is implicational, ∼ ∈ I, if always a0 ≥ a ∧ b0 ≤ b implies ∼ (a0 , b0 ) ≥ ∼ (a, b). (2) A quantifier ∼ (a, b, c) is double implicational, ∼ ∈ DI, if always a0 ≥ a ∧ b0 ≤ b ∧ c0 ≤ c implies ∼ (a0 , b0 , c0 ) ≥ ∼ (a, b, c). (3) A quantifier ∼ (a, b, c) is Σ-double implicational, ∼ ∈ ΣDI, if always a0 ≥ a ∧ b0 + c0 ≤ b + c implies ∼ (a0 , b0 , c0 ) ≥ ∼ (a, b, c). (4) A quantifier ∼ (a, b, c, d) is equivalence, ∼ ∈ E, if always a0 ≥ a ∧ b0 ≤ b ∧ c0 ≤ c ∧ d0 ≥ d implies ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). (5) A quantifier ∼ (a, b, c, d) is Σ-equivalence, ∼ ∈ ΣE, if always a0 + d0 ≥ a + d ∧ b0 + c0 ≤ b + c implies ∼ (a0 , b0 , c0 , d0 ) ≥ ∼ (a, b, c, d). Example 4. ⇒ ∈ I, ⇔ ∈ ΣDI, ≡ ∈ ΣE. Proposition 3. I ⊂ DI ⊂ E, ΣDI ⊂ DI, ΣE ⊂ E. In the original GUHA method [4],[5], some statistically motivated quantifiers were introduced. They are based on hypotheses testing, e.g.: Given 0 < p < 1, the question is if the conditional probability corresponding to the examined relation of Boolean attributes ϕ and ψ is ≥ p. This question lead to the test of the null hypothesis that corresponding conditional probability is ≥ p, against the alternative hypothesis that this probability is < p. The following quantifiers are derived from the appropriate statistical test.
Classes of Implicational and Equivalence Quantifiers
119
Example 5. Quantifier ⇒?p of upper critical implication ⇒?p (a, b) =
a X i=0
(a + b)! pi (1 − p)a+b−i i!(a + b − i)!
is implicational [4],[5]. Example 6. Quantifier ⇔?p of upper critical double implication ⇔?p (a, b, c) =
a X i=0
(a + b + c)! pi (1 − p)a+b+c−i i!(a + b + c − i)!
is Σ-double implicational [5],[11]. Example 7. Quantifier ≡?p of upper critical equivalence ≡?p (a, b, c, d) =
a+d X i=0
(a + b + c + d)! pi (1 − p)a+b+c+d−i i!(a + b + c + d − i)!
is Σ-equivalence [5],[11]. Let us note, that all the above mentioned quantifiers are used (among others) in the GUHA procedure 4FT-Miner [12]. Some more examples of double implicational and equivalence quantifiers can be derived from the list of association coefficients (resemblance measures on Boolean vectors) included in [2]. In the next sections, one-to-one correspondence with strong logical properties will be shown i) between classes of quantifiers I, ΣDI by means of the relation: ⇔∗ (a, b, c) = ⇒∗ (a, b + c), and, analogously, ii) between classes of quantifiers ΣDI, ΣE by means of the relation: ≡∗ (a, b, c, d) = ⇔∗ (a + d, b, c). First, let us prove the following auxiliary propositions. Lemma 4. A quantifier ⇔∗ is Σ-double implicational iff the following conditions hold: (i) for all a, b, c, b0 , c0 such that b0 + c0 = b + c ⇔∗ (a, b0 , c0 ) = ⇔∗ (a, b, c) holds, (ii) the quantifier ⇒∗ defined by ⇒∗ (a, b) = ⇔∗ (a, b, 0) is implicational. Proof. For Σ-double implicational quantifiers, (i), (ii) are clearly true. Let ⇔∗ is a quantifier satisfying (i), (ii), and a0 ≥ a ∧ b0 + c0 ≤ b + c. Then ⇔∗ (a0 , b0 , c0 ) = ⇔∗ (a0 , b0 + c0 , 0) = ⇒∗ (a0 , b0 + c0 ) ≥ ≥ ⇒∗ (a, b + c) = ⇔∗ (a, b + c, 0) = ⇔∗ (a, b, c).
120
J. Iv´ anek
Example 8. Quantifier ⇔+ (Kulczynski 1927, see [2]): a a + a+c ) ⇔+ (a, b, c) = 12 ( a+b is double implicational but not Σ-double implicational, ⇔+ ∈ DI − ΣDI; for instance ⇔+ (1, 1, 1) = ⇔+ (1, 2, 0) does not hold. Lemma 5. A quantifier ≡∗ is Σ-equivalence iff the following conditions hold: (i) for all a, b, c, d, a0 , b0 , c0 , d0 such that a0 + d0 = a + d, b0 + c0 = b + c ≡∗ (a0 , b0 , c0 , d0 ) = ≡∗ (a, b, c, d) holds, (ii) the quantifier ⇔∗ defined by ⇔∗ (a, b, c) = ≡∗ (a, b, c, 0) is Σ-double implicational. Proof. For Σ-equivalence quantifiers, (i), (ii) are clearly true. Let ≡∗ is a quantifier satisfying (i), (ii), and a0 + d0 ≥ a + d ∧ b0 + c0 ≤ b + c. Then ≡∗ (a0 , b0 , c0 , d0 ) = ≡∗ (a0 + d0 , b0 , c0 , 0) = ⇔∗ (a0 + d0 , b0 , c0 ) ≥ ≥ ⇔∗ (a + d, b, c) = ≡∗ (a + d, b, c) = ≡∗ (a, b, c, d). Example 9. Quantifier ≡+ (Sokal, Sneath 1963, see [2]): a a d d + a+c + d+b + d+c ) ≡+ (a, b, c, d) = 14 ( a+b is equivalence but not Σ-equivalence, ≡+ ∈ E − ΣE; for instance ≡+ (1, 1, 1, 1) = ≡+ (2, 1, 1, 0) does not hold. We shall use the following definition to state relations between different quantifiers: Definition 6. A quantifier ∼1 is less strict than ∼2 (or ∼2 is more strict than ∼1 ) if for all four-fold tables < a, b, c, d > ∼1 (a, b, c, d) ≥ ∼2 (a, b, c, d). From the (fuzzy) logic point of view, it means that in all models (data) the formula ϕ ∼1 ψ is at least so true as the formula ϕ ∼2 ψ, i.e. the deduction rule ϕ∼2 ψ ϕ∼1 ψ is correct. Example 10. ⇔ is more strict than ⇒ , and less strict than ≡ ; ⇔+ is more strict than ⇔ .
3
Correspondence between Classes of Σ-Double Implicational Quantifiers and Implicational Ones
Let ⇒∗ be an implicational quantifier. There is a natural task to construct some Σ-double implicational quantifier ⇔∗ such that from formula ϕ ⇔∗ ∗ ψ logically ψ ϕ⇔∗ ψ follow both implications ϕ ⇒∗ ψ, ψ ⇒∗ ϕ, i.e. deduction rules ϕ⇔ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct. Such a quantifier ⇔∗ should be as less strict as possible to be near to ⇒∗ . Following two theorems show how to construct the logically nearest Σ-double implicational quantifier from a given implicational quantifier and vice versa.
Classes of Implicational and Equivalence Quantifiers
121
Theorem 7. Let ⇒∗ be an implicational quantifier and ⇔∗ be the quantifier constructed from ⇒∗ for all four-fold tables < a, b, c, d > by the formula ⇔∗ (a, b, c) = ⇒∗ (a, b + c). Then ⇔∗ is the Σ-double implicational quantifier which is the least strict from the class of all Σ-double implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property ∼ (a, b, c) ≤ min(⇒∗ (a, b), ⇒∗ (a, c)). Remark. Let us mention that this means the following: ∗ ψ ϕ⇔∗ ψ (1) deduction rules ϕ⇔ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct; (2) if ∼ is a Σ-double implicational quantifier such that deduction rules ϕ∼ψ ϕ∼ψ ϕ⇒∗ ψ , ψ⇒∗ ϕ are correct, ϕ∼ψ ϕ⇔∗ ψ
then ∼ is more strict than ⇔∗ , i.e. also
is correct.
Proof. Since ⇒∗ is an implicational quantifier, ⇔∗ is a Σ-double implicational quantifier; moreover, ⇔∗ (a, b, c) = ⇒∗ (a, b + c) ≤ min( ⇒∗ (a, b), ⇒∗ (a, c)) for all four-fold tables < a, b, c, d >. Let ∼ is a Σ-double implicational quantifier satisfying the property ∼ (a, x, y) ≤ min( ⇒∗ (a, x), ⇒∗ (a, y)) for all four-fold tables < a, x, y, d >. Then we obtain using Lemma 4 ∼ (a, b, c) = ∼ (a, b + c, 0) ≤ ⇒∗ (a, b + c) = ⇔∗ (a, b, c) for all four-fold tables < a, b, c, d >, which means that ∼ is more strict than ⇔∗ . a Example 11. (1) For the basic implication ⇒ (a, b) = a+b , the basic double a implication ⇔ (a, b, c) = a+b+c is the least strict Σ-double implicational quan∗
∗
ϕ⇔ ψ ϕ⇔ ψ tifier satisfying deduction rules ϕ⇒ , ψ⇒ ϕ . ψ (2) For the upper critical implication Pa (a+b)! i a+b−i , ⇒?p (a, b) = i=0 i!(a+b−i)! p (1 − p) the upper critical double implication Pa (a+b+c)! i a+b+c−i ⇔?p (a, b, c) = i=0 i!(a+b+c−i)! p (1 − p) is the least strict Σ-double implicational quantifier satisfying deduction rules ϕ⇔∗ ψ ϕ⇔∗ ψ , . ϕ⇒? ψ ψ⇒? ϕ p
p
Theorem 8. Let ⇔∗ be a Σ-double implicational quantifier and ⇒∗ be the quantifier constructed from ⇔∗ for all four-fold tables < a, b, c, d > by the formula ⇒∗ (a, b) = ⇔∗ (a, b, 0). Then ⇒∗ is the implicational quantifier which is the most strict from the class of all implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property min(∼ (a, b), ∼ (a, c)) ≥ ⇔∗ (a, b, c).
122
J. Iv´ anek
Remark. Let us mention that this means the following: ∗ ψ ϕ⇔∗ ψ (1) deduction rules ϕ⇔ ∗ ϕ⇒ ψ , ψ⇒∗ ϕ are correct; (2) if ∼ is an implicational quantifier such that deduction rules ϕ⇔∗ ψ ϕ⇔∗ ψ ϕ∼ψ , ψ∼ϕ are correct, then ∼ is less strict than ⇒∗ , i.e. also
ϕ⇒∗ ψ ϕ∼ψ
is correct.
Proof. Since ⇔∗ is a Σ-double implicational quantifier, ⇒∗ is an implicational quantifier; moreover, ⇔∗ (a, b, c) = ⇔∗ (a, b + c, 0) ≤ min( ⇔∗ (a, b, 0), ⇔∗ (a, c, 0)) = min( ⇒∗ (a, b), ⇒∗ (a, c)) for all four-fold tables < a, b, c, d >. Let ∼ is an implicational quantifier satisfying the property min(∼ (a, b), ∼ (a, c)) ≥ ⇔∗ (a, b, c) for all four-fold tables < a, b, c, d >. Then we obtain ∼ (a, b) ≥ ⇔∗ (a, b, 0) = ⇒∗ (a, b) for all four-fold tables < a, b, c, d >, which means that ∼ is less strict than ⇒∗ .
4
Correspondence between Classes of Σ-Equivalence Quantifiers and Σ-Double Implicational Ones
This section will be a clear analogy with the previous one: Let ⇔∗ be an Σ-double implicational quantifier. There is a natural task to construct some Σ-equivalence ≡∗ such that the formula ϕ ≡∗ ψ logically follows both from∗ the formula ϕ ⇔∗ ψ, and from the formula ¬ϕ ⇔∗ ¬ψ, i.e. deduction ϕ⇔ ψ ¬ϕ⇔∗ ¬ψ rules ϕ≡∗ ψ , ϕ≡∗ ψ are correct. Such a quantifier ≡∗ should be as strict as possible to be near to ⇔∗ . Following theorems show how to construct the logically nearest Σ-equivalence quantifier from a given Σ-double implicational quantifier and vice versa. The proofs of these theorems are similar to the proofs of Theorems 7,8, so we shall omit them for the lack of space. Theorem 9. Let ⇔∗ be a Σ-double implicational quantifier and ≡∗ be the quantifier constructed from ⇔∗ for all four-fold tables < a, b, c, d > by the formula ≡∗ (a, b, c, d) = ⇔∗ (a + d, b, c). Then ≡∗ is the Σ-equivalence which is the most strict from the class of all Σ-equivalences∼ satisfying for all four-fold tables < a, b, c, d > the property ∼ (a, b, c, d) ≥ max( ⇔∗ (a, b, c), ⇔∗ (d, b, c)). a Example 12. (1) For the basic double implication ⇔ (a, b, c) = a+b+c , the basic a+d equivalence ≡ (a, b, c, d) = a+b+c+d , is the most strict Σ-equivalence satisfying ¬ϕ⇔ ¬ψ ψ deduction rules ϕ⇔ ϕ≡∗ ψ , ϕ≡∗ ψ . (2) For the upper critical double implication Pa (a+b+c)! i a+b+c−i , ⇔?p (a, b, c) = i=0 i!(a+b+c−i)! p (1 − p)
Classes of Implicational and Equivalence Quantifiers
the upper critical equivalence Pa+d (a+b+c+d)! i a+b+c+d−i ≡?p (a, b, c, d) = i=0 i!(a+b+c+d−i)! p (1 − p) is the most strict Σ-equivalence satisfying deduction rules
123
ϕ⇔?p ψ ¬ϕ⇔?p ¬ψ ϕ≡∗ ψ , ϕ≡∗ ψ .
Theorem 10. Let ≡∗ be an Σ-equivalence quantifier and ⇔∗ be the quantifier constructed from ≡∗ for all four-fold tables < a, b, c, d > by the formula ⇔∗ (a, b, c) = ≡∗ (a, b, c, 0). Then ⇔∗ is the Σ-double implicational quantifier which is the least strict from the class of all Σ-double implicational quantifiers ∼ satisfying for all four-fold tables < a, b, c, d > the property max(∼ (a, b, c), ∼ (d, b, c)) ≤ ≡∗ (a, b, c, d).
5
Conclusions
The theorems proved in the paper show that quantifiers from classes I, ΣDI, ΣE compose logically affiliated triads ⇒∗ , ⇔∗ , ≡∗ , where ⇒∗ is implicational quantifier, ⇔∗ is Σ-double implicational quantifier, ≡∗ is Σ-equivalence. Examples of such triads included in this paper are: Example 13. Triad of basic quantifiers ⇒ , ⇔ , ≡ , where a a ⇒ (a, b) = a+b , ⇔ (a, b, c) = a+b+c , ≡ (a, b, c, d) =
a+d a+b+c+d .
Example 14. Triad of statistically motivated upper critical quantifiers ⇒?p , ⇔?p , ≡?p , where Pa (a+b)! i a+b−i ⇒?p (a, b) = , i=0 i!(a+b−i)! p (1 − p) P a (a+b+c)! ? i a+b+c−i ⇔p (a, b, c) = , i=0 i!(a+b+c−i)! p (1 − p) Pa+d (a+b+c+d)! i ? a+b+c+d−i . ≡p (a, b, c, d) = i=0 i!(a+b+c+d−i)! p (1 − p) Let us stress that to each given quantifier from classes I, ΣDI, ΣE, such triad can be constructed. This can naturally extend the metodological approach used to the particular quantifier’s definition for covering all three types of relations (implication, double implication, equivalence). We proved that following deduction rules are correct for the triads: ϕ ⇔∗ ψ ϕ ⇔∗ ψ ϕ ⇔∗ ψ ¬ϕ ⇔∗ ¬ψ , , , . ϕ ⇒∗ ψ ψ ⇒∗ ϕ ϕ ≡∗ ψ ϕ ≡∗ ψ These deduction rules can be used in knowledge discovery and data mining methods in various ways: (1) to organize efectively search for rules in databases (discovering some rules is a reason to skip over in search, because of some other rules simply follows from
124
J. Iv´ anek
discovered ones; nonvalidity of some rules means that some others are also non valid, ...); (2) to filter results of data mining procedure (results which follows from others are not so interesting for users); (3) to order rules according different (but affiliated) quantifications. In practice, some of the above described ideas were used in the systems Combinational Data Analysis, ESOD [8], Knowledge Explorer [3], and 4FTMiner [12]. This research has been supported by grant VS96008 of the Ministry of Education, Youth and Sports of the Czech Republic. The author is grateful to J.Rauch and R.Jirouˇsek for their valuable comments on the preliminary version of the paper.
References 1. Aggraval, R. et al.: Fast Discovery of Association Rules. In Fayyad, V.M. et al.: Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press 1996, p.307-328. 2. Batagelj, V., Bren, M.: Comparing Resemblance Measures. J. of Classification 12 (1995), p. 73-90. 3. Berka, P., Iv´ anek, J.: Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In Machine Learning. ECML-94 Catania (ed. Bergadano, Raedt). Springer 1994, p.339-342. 4. H´ ajek,P., Havr´ anek,T.: Mechanising Hypothesis Formation - Mathematical Foundations for a General Theory. Springer-Verlag, Berlin 1978, 396 p. 5. H´ ajek,P., Havr´ anek,T., Chytil M.: Metoda GUHA. Academia, Praha 1983, 314 p. (in Czech) 6. H´ ajek, P., Sochorov´ a, A., Zv´ arov´ a, J.: GUHA for personal computers. Computational Statistics & Data Analysis 19 (1995), p. 149 - 153 7. H´ ajek, P., Holeˇ na, M.: Formal Logics of Discovery and Hypothesis Formation by Machine. In Discovery Science (Arikawa,S. and Motoda,H., eds.), Springer-Verlag, Berlin 1998, p.291-302 8. Iv´ anek, J., Stejskal, B.: Automatic Acquisition of Knowledge Base from Data without Expert: ESOD (Expert System from Observational Data). In Proc. COMPSTAT’88 Copenhagen. Physica-Verlag, Heidelberg 1988, p.175-180. 9. Rauch,J.: GUHA as a Data Mining Tool. In: Practical Aspects of Knowledge Management. Schweizer Informatiker Gesellshaft Basel, 1996 10. Rauch, J.: Logical Calculi for Knowledge Discovery in Databases. In Principles of Data Mining and Knowledge Discovery, (Komorowski,J. and Zytkow,J., eds.), Springer-Verlag, Berlin 1997, p. 47-57. 11. Rauch,J.: Classes of Four-Fold Table Quantifiers. In Principles of Data Mining and Knowledge Discovery, (Quafafou,M. and Zytkow,J., eds.), Springer Verlag, Berlin 1998, p. 203-211. 12. Rauch,J.: 4FT-Miner - popis procedury. Technical Report LISp-98-09, Praha 1999. 13. Zembowicz,R. - Zytkow,J.: From Contingency Tables to Various Forms of Knowledge in Databases. In Fayyad, U.M. et al.: Advances in Knowledge Discovery and Data Mining. AAAI Press/ The MIT Press 1996, p. 329-349.
Querying Inductive Databases via Logic-Based User-De ned Aggregates Fosca Giannotti and Giuseppe Manco CNUCE - CNR Via S. Maria 36. 56125 Pisa - Italy
fF.Giannotti,
[email protected] Abstract. We show how a logic-based database language can support
the various steps of the KDD process by providing: a high degree of expressiveness, the ability to formalize the overall KDD process and the capability of separating the concerns between the speci cation level and the mapping to the underlying databases and datamining tools. We generalize the notion of Inductive Data Bases proposed in [4, 12] to the case of Deductive Databases. In our proposal, deductive databases resemble relational databases while user de ned aggregates provided by the deductive database language resemble the mining function and results. In the paper we concentrate on association rules and show how the mechanism of user de ned aggregates allows to specify the mining evaluation functions and the returned patterns.
1
Introduction
The rapid growth and spread of knowledge discovery techniques has highlighted the need to formalize the notion of knowledge discovery process. While it is clear which are the objectives of the various steps of the knowledge discovery process, little support is provided to reach such objectives, and to manage the overall process. The role of domain, or background, knowledge is relevant at each step of the KDD process: which attributes discriminate best, how can we characterize a correct/useful pro le, what are the interesting exception conditions, etc., are all examples of domain dependent notions. Notably, in the evaluation phase we need to associate with each inferred knowledge structure some quality function [HS94] that measures its information content. However, while it is possible to de ne quantitative measures for certainty (e.g., estimated prediction accuracy on new data) or utility (e.g., gain, speed-up, etc.), notions such as novelty and understandability are much more subjective to the task, and hence dicult to de ne. Here, in fact, the speci c measurements needed depend on a number of factors: the business opportunity, the sophistication of the organization, past history of measurements, and the availability of data. The position that we maintain in this paper is that a coherent formalism, capable of dealing uniformly with induced knowledge and background, or domain, •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 125−135, 1999. Springer−Verlag Berlin Heidelberg 1999
126
F. Giannotti and G. Manco
knowledge, would represent a breakthrough in the design and development of decision support systems, in several challenging application domains. Other proposal in the current literature have given experimental evidence that the knowledge discovery process can take great advantage of a powerful knowledge-representation and reasoning formalism [14, 11, 15, 5]. In this context, the notion of inductive database, proposed in [4, 12], is a rst attempt to formalize the notion of interactive mining process. An inductive database provides a uni ed and transparent view of both inferred (deductive) knowledge, and all the derived patterns, (the induced knowledge) over the data. The objective of this paper is to demonstrate how a logic-based database language, such as LDL++ [17], can support the various steps of the KDD process by providing: a high degree of expressiveness, the ability to formalize the overall KDD process and the capability of separating the concerns between the speci cation level and the mapping to the underlying databases and data mining tools. We generalize the notion of Inductive Databases proposed in [4, 12] to the case of Deductive Databases. In our proposal, deductive databases resemble relational databases while user de ned aggregates provided by LDL++ resemble the mining function and results. Such mechanism provides a exible way to customize, tune and reason on both the evaluation function and the extracted knowledge. In the paper we show how such a mechanism can be exploited in the task of association rules mining. The interested reader is referred to an extended version [7] of this paper, which covers the bayesian classi cation data mining task. 2
Logic Database Languages
Deductive databases are database management systems whose query languages and storage structures are designed around a logical model of data. The underlying technology is an extension to relational databases that increases the power of the query language. Among the other features, the rule-based extensions support the speci cation of queries using recursion and negation. We adopt the LDL++ deductive database system, which provides, in addition to the typical deductive features, a highly expressive query language with advanced mechanisms for non-deterministic, non-monotonic and temporal reasoning [9, 18]. In deductive databases, the extension of a relation is viewed as a set of facts, where each fact corresponds to a tuple. For example, let us consider the predicate assembly(Part Subpart) containing parts and their immediate subparts. The predicate partCost(BasicPart Supplier Cost) describes the basic parts, i.e., parts bought from external suppliers rather than assembled internally. Moreover, for each part the predicate describes the supplier, and for each supplier the price charged for it. Examples of facts are: ;
;
;
assembly(bike; frame): partCost(top tube; reed; 20): assembly(bike; wheel): partCost(fork; smith; 10): assembly(wheel; nipple):
Querying Inductive Databases via Logic−Based User−Defined Aggregates
127
Rules constitute the main construct of LDL++ programs. For instance, the rule multipleSupp(S) partCost(P1 S ) partCost(P2 S ) P1 6= P2 describes suppliers that sell more than one part. The rule corresponds to the SQL join query ;
;
;
;
;
;
:
SELECT P1.Supplier FROM partCost P1, partCost P2 WHERE P1.Supplier = P2.Supplier AND P1.BasicPart P2.BasicPart
In addition to the standard relational features, LDL++ provides recursion and negation. For example, the rule allSubparts(P S) assembly(P S) allSubparts(P S) allSubparts(P S1) assembly(S1 S) computes the transitive closure of the relation assembly. The following rule computes the least cost for each basic part by exploiting negation: cheapest(P C) partCost(P C) :cheaper(P C) cheaper(P C) partCost(P C1) C1 C ;
;
;
:
;
;
; ;
;
;
;
;
; ;
;
;
;
;
rules(L; R; S; C)
:
:
frequentPatterns(A; S); frequentPatterns(R; S1 ); subset(R; A); difference(A; R; L); C = S=S1:
( 1) r
Notice, however, that such an approach, though semantically clean, is very inecient, because of the large amount of computations needed at each step2 . In [10] we propose a technique which allows a compromise between loose and tight coupling, by adopting external specialized algorithms (and hence specialized data structures), but preserving the integration with the features of the language. In such proposal, inductive computations may be considered as aggregates, so that the proposed representation formalism is unaected. However, the inductive task is performed by an external ad-hoc computational engine. Such an approach has the main advantage of ensuring ad-hoc optimizations concerning the mining task transparently and independently from the deductive engine. In our case the patterns aggregate is implemented with some typical algorithm for the computation of the association rules. (e.g., Apriori algorithm [2]). The aggregation speci cation can hence be seen as a middleware between the core algorithm and the data set (de ned by the body of the rule) against which the algorithm is applied. The rest of the section shows some examples of complex queries whithin the resulting logic language. In the following we shall refer to the table with schema and contents exempli ed in 1. Example 4. \Find patterns with at least 3 occurrences from the daily transactions of each customer": frequentPatterns(patternsh(3 S)i) transSet(D C S) transSet(D C hIi) transaction(D C I P Q) By querying frequentPatterns(F S) we obtain, among the answers, the tuples (f g 3) and (f g 3). ut ;
;
;
;
;
;
;
;
;
:
:
;
pasta ;
pasta; wine ;
Again, in LDL++ the capability of de ning set-structures (and related operations) is guaranteed by the choice construct and by XY-strati cation. 2 Practically, the aggregate computation generates 2 I sets of items, where is the set of dierent items appearing in the tuples considered during the computation. Pruning of unfrequent subsets is made at the end of the computation of all subsets. Notice, however, that clever strategies can be de ned (e.g., computation of frequent maximal patterns [3]). 1
j j
I
132
F. Giannotti and G. Manco
transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(12-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(13-2-97, transaction(15-2-97, transaction(15-2-97,
cust1, beer, 10, 10). cust1, chips, 3, 20). cust1, wine, 20, 2). cust2, wine, 20, 2). cust2, beer, 10, 10). cust2, pasta, 2, 10). cust2, chips, 3, 20). cust2, jackets, 100, 1). cust2, col shirts, 30, 3). cust3, wine, 20, 1). cust3, beer, 10, 5). cust1, chips, 3, 20). cust1, beer,10,2). cust1,pasta,2,10). cust1,chips,3,10). Table 1.
transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(16-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97, transaction(18-2-97,
A sample transaction table.
cust1,jackets,120,1). cust2,wine,20,1). cust2,pasta,4,8). cust3, chips, 3, 20). cust3,col shirts,25,3). cust3,brown shirts,40,2). cust2,beer,8,12). cust2,beer,10,10). cust2,chips,3,20). cust2,chips,3,20). cust3,pasta,2,10). cust1,pasta,3,5). cust1,wine,25,1). cust1, chips, 3, 20). cust1, beer, 10, 10).
Example 5. \Find patterns with at least 3 occurrences from the transactions of each customers": frequentPatterns(patternsh(3; S)i) transSet(C; S): transSet(C; hIi) transaction(D; C; I; P; Q): Dierently from the previous example, where transactions were grouped by customer and by date, the previous rules group transactions by customer. We then compute the frequent patterns on the restructured transactions transSet(cust1 fbeer chips jackets pasta wineg) transSet(cust2 fbeer chips col shirts jackets pasta wineg) transSet(cust3 fbeer brown shirts chips col shirts pasta wineg) obtaining, e.g., the pattern (fbeer; chips; pasta; wineg; 3). t u Example 6. \Find association rules with a minimum support 3 from daily transactions of each customer". This can be formalized by rule (r1 ). Hence, by querying rules(L; R; S; C), we obtain the association rule (fpastag; fwineg; 3; 0:75). We can further postprocess the results of the aggregation query. For example, the query rules(fA; Bg; fbeerg; S; C) computes \two-to-one" those rules where the consequent is the beer item. An answer is (fchips; wineg; fbeerg; 3; 1). ut Example 7. The query \ nd patterns from daily transactions of high-spending customers (i.e., customers with at least 70 of total expense ad at most 3 items brought), such that each pattern has at least 3 occurrences" can be formalized as follows: frequentPatterns(patternsh(3; S)i) transSet(D; C; S; I; V); V > 70; I 3: transSet(D; C; hIi; counthIi; sumhVi) transaction(D; C; I; P; Q); V = P Q: ;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
;
The query frequentPatterns(F S) returns the patterns (beer 3), (chips 4) and (beer chips 3) that characterize the class of high-spending customers. ut ;
;
;
;
;
Querying Inductive Databases via Logic−Based User−Defined Aggregates
133
Example 8 ([10]). The query \ nd patterns from daily transactions of each customer, at each generalization level, such that each pattern has a given occurrency depending from the generalization level" is formalized as follows: itemsGeneralization(0; D; C; I; P; Q) transaction(D; C; I; P; Q): itemsGeneralization(I + 1; D; C; AI; P; Q) itemsGeneralization(I; D; C; S; P; Q); category(S; AI):
h i)
itemsGeneralization(I; D; C; S
h
i
itemsGeneralization(I; D; C; S; P; Q):
freqAtLevel(I; patterns (Supp; S) ) itemsGeneralization(I; D; C; S); suppAtLevel(I; S):
where the suppAtLevel predicate tunes the support threshold at a given item hierarchy. The query is the result of a tighter coupling of data preprocessing and result interpretation and postprocessing: we investigate the behaviour of rules over an item hierarchy. Suppose that the following tuples de ne a part-of hierarchy: category(beer; drinks) category(wine; drinks) category(pasta; food) category(chips; food) category(jackets; wear) category(col shirts; wear) category(brown shirts; wear)
Then, by querying freqAtLevel(I F S) we obtain, e.g., (0 (1 food 9), (1 drinks 7) and (1 drinks food 6). ;
;
;
;
;
;
;
;
;
beer; chips; wine; 3),
;
ut
Example 9. The query \ nd rules that are interestingly preserved by drillingdown an item hierarchy" is formalized as follows: rulesAtLevel(I; L; R; S; C) preservedRules(L; R; S; C)
freqAtLevel(I; A; S); freqAtLevel(I; R; S1); subset(R; A); difference(A; R; L); C = S=S1 :
rulesAtLevel(I + 1; L1 ; R1 ; S1 ; C1 ); rulesAtLevel(I; L; R; S; C); setPartOf(L; L1); setPartOf(R; R1); C > C1:
Preserved rules are de ned as those rules valid at any generalization level, such that their con dence is greater than their generalization3. ut
5 Final Remark We have shown that the mechanism of user-de ned aggregates is powerful enough to model the notion of inductive database, and to specify exible query answering capabilities. 3
The choice for such an interest measure is clearly arbitrary and subjective. Other signi cant interest measures can be speci ed (e.g., the interest measure de ned in [16]).
134
F. Giannotti and G. Manco
A major limitation in the proposal is eciency: it has been experimentally shown that specialized algorithms (on specialized data structures) have a better performance than database-oriented approaches (see, e.g., [1]). Hence, in order to improve performance considerably, a thorough modi cation of the underlying database abstract machine should be investigated. Notice in fact that, with respect to ad hoc algorithms, when the programs speci ed in the previous sections are executed on a Datalog++ abstract machine, the only available optimizations for such programs are the traditional deductive databases optimizations [8]. Such optimizations techniques, however, need to be further improved by adding ad-hoc optimizations. For the purpose of this paper, we have been assuming to accept a reasonable worsening in performance, by describing the aggregation formalism as a semantically clean representation formalism, and demanding the computational eort to external ad-hoc engines [10]. This, however, is only a partial solution to the problem, in that more re ned optimization techniques can be adopted. For example, in example 6, we can optimize the query by observing that directly computing rules with three items (even by counting the transactions with at least three items) is less expensive than computing the whole set of association rules, and then selecting those with three items. Some interesting steps in this direction have been made: e.g., [13] proposes an approach to the optimization of datalog aggregation-based queries, and in [13] a detailed discussion of the problem of the optimized computation of optimized computation of constrained association rules is made. However, the computational feasibility of the proposed approach to more general cases is an open problem.
References 1. R. Agrawal, S. Sarawagi, and S. Thomas. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In Procs. of ACM-SIGMOD'98, 1998. 2. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, 1994. 3. R. Bayardo. Eciently Mining Long Patterns from Databases. In Proc. ACM Conf. on Management of Data (Sigmod98), pages 85{93, 1998. 4. J-F. Boulicaut, M. Klemettinen, and H. Mannila. Querying Inductive Databases: A Case Study on the MINE RULE Operator. In Proc. 2nd European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD98), volume 1510 of Lecture Notes in Computer Science, pages 194{202, 1998. 5. U.M. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press/the MIT Press, 1996. 6. F. Giannotti, D. Pedreschi, and C. Zaniolo. Semantics and Expressive Power of Non Deterministic Constructs for Deductive Databases. To appear in Journal of Logic Programming. 7. F. Giannotti and G. Manco. Querying inductive databases via logic-based userde ned aggregates. Technical report, CNUCE-CNR, June 1999. Available at http://www-kdd.di.unipi.it. 8. F. Giannotti, G[iuseppe Manco, M. Nanni, and D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. Technical report, Department of Computer Science Univ. Pisa, September 1998. Submitted for publication.
Querying Inductive Databases via Logic−Based User−Defined Aggregates
135
9. F. Giannotti, G. Manco, M. Nanni, and D. Pedreschi. Query Answering in Nondeterministic, Nonmonotonic, Logic Databases. In Procs. of the Workshop on Flexible Query Answering, number 1395 in Lecture Notes in Arti cial Intelligence, march 1998. 10. F. Giannotti, G. Manco, M. Nanni, D. Pedreschi, and F. Turini. Integration of deduction and induction for mining supermarket sales data. In Proceedings of the International Conference on Practical Applications of Knowledge Discovery (PADD99), April 1999. 11. J. Han. Towards On-Line Analytical Mining in Large Databases. Sigmod Records, 27(1):97{107, 1998. 12. H. Mannila. Inductive databases and condensed representations for data mining. In International Logic Programming Symposium, pages 21{30, 1997. 13. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory Mining and Pruning Optimizations of Constrained Associations Rules. In Proc. ACM Conf. on Management of Data (Sigmod98), June 1998. 14. S. Ceri R. Meo, G. Psaila. A New SQL-Like Operator for Mining Association Rules. In Proceedings of The Conference on Very Large Databases, pages 122{133, 1996. 15. W. Shen, K. Ong, B. Mitbander, and C. Zaniolo. Metaqueries for Data Mining. In Advances in Knowledge Discovery and Data Mining, pages 375{398. AAAI Press/The MIT Press, 1996. 16. R. Srikant and R. Agrawal. Mining Generalized Association Rules. In Proc. of the 21th Int'l Conference on Very Large Databases, 1995. 17. C. Zaniolo, N. Arni, and K. Ong. Negation and Aggregates in Recursive Rules: The LDL++ Approach. In Proc. 3rd Int. Conf. on Deductive and Object-Oriented Databases (DOOD93), volume 760 of Lecture Notes in Computer Science, 1993. 18. C. Zaniolo and H. Wang. Logic-Based User-De ned Aggregates for the Next Generation of Database Systems. In The Logic Programming Paradigm: Current Trends and Future Directions. Springer Verlag, 1998.
Peculiarity Oriented Multi-database Mining Ning Zhong1 , Y.Y. Yao2 , and Setsuo Ohsuga3 1 3
Dept. of Computer Science and Sys. Eng., Yamaguchi University 2 Dept. of Computer Science, University of Regina Dept. of Information and Computer Science, Waseda University
Abstract.
The paper proposes a way of mining peculiarity rules from
multiply statistical and transaction databases. We introduce the peculiarity rules as a new type of association rules, which can be discovered
from a relatively small number of the peculiar data by searching the relevance among the peculiar data. We argue that the peculiarity rules represent a typically unexpected, interesting regularity hidden in statistical and transaction databases. We describe how to mine the peculiarity rules in the multi-database environment and how to use the RVER (Reverse Variant Entity-Relationship) model to represent the result of multi-database mining. Our approach is based on the database reverse engineering methodology and granular computing techniques. Keywords:
Multi-Database Mining, Peculiarity Oriented, Relevance,
Database Reverse Engineering, Granular Computing (GrC).
1 Introduction Recently, it has been recognized in the KDD (Knowledge Discovery and Data Mining) community that multi-database mining is an important research topic [3, 14, 19]. So far most of the KDD methods that have been developed are on the single universal relation level. Although theoretically, any multi-relational database can be transformed into a single universal relation, practically this can lead to many issues such as universal relations of unmanageable sizes, in ltration of uninteresting attributes, losing of useful relation names, unnecessary join operation, and inconveniences for distributed processing. In particular, some concepts, regularities, causal relationships, and rules cannot be discovered if we just search a single database since the knowledge hides in multiply databases basically. Multi-database mining involves many related topics including interestingness checking, relevance, database reverse engineering, granular computing, and distributed data mining. Liu et al. proposed an interesting method for relevance measure and an ecient implementation for identifying relevant databases as the rst step for multi-database mining [10]. Ribeiro et al. described a way for extending the INLEN system for multi-database mining by the incorporation of primary and foreign keys as well as the development and processing of knowledge segments [11]. Wrobel extended the concept of foreign keys into foreign links because multi-database mining is also interested in getting to non-key attributes •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 136−146, 1999. Springer−Verlag Berlin Heidelberg 1999
Peculiarity Oriented Multi−database Mining
137
[14]. Aronis et al. introduced a system called WoRLD that uses spreading activation to enable inductive learning from multiple tables in multiple databases spread across the network [3]. Database reverse engineering is a research topic that is closely related to multi-database mining. The objective of database reverse engineering is to obtain the domain semantics of legacy databases in order to provide meaning of their executable schemas' structure [6]. Although database reverse engineering has been investigated recently, it was not researched in the context of multi-database mining. In this paper we take a uni ed view of multi-database mining and database reverse engineering. We use the RVER (Reverse Variant Entity-Relationship) model to represent the result of multi-database mining. The RVER model can be regarded as a variant of semantic networks that are a kind of well-known method for knowledge representation. From this point of view, multi-database mining can be regarded as a kind of database reverse engineering. A challenge in multi-database mining is semantic heterogeneity among multiple databases since no explicit foreign key relationships exist among them usually. Hence, the key issue is how to nd/create the relevance among different databases. In our methodology, we use granular computing techniques based on semantics, approximation, and abstraction [7, 18]. Granular computing techniques provide a useful tool to nd/create the relevance among dierent databases by changing information granularity. In this paper, we propose a way of mining peculiarity rules from multiply statistical and transaction databases, which is based on the database reverse engineering methodology and granular computing techniques.
2 Peculiarity Rules and Peculiar Data In this section, we rst de ne peculiarity rules as a new type of association rules and then describe a way of nding peculiarity rules. 2.1
Association Rules vs. Peculiarity Rules
Association rules are an important class of regularity hidden in transaction databases [1, 2]. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y . So far, two categories of the association rules, the general rule and the exception rule, have been investigated [13]. A general rule is a description of a regularity for numerous objects and represents the well-known fact with common sense, while an exception rule is for a relatively small number of objects and represents exceptions to the well-known fact. Usually, the exception rule should be associated with a general rule as a set of rule pairs. For example, the rule \using a seat belt is risky for a child" which represents exceptions to the general rule with common sense \using a seat belt is safe". The peculiarity rules introduced in this paper can be regarded as a new type of association rules for a dierent purpose. A peculiarity rule is discovered from
138
N. Zhong, Y.Y. Yao, and S. Ohsuga
the peculiar data by searching the relevance among the peculiar data. Roughly speaking, a data is peculiar if it represents a peculiar case described by a relatively small number of objects and is very dierent from other objects in a data set. Although it looks like the exception rule from the viewpoint of describing a relatively small number of objects, the peculiarity rule represents the well-known fact with common sense, which is a feature of the general rule. We argue that the peculiarity rules are a typical regularity hidden in statistical and transaction databases. Sometimes, the general rules that represent the well-known fact with common sense cannot be found from numerous statistical or transaction data, or although they can be found, the rules may be uninteresting ones to the user since data are rarely specially collected/stored in a database for the purpose of mining knowledge in most organizations. Hence, the evaluation of interestingness (including surprisingness, unexpectedness, peculiarity, usefulness, novelty) should be done before and/or after knowledge discovery [5, 9, 12]. In particular, unexpected (common sense) relationships/rules may be hidden a relatively small number of data. Thus, we may focus some interesting data (the peculiar data), and then we nd more novel and interesting rules (peculiarity rules) from the data. For example, the following rules are the peculiarity ones that can be discovered from a relation called Japan-Geography (see Table 1) in a Japan-Survey database: rule1 rule2
! !
: ArableLand(large) & Forest(large) : ArableLand(small) & Forest(small)
Table 1. Region
Hokkaido
Area
(
)
P opulationDensity low :
(
)
P opulationDensity high :
Japan-Geography
Population PopulationDensity PeasantFamilyN ArableLand Forest . . .
82410.58
5656
67.8
93
1209
Aomori
9605.45
1506
156.8
87
169
623
...
...
...
...
...
...
...
...
...
Tiba
5155.64
5673
1100.3
116
148
168
...
2183.42
11610
5317.2
21
12
80
...
...
...
...
...
...
...
...
1886.49
8549
4531.6
39
18
59
...
...
...
...
...
...
...
...
Tokyo Osaka ... ...
5355 . . .
In order to discover the rules, we rst need to search the peculiar data in the relation Japanese-Geography. From Table 1, we can see that the values of the attributes ArableLand and Forest for Hokkaido (i.e. 1209 Kha and 5355 Kha) and for Tokyo and Osaka (i.e. 12 Kha, 18 Kha, and 80 Kha, 59 Kha) are very dierent from other values in the attributes. Hence, the values are regarded as the peculiar data. Furthermore, rule and rule are generated by searching the relevance among the peculiar data. Note that we use the qualitative representation for 1
2
Peculiarity Oriented Multi−database Mining
139
the quantitative values in the above rules. The transformation of quantitative to qualitative values can be done by using the following background knowledge on information granularity: Basic granules: bg1 = bg4 =
f f
g g
high, low ; bg2 = far, close ; bg5 =
Speci c granules:
f f
g f g . . . . . ..
large, smal l ; bg3 =
biggest-cities =
f
kansei-area =
Osaka, Kyoto, Nara, ... ;
g
Tokyo, Osaka ; kanto-area =
f
g
many, little ;
long, short ;
g
f
. . . . . ..
g
Tokyo, Tiba, Saitama, ... ;
That is, ArableLand = 1209, Forest = 5355 and PopulationDensity = 67.8 for Hokkaido are replaced by the granules, \large" and \low", respectively. Furthermore, Tokyo and Osaka are regarded as a neighborhood (i.e. the biggest cities in Japan). Hence, rule2 is generated by using the peculiar data for both Tokyo and Osaka as well as their granules (i.e. \small" for ArableLand and Forest, and \high" for PopulationDensity). 2.2
Finding the Peculiar Data
There are many ways of nding the peculiar data. In this section, we describe an attribute-oriented method. Let X = fx1 ; x2 ; . . . ; xn g be a data set related to an attribute in a relation, and n is the number of dierent values in an attribute. The peculiarity of xi can be evaluated by the Peculiarity Factor, PF (xi ); ( )=
P F xi
Xq n
(
)
N xi ; xj :
(1)
j =1
It evaluates whether xi occurs relatively small number and is very dierent from other data xj by calculating the sum of the square root of the conceptual distance between xi and xj . The reason why the square root is used in Eq. (1) is that we prefer to evaluate more near distances for relatively large number of data so that the peculiar data can be found from relatively small number of data. Major merits of the method are { {
It can handle both the continuous and symbolic attributes based on a uni ed semantic interpretation; Background knowledge represented by binary neighborhoods can be used to evaluate the peculiarity if such background knowledge is provided by a user.
If X is a data set of a continuous attribute and no background knowledge is available, in Eq. (1), ( ) = jxi 0 xj j: (2) Table 2 shows an example for the calculation. On the other hand, if X is a data set of a symbolic attribute and/or the background knowledge for representing the N xi ; xj
140
N. Zhong, Y.Y. Yao, and S. Ohsuga
conceptual distances between xi and xj is provided by a user, the peculiarity factor is calculated by the conceptual distances, N (xi ; xj ): Table 3 shows an example in which the binary neighborhoods shown in Table 4 are used as the background knowledge for representing the conceptual distances of dierent type of restaurants [7, 15]. However, all the conceptual distances are 1, as default, if background knowledge is not available. Table 2.
An example of the peculiarity factor for a continue attribute Region ArableLand 1209
Hokkaido
Table 3.
PF
)
134.1
Tokyo
12
Osaka
18
Yamaguchi
162
60.5
Okinawa
147
59.4
An example of the peculiarity
factor for a symbolic attribute
Restaurant
Type
PF
Wendy
American
2.2
Le Chef
French
Great Wall
Chinese
Kiku
Japanese
1.6
South Sea
Chinese
1.6
)
=
2.6 1.6
=
60.9 60.3
Table 4.
The binary neighborhoods for a symbolic attribute
Type
Type
N
Chinese
Japanese
1
Chinese
American 3
Chinese
French
4
American
French
2
American Japanese French
Japanese
3 3
After the evaluation for the peculiarity, the peculiar data are elicited by using a threshold value,
threshold = mean of P F (xi ) + 2 variance of P F (xi) (3) where can be speci ed by a user. That is, if P F (xi ) is over the threshold value, xi is a peculiar data. Based on the preparation stated above, the process of nding the peculiar data can be outlined as follows:
Calculate the peculiarity factor PF (xi) in Eq. (1) for all values in a data set (i.e. an attribute). Step 2. Calculate the threshold value in Eq. (3) based on the peculiarity factor obtained in Step 1: Step 3. Select the data that is over the threshold value as the peculiar data. Step 4. If current peculiarity level is enough, then goto Step 6: Step 5. Remove the peculiar data from the data set and thus, we get a new data set. Then go back to Step 1: Step 6. Change the granularity of the peculiar data by using background knowledge on information granularity if the background knowledge is available. Step 1.
Peculiarity Oriented Multi−database Mining
141
Furthermore, the process can be done in a parallel-distributed mode for multiple attributes, relations and databases since this is an attribute-oriented nding method.
2.3 Relevance among the Peculiar Data A peculiarity rule is discovered from the peculiar data by searching the relevance among the peculiar data. Let X (x) and Y (y) be the peculiar data found in two attributes X and Y respectively. We deal with the following two cases:
{ If the X (x) and Y (y ) are found in a relation, the relevance between X (x) and Y (y) is evaluated in the following equation:
= P1 (X (x)jY (y))P2 (Y (y)jX (x)): (4) That is, the larger the product of the probabilities of P1 and P2 ; the stronger the relevance between X (x) and Y (y ). { If the X (x) and Y (y) are found in two dierent relations, we need to use a value (or its granule) in a key (or foreign key/link) as the relevance factor, K (k ), to nd the relevance between X (x) and Y (y ). Thus, the relevance between X (x) and Y (y) is evaluated in the following equation: R1
R2
= P1 (K (k )jX (x))P2 (K (k )jY (y)):
(5)
Furthermore, Eq. (4) and Eq. (5) are suitable for handling more than two peculiar data found in more than two attributes if X (x) (or Y (y )) is a granule of the peculiar data. 3
Mining Peculiarity Rules in Multi-Database
Building on the preparatory in Section 2, this section describes a methodology of mining peculiarity rules in multi-database.
3.1 Multi-Database Mining in Dierent Levels Generally speaking, the task of multi-database mining can be divided into two levels: 1. Mining from multiple relations in a database. 2. Mining from multiple databases. First, we need to extend the concept of foreign keys into foreign links because we are also interested in getting to non-key attributes for data mining from multiple relations in a database. A major work is to nd the peculiar data in multiple relations for a given discovery task while foreign link relationships exist. In other words, our task is to select n relations, which contain the peculiar data, among m relations (m n) with foreign links.
142
N. Zhong, Y.Y. Yao, and S. Ohsuga
We again use the Japan-Survey database as an example. There are many relations (tables) in this database such as Japan-Geography, Economy, AlcoholicSales, Crops, Livestock-Poultry, Forestry, Industry, and so on. Table 5 and Table 6 show two of them as examples (Table 1 is another one (Japan-Geography)). The method for selecting n relations among m relations can be brie y described as follows: Table 5. Economy
Region
Hokkaido
PrimaryInd SecondaryInd TertiaryInd . . .
9057
34697
96853
...
Aomori
2597
6693
22722
...
...
...
...
...
...
Tiba
3389
44257
76277
187481
484294
...
839 ...
...
...
397
99482
209492
...
...
...
Tokyo Osaka ... ...
... ... ... ...
Table 6. Alcoholic-Sales
Region
Sake
Hokkaido
42560
Aomori
18527
60425
...
...
...
...
...
Tiba
47753
Tokyo Osaka ... ...
Beer
...
257125 . . .
205168 . . .
150767 838581 ...
...
...
...
100080 577790
... ... ... ...
Focus on a relation as the main table and nd the peculiar data from this table. Then elicit the peculiarity rules from the peculiar data by using the methods stated in Section 2.2 and 2.3. For example, if we select the relation called Japan-Geography shown in Table 1 as the main table, rule1 and rule2 stated in Section 2.1 are a result for the step. Step 2. Find the value(s) of the focused key corresponding to the mined peculiarity rule in Step 1 and change its granularity of the value(s) of the focused key if the background knowledge on information granularity is available. For example, \Tokyo" and \Osaka" that are the values of the key attribute region can be changed into a granule, \biggest cities". Step 3. Find the peculiar data in the other relations (or databases) corresponding to the value (or its granule) of the focused key. Step 4. Select n relations that contain the peculiar data, among m relations (m n). In other words, we just select the relations that contain the peculiar data that are relevant to the peculiarity rules mined from the main table. Step 1.
Peculiarity Oriented Multi−database Mining
143
Here we need to nd the related relations by using foreign keys (or foreign links). For example, since the (foreign) key attribute is Region for the relations in the Japan-Survey database, and the value in the key, Region = Hokkaido, which is related to the mined rule1 ; we search the peculiar data in other relations that are relevant to the mined rule1 by using Region = Hokkaido as a relevance factor. The basic method for searching the peculiar data is similar to the one stated in Section 2.2. However, we just check the peculiarity of the data that are relevant to the value (or its granule) of the focused key in the relations. Furthermore, selecting n relations among m relations can be done in a parallel-distributed cooperative mode. Let \j" denote a relevance among the peculiar data (but not a rule currently, and can be used to induce rules as to be stated in Section 3.2). Thus, we can see that the peculiar data are found in the relations, Crops, Livestock-Poultry, Forestry, Economy, corresponding to the value of the focused key, Region = Hokkaido: In the relation, Crops,
Region(Hokkaido) j (WheatOutput(high) & RiceOutput(high)). In the relation, Livestock-Poultry, Region(Hokkaido) j (MilchCow(many) & MeatBull(many) & MilkOutput(many) & Horse(many)). In the relation, Forestry, Region(Hokkaido) j (TotalOutput(high) & SourceOutput(high)). In the relation, Economy, Region(Hokkaido) j PrimaryIndustry(high).
Hence the relations, Crops, Livestock-Poultry, Forestry, Economy are selected. On the other hand, the peculiar data are also found in the relations, AlcoholicSales and Economy, corresponding to the value of the focused key, Region = biggest-cities: In the relation, Alcoholic-Sales, Region(biggest-cities) j (Sake-sales(high) & RiceOutput(high)). In the relation, Economy, Region(biggest-cities) j TertiaryIndustry(high). Furthermore, the methodology stated above can be extended for mining from multiple databases. For example, if we found that the turnover was a marked drop in some day from a supermarket transaction database, maybe we cannot understand why. However, if we search a weather database, we can nd that there was a violent typhoon this day in which the turnover of the supermarket was a marked drop. Hence, we can discover the reason why the turnover was a marked drop. A challenge in multi-database mining is semantic heterogeneity among multiple databases since no explicit foreign key relationships exist among them usually. Hence, the key issue is how to nd/create the relevance among dierent databases. In our methodology, we use granular computing techniques based on semantics, approximation, and abstraction for solving the issue [7, 18].
144
3.2
N. Zhong, Y.Y. Yao, and S. Ohsuga
Representation and Re-learning
We use the RVER (Reverse Variant Entity-Relationship) model to represent the peculiar data and the conceptual relationships among the peculiar data discovered from multiply relations (databases). Figure 1 shows the general framework of the RVER model. The RVER model can be regarded as a variant of semantic networks that are a kind of well-known method for knowledge representation. From this point of view, multi-database mining can be regarded as a kind of database reverse engineering. Figure 2 shows a result mined from the JapanSurvey database; Figure 3 shows the result mined from two databases on the supermarkets at Yamaguchi prefecture and the weather of Japan. The point of which the RVER model is different from an ordinary ER model is that we just represent the attributes that are relevant to the peculiar data and the related peculiar data (or their granules) in the RVER model. Thus, the RVER model provides all interesting information that is relevant to some focusing (e.g. Region = Hokkaido and Region = biggest-cities in the Japan-Geography database) for learning the advanced rules among multiple relations (databases). Re-learning means learning the advanced rules (e.g., if-then rules and firstorder rules) from the RVER model. For example, the following rules can be learned from the RVER models shown in Figure 2 and Figure 3: rule3 : ArableLand(1arge) & Forest(1arge) -+ PrimaryIndustry(high). r ul e4 : Weat h e r(t yph oo n) + Turno v e r(v e ry-low) .
A peculiarity rule
the focused key value
Fig. 1. The RVER model
Peculiarity Oriented Multi−database Mining
145
ationDensity(low)c eLand(hrge)& Forest(larg
Fig. 2. The RVER model related to Region = Hokkaido
Fig. 3. The RVER model mined from two databases
4
Conclusion
We presented a way of mining peculiarity rules from multiply statistical and transaction databases. The peculiarity rules are defined as a new type of association rules. We described a variant of E R model and semantic networks as a way t o represent peculiar data and their relationship among multiple relations (databases). We can change the granularity of the peculiar data dynamically in the discovery process. Some of databases such as Japan-survey, web-log, weather, supermarket have been tested or have been testing for our approach.
146
N. Zhong, Y.Y. Yao, and S. Ohsuga
Since this project is very new, we just nished the rst step. Our future work includes developing a systematic method to mine the rules from multiply databases where there are no explicitly foreign key (link) relationships, and to induce the advanced rules from the RVER models discovered from multiple databases.
References 1. Agrawal R. et al. \Database Mining: A Performance Perspective", IEEE Trans. Knowl. Data
Eng., 5(6) (1993) 914-925. 2. Agrawal R. et al. \Fast Discovery of Association Rules", Advances in Knowledge Discovery and Data Mining, AAAI Press (1996) 307-328. 3. Aronis, J.M. et al \The WoRLD; Knowledge Discovery from Multiple Distributed Databases", Proc. 10th International Florida AI Research Symposium (FLAIRS-97) (1997) 337-341. 4. Fayyad, U.M., Piatetsky-Shapiro, G et al (eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press (1996). 5. Freitas, A.A. \On Objective Measures of Rule Surprisingness" J. Zytkow and M. Quafafou (eds.) Principles of Data Mining and Knowledge Discovery. Lecture Notes AI 1510, Springer-Verlag (1998) 1-9. 6. Chiang, Roger H.L. et al (eds.) \A Framework for the Design and Evaluation of Reverse Engineering Methods for Relational Databases", Data & Knowledge Engineering, Vol.21 (1997) 57-77. 7. Lin, T.Y. \Granular Computing on Binary Relations 1: Data Mining and Neighborhood Systems ", L. Polkowski and A. Skowron (eds.) Rough Sets in Knowledge Discovery 1, In Studies in Fuzziness and Soft Computing series, Vol. 18, Physica-Verlag (1998) 107-121. 8. Lin, T.Y., Zhong, N., Dong, J., and Ohsuga, S. \Frameworks for Mining Binary Relations in Data", L. Polkowski and A. Skowron (eds.) Rough Sets and Current Trends in Computing, LNAI 1424, Springer-Verlag (1998) 387-393. 9. Liu, B., Hsu W., and Chen, S. \Using General Impressions to Analyze Discovered Classi cation Rules", Proc. Third International Conference on Knowledge Discovery and Data Mining (KDD-97), AAAI Press (1997) 31-36. 10. Liu, H., Lu H., and Yao, J. \Identifying Relevant Databases for Multidatabase Mining", X. Wu et al. (eds.) Research and Development in Knowledge Discovery and Data Mining, Lecture Notes in AI 1394, Springer-Verlag (1998) 210-221. 11. Ribeiro, J.S., Kaufman, K.A., and Kerschberg, L. \Knowledge Discovery from Multiple Databases", Proc First Inter. Conf. on Knowledge Discovery and Data Mining (KDD-95), AAAI Press (1995) 240-245. 12. Silberschatz, A. and Tuzhilin, A. \What Makes Patterns Interesting in Knowledge Discovery Systems", IEEE Trans. Knowl. Data Eng., 8(6) (1996) 970-974. 13. Suzuki E.. \Autonomous Discovery of Reliable Exception Rules", Proc Third Inter. Conf. on Knowledge Discovery and Data Mining (KDD-97), AAAI Press (1997) 259-262. 14. Wrobel, S. \An Algorithm for Multi-relational Discovery of Subgroups", J. Komorowski and J. Zytkow (eds.) Principles of Data Mining and Knowledge Discovery. LNAI 1263, Springer-Verlag (1997) 367-375. 15. Yao, Y.Y. \Granular Computing using Neighborhood Systems", Roy, R., Furuhashi, T., and Chawdhry, P.K. (eds.) Advances in Soft Computing: Engineering Design and Manufacturing, Springer-Verlag (1999) 539-553. 16. Yao, Y.Y. and Zhong, N. \An Analysis of Quantitative Measures Associated with Rules", Zhong, N. and Zhou, L. (eds.) Methodologies for Knowledge Discovery and Data Mining, LNAI 1574, Springer-Verlag (1999) 479-488. 17. Yao, Y.Y. and Zhong, N. \Potential Applications of Granular Computing in Knowledge Discovery and Data Mining", Proc. The 5th.International Conference on Information Systems Analysis and Synthesis (IASA'99), edited in the invited session on Intelligent Data Mining and Knowledge Discovery (1999) (in press). 18. Zadeh, L. A. \Toward a Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic", Fuzzy Sets and Systems, Elsevier Science Publishers, 90 (1997) 111-127. 19. Zhong N. and Yamashita S. \A Way of Multi-Database Mining", Proc. the IASTED International Conference on Arti cial Intelligence and Soft Computing (ASC'98), IASTED/ACTA Press (1998) 384-387.
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach Shusaku Tsumoto Department of Medicine Informatics, Shimane Medical University, School of Medicine, 89-1 Enya-cho Izumo City, Shimane 693-8501 Japan E-mail:
[email protected] Abstract. Since early 1980’s, due to the rapid growth of hospital information systems (HIS), electronic patient records are stored as huge databases at many hospitals. One of the most important problems is that the rules induced from each hospital may be different from those induced from other hospitals, which are very difficult even for medical experts to interpret. In this paper, we introduce rough set based analysis in order to solve this problem. Rough set based analysis interprets the conflicts between rules from the viewpoint of supporting sets, which are closely related with dempster-shafer theory(evidence theory) and outputs interpretation of rules with evidential degree. The proposed method was evaluated on two medical databases, the experimental results of which show that several interesting relations between rules, including interpretation on difference and the solution of conflicts between induced rules, are discovered.
1
Introduction
Since early 1980’s, due to the rapid growth of hospital information systems (HIS), electronic patient records are stored as huge databases at many hospitals. One of the most important problems is that the rules induced from each hospital may be different from those induced from other hospitals, which are very difficult even for medical experts to interpret. In this paper, we introduce rough set based analysis in order to solve this problem. Rough set based analysis interprets the conflicts between rules from the viewpoint of supporting sets, which are closely related with dempster-shafer theory(evidence theory) and outputs interpretation of rules with evidential degree. The proposed method was evaluated on two medical databases, the experimental results of which show that several interesting relations between rules, including interpretation on difference and the solution of conflicts between induced rules, are discovered. The paper is organized as follows: Section 2 will make a brief description about distributed data analysis. Section 3 and 4 discusses the definition of rules and rough set model of distributed data analysis. Section 5 gives experimental results. Section 6 discusses the problems of our work and related work, and finally, Section 7 concludes our paper. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 147–155, 1999. c Springer-Verlag Berlin Heidelberg 1999
148
2
S. Tsumoto
Distributed Data Analysis
In distributed rule induction, the following three cases should be considered. (1) One database induces rules, whose attribute-value pairs do not appear in other database(independent type). (2) Rules induced from one database are overlapped with rules induced from other databases(boundary type). (3) Rules induced from one database are described by the subset of attribute-value pairs, which are used in rules induced from other databases(subcategory type). In the first case, it would be very difficult to interpret all the results because each database do no share the regularities with other databases. In the second case, shared information will be much more important than other information. In the third case, subset information will be important. It is notable that this classification on distributed data analysis can be applied to discussion on collaboration between domain experts and rule discovery methods: Empirical studies on medical data mining[2,11] show that medical experts try to interpret unexpected patterns with their domain knowledge, which can be viewed as hypothesis generation. In [2], gender is an attribute unexpected by experts, which led to a new hypothesis that body size will be closely related with complications of angiography. In [11,12], gender and age are unexpected attributes, which triggered reexamination of datasets and generated a hypothesis that immunological factors will be closely related with meningitis. These actions will be summerized into the following three patterns: 1. If induced patterns are completely equivalent to domain knowledge, then the patterns are commonsense. 2. If induced patterns partially overlap with domain knowledge, then the patterns may include unexpected or interesting subpatterns. 3. If induced patterns are completely different from domain knowledge, then the patterns are difficult to interpret. Then, the next step will be validation of a generated hypothesis: a dataset will be collected under the hypothesis in a prospective way. After the data collection, statistical analysis will be applied to detect the significance of this hypothesis. If the hypothesis is confirmed with statistical significance, these results will be reported. Thus, such a kind of interaction between human experts and rule discovery methods can be viewed as distributed data analysis.
3 3.1
Probabilistic Rules Accuracy and Coverage
In the subsequent sections, we adopt the following notations, which is introduced in [8]. Let U denote a nonempty, finite set called the universe and A denote a nonempty, finite set of attributes, i.e., a : U → Va for a ∈ A, where Va is called the domain of a, respectively.Then, a decision table is defined as an information
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
149
system, A = (U, A ∪ {d}). The atomic formulas over B ⊆ A ∪ {d} and V are expressions of the form [a = v], called descriptors over B, where a ∈ B and v ∈ Va . The set F (B, V ) of formulas over B is the least set containing all atomic formulas over B and closed with respect to disjunction, conjunction and negation. For each f ∈ F (B, V ), fA denote the meaning of f in A, i.e., the set of all objects in U with property f , defined inductively as follows. 1. If f is of the form [a = v] then, fA = {s ∈ U |a(s) = v} 2. (f ∧ g)A = fA ∩ gA ; (f ∨ g)A = fA ∨ gA ; (¬f )A = U − fa By the use of this framework, classification accuracy and coverage, or true positive rate is defined as follows. Definition 1. Let R and D denote a formula in F (B, V ) and a set of objects which belong to a decision d. Classification accuracy and coverage(true positive rate) for R → d is defined as: αR (D) =
|RA ∩ D| |RA ∩ D| (= P (D|R)), and κR (D) = (= P (R|D)), |RA | |D|
where |A| denotes the cardinality of a set A, αR (D) denotes a classification accuracy of R as to classification of D, and κR (D) denotes a coverage, or a true positive rate of R to D, respectively. It is notable that these two measures are equal to conditional probabilities: accuracy is a probability of D under the condition of R, coverage is one of R under the condition of D. 3.2
Definition of Rules
By the use of accuracy and coverage, a probabilistic rule is defined as: α,κ
R →d
s.t.
R = ∧j ∨k [aj = vk ], αR (D) ≥ δα , κR (D) ≥ δκ .
This rule is a kind of probabilistic proposition with two statistical measures, which is an extension of Ziarko’s variable precision model(VPRS) [14].1
4
Rough Set Model of Distributed Data Analysis
4.1
Definition of Characterization Set
In order to model these three reasoning types, a statistical measure, coverage κR (D) plays an important role in modeling, which is a conditional probability of a condition (R) under the decision D (P (R|D)). 1
This probabilistic rule is also a kind of Rough Modus Ponens[6].
150
S. Tsumoto
Let us define a characterization set of D, denoted by L(D) as a set, each element of which is an elementary attribute-value pair R with coverage being larger than a given threshold, δκ . That is, Lδκ (D) = {[ai = vj ]|κ[ai =vj ] (D) > δκ }. Then, according to the descriptions in section 2, three types of differences will be defined as below: 1. Independent type: Lδκ (Di ) ∩ Lδκ (Dj ) = φ, 2. Boundary type: Lδκ (Di ) ∩ Lδκ (Dj ) 6= φ, and 3. Subcatgory type: Lδκ (Di ) ⊆ Lδκ (Dj ), where i and j denotes a table i and j. All three definitions correspond to the negative region, boundary region, and positive region[4], respectively, if a set of the whole elementary attribute-value pairs will be taken as the universe of discourse. Thus, here we can apply the technique which is similar to inductiong of decision rules from the partition of equivalence relations. In the cases of boundary and subcategory type, the lower and upper limits of characterization are defined as: Lδk appa (D) = ∩i Lδk appa (Di ) Lδk appa (D) = ∪i Lδk appa (Di ) Concerning independent type, the lower limit is empty: Lκ (D) = and only the upper limit of characterization is defined. The lower limit of characterization is a set whose elements are included in all the databases, which can be viewed as information shared by all the datasets. The upper limit of characterization is a set whose elements are included in at least one database, which can be viwed as possible information shared by datasets. It is notable that the size of those limits is dependent on the choice of the threshold δκ . 4.2
Characterization as Exclusive Rules
Characteristics of characterization set depends on the value of δκ . If the threshold is set to 1.0, then a characterization set is equivalent to a set of attributes in exclusive rules[9]. That is, the meaning of each attribute-value pair in L1.0 (D) covers all the examples of D. Thus, in other words, some examples which do not satisfy any pairs in L1.0 (D) will not belong to a class D. Construction of rules based on L1.0 are discussed in Subsection 4.4, which can also be found in [10,12]. The differences between these two papers are the following: in the former paper, independent type and subcategory type for L1.0 are focused on to represent diagnostic rules and applied to discovery of decision rules in medical databases. On the other hand, in the latter paper, a boundary type for L1.0 is focused on and applied to discovery of plausible rules.
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
4.3
151
Rough Inclusion
Concerning the boundary type, it is important to consider the similarities between classes. In order to measure the similarity between classes with respect to characterization, we introduce a rough inclusion measure µ, which is defined as follows: T |S T | . µ(S, T ) = |S| It is notable that if S ⊆ T , then µ(S, T ) = 1.0, which shows that this relation extends subset and superset relations. This measure is introduced by Polkowski and Skowron in their study on rough mereology[7], which focuses on set-inclusion to characterize a hierarchical structure based on a relation between a subset and superset. Thus, application of rough inclusion to capturing the relations between classes is equivalent to constructing rough hierarchical structure between classes, which is also closely related with information granulation proposed by Zadeh[13].
5
An Algorithm for Analysis
An algorithms for searching for the lower and upper limit of characterization and induction of rules based on these limits are given in Fig. 1 and Fig. 2. Since subcategory type and independent type can be viewed as special types of boundary type with respect to rough inclusion, rule induction algorithms for subcategory type and independent type are given if the thresholds for µ are set up to 1.0 and 0.0, respectively. Rule discovery(Fig 1.) consists of the following three procedures. First, the characterization of each given class is extracted from each database and the lower and upper limit of characterization is calculated. Second, from these limits, rule induction method(Fig.2) will be applied. Finally, all the characteriztion are classified into several groups with respect to rough inclusion and the degree of similarity will be output.
6 6.1
Experimental Results Applied Datasets
For experimental evaluation, a new system, called PRIMEROSE-REX5 (Probabilistic Rule Induction Method for Rules of Expert System ver 5.0), is developed with the algorithms discussed above. PRIMEROSE-REX5 was applied to the following three medical domains, whose information is shown in Table 1.
152
S. Tsumoto
procedure Rule Discovery (T otal P rocess); var i : integer; M, L, R : List; LD : List; /* A list of all databases */ begin Calculate αR (Di ) and κR (Di ) for each elementary relation R and each class Di ; (i: A dataset i) Make a list L(Di ) = {R|κR (D) ≥ δκ }) for Di ; Calculate L(D) = ∩L(Di ) and overlineL(D) = ∪L(Dj ). Apply Rule Induction methods for L(D) and L(D). while (LD 6= φ) do begin i := f irst(LD ); M := LD − {i}; while (M 6= φ) do begin j := f irst(M ); if (µ(L(Dj ), L(Di )) ≥ δµ ) then L2 (D) := L2 (D) + {(i, j, δµ )}; M := M − Dj ; end Store L2 (D) as a similarity of dataset with respect to δµ LD := LD − i; end end {Rule Discovery }; Fig. 1. An Algorithm for Rule Discovery Table 1. Databases Domain Tables Samples Classes Attributes Headache 10 52119 45 147 CVD 4 7620 22 285 Meningitis 5 1211 4 41
6.2
Discovery in Experiments
Characterization of Headache. Although all the rules from the lower and upper limit were not interesting for domain experts, several interesting and unexpected relations on the degree of similarity were found in characterization sets. Ten hospitals are grouped in three groups. Table 2 shows several information about these groups, each differentiated factor of which are regions. The first group is mainly located on the countryside, most of the people are farmers. The second one is mainly located in the housing area. Finally, the third group is in the business area. Those groups included several interesting features for differential diagnosis of headache. In the first group, hypertension was one of the most important attributes for differential diagnosis. In the housing area, the nature of headache
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
153
procedure Induction of Classif ication Rules; var i : integer; M, Li : List; begin L1 := Ler ; /* Ler : List of Elementary Relations ((L) or (L)) */ i := 1; M := {}; for i := 1 to n do /* n: Total number of attributes */ begin while ( Li 6= {} ) do begin Select one pair R = ∧[ai = vj ] from Li ; Li := Li − {R}; if (αR (D) ≥ δα ) and (κR (D) ≥ δκ ) then do Sir := Sir + {R}; /* Include R as Inclusive Rule */ else M := M + {R}; end Li+1 := (A list of the whole combination of the conjunction formulae in M ); end end {Induction of Classif ication Rules }; Fig. 2. An Algorithm for Classification Rules
was important for differential diagnosis. Finally, in the business area, the location of headache was important. According to domain experts’ comments, these attributes are closely related with working environments. This analysis suggests that the differences in upper limit and lower limit also include information, which lead to knowledge discovery. Table 2. Characterization in Headache Location Important Features in Upper Limit G1 Countryside Hypertension=yes G2 Housing Nature=chronic, acute G3 Business Location=neck,occipital
Rules of CVD. Concerning the database on CVD, several interesting rules are derived both from the lower limit and the upper limit. The most interesting results of lower limit are the following rules for thalamus hemorrahge: [Sex = F emale] ∧ [Hemiparesis = Lef t] ∧ [LOC : positive] → T halamus ¬[Risk : Hypertension] ∧ ¬[Sensory = no] → ¬T halamus
154
S. Tsumoto
Interestingly, LOC(loss of consciousness) under the condition of [Sex = F emale] ∧ [Hemiparesis = Lef t] is an important factor to diagnose thalamic damage. In this domain, any strong correlations between these attributes and others, like the database of meningitis, have not been found yet. It will be our future work to find what factor will be behind these rules. Rules of Meningitis. In the domain of meningitis, the following rules from the lower limit of charcacterization, which medical experts do not expect, are obtained. [W BC < 12000] ∧ [Sex = F emale] ∧ [Age < 40] ∧ [CSF CELL < 1000] → V irus [Age ≥ 40] ∧ [W BC ≥ 8000] ∧ [Sex = M ale] ∧ [CSF CELL ≥ 1000] → Bacteria The most interesting points are that these rules have information about age and sex, which often seems to be unimportant attributes for differential diagnosis. The first discovery is that women do not often suffer from bacterial infection, compared with men, since such relationships between sex and meningitis has not been discussed in medical context[1]. Examined the database of meningitis closely, it is found that most of the above patients suffer from chronic diseases, such as DM, LC, and sinusitis, which are the risk factors of bacterial meningitis. The second discovery is that [age < 40] is also an important factor not to suspect viral meningitis, which also matches the fact that most old people suffer from chronic diseases. These results were also re-evaluted in medical practice. Recently, the above two rules were checked by additional 21 cases who suffered from meningitis (15 cases: viral and 6 cases: bacterial meningitis ) at a hospital which is different from the hospitals where datasets were collected. Surprisingly, the above rules misclassfied only three cases (two are viral, and the other is bacterial), that is, the total accuracy is equal to 18/21 = 85.7% and the accuracies for viral and bacterial meningitis are equal to 13/15 = 86.7% and 5/6 = 83.3%. The reasons of misclassification are the following: a case of bacterial infection is a patient who have a severe immunodeficiency, although he is very young. Two cases of viral infection are patients who also have suffered from herpes zoster. It is notable that even those misclassficiation cases can be explained from the viewpoint of the immunodeficiency: that is, it is confirmed that immunodefiency is a key word for menigitis. The validation of these rules is still ongoing, which will be reported in the near future.
7
Discussion: Conflict Analysis
It is easy to see the relations of independent type and subcategory type. While independent type suggests different mechanisms of diseases, subcategory type
Knowledge Discovery in Medical Multi-databases: A Rough Set Approach
155
does the same etiology. The difficult one is boundary type, where several symptoms are overlapped in each Lδκ (D). In this case, relations between Lδκ (Di ). and Lδκ (Dj ) should be examined. One approach to these complicated relations is conflict analysis[5]. In this analysis, several concepts which shares several attribute-value pairs, are analyzed with respect to qualitative similarity measure that can be viewed as an extension of rough inclusion. It will be our future work to introduce this methodology to analyze relations of boundary type and to develop an induction algorithms for these relations.
References 1. Adams RD and Victor M: Principles of Neurology, 5th edition. McGraw-Hill, New York, 1993. 2. Harris, J.M. Coronary Angiography and Its Complications - The Search for Risk Factors, Archives of Internal Medicine, 144, 337-341,1984. 3. Lin, T.Y. Fuzzy Partitions: Rough Set Theory, in Proceedings of Seventh International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems(IPMU’98), Paris, pp. 1167-1174, 1998. 4. Pawlak, Z., Rough Sets. Kluwer Academic Publishers, Dordrecht, 1991. 5. Pawlak, Z. Conflict analysis. In: Proceedings of the Fifth European Congress on Intelligent Techniques and Soft Computing (EUFIT’97), pp.1589–1591, Verlag Mainz, Aachen, 1997. 6. Pawlak, Z. Rough Modus Ponens. Proceedings of IPMU’98 , Paris, 1998. 7. Polkowski, L. and Skowron, A.: Rough mereology: a new paradigm for approximate reasoning. Intern. J. Approx. Reasoning 15, 333–365, 1996. 8. Skowron, A. and Grzymala-Busse, J. From rough set theory to evidence theory. In: Yager, R., Fedrizzi, M. and Kacprzyk, J.(eds.) Advances in the Dempster-Shafer Theory of Evidence, pp.193-236, John Wiley & Sons, New York, 1994. 9. Tsumoto, S. Automated Induction of Medical Expert System Rules from Clinical Databases based on Rough Set Theory Information Sciences 112, 67-84, 1998. 10. Tsumoto, S. Extraction of Experts’ Decision Rules from Clinical Databases using Rough Set Model Journal of Intelligent Data Analysis, 2(3), 1998. 11. Tsumoto, S., Ziarko, W., Shan, N., Tanaka, H. Knowledge Discovery in Clinical Databases based on Variable Precision Rough Set Model. Proceedings of the Eighteenth Annual Symposium on Computer Applications in Medical Care, Journal of the American Medical Informatics Associations 2, supplement, pp.270-274,1995. 12. Tsumoto, S. 1999. Knowledge Discovery in Clinical Databases – An Experiment with Rule Induction and Statistics– In: Ras, Z.(ed.) Proceedings of the Eleventh International Symposium on Methodologies for Intelligent Systems (ISMIS’99), Springer Verlag (in press). 13. Zadeh, L.A., Toward a theory of fuzzy information granulation and its certainty in human reasoning and fuzzy logic. Fuzzy Sets and Systems 90, 111-127, 1997. 14. Ziarko, W., Variable Precision Rough Set Model. Journal of Computer and System Sciences. 46, 39-59, 1993.
Automated Discovery of Rules and Exceptions from Distributed Databases Using Aggregates Rónán Páircéir, Sally McClean and Bryan Scotney School of Information and Software Engineering, Faculty of Informatics, University of Uster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland. {r.pairceir, si.mcclean, bw.scotney }@ulst.ac.uk
Abstract. Large amounts of data pose special problems for Knowledge Discovery in Databases. More efficient means are required to ease this problem, and one possibility is the use of sufficient statistics or “aggregates”, rather than low level data. This is especially true for Knowledge Discovery from distributed databases. The data of interest is of a similar type to that found in OLAP data cubes and the Data Warehouse. This data is numerical and is described in terms of a number of categorical attributes (Dimensions). Few algorithms to date carry out knowledge discovery on such data . Using aggregate data and accompanying meta-data returned from a number of distributed databases, we use statistical models to identify and highlight relationships between a single numerical attribute and a number of Dimensions. These are initially presented to the user via a graphical interactive middle-ware, which allows drilling down to a more detailed level. On the basis of these relationships, we induce rules in conjunctive normal form. Finally, exceptions to these rules are discovered.
1
Introduction
The evolution of database technology has resulted in the development of efficient tools for manipulating and integrating data. Frequently these data are distributed on different computing systems in various sites. Distributed Database Management Systems provide a superstructure, which integrates either homogeneous or heterogeneous DBMS [1]. In recent years, there has been a convergence between Database Technology and Statistics, partly through the emerging field of Knowledge Discovery in Databases. In Europe this development has been particularly encouraged by the EU Framework IV initiative, with DOSIS projects IDARESA [2] and ADDSIA [3], which retrieve aggregate data from distributed statistical databases via the internet. In order to alleviate some of the problems associated with mining large sets of low level data, one option is to use a set of sufficient statistics in place of the data itself [4]. In this paper we show how the same results can be obtained by replacing the low level data with our aggregate data. This is especially important in the distributed database situation, where issues associated with slow data transfer and privacy may preclude the transfer of the low level data [5]. The type of data we deal with here is •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 156−164, 1999. Springer−Verlag Berlin Heidelberg 1999
Automated Discovery of Rules and Exceptions from Distributed Databases
157
very similar to the multidimensional data stored in the Data Warehouse (DW) [6, 7]. These data consists of two attribute value types: Measures or numerical data, and Dimensions or categorical data. Some of the Dimensions may have an associated hierarchy to specify grouping levels. This paper deals with such data in statistical databases, but should be easily adapted to a distributed DW implementation [8]. In our statistical databases, aggregate data is stored in the form of Tandem Objects [9], consisting of two parts: a macro relation and its corresponding meta relations (containing statistical metadata for tasks such as attribute value re-classification and currency conversion). Using this aggregate data, it is possible, with models taken from the field of statistics, to study the relation between a response attribute and one or more explanatory attributes. We use Analysis of Variance (ANOVA) models [10] to discover rules and exceptions from aggregate data retrieved from a number of distributed statistical databases. Paper Layout Section 2 contains an extended example. Section 3 shows how the data are retrieved and integrated for final use. The statistical modelling and computation are discussed in section 4, along with the method of displaying the resulting discovered knowledge. Section 5 concludes with a summary and possibilities for further work.
2
An Extended Example
Within our statistical database implementation, the user selects a single Measure and a number of Dimensions from a subject domain for inclusion in the modelling process. The user may restrict the attribute values from any attribute domain, for example, GENDER= Male. In this example the Measure selected is COST (of Insurance Claim) and the Dimensions of interest are COUNTRY {Ireland, England, France}, REGION {City, County}, GENDER {Male, Female} and CAR-CLASS {A, B, C}. A separate distributed database exists for each country. Once the Measure and Dimensions have been entered, the query is sent to the domain server, where it is decomposed in order to retrieve the aggregate data from the distributed databases. As part of the IDARESA project [2], operators have been developed to create, retrieve and harmonise the aggregate data in the Tandem Objects (See Section 3). The Macro relation in the Tandem Object consists of the Dimensions and the single Measure (in this case COST), which is summarised within the numerical attributes N, S and SS. S contains the sum of COST values aggregated over the Dimension set, SS is the equivalent for sums of squares of COST values and N is the count of low level tuples involved in the aggregate. Once the retrieved data have been returned to the domain server and integrated into one Macro relation, the final operation on the data before the statistical analysis is the DATA CUBE operator [11]. Some example tuples from the final Macro relation are shown in Table 2.1.
COST = µ + G + P + C + R(C ) + GP + GC + GR(C ) + PC + PR(C ) + GPC + ε ijk ln
il ( k )
i
jk
j
k
ij
l (k )
jl ( k )
ijk
ijk ln
ik
(1)
158
R. Páircéir, S. McClean, and B. Scotney
Table 2.1.
Example tuples from Final Macro relation
COUNTRY
REGION
GENDER
Ireland England All Ireland
City County All City
Male Female Male All
CARCLASS A B A All
COST_ N 12000 10000 72000 54000
COST_ S 0.730 0.517 4.320 2.850
COST_ SS 43.21 25.08 261.23 161.41
The relevant Meta-data retrieved indicates that all the Dimensions are fixed variables for the statistical model, and that a hierarchy exists from REGION → COUNTRY. This information is required to automatically fit the correct ANOVA model. For our illustrative example, the model is shown above in (1).
0.05
Gender
0.04
Region(Country) Country
0.03 0.02
Car-class Gender/Country Gender/Car-class
0.01 0
Gender/Region(Country) Car-class/Country Car-class/Region(Country)
Fig. 2.1. Significant Effects graph for the Insurance example Once the model parameters have been calculated and validated for appropriateness, the results are presented to the user. The first step involves a graph showing attribute level relationships between the Dimensions and the COST Measure. These relationships (also known as effects) are presented in terms of main Dimension effects, two-, and three- way interaction effects. Only those relationships (effects) that are statistically significant are shown in the graph, with the height of each bar representing the significance of the corresponding effect. The legend contains an entry for all effects, so that the user may drill-down on any one desired. In the Insurance example, GENDER, COUNTRY and REGION within COUNTRY each show a statistically significant relationship with COST, as can be seen from the Significant Effects graph in Figure 2.1. None of the three-way effects (e.g. GENDER/REGION(COUNTRY)) have a statistically significant relationship with the COST Measure. The user can interact with this graphical representation. By clicking on a particular bar or effect in the legend of the graph, the user can view a breakdown of COST values for that effect, either in a table or a graphical format. This illustrates to the
Automated Discovery of Rules and Exceptions from Distributed Databases
159
user, at a more detailed level, the relationship between an attribute’s domain values and the COST Measure. These are conveyed in terms of deviations from the overall mean, in descending order. In this way, the user guides what details he wants to look at, from a high level attribute view to lower more detailed levels. A graph of the breakdown of attribute values for GENDER is shown in Figure 2.2. From this it can be seen that there is a large difference between COST claims for Males and Females. Breakdown for Attribute Gender
Female -5.15
7.59 Male
-10
-5
0
5
10
Deviation from overall mean of 51.34
Fig. 2.2. Deviations from mean for GENDER values On the basis of these relationships, rules in conjunctive normal form (CNF) are constructed. The rules involving GENDER are shown in (2) and (3) below. Based on the records in the databases, we can say statistically at a 95% level of confidence that the true COST lies within the values shown in the rule consequent. GENDER{Male} GENDER{Female}
→ →
COST between {57.63} and {60.23} COST between {44.63} and {47.75}
(2) (3)
The final step involves presenting to the user any attribute value combinations at aggregate levels which deviate from the high level rules discovered. For example, a group of 9,000 people represented by the following conjunction of attribute values (4) represents an exception to the high level rules: COUNTRY{Ireland} ∧ GENDER{Female} ∧ REGION{City} → ACTUAL VALUE COST between {50.12} and {57.24} → EXPECTED VALUE COST between {41.00} and {48.12}
(4)
This can be seen to be an exception, as the corresponding Expected and Actual COST ranges do not overlap. The information in this exception rule may be of interest for example in setting prices for Insurance for Females. Before making any decisions, this exception should be investigated in detail. We find such exceptions at aggregate levels only. It is not possible at this stage to study exceptions for low-level values as
160
R. Páircéir, S. McClean, and B. Scotney
these are resident at the different distributed databases, and in many situations privacy issues prevent analysis at this level in any case.
3
Aggregate Data Retrieval and Integration
The data at any one site may consist of low level “micro” data and/or aggregate “macro” data, along with accompanying statistical metadata (required for example for harmonisation of the data at the domain server). This view of micro and macro data is similar to the base data and materialised views held in the Data Warehouse [7]. In addition, textual (passive) metadata for use in documentation are held in an objectoriented database. An integrated relational strategy for micro and macrodata is provided by the MIMAD model [9] which is used in our implementation. To retrieve aggregate data from the distributed data sites, IDARESA has developed a complete set of operators to work with Tandem Objects [2]. Within a Tandem Object, a macro relation R describes a set of macro objects (statistical tables) where C1,..Cn represent n Dimensions and S1,…Sm are m summary attributes (N, S and SS) which summarise an underlying Measure. The IDARESA operators are implemented using SQL which operates simultaneously on a Macro relation and on its accompanying meta relations. In this way, whenever a macro relation is altered by an operator, the accompanying meta relations are always adjusted appropriately. The summary attributes in the macro relation form a set of “sufficient statistics” in the form of count (N), sum (S) and sums of squares (SS) for the desired aggregate function. An important concept is the additive property of these summary attributes [9] defined as follows: σ ( α UNION β) = σ (α) + σ (β)
(5)
where α and β are macro relations which are macro compatible and σ() is an application of a summary attribute function (e.g. SUM) over the Measure in α and β. Using the three summary attributes, it is possible to compute a large number of statistical procedures [9], including ANOVA models. Thus it is possible to combine aggregates over these summary statistics at a central site for our knowledge discovery purposes. The user query is decomposed by a Query Agent which sends out Tandem Object requests to the relevant distributed sites. If the data at a site is in the micro data format an IDARESA operator called MIMAC (Micro to Macro Create) is used to construct a Tandem Object with the required Measure and Dimensions, along with accompanying meta relations. If the data are already in a macro data format, IDARESA operators TAP (Tandem Project) and TASEL (Tandem Select) are used to obtain the required Tandem Object. Once this initial Tandem Object has been created at each site, operators TAREC (Tandem Reclassify) and TACO (Tandem Convert) may be applied to the macro relations using information in the meta relations. TAREC can be used in two ways: the first is in translating attribute domain values to a single common language for all distributed macro relations (e.g. changing French words for male and female in the GENDER attribute to English).; the second use is on reclassi-
Automated Discovery of Rules and Exceptions from Distributed Databases
161
fying a Dimension’s domain values so that all the macro relations contain attributes with the same domain set (e.g. the French database might classify Employed as “Parttime” and “Full-time” separately. These need to be reclassified and aggregated to the value “Employed” which is the appropriate classification used by the other Countries involved in the query). The operator TACO is used to convert the Measure summary attributes to a common scale for all sites (e.g. converting COST from local currency to ECU for each site using conversion information in the meta relations). The final harmonised Tandem Object from each site is communicated to the Domain Server. The Macro relations are now Macro compatible [2] and can therefore be integrated into a single aggregate macro relation using the TANINT (Tandem Integration) operator. The meta relations are also integrated accordingly. The final task is to apply the DATA CUBE operator [11] to the Macro relation. The data is now in a suitable format for the statistical modelling. 3.1
Implementation Issues
In our prototype the micro data and Tandem Objects are stored in MS SQL Server. Access to remote distributed servers is achieved via the Internet in a Java environment. A well acknowledged three tier architecture has been adopted for the design. The logical structure consists of a front-end user (the client), a back-end user (the server), and middleware which maintains communication between the client and the server. The distributed computing middleware capability called remote method invocation (RMI) is used here. A query is transformed into a series of nested IDARESA operators and passed to the Query Agent for assembly into SQL and execution.
4
Statistical Modelling and Results Display
ANOVA models [10] are versatile statistical tools for studying the relation between a numerical attribute and a number of explanatory attributes. Two factors have enabled us to construct these models from our distributed data. The first is the fact that we can combine distributed primitive summary attributes (N, S and SS) from each distributed database seamlessly using the MIMAD model and IDARESA operators described in Section 3. The second factor is that it is possible to use these attributes to compute the coefficients of an ANOVA model in a computationally efficient way. The ANOVA model coefficients also enable us to identify exceptions in the aggregate data. The term “exception” here is defined as an aggregate Measure value which differs in a statistically significant manner from its expected value calculated from the model. While it is not the focus of this paper to detail the ANOVA computations, a brief description follows. The simplest example of an ANOVA model is shown in equation (6), similar to the model in equation (1) which contains more Dimensions and a hierarchy between the Dimensions COUNTRY and REGION. In equation (6), Measureijk represents a numerical Measure value corresponding to th valuei of Dimension A and valuej of Dimension B. k represents the k example or replicate for this Dimension set. The µ term in the model represents the overall aver-
162
R. Páircéir, S. McClean, and B. Scotney
age or mean value for the Measure. The A and B single Dimension terms are used in the model to see if these Dimensions have a relationship (Main effect) with the Measure. The (AB) term, representing a 2-way interaction effect between Dimensions A and B, is used to see if there is a relationship between the Measure and values for Dimension A, which hold only when Dimension B has a certain value. The final term in the model is an error term, which is used to see if any relationships are real in a statistically significant way.
Measure
ijk
= µ + A + B j + ( AB) i
ij
+ε
ijk
(6)
In order to discover exceptions at aggregate levels, the expected value for a particular Measure value as calculated from the model, is subtracted from the actual Measure value. If the difference is statistically significant in terms of the model error, this value is deemed to be an exception. When calculating an expected value for Measureijk, the model reduces nicely to the average of the k values where A=i and B=j, saving considerable computing time in the calculation of exceptions. The model reduces similarly in calculating an exception at any aggregate level (e.g. the expected Measure value for aggregate GENDER{Male} and COUNTRY {Ireland} is simply the average over all tuples with these attribute values). It is important to note that if an interaction effect (e.g. AB) is deemed to be statistically significant, then the main effects involved in this interaction effect (A and B) are disregarded and all the focus centers on the interaction effect. In such a situation, when effects are converted to CNF rules, main effects based on a significant interaction effect are not shown. In our ANOVA model implementations, we do not model higher than 3-way interaction effects as these are seldom if ever significant [10]. 4.1 Presentation of Results The first step in the results presentation is at the attribute level based on the statistically significant Main and Interaction Effects. Statistical packages present ANOVA results in a complicated table suitable for statisticians. Our approach summarises the main details of this output in a format more suited to a user not overly familiar with statistical modelling and analysis. We present the statistically significant effects in an interactive graphical way, as shown in Figure 2.1. The scale of the graph is the probability that an effect is real. Only those effects significant above a 95% statistical level are shown. The more significant an effect, the stronger the relationship between Dimensions in the effect and the Measure. As a drill-down step from the attribute level, the user can interact with the graph to obtain a breakdown of Measure value means for any effect. This allows the user to understand an effect’s relationship with the Measure in greater detail. The user can view this breakdown either graphically, as shown in Figure 2.2. or in a table format. The breakdown consists of the mean Measure deviation values from the overall Measure mean, for the corresponding effect’s Dimension values (e.g. Figure 2.2 shows that the mean COST for GENDER{Male} deviates from the overall mean of 51.34 by +7.59 units). Showing the breakdown as deviations from the overall mean
Automated Discovery of Rules and Exceptions from Distributed Databases
163
facilitates easy comparison of the different Measure means. The significant effects are next converted into a set of rules in conjunctive normal form, with an associated range within which we can statistically state that the true Measure value lies. This range is based on a statistical confidence interval. This set of rules in CNF summarises the knowledge discovered using the ANOVA analysis. The final pieces of knowledge which are automatically presented to the user, are the exceptions to the discovered rules. These are Measure values corresponding to all the different Dimension sets at the aggregate level, which differ in a statistically significant way from their expected ANOVA model values. An example of an exception is (4) in Section 2. These are also presented in CNF, with their expected and actual range values. One factor which is also important to a user interested in finding exceptions, is to know in what way they are exceptions. This is possible through an examination of the rules which are relevant to the exception. For the example in Section 2, assume that (2) and (3) are the only significant rules induced. In order to see why (4) is an exception, we look at rules which are related to it. We define a rule and an exception to be related if the rule antecedent is nested within the exception antecedent. In this case the antecedent in rule (3) GENDER{Female} is nested in the exception (4). Comparing the Measure value range for the rule {44.6 - 47.75} with that of the exception {50.12 - 57.24}, it can be seen that they do not overlap. Therefore it can be stated in this simple illustration that GENDER is in some sense a cause of exception (4). This conveys more knowledge to the user about the exception. Further work is required on this last concept to automate the process in some suitable way. 4.2
Related Work
In the area of supervised learning, a lot of research has been carried out on the discovery of rules in CNF, and some work is proceeding on the discovery of exceptions and deviations for this type of data [13, 14]. A lot less work in the knowledge discovery area has been carried out in relation to a numerical attribute described in terms of categorical attributes. Some closely related research involves a paper on exploring exceptions in OLAP data cubes [14]. The authors there use an ANOVA model to enable a user to navigate through exceptions using an OLAP tool, highlighting drilldown options which contain interesting exceptions. Their work bears similarity only to the exception part of our results presentation, whereas we present exceptions to our rules at aggregate levels in CNF. Some work on knowledge discovery in distributed databases has been carried out in [5, 15].
5 Summary and Further Work Using aggregate data and accompanying meta-data returned from a number of distributed databases, we used ANOVA models to identify and highlight relationships between a single numerical attribute and a number of Dimensions. On the basis of these relationships which are presented to the user in a graphical fashion, rules were induced in conjunctive normal form and exceptions to these rules were discovered.
164
R. Páircéir, S. McClean, and B. Scotney
Further work can be carried out on the application of Aggregate data to other knowledge discovery techniques applied to the distributed setting, conversion of our rules into linguistic summaries of the relationships and exceptions and investigation of models which include a mix of Measures and Dimensions.
References 1. 2. 3.
4. 5. 6. 7. 8.
9.
10. 11.
12. 13. 14. 15.
Bell, D., Grimson, J.: Distributed database systems. Wokingham : AddisonWesley, (1992) c M Clean S., Grossman, W. and Froeschl, K.: Towards Metadata-Guided Distributed Statistical Processing. NTTS’98 Sorrento, Italy (1998): 327-332 Lamb, J., Hewer, A., Karali, I., Kurki-Suonio, M., Murtagh, F., Scotney, B., Smart C., Pragash K. : The ADDSIA (Access to Distributed Databases for Statistical Information and Analysis) Project. DOSIS project paper 1, NTTS-98, Sorrento, Italy. 1-20 (1998) Graefe, G, Fayyad, U., Chaudhuri, S.: On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. KDD (1998): 204-208 Aronis, J., Kolluri, V., Provost, F., and Buchanan, B.: The WoRLD: Knowledge Discovery from multiple distributed databases. In Proc FLAIRS’97 (1997) Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997) Shoshani, A.: OLAP and Statistical Databases: Similarities and Differences. PODS 97: 185-196 (1997) Albrecht, J. and Lehrner, W.: On-Line Analytical Processing in Distributed Data Warehouses. International Database Engineering and Applications Symposium (IDEAS'98), Cardiff, Wales, U.K (1998) Sadreddini MH, Bell D., and McClean SI.: A Model for integration of Raw Data and Aggregate Views in Heterogeneous Statistical Databases. Database Technology vol 4,no 2, 115-127 (1991). Neter, J.: Applied linear statistical models. - 3rd ed. - Chicago, Ill.; London: Irwin, (1996). Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By,Cross-Tab, and SubTotal. ICDE 1996: 152-159 (1996) Liu, H., Lu, h., Feng, L. and Hussain, F.: Efficient Search of Reliable Exceptions. PAKDD 99 Beijing, China (1999) Arning,A., Agrawal, R. and Raghavan, P.: A linear Method for Deviation Detection in Large Databases KDD, Portland, Oregon, USA (1996) Sarawagi, S., Agrawal, R., Megiddo, N.: Discovery-Driven Exploration of OLAP Data Cubes. EDBT 98: 168-182 (1998) Ras, Z., Zytkow J.:Discovery of Equations and the Shared Operational Semantics in Distributed Autonomous Databases. PAKDD99 Beijing, China (1999)
Text Mining via Information Extraction Ronen Feldman , Yonatan Aumann, Moshe Fresko, Orly Liphstat, Binyamin Rosenfeld, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan University Ramat-Gan, ISRAEL Tel: 972-3-5326611 Fax: 972-3-5326612
[email protected] Abstract. Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining via information extraction, in which knowledge discovery takes place on a more focused collection of events and phrases that are extracted from and label each document. These events plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This approach was implemented in the Textoscope system. Textoscope consists of a document retrieval module which converts retrieved documents from their native formats into SGML documents used by Textoscope; an information extraction engine, which is based on a powerful attribute grammar which is augmented by a rich background knowledge; a taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and a set of knowledge-discovery tools for the resulting event-labeled documents. We evaluate our approach on a collection of newswire stories extracted by Textoscope’s own agent. Our results confirm that Text Mining via information extraction serves as an accurate and powerful technique by which to manage knowledge encapsulated in large document collections.
1 Introduction Traditional databases store information in the form of structured records and provide methods for querying them to obtain all records whose content satisfies the user’s query. More recently however, researchers in Knowledge Discovery in Databases (KDD) have provided a new family of tools for accessing information in databases. The goal of such work, often called data mining, has been defined as •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 165−173, 1999. Springer−Verlag Berlin Heidelberg 1999
166
R Feldman et al.
“the nontrivial extraction of implicit, previously unknown, and potentially useful information from given data. Work in this area includes applying machinelearning and statistical-analysis techniques towards the automatic discovery of patterns in databases, as well as providing user-guided environments for exploration of data. Most efforts in KDD have focused on data mining from structured databases, despite the tremendous amount of online information that appears only in collections of unstructured text. This paper focuses on the problem of text mining, performing knowledge discovery from collections of unstructured text. One common technique [3,4,5] has been to assume that associated with each document is a set of labels and then perform knowledge-discovery operations on the labels of each document. The most common version of this approach has been to assume that labels correspond to keywords, each of which indicates that a given document is about the topic associated with that keyword. However, to be effective, this requires either: manual labeling of documents, which is infeasible for large collections; hand-coded rules for recognizing when a label applies to a document, which is difficult for a human to specify accurately and must be repeated anew for every new keyword; or automated approaches that learn from labeled documents rules for labeling future documents, for which the state of the art can guarantee only limited accuracy and which also must be repeated anew for every new keyword. A second approach has been to assume that a document is labeled with each of the words that occurs within it. However, as was shown by Rajman and Besançon [6] and is further supported by the results presented here, the results of the mining process are often rediscoveries of compound nouns (such as that “Wall” and “Street” or that “Ronald” and “Reagan” often co-occur) or of patterns that are at too low a level (such as that “shares” and “securities” cooccur). In this paper we instead present a middle ground, in which we perform Information extraction on each document to find events and entities that are likely to have meaning in the domain, and then perform mining on the extracted events labeling each document. Unlike word-based approaches, the extracted events are fewer in number and tend to represent more meaningful concepts and relationships in the domain of the document. A possible event can be that a company did a joint venture with a group of companies or that a person took position at a company. Unlike keyword approaches, our information-extraction method eliminates much of the difficulties in labeling documents when faced with a new collection or new keywords. While we rely on a generic capability of recognizing proper names which is mostly domain-independent, when the system is to be used in new domains, some work is needed for defining additional event schemas. Textoscope provides a complete editing/compiling/debugging environment for defining the new event schemas. This environment enables easy creation and manipulation of information extraction rules. This paper describes Textoscope, a system that embodies this approach to text mining via information extraction. The overall structure of Textoscope is shown in Figure 1. The first step is to convert documents (either internal documents or
Text Mining via Information Extraction
167
external documents fetched by using the Agent) into an SGML format understood by Textoscope. The resulting documents are then processed to provide additional linguistic information about the contents of each document – such as through partof-speech tagging. Documents are next labeled with terms extracted directly from the documents, based on syntactic analysis of the documents as well as on their patterns of occurrence in the overall collection. The terms and additional higherlevel entities are then placed in a taxonomy through interaction with the user as well as via information provided when documents are initially converted into Textoscope’s SGML format. Finally, KDD operations are performed on the event-labeled documents.
Taxonomy Editor
FTP
Reader/SGML Converter
Information Extraction
Text Mining ToolBox
Visualization Tools
Agent
Other Online Sources
Fig. 1. Textoscope architecture. Examples of document collections suitable for text mining are documents on the company’s Intranet, patent collections, newswire streams, results returned from a search engine, technical manuals, bug reports, and customer surveys. In the remainder of this paper we describe Textoscope’s various components. The the linguistic preprocessing steps, Textoscope’s Information extraction engine, its tool for creating a taxonomic hierarchy for the extracted events, and, finally, a sample of its suite of text mining tools. We give examples of mining results on a collection of newswire stories fetched by our agent.
2 Information Extraction Information Extraction (IE) aims at extracting instances of predefined templates from textual documents. IE has grown to be a very active field of research thanks to the MUC (Message Understanding Conference) initiative. MUC was initiated by DARPA in the late 80’s in response to the information overload of on-line texts. One of the popular uses of IE is proper name extraction, i.e., extraction of company names, personal names, locations, dates, etc. The main components of an IE system are tokenization, zoning (recognizing paragraph and sentence limits),
168
R Feldman et al.
morphological and lexical processing, parsing and domain semantics [1,7]. Typically, IE systems do not use full parsing of the document since that is too time consuming and error prone. The methods typically used by IE systems are based on shallow parsing and use a set of predefined parsing rules. This “knowledgebased” approach may be very time consuming and hence a good support environment for writing the rules is needed. Textoscope preprocesses the documents by using its own internal IE engine. The IE engine makes use of a set of predefined extraction rules. The rules can make use of a rich set of functions that are used for string manipulation, set operations and taxonomy construction. We have three major parts to the rules file. First we define all the events that we want to extract from the text. An example of an event is “Company1 Acquired Company2”, or “Person has Position in Company”. The second part are word classes, collections of words that have a similar semantic property. Examples of word classes are company extensions (like “inc”, “corporation” “gmbh” “ag” etc.) and a list of common personal first names. The third and last part are rules that are used to extract events out of the documents. There are two types of rules, event-generation rules and auxiliary rules. Each event-generating rule has three parts, a pattern, a set of constraints (on components of the pattern), and a set of events that are generated from the pattern. An auxiliary rule contains just a pattern. The system supports three types of patterns, AND-patterns , sequential pattern (which has a similar semantics to a prolog DCG rule), and skip patterns. Skip patterns enable the IE engine to skip a series of tokens until a member of a word class is found. Here is an example of an event generating rule that uses an auxiliary rule: @ListofProducts = ( @ProductList are [ registered ] trademarks of @Company @! ) > ProductList: Products = 0. @ProductList = ( @Product , @ProductList1 @!). @ProductList1 = ( @Product, @ProductList1 @!). @ProductList1 = ( @Product [ , ] and @Product @! ). In this case we look for a list of entities that is followed by the string “are registered trademarks” or “are trademarks”. Each of the entities must conform to the syntax of a @Product. We have used many resources found on the WWW to acquire lists of common objects such as countries, states, cities, business titles (e.g., CEO, VP of Product Development, etc.), technology terms etc. Technology terms for instance were extracted from publicly available glossaries. We have used our IE engine (with a specially designed rule-set) to automatically extract the terms from the HTML source of the glossaries. In addition, we have used words lists of the various part of speech categories (nouns, verbs, adjectives, etc.). These word lists are used inside the rules to direct the parsing.
Text Mining via Information Extraction
169
Each document is processed using the IE engine and the generated events are inserted into the document repository. In addition to the events inserted, each document is annotated with terms that are generated by using term extraction algorithms [2,5]. This enables the system to use co-occurrence between terms to infer relations that were missed by the IE engine. The user can select the granularity level of the co-occurrence computation, either document level, paragraph level or sentence level. Clearly, if the granularity level is selected to be document-level, the precision will decrease, while the recall will increase. On the other hand, selecting a sentence-level granularity will yield higher precision and lower recall. The default granularity level is the sentence level, terms will be considered to have relationship only if they co-occur within the same sentence. In all the analysis modules of the Textoscope system the user can select whether relationships will be based solely on the events extracted by the IE engine, on the term extraction, or a combination of two. One of the major issues that we have taken into account while designing the IE Rule Language was allowing the specification of common text processing actions within the language rather than resorting to external code written in C/C++. In addition to recognizing events, the IE engine allows the additional analysis of text fractions that were identified as being of interest. For instance, if we have identified that a given set of tokens is clearly a company name (by having as a suffix one of the predefined company extensions), we can insert into a dynamic set called DCompanies the full company name and any of its prefixes that still constitute a company name. Consider the string “Microsoft Corporation”, we will insert to DCompanies both “Microsoft Corporation”, and “Microsoft”. Dynamic sets are handled at five levels: there are system levels sets, there are corpus level sets, there are document level sets, paragraph level sets and sentence level sets. System level sets enable knowledge transfer between corpuses, while corpus level sets enable knowledge transfer between documents in the same corpus. Document level sets are used in cases where the knowledge acquired should be used just for the analysis of the rest of the document and it is not applicable to other documents. Paragraph and sentence level sets are used in discourse analysis, and event linking. The IE engine can learn the type of an entity by the context in which the entity appears. As an example, consider a list of entities some of which are unidentified. If the engine can determine the type of at least one of them, then the types of all other entities are determined to be the same. For instance, given the string “joint venture among affiliates of Time Warner, MediaOne Group, Microsoft, Compaq and Advance/Newhouse.”, since the system has already identified Microsoft as being a company, it determined that Time Warner, MediaOne Group, Compaq and Advance/Newhouse are companies as well. The use of the list-processing rules provided a considerable boost the accuracy of the IE engine. For instance, in the experiment described in Section 4, it caused recall to increase from 82.3% to 92.6% while decreasing precision from 96.5% to 96.3%.
Textoscope provides a rich support environment for editing and debugging the extraction rules. On the editing front, Textoscope provides a visual editor for
170
R Feldman et al.
building the rules that enables the user to create rules without having to memorize the exact syntax. On the debugging front, Textoscope provides two main utilities. First, it provides a visual tool that enables one to see all the events that were extracted from the document. The user can click on any of the events and then see the exact text where this event was extracted from. In addition the system provides an interactive derivation tree of the event, so that the user can explore exactly how the event was generated. An example of such a derivation tree is shown in Figure 2. Here we parsed the sentence “We see the Nucleus Prototype Mart as the missing link to quickly deploying high value business data warehouse solutions, said David Rowe, Director of Data Warehousing Practice at GE Capital Consulting”, and extracted the event that David Rowe is the Director of Data Warehousing Practice at a company called GE Capital Consulting. Each node in the derivation tree is annotated by an icon that symbolizes the nature of the associated grammar feature. The second debugging tool provides the user with the ability to use a tagged training set and rate each of the rules according to their contribution to the precision and recall of the system. Rules that cause precision to be lower and do not contribute towards a higher recall can be either deleted or modified.
Fig. 2. An Interactive Derivation Tree of an Extracted Event The second debugging tool provides the user with the ability to use a tagged training set and rate each of the rules according to their contribution to the precision and recall of the system. Rules that cause precision to be lower and do not contribute towards a higher recall can be either deleted or modified. The events that were generated by the IE engine are used also for the automatic construction of the taxonomy. Each field in each of the events is used as a source of values for the corresponding node in the taxonomy. For instance, we use the Company field from the event “Person, Position, Company” to construct the Company node in the taxonomy. The system contains several meta rules that enable the construction of a multi-level taxonomy. Such a rule can be, for instance, that Banks are Companies and hence the Bank node will be placed under the Company node in the Taxonomy.
Text Mining via Information Extraction
171
Textoscope constructs a thesaurus that contains lists of synonyms. The thesaurus is constructed by using co-reference and a set of rules for deciding that two terms actually refer to the same entity. Example of a synonym list that is constructed by the system is { “IBM”, “International Business Machines Corp” and “Big Blue” }. Textoscope also includes a synonym editor that enables the user to add/modify/delete synonym lists. This enables the user to change the automatically created thesaurus and customize it to her own needs.
3 Results We tested the accuracy of the IE engine by analyzing collections of documents that were extracted by the Agent from MarketWatch.com. We started by extracting 810 articles from MarketWatch.com which mentioned “ERP”. We have created 30 different events focused around companies, technologies, products and alliances. We have defined more than 250 word classes and have used 750 rules to extract those 30 event types. The rule scoring tool described in Section 3 was proved to be very useful in the debugging and refinement of the rule set. After the construction of the initial rule set we were able to achieve an F-Score of 89.3%. Using the rule scoring utility enabled us to boost the F-Score to 96.7% in several hours. In order to test the rule set, we have used our agent again to extract 2780 articles that mentioned “joint venture” from MarketWatch.com. We were able to extract 15,713 instances of these events. We have achieved a 96.3 precision and 92.6 recall on the company, people, technology and product categories and hence an FScore of 94.4% (β = 1) where β + 1 PR . These results are in par with the 2
F= 2 β P + R
results achieved by the FASTUS system [1] and the NETOWL system (www.netowl.com). We will now show how Textoscope enables us to analyze the events and terms that were extracted from the 2780 articles. Textoscope provides a set of visual maps that depict the relationship between entities in the corpus. The context graph shown in Figure 3 depicts the relationship between “technologies”. The weights of the edges (number of documents in which the technologies appear in the same context) are coded by the color of the edge, the darker the color, the more frequent the connection. The graph clearly reveals the main technology clusters, which are shown as disconnected components of the graph: a security cluster and internet technologies cluster. We can see strong connections between electronic commerce and internet security, between ERP and data warehousing, and between ActiveX and internet security. In Figure 4, we can view some of the company clusters that were involved in some sort of alliance (“joint venture”, “strategic alliance”, “commercial alliance”, etc. ).
172
R Feldman et al.
The Context Graph provides a powerful way to visualize relationship encapsulated in thousands of documents.
Fig. 3. Context Graph (technologies)
Fig. 4. Joint Venture Clusters
4 Summary Text mining based on Information Extraction attempts to hit a midpoint, reaping some benefits from each of these extremes while avoiding many of their pitfalls. On the one hand, there is no need for human effort in labeling documents, and we
Text Mining via Information Extraction
173
are not constrained to a smaller set of labels that lose much of the information present in the documents. Thus the system has the ability to work on new collections without any preparation, as well as the ability to merge several distinct collections into one (even though they might have been tagged according to different guidelines which would prohibit their merger in a tagged-based system). On the other hand, the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced relative to pure wordbased approaches. Text mining using Information Extraction thus hits a useful middle ground on the quest for tools for understanding the information present in the large amount of data that is only available in textual form. The powerful combination of precise analysis of the documents and a set of visualization tools enable the user to easily navigate and utilize very large document collections.
References 1.
Appelt, Douglas E., Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson, 1993. ‘‘FASTUS: A Finite-State Processor for Information Extraction from Real-World Text’’, Proceedings. IJCAI-93, Chambery, France, August 1993.
2.
Daille B., Gaussier E. and Lange J.M., 1994. Towards Automatic Extraction of Monolingual and Bilingual Terminology, In Proceedings of the International Conference on Computational Linguistics, COLING’94, pages 515-521.
3.
Feldman R., and Hirsh H., 1996. Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent Information Systems. 1996.
4.
Feldman R., Aumann Y., Amir A., Klösgen W. and Zilberstien A., 1997. Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections, In Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, Newport Beach, CA.
5.
Feldman R. and Dagan I., 1995. KDT – Knowledge Discovery in Texts. In Proceedings of the First International Conference on Knowledge Discovery, KDD-95.
6.
Rajman M. and Besançon R., 1997. Text Mining: Natural Language Techniques and Text Mining Applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie. Leysin, Switzerland, Oct 7-10, 1997.
7.
Soderland S., Fisher D., Aseltine J., and Lehnert W., "Issues in Inductive Learning of Domain-Specific Text Extraction Rules," Proceedings of the Workshop on New Approaches to Learning for Natural Language Processing at the Fourteenth International Joint Conference on Artificial Intelligence, 1995.
TopCat: Data Mining for Topic Identification in a Text Corpus? Chris Clifton1 and Robert Cooley2
??
1 2
The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730-1420 USA
[email protected] University of Minnesota, 6-225D EE/CS Building, Minneapolis, MN 55455 USA
[email protected] Abstract. TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.
1
Introduction
Data mining has emerged to address problems of understanding ever-growing volumes of information for structured data, finding patterns within the data that are used to develop useful knowledge. On-line textual data is also growing rapidly, creating needs for automated analysis. There has been some work in this area [14,10,16], focusing on tasks such as: association rules among items in text [9], rules from semi-structured documents [18], and understanding use of language [5,15]. In this paper the desired knowledge is major topics in a collection; data mining is used to discover patterns that disclose those topics. The basic problem is as follows: Given a collection of documents, what topics are frequently discussed in the collection? The goal is to help a human understand the collection, so a good solution must identify topics in some manner that is meaningful to a human. In addition, we want results that can be used for further exploration. This gives a requirement that we be able to identify source texts relevant to a given topic. This is related to document clustering [21], but the requirement for a topic identifier brings it closer to rule discovery mechanisms. The way we apply data mining technology on this problem is to treat a document as a “collection of entities”, allowing us to map this into a market ? ??
This work supported by the Community Management Staff’s Massive Digital Data Systems Program. This work was performed while the author was at the MITRE Corporation.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 174–183, 1999. c Springer-Verlag Berlin Heidelberg 1999
TopCat: Data Mining for Topic Identification in a Text Corpus
175
basket problem. We use natural language technology to extract named entities from a document. We then look for frequent itemsets: groups of named entities that commonly occurred together. Next, we further cluster on the groups of named entities; capturing closely-related entities that may not actually occur in the same document. The result is a refined set of clusters. Each cluster is represented as a set of named entities and corresponds to an ongoing topic in the corpus. An example topic is: ORGANIZATION Justice Department, PERSON Janet Reno, ORGANIZATION Microsoft. This is recognizable as the U.S. antitrust case against Microsoft. Although not as informative as a narrative description of the topic, it is a compact, human-understandable representation. It also meets our “find the original documents” criteria, as the topic can used as a query to find documents containing some or all of the extracted named entities (see Section 3.4).
2
Problem Statement
The TopCat project started with a specific user need. The GeoNODE project at MITRE [12] is developing a system for analysis of news in a geographic context. One goal is to visualize ongoing topics in a geographic context; this requires identifying ongoing topics. We had experience with identifying association rules among entities/concepts in text, and noticed that some of the rules were recognizable as belonging to major news topics. This led to the effort to develop a topic identification mechanism based on data mining techniques. There are related topic-based problems being addressed. The Topic Detection and Tracking (TDT) project [1] looks at clustering and classifying news articles. Our problem is similar to the Topic Detection (clustering) problem, except that we must generate a human-understandable “label” for a topic: a compact identifier that allows a person to quickly see what the topic is about. Even though our goals are slightly different, the test corpus developed for the TDT project (a collection of news articles manually classified into topics) provides a basis for us to evaluate our work. A full description of the corpus can be found in [1]. For this evaluation, we use the topic detection criteria developed for TDT2 (described in Section 4). This requires that we go beyond identifying topics, and also match documents to a topic. One key item missing from the TDT2 evaluation criteria is that the T opicID must be useful to a human. This is harder to evaluate, as not only is it subjective, but there are many notions of “useful”. We later argue that the T opicID produced by TopCat is useful to and understandable by a human.
3
Process
TopCat follows a multi-stage process, first identifying key concepts within a document, then grouping these to find topics, and finally mapping the topics back to documents and using the mapping to find higher-level groupings. We identify key concepts within a document by using natural language techniques to extract
176
C. Clifton and R. Cooley
named people, places, and organizations. This gives us a structure that can be mapped into a market basket style mining problem.1 We then generate frequent itemsets, or groups of named entities that commonly appear together. Further clustering is done using a hypergraph splitting technique to identify groups of frequent itemsets that contain considerable overlap, even though not all of the items may appear together often enough to qualify as a frequent itemset. The generated topics, a set of named entities, can be used as a query to find documents related to the topic (Section 3.4). Using this, we can identify topics that frequently occur in the same document to perform a further clustering step (identifying not only topics, but also topic/subtopic relationships). We will use the following cluster, capturing professional tennis stories, as an example throughout this section. PERSON Andre Agassi PERSON Pete Sampras PERSON Marcelo Rios
PERSON Martina Hingis PERSON Venus Williams PERSON Anna Kournikova
PERSON Mary Pierce PERSON Serena
This is a typical cluster (in terms of size, support, etc.) and allows us to illustrate many of the details of the TopCat process. It comes from merging two subsidiary clusters (described in Section 3.5), formed from clustering seven frequent itemsets (Section 3.3). 3.1
Data Preparation
TopCat starts by identifying named entities in each article (using the Alembic[7] system). This serves several purposes. First, it shrinks the data set for further processing. It also gives structure to the data, allowing us to treat documents as a set of typed and named entities. This gives us a natural database schema for documents that maps into the traditional market basket data mining problem. Third, and perhaps most important, it means that from the start we are working with data that is rich in meaning, improving our chances of getting human understandable results. We eliminate frequently occurring terms (those occurring in over 10% of the articles, such as United States), as these are used across too many topics to be useful in discriminating between topics. We also face a problem with multiple names for the same entity (e.g., Marcelo Rios and Rios). We make use of coreference information from Alembic to identify different references to the same entity within a document. From the group of references for an entity within a document, we use the globally most common version of the name where most groups containing that name contain at least one other name within the current group. Although not perfect, this does give a global identifier for an entity that is both reasonably global and reasonably unique. We eliminate composite articles (those about multiple unrelated topics, such as daily news summaries). We found most composite articles could be identified 1
Treating a document as a “basket of words” did not produce as meaningful topics. Named entities stand alone, but raw words need sequence.
TopCat: Data Mining for Topic Identification in a Text Corpus
177
by periodic recurrence of the same headline; we ignore any article with a headline that occurs at least monthly.
3.2
Frequent Itemsets
The foundation of the topic identification process is frequent itemsets. In our case, a frequent itemset is a group of named entities that occur together in multiple articles. What this really gives us is correlated items, rather than any notion of a topic. However, we found that correlated named entities frequently occurred within a recognizable topic. Discovery of frequent itemsets is a well-understood data mining problem, arising in the market basket association rule problem [4]. A document can be viewed as a market basket of named entities; existing research in this area applies directly to our problem. (We use the query flocks technology of [20] for finding frequent itemsets using the filtering criteria below). One problem with frequent itemsets is that the items must co-occur frequently, causing us to ignore topics that occur in only a few articles. To deal with this, we use a low support threshold of 0.05% (25 occurrences in the TDT corpus). Since we are working with multiple sources, any topic of importance is mentioned multiple times; this level of support captures all topics of any ongoing significance. However, this gives too many frequent itemsets (6028 2-itemsets in the TDT corpus). We need additional filtering criteria to get just the “important” itemsets.2 We use interest[6], a measure of correlation strength (specifically, the ratio of the probability of a frequent itemset occurring in a document to the multiple of the independent probabilities of occurrence of the individual items) as an additional filter. This emphasizes relatively rare items that generally occur together, and de-emphasizes common items. We select all frequent itemsets where either the support or interest are at least one standard deviation above the average, or where both support and interest are above average (note that this is computed independently for 2-itemsets, 3-itemsets, etc.) For 2-itemsets, this brings us from 6028 to 1033. We also use interest to choose between “contained” and “containing” itemsets (i.e., any 3-itemset contains three 2-itemsets with the required support.) An n−1itemset is used only if it has greater interest than the corresponding n-itemset, and an n-itemset is used only if it has greater interest than at least one of its contained n − 1-itemsets. This brings us to 416 (instead of 1033) 2-itemsets. The difficulty with using frequent itemsets for topic identification is that they tend to be over-specific. For example, the “tennis player” frequent itemsets consist of the following: 2
The problems with traditional data mining measures for use with text corpuses have been noted elsewhere as well, see [8] for another approach.
178
C. Clifton and R. Cooley Type1 PERSON PERSON PERSON PERSON PERSON PERSON PERSON
Value1 Andre Agassi Andre Agassi Anna Kournikova Marcelo Rios Martina Hingis Martina Hingis Martina Hingis
Type2 PERSON PERSON PERSON PERSON PERSON PERSON PERSON
Value2 Marcelo Rios Pete Sampras Martina Hingis Pete Sampras Mary Pierce Serena Venus Williams
Support Interest .00063 261 .00100 190 .00070 283 .00076 265 .00057 227 .00054 228 .00063 183
These capture individual matches of significance, but not the topic of “championship tennis” as a whole. 3.3
Clustering
We experimented with different frequent itemset filtering techniques, but were always faced with an unacceptable tradeoff between the number of itemsets and our ability to capture a reasonable breadth of topics. Further investigation showed that some named entities we should group as a topic would not show up as a frequent itemset under any measure; no article contained all of the entities. Therefore, we chose to perform clustering of the named entities in addition to the discovery of frequent itemsets. The hypergraph clustering method of [11] takes a set of association rules and declares the items in the rules to be vertices, and the rules themselves to be hyperedges. Clusters can be quickly found by using a hypergraph partitioning algorithm such as hMETIS [13]. We adapted the hypergraph clustering algorithm described in [11] in several ways to fit our particular domain. Because TopCat discovers frequent itemsets instead of association rules, the rules do not have any directionality and therefore do not need to be combined prior to being used in a hypergraph. The interest of each itemset was used for the weight of each edge. Since interest tends to increase dramatically as the number of items in a frequent itemset increases, the log of the interest was used in the clustering algorithm to prevent the larger itemsets from completely dominating the process. Upon investigation, we found that the stopping criteria presented in [11] only works for domains that form very highly connected hypergraphs. Their algorithm continues to recursively partition a hypergraph until the weight of the edges cut compared to the weight of the edges left in either partition falls below a set ratio (referred to as fitness). This criteria has two fundamental problems: it will never divide a loosely connected hypergraph into the appropriate number of clusters, as it stops as soon as if finds a partition that meets the fitness criteria; and it always performs at least one partition (even if the entire hypergraph should be left together.) To solve these problems, we use the cut-weight ratio (the weight of the cut edges divided by the weight of the uncut edges in a given partition). This is defined as follows. Let P be a partition with a set of m edges e, and c the set of n edges cut in the previous split of the hypergraph: cutweight(P ) =
n W eight(ci ) Σi=1 m W eight(e ) Σj=1 j
TopCat: Data Mining for Topic Identification in a Text Corpus
179
473 David Cone
162 Yankee Stadium
191 George Steinbrenner
Joe Torre
Daryl Strawberry
441
Tampa
161
Fig. 1. Hypergraph of New York Yankees Baseball Frequent Itemsets
A hyperedge remains in a partition if 2 or more vertices from the original edge are in the partition. For example, a cut-weight ratio of 0.5 means that the weight of the cut edges is half of the weight of the remaining edges. The algorithm assumes that natural clusters will be highly connected by edges. Therefore, a low cut-weight ratio indicates that hMETIS made what should be a natural split between the vertices in the hypergraph. A high cut-weight ratio indicates that the hypergraph was a natural cluster of items and should not have been split. Once the stopping criteria has been reached, vertices are “added back in” to clusters if they are contained in an edge that “overlaps” to a significant degree with the vertices in the cluster. The minimum amount of overlap required is defined by the user. This allows items to appear in multiple clusters. For our domain, we found that the results were fairly insensitive to the cutoff criteria. Cut-weight ratios from 0.3 to 0.8 produced similar clusters, with the higher ratios partitioning the data into a few more clusters than the lower ratios. The TDT data produced one huge hypergraph containing half the clusters. Most of the rest are independent hypergraphs that become single clusters. One that does not become a single cluster is shown in Figure 1. Here, the link between Joe Torre and George Steinbrenner (shown dashed) is cut. Even though this is not the weakest link, the attempt to balance the graphs causes this link to be cut, rather than producing a singleton set by cutting a weaker link. This is a sensible distinction. During spring 1999, the Yankees manager (Torre) and players were in Tampa, Florida for spring training, while the owner (Steinbrenner) was handling repairs to a crumbling Yankee Stadium in New York. 3.4
Mapping to Documents
The preceding process gives us reasonable topics. However, to evaluate this with respect to the TDT2 instrumented corpus, we must map the identified topics back to a set of documents. We use the fact that the topic itself, a set of named entities, looks much like a boolean query. We use the TFIDF metric[17] to
180
C. Clifton and R. Cooley
generate a distance measure between a document and a topic, then choose the closest topic for each document. This is a flexible measure; if desired, we can use cutoffs (a document isn’t close to any topic), or allow multiple mappings. 3.5
Combining Clusters Based on Document Mapping
Although the clustered topics appeared reasonable, we were over-segmenting with respect to the TDT “ground truth” criteria. For example, we separated men’s and women’s tennis; the TDT human-defined topics had this as a single topic. We found that the topic-to-document mapping provided a means to deal with this. Many documents were close to multiple topics. In some cases, this overlap was common and repeated; many documents referenced both topics (the tennis example was one of these). We used this to merge topics, giving the final “tennis” topic shown in Section 1. There are two types of merge. In the first (marriage), the majority of documents similar to either topic are similar to both. In the second (parent/child ), the documents similar to the child are also similar to the parent, but the reverse does not necessarily hold. (The tennis clusters were a marriage merge.) The marriage similarity between clusters a and b is defined as: P i∈documents T F IDFia ∗ T F IDFib /N P M arriageab = P i∈documents T F IDFia /N ∗ i∈documents T F IDFib /N Based on the TDT2 training set, we chose a cutoff of 30 (M arriageab ≥ 30) for merging clusters. Similar clusters are merged by taking a union of their named entities. The parent child relationship is calculated as follows: P T F IDFip ∗ T F IDFic /N P P arentChildpc = i∈documents i∈documents T F IDFic /N We calculate the parent/child relationship after the marriage clusters have been merged. In this case, we used a cutoff of 0.3. Merging the groups is again accomplished through a union of the named entities. Note that there is nothing document-specific about these methods. The same approach could be applied to any market basket problem.
4
Experimental Results
The TDT2 evaluation criteria is based on the probability of failing to retrieve a document that belongs with the topic, and the probability of erroneously matching a document to the topic. These are combined to a single number CDet as describe in [3]. The mapping between TopCat-identified topics and reference topics is defined to be the mapping that minimizes CDet for that topic (as specified by the TDT2 evaluation process).
TopCat: Data Mining for Topic Identification in a Text Corpus
181
Using the TDT2 evaluation data (May and June 1998), the CDet score was 0.0055. This was comparable to the results from the TDT2 topic detection participants[2], which ranged from 0.0040 to 0.0129, although they are not directly comparable (as the TDT2 topic detection is on-line, rather than retrospective). Of note is the low false alarm probability we achieved (0.002); further improvement here would be difficult. The primary impediment to a better overall score is the miss probability of 0.17. The primary reason for the high miss probability is the difference in specificity between the human-defined topics and the TopCat-discovered topics. (Only two topics were missed entirely; one contained a single document, the other three documents.) Many TDT2-defined topics matched multiple TopCat topics. Since the TDT2 evaluation process only allows a single system-defined topic to be mapped to the human-defined topic, over half the TopCat-discovered topics were not used (and any document associated with those topics was counted as a “miss” in the scoring). TopCat often identified separate topics, such as (for the conflict with Iraq) Madeleine Albright/Iraq/Middle East/State, in addition to the “best” topic (lowest CDet score) shown at the top of Table 1. Although various TopCat parameters could be changed to merge these, many similar topics that the “ground truth” set considers separate (such as the world ice skating championships and the winter Olympics) would be merged as well. The miss probability is a minor issue for our problem. Our goal is to identify important topics, and to give a user the means to follow up on that topic. The low false alarm probability means that a story selected for follow-up will give good information on the topic. For the purpose of understanding general topics and trends in a corpus, it is more important to get all topics and a few good articles for each topic than to get all articles for a topic.
5
Conclusions and Future Work
We find the identified topics both reasonable in terms of the TDT2 defined accuracy, and understandable identifiers for the subject. For example, the most important three topics (based on the support of the frequent itemsets used to generate the topics) are shown in Table 1. The first (Iraqi arms inspections) also gives information on who is involved (although knowing that Richard Butler was head of the arms inspection team, Bill Richardson is the U.S. Ambassador to the UN, and Saddam Hussein is the leader of Iraq may require looking at the documents; this shows the usefulness of mapping the topic identifier to documents.) The third is also reasonably understandable: Events in and around Yugoslavia. The second is an amusing proof of the first half of the adage “Everybody talks about the weather, but nobody does anything about it.” The clustering methods of TopCat are not limited to topics in text, any market basket style problem is amenable to the same approach. For example, we could use the hypergraph clustering and relationship clustering on mail-order purchase data. This extends association rules to higher-level “related purchase” groups. Association rules provide a few highly-specific actionable items, but are
182
C. Clifton and R. Cooley Table 1. Top 3 Topics for January through June 1998 Topic 1 LOCATION Baghdad LOCATION Britain LOCATION China LOCATION Iraq ORG. Security Council ORG. United Nations PERSON Kofi Annan PERSON Saddam Hussein PERSON Richard Butler PERSON Bill Richardson LOCATION Russia LOCATION Kuwait LOCATION France ORG. U.N.
Topic 2 Topic 3 LOCATION Alaska LOCATION Albania LOCATION Anchorage LOCATION Macedonia LOCATION Caribbean LOCATION Belgrade LOCATION Great Lakes LOCATION Bosnia LOCATION Gulf Coast LOCATION Pristina LOCATION Hawaii LOCATION Yugoslavia LOCATION New England LOCATION Serbia LOCATION Northeast PERSON Slobodan Milosevic LOCATION Northwest PERSON Ibrahim Rugova LOCATION Ohio Valley ORG. Nato LOCATION Pacific Northwest ORG. Kosovo Liberation LOCATION Plains Army LOCATION Southeast LOCATION West PERSON Byron Miranda PERSON Karen Mcginnis PERSON Meteorologist Dave Hennen PERSON Valerie Voss
not as useful for high-level understanding of general patterns. The methods presented here can be used to give an overview of patterns and trends of related purchases, to use (for example) in assembling a targeted specialty catalog. The cluster merging of Section 3.5 defines a topic relationship. We are exploring how this can be used to browse news sources by topic. Another issue is the use of information other than named entities to identify topics. One possibility is to add actions (e.g., particularly meaningful verbs such as “elected”). We have made little use of the type of named entity. However, what the named entity processing really gives us is a typed market basket (e.g., LOCATION or PERSON as types.) Another possibility is to use generalizations (e.g., a geographic “thesaurus” equating Prague and Brno with the Czech Republic) in the mining process[19]. Further work on expanded models for data mining could have significant impact on data mining of text.
References 1. 1998 topic detection and tracking project (TDT-2). http://www.nist.gov/speech/tdt98/tdt98.htm. 2. The topic detection and tracking phase 2 (TDT2) evaluation. ftp://jaguar.ncsl.nist.gov/tdt98/tdt2 dec98 official results 19990204/index.htm. 3. The topic detection and tracking phase 2 (TDT2) evaluation plan. http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf. 4. Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 26–28 1993. 5. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri Verkamo. Mining in the phrasal frontier. In 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97), Trondheim, Norway, June 25–27 1997.
TopCat: Data Mining for Topic Identification in a Text Corpus
183
6. Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, Tucson, AZ, May 13-15 1997. 7. David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixed initiative development of language processing systems. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 1997. 8. Ronen Feldman, Yonatan Aumann, Amihood Amir, Amir Zilberstein, and Wiolli Kloesgen. Maximal association rules: a new tool for mining for keyword cooccurrences in document collections. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 167–170, August 14– 17 1997. 9. Ronen Feldman and Haym Hirsh. Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems, 9(1):83–97, July 1998. 10. Ronen Feldman and Haym Hirsh, editors. IJCAI’99 Workshop on Text Mining, Stockholm, Sweden, August 2 1999. 11. Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Clustering based on association rule hypergraphs. In Proceedings of the SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, 1997. 12. Rob Hyland, Chris Clifton, and Rod Holland. GeoNODE: Visualizing news in geospatial context. In Proceedings of the Federal Data Mining Symposium and Exposition ’99, Washington, D.C., March 9-10 1999. AFCEA. 13. George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekar. Multilevel hypergraph partitioning: Applications in VLSI domain. In Proceedings of the ACM/IEEE Design Automation Conference, 1997. 14. Yves Kodratoff, editor. European Conference on Machine Learning Workshop on Text Mining, Chemnitz, Germany, April 1998. 15. Brian Lent, Rakesh Agrawal, and Ramakrishnan Srikant. Discovering trends in text databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 227–230, August 14–17 1997. 16. Dunja Mladeni´c and Marko Grobelnik, editors. ICML-99 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia, June 30 1999. 17. Gerard Salton, James Allan, and Chris Buckley. Automatic structuring and retrieval of large text files. Communications of the ACM, 37(2):97–108, February 1994. 18. Lisa Singh, Peter Scheuermann, and Bin Chen. Generating association rules from semi-structured documents using an extended concept hierarchy. In Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada, November 1997. 19. Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, September 23-25 1995. 20. Dick Tsur, Jeffrey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal. Query flocks: A generalization of association rule mining. In Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, pages 1–12, Seattle, WA, June 2-4 1998. 21. Oren Zamir, Oren Etzioni, Omid Madan, and Richard M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 287–290, August 14–17 1997.
Vhohfwlrq dqg Vwdwlvwlfdo Ydolgdwlrq ri Ihdwxuhv dqg Surwrw|shv P1 Vheedq4>5 / G1D1 ]ljkhg5 ) V1 Gl Sdopd5
= U1D1S1L1G1 oderudwru| 0 Zhvw Lqglhv dqg Jxldqd Xqlyhuvlw|/ Iudqfh1 Pduf1VheedqCxqly0dj1iu 2 = H1U1L1F1 oderudwru| 0 O|rq 5 Xqlyhuvlw|/ Iudqfh1 ~}ljkhg/vheedq/vglsdopdCxqly0o|rq51iu
Devwudfw1 Ihdwxuhv dqg surw|shv vhohfwlrq duh wzr pdmru sureohpv lq gdwd plqlqj/ hvshfldoo| iru pdfklqh ohduqlqj dojrulwkpv1 Wkh jrdo ri erwk vhohfwlrqv lv wr uhgxfh vwrudjh frpsoh{lw|/ dqg wkxv frpsxwdwlrqdo frvwv/ zlwkrxw vdfulflqj dffxudf|1 Lq wklv duwlfoh/ zh suhvhqw wzr lqfuhphqwdo dojrulwkpv xvlqj jhrphwulfdo qhljkerukrrg judskv dqg d qhz vwdwlvwlfdo whvw wr vhohfw/ vwhs e| vwhs/ uhohydqw ihdwxuhv dqg surwrw|shv iru vxshu0 ylvhg ohduqlqj sureohpv1 Wkh ihdwxuh vhohfwlrq surfhgxuh zh suhvhqw frxog eh dssolhg ehiruh dq| pdfklqh ohduqlqj dojrulwkp lv xvhg1
4
Lqwurgxfwlrq
Zh ghdo lq wklv sdshu zlwk ohduqlqj iurp h{dpsohv $ ghvfulehg e| sdluv ^[+$,> \ +$,`/ zkhuh [+$, lv d yhfwru ri s ihdwxuh ydoxhv dqg \ +$, lv wkh fruuh0 vsrqglqj fodvv odeho1 Wkh jrdo ri d ohduqlqj dojrulwkp lv wr exlog d fodvvlfdwlrq ixqfwlrq * iurp d vdpsoh d ri q h{dpsohv $m >+m @4===q, 1 Iurp d wkhruhwlfdo vwdqgsrlqw/ wkh vhohfwlrq ri d jrrg vxevhw ri ihdwxuhv [ lv ri olwwoh lqwhuhvw = d Ed|hvldq fodvvlhu +edvhg rq wkh wuxh glvwulexwlrqv, lv prqrwrqlf/ l1h1/ dgglqj ihdwxuhv fdq qrw ghfuhdvh wkh prgho*v shuirupdqfh ^43`1 Wklv wdvn kdv krzhyhu uhfhlyhg sohqw| ri dwwhqwlrq iurp vwdwlvwlfldqv dqg uhvhdfkhuv lq Pdfklqh Ohduqlqj vlqfh wkh prqrwrqlflw| dvvxpswlrq uduho| krogv lq sudfwlfdo vlwxdwlrqv zkhuh wkh wuxh glvwulexwlrqv duh xqnrzq1 Luuhohydqw ru zhdno| uhohydqw ihdwxuhv pd| wkxv uhgxfh wkh dffxudf| ri wkh prgho1 Wkuxq hw do1 ^4;` vkrzhg wkdw wkh F718 dojrulwkp jhqhudwhv ghhshu ghflvlrq wuhhv zlwk orzhu shuirupdqfhv zkhq zhdno| uhohydqw ihdwxuhv duh qrw ghohwhg1 Dkd ^4` dovr vkrzhg wkdw wkh vwrudjh ri wkh LE6 dojrulwkp lqfuhdvhv h{srqhqwldoo| zlwk wkh qxpehu ri luuhohydqw ihdwxuhv1 Vhohfwlrq ri uhohydqw surwrw|sh vxevhwv kdv dovr ehhq pxfk vwxglhg lq Pdfklqh Ohduqlqj1 Wklv whfkqltxh lv ri sduwlfxodu lqwhuhvw zkhq xvlqj qrq sdud0 phwulf fodvvlfdwlrq phwkrgv vxfk dv n0qhduhvw0qhljkeruv ^;`/ Sdu}hq*v zlqgrzv ^45` ru pruh jhqhudoo| phwkrgv edvhg rq jhrphwulfdo prghov wkdw kdyh d uhsxwd0 wlrq iru kdylqj kljk frpsxwdwlrqdo dqg vwrudjh frvwv1 Lq idfw/ wkh fodvvlfdwlrq ri d qhz h{dpsoh riwhq uhtxluhv glvwdqfh fdofxodwlrqv zlwk doo srlqwv vwruhg lq •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 184−192, 1999. Springer−Verlag Berlin Heidelberg 1999
Selection and Statistical Validation of Features and Prototypes
185
phpru|1 Wklv ohg uhvhdufkhuv wr exlog vwudwhjlhv wr uhgxfh wkh vl}h ri wkh ohduq0 lqj vdpsoh +vhohfwlqj rqo| wkh ehvw h{dpsohv zklfk zloo eh fdoohg surwrw|shv,/ nhhslqj dqg shukdsv lqfuhdvlqj fodvvlfdwlrq shuirupdqfhv ^;`/ ^:` dqg ^4:`1 Zh suhvhqw lq wklv duwlfoh wzr kloo folpelqj dojrulwkpv wr vhohfw uhohydqw ihdwxuhv dqg surwrw|shv/ xvlqj prghov iurp frpsxwdwlrqdo jhrphwu|1 Wkh uvw dojrulwkp vwhs e| vwhs vhohfwv uhohydqw ihdwxuhv lqghshqghqwo| ri d jlyhq ohduqlqj dojrulwkp +wkh fodvvlfdwlrq dffxudf| lv qrw xvhg wr lghqwli| wkh ehvw ihdwxuhv exw rqo| wr vwrs wkh vhohfwlrq dojrulwkp,1 Wklv ihdwxuh vhohfwlrq whfkqltxh lv edvhg rq wkh lghd wkdw shuirupdqfhv ri d ohduqlqj dojrulwkp/ zkdwhyhu wkh do0 jrulwkp pd| eh/ qhfhvvdulo| ghshqg rq wkh jhrphwulfdo vwuxfwxuhv ri fodvvhv wr ohduq1 Zh sursrvh fkdudfwhul}lqj wkhvh vwuxfwxuhv lq LUs xvlqj prghov lqvsluhg iurp frpsxwdwlrqdo jhrphwu|1 Dw hdfk vwdjh/ zh vwdwlvwlfdoo| phdvxuh wkh vhsd0 udelolw| ri wkhvh vwuxfwxuhv lq wkh fxuuhqw uhsuhvhqwdwlrq vsdfh/ dqg yhuli| li wkh nhsw ihdwxuhv doorz wr exlog d prgho pruh h!flhqw wkdq wkh suhylrxv rqh1 Xqolnh wkh uvw/ wkh vhfrqg dojrulwkp xvhv wkh fodvvlfdwlrq ixqfwlrq wr vhohfw surwrw|shv lq wkh ohduqlqj vdpsoh1 Lw whvwv wkh txdolw| ri vhohfwhg h{dpsohv/ yhuli|lqj rq wkh rqh kdqg wkdw wkh| doorz wr rewdlq rq d ydolgdwlrq vdpsoh d vxffhvv udwh vljqlfdqwo| forvh wr wkh rqh rewdlqhg zlwk wkh ixoo vdpsoh/ dqg rq wkh rwkhu kdqg wkdw wkh| frqvwlwxwh rqh ri wkh ehvw ohduqlqj vxevhwv zlwk wklv vl}h1
5
Ghqlwlrqv lq Frpsxwdwlrqdo Jhrphwu|
Wkh dssurdfk zh sursrvh lq wklv duwlfoh xvhv qhljkerukrrg judskv1 Lqwhuhvwhg uhdghuv zloo qg pdq| prghov ri qhljkerukrrg judskv lq ^46`/ vxfk dv Ghodxqd|*v Wuldqjxodwlrq/ Uhodwlyh Qhljkerukrrg Judsk/ dqg wkh Plqlpxp Vsdqqlqj Wuhh +Ilj1 4,1
Minimum Spanning Tree
Gabriel’s Graph
Relative Neighborhood Graph
Delaunay’s Triangulation
Ilj1 41 Qhljkerukrrg Vwuxfwxuhv
186
M. Sebban, D.A. Zighed, and S. Di Palma
Ghqlwlrq 41 = D judsk J + /D, lv frpsrvhg ri d vhw ri yhuwlfhv qrwhg olqnhg e| d vhw ri hgjhv qrwhg D1 QE = Lq wkh fdvh ri dq rulhqwhg judsk/ D zloo eh wkh vhw ri dufv1 Lq rxu sdshu/ zh rqo| frqvlghu qrq0rulhqwhg judskv/ l1h1 d olqn ehwzhhq wzr srlqwv ghqhv dq hgjh1 Wklv fkrlfh pdnhv hyhu| qhljkerukrrg uhodwlrq v|pphwulfdo1
6 614
Vhohfwlrq ri Uhohydqw Ihdwxuhv Lqwurgxfwlrq
Jlyhq d uhsuhvhqwdwlrq vsdfh [ frqvwlwxwhg e| s ihdwxuhv [4 > [5 > ===> [s / dqg d vdpsoh ri q h{dpsohv qrwhg $4 > $5 > ==> $q = /d ohduqlqj phwkrg doorzv wr exlog d fodvvlfdwlrq ixqfwlrq *4 wr suhglfw wkh vwdwh ri \ 1 3 Frqvlghu qrz d vxevhw [3 @ i[4 > [5 > ===> [s3 j ri doo ihdwxuhv/ zlwk s ? s/ dqg qrwh *5 wkh fodvvlfdwlrq ixqfwlrq exlow lq wklv qhz uhsuhvhqwdwlrq vsdfh1 Li fodvvlfdwlrq shuirupdqfhv ri *4 dqg *5 duh htxlydohqw/ zh zloo dozd|v suhihu wkh prgho xvlqj ihzhu ihdwxuhv iru wkh frqvwuxfwlrq ri * 1 Wzr uhdvrqv mxvwli| wklv fkrlfh= 41 Wkh fkrlfh ri [3 uhgxfhv ryhuwwlqj ulvnv1 51 Wkh fkrlfh ri [3 uhgxfhv frpsxwdwlrqdo dqg vwrudjh frvwv1 Jhqhudol}dwlrq shuirupdqfhv ri *5 pd| vrphwlphv eh ehwwhu wkdq wkrvh re0 wdlqhg zlwk *4 > ehfdxvh vrph ihdwxuhv fdq eh qrlvhg lq wkh ruljlqdo vsdfh1 Qhyhuwkhohvv/ zh fdq qrw whvw doo frpelqdwlrqv ri ihdwxuhv/ l1h1 exlog dqg whvw 5s 4 fodvvlfdwlrq ixqfwlrqv1 Frqvwuxfwlyh phwkrgv +ghflvlrq wuhhv/ ix}}| wuhhv/ lqgxfwlrq judskv/ hwf1, vh0 ohfw ihdwxuhv vwhs e| vwhs zkhq wkh| lpsuryh shuirupdqfhv ri d jlyhq fulwhulrq +fodvvlfdwlrq vxffhvv udwh/ krprjhqhlw| fulwhulrq,1 Lq wkhvh phwkrgv/ wkh frq0 vwuxfwlrq ri wkh * ixqfwlrq lv grqh vlpxowdqhrxvo| zlwk ihdwxuhv fkrlfh1 Dprqj zrunv xvlqj wkh hvwlpdwlrq ri wkh fodvvlfdwlrq vxffhvv udwh/ zh fdq flwh wkh furvv0ydolgdwlrq surfhgxuh ^43`/ dqg Errwvwuds surfhgxuh ^8`1 Qhyhuwkhohvv/ hyhq li wkhvh phwkrgv doorz wr rewdlq dq xqeldvhg hvwlpdwlrq ri wklv udwh/ fdofxodwlrq frvwv vhhp surklelwlyh wr mxvwli| wkhvh surfhgxuhv dw hdfk vwdjh ri wkh ihdwxuh vhohfwlrq surfhvv1 Phwkrgv xvlqj krprjhqhlw| fulwhulrq riwhq sursrvh vlpsoh lqglfdwruv idvw wr frpsxwh/ vxfk dv hqwurs| phdvxuhv/ xqfhuwdlqw| phdvxuhv/ vhsdudelolw| phdvxuhv olnh wkh ri Zlonv ^47` ru Pdkdodqrelv*v glvwdqfh1 Exw uhvxowv dovr ghshqg rq wkh fxuuhqw * ixqfwlrq1 Zh sursrvh lq wkh qh{w vhfwlrq d qhz ihdwxuhv vhohfwlrq dssurdfk/ dssolhg ehiruh wkh frqvwuxfwlrq ri wkh * fodvvlfdwlrq ixqfwlrq/ lqghshqghqwo| ri wkh ohduqlqj phwkrg xvhg1 Wr hvwlpdwh txdolw| ri d ihdwxuh/ zh sursrvh wr hvwlpdwh txdolw| ri wkh uhsuhvhqwdwlrq vsdfh zlwk wklv ihdwxuh1
Selection and Statistical Validation of Features and Prototypes
615
187
Krz wr Hydoxdwh wkh Txdolw| ri d Uhsuhvhqwdwlrq VsdfhB
Zh frqvlghu wkdw p glhuhqw fodvvhv duh zhoo uhsuhvhqwhg e| s ihdwxuhv/ li wkh uhsuhvhqwdwlrq vsdfh +fkdudfwhul}hg e| s glphqvlrqv, vkrzv zlgh jhrphwulfdo vwuxfwxuhv ri srlqwv ehorqjlqj wr wkhvh fodvvhv1 Lq idfw/ zkhq zh exlog d prgho/ zh dozd|v vhdufk iru wkh uhsuhvhqwdwlrq vsdfh iduwkhvw iurp wkh vlwxdwlrq zkhuh hdfk srlqw ri hdfk fodvv frqvwlwxwhv rqh vwuxfwxuh1 Wkxv/ wkh txdolw| ri d uhsuhvhqwdwlrq vsdfh fdq eh hvwlpdwhg e| wkh glvwdqfh wr wkh zruvw vlwxdwlrq fkdudfwhulvhg e| wkh htxdolw| ri ghqvlw| ixqfwlrqv ri fodvvhv1 Wr vroyh wklv sureohp/ zh fdq xvh rqh ri wkh qxphurxv vwdwlvwlfdo whvwv ri srsxodwlrq krprjhqhlw|1 Xqiruwxqdwho|/ qrqh ri wkhvh whvwv lv erwk qrqsdudphwulf dqg dssolfdeoh lq LUs = Lq Vheedq ^48`/ zh exlow d qhz vwdwlvwlfdo whvw +fdoohg whvw ri hgjhv,/ zklfk grhv qrw vxhu iurp wkhvh frqvwudlqwv1 Xqghu wkh qxoo k|srwkhvlv K3 = K3 = I4 +{, @ I5 +{, @ === @ Ip +{, @ I +{, zkhuh Il +{, fruuhvsrqgv wr wkh uhsduwlwlrq ixqfwlrq ri wkh fodvv l Wkh frqvwuxfwlrq ri wklv whvw xvhv vrph frqwulexwlrqv ri frpsxwdwlrqdo jhrp0 hwu|1 Rxu dssurdfk lv edvhg rq wkh vhdufk iru jhrphwulfdo vwuxfwxuhv/ fdoohg kr0 prjhqhrxv vxevhwv/ mrlqlqj srlqwv wkdw ehorqj wr wkh vdph fodvv1 Wr rewdlq wkhvh krprjhqhrxv vxevhwv dqg hydoxdwh wkh txdolw| ri wkh uhsuhvhqwdwlrq vsdfh/ zh sursrvh wkh iroorzlqj surfhgxuh = 41 Frqvwuxfw d uhodwhg jhrphwulfdo judsk/ vxfk dv wkh Ghodxqd| Wuldqjxodwlrq/ wkh Jdeulho*v Judsk/ hwf1 ^46`1 51 Frqvwuxfw krprjhqhrxv vxevhwv/ ghohwlqj hgjhv frqqhfwlqj srlqwv zklfk ehorqj wr gliihuhqw fodvvhv1 61 Frpsduh wkh sursruwlrq ri ghohwhg hgjhv zlwk wkh suredelolw| rewdlqhg xqghu wkh qxoo k|srwkhvlv1 Wkh fulwlfdo wkuhvkrog ri wklv whvw lv xvhg wr vhdufk iru wkh uhsuhvhqwdwlrq vsdfh zklfk lv wkh iduwkhvw iurp wkh K3 k|srwkhvlv1 Dfwxdoo|/ wkh vpdoohu wklv ulvn lv/ wkh ixuwkhu iurp wkh K3 k|srwkhvlv zh duh1 Wzr vwdwhjlhv duh srvvleoh wr qg d jrrg uhsuhvhqwdwlrq vsdfh = 41 Vhdufk iru wkh uhsuhvhqwdwlrq vsdfh zklfk plqlpl}hv wkh fulwlfdo wkuhvkrog ri wkh whvw/ l1h1 zklfk lv wkh iduwkhvw iurp wkh K3 k|srwkhvlv1 Odwhu rq/ zh zloo xvh wklv dssurdfk wr wdfnoh wklv sureohp1 51 Vhdufk iru d zd| wr plqlpl}h wkh vl}h ri wkh uhsuhvhqwdwlrq vsdfh +zlwk wkh dgydqwdjh ri uhgxflqj vwrudjh dqg frpsxwlqj frvwv,/ zlwkrxw uhgxflqj wkh txdolw| ri wkh lqlwldo vsdfh1 616
Dojrulwkp
Ohw [ @ i[4 > [5 > ===> [s j eh wkh uhsuhvhqwdwlrq ri d jlyhq d ohduqlqj vdpsoh1 Dprqj wkhvh s ihdwxuhv/ zh vhdufk iru wkh s prvw glvfulplqdqw rqhv +s ? s, xvlqj wkh iroorzlqj dojrulwkp=
188
M. Sebban, D.A. Zighed, and S. Di Palma
41 Frpsxwh wkh 3 fulwlfdo wkuhvkrog ri wkh whvw ri hgjhv lq wkh lqlwldo uhsuhvhqwdwlrq vsdfh [ 51 Frpsxwh iru hdfk frpelqdwlrq ri s 4 ihdwxuhv wdnhq dprqj wkh s fxuuhqw/ wkh f fulwlfdo wkuhvkrog 61 Vhohfw wkh ihdwxuh zklfk plqlpl}hv wkh f fulwlfdo wkuhvkrog 71 Li f ? 3 wkhq ghohwh wkh vhohfwhg ihdwxuh/ s # s 4/ uhwxuq wr vwhs 4 hovh s @ s dqg vwrs1 4
Wklv dojrulwkp lv d kloo folpelqj phwkrg1 Lw grhv qrw vhdufk iru dq rswlpdo fodvvlfdwlrq ixqfwlrq/ lq dffrugdqfh zlwk d fulwhulrq edvhg rq dq xqfhuwdlqw| phdvxuh/ exw udwkhu dlpv dw qglqj d uhsuhvhqwdwlrq vsdfh wkdw doorzv wr exlog d ehwwhu prgho1 617
Vlpxodwhg H{dpsoh
Wr looxvwudwh rxu dssurdfk/ zh dsso| lq wklv vhfwlrq rxu dojrulwkp wr d vlpxodwhg h{dpsoh1 Ohw d eh d ohduqlqj vdpsoh frpsrvhg ri 433 h{dpsohv ehorqjlqj wr wzr fodvvhv1 Hdfk h{dpsoh lv uhsuhvhqwhg lq LU6 e| 6 ihdwxuhv +qrwhg [4 [5 [6 ,1 Wkh wzr fodvvhv duh vwdwlvwlfdoo| glhuhqw/ l1h1 fkdudfwhulvhg e| wzr glhuhqw sure0 delolw| ghqvlwlhv1 Iru lqvwdqfh/ Qrupdo odz Q +4 > 4 , iru h{dpsohv ri |4 fodvv Qrupdo odz Q +5 > 5 ,> zkhuh 5 A 4 iru h{dpsohv ri |5 fodvv /
/
Wr hvwlpdwh wkh fdsdflw| ri rxu dojrulwkp wr qg wkh ehvw uhsuhvhqwdwlrq vsdfh/ zh jhqhudwh 6 qhz qrlvhg ihdwxuhv +qrwhg [7 [8 [9 ,1 Hdfk ihdwxuh lv jhqhudwhg lghqwlfdoo| iru wkh zkroh vdpsoh1 Wkh uvw 3 ulvn lq LU9 lv derxw 4143; = Dsso|lqj rxu dojrulwkp/ zh rewdlq wkh iroorzlqj uhvxowv +wdeoh 4,1 /
/
Wdeoh 41 Dssolfdwlrq ri wkh ihdwxuh vhohfwlrq dojrulwkp vwhs l 4 5 6 7
f
f2
f
fe
fD
fS
3D 8 1 4 3 3D 5 14 3 3H 5 1 4 3 3 6 1 4 3 32 ; 14 3 32 8 14 3 3S 6 1 4 3 3S 4 14 3 32 - 4 1 4 3 3 7 14 3 3e : 14 3 3S 7 1 4 3 3e 9 14 3 32 - 4 1 4 3 3D 7 14 3 3D 4 1 4 3 3e < 14 3 3e 6 14 3 -
kSW
kf
G h flvlr q
3 4 1 4 3 3H F r q w l q x h 3e 5 14 3 3 F r q w l q x h 7 14 3 3D 7 14 3 3e F r q w l q x h 4 14 3 3e 4 14 3 3D V w r s < 14 3 5 14 3
Gxulqj vwhs 4/ ghohwlrq ri [7 ihdwxuh doorzv wr uhgxfh fulwlfdo wkuhvkrog +iurp 4143; wr 514346 ,1 Vwhsv 5 dqg 6 ohdg wr wkh vxsuhvvlrq ri [9 dqg [8 1 Dw wkh irxuwk vwhs/ wkh ydoxh wkh vhfrqg dssolhv udqgrp pxwdwlrq kloo folpelqj/ zkhuh wkh wqhvv ixqfwlrq lv wkh fodvvlfdwlrq vxffhvv udwh rq wkh ohduqlqj vdpsoh1 \hw/ wklv dssurdfk lv olplwhg wr vlpsoh sureohpv zkhuh fodvvhv ri sdwwhuqv duh hdvlo| vhsdudeoh/ vlqfh wkh dxwkru d sulrul ghqhv wkh qxpehu ri surwrw|shv dv wkh qxpehu ri fodvvhv1 Zh fdq hdvlo| lpdjlqh vrph sureohpv zkhq fodvvhv duh pl{hg1 Lq rxu plqg/ zh frxog lpsuryh wklv dojrulwkp xvlqj dv wkh qxpehu ri surwrw|shv wkh qxpehu ri krprjhqhrxv vxevhwv ghvfulehg lq wkh suhylrxv vhfwlrq1 Rwkhu zrunv derxw surwrw|sh vhohfwlrq fdq eh irxqg lq ^ $, Ohw +$m > $,> wkh zhljkw ri wkh $m yrwhu/ qhljkeru ri $/ eh ghqhg dv =
190
M. Sebban, D.A. Zighed, and S. Di Palma
= d $ ^3> 4` @ V$m >$ ,> ;$ 3 5 +$m > $, :$ +$m > $, @ Su+$ 3 5 zkhuh V$m >$ lv wkh k|shuvskhuh zlwk wkh gldphwhu +$m > $,1 Ghqlwlrq 61 = Fryhulqj vsdfh Zh ghqh wkh fryhulqj vsdfh G frqwdlqlqj doo srvvleoh phpehuvkls ri wkh
vhw dv wkh k|shufxeh fryhulqj wkh xqlrq ri k|shuvskhuhv ri qhljkeruv lq wkh ohduqlqj vdpsoh1
D
ω1 ω6
ω2
d2=10
R=1.5
ω
ω4
ω3 ω5 d1=8
Ilj1 51 H{dpsoh ri fryhulqj vsdfh1
Iurp G/ zh fdofxodwh wkh suredelolw|
YG YV$m >$ YG zkhuh YG lv wkh yroxph ri G dqg YV$m >$ lv wkh yroxph ri wkh k|shuvskhuh zlwk gldphwhu +$m > $,= Ghqlwlrq 71 = Zh ghqh YV$m >$ > wkh yroxph ri d jlyhq k|shuvskhuh lq LUs zlwk gldphwhu +$m > $, dv =
S u+$ 3 5 @ V$m >$ , @
s
s YV$m >$ @ 5s u$s m >$ +s , 5
zkhuh u$m >$ lv wkh udglxv ri wkh k|shuvskhuh zlwk gldphwhu +$m > $, dqg +{, lv wkh Jdppd ixqfwlrq1 YG lv rewdlqhg e| pxowlsolfdwlrq ri wkh ohqjwkv ri wkh k|shufxeh*v vlghv1 H{dpsoh = Jlyhq d Jdeulho*v Judsk exlow iurp d ohduqlqj vdpsoh
d @ i$4 > $5 > $6 > $7 > $8 > $9 j +Ilj1 5,/ dqg $ d qhz h{dpsoh wr odeho/ zh fdq fdofxodwh wkh zhljkw +$4 > $, ri $4 / YG YV$4 >$ g4 g5 u$5 4 >$ @ @ 3= $, @ YG g4 g5
Selection and Statistical Validation of Features and Prototypes
716
191
Surwrw|sh Vhohfwlrq Dojrulwkp
Wzr w|shv ri dojrulwkpv h{lvw iru wkh exloglqj ri jhrphwulfdo judskv ^6`= 41 Wrwdo dojrulwkpv = lq wklv fdvh/ qhljkerukrrg vwuxfwxuhv +Jdeulho/ Uhodwlyh qhljkeruv ru Ghodxqd|*v Wuldqjohv, duh dssolhg rq wkh zkroh vdpsoh1 Wr exlog d qhz hgjh/ vrph frqglwlrqv pxvw eh lpsrvhg rq wkh zkroh vhw1 Wkxv/ zkhq d qhljkerukrrg lv exlow/ lw lv qhyhu vxssuhvvhg1 51 Frqvwuxfwlyh dojrulwkpv = lq wklv fdvh/ wkh judsk lv exlow srlqw e| srlqw/ vwhs e| vwhs1 Hdfk srlqw lv lqvhuwhg/ jhqhudwlqj vrph qhljkerukrrgv/ ghohwlqj rwk0 huv1 Wkxv/ rqo| d orfdo xsgdwh ri wkh judsk lv qhfhvvdu| ^7`1 Iru wkhvh wzr w|shv ri dojrulwkpv/ wkh odeho ri srlqwv wr lqvhuw lv qrw xvhg1 Wkh surwrw|shv vhohfwlrq dojrulwkp suhvhqwhg lq wklv vhfwlrq ehorqjv wr wkh vhfrqg fdwhjru| exw wdnhv lqwr dffrxqw wkh odeho ri srlqwv douhdg| lqvhuwhg lq wkh judsk1 Lw pd| wkxv rqo| eh xvhg zlwk vxshuylvhg ohduqlqj1 Lwv sulqflsoh lv vxppxdul}hg e| wkh iroorzlqj svhxgr0frgh1 Ohw d eh wkh ruljlqdo wudlqlqj vdpsoh dqg eh wkh vhw ri vhohfwhg surswrw|shv Lqlwldoo|/ frqwdlqv rqh udqgrpo| vhohfwhg h{dpsoh Uhshdw Fodvvli| d zlwk wkh Suredelolvwlf Yrwh xvlqj wkh h{dpsohv lq 1 Pryh plvfodvvlilhg h{dpsohv lqwr 1 xqwlo doo h{dpsohv uhpdlqlqj lq d duh zhoo fodvvlilhg1 Wkxv/ wkh shuwlqhqfh ri dq h{dpsoh lv ghqhg dv iroorzlqj = d srlqw lv shu0 wlqhqw li lw eulqjv lqirupdwlrq derxw lwv fodvv1 Lqwhuhvwhg uhdghuv pd| qg wkh uhvxowv ri dq dssolfdwlrq ri rxu surwrw|sh vhohfwlrq whfkqltxh rq wkh zhoo0nqrzq Euhlpdq zdyh irupv sureohp ^5` lq ^49`1 Wklv uhvxowv vkrz wkdw wkh vhohfwlrq whfkqltxh doorzv /rq wklv sureohp/ wr fxw e| pruh wkdq kdoi wkh vl}h ri ohduqlqj vdpsoh zlwkrxw orzhulqj wkh jhqhudolvdwlrq dffxudf| ri wkh exlow fodvvlfdwlrq ixqfwlrq1
8
Frqfoxvlrq
Wkh jurzlqj vl}h ri prghuq gdwdedvhv pdnhv ihdwxuh vhohfwlrq dqg surwrw|sh vhohfwlrq fuxfldo lvvxhv1 Zh kdyh sursrvhg lq wklv duwlfoh wzr dojrulwkpv wr uhgxfh wkh glphqvlrqdolw| ri wkh uhsuhvhqwdwlrq vsdfh dqg wr uhgxfh wkh qxpehu ri h{dpsohv ri d ohduqlqj vdpsoh1 Rxu dssurdfk lv fxuuhqwo| olplwhg lq wkdw lw vxssrvhv wkdw h{dpsohv duh rqo| ghvfulehg e| qxphulfdo ihdwxuhv1 Zh duh qrz zrunlqj rq qhz qhljkerukrrg vwuxfwxuhv wr wdnh lqwr dffrxqw v|perolf gdwd/ zlwkrxw xvlqj hxfolghdq glvwdqfhv1
192
M. Sebban, D.A. Zighed, and S. Di Palma
Uhihuhqfhv 41 Dkd/ G1Z1/ Nleohu/ G1/ ) Doehuw/ P1N1 Lqvwdqfh0edvhg ohduqlqj dojrulwkpv1 Pd0 fklqh Ohduqlqj 9+4,=6:099/ 4 2 classes, k binary classifiers are built, each of them used for the discrimination of one class against all others. The classifier returning 1
http://www.ics.uci.edu/˜mlearn/MLRepository.html
Contribution of Boosting in Wrapper Models
217
|LS|
AdaBoost(LS = {(xi , y(xi ))}i=1 ) Initialize distribution D1 (xi ) = 1/|LS|; For t = 1, 2, ..., T Build weak hypothesis ht using Dt ; Compute the confidence αt : αt =
1 + rt 1 log 2 1 − rt
X
(1)
m
rt =
Dt (xi )y(xi )ht (xi )
(2)
i=1 −αt y(xi )ht (xi )
; Update: Dt+1 (xi ) = Dt (xi )e Zt /∗Zt is a normalization coefficient∗/ endFor Return the classifier T X
H(x) = sign(
αt ht (x))
t=1
Fig. 1. Pseudocode for AdaBoost.
the greatest value gives the class of the observation. Boosting has been shown theoretically or empirically to satisfy particularly interesting properties. Among them, it was remarked [5] that boosting is sometimes immune to overfitting, a classical problem in machine learning. Moreover it allows to reduce a lot the representational bias PTin relevance estimation we pointed out before. Define the function F (x) = t=1 αt ht (x) to avoid problems with the “sign” expression in H(x). [12] have proven that using AdaBoost is equivalent to optimize a criterion which is not the accuracy, but precisely the normalization factor Zt as presented in figure 1. Using a more synthetic notation, [5] have proven that AdaBoost repetitively optimizes the following criterion: Z = E(x,y(x)) (e−y(x)F (x) ) In a first step, we decided then to use this criterion in a forward selection algorithm that we called FS2 Boost (figure 2). We show in the next section the interest of this new optimized criterion thanks to experimental results.
3
Experimental Results: Z versus Accuracy
In this section, the goal is to test the effect of the criterion optimized in the wrapper model. We propose to compare the selected feature relevance using either Z or the accuracy, on synthetic or natural databases. Nineteen problems
218
M. Sebban and R. Nock |LS|
FS2 Boost(LS = {(xi , y(xi ))}i=1 ) 1 Z0 ← +∞; E ← ∅; S ← {s1 , s2 , ..., sp }; 2 ForEach sj ∈ S H ←AdaBoost(LS,E ∪ si ); Zi ← ZE∪si (H); select smin for which Zmin = mini Zi ; endFor 3 If Zmin Z0 then S = S\{smin }; E = E ∪ {smin }; Z0 ← Zmin ; Goto step 2; Else return E; Fig. 2. Pseudocode for FS2 Boost. S is the set of features
were chosen, among them the majority was taken from the UCI repository. A database was generated synthetically with some irrelevant features (called Artificial ). Hard is a hard problem consisting of two classes and 10 features per instance. There are five irrelevant features. The class is given by the XOR of the five relevant features. Finally, each feature has 10% noise. The Xd6 problem was previously used by [3]: it is composed of 10 attributes, one of which is irrelevant. The target concept is a disjunctive normal form over the nine other attributes. There is also classification noise. Since we know for artificial problems the relevance degree of each feature, we can easily evaluate the effectiveness of our selection method. The problem is more difficult for natural domains. An adequate solution consists in running on each feature subset an induction algorithm (kNN in our study), and compare the “qualities” of the feature subsets with respect to the a posteriori accuracies. Accuracies are estimated by a leave-one-out cross-validation. On each dataset, we used the following experimental set-up: 1. the Simple Forward Selection (SFS) algorithm is applied, optimizing the accuracy during the selection. We compute then the accuracy by crossvalidation in the selected subspace. 2. FS2 Boost is run (T = 50). We compute also the a posteriori accuracy. 3. We compute the accuracy in the original space with all the attributes. Results are presented in table 1. First, FS2 Boost works well on datasets for which we knew the nature of features: relevant attributes are almost always selected, even if irrelevant attributes are sometimes also selected. On these problems, the expected effects of FS2 Boost are then confirmed. Second, FS2 Boost allows to obtain almost always a better accuracy rate on the selected subset, than on the subset chosen by the simple forward selection algorithm. Third, in the majority of cases, accuracy estimates on feature subsets after FS2 Boost are better than on the whole set of attributes. Despite these interesting results, FS2 Boost has a shortcoming: its computational cost. In the next section, after some definitions, we will show that instead
Contribution of Boosting in Wrapper Models Database
SFS FS2 Boost All Attributes
Monks 1
97.9
97.9
81
Monks 2
67.2
67.2
68.3 94.4
Monks 2
99.0
99.0
Artificial
84.7
86.4
84
LED
81.4
90.2
90.2
LED24
81.4
87.2
77.9
Credit
86.1
87.1
76.8
73
74.8
66.9
Glass2
62.5
73.2
72.0
Heart
82.2
81.7
82.8
Hepatitis
78.7
81.9
82.4
Horse
77.6
86.3
72.2
Breast Cancer 96.4
96.4
96.5
EchoCardio
219
Xd6
79.9
79.9
78.1
Australian
83.8
81.6
78.7 91.5
White House
95.7
95.7
Pima
73.2
73.3
73.0
Hard
58.7
58.7
59.0
Vehicle
72.9
73.7
71.6
Table 1. Accuracy comparisons between three feature sets: (i) the subset obtained by optimizing the accuracy, (ii) the subset deduced by FS2 Boost, and (iii) the whole set of features. Best results are underlined.
of minimizing Z, we can speed-up the boosting convergence optimizing another 0 Z criterion.
4
Speeding-up Boosting Convergence
Let S = {(x1 , y(x1 )), (x2 , y(x2 )), ..., (xm , y(xm ))} be a sequence of training examples, where each observation belongs to X , and each label yi belongs to a finite label space Y. In order to handle observations which can belong to different classes, for any description xp over X , define |xp + | (resp. |xp − |) to be the cardinality of positive (resp. negative) examples having the description xp ; note that |xp | = |xp − | + |xp + |. We make large use of three quantities, |xp max | = max(|xp + |, |xp − |), |xp min | = min(|xp + |, |xp − |) and ∆(xp ) = |xp max | − |xp min |. The optimal prediction for some description x is the class hidden in the “max” of |xp max |, which we write y(xp ) for short. Finally, for some predicate P, define as [[P]] to be 1 if P holds, and 0 otherwise; define as π(x, x0 ) to be the predicate “x0 and x share identical descriptions”, for arbitrary descriptions x and x0 . We give here indications on speeding-up Boosting convergence for the biclass setting. In the multiclass case, the strategy remains the same. The idea is to replace Schapire-Singer’s Z criterion by another one, which integrates the notion
220
M. Sebban and R. Nock
of similar descriptions belonging to different classes. This kind of situation often appears in feature selection, notably at the beginning of the SFS algorithm or also when the number of features is small according to a high cardinality of the learning set. More precisely, we use h i 0 0 Ex0˜D0 e−y(x )(αt ht (x )) t
with
P Dt0 (x0 )
=P x00
x
p P
Dt (x0 )[[π(x, x0 )]]
∆(xp ) |xp |
00 0 xp Dt (x )[[π(x, x )]]
∆(xp ) |xp |
In other words, we minimize a weighted expectation with distribution favoring the examples for which the conditional distribution of the observations projecting onto it is greatly in favor of one class against the others. Note that when each possible observation belongs to one class (i.e. no information is lost among the examples), the expectation is exactly Schapire-Singer’s Z. As [12] suggest, for the sake of simplicity, we can fold temporarily αt in ht so that the weak learner scales-up its votes to IR. Removing the t subscript, we obtain the following criterion which the weak learner should strive to optimize: h i 0 0 Z 0 = Ex0˜D0 e−y(x )h(x ) Optimizing Z 0 instead of Z at each round of Boosting is equivalent to (i) keeping strictly AdaBoost’s algorithm while optimizing Z 0 , or (ii) modifying Ad0 aBoost’s initial distribution, or its update rule. With the new Z criterion, we have to choose in our extended algorithm iFS2 Boost, 1 + rt0 1 αt0 = log 2 1 − rt0 where
rt0 =
X x0
5
Dt0 (x0 )y(x0 )ht (x0 ) = Ex0˜Dt0 [y(x0 )ht (x0 )]
Experimental Results: Z 0 versus Z
We tested here 12 datasets with the following experimental set-up: 1. The FS2 Boost algorithm is run with T base learners. We test the algorithm with different values of T (T = 1, .., 100), and we search for the minimal number TZ which provides a stabilized feature subset F Sstabilized , i.e. for which the feature subset is the same for T = TZ , .., 100.
Contribution of Boosting in Wrapper Models
221
Fig. 3. Relative Gain Grel of weak learners. The dotted line presents the average gain
2. iFS2 Boost is also run with different values of T and we search for the 0 feature subset. number TZ 0 which provides a F Sstabilized For ten datasets, we note that the use of Z 0 in iFS2 Boost allows to save on some weak hypothesis, without modifying the selected features (i.e. F Sstabilized = 0 ). In average, our new algorithm requires 3.5 learners less than F Sstabilized FS2 Boost. These results confirm the speeder convergence of iFS2 Boost, without alteration of the selected subspace. What is more surprising is that for two databases (Glass2 and LED24 ), iFS2 Boost needs more learners than 0 . We could intuitively think FS2 Boost and we obtain F Sstabilized 6= F Sstabilized 0 that, Z converging faster than Z, we should not meet such a situation. In fact, we can explain this phenomenon analyzing the speed-up factor of iFS2 Boost. Actually, the number |xp | of instances sharing a same description and belonging to different classes is independent from a subset to another, and the gain G = TZ − TZ 0 is directly dependant of |xp |. Thus, at a given step of the selection, iFS2 Boost can exceptionally select a weak relevant feature for which the speed-up factor is higher than for a strongly relevant one. In that case, iFS2 Boost will require supplementary weak hypothesis to correctly update the instance distribution. Nevertheless, this phenomenon seems to be quite marginal. Improvements of iFS2 Boost can be more dramatically presented by comTZ −T 0 puting the relative gain of weak learners Grel = T 0 Z . Results are presented in Z
figure 3. In that case, we notice that iFS2 Boost requires in average 22.5% learners less than FS2 Boost, that confirms the positive effects of our new approach, without challenging the selected subset by FS2 Boost.
6
Conclusion
In this article, we linked two central problems in machine learning and data mining: feature selection and boosting. Even if these two fields have the common
222
M. Sebban and R. Nock
aim to deduce from feature sets powerful classifiers, as far as we know few works tried to share their interesting properties. Replacing the accuracy by another Z performance criterion optimized by a boosting algorithm, we obtained better results for feature selection, despite high computational costs. To reduce this complexity we tried to improve the proposed FS2 Boost algorithm, introducing a speed-up factor in the selection. In the majority of cases, improvements are significant, allowing to save on some weak learners. The experimental gain represents on average more than 20% of the running time. Following a remark of [10] on Boosting, improvements of this magnitude without degradation of the solution, would be well worth the choice of iFS2 Boost, particularly on large domains where feature selection becomes essential. We still think however that time improvements are possible, but with possibly slight modifications of the solutions. In particular, investigations on computationally efficient estimators of boosting coefficients are sought. This shall be the subject of future work in the framework of feature selection.
References 1. D. Aha and R. Bankert. A comparative evaluation of sequential feature selection algorithms. In Fisher and Lenz Edts, Artificial intelligence and Statistics, 1996. 2. A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Issue of Artificial Intelligence, 1997. 3. W. Buntine and T. Niblett. A further comparison of splitting rules for decision tree induction. Machine Learning, pages 75–85, 1992. 4. Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, pages 119–139, 1997. 5. J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. draft, July 1998. 6. G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Eleventh ICML conference, pages 121–129, 1994. 7. R. Kohavi. Feature subset selection as search with probabilistic estimates. AAAI Fall Symposium on Relevance, 1994. 8. D. Koller and R. Sahami. Toward optimal feature selection. In Thirteenth International Conference on Machine Learning (Bari-Italy), pages 284–292, 1996. 9. P. Langley and S. Sage. Oblivious decision trees and abstract cases. In Working Notes of the AAAI94 Workshop on Case-Based Reasoning, pages 113–117, 1994. 10. J. Quinlan. Bagging, boosting and c4.5. In AAAI96, pages 725–730, 1996. 11. C. Rao. Linear statistical inference and its applications. Wiley New York, 1965. 12. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. In Proceedings of the Eleventh Annual ACM Conference on Computational Learning Theory, pages 80–91, 1998. 13. M. Sebban. On feature selection: a new filter model. In Twelfth International Florida AI Research Society Conference, pages 230–234, 1999. 14. D. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In 11th International Conference on Machine Learning, pages 293–301, 1994.
Experiments on a Representation-Independent “Top-Down and Prune” Induction Scheme Richard Nock1 , Marc Sebban1 , and Pascal Jappy2
2
1 Univ. Antilles-Guyane, Dept of Maths and CS, ` Campus de Fouillole, 97159 Pointe-A-Pitre, France {rnock,msebban}@univ-ag.fr Leonard’s Logic, 20 rue th´er`ese, 75001 Paris, France
[email protected] Abstract. Recently, some methods for the induction of Decision Trees have received much theoretical attention. While some of these works focused on efficient top-down induction algorithms, others investigated the pruning of large trees to obtain small and accurate formulae. This paper discusses the practical possibility of combining and generalizing both approaches, to use them on various classes of concept representations, not strictly restricted to decision trees or formulae built from decision trees. The algorithm, Wirei, is able to produce decision trees, decision lists, simple rules, disjunctive normal form formulae, a variant of multilinear polynomials, and more. This shifting ability allows to reduce the risk of deviating from valuable concepts during the induction. As an example, in a previously used simulated noisy dataset, the algorithm managed to find systematically the target concept itself, when using an adequate concept representation. Further experiments on twenty-two readily available datasets show the ability of Wirei to build small and accurate concept representations, which lets the user choose his formalism to best suit his interpretation needs, in particular for mining purposes.
1
Introduction
Many of the classical problems in designing machine learning (ML) algorithms can be understood by means of accuracy, time/space complexity, size and intelligibility issues. Generally, satisfying most of them is essentially a matter of compromises. In such cases, the problem is to transform rapidly enough the dataset to a useful compact representation that, while capturing most of the generalizable knowledge of the original data, will stay sufficiently small to be intelligible and interpretable. While the rapid increase in computer’s performances has somewhat de-emphasized the time requirements to obtain the algorithm’s outputs [Qui96], the other requirements cannot be easily solved. As an example, it was recently observed that the end user of ML algorithms is likely to appreciate various output types against other ones, that is, not a single concept representation class fits to all users, and the ability to shift in practice the output type is of great importance. This is also important from a theoretical viewpoint. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 223–231, 1999. c Springer-Verlag Berlin Heidelberg 1999
224
R. Nock, M. Sebban, and P. Jappy
Some problems admit a small coding on some concept representation, but lead to overly large representations on other classes. There has been recently much work to establish sound theoretical bases for the induction of decision trees, to explain and improve the behavior of algorithms such as C4.5 [KM96, KM98, SS98]. Such algorithms proceed by a “top-down and prune” scheme: a large formula is induced, which is pruned in a latter step, to obtain a small and accurate final output. While [KM96, SS98] have focused on improving the top-down induction, [KM98] have established the theoretical bases of a new pruning scheme, with theoretically proven near-optimal behavior. These schemes, thought initially focused on decision trees, have remarkable general properties, which can be applied outside the class of decision trees. A previous study [NJ98] shows that the class of decision lists, which shares close properties with decision trees, can benefit of principles closely related to the top-down induction. In that paper, we are concerned by the generalization of the whole “top-down and prune” scheme to a very large scope of concept representations. More precisely, we propose a general principle derived from the weak learning framework of [SS98] and the pruning framework of [KM98], to which we relate to as Wirei (for Weak Induction REpresentation-independent). Wirei is able to induce on any problem formulae such as Decision Lists (DL), Decision Committees (DC, a variant of multilinear polynomials), Decision Trees (DT), Disjunctive Normal Form formulae (DNF), simple monomials, and more. Wirei is much different from approaches such as C4.5rules, which proposes to induce rules from DT. Indeed, in C4.5rules, a DT is always primarily induced, which in a subsequent step is transformed into a set of rules. Wirei, on the other hand, processes directly formulae inside the chosen class. Experiments carried out on twenty-two publicly available domains reveal that on each dataset, concept representations built from various classes can be much different from each other while still being small and accurate. Wirei was also able to exhibit on runs over noisy domains the target formula itself, thus achieving an optimal compromise between accuracy and size. The time complexity of Wirei compares favorably to that of classical approaches such as C4.5. After a general presentation of Wirei, and its applications to a large scope of concept representation classes, we relate experiments conducted using Wirei on twenty-two domains, almost all of which can be found on the UCI repository of machine learning database [BKM98].
2
Wirei
Throughout the paper, the following notations are used: LS denotes the set examples used for training, each of which is described with n attributes, and belongs to one class among c. The following subsections present the basis of the growing and pruning algorithms. For the sake of clarity, an applicative example (generally on DT) is provided for all, and the specific applications to other classes are presented on a subsequent devoted part.
Representation-Independent Induction
2.1
225
A General Top-down Induction Algorithm
The principle is to repeatedly optimize, in a top-down manner, a particular Z criterion over the partition induced by the current formula f on LS. This partition into subsets LS1 , LS2 , ..., LSk satisfies the following two axioms: 1. ∀1 ≤ j ≤ k, any two examples of LSj are classified exactly in the same fashion, 2. ∀1 ≤ i < j ≤ k, any two examples of respectively LSi and LSj are not classified in the same fashion. It is important to note the term “fashion” instead of “class”. Two examples classified in the same fashion follow exactly the same path in the formula, e.g. the same leaf in a DT. To each example is associated a weight, which mimics its appearance probability inside LS (if uniform, all weights equal 1/|LS|, where |.| is the cardinality function). We adopt the convention that examples are described using couples of the type (o, co ), where o is an observation, and co its corresponding class; its weight is written w((o, co )). Fix as [[π]] the function returning the truth value of a predicate π. Define for any class 1 ≤ l ≤ c and any subset LSj of the partition the following quantities: X X w((o, co ))[[co = l]] ; W−j,l = w((o, co ))[[co 6= l]] W+j,l = (o,co )∈LSj
(o,co )∈LSj
In other words, W+j,l represents the fraction of examples of class l present in subset LSj , and W−j,l represents the fraction of examples of classes 6= l present in subset LSj . The Z criterion of [SS98] is the following: X X q j,l j,l Z=2 W + W− j
l
The core of procedure TDbuild simply consists in repeatedly optimizing the decreasing of the current Z, until either no decreasing is possible, or some upperbound Imax of the formula’s size is reached. In order to keep a fast procedure, in any rule-based formula (e.g. DL, DNF), the current search is focused on a currently grown rule, until a new one is grown when no addition in the current rule decreases Z. 2.2
A General Pruning Algorithm
The objective is to test exactly once the removal of each of the subparts of the formula f obtained from TDbuild. The test is bottom-up for all formula based on literal or rule-ordering, such as decision trees or decision lists. For other formulae without ordering, such as DNF, it “simulates” a bottom-up scanning of the formulae. The ordering of the former formulae supposes that some parts Q of the formula may only be reached after having reached another part P . In the case of decision trees, P is an internal node, and all possible Q are internal
226
R. Nock, M. Sebban, and P. Jappy
nodes belonging to the subtree rooted at P . The test evaluates the possible removal of all Q before testing P , and whenever P is removed, all depending Q are also removed, leading to the entire pruning of the subtree rooted at P , itself replaced by the best leaf other the examples reaching P . Tests for other formulae will be detailed in their devoted subsections. Algorithm 1 presents Algorithm 1: BUprune(LS, f , δ, seq ) Input: sample LS, formula f , real 0 < δ < 1, integer seq Output: a formula f foreach P ∈ f scanned bottom-up do C :=Ways(P ); H :=Sons(P ); sloc := |Reach(P, f, LS)|×seq /(|LS|(c−1)2 ); p α := (log C + log H + 2 log seq /δ)/(sloc ); P := lError(f ,Reach(P, f, LS)); ∅ := lError(f \P ,Reach(P, f, LS)); if P + α ≥ ∅ then Remove(P ,f ); return f
BUprune. We emphasize the fact that BUprune is an application of the theoretical results of [KM98]. The parameters used are the following ones. Ways(.) returns the number of distinct formulae which could replace in f the series of tests to reach P . In a decision tree, this represents the number of distinct monomials whose length equal the depth of P . Sons(.) returns the number of distinct subformulae in f that could be placed after P , without changing the size of f . In a decision tree, this represents the number of distinct subtrees that can be rooted at P without changing the whole number of internal nodes of f . Reach(.,.,.) returns the subset of examples from LS reaching P in f . In a decision tree, this represents the subset of LS reaching the internal node P . lError(.,.) returns the local error over Reach(.,.,.), in the formula f (for P ), or f to which P and all subformulae of P are removed (for ∅ ). In the case of a decision tree, the latter quantity corresponds to the local error of the best leaf rooted at P . The term “local error” is very important: in particular, the distribution used to calculate lError(.,.) is such that all examples from LS\Reach(P, f, LS) have zero weight. seq is a correction factor, which is not in [KM98]. We now explain its use. The test to remove P is optimistic, in that we face the possibility to overprune the formula, all the more if LS is not sufficiently large. For example, consider the case c = 2, |LS| = 2000, |Reach(P, f, LS)| = 100, δ < .20 and seq = |LS|. Then we obtain α > .40, even when considering C = H = 1. Experimentally, this shortcoming may lead to an empty formula, by pruning all parts of the initial formula. In order to overcome this difficulty, we have chosen to “mimic” the re-sampling of LS into another set of size seq > |LS|, in which examples would have exactly the same distribution as in LS. In our experiments, the values of C and H, since having a hard fast calculation, were approximated with upperbounds as large as possible, still in order not to face this possibility of overpruning. The bounds are not as tight as one could expect, yet they gave
Representation-Independent Induction
227
experimentally good results. We now go on detailing the algorithms TDbuild and BUprune for various kinds of formalisms.
3
Applications of Wirei to Specific Classes
Fix u(k) = 2k × n!/((n − k)!k!) (fast approximations of u(.) can be obtained by Stirling’s formula). This represents the number of Boolean monomials of length k over n variables. The application of Wirei to DT mainly follows from our preceding comments, and the results of [SS98, KM98, Qui96]. Due to the lack of space, we only detail results on other formalisms. The most simple is for monomials. When a single monomial f is needed, associated to a fixed class to which we refer as the positive class, only algorithm TDbuild is used. There are only two subsets LS1 and LS2 in the partition of LS, containing respectively examples satisfying the monomial, and those which do not satisfy the monomial. We additionally put the following constraint: each test added keeps the positive class as the majority class for the examples satisfying f . This gives the algorithm Wirei(Rule). Decision Lists: Wirei(DL). TDbuild: for a DL with m monomials, the partition of LS contains m + 1 subsets. The m first subsets are those corresponding to a monomial, and the (m + 1)th corresponds to the default class. Optimize(.) proceeds as follows. Each possible test is added to the last rule of the decision list. When no further addition of a test decreases the Z value, a new rule, created in the last position, is investigated. BUprune: for a DL with m monomials, each P is a monomial, and the monomials are tested from the last monomial of the DL to the first one. Reach(.,.,.) returns the subset of examples reaching P . When pruning a monomial P , all monomials following P (that were not pruned) are removed with P . The best default class other the training sample replaces P . Fix as l the position of P inside the DL. We then choose C = u(l − 1). Fix as t the average number of literals of each monomial following P . Then, we fix H = (m − l)u(t). DNF: Wirei(DNF) , is used when c = 2. TDbuild: for a DNF with m monomials, the partition can contain up to min{|LS|, 2m } subsets (this quantity is never greater than |LS|, which guarantees efficient processing time). Each subset contains the examples satisfying exactly the same subset of monomials. While there is no ordering on monomials, algorithm TDbuild is still bottom-up. Each test is added to a current monomial. When no further addition of a test into this monomial decreases the Z value, a new monomial is created, initialized to ∅, and treated as the current monomial. The same constraint as for monomials is used when minimizing Z: each test added to a monomial keeps the positive class as the majority class for all examples satisfying this monomial. BUprune: while there is no ordering on monomials, the bottom-up fashion is still preserved for the formula f . Each P represents a monomial of the DNF, and when removing P , no other monomial is removed. Reach(.,.,.) returns the
228
R. Nock, M. Sebban, and P. Jappy
subset of examples satisfying P . Fix as l the total number of monomials 6= P , inside f , that could be satisfied while satisfying P , and t their average length. In other words, each of these monomials must not have a contradictory test with P . Fix as |P | the number of literals of P . We choose C = u(|P |) and H = l × u(t). In addition, sloc is the cardinality of the examples satisfying P . Decision Committees: Wirei(DC). We use DC with constrained vectors. Such a DC [NG95, NJ99] contains two parts: – A set of unordered couples (or rules) {(mi , v i )} where each mi is a monomial, and each v i is a vector in {−1, 0, 1}c (the values correspond to the natural interpretation “is in disfavor of”, “is neutral w.r.t.”, “is in favor of” one class). – A Default Vector D in [0, 1]c . For any observation o we calculate V o , the sum of all vectors whose monomials are satisfied by o. The index of the maximal component of V o gives the class assigned to o. If it is not unique, we take the index of the maximal component of D corresponding to the maximal component of V o . Algorithm BUprune has the same structure as for DNF. However, in order not to artificially increase the power of the vectors by multiplying the appearance of some monomials, we do not authorize the addition of multiple copies of a single monomial, a case which can only occur when the current Z is not decreased. TDbuild: it is the same as for DNF, except that we remove the constraint on choosing monomials discriminating the positive class. Before executing algorithm BUprune, we calculate the components of each v i . To do so, we use the algorithm of [NJ99] which proceeds by minimizing Ranking Loss as defined by [SS98].
4
Experimental Results
Wirei was evaluated on a representative collection of twenty-two problems, most of which can be found on the UCI repository [BKM98]. The only exceptions were the “LEDeven” and “XD6” domain. “LEDeven” consists in the noisy ten-classes problem “LED10” with classes reunited into odd and even classes. “XD6” consists in a two-classes problem with ten description variables for each example. The target concept, from which all examples are uniformly sampled, is a DNF with three variables in each of its monomials, described over the first nine variables. The tenth variable is irrelevant in the strongest sense. A 10% classification noise is added, which also represents Bayes optimum. References for all the datasets, omitted due to space constraints, can be found in [BN92, Hol93, Qui96], or on the UCI repository [BKM98]. All algorithms are ran using δ = 15%, |seq | = 10000 and Imax = 40, in order to make clear comparisons. Ten complete 10-fold stratified cross-validations were carried out with each database. In a tenfold cross-validation, the training instances are partitioned into 10 equal-sized subsets with similar class distributions. Each subset in turn is used for testing
Representation-Independent Induction
229
while the remaining nine are used for training. Due to the lack of space, only experiments with Wirei(DC), Wirei(DNF) and Wirei(DL) are shown in table 1. With respect to each algorithm, in its first column are shown error rates averaged over the 10-fold cross-validations, with the average number of monomials (second column), and the average total number of literal on the third column (if a literal appears k times, it is counted k times). Column “Others” relates various results among the best we know, for which the experiments were carried out under a similar setting as ours. Over the 22 datasets, Wirei outperforms many of the traditional approaches. Comparing the errors gives already an advantage to Wirei(particularly Wirei(DL)), but improvements become more sensible as the errors are compared in the light of the corresponding formula’s sizes. Size reductions, while still preserving in many cases a better error, can range towards magnitude order of twenty or more. In particular, Wirei(DL) is a clear winner against CN2 when considering both errors and sizes. But there is more to say, when comparing approaches head to head. More than performing a comparison between accuracies, we performed a comparison between the classifiers themselves on specific problems. On “XD6”, we observed that the classifiers built for both Wirei(DNF) and Wirei(DL) are always exactly the target formula, beating in both accuracy and size classical DT induction approaches [BN92]. However, coding the target formula with DT can be done modulo the creation of comparatively large trees, which is more risky when building formula in a top-down fashion: chances are indeed larger that the formula built deviates from the optimal one. This clearly accounts for the representation shift Wirei proposes. On “Vote0”, we still obtained exactly the same classifiers for Wirei(DNF) and Wirei(DL), with one literal. This problem is known to have one attribute which makes a very reliable test [BN92], attribute which is precisely always selected by Wirei(DNF) and Wirei(DL). In order to cope with this problem, [BN92] propose to remove this attribute, which gives the “Vote1” problem. While DT induction algorithms give much larger formulae, Wirei(DL) always manages to find a two-tests rule which still gives very good results, and might contain useful informations for Data Mining purposes. However, the problem seems indeed more difficult since Wirei(DC) finds more complex formulae with average accuracy slightly below 10%, a seldom result if we refer to the collection of reported studies in [Hol93], none of which break the 10% barrier. This stability property was also remarked on the “Horse-Co” problem, where both Wirei(DL) and Wirei(DNF) even encompassed DT approaches using a very simple concept. On the “LED10” domain, Wirei(DC) obtained on average a result a little above the 24% Bayes error rate, but Wirei(DL) performed very poorly (while DT give intermediate results). Interestingly, when transforming the problem to “LEDeven”, Wirei(DL) achieved near-optimal prediction, with a completely stable classifier, but Wirei(DC)’s prediction degraded with respect to Bayes. A simple explanation for this behavior is that “LED10” is a problem which can be encoded very efficiently using simple linear frontiers around classes [NG95], and it was proven that linear separators, while being DC with one-rule monomials
230
R. Nock, M. Sebban, and P. Jappy
(remark that Wirei(DC)’s monomials contain on average 1.43 literals), are very difficult to encode using simple DLs [NG95]. On the other hand, “LEDeven” can be related to much simpler concepts, which can be very efficiently coded using DL. We have remarked that the DL obtained by Wirei(DL) were very accurate in that among their two rules, the first contained a test which discriminates all but one (the “4”) even digits against all odd digits, and the second coupled with the default class, led to an efficient test to discriminate under noise the “4” digit against all odd digits. When comparing “Glass” and “Glass2”, which is a modified version of “Glass” [CB91], the interest of the hypothesis concept shift between the two problems is clear, as Wirei(DC)performed well on “Glass2”, while Wirei(DL) gave the best results on “Glass”. Following all these observations, we can say that Wirei is an experimental evidence of the power of simple induction schemes such as “Top-down and prune”, which received recently much attention to establish sound theoretical foundations. Though many works were primarily based on decision trees [KM96, KM98], theoretical results seem to be practically scalable to various different classes of formalism representation, three of which were explored in depth in our experiments. Additionally, experimental results reveal that applications of the generic algorithm Wirei to specific classes can exhibit even better behavior than algorithms specifically dedicated to the same classes.
References [BKM98] C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases. 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. [BN92] W. Buntine and T. Niblett. A further comparison of splitting rules for Decision-Tree induction. Machine Learning, pages 75–85, 1992. [CB91] P. Clark and R. Boswell. Rule induction with CN2: some recent improvements. In Proc. of the 6 th European Working Session in Learning, pages 151–161, 1991. [Dom98] P. Domingos. A Process-oriented Heuristic for Model selection. In Proc. of the 15 th International Conference on Machine Learning, pages 127–135, 1998. [FW98] E. Franck and I. Witten. Using a Permutation Test for Attribute selection in Decision Trees. In Proc. of the 15 th International Conference on Machine Learning, pages 152–160, 1998. [Hol93] R.C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, pages 63–91, 1993. [KM96] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pages 459–468, 1996. [KM98] M. J. Kearns and Y. Mansour. A Fast, Bottom-up Decision Tree Pruning algorithm with Near-Optimal generalization. In Proc. of the 15 th International Conference on Machine Learning, 1998. [NG95] R. Nock and O. Gascuel. On learning decision committees. In Proc. of the 12 th International Conference on Machine Learning, pages 413–420, 1995. [NJ98] R. Nock and P. Jappy. On the power of decision lists. In Proc. of the 15 th International Conference on Machine Learning, pages 413–420, 1998.
Representation-Independent Induction
231
Table 1. Comparisons of various approaches of Wirei (whose least errors are underlined for each domain; a “/” mark for DNF denotes a domain with c > 2 classes). Wirei(DC) Wirei(DNF) Wirei(DL) Domain err (%) mDC lDC err (%) mDN F lDN F err (%) mDL lDL Balance 22.24 6.2 13.7 / / / 23.01 4.2 13.5 Breast-W 4.08 5.4 22.8 13.80 1.6 2.9 6.90 1.7 6.7 Echo 28.57 2.0 3.9 36.42 1.3 2.3 30.00 0.7 1.6 Glass 53.91 1.3 1.8 / / / 38.69 6.0 19.2 Glass2 21.10 6.6 18.2 23.52 5.7 15.4 22.35 11.2 27.1 Heart-St 22.96 3.9 11.7 34.44 2.0 6.6 23.53 9.4 22.6 Heart-C 24.87 4.4 14.3 23.87 2.0 5.7 21.93 2.8 7.7 Heart-H 20.67 5.2 13.8 24.00 1.3 5.1 23.00 1.2 4.5 Hepatitis 21.76 3.9 10.1 24.70 3.2 6.5 18.82 1.7 4.2 Horse-Co 22.31 9.5 27.0 4 13.68 4 1.0 4 2.0 4 13.68 4 1.0 4 2.0 Iris 5.33 1.9 4.6 / / / 2.67 2.0 4.8 Labor 15.00 4.1 9.5 43.33 1.2 1.9 16.67 3.1 5.6 Lung 42.50 1.3 3.8 / / / 47.50 2.0 5.9 † LED10 26.95 10.1 14.4 / / / 58.26 4.1 9.1 LEDeven 19.91 5.7 10.3 42.50 1.9 3.9 †,4 12.08 4 2.0 4 6.0 Monk1 15.00 4.1 9.5 35.44 2.8 3.9 4.11 9.4 18.3 Monk2 24.26 8.3 35.5 48.69 2.4 5.3 21.97 10.8 44.4 Monk3 3.93 4.3 6.4 44.11 4.0 4.3 4 3.57 4 2.0 4 2.0 Pima 28.52 3.3 7.1 38.44 0.7 1.8 25.71 2.4 6.5 Vote0 7.70 2.9 4.4 7.73 1.4 1.7 4 5.68 4 1.0 4 1.0 Vote1 9.95 4.5 10.8 18.86 2.4 3.2 4 10.23 4 1.0 4 2.0 †,4 †,4 †,4 XD6 20.34 9.4 18.4 9.84 3.0 9.0 †,4 9.84 †,4 3.0 †,4 9.0
Others 32.1 b 4.0 b 32.335.4 c 41.5032.8 c 20.3 b 21.5 b 22.552.0 c 21.860.3 c 19.234.0 c 14.92 b 4.93.5 a 32.8913.0 a
25.9 b 4.349.6 c 12.798.9 a 22.0614.8 a
† : near-optimal or optimal results. 4 : the same classifier is produced at each fold. mF : average number of monomials (see text). lF : average number of literals (see text). References: a [BN92], best reported results on DT induction. The small number indicates the least number of leaves (equiv. to a number of monomials). b [FW98, Qui96], C4.5’s error. c [Dom98, CB91], various improved CN2’s error (small numbers indicate the whole number of literals).
[NJ99] [Qui96] [SS98]
R. Nock and P. Jappy. A top-down and prune induction scheme for constrained decision committees. In Proc. of the 3 rd International Symposium on Intelligent Data Analysis, 1999. accepted. J. R. Quinlan. Bagging, Boosting and C4.5. In Proc. of AAAI-96, pages 725–730, 1996. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual ACM Conference on Computational Learning Theory, pages 80–91, 1998.
Heuristic Measures of Interestingness Robert J. Hilderman and Howard J. Hamilton Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 {hilder,hamilton}@cs.uregina.ca
Abstract. The tuples in a generalized relation (i.e., a summary generated from a database) are unique, and therefore, can be considered to be a population with a structure that can be described by some probability distribution. In this paper, we present and empirically compare sixteen heuristic measures that evaluate the structure of a summary to assign a single real-valued index that represents its interestingness relative to other summaries generated from the same database. The heuristics are based upon well-known measures of diversity, dispersion, dominance, and inequality used in several areas of the physical, social, ecological, management, information, and computer sciences. Their use for ranking summaries generated from databases is a new application area. All sixteen heuristics rank less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as most interesting. We demonstrate that for sample data sets, the order in which some of the measures rank summaries is highly correlated.
1
Introduction
Techniques for determining the interestingness of discovered knowledge have previously received some attention in the literature. For example, in [5], a measure is proposed that determines the interestingness (called surprise there) of discovered knowledge via the explicit detection of Simpson’s paradox. Also, in [22], information-theoretic measures for evaluating the importance of attributes are described. And in previous work, we proposed and evaluated four heuristics, based upon measures from information theory and statistics, for ranking the interestingness of summaries generated from databases [8,9]. Ranking summaries generated from databases is useful in the context of descriptive data mining tasks where a single data set can be generalized in many different ways and to many levels of granularity. Our approach to generating summaries is based upon a data structure called a domain generalization graph (DGG) [7,10]. A DGG for an attribute is a directed graph where each node represents a domain of values created by partitioning the original domain for the attribute, and each edge represents a generalization relation between these domains. Given a set of DGGs corresponding to a set of attributes, a generalization space can be defined as all possible combinations of domains, where one ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 232–241, 1999. c Springer-Verlag Berlin Heidelberg 1999
Heuristic Measures of Interestingness
233
domain is selected from each DGG for each combination. This generalization space describes, then, all possible summaries consistent with the DGGs that can be generated from the selected attributes. When the number of attributes to be generalized is large or the DGGs associated with the attributes are complex, the generalization space can be very large, resulting in the generation of many summaries. If the user must manually evaluate each summary to determine whether it contains an interesting result, inefficiency results. Thus, techniques are needed to assist the user in identifying the most interesting summaries. In this paper, we introduce and evaluate twelve new heuristics based upon measures from economics, ecology, and information theory, in addition to the four previously mentioned in [8] and [9], and present additional experimental results describing the behaviour of these heuristics when used to rank the interestingness of summaries. Together, we refer to these sixteen measures as the HMI set (i.e., heuristic measures of interestingness). Although our measures were developed and utilized for ranking the interestingness of generalized relations using DGGs, they are more generally applicable to other problem domains. For example, alternative methods could be used to guide the generation of summaries, such as Galois lattices [6], conceptual graphs [3], or formal concept analysis [19]. Also, summaries could more generally include views generated from databases or summary tables generated from data cubes. However, we do not dwell here on the methods or technical aspects of deriving summaries, views, or summary tables. Instead, we simply refer collectively to these objects as summaries, and assume that some collection of them is available for ranking. The heuristics in the HMI set were chosen for evaluation because they are well-known measures of diversity, dispersion, dominance, and inequality that have previously been successfully applied in several areas of the physical, social, ecological, management, information, and computer sciences. They share three important properties. First, each heuristic depends only on the probability distribution of the data to which it is being applied. Second, each heuristic allows a value to be generated with at most one pass through the data. And third, each heuristic is independent of any specific units of measure. Since the tuples in a summary are unique, they can be considered to be a population with a structure that can be described by some probability distribution. Thus, utilizing the heuristics in the HMI set for ranking the interestingness of summaries generated from databases is a natural and useful extension into a new application domain.
2
The HMI Set
A number of variables will be used in describing the HMI set, which we define as follows. Let m be the total number of tuples in a summary. Let ni be the value contained in the Count attribute for tuple ti (all summaries contain Pm a derived attribute called Count; see [8] or [9] for more details). Let N = i=1 ni be the total count. Let p be the actual probability distribution of the tuples based upon the values ni . Let pi = ni /N be the actual probability for tuple ti . Let q be a
234
R.J. Hilderman and H.J. Hamilton
uniform probability distribution of the tuples. Let u¯ = N/m be the count for tuple ti , i = 1, 2, . . . , m according to the uniform distribution q. Let q¯ = 1/m be the probability for tuple ti , for all i = 1, 2, . . . , m according to the uniform distribution q. Let r be the probability distribution obtained by combining the ¯. Let ri = (ni + u ¯)/2N , be the probability for tuples ti , for all values ni and u i = 1, 2, . . . , m according to the distribution r. So, given the sample summary shown in Table 1, for example, we have m = 4, n1 = 3, n2 = 1, n3 = 1, n4 = 2, ¯ = 1.75, q¯ = 0.25, N = 7, p1 = 0.429, p2 = 0.143, p3 = 0.143, p4 = 0.286, u r1 = 0.339, r2 = 0.196, r3 = 0.196, and r4 = 0.268. Table 1. A sample summary Tuple ID Colour Shape t1 t2 t3 t4
red red blue green
Count
round square square round
3 1 1 2
We now describe the sixteen heuristics in the HMI set. Examples showing the calculation of each heuristic are not provided due to space limitations. IV ariance . Based upon sample variance from classical statistics [15], IV ariance measures the weighted average of the squared deviations of the probabilities pi from the mean probability q¯, where the weight assigned to each squared deviation is 1/(m − 1). Pm (pi − q¯)2 IV ariance = i=1 m−1 ISimpson . A variance-like measure based upon the Simpson index [18], ISimpson measures the extent to which the counts are distributed over the tuples in a summary, rather than being concentrated in any single one of them. ISimpson =
m X
p2i
i=1
IShannon . Based upon a relative entropy measure from information theory (known as the Shannon index) [17], IShannon measures the average information content in the tuples of a summary. IShannon = −
m X
pi log2 pi
i=1
IT otal . Based upon the Shannon index from information theory [23], IT otal measures the total information content in a summary. IT otal = m ∗ IShannon IM ax . Based upon the Shannon index from information theory [23], IM ax measures the maximum possible information content in a summary. IM ax = log2 m
Heuristic Measures of Interestingness
235
IM cIntosh . Based upon a heterogeneity index from ecology [14], IM cIntosh views the counts in a summary as the coordinates of a point in a multidimensional space and measures the modified Euclidean distance from this point to the origin. pPm n2 N− √i=1 i IM cIntosh = N− N ILorenz . Based upon the Lorenz curve from statistics, economics, and social science [20], ILorenz measures the average value of the Lorenz curve derived from the probabilities pi associated with the tuples in a summary. The Lorenz curve is a series of straight lines in a square of unit length, starting from the origin and going successively to points (p1 , q1 ), (p1 + p2 , q1 + q2 ), . . .. When the pi ’s are all equal, the Lorenz curve coincides with the diagonal that cuts the unit square into equal halves. When the pi ’s are not all equal, the Lorenz curve is below the diagonal. m X (m − i + 1)pi ILorenz = q¯ i=1
IGini . Based upon the Gini coefficient [20] which is defined in terms of the Lorenz curve, IGini measures the ratio of the area between the diagonal (i.e., the line of equality) and the Lorenz curve, and the total area below the diagonal. Pm Pm ¯ − pj q¯| i=1 j=1 |pi q IGini = 2m2 q¯ IBerger . Based upon a dominance index from ecology [2], IBerger measures the proportional dominance of the tuple in a summary with the highest probability pi . IBerger = max(pi ) ISchutz . Based upon an inequality measure from economics and social science [16], ISchutz measures the relative mean deviation of the actual distribution of the counts in a summary from a uniform distribution of the counts. Pm pi − q¯ ISchutz = i=1 2m¯ q IBray . Based upon a community similarity index from ecology [4], IBray measures the percentage of similarity between the actual distribution of the counts in a summary and a uniform distribution of the counts. Pm min(ni , u ¯) IBray = i=1 N IW hittaker . Based upon a community similarity index from ecology [21], IW hittaker measures the percentage of similarity between the actual distribution of the counts in a summary and a uniform distribution of the counts. ! m X |pi − q¯| IW hittaker = 1 − 0.5 i=1
236
R.J. Hilderman and H.J. Hamilton
IKullback . Based upon a distance measure from information theory [11], IKullback measures the distance between the actual distribution of the counts in a summary and a uniform distribution of the counts. ! m X pi pi log2 IKullback = log2 m − q¯ i=1 IM acArthur . Based upon the Shannon index from information theory [13], IM acArthur combines two summaries, and then measures the difference between the amount of information contained in the combined distribution and the amount contained in the average of the two original distributions. ! P m m X (− i=1 pi log2 pi ) + log2 m IM acArthur = − ri log2 ri − 2 i=1 IT heil . Based upon a distance measure from information theory [20], IT heil measures the distance between the actual distribution of the counts in a summary and a uniform distribution of the counts. Pm |pi log2 pi − q¯ log2 q¯| IT heil = i=1 m¯ q IAtkinson . Based upon a measure of inequality from economics [1], IAtkinson measures the percentage to which the population in a summary would have to be increased to achieve the same level of interestingness if the counts in the summary were uniformly distributed. !q¯ m Y pi IAtkinson = 1 − q¯ i=1
3
Experimental Results
To generate summaries, a series of seven discovery tasks were run: three on the NSERC Research Awards Database (a database available in the public domain) and four on the Customer Database (a confidential database supplied by an industrial partner). These databases have been frequently used in previous data mining research [8,9,12] and will not be described again here. We present the results of the three NSERC discovery tasks, which we refer to as N-2, N-3, and N-4, where 2, 3, and 4 correspond to the number of attributes selected in each discovery task. Similar results were obtained from the Customer Database. Typical results are shown in Tables 2 through 5, where the 22 summaries generated from the N-2 discovery task are ranked by the various measures. In Tables 2 through 5, the Summary ID column describes a unique summary identifier (for reference purposes), the Non-ANY Attributes column describes the number of non-ANY attributes in the summary (i.e., attributes that have not
Heuristic Measures of Interestingness
237
Table 2. Ranks assigned by IV ariance , ISimpson , IShannon , and IT otal from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
IV ariance Score Rank 0.377595 0.128641 0.208346 0.024569 0.018374 0.017788 0.041606 0.377595 0.208346 0.079693 0.018715 0.050770 0.041606 0.013534 0.010611 0.012575 0.008896 0.011547 0.006470 0.002986 0.002078 0.001582
1.5 5.0 3.5 10.0 12.0 13.0 8.5 1.5 3.5 6.0 11.0 7.0 8.5 14.0 17.0 15.0 18.0 16.0 19.0 20.0 21.0 22.0
ISimpson Score Rank 0.877595 0.590615 0.875039 0.298277 0.258539 0.253419 0.474451 0.877595 0.875039 0.518772 0.260833 0.517271 0.474451 0.226253 0.221664 0.260017 0.225542 0.278568 0.220962 0.141445 0.121836 0.119351
1.5 5.0 3.5 10.0 14.0 15.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 13.0 17.0 11.0 19.0 20.0 21.0 22.0
IShannon Score Rank 0.348869 0.866330 0.443306 1.846288 2.125994 2.268893 1.419260 0.348869 0.443306 1.215166 2.194598 1.309049 1.419260 2.473949 2.616697 2.288068 2.567410 2.282864 2.710100 3.259974 3.538550 3.679394
1.5 5.0 3.5 10.0 11.0 13.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 15.0 17.0 14.0 19.0 20.0 21.0 22.0
IT otal Score Rank 0.697738 2.598990 1.773225 9.231440 12.755962 20.420033 14.192604 0.697738 1.773225 6.075830 19.751385 11.781437 14.192604 27.213436 41.867161 38.897160 53.915619 47.940136 81.302986 130.39897 176.92749 246.51939
1.5 5.0 3.5 7.0 9.0 13.0 10.5 1.5 3.5 6.0 12.0 8.0 10.5 14.0 16.0 15.0 18.0 17.0 19.0 20.0 21.0 22.0
Table 3. Ranks assigned by IM ax , IM cIntosh , ILorenz , and IBerger from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
IM ax Score Rank 1.000000 1.584963 2.000000 2.321928 2.584963 3.169925 3.321928 1.000000 2.000000 2.321928 3.169925 3.169925 3.321928 3.459432 4.000000 4.087463 4.392317 4.392317 4.906891 5.321928 5.643856 6.066089
1.5 3.0 4.5 6.5 8.0 10.0 12.5 1.5 4.5 6.5 10.0 10.0 12.5 14.0 15.0 16.0 17.5 17.5 19.0 20.0 21.0 22.0
IM cIntosh Score Rank 0.063874 0.233956 0.065254 0.458697 0.496780 0.501894 0.314518 0.063874 0.065254 0.282728 0.494505 0.283782 0.314518 0.529937 0.534837 0.495313 0.530693 0.477246 0.535592 0.630569 0.657900 0.661515
1.5 5.0 3.5 10.0 14.0 15.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 13.0 17.0 11.0 19.0 20.0 21.0 22.0
ILorenz Score Rank 0.532746 0.429060 0.277279 0.402945 0.379616 0.261123 0.165982 0.532746 0.277279 0.283677 0.253015 0.166537 0.165982 0.236883 0.175297 0.142521 0.132651 0.118036 0.100625 0.108058 0.102211 0.083496
1.5 3.0 7.5 4.0 5.0 9.0 14.5 1.5 7.5 6.0 10.0 13.0 14.5 11.0 12.0 16.0 17.0 18.0 21.0 19.0 20.0 22.0
IBerger Score Rank 0.934509 0.712931 0.934509 0.393841 0.393841 0.393841 0.603704 0.934509 0.934509 0.666853 0.365614 0.666853 0.603704 0.365614 0.365614 0.365614 0.365614 0.420841 0.365614 0.234297 0.234297 0.234297
2.5 5.0 2.5 12.0 12.0 12.0 8.5 2.5 2.5 6.5 16.5 6.5 8.5 16.5 16.5 16.5 16.5 10.0 16.5 21.0 21.0 21.0
been generalized to the level of the most general node in the associated DGG that contains the default description “ANY”), the No. of Tuples column describes the number of tuples in the summary, and the Score and Rank columns describe the calculated interestingness and the assigned rank, respectively, as determined by the corresponding measure. Some measures are ranked by score in descending order and some in ascending order (this is easily determined by examining the ranks assigned in Tables 2 through 5). This is done so that each measure ranks the less complex summaries (i.e., those with few tuples and/or few non-ANY attributes) as more interesting. Tables 2 through 5 do not show any single-tuple summaries (e.g., a single-tuple summary where both attributes are generalized to ANY and a single-tuple summary that was an artifact of the DGGs used), as these summaries are considered to contain no information and are, therefore, uninteresting by definition. The summaries in Tables 2 through 5 are shown in increasing order of the number of non-ANY attributes and the number of tuples in each summary, respectively.
238
R.J. Hilderman and H.J. Hamilton Table 4. Ranks assigned by ISchutz , IBray , IW hittaker , and IKullback from N-2 Summary Non-ANY No. of ID Attributes Tuples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
ISchutz Score Rank 0.434509 0.379598 0.684509 0.310744 0.294042 0.466300 0.734509 0.434509 0.684509 0.534397 0.516940 0.712175 0.734509 0.486637 0.600273 0.699103 0.696302 0.743921 0.723102 0.734397 0.734397 0.742610
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
IBray IW hittaker Score Rank Score Rank 0.565491 0.620402 0.315491 0.689256 0.705958 0.533700 0.265491 0.565491 0.315491 0.465603 0.483060 0.287825 0.265491 0.513363 0.399727 0.300897 0.303698 0.256079 0.276898 0.265603 0.265603 0.25739
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
0.565491 0.620402 0.315491 0.689256 0.705958 0.533700 0.265491 0.565491 0.315491 0.465603 0.483060 0.287825 0.265491 0.513363 0.399727 0.300897 0.303698 0.256079 0.276898 0.265603 0.265603 0.257390
4.5 3.0 11.5 2.0 1.0 6.0 19.5 4.5 11.5 9.0 8.0 15.0 19.5 7.0 10.0 14.0 13.0 22.0 16.0 17.5 17.5 21.0
IKullback Score Rank 0.348869 0.866330 0.443306 1.846288 2.125994 2.268893 1.419260 0.348869 0.443306 1.215166 2.194598 1.309049 1.419260 2.473949 2.616697 2.288068 2.567410 2.282864 2.710100 3.259974 3.538550 3.679394
1.5 5.0 3.5 10.0 11.0 13.0 8.5 1.5 3.5 6.0 12.0 7.0 8.5 16.0 18.0 15.0 17.0 14.0 19.0 20.0 21.0 22.0
Table 5. Ranks assigned by IM acArthur , IT heil , IAtkinson , and IGini from N-2 Summary Non-ANY No. of IM acArthur ID Attributes Tuples Score Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 4 5 6 9 10 2 4 5 9 9 10 11 16 17 21 21 30 40 50 67
0.184731 0.218074 0.399511 0.144729 0.132377 0.243857 0.457814 0.184731 0.399511 0.298402 0.264620 0.452998 0.457814 0.260255 0.342143 0.441534 0.440642 0.487441 0.494412 0.479347 0.482560 0.515363
3.5 5.0 11.5 2.0 1.0 6.0 16.5 3.5 11.5 9.0 8.0 15.0 16.5 7.0 10.0 14.0 13.0 20.0 21.0 18.0 19.0 22.0
IT heil Score Rank 0.651131 0.718633 1.556694 0.757153 0.777902 1.710559 2.508888 0.651131 1.556694 1.195810 1.898130 2.249471 2.508888 2.025527 2.939297 3.512838 3.890191 3.982314 4.485426 5.317662 5.751495 6.181546
1.5 3.0 7.5 4.0 5.0 9.0 13.5 1.5 7.5 6.0 10.0 12.0 13.5 11.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0
IAtkinson Score Rank 0.505218 0.914901 0.792127 0.759314 0.693136 0.765973 0.821439 0.505218 0.792127 0.859044 0.759162 0.884562 0.821439 0.727091 0.797472 0.860465 0.852812 0.862917 0.894697 0.854864 0.854329 0.885877
1.5 22.0 8.5 6.0 3.0 7.0 11.5 1.5 8.5 16.0 5.0 19.0 11.5 4.0 10.0 17.0 13.0 18.0 21.0 15.0 14.0 20.0
IGini Score Rank 0.217254 0.158404 0.173861 0.078822 0.067906 0.065429 0.076804 0.217254 0.173861 0.126529 0.067231 0.086449 0.076804 0.056104 0.044494 0.045517 0.037253 0.038645 0.027736 0.020222 0.016312 0.012656
1.5 5.0 3.5 8.0 11.0 13.0 9.5 1.5 3.5 6.0 12.0 7.0 9.5 14.0 16.0 15.0 18.0 17.0 19.0 20.0 21.0 22.0
Tables 2 through 5 show similarities in how some of the sixteen measures rank summaries. For example, the six most interesting summaries (i.e., 1, 2, 3, 8, 9, and 10) are ranked identically by IV ariance , ISimpson , IShannon , IT otal , IM cIntosh , and IKullback , while the four least interesting summaries (i.e., 19, 20, 21, and 22) are ranked identically by IV ariance , ISimpson , IShannon , IT otal , IM ax , IM cIntosh , IKullback , IT heil , and IGini . To quantify the extent of the ranking similarities between the sixteen measures across all seven discovery tasks, we calculated the Gamma correlation coefficient for each pair of measures and found that 86.4% of the coefficients are highly significant with a p-value below 0.005. We also found the ranks assigned to the summaries have a high positive correlation for some pairs of measures. For the purpose of this discussion, we considered a pair of measures to be highly correlated when the average coefficient is greater than 0.85. Thus, 35% of the pairs (i.e., 42 of 120 pairs) are highly correlated using the 0.85 threshold. Following careful examination of the 42 highly correlated pairs, we found two distinct groups of measures within which summaries are ranked similarly. One group
Heuristic Measures of Interestingness
239
consists of the measures IV ariance , ISimpson , IShannon , IT otal , IM ax , IM cIntosh , IBerger , IKullback , and IGini . The other group consists of the measures ISchutz , IBray , IW hittaker , and IM acArthur . There are no similarities (i.e., no high positive correlations) shared between the two groups. Of the remaining three measures, IT heil , ILorenz , and IAtkinson , IT heil is only highly correlated with IM ax , while ILorenz and IAtkinson are not highly correlated with any of the other measures. There were no highly negative correlations between any of the pairs of measures. One way to analyze the measures is to determine the complexity of summaries considered to be of high, moderate, and low interest (i.e., the relative interestingness). These results are shown in Table 6. In Table 6, the values in the H, M, and L columns describe the complexity index for a group of summaries considered to be of high, moderate, and low interest, respectively. The complexity index for a group of summaries is defined as the product of the average number of tuples and the average number of non-ANY attributes contained in the group of summaries. For example, the complexity index for summaries determined to be of high interest by the IV ariance index for discovery task N-2, is 4.5 (i.e., 3 × 1.5, where 3 and 1.5 are the average number of tuples and average number of non-ANY attributes, respectively). High, moderate, and low interest summaries were considered to be the top, middle, and bottom 20%, respectively, of summaries. The N-2, N-3, and N-4 discovery tasks generated sets containing 22, 70, and 214 summaries, respectively. Thus, the complexity index of the summaries from the N-2, N-3, and N-4 discovery tasks is based upon the averages for four, 14, and 43 summaries, respectively. Table 6. Relative interestingness of summaries from the NSERC discovery tasks Interestingness Measure
H
IV ariance ISimpson IShannon IT otal IM ax IM cIntosh ILorenz IBerger ISchutz IBray IW hittaker IKullback IM acArthur IT heil IAtkinson IGini
4.5 4.5 4.5 4.5 3.6 4.5 3.9 4.5 4.0 4.0 4.0 4.5 4.9 3.9 8.0 4.5
N-2 M 11.3 20.3 11.3 13.2 14.0 20.3 20.3 15.8 13.1 13.1 13.1 11.3 13.1 17.1 18.0 13.2
L 93.6 93.6 93.6 93.6 93.6 93.6 93.6 93.6 48.6 48.6 48.6 93.6 84.0 93.6 49.1 93.6
Relative Interestingness N-3 H M L H 9.0 9.0 9.0 8.1 8.3 9.0 21.1 9.6 23.4 23.4 23.4 9.0 23.2 9.1 31.5 9.0
64.7 72.9 72.9 65.8 63.7 72.9 104.8 86.6 367.9 367.9 367.9 72.9 251.4 66.2 270.5 60.5
520.3 477.4 520.3 545.5 545.5 477.4 249.3 457.5 146.7 146.7 146.7 520.3 220.8 533.3 103.7 537.7
34.6 38.0 29.8 27.2 27.0 38.0 133.6 48.8 289.8 289.8 289.8 29.8 249.5 33.8 531.1 27.9
N-4 M 430.5 447.8 430.2 423.6 424.2 447.8 1373.9 587.8 1242.2 1242.2 1242.2 430.2 1210.3 558.9 555.6 425.1
L 3212.9 3163.1 3210.2 3220.5 3221.6 3163.1 482.6 2807.2 227.0 227.0 227.0 3210.2 233.2 2668.4 1611.1 3220.5
Table 6 shows that in most cases the complexity index is lowest for the most interesting summaries and highest for the least interesting summaries. For example, the complexity index for summaries determined by the IV ariance index to be of high, moderate, and low interest are 4.5, 11.3, and 93.6 from N-2, respectively, 9.0, 64.7, and 520.3 from N-3, respectively, and 34.6, 430.5, and 3212.9 from N-4, respectively. The only exceptions occurred in the results for the ILorenz , ISchutz , IBray , IW hittaker , IM acArthur , and IAtkinson indexes from the N-3 and N-4 discovery tasks.
240
R.J. Hilderman and H.J. Hamilton
A comparison of the summaries with high relative interestingness from the N-2, N-3, and N-4 discovery tasks is shown in the graph of Figure 1. In Figure 1, the horizontal and vertical axes describe the measures and the complexity indexes, respectively. Horizontal rows of bars correspond to the complexity indexes of summaries from a particular discovery task. The back most horizontal row of bars corresponds to the average complexity index for a particular measure. Figure 1 shows a maximum complexity index on the vertical axes of 60.0 (although the complexity indexes for ILorenz , ISchutz , IBray , IW hittaker , IM acArthur , and IAtkinson from the N-4 discovery task each exceed this value by a minimum of 189.5). The measures, listed in ascending order of the complexity index, are (position in parentheses): IM ax (1), IT otal (2), IGini (3), IShannon and IKullback (4), IT heil (5), IV ariance (6), ISimpson and IM cIntosh (7), IBerger (8), ILorenz (9), IM acArthur (10), ISchutz , IBray , and IW hittaker (11), and IAtkinson (12).
Complexity Index
60.0
45.0
30.0
Average N-4 N-3 N-2
15.0
in i IG
he il in so n
IT
IA tk
IB r IW ay hi tta ke r IK ul lb ac IM k ac Ar th ur
z IB er ge r IS ch ut z
h
en
os nt
or IL
IM cI
al
ax
ot
IM
IT
IV ar ia nc e IS im ps on IS ha nn on
0.0
Interestingness Measures
Fig. 1. Relative complexity of summaries from the NSERC discovery tasks
4
Conclusion and Future Research
We described the HMI set of heuristics for ranking the interestingness of summaries generated from databases. Although the heuristics have previously been applied in several areas of the physical, social, ecological, management, information, and computer sciences, their use for ranking summaries generated from databases is a new application area. The preliminary results presented here show that the order in which some of the measures rank summaries is highly correlated, resulting in two distinct groups of measures in which summaries are ranked similarly. Highly ranked, concise summaries provide a reasonable starting point for further analysis of discovered knowledge. That is, other highly ranked summaries that are nearby in the generalization space will probably contain information at useful and appropriate levels of detail. Future research will focus on determining the specific response of each measure to different population structures.
Heuristic Measures of Interestingness
241
References 1. A.B. Atkinson. On the measurement of inequality. Journal of Economic Theory, 2:244–263, 1970. 2. W.H. Berger and F.L. Parker. Diversity of planktonic forminifera in deep-sea sediments. Science, 168:1345–1347, 1970. 3. I. Bournaud and J.-G. Ganascia. Accounting for domain knowledge in the construction of a generalization space. In Proceedings of the Third International Conference on Conceptual Structures, pages 446–459. Springer-Verlag, August 1997. 4. J.R. Bray and J.T. Curtis. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27:325–349, 1957. 5. A.A. Freitas. On objective measures of rule surprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 1–9, Nantes, France, September 1998. 6. R. Godin, R. Missaoui, and H. Alaoui. Incremental concept formation algorithms based on galois (concept) lattices. Computational Intelligence, 11(2):246–267, 1995. 7. H.J. Hamilton, R.J. Hilderman, L. Li, and D.J. Randall. Generalization lattices. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 328–336, Nantes, France, September 1998. 8. R.J. Hilderman and H.J. Hamilton. Heuristics for ranking the interestingness of discovered knowledge. In N. Zhong and L. Zhou, editors, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’99), pages 204–209, Beijing, China, April 1999. 9. R.J. Hilderman, H.J. Hamilton, and B. Barber. Ranking the interestingness of summaries from data mining systems. In Proceedings of the 12th International Florida Artificial Intelligence Research Symposium (FLAIRS’99), pages 100–106, Orlando, Florida, May 1999. 10. R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel knowledge discovery using domain generalization graphs. In J. Komorowski and J. Zytkow, editors, Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’97), pages 25–35, Trondheim, Norway, June 1997. 11. S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. 12. H. Liu, H. Lu, and J. Yao. Identifying relevant databases for multidatabase mining. In X. Wu, R. Kotagiri, and K. Korb, editors, Proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’98), pages 210–221, Melbourne, Australia, April 1998. 13. R.H. MacArthur. Patterns of species diversity. Biological Review, 40:510–533, 1965. 14. R.P. McIntosh. An index of diversity and the relation of certain concepts to diveristy. Ecology, 48(3):392–404, 1967. 15. W.A. Rosenkrantz. Introduction to Probability and Statistics for Scientists and Engineers. McGraw-Hill, 1997. 16. R.R. Schutz. On the measurement of income inequality. American Economic Review, 41:107– 122, March 1951. 17. C.E. Shannon and W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. 18. E.H. Simpson. Measurement of diversity. Nature, 163:688, 1949. 19. G. Stumme, R. Wille, and U. Wille. Conceptual knowledge discovery in databases using formal concept analysis methods. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD’98), pages 450–458, Nantes, France, September 1998. 20. H. Theil. Economics and information theory. Rand McNally, 1970. 21. R.H. Whittaker. Evolution and measurement of species diversity. Taxon, 21 (2/3):213–251, May 1972. 22. Y.Y. Yao, S.K.M. Wong, and C.J. Butz. On information-theoretic measures of attribute importance. In N. Zhong and L. Zhou, editors, Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’99), pages 133–137, Beijing, China, April 1999. 23. J.F. Young. Information theory. John Wiley & Sons, 1971.
Enhancing Rule Interestingness for Neuro-fuzzy Systems Thomas Wittmann, Johannes Ruhland, and Matthias Eichholz Lehrstuhl für Wirtschaftsinformatik, Friedrich Schiller Universität Jena Carl-Zeiß-Str. 3, 07743 Jena, Germany
[email protected],
[email protected],
[email protected] Abstract. Data Mining Algorithms extract patterns from large amounts of data. But these patterns will yield knowledge only if they are interesting, i.e. valid, new, potentially useful, and understandable. Unfortunately, during pattern search most Data Mining Algorithms focus on validity only, which also holds true for Neuro-Fuzzy Systems. In this Paper we introduce a method to enhance the interestingness of a rule base as a whole. In the first step, we aggregate the rule base through amalgamation of adjacent rules and eliminiation of redundant attributes. Supplementing this rather technical approach, we next sort rules with regard to their performance, as measured by their evidence. Finally, we compute reduced evidences, which penalize rules that are very similar to rules with a higher evidence. Rules sorted on reduced evidence are fed into an integrated rulebrowser, to allow for manual rule selection according to personal and situation-dependent preference. This method was applied successfully to two real-life classification problems, the target group selection for a retail bank, and fault diagnosis for a large car manufacturer. Explicit reference is taken to the NEFCLASS algorithm, but the procedure is easily generalized to other systems.
1
Introduction
Data Mining Algorithms extract patterns from large amounts of data. But patterns are only interesting, if they are valid, new, potentially useful, and understandable. Unfortunately, most Data Mining Algorithms only refer to validity in search for patterns. This also holds for Neuro-Fuzzy Systems, a promising new development for classification learning. Empirical studies (e.g. [11]) have proven their ability to combine automatic learning, as attributed to neural networks, with classification quality comparable to other data mining methods. But when it comes to analyzing real-life problems the main advantage over pure neural networks, the ease of understandability and interpretation, remains a much-acclaimed desire rather than a proven fact, though. Hence, pattern post-processing is necessary to identify interesting rules in the rule base output of a Neuro-Fuzzy System. This can also be called „data mining of second order“ [6]. Different methods can be used to enhance interestingness of rules J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 242-250, 1999. © Springer-Verlag Berlin Heidelberg 1999
Enhancing Rule Interestingness for Neuro-fuzzy Systems
243
according to the four criteria mentioned above. The aim is to concentrate on the most valid, new and useful rules and thus aggregate the rule base. Finally, a lean rule base is easier to understand. In this paper we refer especially to the NEFCLASS algorithm, but the procedure we develop can be generalized easily. In the first step, we aggregate the rule base with respect to adjacent rules and redundant attributes, in the second step we order the rules with regard to their performance. These sorted rules are fed into an integrated rulebrowser, which gives the user the opportunity to select rules according to his/her interests. As a first test, we successfully applied this method to two real-life classification problems, the target group selection for a bank, and the fault diagnosis for a big car manufacturer.
2
The Neuro-fuzzy System NEFCLASS
In NEFCLASS (NEuroFuzzyCLASSification) a fuzzy system is mapped on a neural network, a feed-forward three-layered multilayer perceptron. The (crisp) input pattern is presented to the neurons of the first layer. The fuzzification takes place when the input signals are propagated to the hidden layer, because the weights of the connections are modeled as membership functions of linguistic terms. The neurons of the hidden layer represent the rules. They are fully connected to the input layer with connection weights being interpretable as fuzzy sets. A hidden neuron’s response is connected to one output neuron only. With output neurons being associated with an output class on a 1:1 basis, each hidden neuron serves as a classification detector for exactly one output class. Hidden to output layer connections do not carry connection weights reflecting the lack of rule confidences in the approach. The learning phase is divided into two steps. In the first step the system learns the number of rule units and their connections, i.e. the rules and their antecedents, and in the second step it learns the optimal fuzzy sets, that is their membership functions [9]. NEFCLASS is available as freeware from the Institute of Knowledge Processing and Language Engineering, University of Magdeburg, Germany, http://fuzzy.cs.uni-magdeburg.de/welcome.html.
3
Rule Aggregation
Next, we describe a prototype for rule post-processing, called Rulebrowser. In the first step, Rulebrowser aggregates the rule base with respect to redundant rules and attributes. It joins adjacent rules, i.e. those containing the same conclusion while being based on adjacent fuzzy sets, and eliminates attributes becoming irrelevant afterwards (i.e. all fuzzy sets of variable joined). Figure 1 shows the basic idea for a simple case with two attributes (each containing three fuzzy sets) and nine rules classifying the cases into two classes (class1: c1 and class 2: c2). In figure 1 we are able to join the rules r1 to r3, thus making attribute x1 superfluous for this rule ra. Furthermore, e.g. r5 and r8 can be joined, but this will not allow us to eliminate any attribute.
244
T. Wittmann, J. Ruhland, and M. Eichholz
x1
large medium small
r1: c1 r2: c1 r3: c1 small
r4: c1 r5: c2 r6: c2 medium
r7: c2 r8: c2 r9: c1 large x2
x1
large medium small
ra: c1 small
r4: c1 r7: c2 rb: c2 r6: c2 r9: c1 medium large x2
Fig. 1. Rule Aggregation. Example
As seen from figure 1 joining rules in one dimension can inhibit rule aggregation along another attribute. Hence, when confronted with a high-dimensional rule space, rule aggregation is no trivial task, but a complex search problem [7]. Different algorithms can be used to search for suitable candidate rules for aggregation. Rulebrowser resorts on a similarity measure to be defined below. For each rule we determine the most similar rule and check the position of both rules in rule space. If we can join them, the first rule’s premise is extended to the range of the second rule, eliminating the second rule. If premises can not be joined, we examine the next similar rules, until we find two adjacent rules or all rules are examined. An attribute becomes superfluous, if after rule aggregation the antecedent containing this attribute covers all possible terms of this attribute, e.g. ra: ‘x1 = small or medium or large’ in figure 1. In pseudo-code notation this is : program Rule Aggregation determine similarities between rules for all classes do begin for all rules in a class do begin eliminate superfluous attributes in rules repeat for all rules in a class in order of similarity to the rule in focus if both rules can be joined then extend first rule’s premise eliminate second rule until adjacent rule found end end The similarity of two rules simr1,r2 has been defined in [3] as
sim r1,r2 r1, r2 = n= disti =
 =1-
n i =1
dist i
n
.
compared rules, total number of attributes in r1 and r2, distance measure for attribute i: Ïx ,x if attribute i exists in both rules dist i = Ì r1 r 2 , else Ó1
(1)
Enhancing Rule Interestingness for Neuro-fuzzy Systems
245
|xr ,xr | = distance between the fuzzy sets in r1 and r2, measured by their rank, i.e. for an attribute with 5 fuzzy sets {very small, small, medium, large, very large} |large, very small| = 4-1 = 3. For compound fuzzy sets like ‘low or medium’ the mean of the ranks is used for computation. 1
4
2
Rule Sorting and Browsing
Rule aggregation is a more or less technical approach, aiming only at understandability (lean rule base) and utility of rules (relevant attributes). In the second step, Rulebrowser sorts the rules with regard to a user-defined performance criterion and gives the user the opportunity to select the rules that comply with his/her interests. According to the user’s data mining target different foci on the same rule base may be realized by such a system. Possible aims of analysis are, for example: 1. In database marketing the user looks for a specified number of addresses, which have a high potential for becoming customers. The goal is a number of rules with a high coverage, ordered by decreasing validity. Not a single justification of a decision is important but reliable selection of a high number of potential customers. 2. In checking creditworthiness the user searches for rules with a high validity, but not necessarily a high coverage, that tells him whether the applicant is a high or a low risk customer. This is to illustrate our belief that there is no single ‘most interesting’ rule base, but that the user’s interests are the prime criteria for interestingness [1][2]. In detail, this approach takes into consideration the single rule confidence, as well as extended utility and novelty aspects. 4.1
Selecting High-Performance Rules
First, we determine the high-performance rules and sort the rules according to their power. This will be operationalized in various ways. Besides the well-known criteria, rule confidence and rule support, rule evidence is a relevant feature. A rule’s support is measured by the number of patterns that accomplish the rule’s premise. In other words, it is the number of times the rule ‘fires’. For this we emulate the signal propagation within NEFCLASS. Rules with compound antecedents have to be split up into ‘pure’ rules combined by a disjunction. The proportion of patterns, in which classification by a rule is done correctly, determines its confidence. Evidence is a composite measurement, depending on rule confidence and rule support. The following equation is based on suggestions of Gebhardt [3] and the ‘average error’, proposed by Jim/ Wuthrich [5]:
246
T. Wittmann, J. Ruhland, and M. Eichholz
Ï Ê a + bˆ Ô1 evidence i = Ì ÁË n ˜¯ ÔÓ0
if support > 0 . else
(2)
a = recognized patterns of the wrong class, b = not recognized patterns of the right class and n = number of patterns.
class 1 class 2
rule fires
rule does not fire
c a
b d
confidence = c/(c+a)
n= number of patterns support = c+a
evidence = 1- (a+b)/n or (c+d)/n
Fig. 2. Interrelation between the concepts of confidence, support and evidence.
Figure 2 visualizes the interrelation between confidence, support and evidence for a simple example with two classes. The rule’s conclusion is class 1. While confidence measures how valid, and support measures how often a rule ‘fires’, evidence makes a more complex proposition. It penalizes rules that misfire (rule fires, although it should not) as well as rules that fire too infrequently (rule does not fire, although it should). Rulebrowser sorts the rules based on the chosen performance measure (evidence is the default option, but confidence and support can be chosen). In each iteration these measures are calculated only for the patterns not already covered by rules that have entered the rulebase in previous steps. Hence, just the incremental value of the rule to the rulebase is taken into account. This is based on the idea of the ‘rule cover’-algorithm by Toivonen [13]. The user defines a mimimum support of the rules as a stop-criterion (if this support is not reached, the algorithm stops after 100 iteration): This leads to a selection of the most relevant rules. Defining a high support threshold will often shrink a rulebase drastically, but must be weighted against ensuing information loss. program Rule Sort repeat determine performance for each rule in search space for all classes do begin select rule with maximum evidence and support > 0 delete all cases covered by this rule delete rule from search space end until stop-criterion fulfilled
4.2
Devaluation of Similar Rules
To this point, we have only judged rules based on their performance considered in isolation. We may now manipulate the evidence formula to account for novelty of
Enhancing Rule Interestingness for Neuro-fuzzy Systems
247
entering candidate to the rulebase. We compute reduced evidences, which penalize rules that are very similar to rules with a higher evidence [3]. program Devaluate Similar Rules for all rules do begin reduced evidence := evidence rule := not marked end repeat for all classes do begin determine not-marked rule with highest reduced evidence and mark it for all not-marked rules do begin compute new reduced evidence according to (3) end end until all rules are marked
V
new red
new red
k d *sim Ï R1, R2 È V(R 1 ) ˘ Ô red (R 1 ) = min ÌV (R 1 ) , V(R 1 ) * Í red ˙ ÍÎ V (R 2 ) ˙˚ Ô Ó
(Ri) = V red V (Ri) = V (Ri) = d= simR1, R2 = K=
¸ Ô ˝ . Ô ˛
(3)
new reduced evidence of rule i, reduced evidence of rule i, evidence of rule i, strength of devaluation, similarity of rule 1 and 2 according to formula (1), relevance of similarity.
This reduced evidence is again used to sort the rules [3]. In the resulting, rather userfriendly overview of the rules, the user can determine the parameters of devaluation by a scroll-bar, re- and devaluate rules manually and cut off rules with low reduced evidence. In addition he/she can change the sort criterion. In doing so the user can find his/her optimal position in the trade-off between the different facets of interestingness, especially validity and simplicity of the rule base.
5
Empirical Evaluation of Rulebrowser
Rulebrowser has been successfully applied to two real-life classification problems: fault diagnosis for a large car manufacturer and the target group selection for a bank. The first database (car data) contains 18.000 records on cars recently sold, their respective characteristics (21 standard equipment characteristics such as number of cylinders plus 221 on optional equipment as ABS) and data on faults detected (e.g. engine breakdown). The analysis aim was, How do car characteristics influence certain fault frequences? The second database (bank data) is based on a mailing campaign to
248
T. Wittmann, J. Ruhland, and M. Eichholz
convince customers of a bank to buy a credit card. It consists of about 180.000 cases, 21 attributes and 2 classes (respondents/ non-respondents). For the car data, figure 3 shows the considerable reduction in the number of rules and the average number of attributes in one rule after rule aggregation. Figure 4 shows the further reduction in the number of rules for a minimum support of 95%, 90% and 80%. number of rules 25 22 20
16
average number of attributes per rule 22 20
8
15
15
12
6
13
12
10
10 6
5
6 43
3
5
2,33
6,25 5
3
5
4,4
4
8
10
7
3,83 3
3 2,67
2,63
5
4
1,67 2,67 2,58
2
2
3
0
0 w01w07w30w42w54w68w72w80w82
w01 w07 w30 w42 w54w68 w72 w80 w82 different car faults
different car faults after Rule Aggregation
before
Fig. 3. Reduction of the number of rules and average number of attributes in one rule.
num- 15 ber of 10 rules 5
12 10 8
1212 9
8 66
66
666
8
4
3333
8
6666 3333
333
7
66
2
0 w01
w07
after Rule Aggregation Min. Support: 90%
w30
w42
w54
w68
w72
Min. Support: 95% Min. Support: 80%
w80 w82 9 types of car fault
Fig. 4. Further reduction in the number of rules. Looking at the bank data, which, in contrast to the car data, were only moderately preprocessed with respect to the number of attributes, we can see even better results. Rule aggregation reduced the average number of attributes per rule from 21 to 17.5, with the smallest rule containing only 11 attributes. The number of rules sank from 242 to 48 after rule aggregation. With a minimum support of 90% we even managed to decrease the number of rules to 16. In both cases, rule aggregation leads to no loss of validity, as no information is deleted. Looking at the minimum support rule, we are confronted with a validity versus simplicity trade-off. But the loss of validity is small due to the elimination of the least powerfull rules, that cover only few cases and describe no new structures.
Enhancing Rule Interestingness for Neuro-fuzzy Systems
6
249
Related work
Pattern postprocessing, in contrast to the various preprocessing tasks, like attribute selection or treating missing values, is a stepchild of research. Only few approaches have been developed to solve the problem of enhancing rule interestingness. Most of them are found in the field of association analysis, where the phenomenon of ‘exploding’ rule bases is fundamental. Due to space restrictions we can only mention a selection of the approaches Most methods rely on objective measures of rule interestingness, like the ‘Neighborhood-Based Unexpectedness’ of Dong/ Li [2], the use of rule-covers by Toivonen [13] or the rule aggregation approach of Major/ Mangano [8]. Only few approaches integrate the user, usually by requiring predefined rule templates (Klemettinen et al. [6]) or belief systems (Silberschatz/ Tuzhilin [12]). But this is often unfeasible in practice, due to the strong involvement of the user. Few authors propose combined approaches for the evaluation of interestingness. Gebhardt suggests evidence and affinity of rules for a composed measure of interestingness. He proposes several ways to quantify them and discusses pros and cons [3]. Hausdorf/ Müller have developed a complex system for evaluating interestingness based on different facets [4]. The developers of NEFCLASS themselves have proposed a rule aggregation method, based on four different steps of attribute, rule and fuzzy set elimination [10]. But these steps mainly aim at the validity of the rules and contain serious problems. Another algorithm by Klose et al. aggregates the rule base in three steps, ‘input pruning on data set level’ (attribute selection), ‘input pruning on rule level’ and ‘simple rule merging’ [7]. The last two steps correspond to the rule aggregation proposed in this paper.
7
Conclusions
Rule post-processing is an essential step in the Knowledge Discovery in Databases process. This holds true for Neuro-Fuzzy Systems in particular, too. In this paper we have proposed a prototype for a tool that aggregates and sorts rules according to different facets of interestingness. The results are promising, but further research is needed. For example, more sophisticated search strategies to identify candidate rules for aggregation might improve the resulting rule base. Visualization techniques, which have not been mentioned in this paper, could enhance the understandability of the rules. However, the quality of a post-processing method is very hard to quantify, as it strongly depends on the system’s user. Hence, only further practical applications of our methods can evaluate their effectiveness. Research in this subject was funded in part by the Thuringian Ministry for Science, Research and Culture. The authors are responsible for the content of this publication.
250
T. Wittmann, J. Ruhland, and M. Eichholz
References 1. Brachman, R. J., Anand, T., The Process of Knowledge Discovery in Databases, In: Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. (Eds.), Advances in Knowledge Discovery and Data Mining, Menlo Park, CA 1996, pp. 37-58 2. Dong, G., Li, J., Interestingness of Discovered Association Rules in terms of NeighborhoodBased Unexpectedness, In: Wu, X., Kotagiri, R., Korb, E. (Eds.), Research and Development in Knowledge Discovery and Data Mining (Proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining), Heidelberg 1998, pp. 72-86 3. Gebhardt, F., Discovering interesting statements from a database, In: Applied Stochastic Models And Data Analysis 1/1994, pp. 1-14 4. Hausdorf, C. Müller, M., A Theory of Interestingness for Knowledge Discovery in Databases Exemplified in Medicine, In: Lavrac, C. Keravnou, E., Zupan, B. (Eds.), First International Workshop on Intelligent Data Analysis in Medicine and Pharmacology 5. Jim, K., Wuthrich, B., Rule Discovery: Error measures and Conditional Rule Probabilities, In: KDD: Techniques and Applications. Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapur u.a. 1997, pp. 82-89 6. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H.; Verkamo, A. I., Finding Interesting Rules from Large Sets of Discovered Association Rules, In: Adam, N. R., Bhargava, B. K., Yesha, Y. (Eds.), Proceedings of the Third International Conference on Information and Knowledge Management (CIKM'94), Gaithersburg, Maryland, November 29 - December 2, 1994, (ACM Press) 1994, pp. 401-407 7. Klose, A., Nürnberger, A., Nauck, D., Some approaches to improve the interpretability of Neuro-Fuzzy Classifiers, In: Zimmermann, H.-J. (Eds.), 6th European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, September 7-10, 1998, Aachen 1998, pp. 629-633 8. Major, J. A., Mangano, J. J., Selecting among rules induced from a hurricane Database, In: Piatetsky-Shapiro, G. (Eds.), Knowledge Discovery in Databases, Papers from the 1993 AAAI Workshop, Menlo Park, CA 1993, pp. 28-44 9. Nauck, D., Kruse, R., NEFCLASS - A Neuro-Fuzzy Approach for the Classification of Data. Paper of Symposium on Applied Computing 1995 (SAC'95) in Nashville 10. Nauck, D., Kruse. R., New Learning Strategies for NEFCLASS, In: Proc. Seventh International Fuzzy Systems Association World Congress IFSA'97, Vol. IV, Academia Prague, 1997, pp. 50-55 11. Ruhland, J., Wittmann, T., Neurofuzzy Systems In Large Databases - A comparison of Alternative Algorithms for a real-life Classification Problem, in: Proceedings 5th European Congress on Intelligent Techniques and Soft Computing, Aachen, Germany, September 8-11 1997 (EUFIT' 97), pp. 1517-1521 12. Siberschatz, A., Tuzhilin, A., On subjective measures of interestingness in knowledge discovery, In: Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, Montreal, August 1995, pp. 275-281 13. Toivonen, H., Klemettinen, M., Ronkainen, P., Hätönen, K., Mannila, H., Pruning and Grouping Discovered Associations Rules, In: Kodratoff, Y., Nakhaeizadeh, G., Taylor, C. (Eds.), Workshop Notes Statistics, Machine Learning, and Knowledge Discovery in Databases. MLNet Familiarization Workshop, Heraklion, Crete, April 1995, 1995, pp. 47-52
Unsupervised Profiling for Identifying Superimposed Fraud Uzi Murad and Gadi Pinkas Tel Aviv University, Ramat-Aviv 69978, Israel Amdocs (Israel) Ltd., 8 Hapnina St., Ra’anana 43000, Israel {uzimu, gadip}@amdocs.com
Abstract. Many fraud analysis applications try to detect “probably fraudulent” usage patterns, and to discover these patterns in historical data. This paper builds on a different detection concept; there are no fixed “probably fraudulent” patterns, but any significant deviation from the normal behavior indicates a potential fraud. In order to detect such deviations, a comprehensive representation of “customer behavior” must be used. This paper presents such representation, and discusses issues derived from it: a distance function and a clustering algorithm for probability distributions.
1 Introduction The telecommunications industry regularly suffers major losses due to fraud. The various types of fraud may be classified into two categories: Subscription fraud. Fraudsters obtain an account without intention to pay the bill. In such cases, abnormal usage occurs throughout the active period of the account. Superimposed fraud [3]. Fraudsters „take over“ a legitimate account. In such cases, the abnormal usage is „superimposed“ upon the normal usage of the legitimate customers. Examples of such cases include cellular cloning and calling card theft. Data mining can be used to learn what situations constitute fraud. However, the nature of the problem makes standard pattern recognition algorithms impractical. Following are the unique characteristics of the superimposed fraud detection problem. Context. Customers differ in calling behaviors (usage patterns and volume). A usage pattern may be normal for one customer and abnormal for another. For example, calls from New York are suspicious if the customer lives and works in Boston, but perfectly normal for New York residents. Sometimes a single customer demonstrates different types of behavior on different occasions. For example, business customers make many calls on business days, but no calls on weekends. Finally, normal behavior may change over time. It is impossible to define global fraud criteria that would be valid for all customers, all the time. Changing fraud patterns. Following the progress of technology, fraudsters adopt new fraud techniques, which may result in new usage patterns. A set of previously observed fraud-related patterns might not suffice to detect new instances of fraud. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 251-261, 1999. © Springer-Verlag Berlin Heidelberg 1999
252
U. Murad and G. Pinkas
Following this discussion, a fraud detection system should be (1) sensitive to customer-unique behavior, (2) adaptable to changes in customer behavior, (3) sensitive to rare yet normal customer behavior and (4) adaptable to new fraud patterns.
2 Related Work Several techniques concerning telecommunication and credit card fraud detection are described in literature. The majority of techniques monitor customer behavior regarding specific usage patterns. [10] describes a rule based system, which accumulates number or duration of calls that match specific patterns (e.g., international calls) in one day and calculates the average and standard deviation of the daily values. Then it compares new values against a user-defined threshold in terms of standard deviations from the average. [9] describes a neural network based system, which uses similar parameters, but learns from known cases what situations (combinations of current value, average value and standard deviation) are fraudulent. In this approach, the threshold is determined from known cases, rather than being user-defined. [3] uses supervised learning to discover from historical data what usage patterns are „probably fraudulent“, but still deals with specific patterns. In general, any supervised learning algorithm has to receive a significant number of cases of a pattern in order to learn it. It might be difficult to obtain enough known cases of each pattern. In addition, new patterns of fraud must be discovered by other means (such as customer complaints), before being presented to the learning algorithm. A long time may pass from the first occurrence of the pattern until it is used for detection. Finally, accurately classified data may be hard to get. A different approach, the one this paper follows, is to create a general model of behavior, without using predefined patterns. [7] deals with credit card fraud, and profiles customers by their typical transactions. A new transaction is examined against the profile, to see whether similar transactions have appeared in the past. There is no element of usage volume, or frequency of transactions. [1] represents short-term behavior by the probability distribution of calls made by a customer in one day. This „Current User Profile“ (CUP) is examined against the „User Profile History“ (UPH), which is the average of the CUPs generated by that user. Call volume is not taken into consideration, and, since the UPH is an average, it does not reflect rare yet normal patterns.
3 Three Level Profiling The technique for fraud detection suggested in this paper does not search for specific fraud patterns; instead, it points out any significant deviation from the customer’s normal behavior as a potential fraud. In order to detect such deviations, a comprehensive representation of behavior, which can capture a variety of behavior patterns, must be used, and the definition of „deviation“ should reflect the actual degree of dissimilarity between „behavior instances“. To meet these goals, three profile levels are used:
Unsupervised Profiling for Identifying Superimposed Fraud Call Profile duration
start time
dest. type
call type
….
Call Duration
Call Duration
First level prototyping
First level profile
Toll Free
Toll Free
PRS
PRS
International
International
Local
Local Data
Voice
Second level profile
253
Time of Day
Quantitative Profile
Time of Day
10%
70%
20%
Qualitative Profile
Second level prototyping
Daily Profile
Third level profile
Daily Profiles Space Week-days Weekend DP DP DP type type type 1 2 3
Overall Profile
Fig. 1. Three Level Profiling
Call Profile – represents a single call and includes all fields of the Call Detail Record (CDR) that are relevant to behavior. Daily profile – represents the short-term behavior of a customer. The daily profile consists of two parts: The qualitative profile describes the kind of calls (when, from where, to where) the customer made during the day, and the quantitative profile describes the usage volume. Each attribute in the call profile is seen as a random variable, and the qualitative profile is the empirical multi-dimensional probability distribution of calls on a given day. Since the space of call profiles is huge, it is partitioned into a finite number of subspaces, each of which is represented by a „prototypical call“. The qualitative profile is a vector, which contains an entry for each „call prototype“ with the percentage of calls (on a given day) of that prototype. Overall Profile – represents the long-term „normal“ behavior of a customer, and reflects the daily behaviors normally exhibited by the customer. Since the space of Daily Profiles is infinite, a clustering algorithm is applied to the daily profiles, which extracts „prototypical days“. The overall profile is a vector with entries for all „daily prototypes“, containing information about the „normal“ usage level for that customer in days of that type. Since customers may behave differently on different „types of
254
U. Murad and G. Pinkas
days“, a separate vector is kept for each type. In the current framework there are two types: weekdays and weekends. Fig. 1 illustrates the concept of Three Level Profiling.
4
The Call Profile
The call profile c=(c1,c2,…,cd) includes all those features of the Call Detail Record (CDR) that are relevant to behavior. For the data set used for the first implementation (wireline data), the following attributes were used: call start time, call duration, destination type (local, international, premium rate service or toll-free) and call type (voice or data). Other attributes should be considered for different data sets. In a cellular context, for example, the originating location would probably be included. Each attribute ci corresponds to a random variable Xi, which gets values from domain Di. The domain of call profiles D = D1 x D2 x …x Dd is huge; call duration may be any number of seconds, and start time may be any time (in seconds) in the day. The huge domain of CDR profiles should be represented by a relatively small number of call prototypes. For our purpose, a good set of prototypes will be such that any new call has a prototype similar enough to it, and different prototypes are dissimilar enough. We extract call prototypes as follows. For continuous or ordinal attributes (call duration and start time) the domain Di of attribute i is split into ni ranges, resulting * * in a discrete domain Di ={ x , x ,..., x }. For discrete non-ordinal attributes Di = Di. In our framework, we partitioned the start time into 12 two-hour time windows. Call duration was partitioned to 12 five-minute ranges, which cover calls up to an hour long. There is, of course, inaccuracy in calls longer than one hour. However, since 99.9% of the calls are shorter than one hour, and long duration calls can be monitored easily by other means, this compromise is reasonable. The resulting discrete domain contains l = n1 ⋅ n2 ⋅...⋅nd values. i 1
i 2
i ni
5 The Daily Profile and the Curse of Dimensionality The daily profile DP consists of a quantitative profile DP.n (the number of calls made in that day) and a qualitative profiles DP. q = (q1 , q2 ,..., ql ) where 0 ≤ qi ≤ 1 , Σqi = 1 and l is the number of call prototypes. Each entry qi corresponds to a call prototype, and contains the percentage of calls of that type in the given day. , in our case) seems to be The huge dimension of the vector ( 12 ⋅ 12 ⋅ 4 ⋅ 2 = 1152 problematic in terms of both storage space and computation time. However, only few prototypes actually appear in a single daily profile. The average number of non-zero call prototypes in a daily profile (calculated using over 400,000 Daily Profiles) is 2.8, and 99% of the daily profiles have 13 or less non-zero call prototypes. If time and space complexity depend on the number of non-zero prototypes only, then the dimensionality problem ceases to exist. To achieve this, daily profiles are saved in variablelength arrays. Computations with daily profiles are discussed next.
Unsupervised Profiling for Identifying Superimposed Fraud
255
6 Distances Between Qualitative Profiles A distance function between daily profiles is required for clustering daily profiles and for detection of deviations. It is extremely important that the distance function reflect the actual level of similarity between two daily profiles. We examined several known distance functions. The Euclidean distance is not adequate for the problem. Consider, for example, the three following call prototypes: Call prototype 1: 5-minute local voice call at 2:00PM Call prototype 2: 10-minute local voice call at 2:00PM Call prototype 3: 60-minute international data call at 2:00AM and three daily profiles, each of which contains only calls of prototype 1, 2 and 3, respectively (i.e., each daily profile has value 1 in one entry and 0 in all others). While it is obvious that the first daily profile is similar to the second, and absolutely different from the third, the Euclidean distances are identical and equal to 2 , which is the maximal distance. The reason for this distortion is that Euclidean distance does not take into account the similarity between attributes. Each attribute is treated as totally different in meaning from all other attributes. In our case, however, the attributes represent points in the d-dimensional space of call profiles and, as such, some attributes are „closer“ (in meaning) to each other than others. The Helinger distance (suggested by [1]) and the Mahalonobis distance [5] suffer from the same distortion. Since the qualitative profiles represent multidimensional probability distributions, distance functions between probability distributions should be considered. Distance functions discussed in [6] (for example, Patrick-Fisher, Matusita and Divergence) share the same problem. An obvious solution is to compare the probability for neighborhoods of values. However, the performance depends on the neighborhood size. Small neighborhoods will not be sensitive enough in cases where the values are distant, and large neighborhoods will loose information. Moreover, large neighborhoods may require many more prototypes to be handled than the non-zero ones. Cumulative Distribution Based Distance. We propose the CD-distance, which is based on cumulative distribution. The cumulative distribution enables to capture the „closeness“ of values, in addition to their probabilities. The distance function sums the squared differences of the cumulative distribution functions (instead of the density functions). Formally, in the one-dimensional case we define the distance as follows. Let f1(x) and f2(x) be two continuous probability distributions of a random variable X. The distance between f1(x) and f2(x), denoted by d(f1, f2) is
d ( f1 , f 2 ) =
xmax
xmax 1 2 ( F1 ( x) − F2 ( x )) dx . ∫ − xmin xmin
(1)
where F(x) is the cumulative distribution ( F ( x ) = P( X ≤ x ) ), and xmax and xmin are the maximum and minimum values, respectively, of the random variable X. We define xmax
256
U. Murad and G. Pinkas
and xmin such that the probability of values outside (xmin, xmax) is redundant. xmax and xmin do not depend on the two density functions being compared, but solely on the random variable X. In order to obtain distance values on a normalized scale, the sum is divided by (xmax - xmin), so we get 0 ≤ d ( f 1 , f 2 ) ≤ 1 . In the discrete, ordinal case, the domain of the random variable X is an ascending list of values ( D = {x1 , x2 ,..., xn } , xi > x j ⇔ i > j ), and the distance between two probability distributions is
d ( f1 , f2 ) =
2 1 n −1 ∑ ( F1 ( x j ) − F2 ( x j )) δ j xn − x1 j =1
(2)
where δj = x j +1 − x j . In this case, the cumulative distribution is a step-function, and (2) is a straightforward simplification of (1) for the subset of discrete probability distributions. x1 and xn, the smallest and largest discrete values, respectively, serve as xmin and xmax. δj are the differences between consecutive discrete values. Note that δj are not necessarily equal, therefore, to improve computation efficiency, we can consider only the non-zero entries (entries k such that Pi ( X = xk ) ≠ 0, i=1,2). Binary random variables are treated as discrete ordinal random variables, with the domain {0,1}. In the discrete, non-ordinal case (e.g., call type), the domain consists of non-numeric values. We assume that the similarity between each pair of values is equal. In this case, we replace the call profile attribute i with ni binary attributes, where ni is the cardinality of domain Di. The generalized distance function in the multidimensional case is:
d ( f1, f 2 ) =
d
1 i i xni − x1
∑ wi i =1
∑ ( F1 ( x ij ) − F2 ( x ij )) δ ij
ni −1 j =1
2
(3)
where x ij is the jth prototype of call profile attribute i, F ( x ij ) = P( X i ≤ x ij ) , and wi is a weight for call profile attribute i. Σwi = 1. It can be shown [11] that the cd-distance function maintains, for any probability distribution functions f1, f 2 and f3: (1) 0 ≤ d ( f1 , f 2 ) ≤ 1 d ( f1 , f1 ) = 0 (2) d ( f1, f 2 ) = d ( f 2 , f1 ) (3) (symmetry) d ( f1 , f 2 ) + d ( f 2 , f 3 ) ≥ d ( f1 , f 3 ) (4) (triangle inequality)
7 Extracting Daily Prototypes The space of daily qualitative profiles is infinite, and we need to represent it by K „prototypical“ qualitative profiles, which represent prototypical behaviors. This dis-
Unsupervised Profiling for Identifying Superimposed Fraud
257
cretization is important not only for performance, but also to facilitate the investigation of alerts by the human analyst. We use a clustering algorithm to extract such prototypes. Since the distance function is not Euclidean, a proximity-matrix-based algorithm seems to be essential. However, such algorithms are restricted to small sample sets, due to their space and time complexity. In our case, however, extracting a sufficiently small sample may result in losing unique prototypes. The partitional algorithm K-means is based on Euclidean distance and does not require a proximity matrix. The algorithm seeks to minimize the sum of squared distances between samples and their associated cluster: N
∑ d 2 ( pi , c p ) . i =1
(4)
i
Recalculating the new cluster center as the Euclidean centroid of samples assigned to it locally minimizes this criterion. It can be shown [11] that this method of recalculating cluster centers also minimizes the criterion function with the CD-distance, therefore the K-means algorithm can be used, with the CD-distance replacing the Euclidean distance. An „adaptive“ version K-means [8] is used in order to determine dynamically the number of clusters.
8 The Overall Profile The overall profile OP is an array containing an entry for each daily prototype. Each entry OPi contains the number of days of prototype i observed for the account (OPi.n), the sum (OPi .sn) and sum of squares (OPi.ssn) of number of calls in these days. These components are later used to calculate the average Mi and standard deviation σi of the number of calls per day of a prototype, as described in [10]. In order to adapt to changes in customer behavior, the components OPi.x with x = n, sn, ssn may be updated using a decay function, like the one used in [10]. In our current prototype, two such arrays are used; one for business days and another for weekends.
9 Deviation Detection Matching a daily profile to the overall profile includes qualitative and quantitative checks. The qualitative profile matches the overall profile if it is closer than threshold Tqualitative to the nearest non-zero daily prototype of that customer. The quantitative profile matches the overall profile if ( DP. n − M i ) σ i ≤ Tquantitative, where i is the prototype closest to DP) and Tquantitative is a threshold in terms of standard deviations. In order to reduce false alerts on daily profiles in which the deviating activity is low, we wish to ignore such profiles, which are of no interest to investigate. We assign a value to each daily profile based on call durations and destinations, which represents the „interestingness level“ of that day (not necessarily the cost of calls). Daily profiles
258
U. Murad and G. Pinkas
with a value smaller than Tvalue are not checked, but still update the overall profiles. Finally, quantitative deviations on days with less than Tncalls calls do not issue alerts. A detection process constantly reads CDRs and updates the daily profiles. Once a day it performs the deviation detection and updates the overall profiles.
10 Evaluation To test the technique, we used the data set of wireline CDRs. The data set covered three months’ usage of about 7,000 accounts. We used the first two months’ calls only to learn overall profiles. The third month’s data was checked for behavioral changes and used to update the profiles. We considered only customers with more than 20 active days in the first two months and without fraudulent usage during this period. A total of 6,334 accounts maintained these conditions, out of which 82 accounts (1.3%) included superimposed fraudulent usage in the third month. We ran the system with 240 combinations of the four thresholds. Alerts on one of the first two fraudulent days are considered as „hits“. For comparison we took the widely used rule-based method (for example, [10]). This method is also based on unsupervised learning, and attempts to detect significant changes (increase) in usage. This method accumulates usage (number or duration of calls of certain types) and calculates the average and standard deviation of the daily values of each accumulator. The average and standard deviation are updated with each newly introduced accumulator value. An alert is issued whenever the value of a certain accumulator exceeds a threshold Tstdevs, defined in terms of a standard deviation from the average. We used accumulators of voice, data, international, PRS, toll-free and nightly calls. We accumulated both the number and duration of calls of each type. Here also, only the first two months’ calls were used for learning. In order to prevent alerts on days with low activity, we also used Tvalue (where the daily value was calculated as in our method), Tncalls and Tduration thresholds. We ran the algorithm with 240 different combinations of values for these four thresholds. Evaluation Technique. For our purpose, we cannot compare classifiers using accuracy alone, since the class distribution is not constant and the costs of false negative errors and false positive errors are not equal. In addition, the class distribution is very skewed [4]; in our case, a „do nothing“ strategy will give 98.7% accuracy. In both methods, the thresholds control the number of alerts. The lower the thresholds, the more true positives and more false positives are produced. Therefore, to evaluate the system’s performance, we use two measures: • True positive (hit) rate. the ratio of detected fraud cases of all fraud cases • False positive (false alarm) rate. the ratio of nofraud cases classified as fraud of all nofraud cases. Then we compare the hit rates achieved by each method, given a fixed false alarm rate.
Unsupervised Profiling for Identifying Superimposed Fraud
100%
Allowed False Positive Rate
% True Positive
80% 60% 40% 3L
20%
RB
0% 0%
10%
20%
30%
% False Positive
40%
50%
1% 2% 3% 4% 5% 10% 15%
259
Possible True Positive Rate 3L 65.85% 93.90% 95.12% 95.12% 96.34% 97.56% 100.00%
RB 25.61% 50.00% 59.76% 63.41% 68.29% 81.71% 86.59%
Fig. 2. Performance comparison on real data. Hit rate vs. false alarm rate of all 240 runs of each method, with the non-decreasing hulls (graph), and a summary of best results (table).
The graph in Fig. 2 depicts the results of the 240 runs of each algorithm, with the corresponding non-decreasing hulls. The non-decreasing hulls reflect the best results of each method. The best results are also summarized in the table in Fig. 2. It can be seen that the 3L method outperforms the naive RB approach, as even the worst results of 3L dominate the best results of RB. 3L reaches high detection rate given low false alarm rates, where RB performs poorly. Semi-Synthetic Data. In order to evaluate the 3L method on a larger set of fraud cases, and to analyze its sensitivity to various fraud patterns and to different volumes of fraud, we use semi-synthetic data. For this purpose, we added a synthetically generated fraudulent usage to the original usage of 10% (625) of the fraud-free accounts. To each of the selected accounts we added „fraudulent“ usage covering three consecutive days and matching one of six fraud patterns. The patterns differ in quantity of calls and in structure and are distributed equally over the 625 cases. Table 1 shows the total hit rate, as well as the rates of the detected cases of 4 selected patterns, given various false alarm rates. The hit rates correspond to the configuration which generated the best total hit rate (rather than best hit rate for each pattern separately), therefore the hit rates of the patterns are not always increasing. Table 1. Performance comparison on semi- synthetic data. The average daily number of added cdrs is shown in parentheses in the titles, following by the pattern’s description Allowed False Positive Rate 1% 1% 2% 3% 4% 5% 10%
Pattern 1 (7) Long PRS 3L RB 90.7% 91.6% 23.4% 91.6% 100.0% 91.6% 100.0% 91.6% 100.0% 91.6% 100.0% 92.5% 100.0%
Possible True Positive Rate Pattern 2 (100) Pattern 3 (4) Pattern 4 (25) Call reselling Intl. In evenings No pattern 3L RB 3L RB 3L RB 100.0% 68.7% 70.5% 100.0% 100.0% 71.7% 1.0% 72.4% 56.2% 100.0% 100.0% 71.7% 1.0% 81.0% 99.0% 100.0% 100.0% 72.7% 3.0% 89.5% 99.0% 100.0% 100.0% 72.7% 3.0% 89.5% 100.0% 100.0% 100.0% 73.7% 25.3% 90.5% 99.0% 100.0% 100.0% 73.7% 41.4% 95.2% 100.0%
Total 3L 78.6% 81.9% 85.1% 88.0% 88.0% 90.9% 93.4%
RB 50.1% 77.6% 82.2% 83.5% 86.4% 89.8%
260
U. Murad and G. Pinkas 100%
Best Hit Rate
Best Hit Rate
100% 80% 60% 40%
3L
20%
RB
0%
60% 40%
3L
20%
RB
0% 0
10
20
30
40
50
60
0
Average number of added CDRs
a. pattern 2, 1% false alarm rate
10
20
30
40
50
60
Average number of added CDRs
b. pattern 2, 3% false alarm rate
100%
100%
Best Hit Rate
Best Hit Rate
80%
80% 60% 40%
3L RB
20%
80% 60% 40%
3L
20%
RB
0%
0% 0
10
20
30
40
50
60
Average number of added CDRs
c. pattern 3, 1% false alarm rate
0
10
20
30
40
50
60
Average number of added CDRs
d. pattern 3, 3% false alarm rate
Fig. 3. Sensitivity to pattern size. The points on each line represent the highest hit rate achieved under the given false alarm rate
On the „intensive“ patterns (especially pattern 2) both systems perform well, with the rule-based method performing slightly better. On the other hand, the 3L method is more sensitive to „less obvious“ patterns (such as pattern 4). The 3L method performs better under a lower false positive rate, but gradually loses its superiority when given higher false alarm rates. We have also examined the sensitivity of the technique to the „size“ of fraud (i.e., the number of fraudulent calls). To do this, we used fraud patterns 2 and 3. Pattern 2 („call reselling“) is „general“: all types of voice calls, throughout the day, with various durations. Pattern 3 is „specific“: international data calls in the evenings. For each pattern, we added fraudulent calls to the selected accounts, increasing gradually the number of added calls per day from 3 to 60. For each number of added calls, we ran both methods, each with 240 combinations of thresholds. We then checked how the hit rate increases with the number of added calls, given a fixed rate of false alarm rate. Fig. 3 shows the results. On the general pattern (Fig. 3(a) and 3(b)), the 3L method is more sensitive to small cases of fraud, and gradually looses its superiority when the number of added calls increases. The difference is more significant under lower false alarm rates. On the specific pattern (Fig. 3(c) and 3(d)) 3L reaches relatively high hit rate for small numbers of added calls, significantly better than that of RB.
11 Conclusion and Future Work Three Level Profiling provides a comprehensive representation of customer behavior. Using this profiling method, rather than profiles based on predefined usage patterns, the system can cope with the dynamic nature of telecommunication fraud.
Unsupervised Profiling for Identifying Superimposed Fraud
261
Initial experiments show a superiority of this method over the rule-based method. In the future, we intend to apply the profiling technique to identify changes in calling behavior for other purposes, such as identifying marketing opportunities and offering new incentives or price plans to customers following a behavioral change.
Acknowledgements This paper is based on an M.Sc. thesis written by Uzi Murad under the supervision of Professor Victor Brailovsky at Tel Aviv University. We thank Victor Brailovsky for insightful discussions and Saharon Rosset for helpful comments and suggestions.
References 1.
Burge, P., Shawe-Taylor, J.: Detecting Cellular Fraud Using Adaptive Prototypes. In: Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management. Providence RI (1997) 9-13 2. Duda, R. O., Hart, P.E.: Pattern Classification and Scene Analysis. John Wiley & Sons, Inc., New York NY (1973) 3. Fawcett, T., Provost, F.: Adaptive Fraud Detection. In: Fayyad, U., Mannila, H., PiatetskyShapiro, G. (Eds.): Data Mining and Knowledge Discovery, vol 1. Kulwer Academic Publishers, Boston CA (1997) 291-316 4. Fawcett, T., Provost, F.: Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In: Agrawal, R., Stolorz, P., PiatetskyShapiro, G. (Eds.): Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park CA (1997) 43-48 5. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs NJ (1988) 6. Kittler, J.: Feature Selection and Extraction. In: Young, T.Y., Fu, K. (Eds.): Handbook of Pattern Recognition and Image Processing. Academic Press Inc., Orlando FA (1986) 59-83 7. Kokkinaki, A.I.: On Atypical Database Transactions: Identification of Probable Fraud using Machine Learning for User Profiling. In: Proceedings of IEEE Knowledge and Data Engineering Exchange Workshop (1997) 107-113 8. McQueen, J.B.: Some Methods for Classification and Analysis of Multivariante Observations. In: Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967). 281-297 9. Moreau, Y., Vandewalle, J: Detection of Mobile Phone Fraud using Supervised Neural Networks: A First Prototype. Available via ftp://ftp.esat.kuleuven.ac.be/pub/SISTA/ moreau/reports/icann97_TR97-44.ps (1997) 10. Moreau, Y., Preneel, B., Burge, P., Shawe-Taylor, J., Stoermann C., Cook, C.: Novel Techniques for Fraud Detection in Mobile Telecommunication Networks. In: ACTS Mobile Summit, Grenada Spain (1997) 11. Murad, U.: Three Level Profiling for Telecommunication Fraud Detection. M.Sc. thesis, Tel Aviv University, Israel (1999)
OPTICS-OF: Identifying Local Outliers Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng1, Jörg Sander Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich, Germany {breunig | kriegel | ng | sander}@dbs.informatik.uni-muenchen.de phone: +49-89-2178-2225 fax: +49-89-2178-2192
Abstract: For many KDD applications finding the outliers, i.e. the rare events, is more interesting and useful than finding the common cases, e.g. detecting criminal activities in E-commerce. Being an outlier, however, is not just a binary property. Instead, it is a property that applies to a certain degree to each object in a data set, depending on how ‘isolated’ this object is, with respect to the surrounding clustering structure. In this paper, we formally introduce a new notion of outliers which bases outlier detection on the same theoretical foundation as density-based cluster analysis. Our notion of an outlier is ‘local’ in the sense that the outlier-degree of an object is determined by taking into account the clustering structure in a bounded neighborhood of the object. We demonstrate that this notion of an outlier is more appropriate for detecting different types of outliers than previous approaches, and we also present an algorithm for finding them. Furthermore, we show that by combining the outlier detection with a density-based method to analyze the clustering structure, we can get the outliers almost for free if we already want to perform a cluster analysis on a data set.
1.
Introduction
Larger and larger amounts of data are collected and stored in databases, increasing the need for efficient and effective analysis methods to make use of the information contained implicitly in the data. Knowledge discovery in databases (KDD) has been defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [9]. Corresponding to the kind of patterns to be discovered, several KDD tasks can be distinguished. Most research in KDD and data mining is concerned with identifying patterns that apply to a large percentage of objects in a data set. For example, the goal of clustering is to identify a set of categories or clusters that describes the structure of the whole data set. The goal of classification is to find a function that maps each data object into one of several given classes. On the other hand, there is another important KDD task applying only to very few objects deviating from the majority of the objects in a data set. Finding exceptions and outliers has not yet received much attention in the KDD area (cf. section 2). However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), finding rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases. 1. On sabbatical from: Dept. of CS, University of British Columbia, Vancouver, Canada,
[email protected].
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 262-270, 1999. Springer-Verlag Berlin Heidelberg 1999
264
M.M. Breunig et al.
Outliers and clusters in a data set are closely related: outliers are objects deviating from the major distribution of the data set; in other words: being an outlier means not being in or close to a cluster. However, being an outlier is not just a binary property. Instead, it is a property that applies to a certain degree to each object, depending on how ‘isolated’ the object is. Formalizing this intuition leads to a new notion of outliers which is ‘local’ in the sense that the outlier-degree of an object takes into account the clustering structure in a bounded neighborhood of the object. Thus, our notion of outliers is strongly connected to the notion of the density-based clustering structure of a data set. We show that both the cluster-analysis method OPTICS (“Ordering Points To Identify the Clustering Structure”), which has been proposed recently [1], as well as our new approach to outlier detection, called OPTICS-OF (“OPTICS with Outlier Factors”), are based on a common theoretical foundation. The paper is organized as follows. In section 2, we will review related work. In section 3, we show that global definitions of outliers are inadequate for finding all points that we wish to consider as outliers. This observation leads to a formal and novel definition of outliers in section 4. In section 5, we give an extensive example illustrating the notion of local outliers. We propose an algorithm to mine these outliers in section 6 including a comprehensive discussion of performance issues. Conclusions and future work are given in section 7.
2.
Related Work
Most of the previous studies on outlier detection were conducted in the field of statistics. These studies can be broadly classified into two categories. The first category is distribution-based, where a standard distribution (e.g. Normal, Poisson, etc.) is used to fit the data best. Outliers are defined based on the distribution. Over one hundred tests of this category, called discordancy tests, have been developed for different scenarios (see [4]). A key drawback of this category of tests is that most of the distributions used are univariate. There are some tests that are multivariate (e.g. multivariate normal outliers). But for many KDD applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-d space, and is assigned a depth. With respect to outlier detection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed (e.g. [13], [15]). In theory, depthbased approaches could work for large values of k. However, in practice, while there exist efficient algorithms for k = 2 or 3 ([13], [11]), depth-based approaches become inefficient for large data sets for k ≥ 4. This is because depth-based approaches rely on the computation of k-d convex hulls which has a lower bound complexity of Ω(nk/2). Recently, Knorr and Ng proposed the notion of distance-based outliers [12]. Their notion generalizes many notions from the distribution-based approaches, and enjoys better computational complexity than the depth-based approaches for larger values of k. Later in section 3, we will discuss in detail how their notion is different from the notion of local outliers proposed in this paper. Given the importance of the area, fraud detection has received more attention than the general area of outlier detection. Depending on the specifics of the application do-
OPTICS-OF: Identifying Local Outliers
265
mains, elaborate fraud models and fraud detection algorithms have been developed (e.g. [8], [6]). In contrast to fraud detection, the kinds of outlier detection work discussed so far are more exploratory in nature. Outlier detection may indeed lead to the construction of fraud models.
3.
Problems of Current (non-local) Approaches
As we have seen in section 2, most of the existing work in outlier detection lies in the field of statistics. Intuitively, outliers can be defined as given by Hawkins [10]. Definition 1: (Hawkins-Outlier) An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. This notion is formalized by Knorr and Ng [12] in the following definition of outliers. Definition 2: (DB(p,d)-Outlier) An object o in a data set D is a DB(p,d)-outlier if at least fraction p of the objects in D lies greater than distance d from o. Below, we will show that definition 2 captures o1 only certain kinds of outliers. Its shortcoming is C1 that it takes a global view of the data set. The fact that many interesting real-world data sets exhibit a more complex structure, in which objects are only outliers relative to their local, surrounding object distribution, is ignored. We give an examples of a data set containing objects that are outliers accordo3 o2 ing to Hawkins’ definition for which no values for C2 p and d exist such that they are DB(p,d)-outliers. Figure 1 shows a 2-d dataset containing 43 objects. Fig. 1. 2-d dataset DS1 It consists of 2 clusters C1 and C2, each consisting of 20 objects, and there are 3 additional objects o1, o2 and o3. Intuitively, and according to definition 1, o1, o2 and o3 are outliers, and the points belonging to the clusters C1 and C2 are not. For an object o and a set of objects S, let d(o,S) = min{ d(o,s) | s ∈ S }. Let us consider the notion of outliers according to definition 2: • o1: For every d ≤ d(o1,C1) and p ≤ 42/43, o1 is a DB(p,d) outlier. For smaller values of p, d can be even larger. • o2: For every d ≤ d(o2, C1) and p ≤ 42/43, o2 is a DB(p,d) outlier. Again, for smaller values of p, d can be even larger. • o3: Assume that for every point q in C1, the distance from q to its nearest neighbor is larger than d(o3, C2). In this case, no combination of p and d exists such that o3 is an DB(p,d) outlier and the points in C1 are not: - For every d ≤ d(o3, C2), p=42/43 percent of all points are further away from o3 than d. However, this condition also holds for every point q ∈ C1. Thus, o3 and all q ∈ C1 are DB(p,d)-outliers. - For every d > d(o3, C2), the fraction of points further away from o3 is always smaller than for any q ∈ C1, so either o3 and all q ∈ C1 will be considered outliers or (even worse) o3 is not an outlier and all q ∈ C1 are outliers.
266
M.M. Breunig et al.
From this example, we infer that definition 2 is only adequate under certain, limited conditions, but not for the general case that clusters of different densities exist. In these cases definition 2 will fail to find the local outliers, i.e. outliers that are outliers relative to their local surrounding data space.
4.
Formal Definition of Local Outliers
In this section, we develop a formal definition of outliers that more truly corresponds to the intuitive notion of definition 1, avoiding the shortcomings presented in section 3. Our definition will correctly identify local outliers, such as o3 in figure 1. To achieve this, we do not explicitly label the objects as “outlier” or “not outlier”; instead we compute the level of outlier-ness for every object by assigning an outlier factor. Definition 3: (ε-neighborhood and k-distance of an object p) Let p be an object from a database D, let ε be a distance value, let k be a natural number and let d be a distance metric on D. Then: • the ε-neighborhood Nε(p) are the objects x with d(p,x)≤ε: Nε(p) = { x∈D | d(p,x)≤ε}, • the k-distance of p, k-distance(p), is the distance d(p,o) between p and an object o ∈D such that at least for k objects o’∈D it holds that d(p,o’) ≤ d(p,o), and for at most k-1 objects o’∈D it holds that d(p,o’) < d(p,o). Note that k-distance(p) is unique, although the object o which is called ‘the’ k-nearest neighbor of p may not be unique. When it is clear from the context, we write Nk(p) as a shorthand for Nk-distance(p)(p), i.e. Nk(p) = { x ∈ D | d(p,x) ≤ k-distance(p)}. The objects in the set Nk(p) are called the “k-nearest-neighbors of p” (although there may be more than k objects in Nk(p) if the k-nearest neighbor of p is not unique). Before we can formally introduce our notion of outliers, we have to introduce some basic notions related to the density-based cluster structure of the data set. In [7] a formal notion of clusters based on point density is introduced. The point density is measured by the number of objects within a given area. The basic idea of the clustering algorithm DBSCAN is that for each object of a cluster the neighborhood of a given radius (ε) has to contain at least a minimum number of objects (MinPts). An object p whose ε-neighborhood contains at least MinPts objects is said to be a core object. Clusters are formally defined as maximal sets of density-connected objects. An object p is density-connected to an object q if there exists an object o such that both p and q are density-reachable from o (directly or transitively). An object p is said to be directly density-reachable from o if p lies in the neighborhood of o and o is a core object [7]. A ‘flat’ partitioning of a data set into a set of clusters is useful for many applications. However, an important property of many real-world data sets is that their intrinsic cluster structure cannot be characterized by global density parameters. Very different local densities may be needed to reveal and describe clusters in different regions of the data space. Therefore, in [1] the density-based clustering approach is extended and generalized to compute not a single flat density-based clustering of a data set, but to create an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. This cluster-ordering of a data set is based on the notions of core-distance and reachability-distance.
OPTICS-OF: Identifying Local Outliers
267
Definition 4: (core-distance of an object p) Let p be an object from a database D, let ε be a distance value and let MinPts be a natural number. Then, the core-distance of p is defined as core-distanceε,MinPts(p) =
UNDEFINED, if |N ε(p) | < MinPts MinPts-distance(p), otherwise
The core-distance of object p is the smallest distance ε’ ≤ ε such that p is a core object with respect to ε’ and MinPts if such an ε’ exists, i.e. if there are at least MinPts objects within the ε-neighborhood of p. Otherwise, the core-distance is UNDEFINED. Definition 5: (reachability-distance of an object p w.r.t. object o) Let p and o be objects from a database D, p ∈ Nε(o), let ε be a distance value and let MinPts be a natural number. Then, the reachability-distance of p with respect to o is defined as reachability-distanceε,MinPts(p, o) = UNDEFINED, if |N ε(o) | < MinPts max ( core-distance ε, MinPts ( o ), d ( o, p ) ) otherwise
core (o
)
The reachability-distance of an object p with respect to object o is the smallest distance such that p is directly density-reachable from o if o is a core object within p’s ε-neighborhood. To capture this idea, the reachability-distance of p with respect to o cannot be smaller than the core-distance of o since for smaller distances no object is directly density-reachable from o. Otherwise, if o is not a core object, the reachability-distance is UNDEFINED. Figure 2 illustrates the core-distance and the reachability-distance. The core-distance and reachability-distance were originally introduced for the OPTICS-algorithm [1]. The OPTICS-algorithm computes a “walk” through ε p the data set, and calculates for each object o the corer(p 1 distance and the smallest reachability-distance with o respect to an object considered before o in the walk. Such a walk through the data satisfies the following r(p 2 condition: Whenever a set of objects C is a densitybased cluster with respect to MinPts and a value ε’ p smaller than the value ε used in the OPTICS algoFig. 2. Core-distance(o), rithm, then a permutation of C (possibly without a reachability-distances r(p1,o), few border objects) is a subsequence in the walk. r(p2,o) for MinPts=4 Therefore, the reachability-plot (i.e. the reachability values of all objects plotted in the OPTICS ordering) yields an easy to understand visualization of the clustering structure of the data set. Roughly speaking, a low reachability-distance indicates an object within a cluster, and a high reachability-distance indicates a noise object or a jump from one cluster to another cluster. The reachability-plot for our dataset DS1 is depicted in figure 3 (top). The global structure revealed shows that there are the two clusters, one of which is more dense than the other, and a few objects outside the clusters. Another example of a reachability-plot for the more complex data set DS2 (figure 4) containing hierarchical clusters is depicted in figure 5.
268
M.M. Breunig et al.
Definition 6: (local reachability density of an object p) Let p be an object from a database D and let MinPts be a natural number. Then, the local reachability density of p is defined as
∑
o∈N
reachability-distance ∞, MinPts(p, o) (p )
MinPts lrd MinPts(p) = 1 ⁄ --------------------------------------------------------------------------------------------------------------------N MinPts(p)
outlier factor
reachability
The local reachability density of an object p is the inverse of the average reachabilitydistance from the MinPts-nearest-neighbors of p. The reachability-distances occurring in this definition are all defined, because ε=∞. The lrd is ∞ if all reachability-distances are 0. This may occur for an object p if there are at least MinPts objects, different from p, but sharing the same spatial coordinates, i.e. if there are at least MinPts duplicates of p in the data set. For simplicity, we will not handle this case explicitly but simply assume that there are no duplicates. (To deal with duplicates, we can base our notion of neighborhood on a k-distinct-distance, defined analogously to k-distance in definition 3 with the additional requirement that there be at least k different objects.) o1 The reason for using the reachability-distance instead of simply the distance between p and its o2 neighbors o is that it will significantly weaken C2 C1 statistical fluctuations of the inter-object diso3 tances: lrds for objects which are close to each other in the data space (whether in clusters or noise) will in general be equaled by using the reachability-distance because it is at least as large as the core-distance of the respective 1 object o. The strength of the effect can be conobjects trolled by the parameter MinPts. The higher the value for MinPts the more similar the reachabilFig. 3. reachability-plot and ity-distances for objects within the same area of outlier factors for DS1 the space. Note that there is a similar ‘smoothing’ effect for the reachability-plot produced by the OPTICS algorithm, but in this case of clustering we also weaken the so-called ‘single-link effect’ [14]. Definition 7: (outlier factor of an object p) Let p be an object from a database D and let MinPts be a natural number. Then, the lrd MinPts(o) ----------------------------lrd MinPts(p) o ∈ N MinPts(p) outlier factor of p is defined as OF MinPts(p) = ------------------------------------------------------------N MinPts(p)
∑
The outlier factor of the object p captures the degree to which we call p an outlier. It is the average of the ratios of the lrds of the MinPts-nearest-neighbors and p. If these are identical, which we expect for objects in clusters of uniform density, the outlier factor is 1. If the lrd of p is only half of the lrds of p’s MinPts-nearest-neighbors, the outlier factor of p is 2. Thus, the lower p’s lrd is and the higher the lrds of p’s MinPts-nearestneighbors are, the higher is p’s outlier factor.
OPTICS-OF: Identifying Local Outliers
269
Figure 3, (top) shows the reachability-plot for DS1 generated by OPTICS [1]. Two clusters are visible: first the dense cluster C2, then points o3 and o1 (larger reachability values) and - after the large reachability indicating a jump - all of cluster C1 and finally o2. Depicted below the reachability-plot are the corresponding outlier factors (the objects are in the same order as in the reachability-plot). Object o1 has the largest outlier factor (3.6), followed by o2 (2.0) and o3 (1.4). All other objects are assigned outlier factors between 0.993 and 1.003. Thus, our technique successfully highlights not only the global outliers o1 and o2 (which are also DB(p,d)-outliers), but also the local outlier o3 (which is not a reasonable DB(p,d)-outlier).
5.
An Extensive Example o3
o4
o2 o1
o5
Fig. 4. Example dataset DS2
In this section, we demonstrate the effectiveness of the given definition using a complex 2-d example data set (DS2, figure 4, 473 points), containing most characteristics of real-world data sets, i.e. hierarchical/overlapping clusters and clusters of widely differing densities and arbitrary shapes. We give a small 2-d example to make it easier to understand the concepts. Our approach, however, works equally well in higher dimensional spaces. DS2 consists of 3 clusters of uniform (but different) densities and one hierarchical cluster of a low density containing 2 small and 1 bigger subcluster. The data set also contains 12 outliers.
outlier factor
reachability
Figure 5 (top) shows the reachability-plot N1 generated by OPTICS. We see 3 clusters with N2 1 2 3 4 different, but uniform densities in areas 1, 2 and N3 N4 3, a large, hierarchical cluster in area 4 and its 4.1 4.2 4.3 subclusters in areas 4.1, 4.2 and 4.3. The noise points (outliers) have to be located in areas N1, objects N2, N3 and N4. o4 Figure 5 (bottom) shows the outlier factors o3 for MinPts=10 (objects in the same order). o5 o1 Most objects are assigned outlier factors of o2 around 1. In areas N3 and N4 there is one point each (o1 and o2) with outlier factors of 3.0 and 1 2.7 respectively, characterizing outliers with loobjects cal reachability densities about half to one third Fig. 5. Reachability-plot (ε=50, of the surrounding space. The most interesting MinPts=10) and outlier factors OF10 for DS2 area is N1. The outlier factors are between 1.7 and 6.3. The first two points with high outlier factors 5.4 and 6.3 are o3 and o4. Both only have one close neighbor (the other one) and all other neighbors are far away in the cluster in area 3, which has a high density (recall that for MinPts=10 we are looking at the 10-nearest-neighbors). Thus, o3 and o4 are assigned large outlier factors. The other points in N1, however, are assigned much smaller
270
M.M. Breunig et al.
(but still significantly larger than 1) outlier factors between 1.7 and 2.4. These are the points surrounding o5 which can either be considered a small, low density cluster or outliers, depending on ones viewpoint. Object o5 as the center point of this low density cluster is assigned the lowest outlier factor of 1.7, because it is surrounded by points of equal local reachability density. We also see that, from the reachability-plot, we can only infer that all points in N1 are in an area of low density, because the reachability values are high. However, no evaluation concerning their outlierness is possible.
6.
Mining Local Outliers - Performance Considerations
To compute the outlier-factors OFMinPts(p) for all objects p in a database, we have to perform three passes over the data. In the first pass, we compute NMinPts(p) and coredistance∞,MinPts(p). In the second pass, we calculate reachability-distance∞,MinPts(p,o) of p with respect to its neighboring objects o∈NMinPts(p) and lrdMinPts(p) of p. In the third pass, we can compute the outlier factors OF(p). The runtime of the whole procedure is heavily dominated by the first pass over the data since we have to perform k-nearest-neighbor queries in a multidimensional database, i.e. the runtime of the algorithm is O(n * runtime of a MinPts-nearest-neighborhood query). Obviously, the total runtime depends on the runtime of k-nearest-neighbor query. Without any index support, to answer a k-nearest-neighbor query, a scan through the whole database has to be performed. In this case, the runtime of our outlier detection algorithm would be O(n2). If a tree-based spatial index can effectively be used, the runtime is reduced to O (n log n) since k-nearest-neighbor queries are supported efficiently by spatial access methods such as the R*-tree [3] or the X-tree [2] for data from a vector space or the M-tree [5] for data from a metric space. The height of such a tree-based index is O(log n) for a database of n objects in the worst case and, at least in low-dimensional spaces, a query with a reasonable value for k has to traverse only a limited number of paths. If also the algorithm OPTICS is applied to the data set, i.e. if we also want to perform some kind of cluster analysis, we can drastically reduce the cost for the outlier detection. The algorithm OPTICS retrieves the ε-neighborhood Nε(p) for each object p in the database, where ε is an input parameter. These ε-neighborhoods can be utilized in the first pass over the data for our outlier detection algorithm: only if this neighborhood Nε(p) of p does not already contain MinPts objects, we have to perform a MinPts-nearest-neighbor query for p to determine NMinPts(p). In the other case, we can retrieve NMinPts(p) from Nε(p) since then it holds that NMinPts(p) ⊆ Nε(p). Our experiments indicate that in real applications, for a reasonable value of ε and MinPts, this second case is much more frequent than the first case.
7.
Conclusions
Finding outliers is an important task for many KDD applications. All proposals so far considered ‘being an outlier’ a binary property. We argue instead, that it is a property that applies to a certain degree to each object in a data set, depending on how ‘isolated’ this object is, with respect to the surrounding clustering structure. We formally defined
OPTICS-OF: Identifying Local Outliers
271
the notion of an outlier factor, which captures exactly this relative degree of isolation. The outlier factor is local by taking into account the clustering structure in a bounded neighborhood of the object. We demonstrated that this notion is more appropriate for detecting different types of outliers than previous approaches. Our definitions are based on the same theoretical foundation as density-based cluster analysis and we show how to analyze the cluster structure and the outlier factors efficiently at the same time. In ongoing work, we are investigating the properties of our approach in a more formal framework, especially with regard to the influence of the MinPts value. Future work will include the development of a more efficient and an incremental version of the algorithm based on the results of this analysis.
References 1. Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: “OPTICS: Ordering Points To Identify the Clustering Structure”, Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA, 1999. 2. Berchthold S., Keim D., Kriegel H.-P.: “The X-Tree: An Index Structure for HighDimensional Data”, 22nd Conf. on Very Large Data Bases, Bombay, India, 1996, pp. 28-39. 3. Beckmann N., Kriegel H.-P., Schneider R., Seeger B.: “The R*-tree: An Efficient and Robust Access Method for Points and Rectangles”, Proc. ACM SIGMOD Int. Conf. on Management of Data, Atlantic City, NJ, ACM Press, New York, 1990, pp. 322-331. 4. Barnett V., Lewis T.: “Outliers in statistical data”, John Wiley, 1994. 5. Ciaccia P., Patella M., Zezula P.: “M-tree: An Efficient Access Method for Similarity Search in Metric Spaces”, Proc. 23rd Int. Conf. on Very Large Data Bases, Athens, Greece, 1997, pp. 426-435. 6. DuMouchel W., Schonlau M.: “A Fast Computer Intrusion Detection Algorithm based on Hypothesis Testing of Command Transition Probabilities”, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, AAAI Press, 1998, pp. 189-193. 7. Ester M., Kriegel H.-P., Sander J., Xu X.: “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231. 8. Fawcett T., Provost F.: “Adaptive Fraud Detection”, Data Mining and Knowledge Discovery Journal, Kluwer Academic Publishers, Vol. 1, No. 3, pp. 291-316. 9. Fayyad U., Piatetsky-Shapiro G., Smyth P.: “Knowledge Discovery and Data Mining: Towards a Unifying Framework”, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 82-88. 10. Hawkins, D.: “Identification of Outliers”, Chapman and Hall, London, 1980. 11. Johnson T., Kwok I., Ng R.: “Fast Computation of 2-Dimensional Depth Contours”, Proc. 4th Int. Conf. on KDD, New York, NY, AAAI Press, 1998, pp. 224-228. 12. Knorr E. M., Ng R. T.: “Algorithms for Mining Distance-Based Outliers in Large Datasets”, Proc. 24th Int. Conf. on Very Large Data Bases, New York, NY, 1998, pp. 392-403. 13. Preparata F., Shamos M.: “Computational Geometry: an Introduction“, Springer, 1988. 14. Sibson R.: “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 16, No. 1, 1973, pp. 30-34. 15. Tukey J. W.: “Exploratory Data Analysis”, Addison-Wesley, 1977.
Selective Propositionalization for Relational Learning ´ Erick Alphonse, C´eline Rouveirol Inference and Learning Group, LRI - UMR 8623 CNRS Bˆ atiment 490, Universit´e Paris-Sud 91405 - Orsay Cedex (France) {alphonse,celine}@lri.fr ? Abstract. A number of Inductive Logic Programming (ILP) systems have addressed the problem of learning First Order Logic (FOL) discriminant definitions by first reformulating the problem expressed in a FOL framework into a attribute-value problem and then applying efficient algebraic learning techniques. The complexity of such propositionalization methods is now in the size of the reformulated problem which can be exponential. We propose a method that selectively propositionalizes the FOL training set by interleaving boolean reformulation and algebraic resolution. It avoids, as much as possible, the generation of redundant boolean examples, and still ensures that explicit correct and complete definitions are learned.
1
Introduction
Learning relational concepts from examples stored in a multi-relational database has been identified as a challenge for Inductive Logic Programming (ILP) techniques by both KDD and ILP communities [3]. However, it is a well-known fact that the counterpart of learning in restrictions of FOL, even relational ones, is the dramatic complexity of the coverage test between a hypothesis and an example. Here, we address discriminant concept learning in Datalog target concept languages1 . In such languages, the exponential complexity of subsumption (classically θ-subsumption [9]) is inherent to the non determinacy of the computation of “matching” substitutions between a hypothesis and an example. This can happen when the Entity-Relationship schema of the target relational database contains 1-n or n-n associations. While a number of specific biases have been developed directly in an FOL framework to control this indeterminacy by restricting the target concept language (see for instance, ij-determination [8]), a family of ILP methods (among others, LINUS [6], STILL [13], REPART [15], SP [5]) have addressed this problem by propositionalizing the ILP problem, i.e., by reformulating the ILP learning problem into an attribute value or even boolean one, which can then be ? 1
This work has been partially supported by ESPRIT through LTR ILP 2 n. 20237. Horn clause languages without function symbols other than constants. •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 271−276, 1999. Springer−Verlag Berlin Heidelberg 1999
272
É. Alphonse and C. Rouveirol
handled by learning techniques dedicated to this simpler formalism. Once the representation change has been performed, robust and efficient algorithms can be successfully applied, provided that the discriminant features of the FOL learning problem are preserved by propositionalization. Propositionalizations in those systems all adopt the same schema: given a pattern P , FOL examples are reformulated into their (potentially multiple) matchings with P , yielding a tabular representation. Of course, the subsumption test being of exponential complexity in an unrestricted Datalog language, the size of the reformulated problem can be exponential [1] as well as highly redundant, and cannot be directly addressed as such for complex relational learning problems. This paper presents a selective propositionalization that controls the size of the reformulated problem: instead of generating the whole boolean reformulation of the FOL problem before resolution, this method interleaves boolean reformulation and algebraic resolution. Information gathered during algebraic resolution is used to constrain the generation of the reformulated boolean problem to the boolean vectors that are useful for next refinement step(s) only. In doing so, it avoids, as much as possible, the generation of redundant boolean examples, enables partial storing of positive boolean instances only and still ensures that correct and complete definitions are learned.
2
Background
After [12, 15, 11], a learning problem can be decomposed into two subproblems, a relational (or structural) one and a functional one. To illustrate this decomposition, consider learning from examples stored in a multi-relational database. Here, learning from litterals representing the multiple foreign key links [14] among tuples of different relations is a structural learning problem, whereas learning on the other (mono-valued) attributes of those relations is a functional one. Consequently, this paper focuses on relational learning, which is typically a non-determinate learning problem, within Datalog target concept language without constants and without restriction on the depth or level of “indeterminacy” of existential variables [8]. In such a language, the propositionalization process is described as follow: Definition 1. The pattern P is built from a seed positive example e, as the maximal generalization of e plus equality constraints between pairs of variables in the pattern which are satisfied by e (see example below). Each training example is then translated into a set of boolean vectors. For each matching σi of P variables onto constants of e, the attributes of the boolean vector associated to a FOL example indicate which constraints of the pattern (presence/absence of a literal, links between variables of the pattern) are satisfied by σi . Thus, the FOL search space is shifted to a boolean lattice ordered by boolean inclusion, denoted ≺b . The search space of the reformulated problem is then that of concepts more general than or equal to the seed example: a partial mapping of P literals to FOL example literals yields a more general boolean vector than P , whereas a complete mapping yields a boolean vector equivalent to P . For instance, if E, CE are a positive and a negative example of the target concept and E 0 is the seed example, the obtained tabular representation is:
Selective Propositionalization for Relational Learning
273
E : c(a) ← p(a, b), p(b, c), q(c), q(a). E 0 : c(a) ← p(a, b), q(b), q(a), r(c). CE : c(a) ← p(a, b), p(b, c), q(b), q(c). P c(U ) p(V, W ) q(X) q(Y ) r(Z) U = V U = Y V = Y W = X θE,1 1 1 1 1 0 1 0 0 0 θE,2 1 1 1 1 0 1 1 1 0 θE,i 1 1 1 1 0 0 0 0 1 θCE,1 1 1 1 1 0 1 0 0 1 θCE,2 1 1 1 1 0 1 0 0 1 θCE,j 1 1 1 1 0 0 0 1 1 Fig. 1. Excerpt of boolean representation of a FOL problem
Moreover, as pointed out in [15], the learning task is no longer to induce a Datalog concept consistent with all boolean vectors, but what is referred to as the multi-part problem2 : Definition 2. (after [15]) The multi-part problem consists of finding a description that covers, for all FOL positive examples, at least one of their associated boolean vectors (completeness) and none of the boolean vectors associated to any FOL negative example (correctness).
3
State of the Art
Although the reformulated problem can be delegated to efficient and robust algorithms (SP [5] with C4.5 [10], REPART with CHARADE [4]), the space complexity becomes intractable, as does the time complexity. As pointed out by [1], boolean learners working on the reformulated problem must deal with data of exponential size wrt the FOL problem. Indeed, following definition 1, each positive or negative FOL example is described by a set of boolean vectors (termed in the remainder of the paper positive and negative boolean vectors respectively), the cardinality of which is equal to its multiple matchings with the propositionalization pattern. As far as we know, two learning systems have addressed this problem: STILL [13] and REPART [15]. For the former, propositionalization is performed through a stochastic selection of η example matchings (η is a system parameter) with the pattern, which allows for bounding the size of the reformulated problem, yielding a polynomial generalization process. To offset the imperfection of such generalizations as ”standalone” classifiers, STILL learns a committee of them, that classifies unseen examples in a nearest neighbor-like way. For the latter, restriction of the reformulated problem is performed through the choice of a relevant propositionalization pattern. The user/expert must provide a pattern as a (strong) bias which allows him to drastically decrease the matching space. The validity of the method relies on the assumption that the selected pattern preserves the discrimination information sought for. As the FOL learning problem is propositionalized before resolution, this system nevertheless has to cope with the size of the reformulated data. We propose one alternative method in order to both reduce the data size of the reformulated problem and avoid data storing as much as possible. 2
As noted by the authors, this learning problem is closely related to what Dietterich termed the multi-instance problem [2].
274
4
É. Alphonse and C. Rouveirol
Selective Propositionalization
build P from a seed positive example (see def. 1) initialize G as the universal element of the search space For each ce ∈ CE do Repeat Select g ∈ G (1) Compute a boolean vector b from ce (* P ≺b b b g *) If b is equal to P Then no structural discrimination is possible Else Specialize G to discriminate b (* algebraic resolution *) (2) Evaluate each element of G wrt positive example coverage Update G (* beam search strategy *) Endif Until all elements of G are correct Endfor return(G) Fig. 2. Computation of n elements of G
The overall structure of our algorithm is quite classical. It is based on the Candidate Elimination Algorithm [7] and implements a covering method for learning disjunctive concepts. The algorithm computes a set of maximally general and correct solutions by a top-down search in a boolean search space. The two original ideas of the algorithm stem from the fact that the boolean examples handled by the algorithm are not generated before learning proceeds, but during resolution. Therefore, as opposed to classical propositionalization methods (see sec. 1) which compute as many boolean examples as the number of matchings between the pattern and FOL examples, this algorithm constructively exploits: i) information gathered during resolution to only generate boolean examples that are useful for (in)validating the current specialization step; ii) the partial ordering on the instance space in order to generate useful examples, that is, the “close to” most specific ones. 4.1
Exploiting Current Resolution Information
In classical propositionalization techniques, all (positive and negative) FOL examples are reformulated into their multiple matchings with a given pattern P . In contrast, our method only looks for boolean vectors that may invalidate the current hypothesis g and therefore yields a specialization of g. At each search step, it therefore attempts to build a negative boolean vector more specific than g, i.e., which contains at least all boolean attributes of g. For a Datalog language, several tentative matchings may be necessary (in the worst case, an exponential number) in order to build a matching substitution σ. However, the benefit of selective propositionalization wrt a classical propositionalization, in terms of the matching space explored, is theoretically (and empirically, as shown in our first experiments, sec. 5) substantial: the space of matching substitutions to be searched is induced by literals belonging to g as opposed to P , that is, by relevant predicates wrt the current discriminant task.
Selective Propositionalization for Relational Learning
4.2
275
Partial Ordering on the Instance Space
The size of the reformulated boolean problem is upper-bounded by the number of matching substitutions between the pattern and the FOL examples (positive and negative), but a large fraction of these boolean vectors are redundant (see in fig. 1 θE,1 wrt θE,2 and θCE,1 wrt θCE,2 ) in that they do not directly take part in the process of building a correct and complete discriminant solution. Such redundant data occur when propositionalizing both negative and positive examples (respectively, steps 1 and 2 of the algorithm). Indeed, as far as negative examples are concerned, and after [12], there is a partial ordering (nearest-miss) of the negative instance space and it has been shown that only maximally specific negative examples wrt this partial ordering are sufficient for solving the discriminant learning problem. For positive examples, if we refer to definition 2, a FOL example is covered in the boolean search space if at least one of its corresponding boolean vectors is covered. Therefore, only the most specific ones wrt boolean inclusion are sufficient for our learning problem. After computing the matching substitution σ as stated above, the propositionalization will be as efficient as the extracted boolean vector is specific. We therefore complete σ by deterministically3 matching literals of P with the FOL example. In doig so, we therefore cannot ensure that the extracted boolean vector is a most specific one, which would require an exponential complexity, but is only a ”close to” most specific one.
5
Experimentations
The efficiency of our approach is evaluated by both percentages of boolean vectors computed and that of the matching space explored by our approach wrt classical propositionalization methods as presented in section 2. The former reflects the amount of non-redundant boolean vectors empirically computed by the selective propositionalization, that is the complexity of the learning problem resolution. As for the latter, it reflects the complexity of the selective propositionalization itself. As a learning database, we have used a hard artificial problem derived from Michalski’s trains involving an intractable number of data (about one hundred million) for classical propositionalization methods. As a result, we have obtained an amount of 0,0018% boolean vectors computed (with a standard deviation of 0,0017%), by exploring 1,62% of the whole matching space (with a standard deviation of 1,69%). As a corollary, learning methods implementing selective propositionalization are empirically about 62 times faster than classical propositionalization methods.
6
Conclusion
We have proposed an original propositionalization method which benefits from the advantages of both generate and test methods, using a ”more general than or equal to” partial ordering, and from a sound and efficient algebraic specialization 3
therefore, with polynomial time complexity.
276
É. Alphonse and C. Rouveirol
operator. The selective propositionalization method has been validated on an artificial, yet complex relational problem, involving a huge matching space and seems well-suited for handling highly indeterminate FOL learning problem. The generation of a large amount of redundant data is avoided. On the other hand, the Version Space approach allows for storing just a few boolean vectors computed from positive FOL examples only. Finally, this selective propositionalization technique can be adapted to any subsumption relation in the original FOL search space, and it can be combined with additional biases that can further improve the overall efficiency. For instance, user bias [15] can be incorporated in the pattern definition to further decrease the size of the matching space.
References 1. L. De Raedt. Attribute-value learning versus inductive logic programming : The missing link. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 1–8. Springer Verlag, 1998. 2. T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multi-instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1996. 3. U. Fayyad. Knowledge discovery in databases : An overview. In N. Lavraˇc and S. Dˇzeroski, editors, Proc. of the 7th International Workshop on ILP, pages 1–16. Springer Verlag, 1997. 4. J.-G. Ganascia. A rule system learning system. In R. Bajcsy, editor, Proc. of International Joint Conference of Artificial Intelligence, pages 432–438. Morgan Kaufmann, 1993. 5. S. Kramer, B. Pfahringer, and C. Helma. Stochastic propositionalization of nondeterminate background knowledge. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 80–94. Springer Verlag, 1998. 6. N. Lavraˇc and S. Dˇzeroski. Inductive Logic Programming : techniques and Applications. Ellis Horwood, 1994. 7. T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203–226, 1982. 8. S. Muggleton and C. Feng. Efficient induction in logic programs. In S. Muggleton, editor, International Workshop on ILP, pages 281–298. Academic Press, 1992. 9. G. Plotkin. A note on inductive generalization. In Machine Intelligence, volume 5. Edinburgh University Press, 1970. 10. J. R. Quinlan. C4. 5: Programs for Machine Learning. MK, San Mateo, CA, 1993. 11. M. Sebag. Resource bounded induction and deduction in FOL. In Proc. on Multi Strategy Learning, 1998. 12. M. Sebag and C. Rouveirol. Induction of maximally general clauses consistent with integrity constraints. In S. Wrobel, editor, Proc. of the 4th International Workshop on ILP, pages 195–216, 1994. 13. M. Sebag and C. Rouveirol. Tractable induction and classification in first order logic via stochastic matching. In 15th Int. Join Conf. on Artificial Intelligence (IJCAI’97), Nagoya, Japan, pages 888–893. Morgan Kaufmann, 1997. 14. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. of PKDD, pages 78–87. Springer Verlag, 1997. 15. J.-D. Zucker and J.-G. Ganascia. Learning strcutural indeterminate clauses. In D. Page, editor, Proc. of the 8th International Workshop on ILP, pages 235–244. Springer Verlag, 1998.
Circle Graphs: New Visualization Tools for Text-Mining Yonatan Aumann, Ronen Feldman, Yaron Ben Yehuda, David Landau, Orly Liphstat, Yonatan Schler Department of Mathematics and Computer Science Bar-Ilan University Ramat-Gan, ISRAEL Tel: 972-3-5326611 Fax: 972-3-5326612
[email protected] Abstract. The proliferation of digitally available textual data necessitates automatic tools for analyzing large textual collections. Thus, in analogy to data mining for structured databases, text mining is defined for textual collections. A central tool in text-mining is the analysis of concept relationship, which discovers connections between different concepts, as reflected in the collection. However, discovering these relationships is not sufficient, as they also have to be presented to the user in a meaningful and manageable way. In this paper we introduce a new family of visualization tools, which we coin circle graphs, which provide means for visualizing concept relationships mined from large collections. Circle graphs allow for instant appreciation of multiple relationships gathered from the entire collection. A special type of circle-graphs, called Trend Graphs, allows
tracking of the evolution of relationships over time.
1 Introduction Most informal definitions [2] introduce knowledge discovery in databases (KDD) as the extraction of useful information from databases by large-scale search for interesting patterns. The vast majority of existing KDD applications and methods deal with structured databases, for example, client data stored in a relational database, and thus exploits data organized in records structured by categorical, ordinal, and continuous variables. However, a tremendous amount of information is stored in documents that are essentially unstructured. The availability of document collections and especially of online information is rapidly growing, so that an analysis bottleneck often arises also in this area. Thus, in analogy to data mining for structured data, text mining is defined for textual data. Text mining is the science of extracting information from hidden patterns in large textual collections. Text mining shares many characteristics with classical data mining, but also differs in some. Thus, it is necessary to provide special tools geared specifically to text mining. A central tool, found valuable in text mining, is the analysis of concept •
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 277−282, 1999. Springer−Verlag Berlin Heidelberg 1999
278
Y. Aumann et al.
relationship [3,4], defined as follows. Large textual corpuses are most commonly composed of a collection of separate documents (e.g. news articles, web pages). Each document refers to a set of concepts (terms). Text mining operations consider the distribution of concepts on the inter-document level, seeking to discover the nature and relationships of concepts as reflected in the collection as a whole. For example, in a collection of news articles, a large number of articles on politician X and "scandal" may indicate a negative image of the character, and alert for a new PR campaign. Or, for another example, a growing number of articles on both company Y and product Z may indicate a shift of focus in the company’s interests, a shift which should be noted by its competitors. Notice that in both of these cases, the information is not provided by any single document, but rather from the totality of the collection. Thus, concept relationship analysis seeks to discover the relationship between concepts, as reflected by the totality of the corpus at hand. Clearly, discovering the concept relationships is only useful insofar as this information can be conveyed to the end-user. In practice, even medium-sized collections tend to give rise to a very large number of relationships. Thus, a mere listing of the discovered relationships is of little practical use for the end-user, as it is too large to comprehend. In addition, a linear list fails to show the structure arising from the entirety of relationships. Thus, we find that in order for mining of concept-relationship to be a useful tool for text-analysis, proper visualization techniques are necessary for presenting the results to the end user in a meaningful and manageable form. In this paper we introduce a new family of visualization tools for text-mining, which we call Circle Graphs. Circle graphs prove to be an effective tool for visualizing concept relationships discovered in text-mining. The graphs provide the user with an instant overall view of many relationships at once. Thus, circle graphs provide the extra benefit of surfacing the overall structure emerging from the multitude of relationships. We describe two specific types of circle graphs: 1. Category Connection Graphs: Provide a graphic representation of relationships between concept in different categories. 2. Context Connection Circle Graphs: Provide the user with a graphic representation of the connection between entities in a give context.
2 Circle Graphs We now describe the circle graphs visualization. We first give some basic definitions and notations.
Circle Graphs: New Visualization Tools for Text−Mining
279
2.1 Definitions and Notations Let T a be taxonomy. T is represented as a DAG (Directed Acyclic Graph), with the terms at the leaves. For a given node v∈T, we denote by Terms(v) the terms which are decedents of v. Let D be a collection of documents. For terms e1 and e2 we denote supD(e1,e) the number of documents in D which indicate a relationship between e1 and e2. The nature of the indication can be defined individually according to the context. In the current implementation we say that a document indicates a relationship if both terms appear in the document in the same sentence. This has proved to be a strong indicator. Similarly, for a collection D and terms e1, e2 and c, we denote by supD(e1,e2,c) the number of documents which indicate a relationship between e1 and e2 in the context of c (e.g. relationship between the UK and Ireland in the context of peace talks). Again, the nature of indication may be determined in many ways. In the current implementation we require that they all appear in the same sentence.
2.2 Category Connection Maps Category Connection Maps provide a means for concise visual representation of connections between different categories, e.g. between companies and technologies, countries and people, or drugs and diseases. In order to define a category connection map, the user chooses any number of categories from the taxonomy. The system finds all the connections between the terms in the different categories. To visualize the output, all the terms in the chosen categories are depicted on a circle, with each category placed on a separate part on the circle. A line is depicted between terms of different categories which are related. A color coding scheme represents stronger links with darker colors.
Formally, given a set C={v1,v2,…,vk} of taxonomy nodes, and a document collection D, the category connection map is the weighted G defined as follows. The nodes of the graph are the set V=terms(v1)∪terms(v2)∪… ∪terms(vk). Nodes u,w∈V are connected by an edge if: 1. u and w are from different categories 2. supD(u,w)>0. The weight of the edge (u,w) is supD(u,w). 3. An important point to notice regarding Category Connection Maps is that the map presents in a single picture information from the entire collection of documents. In the specific example, there is no single document that has the relationship between all the companies and the technologies. Rather, the graphs depicts aggregate knowledge from hundreds of documents. Thus, the user is provided with a bird’s-eye summary view of data from across the collection.
280
Y. Aumann et al.
The Category Connection Maps are dynamic in several ways. Firstly, the user can choose any node in the graph and the links from this node are highlighted. In addition, a double-click on any of the edges brings the list of documents which support the given relationship, together with the most relevant sentence in each document. Thus, in a way, the system is the opposite of search engines. Search engines point to documents, in the hope that the user will be able to find the necessary information. Circle category connection maps present the user with the information itself, which can then be backed by a list of documents.
Figure 1 – Context Circle Graph. The graph presents the connections between companies in the context of “joint venture”. Clusters are depicted separately. Color coded lines represent the strength of the connection. The information is based on 5,413 news articles obtained from Marketwatch.com. Only connection with weight 2 or more are depicted.
2.3 Context Circle-Graphs Context Circle-Graphs provide a visual means for concise representation of the relationship between many terms in a given context. In order to define a context circle graph the user defines: 1. A taxonomy category (e.g. “companies”), which determines the nodes of the circle graph (e.g. companies) 2. An optional context node (e.g. “joint venture”): which will determine the type of connection we wish to find among the graph nodes. Formally, for a set of taxonomy nodes vs, and a context node C, the Context Circle Graph is a weighted graph on the node set V=terms(vs). For each pair u,w∈V there is an edge between u and w, if there exists a context term c∈C, such that supD(u,w,c)>0. In this case the weight of the edge is Σc∈C supD(u,w,c). If no
Circle Graphs: New Visualization Tools for Text−Mining
281
context node is defined, then the connection can be in any context. Formally, in this case the root of the taxonomy is considered as the context. A Context Circle Graph for “companies” in the context of “joint venture” is depicted in Figure 1. In this case, the graph is clustered, as described below. The graph is based on 5,413 news documents downloaded from Marketwatch.com. The graph gives the user a summary of the entire collection in one visualization. The user can appreciate the overall structure of the connections between companies in this context, even before reading a single document!
2.3.1 Clustering For Context Circle Graph we use clustering to identify clusters of nodes which are strongly inter-related in the given context. In the example of figure 1, the system identified six separate clusters. The edges between members of each cluster are depicted in a separate small Context Circle Graph, adjacent to the center graph. The center graph shows connections between terms of different clusters, and those with terms which are not in any cluster. We now describe the algorithm for determining the clusters. Note that the clustering problem here is different from the classic clustering problem. In the classic problem we are given points in some space, and seek to find clusters of points which are close to each other. Here, we are given a graph in which we are seeking to find dense sub-graphs of the graph. Thus, a different type of clustering algorithm is necessary. The algorithm is composed of two main steps. In the first step we assign weights to edges in the graph. The weight of an edge reflects the strength of the connection between the vertices. Edges incident to vertices which are in the same cluster should be associated with high weights. In the next step we identify sets of vertices which are dense-subgraphs. This step uses the weights assigned to the edges in the previous one. We first describe the weight-assignment method. In order to evaluate the strength of a link between a pair of vertices u and v, we consider two criteria: Let u be a vertex in the graph. We use the notation Γ(u) to represent the neighborhood of u. The cluster weight of (u,v) is affected by the similarity of Γ(u) and Γ(v). We assume that vertices within the same clusters have many common neighbors. Existence of many common neighbors is not a sufficient condition, since in dense graphs any two vertices may have some common neighbors. Thus, we emphasize the neighbors which are close to u and v in the sense of cluster weight. Suppose x∈Γ(u)∩ Γ(u), if the cluster weights of (x,u) and (x,v) are high, there is a good chance that x belongs to the same cluster as u and v.
282
Y. Aumann et al.
We can now define an update operation on an edge (u,v) which takes into account both criteria: w(u,v) = w(x,u) + w(x,v)
∑
x ∈Γ (u) I Γ(v)
∑
x ∈Γ (u) I Γ (v)
The algorithm starts with initializing all weights to be equal, w(u,v)=1 for all u,v. Next, the update operation is applied to all edges iteratively. After a small number of iterations (set to 5 in the current implementation) it stops and outputs the values associated with each edge. We call this the cluster weight of the edge. The cluster weight has the following characteristic. Consider two vertices u and v within the same dense sub-graph. The edges within this sub-graph mutually affect each other. Thus the iterations drive cluster weight w(u,v) up. If, however, u and v do not belong to the dense sub-graph, the majority of edges affecting w(u,v) will have lower weights, resulting in a low cluster weight assigned to (u,v). After computing the weights, the second step of the algorithm finds the clusters. We define a new graph with the same set of vertices. In the new graph we consider only a small subset of the original edges, whose weights were the highest. In our experiments we took the top 10% of the edges. Since now almost all of the edges are likely to connect vertices within the same dense sub-graph, we thus separate the vertices into clusters by computing the connected components of the new graph and considering each component as a cluster. Figure 1 shows a circle context graph with six clusters. The cluster are depicted around the center graph. Each cluster is depicted in a different color. Nodes which are not in any cluster are colored gray. Note that the nodes of cluster appear both in the central circle and in the separate cluster graph. Edges within the cluster are depicted in the separate cluster graph. Edges between clusters are depicted in the central circle.
References 1.
Eick, S.G. and Wills, G.J.; Navigating Large Networks with Hierarchies. Visualization ’93, pp. 204-210, 1993.
2.
Fayyad, U,; Piatetsky-Shapiro, G.; and Smyth P. Knowledge Discover and Data Mining: Towards a Unifying Framework. In Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining, 82-88, 1996.
3.
Feldman, R.; and Dagan, I.; KDT - Knowledge Discovery in Texts. In Proceedings of the 1sr International Conference of Knowledge Discovery and Data Mining. 1995.
4.
Feldman, R.; Klosgen, W.; and Zilberstein, A.; Visualization Techniques to Explore Data Mining Results for Document Collections. In Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining, 16-23. 1997.
5.
Hendley, R.J., Drew, N.S., Wood, A.M., and Beale R.; Narcissus: Visualizing Information. In Proc. Int. Symp. On Information Visualization, pp. 90-94, 1995.
On the Consistency of Information Filters for Lazy Learning Algorithms Henry Brighton1 and Chris Mellish2 1 2
SHARP Laboratories of Europe Ltd., Oxford Science Park, Oxford, England, UK
[email protected] Department of Artificial Intelligence, The University of Edinburgh, Scotland, UK
[email protected] Abstract. A common practice when filtering a case-base is to employ a filtering scheme that decides which cases to delete, as well as how many cases to delete, such that the storage requirements are minimized and the classification competence is preserved or improved. We introduce an algorithm that rivals the most successful existing algorithm in the average case when filtering 30 classification problems. Neither algorithm consistently outperforms the other, with each performing well on different problems. Consistency over many domains, we argue, is very hard to achieve when deploying a filtering algorithm.
1
Introduction
Information filtering is an attractive proposition when working with lazy learning algorithms. The lazy learning paradigm is characterized by the indiscriminate storage of training cases during the training stage. To classify an unseen query case a lazy learner applies the nearest neighbor algorithm [4]. We focus on how the size of the database holding the cases can be minimised such that the classification response time can be improved. By removing harmful cases we can also increase the overall classification competence of the learner. In this paper we introduce a new algorithm for filtering case-bases used by lazy learning algorithms. After comparing the algorithm with the most successful existing filter on 30 datasets from the UCI repository for machine learning databases [5], we conclude that neither approach is consistently superior.
2
Issues in Case Filtering
By removing a set of cases from a case-base the response time for classification decisions will decrease, as fewer cases are examined when a query case is presented. The removal can also lead to either an increase or decrease in classification competence. Therefore, when applying a filtering scheme to a case base we must be clear about the degree to which we are willing to let the original classification competence depreciate. Typically, the principle objective of a filtering algorithm is unintrusive storage reduction. Here, classification competence is primary: we ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 283–288, 1999. c Springer-Verlag Berlin Heidelberg 1999
284
H. Brighton and C. Mellish
(a)
(b)
Fig. 1. (a) The 2d-dataset. (b) The cases remaining from the 2d-dataset after 5 iterations of the ICF algorithm.
desire the same (or higher) learning competence but we require it faster and taking up less space. Ideally, competence should not suffer at the expense of improved performance. If our filtering decisions are not to harm the classification competence of the learner, we must be clear about the kind of deletion decisions that introduce misclassifications. Consider the following reasons why a k-nearest neighbor classifier might misclassify an unseen query case:
1. When noise is present in locality of the query case. The noisy cases(s) win the majority vote, resulting in the incorrect class being predicted. 2. When the query case occupies a position close to an inter-class border where discrimination is harder due to the presence of multiple classes. 3. When the region defining the class, or fragment of the class, is so small that cases belonging to the class that surrounds the fragment win the majority vote. This situation depends on the value of k being large. 4. When the problem is unsolvable by a lazy learner. This may be due to the nature of the underlying function, or due to the sparse data problem. In the context of filtering, we can address point (1) and try and improve classification competence by removing noise. We can do nothing about (4) as this situation is a given and defines the intrinsic difficulty of the problem. However, issues (2) and (3) should guide our removal decisions. Removing cases close to borders is not recommended as these cases are relevant to discrimination between classes. We should be aware of point (3), but as k is typically small, the occurence of such a problem is likely to be rare. Consider our example dataset shown in Figure 1(a), we can imagine that removing the interior of the class regions would not lead to a misclassification of a query case at these points: the border cases still supply the required information.
On the Consistency of Information Filters for Lazy Learning Algorithms
3
285
Review
Filtering the set of stored instances has been an issue since the early work on nearest neighbor (NN) classification [4]. The early schemes typically concentrate on either competence enhancement (noise removal)[8] or competence preservation [6]. More recent schemes attempt both [1,9]. A novel approach to competence preservation is the Footprint Deletion policy of Smyth and Keane [6] which is a filtering scheme designed for use within the paradigm of Case- Based Reasoning (CBR). In previous work [3] we have shown that some of the concepts introduced by Smyth and Keane transfer to the simpler context of lazy learning. Much of Smyth and Keane’s work relies on the notion of case adaptation. They use the property Adaptable(c, c0 ) to mean case c can be adapted to c0 . Generally speaking, we can delete a case for which there are many other cases that can be adapted to it. In our previous work we introduced a Lazy Learning parallel termed the Local-Set of a case c to capture this property [2]. We define the Local-set of a case c as: The set of cases contained in the largest hypersphere centered on c such that only cases in the same class as c are contained in the hypersphere. The novelty of Smyth and Keane’s work stems from their proposed taxonomy of case groups. By defining four case categories, which reflect the contribution to overall competence the case provides, we gain an insight into the effect of removing a case. We define these categories in terms of two properties: Reachability and Coverage. These properties are important, as the relationship between them has been used in crucial work which we discuss later. For a casebase CB = {c1 , c2 , . . . , cn } , we define Coverage and Reachability as follows: Coverage(c) = {c0 ∈ CB : Adaptable(c, c0 )}
(1)
Reachable(c) = {c0 ∈ CB : Adaptable( c0 , c)}
(2)
Using these two properties we can define the four groups using set theory. For example, a case in the pivotal group is defined as a case with an empty reachable set. For a more thorough definition we refer the reader to the original article. Our investigation into the Lazy Learning parallel of Footprint Deletion differs only in the replacement of Adaptable with the Local-set property. Whether a case c can be Adapted to a case c’ relies on whether c is relevant to the solution of c0 . In lazy learning this means that c is a member of nearest neighbors of c0 . However, we cannot assume that a case of a differing class is relevant to the solution (correct prediction) of c0 . We therefore bound the neighborhood of c’ by the first case of a differing class. Armed with this parallel we found that Footprint deletion performed well [2]. Perhaps more interestingly, we found that a simpler method which uses only the local-set property, and not the case taxonomies, performs just as well. With local-set deletion, we choose to delete cases with large localsets, as these are cases located at the interior of class regions. Local-set deletion has subsequently been employed in the context of natural language processing [7].
286
H. Brighton and C. Mellish
ICF(T ) 1 /* Perform Wilson Editing */ 2 for all x ∈ T do 3 if x classified incorrectly by k nearest neighbours then 4 flag x for removal 5 for all x ∈ T do 6 if x flagged for removal then T = T − {x} 7 /* Iterate until no cases flagged for removal */ 8 repeat 9 for all x ∈ T do 10 compute reachable(x) 11 compute coverage(x) 12 progress = false 13 for all x ∈ T do 14 if |reachable(x)| > |coverage(x)| then 15 flag x for removal 16 progress = true 17 for all x ∈ T do 18 if x flagged for removal then T = T − {x} 19 until not progress 20 return T Fig. 2. The Iterative Case Filtering Algorithm.
4
An Iterative Case Filtering Algorithm
We now present a new algorithm which uses an iterative approach to case deletion. We term the algorithm the Iterative Case Filtering Algorithm (ICF). The ICF algorithm uses the lazy learning parallels of case coverage and reachability we developed when transferring the CBR footprint deletion policy, discussed above. We apply a rule which identifies cases that should be deleted. These cases are then removed, and the rule is applied again, iteratively, until no more cases fulfil the pre-conditions of the rule. The ICF algorithm uses the reachable and coverage sets described above, which we can liken to the neighborhood and associate sets used by [9]. An important difference is that the reachable set is not fixed in size but rather bounded by the nearest case of different class. This difference is crucial as our algorithm relies on the relative sizes of these sets. Our deletion rule is simple: we remove cases which have a reachable set size greater than the coverage set size. A more intuitive reading of this rule is that a case c is removed when more cases can solve c than c can solve itself. After removing these cases the case-space will typically contain thick bands of cases either side of class borders. The algorithm is depicted in Figure 2. We also employ the noise filtering scheme based on Wilson Editing and adopted by [9]. Lines 2-6 of the algorithm perform this task. Figure 1(b) depicts the 2d-dataset, introduced earlier, after 5 iterations of the
On the Consistency of Information Filters for Lazy Learning Algorithms
287
ICF algorithm. This is the deletion criterion the algorithm uses; the algorithm proceeds by repeatedly computing these properties after filtering has occurred. Usually, additional cases will begin to fulfil the criteria as thinning proceeds and the bands surrounding the class boundaries narrow. After a few iterations of removing cases and recomputing, the criterion no longer holds. We evaluated the ICF algorithm on 30 datasets found at the UCI repository of machine learning databases [5]. The maximum number of iterations performed, of the 30 datasets, was 17. This number of iterations was required for the switzerland database, where the algorithm removed an average of 98% of cases. However, a number of the datasets consistently require as little as 3 iterations. Examining each iteration of the algorithm, specifically the percentage of cases removed after each iteration, provides us with an important insight into how the algorithm is working. We call this the reduction profile and is a characteristic of the case-base. Profiles exhibiting a short series of iterations, each one removing a large number of cases, would indicate a simple case-base structure containing little inter-dependency between regions. The most problematic of case-base structures would be characterised by a long series resulting in few cases being removed. Comparing the ICF algorithm with RT3, the most successful of Wilson and Martinez’s algorithms, we found that the average case behaviors over the 30 datasets were very similar (See Table 1). Neither algorithm consistently outperformed the other. More interestingly, the behavior of the two algorithms differ considerably on some problems. We also found that the domains which suffer a competence degradation as a result of filtering using ICF and RT3 are exactly those in which competence degrades as a result of noise removal. This would indicate that noise removal is sometimes harmful, and both ICF and RT3 suffer as a consequence. To summarize, we have presented an algorithm which iteratively filters a case-base using a lazy learning parallel of the two case properties used in the CBR Footprint Deletion policy. Due to the iterative nature of the algorithm, we have gained an insight into how the deletion of regions depend on each other. The point at which our deletion criterion seases to hold can result in improved generalization accuracy and storage reduction.
5
Conclusion
We have introduced the ICF algorithm which supports the argument that consistency is hard. We compared the ICF algorithm with a recent successful algorithm RT3 and found that their average case performance is very similar, but on individual problems they can differ considerably. Each algorithm can outperform the other on certain problems, both in terms of competence and storage reduction. Consistency is therefore a problem in the deployment of filtering schemes. One advantage of our algorithm is that it provides us with a reduction profile. The profile tells how different regions are dependent on each other. This provides us with a useful degree of perspicuity in understanding the structure of the case-base. Utimately, the choice of filter we deploy must be informed by any insights we have into the structure of the case-space.
288
H. Brighton and C. Mellish
Table 1. The classification accuracy and storage requirements for a sample of the datasets mentioned. The benchmark competence, which is the accuracy acheived whithout any filtering, is compared with Wilson Editing, RT3, and ICF.
Dataset abalone balance-scale cleveland ecoli glass hungarian led led-17 lymphography pima-indians primary-tumor switzerland thyroid waveform wine yeast zoo average
Benchmark Acc. Stor. 19.53 100 77.36 100 77.67 100 81.94 100 71.43 100 76.55 100 63.77 100 42.82 100 77.59 100 69.54 100 36.57 100 92.08 100 90.93 100 75.36 100 84.57 100 52.70 100 95.50 100 75.75 100
Wilson Editing RT3 ICF Acc. Stor. Acc. Stor. Acc. Stor. 22.01 19.64 22.11 40.95 22.74 15.11 86.04 77.48 83.40 18.23 81.47 14.67 78.67 77.39 78.89 20.92 72.08 15.60 86.27 81.77 82.84 15.76 81.34 14.06 69.05 70.17 69.05 23.26 69.64 31.40 79.91 77.03 80.17 9.81 78.30 12.15 68.27 66.11 69.62 18.04 71.74 41.81 43.00 43.09 41.48 46.78 42.33 27.50 76.38 79.41 72.70 26.73 77.59 25.63 71.27 69.20 71.08 22.38 69.17 17.22 36.57 35.81 39.43 30.76 37.06 18.32 93.54 90.45 91.67 2.15 92.28 2.02 89.30 91.48 77.91 16.23 86.63 21.85 76.62 76.37 76.14 22.79 73.93 18.98 86.43 85.17 86.43 15.37 83.81 12.00 55.39 52.97 55.32 27.03 52.25 16.62 96.25 95.31 87.08 26.13 92.42 52.78 77.52 75.98 76.59 19.29 76.13 19.73
References 1. D. W. Aha, D. Kibler, and M. K. Albert. Instance based learning algorithms. Machine Learning, 6(1):37–66, 1991. 2. H. Brighton. Experiments in case-based learning. Undergraduate Dissertation, Department of Artificial Intelligence, University of Edinburgh, Scotland, 1996. 3. H. Brighton. Information filtering for lazy learning algorithms. Masters Thesis, Centre for Cognitive Science, University of Edinburgh, Scotland, 1997. 4. T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, IT-13:21 – 27, 1967. 5. C. J. Merz and P. M. Murphy. UCI repository of machine learning databases. [http://www.ics.uci.edu/∼mlearn/MLRepository.html], 1996. Irvine, CA: University of California, Department of Information and Computer Science. 6. B. Smyth and M. T. Keane. Remembering to forget. In C. S. Mellish, editor, IJCAI95: Proceedings of the Fourteenth International Conference on Artificial Intelligence, volume 1, pages 377 – 382. Morgan Kaufmann Publishers, 1995. 7. Antal van den Bosch and Walter Daelemans. Do not forget: Full memory in memorybased learning of word pronunciation. In Proceedings of NeMLaP3/CoNLL98, pages 195 – 204, Sydney, Australia, 1998. 8. D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):408 – 421, Jun 1972. 9. D. R. Wilson and A. R. Martinez. Instance pruning techniques. In D. Fisher, editor, Machine Learning: Proceedings of the Fourteenth International Conference, San Francisco, CA, 1997. Morgan Kaufmann.
Using Genetic Algorithms to Evolve a Rule Hierarchy Robert Cattral, Franz Oppacher, and Dwight Deugo Intelligent Systems Research Unit School of Computer Science Carleton University Ottawa, On K1S 5B6 {rcattral,oppacher,deugo}@scs.carleton.ca
Abstract. This paper describes the implementation and the functioning of RAGA (Rule Acquisition with a Genetic Algorithm), a genetic-algorithm-based data mining system suitable for both supervised and certain types of unsupervised knowledge extraction from large and possibly noisy databases. The genetic engine is modified through the addition of several methods tuned specifically for the task of association rule discovery. A set of genetic operators and techniques are employed to efficiently search the space of potential rules. During this process, RAGA evolves a default hierarchy of rules, where the emphasis is placed on the group rather than each individual rule. Rule sets of this type are kept simple in both individual rule complexity and the total number of rules that are required. In addition, the default hierarchy deals with the problem of overfitting, particularly in classification tasks. Several data mining experiments using RAGA are described.
1 Introduction Data mining, also known as KDD, or Knowledge Discovery in Databases, refers to the attempt to extract previously unknown and potentially useful relations and other information from databases and to present the acquired knowledge in a form that is easily comprehensible to humans (for example, see [1]). It differs from classical machine learning mainly in the fact that the training set is a database stored for purposes unrelated to training a learning algorithm. Consequently, data mining algorithms must cope with large amounts of data, various forms of noise and often unfavorable representations. Because of the requirement of comprehensibility, i.e., that the system be able to communicate the results of its learning in operationally effective and easily understood symbolic form, many approaches to data mining favor symbolic machine learning techniques, typically variants of AQ learning and decision tree induction [2]. RAGA meets the comprehensibility requirement by working with a population of variable-length, symbolic rule structures that can accommodate not just feature-value J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 289-294, 1999. © Springer-Verlag Berlin Heidelberg 1999
290
R. Cattral, F. Oppacher, and D. Deugo
pairs but arbitrary n-place predicates (n 0), while exploiting the proven ability of the Genetic Algorithm (e.g. [3]) to efficiently search large spaces. Most extant data mining systems perform supervised learning, where the system attempts to find concept descriptions for classes that are, together with preclassified examples, supplied to it by a teacher. The task of unsupervised learning is much more demanding because here the system is only directed to search the data for interesting associations, and attempts to find the classes by itself by postulating class descriptions for sufficiently many classes to cover all items in the database. We would like to point out that the usual characterization of unsupervised learning as learning without preclassified examples conflates a variety of increasingly difficult learning tasks. These tasks range from detecting potentially useful regularities among the data couched in the provided description language to the discovery of concepts through conceptual clustering and constructive induction, and to the further discovery of empirical laws relating concepts constructed by the system. As will be shown in section 6 below, RAGA is capable of both supervised and (the simplest type of) unsupervised learning. However, since we wish to compare our system to others we emphasize in this paper its use in supervised learning Sections 2 and 3 briefly describe the rules acquired by RAGA and its major parameters, respectively. Section 5 characterizes the system’s peculiar type of evolution, section 6 reports some experimental results and Section 7 concludes. 1
2 Representation of Data: If-Then Rules An important type of knowledge acquired by many data mining systems takes the form of if-then rules. Such rules state that the presence of one or more items implies or predicts the presence of other items. A typical rule has the form: If X1 ¾ X2 ¾ … ¾ Xn, then Y1 ¾ Y2 ¾ … ¾ Yn. The data stored internally by RAGA represents rules of this type, with a varying number of conjuncts in the antecedent and/or the consequent. (See section 3). Each part of the antecedent as well as the expression in the consequent can contain n-place predicates. If n = 0, the expression is a propositional constant; if n = 1, the expression has the widely used form of attribute-value pairs. However, RAGA can handle predicates of any arity. The antecedents and consequents in association rules can be conjunctions (¾) and negations (¬) of expressions that are built up from predicates and comparison operators (=, , , ). Component expressions can involve boolean variables (e.g. If X ¾ Y, then ¬Z), integer and real variables and constants (e.g. If X > 98.6, then Y = 1), and percentiles and percentage constants (e.g. If X > 85% ¾ Y < X, then Z > 10%). 1
This is the only type of unsupervised learning with which we have experimented thus far.
Using Genetic Algorithms to Evolve a Rule Hierarchy
291
The ability to generate negated expressions is not enabled by default because uncontrolled use of negations not only increases the search space but also often leads to the production of useless rules. For example, while If X ¾ ¬Y, then Z may be a good rule, the fitness function should penalize If X, then ¬Y ¾ ¬Z as useless in many situations where most items are absent in any given transaction. Although the introduction of constraints and rules governing the use of negation may improve the quality and speed of learning, this has not been explored for the experiments reported here.
2.1 Confidence and Support Factors In general, a rule will be more relevant and useful the higher its confidence and support are. Considered in isolation, i.e., outside a default hierarchy, rules with low confidence are useless because they are frequently wrong, and rules with low support are useless because they report uncommon combinations of items and are frequently inapplicable. It is important to note, however, that if the targets for support and confidence are set too high in unsupervised data mining, useful rules will be missed. Thus, when the confidence target is set too high, redundancies and tautologies will crowd out potentially useful rules. The proper support level is best determined by varying the settings over several analyses on the same data.
3 User Interface and System Configuration RAGA is configured through a graphical interface, and can be operated by a novice computer user. It is important, however, that a data mining technician familiar with the domain specifics configure each project. Before RAGA can perform a rule analysis, the variables and predicates with which rules will be built must be defined, and a number of options controlling the Genetic Algorithm component of the system must be chosen. Unlike other classification algorithms, RAGA supports comparison of attributes. An example of this would be comparing the length of a rectangle with its width. Queries of this type enable the classification algorithm to search beyond single dimensional vectors. However, comparison of attributes is not always desirable. An example of this would be the comparison of dollar amounts in a sales transaction. While it may be useful to compare the prices of different items in a purchase, it is probably not useful to compare the price of a single item to the cost of the entire order (e.g. is the cost of the soda less than the total bill). In cases where the comparison of attributes is deemed fruitless, variable class restrictions can be imposed to prevent variables of certain types from being compared to one another. This has the additional benefit of further narrowing search spaces. Rule position conditions fix numbers or types of elements in the antecedent or consequent. This is used in classification tasks, where the attributes are always in the
292
R. Cattral, F. Oppacher, and D. Deugo
antecedent and the class is alone in the consequent. In the absence of these conditions the search is undirected.
4 Evolving a Default Hierarchy A default hierarchy is a collection of rules that are executed in a particular order. When testing a particular data item against a hierarchy of rules, the rule at the top of the list is tried first. If its antecedent correctly matches the conditions in the element being tested, this top rule is used. If a rule does not apply, then the element is matched against the rule at the next lower level of the hierarchy. This continues until the element matches a rule or the bottom of the hierarchy is reached. Rules that are incorrect by themselves can be protected by rules preceding them in the default hierarchy, and play a useful coverage-extending role, as in the following example: If (numberOfSides = 4) ^ (length = width) then class = square If (numberOfSides = 3) then class = triangle If (numberOfSides > 2) ^ (numberOfSides < 5) then class = rectangle If the last rule were used out of order, many instances would be improperly classified. In the current position it covers the remaining data items accurately. Experimentation has shown that rules at the top of the evolved hierarchy cover most of the data, and rules near the bottom often handle exceptional cases. There is a certain amount of overlap between members of the hierarchy, as opposed to having a mutually exclusive set of rules. The evolving hierarchy tends to produce fewer and less complex rules. In RAGA, the problem of over-fitting is addressed by deciding how to handle the rules at the bottom of the hierarchy. Often these special cases handle only 1 or 2 out of perhaps 5000 data elements. It is important to note, however, that this type of overfitting is harmless because these rules are only tried as a last resort. If improper classification is considered more costly than leaving the class unknown, the user would simply ignore, after visual inspection, the lower levels of the rule hierarchy.
5 The Genetic Engine Used in RAGA In order to apply the Genetic Algorithm (GA) to the task of data mining for rules we found it desirable to modify the traditional GA in a number of respects. Accordingly, the genetic engine used by RAGA is a hybrid GA. Perhaps the most drastic modification concerns our choice of representation. Unlike the traditional GA whose chromosomes are fixed-length binary strings, the GA in our system accommodates rules of varying length and complexity. These rules are expressed in a nonbinary alphabet of user-defined symbols.
Using Genetic Algorithms to Evolve a Rule Hierarchy
293
The system operates differently depending on whether the current task is classification or otherwise. The primary difference is determining how useful a rule might be, namely its fitness. The algorithm reads as follows: Processing of one generation in RAGA involves three steps: (i) Controlled by two parameters, ordinary elitism copies one or more of the current best individuals into the next population to guarantee that the top fitness levels will not drop between generations. Classification elitism copies every rule that uniquely covers at least one data item and thus contributes, even if only in a small way, to the set of final rules. (ii) Next, fitness proportional selection, crossover and (macro and micro) mutations are applied. Until the new population is complete, rule pairs are repeatedly selected and possibly crossed over. Crossover splits rules only between conjunctions. Because of this (and also because of macromutations) rules can grow or shrink during this process. Before the two child rules enter the next phase, they - like all other rules except those copied under elitism - are subjected to micro and macro mutation with bounded rates. Since all rules with positive confidence are macro-mutated, the population size grows during generations. (iii) Finally, intergeneration processing takes place to ensure validity and nonredundancy. Rules may have several comparisons deleted before conforming to what is allowed. If after this point a rule has become invalid or identical to one that already exists in the new population, it is discarded. After enough valid rules have been selected, modified, and inserted into the new population, the evolution for the current generation is complete.
6 Some Experimental Results The data set tested contains 8124 sample descriptions of 23 species of gilled mushrooms in the Agaricus and Lepiota Family (drawn from [4] and presented in [5]). The data set uses 22 attributes, and classifies each mushroom as either edible (51.8%) or poisonous. Each of 9 test runs produced between 14 and 25 rules. Each rule set yields 100% accuracy for the entire set. This compares favorably with STAGGER [5] and HILLARY [6], which approach 95% classification accuracy after training on 1000 instances. Several unsupervised tests were also run on the mushroom data set in an attempt to discover information that is not necessarily related to the predefined classes. When looking for domain specific information that may have nothing to do with edibility (by using rules with 100% confidence and 100% support), we found several rules like the following: 2
2
Five test runs used 1000 training instances and the remaining four runs used 7124 instances.
294
R. Cattral, F. Oppacher, and D. Deugo
If the Gill Attachment is not Descending (either: attached, free, or notched), then the Veil Type is Partial. Unfortunately, we lack the expertise in the given domain to distinguish between interesting domain specific information and well-known facts. In an attempt to automatically discover some facts about edibility, we reduced support to 50%. These tests are difficult to interpret because we lack a tool to compare the results of an undirected and a directed search. We did notice, however, that many of the same attributes used to describe classes are used similarly in the two sets of rules.
7 Conclusion We have described a flexible new data mining system based on a modified GA. Preliminary experiments show that RAGA’s performance compares favorably with that of other approaches to data mining. Unlike the latter, RAGA is also capable of simple forms of unsupervised learning. In the space of evolutionary approaches, RAGA seems to lie ‘half way’ between Genetic Algorithms and Genetic Programming: like GP, it uses a variable-length, albeit restricted, representation with a non-binary alphabet, a typed crossover and a macromutation that shares some of the effects with GP crossover; like GA, it uses mutation, and it does not evolve programs. Unlike both GP and GA, it promotes validity and nonredundancy by intergenerational processing on fluctuating numbers of individuals, it implements a form of elitism that causes a wide exploration of the data set, and, by making data coverage a component of fitness, it automatically evolves default hierarchies of rules.
References 1. 2. 3. 4. 5. 6.
Berry, Michael J.A., Linoff, Gordon: Data Mining Techniques. J. Wiley & Sons (1997). Michalski, R., Bratko, I., Kubat, M. Machine Learning and Data Mining. Wiley, New York (1998). Mitchell, Melanie: An Introducation to Genetic Algorithms. MIT Press, Mass (1996). Lincoff, G. H. The Audubon Society Field Guide to North American Mushrooms. Alfred A. Knopf, New York (1981). Schlimmer, J. S. Concept Acquisition through Representational Adjustment (TR87-19). Computer Science, University of California, Irvine (1987). Iba, W., Wogulis, J., Langley, P. Trading off Simplicity and Coverage in Incremental th Concept Learning. Proceedings of the 5 International conference on Machine Learning, 73-39. Morgan Kaufmann, Ann Arbor, Michigan (1988).
Mining Temporal Features in Association Rules 1
2
Xiaodong Chen and Ilias Petrounias 1
Department of Computing & Mathematics, Manchester Metropolitan University Manchester M1 5GD, U.K., e-mail:
[email protected] 2
Department of Computation, UMIST, PO Box 88, Manchester M60 1QD, U.K., e-mail:
[email protected] Abstract: In real world applications, the knowledge that is used for aiding decision-making is always time-varying. However, most of the existing data mining approaches rely on the assumption that discovered knowledge is valid indefinitely. People who expect to use the discovered knowledge may not know when it became valid, or whether it still is valid in the present, or if it will be valid sometime in the future. For supporting better decision making, it is desirable to be able to actually identify the temporal features with the interesting patterns or rules. The major concerns in this paper are the identification of the valid period and periodicity of patterns and more specifically association rules.
1. Introduction The problem of association rules was introduced in [1] and has been extended in different ways. Most existing work overlooks any time components, which are usually attached to transactions in databases. Without this knowledge most of the information resulting from data mining activities is not of great use. For example, it is not useful to look at all supermarket transactions that have taken place over the years in order to identify patterns. Most of this information will be outdated. Temporal issues of association rules have been recently addressed in [2] and [4]. [2] focuses on the discovery of association rules with known valid periods and periodicities. The valid period shows the absolute time interval during which an association is valid, while the periodicity conveys when and how often an association is repeated. Valid period and periodicity are specified by calendar time expressions in [2]. In [4], the concept of calendric association rules is defined, where the rule is combined with a calendar that is a set of time intervals and is described by a calendar algebra. Here we focus on two mining problems for temporal features of some known/given association: 1) finding all interesting contiguous time intervals during which a specific association holds (section 2); 2) finding all interesting periodicities that a specific association has (section 3). 2. Discovery of Longest Intervals Given a time-stamped database and a known association, one of our interests is to find all possible time intervals during which this association holds. Those intervals are composed of a totally ordered set of contiguous constructive intervals (called granular intervals) with a given granularity representing a non-decomposable interval of some fixed length. The interval granularity is the size of each granular interval (e.g. Hour, Day, etc.). Each expected time interval is denoted by {Gi, Gi+1, ..., Gj}, where Gk (i ≤ k ≤ j) is a granular interval, and the time domain can be also represented by a totally ordered set of all contiguous granular intervals. We define LENGTH(ITVL, GC) as the number of intervals of granularity GC in ITVL. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 295-300, 1999. © Springer-Verlag Berlin Heidelberg 1999
296
X. Chen and I. Petrounias
Definition 2.1: Given an association AR, an interval ITVL is valid with respect to AR if the temporal association rule (AR, ITVL) satisfies min_supp and min_conf. More often than not people are just interested in intervals the duration of which is long enough, since some short intervals may not be periods of particular interest. Definition 2.2: Given an association AR and an interval granularity GC, an interval ITVL is long with respect to AR if: ITVL is valid with respect to AR, and LENGTH(ITVL, GC) ≥ min_ilen ( minimal interval length ). Consider a long interval ITVL with respect to AR. It is possible that ∃ ITVL’ ⊂ ITVL and LENGTH(ITVL’, GC) ≥ min_ilen. ITVL’ is not a long interval with respect to AR, since AR may have low support and/or confidence during ITVL’, but very high support and confidence during the rest of the period(s) in ITVL. Definition 2.3: Given an association AR and an interval granularity GC, an interval ITVL is strictly long with respect to AR if for any ITVL’, ITVL’ ⊂ ITVL and LENGTH(ITVL’, GC) ≥ min_ilen, ITVL’ is long with respect to AR. With respect to a given association AR, for any two strictly long intervals, ITVL1 and ITVL2, if ITVL1 ⊂ ITVL2, we say that ITVL2 is strictly longer than ITVL1. Definition 2.4: Given an association AR and an interval granularity GC, an interval ITVL is longest with respect to AR if: 1) the interval ITVL is strictly long with respect to AR, and 2) not ∃ ITVL” ⊃ ITVL, where ITVL” is strictly long with respect to AR. With respect to a given association AR, there may be a series of different longest intervals existing along the time line. Definition 2.5: Given a set of time-stamped transactions (D) over a time domain (T), a known association (AR), minimum support (min_supp), minimum confidence (min_conf), and minimum interval length (min_ilen), the problem of mining valid time periods is to find all possible longest intervals with respect to the association AR. Suppose time domain T = {G1, G2, ..., Gn}, where Gi (1≤i≤ n) is a granular interval. The set of time-stamped transactions D is ordered by timestamps and is partitioned into {D(G1), D(G2), ..., D(Gn). The search problem can be considered as successively looking for all longest sequences along the time domain sequence {G1, G2, ..., Gn}. For each possible longest interval, the search can be performed in two steps: 1) find its seed interval; 2) extend this seed interval to the corresponding longest interval. Definition 2.6: Let an interval ITVL = {Gi, Gi+1, ..., Gj}, ITVL is called as a seed interval if it satisfies the following conditions: 1) it is a strictly-long interval; 2) no strictly long interval starting before Gi covers ITVL; and 3) no other interval being covered by ITVL satisfies the previous two conditions. For example, let min_ilen be 3 and assume that ITVL1 = {G5, G6, G7, G8, G9} and ITVL2 = {G7, G8, G9, G10, G11, G12} are two longest intervals, then {G7, G8, G9, G10} could be a seed interval of ITVL2 if there is no other strictly long interval covering it. However, although {G7, G8, G9} is strictly long, it can not be a seed interval of any longest interval since ITVL1 covers it (condition 2). Also, {G7, G8, G9, G10, G11} is not regarded as a seed interval because {G7, G8, G9, G10} is a seed one (condition 3). Proposition 2.1: Let an interval ITVL = {Gi, Gi+1, ..., Gj}. If ITVL is a seed interval, there must be one and only one longest interval that covers ITVL and this longest interval is an interval starting from Gi (this holds due to definitions 2.4, 2.6). This says that if we can find all the seed intervals, we can extend them to get all the longest intervals. The questions are: how to find the seed interval and how to extend it to the longest interval. Let’s answer the second question first. If ITVL is a seed interval, then the corresponding longest interval can be derived from ITVL as follows:
Mining Temporal Features in Association Rules
297
1) If the last granular interval of ITVL is the last granular interval along the time domain, output ITVL (which is obviously a longest interval) and terminate the search. Longest Interval
Strictly Long
p j - min_ilen + 1 ) break; /* found a seed {Gptr1, ...., Gptr2 } */ (11) else if ( i = j -min_ilen + 1 && j = n ) exit; /* no any more seed */ (12) else { (13) for (k = ptr1; k ≤ i; k++) do OUT(G_QUEUE); (14) if ( i = j - min_ilen + 1 ) (15) { j++; SCAN(D[Gj]); IN(G_QUEUE, Gj ); } (16) ptr1 = i + 1; ptr2 = j; (17) }} (18) for ( ptr2 ≤ n ) do { /* looking for the next longest interval */ (19) if (ptr2 = n){ (20) OUTPUT({G ptr1, ..., Gptr2 }) ; /* found a longest interval */ (21) exit; (22) } (23) i = ptr1; j = ptr2 + 1; (24) for ( i ≤ j - min_ilen + 1 ) do { (25) if ( NotValid({Gi , ..., Gj}) ) break; (26) i++; (27) } (28) if ( i ≤ j - min_ilen + 1 ) { (29) OUTPUT({Gptr1, ..., Gptr2 }); /* found a longest interval */ (30) for (k = ptr1; k ≤ i; k++) do OUT(G_QUEUE); (31) if ( i = j - min_ilen + 1 ) (32) { j++; SCAN(D[Gj]); IN(G_QUEUE, Gj ); } (33) ptr1 = i + 1; ptr2 = j; (34) break; (35) } (36) else { /* extending {Gptr1, ..., Gptr2+1} with Gj */ (37) SCAN(D[Gj ]); IN(G_QUEUE, Gj); (38) ptr2 = j; (39) } }}
Figure 2.3 Search Algorithm for Longest Intervals (LISeeker)
Function SCAN passes over all the transactions in D[Gi] counting the number of those transactions, the number of the transactions containing the body of AR, and the number of transactions containing both the body and head of AR. Function NotValid
Mining Temporal Features in Association Rules
299
checks if the interval {Gi,...,Gj} is valid in terms of the given minimum support and confidence. Since the relevant counts (trans_num, body_num, rule_num) in each data partition D[Gk] ( i≤k≤j ) have been recorded in G_QUEUE, the support and confidence of AR in D[{G1,G2,...,Gn}] can be computed by the sums of those relevant counts. The function OUTPUT will convert the longest interval that was found in the form of {Gptr1, ..., Gptr2} into a time period described by an understandable representation. 3. Discovery of Longest Periodicities Given a time-stamped database and a known association, another temporal feature is a set of regular intervals in cycles, during each of which this association exists. A periodic time can be represented as a triplet . Cycle is the length (given by a calendar) of a cycle, Granule is the duration (given by a calendar) of a granular interval, and Range is a pair of numbers which give the position of regular intervals in the cycles. Given a periodic time PT=, its interpretation Φ(PT)={P1,P2,...,Pj,...} is regarded as a set of intervals consisting of the x-th to y-th granular intervals of GR, in all the cycles of CY. If we partition the time domain T by CY and express it as {C1,C2,...,Cj,...} (where Cj is an interval of CY), we have Pj⊆Cj (for any j>0). For example, let PT=, T can be expressed as {year1,year2,..., yearj,...} and Φ(PT) as {Q1,Q2,...,Qj,...} (Qj is the last quarter of year j). Definition 3.1: Given an association AR, a periodic time PT= is valid with respect to AR if there are not less than min_freq% of intervals in Φ(PT), which are strictly long with respect to AR. Definition 3.2: Given association AR, periodic time PT= is longest with respect to AR if PT is valid with respect to AR, and not ( PT’ = such that RR‘ ⊃ RR and PT’ is strictly long with respect to AR. Definition 3.3: Given a set of time-stamped transactions (D) over a time domain (T), a minimum support (min_supp), a minimum confidence (min_conf), a minimum frequency (min_freq), a minimum interval length (min_ilen), as well as the cycle of interest (CY) and granularity (GR), the problem of mining the periodicities of a known association (AR) is to find all possible periodic times , which are longest with respect to the association AR. Here, RR is expected to be discovered. According to the above, the cyclicity (CY) and granularity (GR) of the periodic time that are of interest, are given. So, we can suppose time domain T = {C1, C2, ..., Cm}, where Ci (1 ≤ i ≤ m) is a cycle, so that the data set D can be partitioned into {D[C1], D[C2], ..., D[Cm] }. The search can be decomposed into two sub-problems: 1) search for all the longest intervals over each Ci from D[Ci]; 2) derive the possible periodicities from all longest intervals found in each cycle Ci. The algorithm in section 2 can be used for the search for the longest intervals over each Ci from dataset D[Ci]. We only focus on the second sub-problem: how to derive the periodic time from all longest intervals that are found in each cycle Ci. We use Ci.ITVLSET to express the set of all longest intervals found in each cycle Ci. The algorithm (PIDeriver) used for the derivation is based on the following steps: 1) Scanning each Ci.ITVLSET and adding all longest intervals that are found into an ordered list A_LIST, which is ordered by the starting point and the ending point of the interval. Intervals in A_LIST are called essential intervals. 2) Looking for all candidate intervals by splitting essential intervals in A_LIST. If any two intervals in A_LIST intersect and the intersection is long enough, then the intersection is added into the candidate interval set C_LIST.
300
X. Chen and I. Petrounias
3) For each candidate interval in C_LIST, counting the number of cycles in which there exists a longest interval covering this interval; computing the frequency for this candidate interval; and removing it from C_LIST if it does not satisfy the minimum frequency (min_freq %). For each interval ITVLi in C_LIST, removing it from C_LIST if there is another interval ITVLj in C_LIST, ITVLi ⊆ ITVLj. 4. Implementation and Experimental Results The algorithms described have been implemented in a prototype mining system [3]. The kernel of the system is a temporal mining language [3], which has been integrated with SQL on the basis of ORACLE. For testing the performance of the algorithms, we generated three datasets that mimic the transactions within one year in a retailing application. Each transaction is stamped with the time instant at which it occurs. We run the algorithm LISeeker to look for longest intervals of a given association of items with the fixed interval granularity, minimum support and minimum confidence, but different minimum interval lengths. The results show that no matter how much the given minimum interval length is, the escaped CPU times are just slightly different. The expense for the search is mostly spent on the scanning of the database and it is scanned only once in any case of different minimum interval lengths. Therefore, the search time depends almost exclusively on the size of the dataset. The escaped CPU time rises almost linearly with the sizes of the datasets. Since the search for longest periodicities is based on algorithm LISeeker and the cost for running LPDeriver can be almost neglected, compared with the cost for running LISeeker, its performance feature is very similar to the search for longest intervals. 5. Conclusions and Future Work This paper concentrated on the identification of interesting temporal features (valid period and periodicity) of association rules. Based on the concepts of long intervals and longest periodicities, the mining problems were defined and the search techniques were discussed with the corresponding algorithms. We believe that the identification of similar temporal features of other types of patterns can occur naturally within the same framework. Work is now concentrating on the development of algorithms for the identification of similar temporal features for the different types of patterns. An interactive temporal data mining system for supporting the described tasks has been developed with an appropriate SQL-based language [3]. It is currently being extended to support other mining tasks. References 1. Agrawal, R., Imielinski, T., and Swami, A, Mining Associations between Sets of Items in Massive Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, Washington D.C., May 1993. 2. Chen, X., Petrounias, I., and Heathfield, H., Discovering Temporal Association Rules in Temporal Databases, Proceedings of International Workshop on Issues and Applications of Database Technology (IADT’98), Berlin, Germany, July 1998. th 3. Chen, X. and Petrounias, I., A Framework for Temporal Data Mining, Proceedings of 9 International Conference on Database and Expert Systems Applications (DEXA’98), Vienna, 1998. 4. Ramaswamy, S., Mahajan, S., and Silberschatz, A., On the Discovery of Interesting th Patterns in Association Rules, Proceedings of 24 VLDB Conference, New York, pp.368379, 1998.
The Improvement of Response Modeling: Combining Rule-Induction and Case-Based Reasoning F. Coenen, G. Swinnen, K. Vanhoof and G. Wets Limburg University Centre, Department of Applied Economics, B-3590 Diepenbeek, Belgium {filip.coenen;gilbert.swinnen;koen.vanhoof;geert.wets}@luc.ac.be
Abstract. Direct mail is a typical example for response modeling to be used. In order to decide which people will receive the mailing, the potential customers are divided into two groups or classes (buyers and non-buyers) and a response model is created. Since the improvement of response modeling is the purpose of this paper, we suggest a combined approach of rule-induction and case-based reasoning. The initial classification of buyers and non-buyers is done by means of the C5-algorithm. To improve the ranking of the classified cases, we introduce in this research rule-predicted typicality. The combination of these two approaches is tested on synergy by elaborating a direct mail example.
1
Introduction
One of the most typical examples where response modeling comes into play is direct mail. This marketing application goes further than just sending product information to randomly chosen people, as mass marketing does. A key characteristic of direct mail is that a specific market or geographic location is targeted, while selecting receptors by age, buying habits, interests, income, etc. In order to decide which people will receive the mailing, the potential customers are divided into two groups or classes: buyers and non-buyers. This division, which is based upon the above-mentioned socio-demographic and/or economic information of the potential customers, is called response modeling and can be realized by means of artificial intelligence. A learning algorithm is then applied to predict the class of unseen cases or records, i.e. possible customers. As known from literature, the accuracy of the prediction never reaches 100%, as there are always cases attributed to the wrong class. When applied to the direct mail example again, this means that, at a given mailing depth, there are always people receiving mail concerning products that appear uninteresting to them while buyers are left out of the mailing. As a consequence, costs are made that can be avoided, e.g. by creating a better response model. One way to come to a better response model would be by choosing a better classifier [8]. In this paper, however, we suggest an other approach; i.e. the combination of multiple classification methods. The remainder of this paper is organized as follows. In the next section, a theoretical background of the performed approach will be discussed and a following J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 301-308, 1999. © Springer-Verlag Berlin Heidelberg 1999
302
F. Coenen et al.
section deals with the empirical evaluation of the suggested approach. To illustrate this, a direct mail example is further elaborated. The last section will be reserved for conclusions and topics for future research.
2
Suggested Approach
2.1
Classifiers
C5. One of the possible classifiers that can be used in the response modeling of a data set is the C5-algorithm, the more recent version of C4.5 [10]. The reason that we preferred this algorithm is based upon previous research Van den Poel and Wets [14]. They used the same data set as we did to provide a comparison between a number of classification techniques. They selected techniques in the field of statistical, machine learning and neural network applications, and compared them by means of the overall accuracy on the data set. We preferred to use the C5-algorithm to do the initial classification, since this algorithm attained the highest accuracy on the test set (see also section 3.2). The goal of response modeling is to rank the cases by probability of response. Since each case is classified with a certain confidence by C5, the most trivial way to rank the cases would be by the confidence figure of the applied rule. This means that when the assigned class label is the non-responding class, the complement of the confidence should be taken before sorting the whole data set on this confidence. However, as mentioned before, we propose in this paper an other method to improve response modeling, as will be explained. Case-Based Reasoning. Case-based reasoning methods are based on similarity and try to use the total information of a given unknown case. In our research, we used typicality as similarity measure. To determine the typicality of each case in the context of this research, the following approach was used. a) Firstly, for each case i a distance measure dist(i, j) is determined as follows; the attribute values of i are compared with the attribute values of a case j i. If the values of the considered cases differ, dist(i, j) is increased by one (independent of the size of the difference). b) After determining dist(i, j) according to the above-mentioned method, the class value of the cases i and j is compared. If i belongs to the same class as j, a measure intra(i) is increased by (1 - (dist(i, j) / number of attributes)). If, on the other hand, both cases belong to a different class, a measure inter(i) is increased by (1 – (dist(i, j) / number of attributes)). The above calculations are made for all the cases j i. This is the point where the global character of our approach comes into play, since all other cases ji are taken into account in the calculation of the typicality of just one case i. c) In a next step the measure intra(i) is divided by the number of cases that belong to the same class as i, and inter(i) is divided by the number of cases that belong to the other class.
The Improvement of Response Modeling
303
d) Finally, the typicality of case i is determined by dividing intra(i) by inter(i). For each case the typicality was calculated, allowing these cases to be ranked by this measure. The above steps lead to the following definition: intra (i ) p . Typicality (i) = (1) inter(i) n with p the number of cases that belong to the same class as case i, and n the number of cases that belong to the other class. The cases with typicality higher than 1 are considered as typical cases for the class they belong to. Used as a classifier, this method looks at the similarity between the considered case and the different classes and assigns the label of the most similar class to the case. In response modeling however, it is sufficient to look at the similarity between the considered case and the responding class, in order to use this similarity as a ranking criterion. A Combined Approach. As it is known from previous research, the accuracy of such a model almost never reaches 100% for real world cases. Also in our direct mail example, the classification of buyers and non-buyers was not completely correct; some non-buyers were classified in the class of the buyers, and vice-versa. Since the accuracy of our model attained 76.32 %, a percentage of 23.68% of the cases were misclassified. The fact that errors are made implies that there is room left for improvement if we choose not to write to all persons in the data set as is often the case in direct mailing. This will further be explained in section 3.3 In order to upgrade the response model and have more control over these mislabeled instances, we decided to rank the classified cases. Empirical results taught us that C5 is a better classifier than typicality on the one hand, and also better than other considered classifiers on the other hand This is why we opted for this algorithm to do the initial classification. By applying C5, a case obtains a response probability from just one rule, i.e. the rule with the highest confidence that meets the case. The other rules or cases are not taken into account. The classification by C5 can thus be considered as a local approach; only a part of the information carried by the case and the rule-base is used. In contrast to this, a case-based reasoning method displays a global character; a case obtains a response probability by looking at the total data set. Empirical results (see section 3.2) showed us that typicality outperforms confidence in ranking the cases. That is why we opted for this method to improve the ranking of the classified cases. By combining the strengths of both methods, i.e. C5 as the best classifier, and typicality as the best ranker, we could investigate the effects of the combination between a global and a local approach. The new response modeling method that is suggested in this paper can then be described as follows. An unknown case obtains the class label from the C5 classifier and obtains as response measure the typicality for the given class label. The latter has as consequence that the calculation of the assigned typicality is based on the predicted class label of the case. This typicality will further be called rule-predicted typicality. Thus, the cases are firstly ranked by class label and secondly by rule-predicted typicality.
304
F. Coenen et al.
2.2 Evaluation To compare the ranking by typicality on the one hand with the ranking by confidence and the original situation on the other hand, we selected the Coefficient of Concordance (CoC) [6] and the cumulative response rate as objective measures and graphs as a visualization tool. The CoC takes into account the ranking of the cases, and gives a percentage as outcome. The higher the percentage, the better the sorting. The main reason for choosing this measure is that it looks at the distribution of the cases in the predicted class as a whole. Therefore, the distribution is calibrated on a 10-class rating scale. This means that the distribution is split up into 10 intervals, each with a score higher than the previous interval. The CoC is defined as follows: max score max score 1 nbi ng ’i +0.5 nbi ng i . CoC = (2) n g nb i = min score i = min score with nbi respectively ngi the number of bad, respectively well classified cases with a score equal to i, ng’i the number of well classified cases with score better than i. With a given mailing depth, we know how many cases will be mailed, and the different methods can be evaluated with the cumulative response rate. Further, the graphs can help us by discovering in which range a certain method is superior.
∑
3
Empirical Validation
3.1
The Data Set
∑
The data set that was used for empirical validation was collected from an anonymous mail-order company and consists of 6800 records or cases, each record described by 15 attributes. These records are equally divided between the classes 0 (non-buyers) and 1 (buyers). The information that was available concerns transactional data, as well as socio-demographic information o the customers. All variables were categorized after careful consideration with the mail-order company. They provided the data to us at the level of the individual customer. The specific model that we have built is based on all available data, and predicts whether a person is a possible buyer or not. The outcome is a binary response variable (0/1) representing buying or not buying. Before the induction of the C5 classifier, a training set was composed by randomly selecting approximately 2/3 of the cases from the original data set. The remaining part was used for purposes of testing.
3.2
Results
Evaluation of the Classification. As mentioned in the section concerning the suggested approach, the C5-algorithm was used to classify the cases in a first step. By
The Improvement of Response Modeling
305
applying C5, an accuracy of 76,32% was obtained on the test set, which consisted of 2052 cases (1018 buyers and 1034 non-buyers). Since the accuracy didn’t reach 100%, and the cases were randomly divided into a training and a test set, there are a number of incorrectly classified cases randomly divided among the predicted ones. In order to implement our approach, we separated the cases that were predicted to belong to class 0 (1120 cases) from the cases that were predicted to belong to class 1 (932 cases). This means that our model considered 1120 out of 2052 persons as nonbuyers, and 932 persons as buyers. An overview of the situation in the test set after classifying by C5 is shown in table 3. Table 3. Confusion Matrix of the Test Set Predicted 0 Predicted 1 Total
Real 0 834 200 1034
Real 1 286 732 1018
Total 1120 932 2052
In order to deduce a better response model, we improved the ranking of the cases by sorting them by typicality within the predicted class, under the assumption that the cases that were misclassified would have a lower typicality. This means that after sorting by typicality, the misclassified cases would appear lower in the rank than the correctly classified ones. Evaluation of the Ranking. To compare the outcome of our experiments with the initially unsorted situation on the one hand and the sorting by confidence and rulepredicted typicality on the other hand, we used the Coefficient of Concordance. As mentioned in section 2.2, the distribution has to be ranked on a 10-class rating scale to be evaluated. To evaluate the ranking by rule-predicted typicality in the context of this research, we decided to use the rule-predicted typicality of the cases as score. This means that if the highest rule-predicted typicality of a case in the set attains 1.5, and the lowest rule-predicted typicality equals 0.5, the cases with rule-predicted typicality between 0.5 and 0.6 will be considered as belonging to the same group, and thus have the same score. This implies that the score in the definition (see section 3.2) is replaced by rule-predicted typicality. The same method is used to evaluate the sorting by confidence. The exact results of these calculations can be found in table 4. Table 4. The Coefficient of Concordance
Sorted by Confidence Sorted by RulePredicted Typicality
Predicted Class 0 1 55,2% 65,9% 62,9% 65,0%
Table 4 shows us that the ranking of the test cases that were predicted to belong to class 0, as well as the test cases that were predicted to belong to class 1, becomes better after sorting by rule-predicted typicality or by confidence. In both cases the coefficient of concordance is higher than 50%, i.e. the percentage that can be
306
F. Coenen et al.
expected by a random division of the misclassified cases among the correctly classified cases. If the sorting by rule-predicted typicality is compared with the sorting by confidence, a difference between the predicted class 0 and the predicted class 1 can be noticed. For the predicted class 0, the rule-predicted typicality produces a better result since the coefficient of concordance equals 62.9% against 55.2% after sorting by confidence. This observation is further illustrated by figure 1; the rule-predicted typicality curve is less steep over a larger distance than the confidence curve. The Xaxis shows the number of cases in the predicted class 0, whereas the Y-axis shows the number of misclassifications as they appear gradually among the considered cases 350 300 250 200 150 100 50 0
Cases
. Fig. 1. The appearance of the errors among the cases that are predicted to belong to class 0. The gray colored graph represents the occurrence of errors among the cases that were predicted to belong to class 0 for the unsorted situation. The black and the bold black graph describe the same after sorting by confidence, respectively rule-predicted typicality.
3.3
Application on the Direct Mail Example
In normal circumstances, a mail-order company will try to cut off between 10% and 40% of its unattractive part of the mailing list. This means that between 60% and 90 % of all the persons in the data set will receive the mailing. Often, a mailing depth of 75% is used [14]. The reason for this is that the profit generated by converting a nonbuyer into a buyer is considered higher than the cost of sending a letter to a person that is not interested in the products that are subject of the mail. In our further calculations, we will also consider a mailing depth of 75% of the test set (0,75 * 2052 = 1539 persons). To reach these people, we will direct a letter to all the persons that are considered as buyers by our system, i.e. 932 persons, of which 732 are classified right and thus are buyers in reality. 1539 – 932 = 607 persons from the predicted class 0 will complete this number so that a total amount of 1539 persons are reached. Applied to this direct mail example, the sorting by confidence and rule-predicted typicality produced the following results.
The Improvement of Response Modeling
307
Sorting by Confidence. The predicted non-buyers were sorted by an increasing confidence. As the non-buyers with low confidence are more likely to be misclassified than the ones that were predicted to be non-buyers with a high confidence, the 607 non-buyers with the lowest confidence are included in the mailing list. Among these 607 persons there were 169 buyers. This means that by mailing 1539 persons, we would reach 732 + 169 = 901 buyers out of the 1018 buyers (88,5%) that are present in the test set. Sorting by Rule-predicted Typicality. Analogously on the sorting by confidence, we sorted the cases that were predicted to belong to class 0 by increasing rulepredicted typicality and included the 607 persons with the lowest rule-predicted typicality in the mailing list. Among these 607 persons there were 194 buyers, so that we would reach 732 + 194 = 926 buyers out of 1018 (91%) by mailing 1539 persons. Unsorted Situation. To illustrate the improvement that is made by sorting the cases of the predicted class, we finally give an overview of the situation as it would be without any sorting. Among the 607 persons of the predicted class 0, there would be approximately 0,26 * 607 = 158 buyers since 286 out of 1120 (+/- 26%) cases were misclassified and the errors are randomly divided in the predicted class. This means that we would reach 732 + 158 = 890 buyers out of 1018, or 87.4%. An overview of the results can be found in table 5. Table 5. The number of reached buyers Unsorted 87.40%
Sorted by Confidence 88.50%
Sorted by Rule-Predicted Typicality 91.00%
The fact that the improvement after ranking by confidence is limited to 1.10% shows us that the sorting of the classified cases is a difficult topic. Our approach proved to be a useful one, since it outperforms sorting by confidence by an improvement more than twice as high (2.50%) as the existing improvement of 1.10%.
4
Conclusions
This article describes a method for improving response modeling by using a combined approach of rule-induction and case-based reasoning. The proposed approach consists of classifying the cases by means of the C5-algorithm in a first step, and ranking the classified cases by a typicality measure in a second step. In this way, we could test the combination of the use of local and global information on synergy. Based on empirical results we decided that the C5-algorithm was the best classifier to do the initial classification. This algorithm provides the local aspect of our approach, since it classifies each case by just one rule, i.e. the rule with the highest confidence that meets the case. The other rules or cases are not taken into account. In contrast to this, a case-based reasoning approach displays a global character, since a case obtains a response probability by looking at the total data set. Empirical results showed us that sorting by typicality was the best method to improve the ranking of the
308
F. Coenen et al.
classified cases. To do so, we introduced the concept rule-predicted typicality, as the calculation of the typicality of a test case is based on the predicted class value of the considered case. Finally, the application of our approach on a direct mail example has shown this method to be a promising one. It proves to yield an improvement of 2.50% over the improvement of 1.10% that is generated by the ranking of the classified cases by the existing confidence figures. This implies that we were able to reach 91% of the buyers in our test set, under the consideration of a mailing depth of 75%. Although it is only about a small improvement in absolute terms, yet the total success of a direct mail can depend on this. Since we were not able to test this approach on more than one data set so far, opportunity for future work lies within this topic.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Aijun, A., Cercone, N.: Multimodal Reasoning with Rule Induction and Case-Based Reasoning, in Multimodal Reasoning, AAAI Press (1998) Bayer, J.: Automated Response Modeling System for Targeted Marketing (1998) Brodley, C.E. and Friedl, M.A.: Identifying and eliminating mislabeled training instances, in Proceedings of Thirteenth Nat. Conference on Artificial Intelligence, AAAI Press (1996) Domingos, P.: Knowledge Discovery vie Multiple Methods, IDA, Elsevier Science (1997) Domingos, P.: Multimodal Inductive Reasoning: Combining Rule-Based and Case-Based Learning, in Multimodal Reasoning, AAAI Press (1998) Goonatilake, S., Treleaven, P.: Intelligent Systems for Finance and Business, Wiley (1995) 42 – 45 Holte, R., Acker, L.E., Porter, B.W.: Concept learning and the problem of small disjuncts, in Proceedings of the Eleventh Int. Join Conference on AI, Morgan Kaufmann (1989) 813818 Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, Thirteenth National Conference on Artificial Intelligence (1996) Ling, C.X., Li, C.: Data Mining for Direct Marketing: Problems and Solutions, in Proceedings of the Fourth Int. Conference on Knowledge Discovery and Datamining, AAAI Press (1998) 73-79 Quinlan, J.R.: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993. Sabater, J., Arcos, J.L. and Lopez de Mantaras, R.: Using Rules to Support Case-Based Reasoning for Harmonizing Nelodies, in Multimodal Reasoning, AAAI Press (1998) Surma, J. and Vanhoof, K.: Integrating Rules and Cases for the Classification Task, CaseBased Reasoning, Research and Development, First International Case-Based Reasoning Conference, - ICCBR’95, Springer Verlag (1995) 325-334 Surma, J., Vanhoof, K. and Limere, A.: Integrating Rules and Cases for Data Mining in Financial Databases, in Proceedings of the Nineth Int. Conference on AI Applications – EXPERSYS’97, IIIT-International (1997) Van den Poel, D., Wets, G.: Data Mining for Database Marketing: a mail-order company application, in Proceedings of the Fourth International Workshop on Rough Sets, Fuzzy Sets and Machine Discovery, RSFD '96 (1996) 383- 389 Zhang, J.: Selecting Typical Instances in Instance-Based Learning, in Proceedings of the Ninth Int. Conference on Machine Learning, Morgan Kaufmann (1992) 470-479
Analyzing an Email Collection Using Formal Concept Analysis Richard Cole1 and Peter Eklund1 School of Information Technology, Griffith University PMB 50 GOLD COAST MC, QLD 9217, Australia
[email protected],
[email protected] Abstract. We demonstrate the use of a data analysis technique called formal concept analysis (FCA) to explore information stored in a set of email documents. The user extends a pre-defined taxonomy of classifiers, designed to extract information from email documents with her own specialized classifiers. The classifiers extract information both from (i) the email headers providing structured information such as the date received, from:, to: and cc: lists, (ii) the email body containing free English text, and (iii) conjunctions of the two sources.
1
Formal Concept Analysis
Formal Concept Analysis (FCA) [8,3] is a mathematical framework for performing data analysis that has as its fundamental intuition the idea that a concept is described by its intent and its extent. FCA models the world as being composed of objects and attributes. The choice of what is an object and what is an attribute is dependent on the domain in which FCA is applied. Information about a domain is captured in a formal context, which is a triple K = (G, M, I) in which G is a set of objects, M is a set of attributes, and I ⊂ G × M is a relation saying which objects possess which attributes. A formal concept is a pair (A, B) where A is a set of objects called the extent, and B is a set of attributes called the intent. A must be the largest set of objects for which each object in the set possesses all the attributes in B. The reverse must be true also of B. More precisely, a formal concept of the context (G, M, I) is a pair (A, B), with A ⊆ G, B ⊆ M , A = {a ∈ G | ∀ b ∈ B (a, b) ∈ I} and B = {b ∈ M | ∀ a ∈ A (a, b) ∈ I}. The fundamental theorem of FCA states that the set of formal concepts of a formal context forms a complete lattice. This complete lattice is called a concept lattice, and is usually denoted B(K). A complete lattice is a special type of partial order in which the greatest lower bound and least upper bound of any subset of the elements in the lattice must exist. A lattice may be drawn via a line diagram (see Figures 1 and 4) [7]. For each attribute in m there is a maximal concept that has m in its extent. We shall use the function γ : M → B(K) to denote the mapping from attributes to their maximal concepts. ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 309–315, 1999. c Springer-Verlag Berlin Heidelberg 1999
310
Richard Cole and Peter Eklund
M1
M2
γ( m 1)
γ( m 2)
23
44 15 M3
γ( m 3) (a)
10
(b)
Fig. 1. (a) A concept lattice with the implication m1 ∧ m2 → m3 (b) A concept lattice with the partial implication P r(M 1|M 2) = 15/44.
A concept lattice, generally denoted B(K), is representationally equivalent to the attribute logic that exists over the attributes in the context. For example (see Fig. 1(a)), the proposition ∀g ∈ G p1 (g) ∧ p2 (g) → p3 (g) where pi (g) means that object g has attribute mi will be true, if and only if the greatest lower bound of γ(m1 ) and γ(m2 ) will be greater than or equal to γ(m3 ) in the concept lattice. Since this information is represented diagramatically it is more accessible than a list of conditional probabilities. Labels on the lattice (see Fig. 1(b)) attached above each concept show the introduced intent1 while the labels attached below show the size of the extent of the concept. It is possible to determine the intent of a concept by collecting all intent labels on an upward path from the concept. For example the concept labeled M 3 in Fig. 1(b) also has M 2 and M 1 in its intent. The concept lattice can also represents partial implications or conditional probabilities[5]. For example, if we wanted to know the probability that an object from G has m1 given that it has M 2, denoted P r(M 1 | M 2), this would be given by |Ext(γ(M 1)∧γ(M 2))|/|Ext(γ(M 2))| = 15/44. The two numbers in this ratio being present in the diagram (see Fig. 1(b)).
2
Background
Previous applications of FCA may be divided into two categories. Those that generate a large concept lattice — the number of concepts is roughly the square of the number of documents — of all terms and documents and those that employ conceptual scaling. Godin et. al [4] proposed navigating though a set of text documents via a large concept lattice. Each concept visited was presented in a window listing its intent, and the user moved to subsequent elements nodes via selection of additional terms. Carpineto and Romano [1] proposed navigation in a concept lattice of all terms using a fish-eye viewer. 1
The introduced intent is that part of the intent that is not found in the intent of any more general concepts.
Analyzing Email Using FCA
311
Wille and Rock [6] implemented a system for a library catalogue, using a software system called TOSCANA, in which a large number of sub-lattices were designed by a subject librarian. A visitor to the library could choose a “theme” previously defined by the librarian. This was seen as an advantage in the library environment since the users of the system are generally unfamiliar with reading lattice diagrams. In a sense these approaches represent different extremes. In the first instance the user has a maximum number of choices when navigating, while in the second the user is presented with carefully constructed views of the data. The approach outlined in this paper attempts to strike a middle road allowing the user to construct and modify scales in response to learning information about the data. It is novel in that it allows the user to define a hierarchy over the search terms and presents the user with a dynamic environment for the creation and modification of scales by the user.
3
Hierarchy of Classifiers
The task of extracting information from text documents is a difficult one. The language used in email documents is often informal, makes extensive use of abbreviations, and is highly contextualized. For this reason we do not attempt to do any deep extraction of information from email texts but rather recognize key terms. We experimented with classifiers that recognize regular expressions, either from email headers, or within the body of the email itself.
1 2 3 4 5 6 7 8 9 10 11
CLASSIFICATION "Email-Analysis" ... "Year 1994 - Sep:Nov" Date: "ˆ[A-Z][a-z]+, [0-9]+ (Sep|Nov) 1994" ; ... "Melfyn mentions Barbagello" From: melfyn Body: David ; "Mentions Melfyn" Body: [mM]elfyn ; ... END CLASSIFICATION (a)
1 2 3 4 5 6 7 8 9
BEGIN ORDER "Email-Analysis" ... "Melfyn mentions Barbagello" < "Mentions Barbagello" ; "From Melfyn" < "From DSTC" ; "From DSTC" < "DSTC" ; ... END ORDER
(b)
Fig. 2. (a) Classifiers: a file expressing classifiers for the terms of interest via regular expressions. (b) Hierarchy: a file expressing the hierarchical ordering of classifiers.
The example in Figure 2(a) shows a portion of the classifier file, generated by our taxonomy editor. Lines 3 and 4 show the definition of a classifier. The classifier recognizes the attribute whose name is “Year 1994 - Sep:Nov”. It matches the date field of an email message with a regular expression that recognizes
312
Richard Cole and Peter Eklund
dates between September and November of 1994. Lines 6 and 7 show a classifier that detects emails sent by “Melfyn” in which “David” is mentioned within the text of the email. The classifiers associate attributes with emails and the result is stored in an inverted file index. A hierarchy is defined by a set of subsumption rules defined by the user. For instance, a portion of the example file produced by our taxonomy is presented in Fig. 2(b). Line 3 introduces the implication that an email with “Melfyn mentions Barbagello” implies that the email “Mentions Barbagello”. Associated with each attribute is a primary name, e.g. “Melfyn mentions Barbagello”, and a set of descriptions, for example “Melfyn mentions Barbagello” might also be recognized by “Melfyn Lloyd mentions David Barbagello”. These extra descriptions are used in the next section in which the data is explored.
4
Conceptual Scaling
After defining a taxonomy of attributes which are associated with email documents by classifiers defined in the previous section, it is necessary for the user to choose a small number of attributes, usually less than 10 [2]. The user does this with a specific question in mind and with the aid of the program depicted in Fig. 3.
Fig. 3. Conceptual Scale Creation Tool
The user is interested in the ways in which the attributes combine in the email collection. For example she might be interested in emails “from Melfyn”, “about the DTSC”, and “mentioning Barbagello”. She searches for appropriate attributes based either (i) on their description or (ii) their location in the hierarchy.
Analyzing Email Using FCA
313
To locate an attribute via its description, the user may enter a text search. For instance “Melfyn” would match with all attributes having a description containing a word with the prefix “Melfyn”. Alternatively, the attribute “From Melfyn” might be located as an element immediately below “From DSTC Personnel”. In either case, the results of the search operation are displayed in the lower right hand panel of the tool shown in Fig. 3 and clicking on the attribute adds it to the diagram. The conceptual scale is a subset of the attributes selected by the user, and displayed in the left hand panel of the tool shown in Fig. 3. It is desirable for the user to gain an impression of how the attributes are related with respect to their taxonomical ordering defined in the previous section. We represent the attributes using a Hasse diagram (see Figure 1(a)). Using a Hasse diagram to diagramatically represent the relative ordering of a subset of elements from an ordered set raises a number of questions. Should we preserve (i) the covering relation, (ii) the ordering relation, and (iii) meet and joins where they exist. Preserving the covering relation, while straight forward, is cumbersome since it produces long chains in the diagram and introduces a large number of extra elements. Preserving the ordering relation by itself, would for many queries, produce a single anti-chain2 . We preserve the ordering relation that we then close under join. This induces a new covering relation computed in response to updates to the diagram. The diagram is automatically drawn using a force directed placement heuristic that attempts to minimize change to the diagram. All changes to the diagram are animated to help the user preserve their mental map of the diagram. The user can remove join irreducible elements — those not required by the join closure requirement — from the diagram. An attempt to remove a join reducible element results in user feedback showing the elements preventing its removal. After the user has selected attributes to the required level of specificity. The scale may be used to construct a concept lattice showing the concepts generated by the emails and the attributes selected by the user.
5
Analyzing the Lattice
Fig. 4 shows the concept lattice resulting from the terms identified in the scaling process shown in Fig. 3. The diagram is expressed as a lattice product. Each of the small black circles represent a concept, all the concepts of the bottom large oval have “Philippe” in the intent. In other words these concepts refer to emails associated with Philippe via one or more of the classifiers defined in Fig. 2(a). Compare the numeric labels of the top two concepts in each of the ovals. “793” is the number of emails associated with the term “Philippe” while “5022” is the total number of emails in the test set. We infer from this that 793/5022 or 15.7% of all emails are associated with “Philippe”. Moving down from the top numeric label “5022” to its child “4108” (also labeled “Groups”) reveals that 81% of all emails in the test set are associated with the term “Groups”. 2
An anti-chain is a set of elements that have no relative ordering.
314
Richard Cole and Peter Eklund top Groups
DSTC
KVO
5022 Mention Melfyn
DSTO
4108 363 212 3589 194 124
623
Mention Stephen
44 20 256 27 18
109 17
12 57 14 11
Philippe
793 382 53 27 344 47
106
23
74 24
44
16
33 13 10
Fig. 4. Analysis of the relationships between the terms “Mention Stephen, “KVO”, “DSTC”, “DSTO”, “Melfyn” and the target “Philippe”.
Similarly, if we move down from the top label “793” in the lower oval to its child “382” we see that only 48% of emails associated with “Philippe” concern “Groups” — a possible reason being that much of the email traffic within the email set associated with “Philippe” is not group related. “Groups” are further divided into group categories: these can be read from the top oval identified with the labels “KVO”, “DSTC”, and “DSTO”. Note that “KVO” concerns the majority of group email traffic with 3589/4108 = 87.3%. Emails associated with the group labels “DSTO” and “DSTC” are read from the label “44” in the top oval. Finding the corresponding circle in the lower oval we notice that it is grey. This represents an implication. We move from the grey circle down through the lattice to the label “24”. This point includes the extra attribute “KVO”. The inference is that emails associated with “DSTC”, “DSTO” and “Philippe” are all associated with the term “KVO”. The diagram can be viewed as a three dimensional space containing thematic planes. Consider the plane defined by the labels 109,17,12,11,14,57 in the upper oval of Fig 4. This plane represents the impact of the term “Mention Stephen” on the other named terms “KVO”, “DSTC”, “DSTO”, “Melfyn”. The plane is parallel to two other planes (those above it, to the right) and by considering corresponding points in each of these planes we can measure the influence of the term “Mention Stephen” on the way emails in the test set are partitioned by that term. A more specific inference concerns the points labeled “11” and “12” in the upper oval. “12” is the number of emails associated with the combination of “Mention Stephen” and “Mention Melfyn”. Therefore, 11/12 of these emails
Analyzing Email Using FCA
315
also mention the term “KVO”. The inference is the high correspondence of the use of “KVO” in the context of emails involving ’Melfyn’ and ’Stephen’. Moving to the bottom circle in the lower oval labeled “10” we infer that 10/11 emails also mention “Philippe”. In summary less than half of the email associated with “Philippe” is group related (382/793). When “Philippe” is mention in the context of a group it mostly concerns the “KVO’ group (344/382 = 90%). Emails associated with both the “DSTC” and “DSTO” groups that are “Philippe” related always mentioned the “KVO” group (24/24 — inferred via the grey circle). Finally, “Philippe” is mentioned in 24/44 (55%) of emails involving both the “DSTC” and “DSTC” but is mentioned in 83% of correspondence mentioning “Stephen” and “Melfyn”. We can draw the inference from the analysis of this email data that “Philippe” is the important factor of common interest between “Stephen” and “Melfyn”. It is also clear from the email analysis that “Philippe” is a more important topic of discussion to the “KVO” group (344/382) than to the “DSTC” (53/382) and “DSTO” (106/382) generally.
6
Conclusions
This paper has described the use of a suite of tools designed to allow an investigation of data retrieved from email. The data is retrieved from the emails with the aid of a hierarchy of classifiers that extract useful terms and encode known implications. Further implications, both complete and partial, are then investigated by means of a nested line diagram.
References 1. C. Carpineto and G. Romano. A lattice conceptual clustering system and its application to browsing retrieval. Machine Learning, 24:95–122, 1996. 2. R. Cole and P. Eklund. Scalability of formal concept analysis. Computational Intelligence, 15(1):11–27, 1999. 3. B. Ganter and R. Wille. Formal Concept Analysis: Logical Foundations. Springer Verlag, 1999. 4. R. Godin, J. Gecsei, and C. Pichet. Design of a browsing interface for information retrieval. SIG-IR, pages 246–267, 1987. 5. Michael Luxenburger. Implications, dependencies and Galois drawings. Technical report, TH Darmstadt, 1993. 6. T. Rock and R. Wille. Ein TOSCANA—Erkundungssystem zur Literatursuche. In G. Stumme and R. Wille, editors, Begriffliche Wissensverarbeitung. Methoden und Anwendungen, Berlin–Heidelberg, 1997. Springer–Verlag. 7. Frank Vogt and Rudolf Wille. TOSCANA a graphical tool for analyzing and exploring data. In Graph Drawing ’94, LNAI 894, pages 226–223. Springer Verlag, 1995. 8. Rudolf Wille. Concept lattices and conceptual knowledge systems. In Semantic Networks in Artificial Intelligence. Pergamon Press, Oxford, 1992. Also appeared in Comp. & Math. with Applications, 23(2-9), 1992, p. 493-515.
Business Focused Evaluation Methods: A Case Study Piew Datta1 1GTE Laboratories Incorporated, 40 Sylvan Rd., Waltham, Massachusetts USA 02451
[email protected] Abstract. Classification accuracy or other similar metrics have long been the measures used by researchers in machine learning and data mining research to compare methods and show the usefulness of such methods. Although these metrics are essential to show the predictability of the methods, they are not sufficient. In a business setting other business processes must be taken into consideration. This paper describes additional evaluations we provided potential users of our churn prediction prototype, CHAMP, to better define the characteristics of its predictions.
1. Introduction As data mining and machine learning techniques are moving from research algorithms to business applications, it is becoming obvious that the acceptance of data mining systems into practical business problems relies heavily on their integration in to business process. One critical aspect of building a practical and useful system is showing that the techniques can tackle the business problem. Traditionally, machine learning and data mining research areas have used classification accuracy in some form to show that the techniques can predict better than chance. The evaluation methods need to more closely resemble how the system will work if in place. This paper focuses on the experimental evaluations we performed on a prototype called CHAMP, Churn Analysis Modeling and Prediction, developed for GTE Wireless (GTEW). CHAMP is a data mining tool used to predict which of GTEW customers will churn within the following two months. Although we were able to show that CHAMP was considerably more accurate at identifying churners than existing processes at GTEW, we needed to provide additional evaluations to persuade potential users of its benefits. In the next section of this paper we provide some background about GTEW. Section 3 describes briefly each of CHAMP’s components. Section 4 discusses several criteria we used to describe CHAMP’s benefits to the GTEW marketing department. Many of these experiments are non-traditional methods used to evaluate CHAMP.
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 316-322, 1999. © Springer-Verlag Berlin Heidelberg 1999
Business Focused Evaluation Methods: A Case Study
317
2. GTEW and Its Data Warehouse GTEW provides cellular service to customers from various geographically diverse markets within the United States of America. GTEW currently has about 5 million customers in about 100 markets and is growing annually. As in all businesses, customers will sometimes terminate their service or switch providers for a variety of reasons. In the telecommunications industry this is referred to as churn. Although industry wide churn rates are only about 2% to 3% per month, this results in a considerable number of subscribers discontinuing service. Currently GTEW accumulates information for its cellular customers in a relational data warehouse that collects data from many regional database sources. The warehouse contains over 200 fields consisting of billing and service data for each customer on a monthly basis and stores historical data going back for two years. Each month CHAMP analyzes this information to predict the possibility that any particular customer will churn based on historical data. Knowledge discovered by analyzing the characteristics of churners is used to guide marketing retention campaigns.
3.
CHAMP: A Brief Overview
Members of the Knowledge Discovery in Databases project at GTE Laboratories developed CHAMP to help GTEW reduce customer churn. Since GTEW’s data warehouse is updated monthly and since they have a diverse set of markets, we decided to build models monthly for each of GTEW’s top 20 markets. This decision constrains CHAMP to be fully automated. Another goal was to ensure that CHAMP is scalable regarding customer records and fields. The last goal was to allow the models to be valid for a period of 60 days to allow the marketing department time to develop any desired campaigns. We developed CHAMP’s overall design with these and other goals at the forefront. Readers interested in details should refer to Datta et. al. (1999)[1]. There are essentially two phases for applying data mining methods to identify churners: building models and applying models. For model building, initially the date and the market are provided to the prototype which retrieves relevant historical data from the remote data warehouse to create a local extract. The input to the model building component uses customer billing and usage information from three months previous and the dependent binary variable denoting whether the customer has churned in the previous 2 months. CHAMP’s modeling method employs a hybrid of machine learning techniques. Initially we use a decision tree method (Quinlan, 1993 [2]) to rank fields according to their prediction capability and then use a cascade neural network (Fahlmann & Lebiere, 1988 [3]; Puskorius et al. 1991 [4]; Rumelhart, Hinton, & Williams, 1986 [5]) with the 30 highest ranked fields. The neural network uses a genetic algorithm (Koza, 1993 [6]) to find transformations and groupings of fields for increased model accuracy. Once the model is built, the model is applied to current data. This data only contains customer billing and usage data and does not have customer churn information since we do not know if a customer will churn until the end of the month. The churn score generator uses the learned model and current data to produce a churn score for each customer, predicting if the customer will churn in the next 60 days.
318
P. Datta
The churn score ranging from 0 to 100 describes an individual customer’s propensity to churn. Customers with a higher churn score have a higher propensity to churn.
4. Empirical Evaluations In this section we describe the empirical evaluations we applied to better understand the characteristics of CHAMP on several differing markets. Some of these methods are applied traditionally, such as computing the lift and payoff of the learned models. Marketing professionals at GTEW suggested business oriented experiments aimed at taking some of the marketing processes currently in place and seeing how CHAMP will operate in regards to these constraints. Generally, the data is prepared by randomly separating the entire dataset into two distinct sets: training and testing. The testing dataset is roughly 50% of the entire dataset. All experimental results are shown on the held aside testing set. We use five markets which vary considerably in size and geographic location, showing the generality of our results. We use these markets to demonstrate the performance of CHAMP across six different types of evaluation methods. 4.1
Traditional Evaluation Methods 1
We have validated models using both the lift and payoff metrics (Datta et al., 1999 [1]; Masand et al., 1999 [7]; Masand & Piatetsky-Shapiro, 1996 [8]). An example of the lifts for different percentages of the sorted list is shown in Figure 1 for Markets 1, 2, and 3. The largest gain in lift for all three markets occurs for the first 5% to 10% as shown from the slope of the curve at these points. The first (top) decile is the first 10% of the sorted list. A lift of 1 means that the model predicts churn equal to chance and the lift eventually becomes 1 as the entire sorted list is used. These results show that CHAMP can predict churn behavior more accurately than chance. 2 Figure 2 shows the cumulative payoff as incremental percentages of the sorted scores list are used for Markets 1, 2, and 3. The highest point in the curve where payoff is maximized varies dramatically for each market. If the customers falling to the left of the highest point are contacted this results in the highest payoff for the market. The highest payoff has a large range from $40,000 to $85,000 per month depending on the market. 4.2
Business Oriented Evaluation Experiments
In this section, we describe evaluations of CHAMP behavior of interest to marketing professionals. We typically run experiments on the first decile, top 10% of the sorted churn scores. As shown in Figure 2, this is where CHAMP has the largest lift.
1
The prediction module produces a score for each customer and sorts customers according to score. The lift metric computes the gain in predictiveness for subsets of the sorted list over the base churn rate (i.e. churn as it is currently occurring in the market). 2 We used a probability of 50% that a customer will continue service after being contacted, that contacting the customer costs $7 and that the customer will stay remain for 6 months. These numbers used to calculate the payoff are for illustrative purposes only and do not necessarily reflect actual numbers used in the business process.
Business Focused Evaluation Methods: A Case Study
319
Lift of Mark ets 1,2,3 5
Lift
4 3
M arket 1
2
M arket 2
1
M arket 3
0 5
15
25
35
45
55
65
75
85
95
Percent of s orted s core lis t
Fig. 1. Lift for Markets 1, 2, and 3. The highest gain in lift is between the top 1-10% of the sorted scores list.
Payoff (in 1,000’s)
C u m a lit iv e P a y o f f f o r M a r k e t s 1 ,2 ,3 100 80
M ark et 1
60
M ark et 2
40
M ark et 3
20 0 5
15
25
35
45
55
65
75
85
95
P e r c e n t o f s o r t e d s c o r e lis t
Fig. 2. Simulated cumulative payoff for Markets 1, 2, and 3. For these markets the highest payoff occurs at less than 50% of the sorted scores list.
Percentage of Churners Identified From a marketing point of view it is important to know the percentage of the actual churners that are being captured in the highest decile, decile 1. In addition it is important to know how much lead time they will have before a customer with a higher propensity to churn will churn. We went back to our historical data and chose to look at the churners at a point in time, namely February 1998. We calculated the number of churners that appeared in decile 1 during the previous month, January 1998, and also looked at the number of customers that appeared at least once in decile 1 during the three months previous to February. Table 1 shows the results for 3 markets. CHAMP identified a fairly large percentage (28% to 36%) of churners (i.e. that is the customer appeared in the top decile at least once in the previous three months). These results also indicate that CHAMP can pick up clear signs of churning soon before customers actually churn. A smaller percentage of decile 1 customers churned in the following month (first column), although it is a larger percent than uniform for markets 2 and 3.
320
P. Datta
Table 1. Percent of churners identified by CHAMP scores for January 1998. Market Name
Market 1 Market 2 Market 3
Percent of churners in decile 1 in previous month 17% 19.5% 28%
Percent of churners in decile 1 at least once in previous three months 28% 31% 36%
Estimated Overlap among High Propensity Churn Customers Another aspect marketing professionals were interested in was the number of contacts they would have to make if they used CHAMP scores. There is some indication that some percentage of those that appeared in decile 1 from one month would also appear in the next month. GTEW has policies restricting the number of times a customer can be contacted for a specified period of time. We conducted the following experiment. We identified the unique customers from decile 1 for two consecutive months, that is, if a customer appeared in decile 1 for both months, the customer was only counted once. We also conducted the same experiment for three consecutive months. Table 2 shows the results. The percentages were computed by dividing the number of unique customers for the period by the number of total customers in decile 1 for the period. For example, if we identify 50 unique customers for two months and decile 1 has a size of 30 customers a month then we would divide 50 by (30*2) and get 83%. These results show that a sizeable number of unique customers appear in decile 1 for consecutive months, showing the stability of the model. Depending on the marketing department’s policy for contacting customers, they have some idea of the number of contacts they will need to make monthly. Table 2. Percent of unique customers in decile 1 for consecutive months.
Market Name Market 1 Market 2 Market 3
Unique customers in 2 months 80% 75.5% 72%
Unique customers in 3 months 69% 69% 59%
Aging Experiments With dramatic changes in the cellular industry, an important issue related to modeling behavior is understanding the lifetime of the learned models and the decline in their predictive capability. In addition, it is also important to understand how long individual customer scores are valid. In this section we discuss two experiments focused on evaluating the lifetime of the models and scores.
Business Focused Evaluation Methods: A Case Study
321
We ran the first experiment, model aging, with models learned monthly, starting at March 1997. We created 4 different models, one for each successive month. We evaluated the models on customer data dated in June of 1997. Figure 3 shows the lift of the first decile for three markets. As can be seen on the graph, although the lift decreases slightly, there is no statistical difference when the older models are run on more recent customer data. Market 1 did have a significant drop in lift for the 3 and 4 month models. One possibility is that the data representing that time of the year did not include any major seasonal changes or new competitive offers for Markets 4 and 5 but some external factors such as new competition could have made the older models less accurate in Market 1. An experiment looking at longer delay periods between when the model is built and applied may better reflect seasonal trends. M ode l agi n g
Lift
4
M arket 1
3.5
M arket 4
3
M arket 5
2.5 1
2
3
4
Age of m ode l (m on th s)
Fig. 3. Lift slowly decreases for Markets 1, 4, and 5 as the model ages. In the second experiment we considered the lifetime of generated scores, that is when do customers in decile 1 with a high propensity to churn actually churn. To conduct this experiment, we followed a group of customers scored by CHAMP, from June 1997 until January 1998 to see whether they churned during the period and if so during which month. The results of Market 1 are illustrated in Figure 4. Decile 1 has a larger number of customers churning over the time period compared to the other deciles. In addition, in decile 1 the customers that churn tend to do so within the first few months. The lift for decile 1 in June 1997 is 4.07 which means churners concentrated in decile 1 are about four times as likely to churn when compared to the background churn rate. The lifts for July and August are 2.35 and 1.93 respectively. Decile 10 should contain the customers less likely to churn. The proof for this is shown in the lifts for decile 10 which are 0.22, 0.52, and 0.61 for June, July and August respectively. Although only about 40% of the customers in decile 1 have churned in a 6 month period, this prediction accuracy is still much higher than the percentage of background churn over the same period, about 16%-24% (assuming the industry average of about 2%-3% per month). The remaining markets have similar lift characteristics but are not shown for space considerations.
5. Summary and Discussion The traditional lift experiments we conducted on CHAMP indicated that the learned models could predict churners in the upcoming months more effectively than current methods used by GTEW. We conducted additional experiments described in section 4.2 that focus on these
322
P. Datta
questions. These experiments illustrated the benefits of using CHAMP that were not obvious to us initially. For example, in Figure 4 shows that the effectiveness of CHAMP customer scores extends over the time period that we initially built the models for, 60 days, and Figure 3 shows that models slowly decline in predictive capability over several months. These experiments not only helped explain CHAMP characteristics to users, but also the helped CHAMP developers and researchers. We expect end users of any data mining prototype or system to have a wide variety of questions regarding performance and applicability. This paper takes a first step in describing some of the questions not addressed by simple accuracy measurements.
Percent of scores for 6/97
Mar k e t 1 S c or e A g ing
A fte r Ja n 98
50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0%
Ja n -98
D e c -97
N o v -97
O c t-97
Se p -97
A u g -97
Ju l-97
Ju n -97 1
2
3
4
5
6
7
8
9
10
De c ile
Fig. 4. Score aging results for Market 1. Those in decile 1 tend to churn at a higher rate not only for the next 2 months, but for the next 6 months. Note that the top of the decile bars have been cut off for space considerations. The bars reach 100%.
References 1. 2. 3. 4. 5.
6. 7. 8.
Datta, P., Masand, B., Mani, D. R. & Li, B.: Automated Cellular Modeling and Prediction on a Large Scale. Artificial Intelligence Review: Special Issue on Data Mining Applications. Kluwer Academic Publishers. To appear Oct. 1999. Quinlan, J. R.: C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann (1993). Fahlmann, S. E., & Lebiere, C.: The cascade-correlation learning architecture, Advances in Neural Information Processing Systems, volume 2. Morgan Kaufmann (1988). Puskorius, Gint, Feldkamp, Lee: Decoupled Extended Kalman Filter Training of Feedforward Layered Networks. Proceedings of the International Joint Conference on Neural Networks, IEEE (1996). Rumelhart, D. E., Hinton, G. E. & Williams, R. J.: Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the microstructure of cognition. Volume I: Foundations. Cambridge, MA: MIT Press/Bradford Books, (1986) pp 318 - 362. Koza, J.: Genetic Programming. MIT Press (1993). Masand, B., Datta, P., Mani, D. R. & Li, B. CHAMP: A Prototype for Automated Cellular Churn Prediction. Data Mining and Knowledge Discovery. Kluwer Academic Publishers (1999). Masand, B. & Piatetsky-Shapiro, G.: A comparison of approaches for maximizing business payoff of prediction models. Proceedings of the Second International Conference on Knowledge Discovery & Data Mining. Seattle, WA. (1996). pp.195-201.
Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions Wolfgang Ertel and Manfred Schramm Fachhochschule Ravensburg-Weingarten, Postfach 1261, 88241 Weingarten, GERMANY
<ertel|schramma>@fh-weingarten.de www.fh-weingarten.de/~ertel http://ti-voyager.fbe.fh-weingarten.de/schramma
Abstract We present a project for probabilistic reasoning based on the
concept of maximum entropy and the induction of probabilistic knowledge from data. The basic knowledge source is a database of 15000 patient records which we use to compute probabilistic rules. These rules are combined with explicit probabilistic rules from medical experts which cover cases not represented in the database. Based on this set of rules the inference engine Pit (Probability Induction Tool), which uses the well-known principle of Maximum Entropy 5], provides a unique probability model while keeping the necessary additional assumptions as minimal and clear as possible. Pit is used in the medical diagnosis project Lexmed 4] for the identication of acute appendicitis. Based on the probability distribution computed by Pit, the expert system proposes treatments with minimal average cost. First clinical performance results are very encouraging.
1 Introduction Probabilities deliver a well-researched method of reasoning with uncertain knowledge. They form a unied language to express knowledge inductively generated from data as well as expert knowledge. To build a system for reasoning with probabilities based on data and expert knowledge, we have to solve dierent problems: { To infer a set of rules (probabilistic constraints) from data, where the number of rules has to be small enough to avoid over-tting and to be large enough to avoid 'under-tting'. For this task we use an algorithm, which generates a probabilistic network. { To nd probabilistic rules for groups of patients, not present in our database. In cooperation with our medical experts, we collect rules describing patients with acute abdominal pain, not taken to the theatre by the diagnosis of acute appendicitis and therefore not present in our database. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 323-328, 1999. Springer-Verlag Berlin Heidelberg 1999
324
W. Ertel and M. Schramm
{ To construct a unique probability model from all our constraints. If we ex-
press our knowledge in a set of rules, this set usually does not allow to generate a unique probability model, necessary to get a denite answer for every probabilistic query (of our domain). We solve this task by the use of MaxEnt (see sec. 4), which delivers a precise semantics to complete probability distributions. { A typical problem for classication (e.g. in medicine) is that dierent classication errors cause dierent costs. We solve this task by asking the experts to dene the costs of wrong decisions, where these costs are not meant to be local to the hospital, but global in the sense of including all consequences the patient has to suer from. The decisions of our system are found by minimizing these costs under a given probability distribution.
2 Automatic Generation of Rules from Data From a database of 15000 patient records from all hospitals in Baden-W urttemberg in 19951, the following procedure generates a set of probabilistic rules. To facilitate the description we simplify our application to the two class problem of deciding between appendicitis (App = true) and not appendicitis (App = false), assuming the basic condition of acute abdominal pain to be true in all our rules. In order to be independent of the environment of the particular clinic, the rules are conditioned on the diagnosis variable App, i.e. rules will have the form P (A = aij App = true B = bj : : :) = x where ai bi are values of binary symptom variables A and B (binary for simplication) and x is a real number or a real interval2. In order to abstract the data into probabilities, we use the concept of (conditional) independence3 as widely known and accepted. We therefore draw an independence map (6]) of the variables, i.e. an undirected graph G where the nodes represent variables and the edges and paths represent dependencies between variables. A missing edge between 2 variables denotes a conditional independence of the two variables, given the union of all other variables (see e.g. Sec. 4.5 in 9]). For most real world applications, however the number of elementary events, induced by the union of variables, is larger than the available data (in our application, the variables span 109 events where 'only' 15000 patient records are available). In order to avoid over-tting, we have to use a 'local' approach for building an independence map, i.e. an approach which works on a small set of variables rather than the union. Our procedure works as follows: We are grateful to the ARGE - Qualitatssicherung der Landesarztekammer Baden Wurttemberg for providing this database. 2 In case of an interval, P (x) = a b] expresses the uncertainty that P (x) can be any value in a b] 0 1]. 3 Variables A and B are conditionally independent given C i (knowing the value of C ) the knowledge of the value of B has no inuence on deciding about the value of A. In technical terms: P (A = ai jB = bj C = ck ) = P (A = ai jC = ck ) for all i j k. 1
Combining Data and Knowledge by MaxEnt
325
1. For variables A and B and a vector of variables S (with a vector of values s) with A B 2= S ) let DA S B denote the degree of dependence between the variables A and B given S , which we calculate by the 'distance' 4 between X and Y , where Y := P (A = ajS = s) P (B = bjS = s) and X := P (A = ai B = bj )jS = s) . Let DA 0 B denote the degree of dependence between A and B , which we calculate by the distance between X and Y , where Yi j := P (A = ai ) P (B = bj ) and Xi j := P (A = ai B = bj ). 2. We build an undirected graph by the following rules: Draw a node for every variable A, including the special diagnosis variable App (with values 'true' and 'false'). Draw an edge (A App) i DA 0 App is above a heuristically determined value t (see below). For the pair of variables A and B with the largest value of DA App B we add an edge (A,B ) to the graph if for the minimal separating set S for the nodes A and B 5 the distance DA B is above t. 3. If the procedure is completed, a graph G has been generated. As already mentioned, medical knowledge is typically conditioned on illnesses, expressing the assumption that this type of rules is more context independent than others (see footnote 8 in sec. 3). We therefore adopt the graph to this type of rules. For this goal we direct the edges (App A) towards A and calculate rules of the form P (A = ai j App = true) = x. Directions for the remaining edges are selected arbitrarily with the result of dening a Bayesian network6 of rules like P (A = ai jApp = true B = bj : : :) = x, where App B and possibly other variables are 'inputs' to the variable A. Remember that the number of edges is limited by the size of the threshold t: If the number of variables in a rule is too large in relation to the available data, t has to be increased (to avoid over-tting) if the density of edges is too small (if the inductive power of the probabilistic rules is too weak) t has to be decreased (to avoid under-tting). This set of rules is incomplete (i.e. it does not specify a unique probability model) because we do not construct rules for the class (App = false) from our database (see Sec. 3). Additional rules are specied by our experts. But as the resulting set of rules is still incomplete (for e.g. using intervals in our rules), we need the method of Maximum Entropy (see 4) to complete the probability model. i j s
i j s
S
P
We use the cross entropy-function for this task, which is similar, but not equivalent to the correlation coecient: it is dened as CR(x y) = i xi log(xi =yi ). 5 A separating set S for the nodes A and B disconnects A and B , i.e. there are no paths between A and B if the variables in S and their edges are removed from the graph. A minimal separating set is minimal in the number of variables it contains. If there is more than one minimal separating set S , we take the set S with the lowest distance DA S B . Remark: By construction, the minimal separating set will always contain App. 6 The missing distribution of App is given by our experts 4
326
W. Ertel and M. Schramm
3 Expert-Rules All patients in our database have been operated under the diagnosis 'acute appendicitis', suering from 'acute abdominal pain'. Thus our database can not provide a model of patients with have been sent home (with the diagnosis 'non specic abdominal pain') or which have been forwarded to other departments (assuming other causes for their pain). In order to get a model of these classes of patients, we use the explicit knowledge of our medical experts7 and the literature (see e.g. 1]) to receive rules like: 8 P (A = ai jApp = false ^ : : :) = x y] :
4 Generating a unique probability distribution from rules by the method of Maximum Entropy In order to support interesting decisions in cases of incomplete knowledge, we have to add more constraints. In order to add no (false) ad hoc knowledge, the constraints have to be selected such that they maximize the ability to decide and minimize the probability of an error. The method of Maximum Entropy P which chooses the probability model with maximal entropy H H (v) := ; i vi log(vi )] is known to solve these problems: { it maximizes the ability to decide, because it is known to choose a single (unique) probability model in the case of linear constraints. { it minimizes the probability of an error, because the distribution of models is known to be concentrated around the MaxEnt model (3]). Computing the MaxEnt-Model is not a new idea but very expensive in the worst case. The main problem is that the number of interpretations (elementary events) grows exponentially with the number of variables. To avoid this eect in the average case, the principles of independence and indierence are used to reduce the complexity of the computations. These two principles are both used in our system Pit (Probability Induction Tool) for a more ecient calculation of the MaxEnt model (8]).
5 Generating Decisions from Probabilities Once the rule base is constructed, a run of Pit computes the MaxEnt model and any query can be answered by standard probabilistic computations. However, the expected result of reasoning in Lexmed is not a probability but a decision (diagnosis). How are probabilities related to decisions? In our application We are grateful to our medical experts Dr. W. Rampf and Dr. B. Hontschik for their support in the knowledge acquisition and patience in answering our questions. 8 This type of knowledge surely depends on the particular application scenario. For example in a pediatric clinic in Germany there are other typical causes for abdominal pain than in a hospital in a tropical country or in a military hospital. 7
Combining Data and Knowledge by MaxEnt
327
(as in many others), misclassications do have very dierent consequences. The diagnosis 'perforated appendicitis', where 'no appendicitis' would be correct, is very dierent to the diagnosis 'no appendicitis' where 'perforated appendicitis' would be correct. The latter case is of course a much bigger mistake, or in other words, much more expensive. Therefore we are interested in a diagnosis which causes minimum overall cost. Including such a cost calculation in the diagnosis process is very simple (c.f. Figure 1). Let Cij be the additional costs if the real diagnosis is class j , but the physician would decide for i. Given a matrix Cij of such misclassication costs and the probability pi for each real diagnosis i, the query evaluation of Lexmed computes the average misclassication cost Cj to
X Cj = Cij pi : n
i=1
and then selects the class j = argminfCj jj = 1 : : : ng with minimum average cost.
6 Diagnosis of Acute Appendicitis During the last twenty years the diagnosis of acute appendicitis has been improved with respect to the misclassication rate 2,1]. However, depending on the particular way of sampling and the hospital the rate of misclassication among surgeons still ranges between 15 and 30%, which is not satisfactory (2]). A number user interface of expert systems for this task have been diagnosis realized, some with high accuracy (1]), query misclassification but still there is no breakthrough in clincosts ical applications of such systems. Lexmed is a learning expert system for query medical diagnosis based on the MaxEnt PIT method. Viewed as a black box, Lexmed maps a vector of clinical symptoms (discrete variable-values) to the probability experts answer for dierent diagnoses. The central comrule knowledge ponent inside Lexmed is the rule base modelling base containing a set of probabilistic rules as shown in Figure 1. The acquisition of rules data base literature is performed by the inductive part (see 2) rule induction and the acquisition of explicit knowledge (see 3). The integration of knowledge from Figure1: Overview of the Lexmed artwo dierent sources in one rule base may chitecture. cause severe problems, at least if the formal knowledge representation of the two sources is dierent, for example if the inductive component is a neural net and the explicit knowledge is represented in rst order logic. In our system, however, the language of probabilities provides
328
W. Ertel and M. Schramm
a uniform and powerful knowledge representation mechanism. And MaxEnt is an inference engine which does not require a complete9 set of rules.
7 Results Running times of the system for query evaluation on the appendicitis application with about 400 probabilistic rules are about 1{2 seconds. The average cost of the decisions was measured for Lexmed without expert rules (as described in Section 2) and for the decision tree induction system C5.0 7] with 10-fold cross validation on the database in Table 1. For completeness reasons we also performed runs of both systems without cost information and computed the classication error. Lexmed C5.0 Average cost 1196 DM 1292 DM Classication error 22.6 % 23.5%
Table1. Average cost and error of C5.0 and Lexmed without expert rules.
The gures show that on the database the purely inductive part of Lexmed and C5.0 have similar performance. However in a real test in the hospital the expert rules in addition to the inductively generated rules will be very important for good performance, because the database is not representative of the patients in the hospital as mentioned in Section 3. Thus, in the real application we expect Lexmed to perform much better than a pure inductive learner like C5.0. Apart from the numeric performance results the rst presentation of the system in the municipal hospital of Weingarten was very encouraging. Since June 1999 the doctors in the hospital use Lexmed via internet 4].
References 1. De Dombal. Diagnosis of Acute Abdominal Pain. Churchill Livingstone, 1991. 2. B. Hontschik. Theorie und Praxis der Appendektomie. Mabuse Verlag, 1994. 3. E.T. Jaynes. Concentration of distributions at entropy maxima. In Rosenkrantz, editor, Papers on Probability, Statistics and statistical Physics. D. Reidel Publishing Company, 1982. 4. Homepage of Lexmed. http://lexmed.fh-weingarten.de, 1999. 5. J.B. Paris and A. Vencovska. A Note on the Inevitability of Maximum Entropy. International Journal of Approximate Reasoning, 3:183{223, 1990. 6. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. 7. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. C5.0, online available at www.rulequest.com. 8. M. Schramm and W. Ertel. Reasoning with Probabilities and Maximum Entropy: The System PIT and its Application in LEXMED. In accepted at: Symposium on Operations Research 1999, 1999. 9. J. Whittaker. Graphical Models in applied multivariate Statistics. John Wiley, 1990. 9
'complete' means that the rules are sucient to induce a unique probability distribution.
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation? Ad Feelders Tilburg University CentER for Economic Research PO Box 90153 5000 LE Tilburg, The Netherlands e-mail:
[email protected] Abstract. In many applications of data mining a - sometimes considerable - part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.
1
Introduction
The quality of knowledge extracted with data mining algorithms is evidently largely determined by the quality of the underlying data. One important aspect of data quality is the proportion of missing data values. In many applications of data mining a - sometimes considerable - part of the data values is missing. This may occur because they were simply never entered into the operational systems, or because for example simple domain checks indicate that entered values are incorrect. Another common cause of missing data is the joining of not entirely matching data sets, which tends to give rise to monotone missing data patterns. Despite the frequent occurrence, many data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. In this paper we focus on the well-known tree-based algorithm CART [3], that handles missing data by so called surrogate splits1 . As an alternative we 1
In fact we used the S program RPART that reimplements many of the ideas of CART, in particular the way it handles missing data.
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 329–334, 1999. c Springer-Verlag Berlin Heidelberg 1999
330
A. Feelders
investigate more principled simulation-based approaches to handle missing data, based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets.
2
Multiple Imputation
Multiple imputation [5,4] is a simulation-based approach where a number of complete data sets are created by filling in alternative values for the missing data. The completed data sets may subsequently be analyzed using standard completedata methods, after which the results of the individual analyses are combined in the appropriate way. The advantage, compared to using missing-data procedures tailored to a particular algorithm, is that one set of imputations can be used for many different analyses. The hard part of this exercise is to generate the imputations which may require computationally intensive algorithms such as data augmentation and Gibbs sampling [5,7]. In our experiments we used software for data augmentation written in S-plus by J.L. Schafer2 to generate the imputations. Since the examples we consider in this section contain both categorical and continuous variables, imputations are based on the general location model (see [5], chapter 9). The Bayesian nature of multiple imputation requires the specification of a prior distribution for the parameters of the imputation model. We used a non-informative prior, i.e. a prior corresponding to a state of prior ignorance about the model parameters. One of the critical parts of using multiple imputation is to assess the convergence of data augmentation. In our experiments we used a rule of thumb suggested by Schafer [6]. Experience shows that data augmentation nearly always converges in fewer iterations than EM. Therefore we first computed the EM-estimates of the parameters, and recorded the number of iterations, say k, required. Then we perform a single run of the data augmentation algorithm of length 2M k, using the EM-estimates as starting values, where M is the number of imputations required. Just to be on the “safe side”, we used the completed data sets from iterations 2k, 4k, . . . , 2M k.
3
Waveform Recognition Data
To compare the performance of imputation with surrogate splits, we first consider the waveform recognition data used extensively in [3]. The only categorical variable is the class label (with 3 possible values), and all 21 covariates are continuous, so imputation is based on the well-known linear discriminant model. Note that the assumptions of the linear discriminant model are not correct here, because the distribution of the covariates within each class is not multivariate 2
this software is available at http://stat.psu.edu/∼jls/misoftwa.html
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?
331
normal and furthermore the covariance structure differs between the classes. Still, the model may be “good enough” to generate the imputations. In the experiments, we generated 300 observations (100 from each class) to be used as a training set, with different percentages of missing data in the covariates. Then we built trees as follows 1. On the incomplete training set, using surrogate splits. 2. On one or more completed data sets using (multiple) imputation. In both cases the trees were built using 10-fold cross-validation to determine the optimal value for the complexity parameter (the amount of pruning), using the program RPART3 . The error rate of the trees was estimated on an independent test set containing 3000 complete observations (1000 from each class). To estimate the error rate at each percentage of missing data, the above procedure was repeated 10 times and the error rates were averaged over these 10 trials. In a first experiment, each individual data item had a fixed probability of being missing. Table 1 summarizes the comparision of surrogate splits and single imputation at different fractions of missing data. Single imputations are drawn from the predictive distribution of the missing data given the observed data and the EM-estimates for the model parameters. Looking at the difference between the error rates one can see that imputation gains an advantage when the level of missing data becomes higher. However, at a moderate level of missing data (say 10% or less) it doesn’t seem worth the extra effort of generating imputations. − This same trend is also clear from rows four (p+ imp ) and five (pimp ) of the table. − p+ imp (pimp ) indicates the number of times of the ten trials, that the error rate of imputation was higher (lower) and the difference was significant at the 5% level. So, for example, at 30% missing data the difference was significant at the 5% level four out of ten times, and in all four cases the error rate of imputation was lower. % Missing eˆsur eˆimp eˆsur − eˆimp p+ imp p− imp
10 29.8% 29.8% 0% 1 1
20 30.9% 29.2% 1.7% 0 4
30 32.2% 30.6% 1.6% 0 4
40 32.4% 30.0% 2.4% 0 6
45 34.3% 30.4% 3.9% 0 7
Table 1. Estimated error rate of surrogate splits and single imputation at different fractions of missing data (estimates are averages of 10 trials)
In a second experiment we used multiple imputation with M = 5, and averaged the predictions of the 5 resulting trees. The results are given in table 2. The 3
RPART is written by T. Therneau and E. Atkinson in the S language. The S-plus version for Windows is available from http://www.stats.ox.ac.uk/pub/Swin.
332
A. Feelders
performance of multiple imputation is clearly better than both single imputation and surrogate splits. Presumably, this gain comes from the variance reduction resulting from averaging a number of trees, like is done in bagging [2]. % Missing eˆsur eˆimp eˆsur − eˆimp p+ imp p− imp
10 28.9% 26.0% 2.9% 0 9
20 30.1% 26.1% 4.0% 0 8
30 30.0% 25.5% 4.5% 0 9
40 33.3% 25.7%∗ 7.6% 0 10
45 35.6% 26.0%∗ 9.6% 0 10
Table 2. Estimated error rate of surrogate splits and multiple imputation at different fractions of missing data. ∗ : here we ran into problems with data augmentation and used EM-estimates only to generate the imputations
4
Pima Indians Database
In this section we perform a comparison of surrogate splits and imputation on a real life data set that has been used quite extensively in the machine learning literature. It is known as the Pima Indians Diabetes Database, and is available at the UCI machine learning repository [1]. The class label indicates whether the patient shows signs of diabetes according to WHO criteria. Although the description of the dataset says there are no missing values, there are quite a number of observations with “zero” values that most likely indicate a missing value. In table 3 we summarize the content of the dataset, where we have replaced zeroes by missing values for x3 , . . . , x7 . The dataset contains a total of 768 observations, of which 500 of class 0 and 268 of class 1. Variable y x1 x2 x3 x4 x5 x6 x7 x8
Description Class label (0 or 1) Number of times pregnant Age (in years) Plasma glucose concentration Diastolic blood pressure Triceps skin fold thickness 2-hour serum insulin Body mass index Diabetes pedigree function
Missing values 0 0 0 5 35 227 374 11 0
Table 3. Overview of missing values in pima indians database
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation?
333
In our experiment the test set consists of the 392 complete observations, and the training set consists of the remaining 376 observations with one or more values missing. Of these 376 records, 374 have a missing value for x6 (serum insulin), so we removed this variable. Furthermore, we changed x1 (number of times pregnant) into a binary variable indicating whether or not the person had ever been pregnant (the entire dataset consists of females at least 21 years old, so this variable is always applicable). This leaves us with a dataset containing two binary variables (y and x1 ) and six numeric variables (x2 , . . . , x5 , x7 and x8 ), with 278/2632 ≈ 10% missing values in the covariates. Although x2 and x8 are clearly skewed to the right, we did not transform them to make them appear more normal, in order to get an impression of the robustness of imputation under the general location model. The first experiment compares the use of surrogate splits to imputation of a single value based on the EM-estimates. Of course the tree obtained after single imputation depends on the values imputed. Therefore we performed ten independent draws, to get an estimate of the average performance of single imputation. The results are summarized in table 4. Draw 1 2 3 4 5 6 7 8 9 10 eˆimp 22.7% 30.6% 25.3% 26.0% 30.0% 24.5% 26.8% 24.7% 27.8% 29.3% p-value .0002 1 .0075 .0114 .7493 .0097 .0237 .0038 .2074 .6908 Table 4. Estimated error rates of ten single imputation-trees and the corresponding p-values of H0 : eimp = esur , with eˆsur = 30.6%
For each single imputation-tree, we compared the performance on the test set with that of the tree built using surrogate splits, which had an error rate of 120/392 ≈ 30.6%. Tests of H0 : esur = eimp against a two-sided alternative, using an exact binomial test, yield the p-values listed in the second row of table 4. On average the single imputation-tree has an error rate of 26.8% which compares favourably to the error rate of 30.6% of the tree based on the use of surrogate splits. In a second experiment we used multiple imputation (M = 5) and averaged the predictions of the 5 trees so obtained. Table 5 summarizes the results of 10 independent trials. The average error rate of the multiple imputation-trees over these 10 trials is approximately 25.2%. This compares favourably to both the single tree based on surrogate splits, and the tree based on single imputation.
5
Discussion and Conclusions
The use of statistical imputation to handle missing data in data mining has a number of attractive properties. First of all, the imputation phase and analysis phase are separated. Once the imputations have been generated the completed
334
A. Feelders Trial 1 2 3 4 5 6 7 8 9 10 eˆimp 27.3% 24.5% 25.8% 26.8% 23.7% 24.2% 24.0% 25.5% 24.7% 25.5% p-value .1048 .0015 .0295 .0357 .0003 .0026 .0022 .0105 .0027 .0119
Table 5. Estimated error rates of 10 multiple imputation-trees (M = 5), and the corresponding p-values of H0 : eimp = esur , with eˆsur = 30.6%
data sets may be analysed with any appropriate data mining algorithm. The imputation model does not have to be the “true” model (otherwise why not stick to that model for the complete analysis?) but should merely be good enough for generating the imputations. We have not performed systematic robustness studies, but in both data sets analysed the assumptions of the general location model were voilated to some extent. Nevertheless, the results obtained with imputation were nearly always better than those with surrogate splits. Despite these theoretical advantages, one should still consider whether they outweigh the additional effort of specifying an appropriate imputation model and generating the imputations. From the experiments we performed some tentative conclusions may be drawn. For the waveform data, single imputation tends to outperform surrogate splits as the amount of missing data increases. At moderate amounts of missing data (say 10% or less) one can avoid generating imputations and just use surrogate splits. For the pima indians data, with about 10% missing data in the training set, single imputation already shows a somewhat better predictive performance. Multiple imputation shows a consistenly superior performance, as it profits from the variance reduction achieved by averaging the resulting trees. For high variance models such as trees and neural networks multiple imputation may therefore yield a substantial performance improvement.
References 1. C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1999. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996. 3. L. Breiman, J.H. Friedman, R.A. Olshen, and C.T. Stone. Classification and Regression Trees. Wadsworth, Belmont, California, 1984. 4. D.B. Rubin. Multiple imputation after 18+ years. Journal of the American Statistical Association, 91:473–489, 1996. 5. J.L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, 1997. 6. J.L. Schafer and M.K. Olsen. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behavioral Research, 33(4):545– 571, 1998. 7. M.A. Tanner. Tools for Statistical Inference (third edition). Springer, New York, 1996.
Rough Dependencies as a Particular Case of Correlation: Application to the Calculation of Approximative Reducts? Mar´ıa C. Fernandez-Baiz´an1 , Ernestina Menasalvas Ruiz1 Jos´e M. Pe˜ na S´ anchez1 Socorro Mill´ an2 , Eloina Mesa2 1
Departamento de Lenguajes y Sistemas Inform´ aticos e Ingenier´ıa del Software, Facultad de Inform´ atica,U.P.M., Campus de Montegancedo, Madrid 2 Universidad del Valle, Cali. Colombia {cfbaizan, emenasalvas}@fi.upm.es,
[email protected],
[email protected],
[email protected] Abstract. Rough Sets Theory provides a sound basis for the extraction of qualitative knowledge (dependencies) from very large relational databases. Dependencies may be expressed by means of formulas (implications) in the following way: {x1 , . . . , xn } ⇒ρ {y} where {x1 , . . . , xn } are attributes that induce partitions into equivalence classes on the underlying population. Coefficient ρ is the dependency degree, it establishes the percentage of objects that can be correctly assigned to classes of y, taking into account the classification induced by {x1 , . . . , xn }. Dealing with decision tables, it is important to determine ρ and to eliminate from {x1 , . . . , xn } redundant attributes, to obtain minimal reducts having the same classification power as the original set. The problem of reduct extraction is NP-hard. Thus, approximative reducts are often determined. Reducts have the same classification power of the original set of attributes but quite often contain redundant attributes. The main idea developed in this paper is that attributes considered as random variables related by means of a dependency, are also correlated (the opposite, in general, is not true). From this fact we try to find, making use of well stated and widely used statistical methods, only the most significant variables, that is to say, the variables that contribute the most (in a quantitative sense) to determine y. The set of attributes (in general a subset of {x1 , x2 , . . . , xn }) obtained by means of well-founded sound statistical methods could be considered as a good approximation of a reduct. Keywords: Rough Sets, Rough Dependencies, Multivariate Analysis, Multiple Regression. ?
This work is supported by the Spanish Ministry of Education under project PB950301
˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 335–340, 1999. c Springer-Verlag Berlin Heidelberg 1999
336
1
M.C. Fernandez-Baiz´ an et al.
Rough Dependencies Reducts
Let U = {1, 2, . . . , n} be a non empty set of objects that will be called the universe. Objects of the universe are described by means of a set of attributes: T = {x1 , x2 , . . . , xk }. If we assume all these attributes to be mono valued functions of the elements of U , then they can be seen as equivalence relations on U . The corresponding quotient sets being: U/xj = {[i]xj /i ∈ U }
(1)
where [i]xj stands for the equivalence class (with respect to xj ) including the element i,. Let P ⊂ T be a subset of T . The indiscernability relation with respect to P , IN D(P ), is defined as follows: U/IN D(P ) =
\
[i]xj
(2)
xj ∈P
The indiscernability relation is an equivalence relation. Let now consider the following sets: P ⊆ T and Q ⊆ T . We say that Q depends on P , P ⇒ Q, if and only if IN D(P ) ⊆ IN D(Q) (every class of IN D(P ) is included in a class of IN D(Q)). In general and due both to the random nature of data and the inherent imprecision of the measures, from a table of observations we cannot infer exact dependencies. All that can be obtained are expressions of the form: P ⇒ρ Q. Being ρ the dependency degree , 0 ≤ ρ ≤ 1, where 1 corresponds to the total dependency and 0 to the total independency of Q with respect to P . P OSP (Q) =
[
IN D(P )X
(3)
X∈U/Q
IN D(P )X = ∪{Y ∈ U/IN DP/Y ⊆ X}
(4)
OSP (Q) We can now define ρ as cardP × 100, the meaning of the dependency cardU P ⇒ρ Q is that the ρ% of the elements of U can be correctly assigned to classes of Q, given the classification P . If deleting xj ∈ P , the equality P OSP −{xj } (Q) = P OSP (Q) holds, then we say that xj is Q-redundant in P and it may be suppressed while preserving the classification power of he set. If P 0 ⊂ P is such that P OSP (Q) = P OSP 0 (Q) and P 0 does not contain Qredundant elements, then we say that P 0 is a Q-reduct of P . Dealing with decision tables, and being C = {x1 , x2 , . . . , xk } (condition attributes) and D = {y} (decision attribute), the dependency C ⇒ρ D holds, and we must: (a) determine ρ and (b) minimise C (eliminating redundancies by means of extracting reducts from it).
Rough Dependencies as a Particular Case of Correlation
2
337
Techniques for the Multidimensional Analysis of Data: Correlation and Multiple Regression
The choice of a statistical technique for the multidimensional analysis of data depends on the nature of them as well as on the desired objective: description or prediction. Dealing with decision tables the problem can be seen as the prediction of the decision attribute making use of the condition attributes. We can distinguish two different cases to which regression technique is applicable: – When the predictive variables (in our case condition attributes) are quantitatives ones and the predicted variable (the decision attribute in our case) is also quantitative. – When the predictive variables are quantitatives and the predicted variable is qualitative but can be expressed by means of a numerical value with a logical order.
3
Multiple Regression
In simple correlation there is only one predictive variable and one predicted variable. The n available examples constitute a cloud of dots in the two dimensions plane (X, Y ) through which the minimal square straight line is drawn. In multiple regression this procedure is generalised. Having k predictive variables we have to calculate k coefficients A1 , A2 , . . . , Ak as well as a constant term y0 that allow you to form the equation: y = yo + A1 x1 + A2 x2 + . . . , +Ak xk
(5)
of the regression hyperplane that approximate the best the n examples. Assuming n to be n >> k : The k coefficients determine a vector A and the values of x1 , . . . , xk constitute a matrix X(n, k). The n values of Y form a column. Y1 A1 . . Y = (6) A= . . Ak Yn and we get :
A = (X 0 X)−1 X 0 Y
(Being X 0 the transpose of X)
(7)
In order to calculate these coefficients the method of centred variables is applied. To evaluate the quality of the approximation the difference between the observed and the predicted values is calculated. Let s be the result of adding the square of that difference. We define then: σ2 =
s n−k−1
(8)
338
M.C. Fernandez-Baiz´ an et al.
When n >> k this value is approximately the variance of sample of the n examples. Then we have the correlation coefficient r to be: r s r = 1− P (9) (yi − y)2 being −1 ≤ r ≤ 1. We consider values |r| > 0, 8.
4
Stepwise Regression
We are interested only in the most significant variables “explaining” or “predicting” Y . To eliminate the less significant ones, we follow an iterative process of stepwise regression. The steps are the following: – Carry out the simple regression process with every variable under consideration. Then, retain the one giving the maximal value of r (or the minimal value of s). – Carry out double regression process with the selected variable and any other one. Retain the one giving minimal value of s. – We follow in this way, (triple regression, ...) In each step there is a decrement δ of s. We calculate: δ (10) F = 2 σ We compare this result with the value given by a Fischer table for (n − k − 1) and 1 degree of freedom. We finish when the result of this test is negative (Fcalculated < FGivenbythetable ). The set of condition variables selected in this way is an approximative reduct.
5
Correlation vs Rough Dependencies
Correlation does not mean causality. If two independent variables depend on a third one, they will be strongly correlated. Correlation does not implies dependency. But if y depends on {x1 , x2 , ..., xk }; then {y} will be correlated with x1 , x2 , ..., xk .
6
Stepwise Regression as a Foundation for the Calculation of Approximative Reducts
When dependencies as {x1 , x2 , ..., xk } ⇒ {y} are simplified, we consider possible dependencies existing between subsets of {x1 , x2 , ..., xk } and thus eliminating redundancy. The statistical approach is similar: If there is a set of variables strongly correlated in the implicant, there is redundant use of the less significants, that may be eliminated by means of the stepwise regression. The subset of condition variables obtained in this way is an approximative reduct.
Rough Dependencies as a Particular Case of Correlation DECISION TABLE
(C,D)
C = {x1, x2, ..., xk } D = {y}
DECISION TABLE
STEPWISE REGRESSION
DISCRETISER
339
(C,D)
C0 ⊆ C
(approximative reduct) (C’,D)
LOWER
card P OSC (D) ρ= card U
ρ
DISCRETISER
LOWER
ρ0 =
card P OSC 0 (D) card U
ρ0 Fig. 1. Approximative reducts by means of Stepwise regression.
7
Calculating Approximative Reducts by Means of Stepwise Linear Regression
In a decision table when: (i) The number of cases (rows) is much more greater than the number of attributes (columns), (ii) The condition attributes are quantitative and (iii) There is either a qualitative or quantitative susceptible of being expressed as a numerical value with an order decision attribute. The dependency between condition and decision may be analytically approached by means of a linear regression model1 : y = y0 + A1 x1 + ... + Ak xk
(11)
The stepwise regression process provides for the elimination of the less significant condition attributes, thus obtaining an approximative reduct, whose quality may be tested by comparing the percentage of objects classified by using the whole set of conditions.
8
Application to Randomly Generated Data
A table containing 10.000 tuples has been generated in a random way. The table corresponds to a decision table composed by 4 condition attributes x1 , x2 , x3 , x4 and one decision attribute y. The correlation matrix is: 1
0, 731 1 1 0, 816 0, 229 −0, 535 −0, 824 −0, 139
1 −0, 821 −0, 245 −0, 973 −0, 029 1
1
(12)
If |r| < 0.8 the linear model is not adequate and other approaches should be used.
340
M.C. Fernandez-Baiz´ an et al.
The selection of variables (in the order x4 , x1 , x2 , x3 ) results: y = 117, 57 − 0, 738x4 y = 103, 1 − 0, 614x4 + 1, 44x1 y = 71, 65 − 0, 237x4 + 1, 452x1 + 0, 416x2
(r = −0, 821; δ = 0, 674)
(13)
(r = 0, 986; δ = 0, 297) (14) (r = 0, 991; δ : non-sensitive) (15)
The correlation matrix indicates that there is a high correlation index between x2 and x4 . The minimum distance between the correspondent coefficient and its standard deviation, pointed out the need to eliminate x4 . Thus, the following result is obtained as a lineal model of y: y = 52, 58 + 1, 468x1 + 0, 662x2
r = 0, 989
(16)
Considering the possible existence of a dependency between {x1 , x2 , x3 , x4 } and {y} you get that an approximative reduct is {x1 , x2 }. This method has the advantage, from the point of view of minimising the error, of calculating approximative reducts from raw (non discrete) data. However, if we apply a discretising method and then calculate the dependency degree making use of rough sets we obtain the following result: {x1 , x2 , x3 , x4 } =⇒ρ1 {y} {x1 , x2 } =⇒ρ2 {y}
ρ1 = 88, 27% ρ2 = 86, 74%
(17) (18)
From this result we can conclude that the power of classification remains almost unalterable.
References 1. Cristine Nora Analyse de Donnees et Information, Ecole Nationale Superieure des Telecomunications. Paris. Research Report ENST C - 79022 2. Sergey Brin, Rajeev Motwani,Craig Silverstein Beyond Market Basket: Generalizing Association Rules to Correlations, In Proceedings of ACM SIGMOD International Conference 1997 pp. 265-276 3. M. Jambu Classification automatique pour l’analyse des donnes , Vol.1 Methods et Algorithms ed. Dunod Paris 1978 4. I. C. Lerman Classification et Analyse Ordinal des donnes. ed. Dunod 1981, 5. Jean de Lagrade Initiation A L’Analyse des Donnees Dunod 1983 6. Cristine Nora, Christine Vercken Panorama Des Principales Techniques D’Analyse De Donnes Multidimensionnelles et De Leurs Possibilites Ecole Nationale Superieure des Telecomunications. Paris. Research Report ENST C - 76010 7. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data Kluwer 1991 8. Grizzle, J.E., Williams, O.D. [1972] Loglinear models and test of independence for contingency tables. Biometrics 28, pp.137-156 9. Bishop, Y., Fienberg, S., Holland P. Discrete Multivariate Analysis:Theory and Practice. Cambridge, MA: The MIT press,Second Printing , 1975 10. Agresti, A. Analysis of ordinal caetgorical Data. Jhon Wiley and Sons, Inc. New York 1984 11. Cox, D.R. The Analysis of Binary Data, New York, Halsted Press, 1970
A Fuzzy Beam-Search Rule Induction Algorithm 1
2
1
1,2
Cristina S. Fertig , Alex A. Freitas , Lucia V. R. Arruda , and Celso Kaestner 1
CEFET-PR, CPGEI. Av. Sete de Setembro, 3165 Curitiba – PR. 80230-901. Brazil.
[email protected],
[email protected],
[email protected] 2
PUC-PR, PPGIA-CCET. Rua Imaculada Conceição, 1155 Curitiba – PR. 80215-901. Brazil. {alex, kaestner}@ppgia.pucpr.br http://www.ppgia.pucpr.br/~alex
Abstract. This paper proposes a fuzzy beam search rule induction algorithm for the classification task. The use of fuzzy logic and fuzzy sets not only provides us with a powerful, flexible approach to cope with uncertainty, but also allows us to express the discovered rules in a representation more intuitive and comprehensible for the user, by using linguistic terms (such as low, medium, high) rather than continuous, numeric values in rule conditions. The proposed algorithm is evaluated in two public domain data sets.
1 Introduction This paper addresses the classification task. In this task the goal is to discover a relationship between a goal attribute, whose value is to be predicted, and a set of predicting attributes. The system discovers this relationship by using known-class examples, and the discovered relationship is then used to predict the goal-attribute value (or the class) of unknown-class examples. There are numerous rule induction algorithms for the classification task. However, the vast majority of them work within the framework of classic logic. In contrast, this paper proposes a fuzzy rule induction algorithm for the classification task. The motivation for this work is twofold. First, fuzzy logic is a powerful, flexible method to cope with uncertainty. Second, fuzzy rules are a natural way to express rules involving continuous attributes. Actually, rule induction algorithms implicitly perform a kind ‘hard’ (rather than soft) discretization when they cope with continuous attributes. For instance, within the framework of classic logic a rule induction algorithm discovers rules containing conditions such as ‘age < 25’, which has the obvious disadvantage of not coping well with the borderline ages of 24 and 25. In contrast, a fuzzy rule can contain a condition such as ‘age = young’, where young is a fuzzy attribute describing persons around 25 years old. This concept is more natural and more comprehensible for the user. In principle any rule induction method can be ‘fuzzyfied’. Indeed, some fuzzy decision tree algorithms have been proposed in the past [3], [2]. In this paper we have chosen, as the underlying data mining algorithm to be fuzzyfied, a variant of the beam-search rule induction algorithm described in [8]. Our motivation for this choice J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 341-347, 1999. © Springer-Verlag Berlin Heidelberg 1999
342
C.S. Fertig et al.
is as follows. Despite their popularity, most decision tree algorithms perform a kind of hill-climbing search for rules, by exploring just one alternative at a time, which makes them very sensitive to the problem of local maxima. Beam search algorithms have the advantage of being less sensitive to this problem, since they explore w alternatives at a time, where w is the beam width [13]. To the best of our knowledge, this paper is the first work to propose a fuzzy beam search-based rule induction algorithm.
2 Beam Search-Based Rule Induction The basic idea of a beam search-based rule induction algorithm [8] is described in Figure 1. The algorithm receives two arguments as input, namely max_depth (the maximum depth of the search tree) and w, the beam width (the number of rules or tree paths being explored by the algorithm at a given time). Both parameters are small integer numbers. In Figure 1, Ri denotes the i-th rule, i=1,…,w, Aj denotes the j-th predicting attribute, vjk denotes the k-th value of the j-th predicting attribute, and Rijk denotes the new rule creating by adding condition Aj = vjk to the rule Ri. This is a top-down algorithm, which starts with an ‘empty rule’ with no condition, and iteratively adds one condition at a time to all the current rules, so specializing the rules. Each condition added to a rule is of the form ‘Attribute = Value’, where Value is a value belonging to the domain of the attribute. (The operator could be ‘>’ or ‘ C denote a rule, where A is the rule antecedent (a conjunction of conditions) and C is the rule consequent (the valued predicted for the goal attribute). The CF measure is simply |A & C| / |A|, where |x| denotes the cardinality of set x. In other words, CF is the ratio of the number of examples that both satisfy the conditions in the rule antecedent and have the goal-attribute value predicted by the rule consequent over the number of examples satisfying the conditions in the rule antecedent. We borrowed from [10] the idea of using a variant of this measure defined as: (|A & C| - ½) / |A|. The motivation for subtracting ½ from the numerator is to favor the discovery of more general rules, by avoiding the overfitting of the rules to the data. For instance, consider two rules R1 and R2, where R1 has |A & C| = |A| = 1 and R2 has |A & C| = |A| = 100. Without the ½ correction, both rules have a CF = 100%. However, rule R1 is probably overfitting the data, and
A Fuzzy Beam-Search Rule Induction Algorithm
343
rule R2 is more likely to be accurate on unseen data. With the ½ correction we achieve the more realistic CF measures of 50% and 99.5% for rules R1 and R2, respectively. Input: max_depth, w; depth := 0; RuleSet = a single rule with no conditions; REPEAT FOR EACH rule Ri in RuleSet FOR EACH attribute Aj not used yet in rule Ri Specialize Ri by adding a condition Aj = vjk to the rule, and call the new rule Rijk; Compute a rule-quality measure for Rijk; END FOR END FOR RuleSet = the best w rules among all current rules; depth := depth + 1; UNTIL (no rule created in this iteration is better than any rule of previous iteration) OR (depth = max_depth)
Fig. 1. Overview of a Beam Search-based Rule Induction Algorithm
Finally, note that the algorithm of Figure 1 is searching for conditions to add to the rule antecedent only. It does not specify how to form the rule consequent. This part of the rule contains the value (class) predicted for the goal attribute. The choice of the class predicted by a rule can be made in several ways. One approach would be to let the algorithm automatically choose the best class to form the rule consequent, by picking the class to which the majority of the examples satisfying the rule antecedent belong. Another approach would be to run the algorithm k times for a kclass problem, where in each run the algorithm searches only for rules predicting the k-th class. Yet another approach, assuming a two-class problem, is to run the algorithm to discover only rules predicting one class. In this case, if an example satisfies any of the discovered rules it is assigned the corresponding class; otherwise it is assigned the other class (the ‘default’ class). We have opted for this latter approach. In our work the algorithm searches only for rules predicting the minority class (i.e. the class with smaller relative frequency in the data being mined), whereas the majority class is the default class. We have chosen this approach for two reasons. First, its simplicity. Second, focusing on the discovery of rules predicting the minority class has the advantage that these rules tend to be more interesting to the user than rules predicting the majority class [5]. For instance, rules predicting a rare disease tend to be more interesting than rules predicting that the patient does not have that disease. (Of course, one of the reasons why minority-class rules tend to be more interesting is that it is more difficult to discover them, in comparison with majority class rules.)
344
C.S. Fertig et al.
3 Fuzzyfying a Beam Search-Based Rule Induction Algorithm In this section we describe how we have fuzzyfied the algorithm described in the previous section. First of all, as a pre-processing step, for each continuous attribute in the data being mined we must associate: (1) a set of fuzzy linguistic terms (such as high, medium, low), which are essentially labels for fuzzy sets; and (2) a fuzzy membership function, which specifies the degree to which each attribute’s original value belongs to the attribute’s linguistic terms. Loosely speaking, this can be regarded as a kind of ‘discretization’, since the originally-continuous attribute can now take on just a few linguistic terms as its value. However, this is a very flexible discretization, since each of the now ‘discrete’ values of the attribute is actually a flexible linguistic term corresponding to a fuzzy set. The continuous attributes were fuzzyfied by using trapezoidal membership functions [1], since this kind of function often leads to a data modeling closer to reality. Note that it is necessary to fuzzyfy only continuous attributes, and not categorical ones. In addition to the above-described fuzzyfication of continuous attributes, we also need to fuzzyfy the computation of the rule-quality measure. In our case this corresponds to fuzzyfying the formula (|A & C| - ½) / |A|, defined in the previous section. In our first attempt to define a fuzzy version of |A|, this term was the summation, over all training examples, of the degree to which the example satisfies the rule antecedent. For each example, this degree is computed by a fuzzy AND of the degree to which the example satisfies each condition. We have used the conventional definition of the fuzzy AND as the minimum operator. For instance, suppose a rule antecedent contains the following three conditions ‘age = young and salary = low and sex = male’, where age and salary are fuzzyfied (originally continuous) attributes and sex is a categorical attribute. Suppose that a given training example satisfies the condition ‘age = young’ to a degree of 0.8, the condition ‘salary = low’ to a degree of 0.6 and the condition ‘sex = male’ to a degree of 1. In this case the example satisfies the rule antecedent to a degree of 0.6 (minimum value among 0.8, 0.6 and 1). Note that if the example had ‘sex = female’ it would satisfy the above rule antecedent to a degree of 0. (Conditions involving categorical attributes are satisfied to a degree of either 0 or 1.) However, the above fuzzyfication of |A|, although theoretically sound, has a disadvantage. In practice, many training examples can satisfy a rule antecedent to a small degree, and the summation of all these small degrees of membership can undermine the reliability of the CF measure. To avoid this, our summation of the degree to which an example satisfies the rule antecedent includes only the examples with a degree of membership greater than or equal to a predefined threshold (set to 0.5 in our experiments), rather than all training examples. This operation is based on the alpha-cut technique used in fuzzy arithmetic [9]. In our fuzzy version, |A & C| is the summation, over all examples that have the class predicted by the rule, of the degree to which the example satisfies the rule antecedent. Analogously to the computation of |A|, this summation considers only examples which satisfy the rule antecedent to a degree greater than or equal to the above-mentioned membership threshold.
A Fuzzy Beam-Search Rule Induction Algorithm
345
4 Computational Results The above described fuzzy rule induction algorithm was evaluated on two public domain data sets, available from http://www.ics.uci.edu/~mlearn/MLRepository.html. The first data set used in our experiments was the Pima Indians Diabetes Database [11], which contains 9 continuous attributes and 768 examples. The second data set was the Boston Housing Data [7], which contains 13 continuous attributes and 1 binary attribute. The latter was removed from the data set for the purposes of our experiments, since only continuous attributes are fuzzyfied by our algorithm. This data set contains 506 examples. The goal attribute for this data set (median value of owner-occupied homes in $1000’s) was originally continuous. To adapt this data set to the classification task, the goal attribute was discretized, so that it can take on only two classes (cheap and expensive). Each data set was divided into 5 partitions, and a cross-validation procedure [6] was then performed. For each data set, this corresponds to run the algorithm 5 times, where in each time a different partition is used as the test set and all the remaining four partitions are used as the training set. In all our experiments the maximum depth of the tree search was set to 2 and the beam width w was set to 10. Almost all continuous attribute were fuzzyfied into 3 linguistic values, each represented by a trapezoidal membership function. The only exceptions were the attributes Rad and Tax of the housing data set, which were fuzzyfied into two linguistic values. The results of our experiments are reported in Table 1. Each cell of the table refers to the average results over the 5 runs of the algorithm associated with the crossvalidation procedure. The first row indicates the baseline accuracy of each data set, that is the relative frequency of the majority class. A basic requirement for the predictive accuracy of a classifier is that it be greater than the baseline accuracy, since it would be trivial to achieve such accuracy by always predicting the majority class. The second row indicates the overall predictive accuracy achieved by our fuzzy algorithm. This was computed as the ratio of the number of correctly classified test examples over the total number of test examples. Note that for both data sets the algorithm achieved an accuracy rate significantly better than the baseline accuracy. Note that the above overall accuracy rate only takes into account whether the classification of an example was correct or wrong, regardless of which kind of rule (a discovered rule or the default-class rule) was used to classify the example. To analyze in more detail the quality of the discovered rules, we also measured separately the accuracy rate of the discovered rules and the accuracy rate of the default-class rule. Recall that all discovered rules predict the same class, namely the minority class, whereas the default-class rule, which is applied whenever the test example does not satisfy any discovered rule, simply predicts the majority class. Recall also that an example is considered to satisfy a discovered rule only if the rule antecedent belongs to the alpha-cut membership function describing the linguistic terms (with alpha = 0.5 in our experiments), as explained in section 3. The third row of Table 1 indicates the accuracy rate of the discovered rules, computed as the ratio of the number of examples correctly classified by the discovered rules over the number of examples satisfying any of the discovered rules. (Since all discovered rules predict the same class, it is irrelevant which rule is actually classifying the example.)
346
C.S. Fertig et al.
The fourth row indicates the accuracy rate of the default-class rule, computed as the ratio of the number of examples correctly classified by this rule over the number of examples classified by this rule (i.e. the number of examples that do not satisfy any of the discovered rules). As can be seen in the table, in both data sets the default rule has a predictive accuracy significantly better than the discovered rules. This was somewhat expected, given the difficulty of predicting the minority class in both data sets. Actually, if we were to compute the ‘baseline accuracy of the minority class’, we would find the values 0.349 (1 – 0.651) and 0.245 (1 – 0.755) for the Diabetes and Housing data sets, respectively. From this point of view, the discovered rules can still be considered as good-quality ones with respect to predictive accuracy, since their accuracy is 0.711 and 0.838 for the Diabetes and Housing data sets, respectively. Table 1. Computational Results
Baseline accuracy Overall accuracy Accuracy of discovered rules Accuracy of default rule
Diabetes data set 0.651
Housing data set 0.755
0.711 0.641 0.730
0.838 0.725 0.863
Another issue to be considered is the comprehensibility of the discovered rules. Although this is a subjective criterion, it is common in the literature to evaluate comprehensibility in terms of the syntactical simplicity of the discovered rules. In this case, the smaller the number of rules and the smaller the number of conditions per rule, the more comprehensible the discovered rule set is. With this definition of comprehensibility we can say that the rules discovered by our algorithm are comprehensible, almost by definition of the algorithm and a suitable choice of its parameters. Firstly, the algorithm discovers only a small set of rules, whose number is the user-specified beam width. Secondly, the algorithm discovers only relatively short rules, where the number of conditions of the discovered rules is at most the user-specified maximum depth of the search tree. In addition, and more important, the use of linguistic terms associated with our fuzzy algorithm can be considered as a form of improving rule comprehensibility, in comparison with continuous, numeric values, as argued in the introduction. It should be noted, however, that high accuracy and comprehensibility do not necessarily imply interestingness. To consider a classical example, in a hospital database we can easily mine a rule such as ‘if (patient is pregnant) then (patient sex is female)’. Although this rule is highly accurate and comprehensible, it is obviously uninteresting, since it states the obvious. Our algorithm discovers only rules predicting the minority class, which, as argued in section 3, tend to be more interesting rules for the user. However, a more detailed analysis of the degree of interestingness of the discovered rules is beyond the scope of this paper. Although the literature proposes several methods to evaluate the degree of interestingness of the discovered rules – see e.g. [4], [12] – it does not seem trivial to adapt these methods to the context of fuzzy rules.
A Fuzzy Beam-Search Rule Induction Algorithm
347
5 Conclusion We have proposed a fuzzy version of a beam search-based rule induction algorithm. We have also evaluated the algorithm on two data sets. Overall, the results are good, not only with respect to predictive accuracy, but also (and more importantly, in data mining) with respect to the comprehensibility of the discovered rules, which is intuitively improved by the use of linguistic terms associated with fuzzy rules. However, the computational results reported in this paper are still somewhat preliminary, since the algorithm has been applied to only two data sets. Future research will include a more extensive evaluation of the algorithm on other data sets. A disadvantage of our fuzzy approach is that it requires a preprocessing phase to specify the fuzzy membership for each continuous attribute. To solve this problem two approaches can be used: (1) this specification is done with the help of the user (an expert in the meaning of the data), which requires valuable human time; (2) a fuzzy clustering algorithm can be used to get the membership function [9]. Hopefully, in the future fuzzy databases will be more commonplace, so that the fuzzy values and fuzzy membership functions that our algorithm requires might be already available in the underlying fuzzy database, so avoiding the need for the above preprocessing phase.
References 1. Bojadziev, G., Bojadziev, M.: Fuzzy sets, Fuzzy logic, Applications. World Scientific (1995) 2. Chi, Z., Yan, H.: ID3-derived fuzzy rules and optimized defuzzification for handwritten numeral recognition. IEEE Trans. On Fuzzy Systems¸4(1), Feb. (1996) 24-31 3. Cios, K.J., Sztandera, L.M.: Continuous ID3 algorithm with fuzzy entropy measures. Proc. IEEE Int. Conf. Fuzzy Systems,. San Diego (1992) 469-476 4. Freitas, A.A.: On objective measures of rule surprisingness. Principles of Data Mining and nd Knowledge Discovery (Proc. 2 European Symp., PKDD-98). Lecture Notes in Artificial Intelligence 1510. Springer-Verlag (1998) 1-9 5. Freitas, A.A.: On rule interestingness measures. To appear in Knowledge-Based Systems journal (1999) 6. Hand, D.: Construction and Assessment of Classification Rules. John Wiley & Sons (1997) 7. Harrison, D., Rubinfeld, D.L.: UCI Repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Dept. of Information and Computer Science (1993) 8. Holsheimer, M., Kersten, M., Siebes, A.: Data surveyor: searching the nuggets in parallel. In: U.M. Fayyad et al. (Eds.) Advances in Knowledge Discovery and Data Mining. AAAI Press (1996) 447-467 9. Predrycz, W., Gomide, F.: An Introduction to Fuzzy Sets Analysis and Design. MIT (1998) 10. Quinlan, J.R.: Generating production rules from decision trees. Proc. Int. Joint Conf. AI (IJCAI-87) (1987) 304-307 11. Sigillito, V.: UCI Repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: Univ. of California, Dept. of Information and Computer Science (1990) 12. Suzuki, E., Kodratoff, Y.: Discovery of surprising exception rules based on intensity of nd implication. Principles of Data Mining and Knowledge Discovery (Proc. 2 European Symp., PKDD-98). Lecture Notes in Artif. Intelligence 1510. Springer-Verlag (1998) 10-18 rd 13. Winston, P.H.: Artificial Intelligence. 3 Ed. Addison-Wesley (1992)
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining Zhiwei Fu Robert H. Smith School of Business, University of Maryland, College Park, 20742, USA
[email protected] Abstract. A variety of techniques have been developed to scale decision tree classifiers in data mining to extract valuable knowledge. However, these aproaches either cause a loss of accuracy or cannot effectively uncover the data structure. We explore a more promising GA-based decision tree classifier, OOGASC4.5, to integrate the strengths of decision tree algorithms with statistical sampling and genetic algorithm. The proposed program could not only enhance the classification accuracy but assumes the potential advantage of extracting valuable rules as well. The computational results are provided along with analysis and conclusions.
1
Introduction
Data mining has recently become an attractive discipline within the last few years [4][7]. Its goal is to extract pieces of previously unknown, but valuable knowledge or patterns from large data sets. Data mining over large data sets is important due to its commercial potential. Numerous algorithms have been developed with regard to handling large data sets, such as distributed algorithms, restricted search, parallel algorithms and data reduction algorithms. However, the computational costs, the available storage and the retrieval of large data sets are still serious concerns for large-scale data mining. While certain data mining algorithms show consistent performances for some data sets, it is not necessarily true across all problem domains. For example, the performance of different extracted “ideal” scheme for induction based classification/decision problems varies greatly with the characteristics of the data sets to which the algorithms applies. On the other hand, techniques such as discretization can cause a loss of accuracy when scaling up to large data sets [1][8]. In this paper we explore a more promising algorithm using genetic algorithm (GA) and SampleC4.5 for classification problems. The paper begins with introducing the problem and reviewing the decision tree algorithm and GA, followed by the design of Object Oriented Genetic Algorithm with SampleC4.5 (OOGASC4.5). We then show the computational results. After comparison and analysis, we end up with conclusions in the final section.
J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 348-353, 1999. © Springer-Verlag Berlin Heidelberg 1999
An Innovative GA-Based Decision Tree Classifier in Large Scale Data Mining
2
349
The OOGASC4.5 Program
2.1 C4.5 and SampleC4.5 Decision tree algorithms have long been recognized as a powerful tool in data mining to represent schemes in the studied data sets according to values of variables. Among them, C4.5 [9] has been widely implemented and tested. It uses the top-down induction approach to build the decision tree, which is fitted to training samples by recursively partitioning the data into increasingly homogeneous subsets, based on the values of a variable one at a time. It starts at the root of the tree and moves through it until a leaf is encountered, or no more improvement could be made. However, C4.5 pursues local greedy search which can quickly converge but at the expense of higher possibility of getting trapped into the local optima. To efficiently uncover the data set, we develop SampleC4.5 by instilling statistical sampling methods into C4.5. The SampleC4.5 algorithm starts with the full original set of variables and a training set of certain starting percentage (p=p0) extracted from raw data. The remaining data (1- p0)N form the test set, where N is the size of the original data set. In the case of large data sets, we would form a separate test set, and an “untouched” validation set beforehand rather than the dynamic (1- p0)N test sets. SampleC4.5 is implemented on the training set for a number of iterations (n=n0). The classification accuracy on test sets over all iterations n is averaged, and the standard error is calculated, after which the algorithm increments p by some step percentage s0. The process repeats on a new training set of (p0+s0)N, and a new test set of (1-(p0+s0))N or the pre-specified test/validation set for large data sets. Two statistical sampling approaches, simple random sampler and stratified random sampler, are provided. The pseudo code for the algorithm is shown as follows, begin initialize n=1; initialize p=p0; while (0 xt+i , i = −τ, −τ + 1, . . . , −1, 1, . . . , τ − 1, τ P (1) zt = 0, otherwise is defined as a local extreme value of τ preceding and succeeding datapoints. The trough indicator ztT is defined in an analogous way. As we investigate monthly time series, we define τ =2. At time t the economist knows only the current and past datapoints xt , xt−1 , . . . , xt−τ +1 , xt−τ . The future values xt+1 , . . . , xt+τ −1 , xt+τ have to be estimated using e.g. ARMA- and VAR-models. With those estimates we applied a Monte-Carlo based procedure developed by Wecker [1] and Kling [2] to obtain probabilistic statements about near-by TPs. A TP is detected if the probability reaches or exceeds a certain threshold θ, e.g. θ=.5. A participant in the financial markets usually is not interested in MSE, MAE, etc. but in economic performance. Since our models do not produce return forecasts but probabilities for TPs, we have to measure performance indirectly by generating trading signals from those probabilities: A short position is taken when a peak is detected (implying the market will fall, trading signal s=-1), a long
494
T. Poddig and C. Huber
position in the case of a trough s=+1), and the position of the previous period is maintained if there is no TP. With the actual period-to-period return ractual,t we can calculate the return rm,t from a TP forecast of our model: rm,t = s · ractual,t . In this paper we deal with log-differenced data, so the Cumulative Wealth can PT be computed by adding the returns over T periods: CW = t=1 rm,t . To test the ability of the ARMA and VAR models to predict TPs, we investigate nine financial time series, namely DMDOLLAR, YENDOLLAR, BD10Y (performance index for the 10 year German government benchmark bond), US10Y, JP10Y, MSWGR (performance index for the German stock market), MSUSA, MSJPA, and the CRB-Index. The data was available in monthly periodicity from 83.12 to 97.12, equalling 169 datapoints. To allow for the possibility of structural change in the data, we implemented rolling regressions: After estimating the models with the first 100 datapoints and forecasting the τ succeeding datapoints, the data-window of the fixed size of 100 datapoints was put forth for one period and the estimation procedure as well as the Monte-Carlo-simulations were repeated until the last turning point was predicted for 97.10. Thereby we obtained 68 outof-sample turning point forecasts. We estimated a multitude of models for each model class: 15 ARMA-models from (1,0), (0,1), (1,1),..., to (3,3) and 3 VAR models VAR(1), (2), and (3) comprising all nine variables. We do not specify a model and estimate all rolling regressions with this model. Rather we specify a class of models (ARMA and VAR). Within a class the best model is selected for forecasting. As an extreme case, a different model specification could be chosen for every datapoint (within the ARMA class e.g. the ARMA(1,0) model for the first rolling regression, ARMA(2,2) for the second etc.). Popular in-sample model selection criteria are AIC and SIC. Applying AIC and SIC for model selection within the first rolling regression, we estimated a multitude of e.g. ARMA-models with 100 datapoints and chose the model with the lowest AIC to forecast the τ future datapoints. In contrast to the simple implementation of AIC and SIC, the out-of-sample procedure for model selection is more complicated. Therefore we divided the training data in two subsequent, disjunct parts: an estimation (=training) subset (70 datapoints) and a validation subset (30 datapoints, see figure 1). The first 70 datapoints from t-99 to t-30 were used to estimate the models, which were validated with respect to their abilities to predict TPs on the following 30 datapoints from t-29 to t. The decision which model is the ”best” within the out-of-sample selection procedure was made with respect to CW : the model with the highest CW was selected. The specification of this model, e.g. ARMA(2,2), then was re-estimated with the 100 datapoints from t-99 to t to forecast the at time t unknown τ values of the time series which are necessary to decide whether there is a turning point at time t. As a result of model selection with the two in-sample criteria AIC, SIC, and the out-of-sample procedure with regard to CW we obtain three sequences of TP forecasts each for ARMA- and VAR-models for the out-of-sample backtesting period of the 68 months. Two ARMA-sequences with a threshold θ=.5 could look like table 1. The first four columns refer to the number of the rol-
A Comparison of Model Selection Procedures
495
Fig. 1. Division of the database
ling regressions and the training, validation, and forecast period, respectively. For AIC and SIC model selection was performed on the 100 datapoints of the training and validation subset as a whole. The 5th (7th) column gives the specification of the ARMA-model selected by CW (AIC), the 6th (8th) column gives the corresponding CW - (AIC)-value. Table 1. ARMA-sequence as an example for the rolling regressions RR 1 2 .. . 68
training 83.12-89.9 84.1-89.10 .. . 89.8-95.4
validation forecast 89.10-92.3 92.4-92.5 89.11-92.4 92.5-92.6 .. .. . . 95.5-97.10 97.11-97.12
CW Spec. CW − value (2,2) .179 (1,0) .253 .. .. . . (3,0) .815
AIC Spec. AIC − value (3,3) -5.326 (1,1) -5.417 .. .. . . (2,3) -5.482
The first TP forecast was produced for 92.4 (with the unknown values of 92.5 and 92.6), the last for 97.10. The 68 out-of-sample forecasts of the model sequences generated this way are finally evaluated with respect to CW . To judge whether the econometric models are valuable forecasting tools, one would like to test if the model class under consideration is able to outperform a simple benchmark in the backtesting period. When forecasting economic time series, a simple benchmark is the naive forecast. Using the last certain TP statement can be regarded as a benchmark in this sense. As τ =2, the last certain TP statement can be made for t-2, using the datapoints from t-4 to t. A valuable forecasting model should be able to outperform this Naive TP Forecast (NTPF) in the backtesting period. In order to produce a statistically significant result when comparing the model sequences generated by the different model selection criteria, we apply Analysis of Variance (ANOVA). The forecasts for ARMA-models, θ=.5, with respect to the evaluation criterion CW can be exhibited as in table 2 (the last column
496
T. Poddig and C. Huber
contains the NTPF-results). The entry -.115 in the 3rd column of row 3 means that ARMA-models selected by AIC produced a CW of -.115 in the backtesting period when predicting turning points for MSWGR. Looking to the last row, column 3 reveals that the mean CW over all nine time series from the ARMA forecasts is -.192. Table 2. Example for the exhibition of the results from the ARMA turning point forecasts
MSWGR BD10Y .. . mean:
Selection criteria CW AIC SIC -.262 -.115 -.020 -.856 -.515 -.898 .. .. .. . . . -.232 -.192 -.264
NTPF -.029 -.291 .. . -.196
The block experiment of ANOVA can be used to test if the means of the columns (here the means from the TP predictions) and the means of the rows (the means from TP predictions for one of the time series) are identical. Thereby it is possible to compare the performance of the different model selection criteria. Additionally, the NTPF is included in the test to make sure that the models outperform the benchmark. The basic model of ANOVA is: yij = µ + αi + βj + eij , where yij represents the element in row i and column j of table 2, µ is the common mean of all yij , αi is the block effect due to the analysis of different time series in the r rows of table 2, βj the treatment effect of the p selection criteria (incl. NTPF) in the columns of table 2, and eij an iid, N (0; σ 2 ) random factor. We want to test whether the treatment effects βj are zero: β1 = β2 = ... = βp = 0. In other words, we want to test the null hypothesis that there are no statistically significant effects due to the use of different model selection criteria on the TP forecasts from ARMA- and VAR-models. An F test statistic is based on the idea that the total variation SST of the elements in table 2 can be decomposed in the variation between the blocks SSA, the variation between the selection criteria SSB, and the random variation SSE: SST = SSA + SSB + SSE. Estimators for SST , SSA, SSB and SSE can be computed as shown in [3]. Then an F -statistic can be computed (see [3]). The null is rejected, if F exceeds its critical value. The next section presents empirical results.
3
Empirical Results and Conclusion
The following table 3 exhibits the empirical results from the TP forecasts with ARMA- and VAR-models. The 2nd and 3rd column show the value for the F statistic and its corresponding p-value. The 4th to 7th column contain the means
A Comparison of Model Selection Procedures
497
of the model sequence created by the selection criterion under consideration. E.g. the entry ”.32” in row 2, column 2 gives the F -statistic for the null that the mean CW of ARMA-models, created by the use of the selection criteria AIC, SIC, CW , and the NTPF are all the same. The p-value of .8087 indicates that the null cannot be rejected at the usual levels of signficance (e.g. .10). Thus we have to conclude that there are no differences between TP forecasts from ARMA-models generated by different model selection criteria. Moreover, the ARMA forecasts do not differ significantly from the NTPF. Columns 4 to 7 exhibit the mean CW over all nine time series. The ARMA models selected by e.g. AIC managed to produce an average CW of .059 in the simulation period. This is only marginally higher than the mean CW from NTPF (.057). Table 3. Empirical ANOVA results from the TP predictions Model F p AIC SIC CW NTPF ARMA .32 .8087 .059 -.022 -.010 .057 VAR .16 .9242 -.002 -.002 .015 .057
In general, the results indicate that there are no statistically significant differences between TP predictions from ARMA- and VAR-models (p-values .8087 and .9242). With concern to ARMA-models, AIC seems to be the best selection criterion with respect to CW (mean CW =0.059). This is only slightly better than the benchmark NTPF (mean CW =0.057) and cannot be considered as a reliable result. The other selection criteria even led to underperformance vs. NTPF. Results are even worse for VAR-models. All VARs underperformed the NTPF. Thus it must be doubted that ARMA- and VAR-models are valuable tools for predicting TPs in financial time series. If they are employed despite of the results achieved here, it might be a good choice to make use of in-sample selection criteria AIC and SIC. They led to comparable results as the out-ofsample validation procedure suggested in this paper and are less expensive to implement. If those results hold for other forecasting problems, evaluation criteria, and selection procedures as well has to be investigated by further research.
References 1. Wecker, W. (1979): Predicting the turning points of a time series; in: Journal of Business, Vol. 52, No. 1, 35-50 2. Kling, J.L. (1987): Predicting the turning points of business and economic time series; in: Journal of Business, Vol. 60, No. 2, 201-238 3. Poddig, T.; Huber, C. (1999): A Comparison of Model Selection Procedures for Predicting Turning Points in Financial Time Series - Full Version, Discussion Papers in Finance No. 3, University of Bremen, available at: www1.uni-bremen.de/ e fiwi/
Mining Lemma Disambiguation Rules from Czech Corpora Luboˇs Popel´ınsk´ y and Tom´ aˇs Pavelek Natural Language Processing Laboratory Faculty of Informatics, Masaryk University in Brno Czech Republic {popel,xpavelek}@fi.muni.cz
Abstract. Lemma disambiguation means finding a basic word form, typically nominative singular for nouns or infinitive for verbs. In Czech corpora it was observed that 10% of word positions have at least 2 lemmata. We developed a method for lemma disambiguation when no expert domain knowledge is available based on combination of ILP and kNN techniques. We propose a way how to use lemma disambiguation rules learned with ILP system Progol to minimise a number of incorrectly disambiguated words. We present results of the most important subtasks of lemma disambiguation for Czech. Although no knowledge on Czech grammar has been used the accuracy reaches 93% with a small fraction of words remaining ambiguous.
1
Disambiguation in Czech
Disambiguation in inflective languages, of which Czech is a very good instance, is a very challenging task because of their usefulness as well as its complexity. DESAM, a corpus of Czech newspaper texts that is now being built at Faculty of Informatics, Masaryk University, contains more than 1 000 000 word positions, about 130 000 different word forms, about 65 000 of them occuring more then once, and 1665 different tags. DESAM is now being tagged – partially manually, partially by means of different disambiguators – into 66 grammatical categories like a part-of-speech, gender, case, number etc., about 2 000 tags, combinations of category-value couples. E.g. for substantives, adjectives and numerals there are 4 basic grammatical categories. For pronouns 5 categories, for verbs 7 and for adverbs 3 categories, and some number of subcategories. The large number of tags is made by combination of those categories. It was observed [11] that there is in average 4.21 possible tags per word. It is impossible to perform the disambiguation task manually and any tool that can decrease the amount of human work is welcome. DESAM is still not large enough. It does not contain all Czech word forms – compare 132 000 different word forms in DESAM with more than 160 000 stems of Czech words that morphological analysers are able to recognise (each of them can have a number of both prefixes and suffixes). Thus DESAM does not ˙ J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 498–503, 1999. c Springer-Verlag Berlin Heidelberg 1999
Mining Lemma Disambiguation Rules from Czech Corpora
499
contain the representative set of Czech sentences. In addition DESAM contains some errors, i.e. incorrectly tagged words. Another problem is that the significant amount of word positions (words as well as interpunction) are untagged. For the word form “se” nearly one fifth of words are untagged (16,8%) and 93.4% of contexts contain an untagged word. It is similar for other classes of words with an ambigoues lemma. It should be noticed here that the disambiguation task in Czech language is much more complex than in e.g. English also for another reason. For English there are tagged corpora covering a majority of common English sentences. The known grammar rules cover a significant part of English sentence syntax. Unfortunately, neither of those statements hold for Czech. It makes our task quite difficult.
2
Lemma Disambiguation
Lemma disambiguation which we address here, means assigning to each word form its basic form – nominative singular for nouns, adjectives, pronouns and numerals, infinitive for verbs. E.g. in the sentence Od r´ ana je m´ a Ivana se ˇzenou. (literarily since (the) morning my Ivana(female) has been with (my) wife.) each of words except the preposition ”od” has two basic forms. E.g. “r´ ana” can be genitive of “r´ ano”(morning) as well as nominative of a substantive “r´ ana”(bang). In Czech corpora it was observed that 10% of word positions – i.e. each 10th word of a text – have at least 2 lemmata and about 1% word forms of the Czech vocabulary has at least 2 lemmata. The most frequent ambiguous word forms are se and je. Disambiguation of the word “se” would be welcome as it is the 3rd most frequent word in DESAM corpus. Actually the lemma disambiguation (almost always) leads to a disambiguation of sense. In the example, m´ a means either ”my” (daughter) or ”has” (s/he has), se is either preposition ”with” (my wife) or the reflexive pronoun “self” (like ”elle se lave” in French). We use here a novel approach to lemma disambiguation based on a combination of memory-based learning, namely weighted k-Nearest Neighbor method [8] and inductive logic programming(ILP) [9]. Inductive logic programming aims at finding first-order logic rules that cover positive examples (and uncover negative ones) using given domain knowledge predicates. The rest of the paper is organised as follows. In Section 3 we explain how to build basic domain knowledge if no sufficient linguistic knowledge is available. In the Section 4 we present the results obtained with ILP system Progol for the most frequent lemma-ambiguous word form “se”. Rule set accuracy on a disambiguated context is displayed. Section 5 brings the results of disambiguation when correct tags in a context are unknown. We conclude with a discussion of results and a summary of relevant works.
500
3
L. Popel´ınsk´ y and T. Pavelek
Domain Knowledge
There is no complete formal description of Czech grammar. We decided to build domain knowledge predicates without any need of deep linguistic knowledge. We only exploit information about particular tags in a context. The general form of domain knowledge predicates is p(Context, Focus, Condition) where Context is a variable bound with either left context in a reverse order or with right context, Focus, Condition are terms. Focus defines a subpart of the Context. It has a form first(N) (N=1..max length, a sublist of the Context of length N neighboring with the word. max length is a maximal length of a context). Condition says what condition must hold on the Focus. Condition is an unary term of the form somewhere(List) (tags from the List appear somewhere in the Context) or always(List) (tags from the List appear in all positions in the Context). E.g. a goal p(X,first(2), always([c7,nS])) succeeds if tags c7,nS appear in each of the first two words in the context X – e.g. a pronoun and a noun in singular instrumental as in “(se) svou sestrou” – ”(with) his sister”.
4
Learning Disambiguation Rules with Progol
We will demonstrate our method on disambiguation of the word form “se”. It may have either the lemma “s” (preposition like “with” in English) or the lemma “sebe” (reflexive pronoun ”self”). For generation of learning sets we use the part of DESAM corpus which was manually disambiguated (about 250 000 word positions). The left and right contexts have been set to 5 words. Untagged words in context has been tagged as ’unknown part-of-speech’ (tag kZ). Negative examples have been built from sentences where the word has the second lemma. Using P-Progol [10] version 2.2 we have learned rules for both of the two lemmata. It means that for each task we obtained two rule sets that should be complementary. However, we have found it useful to use both of them. Number of sentences was 232 (preposition) and 2935(pronoun). 80% of examples was used for learning. We tested each rule set on the rest of data. Learning time reached 14 hours. It is caused by the enormous number of 4536 literals that may appear in a rule body. It must be mentioned that the default accuracy, i.e. assigning the reflexive pronoun lemma to each occurrence of “se” is 92.7%. Then the reached rule accuracies 92.84% (pronoun) and 94.48%(preposition) are not too impressive. In the next section we will show that even such “poor” rule sets are usable for lemma disambiguation.
Mining Lemma Disambiguation Rules from Czech Corpora
5
501
Disambiguation
The goal then was to find such a criterion that would allow to find the correct lemma for the word “se”. The learning and the testing sets contained sentences not used for learning the disambiguation rules. We limited both left and right context to the length of 3 words. Then we removed all sentences that contained commas, dots, parentheses etc. 50% of the sentences were used for estimation of parameters, the rest for testing. All possible grammatical categories were found for each sentence employing LEMMA morphological analyser1 . Then all variations of categories was generated for each sentence. Both theories learned by Progol were run on those data so that for each sentence we had two success rates, i.e. the relative number of correctly covered positive examples and correctly uncovered negative examples to the number of all examples. Time needed for disambiguation of 1 sentence was 6 seconds in average, very rarely it was more than 10 seconds. If the disambiguation lasted more than 30 seconds (because of the enormous number of variations of tags), the process was killed. It concerned less than 2% of cases. Two success rates obtained for a sentence are as (x,y)-coordinates. The new example is then classified into class(lemma) of the nearest neighbor(s) in that learning set. We computed the distance between two instances (x1 , y1 ) and (x2 , y2 ) as an Euclidian distance. As mentioned above, 50% new sentences have been used for building the set of instances and for parameter estimation. On the new learning set we tried values of k (the number of neighbors) in the range 1..10. It was observed that increasing value of k did not increase accuracy of disambiguation. Therefore for all experiments below k was set to 1. Then we found the nearest point (xi , yi ). Let s1 , s2 be the number of instances with lemma “s” and the number of instances with the lemma “sebe” for this point. If si is greater than sj we would expect that the i-th lemma is the right one. We lemma := if s1 > s2 ∧ succesRatelemma1 > tlemma1 then lemma1 else if s1 < s2 ∧ succesRatelemma2 > tlemma2 then lemma2 else unresolved Fig. 1. kNN algorithm
also observed that if a success rate for a word in a particular context is smaller than a threshold the word cannot be disambiguated. Thus the correct lemma was assigned using the rules in Fig. 1. Values of (tlemma1 , tlemma2 ) was tested in the range (0,0)..(1,1). The best settings of thresholds on the learning set was tlemma1 = 0, tlemma2 = 0.8. Results of disambiguation are in Table 1.
1
copyright Lingea Brno 1995
502
L. Popel´ınsk´ y and T. Pavelek
preposition learn test pronoun learn test
disambiguation #ex correct wrong accuracy(%) 99 80 4 97.5 112 93 7 93.0 297 214 2 99.1 310 236 6 97.5
unresolved # % 17 17.2 14 12.5 82 27.6 44 14.2
Table 1. Results of kNN algorithm
6
Conclusion
The presented results are the first obtained by ILP techniques in disambiguation of inflective languages, as far as we know. It must be stressed that the Czech corpus is under development and therefore it contains about 17% of untagged words as well as incorrectly tagged words. Moreover, there were no usable formal grammar rules for Czech that would make the domain knowledge building easier. We described the systematic way of building domain knowledge if no sufficient linguistic knowledge is available. A new method for lemma disambiguation was introduced that reached an accuracy 93%, leaving a small part of words ambiguous. Similar accuracy was obtained for Prague Tree Bank corpus [13]. The lemma disambiguation task is not solved here completely. The main reason is that the Czech corpora are still too small and therefore cardinality of learning sets is not sufficient for most of the tasks. Our approach was also used for disambiguation of unknown words (not existing in the corpus). We defined similarity classes for lemma-ambiguous words in terms of grammatical categories. First results can be found in [12]. Results obtained by ILP for tag disambiguation can be found in [13]. So far statistical techniques (accuracy 81.64%) and neural nets (75.47%) have been applied to DESAM [11]. See also [5,6,14] for other results with another Czech corpus. It should be pointed out that our results are not quite comparable as we focus only on lemma disambiguation. In the past, ILP has been applied for inflective languages in the field of morphology. LAI Ljubljana [2] applied ILP for generating the lemma from the oblique form of nouns as well as for generating the correct oblique form from the lemma, with average accuracy 91.5 % . Learning nominal inflections for Czech and Slovene (among others) is described in [7]. James Cussens [1] developed POS tagger for English that achieved per-word accuracy of 96.4 %. Martin Eineborg and Nikolaj Lindberg [3,4] induced constraint grammar-like disambiguation rules for Swedish with accuracy 98%. Our approach differs significantly in two points. We do not exploit any information on particular words as in [3]. Such knowledge would improve an accuracy significantly. Neither we use any hand-coded grammatical domain knowledge as in [1].
Mining Lemma Disambiguation Rules from Czech Corpora
503
Out method, although developed for Czech language, is actually language-independent except of the set of tags. It means that it is possible to use our approach also for other languages. Acknowledgements This paper is a brief version of [13]. We thank a lot to anonymous referees of LLL workshop for their comments. We would like to thank to Karel Pala and ˇ ep´ Olga Stˇ ankov´ a for their help with earlier versions of this paper. We thank, too, Tom´ aˇs Pt´ aˇcn´ık, Pavel Rychl´ y, Radek Sedl´ aˇcek and Robert Kr´ al for fruitful discussions and assistance. This work has been partially supported by VS97028 grant of Ministry of Education of the Czech Republic ”Natural Language Processing Laboratory” and ESPRIT ILP2 Project.
References 1. Cussens J.: Part-of-Speech Tagging using Progol. In Proc. of ILP’97, LNAI 1297, Springer-Verlag 1997. 2. Dˇzeroski S., Erjavec T.: Induction of Slovene Nominal Paradigms. In Proc. of ILP’97, LNAI 1297, Springer-Verlag 1997. 3. Eineborg M., Lindberg N.: Induction of Constraint Grammar rules using Progol. In Proc. of ILP’98, 1998. 4. Lindberg N., Eineborg M.: Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming. In: COLING/ACL’98. 5. Hajiˇc J., Hladk´ a B.: Probabilistic and rule-based tagger of an inflective language – a comparison. In Proceedings of the 5th Conf. on Applied Natural Language Processing, 111-118, Washington D.C., 1997. 6. Hajiˇc J., Hladk´ a B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of EACL 1998. 7. Manandhar S., Dˇzeroski S., Erjavec T.: Learning multilingual morphology with CLOG. In Proc. of ILP’98, 1998. 8. Mitchell, T.M.: Machine Learning. McGraw Hill, Newy York, 1997. 9. Muggleton S., De Raedt L.: Inductive Logic Programming: Theory And Methods. J. Logic Programming 1994:19,20:629-679. 10. Muggleton S.: Inverse Entailment and Progol. New Generation Computing Journal, 13:245-286, 1995. 11. K. Pala , P. Rychl´ y and P. Smrˇz: DESAM - annotated corpus for Czech. In Pl´ aˇsil F., Jeffery K.G.(eds.): Proceedings of SOFSEM’97, Milovy, Czech Republic. LNCS 1338, Springer-Verlag 1997. 12. Pavelek T., Popel´ınsk´ y L.: Towards lemma disambiguation: Similarity classes (submitted) 13. Popel´ınsk´ y L., Pavelek T., Pt´ aˇcn´ık T.: Towards disambiguation in Czech corpora. Workshop Notes of “Learning Language in Logic”(LLL) ICML’99 Workshop, Bled, Slovenija, 1999. 14. Zavrel J., Daelemans W.: Recent Advances in Memory-Based Part-of-Speech Tagging. ILK/Computaional Linguistics, Tilburg University, 1998.
Adding Temporal Semantics to Association Rules Chris P. Rainsford1 and John F. Roddick2 1
Defence Science and Technology Organisation, DSTO C3 Research Centre Fernhill Park , Canberra, 2600, Australia.
[email protected] 2 Advanced Computing Research Centre, School of Computer and Information Science University of South Australia, The Levels, Adelaide, 5095, Australia.
[email protected] Abstract. The development of systems for knowledge discovery in databases, including the use of association rules, has become a major research issue in recent years. Although initially motivated by the desire to analyse large retail transaction databases, the general utility of association rules makes them applicable to a wide range of different learning tasks. However, association rules do not accommodate the temporal relationships that may be intrinsically important within some application domains. In this paper, we present an extension to association rules to accommodate temporal semantics. By finding associated items first and then looking for temporal relationships between them, it is possible to incorporate potentially valuable temporal semantics. Our approach to temporal reasoning accommodates both point-based and intervalbased models of time simultaneously. In addition, the use of a generalized taxonomy of temporal relationships supports the generalization of temporal relationships and their specification at different levels of abstraction. This approach also facilitates the possibility of reasoning with incomplete or missing information.
1 Introduction Association rules have been widely investigated within the field of knowledge discovery q.v. ([1],[2],[7],[8],[9],[12],[14]). In this paper we present an extension to association rules to allow them to exploit the semantics associated with temporal data and particularly temporal interval data. Some work in the discovery of common sequences of events has been conducted q.v. ([3],[10],[15]). However, these algorithms are aimed at finding commonly occurring sequences rather than associations. Moreover, the algorithms only accommodate point-based events and this restricts both the potential semantics of knowledge that may be discovered and the data that can be learnt from. Other investigations have examined the discovery of association rules from temporal data, such that each discovered rule is weakened by a temporal dependency q.v. [5]. Özden et al. extend this to cyclic domains to describe associations that are strong during particular parts of a specified cycle [11]. This may be used to describe the behavior of rules that only hold true in summer or winter or during some other part of a given cycle. J.M. Zytkow and J. Rauch (Eds.): PKDD’99, LNAI 1704, pp. 504-509, 1999. © Springer-Verlag Berlin Heidelberg 1999
Adding Temporal Semantics to Association Rules
505
Conventional association rules do not accommodate the temporal relationships that may be intrinsically important in some application domains. Importantly, each basket of items is treated individually with no record of the associated customer or client who purchased these goods. However, where client histories exist, temporal patterns may be associated with their purchasing behaviour over time. Therefore it would useful to provide organisational decision-makers with this temporal information. Existing association rule algorithms do not support such temporal semantics. This paper addresses this issue by presenting an extension to association rules that accommodates both point-based and interval-based models of time simultaneously. In addition, the use of a generalised taxonomy of relationships supports the generalisation of temporal relationships and specification at different levels of abstraction. This approach also facilitates the possibility of reasoning with incomplete or missing information. This flexibility makes the proposed approach applicable to a wide array of application domains. Although initially motivated by the desire to analyse large retail transaction databases, the general utility of association rules makes them applicable to a wide range of different learning tasks. For consistency the entities with which a history of transactions is associated are described here as clients. Clients may be associated with non-temporal properties such as sex, specific events such as item purchases or with attributes manifested over intervals such as bank balances, outstanding debts or insurance classifications. These properties are described as items and the set of items associated with a client is said to be their basket of items. Association rules may be able tell us that Investment_X is associated with Insurance_Y. Temporal associations may then tell us that Investment_X usually occurs after the start of Insurance_Y. This may indicate that customers start with an insurance policy and this becomes a gateway for other services such as A campaign marketing Investment_X to holders of Investment_X. Insurance_Y may then be suggested. In the next section temporal association rules are formally defined. Section 3 then discusses the temporal logic that underlies the proposed approach to learning temporal association rules. An overview of the learning algorithm is provided in Section 4. In Section 5 a summary of this paper and discussion of future research is provided.
2
Temporal Association Rules
A temporal association rule can be considered a conventional association rule that includes a conjunction of one or more temporal relationships between items in the antecedent or consequent. Building upon the original formalism in [1] temporal association rules can be defined as follows: Let I = I1, I2,...,Im be a set of binary attributes or items and T be a database of tuples. Association rules were first proposed for use within transaction databases, where each transaction t is recorded with a corresponding tuple. Hence attributes represented items and were limited to a binary domain where t(k) = 1 indicated that the item Ik had been purchased as part of the transaction, and t(k) = 0 indicated that it had not. However in a more general context t may be any tuple with binary domain attributes, which need not represent a transaction buy may simply represent the presence of some attribute value or range of
506
C.P. Rainsford and J.F. Roddick
values. Temporal attributes are defined as attributes with associated temporal points or intervals that record the time for which the item or attribute was valid in the modelled domain. Let X be a set of some attributes in I. It can be said that a transaction t satisfies X if for all attributes Ik in X, t(k) = 1. Consider a conjunction of binary temporal predicates P1 ∧ P2….∧ Pn defined on attributes contained in either X or Y where n ≥ 0. Then by a temporal association rule, we mean an implication of the form X ⇒ Y ∧ P1 ∧ P2….∧ Pn, where X, the antecedent, is a set of attributes in I and Y, the consequent, is a set of attributes in I that is not present in X. The rule X ⇒ Y ∧ P1 ∧ P2….∧ Pn is satisfied in the set of transactions T with the confidence factor 0 ≤ c ≤ 1 iff at least c% of transactions in T that satisfy X also satisfy Y. Likewise each predicate Pi is satisfied with a temporal confidence factor of 0 ≤ tcPi ≤ 1 iff at least tc% of Transactions in T that satisfy X and Y also satisfy Pi. The notation X ⇒ Y |c ∧ P1|tc ∧ P2|tc….∧ Pn |tc is adopted to specify that the rule X ⇒ Y ∧ P1 ∧ P2….∧ Pn has a confidence factor of c and temporal confidence factor of tc. As an illustration consider the following simple example rule: policyC ⇒ investA,productB | 0.87 ∧ during(investA,policyC) | 0.79 ∧ before(productB,investA) | 0.91 This rule can be read as follows: The purchase of investment A and product B are associated with insurance policy C with a confidence factor of 0.87. The investment in A occurs during the period of policy C with a temporal confidence factor of 0.79 and the purchase of product B occurs before investment A with a temporal confidence factor of 0.91 Binary temporal predicates are defined using Allen’s thirteen interval based relations and Freksa’s neighbourhood relations and this will be discussed in the next section.
3
Temporal Logic
The expressiveness of temporal association rules is determined by the set of temporal predicates available to describe relationships between items. For our work, Allen’s taxonomy of temporal relationships is adopted to describe the basic relationships between intervals [4]. These relationships become the basis for binary temporal predicates. Using these relations we are able to treat points as a special case of intervals where begin and end points are equal. To add extra expressive capability, Freksa’s generalised relationships have also been adopted [6]. Freksa’s neighbourhood relations generalise over Allen’s relations and this allows the proposed algorithm to describe temporal relationships at multiple levels. Therefore several commonly occurring relationships can be summarised into single strong relationships. Both of these taxonomies are depicted in Figure 1.
)UHNVD·V
(ol)
The name of the relationship
X older than Y
The relationship label
? X succeeds Y (sd)
> X surviving contemporary of Y (sc)
?? X contemporary of Y (ct)
??>< X survived by and contemporary of Y (bc)
? X older contemporary of Y (oc)
??(? X died after birth of Y (db)
?>?> X younger and survives Y (ys)
?