Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann
Subseries of Lecture Notes in Computer Science
3571
Lluís Godo (Ed.)
Symbolic and Quantitative Approaches to Reasoning with Uncertainty 8th European Conference, ECSQARU 2005 Barcelona, Spain, July 6-8, 2005 Proceedings
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA Jörg Siekmann, University of Saarland, Saarbrücken, Germany Volume Editor Lluís Godo Institut d’Investigació en Intel.ligència Artificial (IIIA) Consejo Superior de Investigaciones Científicas (CSIC) Campus UAB s/n, 08193 Bellaterra, Spain E-mail:
[email protected] Library of Congress Control Number: 2005928377
CR Subject Classification (1998): I.2, F.4.1 ISSN ISBN-10 ISBN-13
0302-9743 3-540-27326-3 Springer Berlin Heidelberg New York 978-3-540-27326-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11518655 06/3142 543210
Llu´ıs Godo (Ed.)
Symbolic and Quantitative Approaches to Reasoning with Uncertainty 8th European Conference, ECSQARU 2005 Barcelona, Spain, July 6–8, 2005 Proceedings
Preface
These are the proceedings of the 8th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, ECSQARU 2005, held in Barcelona (Spain), July 6–8, 2005. The ECSQARU conferences are biennial and have become a major forum for advances in the theory and practice of reasoning under uncertainty. The first ECSQARU conference was held in Marseille (1991), and after in Granada (1993), Fribourg (1995), Bonn (1997), London (1999), Toulouse (2001) and Aalborg (2003). The papers gathered in this volume were selected out of 130 submissions, after a strict review process by the members of the Program Committee, to be presented at ECSQARU 2005. In addition, the conference included invited lectures by three outstanding researchers in the area, Seraf´ın Moral (Imprecise Probabilities), Rudolf Kruse (Graphical Models in Planning) and J´erˆome Lang (Social Choice). Moreover, the application of uncertainty models to real-world problems was addressed at ECSQARU 2005 by a special session devoted to successful industrial applications, organized by Rudolf Kruse. Both invited lectures and papers of the special session contribute to this volume. On the whole, the programme of the conference provided a broad, rich and up-to-date perspective of the current high-level research in the area which is reflected in the contents of this volume. I would like to warmly thank the members of the Program Committee and the additional referees for their valuable work, the invited speakers and the invited session organizer. I also want to express my gratitude to all of my colleagues and friends of the Executive Committee for their excellent work and unconditional support, dedicating a lot of their precious time and energy to make this conference successful. Finally, the sponsoring institutions are also gratefully acknowledged for their support.
May 2005
Llu´ıs Godo
Organization
ECSQARU 2005 was organized by the Artificial Intelligence Research Institute (IIIA), belonging to the Spanish Scientific Research Council (CSIC).
Executive Committee Conference Chair
Llu´ıs Godo (IIIA, Spain)
Organizing Committee
Teresa Alsinet (University of Lleida, Spain) Carlos Ches˜ nevar (University of Lleida, Spain) Francesc Esteva (IIIA, Spain) Josep Puyol-Gruart (IIIA, Spain) Sandra Sandri (IIIA, Spain)
Technical Support
Francisco Cruz (IIIA, Spain)
Program Committee Teresa Alsinet (Spain) John Bell (UK) Isabelle Bloch (France) Salem Benferhat (France) Philippe Besnard (France) Gerd Brewka (Germany) Luis M. de Campos (Spain) Claudette Cayrol (France) Carlos Ches˜ nevar (Spain) Agata Ciabattoni (Austria) Giulianella Coletti (Italy) Fabio Cozman (Brazil) Adnan Darwiche (USA) James P. Delgrande (Canada) Thierry Denœux (France) Javier Diez (Spain) Marek Druzdzel (USA) Didier Dubois (France) Francesc Esteva (Spain) H´el`ene Fargier (France) Linda van der Gaag (Netherlands)
Hector Geffner (Spain) Angelo Gilio (Italy) Michel Grabisch (France) Petr H´ajek (Czech Republic) Andreas Herzig (France) Eyke Huellermeier (Germany) Anthony Hunter (UK) Manfred Jaeger (Denmark) Gabriele Kern-Isberner (Germany) J¨ urg Kohlas (Switzerland) Ivan Kramosil (Czech Republic) Rudolf Kruse (Germany) J´erˆome Lang (France) Jonathan Lawry (UK) Daniel Lehmann (Israel) Pedro Larra˜ naga (Spain) Churn-Jung Liau (Taiwan) Weiru Liu (UK) Thomas Lukasiewicz (Italy) Pierre Marquis (France) Khaled Mellouli (Tunisia)
VIII
Organization
Seraf´ın Moral (Spain) Thomas Nielsen (Denmark) Kristian Olesen (Denmark) Ewa Orlowska (Poland) Odile Papini (France) Simon Parsons (USA) Lu´ıs Moniz Pereira (Portugal) Ramon Pino-P´erez (Venezuela) David Poole (Canada) Josep Puyol-Gruart (Spain) Henri Prade (France) Maria Rifqi (France) Alessandro Saffiotti (Sweden) Sandra Sandri (Spain)
Ken Satoh (Japan) Torsten Schaub (Germany) Romano Scozzafava (Italy) Prakash P. Shenoy (USA) Guillermo Simari (Argentina) Philippe Smets (Belgium) Claudio Sossai (Italy) Milan Studen´ y (Czech Republic) Leon van der Torre (Netherlands) Enric Trillas (Spain) Emil Weydert (Luxembourg) Mary-Anne Williams (Australia) Nevin L. Zhang (Hong Kong, China)
Additional Referees David Allen Fabrizio Angiulli Cecilio Angulo Nahla Ben Amor Guido Boella Jes´ us Cerquides Mark Chavira Gaetano Chemello Petr Cintula Francisco A.F.T. da Silva
Christian D¨ oring Zied Elouedi Enrique Herrera-Viedma Thanh Ha Dang Jinbo Huang Joris Hulstijn Germano S. Kienbaum Beata Konikowska V´ıtor H. Nascimento Giovanni Panti
Sponsoring Institutions Artificial Intelligence Research Institute (IIIA) Spanish Scientific Research Council (CSIC) Generalitat de Catalunya, AGAUR Ministerio de Educaci´ on y Ciencia MusicStrands, Inc.
Witold Pedrycz Andr´e Ponce de Leon Guilin Qi Jordi Recasens Rita Rodrigues Ikuo Tahara Vicen¸c Torra Suzuki Yoshitaka
Table of Contents
Invited Papers Imprecise Probability in Graphical Models: Achievements and Challenges Seraf´ın Moral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Knowledge-Based Operations for Graphical Models in Planning J¨ org Gebhardt, Rudolf Kruse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Some Representation and Computational Issues in Social Choice J´erˆ ome Lang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Bayesian Networks Nonlinear Deterministic Relationships in Bayesian Networks Barry R. Cobb, Prakash P. Shenoy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Penniless Propagation with Mixtures of Truncated Exponentials Rafael Rum´ı, Antonio Salmer´ on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Approximate Factorisation of Probability Trees Irene Mart´ınez, Seraf´ın Moral, Carmelo Rodr´ıguez, Antonio Salmer´ on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Abductive Inference in Bayesian Networks: Finding a Partition of the Explanation Space M. Julia Flores, Jos´e A. G´ amez, Seraf´ın Moral . . . . . . . . . . . . . . . . . . . .
63
Alert Systems for Production Plants: A Methodology Based on Conflict Analysis Thomas D. Nielsen, Finn V. Jensen . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Hydrologic Models for Emergency Decision Support Using Bayesian Networks Martin Molina, Raquel Fuentetaja, Luis Garrote . . . . . . . . . . . . . . . . . . .
88
X
Table of Contents
Graphical Models Probabilistic Graphical Models for the Diagnosis of Analog Electrical Circuits Christian Borgelt, Rudolf Kruse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Qualified Probabilistic Predictions Using Graphical Models Zhiyuan Luo, Alex Gammerman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A Decision-Based Approach for Recommending in Hierarchical Domains Luis M. de Campos, Juan M. Fern´ andez-Luna, Manuel G´ omez, Juan F. Huete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Learning Causal Networks Scalable, Efficient and Correct Learning of Markov Boundaries Under the Faithfulness Assumption Jose M. Pe˜ na, Johan Bj¨ orkegren, Jesper Tegn´er . . . . . . . . . . . . . . . . . . . 136 Discriminative Learning of Bayesian Network Classifiers via the TM Algorithm Guzm´ an Santaf´e, Jose A. Lozano, Pedro Larra˜ naga . . . . . . . . . . . . . . . . 148 Constrained Score+(Local)Search Methods for Learning Bayesian Networks Jos´e A. G´ amez, J. Miguel Puerta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 On the Use of Restrictions for Learning Bayesian Networks Luis M. de Campos, Javier G. Castellano . . . . . . . . . . . . . . . . . . . . . . . . . 174 Foundation for the New Algorithm Learning Pseudo-Independent Models Jae-Hyuck Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Planning Optimal Threshold Policies for Operation of a Dedicated-Platform with Imperfect State Information - A POMDP Framework Arsalan Farrokh, Vikram Krishnamurthy . . . . . . . . . . . . . . . . . . . . . . . . . 198 APPSSAT: Approximate Probabilistic Planning Using Stochastic Satisfiability Stephen M. Majercik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Table of Contents
XI
Causality and Independence Racing for Conditional Independence Inference Remco R. Bouckaert, Milan Studen´y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Causality, Simpson’s Paradox, and Context-Specific Independence Manon J. Sanscartier, Eric Neufeld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 A Qualitative Characterisation of Causal Independence Models Using Boolean Polynomials Marcel van Gerven, Peter Lucas, Theo van der Weide . . . . . . . . . . . . . . 244
Preference Modelling and Decision On the Notion of Dominance of Fuzzy Choice Functions and Its Application in Multicriteria Decision Making Irina Georgescu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 An Argumentation-Based Approach to Multiple Criteria Decision Leila Amgoud, Jean-Francois Bonnefon, Henri Prade . . . . . . . . . . . . . . . 269 Algorithms for a Nonmonotonic Logic of Preferences Souhila Kaci, Leendert van der Torre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Expressing Preferences from Generic Rules and Examples – A Possibilistic Approach Without Aggregation Function Didier Dubois, Souhila Kaci, Henri Prade . . . . . . . . . . . . . . . . . . . . . . . . . 293 On the Qualitative Comparison of Sets of Positive and Negative Affects Didier Dubois, H´el`ene Fargier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Argumentation Systems Symmetric Argumentation Frameworks Sylvie Coste-Marquis, Caroline Devred, Pierre Marquis . . . . . . . . . . . . . 317 Evaluating Argumentation Semantics with Respect to Skepticism Adequacy Pietro Baroni, Massimiliano Giacomin . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Logic of Dementia Guidelines in a Probabilistic Argumentation Framework Helena Lindgren, Patrik Eklund . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
XII
Table of Contents
Argument-Based Expansion Operators in Possibilistic Defeasible Logic Programming: Characterization and Logical Properties Carlos I. Ches˜ nevar, Guillermo R. Simari, Lluis Godo, Teresa Alsinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Gradual Valuation for Bipolar Argumentation Frameworks Claudette Cayrol, Marie Christine Lagasquie-Schiex . . . . . . . . . . . . . . . . 366 On the Acceptability of Arguments in Bipolar Argumentation Frameworks Claudette Cayrol, Marie Christine Lagasquie-Schiex . . . . . . . . . . . . . . . . 378
Inconsistency Handling A Modal Logic for Reasoning with Contradictory Beliefs Which Takes into Account the Number and the Reliability of the Sources Laurence Cholvy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 A Possibilistic Inconsistency Handling in Answer Set Programming Pascal Nicolas, Laurent Garcia, Igor St´ephan . . . . . . . . . . . . . . . . . . . . . . 402 Measuring the Quality of Uncertain Information Using Possibilistic Logic Anthony Hunter, Weiru Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Remedying Inconsistent Sets of Premises Philippe Besnard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Measuring Inconsistency in Requirements Specifications Kedian Mu, Zhi Jin, Ruqian Lu, Weiru Liu . . . . . . . . . . . . . . . . . . . . . . . 440
Belief Revision and Merging Belief Revision of GIS Systems: The Results of REV!GIS Salem Benferhat, Jonathan Bennaim, Robert Jeansoulin, Mahat Khelfallah, Sylvain Lagrue, Odile Papini, Nic Wilson, Eric W¨ urbel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 Multiple Semi-revision in Possibilistic Logic Guilin Qi, Weiru Liu, David A. Bell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 A Local Fusion Method of Temporal Information Mahat Khelfallah, Bela¨ıd Benhamou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Table of Contents
XIII
Mediation Using m-States Thomas Meyer, Pilar Pozos Parra, Laurent Perrussel . . . . . . . . . . . . . . 489 Combining Multiple Knowledge Bases by Negotiation: A Possibilistic Approach Guilin Qi, Weiru Liu, David A. Bell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 Conciliation and Consensus in Iterated Belief Merging Olivier Gauwin, S´ebastien Konieczny, Pierre Marquis . . . . . . . . . . . . . . . 514 An Argumentation Framework for Merging Conflicting Knowledge Bases: The Prioritized Case Leila Amgoud, Souhila Kaci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Belief Functions Probabilistic Transformations of Belief Functions Milan Daniel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Contextual Discounting of Belief Functions David Mercier, Benjamin Quost, Thierry Denœux . . . . . . . . . . . . . . . . . 552
Fuzzy Models Bilattice-Based Squares and Triangles Ofer Arieli, Chris Cornelis, Glad Deschrijver, Etienne Kerre . . . . . . . . 563 A New Algorithm to Compute Low T-Transitive Approximation of a Fuzzy Relation Preserving Symmetry. Comparisons with the T-Transitive Closure Luis Garmendia, Adela Salvador . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 Computing a Transitive Opening of a Reflexive and Symmetric Fuzzy Relation Luis Garmendia, Adela Salvador . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Generating Fuzzy Models from Deep Knowledge: Robustness and Interpretability Issues Raffaella Guglielmann, Liliana Ironi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600 Analysis of the TaSe-II TSK-Type Fuzzy System for Function Approximation Luis Javier Herrera, H´ector Pomares, Ignacio Rojas, Alberto Guill´en, Mohammed Awad, Olga Valenzuela . . . . . . . . . . . . . . . . 613
XIV
Table of Contents
Many-Valued Logical Systems Non-deterministic Semantics for Paraconsistent C-Systems Arnon Avron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Multi-valued Model Checking in Dense-Time Ana Fern´ andez Vilas, Jos´e J. Pazos Arias, A. Bel´en Barrag´ ans Mart´ınez, Mart´ın L´ opez Nores, Rebeca P. D´ıaz Redondo, Alberto Gil Solla, Jorge Garc´ıa Duque, Manuel Ramos Cabrer . . . . . . . . . . . . . . . . . . . . . . . 638 Brun Normal Forms for Co-atomic L ukasiewicz Logics Stefano Aguzzoli, Ottavio M. D’Antona, Vincenzo Marra . . . . . . . . . . . 650 Poset Representation for G¨ odel and Nilpotent Minimum Logics Stefano Aguzzoli, Brunella Gerla, Corrado Manara . . . . . . . . . . . . . . . . . 662
Uncertainty Logics Possibilistic Inductive Logic Programming Mathieu Serrurier, Henri Prade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Query Answering in Normal Logic Programs Under Uncertainty Umberto Straccia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 A Logical Treatment of Possibilistic Conditioning Enrico Marchioni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 A Zero-Layer Based Fuzzy Probabilistic Logic for Conditional Probability Tommaso Flaminio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714 A Logic with Coherent Conditional Probabilities Nebojˇsa Ikodinovi´c, Zoran Ognjanovi´c . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726 Probabilistic Description Logic Programs Thomas Lukasiewicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
Probabilistic Reasoning Coherent Restrictions of Vague Conditional Lower-Upper Probability Extensions Andrea Capotorti, Maroussa Zagoraiou . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
Table of Contents
XV
Type Uncertainty in Ontologically-Grounded Qualitative Probabilistic Matching David Poole, Clinton Smyth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763 Some Theoretical Properties of Conditional Probability Assessments Veronica Biazzo, Angelo Gilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775 Unifying Logical and Probabilistic Reasoning Rolf Haenni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Reasoning Models Under Uncertainty Possibility Theory for Reasoning About Uncertain Soft Constraints Maria Silvia Pini, Francesca Rossi, Brent Venable . . . . . . . . . . . . . . . . . 800 About the Processing of Possibilistic and Probabilistic Queries Patrick Bosc, Olivier Pivert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812 Conditional Deduction Under Uncertainty Audun Jøsang, Simon Pope, Milan Daniel . . . . . . . . . . . . . . . . . . . . . . . . 824 Heterogeneous Spatial Reasoning Haibin Sun, Wenhui Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Uncertainty Measures A Notion of Comparative Probabilistic Entropy Based on the Possibilistic Specificity Ordering Didier Dubois, Eyke H¨ ullermeier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848 Consonant Random Sets: Structure and Properties Enrique Miranda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860 Comparative Conditional Possibilities Giulianella Coletti, Barbara Vantaggi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872 Second-Level Possibilistic Measures Induced by Random Variables Ivan Kramosil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Probabilistic Classifiers Hybrid Bayesian Estimation Trees Based on Label Semantics Zengchang Qin, Jonathan Lawry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896
XVI
Table of Contents
Selective Gaussian Na¨ıve Bayes Model for Diffuse Large-B-Cell Lymphoma Classification: Some Improvements in Preprocessing and Variable Elimination Andr´es Cano, Javier G. Castellano, Andr´es R. Masegosa, Seraf´ın Moral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908 Towards a Definition of Evaluation Criteria for Probabilistic Classifiers Nahla Ben Amor, Salem Benferhat, Zied Elouedi . . . . . . . . . . . . . . . . . . 921 Methods to Determine the Branching Attribute in Bayesian Multinets Classifiers Andr´es Cano, Javier G. Castellano, Andr´es R. Masegosa, Seraf´ın Moral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
Classification and Clustering Qualitative Inference in Possibilistic Option Decision Trees Ilyes Jenhani, Zied Elouedi, Nahla Ben Amor, Khaled Mellouli . . . . . . 944 Partially Supervised Learning by a Credal EM Approach Patrick Vannoorenberghe, Philippe Smets . . . . . . . . . . . . . . . . . . . . . . . . . 956 Default Clustering from Sparse Data Sets Julien Velcin, Jean-Gabriel Ganascia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968 New Technique for Initialization of Centres in TSK Clustering-Based Fuzzy Systems Luis Javier Herrera, H´ector Pomares, Ignacio Rojas, Alberto Guill´en, Jes´ us Gonz´ alez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 980
Industrial Applications Learning Methods for Air Traffic Management Frank Rehm, Frank Klawonn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992 Molecular Fragment Mining for Drug Discovery Christian Borgelt, Michael R. Berthold, David E. Patterson . . . . . . . . . 1002 Automatic Selection of Data Analysis Methods Detlef D. Nauck, Martin Spott, Ben Azvine . . . . . . . . . . . . . . . . . . . . . . . 1014 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027
Imprecise Probability in Graphical Models: Achievements and Challenges (Extended Abstract) Seraf´ın Moral Departamento de Ciencias de la Computaci´ on e I.A., Universidad de Granada, 18071 Granada, Spain
[email protected] This talk will review the basic notions of imprecise probability following Walley’s theory [1] and its application to graphical models which usually have considered precise Bayesian probabilities [2]. First approaches to imprecision were robustness studies: analysis of the sensibility of the outputs to variations of network parameters [3, 4]. However, we will show that the role of imprecise probability in graphical models can be more important, providing alternative methodologies for learning and inference. One key problem of current methods to learn Bayesian networks from data is the following: with short samples obtained from a very simple model it is possible to learn complex models which are far from reality [5]. The main aim of the talk will be to show that with imprecise probability we can transform lack of information into indeterminacy and thus the possibilities of obtaining unsupported outputs are much lower. The following points will be considered: 1. A review of imprecise probability concepts, showing the duality between sets of probabilities and sets of desirable gambles representations. Most of the present work in graphical models has been expressed in terms of sets of probabilities, but desirable gambles representation is simpler in many situations [6]. This will be the first challenge we propose: to develop a methodology for graphical models based on sets of desirable gambles representation. 2. We will show that independence can have different generalizations in imprecise probability, giving rise to different interpretations of graphical models [7]. We will consider the most important ones: epistemic independence and strong independence. 3. Given a network structure, the estimation of conditional probabilities in a Bayesian network poses important problems. Usually, Bayesian methods are used in this task, but we will show that the selection of concrete ’a priori’ distributions in conjunction with the design of the network can have important consequences in the results of the probabilities we compute with the network. Then, we will introduce the imprecise Dirichlet model [8] and discuss how it can be applied to estimate interval probabilities in a dependence graph. Its use will allow to obtain sensible conclusions (non vacuous intervals) under weaker assumptions than precise Bayesian models. 4. In general, there are no methods based on imprecise probability to learn a dependence graph. This is another important challenge for the future. In [5] we L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 1–2, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
2
S. Moral
have introduced a new score to decide between dependence or independence taking as basis the imprecise Dirichlet model, which can be used for the design of a genuine imprecise probability learning procedure. Bayesian scores always decide between one of the options (dependence or independence) even for very short samples. The main novelty of the imprecise probability score is that in some situations will determine that there is no evidence to support any of the options. This will have important consequences on the behaviour of the learning algorithms and the strategy for searching a good model. 5. We will review algorithms for inference in graphical models with imprecise probability, showing the different optimization problems associated with the different independence concepts and estimation procedures [9]. One of the most actual challenging problems is the development of inference algorithms when probabilities are estimated under a global application of the imprecise Dirichlet model. 6. Finally we will consider the problem of supervised classification, making a survey of existing approaches [10, 11] and pointing at the necessity of developing a fair comparison procedure between the outputs of precise and imprecise models.
References 1. Walley, P.: Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London (1991) 2. Jensen, F.: Bayesian Networks and Decision Graphs. Springer-Verlag, New York (2002) 3. Fagin, R., Halpern, J.: A new approach to updating beliefs. In Bonissone, P., Henrion, M., Kanal, L., Lemmer, J., eds.: Uncertainty in Artificial Intelligence, 6. North-Holland, Amsterdam (1991) 347–374 4. Breese, J., Fertig, K.: Decision making with interval influence diagrams. In P.P. Bonissone, M. Henrion, L.K., ed.: Uncertainty in Artificial Intelligence, 6. Elsevier (1991) 467–478 5. Abell´ an, J., Moral, S.: A new imprecise score measure for independence. Submitted to the Fourth International Symposium on Imprecise Probability and Their Applications (ISIPTA ’05) (2005) 6. Walley, P.: Towards a unified theory of imprecise probability. International Journal of Approximate Reasoning 24 (2000) 125–148 7. Couso, I., Moral, S., Walley, P.: A survey of concepts of independence for imprecise probabilities. Risk, Decision and Policy 5 (2000) 165–181 8. Walley, P.: Inferences from multinomial data: learning about a bag of marbles (with discussion). Journal of the Royal Statistical Society, Series B 58 (1996) 3–57 9. Cano, A., Moral, S.: Algorithms for imprecise probabilities. In Kohlas, J., Moral, S., eds.: Handbook of Defeasible and Uncertainty Management Systems, Vol. 5. Kluwer Academic Publishers, Dordrecht (2000) 369–420 10. Zaffalon, M.: The naive credal classifier. Journal of Statistical Planning and Inference 105 (2002) 5–21 11. Abell´ an, J., Moral, S.: Upper entropy of credal sets. Applications to credal classification. International Journal of Approximate Reasoning (2005). To appear.
Knowledge-Based Operations for Graphical Models in Planning J¨org Gebhardt1 and Rudolf Kruse2 1
Intelligent Systems Consulting (ISC), Celle, Germany
[email protected] 2 Dept. of Knowledge Processing and Language Engineering (IWS), Otto-von-Guericke-University of Magdeburg, Magdeburg, Germany
Abstract. In real world applications planners are frequently faced with complex variable dependencies in high dimensional domains. In addition to that, they typically have to start from a very incomplete picture that is expanded only gradually as new information becomes available. In this contribution we deal with probabilistic graphical models, which have successfully been used for handling complex dependency structures and reasoning tasks in the presence of uncertainty. The paper discusses revision and updating operations in order to extend existing approaches in this field, where in most cases a restriction to conditioning and simple propagation algorithms can be observed. Furthermore, it is shown how all these operations can be applied to item planning and the prediction of parts demand in the automotive industry. The new theoretical results, modelling aspects, and their implementation within a software library were delivered by ISC Gebhardt and then involved in an innovative software system realized by Corporate IT for the world-wide item planning and parts demand prediction of the whole Volkswagen Group.
1
Introduction
Complex products like automobiles are usually assembled from a number of prefabricated modules and parts. Many of these components are produced in specialised facilities not necessarily located at the final assembly site. An on-time delivery failure of only one of these components can severely lower production efficiency. In order to efficiently plan the logistical processes, it is essential to give acceptable parts demand estimations at an early stage of planning. One goal of the project described in this paper was to develop a system which plans parts demand for production sites of the Volkswagen Group. The market strategy of the Volkswagen Group is strongly customer-focused — based on adaptable designs and special emphasis on variety. Consequently, when ordering an automobile, the customer is offered several options of how each feature should be realised. The consequence is a very large number of possible car variants. Since the particular parts required for building an automobile depend on the variant of the car, the overall parts demand can not be successfully estimated from total production numbers alone. L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 3–14, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
4
J. Gebhardt and R. Kruse
The modelling of domains with such a large number of possible states is very complex. For many practical purposes, modelling problems are simplified by introducing strong restrictions, e.g. fixing the value of some variables, assuming simple functional relations and applying heuristics to eliminate presumably less informative variables. However, as these restrictions can be in conflict with accuracy requirements or flexibility, it is rewarding to look into methods for solving the original task. Since working with complete domains seems to be infeasible, decomposition techniques are a promising approach to this kind of problem. They are applied for instance in graphical models (Lauritzen and Spiegelhalter, 1988; Pearl, 1988; Lauritzen, 1996; Borgelt and Kruse, 2002; Gebhardt, 2000), which rely on marginal and conditional independence relations between variables to achieve a decomposition of distributions. In addition to a compact representation, graphical models allow reasoning on high dimensional spaces to be implemented using operations on lower dimensional subspaces and propagating information over a connecting structure. This results in a considerable efficiency gain. In this paper we will show how a graphical model, when combined with certain operators, can be applied to flexibly plan parts demand in the automotive industry. We will furthermore demonstrate that such a model offers additional benefits, since it can be used for item planning, and it also provides a useful tool to simulate parts demand and capacity usage in projected market development scenarios.
2
Probabilistic Graphical Models
Graphical Models have often and successfully been applied with regard to probability distributions. The term ”graphical model” is derived from an analogy between stochastic independence and node separation in graphs. Let V = {A1 , . . . , An } be a set of random variables. If the underlying distribution fulfils certain criteria (see e.g. Castillo et al., 1997), then it is possible to capture some of the independence relations between the variables in V using a graph G = (V, E). 2.1
Bayesian Networks
In the case of Bayesian networks, G is a directed acyclic graph (DAG). Conditional independence between variables Vi and Vj ; i 6= j; Vi , Vj ∈ V given the value of other variables S ⊆ V is expressed by Vi and Vj being d-separated by S in G (Pearl, 1988; Geiger et al., 1990); i.e. there is no sequence of edges (of any directionality) between Vi and Vj such that: 1. every node of that sequence with converging edges is an element of S or has a descendant in S, 2. every other node is not in S. Probabilistic Bayesian networks are based on the idea that the common probability distribution of several variables can be written as a product of marginal and conditional distributions. Independence relations allow for a simplification of these products. For distributions such a factorisation can be described by a
Knowledge-Based Operations for Graphical Models in Planning
5
graph. Any independence map of the original distribution that is also a DAG provides a valid factorisation. If such a graph G is known, it is sufficient to store a conditional distribution for each node attribute given its direct predecessors in G (marginal distribution if there are no predecessors) to represent the complete distribution pV , i.e.
pV
Ã
V
Ai ∈V
2.2
∀a1 ∈ ! dom(A1 ) : .Ã . . ∀an ∈ dom(An ) : ! V Q Ai = ai = p Ai = ai | Aj = aj . Ai ∈V
(Aj ,Ai )∈E
Markov Networks
Markov networks are based on similar principles, but rely on undirected graphs and the u-separation criterion instead. Two nodes are considered separated by a set S if all paths connecting the nodes contain an element from S. If G is an independence map of a given distribution, then any separation of two nodes given a set of attributes S corresponds to a conditional independence of the two given values of the attributes in S. As shown by Hammersley and Clifford (1971) a strictly positive probability distribution is factorisable w.r.t. its undirected independence graph, with the factors being nonnegative functions on the maximal cliques C = {C1 . . . Cm } break in G. ! ! 1 ) : . . . ∀an ∈Ãdom(An ) : Ã ∀a1 ∈ dom(A V V Q Aj = aj . Ai = ai = φC i pV Ai ∈V
Ci ∈C
Aj ∈Ci
A detailed discussion of this topic, which includes the choice of factor potentials φCi is given e.g. in Borgelt and Kruse (2002). It is worthy to note that graphical models can also be used in the context of possibility distributions. The product in the probabilistic formulae will then be replaced with the minimum.
3
Analysis of the Planning Problem
The models offered by the Volkswagen Group are typically highly flexible and therefore very rich in variants. In fact many of the assembled cars are unique with respect to the variant represented by them. It should be obvious that under these circumstances a car cannot be described by general model parameters alone. For that reason, model specifications list so called item variables {Fi : i = 1 . . . n; i, n ∈ IN }. Their domains dom(Fi ) are called item families. The item variables refer to various attributes like for example ‘exterior colour’, ‘seat covering’, ‘door layout’ or ‘presence of vanity mirror’ and serve as placeholders for features of individual vehicles. The elements of the respective domains are called items. We will use capital letters to denote item variables and indexed lower case letters for items in the associated family. A variant specification is
6
J. Gebhardt and R. Kruse Table 1. Vehicle specification Class: ’Golf’
Item
short back
2.8L 150kW spark
Type alpha
5
no
...
Item family
body variant
engine
radio
door layout
vanity mirror
...
obtained when a model specification is combined with a vector providing exactly one element for each item family (Table 1.) For the ’Golf’ class there are approximately 200 item families—each consisting of at least two, but up to 50 items. The set of possible variants is the product space dom(F1 )× . . . × dom(Fn ) with a cardinality of more than 2200 (1060 ) elements. Not every combination of items corresponds to a valid variant specification (see Sec. 3.1), and it is certainly not feasible to explicitely specify variantpart lists for all possible combinations. Apart from that, there is the manufacturing point of view. It focuses on automobiles being assembled from a number or prefabricated components, which in turn may consist of smaller units. Identifying the major components—although useful for many other tasks—does not provide sufficient detail for item planning. However, the introduction of additional structuring layers i.e. ‘components of components’ leads to a refinement of the descriptions. This way one obtains a tree structure with each leave representing an installation point for alternative parts. Depending on which alternative is chosen, different vehicle characteristics can be obtained. Part selection is therefore based on the abstract vehicle specification, i.e. on the item vector. At each installation point only a subset of item variables is relevant. Using this connection, it is possible to find partial variant specifications (item combinations) that reliably indicate whether a component has to be used or not. At the level of whole planning intervals this allows to calculate total parts demand as the product of the relative frequency of these relevant item combinations and the projected total production for that interval. Thus the problem of estimating parts demand is reduced to estimating the frequency of certain relevant item combinations. 3.1
Ensuring Variant Validity
When combining parts, some restrictions have to be considered. For instance, a given transmission t1 may only work with a specific type of engine e3 . Such relations are represented in a system of technical and marketing rules. For better readability the item variables are assigned unique names, which are used as a synonym for their symbolic designation. Using the item variables T and E (‘transmission’ and ‘engine’), the above example would be represented as: if ‘transmission’ = t1 then ‘engine’ = e3
Knowledge-Based Operations for Graphical Models in Planning
7
The antecedence of a rule can be composed from a combination of conditions and it is possible to present several alternatives in the consequence part. if ’engine’ = e2 and ’auxiliary heater’ = h3 then ’generator’ ∈ {g3 , g4 , g5 } Many rules state engineering requirements and are known in advance. Others refer to market observations and are provided by experts (e.g. a vehicle that combines sportive gadgets with a weak motor and automatic gear will not be considered valid, even though technically possible). The rule system covers explicit dependencies between item variables and ensures that only valid variants are considered. Since it already encodes dependence relations between item variables it also provides an important data source for the model generation step. 3.2
Additional Data Sources
In addition to the rule system it is possible to access data on previously produced automobiles. This data provides a large set of examples, but in order to use it for market oriented estimations, it has to be cleared of production-driven influences first. Temporary capacity restrictions, for example, usually only affect some item combinations and lead to their underrepresentation at one time. The converse effect will be observed, when production is back to normal, so that the deferred orders can be processed. In addition to that, the effect of starting times and the production of special models may superpose the statistics. One also has to consider that the rule system, which was valid upon generation of the data, is not necessarily identical to the current one. For that reason the production history data is used only from relatively short intervals known to be free of major disturbances (like e.g. the introduction of a new model design or supply shortages). When intervals are thus carefully selected, the data is likely to be ‘sufficiently representative’ to quantify variable dependences and can thus provide important additional information. Considering that most of the statistical information obtained from the database would be tedious to state as explicit facts, it is especially useful for initialising planning models. Finally we want experts to be able to integrate their own observations or predictions into the planning model. Knowledge provided by experts is considered of higher priority than that already represented by the model. In order to deal with possible conflicts it is necessary to provide revision and updating mechanisms.
4
Generation of the Markov Network Model
It was decided to employ a probabilistic Markov network to represent the distribution of item combinations. Probabilities are thus interpreted in terms of estimated relative frequencies for item combinations. But since there are very good predictions for the total production numbers, conversion of facts based on absolute frequency is well possible. In order to create the model itself one still has to find an appropriate decomposition. When generating the model there are two data sources available, namely a rule system R, and the production history.
8
J. Gebhardt and R. Kruse
4.1
Transformation of the Rule System
The dependencies between item variables as expressed in the rule system are relational. While this allows to exclude some item combinations that are inconsistent with the rules, it does not distinguish between the remaining item combinations, even though there may be significant differences in terms of their frequency. Nevertheless the relational information is very helpful in the way that it rules out all item combinations that are inconsistent with the rule system. In addition to that, each rule scheme (the set of item variables that appear in a given rule) explicitly supplies a set of interacting variables. For our application it is also reasonable to assume that item variables are at least in approximation independent from one another given all other families, if there is no common appearance of them in any rule (unless explicitly stated so, interior colour is expected to be independent of the presence of a trailer hitch). Using the above independence assumption we can compose the relation of ‘being consistent with the rule system’. The first step consists in selecting the maximal rule schemes with respect to the subset relation. For the joint domain over the variables in each maximal rule scheme the relation can directly be obtained from the rules. For efficient reasoning with Markov networks it is desirable that the underlying clique graph has the hypertree property. This can be ensured by graph triangulating (Figure 1c). An algorithm that performs this triangulation is given e.g. in Pearl (1988). However introducing additional edges is done at the cost of losing some more independence information. The maximal cliques in the triangulated independence graph correspond to the nodes of a hypertree (Figure 1d).
b)
a)
A {ABC} {BDE} {CF G} {EF }
C
B
G
@ @
D
D E F
Unprocessed graph
d)
c)
C
@ @
G
Rule schemes
A
B
m ABC A
m BDE
A
m BCE
E
m CEF
F
m CFG
Triangulated graph
Hypertree representation
Fig. 1. Transformation into hypertree structure
Knowledge-Based Operations for Graphical Models in Planning
9
To complete the model we still need to assign a local distribution (i.e. relation) to each of the nodes. For those nodes that represent the original maximal cliques in the independence graph they can be obtained from the rules that work with these item variables or a subset of them (see above). Those that use edges introduced in the triangulation process can be computed from them by combining projections, i.e. applying the conditional independence relations that have been removed from the graph when the additional edges were introduced. Since we are dealing with the relational case here this amounts to calculating a join operation. Although such a representation is useful to distinguish valid vehicle specifications from invalid ones, the relational framework alone cannot supply us with sufficient information to estimate item rates. Therefore it is necessary to investigate a different approach. 4.2
Learning from Historical Data
A different available data source consists of variant descriptions from previously produced vehicles. However, predicting item frequencies from such data relies on the assumption that the underlying distribution does not change all too sudden. In section 3.2 considerations have been provided how to find ‘sufficiently representative’ data. Again we can apply a Markov network to capture the distributions using the probabilistic framework this time. One can distinguish between several approaches to learn the structure of probabilistic graphical models from data. Performing an exhaustive search of possible graphs is a very direct approach. Unfortunately this method is extremely costly and infeasible for complex problems like the one given here. Many algorithms are based on dependency analysis (Sprites and Glymour, 1991; Steck, 2000; Verma and Pearl, 1992) or Bayesian statistics, e.g. K2 (Cooper and Herskovits, 1992), K2B (Khalfallah and Mellouli, 1999), CGH (Chickering et al., 1995) and the structural EM algorithm (Friedman, 1998). Combined algorithms usually use heuristics to guide the search. Algorithms for structure learning in probabilistic graphical models typically consist of a component to generate candidate graphs for the model structure, and a component to evaluate them so that the search can be directed (Khalfallah and Mellouli, 1999; Singh and Valtorta, 1995). However even these methods are still costly and do not guarantee a result that is consistent to the rule system of our application. Our approach is based on the fact that we do not need to rely on the production history for learning the model structure. Instead we can make use of the relational model derived from the rule system. Using the structure of the relational model as a basis and combining it with probability distributions estimated from the production history constitutes an efficient way to construct the desired probabilistic model. Once the hypergraph is selected, it is necessary to find the factor potentials for the Markov network. For this purpose a frequentistic interpretation is assumed, i.e. estimates for the local distributions for each of the maximal cliques are ob-
10
J. Gebhardt and R. Kruse
tained directly from the database. In the probabilistic case there are several choices for the factor potentials because probability mass associated with the overlap of maximal cliques (separator sets) can be assigned in different ways. However for fast propagation it is often useful to store both local distributions for the maximal cliques and the local distributions for the separator sets (junction tree representation). Having copied the model structure from the relational model also provides us with additional knowledge of forbidden combinations. In the probability distributions these item combinations should be assigned a zero probability. While the model generation based on both rule system and samples is fast, it does not completely rule out inconsistencies. One reason for that is the continuing development of the rule system. The rule system is subject to regular updates in order to allow for changes in marketing programs or composition of the item families themselves. These problems, including the redistribution of probability mass, can be solved using belief change operations (Gebhardt and Kruse, 1998), which are described in the next section.
5
Planning Operations
A planning model that was generated using the above method, usually does not reflect the whole potential of available knowledge. For instance, experts are often aware of differences between the production history and the particular planning interval the model is meant to be used with. Thus a mechanism to modify the represented distribution is required. In addition to that we have already mentioned possible inconsistencies that arise from the use of different data sources in the learning process itself. Planning operators have been developed to efficiently handle this kind of problem, so modification of the distribution and restoration of a consistent state can be supported. 5.1
Updating
Let us now consider the situation where previously forbidden item combinations become valid. This can result for instance from changes in the rule system. In this case neither quantitative nor qualitative information on variable interaction can be obtained from the production history. A more complex version of the same problem occurs when subsets of cliques are to be altered while the information in the remaining parts of the network is retained, for instance after the introduction of rules with previously unused schemes (Gebhardt et al., 2003). In both cases it is necessary to provide the probabilistic interaction structure—a task performed with the help of the updating operation. The updating operation marks these combinations as valid by assigning a positive near zero probability to their respective marginals in the local distributions. Since the replacement value is very small compared to the true item frequencies obtained from the data, the quality of estimation is not affected by this alteration. Now instead of using the same initialisation for all new item
Knowledge-Based Operations for Graphical Models in Planning
11
combinations, the proportion of the values is chosen in accordance to an existing combination, i.e. the probabilistic interaction structure is copied from reference item combinations. This also explains why it is not convenient to use zero itself as an initialisation. The positive values are necessary to carry qualitative dependency information. For illustration consider the introduction of a new value t4 to item family transmission. The planners predict that the new item distributes similarly to the existing item t3 . If they specify t3 as a reference, the updating operation will complete the local distributions that involve T , such that the marginals for the item combinations that include t4 are in the same ratio to each other as their respective counterparts with t3 instead. Since updating only provides the qualitative aspect of dependency structure, it is usually followed by the subsequent application of the revision operation, which can be used to reassign probability mass to the new item combinations. 5.2
Revision
After the model has been generated, it is further adapted to the requirements of the particular planning interval. The information used at this stage is provided by experts and includes marketing and sales stipulations. It is usually specific to the planning interval. Such additional information can be integrated into the model using the revision operator. The input data consists of predictions or restrictions for installation rates of certain items, item combinations or even sets of either. It also covers the issue of unexpected capacity restrictions, which can be expressed in this form. Although the new information is frequently in conflict with prior knowledge, i.e. the distribution previously represented in the model, it usually has an important property—namely that it is compatible with the independence relations, which are represented in the model structure. The revision operation, while preserving the network structure, serves to modify quantitative knowledge in such a way that the revised distribution becomes consistent with the new specialised information. There is usually no unique solution to this task. However, it is desirable to retain as much of the original distribution as possible so the principle of minimal change (G¨ardenfors, 1988) should be applied. Given that, a successful revision operation holds a unique result (Gebhardt et al., 2004). The operation itself starts by modifying a single marginal distribution. Using the iterative proportional fitting method, first the local clique and ultimately the whole network is adapted to the new information. Since revision relies on the qualitative dependency structure already present, one can construct cases where revision is not possible. In such cases an updating operation is required before revision can be applied. In addition to that the supplied information can be contradictory in itself. Such situations are sometimes difficult to recognise. Criteria that entail a successful revision and proves for the maximum preservation of previous knowledge have been provided in Gebhardt et al. (2004). Gebhardt (2001) deals with the problem of inconsistent information and how the revision operator itself can help dealing with it.
12
J. Gebhardt and R. Kruse
Depending on circumstances human experts may want to specify their knowledge in different ways. Sometimes it is more convenient to give an estimation of future item frequency in absolute numbers, while at a different occasion it might be preferable to specify item rates or a relative increase. With the help of some readily available data and the information which is already represented in the network before revision takes place, such inputs can be transformed to item rates. From the operator’s point of view this can be very useful. As an example for a specification using item rates experts might predict a rise of the popularity of a recently introduced navigation system and set the relative frequency of this respective item from 20% to 30%. Sometimes the stipulations are embedded in a context as in “The frequency of air conditioning for Golfs with all wheel drive in France will increase by 10%”. In such cases the statements can be transformed and amount to a changing the ratio of the rates for the combination of all items in the statement (air conditioning present, all wheel drive, France) to the rates of that, which only includes the items from the context (all wheel drive, France).
5.3
Focussing
While revision and updating are essential operations for building and maintaining a distribution model, it is a much more common activity to apply the model for the exploration of the represented knowledge and its implications with respect to user decisions. Typically users would want to concentrate on those aspects of the represented knowledge that fall into their domain of expertise. Moreover, when predicting parts demand from the model, one is only interested in estimated rates for particular item combinations (see Sec. 3). Such activities require a focussing operation. It is achieved by performing evidence-driven conditioning on a subset of variables and distributing the information through the network. The well-known variable instantiation can be seen as a special case of focussing where all probability is assigned to exactly one value per input variable. As with revision, context dependent statements can be obtained by returning conditional probabilities. Furthermore, item combinations with compatible variable schemes can be grouped at the user interface providing access to aggregated probabilities. Apart from predicting parts demand, focussing is often employed for market analyses and simulation. By analysing which items are frequently combined by customers, experts can tailor special offers for different customer groups. To support planning of buffer capacities, it is necessary to deal with the eventuality of temporal logistic restrictions. Such events would entail changes in short term production planning so that the consumption of the concerned parts is reduced. This in turn affects the overall usage of other parts. The model can be used to simulate scenarios defined by different sets of frame conditions, to test adapted production strategies and to assess the usage of all parts.
Knowledge-Based Operations for Graphical Models in Planning
6
13
Application
The results obtained in this paper have contributed to the development of the planning system EPL (EigenschaftsPLanung, item planning). It was initiated in 2001 by Corporate IT, Sales, and Logistics of the Volkswagen Group. The aim was to establish for all trademarks a common item planning system that reflects the presented modelling approach based on Markov networks. System design and most of the implementation work of EPL is currently done by Corporate IT. The mathematical modelling, theoretical problem solving, and the development of efficient algorithms, extended by the implementation of a new software library called MARNEJ (MARkov NEtworks in Java) for the representation and the presented functionalities on Markov networks have been entirely provided by ISC Gebhardt. Since 2004 the system EPL is being rolled out to all trademarks of the Volkswagen group and step by step replaces the previously used planning systems. In order to promote acceptance and to help operators adapt to the new software and its additional capabilities, the user interface has been changed gradually. In parallel planners have been introduced to the new functionality, so that EPL can be applied efficiently. In the final configuration the system will have 6 to 8 Hewlett Packard Machines running Linux with 4 AMD Opteron 64-Bit CPUs and 16 GB of main memory each. With the new software, the increasing planning quality, based on the many innovative features and the appropriateness of the chosen model of knowledge representation, as well as a considerable reduction of calculation time turned out to be essential prerequisites for advanced item planning and calculation of parts demand in the presence of structured products with an extreme number of possible variants.
References C. Borgelt and R. Kruse. Graphical Models—Methods for Data Analysis and Mining. J. Wiley & Sons, Chichester, 2002. E. Castillo, J.M. Guit´errez, and A.S. Hadi. Expert Systems and Probabilistic Network Models. Springer-Verlag, New York, 1997. D.M. Chickering, D. Geiger, and D.Heckerman. Learning Bayesian networks from data. Machine Learning, 20(3):197–243, 1995. G.F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992. N. Friedman. The Bayesian structural EM algorithm. In Proc. of the 14th Conference on Uncertainty in AI, pages 129–138, 1998. P. G¨ ardenfors. Knowledge in the Flux—Modeling the Dynamics of Epistemic States. MIT press, Cambridge, MA, 1988. J. Gebhardt. The revision operator and the treatment of inconsistent stipulations of item rates. Project EPL: Internal Report 9. ISC Gebhardt and Volkswagen Group, GOB-11, 2001.
14
J. Gebhardt and R. Kruse
J. Gebhardt. Learning from data: Possibilistic graphical models. In D. M. Gabbay and P. Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, volume 4: Abductive Reasoning and Learning, pages 314–389. Kluwer Academic Publishers, Dordrecht, 2000. J. Gebhardt and R. Kruse. Parallel combination of information sources. In D. M. Gabbay and P. Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, volume 3: Belief Change, pages 393–439. Kluwer Academic Publishers, Dordrecht, 1998. J. Gebhardt, H. Detmer, and A.L. Madsen. Predicting parts demand in the automotive industry – an application of probabilistic graphical models. In Proc. Int. Joint Conf. on Uncertainty in Artificial Intelligence (UAI’03, Acapulco, Mexico), Bayesian Modelling Applications Workshop, 2003. J. Gebhardt, C. Borgelt, and R. Kruse. Knowledge revision in markov networks. Mathware and Soft Computing, 11(2-3):93–107, 2004. D. Geiger, T.S. Verma, and J. Pearl. Identifying independence in Bayesian networks. Networks, 20:507–534, 1990. J.M. Hammersley and P.E. Clifford. Markov fields on finite graphs and lattices. Cited in Isham (1981), 1971. V. Isham. An introduction to spatial point processes and markov random fields. Int. Statistical Review, 49:21–43, 1981. F. Khalfallah and K. Mellouli. Optimized algorithm for learning Bayesian networks from data. In Proc. 5th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQUARU’99), pages 221–232, 1999. S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 2(50):157–224, 1988. S.L. Lauritzen. Graphical Models. Oxford University Press, 1996. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufman, San Mateo, USA, 1988. (2nd edition 1992). M. Singh and M. Valtorta. Construction of Bayesian network structures from data: Brief survey and efficient algorithm. Int. Journal of Approximate Reasoning, 12: 111–131, 1995. P. Sprites and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computing Review, 9(1):62–72, 1991. H. Steck. On the use of skeletons when learning Bayesian networks. In Proc. of the 16th Conference on Uncertainty in AI, pages 558–565, 2000. T. Verma and J. Pearl. An algorithm for deciding whether a set of observed independencies has a causal explanation. In Proc. 8th Conference on Uncertainty in AI, pages 323–330, 1992.
Some Representation and Computational Issues in Social Choice J´erˆome Lang IRIT - Universit´e Paul Sabatier and CNRS, 31062 Toulouse Cedex (France)
[email protected] Abstract. This paper briefly considers several research issues, some of which are on-going and some others are for further research. The starting point is that many AI topics, especially those related to the ECSQARU and KR conferences, can bring a lot to the representation and the resolution of social choice problems. I surely do not claim to make an exhaustive list of problems, but I rather list some problems that I find important, give some relevant references and point out some potential research issues1 .
1
Introduction
For a few years, Artificial Intelligence has been taking more and more interest in collective decision making. There are two main reasons for that, leading to two different lines of research. Roughly speaking, the first one is concerned with importing concepts and procedures from social choice theory for solving questions that arise in AI application domains. This is typically the case for managing societies of autonomous agents, which calls for negotiation and voting procedures. The second line of research, which is the focus of this position paper, goes the other way round: it is concerned with importing notions and methods from AI for solving questions originally stemming from social choice. Social choice is concerned with designing and evaluating methods of collective decision making. However, it somewhat neglects computational issues: the problem is generally considered to be solved when the existence (or the nonexistence) of a procedure meeting some requirements has been shown; more precisely, knowing that the procedure can be computed is generally enough; now, how hard this computation is, and how the procedure should be implemented, have deserved less attention in the social choice community. This is where AI (and operations research, and more generally computer science) comes into play. As often when bringing together two traditions, AI probably raises more new 1
Writing a short survey is a difficult task, especially because it always leads to leaving some relevant references aside. I’ll maintain a long version of this paper, accessible at http://www.irit.fr/recherches/RPDMP/persos/JeromeLang/papers/ecsqaru05-long.pdf, and I’ll express my gratitude to everyone who’ll point to me any missing relevant reference.
L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 15–26, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
16
J. Lang
questions pertaining to collective decision making than it solves old ones. One of the most relevant of these issues consists in considering group decision making problems when the set of alternative is finite and has a combinatorial structure. This paper gives a brief overview of some research issues along this line. Section 2 starts with the crucial problem of eliciting and representing the individual’s preferences on the possible alternatives. Section 3 focuses on preference aggregation, Section 4 on vote, and Section 5 on fair division. Section 6 evokes other directions deliberately ignored in this short paper.
2
Elicitation and Compact Representation of Preference
Throughout the paper, N = {1, . . . , n} is the (finite) set of agents involved in the collective choice and X is the finite set of alternatives on which the decision process bears. Any individual or collective decision making problem needs some description (at least partial) of the preferences of each of the agents involved over the possible alternatives. A numerical preference structure is a utility function u : X → IR. An ordinal preference structure is a preorder P on X, called preference relation. R(x, y) is denoted alternatively by x º y. ≻ denotes strict preference (x ≻ y if and only if x º y and not y º x) and ∼ denotes indifference (x ∼ y if and only if x º y and y º x). An intermediate model between pure ordinality and pure numerical models is that of qualitative preferences, consisting of (qualitative) utility functions u : X → L, where L is a totally ordered (yet not numerical) scale. Unlike ordinal preferences, qualitative preferences allow commensurability between uncertainty and preference scales as well as interagent comparison of preferences (see [22] for discussions on ordinality in decision making.) The choice of a model, i.e. a mathematical structure, for preference, does not tell how agents’ preferences are obtained from them, stored, and handled by algorithms. Preference representation consists in choosing a language for encoding preferences so as to spare computational resources. The choice of a language is guided by two tasks: upstream, preference elicitation consists in interacting with the agent so as to obtain her preferences over X, while optimization consists in finding nondominated alternatives from a compactly represented input. As long as the set of alternatives has a small size, the latter problems are computationally easy. Unfortunately, in many concrete problems the set of alternatives has a combinatorial structure. A combinatorial domain is a Cartesian product of finite value domains for each one of a set of variables: an alternative in such a domain is a tuple of values. Clearly, the size of such domains grows exponentially with the set of variables and becomes quickly very large, which makes explicit representations and straightforward elicitation and optimization no longer reasonable. Logical or graphical compact representation languages allow for representing in as little space as possible a preference structure whose size would be prohibitive if it were represented explicitly. The literature on preference elicitation and representation for combinatorial domains has been growing fastly for a few years, and due to the lack of space I omit giving references here.
Some Representation and Computational Issues in Social Choice
17
The criteria one can use for choosing a compact preference language include, at least, the following ones: – cognitive relevance: a language should be as close as possible to the way human agents “know” their preferences and express them in natural language; – elicitation-friendliness: it should be easy to design algorithms to elicit preference from an agent so as to get an output expressed in a given language; – expressivity: find out the set of preference relations or utility functions that can be expressible in a given language; – complexity: given an input consisting of a compactly represented preference structure in a given language, determine the computational complexity of finding a non-dominated alternative, checking whether an alternative is preferred to another one, whether an alternative is non-dominated etc.; – comparative succinctness: given two languages L and L′ , determine whether every preference structure that can be expressed in L can also be expressed in L′ without a significant (suprapolynomial) increase of size, in which case L′ is said to be at least as succinct as L. Cognitive relevance is somewhat hard to assess, due to its non-technical nature, and has been rarely studied. Complexity has been studied in [35] for logic-based languages. Expressivity and comparative succinctness have been systematically investigated in [19] for ordinal preference representation. Although these languages have been designed for single agents, they can be extended to multiple agents without much difficulty; [34] and [44] are two examples of such extensions.
3
Preference Aggregation
Preference aggregation, even on simple domains, raises challenging computational issues that have been recently investigated by AI researchers. Aggregating preferences consist in mapping a collection hP1 , . . . , Pn i of preference relations (or profiles) into a collective preference relation P ∗ (which implies circumvening Arrow’s impossibility theorem [2] by relaxing one of its applicability conditions.) Now, even on simple domains, some aggregation functions raise computational difficulties. This is notably the case for Kemeny’s aggregation rule, consisting in aggregating the profiles into a profile (called Kemeny consensus) being closest to the n profiles, with respect to a distance which, roughly speaking, is the sum, for all agents, of the numbers of pairs of alternatives on which the aggregated profile disagrees with the agent’s profile. Computing a Kemeny consensus is NP-hard; [21] addresses its practical computation. When the set of alternatives has a combinatorial structure, things get much worse. Moreover, since in that case preferences are often described in a compact representation language, aggregation should ideally operate directly on this language, without generating the individual nor the aggregated preferences explicitly. A common way of aggregating compactly represented preferences is (logical) merging. The common point of logic-based merging approaches is that
18
J. Lang
the set of alternatives corresponds to a set of propositional worlds; the logicbased representation of agent’s preferences (or beliefs) then induces a cardinal function (using ranks or distances) on worlds and aggregates these cardinal preferences. These functions are not necessarily on a numerical scale but the scale has to be common to all agents. We do not have the space to give all relevant references to logic-based merging here, but we give a few of them, which explicitly mention some social choice theoretic issues: [33, 40, 13, 39]. See also [34, 6] for preference aggregation from logically expressed preferences. .
4
Vote
Voting is one of the most popular ways of reaching common decisions. Researchers in social choice theory have studied extensively the properties of various families of voting rules, but, again, have neglected computational issues. A voting rule maps each collection of individual preference profiles, generally consisting of linear orders over the set of candidates, to a nonempty subset of the set of candidates; if the latter subset is always a singleton then the voting rule is said to be deterministic2 . For a panorama of voting rules see for instance [10]. We just give here a few of them. A positional scoring rule is defined from a scoring vector, i.e. a vector s = (s1 , . . . , sm ) of integers such that s1 ≥ s2 ≥ . . . ≥ sm and s1 > sm . Let ranki (x) be the rank of x in ≻i (1 if it is the favorite candidate for voter i, 2 PN if it is the second favorite etc.), then the score of x is S(x) = i=1 sranki (x) . Two well-known examples of positional scoring procedures are the Borda rule, defined by sk = m − k for all k = 1, . . . , m, and the plurality rule, defined by s1 = 1, and sk = 0 for all k > 1. Moreover, a Condorcet winner is a candidate preferred to any other candidate by a strict majority of voters. (it is well-known that there are some profiles for which no Condorcet winner exists.) Obviously, when there exists a Condorcet winner then it is unique. A Condorcet-consistent rule is a voting rule electing the Condorcet winner whenever there is one. The first question that comes to mind is whether determining the outcome of an election, for a given voting procedure, is computationally challenging (which is all the more relevant as electronic voting becomes more and more popular.) 4.1
Computing the Outcome of Voting Rules: Small Domains
Most voting rules among those that are practically used are computable in linear or quadratic time in the number of candidates (and almost always linear in the number of voters); thererefore, when the number of candidates is small (which is typically the case for political elections where a single person has to be elected), computing the outcome of a voting rule does not need any sophisticated algorithm. However, a few voting rules are computationally complex. Here are three 2
The literature of social choice theory rather makes use of the terminology “voting correspondances” and “deterministic voting rules” but for the sake of simplicity we will make use of the terminology “voting rules” in a uniform way.
Some Representation and Computational Issues in Social Choice
19
of them: Dodgson’s rule and Young’s rule both consist in electing candidates that are closest to being a Condorcet winner: each candidate is given a score that is the smallest number of exchanges of elementary changes in the voters’ preference orders needed to make the candidate a Condorcet winner. Whatever candidate (or candidates, in the case of a tie) has the lowest score is the winner. For Dodgson’s rule, an elementary change is an exchange of adjacent candidates in a voter’s preference profile, while for Young’s rule it is the removal of a voter. Lastly, Kemeny’s voting rule elects a candidate if and only if it is the preferred candidate in some Kemeny consensus (see Section 3). Deciding whether a given candidate is a winner for any of the latter three voting rules is a ∆P2 (O(log n))-complete (for Dodgson’s, NP-hardness was shown in [5] and ∆P2 (O(log n))-completeness in [30]; ∆P2 (O(log n))-completeness was shown in [45] for Young’s and in [31] for Kemeny’s. 4.2
Computing the Outcome of Voting Rules: Combinatorial Domains
Now, when the set of candidates has a combinatorial structure, even simple procedures such as plurality and Borda become hard. Consider an example where agents have to agree on a common menu to be composed of a first course dish, a main course dish, a dessert and a wine, with a choice of 6 items for each. This makes 64 candidates. This would not be a problem if the four items to be chosen were independent from the other ones: in this case, this vote problem over a set of 64 candidates would come down to four independent problems over sets of 6 candidates each, and any standard voting rule could be applied without difficulty. But things get complicated if voters express dependencies between variables, such as “I prefer white wine if one of the courses is fish and none is meat, red wine if one of the courses is meat and none is fish, and in the remaining cases I would like equally red or white wine”, etc. Obviously, the prohibitive number of candidates makes it hard, or even practically impossible, to apply voting rules in a straightforward way. The computational complexity of some voting procedures when applied to compactly represented preferences on a combinatorial set of candidates has been investigated in [35]; however this paper does not address the question of how the outcome can be computed in a reasonable amount of time. When the domain is large enough, computing the outcome by first generating the whole preference relations on the combinatorial domain from their compact representation is unfeasible. A first way of coping with the problem consists in contenting oneself with an approximation of the outcome of the election, using incomplete and/or randomized algorithms making a possible use of heuristics. This is an open research issue. A second way consists in decomposing the vote into local votes on individual variables (or small sets of variables), and gathering the results. However, as soon as variables are not preferentially independent, it is generally a bad idea: “multiple election paradoxes” [11] show that such a decomposition leads to suboptimal choices, and give real-life examples of such paradoxes, including simultaneous
20
J. Lang
referenda on related issues. We give here a very simple example of such a paradox. Suppose 100 voters have to decide whether to build a swimming pool or not (S), and whether to build a tennis court or not (T). 49 voters would prefer a swimming pool and no tennis court (S T¯), 49 voters prefer a tennis court and no ¯ ) and 2 voters prefer to have both (ST ). Voting separately swimming pool (ST on each of the issues gives the outcome ST , although it received only 2 votes out of 100 – and it might even be the most disliked outcome by 98 of the voters (for instance because building both raises local taxes too much). Now, the latter example did not work because there is a preferential dependence between S and T . A simple idea then consists in exploiting preferential independencies between variables; this is all the more relevant as graphical languages, evoked in Section 2, are based on such structural properties. The question now is to what extent we may use these preferential independencies to decompose the computation of the outcome into smaller problems. However, again this does not work so easily: several well-known voting rules (such as plurality or Borda) cannot be decomposed, even when the preferential structure is common to all voters. Most of them fail to be decomposable even when all variables are mutually independent for all voters. We give below an example of this phenomenon. Consider 7 voters, a domain with two variables x and y, whose domains are respectively {x, x ¯} and {y, y¯}, and the following preference relations, where each agent expresses his preference relation by a CP-net [7] corresponding to the following fixed preferential structure: preference on x is unconditional and preference on y may depend on the value given to x. 3 voters
2 voters
2 voters
x ¯≻x x : y¯ ≻ y x ¯ : y ≻ y¯
x≻x ¯ x : y ≻ y¯ x ¯ : y¯ ≻ y
x≻x ¯ x : y¯ ≻ y x ¯ : y ≻ y¯
For instance, the first CP-net says that the voters prefer x ¯ to x unconditionally, prefer y¯ to y when x = x and y to y¯ when x = x ¯. This corresponds to the following preference relations: 3 voters 2 voters 2 voters
x ¯y x ¯y¯ x¯ y xy
xy x¯ y x ¯y¯ x ¯y
x¯ y xy x ¯y x ¯y¯
The winner for the plurality rule is x ¯y. Now, the sequential approach gives the following outcome: first, because 4 agents out of 7 unconditionally prefer x over x ¯, applying plurality (as well as any other voting rule, since all reasonable voting rules coincide with the majority rule when there are only 2 candidates)
Some Representation and Computational Issues in Social Choice
21
locally on x leads to choose x = x. Now, given x = true, 5 agents out of 7 prefer y¯ to y, which leads to choose y = y¯. Thus, the sequential plurality winner is (x, y¯) – whereas the direct plurality winner is (¯ x, y). Such counterexamples can be found for many other voting rules. This raises the question of finding voting rules which can be decomposed into local rules (possibly under some domain restrictions), following the preferential independence structure of the voters’ profiles – which is an open issue. 4.3
Manipulation
Manipulating a voting rule consists, for a given voter or coalition of voters, in expressing an insincere preference profile so as to give more chance to a preferred candidate to be elected. Gibbard and Satterthwaite’s theorem [29, 47] states that if the number of candidates is at least 3, then any nondictatorial voting procedure is manipulable for some profiles. Consider again the example above with the 7 voters3 , and the plurality rule, whose outcome is x ¯y. The two voters whose true preference is xy ≻ x¯ y≻x ¯y¯ ≻ x ¯y have an interest to report an insincere preference profile with x¯ y on top, that is, to vote for x¯ y – in that case, the winner is x¯ y , which these two voters prefer to the winner if they express their true preferences, namely x ¯y. Since it is theoretically not possible to make manipulation impossible, one can try to make it less efficient or more difficult. Making manipulation less efficient can consist in making as little as possible of the others’ votes known to the would-be manipulating voter – which may be difficult in some contexts. Making it more difficult to compute is a way followed recently by [4, 3, 15, 14, 17]. The line of argumentation is that if finding a successful manipulation is extremely hard computationally, then the voters will give up trying to manipulate and express sincere preferences. Note that, for once, the higher the complexity, the better. Randomization can play a role not only in making manipulation less efficient but also more complex to compute [17]. In a logical merging context (see Section 3), [27] investigate the manipulation of merging processes in propositional logic. The notion of a manipulation is however more complex to define (and several competing notions are discussed indeed), since the outcome of the process is a full preference relation. 4.4
Incomplete Knowledge and Communication Complexity
Given some incomplete description of the voters’ preferences, is the outcome of the vote determined? If not, whose preferences are to be elicited and what is relevant so as to compute the outcome? Assume, for example, that we have 4 candidates A, B, C, D and 9 voters, 4 of which vote C ≻ D ≻ A ≻ B, 2 of which vote A ≻ B ≻ D ≻ C and 2 of which vote B ≻ A ≻ C ≻ D, the last vote being still unknown. If the plurality rule is chosen then the outcome is already known (the winner is C) and there is no need to elicit the last voter’s profile. If the Borda rule is used then the partial scores are A : 14, B : 10, C : 14, D : 10, 3
I thank Patrice Perny, from whom I borrowed this example.
22
J. Lang
therefore the outcome is not determined; however, we do not need to know the totality of the last vote, but we only need to know whether the last voter prefers A to C or C to A. This vote elicitation problem is investigated from the point of view of computational complexity in [16]. More generally, communication complexity is concerned with the amount of information to be communicated so that the outcome of the vote procedure is determined: since the outcome of a voting rule is sometimes determined even if not all votes are known, this raises the question in designing protocols for gathering the information needed so as to communicate as little info as possible [18]. For example, plurality needs only to know top ranked candidates, while plurality with run-off needs the top-ranked candidates and then, after communicating the names of two finalists to the voters, which one they prefer between these two.
5
Fair Division
Resource allocation of indivisible goods aims at assigning, to each of a set of agents N , some items from a finite set R to each of a set of agents N , given their preferences over all possible combination of objects. For the sake of simplicity, we assume here that each resource must be given to one and only one agent4 . In centralized allocation problems, the assignment is determined by a central authority to which the agents have given their preferences beforehand. As it stands, a centralized fair division problem is clearly a group decision making problem on a combinatorial domain, since the number of allocations grows exponentially with the number of resources. Since the description of a fair division problem needs the specification of the agents’ preferences over the set of all possible combinations of objects, elicitation and compact representation issues are highly relevant here as well. Now, is a fair division problem a vote problem, where candidates are possible allocations? Not quite, because a usual assumption is made, stating that the primary preferences expressed by agents depends only of their share, that is, agent i is indifferent between two allocations as soon as they give her the same share. Furthermore, as seen below, some specific notions for fair division problems, such as envy-freeness, have no counterpart in terms of voting. Two classes of criteria are considered in centralized resource allocation, namely efficiency and equity (or fairness). At one extremity, combinatorial auctions consist in finding an allocation maximizing the revenue of the seller, where this revenue is the sum, over all agents, of the price that the agent is willing to pay for the combination of objects he receives in the allocation (given that these price functions are not necessarily additive.) Combinatorial auctions are a very spe4
More generally, an object could be allocated to zero, one, or more agents of N . Even if most applications require the allocation to be preemptive (an object cannot be allocated to more than one agent), some problems do not require it. An example of such preemption-free problems is the exploitation of shared Earth observation satellites described in [36, 8].
Some Representation and Computational Issues in Social Choice
23
cific, purely utilitarianistic class of allocation problems, in which considerations such as equity and fairness are not relevant. They have received an enormous attention since a few years (see [20]). Here we rather focus on allocation problems where fairness is involved – in which case we speak of fair division. The weakest efficiency requirement is that allocations should not be Paretodominated: an allocation π : N → 2X is Pareto-efficient if and only if there is no allocation π ′ such that (a) for all i, π ′ (i) ºi π(i) and (b) there exists an i such that π ′ (i) ≻i π(i). Pareto-efficiency is purely ordinal, unlike the utilitarianistic criterion, applicable only when preference are numerical, P under which ′ an allocation π is preferred to an allocation π if and only if i∈N ui (π(i)) > P ′ u (π (i)). i i∈N None of the latter criteria deals with fairness or equity. The most usual way of measuring equity is egalitarianism, which compares allocations with respect to the leximin ordering which, informally, works by comparing first the utilities of the least satisfied agents, and when these utilities coincide, compares the utilities of the next least satisfied agents and so on (see for instance Chapter 1 of [41]). The leximin ordering does not need preferences to be numerical but only interpersonally comparable, that is, expressed on common scale. A purely ordinal fairness criterion is envy-freeness : an allocation π is envy-free if and only if π(i) ºi π(j) holds for all i and all j 6= i, or in informal terms, each agent is at least as happy with his share than with any other one’s share. It is well-known that there exist allocation problems for which no there exists no allocation being both Pareto-efficient and envy-free. In distributed allocation problems, agents negotiate, communicate, exchange or trade goods, in a multilateral way. Works along this line have addressed the convergence conditions towards allocations being optimal from a social point of view, depending on the acceptability criteria used by agents when deciding whether or not to agree on a propose exchange of resources, and some constraints allowed on deals – see e.g. [46, 26, 24, 23, 12]. The notion of communication complexity is revisited in [25] and reinterpreted as the minimal (with respect to some criteria) sequence of deals between agents (where minimality is with respect to a criterion that may vary, and which takes into account the number of deals and the number of objects exchanged in deals). See [38] for a survey on these issues. Whereas social choice theory has developed an important literature on fair division, and artificial intelligence has devoted much work on the computational aspects of combinatorial auctions, computational issues in fair division have only started recently to be investigated. Two works addressing envy-freeness from a computational prespective are [37], who compute approximately envyfree solutions (by first making it a graded notion, suitable to optimization), and [9] who relate the search of envy-freeness and efficient allocations to some well-known problems in knowledge representation. A more general review of complexity results for centralized allocation problems in in [8]. Complexity issues for distributed allocation problems are addressed in [24].
24
J. Lang
Clearly, many models developed in the AI community should have an impact on modelling, representing compactly and solving fair division problems. Moreover, some issues addressed for voting problems and/or combinatorial auctions, such as the computational aspects of elicitation and manipulation and the role of incomplete knowledge, are still to be investigated for fair division problems.
6
Conclusion
There are many more issues for further research than those that we have briefly evoked. Models and techniques from artificial intelligence should play an important role, for (at least) the following reasons: – the importance of ordinal and qualitative models in preference aggregation, vote and fair division (no need to recall that the AI research community has contributed a lot to the study of these models.) Ordinality is perhaps even more relevant in social choice than in decision under uncertainty and multicriteria decision making, due to equity criteria and the difficulty of interpersonal comparison of preference. – the role of incomplete knowledge, and the need to reason about agents’ beliefs, especially in utility elicitation and communication complexity issues. Research issues include various ways of applying voting and allocation procedures under incomplete knowledge, and the study of communication protocols for these issues, which may call for multiagent models of beliefs, including mutual and common belief (see e.g. [28]). Models and algorithms for group decision under uncertainty is a promising topic as well. – the need for compact (logical and graphical) languages for preference elicitation and representation and measure their spatial efficiency. These languages need to be extended to multiple agents (such as in [44]), and aggregation should be performed directly in the language (e.g., aggregating CP-nets into a new CP-net without generating the preference relations explicitly). – the high complexity of the tasks involved leads to interesting algorithmic problems such as finding tractable subclasses, efficient algorithms and approximation methods,using classical AI and OR techniques. – one more relevant issue is sequential group decision making and planning with multiple agents. For instance, [42] address the search for an optimal path for several agents (or criteria), with respect to an egalitarianistic aggregation policy. – measuring and localizing inconsistency among a group of agents – especially when preferences are represented under a logical form – could be investigated by extending inconsistency measures (see [32]) to multiple agents.
References 1. H. Andreka, M. Ryan, and P.-Y. Schobbens. Operators and laws for combining preference relations. Journal of Logic and Computation, 12(1):13–53, 2002. 2. K. Arrow. Social Choice and Individual Values. John Wiley and Sons, 1951. revised edition 1963.
Some Representation and Computational Issues in Social Choice
25
3. J.J. Bartholdi and J.B. Orlin. Single transferable vote resists strategic voting. Social Choice and Welfare, 8(4):341–354, 1991. 4. J.J. Bartholdi, C.A. Tovey, and M.A. Trick. The computational difficulty of manipulating an election. Social Choice and Welfare, 6(3):227–241, 1989. 5. J.J. Bartholdi, C.A. Tovey, and M.A. Trick. Voting schemes for which it can be difficult to tell who won the election. Social Choice and Welfare, 6(3):157–165, 1989. 6. S. Benferhat, D. Dubois, S. Kaci, and H. Prade. Bipolar representation and fusion of preference in the possibilistic logic framework. In Proceedings of KR2002, pages 421–429, 2002. 7. C. Boutilier, R. Brafman, C. Domshlak, H. Hoos, and D. Poole. CP-nets: a tool for representing and reasoning with conditional ceteris paribus statements. Journal of Artificial Intelligence Research, 21:135–191, 2004. 8. S. Bouveret, H. Fargier, J. Lang, and M. Lemaˆıtre. Allocation of indivisible goods: a general model and some complexity results. In Proceedings of AAMAS 05, 2005. Long version available at http://www.irit.fr/recherches/RPDMP/persos/ JeromeLang/papers/aig.pdf. 9. S. Bouveret and J. Lang. Efficiency and envy-freeness in fair division of indivisible goods: logical representation and complexity. In Proceedings of IJCAI-05, 2005. 10. S. Brams and P. Fishburn. Voting procedures. In K. Arrow, A. Sen, and K. Suzumura, editors, Handbook of Social Choice and Welfare, chapter 4. Elsevier, 2004. 11. S. Brams, D. M. Kilgour, and W. Zwicker. The paradox of multiple elections. Social Choice and Welfare, 15:211–236, 1998. 12. Y. Chevaleyre, U. Endriss, and N. Maudet. On maximal classes of utility functions for efficient one-to-one negotiation. In Proceedings of IJCAI-2005, 2005. 13. S. Chopra, A. Ghose, and T. Meyer. Social choice theory, belief merging, and strategy-proofness. Int. Journal on Information Fusion, 2005. To appear. 14. V. Conitzer, J. Lang, and T. Sandholm. How many candidates are required to make an election hard to manipulate? In Proceedings of TARK-03, pages 201–214, 2003. 15. V. Conitzer and T. Sandholm. Complexity of manipulating elections with few candidates. In Proceedings of AAAI-02, pages 314–319, 2002. 16. V. Conitzer and T. Sandholm. Vote elicitation: complexity and strategy-proofness. In Proceedings of AAAI-02, pages 392–397, 2002. 17. V. Conitzer and T. Sandholm. Universal voting protocols to make manipulation hard. In Proceedings of IJCAI-03, 2003. 18. V. Conitzer and T. Sandholm. Communication complexity of common votiong rules. In Proceedings of the EC-05, 2005. 19. S. Coste-Marquis, J. Lang, P. Liberatore, and P. Marquis. Expressive power and succinctness of propositional languages for preference representation. In Proceedings of KR-2004, pages 203–212, 2004. 20. P. Cramton, Y. Shoham, and R. Steinberg, editors. Combinatorial Auctions. MIT Press, 2005. To appear. 21. A. Davenport and J. Kalagnanam. A computational study of the Kemeny rule for preference aggregation. In Proceedings of AAAI-04, pages 697–702, 2004. 22. D. Dubois, H. Fargier, and P. Perny. On the limitations of ordinal approaches to decision-making. In Proceedings of KR2002, pages 133–146, 2002. 23. P. Dunne. Extremal behaviour in multiagent contract negotiation. Journal of Artificial Intelligence Research, 23:41–78, 2005. 24. P. Dunne, M. Wooldridge, and M. Laurence. The complexity of contract negotiation. Artificial Intelligence, 164(1-2):23–46, 2005.
26
J. Lang
25. U. Endriss and N. Maudet. On the communication complexity of multilateral trading: Extended report. Journal of Autonomous Agents and Multiagent Systems, 2005. To appear. 26. U. Endriss, N. Maudet, F. Sadri, and F. Toni. On optimal outcomes of negociations over resources. In Proceedings of AAMAS-03, 2003. 27. P. Everaere, S. Konieczny, and P.Marquis. On merging strategy-proofness. In Proceedings of KR-2004, pages 357–368, 2004. 28. R. Fagin, J. Halpern, Y. Moses, and M. Vardi. Reasoning about Knowledge. MIT Press, 1995. 29. A. Gibbard. Manipulation of voting schemes. Econometrica, 41:587–602, 1973. 30. E. Hemaspaandra, L. Hemaspaandra, and J. Rothe. Exact analysis of Dodgson elections: Lewis Carroll’s 1876 system is complete for parallel access to NP. JACM, 44(6):806–825, 1997. 31. E. Hemaspaandra, H. Spakowski, and J. Vogel. The complexity of Kemeny elections. Technical report, Jenaer Schriften zur Mathematik und Informatik, October 2003. 32. A. Hunter and S. Konieczny. Approaches to measuring inconsistent information, pages 189–234. SpringerLNCS 3300, 2004. 33. S. Konieczny and R. Pino P´erez. Propositional belief base merging or how to merge beliefs/goals coming from several sources and some links with social choice theory. European Journal of Operational Research, 160(3):785–802, 2005. 34. C. Lafage and J. Lang. Logical representation of preferences for group decision making. In Proceedings of KR2000, pages 457–468, 2000. 35. J. Lang. Logical preference representation and combinatorial vote. Annals of Mathematics and Artificial Intelligence, 42(1):37–71, 2004. 36. M. Lemaˆıtre, G. Verfaillie, and N. Bataille. Exploiting a common property resource under a fairness constraint: a case study. In Proceedings of IJCAI-99, pages 206– 211, 1999. 37. R. Lipton, E. Markakis, E. Mossel, and A. Saberi. On approximately fair allocations of indivisible goods. In Proceedings of EC’04, 2004. 38. Agentlink technical forum group on multiagent resource allocation. http://www.doc.ic.ac.uk/ ue/MARA/, 2005. 39. P. Maynard-Zhang and D. Lehmann. Representing and aggregating conflicting beliefs. Journal of Artificial Intelligence Research, 19:155–203, 2003. 40. T. Meyer, A. Ghose, and S. Chopra. Social choice, merging, and elections. In Proceedings of ECSQARU-01, pages 466–477, 2001. 41. H. Moulin. Axioms of Cooperative Decision Making. Cambridge University Press, 1988. 42. P. Perny and O. Spanjaard. On preference-based search in state space graphs. In Proceedings of AAAI-02, pages 751–756, 2002. 43. M. S. Pini, F. Rossi, K. Venable, and T. Walsh. Aggregating partially ordered preferences: possibility and impossibility results. In Proceedings of TARK-05, 2005. 44. F. Rossi, K. Venable, and T. Walsh. mCP nets: representing and reasoning with preferences of multiple agents. In Proceedings of AAAI-04, pages 729–734, 2004. 45. J. Rothe, H. Spakowski, and J. Vogel. Exact complexity of the winner for Young elections. Theory of Computing Systems, 36(4):375–386, 2003. 46. T. Sandholm. Contract types for satisficing task allocation: I theoretical results. In Proc. AAAI Spring Symposium: Satisficing Models, 1998. 47. M. Satterthwaite. Strategyproofness and Arrow’s conditions. Journal of Economic Theory, 10:187–217, 1975.
Nonlinear Deterministic Relationships in Bayesian Networks Barry R. Cobb and Prakash P. Shenoy University of Kansas School of Business, 1300 Sunnyside Ave., Summerfield Hall, Lawrence, KS 66045-7585, USA {brcobb, pshenoy}@ku.edu
Abstract. In a Bayesian network with continuous variables containing a variable(s) that is a conditionally deterministic function of its continuous parents, the joint density function does not exist. Conditional linear Gaussian distributions can handle such cases when the deterministic function is linear and the continuous variables have a multi-variate normal distribution. In this paper, operations required for performing inference with nonlinear conditionally deterministic variables are developed. We perform inference in networks with nonlinear deterministic variables and non-Gaussian continuous variables by using piecewise linear approximations to nonlinear functions and modeling probability distributions with mixtures of truncated exponentials (MTE) potentials.
1
Introduction
An important class of Bayesian networks with continuous variables are those that have conditionally deterministic variables (a variable that is a deterministic function of its parents). Conditional linear Gaussian (CLG) distributions (Lauritzen and Jensen 2001) can handle such cases when the deterministic function is linear and variables are normally distributed. In models with nonlinear deterministic relationships and non-Gaussian distributions, Monte Carlo methods may be required to obtain an approximate solution. General purpose solution algorithms, e.g., the Shenoy-Shafer architecture, have not been adapted to such models, primarily because the joint density for the variables in models with deterministic variables does not exist and these methods involve propagation of probability densities. Approximate inference in Bayesian networks with continuous variables can be performed using mixtures of truncated exponentials (MTE) potentials (Moral et al. 2001). Cobb and Shenoy (2004) define operations which allow the distributions of linear deterministic variables to be determined when the continuous variables are modeled with MTE potentials. This allows MTE potentials to be used for inference in any continuous CLG model, as well as other models that have non-Gaussian and conditionally deterministic variables. This paper extends these methods to continuous Bayesian networks with nonlinear deterministic variables. L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 27–38, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
28
B.R. Cobb and P.P. Shenoy
The remainder of this paper is organized as follows. Section 2 introduces notation and definitions used throughout the paper. Section 3 describes a method for approximating a nonlinear function with a piecewise linear function. Section 4 defines operations required for inference in Bayesian networks with conditionally deterministic variables. Section 5 contains examples of determining the distributions of nonlinear conditionally deterministic variables. Section 6 summarizes and states directions for future research. This paper is based on a longer, unpublished working paper (Cobb and Shenoy 2005).
2
Notation and Definitions
This section contains notation and definitions used throughout the paper. 2.1
Notation
Random variables will be denoted by capital letters, e.g., A, B, C. Sets of variables will be denoted by boldface capital letters, e.g., X. All variables are assumed to take values in continuous state spaces. If X is a set of variables, x is a configuration of specific states of those variables. The continuous state space of X is denoted by ΩX . In graphical representations, continuous nodes are represented by double-border ovals, whereas nodes that are deterministic functions of their parents are represented by triple-border ovals. 2.2
Mixtures of Truncated Exponentials
A mixture of truncated exponentials (MTE) (Moral et al. 2001) potential has the following definition. MTE potential. Let X = (X1 , . . . , Xn ) be an n-dimensional random variable. A function φ : ΩX 7→ R+ is an MTE potential if one of the next two conditions holds: 1. The potential φ can be written as φ(x) = a0 +
m X i=1
n
{Xb x } (j) j i
ai exp
(1)
j=1
(j)
for all x ∈ ΩX , where ai , i = 0, . . . , m and bi , i = 1, . . . , m, j = 1, . . . , n are real numbers. 2. The domain of the variables, ΩX , is partitioned into hypercubes {ΩX1 , . . . , ΩXk } such that φ is defined as φ(x) = φi (x)
if x ∈ ΩXi , i = 1, . . . , k ,
(2)
where each φi , i = 1, ..., k can be written in the form of equation (1) (i.e. each φi is an MTE potential on ΩXi ).
Nonlinear Deterministic Relationships in Bayesian Networks
29
In the definition above, k is the number of pieces and m is the number of exponential terms in each piece of the MTE potential. We will refer to φi as the i-th piece of the MTE potential φ and ΩXi as the portion of the domain of X approximated by φi . In this paper, all MTE potentials are equal to zero in unspecified regions. 2.3
Conditional Mass Functions (CMF)
When relationships between continuous variables are deterministic, the joint probability density function (PDF) does not exist. If Y is a deterministic relationship of variables in X, i.e. y = g(x), the conditional mass function (CMF) for {Y | x} is defined as pY |x = 1{y = g(x)} ,
(3)
where 1{A} is the indicator function of the event A, i.e. 1{A}(B) = 1 if B = A and 0 otherwise.
3 3.1
Piecewise Linear Approximations to Nonlinear Functions Dividing the Domain
Suppose that a random variable Y is a deterministic function of a single variable X, Y = g(X). The function Y = g(X) can be approximated by a piecewise linear function. Define a set of ordered points x = (x0 , ..., xn ) in the domain of X, with x0 and xn defined as the endpoints of the domain. A corresponding set of points y = (y0 , ..., yn ) is determined by calculating the value of the function y = g(x) at each point xi , i = 0, ..., n. The piecewise linear function (with n pieces) approximating Y = g(X) is the function Y (n) = g (n) (X) defined as follows: ´ ³ y1 −y0 y1 −y0 if x0 ≤ x < x1 · x y − 0 + x1 −x0 · x 0 x1 −x0 ´ ³ y2 −y1 1 if x1 ≤ x < x2 y1 − xy22 −y −x1 · x1 + x2 −x1 · x . .. g (n) (x) = .. . ´ ³ yn−1 −yn−2 yn−1 −yn−2 y n−2 − xn−1 −xn−2 · xn−2 + yn−1 −xn−2 · x if xn−2 ≤ x < xn−1 ³ ´ yn −yn−1 yn −yn−1 y if xn−1 ≤ x ≤ xn . n−1 − xn −xn−1 · xn−1 + xn −xn−1 · x
(4) denote the i-th piece of the piecewise linear function in (4). We Let refer to g as an n-point (piecewise linear) approximation of g. In this paper, all piecewise linear functions equal zero in unspecified regions. If a variable is a deterministic function of multiple variables, the definition in (4) can be extended by dividing the domain of the parent variables into hypercubes and creating an approximation of each function in each hypercube. (n) gi (x) (n)
30
B.R. Cobb and P.P. Shenoy
3.2
Algorithm for Splitting Regions
An initial piecewise approximation is defined (minimally) by splitting the domain of X at extreme points and points of change in concavity and convexity in the function y = g(x), and at endpoints of pieces of the MTE potential for X. This initial set of bounds on the pieces of the approximation is defined as x = (xS0 , ..., xSℓ ). The absolute value of the difference between the approximation and the function will increase, then eventually decrease within each region of the approximation. This is due to the fact that the approximation in (4) always lies “inside” the actual function. Additional pieces may be added to improve the fit between the nonlinear function and the piecewise approximation. Define an allowable error bound, ǫ, for the distance between the function g(x) and its piecewise linear approximation. Define an interval η used to select the next point at which to test the distance between g(x) and the piecewise approximation. The piecewise linear approximation in (4) is completely defined by the sets of points x = (x0 , ..., xn ) and y = (y0 , ..., yn ). The following procedure in pseudo-code determines the sets of points x and y which define the piecewise linear approximation when a deterministic variable has one parent. INPUT := xS0 , ..., xSℓ , g(x), ǫ, η OUTPUT : x = (x0 , ..., xn ), y = (y0 , ..., yn ) INITIALIZATION x ← {(xS0 , ..., xSℓ )} /* Endpoints, extrema, and inflection points in ΩX */ y ← {(g(xS0 ), ..., g(xSℓ ))} i = 0 /* Index for the intervals in the domain of X */ DO WHILE i < | x | /* Continue until all intervals are refined*/ j = 1 /* Index for number of test points in an interval */ a = 0 /* Previous distance between g(x) and approximation*/ b = 0 /* Current distance between g(x) and approximation */ FOR j = 1 : (xi+1 − xi )/η b =³³ g(xi + (j − 1) · η)− ´ ´ yi+1 −yi −yi · (x + (j − 1) · η) + · x yi − xyi+1 i i xi+1 −xi i+1 −xi
IF
| b | ≥ a /* Compare current and previous distance */ a =| b | /*Distance increased; test next point */ ELSE BREAK /*Distance did not increase; break loop */ END IF END FOR IF a > ǫ /*Test max. distance versus allowable error bound */ x ← Rank (x ∪ {xi + (j − 2) · η}) /* Update x and re-order */ y ← Rank (y ∪ {g(xi + (j − 2) · η)}) /* Update y and re-order */ END IF i=i+1 END DO
Nonlinear Deterministic Relationships in Bayesian Networks
31
The algorithm refines the piecewise approximation to the function y = g(x) until the maximum distance between the function and the piecewise approximation is no larger than the specified error bound. A smaller error bound, ǫ, produces more pieces in the linear approximation and a closer fit in the theoretical and approximate density functions for the deterministic variable (see, e.g., Section 5.1 of (Cobb and Shenoy 2005)). A closer approximation using more pieces, however, requires greater computational expense in the inference process.
4
Operations with Linear Deterministic Variables
Consider a random variable Y which is a monotonic function, Y = g(X), of a random variable X. The joint cumulative distribution function (CDF) for {X, Y } is given by FX,Y (x, y) = FX (g −1 (y)) if g(X) is monotonically increasing and FX,Y (x, y) = FX (x) − FX (g −1 (y)) if g(X) is monotonically decreasing. The CDF of Y is determined as FY (y) = lim FX,Y (x, y). Thus, FY (y) = FX (g −1 (y)) x→∞
if g(X) is monotonically increasing and FY (y) = 1 − FX (g −1 (y)) if g(X) is monotonically decreasing. By differentiating the CDF of Y , the PDF of Y is obtained as ¯ ¯ ¯ ¯d d (5) FY (y) = fX (g −1 (y)) ¯¯ (g −1 (y))¯¯ , fY (y) = dy dy
when Y = g(X) is monotonic. If Y is a conditionally deterministic linear function of X, i.e. Y = g(x) = ax + b, a 6= 0, the following operation can be used to determine the marginal PDF for Y : ¶ µ y−b 1 . (6) · fX fY (y) = a |a|
The following definition extends the operation defined in (6) to accommodate piecewise linear functions. Suppose Y is a conditionally deterministic piecewise linear function of X, Y = g(X), where gi (x) = ai x + bi , with each ai 6= 0, i = 1, ..., n. Assume the PDF for X is an MTE potential φ with k pieces, where the j-th piece is denoted φj for j = 1, ..., k. Let nj denote the number of linear segments of g that intersect with the domain of φj and notice that n = n1 + . . . + nj + . . . + nk . The CMF pY |x represents the conditionally deterministic relationship of Y on X. The following definition will be used to determine the ¡ ¢↓Y marginal PDF for Y (denoted χ = φ ⊗ pY |x ): 1/a1 · φ1 ((y − b1 )/a1 ) if y0 ≤ y < y1 if y1 ≤ y < y2 1/a2 · φ1 ((y − b2 )/a2 ) . .. ¢↓Y ¡ ∆ . χ(y) = φ ⊗ pY |x (y) = .. ) if yn1 −1 ≤ y < yn1 )/a · φ ((y − b 1/a n 1 n n 1 1 1 . .. . . . 1/an · φk ((y − bn )/an ) if yn−1 ≤ y < yn , (7)
32
B.R. Cobb and P.P. Shenoy
with φj being the piece of φ whose domain is a superset of the domain of gi . The normalization constants for each piece of the resulting MTE potential ensures that the CDF of the resulting MTE potential matches the CDF of the theoretical MTE potential at the endpoints of the domain of the resulting PDF. From Theorem 3 in (Cobb and Shenoy 2004), it follows that the class of MTE potentials is closed under the operation in (7); thus, the operation can be used for inference in Bayesian networks with deterministic variables. Note that the class of MTE potentials is not closed under the operation in (5), which is why we approximate nonlinear functions with piecewise linear functions.
5
Examples
The following examples illustrate determination of the distributions of random variables which are nonlinear deterministic functions of their parents, as well as inference in a simple Bayesian network with a nonlinear deterministic variable. 5.1
Example One
Suppose X is normally distributed with a mean of 0 and a standard deviation of 1, i.e. X ∼ N (0, 12 ), and Y is a conditionally deterministic function of X, y = g(x) = x3 . The distribution of X is modeled with an two-piece, three-term MTE potential as defined in (Cobb et al. 2003). The MTE potential is denoted by φ and its two pieces are denoted φ1 and φ2 , with ΩX1 = {x : −3 ≤ x < 0} and ΩX2 = {x : 0 ≤ x ≤ 3}. Piecewise Approximation. Over the region [−3, 3], the function y = g(x) = x3 has an inflection point at x = 0, which is also an endpoint of a piece of the MTE approximation to the PDF of X. To initialize the algorithm in Sect. 3.2, we define x = (xS0 , xS1 , xS2 )= (−3, 0, 3) and y = (y0S , y1S , y2S )= (−27, 0, 27). For this example, define ǫ = 1 and η = 0.06 (which divides the domain of X into 100 equal intervals). The procedure in Sect. 3.2 terminates after finding sets of points x = (x0 , ..., x8 ) and y = (y0 , ..., y8 ) as follows: x = (−3.00, −2.40, −1.74, −1.02, 0.00, 1.02, 1.74, 2.40, 3.00) , y = (−27.000, −13.824, −5.268, −1.061, 0.000, 1.061, 5.268, 13.824, 27.000) . The function representing the eight-point linear approximation is defined as 21.960x + 38.880 if − 3.00 ≤ x < −2.40 12.964x + 17.289 if − 2.40 ≤ x < −1.74 5.843x + 4.898 if − 1.74 ≤ x < −1.02 1.040x if − 1.02 ≤ x < 0 g (8) (x) = (8) 1.040x if 0 ≤ x < 1.02 5.843x − 4.898 if 1.02 ≤ x < 1.74 12.964x − 17.289 if 1.74 ≤ x < 2.40 21.960x − 38.880 if 2.04 ≤ x ≤ 3.00 .
Nonlinear Deterministic Relationships in Bayesian Networks
33
20 10
-3
-2
2
1
-1
3
-10 -20
Fig. 1. The piecewise linear approximation g (8) (x) overlayed on the function y = g(x)
The piecewise linear approximation g (8) (x) is shown in Fig. 1, overlayed on the function y = g(x). The conditional distribution for Y is represented by a CMF as follows: ψ (8) (x, y) = pY |x (y) = 1{y = g (8) (x)} . Determining the Distribution of Y . The marginal distribution for Y is ¢↓Y ¡ . The MTE potential for Y is determined by calculating χ(8) = φ ⊗ ψ (8)
χ(8) (y) =
(1/21.960) · φ(1) (0.0455y − 1.7705) if (1/12.964) · φ1 (0.0771y − 1.3336) if (1/5.843) · φ1 (0.1712y − 0.8384) if (1/1.040) · φ1 (0.9612y) if (1/1.040) · φ2 (0.9612y) (1/5.843) · φ2 (0.1712y + 0.8384) (1/12.964) · φ2 (0.0771y + 1.3336) (1/21.960) · φ2 (0.0455y + 1.7705)
− 27.000 ≤ y < −13.824 − 13.824 ≤ y < −5.268 − 5.268 ≤ y < −1.061 − 1.061 ≤ y ≤ 0.000
if 0.000 ≤ y < 1.061 if 1.061 ≤ y < 5.628 if 5.628 ≤ y < 13.824 if 13.824 ≤ y ≤ 27.000 .
The CDF associated with the eight-piece MTE approximation is shown in Fig. 2, overlayed on the CDF associated with the PDF from the transformation ¡ ¢ d ¡ −1 ¢ g (y) . fY (y) = fX g1−1 (y) dy 1
(9)
34
B.R. Cobb and P.P. Shenoy
1 0.8 0.6 0.4 0.2
-20
10
-10
20
Fig. 2. CDF for the eight-piece MTE approximation to the distribution for Y overlayed on the CDF created using the transformation in (9)
5.2
Example Two
The Bayesian network in this example (see Fig. 3) contains one variable (X) with a non-Gaussian potential, one variable (Z) with a Gaussian potential, and one variable (Y ) which is a deterministic linear function of its parent. The probability distribution for X is a beta distribution, i.e. £(X) ∼ Beta(α = 2.7, β = 1.3). The PDF for X is approximated (using the methods described in (Cobb et al. 2003))
Y
X
Z
Fig. 3. The Bayesian network for Example Two
1.75 1.5 1.25 1 0.75 0.5 0.25 0.2
0.4
0.6
0.8
1
Fig. 4. The MTE potential for X overlayed on the actual Beta(2.7, 1.3) distribution
Nonlinear Deterministic Relationships in Bayesian Networks
35
0.5 0.4 0.3 0.2 0.1
0.2
0.4
0.6
0.8
1
Fig. 5. The piecewise linear approximation g (5) (x) overlayed on the function g(x) in Example Two
by a three-piece, two-term MTE potential. The MTE potential φ for X is shown graphically in Figure 4, overlayed on the actual Beta(2.7, 1.3) distribution. The variable Y is a conditionally deterministic function of X, y = g(x) = −0.5x3 + x2 . The five-point linear approximation is characterized by points x = (x0 , ..., x5 )=(0, 0.220, 0.493, 0.667, 0.850, 1) and y = (y0 , ..., y5 )=(0, 0.043, 0.183, 0.296, 0.415, 0.500). The points x0 , x2 , x3 , and x5 are defined according to the endpoints of the pieces of φ. The point x4 is an inflection point in the function g(x) and the point x1 = 0.220 is found by the algorithm in Sect. 3.2 with ǫ = 0.015 and η = 0.01. The function representing the five-piece linear approximation (denoted as g (5) ) is shown graphically in Fig. 5 overlayed on g(x). The conditional distribution for Y given X is represented by a CMF as follows: ψ (5) (x, y) = pY |x (y) = 1{y = g (5) (x)} . The probability distribution for Z is defined as £(Z | y) ∼ N (2y + 1, 1) and is approximated by χ, which is a two-piece, three-term MTE approximation to the normal distribution (Cobb et al. 2003). 5.3
Computing Messages
The join tree for the example problem is shown in Fig. 6. The messages required to calculate posterior marginals for each variable in the network without evidence are as follows: 1) φ from {X} to {X, Y } 2) (φ ⊗ ψ (5) )↓Y from {X, Y } to {Y } and {Y } to {Y, Z} 3) ((φ ⊗ ψ (5) )↓Y ⊗ χ)↓Z from {Y, Z} to {Z}
36
B.R. Cobb and P.P. Shenoy
f
y5
X
{X,Y}
c
Z
{Y,Z}
Y
Fig. 6. The join tree for the example problem
5.4
Posterior Marginals
The posterior marginal distribution for Y is the message sent from {X, Y } to {Y } and is calculated using the operation in (7). The expected value and variance of this distribution are calculated as 0.3042 and 0.0159, respectively. The posterior marginal distribution for Z is the message sent from {Y, Z} to {Z} and is calculated by point-wise multiplication of MTE functions, followed by marginalization (see operations defined in (Moral et al. 2001)). The expected value and variance of this distribution are calculated as 1.6084 and 1.0455, respectively. 5.5
Entering Evidence
Suppose we observe evidence that Z = 0 and let eZ denote this evidence. Define ϕ = (φ ⊗ ψ (5) )↓Y and ψ 0(5) (x, y) = 1{x = g (5)−1 (y)} as the potentials resulting from the reversal of the arc between X and Y (Cobb and Shenoy 2004). The evidence eZ is passed from {Z} to {Y, Z} in the join tree, where the existing potential is restricted to χ(y, 0). This likelihood potential is passed from {Y, Z} to {Y } in the join tree. 0 Denote the unnormalized posterior marginal distribution Z for B as ξ (y) =
ϕ(y)·χ(y, 0). The normalization constant is calculated as K= (ϕ(y)·χ(y, 0)) dy = y
0.0670. Thus, the normalized marginal distribution for Y is found as ξ(y) =
1 0.8 0.6 0.4 0.2
0.1
0.2
0.3
0.4
0.5
Fig. 7. The posterior marginal CDF for Y considering the evidence Z = 0
Nonlinear Deterministic Relationships in Bayesian Networks
37
1 0.8 0.6 0.4 0.2
0.2
0.4
0.6
0.8
1
Fig. 8. The posterior marginal CDF for X considering the evidence (Z = 0)
K −1 · ξ 0 (y). The expected value and variance of this distribution (whose CDF is displayed in Fig. 7) are calculated as 0.2560 and 0.0167, respectively. Using the operation in (7), we determine the posterior marginal distribution for X as ϑ = (ξ ⊗ ψ 0(5) )↓X . The expected value and variance of this distribution are calculated as 0.5942 and 0.0480, respectively. The posterior marginal CDF for X considering the evidence is shown graphically in Figure 8.
6
Summary and Conclusions
This paper has described operations required for inference in Bayesian networks containing variables that are nonlinear deterministic functions of their continuous parents. Since the joint PDF for a network with deterministic variables does not exist, the operations required are based on the method of convolutions from probability theory. By estimating nonlinear functions with piecewise linear approximations, we ensure the class of MTE potentials are closed under these operations. Bayesian networks in this paper contain only continuous variables. In future work, we plan to design a general inference algorithm for Bayesian networks that contain a mixture of discrete and continuous variables, with some continuous variables defined as deterministic functions of their continuous parents.
References Cobb, B.R. and P.P. Shenoy: Inference in hybrid Bayesian networks with deterministic variables. In P. Lucas (ed.): Proceedings of the Second European Workshop on Probabilistic Graphical Models (PGM–04) (2004) 57–64, Leiden, Netherlands. Cobb, B.R. and P.P. Shenoy: Modeling nonlinear deterministic relationships in Bayesian networks. School of Business Working Paper No. 310, University of Kansas, Lawrence, Kansas (2005). Available for download at: http://www.people.ku.edu/∼brcobb/WP310.pdf
38
B.R. Cobb and P.P. Shenoy
Cobb, B.R., Shenoy, P.P. and R. Rum´ı: Approximating probability density functions in hybrid Bayesian networks with mixtures of truncated exponentials. Working Paper No. 303, School of Business, University of Kansas, Lawrence, Kansas (2003). Available for download at: http://www.people.ku.edu/∼brcobb/WP303.pdf Kullback, S. and R.A. Leibler: On information and sufficiency. Annals of Mathematical Statistics 22 (1951) 79–86. Larsen, R.J. and M.L. Marx: An Introduction to Mathematical Statistics and its Applications (2001) Prentice Hall, Upper Saddle River, N.J. S.L. Lauritzen and F. Jensen: Stable local computation with conditional Gaussian distributions. Statistics and Computing 11 (2001) 191–203. Moral, S., Rum´ı, R. and A. Salmer´ on: Mixtures of truncated exponentials in hybrid Bayesian networks. In P. Besnard and S. Benferhart (eds.): Symbolic and Quantitative Approaches to Reasoning under Uncertainty, Lecture Notes in Artificial Intelligence 2143 (2001) 156–167, Springer-Verlag, Heidelberg.
Penniless Propagation with Mixtures of Truncated Exponentials⋆ Rafael Rum´ı and Antonio Salmer´on Dept. Estad´ıstica y Matem´ atica Aplicada, Universidad de Almer´ıa, 04120 Almer´ıa, Spain {rrumi, Antonio.Salmeron}@ual.es
Abstract. Mixtures of truncated exponential (MTE) networks are a powerful alternative to discretisation when working with hybrid Bayesian networks. One of the features of the MTE model is that standard propagation algorithm can be used. In this paper we propose an approximate propagation algorithm for MTE networks which is based on the Penniless propagation method already known for discrete variables. The performance of the proposed method is analysed in a series of experiments with random networks.
1
Introduction
A Bayesian network is an efficient representation of a joint probability distribution over a set of variables, where the network structure encodes the independence relations among the variables. Bayesian networks are commonly used to make inferences about the probability distribution on some variables of interest, given that the values of some other variables are known. This task is usually called probabilistic inference or probability propagation. Much attention has been paid to probability propagation in networks where the variables are discrete with a finite number of possible values. Several exact methods have been proposed in the literature for this task [8, 13, 14, 20], all of them based on local computation. Local computation means to calculate the marginals without actually computing the joint distribution, and is described in terms of a message passing scheme over a structure called join tree. Also, approximate methods have been developed with the aim of dealing with complex networks [2, 3, 4, 7, 18, 19]. In mixed Bayesian networks, where both discrete and continuous variables appear simultaneously, it is possible to apply local computation schemes similar to those for discrete variables. However, the correctness of exact inference depends on the model. This problem was deeply studied before, but the only general solution is the discretisation of the continuous variables [5, 11] which are then treated as if they ⋆
This work has been supported by the Spanish Ministry of Science and Technology, project Elvira II (TIC2001-2973-C05-02) and by FEDER funds.
L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 39–50, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
40
R. Rum´ı and A. Salmer´ on
were discrete, and therefore the results obtained are approximate. Exact propagation can be carried out over mixed networks when the model is a conditional Gaussian distribution [12, 17], but in this case, discrete variables are not allowed to have continuous parents. This restriction was overcome in [10] using a mixture of exponentials to represent the distribution of discrete nodes with continuous parents, but the price to pay is that propagation cannot be carried out using exact algorithms: Monte Carlo methods are used instead. The Mixture of Truncated Exponentials (MTE) model [15] provide the advantages of the traditional methods and the added feature that discrete variables with continuous parents are allowed. Exact standard propagation algorithms can be performed over them [6], as well as approximate methods. In this work, we introduce an approximate propagation algorithm for MTEs based on the idea of Penniless propagation [2], which is actually derived from the Shenoy-Shafer [20] method. This paper continues with a description of the MTE model in section 2. The representation based on mixed tress can be found in section 3. Section 4 contains the application of Shenoy-Shafer algorithm to MTE networks, while in section 5 the Penniless algorithm is presented, and is illustrated with some experiments reported in section 6. The paper ends with conclusions in section 7.
2
The MTE Model
Throughout this paper, random variables will be denoted by capital letters, and their values by lowercase letters. In the multi-dimensional case, boldfaced characters will be used. The domain of the variable X is denoted by ΩX . The MTE model is defined by its corresponding potential and density as follows [15]: Definition 1. (MTE potential) Let X be a mixed n-dimensional random vector. Let Y = (Y1 , . . . , Yd ) and Z = (Z1 , . . . , Zc ) be the discrete and continuous parts of X, respectively, with c + d = n. We say that a function f : ΩX 7→ R+ 0 is a Mixture of Truncated Exponentials potential (MTE potential) if one of the next conditions holds: i. Y = ∅ and f can be written as f (x) = f (z) = a0 +
m X i=1
ai exp
c X
(j)
j=1
(j)
bi z j
(1)
for all z ∈ ΩZ , where ai , i = 0, . . . , m and bi , i = 1, . . . , m, j = 1, . . . , c are real numbers. ii. Y = ∅ and there is a partition D1 , . . . , Dk of ΩZ into hypercubes such that f is defined as f (x) = f (z) = fi (z) if z ∈ Di , where each fi , i = 1, . . . , k can be written in the form of (1). iii. Y 6= ∅ and for each fixed value y ∈ ΩY , fy (z) = f (y, z) can be defined as in ii.
Penniless Propagation with Mixtures of Truncated Exponentials
41
Definition 2. (MTE density) An MTE potential f is an MTE density if X Z f (y, z)dz = 1 . y∈ΩY
ΩZ
In a Bayesian network, we find two types of densities: 1. For each variable X which is a root of the network, a density f (x) is given. 2. For each variable X with parents Y, a conditional density f (x|y) is given. A conditional MTE density f (x|y) is an MTE potential f (x, y) such that fixing y to each of its possible values, the resulting function is a density for X.
3
Mixed Trees
In [15] a data structure was proposed to represent MTE potentials: The socalled mixed probability trees or mixed trees for short. The formal definition is as follows: Definition 3. (Mixed tree) We say that a tree T is a mixed tree if it meets the following conditions: i. Every internal node represents a random variable (either discrete or continuous). ii. Every arc outgoing from a continuous variable Z is labeled with an interval of values of Z, so that the domain of Z is the union of the intervals corresponding to the arcs Z-outgoing. iii. Every discrete variable has a number of outgoing arcs equal to its number of states. iv. Each leaf node contains an MTE potential defined on variables in the path from the root to that leaf.
Y1 0
1
Z1
Z1
0 α1 > ... > αn > αn+1 = 0. If δ is a set of uncertainty degrees, we define min(δ) = αj (resp. max(δ) = αj ) such that αj ∈ δ and /∃αk ∈ δ such that αk < αj (resp. αk > αj ). A qualitative possibility distribution (QPD) is a function which associates to each element ω of the universe of discourse Ω an element from L, thus, enabling us to express that some states are more plausible than others without referring to any numerical value. The QPD covers all the properties of the quantitative possibility distributions mentioned in this section.
948
I. Jenhani et al.
3.2
Building Possibilistic Option Decision Trees
Recall that the heart of any decision tree algorithm is the attribute selection measure parameter which is used to build a decision tree. As it is described, the standard building procedure [11] chooses at each decision node the attribute having the maximum or the minimum value (according to the context) of this measure, assuming that it leads to the smallest tree, and the remaining attributes are rejected: at this point, Ockham’s razor is applied. For instance, suppose that at a node n, we find that Gr(T, A1 ) = 0.87 and Gr(T, A2 ) = 0.86. In standard decision tree building procedure, the node n will be split according to the values of A1 whereas A2 is rejected in spite of the fact that the two values are almost equal. When looking into the second part of the assumption underlying Ockham’s Razor: ”It does not guarantee that the simplest model will be correct, it merely establishes priorities.”, and after computing the gain ratios of the different attributes, one should establish priorities between these candidate attributes according to the obtained values and select attributes that appears possible to a certain extent as well instead of choosing only the one with the highest gain ratio and rejecting all the remaining attributes. Thus, the idea is to assign to each decision node n, a normalized possibility distribution πAn over the set of remaining attributes at this node, based on the set of gain ratios of the different attributes GR = {Gr(Tn , Ak ) s. t. Ak ∈ An }. Tn denotes the training subset relative to the node n. Let An be the set of remaining attributes at a decision node n and GR the set corresponding to their gain ratios. We define a quantitative possibility distribution πAn by the following equation: ⎧ if Gr(Ak ) ≤ 0 ⎨0 1 if Gr(Ak ) = max(GR) πAn (Ak ) = (9) ⎩ Gr(Ak ) otherwise. ∗ Gr(A ) k
We interpret πAn (Ak ) as the possibility degree that a given attribute Ak is reliable for the node n. An alternative manner to quantify the attributes was proposed by Hllermeier in [6], but the characteristics of our possibility distribution is that it proportionally preserves the gap between the different attributes according to their gain ratios and it does not use any additional parameter. Once possibility degrees are generated for each attribute, we use the option technique [4], i.e., a decision node n will not be only split according to the best attribute A∗k but rather for all attributes in the set A∗n which we define by: A∗n = {Ak ∈ An s. t. distance(A∗k , Ak ) ≤ ∆}.
(10)
where distance(A∗k , Ak ) = πAn (A∗k )− πAn (Ak ), An denotes the set of candidate attributes at the node n and ∆ represents an arbitrary threshold varying in the interval [0, 1]. The fixed value of ∆ has a direct effect on the size of the tree. In fact, for a large (resp. small) value of ∆, the number of the selected attributes, at each node, will increase (resp. decrease) and hence, the tree will have a larger (resp. smaller) size. The extreme cases occur when:
Qualitative Inference in Possibilistic Option Decision Trees
949
– ∆ = 0, we recover a standard decision tree as C4.5 of Quinlan. – ∆ = 1, we obtain a huge decision tree composed of all the combinations of the different attribute values. This case is not interesting because it increases the time and space complexity. In addition, selecting attributes with low possibility degrees of being reliable in a given option node is nonsensical. Since we can have more than one attribute at a given decision node n (an option-node), the partitioning is realized as follows: For each attribute Ak ∈ A∗n and each value v ∈ D(Ak ), one outgoing edge is added to n. This edge is labeled with the value v and the possibility degree πAn (Ak ) which is interpreted as the reliability degree of that edge. Obviously, we keep the same stopping criteria as in the standard decision trees. Example 1. Let us use the golf data set [8] to illustrate the induction of a possibilistic option decision tree (PODT). Let T be the training set composed of fourteen objects which are characterized by four attributes: -
Outlook: sunny or overcast or rain. Temp: hot or mild or cool. Humidity: high or normal. Wind: weak or strong.
Two classes are possible either, C1 (play) or C2 (don’t play). The training set T is given by Table 1: Assume ∆ = 0.4 in Equation (10). Let us compute the gain ratios of the different attributes at the root node n = 0: Gr(T0 , Outlook) = Gr(T0 , Temp) =
Gain (T0 , Outlook) Split Inf o (T0 , Outlook)
Gain (T0 , T emp) Split Inf o (T0 , T emp)
Gr(T0 , Humidity) = Gr(T0 , Wind) =
=
=
0.029 1.556
Gain (T0 , Humidity) Split Inf o (T0 , Humidity)
Gain (T0 , W ind) Split Inf o (T0 , W ind)
=
0.048 0.985
0.246 1.577
= 0.156;
= 0.018; =
0.151 1
= 0.151;
= 0.048;
We remark that the attribute ”Outlook” has the highest gain ratio. Let’s now, compute the possibility degrees of the different attributes, using Equation (9), in order to define the set A∗0 : πA0 (Outlook) = 1 πA0 (T emp) =
Gr(T0 , T emp) Gr(T0 , Outlook)
πAo (Humidity) = πA0 (W ind) =
=
0.018 0.156
Gr(T0 , Humidity) Gr(T0 , Outlook)
Gr(T0 , W ind) Gr(T0 , Outlook)
=
=
0.048 0.156
= 0.12; 0.151 0.156
= 0.97;
= 0.31;
950
I. Jenhani et al.
Given ∆ = 0.4, the set of attributes which will be assigned to the root n0 of the possibilistic option tree is given by: A∗0 = {Outlook, Humidity}. The possibilistic option tree induced from the training set T (∆ = 0.4 in Equation (10)), which we denote by P ODT0.4 , is given by Fig. 1. For clarity reasons, abbreviations of the attribute values are used instead of complete words.(e.g. ”ho” for the value ”hot”, ”hi” for ”high”, ”we” for ”weak”, etc.).
Table 1. Training set Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
Temp Humidity Wind Class hot high weak C2 hot high strong C2 hot high weak C1 mild high weak C1 cool normal weak C1 cool normal strong C2 cool normal strong C1 mild high weak C2 cool normal weak C1 mild normal weak C1 mild normal strong C1 mild high strong C1 hot normal weak C1 mild high strong C2
Outlook \ Humidity
XXX H HHXXX XX ov(1) ra(1) HH su(1) hi(0.97) XXX no(0.97) XXX H H XXX HH Humidity C1 Wind (P1 ) A A hi(1) st(1) A we(1) no(1) A
C2 (P2 )
A
C1 (P3 )
C2 (P4 )
Outlook/Wind
Outlook
@H H
@
A
ov(0.6) H we(1) su(1) ra(1)@ov(1) su(0.6) @ HH ra(0.6) st(1) @ @
C1 (P5 )
C2 Wind (P6 ) A
H
st(1)
C2 (P11 )
C1 C1 Wind (P7 )(P8 ) B
C1 Outlook C1 (P9 ) A (P10 )
A B A A ov(1)ra(1) B we(1) A we(1) st(1) su(1) A BB A A A C1 C2 (P12 ) (P13 )
Fig. 1. Final possibilistic option tree
C1 C1 C1 C2 (P14 )(P15 ) (P16 ) (P17 )
Qualitative Inference in Possibilistic Option Decision Trees
4
951
Qualitative Inference with Possibilistic Option Trees
In this section, we are interested on how to classify objects characterized by uncertain or missing attribute values within possibilistic option trees. Uncertainty here is handled in a qualitative possibilistic framework. For each attribute, we assign a qualitative possibility distribution (QPD) to express the uncertainty on the real value of that attribute. Given the set of attributes A, the instance to classify is described by a vector of possibility dis→ − , ..., πA ). An attribute Ak whose value is known with certributions i = (πA 1 n (v) = 1, and for all other tainty has exactly one value v ∈ D(Ak ), such that πA k values v ∈ D(Ak ) − {v}, πAk (v ) = 0. An attribute Ak whose value is missing is (v) = 1. represented by a uniform possibility distribution, i.e., ∀ v ∈ D(Ak ), πA k → − Table 2 gives an example of an uncertain instance i1 to classify. Note that 1 > α1 > α2 > α3 > α4 > α5 . In order to classify an uncertain instance (e.g. → − i1 ) within a possibilistic option tree P ODT , we need to carry out the following steps: − → Table 2. Instance i1 πoutlook πtemp πhumidity πwind sunny α4 hot 1 high 1 strong 1 overcast α1 mild 1 normal α2 weak α5 rain 1 cool α3
Step One: The Instance Propagation At each option node of a possibilistic option tree, the instance to classify can branch in different directions depending on the chosen attribute to test on. To each one of these attributes, we have assigned a possibility degree πAn (Ak ) (Equation (9)) indicating the possibility that a given attribute is reliable for a given option node n. Thus, throughout a given PODT, whenever an instance follows an attribute ) should be discounted Ak , the related QPD in the instance to classify (πA k according to the possibility degree of the followed attribute (πAn (Ak )) using Equation (8). The discounted possibility degrees will replace the degrees labeling the PODT. Step Two: Exploring the Paths Once the propagation is made within the PODT (step 1), we should explore all its paths in order to determine their corresponding possibility degrees based on the ’new’ discounted possibility degrees labeling the tree. Since we deal with qualitative possibility distributions, we have chosen the minimum operator to define the possibility degree of a path p = (n0 , ..., nk ) as πpath (p) =
min
0≤ip → – → ck is preferred to → • If there exists i ∈ {1, ..., min(n, m)} such that π(pck ,i ) > π(pcl ,i ) and ∀j < i, π(pck ,j ) = π(pcl ,j ). • Or if ∀ i ∈ {1, ..., min(n, m)}, π(pck ,i ) = π(pcl ,i ) and m > n. → → → → cl , iff n = m and ∀ i, π(pck ,i ) = π(pcl ,i ). cl , denoted by − ck =p − – − ck is equal to − In the case of equally preferred vectors, we choose a class at random. → − Example 2. Suppose we have to classify the instance i1 given in Table 2 within the induced P ODT0.4 of Example 1. Assume α1 = 0.8, α2 = 0.5, α3 = 0.4, α4 = 0.2 and α5 = 0.1. The assigned values only preserve the ranking between αi and hence they have no sense. So, we get the following instance:
− → Table 3. Instance i1 πoutlook πtemp πhumidity πwind sunny 0.2 hot 1 high 1 strong 1 overcast 0.8 mild 1 normal 0.5 weak 0.1 rain 1 cool 0.4
STEP 1: Instance Propagation → − Starting from the root node of the P ODT0.4 (see Fig. 1), the instance i1 can follow both the ’Outlook’ attribute and the ’Humidity’ attribute whose reliability degrees are respectively 1 and 0.97. According to the reliability of each followed as attribute Ak , we will discount the corresponding possibility distribution πA k mentioned above. The different edges of the P ODT0.4 will be labeled by the discounted QPD’s of the instance to classify. We do not show the figure here for reasons of space.
Qualitative Inference in Possibilistic Option Decision Trees
953
STEP 2: Exploring the Paths Let us compute the possibility degree relative to each path using Equation (11): P1 : 0.8 ⇒ (C1 , 0.8), P2 : min(0.2, 1) = 0.2 ⇒ (C2 , 0.2), P3 : min(0.2, 0.5) = 0.2 ⇒ (C1 , 0.2), P4 : min(1, 1) = 1 ⇒ (C2 , 1), P5 : min(1, 0.1) = 0.1 ⇒ (C1 , 0.1), P6 : min(1, 0.2) = 0.2 ⇒ (C2 , 0.2), P7 : min(1, 0.8) = 0.8 ⇒ (C1 , 0.8), P8 : min(0.5, 0.4) = 0.4 ⇒ (C1 , 0.4), P9 : min(0.5, 0.8) = 0.5 ⇒ (C1 , 0.5), P10 : min(0.5, 0.1) = 0.1 ⇒ (C1 , 0.1), P11 : min(1, 1, 1) = 1 ⇒ (C2 , 1), P12 : min(1, 1, 0.1) = 0.1 ⇒ (C1 , 0.1), P13 : min(0.5, 1, 1) = 0.5 ⇒ (C2 , 0.5), P14 : min(0.5, 1, 0.1) = 0.1 ⇒ (C1 , 0.1), P15 : min(0.5, 1, 0.2) = 0.2 ⇒ (C1 , 0.2), P16 : min(0.5, 1, 0.8) = 0.5 ⇒ (C1 , 0.5), P17 : min(0.5, 0.8, 1) = 0.5 ⇒ (C2 , 0.5). STEP 3: Exploring the Classes Refining the results found, using Definition 1, we get: − → C1 = {0.8, 0.8, 0.5, 0.5, 0.4, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1}. − → C2 = {1, 1, 0.5, 0.5, 0.2, 0.2}. − → − → → − Then, the class corresponding to the instance i1 is C2 since C2 >p C1 . Note that the classification method described in this Section collapses to the standard classification procedure when testing instances are certain and ∆ = 0.
5
Experimental Results
For the evaluation of the possibilistic option tree approach, we have developed programs in Matlab V6.5, implementing both of the building and the classification procedures relative to the PODT. Then, we have applied our approach to two real databases obtained from the U.C.I repository of Machine Learning databases [8]. A brief description of these nominal-valued databases is presented in Table 3. #Tr, #Ts, #attributes, #classes denote respectively the number of training instances, the number of testing instances, the number of attributes and the number of classes. For the testing sets, we have generated uncertainty relative to attribute’s values of the different testing instances in an artificial manner. In this experimentation, we were interested by the impact of varying ∆ in number of well classif ied instances Equation (10) on the P CC (= total number of classif ied instances ) by considering parameters relative to the tree size (#nodes, #leaves) and temporal parameters (time relative to the building phase (T. build.) and to the classification phase (T. classif.)). Table 4 and Table 5 summarize different results relative to the Wisconsin breast cancer and Nursery databases, respectively. Note that the experimentations were performed using a Centrino 1.4 GHz PC with 512 MB of RAM running Windows XP. It is important to mention that, during the experimentations, we have varied ∆ from 0 to 0.5. We stopped at 0.5 since it becomes not interesting to consider attributes whose reliability is less than 0.5, i.e., attributes that seem to become
954
I. Jenhani et al. Table 4. Description of databases Database #Tr #Ts #attributes #classes Wisconsin Breast Cancer 629 70 8 2 Nursery 750 75 8 5 Table 5. The experimental measures (W. breast cancer) ∆ #nodes #leaves T. build. (s) T. classif. (s) PCC (%) 0 101 168 15.27 55.42 81.42 0.1 154 259 17.5 96.54 88.57 0.2 320 550 27.27 204.38 80.00 0.3 529 933 38.89 366.15 80.00 0.4 879 1602 59.41 673.62 78.57 0.5 1802 3263 110.0 1635.98 75.71
Table 6. The experimental measures (Nursery) ∆ #nodes #leaves T. build. (s) T. classif. (s) PCC (%) 0 60 108 12.34 17.84 88.00 0.1 107 197 13.55 32.61 90.66 0.2 176 333 16.25 57.81 92.00 0.3 224 424 18.86 72.88 86.66 0.4 294 554 21.05 98.34 86.66 0.5 401 781 26.26 134.87 84.00
far from the fully reliable one. As it is shown in Table 4 and Table 5, the P CC increases progressively and becomes to decrease when reaching a specific value of ∆. For instance, in the W. breast cancer database, the P CC increases from 81.42 % to 88.57 % when varying ∆ from 0 to 0.1 and becomes to decrease from 88.57 % to 75.71 % for ∆ ∈ [0.1, 0.5]. The value of ∆ for which we obtain the most accurate P ODT (0.1 for the W. breast cancer database and 0.2 for the Nursery database) is determined experimentally and depends on the used training set. These results confirm the results obtained in [9]: smaller tree(s) is (are) not necessarily more accurate than the slightly larger one(s). It is important to note that the P ODT approach has the advantage of classifying instances having uncertain or missing attribute values.
6
Conclusion
In this paper, we have developed a new approach so-called possibilistic option decision tree. This approach has two advantages. The first is that it considers more than one attribute at a given decision node by breaking Ockham’s razor principle. The second advantage is the ability of classifying instances characterized by uncertain/missing attribute values. The experimental results presented
Qualitative Inference in Possibilistic Option Decision Trees
955
in this paper are encouraging. In fact, the classification accuracy of the PODT increases when varying ∆ until reaching a specific value which is purely experimental. This value is relatively small and hence the time and space complexity are reasonable. We belief that the pruning issue should be investigated and aim to extend our approach to handle continuous attributes in the future.
References 1. Ben Amor, N., Benferhat, S., Elouedi, Z.: Qualitative classification and evaluation in possibilistic decision trees, FUZZ-IEEE’2004. 2. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Occam’s razor, Information Processing Letters, 24, 377-380, 1987. 3. Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J.: Classification and regression trees, Monterey, CA : Wadsworth & Brooks, 1984. 4. Buntine, W.: Learning classfication trees, Statistics and Computing, 63-73, 1990. 5. Dubois, D., Prade, H.: Possibility theory: An approach to computerized processing of uncertainty, Plenum Press, New York, 1988. 6. Hllermeier, E.: Possibilistic Induction in decision tree learning, ECML’2002. 7. Kohavi, R., Kunz, C.: Option decision trees with majority votes, ICML’97. 8. Murphy, P. M., Aha, D. W., UCI repository of machine learning databases, 1996. 9. Murphy, P. M., Pazzani, M. J.: Exploring the decision forest: An emperical investigation of Occam’s Razor in decision tree induction, JAIR, 257-275, 1994. 10. Quinlan, J. R.: Induction of decision trees, Machine Learning, 1, 81-106, 1986. 11. Quinlan, J. R.: C4.5: Programs for machine learning, Morgan Kaufmann, 1993. 12. Weiss, S. M., Kulikovski, C. A.:Computer systems that learn, Morgan Kaufmann, San Mateo, California, 1991.
Partially Supervised Learning by a Credal EM Approach Patrick Vannoorenberghe1 and Philippe Smets2 1 PSI, FRE 2645 CNRS, Universit´e de Rouen, Place Emile Blondel, 76821 Mont Saint Aignan cedex, France
[email protected] 2 IRIDIA, Universit´e Libre de Bruxelles, 50, av. Roosevelt, 1050 Bruxelles, Belgique
[email protected] Abstract. In this paper, we propose a Credal EM (CrEM) approach for partially supervised learning. The uncertainty is represented by belief functions as understood in the transferable belief model (TBM). This model relies on a non probabilistic formalism for representing and manipulating imprecise and uncertain information. We show how the EM algorithm can be applied within the TBM framework when applied for the classification of objects and when the learning set is imprecise (the actual class of each object is only known as belonging to a subset of classes), and/or uncertain (the knowledge about the actual class is represented by a probability function or by a belief function). Keywords: Learning, belief functions, EM, transferable belief model.
1
Introduction
Supervised learning consists in assigning an input pattern x to a class, given a learning set L composed of N patterns xi with known classification. Let Ω = {ω1 , ω2 , . . . , ωK } be the set of K possible classes. Each pattern in L is represented by a p-dimensional feature vector xi and its corresponding class label yi . When the model generating the data is known, the classical methods of discriminant analysis (DA) permits the estimation of the parameters of the model. Still these methods assumed in practice that the actual class yi of each case in the learning set is well known. Instead suppose the data of the learning set are only partially observed, i.e., the actual class of a given object is only known to be one of those in a given subset C of Ω. Classical methods for parametric learning encounter then serious problems. One of the solution was based on the EM algorithm (Dempster, Laird, & Rubin, 1977; McLaclan & Krishnan, 1997). Parametric learning requires a model of the generation of the data and an algorithm for estimating the parameters of this model using the available information contained in the learning set. A major drawback of many parametric methods is their lack of flexibility when compared with nonparametric methods. However, this problem can be circumvented using mixture models which L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 956–967, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
Partially Supervised Learning by a Credal EM Approach
957
combine much of the flexibility of nonparametric methods with certain of the analytic advantages of parametric methods. In this approach, we assume that the data X = {x1 , . . . , xN } are generated independently from a mixture density model which probability density function (pdf) is given by: f (xi ; yi = ωk , θ) =
Gk X
πkg fkg (xi ; αkg )
(1)
g=1
where Gk is the number of components in the mixture for the cases in class ωk , πkg are the mixing proportions, fkg denotes a component, i.e. a probability distribution function parametrized by αkg , and θ = {(πkg , αkg ) : g = 1, . . . , Gk ; k = 1, . . . , K} are the model parameters to be estimated. For mixture of Gaussian pdfs, the function fkg (xi ; αkg ) is a Gaussian pdf and αkg is a set of parameters αkg = (µkg , Σ kg ) where µkg is the mean and Σ kg the variance-covariance matrix of the Gaussian pdf fkg . Generally, the maximum likelihood estimation of the parameters of this model cannot be obtained analytically, but learning θ could be easily achieved if the particular component fkg responsible for the existence of each observation xi was known. In reality, this ideal situation is hardly encountered. Several real world contexts can be described. 1. The precise teacher case. For each learning case, we know the actual class to which it belongs. The missing information is the g value for each case. The classical approach to solve this problem is the EM algorithm. 2. The imprecise teacher case. For each learning case, we only know that the actual class belongs to a subset of Ω. The missing information is the k and the g values for each case, where k is constrained to a subset of 1, . . . , K. The EM algorithm can be extended to such a case (Hastie & Tibshirani, 1996; Ambroise & Govaert, 2000). 3. The precise and uncertain teacher case. For each learning case, we only have some beliefs about what is the actual class to which the case belongs. The uncertainty is represented by a probability function on Ω. The uncertainty concerns the k value, and the g values are still completely unknown. 4. The imprecise and uncertain teacher case. For each learning case, we only have some beliefs about what is the actual class to which the case belongs. The uncertainty is represented by a belief function on Ω. The uncertainty and imprecision concern the k value, and the g values are still completely unknown. The EM algorithm can be further extended to such a case as done here. In this paper, we consider the imprecise teacher case and the imprecise and uncertain teacher case, the first case being covered by the second one. Uncertainty is represented by belief functions as understood in the TBM (Smets & Kennes, 1994; Smets, 1998). We propose to use the advantages of both the EM algorithm and the belief functions to learn the parameter of a TBM classifier. This algorithm is called the ‘Credal EM’ (CrEM) and its related classifier is called the ‘CrEM classifier’.
958
P. Vannoorenberghe and P. Smets
Previous work on comparing a TBM classifier with an EM based classifier was performed in (Ambroise, Denoeux, Govaert, & Smets, 2001). Performance were analogous, but the TBM classifier was much simpler to use. The TBM classifier used in that comparison was based on non parametric methods as developed by (Denœux, 1995; Zouhal & Denœux, 1998). Here the TBM is used for parameter estimation and the final TBM classifier is based on a parametric method. This paper is organized as follows. The basic concepts of belief functions theory are briefly introduced in Section 2. The notion of likelihood is extended into the TBM in Section 3. The principle of parameters estimation via the EM algorithm is recalled in Section 4. The proposed algorithm is presented in Section 5. Finally, Section 6 gives some experimental results using synthetic data.
2
Background Materials on Belief Functions
Let Ω be a finite space, and let 2Ω be its power set. A belief function defined on Ω can be mathematically defined by introducing a set function, called the basic belief assignment (bba) mΩ : 2Ω → [0, 1] which satisfies: X mΩ (A) = 1. (2) A⊆Ω
Ω
Each subset A ⊆ Ω such as m (A) > 0 is called a focal element of mΩ . Given this bba, a belief function belΩ and a plausibility function plΩ can be defined, respectively, as: X belΩ (A) = mΩ (B), ∀ A ⊆ Ω. (3) ∅6=B⊆Ω
Ω
pl (A) =
X
mΩ (B), ∀ A ⊆ Ω.
(4)
A∩B6=∅
The three functions belΩ , plΩ and mΩ are in one-to-one correspondence and represent three facets of the same piece of information. We can retrieve each function from the others using the fast M¨obius transform (Kennes, 1992). Let Ω mΩ 1 and m2 be two bbas defined on the same frame Ω. Suppose that the two bbas are induced by two distinct pieces of evidence. Then the joint impact of the two pieces of evidence can be expressed by the conjunctive rule of combination which results in the bba: X Ω Ω ∩ mΩ mΩ (5) mΩ 1 (B).m2 (C). 2 )(A) = 12 (A) = (m1 ° B∩C=A
In the TBM, we distinguish the credal level where beliefs are entertained (formalized, revised and combined) and the pignistic level used for decision making. Based on rationality arguments developed in the TBM, Smets proposes to transform mΩ into a probability function BetP on Ω (called the pignistic probability function) defined for all ωk ∈ Ω as: X mΩ (A) 1 (6) BetP (ωk ) = |A| 1 − mΩ (∅) A∋ωk
Partially Supervised Learning by a Credal EM Approach
959
P where |A| denotes the cardinality of A ⊆ Ω and BetP (A) = ω∈A BetP (ω), ∀A ⊆ Ω. In this transformation, the mass of belief m(A) is distributed equally among the elements of A (Smets & Kennes, 1994; Smets, 2005). Let us suppose the two finite spaces X, the observation space, and Θ, the unordered parameter space. The Generalized Bayesian Theorem (GBT), an extension of Bayes theorem within the TBM (Smets, 1993), consists in defining a belief function on Θ given an observation x ⊆ X, the set of conditional bbas mX [θi ] over X, one for each θi ∈ Θ1 and a vacuous a priori on Θ. Given this set of bbas (which can be associated to their related belief or plausibility functions), then for x ⊆ X and ∀A ⊆ Θ, we have: Y (1 − plX [θi ](x)). (7) plΘ [x](A) = 1 − θi ∈A
3
Explaining the Likelihood Maximization Within the TBM
Suppose a random sample of a distribution with parameters θ ∈ Θ and let X = {x1 , . . . , xN : xi ∈ IRp } be the set of observations. In probability theory many estimation procedures for θ are based on the maximization of the likelihood, i.e. p P IR (X|θ) considered as a function of θ. How do we generalize this procedure within the TBM? We reconsider the issue. For each θ ∈ Θ, we have a conditional bba on IR, denoted mIR [θ]. We observe x ⊆ IR. This induce a bba on Θ by the application of the GBT. So we get the bba mΘ [x]. How to estimate θ0 , the actual value of Θ? We could select the θ that maximizes BetP Θ [x], thus the most ‘probable’ value of Θ. This last solution means finding the modal value of BetP Θ [x]. We feel this principle fits with the idea underlying the maximum likelihood estimators. So we must find the θ ∈ Θ such that BetP Θ [x](θ) ≥ BetP Θ [x](θi ), ∀ θi ∈ Θ. This maximization seems hard to solve, but we can use theorem III.1. in (Delmotte & Smets, 2004) which states that the θ that maximizes BetP Θ [x] is the same as the one that maximizes the plausibility function plΘ [x](θ), provided the a priori belief on Θ is vacuous, as it is the case here. Theorem 1. Given x ⊆ X and plX [θ](x) for all θ ∈ Θ, let plΘ [x] be the plausibility function defined on Θ and computed by the GBT, and BetP Θ [x] be the pignistic probability function constructed on Θ from plΘ [x], then: BetP Θ [x](θi ) > BetP Θ [x](θj )
iff
plΘ [θi ](x) > plΘ [θj ](x).
(8)
In the TBM, plΘ [x](θ) is equal to plX [θ](x). Furthermore when N i.i.d. data QN N xi , i = 1, . . . , N , are observed, we get plX [θ](x1 , ..., xN ) = i=1 plX [θ](xi ). 1
We use the next notational convention for the indices and [ ]: mD [u](A) denotes the mass given to the subset A of the domain D by the conditional bba mD [u] defined on D given u is accepted as true.
960
P. Vannoorenberghe and P. Smets
This last term is easy to compute and leads thus to applicable algorithms. Maximizing the likelihood over θ turns out to mean maximizing over θ the conditional plausibilities of the data given θ.
4
Parameter Estimation by EM Algorithm
We introduce the classical EM approach to find the parameters of a mixture models from a data set X = {x1 , . . . , xN } made of cases which belong to a same class. The aim is to estimate the posterior distribution of the variable y which indicates the component of the mixture that generated xi taking into account the available information L. For simplicity sake, we do not indicate the class index k. For that estimation, we need to know πg , fg and αg for g = 1, . . . , G. For their estimation, we use the EM algorithm to maximize according to θ the log likelihood: L(θ; X) = log(
N Y
f (xi ; θ)) =
N X i=1
i=1
G X πg fg (xi ; αg )). log(
(9)
g=1
In order to solve this problem, the idea is that if one had access to a hidden random variable z that indicates which data point was generated by which component, then the maximization problem would decouple into a set of simple maximizations. Using this indicator variable z, relation (9) can be written as the next complete-data log likelihood function: Lc (θ; X, z) =
G N X X
zig log(πg fg (xi ; αg ))
(10)
i=1 g=1
where zig = 1 if the Gaussian pdf having generated the observation xi is fg , and 0 otherwise. Since z is unknown, Lc cannot be used directly, so we usually work with its expectation denoted Q(θ|θl ) where l is used as the iteration index. As shown in (Dempster et al., 1977), L(θ; X) can be maximized by iterating the following two steps: – E step: Q(θ|θl ) = E[Lc (θ; X, z)|X, θl ] – M step: θl+1 = arg maxθ Q(θ|θl ) The E (Expectation) step computes the expected complete data log likelihood and the M (Maximization) step finds the parameters that maximize that likelihood. Q(θ|θl ) can be rewritten as Q(θ|θl ) =
N X G X
E[zig |X, θl ] log(πg fg (xi ; αg ))
(11)
i=1 g=1
In a probabilistic framework, E[zig |X, θl ] is nothing else than P (zig = 1|X, θl ), the posterior distribution easily computed from the observed data.
Partially Supervised Learning by a Credal EM Approach
5
961
CrEM: The Credal Solution
In this section, we introduce a credal EM approach for partially supervised learning. The imprecision or/and uncertainty on the observed labels are represented by belief functions (cf. section 5.1). We consider the imprecise and uncertain teacher case (section 5.2). 5.1
Partially Observed Labels
Thanks to its flexibility, a belief function can represent different forms of labels including hard labels (HL), imprecise labels (IL), probabilistic labels (PrL), possibilistic (PoL) labels and credal labels (CrL). Table 1 illustrates an example of the bbas that characterize the knowledge about the labels on a three-class frame. Note that a possibility measure is known to be formally equivalent to a consonant belief function, i.e., a belief function with nested focal elements (Denœux & Zouhal, 2001). Unlabeled samples (UL) can be encoded using the vacuous belief function mv defined as mv (Ω) = 1. This show that handling the general case based on belief functions covers all cases of imperfect teacher (imprecise and/or uncertain). Of course, the TBM covers the HL, IL, PrL and CrL cases. For the PoL, the CrEM algorithm presented here has to be adapted as we use the GBT and other combination rules that differ from their possibilistic counterparts. Table 1. Example of imprecise and uncertain labeling with belief functions
A⊆Ω {ω1 } {ω2 } {ω1 , ω2 } {ω3 } {ω1 , ω3 } {ω2 , ω3 } Ω
5.2
HL 0 1 0 0 0 0 0
IL 0 0 1 0 0 0 0
PrL 0.2 0.6 0 0.2 0 0 0
PoL 0 0 0 0.7 0.2 0 0.1
CrL .1 0 .2 .3 .3 0 .1
UL 0 0 0 0 0 0 1
The Imprecise and Uncertain Teacher Case
Let Ω = {ω1 , . . . , ωK } be a set of K mutually exclusive classes2 . Let L be a set of N observed cases and called the learning set. For i = 1, . . . , N , let ci denotes the i-th case. For case ci , we collect a feature vector xi taking values in IRp , and a bba mΩ i that represents all we know about the actual class yi ∈ Ω to which case ci belongs. We then assume that the probability density function (pdf) of xi is given by the next mixture of pdfs : 2
In the TBM, we do not require Ω to be exhaustive, but one could add this requirement innocuously.
962
P. Vannoorenberghe and P. Smets
f (xi ; yi = ωk , θk ) =
Gk X
πkg fkg (xi ; αkg )
(12)
g=1
where fkg is the p-dimensional Gaussian pdf with parameters αkg = (µkg , Σ kg ). Ω Let the available data be {(x1 , mΩ 1 )..., (xN , mN )} where X = (x1 , ..., xN ) is an i.i.d sample. Let Y = (y1 , ..., yN ) be the unobserved labels and mΩ = Ω (mΩ 1 , . . . , mN ) are the bbas representing our beliefs about the actual values of the yi ’s. For the estimation of the parameters θ = ({αkg : j = 1, . . . Gk , k = 1, . . . , K}, Y ), we use the EM algorithm to maximize the log likelihood given by: L(θ; L) = log(
N Y
f (xi ; yi = ωk , θk )) =
i=1
N X i=1
Gk X log( πkg fkg (xi ; αkg )).
(13)
g=1
We rephrase the relation by considering all the Gaussian pdfs. There are Pcan K G = k=1 Gk Gaussian pdfs. Let Jk be the indexes in the new ordering of the Pk Pk−1 components of the class ωk . So Jk = {j : ν=1 Gν } where ν=1 Gν < j ≤ P0 G = 0. This reindexing is analogous to a refinement R of the classes in ν=1 ν Ω = {ωk : k = 1, . . . , K} into a set of new ‘classes’ Ω ∗ = {ωj∗ : j = 1, . . . , G} ∗ where ωk is mapped onto {ωj∗ : j ∈ Jk }. The bba mΩ i can be refined on Ω as Ω∗ mi where Ω mΩ i (R(A)) = mi (A) =0 ∗
∀A ⊆ Ω otherwise
(14)
For each case ci , we must find out which of the G pdfs generated their xi data. So, equation (13) can be written as: L(θ; L) =
N X i=1
G X log( πj fj (xi ; αj ))
(15)
j=1
where the sum of the πj taken on the j indexes corresponding to the possible classes of ci must add to 1, all others being 0. We reconsider the EM algorithm when the teacher is imperfect. We need for ∗ about its class in Ω ∗ . If the each case ci the plausibility of xi given the bba mΩ i p ∗ IR ∗ actual class is ωj , then pl [ωj ](xi ) is given by fj (xi , αj ). If xi is a singleton (as p usual and assumed hereafter) then plIR [ωj∗ ](xi ) = fj (xi , αj )dx where we put dx to mention that a plausibility is a set function whereas f itself is a density. This dx term will cancel when normalizing. Let A ⊆ Ω ∗ , then from the disjunctive rule of combination associated to the GBT we get: Y p p plIR [A](xi ) = 1 − (1 − plIR [ωj∗ ](xi )). (16) j:ωj∗ ∈A
We then assess the bba on Ω ∗ given θl and xi . From the GBT, we get ∗ ∗ by the conmΩ [xi , θl ]. We combine this bba with the prior bba given by mΩ i junctive combination rule. The term to maximize is then:
Partially Supervised Learning by a Credal EM Approach
Q(θ|θl ) =
N X X
p
Ω IR ∩ mi )(A) log(pl (mΩ [xi , θl ]° [A](xi )) ∗
∗
963
(17)
i=1 A⊆Ω ∗
p
where plIR [A](xi ) is given by relation (16).
6
Simulations Results
In this section, we propose to illustrate the performance of the CrEM algorithm described in the previous sections using two learning tasks. 6.1
Learning Task 1: Isosceles Triangles
In this task, we have three classes: Ω = {ω1 , ω2 , ω3 } and two-dimensional data. In each class, there are 2 components (Gk = 2, k = 1, 2, 3). For a given subset, each vector x is generated from a Gaussian f (x|ωg ) ∼ N (µg , Σ g ) where Σ g = σI. The parameters for the 6 pdfs are presented in table 2. The pdf corresponds to 3 largely spread data (σ = 2) located at the 3 corners of an isosceles triangle, and to 3 clustered data (σ = 0.5) located at the 3 corners of another isosceles triangle. The pair of pdf corresponding to one class are thus located at one corner and half way on the line between the other 2 corners. In figure 1, we illustrate an example of such a learning set with its respective isosceles triangles (fine lines). We generate a sample of 50 cases from each of the 6 pdfs. Labels for each case can be of two types, either imprecise (IL) or credal (CrL). In the IL case, the labels for the 50 cases from the largely spread data (those at the corners) are precise. The other 50 cases are randomly split into two groups of 25 cases. Their labels are imprecise and made of 2 classes, the actual class being one of them. So for the 50 cases in subset 2 of class ω1 , 25 are labeled {ω1 , ω2 } and 25 are labeled {ω1 , ω3 }. In the CrL case, the labels are subsets of Ω randomly Table 2. Parameters of the learning set for task 1 with imprecise labels (IL) and the estimations obtained with the CrEM for one run
ω1 (+) ω1 (+) subset1 subset2 17.5 10 µa 14.3 10 µb 0.5 2 σ IL 50 ω1 25 ω1 , ω2 25 ω1 , ω3 cases 17.54 9.13 ma 14.32 mb 10.35 0.38 2.57 s 0.152 0.185 r
ω2 (×) ω2 (×) subset1 subset2 15 15 10 18.6 0.5 2 50 ω2 25 ω1 , ω2 25 ω2 , ω3 14.92 15.60 10.12 18.95 0.37 1.85 0.148 0.178
ω3 (·) ω3 (·) subset1 subset2 12.5 20 14.3 10 0.5 2 50 ω3 25 ω1 , ω3 25 ω2 , ω3 12.42 20.36 14.35 9.86 0.35 3.24 0.154 0.179
964
P. Vannoorenberghe and P. Smets Learning data with partially observed labels
25
class ω1 class ω2 class ω 3
20 4
15
5
1
10
2
3
6
5
0
0
5
15
10
25
20
30
Fig. 1. Learning set in the feature space
Table 3. Percentage of correct classification for classical EM and CrEM algorithms
1 2 3 Triangles 85.3 84.3 86.3 EM CrEM IL 86.3 85.3 88.0 CrEM CrL 87.0 86.6 87.6
4 88.0 90.3 90.0
5 86.7 88.0 87.6
6 87.0 87.3 88.0
7 83.3 84.0 85.3
8 85.7 88.0 88.3
9 90.7 91.0 91.3
10 88.0 88.0 86.7
mean 86.5 87.6 87.8
std 2.1 2.0 1.7
generated and each one receives a random mass. We thus generate imprecise and uncertain learning sets as they can be encountered in real world applications. We run 10 simulations. For each of them, we generate the labels for the IL and CrL cases. In figure 1, we present the data for one simulation. The bold line triangle illustrates the result of the application of the CrEM for the IL case. As can be seen, the means (the corners of the triangles) are well located. The estimated parameters are listed at the bottom of table 2. On the IL data, we apply both a classical EM algorithm and the CrEM. On the CrL, we apply only the CrEM algorithm as the classical does not seem fitted for such type of data. In table 3, we present the Percentage of Correct Classification (PCC) obtained for each of the 10 independent training sets. Each method produces very similar results but only the CrEM algorithm is able to use credal labels, a much more flexible information than the one encountered in the IL case. 6.2
Learning Task 2: Qualitative Example
This learning set is drawn using three bi-dimensional Gaussian classes of standard deviation 1.5 respectively centered on (3, 0), (0, 5) and (0, 0). Figure 2 illustrates this learning task associated to the decision regions computed using parameters of the CrEM algorithm learnt from credal labels (CrL). A very important, but classical feature using EM and mixture models algorithms, is the
Partially Supervised Learning by a Credal EM Approach
965
Learning with unlabeled data and partially observed labels
−4
0.9
−2
0.8
0
0.7
2
0.6
4 0.5
6
class ω1 class ω2 class ω
0.4
3
−4
−2
0
2
4
6
Fig. 2. Maximum pignistic probabilities as grey level values Table 4. Estimated parameters of the learning task 2
ω1 (+) ω1 (+) ω2 (×) ω2 (×) ω3 (·) ω3 (·) µb µb µa µa µb µa 3.00 0.00 0.00 0.00 0.00 5.00 Real values Training set 1 3.52 -0.10 0.96 -0.45 -0.00 5.18 Training set 2 2.99 -0.19 -0.07 -0.40 -0.00 5.14
ability to cope with unlabeled samples. The first intuition is that these unlabeled data don’t bring any information for learning the parameters of the generated data. Contrary to this idea, we can show on this illustrative example that unlabeled data give clearly a more precise idea of the real distributions. To highlight this issue, two training sets were considered: a training set (set 1) which contains all the data except that we randomly remove 40 cases (80%) of class ω2 , and a training set (set 2) with all the data (150 cases). In this second learning set, we replace the credal labels generated for the 40 previous cases with vacuous belief functions (UL) before applying the CrEM classifier. Table 4 shows the estimated parameters for these two learning tasks. Additionally, estimated means are illustrated with gray levels disks in figure 2. This last capacity makes CrEM a very suitable algorithm for cluster analysis which is under study. In all these simulations, the estimation of the number of components Gk is a difficult model choice problem for which there is a number of possible solutions (Figueiredo & Jain, 2002). This problem is left for future works.
7
Conclusion
In this paper, a credal approach for partially supervised learning has been presented. The proposed methodology uses a variant of EM algorithm to estimate
966
P. Vannoorenberghe and P. Smets
parameters of mixture models and can cope with learning set where the knowledge about the actual class is represented by a belief function. Several simulations have proved the good performance of this CrEM algorithm compared to classical EM estimation in learning mixture of Gaussians. Numerous applications of this approach can be mentioned. As example, let us consider Bayesian networks which use EM algorithms to estimate parameters of unknown distributions. Using CrEM algorithm can be a good alternative for belief networks. Future work is concerned with model selection issue which includes the choice of the number of components, shape of each component. . . Another important issue is the detection of outliers which can be solved by adding an extra component (uniform for example) in the mixture.
References Ambroise, C., Denoeux, T., Govaert, G., & Smets, P. (2001). Learning from an imprecise teacher: probabilistic and evidential approaches. In Proceedings of asmda’2001 (Vol. 1, pp. 100–105). Compi`egne, France. Ambroise, C., & Govaert, G. (2000). EM algorithm for partially known labels. In Proceeding of IFCS’2000 (Vol. 1). Namur, Belgium. Delmotte, F., & Smets, P. (2004). Target identification based on the transferable belief model interpretation of Dempster-Shafer model. IEEE Transactions on Systems, Man and Cybernetics, A 34, 457–471. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, series B, 39, 1-38. Denœux, T. (1995). A k-nearest neighbor classification rule based on Dempster-Shafer theory. IEEE Transactions on Systems, Man and Cybernetics, 25 (5), 804–813. Denœux, T., & Zouhal, L. M. (2001). Handling possibilistic labels in pattern classification using evidential reasoning. Fuzzy Sets and Systems, 122, 47–62. Figueiredo, M. A. T., & Jain, A. K. (2002). Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 24 (3), 381–396. Hastie, T., & Tibshirani, R. J. (1996). Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society B, 58, 155–176. Kennes, R. (1992). Computational aspects of the M¨ obius transform of a graph. IEEESMC, 22, 201–223. McLaclan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley. Smets, P. (1993). Belief functions: the disjunctive rule of combination and the generalized Bayesian theorem. Int. J. Approximate Reasoning, 9, 1–35. Smets, P. (1998). The transferable belief model for quantified belief representation. In D. M. Gabbay & P. Smets (Eds.), Handbook of defeasible reasoning and uncertainty management systems (Vol. 1, pp. 267–301). Kluwer, Doordrecht, The Netherlands. Smets, P. (2005). Decision making in the TBM: the necessity of the pignistic transformation. Int. J. Approximate Reasoning, 38, 133–147.
Partially Supervised Learning by a Credal EM Approach
967
Smets, P., & Kennes, R. (1994). The transferable belief model. Artificial Intelligence, 66, 191–234. Zouhal, L. M., & Denœux, T. (1998). An evidence theoretic k-nn rule with parameter optimisation. IEEE Transactions on Systems, Man and Cybernetics - Part C, 28, 263-271.
Default Clustering from Sparse Data Sets J. Velcin and J.-G. Ganascia LIP6, Universit´e Paris VI, 8 rue du Capitaine Scott, 75015 Paris, France {julien.velcin, jean-gabriel.ganascia}@lip6.fr
Abstract. Categorization with a very high missing data rate is seldom studied, especially from a non-probabilistic point of view. This paper proposes a new algorithm called default clustering that relies on default reasoning and uses the local search paradigm. Two kinds of experiments are considered: the first one presents the results obtained on artificial data sets, the second uses an original and real case where political stereotypes are extracted from newspaper articles at the end of the 19th century.
Introduction Missing values are of great interest in a world in which information flows play a key role. Most data analysis today has to deal with a lack of data due to voluntary omissions, human error, broken equipment, etc. [1]. Three kinds of strategies are generally used to handle such data: ignoring the incomplete observations (the so-called “list-wise deletion”), estimating the unknown values with other variables (single or multiple imputation, k-nearest-neighbors [2], maximum likelihood approaches [3]) or using the background knowledge to complete the “gaps” automatically with default values (arbitrary values, default rules). The present work proposes a strategy which is not based on information completion but on default reasoning. The goal is to extract a set of some very complete descriptions that summarize as well as possible the whole data set. For this purpose, a clustering algorithm is proposed that is based on local search techniques and constraints specific to the context of sparse data. Section 1 presents a new approach to conceptual clustering when missing information exists. Section 2 proposes a general framework, applied to the attributevalue formalism. The new notion of default subsumption is introduced, before seeing how the concept of stereotype makes it possible to name clusters. A stereotype set extraction algorithm is then presented. Section 3 concerns experiments, first on artificial data sets and secondly with a real data case generated from newspaper articles.
1 1.1
Dealing with Missing Values Missing Values and Clustering
Generally, in Data Analysis, missing values are primarily solved just before starting the “hard” analysis itself (e.g. Multiple Correspondence Analysis [4]). But L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 968–979, 2005. c Springer-Verlag Berlin Heidelberg 2005 °
Default Clustering from Sparse Data Sets
969
this sort of pre-processing method is not really flexible for classification purposes, especially with a high rate of missing values. This paper presents the problem performed in a non-supervised way, as with the well-known clustering algorithms k-means (its categorical version: k-modes) and EM (Expectation-Maximization). But contrary to these algorithms that can easily lead to local optima, we have chosen to achieve the clustering using a combinatorial optimization approach, like in [5] or [6]. Our goal here is not only to cluster examples but also and mainly to describe the cluster easily and in an understandable way. The problem can thus be stated as finding readable, understandable, consistent and rich descriptions. Each of these descriptions covers part of the data set. The examples belonging to a part can be considered as equivalent according to the covering description. Note that our interest is focused on the similarity between the examples and the cluster descriptions, and not between the examples themselves. 1.2
Default Clustering
E. Rosch saw the categorization itself as one of the most important issues in cognitive science [7]. She introduced the concept of prototype as the ideal member of a category. Whereas categorization makes similar observations fit together and dissimilar observations be well separated, clustering is the induction process in data mining that actually build such categories. More specifically, conceptual clustering is a machine learning task defined by R. Michalski [8] which does not require a teacher and uses an evaluation function to discover classes named with appropriate conceptual descriptions. Conceptual clustering was principally studied in a probabilistic context (see, for instance, D. Fisher’s Cobweb algorithm [9]) and seldom used really sparse data sets. For instance, the experiments done by P.H. Gennari do not exceed 30% of missing values [10]. This paper proposes a new technique called default clustering which is inspired by the default logic of R. Reiter [11]. We use a similar principle but for induction, when missing information exists. The main assumption is the following: if an observation is grouped with other similar observations, you can use these observations to complete unknown information in the original fact if it remains consistent with the current context. Whereas default logic needs implicit knowledge expressed by default rules, default clustering only uses information available in the data set. The next section presents this new framework. It shows how to extract stereotype sets from very sparse data sets: first it extends the classical subsumption, next it discusses stereotype choice, and finally it proposes a local search strategy to find the best solution.
2
Logical Framework
This section presents the logical framework of default clustering in the attributevalue formalism (an adaptation to conceptual graphs can be found in [12]). The description space is noted D, the attribute space A, the descriptor space (i.e. the values the attributes can take) V and the example set E. The function δ maps
970
J. Velcin and J.-G. Ganascia
each example e ∈ E to its description δ(e) ∈ D. Note that this logical framework only presents categorical attributes, but it has been easily extended to ordinal attributes. 2.1
Default Subsumption
Contrary to default logic, the problem here is not to deduce, but to induce knowledge from data sets in which most of the information is unknown. Therefore, we put forward the notion of default subsumption, which is the equivalent for subsumption of the default rule for deduction. Saying that a description d ∈ D subsumes d′ ∈ D by default means that there exists an implicit description d′′ such that d′ completed with d′′ , i.e. d′ ∧d′′ , is more specific than d in the classical sense, which signifies that d′ ∧ d′′ entails d. The exact definition follows: Definition 1. d subsumes d′ by default (noted d ≤D d′ ) iff ∃dc such that dc 6=⊥ and d ≤ dc and d′ ≤ dc where t ≤ t′ stands for t subsumes t′ in the classical sense. dc is a minorant of d and d′ in the subsumption lattice. To illustrate our definition, here are some descriptions based on binary attributes that can be compared with respect to the default subsumption: d1 = {(Traitor=yes),(Internationalist=yes)} d2 = {(Traitor=yes),(Connection with jews=yes)} d3 = (Patriot=yes)
d1 ≤D d2 and d2 ≤D d1 because ∃dc such that d1 ≤ dc and d2 ≤ dc : dc = {(Traitor=yes),(Internationalist=yes),(Connection with jews=yes)}
However, considering that a patriot cannot be an internationalist and vice-versa, i.e. ¬((Patriot=yes) ∧ (Internationalist=yes)), which was an implicit statement for many people living in France at the end of the 19th century, d1 does not subsume d3 by default, i.e. ¬(d1 ≤D d3 ). Property 1. The notion of default subsumption is more general than classical subsumption since, if d subsumes d′ , i.e. d ≤ d′ , then d subsumes d′ by default, i.e. d ≤D d′ . The converse is not true: if d ≤D d′ , we do not know if d ≤ d′ . Property 2. The default subsumption relationship is symmetrical, i.e. ∀d ∀d′ if d ≤D d′ then d′ ≤D d. Note that the notion of default subsumption may appear strange for people accustomed to classical subsumption because of the symmetrical relationship. As a consequence, it does not define an ordering relationship on the description space D. The notation ≤D may be confusing with respect to this symmetry, but it is relative to the underlying idea of generality. 2.2
Concept of Stereotype
In the literature of categorization, Rosch introduced the concept of prototype [7, 13] inspired by the family resemblance notion of Wittgenstein [14] (see [15]
Default Clustering from Sparse Data Sets
971
for an electronic version and [16] for an analysis focused on family resemblance). Even if our approach and the original idea behind the concept of prototype have several features in common, we prefer to refer to the older concept of stereotype that was introduced by the publicist W. Lippman in 1922 [17]. For him, stereotypes are perceptive schemas (a structured association of characteristic features) shared by a group about other person or object categories. These simplifying and generalizing images about reality affect human behavior and are very subjective. Below are three main reasons to make such a choice. First of all, the concept of prototype is often misused in data mining techniques. It is reduced to either an average observation of the examples or an artificial description built on the most frequent shared features. Nevertheless, both of them are far from the underlying idea in family resemblance. Especially in the context of sparse data, it seems more correct to speak about a combination of features found in different example descriptions than about average or mode selection. The second argument is that the notion of stereotype is often defined as an imaginary picture that distorts the reality. Our goal is precisely to generate such pictures even if they are caricatural of the observations. Finally, these specific descriptions are better adapted for fast classification (we can even say discrimination) and prediction than prototypes, which is closely linked to Lippman’s definition. In order to avoid ambiguities, we restrict the notion to a specific description d ∈ D associated to (we can say “covering”) a set of descriptions D ⊂ D. However, the following subsection does not deal just with stereotypes but with stereotype sets to cover a whole description set. The objective is therefore to automatically construct stereotype sets, whereas most of the studies are focused on already fixed stereotype usage [18, 19]. Keeping this in mind, the space of all the possible stereotype sets is browsed in order to discover the best one, i.e. the set that best covers the examples of E with respect to some similarity measure. But just before addressing the search itself, we should consider both the relation of relative cover and the similarity measure used to build the categorization from stereotype sets. 2.3
Stereotype Sets and Relative Cover
Given an example e characterized by its description d = δ(e) ∈ D, consider the following statement: the stereotype s ∈ D is allowed to cover e if and only if s subsumes d by default. It means that in the context of missing data each piece of information is so crucial that even a single contradiction prevents the stereotype from being a correct generalization. Furthermore, since there is no contradiction between this example and its related stereotype, the stereotype may be used to complete the example description. In order to perform the clustering, a very general similarity measure Msim has been defined, which counts the number of common descriptors of V belonging to two descriptions, ignores the unknown values and takes into account the default subsumption relationship:
972
J. Velcin and J.-G. Ganascia
Msim : D × D −→ N+ (di , dj ) 7−→ Msim (di , dj ) = |{v ∈ d/d = di ∧ dj }| if di ≤D dj , Msim (di , dj ) = 0 if ¬(di ≤D dj ). where di ∧ dj is the least minorant of di and dj in the subsumption lattice. Let us now consider a set S = {s∅ , s1 , s2 . . . sn } ⊂ D of stereotypes. s∅ is the absurd-stereotype linked to the set E∅ . Then, a categorization of E can be calculated using S with an affectation function which we called relative cover : Definition 2. The relative cover of an example e ∈ E, with respect to a set of stereotypes S = {s∅ , s1 , s2 . . . sn }, noted CS (e), is the stereotype si if and only if: 1. si ∈ S, 2. Msim (δ(e), si ) > 0, 3. ∀k ∈ [1, n], k 6= i, Msim (δ(e), si ) > Msim (δ(e), sk ). It means that an example e ∈ E is associated to the most similar and “covering-able” stereotype relative to the set S. If there are two competitive stereotypes with an equal higher score or if there is no covering stereotype, then the example is associated to the absurd-stereotype s∅ . In this case, no completion can be calculated for e. Note that CS defines an equivalence relation on E. Given an example e, consider now the projection of its description δ(e) on the descriptors belonging to CS (e). This projection, noted δ(e)|CS , naturally subsumes the original description δ(e). If ei and ej are covered by the same stereotype, i.e. CS (ei ) = CS (ej ), then the projection of ei can be subsumed by default by the projection of ej . More formally: Property 3. ∀ei , ej ∈ E 2 , CS (ei ) = CS (ej ) ⇒ δ(ei )|CS ≤D δ(ej )|CS . This means that the examples covered by the same stereotype are considered equivalent if we consider as negligible the descriptors that do not belong to this stereotype. This shows that, beyond the use of stereotypes, it is the examples themselves that are used to complete the sparse descriptions. 2.4
Stereotype Extraction
In this paper, default reasoning is formalized using the notions of both default subsumption and stereotype set. Up to now, these stereotype sets were supposed to be given. This section shows how the classification can be organized into such sets in a non-supervised learning task. It can be summarized as follows. Given: 1. An example set E. 2. A description space D. 3. A description function δ: E −→ D which associates a description δ(e) ∈ D to each example belonging to the training set E. The function of a non-supervised learning algorithm is to organize the initial set of individuals E into a structure (for instance a hierarchy, a lattice or a pyramid).
Default Clustering from Sparse Data Sets
973
In the present case, the structure is limited to partitions of the training set, which corresponds to searching for stereotype sets as discussed above. These partitions may be generated by (n + 1) stereotypes S = {s∅ , s1 , s2 . . . sn }: it is sufficient to associate to each si the set Ei of examples e belonging to E and covered by si relative to S. The examples that cannot be covered by any stereotype are put into the E∅ cluster and associated to s∅ . To choose from among the numerous possible partitions, which is a combinatorial problem, a non-supervised algorithm requires a function for evaluating stereotype set relevance. Because of the categorical nature of data and the previous definition of relative cover, it appears natural to make use of the similarity measure Msim . This is exactly what we do by introducing the following cost function hE : Definition 3. E being an example set, S = {s∅ , s1 , s2 . . . sn } a stereotype set and CS the function that associates to each example e its relative cover, i.e. its closest stereotype with respect to Msim and S, the cost function hE is defined as follows: X hE (S) = Msim (δ(e), CS (e)) e∈E
While k-modes and EM algorithms are straightforward, i.e. each step leads to the next one until convergence, we reduce here the non-supervised learning task to an optimization problem. This approach offers several interesting features: avoiding local optima (especially with categorical and sparse data), providing “good” solutions even if not the best ones, better control of the search. In addition, it is not necessary to specify the number of expected stereotypes that is also discovered during the search process. There are several methods for exploring such a search space (hill-climbing, simulated annealing, etc.), but we have chosen the meta-heuristic called tabu search which improves the local search algorithm. Remember that the local search process can be schematized as follows: 1. An initial solution Sini is given (for instance at random). 2. A neighborhood is calculated from the current solution Si with the assistance of permitted movements. These movements can be of low influence (enrich one stereotype with a descriptor, remove a descriptor from another) or of high influence (add or retract one stereotype to or from the current stereotype set). 3. The best movement, relative to the evaluation function hE , is chosen and the new current solution Si+1 is computed. 4. The process is iterated a specific number of times and the best up-to-now discovered solution is recorded. Then, the solution is the stereotype set Smax that best maximizes hE in comparison to all the crossed sets. As in almost all local search techniques, there is a trade-off between exploitation, i.e. choosing the best movement, and exploration, i.e. choosing a non optimal state to reach completely different areas. The tabu search extends the basic local search by manipulating short and long-term memories which are used to avoid loops and to intelligently explore the search space. We shall not detail here this meta-heuristic but suggest you read the book by Glover and Laguna [20].
974
J. Velcin and J.-G. Ganascia
2.5
Constraints on Stereotypes
A “no-redundancy” constraint has been added in order to obtain a perfect separation between the stereotypes. In the context of sparseness, it seems really important to extract contrasted descriptions, which are used to quickly classify the examples, as does the concept of stereotype introduced by Lippman. A new constraint called cognitive cohesion is now defined. It verifies cohesion within a cluster, i.e. an example set Ej ⊂ E, relative to the corresponding stereotype sj ∈ S. Cognitive cohesion is verified if and only if, given two descriptors v1 and v2 ∈ V of sj , it is always possible to find a series of examples that make it possible to pass by correlation from v1 to v2 . Below are two example sets with their covering stereotype. The example on the left verifies the constraint, the one on the right does not. s1 : a0 e1 : a0 e2 : a0 e6 : ? e8 : ? e42 : a0
, b1 , ? , b1 , ? , b1 , ?
, d5 , ? , ? , d5 , d5 , d5
, f0 , ? , ? , ? , f0 , ?
, h0 , h0 , ? , ? , ? , ?
s2 : a0 e0 : a0 e8 : ? e9 : a0 e51 : ? e101 : ?
, b1 , b1 , ? , b1 , ? , ?
, d5 , ? , ? , ? , d5 , d5
, f0 , ? , f0 , ? , ? , ?
, h0 , ? , ? , ? , h0 , h0
Hence, with s2 it is never possible to pass from a0 to d5 , whereas it is allowed by s1 (for instance with e2 and then e8 ). In the case of s1 , you are always able to find a “correlation path” from one descriptor of the description to another, i.e. examples explaining the relationship between the descriptors in the stereotype. The graph below gives an example of a path between the descriptor h0 and the descriptor f0 , using e1 , e42 and e8 :
3
Experiments
This section presents experiments performed on artificial data sets. This is followed by an original comparison in a real data case using three well-known clusterers. Default clustering was implemented in a Java program called PRESS (Programme de Reconstruction d’Ensembles de St´er´eotypes Structur´es). All the experiments for k-modes, EM and Cobweb were performed using the Weka platform [21]. Note that the data sets used in the following correspond to the default clustering assumptions.
Default Clustering from Sparse Data Sets
3.1
975
Validation on Artificial Data Sets
These experiments use artificial data sets to validate the robustness of our algorithm. The first step is to give some contrasted descriptions of D. Let us note ns the number of these descriptions. Next, these initial descriptions are duplicated nd times. Finally, missing data are artificially simulated by removing a percentage p of descriptors at random from these ns × nd artificial examples. The evaluation is carried out by testing different clusterers on these data and comparing the discovered cluster representatives with the initial descriptions. We verify what we call recovered descriptors, i.e. the proportion of initial descriptors that are found. This paper presents the results obtained with ns = 5 and nd = 50 over 50 runs. The number of examples is 250 and the descriptions are built using a langage of 30 binary attributes. Note that these experiments are placed in the Missing Completely At Random (MCAR) framework.
Fig. 1. Proportion of recovered descriptors
Fig. 1 shows firstly that the results of PRESS are very good using a robust learning process. The stereotypes discovered correspond very well to the original descriptions up to 75% of missing data. In addition, this score remains good (nearly 50%) up to 90%. Whereas Cobweb seems stable relative to the increase in the number of missing values, the results of EM rapidly get worse above 80%. Those obtained using k-modes are the worst, although the number of expected medoids has to be specified. 3.2
Studying Social Misrepresentation
The second part of the experiments deals with real data extracted from a newspaper called “Le Matin” from the end of the 19th century in France. The purpose is to automatically discover stereotype sets from events related to the political disorder in the first ten days of September 1893. The results of PRESS are
976
J. Velcin and J.-G. Ganascia
compared to those of the three clusterers k-modes, EM and Cobweb. It should be pointed out that our interest focuses on the cluster descriptions, which we call representatives to avoid any ambiguity, rather than on the clusters themselves. The articles linked to the chosen theme were gathered and represented using a language with 33 attributes. The terms of this language, i.e. attributes and associated values, were extracted manually. Most of the attributes are binary, some accept more than two values and some are ordinals. The number of extracted examples is 63 and the rate of missing descriptors is nearly 87%, which is most unusual. 3.3
Evaluation of Default Clustering
In order to evaluate PRESS, a comparison was made with three classical clusterers: k-modes, EM and Cobweb. Hence, a non-probabilistic description of the clusters built by these algorithms was extracted using four techniques: (1) using the more frequent descriptors (mode approach); (2) the same as (1) but forbidding contradictory features between the examples and their representative; (3) dividing the descriptors between the different representatives; (4) the same as (3) but forbidding contradictory features. Two remarks need to be made. Firstly, the cluster descriptions resulting from k-modes correspond to technique (1). Nevertheless, we tried the other three techniques exhaustively. Secondly, representatives resulting from extraction techniques (3) and (4) validate by construction the no-redundancy constraint. The comparison was made according to the following three points: The first approach considers the contradictions between an example and its representative. The example contradiction is the percentage of examples containing at least one descriptor in contradiction with its covering representative. In addition, if you consider one of these contradictory examples, average contradiction is the percentage of descriptors in contradiction with its representative. This facet of conceptual clustering is very important, especially in the sparse data context. Secondly, we check if the constraints described in section 2.5 (i.e. cognitive cohesion and no-redundancy) are verified. They are linked to the concept of stereotype and to the sparse data context. Finally, we consider the degree of similarity between the examples and their covering representatives. This corresponds to the notion of compactness within clusters, but without penalizing the stereotypes with many descriptors. The function hE seems really adapted to render an account of representative relevance. In fact, we used a version of hE normalized between 0 and 1 by dividing, by the total number of descriptors. 3.4
Results
Fig. 2 gives the results obtained from the articles published in Le Matin. Experiments for the k-modes algorithm were carried out with N = 2 . . . 8 clusters, but only N = 6 results are presented in this comparison. The rows of the table show the number n of extracted representatives, the two scores concerning contradic-
Default Clustering from Sparse Data Sets
977
PRESS Cobweb EM k-Modes (1) (2) (3) (4) (1) (2) (3) (4) (1) (2) (3) (4) 6 6 6 6 6 2 2 2 2 2 2 2 2 n 0 ex. contradiction 27 0 27 0 48 0 48 0 56 0 57 0 0 av. contradiction 42 0 44 0 56 0 56 0 52 0 51 0 .89 .60 .74 .50 .85 .66 .83 .65 .82 .56 .68 .46 .79 hE 0 70 63 0 0 17 7 0 0 72 55 0 0 redundancy X cog. cohesion × × × × × × × × × × × ×
Fig. 2. Comparative results on Le Matin
tion, the result of hE , the redundancy score and whether or not the cognitive cohesion constraint is verified. The columns represent each type of experiment (k-modes associated with techniques (1) to (4), EM and Cobweb as well, and finally our algorithm PRESS). Let us begin by considering the contradiction scores. They highlight a principal result of default clustering: using PRESS, the percentage of examples having contradictory features with their representative is always equal to 0%. In contrast, the descriptions built using techniques (1) and (3) (whatever the clusterer used) possess at least one contradictory descriptor with 27% to 57% of the examples belonging to the cluster. Furthermore, around 50% of the descriptors of these examples are in contradiction with the covering description, and that can in no way be considered as a negligible noise. This is the reason why processes (1) and (3) must be avoided, especially in the sparse data context, when building such representatives from k-modes, EM or Cobweb clustering. Hence, we only consider techniques (2) and (4) in the following experiments. Let us now study the results concerning clustering quality. This quality can be expressed thanks to the compactness function hE , the redundancy rate and cognitive cohesion. PRESS marked the best score (0.79) for cluster compactness with six stereotypes. That means a very good homogeneity between the stereotypes and the examples covered. It is perfectly consistent since our algorithm tries to maximize this function. The redundant descriptors rate is equal to 0%, according to the no-redundancy constraint. Furthermore, PRESS is the only algorithm that is able to verify cognitive cohesion. EM obtains the second best score and redundant descriptor rate remains acceptable. However, the number of expected classes must be given or guessed using a cross-validation technique, for instance. K-modes and Cobweb come third and fourth and also have to use an external mechanism to discover the final number of clusters. Note that the stereotypes extracted using PRESS correspond to the political leanings of the newspaper. For instance, the main stereotype produces a radical, socialist politician, corrupted by foreign money and Freemasonry, etc. It corresponds partly to the difficulty in accepting the major changes proposed by the radical party and to the fear caused in France since 1880 by the theories of Karl Marx. We cannot explain here in more detail the semantics of discovered stereotypes, but these first results are really promising.
978
4
J. Velcin and J.-G. Ganascia
Conclusion
Sparse data clustering is seldom studied in a non-probabilistic way and with such a high number of missing values. However, it is really important to be able to extract readable, understandable descriptions from such type of data in order to complete information, to classify new observations quickly and to make predictions. In this way, the default clustering presented in this paper tries to provide an alternative to the usual clusterers. This algorithm relies on local optimization techniques that implement a very basic version of the tabu search meta-heuristic. Part of our future work will be to extend these techniques for stereotype set discovering. Hence, an efficient tabu search has to develop a long-term memory and to use more appropriate intensification and diversification strategies (e.g. path-relinking strategy). The results obtained, on both artificial data sets and a real case extracted from newspaper articles, are really promising and should lead to other historical studies concerning social stereotypes. Another possible extension is to apply these techniques to the study of social representations, a branch of social psychology introduced by S. Moscovici in 1961 [23]. More precisely, this approach is really useful for press content study which up to now is done manually by experts. Here it would be a question of choosing key dates of the Dreyfus affair and automatically extracting stereotypical characters from different newspapers. These results will then be compared and contrasted with the work of sociologists and historians of this period.
Acknowledgments The authors would particularly like to thank Rosalind Greenstein for reading and correcting the manuscript.
References 1. Newgard, C.D., Lewis, R.J.: The Imputation of Missing Values in Complex Sampling Databases: An Innovative Approach. In: Academic Emergency Medicine, Volume 9, Number 5484. Society for Academic Emergency Medicine (2002). 2. Huang, C.-C., Lee, H.-M.: A Grey-Based Nearest Neighbor Approach for Missing Attribute-Value Prediction. In: Applied Intelligence, Volume 20. Kluwer Academic Publishers (2004) pp.239–252. 3. Ghahramani, Z., Jordan, M.-I.: Supervised learning from incomplete data via an EM approach. In: Advances in Neural Information Processing Systems, Volume 6. Morgan Kaufmann Publishers (1994), San Francisco. 4. Benzecri, J.P.: Correspondence Analysis Handbook, New York: Marcel Dekker (1992). 5. Figueroa, A., Borneman, J., Jiang, T.: Clustering binary fingerprint vectors with missing values for DNA array data analysis (2003). 6. Sarkar, M., Leong, T.Y.: Fuzzy K-means clustering with missing values. In: Proc AMIA Symp. PubMed (2001) pp.588–92.
Default Clustering from Sparse Data Sets
979
7. Rosch, E.: Cognitive representations of semantic categories, In: Journal of Experimental Psychology: General, number 104 (1975) pp.192–232. 8. Michalski, R.S.: Knowledge acquisition through conceptual clustering: A theoretical framework and algorithm for partitioning data into conjunctive concepts. In: International Journal of Policy Analysis and Information Systems, 4 (1980) pp.219– 243. 9. Fisher, D.H.: Knowledge Acquisition Via Incremental Conceptual Clustering. In: Machine Learning, number 2 (1987) pp.139–172. 10. Gennari, J.H.: An experimental study of concept formation. Doctoral dissertation (1990), Department of Information & Computer Science, University of California, Irvine. 11. Reiter, R.: A logic for default reasoning. In: Artificial Intelligence, number 13 (1980) pp.81–132. 12. Velcin, J., Ganascia, J.-G.: Modeling default induction with conceptual structures, In: ER 2004 Conference Proceedings. Lu, Atzeni, Chu, Zhou, and Ling editors. Springer-Verlag (2004), Shangai, China. 13. Rosch, E.: Principles of categorization, In: Cognition and Categorization. NJ: Lawrence Erlbaum, Hillsdale (1978) pp.27–48. 14. Wittgenstein, L.: Philosophical Investigations. Blackwell (1953), Oxford, UK. 15. Shawver, L.: Commentary on Wittgenstein’s Philosophical Investigations. In: http://users.rcn.com/rathbone/lw65-69c.htm. 16. Narboux, J.-P.: Ressemblances de famille, caract`eres, crit`eres, In: Wittgenstein : mtaphysique et jeux de langage. PUF (2001) pp.69–95. 17. Lippman, W.: Public Opinion, Ed. MacMillan (1922), NYC. 18. Rich, E.: User Modeling via Stereotypes. In: International Journal of Cognitive Science, 3 (1979) pp.329–354. 19. Amossy, R., Herschberg Pierrot, A.: St´er´eotypes et clich´es : langues, discours, soci´et´e. Nathan Universit´e (1997). 20. Glover,F., Laguna, M.: Tabu Search, Kluwer Academic Publishers (1997). 21. Garner,S.R.: WEKA: The waikato environment for knowledge analysis, In: Proc. of the New Zealand Computer Science Research Students Conference (1995) pp.57– 64. 22. Moscovici, S.: La psychanalyse : son image et son public. PUF (1961), Paris.
New Technique for Initialization of Centres in TSK Clustering-Based Fuzzy Systems Luis Javier Herrera, H´ector Pomares, Ignacio Rojas, Alberto Guill´en, and Jes´ us Gonz´ alez University of Granada, Department of Computer Architecture and Technology, E.T.S. Computer Engineering, 18071 Granada, Spain http://atc.ugr.es
Abstract. Several methodologies for function approximation using TSK systems make use of clustering techniques to place the rules in the input space. Nevertheless classical clustering algorithms are more related to unsupervised learning and thus the output of the training data is not taken into account or, simply the characteristics of the function approximation problem are not considered. In this paper we propose a new approach for the initialization of centres in clustering-based TSK systems for function approximation that takes into account the expected output error distribution in the input space to place the fuzzy system rule centres. The convenience of proposed the algorithm comparing to other input clustering and input/output clustering techniques is shown through a significant example.
1
Introduction
The problem of function approximation deals with estimating an unknown function f from samples of the form {(xm ; z m ) ; m = 1, 2, . . . , M ; with z m = f (xm ) ∈ IR, and xm ∈ IRm } and is a crucial problem for a number of scientific and engineering areas. The main goal is thus to learn an unknown functional mapping between the input vectors and their corresponding output values, using a set of known training samples. Later, this generated mapping will be used to obtain the expected output given any new input data. Regression or function approximation problems deal with continuous input/output data in contrast to classification problems that deal with discrete, categorical output data. Fuzzy Systems are widely applied for both classification and Function Approximation problems. Specifically, for function approximation problems, two main techniques appear in the literature, Grid-Based Fuzzy Systems (GBFSs) [5] and Clustering-Based Fuzzy Systems (CBFSs) [6], whose main difference is the type of partitioning of the input space. GBFS have the advantage that they perform a thorough coverage of the input space, but at the expense of suffering from the curse of dimensionality that makes them inapplicable for problems with moderate complexity. In contrast, Clustering-Based Fuzzy System (CBFSs) techniques place the rules in the zones of the input space in which they are needed, L. Godo (Ed.): ECSQARU 2005, LNAI 3571, pp. 980–991, 2005. c Springer-Verlag Berlin Heidelberg 2005
New Technique for Initialization of Centres in TSK CBFSs
981
being more suitable thus for example for Time Series Prediction problems in which the input data is more centralized in some regions of the input space, or for problems with moderated complexity and a higher number of input variables. CBFS techniques usually utilize a clustering approach [3] for the initialization of the rule centres and afterwards perform an optimization process in order to obtain the pseudo-optimal rule parameters (centres and weights) using gradient descent, constraint optimization [4], etc. The use of clustering approaches for the initialization of the rule centres is mainly based on the idea of performing clustering in the input space and associating a weight or functional value to this region of the input space. Nevertheless, this idea might be more appropriate in classification problems; in function approximation problems, input space cluster interrelation does not necessarily carry out such output cluster interrelation. Input-output clustering techniques [1, 2] solve partially this problem since they consider the output variable/s in the clustering process. The input/output CFA clustering algorithm [1] for example performs a output-variance weighted inputspace clustering according to a modified distortion measure. In this paper we present a new approach for rule centres initialization that does not minimize a classical clustering distortion function, but that uses the final function-approximation-error function J=
(f (xm ) − z m )
2
(1)
m∈D
to place the centres pseudo-optimally. The idea of our approach is to place the centres so that the estimated error along each corresponding input space region is similar. Or, similarly, by forcing each centre to have a similar error, according to Eq. 1, in each side of every input dimension. The rest of the paper is organized as follows. Section 2 presents and discuss our Error Equidistribution Method (EEM) for the initialization of centres in CBFS. Section 3 presents an example and compares our EEM approach with other previous clustering methodologies. Finally in Section 4 we present the conclusions obtained from this work.
2
Error Equidistribution Method for Initialization of Rule Centres in CBFS for Function Approximation
In this section we present the new methodology proposed for the initialization of rule centres in CBFS for function approximation in the context of a general learning methodology. Typically, the structure of a multiple-input single-output (MISO) TakagiSugeno-Kang (TSK) fuzzy system and its associated fuzzy inference method comprises a set of K IF-THEN rules in the form Rulek : IF x1 is µk1 AND . . . AND xn is µkn THEN y = Rk
(2)
982
L.J. Herrera et al.
where the µki are fuzzy sets characterized by membership functions µki (xi ) in universes of discourse Ui (in which variables xi take their values), and where Rk are the consequents of the rules. The output of a fuzzy system with rules in the form shown in Eq. 2 can be expressed (using weighted average aggregation) as
F (x) =
K
µk (x)yk
k=1 K
(3) µk (x)
k=1
provided that µk (x)is the activation value for the antecedent of the rule k, which can be expressed as µk (x) = µk1 (x1 )µk2 (x2 ) . . . µkn (xn )
(4)
Given this formulation the learning process in a CBFS with a fixed number of fuzzy rules, can be subdivided in two main steps: optimization of rule consequents and optimization of rule antecedents, i.e. optimization of the membership function (MF) parameters. Optimization of Fuzzy Rule Consequents. Given a fixed membership functions configuration, we can obtain optimally the rule consequents (no matter the degree of the polinomial rule consequent). The Least Squares approach (LSE) by obtaining the partial derivatives of J (see Eq. 1) with respect to each of the consequents coefficients obtains a linear equation system that can be solved using any of the well-known mathematical methods for this purpose. In particular we will use Singular Value Decomposition (SVD) since it allows to detect redundancies (that make the problem ill-conditioned) in the equation system and easily remove them. Optimization of Fuzzy Rule Antecedents. Given a fixed number of rules, according to the function approximation problem formulation, we wish to minimize the error function J (see Eq. 1), but in this case the rule antecedent parameters (membership function parameters) can not be expressed as a linear function with respect to J. Thus a gradient descent or a constrained optimization could be applied, that would make use of the optimal rule consequents coefficients calculation. But these techniques have the drawback that they can easily fall in local minima. Therefore, several approaches have been proposed for CBFS in order to find a good starting point for the rule centres, being most of them based in clustering techniques. Traditional clustering algorithms used in CBFS attempt to place the rule centres according to the set of vectors selected by a clustering technique, typically a fuzzy clustering algorithm [3]. These clustering algorithms can be divided into two conceptually different families [1]: input clustering and input/output clustering. Here we present a novel approach, more intuitive from the point of view of the function approximation problem formulation that is based on a previous work for GBFS [5].
New Technique for Initialization of Centres in TSK CBFSs
2.1
983
Initialization of the Rule Centres Using the Error Equidistribution Method
For the general model we present in this paper, we will make use of gaussian membership functions. Thus, the parameters to be optimized for each MF would be the centre (composed by one centre value for input dimension) and the width, but for the sake of simplicity of our initialization approach we will use one width per centre for every dimension, that will be automatically calculated using the nearest centre criteria [8]. Therefore the only parameters that our initialization process will obtain will be the rule (cluster) centres. The main purpose of our approach, instead of trying to minimize a classical distortion measure based on the distance of the training data points to the rule centres, it will try to place the rule centres so that the errors (according to Eq. 1) are homogeneously distributed over the whole output range. The methodology to obtain such distribution of rule centres stays as follows. Starting from a random initialized (or using any simple clustering approach like k-means [10]) rule centres distribution, we will consider that a rule centre k is responsible for the error corresponding to each training point xm using the next formula µk (x) 2 (f (xm ) − z m ) (5) J k (xm ) = K µ (x) j=1 j being thus
J=
K
J k (xm )
(6)
m∈D j=1
k k and Si+ , that will Every rule centre k will have associated parameters Si− reflect the error according to Eq. 5 on the “left” (minus sign) and on the “right” (plus sigh) of the centre cki (i.e. centre of the MF in rule k in dimension i). k Si+ = J k (xm ) (7) m∈D xm ≥ck i i
k = Si−
J k (xm )
(8)
m∈D xm µR ˜ (x) j
j
The SPIDA Wizard for Analysis Model Selection
Based on the techniques described in the preceding sections, we implemented a wizard for our data analysis tool SPIDA. In a series of dialogs the user specifies the data analysis problem (prediction, grouping, dependencies), chooses the data source and gives his preferences regarding the solution (explanation facility, type of explanation, simplicity of explanation, facility to take prior knowledge, adaptability, accuracy etc.). Figure 2 shows the dialog for specifying requirements for an explanation facility. Possible selections are a mixture of fuzzy terms like ’at least medium’ or ’simple’ for simplicity, and crisp terms like ’Rules’ and ’Functions’ for type of explanation. The dialogs for other preferences look very similar. A typical ranking of data analysis methods according to user preferences is shown in Fig. 3, where the match or compatibility of method properties with preferences is given as suitability. At this stage, no models have been created, so model properties like accuracy and simplicity are not taken into account for the suitability.
1022
D.D. Nauck, M. Spott, and B. Azvine
Fig. 2. Specifying preferences, here regarding an explanation facility
Fig. 3. Ranking of analysis models
The user can preselect the most suitable methods and trigger the creation of models for them. As already mentioned in Section 2, the wizard will then create models for each selected method, evaluate model properties afterwards and try to improve on the match with the respective desired properties. This is achieved by changing learning parameters of the methods, which have been collected from experts in the field. If no improvement can be achieved, anymore, the final overall suitability can be shown. Figure 4 shows five different models of the Neuro-Fuzzy classifier Nefclass [9]. The user has asked for a simple model, so the wizard tried to force Nefclass to produce a simple solution but keeping the accuracy up. As can be seen in the figure SPIDA produced three models with high simplicity,
Automatic Selection of Data Analysis Methods
1023
Fig. 4. Accuracy, simplicity and overall suitability of different Nefclass models
but considerably different accuracy – in this case between 44% and 55% (the actual values for accuracy can be revealed in tool tips). The user can balance the importance of simplicity against accuracy as one of the preferences, so the wizard decides on the best model according to this. Nevertheless, the user can pick a different model based on the information in Fig. 4. 3.1
User Preferences and Method Properties
In the current version of the SPIDA wizard, we measure the suitability of a data analysis method according to the following method properties – type of analysis problem (classification, function approximation, clustering, dependency analysis etc.) – if an explanation facility exists – type of explanation (rules or functions) – adaptability to new data – if prior knowledge can be integrated and model properties – simplicity of an explanation – accuracy Another conceivable model property is execution time, which can be crucial for real-time applications. Examples for property profiles are shown in Table 1. Table 1. Property profiles for decision trees, neural networks and Nefclass
Method
Problem
Decision Tree
classification
Neural Network classification, func. approx.
Nefclass
classification
Explain Adapt Prior Knowl.
rules
no
no
no
medium
no
rules
high
yes
1024
D.D. Nauck, M. Spott, and B. Azvine
The method properties above are symbolic, whereas the model properties are numeric. In general, of course, this is not necessarily the case. For all numeric properties, fuzzy sets have to be defined as granularisation of the underlying domain. For example, if accuracy was measured as value in [0, 1] fuzzy sets for ’high’, ’medium’ and ’low’ accuracy could be defined on [0, 1] as fuzzy values for accuracy. Since accuracy is heavily dependent on the application, the definition of the fuzzy terms is as well. We ask users to specify a desired accuracy and the lowest acceptable accuracy whenever they use the wizard. These two crisp accuracy values are then used as cross-over points for three trapezoidal membership functions for ’high’, ’medium’ and ’low’. In case the user cannot specify accuracy due to a lack of knowledge, accuracy will simply not be used to determine the suitability of an analysis model. For other properties, fuzzy sets can be defined accordingly, either by the user or by the expert who designs the wizard. Fuzzy sets can even be adapted by user feedback. If the wizard, for instance, recommends a supposedly simple model that is not simple at all from the user’s perspective the underlying fuzzy set can be changed accordingly (user profiling). In the current version of the wizard, user preferences are specified at a similar level as desired method and model properties. They include – type of analysis problem (classification, function approximation, clustering, dependency analysis etc.) – importance of an explanation facility (do not care, nice to have, important) – type of explanation (do not care, rules, functions) – adaptability to new data (do not care, nice to have, important) – integration of prior knowledge (do not care, nice to have, important) – simplicity of an explanation – accuracy – balance importance of accuracy and simplicity The mapping from user preferences onto desired properties is therefore quite simple, in some cases like accuracy almost a one-to-one relation like ’If accuracy preference is at least medium, then desired accuracy is medium or high’. For others like simplicity it is slightly more complicated with rules like ’If simplicity preference is high and an explanation is important, then desired simplicity is medium (0.6) + high (1.0)’. The balance for the importance of accuracy and simplicity is not used to compute the suitability of models, since we can assume that the user has specified his preferences regarding these properties. It is only taken into account if several models of the same analysis method get the same suitability score, so the wizard can decide on the better one. The balance is also used when the wizard decides to rerun an analysis method with different learning parameters because accuracy and/or simplicity are not satisfactory. Depending on a combination of accuracy and simplicity score and their balance the wizard changes parameters in order to either improve on accuracy or simplicity. Some properties like the level of accuracy can easily be measured and compared for all models, whereas others like the level of simplicity are more difficult. In [10] we proposed a way to measure the interpretability of rule sets (crisp or fuzzy), which
Automatic Selection of Data Analysis Methods
1025
can be used as a measure of simplicity for most rule-based models. Measuring the simplicity of models which are based on functions is more difficult, especially since we require such a measure to be comparable with a measure for rule sets (commensurability). Nevertheless, heuristically defined measures that take into account the number of arguments (as in rule sets) and the complexity of a function usually work well enough, in particular, since we finally evaluate simplicity on the basis of a handful of fuzzy values and not on the underlying continuous domain.
4
Conclusion
As a new direction in automating data analysis, we introduced the concept of using soft constraints for the selection of an appropriate data analysis method. These constraints represent the user’s requirements regarding the analysis problem in terms of the actual problem (like prediction, clustering or finding dependencies) and preferences regarding the solution. Requirements can potentially be defined at any level of abstraction. Expert knowledge in terms of a fuzzy rule base maps high-level requirements onto required properties of data analysis methods which will then be matched to actual properties of analysis methods. As a result of our work, we introduced a new measure for the compatibility of fuzzy requirements with fuzzy properties that can be applied to other problems in the area of multi-criteria decision making. The methods presented above have been implemented as a wizard for our data analysis tool SPIDA, which has been successfully used to produce solutions to a variety of problems within BT, e.g. fraud detection, travel time prediction and customer satisfaction analysis.
References 1. Nauck, D., Spott, M., Azvine, B.: Spida – a novel data analysis tool. BT Technology Journal 21 (2003) 104–112 2. Spott, M.: Combining fuzzy words. In: Proc. of FUZZ-IEEE 2001, Melbourne, Australia (2001) 3. Spott, M.: Efficient reasoning with fuzzy words. In Halgamuge, S.K., Wang, L., eds.: Computational Intelligence for Modelling and Predictions. Springer Verlag (2004) (to appear). 4. Gebhardt, J., Kruse, R.: The context model—an integrating view of vagueness and uncertainty. Intern. Journal of Approximate Reasoning 9 (1993) 283–314 5. Zadeh, L.A.: Fuzzy sets. Information and Control 8 (1965) 338–353 6. Sinha, D., Dougherty, E.: Fuzzification of set inclusion: theory and applications. FSS 55 (1993) 15–42 7. Cornelis, C., Van der Donck, C., Kerre, E.: Sinha-dougherty approach to the fuzzification of set inclusion revisited. FSS 134 (2003) 283–295 8. Bouchon-Meunier, B., Rifqi, M., Bothorel, S.: Towards general measures of comparison of objects. Fuzzy Sets and Systems 84 (1996) 143–153 9. Nauck, D., Kruse, R.: A neuro-fuzzy method to learn fuzzy classification rules from data. FSS 89 (1997) 277–288 10. Nauck, D.: Measuring interpretability in rule-based classification systems. In: Proc. IEEE Int. Conf. on Fuzzy Systems 2003, St. Louis (2003) 196–201
Author Index
Aguzzoli, Stefano 650, 662 Alsinet, Teresa 353 Amgoud, Leila 269, 527 Arieli, Ofer 563 Avron, Arnon 625 Awad, Mohammed 613 Azvine, Ben 1014 Baroni, Pietro 329 Barrag´ ans Mart´ınez, A. Bel´en 638 Bell, David A. 465, 501 Ben Amor, Nahla 921, 944 Benferhat, Salem 452, 921 Benhamou, Bela¨ıd 477 Bennaim, Jonathan 452 Berthold, Michael R. 1002 Besnard, Philippe 427 Biazzo, Veronica 775 Bj¨ orkegren, Johan 136 Bonnefon, Jean-Francois 269 Borgelt, Christian 100, 1002 Bosc, Patrick 812 Bouckaert, Remco R. 221 Cano, Andr´es R. 908, 932 Capotorti, Andrea 750 Castellano, Javier G. 174, 908, 932 Cayrol, Claudette 366, 378 Ches˜ nevar, Carlos 353 Cholvy, Laurence 390 Cobb, Barry R. 27 Coletti, Giulianella 872 Cornelis, Chris 563 Coste-Marquis, Sylvie 317 Daniel, Milan 539, 824 D’Antona, Ottavio M. 650 de Campos, Luis M. 123, 174 Denœux, Thierry 552 Deschrijver, Glad 563 Devred, Caroline 317 D´ıaz Redondo, Rebeca P. 638 Dubois, Didier 293, 305, 848
Eklund, Patrik 341 Elouedi, Zied 921, 944 Fargier, H´el`ene 305 Farrokh, Arsalan 198 Fern´ andez Vilas, Ana 638 Fern´ andez-Luna, Juan M. 123 Flaminio, Tommaso 714 Flores, M. Julia 63 Fuentetaja, Raquel 88 G´ amez, Jos´e A. 63, 161 Gammerman, Alex 111 Ganascia, Jean-Gabriel 968 Garc´ıa Duque, Jorge 638 Garcia, Laurent 402 Garmendia, Luis 576, 587 Garrote, Luis 88 Gauwin, Olivier 514 Gebhardt, J¨ org 3 Georgescu, Irina 257 Gerla, Brunella 662 Giacomin, Massimiliano 329 Gil Solla, Alberto 638 Gilio, Angelo 775 Godo, Llu´ıs 353 G´ omez, Manuel 123 Gonz´ alez, Jes´ us 980 Guglielmann, Raffaella 600 Guill´en, Alberto 613, 980 Haenni, Rolf 788 Herrera, Luis Javier 613, 980 Huete, Juan F. 123 H¨ ullermeier, Eyke 848 Hunter, Anthony 415 Ikodinovi´c, Nebojˇsa Ironi, Liliana 600
726
Jeansoulin, Robert 452 Jenhani, Ilyes 944 Jensen, Finn V. 76 Jin, Zhi 440 Jøsang, Audun 824
1028
Author Index
Kaci, Souhila 281, 293, 527 Kerre, Etienne 563 Khelfallah, Mahat 452, 477 Klawonn, Frank 992 Konieczny, S´ebastien 514 Kramosil, Ivan 884 Krishnamurthy, Vikram 198 Kruse, Rudolf 3, 100 Lagasquie-Schiex, Marie Christine 378 Lagrue, Sylvain 452 Lang, J´erˆ ome 15 Larra˜ naga, Pedro 148 Lawry, Jonathan 896 Lee, Jae-Hyuck 186 Li, Wenhui 836 Lindgren, Helena 341 Liu, Weiru 415, 440, 465, 501 L´ opez Nores, Mart´ın 638 Lozano, Jose A. 148 Lu, Ruqian 440 Lucas, Peter 244 Lukasiewicz, Thomas 737 Luo, Zhiyuan 111 Majercik, Stephen M. 209 Manara, Corrado 662 Marchioni, Enrico 701 Marquis, Pierre 317, 514 Marra, Vincenzo 650 Mart´ınez, Irene 51 Masegosa, Andr´es R. 908, 932 Mellouli, Khaled 944 Mercier, David 552 Meyer, Thomas 489 Miranda, Enrique 860 Molina, Martin 88 Moral, Seraf´ın 1, 51, 63, 908, 932 Mu, Kedian 440 Nauck, Detlef D. 1014 Neufeld, Eric 233 Nicolas, Pascal 402 Nielsen, Thomas D. 76 Ognjanovi´c, Zoran Papini, Odile 452 Patterson, David E. Pazos Arias, Jos´e J. Pe˜ na, Jose M. 136
726 1002 638
366,
Perrussel, Laurent 489 Pini, Maria Silvia 800 Pivert, Olivier 812 Pomares, H´ector 613, 980 Poole, David 763 Pope, Simon 824 Pozos Parra, Pilar 489 Prade, Henri 269, 293, 675 Puerta, J. Miguel 161 Qi, Guilin 465, 501 Qin, Zengchang 896 Quost, Benjamin 552 Ramos Cabrer, Manuel 638 Rehm, Frank 992 Rodr´ıguez, Carmelo 51 Rojas, Ignacio 613, 980 Rossi, Francesca 800 Rum´ı, Rafael 39 Salmer´ on, Antonio 39, 51 Salvador, Adela 576, 587 Sanscartier, Manon J. 233 Santaf´e, Guzm´ an 148 Serrurier, Mathieu 675 Shenoy, Prakash P. 27 Simari, Guillermo 353 Smets, Philippe 956 Smyth, Clinton 763 Spott, Martin 1014 St´ephan, Igor 402 Straccia, Umberto 687 Studen´ y, Milan 221 Sun, Haibin 836 Tegn´er, Jesper
136
Valenzuela, Olga 613 van der Torre, Leendert 281 van der Weide, Theo 244 van Gerven, Marcel 244 Vannoorenberghe, Patrick 956 Vantaggi, Barbara 872 Velcin, Julien 968 Venable, Brent 800 W¨ urbel, Eric 452 Wilson, Nic 452 Zagoraiou, Maroussa
750