Subseries of Lecture Notes in Computer Science
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Discovery Science 6th International Conference, DS 2003 Sapporo, Japan, October 17-19, 2003 Proceedings
13
Volume Editors Gunter Grieser Technical University Darmstadt Alexanderstr. 10, 64283 Darmstadt, Germany E-mail:
[email protected] Yuzuru Tanaka Akihiro Yamamoto Hokkaido University MemeMedia Laboratory N-13, W-8, Sapporo, 060-8628, Japan E-mail: {tanaka;yamamoto}@meme.hokudai.ac.jp
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .
CR Subject Classification (1998): H.2.8, I.2, H.3, J.1, J.2 ISSN 0302-9743 ISBN 3-540-20293-5 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10964132 06/3142 543210
Preface
This volume contains the papers presented at the 6th International Conference on Discovery Science (DS 2003) held at Hokkaido University, Sapporo, Japan, during 17–19 October 2003. The main objective of the discovery science (DS) conference series is to provide an open forum for intensive discussions and the exchange of new information among researchers working in the area of discovery science. It has become a good custom over the years that the DS conference is held in parallel with the International Conference on Algorithmic Learning Theory (ALT). This combination of ALT and DS allows for a comprehensive treatment of the whole range, from theoretical investigations to practical applications. Continuing the good tradition, DS 2003 was co-located with the 14th ALT conference (ALT 2003). The proceedings of ALT 2003 were published as a twin volume 2842 of the LNAI series. The DS conference series has been supervised by the international steering committee chaired by Hiroshi Motoda (Osaka University, Japan). The other members are Alberto Apostolico (Univ. of Padova, Italy and Purdue University, USA), Setsuo Arikawa (Kyushu University, Japan), Achim Hoffmann (UNSW, Australia), Klaus P. Jantke (DFKI, Germany), Masahiko Sato (Kyoto University, Japan), Ayumi Shinohara (Kyushu University, Japan), Carl H. Smith (University of Maryland, USA), and Thomas Zeugmann (University of L¨ ubeck, Germany). We received 80 submissions, out of which 18 long and 29 short papers were selected by the program committee, based on clarity, significance, and originality, as well as on relevance to the field of discovery science. The DS 2003 conference had two paper categories, long and short papers. Long papers were presented as 25-minute talks, while short papers were presented as 5-minute talks accompanied by poster presentations. Due to the limited time of the conference, some long submissions could only be accepted as short papers. Some authors of those papers decided not to submit final versions or to just publish abstracts of their papers. This volume consists of three parts. The first part contains the invited talks of ALT 2003 and DS 2003. These talks were given by Thomas Eiter (Technische Universit¨ at Wien, Austria), Genshiro Kitagawa (The Institute of Statistical Mathematics, Japan), Akihiko Takano (National Institute of Informatics, Japan), Naftali Tishby (The Hebrew University, Israel), and Thomas Zeugmann (University of L¨ ubeck, Germany). Because the invited talks were shared by the DS 2003 and ALT 2003 conferences, this volume contains the full versions of Thomas Eiter’s, Genshiro Kitagawa’s, and Akihiko Takano’s talks, as well as abstracts of the talks by the others. The second part of this volume contains the accepted long papers, and the third part contains the accepted short papers.
VI
Preface
We would like to express our gratitude to our program committee members and their subreferees who did great jobs in reviewing and evaluating the submissions, and who made the final decisions through intensive discussions to ensure the high quality of the conference. Furthermore, we thank everyone who led this conference to a great success: the authors for submitting papers, the invited speakers for accepting our invitations and giving stimulating talks, the steering committee and the sponsors for their support, the ALT Chairpersons Klaus P. Jantke, Ricard Gavalda, and Eiji Takimoto for their collaboration, and, last but not least, Makoto Haraguchi and Yoshiaki Okubo (both Hokkaido University, Japan) for their local arrangement of the twin conferences.
October 2003
Gunter Grieser Yuzuru Tanaka Akihiro Yamamoto
Organization
Conference Chair Yuzuru Tanaka
Hokkaido University, Japan
Program Committee Gunter Grieser (Co-chair) Akihiro Yamamoto (Co-chair) Simon Colton Vincent Corruble Johannes F¨ urnkranz Achim Hoffmann Naresh Iyer John R. Josephson Eamonn Keogh Mineichi Kudo Nicolas Lachiche Steffen Lange Lorenzo Magnani Michael May Hiroshi Motoda Nancy J. Nersessian Vladimir Pericliev Jan Rauch Henk W. de Regt Ken Satoh Tobias Scheffer Einoshin Suzuki Masayuki Takeda Ljupˇco Todorovski Gerhard Widmer
Technical University, Darmstadt, Germany Hokkaido University, Japan Imperial College London, UK Universit´e P. et M. Curie Paris, France Research Institute for AI, Austria University of New South Wales, Australia GE Global Research Center, USA Ohio State University, USA University of California, USA Hokkaido University, Japan University of Strasbourg, France DFKI GmbH, Germany University of Pavia, Italy Fraunhofer AIS, Germany Osaka University, Japan Georgia Institute of Technology, USA Academy of Sciences, Sofia, Bulgaria University of Economics, Prague, Czech Republic Free University Amsterdam, The Netherlands National Institute of Informatics, Japan Humboldt University, Berlin, Germany Yokohama National University, Japan Kyushu University, Japan Joˇzef Stefan Institute, Ljubljana, Slovenia University of Vienna, Austria
Local Arrangements Makoto Haraguchi (Chair) Yoshiaki Okubo
Hokkaido University, Japan Hokkaido University, Japan
VIII
Organization
Subreferees Hideo Bannai Rens Bod Marco Chiarandini Nigel Collier Pascal Divoux Thomas G¨artner Peter Grigoriev Makoto Haraguchi Katsutoshi Hirayama Akira Ishino Takayuki Ito Ai Kawazoe Nikolay Kirov Edda Leopold
Tsuyoshi Murata Atsuyoshi Nakamura Luis F. Paquete Lourdes Pe˜ na Castillo Johann Petrak Detlef Prescher Ehud Reither Alexandr Savinov Tommaso Schiavinotto Alexander K. Seewald Esko Ukkonen Serhiy Yevtushenko Kenichi Yoshida Sandra Zilles
Sponsors Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) The Suginome Memorial Foundation, Japan
Table of Contents
Invited Talks Abduction and the Dualization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Eiter, Kazuhisa Makino
1
Signal Extraction and Knowledge Discovery Based on Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genshiro Kitagawa
21
Association Computation for Information Access . . . . . . . . . . . . . . . . . . . . . . Akihiko Takano
33
Efficient Data Representations That Preserve Information . . . . . . . . . . . . . . Naftali Tishby
45
Can Learning in the Limit Be Done Efficiently? . . . . . . . . . . . . . . . . . . . . . . . Thomas Zeugmann
46
Long Papers Discovering Frequent Substructures in Large Unordered Trees . . . . . . . . . . Tatsuya Asai, Hiroki Arimura, Takeaki Uno, Shin-ichi Nakano
47
Discovering Rich Navigation Patterns on a Web Site . . . . . . . . . . . . . . . . . . . Karine Chevalier, C´ecile Bothorel, Vincent Corruble
62
Mining Frequent Itemsets with Category-Based Constraints . . . . . . . . . . . . Tien Dung Do, Siu Cheung Hui, Alvis Fong
76
Modelling Soil Radon Concentration for Earthquake Prediction . . . . . . . . . Saˇso Dˇzeroski, Ljupˇco Todorovski, Boris Zmazek, Janja Vaupotiˇc, Ivan Kobal
87
Dialectical Evidence Assembly for Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Alistair Fletcher, John Davis Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Daiji Fukagawa, Tatsuya Akutsu Performance Evaluation of Decision Tree Graph-Based Induction . . . . . . . . 128 Warodom Geamsakul, Takashi Matsuda, Tetsuya Yoshida, Hiroshi Motoda, Takashi Washio
X
Table of Contents
Discovering Ecosystem Models from Time-Series Data . . . . . . . . . . . . . . . . 141 Dileep George, Kazumi Saito, Pat Langley, Stephen Bay, Kevin R. Arrigo An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets and Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 153 Xiaoshu Hang, Honghua Dai Extraction of Coverings as Monotone DNF Formulas . . . . . . . . . . . . . . . . . . 166 Kouichi Hirata, Ryosuke Nagazumi, Masateru Harao What Kinds and Amounts of Causal Knowledge Can Be Acquired from Text by Using Connective Markers as Clues? . . . . . . . . . . . . . . . . . . . . . . . . . 180 Takashi Inui, Kentaro Inui, Yuji Matsumoto Clustering Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Toshihiro Kamishima, Jun Fujiki Business Application for Sales Transaction Data by Using Genome Analysis Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Naoki Katoh, Katsutoshi Yada, Yukinobu Hamuro Improving Efficiency of Frequent Query Discovery by Eliminating Non-relevant Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 J´erˆ ome Maloberti, Einoshin Suzuki Chaining Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Taneli Mielik¨ ainen An Algorithm for Discovery of New Families of Optimal Regular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Oleg Monakhov, Emilia Monakhova Enumerating Maximal Frequent Sets Using Irredundant Dualization . . . . . 256 Ken Satoh, Takeaki Uno Discovering Exceptional Information from Customer Inquiry by Association Rule Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Keiko Shimazu, Atsuhito Momma, Koichi Furukawa
Short Papers Automatic Classification for the Identification of Relationships in a Meta-data Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Gerd Beuster, Ulrich Furbach, Margret Gross-Hardt, Bernd Thomas Effects of Unreliable Group Profiling by Means of Data Mining . . . . . . . . . 291 Bart Custers
Table of Contents
XI
Using Constraints in Discovering Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Saˇso Dˇzeroski, Ljupˇco Todorovski, Peter Ljubiˇc SA-Optimized Multiple View Smooth Polyhedron Representation NN . . . . 306 Mohamad Ivan Fanany, Itsuo Kumazawa Elements of an Agile Discovery Environment . . . . . . . . . . . . . . . . . . . . . . . . . 311 Peter A. Grigoriev, Serhiy A. Yevtushenko Discovery of User Preference in Personalized Design Recommender System through Combining Collaborative Filtering and Content Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Kyung-Yong Jung, Jason J. Jung, Jung-Hyun Lee Discovery of Relationships between Interests from Bulletin Board System by Dissimilarity Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Kou Zhongbao, Ban Tao, Zhang Changshui A Genetic Algorithm for Inferring Pseudoknotted RNA Structures from Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Dongkyu Lee, Kyungsook Han Prediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Sanghoon Lee, Jihoon Yang, Kyung-whan Oh Mining RNA Structure Elements from the Structure Data of Protein-RNA Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Daeho Lim, Kyungsook Han Discovery of Cellular Automata Rules Using Cases . . . . . . . . . . . . . . . . . . . . 360 Ken-ichi Maeda, Chiaki Sakama Discovery of Web Communities from Positive and Negative Examples . . . . 369 Tsuyoshi Murata Association Rules and Dempster-Shafer Theory of Evidence . . . . . . . . . . . . 377 Tetsuya Murai, Yasuo Kudo, Yoshiharu Sato Subgroup Discovery among Personal Homepages . . . . . . . . . . . . . . . . . . . . . . 385 Toyohisa Nakada, Susumu Kunifuji Collaborative Filtering Using Projective Restoration Operators . . . . . . . . . 393 Atsuyoshi Nakamura, Mineichi Kudo, Akira Tanaka, Kazuhiko Tanabe Discovering Homographs Using N-Partite Graph Clustering . . . . . . . . . . . . 402 Hidekazu Nakawatase, Akiko Aizawa
XII
Table of Contents
Discovery of Trends and States in Irregular Medical Temporal Data . . . . . 410 Trong Dung Nguyen, Saori Kawasaki, Tu Bao Ho Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 Yoshiaki Okubo, Makoto Haraguchi Content-Based Scene Change Detection of Video Sequence Using Hierarchical Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 Jong-Hyun Park, Soon-Young Park, Seong-Jun Kang, Wan-Hyun Cho An Appraisal of UNIVAUTO – The First Discovery Program to Generate a Scientific Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Vladimir Pericliev Scilog: A Language for Scientific Processes and Scales . . . . . . . . . . . . . . . . . 442 Joseph Phillips Mining Multiple Clustering Data for Knowledge Discovery . . . . . . . . . . . . . 452 Thanh Tho Quan, Siu Cheung Hui, Alvis Fong Bacterium Lingualis – The Web-Based Commonsensical Knowledge Discovery Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Rafal Rzepka, Kenji Araki, Koji Tochinai Inducing Biological Models from Temporal Gene Expression Data . . . . . . . 468 Kazumi Saito, Dileep George, Stephen Bay, Jeff Shrager Knowledge Discovery on Chemical Reactivity from Experimental Reaction Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 Hiroko Satoh, Tadashi Nakata A Method of Extracting Related Words Using Standardized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Tomohiko Sugimachi, Akira Ishino, Masayuki Takeda, Fumihiro Matsuo Discovering Most Classificatory Patterns for Very Expressive Pattern Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Masayuki Takeda, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Setsuo Arikawa Mining Interesting Patterns Using Estimated Frequencies from Subpatterns and Superpatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 Yukiko Yoshida, Yuiko Ohta, Ken’ichi Kobayashi, Nobuhiro Yugami
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Abduction and the Dualization Problem Thomas Eiter1 and Kazuhisa Makino2 1
2
Institut f¨ur Informationssysteme, Technische Universit¨at Wien, Favoritenstraße 9-11, A-1040 Wien, Austria,
[email protected] Division of Mathematical Science for Social Systems, Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka, 560-8531, Japan,
[email protected] Abstract. Computing abductive explanations is an important problem, which has been studied extensively in Artificial Intelligence (AI) and related disciplines. While computing some abductive explanation for a literal χ with respect to a set of abducibles A from a Horn propositional theory Σ is intractable under the traditional representation of Σ by a set of Horn clauses, the problem is polynomial under model-based theory representation, where Σ is represented by its characteristic models. Furthermore, computing all the (possibly exponentially) many explanations is polynomial-time equivalent to the problem of dualizing a positive CNF, which is a well-known problem whose precise complexity in terms of the theory of NP-completeness is not known yet. In this paper, we first review the monotone dualization problem and its connection to computing all abductive explanations for a query literal and some related problems in knowledge discovery. We then investigate possible generalizations of this connection to abductive queries beyond literals. Among other results, we find that the equivalence for generating all explanations for a clause query (resp., term query) χ to the monotone dualization problem holds if χ contains at most k positive (resp., negative) literals for constant k, while the problem is not solvable in polynomial total-time, i.e., in time polynomial in the combined size of the input and the output, unless P=NP for general clause resp. term queries. Our results shed new light on the computational nature of abduction and Horn theories in particular, and might be interesting also for related problems, which remains to be explored. Keywords: Abduction, monotone dualization, hypergraph transversals, Horn functions, model-based reasoning, polynomial total-time computation, NPhardness.
1
Introduction
Abduction is a fundamental mode of reasoning, which was extensively studied by C.S. Peirce [54]. It has taken on increasing importance in Artificial Intelligence (AI) and
This work was supported in part by the Austrian Science Fund (FWF) Project Z29-N04, by a TU Wien collaboration grant, and by the Scientific Grant in Aid of the Ministry of Education, Science, Sports, Culture and Technology of Japan.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 1–20, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
T. Eiter and K. Makino
related disciplines, where it has been recognized as an important principle of commonsense reasoning (see e.g. [9]). Abduction has applications in many areas of AI and Computer Science including diagnosis, database updates, planning, natural language understanding, learning etc. (see e.g. references in [21]), where it is primarily used for generating explanations. Specialized workshops have been held un the recent years, in which the nature and interrelation with other modes of reasoning, in particular induction and deduction, have been investigated. In a logic-based setting, abduction can be seen as the task to find, given a set of formulas Σ (the background theory), a formula χ (the query), and a set of formulas A (the abducibles or hypotheses), a minimal subset E of A such that Σ plus E is satisfiable and logically entails χ (i.e., an explanation). A frequent scenario is where Σ is a propositional Horn theory, χ is a single literal or a conjunction of literals, and A contains literals. For use in practice, the computation of abductive explanations in this setting is an important problem, for which well-known early systems such as Theorist [55] or ATMS solvers [13,56] have been devised. Since then, there has been a growing literature on this subject. Computing some explanation for a query literal χ from a Horn theory Σ w.r.t. assumptions A is a well-known NP-hard problem [57], even if χ and A are positive. Much effort has been spent on studying various input restrictions, cf. [29,11,27,16,15,21, 57,58,59], in order to single out tractable cases of abduction. For example, the case where A comprises all literals is tractable; such explanations are assumption-free explanations. It turned out that abduction is tractable in model-based reasoning, which has been proposed as an alternative form of representing and accessing a logical knowledge base, cf. [14,34,35,36,42,43]. Model-based reasoning can be seen as an approach towards Levesque’s notion of “vivid” reasoning [44], which asks for a more straight representation of the background theory Σ from which common-sense reasoning is easier and more suitable than from the traditional formula-based representation. In model-based reasoning, Σ is represented by a subset S of its models, which are commonly called characteristic models, rather than by a set of formulas. Given a suitable query χ, the test for Σ |= χ becomes then as easy as to check whether χ is true in all models of S, which can be decided efficiently. We here mention that formula-based and the model-based approach are orthogonal, in the sense that while a theory may have small representation in one formalism, it has an exponentially larger representation in the other. The intertranslatability of the two approaches, in particular for Horn theories, has been addressed in [34,35,36,40,42]. Several techniques for efficient model-based representation of various fragments of propositional logic have been devised, cf. [35,42,43]. As shown by Kautz et al., an explanation for a positive literal χ = q w.r.t. assumptions A from a Horn theory Σ, represented by its set of characteristic models, char(Σ), can be computed in polynomial time [34,35,42]; this results extends to negative literal queries χ = q as well, and has been generalized by Khardon and Roth [42] to other fragments of propositional logic. Hence, model-based representation is attractive from this view point of finding efficiently some explanation. While computing some explanation of a query χ has been studied extensively in the literature, computing multiple or even all explanations for χ has received less attention. However, this problem is important, since often one would like to select one out of a
Abduction and the Dualization Problem
3
set of alternative explanations according to a preference or plausibility relation; this relation may be based on subjective intuition and thus difficult to formalize. As easily seen, exponentially many explanations may exist for a query, and thus computing all explanations inevitably requires exponential time in general, even in propositional logic. However, it is of interest whether the computation is possible in polynomial total-time (or output-polynomial time), i.e., in time polynomial in the combined size of the input and the output. Furthermore, if exponential space is prohibitive, it is of interest to know whether a few explanations (e.g., polynomially many) can be generated in polynomial time, as studied by Selman and Levesque [58]. In general, computing all explanations for a literal χ (positive as well as negative) w.r.t. assumptions A from a Horn theory Σ is under formula-based representation not possible in polynomial total-time unless P=NP; this can be shown by standard arguments appealing to the NP-hardness of deciding the existence of some explanation. For generating all assumption-free explanations for a positive literal, a resolution-style procedure has been presented in [24] which works in polynomial total-time, while for a negative literal no polynomial total-time algorithm exists unless P=NP [25]. However, under model-based representation, such results are not known. It turned out that generating all explanation for a literal is polynomial-time equivalent to the problem of dualizing a monotone CNF expression (cf. [2,20,28]), as shown in [24]. Here, polynomial-time equivalence means mutual polynomial-time transformability between deterministic functions, i.e., A reduces to B, if there are polynomial-time functions f, g such that for any input I of A, f (I) is an input of B, and if O is the output for f (I), then g(O) is the output of I, cf. [52]; moreover, O is requested to have size polynomial in the size of the output for I (otherwise, trivial reductions may exist). This result, for definite Horn theories and positive literals, is implicit also in earlier work on dependency inference [49,50], and is closely related to results in [40]. The monotone dualization problem is an interesting open problem in the theory of NP-completeness (cf. [45,53]), which has a number of applications in different areas of Computer Science, [2,19], including logic and AI [22]; the problem is reviewed in Section 2.2 where also briefly related problems in knowledge discovery are mentioned. In the rest of this paper, we first review the result on equivalence between monotone dualization and generating all explanations for a literal under model-based theory representation. We then consider possible generalizations of this result for queries χ beyond literals, where we consider DNF, CNF and important special cases such as a clause and a term (i.e., a conjunction of literals). Note that the explanations for single clause queries correspond to the minimal support clauses for a clause in Clause Management Systems [56,38,39]. Furthermore, we shall consider on the fly also some of these cases under formula-based theory representation. Our aim will be to elucidate the frontier of monotone dualization equivalent versus intractable instances, i.e., not solvable in polynomial total-time unless P=NP, of the problem. It turns out that indeed the results in [24] generalize to clause and term queries under certain restrictions. In particular, the equivalence for generating all explanations for a clause query (resp., term query) χ to the monotone dualization holds if χ contains at most k positive (resp., negative) literals for constant k, while the problem is not solvable in polynomial total-time unless P=NP for general clause (resp., term) queries.
4
T. Eiter and K. Makino
Our results shed new light on the computational nature of abduction and Horn theories in particular, and might be interesting also for related problems, which remains to be explored.
2
Notation and Concepts
We assume a propositional (Boolean) setting with atoms x1 , x2 , . . . , xn from a set At, where each xi takes either value 1 (true) or 0 (false). Negated atoms are denoted by xi , and the opposite of a literal by . Furthermore, we use A = { | ∈ A} for any set of literals A and set Lit = At ∪ At. A any finite set of formulas. theory Σ is A clause is a disjunction c = p∈P (c) p ∨ p∈N (c) p of literals, where P (c) and N (c) are respectively the sets of atoms occurring positively andnegatively in c and P (c) ∩ N (c) = ∅. Dually, a term is a conjunction t = p∈P (t) p ∧ p∈N (t) p of literals, where P (t) and N (t) are similarly defined. A conjunctive normal form (CNF) is a conjunction of clauses, and a disjunctive normal form (DNF) is a disjunction of terms. As common, we also view clauses c and terms t as the sets of literals they contain, and similarly CNFs ϕ and DNFs ψ as sets of clauses and terms, respectively, and write ∈ c, c ∈ ϕ etc. A clause c is prime w.r.t. theory Σ, if Σ |= c but Σ |= c for every c ⊂ c. A CNF ϕ is prime, if each c ∈ ϕ is prime, and irredundant, if ϕ \ {c} ≡ ϕ for every c ∈ ϕ. Prime terms and irredundant prime DNFs are defined analogously. A clause c is Horn, if |P (c)| ≤ 1 and negative (resp., positive), if |P (c)| = 0 (resp., |N (c)| = 0). A CNF is Horn (resp., negative, positive), if it contains only Horn clauses (resp., negative, positive clauses). A theory Σ is Horn, if it is a set of Horn clauses. As usual, we identify Σ with ϕ = c∈Σ c. Example 1. The CNF ϕ = (x1 ∨x4 )∧(x4 ∨x3 )∧(x1 ∨x2 )∧(x4 ∨x5 ∨x1 )∧(x2 ∨x5 ∨x3 ) over At = {x1 , x2 , . . . , x5 } is Horn. The following proposition is well-known. Proposition 1. Given a Horn CNF ϕ and a clause c, deciding whether ϕ |= c is possible in polynomial time (in fact, in linear time, cf. [18]). Horn theories have a well-known semantic characterization. A model is a vector v ∈ {0, 1}n , whose i-th component is denoted by vi . For B ⊆ {1, . . . , n}, we let xB be the model v such that vi = 1, if i ∈ B and vi = 0, if i ∈ / B, for i ∈ {1, . . . , n}. The notions of satisfaction v |= ϕ of a formula ϕ and consequence Σ |= ϕ, ψ |= ϕ etc. are as usual; the set of models of ϕ (resp., theory Σ), is denoted by mod(ϕ) (resp., mod(Σ)). In the example above, the vector u = (0, 1, 0, 1, 0) is a model of ϕ, i.e., u |= ϕ. For models v, w, we denote by v ≤ w the usual componentwise ordering, i.e., vi ≤ wi for all i = 1, 2, . . . , n, where 0 ≤ 1; v < w means v = w and v ≤ w. For any set of models M , we denote by max(M ), (resp., min(M )) the set of all maximal (resp., minimal) models in M . We denote by v w componentwise AND of vectors v, w ∈{0, 1}n (i.e., their intersection), and by Cl∧ (S) the closure of S ⊆ {0, 1}n under . Then, a theory Σ is Horn representable, iff mod(Σ) = Cl∧ (mod(Σ)).
Abduction and the Dualization Problem
5
Example 2. Consider M1 = {(0101), (1001), (1000)} and M2 = {(0101), (1001), (1000),(0001), (0000)}. Then, for v = (0101), w = (1000), we have w, v ∈ M1 , while v w = (0000) ∈ / M1 ; hence M1 is not the set of models of a Horn theory. On the other hand, Cl∧ (M2 ) = M2 , thus M2 = mod(Σ2 ) for some Horn theory Σ2 . As discussed by Kautz et al. [34], a Horn theory Σ is semantically represented by its characteristic models, where v ∈ mod(Σ) is called characteristic (or extreme [14]), if v ∈ Cl∧ (mod(Σ) \ {v}). The set of all such models, the characteristic set of Σ, is denoted by char(Σ). Note that char(Σ) is unique. E.g., (0101) ∈ char(Σ2 ), while (0000) ∈ / char(Σ2 ); we have char(Σ2 ) = M1 . The following proposition is compatible with Proposition 1 Proposition 2. Given char(Σ), and a clause c, deciding whether Σ |= c is possible in polynomial time (in fact, in linear time, cf. [34,26]). The model-based reasoning paradigm has been further investigated e.g. in [40,42], where also theories beyond Horn have been considered [42]. 2.1 Abductive Explanations The notion of an abductive explanation can be formalized as follows. Definition 1. Given a (Horn) theory Σ, called the background theory, a CNF χ (called query), and a set of literals A ⊆ Lit (called abducibles), an explanation of χ w.r.t. A is a minimal set of literals E over A such that (i) Σ ∪ E |= χ, and (ii) Σ ∪ E is satisfiable. If A = Lit, then we call E simply an explanation of χ. The above definition generalizes the assumption-based explanations of [58], which emerge as A=P ∪ P where P ⊆ P (i.e., A contains all literals over a subset P of the letters) and χ = q for some atom q. Furthermore, in some texts (e.g., [21]) explanations must be sets of positive literals, and χ is restricted to a special form; in [21], χ is requested to be a conjunction of atoms. The following characterization of explanations is immediate from the definition. Proposition 3. For any theory Σ, any query χ, and any E ⊆ A(⊆ Lit), E is an explanation for χ w.r.t. A from Σ iff the following conditions hold: (i) Σ ∪E is satisfiable, (ii) Σ ∪ E |= χ, and (iii) Σ ∪ (E \ {}) |= χ, for every ∈ E. Example 3. Reconsider the Horn CNF ϕ = (x1 ∨ x4 ) ∧ (x4 ∨ x3 ) ∧ (x1 ∨ x2 ) ∧ (x4 ∨ x5 ∨ x1 ) ∧ (x2 ∨ x5 ∨ x3 ) from above. Suppose we want to explain χ = x2 from A = {x1 , x4 }. Then, we find that E = {x1 } is an explanation. Indeed, Σ ∪ {x1 } |= x2 , and Σ ∪ {x1 } is satisfiable; moreover, E is minimal. On the other hand, E = {x1 , x4 } satisfies (i) and (ii) for χ = x2 , but is not minimal. We note that there is a close connection between the explanations of a literal and the prime clauses of a theory.
6
T. Eiter and K. Makino
Proposition 4 (cf. [56,32]). For any theory Σ and literalχ, a set E ⊆ A(⊆ Lit) with E ={χ} is an explanation of χ w.r.t. A, iff the clause c = ∈E ∨ χ is a prime clause of Σ. Thus, computing explanations for a literal is easily seen to be polynomial-time equivalent to computing prime clauses of a certain form. We refer here to [51] for an excellent survey of techniques for computing explanations via computing prime implicates and related problems. 2.2
Dualization Problem
Dualization of Boolean functions (i.e., given a formula ϕ defining a function f , compute a formula ψ for the dual function f d ) is a well-known computational problem. The problem is trivial if ψ may be any formula, since we only need to interchange ∨ and ∧ in ϕ (and constants 0 and 1, if occurring). The problem is more difficult if ψ should be of special form. In particular, if ϕ is a CNF and ψ should be a irredundant prime CNF (to avoid redundancy); this problem is known as Dualization [22]. For example, if ϕ = (x1 ∨ x3 )(x2 ∨ x3 ), then a suitable ψ would be (x2 ∨ x3 )(x1 ∨ x3 ), since (x1 ∧ x3 ) ∨ (x2 ∧ x3 ) ≡ (x1 ∨ x2 )(x2 ∨ x3 )(x1 ∨ x3 ) simplifies to it. Clearly, ψ may have size exponential in the size of ϕ, and thus the issue is here whether a polynomial total-time algorithms exists (rather than one polynomial in the input size). While it is easy to see that the problem is not solvable in polynomial total unless P=NP, this result could neither be established for the important class of positive (monotone) Boolean functions so far, nor is a polynomial total-time algorithm known to date, cf. [23,28,37]. Note that for this subproblem, denoted Monotone Dualization, the problem looks simpler: all prime clauses of a monotone Boolean function f are positive and f has a unique prime CNF, which can be easily computed from any given CNF (just remove all negative literals and non-minimal clauses). Thus, in this case the problem boils down to convert a prime positive DNF ϕ constructed from ϕ into the equivalent prime (monotone) CNF. An important note is that Monotone Dualization is intimately related to its decisional variant, Monotone Dual, since Monotone Dualization is solvable in polynomial total-time iff Monotone Dual is solvable in polynomial time cf. [2]. Monotone Dual consists of deciding whether a pair of CNFs ϕ, ψ whether ψ is the prime CNF for the dual of the monotone function represented by ϕ (strictly speaking, this is a promise problem [33], since valid input instances are not recognized in polynomial time. For certain instances such as positive ϕ, this is ensured). A landmark result on Monotone Dual was [28], which presents an algorithm solving the problem in time no(log n) . More recently, algorithms have been exhibited [23,37] which show that the complementary problem can be solved with limited nondeterminism in polynomial time, i.e, by a nondeterministic polynomial-time algorithm that makes only a poly-logarithmic number of guesses in the size of the input. Although it is still open whether Monotone Dual is polynomially solvable, several relevant tractable classes were found by various authors (see e.g. [8,12,17,20,30,47,47] and references therein). A lot of research efforts have been spent on Monotone Dualization and Monotone Dual (see survey papers, e.g. [45,53,22]), since a number of problems turned
Abduction and the Dualization Problem
7
out to be polynomial-time equivalent to this problem; see e.g. [2,19,20] and the more paper [22]. Polynomial-time equivalence of computation problems Π and Π is here understood in the sense that problem Π reduces to Π and vices versa, where Π reduces to Π , if there a polynomial functions f, g s.t. for any input I of Π, f (I) is an input of Π , and if O is the output for f (I), then g(O) is the output of I, cf. [52]; moreover, O is requested to have size polynomial in the size of the output for I (if not, trivial reductions may exist). Of the many problems to which Monotone Dualization is polynomially equivalent, we mention here computing the transversal hypergraph of a hypergraph (known as Transversal Enumeration (TRANS-ENUM)) [22]. A hypergraph H = (V, E) is a collection E of subsets e ⊆ V of a finite set V , where the elements of E are called hyperedges (or simply edges). A transversal of H is a set t ⊆ V that meets every e ∈ E, and is minimal, if it contains no other transversal properly. The transversal hypergraph of H is then the unique hypergraph Tr (H) = (V, T ) where T are all minimal transversals of H. Problem TRANS-ENUM is then, given a hypergraph H = (V, E), to generate all the edges of Tr (H); TRANS-HYP is deciding, given H = (V, E) and a set of minimal transversals T , whether Tr (H) = (V, T ). There is a simple correspondence between Monotone Dualization and TRANSENUM: For any positive CNF ϕ on At representing a Boolean function f , the prime CNF ψ for the dual of f consists of all clauses c such that c ∈ Tr (At, ϕ) (viewing ϕ as set of clauses). E.g., if ϕ = (x1 ∨ x2 )x3 , then ψ = (x1 ∨ x3 )(x2 ∨ x3 ). As for computational learning theory, Monotone Dual resp. Monotone Dualization are of relevance in the context of exact learning, cf. [2,31,47,48,17], which we briefly review here. Let us consider the exact learning of DNF (or CNF) formulas of monotone Boolean functions f by membership oracles only, i.e., the problem of identifying a prime DNF (or prime CNF) of an unknown monotone Boolean function f by asking queries to an oracle whether f (v)=1 holds for some selected models v. It is known [1] that monotone DNFs (or CNFs) are not exact learnable with membership oracles alone in time polynomial in the size of the target DNF (or CNF) formula, since information theoretic barriers impose a |CNF(f )| + |DNF(f )| lower bound on the number of queries needed, where |CNF(f )| and |DNF(f )| denote the numbers of prime implicates and prime implicants of f , respectively. This fact raises the following question: – Can we identity both the prime DNF and CNF of an unknown monotone function f by membership oracles alone in time polynomial in |CNF(f )| + |DNF(f )| ? Since the prime DNF (resp., prime CNF) corresponds one-to-one to the set of all minimal true models (resp., all maximal false models) of f , the above question can be restated in the following natural way [2,47]: Can we compute the boundary between true and false areas of an unknown monotone function in polynomial total-time ? There should be a simple algorithm for the problem as follows, which uses a DNF h and a CNF h consisting of some prime implicants and prime implicates of f , respectively, such that h |= ϕ and ϕ |= h , for any formula ϕ representing f : Step 1. Set h and h to be empty (i.e., falsity and truth).
8
T. Eiter and K. Makino
Step 2. while h ≡ h do Take a counterexample x of h ≡ h ; if f (x) = 1 then begin Minimize t = i:xi =1 xi to a prime implicant t∗ of f ; h := h ∨ t∗ (i.e., add t∗ to h); end else /* f (x) = 0 */ begin Minimize c = i:xi =0 xi to a prime implicate c∗ of f ; h := h ∧ c∗ (i.e., add c∗ to h ); end Step 3. Output h and h . This algorithm needs O(n(|CNF(f )| + |DNF(f )|)) many membership queries. If h ≡ h (i.e., the pair (hd , h) is a Yes instance of Monotone Dual) can always be decided in polynomial time, then the algorithm is polynomial in n, CNF(f ), and DNF(f ). (The converse is also known [2], i.e., if the above exact learning problem is solvable in polynomial total time, then Monotone Dual is polynomially solvable.) Of course, other kinds of algorithms exist; for example, [31] derived an algorithm with different behavior and query bounds. Thus, for the classes C of monotone Boolean functions which enjoying that (i) Monotone Dual is polynomially solvable and a counterexample is found in polynomial time in case (which is possible under mild conditions, cf. [2]), and (ii) the family of prime DNFs (or CNFs) is hereditary, i.e., if a function with the prime DNF φ = i∈I ti is in C, then any function with the prime DNF φS = i∈S ti , where S ⊆ I, is in C, the above is a simple polynomial time algorithm which uses polynomially many queries within the optimal bound (assuming that |CNF(f )| + |DNF(f )| is at least of order n). For many classes of monotone Boolean functions, we thus can get the learnability results from the results of Monotone Dual, e.g., k-CNFs, k-clause CNFs, read-k CNFs, and k-degenerate CNFs [20,23,17]. In knowledge discovery, Monotone Dualization and Monotone Dual are relevant in the context of several problems. For example, it is encountered in computing maximal frequent and minimal infrequent sets [6], in dependency inference and key computation from databases [49,50,19], which we will address in Sections 3 and 4.2 below, as well as in translating between models of a theory and formula representations [36,40]. Moreover, their natural generalizations have been studied to model various interesting applications [4,5,7].
3
Explanations and Dualization
Deciding the existence of some explanation for a literal χ = w.r.t. an assumption set A from a Horn Σ is NP-complete under formula representation (i.e., Σ is given by a
Abduction and the Dualization Problem
9
Horn CNF), for both positive and negative , cf. [57,24]; hence, generating some resp. all explanations is intractable in very elementary cases (however, under restrictions such as A = Lit for positive , the problem is tractable [24]). In the model-based setting, matters are different, and there is a close relationship between abduction and monotone dualization. If we are given char(Σ) of a Horn theory Σ and an atom q, computing an explanation E for q from Σ amounts to computing a minimal set E of letters such that (i) at least one model of Σ (and hence, a model in char(Σ)) satisfies E and q, and that (ii) each model of Σ falsifying q also falsifies E; this is because an atom q has only positive explanations E, i.e., it contains only positive literals (see e.g. [42] for a proof). Viewing models as v = xB , then (ii) means that E is a minimal transversal of the hypergraph (V, M ) where V corresponds to the set of the variables and M consists of all V − B such that xB ∈ char(Σ) and xB |= q. This led Kautz et al. [34] to an algorithm for computing an explanation E for χ = q w.r.t. a set of atoms A ⊆ At which essentially works as follows: 1. Take a model v ∈ char(Σ) such that v |= q. / B }, where v = xB . 2. Let V := A ∩ B and M = {V \ B | xB ∈ char(Σ), q ∈ 3. if ∅ ∈ / M , compute a minimal transversal E of H = (V, M ), and output E; otherwise, select another v in Step 1 (if no other is left, terminate with no output). In this way, some explanation E for q w.r.t. A can be computed in polynomial time, since computing some minimal transversal of a hypergraph is clearly polynomial. Recall that under formula-based representation, this problem is NP-hard [57,58]. The method above has been generalized to arbitrary theories represented by models using Bshouty’s Monotone Theory [10] and positive abducibles A, as well as for other settings, by Khardon and Roth [42] (cf. also Section 4.2). Also all explanations of q can be generated in the way above, by taking all models v ∈ char(Σ) and all minimal transversals of (V, M ). In fact, in Step 1 v can be restricted to the maximal vectors in char(Σ). Therefore, computing all explanations reduces to solving a number of transversal computation problems (which trivially amount to monotone dualization problems) in polynomial time. As shown in [24], the latter can be polynomially reduced to a single instance. Conversely, monotone dualization can be easily reduced to explanation generation, cf. [24]. This established the following result. Theorem 1. Given char(Σ) of a Horn theory Σ, a query q, and A ⊆ Lit, computing the set of all explanations for q from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization. A similar result holds for negative literal queries χ = q as well. Also in this case, a polynomial number of transversal computation problems can be created such that each minimal transversal corresponds to some explanation. However, matters are more complicated here since a query q might also have non-positive explanations. This leads to a case analysis and more involved construction of hypergraphs. We remark that a connection between dualization and abduction from a Horn theory represented by char(Σ) is implicit in dependency inference from relational databases. An important problem there is to infer, in database terminology, a prime cover of the
10
T. Eiter and K. Makino
set Fr+ of all functional dependencies (FDs) X→A which hold on an instance r of a relational schema U = {A1 , . . . , An } where the Ai are the attributes. A functional dependency X→A, X ⊆ U , A ∈ U , is a constraint which states that for every tuples t1 and t2 occurring in the same relation instance r, it holds that t1 [A] = t2 [A] whenever t1 [X] = t2 [X], i.e., coincidence of t1 and t2 on all attributes in X implies that t1 and t2 also coincide on A. A prime cover is a minimal (under ⊆) set of non-redundant FDs X→A (i.e., X →A is violated for each X ⊂ X) which is logically equivalent to Fr+ . In our terms, a non-redundant FD X→A corresponds to a prime clause B∈X B ∨A of the CNF ϕFr+ , where for any set of functional dependencies F , ϕF is the CNF + ϕF = X→A∈F B∈X B ∨A , where the attributes in U are viewed as atoms and Fr is the set of all FDs which hold on r. Thus, by Proposition 4, the set X is an explanation of A from ϕr . As shown in [49], so called max sets for all attributes A are polynomialtime computable from r, which in totality correspond to the characteristic models of the (definite) Horn theory Σr defined by ϕFr+ [41]. Computing the explanations for A is then reducible to an instance of Trans-Enum [50], which establishes the result for generating all assumption-free explanations from definite Horn theories. We refer to [41] for an elucidating discussion which reveals the close correspondence between concepts and results in database theory on Armstrong relations and in model-based reasoning, which can be exploited to derive results about space bounds for representation and about particular abduction problems. The latter will be addressed in Section 4.2. Further investigations on computing prime implicates from model-based theory representation appeared in [36] and in [40], which exhibited further problems equivalent to Monotone Dualization. In particular, Khardon has shown, inspired by results in [49, 50,19], that computing all prime implicates of a Horn theory Σ represented by its characteristic models is, under Turing reducibility (which is more liberal than the notion of reducibility we employ here in general), polynomial-time equivalent to TRANS-ENUM. Note, however, that by Proposition 4, we are here concerned with computing particular prime implicates rather than all.
4
Possible Generalizations
The results reported above deal with queries χ which are a single literal. As already stated in the introduction, often queries will be more complex, however, and consist of a conjunction of literals, of clauses [56], etc. This raises the question about possible extensions of the above results for queries of a more general form, and in particular whether we encounter other kinds of problem instances which are equivalent to Monotone Dualization. 4.1
General Formulas and CNFs
Let us first consider how complex finding abductive explanations can grow. It is known [21] that deciding the existence of an explanation for literal query χ w.r.t. a set A is Σ2P -complete (i.e., complete for NPNP ), if the background theory Σ is a set of arbitrary clauses (not necessarily Horn). For Horn Σ, we get a similar result if the query χ is an arbitrary formula.
Abduction and the Dualization Problem
11
Proposition 5. Given a Horn CNF ϕ, a set A ⊆ Lit, and a query χ, deciding whether χ has some explanation w.r.t A from ϕ is (i) Σ2P -complete, if χ is arbitrary (even if it is a DNF), (ii) NP-complete, if χ is a CNF, and (iii) NP-complete, if A = Lit. Intuitively, an explanation E for χ can be guessed and then, by exploiting Proposition 3, be checked in polynomial time with an oracle for propositional consequence. The Σ2P -hardness in case (i) can be shown by an easy reduction from deciding whether a quantified Boolean formula (QBF) of form F = ∃X∀Y α, where X and Y are disjoint sets of Boolean variables and α is a DNF over X ∪ Y . Indeed, just let Σ = ∅, and A = {x, x | x ∈ X}. Then, χ = α has an explanation w.r.t. A iff formula F is valid. On the other hand, if χ is a CNF, then deciding consequence Σ ∪ S |= χ is polynomial for every set of literals S; hence, in case (ii) the problem has lower complexity and is in fact in NP. As for case (iii), if A = Lit, then an explanation exists iff Σ ∪ {χ} has a model, which can be decided in NP. The hardness parts for (ii) and (iii) are immediate by a simple reduction from SAT (given a CNF β, let Σ=∅, χ=β, and A=Lit). We get a similar picture under model-based representation. Here, inferring a clause c from char(Σ) is feasible in polynomial time, and hence also inferring a CNF. On the other hand, inferring an arbitrary formula (in particular, a DNF) α, is intractable, since to witness Σ |= α we intuitively need to find proper models v1 , . . . , vl ∈ char(Σ) such |= α. that i vi Proposition 6. Given char(Σ), a set A ⊆ Lit, and a query χ, deciding whether χ has some explanation w.r.t A from Σ is (i) Σ2P -complete, if χ is arbitrary (even if it is a DNF), (ii) NP-complete, if χ is a CNF, and (iii) NP-complete, if A = Lit. As for (iii), we can guess a model v of χ and check whether v is also a model of Σ from char(Σ) in polynomial time (indeed, check whether v = {w ∈ char(Σ) | v ≤ w} holds). The hardness parts can be shown by slight adaptations of the constructions for the formula based case, since char(Σ) for the empty theory is easily constructed (it consists of xAt and all xAt\{i} , i ∈ {1, . . . , n}). So, like in the formula-based case, also in the model-based case we loose the immediate computational link of computing all explanations to Monotone Dualization if we generalize queries to CNFs and beyond. However, it appears that there are interesting cases between a single literal and CNFs which are equivalent to Monotone Dualization. As for the formula-based representation, recall that generating all explanations is polynomial total-time for χ being a positive literal (thus, tractable and “easier” than Monotone Dualization), while it is coNP-hard for χ being a CNF (and in fact, a negative literal); so, somewhere between the transition from tractable to intractable might pass instances which are equivalent to Monotone Dualization. Unfortunately, restricting to positive CNFs (this is the first idea) does not help, since the problem remains coNP-hard, even if all clauses have small size; this can be shown by a straightforward reduction from the well-known EXACT HITTING SET problem [25]. However, we encounter monotone dualization if Σ is empty. Proposition 7. Given a set A ⊆ Lit, and a positive CNF χ, generating all explanations of χ w.r.t A from Σ = ∅ is polynomial-time equivalent to dualizing a positive CNF, under both model-based and formula based representation.
12
T. Eiter and K. Makino
This holds since, as easily seen, every explanation E must be positive, and moreover, must be a minimal transversal of the clauses in χ. Conversely, every minimal transversal T of χ after removal of all positive literals that do not belong to A (viewed as hypergraphs), is an explanation. Note that this result extends to those CNFs which are unate, i.e., convertible to a positive CNF by flipping the polarity of some variables. 4.2
Clauses and Terms
Let us see what happens if we move from general CNFs to the important subcases of a single clause and a single term, respectively.As shown in [25], generating all explanations remains under formula-based polynomial total-time for χ being a positive clause or a positive term, but is intractable as soon as we allow a negative literal. Hence, we do not see an immediate connection to monotone dualization. More promising is model-based representation, since for a single literal query, equivalence to monotone dualization is known. It appears that we can extend this result to clauses and terms of certain forms. Clauses. Let us first consider the clause case, which is important for Clause Management Systems [56,38,39]. Here, the problem can be reduced to the special case of a single literal query as follows. Given a clause c, introduce a fresh letter q. If we would add the formula c ⇒ q to Σ, then the explanations of q would be, apart from the trivial q in case, just explanation the explanations of c. We can rewrite c ⇒ q to a Horn CNF α = x∈P (c) (x ∨ q) ∧ x∈N (c) (x ∨ q), so adding α maintains Σ Horn and thus the reduction works in polynomial time under formula-based representation. (Note, however, that we reduce this to a case which is intractable in general.) Under model-based representation, we need to construct char(Σ ∪ {α}), however, and it is not evident that this is always feasible in polynomial time. We can establish the following relationship, though. Let P (c) = {q1 , . . . , qk } and N (c) = {qk+1 , . . . , qm }; thus, α is logically equivalent to q ⇒ q 1 ∧ · · · q k ∧ qk+1 ∧ · · · ∧ qm . Claim. For Σ = Σ ∪ {α}, we have char(Σ ) ⊆ {v@(0) | v ∈ char(Σ)} ∪ m (v1 ∧· · · ∧ vk )@(1) | vi ∈ char(Σ), vi |= q i ∧ j=k+1 qj , for 1 ≤ i ≤ k ∪ (v0 ∧ v1 ∧· · · ∧ vk )@(1) | vi ∈ char(Σ), m for 0 ≤ i ≤ k, . m v0 |= i=1 qi , vi |= q i ∧ j=k+1 qj , for 1 ≤ i ≤ k where “@” denotes concatenation and q is assumed to be the last in the variable ordering. Indeed, each model of Σ, extended to q, is a model of Σ if we set q to 0; all models of Σ of this form can be generated by intersecting characteristic models of Σ extended in this way. On the other hand, any model v of Σ in which q is 1 must have also qk+1 , . . . , qm set to 1 and q1 , . . . , qk set to 0. We can generate such v by expanding the intersection of some characteristic models v1 , . . . , vl of Σ (where l ≤ |char(Σ)|) in which qk+1 , . . . , qm are set to 1, and where each of q1 , q2 , . . . , qk is made 0 by
Abduction and the Dualization Problem
13
intersection with at least one of these vectors. By associativity and commutativity of intersection, we can generate v then by expanding the intersection of vectors of the form given in the above equation. From the set RHS on the right hand side, char(Σ ) can be easily computed by eliminating those vectors v which are generated by the intersection of other vectors (i.e., such that v = {w ∈ RHS | v < w}). In general, RHS will have exponential size; however, computing RHS is polynomial if k is bounded by a constant; clearly, computing char(Σ ) from RHS is feasible in time polynomial in the size of RHS, and hence in polynomial time in the size of char(Σ). Thus, the reduction from clause explanation to literal explanation is computable in polynomial time. We thus obtain the following result. Theorem 2. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a clause query c, and A ⊆ Lit, computing all explanations for c from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization, if |P (c)| ≤ k for some constant k. Note that the constraint on P (c) is necessary, in the light of the following result. Theorem 3. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a positive clause query c, and some explanations E1 , . . . , El for c from Σ, deciding whether there exists some further explanation is NP-complete. The NP-hardness can be shown by a reduction from the well-known 3SAT problem. By standard arguments, we obtain from this the following result for computing multiple explanations. Corollary 1. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a set A ⊆ Lit and a clause c, computing a given number resp. all explanations for c w.r.t. A from Σ is not possible in polynomial total-time unless P=NP. The hardness holds even for A = Lit, i.e., for assumption-free explanations. Terms. Now let us finally turn to queries which are given by terms, i.e., conjunctions of literals. With a similar technique as for clause queries, explanations for a term t can be reduced to explanations for a positive literal query in some cases. Indeed, introduce a fresh atom q and consider t ⇒ q; this formula is equivalent to a Horn clause α if |N (t)| ≤ 1 (in particular, if t is positive). Suppose t = q 0 ∧ q1 ∧ · · · ∧ qm , and let Σ = Σ ∪ {α} (where α = q ∨ q 1 ∨ · · · ∨ q m ∨ q0 ). Then we have char(Σ ) ⊆ {v@(0) | v ∈ char(Σ)} ∪ { v@(1) | v ∈ char(Σ), v |= q0 ∨ q 1 ∨ · · · ∨ q m } ∪ {(v ∧ v )@(1) | v, v ∈ char(Σ), v |= q 1 ∨ · · · ∨ q m , v |= q 0 ∧ q1 ∧ · · · ∧ qm }, from which char(Σ ) is easily computed. The explanations for t from Σ then correspond to the explanations for q from Σ modulo the trivial explanation q in case. This implies polynomial-time equivalence of generating all explanations for a term t to Monotone Dualization if t contains at most one negative literal. In particular, this holds for the case of positive terms t.
14
T. Eiter and K. Makino
Note that the latter case is intimately related to another important problem in knowledge discovery, namely inferring the keys of a relation r, i.e., the minimal (under ⊆) sets of attributes K ⊆ U = {A1 , A2 , . . . , An } whose values uniquely identify the rest of any tuple in the relation. The keys for a relation instance r over attributes U amount to the assumption-free explanations of t from the Horn CNF ϕFr+ defined as above in Section 2.2. Thus, abduction procedures can be used for generating keys. Note that char(Fr+ ) is computable in polynomial time from r (cf. [41]), and hence generating all keys polynomially reduces to Monotone Dualization; the converse has also been shown [19] [19]. Hence, generating all keys is polynomially equivalent to Monotone Dualization and also to generating all explanations for a positive term from char(Σ). A similar reduction can be used to compute all keys of a generic relation scheme (U, F ) of attributes U and functional dependencies F , which amount to the explanations for the term t = A1 A2 · · · An from the CNF ϕF . Note that Khardon et al. [41] investigate computing keys for given F and more general Boolean constraints ψ by a simple reduction to computing all nontrivial explanations E of a fresh letter q (i.e., E = {q}; A ∨ q ), which can thus be done in polynomial total-time as follows use ψ ∧ Ai ∈U i from results in [24]; for FDs (i.e., ψ = ϕF ) this is a classic result in database theory [46]. Furthermore, [41] also shows how to compute an abductive explanation using an oracle for key computation. We turn back to abductive explanations for a term t, and consider what happens if we have multiple negative literals in t. Without constraints on N (t), we face the intractability, since deciding the existence of a nontrivial explanation for t is already difficult, where an explanation E for t is nontrivial if E = P (t) ∪ {q | q ∈ N (t)}. Theorem 4. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ and a term t, deciding whether (i) there exists a nontrivial assumption-free explanation for t from Σ is NPcomplete; (ii) there exists an explanation for t w.r.t. a given set A ⊆ Lit from Σ is NP-complete. In both cases, the NP-hardness holds even for negative terms t. The hardness parts of this theorem can be shown by a reduction from 3SAT. Corollary 2. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a set A ⊆ Lit and a term t, computing a given number resp. all explanations for t w.r.t. A from Σ is not possible in polynomial total-time, unless P=NP. The hardness holds even for A = Lit, i.e., for assumption-free explanations. While the above reduction technique works for terms t with a single negative literal, it is not immediate how to extend it to multiple negative literals such that we can derive polynomial equivalence of generating all explanations to Monotone Dualization if |N (t)| is bounded by a constant k. However, this can be shown by a suitable extension of the method presented in [24], transforming the problem into a polynomial number of instances of Monotone Dualization, which can be polynomially reduced to a single instance [24].
Abduction and the Dualization Problem
15
Proposition 8. For any Horn theory Σ and E = {x1 , . . . , xk , xk+1 , . . . , xm } (⊆ Lit), char(Σ ∪ E) is computable from char(Σ) by char(Σ ∪ E) = char(M1 ∪ M2 ), where m xj for 1 ≤ i ≤ k}, M1 = {v1 ∧ · · · ∧ vk | vi ∈ char(Σ), vi |= xi ∧ j=k+1 m M2 = {v ∧ v0 | v ∈ M1 , v0 ∈ char(Σ), v0 |= xj }, j=1
and char(S) = {v ∈ S | v ∈ / Cl∧ (S \ {v})} for every S ⊆ {0, 1}n . This can be done in polynomial time if k = |E ∩ At| is bounded by some constant. For any model v and any V ⊆ At, let v[V ] denote the projection of v on V , and for any theory Σ and any V ⊆ At, Σ[V ] denotes the projection of Σ on V , i.e., mod(Σ[V ]) = {v[V ] | V ∈ mod(Σ)}. Proposition 9. For any Horn theory Σ and any V ⊆ At, char(Σ[V ]) can be computed from char(Σ) in polynomial time by char(Σ[V ]) = char(char(Σ)[V ]). For any model v and any set of models M , let maxv (M ) denote the set of all the models in M that is maximal with respect to ≤v . Here, for models w and u, we write w ≤v u if wi ≤ ui if vi = 0, and wi ≥ ui if vi = 1. Proposition 10. For any Horn theory Σ and any model v, maxv (Σ) can be computed from char(Σ) by maxv ({w ∈ MS | S ⊆ {xi | vi = 1} }), where M∅ = char(Σ) and for S = {x1 , . . . , xk }, MS = {v1 ∧ · · · ∧ vk | vi ∈ char(Σ), vi |= xi for 1 ≤ i ≤ k}. This can be done in polynomial time if |{xi | vi = 1}| is bounded by some constant. Theorem 5. Given char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a term query t, and A ⊆ Lit, computing all explanations for t from Σ w.r.t. A is polynomial-time equivalent to Monotone Dualization, if |N (t)| ≤ k for some constant k. Proof. (Sketch) We consider the following algorithm. Algorithm Term-Explanations Input: char(Σ) ⊆ {0, 1}n of a Horn theory Σ, a term t, and A ⊆ Lit. Output: All explanations for t from Σ w.r.t. A. Step 1. Let Σ = Σ ∪ P (t) ∪ {q | q ∈ N (t)}. Compute char(Σ ) from char(Σ). Step 2. For each xi ∈ N (t), let Σxi = Σ ∪{xi }, and for each xi ∈ P (t), let Σxi = Σ ∪ {xi }. Compute char(Σxi ) from char(Σ) for xi ∈ N (t), and compute char(Σxi ) from char(Σ) for xi ∈ P (t). Step 3. For each B = B− ∪ B+ , where B− ⊆ A ∩ At with |B− | ≤ |N (t)| and B+ = (A ∩ At) \ {q | q ∈ B− }, let C = B+ ∪ {q | q ∈ B− }. (3-1) Compute char(Σ [C]) from char(Σ ). (3-2) Let v ∈ {0, 1}C be the model with vi = 1 if xi ∈ B− , and vi = 0 if xi ∈ B+ . Compute maxv (Σ [C]) from char(Σ [C]). (3-3) For each w ∈ maxv (Σ [C]), let Cv,w ={xi | vi =wi } and let v ∗ =v[Cv,w ].
16
T. Eiter and K. Makino
(3-3-1) Compute maxv∗ (Σxi [Cv,w ]) and maxv∗ (Σxi [Cv,w ]) from char(Σxi ) and char(Σxi ), respectively. Let
Mv,w = maxv∗ (Σxi [Cv,w ]) ∪ maxv∗ (Σxi [Cv,w ]). xi ∈N (t)
xi ∈P (t)
(3-3-2) Dualize the CNF ϕv,w = cu =
u∈Mv,w cu ,
i:ui =vi =0
xi ∨
where
xi .
i:ui =vi =1
Each prime clause c of ϕdv,w corresponds to an explanation E = P (c)∪{xj | xj ∈ N (c)}, which is output. Note that ϕv,w is unate, i.e., convertible to a positive CNF by flipping the polarity of some variables. Informally, the algorithm works as follows. The theory Σ is used for generating candidate sets of variables C on which explanations can be formed; this corresponds to condition (i) of an explanation, which combined with condition (ii) amounts to consistency of Σ ∪ E ∪ {t}. These sets of variables C are made concrete in Step 3 via B, where the easy fact is taken into account that any explanation of a term t can contain at most |N (t)| negative literals. The projection of char(Σ ) to the relevant variables C, computed in Step 3-1, serves then as the basis for a set of variables, Cv,w , which is a largest subset of C on which some vector in Σ [C] is compatible with the selected literals B; any explanation must be formed on variables included in some Cv,w . Here, the ordering of vectors under ≤v is relevant, which respects negative literals. The explanations over Cv,w are then found by excluding every countermodel of t, i.e., all the models of Σxi , xi ∈ N (t) resp. Σxi , xi ∈ P (t), with a smallest set of literals. This amounts to computing minimal transversals (where only maximal models under ≤v∗ need to be considered), or equivalently, to dualization of the given CNF ϕv,w . More formally, it can be shown that the algorithm above computes all explanations. Moreover, from Propositions 8, 9, and 10, we obtain that computing all explanations reduces in polynomial time to (parallel) dualization of positive CNFs if |N (t)| ≤ k, which can be polynomially reduced to dualizing a single positive CNF [24]. Since as already known, Monotone Dualization reduces in polynomial-time to computing all explanations of a positive literal, the result follows. We remark that algorithm Term Explanations is closely related to results by Khardon and Roth [42] about computing some abductive explanation for a query χ from a (not necessarily Horn) theory represented by its characteristic models, which are defined using Monotone Theory [10]. In fact, Khardon and Roth established that computing some abductive explanation for a Horn CNF query χ w.r.t. a set A containing at most k negative literals from a theory Σ is feasible in polynomial time, provided that Σ is represented by an appropriate type of characteristic models (for Horn Σ, the characteristic models chark+1 (Σ) with respect to (k + 1)-quasi Horn functions will do, which are those functions with a CNF ϕ such that |P (c)| ≤ k + 1 for every c ∈ ϕ). Proposition 10 implies that chark+1 (Σ) can be computed in polynomial time from char(Σ). Hence, by a detour through characteristic models conversion, some explanation for a
Abduction and the Dualization Problem
17
Table 1. Complexity of computing all abductive explanations for a query from a Horn theory Query χ
general/ CNF
Horn theory Σ
DNF
Horn CNF ϕ
Π2P d
char(Σ)
Π2P d
literal pos a
coNP coNP coNP
clause
neg c
coNP
Dual
pos a
coNP nPTT
b,c
term
Horn general pos c
coNP
coNP
Dual nPTT
c
b,c
a
coNP
neg
general
c
coNPc
b,c
coNPb,c
coNP
Dual coNP
a
polynomial total-time for assumption-free explanations (A = Lit).
b
Dual for k-positive clauses resp. k-negative terms, k bounded by a constant.
c
nPTT for assumption-free explanations (A = Lit).
d
coNP (resp. nPTT) for assumption-free explanations (A = Lit) and general χ (resp. DNF χ).
Horn CNF w.r.t. A as above can be computed from a Horn Σ represented by char(Σ) in polynomial time using the method of [42]. This can be extended to computing all explanations for χ, and exploiting the nature of explanations for terms to an algorithm similar to Term Explanations. Furthermore, the results of [42] provides a basis for obtaining further classes of abduction instances Σ, A, χ polynomially equivalent to Monotone Dualization where Σ is not necessarily Horn. However, this is not easy to accomplish, since roughly nonHorn theories lack in general the useful property that every prime implicate can be made monotone by flipping the polarity of some variables, where the admissible flipping sets induce a class of theories in Monotone Theory. Explanations corresponding to such prime implicates might not be covered by a simple generalization of the above methods.
5
Conclusion
In this paper, we have considered the connection between abduction and the well-known dualization problems, where we have reviewed some results from recent work and added some new; a summary picture is given in Table 1. In this table, “nPTT” stands for “not polynomial total-time unless P=NP,” and “coNP” resp. “Π2P ” stands for for deciding whether the output is empty (i.e., no explanation exists) is coNP-complete resp. Π2P -complete (which trivially implies nPTT); Dual denotes polynomial-time equivalence to Monotone Dualization. In order to elucidate the role of abducibles, the table highlights also results for assumption-free explanations (A = Lit) when they deviate from an arbitrary set A of abducibles. As can be seen from the table, there are several important classes of instances which are equivalent to Monotone Dualization. In particular, this holds for generating all explanations for a clause query (resp., term query) χ if χ contains at most k positive (resp., negative) literals for constant k. It remains to be explored how these results, via the applications of abduction, lead to the improvement of problems that appear in applications. In particular, the connections to problems in knowledge discovery remain to be explored. Furthermore, an implementation of the algorithms and experiments are left for further work.
18
T. Eiter and K. Makino
We close by pointing out that besides Monotone Dualization, there are related problems whose precise complexity status in the theory of NP-completeness is not known to date.A particular interesting one is the dependency inference problem which we mentioned above, i.e., compute a prime cover of the set Fr+ of all functional dependencies (FDs) X→A which hold on an instance r of a relational schema [49] (recall that a prime cover is a minimal (under ⊆) set of non-redundant FDs which is logically equivalent to Fr+ ). There are other problems which are polynomial-time equivalent to this problem [40] under the more liberal notion of Turing-reduction used there; for example, one of these problems is computing the set of all characteristic models of a Horn theory Σ from a given Horn CNF ϕ representing it. Dependency inference contains Monotone Dualization as a special case (cf. [20]), and is thus at least as hard, but to our knowledge there is no strong evidence that it is indeed harder, and in particular, it is yet unknown whether a polynomial total-time algorithm for this problem implies P=NP. It would be interesting to see progress on the status of this problem, as well as possible connections to abduction.
References 1. D. Angluin. Queries and Concept Learning. Machine Learning, 2:319–342, 1996. 2. C. Bioch and T. Ibaraki. Complexity of identification and dualization of positive Boolean functions. Information and Computation, 123:50–63, 1995. 3. E. Boros, Y. Crama, and P. L. Hammer. Polynomial-time inference of all valid implications for Horn and related formulae. Ann. Mathematics and Artificial Intelligence, 1:21–32, 1990. 4. E. Boros, V. Gurvich, L. Khachiyan and K. Makino. Dual-bounded generating problems: Partial and multiple transversals of a hypergraph. SIAM J. Computing, 30:2036–2050, 2001. 5. E. Boros, K. Elbassioni, V. Gurvich, L. Khachiyan and K. Makino. Dual-bounded generating problems: All minimal integer solutions for a monotone system of linear inequalities. SIAM Journal on Computing, 31:1624-1643, 2002. 6. E. Boros, V. Gurvich, L. Khachiyan, and K. Makino. On the complexity of generating maximal frequent and minimal infrequent sets. In Proc. 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS-02), LNCS 2285, pp. 133–141, 2002. 7. E. Boros, K. Elbassioni, V. Gurvich, L. Khachiyan, and K. Makino. An intersection inequality for discrete distributions and related generation problems. In Proc. 30th Int’l Coll. on Automata, Languages and Programming (ICALP 2003), LNCS 2719, pp. 543-555, 2003. 8. E. Boros, V. Gurvich, and P. L. Hammer. Dual subimplicants of positive Boolean functions. Optimization Methods and Software, 10:147–156, 1998. 9. G. Brewka, J. Dix, and K. Konolige. Nonmonotonic Reasoning – An Overview. Number 73 in CSLI Lecture Notes. CSLI Publications, Stanford University, 1997. 10. N. H. Bshouty. Exact Learning Boolean Functions via the Monotone Theory. Information and Computation, 123:146–153, 1995. 11. T. Bylander. The monotonic abduction problem: A functional characterization on the edge of tractability. In Proc. 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR-91), pp. 70–77, 1991. 12. Y. Crama. Dualization of regular boolean functions. Discrete App. Math., 16:79–85, 1987. 13. J. de Kleer. An assumption-based truth maintenance system. Artif. Int., 28:127–162, 1986. 14. R. Dechter and J. Pearl. Structure identification in relational data. Artificial Intelligence, 58:237–270, 1992.
Abduction and the Dualization Problem
19
15. A. del Val. On some tractable classes in deduction and abduction. Artificial Intelligence, 116(1-2):297–313, 2000. 16. A. del Val. The complexity of restricted consequence finding and abduction. In Proc. 17th National Conference on Artificial Intelligence (AAAI-2000), pp. 337–342, 2000. 17. C. Domingo, N. Mishra, and L. Pitt. Efficient read-restricted monotone CNF/DNF dualization by learning with membership queries. Machine Learning, 37:89–110, 1999. 18. W. Dowling and J. H. Gallier. Linear-time algorithms for testing the satisfiability of propositional Horn theories. Journal of Logic Programming, 3:267–284, 1984. 19. T. Eiter and G. Gottlob. Identifying the minimal transversals of a hypergraph and related problems. Technical Report CD-TR 91/16, Christian Doppler Laboratory for Expert Systems, TU Vienna, Austria, January 1991. 20. T. Eiter and G. Gottlob. Identifying the minimal transversals of a hypergraph and related problems. SIAM Journal on Computing, 24(6):1278–1304, December 1995. 21. T. Eiter and G. Gottlob. The complexity of logic-based abduction. Journal of the ACM, 42(1):3–42, January 1995. 22. T. Eiter and G. Gottlob. Hypergraph transversal computation and related problems in logic and AI. In Proc. 8th European Conference on Logics in Artificial Intelligence (JELIA 2002), LNCS 2424, pp. 549–564. Springer, 2002. 23. T. Eiter, G. Gottlob, and K. Makino. New results on monotone dualization and generating hypergraph transversals. SIAM Journal on Computing, 32(2):514–537, 2003. Preliminary paper in Proc. ACM STOC 2002. 24. T. Eiter and K. Makino. On computing all abductive explanations. In Proc. 18th National Conference on Artificial Intelligence (AAAI ’02), pp. 62–67, 2002. Preliminary Tech. Rep. INFSYS RR-1843-02-04, Institut f¨ur Informationssysteme, TU Wien, April 2002. 25. T. Eiter and K. Makino. Generating all abductive explanations for queries on propositional Horn theories. In Proc. 12th Annual Conference of the EACSL (CSL 2003), August 25-30 2003, Vienna, Austria. LNCS, Springer, 2003. 26. T. Eiter, T. Ibaraki, and K. Makino. Computing intersections of Horn theories for reasoning with models. Artificial Intelligence, 110(1-2):57–101, 1999. 27. K. Eshghi. A tractable class of abduction problems. In Proc. 13th International Joint Conference on Artificial Intelligence (IJCAI-93), pp. 3–8, 1993. 28. M. Fredman and L. Khachiyan. On the complexity of dualization of monotone disjunctive normal forms. Journal of Algorithms, 21:618–628, 1996. 29. G. Friedrich, G. Gottlob, and W. Nejdl. Hypothesis classification, abductive diagnosis, and therapy. In Proc. International Workshop on Expert Systems in Engineering, LNCS/LNAI 462, pp. 69–78. Springer, 1990. 30. D.R. Gaur and R. Krishnamurti. Self-duality of bounded monotone boolean functions and related problems. In Proc. 11th International Conference on Algorithmic Learning Theory (ALT 2000), LNCS 1968, pp. 209-223, 2000. 31. D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data mining, hypergraph transversals, and machine learning. In Proc. 16th ACM Symposium on Principles of Database Systems (PODS-96), pp. 209–216, 1993. 32. K. Inoue. Linear resolution for consequence finding. Artif. Int., 56(2-3):301–354, 1992. 33. D. S. Johnson. A Catalog of Complexity Classes. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A, chapter 2. Elsevier, 1990. 34. H. Kautz, M. Kearns, and B. Selman. Reasoning With Characteristic Models. In Proc. 11th National Conference on Artificial Intelligence (AAAI-93), pp. 34–39, 1993. 35. H. Kautz, M. Kearns, and B. Selman. Horn approximations of empirical data. Artificial Intelligence, 74:129–245, 1995.
20
T. Eiter and K. Makino
36. D. Kavvadias, C. Papadimitriou, and M. Sideri. On Horn envelopes and hypergraph transversals. In Proc. 4th International Symposium on Algorithms and Computation (ISAAC-93), LNCS 762, pp. 399–405, Springer, 1993. 37. D. J. Kavvadias and E. C. Stavropoulos. Monotone Boolean dualization is in co-NP[log2 n]. Information Processing Letters, 85:1–6, 2003. 38. A. Kean and G. Tsiknis. Assumption based reasoning and Clause Management Systems. Computational Intelligence, 8(1):1–24, 1992. 39. A. Kean and G. Tsiknis. Clause Management Systems (CMS). Computational Intelligence, 9(1):11–40, 1992. 40. R. Khardon. Translating between Horn representations and their characteristic models. Journal of Artificial Intelligence Research, 3:349–372, 1995. 41. R. Khardon, H. Mannila, and D. Roth. Reasoning with examples: Propositional formulae and database dependencies. Acta Informatica, 36(4):267–286, 1999. 42. R. Khardon and D. Roth. Reasoning with models. Artif. Int., 87(1/2):187–213, 1996. 43. R. Khardon and D. Roth. Defaults and relevance in model-based reasoning. Artificial Intelligence, 97(1/2):169–193, 1997. 44. H. Levesque. Making believers out of computers. Artificial Intelligence, 30:81–108, 1986. 45. L. Lov´asz. Combinatorial optimization: Some problems and trends. DIMACS Technical Report 92-53, RUTCOR, Rutgers University, 1992. 46. C. L. Lucchesi and S. Osborn. Candidate Keys for Relations. Journal of Computer and System Sciences, 17:270–279, 1978. 47. K. Makino and T. Ibaraki. The maximum latency and identification of positive Boolean functions. SIAM Journal on Computing, 26:1363–1383, 1997. 48. K. Makino and T. Ibaraki, A fast and simple algorithm for identifying 2-monotonic positive Boolean functions. Journal of Algorithms, 26:291–305, 1998. 49. H. Mannila and K.-J. R¨aih¨a. Design by Example: An application of Armstrong relations. Journal of Computer and System Sciences, 22(2):126–141, 1986. 50. H. Mannila and K.-J. R¨aih¨a. Algorithms for inferring functional dependencies. Technical Report A-1988-3, University of Tampere, CS Dept., Series of Publ. A, April 1988. 51. P. Marquis. Consequence FindingAlgorithms. In D. Gabbay and Ph.Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, volume V: Algorithms for Uncertainty and Defeasible Reasoning, pp. 41–145. Kluwer Academic, 2000. 52. C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. 53. C. H. Papadimitriou. NP-Completeness:A retrospective, In Proc. 24th Int’l Coll. on Automata, Languages and Programming (ICALP 1997), pp. 2–6, LNCS 1256. Springer, 1997. 54. C. S. Peirce. Abduction and Induction. In J. Buchler, editor, Philosophical Writings of Peirce, chapter 11. Dover, New York, 1955. 55. D. Poole. Explanation and prediction: An architecture for default and abductive reasoning. Computational Intelligence, 5(1):97–110, 1989. 56. R. Reiter and J. de Kleer. Foundations of assumption-based truth maintenance systems: Preliminary report. In Proc. 6th National Conference on Artificial Intelligence (AAAI-87), pp. 183–188, 1982. 57. B. Selman and H. J. Levesque. Abductive and default reasoning: A computational core. In Proc. 8th National Conference on Artificial Intelligence (AAAI-90), pp. 343–348, July 1990. 58. B. Selman and H. J. Levesque. Support set selection for abductive and default reasoning. Artificial Intelligence, 82:259–272, 1996. 59. B. Zanuttini. New polynomial classes for logic-based abduction. Journal of Artificial Intelligence Research, 2003. To appear.
Signal Extraction and Knowledge Discovery Based on Statistical Modeling Genshiro Kitagawa The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku Tokyo 106-8569 Japan
[email protected] http://www.ism.ac.jp/˜kitagawa
Abstract. In the coming post IT era, the problems of signal extraction and knowledge discovery from huge data sets will become very important. For this problem, the use of good model is crucial and thus the statistical modeling will play an important role. In this paper, we show two basic tools for statistical modeling, namely the information criteria for the evaluation of the statistical models and generic state space model which provides us with a very flexible tool for modeling complex and time-varying systems. As examples of these methods we shall show some applications in seismology and macro economics.
1
Importance of Statistical Modeling in Post IT Era
Once the model is specified, various types of inferences and prediction can be deduced from the model. Therefore, the model plays a curial role in scientific inference or signal extraction and knowledge discovery from data. In scientific research, it is frequently assumed that there exists a known or unknown “true” model. In statistical community as well, from the age of Fisher, the statistical theories are developed under the situation that we estimate the true model with small number of parameters based on limited number of data. However, in recent years, the models are rather considered as tools for extracting useful information from data. This is motivated by the information criterion AIC that revealed that in the estimation of model for prediction, we may obtain a good model by selecting a simple model even though it may have some bias. On the other hand, if the model is considered as just a tool for signal extraction, the model cannot be uniquely determined and there exist many possible models depending on the viewpoints of the analysts. This means that the results of the inference and the decision depend on the used model. It is obvious that a good model yields a good result and a poor model yields a poor result. Therefore, in statistical modeling, the objective of the modeling is not to find out the unique “true” model, but to obtain a “good” model based on the characteristics of the object and the objective of the analysis. To obtain a good model, we need a criterion to evaluate the goodness of the model. Akaike (1973) proposed to evaluate the model by the goodness of its G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 21–32, 2003. c Springer-Verlag Berlin Heidelberg 2003
22
G. Kitagawa
predictive ability and to evaluate it by the Kullback-Leibler information. As is well-known, the minimization of the Kullback-Leibler information is equivalent to maximizing the expected log-likelihood of the model. Further, as a natural estimate of the expected log-likelihood, we can define the log-likelihood and thus can be lead to the maximum likelihood estimators. However, in comparing the models with parameters estimated by the method of maximum likelihood, there arises a bias in the log-likelihood as an estimate of the expected log-likelihood. By correcting for this bias, we obtain Akaike information criterion AIC. After the derivation of the AIC, various modifications or extensions of the AIC such as TIC, GIC and EIC are proposed. The information criterion suggests various things that should be taken into account in modeling. Firstly, since the data is finite, the models with too large number of free parameters may have less ability for prediction. There are two alternatives to mitigate this difficulty. One way is to restrict the number of free parameters which is realized by minimizing the AIC criterion. The other way is to obtain a good model with huge number of parameters by imposing a restriction on the parameters. For this purpose, we need to combine the information not only from the data but also the one from the knowledge on the object and the objective of the analysis. Therefore, the Bayes models play important role, since the integration of information can be realized by the Bayes model with properly defined prior information and the data. By the progress of the information technology, the information infrastructure in research area and society is being fully equipped, and the environment of the data has been changed very rapidly. For example, it becomes possible to obtain huge amount of data from moment to moment in various fields of scientific research and technology, for example the CCD image of the night sky, POS data in marketing, high frequency data in finance and the huge observations obtained in environmental measurement or in the study for disaster prevention. In contrast with the conventional well designed statistical data, the special feature of these data sets is that they can be obtained comprehensively. Therefore, it is one of the most important problem in post IT era to extract useful information or discover knowledge from not-so-well designed massive data. For the analysis of such huge amount of data, an automatics treatment of the data is inevitable and a new facet of difficulty in modeling arises. Namely, in classical framework of modeling, the precision of the model increases as the increase of the data. However, in actuality, the model changes with time due to the change of the stricture of the object. Further, as the information criteria suggest, the complexity of the model increases as the increase of the data. Therefore, for the analysis of huge data set, it is necessary to develop a flexible model that can handle various types of nonstationarity, nonlinearity and nonGaussianity. It is also important to remember that the information criteria are relative criteria. This means that the selection by any information criterion is nothing but the one within the pre-assigned model class. This suggests that the process of modeling is an everlasting improvement of the model based on the
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
23
increase of the data and knowledge on the object. Therefore, it is very important to prepare a flexible models that can fully utilize the knowledge of the analyst. In this paper, we shall show two basic tools for statistical modeling. Namely, firstly we shall show various information criteria AIC, TIC, GIC and EIC. We shall then show a general state space model as a generic time series model for signal extraction. Finally, we shall show some applications in seismology and macro economics.
2
Information Criteria for Statistical Modeling
Assume that the observations are generated from an unknown “true” distribution function G(x) and the model is characterized by a density function f (x). In the derivation of AIC (Akaike (1973)), the expected log-likelihood EY log f (Y ) = log f (y)dG(y) is used as the basic measure to evaluate the similarity between two distributions, which is equivalent to the Kullback-Leibler information. In actual situations, G(x) is unknown and only a sample X ={X1 , . . . , Xn } ˆ n (x) = n log f (Xi ) is given. We then use the log-likelihood = n log f (x)dG i=1 ˆ n (x) as a natural estimator of (n times of) the expected log-likelihood. Here G is the empirical distribution function defined by the data. When a model contains an unknown parameter θ and its density is of the ˆ form f (x|θ), it naturally leads to use the maximum likelihood estimator θ. 2.1
AIC and TIC
For a statisticalmodel f (x|θ) fitted to the data, however, the log-likelihood n n−1 (θ) = n−1 i=1 log f (Xi |θ) ≡ n−1 log f (X|θ) has a positive bias as an estimator of the expected log-likelihood, EG log f (Y |θ), and it cannot be directly used for model selection. By correcting the bias 1 log f (X|θ(X)) − EY log f (Y |θ(X)) , b(G) = nEX (1) n an unbiased estimator of the expected log-likelihood is given by 1 1 log f (X|θ(X)) − b(G) = −2 log f (X|θ(X)) + 2b(G). IC = −2n n n
(2)
Since it is very difficult to obtain the bias b(G) in a closed form, it is usually approximated by an asymptotic bias. Akaike (1973) approximated b(G) by the number of parameters, bAIC = m, and proposed the AIC criterion, AIC = −2 log f (X|θˆM L ) + 2m,
(3)
where θˆML is the maximum likelihood estimate. On the other hand, Takeuchi ˆ J(G) ˆ −1 }, (1976) showed that the asymptotic bias is given by bTIC = tr{I(G) ˆ ˆ where I(G) and J(G) are the estimates of the Fisher information and expected Hessian matrices, respectively .
24
2.2
G. Kitagawa
General Information Criterion, GIC
The above method of bias correction for the log-likelihood can be extended to a ˆ n ). general model constructed by using a statistical functional such as θˆ = T (G For such a general statistical model, Konishi and Kitagawa (1996) derived the asymptotic bias ∂ log f (Y |T (G)) bGIC (G) = trEY T (1) (Y ; G) , (4) ∂θ and proposed GIC (Generalized Information Criterion). Here T (1) (Y ; G) is the first derivative of the statistical functional T (Y ; G) which is usually called the influence function. The information criteria obtained so far can be generally expressed as ˆ − b1 (G ˆ n ), where b1 (G ˆ n ) is the first order bias correction term such as log f (X|θ) (4). The second order bias-corrected information criterion can be defined by 1 ˆ ˆ ˆ ˆ GIC2 = −2 log f (X|θ) + 2 b1 (Gn ) + b2 (Gn ) − ∆b1 (Gn ) . (5) n Here b2 (G) is defined by the expansion ˆ − nEY log f (Y |θ) ˆ = b1 (G) + 1 b2 (G) + O(n−2 ), (6) b(G) = EX log f (X|θ) n and the bias of the first order bias correction term ∆b1 (G) is defined by ˆ = b1 (G) + 1 ∆b1 (G) + O(n−2 ). EX b1 (G) n 2.3
(7)
Bootstrap Information Criterion, EIC
The bootstrap method (Efron 1979) provides us with an alternative way of bias correction of the log-likelihood. In this method, the bias b(G) in (1) is estimated by ˆ n ) = EX ∗ {log f (X ∗ |θ(X ∗ )) − log f (X|T (X ∗ ))} , bB (G
(8)
and the EIC (Extended Information Criterion) is defined by using this (Ishiguro ˆn) et al. (1997)). In actual computation, the bootstrap bias correction term bB (G is approximated by bootstrap resampling. The variance of the bootstrap estimate of the bias defined in (4) can be reduced automatically without any analytical arguments (Konishi and Kitagawa ˆ − nEY [log f (Y |θ)]. ˆ (1996), Ishiguro et al. (1997)). Let D(X; G) = log f (X|θ) Then D(X; G) can be decomposed into D(X; G) = D1 (X; G) + D2 (X; G) + D3 (X; G)
(9)
ˆ − log f (X|T (G)), D2 (X; G) = log f (X|T (G))− where D1 (X; G) = log f (X|θ) ˆ nEY [log f (Y |T (G))] and D3 (X; G) = nEY [log f (Y |T (G))] − nEY [log f (Y |θ)].
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
25
ˆ n ), it can be For a general estimator defined by a statistical functional θˆ = T (G shown that the bootstrap estimate of EX [D1 + D3 ] is the same as that of EX [D], but Var{D} = O(n) and Var{D1 + D3 } = O(1). Therefore by estimating the bias by ˆ n ) = EX ∗ [D1 + D3 ], b∗B (G
(10)
a significant reduction of the variance can be achieved for any estimators defined by statistical functional, especially for large n.
3 3.1
State Space Modeling Smoothness Prior Modeling
A smoothing approach attributed to Whittaker [21], is as follows: Let y n = fn + ε n ,
n = 1, ..., N
(11)
denote observations, where fn is an unknown smooth function of n. εn is an i.i.d. normal random variable with zero mean and unknown variance σ 2 . The problem is to estimate fn , n = 1, ..., N from the observations, yn , n = 1, ..., N , in a statistically reasonable way. However, in this problem, the number of parameters to be estimated is equal to or even greater than the number of observations. Therefore, the ordinary least squares method or the maximum likelihood method yield meaningless results. Whittaker [21] suggested that the solution fn , n = 1, ..., N balances a tradeoff between infidelity to the data and infidelity to a smoothness constraint. Namely, for given tradeoff parameter λ2 and the difference order k, the solution satisfies N N
2 2 k 2 min (12) (yn − fn ) + λ (∆ fn ) . f
n=1
n=1
Whittaker left the choice of λ2 to the investigator. 3.2
State Space Modeling
It can be seen that the minimization of the criterion (12) is equivalent to assume the following linear-Gaussian model: yn = fn + wn , fn = ck1 fn−1 + · · · + ckk fk + vn ,
(13)
where wn ∼ N (0, σ 2 ), vn ∼ N (0, τ 2 ), λ2 = σ 2 /τ 2 and ckj is the j-th binomial coefficient.
26
G. Kitagawa
Therefore, the models (13) can be expressed in a special form of the state space model xn = F xn−1 + Gvn yn = Hxn + wn
(system model), (observation model),
(14)
where xn = (tn , ..., tn−k+1 ) is a k-dimensional state vector, F , G and H are k × k, k × 1 and 1 × k matrices, respectively. For example, for k = 2, they are given by
2 −1 1 tn xn = , F = , G= , H = [1, 0]. (15) tn−1 1 0 0 One of the merits of using this state space representation is that we can use computationally efficient Kalman filter for state estimation. Since the state vector contains unknown trend component, by estimating the state vector xn , the trend is automatically estimated. Also unknown parameters of the model, such as the variances σ 2 and τ 2 can be estimated by the maximum likelihood method. In general, the likelihood of the time series model is given by L(θ) = p(y1 , . . . , yN |θ) =
N
p(yn |Yn−1 , θ),
(16)
n=1
where Yn−1 = {y1 , . . . , yn−1 } and each component p(yn |Yn−1 , θ) can be obtained as byproduct of the Kalman filter [6]. It is interesting to note that the tradeoff parameter λ2 in the penalized least squares method (12) can be interpreted as the ratio of the system noise variance to the observation noise variance, or the signal-to-noise ratio. The individual terms in (16) are given by, in general p-dimensional observation case, − 12 1 1 −1 √ p(yn |Yn−1 , θ) = Wn|n−1 exp − ε n|n−1 Wn|n−1 εn|n−1 , (17) 2 ( 2π)p where εn|n−1 = yn − yn|n−1 is one-step-ahead prediction error of time series and yn|n−1 and Vn|n−1 are the mean and the variance covariance matrix of the observation yn , respectively, and are defined by yn|n−1 = Hxn|n−1 ,
Wn|n−1 = HVn|n−1 H + σ 2 .
(18)
Here xn|n−1 and Vn|n−1 are the mean and the variance covariance matrix of the state vector given the observations Yn−1 and can be obtained by the Kalman filter [6]. If there are several candidate models, the goodness of the fit of the models can be evaluated by the AIC criterion defined by ˆ + 2(number of parameters). AIC = −2 log L(θ)
(19)
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
3.3
27
General State Space Modeling
Consider a nonlinear non-Gaussian state space model for the time series yn , xn = Fn (xn−1 , vn ) yn = Hn (xn , wn ),
(20)
where xn is an unknown state vector, vn and wn are the system noise and the observation noise with densities qn (v) and rn (w), respectively. The first and the second model in (20) are called the system model and the observation model, respectively. The initial state x0 is assumed to be distributed according to the density p0 (x). Fn (x, v) and Hn (x, w) are possibly nonlinear functions of the state and the noise inputs. This model is an extension of the ordinary linear Gaussian state space model (14). The above nonlinear non-Gaussian state space model specifies the conditional density of the state given the previous state, p(xn |xn−1 ), and that of the observation given the state, p(yn |xn ). This is the essential features of the state space model, and it is sometimes convenient to express the model in this general form based on conditional distributions xn ∼ Qn ( · |xn−1 ) yn ∼ Rn ( · |xn ).
(21)
With this model, it is possible to treat the discrete process such as the Poisson models. 3.4
Nonlinear Non-Gaussian Filtering
The most important problem in state space modeling is the estimation of the state vector xn from the observations, Yt ≡ {y1 , . . . , yt }, since many important problems in time series analysis can be solved by using the estimated state vector. The problem of state estimation can be formulated as the evaluation of the conditional density p(xn |Yt ). Corresponding to the three distinct cases, n > t, n = t and n < t, the conditional distribution, p(xn |Yt ), is called the predictor, the filter and the smoother, respectively. For the standard linear-Gaussian state space model, each density can be expressed by a Gaussian density and its mean vector and the variance-covariance matrix can be obtained by computationally efficient Kalman filter and smoothing algorithms [6]. For general state space models, however, the conditional distributions become non-Gaussian and their distributions cannot be completely specified by the mean vectors and the variance covariance matrices. Therefore, various types of approximations to the densities have been used to obtain recursive formulas for state estimation, e.g., the extended Kalman filter [6], the Gaussian-sum filter [5] and the dynamic generalized linear model [20]. However, the following non-Gaussian filter and smoother [11] can yield an arbitrarily precise posterior density.
28
G. Kitagawa
[Non-Gaussian Filter] p(xn |Yn−1 )= p(xn |xn−1 )p(xn−1 |Yn−1 )dxn−1 p(xn |Yn )=
p(yn |xn )p(xn |Yn−1 ) , p(yn |Yn−1 )
(22)
where p(yn |Yn−1 ) is defined by
p(yn |xn )p(xn |Yn−1 )dxn .
[Non-Gaussian Smoother] p(xn+1 |xn )p(xn+1 |YN ) dxn+1 . p(xn |YN ) = p(xn |Yn ) p(xn+1 |Yn )
(23)
However, the direct implementation of the formula requires computationally very costly numerical integration and can be applied only to lower dimensional state space models.
3.5
Sequential Monte Carlo Filtering
To mitigate the computational burden, numerical methods based on Monte Carlo approximation of the distribution have been proposed [9,12]. In the Monte Carlo filtering [12], we approximate each density function by many particles that can be considered as realizations from that distribution. Specifically, assume that (1) (m) each distribution is expressed by using m particles as follows: {pn , . . . , pn } ∼ (1) (m) p(xn |Yn−1 ) and {fn , . . . , fn } ∼ p(xn |Yn ). This is equivalent to approximate the distributions by the empirical distributions determined by m particles. Then it will be shown that a set of realizations expressing the one step ahead predictor p(xn |Yn−1 ) and the filter p(xn |Yn ) can be obtained recursively as follows. [Monte Carlo Filter] (j)
1. Generate a random number f0 ∼ p0 (x) for j = 1, . . . , m. 2. Repeat the following steps for n = 1, . . . , N . (j)
a) Generate a random number vn ∼ q(v), for j = 1, . . . , m. (j) (j) (j) b) Compute pn = F (fn−1 , vn ), for j = 1, . . . , m. (j)
(j)
c) Compute αn = p(yn |pn ) for j = 1, . . . , m. (j) (1) (m) d) Generate fn , j = 1, . . . , m by the resampling of pn , . . . , pn . with the (1) (j) weights proportional to αn , . . . , αn . The above algorithm for Monte Carlo filtering can be extended to smoothing by a simple modification. The details of the derivation of the algorithm is shown in [12].
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
3.6
29
Self-Organizing State Space Model
If the non-Gaussian filter is implemented by the Monte Carlo filter, the sampling error sometimes renders the maximum likelihood method impractical. In this case, instead of estimating the parameter θ by the maximum likelihood method, we consider a Bayesian estimation by augmenting the state vector as zn = [xTn , θT ]T . The state space model for this augmented state vector zn is given by zn = F ∗ (zn−1 , vn ) yn = H ∗ (zn , wn )
(24)
where the nonlinear functions F ∗ (z, v) and H ∗ (z, w) are defined by F ∗ (z, v) = [F (x, v), θ]T , H ∗ (z, w) = H(x, w). Assume that we obtain the posterior distribution p(zn |YN ) given the entire observations YN = {y1 , · · · , yN }. Since the original state vector xn and the parameter vector θ are included in the augmented state vector zn , it immediately yields the marginal posterior densities of the parameter and of the original state. This method of Bayesian simultaneous estimation of the parameter and the state of the state space model can be easily extended to a time-varying parameter situation where the parameter θ = θn evolves with time n. It should be noted that in this case we need a proper model for time evolution of the parameter.
4 4.1
Examples Extraction of Seismic Waves
The earth’s surface is under continuous disturbances due to a variety of natural forces and human induced sources. Therefore, if the amplitude of the earthquake signal is very small, it will be quite difficult to distinguish it from the background noise. In this section, we consider a method of extracting small seismic signals (P-wave and S-wave) from relatively large background noise [13], [17]. For the extraction of the small seismic signal from background noise, we consider the model yn = rn + sn + εn ,
(25)
where rn , sn and εn denote the background noise, the signal and the observation noise, respectively. To separate these three components, it is assumed that the background noise rn is expressed by the autoregressive model rn =
m
ci rn−i + un
(26)
i=1
where the AR order m and the AR coefficients ci are unknown and un and εn are white noise sequences with un ∼ N (0, τ12 ) and εn ∼ N (0, σ 2 ).
30
G. Kitagawa
The seismograms are actually records of seismic waves in 3-dimensional space and the seismic signal is composed of P-wave and S-wave. Hereafter East-West, North-South and Up-Down components are denoted as yn = [xn , yn , zn ]T . Pwave is a compression wave and it moves along the wave direction. Therefore it can be approximated by a one-dimensional model, pn =
m
aj pn−j + un .
(27)
j=1
On the other hand, S-wave moves on a plane perpendicular to the wave direction and thus can be expressed by 2-dimensional model,
qn bj11 bj12 qn−j vn1 = + . rn bj21 bj22 rn−j vn2
(28)
j=1
Therefore, the observed three-variate time series can be expressed as x xn α1n β1n γ1n pn wn yn = α2n β2n γ2n qn + wny . zn α3n β3n γ3n rn wnz
(29)
In this approach, the crucial problem is the estimation of time-varying wave direction, αjn , βjn and γjn . They can be estimated by the principle component analysis of the 3D data. These models can be combined in the state space model form. Note that the variances of the component models corresponds to the amplitude of the seismic signals and are actually time varying. These variance parameters play the role of a signal to noise ratios, and the estimation of these parameters is the key problem for the extraction of the seismic signal. A selforganizing state space model can be applied to the estimation of the time-varying variances [13]. 4.2
Seasonal Adjustment
The standard model for seasonal adjustment is given by y n = t n + sn + wn ,
(30)
where tn , sn and wn are trend, seasonal and irregular components. A reasonable solution to this decomposition was given by the use of smoothness priors for both tn and sn [14]. The trend component tn and the seasonal component sn are assumed to follow tn = 2tn−1 − tn−2 + vn , sn = −(sn−1 + · · · + sn−11 ) + un ,
(31)
Signal Extraction and Knowledge Discovery Based on Statistical Modeling
31
where vn , un and wn are Gaussian white noise with vn ∼ N (0, τt2 ), un ∼ N (0, τs2 ) and wn ∼ N (0, σ 2 ). However, by using a more sophisticated model, we can extract a more information from the data. For example, many of the economic time series related to sales or production are affected by the number of days of the week. Therefore, the sales of a department store will be strongly affected by the number of Sundays and Saturdays in each month. Such kind of effect is called the trading day effect. To extract the trading day effect, we consider the decomposition yn = tn + sn + tdn + wn ,
(32)
where tn , sn and wn are as above and the trading day effect component, tdn , is assumed to be expressed as tdn =
7
βj djn ,
(33)
j=1
where djn is the number of j-th day of the week (e.g., j=1 for Sunday and j=2 for Monday, etc.) and βj is the unknown trading day effect coefficient. To assure the identifiability, it is necessary to put constraint that β1 + · · · + β7 = 0. Since the numbers of day of the week are completely determined by the calendar, if we obtain good estimates of the trading day effect coefficients, then it will greatly contribute to the increase of the precision of the prediction. 4.3
Analysis of Exchange Rate Data
We consider the multivariate time series of exchange rate between US dollars and other foreign currencies. By using proper smoothness prior models, we try to decompose the change of the exchange rate into two components, one expresses the effect of US economy and the other the effect of other country. By this decomposition, it is possible to determine, for example, whether the decrease of the Yen/USD exchange rate at a certain time is due to weak Yen or strong US dollar.
References 1. Akaike, H.: Information Theory and an Extension of the Maximum Likelihood Principle. In: Petrov, B.N., Csaki, F. (eds.): 2nd International Symposium in Information Theory. Akademiai Kiado, Budapest, (1973) 267–281. 2. Akaike, H.: A new look at the statistical model identification, IEEE Transactions on Automatic Control, AC-19, 716–723 (1974) 3. Akaike, H.: Likelihood and the Bayes procedure (with discussion), In Bayesian Statistics, edited by J.M. Bernardo, M.H. De Groot, D.V. Lindley and A.F.M. Smith, University press, Valencia, Spain, 143–166 (1980)
32
G. Kitagawa
4. Akaike, H., and Kitagawa, G. eds.: The Practice of Time Series Analysis, SpringerVerlag New York (1999) 5. Alspach, D.L., Sorenson, H.W.: Nonlinear Bayesian Estimation Using Gaussian Sum Approximations. IEEE Transactions on Automatic Control, AC-17 (1972) 439–448. 6. Anderson, B.D.O., Moore, J.B.: Optimal Filtering, New Jersey, Prentice-Hall (1979). 7. Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer-Verlag, NewYork (2000). 8. Efron, B.: Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, (1979) 1–26. 9. Gordon, N.J., Salmond, D.J., Smith, A.F.M., Novel approach to nonlinear /nonGaussian Bayesian state estimation, IEE Proceedings-F, 140, (2) (1993) 107–113. 10. Ishiguro, M., Sakamoto, Y.,Kitagawa, G.: Bootstrapping log-likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics, 49 (3), (1997) 411–434. 11. Kitagawa, G.: Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association, 82 (1987) 1032–1063. 12. Kitagawa, G.: Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5 (1996) 1–25. 13. Kitagawa, G.: Self-organizing State Space Model. Journal of the American Statistical Association. 93 (443) (1998) 1203–1215. 14. Kitagawa, G., Gersch, W.: A Smoothness Priors-State Space Approach to the Modeling of Time Series with Trend and Seasonality. Journal of the American Statistical Association, 79 (386) (1984) 378–389. 15. Kitagawa, G. and Gersch, W.: Smoothness Priors Analysis of Time Series, Lecture Notes in Statistics, No. 116, Springer-Verlag, New York (1996). 16. Kitagawa, G. and Higuchi, T.: Automatic transaction of signal via statistical modeling, The Proceedings of The First International Conference on Discovery Science, Springer-Verlag Lecture Notes in Artificial Intelligence Series, 375–386 (1998). 17. Kitagawa, G., Takanami, T., Matsumoto, N.: Signal Extraction Problems in Seismology, Intenational Statistical Review, 69 (1), (2001) 129–152. 18. Konishi, S., Kitagawa, G.: Generalised information criteria in model selection. Biometrika, 83, (4), (1996) 875–890. 19. Sakamoto, Y., Ishiguro, M. and Kitagawa, G.: Akaike Information Criterion Statistics, D-Reidel, Dordlecht, (1986) 20. West, M., Harrison, P.J., Migon,H.S.: Dynamic generalized linear models and Bayesian forecasting (with discussion). Journal of the American Statistical Association. 80 (1985) 73–97. 21. Whittaker, E.T: On a new method of graduation, Proc. Edinborough Math. Assoc., 78, (1923) 81–89.
Association Computation for Information Access Akihiko Takano National Institute of Informatics Hitotsubashi, Chiyoda, Tokyo 101-8430 Japan
[email protected] Abstract. GETA (Generic Engine for Transposable Association) is a software that provides efficient generic computation for association. It enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. Scalable implementation of GETA can handle large corpora of twenty million documents, and provides the implementation basis for the effective information access of next generation. DualNAVI is an information retrieval system which is a successful example to show the power and the flexibility of GETA-based computation for association. It provides the users with rich interaction both in document space and in word space. Its dual view interface always returns the retrieved results in two views: a list of titles for document space and “Topic Word Graph” for word space. They are tightly coupled by their cross-reference relation, and inspires the users with further interactions. The two-stage approach in the associative search, which is the key to its efficiency, also facilitates the content-based correlation among databases. In this paper we describe the basic features of GETA and DualNAVI.
1
Introduction
Since the last decade of the twentieth century, we have experienced an unusual expansion of the information space. Virtually any documents including Encyclopaedia, newspapers, and daily information within industries become available in digital form. The information we can access are literary exploding in amount and variation. Information space we face in our daily life is rapidly losing its coherence of any kind, and this has brought many challenges in the field of information access research. The effective access to such information is crucial to our intelligent life. Productivity of each individual in this new era can be redefined as a power for recovering order from the chaos which this information flood left. It requires the ability to collect appropriate information, one to analyze and discover the order within the collected information, and the ability to make proper judgement based on the analysis. This leads to the following three requirements for the effective information access we need: – Flexible methods for collecting relevant information. – Extracting mutual association (correlation) within the collected information. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 33–44, 2003. c Springer-Verlag Berlin Heidelberg 2003
34
A. Takano
– Interaction with user’s intention (in mind) and the archival system of knowledge (e.g. ontology). We need a swift and reliable method to collect relevant information from millions of documents. But the currently available methods are mostly based on simple keyword search, which suffers low precision and low recall. We strongly believe that the important clue to attack these challenges lies in the metrication of the information space. Once we got proper metrics for measuring similarity or correlation in information space, it should not be difficult to recover some order through this metrics. We looked for a candidate of the metrication in the accumulation of previous research, and found that the statistical (or probabilistic) measures for the document similarity are the promising candidates. It is almost inspirational when we realize that these measures establish the duality between document space and word space. Following this guiding principle, we have developed an information retrieval system DualNAVI [8,10,13]. It provides the users with rich interaction both in document space and in word space. Its dual view interface always returns the retrieved results in two views: a list of titles for document space and “Topic Word Graph” for word space. They are tightly coupled by their cross-reference relation, and inspires the users with further interactions. DualNAVI supports two kinds of search facilities, document associative search and keyword associative search, both of them are similarity based search from given items as a query. The dual view provides a natural interface to invoke these two search functions. The two-stage approach in the document associative search is the key to the efficiency of DualNAVI. It also facilitates the contentbased correlation among databases which are maintained independently and distributively. Our experience with DualNAVI tells that the association computation based on mathematically sound metrics is the crucial part to realize the new generation IR (Information Retrieval) technologies [12]. DualNAVI has been commercially used in the Encyclopaedia search service over the internet since 1998, and BioInformatics DB search [1] since 2000. Its effectiveness has been confirmed by many real users. It is also discussed as one of the promising search technologies for scientists in Nature magazine [2]. The main reason why the various proposed methods have not been used in practice is that they are not scalable in nature. Information access based on similarity between documents or words looks promising to offer an intuitive way to overview a large document sets. But the heavy computing cost for evaluating similarity prevents them from being practical for large corpora of million documents. DualNAVI was not the exception. To overcome this scalability problem, we have developed a software called GETA (Generic Engine for Transposable Association) [3], which provides efficient generic computation for association. It enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. Scalable implementation of GETA which works on various PC clusters, can handle large corpora of twenty million documents, and
Association Computation for Information Access
35
Fig. 1. Navigation Interface of DualNAVI (Nishioka et al., 1997)
provides the implementation basis for the effective information access of next generation [1,14]. In this paper we first overview the design principle and the basic functions of DualNAVI. Two important features of DualNAVI, topic word graph and associative search, are discussed in detail. We also explain how it works in the distributive setting and realizes the cross DB associative search. Finally, we describe the basic features of GETA.
2 2.1
DualNAVI : IR Interface Based on Duality DualNAVI Interaction Model
DualNAVI is an information retrieval system which provides users with rich interaction methods based on two kinds of duality, dual view and dual query types. Dual view interface is composed of two views of retrieved results: one in document space and the other in word space (See Fig.1). Titles of retrieved results are listed on the left-hand side of the screen (for documents), and the summarizing information are shown as a “Topic Word Graph” on the right of the screen (for words). Topic word graphs are dynamically generated by analyzing the retrieved set of documents. A set of words characterizing the retrieved results
36
A. Takano
Dual Query
Associative search
Keyword search
feedback
feedback Dual View Interface
Title List
Cross Checking
Topic Word Graph
Fig. 2. Dual view bridges Dual query types
are shown, and the links between them represent the statistically meaningful cooccurrence relations among them. Connected subgraphs are expected to include good potential keywords with which to refine searches. Two views are tightly coupled each other based on their cross-reference relation. Select some topic words, and related documents which include them are highlighted in the list of documents. And vice versa. On the other hand, dual query types mean that DualNAVI supports two kinds of search facilities. Document associative search finds related documents to given set of key documents. Keyword associative search finds related documents to given set of key words. Dual view interface provides a natural way for indicating the key objects for these search methods. Just select the relevant documents or words within the previous search result, and the user can start a new associative search. This enables easy and intuitive relevance feedbacks to refine searches effectively. Search by documents is especially helpful when users have some interesting documents but feel difficult in selecting proper keywords. The effectiveness of these two types of feedback using DualNAVI has been evaluated in [10]. The results were significantly positive for both types of interaction.
Association Computation for Information Access
2.2
37
Dual View Bridges Dual Query Types
The dual view and dual query types are not just two isolated features. Dual query types can work effectively only with dual view framework. Figure 2 illustrates how they relate each other. We can start with either a search by keywords or by a document, and the retrieved results are shown in the dual view. If the title list includes interesting articles, we can proceed to the next associative search using these found articles as key documents. If some words in the topic word graph are interesting, we can start a new keyword search using these topic words as keys. Another advantage of dual view interface is that the cross checking function is naturally realized. If a user selects some articles of his interest, he can easily find what topic words appear in them. Dually, it is easy to find related articles by selecting topic words. If multiple topic words are selected, the thickness of checkmark (See Fig.1) indicates the number of selected topic words included by each article. The user can sort the title list by this thickness, which approximates the relevance of each article to the topic suggested by the selected words.
3 3.1
Association Computation for DualNAVI Generation of Topic Word Graph
Topic word graphs summarize the search results and suggest proper words for further refining of searches. The method of generating topic word graphs is fully described in [9]. Here we give a brief summary. The process consists of three steps (See Fig.3). The first step is the extraction of topic words based on the word frequency analysis over the retrieved set of documents. Next step is to generate links between extracted topic words based on co-occurrence analysis. The last step assigns each topic word a xy-coordinates position on the display area. The score for selecting topic words is given by df(w) in the retrieved documents df(w) in the whole database where df(w) is the document frequency of word w, i.e. the number of documents containing w. In general, it is difficult to keep the balance between high frequency words (common words) and low frequency words (specific words) by using a single score. In order to make a balanced selection, we adopted the frequencyclass method, where all candidate words are first roughly classified by their frequencies, and then proper number of topic words are picked up from each frequency class. A link (an edge) between two words means that they are strongly related. That is, they co-appear in many documents in the retrieved results. In the link generation step, each topic word X is linked to another topic word Y which maximizes the co-occurrence strength df(X & Y) / df(Y) with X, among those
38
A. Takano
Fig. 3. Generating Topic Word Graph (Niwa et al., 1997)
having higher document frequency than X. Here df(X & Y) means the number of retrieved documents which have both X and Y. The length of a link has no specific meaning, although it might be natural to expect a shorter link means a stronger relation. In the last step to give two dimensional arrangement of topic word graphs, the y-coordinate (vertical position) is decided according to the document frequency of each word within the retrieved set. Common words are placed in the upper part, and specific words are placed in the lower part. Therefore, the graph can be considered as a hierarchical map of topics appear in the retrieved set of documents. The x-coordinate (horizontal position) has no specific meaning. It is assigned just in the way to avoid overlapping of nodes and links. 3.2
Associative Search
Associative search is a new type of information retrieval method based on the similarity between documents. It can be considered as a search documents by examples. It is useful when the user’s intention cannot clearly be expressed by one or several keywords, but the user has some documents (partly) match with his intention. Associative search is also a powerful tool for relevance feedbacks. If you find interesting items in the search results, associative search with these items as search keys may bring you more related items which were not previously retrieved. Associative search of DualNAVI consists of following steps:
Association Computation for Information Access
Document Space
39
Word Space
Related Docs Associative Associative Search Search
Characterizing Words
Some Docs
Fig. 4. Associative Search
– Extraction of characterizing words from the selected documents. The default number of characterizing words to be extracted is 200. For each word (w), which appears at least once in the selected documents, its score is calculated by score(w) = tf(w) / TF(w), where tf(w) and TF(w) are the term frequencies of w in the selected documents, and in the whole database respectively. Then the above number of words of higher score are selected. – These extracted words are used as a query, and the relevance of each document (d) in the target database with this query (q) is calculated by sim(d,q), which is described in Fig.5 [11]. Here DF(w) is the number of documents containing w, and N is the total number of documents in the database. – The documents in the target database are sorted by this similarity score and the top ranked documents are returned as the search results. In theory, associative search should not limit the size of the query. The merged documents should be used as the query in associative search from a set of documents. But if we don’t reduce the number of the distinctive words in the query, we end up on calculating sim(d,q) for almost all the documents in the target database, even when we request for just ten documents. In fact this extensive computational cost had prevented associative search from being used in the practical systems. This is why we reduce the query into manageable size in step one. Most common words which appear in many documents are dropped in this step. We have to do this carefully so as not to drop important (e.g. informative) words within
40
A. Takano
sim(d, q) = ρ(d)
σ(w) · ν(w, d) · ν(w, q)
w∈q
1 + θ(length(d) − ) ( : average document length, θ : slope constant) ρ(d) =
σ(w) = log ν(w, X) =
N DF(w)
1 + log tf(w|X) 1 + log (averageω∈X tf(ω|X) )
Fig. 5. Similarity measure for associative search (Singhal et al., 1996)
the query. In above explanation, we simply adopt tf(w) / TF(w) for measuring the importance of the words. But it is possible to use any other methods for this filtering. In [5], Hisamitsu, et. al. discuss about the representativeness of the words, which is a promising candidate for this task. It is more complex but a theoretically sound measure. Another break through we made for taming this computational cost is of course the development of GETA, which is a high-speed engine for these generic association computation. In GETA, the computations for realizing above two steps are supported in their most abstract ways: – the method to extract summarizing words from given set of documents, – the method to collect documents from the target database which are best relevant to the given key document (a set of words). With these improvement, the associative search in DualNAVI becomes efficient enough for many practical applications. 3.3
Cross DB Associative Search
This two step procedure of associative search is the key to the distributive architecture of DualNAVI system. Associative search is divided into two subtasks, summarizing and similarity evaluation. Summarizing needs only the source DB, and target DB is only used in similarity evaluation. If these two functions are provided for each DB, it is not difficult to realize the cross DB associative search. Figure 6 shows the structure for the cross DB associative search between two physically distributed DualNAVI servers, one for the encyclopaedia and the other for newspapers. User can select some articles in the encyclopaedia and search related articles in other DB’s, say newspapers. User can search between physically distributed DB’s associatively and seamlessly. We call these set of DB’s a “Virtual Database” because it can be accessed as one large database.
Association Computation for Information Access
41
Virtual Database Individual Users
Cross DB Associative Search Encyclopaedia Internet
DualNAVI Client
DualNAVI Server
Newspapers DualNAVI Server
Fig. 6. Cross DB associative search with DualNAVI
4
GETA: Generic Engine for Association Computation
GETA (Generic Engine for Transposable Association) is a software that provides efficient generic computation for association [3]. It is designed as a tool for manipulating very large sparse matrices, which typically appear as index files for the large scale text retrieval. By providing the basic operations on this matrix, GETA enables the quantitative analysis of various proposed methods based on association, such as measuring similarity among documents or words. By using GETA, it is almost trivial to realize associative search functions, which accept a set of documents as a query, and return a set of highly related documents in the relevance order. The computation by GETA is very efficient – it is typically 50 times or more faster than the quick and hack implementation of the task. The key to the performance of GETA is its representation of the matrices. GETA tries to compress the whole information of matrices in its extreme, and it facilitates GETA to put them on the memory almost always. An experimental associative search system using GETA can handle one million documents with an ordinary PC (single CPU with 1GB memory). It is verified that the standard response time for the associative search is less than a few seconds, which is fast enough to be practical. In order for higher scalability, we have also developed the parallel processing version of GETA. It divides and distributes the matrix information onto each node of PC clusters, and collaboratively calculates the same results as the mono-
42
A. Takano
Fig. 7. Experimental Associative Search Interface using GETA
lithic version of GETA. Thanks to the inner product structure of most statistical measures, the speedup by this parallelization is significant. With this version of GETA, it is confirmed that a real time associative search becomes feasible for about 20 million documents using 8 to 16-nodes PC cluster. The use of GETA is not limited to associative search. It can be applied to a large variety of text processing techniques, such as text categorization, text clustering, and text summarization. We believe GETA will be an essential tool for accelerating research and practical application of these and other text processing techniques. GETA was released as an open source software in July 2002. The major part of design and implementation of GETA has been done by Shingo Nishioka. The development of GETA has been supported by the Advanced Software Technology project under the auspices of Information Promotion Agency (IPA) in Japan. 4.1
Basic Features of GETA
Some of the characterizing features of GETA is as follows:
Association Computation for Information Access
43
– It provides Efficient and Generic computation for association using highly compressed indices. – It is Portable — it works on various UNIX’s (e.g. FreeBSD, Linux, Solaris, etc.) on PC servers or PC clusters. – It is Scalable — associative document search for over 20 million documents can be done within a few seconds using 8 to 16-nodes PC clusters. – It is Flexible — the similarity measures among documents or words can be switched dynamically during computation. Users can easily define their own measures using macros. – Most functions of GETA are accessible from Perl environment using its Perl interface. It is useful to implement experimental systems for comparing various statistical measures of similarities. For demonstrating this flexibility of GETA, we have implemented an associative search interface for document search (See Fig. 7). It can be used for quantitative comparison among various measures or methods. 4.2
Document Analysis Methods Using GETA
Various methods for document analysis have been implemented using GETA, and is included in the standard distribution of GETA system. – Tools for dynamic document clustering: The various methods for dynamic document clustering are implemented using GETA: • It provides an efficient implementation of HBC (Hierarchical Bayesian Clustering) method [6,7]. It takes a few seconds for clustering 1000 documents on an ordinary PC. • Representative terms of each cluster are available. • For comparative studies among different methods, most major existing clustering methods (e.g. single-link method, complete-link method, group average method, Ward method, etc.) are also available. – Tools for evaluating word representativeness [4,5]: Representativeness is a new measure for evaluating the power of words to represent some topic. It provides the quantitative criterion for selecting effective words to summarize the content of a given set of documents. A new measure for word representativeness is proposed together with an efficient implementation for evaluating them using GETA. It is also possible to apply it for automatic selection of important compound words.
5
Conclusions
We have shown how association computation, such as evaluating the similarity among documents or words, are essential for new generation technologies in
44
A. Takano
information access. We believe DualNAVI which is based on this principle brings a new horizon of the interactive information access, and GETA will serve as an implementation basis for these new generation systems. Acknowledgements. The work reported in this paper is an outcome of the joint research efforts with my former colleagues at Hitachi Advanced/Central Research Laboratories: Yoshiki Niwa, Toru Hisamitsu, Makoto Iwayama, Shingo Nishioka, Hirofumi Sakurai and Osamu Imaichi. This research is partly supported by the Advanced Software Technology Project under the auspices of Information-technology Promotion Agency (IPA), Japan. It is also partly supported by CREST Project of Japan Science and Technology.
References 1. BACE (Bio Association CEntral). http://bace.ims.u-tokyo.ac.jp/, August 2000. 2. D. Butler. Souped-up search engines. Nature, 405, pages 112–115, 2000. 3. GETA (Generic Engine for Transposable Association). http://geta.ex.nii.ac.jp/, July 2002. 4. T. Hisamitsu, Y. Niwa, and J. Tsujii. Measuring Representativeness of Terms. In Proceedings of IRAL’99, pages 83–90, 1999. 5. T. Hisamitsu, Y. Niwa, and J. Tsujii. A Method of Measuring Term Representativeness. In Proceedings of COLING 2000, pages 320–326, 2000. 6. M. Iwayama and T. Tokunaga. Hierarchical Bayesian Clustering for Automatic Text Classification. In Proceedings of IJCAI’95, pages 1322–1327, 1995. 7. M. Iwayama. Relevance reedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR 2000, pages 10–16, 2000. 8. S. Nishioka, Y. Niwa, M. Iwayama, and A. Takano. DualNAVI: An information retrieval interface. In Proceedings of JSSST WISS’97, pages 43–48, 1997. (in Japanese). 9. Y. Niwa, S. Nishioka, M. Iwayama, and A. Takano. Topic graph generation for query navigation: Use of frequency classes for topic extraction. In Proceedings of NLPRS’97, pages 95–100, 1997. 10. Y. Niwa, M. Iwayama, T. Hisamitsu, S. Nishioka, A. Takano, H. Sakurai, and O. Imaichi. Interactive Document Search with DualNAVI. In Proceedings of NTCIR’99, pages 123–130, 1999. 11. A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization In Proceedings of ACM SIGIR’96, pages 21–29, 1996. 12. A. Takano, Y. Niwa, S. Nishioka, M. Iwayama, T. Hisamitsu, O. Imaichi and H. Sakurai. Information Access Based on Associative Calculation, In Proceedings of SOFSEM 2000, LNCS Vol.1963, pages 187–201, Springer-Verlag, 2000. 13. A. Takano, Y. Niwa, S. Nishioka, T. Hisamitsu, M. Iwayama, and O. Imaichi. Associative Information Access using DualNAVI. In Proceedings of NLPRS 2001, pages 771–772, 2001. 14. Webcat Plus (Japanese Books Information Service). http://webcatplus.nii.ac.jp/, October 2002.
Efficient Data Representations That Preserve Information Naftali Tishby School of Computer Science and Engineering and Center for Neural Computation The Hebrew University, Jerusalem 91904, Israel
[email protected] Abstract. A fundamental issue in computational learning theory, as well as in biological information processing, is the best possible relationship between model representation complexity and its prediction accuracy. Clearly, we expect more complex models that require longer data representation to be more accurate. Can one provide a quantitative, yet general, formulation of this trade-off? In this talk I will discuss this question from Shannon’s Information Theory perspective. I will argue that this trade-off can be traced back to the basic duality between source and channel coding and is also related to the notion of ”coding with side information”. I will review some of the theoretical achievability results for such relevant data representations and discuss our algorithms for extracting them. I will then demonstrate the application of these ideas for the analysis of natural language corpora and speculate on possibly-universal aspects of human language that they reveal. Based on joint works with Ran Bacharach, Gal Chechik, Amir Globerson, Amir Navot, and Noam Slonim.
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, p. 45, 2003. c Springer-Verlag Berlin Heidelberg 2003
Can Learning in the Limit Be Done Efficiently? Thomas Zeugmann Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck, Wallstraße 40, 23560 L¨ ubeck, Germany
[email protected] Abstract. Inductive inference can be considered as one of the fundamental paradigms of algorithmic learning theory. We survey results recently obtained and show their impact to potential applications. Since the main focus is put on the efficiency of learning, we also deal with postulates of naturalness and their impact to the efficiency of limit learners. In particular, we look at the learnability of the class of all pattern languages and ask whether or not one can design a learner within the paradigm of learning in the limit that is nevertheless efficient. For achieving this goal, we deal with iterative learning and its interplay with the hypothesis spaces allowed. This interplay has also a severe impact to postulates of naturalness satisfiable by any learner. Finally, since a limit learner is only supposed to converge in the limit, one never knows at any particular learning stage whether or not the learner did already succeed. The resulting uncertainty may be prohibitive in many applications. We survey results to resolve this problem by outlining a new learning model, called stochastic finite learning. Though pattern languages can neither be finitely inferred from positive data nor PAC-learned, our approach can be extended to a stochastic finite learner that exactly infers all pattern languages from positive data with high confidence.
The full version of this paper is published in the Proceedings of the 14th International Conference on Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence Vol. 2842
G. Grieser et al. (Eds.): DS 2003, LNAI 2843, p. 46, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovering Frequent Substructures in Large Unordered Trees Tatsuya Asai1 , Hiroki Arimura1 , Takeaki Uno2 , and Shin-ichi Nakano3 1
Kyushu University, Fukuoka 812–8581, JAPAN {t-asai,arim}@i.kyushu-u.ac.jp 2 National Institute of Informatics, Tokyo 101–8430, JAPAN
[email protected] 3 Gunma University, Kiryu-shi, Gunma 376–8515, JAPAN
[email protected] Abstract. In this paper, we study a frequent substructure discovery problem in semi-structured data. We present an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a large collection of data trees with frequency above a user-specified threshold. The keys of the algorithm are efficient enumeration of all unordered trees in canonical form and incremental computation of their occurrences. We then show that Unot discovers each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data trees, and m is the total number of occurrences of T in the data trees.
1
Introduction
By rapid progress of network and storage technologies, huge amounts of electronic data have been available in various enterprises and organizations. These weakly-structured data are well modeled by graph or trees, where a data object is represented by a nodes and a connection or relationships between objects are encoded by an edge between them. There have been increasing demands for efficient methods for graph mining, the task of discovering patterns in large collections of graph and tree structures [1,3,4,7,8,9,10,13,15,17,18,19,20]. In this paper, we present an efficient algorithm for discovering frequent substructures in a large graph structured data, where both of the patterns and the data are modeled by labeled unordered trees. A labeled unordered tree is a rooted directed acyclic graph, where all but the root node have exactly one parent and each node is labeled by a symbol drawn from an alphabet (Fig. 1). Such unordered trees can be seen as either a generalization of labeled ordered trees extensively studied in semi-structured data mining [1,3,4,10,13,18,20], or as an efficient specialization of attributed graphs in graph mining researches [7,8, 9,17,19]. They are also useful in modeling various types of unstructured or semistructured data such as chemical compounds, dependency structure in discourse analysis and the hyperlink structure of Web sites. On the other hand, difficulties arise in discovery of trees and graphs such as the combinatorial explosion of the number of possible patterns, the isomorphism problem for many semantically equivalent patterns. Also, there are other G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 47–61, 2003. c Springer-Verlag Berlin Heidelberg 2003
48
T. Asai et al. T B 2 C
D
A 1
3
C
B 5 4
C
6
A 1
B 2 C
3
C A
4 5
B 7 B
6
A
8
B
B 10 9
C
11
A
12
C
C 14
13
C
15
Fig. 1. A data tree D and a pattern tree T
difficulties such as the computational complexity of detecting the embeddings or occurrences in trees. We tackle these problems by introducing a novel definitions of the support and the canonical form for unordered trees, and by developing techniques for efficient enumeration of all unordered trees in canonical form without duplicates and for incremental computation of the embeddings of each patterns in data trees. Interestingly, these techniques can be seen as instances of the reverse search technique, known as a powerful design tool for combinatorial enumeration problems [6,16]. Combining these techniques, we present an efficient algorithm Unot that computes all labeled unordered trees appearing in a collection of data trees with frequency above a user-specified threshold. The algorithm Unot has a provable performance in terms of the output size unlike other graph mining algorithm presented so far. It enumerates each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data tree, and m is the total number of occurrences of T in the data trees. Termier et al. [15] developed the algorithm TreeFinder for discovering frequent unordered trees. The major difference with Unot is that TreeFinder is not complete, i.e., it finds a subset of the actually frequent patterns. On the other hand, Unot computes all the frequent unordered trees. Another difference is that matching functions preserve the parent relationship in Unot, whereas ones preserve the ancestor relationship in TreeFinder. Very recently, Nijssen et al. [14] independently proposed an algorithm for the frequent unordered tree discovery problem with an efficient enumeration technique essentially same to ours. This paper is organized as follows. In Section 2, we prepare basic definitions on unordered trees and introduce our data mining problems. In Section 3, we define the canonical form for the unordered trees. In Section 4, we show an efficient algorithm Unot for finding all the frequent unordered trees in a collection of semi-structured data. In Section 5, we conclude the results.
2
Preliminaries
In this section, we give basic definitions on unordered trees according to [2] and introduce our data mining problems. For a set A, |A| denotes the size of A. For a binary relation R ⊆ A2 on A, R∗ denotes the reflexive transitive closure of R.
Discovering Frequent Substructures in Large Unordered Trees
2.1
49
The Model of Semi-structured Data
We introduce the class of labeled unordered trees as a model of semi-structured data and patterns according to [2,3,12]. Let L = {, 1 , 2 , . . .} be a countable set of labels with a total order ≤L on L. A labeled unordered tree (an unordered tree, for short) is a directed acyclic graph T = (V, E, r, label), with a distinguished node r called the root, satisfying the followings: V is a set of nodes, E ⊆ V ×V is a set of edges, and label : V → L is the labeling function for the nodes in V . If (u, v) ∈ E then we say that u is a parent of v, or v is a child of u. Each node v ∈ V except r has exactly one parent and the depth of v is defined by dep(v) = d, where (v0 = r, v1 , . . . , vd ) is the unique path from the root r to v. A labeled ordered tree (an ordered tree, for short) T = (V, E, B, r, label) is defined in a similar manner as a labeled unordered tree except that for each internal node v ∈ V , its children are ordered from left to right by the sibling relation B ⊆ V × V [3]. We denote by U and T the classes of unordered and ordered trees over L, respectively. For a labeled tree T = (V, E, r, label), we write VT , VE , rT and labelT for V, E, r and label if it is clear from the context. The following notions are common in both unordered and ordered trees. Let T be an unordered or ordered tree and u, v ∈ T be its nodes. If there is a path from u to v, then we say that u is an ancestor of v, or v is a descendant of u. For a node v, we denote by T (v) the subtree of T rooted at v, the subgraph of T induced in the set of all descendants of v. The size of T , denoted by |T |, is defined by |V |. We define the special tree ⊥ of size 0, called the empty tree. Example 1. In Fig. 1, we show examples of labeled unordered trees T and D on alphabet L = {A, B, C} with the total ordering A > B > C. In the tree T , the root is 1 labeled with A and the leaves are 3, 4, and 6. The subtree T (2) at node 2 consists of nodes 2, 3, 4. The size of T is |T | = 6. Throughout this paper, we assume that for every labeled ordered tree T = (V, E, B, r, label) of size k ≥ 1, its nodes are exactly {1, . . . , k}, which are numbered consecutively in preorder. Thus, the root and the rightmost leaf of T are root(T ) = 1 and rml(T ) = k, respectively. The rightmost branch of T is the path RM B(T ) = (r0 , . . . , rc ) (c ≥ 0) from the root r to the rightmost leaf of T . 2.2
Patterns, Tree Matching, and Occurrences
For k ≥ 0, a k-unordered pattern (k-pattern, for short) is a labeled unordered tree having exactly k nodes, that is, VT = {1, . . . , k} such that root(T ) = 1 holds. An unordered database (database, for short) is a finite collection D = {D1 , . . . , Dn } ⊆ U of (ordered) trees, where each Di ∈ D is called a data tree. We denote by VD the set of the nodes of D and ||D|| = |VD | = D∈D |VD |. The semantics of unordered and ordered tree patterns are defined through tree matching [3]. Let T and D ∈ U be labeled unordered trees over L, which are called the pattern tree and the data tree, respectively. Then, we say that T occurs in D as an unordered tree if there is a mapping ϕ : VT → VD satisfying the following (1)–(3) for every x, y ∈ VT :
50
T. Asai et al.
(1) ϕ is one-to-one, i.e., x = y implies ϕ(x) = ϕ(y). (2) ϕ preserves the parent relation, i.e., (x, y) ∈ ET iff (ϕ(x), ϕ(y)) ∈ ED . (3) ϕ preserves the labels, i.e., labelT (x) = labelD (ϕ(x)). The mapping ϕ is called a matching from T into D. Then we can extend the matching from matching ϕ : VT → VD into a data tree to a matching ϕ : VT → 2VD into a database. MD (T ) denotes the set of all matchings from T into D. Then, we define four types of occurrences of U in D as follows: Definition 1. Let k ≥ 1, T ∈ U be a k-unordered pattern, D be a database. For any matching ϕ : VT → VD ∈ MD (T ) from T into D, we define: 1. 2. 3. 4.
The total occurrence of T is the k-tuple T oc(ϕ) = ϕ(1), . . . , ϕ(k) ∈ (VD )k . The embedding occurrence of T is the set Eoc(ϕ) = {ϕ(1), . . . , ϕ(k)} ⊆ VD . The root occurrence of T : Roc(ϕ) = ϕ(1) ∈ VD The document occurrence of T is the index Doc(ϕ) = i such that Eoc(ϕ) ⊆ VDi for some 1 ≤ i ≤ |D|.
Example 2. In Fig. 1, we see that the pattern tree S has eight total occurrences ϕ1 = 1, 2, 3, 4, 10, 11 , ϕ2 = 1, 2, 4, 3, 10, 11 , ϕ3 = 1, 2, 3, 4, 10, 13 , ϕ4 = 1, 2, 4, 3, 10, 13 , ϕ5 = 1, 10, 11, 13, 2, 3 , ϕ6 = 1, 10, 13, 11, 2, 3 , ϕ7 = 1, 10, 11, 13, 2, 4 , and ϕ8 = 1, 10, 13, 11, 2, 4 in the data tree D, where we identify the matching ϕi and T oc(ϕi ). On the other hand, there are four embedding occurrences Eoc(ϕ1 ) = Eoc(ϕ2 ) = {1, 2, 3, 4, 10, 11}, Eoc(ϕ3 ) = Eoc(ϕ4 ) = {1, 2, 3, 4, 10, 13}, Eoc(ϕ5 ) = Eoc(ϕ6 ) = {1, 2, 3, 10, 11, 13}, and Eoc(ϕ7 ) = Eoc(ϕ8 ) = {1, 2, 4, 10, 11, 13}, and there is one root occurrence ϕ1 (1) = ϕ2 (1) = · · · = ϕ8 (1) = 1 of T in D. Now, we analyze the relationship among the above definitions of the occurrences by introducing an ordering ≥occ on the definitions. For any types of occurrences τ, π ∈ {T oc, Eoc, Roc, Doc}, we say π is stronger than or equal to τ , denoted by π ≥occ τ , iff for every matchings ϕ1 , ϕ2 ∈ MD (T ) from T to D, π(ϕ1 ) = π(ϕ2 ) implies τ (ϕ1 ) = τ (ϕ2 ). For an unordered pattern T ∈ U, we denote by T OD (T ), EOD (T ), ROD (T ), and DOD (T ) the set of the total, the embedding, the root and the document occurrences of T in D, respectively. The first lemma describes a linear ordering on classes of occurrences and the second lemma gives the relation between the relative size of the occurrences. Lemma 1. T oc ≥occ Eoc ≥occ Roc ≥occ Doc. Lemma 2. Let D be a database and T be a pattern. Then, |T OD (T )| = k Θ(k) |EOD (T )|
and
|EOD (T )| = nΘ(k) |ROD (T )|
over all pattern T ∈ U and all databases D ∈ 2U satisfying k ≤ cn for some 0 < c < 1, where k is the size of T and n is the size of a database. Proof. Omitted. For the proof, please consult the technical report [5].
Discovering Frequent Substructures in Large Unordered Trees
51
Fig. 2. The depth-label sequences of labeled ordered trees
For any of four types of the occurrences τ ∈ {T oc, Eoc, Roc, Doc}, the τ -count of an unordered pattern U in a given database D is |τ D (U )|. Then, the (relative) τ -frequency is the ratio f reqD (T ) = |τ D (T )|/||D|| for τ ∈ {T oc, Eoc, Roc} and f reqD (T ) = |DocD (T )|/|D| for τ = Doc. A minimum frequency threshold is any number 0 ≤ σ ≤ 1. Then, we state our data mining problems as follows. Frequent Unordered Tree Discovery with Occurrence Type τ Given a database D ⊆ U and a positive number 0 ≤ σ ≤ 1, find all unordered patterns U ∈ U appearing in D with relative τ -frequency at least σ, i.e., f reqD (T ). In what follows, we concentrate on the frequent unordered tree discovery problem with embedding occurrences with Eoc although T oc is more natural choice from the view of data mining. However, we note that it is easy to extend the method and the results in this paper for coarser occurrences Roc and Doc by simple preprocessing. The following substructure enumeration problem, is a special case of the frequent unordered tree discovery problem with embedding occurrences where σ = 1/||D||. Substructure Discovery Problem for Unordered Trees Given a data tree D ∈ U, enumerate all the labeled unordered trees T ∈ U embedded in D, that is, T occurs in D at least once. Throughout this paper, we adopt the first-child next-sibling representation [2] as the representation of unordered and ordered trees in implementation. For the detail of the representation, see some textbook, e.g., [2].
3
Canonical Representation for Unordered Trees
In this section, we give the canonical representation for unordered tree patterns according to Nakano and Uno [12]. 3.1
Depth Sequence of a Labeled Unordered Tree
First, we introduce some technical definitions on ordered trees. We use labeled ordered trees in T as the representation of labeled unordered trees in U, where U ∈ U can be represented by any T ∈ T such that U is obtained from T by ignoring its sibling relation BT . Two ordered trees T1 and T2 ∈ T are equivalent
52
T. Asai et al.
each other as unordered trees, denoted by T1 ≡ T2 , if they represent the same unordered tree. We encode a labeled ordered tree of size k as follows [4,11,20]. Let T be a labeled ordered tree of size k. Then, the depth-label sequence of T is the sequence C(T ) = ((dep(v1 ), label(v1 )), . . . , (dep(vk ), label(vk )) ∈ (N×L)∗ , where v1 , . . . , vk is the list of the nodes of T ordered by the preorder traversal of T and each (dep(vi ), label(vi )) ∈ N×L is called a depth-label pair . Since T and N × L have one-to-one correspondence, we identify them in what follows. See Fig. 2 for examples of depth-label sequences. Next, we introduce the total ordering ≥ over depth-label sequences as follows. For depth-label pairs (di , i ) ∈ N×L (i = 1, 2), we define (d1 , 1 ) > (d2 , 2 ) iff either (i) d1 > d2 or (ii) d1 = d2 and 1 > 2 . Then, C(T1 ) = (x1 , . . . , xm ) is heavier than C(T2 ) = (y1 , . . . , ym ), denoted by C(T1 ) ≥lex C(T2 ), iff C(T1 ) is lexicographically larger than or equal to C(T2 ) as sequences over alphabet N×L. That is, C(T1 ) ≥lex C(T2 ) iff there exists some k such that (i) xi = yi for each i = 1, . . . , k − 1 and (ii) either xk > yk or m > k − 1 = n. By identifying ordered trees and their depth-label sequences, we may often write T1 ≥lex T2 instead of C(T1 ) ≥lex C(T2 ). Now, we give the canonical representation for labeled unordered trees as follows. Definition 2 ([12]). A labeled ordered tree T is in the canonical form or a canonical representation if its depth-label sequence C(T ) is heaviest among all ordered trees over L equivalent to T , i.e., C(T ) = max{ C(S) | S ∈ T , S ≡ T }. The canonical ordered tree representation (or canonical representation, for short) of an unordered tree U ∈ U, denoted by COT (U ), is the labeled ordered tree T ∈ T in the canonical form that represents U as unordered tree. We denote by C the class of the canonical ordered tree representations of labeled unordered trees over L. The next lemma gives a characterization of the canonical representations for unordered trees [12]. Lemma 3 (Left-heavy condition [12]). A labeled ordered tree T is the canonical representation of some unordered tree iff T is left-heavy, that is, for any node v1 , v2 ∈ V , (v1 , v2 ) ∈ B implies C(T (v1 )) ≥lex C(T (v2 )). Example 3. Three ordered trees T1 , T2 , and T3 in Fig. 2 represents the same unordered tree, but not as ordered trees. Among them, T1 is left-heavy and thus it is the canonical representation of a labeled unordered tree under the assumption that A > B > C. On the other hand, T2 is not canonical since the depth-label sequence C(T2 (2)) = (1B, 2A, 2B, 3A) is lexicographically smaller than C(T2 (6)) = (1A, 2B, 2A) and this violates the left-heavy condition. T3 is not canonical since B < A implies C(T3 (3)) = (2B, 3A) lex S for (T ) (T ) any S ∈ T . For a pattern tree T ∈ T , we sometimes write Li and Ri for Li and Ri by indicating the pattern tree T . By Lemma 3, an ordered tree is in canonical form iff it is left-heavy. The next lemma claims that the algorithm only checks the left trees and the right trees to check if the tree is in canonical form. Lemma 5 ([12]). Let S ∈ C be a canonical representation and T be a child tree of S with the rightmost branch (r0 , . . . , rg ), where g ≥ 0. Then, T is in canonical form iff Li ≥lex Ri holds in T for every i = 0, . . . , g − 1.
Discovering Frequent Substructures in Large Unordered Trees
55
Procedure Expand(S, O, c, α, F) Input: A canonical representation S ∈ U , the embedding occurrences O = EOD (S), and the copy-depth c, nonnegative integer α, and the set F of the frequent patterns. Method: – If (|O| < α) then return; Else F := F ∪ {S}; – For each S ·(i, ), cnew ∈ FindAllChildren(S, c), do; • T := S ·(i, ); • P := UpdateOcc(T, O, (i, )); • Expand(T, P, cnew , α, F); Fig. 6. A depth-first search procedure Expand
Let T be a labeled ordered tree with the rightmost branch RM B(T ) = (r0 , r1 , . . . , rg ). We say C(Li ) and C(Ri ) have a disagreement at the position j if j ≤ min(|C(Li )|, |C(Ri )|) and the j-th components of C(Li ) and C(Ri ) are different pairs. Suppose that T is in canonical form. During a sequence of rightmost expansions to T , the i-th right tree Ri grows as follows. 1. Firstly, when a new node v is attached to ri as a rightmost child, the sequence is initialized to C(Ri ) = v = (dep(v), label(v)). 2. Whenever a new node v of depth d = dep(v) > i comes to T , the right tree Ri grows. In this case, v is attached as the rightmost child of rd−1 . There are two cases below: (i) Suppose that there exists a disagreement in C(Li ) and C(Ri ). If r dep(v) ≥ v then the rightmost expansion with v does not violate the left-heavy condition of T , where rdep(v) is the node preceding v in the new tree. (ii) Otherwise, we know that C(Ri ) is a prefix of C(Li ). In this case, we say Ri is copying Li . Let m = |C(Ri )| < |C(Li )| and w be the m-th component of C(Li ). For every new node v, T ·v is a valid expansion if w ≥ v and r dep(v) ≥ v. Otherwise, it is invalid. (iii) In cases (i) and (ii) above, if rdep(v)−1 is a leaf of the rightmost branch of T then r dep(v) is undefined. In this case, we define r dep(v) = ∞ . 3. Finally, T reaches C(Li ) = C(Ri ). Then, the further rightmost expansion to Ri is not possible. If we expand a given unordered pattern T so that all the right trees R0 , . . . , Rg satisfy the above conditions, then the resulting tree is in canonical form. Let RM B(T ) = (r0 , r1 , . . . , rg ) be the rightmost branch of T . For every i = 0, 1, . . . , g − 1, the internal node ri is said to be active at depth i if C(Ri ) is a prefix of C(Li ). The copy depth of T is the depth of the highest active node in T . To deal with special cases, we introduce the following trick: We define the leaf rg to be always active. Thus we have that if all nodes but rg are not active
56
T. Asai et al.
Procedure FindAllChildren(T, c) : Method : Return the set Succ of all pairs S, c , where S is the canonical child tree of T and c is its copy depth generated by the following cases: Case I : If C(Lk ) = C(Rk ) for the copy depth k: – The canonical child trees of T are T ·(1, 1 ), . . . , T ·(k + 1, k+1 ), where label(ri ) ≥ i for every i = 1, . . . , k + 1. The trees T ·(k + 2, k+2 ), . . . , T · (g + 1, g+1 ) are not canonical. – The copy depth of T ·(i, i ) is i − 1 if label(ri ) = i and i otherwise for every i = 1, . . . , k + 1. Case II : If C(Lk ) = C(Rk ) for the copy depth k: – Let m = |C(Rk )|+1 and w = (d, ) be the m-th component of C(Lk ) (the next position to be copied). The canonical child trees of T are T·(1, 1 ), . . . , T·(d, d ), where label(ri ) ≥ i for every i = 1, . . . , d − 1 and ≥ d holds. – The copy depth of T ·(i, i ) is i − 1 if label(ri ) = i and i otherwise for every i = 1, . . . , d − 1. The copy depth of T ·(d, d ) is k if w = v and d otherwise. Fig. 7. The procedure FindAllChildren
then its copy-depth is g. This trick greatly simplies the description of the update below. Now, we explain how to generate all child trees of a given canonical representation T ∈ C. In Fig. 7, we show the algorithm FindAllChildren that computes the set of all canonical child trees of a given canonical representation as well as their copy depths. The algorithm is almost same as the algorithm for unlabeled unordered trees described in [12]. The update of the copy depth is slightly different from [12] by the existence of the labels. Let T be a labeled ordered tree with the rightmost branch RM B(T ) = (r0 , r1 , . . . , rg ) and the copy depth k ∈ {−1, 0, 1, . . . , g − 1}. Note that in the procedure FindAllChildren, the case where all but rg are inactive, including the case for chain trees, is implicitly treated in Case I. To implement the algorithm FindAllChildren to enumerate the canonical child trees in O(1) time per tree, we have to perform the following operation in O(1) time: updating a tree, access to the sequence of left and right trees, maintenance of the position of the shorter prefix at the copy depth, retrieval of the depth-label pair at the position, and the decision of the equality C(Li ) = C(Ri ). To do this, we represent a pattern T by the structure shown in Fig. 4. – An array code : [1..size] → (N × L) of depth-label pairs that stores the depth-label sequence of T with length size ≥ 0. – A stack RM B : [0..top] → (N × N × {=, =}) of the triples (lef t, right, cmp). For each (lef t, right, cmp) = RM B[i], lef t and right are the starting positions of the subsequences of code that represent the left tree Li and the right tree Ri , and the flag cmp ∈ {=, =} indicates whether Li = Ri holds. The length of the rightmost branch is top ≥ 0. It is not difficult to see that we can implement all the operation in FindAllChildren of Fig. 7 to work in O(1) time, where an entire tree is not output
Discovering Frequent Substructures in Large Unordered Trees
57
Algorithm UpdateOcc(T, O, d, ) Input: the rightmost expansion of a pattern S, the embedding occurrence list O = EOD (S), the depth d ≥ 1 and a label ∈ L of the rightmost leaf of T . Output: the new list P = EOD (T ). Method: – P := ∅; – For each ϕ ∈ O, do: + x := ϕ(rd−1 ); /* the image of the parent of the new node rd = (d, ) */ + For each child y of x do: − If labelD (y) = and y ∈ E(ϕ) then ξ := ϕ·y and f lag := true; − Else, skip the rest and continue the for-loop; − For each i = 1, . . . , d − 1, do: If C(Li ) = C(Ri ) but ξ(lef ti ) = ξ(righti ) then f lag := f alse, and then break the inner for-loop; − If f lag = true then P = P ∪ {ξ}; – Return P; Fig. 8. An algorithm for updating embedding occurrence lists of a pattern
but the difference from the previous tree. The proof of the next lemma is almost same to [12] except the handling of labels. Lemma 6 ([12]). For every canonical representation T and its copy depth c ≥ 0, FindAllChildren of Fig. 7 computes the set of all the canonical child trees of T in O(1) time per tree, when only the differences to T are output. The time complexity O(1) time per tree of the above algorithm is significantly faster than the complexity O(k2 ) time per tree of the straightforward algorithm based on Lemma 3, where k is the size of the computed tree. 4.3
Updating the Occurrence List
In this subsection, we give a method for incrementally computing the embedding occurrences EOD (T ) of the child tree T from the occurrences EOD (S) of the canonical representation S. In Fig. 8, we show the procedure UpdateOcc that, given a canonical child tree T and the occurrences EOD (S) of the parent tree S, computes its embedding occurrences EOD (T ). Let T be a canonical representation for a labeled unordered tree over L with domain {1, . . . , k}. Let ϕ ∈ MD (T ) be a matching from T into D. Recall that the total and the embedding occurrences of T associated with ϕ is T O(ϕ) = ϕ(1), . . . , ϕ(k) and EO(ϕ) = {ϕ(1), . . . , ϕ(k)}, respectively. For convenience, we identify ϕ to T O(ϕ). We encode an embedding occurrence EO in one of the total occurrences ϕ corresponding to EO = EO(ϕ). Since there are many total occurrences corresponding to EO, we introduce the canonical representation for embedding occurrences similarly as in Section 3.
58
T. Asai et al. A
⊥
A
B
B [omit] A
B
A A
B
B
A
B A
A A
A
B A
A
A
A
A
B
A
A B A [omit][omit]
B
A
A
B
B
A [omit]
B
B
B
A
B B
A A [omit]
B
B
B
A A B
A
B
A
B
B B
A A
A
A A
A
B
A
B
B
A
B A
B
B
B
B
B
A
B
B
B
B
A
A
A B
A
A
A
A
B
B
Fig. 9. A search tree for labeled unordered trees
Two total occurrences ϕ1 and ϕ2 are equivalent each other if EO(ϕ1 ) = EO(ϕ2 ). The occurrences ϕ1 is heavier than ϕ2 , denote by ϕ1 ≥lex ϕ2 , if ϕ1 is lexicographically larger than ϕ2 as the sequences in N∗ . We give the canonical representation for the embedding occurrences. Definition 4. Let T be a canonical form of a labeled unordered tree and EO ⊆ VD be its embedding occurrence in D. The canonical representation of EO, denoted by CR(EO), is the total occurrence ϕ ∈ MD (T ) that is the heaviest tuple in the equivalence class { ϕ ∈ MD (T ) | ϕ ≡ ϕ }. Let ϕ = ϕ(1), . . . , ϕ(k) be an total occurrence of T over D. We denote by P (ϕ) the unique total occurrence of length k − 1 derived from ϕ by removing the last component ϕ(k). We say P (ϕ) is a parent occurrence of ϕ. For a node v ∈ VT in T , we denote by ϕ(T (v)) = ϕ(i), ϕ(i + 1), . . . , ϕ(i + |T (v)| − 1) the restriction of ϕ to the subtree T (v), where i, i+1, . . . , i+|T (v)|−1 is the nodes of T (v) in preorder. Now, we consider the incremental computation of the embedding occurrences. Lemma 7. Let k ≥ 1 be any positive integer, S be a canonical tree, and ϕ be its canonical occurrence of S in D. For a node w ∈ VD , let T = S ·v be a child tree of S with the rightmost branch (r0 , . . . , rg ). Then, the mapping φ = ϕ·w is a canonical total occurrence of T in D iff the following conditions (1)–(4) hold. (1) labelD (w) = labelT (v). (2) For every i = 1, . . . , k − 1, w = ϕ(i). (3) w is a child of ϕ(rd−1 ), where rd−1 ∈ RM B(S) is the node of depth d − 1 on the rightmost branch RM B(S) of S. (4) C(Li ) = C(Ri ) implies φ(root(Li )) = φ(root(Ri )) for every i = 0, . . . , g − 1. Proof. For any total occurrence ϕ ∈ MD (T ), if ϕ is in canonical form then so is its parent P (ϕ). Moreover, ϕ is the canonical form iff ϕ is partially left-heavy, that is, for any nodes v1 , v2 ∈ VT , both of (v1 , v2 ) ∈ B and T (v1 ) = T (v2 ) imply
ϕ(T (v1 )) ≥lex ϕ(T (v2 )). Thus the lemma holds.
Discovering Frequent Substructures in Large Unordered Trees
59
Lemma 7 ensures the correctness of the procedure UpdateOcc of Fig. 8. To show the running time, we can see that the decision C(Li ) = C(Ri ) can be decidable in O(1) time using the structure shown in Fig. 4. Note that every canonical tree has at least one canonical child tree. Thus, we obtain the main theorem of this paper as follows. Theorem 1. Let D be a database and 0 ≤ σ ≤ 1 be a threshold. Then, the algorithm Unot of Fig. 5 computes all the canonical representations for the frequent unordered trees w.r.t. embedding occurrences in O(kb2 m) time per pattern, where b is the maximum branching factor in VD , k is the maximum size of patterns enumerated, and m is the number of embeddings of the enumerated pattern. Proof. Let S be a canonical tree and T be a child tree of S. Then, the procedure UpdateOcc of Fig. 8 computes the list of all the canonical total occurrences of T in O(k bm ) time, where k = |T | and m = |EO(S)|. From Lemma 6 and the fact |EO(T )| = O(b|EO(S)|), we have the result.
Fig. 9 illustrates the computation of the algorithm Unot that is enumerating a subset of labeled unordered trees of size at most 4 over L = {A, B}. The arrows indicates the parent-child relation and the crossed trees are non-canonical ones. 4.4
Comparison to a Straightforward Algorithm
We compare our algorithm Unot to the following straightforward algorithm Naive. Given a database D and a threshold σ, Naive enumerates all the labeled ordered trees over L using the rightmost expansion, and then for each tree it checks if it is in canonical form applying Lemma 3. Since the check takes O(k 2 ) time per tree, this stage takes O(|L|k 3 ) time. It takes O(nk ) time to compute all the embedding occurrences in D of size n. Thus, the overall time is O(|L|k 3 +nk ). On the other hand, Unot computes the canonical representations in O(kb2 m) time, where the total number m of the embedding occurrences of T is m = O(nk ) in the worst case. However, m will be much smaller than nk as the pattern size of T grows. Thus, if it is the case that b is a small constant and m is much smaller than n, then our algorithm will be faster than the straightforward algorithm.
5
Conclusions
In this paper, we presented an efficient algorithm Unot that computes all frequent labeled unordered trees appearing in a collection of data trees. This algorithm has a provable performance in terms of the output size unlike previous graph mining algorithms. It enumerates each frequent pattern T in O(kb2 m) per pattern, where k is the size of T , b is the branching factor of the data tree, and m is the total number of occurrences of T in the data trees. We are implementing a prototype system of the algorithm and planning the computer experiments on synthesized and real-world data to give empirical evaluation of the algorithm. The results will be included in the full paper.
60
T. Asai et al.
Some graph mining algorithms such as AGM [8], FSG [9], and gSpan [19] use various types of the canonical representation for general graphs similar to our canonical representation for unordered trees. AGM [8] and FSG [9] employ the adjacent matrix with the lexicographically smallest row vectors under the permutation of rows and columns. gSpan [19] uses as the canonical form the DFS code generated with the depth-first search over a graph. It is a future problem to study the relationship among these techniques based on canonical coding and to develop efficient coding scheme for restricted subclasses of graph patterns. Acknowledgement. Tatsuya Asai and Hiroki Arimura would like to thank Ken Satoh, Hideaki Takeda, Tsuyoshi Murata, and Ryutaro Ichise for the fruitful discussions on Semantic Web mining, and to thank Takashi Washio, Akihiro Inokuchi, Michihiro Kuramochi, and Ehud Gudes for the valuable discussions and comments on graph mining. Tatsuya Asai is grateful to Setsuo Arikawa for his encouragement and support for this work.
References 1. K. Abe, S. Kawasoe, T. Asai, H. Arimura, and S. Arikawa. Optimized Substructure Discovery for Semi-structured Data, In Proc. PKDD’02, 1–14, LNAI 2431, 2002. 2. Aho, A. V., Hopcroft, J. E., Ullman, J. D., Data Structures and Algorithms, Addison-Wesley, 1983. 3. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, S. Arikawa, Efficient Substructure Discovery from Large Semi-structured Data, In Proc. SIAM SDM’02, 158–174, 2002. 4. T. Asai, H. Arimura, K. Abe, S. Kawasoe, S. Arikawa, Online Algorithms for Mining Semi-structured Data Stream, In Proc. IEEE ICDM’02, 27–34, 2002. 5. T. Asai, H. Arimura, T. Uno, S. Nakano, Discovering Frequent Substructures in Large Unordered Trees, DOI Technical Report DOI-TR 216, Department of Informatics, Kyushu University, June 2003. http://www.i.kyushu-u.ac.jp/doitr/trcs216.pdf 6. D. Avis, K. Fukuda, Reverse Search for Enumeration, Discrete Applied Mathematics, 65(1–3), 21–46, 1996. 7. L. B. Holder, D. J. Cook, S. Djoko, Substructure Discovery in the SUBDUE System, In Proc. KDD’94, 169–180, 1994. 8. A. Inokuchi, T. Washio, H. Motoda, An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data, In Proc. PKDD’00, 13–23, LNAI, 2000. 9. M. Kuramochi, G. Karypis, Frequent Subgraph Discovery, In Proc. IEEE ICDM’01, 2001. 10. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, H. Ueda, Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents, In Proc. PAKDD’02, 341–355, LNAI, 2002. 11. S. Nakano, Efficient generation of plane trees, Information Processing Letters, 84, 167–172, 2002. 12. S. Nakano, T. Uno, Efficient Generation of Rooted Trees, NII Technical Report NII-2003-005E, ISSN 1346-5597, Natinal Institute of Informatics, July 2003. 13. S. Nestrov, S. Abiteboul, R. Motwani, Extracting Schema from Semistructured Data, In Proc. SIGKDD’98 , 295–306, ACM, 1998.
Discovering Frequent Substructures in Large Unordered Trees
61
14. S. Nijssen, J. N. Kok, Effcient Discovery of Frequent Unordered Trees, In Proc. MGTS’03, September 2003. 15. A. Termier, M. Rousset, M. Sebag, TreeFinder: a First Step towards XML Data Mining, In Proc. IEEE ICDM’02, 450–457, 2002. 16. T. Uno, A Fast Algorithm for Enumerating Bipartite Perfect Matchings, In Proc. ISAAC’01, LNCS, 367–379, 2001. 17. N. Vanetik, E. Gudes, E. Shimony, Computing Frequent Graph Patterns from Semistructured Data, In Proc. IEEE ICDM’02, 458–465, 2002. 18. K. Wang, H. Liu, Schema Discovery from Semistructured Data, In Proc. KDD’97, 271–274, 1997. 19. X. Yan, J. Han, gSpan: Graph-Based Substructure Pattern Mining, In Proc. IEEE ICDM’02, 721–724, 2002. 20. M. J. Zaki. Efficiently Mining Frequent Trees in a Forest, In Proc. SIGKDD 2002, ACM, 2002.
Discovering Rich Navigation Patterns on a Web Site 1,2
1
Karine Chevalier , Cécile Bothorel , and Vincent Corruble
2
1 France Telecom R&D (Lannion), France {karine.chevalier, cecile.bothorel}@rd.francetelecom.com 2 LIP6, Pole IA, Université Pierre et Marie Curie (Paris VI), France {Karine.Chevalier, Vincent.Corruble}@lip6.fr
Abstract. In this paper, we describe a method for discovering knowledge about users on a web site from data composed of demographic descriptions and site navigations. The goal is to obtain knowledge that is useful to answer two types of questions: (1) how do site users visit a web site? (2) Who are these users? Our approach is based on the following idea: the set of all site users can be divided into several coherent subgroups; each subgroup shows both distinct personal characteristics, and a distinct browsing behaviour. We aim at obtaining associations between site usage patterns and personal user descriptions. We call this combined knowledge 'rich navigation patterns'. This knowledge characterizes a precise web site usage and can be used in several applications: prediction of site navigation, recommendations or improvement in site design.
1
Introduction
The World Wide Web is a powerful medium through which individuals or organizations can convey all sorts of information. Many attempts have been made to find ways to describe automatically web users (or more generally internet users) and how they use Internet. This paper focuses on the study of web users at the level of a given web site: are there several consistent groups of site users based on demographic descriptions? If this is the case, does each group show a distinct way of visiting the web site? These questions are important for site owners and advertisers, but also in a social research perspective: it is interesting to test if there is some dependence between demographic descriptions and ways to navigate on a site. Our project addresses the discovery of knowledge about users and their different site usage patterns for a given site. We aim at obtaining associations between site usage patterns (through navigation patterns) and personal user descriptions. We call this combined knowledge ’rich navigation patterns’. These particular patterns underline, on a given site, different ways of visiting the site for specific groups of users (users that share similar personal descriptions). Our aim is to test the assumption that there is some links between navigations on a site and users’ characteristics and to study the relevance of correlating these two very different types of data. If our results confirm that there are some relations between users’ G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 62–75, 2003. © Springer-Verlag Berlin Heidelberg 2003
Discovering Rich Navigation Patterns on a Web Site
63
personal characteristics and site navigations, this knowledge would help to describe in a rich manner site visitors, and open avenues to assist site navigation. For instance, it can be helpful when we want to assist a user for which no information (i.e. personal information or last visits on the site) is known. Based on his current navigation on a web site, and a set of rich navigation patterns obtained before, we can infer some personal information about him. We can then recommend him to visit some documents adapted to his inferred profile. This paper is organized in the following way: the second section presents some tools and methods to understand and describe site users, the third section describes rich navigation patterns and a method to discover them and the fourth section shows our evaluations performed on the rich navigation patterns extracted from several web sites.
2
Knowledge Acquisition about Site Users
There are many methods to measure and describe the audience and the traffic on a site. A first way to know users who access a given site is to use surveys produced by organizations like NetValue [13], Media Metrix [12] and Nielsen//NetRating [14]. Their methods consist in analyzing Internet activities of a panel over a long period of time and infer knowledge on the entire population. This is a user-centric approach. Panels are built up in order to represent at best the current population of Internet users. Some demographic data on each user of the panel (like age, gender or level of Internet practice…) are collected and all theirs activities on Internet are recorded. The analysis of these data provides Internet usage qualification and quantification. We will consider here only the information related to users and web usage. This approach gives general trends: it indicates for instance who use the web, what sort of site they visit, etc., but no processing is performed to capture precisely usage patterns on a given site. In the best case, one advantage of this approach is that site owners can get a description of his site users, but this point is true only for sites with large audience, the other sites have a low chance to have their typical users within the panel so as to obtain a meaningful description of their users. We can point out several interesting aspects in the user-centric approach. Firstly those methods are based on the extrapolation of observations made on a panel of users to the entire set of users. This means that it can be sufficient to make an analysis on only one part of users. Secondly, the approach relies on the assumption that there are some links between some features of users profile and their Internet usage. The second way to know users who have access to a given site is to perform an analysis at the site level. This is the site-centric approach. It consists in collecting all site navigations and then analysing these data in order to obtain traffic measure on the site and retrieve the statically dominant paths or usage patterns from the set of site sessions. A session corresponds to a user site visit; a session can be considered as a page sequence (in chronological order). Users' sessions are extracted from log files that contain all HTTP requests done on the site. Further information on problems and techniques to retrieve sessions from log files can be found in [5]. There are many industrial tools (WebTrends[16], Cybermetrie [7]) that implement the site-centric approach.
64
K. Chevalier, C. Bothorel, and V. Corruble
Here, we focus our attention on methods that retrieve automatically site usage patterns. Most of the methods are based on frequency measures: a navigation path is retrieved from the set of site sessions because it has a high probability of being followed [3][4], a sequence of pages is selected because it appears frequently in the set of site sessions (WebSPADE algorithm [8] adaptation SPADE algorithm [17]) or a site usage pattern is revealed because it is extracted from a group of sessions that were brought together by a clustering method [11]. Cooley and al. suggest filtering the frequent page sets in order to keep the most relevant set [6]. They consider a page set as interesting if it contains pages that are not directly connected (there is no link between them and no similarity between their content). Those methods allow catching different precise site usage patterns in terms of site page visited but they capture only common site usage patterns (site usage patterns shared by the greatest number of users). If a particular group of users shows a specific usage pattern of the web site, and it is not composed of enough users, their specific usage will not be highlighted. In that case, important information can be missed: particular (and significant) behaviours could be lost among all navigations. One way to overcome this limitation is to rely on some assumptions and methodologies of the user-centric approach that we described above. Firstly, it could be interesting to assume that there are some correlations between the users’ personal descriptions and the way they visit a given site. We can then build groups of users based on personal characteristics and then apply the site usage patterns extraction on smaller sets of sessions in order to capture navigation patterns specific to subgroups of users. This strategy reveals site usage patterns which are less frequent but associated to a specific group of users who share similar personal descriptions. We can then answer questions such as: “Do young people visit the same pages on a given site?” Secondly, in the same manner that the user-centric approach extrapolates knowledge learned on a panel of Internet users to the entire set of Internet users, we could restrict our search on data coming from a subset of site users and interpret the results obtained on this subset as valid for all the site users.
3
Discovering Rich Navigation Patterns
Our research project addresses the problem of knowledge discovery about a set of web site users and their site uses. We explore the possibility of correlating users' personal characteristics with their site navigation. Our objective is to provide a rich usage analysis of a site, i.e. usage patterns that are associated to personal characteristics and so offer a different, deeper understanding of the site usage. This has the following benefits: • It provides the site manager with the means to understand his/her site users. • It lets us envisage applications to personalization, such as navigation assistance to help new visitors. We want to add meaning to site usage patterns, and find site usage patterns which are specific to a subgroup of site users. We explore the possibility to correlate user de-
Discovering Rich Navigation Patterns on a Web Site
65
scriptions and site usage patterns. Our work relies on the assumption that navigation "behaviors" and users’ personal descriptions are correlated. If valid, this assumption has two consequences: (1) Two users similar in socio-demographic terms have got similar navigation on a web site; (2) Two users similar in their navigations on a web site have got similar personal description. Our approach supposes the availability of data that are richer than classical site logs. They are composed of site sessions and personal descriptions of reference users. Reference users form a subset of users who have accepted to provide a list of personal characteristics (like age, job…) and some navigation sessions on the web site, i.e. they are used as a reference panel for the entire population of the web site. From this data, we wish to obtain knowledge that is specific to the web site from which the data is obtained. In an application, by using this knowledge, we can infer some personal information about a new visitor, and propose page recommendations based on his navigation even if he gives no personal information. We choose to build this knowledge around two distinct elements: - A personal user characteristic is an element that describes in a personal way a user, for instance: age is between 15 and 25 years old, gender is man… - A navigation pattern represents a site usage pattern. Navigation patterns are sequences or sets of web pages that occur frequently in users sessions. For instance, our data shows, on the boursorama.com site (a French Stock Market site), the following frequent sequence of pages: access to a page about quoted shares and later on, consultation of a page that contains advices for making investment. We call the association of both elements of knowledge a 'rich navigation pattern', i.e. a navigation pattern associated to personal user characteristics. After describing our way to discover navigation patterns in the next subsection, we detail different rich navigation patterns that we want to learn and finally we present a way to extract them from a set of data composed of reference user's description and their site sessions. 3.1 Discovering Navigation Patterns Navigation patterns are sequences or sets of pages that occur frequently in session sets. We used an algorithm to retrieve frequent sets of pages, that take into account principles of algorithms such as FreeSpan [10] (PrefixSpan [15], WebSPADE [8] and SPADE [19]) that improve Apriori [1]. These algorithms are based on the following idea: "a frequent set is composed of frequent subsets". Here, a session is considered as a set of pages. We chose to associate to each pageset a list of session ids in which the pageset occurs in order to avoid scanning the whole set of sessions each time the support of pageset have to be calculated [10][15][8][19].
66
K. Chevalier, C. Bothorel, and V. Corruble Table 1. Initialisation phase for each session s in S do for each page pg ∈ s do Add s to the session set of the page pg. L1 ={ } for each page pg do if numberUser(pg)>minOccurrence then L1 =L1∪{pg} return L1
An initialisation phase (table 1) creates the frequent sets composed of one page. The session set S is scanned in order to build for each page pg a set that contains all sessions in which pg occurs. Then, only the web pages that appear in the navigations of more than minOccurrence users, are kept in L1 (set of large 1-pageset). A session is Table 2. Building (k+1) pagesets // Main loop k = 1 while (|Lk|>1) do Lk+1 =BuildNext(Lk) k++ end_while // BuildNext(Lk): Lk+1 = { } i = 1 while (i 1 − 1/nα for any fixed constant α where n is the number of attributes) if Boolean functions are limited to AND/OR of literals and examples are generated uniformly at random [4]. They also performed computational experiments on GREEDY SGL and conjectured that GREEDY SGL finds the minimum set of attributes with high probability for most Boolean functions if examples are generated uniformly at random. In this paper, we prove the above conjecture where the functions are limited to a large class of Boolean functions, which we call unbalanced functions. The class of unbalanced functions is so large that for each positive integer d, we can prove more than half of d-input Boolean functions are included in the class. We can also prove that the fraction of d-input unbalanced functions converges to 1 as d grows. As mentioned above, GREEDY SGL has very good average-case performance if the number of relevant attributes is not small. However, the average-case performance is not satisfactory if the number is small (e.g., less than 5). Therefore, we developed a variant of GREEDY SGL (GREEDY DBL, in short). We performed computational experiments on GREEDY DBL using artificially generated data sets. The results show that the average-case performance is considerably improved when the number of relevant attributes is small though GREEDY DBL takes much longer CPU time.
2 2.1
Preliminaries Problem of Inferring Boolean Function
We consider the inferring problem for n input variables x1 , . . . , xn and one output variables y. Let x1 (k), . . . , xn (k), y(k) be the kth tuple in the table, where xi (k) ∈ {0, 1}, y(k) ∈ {0, 1} for all i, k. Then, we define the problem of the inferring Boolean function in the following way. Input: x1 (k), . . . , xn (k), y(k)k=1,... ,m , where xi (k), y(k) ∈ {0, 1} for all i, k.
116
D. Fukagawa and T. Akutsu
Output: a set X = {xi1 , . . . , xid } with the minimum cardinality (i.e., minimum d) for which there exists a function f (xi1 , . . . , xid ) such that (∀k).(y(k) = f (xi1 (k), . . . , xid (k))) holds. Clearly, this problem is an NP-optimization problem. To guarantee the existence of at least one f which satisfies the condition above, we suppose the input data is consistent (i.e., for any k1 = k2 , (∀i)(xi (k1 ) = xi (k2 )) implies y(k1 ) = y(k2 ).) This can be tested in O(nm log m) by sorting the input tuples. In this paper, what to infer is not f , but d-input variables X. If d is bounded by a constant, we can determine f in O(m) time after determining d-input variables. 2.2
GREEDY SGL: A Simple Greedy Algorithm
The problem of inferring Boolean function is closely related to the problem of inference of functional dependencies and both of them are known to be NP-hard [12]. Therefore a simple greedy algorithm GREEDY SGL has been proposed [2]. In GREEDY SGL, the original problem is reduced to the set cover problem and a well-known greedy algorithm (see e.g. [16]) for the set cover is applied. GREEDY SGL is not only for inferring Boolean function, but for inferring other variations of the problem (e.g. the domain can be multivalued, real numbers, and so on). The following is a pseudo-code for GREEDY SGL: S ← {(k1 , k2 ) | k1 < k2 and y(k1 ) = y(k2 )} X ← {} X ← {x1 , . . . , xn } while S = {} do for all xi ∈ X do = xi (k2 )} Si ← {(k1 , k2 ) ∈ S | xi (k1 ) i∗ ← argmaxi |Si | S ← S − Si∗ X ← X − {xi∗ } X ← X ∪ {xi∗ } Output X The approximation ratio of this algorithm is at most 2 ln m+1 [2]. Since the lower bound of Ω(log m) on the approximation ratio is proved [2], GREEDY SGL is optimal except a constant factor. 2.3
Notations
Let f (x1 , . . . , xd ) (or f ) be an d-input Boolean function. Then |f | denotes the number of the assignments for the input variables such that f = 1 holds (i.e., |f | = |{(x1 , . . . , xd ) | f (x1 , . . . , xd ) = 1}|). We use f to represent the negation of f . Hence |f | is the number of the assignment such that f = 0 holds. For a input variable xi in f , fxi and fxi denote the positive and negative cofactors w.r.t. xi , respectively. Namely, fxi is f in which xi is replaced by a
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
117
constant 1. Similarly, fxi is f in which xi is replaced by a constant 0. For a set of variables X, let Π(X) denote the set of products of literals, i.e., Π(X) = {c = li1 · · · lik | lij is either xij or xij }. For a product of literals c = li1 · · · lik , extended Shannon cofactor fc is defined as (· · · ((fli1 )li2 ) · · · )lik . For a d-input (completely defined) Boolean function f , we call an input variable xi relevant iff there exists an assignment for x1 , . . . , xi−1 , xi+1 , . . . , xd such that f (x1 , . . . , xi−1 , 0, xi+1 , . . . , xd ) = f (x1 , . . . , xi−1 , 1, xi+1 , . . . , xd ) holds. Note that this definition of relevancy does not incorporate redundancy.
3
Unbalanced Functions
It is mathematically proved that for an instance of the inference problem generated uniformly at random, GREEDY SGL can find the optimal solution with high probability if the underlying function is restricted to AND/OR of literals [4]. On the other hand, the result of the computational experiments suggests that GREEDY SGL works optimally in more general situations [4]. In this section, we will define a class of Boolean functions which we call unbalanced functions. The class of unbalanced functions includes the class of AND/OR of literals and is much larger than AND/OR of literals. We will also evaluate the size of the class. More than half of all Boolean functions are included in the class of unbalanced functions. Furthermore, for large d, almost all Boolean functions are included. Note that the definition of the term “unbalanced functions” is not common. In this paper, we define the term according to Tsai et al. [15] with some modification. → {0, 1}). Definition 1. Let f be a d-input Boolean function (i.e., f : {0, 1}d We say f is balanced w.r.t. a variable xi iff |fxi | = |fxi | holds. Otherwise f is unbalanced w.r.t. xi . Definition 2. We say f is balanced iff f is balanced w.r.t. all the input variables. Similarly, we say f is unbalanced iff f is unbalanced w.r.t. all the input variables. Let us give an example. For d = 2, there exist sixteen Boolean functions. Eight of them, namely {x1 ∧ x2 , x1 ∧ x2 , x1 ∧ x2 , x2 ∧ x2 , x1 ∨ x2 , x1 ∨ x2 , x1 ∨ x2 , x1 ∨ x2 } are members of unbalanced functions. Other four functions, namely {0, 1, x1 ⊕ x2 , x1 ⊕ x2 } are members of balanced functions. The remaining four functions, {x1 , x2 , x1 , x2 } are neither balanced nor unbalanced. Next, we evaluate the size of the class of unbalanced functions and show that (d) it is much larger than AND/OR of literals. Let Bi be the set of d-input Boolean functions for which xi is balanced. For example, in the case of d = 2, we have (2) (2) B1 = {0, x2 , x1 ⊕ x2 , x1 ⊕ x2 , x2 , 1} and B2 = {0, x1 , x1 ⊕ x2 , x1 ⊕ x2 , x1 , 1}. Then, we have the following lemmas. 2d (d) (d) Lemma 1. |B1 | = ... = |Bd | = 2d−1 .
118
D. Fukagawa and T. Akutsu (d)
(d)
Proof. Since it is clear that |B1 | = ... = |Bd | holds by the similarities, all we 2d (d) are have to prove is |B1 | = 2d−1 . Assume that f is balanced w.r.t. a variable x1 . Let f = (x1 ∧ fx1 ) ∨ (x1 ∧ fx1 ) be a f ’s Shannon decomposition. fx1 is a partial function of f which defines a half part of f ’s value and fx1 independently defines the other part. Assuming that x1 is balanced, we have |fx1 | = |fx1 | by the definition. For any (d − 1)-input Boolean functions g and h, consider a d-input function f = (x1 ∧ g(x2 , . . . , xd )) ∨ (x1 ∧ h(x2 , . . . , xd )). Note there is one-to-one correspondence between (g, h) and f . Then f is balanced w.r.t. x1 iff |g| = |h|. Hence, (d) |B1 | is equal to the number of possible pairs (g, h) such that |g| = |h| holds. 2d−1 d−1 2 2d = 2d−1 , the Lemma follows. Since the number of such pairs is k=0 2 k Lemma 2. The fraction of unbalanced functions to all the Boolean functions d (d) (d) (d) converges to 1 as d grows. That is, B 1 ∧ B 2 ∧ · · · ∧ B d ∼ 22 . Proof. Using Boole-Bonferroni’s inequality [13], Stirling’s approximation and Lemma 1, we have d d d d 1 (d) 1 (d) 2 d |B | = 2d d−1 ∼ √ . B ≤ 2 22d i=1 i 22d i=1 i 2 π2d−1 This converges to 0 as d grows. Using de Morgan’s law, d d d d d (d) (d) 2d ∼ 22 . B i = 2 − Bi ∼ 22 · 1 − √ d−1 π2 i=1
i=1
Even for small d, we have the following lemma. Lemma 3. For any d, the number of d-input unbalanced functions is more than half the number of all the d-input Boolean functions. Proof. It is easy to prove that a Boolean function f is unbalanced if |f | is odd. In fact, if f is not unbalanced there exists a variable xi such that |fxi | = |fxi | holds, for which |f | = |fxi | + |fxi | must be even. Now, we can show that |{f | |f | : even}| = |{f | |f | : odd}| and hence the lemma follows. Fig.1 shows the fraction of the class of unbalanced functions. The x-axis shows d (the arity of a function) and the y-axis shows the fraction (0 to 1) of the unbalanced functions. The fractions (black circles) are calculated with the exact number of unbalanced functions for each d. Since the exact numbers are hard to compute for large d, we give the upper (stars) and lower (crosses) bounds for them. The approximation of the lower bound (dashed lines; see Lemma 2) is also drawn. Two bounds and the approximation almost join together for d > 10. The fraction of unbalanced functions is more than 90% for d > 15 and it converge to 100% for increasing d.
fraction (# of unbalanced/# of all)
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
119
1 0.8 0.6 0.4 unbalanced lower bound upper bound 1−x/√π·2x−1
0.2 0 0
5
10
15
20
25
d
Fig. 1. The fraction of unbalanced functions
4
Analysis for a Special Case
In this section, we consider a special case in which all the possible assignment for the input variables are given. Namely, m = 2n . Note that such an instance gives the complete description of the underlying function (precisely, it contains 2n /2d copies of the truth table of the function). We are interested in whether GREEDY SGL succeeds for the instances or not. Lemma 4. Consider the instance of inference problem for a Boolean function f . Assume that all the possible assignment for the input variables are given in the table. (Note that for given f , we can uniquely determine the instance of this case.) For such an instance, GREEDY SGL can find the optimal solution if f is unbalanced. Proof. Let us prove that GREEDY SGL chooses the proper variable in each iteration. Let X ∗ be the set of all relevant variables (i.e., the set of input variables of f ). In the first iteration, for a relevant variable xi ∈ X ∗ , the number of tuple pairs covered by xi is
2n 2 |Si | = |fxi | · |f xi | + |fxi | · |f xi | · . 2d Using |fxi | + |fxi | = |f |, |f xi | + |f xi | = |f | and |fxi | + |f xi | = |fxi | + |f xi |, there exists ∆ such that |fxi | =
|f | + ∆, 2
|fxi | =
|f | − ∆, 2
|f xi | =
|f | − ∆, 2
|f xi | =
|f | + ∆. 2
So, |Si | is rewritten as follows: n 2 n 2 2 2 |f | · |f | |f | · |f | (|fxi | − |fxi |)2 · d + 2∆2 · d + |Si | = = . 2 2 2 2 2
(1)
120
D. Fukagawa and T. Akutsu
On the other hand, for an irrelevant variable xj ∈ X ∗ the number of tuple pairs covered by xj is |Sj | =
|f | |f | |f | |f | · + · 2 2 2 2
n 2 n 2 2 2 |f | · |f | · · = . 2d 2 2d
(2)
Subtracting (2) from (1), 2
|Si | − |Sj | =
(|fx1 | − |fx1 |) · 2
2n 2d
2 ≥ 0.
(3)
If f is unbalanced w.r.t. xi , it holds that Si > Sj for any irrelevant variable xj ∈ X ∗ . Hence, the variable chosen in the first iteration must be relevant. Next, assuming that GREEDY SGL has succeeded up to the rth iteration and has chosen a set of variables Xr = {xi1 , ..., xir }, let us prove the success in the (r + 1)th iteration. The number of tuple pairs covered by xi ∈ X ∗ \ Xr is 2n 2 |Si | = |(fc )xi | · |(f c )xi | + |(fc )xi | · |(f c )xi | · 2d−r c 2 |fc | · |f | (|fcx | − |fcx |)2 2n c i i + · = . 2 2 2n−d c
where c denotes a product of literals consist of Xr . In the summation c , c ∈ X ∗, is considered over all the possible product of variables in Xr . For xj it holds that |fcxj | = |fcxj |. Therefore the requirement for |Si | > |Sj | is that (∃c).(fc is unbalanced w.r.t. xi ). Let us prove that this requirement is satisfied if f is unbalanced w.r.t. xi . Assuming (∀c).(|fcx i | = |fcxi |), we can obtain |fxi | = |fxi | owing to |fxi | = = c |fcxi | and |fxi | = c |fcxi |. Hence, if f is unbalanced w.r.t. xi (i.e., |fxi | |fxi |), there exists at least one c for which fc is unbalanced w.r.t. xi . As a consequence, GREEDY SGL succeed in finding the optimal solution in the (r + 1)th iteration. By induction, the variable chosen by GREEDY SGL is included in X ∗ in each iteration. If all the variables in X ∗ are chosen, GREEDY SGL will output the solution (=X ∗ ) and stop, that is, GREEDY SGL will succeed to find the optimal solution for the instance.
5 5.1
Analysis for Random Instances The Condition for Success
GREEDY SGL can find the optimal solution for some instances even if they do not conform to the special case mentioned in the previous section. First, let us see what kind of instances those are. The following lemma is an extension of the known result on the performance of GREEDY SGL for AND/OR
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
121
functions [4]. This lemma gives a characterization of the sets of examples for which GREEDY SGL succeeds. It helps to discuss the success probability of GREEDY SGL (see Sect. 5.2). Lemma 5. Consider the problem of inferring Boolean function for n input variables and one output variable. Let f be a d(< n)-input Boolean function and X be a set of {xi1 , . . . , xid }. Given an instance which has the optimal solution y = f (X), GREEDY SGL succeeds if
= |fxi |) (∀xi ∈ X).(|fxi | 2 |f | · |f | + 1 · (BMIN ) > |f | · |f | · (2CMAX )2
(4) (5)
where BMIN and CMAX are defined as follows: BA = {k | (xi1 (k), . . . , xid (k)) = A}, CA (i, p) = {k | xi (k) = p ∧ k ∈ BA },
BMIN = min |BA |, A
CMAX = max |CA (i, p)|. A,i,p
Proof. Assuming (4) and (5), let us prove that GREEDY SGL chooses the proper variable in each iteration. For each xi ∈ X, the number of tuple pairs covered by xi in the first iteration is
|Si | = |BA | · |BA | p∈{0,1}
A∈f 0 (i,p)
A∈f 1 (i,1−p)
≥ |fxi | · |f xi | + |fxi | · |f xi | · (BMIN )2 |f | · |f | |f | − |f |2 xi xi + · (BMIN )2 = 2 2 where f q (i, p) = {A | Ai = p ∧ f (A) = q}. i.e., |f 1 (i, 1)| = |fxi |, |f 1 (i, 0)| = |fxi |, |f 0 (i, 1)| = |f xi | and |f 0 (i, 0)| = |f xi |. Applying (4) to the above, we have |Si | ≥
|f | · |f | + 1 · (BMIN )2 . 2
Similarly, the number of tuple pairs covered by xj ∈ X is |Sj | = |CA (j, p)| · |CA (j, 1 − p)| ≤ 2 · |f | · |f | · (CMAX )2 . p∈{0,1} A∈f 0
A∈f 1
where f q = {A | f (A) = q}, i.e., |f 1 | = |f | and |f 0 | = |f |. Consequently, |Si | > |Sj | holds for any xi ∈ X and xj ∈ X if the condition (5) is satisfied. Let i1 = argmaxi |Si |. Thus, GREEDY SGL chooses xi1 which must be included in X. Next, assume that GREEDY SGL has chosen proper variables Xr = {xi1 , . . . , xir } before the (r + 1)th step of iteration, where r ≥ 1. Let us prove that GREEDY SGL succeeds to choose a proper variable at the (r + 1)th step, too.
122
D. Fukagawa and T. Akutsu
Suppose that fc denotes an extended Shannon cofactor of f w.r.t. c, where c is a product of variables in Xr . For each xi ∈ X \ Xr , the number of tuple pairs covered by xi is
|BA | · |BA | |Si | = c
p∈{0,1} A∈fc0 (i,p)
A∈fc1 (i,1−p)
|fc | · |f | (|fcx | − |fcx |)2 c i i · (BMIN )2 + 2 2 c
≥ (1/2) · |fc | · |f c | + 1 · (BMIN )2 ≥
c
where = {A | Ai = p ∧ fc (A) = q}. The last inequality follows from = |fcx (∃c).(|fcxi | i |). In fact, assuming (∀c).(|fcxi | = |fcxi |), we can obtain |fxi | = |fxi | since c |fc x| = |fx |, which contradicts the condition (4). ∈ X, the number of tuple pairs covered by On the other hand, for each xj xj at the (r + 1)th step is |CA (j, p)| · |CA (j, 1 − p)| |Sj | = fcq (i, p)
c
≤2·
p∈{0,1}
A∈fc0
A∈fc1
|fc | · |f c |
· (CMAX )2 .
c
Hence, GREEDY SGL will succeed at the (r + 1)th step if 1 1+ · (BMIN )2 > (2CMAX )2 , |f | · |f | c c c which is obtained from (5) using c |fc | · |f c | ≤ ( c |fc |) · ( c |f c |) = |f | · |f |. Thus, GREEDY SGL succeeds to choose the proper variable at (r + 1)th step of the iteration. By induction, GREEDY SGL can find the proper variable at each steps. Note that if the number of tuples m is sufficiently large (precisely, m = Ω(log n)), there exists at most one optimal solution with high probability [3]. Thus, the assumption for the uniqueness of the optimal solution is reasonable in this lemma. We can easily see that the conditions in Lemma 5 are not the requirement for success of GREEDY SGL. The condition (4) can be relaxed as follows: for the optimal solution (i1 , . . . , id ), there exist a permutation (i1 , . . . , id ) and a list of = |(fli ···li )xi | holds for each literals li1 , . . . , lid−1 such that |(fli ···li )xi | 1
j−1
j
1
j−1
j
j = 1, . . . , d. Theoretical results shown in this paper can be modified for a simpler algorithm which chooses d input variables corresponding to d-highest |Si |. However, the success probability of GREEDY SGL is expected to be higher [4] and GREEDY SGL can cover more cases as mentioned just above.
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
5.2
123
The Success Rate for the Random Instances
In Lemma 5, we formulated two success conditions for GREEDY SGL. To estimate the success rate of GREEDY SGL for the random instances, we will estimate the probability that these conditions hold. The probability that (4) holds is equal to the fraction of unbalanced functions for fixed d, which is already presented (see Sect. 3). It is the probability for (5) that remains to be estimated. Assuming that instances are generated uniformly at random, we can prove the following lemma as in [4]. Lemma 6. For sufficiently large m (m = Ω(α log n)), suppose that a instance of the inference problem is generated uniformly at random. Then, the instance satisfies (5) with high probability (with probability > 1 − 1/nα ). Combining Lemma 5 and 6, we have: Theorem 1. Suppose that functions are restricted to d-input unbalanced functions, where d is a constant. Suppose that an instance of the inference problem is generated uniformly at random. Then, for sufficiently large m (m = Ω(α log n)), GREEDY SGL outputs the correct set of input variables {x1 , . . . , xd } with high probability (with probability > 1 − 1/nα for any fixed constant α). Note that the number of unbalanced functions is much larger than the number of AND/OR functions for each d and the fraction to all the Boolean functions converges to 1 as d grows (see Lemma 2 and Fig. 1).
6
Computational Experiments
As proved above, GREEDY SGL has very good average-case performance if d (the number of relevant variables) is sufficiently large. However, the averagecase performance is not satisfactory if d is small (e.g., less than 5). Thus, we develop a modified version of GREEDY SGL, which we call GREEDY DBL. This is less efficient, but outperforms GREEDY SGL in the success ratio. GREEDY DBL is almost the same as GREEDY SGL. Both algorithms reduce the inference problem to the set cover problem. Recall that GREEDY SGL chooses the variable xi that maximizes |Si | at each step of the iterations. GREEDY DBL finds the pair of variables (xi , xj ) that maximizes |Si ∪ Sj | and then chooses the variable xi if |Si | ≥ |Sj | and xj otherwise. Two algorithms differ only in that respect. While GREEDY SGL takes O(m2 ng) time, it is known that there exists an efficient implementation that works in O(mng) time [4], where g is the number of the iterations (i.e., the number of variables outputted by GREEDY SGL) and d is assumed to be bounded by a constant. GREEDY DBL takes O(m2 n2 g) time since it examines pairs of variables. As in GREEDY SGL, there exists an implementation of GREEDY DBL which works in O(mn2 g) time. We implemented efficient versions of these two algorithms and compared them on their success ratios for random instances, where “success” means
124
D. Fukagawa and T. Akutsu
Table 1. The success ratio of GREEDY SGL and GREEDY DBL (%) (n = 1000, #iter=100)
m = 100 m = 300 m = 500 m = 800 m = 1000
GREEDY SGL d=1 2 3 4 46% 45% 43% 36% 47% 55% 69% 79% 61% 45% 68% 88% 56% 51% 72% 96% 58% 50% 75% 96%
GREEDY DBL d=1 2 3 4 57% 60% 75% 75% 48% 70% 88% 98% 47% 67% 80% 98% 48% 62% 84% 96% 51% 58% 82% 99%
m = 100 m = 300 m = 500 m = 800 m = 1000
GREEDY SGL d=5 6 7 8 7% 0% 0% 0% 77% 40% 5% 0% 87% 73% 34% 5% 96% 86% 71% 17% 96% 96% 74% 40%
GREEDY DBL d=5 6 7 8 26% 0% 0% 0% 98% 81% 19% 1% 99% 95% 73% 23% 100% 99% 99% 56% 100% 100% 100% 82%
GREEDY SGL (resp. GREEDY DBL) outputs the set of variables which is same as that for the underlying Boolean function (Though the optimal number of variables may be smaller than that for the underlying Boolean function, it was proved that such a case seldom occurs if m is sufficiently large [3]). We set n = 1000 and varied m from 100 up to 1000 and d from 1 up to 8. For each n, m and d, we generated an instance uniformly at random and solved it with both GREEDY SGL and GREEDY DBL. We repeated this process and counted the successful executions over 100 trials. The random instances were generated in the same way as Akutsu et al. [4] did: – First, we randomly selected d different variables xi1 , . . . , xid from x1 , . . . , xn . d – We randomly selected a d-input Boolean function (say, f ). There are 22 possible Boolean functions, each of which has the same probabilities. – Then, we let xi (k) = 0 with probability 0.5 and xi (k) = 1 with probability 0.5 for all xi (i = 1, . . . , n) and k (k = 1, . . . , m). Finally, we let y(k) = f (xi1 (k), . . . , xid (k)) for all k. Table 1 shows success ratios of GREEDY SGL and GREEDY DBL for the random instances. As expected, the success ratios of GREEDY DBL are higher than those of GREEDY SGL. It is seen that the success ratio increases as d increases for both GREEDY SGL and GREEDY DBL if m is sufficiently large. This agrees with the results on the fraction of unbalanced functions shown in Sect. 3. It is also seen that the success ratio increases as m increases. This is reasonable because Lemma 6 holds for sufficiently large m. In the case of d = 2, GREEDY SGL is expected to succeed for half of Boolean functions, because the other half includes the degenerated functions, XOR and its negation. Since GREEDY DBL can find the optimal solution even if the
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
125
underlying function is XOR or its negation, the success ratio is expected to be 62.5%. The results of the experiments agree with these expectations. Through all the computational experiments, we used a PC with a 2.8 GHz CPU and 512 KB cache memory. For a case of n = 1000, d = 3, m = 100, approximate CPU time was 0.06 sec. for GREEDY SGL and 30 sec. for GREEDY DBL. Since cases of n > 10000 may be intractable for GREEDY DBL (in fact it is expected to take more than 3000 sec.), we need to improve its time efficiency.
7
Comparison with Related Work
Since many studies have been done for feature selection and various greedy algorithms have been proposed, we compare the algorithms in this paper with related algorithms. However, it should be noted that the main result of this paper is theoretical analysis of average case behavior of a greedy algorithm and is considerably different from other results. Boros et al. proposed several greedylike algorithms [6], one of which is almost the same as GREEDY SGL. They present some theoretical results as well as experimental results using real-world data. However, they did not analyze average case behavior of the algorithms. Pagallo and Haussler proposed greedy algorithms for learning Boolean functions with a short DNF representation [14] though they used information theoretic measures instead of the number of covered pairs. A simple strategy known as the greedy set-cover algorithm [5] is almost the same as GREEDY SGL. However, these studies focus on special classes of Boolean functions (e.g., disjunctions of literals), whereas this paper studies average case behavior for most Boolean functions. The WINNOW algorithm [11] and its variants are known as powerful methods for feature selection [5]. These algorithms do not use the greedy strategy. Instead, feature-weighting methods using multiplicative updating schemes are employed. Gamberger proposed the ILLM algorithm for learning generalized CNF/DNF descriptions [7]. It also uses pairs of positive examples and negative examples though these are maintained using two tables. The ILLM algorithm consists of several steps, some of which employ greedy-like strategies. One of the important features of the ILLM algorithm is that it can output generalized CNF/DNF descriptions, whereas GREEDY SGL (or GREEDY DBL) outputs only a set of variables. The ILLM algorithm and its variants contain procedures for eliminating noisy examples [7,8], whereas our algorithms do not explicitly handle noisy examples. Another important feature of the ILLM approach is that it can be used with logic programming [10], by which it is possible to generate logic programs from examples. GREEDY SGL (or GREEDY DBL) may be less practical than the ILLM algorithm since it does not output Boolean expressions. However, it seems difficult to make theoretical analysis of average case behavior of the ILLM algorithm because it is more complex than GREEDY SGL. From a practical viewpoint, it might be useful to combine the ideas in the ILLM algorithm with GREEDY SGL.
126
8
D. Fukagawa and T. Akutsu
Concluding Remarks
In this paper, we proved that a simple greedy algorithm (GREEDY SGL) can find the minimum set of relevant attributes with high probability for most Boolean functions if examples are generated uniformly at random. The assumption on the distribution of examples is too strong. However, it is expected that GREEDY SGL will work well if the distribution is near to uniform. Even in the worst case (i.e., no assumption is put on the distribution of examples), it is guaranteed that GREEDY SGL outputs the set of attributes whose size is O(log n) times larger than the optimal [2]. Though a noisefree model is assumed in this paper, previous experimental results suggest that GREEDY SGL is still effective in noisy cases [4]. Experimental results on variants of GREEDY SGL [6] also suggest that greedy-based approach is useful in practice. Acknowledgments. We would like to thank Prof. Toshihide Ibaraki in Kyoto University for letting us know a related work [6]. This work is partially supported by Grant-in-Aid for Scientific Research on Priority Areas (C) “Genome Information Science” from the Ministry of Education, Science, Sports and Culture of Japan.
References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining Association Rules between Sets of Items in Large Databases. In Proc. SIGMOD Conference 1993, Washington, D.C. (1993) 207–216 2. Akutsu, T. and Bao, F.: Approximating Minimum Keys and Optimal Substructure Screens. In: Proc. COCOON 1996. Lecture Notes in Computer Science, Vol. 1090. Springer-Verlag, Berlin Heidelberg New York (1996) 290–299 3. Akutsu, T., Miyano, S., Kuhara, S.: Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model. In Proc. Pacific Symposium on Biocomputing (1999) 17–28 4. Akutsu, T., Miyano, S., Kuhara, S.: A simple greedy algorithm for finding functional relations: efficient implementation and average case analysis. Theoretical Computer Science 292 (2003) 481–495. Preliminary version has appeared in DS 2000 (LNCS 1967) 5. Blum, A., Langley, P.: Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence 97 (1997) 245–271 6. Boros, E., Horiyama, T., Ibaraki, T., Makino, K., Yagiura, M.: Finding Essential Attributes from Binary Data. Annals of Mathematics and Artificial Intelligence 39 (2003) 223–257 7. Gamberger, D.: A Minimization Approach to Propositional Inductive Learning. In: Proc. ECML 1995. Lecture Notes in Computer Science, Vol. 912. Springer-Verlag, Berlin Heidelberg New York (1995) 151–160 8. Gamberger, D., Lavrac, N.: Conditions for Occam’s Razor Applicability and Noise Elimination. In: Proc. ECML 1997. Lecture Notes in Computer Science, Vol. 1224. Springer-Verlag, Berlin Heidelberg New York (1997) 108–123
Performance Analysis of a Greedy Algorithm for Inferring Boolean Functions
127
9. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. The MIT Press (1994) 10. Lavrac, N., Gamberger, D., Jovanoski, V.: A Study of Relevance for Learning in Deductive Databases. Journal of Logic Programming 40 (1999) 215–249 11. Littlestone, N.: Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm. Machine Learning 2 (1987) 285–318 12. Mannila, H., Raiha, K.-J.: On the Complexity of Inferring Functional Dependencies. Discrete Applied Mathematics 40 (1992) 237–243 13. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge Univ. Press (1995) 14. Pagallo, G., Haussler, D.: Boolean Feature Discovery in Empirical Learning. Machine Learning 5 (1990) 71–99 15. Tsai, C.-C., Marek-Sadowska, M.: Boolean Matching Using Generalized ReedMuller Forms. In Proc. Design Automation Conference (1994) 339–344 16. Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2001)
Performance Evaluation of Decision Tree Graph-Based Induction Warodom Geamsakul, Takashi Matsuda, Tetsuya Yoshida, Hiroshi Motoda, and Takashi Washio Institute of Scientific and Industrial Research, Osaka University 8-1 Mihogaoka, Ibaraki, Osaka 567-0047, JAPAN {warodom,matsuda,yoshida,motoda,washio}@ar.sanken.osaka-u.ac.jp
Abstract. A machine learning technique called Decision tree GraphBased Induction (DT-GBI) constructs a classifier (decision tree) for graph-structured data, which are usually not explicitly expressed with attribute-value pairs. Substructures (patterns) are extracted at each node of a decision tree by stepwise pair expansion (pairwise chunking) in GBI and they are used as attributes for testing. DT-GBI is efficient since GBI is used to extract patterns by greedy search and the obtained result (decision tree) is easy to understand. However, experiments against a DNA dataset from UCI repository revealed that the predictive accuracy of the classifier constructed by DT-GBI was not high enough compared with other approaches. Improvement is made on its predictive accuracy and the performance evaluation of the improved DT-GBI is reported against the DNA dataset. The predictive accuracy of a decision tree is affected by which attributes (patterns) are used and how it is constructed. To extract good enough discriminative patterns, search capability is enhanced by incorporating a beam search into the pairwise chunking within the greedy search framework. Pessimistic pruning is incorporated to avoid overfitting to the training data. Experiments using a DNA dataset were conducted to see the effect of the beam width, the number of chunking at each node of a decision tree, and the pruning. The results indicate that DT-GBI that does not use any prior domain knowledge can construct a decision tree that is comparable to other classifiers constructed using the domain knowledge.
1
Introduction
In recent years a lot of chemical compounds have been newly synthesized and some compounds can be harmful to human bodies. However, the evaluation of compounds by experiments requires a large amount of expenditure and time. Since the characteristics of compounds are highly correlated with their structure, we believe that predicting the characteristics of chemical compounds from their structures is worth attempting and technically feasible. Since structure is represented by proper relations and a graph can easily represent relations, knowledge discovery from graph structured data poses a general problem for G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 128–140, 2003. c Springer-Verlag Berlin Heidelberg 2003
Performance Evaluation of Decision Tree Graph-Based Induction
129
mining from structured data. Some other examples amenable to graph mining are finding typical web browsing patterns, identifying typical substructures of chemical compounds, finding typical subsequences of DNA and discovering diagnostic rules from patient history records. Graph-Based Induction (GBI) [10,3], on which DT-GBI is based, discovers typical patterns in general graph structured data by recursively chunking two adjoining nodes. It can handle graph data having loops (including self-loops) with colored/uncolored nodes and links. There can be more than one link between any two nodes. GBI is very efficient because of its greedy search. GBI does not lose any information of graph structure after chunking, and it can use various evaluation functions in so far as they are based on frequency. It is not, however, suitable for graph structured data where many nodes share the same label because of its greedy recursive chunking without backtracking, but it is still effective in extracting patterns from such graph structured data where each node has a distinct label (e.g., World Wide Web browsing data) or where some typical structures exist even if some nodes share the same labels (e.g., chemical structure data containing benzene rings etc). On the other hand, besides extracting patterns from data, the decision tree construction method [6,7] is a widely used technique for data classification and prediction. One of its advantages is that rules, which are easy to understand, can be induced. Nevertheless, to construct decision trees it is usually required that data is represented by or transformed into attribute-value pairs. However, it is not trivial to define proper attributes for graph-structured data beforehand. We have proposed a method called Decision tree Graph-Based Induction (DT-GBI), which constructs a classifier (decision tree) for graph-structured data while constructing the attributes during the course of tree building using GBI recursively and did preliminary performance evaluation [9]. A pair extracted by GBI, consisting of nodes and links among them1 , is treated as an attribute and the existence/non-existence of the pair in a graph is treated as its value for the graph. Thus, attributes (pairs) that divide data effectively are extracted by GBI while a decision tree is being constructed. To classify unseen graph-structured data by the constructed decision tree, attributes that appear in the nodes of the tree are produced from data before the classification. However, experiments using a DNA dataset from UCI repository revealed that the predictive accuracy of decision trees constructed by DT-GBI was not high compared with other approaches. In this paper we first report the improvement made on DT-GBI to increase its predictive accuracy by incorporating 1) a beam search, 2) pessimistic pruning. After that, we report the performance evaluation of the improved DT-GBI through experiments using a DNA dataset from the UCI repository and show that the results are comparable to the results that are obtained by using the domain knowledge [8]. Section 2 briefly describes the framework of DT-GBI and Section 3 describes the improvement made on DT-GBI. Evaluation of the improved DT-GBI is re1
Repeated chunking of pairs results in subgraph structure
130
W. Geamsakul et al.
ported in Section 4. Section 5 concludes the paper with a summary of the results and the planned future work.
2 2.1
Decision Tree Graph-Based Induction Graph-Based Induction Revisited
GBI employs the idea of extracting typical patterns by stepwise pair expansion as shown in Fig. 1. In the original GBI an assumption is made that typical patterns represent some concepts/substructure and “typicality” is characterized by the pattern’s frequency or the value of some evaluation function of its frequency. We can use statistical indices as an evaluation function, such as frequency itself, Information Gain [6], Gain Ratio [7] and Gini Index [2], all of which are based on frequency. In Fig. 1 the shaded pattern consisting of nodes 1, 2, and 3 is thought typical because it occurs three times in the graph. GBI first finds the 1→3 pairs based on its frequency, chunks them into a new node 10, then in the next iteration finds the 2→10 pairs, chunks them into a new node 11. The resulting node represents the shaded pattern.
1
1 3
2
7
7
1
5
2
4 6
1
4
3
2
3
8 5
11
4 6
9
8 5
9
1 3
2
7
7
5
4
11
11
3 2
10 2
11
Fig. 1. The basic idea of the GBI method
It is possible to extract typical patterns of various sizes by repeating the above three steps. Note that the search is greedy. No backtracking is made. This means that in enumerating pairs no pattern which has been chunked into one node is restored to the original pattern. Because of this, all the ”typical patterns” that exist in the input graph are not necessarily extracted. The problem of extracting all the isomorphic subgraphs is known to be NP-complete. Thus, GBI aims at extracting only meaningful typical patterns of certain sizes. Its objective is not finding all the typical patterns nor finding all the frequent patterns. As described earlier, GBI can use any criterion that is based on the frequency of paired nodes. However, for finding a pattern that is of interest any of its subpatterns must be of interest because of the nature of repeated chunking.
Performance Evaluation of Decision Tree Graph-Based Induction
131
In Fig. 1 the pattern 1→3 must be typical for the pattern 2→10 to be typical. Said differently, unless pattern 1→3 is chunked, there is no way of finding the pattern 2→10. Frequency measure satisfies this monotonicity. However, if the criterion chosen does not satisfy this monotonicity, repeated chunking may not find good patterns even though the best pair based on the criterion is selected at each iteration. To resolve this issue GBI was improved to use two criteria, one for frequency measure for chunking and the other for finding discriminative patterns after chunking. The latter criterion does not necessarily hold monotonicity property. Any function that is discriminative can be used, such as Information Gain [6], Gain Ratio [7] and Gini Index [2], and some others. GBI(G) Enumerate all the pairs Pall in G Select a subset P of pairs from Pall (all the pairs in G) based on typicality criterion Select a pair from Pall based on chunking criterion Chunk the selected pair into one node c Gc := contracted graph of G while termination condition not reached P := P ∪ GBI(Gc ) return P Fig. 2. Algorithm of GBI
The improved stepwise pair expansion algorithm is summarized in Fig. 2. It repeats the following four steps until chunking threshold is reached (normally minimum support value is used as the stopping criterion). Step 1. Extract all the pairs consisting of connected two nodes in the graph. Step 2a. Select all the typical pairs based on the typicality criterion from among the pairs extracted in Step 1, rank them according to the criterion and register them as typical patterns. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Step 2b. Select the most frequent pair from among the pairs extracted in Step 1 and register it as the pattern to chunk. If either or both nodes of the selected pair have already been rewritten (chunked), they are restored to the original patterns before registration. Stop when there is no more pattern to chunk. Step 3. Replace the selected pair in Step 2b with one node and assign a new label to it. Rewrite the graph by replacing all the occurrence of the selected pair with a node with the newly assigned label. Go back to Step 1. The output of the improved GBI is a set of ranked typical patterns extracted at Step 2a. These patterns are typical in the sense that they are more discriminative than non-selected patterns in terms of the criterion used.
132
W. Geamsakul et al. DT-GBI(D) Create a node DT for D if termination condition reached return DT else P := GBI(D) (with the number of chunking specified) Select a pair p from P Divide D into Dy (with p) and Dn (without p) Chunk the pair p into one node c Dyc := contracted data of Dy for Di := Dyc , Dn DTi := DT-GBI(Di ) Augment DT by attaching DTi as its child along yes(no) branch return DT Fig. 3. Algorithm of DT-GBI
2.2
Feature Construction by GBI
Since the representation of decision tree is easy to understand, it is often used as the representation of classifier for data which are expressed as attribute-value pairs. On the other hand, graph-structure data are usually expressed as nodes and links, and there is no obvious components which corresponds to attributes and their values. Thus, it is difficult to construct a decision tree for graphstructured data in a straight forward manner. To cope with this issue we regard the existence of a subgraph in a graph as an attribute so that graph-structured data can be represented with attribute-value pairs according to the existence of particular subgraphs. However, it is difficult to identify and extract those subgraphs selectively which are effective for classification task beforehand. If pairs are extended in a step-wise fashion by GBI and discriminative ones are selected and further extended while constructing a decision tree, discriminative patterns (subgraphs) can be constructed simultaneously during the construction of a decision tree. In our approach attributes and their values are defined as follows: – attribute: a pair in graph-structured data. – value for an attribute: existence/non-existence of the pair in a graph. When constructing a decision tree, all the pairs in data are enumerated and one pair is selected. The data (graphs) are divided into two groups, namely, the one with the pair and the other without the pair. The selected pair is then chunked in the former graphs. and these graphs are rewritten by replacing all the occurrence of the selected pair with a new node. This process is recursively applied at each node of a decision tree and a decision tree is constructed while attributes (pairs) for classification task is created on the fly. The algorithm of DT-GBI is summarized in Fig. 3. Since the value for an attribute is yes (contains pair) and no (does not contain pair), the constructed decision tree is represented as a binary tree.
Performance Evaluation of Decision Tree Graph-Based Induction
133
The proposed method has the characteristic of constructing the attributes (pairs) for classification task on-line while constructing a decision tree. Each time when an attribute (pair) is selected to divide the data, the pair is chunked into a larger node in size. Thus, although initial pairs consist of two nodes and the link between them, attributes useful for classification task are gradually grown up into larger pair (subgraphs) by applying chunking recursively. In this sense the proposed DT-GBI method can be conceived as a method for feature construction, since features, namely attributes (pairs) useful for classification task, are constructed during the application of DT-GBI.
3 3.1
Enhancement of DT-GBI Beam Search for Expanding Search Space
Since the search in GBI is greedy and no backtracking is made, which patterns are cs extracted by GBI depends on which pair is selected for chunking in Fig. 3. Thus, there can be many patterns which are c11 c12 c13 c14 c15 not extracted by GBI. To relax this problem, a beam search is incorporated to GBI c21 c22 c23 c24 c25 within the framework of greedy search [4] to extract more discriminative patterns. A certain fixed number of pairs ranked c31 c32 c33 c34 c35 from the top are allowed to be chunked individually in parallel. To prevent each branch from growing exponentially, the total number of pairs to chunk is fixed at Fig. 4. An Example of state transition each level of branch. Thus, at any itera- with beam search when the beam width tion step, there is always a fixed number = 5 of chunking that is performed in parallel. An example of state transition with beam search is shown in Fig.4 in the case where the beam width is 5. The initial condition is the single state cs. All pairs in cs are enumerated and ranked according to both the frequency measure and the typicality measure. The top 5 pairs according to the frequency measure are selected, and each of them is used as a pattern to chunk, branching into 5 children c11 , c12 , . . . , c15 , each rewritten by the chunked pair. All pairs within these 5 states are enumerated and ranked according to the two measures, and again the top 5 ranked pairs according to the frequency measure are selected. The state c11 is split into two states c21 and c22 because two pairs are selected, but the state c12 is deleted because no pair is selected. This is repeated until the stopping condition is satisfied. Increase in the search space improves the pattern extraction capability of GBI and thus that of DT-GBI.
134
3.2
W. Geamsakul et al.
Pruning Decision Tree
Recursive partitioning of data until each subset in the partition contains data of a single class often results in overfitting to the training data and thus degrades the predictive accuracy of decision trees. To avoid overfitting, in our previous approach [9] a very naive prepruning method was used by setting the termination condition in DT-GBI in Fig. 3 to whether the number of graphs in D is equal to or less than 10. On the other hand, a more sophisticated postpruning method, is used in C4.5 [7] (which is called “pessimistic pruning”) by growing an overfitted tree first and then pruning it to improve predictive accuracy based on the confidence interval for binomial distribution. To improve predictive accuracy, pessimistic pruning in C4.5 is incorporated into the DT-GBI by adding a step for postpruning in Fig. 3.
4
Performance Evaluation of DT-GBI
The proposed method is tested using a DNA dataset in UCI Machine Learning Repository[1]. A promoter is a genetic region which initiates the first step in the expression of an adjacent gene (transcription). The promoter dataset consists of strings that represent nucleotides (one of A, G, T, or C). The input features are 57 sequential DNA nucleotides and the total number of instances is 106 including 53 positive instances (sample promoter sequence) and 53 negative instances (nonpromoter sequence). This dataset was explained and analyzed in [8]. The data is so prepared that each sequence of nucleotides is aligned at a reference point, which makes it possible to assign the n-th attribute to the n-th nucleotide in the attribute-attribute value representation. In a sense, this dataset is encoded using domain knowledge. This is confirmed by the following experiment. Running C4.5[7] gives a prediction error of 16.0% by leaving one out cross validation. Randomly shifting the sequence by 3 elements gives 21.7% and by 5 elements 44.3%. If the data is not properly aligned, standard classifiers such as C4.5 that use attribute-attribute value representation does not solve this problem, as shown in Fig. 5.
aacgtcgattagccgat gtccatggtcaagtccg tccaggtgcagtcatgc aacgtcgattagccgat g tccatggtcaagtcc g tccaggtgcagtcat gc
Prediction error (C4.5, LVO) Original data 16.0% Shift randomly by 16.0% ≤ 1 element 21.7% ≤ 2 elements 26.4% ≤ 3 elements 44.3% ≤ 5 elements
Fig. 5. Change of error rate by shifting the sequence in the promoter dataset
Performance Evaluation of Decision Tree Graph-Based Induction
135
One of the advantages of graph rep.. 10 5 resentation is that it does not require the data to be aligned at a reference 4 point. In our approach, each sequence 3 .. 8 3 is converted to a graph representa2 2 .6 tion assuming that an element inter1 1 1 acts up to 10 elements on both sides a t g c a t ・・・・・ ・ (See Fig. 6). Each sequence thus re1 1 5 sults in a graph with 57 nodes and 2 2 515 lines. Note that a sequence is rep.. 3 resented as a directed graph since it 7 4 .. is known from the domain knowledge 9 that influence between nucleotides is directed. Fig. 6. Conversion of DNA Sequence Data In the experiment, frequency was to a graph used to select a pair to chunk in GBI and information gain [6] was used in DT-GBI to select a pair from the pairs returned by GBI as the typicality measure. A decision tree was constructed in either of the following two ways: 1) apply chunking nr times only at the root node and only once at other nodes of a decision tree, 2) apply chunking ne times at every node of a decision tree. Note that nr and ne are defined along the depth in Fig. 4. Thus, there is more chunking taking place during the search when the beam width is larger. The pair (subgraph) that is selected for each node of the decision tree is the one which maximizes the information gain among all the pairs that are enumerated. Pruning of decision tree was conducted either by prepruning: set the termination condition in DT-GBI in Fig. 3 to whether the number of graphs in D is equal to or less than 10, or, by postpruning: conduct pessimistic pruning in Subsection 3.2 by setting the confidence level to 25%. Beam width was changed from 1 to 15. The prediction error rate of a decision tree constructed by DT-GBI was evaluated by the average of 10 runs of 10 fold cross-validation in both experiments. The first experiment focused on the effect of the number of chunking at each node of a decision tree and thus beam width was set to 1 and the prepruning was used. The parameter nr and ne were changed from 1 to 10 in 1) and 2) in accordance with 1) and 2) explained above, respectively. Fig. 7 shows the result of experiments. In this figure the dotted line indicates the error rate for 1) the solid line for 2). The best error rate was 8.11% when nr = 5 for 1) and 7.45% when ne = 3 for 2). The corresponding induced decision trees for all 106 instances are shown in Fig. 8 (nr = 5) and Fig. 9 (ne = 3). The decrease of error rate levels off when the the number of chunking increases for both 1) and 2). The result shows that repeated application of chunking at every node results in constructing a decision tree with better predictive accuracy. The second experiment focused on the effect of beam width, changing its value from 1 to 15 using pessimistic pruning. The number of chunking was fixed at the best number which was determined by the first experiment in Fig. 7,
136
W. Geamsakul et al. 14
root node only every node
Error rate (%)
12 10 8 6 4 2 0 0
2
4
6
Number of chunking at a node
8
10
Fig. 7. Result of experiment (beam width=1, without pessimistic pruning)
→→→
a 1 a1 a 1 a n=53, p=53
Y
N
→→→
1 1 1 g a g a n=53, p=31
Promoter n=0, p=22
Y
N
→→
10 1 g t t n=34, p=31
Non-promoter n=19, p=0
Y
N
→1 c→1 t→1 t→2 a→1 a
→→→
1 1 1 t c a a n=32, p=17
g
n=2, p=14
Y Non-promoter n=2, p=0
N Promoter n=0, p=14
Y Promoter n=0, p=2
Y
→ 2
c c n=23, p=3
N Non-promoter n=23, p=1
N
→→→
1 3 1 t t a a n=9, p=14
Y Promoter n=0, p=8
N Non-promoter n=9, p=6
Fig. 8. Example of constructed decision tree (chunking applied 5 times only at the root node, beam width = 1, with prepruning)
namely, nr = 5 for 1) and ne = 3 for 2). The result is summarized in Fig. 12. The best error rate was 4.43% when beam width = 12 for 1) (nr = 5) and 3.68% when beam width = 11 for 1) (ne = 3). The corresponding induced decision trees for all 106 instances are shown in Fig. 10 and Fig. 11. Fig. 13 shows yet another result when prepruning was used (beam with=8, ne = 3). The result reported in [8] is 3.8% (they also used 10-fold cross validation) which is obtained by the M-of-N expression rules extracted from KBANN (Knowledge Based Artificial Neural Network). The obtained M-of-N rules are
Performance Evaluation of Decision Tree Graph-Based Induction
137
→→→
1 1 1 a a a a n=53, p=53
Y
N
→ →→
1 3 1 t c t a n=53, p=31
Promoter n=0, p=22
Y
N
→ →→
1 1 1 a a t t n=53, p=21
Promoter n=0, p=10
N
Y
→→
1 1 a a c n=53, p=14
Promoter n=0, p=7
Y
N
→1 t8→a1→a1→c
→→
6 1 c g a n=19, p=10
t
n=34, p=4
Y
Y
N
Promoter n=0, p=3
Non-promoter n=34, p=1
N
→ → →→
3 1 5 1 a g c t t n=9, p=10
Non-promoter n=10, p=0
Y
N
Non-promoter n=5, p=0
Promoter n=4, p=10
Fig. 9. Example of constructed decision tree (chunking applied 3 times at every node, beam width = 1, with prepruning)
→→→
1 8 1 g a g a n=53, p=53
Y
N
→→→
1 1 1 a a a a n=32, p=53
Non-promoter n=21, p=0
N
Y
→→
9 1 g t t n=32, p=31
Promoter n=0, p=22
N
Y
→→
1 7 t g c n=32, p=13
Promoter n=0, p=18
N
Y Non-promoter n=18, p=3
→→→
1 2 1 t a a a n=14, p=10
Y Promoter n=0, p=6
N Non-promoter n=14, p=4
Fig. 10. Example of constructed decision tree (chunking applied 5 times only at the root node, beam width = 12, with pessimistic pruning)
too much complicated and not easy to interprete. Since KBANN uses domain knowledge to configure the initial artificial neural network, it is worth mentioning
138
W. Geamsakul et al.
→ →→
1 1 1 a a t t n=53, p=53
Y
N
→→→
1 8 1 g a g a n=53, p=35
Promoter n=0, p=22
Y
N
→→→→
1 3 1 1 g a a a c n=32, p=35
Non-promoter n=21, p=0
N
Y
→→→→
1 7 1 a g c g c n=20, p=35 1
Non-promoter n=12, p=0
Y
N
→→
→ →→
1 2 t t t n=9, p=2
Y Promoter n=0, p=2
1 3 1 g c t g n=11, p=33
Y
N
N
→→
9 1 a a a n=7, p=3
Non-promoter n=9, p=0
Promoter n=4, p=30
Y
N
Promoter n=0, p=3
Non-promoter n=7, p=0
Fig. 11. Example of constructed decision tree (chunking applied 3 times at every node, beam width = 8, with pessimistic pruning) 14
root node only (up to 5)
12
every node (up to 3)
Error rate (%)
10 8 6 4 2 0 0
5
GBI beam width
10
15
Fig. 12. Result of experiment (with pessimistic pruning)
that DT-GBI that does not use any domain knowledge induced a decision tree with comparable predictive accuracy. Comparing the decision trees in Figs. 10 and 11, the trees are not stable. Both gives a similar predictive accuracy but the patterns in the decision nodes are not the same. According to [8], there are many pieces of domain knowledge and the rule conditions are expressed by the various combinations of these pieces. Among these many pieces of knowledge, the pattern (a → a → a → a) in the second node in Fig. 10 and the one (a → a → t → t) in the root node in Fig. 11 match their domain knowledge, but the
Performance Evaluation of Decision Tree Graph-Based Induction
139
14
root node only (up to 5)
12
every node (up to 3)
10
) % ( e t a r r o r r E
8 6 4 2 0 0
5
GBI beam width
10
15
Fig. 13. Result of experiment (with prepruning)
others do not match. We have assumed that two nucleotides that are apart more than 10 nodes are not directly correlated. Thus, the extracted patterns have no direct links longer than 9. It is interesting to note that the first node in Fig. 10 relates two pairs (g → a) that are 7 nodes apart as a discriminatory pattern. Indeed, all the sequence having this pattern are concluded to be non-promoter from the data. It is not clear at this stage whether the DT-GBI can extract the domain knowledge or not. The data size is too small to make any strong claims. [4,5] report another approach to construct a decision tree for the promoter dataset. The extracted patterns (subgraphs) by B-GBI, which incorporates beam search into GBI to enhance search capability, were treated as attributes and C4.5 was used to construct a decision tree. The best reported error rate with 10 fold cross validation is 6.3% in [5] using the patterns extracted by B-GBI (beam width = 2) and C4.5 (although 2.8% with Leave One Out (LVO) is also reported, LVO tends to reduce the error rate compared with 10 fold cross validation from the reported result in [5]). On the other hand, the best prediction error rate of DTGBI is 3.68% (which is much better than 6.3% above) when chunking was applied 3 times at each node, beam width = 8, and with pessimistic pruning. The result is also comparable to 3.8 % obtained by KBANN using M-of-N expression [8].
5
Conclusion
This paper reports the improvement made on DT-GBI, which constructs a classifier (decision tree) for graph-structured data by GBI. To classify graphstructured data attributes, namely substructures useful for classification task, are constructed by applying chunking in GBI on the fly while constructing a decision tree. The predictive accuracy of DT-GBI is improved by incorporating 1) a beam search, 2) pessimistic pruning. The performance evaluation of the improved DT-GBI is reported through experiments on a classification problem of DNA promoter sequences from the UCI repository and the results show that
140
W. Geamsakul et al.
DT-GBI is comparable to other method that uses domain knowledge in modeling the classifier. Immediate future work includes to incorporate more sophisticated method for determining the number of cycles to call GBI at each node to improve prediction accuracy. Utilizing the rate of change of information gain by successive chunking is a possible way to automatically determine the number. Another important direction is to explore how the partial domain knowledge is effectively incorporated to constrain the search space. DT-GBI is currently being applied to a much larger medical dataset. Acknowledgment. This work was partially supported by the grant-in-aid for scientific research 1) on priority area “Active Mining” (No. 13131101, No. 13131206) and 2) No. 14780280 funded by the Japanese Ministry of Education, Culture, Sport, Science and Technology.
References 1. C. L. Blake, E. Keogh, and C.J. Merz. Uci repository of machine leaning database, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984. 3. T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In Knowledge Discovery and Data Mining: Current Issues and New Applications, Springer Verlag, LNAI 1805, pages 420–431, 2000. 4. T. Matsuda, H. Motoda, T. Yoshida, and T. Washio. Knowledge discovery from structured data by beam-wise graph-based induction. In Proc. of the 7th Pacific Rim International Conference on Artificial Intelligence, Springer Verlag, LNAI 2417, pages 255–264, 2002. 5. T. Matsuda, T. Yoshida, H. Motoda, and T. Washio. Mining patterns from structured data by beam-wise graph-based induction. In Proc. of The Fifth International Conference on Discovery Science, pages 422–429, 2002. 6. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 7. J. R. Quinlan. C4.5:Programs For Machine Learning. Morgan Kaufmann Publishers, 1993. 8. G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. 9. G. Warodom, T. Matsuda, T. Yoshida, H. Motoda, and T. Washio. Classifier construction by graph-based induction for graph-structured data. In Advances in Knowledge Discovery and Data Mining, Springer Verlag, LNAI 2637, pages 52–62, 2003. 10. K. Yoshida and H. Motoda. Clip : Concept learning from inference pattern. Journal of Artificial Intelligence, 75(1):63–92, 1995.
Discovering Ecosystem Models from Time-Series Data Dileep George1 , Kazumi Saito2 , Pat Langley1 , Stephen Bay1 , and Kevin R. Arrigo3 1 Computational Learning Laboratory, CSLI Stanford University, Stanford, California 94305 USA {dil,langley,sbay}@apres.stanford.edu 2 NTT Communication Science Laboratories 2-4 Hikaridai, Seika, Soraku, Kyoto 619-0237 Japan
[email protected] 3 Department of Geophysics, Mitchell Building Stanford University, Stanford, CA 94305 USA
[email protected] Abstract. Ecosystem models are used to interpret and predict the interactions of species and their environment. In this paper, we address the task of inducing ecosystem models from background knowledge and timeseries data, and we review IPM, an algorithm that addresses this problem. We demonstrate the system’s ability to construct ecosystem models on two different Earth science data sets. We also compare its behavior with that produced by a more conventional autoregression method. In closing, we discuss related work on model induction and suggest directions for further research on this topic.
1
Introduction and Motivation
Ecosystem models aim to simulate the behavior of biological systems as they respond to environmental factors. Such models typically take the form of algebraic and differential equations that relate continuous variables, often through feedback loops. The qualitative relationships are typically well understood, but there is frequently ambiguity about which functional forms are appropriate and even less certainty about the precise parameters. Moreover, the space of candidate models is too large for human scientists to examine manually in any systematic way. Thus, computational methods that can construct and parameterize ecosystem models should prove useful to Earth scientists in explaining their data. Unfortunately, most existing methods for knowledge discovery and data mining cast their results as decision trees, rules, or some other notation devised by computer scientists. These techniques can often induce models with high predictive accuracy, but they are seldom interpretable by scientists, who are used to different formalisms. Methods for equation discovery produce knowledge in forms that are familiar to Earth scientists, but most generate descriptive models rather than explanatory ones, in that they contain no theoretical terms and make little contact with background knowledge. G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 141–152, 2003. c Springer-Verlag Berlin Heidelberg 2003
142
D. George et al.
In this paper, we present an approach to discovering dynamical ecosystem models from time-series data and background knowledge. We begin by describing IPM, an algorithm for inducing process models that, we maintain, should be interpretable by Earth scientists. After this, we demonstrate IPM’s capabilities on two modeling tasks, one involving data on a simple predator-prey ecosystem and another concerning more complex data from the Antarctic ocean. We close with a discussion of related work on model discovery in scientific domains and prospects for future research on the induction of ecosystem models.
2
An Approach to Inductive Process Modeling
As described above, we are interested in computational methods that can discover explanatory models for the observed behavior of ecosystems. In an earlier paper (Langley et al., in press), we posed the task of inducing process models from time-series data and presented an initial algorithm for addressing this problem. We defined a quantitative process model as a set of processes, each specifying one or more algebraic or differential equations that denote causal relations among variables, along with optional activation conditions. At least two of the variables must be observed, but a process model can also include unobserved theoretical terms. The IPM algorithm generates process models of this sort from training data about observable variables and background knowledge about the domain. This knowledge includes generic processes that have a form much like those in models, in that they relate variables with equations and may include conditions. The key differences are that a generic process does not commit to specific variables, although it constrains their types, and it does not commit to particular parameter values, although it limits their allowed ranges. Generic processes are the building blocks from which the system constructs its specific models. More specifically, the user provides IPM with three inputs that guide its discovery efforts: 1. A set of generic processes, including constraints on variable types and parameter values; 2. A set of specific variables that should appear in the model, including their names and types; 3. A set of observations for two or more of the variables as they vary over time. In addition, the system requires three control parameters: the maximum number of processes allowed in a model, the minimum number of processes, and the number of times each generic process can occur. Given this information, the system first generates all instantiations of generic processes with specific variables that are consistent with the type constraints. After this, it finds all ways to combine these instantiated processes to form instantiated models that have acceptable numbers of processes. The resulting models refer to specific variables, but their parameters are still unknown. Next, IPM uses a nonlinear optimization routine to determine these parameter values. Finally, the system selects and
Discovering Ecosystem Models from Time-Series Data
143
returns the candidate that produces the smallest squared error on the training data, modulated by a minimum description length criterion. The procedure for generating all acceptable model structures is straightforward, but the method for parameter optimization deserves some discussion. The aim is to find, for each model structure, parameters that minimize the model’s squared predictive error on the observations. We have tried a number of standard optimization algorithms, including Newton’s method and the LevenbergMarquardt method, but we have found these techniques encounter problems with convergence and local optima. In response, we designed and implemented our own parameter-fitting method, which has given us the best results to date. A nonlinear optimization algorithm attempts to find a set of parameters Θ that minimizes an objective function E(Θ). In our case, we define E as the squared error between the observed and predicted time series: E(Θ) =
T J
(ln(xoj (t)) − ln(xj (t)))2 ,
(1)
t=1 j=1
where xoj and xj represent the observed and predicted values of J observed variables, t denotes time instants, and ln(·) is the natural logarithmic function. Standard least-squares estimation is widely recognized as relatively brittle with respect to outliers in samples that contain gross error. Instead, as shown in Equation (1), we minimize the sum of squared differences between logarithmically transformed variables, which is one approach to robust estimation proposed by Box and Cox (1964). In addition, we maintain positivity constraints on process variables by performing a logarithmic transformation on the differential equations in which they appear. Predicted values for xj are obtained by solving finite-difference approximations of the differential equations specified in the model. The parameter vector Θ incorporates all unknowns, including any initial conditions for unobserved variables needed to solve the differential equations. In order to minimize our error function, E, defined as a sum of squared errors, we can calculate its gradient vector with respect to a parameter vector. For this purpose, we borrowed the basic idea of error backpropagation through time (Rumelhart, Hinton, & Williams, 1986), frequently used for learning in recurrent neural networks. However, the task of process model induction required us to extend this method to support the many different functional forms that can occur. Our current solution relies on hand-crafted derivatives for each generic process, but it utilizes the additive nature of process models to retain the modularity of backpropagation and its compositional character. These in turn let the method carry out gradient search to find parameters for each model structure. Given a model structure and its corresponding backpropagation equations, our parameter-fitting algorithm carries out a second-order gradient search (Saito & Nakano, 1997). By adopting a quasi-Newton framework (e.g., Luenberger, 1984), this calculates descent direction as a partial Broyden-Fletcher-GoldfarbShanno update and then calculates the step length as the minimal point of a second-order approximation. In earlier experiments on a variety of data sets, this algorithm worked quite efficiently as compared to standard gradient search
144
D. George et al.
methods. Of course, this approach does not eliminate all problems with local optima; thus, for each model structure, IPM runs the parameter-fitting algorithm ten times with random initial parameter values, then selects the best result. Using these techniques, IPM overcomes many of the problems with local minima and slow convergence that we encountered in our early efforts, giving reasonable performance according to the squared error criterion. However, we anticipate that solving more complex problems will require the utilization of even more sophisticated algorithms for non-linear minimization. However, reliance on squared error as the sole optimization criterion tends to select overly complex process models that overfit the training data. Instead, IPM computes the description length of each parameterized model as the sum of its complexity and the information content of the data left unexplained by the model. We define complexity as the number of free parameters and variables in a model and the unexplained content as the number of bits needed to encode the squared error of the model. Rather than selecting the model with the lowest error, IPM prefers the candidate with the shortest description length, thus balancing model complexity against fit to the training data.
3
Modeling Predator-Prey Interaction
Now we are ready to consider IPM’s operation on an ecosystem modeling task. Within Earth science, models of predator-prey systems are among the simplest in terms of the number of variables and parameters involved, making them good starting points for our evaluation. We focus here on the protozoan system composed of the predator P. aurelia and the prey D. nasutum, which is well known in population ecology. Jost and Adiriti (2000) present time-series data for this system, recovered from an earlier report by Veilleux (1976), that are now available on the World Wide Web. The data set includes measurements for the two species’ populations at 12-hour intervals over 35 days, as shown in Figure 1. The data are fairly smooth over the entire period, with observations at regular intervals and several clear cycles. We decided to use these observations as an initial test of IPM’s ability to induce an ecosystem model. 3.1
Background Knowledge about Predator-Prey Interaction
A scientist who wants IPM to construct explanatory models of his observations must first provide a set of generic processes that encode his knowledge of the domain. Table 1 presents a set of processes that we extracted from our reading of the Jost and Adiriti article. As illustrated, each generic process specifies a set of generic variables with type constraints (in braces), a set of parameters with ranges for their values (in brackets), and a set of algebraic or differential equations that encode causal relations among the variables (where d[X, t, 1] refers to the first derivative of X with respect to time). Each process can also include one or more conditions, although none appear in this example.
Discovering Ecosystem Models from Time-Series Data
145
Table 1. A set of generic processes for predator-prey models. generic process logistic growth; generic process exponential growth; variables S{species}; variables S{species}; parameters ψ [0, 10], κ [0, 10]; parameters β [0, 10]; equations d[S, t, 1] = ψ ∗ S ∗ (1 − κ ∗ S); equations d[S, t, 1] = β ∗ S; generic process predation volterra; generic process exponential decay; variables S1{species}, S2{species}; variables S{species}; parameters π [0, 10], ν [0, 10]; parameters α [0, 1]; equations d[S1, t, 1] = −1 ∗ π ∗ S1 ∗ S2; equations d[S, t, 1] = −1 ∗ α ∗ S; d[S2, t, 1] = ν ∗ π ∗ S1 ∗ S2; generic process predation holling; variables S1{species}, S2{species}; parameters ρ [0, 1], γ [0, 1], η [0, 1]; equations d[S1, t, 1] = −1 ∗ γ ∗ S1 ∗ S2/(1 + ρ ∗ γ ∗ S1); d[S2, t, 1] = η ∗ γ ∗ S1 ∗ S2/(1 + ρ ∗ γ ∗ S1);
The table shows five such generic processes. Two structures, predation holling and predation volterra, describe alternative forms of feeding; both cause the predator population to increase and the prey population to decrease, but they differ in their precise functional forms. Two additional processes – logistic growth and exponential growth – characterize the manner in which a species’ population increases in an environment with unlimited resources, again differing mainly in the forms of their equations. Finally, the exponential decay process refers to the decrease in a species’ population due to natural death. All five processes are generic in the sense that they do not commit to specific variables. For example, the generic variable S in exponential decay does not state which particular species dies when it is active. IPM must assign variables to these processes before it can utilize them to construct candidate models. Although the generic processes in Table 1 do not completely encode knowledge about predator-prey dynamics, they are adequate for the purpose of evaluating the IPM algorithm on the Veilleux data. If needed, a domain scientist could add more generic processes or remove ones that he considers irrelevant. The user is responsible for specifying an appropriate set of generic processes for a given modeling task. If the processes recruited for a particular task do not represent all the mechanisms that are active in that environment, the induced models may fit the data poorly. Similarly, the inclusion of unnecessary processes can increase computation time and heighten the chances of overfitting the data. Before the user can invoke IPM, he must also provide the system with the variables that the system should consider including in the model, along with their types. This information includes both observable variables, in this case predator and prey, both with type species, and unobservable variables, which do not arise in this modeling task. In addition, he must state the minimum acceptable number of processes (in this case one), the maximum number of processes (four), and the number of times each generic process can occur (two).
146
D. George et al. Table 2. Process model induced for predator-prey interaction.
model Predator Prey; variables P redator, P rey; observables P redator, P rey; process exponential decay; equations d[P redator, t, 1] = −1 ∗ 1.1843 ∗ P redator; process logistic growth; equations d[P rey, t, 1] = 2.3049 ∗ P rey ∗ (1 − 0.0038 ∗ P rey); process predation volterra; equations d[P rey, t, 1] = −1 ∗ 0.0298 ∗ P rey ∗ P redator; d[P redator, t, 1] = 0.4256 ∗ 0.0298 ∗ P rey ∗ P redator;
3.2
Inducing Models for Predator-Prey Interaction
Given this information, IPM uses the generic processes in Table 1 to generate all possible model structures that relate the two species P. aurelia and D. nasutum, both of which are observed. In this case, the system produced 228 candidate structures, for each of which it invoked the parameter-fitting routine described earlier. Table 2 shows the parameterized model that the system selected from this set, which makes general biological sense. It states that, left in isolation, the prey (D. nasutum) population grows logistically, while the predator (P. aurelia) population decreases exponentially. Predation leads to more predators and to fewer prey, controlled by multiplicative equations that add 0.4256 predators for each prey that is consumed. Qualitatively, the model predicts that, when the predator population is high, the prey population is depleted at a faster rate. However, a reduction in the prey population lowers the rate of increase in the predator population, which should produce an oscillation in both populations. Indeed, Figure 1 shows that the model’s predicted trajectories produce such an oscillation, with nearly the same period as that found in the data reported by Jost and Adiriti. The model produces a squared error of 18.62 on the training data and a minimum description length score of 286.68. The r2 between the predicted and observed values is 0.42 for the prey and 0.41 for the predator, which indicates that the model explains a substantial amount of the observed variation. 3.3
Experimental Comparison with Autoregression
Alternative approaches to induction from time-series data, such as multivariate autoregression, do not yield the explanatory insight of process models. However, they are widely used in practice, so naturally we were interested in how the two methods compare in their predictive abilities. To this end, we ran the Matlab package ARFit (Schneider & Neumaier, 2001) on the Veilleux data to infer the structure and parameters of an autoregressive model. This uses a stepwise leastsquares procedure to estimate parameters and a Bayesian criterion to select the
Discovering Ecosystem Models from Time-Series Data
147
Fig. 1. Predicted and observed log concentrations of protozoan prey (left) and predator (right) over a period of 36 hours.
best model. For the runs reported here, we let ARFit choose the best model order from zero to five. To test the two methods’ abilities to forecast future observations, we divided the time series into successive training and test sets while varying their relative sizes. In particular, we created 35 training sets of size n = 35 . . . 69 by selecting the first n examples of the time series, each with a corresponding test set that contained all successive observations. In addition to using these training sets to induce the IPM and autoregressive models, we also used their final values to initialize simulation with these models. Later predictions were based on predicted values from earlier in the trajectory. For example, to make predictions for t = 40, both the process model and an autoregressive model of order one would utilize their predictions for t = 39, whereas an autoregressive model of order two would draw on predictions for t = 38 and t = 39. Figure 2 plots the resulting curves for the models induced by IPM, ARFit, and a constant approximator. In every run, ARFit selected a model of order one. Both IPM and autoregression have lower error than the straw man, except late in the curve, when few training cases are available. The figure also shows that, for 13 to 21 test instances, the predictive abilities of IPM’s models are roughly equal to or better than those for the autoregressive models. Thus, IPM appears able to infer models which are as accurate as those found by an autoregressive method that is widely used, while providing interpretability that is lacking in the more traditional models.
4
Modeling an Aquatic Ecosystem
Although the predator-prey system we used in the previous section was appropriate to demonstrate the capabilities of the IPM algorithm, rarely does one find such simple modeling tasks in Earth science. Many ecosystem models involve interactions not only among the species but also between the species and environmental factors. To further test IPM’s ability, we provided it with knowledge
148
D. George et al.
Fig. 2. Predictive error for induced process models, autoregressive models, and constant models, vs. the number of projected time steps, on the predator-prey data.
and data about the aquatic ecosystem of the Ross Sea in Antarctica (Arrigo et al., in press). The data came from the ROAVERRS program, which involved three cruises in the austral spring and early summers of 1996, 1997, and 1998. The measurements included time-series data for phytoplankton and nitrate concentrations, as shown in Figure 3. 4.1
Background Knowledge about Aquatic Ecosystems
Taking into account knowledge about aquatic ecosystems, we crafted the set of generic processes shown in Table 3. In contrast to the components for predatorprey systems, the exponential decay process now involves not only reduction in a species’ population, but also the generation of residue as a side effect. Formation of this reside is the mechanism by which minerals and nutrients return to the ecosystem. Knowledge about the generation of residue is also reflected in the process predation. The generic process nutrient uptake encodes knowledge that plants derive their nutrients directly from the environment and do not depend on other species for their survival. Two other processes – remineralization and constant inflow – convey information about how nutrients become available in ecosystems Finally, the growth process posits that some species can grow in number independent of predation or nutrient uptake. As in the first domain, our approach to process model induction requires the user to specify the variables to be considered, along with their types. In this case, we knew that the Ross Sea ecosystem included two species, phytoplankton and zooplankton, with the concentration of the first being measured in our data set and the second being unobserved. We also knew that the sea contained nitrate, an observable nutrient, and detritus, an unobserved residue generated when members of a species die.
Discovering Ecosystem Models from Time-Series Data
149
Table 3. Five generic processes for aquatic ecosystems with constraints on their variables and parameters. generic process exponential decay; generic process constant inflow; variables S{species}, D{detritus}; variables N {nutrient}; parameters α [0, 10]; parameters ν [0, 10]; equations d[S, t, 1] = −1 ∗ α ∗ S; equations d[N, t, 1] = ν; d[D, t, 1] = α ∗ S; generic process nutrient uptake; generic process remineralization; variables S{species}, N {nutrient}; variables N {nutrient}, D{detritus}; parameters β [0, 10], µ [0, 10]; parameters ψ [0, 10] ; conditions N > τ ; equations d[N, t, 1] = ψ ∗ D ; equations d[S, t, 1] = µ ∗ S; d[D, t, 1] = −1 ∗ ψ ∗ D; d[N, t, 1] = −1 ∗ β ∗ µ ∗ S; generic process predation; variables S1{species}, S2{species}, D{detritus}; parameters ρ [0, 10], γ [0, 10]; equations d[S1, t, 1] = γ ∗ ρ ∗ S1; d[D, t, 1] = (1 − γ) ∗ ρ ∗ S1; d[S2, t, 1] = −1 ∗ ρ ∗ S1;
4.2
Inducing Models for an Aquatic Ecosystem
Given this background knowledge about the Ross Sea ecosystem and data from the ROAVERRS cruises, we wanted IPM to find a process model that explained the variations in these data. To make the system’s search tractable, we introduced further constraints by restricting each generic process to occur no more than twice and considering models with no fewer than three processes and no more than six. Using the four variables described above – Phyto{species}, Zoo{species}, Nitrate{nutrient}, and Detritus{residue} – IPM combined these with the available generic processes to generate some 200 model structures. Since Phyto and Nitrate were observable variables, the system considered only those models that included equations with these variables on their left-hand sides. The parameter-fitting routine and the description length criterion selected the model in Table 4, which produced a mean squared error of 23.26 and a description length of 131.88. Figure 3 displays the log values this candidate predicts for phytoplankton and nitrate, along with those observed in the field. The r2 value is 0.51 for Phyto but only 0.27 for Nitrate, which indicates that the model explains substantially less of the variance than in our first domain. Note that the model includes only three processes and that it makes no reference to zooplankton. The first process states that the phytoplankton population dies away at an exponential rate and, in doing so, generates detritus. The second process involves the growth of phytoplankton, which increases its population as it absorbs the nutrient nitrate. This growth happens only when the nitrate concentration is above a threshold, and it causes a decrease in the concentration
150
D. George et al. Table 4. Induced model for the aquatic ecosystem of the Ross Sea.
model Aquatic Ecosystem; variables P hyto, N itrate, Detritus, Zoo; observables P hyto, N itrate; process exponential decay 1; equations d[P hyto, t, 1] = −1 ∗ 1.9724 ∗ P hyto; d[Detritus, t, 1] = 1.9724 ∗ P hyto; generic process nutrient uptake; conditions N itrate > 3.1874; equations d[P hyto, t, 1] = 3.6107 ∗ P hyto; d[N itrate, t, 1] = −1 ∗ 0.3251 ∗ 3.6107 ∗ P hyto; generic process remineralization; equations d[N itrate, t, 1] = 0.032 ∗ Detritus; d[Detritus, t, 1] = −1 ∗ 0.032 ∗ Detritus;
of the nutrient. The final process states that the residue is converted to the consumable nitrate at a constant rate. In fact, the model with the lowest squared error included a predation process which stated that zooplankton feeds on phytoplankton, thereby increasing the former population, decreasing the latter, and producing detritus. However, IPM calculated that the improved fit was outweighed by the cost of including an additional process in the model. This decision may well have resulted from a small population of zooplankton, for which no measurements were available but which is consistent with other evidence about the Ross Sea ecosystem. We suspect that, given a more extended time series, IPM would rank this model as best even using its description length, but this is an empirical question that must await further data.
5
Discussion
There is a large literature on the subject of ecosystem modeling. For example, many Earth scientists develop their models in STELLA (Richmond et al., 1987), an environment that lets one specify quantitative models and simulate their behavior over time. However, work in this and similar frameworks has focused almost entirely on the manual construction and tuning of models, which involves much trial and error. Recently, increased computing power has led a few Earth scientists to try automating this activity. For instance, Morris (1997) reports a method for fitting a predator-prey model to time-series data, whereas Jost and Adiriti (2000) use computation to determine which functional forms best model similar data. Our approach has a common goal, but IPM can handle more complex models and uses domain knowledge about generic processes to constrain search through a larger model space. On another front, our approach differs from most earlier work on equation discovery (e.g., Washio et al., 2000) by focusing on differential equation models
Discovering Ecosystem Models from Time-Series Data
151
Fig. 3. Predicted and observed log concentrations of phytoplankton (left) and nitrate (right) in the Ross Sea over 31 days.
of dynamical systems. The most similar research comes from Todorovski and D˜zeroski (1997), Bradley et al. (1999), and Koza et al. (2001), who also report methods that induce differential equation models by searching for model structures and parameters that fit time-series data. Our framework extends theirs by focusing on processes, which play a central role in many sciences and provide a useful framework for encoding domain knowledge that constrains search and produces more interpretable results. Also, because IPM can construct models that include theoretical terms, it supports aspects of abduction (e.g., Josephson, 2000) as well as induction. Still, however promising our approach to ecosystem modeling, considerable work remains before it will be ready for use by practicing scientists. Some handcrafted models contain tens or hundreds of equations, and we must find ways to constrain search further if we want our system to discover such models. The natural source of constraints is additional background knowledge. Earth scientists often know the qualitative processes that should appear in a model (e.g., that one species preys on another), even when they do not know their functional forms. Moreover, they typically organize large models into modules that are relatively independent, which should further reduce search. Future versions of IPM should take advantage of this knowledge, along with more powerful methods for parameter fitting that will increase its chances of finding the best model. In summary, we believe that inductive process modeling provides a valuable alternative to the manual construction of ecosystem models which combines domain knowledge, heuristic search, and data in a powerful way. The resulting models are cast in a formalism recognizable to Earth scientists and they refer to processes that domain experts will find familiar. Our initial results on two ecosystem modeling tasks are encouraging, but we must still extend the framework in a number of directions before it can serve as a practical scientific aid. Acknowledgements. This work was supported by the NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation. We thank Tasha Reddy and Alessandro Tagliabue for preparing the ROAVERRS data and
152
D. George et al.
for discussions about ecosystem processes. We also thank Saˇso Dˇzeroski and Ljupˇco Todorovski for useful discussions about approaches to inductive process modeling.
References Arrigo, K. R., Worthen, D. L. & Robinson, D. H. (in press). A coupled ocean-ecosystem model of the Ross Sea. Part 2: Phytoplankton taxonomic variability and primary production. Journal of Geophysical Research. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B , 26, 211–252. Bradley, E., Easley, M., & Stolle, R. (2001). Reasoning about nonlinear system identification. Artificial Intelligence, 133 , 139–188. Josephson, J. R. (2000). Smart inductive generalizations are abductions. In P. A. Flach & A. C. Kakas (Eds.), Abduction and induction. Kluwer. Jost, C., & Adiriti, R. (2000). Identifying predator-prey processes from time-series. Theoretical Population Biology, 57, 325–337. Koza, J., Mydlowec, W., Lanza, G., Yu, J., & Keane, M. (2001). Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Pacific Symposium on Biocomputing, 6 , 434–445. Langley, P., George, D., Bay, S. & Saito, K. (in press). Robust induction of process models from time-series data. Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC: AAAI Press. Luenberger, D.G. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. Morris, W. F. (1997). Disentangling effects of induced plant defenses and food quantity on herbivores by fitting nonlinear models. American Naturalist, 150 , 299–327. Richmond, B., Peterson, S., & Vescuso, P. (1987). An academic user’s guide to STELLA. Lyme, NH: High Performance Systems. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing. Cambridge: MIT Press. Saito, K., & Nakano, R. (1997). Law discovery using neural networks. Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (pp. 1078–1083). Yokohama: Morgan Kaufmann. Schneider, T., & Neumaier, A. (2001). Algorithm 808: ARFIT – A Matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Transactions on Mathematical Software, 27 , 58–65. Todorovski, L., & Dˇzeroski, S. (1997). Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 376–384). San Francisco: Morgan Kaufmann. Veilleux, B. G. (1979). An analysis of the predatory interaction between Paramecium and Didinium. Journal of Animal Ecology, 48, 787–803. Washio, T., Motoda, H., & Niwa, Y. (2000). Enhancing the plausibility of law equation discovery. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1127–1134). Stanford, CA: Morgan Kaufmann.
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets and Genetic Algorithm Xiaoshu Hang and Honghua Dai School of Information Technology, Deakin University, Australia {xhan,hdai}@deakin.edu.au
Abstract. This paper proposes an optimal strategy for extracting probabilistic rules from databases. Two inductive learning-based statistic measures and their rough set-based definitions: accuracy and coverage are introduced. The simplicity of a rule emphasized in this paper has previously been ignored in the discovery of probabilistic rules. To avoid the high computational complexity of roughset approach, some rough-set terminologies rather than the approach itself are applied to represent the probabilistic rules. The genetic algorithm is exploited to find the optimal probabilistic rules that have the highest accuracy and coverage, and shortest length. Some heuristic genetic operators are also utilized in order to make the global searching and evolution of rules more efficiently. Experimental results have revealed that it run more efficiently and generate probabilistic classification rules of the same integrity when compared with traditional classification methods.
1 Introduction One of the main objectives of database analysis in recent years is to discover some interesting patterns hidden in databases. Over the years, much work has been done and many algorithms have been proposed. Those algorithms can be mainly classified into two categories: machine learning-based and data mining-based. The knowledge discovered is generally presented as a group of rules which are expected to have high accuracy, coverage and readability. High accuracy means a rule has high reliability, and high coverage implies a rule has strong prediction ability. It has been noted that some data mining algorithms produce large amounts of rules that are potentially useless, and much post-processing work had to be done in order to pick out the interesting patterns. Thus it is very important to have a mining approach that is capable of directly generating useful knowledge without post-process. Note that genetic algorithms (GA) have been used in some applications to acquire knowledge from databases, as typified by the GA-based classifier system for concept learning and the GA-based fuzzy rules acquisition from data set[3]. In the later application which contains information of uncertainty and vagueness to some extent, GA is used to evolve the initial fuzzy rules generated by fuzzy clustering approaches. In the last decade, GAs have been used in many diverse areas, such as function optimization, G. Grieser et al. (Eds.): DS 2003, LNAI 2843, pp. 153–165, 2003. © Springer-Verlag Berlin Heidelberg 2003
154
X. Hang and H. Dai
image processing, pattern recognition, knowledge discovery, etc. In this paper, GA and rough sets are combined in order to mine probabilistic rules from a database. Rough set theory was introduced by Pawlak in 1987. It has been recognized as a powerful mathematical tool for data analysis and knowledge discovery from imprecise and ambiguous data, and has been successfully applied in a wide range of application domains such as machine learning, expert system, pattern recognition, etc. It classifies all the attributes of an information table into three categories which are: core attributes, reduct attributes and dispensable attributes, according to their contribution to the decision attribute. The drawback of rough sets theory is its computational inefficiency, which restricts it from being effectively applied to knowledge discovery in database[5]. We borrow its idea and terminologies in this paper to represent rules and design the fitness function of GA, so that we can efficiently mine some probabilistic rules. The rules to be mined are called probabilistic rules since they are characterized with the two key parameters: accuracy and coverage. The optimal strategy used in this paper focuses on acquiring rules with high accuracy, coverage and short length. The rest of this paper is organized as follows: section 2 introduces some concepts on rough sets and the definition of probabilistic rules. Section 3 show the genetic algorithm-based strategy for mining probabilistic rules and section 4 gives the experimental results. Section 5 is the conclusion.
2 Probabilistic Rules in Rough Sets 2.1 Rough Sets Let I= be an information table, where U is the definite set of instances denoting the universe and A is the attribution set. R is an equivalent relation on U and AR=(U,R) is called an approximation space. For any two instances x and y in U, they are said to be equivalent if they are indiscernible with respect to relation R. In general, [x]R is used to represent the equivalent class of the instance x with respect to the relation R in U. [x]R= {y| y∈U and ∀r ⊆R, r(x) = r(y)} Let X be a certain subset of U, the lower approximation of X donated by R_(X), also known as the positive region of X donated by POSR(X), is the greatest collection of equivalent classes in which each instance in R_(X) can be fully classified into X. − the upper approximation of X donated by R (X), also known as the negative region of X donated by NEGR(X), is the smallest collection of equivalent classes that contains some instances that possibly belong to X. In this paper, R is not an ordinary relation but an extended relation represented by a formula, which is normally composed of the conjunction or disjunction of attributevalue pairs. Thus, the equivalent class of the instance x in U with respect to a conjunctive formula like [temperature=high] ∧ [humidity=low] is described as follows:
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
155
[x][temperature=high]∧[Humidity=low] ={y ∈ U | (temperature(x) =temperature(y)=high)
∧ (humidity(x)=humidity(y)=low)}
The equivalent class of the instance x in U with respect to a disjunctive formula like [temperature=high] ∨ [humidity=low] is presented by : [x][temperature=high]∨[Humidity=low] ={ y ∈ U| (temperature(x) =temperature(y)=high) ∨ ( humidity(x)=humidity(y)=low) } An attribute-value pair, e.g.[temperature=high], corresponds to the atom formula in concept learning based on predicate logic programming. The conjunction of attributevalue pairs e.g. [temperature=high] ∧ [Humidity=low] is equivalent to a complex formula in AQ terminology. The length of a complex formula R is defined as the number of attribute-value pairs it contains, and is donated by |R|. Accuracy and coverage are two important statistic measures in concept learning, and are used to evaluate the reliability and prediction ability of a complex formula acquired by a concept learning program. In this paper, their definitions based on rough sets are given as follows: Definition 1. Let R be a conjunction of attribute-value pairs and D be the set of objects belonging to the target class d. The accuracy of rule R→d is defined as : α R (D ) =
| [ x]R D | | [ x]R |
(1)
Definition 2. Let R be a conjunction of attribute-value pairs and D be the set of objects belonging to the target class d. The coverage of rule R→d is defined as: κ
R
(D ) =
| [x]R D | | D |
(2)
where [x]R is the set of objects that are indiscernible with respect to R. For example, αR(D)=0.8 indicates that 80% of the instances in the equivalent class [x]R probabilistically belong to class d and κR(D)=0.5 refers to that 50% of instances belonging to class d are probabilistically covered by the rule R. αR(D)=1 implies that all the instances in [x]R are fully classified into class d and κR(D)=1 means to that all the instances in target class d are covered by the rule R. In this case the rule R is deterministic, otherwise the rule R is probabilistic. 2.2 Probabilistic Rules We now formally give the definition of a probabilistic rule. Definition 3. Let U be the definite set of training instances, D be the set whose instances belong to the concept d and R be a formula of the conjunction of attributevalue pairs, a rule R → d is said to be probabilistic if it satisfies: αR(D)>δα and κR(D)> δκ.
156
X. Hang and H. Dai
Where δα and δκ are the two user-specified positive real values. The rule is then represented as: ,κ R α→ d
For example, a probabilistic rule induced in table1 is presented as: [prodrome=0] ∧ [ nausea=0] ∧ [ M1= 1] → [class = m.c.h.] Since[x]prodrome=0∧ nausea=0∧M1=1 ={1,2,5} and [x]class=m.c.h.={1,2,5,6}, therefore:
αR(D)=1 and κR(D)=0.75 It is well known that in concept learning, consistency and completeness are two fundamental concepts. If a concept description covers all the positive examples, it is called complete. If it covers no negative examples, it is called consistent. If both completeness and consistency are retained by a concept description then the concept description is correct on all examples. In rough sets theory, consistency and completeness are also defined in a similar way. A composed set is said to be consistent with respect to a concept if and only if it is a subset of the lower approximation of the concept. A compose set is said to be complete with respect to a concept if and only if it is a superset of the upper approximation of the concept[7]. Based on this idea we give the following two definitions that describe the consistency and completeness of a probabilistic rule. Definition 4. A probabilistic rule is said to be consistent if and only if itsαR(D)=1. Definition 5. A probabilistic rule is said to be complete if and only if itsκR(D)=1. Now we consider the problem on the simplicity of a concept description. It is an empirical observation rather than a theoretical proof that there is an inherent tradeoff between concept simplicity and completeness. In general, we do not want to trade accuracy for simplicity or efficiency, but still strive to maintain the prediction ability of the resulting concept description. In most applications of concept learning, due to noise in the training data rules that are overfitted tend to be long, in order for the induced rules to be consistent with all the training data, a relax requirement is usually set up so that short rules can be acquired. The simplicity of a concept description is considered and therefore is one of the goals we pursue in the process of knowledge acquisition. The simpler a concept description is, the lower the cost for storage and utilization. The simplicity of a probabilistic rule R in this paper is defined as follows: η
R
(D ) = 1 −
| R | | A |
(3)
where |R| represents the number of attribute-value pairs the probabilistic rule contains, and |A| donates the number of attributes the information table contains. Note that a probabilistic rule may still not be the optimum one even if it has both α=1and κ=1. The reason for this is that it may still contain dispensable attribute-value pairs when it is of full accuracy and coverage. In this paper, these rules will be further
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
157
refined through mutation in the process of evolution. Thus, taking into account the constraint of simplicity, we can further represent a probabilistic rule as follows: ,κ ,η R α → d
s.t. R= ∧ j[xj=vk] , αR(d)>δ1 , κ R(d) >δ2 and η R(d) >δ3
Table 1. A small database on headache diagnosis[1]
No. 1 2 3 4 5 6
Age 50-59 40-49 40-49 40-49 40-49 50-59
Loc ocular whole lateral whole whole whole
Nature pers. pers. throb. throb. radia. pers.
Prod 0 0 1 1 0 0
Nau 0 0 1 1 0 1
M1 1 1 0 0 1 1
Class m.c.h m.c.h migra migra m.c.h m.c.h
M1: tenderness of M1, m.c.h: muscle contraction headache,
The following two probabilistic rules derived from table1 are used to illustrate the importance of simplicity of a rule. Both rule1 and rule2 have full accuracy and coverage but rule2 is much shorter than rule1, therefore rule2 is preferred. Rule1: [Age=40-49] [nature=throbbing] ∧ [prodrome=1] ∧ [nausea=1] ∧ [M1=0] → [class=migraine], α=1, κ =1 and η=0.143. Rule2: [nature=throbbing] ∧ [ prodrome=1] ∧ [ M1=0] → [ class=migraine] α=1, κ =1 and η=0.429
3 A GA-Based Acquisition of Probabilistic Rules It is well known that a rough sets based approach is a multi-concept inductive learning approach, and is widely applied to the uncertain or incomplete information systems. However, due to its high computational complexity, it is ineffective to be applied to acquiring knowledge from a database of large size. To sufficiently overcome this problem and to exploit rough sets theory in knowledge discovery with uncertainty, we use rough set technology combined with genetic algorithm to acquire optimal probabilistic rules from databases. 3.1 Coding Genetic algorithms, as defined by Holland, work on fixed-length bit chains. Thus, a coding function is needed to present every solution to the problem with a chain in a one-to-one way. Since we do not know at the phase of coding how many rules an
158
X. Hang and H. Dai
information table contains, we have the individuals be initially coded to contain enough rules which will be refined during the evolution phase. Note that binary coded or classical GAs are less efficient when applied to multidimensional, large sized databases. We consider real-coded GAs in which variables appear directly in the chromosome and modified by genetic operators. First, we define a list of attribute-value pairs for the information table of discourse. Its length L is determined by each attribute’s domain. Let Dom(Ai) donate the domain of the attribute Ai, then we have: L=|Dom(A1)| + |Dom(A2)| + ⋅⋅⋅ + |Dom(Am)| = ∑i=1,m |Dom(Ai)| where |Dom(Ai)| represents the number of the different discrete values that attribute Ai contains. Assume the information table contains m attributes and a rule, in the worst case, contains one attribute-value pair for each attribute, then the rule has the maximum length with m attribute- value pairs. And we also assume an individual contains n rules and each rule is designed to be the fixed length m, therefore the total length of an individual is n×m attribute-value pairs. In this paper, an individual is not coded as a string of binary bits but an array of structure variables. Each structure variable has four fields(see figure1): a structure variable corresponding to the body of a rule, and a flag variable indicating whether the rule is superfluous, and two real variables representing the accuracy and the coverage, respectively. A rule, which is a structured variable, has three integer fields: an index
struc.var1
structure variable
flag
struc.var2
…
struc. var n
rule2 accuracy coverage attri-val 1 attri-val 2
consequence
attr-val 3 . .
set of premises
. attri-val n Fig. 1. A real coding scheme
An Optimal Strategy for Extracting Probabilistic Rules by Combining Rough Sets
159
pointing to the list of attribute-value pairs, an upper bounder and a lower bounder of the index, and two sub-structured variables. The first sub-structured variable is the consequence of the rule and the rest are the premise items of the rule. During the process of evolution, the index changes between the upper and lower bounder and possibly become zero, meaning the corresponding premise item is indispensable in the rule. 3.2 Fitness Assume an individual contains n rules and each rule Ri (i=1,…,n) has an accuracy αi and a coverage κi. The fitness function is designed to be the average of all the rules’ fitness. F (α , κ ) =
= =
n
1 n
∑
1 n
∑
1 n
i =1
n
i =1
ω1 + ω 2 1 = ω1 ω2 n + αi κi
n
∑
i =1
(ω 1 + ω 2 )α i κ ω 2 α i + ω 1κ i
i
(ω 1 + ω 2 ) [ x ] R i D i
ω1 [x]R
i
+ ω
2
Di
n
∑
i =1
f (Ri, Di)
where ω1,ω2 are the weights for αi and κi, respectively, and ω1+ω2=1. Therefore: 0