Handbook of Research on Text and Web Mining Technologies Min Song New Jersey Institute of Technology, USA Yi-fang Brook Wu New Jersey Institute of Technology, USA
Volume I
Information science reference Hershey • New York
Director of Editorial Content: Director of Production: Managing Editor: Assistant Managing Editor: Typesetter: Cover Design: Printed at:
Kristin Klinger Jennifer Neidig Jamie Snavely Carole Coulson Chris Hrobak Lisa Tosheff Yurchak Printing Inc.
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.igi-global.com and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identi.cation purposes only . Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Handbook of research on text and web mining techologies / Min Song and Yi-Fang Wu, editors. p. cm. Includes bibliographical references and index. Summary: "This handbook presents recent advances and surveys of applications in text and web mining of interests to researchers and endusers "--Provided by publisher. ISBN 978-1-59904-990-8 (hardcover) -- ISBN 978-1-59904-991-5 (ebook) 1. Data mining--Handbooks, manuals, etc. 2. Web databases--Handbooks, manuals, etc. I. Song, Min, 1969- II. Wu, Yi-Fang, 1970QA76.9.D343H43 2008 005.75'9--dc22 2008013118
British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher. If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.
Editorial Advisory Board
Zoran Obradovic Temple University, USA
Hongfang Liu Georgetown University, USA
Alexander Yates Temple University, USA
Illhoi Yoo University of Missouri-Columbia, USA
Il-Yeol Song Drexel University, USA
Jason T.L. Wang New Jersey Institute of Technology, USA
Xiaohua Tony Hu Drexel University, USA
List of Contributors
Agarwal, Nitin / Arizona State University, USA..................................................................................................646 Amine, Abdelmalek / Djillali Liabes University, Algeria & Taher Moulay University Center, Algeria............189 Atkinson, John / Universidad de Concepción, Chile............................................................................................37 Aufaure, Marie / Audel INRIA Sophia and Supélec, France..............................................................................418 Babu, V. Suresh / Indian Institute of Technology Guwahati, India.....................................................................181 Back, Barbro / Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland.....724 Bellatreche, Ladjel / University of Poitiers, France...........................................................................................189 Bennamoun, Mohammed / University of Western Australia, Australia.....................................................141, 500 Bhattacharyya, Pushpak / IIT Bombay, India....................................................................................................571 Bothma, Theo J. D. / University of Pretoria, South Africa.................................................................................288 Buccafurri, Francesco / University “Mediterranea” of Reggio Calabria, Italy................................................273 Caminiti, Gianluca / University “Mediterranea” of Reggio Calabria, Italy.....................................................273 Campos, Luis M. de / University of Granada, Spain..........................................................................................331 Chen, Xin / Microsoft Corporation, USA............................................................................................................301 Couto, Francisco M. / University of Lisbon, Portugal........................................................................................314 Cui, Xiaohui / Oak Ridge National Laboratory, USA.........................................................................................165 Davis, Neil / The University of Shef.eld, UK . .....................................................................................................822 Demetriou, George / The University of Sheffield, UK.........................................................................................822 Dreweke, Alexander / Friedrich-Alexander University Erlangen-Nuremberg, Germany....................................62 Elberrichi, Zakaria / Djillali Liabes University, Algeria...................................................................................189 Faria, Daniel / University of Lisbon, Portugal....................................................................................................314 Fernández-Luna, Juan M. / University of Granada, Spain...............................................................................331 Ferng, William / The Boeing Phantom Works, USA............................................................................................546 Fischer, Ingrid / University of Konstanz, Germany.............................................................................................626 Fox, Edward A. / Virginia Tech, USA....................................................................................................................61 Gaizauskas, Robert / The University of Sheffield, UK........................................................................................822 Ginter, Filip / Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland..............724 Grego, Tiago / University of Lisbon, Portugal....................................................................................................314 Hagenbuchner, Markus / University of Wollongong, Australia.........................................................................785 Hiissa, Marketta / Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland.............................................................................................................................................................724 Huete, Juan F. / University of Granada, Spain...................................................................................................331 Kao, Anne / The Boeing Phantom Works, USA...................................................................................................546 Karsten, Helena / Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland...........................................................................................................................................................724 Kim, Han-joon / University of Seoul, Korea.......................................................................................................111 Kroeze, Jan H. / University of Pretoria, South Africa.........................................................................................288
Kutty, Sangeetha / Queensland University of Technology, Australia.................................................................227 Lai, Kin Keung / City University of Hong Kong, China.....................................................................................201 Lax, Gianluca / University “Mediterranea” of Reggio Calabria, Italy..............................................................273 Lazarinis, Fotis / University of Sunderland, UK.................................................................................................530 Lechevallier, Yves / INRIA Rocquencourt, France..............................................................................................418 Lee, Ki Jung / Drexel University, USA................................................................................................................758 Lee, Yue-Shi / Ming Chuan University, Taiwan, ROC.........................................................................................448 Le Grand, Bénédicte / Laboratoire d’Informatique de Paris, 6, France............................................................418 Li, Quanzhi / Avaya, Inc., USA......................................................................................................................23, 369 Li, Xiao-Li / Institute for Infocomm Research, Singapore.....................................................................................75 Liberati, Diego / Istituto di Elettronica e Ingegneria dell’Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Politecnico di Milano, Italy...................................................................................684 Lichtnow, Daniel / Catholic University of Pelotas, Brazil..................................................................................346 Lingras, Pawan / Saint Mary’s University, Canada............................................................................................386 Lingras, Rucha / Saint Mary’s University, Canada.............................................................................................386 Liu, Huan / Arizona State University, USA.........................................................................................................646 Liu, Shuhua / Academy of Finland, Finland & Åbo Akademi University, Finland.............................................724 Liu, Wei / University of Western Australia, Australia..................................................................................141, 500 Liu, Ying / The Hong Kong Polytechnic University Hong Kong SAR, China.........................................................1 Loh, Stanley / Lutheran University of Brazil, Brazil...........................................................................................346 Luo, Xin / Virginia State University, USA............................................................................................................694 Mahalakshmi, G.S. / Anna University, Chennai, India.......................................................................................483 Manco, Giuseppe / Italian National Research Council, Italy.............................................................................604 Malki, Mimoun / Djillali Liabes University, Algeria..........................................................................................189 Marghescu, Dorina / Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland............................................................................................................................................................724 Masseglia, Florent / INRIA Sophia Antipolis, France.........................................................................................418 Matera, Maristella / Politecnico di Milano, Italy...............................................................................................401 Matthee, Machdel C. / University of Pretoria, South Africa..............................................................................288 Meo, Pasquale De / Università degli Studi Mediterranea di Reggio Calabria, Italy..........................................670 Meo, Rosa / Universitá de Torino, Italy...............................................................................................................401 Mhamdi, Faouzi / University of Jandouba, Tunisia............................................................................................128 Murty, M. Narasimha / Indian Institute of Science, India..................................................................................708 Nayak, Richi / Queensland University of Technology, Australia................................................................227, 249 Oliveira, José Palazzo M. de / Federal University of Rio Grande do Sul, Brazil..............................................346 Oliverira, Stanley R. M. / Embrapa Informática Agropecuaria, Brazil.............................................................468 Ortale, Riccardo / University of Calabria, Italy.................................................................................................604 Pahikkala, Tapio / Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland......724 Patra, Bidyut Kr. / Indian Institute of Technology Guwahati, India...................................................................181 Pecoraro, Marcello / University of Naples Federico II, Italy.............................................................................359 Pérez-Quiñones, Manuel / Virginia Tech, USA.....................................................................................................61 Pesquita, Catia / University of Lisbon, Portugal................................................................................................314 Poteet, Steve / The Boeing Phantom Works, USA................................................................................................546 Potok, Thomas E. / Oak Ridge National Laboratory, USA.................................................................................165 Pyysalo, Sampo / Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland.........724 Qi, Yanliang / New Jersey Institute of Technology, USA.....................................................................................748 Quach, Lesley / The Boeing Phantom Works, USA.............................................................................................546 Quattrone, Giovanni / Università degli Studi Mediterranea di Reggio Calabria, Italy.....................................670 Rakotomalala, Ricco / University of Lyon 2, France..........................................................................................128 Ramakrishnan, Ganesh / IBM India Research Labs, India...............................................................................571 Roberts, Ian / The University of Sheffield, UK....................................................................................................822
Romero, Alfonso E. / University of Granada, Spain...........................................................................................331 Salakoski, Tapio / Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland.......724 Segall, Richard S. / Arkansas State University, USA..........................................................................................766 Sendhilkumar, S. / Anna University, Chennai, India..........................................................................................483 Siciliano, Roberta / University of Naples Federico II, Italy...............................................................................359 Silva, Mário J. / University of Lisbon, Portugal.................................................................................................314 Simonet, Michel / Joseph Fourier University, France........................................................................................189 Suominen, Hanna / Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland...........................................................................................................................................................724 Tagarelli, Andrea / University of Calabria, Italy................................................................................................604 Thirumaran, E. / Indian Institute of Science, India............................................................................................708 Tjoelker, Rod / The Boeing Phantom Works, USA..............................................................................................546 To, Phuong Kim / Tedis P/L, Australia................................................................................................................785 Trousse, Brigitte / INRIA Rocquencourt Antipolis, France.................................................................................418 Tsoi, Ah Chung / Monash University, Australia..................................................................................................785 Tungare, Manas / Virginia Tech, USA...................................................................................................................61 Ursino, Domenico / Università degli Studi Mediterranea di Reggio Calabria, Italy.........................................670 Viswanth, P. / Indian Institute of Technology Guwahati, India...........................................................................181 Wang, Hsiao-Fan / National Tsing Hua University, Taiwan, ROC.....................................................................807 Wang, Miao-Ling / Minghsin University of Science & Technology, Taiwan, ROC............................................807 Wang, Shouyang / Chinese Academy of Sciences, China...................................................................................201 Werth, Tobias / Friedrich-Alexander University Erlangen-Nuremberg, Germany.............................................626 Wives, Leandro Krug / Federal University of Rio Grande do Sul, Brazil.........................................................346 Wong, Wilson / University of Western Australia, Australia........................................................................141, 500 Wörlein, Marc / Friedrich-Alexander University Erlangen-Nuremberg, Germany...........................................626 Wu, Jason / The Boeing Phantom Works, USA....................................................................................................546 Wu, Yi-fang Brook / New Jersey Institute of Technology, USA............................................................23, 301, 369 Xu, Shuting / Virginia State University, USA......................................................................................................694 Yen, Show-Jane / Ming Chuan Univerity, Taiwan, ROC.....................................................................................448 Yikun Guo/ The University of Sheffield, UK.......................................................................................................822 Yu, Lean / Chinese Academy of Sciences, China & City University of Hong Kong, China................................201 Yu, Xiaoyan / Virginia Tech, USA..........................................................................................................................61 Yuan, Weigo / Virginia Tech, USA.........................................................................................................................61 Yuan, Yubo / Virginia Tech, USA...........................................................................................................................61 Zaïane, Osmar R. / University of Alberta, Edmonton, Canada..........................................................................468 Zhang, Jianping / MITRE Corporation, USA.....................................................................................................646 Zhang, Qingyu / Arkansas State University, USA...............................................................................................766 Zhang, Yu-Jin / Tsinghua University, Beijing, China............................................................................................96
Table of Contents
Foreword . .......................................................................................................................................xxxiii Preface . ........................................................................................................................................... xxxiv Acknowledgment . ........................................................................................................................... xxxv
Volume I Section I Document Preprocessing Chapter I On Document Representation and Term Weights in Text Classification ................................................ 1 Ying Liu, The Hong Kong Polytechnic University Hong Kong SAR, China Chapter II Deriving Document Keyphrases for Text Mining ................................................................................ 23 Yi-fang Brook Wu, New Jersey Institute of Technology, USA Quanzhi Li, Avaya, Inc., USA Chapter III Intelligent Text Mining: Putting Evolutionary Methods and Language Technologies Together .......... 37 John Atkinson, Universidad de Concepción, Chile Section II Classi.cation and Clustering Chapter IV Automatic Syllabus Classification Using Support Vector Machines .................................................... 61 Xiaoyan Yu, Virginia Tech, USA Manas Tungare, Virginia Tech, USA Weigo Yuan, Virginia Tech, USA Yubo Yuan, Virginia Tech, USA Manuel Pérez-Quiñones, Virginia Tech, USA Edward A. Fox, Virginia Tech, USA
Chapter V Partially Supervised Text Categorization............................................................................................... 75 Xiao-Li Li, Institute for Infocomm Research, Singapore Chapter VI Image Classification and Retrieval with Mining Technologies ............................................................ 96 Yu-Jin Zhang, Tsinghua University, Beijing, China Chapter VII Improving Techniques for Naïve Bayes Text Classifiers . .................................................................. 111 Han-joon Kim, University of Seoul, Korea Chapter VIII Using the Text Categorization Framework for Protein Classification ................................................ 128 Ricco Rakotomalala, University of Lyon 2, France Faouzi Mhamdi, University of Jandouba, Tunisia Chapter IX Featureless Data Clustering ................................................................................................................ 141 Wilson Wong, University of Western Australia, Australia Wei Liu, University of Western Australia, Australia Mohammed Bennamoun, University of Western Australia, Australia Chapter X Swarm Intelligence in Text Document Clustering............................................................................... 165 Xiaohui Cui, Oak Ridge National Laboratory, USA Thomas E. Potok, Oak Ridge National Laboratory, USA Chapter XI Some Efficient and Fast Approaches to Document Clustering . ......................................................... 181 P. Viswanth, Indian Institute of Technology Guwahati, India Bidyut Kr. Patra, Indian Institute of Technology Guwahati, India V. Suresh Babu, Indian Institute of Technology Guwahati, India Chapter XII SOM-Based Clustering of Textual Documents Using WordNet ........................................................ 189 Abdelmalek Amine, Djillali Liabes University, Alergia & Taher Moulay University Center, Algeria Zakaria Elberrichi, Djillali Liabes University, Algeria Michel Simonet, Joseph Fourier University, France Ladjel Bellatreche, University of Poitiers, France Mimoun Malki, Djillali Liabes University, Algeria
Chapter XIII A Multi-Agent Neural Network System for Web Text Mining .......................................................... 201 Lean Yu, Chinese Academy of Sciences, China & City University of Hong Kong, China Shouyang Wang, Chinese Academy of Sciences, China Kin Keung Lai, City University of Hong Kong, China Section III Database, Ontology, and the Web Chapter XIV Frequent Mining on XML Documents ............................................................................................... 227 Sangeetha Kutty, Queensland University of Technology, Australia Richi Nayak, Queensland University of Technology, Australia Chapter XV The Process and Application of XML Data Mining ........................................................................... 249 Richi Nayak, Queensland University of Technology, Australia Chapter XVI Approximate Range Querying over Sliding Windows ....................................................................... 273 Francesco Buccafurri, University “Mediterranea” of Reggio Calabria, Italy Gianluca Caminiti, University “Mediterranea” of Reggio Calabria, Italy Gianluca Lax, University “Mediterranea” of Reggio Calabria, Italy Chapter XVII Slicing and Dicing a Linguistic Data Cube ........................................................................................ 288 Jan H. Kroeze, University of Pretoria, South Africa Theo J. D. Bothma, University of Pretoria, South Africa Machdel C. Matthee, University of Pretoria, South Africa Chapter XVIII Discovering Personalized Novel Knowledge from Text .................................................................... 301 Yi-fang Brook Wu, New Jersey Institute of Technology, USA Xin Chen, Microsoft Corporation, USA Chapter XIX Untangling BioOntologies for Mining Biomedical Information ........................................................ 314 Catia Pesquita, University of Lisbon, Portugal Daniel Faria, University of Lisbon, Portugal Tiago Grego, University of Lisbon, Portugal Francisco M. Couto, University of Lisbon, Portugal Mário J. Silva, University of Lisbon, Portugal
Chapter XX Thesaurus-Based Automatic Indexing ................................................................................................ 331 Luis M. de Campos, University of Granada, Spain Juan M. Fernández-Luna, University of Granada, Spain Juan F. Huete, University of Granada, Spain Alfonso E. Romero, University of Granada, Spain Chapter XXI Concept-Based Text Mining ............................................................................................................... 346 Stanley Loh, Lutheran University of Brazil, Brazil Leandro Krug Wives, Federal University of Rio Grande do Sul, Brazil Daniel Lichtnow, Catholic University of Pelotas, Brazil José Palazzo M. de Oliveira, Federal University of Rio Grande do Sul, Brazil Chapter XXII Statistical Methods for User Profiling in Web Usage Mining ............................................................ 359 Marcello Pecoraro, University of Naples Federico II, Italy Roberta Siciliano, University of Naples Federico II, Italy Chapter XXIII Web Mining to Identify People of Similar Background ..................................................................... 369 Quanzhi Li, Avaya, Inc., USA Yi-fang Brook Wu, New Jersey Institute of Technology, USA Chapter XXIV Hyperlink Structure Inspired by Web Usage ...................................................................................... 386 Pawan Lingras, Saint Mary’s University, Canada Rucha Lingras, Saint Mary’s University, Canada Chapter XXV Designing and Mining Web Applications: A Conceptual Modeling Approach .................................. 401 Rosa Meo, Università di Torino, Italy Maristella Matera, Politecnico di Milano, Italy
Volume II Chapter XXVI Web Usage Mining for Ontology Management . ................................................................................ 418 Brigitte Trousse, INRIA Sophia Antipolois, France Marie-Aude Aufaure, INRIA Sophia and Supélec, France Bénédicte Le Grand, Laboratoire d’Informatique de Paris 6, France Yves Lechevallier, INRIA Rocquencourt, France Florent Masseglia, INRIA Sophia Antipolois, France
Chapter XXVII A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns ............................................................................................................................................... 448 Yue-Shi Lee, Ming Chuan University, Taiwan, ROC Show-Jane Yen, Ming Chuan University, Taiwan, ROC Chapter XXVIII Privacy-Preserving Data Mining on the Web: Foundations and Techniques . .................................... 468 Stanley R. M. Oliveira, Embrapa Informática Agropecuária, Brazil Osmar R. Zaïane, University of Alberta, Edmonton, Canada Section IV Information Retrieval and Extraction Chapter XXIX Automatic Reference Tracking ........................................................................................................... 483 G.S. Mahalakshmi, Anna University, Chennai, India S. Sendhilkumar, Anna University, Chennai, India Chapter XXX Determination of Unithood and Termhood for Term Recognition ..................................................... 500 Wilson Wong, University of Western Australia, Australia Wei Liu, University of Western Australia, Australia Mohammed Bennamoun, University of Western Australia, Australia Chapter XXXI Retrieving Non-Latin Information in a Latin Web: The Case of Greek ............................................. 530 Fotis Lazarinis, University of Sunderland, UK Chapter XXXII Latent Semantic Analysis and Beyond ............................................................................................... 546 Anne Kao, The Boeing Phantom Works, USA Steve Poteet, The Boeing Phantom Works, USA Jason Wu, The Boeing Phantom Works, USA William Ferng, The Boeing Phantom Works, USA Rod Tjoelker, The Boeing Phantom Works, USA Lesley Quach, The Boeing Phantom Works, USA Chapter XXXIII Question Answering Using Word Associations .................................................................................. 571 Ganesh Ramakrishnan, IBM India Research Labs, India Pushpak Bhattacharyya, IIT Bombay, India
Chapter XXXIV The Scent of a Newsgroup: Providing Personalized Access to Usenet Sites through Web Mining ........................................................................................................................................ 604 Giuseppe Manco, Italian National Research Council, Italy Riccardo Ortale, University of Calabria, Italy Andrea Tagarelli, University of Calabria, Italy Section V Application and Survey Chapter XXXV Text Mining in Program Code ............................................................................................................ 626 Alexander Dreweke, Friedrich-Alexander University Erlangen-Nuremberg, Germany Ingrid Fischer, University of Konstanz, Germany Tobias Werth, Friedrich-Alexander University Erlangen-Nuremberg, Germany Marc Wörlein, Friedrich-Alexander University Erlangen-Nuremberg, Germany Chapter XXXVI A Study of Friendship Networks and Blogosphere ............................................................................ 646 Nitin Agarwal, Arizona State University, USA Huan Liu, Arizona State University, USA Jianping Zhang, MITRE Corporation, USA Chapter XXXVII An HL7-Aware Decision Support System for E-Health . ................................................................... 670 Pasquale De Meo, Università degli Studi Mediterranea di Reggio Calabria, Italy Giovanni Quattrone, Università degli Studi Mediterranea di Reggio Calabria, Italy Domenico Ursino, Università degli Studi Mediterranea di Reggio Calabria, Italy Chapter XXXVIII Multitarget Classifiers for Mining in Bioinformatics ......................................................................... 684 Diego Liberati, Istituto di Elettronica e Ingegneria dell’Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Politecnico di Milano, Italy Chapter XXXIX Current Issues and Future Analysis in Text Mining for Information Security Applications .............. 694 Shuting Xu, Virginia State University, USA Xin Luo, Virginia State University, USA Chapter XL Collaborative Filtering Based Recommendation Systems . ................................................................ 708 E. Thirumaran, Indian Institute of Science, India M. Narasimha Murty, Indian Institute of Science, India
Chapter XLI Performance Evaluation Measures for Text Mining ........................................................................... 724 Hanna Suominen, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Sampo Pyysalo, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Marketta Hiissa, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Filip Ginter, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Shuhua Liu, Academy of Finland, Finland & Åbo Akademi University, Finland Dorina Marghescu, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Tapio Pahikkala, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Barbro Back, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Helena Karsten, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Tapio Salakoski, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Chapter XLII Text Mining in Bioinformatics: Research and Application ................................................................ 748 Yanliang Qi, New Jersey Institute of Technology, USA Chapter XLIII Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics: Finding the Right Information for Consumer’s Health Information Need .................... 758 Ki Jung Lee, Drexel University, USA Chapter XLIV A Survey of Selected Software Technologies for Text Mining .......................................................... 766 Richard S. Segall, Arkansas State University, USA Qingyu Zhang, Arkansas State University, USA Chapter XLV Application of Text Mining Methodologies to Health Insurance Schedules ...................................... 785 Ah Chung Tsoi, Monash University, Australia Phuong Kim To, Tedis P/L, Australia Markus Hagenbuchner, University of Wollongong, Australia
Chapter XLVI Web Mining System for Mobile-Phone Marketing ............................................................................ 807 Miao-Ling Wang, Minghsin University of Science & Technology, Taiwan, ROC Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC Chapter XLVII Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator ...................................................................................................................................... 822 Neil Davis, The University of Sheffield, UK George Demetriou, The University of Sheffield, UK Robert Gaizauskas, The University of Sheffield, UK Yikun Guo, The University of Sheffield, UK Ian Roberts, The University of Sheffield, UK
Detailed Table of Contents
Foreword . .......................................................................................................................................xxxiii Preface . ........................................................................................................................................... xxxiv Acknowledgment . ........................................................................................................................... xxxv
Volume I Section I Document Preprocessing Chapter I On Document Representation and Term Weights in Text Classification ................................................ 1 Ying Liu, The Hong Kong Polytechnic University Hong Kong SAR, China In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e. bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks. Chapter II Deriving Document Keyphrases for Text Mining ................................................................................ 23 Yi-fang Brook Wu, New Jersey Institute of Technology, USA Quanzhi Li, Avaya, Inc., USA
Document keyphrases provide semantic metadata which can characterize documents and produce an overview of the content of a document. This chapter describes a keyphrase identification program (KIP), which extracts document keyphrases by using prior positive samples of human identified domain keyphrases to assign weights to the candidate keyphrases. The logic of our algorithm is: the more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. To obtain human identified positive inputs, KIP first populates its glossary database using manually identified keyphrases and keywords. It then checks the composition of all noun phrases extracted from a document, looks up the database and calculates scores for all these noun phrases. The ones having higher scores will be extracted as keyphrases. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database. Chapter III Intelligent Text Mining: Putting Evolutionary Methods and Language Technologies Together .......... 37 John Atkinson, Universidad de Concepción, Chile This chapter proposes a text mining model to handle shallow text representation and processing for mining purposes in an integrated way. Its aims are to look for interesting explanatory knowledge across text documents. The proposed model involves a mixture of different techniques from evolutionary computation and other kinds of text mining methods. Section II Classification and Clustering Chapter IV Automatic Syllabus Classification Using Support Vector Machines .................................................... 61 Xiaoyan Yu, Virginia Tech, USA Manas Tungare, Virginia Tech, USA Weigo Yuan, Virginia Tech, USA Yubo Yuan, Virginia Tech, USA Manuel Pérez-Quiñones, Virginia Tech, USA Edward A. Fox, Virginia Tech, USA Syllabi are important educational resources. Gathering syllabi that are freely available and creating useful services on top of the collection presents great value for the educational community. However, searching for a syllabus on the Web using a generic search engine is an error-prone process and often yields too many irrelevant links. In this chapter, we describe our empirical study on automatic syllabus classification using dupport vector machines (SVM) to filter noise out from search results. We describe various steps in the classification process from training data preparation, feature selection, and classifier building using SVMs. Empirical results are provided and discussed. We hope our reported work will also benefit people who are interested in building other genre-specific repositories. Chapter V Partially Supervised Text Categorization............................................................................................... 75 Xiao-Li Li, Institute for Infocomm Research, Singapore
In traditional text categorization, a classifier is built using labeled training documents from a set of predefined classes. This chapter studies a different problem: partially supervised text categorization. Given a set P of positive documents of a particular class and a set U of unlabeled documents (which contains both hidden positive and hidden negative documents), we build a classifier using P and U to classify the data in U as well as future test data. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this chapter, we introduce the main techniques S-EM, PEBL, Roc-SVM and A-EM, to solve the partially supervised problem. In many application domains, partially supervised text categorization is preferred since it saves on the labor-intensive effort of manual labeling of negative documents. Chapter VI Image Classification and Retrieval with Mining Technologies ............................................................ 96 Yu-Jin Zhang, Tsinghua University, Beijing, China Mining techniques can play an important role in automatic image classification and content-based retrieval. A novel method for image classification based on feature element through association rule mining is presented in this chapter. The effectiveness of this method comes from two sides. The visual meanings of images can be well captured by discrete feature elements. The associations between the description features and the image contents can be properly discovered with mining technology. Experiments with real images show that the new approach provides not only lower classification and retrieval error but also higher computation efficiency. Chapter VII Improving Techniques for Naïve Bayes Text Classifiers . .................................................................. 111 Han-joon Kim, University of Seoul, Korea This chapter introduces two practical techniques for improving Naïve Bayes text classifiers that are widely used for text classification. The Naïve Bayes has been evaluated to be a practical text classification algorithm due to its simple classification model, reasonable classification accuracy, and easy update of classification model. Thus, many researchers have a strong incentive to improve the Naïve Bayes by combining it with other meta-learning approaches such as EM (Expectation Maximization) and Boosting. The EM approach is to combine the Naïve Bayes with the EM algorithm and the Boosting approach is to use the Naïve Bayes as a base classifier in the AdaBoost algorithm. For both approaches, a special uncertainty measure fit for Naïve Bayes learning is used. In the Naïve Bayes learning framework, these approaches are expected to be practical solutions to the problem of lack of training documents in text classification systems. Chapter VIII Using the Text Categorization Framework for Protein Classification ................................................ 128 Ricco Rakotomalala, University of Lyon 2, France Faouzi Mhamdi, University of Jandouba, Tunisia In this chapter, we are interested in proteins classification starting from their primary structures. The goal is to automatically affect proteins sequences to their families. The main originality of the approach is that we directly apply the text categorization framework for the protein classification with very minor
modifications. The main steps of the task are clearly identified: we must extract features from the unstructured dataset, we use the fixed length n-grams descriptors; we select and combine the most relevant one for the learning phase; and then, we select the most promising learning algorithm in order to produce accurate predictive model. We obtain essentially two main results. First, the approach is credible, giving accurate results with only 2-grams descriptors length. Second, in our context where many irrelevant descriptors are automatically generated, we must combine aggressive feature selection algorithms and low variance classifiers such as SVM (Support Vector Machine). Chapter IX Featureless Data Clustering ................................................................................................................ 141 Wilson Wong, University of Western Australia, Australia Wei Liu, University of Western Australia, Australia Mohammed Bennamoun, University of Western Australia, Australia Feature-based semantic measurements have played a dominant role in conventional data clustering algorithms for many existing applications. However, the applicability of existing data clustering approaches to wider range of applications is limited due to issues such as complexity involved in semantic computation, long pre-processing time required for feature preparation, and poor extensibility of semantic measurement due to non-incremental feature source. This chapter first summarises the many commonly used clustering algorithms and feature-based semantic measurements, and then highlights the shortcomings to make way for the proposal of an adaptive clustering approach based on featureless semantic measurements. The chapter concludes with experiments demonstrating the performance and wide applicability of the proposed clustering approach. Chapter X Swarm Intelligence in Text Document Clustering............................................................................... 165 Xiaohui Cui, Oak Ridge National Laboratory, USA Thomas E. Potok, Oak Ridge National Laboratory, USA In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. The major challenge of today’s information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. The swarm intelligence clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools, and ant food forage. Compared to the traditional clustering algorithms, the swarm algorithms are usually flexible, robust, decentralized, and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document clustering. Chapter XI Some Efficient and Fast Approaches to Document Clustering . ......................................................... 181 P. Viswanth, Indian Institute of Technology Guwahati, India Bidyut Kr. Patra, Indian Institute of Technology Guwahati, India V. Suresh Babu, Indian Institute of Technology Guwahati, India
Clustering is a process of finding natural grouping present in a dataset. Various clustering methods are proposed to work with various types of data. The quality of the solution as well as the time taken to derive the solution is important when dealing with large datasets like that in a typical documents database. Recently hybrid and ensemble based clustering methods are shown to yield better results than conventional methods. The chapter proposes two clustering methods; one is based on a hybrid scheme and the other based on an ensemble scheme. Both of these are experimentally verified and are shown to yield better and faster results. Chapter XII SOM-Based Clustering of Textual Documents Using WordNet ........................................................ 189 Abdelmalek Amine, Djillali Liabes University, Algeria & Taher Moulay University Center, Algeria Zakaria Elberrichi, Djillali Liabes University, Algeria Michel Simonet, Joseph Fourier University, France Ladjel Bellatreche, University of Poitiers, France Mimoun Malki, Djillali Liabes University, Algeria The classification of textual documents has been the subject of many studies. Technologies like the Web and numerical libraries facilitated the exponential growth of available documentation. The classification of textual documents is very important since it allows the users to effectively and quickly fly over and understand better the contents of large corpora. Most classification approaches use the supervised method of training, more suitable with small corpora and when human experts are available to generate the best classes of data for the training phase, which is not always feasible. The unsupervised classification or “clustering” methods make emerge latent (hidden) classes automatically with minimum human intervention, There are many, and the self organized maps (SOM) by Kohonen is one of the algorithms for unsupervised classification that gather a certain number of similar objects in groups without a priori knowledge. This chapter introduces the concept of unsupervised classification of textual documents and proposes an experiment with a conceptual approach for the representation of texts and the method of Kohonen for clustering. Chapter XIII A Multi-Agent Neural Network System for Web Text Mining .......................................................... 201 Lean Yu, Chinese Academy of Sciences, China & City University of Hong Kong, China Shouyang Wang, Chinese Academy of Sciences, China Kin Keung Lai, City University of Hong Kong, China This chapter proposes a Web mining system based on back-propagation neural network to support users for decision making. To handle the scalability issue of the Web mining system, the proposed system provides a multi-agent based neural network system in a parallel way.
Section III Database, Ontology, and the Web Chapter XIV Frequent Mining on XML Documents ............................................................................................... 227 Sangeetha Kutty, Queensland University of Technology, Australia Richi Nayak, Queensland University of Technology, Australia With the emergence of XML standardization, XML documents have been widely used and accepted in almost all the major industries. As a result of the widespread usage, it has been considered essential to not only store these XML documents but also to mine them to discover useful information from them. One of the very popular techniques to mine XML documents is frequent pattern mining, which has huge potential in varied domains such as bio-informatics, network analysis. This chapter presents some of the existing techniques to discover frequent patterns from XML documents. It also covers the applications and addresses the major issues in mining XML documents. Chapter XV The Process and Application of XML Data Mining ........................................................................... 249 Richi Nayak, Queensland University of Technology, Australia XML has gained popularity for information representation, exchange and retrieval. As XML material becomes more abundant, its heterogeneity and structural irregularity limit the knowledge that can be gained.. The utilisation of data mining techniques becomes essential for improvement in XML document handling. This chapter presents the capabilities and benefits of data mining techniques in the XML domain, as well as, a conceptualization of the XML mining process. It also discusses the techniques that can be applied to XML document structure and/or content for knowledge discovery. Chapter XVI Approximate Range Querying over Sliding Windows ....................................................................... 273 Francesco Buccafurri, University “Mediterranea” of Reggio Calabria, Italy Gianluca Caminiti, University “Mediterranea” of Reggio Calabria, Italy Gianluca Lax, University “Mediterranea” of Reggio Calabria, Italy In the context of knowledge discovery in databases, data reduction is a pre-processing step delivering succinct yet meaningful data to sequent stages. If the target of mining are data streams, then it is crucial to suitably reduce them, since often analyses on such data require multiple scans. In this chapter, we propose a histogram-based approach to reducing sliding windows supporting approximate arbitrary (i.e., non biased) range-sum queries. The histogram is based on a hierarchical structure (as opposed to the flat structure of traditional ones) and it results suitable to directly support hierarchical queries, such as drill-down and roll-up operations. In particular, both sliding window shifting and quick query answering operations are logarithmic in the sliding window size. Experimental analysis shows the superiority of our method in terms of accuracy w.r.t. the state-of-the-art approaches in the context of histogram-based sliding window reduction techniques.
Chapter XVII Slicing and Dicing a Linguistic Data Cube ........................................................................................ 288 Jan H. Kroeze, University of Pretoria, South Africa Theo J. D. Bothma, University of Pretoria, South Africa Machdel C. Matthee, University of Pretoria, South Africa This chapter discusses the application of some data warehousing techniques on a data cube of linguistic data. The results of various modules of clausal analysis can be stored in a three-dimensional data cube in order to facilitate on-line analytical processing of data by means of three-dimensional arrays. Slicing is such an analytical technique, which reveals various dimensions of data and their relationships to other dimensions. By using this data warehousing facility the clause cube can be viewed or manipulated to reveal, for example, phrases and clauses, syntactic structures, semantic role frames, or a two-dimensional representation of a particular clause’s multi-dimensional analysis in table format. These functionalities are illustrated by means of the Hebrew text of Genesis 1:1-2:3. The authors trust that this chapter will contribute towards efficient storage and advanced processing of linguistic data. Chapter XVIII Discovering Personalized Novel Knowledge from Text .................................................................... 301 Yi-fang Brook Wu, New Jersey Institute of Technology, USA Xin Chen, Microsoft Corporation, USA This chapter presents a methodology for personalized knowledge discovery from text. Traditionally, problems with text mining are numerous rules derived and many already known to the user. Our proposed algorithm derives user’s background knowledge from a set of documents provided by the user, and exploits such knowledge in the process of knowledge discovery from text. Keywords are extracted from background documents and clustered into a concept hierarchy that captures the semantic usage of keywords and their relationships in the background documents. Target documents are retrieved by selecting documents that are relevant to the user’s background. Association rules are discovered among noun phrases extracted from target documents. Novelty of an association rule is defined as the semantic distance between the antecedent and the consequent of a rule in the background knowledge. The experiment shows that our novelty measure performs better than support and confidence in identifying novel knowledge. Chapter XIX Untangling BioOntologies for Mining Biomedical Information ........................................................ 314 Catia Pesquita, University of Lisbon, Portugal Daniel Faria, University of Lisbon, Portugal Tiago Grego, University of Lisbon, Portugal Francisco M. Couto, University of Lisbon, Portugal Mário J. Silva, University of Lisbon, Portugal Biomedical research generates a vast amount of information that is ultimately stored in scientific publications or in databases. The information in scientific texts is unstructured and thus hard to access, whereas the information in databases, although more accessible, often lacks in contextualization. The integra-
tion of information from these two kinds of sources is crucial for managing and extracting knowledge. By structuring and defining the concepts and relationships within a biomedical domain, BioOntologies have taken a key role in this integration. This chapter describes the role of BioOntologies in sharing, integrating and mining biological information, discusses some of the most relevant BioOntologies and illustrates how they are being used by automatic tools to improve our understanding of life. Chapter XX Thesaurus-Based Automatic Indexing ................................................................................................ 331 Luis M. de Campos, University of Granada, Spain Juan M. Fernández-Luna, University of Granada, Spain Juan F. Huete, University of Granada, Spain Alfonso E. Romero, University of Granada, Spain In this chapter, we present a thesaurus application in the field of text mining and more specifically automatic indexing on the set of descriptors defined by a thesaurus. We begin by presenting various definitions and a mathematical thesaurus model, and also describe various examples of real world thesauri which are used in official institutions. We then explore the problem of thesaurus-based automatic indexing by describing its difficulties and distinguishing features and reviewing previous work in this area. Finally, we propose various lines of future research. Chapter XXI Concept-Based Text Mining ............................................................................................................... 346 Stanley Loh, Lutheran University of Brazil, Brazil Leandro Krug Wives, Federal University of Rio Grande do Sul, Brazil Daniel Lichtnow, Catholic University of Pelotas, Brazil José Palazzo M. de Oliveira, Federal University of Rio Grande do Sul, Brazil The goal of this chapter is to present an approach to mine texts through the analysis of higher level characteristics (called “concepts’), minimizing the vocabulary problem and the effort necessary to extract useful information. Instead of applying text mining techniques on terms or keywords labeling or extracted from texts, the discovery process works over concepts extracted from texts. Concepts represent real world attributes (events, objects, feelings, actions, etc.) and, as seen in discourse analysis, they help to understand ideas and ideologies present in texts. A previous classification task is necessary to identify concepts inside the texts. After that, mining techniques are applied over the concepts discovered. The chapter will discuss different concept-based text mining techniques and present results from different applications. Chapter XXII Statistical Methods for User Profiling in Web Usage Mining ............................................................ 359 Marcello Pecoraro, University of Naples Federico II, Italy Roberta Siciliano, University of Naples Federico II, Italy This chapter aims at providing an overview about the use of statistical methods supporting the Web usage mining. Within the first part is described the framework of the Web usage mining as a branch of the
Web mining committed to the study of how to use a Website. Then, the data (object of the analysis) are detailed together with the problems linked to the pre-processing. Once clarified, the data origin and their treatment for a correct development of a Web Usage analysis,the focus shifts on the statistical techniques that can be applied to the analysis background, with reference to binary segmentation methods. Those latter allow the discrimination through a response variable that determines the affiliation of the users to a group by considering some characteristics detected on the same users. Chapter XXIII Web Mining to Identify People of Similar Background ..................................................................... 369 Quanzhi Li, Avaya, Inc, USA Yi-fang Brook Wu, New Jersey Institute of Technology, USA This chapter presents a new approach of mining the Web to identify people of similar background. To find similar people from the Web for a given person, two major research issues are person representation and matching persons. In this chapter, a person representation method which uses a person’s personal Website to represent this person’s background is proposed. Based on this person representation method, the main proposed algorithm integrates textual content and hyperlink information of all the Web pages belonging to a personal Website to represent a person and match persons. Other algorithms are also explored and compared to the main proposed algorithm. The evaluation methods and experimental results are presented. Chapter XXIV Hyperlink Structure Inspired by Web Usage ...................................................................................... 386 Pawan Lingras, Saint Mary’s University, Canada Rucha Lingras, Saint Mary’s University, Canada This chapter describes how Web usage patterns can be used to improve the navigational structure of a Website. The discussion begins with an illustration of visualization tools that study aggregate and individual link traversals. The use of data mining techniques such as classification, association, and sequence analysis to discover knowledge about Web usage, such as navigational patterns, is also discussed. Finally, a graph theoretic algorithm to create an optimal navigational hyperlink structure, based on known navigation patterns, is presented. The discussion is supported by analysis of real-world datasets. Chapter XXV Designing and Mining Web Applications: A Conceptual Modeling Approach .................................. 401 Rosa Meo, Università di Torino, Italy Maristella Matera, Politecnico di Milano, Italy This chapter surveys the usage of a modeling language, WebML, for the design and the management of dynamic Web applications. The chapter also reports a case study of the effectiveness of WebML and its conceptual modeling methods by analyzing Web logs. To analyze Web logs, the chapter utilizes the data mining paradigm of item sets and frequent patterns.
Volume II Chapter XXVI Web Usage Mining for Ontology Management . ................................................................................ 418 Brigitte Trousse, INRIA Sophia Antipolois, France Marie-Aude Aufaure, INRIA Sophia and Supélec, France Bénédicte Le Grand, Laboratoire d’Informatique de Paris 6, France Yves Lechevallier, INRIA Rocquencourt, France Florent Masseglia, INRIA Sophia Antipolois, France This chapter proposes a novel approach for applying ontology to Web-based information systems. The technique adopted in the chapter is to discover new relationship among extracted concepts from Web logs by using ontology. The chapter also describes the effective usage of ontology for Web site reorganization. Chapter XXVII A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns ............................................................................................................................................... 448 Yue-Shi Lee, Ming Chuan University, Taiwan, ROC Show-Jane Yen, Ming Chuan University, Taiwan, ROC This chapter proposes efficient incremental and interactive data mining algorithms to discover Web traversal patterns and make the mining results to satisfy the users’ requirements. Incremental and interactive data mining helps to reduce the unnecessary processes when the minimum support is changed or Web logs are updated. The chapter also reports on that proposed work is superior to other similar techniques. Chapter XXVIII Privacy-Preserving Data Mining on the Web: Foundations and Techniques . .................................... 468 Stanley R. M. Oliveira, Embrapa Informática Agropecuária, Brazil Osmar R. Zaïane, University of Alberta, Edmonton, Canada This chapter describes the foundations for research in privacy-preserving data mining on the Web. The chapter surveys the research problems, issues, and basic principles associated with privacy-preserving data mining. The chapter also introduces a taxonomy of the existing privacy-preserving data mining techniques and a discussion on how these techniques are applicable to Web-based applications. Section IV Information Retrieval and Extraction Chapter XXIX Automatic Reference Tracking ........................................................................................................... 483 G.S. Mahalakshmi, Anna University, Chennai, India S. Sendhilkumar, Anna University, Chennai, India
Automatic reference tracking involves systematic tracking of reference articles listed for a particular research paper by extracting the references of the input seed publication and further analyzing the relevance of the referred paper with respect to the seed paper. This tracking continues recursively with every reference paper being assumed as seed paper at every track level until the system finds any irrelevant (or far relevant) references deep within the reference tracks which does not help much in the understanding of the input seed research paper at hand. The relevance is analysed based on the keywords collected from the title and abstract of the referred article. The objective of the reference tracking system is to automatically list down closely relevant reference articles to aid the understanding of the seed paper thereby facilitating the literature survey of the aspiring researcher. This chapter proposes the system design and evaluation of automatic reference tracking system discussing the observations obtained. Chapter XXX Determination of Unithood and Termhood for Term Recognition ..................................................... 500 Wilson Wong, University of Western Australia, Australia Wei Liu, University of Western Australia, Australia Mohammed Bennamoun, University of Western Australia, Australia As more electronic text is readily available, and more applications become knowledge intensive and ontology-enabled, term extraction, also known as automatic term recognition or terminology mining is increasingly in demand. This chapter first presents a comprehensive review of the existing techniques, discusses several issues and open problems that prevent such techniques from being practical in real-life applications, and then proposes solutions to address these issues. Keeping afresh with the recent advances in related areas such as text mining, we propose new measures for the determination of unithood, and a new scoring and ranking scheme for measuring termhood to recognise domain-specific terms. The chapter concludes with experiments to demonstrate the advantages of our new approach. Chapter XXXI Retrieving Non-Latin Information in a Latin Web: The Case of Greek ............................................. 530 Fotis Lazarinis, University of Sunderland, UK Over 60% of the online population are non-English speakers and it is probable the number of non-English speakers is growing faster than English speakers. Most search engines were originally engineered for English. They do not take full account of inflectional semantics nor, for example, diacritics or the use of capitals. The main conclusion from the literature is that searching using non-English and non-Latin based queries results in lower success and requires additional user effort so as to achieve acceptable recall and precision. In this chapter a Greek query log is morphologically and grammatically analyzed and a number of queries are submitted to search engines and their relevance is evaluated with the aid of real users. A Greek meta-searcher redirecting normalized queries to Google.gr is also presented and evaluated. An increase in relevance is reported when stopwords are eliminated and queries are normalized based on their morphology.
Chapter XXXII Latent Semantic Analysis and Beyond ............................................................................................... 546 Anne Kao, The Boeing Phantom Works, USA Steve Poteet, The Boeing Phantom Works, USA Jason Wu, The Boeing Phantom Works, USA William Ferng,The Boeing Phantom Works, USA Rod Tjoelker, The Boeing Phantom Works, USA Lesley Quach, The Boeing Phantom Works, USA Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization. Chapter XXXIII Question Answering Using Word Associations .................................................................................. 571 Ganesh Ramakrishnan, IBM India Research Labs, India Pushpak Bhattacharyya, IIT Bombay, India Text mining systems such as categorizers and query retrievers of the first generation were largely hinged on word level statistics and provided a wonderful first-cut approach. However systems based on simple word-level statistics quickly saturate in performance, despite the best data mining and machine learning algorithms. This problem can be traced to the fact that, typically, naive, word-based feature representations are used in text applications, which prove insufficient in bridging two types of chasms within and across documents, viz. lexical chasm and syntactic chasm . The latest wave in text mining technology has been marked by research that will make extraction of subtleties from the underlying meaning of text, a possibility. In the following two chapters, we pose the problem of underlying meaning extraction from text documents, coupled with world knowledge, as a problem of bridging the chasms by exploiting associations between entities. The entities are words or word collocations from documents. We utilize two types of entity associations, viz. paradigmatic (PA) and syntagmatic (SA). We present first-tier algorithms that use these two word associations in bridging the syntactic and lexical chasms. We also propose second-tier algorithms in two sample applications, viz., question answering and text classification which use the first-tier algorithms. Our contribution lies in the specific methods we introduce for exploiting entity association information present in WordNet, dictionaries, corpora and parse trees for improved performance in text mining applications.
Chapter XXXIV The Scent of a Newsgroup: Providing Personalized Access to Usenet Sites through Web Mining ........................................................................................................................................ 604 Giuseppe Manco, Italian National Research Council, Italy Riccardo Ortale, University of Calabria, Italy Andrea Tagarelli, University of Calabria, Italy This chapter surveys well-known Web content mining techniques that can be used for addressing the problem of providing personalized access to the contents of Usenet community. It also discusses how the end-results of knowledge discovery process from the Usenet sites are utilized by individual users. Section V Application and Survey Chapter XXXV Text Mining in Program Code ............................................................................................................ 626 Alexander Dreweke, Friedrich-Alexander University Erlangen-Nuremberg, Germany Ingrid Fischer, University of Konstanz, Germany Tobias Werth, Friedrich-Alexander University Erlangen-Nuremberg, Germany Marc Wörlein, Friedrich-Alexander University Erlangen-Nuremberg, Germany Searching for frequent pieces in a database with some sort of text is a well-known problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Several different approaches for finding code duplicates based on the text representation of the code or on graphs representing the data and control flow of the program and graph mining algorithms. Chapter XXXVI A Study of Friendship Networks and Blogosphere ............................................................................ 646 Nitin Agarwal, Arizona State University, USA Huan Liu, Arizona State University, USA Jianping Zhang, MITRE Corporation, USA In Golbeck and Hendler (2006), authors consider those social friendship networking sites where users explicitly provide trust ratings to other members. However, for large social friendship networks it is infeasible to assign trust ratings to each and every member so they propose an inferring mechanism which would assign binary trust ratings (trustworthy/non-trustworthy) to those who have not been assigned one. They demonstrate the use of these trust values in e-mail filtering application domain and report encouraging results. Authors also assume three crucial properties of trust for their approach to work: transitivity, asymmetry, and personalization. These trust scores are often transitive, meaning, if
Alice trusts Bob and Bob trusts Charles then Alice can trust Charles. Asymmetry says that for two people involved in a relationship, trust is not necessarily identical in both directions. This is contrary to what was proposed in Yu and Singh (2003). They assume symmetric trust values in the social friendship network. Social networks allow us to share experiences, thoughts, opinions, and ideas. Members of these networks, in return experience a sense of community, a feeling of belonging, a bonding that members matter to one another and their needs will be met through being together. Individuals expand their social networks, convene groups of like-minded individuals and nurture discussions. In recent years, computers and the World Wide Web technologies have pushed social networks to a whole new level. It has made possible for individuals to connect with each other beyond geographical barriers in a “flat” world. The widespread awareness and pervasive usability of the social networks can be partially attributed to Web 2.0. Representative interaction Web services of social networks are social friendship networks, the blogosphere, social and collaborative annotation (aka “folksonomies”), and media sharing. In this work, we briefly introduce each of these with focus on social friendship networks and the blogosphere. We analyze and compare their varied characteristics, research issues, state-of-the-art approaches, and challenges these social networking services have posed in community formation, evolution and dynamics, emerging reputable experts and influential members of the community, information diffusion in social networks, community clustering into meaningful groups, collaboration recommendation, mining “collective wisdom” or “open source intelligence” from the exorbitantly available user-generated contents. We present a comparative study and put forth subtle yet essential differences of research in friendship networks and Blogosphere, and shed light on their potential research directions and on cross-pollination of the two fertile domains of ever expanding social networks on the Web. Chapter XXXVII An HL7-Aware Decision Support System for E-Health . ................................................................... 670 Pasquale De Meo, Università degli Studi Mediterranea di Reggio Calabria, Italy Giovanni Quattrone, Università degli Studi Mediterranea di Reggio Calabria, Italy Domenico Ursino, Università degli Studi Mediterranea di Reggio Calabria, Italy In this chapter we present an information system conceived for supporting managers of Public Health Care Agencies to decide the new health care services to propose. Our system is HL7-aware; in fact, it uses the HL7 (Health Level Seven) standard (Health Level Seven [HL7], 2007) to effectively handle the interoperability among different Public Health Care Agencies. HL7 provides several functionalities for the exchange, the management and the integration of data concerning both patients and health care services. Our system appears particularly suited for supporting a rigorous and scientific decision making activity, taking a large variety of factors and a great amount of heterogeneous information into account. Chapter XXXVIII Multitarget Classifiers for Mining in Bioinformatics ......................................................................... 684 Diego Liberati, Istituto di Elettronica e Ingegneria dell’Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Politecnico di Milano, Italy Building effective multi-target classifiers is still an on-going research issue: this chapter proposes the use of the knowledge gleaned from a human expert as a practical way for decomposing and extend the
proposed binary strategy. The core is a greedy feature selection approach that can be used in conjunction with different classification algorithms, leading to a feature selection process working independently from any classifier that could then be used. The procedure takes advantage from the Minimum Description Length principle for selecting features and promoting accuracy of multi-target classifiers. Its effectiveness is asserted by experiments, with different state-of-the-art classification algorithms such as Bayesian and Support Vector Machine classifiers, over dataset publicly available on the Web: gene expression data from DNA micro-arrays are selected as a paradigmatic example, containing a lot of redundant features due to the large number of monitored genes and the small cardinality of samples. Therefore, in analysing these data, like in text mining, a major challenge is the definition of a feature selection procedure that highlights the most relevant genes in order to improve automatic diagnostic classification. Chapter XXXIX Current Issues and Future Analysis in Text Mining for Information Security Applications .............. 694 Shuting Xu, Virginia State University, USA Xin Luo, Virginia State University, USA Text mining is an instrumental technology that today’s organizations can employ to extract information and further evolve and create valuable knowledge for more effective knowledge management. It is also an important tool in the arena of information systems security (ISS). While a plethora of text mining research has been conducted in search of revamped technological developments, relatively limited attention has been paid to the applicable insights of text mining in ISS. In this chapter, we address a variety of technological applications of text mining in security issues. The techniques are categorized according to the types of knowledge to be discovered and the text formats to be analyzed. Privacy issues of text mining as well as future trends are also discussed. Chapter XL Collaborative Filtering Based Recommendation Systems . ................................................................ 708 E. Thirumaran, Indian Institute of Science, India M. Narasimha Murty, Indian Institute of Science, India This chapter introduces collaborative filtering-based recommendation systems, which has become an integral part of e-commerce applications, as can be observed in sites like Amazon.com. It will present several techniques that are reported in the literature to make useful recommendations, and study their limitations. The chapter also lists the issues that are currently open and the future directions that may be explored to address those issues. Furthermore, the authors hope that understanding of these limitations and issues will help build recommendation systems that are of high accuracy and have few false positive errors (which are products that are recommended, though the user does not like them).
Chapter XLI Performance Evaluation Measures for Text Mining ........................................................................... 724 Hanna Suominen, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Sampo Pyysalo, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Marketta Hiissa, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Filip Ginter, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Shuhua Liu, Academy of Finland, Finland & Åbo Akademi University, Finland Dorina Marghescu, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Tapio Pahikkala, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland Barbro Back, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Helena Karsten, Turku Centre for Computer Science (TUCS), Finland & Åbo Akademi University, Finland Tapio Salakoski, Turku Centre for Computer Science (TUCS), Finland & University of Turku, Finland The purpose of this chapter is to provide an overview of prevalent measures for evaluating the quality of system output in seven key text mining task domains. For each task domain, a selection of widely used, well applicable measures is presented, and their strengths and weaknesses are discussed. Performance evaluation is essential for text mining system development and comparison, but the selection of a suitable performance evaluation measure is not a straightforward task. Therefore this chapter also attempts to give guidelines for measure selection. As measures are under constant development in many task domains and it is important to take the task domain characteristics and conventions into account, references to relevant performance evaluation events and literature are provided. Chapter XLII Text Mining in Bioinformatics: Research and Application ................................................................ 748 Yanliang Qi, New Jersey Institute of Technology, USA The biology literatures have been increased in an exponential growth in recent year. The researchers need an effective tool to help them find out the needed information in the databases. Text mining is a powerful tool to solve this problem. In this chapter, we talked about the features of text mining and bioinformatics, text mining applications, research methods in bioinformatics and problems and future path. Chapter XLIII Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics: Finding the Right Information for Consumer’s Health Information Need .................... 758 Ki Jung Lee, Drexel University, USA
With the increased use of Internet, a large number of consumers .rst consult on line resources for their healthcare decisions. The problem of the existing information structure primarily lies in the fact that the vocabulary used in consumer queries is intrinsically different from the vocabulary represented in medical literature. Consequently, the medical information retrieval often provides poor search results. Since consumers make medical decisions based on the search results, building an effective information retrieval system becomes an essential issue. By reviewing the foundational concepts and application components of medical information retrieval, this chapter will contribute to a body of research that seeks appropriate answers to a question like “How can we design a medical information retrieval system that can satisfy consumer’s information needs?” Chapter XLIV A Survey of Selected Software Technologies for Text Mining .......................................................... 766 Richard S. Segall, Arkansas State University, USA Qingyu Zhang, Arkansas State University, USA This chapter presents background on text mining, and comparisons and summaries of seven selected software for text mining. The text mining software selected for discussion and comparison in this chapter are: Compare Suite by AKS-Labs, SAS Text Miner, Megaputer Text Analyst, Visual Text by Text Analysis International, Inc. (TextAI), Magaputer PolyAnalyst, WordStat by Provalis Research, and SPSS Clementine. This chapter not only discusses unique features of these text mining software packages but also compares the features offered by each in the following key steps in analyzing unstructured qualitative data: data preparation, data analysis, and result reporting. A brief discussion of Web mining and its software are also presented, as well as conclusions and future trends. Chapter XLV Application of Text Mining Methodologies to Health Insurance Schedules ...................................... 785 Ah Chung Tsoi, Monash University, Australia Phuong Kim To, Tedis P/L, Australia Markus Hagenbuchner, University of Wollongong, Australia This chapter describes the application of several text mining techniques to discover patterns in the health insurance schedule with an aim to uncover any inconsistency or ambiguity in the schedule. Based on the survey, this chapter experiments with classification and clustering on full features and reduced features using the latent semantic kernel algorithm. The results show that the LSK algorithm works well on Health Insurance Commission schedules. Chapter XLVI Web Mining System for Mobile-Phone Marketing ............................................................................ 807 Miao-Ling Wang, Minghsin University of Science & Technology, Taiwan, ROC Hsiao-Fan Wang, National Tsing Hua University, Taiwan, ROC This chapter proposes a Web mining system that incorporates both online efficiency and off-line effectiveness to provide the right information based on users’ preferences. The proposed system is applied to the Web site marketing of mobile phones. Through this case study, this chapter demonstrates that a
query-response containing a reasonable number of mobile phones best matched a user’s preferences can be provided. Chapter XLVII Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator ...................................................................................................................................... 822 Neil Davis, The University of Shef.eld, UK George Demetriou, The University of Sheffield, UK Robert Gaizauskas, The University of Sheffield, UK Yikun Guo, The University of Sheffield, UK Ian Roberts, The University of Sheffield, UK Text mining technology can be used to assist in finding relevant or novel information in large volumes of unstructured data, such as that which is increasingly available in the electronic scientific literature. However, publishers are not text mining specialists, nor typically are the end-user scientists who consume their products. This situation suggests a Web services based solution, where text mining specialists process the literature obtained from publishers and make their results available to remote consumers (research scientists). In this chapter we discuss the integration of Web services and text mining within the domain of scientific publishing and explore the strengths and weaknesses of three generic architectural designs for delivering text mining Web services. We argue for the superiority of one of these and demonstrate its viability by reference to an application designed to provide access to the results of text mining over the PubMed database of scientific abstracts.
xxxiii
Foreword
I am delighted to see this new book on text and Web mining technologies edited by Min Song and Yifang Brook Wu. Text and Web mining is a new and exciting area where information systems, computer science, and information science contribute to solve the crisis of information overload by integrating techniques and methods from data mining, machine learning, natural language processing, information retrieval, and knowledge management. This is one of the most active and exciting areas of the database research community. Researchers in areas such as statistics, visualization, artificial intelligence, and machine learning are contributing to this field. The breadth of the field makes it difficult to grasp its extraordinary progress over the last few years. Unfortunately, there have been relatively few complete and comprehensive books on text and Web mining technologies. This book presents every important aspects of text and Web mining. One of the most impressive aspects of this book is its broad coverage of challenges and issues in text and Web mining. It presents a comprehensive discussion of the state-of-the-art in text and Web mining. In addition to providing an in-depth examination of core text and Web mining algorithms and operations, the book examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text and Web mining in various fields such as bioinformatics, business intelligence, genomics research and counter-terrorism activities. The field is evolving very rapidly, but this book is a quick way to learn the basic ideas and to understand where the field is today. I found it very informative and stimulating, and I expect you will too. Those of us who are interested in text and Web mining will benefit from this handbook. In particular, this book will serve as a best reference book for graduate level courses such as advanced topics in information systems and text/Web mining. With this new book, I believe that many others will get to share the extensive and deep insights of contributed authors on text and Web mining. Xiaohua Hu
xxxiv
Preface
With an abundance of textual information on the Web, text and Web mining is increasingly important. Although search technologies have matured and getting relevant documents or Web pages is not difficult any more, information overload has never ceased to be a roadblock for users. More advanced text applications are needed to bring out novel and useful information or knowledge hidden in the sea of documents. The purpose of this handbook is to present most recent advances and survey of applications in text and Web mining which should be of interests to researchers and end-users alike. With that in mind, we invited submissions to Handbook of Research in Text and Web Mining. Based on the content, we organized this handbook into five sections which represent the major topic areas in text and Web mining. Section titles and their highlights: Section I Document Preprocessing, concerns steps on obtaining key textual elements and their weights before mining occurs. This section covers various operations to transform text into the next step including lexical analysis, elimination of functional words, stemming, identification of key terms and phrases, and document representation. Section II Classification and Clustering, discusses two popular mining methods and their applications in text and Web mining. In this section, we present stateof-the-art classification and clustering techniques applied to several interesting problem domains such as syllabus, protein, and image classification. Section III Database, Ontology and the Web, presents topics relating to three types of objects and their use in the mining processing either as data or supplemental information to improve mining performance. This section presents a variety of research issues and problems associated with database, ontology, and the Web from text and Web mining perspective. Section IV Information Retrieval and Extraction, illustrates how mining techniques can be used to enhance performance of information retrieval and extraction. This section presents that how text and Web mining techniques contribute to resolve difficult problems of information retrieval and extraction. Section V Application and Survey, concludes the book with surveys on latest research and end-user applications. All the chapters are opened with an overview and concluded with references. It will be beneficial for readers of this handbook to have basic understanding of natural language processing and college statistics, since we consider both subjects the foundation of text and Web mining. The research in this area is developing rapidly. Therefore, by no means that this handbook is the ultimate research report on what text and Web mining can achieve. It is hoped that this handbook will serve as catalyst to innovative ideas and thus make exciting research in this and complimentary research areas fruitful in the near future. Min Song & Yi-fang Brook Wu Newark, NJ June, 2008
xxxv
Acknowledgment
The editors of this book thank Dr. Michael Bieber, the chair of the Information Systems Department, who gave us constant encouragement and great inspiration. The publication of this book could not have been possible for the focused and dedicated efforts put in by all contributing authors and reviewers. We sincerely thank all of them. The editors would also like to thank Yanliang Qi for managing correspondence with authors and reviewers.
xxxvi
About the Editors
Min Song is an assistant professor of Department of Information Systems at NJIT. He received his MS in School of Information Science from Indiana University in 1996 and received PhD degree in Information Systems from Drexel University in 2005. Dr. Song has a background in text mining, bioinfomatics, information retrieval and information visualization. Dr. Song received the Drexel Dissertation Award in 2005. In 2006, Min’s work received an honorable mention award in the 2006 Greater Philadelphia Bioinformatics Symposium. In addition, The paper entitled “Extracting and Mining Protein-Protein Interaction Network from Biomedical Literature” has received the best paper award from 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, which was held in San Diego, USA, Oct. 7-8, 2004. In addition, another paper entitled “Ontology-based Scalable and Portable Information Extraction System to Extract Biological Knowledge from Huge Collection of Biomedical Web Documents” was nominated as the best paper at 2004 IEEE/ACM Web Intelligence Conference, which was held in Beijing, China, Sept, 20-24, 2004. Yi-fang Brook Wu is an associate professor in the Information Systems Department at New Jersey Institute of Technology. Her current research interests include: text mining, information extraction, knowledge organization, information retrieval, and natural language processing. Her projects have been supported by National Science Foundation and Institute of Museum of Library Services. Her research has appeared in journals such as Journal of the American Society for Information Science and Technology, Journal of Biomedical Informatics and Information Retrieval.
Section I
Document Preprocessing
Chapter I
On Document Representation and Term Weights in Text Classi.cation Ying Liu The Hong Kong Polytechnic University Hong Kong SAR, China
Abstr act In the automated text classification, a bag-of-words representation followed by the tfidf weighting is the most popular approach to convert the textual documents into various numeric vectors for the induction of classifiers. In this chapter, we explore the potential of enriching the document representation with the semantic information systematically discovered at the document sentence level. The salient semantic information is searched using a frequent word sequence method. Different from the classic tfidf weighting scheme, a probability based term weighting scheme which directly reflect the term’s strength in representing a specific category has been proposed. The experimental study based on the semantic enriched document representation and the newly proposed probability based term weighting scheme has shown a significant improvement over the classic approach, i.e., bag-of-words plus tfidf, in terms of Fscore. This study encourages us to further investigate the possibility of applying the semantic enriched document representation over a wide range of text based mining tasks.
INTR ODUCT ION Text classification (TC) is such a task to categorize documents into predefined thematic categories. In particular, it aims to find the mapping ξ, from a set of documents D: {d1, …, di} to a set of thematic categories C: {C1, …, Cj}, i.e. ξ : D C. In its current study, which is dominated by supervised learnCopyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
On Document Representation and Term Weights in Text Classi.cation
ing, the construction of a text classifier is often conducted in two main phases (Debole & Sebastiani, 2003; Sebastiani, 2002): 1. Document indexing: The creation of numeric representations of documents • Term selection: To select a subset of terms from all terms occurring in the collection to represent the documents in a sensible way, either to facilitate computing or to achieve best effectiveness in classification. • Term weighting: To assign a numeric value to each term in order to weight its contribution which helps a document stand out from others. 2. Classifier induction: The building of a classifier by learning from the numeric representations of documents In literature, the most popular approach in handling document indexing is probably the bag-of-words (BoW) approach, i.e. each unique word in the document is considered as the smallest unit to convey information (Manning & Schütze, 1999; Sebastiani, 2002; van_Rijsbergen, 1979), followed by the tfidf weighting scheme, i.e. term frequency (tf ) times inverse document frequency (idf ) (Salton & Buckley, 1988; Salton & McGill, 1983). Numerous machine learning algorithms have been applied to the task of text classification. These include, but are not limited to, Naïve Bayes (Lewis, 1992a; Rennie, Shih, Teevan, & Karger, 2003a), decision tree and decision rule (Apté, Damerau, & Weiss, 1994, 1998), artificial neural network (Ng, Goh, & Low, 1997; Ruiz & Srinivasan, 2002), k-nearest neighbor (kNN) (Baoli, Qin, & Shiwen, 2004; Han, Karypis, & Kumar, 2001; Yang & Liu, 1999), Bayes Network related (Friedman, Geiger, & Goldszmidt, 1997; Heckerman, 1997), and recently support vector machines (SVM) (Burges, 1998; Joachims, 1998; Vapnik, 1999). Researchers and professionals from relevant areas, e.g. machine learning, information retrieval and natural language processing, constantly introduce new algorithms, test data sets, benchmark results, etc. A comprehensive review about text classification was given by (Sebastiani, 2002). While text classification has emerged as a fast growing area, particularly due to the involvement of machine learning based algorithms, some key questions remain. Although, it is generally understood that word sequences, e.g. “finite state machine”, “machine learning” and “supply chain management”, convey more semantic information in representing textual documents than single terms, they have seldom been explored in both text classification and clustering. In fact, it is lack of a systematic approach to generate such high quality word sequences automatically. The rich semantic information resident in word sequences has been ignored. Although, the tfidf weighting scheme performs pretty well, particularly in the task of Web based information search and retrieval, these weights do not directly indicate the closeness of thematic relation between a term and a specific category. In this chapter, we mainly focus on the phase 1 in text classification, i.e. document indexing. We are concerned about two key issues: • •
Q1: What should be a term? What should be considered as the basic unit carrying document contents? Q2: How to compute term weights effectively to improve the classification performance?
On Document Representation and Term Weights in Text Classification
On question 1, we depart from the classic BoW approach by introducing more salient and meaningful word sequences into the document representation. These sequences, named as Frequent word Sequences, are automatically discovered from documents at their sentence level. The strength of this method is that it employs a versatile technique for finding sequential text phrases from full text, allowing, if desired, gaps between the words in a phrase. The key idea behind is to involve more salient semantic information in the document representation. On question 2, we depart from the classic tfidf approach and propose a probability based term weighting scheme based on the observation and inspiration from feature selection in TC. The essential idea is that a feature will gain more weight if it appears more frequently in positive training examples than negative ones, which directly indicates its closeness or the strength of membership with respect to that specific category. For the convenience of reading, this chapter has been organized into three main sections. In the first section, we focus on the status quo of document representation and introduce the concept of Maximal Frequent word Sequence (MFS). We further explain how MFS can be implemented to solicit high quality word sequences in a systematic way. An illustration example follows. In the second section, we briefly review the classic term weighting scheme, i.e. tfidf and some of its variants, which are widely applied in text classification, with a full attention on the probability based term weighting that we have proposed. Finally, the last section reports the experimental studies and results of text classification based on the joint approach of the semantic enriched document representation and our new term weighting scheme in detail.
T ER MS IN DOC UMENT R EPR ES ENT AT ION What S hould be a T erm? Borrow from information retrieval, the most widely accepted document representation model in text classification is probably vector space model (Baeza-Yates & Ribeiro-Neto, 1999; Jurafsky & Martin, 2000; Manning & Schütze, 1999; Sebastiani, 2002; van_Rijsbergen, 1979), i.e. a document di is repre sented as a vector of term weights vi ( w1i , w2i ,..., w|G|i ) , where G is the collection of terms that occur at least once in the document collection D. The vector space model is often referred as bag-of-words (BoW) approach to document representation when each single word in the text is considered as a term. Besides the vector space model, some other commonly used approaches are Boolean model, probability model (Fuhr, 1985; Robertson, 1977), inference network model (Turtle & Croft, 1989), and statistical language model. Essentially, a statistical language model is concerned about the probabilities of word sequences, denoted as P ( S : w1 , w2 ,..., wn ) (Manning & Schütze, 1999). These sequences can be phrases, clauses and sentences. Their probabilities are often estimated from a large text corpus. As reported, statistical language modeling has been successfully applied to many domains, such as its original application in speech recognition and part-of-speech tagging (Charniak, 1993), information retrieval (Hiemstra, 1998; Miller, Leek, & Schwartz, 1999; Ponte & Croft, 1998) and spoken language understanding (Zue, 1995). In practice, the most widely used statistical language model is n-gram model (Manning & Schütze, 1999), such as:
On Document Representation and Term Weights in Text Classification
• • •
Unigram: P ( S : w1 , w2 ,..., wn ) = P( w1 ) P( w2 )...P( wn ) Bigram: P ( S : w1 , w2 ,..., wn ) = P( w1 ) P( w2 w1 )...P( wn wn −1 ) Trigram: P ( S : w1 , w2 ,..., wn ) = P( w1 ) P( w2 w1 ) P( w3 w1,2 )...P( wn wn − 2, n −1 )
The n-gram model is formed based on the observation that the occurrence of a specific word may be affected by its immediately preceding words. In unigram model, words are assumed to be independent. Hence, the probability of a word sequence S is approximated by the product of all individual words’ probabilities. In bigram and trigram model, more context information has been taken into consideration, i.e. the probability of word depends on its previous one word or two words respectively. Built upon the BoW approach, various studies have been reported to refine the document representation. Basically, there are two major streams. One stream mainly works on the reduction of term size. This is largely because in the classic BoW approach each unique word that has occurred at least once in D will be considered as a feature to represent the documents. As a result, the dimension of the term space, even for a small document collection, can easily reach a few thousands. The computational constraints imposed by the high dimensionality of the input data and the richness of information it provides to maximally identify each individual object is a well known tradeoff in classification. Therefore, term reduction has been introduced to capture the salient information by choosing the most important terms, and hence making the computing tasks tractable. The prevailing approaches include, term selection (also known as feature selection in the machine learning community) (Forman, 2003; Galavotti, Sebastiani, & Simi, 2000; Mladenic & Grobelnik, 1998; Ng et al., 1997; Scott & Matwin, 1999; Sebastiani, 2002; Taira & Haruno, 1999; Yang & Pedersen, 1997; Zheng, Wu, & Srihari, 2004), latent semantic indexing\ analysis (Cai & Hofmann, 2003; Cristianini, Shawe-Taylor, & Lodhi, 2002; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Dumais, Platt, Heckerman, & Sahami, 1998; Manning & Schütze, 1999; Weigend, Wiener, & Pedersen, 1999; Wiener, Pedersen, & Weigend, 1995) and lastly term clustering (Lewis, 1992a; Li & Jain, 1998). Another stream aims to bring more sophisticated representations to text classification. These include researches using n-grams, phrases and co-occurred words to enrich the document representation (Apté et al., 1994; Baker & McCallum, 1998; Maria Fernanda Caropreso, Matwin, & Sebastiani, 2001; Maria F Caropreso, Matwin, & Sebastiani, 2002; Dumais et al., 1998; Kongovi, Guzman, & Dasigi, 2002; Lewis, 1992a; Sahlgren & Cöster, 2004; Tan, Wang, & Lee, 2002) and the Darmstadt Indexing Approach (DIA) (Fuhr & Buckley, 1991) which takes into account various different properties of terms, documents and categories. However, the experimental results of adopting more sophisticated representations are not uniformly encouraging when compared to the BoW approach (Sebastiani, 2002). While we observe that single words are not the only units to convey the thematic information, and sequences like “finite state machine”, “machine learning” and “supply chain management” are surely in a better position to represent the content of sentences and documents, such sequences are rarely integrated into the current study of text classification. As a matter of fact, they are in a superior stand to supply the critical information that differentiates a category from others. We have examined the previous work using phrases or bigrams as the indexes (Lewis, 1992a, 1992b). We repeated their procedures to generate phrases and bigrams in several datasets. Two major problems are noted. In the first place, we observed that there are many phrases irrelevant to the themes of documents. We suspect that using these irrelevant phrases as indexes does not give documents a better presentation. Secondly, the size of indexing terms, i.e. the dimension of document features, has been dramatically expanded. The well-
On Document Representation and Term Weights in Text Classification
known problem of data sparsity in document vectors becomes even worse. This alerts us that the word sequences should be carefully selected.
Frequent Word S equences In chasing the better performance of text classification, we are concerned about how to discover a fine set of word sequences which embody the salient content information of each document. In our work, we extend Ahonen’s work in finding the Maximal Frequent word Sequence (MFS) out of a textual dataset (Ahonen-Myka, 1999). A MFS is a sequence of words that is “frequent in the document collection and, moreover, that is not contained in any other longer frequent sequence.” A word sequence is frequent if it appears in at least σ documents, where σ is a pre-specified support threshold. The goal of MFS algorithm is to find all maximal frequent phrases in the textual dataset. The strength of MFS method is that it employs a versatile technique for finding sequential text phrases from full text, allowing, if desired, gaps between the words in a phrase. For example, the word sequence “product knowledge databases” can be extracted as a frequent phrase even if its occurrence is in the form of: • • •
“…product management using knowledge databases…” “…product data in knowledge databases…” “…product specifications, knowledge databases…”
in the supporting documents of the document collection. The maximum gap allowed between words in a sequence is determined by the maximal word gap parameter g. In its original application, the MFSs discovered acted as content descriptors of the documents in which they occurred (Ahonen-Myka, Heinonen, Klemettinen, & Verkamo, 1999; Yap, Loh, Shen, & Liu, 2006). These descriptors are compact but human-readable, and have the potential to be used in the subsequent analysis of documents. Previously, we have applied MFS to topic detection (Yap et al., 2006) and text summarization (Zhan, Loh, & Liu, 2007). The limitation with current MFS method and its application is that the MFSs must be generated and supported based on a group of documents which are assumed to be conceptually similar, e.g. sharing the same category labels or thematic topics. This assumption is often weak in the scenarios where documents are not labeled, e.g., text classification based on the semi-supervised learning approach and text clustering. In order to exploit what an individual document really contains, we propose an extended approach to select frequent single words and generate MFSs simultaneously at the document sentence level. Our method aims to provide a uniform model primarily from the perspective of document itself. We believe that the MFSs found at the sentence level within a document will provide more clues with respect to its thematic information. Essentially, we are interested to use the MFSs discovered as the salient semantic information to render the document theme. Through adjusting the support σ and the gap g, our aim is to search an optimal set of MFSs to represent the most important content of the document. We notice that when we increase the gap g resulting in longer word sequences, more MFS phrases are generated. While this may increase the coverage of topics, many of them are discovered as irrelevant to the category themes. This is alleviated by increasing the support threshold σ.
On Document Representation and Term Weights in Text Classification
Figure 1. The difference between the BoW approach and the new representation enriched by MFSs as the salient semantic features
Document R epresentation with S alient S emantic Features In the BoW approach, documents are basically defined by all single words which have occurred at least once in the document collection. In our work, documents are represented by a set of single words and MFSs found, where these MFSs are supposed to capture the salient semantic information of document. Figure 1 visualizes the difference between these two approaches. The particular question is how to search these single terms and especially MFSs, and what should be included. Algorithm 1 gives the details with respect to the aforementioned question. It starts with a set of sentences S j {s j1 , s j 2 ,..., s jn } of document dj. Based on the sentence support threshold σ pre-specified, a set of frequent single words G j {t j1 , t j 2 ,..., t jn } are selected. Given the pre-specified word gap g, G j {t j1 , t j 2 ,..., t jn } is extended to a set of ordered word pairs. For instance, supposing g = 1 and the original sentence sjn comprises the words “ABCDEFGHI.” After the identification of frequent single words, only words “ABCEFI” are left for sjn. Therefore, the ordered pairs arising out of sjn would be ‘AB’, ‘AC’, ‘BC’, ‘CE’, and ‘EF’. Pairs like ‘BE’ and ‘CF’ are not considered because the number of words between ‘BE’ and ‘CF’ exceeds the g parameter. Each ordered pair is then stored into a hash data structure, along with its occurrence information, such as location and frequency. This is repeated for all sentences in dj, with a corresponding update of the hash. Thereafter, each ordered pair in the hash is examined, and the pairs that are supported by at least σ sentences are considered frequent. This set of frequent pairs is named Grams2. Algorithm 1. Generation of semantic enriched document representation Input: S j {s j1 , s j 2 ,..., s jn } : a set of n pre-processed sentences in document dj σ: sentence support g: maximal word gap
On Document Representation and Term Weights in Text Classification
1. for all sentences s ∈ Sj identify the frequent single words t, G j ← t ; 2. expand G j to Grams2, frequent ordered word pairs //Discovery phase 3. k = 2; 4. MFSj = null; 5. while Gramsk not empty 6. for all seq ∈ Gramsk 7. if seq is frequent 8. if seq is not a subsequence of some Seq ∈ MFSj // Expand phase: expand frequent seq 9. max = expand(seq); 10. MFSj = MFSj ∪ max; 11. if max = seq 12. remove seq from Gramsk 13. else 14. remove seq from Gramsk // Join phase: generate set of (k + 1)-seqs 15. Gramsk+1 = join(Gramsk); 16. k = k + 1; 18. return (G + MFS ) j The next phase, the Discovery phase, forms the main body of Algorithm 1. It is an iteration of gram expansion for the grams in the current Gramsk, and gram joining, to form Gramsk+1. Only grams that are frequent and not subsequences of the previously discovered MFSs are considered suitable for expansion. The latter condition is in place to avoid a rediscovery of MFSs that have already been found. This Expand-Join iteration continues until an empty gram-set is produced from a Join phase. We further break the Discovery phase into the Expand phase and the Join phase. In the Expand phase, every possibility of expansion of an input word sequence seq form Gramsk is explored. The expansion process continues, for that particular input word sequence seq, until the resulting sequence is no longer frequent. The last frequent sequence achieved in the expansion, seq’, will be an MFS by definition, and it will be stored together with its occurrence information. This process of gram expansion and information recording continues, for every suitable gram in Gramsk. Subsequently, the Join phase follows, which consists of a simple join operation amongst the grams left in Gramsk, to form Gramsk+1, i.e. the set of grams that are of length (k+1). When an empty gram set is produced from the Join phase, namely, no more grams are left for further expansion, the set of MFSj and G j is returned for document dj. This process continues for dj+1.
An Illustration Example In this subsection we give an example to illustrate how the proposed document representation can be accomplished. Given two documents dj and dj+1 which are actually labeled under the same category topic, Figure 2 shows their details.
On Document Representation and Term Weights in Text Classification
Figure 2. Two original documents, dj and dj+1
Figure 3. dj and dj+1 after preprocessing
Immediately, we preprocess the documents. These include various standard procedures that are commonly utilized in text processing, such as lexical analysis (also known as tokenization), stop word removal, number and punctuation removal, and word stemming (Porter, 1980). Please note that the sentence segmentation has been retained. Figure 3 shows the details of dj and dj+1 after preprocessing. In the classic BoW approach, document processing, which often involves the aforementioned steps, will generally stop here before the document is sent for term weighting, i.e. to convert the document from its textual form to a numeric vector (Baeza-Yates & Ribeiro-Neto, 1999; Salton & McGill, 1983; Sebastiani, 2002). Given the sentence support s = 2 and the word gap g = 1, Figure 4 gives the details of document representation with MFSs for dj and dj+1. After the generation of semantic enriched document representation, the number of indexing terms has been reduced from 80 in BoW approach to 29. The details are presented in Figure 5 and 6 respectively.
HOW T O C OMPUT E T HE T ER M WEIGHTS T he C lassic T erm Weighting S chemes Term weighting has long been formulated in a form as term frequency times inverse documents frequency, i.e. tfidf (Baeza-Yates & Ribeiro-Neto, 1999; Salton & Buckley, 1988; Salton & McGill, 1983;
On Document Representation and Term Weights in Text Classification
Figure 4. Semantic enriched document representation, where the sentence support is two and the word gap is one for the MFS generation
Figure 5. Indexing terms in the semantic enriched document representation for dj and dj+1
Figure 6. Indexing terms in the BoW modeling approach for dj and dj+1
van_Rijsbergen, 1979). The more popular “ltc” form (Baeza-Yates & Ribeiro-Neto, 1999; Salton & Buckley, 1988; Salton & McGill, 1983) is given by,
tfidf (ti , d j ) = tf (ti , d j ) × log(
N ) N (ti )
(1)
and its normalized version is wi , j =
tfidf (ti , d j )
∑ k =1 tfidf (tk , d j )2 |T |
(2)
On Document Representation and Term Weights in Text Classification
where N and |T| denote the total number of documents and unique terms contained in the collection respectively, and N(ti) represents the number of documents in the collection in which term ti occurs at least once, and 1 + log(n(ti , d j )), if n(ti , d j ) > 0 tf (ti , d j ) = 0, otherwise
where n(ti,dj) is the number of times that the term ti occurs in document dj. In practice, the summation in equation (2) is only concerned about the terms occurred in document dj. The significance of the classic term weighting schemes in equitation (1) and (2) is that they have embodied three fundamental assumptions of term frequency distribution in a collection of documents (Debole & Sebastiani, 2003; Sebastiani, 2002). These assumptions are: • • •
Rare terms are no less important than frequent terms – idf assumption Multiple appearance of a term in a document are no less important than single appearance – tf assumption For the same quantity of term matching, long documents are no more important than short documents – normalization assumption
Because of these, the “ltc” and its normalized form have been extensively studied by many researchers and show its good performance over a number of different data sets (Dumais & Chen, 2000; Forman, 2003; Lewis, Yang, Rose, & Li, 2004; Sebastiani, 2002; Yang & Liu, 1999). Therefore, they have become the default choice in text classification.
T he Probability B ased T erm Weighting S cheme In text classification, feature selection serves as a key procedure to reduce the dimensionality of the input data, e.g. the number of index terms, in order to save computational cost. In literature, numerous feature selection methods, e.g. information gain, mutual information, chi-square, odds ratio and so on, have been intensively studied to distill the important terms while still keeping the dimension small (Forman, 2003; Ng et al., 1997; Ruiz & Srinivasan, 2002; Yang & Pedersen, 1997). When these methods are applied to text classification for term selection purpose, they are basically utilizing four fundamental information elements shown in Table 1, i.e. A denotes the number of documents belonging to category ci where the term tk occurs at least once; B denotes the number of documents not belonging to category ci where the term tk occurs at least once; C denotes the number of documents belonging to category ci where the term tk does not occur; D denotes the number of documents not belonging to category ci where the term tk does not occur. While many researchers believe that the term weighting schemes in the form as tfidf representing those three aforementioned assumptions, we understand tfidf in a much simpler manner, i.e. •
10
Local weight: The tf term, either normalized or not, specifies the weight of tk within a specific document, which is basically estimated based on the frequency or relative frequency of tk within this document.
On Document Representation and Term Weights in Text Classification
Table 1. Fundamental information elements used for feature selection in text classification
•
ci
ci
tk
A
B
tk
C
D
Global weight: The idf term, either normalized or not, defines the contribution of tk to a specific document in a global sense.
If we temporarily ignore how tfidf is defined, and focus on the core problem, i.e. whether this document belongs to this category, we realize that a set of terms are needed to represent the documents effectively and a reference framework is also required to make the comparison possible. As previous research shows that tf is very important (Leopold & Kindermann, 2002; Salton & Buckley, 1988; Sebastiani, 2002) and using tf alone can already achieve good performance, we retain the tf term. Now, let us consider idf, i.e. the global weighting of tk. The conjecture is that if the term selection can effectively differentiate a set of terms Tk out from all terms T to represent category ci, then it is desirable to transform that difference into some sort of numeric values for further processing. Our approach is to replace the idf term with the value that reflects the term’s strength of representing a specific category. Since this procedure is performed jointly with the category membership, this basically implies that the weights of Tk are category specific. Therefore, the only problem left is how to compute such values. We decide to compute those term values using the most direct information, e.g. A, B and C, and combine them in a sensible way which is different from existing feature selection measures. From Table 1, two important ratios which directly indicate terms’ relevance with respect to a specific category are noted, i.e. A/B and A/C, • •
A/B: it is easy to understand that if term tk is highly relevant to category ci only, which basically says that tk is a good feature to represent category ci, then the value of A/B tends to be higher. A/C: given two terms tk, tl and a category ci, the term with a higher value of A/C will be the better feature to represent ci, since a larger portion of it occurs with category ci.
In the following of this article, we name A/B and A/C relevance indicator since these two ratios immediately indicate the term’s strength in representing a category. In fact, these two indicators are nicely supported by probability estimates. For instance, A/B can be extended as (A/N)/(B/N), where N is the total number of documents, A/N is the probability estimate of documents from category ci where term tk occurs at least once and B/N is the probability estimate of documents not from category ci where term tk occurs at least once. In this manner, A/B can be interpreted as a relevance indicator of term tk with respect to category ci. Surely, the higher the ratio, the more important the term tk is related to category ci. A similar analysis can be made with respect to A/C. The ratio reflects the expectation that a term
11
On Document Representation and Term Weights in Text Classification
is deemed as more relevant if it occurs in the larger portion of documents from category ci than other terms. Since the computing of both A/B and A/C has its intrinsic connection with the probability estimates of category membership, we propose a new term weighting factor which utilizes the aforementioned two relevance indicators to replace idf in the classic tfidf weighting scheme. Considering the probability implication of A/B and A/C, the most immediate choice is to take the product of these two ratios. Finally, the proposed weighting scheme is formulated as
tf ⋅ log(1 +
AA ) BC
(Liu, Loh, Kamal, & Tor, 2006).
EXPER IMENT AL ST UDIES AND R ES ULTS Two data sets were tested in our experiment, i.e. MCV1 and Reuters-21578. MCV1 is an archive of 1434 English language manufacturing related engineering papers which we gathered by the courtesy of the Society of Manufacturing Engineers (SME). It combines all engineering technical papers published by SME from year 1998 to year 2000. All documents were manually classified (Liu, Loh, & Tor, 2004). There are a total of 18 major categories in MCV1. Figure 7 gives the class distribution in MCV1. Reuters-21578 is a widely used benchmarking collection (Sebastiani, 2002). We followed Sun’s approach (Sun, Lim, Ng, & Srivastava, 2004) in generating the category information. Figure 8 gives the class distribution of the Reuters dataset used in our experiment. Unlike Sun (Sun et al., 2004), we did not
Figure 7. Category details and class distribution in MCV1
12
On Document Representation and Term Weights in Text Classification
randomly sample negative examples from categories not belonging to any of the categories in our data set, instead we treated examples not from the target category in our dataset as negatives. We compared the proposed semantic enriched document representation with the classic BoW approach on MCV1 and Reuters-21578 using tfidf and the probability based term weighting scheme respectively. Three groups of experimental results are reported, i.e. • • •
BoW plus tfidf: BoW+TFIDF. BoW plus probability based term weighting: BoW+Prob. Semantic enriched plus probability based term weighting: (BoW+MFS)+Prob.
Bayesian classifier, i.e. Complement Naïve Bayes (CompNB) (Rennie, Shih, Teevan, & Karger, 2003b), and support vector machines (SVM) (Vapnik, 1999) were chosen as the classification algorithms. The CompNB has been recently reported that it can significantly improve the performance of Naïve Bayes over a number of well known datasets, including Reuters-21578 and 20 Newsgroups. Various correction steps are adopted in CompNB, e.g. data transformation, better handling of word occurrence dependencies and so on. In our experiments, we borrowed the package implemented in Weka 3.5.3 Developer version (Witten & Frank, 2005). For SVM, we chose the well known implementation SVM Light (Joachims, 1998, 2001). Linear function was adopted as its kernel function, since previous work had shown that the linear function could deliver even better performance without tedious parameter tuning in text classification (Dumais & Chen, 2000; Joachims, 1998). As for the performance measurement, precision, recall and their harmonic combination, i.e. the F1 value (Baeza-Yates & Ribeiro-Neto, 1999; van Rijsbergen, 1979), were calculated. Performance was assessed based on the five-fold cross validation. Since we are very concerned about the performance of every category, we report the overall performance in macro-averaged manner, i.e. macro-average F1, to avoid the bias for minor categories in imbalanced data associated with micro-averaged scores (Sebastiani, 2002; Yang & Liu, 1999).
Figure 8. Category details and class distribution in Reuters-21578
13
On Document Representation and Term Weights in Text Classification
Figure 9. F1 scores of BoW plus tfidf weighting, BoW plus probability based term weighting, and semantic enriched document representation plus probability based term weighting using CompNB and SVM respectively over MCV1
Figure 10. F1 scores of BoW plus tfidf weighting, BoW plus probability based term weighting, and semantic enriched document representation plus probability based term weighting using CompNB and SVM respectively over Reuters-21578
14
On Document Representation and Term Weights in Text Classification
Major standard text preprocessing steps were applied in our experiments, including tokenization, stop word and punctuation removal, and stemming. However, feature selection was skipped for SVM experiments and all terms left after stop word and punctuation removal and stemming were kept as features. Figure 9 and 10 show the overall performances of the aforementioned three combinations, i.e. BoW+TFIDF, BoW+Prob and (BoW+MFS)+Prob, using CompNB and SVM over MCV1 and Reuters21578 data respectively. Our first observation is that the proposed scheme of probability based term weights is able to outperform tfidf over both data sets using SVM and Bayesian classifier. The results of BoW+TFIDF on Reuters-21578 is in line with the literature (Sun et al., 2004). Table 2 presents the macro-averaged F1 values of three combinations tested over two data sets. We note that using our proposed weighting scheme can significantly improve the overall performance from 5% to more than 12% than those using the classic representation. Surprisingly, we also note that when the probability based term weighting is implemented, CompNB has delivered the result which is very close to the best one that SVM can achieve using the classic tfidf scheme in Reuters-21578. This has implied the great potential of using CompNB as a state-of-the-art classifier. As shown in Figure 7 and Figure 8, both MCV1 and Reuters-21578 are actually skewed data sets. In MCV1, there are six categories that own only around 1% of text population each and 11 categories falling below the average. The same case also happens to the Reuters-21578 data set. While it has 13 categories, grain and crude, the two major categories, share around half of the text population. Totally, the numbers of supporting documents of eight categories fall below the average. Previous literature did not report successful stories over these minor categories (Sun et al., 2004; Yang & Liu, 1999). A close analysis of our experimental studies shows that the probability based scheme generates much better results on minor categories in both MCV1 and Reuters-21578, regardless of the classifiers used. For all minor categories shown in both figures, we observed a sharp increase of performance when the system’s weighting method switches from tfidf to the probability one. Table 3 reveals more insights with respect to the system performance. In general, we observe that using the probability based term weighting scheme can greatly enhance the systems’ recalls. Although it falls slightly below tfidf in terms of precision using SVM, it still improves the precision in CompNB, far superior to those figures that tfidf can possibly deliver. For SVM, while the averaged precision of tfidf in MCV1 is 0.8355 which is about
Table 2. Macro-averaged F1 scores of BoW+TFIDF, BoW+Prob and (BoW+MFS)+Prob using CompNB and SVM over MCV1 and Reuters-21578 MCV1 S VM C ompNB
B oW+T FIDF 0.6729 0.4517
B oW+Prob 0.7553 0.5653
(B oW+MFS )+Prob 0.7860 0.6400
21578 S VM C ompNB
B oW+T FIDF 0.8381 0.6940
B oW+Prob 0.8918 0.8120
(B oW+MFS )+Prob 0.9151 0.8554
15
On Document Representation and Term Weights in Text Classification
Table 3. Macro-averaged precision and recall of BoW+TFIDF, BoW+Prob and (BoW+MFS)+Prob using CompNB and SVM over MCV1 and Reuters-21578 Precision S VM C ompNB S VM C ompNB
B oW+T FIDF 0.8355 0.4342 0.8982 0.5671
B oW+Prob 0.7857 0.6765 0.8803 0.7418
(B oW+MFS )+Prob 0.8200 0.7105 0.9162 0.7837
MC V1 21578
Recall S VM C ompNB S VM C ompNB
B oW+T FIDF 0.6006 0.4788 0.7935 0.9678
B oW+Prob 0.7443 0.5739 0.9080 0.9128
(B oW+MFS )+Prob 0.7648 0.6377 0.9164 0.9470
MC V1 21578
5% higher than the probability’s, its averaged recall is 0.6006 only, far less than the probability based’s 0.7443. The case with Reuters-21578 is even more impressive. Using SVM, while the averaged precision of tfidf is 0.8982 which is only 1.8% higher than the probability’s, the averaged recall of probability based scheme reaches 0.9080, compared to tfidf’s 0.7935. Overall, the probability based weighting scheme surpasses tfidf in terms of F1 values over both data sets. From Figure 9, 10 and Table 2, we further observe that when the document representation is switched from BoW to the semantic enriched model, i.e. BoW+MFS, there is a further increase of performance,
Figure 11. The MFS examples of various categories in MCV1
16
On Document Representation and Term Weights in Text Classification
notably from 2% to 7%, using CompNB and SVM over both data sets. We also note that CompNB performs very well if we adopt the (BoW+MFS)+Prob approach. For Reuters data, its top performance using the (BoW+MFS)+Prob approach has overtaken the best one using the classic model, i.e. BoW+TFIDF. Our examination reveals that the MFSs found are in a sound quality to represent the key content of a document. Document vectors will possess more effective semantic information and inter-links when these MFSs are incorporated in the document representation. For instance, in the BoW approach documents centered on “Forming & Fabricating”, “Assembly & Joining” and “Welding” are usually dominated by a few frequent words, such as “process”, “metal” and “manufacturing”, etc. These words do not offer an effective means to represent the difference among these documents. In contrast, the integration of MFSs, e.g. “draw process”, “assembly process”, “laser cut process”, “weld process”, and “sheet metal forming process”, in their document representations will bear more clues to link up similar texts. As shown in Table 3, we believe that largely due to this reason the precisions based on the semantic enriched model have become better than those BoW+Prob based, while the recalls have not been jeopardized and are actually often better than their counterparts in BoW+Prob’s. This is particularly true to SVM. Figure 11 shows some examples of MFSs found from various categories in MCV1.
C ONC LUS ION Automated text classification is such a task of great interest to the researchers and professionals. It has witnessed a booming interest in the last decade, largely due to the increased availability of texts in digital form, the ensuing need to organize them, and its generic nature in classification. In this chapter, we have reviewed the status-quo of document indexing in text classification. A joint approach which couples a semantic enriched document representation and a probability based term weighting is proposed. The semantic enriched document representation involves a systematic frequent word sequence search at the document sentence level, and the probability based term weighting scheme directly reflects the term’s strength in representing different thematic categories. The experimental studies based on three types of document indexing approaches, i.e. BoW+TFIDF, BoW+Prob, and (BoW+MFS)+Prob, have demonstrated the merit of the joint approach proposed.
Ack nowledgment The work described in this chapter was supported by a grant from the Research Grants Council of the Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China (Project No. G-YF59).
Refe r enc es Ahonen-Myka, H. (1999). Finding All Frequent Maximal Sequences in Text. Paper presented at the Proceedings of the 16th International Conference on Machine Learning ICML-99 Workshop on Machine Learning in Text Data Analysis, J. Stefan Institute, Ljubljana.
17
On Document Representation and Term Weights in Text Classification
Ahonen-Myka, H., Heinonen, O., Klemettinen, M., & Verkamo, A. I. (1999). Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery. Paper presented at the Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications. Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3), 233-251. Apté, C., Damerau, F., & Weiss, S. M. (1998). Text Mining with Decision Trees and Decision Rules. Paper presented at the Conference on Automated Learning and Discovery, Carnegie-Mellon University. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Boston, MA, USA: AddisonWesley Longman Publishing Co., Inc. Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. Paper presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia. Baoli, L., Qin, L., & Shiwen, Y. (2004). An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP), 3(4), 215-226. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121-167. Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. Paper presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, Toronto, Canada. Caropreso, M. F., Matwin, S., & Sebastiani, F. (2001). A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin (Ed.), Text Databases and Document Management: Theory and Practice (pp. 78-102): Idea Group Publishing. Caropreso, M. F., Matwin, S., & Sebastiani, F. (2002). Statistical Phrases in Automated Text Categorization (No. 2000-B4-007). Charniak, E. (1993). Statistical Language Learning. Cambridge, Massachusetts: MIT Press. Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent Semantic Kernels. Journal of Intelligent Information Systems, 18(2-3), 127-152 Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. Paper presented at the Proceedings of the 2003 ACM symposium on Applied computing, Melbourne, Florida, USA. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Indexing. Journal of the American Society for Information Science, 41(6), 391-407. Dumais, S., & Chen, H. (2000). Hierarchical classification of Web content. Paper presented at the Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000), Athens, Greece.
18
On Document Representation and Term Weights in Text Classification
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Paper presented at the Proceedings of the seventh international conference on Information and knowledge management, Bethesda, Maryland, United States. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection, 3, 12891305. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2-3), 131-163. Fuhr, N. (1985). A Probabilistic Model of Dictionary Based Automatic Indexing. Paper presented at the Proceedings of the riao 85 (Recherche d’ Informations Assistee par Ordinateur), Grenoble, France. Fuhr, N., & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS), 9(3), 223-248. Galavotti, L., Sebastiani, F., & Simi, M. (2000). Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. Paper presented at the Proceedings of {ECDL}-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, Portugal. Han, E.-H., Karypis, G., & Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Paper presented at the Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2001, Hong Kong, China. Heckerman, D. (1997). Bayesian Networks for Data Mining. Data Mining and Knowledge Discovery, 1, 79-119. Hiemstra, D. (1998). A Linguistically Motivated Probabilistic Model of Information Retrieval. Paper presented at the Research and Advanced Technology for Digital Libraries, Second European Conference, ECDL ‘98, Heraklion, Crete, Greece. Joachims, T. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. Paper presented at the Machine Learning: ECML-98, Tenth European Conference on Machine Learning, Berlin, Germany. Joachims, T. (2001). A Statistical Learning Model of Text Classification with Support Vector Machines. Paper presented at the Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States. Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition: Prentice Hall. Kongovi, M., Guzman, J. C., & Dasigi, V. (2002). Text Categorization: An Experiment Using Phrases. Paper presented at the Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval, Glasgow, Scotland. Leopold, E., & Kindermann, J. (2002). Text Categorization with Support Vector Machines - How to Represent Texts in Input Space. Machine Learning, 46(1-3), 423-444.
19
On Document Representation and Term Weights in Text Classification
Lewis, D. D. (1992a). An evaluation of phrasal and clustered representations on a text categorization task. Paper presented at the Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval. Lewis, D. D. (1992b). Representation and learning in information retrieval. University of Massachusetts Amherst, USA. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361-397. Li, Y. H., & Jain, A. K. (1998). Classification of Text Documents. The Computer Journal, 41(8), 537546. Liu, Y., Loh, H. T., Kamal, Y.-T., & Tor, S. B. (2006). Handling of Imbalanced Data in Text Classification: Category Based Term Weights. In A. Kao & S. Poteet (Eds.), Text Mining and Natural Language Processing (pp. 173-195): Springer London. Liu, Y., Loh, H. T., & Tor, S. B. (2004). Building a Document Corpus for Manufacturing Knowledge Retrieval. Paper presented at the Proceedings of the Singapore MIT Alliance Symposium, Singapore. Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Boston, USA: The MIT Press. Miller, D. R. H., Leek, T., & Schwartz, R. M. (1999). A Hidden Markov Model Information Retrieval System. Paper presented at the Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US. Mladenic, D., & Grobelnik, M. (1998). Feature selection for classification based on text hierarchy. Paper presented at the Proceedings of the Workshop on Learning from Text and the Web, Conference on Automated Learning and Discovery CONALD-98. Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perception learning, and a usability case study for text categorization. Paper presented at the ACM SIGIR Forum , Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia, Pennsylvania, United States. Ponte, J. M., & Croft, W. B. (1998). A Language Modeling Approach to Information Retrieval. Paper presented at the Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137. Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003a). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Paper presented at the Proceedings of the Twentieth International Conference on Machine Learning (ICML). Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003b). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Paper presented at the Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA.
20
On Document Representation and Term Weights in Text Classification
Robertson, S. E. (1977). The Probability Ranking Principle in IR. Journal of documentation, 33(4), 294-304. Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical Text Categorization Using Neural Networks. Information Retrieval, 5(1), 87-118. Sahlgren, M., & Cöster, R. (2004). Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization. Paper presented at the Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), Geneva. Salton, G., & Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5), 513-523. Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. New York, USA: McGraw-Hill. Scott, S., & Matwin, S. (1999). Feature Engineering for Text Classification. Paper presented at the Proceedings of ICML-99, 16th International Conference on Machine Learning, San Francisco, US. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR), 34(1), 1-47. Sun, A., Lim, E.-P., Ng, W.-K., & Srivastava, J. (2004). Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(10), 1305-1308. Taira, H., & Haruno, M. (1999). Feature Selection in SVM Text Categorization. Paper presented at the Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on on Innovative Applications of Artificial Intelligence, AAAI/IAAI’99, Orlando, Florida. Tan, C.-M., Wang, Y.-F., & Lee, C.-D. (2002). The use of bigrams to enhance text categorization. Information Processing & Management, 38(4), 529-546. Turtle, H., & Croft, W. B. (1989). Inference networks for document retrieval. Paper presented at the Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, Brussels, Belgium. van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London, UK: Butterworths. Vapnik, V. N. (1999). The Nature of Statistical Learning Theory (2nd ed.). New York: Springer-Verlag. Weigend, A. S., Wiener, E. D., & Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3), 193-216. Wiener, E. D., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. Paper presented at the Proceedings of {SDAIR}-95, 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US. Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques (2nd ed.). San Francisco, Calif. USA: Morgan Kaufmann.
21
On Document Representation and Term Weights in Text Classification
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, California, United States. Yang, Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Paper presented at the Proceedings of ICML-97, 14th International Conference on Machine Learning. Yap, I., Loh, H. T., Shen, L., & Liu, Y. (2006). Topic Detection Using MFSs. Paper presented at the Proceedings of the 19th International Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems (IEA\AIE 2006), LNCS 4031, Annecy France. Zhan, J., Loh, H. T., & Liu, Y. (2007). Automatic Summarization of Online Customer Reviews. Paper presented at the Proceedings of International Conference on Web Information Systems and Technologies (WEBIST) 2007, Barcelona, Spain. Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter: Special issue on learning from imbalanced datasets, 6(1), 80-89. Zue, V. W. (1995). Navigating the Information Superhighway Using Spoken Language Interfaces. IEEE Expert: Intelligent Systems and Their Applications, 10(5), 39-43.
Ke y Te r ms Document Representation: Document representation is concerned about how textual documents should be represented in various tasks, e.g. text processing, retrieval and knowledge discovery and mining. Its prevailing approach is the vector space model, i.e. a document di is represented as a vector of term weights vi ( w1i , w2i ,..., w|G|i ) , where G is the collection of terms that occur at least once in the document collection D. Term Weighting Scheme: Term weighting is a process to compute and assign a numeric value to each term in order to weight its contribution in distinguishing a particular document from others. The most popular approach is tfidf weighting scheme, i.e. term frequency (tf ) times inverse document frequency (idf ). Text Classification: Text classification intends to categorize documents into a series of predefined thematic categories. In particular, it aims to find the mapping ξ, from a set of documents D: {d1, …, di} to a set of thematic categories C: {C1, …, Cj}, i.e. ξ : D C.
22
23
Chapter II
Deriving Document Keyphrases for Text Mining Yi-fang Brook Wu New Jersey Institute of Technology, USA Quanzhi Li Avaya, Inc., USA
ABSTR ACT Document keyphrases provide semantic metadata which can characterize documents and produce an overview of the content of a document. This chapter describes a Keyphrase Identification Program (KIP), which extracts document keyphrases by using prior positive samples of human identified domain keyphrases to assign weights to the candidate keyphrases. The logic of our algorithm is: the more keywords a candidate keyphrase contains and the more significant these keywords are, the more likely this candidate phrase is a keyphrase. To obtain human identified positive inputs, KIP first populates its glossary database using manually identified keyphrases and keywords. It then checks the composition of all noun phrases extracted from a document, looks up the database and calculates scores for all these noun phrases. The ones having higher scores will be extracted as keyphrases. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database.
INTR ODUCT ION As textual information pervades the Web and local information systems, text mining is becoming more and more important to deriving competitive advantages. One critical factor to successful text mining applications is the ability of finding significant topical terms for discovering interesting patterns or relationships. Discourse representation theory (Kamp, 1981) shows that a document’s primary concepts are mainly carried by noun phrases. We believe that noun phrases (NPs) as textual elements are better
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Deriving Document Keyphrases for Text Mining
suited for text mining and could provide more discriminating power than single words, which are still used by many text mining applications. Since not all NPs in a document are important, we propose using them as candidates and identifying keyphrases from them. Document keyphrases are the most important topical phrases for a given document. They provide a concise summary of a document’s content, offering semantic metadata summarizing a document. Previous studies have shown that document keyphrases can be used in a variety of applications, such as retrieval engine (Li, Wu, Bot & Chen, 2004), browsing interface (Gutwin, Paynter, Witten, Nevill-Manning & Frank, 1999), thesaurus construction (Kosovac, Vanier & Froese, 2000), and document classification and clustering (Jonse & Mahoui, 2000). For example, they may be utilized to enrich the metadata of the results returned from a search engine (Li et al, 2004). They may also be used to efficiently classify or cluster documents into different categories (Jonse & Mahoui, 2000). Some documents have a list of author-assigned keyphrases, but most documents do not have them. The keyphrases assigned by domain experts, such as indexers or authors, may be chosen from a document or a controlled vocabulary. However, manually assigning keyphrases to documents is costly and time-consuming, so it is necessary to develop an algorithm to automatically generate keyphrases for documents. Automatic keyphrase generation can be executed in two ways: keyphrase extraction and keyphrase assignment. Keyphrase extraction methods choose phrases from the document text as keyphrases. Keyphrase assignment methods choose phrases from a controlled vocabulary as document keyphrases. The problem with keyphrase assignment is that some potentially useful keyphrases will be ignored if they appear in the document but not in the controlled vocabulary. In some domains, such controlled vocabularies might not even be available, and most controlled vocabularies are not updated frequently enough. Therefore, automatic keyphrase extraction method is more desirable and popular. In this chapter, we describe our Keyphrase Identification Program, KIP, and its learning function. KIP utilize a keyphrase extraction technique: the keyphrases generated by KIP must appear in the document. The algorithm considers the composition of a noun phrase. To analyze a noun phrase and assign a score to it, KIP uses a glossary database, which contains manually pre-identified domain-specific keyphrases and keywords, to calculate scores of noun phrases in a document. The noun phrases having higher scores will be extracted as keyphrases. KIP’s learning function enriches its glossary database by automatically adding new keyphrases extracted from documents to the database. Consequently, the database will grow gradually and the system performance will be improved. The remainder of this chapter is organized in the manner described below. Previous studies are first presented. Then our methodology and system architecture are described, followed by KIP’s learning function. Finally, the evaluations of KIP’s effectiveness and its learning function are presented.
R ELAT ED WORK Several automatic keyphrase extraction techniques have been proposed in previous studies. Krulwich and Burkey (1996) use some heuristics to extract significant topical phrases from a document. The heuristics are based on documents’ structural features, such as the presence of phrases in document section headers, the use of italics, and the different formatting structures. This approach is not difficult to implement, but the limitation is that not every document has explicit structural features. Zha (2000) proposes a method for keyphrase extraction by modeling documents as weighted undirected and weighted bipartite graphs. Spectral graph clustering algorithms are used for partitioning
24
Deriving Document Keyphrases for Text Mining
sentences of a document into topical groups. Within each topical group, the mutual reinforcement principle is used to compute keyphrase and sentence saliency scores. The keyphrases and sentences are then ranked according to their saliency scores. Then keyphrases are selected for inclusion in the top keyphrase list, and sentences are also selected for inclusion in summaries of the document. This approach considers basically only the frequency of a phrase, and the paper does not give evaluation information on the extracted keyphrases. Statistical language models are used by Tomokiyo and Hurst (2003) to extract keyphrases. Their method uses pointwise KL-divergence (Cover & Thomas, 1991) between multiple language models for scoring both phraseness and informativeness, which are unified into a single score to rank extracted phrases. Phraseness and informativeness are two features of a keyphrase. Phraseness is an abstract notion describing the degree to which a given word sequence is considered to be a phrase. Informativeness means how well a phrase represents the key ideas in a document collection. In the calculations, they use the relationship between foreground and background corpora to formalize the notion of informativeness. The target document collection from which representative keyphrases are extracted is called the foreground corpus. The document collection to which this target collection is compared is called the background corpus. Their approach needs a foreground corpus and a background corpus, and the latter cannot be easily obtained. The phrase list it generates is for the whole corpus, not an individual document. Their paper does not report any evaluation of their approach. Turney (2000) is the first person who treats the problem of phrase extraction as supervised learning from examples. Turney uses nine features to score a candidate phrase; some of the features are the location of the first occurrence of the phrase in the document and whether or not the phrase is a proper noun. Keyphrases are extracted from candidate phrases based on the examination of their features. Turney introduces two kinds of algorithms: C4.5 decision tree induction algorithm and GenEx, which is more successful than C4.5. GenEx has two components: Extractor and Gentor. Extractor processes a document and produces a list of phrases based on the setting of 12 parameters. In the training stage, Gentor is used to tune the parameter setting to get the optimal performance. Once the training process is finished, Gentor is no longer used, and then Extractor alone can extract keyphrases using the optimal parameter setting obtained from training stage. In Extractor’s formula, the dominant factors for the calculation of a phrase’s score are the frequency of the phrase, the frequencies of words within it, and the location of its first occurrence. Kea uses a surprised machine learning algorithm which is based on naïve Bayes’ decision rules (Frank, Paynter, Witten, Gutwin, & Nevill-Manning, 1999; Witten, Paynter, Frank, Gutwin, & NevillManning, 1999). It treats keyphrase extraction as a classification task. So the problem is to classify a candidate keyphrase into one of the two classes, keyphrase and non-keyphrase. Two attributes are used to discriminate between keyphrase and non-keyphrase, the TF.IDF score of a phrase and the distance into the document of the phrase’s first appearance. They use documents with author-provided keywords as the training documents. A model is learned from the training documents and corresponds to a specific corpus containing the training documents. A model can be used to identify keyphrases from other documents once it is learned from the training documents. Each model consists of a Naive Bayes classifier and two supporting files, which contain phrase frequencies and stopped words. Both Kea and Extractor use a similar way to identify candidate keyphrase: the input text is split up according to phrase boundaries (numbers, punctuation marks, dashes, brackets); non-alphanumeric characters and all numbers are deleted; then a phrase is defined as a sequence of one, two, or three words that appear consecutively in the text; finally, it eliminates those phrases beginning or ending with
25
Deriving Document Keyphrases for Text Mining
a stop word. The above approach to identifying candidate keyphrase is different from ours. However, Kea and Extractor both use supervised machine learning approaches. They all need training corpora to train their programs. For each document in the corpus, there must be a target set of keyphrases provided by authors or generated by experts. It requires a lot of manual work, and in some applications, there might be no appropriate document set that can be used to train the program. Another limitation is that the training corpora are usually not up-to-date, and retraining the program with new corpora is not easy. Based on results of prior studies, we are looking for a method which can identify real keyphrases now, and also be able to automatically and gradually adapt to the new development and advances of the domain of documents it tries to derive keyphrases from. Our algorithm is described in detail in the following section.
K IP: A DOMAIN-S PEC IFIC K EY PHR AS E EXTR ACT ION ALGOR IT HM KIP is a domain-specific keyphrase extraction program, not a keyphrase assignment program, which means the generated keyphrases must occur in the document text. It is also designed to be able to learn to adapt to the new development of a chosen domain. KIP is designed by mimicking “learning by example” that humans do when they learn new things. Identifying things in the environment which are already in our minds is easy. However, learning to identify new things needs to build on top of what we know before. Besides documents, from which keyphrases will be extracted, as inputs, KIP requires a database which is similar to the domain knowledge in our mind. When KIP examines a keyphrase candidate, it looks for characteristics (words and sub-phrases) in the phrase which are already in the background knowledge base and assigns weights (how domain-specific a word or a sub-phrase is) accordingly. For the version of KIP with learning function, once a new phrase is learned, it is inserted to the database and the weights of all affected characteristics are adjusted to reinforce the learning. According to our design, KIP algorithm is based on the logic that a noun phrase containing domainspecific keywords and/or keyphrases is likely to be a keyphrase in the domain. The more keywords/ keyphrases it contains and the more significant the keywords/keyphrases are, the more likely that this noun phrase is a keyphrase. The pre-identified domain-specific keywords and keyphrases are stored in the glossary database, which is used to calculate scores of noun phrases. Here a pre-defined domainspecific keyword means a single term word, and a pre-defined domain-specific keyphrase means a phrase containing one or more words. A keyphrase generated by KIP can be a single-term keyphrase or a multiple-term keyphrase up to 6 words long. KIP operations can be summarized as follows. KIP first extracts a list of keyphrase candidates, which are noun phrases from input documents. Then it examines the composition of a keyphrase candidate (a noun phrase) and assigns score to it. The score of a noun phrase is determined mainly based on three factors: its frequency of occurrence in the document, its composition (what words and sub-phrases it contains), and how specific these words and sub-phrases are in the domain of the document. To calculate scores of noun phrases, readily available human identified domain-specific keyphrases are parsed to form a glossary database with weights. Finally, the weights of noun phrases are looked up in the database and added, and those with higher scores are selected as keyphrases of the document. In this section we will introduce KIP’s four main components: the tokenizer, the part-of-speech (POS) tagger, the noun phrase extractor, and the keyphrase extraction tool. In the next section, we will present a KIP’s special feature: the learning function. To distinguish the two kinds of keyphrases, i.e.
26
Deriving Document Keyphrases for Text Mining
human identified domain-specific keyphrases and KIP identified keyphrases, we refer to the former as manual keyphrases and latter as automatic keyphrases.
T okenizer After documents are loaded into the system, the tokenizer will separate all the words, punctuation marks and other symbols from document text to obtain the atom units.
Part-of-Speech Tagger To identify noun phrases, the system requires knowledge of the part of speech of the words in the text. A part-of-speech tagger is used to assign the most likely part of speech tag to each word in the text. Our part-of-speech tagger is a revised version of the widely used Brill tagger (Brill, 1995). Brill tagger is based on transformation-based error-driven learning. When it was trained on 600,000 words from Wall Street Journal Corpus, an accuracy of 97.2% was achieved on a separate 150,000 words test set from Wall Street Journal Corpus. Our tagger was trained on two corpora, the Penn Treebank Tagged Wall Street Journal Corpus and Brown Corpus. Tagging is done in two stages. First, every word is assigned its most likely tag. Next, contextual transformations are used to improve accuracy.
Noun Phrase Extractor After all the words in the document are tagged, the noun phrase extractor will extract noun phrases from this document. KIP’s noun phrases extractor (NPE) extracts noun phrases by selecting the sequence of POS tags that are of interests. The current sequence pattern is defined as {[A]} {N}, where A refers to Adjective, N refers to Noun, { } means repetition, and [ ] means optional. A set of optional rules is also used. Phrases satisfying the above sequence patterns or the optional rules will be extracted as noun phrases. Users may choose to obtain noun phrases of different length by changing system parameters. Ramshaw and Marcus (1995) introduce a standard data set for the evaluation of noun phrase identification approach. Precision and recall are used to measure the performance of the algorithm. Precision measures the percentage of noun phrases identified by the approach that are accurate, and recall measures the percentage of noun phrases from the testing set that are identified by the approach. For this data set, many evaluation results of noun phrase identification approaches have been published (Sang, 2000; Argamon, Dagan, & Krymolowski, 1999; Cardie & Pierce 1999; Muñoz, Punyakanok, Roth, & Zimak, 1999). Precision and recall can be combined in one measure: the F measure. F = 2*precision*recall / (recall + precision). The F value of our noun phrase extractor is 0.91. It is comparable with other approaches mentioned above, whose F values range from 0.89 to 0.93. At this stage, KIP produces a list of noun phrases, which will be used in next stage, keyphrase extraction.
Extracting Keyphrases At this stage, a list of noun phrases produced by NPE is already available. These noun phrases are keyphrase candidates. They will be assigned scores and ranked in this stage. Noun phrases with higher scores will be extracted as this document’s keyphrases. As previously described, KIP’s algorithm examines the composition of a noun phrase to assign a score to it. In order to calculate the scores for noun
27
Deriving Document Keyphrases for Text Mining
phrases, we use a glossary database containing domain-specific manual keyphrases and keywords, which provide initial weights for the keywords and sub-phrases of a candidate keyphrase. In the following sections, we will first describe how to build this database, then how to calculate a noun phrase’s score, and finally how the keyphrases are extracted.
Building Glossary Database The glossary database has two lists (tables): (a) a manual keyphrase list and (b) a manual keyword list. A manual keyphrase is an entry in the pre-defined keyphrase list, and it could contain one or more words; and a manual keyword means a single word parsed from list (a). Before using KIP, users will need a glossary database from a particular domain corresponding to the domain of the documents. When the system is applied to a new domain, the only thing required is to build or change to a new database specific to the domain. We use the Information Systems (IS) domain as an example to illustrate how a domain-specific glossary database is built. For IS domain, both lists were generated from two main sources: (1) author keyphrases from an IS abstract corpus, and (2) “Blackwell Encyclopedic Dictionary of Management Information Systems” (1997 edition). The reason for combining the two sources to generate the lists was the need to obtain keyphrases and keywords that would cover both theoretical and technical aspects of IS literature as much as possible. We believe that if the database contains more comprehensive human identified keyphrases and keywords, the performance of KIP will be better. Keyphrase List. The keyphrase list was generated as follows. First, 3,000 abstracts from IS related journals were automatically processed, and all keyphrases provided by original authors were extracted to form an initial list. Second, this list was further augmented with keyphrases extracted from the Blackwell encyclopedic dictionary. Keyword List. Most of the keyphrases in the keyphrase list are composed of two or more words. To obtain the manual keywords, all manual keyphrases were split into individual words and added as keywords to the keyword list. The keyphrase table has three columns (keyphrases, weights, and sources) and the keyword table has two columns (keywords and weights). Keyphrases in the keyphrase table may come from up to two sources. Initially, they are all manually identified by the way described above. During KIP’s learning process, the system may automatically learn new phrases and add them to the database. We discuss the details of KIP’s learning process and how new phrases are added to the glossary database in the next section. The weights of these domain-specific keyphrases and keywords in the glossary database are assigned automatically through the following steps: 1. Assigning weights to keywords. A keyword can be in one of three conditions: (A) the keyword itself alone is a keyphrase and is not part of any keyphrase in the keyphrase table; (B) the keyword itself alone is not a keyphrase but is only part of one or more keyphrases in the keyphrase table; and (C) the keyword itself alone is a keyphrase and also is part of one or more keyphrases in the keyphrase table. Each keyword in the keyword table will be checked against the keyphrase table to see which condition it belongs to. The weights are automatically assigned to keywords differently in each condition. The rationale behind this is that it reflects how domain-specific a keyword
28
Deriving Document Keyphrases for Text Mining
is in the domain. The more specific a keyword is, the higher weight it has. For each keyword in condition (A), the weight is X (the system default value for X is 10); for each keyword in condition (B), the weight is Y divided by the times the keyword appears as part of a keyphrase (the system default value for Y is 5); for each keyword in condition (C), the weight is X+
2
Y N ,
where N is the number of times that the keyword appears as part of a keyphrase. 2. Assigning weights to keyphrases. The weight of each word in the manual keyphrase is found from the keyword table, and then all the weights of the words in this manual keyphrase are added together. The sum is the weight for this manual keyphrase. The weights of manual keyphrases and keywords assigned by the above method will be used to calculate the scores of keyphrase candidates in a document.
C alculating S cores for K eyphrase C andidates A noun phrase’s score is defined by multiplying a factor F by a factor S. F is the frequency of this phrase in the document, and S is the sum of weights of all the individual words and all the possible combinations of adjacent words within a keyphrase candidate (we call a combination of adjacent words a “sub-phrase” of this keyphrase candidate). So we have the following equation: The score of a noun phrase = F × S. The sum of weights S is defined as: N
M
i =1
j =1
S = ∑ wi + ∑ p , j
where wi is the weight of a word within this noun phrase, and pj is the weight of a sub-phrase within this noun phrase. The method used to obtain the values of wi and pj will be explained later. Let us use an example to illustrate the above equation. Assume there is a noun phrase “ABC” where A, B and C are three words. The possible combinations of adjacent words are AB, BC, and ABC. The score for noun phrase “ABC” will be the frequency of “ABC” in this document multiplied by the summation of weights of A, B, C, AB, BC, and ABC. The motivation for including the weights of all possible sub-phrases into the phrase score, in addition to the weights of individual words, is to find out if a sub-phrase is a manual keyphrase in the glossary database. If it is, this phrase is expected to be more important. KIP will lookup the keyphrase table to obtain the weights for all the sub-phrases of a keyphrase candidate. If a sub-phrase is found, the corresponding weight in the keyphrase table is assigned to this sub-phrase; otherwise, a predefined low weight will be assigned. This predefined weight is usually much smaller than the lowest weight of a keyphrase or keyword in the database, because it is for a new and previously unidentified phrase. The user can adjust this value by changing the system
29
Deriving Document Keyphrases for Text Mining
Figure. 1. A screenshot of KIP used to explain the learning function
parameter. Similarly, KIP obtains the weight of a word by looking up the keyword table. If it finds the word from the table, the corresponding weight in the keyword table will be the weight of the word. Otherwise, a very low predefined weight will be assigned to it.
Extracting K eyphrases All the scores of keyphrase candidates are normalized to range from 0 to 1 (all scores are divided by the highest score) after they are calculated. This makes it easier to find out the relative importance of all candidate keyphrases. All candidate keyphrases for a document are then ranked in descending order by their scores. The keyphrases of a document can be extracted from the ranked list. In order to be as flexible as possible, the KIP system has a set of parameters for users to decide the number of keyphrases they want from a document. The number of extracted keyphrases for a document can be defined in three ways: (1) defining a specific number of keyphrases to be extracted; (2) specifying the percentage of noun phrases to be extracted (for example, top 10% of all the identified noun phrases are to be extracted); and (3) setting a threshold for keyphrases to be extracted (for example, only noun phrases with scores greater than 0.7 are to be extracted). KIP contains all the above basic options, as well as possible combinations of them. For example, the system can extract keyphrases that are on the top 20% of the candidate list and also with scores greater than 0.7. We have mentioned several system parameters that users can adjust. However, most of the parameters do not need users’ adjustment. Their default values are acquired based on our testing and observations. The purpose of providing extra options is that we want to make the system as flexible as possible. Such options are mainly for advanced users, in case they need to use KIP in some special applications. For example, for text mining applications, we might need as many keyphrases as possible from a document; while for document content metadata generation, we want selective high quality keyphrases. For most users, the only parameter they want/need to change is the desired number of keyphrases.
30
Deriving Document Keyphrases for Text Mining
K IP’S LEAR NING FUNCT ION KIP, as well as many other keyphrase extraction algorithms rely on keyphrases pre-defined by human experts as positive examples. Sometimes, such examples are not up to date or available. Therefore, an adaptation and learning function is necessary for KIP. So it grows as the field of documents grows. KIP’s learning function can enrich the glossary database by automatically adding new identified keyphrases to the database. This function is optional, and users can enable or disable it. With the learning function enabled, whenever the system identifies a new keyphrase (“new” means this keyphrase is not in the database’s keyphrase table, and it satisfies the inclusion requirements), this keyphrase will be automatically added to the keyphrase table and the words it contains will be added to the keyword table. The inclusion requirements can be modified by defining how many keyphrases will be extracted from a document. After new keyphrases are added to the glossary database, weights of the affected keyphrases and keywords in the database will be recalculated to reinforce the learning. With this feature enabled, the database will grow gradually. It will benefit future keyphrase extraction for new documents. When KIP is used in a domain where there are very few existing domain-specific keyphrases and keywords, the learning function will be especially useful. When it is applied to such a domain, KIP can automatically learn new keyphrases, and finally build a glossary for this domain. We use Figure 1 as an example to explain how the learning function works. In this figure, seven documents are processed, and the file names and the extracted keyphrases are displayed in the left frame. For the document shown above on the right frame, KIP extracts and highlights five keyphrases, which are also displayed in the decreasing order of importance in the left frame. Four of them are marked with a red cube in front of each of them (“user participation,” “user satisfaction,” “user participative behavior,” and “different contextual situation”), and one of them is marked with a blue cube in front of it (“systems development”). The four keyphrases with a red cube are new to the glossary database, which means they do not exist in the keyphrase table. The one with a blue cube is already in the database. With the automatic learning function enabled, the system will add these four new keyphrases to the database automatically. To better control the quality of the new keyphrases added to the database, KIP has some parameters which allow users to set inclusion requirements for adding new keyphrases to the database. It is similar to the way of defining the quantity of keyphrases extracted for an individual document described in the previous section. The system also has an option to let a user excludes some new keyphrases from being added to the database, if s/he thinks they are not qualified. The learning process can be automatic without user involvement, and it can also be user-involved. If the automatic option is disabled, the user can decide if he/she wants a new identified keyphrase to be added to the database. In this way, the user can control the quality: only the new identified keyphrases that satisfy the user will be added to the database. Another useful feature is that if the user thinks a phrase is good and needs to be added to the database, but it is not identified by the system as a keyphrase, the user can highlight this phrase from the document text in the right frame. Then the system will add this phrase to the database automatically.
31
Deriving Document Keyphrases for Text Mining
EXPER IMENT Generally, there are two ways to evaluate a keyphrase extractor’s performance. The first method is to use human assessment to evaluate system-extracted keyphrases. Human evaluation requires a lot of effort, because it requires subjects to read the full-text documents from which keyphrase are extracted. The second method evaluates a keyphrase extractor’s effectiveness by using the standard information retrieval measures, precision and recall. To compute the precision and recall, a test document should already have a set of keyphrases as the standard keyphrases. Because there are no large test collections with identified keyphrases available for evaluation, the document keyphrases assigned by the original author(s) are usually used as the standard keyphrase set to calculate the precision and recall. The systemgenerated keyphrases are compared to the keyphrases assigned by the original author(s). Previous studies have used this measure and found it an appropriate method to measure the effectiveness of a keyphrase extraction system (Jones & Paynter, 2002; Sang, 2000; Frank et al, 1999; Tolle & Chen, 2000). Measuring precision and recall against author keyphrases is easy to carry out and less time-consuming than human evaluations, and it allows more precise comparison between different keyphrase extraction systems. So, in this study, we employed this method to evaluate KIP’s effectiveness. Recall means the proportion of the keyphrases assigned by a document’s author(s) that appear in the set of keyphrases generated by the keyphrase extraction system. Precision means the proportion of the extracted keyphrases that match the keyphrases assigned by a document’s author(s). We used the information systems (IS) domain to perform the experiments. The process of building an IS glossary database containing domain-specific keyphrases and keywords is formerly described in the next section. We also wanted to know how well KIP performs, so we compared KIP to two other keyphrase extraction systems, Kea (Frank et al, 1999; Witten, 1999) and Extractor (Turney, 2000). Other reported systems were not available to us for a comparison. In this experiment, we used Kea 1.1.4 with its build-in model, cstr, which gives the best results among all its models (Jones & Paynter, 2002). For Extractor, we used its commercial evaluation version 7.2. All these three systems can take an input document and generate a list of keyphrases for the document. Five hundred journal and conference papers were chosen as the test documents in this experiment. The sources of the 500 documents are: Journal of the Association for Information Systems 2002-2004, journal of Information Retrieval, Proceedings of Americas Conference on Information Systems 20012003 and Journal of Data Mining and Knowledge Discovery 2002. We chose documents from different sources to make the experiment results more generalizable. These 500 papers were directly within or in some sense related to the Information Systems domain, so our starting glossary database was appropriate. All these 500 papers had author-assigned keywords. (The author-assigned keywords in these 500 papers were not used to populate our glossary database.) Author-assigned keyphrases were removed from the papers before they were processed by these three systems. The length of most of these papers was between 5 to 15 pages. The average number of author-assigned keyphrases for these papers was 4.7. We compared the performance of KIP and Kea when the number of keyphrases extracted by them was 5, 10, 15 and 20, respectively. Since the commercial evaluation version of Extractor we used could produce at most eight phrases for each document, we compared the performance of Extractor, KIP and Kea only when the number of extracted keyphrases was 5 and 8, respectively. Table 1 shows the results for KIP and Kea. We also test the statistical significance of the difference between precisions of the two systems, as well as their recalls, using a paired t-test. From Table 1, we
32
Deriving Document Keyphrases for Text Mining
Table 1. Precision and Recall for KIP and Kea Number of extracted keyphrases
Average Precision + Standard Deviation KIP
Kea
5
0.27+0.19
0.20+0.18
10
0.19+0.11
15 20
Significant test on precision difference (p-value < 0.05 ?)
Average Recall + Standard Deviation
Significant test on recall difference (p-value < 0.05 ?)
KIP
Kea
Yes
0.31+0.22
0.20+0.17
Yes
0.15+0.12
Yes
0.44+0.24
0.32+0.26
Yes
0.15+0.07
0.13+0.10
Yes
0.50+0.23
0.40+0.27
Yes
0.12+0.05
0.11+0.08
No
0.54+0.23
0.44+0.28
Yes
Table 2. Precision and Recall for KIP and Extractor Number of extracted keyphrases
Average Precision + Standard Deviation KIP
Extractor
5
0.27+0.19
0.24+0.15
8
0.22+0.13
0.20+0.12
Significant test on precision difference (p-value < 0.05 ?)
Average Recall + Standard Deviation
Significant test on recall difference (p-value < 0.05?)
KIP
Extractor
No
0.31+0.22
0.26+0.16
Yes
No
0.39+0.24
0.35+0.22
Yes
Table 3. Precision and Recall for Kea and Extractor Number of extracted keyphrases
Average Precision + Standard Deviation Kea
Extractor
5
0.20+0.18
0.24+0.15
8
0.16+0.13
0.20+0.12
Significant test on precision difference (p-value < 0.05 ?)
Average Recall + Standard Deviation
Significant test on recall difference (p-value < 0.05 ?)
Kea
Extractor
Yes
0.20+0.17
0.26+0.16
Yes
Yes
0.28+0.22
0.35+0.22
Yes
can see that, in respect to precision and recall, KIP performs better than Kea. The results are significant at 95% confidence level (p < 0.05) except for the precision when the number of extracted keyphrase is 20. Because of the reason described above, we compare the performance of KIP to Extractor’s at only 2 data points: the number of extracted phrases is 5 and 8, respectively. Table 2 shows that KIP performs better than Extractor, but the results are only significant for recall, in terms of precision, the results are not significant when the number of extracted keyphrases is 5 and 8.
33
Deriving Document Keyphrases for Text Mining
Kea and Extractor are compared in Table 3. The results show that Extractor performs better than Kea, in respect to precision and recall, when the number of extracted keyphrases is 5 and 8. The results are significant at 95% confidence level. As explained in previous sections, KIP’s algorithm considers the composition of a phrase. The score of a candidate keyphrase is determined mainly based on three factors: its frequency in the document, its composition (what words and sub-phrases it contains), and how specific these words and sub-phrases are in the domain of the document. To find the importance of the words and sub phrases within a candidate keyphrase, KIP uses a glossary database. In this experiment, we also wanted to know: among the correctly identified keyphrases (the extracted keyphrases that matched author-provided keyphrases), how many of them were already in the glossary database. By examining the glossary database and the correctly identified keyphrases, we found that on average only 18% of the correctly identified keyphrases had a complete match with a phrase in the glossary database. This means that most of the keyphrases were correctly identified not because they were already in the database. They were correctly identified due to their composition, frequency, and how specific the words and sub-phrases within them are in the domain of the document. We used author-provided keyphrase to calculate the precision and recall in this experiment. However, we need to point out that some author-provided keyphrases may not occur in the document they are assigned to. In the experiments reported by Turney (2000), about only 75% of author-provided keyphrases appear somewhere in the document. That suggests the highest recall for a system could only be 0.75.
C ONC LUS ION In this chapter we describe a new keyphrase extraction algorithm and its learning function. The experimental results show that the performance of KIP is comparable to, or in some cases, better than other reported keyphrase extraction algorithms. KIP’s ability to automatically add new keyphrases to its glossary database makes it easier to apply KIP to different domains. The features and performance of KIP will make it useful for a variety of applications, such as retrieval engine, browsing interface and thesaurus/glossary construction. Our future research will focus on evaluating KIP with human subjects and applying it in different domains.
ACK NOWLEDGMENT Partial support for this research was provided by the United Parcel Service Foundation; the National Science Foundation under grants DUE-0226075, DUE-0434581 and DUE-0434998, and the Institute for Museum and Library Services under grant LG-02-04-0002-04.
R EFER ENC ES Argamon, S., Dagan, I., & Krymolowski, Y. (1999). A Memory-Based Approach to Learning Shallow Natural Language Patterns. Journal of Experimental and Theoretical Artificial Intelligence (JETAI), 11 (3), 369-390.
34
Deriving Document Keyphrases for Text Mining
Brill, E. (1995). Transformation-based Error-driven Learning and Natural Language Processing: A Case study in Part-of-speech Tagging. Computational Linguistics 21(4), 543-565. Cardie, C., & Pierce, D. (1999). The Role of Lexicalization and Pruning for Base Noun Phrase Grammars. Proceedings of the Sixteenth National Conference on Artificial Intelligence, 423-430. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory, JohnWiley. Frank, E., Paynter, G., Witten, I., Gutwin, C., & Nevill-Manning, C. (1999). Domain-specific keyphrase extraction. Proceeding of the sixteenth international joint conference on artificial intelligence, San Mateo, CA, 668-673. Gutwin, C., Paynter, G., Witten, I. H., Nevill-Manning, C., & Frank, E. (1999). Improving Browsing in Digital Libraries with Keyphrase Indexes. Journal of. Decision Support Systems, 27(1-2), 81-104. Jonse, S., & Mahoui, M. (2000). Hierarchical document clustering using automatically extracted keyphrase. In Proceeding of the third international Asian conference on digital libraries Seoul, Korea, 113-120. Jones, S., & Paynter, G. W. (2002). Automatic extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications. Journal of the American Society for Information Science and Technology, Wiley Periodicals, Inc, 53(8), 653-677. Kamp, H. A. (1981). Theory of Truth and Semantic Representation: Formal Methods in the Study of Language, In Groenendijk, J., Janssen T., and Stokhof M., editors, Mathema-tische Centrum, 1, 277-322. Kosovac, B., Vanier, D. J., & Froese, T.M. (2000). Use of keyphrase extraction software for creation of an AEC/FM thesaurus. Electronic Journal of Information Technology in Construction, 5, 25–36. Krulwich, B., & Burkey, C. (1996). Learning user information interests through the extraction of semantically significant phrases. In Hearst M. and Hirsh H., editor. In Proceedings of AAAI 1996 Spring Symposium on machine Learning in Information Access, AAAI Press, California, 15-18 Li, Q., Wu, Y .B., Bot, R. S., & Chen, X. (2004). Incorporating Document Keyphrases in Search Results. Proceedings of the Tenth Americas Conference on Information Systems, New York, New York. 3255-3263. Muñoz, M., Punyakanok, V., Roth, D., & Zimak, D. (1999). A Learning Approach to Shallow Parsing. Proceedings of EMNLP/WVLC-99, University of Maryland, MD. 168-178. Ramshaw L. A., & Marcus, M. P. (1995). Text Chunking Using Transformation-Based Learning, Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA. 82-94. Sang, E. F. (2000). Noun Phrase Representation by System Combination. In Proceedings of ANLPNAACL 2000, Seattle, WA. Tolle, K.M., & Chen, H. (2000). Comparing noun phrasing techniques for use with medical digital library tools. Journal of the American Society for Information Science, 51(4), 352-370. Tomokiyo, T., & Hurst, M. (2003). A Language Model Approach to Keyphrase Extraction. Proceedings of the Association for Computational Linguistics (ACL) 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan. 35
Deriving Document Keyphrases for Text Mining
Turney, P. D. (2000). Learning algorithm for keyphrase extraction. Information Retrieval, 2(4), 303336. Witten, I. H., Paynter, G. W., Frank, E., Gutwin C., & Nevill-Manning C. G. (1999). KEA: Practical Automatic Keyphrase Extraction. Proceedings of the Fourth ACM Conference on Digital Libraries, 254-255. Zha, H. (2000). Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland.
Ke y Te r ms Content Metadata: A special piece of metadata which describes the content of a document. Document Keyphrases: Important topical phrases which, when combined, describe the main theme of a document. Document Metadata: Data about a document, e.g. title, author, source, etc. Domain-Specific Keyphrase Extraction: Document keyphrase extraction methods that are used for extracting keyphrases for documents in a specific domain. Keyphrase Extraction: Methods extracting important topical phrases from a document. Keyphrase Assignment: Assigning keyphrases using predefined list to a document. Text Mining: Methods to distil useful information from large bodies of text.
36
37
Chapter III
Intelligent Text Mining:
Putting Evolutionary Methods and Language Technologies Together John Atkinson Universidad de Concepción, Chile
Abstr act This chapter introduces a novel evolutionary model for intelligent text mining. The model deals with issues concerning shallow text representation and processing for mining purposes in an integrated way. Its aims are to look for interesting explanatory knowledge across text documents. The approach uses natural-language technology and genetic algorithms to produce explanatory novel hidden patterns. The proposed approach involves a mixture of different techniques from evolutionary computation and other kinds of text mining methods. Accordingly, new kinds of genetic operations suitable for text mining are proposed. Some experiments and results and their assessment by human experts are discussed which indicate the plausibility of the model for effective knowledge discovery from texts. With this chapter, authors hope the readers to understand the principles, theoretical foundations, implications, and challenges of a promising linguistically motivated approach to text mining.
Intr oduct ion Like gold, information is both an object of desire and a medium of exchange. Also like gold, it is rarely found just lying about. It must be mined, and as it stands, a large portion of the world’s electronic information exists as numerical data. Data mining technology can be used for the purpose of extracting “nuggets” from well-structured collections that exist in relational databases and data warehouses. However, 80% of this portion exists as text and is rarely looked at: letters from customers, e-mail correspondence, technical documentation, contracts, patents, and so forth. Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Intelligent Text Mining
An important problem is that information in this unstructured form is not readily accessible to be used by computers. This has been written for human readers and requires, when feasible, some natural language interpretation. Although full processing is still out of reach with current technology, there are tools using basic pattern recognition techniques and heuristics that are capable of extracting valuable information from free text based on the elements contained in it (e.g., keywords). This technology is usually referred to as text mining and aims at discovering unseen and interesting patterns in textual databases. These discoveries are useless unless they contribute valuable knowledge for users who make strategic decisions (i.e., managers, scientists, businessmen). This leads then to a complicated activity referred to as knowledge discovery from texts (KDT) which, like knowledge discovery from databases (KDD), correspond to “the non-trivial process of identifying valid, novel, useful, and understandable patterns in data.” Despite the large amount of research over the last few years, only few research efforts worldwide have realised the need for high-level representations (i.e., not just keywords), for taking advantage of linguistic knowledge, and for specific purpose ways of producing and assessing the unseen knowledge. The rest of the effort has concentrated on doing text mining from an information retrieval (IR) perspective and so both representation (keyword based) and data analysis are restricted. The most sophisticated approaches to text mining or KDT are characterised by an intensive use of external electronic resources including ontologies, thesauri, and so forth, which highly restricts the application of the unseen patterns to be discovered and their domain independence. In addition, the systems so produced have few metrics (or none at all) which allow them to establish whether the patterns are interesting and novel. In terms of data mining techniques, genetic algorithms (GA) for mining purposes has several promising advantages over the usual learning / analysis methods employed in KDT: the ability to perform global search (traditional approaches deal with predefined patterns and restricted scope), the exploration of solutions in parallel, the robustness to cope with noisy and missing data (something critical in dealing with text information as partial text analysis techniques may lead to imprecise outcome data), and the ability to assess the goodness of the solutions as they are produced. In order to deal with these issues, many current KDT approaches show a tendency to start using more structured or deeper representations than just keywords to perform further analysis so to discover informative and (hopefully) unseen patterns. Some of these approaches attempt to provide specific contexts for discovered patterns (e.g., “it is very likely that if X and Y occur then Z happens.”), whereas others use external resources (lexicons, ontologies, thesaurus) to discover relevant unseen semantic relationships which may “explain” the discovered knowledge, in restricted contexts, and with specific fixed semantic relationships in mind. Sophisticated systems also use these resources as a commonsense knowledge base which along with reasoning methods can effectively be applied to answering questions on general concepts. In this chapter, we describe a new model for intelligent text mining which brings together the benefits of evolutionary computation techniques and language technology to deal with current issues in mining patterns from text databases. In particular, the approach puts together information extraction (IE) technology and multi-objective evolutionary computation techniques. It aims at extracting key underlying linguistic knowledge from text documents (i.e., rhetorical and semantic information) and then hypothesising and assessing interesting and unseen explanatory knowledge. Unlike other approaches to KDT, the model does not use additional electronic resources or domain knowledge beyond the text database. 38
Intelligent Text Mining
This chapter develops a new semantically-guided model for evolutionary text mining which is domain-independent but genre-based. Unlike previous research on KDT, the approach does not rely on external resources or descriptions, hence its domain-independence. Instead, it performs the discovery only using information from the original corpus of text documents and from the training data generated from them. In addition, a number of strategies have been developed for automatically evaluating the quality of the hypotheses (“novel” patterns). This is an important contribution on a topic which has been neglected in most of KDT research over the last years. The model and the experimental applications provide some support that it is indeed plausible to conceive an effective KDT approach independent of domain resources and to make use of the underlying rhetorical information so as to represent text documents for text mining purposes. The first specific objective is to achieve a plausible search of novel / interesting knowledge based on an evolutionary knowledge discovery approach which makes effective use of the structure and genre of texts. A second objective consists of showing evaluation strategies which allows for the measuring of the effectiveness in terms of the quality of the outcome produced by the model and which can correlate with human judgments. Finally, the chapter highlights the way all these strategies are integrated to produce novel knowledge which contributes additional information to help one better understand the nature of the discovered knowledge, compared to bag-of-words text mining approaches. Text mining or knowledge discovery from texts (KDT) can potentially benefit from successful techniques from data mining or knowledge discovery from databases (KDD) (Han & Kamber, 2001) which have been applied to relational databases. However, data mining techniques cannot be immediately applied to text data for the purposes of text mining (TM) as they assume a structure in the source data which is not present in free text. Hence, new representations for text data have to be used. Also, while the assessment of discovered knowledge in the context of KDD is a key aspect for producing an effective outcome, the evaluation / assessment of the patterns discovered from text has been a neglected topic in the majority of the KDT approaches. Consequently, it is not proved whether the discoveries are novel, interesting, and useful for decision-makers. Despite the large amount of research over the last few years, only few research efforts worldwide have realised the need for high-level representations (i.e., not just keywords), for taking advantage of linguistic knowledge, and for specific purpose ways of producing and assessing the unseen knowledge. The rest of the effort has concentrated on doing text mining from an information retrieval (IR) perspective and so both representation (keyword based) and data analysis are restricted. By using an evolutionary model for KDT, allows us to explore integrated approaches to deal with the most important issues in text mining and KDT. In addition, the model’s implications for intelligence analysis are highlighted. In particular: • • •
Evolutionary computation techniques (Freitas, 2002) of search and optimization are used to search for high-level and structured unseen knowledge in the form of explanatory hypotheses. Different strategies are described so as to automatically evaluate the multiple quality goals to be met by these hypotheses (interestingness, novelty, etc.). Using a prototype system for looking for the best patterns, several experiments with human domain experts are carried out to assess the quality of the model and therefore the discovered patterns.
Overall, the approach put together information extraction (IE) technology and multi-objective evolutionary computation techniques. It aims at extracting key underlying linguistic knowledge from text
39
Intelligent Text Mining
documents (i.e., rhetorical and semantic information) and then hypothesising and assessing interesting and unseen explanatory knowledge. Unlike other approaches to KDT, we do not use additional electronic resources or domain knowledge beyond the text database.
Evolut ionar y Knowledge Disc over y fr om Tex ts Nahm and Mooney (2000) have attempted to bring together general ontologies, IE technology, and traditional machine learning methods to mine interesting patterns. Unlike previous approaches, Mooney deals with a different kind of knowledge (e.g., prediction rules). In addition, an explicit measure of novelty of the mined rules is proposed by establishing semantic distances between rules’ antecedents and consequents using the underlying organisation of WordNet. Novelty is then defined as the average (semantic) distance between the words in a rule’s antecedent and consequent. A key problem with this is that the method depends highly on WordNet’s organisation and idiosyncratic features. As a consequence, since a lot of information extracted from the documents are not included in WordNet the predicted rules will lead to misleading decisions on their novelty. Many approaches to TM / KDT use a variety of different “learning” techniques. Except for cases using machine learning techniques such as neural networks, decision trees, and so on, which have also been used in relational data mining (DM), the real role of “learning” in the systems is not clear. There is no learning which enables the discovery but instead a set of primitive search strategies which do not necessarily explore the whole search space due to their dependence on the kind of semantic information previously extracted. Despite there being a significant and successful number of practical search and optimization techniques (Mitchell, 1996; Deb, 2001), there are some features that make some techniques more appealing to perform this kind of task than others, in terms of representation required, training sets required, supervision, hypothesis assessment, robustness in the search, and so forth. In particular, the kind of evolutionary computation technique known as genetic algorithms (GA) has proved to be promising for search and optimization purposes. Compared with classical search and optimization algorithms, GAs are much less susceptible to getting stuck to local suboptimal regions of the search space as they perform global search by exploring solutions in parallel. GAs are robust and able to cope with noisy and missing data, they can search spaces of hypotheses containing complex interacting parts, where the impact of each part on overall hypothesis fitness may be difficult to model (De Jong, 2006). In order to use GAs to find optimal values of decision variables, we first need to represent the hypotheses in binary strings (the typical pseudo-chromosomal representation of a hypothesis in traditional GAs). After creating an initial population of strings at random, genetic operations are applied with some probability in order to improve the population. Once a new string is created by the operators, the solution is evaluated in terms of its measure of individual goodness referred to as fitness. Individuals for the next generation are selected according to their fitness values, which will determine those to be chosen for reproduction. If a termination condition is not satisfied, the population is modified by the operators and a new (and hopefully better) population is created. Each interaction in this process is called a generation and the entire set of generations is called a run. At the end of a run there is often one or more highly fit chromosomes in the population. One of the major contributions of evolutionary algorithms (e.g., GAs) for an important number of DM tasks (e.g., rule discovery, etc.) is that they tend to cope well with attribute interactions. This is in 40
Intelligent Text Mining
contrast to the local, greedy search performed by often-used rule induction and decision-tree algorithms (Berthold & Hand, 2000; Han & Kamber, 2001). Most rule induction algorithms generate (prune) a rule by selecting (removing) one rule condition at a time, whereas evolutionary algorithms usually evaluate a rule as a whole via the fitness function rather than evaluating the impact of adding / removing one condition to / from a rule. In addition, operations such as crossover usually swap several rule conditions at a time between two individuals. One general aspect worth noting in applying GAs for DM tasks is that both the representation used for the discovery and the evaluation carried out assume that the source data are properly represented in a structured form (i.e., database) in which the attributes and values are easily handled. When dealing with text data, these working assumptions are not always plausible because of the complexity of text information. In particular, mining text data using evolutionary algorithms requires a certain level of representation which captures knowledge beyond discrete data (i.e., semantics). Thus, there arises the need for new operations to create knowledge from text databases. In addition, fitness evaluation also imposes important challenges in terms of measuring novel and interesting knowledge which might be implicit in the texts or be embedded in the underlying semantics of the extracted data. We developed a semantically-guided model for evolutionary text mining which is domain-independent but genre-based. Unlike previous approaches to KDT, our approach does not rely on external resources or descriptions hence its domain-independence. Instead, it performs the discovery only using information from the original corpus of text documents and from the training data generated from them. In addition, a number of strategies have been developed for automatically evaluating the quality of the hypotheses (“novel” patterns). This is an important contribution on a topic which has been neglected in most of KDT research over the last years. In order to deal with issues regarding representation and new genetic operations so to produce an effective KDT process, our working model has been divided into two phases. The first phase is the preprocessing step aimed to produce both training information for further evaluation and the initial population of the GA. The second phase constitutes the knowledge discovery itself, in particular this aims at producing and evaluating explanatory unseen hypotheses. The whole processing starts by performing the IE task (Figure 1) which applies extraction patterns and then generates a rule-like representation for each document of the specific domain corpus. After Figure 1. The evolutionary model for knowledge discovery from texts
41
Intelligent Text Mining
processing a set of n documents, the extraction stage will produce n rules, each one representing the document’s content in terms of its conditions and conclusions. Once generated, these rules, along with other training data, become the “model” which will guide the GA-based discovery (see Figure 1). In order to generate an initial set of hypotheses, an initial population is created by building random hypotheses from the initial rules, that is, hypotheses containing predicate and rhetorical information from the rules are constructed. The GA then runs for a number of generations until a fixed number of generations is achieved. At the end, a small set of the best hypotheses are obtained. The description of the model is organised as follows: The next section presents the main features of the text preprocessing phase and how the representation for the hypotheses is generated. In addition, training tasks which generate the initial knowledge (semantic and rhetorical information) to feed the discovery are described. Then we describe constrained genetic operations to enable the hypotheses discovery, and proposes different evaluation metrics to assess the plausibility of the discovered hypotheses in a multi-objective context.
T ext Preprocessing and T raining The preprocessing phase has two main goals: to extract important information from the texts and to use that information to generate both training data and the initial population for the GA. In terms of text preprocessing (see first phase in Figure 1), an underlying principle in our approach is to be able to make good use of the structure of the documents for the discovery process. It is wellknown that processing full documents has inherent complexities (Manning & Schutze, 1999), so we have restricted our scope somewhat to consider a scientific genre involving scientific / technical abstracts. These have a well-defined macro-structure (genre-dependent rhetorical structure) to “summarise” what the author states in the full document (i.e., background information, methods, achievements, conclusions, etc.). Unlike patterns extracted for usual IE purposes such as in Hearst (1999, 2000) and Jacquemin and Tzoukermann (1999), this macro-structure and its roles are domain-independent but genre-based, so it is relatively easy to translate it into different contexts. As an example, suppose that we are given the abstract of Figure 2 where bold sequences of words indicate the markers triggering the IE patterns. From such a structure, important constituents can be identified: •
•
42
Rhetorical roles (discourse-level knowledge). These indicate important places where the author makes some “assertions” about his or her work (i.e., the author is stating the goals, used methods, achieved conclusions, etc.). In the example, the roles are represented by goal, object, method, and conclusion. Predicate relations. These are represented by actions (predicate and arguments) which are directly connected to the role being identified and state a relation which holds between a set of terms (words which are part of a sentence), a predicate and the role which they are linked to. Thus, for the example, they are as follows: provide(’the basic information ...’), analyse(’long-term trends ...’), study(’lands plot using ...’), improve(’soil ...improved after ...’).
Intelligent Text Mining
Figure 2. An abstract and the extracted information
•
Causal relation(s). Although there are no explicit causal relations in the example, we can hypothesise a simple rule of the form:
IF the current goals are G1, G2, ... and the means / methods used M1,M2, ... (and any other constraint / feature) THEN it is true that we can achieve the conclusions C1, C2, ...
In order to extract this initial key information from the texts, an IE module was built. Essentially, it takes a set of text documents, has them tagged through a previously trained part-of-speech (POS) tagger, and produces an intermediate representation for every document (i.e., template, in an IE sense) which is then converted into a general rule. A set of hand-crafted domain-independent extraction patterns was written and coded. In addition, key training data are captured from the corpus of documents itself and from the semantic information contained in the rules. This can guide the discovery process in making further similarity judgments and assessing the plausibility of the produced hypotheses. Training information from the corpus. It has been suggested that huge amounts of texts represent a valuable source of semantic knowledge. In particular, in latent semantic analysis (LSA) (Kintsch, 2001), it is claimed that this knowledge is at the word level. LSA is a mathematical technique that generates a high-dimensional semantic space from the analysis of a huge text corpus. It was originally developed in the context of IR (Berry, 2004) and adapted by psycholinguistics for natural-language processing tasks (Landauer, Foltz, & Laham, 1998a).
•
43
Intelligent Text Mining
44
LSA differs from some statistical approaches for textual data analysis in two significant aspects. First, the input data “associations” from which LSA induces are extracted from unitary expressions of meaningful words and the complete meaningful utterances in which they occur, rather than between successive words (i.e., mutual information, co-occurrence). Second, it has been proposed that LSA constitutes a fundamental computational theory of the acquisition and representation of knowledge as its underlying mechanism can account for a longstanding and important mystery, the inductive property of learning by which people acquire much more knowledge than appears to be available in experience (Landauer, Laham, & Foltz, 1998b). By keeping track of the patterns of occurrences of words in their corresponding contexts, one might be able to recover the latent structure of the meaning space, this is, the relationship between meanings of words: the larger and the more consistent their overlap, the closer the meanings. In order to produce meaning vectors, LSA must be trained with a huge corpus of text documents. The initial data are meaningful passages from these texts and the set of words that each contains. Then, a matrix is constructed whose rows represent the terms (i.e., keywords) and the columns represent the documents where these terms occur. The cells of this matrix are the frequencies with which the word occurred in the documents. In order to reduce the effect of the words which occur across a wide variety of contexts, these cells are usually multiplied by a global frequency of the term in the collection of documents (i.e., logarithmic entropy term weight) (Berry, 2004). These normalized frequencies are the input to LSA which transforms them into a high-dimensional semantic space by using a type of principal components analysis called singular vector decomposition (SVD) which compresses a large amount of co-occurrence information into a much smaller space. This compression step is somewhat similar to the common feature of neural networks where a large number of inputs are connected to a fairly small number of hidden layer nodes. If there are too many nodes, a network will “memorize” the training set, miss the generality of the data, and consequently perform poorly on a test set. Otherwise, this will tend to “capture” the underlying features of the input data representation. LSA has also been successfully applied to an important number of natural-language tasks in which the correlations with human judgments have proved to be promising, including the treatment of synonymy (Landauer et al., 1998b), tutorial dialog management (Graesser, Wiemer-Hastings, & Kreuz, 1999; Wiemer-Hastings & Graesser, 2001), anaphora resolution (Klebanov, 2001), and text coherence measurement (Foltz, Kintsch, & Landauer, 1998). In terms of measuring text coherence, the results have shown that the predictions of coherence performed by LSA are significantly correlated with other comprehension measures (Dijk & Kintsch, 1983) showing that LSA appears to provide an accurate measure of the comprehension of the texts. In this case, LSA made automatic coherence judgments by computing the similarity (SemSim) between the vector corresponding to consecutive passages of a text. LSA predicted comprehension scores of human subjects extremely well, so it provided a characterization of the degree of semantic relatedness between the segments. Following work by Kintsch (2001) on LSA incorporating structure, we have designed a semistructured LSA representation for text data in which we represent predicate information (i.e., verbs) and arguments (i.e., set of terms) separately once they have been properly extracted in the IE phase. For this, the similarity is calculated by computing the closeness between two predicates (and arguments) based on the LSA data (function SemSim ( P1 ( A1 ), P2 ( A2 ))).
Intelligent Text Mining
We propose a simple strategy for representing the meaning of the predicates with arguments. Next, a simple method is developed to measure the similarity between these units. Given a predicate P and its argument A, the vectors representing the meaning for both of them can be directly extracted from the training information provided by the LSA analysis. Representing the argument involves summing up all the vectors representing the terms of the argument and then averaging them, as is usually performed in semi-structured LSA. Once this is done, the meaning vector of the predicate and the argument is obtained by computing the sum of the two vectors as used in (Wiemer-Hastings, 2000). If there is more than one argument, then the final vector of the argument is just the sum of the individual arguments’ vectors. Note that training information from the texts is not sufficient as it only conveys data at a word semantics level. We claim that both basic knowledge at a rhetorical, semantic level, and co-occurrence information can be effectively computed to feed the discovery and to guide the GA. Accordingly, we perform two kinds of tasks: creating the initial population and computing training information from the rules. 1. Creating the initial population of hypotheses. Once the initial rules have been produced, their components (rhetorical roles, predicate relations, etc.) are isolated and become a separate “database.” This information is used both to build the initial hypotheses and to feed the further genetic operations (i.e., mutation of roles will need to randomly pick a role from this database). 2. Computing training information (in which two kinds of training data are obtained): a. Computing correlations between rhetorical roles and predicate relations. The connection between rhetorical information and the predicate action constitutes key information for producing coherent hypotheses. For example, is, in some domain, the goal of some hypothesis likely to be associated with the construction of some component? In a health context, this connection would be less likely than having “finding a new medicine for ...” as a goal. In order to address this issue, we adopted a Bayesian approach where we obtain the conditional probability of some predicate p given some attached rhetorical role r, namely Prob(p|r). This probability values are later used to automatically evaluate some of the hypotheses’ criteria. b. Computing co-occurrences of rhetorical information. One could think of a hypothesis as an abstract having text paragraphs which are semantically related to each other. Consequently, the meaning of the scientific evidence stated in the abstract may subtly change if the order of the facts is altered. This suggests that in generating valid hypotheses there will be rule structures which are more or less desirable than others. For instance, if every rule contains a “goal” as the first rhetorical role, and the GA has generated a hypothesis starting with some “conclusion” or “method,” it will be penalized, therefore, it is very unlikely for that to survive in the next generation. Since the order matters in terms of affecting the rule’s meaning, we can think of the p roles of a rule, as a sequence of tags: such that ri precedes ri+1, so we generate, from the rules, the conditional probabilities Prob(rp│rq), for every role rp , rq. The probability that rq precedes rp will be used in evaluating new hypotheses, in terms that, for instance, its coherence.
45
Intelligent Text Mining
Evaluation of Discovered Patterns Our approach to KDT is strongly guided by semantic and rhetorical information, and consequently there are some soft constraints to be met before producing the offspring so as to keep them coherent. The GA will start from a initial population, which in this case, is a set of semi-random hypotheses built up from the preprocessing phase. Next, constrained GA operations are applied and the hypotheses are evaluated. In order for every individual to have a fitness assigned, we use a evolutionary multiobjective optimisation strategy based on the SPEA algorithm (Zitzler & Thiele, 1998) in a way which allows incremental construction of a Pareto-optimal set and uses a steady-state strategy for the population update. For semantic constraints, judgments of similarity between hypotheses or components of hypotheses (i.e., predicates, arguments, etc.) are carried out using the LSA training data and predicate-level information previously discussed in the training step.
Patterns Discovery Using the semantic measure and additional constraints discussed later on, we propose new operations to allow guided discovery such that unrelated new knowledge is avoided, as follows: • •
Selection: selects a small number of the best parent hypotheses of every generation (generation gap) according to their Pareto-based fitness (Deb, 2001). Crossover: a simple recombination of both hypotheses’ conditions and conclusions takes place, where two individuals swap their conditions to produce new offspring (the conclusions remain). Under normal circumstances, crossover works on random parents and positions where their parts should be exchanged. However, in our case this operation must be restricted to preserve semantic coherence. We use soft semantic constraints to define two kind of recombinations: 1. Swanson’s crossover. Based on Swanson’s hypothesis (Swanson, 1988, 2001) we propose a recombination operation as follows:
If there is a hypothesis (AB) such that “IF A THEN B” and another one (BC) such that “IF B’ THEN C ”, (B’ being something semantically similar to B) then a new interesting hypothesis
Figure 3. Semantically guided Swanson crossover
46
Intelligent Text Mining
“IF A THEN C ” can be inferred, only if the conclusions of AB have high semantic similarity (i.e., via LSA) with the conditions of hypothesis BC.
•
•
The about principle can be seen in Swanson’s crossover between two learned hypotheses as shown in Figure 3. 2. Default semantic crossover. If the previous transitivity does not apply then the recombination is performed as long as both hypotheses as a whole have high semantic similarity which is defined in advance by providing minimum thresholds. Mutation: aims to make small random changes on hypotheses to explore new possibilities in the search space. As in recombination, we have dealt with this operation in a constrained way, so we propose three kinds of mutations to deal with the hypotheses’ different objects: 1. Role mutation. One rhetorical role (including its contents: relations and arguments) is selected and randomly replaced by a random one from the initial role database. 2. Predicate mutation. One inner predicate and argument is selected and randomly replaced by another from the initial predicate databases. 3. Argument mutation. Since we have no information about arguments’ semantic types, we choose a new argument by following a guided procedure in which we select predicate relations and then arguments at random according to a measure of semantic similarity via LSA (Wiemer-Hastings, 2000). Population update: We use a non-generational GA in which some individuals are replaced by the new offspring in order to preserve the hypotheses’ good material from one generation to other, and so to encourage the improvement of the population’s quality. We use a steady-state strategy in which each individual from a small number of the worst hypotheses is replaced by an individual from the offspring only if the latter are better than the former.
Patterns Evaluation Since each hypothesis in our model has to be assessed by different criteria, usual methods for evaluating fitness are not appropriate. Hence, evolutionary multi-objective optimisation (EMOO) techniques which use the multiple criteria defined for the hypotheses are needed. Accordingly, we propose EMOO-based evaluation metrics to assess the hypotheses’ fitness in a domain-independent way and, unlike other approaches, without using any external source of domain knowledge. The different metrics are represented by multiple criteria by which the hypotheses are assessed. In order to establish evaluation criteria, we have taken into account different issues concerning plausibility (Is the hypothesis semantically sound? Are the GA operations producing something coherent in the current hypothesis?), and quality itself (How is the hypothesis supported from the initial text documents? How interesting is it?). Accordingly, we have defined eight evaluation criteria to assess the hypotheses (i.e., in terms of Pareto dominance, it will produce an eight-dimensional vector of objective functions) given by: relevance, structure, cohesion, interestingness, coherence, coverage, simplicity, plausibility of origin. The current hypothesis to be assessed will be denoted as H, and the training rules as Ri. Evaluation methods (criteria) by which the hypotheses are assessed and the questions they are trying to address are as follows:
47
Intelligent Text Mining
•
Relevance. Relevance addresses the issue of how important the hypothesis is to the target concepts. This involves two concepts (i.e., terms), as previously described, related to the question:
What is the best set of hypotheses that explain the relation between and ?
Considering the current hypothesis, it turns into a specific question: How good is the hypothesis in explaining this relation? This can be estimated by determining the semantic closeness between the hypothesis’ predicates (and arguments) and the target concepts1 by using the meaning vectors obtained from the LSA analysis for both terms and predicates. Our method for assessing relevance takes these issues into account along with some ideas of Kintsch’s Predication. Specifically, we use the concept of strength (Kintsch, 2001): strength(A,I) = f (SemSim(A, I), SemSim(P, I))) between a predicate with arguments and surrounding concepts (target terms in our case) as a part of the relevance measure, which basically decides whether the predicate (and argument) is relevant to the target concepts in terms of the similarity between both predicate and argument, and the concepts. In order to account for both target terms, we just take the average of strength (Str) for both terms. So, the overall relevance becomes: |H| Str[Pi,Ai,term1]+Str[Pi,Ai,term2]
relevance(H)= (1/2) ∑
•
in which |H| denotes the length of the hypothesis H, that is, the number of predicates. Structure (How good is the structure of the rhetorical roles? ). This measures how much of the rules’ structure is exhibited in the current hypothesis. Since we have previous pre-processed information for bi-grams of roles, the structure can be computed by following a Markov chain (Manning & Schutze, 1999) as follows:
|H|
i=1
|H|
Structure(H)= Prob(r1)*
where ri represents the i-th role of the hypothesis H, Prob(ri│ri-1) denotes the conditional probability that role ri-1 immediately precedes ri. Prob(ri) denotes the probability that no role precedes ri, that is, it is at the beginning of the structure (i.e., Prob(ri│)). Cohesion (How likely is a predicate action to be associated with some specific rhetorical role? ). This measures the degree of “connection” between rhetorical information (i.e., roles) and predicate actions. The issue here is how likely (according to the rules) some predicate relation P in the current hypothesis is to be associated with role r. Formally, cohesion for hypothesis H is expressed as: Prob[P |r ] i i cohesion(H)= ∑ |H| r ,P ∈H
•
48
i
i=2
Prob(ri|ri-1)
i
where Prob(Pi│ri ) states the conditional probability of the predicate Pi given the rhetorical role ri.
Intelligent Text Mining
•
Interestingness (How interesting is the hypothesis in terms of its antecedent and consequent?). Unlike other approaches to measure “interestingness” which use an external resource (e.g., WordNet) and rely on its organisation, we propose a different view where the criterion can be evaluated from the semi-structured information provided by the LSA analysis. Accordingly, the measure for hypothesis H is defined as a degree of unexpectedness as follows:
interestingness(H)=
That is, the lower the similarity, the more interesting the hypothesis is likely to be. Otherwise, it means the hypothesis involves a correlation between its antecedent and consequent which may be an uninteresting known common fact (Nahm & Mooney, 2002). Coherence. This metrics addresses the question whether the elements of the current hypothesis relate to each other in a semantically coherent way. Unlike rules produced by DM techniques in which the order of the conditions is not an issue, the hypotheses produced in our model rely on pairs of adjacent elements which should be semantically sound, a property which has long been dealt with in the linguistic domain, in the context of text coherence (Foltz et al., 1998). Semantic coherence is calculated by considering the average semantic similarity between consecutive elements of the hypothesis. However, note that this closeness is only computed on the semantic information that the predicates and their arguments convey (i.e., not the roles) as the role structure has been considered in a previous criterion. Accordingly, the criterion can be expressed as follows:
•
•
[|H|-1] SemSim [Pi[Ai],Pi+1 [Ai+1 ]] Coherence(H)= ∑ (|H|-1) i=1 where (|H|-1) denotes the number of adjacent pairs, and SemSim is the LSA-based semantic similarity between two predicates. Coverage. The coverage metric tries to address the question of how much the hypothesis is supported by the model (i.e., rules representing documents and semantic information). Coverage of a hypothesis has usually been measured in KDD approaches by considering some structuring in data (i.e., discrete attributes) which is not present in textual information. Besides, most of the KDD approaches have assumed the use of linguistic or conceptual resources to measure the degree of coverage of the hypotheses (i.e., match against databases, positive examples). In order to deal with the criterion in the context of KDT, we say that a hypothesis H covers an extracted rule Ri only if the predicates of H are roughly (or exactly, in the best case) contained in Ri. Formally, the rules covered are defined as:
RulesCovered(H)={ Ri∈RuleSet| ∀Pj∈Ri∃HPk∈HP:
(SemSim{HPk, Pj] ≥ threshold ∧ predicate[HPk]=predicate[Pj])} where SemSim(HPk, Pj ) represents the LSA-based similarity between hypothesis predicate HPk and rule predicate Pj, threshold denotes a minimum fixed user-defined value, RuleSet denotes the whole set of rules, HP represents the list of predicates with arguments of H, and Pj represents a
49
Intelligent Text Mining
predicate (with arguments) contained in Ri. Once the set of rules covered is computed, the criterion can finally be computed as: Coverage(H)= |RulesCovered[H]| |RuleSet|
where |RulesCovered| and |RuleSet| denote the size of the set of rules covered by H, and the size of the initial set of extracted rules, respectively. Simplicity (How simple is the hypothesis?). Shorter and / or easy-to-interpret hypotheses are preferred. Since the criterion has to be maximised, the evaluation will depend on the length (number of elements) of the hypothesis. Plausibility of origin (How plausible is the hypothesis produced by Swanson’s evidence?). If the current hypothesis was an offspring from parents which were recombined by a Swanson’s transitivity-like operator, then the higher the semantic similarity between one parent’s consequent and the other parent’s antecedent, the more precise is the evidence, and consequently worth exploring as a novel hypothesis. If no better hypothesis is found so far, the current similarity is inherited from one generation to the next. Accordingly, plausibility for a hypothesis H is simply given by:
Plausibility (H)= 0 If H is in the original population or is a result of another operation
•
•
S pIf H was created from a Swanson’s crossover
Note that since we are dealing with a multi-objective problem, there is no simple way to get independent fitness values as the fitness involves a set of objective functions to be assessed for every individual. Therefore the computation is performed by comparing objectives of one individual with others in terms of Pareto dominance (Deb, 2001) in which non-dominated solutions (Pareto individuals) are searched for in every generation. Next, since our model is based on a multi-criteria approach, we have to face three important issues in order to assess every hypothesis’ fitness: Pareto dominance, fitness assignment, and the diversity problem (Deb, 2001). Despite an important number of state-of-the-art methods to handle these issues (Deb, 2001), only a small number of them has focused on the problem in an integrated and representation-independent way. In particular, Zitzler and Thiele (1998) propose an interesting method, strength Pareto evolutionary algorithm (SPEA) which uses a mixture of established methods and new techniques in order to find multiple Pareto-optimal solutions in parallel, and at the same time to keep the population as diverse as possible. We have also adapted the original SPEA algorithm which uses an elitist strategy to allow for the incremental updating of the Pareto-optimal set along with our steady-state replacement method.
Analys is and Re s ults In order to assess the quality of the discovered knowledge (hypotheses) by the model a prolog-based prototype has been built. The IE task has been implemented as a set of modules whose main outcome is the set of rules extracted from the documents. In addition, an intermediate training module is responsible for generating information from the LSA analysis and from the rules just produced. The initial rules are represented by facts containing lists of relations both for antecedent and consequent.
50
Intelligent Text Mining
Figure 4. GA evaluation for some of the criteria
continued on following page
51
Intelligent Text Mining
Figure 4. continued
52
Intelligent Text Mining
For the purpose of the experiments, the corpus of documents has been obtained from a database for agricultural and food science. We selected this kind of corpus as it has been properly cleaned-up, and builds upon a scientific area which we do not have any knowledge about so to avoid any possible bias and to make the results more realistic. A set of 1,000 documents was extracted from which one-third were used for setting parameters and making general adjustments, and the rest were used for the GA itself in the evaluation stage. Next, we tried to provide answers to two basic questions concerning our original aims: How well does the GA for KDT behave? How good are the hypotheses produced according to human experts in terms of text mining’s ultimate goals: interestingness, novelty, usefulness, and so forth. In order to address these issues, we used a methodology consisting of two phases: the system evaluation and the experts’ assessment. 1.
System evaluation. This aims at investigating the behavior and the results produced by the GA. We set the GA by generating an initial population of 100 semi-random hypotheses. In addition, we defined the main global parameters such as mutation probability (0.2), cross-over probability (0.8), maximum size of Pareto set (5%), and so forth. We ran five versions of the GA with the same configuration of parameters but different pairs of terms to address the quest for explanatory novel hypotheses. The different results obtained from running the GA as used for our experiment are shown in the form of a representative behavior in Figure 4, where the number of generations is placed against the average objective value for some of the eight criteria. Some interesting facts can be noted. Almost all the criteria seem to stabilise after (roughly) generation 700 for all the runs, that is, no further improvement beyond this point is achieved and so this may give us an approximate indication of the limits of the objective function values.
Table 1. Analysis of the behavior of the GA to different parameters Run
Pm
Pc
AvgFitness
Std. Dev
Min.Fit.
Max.Fit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.025 0.075 0.125 0.175 0.2 0.025 0.075 0.125 0.175 0.2 0.025 0.075 0.125 0.175 0.2 0.025 0.075 0.125 0.175 0.2
0.50 0.50 0.50 0.50 0.50 0.60 0.60 0.60 0.60 0.60 0.70 0.70 0.70 0.70 0.70 0.80 0.80 0.80 0.80 0.80
0.0911 0.0833 0.0934 0.0934 0.0799 0.0625 0.0725 0.0623 0.0625 0.0602 0.0323 0.0358 0.0358 0.0316 0.0301 0.0230 0.0329 0.0240 0.0221 0.0209
0.0790 0.0746 0.0746 0.0740 0.0701 0.0601 0.0600 0.0602 0.0600 0.0583 0.0617 0.0622 0.0619 0.0619 0.0958 0.0556 0.0553 0.0567 0.0543 0.0470
0.0099 0.0099 0.0099 0.0099 0.0099 0.0099 0.0099 0.0099 0.0099 0.0099 0 0 0 0 0 0 0 0 0 0
0.2495 0.2495 0.2495 0.2297 0.2297 0.2188 0.2188 0.2188 0.2188 0.2188 0.2495 0.2495 0.2495 0.2495 0.4950 0.2495 0.2495 0.2495 0.2495 0.1881
53
Intelligent Text Mining
Table 2. Pairs of target terms used for the actual experiments Run 1 2 3 4 5
54
Term 1 enzyme glycocide antinutritious degradation cyanogenics
Term 2 zinc inhibitor cyanogenics erosive inhibitor
Another aspect worth highlighting is that despite a steady-state strategy being used by the model to produce solutions, the individual evaluation criteria behave in unstable ways to accommodate solutions which had to be removed or added. As a consequence, it is not necessarily the case that all the criteria have to monotonically increase. In order to see this behavior, look at the results for the criteria for the same period of time, between generations 200 and 300 for Run 4. For an average hypothesis, the quality of coherence, cohesion, simplicity, and structure gets worse, whereas this improves for coverage, interestingness, and relevance, and has some variations for plausibility. The quality of the search process can also be analyzed by observing the typical behavior of the GA in terms of the performance of the genetic operators in generating fit solutions, its robustness (i.e., Does it always find good solutions?), and the quality of the hypotheses in terms of the objective functions. In terms of genetic operators, the aim was to investigate how sensitive the GA is to different parameter values. Because of the large combination of parameter settings, we concentrated on the probabilities of crossover and mutation only, in terms of the fitness of the produced solutions. Note that because of the nature of the SPEA-based strategy, low fitness values are desired. Test parameter values were established as shown in Table 1 for 20 runs of the GA, each up to 1,000 generations, with a initial population of 100 hypotheses. Here, different probabilities of mutation (m) and crossover (c) are tested, and the resulting average fitness of the population, its standard deviation, and the minimum and maximum values of fitness are shown (the rest of the parameters remain the same). The parameters were systematically tested with steps of approximately 5% (starting from 0.025) for m, and 10% (starting from 0.50) for c. The final range for m is from 0.025 to 0.50, whereas for m, this is from 0.50 to 0.80. Thus, the table shows the different settings involved moving through the range for m and fixing a value for c. For example, the first 5 runs consider setting c fixed and testing with different values of m. Some aspects of the resulting values are worth highlighting: • Although finding good solutions is no guarantee that the search process is effective because human judgment is not considered, the GA seems to be able to find good hypotheses, that is, individuals with fitness zero or close to zero. • Because of the constrained genetic operators, small changes in the parameter values do not have a significant effect on the best obtained fitness. • Although higher values of m and c might improve the overall performance of the GA by decreasing the population fitness values, sometimes the maximum fitness values tend to increase despite the overall improvement of the population (see Runs 11 to 19, compared to Runs 6 to 10).
Intelligent Text Mining
•
2.
As the parameter values increase, there is a tendency for the minimum fitness to decrease. However, note that because of the multi-objective nature of the model, having low (or zero) fitness values between Runs 11 and 20 does not necessarily imply that there are no changes in individual criteria of the best solutions. Indeed, considering individual objective values, the best solutions may be those with the lowest fitness values. • Sudden peaks (e.g., average fitness of Run 3, 7, 12, etc.) can also be explained because of decisions on dominance (e.g., some less fit solutions leaving the Pareto set). This analysis shows that increases in both mutation and crossover can have a positive effect on the quality of the solutions. Expert assessment. This aims at assessing the quality (therefore, effectiveness) of the discovered knowledge on different criteria by human domain experts. For this, we designed an experiment in which 20 human experts were involved and each assessed five hypotheses selected from the Pareto set. We then asked the experts to assess the hypotheses from 1 (worst) to 5 (best) in terms of the following criteria: interestingness (INT), novelty (NOV), usefulness (USE), sensibleness (SEN), etc.
In order to select worthwhile terms for the experiment, we asked one domain expert to filter pairs of target terms previously related according to traditional clustering analysis (see Table 2 containing target terms used in the experiments). The pairs which finally deserved attention were used as input in the actual experiments (i.e., degradation and erosive).
Figure 5. Distribution of experts’ assessment of hypothesis per criteria
55
Intelligent Text Mining
Once the system hypotheses were produced, the experts were asked to score them according to the five subjective criteria. Next, we calculated the scores for every criterion as seen in the overall results in Figure 5 (for length’s sake, only some criterion are shown). The assessment of individual criteria shows some hypotheses did well with scores above the average (50%) on a 1-5 scale. Overall, this supports the claim that the model indeed is able to find nuggets in textual information and to provide some basic explanation about the hidden relationships in these discoveries. This is the case for Hypotheses 11, 16, and 19 in terms of INT, Hypotheses 14 and 19 in terms of SEN, Hypotheses 1, 5, 11, 17, and 19 in terms of USE, and Hypotheses 24 in terms of NOV, and so forth. These results and the evaluation produced by the model were used to measure the correlation between the scores of the human subjects and the system’s model evaluation. Since both the expert and the system’s model evaluated the results considering several criteria, we first performed a normalisation aimed at producing a single “quality” value for each hypothesis. We then calculated the pair of values for every hypothesis and obtained a (Spearman) correlation r = 0.43 (t-test = 23.75,df = 24, p 0 and −1 otherwise. In some cases, it is not easy to find such a hyperplane in the original data space, in which case the original data space has to be transformed into a higher dimensional space by applying kernels. In this work, we focused on SVM without kernels. In order to employ SVM on multi-class classification, we first conducted pairwise classification (Furnkranz, 2001) and then decided a document’s class by pairwise coupling (Hastie & Tibshirani, 1998). In addition, sequential minimal optimization (SMO) (Platt, 1999), a fast nonlinear optimization method, was employed during the training process to accelerate training. The implementation of SVM discussed above is based on the Weka package (Witten & Frank, 2005).
EVALUAT ION Performance Measures We employed F1 as the main performance measure. F1 is a measure that trades off precision and recall, to provide an overall measure of classification performance. For each class on a training set, the definitions of the measures are:
66
Automatic Syllabus Classification Using Support Vector Machines
• • •
Precision: the percentage of the correctly classified positive examples among all the examples classified as positive. Recall: the percentage of the correctly classified positive examples among all the positive examples. F1: 2*Precision * Recall / (Precision + Recall).
A higher F1 value indicates better classification performance. Since we used thousands of settings in the experiment, we employed several average measures to facilitate our analysis. The uses and formulas of these aggregated average measures are shown in Table 3. These aggregated measures will be used in later graph comparison to illustrate the effects of different experimental settings.
R esults and Discussions We conducted analyses based on the average metrics shown in Table 3 and the results are summarized in five primary findings below. 1.
The impact of class distribution on training data: Figure 1 shows the impact of class distribution measured by tridsjf klavgF1. The mean of the results is 0.56 and the standard deviation is 0.03. The best setting is the one with the largest training size, the original class distribution and the most features selected by hybrid feature selection method with DF = 10 (tr9ds0f 0lavgF1 = 0.63). The worst is the one with the smallest training size, the original class distribution and the least features selected by genre feature selection method (tr0ds0f 2lavgF1 = 0.46). Furthermore, although the performance variation of each setting on average is not significant, 67% of settings perform better
Table 3. Average measures Measures
Uses
Formulas
tridsjfklcpavgF1
Measure the stable performance of a setting with the ith size, the jth distribution, the klth feature selection method, and the pth class. The details of the settings are in Table 4.
N/A
tridsjfklavgF1
Measure the impact of re-sampling towards uniform class distributions on different training sizes and feature selection methods
1/4∑p tridsjfklcpavgF1
dsjcpavgF1
Measure the impact of re-sampling towards uniform class distributions on different classes
1/10∑i1/7∑ktridsjfklcpavgF1
trifkavgF1
Measure the impact of training sizes on feature selection methods
1/2∑j1/4∑p1/Nk∑l tridsjfklcpavgF1
fklavgF1
Measure the impact of different DF thresholds on classification performance
1/10∑i1/2∑j1/4∑p tridsjfklcpavgF1
dsjfkcpavgF1
Measure the impact of different feature selection methods on different classes with a varied or uniform distribution
1/10∑i1/Nk∑l tridsjfklcpavgF1
67
Automatic Syllabus Classification Using Support Vector Machines
Table 4. A variety of settings for training data sets Settings
Explanations
tri
Training sets are of ten different sizes increasing with the orderings
dsj
0: original distribution 1: uniform distribution
fkl
01: hybrid feature selection with DF=10 02: hybrid feature selection with DF=20 03: hybrid feature selection with DF=30 11: general feature selection with DF=10 12: general feature selection with DF=20 13: general feature selection with DF=30 21: genre feature selection
cp
0: full syllabus 1: partial syllabus 2: entry page 3: noise page
Figure 1. Classification performance measured by F1 on settings with different training data distribution. hybridi: hybrid feature selection with DF = i × 10; generali: DF as the feature selection method and DF = i × 10; genre: manual selection of features specific to the syllabus genre. 10 data items are in each category, which represent performance with respect to 10 different training data sizes.
after re-sampling towards a uniform class distribution. However, if only considering the settings with large training sizes such as tr8 and tr9, 71% settings perform worse after the re-sampling. In addition, we observed that 60% of settings with the genre feature selection method perform worse after re-sampling. Therefore, the class distribution on a training set has no impact on classification performance when the training data size is large to have enough samples for each class or far more than the number of features. 68
Automatic Syllabus Classification Using Support Vector Machines
Figure 2. Classification performance varied with classes
2.
3.
The performance with respect to each class: Figure 2 summarizes our further investigation on the performance of each class as measured by dsjcpavgF1. With uniform class distribution on training sets, a performance of 0.66 can be achieved for full syllabi, which is 65% better than the performance for the partial syllabi. Performance figures for the entry pages and the noise pages are close, both at around 0.60. Before re-sampling, the performance pattern on these four categories is the same but performance on full syllabi and partial syllabi differ even more. Therefore, on average, full syllabi are much easier to classify than partial syllabi by means of our training strategies. Furthermore, our classifiers favored classes with more examples in a training set. Figure 2 shows that uniform re-sampling is beneficial when the sample size is small as in the case of partial, entry and noise pages. However, its usage hurts performance when training data is large as in the full syllabi case. The correlation of feature selection methods and training size: We compared settings with different feature selection methods and different training sizes. Figure 3 shows the results measured by trif kavgF1. On average, the settings with the hybrid feature selection methods perform 1.8% better
69
Automatic Syllabus Classification Using Support Vector Machines
Figure 3. Classification performance varied with training sizes and feature selection methods. tri: the ith training set. The sizes of training sets increase when their ordering numbers increase.
Figure 4. Classification performance on different classes with class distributions: (a) original distribution; (b) uniform distribution.
than those with the general feature selection methods and 7.6% better than those with the genre feature selection method. While the results imply that the larger the training size, the better the classification performance in general, nevertheless the variations of training sizes have dissimilar impact on different feature selection methods. As far as the settings with the hybrid and general feature selection methods are considered, their performances present the larger-size-better-perfor-
70
Automatic Syllabus Classification Using Support Vector Machines
4.
5.
mance trend especially when the training sizes are small (tr0, tr1, and tr2) or large (tr6, tr7, tr8, and tr9). However, there is no much improvement with respect to the genre feature selection method especially considering tr4 to tr9. It is important to note that the number of features from hybrid or general feature selection methods is larger than the maximum size of our training set, and the sizes of training sets tr4 to tr9 are far more than the number of genre features. Therefore, the settings with hybrid or general features might perform better given more training data. On the other hand, given a small size of training data, the genre feature selection method can achieve similar performance as general feature selection methods but with fewer computational resources. The impact of feature selection methods on different classes: We showed the performances of settings with different feature selection methods on each class in Figure 4. We also took into account the original and uniform class distributions separately and showed the results in Figure 4 (a) and (b) respectively. With the original class distributions, the settings with genre features perform 3.9% better than those with general features on entry pages, 1.2% better on full syllabi and 0.6% better on noise pages. This finding indicates that our genre feature selection method can select important features for the syllabus genre. It also suggests the need that more genre features should be defined to differentiate partial syllabi from other categories. For example, it would be useful to capture features to differentiate between outgoing links to syllabus components and links to other resources such as a website with detailed information about a required textbook. It is also interesting to note that uniform re-sampling impacts the performance of settings with genre features more strongly than it impacts other feature types, especially on full syllabi and partial syllabi with a performance difference of −17.8% and +48% as compared to the performance before re-sampling. Overall, the hybrid feature selection method seems to be the best option for all four classes in both sample distributions. The impact of different DF thresholds on feature selection methods: We also conducted analysis on general and hybrid feature selection methods with respect to different DF thresholds. The performance with DF threshold at 10 achieved 1.8% better than that with 20 and 6.8% better than that with 30 regarding the general feature selection methods. The comparison on hybrid feature selection methods reveals the similar results. This suggests that more features, selected when DF threshold value is set to 10, contribute more values to our syllabus classification performance. However, when more features are selected, more training time and computation resources are also required. More features also require more training data. Thus we can consider to set DF threshold as 30 to reduce the feature size for further investigation on more settings with the current training data.
R ELAT ED WORK A few ongoing research studies are involved with collecting and making use of syllabi. A small set of digital library course syllabi was manually collected and carefully analyzed, especially on their reading lists, in order to define the digital library curriculum. (Pomerantz, Oh Yang, Fox, & Wildemuth, 2006) MIT OpenCourseWare manually collects and publishes 1,800 MIT course syllabi for public use. (Hardy, 2002) However, a lot of effort from experts and faculty are required in such manual collection building approaches, which is what our approach tries to address.
71
Automatic Syllabus Classification Using Support Vector Machines
Furthermore, some effort has already been devoted to automating the syllabus collection process. A syllabus acquisition approach similar to ours is described in (Matsunaga, Yamada, Ito, & Hirokaw, 2003), but it differs in the way syllabi are identified. They crawled Web pages from Japanese universities and sifted through them using a thesaurus with common words which occur often in syllabi. A decision tree was used to classify syllabus pages and entry pages (for example, a page containing links to all the syllabi of a particular course over time). In (Thompson, Smarr, Nguyen, & Manning, 2003), a classification approach was described to classify education resources – especially syllabi, assignments, exams, and tutorials. They relied on word features of each document and were able to achieve very good performance (F1 score:0.98). Because we focused on classifying documents with a high probability of being syllabi into more refined categories, our best performance is 0.63 by the F1 measure. Our research also relates to genre classification. Research in genre classification aims to classify data according to genre types by selecting features that distinguish one genre from another, for example, identifying home pages from Web pages (Kennedy & Shepherd, 2005).
C ONC LUS ION In this work, we described in detail our empirical study on syllabus classification. Based on observation of search results from the Web for the keyword query ‘syllabus’, we defined four categories: full syllabus, partial syllabus, entry page, and noise page. We selected features specific to the syllabus genre and tested the effectiveness of such a genre feature selection method as compared to the feature selection method usually used, DF thresholding. On average, DF thresholding performs better than genre feature selection method. However, further investigation revealed that genre feature selection method slightly outperforms DF thresholding on all classes except partial syllabi. Further study is needed regarding more features specific to the syllabus genre. For example, it might be worth including a few HTML tags (e.g. font size) as features. Another important finding is that our current training data size might be a factor that limits the performance of our classifier. However, it is always a laborious job to label a large set of data. Our future work will investigate the SVM variations on syllabus classification.
ACK NOWLEDGMENT This work was funded in part by the National Science Foundation under DUE grant #0532825. We’d also like to thank GuoFang Teng for the data preparation and the suggestions on the early work.
Refe r enc es Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152). New York, NY, USA: ACM Press. Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20, 3 (Sep. 1995), 273297
72
Automatic Syllabus Classification Using Support Vector Machines
Furnkranz, J. (2001). Round robin rule learning. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 146–153). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Hardy, I. (2002). Learn for free online. BBC News, from http://news.bbc.co.uk/2/hi/technology/2270648. stm. Hastie, T. & Tibshirani, R.(1998). Classification by pairwise coupling. in Proceedings of the 1997 conference on Advances in neural information processing systems 10 (pp. 507–513). Cambridge, MA, USA: MIT Press. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (pp.137–142). Heidelberg, DE: Springer. Kennedy A. & Shepherd M. (2005). Automatic identification of home pages on the Web. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences - Track 4. Washington, DC, USA: IEEE Computer Society. Kim, S.-B., Han, K.-S., Rim, H.-C., & Myaeng, S. H. (2006). Some effective techniques for naive bayes text classification. IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, 1457–1466. Matsunaga, Y., Yamada, S., Ito, E., & Hirokaw S. (2003) A Web syllabus crawler and its efficiency evaluation. In Proceedings of International Symposium on Information Science and Electrical Engineering (pp. 565-568). Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. Burges, & A. J. Smola, (Eds.) Advances in Kernel Methods: Support Vector Learning (pp. 185-208). MIT Press, Cambridge, MA. Pomerantz, J., Oh, S., Yang, S., Fox, E. A., & Wildemuth, B. M. (2006) The core: Digital library education in library and information science programs. D-Lib Magazine, vol. 12, no. 11. Thompson, C. A., Smarr, J., Nguyen, H. & Manning, C. (2003) Finding educational resources on the Web: Exploiting automatic extraction of metadata. In Proceedings of European Conference on Machine Learning Workshop on Adaptive Text Extraction and Mining. Tungare, M., Yu, X., Cameron, W., Teng, G., Pérez-Quiñones, M., Fox, E., Fan, W., & Cassel, L. (2007). Towards a syllabus repository for computer science courses. In Proceedings of the 38th Technical Symposium on Computer Science Education (pp. 55-59). SIGCSE Bull. 39, 1. Witten, I. H. & Frank, E. (2005) Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, San Francisco. Yang Y. & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412–420). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Yu, X., Tungare, M., Fan, W., Pérez-Quiñones, M., Fox, E. A., Cameron, W., Teng, G., & Cassel, L. (2007). Automatic syllabus classification. In Proceedings of the Seventh ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 440-441). New York, NY, USA: ACM Press. 73
Automatic Syllabus Classification Using Support Vector Machines
Ke y Te r ms Feature Selection: Feature selection for text documents is a method to solve the high dimensionality of the feature space by selecting more representative features. Usually the feature space consists of unique terms occurring in the documents. Full Syllabus: A syllabus without links to other syllabus components. Model Testing: A procedure performed after model training that applies the trained model to a different data set with known classes and evaluates the performance of the trained model. Model Training: A procedure in supervised machine learning that estimates parameters for a designed model from data set with known classes. Partial Syllabus: A syllabus along with links to more syllabus components at another location. Support Vector Machines (SVM): A supervised machine learning classification approach with the objective to find the hyperplane maximizing the minimum distance between the plane and the training data points. Syllabus Component: One of the following information: course code, title, class time and location, offering institute, teaching staffs, course description, objectives, Web site, prerequisite, textbook, grading policy, schedule, assignment, exam and resources. Syllabus Entry Page: A page that contains a link to a syllabus. Text Classification: The problem of automatically assigning predefined classes to text documents.
74
75
Chapter V
Partially Supervised Text Categorization Xiao-Li Li Institute for Infocomm Research, Singapore
ABSTR ACT In traditional text categorization, a classifier is built using labeled training documents from a set of predefined classes. This chapter studies a different problem: partially supervised text categorization. Given a set P of positive documents of a particular class and a set U of unlabeled documents (which contains both hidden positive and hidden negative documents), we build a classifier using P and U to classify the data in U as well as future test data. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this chapter, we introduce the main techniques S-EM, PEBL, Roc-SVM and A-EM, to solve the partially supervised problem. In many application domains, partially supervised text categorization is preferred since it saves on the labor-intensive effort of manual labeling of negative documents.
INTR ODUCT ION Text categorization is an important problem and has been studied extensively in machine learning, information retrieval and natural language processing. To build a text classifier, the user first collects a set of training documents, which are labeled with predefined or known classes (labeling is often done manually). A classification algorithm is then applied to the training data to build a classifier which is subsequently employed to assign the predefined classes to documents in a test set (for evaluation) or future instances (in practice). This approach to building classifiers is called supervised learning/classification because the training documents all have pre-labeled classes.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Partially Supervised Text Categorization
Over the past few years, a special form of text categorization, partially supervised text categorization, is proposed. This problem can be regarded as a two-class (positive and negative) classification problem, where there are only labeled positive training data, but no labeled negative training data. Due to the lack of negative training data, the classifier building is thus partially supervised. Since traditional classification techniques require both labeled positive and negative documents to build a classifier, they are thus not suitable for this problem. Although it is possible to manually label some negative documents, it is labor-intensive and very time consuming. Partially supervised text categorization studies to build a classifier using only a set of positive documents and a set of unlabeled documents. Note that partially supervised text categorization is also called PU learning where P and U represent “positive set” and “unlabeled set” respectively. Additionally, although PU learning is commonly studied in the context of text and Web page domain, the general concepts and the algorithms are also applicable to the classification tasks in other domains. Next, we discuss the problem definition and some possible applications of PU learning. With the growing volume of text documents on the Web, Internet news feeds, and digital libraries, one often wants to find those documents that are related to one’s interest. For instance, one may want to build a repository of machine learning (ML) papers. First, one can start with an initial set of ML papers (e.g., an ICML Proceedings). One can then find those ML papers from related online journals or conference series, e.g., AI journal, AAAI, IJCAI, SIGIR, and KDD etc. Note that collecting unlabeled documents is normally easy and inexpensive in many text or Web page domains, especially those involving online sources. The ability to build classifiers without negative training data is particularly useful if one needs to find positive documents from many text collections or sources. Given a new collection, the algorithm can be run to find those positive documents. Following the above example, given a collection of AAAI papers (unlabeled set), one can run the algorithm to identify those ML papers. Given a set of SIGIR papers, one can run the algorithm again to find those ML papers. If traditional classification techniques are used to deal with the application above, a user has to label the negative training set for every online sources, such as AAAI, SIGIR and KDD proceedings, etc (it is thus time consuming). In other words, one cannot use the classifier built using ML papers as positive set (P) and manually labeled non-ML papers in AAAI as negative set (N) to classify the SIGIR papers (as a test set) because they are from different domains — the classifier will suffer since the distribution of negative test documents in SIGIR (Information Retrieval papers) is largely different from that in the negative training documents in N (Artificial Intelligence papers). A user would obviously prefer techniques that can provide accurate classification without manual labeling any negative document. In the application above, the class of documents that one is interested in is called the positive class documents, or simply positive documents (e.g., the ML papers in an ICML Proceedings). The set of positive documents are represented as P. The unlabelled set U (e.g., AAAI Proceedings) contains two parts of documents. One is also the documents of class P, which are the hidden positive documents in U. The rest of the documents in U are called the negative class documents or simply negative documents (e.g., the non-ML papers in an AAAI Proceedings) since they do not belong to positive class. Given a positive set P, PU learning aims to identify a particular class P of documents from U. The diagram of PU learning is shown in the Figure 1. Now, we can state the problem more formally as follow: Problem Statement: Given a set P of positive documents that we are interested in, and a set U of unlabeled documents (the mixed set), which contains both positive documents and negative documents, 76
Partially Supervised Text Categorization
Figure 1. The diagram of PU learning
Negative documents P
Positive documents
U
Hidden P ositive documents
we want to build a classifier using P and U that can identify positive documents in U or in a separate test set − in other words, we want to accurately classify positive and negative documents in U or in the test (or future) data set. This PU learning problem can be seen as a classification problem involving two classes, positive and negative. However, we do not have labeled negative documents as they are mixed with positive documents in the unlabelled set (in figure 1, the documents in U do not have any label information). Traditional techniques require both labeled positive and labeled negative documents to build a classifier. PU learning aims to build a classifier using only the positive set and the unlabelled set. Thus, it saves on the labor-intensive effort of manual labeling of negative documents. Partially supervised text categorization can also be very useful in many real-life applications. Similarly, given one’s the bookmarks (positive documents), one may want to find those documents that are of interest from Internet sources without labeling any negative documents. In these applications, positive documents are usually available because if one has worked on a particular task for some time, one should have accumulated many related documents. Even if no positive document is available initially, finding some such documents from the Web or any other source is relatively simple (e.g., using Yahoo Web directory). One can then use this set to find the same class of documents from any other sources without manual labeling of negative documents from each source.
B ACK GR OUND Many studies on text classification have been conducted in the past. Existing techniques including Rocchio, Naive Bayes (NB), SVM, and many others [e.g., (Lewis, 1995; Rocchio, 1971; Vapnik, 1995; Joachims, 1999; Nigam, McCallum, Thrun, & Mitchell, 2000; Yang & Liu, 1999). These existing techniques, however, all require labeled training data of all classes. They are not designed for the positive class based classification. 77
Partially Supervised Text Categorization
A theoretical study of PAC learning from positive and unlabeled examples under the statistical query model was first reported in (Denis, 1998). It basically assumes that the proportion of positive instances in the unlabeled set is known. Letouzey et al. (Letouzey, Denis, & Gilleron, 2000; Denis, Gilleron, & Letouzey, 2005) presented a learning algorithm based on a modified decision tree algorithm in this model. Muggleton (Muggleton, 1997) followed by studying the problem in a Bayesian framework where the distribution of functions and examples are assumed known. (Liu, Lee, Yu, & Li, 2002) reported sample complexity results and provided theoretical elaborations on how the problem may be solved. Subsequently, a number of practical algorithms (Liu et al., 2002; Yu, Han, & Chang, 2002; Li & Liu, 2003) have been proposed. They all conformed to the theoretical results in (Liu et al., 2002), following a two-step strategy: (1) identifying a set of reliable negative documents from the unlabeled set; and (2) building a classifier using EM or SVM iteratively. Their specific differences in the two steps are as follows. The S-EM method proposed in (Liu et al., 2002) was based on naïve Bayesian classification and the EM algorithm (Dempster, Laird, & Rubin, 1977). The main idea was to first use a spying technique to identify some reliable negative documents from the unlabeled set, and then to run EM to build the final classifier. The PEBL method (Yu et al., 2002) uses a different method (1-DNF) for identifying reliable negative examples and then runs SVM iteratively for classifier building. More recently, (Li & Liu, 2003) reported a robust technique called Roc-SVM. In this technique, reliable negative documents were extracted by using the information retrieval technique Rocchio (Rocchio, 1971), and SVM was used in the second step. Other related works include: Lee and Liu proposed a weighted logistic regression technique (Lee & Liu, 2003). Liu et al. proposed a biased SVM technique (Liu, Dai, Li, Lee, & Yu, 2003). They both require a performance criterion to determine the quality of the classifier. Recently, A-EM is proposed to solve a real-life product page classification problem where positive documents in the positive set P and the hidden positive documents in the unlabeled set U have different distribution (since they are generated from the different Web sites). In (Fung, Yu, Lu, & Yu, 2006), a method called PN-SVM was proposed to deal with the case when the positive set is small. PU learning (LGN method) also can help to identify unexpected instances in the test set (Li & Liu, 2007) where the negative data are generated by using positive set P and unlabelled set U. Some other works on PU learning are some applications for extracting relations, identifying user preferences and filtering junk e-mail, etc (Agichtein, 2006; Deng, Chai, Tan, Ng, & Lee., 2004; Schneider, 2004; Zhang & Lee., 2005). Another line of related work is learning from only positive data. In (Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001), a one-class SVM is proposed by Schölkopf. Manevitz and Yousef studied text classification using one-class SVM (Manevitz, Yousef, Critianini, Shawe-Taylor, & Williamson, 2002). One-class SVM uses only positive data to build a SVM classifier whereas the partially supervised text categorization uses both positive and unlabelled data to build a classifier. Li and Liu showed that the accuracies of one-class SVM were poorer than PU learning for text classification (Li & Liu, 2003). Unlabeled data does help classification significantly.
T HE MAIN T EC HNIQUES In this section, we will introduce the main techniques in partially supervised text categorization. These techniques include S-EM (Liu et al., 2002), PEBL (Yu et al., 2002), Roc-SVM (Li & Liu, 2003) and A-EM (Li & Liu, 2005), etc.
78
Partially Supervised Text Categorization
Figure 2. An illustration of the two-step PU learning approach First Step
RN
Second Step
P
RN
Add back
U
Classification Model
U’ RN’ U’ U’’
P
Replace
Negative documents
Positive documents
Given positive set P and unlabelled set U, the core idea of the PU learning techniques is the same as that in (Liu et al., 2002), i.e., (1) identifying a set of reliable negative documents RN from the unlabeled set (called strong negative documents in PEBL), and (2) building a classifier using P, RN and U’ (U’=URN) through applying an existing learning algorithm once or iteratively. The difference between these techniques is that for both steps these techniques use different methods. This two-step approach is illustrated in Figure 2. In the first step, a set of reliable negative documents (RN) is extracted from the unlabeled set U (different techniques use different classifiers to extract RN) and the remaining documents of U are stored in set U’. In the second step, an iterative algorithm, such as EM or SVM, is applied. First, the algorithm builds a classifier C0 using the P (pure positive documents) and RN (reliable negative documents). Then, C0 is used to classify U’, producing new generated reliable negatives (RN’) and the remaining unlabelled
79
Partially Supervised Text Categorization
set U’’. Next, RN’ is added into the set RN. The set U’’ will replace the set U’. Finally, the new classifier C1 is built with P and updated RN (RN∪RN’). The algorithm will be run iteratively (a sequence of classifier will be generated during the process) until a convergence criterion is met, i.e. when RN’ becomes empty set, which means no new generated reliable negatives produced.
S -EM S-EM (Liu et al., 2002) is proposed to solve the problem in the text domain (S-EM can be downloaded from http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html). It is based on naïve Bayesian Categorization (NB) (Lewis, 1995; Nigam et al., 2000) and the EM algorithm (Dempster et al., 1977). The main idea of the method is to first use a spy technique to identify some reliable negative documents from the unlabeled set (first step). It then runs EM iteratively to build the final classifier (second step). Before presenting the proposed method, we give an overview of both NB and EM.
NB Classi.cation and EM Algorithm The Naive Bayesian method is an effective technique for text classification. Given a set of training documents D, each document is considered an ordered list of words. We use wdi,k to denote the word in position k of document di, where each word is from the vocabulary V = < w1, w2, … , w|v| >. The vocabulary is the set of all words we consider for classification. We also have a set of predefined classes, C = {c1, c2, … , c|C|} (in this chapter we only consider two class classification, so, C={c1, c2}). In order to perform classification, we need to compute the posterior probability, Pr(cj|di), where cj is a class and di is a document. Based on the Bayesian probability and the multinomial model, we have
∑ Pr(c ) = j
| D| i =1
Pr(c j | di ) |D|
(1)
and with Laplacian smoothing,
1 + ∑ i =1 N ( wt , di )Ρr(c j | di ) | D|
Ρr( wt | c j ) =
| V | + ∑ s =1 ∑ i =1 N ( ws , d i )Ρr(c j | d i ) |V |
| D|
(2)
where N(wt,di) is the count of the number of times that the word wt occurs in document di and Pr(cj|di)∈{0,1} depending on the class label of the document. Finally, assuming that the probabilities of the words are independent given the class, we obtain the NB classifier: Pr( cj | di ) =
80
Pr(c j ) ∏|kd=i |1 Pr( wdi , k | c j )
∑
|C |
r =1
Pr(c j ) ∏|kd=i |1 Pr( wdi ,k | c j )
(3)
Partially Supervised Text Categorization
In the naive Bayesian classifier, the class with the highest Pr(cj|di) is assigned as the class of the document di. The Expectation-Maximization (EM) algorithm (Liu et al., 2002; Nigam et al., 2000) is a popular class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. It is often used to fill the missing values in the data using existing values by computing the expected value. The EM algorithm consists of two steps, the Expectation step, and the Maximization step. The Expectation step basically fills in the missing data. The parameters are estimated in the Maximization step after the missing data are filled. This leads to the next iteration. For the naive Bayesian classifier, the steps used by EM are identical to that used to build the classifier (equations (3) for the Expectation step, and equations (1) and (2) for the Maximization step). In EM, the probability of the class given a document takes the value in [0, 1] instead of {0, 1}.
T he First S tep of S -EM The target of this step is to extract reliable negative documents from unlabelled set U. S-EM works by sending some “spy” documents from the positive set P to the unlabeled set U. The technique made an assumption: since spy documents from P and hidden positive documents in U are positive documents, spy documents should behave identically to the hidden positive documents in U and hence allows us to reliably infer the behavior of the unknown positive documents. Figure 3 gives the algorithm of the technique, which was used in the S-EM system (Liu et al., 2002). In Figure 3, after initializing the RN as empty set, S-EM randomly samples a set S of positive documents from P and put them in U (lines 2 and 3). The default sampling ratio of s% is 15% in S-EM. The documents in S, acting as “spy” documents, are sent from the positive set P to the unlabeled set U. . Next,
Figure 3. The spy technique extracts the reliable negative set RN Algorithm Spy(P, U, RN,U’) 1. RN ←∅; 2. S ← Sample(P, s%); 3. Us ← U ∪ S; 4. Ps ← P – S; 5. Assign each document in Ps the class label +1; 6. Assign each document in Us the class label −1; 7. NB(Us, Ps); // This produces a NB classifier. 8. Classify each document in Us using the NB classifier; 9. Determine a probability threshold t using S; 10. for each document d ∈ U do 11. 12. 13.
if its probability Pr(+|d) < t then RN ← RN ∪ {d}; endif
14. endfor 15. U’=U – RN
81
Partially Supervised Text Categorization
it runs the naïve Bayesian (NB) algorithm using the set Ps as positive and the set Us as negative (lines 3–7). The NB classifier is then applied to classify each document d in Us, i.e., to assign it a probabilistic class label Pr(+|d), where “+” represents the positive class. Finally, it uses the probabilistic labels of the spies to decide which documents are most likely to be negative. A threshold t is employed to make the decision. Those documents in U with smaller probabilities (Pr(+|d)) than t are regarded as reliable negative and stored in set RN (lines 10–14). Now we discuss how to determine t using spies (line 9). Suppose the set of spies be S = {s1, s2, …, sk}, and the probabilistic labels assigned to each si be Pr(+|si). Intuitively, the minimum probability in S can be set as the threshold value t, i.e., t = min{Pr(+|s1), Pr(+|s2), …, Pr(+|sk)}, which means that SEM wants to retrieve as many as possible hidden positive documents in U because spies should have same behaviors as hidden positives. In a noiseless case, using the minimum probability is acceptable. However, most real-life document collections have outliers and noise. Using the minimum probability is unreliable. The reason is that the posterior probability Pr(+|si) of an outlier document si in S could be 0 or smaller than most (or even all) actual negative documents. However, we do not know the noise level of the data. To be safe, the S-EM system uses a large noise level l = 15% as the default. The final classification result is not very sensitive to l as long it is not too small. To determine t, we first sort the documents in S according to their Pr(+|si) values. We then use the selected noise level l to decide t: we select t such that l percent of documents in S have probability less than t.
T he S econd S tep of S -EM Given positive set P, reliable negative set RN and new unlabelled set U’, the EM algorithm is applied in the second step of S-EM. Expectation step of EM basically fills in the missing data. In such a case, it produces and revises the probabilistic labels of the documents in U’. The parameters are estimated in the Maximization step after the missing data are filled. This leads to the next iteration of the algorithm. EM converges when its parameters stabilize. Using NB in each iteration, EM employs the same equations as those used in building a NB classifier (equation (3) for the Expectation step, and equations (1) and (2) for the Maximization step).The class probability given to each document in U’ takes the value in [0, 1] instead of {1, 0}. The algorithm is given in Figure 4. Steps 1 to 3 build a NB classifier using P and RN. Then steps 4 to 9 run EM algorithm until convergence. Finally, the converged classifier is used to classify the unlabelled set U. The document that is classified as positive class will be regarded as hidden positives in U. The advantage of S-EM is that it does not sensitive to noise due to its probabilistic nature. However, the EM made two mixture model assumptions, i.e. (1) the data is generated by a mixture model, and (2) there is a one-to-one correspondence between mixture components and classes. These two assumptions can cause problems when they do not hold. In some real-life applications, they may be violated since a class may contain documents from completely different topics. On the other hand, if the assumption is met, EM works very well. S-EM is particularly useful when the labeled set is very small. In this case, running EM iteratively can keep improving the performance of classifiers.
PEB L Yu et al. proposed a SVM based technique (called PEBL: Positive Example Based Learning) to classify Web pages given positive and unlabeled pages. The core idea is the same as that in S-EM. In (Yu et al.,
82
Partially Supervised Text Categorization
Figure 4. The EM algorithm with the NB classifier Algorithm S-EM(P, U, RN,U’) 1. Each document in P is assigned the class label 1; 2. Each document in RN is assigned the class label −1; 3. Learn an initial NB classifier f from P and RN, using Equations (1) and (2); 4 repeat // E-Step 5 6 7
for each document di in U’ do Using the current classifier f to compute Pr(cj|di) using Equation (3) endfor // M-Step
8 learn a new NB classifier f from P, RN and U’ by computing Pr(cj) and Pr(wt|cj), using Equations (1) and (2) 9 until the classifier parameters stabilize 10. From the last iteration of EM, a classifier f is obtained 11. for each document di in U do 12. if its probability Pr(+|di) >=0.5 then 13.
Output di as a positive document;
14. else 15.
Output di as a negative document;
16. endif 17. endfor
Figure 5. The 1DNF technique extracts out the strong negative documents from U Algorithm 1DNF(P, U, RN, U') 1. Construct a word feature set V = {w1,…, wn}, wi∈U ∪ P, i=1, 2, …, n; 2. Initialize a positive feature set PF as empty set: PF ← ∅; 3. for each wi ∈ V do 4. 5. 6.
if ( freq(wi, P) / |P| > freq(wi, U) / |U|) then PF ← PF ∪ {wi}; endif
7. endfor; 8. RN ← U; 9. for each document d ∈ U do 10. 11. 12.
if ∃wj∈ PF, freq(wj, d) > 0 then RN ← RN – {d} endif
13. endfor 14. U’=U-RN
83
Partially Supervised Text Categorization
2002), reliable negative documents (strong negative documents in PEBL) are those documents that do not contain any positive features. After a set of strong negative documents is identified, SVM is applied iteratively to build a classifier. PEBL is sensitive to the number of positive documents. When the positive data is small, the results are often very poor (Li & Liu, 2003; Fung et al., 2006).
1DNF T echnique See Figure 5.
Iterative S VM The framework of 1DNF method is shown in Figure 5. Line 1 collects all the words in unlabelled set U and positive set P to obtain a vocabulary V. It then constructs a positive feature set PF from V containing the words that occur in the positive set P more frequently than in the unlabeled set U (lines 2–7). In lines 8–13, it tries to filter out possible positive documents from U. A document in U that does not contain any positive feature in PF is regarded as a reliable negative document. Note that this method is only applicable to text documents. 1DNF usually can only extract out small number of reliable negative documents since its strict condition that a reliable document should not contain any positive feature. Additionally, the positive feature may contain noisy features due to feature distribution problem in P and U.
Figure 6. Running SVM iteratively in PEBL Algorithm PEBL(P, RN, U’) 1. Every document in P is assigned the class label +1; 2. Every document in RN is assigned the class label –1; 3. repeat 4.
Use P and RN to train a SVM classifier f;
5.
Classify U’ using f;
6.
Let the set of documents in U’ that are classified as negative be W;
7.
if W != ∅
8.
U’ ← U’ – W;
9.
RN ← RN ∪ W;
10. endif; 11. until W = ∅ 12. From the last iteration of SVM, a classifier f is obtained 13. for each document di in U do 14. if SVM output the positive value for di then 15.
Output di as a positive document;
16. else 17.
Output di as a negative document;
18. endif 19. endfor
84
Partially Supervised Text Categorization
In this method, SVM is run iteratively using P, RN and U’. The detailed PEBL algorithm is given in Figure 6. The basic idea is as follows: In each iteration, a new SVM classifier f is constructed from P and RN (line 4). Here RN is regarded as a set of negative examples (line 2). The classifier f is then applied to classify the documents in U’ (line 5). The set W of documents in U’ that are classified as negative (line 6) is removed from U’ (line 8) and added to RN (line 9). The iteration stops when no document in U’ is classified as negative (line 11). The final classifier is the classifier built when converge (W = ∅) and its results are used as the final resultst When positive set is big, PEBL algorithm can get good results. However, when positive set is smaller, since the first step 1DNF usually can not extract reliable negatives, the second step of PEBL will badly failed to identify any hidden positives in U (Li & Liu, 2003; Fung et al., 2006).
R oc-S VM This section presents Roc-SVM (Rocchio SVM, can be downloaded from http://www.cs.uic.edu/ ~liub/LPU/LPU-download.html) (Li & Liu, 2003). Roc-SVM differs from PEBL in that it performs negative data extraction from the unlabeled set U using the Rocchio method. Although the second step also runs SVM iteratively to build a classifier, there is a key difference from PEBL. It selects a good classifier from a set of classifiers built by SVM. Compared with S-EM and PEBL, Roc-SVM is robust and performs well consistently under a variety of conditions.
S tep1: Detecting R eliable Negative Documents This subsection introduces the Rocchio method (Rocchio, 1971). The key requirement for this extraction step is that the identified negative documents from the unlabeled set must be reliable or pure, i.e., with no or very few positive documents, because SVM (used in our second step) is very sensitive to noise. This method Rocchio treats the entire unlabeled set U as negative documents and then uses the positive set P and U as the training data to build a Rocchio classifier. The classifier is then used to classify U. Those documents that are classified as negative are considered reliable negative data. The algorithm is shown in Figure 7. In Rocchio classification, eachdocument d is represented as a vector (Salton & McGill, 1986), d = (q1, q1, ... qn). Each element qi in d represents a word wi and is calculated as the combination of term frequency (tf ) and inverse document frequency (idf ), i.e., qi = tf i * idf i. tf i is the number of times that word wi occurs in d, while idf i = log(|D|/df(wi)). Here |D| is the total number of documents and df(wi) is the number of documents where word wi occurs at least once. Building a classifier is achieved by constructing positive and negative prototype vectors c +and c − (lines 2 and 3 in Figure 7). α and β parameters adjust the relative impact of positive and negative training examples. α = 16 and β = 4 are recommended in (Buckley, Salton, & Allan, 1994). d ' , it simply uses the cosine measure (Salton & McGill, 1986) In classification, for each test document to compute the similarity (sim) of d ' with each prototype vector. The class whose prototype vector is more similar to d ' is assigned to the test document (lines 4-6 in Figure 7). Those documents classified as negative form the negative set RN. The reason that Rocchio works well is as follow: In our positive class based learning, the unlabeled set U typically has the following characteristics:
85
Partially Supervised Text Categorization
Figure 7. Rocchio extraction using U as negative 1.
Assign the unlabeled set U the negative class, and the positive set P the positive class;
2.
Let c =
3.
Let c =
4.
for each document d'∈ U do
+ −
5. 6. 7. 8.
1. 2.
1 d 1 d − ∑ ∑ ; | P | d∈P || d || | U | d∈U || d || 1 d 1 d − ∑ ∑ ; | U | d∈U || d || | P | d∈P || d ||
+ '
− '
if sim(c , d ) 0 0 0 0 0 0 0 ... ...
b)
405
Designing and Mining Web Applications
condition that allows selecting an area based on the equality of its OID with the OID of the area previously selected by a user (in a previous page). Such a parameter is transported by the input link of the unit. The second content unit is an index unit. It receives as input a parameter carried by a transport link coming from the data unit; such a parameter represents the OID of the area displayed by the data unit. The index unit thus lists some instances of the entity Research_Topic extracted from the database according to a condition based on the data relationship Research_Area2Research_Field, which associates each research area with a set of correlated research fields. Besides the visual representation, WebML primitives are also provided with an XML-based representation suitable to specify additional detailed properties that would not be conveniently expressed by a graphical notation. Figure 1b reports a simplified XML specification of the Research Area page. For further details on WebML, the reader is referred to Ceri, et al. (2002).
Implementation and Deployment of WebML Applications The XML representation of WebML schemas enables automatic code generation by means of CASE tools. In particular, WebML is supported by the WebRatio CASE tool (Ceri et al., 2003), which translates the XML specifications into concrete implementations. WebRatio offers a visual environment for drawing the data and hypertext conceptual schemas, and an interface to the data layer that assists designers in automatically mapping the conceptual-level entities, attributes, and relationships to physical data structures in the data sources, where the actual data will be stored. The core of WebRatio is a code generator, based on XML and XSL (eXtensible Stylesheet Language) technologies, that is able to generate automatically the application code to be deployed on the J2EE (Java 2 Enterprise Edition) or .NET platforms. More specifically, the code generator produces the queries for data extraction from the application-data sources, the code for managing the application business logic, and the page templates for the automatic generation of the application front end. The generated applications run in a framework implemented on top of an application server. The runtime framework has a flexible, service-based architecture that allows the customization of components. In particular, the logging service can be extended with user-defined modules so as to log the desired data.
Figure 2. Conceptual logs generation (Fraternali, Lanzi, Matera, & Maurino, 2004; Fraternali et al., 2003). Analysis Data Warehouse
Log Synchronizer
Ht t P r equests Log
Application Server
406
WebML runtim e Log
Run-Time Engine
Log Conceptualizer
Web Application s chema
Model-Based Design Tool
DB-To-XML Converter
Database Instance
DBMS
Designing and Mining Web Applications
As better clarified in the following section, this service has been used for gathering the conceptual data needed for the enrichment of the conceptual logs.
WebML C onceptual Logs Conceptual logs (Fraternali et al., 2004; Fraternali et al., 2003) are obtained by enriching standard log files with information available in the WebML conceptual schema of the Web application, and with knowledge of accessed data. As reported in Figure 2, they are generated thanks to some modules, developed as extensions to the WebML-WebRatio framework, that are responsible for extracting and integrating logs from the application server, the WebML application runtime, and the application conceptual schema. The XML dump of the application-data source is also used for deriving detailed information about data accessed within pages. The conceptual logs indeed just include indications about OIDs of data instances. Figure 3 reports an excerpt from a conceptual log referring to an access to the Research Area page previously described. Requests are sorted by user session.1 Each request is then extended with the corresponding events and data coded in the runtime log. • •
The identifiers of the units composing the page are delimited by the tag . The OIDs of the database objects extracted for populating such units are delimited by the tag.
The attribute SchemaRef, defined for pages and units, represents values that univocally identify pages and units within the application’s conceptual schema. Therefore, it provides a reference to the definition of Figure 3. An extract from a conceptual log that refers to an access to the Research Area page specified in Figure 1
407
Designing and Mining Web Applications
such elements, which permits the retrieval of additional properties not traced by the logging mechanism but represented in the conceptual schema, and the integration of them in the conceptual log if needed.
Main Th r ust of t he Chap t er The data-mining analysis of Web logs has exploited the iterative and interactive extraction of datamining patterns from databases typical of the KDD process (Brachman & Anand, 1996). In this study, we want to extract from the rich semantic content and structured nature of the conceptual Web logs interesting, usable, and actionable patterns. In doing this, we apply constraints on the patterns. Constraints are applied to patterns by means of KDD scenarios. A KDD scenario is a set of characteristic data-mining requests on patterns. It is a sort of template to be filled in with specific parameter values. KDD scenarios should be able to solve some frequently asked questions (mining problems) by users and analysts (Web site administrators and/or information-system designers) in order to recover from frequently occurring problems. We have identified three main typologies of mining problems for which patterns on frequently observed events could constitute an aid. 1. 2. 3.
The identification of frequent crawling paths by the users (Web structure and Web usage analysis) The identification of user communities (set of users requesting similar information) The identification of critical situations (anomalies, security attacks, low performance) in which the information system could be placed
The first task enables the customization and construction of adaptive Web sites and recommendation systems, as well as the quality analysis of Web applications (Demiriz, 2004; Fraternali et al., 2003). The analysis of user crawling paths has been used also in Aggarwal (2004) to model the likelihood that a page belongs to a specific topic. This is a relevant problem in the construction of crawlers’ indices and in Web resource discovery. Thus, the mining of collective users’ experiences has been applied successfully to find resources on a certain topic, though this issue is typically related to the second task, that is, the identification of user communities. The discovery and management of user communities is an important aim for customer-relationship management and business applications (e.g., e-commerce). Finally, the identification of critical situations in an information system is essential for the management of an efficient and reliable Web site together with the security of the underlying information-technology system.
Mining C onceptual Logs The use of conceptual logs introduces many advantages over the approaches usually followed in Web usage mining. First of all, they offer rich information that is not available with most traditional approaches. Also, they eliminate the typical Web usage mining preprocessing phase completely. In fact, we note that according to our approach, the following apply.
408
Designing and Mining Web Applications
•
• •
Data cleaning is mainly encapsulated within the procedure that integrates the different log sources (logs from the application server, the WebML application runtime, and the application’s conceptual schema). The identification of user sessions is done by the WebML runtime through the management of session IDs. The retrieval of content and structure information is unnecessary since all these information are available from the WebML conceptual schema.
Finally, since mining methods are applied specifically to a type of rich log files, it is possible to tailor these methods to improve their effectiveness in this particular context. In the following section, we describe the typology of information contained in the Web logs we processed and analyzed, as well as the KDD scenarios, that is, the templates in a Constraint-Based Mining Language.
DEI Web Application Conceptual Logs The Web logs of the DEI Web site2 record accesses on a very large application, collecting one fourth of the overall click stream directed to Politecnico di Milano, Italy. The application manages the publication and storage of the Web pages of professors, and research and administration staff. It publishes also the didactic offerings in terms of courses and their materials, the departments and the research centres each person is affiliated with together with their resources, and finally the list of publications, activities, and projects in which the persons are involved. We collected the Web logs for the first consecutive 3 months in 2003. The original Web log stored by the Web server (Apache) was 60 MB large and is constituted by a relation that has the following information: RequestID, IPcaller, Date, TimeStamp, Operation, Page Url, Protocol, Return Code, Dimension, Browser, and OS. The additional data, deriving from the WebML application design and from the runtime logging module, include the following items: Jsession (identifier of the user crawling session by an enabled Java browser), PageId (generated dynamically by the application server), UnitId (atomic content unit), OID (the object displayed in a page), and Order (in which content units are presented in the page). The Web log contained almost 353,000 user sessions for a total of more than 4.2 million page requests. The total number of pages (dynamic, instantiated by means of OIDs) was 38,554. Each user session was constituted by an average of 12 page requests.
KDD Scenarios In this section, we describe the KDD scenarios we have designed for the discovery of frequent patterns in the DEI Web logs. A KDD scenario is a template introducing some constraints on data and on patterns to be discovered on that data. A template will be instantiated by the user in a request of frequent patterns by filling the template parameters with specific values. KDD scenarios are described by specifying the following features: grouping features (used to form groups of data in the database from which data-mining patterns will be extracted), pattern features (used to describe any element included in a pattern), mining
409
Designing and Mining Web Applications
features (used to apply mining constraints for increasing the relevance of the result), and an evaluation function, that is, one or more aggregate functions used to evaluate the patterns statistically (e.g., count(), sum(), max(), etc.). The statistics, computed for any pattern, are compared with a parameter (displayed in the template within brackets). A template is instantiated by the user into a mining request in which the template parameters are bound to values. A mining request returns the values of the pattern features observed in groups of data made by the grouping features. The returned patterns satisfy the pattern-feature, miningfeature, and statistical constraints. For instance, the first template we discuss is the following: Template: UsersRequestingSamePage grouping features: Page Url pattern features: {IPcaller} mining features: IPcaller NOT IN{list-of-IP-of-search-engines} evaluation function: count(Page Url)> [minValue].
It consists of partitioning the Web log data by the grouping features and the page URL. Then, in each of the groups, patterns are formed by the construction of sets of IPs of callers. In this case, the set of callers requested a same page. For each pattern, the mining features are checked. In this case, the set of IP (Internet protocols) callers is checked to be free of IPs of search engines. Finally, each pattern is evaluated statistically, counting the number of groups in which the pattern is present (i.e., the pages that have been requested by the set of callers). Notice that such a template causes the execution of a process that is essentially a typical frequent-item-set mining algorithm with an evaluation of the constraints on qualifying item-set attributes. Many efficient algorithms for this task exist dealing with different types of constraints, such as those in Srikant, Vu, and Agrawal (1997), Pei and Han (2002), and Gallo, Esposito, Meo, and Botta (2005). We implemented KDD scenarios and related templates as mining requests in a Constraint-Based Mining Language for frequent patterns and association rules. The language is similar to the mine rule (Meo, Psaila, & Ceri, 1998), an SQL-like extension to mine association rules from relational content. The description of the implemented prototype, the relevant algorithms, and the optimizations that exploit the properties of the constraints on patterns are beyond the scope of this chapter. However, a detailed description can be found in Meo, Botta, Esposito, and Gallo (2005). At the end of this section, we show an experimental report on the execution times that were necessary to execute with our prototype an instantiation of each of the templates. In the following, we report some of the most relevant templates we instantiated in our experimentation on the conceptual logs, and we comment on them. In addition, when possible (for privacy reasons), we report some of the most interesting patterns retrieved and their statistics. As the reader will be aware of, all of the templates have an immediate and practical impact on the daily activities of the administration and tuning of a Web application and its Web front end. The discovered patterns help the designer to design an adaptive application that is context and user aware, and adapts its presentation layer according to the information requested and to whom is requesting it. Analysis of Users that Visit the Same Pages. The goal of this analysis is to discover Web communities of users on the basis of the pages they frequently visited. When this template is instantiated, the sets of IPcallers are returned who visited the same sets of a sufficiently large number of pages (>minValue).
410
Designing and Mining Web Applications
Template: UsersVisitingCommonPages grouping features: Page Url pattern features: {IPcaller} mining features: none (or specific predicates qualifying some specific IPcallers). evaluation function: count(Page Url)> [minValue]
In our experiments, we discovered that the most frequently co-occurring IP addresses belong to Web crawler engines or big entities, such as universities (occurring tens of times). In the immediately lower value of support, we obviously discovered the members of the various research groups in the university. A similar query would occur if we wish to discover user communities that share the same user profile in terms of the usage of the network resources. In this case (as we will see in another example that will follow), we would add constraints (in the mining features) on the volume of the data transferred as a consequence of the user request. Both these requests would be followed by a postprocessing phase in order to obtain a description of the actual commonalities of the users. For instance, the postprocessing step would perform a crossover query between the users and the requested resources. Examples of discovered patterns are the requests of frequent the download of materials for courses and the documentation provided in personal home pages. It is an open issue whether the discovered regularities among IP addresses used by users in their visits to the pages occur because these IP addresses have been commonly used by the users. Indeed, this phenomenon could put in evidence the existence of different IP addresses dynamically assigned to the same users, for instance, by the information system of a big institution. Most Frequent Crawling Paths. The goal of this template is to discover sequences of pages (ordered by the date of visit) frequently visited by the users. Template: FreqPaths grouping features: IPcaller pattern features: {Page Url} mining features: Page Urls are ordered according to the Date feature. evaluation function: count(IPcaller)> [minValue]
You can notice that in this case, we did the grouping by user (IPcaller) and searched for sets of pages frequently occurring in the visits of a sufficient number of users (evaluation function). Notice also that we used a condition on the mining features to constrain the temporal ordering between pages, thus ensuring the discovery of sequential patterns. In practice, examples of resulting patterns showed that requests of a research-centre page or research-expertise area were later followed by the home page of a professor. This pattern was later used by Web administrators as a hint for restructuring the Web site access paths. Indeed, the hypothesis is that indexes for performing people search at the whole institution level were too slow. As a consequence, visitors, searching for the personal home page of a person, would have preferred to step into that page coming from the research-centre page with which the searched person was affiliated. Units that Occur Frequently inside User Crawling Sessions. The goal of this request is to discover sets of content units that appeared together in at least a number (minValue) of crawling sessions. Notice how this query is actually possible in virtue of the conceptual logs. In fact, the presence in the log of the
411
Designing and Mining Web Applications
session identifier is not common and is a semantic notion that usually requires complicated heuristics to be reconstructed. Template: CommonUnitsInSessions grouping features: Jsession pattern features: {UnitId} mining features: none (or some specific predicate qualifying the particular Units). evaluation function: count(Jsession)> [minValue]
With this template, we discovered patterns that helped Web designers to redesign the Web application. In fact, we found that the units that most frequently co-occur in visits are the structural components of the Web site (indexes, overview pages, and so on). This finding could mean that the hypergraph was designed in such a way that final pages were too distant from the main access point. As a consequence, visitors happened to visit almost always the structural components of the site without reaching their final destination. Anomalies Detection. The goal of this request is to discover the associations between pages and users who caused authentication errors when accessing those pages. In other words, we wanted to discover those pages that could be effectively used by callers as a way to illegally enter into the information system. Template: Anomalies grouping features: Date pattern features: IPcaller, {Page Url} mining features: IPcaller must have visited in a single Date all the Pages in the set {Page Url} and the request for those Pages is returned with an error = [bad
authentication error].
evaluation function: count(Date)> [minValue]
In this template, we grouped source data by date, thus identifying patterns (association of users to page requests) that are frequent in time. Notice that the mining condition ensures that page requests (Page Url) effectively were originated by the callers associated to them (IPcaller). Examples of most retrieved patterns are attempts to change passwords or download some reserved information. Other queries could be instantiated by this template with the error code bounded to other values, such as “page not found.” These could help in identifying the pages whose links are obsolete or erroneous. Communication-Channel Congestion Control. The goal is to discover the causes of congested traffic over the network. The aim is to retrieve the user IP addresses (with the volume of data requested in KB) that most frequently caused high traffic (the volume must be larger than a given threshold in KB). Template: ChannelCongestion grouping features: Date pattern features: IPcaller, {KBytes} mining features: IPcaller must have requested in a single Date the delivery of a volume of data (KBytes) > [thresholdDimension] evaluation function: count(Date)> [minValue]
Notice that we grouped the input relation by date, thus identifying the users that requested highvolume pages frequently in time. After this query, we needed a crossover query that discovered those
412
Designing and Mining Web Applications
pages causing the high traffic. As examples of discovered patterns, there are the requests of the frequent download of materials for courses and the documentation provided on user home pages. Application Tuning. This template aims at helping the designer in the debugging and tuning of the Web application. It searches for the (technological) conditions in which errors frequently occurred. For instance, it might return the associations between the operating system and browser at the client side with the error code frequently returned by the server. Template: ApplicationTuning grouping features: Date pattern features: OS, Browser, {Return Code} mining features: Return Code = [errorCode] the requests, originated from a system configured with OS and Browser, ended with an error in {Return Code} (whose typology is errorCode). evaluation function: count(Date)> [minValue]
Notice that in this template, we selected only the page requests that result in some errors, determined by the parameter errorCode. This template is similar to one aimed at the discovery of anomalies. Both of them are useful to test the reliability and robustness of a new Web application, taking into account the different configuration settings that could occur at the client side. Users that Visit Frequently Certain Pages. This template aims at discovering if recurrent requests for a set of pages from a certain IP exist. This puts in evidence the fidelity of the users to the service provided by the Web site. Template: UserFidelityToPages grouping features: RequestId pattern features: IPcaller, {Page Url} mining features: requests to the pages in {Page Url} are originated from IPcaller. evaluation function: count(requestId)> [minValue]
Examples of patterns we discovered are provided again by the pages that allow the download of material (course slides, research reports). Table 1. The parameter values used to evaluate the frequent patterns extracted by the instantiated template queries Template Name
Evaluation-Function Threshold Value
UsersVisitingCommonPages
30 (0.001% of the total number of pages)
FreqPaths
8 (IP Callers)
CommonUnitsInSessions
3,530 (0.05% of the total number of sessions)
Anomalies
3 (dates)
ChannelCongestion
3 (dates)
ApplicationTuning
1 (date)
413
Designing and Mining Web Applications
Figure 4. Query execution times in experiments on conceptual logs
ssions onUnit sInSe Comm
Pages idelity Users F
ationT uning Applic
Chann elCon gestio n
Anom alies
14 12 10 8 6 4 2 0
FreqP ath
Execution ti me (s, logaritmic scale)
Execution t imes
t em plate Nam e
Notice that this latter template could have been instantiated with UnitId instead of Page Url. In this way, we would have leveraged the presence in the conceptual log of the information of the content unit, and we would have highlighted the association between content and users that is more specific than the usual association between pages and users. One of the main advantages gained by the conceptual Web logs is indeed the knowledge of the information content of the pages. These content units can give us more precise information on the ultimate reasons (e.g., real content) for which certain pages are frequently requested by the users. Patterns resulting from these requests confirm that the most recurrent ones are downloads of materials from the Web site.
Execution Times of Instances of Templates We conducted a study on the performance and feasibility of instances of the presented templates. The template instances and mining queries were executed by running a prototype of a mining system for the extraction of frequent patterns with constraints. In Figure 4, we show instead the execution times of the above templates instantiated with the values of the parameters shown in Table 1.
Fut ur e T r ends During the last years, several methods and tools for the analysis of Web logs have been proposed with the two emerging goals of calculating statistics about site activities and mining data about user profiles to support personalization. Several data-mining projects have demonstrated the usefulness of a representation of the structure and content organization of a Web application. However, the description of the application structure is considered a critical input to the pattern-discovery algorithms, providing information about expected user behaviours (Cooley, 2003). Our approach is promising. It is based both on a conceptual modeling of the Web application and on templates for its analysis. Templates can be immediately instantiated according to the specificity of the
414
Designing and Mining Web Applications
single cases, and can rely on a great amount of information coming both from the conceptual schemas of the application and from the runtime collection of usage data. This will allow in the future analyzing easily the dynamic applications. This study points out a new trend in the analysis of Web applications. Conceptual modeling and analysis allow both the improvement of the application quality and efficiency, but also allow a deeper understanding of the users’ experience on the Web and help to improve the offered service. In fact, the knowledge about users themselves, their choices, and their preferences allows the construction of an adaptive site, possibly integrated with a recommendation system.
Con c lus ion In this chapter, we discussed the advantages of adopting conceptual modeling for the design and maintenance of a data-intensive Web application. We presented a case study to testify to the power and versatility of conceptual modeling in consideration of the analysis of Web logs. Our application was the Web application of the DEI department, designed with WebML. The resulting Web log was a conceptual log, obtained by the integration of standard (ECFL) Web server logs with information on the Web design application and information content of Web pages. In this case study, we applied and evaluated also the usability and flexibility of KDD scenarios for the analysis of Web logs. This proved the possibility to employ in practice these scenarios, a sort of constraint-based mining templates, to be instantiated with parameter values, resulting in patterns that are frequent in data. We can draw now some conclusions on the discovered patterns. In order to be useful, discovered patterns must be interesting for the user or analyst and actionable in the sense that immediate actions or decisions can be taken as a consequence of their observation. In constraint-based mining, the first point is immediately fulfilled by the retrieved patterns because, by definition, extracted patterns satisfy the given constraints. Indeed, constraints are provided by the analyst to the system in order to identify the desired patterns and discard all the remaining ones. Desired patterns could be the interesting ones, first of all, because they occur frequently and therefore refer to a statistically relevant number of events that occurred in the Web application, and secondarily because some user constraints can discriminate the properties of the desired pattern class with respect to some contrasting classes. The second point is more difficult to establish in an absolute sense. Usually, a postprocessing phase following the proper mining phase is necessary to establish if the retrieved patterns are actionable. Generally, a crossover query that retrieves the original data in which a pattern occurs is sufficient to reveal in which way or which data representing real-life entities are involved in the phenomena described by the patterns themselves. For instance, if patterns are found describing users’ attempts to change passwords, or if patterns are found putting in evidence which user configuration settings more frequently are correlated to system errors, identifying which users are causing the problem, those patterns are immediately actionable because they suggest the way in which the problem can be solved. Other patterns, such as the users’ most frequent crawling paths, are more difficult to translate immediately because they involve a new design of the hypertext and main content areas that organize the Web site. Furthermore, these patterns could also identify two content areas that are contextually correlated but do not require the involvement of a new Web site design. In this case, the solution could consist of providing the users
415
Designing and Mining Web Applications
with some additional suggestions on the pages where the user could find some correlated contents, as it happens in recommendation systems.
Refe r enc es Aggarwal, C. C. (2004). On leveraging user access patterns for topic specific crawling. Data Mining and Knowledge Discovery, 9(2), 123-145. Baresi, L., Garzotto, F., & Paolini, P. (2001). Extending UML for modeling Web applications. Proceedings of HICSS’01 (pp. 1285-1294). Brachman, R. J., & Anand, T. (1996). The process of knowledge discovery in databases: A human-centered approach. In Advances in knowledge discovery and data mining (p. 239). San Francisco: Morgan Kaufmann. Ceri, S., P., & Bongio, A. (2000). Web Modeling Language (WEBML): A modeling language for designing Web sites. Proceedings of WWW9 Conference (pp. 137-157). Ceri, S., Fraternali, P., Bongio, A., Brambilla, M., Comai, S., & Matera, M. (2002). Designing dataintensive Web applications. San Francisco: Morgan Kauffman. Ceri, S., Fraternali, P., Bongio, A., Butti, S., Acerbis, R., Tagliasacchi, M., et al. (2003). Architectural issues and solutions in the development of data-intensive Web applications. Proceedings of CIDR 2003, Asilomar, CA. Cooley, R. (2000). Web usage mining: Discovery and application of interesting patterns from Web data. Unpublished doctoral dissertation, University of Minnesota, Minneapolis. Cooley, R. (2003). The use of Web structures and content to identify subjectively interesting Web usage patterns. ACM Transactions on Internet Technology, 3(2), 93-116. Dai, H., & Mobasher, B. (2002). Using ontologies to discover domain-level Web usage profiles. Proceedings of the 2nd Semantic Web Mining Workshop at ECML/PKDD-2002. Retrieved from http://km.aifb. uni-karlsruhe.de/ws/semwebmine2002/online_html Demiriz, A. (2004). Enhancing product recommender systems on sparse binary data. Data Mining and Knowledge Discovery, 9(2), 147-170. Facca, F. M., & Lanzi, P. L. (2005). Mining interesting knowledge from Weblogs: A survey. Data & Knowledge Engineering, 53(3), 225-241. Fraternali, P. (1999). Tools and approaches for developing data-intensive Web applications: A survey. ACM Computing Surveys, 31(3), 227-263. Fraternali, P., Lanzi, P. L., Matera, M., & Maurino, A. (2004). Model-driven Web usage analysis for the evaluation of Web application quality. Journal of Web Engineering, 3(2), 124-152. Fraternali, P., Matera, M., & Maurino, A. (2003). Conceptual-level log analysis for the evaluation of Web application quality. Proceedings of LA-Web’03, Santiago, Chile.
416
Designing and Mining Web Applications
Gallo, A., Esposito, R., Meo, R., & Botta, M. (2005). Optimization of association rules extraction through exploitation of context dependent constraints. In Lecture notes in artificial intelligence (Vol. 3673, p. 258). Berlin; Heidelberg, Germany: Springer Verlag. Gomez, J., Cachero, C., & Pastor, O. (2001). Conceptual modeling of device-independent Web applications. IEEE MultiMedia, 8(2), 26-39. Jin, X., Zhou, Y., & Mobasher, B. (2004). Web usage mining based on probabilistic latent semantic analysis. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 197-205). Kohavi, R., & Parekh, R. (2003). Ten supplementary analyses to improve e-commerce Web sites. Proceedings of the 5th WEBKDD Workshop: Web Mining as a Premise to Effective and Intelligent Web Applications, ACM SIGKDD. Retrieved from http://www.acm.org/sigs/sigkdd/kdd2003/workshops/ webkdd03/ Meo, R., Botta, M., Esposito, R., & Gallo, A. (2005). A novel incremental approach to association rules mining in inductive databases. In Constraint-based mining (pp. 267-294). Hinterzarten, Germany: Springer Verlag. Meo, R., Lanzi, P. L., Matera, M., Careggio, D., & Esposito, R. (2005). Employing inductive databases in concrete applications. In Constraint-based mining (pp. 295-327). Hinterzarten, Germany: Springer Verlag. Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL for mining association rules. Journal of Data Mining and Knowledge Discovery, 2(2), 195-224. Oberle, D., Berendt, B., Hotho, A., & Gonzales, J. (2003). Conceptual user tracking. In Lecture notes in artificial intelligence: Vol. 2663. Proceedings of the First International Atlantic Web Intelligence Conference, AWIC 2003 (pp. 142-154). Madrid: Springer Verlag. Pei, J., & Han, J. (2002). Constrained frequent pattern mining: A pattern-growth view. SIGKDD Explorations, 4(1), 31-39. Rossi, G., Schwabe, D., Esmeraldo, L., & Lyardet, F. (2001). Engineering Web applications for reuse. IEEE Multimedia, 8(1), 20-31. Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining: KDD’97 (pp. 67-73). Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23.
This work was previously published in Web Data Management Practices: Emerging Techniques and Technologies, edited by A. Vakali; G. Pallis, pp. 179-198, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
417
418
Chapter XXVI
Web Usage Mining for Ontology Management Brigitte Trousse INRIA Sophia Antipolois, France Marie-Aude Aufaure INRIA Sophia and Supélec, France Bénédicte Le Grand Laboratoire d’Informatique de Paris 6, France Yves Lechevallier INRIA Rocquencourt, France Florent Masseglia INRIA Sophia Antipolois, France
ABSTRACT This chapter proposes an original approach for ontology management in the context of Web-based information systems. Our approach relies on the usage analysis of the chosen Web site, in addition to the existing approaches based onWeb pages content analysis. Our methodology is based on knowledge discovery techniques mainly from HTTP Web logs and aims to confronting the discovered knowledge in terms of usage with the existing ontology in order to propose new relations between concepts. We illustrate our approach on a Web site provided by local French tourism authorities (related to Metz city) with the use of clustering and sequential patterns discovery methods. One major contribution of this chapter is, thus, the application of usage analysis to support ontology evolution and/or Web site reorganization.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Web Usage Mining for Ontology Management
INTRODUCTION Finding relevant information on the Web has become a real challenge. This is partly due to the volume of available data and the lack of structure in many Web sites. However, information retrieval may also be dif. cult in well-structured sites, such as those of tourist offices. This is not only due to the volume of data, but also the way information is organized, as it does not necessarily meet Internet users’ expectations. Mechanisms are necessary to enhance their understanding of visited sites. Local tourism authorities have developed Web sites in order to promote tourism and to offer services to citizens. However, this information is scattered and unstructured and thus does not match tourists’ expectations. Our work aims at to provide a solution to this problem by: • • •
Using an existing or semiautomatically built ontology intended to enhance information retrieval. Identifying Web users’ profiles through an analysis of their visits employing Web usage mining methods, in particular automatic classification and sequential pattern mining techniques. Updating the ontology with extracted knowledge. We will study the impact of the visit profiles on the ontology. In particular, we will propose to update Web sites by adapting their structure to a given visit profile. For example, we will propose to add new hyperlinks in order to be consistent with the new ontology.
The first task relies on Web structure and content mining, while the second and the third are deduced from usage analysis. A good structure of information will allow us to extract knowledge from log files (traces of visit on the Web), extracted knowledge will help us update Web sites’ ontology according to tourists’ expectations. Local tourism authorities will thus be able to use these models and check whether their tourism policy matches tourists’ behavior. This is essential in the domain of tourism, which is highly competitive. Moreover, the Internet is widely available and tourists may even navigate pages on the Web through wireless connections. It is therefore necessary to develop techniques and tools in order to help them find relevant information easily and quickly. In the future, we will propose to personalize the display, in order to adapt it to each individual’s preferences according to his profile. This will be achieved through the use of weights on ontology concepts and relations extracted from usage analysis. This chapter is composed of five sections. The first section presents the main definitions useful for the comprehension. In the second, we briefly describe the state-of-the-art in ontology management and Web mining. Then, we propose our methodology mainly based on Web usage mining for supporting ontology management and Web site reorganization. Finally, before concluding, we illustrate the proposed methodology in the tourism domain.
BACKGROUND This section presents the context of our work and is divided into two subsections. We first provide the definitions of the main terms we use in this chapter, then we study the state-of-the-art techniques related to ontology management and to Web mining.
419
Web Usage Mining for Ontology Management
De.nitions •
• •
•
•
•
Web mining is a knowledge discovery process from databases (called KDD) applied to Web data. Web mining can extract patterns from various types of data; three areas of Web mining are often distinguished: Web content mining, Web structure mining and Web usage mining (Kosala & Blockeel, 2000). Web content mining is a form of text mining applied to Web pages. This process allows reveals relationships related to a particular domain via co-occurrences of terms in a text for example. Web structure mining is used to examine data related to the structure of a Web site. This process operates on Web pages’ hyperlinks. Structure mining can be considered as a specialisation of graph mining. Web usage mining is applied to usage data such as those contained in logs files. A log file contains information related to the queries executed by users on a particular Web site. Web usage mining can be used to support the modification of the Web site structure or to give some recommendations to the visitors. Personalization can also be enhanced by usage analysis (Trousse, Jaczynski, & Kanawati, 1999). A thesaurus is a “controlled vocabulary arranged in a known order and structured so that equivalence, homographic, hierarchical, and associative relationships among terms are displayed clearly and identi. ed by standardized relationship indicators.” (ANSI/NISO, 1993). The purpose of a thesaurus is to facilitate documents retrieval. The WordNet thesaurus organizes English nouns, verbs, adverbs, and adjectives into a set of synonyms and defines relationships between synonyms. An ontology aims to formalizing domain knowledge in a generic way and provides a common agreed understanding of a domain, which may be used and shared by applications and groups. According to Gruber, “an ontology is an explicit specification of a conceptualization” (Gruber, 1993).
In computer science, the word ontology, borrowed from philosophy, represents a set of precisely defined terms (vocabulary) about a specific domain and accepted by this domain’s community. Ontologies thus enable people to agree upon the meaning of terms used in a precise domain, knowing that several terms may represent the same concept (synonyms) and several concepts may be described by the same term (ambiguity). Ontologies consist of a hierarchical description of important concepts and of each concept’s properties. Ontologies are at the heart of information retrieval from nomadic objects, from the Internet and from heterogeneous data sources. Ontologies generally are make up of a taxonomy or vocabulary and of inference rules. •
•
420
Information retrieval provides answers to precise queries, whereas data mining aims to discover new and potentially useful schemas from data, which can be materialized with metadata (which are data about data). A user profile is a set of attributes concerning Web sites’ visitors. These attributes provide information that will be used to personalize Web sites according to users’ specific needs. Two kinds of information about users—explicit and implicit—can be distinguished. Explicit knowledge about users may come for example from the user’s connection mode (with or without a login), his status (e.g., subscription) or personal information (address, preferences, etc.). On the other hand, implicit
Web Usage Mining for Ontology Management
•
knowledge—provided by log files—is extracted from users’ visit on the Web. In this work, we mainly focus on implicit data extracted from the Web usage. Users’ visits will be analyzed so as to be applied online —during the same session—or during future Web visits. A user session/visit profile corresponds to the implicit information about users based on user sessions or user visits (cf. definitions of ‘Step 2’ section of our methodology) for precise definitions).
State of the Art of Techniques Related to Ontology Management and to Web Mining Ontology Management Regarding ontology management, our state of the art focuses on ontology construction and evolution.
(a) Ontology Construction Methodologies for ontology building can be classified according to the use or non use of a priori knowledge (such as thesaurus, existing ontologies, etc.) and also to learning methods. The first ones were dedicated to enterprise ontology development (Gruber, 1993; Grüninger & Fox, 1995) and manually built. Then, methodologies for building ontologies from scratch were developed, which do not use a priori knowledge. An example of such a methodology is OntoKnowledge (Staab et al., 2001), which proposes a set of generic techniques, methods, and principles for each process (feasibility study, initialization, refinement, evaluation, and maintenance). Some research work is dedicated to collaborative ontology building such as CO4 (Euzenat, 1995) and (KA)2 (Decker et al., 1999). Another research area deals with ontology reengineering (Gòmez-Pérez & Rojas, 1999). Learning methodologies can be distinguished according to their input data type: texts, dictionaries (Jannink, 1999), knowledge bases (Suryanto & Compton, 2001), relational (Rubin et al., 2002; Stojanovic et al., 2002) and semistructured data (Deitel et al., 2001; Papatheodrou et al., 2002; Volz et al., 2003). In the following subsections, we focus on the general dimensions implied in ontology learning, and describe some approaches for ontology learning from Web pages. Existing methods can be distinguished according to the following criteria: learning sources, type of ontology, techniques used to extract concepts, relationships and axioms, and existing tools. The most recent methodologies generally use a priori knowledge such as thesaurus, minimal ontology, other existing ontologies, and so forth. Each one proposes different techniques to extract concepts and relationships, but not axioms. These axioms can represent constraints but also inferential domain knowledge. As for instance extraction, we can find techniques based on first order logic (Junker et al., 1999), on Bayesian learning (Craven et al., 2000), and so forth. We have to capitalize the results obtained by the different methods and characterize existing techniques, properties, and ways we can combine them. The objective of this section is to describe the characteristics of these methods. Learning ontologies is a process requiring at least the following development stages: • •
Knowledge source preparation (textual compilations, collection of Web documents), potentially using a priori knowledge (ontology with a high-level abstraction, taxonomy, thesaurus, etc.). Data source preprocessing.
421
Web Usage Mining for Ontology Management
• •
Concepts and relationships learning. Ontology evaluation and validation (generally done by experts).
The ontology is built according to the following characteristics: • • • • • • •
Input type (data sources, a priori knowledge existence or not, …). Tasks involved for preprocessing: Simple text linguistic analysis, document classification, text labeling using lexico-syntactic patterns, disambiguating, and so forth. Learned elements: Concepts, relationships, axioms, instances, thematic roles. Learning methods characteristics: Supervised or not, classification, clustering, rules, linguistic, hybrid. Automation level: Manual, semiautomated, automatic, cooperative. Characteristics of the ontology to be built: Structure, representation language, coverage. Usage of the ontology and users’ needs (Aussenac-Gilles et al., 2002).
(b) Ontology Evolution Since the Web is constantly growing and changing, we must expect ontologies to change as well. The reasons to update ontologies are various (Noy et al., 2004): there may have been errors in prior versions, a new terminology or way of modeling the domain may be preferred or the usage may have changed. Specification changes are due, for example, to the transformation of the representation language, while changes at the domain level are comparable to databases’ schema evolution. Finally, modifications of conceptualization concern the application or the usage. The W3C distinguishes ontology evolution and ontology extension. In the first case, a complete restructuring of the ontology is possible while ontology extension does not suppress or modify existing concepts and relations of the original ontology. Ontology evolution can be achieved in different ways. This evolution may be linked to changes at the application level, which requires dealing with the integration of new data sources and their impact on the ontology. In this context, the challenge consists of specifying the way to link ontology evolution to corpus evolution or other resources which justify them—such as ontologies, thesauri, or text collections. Another possible case of evolution is when two versions of an ontology exist: differences are searched for with techniques quite similar to those used for semantic mapping between ontologies. Evolution and version management can fully take advantage of the numerous works achieved in the databases domain (Rahm & Bernstein, 2001). The issue of ontology evolution is essential for perennial applications. The compatibility between various versions can be defined as follows: • •
•
Instance preservation. Ontology preservation (any query result obtained with the new version is a superset of the old version’s results, or the facts inferred with the old version’s axioms can also be inferred with the new one). Consistency preservation (the new version does not introduce any inconsistency).
An open research problem in this area is the development of algorithms allowing an automatic detection of differences between various versions.
422
Web Usage Mining for Ontology Management
The area of ontology evolution is very active and a lot of work is being done in this domain (Castano et al., 2006; Flouris et al., 2006 ). The Boemie (“Boostrapping Ontology Evolution with Multimedia Information Extraction.” http://www.boemie.org/) IST 6th Framework Programme Project (FP6-027538), which has started in March, 2006 (Spyropoulos et al., 2005), will pave the way towards automation of knowledge acquisition from multimedia content, by introducing the notion of evolving multimedia ontologies.
Web Mining Web mining is a KDD process applied to Web data. Vast quantities of information are available on the Web, its lack of structure in which Web mining has to cope. A typical KDD process is made of four main steps (cf. Figure 1): 1. 2. 3. 4.
Data selection aims to extract from the database or data warehouse the information needed by the data mining step. Data transformation will then use parsers in order to create data tables which can be used by the data mining algorithms. Data mining techniques range from sequential patterns to association rules or cluster discovery. Finally the last step will allow the re-use of the obtained results into a usage analysis/interpretation process.
In such a KDD process, we insist on the importance of the preprocessing step composed of selection and transformation substeps. The input data usually comes from databases or from a file in a standard (such as XML) or private format. Then various data mining techniques may be used according to the data types in order to extract new patterns (association rules, sequential patterns, clusters). And finally some graphical tools and different quality criteria are used in the interpretation step in order to validate the new extracted patterns as new knowledge to be integrated. In the context of Web usage mining, the data mining methods are applied on usage data relying, if possible, on the notion of user session or user visit. This notion of session enables us to act at the appropriate level during the process of knowledge extraction from log files (Tanasa, 2005; Tanasa & Trousse, 2004). Moreover, a Web site’s structure analysis can make knowledge extraction easier. It may also allow us to compare usage analysis with information available on the site, which can lead to Web site and/or ontology updates.
WEB USAGE MINING FOR ONTOLOGY MANAGEMENT This section presents the proposed methodology for supporting the management of Web sites’ ontologies based on usage mining. Based on the previous states of the art in the two communities—ontology management and Web mining—we can notice that there is a great synergy between content mining (or Web content mining) and ontology management. Indeed, ontologies could be built or updated through content mining in the context of Web sites. Web mining could be improved through the use of ontologies or Web semantics, also called “Semantic Web mining” by (Berendt et al., 2002; Berendt et al., 2005). On the contrary, the
423
Web Usage Mining for Ontology Management
Figure 1. Steps of the KDD process
use of Web structure mining and Web usage mining are not really explored for supporting ontology construction and evolution. Below, we propose our methodology for supporting ontology extension and Web site reorganization based on Web usage mining techniques.
Proposed Methodology Our methodology is divided into four steps: •
• •
•
The first step consists of building the ontology of the considered Web sites—or in enriching the existing ontology if one is already available. The construction of this basic ontology can be achieved through knowledge extraction from Web sites, based on content and structure mining. In the experiment presented in the section entitled ‘Illustration in the Tourism Domain,’ we used an existing ontology and modified it according to the structure of the studied Web sites. The second step consists of preprocessing the logs (raw data) in a rich usage data warehouse based on various notions such as user session, user visit and so forth. The third step aims to apply data mining techniques of users’ visits on these Web sites through the analysis of log files in order to update the current ontology. Web usage mining techniques allow us to define visit profiles (or navigation profiles) and then to facilitate the emergence of new concepts, for example by merging several existing ones. Two different methods will be emphasized and used to analyze usage: clustering and sequential pattern mining. Finally, we combine the information obtained from these two methods. One expected result is the ability to update the basic ontology with respect to user visits.
The four steps of our methodology—as shown in Figure 2 (derived from (AxIS, 2005))—are detailed in the following subsections. We first describe ontology construction and preprocessing methods applied to usage data in order to make them usable for analysis. Secondly, we describe two different techniques
424
Web Usage Mining for Ontology Management
Figure 2. Steps of the proposed methodology Step 1 Ontology construction from content and/or structure
Web Site.
- Document mining - Content mining - Structure mining
Usage Mining
Usage 1. Automatic or manual step Steps 2 and 3
Step : Pre-Processing Usage data (logs) Step : - Clustering - Frequent sequential patterns
Step 4: Updating the ontology Confrontation Ontology
- Clusters of user visits based on visited elements of the ontology -Sequential patterns of such elements --
*
• Recommandations of ontology changes based on the interpretration of the extracted patterns and clusters from usage data and the current ontology
employed in Web usage mining: clustering and sequential pattern mining. Finally, we explain how to use Web mining to support ontology evolution.
Step 1: Ontology Construction Methods Ontology construction can be performed manually or semiautomatically. In the first case, this task is hard and time consuming. This is the reason why many methods and methodologies have been designed to semiautomate this process. The data sources can be text, semistructured data, relational data, and so forth. In the following, we describe some methods dedicated to knowledge extraction from Web pages. A survey on ontology learning methods and tools can be found in the OntoWeb Web site (http://ontoweb. org/Members/ruben/Deliverable%201.5). Many methods or methodologies have been proposed to enrich an existing ontology using Web documents (Agirre et al., 2000; Faatz & Steinmetz, 2002). However, these approaches are not specifically dedicated to Web knowledge extraction. The approach proposed by Navigli and Velardi (2004) attempts to reduce the terminological and conceptual confusion between members of a virtual community. Concepts and relationships are learned from a set of Web sites using the Ontolearn tool. The main steps are: terminology extraction from Web sites and Web documents data warehouse, semantic interpretation of terms, and identification of taxonomic relationships. Some approaches transform html pages into a semantic structured hierarchy encoded in XML, taking into account HTML regularities (Davulcu et al., 2003). Finally, we can also point out some approaches only dedicated to ontology construction from Web pages without using any a priori knowledge. The approach described in Sanchez and Moreno (2004) is based on the following steps: (1) extract some keywords representative of the domain, (2) find a collection of Web sites related to the previous
425
Web Usage Mining for Ontology Management
keywords (using for example Google), (3) exhaustive analysis of each Web site, (4) the analyzer searches the initial keywords in a Web site and finds the preceding and following words; these words are candidates to be concepts, (5) for each selected concept, a statistical analysis is performed based on the number of occurrences of this word in the Web sites and finally, (6) for each concept extracted using a window around the initial keyword, a new keyword is defined and the algorithm recursively iterates. In Karoui et al. (2004) a method is proposed to extract domain ontology from Web sites without using a priori knowledge. This approach takes the Web pages structure into account and defines a contextual hierarchy. The data preprocessing is an important step to define the more relevant terms to be classified. Weights are associated to the terms according to their position in this conceptual hierarchy. Then, these terms are automatically classified and concepts are extracted. In Ben Mustapha et al., (2006) the authors define an ontological architecture based on a semantic triplet, namely: semantics of the contents, structure and services of a domain. This chapter focuses on the domain ontology construction and is based on a metaontology that represents the linguistic structure and helps to extract lexico-syntactic patterns. This approach is a hybrid one, based on statistical and linguistic techniques. A set of candidate concepts, relationships and lexico-syntactic patterns is extracted from a domain corpus and iteratively validated using other Web bodies. Experiments have been realized in the tourism domain. Many projects also include ontology construction, such as for example the French projects Picsel (http://www.lri.fr/~picsel/) or WebContent (http://www.Webcontent-project.org).
Step 2: Web Logs Preprocessing Methods In order to prepare the logs for usage analysis, we used the methodology for multisites logs data preprocessing recently proposed by (Tanasa, 2005; Tanasa & Trousse, 2004).
De.nitions Given below are the main definitions extracted from (Tanasa, 2005, page 2), in accordance with the World Wide Web consortium’s work on Web characterization terminology: • • • • •
• • •
426
A resource, according to the W3C’s uniform resource identifier specification, can be “anything that has identity.” Possible examples include an HTML file, an image, or a Web service. A URI is a compact string of characters for identifying an abstract or physical resource. A Web resource is a resource accessible through any version of the HTTP protocol (for example, HTTP 1.1 or HTTP-NG). A Web server is a server that provides access to Web resources. A Web page is a set of data constituting one or several Web resources that can be identified by a URI. If the Web page consists of n resources, the first (n-1)th are embedded and the nth URI identifies the Web page. A page view (also called an hit) occurs at a specific moment in time, when a Web browser displays a Web page. A Web browser or Web client is client software that can send Web requests, handle the responses, and display requested URIs. A user is a person using a Web browser.
Web Usage Mining for Ontology Management
•
•
•
•
A Web request is a request a Web client makes for a Web resource. It can be explicit (user initiated) or implicit (Web client initiated). Explicit Web requests (also called clicks) are classified as embedded (the user selected a link from a Web page) or user-input (the user manually initiates the request—for example, by typing the address in the address bar or selecting the address from the bookmarks or history). Implicit Web requests are generated by the Web client that needs the embedded resources from a Web page (images, multimedia files, script files, etc.) in order to display that page. A Web server log file contains Web requests made to the Web server, recorded in chronological order. The most popular log file formats are the Common Log Format (www.w3.org/Daemon/ User/Config/Logging.html#common-logfile-format) and the Extended CLF. A line in the ECLF contains the client’s host name or IP address, the user login (if applicable), the request’s date and time, the operation type (GET, POST, HEAD, and so on), the requested resource’s name, the request status, the requested page’s size, the user agent (the user’s browser and operating system), and the referrer. A given Web page’s referrer is the URL of whatever the Web page which contains the link that the user followed to the current page. 192.168.0.1 - [03/Feb/ 2006:14:05:59 +0200] “GET /francais/geo/map.html HTTP/1.1’’ “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)” A user session is a delimited number of a user’s explicit Web requests across one or more Web servers. User identification depends on the Web site policies. For Web sites requesting registration, the user identification task is straightforward. In other cases, we use the couple (IP, User Agent) for grouping requests in user session. So the couple (IP, user agent) associated to the set of requests is given an ID session. For instance, a user session S12 = < (192.168.0.1, “Mozilla/4.0 ….”), ([‘/english/geo/map.html’, Fri Feb 3 14:05:59] [‘/english/ leasure/index.html’, Tue Oct 3 14:06:17])> means that a user has requested URL ‘/english/geo/map. html’ followed by URL ‘/english/leasure/index.html’ 18 seconds later. A visit (also called a navigation) is a subset of consecutive page views from a user session occurring closely enough (by means of a time threshold or a semantic distance between pages).
Figure 3. Four preprocessing steps
427
Web Usage Mining for Ontology Management
Figure 4. Relational model of the log database
Session OldIDSession NbNavigations DureeTotale
IDNavigation OldIDNavigation NbRequests NbRequestsDif NbInstance DureeTotale Repetition
IP IDIP IP Country IDDomain UniteRecherche
DataTime IDDatetime ReqDate ReqTime ReqHour ReqMFM ReqSFM
Base Numéro IDRequest IDNavigation IDSession IDIP IDUser IDUserAgent IDDateTime Duration FileSize IDStatus IDURL IDReferer
Statut IDstatus Status
URL IDUrL URL Rubrique Rubrique Site FileExtension
Referer IDreferer Referer
Classical Preprocessing and Advanced Preprocessing The data preprocessing was done in four steps (cf. Figure 3.) as proposed in Tanasa and Trousse (2004): data fusion, data cleaning, data structuration, data summarization. Generally, in the data fusion step, the log files from different Web servers are merged into a single log file. During the data cleaning step, the nonrelevant resources (e.g., jpg, js, gif files) and all requests that have not succeeded (cf. code status) are eliminated. In the data structuration step, requests are grouped by user, user session, and visit. Finally, after identifying each variable in the accessed URL and their corresponding descriptions, we define a relational database model to use. Next, the preprocessed data is stored in a relational database, during the data summarization step. Such a preprocessing process is supported by the AxISLogMiner toolbox (Tanasa, 2005, pp. 97-114). Our database model for WUM is a star schema and it is implemented using the ACCESS software from Microsoft. The “base” table (cf. Log) is the main or fact table in this model and each line contains information about one request for page view from the provided log files. The dimension tables used are the tables “URL1” and “Navigation.” The “URL1” table contains the list of the pages viewed with the decomposition of the file path associated to the page. The “Navigation” table contains the list of sessions, each session being described by many variables (date, number of requests, etc.). Now we propose to extract through some queries a data table for our database (Sauberlich & Huber, 2001). The statistical units are the set of visits (i.e., navigations) which verify some properties. For each visit we associate the list of the visited pages which are weighted by the number of requests.
Step 3: Web Usage Mining Methods In the following subsections, we describe the data mining methods we applied on Web usage data in order to illustrate our methodology. 428
Web Usage Mining for Ontology Management
Clustering Methods Appropriate use of a clustering algorithm is often a useful first step in extracting knowledge from a database. Clustering, in fact, leads to a classification, that is the identification of homogeneous and distinct subgroups in data (Bock, 1993; Gordon, 1981), where the definition of homogeneous and distinct depends on the particular algorithm used: this is indeed a simple structure, which, in the absence of a priori knowledge about the multidimensional shape of the data, may be a reasonable starting point towards the discovery of richer and more complex structures. In spite of the great wealth of clustering algorithms, the rapid accumulation of large databases of increasing complexity raises a number of new problems that traditional algorithms are not suited to address. One important feature of modern data collection is the ever increasing size of a typical database: it is not unusual to work with databases containing from a few thousand to a few million items and hundreds or thousands of variables. Now, most clustering algorithms of the traditional type are severely limited as to the number of individuals they can comfortably handle (hundreds to thousands).
(a) General Scheme of Dynamical Clustering Algorithm Let E be a set of n objects; each object s of E is described by p variables of V. The description space defined by the set of variables is D and xs is the vector of description of object s in D. A weight µs>0 can be associated to the object s. The proposed clustering algorithm (Diday, 1975), according to the dynamical clustering algorithm, looks simultaneously for a partition P of E in k classes and for a vector L of k prototypes (g1,…,gi,…,gk) associated to the classes (C1,…,Ci,…,Ck) of the partition P that minimizes a criterion ∆:
{
∆( P* , L* ) = Min ∆( P, L) P ∈ Pk , L ∈ D k
}
with Pk the set of partitions of E in k no-empty classes. Such criterion ∆ expresses the fit between the partition P and the vector of the k prototypes. That is defined as the sum of the distances between all the objects s of E and the prototypes gi of the nearest class Ci: k
∆( P, L) = ∑∑ i =1 s∈Ci
2 s
( xs , gi ) Ci ∈ P, gi ∈ D
The algorithm alternates a step of representation with a step of assignment until the convergence of criterion ∆. After Nr runs the selected partition comes the partition which minimizes the criterion ∆. The convergence of the criterion ∆ to a stationary point is obtained under the following conditions: •
Uniqueness of the class of assignment of each object of E,
•
Existence and uniqueness of the prototype gC minimizing the value for
∑ s∈C
2 s
( xs , g ) all classes
429
Web Usage Mining for Ontology Management
C of E. If ψ is Euclidian distance the prototype gC is equal to gC =
∑ ∑ s∈C
s∈C
x
s s
s
For more details on the chosen algorithm, see (Diday, 1975).
(b) The Crossed Clustering Approach Our aim is to obtain simultaneously a rows partition and a columns partition from a contingency table. Some authors (Govaert, 1977; Govaert & Nadif, 2003) proposed the maximization of the chi-squared criterion between rows and columns of a contingency table. As in the classical clustering algorithm the criterion optimized is based on the best fit between classes of objects and their representation. In our analysis of the relations between visits and pages, we propose to represent the classes by prototypes which summarize the whole information of the visits belonging to each of them. Each prototype is even modeled as an object described by multicategories variables with associated distributions. In this context, several distances and dissimilarity functions could be proposed as assignment. In particular, if the objects and the prototypes are described by multicategories variables, the dissimilarity measure can be chosen as a classical distance between distributions (e.g., chi-squared). The convergence of the algorithm to a stationary value of the criterion is guaranteed by the best fit between the type representation of the classes and the properties of the allocation function. Different algorithms have been proposed according to the type of descriptors and to the choice of the allocation function. The crossed dynamic algorithm (Verde & Lechevallier, 2003) on objects has been proposed in different contexts of analysis, for example: to cluster archaeological data, described by multicategorical variables; to compare social-economics characteristics in different geographical areas with respect to the distributions of some variables (e.g., economics activities; income distributions; worked hours; etc.).
Sequential Pattern Extraction Methods Sequential pattern mining deals with data represented as sequences (a sequence contains sorted sets of items). Compared to the association rule extraction, a study of such data provides «intertransaction» analysis (Agrawal & Srikant, 1995). Due to the notion of time embedded in the data, applications for sequential pattern extraction are numerous and the problem definition has been slightly modified in several ways. Associated to elegant solutions, these problems can match real-life time stamped data (when association rules fail) and provide useful results.
(a) Definitions Let us first provide additional definitions: the item is the basic value for numerous data mining problems. It can be considered as the object bought by a customer or the page requested by the user of a Website, etc. An itemset is the set of items that are grouped by timestamp (e.g., all the pages requested by the user on June 4, 2004). A data sequence is a sequence of itemsets associated to a customer. Table 1 gives a simple example of four customers and their activity over four days on the Web site of New York City. In Table 1, the data sequence of C2 is the following: «(Met, Subway) (Theater, Liberty) (Restaurant)»,
430
Web Usage Mining for Ontology Management
Table 1. Data sequences of four customers over four days Cust
June 04, 2004
June 05, 2004
June 06, 2004
June 07, 2004
C1
Met, Subway
Digital Camera
Bridge
Central Park
C2
Met, Subway
Theater, Liberty
C3
Theater, Liberty
C4
Restaurant
Bridge
Restaurant
Central Park
Met, Subway
Empire State
Theater, Liberty
which means that the customer searched for information about the Metropolitan Museum and about the subway the same day, followed by Web pages on a theater and the Statue of Liberty the following day, and finally a restaurant two days later. A sequential pattern is included in a data sequence (for instance «(Subway) (Restaurant)» is included in the data sequence of C2, whereas «(Theater) (Met)» is not, because of the timestamps). The minimum support is specified by the user and stands for the minimum number of occurrences of a sequential pattern to be considered as frequent. A maximal frequent sequential pattern is included in at least «minimum support» data sequences and is not included in any other frequent sequential pattern. With a minimum support of 50% a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4). In this case a maximal sequential pattern mining process will find three patterns: • • •
S1: «(Met, Subway) (Theater, Liberty)» S2: «(Theater, Liberty) (Restaurant)» S3: «(Bridge) (Central Park)»
One can observe that S1 is included in the data sequences of C2 and C4, S2 is included in those of C2 and C3, and S3 in those of C1 and C2. Furthermore the sequences do not have the same length (S1’s length = 4, S2’s length = 3 and S3’s length = 2).
(b) Web usage mining based on sequential patterns Various techniques for sequential pattern mining were applied on access log files (Bonchi et al., 2001; Masseglia et al., 1999; Masseglia et al., 2004; Nakagawa & Mobasher, 2003; Spiliopoulou et al., 1999; Tanasa, 2005; Zhu et al., 2002). The main interest in employing such algorithms for Web usage data is that they take into account the time-dimension of the data (Masseglia et al., 1999; Spiliopoulou et al., 1999). The WUM tool (Web Utilisation Miner) proposed in (Spiliopoulou et al., 1999) allows the discovery of visit patterns, which are interesting either from the statistical point of view or through their structure. The extraction of sequential patterns proposed by WUM is based on the frequency of considered patterns. Other subjective criteria that can be specified for the visit patterns are, for example, the act of passing through pages with certain properties or a high confidence level between two or several pages of the visit pattern.
431
Web Usage Mining for Ontology Management
In Masseglia et al., (1999) the authors propose the WebTool platform. The extraction of sequential patterns in WebTool is based on PSP, an algorithm developed by the authors, whose originality is to propose a prefix tree storing both the candidates and the frequent patterns. Unfortunately, the dimensionality of these data (in terms of different items—pages—as well as sequences) is an issue for the techniques of sequential pattern mining. More precisely, because of the significant number of different items, the number of results obtained is very small. The solution would consist of lowering the minimum support used, but in this case the algorithms would not be able to finish the process and thus to provide results. A proposal for solving this issue was made by the authors of Masseglia et al. (2004) who were interested in extracting sequential patterns with low support on the basis that high values of supports often generate obvious patterns. To overcome the difficulties encountered by sequential pattern mining methods when lowering the value of the minimum support, the authors proposed to address the problem in a recursive way in order to proceed to a phase of data mining on each subproblem. The subproblems correspond to the users’ visits with common objectives (i.e., sublogs containing only the users, which passed through similar pages). The various patterns obtained are from to visit on a research team’s tutorial by a set of hacking attempts, which used the same techniques of intrusion. Tanasa (2005) included these two proposals in a more general methodology of sequential pattern mining with low support. Another solution is to reduce the number of items by using a generalization of URLs. In Fu et al. (2000) the authors use a syntactic generalization of URLs with a different type of analysis (clustering). Before applying a clustering with the BIRCH algorithm (Zhang et al., 1996), the syntactic topics of a level greater than two are replaced by their syntactic topics of a lower level. For example, instead of http://www-sop.inria.fr/axis/Publications/2005/all.html, they will use http://www-sop.inria.fr/axis/, or http://www-sop.inria.fr/axis/Publications/. However, this syntactic generalization, although automatic, is naive because it is based only on the physical organization given to the Web site’s pages. An improper organization will implicitly generate a bad clustering and thus generate results of low quality. In Tanasa and Trousse (2004), a generalization based on semantic topics is made during the preprocessing of Web logs. These topics (or categories) are given a priori by an expert of the field to which the Web site belongs. However, this is a time-consuming task both for the creation and for the update and maintenance of such categories. For traditional Web usage mining methods, the general idea is similar to the principle proposed in Masseglia et al., (1999). Raw data is collected in a Web log file according to the structure described in Section ‘Step 2: Definitions.’ This data structure can be easily transformed to the one used by sequential pattern mining algorithms. A record in a log file contains, among other data, the client IP, the date and time of the request, and the Web resource requested. To extract frequent behaviors from such a log file, for each user session or visit in the log file, we first have to: transform the ID-Session into a client number (ID), the date and time into a time number, and the URL into an item number.
Table 2. File obtained after a preprocessing step
432
Client
d1
d2
d3
d4
d5
1
a
c
d
b
c
2
a
c
b
f
c
3
a
g
c
b
c
Web Usage Mining for Ontology Management
Table 2 gives a file example obtained after that preprocessing. For each client, there is a corresponding series of times along with the URL requested by the client at each time. For instance, the client 2 requested the URL “f” at time d4. The goal is thus, by means of a data mining step, to find the sequential patterns in the file that can be considered as frequent. The result may be, for instance, (with the file illustrated in Table 2 and a minimum support given by the user: 100%). Such a result, once mapped back into URLs, strengthens the discovery of a frequent behavior, common to n users (with n the threshold given for the data mining process) and also gives the sequence of events composing those behaviors. Nevertheless, most methods that were designed for mining patterns from access log files cannot be applied to a data stream coming from Web usage data (such as visits). In our context, we consider that large volumes of usage data are arriving at a rapid rate. Sequences of data elements are continuously generated and we aim to identify representative behaviors. We assume that the mapping of URLs and clients as well as the data stream management are performed simultaneously.
Step 4: Recommendations for Updating the Ontology via Web Mining For supporting ontology management, we provide recommendations for updating the ontology via Web mining techniques, mainly via Web usage mining. The result of Web mining applied to usage data provides classes of pages and visits. The clustering relies on the site’s structure, visit classes (also called visit profiles) are based on clusters of visited pages and pages labeling is achieved syntactically with regard to the directories corresponding to the Web sites’ structure. In order to see the impact on the ontology, a matching must be done between these syntactic categories and the ontology’s semantic concepts. Our updates mostly concern ontology extension which does not completely modify the initial ontology: • • •
Addition of a leaf concept in a hierarchy Addition of a subtree of concepts in the hierarchy Addition of a relation between two concepts
A new concept may appear as a consequence of the emergence of a class of user visits (from the classification process). In the same way, a new relation between two concepts may be identified through the extraction of sequential patterns. The chosen example in the experimentation described in the following section allows us to illustrate the two first cases.
ILLUSTRATION IN TOURISM DOMAIN The chosen application is dedicated to the tourism domain and is based on Web usage mining to evaluate the impact of the results of usage analysis on the domain ontology.
433
Web Usage Mining for Ontology Management
Figure 5. The chosen tourism Web site (the general home page)
Figure 6. The chosen tourism Web site (the French home page)
434
Web Usage Mining for Ontology Management
Figure 7. The domain ontology group reservation individual visit
museums
addresses
french
language
italian
events
german
shop
english
LL_DomainConcept leisure
restaurant
bed-and-breakfast
accomodation
hotel
information
camping
geography
maps
Web Site Description and Ontology Building We analyzed the Web site of the French city of Metz (http://tourisme.mairie-metz.fr/) illustrated by Figures 5 and 6. This Web site continues to develop and has been awarded several times, notably in 2003, 2004, and 2005 in the context of the “Internet City @@@@” label. This label enables local authorities to evaluate and to show the deployment of a local citizen Internet available to everyone for the general interest. This Web site contains a set of headings such as events, hotels, specialties, etc. It also contains a heading visit where they exists many slides shows. For this experiment, we used an existing ontology and modified it according to the structure of the studied Web site. This ontology was built according to the World Tourism Organization thesaurus. A thesaurus is a “controlled vocabulary arranged in a known order and structured so that equivalence, homographic, hierarchical, and associative relationships among terms are displayed clearly and identified by standardized relationship indicators” (ANSI/NISO, 1993). The main objective associated to a thesaurus is to facilitate documents retrieval. This ontology, edited with Protégé 2000, is represented on Figure 7.
Preprocessing Web Logs The used log dataset consists in 4 HTTP log files with a total of 552809 requests for page views (91 MB). These requests were made from the Tourism office Web site from the Metz city. Each log file 435
Web Usage Mining for Ontology Management
Table 3. Format of the requests IP
Date
Requests
Status
Size
User Agent
213.151.91.186
[25/ Sep/2005:07:45:49 +0200]
“GET /francais/ frameset.html HTTP/1.0”
200
1074
Mozilla 4.0
abo.wanadoo.fr
[25/ Sep/2005:07:52:23 +0200]
“GET /images/ backothtml.gif HTTP/1.1”
200
11588
Firefox 1.0
(corresponding to one week) contains all the requests covering a continuous 28-day period starting from 08:00AM 1st January 2006 until 06:21AM on the 29th January 2006. The log files contain the following six fields (cf. Table 3): • • • • • • •
IP address: The computer’s IP address of the user making the request Date: The Unix time of the request Request: The requested resource (page, picture,…) on the server Status: The HTTP status code returned to the client, such as success (200) Failure, redirection, forbidden access… Size: The content-length of the document transferred document User Agent: The user agent
During the preprocessing step, we thus grouped the 76133 requests remaining after preprocessing into 9898 sessions (i.e., sets of clicks from the same (IP, User Agent)). Each session is divided into several visits. A visit ends when at least a 30 minute interval exists between two consecutive requests belonging to the same session. The statistical unit for the analysis is the visit and the number of visits is equal to 11624. The number of distinct Urls’ requested by users is 723. For our analysis, we did not consider the status code and thus assumed that all requests had succeeded (code 200). Afterwards, we generated from the relational database (cf. Figure 4 for the schema) the dataset used for the visit clustering, illustrated on Table 4. Here we have considered a crossed table where each line corresponds to a visit and the column describes one multicategorical variable, which represents the number of pages requested during each visit. We have limited our clustering analysis to the visits where the time is greater than 60 seconds and where more than 10 pages were visited.
Table 4. Quantity of pages requested during each visit visits V_307(14)
436
Pages (number of requests) tourisme.mairie-metz.fr(3) francais/frameset.html(2) francais/ot.html(3), ...
…
…
V_450(15)
anglais/frameset.html(2); anglais/geo/carte.html(1) anglais/manif/manif2006.html(1),..
Table 5. Linguistic distribution of the visits language
Number of visits
proportion
English
194
10.5
German
400
21.6
Italian
28
1.5
French
1220
65.9
Multi Lang
10
0.5
Total
1852
100
Web Usage Mining for Ontology Management
We thus obtain 1852 visits (in different languages as shown on Table 5), 375 visited pages and 42238 requests on these pages.
Usage Analysis Results Clustering Results The confusion table (cf. Table 6) reports the results obtained after applying the crossed clustering method specifying 11 classes of visits and 10 classes of pages. The language is very important; the set of English pages is split into three classes (PEd_1, PE_2 and PE_3), the set of Germany pages into two
Figure 8. Page groups and visit classes 000 000 000
PEd_ PE_
000
PE_ 000
PGd_
000
PG_ PI_
000 PS_ 000
PF_
0
PS_
VF_
VF_
PEd_
VF_
VG_
VE_
PFd_ PF_
PGd_
V_
PF_
Table 6. Confusion table Partition V_ 1 (10) VE_ 1 (142)
PEd_1
PE_3
PGd_1
0
PE_2 28
0
0
PG_2 32
PI_1 44
PF_1 8
PFd__2 0
PF_3
PS_1
Total
7
74
193
76
1992
34
0
38
1
16
0
7
289
2453
VE_ 2 (52)
1464
606
68
0
1
1
2
4
22
136
2304
VG_ 1 (37)
0
0
0
566
1496
0
14
0
1
68
2145
VG_ 2 (363)
11
48
3
36
7278
2
114
4
10
669
8175
VI_1 (28)
0
7
0
0
0
279
0
0
1
33
320
VF_ 1 (205)
0
13
0
0
8
11
218
6178
578
2195
9201
VF_2 (178)
0
25
0
0
15
2
848
100
643
1803
3436
VF_ 3 (205)
0
17
0
0
31
0
1553
12
116
1527
3256
VF_ 4 (320)
0
41
1
0
30
0
154
105
2122
3067
5520
VF_5 (312) Total
0
60
1
0
65
7
83
232
529
4258
5235
1551
2837
107
602
8994
347
3010
6635
4036
14119
42238
437
Web Usage Mining for Ontology Management
classes (PGd_1 and PG_2), Italian into one (PI_1) and the set of French pages into three classes (PF_ 1, PFd_2 and PF_3), one class (PS_1) represents the set of organization pages (frame pages and home) and the set of pages which are used by all languages (address of hotel, restaurant). It should be noted that the percentage of the clicks on the pages of English language is 10.6% and on the pages of German language is 22.7%. This proportion is the same for the entire set of visits. It
Figure 9. Factorial analysis between visits and visited pages
Figure 10. Relation between page classes and visit classes
438
Web Usage Mining for Ontology Management
should be noted that the class of the slideshows exists in all the languages. The group of visits (VE_2) contains 52 visits, which contain mostly clicks on English pages (2138 out of 2304 clicks during these visits, accounting for 92.8% of the visited pages). We obtain similar results with the groups of visits of German language. For all the languages the average number of clicks during a visit belonging to the three groups of visits on the slideshows (labeled “d” for “diaporama” in French)that is (PEd_1, PGd_1 and PFd_2) is identical (approximately 50 clicks per visit) and for it is much higher than the average calculated for the other groups of visits (between 15 to 20 clicks per visit). 42238 is the number of clicks realized by the set of selected visits. The group “V_1” contains 10 visits using multi languages and 28 clicks on the pages belonging to the page class “PE_2.” PEd_1 contains all the pages of English language associated with the slideshows of the site and PE_2 contains the other pages of English language except some special pages, which belong to group PE_3 (these pages represent the activity of the “jewels”). We have a similar result for the German language, but PGd_1 contains only the pages of the slideshows that give a general vision of the town of Metz. PFd_2 contains the French pages of the slideshows. The first group (PF_1) of the pages of French language contains the pages related to housing restoration. PF_3 contains the pages related to the two concepts leisure (URL francais/loisirs/(velocation, shop, etc.)). HTML and the cultural or festive events (URL francais/manif/(manif2006,manif2005).html) and also pages to different regional specialties of the city (URL francais/specialites/(tartes, caramel, luth,...) html). Group (PS_1) contains the structuring pages (banner page, page related to the frames.) and the whole set of pages employed by the visitors (international hotel). The factorial analysis (cf. Figure 9) shows that the visits associated with the pages related to the slideshows (‘diaporama’ in French) are at the border of the first factorial plan (PGd_1, PFd_2 and PEd_1) and the three directions represent the three languages (PGd_1 and PG_2 for the set of German pages, PE_1, PE_2 and PE_3 for the English pages and the PF_1, PFd_2 and PF_3 for the French pages). Figure 10 shows the third axis related to the direction associated with the Italian pages (PI_1). We can note that the V_1 group of the visits is a includes of the visits which includes pages of different languages. Figures 9 and 10 show that the groups of visits are very separate according to the language. The pages in French can be extracted for a specific analysis, for example to study how the vision of the slideshows can be connected to other activities (reservation of cultural activities, hotels and restaurants).
Sequential Pattern Extraction Results After the preprocessing step, the log file of Metz city’s Web site contains 723 URLs and 11624 visit sequences. These sequences have an average length of 6.5 itemsets (requested URLs). The extracted sequences reflect the frequent behaviors of users connected to the site. These have been obtained from the whole log file (including all the resources in all languages) with different minimum supports. We report in this section a few sequential patterns extracted from the log file: Pattern 1: http://tourisme.mairie-metz.fr/ http://tourisme.mairie-metz.fr/francais/frameset.html http://tourisme.mairie-metz.fr/francais/resto/resto.html http://tourisme.mairie-metz.fr/francais/resto/touresto.html Support: 0.0238577 439
Web Usage Mining for Ontology Management
This behavior has a support of 2.38%. This means that it corresponds to 277 users of the Web site. These users are likely to be interested in finding a restaurant in Metz. Pattern 2: http://tourisme.mairie-metz.fr/francais/frameset.html http://tourisme.mairie-metz.fr/francais/hebergement/heberg.html http://tourisme.mairie-metz.fr/francais/geo/carte.html Support: 0.0200722 This behavior has a support of 2%. Such a behavior corresponds to users interested by in lodging (“hebergement” in French), afterwards in a map (“carte” in French).
Results for Supporting the Update of the Ontology The interpretation of usage analysis results from clustering and sequential pattern mining allow us to make suggestions in order to support ontology management and more precisely to update the ontology, as explained in this subsection. Suggestion to add a concept derived from usage analysis (gathering several existing concepts): We observed that pages from the two concepts leisure and events and from regional specialties are grouped
Figure 11. Updated ontology group reservation visit
individual museums
speciality addresses
language events
LL_DomainConcept
shop
french
italian
german
english
leisure
440
restaurant
bed-and-breakfast
accomodation
hotel
information
camping
geography
maps
Web Usage Mining for Ontology Management
together (PE_3 group). First, the concept of specialty does not exist in our domain ontology and could be added as shown in Figure 11. Moreover, as these notions seem to be frequently associated during visits, two evolutions of the ontology are possible: • •
Relationships could be added between these three ontological concepts. These three concepts could become specializations of a new—more general—concept.
Let us note that there are many pages related to events (francais/specialites/fete.html, francais/specialites/galerie.html), which are stored in the repertory specialites. Our goal is to suggest possible ontology evolutions consistent with usage analysis, but a validation from an expert is required to make a choice between different propositions. We also note that Slide show (“diaporama” in French) represents 50% of pages from the « Visit » category, and represents about one third of the pages globally visited on the site. The addition of a Slide show concept in the domain ontology could thus be suggested. Suggestion to add a relation between concepts: The sequential pattern mining suggests that a relation exists between accommodation and geography/ maps due to the extraction of various patterns linking a page of the concept accommodation and maps such as the frequent sequential pattern 2 previously described. This relationship extracted from usage analysis does not exist as a hyperlink on the Web site.
CONCLUSION Ontology construction and evolution requires the extraction of knowledge from heterogeneous sources. In the case of the Semantic Web, the knowledge extraction is often done from the content of a set of Web pages dedicated to a particular domain. In this chapter, we considered another kind of pattern as we focused on Web usage mining. Web usage mining extracts visit patterns from Web log files and can also extract information about the Web site structure and visit profiles. Among Web usage mining applications, we can point out personalization, modification and improvement of Web pages, and detailed description of Web site usage. In this chapter, we attempt to show the potential impact of Web usage mining on ontology updating. We illustrate such an impact in the tourism domain by considering the Web site of a French local authority; we start from a domain ontology—obtained through the adaptation of an existing ontology to the structure of the actual Web site. Then, we apply different Web usage mining techniques to the logs files generated from this site, in particular clustering and sequential pattern mining methods. Web usage mining provides relevant information to users and it is therefore a very powerful tool for information retrieval. As we mentioned in the previous section, Web usage mining can also be used to support the modification of the Web site structure or to give some recommendations to visitors. Web mining can be useful to add semantic annotations (ontologies) to Web documents and to populate these ontological structures. In the experiment presented in this chapter, the domain ontology was
441
Web Usage Mining for Ontology Management
manually built and we used usage mining to update it. In the future, we will combine Web content and structure mining for a semiautomatic construction of the domain ontology. Regarding the ontology evolution, we will still mainly exploit usage mining but we will also use content mining to confirm suggested updates. This combination of Web content and usage mining could allow us to build ontologies according to Web pages content and to refine them with behavior patterns extracted from log files. Our ultimate goal is to mine all kinds of patterns—content, structure and usage—to perform ontology construction and evolution.
Future Research Directions Many future research directions are possible related to ontology management. Considering ontology management as a complex and interactive process, we mention here two important directions: Semantic Web mining and Semantic Web visualisation. First Semantic Web Mining aims to combine the two areas Semantic Web and Web Mining by using semantics to improve mining and using mining to create semantics (Berendt & al, 2005). More work is needed to realize such a convergence. See (Stumme & al, 2006) for a interesting survey and future directions in that area. In some domains consensual knowledge already exists and could be used for improving Web mining. But in many other domains it might be necessary to start from data and design the first version of ontology via ontology learning/construction. To achieve such learning, it is useful to combine Web usage mining with content and structure analysis in order to give sense to the observed extracted user behavior. New WUM approaches based on structure mining are needed for improving the recommendations to support the evolution of ontologies. Another way to provide more accurate results is to involve users in the mining process, which is the goal of visual data mining. Ontologies (and other semantic formalisms, which create semantic graphs) are very powerful but they may be complex. Intuitive visual user interfaces may significantly reduce the cognitive load of users when working with these complex structures. Visualization is a promising technique for both enhancing users’ perception of structure in large information spaces and providing navigation facilities. According to (Gershon&Eick, 1995), it also enables people to use a natural tool of observation and processing—their eyes as well as their brain—to extract knowledge more efficiently and find insights. The goal of semantic graphs visualization is to help users locate relevant information quickly and explore the structure easily. Thus, there are two kinds of requirements for semantic graphs visualization: representation and navigation. A good representation helps users identify interesting spots whereas an efficient navigation is essential to access information rapidly. We both need to understand the structure of metadata and to locate relevant information easily. A study of representation and navigation metaphors for Semantic Web visualization has been studied by (LeGrand & Soto, 2005) where the semantic relationships between concepts appear on the display, graphically or textually. Many open research issues remain in the domain of Semantic Web visualization; in particular, evaluation criteria must be defined in order to compare the various existing approaches. Moreover, scalability must be addressed, as most current visualization tools can only represent a limited volume of data. To conclude we strongly believe in the potential of multi-disciplinary approaches in Web mining for supporting ontology management.
442
Web Usage Mining for Ontology Management
Acknowledgment The authors want to thank Mr. Rausch, Mr. Hector, and Mr. Hoffmann from the French city of Metz for making their log files available to us. We also thank Mr. Mercier and Mr. Vialle for helping us to make contacts with the French city of Metz. The authors also want to thank Doru Tanasa for his support in the preprocessing step of the tourism logs, and Alex Thibau and Sophie Honnarat for their helpful support.
REFERENCES Agirre, E., Ansa, O., Hovy, E., & Martinez, D. (2000). Enriching very large ontologies using the WWW. In Proceedings of ECAI Workshop on Ontology Learning. Agrawal, R., & Srikant, R. (1995). Mining Sequential Patterns. In Proceedings of the 11th International Conference on Data Engineering (pp. 3-14). ANSI/NISO. (1993). Guidelines for the construction, format, and management of monolingual thesauri. National Information Standards Organization. Aussenac-Gilles, N., Biébow, B., & Szulman, S. (2002). Revisiting ontology design: A methodology based on corpus analysis. In Proceedings of the 12th International Conference in Knowledge Engineering and Knowledge Management (EKAW), Juan-Les-Pins, France. AxIS. (2005). 2005 AxIS research project activity report. Section ‘Overall Objectives.’ http://www. inria.fr/rapportsactivite/RA2005/axis/axis_tf.html Mustapha, N., Aufaure, M-A., & Baazhaoui-Zghal, H. (2006). Towards an architecture of ontological components for the semantic Web. In Proceedings of Wism (Web Information Systems Modeling) Workshop, CAiSE 2006, Luxembourg (pp. 22-35). Berendt, B., Hotho, A., & Stumme. G. (2002). Towards Semantic Web mining. In Proceedings of the First International Semantic Web Conference on the Semantic Web (pp. 264-278). Springer. Berendt, B., Hotho, A., & Stumme. G. (2005, September 15-16). Semantic Web mining and the representation, analysis, and evolution of Web space. In Proceedings of RAWS’2005—Workshop on the Representation and Analysis of Web Space, Prague-Tocna. Bock, H.H. (1993). Classification and clustering: Problems for the future. In E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, & B. Burtschy, (Eds.), New approaches in classification and data analysis (pp. 3-24). Springer, Heidelberg. Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., et al. (2001). Web log data warehousing and mining for intelligent Web caching. Data Knowledge Engineering, 39(2), 165-189. Castano, S., Ferrara, A., & Montanelli., S. (2006). A matchmaking-based ontology evolution methodology. In Proceedings of the 3rd CAiSE INTEROP Workshop on Enterprise Modelling and Ontologies for Interoperability (EMOI - INTEROP 2006), Luxembourg.
443
Web Usage Mining for Ontology Management
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., et al. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1-2), 69-113. Davulcu, H., Vadrevu, S., & Nagarajan, S. (2003). OntoMiner: Bootstrapping and populating ontologies from domain specific websites. In Proceedings of the First International Workshop on Semantic Web and Databases (SWDB 2003), Berlin. Decker, S., Erdmann, M., Fensel, D., & Studer, R. (1999). Ontobroker: Ontology based access to distributed and semistructured information. In Semantic Issues in Multimedia Systems, Proceedings of DS-8 (pp. 351-369). Boston: Kluwer Academic Publisher. Deitel, A.C., Faron, C. & Dieng, R. (2001). Learning ontologies from RDF annotations. In Proceedings of the IJCAI’01 Workshop on Ontology Learning, Seattle, WA. Diday, E. (1975). La méthode des nuées dynamiques. Revue de Statistique Appliquée, 19(2), 19-34. Euzenat, J. (1995). Building consensual knowledge bases: Context and architecture. In Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases. Enschede, Amsterdam: IOS Press. Faatz, A., & Steinmetz, R. (2002). Ontology enrichment with texts from the WWW. Semantic Web Mining 2nd Workshop at ECML/PKDD-2002.Helsinki, Finland. Flouris, G. (2006). On belief change and ontology evolution.Doctoral Dissertation, Department of Computer Science, University of Crete. Flouris, G., & Plexousakis, D.G. (2006). Evolving ontology evolution, Invited Talk. In Proceedings of the 32nd International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 06) (p. 7). Merin, Czech Republic. Fu, Y., Sandhu, K., & Shih, M. (2000). A generalization-based approach to clustering of Web usage sessions. In Proceedings of the 1999 KDD Workshop on Web Mining (Vol. 1836, pp. 21-38). San Diego, CA: Springer-Verlag. Gerson, N., & Eick, S.G. (1995). Visualisation’s new tack: Making sense of information. IEEE Spectrum, 38-56. Gòmez-Pérez, A., & Rojas, M.D. (1999). Ontological reengineering and reuse. In D. Fensel & R. Studer (Ed.), European Workshop on Knowledge Acquisition, Modeling and Management (EKAW). Lecture Notes in Artificial Intelligence LNAI 1621 (pp. 139-156). Springer-Verlag. Gordon, A.D. (1981). Classification: Methods for the exploratory analysis of multivariate data. London: Chapman & Hall. Govaert, G. (1977). Algorithme de classification d’un tableau de contingence. In Proceedings of first international symposium on Data Analysis and Informatics (pp. 487-500). INRIA, Versailles. Govaert, G., & Nadif, M. (2003). Clustering with block mixture models. Pattern recognition. Elservier Science Publishers, 36, 463-473.
444
Web Usage Mining for Ontology Management
Gruber, T. (1993). Toward principles for the design of ontologies used for knowledge sharing. In N. Guarino & R. Poli, (Eds.), International Journal of Human-Computer Studies, special issue on Formal Ontology in Conceptual Analysis and Knowledge Representation, LADSEB-CNR Int. Rep. ACM. Grüninger, M., & Fox, M.S. (1995). Methodology for the design and evaluation of ontologies. IJCAI’95 Workshop on Basic Ontological Issues in Knowledge Sharing, Montreal, Canada. Guarino, N. (1998). Formal Ontology in Information Systems. Guarino (Ed.), First International Conference on Formal Ontology in Information Systems (pp. 3-15). Italy. Jannink, J. (1999). Thesaurus entry extraction from an on-line dictionary. In Proceedings of Fusion 99, Sunnyvale CA. Junker, M., Sintek, M., & Rinck, M. (1999). Learning for Text Categorization and Information Extraction with ILP. In J. Cussens (Eds.), Proceedings of the 1st Workshop on Learning Language in Logic (pp. 84-93). Bled: Slovenia. Karoui, L., Aufaure, M.-A., & Bennacer, N. (2004). Ontology discovery from Web pages: Application to tourism. Workshop on Knowledge Discovery and Ontologies (KDO), co-located with ECML/PKDD, Pisa, Italy, pp. 115-120. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations: Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining, 2(1), 1-5. Le Grand, B., & Soto, M. (2005). Topic Maps, RDF Graphs and Ontologies Visualization. In V. Geromienko, & C. Chen (Eds.), Visualizing the Semantic Web (2nd ed). Springer. Masseglia, F., Poncelet, P., & Cicchetti, R. (1999). An efficient algorithm for Web usage mining. Networking and Information Systems Journal (NIS), 2(5-6), 571-603. Masseglia, F., Tanasa, D., & Trousse, B. (2004). Web usage mining: Sequential pattern extraction with a very low support. In Advanced Web Technologies and Applications: 6th Asia-Pacific Web Conference, APWeb 2004, vol. 3007 (pp. 513-522). Hangzhou, China: Springer-Verlag. Nakagawa, M., & Mobasher, B. (2003). Impact of site characteristics on recommendation models based on association rules and sequential patterns. In Proceedings of the IJCAI’03 Workshop on Intelligent Techniques for Web Personalization, Acapulco, Mexico. Navigli, R., & Velardi, P. (2004). Learning domain ontologies from document warehouses and dedicated Web sites. Computational Linguistics, 30(2), 151-179. Noy, N.F., & Klein, M. (2004). Ontology evolution: Not the same as schema evolution. Knowledge and Information Systems, 6(4), 428-440. Papatheodrou, C., Vassiliou, A., & Simon, B. (2002). C. Papatheodrou, Discovery of ontologies for learning resources using word-based clustering. In Proceedings of ED MEDIA 2002, Denver. Rahm, E., & Bernstein, P. (2001). A survey of approaches to automatic schema matching, The VLDB Journal, 334-350.
445
Web Usage Mining for Ontology Management
Rubin, D.L., Hewett, M., Oliver, D.E., Klein, T.E., & Altman, R.B. (2002). Automatic data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML. In Proceedings of the Pacific Symposium on Biology, Lihue, HI. Sanchez, D., & Moreno, A. (2004). Automatic generation of taxonomies from the WWW. In Proceedings of the 5th International Conference on Practical Aspects of Knowledge Management (PAKM 2004). LNAI, Vol. 3336 (pp. 208-219). Vienna, Austria. Sauberlich, F., & Huber, K.-P. (2001). A framework for Web usage mining on anonymous logfile data. In Schwaiger M. & O. Opitz (Eds.), Exploratory data analysis in empirical research (pp. 309-318). Heidelberg: Springer-Verlag. Spiliopoulou, M., Faulstich, L.C., & Winkler, K. (1999). A data miner analyzing the navigational behaviour of Web users. In Proceedings of the Workshop on Machine Learning in User Modeling of the ACAI’99 Int. Conf., Creta, Greece. Spyropoulos, CD., Paliouras, G., & Karkaletsis, V. (2005, November 30- December 1). BOEMIE: Bootstrapping ontology evolution with multimedia information extraction. 2nd European Workshop on the integration of knowledge, Semantic and Digital Media Technologies, London. Staab, S., Schnurr, H.-P., Studer, R., & Sure, Y. (2001). Knowledge processes and ontologies. IEEE Intelligent Systems Special Issue on Knowledge Management, January/February, 16(1). Stojanovic, L., Stojanovic, N., & Volz, R. (2002). Migrating data-intensive Web sites into the semantic Web. In Proceedings of the 17th ACM symposium on applied computing (SAC). ACM Press. Stumme, G., Hotho, A., & Berendt, B. (2006). Semantic Web mining: State of the art and future directions. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 4(2), 124-143. Suryanto, H., & Compton, P. (2001). Discovery of ontologies from knowledge bases. In Proceedings of the 1st International Conference on Knowledge Capture, the Association for Computing Machinery (pp. 171-178). New York. Tanasa, D. (2005). Web usage mining: Contributions to intersites logs preprocessing and sequential pattern extraction with low support. PhD thesis, University of Nice Sophia Antipolis. Tanasa, D., & Trousse, B. (2004). Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems, 19(2) 59-65. Trousse, B., Jaczynski, M. & Kanawati, R. (1999). Using user behavior similarity for recommandation computation: The broadway approach.In Proceedings of 8th International Conference on Human Computer Interaction (HCI’99) (pp. 85-89). Munich:Lawrence Erlbaum. Verde, R., & Lechevallier, Y. (2003). Crossed Clustering method on Symbolic Data tables. In M. Vichi, P. Monari, S. Migneni, & A. Montanari, (Eds.), New developments in classification, and data analysis (pp. 87-96). Heidelberg: Springer-Verlag. Volz, R., Oberle, D., Staab, S., & Studer, R. (2003). OntoLiFT Prototype. IST Project 2001-33052 WonderWeb Deliverable.
446
Web Usage Mining for Ontology Management
Yuefeng, L., & Ning, Z. (2006). Mining ontology for automatically acquiring Web user information needs. IEEE Trans. Knowl. Data Eng., 18(4), 554-568. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In H.V.Jagadish, & I.S. Mumick (Ed.), Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (pp. 103-114). Montreal, Quebec, Canada: ACM Press. Zhu, J., Hong, J., & Hughes, J.G. (2002). Using Markov Chains for link prediction in adaptive Web sites. In Proceedings of Soft-Ware 2002: First International Conferance on Computing in an Imperfect World (pp. 60-73). Belfast, UK.
ADDITIONAL READING Garboni, C., Masseglia, F., & Trousse, B. (2006). A flexible structured-based representation for XML document mining. In Advances in XML information retrieval and evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Vol. 3977/2006:458-468 of LNCS. Springer Berlin / Heidelberg, Dagstuhl Cstle, Germany, 28 June. Haase, P., & Stojanovic, L. (2005). Consistent evolution of OWL ontologies. In Proceedings of the Second European Semantic Web Conference (ESWC 05), vol. 3532 of Lecture Notes in Computer Science, pp. 182-197. Mikroyannidis, A., & Theodoulidis, B. (2006). Heraclitus II: A framework for ontology management and evolution. In Proceedings of 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI’06) (pp. 514-521). Plessers, P., & De Troyer, O. (2006) Resolving inconsistencies in evolving ontologies. In Y. Sure & J. Domingue (Eds.), Proceedings of the Third European Semantic Web Conference. ESWC 2006 (pp. 200-214). Stojanovic, L., Maedche, A., Motik, B., & Stojanovic, N. (2004). User-driven ontology evolution management. In Proceedings of the Thirteenth European Conference on Knowledge Engineering and Knowledge Management EKAW (pp. 200-214). Springer Verlag. Stojanovic, L. (2004). Methods and tools for ontology evolution. PhD Thesis, University of Karlsruhe, Germany. Ee-Peng Lim and Aixin Sun (2005). Web mining: The ontology approach. In Proceedings of the International Advanced Digital Library Conference (IADLC), Nagoya, Japan (invited paper).
This work was previously published in An Overview of Knowledge Management, edited by H. O. Nigro; S. Gonzalez Cisaro; D. Xodo, pp. 37-64, copyright 2008 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
447
448
Chapter XXVII
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns Yue-Shi Lee Ming Chuan University, Taiwan, ROC Show-Jane Yen Ming Chuan University, Taiwan, ROC
Abstract Web mining is one of the mining technologies, which applies data mining techniques in large amounts of Web data to improve the Web services. Web traversal pattern mining discovers most of the users’ access patterns from Web logs. This information can provide the navigation suggestions for Web users such that appropriate actions can be adopted. However, the Web data will grow rapidly in the short time, and some of the Web data may be antiquated. The user behaviors may be changed when the new Web data is inserted into and the old Web data is deleted from Web logs. Besides, it is considerably difficult to select a perfect minimum support threshold during the mining process to find the interesting rules. Even the experienced experts also cannot determine the appropriate minimum support. Thus, we must constantly adjust the minimum support until the satisfactory mining results can be found. The essences of incremental or interactive data mining are that we can use the previous mining results to reduce the unnecessary processes when the minimum support is changed or Web logs are updated. In this chapter, we propose efficient incremental and interactive data mining algorithms to discover Web traversal patterns and make the mining results to satisfy the users’ requirements. The experimental results show that our algorithms are more efficient than the other approaches. Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Introduction With the trend of the information technology, huge amounts of data would be easily produced and collected from the electronic commerce environment every day. It causes the Web data in the database to grow up at amazing speed. Hence, how should we obtain the useful information and knowledge efficiently based on the huge amounts of Web data has already been the important issue at present. Web mining (Chen, Park, & Yu, 1998; Chen, Huang, & Lin, 1999; Cooley, Mobasher, & Srivastava, 1997; EL-Sayed, Ruiz, & Rundensteiner, 2004; Lee, Yen, Tu, & Hsieh, 2003, 2004; Pei, Han, MortazaviAsl, & Zhu, 2000; Yen, 2003; Yen & Lee, 2006) refers to extracting useful information and knowledge from Web data, which applies data mining techniques (Chen, 2005; Ngan, 2005; Xiao, 2005) in large amount of Web data to improve the Web services. Mining Web traversal patterns (Lee et al., 2003, 2004; Yen, 2003) is to discover most of users’ access patterns from Web logs. These patterns can not only be used to improve the Web site design (e.g., provide efficient access between highly correlated objects, and better authoring design for Web pages, etc.), but also be able to lead to better marketing decisions (e.g., putting advertisements in proper places, better customer classification, and behavior analysis, etc.) In the following, we describe the definitions about Web traversal patterns: Let I = {x1, x2, …, xn} be a set of all Web pages in a Web site. A traversal sequence S = (wi ∈ I, 1 ≤ i ≤ m) is a list of Web pages, which is ordered by traversal time, and each Web page can repeatedly appear in a traversal sequence, that is, backward references are also included in a traversal sequence. For example, if there is a path which visits Web page , and then go to Web page and sequentially, and come back to Web page , and then visit Web page . The sequence is a traversal sequence. The length of a traversal sequence S is the total number of Web pages in S. A traversal sequence with length l is called an l-traversal sequence. For example, if there is a traversal sequence a = , the length of a is 6 and we call a a 6-traversal sequence. Suppose that there are two traversal sequences a = and b = (m ≤ n), if there exists i1 < i2 < …< im, such that bi1 = a1, bi2 = a2, …bim = am, then b contains a, a is a sub-sequence of b, and b is a super-sequence of a. For instance, if there are two traversal sequences a = and b = , then a is a sub-sequence of b and b is a super-sequence of a. A traversal sequence database D, as shown in Table 1, contains a set of records. Each record includes traversal identifier (TID) and a user sequence. A user sequence is a traversal sequence, which stands for a complete browsing behavior by a user. The support for a traversal sequence a is the ratio of user sequences, which contains a to the total number of user sequences in D. It is usually denoted as Support (a). The support count of a is the number of user sequences which contain a. For a traversal sequence , if there is a link from xi to xi+1 (for all i, 1 ≤ i ≤ l-1) in the Web site structure, then the traversal sequence is a qualified traversal sequence. A traversal sequence a is a Web traversal pattern if a is a qualified traversal sequence and Support (a) ≥ min_sup, in which the min_sup is the user specified minimum support threshold. For instance, in Table 1, if we set min_sup to 80%, then Support () = 4/5 = 80% ≥ min_sup = 80%, and there is a link from “A” to “B” in the Web site structure shown in Figure 1. Hence, is a Web traversal pattern. If the length of a Web traversal pattern is l, then it can be called an l-Web traversal pattern. However, the user sequences will grow rapidly and some of the user sequences may be antiquated. The Web traversal patterns may be changed when the new user sequences are inserted into and the old user sequences are deleted from traversal sequence database. Therefore, we must re-discover the Web traversal patterns from the updated database. For example, if a new movie “Star Wars” is coming, in
449
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Table 1. Traversal sequence database TID
User sequence
1
ABCED
2
ABCD
3
CDEAD
4
CDEAB
5
CDAB
6
ABDC
a DVD movies selling Web site, the users may rent or buy the new movie from the Web site. Hence, the users may change their interests to the science-fiction movie. That is, the user behaviors may be changed. Therefore, if we do not re-discover the Web traversal patterns from updated databases, some of the new information (about science-fiction movie) will be lost. However, it is very time-consuming to re-find the Web traversal patterns. For this reason, an incrementally mining method is needed to avoid re-mining the entire database. Besides, based on the min_sup, all the Web traversal patterns can be found. Thus, it is very important to set an appropriate min_sup. If the min_sup is set too high, it will not find enough information for us. On the other hand, if the min_sup is set too low, unimportant information may be found and we will waste a lot of time finding all the information. However, it is very difficult to select a perfect minimum support threshold in the mining procedure to find the interesting rules. Even though they are experienced experts, they also cannot determine the appropriate minimum support threshold. Therefore, we must constantly adjust the minimum support until the satisfactory results can be found. It is very time consuming on these repeated mining processes. In order to find appropriate minimum support threshold, an interactive scheme is needed. In this chapter, we use a uniform framework and propose two novel incremental Web traversal pattern mining algorithm IncWTP and interactive Web traversal pattern mining algorithm IntWTP to find all the Web traversal patterns when the database is updated and the min_sup is changed, respectively. If the database is updated and the minimum support is changed simultaneously, then the two algorithms can be executed successively. These two algorithms utilize the previous mining results to find new Web traversal patterns such that the mining time can be reduced. Therefore, how to choose a storage structure to store previous mining results becomes very important. In this chapter, lattice structure is selected as our storage structure. Not only utilizes the previous mining results, we also use the Web site structure to reduce mining time and storage space.
Figure 1. Web site structure
450
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
The rest of this chapter is organized as follows. Section 2 introduces the most recent researches related to this work. Section 3 describes the data structure for mining Web traversal patterns incrementally and interactively. Section 4 proposes our incremental Web traversal pattern mining algorithms. The interactive Web traversal pattern mining algorithm is presented in section 5. Because our approach is the first work on the maintenance of Web traversal patterns, we evaluate our algorithm by comparing with Web traversal pattern mining algorithm MFTP (Yen, 2003) in section 6. Finally, we conclude our work and present some future research in section 7.
Related Work Path traversal pattern mining (Chen et al., 1998, 1999; Pei et al., 2000; Yen et al., 2006) is the technique that finds navigation behaviors for most of the users in the Web environment. The Web site designer can use this information to improve the Web site design (Sato, Ohtaguro, Nakashima, & Ito, 2005; Velásquez, Ríos, Bassi, Yasuda, & Aoki, 2005), and to increase the Web site performance. Many researches focused on this field (e.g., FS (full scan) algorithm, SS (selective scan) algorithm (Chen et al., 1998), and MAFTP (maintenance of frequent traversal patterns) algorithm (Yen et al., 2006), etc. Nevertheless, these algorithms have the limitations that they can only discover the simple path traversal pattern, which there is no repeated page in the pattern, that is, there is no backward reference in the pattern and the support for the pattern is no less than the minimum support threshold. These algorithms just consider the forward references in the traversal sequence database. Hence, the simple path traversal patterns discovered by the above algorithms are not fit in the real Web environment. Besides, FS and SS algorithms must rediscover simple path traversal patterns from entire database when the minimum support is changed or the database is updated. The MAFTP algorithm (Yen et al., 2006) is an incremental updating technique to maintain the discovered path traversal patterns when the user sequences are inserted into or deleted from the database. The MAFTP algorithm partitions the database into some segments and scans the database segment by segment. For each segment scan, the candidate traversal sequences that cannot be frequent traversal sequences can be pruned and the frequent traversal sequences can be found out earlier. However, the MAFTP algorithm cannot deal with the backward references. Besides, MAFTP needs to re-mine the simple path traversal patterns from the original database when the minimum support is changed. Our approach can discover the non-simple path traversal patterns, that is, both forward references and backward references are considered. Besides, only a small number of the candidate traversal sequences need to be counted from the original traversal sequence database when the database is updated or the minimum support is changed for our algorithms. The non-simple path traversal pattern (i.e., Web traversal pattern) contains not only forward references but also backward references. This information can present user navigation behaviors completely and correctly. The related researches are MFTP (mining frequent traversal patterns) algorithm (Yen, 2003), IPA (integrating path traversal patterns and association rules) algorithm (Lee et al., 2003, 2004), and FS-miner algorithm (EL-Sayed, 2004). MFTP algorithm can discover Web traversal patterns from traversal sequence database. This algorithm considers not only forward references, but also backward references. Unfortunately, MFTP algorithm must rediscover Web traversal patterns from entire database when the minimum support is changed or the database is updated. Our approach can discover the Web traversal patterns and both database insertion and deletion are also considered. Besides, our approach
451
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
can use the discovered information to avoid re-mining entire database when the minimum support is changed. The IPA algorithm can not only discover Web traversal patterns, but also user purchase behavior. It also considers the Web site structure to avoid generating un-qualified traversal sequences. Nevertheless, IPA algorithm does not consider incremental and interactive situations. It must rediscover Web traversal patterns from entire database when the minimum support is changed or the database is updated. The FS-miner algorithm can discover Web traversal patterns from the traversal sequence database. FS-miner algorithm scans database twice to build a FS-tree (frequent sequences tree structure), and then it discovers Web traversal patterns from the FS-tree. However, the FS-tree may be too large to fit into memory. Besides, FS-miner finds the consecutive reference sequences traversed by a sufficient number of users, that is, they just consider the consecutive reference sub-sequences of the user sequences. However, there may be some noises, which exist in a user sequence, that is, some pages in a user sequence may be not the pages that the user really wants to visit. If all sub-sequences for a user sequence are considered, then FS-miner cannot work. Hence, some important Web traversal patterns may be lost for the FS-miner algorithm. Besides, the FS-miner algorithm needs to set a system-defined minimum support, and then the FStree is constructed according to the system-defined minimum support. For the interactive mining, the user specified minimum support must be no less than the system-defined minimum support. Otherwise FS-miner cannot work. If the system-defined minimum support is too small, then the constructed FStree will be very large such that the FS-tree is hard to maintain. If the system-defined minimum support is too large, then users cannot set smaller minimum support than the large system-defined minimum support and the range for setting the user-specified minimum support is rather restricted. Hence, it is difficult to apply FS-miner on the incremental and interactive mining. For our approach, all the subsequences for a user sequence are considered, that is, the noises which exist in a user sequence can be ignored. Besides, there is no restriction on setting the user-specified minimum support, that is, users can set any value as the minimum support threshold for our algorithm. Furthermore, because our algorithm discovers the Web traversal patterns level-by-level in the lattice structure, it will not cause the memory be broken when we just load one level of lattice structure into memory. Sequential pattern mining (Cheng, Yan, & Han, 2004; Lin, & Lee, 2002; Parthasarathy, Zaki, Ogihara, & Dwarkadas, 1999; Pei et al., 2001; Pei et al., 2004) is also similar to Web traversal pattern mining; they discover sequential patterns from customer sequence database. The biggest difference between Web traversal pattern and sequential pattern is that Web traversal pattern considers the link between two Web pages in the Web structure, that is, there must be a link from each page to the next page in a Web traversal pattern. Parthasarathy et al. (1999) proposed an incremental sequential pattern mining algorithm ISL (incremental sequence lattice) algorithm. ISL algorithm updates the lattice structure when the database is updated. The lattice structure keeps all the sequential patterns, and candidate sequences and their support counts, such that just new generated candidate sequences need to be counted from the original database and the mining efficiency can be improved. The candidate sequences whose support count is 0 are also kept in the lattice. It will cause the lattice structure too huge to fit into memory. The other incremental sequential pattern mining algorithm is IncSpan (incremental mining in sequential pattern), which was proposed by Cheng et al. (2004). This algorithm is based on the PrefixSpan (prefix-projected sequential pattern mining) algorithm (Pei et al., 2001; Pei et al., 2004). IncSpan uses the concept of projected-database to recursively mine the sequential patterns. However, ISL and IncSpan algorithms cannot deal with the situation that when the new user sequences are inserted into customer
452
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
sequence database. They just considered inserting the transactions into the original user sequences. Because the user sequences will grow up at any time in the Web environment, our work focuses on mining Web traversal patterns when the user sequences are inserted into and deleted from the traversal sequence database. Besides, ISL and IncSpan algorithms are applied on mining sequential patterns and must re-mine the sequential patterns from the entire database when the minimum support is changed. Our work need to consider the Web site structure to avoid finding unqualified traversal sequences. For this reason, we cannot apply these two algorithms on mining Web traversal patterns. For interactive data mining, KISP (knowledge base assisted incremental sequential pattern) algorithm (Lin et al., 2002) has been proposed for interactively finding sequential patterns. The KISP algorithm constructs a KB (knowledge base) structure in hard disk to minimize the response time for iterative mining. Before discovering the sequential patterns, all the sequences are stored in KB structures ordered by the sequence length. For every sequence length, KB stores the sequences ordered by their supports. KISP algorithm uses the previous information in KB and extends the content of KB for further mining. Based on the KB, KISP algorithm can mine the sequential patterns on different minimum support thresholds without re-mining the sequential patterns from the original database. However, the KB structure simply stores the sequences ordered by sequence lengths and supports. There is no relationship among sequences about super-sequence or sub-sequence in KB structure. Hence, some information cannot be obtained from KB structure directly. For example, we may want to find sequential patterns related to certain items or find the longest sequential patterns, which are not sub-sequence of any other sequential patterns. For our algorithm, we use lattice structure to keep the previous mining results. The information mentioned above about Web traversal patterns can be obtained easily by traversing the lattice structure. Besides, KISP algorithm must re-mine the sequential patterns from the entire database when the database is updated. For our algorithm, we can mine Web traversal patterns interactively and incrementally in one lattice-based framework.
Data Structure for Mining Web Traversal Patterns In order to mine Web traversal patterns incrementally and interactively, we use previous mining results to discover new patterns such that mining time can be reduced. In this chapter, we use a lattice structure to keep previous mining results. Figure 2 shows the simple lattice structure O for the database described in Table 1, when min_sup is set to 50%. In the lattice structure O, only Web traversal patterns are stored
Figure 2. Simple lattice structure
453
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 3. Extended lattice structure
in this structure. To incrementally and interactively mine the Web traversal patterns and speed up mining processes, we extend lattice structure O to record more information. The extended lattice structure E is shown in Figure 3. In Figure 3, each node contains a traversal sequence whose support count is more or equal to 1. We append support information into the upper part of each node. We use this information to calculate and accumulate supports when the incremental and interactive mining is proceeding. Moreover, we also append TID information in which the traversal sequence occurs into the lower part of each node. We can use this information to reduce unnecessary database scans. Different from simple lattice structure, we put all candidate traversal sequences, whose support counts are greater than or equal to one, into the lattice structure, and the lattice is stored on a disk by levels. We can use lattice structure to quickly find the relationships between patterns. For example, if we want to search for the patterns related to Web page “A”, we can just traverse the lattice structure from the node “A”. Moreover, if we want to find maximal Web traversal patterns which are not sub-sequences of the other Web traversal patterns, we just need to traverse the lattice structure once and return the patterns in top nodes, whose supports are greater than or equal to min_sup. For example, in Figure 3, Web traversal patterns , , and are maximal Web traversal patterns. We utilize Web site structure, which is shown in Figure 1 to mine Web traversal patterns from the traversal sequence database shown in Table 1. The final results are shown in Figure 3 when the min_sup set to 50%. The reason for using Web site structure is that we want to avoid unqualified Web traversal sequences to be generated in the mining process. For example, assume that our Web site has 300 Web pages and all of them are all 1-Web traversal patterns. If we do not refer to Web site structure, then 299×300=89,700 candidate 2-sequences can be generated. However, in most situations, most of them are unqualified. Assume that the average out-degree for a node is 10. If we refer to the Web site structure, then just about 300×10=3,000 candidate 2-sequences are generated. The candidate generation method is like the join method proposed in (Cheng et al., 2004). For any two distinct Web traversal patterns , say and , we join them together to form a k-traversal sequence only if either exactly is the same with or exactly the same with (i.e., after dropping the fist page in one Web traversal pattern and the last page in the other Web traversal pattern, the remaining two (k-2)-traversal sequence are identical). For example, candidate sequence can be generated by joining two Web traversal patterns and . For a candidate l-traversal sequence a, if a qualified length (l-1) sub-sequence of
454
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
a is not a Web traversal pattern, then a must not be Web traversal pattern and a can be pruned. Hence, we also check all of the qualified Web traversal sub-sequences with length l-1 for a candidate l-traversal sequence to reduce some unnecessary candidates. In this example, we need to check if and are Web traversal patterns. If one of them is not a Web traversal pattern, is also not a Web traversal pattern. We do not need to check , because is an unqualified Web traversal sequence (i.e., no link from A to C).
Algorithm for Incremental Web Traversal Pattern Mining In this section, we propose an algorithm IncWTP for the maintenance of Web traversal patterns when the database is updated. Our algorithm IncWTP mines the Web traversal patterns from the first level to the last level in the lattice structure. For each level k (k ≥ 1), the k-Web traversal patterns are generated. There are three main steps in each level k: In the first step, the deleted user sequences’ TIDs are deleted from each node of the kth level and the support count of the node is decreased if the node contains the TID of the deleted user sequence. It is very easy to obtain the support count of each node in this step, because our lattice structure keeps not only TID information but also the support count for each node. In the second step, we deal with the inserted user sequences. For each inserted user sequence u, we decompose u into several traversal sequences with length k, that is, all the length k sub-sequences of the user sequence u are generated. According to the Web site structure, the unqualified traversal sequences can be pruned. This pruning can avoid searching for the unqualified traversal sequences from the candidate sequences for counting their supports. For each qualified k-traversal sequence s, if s has been contained in a node of the lattice structure, then we just increase the support count of this node and add TID of user sequence u to the node. Otherwise, if all the qualified length (k-1) sub-sequences of s are Web traversal patterns, then a new node ns which contains traversal sequence s and the TID of user sequence u is created in the kth level. The links between the nodes which contain the qualified length (k-1) sub-sequences of s in the (k-1)th level and the new node ns are created in the lattice structure. Because our lattice structure always maintains the qualified candidate k-traversal sequences and the links between a traversal sequence s and all the length (k-1) sub-sequences of s, the relationships between super-sequences and sub-sequences can easily obtained by traversing the lattice structure. After processing inserted and deleted user sequences, all the k-Web traversal patterns can be generated. If the support count of a node is equal to 0, then the node and all the links related to the node can be deleted from the lattice structure. If the support of a node is less than min_sup, then all the links between the node and the nodes N in the (k+1)th level are deleted, and the nodes in N are marked, be-
Table 2. Traversal sequence database after inserting and deleting user sequences from Table 1 TID
User sequence
3
CDEAD
4
CDEAB
5
CDAB
6
ABDC
7
ABCEA
455
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 4. Updated lattice structure after processing level 1
Figure 5. Updated lattice structure after processing level 2
Figure 6. Updated lattice structure after processing level 3
456
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
cause the traversal sequences in nodes N turn out to be not candidate traversal sequences. Hence, in the kth level, if a node has been marked, then this node and the links between this node and the nodes in the (k+1)th level are also deleted. In the last step, the candidate (k+1)-traversal sequences will be generated. The new Web traversal patterns in level k can be joined by themselves to generate new candidate (k+1)-traversal sequences. Besides, the original Web traversal patterns in level k are also joined with the new Web traversal patterns to generate the other new candidate (k+1)-traversal sequences. The original k-Web traversal patterns need not be joined each other, because they are joined before. The candidate generation method can avoid generating redundant candidate traversal sequences such that the number of the candidate traversal sequences can be reduced. After generating the new candidate (k+1)-traversal sequences, the original database needs to be scanned to obtain the original support count and the TID information for each new candidate (k+1)traversal sequence c. The new node nc which contains c is created and inserted into the lattice structure. The links between the nodes which contain the qualified length k sub-sequences of c in the kth level and the new node nc are created in the lattice structure. If there is no Web traversal patterns generated, then the mining process terminates. Our incremental mining algorithm IncWTP is shown in algorithm 1, which is the c++ like algorithm. Algorithm 2 shows the function CandidateGen, which generates and processes the candidate traversal
Algorithm 1. IncWTP (D, min_sup, W, L, InsTID, DelTID, m) Input: traversal sequence database D, min_sup, web site structure W, lattice structure L, insert TID InsTID, delete TID DelTID, maximum level of L m Output: All Web traversal patterns k=1; while(k m or there are new web traversal patterns generated in level k) for each node n in level k if(the node n are marked) the node n and all the links related to the node can be deleted; the nodes in level (k+1) which have links with node n are marked; if(node n contains any TID in DelTID) delete TIDs contained in DelTID and decrease the support count from n; for each inserted user sequence u decompose u into several qualified traversal sequences with length k; for each decomposed traversal sequence s if(s is contained in a node n of the level k) add u’s TID and increase the support count in the node n; else if(all qualified (k-1)-sub-sequences of s are web traversal patterns) new node ns contains s is generated in the level k; add u’s TID and increase the support count in the node ns; if(the support of a node nm is less than min_sup) all the links between node nm and the nodes in level (k+1) are deleted; the nodes in level (k+1) which have links with node nm are marked; if(the support count of a node n0 in level k is equal to 0) the node n0 and all the links related to the node can be deleted; for each traversal sequence ts in level k if(the support of ts min_sup) WTPk = WTPk {ts}; /* WTPk is the set of all the web traversal patterns */ NewWTPk = WTPk – OriWTPk; / * OriWTPk i s the set of o riginal web traversal patterns and NewWTPk is the set of new web traversal patterns */ output all the web traversal patterns in level k; CandidateGen (NewWTPk , OriWTPk); k++;
457
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Algorithm 2. CandidateGen (NewWTPk , OriWTPk) for each new web traversal pattern x in NewWTPk for each new web traversal pattern y in NewWTPk if(x and y can be joined) generate a n ew c andidate ( k+1)-traversal sequence a nd s tore t he n ew candidate in set C; for each original web traversal pattern z in OriWTPk if(x and z can be joined) generate a n ew c andidate ( k+1)-traversal sequence a nd s tore t he n ew candidate in set C; for each candidate (k+1)-traversal sequence c in C count support and record the user sequences’ TIDs which contain c from D; create a new node nc which contains c; for each node ns in the kth level which contains a qualified k-sub-sequence of c create a link between ns and nc;
sequences. In algorithm 1, D denotes the traversal sequence database, W denotes the Web site structure, L denotes the lattice structure, s denotes the min_sup, NewWTP denotes new Web traversal patterns, OriWTP denotes original Web traversal patterns, InsTID denotes the inserted user sequences’ TIDs, DelTID denotes the deleted user sequences’ TIDs, k denotes current process level in L, and the maximum level of the original lattice structure is m. For instance, the maximum level of the lattice structure in Figure 3 is 3. All the Web traversal patterns will be outputted as the results. For example in Table 1, we insert one user sequence (7, ABCEA) and delete two user sequences (1, ABCED) and (2, ABCD) as shown in Table 2. The min_sup also sets to 50%. At the first level of the lattice structure in Figure 3, TID 1 and TID 2 are deleted and the support count is decreased from each node which contains TID 1 or TID 2 for level 1. Then, the inserted user sequence TID 7 is decomposed into length 1 traversal sequences. The TID 7 is added and support count is increased to each node which contains one of the decomposed 1-traversal sequences. Because there is no new Web traversal patterns generated in level 1, we continue to process level 2 in the lattice structure. The updated lattice structure is shown in Figure 4 after processing the level 1. Because there is no new 1-Web traversal pattern generated and no node deleted, the number of the nodes and the links between the first level and the second level are not changed. According to the deleted sequences’ TIDs, the TID 1 and TID 2 are deleted and the support count is decreased in each node of level 2. Then, the inserted user sequence TID 7 is decomposed into length 2 traversal sequences. The TID 7 is added and the support count is increased to each node which contains one of the decomposed 2-traversal sequences. Finally, we can find the traversal sequence terns out to be 2-Web traversal pattern and the original Web traversal patterns , , and are not Web traversal patterns after updating the database. The five traversal sequences , , , , and are marked. Figure 5 shows the lattice structure after processing the inserted and deleted traversal sequences in level 2. The sequence with double line is the new Web traversal pattern. After generating the new 2-Web traversal patterns, the two new candidate traversal sequences and are generated. Similarly, the last level is processed and the lattice structure about level 3 is updated. Figure 6 shows the final result in our example in which the sequences with solid line are the Web traversal patterns.
458
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Algorithm for Interactive Web Traversal Pattern Mining In this section, we propose an algorithm, IntWTP, for the maintenance of Web traversal patterns when the previous minimum support is changed. For our algorithm IntWTP, if the new min_sup is larger than the original min_sup, then all the traversal sequences whose supports are no less than the new min_sup in the lattice structure are Web traversal patterns. If the new min_sup is smaller than the original min_sup, then our algorithm IntWTP mines the Web traversal patterns from the first level to the last level in the lattice structure. For each level k (k ≥ 1), the k-Web traversal patterns are generated. There are two main steps in each level k: In the first step, the traversal sequences in level k are checked: if the support of a traversal sequence is no less than the new min_sup, but less than the original min_sup, then the traversal sequence is a new Web traversal pattern. Hence, all the new Web traversal patterns can be generated according to the new min_sup and original min_sup. In this step, all the k-Web traversal patterns including original Web traversal patterns and new Web traversal patterns can be obtained. In the second step, the candidate (k+1)-traversal sequences will be generated. The new Web traversal patterns in level k can be joined by themselves to generate new candidate (k+1)-traversal sequences. Besides, the original Web traversal patterns in level k are also joined with the new Web traversal patterns to generate the other new candidate (k+1)-traversal sequences. The original k-Web traversal patterns need not be joined each other, because they are joined before. After generating the new candidate (k+1)-traversal sequences, the database needs to be scanned to obtain the support count and the TID information for each new candidate (k+1)-traversal sequence c. The new node which contains c is created and inserted into the (k+1)th level of the lattice structure. The links between the nodes which contain the qualified length k sub-sequences of c in the kth level and the new node which contains c in the kth level are created in the lattice structure. If there is no Web traversal pattern generated (including original Web traversal patterns), then the mining process terminates. Our interactive mining algorithm IntWTP is shown in algorithm 3, which is the c++ like algorithm. In algorithm 3, Ori_min_sup denotes the original min_sup and New_min_sup denotes the new min_sup. All the Web traversal patterns will be outputted as the results. The following shows an example for our interactive mining algorithm IntWTP. For the previous example (see Table 1 and Figure 3), we first increase the min_sup from 50% to 70%. Because the min_sup is increased, we just traverse the lattice structure once and output the traversal
Algorithm 3. IntWTP (D, New_min_sup, Ori_min_sup, W, L) Input: traversal sequence database D, new min_sup New_min_sup, original min_sup Ori_min_sup, web site structure W, lattice structure L Output: All Web traversal patterns if(New_min_sup < Ori_min_sup) k=1; C= ; while(there are web traversal patterns in level k) find original web traversal patterns OriWTPk base on Ori_min_sup and new web traversal patterns NewWTPk base on New_min_sup; output all the web traversal patterns in level k; CandidateGen (NewWTPk , OriWTPk) k++; Ori_min_sup = New_min_sup;
459
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 7. After processing level 2 of the lattice structure in Figure 3
sequences whose supports are greater than or equal to 70% (i.e., support counts are no less than 4). In this example, , , and are the Web traversal patterns. If we decrease the min_sup from 50% to 40% (i.e., the minimum support count is 2), then new Web traversal patterns may be generated. First of all, we scan the first level (the lowest level) of the lattice structure. Because no new 1-Web traversal patterns are generated, we scan the second level of the lattice structure. In this level, we find that the traversal sequences , , and tern out to be 2-Web traversal patterns. In Figure 7, the sequences in level 2 with double line are the new Web traversal patterns. After finding the new 2-Web traversal patterns, the new candidate 3-traversal sequences can be generated. In this example, candidate 3-traversal sequences and are generated by joining the new 2-Web traversal patterns themselves. The other candidate 3-traversal sequences , , , , , and are generated by joining new 2-Web traversal patterns and original 2-Web traversal patterns. In Figure 8, the sequences in level 3 with double line are the new Web traversal patterns. After we finding the new 3-Web traversal patterns, the candidate 4-traversal sequence is generated. Figure 9 shows the final result of the lattice structure when min_sup is decreased to 40%. In Figure 9, the sequences with solid line are the Web traversal patterns.
Experimental Results Because there are no incremental and interactive mining algorithms on finding Web traversal patterns currently, we use the algorithm MFTP (Yen, 2003), which is also used to find the Web traversal patterns Figure 8. After processing level 3 of the lattice structure in Figure 7
460
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 9. The final lattice structure
to compare with our algorithm IncWTP and IntWTP. We implement the algorithm IncWTP and IntWTP in C language and perform the experiments on a PC with a 1.3GHz Intel Pentium-4 processor, 512 MB RAM, and Windows XP Professional platform. The procedure of the synthetic dataset generation is shown as follows: First, the Web site structure is generated. The number of Web pages is set to 300 and the average number of out-links for each page is set to 15. According to the Web site structure, the potential Web traversal patterns are generated. The average number of Web pages for each potentially Web traversal patterns is set to 6, the total number of potentially Web traversal patterns is set to 2,500 and the maximum size of potentially Web traversal pattern is set to 10. After generating potentially Web traversal patterns, the user sequences are generated by picking the potentially Web traversal patterns from a Poisson distribution and the other pages are picked at random. The average size (the number of pages) per user sequence in database is set to 15 and the maximum size of the user sequences in database is set to 25. We generate four synthetic datasets in which the numbers of user sequences are set to 30K, 50K, 70K and 100K, respectively. In the following, we present the experimental results on the performance of our approaches.
Performance Evaluation for Incremental Web Traversal Pattern Mining In the experiments, the four original datasets are increased by inserting 2K, 4K, 6K, 8K, 10K, 12K, 14K, 16K, 18K, and 20K user sequences. In the first experiment, the min_sup is set to 5%. Figure 10 shows the relative execution times for MFTP and IncWTP on the four synthetic data sets. In Figure 10, we can see that our algorithm, IncWTP, outperforms the MFTP algorithm, since our algorithm uses the lattice structure and Web site structure to prune a lot of candidate sequences and keeps the previous mining results such that just inserted user sequences need to be scanned for most of the candidate sequences. The performance gap increases when the size of original database increases. This is because when the size of original database increases, MFTP algorithm is worse than IncWTP algorithm in terms of the number of candidate traversal sequences and the size of database need to be scanned, since MFTP algorithm must re-find all the Web traversal patterns from the whole updated database. However, most of the new Web traversal patterns are generated just from the inserted user sequences
461
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 10. Relative execution times for MFTP and IncWTP (min_sup = 5%)
for IncWTP algorithm. Moreover, the less the number of inserted user sequences, the less the generated new candidate sequences for our algorithm. Hence, the performance gap increases as the number of inserted user sequences decreases. In the second experiment, we use a synthetic data set in which the numbers of user sequences is 100K, and the min_sup is set to 10%, 8%, 5%, 3%, and 1%, respectively. Figure 11 shows the relative execution times for MFTP and IncWTP, in which we can see that our algorithm IncWTP outperforms MFTP algorithm significantly. The lower the min_sup, the more the candidate sequences generated for MFTP algorithm. MFTP needs to spend a lot of time to count a large number of candidate sequences from the whole updated database. For our algorithm IncWTP, just few new candidate sequences are generated for different minimum support. Hence, the performance gap increases as the minimum support threshold decreases. In the third experiment, the min_sup is set to 5%. We also use the four synthetic data sets in the first experiment. These original data sets are decreased by deleting 2K, 4K, 6K, 8K, 10K, 12K, 14K, 16K, 18K, and 20K user sequences. Figure 12 shows the relative execution times for MFTP and IncWTP on the four synthetic data sets, in which we can see that our algorithm IncWTP is also more efficient than MFTP algorithm. The more the deleted user sequences, the smaller the size of the updated database. Hence, the performance gap decreases as the number of deleted user sequences increases, since the size of the database needs to be scanned and the number of candidate sequences decrease for MFTP algorithm. For our algorithm, there are few or no new candidates generated when the user sequences are deleted from original database, we just need to update the lattice structure for the deleted user sequences when the number of deleted user sequences is small. Hence, IncWTP still outperforms MFTP algorithm. Figure 11. Relative execution times for MFTP and IncWTP (Dataset = 100K)
462
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Figure 12. Relative execution times for MFTP and IncWTP (min_sup = 5%)
In the fourth experiment, we also use the synthetic data set in which the number of user sequences is 100K. The min_sup is set to 10%, 8%, 5%, 3%, and 1%, respectively. Figure13 shows the relative execution times for MFTP and IncWTP on the synthetic data set. In Figure13, we can see that our algorithm IncWTP outperforms MFTP algorithm significantly. The performance gap increases as the minimum support threshold decreases, since the number of candidate sequences and the whole updated database need to be scanned for the large number of candidate sequences for MFTP algorithm. For IncWTP algorithm, just the deleted user sequences need to be scanned when the minimum support threshold is large.
Figure 12. Relative execution times for MFTP and IncWTP (min_sup = 5%)
Figure 13. Relative execution times for MFTP and IncWTP (Dataset = 100K)
463
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Performance Evaluation for Interactive Web Traversal Pattern Mining We use a real world user traversing data and generate five synthetic datasets to evaluate the performance of our interactive mining algorithm IntWTP. This real database is a networked database. It stores information for renting DVD movies. There are 82 Web pages in the Web site. We collect the user traversing data from 02/18/2001 to 02/24/2001 (7 days), and there are 428,596 log entries in this original database. Before mining the Web traversal patterns, we need to transform these Web logs into the traversal sequence database. The steps are listed as follows. Because we want to get meaningful user behaviors, the log entries referred to as images are not important. Thus, all log entries with access filename suffix like .JPG, .GIF, .SME, and .CDF are removed. Then, we organize the log entries according to the user’s IP address and time limit. After these processes, we can obtain the Web traversal sequence database like Table 1. According to these steps, we organize the original log entries into 12,157 traversal sequences. The execution times for our interactive mining algorithm IntWTP and MFTP algorithm are shown in Fig-
Figure 14. Execution times on real database
Figure 15. Relative execution times for MFTP and IntWTP
464
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Table 3. Relative storage space for lattice size and database size Lattice level LatticeSiz e/DBSize 20 10 8 5 3 min_sup (%) 1 0.5 0.1 0.05 0.01
1
2
3
4
5
6
7
8
9
0.95 0.95 0.95 0.95 0.95 0.96 0.96 0.96 0.96 0.96
0.00 0.16 0.32 0.66 0.84 1.01 1.04 1.06 1.06 1.06
0.00 0.04 0.18 0.43 0.61 0.98 1.03 1.07 1.07 1.08
0.00 0.02 0.09 0.32 0.48 0.86 0.92 0.95 0.95 0.97
0.00 0.00 0.03 0.20 0.30 0.72 0.78 0.80 0.80 0.81
0.00 0.00 0.02 0.10 0.17 0.47 0.51 0.52 0.52 0.53
0.00 0.00 0.00 0.02 0.08 0.28 0.29 0.30 0.30 0.30
0.00 0.00 0.00 0.01 0.02 0.14 0.15 0.15 0.15 0.15
0.00 0.00 0.00 0.00 0.00 0.06 0.06 0.06 0.06 0.06
SUM (LatticeSize/D 10 BSize) 0.00 0.00 0.00 0.00 0.00 0.02 0.02 0.02 0.02 0.02
0.95 1.17 1.59 2.69 3.45 5.50 5.76 5.89 5.89 5.94
ure14. In the synthetic datasets, we set the number of Web pages to 300, and generate five datasets with 10K, 30K, 50K, 70K, and 100K user sequences, respectively. The relative execution times for IntWTP algorithm and MFTP algorithm are shown in Figure15. The initial min_sup is set to 20%. Then, we continually decrease the min_sup from 10% to 0.01%. From Figure 14 and Figure 15, we can see that our algorithm, IntWTP, outperforms the MFTP algorithm significantly, since our algorithm uses the lattice structure to keep the previous mining results and Web site structure to prune a lot of candidate sequences such that just new generated candidate sequences need to be counted. Besides, the performance gap increases as the minimum support threshold decreases or the database size increases, because when the minimum support decreases or the database size increases, the number of the candidate sequences increases, and the number of database scans also increases, such that the performance is degraded for MFTP algorithm. However, for our algorithm IntWTP, original Web traversal patterns can be ignored and just few new candidate sequences need to be counted. Hence, the mining time can be reduced dramatically. Moreover, we also do the experiment on the storage space for lattice structure size and database size. We use the synthetic dataset with 100K user sequences. Table 3 shows the ratio of the space occupied by lattice structure to the space occupied by the database for each level. In Table 3, the size of level 2 and level 3 of the lattice structure are slightly larger than the database size when the minimum support is decreased to 1%. In the other cases, the sizes of the lattice structure are smaller than the database size for each level. Because IntWTP algorithm discovers Web traversal patterns level-by-level, it will not cause the memory be broken when we just load one level of lattice structure into memory.
Conclusion and Future Work In this chapter, we propose incremental and interactive data mining algorithms IncWTP and IntWTP for discovering Web traversal patterns when the user sequences are inserted into and deleted from original database and when the minimum support is changed. In order to avoid re-finding the original Web traversal patterns and re-counting the original candidate sequences, our algorithms use lattice structure to keep the previous mining results such that just new candidate sequences need to be computed. Hence,
465
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
the Web traversal patterns can be obtained rapidly when the traversal sequence database is updated and users can adjust the minimum support threshold to obtain the interesting Web traversal patterns quickly. Besides, the Web traversal patterns related to certain pages or maximal Web traversal patterns can also be obtained easily by traversing the lattice structure. However, the Web site structure may be changed. In the future, we shall investigate how to use the lattice structure to maintain the Web traversal patterns when the pages and links in the Web site structure are changed. Besides, the number of Web pages and the user sequences will grow up all the time. The lattice structure may become too large to fit into memory. Hence, we shall also investigate how to reduce the storage space and partition the lattice structure such that all the information can fit into memory for each partition.
Acknowledgment Research on this chapter was partially supported by National Science Council grant NSC93-2213-E130-006 and NSC93-2213-E-030-002.
References Chen, M. S., Huang, X. M., & Lin, I. Y. (1999). Capturing user access patterns in the Web for data mining. Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (pp. 345-348). Chen, M. S., Park, J. S., & Yu, P. S. (1998). Efficient data mining for path traversal patterns in a Web environment. IEEE Transaction on Knowledge and Data Engineering, 10(2), 209-221. Chen, S. Y., & Liu, X. (2005). Data mining from 1994 to 2004: An application-orientated review. International Journal of Business Intelligence and Data Mining, 1(1), 4-21. Cheng, H., Yan, X., & Han, J. (2004). IncSpan: Incremental mining of sequential patterns in large database. Proceedings of 2004 International Conference on Knowledge Discovery and Data Mining (pp. 527-532). Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: Information and pattern discovery on the world wide Web. Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (pp. 558-567). EL-Sayed, M., Ruiz, C., & Rundensteiner, E. A. (2004). FS-miner: Efficient and incremental mining of frequent sequence patterns in Web logs. Proceedings of ACM International Workshop on Web Information and Data Management (pp. 128-135). Lee, Y. S., Yen, S. J., Tu, G. H., & Hsieh, M. C. (2004). Mining traveling and purchasing behaviors of customers in electronic commerce environment. Proceedings of IEEE International Conference on eTechnology, e-Commerce and e-Service (pp. 227-230). Lee, Y. S., Yen, S. J., Tu, G. H., & Hsieh, M. C. (2003). Web usage mining: Integrating path traversal patterns and association rules. Proceedings of International Conference on Informatics, Cybernetics, and Systems (pp. 1464-1469). 466
A Lattice-Based Framework for Interactively and Incrementally Mining Web Traversal Patterns
Lin, M. Y., & Lee, S. Y. (2002). Improving the efficiency of interactive sequential pattern mining by incremental pattern discovery. Proceedings of the Hawaii International Conference on System Sciences (pp. 68-76). Ngan, S. C., Lam, T., Wong, R. C. W., & Fu, A. W. C. (2005). Mining n-most interesting itemsets without support threshold by the COFI-tree. International Journal of Business Intelligence and Data Mining, 1(1), 88-106. Parthasarathy, S., Zaki, M. J., Ogihara, M., & Dwarkadas, S. (1999). Incremental and interactive sequence mining. Proceedings of the 8th International Conference on Information and Knowledge Management (pp. 251-258). Pei, J., Han, J., Mortazavi-Asl, B., & Zhu, H. (2000). Mining access patterns efficiently from Web logs. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 396407). Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., & Hsu, M. C. (2001). PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. Proceeding of International Conference on Data Engineering (pp. 215-224). Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., & Hsu, M. C. (2004). Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1424-1440. Sato, K., Ohtaguro, A., Nakashima, M., & Ito, T. (2005). The effect of a Web site directory when employed in browsing the results of a search engine. International Journal on Web Information System, 1(1), 43-51. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explorations (pp. 12-23). Velásquez, J., Ríos, S., Bassi, A., Yasuda, H., & Aoki, T. (2005). Towards the identification of keywords in the Web site text content: A methodological approach. International Journal on Web Information System, 1(1), 53-57. Xiao, Y., Yao, J. F., & Yang, G. (2005). Discovering frequent embedded subtree patterns from large databases of unordered labeled trees. International Journal of Data Warehousing and Mining, 1(2), 44-66. Yen, S. J. (2003). An efficient approach for analyzing user behaviors in a Web-based training environment. International Journal of Distance Education Technologies, 1(4), 55-71. Yen, S. J., & Lee, Y. S. (2006). An incremental data mining algorithm for discovering Web access patterns. International Journal of Business Intelligence and Data Mining, 1(3), 288-303.
This work was previously published in Data Mining and Knowledge Discovery Technologies, edited by D. Taniar , pp. 72-96, copyright 2008 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
467
468
Chapter XXVIII
Privacy-Preserving Data Mining on the Web: Foundations and Techniques Stanley R. M. Oliveira Embrapa Informática Agropecuária, Brazil Osmar R. Zaïane University of Alberta, Edmonton, Canada
Abstract Privacy-preserving data mining (PPDM) is one of the newest trends in privacy and security research. It is driven by one of the major policy issues of the information era—the right to privacy. This chapter describes the foundations for further research in PPDM on the Web. In particular, we describe the problems we face in defining what information is private in data mining. We then describe the basis of PPDM including the historical roots, a discussion on how privacy can be violated in data mining, and the definition of privacy preservation in data mining based on users’ personal information and information concerning their collective activities. Subsequently, we introduce a taxonomy of the existing PPDM techniques and a discussion on how these techniques are applicable to Web-based applications. Finally, we suggest some privacy requirements that are related to industrial initiatives and point to some technical challenges as future research trends in PPDM on the Web.
Introduction Analyzing what right to privacy means is fraught with problems, such as whether the exact definition of privacy constitutes a fundamental right and whether people are and should be concerned with it. Several definitions of privacy have been given, and they vary according to context, culture, and envi-
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Privacy-Preserving Data Mining on the Web
ronment. For instance, in a seminal paper, Warren and Brandeis (1890) defined privacy as “the right to be alone”. Later on, Westin (1967) defined privacy as “the desire of people to choose freely under what circumstances and to what extent they will expose themselves, their attitude, and their behavior to others”. Schoeman (1984) defined privacy as “the right to determine what (personal) information is communicated to others” or “the control an individual has over information about himself or herself”. More recently, Garfinkel (2001) stated that “privacy is about self-possession, autonomy, and integrity”. On the other hand, Rosenberg (2000) argues that privacy may not be a right after all but a taste: “If privacy is in the end a matter of individual taste, then seeking a moral foundation for it—beyond its role in making social institutions possible that we happen to prize—will be no more fruitful than seeking a moral foundation for the taste for truffles”. The above definitions suggest that, in general, privacy is viewed as a social and cultural concept. However, with the ubiquity of computers and the emergence of the Web, privacy has also become a digital problem. With the Web revolution and the emergence of data mining, privacy concerns have posed technical challenges fundamentally different from those that occurred before the information era. In the information technology era, privacy refers to the right of users to conceal their personal information and have some degree of control over the use of any personal information disclosed to others (Cockcroft & Clutterbuck, 2001). In the context of data mining, the definition of privacy preservation is still unclear, and there is very little literature related to this topic. A notable exception is the work presented in Clifton, Kantarcioglu, and Vaidya (2002), in which PPDM is defined as “getting valid data mining results without learning the underlying data values”. However, at this point, each existing PPDM technique has its own privacy definition. Our primary concern about PPDM is that mining algorithms are analyzed for the side effects they incur in data privacy. We define PPDM as the dual goal of meeting privacy requirements and providing valid data mining results.
The Basis of Privacy-Preserving Data Mining Historical Roots The debate on PPDM has received special attention as data mining has been widely adopted by public and private organizations. We have witnessed three major landmarks that characterize the progress and success of this new research area: the conceptive landmark, the deployment landmark, and the prospective landmark. We describe these landmarks as follows: The conceptive landmark characterizes the period in which central figures in the community, such as O’Leary (1991, 1995), Piatetsky-Shapiro (1995), and others (Klösgen, 1995; Clifton & Marks, 1996), investigated the success of knowledge discovery and some of the important areas where it can conflict with privacy concerns. The key finding was that knowledge discovery can open new threats to informational privacy and information security if not done or used properly. The deployment landmark is the current period in which an increasing number of PPDM techniques have been developed and published in refereed conferences. The information available today is spread over countless papers and conference proceedings. The results achieved in the last years are promising and suggest that PPDM will achieve the goals that have been set for it.
469
Privacy-Preserving Data Mining on the Web
The prospective landmark is a new period in which directed efforts toward standardization occur. At this stage, there is no consent about what privacy preservation means in data mining. In addition, there is no consensus on privacy principles, policies, and requirements as a foundation for the development and deployment of new PPDM techniques. The excessive number of techniques is leading to confusion among developers, practitioners, and others interested in this technology. One of the most important challenges in PPDM now is to establish the groundwork for further research and development in this area.
Privacy Violation in Data Mining Understanding privacy in data mining requires understanding how privacy can be violated and the possible means for preventing privacy violation. In general, one major factor contributes to privacy violation in data mining: the misuse of data. Users’ privacy can be violated in different ways and with different intentions. Although data mining can be extremely valuable in many applications (e.g., business, medical analysis, etc.), it can also, in the absence of adequate safeguards, violate informational privacy. Privacy can be violated if personal data are used for other purposes subsequent to the original transaction between an individual and an organization when the information was collected (Culnan, 1993). One of the sources of privacy violation is called data magnets (Rezgui, Bouguettaya, & Eltoweissy, 2003). Data magnets are techniques and tools used to collect personal data. Examples of data magnets include explicitly collecting information through online registration, identifying users through IP addresses, software downloads that require registration, and indirectly collecting information for secondary usage. In many cases, users may or may not be aware that information is collected or do not know how that information is collected. In particular, collected personal data can be used for secondary usage largely beyond the users’ control and privacy laws. This scenario has led to an uncontrollable privacy violation, not because of data mining itself, but fundamentally because of the misuse of data.
Defining Privacy for Data Mining In general, privacy preservation occurs in two major dimensions: users’ personal information and information concerning their collective activities. We refer to the former as individual privacy preservation and the latter as collective privacy preservation, which is related to corporate privacy in Clifton et al. (2002). •
•
470
Individual privacy preservation: The primary goal of data privacy is the protection of personally identifiable information. In general, information is considered personally identifiable if it can be linked, directly or indirectly, to an individual person. Thus, when personal data are subjected to mining, the attribute values associated with individuals are private and must be protected from disclosure. Miners are then able to learn from global models rather than from the characteristics of a particular individual. Collective privacy preservation: Protecting personal data may not be enough. Sometimes, we may need to protect against learning sensitive knowledge representing the activities of a group. We refer to the protection of sensitive knowledge as collective privacy preservation. The goal here
Privacy-Preserving Data Mining on the Web
is quite similar to the one for statistical databases, in which security control mechanisms provide aggregate information about groups (population) and, at the same time, prevent disclosure of confidential information about individuals. However, unlike in the case for statistical databases, another objective of collective privacy preservation is to protect sensitive knowledge that can provide competitive advantage in the business world. In the case of collective privacy preservation, organizations have to cope with some interesting conflicts. For instance, when personal information undergoes analysis processes that produce new facts about users’ shopping patterns, hobbies, or preferences, these facts could be used in recommender systems to predict or affect their future shopping patterns. In general, this scenario is beneficial to both users and organizations. However, when organizations share data in a collaborative project, the goal is not only to protect personally identifiable information but also sensitive knowledge represented by some strategic patterns.
Characterizing Scenarios of Privacy Preservation on the Web In this section, we describe two real-life motivating examples in which PPDM poses different constraints: •
•
Scenario 1: Suppose we have a server and many clients in which each client has a set of sold items (e.g., books, movies, etc.). The clients want the server to gather statistical information about associations among items in order to provide recommendations to the clients. However, the clients do not want the server to know some strategic patterns (also called sensitive association rules). In this context, the clients represent companies, and the server is a recommendation system for an e-commerce application, for example, fruit of the clients collaboration. In the absence of rating, which is used in collaborative filtering for automatic recommendation building, association rules can be effectively used to build models for online recommendation. When clients send their frequent itemsets or association rules to the server, it must protect the sensitive itemsets according to some specific policies. The server then gathers statistical information from the non-sensitive itemsets and recovers from them the actual associations. How can these companies benefit from such collaboration by sharing association rules while preserving some sensitive association rules? Scenario 2: Two organizations, an Internet marketing company, and an online retail company have datasets with different attributes for a common set of individuals. These organizations decide to share their data for clustering to the optimal customer targets so as to maximize return on investments. How can these organizations learn about their clusters using each other’s data without learning anything about the attribute values of each other?
Note that the above scenarios describe different privacy preservation problems. Each scenario poses a set of challenges. For instance, Scenario 1 is a typical example of collective privacy preservation, while Scenario 2 refers to individuals’ privacy preservation.
471
Privacy-Preserving Data Mining on the Web
Figure 1. A taxonomy of PPDM techniques D ata P artitioning C ryptography-B ased T echniques G enerative-B ased T echniques
D ata S w apping T echniques D ata P erturbation T echniques
D ata M odification
D ata R andom ization T echniques
N oise A ddition T echniques S olutions
S pace T ransform ation T echniques O bject S im ilarity -B ased R epresentation
D ata R estriction
D im ensionality R eduction T ransform ation
B locking-B ased T ec hniques S anitization -B ased T echniques D ata O w nership
D ata-S haring T echniques P attern-S haring T echniques
A Taxonomy of Existing PPDM Techniques In this section, we classify the existing PPDM techniques in the literature into four major categories: data partitioning, data modification, data restriction, and data ownership, as can be seen in Figure 1.
Data Partitioning Techniques Data partitioning techniques have been applied to some scenarios in which the databases available for mining are distributed across a number of sites, with each site only willing to share data mining results, not the source data. In these cases, the data are distributed either horizontally or vertically. In a horizontal partition, different entities are described with the same schema in all partitions, while in a vertical partition the attributes of the same entities are split across the partitions. The existing solutions can be classified into Cryptography-Based Techniques and Generative-Based Techniques. •
472
Cryptography-Based Techniques: In the context of PPDM over distributed data, cryptographybased techniques have been developed to solve problems of the following nature: two or more parties want to conduct a computation based on their private inputs. The issue here is how to conduct such a computation so that no party knows anything except its own input and the results. This problem is referred to as the Secure Multi-Party Computation (SMC) problem (Goldreich, Micali, & Wigderson, 1987). The technique proposed in Lindell and Pinkas (2000) addresses privacypreserving classification, while the techniques proposed in Kantarciolu and Clifton (2002) and Vaidya and Clifton (2002) address privacy-preserving association rule mining, and the technique in Vaidya and Clifton (2003) addresses privacy-preserving clustering.
Privacy-Preserving Data Mining on the Web
•
Generative-Based Techniques: These techniques are designed to perform distributed mining tasks. In this approach, each party shares just a small portion of its local model which is used to construct the global model. The existing solutions are built over horizontally partitioned data. The solution presented in Veloso, Meira, Parthasarathy, and Carvalho (2003) addresses privacypreserving frequent itemsets in distributed databases, whereas the solution in Meregu and Ghosh (2003) addresses privacy-preserving distributed clustering using generative models.
Data Modification Techniques Data modification techniques modify the original values of a database that need to be shared, and in doing so, privacy preservation is ensured. The transformed database is made available for mining and must meet privacy requirements without losing the benefit of mining. In general, data modification techniques aim at finding an appropriate balance between privacy preservation and knowledge disclosure. Methods for data modification include noise addition techniques and space transformation techniques. •
•
Noise Addition Techniques: The idea behind noise addition techniques for PPDM is that some noise (e.g., information not present in a particular tuple or transaction) is added to the original data to prevent the identification of confidential information relating to a particular individual. In other cases, noise is added to confidential attributes by randomly shuffling the attribute values to prevent the discovery of some patterns that are not supposed to be discovered. We categorize noise addition techniques into three groups: (1) data swapping techniques that interchange the values of individual records in a database (Estivill-Castro & Brankovic, 1999); (2) data distortion techniques that perturb the data to preserve privacy, and the distorted data maintain the general distribution of the original data (Agrawal & Srikant, 2000); and (3) data randomization techniques which allow one to perform the discovery of general patterns in a database with error bound, while protecting individual values. Like data swapping and data distortion techniques, randomization techniques are designed to find a good compromise between privacy protection and knowledge discovery (Agrawal & Gehrke, 2002; Evfimievski, Srikant, Rizvi, & Haritsa, 2002; Zang, Wang, & Zhao, 2004). Space Transformation Techniques: These techniques are specifically designed to address privacy-preserving clustering. These techniques are designed to protect the underlying data values subjected to clustering without jeopardizing the similarity between objects under analysis. Thus, a space transformation technique must not only meet privacy requirements but also guarantee valid clustering results. We categorize space transformation techniques into two major groups: (1) object similarity-based representation relies on the idea behind the similarity between objects; that is, a data owner could share some data for clustering analysis by simply computing the dissimilarity matrix (matrix of distances) between the objects and then sharing such a matrix with a third party. Many clustering algorithms in the literature operate on a dissimilarity matrix (Han & Kamber, 2001). This solution is simple to implement and is secure but requires a high communication cost (Oliveira & Zaïane, 2004); (2) dimensionality reduction-based transformation can be used to address privacy-preserving clustering when the attributes of objects are available either in a central repository or vertically partitioned across many sites. By reducing the dimensionality of a dataset to a sufficiently small value, one can find a tradeoff between privacy, communication cost, and accuracy. Once the dimensionality of a database is reduced, the released database preserves (or
473
Privacy-Preserving Data Mining on the Web
slightly modifies) the distances between data points. In tandem with the benefit of preserving the similarity between data points, this solution protects individuals’ privacy since the attribute values of the objects in the transformed data are completely different from those in the original data (Oliveira & Zaïane, 2004).
Data Restriction Techniques Data restriction techniques focus on limiting access to mining results through either generalization or suppression of information (e.g., items in transactions, attributes in relations), or even by blocking access to some patterns that are not supposed to be discovered. Such techniques can be divided into two groups: Blocking-based techniques and Sanitization-based techniques. •
•
Blocking-Based Techniques: These techniques aim at hiding some sensitive information when data are shared for mining. The private information includes sensitive association rules and classification rules that must remain private. Before releasing the data for mining, data owners must consider how much information can be inferred or calculated from large databases and must look for ways to minimize the leakage of such information. In general, blocking-based techniques are feasible to recover patterns less frequent than originally since sensitive information is either suppressed or replaced with unknowns to preserve privacy. The techniques in Johnsten and Raghavan (2001) address privacy preservation in classification, while the techniques in Johnsten and Raghavan (2002) and Saygin, Verykios, and Clifton (2001) address privacy-preserving association rule mining. Sanitization-Based Techniques: Unlike blocking-based techniques that hide sensitive information by replacing some items or attribute values with unknowns, sanitization-based techniques hide sensitive information by strategically suppressing some items in transactional databases, or even by generalizing information to preserve privacy in classification. These techniques can be categorized into two major groups: (1) data-sharing techniques in which the sanitization process acts on the data to remove or hide the group of sensitive association rules that contain sensitive knowledge. To do so, a small number of transactions that contain the sensitive rules have to be modified by deleting one or more items from them or even adding some noise, that is, new items not originally present in such transactions (Dasseni, Verykios, Elmagarmid, & Bertino, 2001; Oliveira & Zaïane, 2002, 2003a, 2003b; Verykios et al., 2004); and (2) pattern-sharing techniques in which the sanitizing algorithm acts on the rules mined from a database, instead of the data itself. The existing solution removes all sensitive rules before the sharing process and blocks some inference channels (Oliveira, Zaïane, & Saygin, 2004). In the context of predictive modeling, a framework was proposed in Iyengar (2002) for preserving the anonymity of individuals or entities when data are shared or made public.
Data Ownership Techniques Data ownership techniques can be applied to two different scenarios: (1) to protect the ownership of data by people about whom the data were collected (Felty & Matwin, 2002). The idea behind this approach is that a data owner may prevent the data from being used for some purposes and allow them to be used for other purposes. To accomplish that, this solution is based on encoding permissions on the use
474
Privacy-Preserving Data Mining on the Web
of data as theorems about programs that process and mine the data. Theorem-proving techniques are then used to guarantee that these programs comply with the permissions; and (2) to identify the entity that receives confidential data when such data are shared or exchanged (Mucsi-Nagy & Matwin, 2004). When sharing or exchanging confidential data, this approach ensures that no one can read confidential data except the receiver(s). It can be used in different scenarios, such as statistical or research purposes, data mining, and online business-to-business (B2B) interactions.
Are These Techniques Applicable to Web Data? After describing the existing PPDM techniques, we now move on to analyze which of these techniques are applicable to Web data. To do so, hereinafter we use the following notation: •
•
WDT: These techniques are designed essentially to support Web usage mining; that is, the techniques address Web data applications only. We refer to these techniques as Web Data Techniques (WDT). GPT: These techniques can be used to support both public data release and Web-based applications. We refer to these techniques as General Purpose Techniques (GPT). a. Cryptography-Based Techniques: These techniques can be used to support business collaboration on the Web. Scenario 2 (in section: The Basis of Privacy-Preserving Data Mining) is a typical example of a Web-based application which can be addressed by cryptographybased techniques. Other applications related to e-commerce can be found (Srivastava, Cooley, Deshpande, & Tan, 2000; Kou & Yesha, 2000). Therefore, such techniques are classified as WDT. b. Generative-Based Techniques: These techniques can be applied to scenarios in which the goal is to extract useful knowledge from large, distributed data repositories. In these scenarios, the data cannot be directly centralized or unified as a single file or database either due to legal, proprietary, or technical restrictions. In general, generative-based techniques are designed to support distributed Web-based applications. c. Noise Addition Techniques: These techniques can be categorized as GPT. For instance, data swapping and data distortion techniques are used for public data release, while data randomization could be used to build models for online recommendations (Zang et al., 2004).
Table 1 shows a summary of the PPDM techniques and their relationship with Web data applications. PPDM Techniques Category
Category
Cryptography-Based Techniques
WDT
Generative-Based Techniques
WDT
Noise Addition Techniques
GPT
Space Transformation Technique
GPT
Blocking-Based Techniques
GPT
Sanitization-Based Techniques
GPT
Data Ownership Techniques
WDT
475
Privacy-Preserving Data Mining on the Web
Scenario 1 (in section: The Basis of Privacy-Preserving Data Mining) is a typical example of an online recommendation system. d. Space Transformation Techniques: These are general purpose techniques (GPT). These techniques could be used to promote social benefits as well as to address applications on the Web (Oliveira & Zaïane, 2004). An example of social benefit occurs, for instance, when a hospital shares some data for research purposes (e.g., cluster of patients with the same disease). Space transformation techniques can also be used when the data mining process is outsourced or even when the data are distributed across many sites. e. Blocking-Based Techniques: In general, these techniques are applied to protect sensitive information in databases. They could be used to simulate an access control in a database in which some information is hidden from users who do not have the right to access it. However, these techniques can also be used to suppress confidential information before the release of data for mining. We classify such techniques as GPT. f. Sanitization-Based Techniques: Like blocking-based techniques, sanitization-based techniques can be used by statistical offices who publish sanitized versions of data (e.g., census problem). In addition, sanitization-based techniques can be used to build models for online recommendations as described in Scenario 1 (in section: The Basis of Privacy-Preserving Data Mining). g. Data Ownership Techniques: These techniques implement a mechanism enforcing data ownership by the individuals to whom the data belongs. When sharing confidential data, these techniques can also be used to ensure that no one can read confidential data except the receiver(s) that are authorized to do so. The most evident applications of such techniques are related to Web mining and online business-to-business (B2B) interactions.
Requirements for Technical Solutions Requirements for the Development of Technical Solutions Ideally, a technical solution for a PPDM scenario would enable us to enforce privacy safeguards and to control the sharing and use of personal data. However, such a solution raises some crucial questions: • •
What levels of effectiveness are in fact technologically possible, and what corresponding regulatory measures are needed to achieve these levels? What degrees of privacy and anonymity must be sacrificed to achieve valid data mining results?
These questions cannot have yes-no answers but involve a range of technological possibilities and social choices. The worst response to such questions is to ignore them completely and not pursue the means by which we can eventually provide informed answers. The above questions can be, to some extent, addressed if we provide some key requirements to guide the development of technical solutions. The following keywords are used to specify the extent to which an item is a requirement for the development of technical solutions to address PPDM:
476
Privacy-Preserving Data Mining on the Web
• •
Must: This word means that the item is an absolute requirement; Should: This word means that valid reasons not to treat this item as a requirement may exist, but the full implications should be understood and the case carefully weighed before discarding this item. a.
b.
c. d. e.
Independence: A promising solution for the problem of PPDM, for any specific data mining task (e.g., association rules, clustering, and classification), should be independent of the mining task algorithm. Accuracy: When it is possible, an effective solution should do better than trade off between privacy and accuracy on the disclosure of data mining results. Sometimes a tradeoff must be found as in Scenario 2 (in section: The Basis of Privacy-Preserving Data Mining). Privacy Level: This is also a fundamental requirement in PPDM. A technical solution must ensure that the mining process does not violate privacy up to a certain degree of security. Attribute Heterogeneity: A technical solution for PPDM should handle heterogeneous attributes (e.g., categorical and numerical). Communication Cost: When addressing data distributed across many sites, a technical solution should carefully consider issues of communication cost.
Requirements to Guide the Deployment of Technical Solutions Information technology vendors in the near future will offer a variety of products which claim to help protect privacy in data mining. How can we evaluate and decide whether what is offered is useful? The non-existence of proper instruments to evaluate the usefulness and feasibility of a solution to address a PPDM scenario challenge is to identify the following requirements: a. b.
c. d.
e.
Privacy Identi. cation: We should identify what information is private. Is the technical solution aimed at protecting individual privacy or collective privacy? Privacy Standards: Does the technical solution comply with international instruments that state and enforce rules (e.g., principles and/or policies) for use of automated processing of private information? Privacy Safeguards: Is it possible to record what has been done with private information and be transparent with individuals about whom the private information pertains? Disclosure Limitation: Are there metrics to measure how much private information is disclosed? Since privacy has many meanings depending on the context, we may require a set of metrics to do so. What is most important is that we need to measure not only how much private information is disclosed but also the impact of a technical solution on the data and on valid mining results. Update Match: When a new technical solution is launched, two aspects should be considered: (1) the solution should comply with existing privacy principles and policies, and (2) in case of modifications to privacy principles and/or policies that guide the development of technical solutions, any release should consider these new modifications.
477
Privacy-Preserving Data Mining on the Web
Future Research Trends Preserving privacy on the Web has an important impact on many Web activities and Web applications. In particular, privacy issues have attracted a lot of attention due to the growth of e-commerce and ebusiness. These issues are further complicated by the global and self-regulatory nature of the Web. Privacy issues on the Web are based on the fact that most users want to maintain strict anonymity on Web applications and activities. The easy access to information on the Web, coupled with the readily available personal data also make it easier and more tempting for interested parties (e.g., businesses and governments) to willingly or inadvertently intrude on individuals’ privacy in unprecedented ways. Clearly, privacy issues on Web data is an umbrella that encompasses many Web applications such as e-commerce, stream data mining, multimedia mining, among others. In this work, we focus on issues toward foundation for further research in PPDM on the Web because these issues will certainly play a significant role in the future of this area. In particular, a common framework for PPDM should be conceived, notably in terms of definitions, principles, policies, and requirements. The advantages of a framework of that nature are as follows: (a) a common framework will avoid confusing developers, practitioners, and many others interested in PPDM on the Web; (b) adoption of a common framework will inhibit inconsistent efforts in different ways and will enable vendors and developers to make solid advances in the future of research in PPDM on the Web. The success of a framework of this nature can only be guaranteed if it is backed by a legal framework such as the Platform for Privacy Preferences (P3P) Project (Joseph & Faith, 1999). This project is emerging as an industry standard providing a simple, automated way for users to gain more control over the use of personal information on Web sites they visit. The European Union has taken a lead in setting up a regulatory framework for Internet Privacy and has issued a directive that sets guidelines for processing and transfer of personal data (European Comission, 1998).
Conclusion In this chapter, we have laid down the foundations for further research in the area of Privacy-Preserving Data Mining (PPDM) on the Web. Although our work described in this chapter is preliminary and conceptual in nature, it is a vital prerequisite for the development and deployment of new techniques. In particular, we described the problems we face in defining what information is private in data mining. We then described the basis of PPDM including the historical roots, a discussion on how privacy can be violated in data mining, and the definition of privacy preservation in data mining based on users’ personal information and information concerning their collective activities. We also introduced a taxonomy of the existing PPDM techniques and a discussion on how these techniques are applicable to Web data. Subsequently, we suggested some desirable privacy requirements that are related to industrial initiatives. These requirements are essential for the development and deployment of technical solutions. Finally, we pointed to standardization issues as a technical challenge for future research trends in PPDM on the Web.
478
Privacy-Preserving Data Mining on the Web
References Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (pp. 439-450), Dallas, TX. Clifton, C., Kantarciolu, M., & Vaidya, J. (2002). Defining privacy for data mining. Proceedings of the National Science Foundation Workshop on Next Generation Data Mining (pp. 126-133), Baltimore. Clifton, C., & Marks, D. (1996). Security and privacy implications of data mining. Proceedings of the Workshop on Data Mining and Knowledge Discovery (pp. 15-19), Montreal, Canada. Cockcroft, S., & Clutterbuck, P. (2001). Attitudes towards information privacy. Proceedings of the 12t h Australasian Conference on Information Systems, Coffs Harbour, NSW, Australia. Culnan, M.J. (1993). How did they get my name? An exploratory investigation of consumer attitudes toward secondary information. MIS Quartely, 17(3), 341-363. Dasseni, E., Verykios, V.S., Elmagarmid, A.K., & Bertino, E. (2001). Hiding association rules by using confidence and support. Proceedings of the 4t h Information Hiding Workshop (pp. 369-383), Pittsburgh, Pennsylvania. Estivill-Castro, V., & Brankovic, L. (1999). Data swapping: Balancing privacy against precision in mining for logic rules. Proceedings of the Data Warehousing and Knowledge Discovery DaWaK-99, Florence, Italy (pp. 389-398). European Comission. (1998). The directive on the protection of individuals with regard of the processing of personal data and on the free movement of such data. Retrieved May 7, 2005, from http://www2. echo.lu Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J. (2002). Privacy preserving mining of association rules. Proceedings of the 8t h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 217-228), Edmonton, AB, Canada. Felty, A.P., & Matwin, S. (2002). Privacy-oriented data mining by proof checking. Proceedings of the 6t h European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) (pp. 138149), Helsinki, Finland. Garfinkel, S. (2001). Database nation: The death of the privacy in the 21s t century. Sebastopol, CA: O’Reilly & Associates. Goldreich, O., Micali, S., & Wigderson, A. (1987). How to play any mental game: A completeness theorem for protocols with honest majority. Proceedings of the 19t h Annual ACM Symposium on Theory of Computing (pp. 218-229). New York. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Iyengar, V.S. (2002). Transforming data to satisfy privacy constraints. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada (pp. 279-288).
479
Privacy-Preserving Data Mining on the Web
Johnsten, T., & Raghavan, V.V. (2001). Security procedures for classification mining algorithms. Proceedings of the 15t h Annual IFIP WG 11.3 Working Conference on Database and Applications Security, Niagara on the Lake, Ontario, Canada (pp. 293-309). Johnsten, T., & Raghavan, V.V. (2002). A methodology for hiding knowledge in databases. Proceedings of the IEEE ICDM Workshop on Privacy, Security, and Data Mining, Maebashi City, Japan (pp. 9-17). Joseph, R., & Faith, C.L. (1999). The platform for privacy preferences. Communications of the ACM, 42(2), 48-55. Kantarciolu, M., & Clifton, C. (2002). Privacy-preserving distributed mining of association rules on horizontally partitioned data. Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Madison, Wisconsin. Klösgen, W. (1995). KDD: Public and private concerns. IEEE EXPERT, 10(2), 55-57. Kou, W., & Yesha, Y. (2000). Electronic commerce technology trends: Challenges and opportunities. IBM Press Alliance Publisher: David Uptmor, IIR Publications, Inc. Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. Crypto 2000, LNCS 1880, 36-54. Meregu, S., & Ghosh, J. (2003). Privacy-preserving distributed clustering using generative models. Proceedings of the 3r d IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL (pp. 211-218). Mucsi-Nagy, A., & Matwin, S. (2004). Digital fingerprinting for sharing of confidential data. Proceedings of the Workshop on Privacy and Security Issues in Data Mining, Pisa, Italy (pp. 11-26). O’Leary, D.E. (1991). Knowledge discovery as a threat to database security. In G. Piatetsky-Shapiro, & W.J. Frawley (Eds.), Knowledge discovery in databases (pp. 507-516). Menlo Park, CA: AAAI/MIT Press. O’Leary, D.E. (1995). Some privacy issues in knowledge discovery: The OECD personal privacy guidelines. IEEE EXPERT, 10(2), 48-52. Oliveira, S.R.M., & Zaïane, O.R. (2002). Privacy preserving frequent itemset mining. Proceedings of the IEEE ICDM Workshop on Privacy, Security, and Data Mining, Maebashi City, Japan (pp. 43-54). Oliveira, S.R.M., & Zaïane, O.R. (2003a). Algorithms for balancing privacy and knowledge discovery in association rule mining. Proceedings of the 7t h International Database Engineering and Applications Symposium (IDEAS’03), Hong Kong, China (pp. 54-63). Oliveira, S.R.M., & Zaïane, O.R. (2003b). Protecting sensitive knowledge by data sanitization. Proceedings of the 3r d IEEE International Conference on Data Mining (ICDM’03), Melbourne, FL (pp. 613-616). Oliveira, S.R.M., & Zaïane, O.R. (2004). Privacy-preserving clustering by object similarity-based representation and dimensionality reduction transformation. Proceedings of the Workshop on Privacy and Security Aspects of Data Mining (PSADM’04) in conjunction with the 4t h IEEE International Conference on Data Mining (ICDM’04), Brighton, UK (pp. 21-30).
480
Privacy-Preserving Data Mining on the Web
Oliveira, S.R.M., Zaïane, O.R., & Saygin, Y. (2004). Secure association rule sharing. Proceedings of the 8t h Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’04), Sydney, Australia (pp. 74-85). Piatetsky-Shapiro, G. (1995). Knowledge discovery in personal data vs. privacy: A mini-symposium. IEEE Expert, 10(2), 46-47. Rezgui, A., Bouguettaya, A., & Eltoweissy, M.Y. (2003). Privacy on the Web: Facts, challenges, and solutions. IEEE Security & Privacy, 1(6), 40-49. Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy in association rule mining. Proceedings of the 28t h International Conference on Very Large Data Bases, Hong Kong, China. Rosenberg, A. (2000). Privacy as a matter of taste and right. In E.F. Paul, F.D. Miller, & J. Paul (Eds.), The right to privacy (pp. 68-90). Cambridge University Press. Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using unknowns to prevent discovery of association rules. SIGMOD Record, 30(4), 45-54. Schoeman, F.D. (1984). Philosophical dimensions of privacy. Cambridge University Press. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. Proceedings of the 8t h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada (pp. 639-644). Vaidya, J., & Clifton, C. (2003). Privacy-preserving K-means clustering over vertically partitioned data. Proceedings of the 9t h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC (pp. 206-215). Veloso, A.A., Meira, Jr., W., Parthasarathy, S., & Carvalho, M.B. (2003). Efficient, accurate and privacypreserving data mining for frequent itemsets in distributed databases. Proceedings of the 18t h Brazilian Symposium on Databases, Manaus, Brazil (pp. 281-292). Verykios, V.S., Elmagarmid, A.K., Bertino, E., Saygin, Y., & Dasseni, E. (2004). Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 434-447. Warren, S.D., & Brandeis, L.D. (1890). The right to privacy. Harvard Law Review, 4(5), 193-220. Westin, A.F. (1967). The right to privacy. New York: Atheneum. Zang, N., Wang, S., & Zhao, W. (2004). A new scheme on privacy preserving association rule mining. Proceedings of the 15th European Conference on Machine Learning and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy.
This work was previously published in Encyclopedia ofWeb and Information Security, edited by E. Ferrari & B. Thuraisingham, pp.282-301, copyright 2006 by Information Science Reference, formerly known as Idea Group Reference (an imprint of IGI Global).
481
Section IV
Information Retrieval and Extraction
483
Chapter XXIX
Automatic Reference Tracking G.S. Mahalakshmi Anna University, Chennai, India S. Sendhilkumar Anna University, Chennai, India
Abstract Automatic reference tracking involves systematic tracking of reference articles listed for a particular research paper by extracting the references of the input seed publication and further analyzing the relevance of the referred paper with respect to the seed paper. This tracking continues recursively with every reference paper being assumed as seed paper at every track level until the system finds any irrelevant (or far relevant) references deep within the reference tracks which does not help much in the understanding of the input seed research paper at hand. The relevance is analysed based on the keywords collected from the title and abstract of the referred article. The objective of the reference tracking system is to automatically list down closely relevant reference articles to aid the understanding of the seed paper thereby facilitating the literature survey of the aspiring researcher. This paper proposes the system design and evaluation of automatic reference tracking system discussing the observations obtained.
INTRODUCTION The World Wide Web (WWW), which is expanding everyday, has become a repository of up-to-date information for scientific scholars and researchers. Everyday numerous publications are deposited in a highly distributed fashion across various sites. The collection of online journals is also rapidly increasing. The increasing proportion of online scholarly literature makes it desirable to cite them out for necessary references. Many publications are deposited in a highly distributed fashion across various sites. Web’s current navigation model of browsing from site to site does not facilitate retrieving and integrating data from multiple sites.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Automatic Reference Tracking
Linking documents seems to be a natural proclivity of scholars. In search for additional information or in search for information presented in simpler terms, whatever the case may be, the user is generally misled with abundant pool of publications not knowing the direction of his/her search in relevance to the research problem and often forgets the source that motivated the user to initiate such navigation over research publications. Often budding researchers find themselves misled amidst conceptually ambiguous references while exploring a particular seed scholarly literature either aiming at a more clear understanding of the seed document or trying to find the crux of the concept behind the journal or conference article. Integration of bibliographical information of various publications available on the Web is an important task for researchers to understand the essence of recent advancements and also to avoid misleading of the user over distributed publication information across various sites. Automatic reference tracking (ART) appears as a good shepherd for the struggling researchers by listing the relevant reference articles to the user’s purview. The relevant articles identified and downloaded are stored for later use. ART-listed references (which are the actual references across various levels from the seed article) are represented year wise, author wise or relevance wise [Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006] for better browsing of ART output. ART involves recursive tracking of reference articles listed for a particular research paper by extracting the references of the input seed publication and further analyzing the relevance [Valerie V. Cross, 2001] of the referred paper with respect to the seed paper. This recursion continues with every reference paper being assumed as seed paper at every track level until the system finds any irrelevant (or farthest) references deep within the reference tracks which does not help much in the understanding of the input seed publication at hand. Tracking of references implies locating the references in the Web through commercial search engines and successfully downloading them [Junjie Chen, Lizhen Liu, Hantao Song and Xueli Yu, 2001] to be fed for subsequent recursions. Web pages, however, change constantly in relation to their contents and existence, often without notification [Xiangzhu Gao, San Murugesan and Bruce Lo, 2003]. Therefore, updating of Web page address for every research reference article is not possible. ART has a dynamic approach for retrieving the articles across the Web. The title of every reference article is extracted from the reference region after successful reference parsing. Later, the title is submitted as a query to search engines and the article is harvested irrespective of the location of the article in the Web. During the tracking of references, pulling down each reference entry and parsing them results in metadata [Day M.Y., Tzong-Hon Tsai, Sung C.L., Lee C.W., Wu S.H., Ong C.S.and Hsu W.L., 2005; Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006], which is populated into the database to promote further search in a recursive fashion. Depending upon the user’s needs the data is projected in a suitable representation. The representation is used to track information on specific attribute specified by the user, which includes the author, title, year of publication etc. thereby providing the complete information about the references for the journal cited. This chapter discusses the design and implementation of a reference tracking system that deals with bringing reference linking value to the scholarly side of the Web and also suitably visualizes the collected document in a more understandable manner. The research described experimentally demonstrates that such systems encourage the viewing and use of information that would not otherwise be viewed, by reducing the cognitive effort required to find, evaluate and access information. In this work, we assume the seed documents are in pdf format and the tracker also looks for only pdf documents from the commercial search engine. Therefore, we ignore reference articles even if they
484
Automatic Reference Tracking
are present in non-pdf formats across the Web [Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006]. We also state that, only references listed in IEEE format are considered for reference parsing and the proposed work needs further extension in these directions. With all these assumptions, ART still has a serious performance bottleneck caused by recursively retrieving online reference documents because there is no potential limit over the number of reference articles retrieved at every track levels. Therefore, for implementation purposes, we assume ART to retrieve reference articles from only two levels. The results are tabulated and the observations are discussed.
RELATED WORK Citation systems like CiteSeer [Citeseer, 2007] do not retrieve the entire contents of the file referred to and instead, only the hyperlink to the file in their respective database storage is displayed at the output. Therefore, such systems are not reliable since the successful retreival is highly dependent on the cached scholarly articles in their own databases (the alert ‘Document not in database’ from CiteSeer). The Google Scholar [Google Scholar, 2007] also performs the search for a given query over scholarly articles but the search results are again the set of abstracts or mere hyperlinks to the location of the research articles. In a nutshell, these scholarly databases retrieve the indexes to articles which speak about the user’s area of interest (as indicated by the user’s seed during the search). It is for the user to click and download the indexed documents and find the most appropriate research article. But, this creates inflexibility and delay to the user who is targeting for the focused information. There may be articles across domains which have actually cited the particular user-interested research paper. When all these are listed to the user’s purview, it is again a challenge to the user to identify and categorize the desired information. Conceptually, our idea is very much distinguished where citations to the seed scholarly articles are retrieved automatically and listed year wise. In ART we perform a relevance-based search and populate the reference database with the results, to facilitate the scholar’s future survey over scholarly literatures.
MOTIVATION The active bibliography section of CiteSeer is the main theme around the proposed system is developed. Citations listed locate the articles where the seed document is referred to since it was published. But ART tracks the seed scholarly article chronologically backwards by only analyzing the listed bibliographies. The flexibility of representing the article, year wise, relevance based or author based is the primary reason for which the reference tracker is developed. Another important feature, which differentiates the functioning of ART is the similarity of documents. The document similarity is analyzed from the paper title, keywords etc. which is the unique approach in improving the performance of the reference tracking system [Shen L., Lim Y.K. and Loh H.T., 2004]. Like JITIR agents [Rhodes B. and Maes P., 2000], ART presents the information in such a way that it can be ignored, but is still easy to access should it be desirable. ART searches the entire Web for referring to the scholarly literatures and retrieves the reference articles anywhere from the Web even though the authorized copy available online is secured. More often, upon submitting the research findings to journals, the authors generally attach the draft article into their home page and of course without viola-
485
Automatic Reference Tracking
tion of copyrights of the publisher involved. ART fetches these articles and lists to the user’s purview thereby satisfying the only objective of aiding a researcher during the online survey of literatures. By this, we mean, there is no security breach over the automatic fetching of online articles and further, upon acquiring the key for protected sites, a thorough search and reference retrieval shall be achieved. Another important feature, which differentiates the functioning of the reference tracking system, is the similarity of documents. The document similarity is analyzed from the paper title, keywords etc. which is the unique approach in improving the performance of the reference tracking system. In summary, the proposed reference tracking system performs a relevance-based search and populates the reference database with the results, to facilitate the scholar’s future survey over scholarly literatures.
DESIGN OF AUTOMATIC REFERENCE TRACKER The Reference Tracking system framework, shown in Figure 1(a) has two sections to deal with, namely, reference document retrieval and reference document representation. Of this, reference document representation is merely a hierarchical representation of the collected documents based on indexes like, author, title, and year of publication or relevance of key information. The reference document retrieval is the most challenging part of online reference tracking system. The challenge lies in managing the pool of downloaded reference literatures in the database in some order, which rules out certain issues in reference document retrieval. The entire ART framework is designed such that each reference document retrieved is stored as unique paper and can be tracked from the seed document. Assigning a unique paper id for each reference document facilitates the accessing of track details. Each document consists of a back reference id which points to the base document pid, which refers the document. This helps to track down the hierarchy of Figure 1(a). Automatic reference tracking system (ART)
Figure 1(b). Track details
486
Automatic Reference Tracking
retrieved documents. The track result is stored as a table containing starting paper and ending paper for a particular track, which is shown in Figure 1(b).
Reference Document Retrieval Reference documents are retrieved by a recursive algorithm, which inputs the seed scholarly literature and returns a hierarchy of reference documents that are relevant to the seed scholarly literature. The main tracker framework inputs the seed document and preprocesses the seed document. Reference document retrieval consists of various steps. They are: • • • •
Reference Extraction Reference Parsing Reference filtering based on relevance Link Extraction and Data retrieval
From the text extracted, the references are extracted as specified in the seed scholarly literature. Later, each individual references are parsed and metadata about the reference like author, year, source, title and publisher is obtained. From the metadata, unnecessary references are filtered out of the search results by comparing it with relevant keywords of the seed scholarly literature. After eliminating duplicating and irrelevant links, the extraction of online URL is initiated [Donna Bergmark and Carl Lagoze, 2001]. If URL is directly stated in the reference cited, the document is retrieved from the link. If URL is not found, it searches the Web through title of the scholarly literature with help of a search engine (Yahoo for our work) and the exact reference literature is retrieved. After retrieving the references, the reference scholarly literatures are downloaded and again the documents are preprocessed to track down the references.
Reference Extraction Preprocessing converts the input raw document into the required text format. From the output text obtained from the seed document, the reference section is located. Each reference in the reference section is separated. They are extracted and fed as input to the reference-parsing module to be processed individually.
Figure. 2. Reference extraction Processed seed document
Individual R eferences
Individual Reference
487
Automatic Reference Tracking
Figure 3. Sample reference information extraction 1.
Minyuhday, “A Knowledge-Based Approach to Citation Extraction,” Institute of Information Science, Taipei, Taiwan, 2005.
2. Donna Berg mark, “Automatic Extraction of Reference Linking Information from Online Documents,” Cornell Digital Library Research Group, 2000.
1. Author = Minyuhday Title = A Knowledge-Based Approach to Citation Extraction Source = Institute of Information Science, Taipei, Taiwan. Year = 2005 2. Author = Donna Berg mark Title = Automatic Extraction of Reference Linking Information from Online Documents Source = Cornell Digital Library Research Group Year = 2000
Steps for Reference Extraction 1. 2. 3. 4. 5. 6.
Get document’s preprocessed text. Get the reference paragraph’s heading position for the document text by comparing the paragraph line spacing. Get all the text under the reference heading by comparing the headings list with the String array containing all possible words for reference. Get the initial index of the reference paragraph. Formulate the index format of the reference. Separate each reference text by the indexes given.
Reference Parsing All the references listed out in a publication’s reference section may not be in the same reference format. In other words, reference formatting for journals, books and conference proceedings (called as reference sub-formats) vary in their structure and style, and hence, identification of their individual formats and sub-formats is essential to parse them accordingly. The meta-data like author, publisher, title, year etc. should also be extracted which are required for framing a query to search the respective reference publication online (refer Figure 3). Parsing of reference formats in our implementation is limited to IEEE and other variants of the format. After retrieval, the metadata of relevant references is stored in database.
Steps for Reference Parsing 1. 2.
488
Formulate different referencing formats that are commonly used. For each reference string given, repeat steps 3 to 5.
Automatic Reference Tracking
3. 4. 5.
Parse it using the reference patterns. If parsing is successful, extract the metadata and store in the database. Else parse it using delimiters and string patterns. Record the new pattern and store it. If parsing is successful, extract the metadata and store in the database. Once referencing pattern is obtained, apply it for all reference strings and extract the metadata.
Title-based Reference Filtering The reference articles obtained have to be filtered in order to provide relevant documents to the user. This is achieved by obtaining keywords from the seed document. The region of the seed document and the extracted reference papers, from which the keywords are obtained for comparison, has a major impact in filtering out irrelevant references [Minjuan Zhong, Zhiping Chen and Yaping Lin, 2004]. ART differs from the online reference tracker [Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006] in policies on reference filtering and reference caching. Reference Caching: Reference Caching indirectly influences Reference Filtering. Generally, in automatic reference tracking [Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006], caching of references is performed with a track database, so that, reference articles harvested are retained across multiple user sessions. This retaining of references across multiple search sessions forms a kind of implicit feedback [Ryen W. White, 2005; Xuehua Shen, Bin Tan and ChengXiang Zhai, 2005], which makes the system unobtrusive. To avoid performance bottlenecks, ART does not maintain integrity across search sessions of a researcher [Xuehua Shen, Bin Tan and ChengXiang Zhai, 2005], i.e. at the end of every search session the track database is emptied. This is based on the assumption that the researcher may be convinced with provision of necessary articles in the current session, which aids the understanding of the concept of the seed article. Implicit feedback information retrieval systems gather information to better represent searcher needs whilst minimizing the burden of directly providing the relevance information [Ryen W. White, 2005]. In ART, implicit feedback [Xuehua Shen, Bin Tan and ChengXiang Zhai, 2005] is achieved by identifying cross-references across various track levels so that, such reference articles though parsed successfully every time, shall be stopped from multiple downloading. Therefore, even though the reference articles in deeper track levels may result in a Hit with the Web, such articles may not be of interest for reference harvesting due to their availability in the track database, and they have to be ignored while reference filtering. Two other innovative and direct ways of filtering the references has been introduced in ART: Titlebased Relevance Filtering and Abstract-based Relevance Filtering. Reference Filtering: In title-based filtering, the title of every reference article is obtained at the stage of reference parsing, decisions regarding whether to include the reference article for reference extraction or not shall be made before submitting the query to the search engine [Xuehua Shen, Bin Tan and ChengXiang Zhai, 2005]. Therefore, title based keyword analysis is much easier for filtering the references. The metadata of each reference is compared with the set of keywords to obtain relevance count, a measure of relevance to the seed document. From the relevance count, the user can segregate documents based on a threshold limit.
489
Automatic Reference Tracking
Figure 4. Snapshot of parsed references with relevance count
Steps for Relevance Calculation 1. 2. 3. 4. 5. 6.
Collect relevant keywords from the Keywords paragraph and add to keyword string array. Parse the main document’s title to get the relevant keyword. Eliminate all the non-relevant words from the title and add to the keyword string array. Set Relevance count for the seed document as 100. For each reference, parse the reference document’s title to get the main keywords of the document, eliminating all the non-relevant words. For each reference document, compare the main keywords with the keyword string array. Calculate the Relevance count based on the number of matches.
From the relevance count, the user can segregate documents based on a threshold limit. When the threshold is assumed to be zero, every reference article extracted is projected to the user (Figure 4 & 5). Because, as and when the reference tracking proceeds, zero filtering (relevance filtering with threshold =0) enables the reference database to handle more reference articles and thereby enables us to study the performance of the reference tracker in handling voluminous database.
490
Automatic Reference Tracking
Figure 5. Meta-data of parsed references
Title-based reference filtering helps ART to limit the articles that are downloaded. But, there are chances of irrelevant articles getting downloaded into the track database of ART. The reason is obvious. i.e. title-based reference filtering harvests only the necessary plain keywords from the title of the seed article and not from the other relevant sections of the seed article. Therefore, to provide better results, abstract based reference filtering is introduced in reference document representation.
Link Extraction and Data Retrieval The references are segregated into linkable and non-linkable formats. The linkable ones (http references) are extracted directly from the WWW (refer Figure 6). In non-linkable references, key words are identified and provided to a search engine (Yahoo, for our work) for locating the hyperlink of the reference document.
Steps for Reference Document Retrieval 1.
If link is found in metadata, get the link. If the link is not present in the specified format, continue to search for the required format.
Figure 6. Reference document retrieval Parsed Reference
Search
WWW
Non - Linkable Reference Linkable Reference
HTTP Referenc
Search Query
Metadata
Reference
491
Automatic Reference Tracking
2. 3. 4. 5. 6. 7.
If link contains a Web site address for the paper, add it to the search string. Get the title of the paper. Add to the search string. Append the search string with a search engine URL with the required options. Retrieve the page for the URL. Get all the links in the retrieved Web page. Choose the exact link with the specified format Return the link.
The locations obtained through the search results are analyzed and the particular reference publication is downloaded. These documents retrieved are again pre-processed to sort out their content reference sections and they are again used for tracking up to a certain extent based on their relevance.
Reference Document Representation Requirements Projection Reference document representation involves analyzing the metadata and projecting the references visually [Junjie Chen, Lizhen Liu, Hantao Song and Xueli Yu, 2001] in a hierarchical fashion according to the user’s requirements. This requirements projection involves some in-depth segregation of the stored documents from the available metadata in database. Abstract based reference filtering can be used as a second filter before the data is actually projected to the user.
Abstract Based Reference Filtering Filtering out the tracked irrelevant (or not closely relevant) articles from user’s view is the motivation behind this work. In plain keyword based relevance filtering [Mahalakshmi G.S. and Sendhilkumar Selvaraju, 2006], elimination of reference articles was based only on the no. of keyword matches against the title of the reference article with that of the seed literature. Here, we have extended the keyword based relevance filtering of scholarly articles to involve extraction of words from the title, abstract and highlights in the seed article and the retrieved articles. Based on the frequency of word matches, the threshold to eliminate or include the article into the track database is calculated. The above solution seemed to be obvious since the keyword region has been extended to abstract and all through the text of the reference article. However, the tracked articles were found to be deviating from the seed scholarly literature on practical reference tracking scenario, which needed a serious analysis along the terms of semantics of the retrieved articles.
Data Representation The representation can be categorized such that, using color legends and other layout schemes, the user is able to identify the exact publication that aids his/her literature search during research. The representation can be author based, relevance based or year based. The reference database has three key tables that store the necessary information regarding the tracked articles. Paper_ID: 492
Paper ID is used to identify the document uniquely and this is the primary key of the document table
Automatic Reference Tracking
Table 1. Document Paper_ID
Title
Year
Source
Link
BR_ID
Table 2. Document_author Paper_ID
Author
Table 3. Document_track TrackID
Title: Year: Author: BR_ID: Refcount: Rel_Count: TrackID: StartRec: EndRec:
StartRec
EndRec
Title of the document Year in which the document is published Author of the paper Back Reference ID of the document ie., the PID of the document which cited this document. Reference count. Number of references in the document The relevance of the reference with the seed document. Unique ID for every document tracked. Used to retrieve a already tracked information. Starting record no. of the document track in the Paper_metadata table. Ending record no. of the document track in the Paper_metadata table.
The output for reference data representation for a sample seed research article is shown in Figure 7 and Figure 8.
RESULTS ART has been implemented using Python, version 2.4.3. The performance of ART is analysed for three different set of seed articles for which the results are summarized below. For implementation purposes and to avoid the populating of track database, we limit reference articles up to two track levels for recursive downloading and the threshold for relevance calculations as 50%. The seed articles tested with ART are listed in Table 4. Every track of every seed article was found to involve both hit and miss of references, which are given in table 5. References might be listed in non-IEEE formats (which violate our assumption) for which our reference parser is not trained. Since we do not take into account the semantic similarity, plain keyword analysis, which we follow for relevance calculations, may end up with low relevance counts, thereby ignoring the article that is spotted out positively. Due to technical issues, the server in which the article is supposed to be present may be, temporarily unavailable or overloaded at that instant. Some articles may
493
Automatic Reference Tracking
Figure 7. Relevance-based representation of references
Figure 8. Author-based representation of references
Table 4. List of Seed articles tested with ART Seed_id
494
Seed_name
List of Actual References in Level 1
1
A Soft Relevance Framework in Content-Based Image Retrieval Systems
34
2
Direct Kernel Biased Discriminant Analysis: A New Content-Based Image Retrieval Relevance Feedback Algorithm
36
3
Design and Implementation of Web-Based Systems for Image Segmentation and CBIR
13
Automatic Reference Tracking
Table 5. Hit–miss ratio of references Item name
Seed 1
Seed 2
Seed 3
No. of parsed references
68
63
67
Hit count in Level 1
10
10
10
Hit count in Level 2
45
41
46
Miss count in Level 1
7
5
3
Miss count in Level 2
6
7
7
be posted in Web in postscript format, and since, we have no system inbuilt for conversion of postscript articles to pdf articles, such references though spotted have to be ignored. Apart from all of these, bad internet connection can also have a greater impact in retrieval of all the reference articles. Like these, the reasons for missed references may be many. Prominent among them is the ‘format ignored’ message. Other messages for missed references are: 1. Low relevance count 2. Server not found 3. Bad URL 4. Post script Ignored 5. Bad Request 6. Server busy 7. Unicode error and 8. Timed out.
Figure 9. Response of ART 0
Total Parsed references
Hit count in Level
Hit count in Level
Miss count in Level
Miss count in Level
0
No. of reference articles
0
0
0
0
0
0 s eed
s eed
s eed
495
Automatic Reference Tracking
Figure 10. Analysis of harvested References for 3 seed research articles in Level 1
No. of reference articles
0
Performance of Art in Level1
actual harvested
0
0
0
0 seed
seed
seed
It can be observed from Figure 9. that the number of missed references is on an average across the three different seed research articles. Thus it is clear that, in spite of fixing the download levels deeper, say, level 5, the miss count tends to converge for subsequent levels and at one finer level it comes down to zero (Figure 10.). The hit-miss ratio of the seeds can be derived from Figure 9. This indicates us that there would be hardly any references to be missed actually in subsequent levels, which means that every reference from that level (or track) onwards would be included by ART for reference harvesting. In other words, this state of ART warns that there is not going to be any useful filtering of references from that particular point onwards and hence, the tracking would have no meaning if every reference is to be included for harvesting. The researcher who is actually struggling to get an idea about the seed document will surely be left in a confused state if every reference article in inner levels too have to be understood conceptually for understanding the seed research paper. This should not be the case, since, the semantic similarity of harvested articles in ART with the first two levels are not convincing. Hence, ART strictly demands extension and enhancement in such directions. Since the performance of ART highly depends on reference filtering based on relevancy, new measures of relevance calculation and threshold fixing shall be explored.
LIMITATIONS AND FUTURE DIRECTIONS Automatic extraction of bibliographic data and reference linking information from the online literature has a serious performance bottleneck caused by recursively retrieving online reference documents. Though our implementation assumed a constant for threshold (we mean, threshold = 50%), poor selection of threshold (may be, higher limits, close to our relevance count 100) will result in reduction of quality of retrieved documents. The reason is that, the number of scholarly articles retrieved will be actually more with a more popular research area. Under all circumstances, other publications of authors of seed reference article need to be maintained assuming that such articles will express the continuity of the
496
Automatic Reference Tracking
author’s previous work. Therefore, fixing a threshold in terms of relevance of documents retrieved for a seed document was an obvious solution, but poor selection of threshold resulted in reduction of quality of retrieved documents. Since this scenario highly depends on reference filtering based on relevancy, new measures of relevance calculation and threshold fixing shall be explored [Snášel V., Moravec P., and Pokorný J., 2005]. Alternate means of threshold is embedded naturally in the retrieval process: the missing of the required online reference publication, the reference entry in a publication in a different format for which the parser is not trained. Besides we have an assumption that most scholarly literature will be present in pdf format and if it is present in any other format (ex: ps) then such issue needs to be handled in future. Apart from this, seed document’s title-based keyword analysis may also be followed [Ibrahim Kushchu, 2005] to prioritize the reference entries in a document so that least relevant reference entries may be discarded. The following table 6 summarizes the reasons for missed references indicating the percentage of articles missed while reference tracking. However, the reasons for missed references (Reasons: 3,4,7,9) are highly dependent on practical situations of information search over the World Wide Web, which are unpredictable. Also, the result discussions presented above does not always imply that the current version of ART provides relevant documents at every hit. There may be documents harvested which are conceptually surprising when looking from the researcher’s perspective. Therefore, some measure of relevancy should be attempted before claiming the purity of harvested references. Fuzzy IR systems [Valerie V. Cross, 2001] attempts to determine a degree of relevance which is much different from the probabilistic model that assumes a document is either totally relevant or totally irrelevant and tries to estimate the likelihood of each outcome. Using fuzzy IR kind of relevance calculations, the quality of articles harvested by ART shall be improved further. Other measures of relevancy including semantic relevance and personalization would ultimately be the finer attempts at tuning the accuracy of Automatic Reference Tracker.
CONCLUSION Automatic extraction of bibliographic data and reference linking information from the online literature has certain limitations, but is not impossible. This chapter has explicitly proposed the integrated frameTable 6. Summary of missed references for seed 1, seed 2 and seed 3 Seed 1
Seed 2
Seed 3
1
S. No.
Format Ignored
Miss Messages
7.692 %
25 %
38.461 %
2
Low Relevance Count
46.154 %
16.66 %
09.09 %
3
Server Not Found
7.692 %
16.66 %
09.09 %
4
Bad URL
7.692 %
16.66 %
18.18 %
5
Postscript Ignored
15.385 %
8.33%
-
6
Bad Request
-
8.33%
-
7
Server Busy
7.692 %
8.33%
09.09 %
8
Unicode Error
7.692 %
-
-
9
Timed Out
-
-
09.09 %
497
Automatic Reference Tracking
work for retrieving and representation of online documents and listed the issues involved. The most relevant documents are filtered using frequency of keyword matching from the title and abstract regions. Harvested articles, which have highest percentage of relevance, are first represented to the user. Upon user’s request, other documents are also included for display. The system can also be extended for making citation counts. It enhances the literature survey of a budding researcher by providing easy access for logically related articles with lesser number of clicks by drastically reducing the time involved in accessing the origin of the documents. Another interesting application of ART lies in helping a reviewer to investigate the credibility of references of any publication’s ‘References’ section.
REFERENCES Citeseer (2007). Scientific Literature Digital Library. Retrieved 26 April, 2007, from http://citeseer.ist. psu.edu/ Day M.Y., Tzong-Hon Tsai, Sung C.L., Lee C.W., Wu S.H., Ong C.S.and Hsu W.L. (2005). A Knowledge – based Approach to Citation Extraction, In Proceedings of IEEE International Conference on Information Reuse and Integration, pp. 50-55. Donna Bergmark and Carl Lagoze (2001). An Architecture for Automatic Reference Linking. Lecture notes in Computer Science, Vol.2163, pp 115-126. Google scholar (2007). Retrieved 26 April, 2007 from http://scholar.google.com/intl/en/scholar/about. html Ibrahim Kushchu (2005). Web-based Evolutionary and Adaptive Information Retrieval. IEEE Transactions On Evolutionary Computation, 9(2), 117-124. Junjie Chen, Lizhen Liu, Hantao Song and Xueli Yu (2001). An Intelligent Information Retrieval System Model, In Proceedings of the 4th World Congress on Intelligent Control and Automation, China, pp. 2500-2503. Mahalakshmi G.S. and Sendhilkumar Selvaraju (2006). Design and Implementation of Online Reference Tracking System, In Proceedings of First IEEE International Conference on Digital Information Management, Bangalore, India, pp. 418-423. Minjuan Zhong, Zhiping Chen and Yaping Lin (2004). Using Classification and key phrase Extraction for Information Retrieval, In Proceedings of the 5th World Congress on Intelligent Control and Automation, China, pp. 3037-3041. Rhodes B. and Maes P. (2000). Just-In-Time Information Retrieval Agents. IBM Systems Journal, 39, (3&4), 685-704. Ryen W. White (2005). Implicit Feedback for Interactive Information Retrieval, Doctoral Abstract, ACM SIGIR Forum, 39(1), 1-25. Shen L., Lim Y.K. and Loh H.T. (2004). Domain-specific Concept-based Information Retrieval System, In Proceedings of International Engineering Management Conference, pp. 525-529.
498
Automatic Reference Tracking
Snášel V., Moravec P., and Pokorný J. (2005). WordNet Ontology based Model for Web Retrieval, In Proceedings of WIRI’05 Workshop, Tokyo, Japan, IEEE Press, pp. 220-225. Xiangzhu Gao, San Murugesan and Bruce Lo (2003). A Dynamic Information Retrieval System for the Web, In Proceedings of 27th Annual International Computer Software and Applications Conference, IEEE Computer Society, pp. 663-668. Xuehua Shen, Bin Tan and ChengXiang Zhai (2005). Context-Sensitive Information Retrieval Using Implicit Feedback. ACM-SIGIR, Brazil, pp. 43-50. Valerie V. Cross (2001). Similarity or Inference for Assessing Relevance in Information Retrieval. In Proceedings of IEEE International Fuzzy Systems Conference, pp. 1287-1290.
Key Terms Information Filtering System: An information filtering system is a system that removes redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user. Its main goal is the management of the information overload and increment of the semantic signal-to-noise ratio. Information Retrieval: Information retrieval is the science of searching for information in documents, searching for documents themselves, searching for meta-data which describe documents or searching within databases, whether relational stand-alone databases or hyper textually-networked databases such as World Wide Web. Relevance Feedback: The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query. Text Mining: Text mining usually involves the process of structuring the input text (usually parsing along with the addition of some linguistic features and the removal of others, and subsequent insertion into the database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. Web Search: Web search means searching for documents located in the Web by supplying queries through search engines. Web Search Query: A Web search query is a query that the user enters into Web search engine to satisfy his or her information needs.
499
500
Chapter XXX
Determination of Unithood and Termhood for Term Recognition Wilson Wong University of Western Australia, Australia Wei Liu University of Western Australia, Australia Mohammed Bennamoun University of Western Australia, Australia
ABSTRACT As more electronic text is readily available, and more applications become knowledge intensive and ontology-enabled, term extraction, also known as automatic term recognition or terminology mining is increasingly in demand. This chapter first presents a comprehensive review of the existing techniques, discusses several issues and open problems that prevent such techniques from being practical in real-life applications, and then proposes solutions to address these issues. Keeping afresh with the recent advances in related areas such as text mining, we propose new measures for the determination of unithood, and a new scoring and ranking scheme for measuring termhood to recognise domain-specific terms. The chapter concludes with experiments to demonstrate the advantages of our new approach.
INTRODUCTION In general, terms are considered as words used in domain-specific contexts. More specific and purposeful interpretations of terms do exist for certain applications such as ontology learning. There are two types of terms, namely, simple terms (i.e. single-word terms) and complex terms (multi-word terms). Collectively, terms constitute what is known as terminology. Terms and the tasks related to their treatments are an integral part of many applications that deal with natural language such as large-scale search engines,
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Determination of Unithood and Termhood for Term Extraction
automatic thesaurus construction, machine translation and ontology learning for purposes ranging from indexing to cluster analysis. With the increasing reliance on large text sources such as the World Wide Web as input, the need to provide automated means for managing domain-specific terms arises. Such relevance and importance of terms has prompted dedicated research interests. Various names such as automatic term recognition, term extraction and terminology mining were given to encompass the tasks related to the treatments of terms. Despite the appearance of ease in handling terms, researchers in term extraction have their share of issues due to the many ambiguities inherent to natural language and differences in the use of words. Some of the general problems include syntactic variations, structural ambiguities and inconsistent use of punctuations. The main aim in term extraction is to determine whether a word or phrase is a term which characterises the target domain. This key question can be further decomposed to reveal two critical notions in this area, namely, unithood and termhood. Unithood concerns with whether or not sequences of words should be combined to form more stable lexical units, while termhood is the extent to which these stable lexical units are relevant to some domains. Formally, unithood refers to “the degree of strength or stability of syntagmatic combinations and collocations” (Kageura and Umino, 1996), and termhood is defined as “the degree that a linguistic unit is related to domain-specific concepts” (Kageura and Umino, 1996). While the former is only relevant to complex terms, the latter concerns both simple terms and complex terms. The research in terms and their treatments can be traced back to the field of Information Retrieval (IR) (Kageura and Umino, 1996). Many of the techniques currently available in term extraction either originated from or are inspired by advances in IR. While term extraction does share some common ground with the much larger field of IR, these two areas have some major differences. Most notably, document retrieval techniques have access to user queries to help determine the relevance of documents, while the evidences for measuring termhood are far less apparent. Based on the type of evidences employed, techniques for term extraction can be broadly categorised as linguistic-oriented and statistic-oriented. Generally, linguistic-oriented techniques rely on linguistic theories, and morphological, syntactical and dependency information obtained from natural language processing. Together with templates and patterns in the form of regular expressions, these techniques attempt to extract and identify term candidates, headmodifier relations and their context. More information is usually required to decide on the unithood and termhood of the candidates. This is when statistical techniques come into play. Evidences in the form of frequencies of occurrence and co-occurrence of the candidates, heads, modifiers and context words are employed to measure dependency, prevalence, tendency and relatedness for scoring and ranking. Consequently, it is extremely difficult for a practical term extractor or term recogniser to achieve quality results without using a combination of techniques. As a relatively small area of research, term extraction, in recent years, enjoys a revival of interests and attentions. This is due to the increasing popularity of large-scale search engines, and the new advances in text mining techniques. Particularly, the birth of ontology learning, due to the increasing need for interoperable semantics in knowledge-based applications has necessitates a revisit to take advantage of existing advancements of automatic term recognition in order not to re-invent the wheels. A survey by Shamsfard and Barforoush (2004) has shown that many existing approaches in ontology learning merely employ isolated and non-comprehensive techniques that were not designed to address the various requirements and peculiarities in terminology. As a result, we reviewed existing automatic term recognition techniques to determine their strengths and applicability to ontology learning. We have uncovered seven issues that have effects on the performance of term extraction:
501
Determination of Unithood and Termhood for Term Extraction
1. 2. 3. 4. 5. 6. 7.
Lack of dedicated measures for the determination of unithood The difference between prevalence and tendency The role of heads and modifiers Relatedness between terms and their context Detection of context words Importance of simple terms Overemphasis on the role of contextual evidence
To address the issues that we have highlighted, firstly, we propose the separation of unithood measurements from the determination of termhood. From here on, we will consider unithood measurements as an important prerequisite, rather than a subsumption to the measurement of termhood. We present the modifications of two existing measures for the computation of a final Boolean value known as Unithood (UH) for determining the collocational strength of word sequences by employing the Google search engine as a source of statistical evidence. Secondly, we propose a new approach to incorporate a series of measures that leads to a final new score known as Termhood (TH). The main aim of this new approach is to maximise the use of available evidence during the process of scoring and filtering to improve the approximation of termhood. This approach consists of two new base measures for capturing the statistical evidence, and four new derived measures that employ the statistical evidence to quantify the linguistic evidences. We will discuss these seven issues and provide an overview of our solutions in Section 2. In Section 3, we will have a brief introduction about terms and their characteristics, and later, a review on the existing techniques for measuring unithood and termhood. In Section 4 and 5, we will present and discuss our solutions in details, including the measures involved, and the justification behind every aspect of the measures. In Section 6, we will summarize some findings from our experiments. Finally, we conclude this chapter with an outlook to potential extensions of this work and suggested directions in Section 7.
ISSUES AND OVERVIEW OF PROPOSED SOLUTION In this section, we will elaborate on some important issues that have effects on the performance and practicality of techniques for term extraction. It is worth noting that some of the problems discussed are not new to this area. •
502
Issue 1: Lack of dedicated measures for the determination of unithood: Most of the research in term extraction was conducted solely to study and develop techniques for measuring termhood, while only a small number study on unithood. Unfortunately, rather than looking at unithood as an important prerequisite, the researchers that work on measurements for unithood merely consider them as part of a larger scoring and filtering mechanism for determining termhood. Such mentality is clearly reflected through the words of Kit (2002), “...we can see that the unithood is actually subsumed, in general, by the termhood.”. Consequently, the significance of unithood measures are often overshadowed by the larger notion of termhood, and the progress and innovation with respect to this small sub-field of term extraction is minimal. Most of the existing techniques for measuring unithood employ conventional measures such as Mutual Information (MI) and LogLikelihood, and rely simply on the occurrence and co-occurrence frequencies from local corpora as the source of evidence.
Determination of Unithood and Termhood for Term Extraction
•
•
•
•
•
•
Issue 2: The difference between prevalence and tendency: Many frequency-oriented techniques adapted from Term Frequency-Inverse Document Frequency (TF-IDF) and others alike from IR fail to comprehend that terms are properties of domains and not documents. Existing weights do not reflect the tendency of term usage across different domains. They merely measure the prevalence of the term in a particular target domain. Consider the scenario where 999 cases of infection were detected in a population of 1000. This only shows that the disease has high prevalence in that one population. However, the number of cases does not reveal whether that particular population is an exclusive target or is the disease equally prevalent in other populations. Issue 3: The role of heads and modifiers: Many approaches have attempted to utilise the head as the representative of the entire complex term in various occasions. For example, the assumption that “term sense is usually determined by its head” by Basili et al. (2001b) is not entirely true. Such move is an oversimplication of the relation between a complex term and its head based on the head-modifier principle. Modifiers in complex terms are meant to narrow down the many possible interpretations of the inherently ambiguous heads. While different complex terms sharing the same head may or may not belong to the same semantic category, their actual senses are determined by the interaction between their heads and the associated modifiers. Issue 4: Relatedness between terms and their context: For approaches that attempt to use contextual information, they have realised the prior need to examine the relatedness of context words with the associated term candidate. The existing solutions to this requirement have yet to deliver overall satisfactory results. For example, Maynard and Ananiadou (1999) rely on the use of rare and static resources for computation of similarity while others such as Basili et al. (2001b) require large corpora to enable the extraction of features to enable the use of feature-based similarity measures. For the latter, the size of the corpus has to be extremely large and the target domain has to be extremely specific in order to make available the amount of features necessary to support feature-based similarity measures. Issue 5: Detection of context words: A number of existing approaches that employ contextual information use fixed-size windows to select the context words. Such technique faces the issue of proper window size selection. The choice of window size can either be numerical (i.e. size of n) or linguistic-based (e.g. within a sentence or document). However, many context words that have direct relations with the term candidates are not necessarily located immediately next to the candidate. The relation between the proximity of context words with term candidates, and the relatedness between them has yet to be demonstrated. Issue 6: Importance of simple terms: A number of existing approaches focused solely on the identification of complex terms. Even though complex terms are more highly regarded in terminology, one cannot deny the fact that simple terms have a role to play. In fact, besides being potential terms, simple terms have the capability of assisting in the identification of complex terms through the head-modifier principle. For an approach to be considered as practical for real-world applications, simple terms cannot be neglected. Issue 7: Overemphasis on the role of contextual evidence: Many researchers have repeatedly stressed on the significance of the old cliché “you shall know a word by the company it keeps”. Unless one has the mechanism of identifying the companies which are truly in the position to describe a word, the overemphasis on contextual evidence may results in negative effects.
503
Determination of Unithood and Termhood for Term Extraction
In view of the issues highlighted before and the need for enhancements in term extraction, we present our solutions in two parts. Firstly, we propose a new Boolean measure for determining the unithood of word sequences using two measures inspired by MI (Church and Hanks, 1990) and Cvalue (Frantzi, 1997) which rely on the use of Google search engine as the source of statistical evidence. The use of the World Wide Web to replace the conventional use of static corpora will eliminate issues related to portability to other domains and the size of text required to induce the necessary statistical evidence. Besides, this dedicated approach to determine the unithood of word sequences will prove to be invaluable to other areas in natural language processing such as noun-phrase chunking and named-entity recognition. Secondly, our answers to Issues 2 to 7 are presented in the form of a new approach governed by the following four principles: •
•
•
•
Principle 1: The approach will make use of all statistical (i.e. cross-domain and intra-domain distributional behaviour) and linguistic (i.e. candidate, modifier, context) evidences during scoring and ranking. The notions of prevalence and tendency should play more distinct roles in the measurement of termhood. Modifiers should have direct contributions to the determination of termhood for complex terms, while the role of contextual evidence should be revised. This will address Issues 2, 3 and 7; Principle 2: The modifiers and context words should be identified using their linguistic properties (i.e. part-of-speech and grammatical relations). This is to ensure that the evidence captured truly reflects the actual information and constraints that the authors of the text try to convey. This principle is important to address Issue 5; Principle 3: The inclusion of an existing measure new to the field of term extraction known as Normalised Google Distance (NGD). This measure will overcome the problems introduced by the use of static and restricted semantic resources as stated in Issue 4; and Principle 4: The approach should provide a unified means of handling both simple and complex terms to address Issue 6.
The requirements posed in Principle 1 define the core of the new approach with respect to the types of evidence and how can these evidence be used to measure termhood. In this regard, we propose two new base measures, and four new derived measures to be incorporated into our new approach: • •
The two new base measures are Domain Prevalence (DP) and Domain Tendency (DT) for capturing the statistical evidence; and The four new derived measures are Discriminative Weight (DW) for the candidate evidence, Modifier Factor (MF) for the modifier evidence, and Average Contextual Discriminative Weight (ACDW) and Adjusted Contextual Contribution (ACC) for contextual evidence. ACC is important for the re-evaluation of the role and significance of contextual evidence in the overall weight TH.
BACKGROUND AND RELATED WORKS The main building blocks of domain ontologies are domain-specific concepts. Since concepts are merely mental symbols that we employ to represent the different aspects of a domain, they can never really be computationally captured from written or spoken resources. Therefore, in ontology learning, terms are
504
Determination of Unithood and Termhood for Term Extraction
regarded as lexical realisations for expressing or representing concepts that characterise various aspects of specialised domains. Attempting to differentiate between words and terms, some researchers such as Frantzi and Ananiadou (1997) consider the latter as part of a special language employed for communication in specialised domains such as law, engineering and medicine. The special language is treated as a derivation which obeys most of the syntactic and grammatical constraints of general language. The differences between the special and general language exist at both the lexical and the semantic level. To illustrate, a general language lexicon is insufficient to meet the requirements of processing text produced using special language (Sager, 1990). In this respect, terms collectively form what is known as terminology the same way words constitute the vocabulary. Nonetheless, Loukachevitch and Dobrov (2004) defended on the non-existence of “an impregnable barrier” by stating that a lot of terms initially conceived for use in specific domains eventually become elements of a general language and similarly, the meaning of general language words are often altered to become an element of a terminology. Some of the ideal characteristics of terms include (Sager, 1990; Loukachevitch and Dobrov, 2004): Definition 1 Primary characteristics of terms in ideal settings: 1. 2. 3.
Terms should not have synonyms. In other words, there should be no different terms implying the same meaning. Meaning of terms is independent of context. Meaning of terms should be precise and related directly to a concept. In other words, a term should not have different meaning or senses.
Besides the ideal characteristics, there are also other hypothetical characteristics of terms. The following list is not a standard and is by no means exhaustive or properly theorised. Nonetheless, as we have pointed out, such linguistically-motivated list is one of the foundations of term extraction. Definition 2 Extended characteristics of terms: 1. 2. 3. 4. 5.
Terms are properties of domain, not document (Basili et al., 2001a). Terms tend to clump together (Bookstein et al., 1998) the same way content-bearing words do (Zipf, 1949). Terms with longer length are rare in corpus since the usage of words with shorter length is more predominant (Zipf, 1935). Simple terms are often ambiguous and modifiers are required to reduce the number of possible interpretations. Complex terms are preferred (Frantzi and Ananiadou, 1997) since the specificity of such terms with respect to certain domains is well-defined.
The previous two lists of characteristics define what terms are and their relationships with their environments such as context words. There is also a commonly adopted list of characteristics that define how terms should behave in text for determining their relevance (Kageura and Umino, 1996). They are
505
Determination of Unithood and Termhood for Term Extraction
Definition 3 Characteristics for determining term relevance : 1.
A term candidate is relevant to a domain if it appears relatively more frequent in that domain than in others. 2. A term candidate is relevant to a domain if it appears only in one domain. 3. A term candidate which is relevant to a domain may have biased occurrences in that domain: 3.1 A term candidate of rare occurrence in a domain. Such candidates are also known as “hapax legomena” which manifest itself as the long tail in Zipf’s law. 3.2 A term candidate of common or frequent occurrence in a domain. 4. A complex term candidate is relevant to a domain if its head is specific to that domain. As mentioned earlier, there are two types of terms in terminology, namely, simple terms and complex terms. Simple terms consist of only one word and are also referred to as single-word terms. Complex terms are multi-word terms and have been regarded as “...the preferred units of designation of terminological concept” (Frantzi and Ananiadou, 1997). This is due to the fact that simple terms or heads of complex terms are polysemous and modifications of such terms are required to narrow down their possible interpretations. This is the fundamental requirements in terminology (Hippisley et al., 2005). Moreover, 85% of domain-specific terms are multi-word in nature (Nakagawa and Mori, 2002), further establishing the importance of complex terms. This is the reason why most of the existing approaches (Pantel and Lin, 2001; Wermter and Hahn, 2005) are focused on the identification of complex terms only.
Existing Approaches for Extracting Candidates and Measuring Unithood Prior to measuring unithood or termhood, term candidates must be extracted. There are two common approaches for extracting term candidates. The first requires the corpus to be tagged or parsed, and a filter is then employed to extract words or phrases satisfying some linguistic patterns. There are two types of filters for extracting from tagged corpus, namely, open or closed. Closed filters, which rely on a small set of allowable part-of-speech, will produce high precision but poor recall (Frantzi and Ananiadou, 1997). On the other hand, open filters which allow part-of-speech speech such as prepositions and adjectives will have the opposite effect. Most of the existing approaches rely on regular expressions and part-of-speech tags to accept or reject sequences of n-grams as term candidates. For example, Frantzi and Ananiadou (1997) employ Brill tagger to tag the raw corpus with part-of-speech and later extract n-grams that fulfil the pattern (Noun | Adjective)+ Noun. Bourigault and Jacquemin (1999) utilise SYLEX, a part-of-speech tagger, to tag the raw corpus. The part-of-speech tags are utilised to extract maximal-length noun phrases, which are later recursively decomposed into head and modifiers. On the other extreme, Dagan and Church (1994) accept only sequences of Noun+. The second type of extraction approach works on raw corpus using a set of heuristics. This type of approach, which does not rely on part-of-speech tags, is quite rare. Such approach has to make use of the textual surface constraints to approximate the boundaries of term candidates. One of the constraints includes the use of a stopword list to obtain the boundaries of stopwords for inferring the boundaries of candidates. A selection list of allowable prepositions can also be employed to enforce constraints on the tokens between units. The filters for extracting term candidates make use of only local, surface-level information, namely, the part-of-speech tags. More evidence is required to establish the dependence between the constituents
506
Determination of Unithood and Termhood for Term Extraction
of each term candidate to ensure strong unithood. Such evidence will usually be statistical in nature in the form of co-occurrences of the constituents in the corpus. Accordingly, the unithood of the term candidates can be determined either as a separate step or incorporated as part of the extraction process. Two of the most common measures of unithood have to be pointwise MI (Church and Hanks, 1990) and log-likelihood ratio (Dunning, 1994). In MI, the co-occurrence frequencies of the constituents of complex terms are utilised to measure their dependency. The MI for two words a and b is defined as: MI (a, b) = log 2
p ( a, b) p (a ) p (b)
where p(a) and p(b) are the probability of occurrence of a and b, respectively. Many measures that apply statistical tests assume strict normal distribution. This does not fare well when faced with rare events. For handling extremely uncommon words or small sized corpus, log-likehood “...delivers the best precision” (Kurz and Xu, 2002). Log-likelihood attempts to quantify how much more likely the occurrence of one pair of words is than the other. Despite its potential, “How to apply this statistic measure to quantify structural dependency of a word sequence remains an interesting issue to explore.” (Kit, 2002). Frantzi (1997) proposed a measure known as Cvalue for extracting complex terms. The measure is based upon the claim that a substring of a term candidate is a candidate itself given that it demonstrates adequate independence from the longer version it appears in. For example, “E. coli food poisoning”, “E. coli” and “ food poisoning” are acceptable as valid complex term candidates. However, “E. coli food” is not. Given a potential term candidate a to be examined for unithood, its Cvalue is defined as: if (| a |= g ) log 2 | a | f a ∑ fl Cvalue(a ) = l∈La log 2 | a | ( f a − ) otherwise | La |
(1)
where |a| is the number of words in a, La is the set of longer term candidates that contain a, g is the longest n-gram considered, and fa is the frequency of occurrence of candidate a. While certain researchers (Kit, 2002) consider Cvalue as a termhood measure, others (Nakagawa and Mori, 2002) accept it as a measure for unithood. Regardless, Cvalue is merely the first step in the derivation of a different measure for termhood known as NCvalue by Frantzi and Ananiadou (1997) to be discussed in the next section.
Existing Measures for Termhood While unithood provides proof of stability of term candidates, termhood assist in isolating unrelated candidates from domain-specific ones. Commonly, the approach for assessing termhood will require a scoring and ranking scheme, similar to that of relevance ranking for IR, where each term is assigned a score. The scores and ranks will assist in the selection of “true” terms from the less likely ones. Most of the existing approaches for measuring termhood make use of the distributional behaviour of terms in documents and domains, and some heuristics related to the dependencies between term candidates or constituents of complex term candidates.
507
Determination of Unithood and Termhood for Term Extraction
Common measures for weighting terms employ frequencies of occurrences of terms in the corpus. There are various implementations but the most common one is the classical Term Frequency-Inverse Document Frequency (TF-IDF) and its variants (Salton and Buckley, 1988). In TF-IDF, the weight for term a in document k is defined as: wak = f ak log(
N ) na
where fak is the frequency of term a in document k, N is the total number of documents in corpus and na is the number of documents in corpus that contain a. Basili et al. (2001a) proposed a TF-IDF inspired measure for assigning terms with more accurate weight that reflects their specificity with respect to the target domain. This contrastive analysis is based on the heuristic that general language-dependent phenomena should spread similarly across different domain corpus and special-language phenomena should portray odd behaviours. This Contrastive Weight (CW) for simple term candidate a in target domain d is defined as: ∑ j ∑ i fij CW (a ) = log f ad log ∑ j f aj
(2)
where fad is the frequency of the simple term candidate a in the target domain d, ∑ j ∑ i fij is the sum of the frequencies of all term candidates in all domains, and ∑ j f aj is the sum of the frequencies of term candidate a in all domains. For complex term candidates, the frequencies of their heads are utilised to compute their weights. This is necessary because the low frequencies of complex terms make estimations difficult. If a is a complex candidate, its weight in domain d is defined as: CW(a) = fadCW(ah)
(3)
where fad is the frequency of the complex term candidate a in the target domain d, and CW(ah) is the contrastive weight for the head of the complex term candidate. The use of heads by Basili et al. (2001a) for computing the contrastive weights for complex term candidates reflects the head-modifier principle (Hippisley et al., 2005). The principle suggests that the information being conveyed by complex terms manifest itself in the arrangement of the constituents. The head acts as the key that refers to a general category to which all other modifications of the head belong. The modifiers are responsible for distinguishing the head from other forms in the same category. In a similar attempt for contrastive analysis, Kurz and Xu (2002) proposed KF-IDF, another modification of TF-IDF. In KF-IDF, the weight for a simple term candidate a in domain d∈D is defined as: | | KFIDFd (a ) = nad log a D + 1 | Da |
where nad is the number of documents in domain d that contain term a, α is a smoothing factor, |D| is the number of domains, and |D|a is the number of domains that contain term a. A simple term candidate is considered as relevant if it appears more often than other candidates in the target domain, but occasionally elsewhere. 508
Determination of Unithood and Termhood for Term Extraction
Besides contrastive analysis, the use of contextual evidence to assist in the correct identification of terms has also become popular. There are currently two dominant approaches to extract contextual information. Most of the existing researchers such as Maynard and Ananiadou (1999) employ fixed-size windows for capturing context words for term candidates. The Keyword in Context (KWIC) (Luhn, 1960) index can be employed to identify the appropriate windows of words surrounding the term candidates. Other researchers such as Basili et al. (2001b) and LeMoigno et al. (2002) employ grammatical relations to identify verb phrases or independent clauses containing the term candidates. One of the work along the line of incorporating contextual information is NCvalue by Frantzi and Ananiadou (1997). Part of the NCvalue measure involves the assignment of weights to context words in the form of nouns, adjectives and verbs located within a fixed-size window from the term candidate. Given that TC is the set of all term candidates and c is a noun, verb or adjective appearing with the term candidates, weight(c) is defined as: ∑ fe | TCc | e∈TCc weight (c) = (0.5) + | | TC f c
(4)
where TCc is the set of term candidates that have c as a context word, ∑ e∈TC f e is the sum of the frequenc cies of term candidates that appear with c, and c is the frequency of c in the corpus. After calculating the weights for all possible context words, the sum of the weights of context words appearing with each term candidate can be obtained. Formally, for each term candidate a that has a set of accompanying context words Ca, the cumulative context weight is defined as: cweight (a ) =
∑ weight (c) + 1
(5)
c∈Ca
Eventually, the NCvalue for a term candidate is defined as: NCvalue(a ) =
1 Cvalue(a )cweight (a ) F
(6)
where F is the size of the corpus in terms of the number of words. There has also been an increasing interest in incorporating semantic information for measuring termhood. Maynard and Ananiadou (1999) employ the Unified Medical Language System (UMLS) to compute two weights, namely, positional and commonality. Positional weight is obtained based on the combined number of nodes belonging to each word, while commonality is measured by the number of shared common ancestors multiplied by the number of words. Accordingly, the similarity between two term candidates is defined as: sim(a, b) =
com(a, b) pos (a, b)
(7)
where com(a,b) and pos(a,b) is the commonality and positional weight respectively, between term candidate a and b. The authors then modified the NCvalue discussed in Equation 6 by incorporating the new similarity measure as part of a Context Factor (CF). The context factor of a term candidate a is defined as:
509
Determination of Unithood and Termhood for Term Extraction
CF (a ) =
∑ f weight (c) + ∑
c∈Ca
c
b∈CTa
f b sim(a, b)
where Ca is the set of context words of a, fc is the frequency of c as a context word of a, weight(c) is the weight for context word c as defined in Equation 4, CTa is the set of context words of a which also happens to be term candidates (i.e. context terms), f b is the frequency of b as a context term of a, and sim(a,b) is the similarity between term candidate a and its context term b as defined in Equation 7. The new NCvalue is defined as: NCvalue(a) = 0.8Cvalue(a) + 0.2CF(a) Basili et al. (2001b) commented that the use of extensive and well-grounded semantic resources by Maynard and Ananiadou (1999) faces the issue of portability to other domains. Instead, Basili et al. (2001b) combine the use of contextual information and the head-modifier principle to capture term candidates and their context words on a feature space for computing similarity. Given the term candidate a, the feature vector for a is: t(a) = ( f1 ,..., fn) where f i is the value of the feature Fi and n is the number of features. Fi comprises of the tuple (Ti,hi) where Ti is the type of grammatical relations (e.g. subj, obj) and i is the head of the relation. For example, the head “poisoning” of the complex term candidate “ food poisoning” in the context of “...cause food poisoning” will have a feature F=(dobj,cause). In other words, only the head of complex terms will be used as referent to the entire structure. According to the authors, “the term sense is usually determined by its head.”. On the contrary, this statement opposes the fundamental fact, not only in terminology but in general linguistics, that simple terms are polysemous and modification of such terms are necessary to narrow down their possible interpretations (Hippisley et al., 2005). Moreover, the size of corpus has to be very large, and the specificity and density of domain terms in the corpus has to be very high to allow for extraction of adequate features. The authors chose cosine measure for computing similarity over the syntactic feature space: sim(τ i ,τ j ) =
τ iτ j | τ i || τ j |
To assist in the ranking of the term candidates using their heads, a controlled terminology TC is employed as the evidence of “correct” terms during the computation of similarity. A centroid τ(TC) is defined as: τ (TC ) =
∑ τ (a)
a∈TC
Accordingly, the weight employed for ranking of the term candidates is defined as: ext(a)=sim(t(a), t(TC))fa where fa is the frequency of candidate a.
510
Determination of Unithood and Termhood for Term Extraction
NEW MEASURE FOR DETERMINING UNITHOOD In this section, we elaborate on our solution to the first issue highlighted in Section 2. Our new approach for measuring the unithood of word sequences consists of two parts. Firstly, a list of word sequences is extracted using purely linguistic techniques. Secondly, word sequences are examined and the related statistical evidence is gathered to assist in determining their mergeability and independence.
Extracting Word Sequences Existing techniques for extracting word sequences have been relying on part-of-speech information and pattern matching (e.g. regular expression). Since the head-modifiers principle is important for our techniques, here we employ both the part-of-speech information and dependency relation for extracting term candidates. The filter is implemented as a head-driven left-right filter (Wong, 2005) that feeds on the output of Stanford Parser (Klein and Manning, 2003), which is an implementation of the unlexicalized Probabilistic Context-Free Grammar (PCFG) and a lexical dependency parser. The head-driven filter begins by identifying a list of head nouns from the output of Stanford Parser. As the name suggests, the filter begins from the head and proceeds to the left and later, right in an attempt to identify maximal-length noun phrases according to the head-modifier information. During the process, the filter will append or prepend any immediate modifier of the current head which is a noun (except possessive nouns), an adjective or a foreign word. Each noun phrase or segment of noun phrase identified using the head-driven filter is known as a potential term candidate, ai where i is the word offset produced by the Stanford Parser (i.e. the “offset” column in Figure 1).
Figure 1. The output from Stanford Parser. The tokens marked with squares are head nouns and the token in the corresponding rows are the modifiers.
511
Determination of Unithood and Termhood for Term Extraction
Figure 2. An example of the head-driven left-right filter at work. The tokens are highlighted with a darker tone are the head nouns. The underlined tokens are the modifiers identified as a result of the left-first and right-later movement of the filter. In the first segment, the head noun is “experience” while the two tokens to the left, namely, “more” and “modern” are the corresponding modifiers. In the second segment of the figure, “Institute” is the head noun and “Massachusetts” is the modifier. Note that the result from the second segment is actually part of a longer noun phrase. Due to the restriction in accepting prepositions by the head-driven filter, “of” is omitted from the output of the second segment in this figure.
Figure 1 shows the output of Stanford Parser for the sentence “A new Internet-enabled bus station being developed by researchers at the Massachusetts Institute of Technology promises to make public transportation’s least glamorous mode a more modern experience.”. Note that the words are lemmatised to obtain the root form. The head nouns are marked with rectangles in the figure. For example, the head “Institute” is modified by “the”, “Massachusetts”, and “of”. Since we do not allow for modifiers of the type “article” and “preposition”, we will obtain “Massachusetts Institute” as shown in Figure 2. Similarly, the modifiers of the head noun “station” are “a”, “new”, “Internet-enabled”, “bus” and “developed”. The resulting phrase “new Internet-enabled bus station” is produced since verbs and articles are not accepted by the filter. Figure 2 shows the head-driven filter at work for some of the head nouns identified from the “modifiee” column of the output in Figure 1. After the head-driven filter has identified potential term candidates using the heads, remaining nouns from the “word” column in Figure 1 which are not part of any potential term candidates will be included.
Determining the Mergeability and Independence of Word Sequences In the following step, we will examine the possibility of merging for pairs of potential term candidates (ax, ay) ∈ A with ax and ay located immediately next to each other (i.e. x+1=y), or separated by a preposition or coordinating conjunction, limited to “and” (i.e. x+2=y). Obviously, ax has to appear before ay in the same sentence or in other words, x ns and na y > ns or else ID=0. As the lexical unit ax occurs more than its longer counterpart s, its independence ID(ax, s) grows. If the occurrences of ax fall below those of s, its independence becomes 0. In other words, we will not be able to encounter ax without encountering s. The same can be interpreted regarding the independence of ay. Conclusively, extremely high independence of ax and ay relative to s will be reflected through high ID(ax, s) and ID(ay, s). Finally, the decision to merge ax and ay to form s depends on both the mergeability between ax and ay, MI(ax, ay), and the independence of ax and ay from s, ID(ax, s) and ID(ay, s). This decision is formulated as a Boolean measure known as Unithood (UH), and is formally defined as (Wong et al., 2007a): 1 UH (ax , a y ) = 0
if ([ MI (ax , a y ) > MI + ] ∨ [ MI + ≥ MI (ax , a y ) ≥ M − ∧ ID(ax , s ) ≥ IDT ∧ ID(a y , s) ≥ IDT ∧
(12)
IDR + ≥ IDR(ax , a y ) ≥ IDR − ]) otherwise
where IDR(ax, ay) = ID(ax,s) / ID(ay,s). IDR helps to ensure that pairs with mediocre mergeability not only has ax and ay with high independence but are also equally independent before mergings are performed. The UH measure described in Equation 12 summarises the relationship between mutual information and the independence measure. Intuitively, the two units ax and ay can only be merged in two cases: • •
If ax and ay has MI higher than the threshold MI+; or If ax and ay achieve mediocre MI within the range MI+ and MI- due to independence ID(ax,s) and ID(ay,s) higher than the threshold IDT.
The thresholds for MI(ax, ay), ID(ax,s), ID(ay,s), and IDR(ax, ay) are decided empirically through our evaluations: • • • • •
M+ = 0.90 M- = 0.02 IDT = 6 IDR+ = 1.35 IDR- = 0.93
A general guideline for selecting the thresholds will be presented during our experiments. Finally, the word sequence s = axbay will be accepted as a new term candidate if UH(ax, ay)=1.
A NEW SCORING AND RANKING SCHEME FOR TERMHOOD In this section, we elaborate on our solutions to the remaining issues (i.e. Issues 2 to 7) highlighted in Section 2. We will discuss about the evidences and the related new measures employed for measuring termhood. This new scoring and ranking scheme is derived based on hypothetical characteristics of terms
514
Determination of Unithood and Termhood for Term Extraction
and their behaviours in corpora. The term characteristics presented in Definitions 1, 2 and 3 earlier are the main guides in formulating our new termhood measure. We are well aware of the possible critics concerning the ad-hoc nature in which our different weights are put together. However, the advantages of this new ranking and scoring scheme, to be demonstrated in our experiments will compensate for the less pragmatic shortcomings related to the issue of mathematical validity. Moreover, one will notice that the derivation and combination of the different weights are not so “ad-hoc” after all, once we have discussed in detail about the motivation and contribution of each weight. We propose a new scoring and ranking scheme (Wong et al., 2007c) that employs distributional behaviours of term candidates within the target domain and also across different domains as statistical evidence to quantify the linguistic evidences in the form of candidate, modifier and context. We introduce two base measures for capturing the statistical evidence based on the cross-domain and intra-domain distribution: • •
Intra-domain distributional behaviour of term candidates and context words are employed to compute the basic Domain Prevalence (DP). Cross-domain distributional behaviour of term candidates and context words are employed to compute the Domain Tendency (DT).
The three types of linguistic evidence, which are essential to the estimation of termhood are measured using new measures derived from the prevalence and tendency measures described previously. The linguistic evidence and the corresponding derived measures are: •
• •
Candidate evidence, in the form of Discriminative Weight (DW), is measured as the product of the domain tendency and the prevalence of term candidates. This evidence constitutes the first step in an attempt to isolate domain-relevant from general candidates. Modifier evidence, in the form of Modifier Factor (MF) is obtained by computing DT using the cross-domain cumulative frequency of modifiers of complex terms. Contextual evidence, in the form of Average Contextual Discriminative Weight (ACDW), is computed using the cumulative DW of context words, scaled according to their semantic relatedness with the corresponding term candidates. ACDW is later adjusted with respect to the DW of the term candidate to obtain the Adjusted Contextual Contribution (ACC) to reflect the reliability of the contextual evidence.
Given that we have a list of term candidates (both simple and complex), TC = {a1,...,an}, we need to identify the m most suitable candidates as terms t∈T. This new scheme requires a corpus containing text generated using the special language of the target domain (i.e. domain corpus) denoted as d, and a set of corpus produced using special languages from domains other than the target domain (i.e. contrastive corpora) denoted as d . In subsequent usage, domains other than the target domain are collectively referred to as “non-domain”. Certain researchers refer to the contrastive corpora as general corpus. Due to the difficulty in justifying what constitutes “general”, we settled with the use of the more definable name “contrastive”. Each complex term, a, will comprise of a head ah and modifiers m∈Ma. Each term candidate is assigned a weight depending on its type (i.e. simple or complex). We refer to this initial weight as Domain Prevalence (DP) for its ability to capture and reflect the extent to which a candidate occur in the target domain d. DP attempts to capture the requirements in Definition
515
Determination of Unithood and Termhood for Term Extraction
3.3.2, namely, a term may be relevant to a domain if it occurs frequently in that domain. If a is a simple term, its DP is defined as: FTC DP(a ) = log10 ( f ad + 10) log10 + 10 f ad + f ad
(13)
where FTC = ∑ j f jd + ∑ j f j d is the sum of the frequencies of occurrences of all a∈TC in both domain and contrastive corpora, while fad and f ad are the frequency of occurrences of a in the domain corpus and contrastive corpora, respectively. The use of log10 and the addition of 10 are for smoothing purposes. If the term candidate is complex, we define its DP as: DP(a) = log10( fad + 10) DP(ah) MF(a)
(14)
Please note the use of the DP of the head of a, instead for the computation of DP for complex terms. Similar to the original CW in Equation 3, the rarity of the events of complex term occurrences does not allow proper computation of the weight. Unlike the original CW, we log the domain frequency of complex terms. We have noticed that extremely common, general complex terms like “New York” or “Los Angeles” will distort the weights and give a false impression of their importance in the domain. Nonetheless, by taking the log of the domain frequency of complex terms, this will result in DP(a) < DP(ah) for extremely rare complex candidates. This opposes Definition 2.4 and 2.5 that complex terms provide for better quality terminology due to their less ambiguous nature, and hence, should yield higher weight than their polysemous head. Consequently, besides the addition of the constant 10 to fad prior to log, we introduce another new measure called Modifier Factor (MF) to: • •
•
Provide relevant complex terms with higher weights than their head. Penalise those potentially deceiving domain-unrelated complex terms with heads that are domainrelated. For example, the head “virus” will yield high weight in the “technology” domain. If we did not take into consideration the fact that the head was modified by “H5N1” to form the complex term “H5N1 virus”, the complex term makes its way into the list of terms for the “technology” domain. Compensate for the low weight of domain-related complex terms due to their domain-unrelated head. For example, upon looking at the head “account”, one would immediately rule the complex term out as irrelevant. The occurrence of “account” will be higher in other domains such as “finance” and “business”. When we take into consideration the modifiers “Google” and “Gmail”, we can safely assign higher weights to the complex term “Google Gmail account”.
The cumulative domain frequency and general frequency of the modifiers will provide us the ability to measure the tendency of the modifiers of a complex term. High modifier factor, MF will assist in raising the DP of complex terms that have modifiers with collective tendency towards the target domain and hence, realising Definition 2.5. Let Ma ∩ TC be the set of modifiers of a which also happen to be term candidates, we define the MF of a complex term a as: ∑ m∈M ∩TC f md + 1 a + 1 MF (a ) = log 2 ∑ m∈M ∩TC f md ¯ + 1 a 516
(15)
Determination of Unithood and Termhood for Term Extraction
In addition to MF, we propose to incorporate a powerful discriminating function to help distinguish the specificity of candidates with respect to target domain d. The original contrastive weight CW does not tell us anything about the inclination of the use of terms for domain versus non-domain purposes. The original CW merely measures the prevalence or significance of the term in the domain as explained earlier. We propose a new measure of Domain Tendency (DT) which realises Definition 3.1, namely, what is the tendency of candidate a occurring more frequently in d than in d . DT is the fraction of the frequency of a occurring in target domain d and the frequency of a occurring in the contrastive domains d . We define DT as: f +1 DT (a ) = log 2 ad + 1 f ¯ + 1 ad
(16)
where fad is the frequency of occurrences of a in the domain corpus, while fad is the frequency of occurrences of a in the contrastive corpora. If term candidate a is equally common in both domain and non-domain, DT=1. If the usage of a is more inclined towards the target domain f ad > f ad , then DT>1, and DT {time _ period#n#l} ORGANIZATION => {organization#n#l} PERSON => {person#n#l} LOCATION => {location#n#l} CURRENCY:MEASUREMENT => {monetary _ unit#n#l}
3.6 The re-ranking algorithm Instances are generated for each pair of question Q and passage Psg, where Q is a question from TREC and Psg is one of the 100 top-ranked passages returned by the baseline ranking algorithm for question Q. Each instance consists of 3 features. The three features, viz., ASim, MinDist and LuceneRank are generated from the pair (Q,Psg) as described previously.
595
Question Answering Using Word Associations
(Q, Psg) instances used for training the log-linear re-ranker are additionally assigned class labels. A (Q,Psg) instance is assigned class label C+ if Psg contains the TREC determined answer for Q. Else it is assigned class label C-. Given an unseen question Q’, instances of the form (Q’, Psg) are generated for each of the 100 top-ranked passages from the baseline system. The class label of the instance is however unknown. The log-linear classifier is used to find Pr(C+ |Q’, Psg) values for each of the instances. The passages are ranked in decreasing order of the Pr(C-|Q’, Psg) values. Higher the rank of a passage, higher is its expected relevance to the question. The complete system is shown in Figure 14.
3.7 Experimental evaluation In this section, we present experimental results for the NsySQLQA question answering algorithm. We used Lucene for the baseline ranking algorithm and WEKA’s (Witten and Frank, 2000) log-linear classifier for reranking. Our code is entirely in Java. For each year ofTREC data, our total data preparation and training time on a dual 1.3GHz Xeon server was several hours, most of it spent in tagging top passages returned by Lucene. WEKA itself took only several minutes. Typically, each query could be answered in under 30 seconds. Once we set up a Logistic Regression-based learner for the re-ranking step, a natural first measurement to make is the accuracy on cross-validation data. Reassuringly, the re-ranking classifier can reject negative (non-answer) passages well, but suffers from low recall owing partly to the large number of negative training instances. Figure 7 gives the performance of the log-linear classifier.
Figure 14. The Complete System using our approach
596
Question Answering Using Word Associations
Figure 15 Re-ranking significantly boosts MRR. The x-axis is the rank at which the (first) answer is found, the y-axis is the log of the number of questions for which the first answer was found at the given rank. Here we re-ranked only the top 100 Lucene-returned passages.
3.7.1 System performance Because our data is highly skewed (having many times more negative instances than positive ones), this is a biased story. A closer look at Figure 7 shows that the input skew makes us much better at weeding out bad passages than preserving the good ones (with the answers). Luckily, in our application we need not throw away any passages4; we simply sort by the score returned by Logistic Regression. Our punch-line is that re-ranking boosts MRR to levels enjoyed by only the best third of recent TREC QA participants. Before we get into specific numbers, we show in Figure 15 a histogram of the number of answer passages found at specific ranks by a baseline system (before re-ranking) and after passage re-ranking. It is immediately obvious that re-ranking eliminates many non-answers, pushing answers to better ranks. Does re-ranking benefit all kinds of queries equally? To study this, we divided queries into categories based on their starting words (what, which, how ..) , which are good indicators of question type), and
Table 7. The re-ranking classifier can reject negative (non-answer) passages well, but suffers from low recall owing partly to the large number of training instances labeled C-. Label C+ means the passage has an answer, C- means it does not. We show representative confusion matrices and mean and standard deviation of recall, precision and F1 scores over 5 runs, each using % of the labeled QA instances sampled u.a.r as training data
597
Question Answering Using Word Associations
Table 8. Mean and standard deviation of MRRs after reranking by logistic regression (LR), calculated over five runs, each using a random % sample of labeled data for training. MRR obtained using LR re-ranking exceeds baseline MRR obtained using traditional IR ranking by 86% and 127% Passage ranked by Baseline Ranking MRR (Lucene) Reranking MRR
TREC 2000
TREC 2002
0.377
0.249
0.71 ± 0.001
0.565 ± 0.001
Table 9. Recent best MRRs at TREC, showing that our system compares favorably. System
Corpus
MRR
FALCON
TREC 2000
0.76
U. Waterloo
TREC 2000
0.46
Queens College, CUNY
TREC 2000
0.46
Webclopedia
TREC 2000
0.31
plotted their group MRRs before and after re-ranking in Figure 16. MRR improvement via re-ranking is wide-spread across question types, but is particularly high for how long and how many queries, where the strongly typed nature of the answer helped re-ranking a great deal. In the endgame, we concentrated on the following features for re-ranking: ASim, MinDist and LuceneRank. Having set up a Logistic Regression (LR) classifier for the re-ranking step, a natural first measurement to make is its accuracy. Sampling training instances and validating on held-out labeled data show that the classifier learns quickly, settling near its best accuracy within only 2-4% of the training data available. Our MRR scores are shown in Table 8, along with MRR for the tf-idfbaseline ranking algorithm and the baseline obtained by applying a boost factor of 2 to selectors in the query. For comparison, some of the top MRRs from the TREC 2000 competition are shown in Figure 9. Our TREC 2000 score compares favorably. (In later years, TREC replaced MRR with a direct assessment of the pinpointed answer zone, which is beyond our scope.)
Figure 16. Sample MRR improvement via re-ranking separated into question categories
598
Question Answering Using Word Associations
Table 10. Various MRR@in values with different TREC datasets Trained on
Tested on
Re-ranked MRR@in
2002
2002
0.565 ± 0.001
2000
2002
0.539 ± 0.004
2000+2001
2002
0.534 ± 0.003
2000+2001+1999
2002
0.541 ± 0.003
2000
2000
0.710 ± 0.001
2002
2000
0.705 ± 0.001
2002+2001
2000
0.627 ± 0.002
2002+2001+1999
2000
0.693 ± 0.001
MRR@in (i.e. MRR-at-infinity) values obtained with various training and test datasets are tabulated in Table § .10. We trained and tested on datasets from different years to ensure that the system generalizes well. The MRR values are computed taking into account all the passages which contain the answer up to a rank of 100 (ideally should be infinity). Table 11 gives the MRR@5 values for TREC 2000 and TREC 2002 trained and tested on the same corpus. For computing MRR@5, passages only up to rank 5 are considered. Passages that contain the answer but lie below rank 5 are discarded while computing MRR@5. How does the NsySQLQA re-ranker compare against the BayesQA re-ranker that was described in Section § 2? To compare performances, we trained the NsySQLQA re-ranker on TREC2001 question answer pairs and evaluated its performance on the TREC1999, TREC2000 and TREC2002 data sets. Experiments of reranking using BayesQA were also performed on the three data sets. The comparison of MRRs obtained using BayesQA and NsySQLQA is presented in Table § 12. It is evident that NsySQLQA shows significant performance improvement over BayesQA.
Table 11. MRR@5 values do not differ much from their MRR@in counterparts Trained on
Tested on
Re-rankedMRR@5
2002
2002
0.553 ± 0.002
2000
2000
0.706 ± 0.002
Table 12. Comparison of NsySQLQA with BayesQA for a trained BBN (Section § .2.3) on three data sets System
TREC1999
TREC2000
TREC2001
BayesQA NsySQLQA
0.467
0.411
0.373
0.671
0.705
0.561
599
Question Answering Using Word Associations
References Abe, Naoki, and Hang Li. 1996. Learning word association norms using tree cut pair models. In Proceedings of the 13th International Conference on Machine Learning. Abney, S., Collins, M., and Singhal, A. (2000). Answer extraction. In ANLP. Agichtein, Eugene, Lawrence, Steve, and Gravano, Luis (2001). Leaming search engine specific query transformations for question answering. In Proceedings of the 10th World Wide Web Conference (WWW10),pages 169-178. Agirre, E. and Rigau, G. (1996). Word-sense disambiguation using conceptual density. In Proceedings of the International Conference on Computational Linguistics (COLlNG-96). Androutsopoulos, G. D., Ritchie, and Thanisch, P. (1995). Natural language interfaces to databases-an introduction. Journal of Language Engineering, 1(1):29-81. Apache Software Group. 2002. Jakarta lucene text search engine. GPL Library. http://jakarta.apache. org/lucene/. Bhalotia, Gaurav, Nakhe, Charuta, Hulgeri, Arvind, Chakrabarti, Soumen and Sudarshan, S. 2002. Keyword searching and browsing in databases using BANKS. In ICDE. Budanitsky, A. (2001). Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Proceedings of the Workshop on WordNet and Other Lexical Resources, North American Chapter of the Associationfor Computational Linguistics, Pittsburgh, PA. Clarke, C., Cormack, G., Kisman, D., and Lynam, T. (2000). Question answering by passage selection (MultiText experiments for TREC9). In E Voorhees and D Herman, editors, The Ninth Text REtrieval Conference (TREC 9), NIST Special Publication 500-249, pages 673-683. NIST. Clarke, C. L. A., Cormack, Gordon V., and Lynam, Thomas R. (2001a). Exploiting Redundancy in Question Answering. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 358-365. ACM Press. Clarke, C. L. A., Cormack, Gordon V., and Lynam, Thomas R (2001b). Exploiting redundancy in question answering. In Research and Development in Information Retrieval, pages 358-365. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1976). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, pages 1-38. Dill, Stephen, Eiron, Nadav, Gibson, David, Gruhl, Daniel, Guha, R., Jhingran, Anant,Kanungo, Tapas, Rajagopalan, Sridhar, Tomkins, Andrew, Tomlin, John A. and Zien, Jason Y. (2003). Semtag and seeker: bootstrapping the semantic Web via automated semantic annotation. In The Twelfth International World Wide Web Conference. Dumais, Susan, Banko, Michele, Brill, Eric, Lin, Jimmy and Ng., Andrew (2002). Web question answering: Is more always better? In SIGIR, pages 291-298, aug. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database, chapter Using WordNet for Text Retrieval, pages 285-303. The MIT Press: Cambridge, MA.
600
Question Answering Using Word Associations
Goldman, Roy, Shivakumar, Narayanan, Venkatasubramanian, Suresh, and Molina, Hector Garcia. (1998). Proximity search in databases. In Ashish Gupta, Oded Shmueli, and Jennifer Widom, editors, VLDB’98, Proceedings of24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pages 26-37. Morgan Kaufmann. Apache Software Group. (2002). Jakarta lucene text search engine. In GPL Library. Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Girju, R., Rus, V., and Morarescu, P. (2000a). FALCON: Boosting knowledge for answer engines. In Proc. of Ninth Text REtrieval Conference (TREC 9), pages 479-488. Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Girju, R., Rus, V., and Morarescu, P. (2000b). FALCON: Boosting knowledge for answer engines. In Proc. of Ninth Text REtrieval Conference (TREC 9), pages 479-488. Hovy, E., Gerber, L., Hermjakob, U., Junk, M., and Lin,, c.-Y. (2000). Question answering in Webclopedia. In Proceedings of the TREC-9 Conference. NIST, Gaithersburg, MD. Hristidis, Vagelis, Papakonstantinon, Yannis, and Balmin, Andrey (2002). Keyword proximity search on xml graphs. In Proceedings of 18th International Conference on Data Engineering, San Jose, USA. Lam, Shyong K., Pennock, David M., Cosley, Dan, and Lawrence, Steve. (2003). 1 billion pages = 1 million dollars? mining the Web to play “who wants to be a millionaire?”. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, UAI 2003, Acapulco, Mexico, August 8-10. Lewis, David D., and Jones, Karen Sparck (1996). Natural language processing for information retrieval. Communications of the ACM, 39(1):92-101. Lin, D. (1998). An information-theoretic definition of similarity. In Proc. 15th International Conf. on Machine Learning, pages 296-304, San Francisco, CA. Morgan Kaufmann. Mason, Oliver (1998). QTag - a portable POS tagger. Technical report. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. (1990). Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 - 244. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Radev, Dragomir, Fan, Weiguo, Qi, Hong, Wu, Harris, and Grewal, Amardeep. (2002). Probabilistic question answering on the Web. In World-wide Web Conference, pages 8-419. Ganesh Ramakrishnan and Pushpak Bhattacharyya. (2003). Text Representation with Wordnet Synsets. In Eight International Conference on Applications of Natural Language to Information Systems (NLDB), pages 214-227. Ramakrishnan, Ganesh, Jadhav, Apurva, Joshi, Ashutosh, Chakrabarti, Soumen, and Bhattacharyya, Pushpak (2003a). Question Answering via Bayesian Inference on Lexical Relations. In Proceedings of ACL 2003, Workshop on Question Answering and Text Summarization, Sapporo, Japan. Ramakrishnan, Ganesh, Paranjpe, Deepa, Srinivasan, Sumana, and Chakrabarti, Soumen (2003b). Passage Scoring for Question answering via Bayesian inference on Lexical Relations. 601
Question Answering Using Word Associations
Ramakrishnan, Ganesh, Prithviraj B., and Bhattacharyya, Pushpak. (2004a). A Gloss-Centered Algorithm for Disambiguation. In Rada Mihalcea and Phil Edmonds, editors, Senseval3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 217-221, Barcelona, Spain, July. Association for Computational Linguistics. Ramakrishnan, Ganesh, Chakrabarti, Soumen, Paranjpe, Deepa, and Bhattacharyya, Pushpak (2004b). Is Question Answering an Acquired Skill? In Proceedings of the 13th conference on World Wide Web, New York, U.S.A. Ramakrishnan, Ganesh, Chitrapura, Krishna Prasad, Krishnapuram, Raghuram, and Bhattacharyya, Pushpak (2005a). A Model for Handling Approximate, Noisy or Incomplete Labeling in Text Classification. In The 13th International Conference on Machine Learning. Ramakrishnan, Ganesh, Paranjpe, Deepa, and Dom, Byron (2005b). A Structuresensitive Framework for Text Categorization. In ACM Conference on Information and Knowledge Management, Bremen, Germany. Ramakrishnan, Ganesh (2005). Bridging Chasms in Text Mining Using Word and Entity Associations. PhD thesis, Indian Institute of Technology, Bombay. Ratnaparkhi, Adwait (1996). A maximum entropy part-of-speech tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996. University of Pennsylvania. Sanderson, Mark (1994). Word sense disambiguation and information retrieval. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 49-57, Dublin, IE. Tellex, Stefanie, Katz, Boris, Lin, Jimmy, Fernandes, Aaron, and Marton, Gregory (2003). Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 41-47. ACM Press. Vorhees, Ellen (2000). Overview of trec-9 question answering track. Text REtreival Conference 9. Witten, Ian H., and Frank, Eibe (2000). Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann Publishers Inc. Yarowsky, D. (1992). Word-sense disambiguation using statistical models of Rogel’s categories trained on large corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 454-460, Nantes, France. Zheng, Zhiping (2002). Answerbus question answering system. In Human Language Technology Conference (HLT).
602
Question Answering Using Word Associations
Key Terms BayesQA: Question Answering using Bayesian Inferencing on Lexical relations (Section §1.2) using the Bayesian Network Model BayesWN. BayesWN: Construction of a Bayesian Network from WordNet lexical relations and training of the Bayesian Network in a semi-supervised manner from raw text. Focus Word: The word in ‘what’, ‘which’ and ‘name’ type of questions that specifies what the type of the answer should be. NsySQLQA: A question answering technique that recovers from the question, fragments of what might have been posed as a structured query (SQL), and extracts answer passages based on these fragments (Section §1.3.1). Selector: A word in a question that is a ground constant and expected to appear as it is in an answer passage.
endnotes 1
2
3
4
http://answers.google.com/answers http://trec.nist.gov/data/qa.html http://www.cyc.com/
This is not quite true; some TREC questions test if the system can return an empty set. But in that case we can always do a recall/precision trade-off and cross-validate it.
603
604
Chapter XXXIV
The Scent of a Newsgroup:
Providing Personalized Access to Usenet Sites through Web Mining Giuseppe Manco Italian National Research Council, Italy Riccardo Ortale University of Calabria, Italy Andrea Tagarelli University of Calabria, Italy
Abstract Personalization is aimed at adapting content delivery to users’ profiles: namely, their expectations, preferences and requirements. This chapter surveys some well-known Web mining techniques that can be profitably exploited in order to address the problem of providing personalized access to the contents of Usenet communities. We provide a rationale for the inadequacy of current Usenet services, given the actual scenario in which an increasing number of users with heterogeneous interests look for information scattered over different communities. We discuss how the knowledge extracted from Usenet sites (from the content, the structure and the usability viewpoints) can be suitably adapted to the specific needs and expectations of each user.
Introduction The term knowledge discovery in databases is usually devoted to the (iterative and interactive) process of extracting valuable patterns from massive volumes of data by exploiting data mining algorithms. In general, data mining algorithms find hidden structures, tendencies, associations and correlations
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
The Scent of a Newsgroup
among data, and mark significant information. An example of data mining application is the detection of behavioural models on the Web. Typically, when users interact with a Web service (available from a Web server), they provide enough information on their requirements: what they ask for, which experience they gain in using the service, how they interact with the service itself. Thus, the possibility of tracking users’ browsing behaviour offers new perspectives of interaction between service providers and end-users. Such a scenario is one of the several perspectives offered by Web mining techniques, which consist of applying data mining algorithms to discovery patterns from Web data. A classification of Web mining techniques can be devised into three main categories: •
•
•
Structure mining: It is intended here to infer information from the topology of the link structure among Web pages (Dhyani et al., 2002). This kind of information is useful for a number of purposes: categorization of Websites, gaining an insight into the similarity relations among Websites, and developing suitable metrics for the evaluation of the relevance of Web pages. Content mining: The main aim is to extract useful information from the content of Web resources (Kosala & Blockeel, 2000). Content mining techniques can be applied to heterogeneous data sources (such as HTML/XML documents, digital libraries, or responses to database queries), and are related to traditional Information Retrieval techniques (Baeza-Yates & Ribeiro-Neto, 1999). However, the application of such techniques to Web resources allows the definition of new challenging application domains (Chakrabarti, 2002): Web query systems, which exploit information about the structure of Web documents to handle complex search queries; intelligent search agents, which work on behalf of users based both on a description of their profile and a specific domain knowledge for suitably mining the results that search engines provide in response to user queries. Usage mining: The focus here is the application of data mining techniques to discover usage patterns from Web data (Srivastava et al., 2000) in order to understand and better serve the needs of Web-based applications and end-users. Web access logs are the main data source for any Web usage mining activity: data mining algorithms can be applied to such logs in order to infer information describing the usage of Web resources. Web usage mining is the basis of a variety of applications (Cooley, 2000; Eirinaki & Vazirgiannis, 2003), such as statistics for the activity of a Website, business decisions, reorganization of link and/or content structure of a Website, usability studies, traffic analysis and security.
Web-based information systems depict a typical application domain for the above Web mining techniques, since they allow the user to choose contents of interest and browse through such contents. As the number of potential users progressively increases, a large heterogeneity in interests and in the knowledge of the domain under investigation is exhibited. Therefore, a Web-based information system must tailor itself to different user requirements, as well as to different technological constraints, with the ultimate aim of personalizing and improving users’ experience in accessing the system. Usenet turns out to be a challenging example of a Web-based information system, as it encompasses a very large community, including government agencies, large universities, high schools, and businesses of all sizes. Here, newsgroups on new topics are continuously generated, new articles are continuously posted, and (new) users continuously access the newsgroups looking for articles of interest. In such a context, the idea of providing personalized access to the contents of Usenet articles is quite attractive, for a number of reasons.
605
The Scent of a Newsgroup
1. First of all, the hierarchy provided by the newsgroups is often inadequate, since too many newsgroups deal with overlapping subjects, and even a single newsgroup dealing with a specific topic may contain rather heterogeneous threads of discussion. As an example, the newsgroup comp. lang.java.programmer (which should deal with programming issues in the Java programming language) contains threads that can be grouped in different subtopics, dealing respectively with “typing,” “networking,” “debugging,” and so forth. As a consequence, when looking for answers to a specific query one can frequently encounter the abundance problem, which happens when too many answers are available and the degree of relevance of each answer has to be suitably weighted. By contrast, one can even encounter the scarcity problem, which usually happens when the query is too specific and many viable answers are missed. Hence, articles in the hierarchy provided by the newsgroups need better (automatic) organization according to their contents, in order to facilitate the search and detection of relevant information. 2. Answers to specific threads, as well as references to specific articles, can be analysed from a structural point of view. For example, the graph structure of the accesses to articles (available from users’ access logs) can be investigated to allow the identification of hub and authoritative articles, as well as users with specific areas of expertise. Thus, the analysis of the graphs devised from both accesses and reactions to specific articles is clearly of greater impact to the purpose of providing personalized access to the Usenet service. 3. Since users can be tracked, their preferences, requirements and experiences in accessing newsgroups can be evaluated directly from the access logs. As a consequence, both available contents and their presentation can be adapted according to the user’s profile, which can be incrementally built as soon as the user provides sufficient information about his/her interaction with the service. Moreover, the experience provided by the interaction of a given user can be adapted to users exhibiting a similar profile, thus enabling a collaborative system in which experiences are shared among users. As a matter of fact, Usenet and the Web share strong similarities. Hence, the application of wellknown Web mining techniques from literature to Usenet should, in principle, allow similar benefits of mining traditional Web contents. Nevertheless, the application of the above techniques must take into account the intrinsic differences between the two contexts. To this purpose, the contribution of this chapter is the analysis and, where necessary, the revising of suitable Web mining techniques with the aim of tailoring them to the Usenet environment. The goal of this chapter is to survey some well-known Web mining techniques that can be profitably exploited to address the problem of providing personalized access to the contents of Usenet communities available from the Web. In such a context, the problem of providing personalized access to the Usenet communities can be ideally divided into three main phases: user profiling, the process of gaining an insight about the preferences and tastes of Usenet visitors through an in-depth analysis of their browsing behaviour; content profiling, the process of gaining an insight about the main issues and topics appearing in the content and in the structure of the articles; and personalization, the adoption of ad-hoc strategies to tailor the delivery of Usenet contents to a specific profile. For brevity’s sake, only the most prominent adaptations required to apply these techniques to the Usenet environment are illustrated in detail. We mainly point out the possibility of accessing newsgroups from a Web interface (such as, e.g., in groups.google.com), and envisage a Web-based service capable of providing personalized access by exploiting Web mining techniques. A Web-enabled Usenet
606
The Scent of a Newsgroup
access through a Web server has some major advantages with respect to the traditional access provided by NNTP (Net News Transfer Protocol) servers. In particular, from a user point of view, a Web-based service exhibits the advantage of offering ubiquitous and anytime access through the Web, without the need of having a predefined client other than a Web browser. In addition, a user can fruitfully exploit further consolidated Web-based services (such as, e.g., search engine capabilities). From a service provider point of view, it offers the possibility of tracking a given user in his/her interaction with the service for free, by exploiting traditional Web-based techniques. Throughout this chapter, a particular emphasis is given on clustering techniques tailored for personalization. Although a number of different approaches exist (Deshpande & Karypis, 2001; Mobasher et al., 2001), we believe that those based on clustering are particularly promising since they effectively allow the partition of users on the basis of their browsing behaviours, which reliably depict user profiles. Moreover, since clusters provide objective descriptions for the corresponding profiles, such an approach is also able to act on the process of content delivery in order to address the specificities of each profile: any change in the details of a particular profile is soon captured by clustering and, consequently, exploited in the delivery process.
Information Content of a Usenet Site Usenet operates in a peer-to-peer framework that allows the users to exchange public messages on a wide variety of topics including computers, scientific fields, politics, national cultures, and hobbies. Differing from e-mail messages, Usenet articles (formatted according to the RFC-1036 Usenet news standard) are concerned with public discussions rather than personal communications and are grouped, accord-
Figure 1. Example of a Web interface to Usenet News
607
The Scent of a Newsgroup
ing to their main subject, into newsgroups. Practically speaking, newsgroups are collections of articles sharing the same topics. Newsgroups can be organized into hierarchies of topics. Information about the hierarchy can be found in the newsgroup names themselves, which contain two or more parts separated by periods. The first part of the name indicates the top-level hierarchy to which the newsgroup belongs (the standard “Big Seven” top-level hierarchies are: comp, misc, news, rec, sci, soc, talk). Reading from left to right, the various parts of the name progressively narrow the topic of discussion. For example, the newsgroup comp.lang.java.programmer contains articles discussing programming issues concerning the Java programming language. Notice that, although the newsgroup hierarchy is highly structured, articles can be posted to multiple newsgroups. This usually happens when articles contain more than one topic of discussion: for example, an article concerning the use of an Oracle JDBC Driver can be posted to both comp.lang.java.programmer and comp.databases.oracle. In our scenario, Usenet articles can be accessed by means of a Web server, which allows both the visualization of articles, and the navigation of newsgroup hierarchies. An example of such an interface is provided in Figure 1. Users can navigate the hierarchy of newsgroups by accessing highly structured pages, in which hyperlinks point to specific portions of the hierarchy. A crucial aspect in the information delivery process is the quality of information itself. Information quality is guaranteed neither in a typical Web setting nor in a Usenet scenario. However, while information related to a given topic of interest is scattered on the Web, Usenet simplifies the process of information locating. Indeed, articles are typically grouped in threads: a thread contains an initial article, representing an argument of discussion raised by some user, and a chain of answers to the initial article. Users only have to look within those threads semantically related to the topic under investigation. This allows us to narrow the search space and, as a consequence, to obtain a recall higher than the one typically achievable by traditional information retrieval approaches on the Web. Also, the process of thread selection corresponds to the sense disambiguation of the topic of interest. This, in turn, leads to an increase in the degree of accuracy of the inferred knowledge. Moreover, since Usenet allows the user to focus on a specific set of threads, the possibility of the individual news articles having been written by authoritative people is not less likely than it is on the Web (provided that an effective technique for narrowing the search space, depending on the specific topic of interest, can be devised for the Web itself). Yet, Usenet is a source of information whose inferable knowledge is often complementary with respect to the contents within Web pages. As an example, consider the scenario in which a user is interested in finding detailed information on a specific topic like “using Java Native Interface and Java Servlets running under Apache Tomcat”. Querying and posting to Usenet newsgroups allows the satisfaction of the specific request. This ultimately corresponds to benefit from the depicted collaborative environment, by obtaining both references to dedicated Websites and reports from other users’ experiences. Two main information sources can be detected in a Usenet environment: access logs, describing users’ browsing behaviour, and Usenet repositories, representing the articles a user may access. We now analyse such information sources in more detail.
Web Logs: Mechanisms for Collecting User Clickstream Usage data are at the basis of any Web usage mining process. These data can be collected at different levels on an ideal communication channel between a generic user and the Website currently accessed: at client level, proxy-server level, and Website level. Sources at different levels take into account different segments of Web users and, as a consequence, highlight distinct browsing patterns (Cooley,
608
The Scent of a Newsgroup
2000). Precisely, sources at the client level focus on single user/single site (or even single user/multi site) browsing behaviour. Information related to multi user/multi site navigational behaviour is collected at the proxy-server level. Finally, sources at the Website level reveal usage patterns of multi user/single site type. Since the user profiling phase aims to gain an insight into the browsing strategies of visitors, we focus on the sources of usage data at Website level. It is worth noticing that in such a context usage mining techniques are only exploited to silently infer user preferences from their browsing sessions. This finally does not raise privacy violation issues: since users are not tracked at the client side, their identity is unknown and any invasive technology, such as cookies, is not leveraged to try to recognize the user across their repeated transactions. Furthermore, information extracted from browsing data is mediated on a huge number of user sessions, which makes the proposed Web usage mining techniques non-invasive of user privacy. Web logs represent a major approach to tracking user behaviour. By recording an incoming request to a Web server, Web logs explicitly capture visitors’ browsing behaviour in a concurrent and interleaved manner. A variety of ad-hoc formats have been devised for organizing data within Web logs: two popular formats are CLF (Common Log Format) and ECLF (Extended Common Log Format). ECLF, recently introduced by the W3C, improves CLF by adding a number of new fields to each entry that have been revealed as being particularly useful for demographic analysis and log summaries (Eirinaki & Vazirgiannis, 2003). ECLF logs can also include cookies, that is, pieces of information uniquely generated by Web servers to address the challenging task of tracking users during their browsing activities. The main limitation that affects Web logs is that they do not allow for the reliable capture of user behaviour. For instance, the viewing time perceived at the Website level may be much longer than it actually is at the client level. This is typically due to a number of unavoidable reasons, such as the client bandwidth, the transmission time necessary for the Web server to deliver the required resource and the congestion status of the network. Also, Web logs may not record some user requests. Such a loss of information usually happens when a user repeatedly accesses the same page and caching is present. Typically, only the first request is captured; subsequent requests may be served by a cache, which can be either local to the client or part of an intermediate proxy-server. These drawbacks must be taken into account while extracting usage patterns from Web logs: a variety of heuristics have been devised in order to pre-process Web logs in a such a way as to reduce the side effects of the above issues. Data about user behaviour can also be collected through alternative approaches (Cooley, 2000), which allow the overcoming of the traditional limitation affecting the exploitation of Web logs: the impossibility of capturing information other than that in the HTTP header of a Web request. Among them, we mention ad-hoc tracking mechanisms for defining meaningful application-dependent logs describing user browsing activities at the required degree of detail. Indeed, the exploitation of new technologies for developing Web (application) servers (such as, e.g., JSP/PHP/.NET frameworks) leads to two main advantages in processing usage data collections. Firstly, it can capture information that is not typically addressed by Web logs, such as request parameters and state variables of an application. Secondly, it guarantees the reliability of the information offered. A format for the generic entry of an application log, suitably thought for the Usenet environment, can be devised. It may consist of (at least) three fields. User is a field that refers to the identity of the user who made the request: it can either be an IP address or a unique user-id. Time is a time stamp for the request. Request logs the details (such as the level in the newsgroup hierarchy, or the topics) of the request.
609
The Scent of a Newsgroup
Content Data: Finding Structure within News Articles News articles are the primary information source of Usenet. Different from generic text documents (to which at first sight they could be assimilated), some relevant features properly characterize them, as well as e-mail messages: •
•
•
Articles are usually rather short in size. This may look like a sociological observation, since the size of an article depends on a set of factors such as the topic involved, user preferred behaviour, and degree of interaction required. Nevertheless, newsgroups are conceived as a means of generating discussions, which typically consist of focused questions and (rather) short replies. Text documents usually exhibit only unstructured features (words occurring in the text). By contrast, news articles result in a combination of important structural properties (such as sender, recipients, date and time, and whether the article represents a new thread or a reaction in a given thread), and unstructured components (Subject, Content, Keywords headers). Also, news articles may contain junk (unsolicited) information. This eventually results in a loss of recall in information delivery, and performance worsening due to the waste of bandwidth and server storage space. Although junk information is quite typical in a Web setting, the presence of spam articles is still more problematic in the Usenet environment, mainly due to duplicate copies or conversational threads including many useless articles (e.g., replies). Many techniques and tools for spam filtering have been developed in the last seven years, mainly aimed at classifying messages as either spam or non-spam. Among them we mention (Androutsopoulos et al., 2000; Drucker et al., 1999). Clearly, such techniques can be used to improve the quality of the information contained in a Usenet server, thus allowing the concentration only on significant data sources.
We now briefly review how to extract relevant information from a collection of news articles. This requires a study on the representation of both structured and unstructured features of a news article. Concerning text contents, the extraction of relevant features is usually performed by a sequence of well-known text operations (Baeza-Yates & Ribeiro-Neto, 1999; Moens, 2000), such as lexical analysis, removal of stopwords, lemmatisation and stemming. The above text operations associate each article with a set of terms that are assumed to best reflect the textual content of the article. However, such terms have different discriminating power, that is, their relevance in the context where they are used. Many factors may contribute in weighting relevant terms: statistics on the text (e.g., size, number of different terms appearing in), relationships between a term and the document containing it (e.g., location, number of occurrences), and relationships between a term and the overall document collection (e.g., number of occurrences). A commonly used weighting function is based on the finding that the most significant terms are those occurring frequently within a document, but rarely within the remaining documents of the collection. To this purpose, the weight of a term can be described as a combination of its frequency of occurrence within a document (Term Frequency - TF) and its rarity across the whole collection (Inverse Document Frequency - IDF). A widely used model complying with the above notion is the vector-space model (Baeza-Yates & Ribeiro-Neto, 1999), in which each article is represented as a n-dimensional vector w, where n is the number of available terms and each component wj is the (normalized) TF.IDF weight associated with a term j:
610
The Scent of a Newsgroup
Table 1. Structured features of news articles
w ji = (tf ji ⋅ idf j )
∑
Feature
Type
Source Header
Newsgroup hierarchy (e.g., comp.lang.c)
categorical
Newsgroups:
Follow-up newsgroup hierarchy
categorical
Followup-To:
Sender domain (e.g., yahoo.com)
categorical
From:
Weekday
categorical
Date:
Time period (e.g., early morning, afternoon, evening)
categorical
Date:
Expiration date
categorical
Expires:
Geographic distribution (e.g., world)
categorical
Distribution:
Article length
numeric
Lines:
Nr. of levels in newsgroup hierarchy
numeric
Newsgroups:
p
(tf pi ⋅ idf p ) 2 .
It is well-known from literature that this model tends to work quite well in practice despite a number of simplifying assumptions (e.g., term independence, absence of word-sense ambiguity, as well as of phrase-structure and word-order). From each article, further features can be extracted from both the required headers (such as From, Date, Newsgroups and Path) and the optional headers (such as Followup-To, Summary, References). In particular, discriminating information can be obtained by exploiting the hierarchy of topics inferred by newsgroups, or by analysing the temporal shift of articles for a given topic. Notice also that, in principle, textual contents may contain further information sources, such as, for example, hyperlinks. Therefore a Web interface to a Usenet service may assign each article a unique URL address, thus allowing articles to link to each other directly by means of such URLs. To summarise, an article can be represented by a feature vector x = (y w) in which the structured component (denoted by y) may comprise, for example, the features reported in Table 1, whereas unstructured information (denoted by w) is mainly obtained from the content and from the Subject header (or from the Summary and Keywords headers as well).
Mining Content of News Articles A straightforward personalization strategy consists of providing searching capabilities within Usenet news. As a matter of fact, many Web interfaces to Usenet servers provide keyword-based querying capabilities and ranking mechanisms for the available articles. In general, queries involve relationships between terms and documents (such as, e.g., “find articles dealing with Java Native Interface”), and can be modelled as vectors in the vector-space model. Hence a typical ranking mechanism may consist of computing the score of each article in a collection as the dot product between its representative vector w and the vector q representing the query: rq , w = w · q. Articles with a high rank constitute the answer to the query.
611
The Scent of a Newsgroup
Figure 2. Example of integration of hierarchical and partitional clustering Input: a set M = {m1, ..., mN} of news articles. Output: a partition P = {C1, ..., Ck} of M. Method: – sample a small subset S = {m1, ..., mh} of news articles randomly chosen from M; – obtain the feature vectors {x1, ..., xh} from S; – apply hierarchical agglomerative algorithm on {x1, ..., xh}; – let c1, ..., ck be the concept vectors representing the clusters of the partition supplied from the hierarchical algorithm: • apply the k-Means algorithm on M exploiting c1, ..., ck as initial centers;
• answer P = {C1, ...,Ck} as the result of the k-Means algorithm.
A different approach to personalization is the identification of topics emerging from the articles, to be proposed to users in a more structured way: this can be accomplished by clustering articles on the basis of their contents. Clustering aims to identify homogeneous groups to be represented as semantically related in the re-organized news collection. Formally, a clustering problem can be stated as follows: given a set M= {m1 , ..., mN } of news articles, we aim to find a suitable partition P = {C1 , …, Ck} of M in k groups, such that each group contains a homogeneous subset of articles (or threads) with an associated label describing the main topics related to the group. The identification of homogeneous groups relies on the capability of: (i) defining matching criteria for articles according to their contents; (ii) detecting representative descriptions for each cluster; and (iii) exploiting suitable clustering schemas. Homogeneity can be measured by exploiting the feature vectors defined above. A similarity measure s(xi , xj) can be defined as s(xi , xj) = αs1 (yi , yj) + (1-α) s2 (wi , wj), where s1 and s2 refer respectively to the similarity of the structured and unstructured components of the articles. In particular, s1 can be defined by resorting to traditional similarity measures, such as Dice, Euclidean or mismatch-count distance (Huang, 1998). Mismatch-count distance can be exploited, for example, by taking into account the hierarchy of subgroups, as in principle, articles posted in newsgroups sharing many levels in the newsgroup hierarchy are more likely to be similar than articles posted in newsgroups sharing few levels. On the other hand, s2 can be chosen among the similarity measures particularly suitable for documents (Baeza-Yates & Ribeiro-Neto, 1999; Strehl et al., 2000), such as the cosine similarity. Finally, α ≤ 1 represents the weight to be associated with the structured component. Concerning labels, we have to find a suitable way of describing the main topics associated with a given group of semantically homogeneous articles. The structural part can easily be tackled by exploiting, for example, the mode vector (Huang, 1998) of the structured components y. The unstructured part requires some attention. In a sense, we are looking for a label reflecting the content of the articles within the cluster, and at the same time capable of differentiating two different clusters. To this purpose, a viable strategy can be to resort to frequent itemsets discovery techniques (Agrawal & Srikant, 1994; Beil et al., 2002), and associate the set of terms (or concepts) with each cluster that more frequently appear within the cluster. Many different clustering algorithms can be exploited (Jain et al., 1999) to cluster articles according to the above mentioned matching criteria. Hierarchical methods are widely known as providing clusters with a better quality (Baeza-Yates & Ribeiro-Neto, 1999; Steinbach et al., 2000). In the context of newsgroup mining, such approaches are particularly attractive since they also allow the generation of cluster hierarchies (which ultimately represent topics and subtopics). However, hierarchical approaches suffer from serious efficiency drawbacks, since they require quadratic time complexity in the number of
612
The Scent of a Newsgroup
articles they deal with. By contrast, efficient centroid-based methods have been proposed in literature. In particular, Dhillon and Modha (2001) define a suitable partitional technique, namely spherical k-Means, which has the main advantage of requiring a linear number of comparisons while still guaranteeing good quality clusters. As stated above, feature vectors wi are normalized. In such a case, the computation of the cosine similarity is reduced to the computation of the dot product among two vectors. The spherical k-Means algorithm aims to maintain such a property during the whole clustering phase. For each cluster Cj containing nj documents, the algorithm computes the cluster center µ j = 1/nj Σw∈C j w. By normalizing µ j , we obtain the concept vector cj of Cj , which is the feature vector that is closest in cosine similarity to all the document vectors in the cluster Cj . The main drawback of centroid-based techniques is that the quality of their results is strictly related to two main issues: the number of desired clusters, which has to be known a priori, and the choice of a set of suitable initial points. Combinations of hierarchical agglomeration with iterative relocations can be devised here (Manco et al., 2002), by first using a hierarchical agglomerative algorithm over a small arbitrary subset S of articles to seed the initial clusters for k-Means-based methods. Figure 2 shows an example of such an integration. Practically, by choosing a subset S of M, with a size h « N as an input for a hierarchical clustering scheme, we avoid efficiency issues. The resulting partition provided by such an algorithm can be exploited within the k-Means algorithm, as it provides an optimal choice for both the desired number of clusters and the initial cluster centers (computed starting from the partition provided by the hierarchical approach).
Mining Usage of News Articles This section is devoted to the problem of exploiting browsing patterns to learn user profiles. A profile is a synthetic description of information requirements, interests and preferences of a set of visitors. Earlier approaches to personalization required explicit user collaboration; that is, visitors had to fill in questionnaires concerning their navigation purposes. However, such an approach suffers from two main limitations. Firstly, questionnaires are subjective, and hence possibly unreliable. Secondly, profiles learned from questionnaires are static; that is, they depict the expectations of a group of users at a given point in time. This eventually requires new questionnaires to be periodically presented to visitors, thus making visitors reluctant to continuously provide information. By contrast, Web usage mining allows inferring user profiles by means of a silent analysis of visitors’ browsing activities. Moreover, profiles learned from browsing patterns are both objective (they are deduced from exhibited navigational behaviours and therefore not affected by subjectiveness) and dynamic (they automatically reflect changes in using a Website). A typical Web usage mining process can be divided into three phases (Srivastava et al., 2000): data preprocessing, the process of turning raw usage data into a meaningful data set (precisely, raw usage data are converted into high-level abstractions such as page views, sessions, transactions, users); pattern discovery, that is, the exploitation of a variety of techniques from different fields (such as machine learning, pattern discovery and statistics) to infer usage patterns potentially of interest; pattern analysis, the step in which the identified patterns are further inspected, aggregated and/or filtered to transform individual patterns into a deep understanding of the usage of the Website under investigation. In the following, data preprocessing and pattern discovery are taken into account.
613
The Scent of a Newsgroup
Usage Data Preprocessing A number of tasks must be accomplished in order to reconstruct a meaningful high-level view of users’ browsing activities from the collection of their individual actions in a Web log. Typically, data cleaning is the first step. It is useful to remove irrelevant entries from Web logs, such as those denoting images or robot accesses. The intuition is that only entries explicitly requested by users should be retained for subsequent computations (Cooley et al., 1999); since Web usage mining aims at highlighting overall browsing patterns, it does not make sense to analyse entries that do not correspond to explicit visitor requests. URI (Uniform Resource Identifier) normalization is conceived to identify as similar the URIs which, though syntactically distinct, refer to the same resource (such as, e.g., www.mysite.com/index.htm and www.mysite.com). User identification is a crucial task for a variety of reasons such as caching policies, firewalls and proxy-servers. Many heuristics exist for identifying users, such as client-side tracking (i.e., the collection of usage data at client level through remote agents), cookies and embedded session IDs (also known as URL rewriting), chains of references and others (Pirolli et al., 1996). It is worth noticing that in spite of a considerable number of (more or less) accurate heuristics, user identification exhibits some intrinsic difficulties, like in the case of users with identical IP addresses, or in the case of an individual user who visits the same pages through two distinct Web browsers executing on a single machine. For each user, session identification aims to divide the resulting set of requests into a number of subsets (i.e., sessions). The requests in each subset share a sort of temporal continuity. As an example, it is reasonable to consider two subsequent requests from a given user exceeding a prefixed amount of time as belonging to separate sessions (Catledge & Pitkow, 1995). Since Web logs track user behaviour for very long time periods, session identification is leveraged to distinguish between repeated visits of the same user. The notion of page view indicates a set of page files (such as frames, graphics and scripts) that contribute to a single browser display. Page view identification is the step in which different session requests are collapsed into page views. At the end of this phase, sessions are turned into time-ordered sequences of page view accesses. Support filtering is typically leveraged in order to eliminate noise from usage data, by removing all those session page views characterized by either very low or extremely high support. These accesses cannot be profitably leveraged to characterize the behaviour of any group of users. Transaction (or episode) identification is an optional preprocessing step that aims at extracting meaningful subsets of page view accesses from any user session. The notions of auxiliary page view (i.e., a page view mainly accessed for navigational purposes) and media page view (namely, an informative page view) contribute to identifying two main kinds of user transactions (Cooley, 2000): auxiliary-content and media-only transactions. Given a generic user session, for each media page view P, an auxiliary-content transaction is a time ordered subsequence consisting of all the auxiliary page views leading to P in the original user session. Browsing patterns emerging from auxiliary-content user transactions reveal the common navigation paths leading to a given media page view. Media-only transactions consist of all the media page views in a user session. They are useful to highlight the correlations among the media page views of a Website.
614
The Scent of a Newsgroup
Pattern Discovery A number of traditional data mining techniques can be applied to the preprocessed usage data in order to discover useful browsing patterns. Here these techniques are analysed in the context of personalization. In the following, the term page is used as a synonym for page view. Association rules capture correlations among distinct items on the basis of their co-occurrence patterns across transactions. In the Web, association rule discovery can be profitably exploited to find relationships among groups of Web pages in a site (Mobasher et al., 2001). This can be accomplished through an analysis of those pages frequently visited together within user sessions. Association rules materialize the actual user judgment about the logical organization of the Web pages in a site: a group of (not necessarily inter-linked) pages often visited together implies some sort of thematic affinity among the pages themselves. Sequential patterns extend association rules by including the notion of time sequence. A sequential pattern indicates that a set of Web pages are chronologically accessed after another set of pages. These patterns allow Websites to be proactive, that is, to automatically predict the next request of the current visitors (Mobasher et al., 2002b). As a consequence, user navigation can be supported by suggestions to those Web pages that should best satisfy user browsing purposes. Classification is a technique that assigns a given item to one of several predefined classes. Profiling is a typical application domain: in this case, the purpose of classification is to choose, for each visitor, a (pre-existing) user profile that best reflects his/her navigational behaviour (Dai et al., 2000). Clustering techniques can be used for grouping either pages or users exhibiting similar characteristics. Page clusters consist of Web pages thematically related according to user judgment. A technique (Mobasher et al., 2002a) for computing these clusters consists of exploiting the association rules behind the co-occurrence patterns of Web pages in such a way as to form a hypergraph. Hypergraph partitioning is then applied to find groups of strongly connected pages. Different approaches (either centroid-based (Giannotti et al., 2002) or hierarchical (Guha et al., 2000)) are aimed at clustering users’ sessions. In general, page clusters summarize similar interests of visitors with different browsing behaviour. Session clusters, on the contrary, depict subsets of users who exhibit the same browsing behaviour. Statistical techniques typically exploit data within user sessions to learn a probabilistic model of the dependencies among the specific variables under investigation. In the field of personalization, an approach consists of building a Markov model (from log data of a Website) that predicts the Web page(s) that users will most likely visit next (Deshpande & Karypis, 2001). Markov models can also be exploited in a probabilistic framework to clustering, based on the EM (Expectation-Maximization) algorithm (Cadez et al., 2000). The main idea here is essentially to learn a mixture of first-order Markov models capable of predicting users’ browsing behaviour.
Mining Structure of a Newsgroup This section aims at investigating the basic intuitions behind some fundamental techniques in the field of structure mining. Though inspired by the same general principles, such techniques are however applied to a novel setting: the context of Usenet news articles. This allows us to deal with the main concepts of structure mining without considering the complexities that arise in a typical Web environment.
615
The Scent of a Newsgroup
Differences and Similarities between the Web and Usenet Environments The Web can be considered as a hyperlinked media with no logical organization (Gibson et al., 1998). Indeed, it results from a combination of three independent stochastic processes of content creation/deletion/update evolving at various scales (Dill et al., 2001). As a consequence, even if content creators impose order on an extremely local basis, the overall structure of the Web appears chaotic (Kleinberg, 1999). In such an environment, finding information inherent to a topic of interest often becomes a challenging task. Many research efforts in the field of structure mining are conceived to discover some sort of high-level structure to be exploited as a semantic glue for thematically related pages. The idea is that if pages with homogeneous contents exhibit some structure that does not depend on the nature of the content, then some effective strategy can be devised in order to address the chaotic nature of the Web. Such an intuition is supported by evidence that, from a geometric point of view, the Web can be considered as a fractal (Dill et al., 2001): highly-hyperlinked regions of the Web exhibit the same geometrical properties as the Web at large. This suggests exploiting structure mining to address relevant problems such as that of topic distillation, that is, the identification of a set of pages highly relevant to a given topic. The basic idea is qualitatively the following. Given a topic of interest, the focus of the approach is on the search for a known link structure surrounding the pages relevant to that topic: this requires exploiting only an extremely limited portion of the Web as a starting point. However, since hyperlinks encode a considerable amount of latent human judgment (Kleinberg, 1999), the structure among initial pages allows the collection of most of the remaining relevant pages: what the algorithm needs to do is, in a certain sense, choose and follow the structural connections (leading from the initial pages to unexplored regions of the Web) which are indicative of contents as prominent as those within the original pages. By contrast, traditional approaches to information discovery on the Web are negatively affected by the chaotic nature of the Web itself. This characteristic seems to make every attempt at devising an endogenous measure (of the generic Web page) to assess the relevance of a given page to a certain subject fruitless. Usenet benefits from a number of interesting features that contribute to make it an ideal application domain for structure mining techniques. Usenet’s overall structure, in fact, appears as a network of news articles, semantically organized by topic. This is mainly due to the fact that users typically post news articles inherent to a specific (more or less broad) topic. Therefore, every single user is involved in playing a crucial role in the process of keeping the overall structure of any newsgroup ordered: the destination group of a news article is already established since the early stages of the inception of the article itself. Two more elements contribute to impose order over the entire structure of a newsgroup. Firstly, the evolution of any newsgroup is based on the sole process of news article posting: no content deletion and/or update is allowed. Secondly, in contrast with what happens on the Web, the process of content management is centralized: many users post their own articles to the same recipient entity, which is only responsible for their availability through the Web. Usenet and the Web present many significant differences. However, some similarities can be highlighted. Newsgroups are available from the Web and this implies that every news article is a Web page, characterized by a proper content and its own structure. Users browse through the news articles within each group in order to find those that are mostly of interest; this is similar to what happens on the Web, though considerable differences characterize the search space in the two cases. Article browsing is made possible by a link topology that exists within each group and among the news articles belonging
616
The Scent of a Newsgroup
to distinct groups. Such a structure consists of three different kinds of links: group links, which belong to the hierarchy of newsgroups and connect a news article to its own group; explicit links, which correspond to links within the content of an article; and implicit links, which can be devised in order to model a number of distinct correlations among the news articles (for instance, implicit links can depict article dependencies such as the reply-to relationship). The peculiarities of newsgroups as well as their affinities with the Web are the basis of the intuition of applying known structure mining techniques to Usenet. If approaches based on link analysis exhibit good performances on the Web, their behaviour should be at least as interesting if applied to a newsgroup environment.
HITS Algorithm Kleinberg (1999) proposed an effective approach, namely HITS, for discovering Web pages relevant to a particular topic of interest. The algorithm aims to discover authoritative pages, that is, pages conveying prominent information, and hub pages, which consist of collections of links pointing to the authorities. The intuition is that if a page is an authority, then there should be a considerable number of hub pages (scattered on the Web) pointing to that authority. A mutually reinforcing relationship keeps authoritative and hub pages together: a good hub links to many good authorities, while a good authority is linked to by many good hubs. As a consequence, given a specific topic, hub pages play a crucial role in the process of identifying relevant information. Since they should point to nearly all the authorities on that topic, hubs can be considered as a sort of glue that keeps authorities together. The notions of hub and authority and the associated mutually reinforcing relationship materialize in a precise link structure, which is independent of the topic under investigation. Figure 3(a) illustrates the geometrical structure lying behind the notions of hub and authority. HITS can be summarized as follows. The algorithm takes as input a focused subgraph G of the Web (in which nodes correspond to pages and edges to hyperlinks among pages), and computes hubs and authorities on the basis of the intuition discussed above. Precisely, authority and hub weights are assigned to each page p in G: respectively, xp and yp . A limited number of iterations are required before the algorithm converges. Each iteration consists of two steps. Firstly, authority and hub weights are updated for each page in G. Then, both kinds of weights are normalized. Formally, given a page p, its associated weights are updated on the basis of the operations below:
Figure 3. Structural patterns behind hubs and authorities
617
The Scent of a Newsgroup
xp =
∑
q:( q , p )∈G
yq
yp =
∑
q:( p , q )∈G
xq
At the end of the third phase, authorities (resp. hubs) correspond to the k pages with the highest authority (resp. hub) weight. HITS is based on the assumption that all links within a given Web page have the same weight in the process of authority conferral. This inevitably causes both pages relevant to a certain topic and junk pages to receive the same amount of authority at each step. Different from the context of scientific literature, in which the process of blind review guarantees the quality of references, both Web pages and Usenet articles are generally neither critiqued nor revised. In the absence of any quality guarantee, HITS represents the simplest approach: news articles typically referred by a conspicuous number of distinct messages should be considered as authorities.
Bringing Hubs and Authorities to the Surface in the Usenet Environment Since content homogeneity seems to depend mainly on topological properties, we claim that structure mining techniques can be profitably applied even in the restricted context of a newsgroup, provided that the latent structure among news articles is suitably reconstructed. A number of application scenarios are discussed next. Finding authoritative articles. The problem of locating articles relevant to a topic of interest can • be addressed in a way that is slightly different from the original intuition behind HITS. A root set is formed by choosing (nearly) all those groups that are inherent to the specific topic. In such a scenario, both explicit and implicit links among news articles are exploited to model the conferral of authority from hubs to authorities. For instance, consider a discussion group on a generic topic T. A user ui could post an article mi containing questions about T. Another user uj could reply to ui by posting an article mj with the required answers. mi and mj are tied by a reply-to relationship: (mi , mj ) is an implicit directed link of the graph G, which in turn represents the overall network of dependencies among the news articles in the root set. Also, the content of mj could include hyperlinks referring to other news articles. For any such link pointing to a destination article mk , an explicit directed link (mj , mk ) is added to G. An in-depth analysis of the resulting network of dependencies can highlight interesting (and often unexpected) structural patterns. Two interesting situations are discussed below. Firstly, a community dedicated to a given topic could include hubs from different groups, all referring to authorities in the same group. This could be due to that fraction of users who send their questioning articles to groups that are not strictly related to the article contents. Typically, replies from more experienced users could refer to the strongest authorities in the group that is the closest to the contents of the above news articles. Secondly, authorities in distinct groups could be pulled together by hubs in the same group. This could happen with topics having different meanings. If one of these meanings is overly popular with respect to the others, then its corresponding group is likely to collect even those articles concerning less popular facets of the general topic. Again, replies could show the proper authorities for such news articles. Finding authoritative sites: articles as hubs. This application aims to discover hybrid communi• ties, that is, highly-cohesive sets of entities, where hubs correspond to news articles and authorities to Websites. Given a topic of interest, a root set is constructed as in the above case. However, only explicit links are taken into account during the process of reconstructing the link structure
618
The Scent of a Newsgroup
•
of the newsgroup region under investigation. The linkage patterns behind hybrid communities show strong correlations between the Usenet and the Web: precisely, contents on the latter either complement or detail information within the former. The geometrical characterization of the notion of hubs and authorities in terms of a bipartite graph structure in Figure 3(a) also applies to the Usenet environment, as far as the above two applications are concerned. Authoritative people: articles as hubs. The network of implicit links among articles is rich in information useful for finding authoritative people. Conceptually, since replies determine the introduction of implicit links into the network of news articles, any such link is indicative of an endorsement of the authority of the associated replying author. Precisely, an author who answers a considerable amount of user questions on a certain theme can be considered as an authority on that topic. Authoritative people emerge from an in-depth analysis of heterogeneous, bipartite graph structures that can be located within the overall article network. Heterogeneity depends on the intrinsic nature of these structures, where a hub is a questioning article, and an authority, instead, corresponds to a collection of replies grouped by their author (substantially, an authority is a set of nodes in the original article graph collapsed into a single entity). Figure 3(b) shows the geometrical characterization of the notion of hubs and authorities in this context. Authoritative people can be discovered at different extents: either for any subset of specified topics in the newsgroup, or taking into account all of the topics in the newsgroup itself. Both cases are useful in highlighting interesting structural patterns that are not obvious at first sight, such as those individuals who are authorities on a number of topics.
PERSONALIZATION IN Usenet Communities Personalization can be defined as the process of tailoring the delivery of contents (or services) in a Website to address individual users’ features such as their requirements, preferences, expectations, background and knowledge. It is a wide research and industrial area, which comprises notions such as recommender systems and adaptive Websites. Recommender systems are conceived to the automatic delivery of suggestions to visitors: suggested items can be information, commercial products or services. By contrast, adaptive Websites address visitors’ features mainly by means of two adaptation methods (Brusilovsky, 1996): adaptive presentation, namely, adapting the contents in a Web page to the profile of a browsing visitor, and adaptive navigation support, that is, suggesting the right navigation path to each individual user on the basis of his/her browsing purposes. From a functional point of view, Websites with personalization capabilities can be designed around the notions of customization and optimization (Perkowitz & Etzioni, 2000). Customization is adapting the content delivery of a Website to reflect the specific features of individual users; precisely, changes to the site organization (i.e., to both contents and structure) are brought about on the basis of the features of each single visitor. Optimization, on the contrary, is conceived to modify the site structure in order to improve usability. Regarding the technologies behind personalization systems, four main categories can be identified (Mobasher et al., 2000): decision rules, content-based filtering, collaborative filtering and Web usage mining. Decision rules allow the explicit specification of how the process of content delivery can be affected by the profile of visitors. Such rules can either be inferred from user interactions with the Website or manually specified by a site administrator. Systems based on content-filtering (Lieberman, 1995)
619
The Scent of a Newsgroup
learn a model of user interests in the contents of the site by observing user navigation activities for a period of time. Then, they try to estimate visitors’ interest in documents not yet viewed on the basis of some similarities between these documents and the profile to which visitors themselves belong to. Collaborative filtering systems explicitly ask for users’ ratings or preferences. A correlation technique is then leveraged to match current users’ data with the information already collected from previous users. A set of visitors with similar ratings is chosen and, finally, suggestions that are predicted to best adapt to current users’ requirements (Konstan et al., 1997) are automatically returned. Recently, an increasing focus has been addressed to techniques for pattern discovery from usage data. Not only does Web usage mining prove to be a fertile area for entirely new personalization techniques, but it also allows for the improvement of the overall performances of traditional approaches. For example, content-based personalization systems may fail at capturing latent correlations among distinct items in a Website. Such a limitation can be avoided by taking into account evidence from user browsing behaviour. Also, collaborative filtering can be revised in order to avoid the drawbacks due to the collection of static profiles.
PageGather Algorithm In the following we discuss how Web usage mining techniques can be applied to the problem of providing a personalized access to Usenet communities. To this purpose, we analyse PageGather (Perkowitz & Etzioni, 2000), an optimization approach that achieves the goal of supporting user navigation through a given Website. PageGather is conceived to automatically create indexes for the pages in a Website. This allows visitors to directly access all the pages inherent to a given topic of interest, without having to locate them within the link structure of a Website. A visit-coherence assumption is the basis of the algorithm: the pages visited by an individual user during a navigation session tend to be conceptually related. The algorithm can be divided into three main steps. •
Evaluation of similarities among Web pages. For each pair of pages pi and pj in the Website, both the probabilities Pr(pi | pj) and Pr(pj | pi ) of a user accessing a page pi (resp. pj), provided that he/she visited pj (resp. pi ) in the same session, are computed. Then, the co-occurrence frequency between pi and pj is chosen to be the minimum of the two probabilities, in order to avoid mistaking asymmetrical relationships for true cases of similarity. A similarity matrix is finally formed, where the
Figure 4. Revised version of pagegather for usenet news articles Input: a set S = {s1, …, sN} of user sessions; a set T = {t1, …, tM} of threads; a threshold t. Output: a set I = {b1, …, bK} of newsgroup indexes. Method: – for each s ∈ S: • extract s0, i.e. a chronologically ordered sequence of accesses to distinct threads; – for each pair of threads ti, tj ∈ T: • compute the transition probabilities Pr(ti | tj) and Pr(tj | ti); – compute a similarity matrix M such that M(i, j) = min{ Pr(ti | tj), Pr(tj | ti) }; – form a graph G from all the entries of M whose value is greater than t; – let C = {c1, …, cK} be the set of either the cliques or the connected components found in G; – for each index page b ∈ I: • for each thread t ∈ c (where c is the cluster to which b is associated): º add a new link referencing t to b;
620
The Scent of a Newsgroup
•
•
(i, j)-th entry is the co-occurrence frequency between pi and pj if there is no link between the two pages; otherwise the entry is 0. In order to reduce the effect of noise, all entries with values below a given threshold are set to 0. Cluster mining. A graph is formed from the similarity matrix. Here nodes correspond to Web pages and edges to the nonzero entries of the matrix. Two alternatives are now possible: finding either cliques or connected components. The former approach allows for the discovery of highly cohesive clusters of thematically related pages, whereas the latter is computationally faster and leads to the discovery of clusters made up of a higher number of pages. Automatic creation of indexes for each group of related pages. For each discovered cluster, an index consisting of links to each Web page in that cluster is generated. Indexes become preferred entry points for the Website: visitors only have to choose an index dealing with the topic of interest.
Usenet consists of a set of newsgroups hierarchically classified by topic. As a consequence, the process of finding interesting news articles necessarily implies a two-phase search. First, any newsgroup that deals with the required topic has to be identified. Second, trivial or uninteresting threads within each such newsgroup need to be filtered out. Such a laborious task may disorientate visitors, whose search activities may be biased by the huge number of both newsgroups and threads. In such a context, PageGather can be profitably applied to optimize the access to Usenet news. The idea consists of providing users with a new logical organization of the threads, which captures their actual perception about the inter-thread correlations. Here, the visit-coherence assumption is made with respect to the threads. Statistical evidence, mediated on a huge number of browsing patterns, allows the finding of clusters of related threads according to the visitors’ perception of logical correlations among the threads themselves. Each cluster deals with all the facets of a specific topic: it represents an overall view of the requirements, preferences and expectations of a subset of visitors. Finally, an index page could be associated to each individual cluster in order to summarize its contents and quickly access the specific facet of the corresponding topic. Such an approach benefits from two main advantages. Firstly, a more effective categorization of Usenet contents is provided to visitors. Secondly, the technique is an optimization approach to the delivery of Usenet contents: it does not require that strategic changes are brought about to the original structure of Usenet newsgroups. Figure 4 depicts the revised version of PageGather.
Conclusion We analysed three main lines of investigation that may contribute to providing personalized access to Usenet articles available via a Web-based interface. Current content mining techniques were applied on the management of news articles. To this purpose, we mainly studied the problem of classifying articles according to their contents. We provided an insight into techniques for tracking and profiling users in order to infer knowledge about their preferences and requirements. Our interest focused on techniques and tools for the analysis of application/server logs, comprising data cleaning and pre-processing techniques for log data and log mining techniques. Also, we modelled the events that typically happen in a Usenet scenario adopting a graph-based model, analysing the most prominent structure mining techniques, such as discovery of hubs and authorities and discovery of Web communities. We
621
The Scent of a Newsgroup
finally discussed personalization methodologies, mainly divided into optimization and customization approaches that can benefit from a synthesis of the above described techniques. The devised Usenet scenario reveals a significant application of Web mining techniques. Also, it suggests devoting further attention and research efforts to the combination of the various Web mining techniques. Indeed, it is clear from the depicted context that the techniques proposed can influence each other, thus improving their effectiveness to the purpose of adaptive Websites.
REFERENCES Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of 20th International Conference Very Large Data Bases (VLDB’94), 487-499. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., & Spyropoulos, C.V. (2000). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. Proceedings of 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00), 160-167. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press Books. Addison Wesley. Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. Proceedings of 8th ACM Conference on Knowledge Discovery and Data Mining (KDD’02), 436-442. Brusilovsky, P. (1996). Methods and techniques of adaptive hypermedia. User Modeling and User Adapted Interaction, 6(2-3), 87-129. Cadez, I., Gaffney, S., & Smyth, P. (2000). A general probabilistic framework for clustering individuals and objects. Proceedings of 6th ACM Conference on Knowledge Discovery and Data Mining (KDD’00), 140-149. Catledge, L., & Pitkow, J. (1995). Characterizing browsing behaviors on the World Wide Web. Computer Networks and ISDN Systems, 27(6), 1065-1073. Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. MorganKaufmann. Cooley, R. (2000). Web usage mining: Discovery and application of interesting patterns from Web data. PhD thesis, University of Minnesota. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1(1), 5-32. Dai, H., Luo, T., Mobasher, B., Sung, Y., & Zhu, J. (2000). Integrating Web usage and content mining for more effective personalization. Proceedings of International Conference on E-Commerce and Web Technologies (ECWeb’00), 1875LNCS, 165-176. Deshpande, M., & Karypis, G. (2001). Selective Markov models for predicting Web-page accesses. Proceedings of SIAM International Conference on Data Mining (SDM’01).
622
The Scent of a Newsgroup
Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1/2), 143-175. Dhyani, D., Ng, W., & Bhowmick, S. (2002). A survey of Web metrics. ACM Computing Surveys, 34(4), 469-503. Dill, S., Kumar, S., McCurley, K., Rajagopalan, S., Sivakumar, D., & Tomkins, A. (2001). Self-similarity in the Web. The VLDB Journal, 69-78. Drucker, H., Wu, D., & Vapnik, V.N.(1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048-1054. Eirinaki, M., & Vazirgiannis, M. (2003). Web mining for personalization. ACM Transactions on Internet Technology, 3(1), 1-27. Giannotti, F., Gozzi, C., & Manco, G. (2002). Clustering transactional data. Proceedings of 6th European Conference on Principles and Practices of Knowledge Discovery in Databases (PKDD’02), 2431LNCS, 175–187. Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web communities from link topology. Proceedings of 9th ACM Conference on Hypertext and Hypermedia, 225-234. Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5), 345-366. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283-304. Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632. Konstan, J. et al. (1997). Grouplens: Applying collaborative filtering to usenet news. Communications of the ACM, 40(3), 77-87. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15. Lieberman, H. (1995). Letizia: An agent that assists Web browsing. Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI’95), 924–929. Manco, G., Masciari, E., & Tagarelli, A. (2002). A framework for adaptive mail classification. Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’02), 387-392. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Communications of the ACM, 43, 142-151. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2001). Effective personalization based on association rule discovery from Web usage data. Proceedings of 3rd International Workshop on Web Information and Data Management (WIDM’01), 9-15.
623
The Scent of a Newsgroup
Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002a). Discovery and evaluation of aggregate usage profiles for Web personalization. Data Mining and Knowledge Discovery, 6(1), 61-82. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002b). Using sequential and nonsequential patterns for predictive Web usage mining tasks. Proceedings of IEEE International Conference on Data Mining (ICDM’02), 669-672. Moens, M. (2000). Automatic indexing and abstracting of document texts. Kluwer Academic Publishers. Perkowitz, M., & Etzioni, O. (2000). Towards adaptive Web sites: Conceptual framework and case study. Artificial Intelligence, 118(1-2), 245-275. Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable structures from the Web. Proceedings of ACM Conference Human Factors in Computing Systems (CHI’96), 118-125. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of Web usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Proceedings of ACM SIGKDD Workshop on Text Mining. Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on Web-page clustering. Proceedings of AAAI Workshop on Artificial Intelligence for Web Search, 58-64.
This work was previously published in Web Mining: Applications and Techniques, edited by A. Scime, pp. 393-414 , copyright 2005 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
624
Section V
Application and Survey
626
Chapter XXXV
Text Mining in Program Code Alexander Dreweke Friedrich-Alexander University Erlangen-Nuremberg, Germany Ingrid Fischer University of Konstanz, Germany Tobias Werth Friedrich-Alexander University Erlangen-Nuremberg, Germany Marc Wörlein Friedrich-Alexander University Erlangen-Nuremberg, Germany
ABSTRACT Searching for frequent pieces in a database with some sort of text is a well-known problem. A special sort of text is program code as e.g. C++ or machine code for embedded systems. Filtering out duplicates in large software projects leads to more understandable programs and helps avoiding mistakes when reengineering the program. On embedded systems the size of the machine code is an important issue. To ensure small programs, duplicates must be avoided. Several different approaches for finding code duplicates based on the text representation of the code or on graphs representing the data and control flow of the program and graph mining algorithms.
INTRODUCTION Computer programs are a special form of text. Words of a programming languages are combined to form correct sentences in this programming language. There exists a wide variety of programming languages, ranging from high-level object-oriented languages like Java or C++ to machine code, the language a processor can actually “understand”. Programming languages are usually translated with the help of compilers from high- to low-level. To produce this kind of “text” - the computer programs Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Text Mining in Program Code
- is the daily work of many programmers; billions of lines of code have been written. Mostly, this code is not well documented and not really understood by anybody after the original programmer stopped working. Typically, many programmers are working on one project and often old code from former versions or other projects is used. Duplicated code fragments are a special problem in big amounts of program code. These duplicated fragments can occur because of excessive use of “copy & paste”, because something was simply reprogrammed or also because of the compiler. When translating from the high-level to intermediate or low-level languages, new duplicates can be introduced, e.g. by using code templates for instructions and instruction sequences. Finding these duplicates has been in the focus of interest for many years. Code duplicates are called clones and clone detection has produced many different algorithms. If program code is simply viewed as text, clone detection is nothing else than mining in this text with the goal of finding the duplicate or similar code. Merging application areas and algorithms from the data mining community on the one hand and clone detection leads to fruitful new insights and results. Finding duplicated code in programs can have different goals. First these duplicates can be visualized as a hint for programmers that something has to be done about this specific piece of code. Second, the redundant code can be replaced automatically by subroutine calls, in-lined procedure calls, and macros etc. that produce the same result. This leads to smaller code that is easier to understand or to maintain. Third, methods to detect and replace duplicated code can be integrated into compilers. Finally, finding duplicated code can lead to special hardware for the duplicates in the area of embedded systems. In the case of program code, duplicates are not always “totally equivalent”. It is not only the one-toone duplicate from a piece of code that is interesting. Also near duplicates or even pieces of code, that are syntactically different, but semantically equivalent must be found. E.g. in two fragments only two independent pieces of code having no side effect onto each other can be exchanged. Variable names can be different or registers in machine code can vary. The application of clone detection ranges from high-level languages to machine code for embedded systems. The latter is the main topic in this chapter. The clone detection algorithms especially for embedded systems are described in detail.
Clone Detection and Embedded Systems In recent years, many small programmable devices appeared on the market. PDAs, mobile phones, and navigation systems are examples for these systems. Typical software on these gadgets deals with speech or image processing, encryption, calendars, or address books. Therefore, the devices contain small processors, handling the different tasks. But also in larger technical devices, such as cars, the number of small processors exploded in the last years. They offer e.g. efficient engine control, drivers assistance functions such as the “Electronic Stability Program” or “Line Departure Warning”, that warns the driver when he/she is leaving the right lane or more comfort functions such storing the different seat features for different persons. In modern high-end cars a network of up to sixty communicating processors are used. These processors, also called embedded systems, are much more specialized in nature than general computing systems (e.g. desktop computers or laptops). Typically, they are produced at low cost, have low energy consumption, are small, and meet rigid time constraints for their tasks. Normally, they are manufactured in high quantities. These requirements place constraints on the underlying architecture. An embedded system consists of a processor core, a Read Only Memory (ROM), Random Access Memory 627
Text Mining in Program Code
(RAM), and an Application Specific Integrated Circuit (ASIC). The cost of developing an integrated circuit is linked to the size of the system. The largest portion of the integrated circuit is often devoted to the RAM for storing the application. Nevertheless, the ROM is much smaller than storage offered on desktop computers. Developing techniques reducing code size are extremely important in terms of reducing the cost of producing such systems. Programmers writing for embedded systems must be aware of the memory restrictions and write small programs in the sense of code size. Less lines of code can make a huge difference on embedded systems. As the complexity of embedded systems programs grows, programming machine code directly gets more complicated. Today it is common to use high-level languages as C or C++ instead. However, programming in these languages results in larger machine code compared to machine code written directly by the programmers. One reason is that compiler optimization classically focuses on execution speed and not so much on code size. Introducing code size optimizations on embedded systems often leads to speed trade-offs, but when execution speed is not critical, minimizing the code size is usually profitable. Different automated techniques have been proposed over the years to achieve this goal of small code (Beszdes, Ferenc, Gyimthy, Dolenc, Karsisto, 2003; Debray, Evans, Muth & Sutter, 2000). These techniques can be separated into two groups. In code compression the code is somehow compressed and must be decompressed again at runtime. This technique results in very small code, but an additional step is necessary to decompress necessary parts. Decompression itself does not take too much time (depending on the compression algorithm) but there must be enough memory to store the decompressed code parts again. Another method is code compaction: the code is shrunk, but remains executable. It is also typical for embedded systems, that they are inflexible. Once the mobile phone is fabricated, the functionality can not be changed. If a new communication standard is proposed, the old mobile phone is thrown away and a new phone is build. For that reason, more flexible embedded systems are researched. The more applications they can handle, the more they will be used. The problem is, that the flexibility of these general purpose processors is paid with performance trade-offs. To handle these performance trade-offs, sets of instructions that are frequently executed are implemented as fixed logic blocks. Such blocks perform better than the softer, reconfigurable parts of the device. If they are well chosen, the performance of the system is enhanced. The goal is to have the frequent parts in hard logic implementation, while simultaneously affording flexibility to other components. Recapitulating, three problems in programming embedded systems are based on code clones: code compression, code compactification, and finding candidates for hard logic implementations. A possible example of code clones in machine code is given in the next section.
An Example Figure 1 on the left hand side shows a sequence of 18 ARM assembler instructions. The ARM architecture is a popular embedded architecture with a common 32-bit RISC instruction set. Three fragments of the code are highlighted. These three parts contain the same operations but in different orders. It is easy to see that most of these operations are independent of each other, so reordering the operations does not change the semantics of the code piece. E.g. the result of ADD r1, r1, #1 is not of interest for any other operation of the gray shaded area as register r1 containing the result of this operation is not used anywhere else. The instruction sequence on the left of Figure 1 can be maximal compacted to 12 instructions (see Figure 1 right hand side) by reordering the instructions according to their data dependencies. After doing this one can remove the instructions by a call (operation BL) to a single code instance labeled “extracted”. At the end of this new code the program counter (pc) must be restored to 628
Text Mining in Program Code
Figure 1. Example instruction sequence and corresponding minimum code sequence
the link register (lr) – containing the return address of a function - in order for the program to execute the next instruction after the call to the extracted code. We use this example to compare the various text mining approaches to explains their advantages and limitations. In the remainder of this paper an overview on mining in program code for embedded systems is given. As a basis, we take several research areas from embedded system design and compiler construction all having the same goal to find repeated code fragments. In the next section, the different application areas, in which shrinking program code is of interest, are described. In the third section the main focus is set on the different methods to find frequent fragments in program code.
BACKGROUND Procedural Abstraction and Cross Jumping In the seventies and eighties of the last century code size moved in the focus of embedded system research. Until then only speed was of interest, the size of the ROM changed this view. Procedural abstraction and cross jumping are two techniques, that can be applied at different stages of compilation ranging from source code at the beginning over the intermediate code of the compiler to machine code at the end. Both methods start with finding repeated code fragments. To find these fragments, several possibilities exist which are explained in the next section “Main Focus” of this chapter. The repeated code fragments are candidates to be extracted into a separate procedure as shown in Figure 1. This is the basic idea of procedural abstraction (Fraser, Myers, & Wendt, 1984; Dreweke, Wörlein, Fischer, Schell, Meinl & Philippsen, 2007). But before extraction the fragments found must be examined whether they are really suitable. The main problem is that frequent fragments might overlap in the code. In this case only one fragment can be extracted while the second fragment is ignored. It must be calculated which groups of frequent fragments has the most savings at the end. Procedural abstraction then turns repeated code fragments into separate procedures. The fragments are deleted in their original positions
629
Text Mining in Program Code
and procedure calls are inserted instead. Our running example in Figure 1 shows procedural abstraction after the compilation applied to machine code. Cross jumping is also known as tail merging. It is applicable if the identified fragments have a branch instruction to the same destination in common on the last instruction. One fragment is left as it is and the other occurrences are replaced by branch instructions to the start of this fragment. This branch instruction at the end of the frequent fragment ensures that program execution continuous at the correct position even after the extraction of the frequent fragment. Both strategies have their benefits and costs in code space and execution time. While the size of the original program shrinks, execution might take longer for procedural abstraction as procedure calls take their time. Therefore, it is not advisable to apply procedural abstraction to code that is often used during runtime (so called hot code). Procedural abstraction and cross jumping require no hardware support at all. There are several approaches and enhancements that do not check for completely equivalent fragments. E.g. fragments for machine code can differ in the registers used. It might be possible to extract two code sequences that only differ in their registers into one procedure and map the registers onto each other. If the register mapping does not introduce more code than the extraction into a procedure saves, this approach is worth it (Cooper & McIntosh, 1999). Procedural abstraction is applied on different stages of compilation. The problem of having different registers in fragments can be avoided when procedural abstraction is applied on the intermediate representation of the compiler before the register mapping takes place. There are also several approaches where procedural abstraction is applied before or post link-time. If applied after library programs etc. are linked to the main code, more duplicates can be found. Another difference is what portions of the code are taken into account. Sometimes the whole program is taken into consideration. This is computationally more complicated. In other cases the program is split into smaller parts. E.g. for machine code, basic blocks are often considered quite often. A basic block is code that has one entry point one exit point and no jump instructions contained within it. Procedural abstraction can also be applied to source code, but due to the rich syntax of high level programming languages, it is more difficult to find real code duplicates. For source code, semantically equivalent but syntactically different code pieces are more interesting. Tools successfully applying procedural abstraction are aipop (Angew. Informatik GmbH, 2006) and Diablo (University of Gent, 2007).
Dictionary Compression What procedural abstraction is in code compaction, dictionary compression (Brisk, Macbeth, Nahapetian & Sarrafzadeh, 2005; Liao, Devadas & Keutzer, 1999) is in code compression. Redundant sequences of instructions are identified in the program as for procedural abstraction. But these sequences are then extracted into a dictionary and replaced with a codeword acting as index into the directory. At runtime the program is decompressed again based on the keywords and the directory. When a key is reached, control is transferred to the dictionary and the corresponding sequence is executed. Afterwards control is transferred back to the original program. Compiler construction for dictionary compression has focused on identifying redundant code sequences in a post-compilation pass, often at link time. Famous variants for dictionaries are the Call Dictionary (CALD) instruction, an early hardware-supported implementation of dictionary compres-
630
Text Mining in Program Code
sion. The dictionary consists simply of a list of instructions. Sequences of instructions are extracted from the original program and replaced with a CALD instruction. The CALD instruction has the form CALD(Addr, N) where CALD is an operational code and Addr is the address of the beginning of the code sequence. Addr is given as the offset from the beginning of the directory. To execute a sequence of commands in the directory, control is transferred to the dictionary address Addr when a CALD instruction is reached. The next N instructions in the directory are executed. Finally control is returned to first instruction after the CALD instruction. Echo instructions are similar to CALD instructions. Instead of storing the instruction sequences externally, one instance is left in inline the program. All other instances of the sequence refer to this special one. Using echo instructions the directory resides inside the program in opposition to CALD instructions where the dictionary is external to the program. As for basic procedural abstraction the code sequences must be completely identical for the basic versions of CALD and echo instructions. But variations to allow fragments that contain syntactical differences are developed.
Regularity Extraction and Template Generation As mentioned in the introduction, a lot of embedded systems are extremely inflexible as they can not be reused after program changes, especially ASICs can not be reused. Instead reconfigurable devices are developed, that can be easily re-programmed (Amit, Sudhakar, Phani,Naresh & Rajesh, 1998; Brisk, Kaplan, Kastner & Sarrafzadeh, 2002; Zaretsky, Mittal, Dick & Banerjee, 2006; Liao, Devadas & Keutzer, 1999) They have been proposed as the new design platform for specialized embedded processors. Therefore, new reconfigurable devices build upon programmable logic technology appeared. To ensure reasonable runtime, functions that have to be executed frequently are implemented as hard logic blocks. If these functions are well chosen, flexibility and runtime can be combined. In the area of embedded systems and hard logic implementations finding these frequent functions is called regularity extraction. These regularities are nothing else than frequent code fragments. It is common in this are to search for frequent fragments on a graph representing the control and data flow of the corresponding program in the intermediate representation of the compiler. Possible frequent fragments are called templates. The process of finding these templates is named template generation. Despite this name template generation is nothing else than mining for frequent fragments. Working on the templates, i.e. putting them into hard logic, is called template extraction. In the beginning, regularity extraction started with a given hand-tuned database of templates that were just combined to look for templates in the given code. There are also many algorithms looking for templates of a special form. As program code is represented as graph, templates are subgraphs. Restrictions e.g. limit to subgraphs with only one entry and/or one exit node. Other algorithms work on directed acyclic graphs or restrict the size of templates. Due to the high complexity of graph mining, many algorithms work with the help of heuristics. Applications of regularity extraction can also be found in reduction of data-path complexity and power reduction.
MAIN FOCUS OF THE CHAPTER In this section an introduction into the standard techniques to decrease code size is given.
631
Text Mining in Program Code
Searching on Strings with Suffix Trees Fraser, Myers & Wendt (1984) were the first trying to find detect clones in machine code. They searched the machine code like a string of letters using suffix trees to identify repeated sequences within the sequence of assembly instructions. The fragments were then abstracted out into procedures. Applied to a range of Unix utilities on a Vax processor, this technique managed to reduce code size by about 7% on the average. A suffix tree is a trie that is a common search structure to detect strings in text. The term trie comes from “retrieval”. Due to this etymology it is pronounced “tree”, although some encourage the use of “try” in order to distinguish it from the more general tree. A trie is a rooted tree whose edges are labelled with one character of the underlying character set. It stores all words of the text. Therefore, all words are inserted based on their character sequences. For each word exist a path from the root to one leaf so that the corresponding labels of the edge sequence express the stored word. Each node has for each existing character just one outgoing edge so that common word prefixes share the same part of the tree. When searching an existing word, the algorithms has to walk through the trie beginning at the root and follow the edges according to the character sequence of the word. The word is stored, is there is a path from the root to a leaf and the edges are labelled with the characters of the word in the correct order. Otherwise the traversal stops at some leaf and the word remains incomplete. At this node the missing suffix sequence of the word has to be added if the word should stored in the trie. A patricia trie is a more compact description of a trie. Sequential parts of the trie where each node has just one incoming and at most one outgoing edge are collapsed into single edges labelled with the whole character sequence of that part. A suffix trie of a string is a trie representing all the suffixes of the string. When a suffix trie for string A has been constructed, it can be used to determine whether another string B is a substring of string A. If there exist a path from the root to a node so that the edge sequence express the searched subsequence B then the B is a substring. With a special “string end” character ($) each suffix terminates by this character that never occurs inside a sequence. Therefore, there is one leaf for each suffix. In this node the starting position of the suffix is stored. A suffix tree of a string is a patricia trie containing all the suffices of the string. These suffix tree can be constructed in linear time and requires linear space (Ukkonnen, 1995). It is an ideal data type for detecting substrings. A suffix trie and a suffix tree express the same (with different space requirements) so further on just suffix trees are used. To store program code in suffix trees the instruction sequence(s) must be mapped to a corresponding alphabet. Textual or semantically identical instructions have to be mapped to the same character to be detected as equal. For the example program the first letters of the operational code (if it is non-ambiguous) is chosen. So the whole example is represented by the string “XAOSRBYOSRBAZSOBRA$”. In Figure 2 the suffix tree for this string is presented. Each path from the root to a node in the suffix tree represents a subsequence of the given program. To decide which patterns are interesting for extraction, the size (number of instructions) and the number of occurrences are relevant. The number of instructions is equal to the number of characters of the subsequence and can be counted during the insertions process. The number of occurrences can also be read out of the suffix tree. Each path from one inner node to a leaf contains a possibility to extend the subsequence to a suffix of the whole code. Because of the
632
Text Mining in Program Code
Figure 2. Suffix tree of the running example
“string end” character each leaf represent one unique suffix, so the number of reachable leaves represent the number of occurrences of the substring. Each suffix stored in the reachable leaves starts with the substring. Not only the number of occurrences, even their location is stored in the suffix tree. Each leaf contains the start of the suffix and therefore points to one occurrence of this substring. No search in the code is necessary after the creation of the suffix tree. Subsequences may overlap in the code sequence. This can be detected by comparing the distance between the start points and the length of the substring. Out of this information the “interesting” candidates can be selected. In the example the only fragment that leads to smaller code is the sequence OSRB with its two occurrences marked in Figure 2. Extracting this sequence in a function results a code sequence with one instruction saved. All other subsequences are to short or too seldom. Compared to Figure 1 this leads to a smaller gain in code size. As suffix trees can only handle syntactically equivalent code pieces, the order variations of if the extracted fragment in Figure 1 can not be detected. Other methods are needed to get the maximal size gain for this example.
Slicing Slicing-based detection of duplicated code works on graphs representing the control and data flow of a program, the so called control data flow graphs (CDFGs). Program slicing is commonly used when specializing an existing program at some point of interest of the source code, the slicing criterion. A program slice consists of all instructions and values that might affect this point of interest, also known
633
Text Mining in Program Code
as slicing criterion. These parts of the program can be extracted into a new program without changing the semantics for the slicing criterion (Komondoor & Horvitz, 2003). Therefore, the CDFG is built up by inserting nodes for each instruction and edges of one type for each data dependency and edges of another type for each control flow dependency. Program slicing is done from the slicing criterion backwards on the edges, until no new node is marked as visited. Komondoor & Horvitz (2003) search for isomorphic subgraphs in several CDFGs in order to extract the associated code by a new procedure and the occurrences by calls to this newly created procedure. The nodes of a CDFG also represent the statements of the program, whereas the edges reflect a control flow or data dependency between the instructions. The nodes and therefore the instructions are partitioned in equivalence classes by means of the instructions. Within an equivalence class, pairs of matching nodes are the start point for growing isomorphic subgraphs. Slicing backwards (respectively forwards) is done if and only if the matching predecessor (respectively successor) of the current investigated node also is a matching node. In Figure 3 an example from the source code of the GNU utility bison is illustrated. As a starting point the two matching nodes 4a and 4b from the different code sequences are selected because they are in the same equivalence class. From now on, further nodes are added to this initial code clone using backward (or forward) slicing. Slicing backwards, on the control dependency edges of the nodes 4, leads to the two while-nodes 1 and 8, which are not in the same equivalence class and, therefore, Figure 4. Control data flow graphs of bison, its source code, and extracted function
634
Text Mining in Program Code
Figure 4. Isomorphic subgraphs of program dependence graph in Figure 3
do not match. The self-edge to node 4 does not lead to new matches, of course. But slicing backward to node 5a/5b and 6a/6b adds to matching nodes to the isomorphic graphs. The node 7 also is reached through a data dependency edge but is not considered because there is no corresponding match in the other program dependence graph. Slicing backwards now from nodes 5a/b leads to the while-nodes 1/8 again. But from the edges of the nodes 6a/6b the matching nodes 2a/2b can be reached by backward slicing, so they also are added to the isomorphic graphs. No further nodes can be added using slicing in this example. The resulting graphs are shown in Figure 3. When using program slicing for specialization, typically solely backward slicing is used. However, using slicing for text mining also requires forward slicing for “good” code clones. This can be explained by conditional statements (like if), which have multiple successors that cannot be reached by backwards slicing from any other conditional branch. The isomorphic subgraphs are expanded by utilizing the transitive closure. Therefore, if already found the duplicated code pairs (C1, C2) and (C2, C3), (C1, C3) also are duplicates and would be combined to group (C1, C2, C3). Instead of other approaches that deal with source code, this one is able to find non-sequential or reordered instructions, which can be extracted, although, without changing the semantics of the program. Additionally, it is able to find similar not only identical program sequences, when defining equivalence classes without respecting the literals. The extraction must rename the variables or insert some temporary. The isomorphic subgraphs must not cross loops, since this may change the meaning of the program. Furthermore, the duplicated code is checked for extractability.
635
Text Mining in Program Code
Using backwards slicing leads to isomorphic subgraphs, but does not lead to the whole set of possible subgraphs. This way, not only the computationally complexity of subgraph isomorphism is avoided but also the number of code clones is reduced leading to more meaningful code clones in average.
Fingerprinting In clone detection, comparison of code fragments to find frequent fragments is expensive. If the number of code fragments is very large, many single comparisons have to be done. In order to reduce the amount of expensive code comparisons, often a hash code or fingerprint of the code fragments is used. If two code blocks are identical, they have the same fingerprints, but not vice versa. If two code blocks have identical fingerprints, these blocks do not necessarily have to be identical. Nevertheless, different fingerprints for two code blocks always mean that these blocks are also different. The calculation of the fingerprint depends on application demands, especially on the interpretation of “identical”. Some applications require that code blocks are identical in the instructions and in the instruction sequence. Other requires just semantically, not structural identities, so a more flexible fingerprint is necessary. A very open fingerprint is to count each occurrence of each type of operational code regardless to their parameters or order. This scheme is resistant to code reordering, register mapping or variable renaming, but results in false positives. Code blocks with identical fingerprint have to be compared to decide, if they are really identical. Less false positive can be reached by storing the code order in the fingerprint. For example, Debray, Evans, Muth & Sutter (2000) encode only the first sixteen instructions of a basic block in a 64-bit value, for the reason that most basic blocks are short. In this way, each of the sixteen instructions is encoded as a 4-bit value, representing the operational code of the instruction. There is a maximum of sixteen code strings with four bits, even since most systems have more operational codes. In that case all operational code are sorted by frequency, determine an encoding for the most frequent fifteen operational codes, and use the last encoding for all other operational codes. Additionally, the fingerprints are hashed in order to reduce the number of fingerprint comparisons. Fingerprints in different hash buckets are different and must not be compared. In Table 1 the hash code of the fingerprints as described s shown. Other approaches calculate the fingerprint by using the Rabin-Karp string matching algorithm. Therefore, each instruction is encoded to an integer, reflecting its operational code and operands. The fingerprint of a code block is computed out of these integers. Therefore each instruction number is multiplied with increasing powers of as special base b, which is bigger than the largest encoded integer. The terms for all instructions of the code block are summed up and the remainder of the division by an fixed integer p (mostly prime) is the final fingerprint f for this code sequence: f i = (ci * bn–1 + ci+1 * bn–2 + ••• +ci+n–1 * b0)mod p Those fingerprints are calculated for a specified length n in the whole code. Like a sliding window of the size n, the fingerprints for all subsequences of the size n can be calculated out of the first fingerprint. Therefore the highest summand representing the unnecessary first instruction has to be subtracted out of the finger print and the newly instruction has to be inserted. All other instructions parts just have to be multiplied by the basis b: f i = (( f i-1 *b) + ci+n–1 – ci–1* (bn mod p)) mod p
636
Text Mining in Program Code
Table 1. Hashing the first 16 instructions of the code shown in Figure 1 1001
1000
0011
0100
1110
1100
1001
0011
0100
1110
1100
1000
1001
0100
0011
1100
LDM
ADD
ORR
SUB
RSC
BIC
LDM
ORR
SUB
RSC
BIC
ADD
LDM
SUB
ORR
BIC
As you can see, the calculation of the next fingerprint is done with a constant number of mathematical instructions, independent of the length of the subsequences. The term bn mod p is a constant than can be precomputed. As can be seen in Figure 5, this approach is able to find the matching instructions with hash value twelve. Of course, these matches must be compared instructions by instruction afterwards, since a matching hash value is not a guarantee matching instruction sequences.
Graph Mining The main advantage of using graph mining for clone detection relies on the possibility of reordering instructions in program code. The fragment extracted in Figure 1 contains order variations that have to be detected. Figure 5. Rabin Karp method
637
Text Mining in Program Code
The general idea of graph mining is to detect all frequent subgraphs, i.e. the fragments, in a database of general graphs (Meinl & Fischer, 2005; Washio & Motoda, 2003). A fragment is frequent, if it can be embedded in at least a given number of database graphs. This number is called support. The support can also be given as a fraction of the database size (normally in per cent) and is than called frequency. Support and frequency express the same attribute. To determine a fragment as frequent, you have to detect all database graphs the fragment is a subgraph of. A brute-force attempt for one fragment is to iterate over all database graphs and explicitly check and count subgraph isomorphism. But the subgraph isomorphism problem is known as NP-complete so it is in general hard to decide if one graph is a subgraph of another. Because graph mining implicitly solves the subgraph isomorphism problem all general graph mining attempts are also NP-complete. A better approach is to store and reuse implicitly the results of previous subgraph tests. If B is a subgraph of C and A a subgraph of B, A is also a subgraph of C. If A is no subgraph of C, the supergraph B of A also cannot be a subgraph of C. These dependencies can be used to minimize the number of subgraph tests. To find all frequent subgraphs, the already found ones are stepwise extended to new bigger ones. This way, out of an initial set of subgraphs and a correct set of extension rules, all subgraphs are traversed. Common initial subgraphs are all frequent graphs with just one node or with just one edge. Normally, the graphs are extended by adding a new edge and simultaneously a new node, if it is required. Elsewhere a new edge between existing nodes is inserted that leads to cyclic fragments. So it is assured that the old graph is a subgraph of the new one without any explicit subgraph isomorphism test. The whole search space for an algorithm can be arranged as a lattice which expresses the extensions of each fragment to the bigger ones. For general graphs, a fragment normally can result out of different extensions of different parent fragments. Because of the huge number of subgraphs, partly shown in Figure 6, it is necessary to traverse this
Figure 6. Extract of the whole search lattice for the example in Figure 1.
638
Text Mining in Program Code
search lattice as efficient as possible. That means that all frequent fragments and less as possible infrequent fragments have to be checked for frequency. Thanks to the construction of the search lattice the children and grandchildren of a fragment have a frequency equal or less the fragment, as shown with the small indices in Figure 6. As said previously, an extended fragment may only be subgraph of the database graphs, the parent fragment also occurs in, because the extended fragment is a super graph to its parent. Therefore, it might just exist in the same or less number of database graphs, and so cannot have a higher support. The bigger the fragment, the smaller its support and thanks to this antimonotony, an extension of infrequent fragments in the lattice is not necessary. Just infrequent ones emerge out of these fragments. This is similar to the frequency antimonotonicity principle in frequent item set mining and removes most infrequent fragments. Additionally to this frequency pruning, each fragment can have different children and different parents in the search lattice and therefore different paths lead to the same fragment. Graph miners are interested in all fragments, but each fragment just once, so during the traversal of the search lattice those multiple reachable fragments have to be detected. If a fragment is reached a second time, all its children were still expanded before, so an additional expansion is not necessary and will create just more duplicates. To detect those duplicates different methods are used in different mining algorithms. On the one hand, a newly found fragment can be compared with all previously found frequent fragments. These tests are done just for those fragments that are frequent, because for the required graph isomorphism test no polynomial algorithm is known and the complexity of graph isomorphism is not yet proven. It seems to be a complexity between polynomial and NP-complete. With the help of several graph properties (like node/edge count, node/edge label distribution …) two fragments can also be tested for isomorphism, so complete graph isomorphism test can be reduced to a bearable level. A problem for that approach is that with an increasing number of found frequent fragments, the required tests become more and more often and so the search slows down. Another approach without this benefit is to build a unique representation for each fragment and just compare these representations (Yan & Han, 2002). Different codes are used, i.e. a row-wise concatenation of the adjacency matrix of the graph. Such a code is normally not unique for one graph as there are different possible adjacency matrices for one graph. Therefore one special representation, normally the lexicographical smallest or biggest one, is defined as the canonical form for a grown fragment. The complexity of the graph isomorphism test is hidden in the test for being canonical, but this test is independent of the number of found graphs. An efficient method for canonical forms is to create an ordered list of comparable tuples representing the edges in the fragment. This edge list can easily be generated during the expansion process out of the order the edges are inserted to the graph. As before, one list (i.e. lexicographic smallest) is used as the canonical representation. This sequence represents one path through the search lattice. If the algorithm generates an edge sequence which is not the canonical representation for the corresponding fragment, it reaches this fragment over an alternative way than the preferred one, so the search can be pruned at this point. Depending on the canonical form the expansion rules can be adjusted to generate fewer duplicates. Not all possible extensions, but just those with the possibility to generate a new canonical representation need to be followed. In addition to those structural pruning rules, specialized pruning rules for different application areas are possible. I.e. if just trees or graphs with maximal ten nodes are relevant, than the search can be
639
Text Mining in Program Code
pruned if a fragment becomes cyclic or still has too much nodes. For a comparison of the most popular graph mining algorithms see Wörlein, Meinl, Fischer & Philippsen (2005). The main adaption to use graph mining algorithms for clone detection is to build the graph database out of program code. The reorderable instructions, the graph mining approach rely on, is not represented in the pure code sequence. Therefore, common known structures from compiler techniques are available. The so called data flow graphs (DFG) or control/data flow graphs (CDFG) express concurrent executable instructions, that have a fix order in the code sequence. A set of DFGs build out of the basic blocks of a ARM binary as in the beginning of this chapter is the basis for the following approaches. For each basic block (code with one entry and one exit point without jump instruction inside) a separate graph is generated. A standard graph mining run on these graphs results in all fragments occurring in a special number of basic blocks. For clone detection it is more important to know how often a fragment occurs in the database. The number of basic blocks it occurs in is irrelevant. If a fragment appears twice in a basic blocks, both appearances should be counted. An occurrence of a fragment in a basic block is called an embedding of this fragment. The number of extractable embeddings is equal to the number of independent, non-overlapping embeddings of a fragment. Two embeddings of the same fragment overlap if they use some or all nodes in common. In this case, after extracting one of the embeddings the second occurrence will be destroyed and cannot be extracted additionally. The best set of non-overlapping embeddings is calculated out of a collision graph containing all overlapping embeddings. Each node of the collision graph represents one embedding of the fragment and there is an edge between two nodes if the corresponding embeddings overlap. An independent set of such a graph is a subset of nodes, so that between two nodes of the subset there is no edge in the original graph. A maximal independent set of this collision graph represents the maximal number of embeddings that can be extracted together. To detect such subsets, general algorithms are available (Kumlander, 2004). The number of extractable embeddings for a fragment is also antimonotone as the original graph count, so frequency pruning is still applicable. For clone detection fragments that just occur twice but are quit big, are also interesting in contrast to small ones with higher frequency. So frequency pruning is possible but the resulting search space is still quite big. A more extensive description is given in Dreweke, Wörlein, Fischer, Schell, Meinl & Philippsen (2007). To keep the search still bearable some special properties of clone detection can be used. The created DFGs are directed acyclic graphs (DAG), so the fragments found are also DAGs. Not every fragment is extractable. The resulting graph (without the extracted fragment) still has to be acyclic, so embeddings/ fragments leading to cyclic structures are irrelevant. For some fragments it is provable that each child and grand child also leads to cyclic structures, so the corresponding search branch can be pruned. To simplify the required search, most graph based approaches reduce their search space to single-rooted fragments. The graph for a single-rooted fragment has just one entry point. A search for single-rooted fragments on the DFG for the example of Figure 1 results in the most frequent fragment consisting of the shaded nodes in Figure 7 and the resulting code (without the extracted procedure) in the bottom left. A search for general connected graph fragments detects additionally that each fragment can be extended by the white ORR-node to the shaded fragment in Figure 7 so a connected search results in the code on the bottom right. Connected graph mining does not detect the optimum presented in Figure 1. By extending the search to unconnected fragments the optimal fragment for a single extraction is found, but unconnected mining
640
Text Mining in Program Code
Figure 7. Detected single-rooted, connected, and unconnected code clones and their extraction
641
Text Mining in Program Code
increases the number of subgraphs and the size of the search lattice dramatically. Figure 7 shows the corresponding optimal pattern as the dashed fragment in and the resulting code (without the extracted procedure) in the first line.
FUTURE TRENDS/CONLUSION Text mining in program code is commonly known as clone detection, but is used in a wide variety of application areas. In this chapter we mainly focused on software engineering and embedded systems. Much research was done twice because of the different names for clone detection in the different areas. There are also many overlappings with research in graph mining. This chapter described the similarities and differences between the manifold approaches with varying goals and showed that graph mining algorithms can be applied very successfully in clone detection. Large software systems often produce “legacy” code, evolving over several versions. Studies showed that up to thirty percent of the code in large software systems, such as the JDK, are code clones. The more code clones are within a software, the more complex it is to maintain the code. Therefore the demand for tools supporting the programmer while writing and maintaining the code is growing more and more. This demand is satisfied by searching similarities mostly by using graph mining related approaches. Code clones need to be similar, not necessarily identical in order to be interesting. Therefore future research work will probably use other approaches that do have a smaller computational complexity. One of these concepts could e.g. be the representation of program text as a vector and clustering code clones with respect to their Euclidean distance. This should reduce the complexity significantly. Since software systems are not always written in one single language, tools that support more than one language or can easily be extended are preferable. In the area of embedded systems, the programs need to be as small as possible in order to reduce manufacturing costs for the large number of pieces. The code clones are extracted by various techniques. Duplicate code means semantically identical code, which must not be contiguous, but can be reordered. Additionally, semantics must not change if variables have been renamed. Instructions, furthermore, not even have to be connected in some dependence graph (reflecting data or control flow dependencies), if they stay extractable afterwards. Studies showed that up to five percent of program code for embedded systems can be reduced by extraction techniques. Both application areas have to deal with large, computational difficult problems, handling many lines of code and, therefore, an enormous search space. Besides, not every code clone or duplicate is interesting for each application. “Interesting” in the area of embedded systems mostly means extractable, preferably saving as much instructions as possible. In the area of software engineering, “interesting” code clones may be “copy & paste” clones that are similar to a certain degree. The needs for the “interesting” duplicates and for an reduced complexity can be satisfied simultaneously. Not the whole search space has to be searched, but some branches of the search tree can be pruned without reducing the set of interesting clones too much. This is possible by development and application of appropriate heuristics. Another common way to reduce the complexity of text mining is to restrict the maximal size of the code clones — typically the limitation is set to twenty instructions. In software engineering applications, this restriction is also reasonable from another point of view, because programmers cannot review larger
642
Text Mining in Program Code
code blocks easily. In embedded systems, the code clone size restriction and the associated decrease of runtime has to be carefully opposed the instruction savings. New areas of applications of text mining in program code or clone detection are available. For example, for teaching or software patents the original author of the code is relevant. Therefore, automatically checks to detect plagiarism are useful. Two programs or files can be compared, if one evolved into the other or if they are written completely independent by comparing their sequential code and semantic structures.
REFERENCES Amit, C., Sudhakar, K., Phani, S., Naresh, S., & Rajesh, G. (1998). A general approach for regularity extraction in datapath circuits, IEEE/ACM international conference on Computer-aided design (pp. 332-339). San Jose, CA, USA: ACM Press. Angew. Informatik GmbH, S. (2006). aipop. 2007, from http://www.AbsInt.com/aipop Beszdes, R., Ferenc, R., Gyimthy, T., Dolenc, A., & Karsisto, K. (2003). Survey of code-size reduction methods. ACM Comput. Surv., 35(3), 223-267. Brisk, P., Kaplan, A., Kastner, A., & Sarrafzadeh, M. (2002). Instruction generation and regularity extraction for reconfigurable processors, International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (pp. 262-269). New York, NY, USA: ACM Press. Brisk, P., Macbeth, J., Nahapetian, A., & Sarrafzadeh, M. (2005). A dictionary construction technique for code compression systems with echo instructions, ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (pp. 105-114). Chicago, IL, USA: ACM Press. Cooper, K., & McIntosh, N. (1999). Enhanced code compression for embedded RISC processors, ACM SIGPLAN 1999 conference on Programming language design and implementation (pp. 139-149). Atlanta, GA, USA: ACM Press. Debray, S., Evans, W., Muth, S., & Sutter, B. (2000). Compiler techniques for code compaction. ACM Transactions on Programming Language Sysems., 22(2), 378-415. Dreweke, A., Wörlein, M., Fischer, I., Schell, D., Meinl, T., & Philippsen, M. (2007). Graph-Based Procedural Abstraction, Fifth International Symposium on Code Generation and Optimization (pp. 259-270). San Jose, CA, USA: IEEE Computer Society. Fraser, C., Myers, E., & Wendt, A. (1984). Analyzing and compressing assembly code, SIGPLAN symposium on Compiler construction (pp. 117-121). Montreal, Canada: ACM Press. Further, S. (1996). ARM System Architecture. Boston, MA: Addison-Wesley Longman Publishing Co., Inc. Kastner, R., Kaplan, A., Memik, S. O., & Bozorgzadeh, E. (2002). Instruction generation for hybrid reconfigurable systems. ACM Transactions on Automated Electronic Systems, 7(4), 605-627.
643
Text Mining in Program Code
Komondoor, R., & Horwitz, S. (2003). Effective, Automatic Procedure Extraction, IEEE International Workshop on Program Comprehension (pp. 33-43): IEEE Computer Society. Kumlander, D. (2004). A new exact Algorithm for the Maximum-Weight Clique Problem based on a Heuristic Vertex- Coloring and a Backtrack Search, 5th International Conference on Modelling, Computation and Optimization in Information Systems andManagement Sciences: MCO 2004 (pp. 202-208). Metz, France: Hermes Science Publishing Ltd. Liao, S., Devadas, S., & Keutzer, K. (1999). A text-compression-based method for code size minimization in embedded systems. ACM Transactions on Automated Electronic Sysems, 4(1), 12-38. Meinl, T., & Fischer, I. (2005). Subgraph Mining. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp. 1059-1063). Hershey, PA, USA: Idea Group. Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 3, 249-260. University of Gent, B. Diablo. 2007, from http://www.elis.ugent.be/diablo Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. SIGKDD Exploration Newsetter., 5(1), 59-68. Wörlein, M., Meinl, T., Fischer, I., & Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In A. Jorge, L. Torgo, P. Brazdil, C. R. & J. Gama (Eds.), Knowledge Discovery in Database: PKDD 2005 (Vol. Lecture Notes in Computer Science, pp. 392403). Porto, Portugal: Springer. Yan, X., & Han, H. (2002). gSpan: Graph-Based Substructure Pattern Mining, IEEE International Conference on Data Mining (ICDM’02) (pp. 721-724): IEEE Computer Society. Zaretsky, D., Mittal, G., Dick, R., & Banerjee, R. (2006). Dynamic Template Generation for Resource Sharing in Control and Data Flow Graphs, 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (pp. 465-468): IEEE Computer Society.
KEY TERMS Code Compaction: The code size reduction of binaries in order to save manufacturing costs or of source code in order to increase maintainability Clone Detection: Methods to identify code clones i.e. a sequence or set of instructions that is similar or even identical to another one in different kinds of programming code. Control Data Flow Graph (CDFG): Represents the control flow and the data dependencies in a program. Embedded System: Unlike general purpose systems, an Embedded System is used and built for special purpose with special requirements (e.g. real-time operation) and is produced with a large number of units. Therefore, tiny cost savings per piece pay off often. Graph Mining: Methods to identify frequent subgraphs in a given graph database.
644
Text Mining in Program Code
Fingerprinting: Code fragments are associated with numerical (hash) codes to speed up the detection of code clones. If two code blocks are identical, they have the same fingerprints. If two code blocks have identical fingerprints, these blocks do not necessarily have to be identical. Procedural Abstraction: The extraction code clones into functions and replacing their occurrences by calls to these new functions. Regularity Extraction: The realization of code clones in hard logic on an embedded system. Slicing: Method to identify code clones based on a CDFG of the program. Based on instructions or their operational code isomorphic subgraphs are grown; similar to graph mining. Suffix Tree: A suffix tree is a data structure used for an efficient detection of duplicated strings in a text.
645
646
Chapter XXXVI
A Study of Friendship Networks and Blogosphere Nitin Agarwal Arizona State University, USA Huan Liu Arizona State University, USA Jianping Zhang MITRE Corporation, USA
ABstract In Golbeck and Hendler (2006), authors consider those social friendship networking sites where users explicitly provide trust ratings to other members. However, for large social friendship networks it is infeasible to assign trust ratings to each and every member so they propose an inferring mechanism which would assign binary trust ratings (trustworthy/non-trustworthy) to those who have not been assigned one. They demonstrate the use of these trust values in e-mail filtering application domain and report encouraging results. Authors also assume three crucial properties of trust for their approach to work: transitivity, asymmetry, and personalization. These trust scores are often transitive, meaning, if Alice trusts Bob and Bob trusts Charles then Alice can trust Charles. Asymmetry says that for two people involved in a relationship, trust is not necessarily identical in both directions. This is contrary to what was proposed in Yu and Singh (2003). They assume symmetric trust values in the social friendship network. Social networks allow us to share experiences, thoughts, opinions, and ideas. Members of these networks, in return experience a sense of community, a feeling of belonging, a bonding that members matter to one another and their needs will be met through being together. Individuals expand their social networks, convene groups of like-minded individuals and nurture discussions. In recent years, computers and the World Wide Web technologies have pushed social networks to a whole new level. It has made possible for individuals to connect with each other beyond geographical barriers in a “flat” world.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Study of Friendship Networks and Blogosphere
The widespread awareness and pervasive usability of the social networks can be partially attributed to Web 2.0. Representative interaction Web services of social networks are social friendship networks, the blogosphere, social and collaborative annotation (aka “folksonomies”), and media sharing. In this work, we briefly introduce each of these with focus on social friendship networks and the blogosphere. We analyze and compare their varied characteristics, research issues, state-of-the-art approaches, and challenges these social networking services have posed in community formation, evolution and dynamics, emerging reputable experts and influential members of the community, information diffusion in social networks, community clustering into meaningful groups, collaboration recommendation, mining “collective wisdom” or “open source intelligence” from the exorbitantly available user-generated contents. We present a comparative study and put forth subtle yet essential differences of research in friendship networks and Blogosphere, and shed light on their potential research directions and on cross-pollination of the two fertile domains of ever expanding social networks on the Web.
Introduction For many years psychologists, anthropologists and behavioral scientists have studied the societal capabilities of humans. They present several studies and results that substantiate the fact that humans like engaging themselves in complex social relationships and admire being a part of social groups. People form communities and groups for the same reasons to quench the thirst for social interaction. Often these groups have like minded members or people with similar interests who discuss various issues including politics, economics, technology, life style, entertainment and what not. These discussions could be between two members of the group or involve several members. These social interactions also led researchers to hypothesize “Small World Phenomenon” (also known as “Small World Effect”) which states that everyone in this world can be contacted via a short chain of social acquaintances. A renowned experiment conducted by psychologist Stanley Milgram in 1967 to find the length of this short chain resulted in the discovery of very interesting observations. This finding 1 gave rise to the famous concept, “six degrees of separation” . Milgram asked his subjects to send mails through US Post and keep passing them until they reached the destination. A more recent experiment conducted in 2001 by Duncan Watts, a professor at Columbia University also concluded with similar results although on a worldwide scale including 157 countries using e-mails and internet as the medium for message passing. This “connectedness” aspect of social interactions between people have fascinated several researchers and results have been applied to fields as varied as genealogy studies. Several sociologists have pointed out subtle differences between society and community, community being a more cohesive entity that promotes a sense of security and freedom among its members. With continued communication, members develop emotional bonds, intellectual pathways, enhanced linguistic abilities, critical thinking and a knack for problem solving. Researchers in the field of psycho-analysis have studied how these interactions within a community proceed and how a group evolves over time. This line of research deals more with the group dynamics and social behavior of communities as a whole with respect to each individual. Several anthropologists are also interested in groups that are bound by cultural ties and try to study their differences from traditional groups in aspects like, communication styles, evolution patterns, participation and involvement, etc. For the past 15 years Computers and Internet have revolutionized the way people communicate. Internet has made possible for people to connect with each other beyond all geographical barriers.
647
A Study of Friendship Networks and Blogosphere
This has tremendously affected social interactions between people and communities. People not only participate in regional issues but also global issues. They can connect to people sitting on exactly the other side of the globe and discuss whatever they like, i.e., living in a flat world. Communities can be spread across several time zones. This humongous mesh of social interactions is termed as social network. Social networks encompass interactions between different people, members of a community or members across different communities. Each person in this social network is represented as node and the communications represent the links or edges among these nodes. A social network comprises of several focussed groups or communities that can be treated as subgraphs. These social networks and subgraphs are highly dynamic in nature which has fascinated several researchers to study the structural and temporal characteristics of social networks. These social interactions could take one of the following forms: friendship networks, blogosphere, media sharing and social and collaborative annotation or “folksonomy”. Next we explain each of these in detail. However, in this chapter we will focus on two special types of social networking phenomena: social friendship networks and blogosphere. Social friendship networks are friendship oriented social networks which are predominantly used to connect and stay in touch with colleagues. Members can search for other people, connect with them and grow their networks. They can send messages to each other, share experiences and opinions with their friends. Communities may evolve gradually over a long time period. A very first social friendship networking Web site was Classmates.com that helped people find and connect to their college or school peers. The advent of Web 2.0 has produced a flood of social networking sites since then. Web 2.0, a 2 term coined by Tim O’Reilly , refers to a perceived second generation of Web based services which is a business revolution in computer industry using internet as a new business platform. Increased collaboration and desktop-like experience were some key features of Web 2.0. It simplified and enriched user experience on the Web and brought more people to join the wave of social networking through Internet. Within a short span of 3 years, there were many social friendship networking Web sites like, 3 Orkut.com, Facebook.com, Myspace.com and more. All of them are very similar to FOAF network, 4 which creates a machine-readable profiles of its members based on RDF schema (a W3C specification originally designed as a metadata model but often used as a method for modeling information) that helps in defining relationships between people, and various attributes such as name, gender, and interests. These services make it easier to share, use information about people and their activities. Social friendship networking Web sites make it relatively simple to analyze the complex structural and temporal behavior as compared to the traditional social networking experiments conducted by Stanley Milgram from data accessibility point of view. Researchers name this type of analysis, social network analysis (SNA) which is an network an interdisciplinary study including sociology, anthropology, sociolinguistics, geography, social psychology, information science and organizational studies. They study interpersonal and personal relations and how they are affected by the structure and composition of these ties. The shape and evolution of a social network helps in studying its usefulness to an individual or a community. Social network analysis focusses more on the relationship of the actors as compared to their attributes which was studied in traditional social experiments. SNA has been used to examine interactions not only between different actors but also between Webpages, organizations and its employees. Typical 5 applications of social networks used by mathematicians include Erdos Number which is the co-author index with Paul Erdos. People also treat this as a collaboration index. Web 2.0 services especially blogs have encouraged participatory journalism. Former information consumers are now the new producers. Web 2.0 has allowed mass to contribute and edit articles through wikis and blogs. Giving access to the mass to contribute or edit has also increased collaboration among
648
A Study of Friendship Networks and Blogosphere
the people unlike Web 1.0 where there was no collaboration as the access to the content was limited to a chosen few. Increased collaboration has developed enormous open source intelligence or collective wisdom on internet which was not there in Web 1.0. “We the media” (Gillmor, 2006), a phenomenon named by Dan Gillmor, a world in which “the former audience”, not a few people in the back room, decides what is important, transforming the lecture style of information consumption to conversation based assimilation. Such an interactive information delivery media hosts a perfect breeding ground for the virtual communities or communities that originate over the Internet. There has been a lot of ongoing research to mine knowledge from this pool of collective wisdom. Other forms of social interaction offered under the umbrella of Web 2.0 are wikis, social and collaborative annotations like del.icio.us that constitute “folksonomy”, media sharing including online photo 6 7 8 and video sharing like Flickr , Youtube . Web 2.0 could also be considered as a “semantic Web” which uses the human knowledge to assign tags (metadata) to various resources including Webpages, images, videos etc. These tags create a human generated taxonomy that can make information increasingly easy to search, discover and navigate over time. This human generated taxonomy is termed as “folksonomy”. This has also created a plethora of open source intelligence or collective wisdom on Internet. People have liked the idea of Web 2.0 and social networking so much that a new social Web browser was also 9 released named “Flock” that combines most of these social interaction services offered under the hat of 10 Web 2.0. Top 20 most visited Web sites around the globe released by Alexa search engine, a network monitoring company owned by Amazon.com, clearly shows an increasing trend towards social networking Web sites. Out of 20 Web sites 12 are social networking Web sites including social friendship networks, blogs, wikis, social and collaborative annotations and media sharing. The percentage of social networking Web sites among top 20 most visited Web sites has been increasing over the years that can be easily observed from the statistics available at Alexa. These social networking structures have been studied in great deal in terms of information propagation and influential nodes. People have applied network and graph analysis to detect the information brokers and bellwethers of communities. Researchers have studied it from infectious disease propagation domain and its applications have been anticipated in viral marketing. Sociologists have studied the effects of influence in physical communities Keller and Berry (2003). Studying the propagation of influence and its effects has become an extremely prominent research area. In this work we will discuss various ongoing research in the field of social networks with a special focus on social friendship networks and blogosphere. We also discuss the impact and applications of this research and latest trends. We compare and contrast and point subtle differences between social friendship networks and internet publishing services like blogs. These differences act as the foundations for basic differences between various on-going research initiatives in social friendship networking sites and blog sites.
Background In this section we discuss necessary background concepts that will help in understanding the details of the ongoing research work in social friendship networking sites and online publishing media, blogs. These concepts will also make it easier to understand the subtleties between these two domains. 11 12 13 14 Social friendship network sites like Orkut , Facebook , Myspace , LiveJournal , etc. provide an interface for people to connect online. People can add others to their list of friends, visit their profile,
649
A Study of Friendship Networks and Blogosphere
Table 1. Comparing individual and community blog sites
get details of other members of the social network, write messages to each other, join communities, take part in discussion and various other activities. The key concept of a social friendship network is the underlying social structure. All the members or actors in the social friendship network are treated as nodes and these connections also known as relations are treated as the links or edges. Combining these nodes and edges gives us a representation of a graph. Please note that this is an undirected graph and could contain cycles. Social friendship networks are much like the network topologies, however they are much more dynamic in nature. Information flow across a social friendship network may not follow the traditional network topology algorithms like shortest route. Information/messages might actually travel through totally different and much longer paths before reaching the other nodes or members of the social friendship network. This requires a new field of study and research known as social network analysis (SNA) to model the complex relationships between actors of social friendship networks and information flow across the network. Social network analysis is a cross disciplinary science that tries to map and measure the relationships between different information processing entities like humans, groups, computers, organizations etc. and model the flow of this information across each of these entities represented by a social network. These entities act as nodes and the relationships between them act as the links of the network. Social network analysis has been used widely outside social network domain in organizations where it is called Organizational Network Analysis (ONA). Businesses use this methodology to analyze their clients and their interaction within themselves and with their business. Social network analysis is used to identify leaders, mavens, brokers, groups, connectors (bridges between groups), mavericks etc. Researchers have used several centrality measures to gauge the information flow across social network which could helps in identifying different roles of aforementioned nodes and groupings within them. Centrality measures help in studying the structural attributes of nodes in a network. They help in studying the structural location of a node in the network which could decide the importance, influence or prominence of a node in the network. Centrality measures help in estimating the extent to which the network revolves around a node. Different centrality measures include, degree centrality, closeness centrality, betweenness centrality. Degree centrality refers to total connections or ties a node has in the network. This could be imagined as a “hubness” value of that node. Rows or column sums of an adjacency matrix would give the degree centrality for that node. Closeness centrality refers to the sum of all the geodesic distance of a node with all other nodes in the network. Betweenness centrality refers to the extent a node is directly connected to nodes that are not directly connected, or the number of geodesic paths that pass through this node. This evaluates how good a node can act
650
A Study of Friendship Networks and Blogosphere
as a “bridge” or intermediary between different sub-networks. High betweenness centrality node can become a “broker” between different sub-networks. Eigenvector centrality defines a node to be central if it is connected to those who are central. It is the principal eigenvector of the adjacency matrix of the network. Other SNA measures used for analyzing social networks are clustering coeffcient (to measure the likelihood that associates of a nodes are associates among themselves to ensure greater cliquishness), cohesion (extent to which the actors are connected directly to each other), density (proportion of ties of a node to the total number of ties this node’s friends have), radiality (extent to which an individuals network reaches out into the network and provides novel information), reach (extent to which any member of a network can reach other members of the network). The term “blog” is derived from the word “Web-log”, which means a Web site that displays in reverse chronological order the entries by one or more individuals and usually has links to comments on specific postings. Each of these entries are called blog posts. Blogosphere can be considered as the universe that contains all these blog sites. Blog sites often provide opinions, commentaries or news on a particular subject, such as food, politics, or local news; some function more like personal online diaries. A typical blog can combine text, images, and links to other blogs, Web pages, and other media related to its topic. The ability for readers to leave comments in an interactive format is an important part of many blog sites. All these are places where a group of users, known as bloggers, get together and share their opinions, views or personal experiences and form part of a wider network of social media. Blogs have become means through which many new ideas and information flow across the Web very rapidly. Blog posts are often associated with a permalink which is a static link to that particular blog post. There are several measures associated with blog posts that can be utilized besides network analysis to detect various roles of bloggers who could be treated as nodes. A blog post can be characterized by inlinks (number of blog posts or Webpages that cite this particular blog post), outlinks (number of blog posts or Webpages this particular blog post refers), number of comments, length of the blog post, rate of comments (rate at which people submit comments on this particular blog post). Besides these, often there is some metadata associated with the blog posts like, date and time of posting, category or annotation of blog post and blogger ID. All these measures are blog post level measures, but there could be some measures which are blogger level. Examples of such measure are number of blog posts submitted during a particular time window, number of comments submitted by a blogger. Blog sites can be categorized into individual blog sites or community blog sites. Individual blog sites are the ones owned and maintained by an individual. Usually individual blog sites do not have a lot of 15 discussion. Examples of individual blogs could be Sifry’s Alerts: David Sifry’s musings (Founder & 16 17 CEO, Technorati), Ratcliffe Blog–Mitch’s Open Notebook , The Webquarters , etc. On the other hand community blog sites are owned and maintained by a group of like-minded users. As a result, more discussion and interaction could be observed in such blog sites. Examples of community blogs could 18 19 20 be Google’s Offcial Blog site , The Unoffcial Apple Weblog , Engadget , Boing Boing: A Directory 21 of Wonderful Things , etc. We summarize the differences in Table 1. Here we focus on community blogs as they provide abundant collective wisdom and open source intelligence. Henceforth, community blogs are referred as blogs.
Social Friendship Networks and Blogosphere In this section we study various ongoing research in social networks. We primarily focus on social friendship networks and blogosphere. Social networks provide means of visualizing existing and po651
A Study of Friendship Networks and Blogosphere
tential relationships and interaction in organizational setting. We try to categorize research based on the application domain and point out the latest techniques, validation methods and datasets utilized by these papers. This way we study the challenges involved in research in social friendship networks and blogosphere.
Social Friendship Networks Here we study research with respect to social friendship networks. We try to compare and contrast different studies in or using social friendship networks in various domains. We already mentioned that social friendship networks are the best way to model interactions and relationships in an organization. Researchers have borrowed the concepts from social friendship networks to various collaborative research initiatives. People have studied social friendship networks in great detail and sought its impact in modeling interactions in a community. Since social friendship networks have far reaching advantages in studying interactions and relations, various collaborative recommendation researchers have exploited its concepts and reported encouraging results. David McDonald proposed a social friendship network based collaboration recommendation approach in an organization (McDonald, 2003). He studied the problem of recommending an appropriate collaboration or an expert to an employee of an organization who is in need of one. In his approach he created two social friendship networks in the organization. One was called Work Group Graph (WGG) which was a context sensitive social friendship network. People working on similar things were a part of same community within this social friendship network. The basic assumption of forming such a social friendship network being, people usually have doubts in the things they are working and best expert would be the one who is also working on similar thing. Nodes within a community represent logical work similarity. This social friendship network was collected using qualitative approach. But a main disadvantage with this social friendship network was it does not effectively identify individuals who span contexts. The second social friendship network, Successive Pile Sort (SPS), was collected based on the prior knowledge of “who hangs out together”. The basis assumption of such a social friendship network based collaboration approach being, individuals who socialize frequently are more likely to collaborate. An important factor of information sharing in an organization is the sociability of various individuals. This social friendship network was collected using quantitative approach. This could possibly identify individuals who span contexts. But a significant disadvantage with SPS social friendship network based collaboration recommendation is that people would like the person who knows the most, not a “friend”. Considering pros and cons with each approach they developed an expertise recommender (ER) engine which would combine the knowledge of both these social friendship networks on an ondemand basis. Bonhard et al. (Bonhard et al., 2006) proposed a movie recommendation system by incorporating social friendship network data, demographics and rating overlap. Since people inherently know which of their friends to trust for a particular recommendation, recommendation systems should include such data. The basic assumption in embedding social friendship networking data in recommendation generation is decision makers tend to seek advise from familiar advisors because they know where their tastes overlap, they might have received good advice previously hence the risk of receiving bad advice is reduced and they can simply rely on it. Social friendship psychology has shown that people like others who are among other things familiar similar to themselves and with whom they have a history of 22 interaction. Netflix has also given their customers an option to establish a social friendship network
652
A Study of Friendship Networks and Blogosphere
and collaborate online to select and recommend movies to their friends. Demographics also play an important role in recommendation. Authors suggest and verify experimentally, that matching people according to their profiles in terms of hobbies and interests rather than just item ratings and explaining recommendations, would make it easier for users to judge appropriateness of a recommendation. People prefer to cooperate with others with a similar demographic background. They conducted experiments with three independent variables, viz., familiarity (social friendship networks), profile similarity (demographics based similarity), and rating overlap and did a 2 × 2 × 2 ANOVA analysis. Authors conducted a 100 user study and reported that familiarity and profile similarity both have overwhelming effect on recommendation performance. McNee et al. (McNee et al., 2002) studied recommending related work and citations for a research. They treated different papers as the “users” of user-item matrix and the other papers this paper cite as the “items” in this matrix. They create a citation Web, similar to the social friendship network formed between different people. This constructs paper-paper relationships as opposed to author-paper or author-author relations in traditional citation recommender systems. This helps in solving the cold-start problem as every paper cites a good number of relevant papers. Sice these are genuine papers so there is no problem of “rogue” users as in traditional recommender systems. But the “social friendship network” of a paper is pretty much fixed as the number of citations are fixed once they are published. This is contrary to social friendship network research, as members of a social friendship network always evolve their networks. Although a lot of researchers promoted the use of social friendship networks in recommender systems since both of them exploit the principles of social collaboration, however Terveen and McDonald (2005) pointed out some significant differences between both the domains. Recommender systems assign a single rating to each item like movies or books. For instance, a movie could be assigned a rating between 1 and 5. On the contrary, a person could be judged in several different ways depending the context in which the person is judged. A person can be highly talented but may not be necessarily a good team worker. So assigning a single rating to a person might not be easily justifiable. So these ratings depend a lot on the context in social friendship network domain as compared to recommender systems domain. Social networks play a fundamental role as a medium to spread information, ideas and influence among its members. An excellent application of information flow through social network lies in the adoption studies of new ideas and innovations within the underlying social networks. This includes the extent to which people’s decisions are affected by their colleagues, family etc. Such network diffusion studies have a long history in social sciences in terms of “word-of-mouth” and “viral marketing”. Subramani and Rajagopalan related the effectiveness of viral marketing to the role of the influencer -whether the attempt to influence is passive or actively persuasive Subramani and Rajagopalan (2003). They also attributed the success of viral marketing to the network externalities -the advantages that come along when a community adopts the new idea or a product, like help and support, resource sharing; which again reinforces the need for good influential members. In social friendship networks the problem of influence spread is formulated as finding that subset of nodes/actors/members of the social friendship network which can maximize the cascade of influence in the network. These nodes are called the “influential” members. These influential members could be used as the initial target for viral marketing. Several models have been studied to find such a subset of influential members of a social friendship network. The optimal solution is NP-hard for most models (Domingos and Richardson, 2001). On the other hand (Richardson and Domingos 2002) proposed a framework based on a simple linear model for identifying influential members and the optimization problem could be solved by solving a system of linear equations. Kempe et al (Kempe et al., 2003) proposed a framework that lies somewhere
653
A Study of Friendship Networks and Blogosphere
in between Domingos and Richardson (2001) and Richardson and Domingos (2002) in terms of model complexity by approximating the optimization problem within guaranteed bounds. Two diffusion models are generally used in modeling influence flow across a social friendship network: Linear Threshold Model and Independent Cascade Model. Linear threshold model (Granovetter, 1978,Schelling, 1978) assumes a linear relation between influencing or active nodes and influenced or non-active nodes. It defines the influencing capacity and influence tolerance limits of each node. If the sum of influencing capacities of a node’s neighbors increase its tolerance then the node gets influenced and active. Independent cascade model (Durrett, 1988), (Liggett, 1985) assumes the process of influence flow as a cascade of events. An event here represents a node being influenced or “activated”. A system parameter is assigned to each node, v, which is its probability to influence its neighbor, w. If v succeeds in influencing w, then w gets activated in time t + 1, otherwise there are no successive attempts are made by v to influence w. This random process continues until no node is left un-activated. Kempe et al. (2003) proposed approximations to both linear threshold model and independent cascade model within guaranteed bounds and compared their approximated model with other social friendship network analysis based heuristics. In social friendship networks it is important not only to detect the influential members or experts in case of knowledge sharing communities but also to assess to what extent some of the members are recognized as experts by their colleagues in the community. This leads to the estimation of trust and reputation of these experts. Some social friendship networks like Orkut allow users to assign trust ratings implying a more explicit notion of trust. Whereas some Web sites have an implicit notion of trust where creating a link to a person on a Webpage implies some amount of business trust for the person. In other cases, Trust and reputation of experts could be typically assessed as a function of the quality of their response to other members’ knowledge solicitations. Pujol et al (Pujol et al., 2002) proposed a NodeMatching algorithm to compute the authority or reputation of a node based on its location in the social friendship network. A node’s authority depends upon the authority of the nodes that relate to this node and also on other nodes that this node relates to. The basic idea is to propagate the reputation of nodes in the social friendship network. This is very similar to the PageRank and HITS algorithms, famous for traditional Web search. However, authors point out the differences between their algorithm and Pagerank and HITS. For PageRank and HITS the transition probability matrix and variance-covariance matrix respectively have to be known previously, unlike NodeMatching algorithm. This becomes infeasible for very large graphs. Moreover, PageRank assumes a fixed graph topology by stratifying the range of transition probability which is different in NodeMatching which can automatically adapt to the topology since it depends upon the authority of the related nodes. While Pujol et al., (2002) proposed an approach to establish reputation based on the position of each member in the social friendship network, Yu and Singh (2003) developed a model for reputation management based on the Dampster-Shafer theory of evidence in the wake of spurious testimonies provided by malicious members of the social friendship network. Each member of a social friendship network is called an agent. Each agent has a set of acquaintances a subset of which forms its neighbors. Each agent builds a model for its acquaintances to quantify their expertise and sociability. These models are dynamic and change based on the agent’s direct interactions with the given acquaintance, interactions with agents referred to by the acquaintance, and on the ratings this acquaintance received from other agents. The authors point out a significant problem with this approach which arises if some acquaintances or other agents generate spurious ratings or exaggerate positive or negative ratings, or offer testimonies that are outright false. Yu and Singh (2003) study the problem of deception using the Dampster-Shafer
654
A Study of Friendship Networks and Blogosphere
belief functions so as to capture uncertainty in the rankings caused by malicious agents. A variant of majority weighted function is applied to belief functions and simple deception models were studied to detect deception in the ratings. Sabater and Sierra (2002) propose a combination of reputation scores on three different dimensions. They combined reputation scores not only through social relations governed by a social friendship network, termed as social dimension but also past experiences based on individual interactions, termed as individual dimension and reputation scores based on other dimensions, termed as ontological dimension. For large social friendship networks it is not always possible to get reputation scores based on just the individual dimension, so they can use the social dimension and ontological dimension would enhance the reputation estimation by considering different contexts. The ontological dimension is very similar to the work proposed in Terveen and McDonald (2005), where the authors recommend collaboration in social friendship networks based on several factors. They explain the importance of context in recommending a member of social friendship network for collaboration. between two members. Personalization of trust means that a member could have different trust values with respect to different members. Trust of a member is absolutely a personal opinion. Consolidating the trust scores for a member might not give a reasonable estimation, so authors propose trust propagation mechanism. Authors define source as the node which is seeking trust value of another node called sink. If there is a direct edge between source and sink then the value is directly transferred, otherwise the trust value is inferred based on the source’s neighbors. Source polls each of its neighbors whom it has given a positive trust rating. The neighbors also use this procedure to compute the trust rating of the sink. Hence gradually sink’s trust scores propagate to the source. They demonstrate the trust rating in filtering e-mails with the help of a 23 prototype TrustMail and using Enron e-mail dataset . Guha et al (Guha et al., 2004) proposed another trust propagation scheme in social friendship networks but they included the element of distrust along with the trust scores. Another ongoing line of research in social friendship network domain tries to study the characteristics of a community in a social friendship network. Several research papers study the characteristics of the social friendship network in terms of how communities can be inferred from the graph structure of social friendship networks. A lot of work has already been done in inferring communities of Web pages using link structures (Flake et al., 2000, Gibson et al., 1998, Kumar et al., 1999). However, some papers also talk about, once these communities are inferred or explicitly established, how they evolve what influences a member of the social friendship network to join or un-join some community and how does topic-drift influences the change in community or vice-versa. The problem of inferring community structures using unsupervised graph clustering has been presented in several research works listed in Flake et al. (2002), (2004), Girvan and Newman (2002), Hopcroft et al., (2003), Newman, (2004). Newman (Newman, 2004) summarizes different approaches for inferring communities from computer science and sociologists perspectives. They explain the traditional graph clustering algorithm like spectral bisection and Kernighan-Lin algorithm. But the principal disadvantage with these two methods is they split the graph in two subgraphs and repeated bisection does not always give satisfactory results, moreover, they do not give a stopping condition. Specifically, Kernighan-Lin algorithm expects the user to specify the size of each of the subgraphs which is often not pragmatic in real world social friendship networks. They also mention hierarchical clustering as another alternative for graph clustering by assigning similarity scores to pair of vertices and then using either single linkage or complete linkage to obtain the dendrogram. Although complete linkage has more desirable properties than single linkage, complete linkage still has some disadvantages like, finding cliques in
655
A Study of Friendship Networks and Blogosphere
a graph is a hard problem and cliques are not unique. More recent algorithms use a divisive method as opposed to the agglomerative method of hierarchical clustering. Girvan and Newman (Girvan and Newman, 2002) proposed a divisive graph clustering method based on edge removal. They use betweenness, a social friendship network analysis centrality measure to identify which edge to remove. They repeat the procedure until there is no edge left. At each iteration they form a dendrogram just like the hierarchical clustering approach. Horizontal cross-sections of the dendrogram represent possible community divisions with a larger or smaller number of communities depending on the position of the cut. Community identification research has been used in other application domains like identifying terrorist 24 groups or communities on internet. This is also called the Dark Web . Coffman and Marcus (Coffman and Marcus, 2004) studied the characteristics of such groups and extracted some commonly discovered patterns. They propose a “spoke-and-hub” model to simulate the interaction patterns of terrorist groups. They identified certain states in their interaction patterns and trained a HMM to predict the possibility of terrorist groups based on the interaction patterns and how they evolve in a test group. Backstrom et al. (Backstrom et al., 2006) proposed a decision tree based approach to govern the membership, growth and change in an established community. They reported that a member of a social friendship network would join a new community depending on the number of friends (s)he already has in that community and also on the underlying social friendship network structure of his/her friends in the community. They also pointed out an interesting observation about the community growth. The number of triads (closed triangles of individuals) a community has decreases the community’s chances to grow. By closed triangles of individuals they mean the conversation/interaction is limited within this clique. If there is a lot of “cliqueish ness” then it makes a community less interesting for new members to join and ultimately for that community to grow. They conducted some experiments to study the movement in a community. They studied whether topic changes induce author movement or author movement induce topic changes in overlapping communities. They concluded that topic changes induce a much faster author movement as compared to topic change induced by author movements, especially for conference data obtained from DBLP. Most of the research done so far in social friendship network domain assumes homogenous relations among its members, however, Cai et al., (2005) study the effect of heterogenous relations in a social friendship network. Based on the different relations inferred with the help of user queries on the data the proposed method is able to identify some “hidden” communities which otherwise are not possible by simply looking at the explicitly given relationships in the network. Social friendship network research has been applied to several other application domains like detecting conflict of interest (Aleman-Meza et al., 2006), Web appearance disambiguation (Bekkerman and McCallum, 2005). Online Web sites 25 like Kaboodle allow users to comment on their friends’ wishlists. People can discuss about different products, events, collections, even favorites with other users.
Blogosphere Till now we focused on one type of social networks, i.e., friendship networks that have explicit relationship information between individuals. Here we focus on a special class of social networks, i.e., Blogosphere, where we do not have explicit relationship information between individuals in the blogosphere. This forms a foundational difference in research carried out in both domains. There are other subtle differences between social friendship networks and the blogosphere. Friendship networks are predominantly used to create friends and to remain in touch with each other. Blogs are used in a different context. They
656
A Study of Friendship Networks and Blogosphere
are usually used for sharing ideas and opinions within a community. Blogosphere are not as focused as friendship networks. Research on the blogosphere is relatively fledgling as compared to social friendship networks. Several researchers have tried to transform the problem domain of Blogosphere to that of social friendship networks and then apply research from the social friendship network domain. We will discuss about these initiatives later in this section. Now we discuss various ongoing research in blogosphere domain characterized by the application areas. 26 Blogosphere is a storehouse of several publicly regulated media. Technorati reported that 175,000 blog posts were created daily which is 2 blog posts per second. This explosive growth makes it beyond human capabilities to look for interesting and relevant blog posts. Therefore a lot of research is going on to automatically characterize different blogs into meaningful groups, so readers can focus on interesting categories rather than filtering out relevant blogs from the whole jungle. Often blog sites allow their users to provide tags to the blog posts. The human labeled tag information forms the so-called “folksonomy”. Brooks and Montanez (Brooks and Montanez, 2006) presented a study where the human labeled tags are good for classifying the blog posts into broad categories while they were less effective in indicating the particular content of a blog post. They used tf-idf measure to pick top three most famous words in every blog post and computed the pairwise similarity among all the blog posts and clustered them. They compared the results with the clustering obtained using the human labeled tags and reported significant improvement. In another research Li et al. (2007) authors tried to cluster blog posts by assigning different weights to title, body and comments of a blog post. Clustering different blog posts would also help blog search engines like Technorati to focus and narrow the search space once the query context 27 is clear. Web sites like Blogcatalog organize several blog sites into a taxonomic structure that helps in focussed browsing of blog sites. Another important research which branched out from the blog-site clustering is determining and inferring communities. Members of these communities have a sense of community and experience a feeling of belonging, a feeling that members matter to one another and their needs will be met through their commitment to be together. Several studies have looked into identifying communities in blogs. One method that researchers commonly use is content analysis and text analysis of the blog posts to identify communities in blogosphere (Blanchard, 2004, Efimova and Hendrick, 2005, Kumar et al., 2003). Kleinberg (Kleinberg, 1998) used an alternative approach in identifying communities in blogs using hubs and authority based approach. They clustered all the expert communities together by identifying them as authorities. Kumar et al. (Kumar et al., 1999) extended the idea of hubs and authorities and included co-citation as a way to extract all communities on the Web and used graph theory algorithms to identify all instances of graph structures that reflect community characteristics. While Chin and Chignell (Chin and Chignell, 2006) proposed a model for finding communities taking the blogging behavior of bloggers into account. They aligned behavioral approaches in studying community with the network and link analysis approaches. They used a case study to first calibrate the measure to evaluate a community based on behavioral aspects using a behavioral survey which could be generalized later on, pruning the need of such surveys. Several researchers have also studied community extraction and social network formation using news-boards and discussion boards. Although different from blogosphere we include these researches here because discussion boards and newsboards are also very similar to blogs in the sense that they also do not have an explicit link structure, and the communication is not “person-to-person”, rather it is more “person-togroup”. Blanchard and Markus (Blanchard and Markus, 2004) studied a virtual settlement Multiple Sport Newsgroup and analyzed the possibility of emerging virtual communities in it. They
657
A Study of Friendship Networks and Blogosphere
studied the characteristics of the newsgroup by conducting interviews with three different kinds of members: leaders (active and well respected), participants (active occasionally to events like triathlons) and lurkers (readers only). They reported that different virtual communities emerge between athletes and those who join the community to keep themselves informed of the latest developments. As communities evolve over time, so do the bellwethers or leaders of the communities who possess the power to influence the mainstream. According to studies in Keller and Berry (2003), 83% people prefer consulting family, friends or an expert over traditional advertising before trying a new restaurant, 71% people prefer to do so before buying a prescription drug or visiting a place, 61% of people prefer to do so before watching a movie. This style of marketing is known as “word-of-mouth”. “Word-of-mouth” has been found to be more effective than the traditional advertising in physical communities. Studies from Keller and Berry (2003) show that before people buy, they talk, and they listen. These experts can influence decisions of people. If we consider “word-of-mouth” as a broadcast of radio signal then these experts are the transmitters that amplify the signal and make it heard by more people. For this reason these experts are aptly termed, The Influentials. Influential bloggers tend to submit influential blog posts that affect other members’ decisions and opinions. They accrue respect in the community over time. Other members tend to listen to what the influentials say before making decisions. Identification of these influential bloggers (Agarwal et al., 2008) could lead to several interesting applications. The influentials can act as market-movers. Since they can influence buying decisions of mainstream, companies can promote them as latent brand ambassadors for their products. Being such a highly interactive media, blogs tend to host several vivid discussions on various issues including new products, services, marketing strategies and their comparative studies. Often this discussion also acts as “word-of-mouth” advertising of several products and services. A lot of advertising companies, approximately 64% Elkin have acknowledged this fact and are shifting their focus towards blog advertising and identifying these influentials. The influentials could sway opinions in political campaigns, elections and reactions to government policies (Drezner and Farrell, 2004). Because they know many people and soak up a large amount of information, Influentials stand out as smart, informed sources of advice and insight. Approximately, 84% of influentials in physical communities are interested in politics and are sought out by others for their perspectives on politics and government, 55% on a regular basis. The influentials could help in customer support and troubleshooting. A lot of companies these days host their own customer blogs, where people could discuss issues related to a product. Often influentials on these blogs troubleshoot the problems peer consumers are having, which could be trusted because of the sense of authority these influentials possess. Often influentials offer suggestions to improve their products. These invaluable comments could be really helpful for companies and customers. Instead of going through each member’s blog posts, companies can focus on the influentials’ blog posts. For 28 instance, Macromedia aggregates, categorizes and searches the blog posts of 500 people who write about Macromedia’s technology. This could also change the design of future market research surveys conducted by a lot of companies. Now instead of printing out the survey forms and passing it to regular consumers and potential ones, companies just need to release these forms in the blog communities and act on the feedbacks of influential bloggers. Recently, Apple Inc. announced its iPhone to be released in June 2007. They gave a full demonstration in January which marked a surge of blog posts across several blog sites. A lot of blog posts talk about its features and compliment it. There are some blog posts that point out potential limitations which could be helpful for Apple Inc. to consider before releasing the product in the market. Such kind of market research is also called “use the views”. 658
A Study of Friendship Networks and Blogosphere
Some recent numbers from Technorati show a 100% increase in the size of the blogosphere every six months. It has grown over 60 times during the past three years. Approximately 2 new blog posts 29 appear every second . New blog posts being generated with such a blazing fast rate, it is impossible to keep track of what is going on in the blogosphere. Many blog readers/subscribers just want to know the most insightful and authoritative stories. Blog posts from influential bloggers would exactly serve this purpose by standing out as representative articles of a blog site. The influentials can be the showcases of a group on the blogosphere. These interesting applications have attracted a surge of research in identifying influential blog sites as well as influential bloggers. Some try to find influential blog sites (both individual and community blog sites), in the entire blogosphere and study how they influence the external world and within the blogosphere (Gill, 2004). Ranking a blog site seems similar to ranking a Web site. However, as pointed out in Kritikopoulos et al. (2006), blog sites in the blogosphere are very sparsely linked and it is not suitable to rank blog sites using Web ranking algorithms like PageRank (Page et al., 1998) and HITS (Kleinberg, 1998). Thus, the authors in Kritikopoulos et al. (2006) suggest to add implicit links to increase the density of link information based on topics. If two blogs are talking about the same topic, an edge can be added between these two blogs based on the topic similarity. A similar strategy adopted by Adar et al. (Adar et al., 2004) is to consider the implicit link structure of blog posts. In their iRank algorithm, a classifier is built to predict whether or not two blogs should be linked. The objective in this work is to find out the path of infection (how one piece of information is propagated). iRank tries to find the blogs which initiates the epidemics. Note that an initiator might not be an influential as they might affect only limited blogs. Influentials should be those which play a key role in the information epidemics. Given the nature of the blogosphere, influential blog sites are few with respect to the sheer number of blog sites. Non-influential sites belong to the long tail (Anderson, 2006) where abundant new business, marketing, and development opportunities can be explored. Gruhl et al (Gruhl et al., 2004) study information diffusion of various topics in the blogosphere from individual to individual, drawing on the theory of infectious diseases. A general cascade model (Goldenberg et al., 2001) is adopted. They derived their model from independent cascade model and generalized to the general cascade model by relaxing the independence assumption. They associate ‘read’ probability and ‘copy’ probability with each edge of the blogger graph indicating the tendency to read one’s blog post and copy it, respectively. They also parameterize the stickiness of a topic which is analogous to the virulence of a disease. An interesting problem related to viral marketing (Richardson and Domingos, 2002, Kempe et al., 2003) is how to maximize the total influence among the nodes (blog sites) by selecting a fixed number of nodes in the network. A greedy approach can be adopted to select the most influential node in each iteration after removing the selected nodes. This greedy approach outperforms PageRank, HITS and ranking by number of citations, and is robust in filtering splogs (spam blogs) (Java et al., 2006). Agarwal et al. (Agarwal et al., 2007) studied and modeled the influence of a blogger on a community blog site. They modeled the blog site as a graph using inherent link structure, including inlinks and outlinks, as edges and treating different bloggers as nodes. Using the link structure the influence flow across different bloggers is observed, recursively. Other blog post level statistics like blog post quality and comments’ information were also used to achieve better results. The model used different weights to regulate the contribution of different statistics. These weights could be tuned to obtain different breeds of influential bloggers, top bloggers or top blog posts in some time frame (e.g., monthly). Those top lists are usually based on some traffic information (e.g., how many posts a blogger posted, or how many comments a blog post received) (Gill, 2004). With the
659
A Study of Friendship Networks and Blogosphere
Table 2. Summarizing the research domains, representative approaches and challenges in social friendship networks
speedy growth of the blogosphere, it is increasingly difficult, if at all possible, to manually track the development and happenings in the blogosphere, in particular, at many blog sites where many bloggers enthusiastically participate in discussions, getting information, inquiring and seeking answers, and voicing their complaints and needs. Since millions of people share their views and opinions on blogs, the blogosphere has become the perfect breeding ground for “participatory journalism”. This has made the blogosphere a much more dynamic environment than traditional Web pages. An announcement of a new product by a company may trigger several discussions around the world. These temporal trends are often very useful for businesses to track and observe the customer opinions. In one such research, Chi et al., (2006), authors use singular value decomposition to identify trends in the topics of the blogs. They use higher order singular value decomposition (HOSVD) to observe the structural trends in the blog sites.
Social Friendship Networks vis-a-vis Blogosphere Social Friendship Networks and Blogosphere come under the umbrella of Social Networks, and they both share some commonalities like social collaboration, sense of community and experience sharing, yet there are some subtle differences. These nuances are worth pointing out here as they shed light on different ongoing research activities. We summarize these research works for social friendship networks and the blogosphere in Tables 2 and 3, respectively. We list the research domains in social friendship networks and the blogosphere and some representative approaches along with the challenges of that domain. Clearly, there are some research areas that are specific to social friendship networks like col-
660
A Study of Friendship Networks and Blogosphere
Table 3. Summarizing the research domains, representative approaches and challenges in the blogosphere
laborative recommendation, trust and reputation because they assume an explicit graph structure in the interaction among different members of the network. Similarly, some of the research areas are specific to blogosphere like blog post/blog site classification, spam blog identification because of the highly textual content nature of these articles. Unlike social friendship networks, the blogosphere does not have explicit links or edges between the nodes. These nodes could be friends in a social friendship network and bloggers in the blogosphere. We could still construct a graph structure in the blogosphere by assuming an edge from one blogger to another if a blogger has commented on other blogger’s blog post. This way we can represent the blogosphere with an equivalent directed graph. Social friendship networks already have predefined links or edges 30 between the members in the form of a FOAF network, an undirected graph. This link/edge inference also poses major challenges in majority of research efforts going on in the blogosphere like identifying communities and influential bloggers. We list these and other challenges specific to various research domains in the blogosphere in Table 3. Although Adar et al. (Adar et al., 2004) proposed a model to infer links between different blog posts based on the propagation of the content in the blog posts, but applicability of such techniques is limited if not much information epidemics is found. Another significant difference between social friendship networks and the blogosphere lies in the way influential members are perceived. Bloggers submit blog posts which are the main source of their influence. Influence score could be computed using blog posts through several measures like inlinks, outlinks, comments, and blog post length. This could give us an actual influential node based on the historical data of who influenced whom. Whereas members of a social friendship networks do not have such a medium through which they can assert their influence. The link information available on a social friendship network and other network centrality measures will just tell us the connectedness of a node which could be used to gauge the spread of influence rather than the influential node itself. Hence, there are works (Coffman and Marcus, 2004, Kempe et al., 2003, Richardson and Domingos, 2002) that measure the spread of influence through a node in social friendship networks. A node that maximizes the spread of influence or who has a higher degree of connectivity is chosen for viral marketing. It is
661
A Study of Friendship Networks and Blogosphere
Table 4. Differences between social friendship networks and blogosphere.
entirely possible that this node is connected to a lot of people but may not be the one who could influence other members. Bloggers spread their influence through blog posts. This information source could be tapped to compute influence of a blogger using several measures like inlinks, outlinks, comments, and blog post length. These differences are summarized in Table 4. In a broader sense, influential nodes identified through the blogosphere are the ones who have “been influencing” fellow bloggers, whereas influential nodes identified through social friendship networks are the ones who “could influence” fellow members. The reason is trivial, in the blogosphere we have the history of who influenced whom through their blog posts, but in social friendship networks we don’t have such information. We only know who is linked to whom and the one who is the most linked could be used to spread the influence. But we don’t know whether he is the right person to do that job. There have been works like Java et al. (2006), where authors model the blogosphere as a social friendship network and then apply the existing works for mining influence in social friendship networks, but they lose essential statistics about the blog posts like inlinks, outlinks, comments, blog post quality, etc. A graph structure is strictly defined in social friendship networks whereas it is loosely defined in Blogosphere. Nodes are members or actors in a social friendship network but they could be bloggers, blog posts or blog sites. Social friendship networks are predominantly used for being in touch or making friends in society, while the main purpose of Blogosphere is to share ideas and opinions with other members of the community or other bloggers. This gives a more community experience to Blogosphere
Figure 1. Social Friendship Networks and Blogosphere constitute part of Social Networks.
662
A Study of Friendship Networks and Blogosphere
as compared to a more friendship oriented environment in social friendship networks. We could observe person-to-community interactions in the blogosphere, whereas one would observe person-to-person interactions in social friendship networks. Another significant difference between the blogosphere and social friendship networks is in estimating the reputation/trust of members. Member’s reputation/trust in the blogosphere is based on the response to other member’s knowledge solicitations. Member’s reputation/trust in a social friendship network is based on the network connections and/or locations in the network. We illustrate the differences between social friendship networks and the blogosphere in Figure 1. The complete set of social interacting services can be represented by Social Networks of which Social Friendship Networks and Blogosphere are focal parts. There are certain social friendship networking 31 Web sites like Orkut, Facebook, LinkedIn , Classmates.com that strictly provide friendship networks. People join these networks to expand their social networks and keep in touch with colleagues. However, these Web sites enforce strict interaction patterns among friends and do not support flexible community 32 structures as supported by blog sites. On the other hand blog sites like TUAW, Blogger , Windows 33 Live Spaces allow members to express themselves and share ideas and opinions with other community members. However, these Web sites do not facilitate private friendship networks. Two members have to use other communication channels to communicate between themselves, privately. But Web sites like LiveJournal and MySpace provide both social friendship networks and blogging capabilities to their members. Clearly based on the characteristics of social interactions one could observe overlap between social friendship networks and the blogosphere as depicted in Figure 1.
Looking Ahead Humans have gregarious tendency to form groups and communities for sharing experiences, thoughts, ideas, and opinions with others. Through these social interactions, individuals convene communities of like-minded people driven by similar interests and needs. Technologies like computers and the World Wide Web have made it possible for people to do so beyond geographical barriers, and facilitate and encourage more and more individuals to take part in these Web activities. The pervasive participation of increasing numbers of users has created a plethora of user-generated contents, collective wisdom, and open-source intelligence. Not only the individuals generate new contents they also enrich the existing contents by providing semantically meaningful labels or tags for easier discovery and retrieval of information and for better navigation over time. In this work we study and characterize these social interacting services offered under the umbrella of social networks, based on their underlying network structures, into four broad categories, viz., social friendship networks, the blogosphere, social and collaborative annotation (or folksonomies) and media sharing. We discuss each of them briefly and focus on social friendship networks and blogosphere which are the two most widely used social networks. The area of social networks has been extensively studied by researchers in different domains including psychology, social science, anthropology, information, and computer science. Various studied aspects of social networks include the formulation, evolution and dynamics of communities and groups in these social networks, emerging experts, leaders, bellwethers and the influentials in these communities, information diffusion and epidemics in social networks, community clustering, recommending collaboration to individuals and organizations, mining user-generated content for collective wisdom and open-source intelligence and other equally significant research opportunities. We comparatively study
663
A Study of Friendship Networks and Blogosphere
state-of-the-art research initiatives in social friendship networks and the blogosphere, and delineate subtle yet essential differences between the two domains. Recent statistics from Technorati have shown that the size of social networks especially the blogosphere doubles every six months. Such an enthusiastic involvement in social networks and the sense of connectedness to one another among members has accelerated the evolution of ubiquitous social networks and presented new opportunities for research and development of new services. For example, researchers experiment and analyze the feasibility of the current state of social networks on mobile devices. There is a huge potential in designing more intuitive and efficient interfaces to social networking services in constrained screen estates. Social interaction models could be used in other domains like computer networks, physical interacting particle systems for modeling the communication or interaction within the network. This can help in analyzing information flow, crucial or expert nodes, and trustworthy nodes. Potential research opportunities for social networks also lie in improving search engines. Social annotation systems facilitate manual tagging of various resources including Webpages, images, videos, enriching the content by adding metadata to these resources. These would ultimately help search engines to index these resources better, and consequently provide more precise search results. Besides using human labeled tags, we could also leverage human intelligence in providing the relevance feedback for search results. Such collaborative search engines rely on social networks to enhance search results for idiosyncratic user queries. Identifying influential bloggers can expand the ways of group representation such as using these influential bloggers to profile a group. Since these influential bloggers could be treated as representative members of a community’s blog site, their blog posts could be considered as an entry point to other representative posts of the blog site. Intelligent processing of such group profiles could help search engines to not only search the blog posts but also enable group search, resulting in promoting group collaboration. By incorporating their social network information coupled with demographic information, social networks research could also help in customer relationship management to provide compact and reliable recommendations. Community evolution studies can serve as an effective avenue for modern defense and homeland security (e.g., used in subject-based data mining).
References Adar, E., Zhang, L., Adamic, L., and Lukose, R. (2004). Implicit structure and the dynamics of blogspace. In Proceedings of the 13th International World Wide Web Conference. Agarwal, N., Liu, H., Tang, L., and Yu, P. S. (2008). Identifying the Influential Bloggers in a Community. In Proceedings of the 1st International Conference on Web Search and Data Mining (WSDM08). Stanford, California. Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A. P., Arpinar, I. B., Joshi, A., and Finin, T. (2006). Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 407–416, New York, NY, USA. ACM Press. Anderson, C. (2006). The long tail : why the future of business is selling less of more. New York : Hyperion.
664
A Study of Friendship Networks and Blogosphere
Backstrom, L., Huttenlocher, D., Kleinberg, J., and Lan, X. (2006). Group formation in large social networks: membership, growth, and evolution. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 44–54, New York, NY, USA. ACM Press. Bekkerman, R. and McCallum, A. (2005). Disambiguating Web appearances of people in a social network. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 463–470, New York, NY, USA. ACM Press. Blanchard, A. (2004). Blogs as Virtual Communities: Identifying a sense of community in the Julie/Julia project. Into the Blogosphere: Rhetoric, Community and Culture. Retrieved from http://blog.lib.umn. edu/blogosphere. Blanchard, A. and Markus, M. (2004). The experienced sense of a virtual community: Characteristics and processes. The DATA BASE for Advances in Information Systems, 35(1). Bonhard, P., Harries, C., McCarthy, J., and Sasse, M. A. (2006). Accounting for taste: using profile similarity to improve recommender systems. In CHI ’06: Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 1057–1066, New York, NY, USA. ACM Press. Brooks, C. H. and Montanez, N. (2006). Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 625–632, New York, NY, USA. ACM Press. Cai, D., Shao, Z., He, X., Yan, X., and Han, J. (2005). Mining hidden community in heterogeneous social networks. In LinkKDD ’05: Proceedings of the 3rd international workshop on Link discovery, pages 58–65, New York, NY, USA. ACM Press. Chi, Y., Tseng, B. L., and Tatemura, J. (2006). Eigen-trend: trend analysis in the blogosphere based on singular value decompositions. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 68–77, New York, NY, USA. ACM Press. Chin, A. and Chignell, M. (2006). A social hypertext model for finding community in blogs. In HYPERTEXT ’06: Proceedings of the seventeenth conference on Hypertext and hypermedia, pages 11–22, New York, NY, USA. ACM Press. Coffman, T. and Marcus, S. (2004). Dynamic classification of groups through social network analysis and hmms. In Proceedings of IEEE Aerospace Conference. Domingos, P. and Richardson, M. (2001). Mining the network value of customers. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57–66, New York, NY, USA. ACM Press. Drezner, D. and Farrell, H. (2004). The power and politics of blogs. In American Political Science Association Annual Conference. Durrett, R. (1988). Lecture Notes on Particle Systems and Percolation. Wadsworth Publishing. Efimova, L. and Hendrick, S. (2005). In search for a virtual settlement: An exploration of Weblog community boundaries.
665
A Study of Friendship Networks and Blogosphere
Elkin, T. Just an online minute... online forecast. Retrieved from http://publications.mediapost.com/index.cfm?fuseaction =Articles.showArticle art aid=29803. Flake, G., Lawrence, S., Giles, C. L., and Coetzee, F. (2002). Self-organization and identification of Web communities. IEEE Computer, 35(3). Flake, G. W., Lawrence, S., and Giles, C. L. (2000). Efficient identification of Web communities. In 6th International Conference on Knowledge Discovery and Data Mining. Flake, G. W., Tarjan, R. E., and Tsioutsiouliklis, K. (2004). Graph clustering and minimum cut trees. Internet Math (1). Gibson, D., Kleinberg, J., and Raghavan, P. (1998). Inferring Web communities from link topology. In 9th ACM Conference on Hypertext and Hypermedia. Gill, K. E. (2004). How can we measure the influence of the blogosphere? In WWW’04: workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. Gillmor, D. (2006). We the Media: Grassroots Journalism by the People, for the People. O’Reilly. Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. In National Academy of Science. Golbeck, J. and Hendler, J. (2006). Inferring binary trust relationships in Web-based social networks. ACM Trans. Inter. Tech., 6(4):497–529. Goldenberg, J., Libai, B., and Muller, E. (2001). Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, 12, 211–223. Granovetter, M. (1978). Threshold models of collective behavior. The American Journal of Sociology, 83, 1420–1443. Gruhl, D., Liben-Nowell, D., Guha, R., and Tomkins, A. (2004). Information diffusion through blogspace. SIGKDD Explor. Newsl., 6(2):43–52. Guha, R., Kumar, R., Raghavan, P., and Tomkins, A. (2004). Propagation of trust and distrust. In WWW ’04: Proceedings of the 13th international conference on World Wide Web, pages 403–412, New York, NY, USA. ACM Press. Hopcroft, J., Khan, O., Kulis, B., and Selman, B. (2003). Natural communities in large linked networks. In 9th Intl. Conf. on Knowledge Discovery and Data Mining. Java, A., Kolari, P., Finin, T., and Oates, T. (2006). Modeling the spread of influence on the blogosphere. In Proceedings of the 15th International World Wide Web Conference. Keller, E. and Berry, J. (2003). One Americal in ten tells the other nine how to vote, where to eat and, what to buy. They are The Influentials. The Free Press. Kempe, D., Kleinberg, J., and Tardos, E. (2003). Maximizing the spread of influence through a social network. In Proceedings of the KDD, pages 137–146, New York, NY, USA. ACM Press.
666
A Study of Friendship Networks and Blogosphere
Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In 9th ACMSIAM Symposium on Discrete Algorithms. Kritikopoulos, A., Sideri, M., and Varlamis, I. (2006). Blogrank: ranking Weblogs based on connectivity and similarity features. In AAA-IDEA ’06: Proceedings of the 2nd international workshop on Advanced architectures and algorithms for internet delivery and applications, page 8, New York, NY, USA. ACM Press. Kumar, R., Novak, J., Raghavan, P., and Tomkins, A. (2003). On the bursty evolution of blogspace. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 568–576, New York, NY, USA. ACM Press. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. (1999). Trawling the Web for emerging cyber communities. In The 8th International World Wide Web Conference. Li, B., Xu, S., and Zhang, J. (2007). Enhancing clustering blog documents by utilizing author/reader comments. In ACM-SE 45: Proceedings of the 45th annual southeast regional conference, pages 94–99, New York, NY, USA. ACM Press. Liggett, T. (1985). Interacting Particle Systems. Springer. McDonald, D. W. (2003). Recommending collaboration with social networks: a comparative evaluation. In CHI ’03: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 593–600, New York, NY, USA. ACM Press. McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., and Riedl, J. (2002). On the recommending of citations for research papers. In CSCW ’02: Proceedings of the 2002 ACM conference on Computer supported cooperative work, pages 116–125, New York, NY, USA. ACM Press. Newman, M. E. J. (2004). Detecting community structure in networks. European Physics Journal B, 38:321–330. Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project. Pujol, J. M., Sangesa, R., and Delgado, J. (2002). Extracting reputation in multi agent systems by means of social network topology. In AAMAS ’02: Proceedings of the first international joint conference on Autonomous agents and multiagent systems, pages 467–474, New York, NY, USA. ACM Press. Richardson, M. and Domingos, P. (2002). Mining knowledge-sharing sites for viral marketing. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61–70, New York, NY, USA. ACM Press. Sabater, J. and Sierra, C. (2002). Reputation and social network analysis in multi-agent systems. In AAMAS ’02: Proceedings of the first international joint conference on Autonomous agents and multiagent systems, pages 475–482, New York, NY, USA. ACM Press. Schelling, T. (1978). Micromotives and macrobehavior norton.
667
A Study of Friendship Networks and Blogosphere
Subramani, M. R. and Rajagopalan, B. (2003). Knowledge-sharing and influence in online social networks via viral marketing. Commun. ACM, 46(12):300–307. Terveen, L. and McDonald, D. W. (2005). Social matching: A framework and research agenda. ACM Trans. Comput.-Hum. Interact., 12(3):401–434. Yu, B. and Singh, M. P. (2003). Detecting deception in reputation management. In AAMAS ’03: Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 73–80, New York, NY, USA. ACM Press.
Key Terms Blog: The term “blog” is derived from the word “Web-log”, which means a Web site that displays in reverse chronological order the entries by one or more individuals and usually has links to comments on specific postings. Blog Post: Web entries that are published on a blog site are called blog posts. Blogosphere: A special class of social networks that exhibit a flexible graph structure among members of the network, supporting public discussion and interaction among community members. These are person-to-group interaction structures. There is no concept of private interaction. These social networks are predominantly used for sharing opinions and ideas with a community rather than a single individual. It is also defined as the universe of all blog sites. Folksonomy: It is a collaboratively generated taxonomic structure of Web pages, media like hyperlinks, images and movies using open-ended labels called tags. Folksonomies make information increasingly easy to search, discover and navigate over time. The descriptive content of such a tagging process is considered better than automatic tagging because of the “collective wisdom” and better context handling capabilities of humans as compared to computing algorithms. Social Network: An association of entities like people, organizations drawn together by one or more specific types of relations, such as friendship, kinship, like or dislike, financial exchange, etc. Such a social structure is often modeled using graphs, where members or actors of social networks act as the nodes and their interactions or relationships form the edges. Social networks encompass interactions between different people, members of a community or members across different communities. Social Network Analysis: An interdisciplinary study that involves sociology, anthropology, psychology, computer and information science to analyze complex relationships and model them in a social network. Social “Friendship” Network: A special class of social networks that enforces a more strict graph structure among members of the network, supporting more private social networks for each member. These are person-to-person(s) interaction structures. Examples include Orkut, Facebook, LinkedIn, Classmates.com.
668
A Study of Friendship Networks and Blogosphere
Endnotes 1
2
4
6 7 8 9 5
10
11
15
17 18 19 20 22 23 24 25 26 27 28 29 30 31 16
http://www.sixdegrees.org/ http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-Web-20.html 3http://www. foaf-project.org/ http://www.w3.org/RDF/ http://www.oakland.edu/enp/index.html http://www.flickr.com/ http://www.youtube.com/ http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-Web-20.html http://www.flock.com/ http://www.alexa.com/ http://www.orkut.com/ 12http://www.facebook.com/ 13http://www.myspace.com/ 14http://www. livejournal.com/ http://www.sifry.com/alerts/ http://www.ratcliffeblog.com/ http://webquarters.blogspot.com/ http://googleblog.blogspot.com/ http://www.tuaw.com/ http://www.engadget.com/ 21http://boingboing.net/ http://www.netflix.com/ http://www.cs.cmu.edu/~enron/ http://ai.arizona.edu/research/terror/index.htm http://www.kaboodle.com/ http://www.technorati.com/ http://www.blogcatalog.com/directory http://weblogs.macromedia.com/ http://www.sifry.com/alerts/archives/000436.html Friends of a friend http://www.linkedin.com/ 32http://www2.blogger.com/home 33http://spaces.live.com/
669
670
Chapter XXXVII
An HL7-Aware Decision Support System for E-Health Pasquale De Meo Università degli Studi Mediterranea di Reggio Calabria, Italy Giovanni Quattrone Università degli Studi Mediterranea di Reggio Calabria, Italy Domenico Ursino Università degli Studi Mediterranea di Reggio Calabria, Italy
ABSTRACT In this chapter we present an information system conceived for supporting managers of Public Health Care Agencies to decide the new health care services to propose. Our system is HL7-aware; in fact, it uses the HL7 (Health Level Seven) standard (Health Level Seven [HL7], 2007) to effectively handle the interoperability among different Public Health Care Agencies. HL7 provides several functionalities for the exchange, the management and the integration of data concerning both patients and health care services. Our system appears particularly suited for supporting a rigorous and scientific decision making activity, taking a large variety of factors and a great amount of heterogeneous information into account.
BACKGROUND In the past years health care expenditure has significantly risen. As an example, a report of the Organization for Economic Cooperation and Development shows that, in the last years, US health expenditure grew 2.3 times faster than Gross Domestic Product (GDP), rising from 13% of GDP in 1997 to 14.6% of it in 2002 (Organisation for Economic Cooperation and Development [OECD], 2005); an analogous trend was observed in Western Europe, where health spending outpaced economic growth by 1.7 times on average. Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
An HL7-Aware Decision Support System for E-Health
Despite this large amount of funding, Public Health Care Agencies (hereafter PHCAs) have partially failed to keep pace with rapid changes in present societies. As an example, in the last decades, in many developed countries, both population size and the corresponding age distribution underwent to relevant changes due to low fertility rates and the continuous increase of life expectancy. Population ageing has deep both social and budgetary effects that compel PHCA managers to carefully plan resource and service allocation. Disharmonious and unbalanced funding policies might imply redundancies/duplications of some services and lacks/deficiencies of other ones. As a consequence, patients might perceive a low quality in supplied services and unacceptable disparities in their delivery. In this scenario, the real challenge consists of the efficient usage of available resources for providing patients with real benefits and for supporting health care managers to modernize PHCAs in such a way as to meet the expectations of a broad audience of users (e.g., patients, physicians). From a patient standpoint, this implies the possibility: (i) to access a comprehensive range of services through a friendly and uniform infrastructure; (ii) to access services tailored around his needs and preferences. From a PHCA manager point of view, this implies the possibility: (i) to successfully face severe financial constraints aiming at reducing the pressure of health care expenditure on public budget; (ii) to provide services for all in such a way as to avoid discrimination on the grounds of yearly income and social status. A large body of evidence shows that, in order to effectively plan resource allocation in a PHCA, it is necessary to manage a large variety of socioeconomic and lifestyle data. This information is the core component of any decision support system operating at national, regional and local level (OECD, 2005). A further problem to face concerns the representation and the storage of health-related data. Generally, medical information systems store their data in proprietary formats; this implies the lack of interoperability and coordination among independent PHCAs and, ultimately, undermines their effectiveness (Eichelberg, Aden, Riesmeier, Dogac, and Laleci 2005). In the past, various approaches have been proposed for supporting PHCA managers in their decision making activities; many of them are agent-based. As an example: •
•
•
In (Flessa, 2000) a system aiming at allocating health care resources for reducing morbidity/mortality in a community is presented. This system receives a budget constraint and solves a suitable linear programming problem to decide the services to be delivered for reducing the number of deaths. Community health conditions are described by means of several parameters. In (Homer, and Milstein 2004) a decision support system operating in the health sector is proposed. This system is based on a graphical model representing causal relationships existing between social and health conditions; this model can be used by an expertise to determine the community's internal capability of addressing its health and social problems and to assess the effectiveness of outside assistance. In order to perform its activity, the system of (Homer, and Milstein 2004) defines and solves a suitable linear programming problem. In (Walczak, Pofahl, and Scorpio 2003) a system supporting a PHCA manager to make his decisions about the bed allocation in an hospital is described. For each patient, this system examines the corresponding record and uses a neural network to estimate his Length of Stay (LOS), i.e., the expected number of days he should spend in the hospital. LOS knowledge makes resource planning activities easier and, ultimately, improves patient care. This system is capable of producing accurate results with a relative paucity of medical information about patients; however, due to the necessity of estimating LOS, it is strictly dependent on the application context. 671
An HL7-Aware Decision Support System for E-Health
•
•
In (Harper, and Shahani 2003) a system capable of modelling the progress of AIDS epidemic in India is presented. This system uses a statistical model to describe the transition of a patient from one state of HIV infection to the next one and to estimate the yearly growth of the population affected by HIV. This allows HIV-related costs to be precisely computed and supports PHCA managers to make their decisions. In (Berndt, Hevner, and Studnicki 2003) the authors present CATCH, a system accessing multiple data sources to generate a set of social and health indicators allowing the identification of the health care goals of a community. In order to carry out its activities, CATCH applies OLAP operators on a suitable Data Warehouse.
MAIN THRUST OF THE CHAPTER Overview This chapter focuses on the research problems outlined previously and aims at providing a contribution in this setting; in fact, it proposes an information system capable of supporting PHCA managers to decide the new health care services to activate in such a way as to both comply with budget constraints and maximize patients satisfaction. Our system is HL7-aware; in fact, it uses HL7 to effectively handle the interoperability among different PHCAs. HL7 provides several functionalities for the exchange, the management and the integration of data concerning both patients and health care services. It is strictly related with XML since HL7-based documents can be easily coded in this language. Interestingly enough, it is a widely accepted standard in the marketplace; specifically, a large number of commercial products implement and support it, and several research projects adopt it as the reference standard for representing clinical documents (Eichelberg Aden, Riesmeir, Dogac, and Laleci, 2005). Our system consists of five main components, namely: (i) a Public Health Care Agency Handler, that supports PHCA managers in their decisional processes; (ii) a Patient Handler, that supports patients to submit queries for detecting services of their interest; (iii) a Patient Query Handler, that processes queries submitted by patients; (iv) a Health Care Service Database, that stores and manages information about the various services provided by PHCAs; (v) a Patient Profile Database, that stores and handles information about patients. Given a pair , such that Pi is a patient and Sj is a health care service, our system computes two coefficients; the former states the benefits that Pi would gain from the activation of Sj; the latter denotes the costs that a PHCA should sustain to provide Pi with Sj. These coefficients are used to define a suitable linear programming problem whose solution allows a PHCA manager to determine the most promising services to activate in such a way as to comply with budget constraints and to maximize patients satisfaction. We have designed two ad-hoc algorithms that solve this linear programming problem taking into account the specificities of the reference context: the former is based on a Branch and Bound technique (Bertsimas, and Tsitsiklis 1997) whereas the latter relies on a Simulated Annealing framework (Russell, and Norvig 2002). Our system is characterized by the following, interesting, properties: (i) it allows the uniform management of potentially heterogeneous information about patients and health care services; such a fea-
672
An HL7-Aware Decision Support System for E-Health
Figure 1. Architecture of our system
ture is obtained by defining a precise HL7-based representation of patient and service characteristics; (ii) it can easily cooperate with other health care systems supporting the HL7 standard; (iii) its results strongly consider patient needs, preferences and characteristics, as well as service features; (iv) it is reactive, since a change in patient or service features influences the computation of benefits and costs and, consequently, its suggestions; (v) it is scalable, since it is possible to define new instances of the various handlers for facing the presence of new patients or PHCAs.
System Description The general architecture of our system is depicted in Figure 1. This figure shows that our system is characterized by three types of modules, namely: (i) a Public Health Care Agency Handler (hereafter, AH), that supports a PHCA manager in his decision making activities; (ii) a Patient Query Handler (hereafter, QH), that processes queries submitted by the various patients; (iii) a Patient Handler (hereafter, PH), that allows a patient to submit queries to search the list of services appearing the closest to his profile and exigencies, as well as to update his profile; PH has been thought as a module to be directly accessed by the patient; however, in case this last is not provided with the technology and/or the know how necessary to exploit it, it is possible to think that the Public Health Care System is provided with counters where patients are assisted by suitable human operators.
673
An HL7-Aware Decision Support System for E-Health
Our system is also provided with a Health Care Service Database (hereafter SD), storing and managing information about the various services provided by PHCAs, and with a Patient Profile Database (hereafter PD), storing and handling information about patient profiles. The behaviour of our system can be analyzed from both the PHCA managers and the patients points of view. Specifically, our system can assist a PHCA manager to verify what services (among a set of new ones he has in mind to propose) appear to be the most beneficial for patients, according to some budget constraints. For each of these services it specifies also the set of patients who might obtain the highest benefits from its activation. In order to carry out such a task, AH loads from PD the set of available Patient Profiles. After this, for each pair , such that Pi is a patient and Sj is a service to be examined, it computes a pair , where bij is a real value representing the benefit that Pi would gain from the activation of Sj, and cij is a real value denoting the cost the PHCA should pay to provide Pi with Sj; the computation of bij and cij takes the profiles of Pi and Sj into account. Finally, after all benefits and costs have been determined, AH constructs a suitable linear programming problem. In addition, our system can support a patient Pi to determine the most interesting services for him. Specifically, each time Pi wants to know if there exist new services interesting for him, he activates the corresponding Patient Handler PHi and submits a query Qik. PHi loads the profile PPi of Pi from PD and forwards the pair to QH that processes Qik and retrieves from SD the list SLik of services best matching Qik and PPi. After this, QH sends SLik to PHi that proposes it to Pi. Finally, our system supports a patient Pi to update his profile. This activity is carried out as follows: Pi activates PHi that presents him a graphical interface visualizing his current profile and allowing him to modify it in a friendly and guided fashion. Once all updates have been performed, PHi stores the updated profile in PD. Observe that this asynchronous and partially obtrusive way to update patient profiles is acceptable in our reference context since profiles change quite infrequently over time and since (differently from other application contexts, such as e-commerce and e-learning) errors occurring in profile update could lead to extreme dangers. In the following sub-sections we provide a detailed description of the various components of our system.
Patient Profile Database (PD) PD stores the profiles of the patients accessing our system. A Patient Profile PPi, associated with a patient Pi, consists of a triplet , where: • •
674
DSi stores both the demographic and the social data of Pi. MPi = {Ki1, Ki2, … , Kim} represents the Medical Profile of Pi; it consists of a set of keywords denoting both his pathologies and his possible needs in health care domain. It is worth pointing out that MPi is much more than a simple clinical document recording patient diseases. In fact, due to its mission, our system should not restrict itself to record patient diseases but it should offer tools for enlarging the number and the variety of services delivered to patients, as well as for raising both the quality and the effectiveness of the Public Health Care System. These goals can be achieved by managing patient information at a broader level than that guaranteed by a simple patient pathology list. This choice agrees with the ideas outlined in the literature (see, for example, (Vasilyeva,
An HL7-Aware Decision Support System for E-Health
•
Pechenizkiy, and Puuronen 2005)), where it is pointed out that, besides medical information, the profile of a patient should include, for example, his tasks and goals, his cognitive and psychological specificities, and so on. PCSi = {PCi1, PCi2, …, PCil} is a set of constraints associated with Pi. A constraint PCih is represented by a pair , where PCNameih denotes its name and PCValuesih indicates the corresponding admissible values. The exploitation of constraints allows a wide range of health-related episodes to be correctly managed. As an example, the constraint , associated with Pi, specifies that penicillin is an allergen for him. The relevance of this constraint is clear: it might prevent Pi from undergoing a clinical treatment involving penicillin.
DSi is coded according to the rules specified for representing the component “Record Target” of the Header of Clinical Document Architecture (called CDA), release 2. CDA is the HL7-based standard for representing clinical documents. The fields of “Record Target” taken into account in DSi are: “Id”, “Name”, “Addr”, “Telecom”, “Administrative Gender Code”, “Birth Time”, “Marital Status” and “Living Arrangement”. According to HL7 philosophy, whenever necessary, we have defined and exploited other fields not already defined in HL7 standard (think, for example, to the field “patient income”). The structures of MPi and PCSi conform to the rules specified to represent the component “entry” of the various sections of the Body of CDA, release 2. These rules specify that section “entries” are represented with the support of specific dictionaries. CDA sections of interest for MPi are “Problem list”, “History of past illness”, “History of medication use”, “History of present illness”, “Family history”, “Social history”, “Immunization”, “Past surgical history”. CDA sections of interest for PCSi are: “Problem list”, “History of allergies”, “Past surgical history”, “Family history”, “Social history”, “Immunization”, “History of past illness”, “History of medication use”, “History of present illness”. All these sections refer to the LOINC (Logical Observation Identifiers Names and Codes [LOINC], 2007) dictionary for their definition and to the SNOMED CT (Systematized NOmenclature of MEDicine Clinical Terms [SNOMED-CT], 2007) dictionary for their content; however, if necessary, HL7 allows other dictionaries to be exploited. According to HL7 suggestions, for those elements of MPi and PCSi that cannot be coded by means of the already defined sections and the already existing dictionaries, we have defined new specific sections and a new specific dictionary to be exploited in our system. PD is constructed and handled by means of the XML technology.
Health Care Service Database (SD) SD stores information about the services delivered by PHCAs. The profile SPj of a health care service Sj consists of a tuple , where: • •
•
SIdj, SNamej, SDescrj represent the identifier, the name and the description of Sj. SFSj = {Fj1, Fj2, … , Fjp} represents a set of features associated with Sj. Each feature of SFSj is a keyword describing a specificity of Sj (e.g., an illness faced by it). As an example, consider the service “chest radiography”; a possible set of features describing it is “pneumonia”, “heart failure”, “emphysema”, “lung cancer”, “check”, “radiography”. SCSj = {SCj1, SCj2, … , SCjt} is a set of constraints associated with Sj. A constraint SCjl consists of a pair , where SCNamejl represents its name and SCValuesjl indicates
675
An HL7-Aware Decision Support System for E-Health
the corresponding admissible values. A constraint could represent the requisites a user must have for accessing a certain service; another constraint could define the activation or the expiration date of a service, and so on. For instance, the constraint , associated with the service “Coronary Angiography”, might specify that if a patient has coronary artery diseases then he is eligible to receive a Coronary Angiography test for free. SFSj and SCSj are coded according to the rules specified to represent the component “entry” of the various sections of the Body of CDA, release 2. Reference sections and dictionaries are the same as those considered for MPi and PCSi. For those features and constraints that cannot be coded by means of the already defined sections and the already existing dictionaries, we have defined new specific sections and we have used the same new specific dictionary adopted for MPi and PCSi. Analogously to PD, also SD is constructed and managed by means of the XML technology.
Patient Handler (PH) A Patient Handler PHi is associated with a patient Pi. Its support data structure stores the profile PPi of Pi. PHi can be activated by Pi when he wants to know if there exist new services interesting for him and, in the affirmative case, for accessing them. First it retrieves PPi from PD and stores it in its support data structure. Then, it allows Pi to submit a query Qik specifying the service type that he presently desires. After this, it sends the pair to QH that processes Qik, constructs the list SLik of services possibly satisfying Pi and sends it to PHi; PHi proposes SLik to Pi who can select the most relevant services, according to his needs. PHi can be activated by Pi also when he wants to update his profile PPi. In this case, PHi retrieves the current PPi from PD and shows it to Pi by means of a graphical interface allowing Pi to modify his profile in a friendly and guided fashion. At the end of this activity, PHi stores the updated PPi into PD.
Patient Query Handler (QH) The Patient Query Handler QH is in charge of answering queries submitted by patients. Its support data structure consists of a list of services whose description is analogous to the service description that we have considered for SD. QH is called by a Patient Handler PHi each time a patient Pi wants to access available services. Its input consists of a pair . It constructs the list SLik of services best matching Qik and PPi; to this purpose it initially applies Information Retrieval techniques (Baeza-Yates, and Ribeiro-Neto 1999) to select from SD the list SLiktemp of services best matching Qik and being compatible with both DSi and PCSi. Then, for each service Sj ∈ SLiktemp, it computes a score stating its relevance for Pi; score computation takes both SFSj and MPi into account. After this, it constructs SLik by selecting from SLiktemp those services having the highest score, and sends it to PHi. Since the focus of this chapter is on the support of decision making activity, due to space constraints, we do not report here a detailed description of the behaviour of QH. However, we point out that its activities are analogous to the corresponding ones described in (De Meo, Quattrone, Terracina, and Ursino 2005), even if referred to another application scenario.
676
An HL7-Aware Decision Support System for E-Health
Public Health Care Agency Handler (AH) A Public Health Care Agency Handler AH supports a PHCA manager in his decision making activity. Specifically, given that a PHCA manager has a budget B at disposal to activate new services, and he has in mind a set of potentially useful services to propose, AH supports him to identify, for each new service, the set of patients who could gain the maximum benefits from its activation. The knowledge of this information plays a key role in the manager’s decision making process; in fact, if the set of patients who could gain a high benefit from a certain service is small, he might decide to not activate it and to re-direct the saved budget to other services. AH is provided with a suitable data structure storing: (i) a tuple for each service Sj the manager wants to examine; (ii) the maximum budget B available for the activation of all new services. First, for each pair , such that Pi is a patient and Sj is a service to examine, AH computes a real coefficient bij ∈ [0,1], denoting the benefit that Pi would gain from the activation of Sj, and a real coefficient cij ∈ [0,1], representing the cost that the PHCA should pay to provide Pi with Sj. The computation of bij is performed on the basis of the following reasonings:• •
The more the set of features SFSj of Sj is similar to MPi, the more Sj can be considered helpful for Pi. In order to quantify the similarity between SFSj and MPi it is possible to adopt the Jaccard coefficient:
∈ (Pi , S j )=
MPi ∩ SFS
j
MPi ∪ SFS j
•
Observe that the values of φ belong to the real interval [0,1]. The benefit of Sj for Pi depends also on his economical and demographical conditions. In fact, in real life, the same service might have a different impact on patients having different incomes. In order to formalize such an observation, it is possible to construct a function ψ that receives information about the profiles of Pi and Sj and returns a coefficient, in the real interval [0,1], specifying the impact of Sj for Pi. The definition of this function strictly depends on the welfare guidelines adopted by a government and, consequently, it might be different in the various countries and/or regions. A possible, quite general, formula is the following:
ψ (Pi, Sj) = KijD · KijI
where: (i) KijD is a coefficient, in the real interval [0,1], considering the disabilities/diseases of Pi and the relevance of Sj for facing them; (ii) KijI is a coefficient, in the real interval [0,1], depending on both the yearly income of Pi and the impact that the presence/absence of Sj might have on him. In our tests, in order to determine the tables stating the values of these coefficients, we have considered the Italian welfare system and we have required the support of a domain expert. Some of these tables are reported at the address http://www.ing.unirc.it/ursino/handbook/tables.html.
677
An HL7-Aware Decision Support System for E-Health
Clearly, a more complex definition of ψ, that considers also other parameters about Pi, as well as the role of Sj on each of them, might be taken into account. In addition, both the previous definition and other, more complex, ones could be adapted to other, possibly not Italian, contexts.
On the basis of the previous observations, bij can be defined as a function of φ and ψ, i.e., bij = φ (Pi, Sj) ·ψ(Pi, Sj). Let us now consider cij, i.e., the cost that a PHCA should pay to provide Pi with Sj. cij is strictly dependent on both the welfare system and the service type; as an example, in the Italian welfare system, in most cases, a patient must partially contribute to a provided service and, generally, his contribution is directly proportional to his yearly income. As a consequence, the cost a PHCA must pay to deliver a service is higher for patients having low incomes1. In our experiments we have considered the Italian welfare system and we have asked an expert to define tables specifying the cost of services into examination for various income ranges. Some of these tables can be found at the address http://www.ing.unirc.it/ursino/handbook/tables.html. The decision making process must also consider the constraints that a service specifies for patients potentially benefiting of it (they are reported in the field SCSj of the corresponding service profile), as well as the constraints associated with the current condition of a patient (they are reported in the field PCSi of the corresponding patient profile). If we indicate by NewServSet the set of new services a PHCA manager is considering for a possible activation and by PSet the set of patients registered in PD, the activity that must be performed by AH can be formalized by means of the following integer linear programming problem: 1.
max ∑ i =1
∑ ∑
2.
s.t.∑ i =1
3.
xij = 0 for each such that i = 1 .. |PSet|, j = 1 .. |NewServSet|, γij = true
4.
xij ∈ {0, 1} for each such that i = 1 .. |PSet|, j = 1 .. |NewServSet|
| PSet |
| PSet |
| NewServSet |
j =1 | NewServSet | j =1
bij · xij
cij · xij ≤ B
Here: (i) xij indicates if Sj appears to be interesting for Pi; in the affirmative case it is equal to 1; otherwise it is equal to 0; all variables xij can be grouped into a matrix; (ii) bij, cij and B have been previously defined; (iii) γij = true if the constraints in PCSi cannot be satisfied by Sj or the constraints in SCSj cannot be satisfied by Pi; otherwise, it is equal to false. Equations (2), (3) and (4) define the feasible region R of the programming problem; a matrix satisfying these equations is said feasible. It is possible to prove that the zero matrix, i.e., a matrix whose elements are all equal to 0, is a feasible solution for our programming problem. Integer linear programming problem is a NP-Hard problem; it has been extensively studied in the literature and a large variety of both exact and heuristic approaches for solving it have been already proposed. In the present research we have considered two very common heuristics, namely Branch and Bound (Bertsimas, and Tsitsiklis 1997) and Simulated Annealing (Russell, and Norvig 2002), and we have adapted them to our reference context. In the following subsections we provide a detailed description of these adaptations. A Branch and Bound Based Technique for Solving Our Programming Problem: The Branch and Bound based technique (hereafter BB) that we have defined for solving our programming problem
678
An HL7-Aware Decision Support System for E-Health
receives: (i) the integer linear programming problem ILPP to solve; (ii) a matrix CurrOptSol, storing the current optimal solution; (iii) a real parameter CurrObjFunc, storing the value of the objective function corresponding to CurrOptSol. It returns a pair representing the optimal solution of ILPP and the corresponding value of the objective function. Initially ILPP coincides with the original problem to solve, CurrOptSol coincides with the zero matrix and CurrObjFunc is equal to 0. First BB activates a function Solve that receives ILPP, relaxes it by disregarding the constraints of Equation (4) in such a way that variables can range in the real interval [0,1], and solves the relaxed ILPP by applying a suitable algorithm (e.g., the Karmarkar or the Ellipsoid algorithms (Spielman, and Teng 2004)). If the relaxed ILPP is unfeasible (i.e., if it does not admit any feasible solution), Solve returns the triplet . On the contrary, if the relaxed ILPP is feasible, Solve returns the triplet , where CandOptSol represents the optimal solution of the relaxed ILPP and CandObjFunc denotes the corresponding value of the objective function; since the relaxed ILPP has been obtained from ILPP by removing some constraints, it is possible to state that CandObjFunc is greater than or equal to CurrObjFunc. At this point, if the relaxed ILPP (and, consequently, ILPP) is unfeasible, then BB returns the pair . If the relaxed ILPP is feasible and all values of CandOptSol are integer, then BB returns the pair . Finally, if the relaxed ILPP is feasible but there exists at least one non-integer value in CandOptSol, then BB must search another solution having only integer values. In order to perform this activity, it carries out the following steps: • •
• •
It activates a function Bound on ILPP in such a way as to compute an upper bound UpperBoundObjFunc for its objective function (as follows). If UpperBoundObjFunc is greater than CurrObjFunc then it is necessary to re-compute the value of the objective function and to compare it with CurrObjFunc. To this purpose: o BB activates a function Branch that partitions ILPP in a set ILPPSet of integer linear programming sub-problems (as follows). o BB is recursively activated on each of these sub-problems and the corresponding optimal solution is computed. Let ILPPi be one of these sub-problems and let be the result obtained by applying BB on it. If CandObjFunci is greater than CurrObjFunc then a better optimal solution has been found; as a consequence, CurrOptSol and CurrObjFunc are set equal to CandOptSoli and CandObjFunci, respectively. If UpperBoundObjFunc is equal to CurrObjFunc then it is possible to conclude that the current optimal solution of ILPP cannot be improved. At the end of these steps CurrOptSol represents the optimal solution of ILPP and CurrObjFunc denotes the corresponding value of the objective function; as a consequence, BB returns the pair and terminates.
Now, it is necessary to illustrate functions Branch and Bound in detail; their behaviour is strictly related to the characteristics of our integer linear programming problem. Branch splits the feasible region R of ILPP into a set of smaller sub-regions. This split is performed according to a branching rule. This rule is defined taking the characteristics of ILPP into account. Now,
679
An HL7-Aware Decision Support System for E-Health
the possible values of the variables associated with ILPP are 0 or 1. If we consider the optimal solution CandOptSol of the relaxed ILPP we can observe that some of its variables have been set equal to 0 or to 1; the other ones have been set to real values2. In this context a reasonable branching rule might decide to choose one of these last variables, say x’, and to split ILPP into two sub-problems ILPP0 and ILPP1, by adding to it the constraints x’=0 and x’=1, respectively. Bound receives ILPP and computes an upper bound UpperBoundObjFunc for its objective function. A Simulated Annealing Based Technique for Solving Our Programming Problem: Simulated Annealing (hereafter SA) is a popular technique to minimize a generic function defined over an arbitrary (and, generally, large) search space (Russell, and Norvig 2002). In the classical formulation, the parameter to minimize is called Energy; if we define a parameter Co-Energy = - Energy, SA can be formalized as a maximization problem and can be directly applied to our reference context; in this case Co-Energy is exactly our objective function. The SA based technique that we have defined for solving our programming problem receives: (i) an integer linear programming problem ILPP; (ii) a matrix CurrOptSol, storing the current optimal solution; (iii) a real parameter CurrCoEnergy, storing the value of the objective function corresponding to CurrOptSol; (iv) a parameter MaxIter, representing the maximum number of iterations that can be performed; (v) a parameter CoEnergyBound, stating the minimum value of Co-Energy that can be associated with an acceptable solution; (vi) a parameter Temperature, necessary to decide if the current solution can be further improved (as follows). Initially CurrOptSol coincides with the zero matrix and CurrCoEnergy is equal to 0. The core of SA is a loop; during each iteration of this loop SA calls the function GenerateCandidate to generate a candidate optimal solution CandOptSol (as follows). After this, it calls a function VerifyCandidate to verify if CandOptSol can improve the current optimal solution CurrOptSol (as follows); in the affirmative case CandOptSol becomes the new current optimal solution. SA terminates when a maximum number of iterations MaxIter has been performed or the Co-Energy associated with the current optimal solution is greater than or equal to CoEnergyBound. Now, it is necessary to illustrate the functions GenerateCandidate and VerifyCandidate in detail; their behaviour is strictly related to the characteristics of our programming problem. GenerateCandidate receives ILPP and its current optimal solution CurrOptSol and generates a candidate optimal solution CandOptSol. Initially, it sets CandOptSol equal to CurrOptSol; then, it constructs the set VS0 (resp., VS1) consisting of all variables of CandOptSol whose value is equal to 0 (resp., 1). After this, it calls a function ExtractMaximum that determines the variable CandOptSolimax,jmax such that the value bimax,jmax/cimax,jmax is higher than or equal to all values bij/cij associated with variables of VS0. This choice is justified by considering that the Co-Energy of ILPP can be increased only by selecting, among all variables CandOptSolij equal to 0, that producing the highest increment of Co-Energy per a unitary increment of cost, or, in other words, that having the highest value of the benefit-cost ratio. After CandOptSolimax,jmax has been determined, it is set equal to 1 and is transferred from VS0 to VS1. This last activity might produce a solution of ILPP whose budget is greater than B, i.e., an unfeasible solution. If this happens, GenerateCandidate performs a loop; during each iteration, it activates a function ExtractMinimum that determines the variable CandOptSolimin,jmin such that the value bimin,jmin/cimin,jmin is less than or equal to all values bij/cij associated with variables of VS1; CandOptSolimin,jmin is set equal to 0 and transferred from VS1 to VS0. This loop is repeated until to CandOptSol becomes feasible.
680
An HL7-Aware Decision Support System for E-Health
GenerateCandidate returns the feasible CandOptSol and the associated CandCoEnergy, computed by determining the value of the objective function of ILPP associated with CandOptSol. VerifyCandidate receives CandCoEnergy, CurrCoEnergy and Temperature and determines if CandOptSol can replace CurrOptSol. It behaves as follows: it generates a random value η belonging to the real interval [0,1); then, it computes the probability that CandOptSol can improve, either directly or after a series of steps, CurrOptSol by applying the Maxwell-Boltzmann probability distribution (see (Russell, and Norvig 2002) for all details and the corresponding explanation about it): 1 Pr = CandCoEnergy −CurrCoEnergy Temperature e
if
CandCoEnergy > CurrCoEnergy otherwise
If η < Pr then VerifyCandidate returns true; otherwise, it returns false.
FUTURE TRENDS As for our future research efforts in this application context, we plan to provide our system with a further module implementing Knowledge Discovery techniques on data concerning the patients’ usage of previously delivered services in such a way as to determine the most appreciated ones. This information could be combined with that presently provided by our system in order to improve the quality of its results. Moreover, we would like to study the possibility to integrate our system with a task assignment one in such a way that, after a PHCA manager has decided the new services to deliver, he can define the internal planning that optimizes the assignment of the PHCA’s available resources in such a way as to reach the highest internal efficiency.
CONCLUSION In this chapter we have presented an HL7-aware information system to support PHCA managers in their decision making activities. We have seen that our system is capable of supporting a rigorous and scientific decision making activity, taking a large variety of factors into account. One of the most relevant features of our system is its capability of representing both patient and service information according to the HL7 directives; in our opinion, this is particularly interesting because HL7 is a widely accepted standard in the marketplace and also several research projects adopt it as the reference standard for representing clinical documents.
REFERENCES Baeza-Yates R., & Ribeiro-Neto B. (1999). Modern Information Retrieval. Addison Wesley Longman.
681
An HL7-Aware Decision Support System for E-Health
Berndt D.J., Hevner A.R., & Studnicki J. (2003). The CATCH Data Warehouse: support for community health care decision-making. Decision Support Systems, 35(3), 367-384. Bertsimas D., & Tsitsiklis J. N. (1997). Introduction to Linear Optimization. Athena Scientific, Belmont, Massachusets, USA. De Meo P., Quattrone G., Terracina G., & Ursino D. (2005). A Multi-Agent System for the management of E-Government Services. In Proceedings of 2005 IEEE/WIC/ACM International Conference on Intelligent Agent Technology (pp 718-724), Compiegne University of Technology, France. IEEE Computer Society Press. Eichelberg M., Aden T., Riesmeier J., Dogac A., & Laleci G.B. (2005). A survey and analysis of electronic healthcare record standards. ACM Computing Surveys, 37(4), 277-315. Flessa S. (2000). Where efficiency saves lives: A linear programme for the optimal allocation of health care resources in developing countries. Health Care Management Science, 3(3), 249-267. Health Level Seven, HL7 (2007). Retrieved from http://www.hl7.org. Harper P.R., & Shahani A.K. (2003). A decision support system for the care of HIV and AIDS patients in India. European Journal of Operational Research, 147(1), 187-197. Homer J., & Milstein B. (2004). Optimal decision making in a dynamic model of community health. In Proc. of the International Conference on System Sciences, Big Island, Hawaii, USA. IEEE Computer Society Press. Logical Observation Identifiers Names and Codes, LOINC. (2007). Retrieved from http://www.regenstrief.org/loinc/. Organisation for Economic Cooperation and Development (2005). Health at a Glance - OECD Indicators 2005. Russell S., & Norvig P. (2002). Artificial Intelligence: A Modern Approach-Second Edition. Prentice Hall, Inc., Upper Saddle River, New Jersey, USA. Spielman D.A., & Teng S. (2004). Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3), 385-463. Systematized NOmenclature of MEDicine Clinical Terms, SNOMED CT. (2007). From http://www. snomed.org. Vasilyeva E., Pechenizkiy M., & Puuronen S. (2005). Towards the framework of adaptive user interfaces for eHealth. In Proc. of the IEEE Symposium on Computer-Based Medical Systems (pp. 139-144). Dublin, Ireland. IEEE Computer Society. Walczak S., Pofahl W.E., & Scorpio R.J. (2003). A decision support tool for allocating hospital bed resources and determining required acuity of care. Decision Support Systems, 34(4), 445-456.
682
An HL7-Aware Decision Support System for E-Health
KEY TERMS Clinical Document Architecture (CDA): An XML-based markup standard conceived to specify the encoding, structure and semantics of clinical documents in such a way as to make their exchange easy. E-Health: An emerging field in the intersection of medical informatics, public health and business, referring to health services and information delivered or enhanced through the Internet and related technologies. In a broader sense, the term characterizes not only a technical development, but also a stateof-mind, a way of thinking, an attitude, and a commitment for networked, global thinking, to improve health care locally, regionally, and worldwide by using information and communication technology. eXtensible Markup Language (XML): The novel language, standardized by the World Wide Web Consortium, for representing, handling and exchanging information on the Web. Health Leven Seven (HL7): A standard series of predefined logical formats for packaging healthcare data in the form of messages to be transmitted among computers systems. Linear Programming Problem: Optimization problems in which the objective function and the constraints are all linear. Logical Observation Identifiers Names and Codes (LOINC): A universal standard for allowing the electronic transmission of clinical data from medical laboratories to hospitals and surgeries. Each LOINC record has a code that can be used in HL7 messages. LOINC codes allow all sections of a CDA document to be codified; as a consequence, the usage of LOINC allows the production of CDA documents characterized by universally acknowledged codes.. Systematized Nomenclature of Medicine (SNOMED): The largest structured vocabulary used in medicine. All substantives, adjectives, eponyms, etc., concerning the medical language are stored in this vocabulary. SNOMED project was started in 1965 by the Committee on Nomenclature and Classification of Disease of the College of American Pathologysts. After various modifications, the current version of this vocabulary, called SNOMED CT (Clinical Terms), was produced. User Modeling: The process of gathering information specific to each user either explicitly or implicitly. This information is exploited to customize the content and the structure of a service to the user’s specific and individual needs. User Profile: A model of a user representing both his preferences and his behaviour.
Endnotes
1
2
It is worth pointing out that, generally, the benefit of a service is much higher for patients having low incomes. This is possible because CandOptSol is the solution of the relaxed ILPP and not of ILPP.
683
684
Chapter XXXVIII
Multitarget Classi.ers for Mining in Bioinformatics Diego Liberati Istituto di Elettronica e Ingegneria dell’Informazione e delle Telecomunicazioni Consiglio Nazionale delle Ricerche Politecnico di Milano, Italy
Abstract Building effective multitarget classifiers is still an on-going research issue: this chapter proposes the use of the knowledge gleaned from a human expert as a practical way for decomposing and extend the proposed binary strategy. The core is a greedy feature selection approach that can be used in conjunction with different classification algorithms, leading to a feature selection process working independently from any classifier that could then be used. The procedure takes advantage from the Minimum Description Length principle for selecting features and promoting accuracy of multitarget classifiers. Its effectiveness is asserted by experiments, with different state-of-the-art classification algorithms such as Bayesian and Support Vector Machine classifiers, over dataset publicly available on the Web: gene expression data from DNA micro-arrays are selected as a paradigmatic example, containing a lot of redundant features due to the large number of monitored genes and the small cardinality of samples. Therefore, in analysing these data, like in text mining, a major challenge is the definition of a feature selection procedure that highlights the most relevant genes in order to improve automatic diagnostic classification.
Introduction As stressed by recent literature (Mukherjee, 2003; Statnikov et al., 2005) many classifiers have their own limitations when used for multitarget classification, i.e. predicting more than two different classes. The prior knowledge gleaned from a human expert is employed as a support to overcome the problems faced by multitarget classification: a divide-and-conquer approach allows to decompose the original multitarget classification problem into a set of binary classification problems. Such decomposition is
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Multitarget Classi.ers for Mining in Bioinformatics
performed by using a decision tree that is based on previous knowledge, experience and observation (Yeoh et al., 2002) and provides an inference schema reflecting the human decision making process. This is of paramount importance when dealing with patho-physiological problems, like the process of selecting the most important genes, i.e. the minimal set of genes allowing to build efficient classifiers, from micro-array data (Golub et al., 1999; Guyon et al., 2002; Blum and Lanley, 1997), as well as when selecting the most interesting features in text mining. In order to improve the performance of learning algorithms (Kahn et al., 2001; Golub et al., 1999) and avoid over-fitting, it is of paramount importance to reduce the dimensionality of the data by deleting unsuitable attributes (Witten and Frank, 2005).
Background The common practice in such increasingly important bioinformatics field is to employ a range of accessible methodologies that can be broadly classified into three categories: •
•
•
Classification methods based on global gene expression analysis (Golub et al., 1999; Alizadeh et al., 2000; Ross et al., 2000) specifically aimed at applying a single technique to a specific gene expression dataset; Traditional statistical approaches such as Principal Component Analysis (Liberati et al., 2005; Garatti et al., 2007), discriminant analysis (Nguyen and Rocke, 2002) or Bayesian decision theory (Bosin et al., 2006); Machine learning techniques such as neural (Tung and Quek, 2005; Khan et al., 2001) and logical (Muselli and Liberati, 2002) networks, decision trees and Support Vector Machines (SVM) (Guyon et al., 2002; Furey et al., 2001; Valentini, 2002).
Nonetheless, (Statnikov et al., 2005) reported that such results lack a consistent and systematic approach as they validate their methods differently, on different public datasets and on different limited sets of features. Dudoit (2002) and colleagues have compared the performance of various micro-array data classification methods, and a recent extensive comparison (Lee et al., 2005) provides some additional insights. The relevance of good feature selection methods has been discussed by Guyon (2002) and colleagues with special emphasis on over-fitting, but the recommendations in literature do not give evidence for a single best method for either the classification of micro-array data, or at least their feature selection. It has also been pointed out (Tung and Quek, 2005) that often classifiers work as black boxes, the decision making process being not intuitive to the human cognitive process and, more importantly, the knowledge extracted by these classifiers from the numerical training not being easy to be understood and then assessed. In order to overcome such drawbacks, an approach more driven by data in feature selection, not neglecting the available domain knowledge, makes it possible to emulate the human style of reasoning and decision making when solving complex problems. The adoption of the Minimum Description Length (MDL) principle (Barron et al., 1998) is proposed for both selecting features and comparing classifiers. Any regularity in the data can in fact be used to compress the data themselves. Being data compression equivalent to a kind of probabilistic prediction, MDL methods can be interpreted as searching for a model with good predictive performance on unseen
685
Multitarget Classifiers for Mining in Bioinformatics
data. But finding regularity is a kind of learning: therefore the more we are able to compress the data without loosing information, the more we have learned about them. According to MDL principle, the best theory to infer from data is the one that minimizes both the length (i.e. the complexity) of the theory itself and the length of the data encoded with respect to it. MDL provides a criterion to judge the quality of a classification model and, by considering each attribute as a simple predictive model of the target class (Kononenko, 1995; Yarmus, 2003; Bosin et al., 2006), it can also be employed to address the problem of feature selection. On such grounds, it is here explored how a feature selection heuristic based on MDL is effective in building the following classifiers: Naïve Bayes (NB) (Friedman et al., 1997; Cheng and Grenier, 1999), Adaptive Bayesian Network (ABN) (Yarmus, 2003), Support Vector Machines (SVM) (Vapnik, 1998) and k-Nearest Neighbour (Witten and Frank, 2005). In order to compare the resulting strategy to the approaches more commonly used in statistics or machine learning, some of the considered methods both for ranking and selecting features and for classification are outlined.
Ranking Statistical Ranking Univariate statistical tests, such as χ2 and t-statistics, are widely employed for gene ranking (Yeoh et al., 2002; Liu et al., 2002). The χ2 method, evaluated also within the present comparative study, weights each individual feature by measuring its chi-square statistic with respect to the classes. The χ2 value of each feature is defined as:
( − ) ∑ ∑ A E E 2
l
k
i =1
j =1
ij
ij
(1)
ij
where l is the number of intervals, k the number of classes, Aij the number of samples in the i-th interval, j-th class, and Eij the expected frequency of Aij; specifically, Eij = Ri⋅Cj /N, where Ri is the number of samples in the i-th interval, Cj the number of samples in the j-th class, N the total number of samples. The larger is the χ2 value, the more discriminant is the corresponding feature. For a continuous feature, as a gene expression value, such approach does first require to make its range of values into a finite number of intervals (Perner and Trautzsch, 1998). Both unsupervised and supervised approaches have been tested, the best results having been obtained using the entropy-based method proposed in (Fayyad and Irani, 1993).
MDL-Based Ranking Minimum Description Length principle (Barron et al., 1998) can be used to achieve the ranking of attributes. The idea underlying the MDL method is to find the most compact encoding of the training data (u1, …, uN). To this end, the following MDL measure (Friedman et al., 1997) is adopted:
686
Multitarget Classifiers for Mining in Bioinformatics
N
MDL = log2(m) –
∑ log i =1
2
( p (ui ))
(2)
where m is the number of potential candidate models and p(ui) is the model predicted probability assigned to the training instance ui. The first term in Equation 2 represents how many bits one needs to encode the specific model (i.e. its length), and the second term measures how many bits are needed to describe the data on the basis of the probability distribution associated to the model. By considering each feature as a simple predictive model of the target class, as described in (Kononenko, 1995), features can indeed be ranked according to their description length, reflecting the strength of their correlation with the target class. The measure of the description length given in (Yarmus, 2003) is here used:
MDL =
∑ j
Nj N j + C − 1 – ∑∑ log 2 ( p ji ) C − 1 j i =1
log 2
(3)
where Nj is the number of training instances with the j-th value of the given feature, C is the number of target class values, and pji is the probability of the value of the target class taken by the i-th training instance with the j-th value of the given feature (estimated from the distribution of target values in the training data). As in the general case (Equation 2), the first term expresses the encoding length, where we have one sub-model for each value of the feature, while the second gives the number of bits needed to describe the data, based on the probability distribution of the target value associated to each sub-model.
Classification Methods Bayesian approaches Recently, a lot of research has focused on improving Naïve Bayes (NB) classifiers (Friedman et al., 1997; Cheng and Greiner, 1999; Keogh and Pazzani, 2002), by relaxing their full independence assumption. One of the most interesting approaches is based on the idea of adding correlation arcs between the attributes of a NB classifier. On these “augmenting arcs” specific structural constraints (Friedman; Keogh and Pazzani, 2002] are imposed, in order to maintain computational simplicity on learning. A method for augmenting the NB network according to a tree topology is proposed in (Friedman et al., 1997), while Keogh and Pazzani (2002) define augmented Bayesian networks whose “augmenting arcs” form a collection of trees. The algorithm here used, the Adaptive Bayesian Network (ABN) (Yarmus, 2003), is a greedy variant of the approach proposed in (Keogh and Pazzani, 2002). In brief, the steps needed to build an ABN classifier are the following. First, the attributes (predictors) are ranked according to their MDL importance. Then, the network is initialized to NB on the top k ranked predictors (X1, X2, …, Xk), that are treated as conditionally independent. Next, the algorithm attempts to extend NB by constructing a set of tree-like multidimensional features. Feature construction proceeds as follows. The top ranked predictor is stated as a seed feature, and the predictor that most improves feature predictive accuracy, if any, is added to the seed. Further predictors are added in such a way to form a tree structure, until the accuracy does not improve. Using the
687
Multitarget Classifiers for Mining in Bioinformatics
next available top ranked predictor as a seed, the algorithm attempts to construct additional features in the same manner. The process is interrupted when the overall predictive accuracy cannot be further improved or after some pre-selected number of steps. The resulting network structure consists of a set of conditionally independent multiattribute features, and the target class probabilities are estimated by the product of feature probabilities. Interestingly, each multidimensional feature can be expressed in terms of a set of if-then rules enabling users to easily understand the basis of model predictions. Support Vector Machines (SVM) are a popular, while recently introduced, classification approach introduced by Vapnik (1998), to whom the reader is referred in order to gain a better knowledge of such approach k-nearest neighbor (k-NN) less recent, but even more popular, are also well described in review literature, like for instance in (Witten and Frank, 2005).
Main Focus of the Chapter Our strategy is essentially based on the assumption that the target (class) is mostly determined by its correlation with single attributes (variables, in our example genes). This may appear quite reductive, but it will be shown that, even with such restriction, powerful results are obtainable, at a very much reduced cost with respect to considering multiple correlations. Moreover, unsupervised techniques (Liberati et al., 2005; Garatti et al., 2007) show how even multivariate approach do often need just a very reduced set of attributes. A learning strategy has been thus developed and tuned, combining a scheme-independent ranking with a scheme-specific selection of attributes, seeking for the optimal sub-set of attributes to employ in the actual classifiers. The scheme-independent ranking is based on general characteristics of the data, their attributes being ranked according to a chosen measure of relevance. A measure built upon the Minimum Description Length principle (Barron et al., 1998) of information theory is here adopted, and compared with the more popular statistical measure based on χ2. In order to find an optimal sub-set of attributes, a scheme-specific selection is then used, evaluating the candidate sub-sets using the learning algorithm that will ultimately be employed for training the classifier. In particular, a greedy algorithm is here employed, whose initialization is the previously defined ranking of all the attributes present in the training dataset, and then working with the sub-set of the N top-ranked attributes (all other features being ignored). As the next step, a classifier is built with only such N attributes, its accuracy being evaluated over an independent test dataset. Then the algorithm sets N = N + k and extends the sub-set by adding the next k top-ranked attributes. A new classifier is built on this extended sub-set, and its accuracy is in turn evaluated on the independent test set. The algorithm iterates this procedure annotating each time the accuracy of the classifier, and stops if the accuracy has not increased over the last J iterations (or when all the original attributes are included in the sub-set). It is worth noticing that J should be chosen greater than 1 if one wants to take into account possible fluctuations that could cause the procedure to prematurely stop before an optimal set of attributes has been reached.
688
Multitarget Classifiers for Mining in Bioinformatics
In order to validate the discussed greedy approach, this chapter presents a case study, namely a multiclass dataset (http://www.stjuderesearch.org/data/ALL1/) containing samples of all the known types of Acute Pediatric Lymphoblastic Leukemia (ALL) at St. Jude Children’s Research Hospital. ALL data provide an interesting test-bed for multitarget classification. Indeed, ALL is a heterogeneous disease consisting of various leukemia sub-types that remarkably differ in their response to chemotherapy (Yeoh et al., 2002). All known ALL sub-types (including T-ALL, E2A-PBX1, TEL-AML1, BCR-ABL, MLL, Hyperdip > 50) are included in the examined dataset, which consists of 327 samples (specifically, 215 training and 112 test samples), each one described by the expression level of 12558 genes. All the experiments have been carried out using a standard public, widely available for free data mining tool: Weka 3 (http://www.cs.waikato.ac.nz/ml/weka). In a first attempt the multiclass problem is tackled in one shot: the learning strategy is applied to the whole dataset, involving 7 classes (the 6 ALL recalled sub-types plus a OTHERS class collecting samples not directly classified in any of those 6). The accuracy of the recalled NB, ABN, SVM and kNN classifiers increases with increasing feature N, k-NN appearing as the best classifier over the whole range. When N ≤ 200 the Bayesian classifiers appear to be less worse than SVM, while the accuracy of each of the 4 approaches becomes similar, and steady, only when N is about as much as 400, the resulting classifiers still implying a quite high level of misclassifications (5-10%). Such results show that good classification accuracy as well as small sets of predictive features are very difficult to be achieved together by a multitarget classification, as highlighted in recent literature (Mukherjee, 2003). However, the knowledge of a domain expert makes it possible to decompose the multitarget classification in a structured set of binary classification problems, one for each target (in our example ALL sub-type). Specifically, the ALL diagnosis can be approached with a divide-and-conquer methodology, hierarchically singling out each sub-type at time (Yeoh et al., 2002). Indeed, when analyzing a biological sample, medical doctors first look for possible evidence of the most common or unambiguous sub-type (in our example T-ALL) against all the others (here referred as OTHERS1). If there is no T-ALL evidence, they look for the next most common or unambiguous sub-type (i.e. E2APBX1) against the remaining (OTHERS2). Then the process steps through TEL-AML1 vs. OTHERS3, BCR-ABL vs. OTHERS4, MLL vs. OTHERS5 and finally Hyperdip > 50 vs. OTHERS (grouping all samples not belonging to any of the previous sub-types). Such methodology can be reproduced in a machine learning process that relies on a tree of six binary sub-models, one for each target. Each binary sub-model is responsible for a single ALL sub-type, i.e. it is in charge of differentiating the samples belonging to a specific sub-type against the rest of samples not yet classified. At each level of such decision tree, a binary classification process gains information from the previous level because the ALL sub-type related to the latter is left out when the former is trained and tested. Thus, the original training and test sets are progressively reduced from one level to the next one. Such a tree of six binary models is much more effective than a single multiclass model. Indeed, the number of misclassified samples is higher for multiclass models, going from 6 to 11 with up to 400 features, while the diagnostic tree models achieve a number of errors from 4 to 8 with only 120 features (20 features for each of the 6 binary models). Interestingly, the k-NN algorithm (best performing in training multiclass models) generates the worst tree of binary models. Six different heuristics for feature selection, based on entropy, χ2 and t-statistics, are explored by learning NB, SVM and k-NN classifiers on our same ALL dataset in (Liu et al., 2002): there, the best feature selection heuristic ranks attributes according to their entropy, and selects all the features, if existing, whose entropy value is less than 0.1, or the 20 features with the lowest entropy values other-
689
Multitarget Classifiers for Mining in Bioinformatics
wise. The resulting NB, SVM, k-NN models respectively misclassify 7, 5 and 4 samples, while the NB, SVM, k-NN classifiers trained with the 20 MDL top-ranked features respectively result in 4, 5 and 8 misclassifications: it looks like our approach does prefer Bayesian classifier, while Liu entropy prefers k-NN, being SVM indifferent and nearly the best. In order to investigate the effectiveness of our feature selection strategy, the commonalities have been analyzed between the set of attributes resulting from previous multiclass and diagnostic tree experiments. In the multiclass experiment NB, ABN, SVM, k-NN classifiers have been built for an increasing number N (= 20, 40, 60, 100, 200, 400) of features, up to the value of 400. In the diagnostic tree experiment NB, ABN, SVM, k-NN binary classifiers have been built for each ALL sub-type (tree level), the best accuracy occuring with a set of 20 features at each single level. For each ALL sub-type in the diagnostic tree, many features are in common for T-ALL, E2A-PBX1, TEL-AML1, but only a few are in common for BCR-ABL, MLL, Hyperdip > 50. This seems to suggest that genes relevant to T-ALL, E2A-PBX1, TEL-AML1 are promptly selected in both experiments, but genes relevant to BCRABL, MLL, Hyperdip > 50 are hard to select in the multiclass experiment. A possible explanation is that the strength of the correlation of genes with a given ALL sub-type can be very different for different sub-types. Hence, if a single sub-type is assumed as target (tree experiment), the feature selection process top-ranks only those genes that are relevant for the target and discards all the others (even if they are strongly correlated with another sub-type). Conversely, in the multiclass experiment, all the genes most strongly correlated with any one of the sub-types are topranked, but the target they are mostly correlated to is not evident at all. The optimal sets of features relevant to targets T-ALL, E2A-PBX1, TEL-AML1 in the tree experiment are top-ranked in the multiclass experiment, too, meaning that the correlation of each set with its target is similar in strength. This assertion is partially true for MLL and Hyperdip > 50, but definitely not true for BCR-ABL, meaning that the correlation is less strong. Another interesting point emerging from the diagnostic tree experiment is that almost no overlap exists between the groups of the best 20 genes selected with MDL at each level of the tree. Specifically, only one overlapping gene has been found between levels 2-5, 3-4, 5-6. Finally, it is of interest to compare feature sets derived from different feature selection heuristics. For each ALL sub-type (corresponding to the binary models in the diagnostic tree), the number of attributes common to the two sets of the best 20 features selected by MDL and χ2 heuristics is fewer for those ALL sub-types that are “hard” to classify, i.e. BCR-ABL and Hyperdip>50
Future trends/Conclusion In this chapter, a greedy approach has been presented for not binary classification of high-dimensional micro-array data when only very few samples are available. The MDL principle has been chosen as an effective and simple heuristic for feature selection. A gene-expression datasets, available on the Web, has been used to validate and assess the performance of the proposed method. The multitarget classification problem has been decomposed into a tree of binary classification problems. Each of them is solved by a classifier entrusted with the task of differentiating the samples from a specific cancer subtype against the rest of the samples. A decomposition based on priors (in our case clinical knowledge) is shown to perform better than multitarget blind classifiers, as the latter can easily get stuck in trying to improve their accuracy. This is consistent to the general principle that,
690
Multitarget Classifiers for Mining in Bioinformatics
when some knowledge is already easily a-priori available, it is not just not useful, but often dangerous, to try to recovery it again from the data where it is embedded: better to use it in order to simplify the rest of the automatic procedure! The results show an improvement in term of a reduced number of features needed by the proposed method as compared to other more traditional techniques: better feature selection can lead to improved classification. The fact that the overlap is less than perfect is likely due to the redundancy of the features in this dataset. It is worth noting, however, that an overall comparison suggests that different feature selection methods can lead to obtain different subsets of genes that can be used to build classifiers of similar accuracy. The work presented in this chapter can be extended at least in a few complementary directions. Future work might consider a comparison of this method with other feature selection and classification methods. The biological significance of the specific identified features can also be discussed. Extension to text mining is of interest not only within bioinformatics as for both genetic code and literature mining, but also in the wider context of Web mining as for both data and text.
Acknowledgment Andrea Bosin, Nicoletta Dessì and Barbara Pes, from Università degli Studi di Cagliari, Dipartimento di Matematica e Informatica have been determinant in the early stages of this study.
References Alizadeh A.A, Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Lu L., Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., R. Warnke, Levy R., Wilson W., Grever M.R., Byrd J.C., Botstein D., Brown P.O. and Staudt L.M. (2000). Distinct Types of Diffuse Large B-cell Lymphoma identified by gene expression profiling. Nature 403, 503-511. Barron A., Rissanen J. and Yu B. (1998). The minimum description length principle in coding and modelling. IEEE Transactions on Information Theory, 44, 2743-2760. Blum A. and Langley P. (1997). Selection of relevant features and examples in machine learning, Artificial Intelligence, 97, 245-271,. Bosin A., Dessì N., Liberati D., and Pes B. (2006). Learning Bayesian Classifiers from Gene-Expression MicroArray Data. Lecture Notes in Computer Science, Volume 3849, . 297 - 304. Springer-Verlag, Cheng G. and Greiner R. (1999). Comparing Bayesian Network Classifiers. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers, Inc., San Francisco. Dudoit S., Fridlyand J. and Speed T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 97, 77-87.
691
Multitarget Classifiers for Mining in Bioinformatics
Fayyad U.M. and Irani K.B. (1993). Multiinterval discretization of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027. San Francisco, CA: Morgan Kaufmann Friedman N., Geiger D. and Goldszmidt M. (1997), Bayesian Network Classifiers. Machine Learning, 29, 131-161. Furey T., Cristianini N., Duffy N., Bednarski D.W., Schummer M. and Haussler D. (2001), Support vector machine classification and validation of cancer tissue samples using micro-array expression data. Bioinformatics 16, 906-914. Garatti S, Bittanti S, Liberati D, Maffezzoli P (2007): An unsupervised clustering approach for leukemia classification based on DNA micro-arrays data. Intelligent Data Analysis 11(2), 175-188. Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M., Downing J.R., Caligiuri M.A., Bloomfield C.D., and Lander E.S. (1999), Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. Guyon I., Weston J., Barnhill S., and Vapnik V. (2002), Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46 (1-3): 389 –422. Khan J., Wei J.S., Ringnér M., Saal L.H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C.R., Peterson C. and Meltzer P.S. (2001), Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673-679. Keogh E. and Pazzani M.J. (2002). Learning the structure of augmented Bayesian classifiers. International Journal on Artificial Intelligence Tools 11(4), 587-601. Kononenko I. (1995), On biases in estimating multivalued attributes. Proceedings of the 14th International Joint Conference of Artificial Intelligence, 95, 1034-1040, Montreal, Canada. Lee J.W., Lee J.B., Park M., Song S.H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics & Data Analysis, 48, 869-885. Liberati D., Bittanti S., Garatti S. (2005). Unsupervised Mining of Genes Classifying Leukaemia, Encyclopaedia of Data Warehousing and Mining, Wang J. (ed.), pp 1155-1159, , IGI Global, Hershey, PA, USA. Liu H., Li J. and Wong L. (2002). A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. Genome informatics 13, 51-60. Mukherjee S. (2003). Classifying Microarray Data Using Support Vector Machines, Understanding and Using Microarray Analysis Techniques: A Practical Guide. Kluwer Academic Publishers, Boston, MA. Muselli M. and Liberati D. (2002). Binary rule generation via Hamming Clustering. IEEE Transactions on Knowledge and Data Engineering 14(6), 1258-1268. Nguyen D.V. and Rocke D.M. (2002). Tumor Classification By Partial Least Squares Using Microarray Gene Expression DATA.. Bioinformatics, 18 (1), 39-50.
692
Multitarget Classifiers for Mining in Bioinformatics
Perner P. and Trautzsch S. (1998). MultiInterval Discretization Methods for Decision Tree Learning. LNCS 1451, 475-482, Springer Verlag. Ross D.T. Scherf U., Eisen M.B., Perou C.M., Rees C., Spellman P., Iyer V., Jeffrey S.S., Van de Rijn M., Waltham M., Pergamenschikov A., Lee J.C., Lashkari D., Shalon D., Myers T.G., Weinstein J.N., Botstein D., Brown P.O. (2000). Systematic Variation in gene expression patterns in human cancer cell lines. Nature Genetics 24, 227-234. Statnikov A, Aliferis C.F., Tsamardinos I., Hardin D. and Levy S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5). Tung W.L. and Quek C. (2005). GenSo-FDSS: a neural-fuzzy decision support system for pedriatric ALL cancer subtype identification using gene expression data. Artificial Intelligence in Medicine, 33, 61-88. Valentini G. (2002). Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artificial Intelligence in Medicine, 26, 281–304. Vapnik V. (1998). Statistical Learning Theory. Wiley-Interscience: New York, NY, USA. Witten I. H. and Frank E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, second edition. Elsevier: San Francisco. Yarmus J.S. (2003). ABN: A Fast, Greedy Bayesian Network Classifier. Retrieved from http://otn.oracle. com/products/bi/pdf/adaptive_bayes_net.pdf Yeoh E.J., Ross M.E., Shurtleff S.A., Williams W., Patel D., Mahfouz R., Behm F., Raimondi S., Relling M., Patel A. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1, 133-143.
Key Terms Bio-informatics: The processing of the huge amount of information pertaining biology. Gene: Sentence in the genetic alphabet codifying a cell instruction. Lymphoblastic Leukemia: Class of blood cancers quite diffuse in children. Micro-Array: Bio-assay technology allowing to measure the expression of thousands of genes from a sample on a single chip. Minimum Description Length: Information theory principle claiming optimality for the more oeconomical description of both the model and coding fully describing the process from where the data samples are exctracted. Multitarget Classification: Partition of the set of samples in more than two classes. Principal Component Analysis: Rearrangement of the data matrix in new orthogonal transformed variables ordered in decreasing order of variance. 693
694
Chapter XXXIX
Current Issues and Future Analysis in Text Mining for Information Security Applications Shuting Xu Virginia State University, USA Xin Luo Virginia State University, USA
Abstract Text mining is an instrumental technology that today’s organizations can employ to extract information and further evolve and create valuable knowledge for more effective knowledge management. It is also an important tool in the arena of information systems security (ISS). While a plethora of text mining research has been conducted in search of revamped technological developments, relatively limited attention has been paid to the applicable insights of text mining in ISS. In this chapter, we address a variety of technological applications of text mining in security issues. The techniques are categorized according to the types of knowledge to be discovered and the text formats to be analyzed. Privacy issues of text mining as well as future trends are also discussed.
Introduction Text mining is an instrumental technology that today’s organizations can employ to extract information and further evolve and create valuable knowledge for more effective knowledge management. The deployment of text mining technology is also of vital importance in the arena of information systems
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Current Issues and Future Analysis in Text Mining for Information Security Applications
security (ISS), especially after the 9/11 tragedy which galvanized the government to increasingly spend resources pursuing hardened homeland security. Furthermore, providing security to computer information systems and communications infrastructures is now one of the national priorities. While a plethora of text mining or data mining research has been conducted in search of revamped technological development, relatively limited attention has been paid to the applicable insights of text mining in ISS. In an effort to fill the void, this article tends to shed light on the correlations between text mining and ISS and to address a variety of technological applications of text mining in ISS privacy issues as well as future trends related to security. As such, this article is organized as follows: after an introduction section, the second section presents the background of text mining and its application in security; the third section addresses different categories of techniques used for text mining in ISS, including social network analysis, abnormal detection, topic discovery, and identity detection; section four discusses social perspectives of text mining versus privacy violation and section five presents future analysis of text mining for ISS research.
Background Defined as “the discovery by computer of new, previously unknown, information by automatically extracting information from different written resources”(Fan et al. 2006), text mining is an emerging technology characterized by a set of technological tools which allow for the extraction of unstructured information from text. With the exponential growth of the internet, it is literally cumbersome for individuals as well as companies to process all the overwhelmed information. Not like some data mining techniques discovering knowledge from only the structured data, such as numeric data, text mining is related to finding knowledge from the unstructured textual data including e-mails, Web pages, business reports, and articles, etc. Leaping from old-fashioned information retrieval to information and knowledge discovery, text mining applies the same analytical functions of data mining to the domain of textual information and replies on sophisticated text analysis techniques that distill information from free-text documents (Dörre et al. 1999). As voluminous corporate information must be merged and managed and the dynamic business environment pushes decision makers to promptly and effectively locate, read, and analyze relevant documents to produce the most informative decisions, discovering hidden patterns from the structured data plays an important role in business where patterns are paramount for strategic decision making. Text mining pursues knowledge discovery from textual databases by isolating key bits of information
Figure 1. Processes involved in text mining(Adapted from (Durfee 2006; Fan et al. 2006) Text Representation & Distillation
Knowledge Sophistication
Structured
Data Retrieval
Text A
nalysis
Data Collection
and Preprocess
(Compare Patterns)
Knowledge Representation
Management Information
Knowledge
Systems
695
Current Issues and Future Analysis in Text Mining for Information Security Applications
from large amounts of text, by identifying relationships among documents, and by inferring new knowledge from them (Durfee 2006). Furthermore, (Fan et al. 2006) indicated that the key to text mining is creating technology that combines a human’s linguistic capabilities with the speed and accuracy of a computer. Gluing the generic process model for text-mining application proposed by (Fan et al. 2006) and general text mining framework suggested by (Durfee 2006), we think that the following model can capture the processes involved in text mining from text collection and distillation to knowledge representation (see Figure 1).
Text Mining in Security Applications There is an increasing demand to apply text mining in information system security issues, for example, corporations wish to survey large collection of e-mails to identify illicit activities; counterterrorism analysts want to review large amounts of news articles to identify information relevant to potential threats, etc. One of such application is ECHELON (European 2001), which is a world-wide signal intelligence and analysis network run by the UKUSA Community. It can capture radio and satellite communications, telephone calls, faxes, e-mails and other data streams nearly anywhere in the world. Text mining is used by ECHELON to search for hints of terrorist plots, drug-dealers’ plans, and political and diplomatic intelligence. Various text mining techniques have been proposed in literature for different security applications. The text formats to be mined may include: • • • •
E-mail Web page News article Web Forum Message / Instant Message
The types of knowledge to be discovered in security applications may include: • • • •
Connections among people, groups, and objects Abnormal behavior/interest of people under surveillance Topic/theme of text Author of text
The knowledge to be discovered above like topic/theme of text is also one of the main goals for main stream text mining research. However, compared with other general text mining application goals like clustering or classifying related texts, or finding the matching texts for the given key words or key sentences, the security applications focus more on discovering the hidden relationships of objects mentioned in the texts. Text corpus, in any format, can be used to discover any of the aforementioned types of knowledge. However, different kind of format may facilitate the discovery of different type of knowledge. For example, the sender and receiver of an e-mail may reveal the relationship among people. By inspecting the Web page browsing history, we may find out a person’s change of interests. Many techniques have been proposed to discover different types of knowledge, which we introduce in details in the follow-
696
Current Issues and Future Analysis in Text Mining for Information Security Applications
ing four categories. Techniques used in one category can be used in another, and the functions in each category may overlap.
Social Network Analysis Social Network Analysis (SNA) is the study of mathematical models for interactions among people, groups, and objects (McCallum, 2005). It has recently been recognized as a promising technology for studying criminal and terrorist networks to enhance public safety and national security. It can be applied to the investigation of organized crimes by integrating information from multiple crime incidents or multiple sources and discovering regular patterns about the structure, organization, operation, and information flow in criminal networks (Xu et al. 2005). It can also be used to study the change and evolvement of the organizational structure in a corporation. It is currently one of the most important techniques applied in text mining for security applications. Social Network Characteristics In Social Network Analysis, some sophisticated structural analysis tools are needed to discover useful knowledge about the structure and organization the networks. The following social network characteristics are usually calculated and analyzed: the overall structure of the network, the subgroups and their interactions, and the roles of network members (nodes).
Overall Structure Generally the following graph metrics are considered to compare the characteristics of social networks (Chapanond et al., 2005): Degree distribution is the histogram of the degree of vertices in the graph. Diameter is the longest of the shortest paths between any pair of vertices in a connected graph. It reflects how far apart two vertices are (from each other) in the graph. • Average Distance is the average length of shortest path between each vertex in the graph. Average Distance Ratio is defined as (total number of vertices in a graph – Average Distance) / • total number of vertices in a graph. Compactness is the ratio between the number of existing edges and the number of all possible • edges. Clustering Coef. cient of a vertex is defined as the percentage of the connections between the • neighbors of a vertex. Clustering Coefficient of a graph is the average value of Clustering Coefficient for all vertexes. Betweenness of an edge is defined as the number of shortest paths that traverse it. • • Relative Interconnectivity between two clusters is defined as the normalized absolute interconnectivity between the two clusters. Relative Closeness between two clusters is the normalized absolute closeness between the two • clusters.
• •
Relative interconnectitity and relative closeness are metrics used to determine the similarity in graph structure between two clusters.
697
Current Issues and Future Analysis in Text Mining for Information Security Applications
The overall structure of one social network can be compared with another to show the similarities and differences between them. For example, Chapanond et al. compared the graph properties of Enron e-mail graph with the RPI e-mail graph and found that degree distributions in both graphs obeyed the power law but the density of connectivity among communities of practice were different (Chapanond et al., 2005). The overall structure of social network usually changes over time. Snapshot of network at different times may be compared to find the changes or trends in the structure of an organization. Diesner et al. compared a network from a month during the Enron crisis with a network from a month in which no major negative happenings were reported and where the organization seemed to be on a successful path (Diesner et al. 2005). The results indicated that the during the Enron crisis the network had been denser, more centralized and more connected than during normal times. Such kind of knowledge is of potential benefit for modeling the development of crisis scenarios in organizations and the investigation of indicators of failure.
Subgroups A social network usually consists of subgroups of individuals who closely interact with each other, such as cliques whose members are fully or almost fully connected. Some data mining techniques like clustering can be used to find out such subgroups (Han et al. 2006).
Pattern of Interaction Blockmodeling (Wasserman et al. 1994) can be used to discover patterns of interaction between subgroups in SNA. It can reveal patterns of between-group interactions and associations and can help reveal the overall structure of the social network under investigation. Given a partitioned network, blockmodel analysis determines the presence or absence of an association between a pair of subgroups based on a link density measure. If the density of the links between the row subgroups is greater than a threshold value, which means the two subgroups interact with each other constantly, then the two subgroups have a strong association.
Role of Node Several graph metrics such as degree, betweenness, and closeness can denote the importance of a node in a social network. A node with high degree may imply the individual’s leadership, whereas a node with high betweenness may be gatekeeper in the network (Xu et al. 2005). We can find those particular nodes with inordinately high degree, or with connections to a particularly well-connected subset of the network (McCallum, 2005). As an example, Krebs employed these metrics to find the central individuals in the electrical network consisting of the 19 hijackers in the 9/11 attacks (Krebs 2002).
Application Examples Based on different text formats, the methods adopted to construct social networks and conduct analysis may be quite different. Next we will use some examples to explain how Social Network Analysis is applied to various text formats.
698
Current Issues and Future Analysis in Text Mining for Information Security Applications
E-Mail An e-mail usually contains the following parts: Sender, Recipient, Date and Time, Subject and Content. A directed simple graph can be constructed in which vertices represent individuals and directed edges are added between individuals who correspond through e-mail. An undirected graph can also be constructed as described in (Chapanond et al., 2005): If the individuals must have exchanged at least threshold1 e-mails with each other and each member of the pair has sent at least threshold2 e-mails to the other, then the two individuals will be connected by an edge in the graph. Here threshold1 and threshold2 are predefined threshold values. Changing the value used for each threshold will change the structure of the graph. Sometimes nodes and edges can have multiple attributes such as the position and location of an employee or the types of relationships between two communication partners. An Author_Recipient_Topic (ART) model was proposed in (McCallum, 2005) for social network analysis. The ART model captures topics and the directed social network of senders and receivers by conditioning the multinomial distribution over topics distinctly on both the author and one recipient of a message. It takes into consideration both author and recipients distinctly, and models the e-mail content as a mixture of topics. The ART model is a Bayesian network that simultaneously models message content, as well as the directed social network in which the messages are sent. In its generative process for each message, an author, ad, and a set of recipients, rd, are observed. To generate each word, a recipient, x, is chosen at uniform from rd, and then a topic z is chosen from a multinomial topic distribution φad,x, where the distribution is specific to the author-recipient pair (ad, x). Finally, the word w is generated by sampling from a topic-specific multinomial distribution θz. The result is that the discovery of topics is guided by the social network in which the collection of message text was generated.
Web Page To apply social network analysis to Web site data like Web pages, the first step is to harvest related Web sites (Zhou et al. 2005, Reid et al. 2005). A focused Web crawler can be employed to automatically discover and download relevant Web sites by following the HTML links from a starting set of pages (Albertsen 2003). Metadata then can be extracted and used to rank the Web sites in terms of relevance. Once the Web sites are harvested, Web link and content analysis are used to study how a group is using the Web. Web link analysis is based on hyperlink structure and is used to discover hidden relationships among communities (Givson et al. 1998). There are two classes of Web link analysis studies: relational and evaluative (Borgman et al. 2002). Relational analysis gives insight into the strength of relations between Web entities in particular Web sites, while evaluative analysis reveals a Web entity’s popularity or quality level. According to (Zhou et al. 2005) which analyses link and content of Web sites owned by US domestic extremist groups, relational analysis works well for terrorism research because it illuminates the relations between extremist Web sites and organizations. Analyzing hyperlink structures sheds light on extremist and hate site infrastructures. It also reveals hidden communities in the relationships among different Web sites for the same group and the interactions with other extremist group sites. In addition, hyperlinks between sites constitute an important cue for estimating the content similarity of any Web site pair in a collection. Web content analysis is the systematic study of site content. Demchak, C. et al. proposed a welldefined methodology for analyzing communicative content in government Web sites (Demchak 2000).
699
Current Issues and Future Analysis in Text Mining for Information Security Applications
They developed a Web site attribute system tool to support their work, which basically consists of a set of high-level attributes, such as transparency and interactivity. Each high-level attribute is associated with a second layer of more refined low-level attributes. To better understand domestic extremists’ uses and goals for the Web, Zhou et al. developed a similar attribute-based coding scheme for methodically capturing the content of Web pages (Zhou et al. 2005).
News Article Bradford provided an example of identifying relationships among terrorist related entities from news articles in (Bradford 2006). Entities may include people, organizations, and locations. Relationships may include communications, financial transactions, and physical proximity. Items of interest may consist of complex combinations of entities and relations, such as events. The entities analyzed in (Bradford 2006) are individual terrorists, terrorist groups, targets, and weapons. To identify information of interest in large collections of news articles, entity extraction and the matrix decomposition technique of latent semantic indexing (LSI) can be combined to allow direct analysis of entity-entity relationships. LSI is a well-established method for extracting relationship information from large collections of text (Berry 1999, Deerwester 1988). For any two entities present in the text collection of interest, the proximity of their representation vectors in the LSI space provides a direct measure of their degree of contextual association in the collection. This feature can be exploited to provide rapid overviews of the aggregate implications of the relations present in large collections of unstructured information. In (Bradford 2006), two-dimensional cross-correlation matrices are used to highlight potentially interesting associations between entities. Identification of such associations allows users to focus their attention on information in the collection that may be of particular interest. This is of great utility in facilitating rapid overview of large quantities of text.
Abnormal Detection Abnormal detection tries to find out objects which appear to be inconsistent with the remainder of the object set. It can be used to detect illicit activities or false identities to assist police and intelligence investigations. There are many algorithms for abnormal detection which can be classified into three categories (Han et al. 2006): the first category is analogous to unsupervised clustering, like k-NN, kmeans, CLARANS, unsupervised neural network etc.; the second category is analogous to supervised classification which requires a priori data knowledge, like regression method, PCA, SVM, supervised neural networks, etc.; and the third category is analogous to semi-supervised recognition or detection, which is usually the combination of the first two type of methods. An example of applying text mining in abnormal detection is described next.
Web Page Access Analysis By monitoring and analysis of Web pages accessed by Web surfers, it is possible to infer a surfer’s areas of interest. Thus terrorists may be identified by some real time Web traffic monitor when they access terrorist–related information on the internet. Some behavior-based anomaly detection model (Elovici et al. 2004, Elovici et al. 2005, Last et al. 2003) were proposed to use the content of Web pages browsed by a specific group of users as an in-
700
Current Issues and Future Analysis in Text Mining for Information Security Applications
put for detecting abnormal activities. The basic idea is to maintain information about the interests of “normal” users in a certain environment (such as a campus), and detect users that are dissimilar to the normal group under a defined threshold of similarity. The models have two phases, the learning phase and the detection phase: During the learning phase, the Web traffic of a group of users is recorded and transformed to an efficient representation for further analysis. The learning phase is applied in the same environment where the detection would later be applied in order to learn the “normal” content of users in the environment. The collected data is used to derive and represent the group’s areas of interest by applying a clustering technique. During the detection phase, users dissimilar to the normal users are detected. The detection is performed by transforming the content of each page accessed by a user to a vector representation that can be compared against the representation of the groups for dissimilarity. The minimum number of suspicious accesses required in order to issue an alarm about a user are defined by the detection algorithm. The detection is performed on-line and should therefore be efficient and scalable.
Topic Discovery The text corpus to be mined is usually very large, containing thousands to billions of documents. To prevent or alleviate security severity, topic discovery may identify the theme of documents so investigators can acquire the most desired documents without literally reading them. Topic discovery techniques include summarization, classification, clustering, and information retrieval. Summarization can greatly reduce the length of a document while keeping its main points. Classification methods (e.g. decision tree, Bayes, SVM, etc.) group a document to one of the predefined categories according to its word frequency. Clustering methods group similar documents without any priori knowledge about the documents. Typical clustering methods include k-means, BIRCH, DBSCAN, CLIQUE, etc. Readers may refer to (Han et al. 2006) for details of the aforementioned classification and clustering methods. Information retrieval accesses the documents based on user queries. Latent Semantic Indexing (Berry 1999, Deerwester 1988) is one of the widely used information retrieval methods. We use two examples to illustrate techniques used in security applications.
E-Mail Surveillance Berry et al. applied a non-negative matrix factorization approach for the extraction and detection of concepts or topics from e-mail messages (Berry 2005). Given an e-mail collection, Berry et al. encoded sparse term-by-message matrices and used a low rank non-negative matrix factorization algorithm to preserve natural data non-negativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Non-negative matrix factorization (NMF) has recently been shown to be a very useful technique in approximating high dimensional data where the data are comprised of non-negative components (Lee and Seung 1999, Paatero and Tapper 1994). Given a collection of e-mail messages expressed as an m×n term by-message matrix X, where each column is an m- dimensional non-negative vector of the original collection (n vectors), the standard NMF problem is to find two new reduced-dimensional matrices W and H, in order to approximate the original matrix X by the product WH in terms of some
701
Current Issues and Future Analysis in Text Mining for Information Security Applications
metric. Each column of W contains a basis vector while each column of H contains the weights needed to approximate the corresponding column in X using the basis from W. The dimensions of matrices W and H are m×r and r ×n, respectively. Usually, the number of columns in the new (basis) matrix W is chosen so that r 0 sign(c) = . −1, c ≤ 0
(7)
While the definition of a single threshold value allows the use of the standard measures defined previously, more informative measures of overall performance can be obtained by studying the performance with multiple threshold values. The precision-recall curve is the result of plotting the precision levels at recall levels obtained by varying the values of θ, and the curve illustrates the trade-off between precision and recall (see, e.g., Manning & Schütze, 2000, pp. 536, 537). To summarize performance over the precision-recall curve, eleven-point average precision can be used. This measure is defined as the average of the precisions obtained with the threshold values θ at which recall has the values of 0.0, 0.1, 0.2, ... , 1.0. All the previously considered performance evaluation measures are sensitive to class distribution, that is, the relative number of positive and negative instances. This can be problematic if, for example, the distribution is different with the gold standard than with real data. The area under ROC curve (AUC) is a measure that is invariant to class distribution (see, e.g., Fawcett & Flach, 2005). Here, ROC refers to the receiver operating characteristic curve which is a result of plotting the recall, also known as the true positive rate, at certain levels of the false positive rate FP/(FP +TN) obtained by varying the values of θ. Thus, unlike the precision-recall curve, ROC incorporates the number of true negatives. Further, AUC is equivalent to the Wilcoxon-Mann-Whitney statistic which is the probability that, given a randomly chosen positive example and a randomly chosen negative example, the classifier will correctly distinguish them (Cortes & Mohri, 2004). This gives a simple method to calculate AUC:
AUC (Y , f '( X )) =
1 m+ m−
∑ ∑ δ ( f '( x ) − f '( x ) > 0)
yi =+1 y j =−1
i
j
(8)
729
Performance Evaluation Measures for Text Mining
where m+ and m– are the numbers of positive and negative instances, respectively. AUC has gained popularity especially in machine learning where classification tasks are often considered. Further discussion about this topic can be found, for example, in the Proceedings of the Workshop on ROC Analysis in Machine Learning (see, e.g., Lachiche, Ferri, & Macskassy, 2006). The appropriate choice of a performance measure for text classification depends on the task and data in question. When the task requires specific classification decisions, accuracy as well as precision and recall are common choices; here the availability of TN is one deciding factor. When using precision and recall, it is important to consider both values or use a combination such as the F measure. With algorithms producing real-value outputs, measures that take performance into account over the entire curve, such as precision-recall curve analysis and AUC, can be more sensitive than precision and recall. Further information regarding performance evaluation measures for text classifiers can be found, for example, in Baldi, Brunak, Chauvin, Andersen, and Nielsen (2000) and Caruana and Niculescu-Mizil (2004).
Information Extraction The aim in information extraction is to identify specific types of events and relations from natural language text and to represent them in a structured, machine-readable form. From the performance measurement point of view, information extraction is similar to classification tasks, and hence many of the measures defined previously are also relevant in this task domain. However, for the measure definitions to apply, the interpretation of TP, TN, FP, and FN must be slightly revised, as explained in the following. The extracted information is typically captured in a template containing slots for the various participants in the events or relationships of interest: for example, in an information extraction task on relationships of people with organizations, an employee_of template would contain slots at least for the employed person and the employing organization. An information extraction system would then aim to correctly fill one such template for each statement of interest found in the source text. Fills then take the role of system output and correct fills are interpreted as TP, missing fills as FN, and incorrect or spurious (extra) fills as FP. Unlike in pure classification tasks, in information extraction it is typically not possible to clearly define TN since any part of the input is a potential fill. Thus, calculating evaluation measures incorporating TN is not possible. By contrast, TP, FP, and FN can be calculated, and therefore precision, recall and derived measures are well defined. Measures based on precision and recall have been widely used in information extraction system evaluations such as in the MUC (Message Understanding Conferences), a major series of competitive information extraction evaluations carried out between 1987–97 (Grishman & Sundheim, 1996; Chinchor, 1997). However, because a fill can contain multiple words from the input, simple classification into TP and FP does not necessarily give a sensitive evaluation of system performance. To address this issue, the precision and recall measures can be extended to include the notion of partially correct cases as follows
P' =
730
TP + 0.5 PP TP + FP + PP
(9)
Performance Evaluation Measures for Text Mining
and
R' =
TP + 0.5 PP , TP + FN + PP
(10)
where PP is the number of partially correct positive cases (Chinchor & Sundheim, 1993) – note that this definition of the measures departs from the traditional contingency matrix model. An augmented F measure can be calculated from these precision and recall measures as defined in the section on text classification. For the MUC evaluation, the identification of partially correct cases requires human judging, but in some cases near matches can be determined fully automatically, for example in cases where the recognized boundaries of a fill are either partially correct or close to correct. For an example of an evaluation incorporating automatically determined partially correct outputs, see Hirschman, Yeh, Blaschke, and Valencia (2005). Information extraction templates commonly contain many slots for various aspects of the event, and often the information for all slots cannot be found in the input. In this case both the correct answers and the information extraction systems should leave the appropriate slots unfilled. In addition to lacking the notion of partial matches, precision and related measures do not distinguish between template slots that are filled with incorrect information and slots that are filled but should have been left blank, a distinction that can be of interest in information extraction system evaluation. To address this issue, measures explicitly measuring undergeneration, overgeneration and substitution can be used. For definitions of these and related measures, see Chinchor and Sundheim (1993). In addition to MUC, competitive evaluations of key information extraction subtasks such as named entity recognition have also been performed, for example, as a shared task in CoNLL (Conference on Computational Natural Language Learning) conferences (see, e.g., Tjong Kim Sang & De Meulder, 2003) and multiple tracks in the TREC (Text REtrieval Conference) conferences (see, e.g., Voorhees & Buckland, 2006). The appropriate choice of performance measure for information extraction depends on the task and data, with specific competitive evaluations, specialized datasets and information extraction subtasks often defining their own variants of the standard measures. However, the precision and recall measures remain popular and, when properly applied, are good intuitive indicators of information extraction system performance. When precision and recall are used, the F measure is a good default choice for distilling performance measurement into a single number. Further information on the evaluation of information extraction systems can be found, for example, in the studies of Grishman and Sundheim (1996) and Chinchor (1997).
Text Segmentation In text segmentation, the aim is to place boundaries into the input so that it is divided into smaller semantically coherent units. This can be viewed as a classification task where the choice is whether or not to place a boundary between two elements of the input X = (x1,...,xm). Here the elements xi are, for example, individual words, sentences or paragraphs. The elements xi define segmentation units since a boundary can be placed only between two elements.
731
Performance Evaluation Measures for Text Mining
Because text segmentation can be treated as a classification task, many performance evaluation measures used in classification can also be used in segmentation. The use of precision and recall measures is especially common (Hirschman & Mani, 2003, p. 416). Each segment boundary produced by the system is interpreted as correctly placed (TP) or incorrectly placed (FP). FN corresponds to the number of undetected boundaries, and TN to the potential segment breaks that are correctly left without a boundary. Precision, recall and related measures can thus be defined as previously. However, these measures are not sensitive to situations where the segment boundary is placed close to the position of the correct boundary but not exactly to the right place. The problem here is analogous to the situation with partial matches described in the section on information extraction. As a result, text segmentation specific measures have been developed. The Pk measure proposed by Beeferman, Berger, and Lafferty (1999) is a popular text segmentation specific performance evaluation measure (see, e.g., Pevzner & Hearst, 2002). Originally, Pk was derived from another measure called PD, which is the probability that two elements drawn randomly from the input are correctly identified as belonging or not to the same segment in the system output with respect to the gold standard. Before proceeding to Pk, PD is briefly described. The PD measure is a probabilistic measure, and hence its values are between 0 and 1, with larger values indicating better performance. The formal definition of PD is (11) D(i, j ) δ (δY (i, j ) = δ f ( X ) (i, j ) ), where m is the number of segmentation units, δ(e) is 1 when the expression e is true and 0 otherwise, and δs(i, j) is 1 if xi and xj belong to the same segment in the segmentation S and 0 otherwise. As Beeferman et al. (1999) explain, the function D(i, j) is a distance probability distribution over the set of possible distances between randomly chosen xi and xj, and there are several plausible distributions that could be used for D(i, j). The distance probability distribution used in Pk is one of these options, and it is described in the following. In the Pk measure, D(i, j) is fixed so that the computation of the measure can be viewed as sweeping a window with a fixed size k across the input (Figure 2). In practice, the window size k is set to be half of the average segment size in the gold standard. Following the explanation given by Pevzner and Hearst (2002), Pk is calculated by determining for each window location whether the outermost elements assigned by the window are incorrectly assigned to the same segment or to different segments in the system output. When an inconsistency is found, the value of Pk is increased. Finally, the value of Pk is normalized between 0 and 1 by dividing by the number of comparisons taken. For Pk, smaller values PD (Y , f ( X )) =
∑
1≤i ≤ j ≤ m
Figure 2. A window of size k sweeping across the input. System output (resp. gold standard) segment boundaries are marked with a dotted (resp. solid) line between the elements xi
x1
x2
x3
window k
732
x4
x5
x6
…
xm–2
xm–1
xm
Performance Evaluation Measures for Text Mining
indicate better performance – note the difference to the interpretation of PD. Pk can be interpreted as the probability that a randomly chosen pair of elements that are k segmentation units apart is assigned to the same segment in either the system output or in the gold standard but not in both. Several properties of the Pk measure have been, however, criticized. Firstly, a text segmentation performance evaluation measure should take into account the distance between the system output boundary and the actual boundary. However, Pk achieves the goal in most cases only with false positives, which causes false negatives to be often penalized more heavily than false positives. In addition, Pk allows some segmentation errors to go unpenalized, it penalizes slightly erroneous boundary placements more than pure false positives of equal magnitude and it is affected by variation in segment size distribution. For details of the criticism, see the study of Pevzner and Hearst (2002). Another commonly used text segmentation performance evaluation measure called the WindowDiff measure (Pevzner and Hearst, 2002) modifies Pk in order to remedy the mentioned problems. As Pk, WindowDiff also moves a fixed-sized window across the segmented text, but this measure compares for each window position the number of boundaries within the window in the system output with the respective number in the gold standard. The measure penalizes the segmentation system whenever these numbers differ. Finally, the measure is normalized by dividing by the number of comparisons taken, that is, m – k. More formally, WindowDiff (Y , f ( X )) =
1 m−k ∑ (| bY (i, i + k ) − b f ( X ) (i, i + k ) | > 0), m − k i =1
(12)
where bs(i, j) is the number of boundaries between positions i and j in the segmentation S. The WindowDiff measure has been criticized for weighting with the same normalization constant both the cases where the number of segment boundaries within the window is larger in the gold standard than it is in the system output, and the cases where the number of segment boundaries within the window is smaller in the gold standard than it is in the system output. As Georgescul, Clark, and Armstrong (2006) explain in-depth, this weighting causes false negative boundaries to be less penalized than false positive boundaries even though the weighting should be equal to both of these error types. While precision, recall and related measures can also be applied to text segmentation, the segmentation specific measures Pk and WindowDiff are both better suited to the task and widely used. For system performance comparison purposes, it may be appropriate to present results separately for both Pk and WindowDiff. However, the selection of WindowDiff could be supported on the basis of being the refined version of Pk: for example, if the system output segmentation often contains boundaries near each other, WindowDiff may be a more appropriate choice because Pk often misses or under-penalizes mistakes in small segments (Pevzner and Hearst, 2002). Further information regarding performance evaluation measures for text segmentation systems can be found, for example, in Beeferman et al. (1999), Pevzner and Hearst (2002) and Georgescul et al. (2006).
Ranking measures This section considers measures appropriate for determining the performance of systems that produce a ranked, or ordered, output. A typical example is a Web search engine which ranks the documents by their relevance. First, general ranking measures are discussed and then, an overview of domain specific performance evaluation measures for information retrieval is given. 733
Performance Evaluation Measures for Text Mining
Text Ranking In text ranking, the aim is to produce an ordering of the input instances with respect to some criterion. In the case considered here, input instances are assigned values that specify this ordering: these output values are typically the ordinal numbers 1, 2, … , m, where m is the number of instances. Such ordering is related to the order implied by real-value outputs discussed in the section on text classification. However, in classification, the final system output labels, even when numeric (e.g., –1, and +1), do not imply a ranking. Formally, the ranking system output for an input X is f(X) = ( f(x1),...,f(xm) ∈ Nm and the gold standard for X is Y = (y1,...,ym) ∈ Nm, where N is the set of natural numbers. It is essential that measures used to evaluate ranking systems capture the ranking characteristic. Common ranking measures are based on determining the extent to which the ranking produced by the system agrees with the gold standard, typically by determining their correlation. General measures of rank correlation are thus applicable to evaluating the performance of ranking systems. Two common measures of rank correlation are Spearman’s r and Kendall’s t. However, the original definitions of these measures do not take into account the possibility of tied ranks, where two or more instances have the same rank in either system output or the gold standard. In text mining, where the number of instances to be ranked is often large, it can be difficult in practice to produce rankings where no ties (or only a small number of ties) are present. When ties are present, r and t are not appropriate measures of agreement (Siegel & Castellan, 1988, p. 241, 249; Kendall & Gibbons, 1990, pp. 40-42). In order to address the issue of tied ranks, tie-corrected versions of both measures have been defined. The definitions of these measures, Spearman’s rb and Kendall’s tb, are presented in the following. Spearman’s rank correlation coefficient rb is defined as m3 − m − ρb =
T f ( X ) + TY 2
m
− 6∑ di2 i =1
,
(13)
di = f(xi) – yi,
(14)
(m − m) − (T f ( X ) + TY )(m3 − m) + T f ( X )TY 3
2
where m is the number of instances,
g
Tr = ∑ (ti3 − ti ),
(15)
i =1
g is the number of groupings of different tied ranks and ti is the number of tied ranks in the ith grouping in ranking r (Siegel & Castellan, 1988, p. 239). When two rankings are identical, rb = 1; when one ranking is reverse of the other, rb = –1; otherwise, rb ∈(–1, 1). Kendall’s rank correlation coefficient tb is defined as
τb =
734
Z (Y , f ( X )) , Z (Y , Y ) Z ( f ( X ), f ( X ))
(16)
Performance Evaluation Measures for Text Mining
where m
∑ sign( y − y ) sign( f ( x ) − f ( x ))
(17)
1, c > 0 sign(c) = 0, c = 0. −1, c < 0
(18)
Z (Y , f ( X )) =
i , j =1
i
j
i
j
and
Note that the definition of the function sign(c) differs from the one given in the section on text classification due to the presence of ties. For an original formulation of the tb measure, see, for example, Kendall and Gibbons (1990, p. 40). Similarly to rb, tb = 1 when two rankings are identical, tb = –1 when one ranking is reverse of the other, and tb ∈(–1, 1) otherwise. However, the values of rb and tb are not directly comparable, although a relationship between the measures exists (Siegel & Castellan, 1988, p. 251). There is no generally applicable reason to choose one of the r and t measures over the other, and the decision between them should be based on the properties of the task, system, and data in question. For example, Kendall and Gibbons (1990, p. 28) explain that in measure selection one should take into account that t is not sensitive to how near or far two untied instances are in the ranking, whereas r gives greater weight to differences between ranks of instances if they are separated by more intervening members of the ranking. While the tie-corrected version of either measure is generally appropriate for the evaluation of text ranking systems, the choice should follow the standard practice in the ranking task at hand. Further information about the rank correlation coefficients can be found, for example, in Siegel and Castellan (1988, pp. 235-254) and Kendall and Gibbons (1990).
Information Retrieval In information retrieval, the goal is to retrieve from a large collection of instances those that are relevant to a particular user information need. In this context, the instances, here referred to as documents, are characterized by the document relevance value and the document retrieval value. The document relevance value corresponds to the gold standard and reflects the actual relevance of the document. Similarly, the document retrieval value corresponds to the system output and expresses the degree to which the document is deemed as relevant by the information retrieval system. In order to evaluate an information retrieval system, the retrieval values of the documents are compared to their relevance values. The various performance measures for information retrieval can be categorized based on whether they consider these values to be binary, discrete (also referred to as ranking), or continuous (Demartini & Mizzaro, 2006). Accuracy, precision, recall, and derived measures, which were introduced in the section on text classification, can also be applied to the evaluation of information retrieval systems. These measures fall into the binary-binary category: both relevance and retrieval have two possible values so that each document is treated as either fully relevant or fully irrelevant to the given information need and either retrieved or not retrieved by the information retrieval system. As binary-binary measures thus do not take into account any ordering of the documents, they do not reflect the fact that the retrieved documents must be presented to the user in a particular order. To address this limitation and the ranking nature of the task, measures with discrete or continuous retrieval values can be applied. 735
Performance Evaluation Measures for Text Mining
The rank correlation coefficients introduced in the section on text ranking are of the discrete-discrete type, that is, both the relevance and retrieval values are treated as discrete ranks. While these measures are more sensitive than binary-binary measures, they do not take into account situations where the system output ranking deviates from the gold standard ranking: misordering of the last documents affects the measure as much as misordering of the first documents. In practical information retrieval applications, however, deviations at the top of the output ranking are more serious than elsewhere as the user of the system usually inspects only a small number of the highest ranked documents. To address this issue, rank correlation coefficients can be modified to compare only the top n items of the rankings. Such modified coefficients have been applied in information retrieval, in particular to compare the outputs of different information retrieval systems (see, e.g., Fagin, Kumar, & Sivakumar, 2003). The application of rank correlation measures to the evaluation against a gold standard is also complicated by the requirement that all documents must be given a rank but finding a well-justified ranking for irrelevant documents is typically difficult. Measures of the binary-discrete type are in many cases most appropriate to the evaluation of information retrieval systems. For binary-discrete measures, the relevance values of the documents are binary, but the performance measures do take into account the order of the documents retrieved by the information retrieval systems. In the following, a number of measures of this type are briefly presented. The average precision over all relevant documents is the average of the precisions obtained at each point when a relevant document has been retrieved. This measure is closely related to the eleven-point average precision, defined in the section on text classification. P@N is the precision at a fixed cut-off point of N retrieved documents. A particular instance of P@N is the R-precision, defined as P@N where N is the number of relevant documents. Finally, the precision-recall curve is used as a means for graphical comparison by superimposing the precision-recall curves of several systems. These measures have been used in the context of the TREC series of events, which have played a central role in the development and standardization of information retrieval system performance measures and protocols since their inception in 1992. For evaluating participating information retrieval systems, the TREC events provide shared datasets, retrieval tasks and evaluation protocols (Voorhees & Buckland, 2006). The measures discussed so far are applicable to traditional text retrieval, where the documents are retrieved in their entirety. Recently, there has been increasing interest in the retrieval of structured documents, in particular documents in XML format. XML retrieval is characterized by the ability to retrieve only certain document components represented by branches in the document XML tree. The user's information need can then be specified as a combination of content and structural requirements, focusing the retrieval only to some document components. Relevance judgments therefore need to be given separately for the various components of the documents, taking into account the document structure. To capture the specifics of XML retrieval, such as the retrieval overlap occurring when the information retrieval system retrieves both a document component and its parent in the XML tree, a set of discrete-discrete type measures called XCG (Extended Cumulated Gain) has been designed. The XCG measures are based on the notion of cumulated gain, a sum of relevance values up to a given rank. Here, the relevance value of a document component is defined as a function of a discrete exhaustivity value and a continuous specificity value which together capture the gold standard relevance judgment: exhaustivity represents the extent to which a given document component covers the topic of request and specificity represents the degree of focus on the given topic in the component. By defining specialized relevance functions, the XCG measures can be adapted to take into account the particular properties of individual XML retrieval tasks (Kazai & Lalmas, 2006) as well as generalized to non-XML structured document
736
Performance Evaluation Measures for Text Mining
retrieval. The XCG measures have been designed in the context of INEX (Initiative for the Evaluation of XML retrieval) series of events that have been organized since 2002 (Malik, Kazai, Lalmas, & Fuhr, 2006). INEX plays a similar role in the competitive evaluation of XML retrieval systems as TREC in traditional text retrieval, offering a standard for structured document retrieval evaluation. In summary, a great number of measures have been proposed for the evaluation of information retrieval systems. Demartini and Mizzaro (2006) have compared 44 measures and give a broad view into the trends in information retrieval performance evaluation. In particular, they illustrate that the earlier binary-binary type measures have been surpassed by measures that consider discrete or continuous retrieval values. These constitute the majority of currently used measures in information retrieval. For further information on information retrieval performance measures, see, for example, Demartini and Mizzaro (2006), Voorhees and Buckland (2006), and Kazai and Lalmas (2006).
Text clustering The aim of clustering is to organize the input instances into groups, or clusters, so that instances that are similar according to some criterion cluster together. By contrast to classification, in clustering, the set of possible classes is not known. An example is news aggregation, where news texts from multiple sources are grouped according to the event they describe: in this task, the set of events cannot be known in advance. This section focuses on measures for the more common case of hard clustering, where each instance belongs to exactly one cluster. The evaluation of soft clustering, where one instance can belong to more than one cluster with different degrees of membership, is not discussed further in this section. Other key issues falling outside the scope of the section include defining the optimal number of clusters, evaluating the clustering tendency of the data, as well as statistical hypothesis testing related to the performance evaluation measures. For more information on these, see, for example, Theodoridis and Koutroumbas (1999, pp. 544-548, 557-570) and Milligan (1996). Clustering systems are divided in two main types: partitional and hierarchical. Partitional systems produce a single clustering or partition, whereas hierarchical systems produce a tree structure defining a nested hierarchy of clusters. In the following, all the outputs are considered partitions unless otherwise mentioned. Formally, the input of the clustering system is denoted X, and the output is f ( X ) = C = c1 , c2 ,..., cN f where ci, i ∈ {1, 2,...,Nf}, are the output clusters. Similarly, a gold standard clustering is denoted Y = y1 , y2 ,..., y NY . Note that the number of clusters Nf and NY do not have to be equal. Performance evaluation measures for clustering are based on measuring either the degree of agreement between the clustering output and a gold standard (external measures), or the extent to which the output reflects the similarities in the input (internal measures). In clustering, performance evaluation is often called cluster validity assessment, and many of the measures are called indices. External measures are based on comparing clustering system output against a gold standard clustering. A number of measures can be defined through a contingency matrix analogous to that used to define classification measures (Table 1). By contrast to the classification contingency matrix, here the matrix is defined for a pair of inputs. For each pair it is determined whether the two inputs belong to the same cluster (S) or to different clusters (D) in a given clustering. Then, SS is the number of pairs that belong to the same cluster yi in the gold standard and to the same cluster cj in the system output.
{
}
{
}
737
Performance Evaluation Measures for Text Mining
Similarly, SD is the number of pairs that belong to the same cluster yi in the gold standard clustering but to different clusters in the system output. DS and DD are defined similarly, as summarized in a contingency matrix for clustering SS DS
SD , DD
(19)
where the columns (resp. rows) represent the system output (resp. gold standard). Note that in the absence of labels, cj and yi do not need to correspond to each other in any other sense than containing the same two inputs. Given the previously defined quantities, the Rand index, a simple clustering performance measure analogous to accuracy, is defined as
RI =
SS + DD , SS + SD + DS + DD
(20)
that is, the fraction of all pairs that are similarly clustered. As the number of inputs grows large, DD may overwhelm the other quantities, causing RI to approach one (Milligan, 1981). Measures that exclude DD do not have this property: two such measures are the Jaccard coefficient, defined as
JC =
SS , SS + SD + DS
(21)
and the Fowlkes and Mallows index, defined as
FMI =
SS . ( SS + SD)( SS + DS )
(22)
For further discussion of measures derived from the contingency table and their variants, see Hubert and Arabie (1985) and Theodoridis and Koutroumbas (1999, pp. 548-551). The measures RI, JC, and FMI are symmetric, that is, they make no distinction between the gold standard and the system output clusterings. They can thus be applied also to compare two different system output clusterings. In the following, a number of asymmetric measures are defined in which one of the clusterings is taken to be the gold standard Y. The standard precision and recall measures are defined for each pair of clusters (yi, cj) as follows P (yi , c j )=
mcyji
R (yi , c j )=
mcyji
738
mc j
m yi
,
(23)
,
(24)
Performance Evaluation Measures for Text Mining
Figure 3. An illustration of the purity and completeness of the cluster cj
y
where mc ji is the number of instances in cluster cj that belong also to cluster yi, and mc is the total number of instances in cluster c. For each output cluster cj, purity and completeness are the precision and recall y values for the predominant gold standard cluster yi, that is, the cluster for which mc ji is maximized. For example, in Figure 3, the cluster cj overlaps two gold standard clusters, yi and yk , of which yi is predomiy y nant; the purity of cj is mc ji / mc j = 4 / 5 and the completeness of cj is mc ji / m yi = 4 / 6. Overall purity and completeness for all output clusters are then defined as the sum of the per-cluster measures, weighted by cluster size (Sedding & Kazakov, 2004). Note that as for precision and recall, there is a tradeoff between purity and completeness, and both individual measures can be trivially optimized, purity by defining single-instance clusters and completeness by having a single cluster containing all instances. Thus, these measures must always be considered together. The F measure combines precision and recall into a single value representing performance. The F measure F(yi, cj) of each pair (yi, cj) can be calculated with the definitions of precision and recall using the formula (5) given in the section on text classification, and the overall F measure of the entire clustering is then defined as the weighted sum of the F measures for the best matching output cluster, that is, NY
m yi
i =1
m
F (C ) = ∑
max F ( yi , c j ),
j =1, 2,, N f
(25)
where m is the number of input instances. Internal measures are based on measuring the extent to which the obtained clustering matches the information inherent in the input data, represented by the proximity matrix defining the similarity (or dissimilarity) of the inputs. These measures can thus be defined without reference to a gold standard, a property that is particularly important in clustering, where knowledge of the correct structure is often not available. Compactness and separation are two central but opposing concepts in internal measures: compactness refers to the distance of instances within a cluster, and separation to the distance between clusters. Compactness tends to decrease and separation increase with increasing cluster size; representative measures of overall clustering performance must thus incorporate both aspects. The Dunn index is a simple, intuitive measure with this property: it is defined as the ratio of the minimum distance between two instances in different clusters to the maximum distance between two instances in a single cluster
739
Performance Evaluation Measures for Text Mining
(Theodoridis and Koutroumbas, 1999, pp. 562-563). However, this measure is very sensitive to variation in the input instances, and a number of variant Dunn-like measures addressing this issue have been defined (Bezdek & Pal, 1998). Numerous other measures based on measuring ratios of separation and compactness have also been introduced; popular measures include the Davies-Bouldin index (Theodoridis and Koutroumbas, 1999, pp. 563-564) and Silhouette width (Rousseeuw, 1987). In addition, correlation coefficients can be applied as internal measures to define the correlation between the proximity matrix and a similarity matrix defined by the clustering. This approach is particularly suitable for the evaluation of hierarchical clusterings, where the similarity matrix is taken to be the cophenetic matrix, which contains the levels in the hierarchy at which two instances are merged in a cluster for the first time. Common choices of correlation coefficients for this purpose include the cophenetic correlation coefficient as well as Hubert’s G and its normalized variant (Theodoridis and Koutroumbas, 1999, pp. 550, 551, 554, 556). Further, in a comparative evaluation of 30 internal measures, the Goodman-Kruskal g and point-biserial coefficient have been found to be highly effective for clustering evaluation (Milligan, 1981). The main deciding factor in the choice of a performance evaluation measure for clustering is the availability of a gold standard. If a gold standard is available, precision and recall-based measures can be applied, including a version of the F measure to combine the two. In the absence of a gold standard, a large number of internal measures can be applied; here, it is crucial to consider the aspects of compactness and separation together. Further information about clustering performance measures can be found in Milligan (1996) and Theodoridis and Koutroumbas (1999).
Text summarization Text summarization aims at distilling the most important information from an input text and then formulating a concise textual output to serve a particular user or task. For example, the task may be to generate a text of 100 words containing the key content of a given newspaper article of 1000 words. Performance evaluation of text summarization systems is usually done by comparing an output summary with other automatic summaries (peers) or human written summaries (often called reference summaries, model summaries or gold standards). At the core of evaluating the content of the summary is the selection of an appropriate content measurement unit (e.g., individual words, phrases or sentences) and measurements of the importance and similarity of the content units (Hovy, Lin, & Zhou, 2005). Similarity measurements can be broadly divided into lexical and semantic measurement types. Lexical similarity measures take into consideration the actual words used without concerning the meaning of the content, while semantic similarity attempts to measure content similarity in terms of meaning. The earliest and simplest summary evaluation methods are word and sentence based. They measure the word or small n-gram (i.e., a sequence of n consecutive words) overlap between two summary texts. Examples include cosine similarity with a binary count of word overlap, cosine similarity with TF-IDF (term frequency – inverse document frequency) weighted word overlap (Salton, Wong, & Yang, 1975), measurement of the longest-common subsequence (Radev, Jing, Stys, & Tam, 2004), as well as n-gram based measurements such as BLEU (BiLingual Evaluation Understudy) (Papineni, Rukos, Ward, & Zhu, 2002). BLEU, originally developed for the evaluation of machine translation, is based on a modified n-gram precision measure, and does not incorporate recall (see the section on text classification).
740
Performance Evaluation Measures for Text Mining
Direct application of the BLEU method in evaluating summaries does not always give representative evaluation results. The ROUGE method (Recall-Oriented Understudy for Gisting Evaluation) (Lin, 2004), inspired by the BLEU method, is geared towards evaluating summaries. It is based on matching the n-grams in the evaluated summary to the n-grams in the reference summaries. Different levels of n-gram matches are allowed; examples of these are exact matches or word root-form matches. The overall value of the measure is derived from the weighted combination of the n-gram matches. ROUGE has been found to produce evaluation results that correlate reasonably with human judgments (Lin, 2004). The evaluation of summaries by comparing content at word and sentence level granularity has been criticized as insufficiently sensitive (Hovy et al., 2005). One solution to overcome this problem is to manually identify and annotate text units of different size that are considered to contain only the important content of the summary texts. Summaries can then be compared based on these units, which are usually longer chunks of words or phrases. This approach is incorporated, for example, in the pyramid method (Nenkova & Passoneau, 2004; Passonneau, Nenkova, McKeown, & Siegelman, 2005). One basic assumption in the pyramid method is that no single best model summary exists. The method attempts to address the content variation across human summaries of the same source text, and hence, multiple model summaries for the same source text are needed. Important text units that serve as a basis for the evaluation are manually identified and annotated. Summarization Content Units (SCUs) are approximately clause-length semantic units, such as continuous or discontinuous sequences of words, shared by the reference summaries. Thus, the content is identified based on shared meaning instead of shared words or n-grams. A pyramid (Figure 4) is built based on SCUs identified from model summaries as follows (Nenkova & Passoneau, 2004): Each SCU is given a weight corresponding to the number of summaries in which it appears, and each tier in the pyramid contains all and only the SCUs with the same weight; the SCUs with the highest weight being on the top tier. Hence, each tier Ti, contains those SCUs that were found in i model summaries, and in the lower tiers, there are SCUs that emerge in only some of the model summaries and that can thus be considered to include less important information. A new summary is evaluated by comparing its contents to the SCUs in the pyramid. The values of the evaluation measure are between zero and one: the higher the value, the more extensive the overlap with the optimal summary. Figure 4. A pyramid with n tiers
741
Performance Evaluation Measures for Text Mining
The value of the evaluation measure is calculated in the following way. If n is the number of reference summaries, then the pyramid has n tiers denoted as Ti, i = 1,...,n, where T1 is the bottom tier and Tn the top tier. Let |Ti| be the number of SCUs in tier Ti, and let the weight of SCUs in tier Ti be i. If the number of SCUs appearing in tier Ti in the summary to be evaluated is Wi, then the total SCU weight for the summary to be evaluated is n
W = ∑ i × Wi.
(26)
i =1
In an optimal summary, an SCU from tier (n – 1) should not be expressed if in tier n, there are SCUs that have not been expressed. Thus, the optimal content value for a summary with K SCUs is , where
(27)
n j = max ∑ Tt ≥ K i t =i
(28)
Opt =
n
∑ i× T
i = j +1
i
n + j × K − ∑ Ti i = j +1
Here j is the maximum index of such tiers that the total number of SCUs counted from the top tier to the tier j exceeds the number of allowed SCUs K. In the formula for the optimal content value, the first term covers the number of SCUs on the tiers above the tier j, and the second term covers those SCUs from the tier j that are needed to generate a summary of K SCUs. Finally, the overall value for the evaluation measure is W/Opt. For a more detailed description of the measure and its modified version, see Nenkova and Passoneau (2004), and Passonneau et al. (2005). Pyramid evaluation has been found capable of differentiating summarization systems well (Passonneau, McKeown, Sigelman, & Goodkind, 2006). However, creating the pyramid and evaluating the peer summaries are very resource-demanding tasks with heavy manual annotation work. As this evaluation approach is based on manually identified content units, it has also an aspect of subjectivity that may complicate consistent evaluation. Basic Elements evaluation tries to overcome subjectivity and variability problems resulting from manual identification of content units by automatic generation of semantic content measurement units called Basic Elements (BEs). These units are shorter and more formalized than SCUs, and can be extracted automatically using syntactic parsing and applying a set of rules to the resulting parse tree. For a detailed description of BEs, see Hovy et al. (2005). A summary is evaluated by comparing the BEs in the evaluated summary to the BEs in the reference summaries. Following Hovy et al. (2005), each BE in the reference summaries is given a score, that is, the number of reference summaries in which the BE appears, and the BEs in the evaluated summary are matched to the BEs found in the reference summaries. Matching can be performed as, for example, lexical, word root-form, or synonym matching. In order to obtain the overall value of the measure for the evaluated summary, the scores of the BEs common to the reference summaries and the evaluated summary are combined. This can be done, for example, by weighting the scores according to the completeness of the match between the BE in the evaluated summary and the BE in the reference summaries, and adding the weighted scores.
742
Performance Evaluation Measures for Text Mining
A good performance evaluation measure of text summarization systems should show consistent correlation with the overall human evaluation result across different summarization tasks. While the earliest performance evaluation measures are word- and sentence based, the more recently developed ones are based on identifying semantic units which try to capture the most important content of the text. Performance measures for text summarization systems are still undergoing constant development and improvement. More information about performance evaluation of summarization systems can be found, for example, in the Proceedings of the Document Understanding Conference (DUC) (DUC, 2001–2006).
Conclusion This chapter discussed the performance evaluation of text mining systems and provided an overview of prevalent measures in task domains of text classification, information extraction, text segmentation, text ranking, information retrieval, text clustering and text summarization. In addition to discussing the application of basic measures in the various tasks, domain specific measures were presented when the applicability of the general measures was limited. Instead of attempting an exhaustive listing of the hundreds of proposed measures, the key measures and their characteristics with respect to the specific tasks were emphasized, and further references to literature provided. Guidelines for measure selection were given and the adoption of standard evaluation protocols and their associated measures was recommended. The chapter reflects the current state-of-the-art in performance evaluation, as applied by practitioners and represented in large scale competitive evaluations. These are likely to remain authoritative sources for the most recent developments in the constantly evolving field of performance evaluation.
Acknowledgment We gratefully acknowledge the financial support of the Academy of Finland; Tekes, the Finnish Funding Agency for Technology and Innovation; and the Nokia Foundation.
References Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., &, Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5), 412-424. Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34(1-3), 177-210. Bezdek, J. C., & Pal, N. R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics, 28(3), 301-315. Caruana, R., & Niculescu-Mizil, A. (2004). Data mining in metric space: An empirical analysis of supervised learning performance criteria. In R. Kohavi, J. Gehrke, W. DuMouchel, & J. Ghosh, (Eds.),
743
Performance Evaluation Measures for Text Mining
Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining (KDD’04) (pp. 69-78). New York, NY: ACM Press. Chinchor, N. (1997). Overview of MUC-7. In Proceedings of the Seventh Message Understanding Contest. Retrieved March 27, 2007, from http://www-nlpir.nist.gov/related_projects/muc/proceedings/ muc_7_proceedings/overview.html Chinchor, N., & Sundheim, B. (1993). MUC-5 evaluation metrics. In MUC5 ‘93: Proceedings of the Fifth Conference on Message Understanding (pp. 69-78). Baltimore, MD: Association for Computational Linguistics. Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. In S. Thrun, L. Saul & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16 (pp. 313-320). Cambridge, MA: MIT Press. Demartini, G., & Mizzaro, S. (2006). A classification of IR effectiveness metrics. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika, & A. Yavlinsky (Eds.), Advances in Information Retrieval. Lecture Notes in Computer Science, 3936 (pp. 488-491). Heidelberg, Germany: Springer-Verlag. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(Dec 2006), 1-30. DUC (2001–2006). The Proceedings of the Document Understanding Conference. Retrieved April 29, 2007, from http://www-nlpir.nist.gov/projects/duc/pubs.html Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. In M. Farach-Colton (Ed.), Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 28-36). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Fawcett, T., & Flach, P. (2005). A response to Webb and Ting’s On the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1), 33-38. Georgescul, M., Clark, A., & Armstrong. S. (2006). An analysis of quantitative aspects in the evaluation of thematic segmentation algorithms. In Proceedings of the Seventh SIGdial Workshop on Discourse and Dialogue (pp. 144-151). Burwood, Victoria, Australia: BPA Digital for the Association for Computational Linguistics. Grishman, R., & Sundheim, B. (1996). Message Understanding Conference ‑ 6: a brief history. In Proceedings of the 16th International Conference on Computational Linguistics (pp. 466-471). Morristown, NJ: Association for Computational Linguistics Hirschman, L., & Mani, I. (2003). Evaluation. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (pp. 414-429). Oxford, NY: Oxford University Press. Hirschman, L., & Thompson, H. S. (1997). Overview of evaluation in speech and natural language processing. In R. Cole (Ed.), Survey of the State of the Art in Human Language Technology (pp. 409414). New York, NY: Cambridge University Press. Hirschman, L., Yeh, A., Blaschke, C., & Valencia, A. (2005). Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1):S1.
744
Performance Evaluation Measures for Text Mining
Hovy, E., Lin, C-Y., & Zhou, L. (2005). Evaluating DUC 2005 using Basic Elements. Proceedings of the Fifth Document Understanding Conference (DUC ’05). Retrieved March 20, 2007, from http://duc. nist.gov/pubs/2005papers/usc-isi-zhou2.pdf Hubert, L., & Arabie, B. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218. Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Kazai, G., & Lalmas, M. (2006). INEX 2005 evaluation measures. In N. Fuhr, M. Lalmas, S. Malik, & G. Kazai (Eds.), Advances in XML Information Retrieval and Evaluation. Lecture Notes in Computer Science, 3977 (pp. 16-29). Heidelberg, Germany: Springer-Verlag. Kendall, M., & Gibbons, J. D. (1990). Rank Correlation Methods (5th ed.). London, UK: Edward Arnold. Lachiche, N., Ferri, C., & Macskassy, S. (Eds.) (2006). Proceedings of the Third Workshop on ROC Analysis in Machine Learning (ROCML’06). Retrieved April 30, 2007, from http://www.dsic.upv. es/~flip/ROCML2006/Papers/Proc_ROCML2006.pdf Lewis, D. D., (1991) Evaluating Text Categorization. In Proceedings of Speech and Natural Language Workshop (pp. 312-318). San Mateo, CA, USA: Morgan Kaufmann. Lin, C-Y. (2004).ROUGE: A package for automatic evaluation of summaries. In M-F. Moens, & S. Szpakowicz (Eds.), Text Summarization Branches Out: Proceedings of the ACL-04 Workshop 2004 (pp. 74-81). Barcelona, Spain: Association for Computational Linguistics. Malik, S., Kazai, G., Lalmas, M., & Fuhr, N. (2006). Overview of INEX 2005. In N. Fuhr, M. Lalmas, S. Malik, & G. Kazai (Eds.), Advances in XML Information Retrieval and Evaluation. Lecture Notes in Computer Science, 3977 (pp. 1-15). Heidelberg, Germany: Springer-Verlag. Manning, C. D., & Schütze, H. (2000). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press. Milligan, G. W. (1981). A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrica, 46 (2), 187-199. Milligan, G. W. (1996). Clustering validation: results and implications for applied analyses. In P. Arabie, L. J. Hubert, & G. De Soete (Eds.), Clustering and Classification (pp. 341-375). River Edge, NJ: World Scientific Publishing. Nenkova, A., & Passonneau R. (2004). Evaluating content selection in summarization: the pyramid method. In S. Dumais, D. Marcu, & S. Roukos (Eds.), HLT-NAACL 2004: Main Proceedings (pp.145152). Association for Computational Linguistics. Papineni, K., Roukos, S., Ward T., & Zhu. W-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311-318). Morristown, NJ, USA: Association for Computational Linguistics.
745
Performance Evaluation Measures for Text Mining
Passonneau R.J., McKeown, K., Sigelman, S., & Goodkind A. (2006). Applying the pyramid method in the 2006 Document Understanding Conference. Proceedings of the Sixth Document Understanding Conference (DUC ’06). Retrieved March 20, 2007, from http://duc.nist.gov/pubs/2006papers/06pyramideval.paper.pdf Passonneau R.J., Nenkova, A., McKeown, K., & Sigelman, S. (2005). Applying the pyramid method in DUC 2005. Proceedings of the Fifth Document Understanding Conference (DUC ’05). Retrieved March 20, 2007, from http://duc.nist.gov/pubs/2005papers/columbiau.passonneau2.pdf Pevzner, L., & Hearst, M. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1), 19-36. Radev, D., Jing, H., Stys, M., & Tam, D. (2004). Centroid-based summarization of multiple documents. Information Processing and Management 40(6), 919-938. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53-65. Salton, G., Wong, A., &. Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM 18(11), 613-620. Sedding, J., & Kazakov, D. (2004). WordNet-based text document clustering. In Proceedings of the Third Workshop on Robust Methods in Analysis of Natural Language Data (ROMAND) (pp. 104-113). Siegel, S., & Castellan, N. J., Jr. (1988). Nonparametric Statistics for the Behavioral Sciences (2nd ed.). New York, NY: McGraw-Hill. Spärck Jones, K., & Galliers, J. R. (1996). Evaluating natural language processing systems: an analysis and review. Lecture Notes in Computer Science, 1083(Whole). Berlin, Germany: Springer-Verlag. Theodoridis, S., & Koutroumbas, K. (1999). Pattern Recognition. San Diego, CA: Academic Press. Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In W. Daelemans & M. Osborne (Eds.), Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (pp.142-147). Morristown, NJ: Association for Computational Linguistics. Voorhees, E. M., & Buckland, L. P. (Eds.) (2006). Proceedings of the Fifteenth Text Retrieval Conference (TREC 2006). Retrieved April 3, 2007, from http://trec.nist.gov/pubs/trec15/t15_proceedings.html
KEY TERMS Content Similarity Measure: At the core of text summarization evaluation is the measuring of content similarity between two summaries. The content similarity measure can be lexical (based on word or sentence units, e.g. cosine similarity measures) or semantic (based on semantic content units, e.g. Summarization Content Units in the pyramid method).
746
Performance Evaluation Measures for Text Mining
External Measures: For clustering evaluation are based on comparing the clustering system output against a gold standard clustering. Measures such as Rand Index, Jaccard coefficient and F measure can be used as external measures. See also internal measures. Extrinsic Evaluation: Assesses the performance of a text mining system component from the perspective of its effects to the performance of the whole system. See also intrinsic evaluation. Gold Standard: Is a dataset defining the correct outputs, often determined by human experts. The output of the evaluated system is compared against the gold standard in order to measure the performance. Internal Measures: For clustering evaluation are based on measuring the extent to which the obtained clustering matches the information inherent in the data. Measures such as Dunn index and DaviesBouldin index as well as correlation coefficients like Hubert’s G can be used as internal measures. Intrinsic Evaluation: Assesses the performance of a text mining system component as an isolated unit unconnected to the other system components. See also extrinsic evaluation. Performance Evaluation Measure: Is a real-value function assessing the quality of the text mining system output. The measure could be, for example, the number of fully correct outputs or the number of errors per input instance. Rank Correlation Coefficient: Is a measure of the association between two rankings produced from the same instances. Examples of rank correlation coefficients are Spearman’s rb and Kendall’s tb. 2 × 2 Contingency Matrix: Is a 2 × 2 matrix whose columns and rows represent the output of the system and the gold standard, respectively. The elements of the matrix are the numbers of true positives, false negatives, false positives and true negatives, upon which performance measures such as accuracy, precision and recall, are based.
747
748
Chapter XLII
Text Mining in Bioinformatics: Research and Application Yanliang Qi New Jersey Institute of Technology, USA
Abstract The biology literatures have been increased in an exponential growth in recent year. The researchers need an effective tool to help them find out the needed information in the databases. Text mining is a powerful tool to solve this problem. In this chapter, we talked about the features of text mining and bioinformatics, text mining applications, research methods in bioinformatics and problems and future path.
Introduction In recent years, there has been an exponential increase in the research of biological area. The biological studies have been transformed from an “information-poor” to an “information-overload” environment. For example, GENBANK release 122 (2/01) contains 11,720,120,326 bases in 10,896,781 sequences11, 720,120,326. There is also wealth of online information. MEDLINE 2004 database contains over 12.5 million records, and the database is currently growing at the rate of 500,000 new citations each year (Aaron M.Cohen & WilliamR.Hersh, 2005). Figure 1 shows the exploding number of articles available from Medline over the past 65 years (data retrieved from the SRS server at the European Bioinformatics Institute; www.ebi.ac.uk/). (Dietrich et al., 2005) So, it is obvious that problem faced by the biological researchers is how to effectively find out the useful and needed documents in such an information overload environment. Traditional manual retrieval method is impractical. Furthermore, online biological information exists in a combination of structured, semi-structured and unstructured forms (M.Ghanem et al., 2005). It is impossible to keep abreast of all developments. Computational methodologies increasingly become important in research (G. Black & P. Stephan, 2004). Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this (Ananiadou et al., 2006). Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Text Mining in Bioinformatics
Figure 1. Medline increase
The interactivities of computational methodologies and life science formed a new research area—Bioinformatics. Bioinformatics is where the information sciences meet the life sciences. Bioinformatics is the application of information technologies to biological structures and processes, and the information generated by this application (Pharmabiz.com, 2002). The goal of text mining in bioinformatics is to help researchers to identify needed information more efficiently, uncover relationships from the vast amount information, In this chapter, we will talk about the role of text mining in bioinformatics. Firstly, we will elaborate the features of text mining and bioinformatics. Then, we will talk about the application of text mining in the bioinformatics area. The third part is the discussion of the research method in this area. Discussion about the problems and future way in this field will be the last part.
Features of TM and Bioinformatics Text mining is a technology that makes it possible to discover patterns and trends semi-automatically from huge collections of unstructured text. It is based on technologies such as Natural Language Processing (NLP), Information Retrieval (IR), Information Extraction (IE), and Data Mining (DM) (N. Uramoto et al., 2004). The technologies, IR, IE and DM look very similar as text mining. But actually, Text Mining (TM) is different from IR, IE and DM. The main difference is that whether there is novel produced in the process (M.Hearst, 1999). Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hyper-textually-networked databases such as the World Wide Web (Wikipedia.org, 1999). The process of IR is to find out the needed information
749
Text Mining in Bioinformatics
Table 1. Comparison between TM, IR, IE and DM Terms
Features
Text Mining
Discovering heretofore unknown information from a text source
Information Retrieval
Finding the required information
Information Extraction
Extraction of facts from unrestricted text sources that is already presented.
Data Mining
Simply a (semi)automated discovery of patterns/trends across the large databases.
in database which is already exists, so in this process, there is no novel produced. Information Extraction, as one type of Information Retrieval whose goal is to automatically extract structured information, also doesn’t produce the novelty in its process. Data mining mainly focus on the structured data while text mining mainly focuses on semi-structured and unstructured data. From the argument of Hearst, data mining is not “mining” but simply a (semi) automated discovery of patterns/trends across large database and no new facts was created in this discovery process. The following table (Table 1) is a comparison of these four techniques. Natural Language Processing is a study which focuses on automated generation and understanding of natural human languages. Text mining is also different from NLP. In the survey of biomedical text mining by Aaron M.Cohen and WilliamR.Hersh (Aaron M.Cohen & WilliamR.Hersh, 2005), the difference was fully discussed. From the viewpoint of Cohen and Hersh, NLP attempts to understand the meaning of text as a whole, while text mining and knowledge extraction concentrate on solving a specific problem in a specific domain identified a priori (possibly using some NLP techniques in the process). For example, text mining can aid database curators by selecting articles most likely to contain information of interest, or potential new treatments for migraine maybe determined by looking for pharmacological substances that are associated with biological processes associated with migraine. Bioinformatics is a highly interdisciplinary research area that emerged many domains such as biology, computer science, medicine, artificial intelligence applied mathematics, statistics. Because many literature documents of biology study is stored as semi-unstructured model and unstructured model, text mining can play a crucial role in helping researchers find out the needed information and relationship of biology data. The main research areas of bioinformatics include sequence analysis, genome annotation, analysis of gene expression, analysis of protein expression, prediction of protein structure, modeling biological systems. Bioinformatics has been applied in many areas. In these areas, the most noticeable is Human Genome Project (HGP): the effort to identify the 80,000 genes in human DNA. The goals of the original HGP were not only to determine more than 3 billion base pairs in the human genome with a minimal error rate, but also to identify all the genes in this vast amount of data. Another goal of the HGP was to develop faster, more efficient methods for DNA sequencing and sequence analysis and the transfer of these technologies to industry. The sequence of the human DNA is stored in databases available to anyone on the Internet. The process of identifying the boundaries between genes and other features in raw DNA sequence is called genome annotation and is the domain of bioinformatics (Wikipedia.org, 2007).
750
Text Mining in Bioinformatics
Figure 2. Text mining applications in biology
Main Application Text mining has been applied in many bioinformatics domains. These domains include Named Entry Recognition (NER), automatics annotation, micro-array analysis, gene expression, gene annotation, system biology etc. Figure 2 indicates the applications of text mining in biology. In this chapter, we will talk about the application in gene expression, gene annotation, system biology and “BioTeKS” (“Biological Text Knowledge Services”), the first major application of the UIMA (Unstructured Information Management Architecture).
Gene Expression Gene Expression is the process by which a gene’s coded information is translated into the proteins present and operating in the cell. If a biology researcher used the traditional data clustering analysis, the result of this clustering analysis could be a group of co-regulated genes (i.e. genes that exhibit similar experimental behavior) or could be groups of differentially expressed genes. After that, the researcher would investigate and validate the significance of his/her findings by: (a). Seeking background information on why such genes are co-regulated or differentially expressed, and (b). Identifying the diseases that are associated with the different isolated gene groupings. Much of the required information is available on online genomic databases, and also in scientific publications. What the user requires are interactive methods that enable him to access such information dynamically, summarize it and re-integrate it in his analysis (M. Ghanem et al., 2004). The 1st phase in this process is gene expression analysis. This phase stimulate many applications. In this chapter, we will talk about several of them.
751
Text Mining in Bioinformatics
The first I want to introduce is GO::TermFinder. This software is to access Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. GO::TermFinder comprises a set of object-oriented Perl modules for accessing Gene Ontology (GO) information and evaluating and visualizing the collective annotation of a list of genes to GO terms. It can be used to draw conclusions from micro-array and other biological data, calculating the statistical significance of each annotation. GO::TermFinder can be used on any system on which Perl can be run, either as a command line application, in single or batch mode, or as a Web -based CGI script (Boyle et al., 2004). Through the study of the example of a GO::Termfinder analysis of micro-array data, we could find that this software is flexible, extensible and easy to reuse and incorporate into analysis pipeline. The second one, Gene Information Systems (GIS) (Chiang JH et al., 2004), a biomedical text-mining system for gene information discovery, focused on four types of gene-related information: biological functions, associated diseases, related genes and gene–gene relations. The aim of this system is to provide researchers an easy-to-use bio-information service that will rapidly survey the rapidly burgeoning biomedical literature. The third one, MedMeSH Summarizer is a system which uses text mining for gene clusters. It can summarize a group of genes by filtering the biomedical literature and assigning relevant keywords describing the functionality of a group of genes. The system aims to summarize literature information about a group of genes in a concise and coherent manner (P. Kankar et al., 2002). The MedMeSH Summarizer9 extracts MeSH** (Medical Subject Headings) terms10 that can summarize the nature of a cluster of gene names obtained from DNA micro-arrays (also called DNA chips) (N. Uramoto et al., 2004). The last one is MedMiner. MedMiner is an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling (L. Tanabe et al., 1999). It was developed incrementally (with constant feedback from biologist-users to meet their needs). It incorporates several key computational components to achieve the twin goals of automated filtering and data organization. The first of these is Internet-based querying of multiple databases. The system is designed for ease integration of additional database. The second key component of MedMiner’s procedure is text filtering and the third component of MedMiner is a carefully designed user interface. The output is organized according to the relevance rule triggered, rather than being ordered arbitrarily or by date (L. Tanabe et al., 1999).
Genome Annotation Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1 2
Identifying elements on the genome, a process called Gene Finding, and Attaching biological information to these elements (Wikipedia.org, 2007)
MyWEST is a Web extraction software tool for effective mining of annotations from Web -based databanks. This tool is aimed at researchers without extensive informatics skills or resources, which exploits user-defined templates to easily mine selected annotations from different Web -interfaced databanks, and aggregates and structures results in an automatically updated database. MyWEST can effectively gathered relevant annotations from various bimolecular databanks, highlighted significant biological characteristics and supported a global approach to the understanding of complex cellular
752
Text Mining in Bioinformatics
mechanisms. The main characteristics of MyWEST are: (1) a Graphic User Interface with intuitive windows for an easy use adequate to biologists and physicians; (2) a module for template creation from any reference HTML page; (3) a module for automatic extraction of data from different HTML pages; (4)parametric functioning, which adapts the extraction performances by modifying parameter values inside extraction configuration text files; (5)log files that contain information about the data extractions performed and allow quick evaluation of results; (6) aggregation and storage of all extracted data, either in tab-delimited text files or in a relational database; and (7) a software agent module for updating the extracted data stored in the database (Marco Masseroli et al., 2004).
System Biology Systems biology is the study of the interactions between the components of biological systems, and how these interactions give rise to the function and behavior of that system (Ananiadou et al., 2006). Systems biology is one of the key examples of a field where the mode of scientific knowledge discovery is shifting from a hypothesis-driven mindset to an integrated holistic mode that combines hypotheses with data. The amount of unstructured textual data is increasing at such a pace it is difficult to discover knowledge and generate scientific hypotheses without the use of knowledge extraction techniques, which are largely based on TM. Text mining techniques can be applied in a variety of areas of systems biology (Ananiadou, et al. 2006).
BioTeKS At the last part of main application, we want to talk about BioTeKS (“Biological Text Knowledge Services”). It is the first major application of the UIMA (Unstructured Information Management Architecture). It is developed by IBM Research. UIMA (and BioTeKS) address the challenges of analyzing text by standardizing and improving the process of developing, deploying, and integrating the operation of multiple text-analysis methods that operate on different levels of linguistic analysis (D.FerrucciandA. Lally, 2003). From this aspect, UIMA improves text analysis processing. BioTeKS focused on methods for text analysis of biomedical text, which is especially complicated scientific text. So BioTeKS improves text analysis processing for life science. The goal of the BioTeKS text-analysis methods is to convert initially unstructured text information into structured text data, commensurate with structured data derived from sources other than text (e.g., gene names in micro-array experiments, or drug, treatment, and disease references in clinical records). BioTeKS is a significant technical initiative within the IBM Research laboratories to integrate and customize a broad suite of text-analysis projects and technologies targeting problems in the domain of biomedical text analysis (R. Mack et al., 2004).
Research Method Before talking about the approaches in bioinformatics, I would like to introduce Discovery Net Project first, because many approaches are extended from it. Discovery Net Project is a £2.08 Million EPSRC-funded project which was funded by the EPSRC under the UK Research Councils e-Science Program. The goal of this project is to build the world’s first e-Science platform for scientific discovery from the data generated by a wide variety of high throughput
753
Text Mining in Bioinformatics
devices at Imperial College London. It is a multidisciplinary project serving application scientists from various fields including biology, combinatorial chemistry, renewable energy and geology. It provides a service-oriented computing model for knowledge discovery, allowing users to connect to and use data analysis software as well as data sources that are made available online by third parties (Source: http://www.discovery-on-the.net/). The first approach extended from Discovery Net was a grid infrastructure for mixed bioinformatics data and text mining, which was developed by Moustafa Ghanem, Yike Guo, Anthony Rowe, Alexandros Chortaras and Jon Ratcliffe developed. In their paper, the authors presented a number of text mining examples conducted over biological data to highlight the advantages of our system. At the core of the Discovery Net text mining is an extensible document representation model. The model is based on the Tipster Document Architecture (Grishman, 1998) that uses the notion of a document annotation, defined as “the primary means by which information about a document is recorded and transmitted between components with in the system”. Following this model, a single document is represented by two entities: the document text, which corresponds to the plain document text and the annotation set structure. The annotation set structure provides a flexible mechanism for associating extra-textual information with certain text segments. Each such text segment is called an annotation, and an annotation set consists of the full set of annotations that a make up a document. A single annotation is uniquely defined by its span, i.e. by its starting and ending position in the document, and has associated with it a set of attributes. The role of the attributes is to hold additional information e.g. about the function, the semantics, or other types of user-defined information related o the corresponding text segment. Depending on the particular application different attributes will be assigned to the annotations (M.Ghanem et al., 2005). After the document collected, an indexing scheme for efficient processing was used (Document indexing phase). In the vector aspect, this system is based on the use of a sparse feature vector type as the basic type for representing documents in the feature vector space. The authors also provided three case studies (Interpreting Gene Expression, Gene Expression-Metabolite Mapping and Scientific Document Categorization) to proof their system valid and effective. The second approach is from M. Ghanem, Y. Guo and A.S. Rowe. In their paper, they provide a powerful workbench for the dynamic analysis and interpretation of bioinformatics data. The aim of this workbench is to investigate and develop methods whereby information integration methods and text mining methods can be used together to validate and interpret the results of a data mining procedure. This mode of scientific discovery is supported by the Discovery Net workflow. This workflow is divided into three logical phases: (1) Gene Expression Analysis, Analysis”), corresponds to the traditional data mining phase, where the biologist conducts analysis over gene expression data using a data clustering analysis component to find co-regulated/differentially expressed genes. (2) Find Relevant Genes from Online Database: the user uses the InfoGrid integration framework to obtain further information about the isolated genes from online databases. (3) Find association between Frequent Terms: the user uses a dictionary of disease terms obtained from the MESH (Medical Subject Headings) dictionary to isolate the key disease terms appearing in the retrieved articles (M. Ghanem et al., 2004). This approach can be applied to various case studies. Text mining also can be used in extracting useful knowledge (such as genes, proteins, small molecules etc.) automatically form unstructured text sources (such as literatures). In the paper of Moustafa Ghanem, Alexandros Chortaras and Yike Guo, they presented a tool for conducting distributed text mining over bioinformatics data. Their approach is based on the use of visual programming and Web services technologies. Our computational framework is based on allowing users to carry out mixed data
754
Text Mining in Bioinformatics
and text mining over distributed data and computational resources through a workflow co-ordination paradigm (M.Ghanem et al., 2004). In the study of text mining in micro-array analysis, because researchers only know a few genes well, while wading through search results is tedious and time consuming, there are two approaches to cope with this problem. 1. 2.
Cluster on numeric data, and then interpret textually; Cluster on textual data, and then interpret numerically;
MedMiner is the application of first approach. It identify group of genes based on experimental data, then creates list of relevant contexts. PubGene is the application of second approach. It compile a list of all genes, compute co-occurrence of genes in Medline articles, then display networks o selected genes, color-code nodes to indicate degree of up/down regulation. Both approaches successfully incorporate prior knowledge. However, only shallow information is used (Altman & Schütze, 2003).
Problems and Future Ways Bioinformatics has made considerable progress in recent years. Traditional information technology companies like IBM, Motorola, Agilent Technologies (a Hewlett - Packard spin - off) and Sun Microsystems anticipate the potential for profit. Drug companies also invest heavily in bioinformatics because they believe bioinformatics offers the prospect of finding better drug targets earlier in the development process. This efficiency can reduce the number of potential therapeutics moving through a company’s clinical testing pipeline, significantly decreasing overall costs. Extra profits can also be expected by reducing research and development periods for new drugs, and also lengthening the time a drug is on the market before its patent expires (Pharmabiz.com, 2002). From all of the foregoing, it is clear that biomedical text mining has great potential. However, that potential is yet unrealized. In the following years, text mining should be able to evaluation validate the results of analytical gene expression methods in identifying significant gene groupings (M. Ghanem et al., 2004). Text mining researcher should cooperate with biology researchers in this interdisciplinary area. The following are some of the potential “New Frontiers” in biomedical text mining: Question-answering, Summarization, Mining data from full text (including figures and tables), User-driven systems, Evaluation (ZWEIGENBAUM et al., 2007). Now this is an exciting time in biomedical text mining, full of promise.
References Aaron M.Cohen & WilliamR.Hersh (2005), A survey of current work in biomedical text mining, Briefings in Bioinformatics, 6(1), 57-71. Ananiadou, Sophia; Kell, Douglas B.; Tsujii & Jun-ichi. (2006), Text mining and its potential applications in systems biology, Trends in Biotechnology, 24(12), 571-579. Altman & Schütze. (2004). Goals for Today • Introduction to text mining for bioinformatics. Retrieved April 24th, 2007, from http://www.ims.uni-stuttgart.de/~schuetze/ws2004ir/20041112/vorlesung4.pdf.
755
Text Mining in Bioinformatics
Chiang JH, Yu HC & Hsu HJ (2004) GIS: a biomedical text-mining system for. gene information discovery. Bioinformatics 20: 120–121. D.FerrucciandA.Lally (2003). Accelerating Corporate Research in the Development, Application and Deployment of Human Language Technologies. In Proceedings of the Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), Edmonton, CA (May 31, 2003). Dietrich Rebholz-Schuhmann, Harald Kirsch & Francisco Couto (2005), Facts from Text—Is Text Mining Ready to Deliver? Retrieved from April 24th, 2007 from http://biology.plosjournals.org/perlserv/ ?request=get-document&doi=10.1371%2Fjournal.pbio.0030065 Elizabeth I. Boyle, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J. Michael Cherry & Gavin Sherlock. (2004). GO::TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20, 3710-3715. G. Black & P. Stephan (2004) Bioinformatics: Recent Trends in Programs, Placements and Job Opportunities, Report to the Alfred P. Sloan Foundation, New York, NY, USA. L. Tanabe, U. Scherf, L.H. Smith, J.K. Lee, L. Hunter & J.N. Weinstein. (1999). MedMiner: an Internet text-mining tool for biomedical information,with application to gene expression profiling. Biotechniques, 27, 1210–1214, 1216–1217. M.Hearst. (1999). Untangling text data mining. In Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics. University of Maryland, USA. M. Ghanem, Y. Guo & A.S. Rowe. (2004), Integrated Data Mining and Text Mining In Support of Bioinformatics. Poster at Proceedings of the UK e-Science All Hands Meeting 2004, Nottingham UK. Marco Masseroli, Andrea Stella, Natalia Meani, Myriam Alcalay & Francesco Pinciroli. (2004). MyWEST: My Web Extraction Software Tool for effective mining of annotations from Web -based databanks. Bioinformatics, 20(18), 3326-3335. Moustafa Ghanem, Alexandros Chortaras & Yike Guo. (2004). Web Service Programming for Biological Text Mining. Search and Discovery in Bioinformatics, SIGIR 2004 Workshop, Sheffield, UK. Moustafa Ghanem, Yike Guo & Anthony Rowe, Alexandros Chortaras and Jon Ratcliffe(2005), A Grid Infrastructure for Mixed Bioinformatics Data and Text Mining. Proceedings of the ACS/IEEE 2005 International Conference on Computer Systems and Applications, IEEE Computer Society Washington, DC, USA. N. Uramoto, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi & K. Takeda (2004). A text-mining system for knowledge discovery from biomedical documents. IBM Systems Journal, 43(3), 516-533. P. Kankar, S. Adak, A. Sarkar, K. Murari & G. Sharma. (2002). MedMeSH Summarizer: Text Mining for Gene Clusters. In Proceedings of the Second SIAM Internationa. Conference on Data Mining, Arlington, VA. Pierre Zweigenbaum, Dina Demner-fushman, Hong Yu & K. Bretonnel Cohen (2007). New Frontiers In Biomedical Text Mining. Pacific Symposium on Biocomputing, 12, 205-208.
756
Text Mining in Bioinformatics
Pharmabiz.com (2002). Bioinformatics: The next wave in life sciences, Retrieved May 07, 2002 11:16 IST from http://www.pharmabiz.com/article/detnews.asp?articleid=11402§ionid=46 R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa & L. V. Subramaniam (2004). Text analytics for life science using the Unstructured Information Management Architecture. IBM Systems Journal, 43(3), 490-515. Wikipedia.org (2007). Genome Annotation, Retrieved April 24th 2007 from http://en.wikipedia.org/ wiki/Genome_annotation Wikipedia.org (2007). Information Retrieval, Retrieved April 24th, 2007 from http://en.wikipedia.org/ wiki/Information_retrieval Wikipedia.org (2007). Human Genome Project, Retrieved April 24th, 2007 from http://en.wikipedia. org/wiki/Human_genome_project
KEY TERMS Bioinformatics: A highly interdisciplinary research area that emerged many domains such as biology, computer science, medicine, artificial intelligence applied mathematics, statistics. Genome Annotation: The process of attaching biological information to sequences. Gene Expression: The process by which a gene’s coded information is translated into the proteins present and operating in the cell. Information Retrieval (IR): The science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases. Natural Language Processing (NLP): A study which focuses on automated generation and understanding of natural human languages. Systems Biology: The study of the interactions between the components of biological systems, and how these interactions give rise to the function and behavior of that system. Text Mining: A technology that makes it possible to discover patterns and trends semi-automatically from huge collections of unstructured text.
757
758
Chapter XLIII
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics: Finding the Right Information for Consumer’s Health Information Need Ki Jung Lee Drexel University, USA
ABSTRACT With the increased use of Internet, a large number of consumers first consult on line resources for their healthcare decisions. The problem of the existing information structure primarily lies in the fact that the vocabulary used in consumer queries is intrinsically different from the vocabulary represented in medical literature. Consequently, the medical information retrieval often provides poor search results. Since consumers make medical decisions based on the search results, building an effective information retrieval system becomes an essential issue. By reviewing the foundational concepts and application components of medical information retrieval, this paper will contribute to a body of research that seeks appropriate answers to a question like “How can we design a medical information retrieval system that can satisfy consumer’s information needs?”
INTRODUCTION The Internet functions as a family physician as more and more people seek medical information on line and make subsequent healthcare decisions based on the acquired medical information. Consequently,
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
one of the perspectives in biomedical informatics defended by the statistics of (1) significant increase in the prevalence of Internet, (2) increased concern of healthcare among general population, and (3) sheer growth in the number of medical literature available in public concentrates on the use of biomedical information by consumers. Denoted as consumer informatics, this emerging perspective focuses on the consumer's information need, access to the information, and modeling their information need for system integration (Haux, 1997; Lange, 1996; Logana & Tse, 2007; Nelson & Ball, 2004; Soergel, Tse, & Slaughter, 2004; Warner, 1995). However, simple statistics does not secure it as a robust academic discipline since the significance of healthcare information is rather determined by what’s in the information and how the information can be used in addition to how users acquire the information. The core problem, in addressing this matter, is the discrepancy between user vocabulary, used by a “lay person”, and professional vocabulary used by clinical professionals. While consumers search medical information on the Internet using their everyday terms, medical literature on the Internet is represented in professional terms. Consequently, key words in consumers’ queries do not match the medical key words in medical documents (or match the wrong ones). This can cause a significant problem since misinformed healthcare decision is directly related to consumer’s health issues. In the field of consumer informatics, therefore, a trend of research has burgeoned in designing the map that can translate (in effect “mapping” one vocabulary to another) consumer vocabulary to professional vocabulary. The objective of this paper is to discuss concepts of consumer health vocabulary and review research endeavors to reduce the gap between consumer vocabulary and professional vocabulary. Primarily, it aims at examining current methodological approaches to map consumer vocabulary to professional vocabulary, critically reviewing them, and suggesting expanded perspective. With this literature review, researchers in the field of consumer informatics will be able to set a basis for further exploration of healthcare information system design. In addition, concepts and methods regarding vocabulary mapping reviewed here will provide the researchers with critical viewpoints to approach methodological problems and help them identify and gain insights into the system design to reduce discrepancy between consumer information need and medical domain knowledge. The rest of the paper is structured as follows: Background concept section discusses conceptual background of consumer healthcare vocabulary and how it can be explored and developed to better facilitate links between consumer information needs and medical domain knowledge. In related work section, research studies that contribute to practical solutions of mapping consumer terms to professional terms are reviewed and analyzed. In discussion section, along with critiques for current studies, different perspectives to approach the problem are discussed. The discussion includes a brief review of semantic approach to design consumer vocabulary. Lastly, conclusion section concludes this paper.
BACKGROUND CONCEPT: CONSUMER VOCABULARY In this section, the concept of consumer health vocabulary is discussed. The definition of consumer health vocabulary, current issues of consumer health vocabulary, and the problems of consumer health vocabulary in regards to information retrieval system are main topics of discussion in this section. Consumer health vocabulary is a set of terms that is used by the general population when they refer to specific healthcare information needs. In a more practical sense, Zeng and Tse (2006) conceptualize it as a “combination of everyday language, technical terms (with or without knowledge of the underlying concepts), and various explanatory models, all influenced by psychosocial and cultural variations,
759
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
in discourses about health topics” (p. 24). This definition implies a few major points in understanding the concept of consumer health vocabulary. First, there is a gap between consumer vocabulary and technical vocabulary, not only in lexical collections but also in the conceptualization of the overlapping vocabulary; and second, interpretation of the same term can vary among consumers with different experiences and cultural background. A series of research studies shows examples of consumer vocabulary issues in particular healthcare situations; Chapman, Abraham, Jenkins, and Fallowfield (2003), in a cancer consultation situation, report that about half the participants in their survey study understood expressions for the “metastatic spread of cancer”, e.g., ‘seedlings’ and ‘spots in the liver’. While 63 percent correctly identified that the term ‘metastasis’ meant that the cancer was spreading, only 52 percent understood that the phrase ‘the tumor is progressing’ has a negative implication. Lerner and colleagues (2000), in the emergency department situation, describe that patients have limited understanding of medical terms. In their study, the participants were asked to answer whether 6 pairs of terms had the same or different meaning. Individual score was measured based on the number of right answers. The mean number of correct responses was 2.8 (SD = 1.2). Among the incorrect answers, the proportion of patients that did not recognize analogous terms was 79 percent for bleeding versus hemorrhage, 78 percent for broken versus fractured bone, 74 percent for heart attack versus myocardial infarction, and 38 percent for stitches versus sutures. Nonanalogous terms were not also correctly recognized; 37 percent for diarrhea versus loose stools, and 10 percent for cast versus splint. Although the situation varied their studies share a common theme of problem addressing that communication between patients and doctors is not well-facilitated due to the language barrier. It is more serious in a situation where a patient does not have a medical professional to interact with, i.e., on line medical search. Making medical decision without proper information can cause a serious danger to patients. Therefore, a system that can virtually aid patients with searching for what they really want is required. A translation mechanism can be embedded in current medical information retrieval systems so that consumers can search for what they need using their own vocabulary. Since conventional mechanism of best match information retrieval depends intensively on exact matching between user query (i.e., representation of user’s information need) and pre-designed information structure of system (i.e., representation of stored texts), mapping consumer vocabulary to professional vocabulary (i.e., system vocabulary) will provide a method to reduce the communication gap. Therefore, the “development of a consumer vocabulary should be based on research that includes consumer information needs and consumers’ ways of talking about and expressing those needs” (Lewis, Brennan, McCray, Tuttle, & Bachman, 2001 , p. 1530), i.e., representation of consumer needs reflecting cultural context or specification.
RELATED WORK: DESIGNING A COMMUNICATION MAP BETWEEN CONSUMERS AND PROFESSIONALS A growing number of studies identify that inter-territory communication between consumers and healthcare professionals sustain a serious language barrier in terms of understanding terms and concepts. In attempting to overcome the roadblock, researchers design methods to offer links between consumer vocabulary and professional vocabulary. In this section, research studies that contribute to methodological approach of mapping consumer terms to professional terms are reviewed and analyzed.
760
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
McCray and colleagues (1999) use UMLS (Unified Medical Language System) Knowledge source server to map user queries put to National Library of Medicine Web site. They report that 41 percent of the 225,164 unique, normalized queries mapped successfully to the UMLS Metathesaurus. Mapped terms were mostly corresponding to concepts of diseases, other types of disorders, and drugs. According to their analysis, term mismatch was caused mainly by user query formulation. In other words, they noticed that user queries are in very specific level that their database may not contain or may only provide indirect information. Moreover, user queries with partial words or abbreviations which are not represented in UMLS were not also efficiently matched. The authors also identify that query length was not sufficient to fully represent consumers’ information needs. Zeng and colleagues (2002) also use UMLS function of conceptual analysis. They collected free text query terms entered into the “clinical interests” section of the Find-a-Doctor function of the Brigham and Women’s hospital Web site and free text query terms entered in the MEDLINEplus site. The collected consumer terms were mapped to the UMLS 2000 edition, containing a set of concepts from various vocabularies such as MeSH, ICD-9, and SNOMED. Exact matching method resulted in mapping rate of 49 percent for the unique Find-a-Doctor query terms and 45 percent for the MEDLINEplus unique query terms to UMLS concepts. They identified unmapped terms and classified them into 12 different categories (Table 1). The categories include spelling error, morphology, concatenation, sequence, abbreviations, synonym, redundancy, generalization, other semantic relationships, valid term not in UMLS, invalid term not in UMLS, and unclear meaning of consumer term. Among them, they also report that, lexical mismatch and conceptual mismatch constituted over 90 percent. Smith, Stavri, and Chapman (2002) collected features and findings identified in 139 e-mail messages which were submitted to the University of Pittsburgh Cancer Institute’s Cancer Information and Referral Service. The e-mail messages were coded and mapped to the 2001 UMLS Metathesaurus. Among 504 unique terms that were identified, 36 percent were exact matches to concepts in the 2001 UMLS Metathesaurus; 35 percent were partial string matches consisting of 24 percent of known synonyms
Table 1. Classification of mismatches between consumer terms and UMLS (Q. T. Zeng et al., 2002)
761
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
for the metathesaurus concepts and 1 percent was lexical variants. 4 percent of the total terms, were not mapped to the UMLS. Tse and Soergel (2003) collected 1,936 postings from 12 Web-based health discussion forums for consumer vocabulary corpus. They also collected 208 documents from magazines, newspapers, commercial ads, government publications, and patient pamphlets for mediator corpus. In order to control for various consumer types 14 “lay persons” identified medical expressions from the documents. According to the provided guideline, they selected terms on the basis of their personal experience, knowledge, judgment, and context in the document. The extracted terms were spelling corrected, expanded, normalized, and then mapped to concepts in the 2000-2001 UMLS Metathesaurus using MetaMap and the Knowledge Source Server. Terms that were not mapped automatically were manually mapped to the UMLS with assistance from a physician consultant. Terms were mapped 36 percent for consumer vocabulary, and 43 percent for mediator vocabulary to UMLS. As reviewed, present technology that enables translation of consumer vocabulary to professional vocabulary is to put existing consumer vocabulary into a formal infrastructure (e.g., UMLS). With this technology consumers will be able to see familiar language and behavior on the system interface level, while requesting complex medical information to satisfy their needs.
DISCUSSION: TOWARDS SEMANTICS Some studies have examined aspects of controlled medical vocabularies and identified the discrepancy between consumer medical terms and professional medical terms. There are also endeavors to make proper connection between those two groups of terms so that consumers can search for medical information without learning the professional terms. However, it may be also the case that consumers may not be able to distinguish useful information from the search result without proper conceptual understanding. Therefore, in addition to a solid mechanism that can translate consumer terms to professional terms, a repository of semantic definitions and relationships among those definitions is required to help consumers with reliable search. In this section, a need of semantic application for medical information retrieval system is discussed. Queries formulated by healthcare consumers usually contain multiple concepts resulting in deeper level of information structure in search results. A pivotal point in organizing information from the search result is to understand equivalent or semantically close concept and make subsequent selection. As Lorence and Spink (2004) argue, one-to-one mapping of a term in consumer vocabulary to a term in professional one is likely to be problematic since it does not identify complex relationships among terms, e.g., relationship through specialization hierarchy. As a consequence, consumer, while seeking medical information on line, can be overwhelmed by the cognitive challenges of estimating the semantic distances between a term and various possible meanings of the term. In other words, search results through one-to-one mapping “translator” do not provide contextual cues that can aid satisfy consumer’s medical information need, hence enrich the search result. Biomedical vocabularies and ontology can play a critical role in the process of integrating healthcare information with appropriate semantics. Not only in academics but also in commercial domains, efforts are underway toward integrating biomedical information with proper context of consumer needs. For example, a healthcare application in relation to this area is designed and analyzed, e.g., the Semantic
762
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
Knowledge Representation Project (National Library of Medicine, n.d.), providing scalable semantic representation of biomedical free text. Internet-based medical search technologies for consumers, in conjunction with their specific level of healthcare inquiries, is evolving to fulfill increasing need to examine the use of language in less strict way with less cultural differences. Therefore, defining and storing semantic relationships among consumer and professional vocabularies can be the next step of system design in medical information retrieval.
CONCLUSION Designing consumer information retrieval system in the domain of biomedical informatics poses a tremendous challenge in that there are so many variables that have to be counted in the functional logic of such systems. This paper has discussed an overview of consumer informatics, consumer vocabulary mapping, in relation to consumer’s medical information search on the Internet. Some studies mentioned in earlier sections examined the mismatch of consumer vocabulary with professional vocabulary while the studies reviewed in later sections have investigated term mapping. Growing opportunities for consumers to gain knowledge about specific health conditions on the Internet offers a significant implication for the design of medical information retrieval system. This paper contended that defining and storing relationships among the consumer and professional terms is required to better facilitate retrieval functions of medical information.
REFERENCES Chapman, K., Abraham, C., Jenkins, V., & Fallowfield, L. (2003). Lay understanding of terms used in cancer consultations. Psycho-Oncology, 12, 557-566. Haux, R. (1997). Aims and tasks of medical informatics. International Journal of Medical Informatics, 44, 9-20. Lange, L. L. (1996). Representation of everyday clinical nursing language in UMLS and SNOMED. Paper presented at the AMIA Annual Fall Symposium. Lerner, E. B., Jehle, D. V. K., Janicke, D. M., & Moscati, R. M. (2000). Medical communication: Do our patients understand? American Journal of Emergency Medicine, 18(7), 764-766. Lewis, D., Brennan, P. F., McCray, A. T., Tuttle, M., & Bachman, J. (2001). If we build it, they will come: Standardized consumer vocabularies. Paper presented at the Medinfo. Logana, R. A., & Tse, T. (2007). A Multidiscipline Conceptual Framework for Consumer Health Informatics. Medinfo, 12(2), 1169-1173. Lorence, D. P., & Spink, A. (2004). Semantics and the medical Web: A review of barriers and breakthroughs in effective healthcare query. Health Information and Libraries Journal, 21, 109-116. McCray, A. T., Loane, R. F., Browne, A. C., & Bangalore, A. K. (1999). Terminology issues in user
763
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
access to Web-based medical information. Paper presented at the American Medical Information Association Annual Symposium. National Library of Medicine, U. S. (n.d.). Semantic knowledge representation. Retrieved September 1, 2006, from http://skr.nlm.nih.gov/ Nelson, R., & Ball, M. J. (Eds.). (2004). Consumer informatics : Applications and strategies in cyber health care. New York: Springer. Smith, C. A., Stavri, P. Z., & Chapman, W. W. (2002). In their own words? A terminological analysis of e-mail to a cancer information service. Paper presented at the American Medical Informatics Association Annual Symposium. Soergel, D., Tse, T., & Slaughter, L. (2004). Helping Healthcare Consumers Understand: An “Interpretive Layer” for Finding and Making Sense of Medical Information. Medinfo, 11(2), 931-935. Tse, T., & Soergel, D. (2003). Exploring medical expressions used by consumers and the media: An emerging view of consumer health vocabularies. Paper presented at the American Medical Informatics Association 2003 Annual Symposium. Warner, H. R. (1995). Medical informatics: A real discipline? Journal of the American Medical Informatics Association, 2(4), 207-214. Zeng, Q. T., Kogan, S., Ash, N., Greenes, R. A., & Boxwala, A. A. (2002). Characteristics of consumer terminology for health information retrieval. Methods of Information in Medicine, 41, 289-298. Zeng, Q. T., & Tse, T. (2006). Exploring and developing consumer health vocabularies. Journal of American Medical Information Association, 13, 24-29.
KEY TERMS Consumer Informatics: A branch of medical informatics that concerns healthcare consumers’ information need, concentrating on their access to medical information through studying methods and models of healthcare consumers’ use of information system. Information Retrieval: A discipline that concerns effective transfer of information between creator of information and user of information. In the system perspective, representation, storage, organization, access, and distribution of information are studied, while, in the user perspective, various information seeking models and how to satisfy users’ information need are studied. Medical Literature Analysis and Retrieval System Online (MEDLINE): A database of literature in life science and medical informatics. Medical Subject Headings (MeSH): A controlled vocabulary and/or metadata system of medical subjects for the purpose of indexing life science literature. Ontology: Structural model that represents concepts and relations of the concepts within a domain to reason about objects in the domain.
764
Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics
Unified Medical Language System (UMLS): A controlled medical vocabulary with defined terms and structured mapping amongst the terms. Vocabulary Mapping: A translation mechanism that links one language to another to promote communication between the users of different languages.
765
766
Chapter XLIV
A Survey of Selected Software Technologies for Text Mining Richard S. Segall Arkansas State University, USA Qingyu Zhang Arkansas State University, USA
Abstract This chapter presents background on text mining, and comparisons and summaries of seven selected software for text mining. The text mining software selected for discussion and comparison in this chapter are: Compare Suite by AKS-Labs, SAS Text Miner, Megaputer Text Analyst, Visual Text by Text Analysis International, Inc. (TextAI), Magaputer PolyAnalyst, WordStat by Provalis Research, and SPSS Clementine. This chapter not only discusses unique features of these text mining software packages but also compares the features offered by each in the following key steps in analyzing unstructured qualitative data: data preparation, data analysis, and result reporting. A brief discussion of Web mining and its software are also presented, as well as conclusions and future trends.
Introduction The growing accessibility of textual knowledge applications and online textual sources has caused a boost in text mining and Web mining research. This chapter presents comparisons and summaries of selected software for text mining. This chapter reviews features offered by each package in the following key steps in analyzing unstructured qualitative data: data preparation including importing, parsing, and cleaning; data analysis including association and clustering; and result presenting/reporting including plots and graphs.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Survey of Selected Software Technologies for Text Mining
Background of Text mining Hearst (2003) defines text mining (TM) as “the discovery of new, previously unknown information, by automatically extracting information from different written sources.” Simply put, text mining is the discovery of useful and previously unknown “gems” of information from textual document repositories. Also Hearst (2003) distinguishes text mining from data mining by noting that with “text mining the patterns are extracted from natural language rather than from structured database of facts.” A more technical definition of text mining is given by Woodfield (2004) author of SAS Notes for Text Miner, as a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects. Text mining (TM) or text data mining (TDM) has been discussed by numerous investigators that include Hearst (1999), Cerrito (2003) for the application to coded information, Hayes et al. (2005) for software engineering, Leon (2007) for identifying drug, compound, and disease literature, and McCallum (1998) for statistical language modeling. Firestone (2005) emphasizes the importance of text mining in the future knowledge work. Romero and Ventura (2007) survey text mining applications in the educational setting. Kloptchenko et al. (2004) use data and text mining techniques for analyzing financial reports. Mack et al. (2004) describe the value of text analysis in biomedical research for life science. Baker and Witte (2006) discuss the mutation mining to support activities of protein engineers. Uramoto et al (2004) utilized a text-mining system adopted from that developed by IBM and named TAKMI (Text Analysis and Knowledge Mining) for use with very large text biomedical text documents. In fact the extension of TAKMI was named MedTAKMI and was capable of mining the entire MEDLINE of 11 million biomedical journal abstracts. The TAKMI system allows extracting deeper relationships among biomedical concepts by the use of natural language techniques. Scherf et al. (2005) discuss the applications of text mining in literature search to improve accuracy and relevance. Kostoff et al. (2001) combine data mining and citation mining to identify user community, and its characteristics by categorizing articles. There is a Text Mining Research Group (TMRG) (2002) at the University of Waikato in New Zealand that maintains a Web page of related publications, links, and software. Similarly there is National Centre for Text Mining (NaCTeM) at the University of Manchester in United Kingdom (UK). The Aims and Objectives of NACTeM is described in article by Ananiandou et al (2005) in which it extensively discusses a need for text mining in biology. According to their Web site of 2002, “text mining uses recall and precision (borrowed from the information retrieval research community) to measure the effectiveness of different information extraction techniques, allowing quantitative comparisons to be made.” A Text Mining Workshop was held in 2007 in conjunction with the Seventh Society of Industrial and Applied Mathematics (SIAM) Conference on Data Mining (SDM 2007). Textbooks in text mining have included applications to biology and biomedicine by Ananiadou and McNaught (2006). Figure 1 of this paper from Liang (2003) shows the text mining process from text preprocessing to analyzing results. Saravanan et al. (2003) discuss how to automatically clean data, i.e., summarizing domain-specific information tailored to user’s needs, by discovering classes of similar items that can be grouped into prescribed domains. Hersh (2005) evaluates different text-mining systems for information retrieval. Turmo et al. (2006) describe and compare different approaches to adaptive information extraction from textual documents and different machine language techniques. Amir et al. (2005) describe a new tool called maximal associations which allows the discovering of interesting associations often lost by regular association rules. Spasic et al. (2005) discuss ontologies and text mining to automatically
767
A Survey of Selected Software Technologies for Text Mining
Figure 1. Text mining process source: Liang (2003)
T e x t min in g p ro c e ss •
• •
•
•
Te xt p re p ro cessin g – S yn ta ctic/S em a ntic te xt a n a lysis F e a tu re s G en e ra tio n – B a g o f w o rd s F e a tu re s S e lectio n – S im p le co u nting – S ta tistics Te xt/D a ta M in in g – C la ssifica tio n S u p e rvise d le a rnin g – C lu ste rin g U n supe rvised le a rnin g A n a lyzin g r esults
extract information and facts, discover hidden associations and generate hypotheses germane to user needs. Ontologies specify the interpretations of terms, echo the structure of the domain, and thus can be used to support automatic semantic analysis of textual information. Seewald et al. (2006) describe an application for relevance assessment for multi-document summarization. To characterize certain document collections by a list of pertinent terms, they have proposed a term utility function, which allows a user to define parameters for continuous trade-off between precision and recall. Visa et al. (2002) develop a new methodology based on prototype matching to extract the document contents. Hirsch et al. (2005) describe a novel approach for using genetic programming to create classification rules and provide a basis for text mining applications. Yang and Lee (2005) develop an approach to automatically generate category themes and reveal the hierarchical structure. Category themes and their hierarchical structures are most determined by human experts, however, with this approach; text documents can be categorized and classified automatically. Fan et al. (2005) describe a method using genetic programming to discover new ranking functions in the information-seeking task for better precision and recall. Wu at al. (2006) describe a key phrase identification program to extract document key phrases for effective document clustering, automatic text summarization, development of search engines, and document classification. Cody et al. (2004) discuss the integration of business intelligence and knowledge management based on an OLAP model enhanced with text analysis. Trumbach (2006) uses text mining to narrow the gap between the information needs of technology managers and analysts’ derived knowledge by analyzing databases for trends, recognizing emerging activity, and monitoring competitors. Srinivasan (2006) develops an algorithm to generate interesting hypotheses from a set of text collections using Medline database. This is a fruitful path to ranking new terms representing novel relationships and making scientific discoveries by text mining. Metz (2003) indicates text mining “applications are clever enough to run conceptual searches, locating, say, all the phone numbers and places names buried in a collection of intelligence communiqués. More Impressive, the software can identify relationships, patterns, and trends involving words, phrases, numbers, and other data.” Guernsey (2003) in an article that appeared in The New York Times stated Text-mining programs
768
A Survey of Selected Software Technologies for Text Mining
go further than Google and other Web search engines by “categorizing information, making links between otherwise unconnected documents and providing visual maps (some look like tree branches or spokes on a wheel) to lead users down new pathways that they might have been aware of.”
Main Focus of the Chapter: Background of Text mining Software A comprehensive list of text mining, text analysis, and information retrieval software is available on Web page of KDnuggets (2007a) and similarly for Web mining and Web usage mining software of KDnuggets (2007b). Selected software from these and other resources are discussed in this chapter. Some of the popular software currently available for text mining include Compare Suite, SAS Text Miner, Megaputer Text Analyst, Visual Text by Text Analysis International, Inc. (TextAI), Megaputer PolyAnalyst, WordStat, and SPSS Clementine for text mining. These software provide a variety of graphical views and analysis tools with powerful capabilities to discover knowledge from text databases. The main focus of this chapter is to compare, discuss, and provide sample output for each as visual comparisons.
Table 1. Text mining software Software
Compare Suite
SAS Text Miner
Text Analyst
Visual Text
Megaputer PolyAnalyst
WordStat
SPSS Clementine
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
add-on
x
Features Data Preparation
Text parsing and extraction Define dictionary Automatic Text Cleaning
Data Analysis
Categorization
x
add-on
x
x
Filtering Concept Linking
x
Text Clustering
x
Dimension reduction techniques
x
Natural language query Results Reporting
Unique features
Interactive Results Window
x
Support for multiple languages
x Report Generation, compare two folders feature
x
add-on
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
Export any table to Excel
Linguistic approach rather than statistics-based approach
Multi-path multiparadigm analyzer
769
A Survey of Selected Software Technologies for Text Mining
Figure 2. Compare Suite with two animal text files
As a visual comparison of the features for these seven selected text mining software, the authors of this chapter constructed Table 1, where essential functions are indicated as being either present or absent with regard to data preparation, data analysis, results reporting, and unique features. As Table 1 shows, Compare Suite and Text Analyst have minimal text mining capabilities while Megaputer PolyAnalyst, SAS Text Miner, WordStat, and SPSS Clementine have extensive text mining capabilities. Visual Text has add-ons to make this software have versatile.
Results 1. Compare Suite Compare Suite is text mining software developed by AKS-Labs, headquarters of which are located in Raleigh, USA. The software allows comparing any to any file including formats such as text file, MS Word, MS Excel, pdf, Web pages, zip archives, and binary files. It allows comparing two files character by character, word by word, or by key words. Two folders can also be compared to find changes made and contained files. A report can be created after comparison including detailed comparison information. Documents can be compared online by server-side comparison.
770
A Survey of Selected Software Technologies for Text Mining
Compare Suite is able to provide to user a window for option of file comparison of text. Illustrated in Figure 2, two animal files are compared by words and results are reported. From the results, the same words that appear in both the files are highlighted with green color and the words that only appear in one of two files are highlighted with purple color.
2. SAS Text Miner SAS Text Miner is actually an “add-on” to SAS Enterprise Miner with the inclusion of an extra icon in the ”Explore” section of the tool bar. SAS Text Miner performs simple statistical analysis, exploratory analysis of textual data, clustering, and predictive modeling of textual data. SAS Text Miner uses the “drag-and-drop” principle by dragging the selected icon in the tool set to dropping it into the workspace. The workspace of SAS Text Miner was constructed with a data icon of selected animal data that was provided by SAS in their Instructor’s Trainer Kit. Figure 3 shows the results of using SAS Text Miner with individual plots for “role by frequency”, “number of documents by frequency”, “frequency by weight”, “attribute by frequency”, and “number of documents by frequency scatter plot.” Figure 4 shows the interactive mode of SAS Text Miner, which includes the text for each document, the respective weight for each term and also the concept linking figure using SASPDFSYNONYMS text file.
Figure 3. Results of SAS Text Miner for animal text
771
A Survey of Selected Software Technologies for Text Mining
Figure 4. Interactive window of SAS Text Miner for animal text
Figure 5. Regression results of SAS Text Miner using Federalists Papers
772
A Survey of Selected Software Technologies for Text Mining
Figure 6. Text Analyst semantic network window of document databasing in the 90’s
Figure 5 shows regression results window in SAS Text Miner using data for Federalists’ Papers from their instructor's trainer kit.
3. Megaputer Text Analyst TextAnalyst has an ActiveX suite for dealing with text and semantic analysis. According to Megaputer (2007c) Web page, TextAnalyst uses a Semantic Network similar to a molecular structure, and determines the relative importance of a text concept, solely by analyzing its connection to other concepts in the text, and also implements algorithms similar to those used for text analysis in the human brain. Figure 6 show a representative screen shot of Megaputer TextAnalyst which consists of a view pane in top left, results pane in top right, and a text pane in the bottom part of the window. The view pane shows each of the nodes in the semantic tree which each can be expanded. Megaputer Text Analyst uses a semantic search window where a query can be entered either as full sentences or questions instead of having to determine key word or phrases. A summary file in the top left pane of Megaputer Text Analyst window can list the most important sentences in the context of the original document. The Summary chooses the sentences on the basis of concepts and relationships between concepts in the full text.
773
A Survey of Selected Software Technologies for Text Mining
Figure 7. Screen-shot of Visual Text window
Figure 8. Parsing trees in Visual Text
774
A Survey of Selected Software Technologies for Text Mining
4. VisualText by TextAI VisualText by TextAI (Text Analysis International, Inc.) uses national language processing including “analyzers” for extracting information from text repositories. Some applications include databases of resumes, Web pages, or even e-mails or Web chat databases. VisualText allows the user to create their own text analyzer, and also includes a TAI Parse for tagging parts of speech and chunking. Voice processing needs to be converted to text first before text processing can be performed. According to VisualText Web page (2005), it can be used for combating terrorism, narcotic espionage, nuclear proliferation, filtering documents, test grading, and automatic coding. VisualText also allows natural language query, which is the ability to ask a computer questions using plain language. Figures 7 and 8 provide screen shots of VisualText. Figure 7 shows the analyzer on the left and the root of the text zone on the right. Figure 8 shows parsing trees. A window in VisualText can show dictionary with an expansion of the root of the text zone on the right part of this window.
5. Megaputer PolyAnalyst Previous work by the authors Segall and Zhang (2009) have utilized Megaputer PolyAnalyst for data mining. The new release of PolyAnalyst version 6.0 includes text mining and specifically new features for Text OLAP (on-line analytical processing) and Taxonomy-based categorization which is useful for
Figure 9. Workspace for text mining in Megaputer PolyAnalyst
775
A Survey of Selected Software Technologies for Text Mining
Figure 10. Keyword extraction window of text analysis in Megaputer PolyAnalyst
Figure 11. Initialization of link term report in Megaputer PolyAnalyst
776
A Survey of Selected Software Technologies for Text Mining
when dealing with large collections of unstructured documents as discussed in Megaputer Intelligence Inc. (2007). The latter cites that taxonomy-based classifications are useful when dealing with large collections of unstructured documents such as tracking the number of known issues in product repair notes and customer support letters. According to Megaputer Intelligence Inc. (2007), PolyAnalyst “provides simple means for creating, importing, and managing taxonomies, and carries out automated categorization of text records against existing taxonomies.” Megaputer Intelligence Inc. (2007) provides examples of applications to executives, customer support specialists, and analysts. According to Megaputer Intelligence Inc. (2007), “executives are able to make better business decisions upon viewing a concise report on the distribution of tracked issues during the latest observation period.” This chapter provides several figures of actual screen shots of Megaputer PolyAnalyst version 6.0 for text mining. These are Figure 9 for workspace of text mining of Megaputer PolyAnalyst, Figure 10 for key word extraction window for the word “breakfast” from customer written comments, and Figure 11 for initialization of link term report. Megaputer PolyAnalyst can also provide screen shots with drilldown text analysis and histogram plot of text analysis.
6. WordStat WordStat is developed by Provalis Research. It is a text analysis software module run on a base product of SimStat or QDA Miner. It can be used to study textual information such as interviews, answers to open-ended questions, journal articles, electronic communications, and so on. WordStat may also be
Figure 12. Screen shot of categorization dictionary window using WordStat (source: http://www.provalisresearch.com/WordStat/WordStatFlashDemo.html)
777
A Survey of Selected Software Technologies for Text Mining
Figure 13. 3-D View of correspondence analysis results using WordStat (source: http://www.provalisresearch.com/WordStat/WordStatFlashDemo.html)
used for categorizing text automatically using a dictionary approach, or for developing and validating new categorization dictionaries or taxonomies. WordStat incorporates many data analysis and graphical tools that can be used to explore relationships between document contents and information amassed in categorical or numeric variables. Hierarchical clustering and multidimensional scaling analysis can be used to identify relationships among categories and document similarity. Correspondence analysis and plots can be used to explore relationships between keywords and different groups. An input file (e.g., excel file) can be imported into the software for analysis. An important preliminary to WordStat analysis is to create a categorization dictionary (which needs domain knowledge). WordStat analysis consists of many tabulations or cross-tabulations of different categories. Figure 12 shows a screen shot of a categorization dictionary window. Correspondence analysis results and 3-D plots by WordStat are illustrated in Figure 13.
7. SPSS Clementine Text mining for Clementine is text mining software developed by SPSS Inc. It can be used to extract concepts and relationships from textual data. It can also be used to convert an unstructured format to a structured one for creating predictive models. Text mining for Clementine can be accessed directly from the interface of SPSS Inc.’s leading data mining software “Clementine”. Text mining for Clementine can process many types of unstructured data, including texts in MS office files, survey text responses, call center notes, Web forms, and Web logs, blogs, Web feeds, streams,
778
A Survey of Selected Software Technologies for Text Mining
Figure 14. Screen shot of SPSS Clementine for text analysis (Source: http://www.spss.com/textanalysis_surveys/demo.htm)
and so on. It uses a natural language processing (NLP) linguistic extraction process. It has a graphical interface and is easy to use. Dictionaries can be customized for specific domain areas by using the built-in Resource Editor. It allows extracting text from multiple languages such as Dutch, English, French, German, Italian, Portuguese, or Spanish. It can also process text translated into English from 14 languages including Arabic, Chinese, Persian, Japanese, and Russian. Figure 14 shows a screen shot of category analysis using SPSS Clementine.
Future Trends/Conclusion This chapter has shown that software for text mining is a new and expanding area of technology that numerous vendors are in competition with each other in providing both unique and common features. Users needing to use text mining are fortunate to have such resources available for needs that were unthinkable a decade ago. Future trends are that text mining software will continue to grow in dimensionalities of features and available software. The applications of software for text mining will be extremely diverse ranging from uses in customer survey responses to drill-downs in medical records or credit or bank reports. A future direction of this work is to pursue the area of Web mining software and contrast with that of text mining. Web mining is according to the TMRG (Text Mining Research Group) Web site defined to be “the slightly more general case of looking for patterns in hypertext and often applies graph theoretical approaches to detect and utilize the structure of Web sites.” Web mining entails mining of content, structure, and usage. Web content mining entails both Web page content mining and search
779
A Survey of Selected Software Technologies for Text Mining
result mining. Web structure mining uses interconnections between Web pages to give weight to pages. Web usage mining is the application that uses data mining to analyze and discover interesting patterns of user’s usage of data on the Web. Some of the popular software that could be selected for Web mining for future comparison include Megaputer WebAnalyst, SAS Web Analytics, SPSS Web Mining for Clementine, WebLog Expert 2.0, and Visitator.
Acknowledgment The authors would like to acknowledge funding for support of this research from the Summer Faculty Research Grant as awarded to both authors from the College of Business at Arkansas State University. The authors would also like to acknowledge their gratefulness to those at SAS Inc. for the Mini-Grant provided for SAS Text Miner, TextAI and Megaputer Inc. for their extremely helpful technical support.
References Allan, J., Kumar, V., and Thompson, P. (2000). Institute for Mathematics and Its Applications (IMA). IMA Hot Topics Workshop: Text Mining. April 17-18, 2000, University of Minnesota, Minneapolis, MN. Amir, A., Aumann, Y., Feldman, R. and Fresko, M., (2005). Maximal association rules: a tool for mining associations in text. Journal of Intelligent Information Systems, 25(3), 333–345. Ananiadou, S. and McNaught, J. (eds) (2006). Text Mining for Biology and Biomedicine, Artech House Publishers, Boston, MA. ISBN 1-58053-984-X. Ananiadou, S., Chruszcz, J., Keane, J., McNight, J., Watry, P. (2005). Ariadne, 42. Retrieved January 2005, from http://www.ariadne.ac.uk/issue42/ananiadou Baker, C. and Witte, R., (2006). Mutation mining - A prospector’s tale. Information Systems Frontier, 8, 47–57. Cerrito, P.B., Badia, A., and Cox, J., (2003). The Application of Text Mining Software to Examine Coded Information. Proceedings of SIAM International Conference on Data Mining, San Francisco, CA, May 1-3. Cody, W., Kreulen, J., Krishna, V., and Spangler, W., (2002). The integration of business intelligence and knowledge management. IBM Systems Journal, 41(4), 697-713. Fan, W. P. (2006). Text Mining, Web Mining, Information Retrieval and Extraction from the WWW References. Retrieved from http://filebox.vt.edu/users/wfan/text_mining.html Fan, W., Gordon, M., and Pathak, P., (2005). Genetic programming-based discovery of ranking functions for effective Web search. Journal of Management Information Systems, 21(4), 37-56. Firestone, J., (2005). Mining for information gold. Information Management Journal, 47-52.
780
A Survey of Selected Software Technologies for Text Mining
Grobelnik, M. and Mladenic, D. (n.d.) Text-Garden — Text-Mining Software Tools. Retrieved from http://kt.ijs.si/Dunja/textgarden/ Guernsey, L. (2003). Digging for Nuggets of Wisdom. The New York Times, October 16, 2003. Hayes, J. H., Dekhtyar, Sundaram, S., (2005). Text Mining for Software Engineering: How Analyst Feedback Impacts Final Results. International Conference on Software Engineering: Proceedings of the 2005 International Workshop on Mining Software Repositories, St Louis, MO, May. Hearst, M. A. (1999). Untangling Text Data Mining, School of Information Management & Systems, University of California at Berkeley. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), June 20-26. Hearst, M. A. (2003). What is Data Mining? Retrieved from http://www.ischool.berkeley.edu/~hearstr/ text_mining.html Hersh, W., (2005). Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Briefings in Bioinformatics, 6(4), 344-356. Hirsch, L., Saeedi, M., and Hirsch, R. (2005). Evolving text classification rules with genetic programming. Applied Artificial Intelligence, 19, 659–676 Jin, X., Zhou. Y., and Mobasher, B. (2004). Web usage mining based on probabilistic latent semantic analysis, Conference on Knowledge Discovery in Data. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 197-205. KDNuggets (2005). Polls: Text Mining Tools. Retrieved January, 2005 from http://www.kdnuggets,com/ polls/2005/text_mining_tools.htm KDNuggets (2007). Text Analysis, Text Mining, and Information Retrieval Software. Retrieved from http:///www.kdnuggets.com/software/text.html Kloptchenko, A., Eklund, T., Karlsson, J., Back, B. and Vanhar, H., (2004). Combining data and text mining techniques for analysing financial reports. Intelligent Systems in Accounting, Finance and Management, 12(1), 29-41. Kostoff, R., Rio, J., Humenik, J., Garcia, E., and Ramirez, A., (2001). Citation mining: Integrating text mining and bibliometrics for research user. Journal of the American Society for Information Science and Technology, 52(13), 1148-1156. Leon, D.A., (2007). Using text mining software to identify drug, compound, and disease relationships in the literature. In Proceedings of 233rd ACS National Meeting, Chicago, IL, March 25-29. Retrieved from http://acsinf.org/docs/meetings/233nm/abs/CINF/073.htm Liang, J.W.(2003). Introduction to Text and Web Mining, Seminar at North Carolina Technical University. Retrieved from http://www.database.cis.nctu.edu.tw/seminars/2003F/TWM/slides/p.ppt Lieberman, H.(2000). Text Mining in Real Time, Talk Abstract. Retrieved from http://www.ima.umn. edu/reactive/abstract/lieberman1.html
781
A Survey of Selected Software Technologies for Text Mining
Lynn, A. (2004). Mellon grant to fund project to develop data-mining software for libraries. Retrieved from http://www.news.uiuc.edu/news/04/1025mellon.html Mack, R., Mukherjea, S., Soffer, A., and Uramoto, N., et al., (2004). Text analytics for life science using the unstructured information management. IBM Systems Journal, 43(3), 490-515. McCallum, A. (1998). Bow: A Toolkit for statistical language modeling, Text Retreival, Classification and Clustering. Retrieved from http;//www.cs.cmu.edu/~mccallum/bow/ Megaputer Intelligence Inc. (2000). Tutorial: TextAnalyst Introduction. Retrieved from http://www. megaputer.com/products/ta/tutorial/textanalyst_tutorial_1.html Megaputer Intelligence Inc. (2007). Data Mining, Text Mining, and Web Mining Software. Retrieved from http:///www.megaputer.com Megaputer Intelligence Inc. (2007). Text OLAP. Retrieved from http://www.megaputer.com/products/ pa/algorithms/text_oplap.php3 Megaputer Intelligence Inc. (2007). TextAnalyst. Retrieved from http://www.megaputer.com/products/ ta/index.php3 Megaputer Intelligence Inc. (2007). WebAnalyst, Benefits of Web Data Mining. Retrieved from http:// www.megaputer.com/products/wa/benefits/php3 Megaputer Intelligence Inc. (2007). WebAnalyst, Introduction to Web Data Mining. Retrieved http://www. megaputer.com/products/wa/intro.php3 Metz. C. (2003). Software: Text mining. PC Magazine. Retrieved from http://www.pcmag.com/print_article2/0,1217.a=43573,00.asp Miller, T. M. (2005). Data and Text Mining: A Business Applications Approach. Pearson Prentice Hall, Upper Saddle River, NJ. Rajman, M. and Besann, R. (1997). Text Mining: Natural Language Techniques and Text mining Applications. Proceedings of the 7th IFIP2.6 Working Conference on Database Semantics (DS-7), Leysin, Switerland, October 1997. Robb, D., (2004). Text Mining tools take on unstructured data. Computerworld, June 21 Romero, C. and Ventura, S., (2007). Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications, 33, 135–146. Saravanan, M.,Reghuraj, C., and Raman, S. (2003). Summarization and categorization of text data in high-level data cleaning for information retrieval. Applied Artificial Intelligence, 17, 461–474. SAS Raises Bar on Data, Text Mining (2005). GRID Today. Retrieved August 29, 2005 from www. gridtoday.com/grid/460386.html Scherf, M., Epple, A., and Werner, T. (2005). The next generation of literature analysis: integration of genomic analysis into text mining. Briefings in Bioinformatics, 6(3), 287-297.
782
A Survey of Selected Software Technologies for Text Mining
Seewald, A., Holzbaur, C., and Widmer, G., (2006). Evaluation of term utility functions for very short multidocument summaries. Applied Artificial Intelligence, 20, 57–77. Segall, R.S. and Zhang, Q. (2009). Comparing four-selected data mining software. Encyclopedia of Data Warehousing and Mining, Chapter XLV, Edited by Jon Wang. IGI Global, Inc., 269-277. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2000, Boston, MA. Slynko, Y. and Ananyam, S. (2007). WebAnalyst Server – universal platform for intelligent e-business. Retrieved from http://www.megaputer.com/tech/wp/wm.php3 Spasic, I., Ananiadou, S., McNaught, J., and Kumar, A., (2005). Text mining and ontologies in biomedicine: making sense of raw text. Briefings in Bioinformatics, 6(3), 239-251. Srinivasan, P. (2004). Text mining: Generating hypotheses from MEDLINE. Journal of the American Society for Information Science and Technology, 55(5), 396-413. Survey of Text mining: Clustering, Classification, and Retrieval, Springer-Verlag. Text mining (2007). Held in conjunction with Seventh SIAM International Conference on Data Mining, Minneapolis, MN, April 28, 2007. Text mining Research Group at the University of Waikato (2002). Text mining, Computer Science Department. University of Waikato, Hamilton, New Zealand. Retrieved from http://www.cs.waikato. ac/nz/~nzdl/textmining The Lemur Project (2007). The Lemur Toolkit for Language Modeling and Information Retrieval. Retrieved from http://www.lemurproject.org/index.php?version=print Trumbach, C., (2006). Addressing the information needs of technology managers: making derived information usable. Technology Analysis & Strategic Management, 18(2), 221–243. Turmo, J., Ageno, A., and Catala, N., (2006). Adaptive Information Extraction. ACM Computing Surveys, 38(2), 1-47. Uramoto, N., Matsuzawa, H, Nagano, T., Murakami, A., Takeuchi, H., and Takeda, K., (2004). A textmining system for knowledge discovery from biomedical documents. IBM Systems Journal, 43(3), 516-533. Uramoto, N., Matsuzawa, H., Nagano, T., Murakami, A., Takeuchi, and Ta, K, (2004). Unstructured Information Management, 43(3). Visa, A., Toivonen, J., Vanharanta, H., and Back, B., (2002). Contents matching defined by prototypes: Methodology verification with books of the bible. Journal of Management Information Systems, 18(4), 87-100. Weiss, S. M., Indurkhya, N., Zhang,T, and Damerau, F. (2005). Text mining: Predictive Methods for Analyzing Unstructured Information, Springer Press
783
A Survey of Selected Software Technologies for Text Mining
Woodfield, Terry (2004). Mining Textual Data Using SAS Text Miner for SAS9 Course Notes, SAS Institute, Inc., Cary, NC. Wu, Y., Li, Q., Bot, R., and Chen, X., (2006). Finding Nuggets in Documents: A Machine Learning Approach, Journal of the American Society For Information Science And Technology, 57(6), 740–752. Yang, H. and Lee, C., (2005). Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization, Journal of Intelligent Information Systems, 25(1), 47–67.
Key Terms Compare Suite: AKS Labs software that compares texts by keywords, highlights common and unique keywords. Megaputer TextAnalyst: Software that offers semantic analysis of free-form texts, summarization, clustering, navigation, and natural language retrieval. Natural Language Processing: Also know as computational linguistics that is, using natural language, e.g., English, to do query or search. SAS Text Miner: Software by SAS Inc. that provides a suite of text processing and analytical tools. SPSS Mining for Clementine: Enables you to extract key concepts, sentiments, and relationships from call center notes, blogs, e-mails and other unstructured data, and convert it to structured format for predictive modeling. Text Mining: Discovery by computer of new, previously unknown information by automatically extracting information from different written resources. Visual Text: Software manufactured by TextAI that uses a comprehensive GUI development environment (www.textanalysis.com) WordStat: Analysis module for textual information such as responses to open-ended questions, interviews, etc.
784
785
Chapter XLV
Application of Text Mining Methodologies to Health Insurance Schedules Ah Chung Tsoi Monash University, Australia Phuong Kim To Tedis P/L, Australia Markus Hagenbuchner University of Wollongong, Australia
Abstract This chapter describes the application of a number of text mining techniques to discover patterns in the health insurance schedule with an aim to uncover any inconsistency or ambiguity in the schedule. In particular, we will apply first a simple “bag of words” technique to study the text data, and to evaluate the hypothesis: Is there any inconsistency in the text description of the medical procedures used? It is found that the hypothesis is not valid, and hence the investigation is continued on how best to cluster the text. This work would have significance to health insurers to assist them to differentiate descriptions of the medical procedures. Secondly, it would also assist the health insurer to describe medical procedures in an unambiguous manner.
Australian Health Insurance System In Australia, there is a universal health insurance system for her citizens and permanent residents. This publicly-funded health insurance scheme is administered by a federal government department called
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Application of Text Mining Methodologies to Health Insurance Schedules
the Health Insurance Commission (HIC). In addition, the Australian Department of Health and Ageing (DoHA), after consultation with the medical fraternity, publishes a manual called Medicare Benefit Schedule (MBS) in which it details each medical treatment procedure and its associated rebate to the medical service providers who provide such services. When a patient visits a medical service provider, the HIC will refund or pay the medical service provider at the rate published in the MBS1 (the MBS is publicly available online from http://www.health.gov.au/pubs/mbs/mbs/css/index.htm). Therefore, the description of medical treatment procedures in the MBS should be clear and unambiguous to interpretation by a reasonable medical service provider as ambiguities would lead to the wrong medical treatment procedure being used to invoice the patient or the HIC. However, the MBS has developed over the years, and is derived through extensive consultations with medical service providers over a lengthy period. Consequently, there may exist inconsistencies or ambiguities within the schedule. In this chapter, we propose to use text mining methodologies to discover if there are any ambiguities in the MBS. The MBS is divided into seven categories, each of which describes a collection of treatments related to a particular type, such as diagnostic treatments, therapeutic treatments, oral treatments, and so on. Each category is further divided into groups. For example, in category 1, there are 15 groups, A1 , A2 , …, A1 5 . Within each group, there are a number of medical procedures which are denoted by unique item numbers. In other words, the MBS is arranged in a hierarchical tree manner, designed so that it is easy for medical service providers to find appropriate items which represent the medical procedures provided to the patient.2 This underlying MBS structure is outlined in Figure 1. This chapter evaluates the following:
Figure 1. An overview of the MBS structure in the year of 1999 Group A 2 Other N on-preferred
... Group A 15
Group A 1
Medical ( Emergency)
General Practitioner
Group C 3
Category 1
Prosthodontic
Professional A ttendance
Group D 1 Misc.
Group C 2 Maxilloacial Nuclear
Diagnostic Procedures
Cleft L ip a nd C left Pallate Services
Orthodontic
Group D 2
Category 2
Category 7 Group C 1
Group T 1
Group P 11
Misc.
MBS
Specimen referred
1999
. . .
Category 3
Category 6
Group T 2
Therapeutic Procedures
Pathology Services
Radiation
Group P 2 ..
Chemical
Group P 1
Group T 9
Heamatology
Amesthesia
Category 5
Category 4
Diagnostic I maging
Oral Services
Group I 5
Group O 1
Magnetic R esonance
Consultation
..
. Group I 2 Tomography
786
.
.. Group I 1
Group O 9
Ultrasound
Nerve Blocks
.
Group O 2 Assistance
Application of Text Mining Methodologies to Health Insurance Schedules
•
•
Hypothesis — Given the arrangement of the items in the way they are organised in the MBS (Figure 1), are there any ambiguities within this classification? Here, ambiguity is measured in terms of a confusion table comparing the classification given by the application of text mining techniques and the classification given in the MBS. Ideally, if the items are arranged without any ambiguities at all (as measured by text mining techniques), the confusion table should be diagonal with zero off diagonal terms. Optimal grouping — Assuming that the classification given in MBS is ambiguous (as revealed in our subsequent investigation of the hypothesis), what is the “optimal” arrangement of the item descriptions using text mining techniques (here “optimal” is measured with respect to text mining techniques)? In other words, we wish to find an “optimal” grouping of the item descriptions together such that there will be a minimum of misclassifications.
The benefits of this work are as follows:
•
•
•
From the DoHA point of view, it will allow the discovery of any existing ambiguities in the MBS. In order to make procedures described in the MBS as distinct as possible, the described methodology can be employed in evaluating the hypothesis in designing the MBS such that there would not be any ambiguities from a text mining point of view. This will lead to a better description of the procedures so that there will be little misinterpretation by medical service providers. From a service provider’s point of view, the removal of ambiguities would allow efficient computer-assisted searching. This will limit misinterpretation, and allow the implementation of a semi-automatic process for the generation of claims and receipts. While the “optimal grouping” process is mainly derived from a curiosity point of view, this may assist the HIC in re-grouping some of their existing descriptions of items in the MBS, so that there will be less opportunities for misinterpretation.
Obviously, the validity of the described method lies in the validity of text mining techniques in unambiguously classifying a set of documents. Unfortunately, this may not be the case, as new text mining techniques are constantly being developed. However, the value of the work presented in this chapter lies in the ability to use existing text mining techniques and to discover, as far as possible, any ambiguities within the MBS. This is bound to be a conservative measure, as we can only discover ambiguities as far as possible given the existing tools. There will be other ambiguities which remain uncovered by current text mining techniques. But at least, using our approach will clear up some of the existing ambiguities. In other words, the text mining techniques do not claim to be exhaustive. Instead, they will indicate ambiguities as far as possible, given their limitations. The structure of this chapter is as follows: In the next section, we describe what text mining is, and how our proposed techniques fall into the general fabric of text mining research. In the following section, we will describe the “bag of words” approach to text mining. This is the simplest method in that it does not take any cognizance of semantics among the words; each word is treated in isolation. In addition, this will give an answer to the hypothesis as stated above. If ambiguities are discovered by using such a simple text mining technique, then there must exist ambiguities in the set of documents describing the medical procedures. This will give us a repository of results to compare with those when we use other text mining techniques. In the next section, we describe briefly the latent semantic kernel (LSK)
787
Application of Text Mining Methodologies to Health Insurance Schedules
technique to pre-process the feature vectors representing the text. In this technique, the intention is that it is possible to manipulate the original feature vectors representing the documents and to shorten them so that they can better represent the “hidden” message in the documents. We show results which do not assume the categories as given in the MBS.
text mining In text mining, there are two main issues: retrieval and classification (Berry, 2004).
•
•
Retrieval techniques — used to retrieve the particular document: ° Keyword-based search — this is the simplest method in that it will retrieve a document or documents which matches a particular set of key words provided by the user. This is often called “queries”. ° Vector space-based retrieval method — this is often called a “bag of words” approach. It represents the document in terms of a set of feature vectors. Then, the vectors can be manipulated so as to show patterns, for example, by grouping similar vectors into clusters (Nigam, McCallum, Thrun, & Mitchell, 2000; Salton, 1983). ° Latent semantic analysis — this is to study the latent or hidden structure of the set of documents with respect to “semantics”. Here “semantics” is taken to mean “correlation” within the set of documents; it does not mean that the technique will discover the “semantic” relationships between words in the sense of linguistics (Salton, 1983). ° Probabilistic latent semantic analysis — this is to consider the correlation within the set of documents within a probabilistic setting (Hofmann, 1999a). Classification techniques — used to assign data to classes. ° Manual classification — a set of documents is classified manually into a set of classes or subclasses. ° Rule-based classification — a set of rules as determined by experts is used to classify a set of documents. ° Naïve Bayes classification — this uses Bayes’ theorem to classify a set of documents, with some additional assumptions (Duda, 2001). ° Probabilistic latent semantic analysis classification — this uses the probabilistic latent semantic analysis technique to classify the set of documents (Hofmann, 1999b). ° Support vector machine classification — this is to use support vector machine techniques to classify the set of documents (Scholkopf, Burges, & Smola, 1999).
This chapter explores the “bag of words” technique to classify the set of documents into clusters and compare them with those given in the MBS. The chapter also employs the latent semantic kernel technique, a technique from kernel machine methods (based on support vector machine techniques) to manipulate the features of the set of documents before subjecting them to clustering techniques.
788
Application of Text Mining Methodologies to Health Insurance Schedules
bag of words If we are given a set of m documents D = [d1 , d2 ,..., dm ], it is quite natural to represent them in terms of vector space representation. From this set of documents it is simple to find out the set of vocabularies used. In order that the set of vocabularies would be meaningful, care is taken by using the stemmisation technique which regards words of the same stem to be one word. For example, the words “representation” and “represent” are considered as one word, rather than two distinct words, as they have the same stem. Secondly, in order that the set of vocabularies would be useful to distinguish documents, we eliminate common words, like “the”, “a”, and “is” from the set of vocabularies. Thus, after these two steps, it is possible to have a set of vocabularies w1 , w2 ,..., wn which represents the words used in the set of documents D. Then, each document can be represented as an n-vector with elements which denote the frequency of occurrence of the word in the document di , and 0 if the word does not occur in the document di . Thus, from a representation point of view, the set of documents D can be equivalently represented by a set of vectors V - [v1 , v2 ,..., vm ] , where vi is an n-vector. Note that this set of vectors V may be sparse, as not every word in the vocabulary occurs in the document (Nigam et al., 2000). The set of vectors V can be clustered together to form clusters using standard techniques (Duda, 2001). In our case, we consider each description of an MBS item as a document. We have a total of 4030 documents; each document may be of varying length, dependent on the description of the particular medical procedure. Table 1 gives a summary of the number of documents in each category. After taking out commonly occurring words, words with the same stem count, and so on, we find that there are a total of 4569 distinct words in the vocabulary. We will use 50% of the total number of items as the training data set, while the other 50% will be used as a testing data set to evaluate the generalisability of the techniques used. In other words, we have 2015 documents in the training data set, and 2015 in the testing data set. The content of the training data set is obtained by randomly choosing items from a particular group so as to ensure that the training data set is sufficiently rich and representative of the underlying data set. Once we represent the set of data in this manner, we can then cluster them together using a simple clustering technique, such as the naïve Bayes classification method (Duda, 2001). The results of this clustering are shown in Table 2. The percentage accuracy is, on average, 91.61%, with 1846 documents out of 2015 correctly classified. It is further noted that some of the categories are badly classified, for example, category-2 and categoryTable 1. An overview over the seven categories in the MBS Category
Number of items
1
158
2
108
3
2734
4
162
5
504
6
302
7
62
Total
4030
789
Application of Text Mining Methodologies to Health Insurance Schedules
Table 2. A confusion table showing the classification of documents (the actual classifications as indicated in the MBS are given horizontally; classifications as obtained by the naïve Bayes method are presented vertically) Category
1
2
3
4
5
6
7
Total
% Accuracy
1
79
0
0
0
0
0
0
79
100.00
2
1
25
9
0
12
7
0
54
46.30
3
12
3
1323
15
10
3
1
1367
96.78
4
1
0
62
18
0
0
0
81
22.22
5
0
3
18
0
229
1
1
252
90.87
6
0
2
0
0
1
148
0
151
98.01
7
3
0
1
1
2
0
24
31
77.42
Table 3. Category-4 Items 52000, 52003, 52006, and 52009 misclassified by the naïve Bayes method as Category-3 items Item No
Item Description
52000
Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, small (not more than 7 cm long), superficial
52003
Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, small (not more than 7 cm long), involving deeper tissue
52006
Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, large (more than 7 cm long), superficial
52009
Skin and subcutaneous tissue or mucous membrane, repair of recent wound of, on face or neck, large (more than 7 cm long), involving deeper tissue
Table 4. Some items in Category 3 which are similar to items 52000, 52003, 52006, and 52009
790
Item No
Item Description
30026
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, small (not more than 7cm long), superficial, not being a service to which another item in Group T4 applies
30035
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, small (not more than 7cm long), involving deeper tissue
30038
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, large (more than 7cm long), superficial, not being a service to which another item in Group T4 applies
30041
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, not on face or neck, large (more than 7cm long), involving deeper tissue, not being a service to which another item in Group T4 applies
30045
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, large (more than 7cm long), superficial
30048
Skin and subcutaneous tissue or mucous membrane, repair of wound of, other than wound closure at time of surgery, on face or neck, large (more than 7cm long), involving deeper tissue
Application of Text Mining Methodologies to Health Insurance Schedules
Table 5. Some correctly classified Category-1 items Item No
Item description
3
Professional attendance at consulting rooms (not being a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management - each attendance
4
Professional attendance, other than a service to which any other item applies, and not being an attendance at consulting rooms, an institution, a hospital, or a nursing home by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients on 1 occasion -- each patient
13
Professional attendance at an institution (not being a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 institution on 1 occasion -- each patient
19
Professional attendance at a hospital (not being a service to which any other item applies) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 hospital on 1 occasion -- each patient
20
Professional attendance (not being a service to which any other item applies) at a nursing home including aged persons’ accommodation attached to a nursing home or aged persons’ accommodation situated within a complex that includes a nursing home (other than a professional attendance at a self contained unit) or professional attendance at consulting rooms situated within such a complex where the patient is accommodated in a nursing home or aged persons’ accommodation (not being accommodation in a self contained unit) by a general practitioner for an obvious problem characterised by the straightforward nature of the task that requires a short patient history and, if required, limited examination and management -- an attendance on 1 or more patients at 1 nursing home on 1 occasion -- each patient
Table 6. Some correctly classified Category-5 items Item No
Item description
55028
Head, ultrasound scan of, performed by, or on behalf of, a medical practitioner where: (a) the patient is referred by a medical practitioner for ultrasonic examination not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies; and (b) the referring medical practitioner is not a member of a group of practitioners of which the first mentioned practitioner is a member (R)
55029
Head, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)
55030
Orbital contents, ultrasound scan of, performed by, or on behalf of, a medical practitioner where: (a) the patient is referred by a medical practitioner for ultrasonic examination not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies; and (b) the referring medical practitioner is not a member of a group of practitioners of which the first mentioned practitioner is a member (R)
55031
Orbital contents, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)
55033
Neck, 1 or more structures of, ultrasound scan of, where the patient is not referred by a medical practitioner, not being a service associated with a service to which an item in Subgroups 2 or 3 of this Group applies (NR)
4. Indeed, it is found that 62 out of 81 category-4 items are misclassified as category-3. Similarly, 12 out of 54 category-2 items are misclassified as category-5 items. This result indicates that the hypothesis is not valid; there are ambiguities in the description of the items in each category, apart from category-1, which could be confused with those in other categories. In particular, there is a high risk of confusing those items in category-4 with those in category-3. A close examination of the list of the 62 category-4 items which are misclassified as category-3 items by the naïve Bayes classification method indicates that they are indeed very similar to those in
791
Application of Text Mining Methodologies to Health Insurance Schedules
category-3. For simplicity, when we say items in category-3, we mean that those items are also correctly classified into category-3 by the classification method. Tables 3 and 4 give an illustration of the misclassified items. It is noted that misclassified items 52000, 52003, 52006, and 52009 in Table 3 are very similar to the category-3 items listed in Table 4. It is observed that the way items 5200X are described is very similar to those represented in items 300YY. For example, item 52000 describes a medical procedure to repair small superficial cuts on the face or neck. On the other hand, item 30026 describes the same medical procedure except that it indicates that the wounds are not on the face or neck, with the distinguishing feature that this is not a service to which another item in Group T4 applies. It is noted that the description of item 30026 uses the word “not” to distinguish this from that of item 52000, as well as appending an extra phrase “not being a service to which another item in Group T4 applies”. From a vector space point of view, the vector representing item 52000 is very close3 to item 30026, closer than other items in category-4, due to the few extra distinguishing words between the two. Hence, item 52000 is classified as “one” in category3, instead of “one” in category-4. Similar observations can be made for other items shown in Table 3, when compared to those shown in Table 4. On the other hand, Tables 5 and 6 show items which are correctly classified in category-1 and category-5 respectively. It is observed that items shown in Table 5 are distinct from those shown in Table 6 in their descriptions. A careful examination of correctly-classified category-1 items, together with a comparison of their descriptions with those correctly-classified category-5 items confirms the observations shown in Tables 5 and 6. In other words, the vectors representing correctly-classified category-1 items are closer to other vectors in the same category than other vectors representing other categories.
Support Vector Machine and Kernel Machine Methodologies In this section, we will briefly describe the support vector machine and the kernel machine techniques.
Support Vector Machine and Kernel Machine Methodology In recent years, there has been increasing interest in a method called support vector machines (Cristianni & Shawe-Taylor, 2000; Guermeur, 2002; Joachims, 1999; Vapnik, 1995). In brief, this can be explained quite easily as follows: Assume a set of (n-dimensional) vectors x1 , x2 ,..., xn . Assuming that this set of vectors is drawn from two classes, 1 and -1. If these classes are linearly separable, then there exists a straight line dividing these two classes as shown on the left of Figure 2. In Figure 2, it is observed that the vectors are well separated. Now if the two classes cannot be separated by a straight line, the situation becomes more interesting. Traditionally, in this case we use a non-linear classifier to separate the classes as shown on the right of Figure 2. In general terms, any two collections of n-dimensional vectors are said to be linearly separable if there exists an (n-1)-dimensional hyper-plane that separates the two collections. One intuition is inspired by the following example: In the exclusive-OR case, we know that it is not possible to separate the two classes using a straight line, when the problem is represented in two dimensions. However, we know that if we increase the dimension of the exclusive-OR example by one, then in three dimensions one can find a hyper-plane which will separate the two classes. This can be observed in Tables 7 and 8, respectively. 792
Application of Text Mining Methodologies to Health Insurance Schedules
Figure 2. Illustration of the linear separability of classes (the two classes at top are separable by a single line, as indicated; for the lower two classes there is no line that can separate them)
Class A
Class B
Class A
Class B
Here it is observed that the two classes are easily separated when we simply add one extra dimension. The support vector machine uses this insight, namely, in the case when it is not possible to separate the two classes by a hyper-plane; if we augment the dimension of the problem sufficiently, it is possible to separate the two classes by a hyper-plane. f(x) = wT φ(x) + b, where w is a set of weights, and b a constant in this high-dimensional space. The embedding of the vectors x in the high-dimensional plane
Table 7. Exclusive-OR example x
y
class
0
0
1
1
1
1
0
1
0
1
0
0
793
Application of Text Mining Methodologies to Health Insurance Schedules
Table 8. Extended exclusive-OR example x
y
z
class
0
0
0
1
1
1
1
1
0
1
0
0
1
0
0
0
is to transform them equivalently to φ(x), where φ(⋅) is a coordinate transformation. The question then becomes: how to find such a transformation φ(⋅)? Let us define a kernel function as follows: K(x, z) ≤ φ(x), φ(z) >≡ φ(x)T φ(z) (1) where φ is a mapping from X to an inner product feature space F. It is noted that the kernel thus defined is symmetric, in other words K(x, z) = K(z, x). Now let us define the matrix X = [x1 x2 ... x3]. It is possible to define the symmetric matrix: x1T T x XT X = 2 [x1 x 2 x n ] T x n
(2)
In a similar manner, it is possible to define the kernel matrix: K = [φ(x1) φ(x2 ) ... φ(xn )]T [φ(x1) φ(x2 ) ... (xn )]
(3)
Note that the kernel matrix K is symmetric. Hence, it is possible to find an orthogonal matrix V such that K = VΛVT , where Λ is a diagonal matrix containing the eigenvalues of K. It is convenient to sort the diagonal values of Λ such that λ1 ≥ λ2 ≥ . . . ≥ λ n . It turns out that one necessary requirement of the matrix K to be a kernel function is that the eigenvalue matrix Λ must contain all positive entries, in other words, λi ≥ 0. This implies that in general, for the transformation φ(⋅) to be a valid transformation, it must satisfy some conditions such that the kernel function formed is symmetric. This is known as the Mercer conditions (Cristianni & Shawe-Taylor, 2000). There are many possible such transformations; some common ones (Cristianni & Shawe-Taylor, 2000) being: Power kernel: K(x, z) = (K(x, z) + c)P where p = 2, 4, ... Gaussian kernel: K(x, z) = exp. There exist quite efficient algorithms using optimisation theory which will obtain a set of support vectors and the corresponding weights of the hyper-plane for a particular problem (Cristianni & ShaweTaylor, 2000; Joachims, 1999). This is based on re-formulating the problem as a quadratic program-
794
Application of Text Mining Methodologies to Health Insurance Schedules
ming problem with linear constraints. Once it is thus re-formulated, the solutions can be obtained very efficiently. It was also discovered that the idea of a kernel is quite general (Scholkopf, Burges, & Smola, 1999). Indeed, instead of working with the original vectors x, it is possible to work with the transformed vectors φ(x) in the feature space, and most classic algorithms, for example, principal component analysis, canonical correlation analysis, and Fisher’s discriminant analysis, all have equivalent algorithms in the kernel space. The advantage of working in the feature space is that the dimension is normally much lower than the original space.
Latent Semantic Kernel Technique The latent semantic kernel method follows the same trend as the kernel machine methodology (Cristianini, Lodhi, & Shawe-Taylor, 2002). The latent semantic kernel is the kernel machine counterpart of the latent semantic technique, except that it operates in a lower dimension feature space, and hence is more efficient. In latent semantic analysis, we have a set of documents represented by D, in terms of V = [v1 , v2 ,..., vm ]. This set of vectors can be concatenated into a matrix D, then we may apply a singular value decomposition on the matrix D as follows: D = U∑VT
(4)
where D is a n m matrix, U is an ortho normal n n matrix such that UUT = I, V is an ortho normal m m matrix, such that VT V = I, and ∑ is a n m matrix, with diagonal entries σ1, σ2,..., σn, if n > m or σ1, σ2,..., σm if m > n. Often the singular values σ are arranged so that σ1 ≥ σ2 ≥ ... ≥ σn. Thus, the singular values σ give some information on the “energy” of each dimension. It is possible that some of the σ may be small or negligible. In this case, it is possible to say that there are only a few significant singular values. For example, if we have σ1 ≥ σ2 ≥ σi >> σi + 1 ≥ σn, then it is possible to approximate the set by σ1 ≥ σ2 ≥ ... ≥ σi. In this case, it is possible to perform a dimension reduction on the original data so that it conforms to the reduced order data. By reducing the order of representation, we are “compressing” the data, thus inducing it to have a “semantic” representation. The idea behind the latent semantic kernel is, instead of considering the original document matrix D, to consider the kernel matrix K = DT D. In this case, the feature space dimension is smaller, as we normally assume that there are more words in the vocabulary than the number of documents, in other words, n » m. Thus, by operating on m m matrix DT D, it is a smaller space than the original n dimensional space. Once it is recognised that the kernel is K, we can then operate on this kernel, for example, performing singular value decomposition on the matrix K, and find the corresponding singular values. One particular aspect of performing a singular value decomposition is to find a reduced order model such that, in the reduced order space it will approximate the model in the original space in the sense that the approximated model contains most of the “energies” in the original model. This concept can also be applied to the latent semantic kernel technique. It is possible to find a reduced order representation of the original features in such a manner that the reduced order representation contains most of the “energies” in the original representation.
795
Application of Text Mining Methodologies to Health Insurance Schedules
The latent semantic kernel algorithm (Cristianini, Lodhi, & Shawe-Taylor, 2002) can be described as follows: Given a ker nel for
K, training set d1 ,..., dm and a number T:
i = 1 to m do :
norm2[i] = K(di , di ); for
j = 1 to T do :
ij = arg maxi (norm2[i]); index[ j] = ij ; size[ j] = ; for
i = 1 to m do: ; norm2[i] = norm2[i] - feat(i, j)* feat(i, j);
end ; end ; retur n
feat[i, j] as the jt h feature of input i;
To classif y a new example for
d:
j = 1 to T do :
; end ; retur n
newfeat[ j] as the jt h feature of the example d.
In our work, we use the latent semantic kernel method as a pre-processing technique in that we have a set of documents represented by the matrix D, and as mentioned previously, the matrix is quite sparse in that there are many null values within the matrix. Hence, our aim is to project this representation onto a representation in a reduced order space so that most of the “energies” are still retained. Once we obtain this set of reduced order representation, we can then manipulate them in the same way as we manipulate the full feature vectors.
796
Application of Text Mining Methodologies to Health Insurance Schedules
Application of Latent Semantic Kernel Methodology to the Medical Benefit Schedule We wish to apply the latent semantic kernel technique to the set of descriptions of items as contained in the MBS. The vector representation as described in in the Bag of Words section is used in the latent semantic kernel (LSK) approach, and it is known as the full feature vectors or, in short, full features. We apply the LSK algorithm to the full features and produce reduced features which have a dimension determined by a variable T (T ≤ n, the number of words in the corpus). In this chapter, we will use the term LSK reduced features to represent such features. We use the following procedures in our experiments with the latent semantic kernel:
• • •
Run bag of words method to produce full features. Run LSK algorithm to produce LSK reduced features. Experiments with both LSK and full features including: ° Binary classification — this will allow only binary classification using support vector machine techniques ° Multi-classification — this will allow multi-class classification using support vector machine techniques ° Clustering the items in the MBS using both full features and reduced features ° Compare the clustering result using multi-classification results
experiments using binary classification In this section we report the results of using a binary classification support vector machine technique for the classification of the items (Joachims, 1999). This is interesting in that it shows us the results of assuming one class, and the other items are assumed to be in a different class. Originally, items in the MBS are classified into seven categories: 1, 2,..., 6 and 7. We have trained a binary classifier for both the full features and for the reduced features regarding each category versus the others. For example, we use category-1 and assume all the other categories are grouped as another class. We run experiments on reduced features for each category where the dimension T of the reduced features was chosen from a set of values within the range [16; 2048]. We show the results of category-1 versus other categories first, and then summarise other situations and draw some general conclusions concerning the experiments. For the experiment about category-1 versus others (see Table 9), we observe that the accuracy climbs from T = 16 rapidly to a stable position of a maximum at T= 32. Note that even though the accuracies oscillate about this maximum value for later values of T, this can be observed to be minor, and can be attributed to noise. From this, we can conclude that if we classify category-1 versus all the other categories, then using T= 32 is sufficient to capture most of the gain. What this implies is that most of the energies in the document matrix are already captured with 32 reduced order features. This shows that the LSK algorithm is very efficient in compressing the features from 4569 words in the vocabulary to only requiring 32 reduced features. Note that it is not possible to interpret the reduced features, as by nature they consist of a transformed set of features.
797
Application of Text Mining Methodologies to Health Insurance Schedules
Table 9. Accuracy on the training data and testing data set for Category 1 with various values of T and full features T
Train
Train
Accuracy
Correct (out of 2015)
Accuracy
Correct (out of 2015)
16
99.60
2007
99.31
2001
32
100
2015
99.75
2010
64
100
2015
99.75
2010
128
100
2015
99.80
2011
256
100
2015
99.80
2011
400
100
2015
99.70
2009
512
100
2015
99.75
2010
800
100
2015
99.75
2010
1024
100
2015
99.75
2010
1200
100
2015
99.80
2011
1400
100
2015
99.75
2010
1600
100
2015
99.75
2010
1800
100
2015
99.75
2010
2000
100
2015
99.75
2010
2048
100
2015
99.75
2010
full features
100
2015
99.75
2010
For experiments involving other sets, for example, using category-2 versus all the other categories, and so on, similar conclusions to that shown for category-1 are observed. It is found that in general a small value of T is sufficient to capture most of the energies in the document matrix. Each category versus the rest peaks at a particular value of T. On average it is found that T = 128 would capture most of the gain possible using the reduced features. These experiments show us that the LSK algorithm is a very efficient pre-processing unit. The observed efficiency is likely due to the sparseness of the full feature space. It can capture most of the energies contained in the document matrix using a small reduced feature set, in other words, with a value of T = 128.
Experiments Using Multiple classificationS In this section, we report on experiments with multi-classification using support vector machine (SVM) methodology. We first discuss the generalisation of SVM’s to multi-class classification, then we describe the experimental results. SVM, as a method proposed in Vapnik (1995), is suitable for two-class classifications. There are a number of extensions which extend this method to multi-class classification problems (Crammer & Singer, 2001; Guermeur, 2002; Lee, Lin, & Wahba, 2002). It turns out that it is possible that, instead of weighing the cost function as an equal cost (indicating that both classes are equally weighed in a twoclass classification problem), one can modify the cost function and weigh the misclassification cost as
798
Application of Text Mining Methodologies to Health Insurance Schedules
Table 10. A confusion table showing the classification of documents (the row gives the actual classification as indicated in the MBS, while the column shows figures which are obtained by using the support vector machine) Category
1
1
74
2
0
3
1
4
0
5
0
2
3
4
5
6
7
total
% accuracy
0
0
0
0
4
1
79
93.67
28
13
0
8
5
0
54
51.85
6
1310
16
20
2
12
1367
95.83
0
71
9
0
0
1
81
11.11
11
25
0
211
4
1
252
83.73
well (Lee, Lin & Wahba, 2002). Once formulated in this manner, the usual formulation of SVM’s can be applied. In the Experiments Using Binary Classification section, we showed that by using a small reduced feature such as T = 128, we can capture most of the energies in the document matrix. The following result is obtained by running multi-class classification using support vector machine on reduced features with the same training and testing data sets as previously. The average accuracy is 88.98%, with 1793 correctly classified out of 2015 (Table 10). Note that once again some of the HIC categories are poorly classified. For example, out of 81 HIC category-4 items, 71 are classified as category-3, while only 9 are classified as category-4. This further confirms the results obtained in the Bag of Words section, namely that there are ambiguities in the HIC classifications of item descriptions.
Experiments with Clustering Algorithms So far, we have experimented on MBS items which are classified by categories as contained in the MBS. We have shown that the MBS contained ambiguities using the bag of words approach. In this section, we ask a different question: If we ignore the grouping of the items into categories as contained in the MBS, but instead we apply clustering algorithms to the item descriptions, how many clusters would we find? Secondly, how efficient would these clusters be in classifying the item descriptions? Efficiency is measured in terms of the confusion matrix in classifying unseen data. The methodology which we used is as follows:
• •
Use a clustering algorithm to cluster the document matrix into clusters, and label the clusters accordingly. Evaluate the efficiency of the clustering algorithm using a support vector machine. The efficiency is observed by examining the resulting confusion matrix.
The main reason why we need to take a two-step process is that the clustering algorithm is an unsupervised learning algorithm and that we do not have any a priori information concerning which item should fall into which cluster. Hence, it is very difficult to evaluate the efficiency of the clusters produced. In our methodology, we evaluate the efficiency of the clusters from the clustering algorithm
799
Application of Text Mining Methodologies to Health Insurance Schedules
by assuming that the clusters formed are “ideal”, label them accordingly, and use the SVM (a supervised training algorithm) to evaluate the clusters formed.
Clustering Using Full Features: Choice of Clustering Method The first experiment was performed on the full features (in other words, using the original document matrix). Different clustering methods were evaluated using various criteria; we found that the repeated bisection method of clustering gives the best results. Hence, we choose to use this clustering method for all future experiments. In the clustering algorithm (Karypis, 2003; Zhao & Karypis, 2002), the document matrix is first clustered into two groups. One of the groups is selected and bisected further. This process continues until the desired number of clusters is obtained. During each step, the cluster is bisected so that the resulting two-way clustering solution optimises a particular clustering criterion. At the end of the algorithm, the overall optimisation function is minimised. There are a number of optimising functions which can be used. We use a simple criterion which measures the pair-wise similarities between two documents Si and Sj as follows:
∑
d q ∈Di , d r ∈D j
cos( d q , d r ) =
∑
d q ∈Di , d r ∈D j
d qT d r
(5)
For a particular cluster Sr of size nr , the entropy is defined as: E ( Sr ) = −
nri 1 q nri log ∑ nr log q i =1 nr
(6)
where q is the number of classes in the data set, and nr i is the number of documents of the i-th class that were assigned to the r-th cluster. The entropy of the entire clustering solution is defined as: k
nr E ( S r ) r =1 n
E=∑
(7)
Perfect clustering means that each cluster will contain only one type of document, that is, each document in the same class belong to the same type. In this case, the entropy will be zero. In general, this is impossible. The clustering result which provides the lowest entropy would be the best clustering result. In this chapter, we made use of the Cluto software (Karypis, 2003) for performing the clustering.
Clustering Using Reduced Features into k-Clusters and Use of Support Vector Machine to Evaluate Cluster Accuracy For experiments in this section, we used reduced features with T = 128, together with the repeated bisection clustering method. Our aim was to use a clustering algorithm with the repeated bisection
800
Application of Text Mining Methodologies to Health Insurance Schedules
method to group item descriptions into clusters irrespective of their original classification in the MBS. We divided the item descriptions into k (constant) clusters. The detailed experimental procedures are as follows: •
•
•
Use reduced features of all MBS items obtained using the latent semantic kernel method as inputs into the clustering algorithm in order to group items into k-clusters, where k is a variable determined by the user. The output from the clustering algorithm gives k clusters. We perform clustering, do a classification on MBS items using the reduced features and the SVM methodology, however in this case cluster the item belonging to results from the clustering method, not the MBS category. The classification output can be displayed in a confusion table which can inform us on how well the clustering algorithm has performed.
In this section, we use the following notations: • •
Cluster category — this is the cluster obtained using the clustering algorithm. HIC category — this is the category which is provided by the MBS.
Clusters (k = 7) First, we ran a clustering algorithm to create seven cluster categories. From Table 11, we observe how items in HIC categories are distributed into cluster categories. From this it is observed that the clusters as obtained by the clustering algorithm are quite different from those indicated in the HIC category. We then validated the classification results from the clustering algorithm using SVM’s. This informed us of the quality of the clustering algorithm in grouping the item descriptions. A classification accuracy of 93.70% on the testing data set (that is, 1888 correct out of 2015) was found. Note that for these experiments, we used the 50% training and 50% testing data sets as in all previous experiments. The distribution of items in the training and testing data sets for each cluster category was selected using a random sampling scheme within each identified category.
Table 11. Distribution of MBS items into cluster categories (the cluster categories are given horizontally, and HIC categories are presented vertically) Class
1
2
3
4
5
6
7
1
0
3
10
0
230
0
0
2
10
39
442
16
107
69
6
3
0
0
1269
62
9
1
3
4
148
13
65
2
7
17
18
5
0
4
297
34
20
0
12
6
0
12
505
47
18
35
21
7
0
37
146
1
113
180
2
801
Application of Text Mining Methodologies to Health Insurance Schedules
Table 12. A confusion table obtained using the support vector machine method Class
1
2
3
4
5
6
7
total
% accuracy
1
112
0
1
0
1
0
6
120
93.33
2
0
312
2
4
11
5
4
338
92.31
3
0
3
661
2
8
6
3
683
96.78
4
2
4
0
119
3
2
1
131
90.84
5
0
1
1
5
173
1
4
185
93.51
6
1
1
6
2
9
282
4
305
92.46
7
0
13
0
4
6
1
229
253
90.51
Table 13. Distribution of MBS items into eight cluster categories (the cluster categories are given horizontally, and HIC categories are presented vertically) Class
1
2
3
4
5
6
7
1
0
4
10
0
244
0
0
2
10
39
326
17
104
70
6
3
0
0
1264
61
8
1
3
4
0
2
223
3
9
1
0
5
148
12
53
2
7
16
18
6
0
4
290
32
20
0
12
7
0
13
501
46
15
30
21
8
0
34
67
1
97
184
2
Table 14. A confusion table using support vector machine to validate the quality of cluster categories Class
1
2
3
4
5
6
7
8
total
% accuracy
1 2
127
0
1
0
0
1
0
3
132
96.21
0
259
2
0
3
5
1
5
275
94.18
3
0
3
657
1
1
11
3
4
680
96.62
4
0
0
0
106
0
7
0
2
115
92.17
5
2
2
0
2
108
4
5
2
125
86.40
6
0
2
1
3
2
170
2
0
180
94.44
7
0
3
5
4
2
8
280
1
303
92.41
8
0
3
0
4
2
5
1
190
205
92.68
Clusters (k = 8) In this section, we experimented with eight clusters instead of seven. This provided the clustering algorithm with more freedom to choose to cluster the item descriptions. When we chose seven clusters, we implicitly tell the cluster algorithm that no matter how the underlying data look like, we nevertheless only allow seven clusters to be found. In this manner, even though the underlying data may be more
802
Application of Text Mining Methodologies to Health Insurance Schedules
conveniently clustered into a higher number of clusters, by choosing seven clusters we force the cluster algorithm to merge the underlying clusters into seven. On the other hand, if we choose eight clusters, this provides more freedom for the clustering algorithm to cluster the underlying data. If it is truly seven clusters, then the algorithm will report that there are seven clusters found. On the other hand, if the data is more conveniently clustered into eight clusters, then by choosing to allow for the possibility of eight clusters, it will allow the clustering algorithm to find one. First, we ran a clustering algorithm to create eight cluster categories (Table 13). From Table 13, it is observed that the clusters as obtained by the clustering algorithm are quite different from those indicated in the HIC category. We then validated the classification results from the clustering algorithm using SVM’s. This informed us of the quality of the clustering algorithm in grouping the item descriptions. A classification accuracy is 94.14% (in other words, 1897 correct out of 2015 resulted).
Summary of Observations It was observed that: 1.
2.
3.
The categories as given by the clustering algorithm are good in grouping the medical item descriptions together. The evaluation using the SVM method shows that it is accurate in grouping them together, as the confusion table has dominant diagonal elements, with few misclassified ones. The SVM is a supervised method. Once it is trained, it can be used to evaluate the generalisation capability of the model on testing results with known classifications. Thus, by examining the confusion table produced, it is possible to evaluate how well the model classifies unseen examples. If the confusion table is diagonally dominant, this implies that there are few misclassifications. On the other hand, if the confusion table is not diagonally dominant, this implies that there are high numbers of misclassifications. In our experiment, we found that the confusion table is diagonally dominant, and hence we can conclude that the grouping as obtained by the clustering algorithm is accurate. In other words, the clustering algorithm was able to cluster the underlying data together into reasonably homogeneous groupings. The classification given by the clustering algorithm increases with the number of clusters, in other words, the degree of freedom in which the clustering algorithm is provided. Obviously there will be an upper limit to a reasonable number of clusters used, beyond which there will not be any further noticeable increase in the classification accuracy. It is observed that the accuracy of assuming seven clusters (93.70%) and the accuracy of using eight clusters (94.14%) are very close to one another. Hence we can conclude that the “optimum” number of clusters is around seven or eight. It is noted that the items in the HIC categories are distributed in the cluster categories. This confirms our finding in the Bag of Words section that the HIC categories are not “ideal” categories from a similarity point of view, in that it can induce confusion due to their similarity.
conclusion We have experimented with classification and clustering on full features and reduced features using the latent semantic kernel algorithm. The results show that the LSK algorithm works well on Health 803
Application of Text Mining Methodologies to Health Insurance Schedules
Insurance Commission schedules. It has been demonstrated that the HIC categories are ambiguous in the sense that some item descriptions in one category are close to those of another. This ambiguity may be the cause of misinterpretation of the Medicare Benefits Schedule by medical service providers, leading to wrong charges being sent to the HIC for refund. It is shown that by using clustering algorithms, the item descriptions can be grouped into a number of clusters, and moreover, it is found that seven or eight clusters would be sufficient. It is noted, however, that the item descriptions as grouped using the clustering algorithm are quite different to those of the HIC categories. This implies that if the HIC wishes to re-group item descriptions, it would be beneficial to consider the clusters as grouped by using clustering algorithms. Note that one may say that our methodology is biased towards the clustering algorithm or classification methods because we only use the simple top HIC categories — categories 1 through 7 as shown in Figure 1. This is only a very coarse classification. In actual fact, the items are classified in the MBS according to a three-tiered hierarchical tree as indicated in Figure 1. For example, an item belongs to category-x group-y item number z, where x ranges from 1 to 7, y ranges from 1 to yi (where i indicates the category that it is in), and z ranges from 1 to zj (where j depends on which group the item is located). This is a valid criticism in that the top HIC categories may be too coarse to classify the item descriptions. However, at the current stage of development of text mining methods, it is quite difficult to consider the hierarchical coding of the algorithms, especially when there are insufficient numbers of training samples. In the MBS case, a group may contain only a few items. Thus there is an insufficient number of data to train either the clustering algorithm or SVM. Hence our approach in considering the top HIC categories may be one possible way to detect if there are any inconsistencies in the MBS given the limitations of the methodology. Even though in this chapter we have concentrated on the medical procedure descriptions of a health insurer, the techniques can be equally applicable to many other situations. For example, our developed methodology can be applied to tax legislation, in identifying whether there are any ambiguities in the description of taxable or tax-exempted items. Other applications of our methodology include: description of the social benefit schedule in a welfare state, and the description of degree rules offered by a university.
ACKNOWLEDGMENT Financial support is gratefully acknowledged from both the Australian Research Council and the Health Insurance Commission through an ARC Linkage Project grant.
References Australian Department of Health and Ageing. The Medicare Benefits Schedule Online. Retrieved from http://www.health.gov.au/pubs/mbs/mbs/css/index.htm Berry, M. W. (Ed.). (2004). Survey of text mining: Clustering, classification, and retrieval. New York: Springer Verlag.
804
Application of Text Mining Methodologies to Health Insurance Schedules
Cristianini, N., & Shawe-Taylor, J. (2000). Support vector machines. Cambridge, UK: Cambridge University Press. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multi-class kernel- based vector machines. Journal of Machine Learning Research, 2, 265-292. Cristianini, N., Lodhi, H., & Shawe-Taylor, J. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2), March, 127-152. Duda, R. O., Hart, P. E., & Stock, D. G. (2001). Pattern classification. New York: John Wiley & Sons. Guermeur, Y. (2002). A simple unifying theory of multi-class support vector machines (Research Rep. No. RR-4669). Lorraine, France: Institut National de Recherche Informatique et en Automatique (INRIA), December. Hofmann, T. (1999a). Probabilistic latent semantic analysis. Proceedings of the 15t h Conference on Uncertainties in Artificial Intelligence. Stockholm, Sweden, July 30-August 1 (pp. 289-296). Hofmann, T. (1999b). Probabilistic latent semantic indexing. Proceedings of the 22n d ACM International Conference on Research and Development in Information Retrieval (SIGIR). University of Berkeley, CA, August 15-19 (pp. 50-57). Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods. Boston: MIT Press. Karypis, G. (2003). Cluto, a clustering toolkit (Tech. Rep. No. 02-017). Minneapolis, MN: Department of Computer Science and Engineering, University of Minnesota, November. Lee, Y., Lin, Y., & Wahba, G. (2002). Multi-category support vector machines, theory and application to the classification of microarray data and satellite radiance data (Tech. Rep. No. 1064). Madison, WI: Department of Statistics, University of Wisconsin. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labelled and unlabelled documents using EM. Machine Learning. 39, 103-134. Salton, G., & McGill, M .J. (1983). Introduction to modern information retrieval. New York: McGraw Hill. Scholkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods. Boston: MIT Press. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer Verlag. Zhao, Y., & Karypis, G. (2002). Criterion function for document clustering, experiments and analysis (Tech. Rep. No. 01-40). Minneapolis, MN: Department of Computer Science and Engineering, University of Minnesota, February.
805
Application of Text Mining Methodologies to Health Insurance Schedules
Endnotes 1
2
3
Note that this scheduled cost is somewhat different from those set by the medical service provider associations. A medical service provider can elect to charge at the scheduled cost set by the medical service provider associations, or a fraction of it, and any gaps in between the charge and the refund by the HIC will need to be met either by the patient, or through additional medical insurance cover specifically designed to cover the gap payment. Note that the Medical Benefit Schedule is a living document in that the schedules are revised once every few years with additional supplements once every three months. The supplements contain minor modifications to particular sections of the schedule, while the major revisions may contain re-classification of the items, deletion or addition of items, mainly for newly introduced medical services, due to technological advances, or clarification of the intent of existing items. The version of MBS used in this chapter is based on the November, 1998, edition with supplements up to and including June 1, 2000. Close here means the cosine of the angle between the two vectors is close to 1.
This work was previously published in Advances in Applied Artificial Intelligence, edited by J. Fulcher, pp. 29-51 , copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
806
807
Chapter XLVI
Web Mining System for Mobile-Phone Marketing Miao-Ling Wang Minghsin University of Science & Technology, Taiwan, ROC Hsiao-Fan Wang National Tsing Hua University, Taiwan, ROC
Abstract With the ever-increasing and ever-changing flow of information available on the Web, information analysis has never been more important. Web text mining, which includes text categorization, text clustering, association analysis and prediction of trends, can assist us in discovering useful information in an effective and efficient manner. In this chapter, we have proposed a Web mining system that incorporates both online efficiency and off-line effectiveness to provide the “right” information based on users’ preferences. A Bi-Objective Fuzzy c-Means algorithm and information retrieval technique, for text categorization, clustering and integration, was employed for analysis. The proposed system is illustrated via a case involving the Web site marketing of mobile phones. A variety of Web sites exist on the Internet and a common type involves the trading of goods. In this type of Web site, the question to ask is: If we want to establish a Web site that provides information about products, how can we respond quickly and accurately to queries? This is equivalent to asking: How can we design a flexible search engine according to users’ preferences? In this study, we have applied data mining techniques to cope with such problems, by proposing, as an example, a Web site providing information on mobile phones in Taiwan. In order to efficiently provide useful information, two tasks were considered during the Web design phase. One related to off-line analysis: this was done by first carrying out a survey of frequent Web users, students between 15 and 40 years of age, regarding their preferences, so that Web customers’ behavior could be characterized. Then the survey data, as well as the products offered, were classified into different demand and preference groups. The other task was related to online query: this was done through the application of an information retrieval technique that responded to users’ queries. Based
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Web Mining System for Mobile-Phone Marketing
on the ideas above the remainder of the chapter is organized as follows: first, we present a literature review, introduce some concepts and review existing methods relevant to our study, then, the proposed Web mining system is presented, a case study of a mobile-phone marketing Web site is illustrated and finally, a summary and conclusions are offered.
Literature Review Over 150 million people, worldwide, have become Internet users since 1994. The rapid development of information technology and the Internet has changed the traditional business environment. The Internet has enabled the development of Electronic Commerce (e-commerce), which can be defined as selling, buying, conducting logistics, or other organization-management activities, via the Web (Schneider, 2004). Companies are finding that using the Web makes it easier for their business to communicate effectively with customers. For example, Amazom.com, an online bookstore that started up in 1998, reached an annual sales volume of over $1 billion in 2003 (Schneider, 2004). Much research has focused on the impact and mechanisms of e-commerce (Angelides, 1997; Hanson, 2000; Janal, 1995; Mohammed, Fisher, Jaworski, & Paddison, 2004; Rayport & Jaworski, 2002; Schneider, 2004). Although many people challenge the future of e-commerce, Web site managers must take advantage of Internet specialties which potentially enable their companies to make higher profits and their customers to make better decisions. Given that the amount of information available on the Web is large and rapidly increasing, determining an effective way to help users find useful information has become critical. Existing document retrieval systems are mostly based on the Boolean Logic model. Such systems’ applications can be rather limited because they cannot handle ambiguous requests. Chen and Wang (1995) proposed a knowledge-based fuzzy information retrieval method, using the concept of fuzzy sets to represent the categories or features of documents. Fuzzy Set Theory was introduced by Zadeh (1965), and is different from traditional Set Theory, as it uses the concept of membership functions to deal with questions that cannot be solved by two-valued logic. Fuzzy Set Theory concepts have been applied to solve special dynamic processes, especially those observations concerned with linguistic values. Because the Fuzzy concept has been shown to be applicable when coping with linguistic and vague queries, Chen and Wang’s method is discussed below. Their method is based on a concept matrix for knowledge representation and is defined by a symmetric relation matrix as follows: a12 a1n a22 an 2 ann 0 ≤ aij ≤ 1, 1 ≤ i ≤ n, 1 ≤ j ≤ n
A1 a11 A2 a21 A= An an1
(1)
where n is the number of concepts, and ai j represents the relevant values between concepts Ai and Aj with ai i = 1, ∀ i. It can be seen that this concept matrix can reveal the relationship between properties used to describe objects, which has benefits for product identification, query solving, and online sales development. For effective analysis, these properties, determined as the attributes of an object, should
808
Web Mining System for Mobile-Phone Marketing
be independent of each other; however this may not always be so. Therefore a transitive closure matrix A* must be obtained from the following definition. Definition 1: Let A be a concept matrix as shown in Equation (1), define: A2 = A ⊗ A ∨ (a1i ∧ ai1 ) i =1,,n ∨ (a2i ∧ ai1 ) = i =1,,n ∨ (a ∧ ai1 ) i =1,,n ni
(a1i ∧ ai 2 ) ∨ (a2i ∧ ai 2 ) i =1,, n ∨
i =1,, n
∨ (ani ∧ ai 2 )
i =1,, n
(a1i ∧ ain ) ∨ (a2i ∧ ain ) i =1,, n ∨
i =1,, n
∨ (ani ∧ ain ) i =1,, n
(2)
where ⊗ is the max-min composite operation with “∨” being the maximum operation and “∧” being the minimum operation. If there exists an integer p ≤ n – 1 such that Ap = Ap +1 = Ap + 2 = ..., A* = Ap is called the Transitive Closure of the concept matrix A. Matrix A* is an equivalent matrix which satisfies reflexive, symmetric and transitive properties. To identify each object by its properties, a document descriptor matrix D is constructed in the following form: A1 D1 D = D2 Dm
d11 d 21 d m1
A2 An d12 d 22 dm2
d1n d 2 n d mn
(3)
0 ≤ dij ≤ 1
where di j represents the degree of relevance of document Di with respect to concept Aj and m is the number of documents in general terms. By applying the max-min composite operation ⊗ to D and A* , we have matrix B = D ⊗ A* = [bi j]m ×n where bi j represents the relevance of each document Di with respect to a particular concept Aj . The implication of this approach for Web mining is: when we classify objects by their properties, if we can also cluster people according to their properties and preferences, then when a query is made, matching a user’s properties to retrieval of the corresponding concept matrices of each cluster will speed up online response. Clustering is fundamental to data mining. Clustering algorithms are used extensively, not only to organize and categorize data, but also for data compression and model construction. There are two major types of clustering algorithms: hierarchical and partitioning. A hierarchical algorithm produces a nested series of patterns with similarity levels at which groupings change. A partitioning algorithm produces only one partition by optimizing an objective function, for example, squared-error criterion (Chen, 2001). Using clustering methods, a data set can be partitioned into several groups, such that the degree of similarity within a group is high, and similarity between the groups is low. There are various kinds of clustering methods (Chen, 2001; Jang, Sun, & Mizutani, 1997; Wang, Wang, & Wu, 1994). In
809
Web Mining System for Mobile-Phone Marketing
this study, we applied the forward off-line method in order to group people according to their properties and preferences. The c-Means algorithm (Tamura, Higuchi, & Tanaka, 1971) also called Hard c-Means (HCM), is a commonly used objective-clustering method, which finds the center of each cluster and minimizes the total spread around these cluster centers. By defining the distance from each datum to the center (a measure of Euclidean distance), the model ensures that each datum is assigned to exactly one cluster. However, in this case in contrast to the HCM, there is vague data and elements may belong to several clusters, with different degrees of belonging. For such situations, Bezdek (1973) developed an algorithm called the Fuzzy c-Means (FCM) algorithm for fuzzy partitioning, such that one datum can belong to several groups with degrees of belonging, specified by membership values between 0 and 1. Obviously, the FCM is more flexible than the HCM, when determining data related to degrees of belonging. Because of the vague boundaries of fuzzy clusters, Wang et al. (1994) showed that it is not sufficient to classify a fuzzy system simply by minimizing the within-group variance. Instead, the maximal between-group variance also had to be taken into account. This led to a Bi-objective Fuzzy c-Means Method (BOFCM) as shown below, in which the performance of clustering can be seen to be improved: c
K
Min Z (U ;V ) = ∑∑ (µ ik ) xk − vi 2
2
i =1 k =1 c
(BOFCM): Max L(U ;V ) = ∑∑ vi − v j
2
,
(4)
i =1 j max α < min K k =1 β K . i i ∑ ( µ ik(l +1) ) 2 + c − 1 ∑ ( µ ik(l +1) ) 2 + c − 1 k =1 k =1
811
Web Mining System for Mobile-Phone Marketing
l +1 l Step 6. Calculate ∆ = U ( ) − U ( ) . If ∆ > ε set l=l+1 and go to Step 3. If ∆ ≤ ε stop.
From the above analysis, we can obtain the clustered data within each center. To speed up the process, the documents can also be grouped according to their degrees of similarity, as defined by Jaccard’s coefficient as follows: m
rij =
∑ min b , b is
s =1 m
∑ max b , b is
s =1
js
js
, 0 ≤ bis , b js ≤ 1,
(7)
where ri j is the similarity between document Di and document Dj . bi S , bj S from matrix B are the relevant values with respect to documents Di , Dk and documents Dj , Dk . So we can obtain the document fuzzy relationship matrix R: D1 D1 R = D2 Dm
r11 r21 rm1
D2 Dm r1m r1m rmm
r12 r22 rm 2
(8)
Again a transitive closure R* of R must be obtained. Then by defining an acceptable level of λ by the mean of the upper triangular matrix R*, m −1 m
i.e., λ =
∑∑ r
(
i =1 j >i
ij
m(m − 1) 2
)
,
we have an λ-threshold partition of documents into clusters. Based on the document descriptor of each document, we can obtain a cluster-concept matrix B’ Figure 1. Framework of the Web mining system
812
Web Mining System for Mobile-Phone Marketing
A1
A2
Group 1 b11′ b12′ ′ b22 ′ Group 2 b21 B′ = Group u bu′1 bu′ 2
An
b1′n b2′ n , ′ bun
(9)
where u is the number of clusters of documents.
With the results of above off-line analysis, a user can take advantage of the clustered documents to improve response time when making an online query. By comparing the matrix B’ with the user’s query vector, the most relevant cluster(s) are selected. Then, by searching the documents within the selected cluster(s), the documents may be retrieved more efficiently. The framework of the proposed Web mining system (Lin, 2002), with both online and off-line operations, is shown in Figure 1.
A Case Study of the Web Mining System In order to demonstrate the proposed system, a Web site, called veryMobile-Phone (http://203.68.224.196/ verymobile/), was constructed in Chinese, in order to catch the behavior of local customers in Taiwan.
Figure 2. Flow diagram of verymobile-phone system
Is t he answer satisfactory?
Yes Questionnaire
Database
Input query
Suggested mobile-phone brand of stage 2
Suggested mobile-phone brand of stage 1
No
Product i s satisfactory and purchased?
Yes
Update database monthly
813
Web Mining System for Mobile-Phone Marketing
The online operation procedure, based on the information provided from an off-line established database, is shown in Figure 2. This initial database was established based on a survey of 800 individuals. The respondents were full- and part-time students, ranging from 15 to 35 years of age, at the Minghsin University of Science and Technology and full-time students, ranging from 20 to 30 years of age, at the Tsing Hua University. A total of 638 questionnaires were returned. After deleting invalid questionnaires, we had 562 valid responses. In this questionnaire, personal data, such as Sex — male or female; Age — under 15, 16~18, 19~22, 23~30 or over 30; Education — senior high school, college, university, masters, PhD or others; Average Income — none, under NTS 18,000, 18,000~30,000, 30,000~45,000 or over 45,000 were recorded, along with their preferences in purchasing a mobile phone, with features made up of A1 :brand, A2 :external, A3 :price, A4 :service, A5 :function, A6 :ease of use, A7 :special offer, etc. Via the BOFCM, users were classified into c=4 groups. The mobile phones, in stock, were also grouped by their features, according to the concepts defined for information retrieval. Below, we demonstrate how the proposed mechanism can be used to suggest the appropriate mobile phone brand for each user, by responding to his or her query, based on his or her features and preferences.
Off-Line Phase The off-line analysis is used mainly to establish the initial database features, including user categories and preferences, as well as mobile phone clusters. The users’ data were grouped by applying the BOFCM. Four clusters were obtained and stored. For each group of users, the concept matrix was calculated, as shown below, to describe the preference relationships obtained among the mobile phones features: brand external price service function ease special of use offer brand 1 external 0.12 0.15 price A2 = service 0.1 function 0.09 ease of use 0.15 special offer 0
0.12
0.15
0.1
0.09
0.15
1
0.11
0.08
0.07
0.1
0.11 0.08 0.07 0.1 0
1 0.1 0.08 0.12 0
0.1 1 0.06 0.09 0
0.08 0.06 1 0.07 0
0.12 0.09 0.07 1 0
0 0 0 0 ; 0 0 1
brand external price service function ease special of use offer 1 0.13 0.15 price A3 = service 0.12 function 0.1 ease of use 0.13 special offer 0.05 brand external
814
0.13
0.15
0.12
0.1
0.13
1 0.11 0.1 0.08 0.1 0.04
0.11 1 0.12 0.09 0.12 0.05
0.1 0.12 1 0.08 0.1 0.04
0.08 0.09 0.08 1 0.08 0.03
0.1 0.12 0.1 0.08 1 0.04
0.05 0.04 0.05 0.04 ; 0.03 0.04 1
Web Mining System for Mobile-Phone Marketing
brand external price service function ease special of use offer brand 1 external 0.12 0.14 price A4 = service 0.12 function 0.08 ease of use 0.13 special offer 0.04
0.12
0.14
0.12
0.08
0.13
1 0.11 0.1 0.07 0.11 0.03
0.11 1 0.11 0.06 0.12 0.04
0.1 0.11 1 0.07 0.11 0.04
0.07 0.08 0.07 1 0.08 0.03
0.11 0.12 0.11 0.08 1 0.04
0.04 0.03 0.04 0.04 . 0.03 0.04 1
Taking Cluster 2 as an example, the transitive closure of the concept matrix A2 is shown in the following analysis: brand external price service function ease special of use offer 0.12 0.15 0.1 0.09 0.15 0 brand 1 0.12 0.1 0.09 0.12 0 external 0.12 1 0.15 0.12 1 0.1 0.09 0.1 0 price * A2 = service 0.1 0.1 0.1 1 0.09 0.1 0 0.09 0 function 0.09 0.09 0.09 0.09 1 0.09 1 0 ease of use 0.15 0.12 0.15 0.1 0 0 0 0 0 1 special offer 0
.
In the meantime, the document descriptor matrix was generated by 14 mobile-phone brands versus 7 concepts: brand external price s ervice function ease special of use offer
BenQ 0 ALCATEL 0.13 Sony ERICSSON 0.17 Kyocera 0 0.20 Mitsubishi MOTOROLA 0.17 0.17 NEC D= NOKIA 0.22 0.10 Panasonic 0 PHILIPS SAGEM 0.14 0.04 SIEMENS BOSCH 0 0 Others
0.14 0.38 0.27 0.80 0.40 0.30 0.39 0.27 0.54 0.33 0.21
0.29 0.25 0.08 0.07 0 0.11 0.22 0.19 0.10 0.50 0.14
0.17 0.25 0 0.67 0 1.00
0 0 0.06 0 0 0.03 0.04 0.02 0 0 0.07 0 0 0
0.14 0 0.43 0.06 0 0.19 0.27 0.04 0.12 0.07 0 0.07 0.20 0 0.20 0.21 0.06 0.12 0.17 0 0.00 0.21 0.03 0.06 0.17 0 0.08 0 0 0.17 0.11 0 0.32 0.33 0 0.21 0 0 0.33 0 0 0.00 815
Web Mining System for Mobile-Phone Marketing
To obtain the document-concept matrix, the D and A2* matrix was composed.
brand external price s ervice function ease special
BenQ 0.15 ALCATEL 0.15 Sony ERICSSON 0.17 Kyocera 0.12 0.20 Mitsubishi MOTOROLA 0.17 0.17 NEC B 2 = D ⊗ A2* = NOKIA 0.22 0.12 Panasonic 0.15 PHILIPS SAGEM 0.14 0.15 SIEMENS BOSCH 0.15 0.15 Others
0.14 0.38 0.27 0.80 0.40 0.30 0.39 0.27 0.54 0.33 0.21 0.17 0.12 0.12
0.29 0.25 0.15 0.12 0.15 0.15 0.22 0.19 0.12 0.50 0.14 0.25 0.67 1.00
0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10
0.14 0.09 0.27 0.09 0.20 0.21 0.17 0.21 0.17 0.09 0.11 0.33 0.09 0.09
of use
offer
0.15 0.15 0.15 0.12 0.15 0.15 0.15 0.15 0.12 0.15 0.14 0.15 0.15 0.15
0.43 0.19 0.12 0.07 0.20 0.12 0.00 0.06 0.08 0.17 0.32 0.21 0.33 0.00
816
Mitsubishi
MOTOROLA
NEC
NOKIA
Panasonic
PHILIPS
SAGAM
SIEMENS
BOSCH
Others
1.00 0.65 0.57 0.37 0.58 0.58 0.53 0.56 0.45 0.61
0.65 1.00 0.68 0.58 0.81 0.73 0.77 0.67 0.65 0.79
0.57 0.68 1.00 0.51 0.79 0.93 0.71 0.84 0.65 0.61
0.37 0.58 0.51 1.00 0.57 0.54 0.56 0.51 0.77 0.48
0.58 0.81 0.79 0.57 1.00 0.84 0.77 0.77 0.72 0.65
0.58 0.73 0.93 0.54 0.84 1.00 0.76 0.86 0.70 0.65
0.53 0.77 0.71 0.56 0.77 0.76 1.00 0.78 0.71 0.63
0.56 0.67 0.84 0.51 0.77 0.86 0.78 1.00 0.64 0.60
0.45 0.65 0.65 0.77 0.72 0.70 0.71 0.64 1.00 0.54
0.61 0.79 0.61 0.48 0.65 0.65 0.63 0.60 0.54 1.00
0.74 0.69 0.67 0.47 0.68 0.69 0.55 0.62 0.55 0.60
0.70 0.70 0.75 0.40 0.68 0.70 0.60 0.67 0.51 0.61
0.69 0.56 0.45 0.32 0.47 0.46 0.42 0.44 0.36 0.70
0.43 0.42 0.37 0.28 0.34 0.37 0.42 0.40 0.31 0.56
0.74 0.70 0.69 0.43
0.69 0.70 0.56 0.42
0.67 0.75 0.45 0.37
0.47 0.40 0.32 0.28
0.68 0.68 0.47 0.34
0.69 0.70 0.46 0.37
0.55 0.60 0.42 0.42
0.62 0.67 0.44 0.40
0.55 0.51 0.36 0.31
0.60 0.61 0.70 0.56
1.00 0.67 0.61 0.36
0.67 1.00 0.56 0.41
0.61 0.56 1.00 0.66
0.36 0.41 0.66 1.00
Kyocera
Sony ERICSSON
SAGEM SIEMENS BOSCH Others
ALCATEL
BenQ ALCATEL Sony ERICSSON Kyocera Mitsubishi MOTOROLA NEC R2 = NOKIA Panasonic PHILIPS
BenQ
From the matrix B2, the relationship between each mobile phone is calculated.
Web Mining System for Mobile-Phone Marketing
1.00 0.70 Sony ERICSSON 0.70 Kyocera 0.70 0.70 Mitsubishi MOTOROLA 0.70 0.70 NEC R 2* = NOKIA 0.70 0.70 Panasonic 0.70 PHILIPS SAGEM 0.74 0.70 SIEMENS BOSCH 0.70 0.66 Others BenQ
ALCATEL
Others
BOSCH
SIEMENS
SAGAM
PHILIPS
Panasonic
NOKIA
NEC
MOTOROLA
Mitsubishi
Kyocera
Sony ERICSSON
ALCATEL
BenQ
Then the transitive closure of the concept matrix R2 can be obtained, as shown below, which is an equivalent matrix that can be used for clustering, according to the desired level of similarity, λ.
0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.74 0.70 0.70 0.66 1.00 0.81 0.72 0.81 0.81 0.78 0.81 0.72 0.79 0.70 0.75 0.70 0.66 0.81 1.00 0.72 0.84 0.93 0.78 0.86 0.72 0.79 0.70 0.75 0.70 0.66 0.72 0.72 1.00 0.72 0.72 0.72 0.72 0.77 0.72 0.70 0.72 0.70 0.66 0.81 0.84 0.72 1.00 0.84 0.78 0.84 0.72 0.79 0.70 0.75 0.70 0.66 0.81 0.93 0.72 0.84 1.00 0.78 0.86 0.72 0.79 0.70 0.75 0.70 0.66 0.78 0.78 0.72 0.78 0.78 1.00 0.78 0.72 0.78 0.70 0.75 0.70 0.66 0.81 0.86 0.72 0.84 0.86 0.78 1.00 0.72 0.79 0.70 0.75 0.70 0.66 0.72 0.72 0.77 0.72 0.72 0.72 0.72 1.00 0.72 0.70 0.72 0.70 0.66 0.79 0.79 0.72 0.79 0.79 0.78 0.79 0.72 1.00 0.70 0.75 0.70 0.66 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 1.00 0.70 0.70 0.66 0.75 0.75 0.72 0.75 0.75 0.75 0.75 0.72 0.75 0.70 1.00 0.70 0.66 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 0.70 1.00 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 0.66 1.00
In our system, a default value of λ is defined by taking the mean value of all elements of the upper triangle. That is, λ = 0.73 is the clustering threshold of R2* and with such λ-cut operation, R2* can be transformed into a 0/1 matrix.
R 2*λ =0.73
BenQ 1 ALCATEL 0 Sony ERICSSON 0 Kyocera 0 0 Mitsubishi MOTOROLA 0 0 NEC = NOKIA 0 0 Panasonic 0 PHILIPS SAGEM 1 0 SIEMENS BOSCH 0 0 Others
0 1 1 0
0 1 1 0
0 0 0 1
0 1 1 0
0 1 1 0
0 1 1 0
0 1 1 0
0 0 0 1
0 1 1 0
1 0 0 0
1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0
0 1 1 0
0 0 0 0
1 0 1 0 1 0 1 0 0 0
1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1
In consequence, 5(u=5) groups of mobile-phone types can be obtained from the 14 available brands, as follows:
817
Web Mining System for Mobile-Phone Marketing
Group 1 = {BenQ, SAGEM}, Group 2 = {ALCATEL, Sony ERICSSON, Mitsubishi, MOTOROLA, NEC, NOKIA, PHILIPS, SIEMENS}, Group 3 = {Kyocera, Panasonic}, Group 4 = {BOSCH}, Group 5 = {Others}
Figure 3. Result of the case study
offline
No
818
Web Mining System for Mobile-Phone Marketing
Based on the document descriptor of each document, we obtained the group-concept matrix B2’, which extensively reduces the data dimension and thus speeds up the information retrieval process.
brand external price service function ease special of use offer Group 1 0.15 0 Group 2 0.17 0 B 2′ = Group 3 0.12 0 Group 4 0.15 0 Group 5 0.15 0
.17 .31
0.21 0.23
0.10 0 0.10 0
.13 .20
0.15 0 0.15 0
.67
0.12
0.10 0
.13
0.12 0
.12 .12
0.67 1.00
0.10 0 0.10 0
.09 .09
0.15 0 0.15 0
.38 .13 .08 .33 .00
With the same procedure, we can calculate the document-concept matrices B1, B3, B4 for each set of clustered users, respectively; this clustering information is also stored in the database. This completes the off-line phase.
Online Phase If a Web user wants to buy a mobile phone, and signs into the Web site, he or she is asked to provide basic data. If, say, the user is female, 22 years old, university educated and earns NTS 30,000~45,000 Income, and emphasized external as her top preference in purchasing a mobile phone, then this information will allow the system to classify the user into user-cluster 2, and with lexicographic ordering of the components, corresponding to the concept “external” of B2’, the system will provide { Kyocera, Panasonic } of Group 3 with the scores of each concept in percentages of (9,50,9,7,10,9,6). The corresponding scores come up with brand: 12, external: 67, price: 12, service: 10, function: 13, ease of use: 12, and special offer: 8 for reference. If she is not satisfied, the system will ask for her preference structure with reference to the above scores. If she replies with the scores of (23, 27, 19, 12, 11, 4, 4), comparing vector Q with the matrix B’, we can find that the most compatible group of mobile phone is the second one, Group 2 = {ALCATEL, Sony ERICSSON, Mitsubishi, MOTOROLA, NEC, NOKIA, PHILIPS, SIEMENS} and then suggest that this user purchase the most relevant mobile phone. The result, shown below, has been translated into English for ease of understanding (see Figure 3). Different types of users map into different users’ clusters and the system provides the most appropriate information corresponding to different clusters of documents. For example, if a male user, 18 years old, college educated, with no Income, and an emphasis on function, he would be classified into Cluster 4. The documents would be grouped as Group 1: {BenQ, SAGEM}; Group 2: {ALCATEL, Sony ERICSSON, Kyocera, Mitsubishi, MOTOROLA, NEC, NOKIA, PHILIPS, SIEMENS}, Group 3: {Panasonic}, Group 4: {BOSCH} and Group 5: {Others}. The system will provide {Panasonic} with the scores of each concept in percentages of (11, 13, 18, 8, 25, 9, 15). Furthermore, if he is not satisfied, after entering a new set of scores, the system will provide a new suggestion. If the users referred to above purchased the mobile phones recommended, their data would be used to update the database, otherwise the database will not be changed. Due to the billing system, such updating processes would be carried out once a month.
819
Web Mining System for Mobile-Phone Marketing
Summary and Discussion Internet technology has developed rapidly in recent years and one of the primary current issues is how to effectively provide information to users. In this study, we utilized a data mining information retrieval technique to create a Web mining system. Since existing retrieval methods do not consider user preferences and hence do not effectively provide appropriate information, we used an off-line process to cluster users, according to their features and preferences, using a bi-criteria BOFCM algorithm. By doing so, the online response time was reduced in a practical test case when a user sent a query to the Web site. The case study in this chapter (a service Web site selling mobile phones) demonstrated that by using the proposed information retrieval technique, a query-response containing a reasonable number, rather than a huge number, of mobile phones could be provided which best matched a users’ preferences. Furthermore, it was shown that a single criterion for choosing the most favorable mobile-phone brand was not sufficient. Thus, the scores provided for the suggested group could be used as a reference for overall consideration. This not only speeds up the query process, but can also effectively support purchase decisions. In system maintenance, counterfeit information causes aggravation for Web site owners. Our proposed system updates the database only if the purchase action is actually carried out, which reduces the risk of false data. Further study into how a linguistic query may be transformed into a numerical query is necessary to allow a greater practical application of this proposal.
References Angelides, M. C. (1997). Implementing the Internet for business: A global marketing opportunity. International Journal of Information Management, 17(6), 405-419. Bedeck, J. C. (1973). Fuzzy mathematics in pattern classification. Unpublished doctoral dissertation, Applied Mathematics Center, Cornell University, Ithaca. Chen, S. M., & Wang, J. Y. (1995). Document retrieval using knowledge-based fuzzy information retrieval techniques. IEEE Transactions on Systems, Man and Cybernetics, 25(5), 793-802. Chen, Z. (2001). Data mining and uncertain reasoning. New York: John Wiley. Hanson, W. (2000). Principles of Internet marketing. Sydney: South-Western College. Janal, D. S. (1995). Online marketing handbook: How to sell, advertise, publicize and promote your product and services on Internet and commercial online systems. New York: Van Nostrand Reinhold. Jang, J. S., Sun, C. T., & Mizutani, E. (1997). Neuro-fuzzy and soft computing: A computational approach to learning and machine intelligence. Upper Saddle River, NJ: Prentice-Hall. Lin, C. L. (2002). Web mining based on fuzzy means for service Web site. Unpublished master’s dissertation, Tsing Hua University, Taiwan.
820
Web Mining System for Mobile-Phone Marketing
Mohammed, R. A., Fisher, R. J., Jaworski, B. J., & Paddison, G. J. (2004). Internet marketing - Building advantage in a networked economy. Boston: McGraw Hill / Irwin. Schneider, G. P. (2004). Electronic commerce: The second wave. Australia: Thomson Learning. Rayport, C., & Jaworski, H. (2002). E-commerce marketing: Introduction to e-commerce. Boston: McGraw Hill / Irwin. Tamura, S., Higuchi, K., & Tanaka, K. (1971). Pattern classification based on fuzzy relations. IEEE Transactions on Systems, Man and Cybernetics, 1, 61-66. Wang, H. F., Wang, C., & Wu, G. Y. (1994). Multicriteria fuzzy C-means analysis. Fuzzy Sets and Systems, 64, 311-319. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Endnote
1
This study is supported by National Science Council, Taiwan, ROC, with project number NSC 91-2213-E-007-075.
This work was previously published in Business Applications and Computational Intelligence , edited by K. E. Voges; N. K. L. Pope, pp. 113-130, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
821
822
Chapter XLVII
Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator Neil Davis The University of Sheffield, UK George Demetriou The University of Sheffield, UK Robert Gaizauskas The University of Sheffield, UK Yikun Guo The University of Sheffield, UK Ian Roberts The University of Sheffield, UK
Abstract Text mining technology can be used to assist in finding relevant or novel information in large volumes of unstructured data, such as that which is increasingly available in the electronic scientific literature. However, publishers are not text mining specialists, nor typically are the end-user scientists who consume their products. This situation suggests a Web services based solution, where text mining specialists process the literature obtained from publishers and make their results available to remote consumers (research scientists). In this chapter we discuss the integration of Web services and text mining within the domain of scientific publishing and explore the strengths and weaknesses of three generic architectural designs for delivering text mining Web services. We argue for the superiority of one of these and demonstrate
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Web Service Architectures for Text Mining
its viability by reference to an application designed to provide access to the results of text mining over the PubMed database of scientific abstracts.
INTRODUCTION With the explosion of scientific publications it has become increasingly difficult for researchers to keep abreast of advances in their own field, let alone trying to comprehend advances in related fields. Due to this rapid increase in the quantity of available electronic textual data both by publishers and third party providers, automatic text mining is of increasing interest to extract and collate information in order to make the scientific researcher’s job easier. Some publishers are already beginning to make textual data available via Web services and this trend seems likely to increase as new uses for data provided in this manner are discovered. Not only does the internet provide a means to accelerate the publishing cycle, it also offers opportunities for new services to be provided to readers, such as search and content-based information access over huge text collections. It is not envisioned that publishers themselves will provide technically complex text mining functionality, but that such functionality will be supplied by specialist text processors via “value added” services layered on top of the basic Web services supplied by the publishers. These specialist text processors will need domain expertise in the scientific area for which they are producing text mining applications. However they are unlikely to be the research scientists using the information, because of the specialised knowledge required to build text mining applications. Starting with the presumption of three interacting entities: publishers, text mining application providers and consumers of published material and text mining results, we discuss in this chapter a variety of architectural designs for delivering text mining using Web services and describe a prototype application based on one of them. In the rest of this section we review some of the context and related work pertaining to this project.
Text Mining Text Mining is a term, which is currently being used to mean various things by various people. In its broadest sense it may be used to refer to any process of revealing information, regularities, patterns or trends, in textual data. Text Mining can be seen as an umbrella term covering a number of established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD), and so on. In a narrower sense it requires the discovery of new information, not just the provision of access to information existing already in a text or to vague trends in text (Hearst, 1999). In the context of this paper, we shall use the term in its broadest sense. We believe that, while the end goal may be the discovery of new information from text, the provision of services which accomplish more modest tasks are essential components for more sophisticated systems. These components are therefore part of the text mining enterprise, and lend themselves more freely to being used in Web services architecture. Text mining is particularly relevant to bioinformatics applications, where the explosive growth of the biomedical literature over the last few years has made the process of searching for information in this literature an increasingly difficult task for biologists. For example the 2004 baseline release of Medline contains 12,421,396 abstracts, published between the years of 1902 and 2004, of which 4,391,392 (around 35 percent) were published between 1994 and 2004.
823
Web Service Architectures for Text Mining
Depending on the complexity of the task, text mining systems may have to employ a range of text processing techniques, from simple information retrieval to sophisticated natural language analysis, or any combination of these techniques. Text mining systems tend to be constructed from pipelines of components, such as tokenisers, lemmatisers, part-of-speech taggers, parsers, n-gram analysers, and so on. New applications may require modification of one or more of these components, or the addition of new bespoke components; however different applications can often re-use existing components. The exploration of the potential of text mining systems has so far been hindered by non-standardised data representations, the diversity of processing resources across different platforms at different sites and the fact that linguistic expertise for developing or integrating natural language processing components is still not widely available. All this suggests that, in the current era of information sharing across networks, an approach based on Web services may be better suited to rapid system development and deployment.
Web Services The World Wide Web Consortium (W3C) defines Web services as “a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP-messages, typically conveyed using HTTP with an XML serialisation in conjunction with other Web-related standards” (http://www.w3.org/TR/wsgloss/). This distributed, interoperable environment is already proving to be useful in bioinformatics research where the typical cycle of literature search, experimental work, and article publishing usually requires access to data which is heterogeneous in nature. Typical data resources may include structured database records, numerical results, natural language texts, and so forth. This data is also typically distributed across the Internet in research institutions around the world and made available on a variety of platforms and via non-uniform interfaces. Web services are likely to prove useful for text mining because they allow for the flexible reuse of components across multiple applications, which is common practice. Furthermore, they are a good fit to the distributed effort available in the research community who will build, at least in the first instance, the text mining systems designed to benefit those who use the scientific literature.
Workflows Web service operations perform single processing tasks that are likely to be of use in a variety of contexts. While a complex processing application may be wrapped as a single Web service, requirements for modularity and reusability within a distributed context mean that it is frequently useful to build complex processing applications out of a set of more elementary Web services operations. There is considerable work underway currently to coordinate Web services in a controlled and repeatable way by using workflows (Addis et al., 2003). Numerous workflows, coordinating Web services into overall applications, have been developed to support scientific research. By exposing text-processing Web services it is possible to coordinate text retrieval and text processing functions in a workflow environment, either as a standalone text mining application or as part of a larger scientific workflow. Existing scientific workflows typically involve operations on structured or numerical data, with all the data interpretation being done by the user, in the context of the related work in the field as reported in the scientific literature. Text mining
824
Web Service Architectures for Text Mining
technology can assist data interpretation by automatically building helpful pathways into the relevant literature as part of a workflow in order to support the scientific discovery process.
Related Work Compared to the amount of research and development in text mining, Web service technologies are still relatively new. It is therefore not surprising that work on integrating text mining with Web services is still limited. Some examples of generic language components available as Web services can be found at the Remote Methods Web site (http://www.remotemethods.com). Some text mining specific Web applications have been proposed by Kubo, Uramoto, Grell, and Matsuzawa (2002), who describe a text-mining middleware architecture based on Web services. The middleware proposed by Kubo is implemented as a two-layer architecture, with the first layer providing the API to the text mining components and the second layer providing the Web services. The text mining subsystem is based on the text analysis system by Nasukawa, Kawano, and Arimura (2001). The system we have developed differs from the one by Kubo et al. (2002) in that we use a set of atomic Web services designed to be run from within a biomedical workflow, rather than being tightly coupled to a piece of proprietary middleware. In this way, we allow the external workflow to specify the arrangement of inputs and outputs of the individual Web services rather than use a prearranged sequence of services in the middleware. Ghanem, Guo, Rowe, Chortaras, and Ratcliffe (2004) describe a grid infrastructure for text mining in bioinformatics which uses workflows or pipelines that represent transformations of data from one form to another. Their system provides a visual interface for integrating local components and Web services as well as a set of mechanisms for allowing the integration of data from remote sources. Applications range from finding associations between gene expressions and disease terms to gene-expression metabolite mapping and document categorisation. The text processing tasks include entity recognition (gene names) and entity relationship extraction (gene-gene, protein-protein). Their text mining system is embedded within a larger data mining system which is targeted at supporting end users engaged in the scientific discovery process, rather than supporting the flexible interaction of publishers, text mining specialists and end users that we explore here. Grover et al. (2004) also propose a Web services based architecture for the deployment of text mining services to application builders who are not text mining experts. The proposed framework makes use of individual text mining components which are exposed as Web services for use in a workflow environment. This architecture supports the construction of distributed text mining pipelines and the integration of either entire pipelines or selected elements into user defined workflows. They also propose an extension to the OWL-S hierarchy to support discovery of text mining services. However, their proposal does not address the wider issues we discuss below of integrating text mining using Web services into a distributed framework that supports the requirements of publishers, text mining service providers, and end users of text mining applications.
POSSIBLE WEB SERVICES ARCHITECTURES In this section we describe abstractions of alternative system designs based on Web services that may be adopted for utilising text mining results and we discuss their relative advantages and disadvantages. These architectures are based on the premise that publishers will provide simple Web service search and
825
Web Service Architectures for Text Mining
Client visible WS
Figure 1. Architecture overview
Specialist Text Processing Logic
Publishers’ Text Feeds To Client
Access layer (WS)
Publishers’ Text Feeds To Processors
Client Application
Publisher 1
… Access layer (WS)
Publisher n
Figure 2. Data pre-processing and storage architecture
Client visible WS
Text processing specialist
Processed data Text with “ value-added” annotation stored in DB
Client requests enriched data
Client
Specialist text processing logic
Regularly request new text from publishers
Access layer (WS)
Access layer (WS)
Data
Data
Publisher 1
Publisher n
retrieval access to their data, such as those provided by the National Library of Medicine (http://eutils. ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl). It is envisioned that more sophisticated text mining Web services will build on top of the text access Web services provided by the publisher, and provide “value added textual data resources”. Thus an end user will have access to both the plain data directly from the publishers and to the mined data provided by specialist text processors via Web services. By using Web services to distribute the textual resources, data searches can be incorporated into scientific research workflows or federated to produce an integrated data presentation application. 826
Web Service Architectures for Text Mining
Figure 1 illustrates the scenario for distributed text mining we investigate in the rest of this paper. In the rest of this section we consider the advantages and disadvantages of three possible architectures for implementing this scenario.
Data Pre-Processing and Storage In the data pre-processing and storage architecture, shown in Figure 2, the specialist text processors hold a replicated copy of the publisher’s data on site, which is maintained by a regular Web service feed from the publisher. The results of any text extraction algorithms can then be merged with the publisher’s data and stored locally, where it can be retrieved by a single Web service call. In addition to any annotation or markup results from text processing, the pre-processing and local storage approach allows for the indexing of both the text and the results of the text processing algorithms. This allows the client to search over both the publisher’s text and the text mining results in order to retrieve data on the basis of both simple free text searches and indices of the extracted data. This process and local storage approach has the advantage that both the publishers text and the results of the text processing algorithms are stored in one place, removing the possibility of receiving partial data and speeding up the search and retrieve process by implementing it as a single Web service call. However this implementation of the proposed architecture has problems associated with it. The most obvious problem is one of resource management. Textual resources tend to be large, for example the XML 2004 baseline release of PubMed biomedical abstracts is 44 Gb, and storing numerous publishers data resources can quickly lead to storage and access problems. In addition to the potential storage problem, text resources need to be regularly updated as new papers are published. To retain end user utility, the text processors must keep their locally stored copy of the textual data current. One of the recurrent challenges with the overall architecture we propose, though of particular relevance to the pre-processing and local storage model, is that of access control. Publishers will, in general,
Figure 3. Data pre-processing and merging architecture Get annotations from DB
Merging component
annotations
Client visible WS
Text processing specialist
Client requests enriched data
Client Merged data returned
Annotations stored in DB
Specialist text processing logic
Regularly request new text from publishers
Request original text from appropriate publisher(s) on client’ s behalf
Access layer (WS)
Access layer (WS)
Data
Data
Publisher 1
Publisher n
827
Web Service Architectures for Text Mining
charge for access to their resources via some form of licensing agreement. This raises two potential problems, the first being that some form of access control must be implemented so that end users can only access the data their license agreements allow them to and secondly, there may be cases where the end user has access to some elements of the stored and merged data but not to all of it. To counter these potential problems, some form of access control will need to be implemented on a per-source basis and data requests sorted not only according to the query but also on the resources available to the client. This may lead to cases where some of the merged data and associated indices may need to be dynamically removed before the value added text resources can be returned to the client.
Data Pre-Processing and Merging In the data pre-processing and merging architecture, as shown in Figure 3, the specialist text processors hold the results of any text extraction algorithms as a set of stand-off annotations, which can then be merged with the publisher’s data as it is requested. Requests made by the client are proxied through to the publisher and if the client is permitted to retrieve the data any annotations can be merged with the raw text and returned to the user as value added text. Access control would be via some kind of token-based authentication mechanism. For example, the client could acquire an authentication token for the session from the publisher (e.g., a Kerberos ticket), and pass this to the text processing Web service. The text processing service would then include this token in its requests to the publisher to fetch only those texts to which the client has access, perhaps using the OASIS WS-Security mechanisms (http://www.oasis-open.org/specs/). By pre-processing the data it is still possible to build indices over the publisher’s entire set of data resources, or potentially over multiple sets of data from multiple publishers. This pre-processing and storage of stand-off annotations reduces the potential data storage problems that may have been encountered in the pre-processing and storage architecture and moves the access control problem to the publisher. However, an architecture based on pre-processing the data and stor-
Figure 4. On demand data processing architecture
Specialist text processing logic
Client visible WS
Text processing specialist
Request original text from appropriate publisher(s) on client’ s behalf
828
Access layer (WS)
Access layer (WS)
Data
Data
Publisher 1
Publisher n
Client requests enriched data
Client Processed data returned
Web Service Architectures for Text Mining
ing the extracted annotations to be merged upon request is not without its own problems. The primary problem is that as stand-off annotation is typical stored as byte offsets, even a slight change in the format of the publishers text can potentially throw all the annotations out of alignment. This means that if the format of the publisher’s data must be regularly checked against the stored annotations to make sure that the annotations remain valid.
On Demand Processing The on demand processing architecture, as shown in Figure 4, does away with any storage of either the publisher’s data or any annotations produced by the specialist text processors, and instead processes the publishers’ data as it is requested by the client. Each request to the specialist text processor is proxied through to a publisher and, assuming the publisher fulfils the request, the text is retrieved, processed, and returned to the client. This architecture removes any problems of storage resource management from the specialist text processor. However it does introduce some serious limitations on the applications that the specialist text processor can provide. In particular if applications are to run in real time, in other words, while the user waits, the sophistication of the text processing will be severely limited. Even the amount of processing within a single text will be limited and the processing of collections of texts that may be relevant to a query to build, for example, inter-document annotations or indices will lead to unacceptable time delays for real-time applications. Such problems do not arise for the pre-processing models, which can do off-line text analysis and index construction in advance of any specific user request.
Client Versus Publisher Web Services All of the described architectures contain two distinct types of Web service, those that communicate with the literature publishers and those that communicate with the end user (though there is also the possibility that the specialist text processing logic is a federation of Web services as described in Grover et al. [2004]). Although both of these operations are implemented by Web services they have very different requirements and functions. The publisher Web services are used for the bulk transport of quantities of text. The client Web services respond to interactive requests from the end user and so must be far more flexible and responsive than the publisher Web services. The publisher Web services are used to retrieve blocks of text either in response to processed client requests (on demand processing) or to locally process the text to produce a set of stand off annotations (data pre-processing and storage or data pre-processing and merging). The client services on the other hand rely on an initial request from the end user, which must be processed and responded to by the text processor. The design of the client Web services must address the question of where the processing of the client’s request is to take place. There is always going to be some element of processing by the specialist text processor, at the very least to deliver the papers and annotations relevant to the client’s request. However this collection of annotated texts (or corpus) requires some further work, such as indexing the texts and rendering for display, in order to support search and navigation functionality. This functionality may be localised within the text processor, on a client application or distributed between the two. The distribution of the search and navigation processing depends to an extent on the functionality to be delivered to the end user.
829
Web Service Architectures for Text Mining
CORE TEXT SERVICES IMPLEMENTATION We have implemented a hybrid system as part of the myGrid project (Stevens, Robinson, & Goble, 2003). This implementation is architecturally similar to the data pre-processing and merging model. However at this time we are using a local copy of the publisher’s data sources rather than making a Web service call to retrieve the data. This was a pragmatic choice made both because we were originally unsure of the speed and efficiency of Web services and the underlying architecture of the myGrid middleware had not crystallised. The original intention was to implement a “data pre-processing and storage” model, however it was later decided that the “data pre-processing and merging” model offered more flexibility. The core text services consist of three main elements: (1) the text processing logic, (2) the annotation storage database, and (3) the text/annotation merging and delivery Web service. The design of the text/annotation merging and delivery Web service depends to a large extent on the functionality to be delivered to the end user and any client interface that is to be used. The text processing logic is the AMBIT information extraction engine (Gaizauskas et al., 2003) which utilises the GATE (General Architecture for Text Engineering) system (Cunningham, Maynard, Bontcheva, & Tablan, 2002) for document and annotation management. The AMBIT system uses a variety of NLP-based algorithms to identify biomedical entities, and the complex relationships between them from the biomedical literature. The system comprises three major stages: lexical and terminological processing, syntactic and semantic processing, and discourse processing. The first stage identifies and classifies bio-medical entities that occur in a given text, such as diseases, drugs, genes, and so forth, by first using Termino (Harkema et al., 2004), and then a term parser to combine different features and to give the final result. The second stage produces a (partial) syntactic and semantic analysis for each sentence in the text. The third step integrates these results into a discourse model that resolves intersentence co-references and produces a final representation of the semantic content of the text. The biomedical entities are stored as a pair of byte offsets, representing the start and end of the entity in the appropriate paper. These byte offsets are stored in a set of relational database tables, implemented in MySQL (http://www.mysql.com). The final step of the core text services set up is to merge the stand-off annotations with the appropriate text and return these to the client application. This is carried out by a Web service called generateCorpus, which takes a list of one or more PubMed identifiers and generates an XML representation of the combined data from the annotations database and the original abstract corresponding to each supplied PubMed identifier. These XML representations are returned to the caller. Currently the plain text abstracts are retrieved from a local copy of the Medline database. However as Web services are used to retrieve and merge the data, the actual location of the Medline database is immaterial and could just as easily be the database at the National Library of Medicine, or one of the many mirrors around the world. One of the functions to be delivered to the end user is the ability to search within the corpus delivered in response to the client query. To supply this functionality, without regenerating the corpus with each subsequent query, a copy of the corpus will either have to be stored either on the client or server. Storing the corpus on the server requires the implementation of some form of session management to save user state and also produces a possible storage problem if multiple corpora are to be held on the server at any one time. To avoid this potential problem it was decided to split the localisation of the navigation between both client and server, with the server generating the corpus and navigation index, which are transferred via Web service to the client for presentation and navigation. Further searches can either be carried out within the corpus already delivered to the client or a new query can be submitted to the text processor. 830
Web Service Architectures for Text Mining
A Sample Text Mining Application The nature of bioinformatics research is such that it requires the repetition of a series of complex analytical tasks under known and constant conditions. This requirement can be satisfied by the use of a workflow model, which coordinates a series of atomic Web services into an experimental sequence under computer control. The use of a computer-mediated workflow allows experiments to be repeated and simplifies the process of results and provenance storage and tracking. Already a number of workflow systems for the bioinformatics researcher have been produced, both by academic and commercial organisations. As a partner in the myGrid project we have explored the use of text mining Web services in the workflow environment developed for the project. The workflow tools developed within the myGrid project include, Scufl, a language for specifying workflows (Oinn et al., 2004), Taverna, a tool for designing workflows (http://taverna.sourceforge.net/), and Freefluo, a workflow enactment engine (http://freefluo.sourceforge.net/). Figure 5 illustrates a sample text mining application, shown as a coordinating workflow linking a number of Web services we have developed. Although the workflow is shown as a standalone entity, it was designed to fit into a larger bioinformatics workflow, one of whose steps is the production of a SwissProt (http://us.expasy.org/sprot/) or a BLAST report (http://www.ncbi.nlm.nih.gov/Education/ BLASTinfo/information3.html) resulting from a gene or protein sequence analysis step1. Such reports list genes or proteins whose structure is similar to a gene or protein of interest. They usually contain links to the PubMed abstract database where the related gene or protein is described. The text mining application assumes that the cited PubMed abstract, and other abstracts related to it, may be of interest to the researcher and automatically retrieves and clusters them. The abstracts will have previously been analysed off-line and significant biological entities in them identified and indexed. The output of the workflow is a clustered set of hyper-linked abstracts, which can be browsed by a client with the appropriate applet (as described below) and from which links into the PubMed database can followed. The workflow starts from a SwissProt report, which is parsed to retrieve any PubMed identifiers it may contain (ParseSwissProt/ParseBLAST). All the unique MeSH terms associated with these PubMed identifiers are retrieved (GetMeSH). The list of MeSH terms is then filtered to remove any non-discriminatory terms to increase the information-to-noise ratio (FilterMeSH). The PubMed identifiers of any paper that has one, or more, members of the filtered list of MeSH terms as the major topic header are retrieved (MeshtoPMID). This new list of PubMed identifiers is passed to two Web services, the first being corpus generation Web service which was outlined above and the second of which clusters the PubMed identifiers using the MeSH tree as organising principle (Cluster). These two Web services provide the output of the workflow, XML representation of the marked up corpus and the papers clustered along the MeSH tree, which can then be explored and viewed by the end user. Each Web service has been implemented using the PERL SOAP::Lite package (http://www.soaplite. com/), and the WSDL (http://www.w3.org/TR.wsdl) documents were hand coded. The Web services are explained in greater detail in the following.
Data Retrieval Web Services GetMeSH: The GetMeSH service takes a list of one or more PubMed identifiers and retrieves a list of the MeSH terms associated with the papers that the PubMed identifiers refer to. The list of retrieved MeSH terms are compiled into a unique list and returned to the caller.
831
Web Service Architectures for Text Mining
Figure 5. A sample text mining workflow developed in Taverna
MeSHtoPMID: The MeshtoPMID service takes a list of one or more MeSH terms and retrieves the PubMed identifiers of all the papers that have the MeSH term in question as a major topic header. The list of retrieved PubMed identifiers is compiled into a list with duplicates removed and returned to the caller.
Data Clustering Web Services Cluster: The Cluster service takes a list of one or more PubMed identifiers and orders them using the MeSH tree as an organisational hierarchy. An XML representation of the MeSH tree with the supplied PubMed identifiers inserted into the appropriate nodes in the tree is returned to the caller.
Parsing and Helper Web Services ParseBLAST: The ParseBLAST service takes a single BLAST report as input, from which all the listed gene ID records are extracted. For each extracted gene ID a call is made to the National Library of Medicine (NLM) servers and the appropriate gene record is retrieved. Each gene record is parsed as it is retrieved and any PubMed identifiers (zero or more) are mined. The mined PubMed identifiers are compiled into a list with duplicates removed and returned to the caller. ParseSwissProt: The ParseSwissProt service takes a single SwissProt record as input. All of the PubMed identifiers (zero or more) listed in the SwissProt record are mined and returned to the caller. FilterMeSH: The FilterMesh service takes a list of one or more MeSH terms and removes from that list any MeSH terms that are deemed to be non-discriminatory. This is done by calculating the number of papers for which the MeSH term in question is a major topic header. If the number of papers falls above a pre-specified threshold the MeSH term is removed from the working list. The original list of MeSH terms with any non-discriminatory values removed is returned to the caller.
832
Web Service Architectures for Text Mining
Visualisation Support In our hybrid system we have combined the novelty of text mining with the power of relational databases and the flexibility of Web services. But enriched content also requires specialised presentation and visualisation support. While a DBMS can be used as a bottom-up approach to store structure related to unstructured data, visualisation requirements imply that a top-down approach may be needed to provide efficient browsing and navigation through the volume of retrieved data. We have therefore developed a visualisation client that allows for renditions to be generated from content abstractions automatically by using the clustering capability of the Medical Subject Headers (MeSH http://www. nlm.nih.gov/mesh/meshhome.html) and the navigating capabilities of Hypertext. Visualisation is based on the hierarchical clustering of documents according to the MeSH taxonomy that is very familiar to bioinformaticians. However the client could be easily adapted to use other taxonomies, such as the Gene Ontology, or data-driven text clusters derivable using clustering algorithms such as k-means or agglomerative clustering. A tree-like visualisation of the hierarchical clusters allows for inter-document navigation through the results of a particular in-silico experiment by expanding or collapsing term nodes that link to subsets of the retrieved document set. Each document subset is represented as a list of titles in a text pane, with each title being a hyperlink that displays the Medline abstract for that particular document. This simple but effective way of organising the information is supplemented with inline navigation by hyperlinking the domain entities such as the names of diseases, genes, species, and so forth. The annotations of such entities are rendered in different colors corresponding to the different types of biomedical classes and are used as anchors for intra-document navigation. For example, starting with a specific term the user has the capability to navigate through all documents in the collection that contain this term, or even start a new search from Medline and compare results. We believe that this visual form of the mined data provides value-added information compared to services such as PubMed, as it should be easier for the user to quickly browse and judge the relevance of the text content. It may also provide the basis for associative learning algorithms that will aim to discover relationships or patterns between information items in the hierarchy and in the documents. A screen shot of the visualisation client implementation in our system is shown in Figure 6. Our hybrid architecture has many similarities with the Model-View-Controller (MVC) design pattern (Gamma, Helm, Johnson, & Vlissides, 1995; Buschmann, Meunier, Rohnert, & Sommerland, 1996). In our system, the database for storing the stand-off annotations could be viewed as the Model, the workflow component as the Controller and the client viewer (Java applet in our system) as the View. When the user performs an action, such as for example, entering a search term for Medline, the controller maps this action to a service request that is transformed to a MySQL query at the Web service end. After the database transaction has taken place, the text classification data and annotations are merged with the original documents and the final result is passed back to the viewer for rendering and visualisation. This architecture supports incremental and parallel component development and the ability to provide custom views over the same set of application data.
833
Web Service Architectures for Text Mining
Figure 6. Text extraction results viewer
DISCUSSION The development work we have carried out has found three main areas of interest and concern within the Web services architecture. Firstly there were implementation problems, centred around incompatibilities between different implementations of SOAP and WSDL; secondly there were re-use and flexibility issues centerd around the building of atomic, modular Web services rather than producing a single large application and finally there was the problem of producing meaningful and usable information from the data provided by the Web services. Each of these areas will be explored in some detail below.
Implementation Issues A number of different SOAP/WSDL implementations were explored and while Web service communication using any given package to develop both Web service and client proved to be comparatively straightforward, attempting to integrate Web services and clients developed using different packages was found to be rather more problematic. The problems were particularly acute when using a SOAP/WSDL implementation in a strongly typed language, such as the Java Axis implementation (http://ws.apache. org/axis/), to communicate with a SOAP/WSDL implementation in a weakly or untyped language, such as the Perl SOAP::Lite package (http://www.soaplite.com/) or the PHP NuSOAP package (http://dietrich.ganx4.com/nusoap/index.php). For example, the WSDL generated to describe a Perl based Web service, while valid in XML terms and perfectly usable from a Perl-based client may cause problems when used by a Java-based client. The simplest approach to this problem would have been to mandate
834
Web Service Architectures for Text Mining
the use of a single implementation of SOAP/WSDL. However, this approach would have run contrary to the Web services principle of abstracting away from specified operating systems and programming languages and would have introduced the communications overhead of Web services with none of the interoperability advantages. It was discovered during the development of the text mining Web services that the differences between the WSDL implementations of different packages can be very subtle. For example, while the WSDL interface for a Web service in two different languages may be almost identical and in both cases syntactically valid according to the WSDL schema (http://www.w3.org/TR/wsdl. html#A4.1) this is no guarantee of interoperability. The practical upshot of this is that even when using well-formed SOAP and WSDL, interoperability between packages is not a given and significant testing, and possible manual recoding of automatically generated WSDL, is required by end user application and Web service developers.
Re-Use and Flexibility Issues The second area of concern is a design issue. Although it is quicker and easier to build a single large application and expose a Web service API to that application, in the long term it is beneficial to build atomic Web services that can be reused in multiple applications. This parallels the standard programming model of producing a number of re-usable component functions with a separate process control structure, which coordinates the component functions to produce an overall composite functional process. In the Web services environment each Web service performs the role of a function with the client application taking the role of the control structure. Each new workflow or application is a new coordinating control structure calling a set (or subset) of the Web services to perform an overall functional process. This type of re-usability is only possible if an architecture supporting atomic Web services has been implemented. Significant discussion of the merits of a modular versus monolithic design are available in any text on software engineering, however it is still an issue for Web services based design due to the communications overhead of encapsulating data in SOAP envelopes and transmitting it over HTTP. It was found that the time taken to encode data for transmission was greater than the time taken to process or transfer that same data. This was particularly true for large bodies of textual data or XML data which must be escaped/unescaped or wrapped in CDATA tags for transmission.
End User Functionality Issues The final, and potentially most important, area of concern highlighted by the development process is that the typical Web service produces streams of data which require formatting in some way to produce a view over that data which is meaningful and useful for the end user. The end user presentation should form an integral part of the overall Web service application design, with the type of data provided by the Web services, the domain in which the data is to be used and the end user requirements all being taken into account. In the example E-Science application discussed above Web services were used to produce data in a way that can be parsed and used by a visualisation applet. Particularly when dealing with text it is very easy to overload the end user with very large amounts of data, so some way of organising and presenting the data in a consistent and comprehensible way is required. The data presentation system chosen was to produce a corpus, consisting of the selected subset of publishers’ data merged with the stand-off annotations extracted by the text processing logic, along with an index to allow organised
835
Web Service Architectures for Text Mining
navigation of the corpus. The index used in the initial implementation was a representation of the MeSH tree with the selected subset of papers clustered on the nodes of the tree as dictated by their associated MeSH terms. This produced an index which was both comparatively easy to navigate (each node had relatively few papers so the end user would not be overloaded with data) and that gave a semantic overview of the corpus that had been generated and returned. Using the pre-defined MeSH hierarchy is only one possible approach to clustering, and one that is dependent on the publishers’ data being annotated with MeSH terms, and other forms of clustering, such as subsumption algorithms (Sanderson & Croft, 1999) are also being investigated for cases where MeSH terms are not available.
CONCLUSION In this chapter we have explored the issues involved in integrating the text mining and text delivery process into a wider Web service-based e-science environment. In particular we addressed Web service-based architectures which support the roles and interests of the three main interacting parties in this area: publishers, text mining application providers, and end users of published material and text mining results. We have analysed three high-level architectural design alternatives and exemplified the discussion with reference to a fully implemented Web service-based e-science demonstrator. Realistic scientific text mining applications appear to require some form of pre-processing. Given text volumes and current text mining algorithms, on demand text processing is at present too slow to provide much more than very limited semantic content extraction, such as entity annotation, if even that, of individual papers. In order to index a large number of papers and potentially reveal novel cross-paper, and even cross-discipline, information an off-line system of bulk processing and cross-referencing needs to be implemented. In terms of the proposed text mining architectures discussed above, this effectively limits the choices to the “data pre-processing and storage” and “data pre-processing and merging” models. The data preprocessing and storage model offers improved speed and reliability, as the data to be served is held in a single place so potential problems of partial retrieval or mismatches during merging are not an issue. Further, data is retrieved by a single Web service call. The data pre-processing and merging model, on the other hand, offers greater flexibility as annotations can be merged as and when they are required and can be applied to papers returned from mirrors of the publishers’ data. Since only the annotations are stored there is a significantly reduced resource overhead for the specialist text processor. Furthermore, this approach can circumvent copyright/data ownership issues, as the specialist text processor need not store copies of the original texts. Integrating text mining into the Web services environment does present new challenges. However, the information available within the body of scientific literature is so valuable that the effort involved will be more than compensated for by the utility to the e-science community. Furthermore Web-services appear to offer perhaps the best solution to the difficult real world problem of deploying text mining at the scale necessary to make it truly effective for the end user, a solution which is feasible both technically and in terms of plausible business models.
836
Web Service Architectures for Text Mining
REFERENCES Addis, M., Ferris, J., Greenwood, M., Marvin, D., Li, P., Oinn, T., et al. (2003). Experiences with eScience workflow specification and enactment in bioinformatics. In Proceedings of UK e-Science All Hands Meeting 2003 (pp. 459-467). Axis. (n.d.). Retrieved from http://ws.apache.org/axis/ Buschmann F., Meunier R., Rohnert H., & Sommerlad P. (1996). Pattern oriented software architecture: A system of patterns, Vol. 1. Wiley. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Freefluo. (n.d.). Retrieved from http://freefluo.sourceforge.net/ Gaizauskas, R., Hepple, M., Davis, M., Guo, Y., Harkema, H., Roberts, A., et al. (2003). AMBIT: Acquiring medical and biological information from text. In Proceedings of UK e-Science All Hands Meeting 2003. Gamma E., Helm R., Johnson R., & Vlissides J. (1995). Design patterns: Elements of reusable objectoriented software. Addison-Wesley. Ghanem, M., Guo, Y., Rowe, A., Chortaras, A., & Ratcliffe M. (2004). A grid infrastructure for mixed bioinformatics data and text mining (Tech. Rep.). Imperial College, Department of Computing. Hearst, M. (1999). Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL99). Grover, S., Halpin, H., Klein, E., Leidner, J.L., Potter, S., Riedel, S., et al. (2004). A framework for text mining services. In Proceedings of UK e-Science All Hands Meeting 2004. Retrieved from http://www. allhands.org.uk/2004/proceedings/papers/281.pdf Harkema, H., Gaizauskas, R., Hepple, M., Roberts, A., Roberts, I., Davis, N., et al. (2004). A large scale terminology resource for biomedical text processing. In Proceedings of Linking Biological Literature, Ontologies and Databases: Tools for Users, Workshop in Conjunction with NAACL/HLT 2004. IBM WebSphere Software. (n.d.). Retrieved from http://www-306.ibm.com/software/Websphere/ Kubo, H., Uramoto, N., Grell, S., & Matsuzawa, H. (2002). A text mining middleware for life science. Genome Informatics, 13, 574-575 Medical Subject Headers. (n.d.). Retrieved from http://www.nlm.nih.gov/mesh/meshhome.html MySQL. (n.d.). Retrieved from http://www.mysql.com Nasukawa, T., Kawano, H., & Arimura, H. (2001). Base technology for text mining. Journal of Japanese Society for Artificial Intelligence, 16(2), 201-211.
837
Web Service Architectures for Text Mining
NLM Web Services API. (n.d.). Retrieved from http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils. wsdl NuSOAP. (n.d.). Retrieved from http://dietrich.ganx4.com/nusoap/index.php Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., et al. (2004, June 16). Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics Journal. Retrieved from http://doi:10.1093/bioinformatics/bth361 Remote Methods. (n.d.). Retrieved from http://www.remotemethods.com/ Sanderson, M., & Croft, W. B. (1999). Deriving concept hierarchies from text. In Proceedings of SIGIR99, the 22nd ACM Conference on Research and Development in Information Retrieval (pp. 206-213). SOAP::Lite. (n.d.). Retrieved from http://www.soaplite.com/ Stevens, R., Robinson, A., & Goble, C. A. (2003). myGrid personalised bioinformatics on the information grid. In Proceedings of 11th International Conference on Intelligent Systems in Molecular Biology. SwissProt. (n.d.). Retrieved from http://us.expasy.org/sprot/ Taverna. (n.d.). Retrieved from http://taverna.sourceforge.net/ Web Services Description Language specifications. (n.d.). Retrieved from http://www.w3.org/TR/ wsdl Web Services specifications. (n.d.). Retrieved from http://www.w3.org/2002/ws/ WSDL schema reference. (n.d.). Retrieved from http://www.w3.org/TR/wsdl.html#A4.1
Endnote
1
838
A short glossary of relevant bioinformatics terms is included in the Appendix.
Web Service Architectures for Text Mining
APPENDIX The system used several database tables to store information about Medline abstracts, the annotation obtained from the output from the Ambit system and MeSH terms associated with those abstracts. The articles table below, for example, stores various information about an abstract, such as its PMID, Journal, date of publishing, authors, title and abstracts, etc., whilst the EntityIndex table stores, for each entity name the Ambit found in the abstract, its start and end offset, type and actual string. Other tables (not listed) store various information about Mesh terms so that one can, for example, retrieve the corresponding abstracts based on given Mesh terms.
This work was previously published in International Journal of Web Services Research, Vol. 3, Issue 4, edited by L. J. Zhang, pp. 95-112, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).
839
About the Contributors
Nitin Agarwal (ASU) is a doctoral candidate in computer science and engineering at Arizona State University. His primary interest lies in knowledge extraction in social media especially Blogosphere. This includes studying, developing computational models for phenomena like influence, trust, familiar strangers in Blogosphere and evaluation of such models. His research interests also includes clustering high dimensional data with its applications ranging from recommender systems to web categorization. He has proposed a novel subspace clustering based framework (SCuBA) for research paper recommender system. He is the author of two book chapters and many publications in prestigious forums on social networks and blogosphere, high-dimensional subspace clustering, webpage categorization. He has reviewed several journal and conference papers for AAAI, PAKDD, EC-WEB, WAIM, International Journal on Information Science, CIKM, IJCSA, APWeb, SDM and JCISE. He was involved in framing an NSF grant proposal about sensor data integration in collaboration with Arizona Department of Environmental Quality (AZDEQ) and Department of Mechanical and Aerospace Engineering at Arizona State University. He has been involved in projects from NSF at Hispanic Research Center in Arizona State University and has experience in database programming for developing applications to help Hispanic students in applying to US universities. Abdelmalek Amine received an engineering degree in computer science from the Computer science department - Djillali Liabes University of Sidi-Belabbes (Algeria) in 1989. He worked in industry for many years. He received the magister diploma in computational science from the University Center of Saida (Algeria) in 2003. He is currently working towards his doctorate degree (PhD) at Djillali Liabes University. He is member of Evolutionary Engineering and Distributed Information Systems laboratory at U.D.L, associate master at Computer science Department of University Center of Saida and lecturer at Computer science Department of Djillali Liabes University. His research interests include data mining, text mining, ontology, classification, clustering, neural networks, genetic programming, biomimetic optimization method. V. Suresh Babu received his ME (computer science) from Anna University, Chennai in 2004. Since January 2005, he is working for his PhD at the Indian Institute of Technology-Guwahati, India. His research interests include pattern recognition, data mining and algorithms. Barbro Back is a professor of accounting information systems at the Åbo Akademi University in Turku, Finland. Her research interests are in the areas of data and text mining, neural networks, financial benchmarking and enterprise resource planning systems. She has published in journals such as
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
About the Contributors
Journal of Management Information Systems, Accounting, Management and Information Technologies, International Journal of Accounting Information Systems, Information & Management, and European Journal of Operational Research. Ladjel Bellatreche received his PhD in computer science from the Clermont Ferrand University France in 2000. He is an assistant professor at Poitiers University France since 2002. Before joining Poitiers University, he has been a visiting researcher at Hong Kong University of Science and Technology (HKUST) from 1997 to 1999 and also been a visiting researcher in Computer Science Department at Purdue University - USA during summer 2001. He has worked extensively in the areas: distributed databases, data warehousing, heterogeneous data integration using conceptual ontologies, designing databases using ontologies, schema and ontologies versioning. He has published more than 60 research papers in these areas in leading international journal and conferences. Ladjel has been associated with many conferences as program committee members. Mohammed Bennamoun received his MSc from Queen’s University, Canada in the area of control theory, and his PhD from Queen’s/QUT in Brisbane, Australia in the area of Computer Vision. He lectured at QUT from 1993 to 1998. In 2003, he joined the School of Computer Science and Software Engineering (CSSE) at The University of Western Australia (UWA) as an associate professor. He has been the Head of CSSE at UWA since February 2007. He published over 100 journal and conference publications, and served as a guest editor for a couple of special issues in international journals, such as IJPRAI. His areas of interest include control theory, robotics, obstacle avoidance, object recognition, artificial neural networks, signal/image processing and computer vision. Pushpak Bhattacharyya. Dr. Pushpak Bhattacharyya is a professor of computer science and Engineering at IIT Bombay. He did his bachelor's (1984), Master's (1986) and PhD (1993) from IIT Kharagpur, IIT Kanpur and IIT Bombay respectively. He was a visiting research fellow in MIT, USA in 1990, a visiting professor in Stanford University in 2004 and a visiting Profesor in Joseph Fourier University in Grenoble, France in 2005 and 2007. Dr. Bhattacharyya’s research focus is on natural language processing, machine learning, machine translation and information extraction, and has published over 100 research articles in leading conferences and journals. He has been program committee members and area chairs for MT Summit, ACL, COLING and many such top research fora. In 2007, Prof. Bhattacharyya received research grant awards in language technology from Microsoft and IBM. Theo J. D. Bothma (BA, MA, DLitt et Phil). Theo Bothma is professor and head of the Department of Information Science at the University of Pretoria. His teaching and research focus on the electronic organization and retrieval of information (including cross-language information retrieval and databases for indigenous knowledge), web development and electronic publishing, as well as on curriculum development. He has been a consultant for a number of e-projects at the library of the university and for web development at various private companies. Francesco Buccafurri is a professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at University of Calabria. In 1995 he was visiting researcher at the Information System Department of the Vienna University of Technology,
About the Contributors
Austria, where he joined the data base and AI group. His research interests include deductive-databases, knowledge-representation and nonmonotonic-reasoning, model-checking, information-security, data-compression, histograms, data-streams, agents, P2P-systems. He has published several papers in top-level international journals and conference proceedings. He served as a referee for international journals and he was member of conference PCs. Will Cameron is completing his master's degree in computer science at Villanova University in December 2007. His work as an undergraduate student in computer science and the honors program included client side personalization and a singe signon for digital libraries. As a graduate research assistant, he has participated in porting the CITIDEL digital library to DSpace and has contributed to the representation of course syllabi and the automatic generation of a library of syllabi. Gianluca Caminiti holds a PhD degree, received in March 2006 from the University “Mediterranea” of Reggio Calabria, Italy. In 2002 he took the “Laurea” degree in Electronics Engineering. His research interests cover the field of artificial intelligence, including multi-agent systems, logic programming, knowledge representation and non-monotonic reasoning. He has published a number of papers in toplevel international conference proceedings. He has been involved in several national and international research projects. Luis M. de Campos, born in 1961, reveived his MS degree in mathematics in 1984. He completed his PhD thesis in 1988 and became lecturer in computer science in 1991 at the University of Granada, Spain. He is currently full professor at the Department of Computer science and Artificial Intelligence at the University of Granada. His current areas of research interest include probabilistic graphical models, machine learning and information retrieval. Lillian Cassel. Dr. Lillian Cassel is professor of computing sciences at Villanova University where she supervises several digital library related projects, including the Computing and Information Technology Interactive Digital Education Library (CITIDEL). She serves on the ACM Education Board and heads efforts in generating an ontology of all of the computing disciplines and also leads the effort to summarize characteristics and requirements of master's level programs in the computing disciplines. Xin Chen. Dr. Chen is affiliated with Microsoft. His research interests are information retrieval, text mining and digital libraries. His research has been published in journals such as Journal of the American Society for Information Science and Technology, and Information Retrieval, and also in proceedings of Knowledge Discovery and Data Mining (ACM KDD), and Conference on Information and Knowledge Management (ACM CIKM). Francisco M. Couto graduated (2000) and has a master's (2001) in informatics and computer engineering from the Instituto Superior Técnico, Portugal. He obtained a PhD (2006) in informatics, specialization bioinformatics, from the Universidade de Lisboa. He was an invited researcher at several European institutes during his PhD studies. He is currently an assistant professor with the Informatics Department at Faculdade de Ciências da Universidade de Lisboa. He teaches courses in information systems and biomedical informatics. He is currently a member of the LASIGE research group. His research interests include BioInformatics, text mining, data mining, data management and information systems.
About the Contributors
Xiaohui Cui. Dr. Cui is a research associate staff in the applied software engineering research Group of ORNL. Dr. Cui’s research interests include swarm intelligence, intelligent agent design, agent based modeling and simulation, emergent behavior engineering, human/social cognitive modeling, sensor network, distributed computing, information retrieval and knowledge discovery. He has authored numerous publications in these areas. He is a member of ACM, North American Association for Computational Social and Organizational Sciences and IEEE computer society. He served as program committee member and special session chair for several international conferences. Alexander Dreweke studied computation engineering (BSc), mathematics (BSc), and computer science at the Friedrich-Alexander University of Erlangen-Nuremberg in Germany. Currently, he is working as PhD student on the usage of heuristic mining algorithms in code size reduction for embedded systems at the same university. In his spare time he is working on automatically identifying genetic diseases and mutations in the human genome in association with the Institute of Human Genetics Hospital Erlangen at the same university. Zakaria Elberrichi received his master's degree in computer science from the California State University in addition to PGCert in higher education. Currently he is a lecturer in computer science and a researcher at Evolutionary Engineering and Distributed Information Systems Laboratory, EEDIS at the university Djillali Liabes, Sidi-Belabbes, Algeria. He has more than 18 years of experience in teaching both BSc and MSc levels in computer science and planning and leading data mining related projects. The last one called “new methodologies for knowledge acquisition”. He supervises many master students in e-learning, text mining, web services, and workflow. Weiguo Fan. Dr. Weiguo Fan is an associate professor of information systems and computer science at Virginia Tech (VPI&SU). He received his PhD in information systems from the Ross School of Business, University of Michigan, Ann Arbor, in July 2002. He has published more than 80 refereed journal and conference papers on data mining, information integration, text mining, digital libraries, information retrieval and online communities. His text mining research on ranking function discovery for search engine optimization, question answering on the Web, and text summarization has been cited more than 400 times according to Google Scholar. His research is currently funded by four NSF grants and one PWC grant. Daniel Faria is doing an informatics-bioinformatics PhD study on protein automated functional classification, in FCUL (Faculty of Sciences – University of Lisbon). He is a member of the LASIGE research group. His work is being supervised by Dr. André Falcão and co-supervised by Dr. António Ferreira. He has a degree in biological engineering, from IST (Lisbon Technical University) and a post-graduate course in bioinformatics, from FCUL. He is interested in using informatics to help understand and model biological systems, and he is also curious about the influence of biological systems in informatics. Juan M. Fernández-Luna got his computer science degree in 1994 at the University of Granada, Spain. In 2001 he got his PhD at the same institution, working on a thesis in which several retrieval models based on Bayesian networks for information retrieval where designed. This subject is still his current research area, although moving to structured documents. He has got experience organising international conferences and has been guest editor of several special issues about Information Retrieval.
About the Contributors
He is currently associate professor at the Department of Computer science and Artificial Intelligence at the University of Granada. Ingrid Fischer studied computer science and Linguistics at the Friedrich-Alexander University of Erlangen-Nuremberg in Germany. A PhD at the same university dealt with the graph-based programming of neural networks. She worked at the Language Technology Institute at Carnegie Mellon University, Pittsburgh, and the International Computer Science Institute, Berkeley, for two years. Now she is lecturing at the University of Konstanz in Germany. Her current research deals with graphs and graph transformations, computational linguistics and data mining. Edward A. Fox. Dr. Edward A. Fox holds a PhD and MS in computer science from Cornell University, and a BS from M.I.T. Since 1983 he has been at Virginia Tech (VPI&SU), where he serves as professor of computer science. He directs the Digital Library Research Laboratory and the Networked Digital Library of Theses and Dissertations. He has been (co)PI on 100 research and development projects. In addition to his courses at Virginia Tech, Dr. Fox has taught 70 tutorials in over 25 countries. He has given 60 keynote/banquet/international invited/distinguished speaker presentations, over 140 refereed conference/workshop papers, and over 250 additional presentations. Filip Ginter holds a PhD in computer science. He is a lecturer at the Department of Information Technology at the University of Turku, and a researcher in the Natural Language Processing Group within the Bioinformatics Laboratory at the Department of Information Technology at the University of Turku in Finland. His main research interests include natural language processing with a particular focus on information extraction in the biomedical domain. Tiago Grego graduated in biochemistry (2006) from the Faculty of Sciences of the University of Lisbon (FCUL). He is a member of the XLDB research group at the LASIGE laboratory of the Informatics Department at the University of Lisbon, where he does research under the supervision of Dr. Francisco M. Couto. His main research interests include bioinformatics, data mining, machine learning and BioOntologies. Marketta Hiissa holds a master’s degree in applied mathematics. She is currently working as a PhD student and a graduate research assistant at the Department of Information Technologies at the Åbo Akademi University, and at Turku Centre for Computer Science (TUCS) in Finland. Her research interests include natural language processing and its applications in health care. Juan F. Huete born in 1967, received his MS degree in Computer science in 1990. He read his PhD Thesis in 1995 in ``Learning Belief Networks: Non-Probabilistic Models’’ and became Lecturer in Computer science in 1997 at the University of Granada, Spain. His current areas of research interest are probabilistic graphical models and its application to information retrieval, recommender systems and text classification.
About the Contributors
Helena Karsten is a research director in information systems at the Åbo Akademi University in Finland and the head of the Zeta Emerging Technologies Laboratory at the Turku Centre for Computer science (TUCS). Her research interests include the interweaving of work and computers, the use of IT to support collaboration and communication, and social theories informing theorizing in IS. She is an associate editor for The Information Society and an editorial board member for Information Technology and People. Han-joon Kim received the BS and MS degrees in computer science and statistics from Seoul National University, Seoul, Korea in 1996 and 1998 and the PhD degree in computer science and engineering from Seoul National University, Seoul, Korea in 2002, respectively. He is currently a assistant professor at the University of Seoul’s Department of Electrical and Computer Engineering, Korea. His current research interests include databases, data/text mining, intelligent information retrieval and business intelligence. Jan H. Kroeze (PhD, M.IT). Jan H. Kroeze (born in 1958) is the author of more than twenty articles published in scientific journals or refereed books. He has a National Research Foundation (South Africa) rating as an established researcher and focuses his research on linguistic issues in information systems. Currently he is working on a second PhD and does research on the use of multi-dimensional databanks to store and process linguistic data of biblical Hebrew clauses. Sangeetha Kutty is a PhD student working under the supervision of Dr.Richi Nayak from Queensland University of Technology. She has completed her master's in information technolgoy from Auckland University of Technology(AUT), Auckland, New Zealand. Her Master’s thesis is on “Embedding constraints into Association Rules Mining”. Her current research is mainly focused in the area of XML mining. Other research interests include tree mining, clustering and ontology generation. Gianluca Lax is an assistant professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 2005 he took the PhD degree in computer science at University of Calabria. Since 2002 he has been involved in several research projects in the areas of Web Services, data reduction, quality of service and P2P systems. His research interests include also data warehousing and OLAP, data streams, user modelling, e-commerce and digital signature. He is also author of a number of papers published in top-level international journals and conference proceedings. Fotis Lazarinis is a part-time lecturer at Greek Technological and Educational Institute and has recently submitted his PhD thesis. He is the main author of more than 30 papers in refereed journals, conferences and workshops, 2 book chapters in research handbooks and more than 15 papers in national journals and conferences. He has participated as an organizer or reviewer to a number of international conferences and workshops related to Web retrieval. He has also single authored 18 computer science educational books in Greek. Ki Jung Lee is a doctoral student in the College of Information Science and Technology, iSchool at Drexel University, Philadelphia, Pennsylvania. He received M.A. from the Department of Telecommunication, Information Studies and Media, Michigan State University in 2002. His current research
About the Contributors
concerns conceptual data modeling, information retrieval, and ontology learning in information forensics and bioinformatics. Quanzhi Li. Dr. Quanzhi Li is a research scientist at Avaya Labs Research, Avaya, Inc. His current research interests are in the areas of information retrieval, data mining, text mining, natural language processing, social networking and event-processing. His research has appeared in journals such as Journal of the American Society for Information Science and Technology, Journal of Biomedical Informatics and Information Retrieval, and conference proceedings such as CIKM, SIGIR and AMCIS. Xiao-Li Li is currently a principal investigator at the Data Mining Department, Institute for Infocomm Research, Singapore. He also holds an appointment of adjunct assistant professor in SCE, NTU. He received his PhD degree in computer science from ICT, CAS, 2001. He was then with National University of Singapore as a research fellow from 2001 to 2004. His research interests include data mining, machine learning and bioinformatics. He has been serving as the members of technical program committees in numerous conferences. In 2005, he received best paper award in the 16th International Conference on Genome Informatics. Diego Liberati, PhD electronic and biomedical engineering, Milano Institute of Technology. Director of Research, Italian National Research Council. Author of 50 papers on ISI Journals, editor and author of books and chapters. Secretary of the Biomedical Engineering Society of the Italian Electronic Engineering Association (and Milano prize laureate in 1987), he has chaired Scientific Committees for Conferences and Grants. Visiting scientist at Rockefeller University, New York University, University of California and International Computer science Institute, he has directed joint projects granted by both private and public institutions and mentored dozens of pupils toward and beyond their doctorate Daniel Lichtnow is a professor in the Catholic University of Pelotas, Brazil, where he also works as system analyst. He obtained a MSc. degree in computer science at Federal University of Santa Catarina, in 2001. His interests include databases systems, text mining and knowledge management. Contact information: E-mail:
[email protected], Av. Fernando Osório, 1459 Pelotas - RS - Brazil Cep 96055-00 Pawan Lingras is a professor in the Department of Mathematics and Computing Science at Saint Mary’s University, Halifax, Canada. He has authored more than 125 research papers in various international journals and conferences. Recently, he published a textbook entitled Building and Intelligent Web: Theory and Practice. His areas of interests include artificial intelligence, information retrieval, data mining, web intelligence, and intelligent transportation systems. He has served as the review committee chair, program committee member, and reviewer for various international conferences on artificial intelligence and data mining. He has also served as general workshop and tutorial chairs for IEEE Conference on Data Mining (ICDM), and ACM/IEEE/WIC Conference Web Intelligence, and Intelligent Agent Technologies (WI/IAT). He is an associate editor of Web Intelligence: An International Journal. Rucha Lingras is a full-time student of Mathematics. She is also a part-time Web developer for Haika Enterprises Inc. Her research interests include multimedia and interactive Web.
About the Contributors
Huan Liu (ASU) is on the faculty of computer science and engineering at Arizona State University. He received his PhD and MS. (computer science) from University of Southern California, and his BEng (computer science and electrical engineering) from Shanghai Jiaotong University. His research interests include machine learning, data and web mining, social computing, and artificial intelligence, investigating search and optimization problems that arise in many real-world applications with high-dimensional data such as text categorization, biomarker identification, group profiling, searching for influential bloggers, and information integration. He publishes extensively in his research areas, including books, book chapters, encyclopedia entries, as well as conference and journal papers. Shuhua Liu. Dr. Shuhua Liu is a research fellow of Academy of Finland. She is currently a visiting scholar at the Center for the Study of Language and Information, Stanford University. She is also affiliated with the Institute of Advanced Management Systems Research of Åbo Akademi University, Finland. Her research interests include text summarization and information extraction for business intelligence and health care applications, crisis early warning systems, and fuzzy clustering analysis in the empirical study of financial crises. Wei Liu received her MSc from Huazhong University of Science and Technology (HUST), China in the area of intelligent control systems, and her PhD from the University of Newcastle, Australia in the area of multi-agent belief revisions. She worked as a lecturer for three years at RMIT University, Australia before joining the School of Computer science and Software Engineering (CSSE) at the University of Western Australia (UWA) in 2003. She is involved in several national competitive grants, as well as international collaborations. She has over 30 publications, and served in the program committee of major international conferences such as AAMAS and IJCAI for many years. Her research interests include ontology engineering, agent collaboration, protocol reasoning, Web service composition, trust evaluation, truth maintenance and belief revision. Ying Liu. Dr. Ying Liu is presently a Lecturer with the Department of Industrial and Systems Engineering at the Hong Kong Polytechnic University. He obtained his PhD from the Singapore MIT Alliance (SMA) at the National University of Singapore in 2006. His current research interests focus on knowledge mining – data, text, Web and multimedia mining, intelligent information processing and management, machine learning, design informatics and their joint applications in engineering design, manufacturing, biomedical and healthcare professions for knowledge management purpose. He is a professional member of ACM, IEEE, ASME and the Design Society. Stanley Loh is a professor in the Catholic University of Pelotas and in the Lutheran University of Brazil, in Brazil. He has a PhD degree in computer science, obtained in 2001 at the Federal University of Rio Grande do Sul. His research interests include recommender systems, data-text-web mining and technology applied to knowledge management and business intelligence. Contact information: E-mail:
[email protected]; R. Ari Marinho, 157, Porto Alegre, RS, Brazil, 90520-300; Phone: 55 51 3337-8220. Xin Luo. Dr. Xin Luo is currently an assistant professor of computer information systems in School of Business at Virginia State University. He received his PhD in information systems from Mississippi State University. His research interests center around information security, e-commerce/m-commerce, IS
About the Contributors
adoption, and cross-cultural IT management. His research articles have appeared in Journals including Journal of Association for Information Systems (JAIS), Communications of the ACM, Communication of the Association for Information Systems (CAIS),Journal of Organizational and End User Computing, International Journal of Information Security & Privacy, Information Systems Security, and Journal of Internet Banking and Commerce. He can be reached at
[email protected]. Mimoun Malki received his PhD in computer science from Oran University (Algeria) in 2003. He is an assistant professor at Djillali Liabes University of Sidi-Belabbes (Algeria) and director of the Evolutionary Engineering and Distributed Information Systems Laboratory-EEDIS. He has been associated with many conferences as program committee members. He supervises many master and PhD students. Dorina Marghescu holds a master’s degree in quantitative economics. She is a PhD student and a graduate research assistant at the Department of Information Technologies at the Åbo Akademi University, and at Turku Centre for Computer Science (TUCS) in Finland. Her research interests include visual data mining, usability evaluation, and information visualization. Machdel C. Matthee (MSc, DCom). Machdel Matthee completed her DCom (Information Systems) in 1999 after which she joined the Department of Informatics at the University of Pretoria. Before that she was a mathematics lecturer at Vista University, South Africa. Her research focus includes information systems education, data and text mining and enterprise architecture. On these topics she read several papers at international conferences and published some journal articles. Pasquale De Meo received the laurea degree in electrical engineering from the University of Reggio Calabria in May 2002. He holds a PhD in system engineering and computer science at the University of Reggio Calabria. His research interests include user modeling, intelligent agents, e-commerce, egovernment, e-health, machine learning, knowledge extraction and representation, scheme integration, XML, cooperative information systems. He has published more than 30 papers on top journals and conferences. Faouzi Mhamdi, received the licence’s degree in computer science in 1999 from the Faculty of Sciences of Tunis, Tunisia. Then, he received a master’s degree in computer science from the National School of Computer science, Tunis, Tunisia. Now, he is preparing a PhD degree in computer science at the Faculty of Sciences of Tunis. His main research area is knowledge discovery in databases and its application in bioinformatics. M. Narasimha Murty received his BE, ME and PhD degrees from the Indian Institute of Science, Bangalore, India. He is currently a professor in the Department of Computer science and Automation. His research interests are in pattern clustering, and data mining. Richi Nayak received the PhD degree in computer science in 2001 from the Queensland University of Technology, Brisbane, Australia. She is a senior lecturer at the School of Information Systems, Queensland University of Technology. Her current research interests include “knowledge discovery and data mining” and “XML based systems”.
About the Contributors
José Palazzo Moreira de Oliveira is full professor of computer science at Federal University of Rio Grande do Sul - UFRGS. He as a doctor degree in computer science from Institut National Politechnique - IMAG (1984), Grenoble, France, a M.Sc. degree in computer science from PPGC-UFRGS (1976) and has graduated in electronic engineering (1968). His research interests include information systems, elearning, database systems and applications, conceptual modeling and ontologies, applications of database technology and distributed systems. He has published about 160 papers, has being advisor of 11 PhD and 49 MSc students. Contact information: E-mail:
[email protected]; Instituto de Informática da UFRGS; Caixa postal 15.064 CEP 91501-970 Porto Alegre, RS, Brasil Tapio Pahikkala holds a master’s degree in computer science. He is currently working as a PhD student and a graduate research assistant in the Natural Language Processing Group within the Bioinformatics Laboratory at the Department of Information Technology at the University of Turku, and at the Turku Centre for Computer science (TUCS) in Finland. His thesis work is on machine learning methods in bioinformatics and text processing. He has authored more than 20 peer-reviewed publications dealing with this topic. Bidyut Kr. Patra obtained his MTech degree in computer science & engineering from Calcutta University, Kolkata, India in 2001. Since July 2005, He is working for his PhD at Indian Institute of Technology Guwahati, India. His research interest includes pattern recognition, data mining and operating system. Marcello Pecoraro has a degree in business economics at Federico II University of Naples. He is PhD student in computational statistics at Department of Mathematics and Statistics of University Federico II of Naples and he is tutor for several university courses of statistic. His main research areas are: web mining, data mining, classification and regression trees. Manuel A. Pérez-Quiñones. Dr. Manuel A. Pérez-Quiñones is an associate professor in the Department of Computer science at Virginia Polytechnic Institute and State University. His research interests include human-computer interaction, personal information management, multiplatform user interfaces, user interface software, and educational uses of computers. Pérez-Quiñones received a DSc in computer science from The George Washington University. He is a member of the ACM, and IEEE Computer Society. Professionally, he serves as a member of the Coalition to Diversify Computing and as member of the editorial board for the Journal on Educational Resources in Computing. Contact him at
[email protected]. Catia Pesquita graduated in cell biology and biotechnology (2005) at the Faculty of Sciences of the University of Lisbon (FCUL). She also has a post-graduation in bioinformatics from FCUL (2006) and is currently doing her master's in bioinformatics at FCUL’s Department of Informatics, under the supervision of Francisco Couto. She is a member of the XLDB group at the LASIGE laboratory of the Informatics Department at University of Lisbon. Her research interests include bioontologies, knowledge management and data and text mining. Thomas E. Potok. Dr. Potok is the Applied Software Engineering Research Group Leader at the ORNL and an Adjunct Professor at University of Tennessee since June 1997. Dr. Potok had a long career at IBM in programming before getting his PhD in Computer Engineering in 1996 and moved to ORNL 10
About the Contributors
to lead a number of software agent research projects. He has published 50+ journal articles, received 4 issued (approved) patents, currently serves on the Editorial Board of International Journal of Web Services Research and International Journal of Computational Intelligence Theory and Practice, and also conference organizing or program committees for several international conferences. Sampo Pyysalo holds a master’s degree in computer science. He is currently working as a PhD student and a graduate research assistant in the Natural Language Processing Group within the Bioinformatics Laboratory at the Department of Information Technology at the University of Turku, and at the Turku Centre for Computer science (TUCS) in Finland. His thesis work focuses on the application of full parsing and machine learning methods to information extraction from biomedical texts. Yanliang Qi is a PhD student in Department of Information Systems, New Jersey Institute of Technology. He holds a bachelor's degree major on automatics, Beijing University of Aeronautics & Astronautics, China. He also has three years working experience in media area. Now his research interests include information retrieval, text mining, business intelligence, etc. Contact him at
[email protected] Giovanni Quattrone received the laurea degree in electrical engineering from the University of Reggio Calabria in July 2003. He holds a PhD in computer science, biomedical and telecommunications engineering at the University of Reggio Calabria. His research interests include user modeling, intelligent agents, e-commerce, machine learning, knowledge extraction and representation, scheme integration, XML, cooperative information systems. He has extensively published on top journals and conferences. Ricco Rakotomalala is associate professor in computer science at the University Lumière (Lyon 2) since 1998. He is member of the ERIC Laboratory. His main research area is knowledge discovery in databases, especially supervised machine learning methods. He designed the software TANAGRA which implements numerous data mining algorithms, freely available on the web. It is the successor of SIPINA, popular free decision tree software for classification. Ganesh Ramakrishnan. Dr. Ganesh Ramakrishnan is a research staff member at IBM India Research Labs. He is currently also holding an Adjunct position with the Computer science and Engineering Department at IIT Bombay. Dr. Ramakrishnan did his Bachelor's (2000) and PhD (2006) from IIT Bombay. His main area of research is application of Machine Learning in Text Mining. He has co-authored over 20 papers in leading conferences, filed three patents, presented tutorials at two conferences and served as program committee member at two international workshops. Dr. Ramakrishnan is co-authoring a book entitled A Handbook of Inductive Logic Programming, with Ashwin Srinivasan which is due to be published in the mid of 2008 by CRC Press. Alfonso E. Romero (Granada, Spain) received his MS in computer science by the University of Granada, Spain, in 2005. Since that, he is a PhD student and a granted student at the Department of Computer science and A.I. at the University of Granada. His main research interests are probabilistic graphical models and applications to document categorization.
11
About the Contributors
Tapio Salakoski received his PhD in computer science at University of Turku in 1997 and is a professor of computer science at University of Turku and Vice Director of the Turku Centre for Computer science (TUCS). He is heading a large research group studying machine intelligence methods and interdisciplinary applications, especially natural language processing in the biomedical domain. His current work focuses on information retrieval and extraction from free text. He has 100 international scientific publications, has chaired international conferences, and lead several research projects involving both academia and industry. Richard S. Segall is associate professor of computer & information technology at Arkansas State University. He holds BS and MS in mathematics, MS in operations research and statistics from Rensselaer Polytechnic Institute, and PhD in operations research form University of Massachusetts at Amherst. He has served on the faculty of University of Louisville, University of New Hampshire and Texas Tech University. His publications have appeared in Applied Mathematical Modelling, Kybernetes: International Journal of Systems and Cybernetics, Journal of the Operational Research Society, and Encyclopedia of Data Warehousing and Mining. His research interests include data mining, database management, and mathematical modeling. Sendhilkumar Selvaraju graduated from the erstwhile Indian Institute of Information Technology – Pondicherry, India in 2002, with a Post Graduate Degree in Information Technology. His main topics of interest are Web Search Personalization, Semantic Web, Web Mining and Data Mining. He is currently a PhD candidate in the Computer science & Engineering at Anna University, Chennai, India. He is a lecturer in the Department of Computer science & Engineering at Anna University. He has published his contributions in international journals and conferences. He can be accessed at http://cs.annauniv. edu/people/faculty/senthil.html Roberta Siciliano is full professor of Statistics at the University of Naples Federico II. After the degree in Economics she took the PhD in Computational Statistics and Data Analysis at the University of Naples Federico II and during that time she spent a period of study and research at the Department of Research and Methods of Leiden University. Main research interests concern computational statistics and multivariate data analysis, more specifically she provided methodological and computational contributions in the following topics: binary segmentation, classification and regression trees, validation procedures for decision rules, incremental algorithms for missing data imputation and data validation, semiparametric and nonlinear regression models, strategies for data mining and predictive learning. The scientific production consists of over 50 papers, mainly on international journals, on monographs with revised contributions and international editors. Mário J. Silva is a PhD in electrical engineering and computer science from the University of California, Berkeley (1994). He held several industrial positions both in Portugal and the USA. He joined the Faculdade de Ciências da Universidade de Lisboa, Portugal, in 1996, where he now leads a research line in data management at the Large-Scale Information Systems Laboratory (LASIGE). His research interests are in information integration, information retrieval and bioinformatics. Michel Simonet, PhD, born in 1945, is the head of the knowledge base and database team of the TIMC laboratory at the Joseph Fourier University of Grenoble. His group works on the design and the
12
About the Contributors
implementation of knowledge bases and databases, using an ontology-based approach, and currently on the integration of Information Systems by using the tools and methodologies they have developed. They work on two main projects: a database and knowledge base management system, named OSIRIS, and a system for database conception and reverse engineering based on original concepts and a new methodology. In the recent years Michel Simonet has managed the European ASIA-ITC GENNERE with China and has been responsible of ontology enrichment in the European IP project Noesis, a platform for wide-scale integration and visual representation of medical intelligence. Hanna Suominen holds a master’s degree in applied mathematics. She is currently working as a PhD student and a graduate research assistant in the Natural Language Processing Group within the Bioinformatics Laboratory at the Department of Information Technology at the University of Turku, and at the Turku Centre for Computer science (TUCS) in Finland. Her research interests include machine learning, natural language processing, mathematical modeling, and health care information systems. Mahalakshmi G. Suryanarayanan obtained her master's of Engineering in Computer science from Anna University, Chennai, INDIA in Jan 2003. She is currently a researcher in the field of Argumentative reasoning pursuing her PhD at Department of Computer science and Engineering, Anna University, Chennai, INDIA. Her research interests include reasoning, knowledge representation, data mining, swarm intelligence and robotics, natural language generation, question-answering, cognitive poetics and discourse analysis. She has published her contributions in international journals and conferences. She can be accessed at http://cs.annauniv.edu/people/faculty/maha.html E. Thirumaran received his MCA degree from the National Institute of Technology at Warangal, India and is currently pursuing his PhD degree in Indian Institute of Science, Bangalore, India. His research interests are in collaborative filtering, diagnostics & prognostics, and data mining, with special emphasis on temporal data mining. Manas Tungare is a 3rd year PhD candidate in the Dept. of Computer science. He is working on making personal information available on multiple platforms, and exploring seamless task migration across devices. He is currently supported by an NSF grant to explore personalization of digital library content to bring it closer to its users. Domenico Ursino received the laurea degree in Computer Engineering from the University of Calabria in July 1995. From September 1995 to October 2000 he was a member of the Knowledge Engineering group at DEIS, University of Calabria. He received the PhD in system engineering and computer science from the University of Calabria in January 2000. From November 2000 to December 2004 he was an Assistant Professor at the University Mediterranea of Reggio Calabria. From January 2005 he is an Associate Professor at the University Mediterranea of Reggio Calabria. His research interests include user modeling, intelligent agents, e-commerce, knowledge extraction and representation, scheme integration and abstraction, semi-structured data and XML, cooperative information systems. P. Viswanath received his MTech (Computer science) from the Indian Institute of Technology, Madras, India in 1996. From 1996 to 2001, he worked as faculty member at BITS-Pilani, India and Jawaharlal Nehru Technological University, Hyderabad, India. He received his PhD from the Indian
13
About the Contributors
Institute of Science, Bangalore, India in 2005. At present he is working as an Assistant Professor in the Department of Computer science and Engineering, Indian Institute of Technology-Guwahati, India. His areas of interest include pattern recognition, data mining and algorithms. Tobias Werth studied computer science at the Friedrich-Alexander University of Erlangen-Nuremberg in Germany. He presented a new canonical form for mining in directed acyclic graph databases in his master thesis. Currently, he is working as PhD student on parallelizing compilers for homogenous and heterogeneous multi-core architectures at the same university. Leandro Krug Wives is associate professor of computer science at Federal University of Rio Grande do Sul – UFRGS. He has a doctor degree in computer science from PPGC-UFRGS (2004); a MSc degree in computer science from PPGC-UFRGS (1999) and graduated in Computer science as well in 1996 at the Catholic University of Pelotas – UCPEL. His research interests include text mining, clustering, recommender systems, information retrieval, information extraction, and digital libraries. Contact information: E-mail:
[email protected]; Fernando Machado, 561, ap. 501, Centro – Porto Alegre – RS - Brasil CEP 90010-321 Wilson Wong is currently a PhD candidate at the University of Western Australia in the field of ontology learning under the sponsorship of the Endeavour International Postgraduate Research Scholarships and the UWA Postgraduate Award for International Students. He received his BIT (HONS) majoring in Data Communication, and MSc in the field of natural language processing from Malaysia. He has over 20 internationally-refereed publications over the period of three years at reputable conferences such as IJCNLP, and journals like data mining and knowledge discovery. His other areas of interest include natural language processing and text mining. Marc Wörlein studied computer science at the Friedrich-Alexander University of Erlangen-Nuremberg in Germany. In parallel, he worked for the Siemens AG Erlangen in the development on magneticresonance tomography. Currently, he is working as PhD student on parallelizing mining algorithms for multi-core and cluster architectures at the same university and on automatic java code generation for such architectures. Shuting Xu. Dr. Shuting Xu is presently an assistant professor in the Department of Computer Information Systems at the Viginia State University. Her research interests include data mining and information retrieval, information secureity, database systems, parallel and distributed computing. Xiaoyan Yu is a PhD candidate in the Department of Computer science at Virginia Tech. Her research interests include information retrieval and data mining. She is working on mining evidence for relationships derived from heterogeneous object spaces on the Web. Yubo Yuan. Dr. Yubo Yuan is an associate professor of mathematics at the University of Electronic Science and Technology of China. He received his PhD from Xi’an Jiaotong University of China. His research interests include data mining, computational finance, support vector machines.
14
About the Contributors
Qingyu Zhang is an associate professor of Decision Sciences and MIS at Arkansas State University. He holds a PhD from the University of Toledo. He has published in Journal of Operations Management, European Journal of Operational Research, International Journal of Production Research, International Journal of Operations and Production Management, International Journal of Production Economics, International Journal of Logistics Management, Kybernetes: International Journal of Systems and Cybernetics, and Industrial Management & Data Systems. He has edited a book, E-Supply Chain Technologies and Management, published by IGI Global. His research interests are data mining, e-commerce, supply chain management, flexibility, and product development. Jianping Zhang (MITRE) is a principal artificial intelligence scientist and data mining group lead in the MITRE Corporation. He is working on several data mining applications for different US government agencies. Prior to MITRE, he was a chief architect for content categorization in AOL and worked on several web and text mining projects. From 1990 to 2000, he was a Professor in Utah State University. He obtained his PhD in computer science from University of Illinois at Urbana-Champaign in 1990 and BS in from Wuhan University in 1982. Yu-Jin Zhang, PhD, University of Liège, Belgium. Professor, Tsinghua University, Beijing, China. He was with the Delft University of Technology, the Netherlands. His research interests are image engineering, including: image processing, analysis and understanding. He has published more than 300 research papers and a dozen of books. He is vice president of China Society of Image and Graphics and the director of the academic committee of the society. He is deputy editor-in-chief of Journal of Image and Graphics and at the editorial board of several scientific journals. He was the program co-chair of ICIG’2000, ICIG’2002 and ICIG’2007. He is a senior member of IEEE.
15
Index
Symbols 2 × 2 contingency matrix 747
A abnormal detection 700, 707 AdaBoost 111, 115, 116, 119, 122, 123, 12 4, 125, 126 anaphora resolution 44 anomalies detection 412 ant colony optimization 169, 171, 174, 179 application tuning 413 association rule mining 306, 313, 368, 715 association rules 98, 242, 244, 246, 262, 26 7, 312, 313, 357, 363, 368 Atype 585, 594, 595 Australian Department of Health and Ageing (DoHA) 786 authorship characterization 707 authorship identification 707 automated text categorization 331 automatic annotation 322 automatic term recognition 529
B back-propagation neural networks (BPNNs) 201–218 agents 214–215 Web text mining system 203–212 feature vector conversion 209 framework 203–204 learning mechanism 210–211 limitations 211–212 main processes 204–208
bag-of-words representation 1 bag of words 789
Bayesian 574, 575, 576, 577, 578, 582, 583 601, 603 learning 421
BayesQA 574, 575, 582, 599, 603 BayesWN 574, 577, 583, 603 BioLiterature 323, 328, 329 biomedical databases 329 biomedical research 314 BioOntologies 314–330 BioOntology 316, 322, 323, 326, 327, 329 blocking-based techniques 474 blogosphere 646, 647, 651, 656, 657, 660, 662, 663, 665, 668 boosting 111, 112, 113, 115, 118, 119, 122, 123, 124, 125, 126, 127, 133, 187, 338, 601 BoW approach 4 bucket 274, 285, 287 bucket-based histogram 287
C CASE tool 402, 404, 406 classification, naïve Bayes 788 classification error 109 classification rule mining 110 classifications, multiple 798 classification uncertainty 127 clause cube 299 clinical document architecture (CDA) 683 clone detection 627, 644 Cluster category 801 clustering 424 crossed clustering approach 430 methods 429
clustering of a dataset 188 cluster prototypes 723
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Index
clusters 801 co-clustering 723 code compaction 644 communication-channel congestion control 412 compare suite 766, 769, 770, 771, 784 complex terms 529 concept-based text mining x, xxii, 346, 358 conceptive landmark 469 Conceptual logs 407 conceptual logs 408 consensus function 188 consumer informatics xiii, xxx, 758, 764 content-based image retrieval (CBIR) 110 content metadata 36 Content mining 605 content personalization 402 content similarity 374, 375, 385, 746 content similarity measure 746 contents in XML document 247, 271 content unit 414 control data flow graph (CDFG) 644 course syllabus 61 crossed dynamic algorithm 430
D data cube ix, xx, 288, 299 data mining, incremental or interactive 448 data mining techniques 249, 250, 252, 260, 266, 267 data modification 473 data partitioning 472 data pre-processing techniques 273 data restriction techniques 474 data stream pre-processing 287 data warehousing 245, 285, 299, 367, 467, 479, 644, 692, 722 deployment landmark 469 descriptor 344 designing Web applications x, xxiii, 401 dicing ix, xx, 288, 294, 299 dimensionality reduction 567, 568, 569, 711 document vii, viii, xv, xviii, 2, 6, 17, 18, 2 0, 21, 22, 24, 35, 178, 181, 182, 35, 36, 93, 146, 166, 167, 172, 177, 17
8, 180, 179, 180, 23, 180, 187, 188, 1, 23, 165, 181, 187, 188, 204, 210, 231, 245, 247, 251, 267, 268, 271, 2 72, 304, 306, 327, 344, 485, 487, 49 1, 492, 493, 503, 508, 544, 549, 550, 560, 569, 610, 675, 683, 706, 743, 744, 745, 746, 754, 820 document clustering 181, 182, 187 document indexing 2, 344 document keyphrases 23–36, 32 document metadata 36 document representation 3, 13 DoHA (Australian Department of Health and Ageing) 786 domain-specific keyphrase extraction 36
E e-health xii, xxviii, 670, 683 EBIMed 324 electronic commerce 449 EM algorithm 80, 92, 114, 117, 125, 127 embedded system 644 EMOO 47 ensemble scheme 181, 184 entity-relationship [E/R] 404 EUROVOC 334 evolutionary multi-objective optimisation (EMOO) 47 extensible markup language (XML) 205 external measures 747 extrinsic evaluation 747
F feature selection 19, 21, 22, 74, 206, 305, 692 fingerprinting 636, 644 flocking model 180 focus word 603 folksonomy 668 frequent pattern mining 227, 228, 229, 235, 239, 241, 243 frequent patterns 247, 271, 272 frequent patterns mining 247, 271, 272 frequent word sequence 5 full syllabus 74
Index
G Gaussian kernel 794 Gene 317, 319, 321, 323, 324, 325, 328, 3 30, 691, 692, 693, 751, 752, 754, 75 6, 757, 833 gene expression 319, 330, 691, 692, 751, 75 2, 754, 757 gene ontology 317 General Purpose Techniques (GPT) 475 genetic algorithms (GA) 38, 40 genome annotation 752, 757 gold standard 747 GoPubMed 323
H Health Insurance Commission (HIC) 786 health leven seven (HL7) 683 HIC (Health Insurance Commission) 786 HIC category 801 hierarchical clustering 180 hierarchical structure 273, 274 histogram 107, 286, 287 HITS Algorithm 617 homogeneity 612 hybrid scheme 181, 185 hypergraph 412 hypertext makeup language (HTML) 205 hypertexts 404
I IE 38, 39 Imbalance in Data 723 incremental Web traversal pattern mining (IncWTP) 455 index 325, 336, 544, 747 inflection 545 information extraction (IE) 38 information filtering system 499 information retrieval 19, 20, 21, 32, 36, 76 , 93, 109, 126, 138, 147, 164, 180, 187, 199, 270, 303, 312, 313, 322, 3 29, 342, 344, 347, 349, 357, 384, 42 0, 498, 499, 501, 543, 544, 545, 54 7, 559, 560, 562, 567, 568, 569, 57 2, 573, 600, 602, 605, 622, 676, 681,
705, 721, 735, 744, 745, 749, 750, 757, 764, 780, 781, 783, 805, 838 information retrieval (IR) 38, 39, 545, 749, 757 inlink and outlink 385 interactive Web traversal pattern mining (IntWTP) 459 interestingness 48, 312, 313 internal measures 747 intrinsic evaluation 747
K KDD process 408 KDD scenario 408 KDD scenarios 402, 409 KDT 38, 39 kernel, Gaussian 794 kernel, latent semantic 795 kernel, power 794 keyphrase assignment 24 keyphrase extraction 24 keyphrase identification program (KIP) 23 Kintsch’s Predication 48 knowledge discovery 418
knowledge base assisted incremental sequential pattern (KISP) 453 Knowledge Discovery 273, 275, 285 knowledge discovery 249, 250, 264, 266, 2 71, 604 knowledge discovery from databases (KDD) 38 knowledge discovery from texts (KDT) 38, 39
L latent semantic analysis (LSA) 43 Latent Semantic Kernel 795 latent semantic kernel (LSK) 787 latent semantic space 569 lemmatization 534, 535, 541, 544, 545 lexico-syntactic patterns 422 linear programming problem 683 linguistic data 288, 289, 291, 292, 295, 29 8, 299 link similarity 374, 376, 385 logical observation identifiers names and codes
Index
(LOINC) 683 lymphoblastic leukemia 689, 693
M maximal Frequent word sequence (MFS) 3 medical literature analysis and retrieval system online (MEDLINE) 764 medical subject headings (MeSH) 764 Medicare Benefit Schedule (MBS) 786 Megaputer TextAnalyst 773, 784 MGED ontology 319 micro-array 693 minimal spanning tree 400 minimum description length 684, 685, 686, 688, 693 mining Web applications x, xxiii, 401 mining XML 227, 228, 235, 243 model-based methods 715, 723 model testing 74 model training 74 molecular biology 138, 330, 838 MRR 580, 581, 582, 598, 599 multi-agent-based Web text mining system 212–214 implementation 214 structure 212–213
multi-resolution analysis 110 multiple classifications 798 multitarget classification 693
N naïve bayes viii, xvii, 2, 13, 111, 112, 11 3, 114, 115, 111, 115, 114, 115, 11 6, 117, 118, 119, 120, 121, 122, 1 23, 124, 125, 126, 127, 137, 339, 68 6, 687, 788 naïve Bayes (NB) algorithm 202 Naïve Bayes classification 788 NAL Agricultural Thesaurus 335 natural-language processing 43 natural language processing 19, 20, 35, 312, 322, 344, 527, 528, 602, 745, 749, 7 50, 757, 784 natural language processing (NLP) 204, 322, 749, 757 Naϊve Bayes 66
Newsgroup mining structure 615 nio-informatics 693 noise addition techniques 473 noun phrase 26 noun phrases 23 noun phrases extractor 27 novelty 40, 301, 306, 308, 310, 311, 312, 313 NsySQLQA 574, 575, 582, 583, 596, 599, 603 nuggets 56
O Off-Line Phase 814 Online Phase 819 ontologies 403 ontology x, xxiii, 163, 199, 244, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 327, 328, 329, 330, 402, 418, 4 21, 422, 423, 425, 433, 435, 440, 44 1, 443, 444, 445, 499, 528, 752, 756 , 764, 833 building and Web site description 435 construction 421 method 425 evolution 422 management 418–447 Web usage mining 423
open biomedical ontologies 321
P page annotation 402, 404 Pareto dominance 50 part-of-speech (POS) 43 part-of-speech tagger 27 partial syllabus 74 particle swarm optimization 166, 169, 170, 179, 180 partitioning clustering 180 partition of a dataset 188 path traversal pattern mining 451 pattern 410 pattern detection 110 pattern discovery 615 pattern recognition 110, 125, 177, 188, 746
Index
patterns 38, 402, 409, 415 patterns statistically 410 people search 385 performance evaluation measure 747 Personalization 604 personalization 92, 313, 365, 366, 367, 399, 420, 445, 604, 619, 655 person search 385 positive set 94 power kernel 794 pre-processing 403 precision 32 predictive model 139 principal component analysis 556, 568, 685, 693 principal components analysis (PCA) 552, 570 privacy violation 470 probabilistic latent semantic analysis (PLSA) 403 probability based term weighting 13 procedural abstraction 629, 643, 644 prospective landmark 470 protein classification 139, 140 prototype 188, 370, 383, 446 PU learning 94
Q query expansion 356, 545 query term/keyword 545
R range query 287 rank correlation coefficient 747 recall 32 regularity extraction 631, 644 reliable multicast transport protocol (RMTP) 220 Reuters-21578 12, 13, 14, 15, 16, 120, 20 0, 568 Roc-SVM 75, 78, 85, 87, 92, 94 rotation 292, 293, 299
S S-EM 75, 78, 80, 81, 82, 83, 85, 87, 90, 92, 94
sanitization-based techniques 474 SAS text miner 202, 222, 766, 769, 770, 77 1, 773, 780, 784 secure multi-party computation (SMC) 472 selective sampling 115, 119, 127 selector 603 self-organizing map (SOM) 203 self organizing maps of kohonen 200 semantic expansion 358 semantic heterogeneity 723 semantic measure 164 semantic similarity 47 Semantic Web 442 mining 442 visualisation 442
SemSim 44 sequential pattern mining 424, 452 similarity transformation 110 singular vector decomposition (SVD) 44 slicing ix, xx, 288, 291, 293, 294, 299 sliding window 282, 287 social “friendship” network 668 social network analysis 668, 697, 698, 707 space transformation techniques 473 SPEA 50 SPSS mining for clementine 784 statistical language models 25 stemming 385, 540, 545 stemmisation technique 789 stopwords 535, 545 strength Pareto evolutionary algorithm (SPEA) 50 structure mining 605 structures in XML document 247 subgraph 234, 247, 271, 644 subtree 236, 247, 268, 272 suffix tree 645 supervised classification 93, 344 supervised learning 64 support vector machin 788 support vector machine methodology 792 support vector machines (SVM) 61–74 swarm intelligence 165, 169, 178, 180 syllabi 61 syllabi, defined classes 62 syllabus component 74
Index
syllabus entry page 74 synonymy 44, 256, 723 systematized nomenclature of medicine (SNOMED) 683 systems biology 757
T template 408, 409 temporal analysis 353, 358 termhood xi, xxv, 500, 502, 507, 521, 522, 526, 529 termhood evidences 529 term weight 385 term weighting 12 term weighting schemes 10 text categorization viii, xvii, 18, 19, 21, 22, 128, 140, 199, 336, 337, 342, 343, 344, 445, 602, 745, 784 text classification 1–22 text classifiers 331 text coherence 49 text mining (TM) 39 thesauri, definition of 334 thesaurus 332 thesaurus, indexing 337 thesaurus, text categorization 336 thesaurus-based automatic indexing 331–345 thesaurus formalization 333 three-dimensional array 291, 292, 293, 294, 299 TM 39 tokenizer 27 tourism 433 traffic analyzers 403 traversal sequence 449 two-step strategy of PU learning 94
U Unified Medical Language System 320 uniform resource identifier (URI) 164 unithood xi, xxv, 500, 501, 502, 506, 514, 521, 526, 529 unlabeled set 95 unsupervised classification 344 usage-mining preprocessing phase 408 Usage mining 605
Usenet Site 607 user
profile 420 session 421, 427
user modeling 446, 622, 683 user profile 683 user profiling x, xxii, 359, 360, 368
V vector space methods 570 vector space model 147, 166, 179, 180, 304, 340, 344, 569 vector space model (VSM) 202 visual text 766, 769, 770, 780, 784 vocabulary mapping 765 vocabulary problem 358
W wavelets 287 Web 408 -based information systems 418 content mining 420 log 426, 441 preprocessing 435 mining 420, 423, 442 structure mining 420 text mining 201–218 usage 441 mining x, xxiii, 418, 420, 441
Web-log 402, 403 Web application development 404 Web content analysis 707 Web Data Techniques (WDT) 475 Web graph 400 Web image search engine 110 Web link analysis 707 Web logs 448 Web mining 448–467 WebML 406 WebML method 404 Web modeling language (WebML) 402 Web site 819 Web structure mining 360, 400 Web traversal x–xiv, xxiv–xxxii, 448–467
Index
Web traversal patterns, mining of 453 Web usage-mining 403 Web usage mining x, xxii, xxiii, 270, 359, 3 60, 363, 361, 359, 364, 363, 364, 36 5, 366, 368, 400, 402, 418, 428 WordNet 40 Wordnet 194, 199, 200, 349, 528, 601 WordStat 766, 769, 770, 777, 778, 784 word stemming 385
X XML 407 XML document handling 249, 253 XML frequent content mining 231, 247, 272 XML frequent patterns mining 247, 272 XML frequent structures mining 247, 272 XML standardization 227